By the end of this module, you should be able to fit and diagnose these models:
Model binary outcomes -> Logistic regression
Model count outcomes -> Binomial, Poisson regression
Think of this module as:
> And what if Y isn’t continuous?
A binary variable takes on two values, often coded as 1 and 0
We do not think of the values as numbers, but as classes or categories
The category coded 1 is often called the event or success
The category coded 0 is often called the non-event or failure
Working with binary data, you can solve a regression
and a classification problem
Regression: model the mean of a random variable, VS,
Classification: tell me what category an observation
belongs to
Classification Problem
\(Y\) takes on one of a discrete number of outcomes or states (categories), \(Y \in \{ C_1, C_2,\cdots, C_k \}\)
The prediction problem is now to assign a new observation to one of the \(k\) categories
The number of categories can be as small as \(k=2\) (binary data), or very large (\(k=1,000\) in image classification)
Regression Problem
Classification can be solved with regression
methods!
The binary probability distribution with success probability \(\pi\):
\[\begin{aligned} Y^* &\in \{C_1, C_2\} \\[9pt] Y &= \left \{ \begin{array}{ll} 1 & \quad \text{if } Y^* = C_1 \\ 0 & \quad \text{if } Y^* = C_2 \end{array} \right . \\[9pt] \text{Pr}(Y = y) &= \left \{ \begin{array}{ll} \pi & y=1 \\ (1-\pi) & y=0 \end{array} \right . \\[9pt] E[Y] &= 1 \times \text{Pr}(Y=1) + 0 \times \text{Pr}(Y=0) \\ &= \pi \end{aligned}\]
Regression is about modeling \(E[Y]=\pi\).
Assign an observation to the most likely category, given its predictor (input) values.
With two categories: \[\begin{array}{cc} \widehat{\pi} > 0.5 & \quad \text{predict category} \, Y=1 \\ \widehat{\pi} \leq 0.5 & \quad \text{predict category} \, Y=0 \end{array}\]
Note: the threshold 0.5 is an example, we can be smarter about it.
Fundamental problems with using linear regression for such data
the response can take on any value, it is not restricted to two values
the mean of the response can take on any value, it is not confined to the \([0,1]\) interval
the predicted values can be smaller than 0 or larger than 1, how could this be a probability
the errors are additive with zero mean and equal variance
binary data do not have equal variance: \(Var[Y] = \pi \times (1-\pi)\)
Question: where is this variance smallest and
largest?
Solution
Notation: \({\bf x}^\prime{\bf \beta} = \beta_0 + \beta_1 x_1 + \cdots \beta_p x_p\)
Allow \({\bf x}^\prime{\bf \beta}\) to take on any possible value (on the real line)
Then use a function that maps \({\bf x}^\prime{\bf \beta}\) to a valid mean
For example, if the mean is a probability, we use a function \(g(\cdot)\) such that \[\begin{aligned} g(\pi) &= {\bf x}^\prime {\bf \beta} \\ \pi &= g^{-1}({\bf x}^\prime {\bf \beta})\\ 0 &\leq \pi \leq 1 \end{aligned}\]
\(\quad g(\cdot)\) is called the
link function.
\(\quad g^{-1}(\cdot)\) is called the
inverse link function.
In logistic regression the target is binary and the link function (usually) is the logit function \[\text{logit}(\pi) = \text{log} \Big( \frac{\pi}{1-\pi} \Big) = {\bf x}^\prime {\bf \beta}\]
The inverse link function is \[\pi = \frac{\exp({\bf x}^\prime{\bf \beta})}{1+\exp({\bf x}^\prime{\bf \beta})} = \frac{1}{1+\exp(-{\bf x}^\prime{\bf \beta})}\]
Now we have a technique to move from the regression predictor \({\bf x}^\prime {\bf \beta}\) to the mean and back.
RA logistic regression is computed with the glm()
function in R.
GLM stands for generalized linear model, a rich
class of models that includes logistic regression.
In R you select logistic regression for binary data
with family="binomial", since the binary distribution is a
special case of the binomial distribution.
Interpretation of logistic regression coefficients
\(\beta_j\) measures the change in logit units (on the scale of the link function) per unit change of the input
\(\beta_j\) does not measure the change in the probability as \(x_j\) changes by one unit
\(\beta_j > 0\): As \(x_j\) increases the predicted probability increases
\(\beta_j < 0\): As \(x_j\) increases the predicted probability decreases
RTo calculate predicted values in R use the
predict() function.
The type="link" or type="response"
option indicates whether to compute predictions on the scale of the link
function or on the scale of the mean (=the scale of the
response).
dfyes <- data.frame(balance=1000, income=20000, student="Yes")
predict(lg,newdata=dfyes, type="link")
predict(lg,newdata=dfyes, type="response")
Let’s look at a logistic regression example in
R.
Before we get deeper into residual analysis and classification a
quick detour to the Exponential family of
distributions.
A distribution is a member of the family if its density or mass function can be written as \[{\displaystyle f(y)=h(y)\,g({\boldsymbol {\theta }})\,\exp \big({\boldsymbol {\eta }}({\boldsymbol {\theta }})\cdot \mathbf {T} (y)\big)}\] \({\bf \theta}\) is a vector of parameters to be estimated
Important members include
Discrete distributions: Binary (Bernoulli), Binomial, Negative Binomial, Poisson
Continuous distributions: Normal (Gaussian), Gamma, Exponential, Inverse Gaussian, Beta
A generalized linear model (GLM) consists of a
random component: distribution of \(Y\) is a member of the exponential family
systematic component: the linear predictor \(\eta = {\bf x}^\prime{\bf \beta}\)
link function: maps the mean \(E[Y] = \mu\) to the linear predictor \(\eta\): \(g(\mu) = \eta\)
The inverse of the link function maps from the linear predictor \(\eta\) to the mean \(\mu\): \(g^{-1}(\eta) = \mu\).
The form of the distribution suggests a link function, called the canonical or natural link function.
Examples of generalized linear models
| Type | Distribution | Canonical Link | \(g(\mu)\) | \(g^{-1}(\eta)\) |
|---|---|---|---|---|
| Binary | Bernoulli (Binary) | logit | \(\text{ln}\left\{\frac{\mu}{1-\mu}\right\}\) | \(\frac{1}{1+E\{-\eta\}}\) |
| Count | Binomial | logit | \(\text{ln}\left\{\frac{\mu}{1-\mu}\right\}\) | \(\frac{1}{1+E\{-\eta\}}\) |
| Count | Poisson | log | \(\text{ln}\{\mu\}\) | \(E\{\eta\}\) |
| Cont’s | Normal (Gaussian) | Identity | \(\mu\) | \(\eta\) |
| Cont’s | Exponential | Inverse | \(\frac{-1}{\mu}\) | \(\frac{-1}/{\eta}\) |
A logistic regression is a generalized linear model with the following components
random component: \(Y\) follows a binary (\(\pi\)) distribution.
systematic component: a linear predictor \(\eta = {\bf x}^\prime{\bf \beta}\).
link function: maps from probabilities (0,1) to
\(\mathbb{R}\).
Any cumulative distribution function (c.d.f) can serve as the inverse
link function for these type of models.
The logit function is common, it is the inverse of the
c.d.f of the standard logistic distribution.
The probit link, based on the normal distribution, is
also popular.
Taking a page from the linear regression book, we could look at
\(p\)-values for hypothesis tests about \(\beta\)s
Goodness-of-fit statistics like \(R^2\)
Diagnostics (residuals, case deletion, …)
The Problem
GLMs are not based on sums of squares and least-squares estimation. They are based on the maximum likelihood principle.
\(R^2\) statistics for generalized linear models are sketchy. They do not have the same interpretation in terms of % variability explained.
Residual plots show a lot of inherent structure if \(Y\) takes on only a few values
The generalized linear models define a number of residual types.
If \(\widehat{\mu}_i = g^{-1}({\bf
x}^\prime\widehat{{\bf \beta}})\) is the predicted mean, then we
can work with
Raw (response) residual: \(y_i - \widehat{\mu}_i\)
“Working” residuals: \(\frac{y_i-\widehat{\mu}_i}{\widehat{\text{Var}}[Y_i]}\)
Pearson residuals: \(\frac{y_i - \widehat{\mu}_i}{\sqrt{\widehat{\text{Var}}[Y_i]}}\)
Deviance residuals: \(2 \sum_{i=1}^n -y_i \text{ln} \Big( \frac{\widehat{\mu}_i}{1-\widehat{\mu}_i} \Big) - \text{ln}\big(1-\widehat{\mu}_i\big)\)
The deviance residuals are based on likelihood theory and are
different for each distribution in the exponential family.
Recommend to use the Pearson and Deviance residuals.
RFour sets of residuals for logistic regression of Default data.
lg <- glm(default ~ income + balance + student,
family="binomial",data=train)
par(mfrow=c(2,2))
par(cex=0.7)
par(mai=c(0.6,0.6,0.2,0.3))
plot(residuals(lg, type="response"), ylab="Raw residual")
plot(residuals(lg, type="working" ), ylab="Working residual" )
plot(residuals(lg, type="pearson" ), ylab="Pearson residual" )
plot(residuals(lg, type="deviance"), ylab="Deviance residual")
Let’s see how these lok in R.
The residual plots in logistic regression are difficult to interpret
There is a lot of structure in the plots due to \(y \in \{0,1\}\)
Structure is typically a source of concerns for residuals so this is not
an informative quantity.
Since we often use logistic regression as a classifier it is common to calculate the misclassification rate
glm outputsBefore we talk about misclassification statistics, what are these
summary(glm()) outputs telling us ?
##
## Call:
## glm(formula = default ~ income + balance + student, family = "binomial",
## data = def_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.087e+01 4.923e-01 -22.080 < 2e-16 ***
## income 3.033e-06 8.203e-06 0.370 0.71152
## balance 5.737e-03 2.319e-04 24.738 < 2e-16 ***
## studentYes -6.468e-01 2.363e-01 -2.738 0.00619 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1571.5 on 9996 degrees of freedom
## AIC: 1579.5
##
## Number of Fisher Scoring iterations: 8
z-value: Wald statistics, which tests the null
hypothesis that the true coefficient is zero, based on likelihood
criteria.
Pr(>|z|): The p-value corresponding to the
z-value, indicating the statistical significance of the predictor in the
model.
Null deviance: Measures the fit of the model with
only the intercept (no predictors).
Residual deviance: Measures the fit of the model
with all the included predictors. A significant drop in deviance from
the null to the residual suggests that the predictors improve the
model’s fit.
AIC: Akaike Information Criterion used for comparing
different models of the same data; lower values generally indicate a
better model fit. See details in future classes.
Number of Fisher Scoring iterations: The number of
iterations the algorithm took to converge.
Remember: The deviance is \(-2\times\) the difference between the log-likelihood at the maximum likelihood estimate and the log-likelihood for a “saturated model” (model with a separate parameter for each observation thus a perfect fit).
For qualitative targets that take on values in \(k\) categories we classify by assigning
classes based on predicting category probabilities.
In logistic regression with \(k=2\)
the classification rule is \[\begin{array}{cc} \widehat{\pi} > c &
\text{predict category} \, Y=1 \\
\widehat{\pi} \leq c & \text{predict category} \, Y=0
\end{array}\] for some value \(c\) (usually \(c=0.5\), the Bayes classifier).
This leads to a \(2 \times 2\) table of observed and classified (predicted) values called the confusion matrix.
| True (Reference) | ||
| Prediction | Yes | No |
| Yes | True Positive (TP) | False Positive (FP) |
| No | False Negative (FN) | True Negative (TN) |
Values on the diagonal (TP & TN) are correctly classified
Values off the diagonal (FN) & FP) are errors; remember errors do not have to have symmetric consequences
Columns might be re-ordered in software
(declare what is “positive’, see below’)
False positive rate: FPR = FP / (FP + TN)
the proportion of incorrect “Yes” classifications among the true
“No”s
False negative rate: FNR = FN / (FN + TP)
the proportion of incorrect “No” classifications among the true
“Yes”s
Sensitivity: TP / (FN + TP) = 1 - FNR
the true positive rate, complement of FNR
the ratio of true positives to what should have been predicted as
positive
Specificity: TN / (FP + TN) = 1 - FPR
the true negative rate, complement of FPR
the ratio of true negatives to what should have been predicted as
negative
Misclassification rate: (FP + FN) /
(TP+TN+FP+FN)
also called error rate, the proportion of incorrect
classifications
Accuracy: (TP+TN) / (TP+TN+FP+FN) = 1 -
Misclassification rate
the proportion of correct classifications
RCalculating the confusion matrix for the test data.
The event coded \(Y=1\) is “Yes”, default on credit card.
pred_prob_test <- predict(lg,newdata=test, type="response")
classify_test <- as.factor(ifelse(pred_prob_test > 0.5,"Yes","No"))
table(classify_test,test$default, dnn=list("Classified","Observed"))
Result
Observed
Classified No Yes
No 958 26
Yes 7 9
Why is this called a “confusion” matrix? Where is the “confusion”?
For the Default data set the confusion matrix along with
many statistics can be obtained in R with the
confusionMatrix() function in the caret
package.
library(caret)
predicted_prob_test <- predict(lg, newdata=test, type="response")
classify <- as.factor(ifelse(predicted_prob_test > 0.5,"Yes","No"))
confusionMatrix(classify,test$default, positive="Yes")
The positive="Yes" option tells R to
compute the statistic assuming that “Yes” is the event of interest (the
“positive” event).
Otherwise it chooses the first level of the factor, which would be “No” in this example.
RConfusion Matrix and Statistics
Reference
Prediction No Yes
No 958 26
Yes 7 9
Accuracy : 0.967
95% CI : (0.954, 0.9772)
No Information Rate : 0.965
P-Value [Acc > NIR] : 0.407906
Kappa : 0.3384
Mcnemar's Test P-Value : 0.001728
Sensitivity : 0.2571
Specificity : 0.9927
Pos Pred Value : 0.5625
Neg Pred Value : 0.9736
Prevalence : 0.0350
Detection Rate : 0.0090
Detection Prevalence : 0.0160
Balanced Accuracy : 0.6249
'Positive' Class : Yes
Other statistics from the confusionMatrix() output
No Information Rate: The proportion of the largest class. If you predict all obs to be “No” you get an accuracy of 0.965. This is called a naı̈ve classifier.
P-Value (Acc > NIR): \(p\)-value of test that accuracy of model is better than the naı̈ve no-information classifier.
Detection Rate: TP / (TP+TN+FP+FN) = 9 / 1,000 = 0.009
Balanced Accuracy: (Sensitivity + Specificity)/2 = (0.2571 + 0.9927) / 2 = 0.6249
Other terms you may come across
Precision: Ratio of true positive to anything predicted as positive, TP / (TP + FP) = 9 / (9 + 7) = 0.5626
Recall = Sensitivity
Training data overstates ability to classify new observations
If prediction is important, use confusion statistics based on test data
Misclassification rate is often used like \(MSE\) in regression with continuous data
Sensitivity and specificity might be more relevant
Consider the consequences of errors.
Is a false positive equally as bad as a false negative?
The confusion matrix changes with choice of cutoff \(c\)
From the test confusion matrix
False positive rate (0.0072) is the error rate among non-defaulters. Only 7 out of 7+958 = 965 were incorrectly labeled as defaulters. That is a low FPR, high specificity.
False negative rate (0.7428) is the error rate among defaulters. That is a high FNR. The model has a low sensitivity of 1 - 0.7428 = 0.2572.
If you want to use the model to detect defaults before they occur, it is probably unacceptable.
Count data, where the possible values of the target variable are countable \((0, 1, 2, 3,\cdots)\) arises in many cases. We consider two important cases
The \(LD_{50}\) is the lethal dosage
of an agent (a herbicide, insecticide, etc.) that kills 50% of the
organisms exposed to the agent. The table shows the number of insect
larvae killed at various concentrations of an insecticide. For each
concentration, 20 larvae were exposed to the insecticide.
| Concentration | 0.375% | 0.75% | 1.5% | 3% | 6% | 12% | 24% |
|---|---|---|---|---|---|---|---|
| No of larvae killed | 0 | 1 | 8 | 11 | 16 | 18 | 20 |
This is an example of counts out of a total.
We will assume that the number of larvae killed is a Binomial(20,\(\pi\)) random variable.
The data can also be treated as binary and modeled with logistic regression:
\(7 \times 20 = 140\) observations
each response is either 1 if the larva died, or 0 if it survived.
there are 74 1’s and 66 0’s
Remember the bike sharing rental data that we worked with during our in-class coding activities.
Bike is an example of a count variable we model as a
Poisson variable.
It is not a count that can be expressed as a proportion, it is a count per unit (hour).
A binomial regression is a generalized linear model with the following components
random component: the counts (i.e. response Y) are assumed to follow a Binomial(\(n,\pi\)) distribution. \(n\) does not have to be the same for the observations.
systematic component: a linear predictor \(\eta = {\bf x}^\prime{\bf \beta}\).
link function: maps from probabilities (0,1) to
\(\mathbb{R}\).
The logit function is common, it is the inverse of the
c.d.f of the standard logistic distribution.
The probit link, based on the normal distribution, is
also popular.
Connection between logistic regression for binary data and binomial regression for binomial counts:
Binary (Bernoulli) distribution
A Bernoulli(\(\pi\)) random variable Y takes on value 1 with probability \(\pi\) and value 0 with probability \(1-\pi\)
The probability mass function (p.m.f.) is \[\text{Pr}(Y=y) = \pi^y (1-\pi)^{1-y}\]
Binomial distribution
Suppose \(Y_1, \cdots, Y_n\) are independent Bernoulli(\(\pi\)) random variables with success probability \(\pi\)
\(Z = \sum_{i=1}^n Y_i\) follows a Binomial(\(n,\pi\)) distribution with p.m.f.
\[\text{Pr}(Z=z) = {n \choose z} \pi^z (1-\pi)^{n-z}\]
In the insecticide example 20 larvae are exposed to an insecticide at a particular concentration \(x\).
The 20 larvae represent Bernoulli\((\pi(x))\) random variables.
Assuming that they respond independently to the insecticide and that they share the same mortality rate at a given concentration, the number of larvae out of 20 that succumb to the insecticide is a Binomial\((20,\pi(x))\)random variable.
RLet’s look at this example in R.
Predict the mortality rate at concentration 5% \[\begin{aligned}
\widehat{\pi}(\text{conc}=5\%) &= \frac{1}{1+\exp(- (-1.7305 +
4.1651 \times \text{log}_{10}(5)) )} \\
&= \frac{1}{1+\exp(1.18078)} = 0.2349
\end{aligned}\] There is a \(23.49\%\) chance that a larvae exposed to
5% concentration of the insecticide will die.
A Poisson regression is a generalized linear model with the following components
random component: the counts follow a Poisson(\(\lambda\)) distribution.
systematic component: a linear predictor \(\eta = {\bf x}^\prime{\bf \beta}\).
link function: the link function maps from \((0,\infty)\) to the \(\mathbb{R}\).
The canonical link is the \(\text{log}()\) function.
A random variable \(Y\) with Poisson\((\lambda)\) distribution has mean \(\lambda\), variance \(\lambda\) and probability mass function \[\text{Pr}(Y=y) = \frac{\lambda^y}{y!} e^{-y}, \quad Y \in \{0,1,2,3,\cdots \}\]
As \(\lambda\) increases the Poisson p.m.f. approaches a Normal distribution
The following plots show the probability mass functions (pmf) for Poisson distributions with \(\lambda = 2\) and \(\lambda = 5\).
As \(\lambda\) increases, the Poisson pmf looks more symmetric and very similar to a normal distribution. The following graph shows the histogram and the density for 10,000 draws from a Poisson(20) and a Normal(20,20) distribution, respectively. The empirical distributions are very close.
## Warning: In density.default(norm, bwd = nrd) :
## extra argument 'bwd' will be disregarded
Because a Poisson distribution with large \(\lambda\) resembles a Normal distribution,
count data are often fitted with standard linear regression methods
(lm()).
That is not a good idea, because
Poisson count data are not homoscedastic: \(E[Y] = Var[Y] = \lambda\) (i.e. \(\lambda(x)\)).
\(\Rightarrow\) The equal variance
assumption of linear regression for continuous target is not
met
The linear regression model dose not guarantee \(\widehat{y} \geq 0\)
The predictions of the two models can be very different.
Try it out with the biek rental data
RThe syntax to fit a Poisson regression is very similar to that for
Logistic or Binomial regression; just add family="poisson"
to the glm() call.
Let’s go back to our Seoul bike rental counts and fit a Poisson regression.
Examples of generalized linear models
| Type | Distribution | Canonical Link | \(g(\mu)\) | \(g^{-1}(\eta)\) |
|---|---|---|---|---|
| Binary | Bernoulli (Binary) | logit | \(\text{ln}\left\{\frac{\mu}{1-\mu}\right\}\) | \(\frac{1}{1+E\{-\eta\}}\) |
| Count | Binomial | logit | \(\text{ln}\left\{\frac{\mu}{1-\mu}\right\}\) | \(\frac{1}{1+E\{-\eta\}}\) |
| Count | Poisson | log | \(\text{ln}\{\mu\}\) | \(E\{\eta\}\) |
| Cont’s | Normal (Gaussian) | Identity | \(\mu\) | \(\eta\) |
| Cont’s | Exponential | Inverse | \(\frac{-1}{\mu}\) | \(\frac{-1}/{\eta}\) |