Advanced Regression Models

Learning Goals

By the end of this module, you should be able to fit and diagnose these models:

  • Model binary outcomes -> Logistic regression

  • Model count outcomes -> Binomial, Poisson regression

Think of this module as:
> And what if Y isn’t continuous?

Regression with Binary Targets

Binary Target

  • A binary variable takes on two values, often coded as 1 and 0

  • We do not think of the values as numbers, but as classes or categories

  • The category coded 1 is often called the event or success

  • The category coded 0 is often called the non-event or failure

  • Working with binary data, you can solve a regression and a classification problem
    Regression: model the mean of a random variable, VS, Classification: tell me what category an observation belongs to

Regression or Classification

Classification Problem

  • \(Y\) takes on one of a discrete number of outcomes or states (categories), \(Y \in \{ C_1, C_2,\cdots, C_k \}\)

  • The prediction problem is now to assign a new observation to one of the \(k\) categories

  • The number of categories can be as small as \(k=2\) (binary data), or very large (\(k=1,000\) in image classification)

Regression Problem

  • Develop a model (function) to predict the mean of \(Y\) from inputs.

Binary Regression is a Classifier

Classification can be solved with regression methods!

The binary probability distribution with success probability \(\pi\):

\[\begin{aligned} Y^* &\in \{C_1, C_2\} \\[9pt] Y &= \left \{ \begin{array}{ll} 1 & \quad \text{if } Y^* = C_1 \\ 0 & \quad \text{if } Y^* = C_2 \end{array} \right . \\[9pt] \text{Pr}(Y = y) &= \left \{ \begin{array}{ll} \pi & y=1 \\ (1-\pi) & y=0 \end{array} \right . \\[9pt] E[Y] &= 1 \times \text{Pr}(Y=1) + 0 \times \text{Pr}(Y=0) \\ &= \pi \end{aligned}\]

Regression is about modeling \(E[Y]=\pi\).

Binary Regression is a Classifier

  • The mean (expected value) of the binary (0,1) target is the success probability.


  • Since regression models the mean we need to find a function that expresses \[\pi = f(x_1,\cdots,x_p,\beta_0,\cdots,\beta_p)\]


  • With estimates of \(\beta_0, \cdots, \beta_p\) we can then estimate the probability of the event of interest \[\widehat{\pi} = \widehat{f} = f(x_1,\cdots,x_p,\widehat{\beta}_0,\cdots,\widehat{\beta}_p)\]

Bayes Classifier

Assign an observation to the most likely category, given its predictor (input) values.

With two categories: \[\begin{array}{cc} \widehat{\pi} > 0.5 & \quad \text{predict category} \, Y=1 \\ \widehat{\pi} \leq 0.5 & \quad \text{predict category} \, Y=0 \end{array}\]

Note: the threshold 0.5 is an example, we can be smarter about it.

Logistic Regression

  • In logistic regression the target is binary and the link function (usually) is the logit function \[\text{logit}(\pi) = \text{log} \Big( \frac{\pi}{1-\pi} \Big) = {\bf x}^\prime {\bf \beta}\]

  • The inverse link function is \[\pi = \frac{\exp({\bf x}^\prime{\bf \beta})}{1+\exp({\bf x}^\prime{\bf \beta})} = \frac{1}{1+\exp(-{\bf x}^\prime{\bf \beta})}\]

  • Now we have a technique to move from the regression predictor \({\bf x}^\prime {\bf \beta}\) to the mean and back.

Logistic Regression in R

  • A logistic regression is computed with the glm() function in R.

  • GLM stands for generalized linear model, a rich class of models that includes logistic regression.

  • In R you select logistic regression for binary data with family="binomial", since the binary distribution is a special case of the binomial distribution.

Calculating Predictions

Interpretation of logistic regression coefficients

  • \(\beta_j\) measures the change in logit units (on the scale of the link function) per unit change of the input

  • \(\beta_j\) does not measure the change in the probability as \(x_j\) changes by one unit

  • \(\beta_j > 0\): As \(x_j\) increases the predicted probability increases

  • \(\beta_j < 0\): As \(x_j\) increases the predicted probability decreases

Calculating Predictions in R

  • To calculate predicted values in R use the predict() function.

  • The type="link" or type="response" option indicates whether to compute predictions on the scale of the link function or on the scale of the mean (=the scale of the response).

  • dfyes <- data.frame(balance=1000, income=20000, student="Yes")
    predict(lg,newdata=dfyes, type="link")
    predict(lg,newdata=dfyes, type="response")

  • Let’s look at a logistic regression example in R.

Exponential Family, Generalized Linear Models

Exponential Family

Before we get deeper into residual analysis and classification a quick detour to the Exponential family of distributions.

A distribution is a member of the family if its density or mass function can be written as \[{\displaystyle f(y)=h(y)\,g({\boldsymbol {\theta }})\,\exp \big({\boldsymbol {\eta }}({\boldsymbol {\theta }})\cdot \mathbf {T} (y)\big)}\] \({\bf \theta}\) is a vector of parameters to be estimated

Important members include

  • Discrete distributions: Binary (Bernoulli), Binomial, Negative Binomial, Poisson

  • Continuous distributions: Normal (Gaussian), Gamma, Exponential, Inverse Gaussian, Beta

Generalized Linear Model (GLM)

A generalized linear model (GLM) consists of a

  • random component: distribution of \(Y\) is a member of the exponential family

  • systematic component: the linear predictor \(\eta = {\bf x}^\prime{\bf \beta}\)

  • link function: maps the mean \(E[Y] = \mu\) to the linear predictor \(\eta\): \(g(\mu) = \eta\)


  • The inverse of the link function maps from the linear predictor \(\eta\) to the mean \(\mu\): \(g^{-1}(\eta) = \mu\).

  • The form of the distribution suggests a link function, called the canonical or natural link function.

Members of GLM family

Examples of generalized linear models

Type Distribution Canonical Link \(g(\mu)\) \(g^{-1}(\eta)\)
Binary Bernoulli (Binary) logit \(\text{ln}\left\{\frac{\mu}{1-\mu}\right\}\) \(\frac{1}{1+E\{-\eta\}}\)
Count Binomial logit \(\text{ln}\left\{\frac{\mu}{1-\mu}\right\}\) \(\frac{1}{1+E\{-\eta\}}\)
Count Poisson log \(\text{ln}\{\mu\}\) \(E\{\eta\}\)
Cont’s Normal (Gaussian) Identity \(\mu\) \(\eta\)
Cont’s Exponential Inverse \(\frac{-1}{\mu}\) \(\frac{-1}/{\eta}\)

Logistic Regression as a GLM

A logistic regression is a generalized linear model with the following components

  • random component: \(Y\) follows a binary (\(\pi\)) distribution.

  • systematic component: a linear predictor \(\eta = {\bf x}^\prime{\bf \beta}\).

  • link function: maps from probabilities (0,1) to \(\mathbb{R}\).
    Any cumulative distribution function (c.d.f) can serve as the inverse link function for these type of models.
    The logit function is common, it is the inverse of the c.d.f of the standard logistic distribution.
    The probit link, based on the normal distribution, is also popular.

Logistic Model Assessment

Taking a page from the linear regression book, we could look at

  • \(p\)-values for hypothesis tests about \(\beta\)s

  • Goodness-of-fit statistics like \(R^2\)

  • Diagnostics (residuals, case deletion, …)

The Problem

  • GLMs are not based on sums of squares and least-squares estimation. They are based on the maximum likelihood principle.

  • \(R^2\) statistics for generalized linear models are sketchy. They do not have the same interpretation in terms of % variability explained.

  • Residual plots show a lot of inherent structure if \(Y\) takes on only a few values

Residuals in GLMs

The generalized linear models define a number of residual types.

If \(\widehat{\mu}_i = g^{-1}({\bf x}^\prime\widehat{{\bf \beta}})\) is the predicted mean, then we can work with

  • Raw (response) residual: \(y_i - \widehat{\mu}_i\)

  • “Working” residuals: \(\frac{y_i-\widehat{\mu}_i}{\widehat{\text{Var}}[Y_i]}\)

  • Pearson residuals: \(\frac{y_i - \widehat{\mu}_i}{\sqrt{\widehat{\text{Var}}[Y_i]}}\)

  • Deviance residuals: \(2 \sum_{i=1}^n -y_i \text{ln} \Big( \frac{\widehat{\mu}_i}{1-\widehat{\mu}_i} \Big) - \text{ln}\big(1-\widehat{\mu}_i\big)\)

The deviance residuals are based on likelihood theory and are different for each distribution in the exponential family.

Recommend to use the Pearson and Deviance residuals.

GLM Residuals in R

Four sets of residuals for logistic regression of Default data.

lg <- glm(default ~ income + balance + student, 
          family="binomial",data=train)

par(mfrow=c(2,2))
par(cex=0.7) 
par(mai=c(0.6,0.6,0.2,0.3))

plot(residuals(lg, type="response"), ylab="Raw residual")
plot(residuals(lg, type="working" ), ylab="Working residual" )
plot(residuals(lg, type="pearson" ), ylab="Pearson residual" )
plot(residuals(lg, type="deviance"), ylab="Deviance residual")  

Let’s see how these lok in R.

Logistic Residuals

  • The residual plots in logistic regression are difficult to interpret

  • There is a lot of structure in the plots due to \(y \in \{0,1\}\)
    Structure is typically a source of concerns for residuals so this is not an informative quantity.

  • Since we often use logistic regression as a classifier it is common to calculate the misclassification rate

Interpretation of glm outputs

Before we talk about misclassification statistics, what are these summary(glm()) outputs telling us ?

## 
## Call:
## glm(formula = default ~ income + balance + student, family = "binomial", 
##     data = def_data)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.087e+01  4.923e-01 -22.080  < 2e-16 ***
## income       3.033e-06  8.203e-06   0.370  0.71152    
## balance      5.737e-03  2.319e-04  24.738  < 2e-16 ***
## studentYes  -6.468e-01  2.363e-01  -2.738  0.00619 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1571.5  on 9996  degrees of freedom
## AIC: 1579.5
## 
## Number of Fisher Scoring iterations: 8
  • z-value: Wald statistics, which tests the null hypothesis that the true coefficient is zero, based on likelihood criteria.

  • Pr(>|z|): The p-value corresponding to the z-value, indicating the statistical significance of the predictor in the model.

  • Null deviance: Measures the fit of the model with only the intercept (no predictors).

  • Residual deviance: Measures the fit of the model with all the included predictors. A significant drop in deviance from the null to the residual suggests that the predictors improve the model’s fit.

  • AIC: Akaike Information Criterion used for comparing different models of the same data; lower values generally indicate a better model fit. See details in future classes.

  • Number of Fisher Scoring iterations: The number of iterations the algorithm took to converge.

  • Remember: The deviance is \(-2\times\) the difference between the log-likelihood at the maximum likelihood estimate and the log-likelihood for a “saturated model” (model with a separate parameter for each observation thus a perfect fit).

Confusion Matrix

For qualitative targets that take on values in \(k\) categories we classify by assigning classes based on predicting category probabilities.

In logistic regression with \(k=2\) the classification rule is \[\begin{array}{cc} \widehat{\pi} > c & \text{predict category} \, Y=1 \\ \widehat{\pi} \leq c & \text{predict category} \, Y=0 \end{array}\] for some value \(c\) (usually \(c=0.5\), the Bayes classifier).

This leads to a \(2 \times 2\) table of observed and classified (predicted) values called the confusion matrix.

True (Reference)
Prediction Yes No
Yes True Positive (TP) False Positive (FP)
No False Negative (FN) True Negative (TN)


  • Values on the diagonal (TP & TN) are correctly classified

  • Values off the diagonal (FN) & FP) are errors; remember errors do not have to have symmetric consequences

  • Columns might be re-ordered in software
    (declare what is “positive’, see below’)

Confusion Statistics

  • False positive rate: FPR = FP / (FP + TN)
    the proportion of incorrect “Yes” classifications among the true “No”s

  • False negative rate: FNR = FN / (FN + TP)
    the proportion of incorrect “No” classifications among the true “Yes”s


  • Sensitivity: TP / (FN + TP) = 1 - FNR
    the true positive rate, complement of FNR
    the ratio of true positives to what should have been predicted as positive

  • Specificity: TN / (FP + TN) = 1 - FPR
    the true negative rate, complement of FPR
    the ratio of true negatives to what should have been predicted as negative


  • Misclassification rate: (FP + FN) / (TP+TN+FP+FN)
    also called error rate, the proportion of incorrect classifications

  • Accuracy: (TP+TN) / (TP+TN+FP+FN) = 1 - Misclassification rate
    the proportion of correct classifications

Confusion Matrix in R

Calculating the confusion matrix for the test data.

The event coded \(Y=1\) is “Yes”, default on credit card.

pred_prob_test <- predict(lg,newdata=test, type="response")
classify_test <- as.factor(ifelse(pred_prob_test > 0.5,"Yes","No"))
table(classify_test,test$default, dnn=list("Classified","Observed"))

Result

            Observed
Classified    No Yes
       No    958  26
       Yes     7   9    

Why is this called a “confusion” matrix? Where is the “confusion”?

For the Default data set the confusion matrix along with many statistics can be obtained in R with the confusionMatrix() function in the caret package.

library(caret)
predicted_prob_test <- predict(lg, newdata=test, type="response")

classify <- as.factor(ifelse(predicted_prob_test > 0.5,"Yes","No"))

confusionMatrix(classify,test$default, positive="Yes")

The positive="Yes" option tells R to compute the statistic assuming that “Yes” is the event of interest (the “positive” event).

Otherwise it chooses the first level of the factor, which would be “No” in this example.

Confusion Statistics in R

Confusion Matrix and Statistics

          Reference
Prediction  No Yes
       No  958  26
       Yes   7   9
                                         
               Accuracy : 0.967          
                 95% CI : (0.954, 0.9772)
    No Information Rate : 0.965          
    P-Value [Acc > NIR] : 0.407906       
                                         
                  Kappa : 0.3384         
                                         
 Mcnemar's Test P-Value : 0.001728       
                                         
            Sensitivity : 0.2571         
            Specificity : 0.9927         
         Pos Pred Value : 0.5625         
         Neg Pred Value : 0.9736         
             Prevalence : 0.0350         
         Detection Rate : 0.0090         
   Detection Prevalence : 0.0160         
      Balanced Accuracy : 0.6249         
                                         
       'Positive' Class : Yes               

Other statistics from the confusionMatrix() output

  • No Information Rate: The proportion of the largest class. If you predict all obs to be “No” you get an accuracy of 0.965. This is called a naı̈ve classifier.

  • P-Value (Acc > NIR): \(p\)-value of test that accuracy of model is better than the naı̈ve no-information classifier.

  • Detection Rate: TP / (TP+TN+FP+FN) = 9 / 1,000 = 0.009

  • Balanced Accuracy: (Sensitivity + Specificity)/2 = (0.2571 + 0.9927) / 2 = 0.6249

Other terms you may come across

  • Precision: Ratio of true positive to anything predicted as positive, TP / (TP + FP) = 9 / (9 + 7) = 0.5626

  • Recall = Sensitivity

How Well Does the Model Perform?

  • Training data overstates ability to classify new observations

  • If prediction is important, use confusion statistics based on test data

  • Misclassification rate is often used like \(MSE\) in regression with continuous data

  • Sensitivity and specificity might be more relevant

  • Consider the consequences of errors.
    Is a false positive equally as bad as a false negative?

  • The confusion matrix changes with choice of cutoff \(c\)

Confusion Matrix

From the test confusion matrix

  • False positive rate (0.0072) is the error rate among non-defaulters. Only 7 out of 7+958 = 965 were incorrectly labeled as defaulters. That is a low FPR, high specificity.

  • False negative rate (0.7428) is the error rate among defaulters. That is a high FNR. The model has a low sensitivity of 1 - 0.7428 = 0.2572.

  • If you want to use the model to detect defaults before they occur, it is probably unacceptable.

Count Target

Count Target

Count data, where the possible values of the target variable are countable \((0, 1, 2, 3,\cdots)\) arises in many cases. We consider two important cases

  • Counts out of a total: here we are counting the number of occurrences of an event out of a total number of events. For example, the number of defective manufacture items out of \(n=100\). The count target takes on values from \((0, 1, 2, \cdots, n)\) and can be expressed as a proportion.
    \(Y\) is frequently assumed to follow a Binomial distribution.


  • Counts per unit: here we are counting events per day, per square foot, or some other unit. For example, the number of daily complaints in a customer service department. There is no upper bound for the possible values: \(0, 1, 2, \cdots\).
    \(Y\) is frequently assumed to follow a Poisson distribution.

Count Target Example: insecticide dosage

The \(LD_{50}\) is the lethal dosage of an agent (a herbicide, insecticide, etc.) that kills 50% of the organisms exposed to the agent. The table shows the number of insect larvae killed at various concentrations of an insecticide. For each concentration, 20 larvae were exposed to the insecticide.

Concentration 0.375% 0.75% 1.5% 3% 6% 12% 24%
No of larvae killed 0 1 8 11 16 18 20

This is an example of counts out of a total.

We will assume that the number of larvae killed is a Binomial(20,\(\pi\)) random variable.

The data can also be treated as binary and modeled with logistic regression:

  • \(7 \times 20 = 140\) observations

  • each response is either 1 if the larva died, or 0 if it survived.

  • there are 74 1’s and 66 0’s

Count Target Example: bike rental counts

Remember the bike sharing rental data that we worked with during our in-class coding activities.

Bike is an example of a count variable we model as a Poisson variable.

It is not a count that can be expressed as a proportion, it is a count per unit (hour).

Binomial Regression

A binomial regression is a generalized linear model with the following components

  • random component: the counts (i.e. response Y) are assumed to follow a Binomial(\(n,\pi\)) distribution. \(n\) does not have to be the same for the observations.

  • systematic component: a linear predictor \(\eta = {\bf x}^\prime{\bf \beta}\).

  • link function: maps from probabilities (0,1) to \(\mathbb{R}\).
    The logit function is common, it is the inverse of the c.d.f of the standard logistic distribution.
    The probit link, based on the normal distribution, is also popular.

Binomial Distribution

Connection between logistic regression for binary data and binomial regression for binomial counts:

  • Binary (Bernoulli) distribution

    • A Bernoulli(\(\pi\)) random variable Y takes on value 1 with probability \(\pi\) and value 0 with probability \(1-\pi\)

    • The probability mass function (p.m.f.) is \[\text{Pr}(Y=y) = \pi^y (1-\pi)^{1-y}\]


  • Binomial distribution

    • Suppose \(Y_1, \cdots, Y_n\) are independent Bernoulli(\(\pi\)) random variables with success probability \(\pi\)

    • \(Z = \sum_{i=1}^n Y_i\) follows a Binomial(\(n,\pi\)) distribution with p.m.f.

    • \[\text{Pr}(Z=z) = {n \choose z} \pi^z (1-\pi)^{n-z}\]

Count Target

In the insecticide example 20 larvae are exposed to an insecticide at a particular concentration \(x\).


The 20 larvae represent Bernoulli\((\pi(x))\) random variables.


Assuming that they respond independently to the insecticide and that they share the same mortality rate at a given concentration, the number of larvae out of 20 that succumb to the insecticide is a Binomial\((20,\pi(x))\)random variable.

Binomial Regression in R

Let’s look at this example in R.

Predict the mortality rate at concentration 5% \[\begin{aligned} \widehat{\pi}(\text{conc}=5\%) &= \frac{1}{1+\exp(- (-1.7305 + 4.1651 \times \text{log}_{10}(5)) )} \\ &= \frac{1}{1+\exp(1.18078)} = 0.2349 \end{aligned}\] There is a \(23.49\%\) chance that a larvae exposed to 5% concentration of the insecticide will die.

  • Question: Find the \(LD_{50}\), that is the dosage lethal to 50% of the animals.


  • Solution: This is an inverse prediction problem: for a given value of \(\pi(x)\) you need to determine \(x\).
    You can find any value for \(x\) that corresponds to a particular mortality rate \(\alpha\) by solving the following equation for \(x_\alpha\) \[\text{logit}(\alpha) = \beta_0 + \beta_1\text{log}_{10}(x_\alpha)\] After a little math you get \[x_\alpha = 10^{(\text{logit}(\alpha) - \beta_0)/\beta_1}\] Plugging in 0.5 for \(\alpha\) and the estimates for \(\beta_0\) and \(\beta_1\) we get \[x_{0.5} = LD_{50} = 10^{(\text{logit}(0.5) + 1.7305)/4.1651} = 10^{0.4155} = 2.603\]

Poisson Regression

A Poisson regression is a generalized linear model with the following components

  • random component: the counts follow a Poisson(\(\lambda\)) distribution.

  • systematic component: a linear predictor \(\eta = {\bf x}^\prime{\bf \beta}\).

  • link function: the link function maps from \((0,\infty)\) to the \(\mathbb{R}\).
    The canonical link is the \(\text{log}()\) function.

  • A random variable \(Y\) with Poisson\((\lambda)\) distribution has mean \(\lambda\), variance \(\lambda\) and probability mass function \[\text{Pr}(Y=y) = \frac{\lambda^y}{y!} e^{-y}, \quad Y \in \{0,1,2,3,\cdots \}\]

  • As \(\lambda\) increases the Poisson p.m.f. approaches a Normal distribution

Poisson distribution

The following plots show the probability mass functions (pmf) for Poisson distributions with \(\lambda = 2\) and \(\lambda = 5\).

As \(\lambda\) increases, the Poisson pmf looks more symmetric and very similar to a normal distribution. The following graph shows the histogram and the density for 10,000 draws from a Poisson(20) and a Normal(20,20) distribution, respectively. The empirical distributions are very close.

## Warning: In density.default(norm, bwd = nrd) :
##  extra argument 'bwd' will be disregarded

Poisson or Normal Approximation

Because a Poisson distribution with large \(\lambda\) resembles a Normal distribution, count data are often fitted with standard linear regression methods (lm()).
That is not a good idea, because

  • Poisson count data are not homoscedastic: \(E[Y] = Var[Y] = \lambda\) (i.e. \(\lambda(x)\)).
    \(\Rightarrow\) The equal variance assumption of linear regression for continuous target is not met

  • The linear regression model dose not guarantee \(\widehat{y} \geq 0\)

  • The predictions of the two models can be very different.

  • Try it out with the biek rental data

Poisson Regression in R

The syntax to fit a Poisson regression is very similar to that for Logistic or Binomial regression; just add family="poisson" to the glm() call.

Let’s go back to our Seoul bike rental counts and fit a Poisson regression.

Members of GLM family

Examples of generalized linear models

Type Distribution Canonical Link \(g(\mu)\) \(g^{-1}(\eta)\)
Binary Bernoulli (Binary) logit \(\text{ln}\left\{\frac{\mu}{1-\mu}\right\}\) \(\frac{1}{1+E\{-\eta\}}\)
Count Binomial logit \(\text{ln}\left\{\frac{\mu}{1-\mu}\right\}\) \(\frac{1}{1+E\{-\eta\}}\)
Count Poisson log \(\text{ln}\{\mu\}\) \(E\{\eta\}\)
Cont’s Normal (Gaussian) Identity \(\mu\) \(\eta\)
Cont’s Exponential Inverse \(\frac{-1}{\mu}\) \(\frac{-1}/{\eta}\)