By the end of this module, you should be able to:
Regression is vast. We focus on statistical learning models with a continuous response:
This module sets the foundation for everything that follows.
Two flavors, same idea:
Simple linear regression (SLR): one input \[Y = \beta_0 + \beta_1 x + \epsilon\]
Multiple linear regression (MLR): many inputs \[Y = \beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p + \epsilon\] \(x\) can be completely different variables or transformations of each other, e.g. a third-order polynomial model \[ Y = \beta_0 + \beta_1 x + \beta_2 x^2 + \cdots + \beta_p x^p + \epsilon\]
“Linear” means linear in the parameters — not necessarily a straight line.
Interpretation:
👉 If correlation measures association, can we use it for regression modeling and prediction?
Scatterplots for data sets with different degrees of correlation
Different correlation examples
STRONG CORRELATION DOES NOT IMPLY LINEAR DEPENDENCE IN THE DATA
Anscombe’s Quartet: Be careful interpreting the numerical value of \(\rho\) without looking at a graph. The following relationships all have \(\rho=0.816\) and the same regression \(y = 3 + \frac{1}{2} x\)
Anscombe quartet examples
Lesson: Do not trust a single number without a plot.
In practice, we have a finite amount of data and need to calculate the correlation’s empirical estimator
Pearson’s correlation coefficient:
\[\displaystyle \begin{array}{ll}\widehat{\rho}_{x,y} & = \frac{1}{n-1} \frac{1}{s_x s_y}\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})\\ & = \frac{1}{n-1} \sum_{i=1}^n \left ( \frac{x_i - \bar{x}}{s_x} \right ) \left ( \frac{y_i - \bar{y}}{s_y} \right ) \end{array}\] where \(\bar{x}\) is the empirical mean of \(X\) and \(s_x\) its sample standard deviation
\(\rightarrow\) Let’s look at some
examples in R with PISA dataset.
Matrix form: \[\begin{array}{ll}{\bf Y}_{(n\times1)} & = {\bf X}_{(n \times 2)}{\bf\beta}_{(2 \times 1)} + {\bf\epsilon}_{(n \times 1)}\\ \left [ \begin{array}{r} Y_1 \\ \vdots \\ Y_n \end{array} \right ] &= \left [ \begin{array}{rr} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \end{array}\right ] \times \left [ \begin{array}{r} \beta_0 \\ \beta_1 \end{array} \right ] + \left [ \begin{array}{r} \epsilon_1 \\ \vdots \\ \epsilon_n \end{array} \right ] \end{array}\] Key question: How does this differ from correlation?
Connection between between correlation and linear regression analysis
In SLR, \(X\) is considered fixed (not random)
In correlation analysis both \(X\) and \(Y\) are random variables
In regression we distinguish a target (output) \(Y\) and input \(X\), it is not a symmetric relationship between the variables, whereas \(\text{Corr}(X,Y) = \text{Corr}(Y,X)\)
In SLR, this holds: \(\text{Corr}(X,Y) = \text{Corr}(Y,\widehat{Y})\)
Let’s look at R examples
Based on the output on the R example, fill in values for
the following
\(\widehat{\beta}_0\) =
\(\widehat{\beta}_1\) =
Number of observations used in the analysis =
Number of observations in the data set =
Median of the residuals =
Mean of the residuals =
Estimate of the variance of \(\widehat{\beta}_1\) =
The \(p\)-value for the hypothesis that
\(\beta_1 = 0\):
Estimate of \(\sigma\), the standard deviation of \(\epsilon\) =
The proportion of variability in Y explained by the regression on X =
If logGDPp increases by 2 units, the MathMean changes by =
\({\bf Y}\) ``lives’’ in an \(n\)-dimensional space (it is a vector of length \(n\))
\({\bf X}\) can generate a space of dimension \(\text{rank}({\bf X})\)
Any vector \({\bf X}{\bf \beta}\) also lives in the space generated by \({\bf X}\)
If we have to choose from all possible \({\bf \beta}\)s then we might as well choose the \({\bf \beta}\) that makes \({\bf X}{\bf \beta}\) the closest to \({\bf Y}\) (Pythagore theorem)
That is the ordinary least square (OLS) estimator \(\widehat{{\bf \beta}}\)
\(\widehat{Y}\) is the orthogonal projection of \(Y\) onto the space generated by \({\bf X}\)
Figure 4.5. from Schabenberger & Pierce
Note: two random vectors that are orthogonal are uncorrelated!
If OLS is an orthogonal projection of \({\bf Y}\) onto \({\bf X}\), then \(\widehat{{\bf Y}}\) and \({\bf Y} - {\bf X}\widehat{{\bf \beta}}\) should be orthogonal.
This implies:
Let’s look at this in an R example
\[R^2 = \frac{\text{SSModel}}{\text{SSTotal}} = 1 - \frac{\text{SSError}}{\text{SSTotal}}\]
Proportion of variance explained by the model over total variance
\(0 \leq R^2 \leq 1\)
\(R^2 = 0\): no linear relationship between Y and X
\(R^2 = 1\): perfect linear relationship between Y and X (SSError = 0)
In SLR & MLR: \(R^2\) is the square of the correlation between \(Y\) and \(\widehat{Y}\)
In SLR: \(R^2 = \text{Corr}(X,Y)^2\)
In MLR: correlation with \(X\) no longer applies
\(R^2\) is infamous: \(R^2\) is a useful statistic but should be handled with care
High R² ≠ good model.
Plots still matter.
The following code simulates data where the variance of the errors depends on \(X\) and the mean function is not a straight line.
A simple linear regression model is not appropriate!
x <- seq(0.15, 1, l = 100)
set.seed(123456)
eps <- rnorm(n = 100, sd = 0.25 * x^2)
y <- 1 - 2 * x * (1 + 0.25 * sin(4 * pi * x)) + eps
plot(x,y,pch=20,col=2)
reg <- lm(y ~ x)
summary(reg)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.53525 -0.18020 0.02811 0.16882 0.46896
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.87190 0.05860 14.88 <2e-16 ***
## x -1.69268 0.09359 -18.09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.232 on 98 degrees of freedom
## Multiple R-squared: 0.7695, Adjusted R-squared: 0.7671
## F-statistic: 327.1 on 1 and 98 DF, p-value: < 2.2e-16
Why is the R^2 so great?
Common choices:
Good transformations:
The true model is \(Y = \beta_0 + \beta_1 \log(x) + \epsilon\)
A simple linear regression of \(Y\) on \(X\) does not fit the data (left panel)
A simple linear regression of \(Y\) on \(X^* = \log(X)\) fits well
In Statistical Learning, we want to predict the target for new values of the input and we want to know the uncertainty associated with those predictions.
Suppose we want to predict at value \(x_0\) of the input. We have two
choices
Predict the value of a new observation: \(Y_0 = f(x_0) + \epsilon\)
This results in a random variable
Prediction interval: uncertainty about a new
observation
Predict the expected value (the mean) of the target: \(E[Y_0] = f(x_0)\)
This results in a deterministic quantity
Confidence interval: uncertainty about the mean
response
YES, both lead to the same estimate but different respective
uncertainties:
Prediction intervals are always wider than confidence ones.
Predicting \(Y_0\) in the simple linear regression model
\(Y_0\) is a random variable: \(Y_0 = \beta_0 + \beta_1 x_0 + \epsilon\)
\(\epsilon\) is a random variable with mean 0 and variance \(\sigma^2\)
An obvious predictor would be \(\widehat{Y}_0 = \widehat{\beta}_0 + \widehat{\beta}_1 x_0 + \widehat{\epsilon}\)
But that is just \(\widehat{Y}_0 = \widehat{\beta}_0 + \widehat{\beta}_1 x_0\)
Predicting \(E[Y_0]\)
\(E[Y_0]\) is a constant (not a random variable): \(\beta_0 + \beta_1 x_0\)
The estimator is \(\widehat{E}[Y_0]= \widehat{\beta}_0 + \widehat{\beta}_1 x_0\)
Goal: uncertainty about the conditional mean \(\mu(x_0) = \mathbb{E}[Y \mid X = x_0]\)
Linear regression model: \(\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon},\quad \boldsymbol{\varepsilon} \sim (0,\sigma^2\mathbf{I})\)
Estimated mean at \(x_0\): \(\widehat{\mu}(x_0) = \mathbf{x}_0^\top \widehat{\boldsymbol{\beta}}\)
Confidence interval (level \(1-\alpha\)): \(\quad \mathbf{x}_0^\top \widehat{\boldsymbol{\beta}} \;\pm\; t_{n-p,\,1-\alpha/2}\; \widehat{\sigma} \sqrt{\mathbf{x}_0^\top(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{x}_0}\)
Interpretation
Goal: uncertainty about a future response \(Y_0 = \mu(x_0) + \varepsilon_0,\quad \varepsilon_0 \sim (0,\sigma^2)\)
Point prediction: \(\widehat{Y}_0 = \mathbf{x}_0^\top \widehat{\boldsymbol{\beta}}\)
Prediction interval (level \(1-\alpha\)): \(\quad \mathbf{x}_0^\top \widehat{\boldsymbol{\beta}} \;\pm\; t_{n-p,\,1-\alpha/2}\; \widehat{\sigma}\sqrt{1 + \mathbf{x}_0^\top(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{x}_0}\)
Interpretation
\[\text{CI: } \widehat{\mu}(x_0) \pm \text{estimation uncertainty}\\ \text{PI: } \widehat{\mu}(x_0) \pm (\text{estimation} + \text{error})\]
If you are interested in the precision of predicting new unobserved
values/individual predictions:
→ use prediction intervals.
If you care about average trends:
→ use confidence intervals.
RLet’s work on some examples
Same model, bigger X: \(\mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}\)
More precisely
\[\begin{array}{ll} \mathbf{Y}_{(n\times1)} &= \mathbf{X}_{(n \times p+1)}\boldsymbol{\beta}_{(p+1 \times 1)} + \boldsymbol{\epsilon}_{(n \times 1)}\\ \left [ \begin{array}{r} Y_1 \\ \vdots \\ Y_n \end{array} \right ] &= \left [ \begin{array}{rrcr} 1 & x_{11} & \cdots & x_{p1} \\ \vdots & \vdots & \vdots & \vdots \\ 1 & x_{1n} & \cdots & x_{pn} \end{array}\right ] \times \left [ \begin{array}{r} \beta_0 \\ \vdots \\ \beta_p \end{array} \right ] + \left [ \begin{array}{r} \epsilon_1 \\ \vdots \\ \epsilon_n \end{array} \right ] \end{array}\]
Which variables to include? all of them, a subset, a superset (feature engineering)
Redundant predictors (collinearity)
Interpretation of partial effects
What if ( p ) is large? Or ( p n )?
We are less worried about the linearity of the output to any one input, what matters is the joint effect of the inputs on the target. There may be some nonlinearity between the response and one input, but once other inputs are in the model the assumptions might be met.
Some inputs might be qualitative (factors),
e.g. interesting comparisons among groups to be made
How do we interpret and diagnose a model with multiple parameters
This is where statistical learning really begins!
The OLS estimator of the model parameter is as before \[\widehat{{\bf \beta}} = \left ( {\bf X}^\prime {\bf X} \right) ^{-1} {\bf X}^\prime{\bf X}\]
It is an unbiased estimator for \({\bf \beta}\) with variance \(\sigma^2 \left ( {\bf X}^\prime {\bf X} \right) ^{-1}\)
Do we interpret the \(R^2\) value any different than in the SLR model?
No, it still measures the proportion of variation in Y that is explained by the model terms
No, it still measures the squared correlation between \(y\) and \(\widehat{y}\)
Yes, in that it no longer describes the strength of the relationship between \(Y\) and any of the individual input variables
In summary(lm()), R provides the \(R^2_{adj}\), which is “independent” of
\(p\) (number of predictors)
\[R^2 = \frac{\text{SSModel}}{\text{SSTotal}} = 1 - \frac{\text{SSError}}{\text{SSTotal}} = 1 - \frac{\widehat{\sigma}^2}{\text{SSTotal}}(n-p-1)\] Whereas the adjusted \(R^2\)
\[R^2_{adj} = \frac{\text{SSModel/(n-p-1)}}{\text{SSTotal/(n-1)}} = 1 - \frac{\widehat{\sigma}^2}{\text{SSTotal}}(n-1)\]
\(R^2_{adj}\) is only explicitly independent of \(p\), in practice the sums of squares still reflect the inclusion of more predictors.
Moreover, \(R^2_{adj}\) has a very high variance!
That means you should still be cautious when using \(R^2_{adj}\).
ANOVA compares explained variability to unexplained variability.
Two equivalent viewpoints:
Group comparison view (classical ANOVA)
Are the mean responses the same across groups via their variance?
Example: Do average test scores differ across teaching methods?
Model comparison view (regression perspective)
Does adding predictors to a model significantly improve fit?
Example: Does including predictors improve prediction compared to an
intercept-only model?
ANOVA is a tool for comparing nested models
ANOVA decomposes the total variability: Total SS = Explained SS +
Unexplained SS, - Tests whether the explained sum of squares is large
relative to the unexplained (residual) sum of squares,after accounting
for degrees of freedom,
- That ratio gives the F-statistic - we use this to test relevant sets
of predictors.
Under linear regression assumptions, the variance of \(Y\) is decomposed into two parts, each one
corresponding to the regression and to the error, respectively.
This decomposition is called the ANalysis Of VAriance (ANOVA).
\[\begin{array}{ll} ||{\bf Y}||^2 &= ||{\bf X}\widehat{{\bf \beta}}||^2 + ||{\bf Y} - {\bf X}\widehat{{\bf \beta}}||^2\\ {\bf Y}^\prime {\bf Y} &= \widehat{{\bf \beta}}{\bf X}^\prime {\bf Y} + \left({\bf Y} - {\bf X}\widehat{{\bf \beta}}\right)^\prime \left({\bf Y} - \widehat{{\bf \beta}}{\bf X} \right)\\ {\bf Y}^\prime {\bf Y} &= \widehat{{\bf \beta}}{\bf X}^\prime {\bf Y} + \left({\bf Y}^\prime{\bf Y} - \widehat{{\bf \beta}}{\bf X}^\prime{\bf Y}\right) \\ \text{SSTotal} &= \text{SSModel} + \text{SSError} \end{array}\]
Interpretation:
Corrected SSTotal: The total variability in the output, does not account for any inputs
Corrected SSModel: The variability in \({\bf Y}\) explained by the inputs
SSError (RSS): The unexplained residual variability
Degrees of freedom: Represent the number of pieces of information contributing to the sums-of-squares term
Mean Squares (MS): Divide a sum of squares by its degrees of freedom
ANOVA is just regression, summarized.
This table is valid for SLR and MLR
If the model contains an intercept, we adjust the sum of squares for the presence of the intercept
The total and model sums-of-squares are then said to be corrected for the mean
The standard ANOVA table looks as follows
| Source | Degrees of freedom (df) | Sum of Squares (SS) | Mean Squares (MS) |
|---|---|---|---|
| Model | \(\mbox{rank}({\bf X})-1\) | \(\widehat{{\bf \beta}}{\bf X}^\prime {\bf Y} - n\bar{{\bf Y}}^2\) | SSModel/\((\mbox{rank}({\bf X})-1)\) |
| Error (RSS) | \(n-\mbox{rank}({\bf X})\) | \({\bf Y}^\prime{\bf Y} - \widehat{{\bf \beta}}{\bf X}^\prime{\bf Y}\) | SSError/\((n-\mbox{rank}({\bf X}))\) |
| Total | \(n-1\) | \({\bf Y}^\prime{\bf Y} - n\bar{{\bf Y}}^2\) |
ANOVA compares explained variability to unexplained variability.
Why divide by \((n-\mbox{rank}({\bf X}))\)?
Because estimating parameters uses information.
Example: Consider the estimation of the sample variance in a sample
of size \(n\)
The usual estimate is \(s^2 = \frac{1}{n-1} \sum_{i=1}^n (y_i - \bar{y})^2\)
If we knew \(\mu\), the correct estimator would be \(s^2_{*} = \frac{1}{n} \sum_{i=1}^n (y_i - \mu)^2\)
Why are we dividing by \(n-1\) instead of \(n\)?
We estimated one parameter of the population: \(\bar{y}\) as the estimate of the population mean \(\mu\)
If \(\bar{y} = 4.5\), you can choose \(n-1\) arbitrary values. Given your \(n-1\) choices, the \(n\)th value is now fixed
If we have \(k\) predictors, a naive procedure would be to check all the possible models that can be constructed with them and then select the best one in terms of a selected criterion (prediction, AIC/BIC, ..) \(\rightarrow\) This is computationally infeasible in large data contexts.
Why it matters:
Avoid overfitting / Balance bias and variance
Improve model interpretability
We want to avoid having more predictors than datapoints (\(p<<n\))
\(R^2\) increases with the number of
predictors (deceiving metric!)
Approach:
Forward: starts from the simplest model, adds predictors
sequentially
Backward: starts from the full (all predictors) model, removes
predictors sequentially
A Mix of the above
Methods:
ANOVA: sum of squares reduction testing
Other criteria AIC/BIC (used R-built-in functions such as
stepwise)
We will talk about AIC/BIC in a future class
anova()# Example: Fit two nested models
fit1 <- lm(medv ~ lstat, data = Boston)
fit2 <- lm(medv ~ lstat + nox + ptratio, data = Boston)
# Compare models using ANOVA
anova_result <- anova(fit1, fit2)
print(anova_result)
## Analysis of Variance Table
##
## Model 1: medv ~ lstat
## Model 2: medv ~ lstat + nox + ptratio
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 504 19472
## 2 502 16802 2 2670.1 39.888 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can test whether a reduced model (i.e. fewer predictors) makes the output significantly worse (in the statistical sense).
Suppose you have two models: full model \(f({\bf X})\) and reduced model \(g({\bf X})\),
Here “full” means the model has more predictors than the “reduced”
one.
The statistic \[F = \frac{(\text{SSError}_r - \text{SSError}_f)/k}{\text{MSError}_f}\] has a Snedecor \(F_{k,n-rk(X)-1}\) distribution.
The subscripts \(r\) and \(f\) refer to the reduced and the full model, respectively, and \(k\) is the number of excluded predictors between full and reduced models.
Note: the reduced model is nested within the full model
We are testing whether the added predictors from reduced to full models are significanly different from \(0\),
In the above example, we are testing:
\[H_0: \beta_{nox}=0 \quad \& \quad \beta_{ptratio}=0 \quad\text{VS}\quad H_1: \beta_{nox}\neq 0 \quad\text{or}\quad \beta_{ptratio}\neq0 \]
anova()The anova() function can also be used on a single
model,
In that case, it performs the analysis sequentially on an increasing set
of predictors.
One predictor at a time to already selected predictors (in order of
appearance in anova()).
If you change the predictor order, the ANOVA result changes.
# Example: Fit two nested models
fit <- lm(medv ~ lstat + nox + ptratio, data = Boston)
# Compare models using ANOVA
anova(fit)
## Analysis of Variance Table
##
## Response: medv
## Df Sum Sq Mean Sq F value Pr(>F)
## lstat 1 23243.9 23243.9 694.4570 <2e-16 ***
## nox 1 4.8 4.8 0.1433 0.7052
## ptratio 1 2665.3 2665.3 79.6318 <2e-16 ***
## Residuals 502 16802.3 33.5
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lm VS anovaBe careful
summary(lm()) → t-tests for individual coefficients
anova(lm()) → F-tests for sets of coefficients
For a single predictor, the t-test and F-test are equivalent: \(F=t^2\),
this is not true in the multiple regression context.
Rmedv)Exploration tools:
lm()Key lesson:
Interpretation becomes harder as models become more powerful.
Regression is:
And most importantly:
A modeling conversation between data, assumptions, and purpose.