Intro

Learning goals

By the end of this module, you should be able to:

  • Model continuous targets using regression
  • Connect correlation, regression, and ANOVA (same story, different lenses)
  • Understand least squares through geometry, not just formulas
  • Fit, interpret, and diagnose regression models in R
  • Interpret confidence vs prediction intervals
  • See when and why transformations matter
  • Build and compare multiple linear regression models

Scope

Regression is vast. We focus on statistical learning models with a continuous response:

  • Continuous outcomes (e.g. weight, test scores) → linear & nonlinear regression
  • Counts (e.g. number of events) → Poisson regression (later in this class)
  • Binary outcomes (0/1) → Logistic regression (later in this class)

This module sets the foundation for everything that follows.

Linear regression: the big picture

Two flavors, same idea:

  • Simple linear regression (SLR): one input \[Y = \beta_0 + \beta_1 x + \epsilon\]

  • Multiple linear regression (MLR): many inputs \[Y = \beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p + \epsilon\] \(x\) can be completely different variables or transformations of each other, e.g. a third-order polynomial model \[ Y = \beta_0 + \beta_1 x + \beta_2 x^2 + \cdots + \beta_p x^p + \epsilon\]

“Linear” means linear in the parameters — not necessarily a straight line.

Correlation (a useful detour)

Interpretation:

👉 If correlation measures association, can we use it for regression modeling and prediction?

Correlation Examples

Scatterplots for data sets with different degrees of correlation

Different correlation examples

Different correlation examples

A warning: Anscombe’s Quartet

STRONG CORRELATION DOES NOT IMPLY LINEAR DEPENDENCE IN THE DATA

Anscombe’s Quartet: Be careful interpreting the numerical value of \(\rho\) without looking at a graph. The following relationships all have \(\rho=0.816\) and the same regression \(y = 3 + \frac{1}{2} x\)

Anscombe quartet examples

Anscombe quartet examples

Lesson: Do not trust a single number without a plot.

Estimating correlation

In practice, we have a finite amount of data and need to calculate the correlation’s empirical estimator

Pearson’s correlation coefficient:

\[\displaystyle \begin{array}{ll}\widehat{\rho}_{x,y} & = \frac{1}{n-1} \frac{1}{s_x s_y}\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})\\ & = \frac{1}{n-1} \sum_{i=1}^n \left ( \frac{x_i - \bar{x}}{s_x} \right ) \left ( \frac{y_i - \bar{y}}{s_y} \right ) \end{array}\] where \(\bar{x}\) is the empirical mean of \(X\) and \(s_x\) its sample standard deviation

\(\rightarrow\) Let’s look at some examples in R with PISA dataset.

Simple Linear Regression (SLR)

Model setup

  • Continuous response \(Y\)
  • One predictor \(X\)
  • Errors: mean \(0\), constant variance

Matrix form: \[\begin{array}{ll}{\bf Y}_{(n\times1)} & = {\bf X}_{(n \times 2)}{\bf\beta}_{(2 \times 1)} + {\bf\epsilon}_{(n \times 1)}\\ \left [ \begin{array}{r} Y_1 \\ \vdots \\ Y_n \end{array} \right ] &= \left [ \begin{array}{rr} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \end{array}\right ] \times \left [ \begin{array}{r} \beta_0 \\ \beta_1 \end{array} \right ] + \left [ \begin{array}{r} \epsilon_1 \\ \vdots \\ \epsilon_n \end{array} \right ] \end{array}\] Key question: How does this differ from correlation?

Correlation vs regression

Connection between between correlation and linear regression analysis

  • In SLR, \(X\) is considered fixed (not random)

  • In correlation analysis both \(X\) and \(Y\) are random variables

  • In regression we distinguish a target (output) \(Y\) and input \(X\), it is not a symmetric relationship between the variables, whereas \(\text{Corr}(X,Y) = \text{Corr}(Y,X)\)

  • In SLR, this holds: \(\text{Corr}(X,Y) = \text{Corr}(Y,\widehat{Y})\)

PISA scores example

Let’s look at R examples

Based on the output on the R example, fill in values for the following

\(\widehat{\beta}_0\) =
\(\widehat{\beta}_1\) =

Number of observations used in the analysis =
Number of observations in the data set =

Median of the residuals =
Mean of the residuals =

Estimate of the variance of \(\widehat{\beta}_1\) =
The \(p\)-value for the hypothesis that \(\beta_1 = 0\):

Estimate of \(\sigma\), the standard deviation of \(\epsilon\) =

The proportion of variability in Y explained by the regression on X =

If logGDPp increases by 2 units, the MathMean changes by =

Geometry of least square

\({\bf Y}\) ``lives’’ in an \(n\)-dimensional space (it is a vector of length \(n\))

\({\bf X}\) can generate a space of dimension \(\text{rank}({\bf X})\)

Any vector \({\bf X}{\bf \beta}\) also lives in the space generated by \({\bf X}\)

If we have to choose from all possible \({\bf \beta}\)s then we might as well choose the \({\bf \beta}\) that makes \({\bf X}{\bf \beta}\) the closest to \({\bf Y}\) (Pythagore theorem)

That is the ordinary least square (OLS) estimator \(\widehat{{\bf \beta}}\)

\(\widehat{Y}\) is the orthogonal projection of \(Y\) onto the space generated by \({\bf X}\)

Figure 4.5. from Schabenberger & Pierce

Figure 4.5. from Schabenberger & Pierce

Note: two random vectors that are orthogonal are uncorrelated!

If OLS is an orthogonal projection of \({\bf Y}\) onto \({\bf X}\), then \(\widehat{{\bf Y}}\) and \({\bf Y} - {\bf X}\widehat{{\bf \beta}}\) should be orthogonal.

This implies:

  • Residuals ⟂ fitted values
  • Residuals are uncorrelated with predictors

Let’s look at this in an R example

R² (friend or enemy)

\[R^2 = \frac{\text{SSModel}}{\text{SSTotal}} = 1 - \frac{\text{SSError}}{\text{SSTotal}}\]

Proportion of variance explained by the model over total variance

  • \(0 \leq R^2 \leq 1\)

  • \(R^2 = 0\): no linear relationship between Y and X

  • \(R^2 = 1\): perfect linear relationship between Y and X (SSError = 0)

  • In SLR & MLR: \(R^2\) is the square of the correlation between \(Y\) and \(\widehat{Y}\)

  • In SLR: \(R^2 = \text{Corr}(X,Y)^2\)

  • In MLR: correlation with \(X\) no longer applies

Why R² is dangerous

\(R^2\) is infamous: \(R^2\) is a useful statistic but should be handled with care

  • \(R^2\) always increases when you add predictors, see MLR slides
  • Can be large for a bad model
  • “Chasing \(R^2\)” invariably leads to overfitting
  • Says nothing about model assumptions
  • It is OK to leave variability unexplained in the training data

High R² ≠ good model.

Plots still matter.

Example: Bad model, Great \(R^2\)

The following code simulates data where the variance of the errors depends on \(X\) and the mean function is not a straight line.

A simple linear regression model is not appropriate!

x <- seq(0.15, 1, l = 100)
set.seed(123456)
eps <- rnorm(n = 100, sd = 0.25 * x^2)
y <- 1 - 2 * x * (1 + 0.25 * sin(4 * pi * x)) + eps
plot(x,y,pch=20,col=2)  

reg <- lm(y ~ x)
summary(reg)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.53525 -0.18020  0.02811  0.16882  0.46896 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.87190    0.05860   14.88   <2e-16 ***
## x           -1.69268    0.09359  -18.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.232 on 98 degrees of freedom
## Multiple R-squared:  0.7695, Adjusted R-squared:  0.7671 
## F-statistic: 327.1 on 1 and 98 DF,  p-value: < 2.2e-16

Why is the R^2 so great?

Transforming predictors

  • Linear model ≠ straight line
  • Some nonlinear effects can be handled via transformations
  • Some heteroscedasticties can be handled via transformations

Common choices:

  • log(X), X², \(\sqrt{X}\)

Good transformations:

  • Improve fit
  • Stabilize variance
  • Simplify interpretation

The true model is \(Y = \beta_0 + \beta_1 \log(x) + \epsilon\)

A simple linear regression of \(Y\) on \(X\) does not fit the data (left panel)

A simple linear regression of \(Y\) on \(X^* = \log(X)\) fits well

Prediction vs Confidence Intervals

In Statistical Learning, we want to predict the target for new values of the input and we want to know the uncertainty associated with those predictions.

Suppose we want to predict at value \(x_0\) of the input. We have two choices

  • Predict the value of a new observation: \(Y_0 = f(x_0) + \epsilon\)
    This results in a random variable
    Prediction interval: uncertainty about a new observation

  • Predict the expected value (the mean) of the target: \(E[Y_0] = f(x_0)\)
    This results in a deterministic quantity
    Confidence interval: uncertainty about the mean response

  • YES, both lead to the same estimate but different respective uncertainties:
    Prediction intervals are always wider than confidence ones.

Predicting \(Y_0\) in the simple linear regression model

  • \(Y_0\) is a random variable: \(Y_0 = \beta_0 + \beta_1 x_0 + \epsilon\)

  • \(\epsilon\) is a random variable with mean 0 and variance \(\sigma^2\)

  • An obvious predictor would be \(\widehat{Y}_0 = \widehat{\beta}_0 + \widehat{\beta}_1 x_0 + \widehat{\epsilon}\)

  • But that is just \(\widehat{Y}_0 = \widehat{\beta}_0 + \widehat{\beta}_1 x_0\)

Predicting \(E[Y_0]\)

  • \(E[Y_0]\) is a constant (not a random variable): \(\beta_0 + \beta_1 x_0\)

  • The estimator is \(\widehat{E}[Y_0]= \widehat{\beta}_0 + \widehat{\beta}_1 x_0\)

Confidence Interval for Mean Response

Goal: uncertainty about the conditional mean \(\mu(x_0) = \mathbb{E}[Y \mid X = x_0]\)

Linear regression model: \(\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon},\quad \boldsymbol{\varepsilon} \sim (0,\sigma^2\mathbf{I})\)

Estimated mean at \(x_0\): \(\widehat{\mu}(x_0) = \mathbf{x}_0^\top \widehat{\boldsymbol{\beta}}\)

Confidence interval (level \(1-\alpha\)): \(\quad \mathbf{x}_0^\top \widehat{\boldsymbol{\beta}} \;\pm\; t_{n-p,\,1-\alpha/2}\; \widehat{\sigma} \sqrt{\mathbf{x}_0^\top(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{x}_0}\)

Interpretation

  • Uncertainty about the regression function
  • Depends only on estimation uncertainty
  • Narrower where data are dense

Prediction Interval for New Obs.

Goal: uncertainty about a future response \(Y_0 = \mu(x_0) + \varepsilon_0,\quad \varepsilon_0 \sim (0,\sigma^2)\)

Point prediction: \(\widehat{Y}_0 = \mathbf{x}_0^\top \widehat{\boldsymbol{\beta}}\)

Prediction interval (level \(1-\alpha\)): \(\quad \mathbf{x}_0^\top \widehat{\boldsymbol{\beta}} \;\pm\; t_{n-p,\,1-\alpha/2}\; \widehat{\sigma}\sqrt{1 + \mathbf{x}_0^\top(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{x}_0}\)

Interpretation

  • Uncertainty about an individual outcome
  • Always wider than confidence intervals
  • Includes irreducible noise (\(+1\) term)

Practical takeaway

\[\text{CI: } \widehat{\mu}(x_0) \pm \text{estimation uncertainty}\\ \text{PI: } \widehat{\mu}(x_0) \pm (\text{estimation} + \text{error})\]

If you are interested in the precision of predicting new unobserved values/individual predictions:
→ use prediction intervals.

If you care about average trends:
→ use confidence intervals.

Prediction examples in R

Let’s work on some examples

Multiple Linear Regression (MLR)

From SLR to MLR

Same model, bigger X: \(\mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}\)

More precisely

\[\begin{array}{ll} \mathbf{Y}_{(n\times1)} &= \mathbf{X}_{(n \times p+1)}\boldsymbol{\beta}_{(p+1 \times 1)} + \boldsymbol{\epsilon}_{(n \times 1)}\\ \left [ \begin{array}{r} Y_1 \\ \vdots \\ Y_n \end{array} \right ] &= \left [ \begin{array}{rrcr} 1 & x_{11} & \cdots & x_{p1} \\ \vdots & \vdots & \vdots & \vdots \\ 1 & x_{1n} & \cdots & x_{pn} \end{array}\right ] \times \left [ \begin{array}{r} \beta_0 \\ \vdots \\ \beta_p \end{array} \right ] + \left [ \begin{array}{r} \epsilon_1 \\ \vdots \\ \epsilon_n \end{array} \right ] \end{array}\]

New challenges in MLR

  • Which variables to include? all of them, a subset, a superset (feature engineering)

  • Redundant predictors (collinearity)

  • Interpretation of partial effects

  • What if ( p ) is large? Or ( p n )?

  • We are less worried about the linearity of the output to any one input, what matters is the joint effect of the inputs on the target. There may be some nonlinearity between the response and one input, but once other inputs are in the model the assumptions might be met.

  • Some inputs might be qualitative (factors),
    e.g. interesting comparisons among groups to be made

  • How do we interpret and diagnose a model with multiple parameters

This is where statistical learning really begins!

OLR for MLR

The OLS estimator of the model parameter is as before \[\widehat{{\bf \beta}} = \left ( {\bf X}^\prime {\bf X} \right) ^{-1} {\bf X}^\prime{\bf X}\]

It is an unbiased estimator for \({\bf \beta}\) with variance \(\sigma^2 \left ( {\bf X}^\prime {\bf X} \right) ^{-1}\)

\(R^2\) for MLR

Do we interpret the \(R^2\) value any different than in the SLR model?

  • No, it still measures the proportion of variation in Y that is explained by the model terms

  • No, it still measures the squared correlation between \(y\) and \(\widehat{y}\)

  • Yes, in that it no longer describes the strength of the relationship between \(Y\) and any of the individual input variables

\(R^2\) versus \(R^2_{adj}\) for MLR

In summary(lm()), R provides the \(R^2_{adj}\), which is “independent” of \(p\) (number of predictors)

\[R^2 = \frac{\text{SSModel}}{\text{SSTotal}} = 1 - \frac{\text{SSError}}{\text{SSTotal}} = 1 - \frac{\widehat{\sigma}^2}{\text{SSTotal}}(n-p-1)\] Whereas the adjusted \(R^2\)

\[R^2_{adj} = \frac{\text{SSModel/(n-p-1)}}{\text{SSTotal/(n-1)}} = 1 - \frac{\widehat{\sigma}^2}{\text{SSTotal}}(n-1)\]

  • \(R^2_{adj}\) is only explicitly independent of \(p\), in practice the sums of squares still reflect the inclusion of more predictors.

  • Moreover, \(R^2_{adj}\) has a very high variance!

  • That means you should still be cautious when using \(R^2_{adj}\).

ANOVA: the variance decomposition

ANOVA compares explained variability to unexplained variability.

Two equivalent viewpoints:

  • Group comparison view (classical ANOVA)
    Are the mean responses the same across groups via their variance?
    Example: Do average test scores differ across teaching methods?

  • Model comparison view (regression perspective)
    Does adding predictors to a model significantly improve fit?
    Example: Does including predictors improve prediction compared to an intercept-only model?
    ANOVA is a tool for comparing nested models

ANOVA decomposes the total variability:   Total SS = Explained SS + Unexplained SS, - Tests whether the explained sum of squares is large relative to the unexplained (residual) sum of squares,after accounting for degrees of freedom,
- That ratio gives the F-statistic - we use this to test relevant sets of predictors.

One equation to rule them all

Under linear regression assumptions, the variance of \(Y\) is decomposed into two parts, each one corresponding to the regression and to the error, respectively.

This decomposition is called the ANalysis Of VAriance (ANOVA).

\[\begin{array}{ll} ||{\bf Y}||^2 &= ||{\bf X}\widehat{{\bf \beta}}||^2 + ||{\bf Y} - {\bf X}\widehat{{\bf \beta}}||^2\\ {\bf Y}^\prime {\bf Y} &= \widehat{{\bf \beta}}{\bf X}^\prime {\bf Y} + \left({\bf Y} - {\bf X}\widehat{{\bf \beta}}\right)^\prime \left({\bf Y} - \widehat{{\bf \beta}}{\bf X} \right)\\ {\bf Y}^\prime {\bf Y} &= \widehat{{\bf \beta}}{\bf X}^\prime {\bf Y} + \left({\bf Y}^\prime{\bf Y} - \widehat{{\bf \beta}}{\bf X}^\prime{\bf Y}\right) \\ \text{SSTotal} &= \text{SSModel} + \text{SSError} \end{array}\]

Interpretation:

  • Total variability = explained (model with inputs) + unexplained (error)

ANOVA table (decoded)

Corrected SSTotal: The total variability in the output, does not account for any inputs

Corrected SSModel: The variability in \({\bf Y}\) explained by the inputs

SSError (RSS): The unexplained residual variability

Degrees of freedom: Represent the number of pieces of information contributing to the sums-of-squares term

Mean Squares (MS): Divide a sum of squares by its degrees of freedom

ANOVA is just regression, summarized.

This table is valid for SLR and MLR

If the model contains an intercept, we adjust the sum of squares for the presence of the intercept

The total and model sums-of-squares are then said to be corrected for the mean

The standard ANOVA table looks as follows

Source Degrees of freedom (df) Sum of Squares (SS) Mean Squares (MS)
Model \(\mbox{rank}({\bf X})-1\) \(\widehat{{\bf \beta}}{\bf X}^\prime {\bf Y} - n\bar{{\bf Y}}^2\) SSModel/\((\mbox{rank}({\bf X})-1)\)
Error (RSS) \(n-\mbox{rank}({\bf X})\) \({\bf Y}^\prime{\bf Y} - \widehat{{\bf \beta}}{\bf X}^\prime{\bf Y}\) SSError/\((n-\mbox{rank}({\bf X}))\)
Total \(n-1\) \({\bf Y}^\prime{\bf Y} - n\bar{{\bf Y}}^2\)

ANOVA compares explained variability to unexplained variability.

MSE and degrees of freedom

  • Training MSE: average squared residual
  • MSError: adjusted for parameters already estimated

Why divide by \((n-\mbox{rank}({\bf X}))\)?

Because estimating parameters uses information.

Degrees of freedom are confusing

Example: Consider the estimation of the sample variance in a sample of size \(n\)

The usual estimate is \(s^2 = \frac{1}{n-1} \sum_{i=1}^n (y_i - \bar{y})^2\)

If we knew \(\mu\), the correct estimator would be \(s^2_{*} = \frac{1}{n} \sum_{i=1}^n (y_i - \mu)^2\)

Why are we dividing by \(n-1\) instead of \(n\)?

  • We estimated one parameter of the population: \(\bar{y}\) as the estimate of the population mean \(\mu\)

  • If \(\bar{y} = 4.5\), you can choose \(n-1\) arbitrary values. Given your \(n-1\) choices, the \(n\)th value is now fixed

Predictor Selection in Multiple Linear Regression

If we have \(k\) predictors, a naive procedure would be to check all the possible models that can be constructed with them and then select the best one in terms of a selected criterion (prediction, AIC/BIC, ..)  \(\rightarrow\) This is computationally infeasible in large data contexts.

  • Why it matters:
    Avoid overfitting / Balance bias and variance
    Improve model interpretability
    We want to avoid having more predictors than datapoints (\(p<<n\))
    \(R^2\) increases with the number of predictors (deceiving metric!)

  • Approach:
    Forward: starts from the simplest model, adds predictors sequentially
    Backward: starts from the full (all predictors) model, removes predictors sequentially
    A Mix of the above

  • Methods:
    ANOVA: sum of squares reduction testing
    Other criteria AIC/BIC (used R-built-in functions such as stepwise)
    We will talk about AIC/BIC in a future class

ANOVA for Model Comparison

  • Purpose: Compare nested models
    Perform a sequence of sum of squares reduction tests
  • R function: anova()
# Example: Fit two nested models
fit1 <- lm(medv ~ lstat, data = Boston)
fit2 <- lm(medv ~ lstat + nox + ptratio, data = Boston)

# Compare models using ANOVA
anova_result <- anova(fit1, fit2)
print(anova_result)
## Analysis of Variance Table
## 
## Model 1: medv ~ lstat
## Model 2: medv ~ lstat + nox + ptratio
##   Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
## 1    504 19472                                  
## 2    502 16802  2    2670.1 39.888 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANOVA: Sum of Squares Reduction Test

We can test whether a reduced model (i.e. fewer predictors) makes the output significantly worse (in the statistical sense).

Suppose you have two models: full model \(f({\bf X})\) and reduced model \(g({\bf X})\),
Here “full” means the model has more predictors than the “reduced” one.

  • The statistic \[F = \frac{(\text{SSError}_r - \text{SSError}_f)/k}{\text{MSError}_f}\] has a Snedecor \(F_{k,n-rk(X)-1}\) distribution.

  • The subscripts \(r\) and \(f\) refer to the reduced and the full model, respectively, and \(k\) is the number of excluded predictors between full and reduced models.

  • Note: the reduced model is nested within the full model

We are testing whether the added predictors from reduced to full models are significanly different from \(0\),

In the above example, we are testing:

\[H_0: \beta_{nox}=0 \quad \& \quad \beta_{ptratio}=0 \quad\text{VS}\quad H_1: \beta_{nox}\neq 0 \quad\text{or}\quad \beta_{ptratio}\neq0 \]

  • If \(H_0\) is true, then \(F\) is expected to be small since \((\text{SSError}_r - \text{SSError}_f)\) will be close to zero \(\rightarrow\) We will use p-values to decide about \(H_0\)

One more thing about anova()

The anova() function can also be used on a single model,
In that case, it performs the analysis sequentially on an increasing set of predictors.

One predictor at a time to already selected predictors (in order of appearance in anova()).

If you change the predictor order, the ANOVA result changes.

# Example: Fit two nested models
fit <- lm(medv ~ lstat + nox + ptratio, data = Boston)
# Compare models using ANOVA
anova(fit)
## Analysis of Variance Table
## 
## Response: medv
##            Df  Sum Sq Mean Sq  F value Pr(>F)    
## lstat       1 23243.9 23243.9 694.4570 <2e-16 ***
## nox         1     4.8     4.8   0.1433 0.7052    
## ptratio     1  2665.3  2665.3  79.6318 <2e-16 ***
## Residuals 502 16802.3    33.5                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Hypothesis Testing: lm VS anova

Be careful

summary(lm()) → t-tests for individual coefficients

anova(lm()) → F-tests for sets of coefficients

For a single predictor, the t-test and F-test are equivalent: \(F=t^2\),
this is not true in the multiple regression context.

Boston Housing example in R

  • 506 neighborhoods
  • Goal: predict median house value (medv)
  • Rich set of predictors

Exploration tools:

  • Scatterplot matrices
  • Correlation plots
  • Full MLR fit in lm()

Key lesson:

Interpretation becomes harder as models become more powerful.

Big-picture takeaway

Regression is:

  • Geometry (projections)
  • Variance decomposition (ANOVA)
  • Prediction with uncertainty

And most importantly:

A modeling conversation between data, assumptions, and purpose.