Learning Goals

This module has the following learning goals:

Introduction

Linear Regression Model

TL;DR: We love linear models, but ...

The linear regression model is a workhorse in Data Analytics.

  • it is simple, easy to understand and to explain

  • it is computationally and mathematically straightforward, e.g., closed-form expressions for leave-one-out CV

  • it is surprisingly competitive against more complex alternatives

  • it is based on least squares (LS), which is unbiased if the model is correct.

  • it works well if \(n \gg p\)

Shortcomings of Linear Regression Model

TL;DR: We might have too many inputs \(p\) relative to \(n\)

The linear regression model then breaks down or can be improved.

  • It has low bias, but not necessarily lowest MSE

  • As \(p\) grows, for fixed \(n\), the variance of the LS estimates increases, the LS estimates become unstable

  • This is exacerbated by multi-collinearity, which also increases in \(p\)

  • When \(p > n\), a unique LS estimator does not exist

How can these shortcomings be addressed?

  • Variable selection (Feature selection):

    • eliminate unimportant inputs from the model

    • use automated algorithms for selecting good inputs

  • Regularization: Also known as shrinkage estimation, it limits the variability of the estimates by forcing (shrinking) them towards zero (in absolute value)

  • Dimension reduction: Instead of using all \(p\) inputs, use statistical techniques to reduce the \(p\)-dimensional problem to an \(m\)-dimensional problem (\(m < p\)). Use \(m\) linear combinations of the \(p\) variables as inputs to standard least-squares regression.

Feature Selection

Why Feature Selection Matters

  • Overfitting vs. Underfitting: Explain how too many features can lead to overfitting, while too few can underfit.

  • Interpretability: Emphasize that simpler models with fewer features are often easier to interpret and communicate.

Core Methods for Feature Selection

  • Filter Methods: Correlation Analysis: Use correlation coefficients (e.g., Pearson’s) to identify features strongly related to the target; Variance Inflated Factor (See inclass activity about regression diagnostics) ; ANOVA

  • Wrapper Methods: Forward/Backward/Stepwise Selection: addition/removal of features based on model performance.

  • Embedded Methods: Regularization (LASSO, Ridge, Elastic Net): Use L1/L2 penalties to shrink coefficients and perform automatic feature selection ; Tree-Based Feature Importance: Use random forests or gradient boosting to rank features by importance.

Feature Selection

Filter method

From the exploratory data analysis, you could have spotted potential collinearity.


In that case, you should calculate some VIF and ANOVA and test for these potential redundancies and signifance of predictors.


If that is not feasible or the EDA is inconclusive, try the next methods (feature selection by wrapper method or regularization).

Wrapper method

  • Several methods for selecting a subset from all possible inputs

    • Best subset selection

    • Forward selection

    • Backward selection

    • Stepwise selection

Feature Selection (wrapper method)

  1. For \(k=1, 2, \cdots, p\):
    A. Let \(\{M_k\}\) denote the set of candidate models that have exactly \(k\) inputs
    B. Choose the best of those candidates and call it \(M_k\)

  2. Choose the single best model among the \(M_0, \cdots, M_p\)


  • Methods differ in the construction of the candidate set in step 1

  • In step 1.B models are chosen based on \(R^2\) or \(SS(\text{Error})\) or \(p\)-values

  • In step 2 models are chosen using cross-validation

Selection metrics: Mallow’s \(C_p\)

Two definitions: \[\begin{aligned} C_p &= \frac{1}{n} \left(\text{SSError} + 2d\widehat{\sigma}^2 \right) \\ C^\prime_p &= \frac{\text{SSError}}{\widehat{\sigma}^2} + 2d - n \end{aligned}\]

  • \(d\) is the number of regressors in the model

  • \(p\) is the number of regressors in the full model

  • \(\widehat{\sigma}^2\) is based on estimating \(\sigma^2\) in the full model (Why?)

Usage: Prefer the model with the smallest \(C_p\) or \(C^\prime_p\).
Among models with the same # of regressors \(d\), this selects the model with smallest SSError

Selection metrics: AIC

AIC (Akaike’s information criterion) is based on maximum likelihood and information theory \[\mbox{AIC} \,=\,2k-2\ln({\widehat{L}})\]

With \(k\) the number of estimated parameters and \(\widehat{L}\) the likelihood at the estimated maximum.

Applied to linear regression with gaussian errors: \[AIC = \frac{1}{n}(\text{SSError} + 2k\widehat{\sigma}^2)\]

  • Usage: Prefer the model with the smaller AIC.

  • For other model families there is no \(C_p\), but there is always an AIC.

Selection metrics: BIC

BIC (Bayesian information criterion) is similar to AIC: \[\mbox{BIC} =k\ln(n)-2\ln({\widehat{L}})\]

Applied to linear regression with gaussian errors:\[BIC = \frac{1}{n}(\text{SSError} + \text{ln}(n)k\widehat{\sigma}^2)\]

  • BIC applies a larger penalty than \(C_p\) or \(AIC\) if \(\text{ln}(n) > 2\).

  • BIC selects smaller models than \(C_p\)

  • BIC is a consistent estimator (it selects the data-generating model with probability 1 if enough data are provided)

  • Usage: Prefer the model with smaller BIC

Selection metrics: Adjusted \(R^2\)

\(R_{adj}^2\) adds a correction to \(R^2\) to penalize larger models.

\[R^2 = 1 - \frac{\text{SSError}}{\text{SSTotal}}\]

\[\text{Adjusted} \, R^2 = 1 - \frac{\text{SSError}}{\text{SSTotal}}\left ( \frac{n-1}{n-k-1} \right )\]

  • SSError always decreases when regressors are added \(\Rightarrow R^2\) increases

  • Adding “junk” variables will not decrease SSError much but incur penalty in denominator

  • Usage Choose model with large Adjusted \(R^2\)

All Subset Regression

All Subsets algorithm

  1. For \(k=1, 2, \cdots, p\) do the following:
    • fit all \({p \choose k} = \frac{p!}{k! (p-k)!}\) models with \(k\) predictors
    • Use small SSError, large \(R^2\), small \(p\)-value, etc., to choose the best of these models, call it \(M_k\).
      (\(M_1\) is the best model with 1 input, \(M_2\) is the best model with 2 inputs, ...)
  2. Using CV, \(C_p\), BIC, or Adjusted \(R^2\) choose the overall best model across the \(M_1, \cdots, M_k\).

Note: some references include \(M_0\) the intercept-only model. It is not likely to be the winning model and we omit it here.

Properties

  • \(\oplus\) Evaluates all possible models (searches entire model space)

  • \(\oplus\) Simple and conceptually appealing

  • \(\ominus\) Total number of models is \(2^p\)

  • \(\ominus\) Computationally quickly impossible.

  • \(\ominus\) With \(p=40\)

    • 40 models with 1 input

    • 780 models with 2 inputs

    • 9880 models with 3 inputs

    • 1.099512e+12 total models

All Subsets in R

We use the regsubsets() function in the leaps package or the train() function in the caret package. method="exhaustive" is the default in regsubsets().

library(leaps)
regfit <- regsubsets(Balance ~ ., data=Credit, nvmax=NULL)
summary(regfit)

1 subsets of each size up to 11
Selection Algorithm: exhaustive
          Income Limit Rating Cards Age Education OwnYes StudentYes MarriedYes RegionSouth RegionWest
1  ( 1 )  " "    " "   "*"    " "   " " " "       " "    " "        " "        " "         " "       
2  ( 1 )  "*"    " "   "*"    " "   " " " "       " "    " "        " "        " "         " "       
3  ( 1 )  "*"    " "   "*"    " "   " " " "       " "    "*"        " "        " "         " "       
4  ( 1 )  "*"    "*"   " "    "*"   " " " "       " "    "*"        " "        " "         " "       
5  ( 1 )  "*"    "*"   "*"    "*"   " " " "       " "    "*"        " "        " "         " "       
6  ( 1 )  "*"    "*"   "*"    "*"   "*" " "       " "    "*"        " "        " "         " "       
7  ( 1 )  "*"    "*"   "*"    "*"   "*" " "       "*"    "*"        " "        " "         " "       
8  ( 1 )  "*"    "*"   "*"    "*"   "*" " "       "*"    "*"        " "        " "         "*"       
9  ( 1 )  "*"    "*"   "*"    "*"   "*" " "       "*"    "*"        "*"        " "         "*"       
10  ( 1 ) "*"    "*"   "*"    "*"   "*" " "       "*"    "*"        "*"        "*"         "*"       
11  ( 1 ) "*"    "*"   "*"    "*"   "*" "*"       "*"    "*"        "*"        "*"         "*"          

Mallow’s \(C_p\), BIC, and Adjusted \(R^2\) for all subset regression

s_all <- summary(regfit)
s_all$cp
s_all$bic
s_all$adjr2

 s_all$cp
 [1] 1800.308406  685.196514   41.133867   11.148910    8.131573    5.574883
        6.462042    7.845931    9.192355   10.472883   12.000000

> s_all$bic
 [1]  -535.9468  -814.1798 -1173.3585 -1198.0527 -1197.0957 -1195.7321
     -1190.8790 -1185.5192 -1180.1989 -1174.9476 -1169.4433

> s_all$adjr2
 [1] 0.7452098 0.8744888 0.9494991 0.9531099 0.9535789 0.9539961
     0.9540098 0.9539649 0.9539243 0.9538912 0.9538287

Forward Selection

Start from the simplest model, add predictors sequentially.

Forward algorithm

  1. For \(k=0, 2, \cdots, p\) do the following:

    • The candidate set at step \(k\) are the models with one additional input that is currently not in the model.

    • Using SSError, large \(R^2\), small \(p\)-value, etc., choose the best model at stage \(k\) as the one that improves most by adding one of the remaining \(p-k\) inputs. Call this model \(M_{k+1}\)

  2. Using CV, \(C_p\), BIC, or Adjusted \(R^2\) choose the overall best model across the \(M_1, \cdots, M_k\).

Forward selection with four predictors.

Stage (\(k\)) Current Model Candidate Set Most Improved
0 None \({\bf X}_1\), \({\bf X}_2\), \({\bf X}_3\), \({\bf X}_4\) \({\bf X}_2\)
1 \({\bf X}_2\) \({\bf X}_1\), \({\bf X}_3\), \({\bf X}_4\) \({\bf X}_4\)
2 \({\bf X}_2\), \({\bf X}_4\) \({\bf X}_1\), \({\bf X}_3\) \({\bf X}_1\)
3 \({\bf X}_2\), \({\bf X}_4\), \({\bf X}_1\) \({\bf X}_3\) \({\bf X}_3\)
4 \({\bf X}_2\), \({\bf X}_4\), \({\bf X}_1\), \({\bf X}_3\) None

Pros

  • Computationally efficient, easy to understand

  • “Conditional” interpretation: given the variables in the model…

  • Can be applied if \(p > n\)!

Cons

  • A predictor that was added cannot be removed

  • Not guaranteed to find the best model at stage \(k \ge 1\) or overall

Forward selection in R

Forward selection results with regsubsets().

library(leaps)
regfit <-  regsubsets(Balance ~ ., data=Credit, method="forward", nvmax=NULL)
summary(regfit)

Selection Algorithm: forward
          Income Limit Rating Cards Age Education OwnYes StudentYes MarriedYes RegionSouth RegionWest
1  ( 1 )  " "    " "   "*"    " "   " " " "       " "    " "        " "        " "         " "       
2  ( 1 )  "*"    " "   "*"    " "   " " " "       " "    " "        " "        " "         " "       
3  ( 1 )  "*"    " "   "*"    " "   " " " "       " "    "*"        " "        " "         " "       
4  ( 1 )  "*"    "*"   "*"    " "   " " " "       " "    "*"        " "        " "         " "       
5  ( 1 )  "*"    "*"   "*"    "*"   " " " "       " "    "*"        " "        " "         " "       
6  ( 1 )  "*"    "*"   "*"    "*"   "*" " "       " "    "*"        " "        " "         " "       
7  ( 1 )  "*"    "*"   "*"    "*"   "*" " "       "*"    "*"        " "        " "         " "       
8  ( 1 )  "*"    "*"   "*"    "*"   "*" " "       "*"    "*"        " "        " "         "*"       
9  ( 1 )  "*"    "*"   "*"    "*"   "*" " "       "*"    "*"        "*"        " "         "*"       
10  ( 1 ) "*"    "*"   "*"    "*"   "*" " "       "*"    "*"        "*"        "*"         "*"       
11  ( 1 ) "*"    "*"   "*"    "*"   "*" "*"       "*"    "*"        "*"        "*"         "*" 

Subset VS Forward Selection

All Subset and Forward selection agree with respect to all \(M_k\), except for \(k=4\).

  • All Subset: Income, Limit, Cards, StudentYes

  • Forward: Income, Limit, Rating, StudentYes

If selection methods land on the same \(k\), the selected inputs might be different.

Backward Elimination

Starts from the full model, removes predictors sequentially

Backward algorithm

  1. Start from the full model with all \(p\) predictors, \(M_p\).

  2. For \(k=p, p-1, \cdots, 1\) do the following:

    • The candidate set at step \(k\) are the models that remove one input from the current model.

    • Using SSError, small \(R^2\), large \(p\)-value, etc., choose the best model at stage \(k\) as the one that deteriorates least by removing one of the remaining \(k\) inputs. Call this model \(M_{k-1}\)

  3. Using CV, \(C_p\), BIC, or Adjusted \(R^2\) choose the overall best model across the \(M_1, \cdots, M_k\).

Backward elimination with four predictors.

Stage (\(k\)) Current Model Candidate Set Least Important
4 \({\bf X}_1, {\bf X}_2, {\bf X}_3, {\bf X}_4\) \({\bf X}_1, {\bf X}_2, {\bf X}_3, {\bf X}_4\) \({\bf X}_2\)
3 \({\bf X}_1, {\bf X}_3, {\bf X}_4\) \({\bf X}_1, {\bf X}_3, {\bf X}_4\) \({\bf X}_4\)
2 \({\bf X}_1, {\bf X}_3\) \({\bf X}_1, {\bf X}_3\) \({\bf X}_1\)
1 \({\bf X}_3\) \({\bf X}_3\) \({\bf X}_3\)
0 None None

Pros

  • Computationally efficient, easy to understand

  • “Conditional” interpretation: given the variables in the model, ...

  • Sees the full model, most likely to provide an unbiased estimate of \(\sigma^2\)

Cons

  • A predictor that was removed cannot be added back at a later step

  • Cannot be applied if \(p > n\) because the full model cannot be fit!

  • Not guaranteed to find the best model at stage \(k\) or overall

Combines the pros of forward selection and backward elimination.

  • Start with an empty model and add the best variable (forward step)

  • If a variable was added, try and remove any of the variables in the model (backward step)

Pros

  • Searches larger model space than forward or backward

  • Possibly evaluates models of size \(k\) several times

  • Variables added earlier can become unimportant at later steps and can be removed

Selection with Cross-validation

We can combine the selection algorithms with a direct estimate of the test error by using cross-validation.

Use the train() function in the caret package.

It uses leaps underneath.

Set the method= parameter of train() to leapForward, leapBackward, or leapSeq for forward, backward, and stepwise, respectively.

Backward elimination with 10-fold cross-validation.

set.seed(123)
train.control <- trainControl(method="cv", number=10)

bkwd.model <- train(Balance ~ . , data =Credit,
                    method = "leapBackward", 
                    tuneGrid = data.frame(nvmax = 1:11),
                    trControl = train.control)

bkwd.model$results

# the index of the selected ("best") model
bkwd.model$bestTune$nvmax
[1] 6

summary(step.model$finalModel)

coef(step.model$finalModel,step.model$bestTune$nvmax)

Selection Algorithm: backward
         Income Limit Rating Cards Age Education OwnYes StudentYes MarriedYes RegionSouth RegionWest
1  ( 1 ) " "    "*"   " "    " "   " " " "       " "    " "        " "        " "         " "       
2  ( 1 ) "*"    "*"   " "    " "   " " " "       " "    " "        " "        " "         " "       
3  ( 1 ) "*"    "*"   " "    " "   " " " "       " "    "*"        " "        " "         " "       
4  ( 1 ) "*"    "*"   " "    "*"   " " " "       " "    "*"        " "        " "         " "       
5  ( 1 ) "*"    "*"   "*"    "*"   " " " "       " "    "*"        " "        " "         " "       
6  ( 1 ) "*"    "*"   "*"    "*"   "*" " "       " "    "*"        " "        " "         " "       

 (Intercept)       Income        Limit       Rating        Cards          Age   StudentYes 
-493.7341870   -7.7950824    0.1936914    1.0911874   18.2118976   -0.6240560  425.6099369 

Summary: Feature Selection in R

leaps caret::train
Preferred method= method=
All Subset (regression) exhaustive
Forward (selection) forward leapForward
Backward (elimination) backward leapBackward
Stepwise (selection) seqrep leapSeq

Feature Selection in Generalized Linear Models

You can use the bestglm() function in the bestglm package.

It implements the exhaustive search method in the leaps package to perform an all subsets approach.

The model is passed as a single data frame with the target variable as the last column.

In the binomial GLM, nonlogistic, case the last two columns of Xy are the counts of ‘success’ and ‘failures’.

The function will work for up to 15 inputs.

See examples in this week’s code.

Regularization

Shrinkage Estimation

  • Shrinkage estimation is a statistical estimation technique that imposes constraints on the parameter estimates. The resulting estimates are closer to zero in absolute value than the unconstrained estimates. We say the estimates were shrunk toward zero.
  • Shrinking coefficients can significantly reduce the variability of the coefficients. Intuitive?

  • Feature selection can be seen as an extreme form of shrinkage where the coefficients of predictors not in the model are set to zero.

  • The most important shrinkage methods in regression are Ridge regression and the Lasso.

Ridge Regression

Least Squares minimizes \[\text{SSError} = \sum_{i=1}^n \left ( y_i - {\bf x}^\prime {\bf \beta} \right )^2\]

Ridge Regression minimizes \[\text{SSError} + \lambda \sum_{j=1}^p \beta_j^2\]

  • The term \(\lambda \sum_{j=1}^p \beta_j^2\) is called the ridge penalty or shrinkage penalty

  • \(\lambda \geq 0\) is a hyper-parameter; we need to find a value for it

Ridge Regression

Effects of the shrinkage penalty \[\lambda \sum_{j=1}^p \beta_j^2\]

  • For given \(\lambda\), the penalty is small if the \(\beta_j \rightarrow 0\)

  • As \(\lambda\) increases the penalty becomes more severe

  • As \(\lambda\) increases the \(\beta_j \rightarrow 0\)

  • If \(\lambda\) is large, you essentially fit an intercept-only model

  • For \(\lambda=0\) the least squares estimates result

  • For every value of \(\lambda\) there is a different vector of \({\bf \beta}\) estimates

  • \(\lambda\) is a hyper-parameter for which we have to find a value \(\Rightarrow\) cross-validation

Computational considerations

  • The intercept \(\beta_0\) is not shrunk, it represents the mean value of the \(Y\)

  • There is one shrinkage parameter \(\lambda\), it applies to all \(\beta_j\)

  • The \({\bf X}\) matrix is centered and scaled so that all columns have mean 0 and standard deviation 1. Otherwise a common \(\lambda\) parameter would not make sense.

Ridge Regression vs Variable Selection

Variable Selection

  • binary decision: predictor \(j\) is in (\(|\beta_j| > 0\)) or out (\(\beta_j = 0\))

  • requires \(n > p\) for the models considered

  • still can produce highly variable estimates if \({\bf X}\) matrix is poorly conditioned (dependencies among the \({\bf X}\)s)

Ridge Regression

  • can be used even if \(p > n\)

  • allows all predictors to be in the model, even if they make only tiny contributions

  • trades some bias for lower variance

Ridge Regression in R

Ridge regression (and Lasso) can be performed in R with the glmnet() function in the glmnet package.

In contrast to other modeling procedures, glmnet() does not accept the usual model-building formula (\(y \sim {\bf X}_1 + {\bf X}_2 + ...\)). Instead, you have to supply the \({\bf X}\) matrix and the \({\bf y}\) vector to the function.

glmnet() will standardize the columns of \({\bf X}\) by default and report the regression coefficients on the original scale.

The Hitters data set in ISLR contains salaries and records for baseball players. The following statements build the \({\bf X}\) matrix w/o intercept and \({\bf Y}\) vector and obtain Ridge regression estimates for \(\lambda = 100, 10, 0.1, 0\).
(alpha=0 is for Ridge regression, alpha=1 is for Lasso.)

Hit <- na.omit(Hitters)
X <- model.matrix(Salary ~ ., data=Hit)[,-1]
y <- Hit$Salary

grid <- c(100, 10, 0.1, 0)
ridge_reg <- glmnet(X, y, alpha=0, lambda=grid)

Choosing \(\lambda\)

You can choose the hyper-parameter \(\lambda\) by cross-validation with the cv.glmnet() function, very similar to using cv.glm() in the previous module.

set.seed(6543)
cv.out <- cv.glmnet(X, y, alpha=0, nfolds=10, type.measure="mse")
plot(cv.out)

bestlam <- cv.out$lambda.min
bestlam
[1] 25.52821

log(bestlam)
[1] 3.239784

Ridge Regression when \(p > n\)

When \(p > n\), least squares estimation breaks down.

Let’s simulate a data set with \(n=5\) and \(p=20\) predictors.

set.seed(1234)
vec <- runif(100)
X <- matrix(vec, nrow=5, ncol=20)
y <- rnorm(dim(X)[1]) + rowSums(X)  

linreg <- lm(y ~ X)

Least squares “runs out” of degrees of freedom after four predictors.

The fitted model is saturated, that is, a perfect fit that interpolates the y data.

> coef(linreg)
(Intercept)          X1          X2          X3          X4          X5          X6          X7          X8
  -7.638833   12.242983   -3.934432   10.842579   15.268963          NA          NA          NA          NA 
         X9          X10         X11         X12         X13         X14         X15         X16         X17
         NA          NA          NA          NA          NA          NA          NA          NA          NA 
         X18         X19         X20 
         NA          NA          NA

> round(y - predict(linreg),6)
1 2 3 4 5 
0 0 0 0 0 

Ridge regression on the other hand allows all predictors to contribute

ridge_reg <- glmnet(X,y,alpha=0,lambda=c(100,10,0.1,0.01))
coef(ridge_reg)

21 x 4 sparse Matrix of class "dgCMatrix"
                  s0       s1       s2       s3
(Intercept)  7.68687  6.99069  5.73141  5.67905
V1          -0.01715 -0.09326 -0.14427 -0.15359
V2           0.02197  0.13473  0.31032  0.33219
V3           0.01824  0.13018  0.38462  0.40144
V4           0.02882  0.16602  0.31569  0.31926
V5           0.06578  0.40821  1.01352  1.09996
V6          -0.01910 -0.13293 -0.42493 -0.46014
V7           0.00925  0.05357  0.03708  0.07006
V8           0.02107  0.13840  0.36602  0.37237
V9           0.03712  0.27334  0.88313  0.92015
V10          0.02202  0.14437  0.41119  0.42711
V11         -0.05368 -0.34858 -0.91773 -0.91176
V12         -0.00808 -0.06854 -0.27528 -0.28008
V13          0.01568  0.07942  0.07134  0.05679
V14          0.01493  0.06842  0.00454  0.00741
V15          0.00370  0.04326  0.22480  0.22586
V16          0.05926  0.39137  1.06970  1.08200
V17          0.03020  0.19274  0.48713  0.48341
V18          0.02981  0.18198  0.42613  0.42566
V19          0.00409  0.04189  0.19267  0.19308
V20          0.04678  0.30709  0.85295  0.85442 

And the predictions are not perfectly interpolating the y-data.

predict(ridge_reg, newx=X)

           s0       s1       s2       s3
[1,] 7.919595 8.441812 9.155059 9.168386
[2,] 7.804364 7.785944 7.982064 7.990984
[3,] 7.705466 7.081441 5.750685 5.713745
[4,] 7.784464 7.659786 7.612904 7.614922
[5,] 7.861569 8.106474 8.574746 8.587421

y
[1] 9.169844 7.991979 5.709654 7.615122 8.588858    

Lasso Regression

An arguable disadvantage of Ridge regression is

  • to include all predictors and thus to

  • give non-zero estimates \(\beta_1, \cdots, \beta_j\)

  • So it cannot be used for feature selection

  • and can be difficult to interpret.

Solution: LASSO (Least Absolute Shrinkage and Selection Operator) combines regularization (shrinkage) with variable selection.

The difference between Ridge and Lasso is the penalty term.

Least Squares minimizes \[\text{SSError} = \sum_{i=1}^n \left ( y_i - {\bf x}^\prime {\bf \beta} \right )^2\]

Ridge Regression minimizes \[\text{SSError} + \lambda \sum_{j=1}^p \beta_j^2\] Lasso Regression minimizes \[\text{SSError} + \lambda \sum_{j=1}^p |\beta_j|\]

The Lasso effect

  • As \(\lambda\) grows some \(\beta_j\) will be exactly zero

  • Interpret this as the predictor is no longer important

  • Depending on \(\lambda\), different predictors will be in the model

In R use glmnet() with the alpha=1 parameter to get Lasso parameter estimates.

glmnet() fits a class of models called elastic nets. The elastic net penalty is \[\lambda \left \{ \frac{1-\alpha}{2} \sum_{j=1}^p \beta_j^2 + \alpha \sum_{j=1}^p |\beta_j| \right \}\]

Lasso Regression in R

Lasso regression for the Hitters data set with \(\lambda = 100, 10, 1, 0.1, 0\).

grid <- c(100, 10, 1, 0.1 , 0)

lasso_reg <- glmnet(X, y, alpha=1, lambda=grid)

coef(lasso_reg)

Lasso regression estimates for \(\lambda = 100, 10, 1, 0.1, 0\).

                   s0         s1         s2         s3         s4
(Intercept) 220.10409   -1.49700  150.27526  163.35927  164.53816
AtBat         .          .         -1.90102   -1.98737   -1.98522
Hits          1.13626    2.01155    6.72889    7.40745    7.46595
HmRun         .          .          1.33127    3.71491    4.17264
Runs          .          .         -1.03154   -2.14140   -2.28368
RBI           .          .          .         -0.84049   -1.00296
Walks         1.18265    2.24853    5.56223    6.16142    6.22837
Years         .          .         -6.90650   -4.30318   -3.86027
CAtBat        .          .         -0.07605   -0.15629   -0.16934
CHits         .          .          .          0.14104    0.15983
CHmRun        .          0.04705    0.13180   -0.00474   -0.08136
CRuns         0.11012    0.21931    1.15880    1.37567    1.40784
CRBI          0.31456    0.40399    0.61220    0.73167    0.76981
CWalks        .          .         -0.72112   -0.79961   -0.80606
LeagueN       .         18.93190   46.61968   61.12354   62.74030
DivisionW     .       -115.19852 -116.21235 -116.95948 -117.12701
PutOuts       0.00330    0.23596    0.28144    0.28230    0.28189
Assists       .          .          0.29052    0.36238    0.37108
Errors        .         -0.78124   -2.85412   -3.32928   -3.38591
NewLeagueN    .          .         -9.75083  -23.67106  -25.28518
image
image

Choosing \(\lambda\)

The value of \(\lambda\) has great impact on Ridge and Lasso results.

You can compute the test error for different values of \(\lambda\) using

  • validation data set

  • leave-one-out cross-validation

  • \(k\)-fold cross-validation

The cv.glmnet function in R helps greatly with the workflow.

Choosing \(\lambda\) by 10-fold cross-validation in Lasso regression.

cv.out <- cv.glmnet(X,y,alpha=1)

bestlam <- cv.out$lambda.min
bestlam
[1] 2.436791    

log(bestlam)
[1] 0.8906821

plot(cv.out) 
image
image

Lasso Simulation

See R code for a simulation to see whether Lasso regression is able to recover the underlying model when \(p > n\).
Spoiler alert: it does!

Lasso or Ridge Regression

  • No method uniformly dominates the other in terms of prediction accuracy

  • Lasso performs better when a few predictors have large coefficients and the remaining predictors are close to zero

  • Ridge regression performs better when the response is related to many predictors of equal size

  • Lasso performs feature selection

  • Lasso models are easier to interpret

  • With Lasso models it is easier to predict (score) new observations. Why?

Dimension Reduction

Dimension Reduction

Regression is a dimension-reducing technique

  • \({\bf Y}_{n \times 1}\) is a vector in \(n\)-dimensional space

  • \({\bf X}{\bf \beta}\) is a vector in \((p+1)\)-dimensional space

  • Sometimes \(p\) is too large

We have seen techniques that deal with \(p\) being (possibly) too large

  • Feature selection: set some \(\beta_j = 0\)

  • Lasso & Ridge regression: place a constraint on the coefficient estimates

Another technique is to apply a two-step method

  1. Obtain inputs \(Z_1,\cdots,Z_M\) from \({\bf X}_1, \cdots, {\bf X}_p\) where \(m < p\)

  2. Fit the model using \(Z_1, \cdots, Z_M\) as the predictors

The \(Z_M\) are linear combinations of all predictors \[Z_m = \sum_{j=1}^p \phi_{jm}{\bf X}_j\] The \(\phi_{jm}\) are called the loadings.

The regression model in step 2 becomes
\[\textbf{Y}_{n \times 1} = \theta_0 + \textbf{Z}_{n \times M}{\bf \theta} + {\bf \epsilon}\]

The dimension of the problem has been reduced from \(p+1\) to \(M+1\)

You can show that this is equivalent to a linear model with \[\beta_j = \sum_{m=1}^M\theta_m \phi_{jm}\] In other words, we control variability by constraining the coefficients \(\beta_j\) to take on this form.

Principal Component Regression

Principal component analysis (PCA) is an unsupervised learning method that finds linear combinations of \(p\) inputs that explain decreasing amounts of variability among the \({\bf X}\)s. These linear combinations are called the principal components and are orthogonal to each other.

The principal components project in the directions in which the \({\bf X}\)s are most variable.

PCA is used to visualize high-dimensional data and to reduce the dimensionality of a problem to the most important principal components (those that explain the most variability).

Principal component regression (PCR) uses \(M\) principal components from a PCA as the inputs in a regression model (instead of the \(p\) \({\bf X}\)s).

PCA is an unsupervised technique: the Y-data does not participate in computing the principal components.

PCR is a supervised technique with target \({\bf Y}\)

PCR Assumption: The directions in which \({\bf X}_1,\cdots,{\bf X}_p\) are most variable are the directions that are most closely related to \({\bf Y}\). Put it in other words: things that explain a lot about the \({\bf X}\)s also explain a lot about \(Y\).

\(M\), the number of principal components to use is a hyper-parameter

Choose \(M\) using cross-validation

PCR in R

In R use plr() in the pls package.

See R code for a worked example.

pcr.fit <- pcr(Salary ~ ., data=Hit,
               scale=TRUE, 
               subset=train, 
               validation="CV",
               segments=5)

summary(pcr.fit)

Principal component regression with 5-fold CV and ncomp=10

Data:   {\bf X} dimension: 263 19 
       Y dimension: 263 1
Fit method: svdpc
Number of components considered: 10

VALIDATION: RMSEP
Cross-validated using 5 random segments.
       (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps  9 comps  10 comps
CV             452    351.7    351.2    350.5    348.2    345.8    342.5    343.4    347.0    347.1     345.3
adjCV          452    351.0    350.4    349.6    347.1    344.7    340.9    341.8    344.9    345.1     343.0

TRAINING: % variance explained
        1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps  9 comps  10 comps
{\bf X}         38.31    60.16    70.84    79.03    84.29    88.63    92.26    94.96    96.28     97.26
Salary    40.63    41.58    42.17    43.22    44.90    46.48    46.69    46.75    46.86     47.76

Validation plot with MSE (instead of RMSE)

validationplot(pcr.fit,val.type="MSEP",legendpos="topright")

Score plot for the first four principal components

scoreplot(pcr.fit, comps=1:4, cex=0.7)