This module has the following learning goals:
Goal 1: Understand ways in which the linear model can break down or can be improved
Goal 2: Learn methods for selecting inputs in the linear regression model (all subset, forward, backward, stepwise)
Goal 3: Ridge regression and Lasso regularization for problems with large \(p\)
Goal 4: Dimension reduction with Principal Component Analysis (PCA)
Goal 5: Implement all methods in R
TL;DR: We love linear models, but ...
The linear regression model is a workhorse in Data Analytics.
it is simple, easy to understand and to explain
it is computationally and mathematically straightforward, e.g., closed-form expressions for leave-one-out CV
it is surprisingly competitive against more complex alternatives
it is based on least squares (LS), which is unbiased if the model is correct.
it works well if \(n \gg p\)
TL;DR: We might have too many inputs \(p\) relative to \(n\)
The linear regression model then breaks down or can be
improved.
It has low bias, but not necessarily lowest MSE
As \(p\) grows, for fixed \(n\), the variance of the LS estimates increases, the LS estimates become unstable
This is exacerbated by multi-collinearity, which also increases in \(p\)
When \(p > n\), a unique LS estimator does not exist
Variable selection (Feature selection):
eliminate unimportant inputs from the model
use automated algorithms for selecting good inputs
Regularization: Also known as shrinkage estimation, it limits the variability of the estimates by forcing (shrinking) them towards zero (in absolute value)
Dimension reduction: Instead of using all \(p\) inputs, use statistical techniques to reduce the \(p\)-dimensional problem to an \(m\)-dimensional problem (\(m < p\)). Use \(m\) linear combinations of the \(p\) variables as inputs to standard least-squares regression.
Why Feature Selection Matters
Overfitting vs. Underfitting: Explain how too many features can lead to overfitting, while too few can underfit.
Interpretability: Emphasize that simpler models with fewer features are often easier to interpret and communicate.
Filter Methods: Correlation Analysis: Use correlation coefficients (e.g., Pearson’s) to identify features strongly related to the target; Variance Inflated Factor (See inclass activity about regression diagnostics) ; ANOVA
Wrapper Methods: Forward/Backward/Stepwise Selection: addition/removal of features based on model performance.
Embedded Methods: Regularization (LASSO, Ridge, Elastic Net): Use L1/L2 penalties to shrink coefficients and perform automatic feature selection ; Tree-Based Feature Importance: Use random forests or gradient boosting to rank features by importance.
From the exploratory data analysis, you could have spotted potential collinearity.
In that case, you should calculate some VIF and ANOVA and test for these potential redundancies and signifance of predictors.
If that is not feasible or the EDA is inconclusive, try the next methods (feature selection by wrapper method or regularization).
Several methods for selecting a subset from all possible inputs
Best subset selection
Forward selection
Backward selection
Stepwise selection
For \(k=1, 2, \cdots, p\):
A. Let \(\{M_k\}\) denote the set of
candidate models that have exactly \(k\) inputs
B. Choose the best of those candidates and call it \(M_k\)
Choose the single best model among the \(M_0, \cdots, M_p\)
Methods differ in the construction of the candidate set in step 1
In step 1.B models are chosen based on \(R^2\) or \(SS(\text{Error})\) or \(p\)-values
In step 2 models are chosen using cross-validation
Two definitions: \[\begin{aligned} C_p &= \frac{1}{n} \left(\text{SSError} + 2d\widehat{\sigma}^2 \right) \\ C^\prime_p &= \frac{\text{SSError}}{\widehat{\sigma}^2} + 2d - n \end{aligned}\]
\(d\) is the number of
regressors in the model
\(p\) is the number of
regressors in the full model
\(\widehat{\sigma}^2\) is based on estimating \(\sigma^2\) in the full model (Why?)
Usage: Prefer the model with the smallest \(C_p\) or \(C^\prime_p\).
Among models with the same # of regressors \(d\), this selects the model with smallest
SSError
AIC (Akaike’s information criterion) is based on maximum likelihood and information theory \[\mbox{AIC} \,=\,2k-2\ln({\widehat{L}})\]
With \(k\) the number of estimated parameters and \(\widehat{L}\) the likelihood at the estimated maximum.
Applied to linear regression with gaussian errors: \[AIC = \frac{1}{n}(\text{SSError} + 2k\widehat{\sigma}^2)\]
Usage: Prefer the model with the smaller AIC.
For other model families there is no \(C_p\), but there is always an AIC.
BIC (Bayesian information criterion) is similar to AIC: \[\mbox{BIC} =k\ln(n)-2\ln({\widehat{L}})\]
Applied to linear regression with gaussian errors:\[BIC = \frac{1}{n}(\text{SSError} + \text{ln}(n)k\widehat{\sigma}^2)\]
BIC applies a larger penalty than \(C_p\) or \(AIC\) if \(\text{ln}(n) > 2\).
BIC selects smaller models than \(C_p\)
BIC is a consistent estimator (it selects the data-generating model with probability 1 if enough data are provided)
Usage: Prefer the model with smaller BIC
\(R_{adj}^2\) adds a correction to \(R^2\) to penalize larger models.
\[R^2 = 1 - \frac{\text{SSError}}{\text{SSTotal}}\]
\[\text{Adjusted} \, R^2 = 1 - \frac{\text{SSError}}{\text{SSTotal}}\left ( \frac{n-1}{n-k-1} \right )\]
SSError always decreases when regressors are added \(\Rightarrow R^2\) increases
Adding “junk” variables will not decrease SSError much but incur penalty in denominator
Usage Choose model with large Adjusted \(R^2\)
All Subsets algorithm
Note: some references include \(M_0\) the intercept-only model. It is not likely to be the winning model and we omit it here.
Properties
\(\oplus\) Evaluates all possible models (searches entire model space)
\(\oplus\) Simple and conceptually appealing
\(\ominus\) Total number of models is \(2^p\)
\(\ominus\) Computationally quickly impossible.
\(\ominus\) With \(p=40\)
40 models with 1 input
780 models with 2 inputs
9880 models with 3 inputs
1.099512e+12 total models
RWe use the regsubsets() function in the
leaps package or the train() function in the
caret package. method="exhaustive" is the
default in regsubsets().
library(leaps)
regfit <- regsubsets(Balance ~ ., data=Credit, nvmax=NULL)
summary(regfit)
1 subsets of each size up to 11
Selection Algorithm: exhaustive
Income Limit Rating Cards Age Education OwnYes StudentYes MarriedYes RegionSouth RegionWest
1 ( 1 ) " " " " "*" " " " " " " " " " " " " " " " "
2 ( 1 ) "*" " " "*" " " " " " " " " " " " " " " " "
3 ( 1 ) "*" " " "*" " " " " " " " " "*" " " " " " "
4 ( 1 ) "*" "*" " " "*" " " " " " " "*" " " " " " "
5 ( 1 ) "*" "*" "*" "*" " " " " " " "*" " " " " " "
6 ( 1 ) "*" "*" "*" "*" "*" " " " " "*" " " " " " "
7 ( 1 ) "*" "*" "*" "*" "*" " " "*" "*" " " " " " "
8 ( 1 ) "*" "*" "*" "*" "*" " " "*" "*" " " " " "*"
9 ( 1 ) "*" "*" "*" "*" "*" " " "*" "*" "*" " " "*"
10 ( 1 ) "*" "*" "*" "*" "*" " " "*" "*" "*" "*" "*"
11 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
Mallow’s \(C_p\), BIC, and Adjusted \(R^2\) for all subset regression
s_all <- summary(regfit)
s_all$cp
s_all$bic
s_all$adjr2
s_all$cp
[1] 1800.308406 685.196514 41.133867 11.148910 8.131573 5.574883
6.462042 7.845931 9.192355 10.472883 12.000000
> s_all$bic
[1] -535.9468 -814.1798 -1173.3585 -1198.0527 -1197.0957 -1195.7321
-1190.8790 -1185.5192 -1180.1989 -1174.9476 -1169.4433
> s_all$adjr2
[1] 0.7452098 0.8744888 0.9494991 0.9531099 0.9535789 0.9539961
0.9540098 0.9539649 0.9539243 0.9538912 0.9538287
Start from the simplest model, add predictors sequentially.
Forward algorithm
For \(k=0, 2, \cdots, p\) do the following:
The candidate set at step \(k\) are the models with one additional input that is currently not in the model.
Using SSError, large \(R^2\), small \(p\)-value, etc., choose the best model at stage \(k\) as the one that improves most by adding one of the remaining \(p-k\) inputs. Call this model \(M_{k+1}\)
Using CV, \(C_p\), BIC, or Adjusted \(R^2\) choose the overall best model across the \(M_1, \cdots, M_k\).
Forward selection with four predictors.
| Stage (\(k\)) | Current Model | Candidate Set | Most Improved |
|---|---|---|---|
| 0 | None | \({\bf X}_1\), \({\bf X}_2\), \({\bf X}_3\), \({\bf X}_4\) | \({\bf X}_2\) |
| 1 | \({\bf X}_2\) | \({\bf X}_1\), \({\bf X}_3\), \({\bf X}_4\) | \({\bf X}_4\) |
| 2 | \({\bf X}_2\), \({\bf X}_4\) | \({\bf X}_1\), \({\bf X}_3\) | \({\bf X}_1\) |
| 3 | \({\bf X}_2\), \({\bf X}_4\), \({\bf X}_1\) | \({\bf X}_3\) | \({\bf X}_3\) |
| 4 | \({\bf X}_2\), \({\bf X}_4\), \({\bf X}_1\), \({\bf X}_3\) | None |
Pros
Computationally efficient, easy to understand
“Conditional” interpretation: given the variables in the model…
Can be applied if \(p > n\)!
Cons
A predictor that was added cannot be removed
Not guaranteed to find the best model at stage \(k \ge 1\) or overall
RForward selection results with regsubsets().
library(leaps)
regfit <- regsubsets(Balance ~ ., data=Credit, method="forward", nvmax=NULL)
summary(regfit)
Selection Algorithm: forward
Income Limit Rating Cards Age Education OwnYes StudentYes MarriedYes RegionSouth RegionWest
1 ( 1 ) " " " " "*" " " " " " " " " " " " " " " " "
2 ( 1 ) "*" " " "*" " " " " " " " " " " " " " " " "
3 ( 1 ) "*" " " "*" " " " " " " " " "*" " " " " " "
4 ( 1 ) "*" "*" "*" " " " " " " " " "*" " " " " " "
5 ( 1 ) "*" "*" "*" "*" " " " " " " "*" " " " " " "
6 ( 1 ) "*" "*" "*" "*" "*" " " " " "*" " " " " " "
7 ( 1 ) "*" "*" "*" "*" "*" " " "*" "*" " " " " " "
8 ( 1 ) "*" "*" "*" "*" "*" " " "*" "*" " " " " "*"
9 ( 1 ) "*" "*" "*" "*" "*" " " "*" "*" "*" " " "*"
10 ( 1 ) "*" "*" "*" "*" "*" " " "*" "*" "*" "*" "*"
11 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
All Subset and Forward selection agree with respect to all \(M_k\), except for \(k=4\).
All Subset: Income, Limit, Cards,
StudentYes
Forward: Income, Limit, Rating,
StudentYes
If selection methods land on the same \(k\), the selected inputs might be different.
Starts from the full model, removes predictors sequentially
Backward algorithm
Start from the full model with all \(p\) predictors, \(M_p\).
For \(k=p, p-1, \cdots, 1\) do the following:
The candidate set at step \(k\) are the models that remove one input from the current model.
Using SSError, small \(R^2\), large \(p\)-value, etc., choose the best model at stage \(k\) as the one that deteriorates least by removing one of the remaining \(k\) inputs. Call this model \(M_{k-1}\)
Using CV, \(C_p\), BIC, or Adjusted \(R^2\) choose the overall best model across the \(M_1, \cdots, M_k\).
Backward elimination with four predictors.
| Stage (\(k\)) | Current Model | Candidate Set | Least Important |
|---|---|---|---|
| 4 | \({\bf X}_1, {\bf X}_2, {\bf X}_3, {\bf X}_4\) | \({\bf X}_1, {\bf X}_2, {\bf X}_3, {\bf X}_4\) | \({\bf X}_2\) |
| 3 | \({\bf X}_1, {\bf X}_3, {\bf X}_4\) | \({\bf X}_1, {\bf X}_3, {\bf X}_4\) | \({\bf X}_4\) |
| 2 | \({\bf X}_1, {\bf X}_3\) | \({\bf X}_1, {\bf X}_3\) | \({\bf X}_1\) |
| 1 | \({\bf X}_3\) | \({\bf X}_3\) | \({\bf X}_3\) |
| 0 | None | None |
Pros
Computationally efficient, easy to understand
“Conditional” interpretation: given the variables in the model, ...
Sees the full model, most likely to provide an unbiased estimate of \(\sigma^2\)
Cons
A predictor that was removed cannot be added back at a later step
Cannot be applied if \(p > n\) because the full model cannot be fit!
Not guaranteed to find the best model at stage \(k\) or overall
Combines the pros of forward selection and backward
elimination.
Start with an empty model and add the best variable (forward step)
If a variable was added, try and remove any of the variables in the model (backward step)
Pros
Searches larger model space than forward or backward
Possibly evaluates models of size \(k\) several times
Variables added earlier can become unimportant at later steps and can be removed
We can combine the selection algorithms with a direct estimate of the
test error by using cross-validation.
Use the train() function in the caret
package.
It uses leaps underneath.
Set the method= parameter of train() to
leapForward, leapBackward, or
leapSeq for forward, backward, and stepwise,
respectively.
Backward elimination with 10-fold cross-validation.
set.seed(123)
train.control <- trainControl(method="cv", number=10)
bkwd.model <- train(Balance ~ . , data =Credit,
method = "leapBackward",
tuneGrid = data.frame(nvmax = 1:11),
trControl = train.control)
bkwd.model$results
# the index of the selected ("best") model
bkwd.model$bestTune$nvmax
[1] 6
summary(step.model$finalModel)
coef(step.model$finalModel,step.model$bestTune$nvmax)
Selection Algorithm: backward
Income Limit Rating Cards Age Education OwnYes StudentYes MarriedYes RegionSouth RegionWest
1 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " "
2 ( 1 ) "*" "*" " " " " " " " " " " " " " " " " " "
3 ( 1 ) "*" "*" " " " " " " " " " " "*" " " " " " "
4 ( 1 ) "*" "*" " " "*" " " " " " " "*" " " " " " "
5 ( 1 ) "*" "*" "*" "*" " " " " " " "*" " " " " " "
6 ( 1 ) "*" "*" "*" "*" "*" " " " " "*" " " " " " "
(Intercept) Income Limit Rating Cards Age StudentYes
-493.7341870 -7.7950824 0.1936914 1.0911874 18.2118976 -0.6240560 425.6099369
Rleaps |
caret::train |
|
| Preferred | method= |
method= |
| All Subset (regression) | exhaustive |
|
| Forward (selection) | forward |
leapForward |
| Backward (elimination) | backward |
leapBackward |
| Stepwise (selection) | seqrep |
leapSeq |
You can use the bestglm() function in the
bestglm package.
It implements the exhaustive search method in the leaps
package to perform an all subsets approach.
The model is passed as a single data frame with the target variable as the last column.
In the binomial GLM, nonlogistic, case the last two columns of Xy are the counts of ‘success’ and ‘failures’.
The function will work for up to 15 inputs.
See examples in this week’s code.
- Shrinkage estimation is a statistical estimation technique that imposes constraints on the parameter estimates. The resulting estimates are closer to zero in absolute value than the unconstrained estimates. We say the estimates were shrunk toward zero.
Shrinking coefficients can significantly reduce the variability
of the coefficients. Intuitive?
Feature selection can be seen as an extreme form of shrinkage
where the coefficients of predictors not in the model are set to
zero.
The most important shrinkage methods in regression are Ridge regression and the Lasso.
Least Squares minimizes \[\text{SSError} = \sum_{i=1}^n \left ( y_i - {\bf x}^\prime {\bf \beta} \right )^2\]
Ridge Regression minimizes \[\text{SSError} + \lambda \sum_{j=1}^p \beta_j^2\]
The term \(\lambda \sum_{j=1}^p \beta_j^2\) is called the ridge penalty or shrinkage penalty
\(\lambda \geq 0\) is a hyper-parameter; we need to find a value for it
Effects of the shrinkage penalty \[\lambda
\sum_{j=1}^p \beta_j^2\]
For given \(\lambda\), the penalty is small if the \(\beta_j \rightarrow 0\)
As \(\lambda\) increases the penalty becomes more severe
As \(\lambda\) increases the \(\beta_j \rightarrow 0\)
If \(\lambda\) is large, you essentially fit an intercept-only model
For \(\lambda=0\) the least squares estimates result
For every value of \(\lambda\) there is a different vector of \({\bf \beta}\) estimates
\(\lambda\) is a hyper-parameter for which we have to find a value \(\Rightarrow\) cross-validation
Computational considerations
The intercept \(\beta_0\) is not shrunk, it represents the mean value of the \(Y\)
There is one shrinkage parameter \(\lambda\), it applies to all \(\beta_j\)
The \({\bf X}\) matrix is centered and scaled so that all columns have mean 0 and standard deviation 1. Otherwise a common \(\lambda\) parameter would not make sense.
Variable Selection
binary decision: predictor \(j\) is in (\(|\beta_j| > 0\)) or out (\(\beta_j = 0\))
requires \(n > p\) for the models considered
still can produce highly variable estimates if \({\bf X}\) matrix is poorly conditioned (dependencies among the \({\bf X}\)s)
Ridge Regression
can be used even if \(p > n\)
allows all predictors to be in the model, even if they make only tiny contributions
trades some bias for lower variance
RRidge regression (and Lasso) can be performed in R with
the glmnet() function in the glmnet
package.
In contrast to other modeling procedures, glmnet() does
not accept the usual model-building formula (\(y \sim {\bf X}_1 + {\bf X}_2 + ...\)).
Instead, you have to supply the \({\bf
X}\) matrix and the \({\bf y}\)
vector to the function.
glmnet() will standardize the columns of \({\bf X}\) by default and report the
regression coefficients on the original scale.
The Hitters data set in ISLR contains salaries and
records for baseball players. The following statements build the \({\bf X}\) matrix w/o intercept and \({\bf Y}\) vector and obtain Ridge
regression estimates for \(\lambda = 100, 10,
0.1, 0\).
(alpha=0 is for Ridge regression, alpha=1 is
for Lasso.)
Hit <- na.omit(Hitters)
X <- model.matrix(Salary ~ ., data=Hit)[,-1]
y <- Hit$Salary
grid <- c(100, 10, 0.1, 0)
ridge_reg <- glmnet(X, y, alpha=0, lambda=grid)
You can choose the hyper-parameter \(\lambda\) by cross-validation with the
cv.glmnet() function, very similar to using
cv.glm() in the previous module.
set.seed(6543)
cv.out <- cv.glmnet(X, y, alpha=0, nfolds=10, type.measure="mse")
plot(cv.out)
bestlam <- cv.out$lambda.min
bestlam
[1] 25.52821
log(bestlam)
[1] 3.239784
When \(p > n\), least squares
estimation breaks down.
Let’s simulate a data set with \(n=5\) and \(p=20\) predictors.
set.seed(1234)
vec <- runif(100)
X <- matrix(vec, nrow=5, ncol=20)
y <- rnorm(dim(X)[1]) + rowSums(X)
linreg <- lm(y ~ X)
Least squares “runs out” of degrees of freedom after four predictors.
The fitted model is saturated, that is, a perfect fit that interpolates the y data.
> coef(linreg)
(Intercept) X1 X2 X3 X4 X5 X6 X7 X8
-7.638833 12.242983 -3.934432 10.842579 15.268963 NA NA NA NA
X9 X10 X11 X12 X13 X14 X15 X16 X17
NA NA NA NA NA NA NA NA NA
X18 X19 X20
NA NA NA
> round(y - predict(linreg),6)
1 2 3 4 5
0 0 0 0 0
Ridge regression on the other hand allows all predictors to contribute
ridge_reg <- glmnet(X,y,alpha=0,lambda=c(100,10,0.1,0.01))
coef(ridge_reg)
21 x 4 sparse Matrix of class "dgCMatrix"
s0 s1 s2 s3
(Intercept) 7.68687 6.99069 5.73141 5.67905
V1 -0.01715 -0.09326 -0.14427 -0.15359
V2 0.02197 0.13473 0.31032 0.33219
V3 0.01824 0.13018 0.38462 0.40144
V4 0.02882 0.16602 0.31569 0.31926
V5 0.06578 0.40821 1.01352 1.09996
V6 -0.01910 -0.13293 -0.42493 -0.46014
V7 0.00925 0.05357 0.03708 0.07006
V8 0.02107 0.13840 0.36602 0.37237
V9 0.03712 0.27334 0.88313 0.92015
V10 0.02202 0.14437 0.41119 0.42711
V11 -0.05368 -0.34858 -0.91773 -0.91176
V12 -0.00808 -0.06854 -0.27528 -0.28008
V13 0.01568 0.07942 0.07134 0.05679
V14 0.01493 0.06842 0.00454 0.00741
V15 0.00370 0.04326 0.22480 0.22586
V16 0.05926 0.39137 1.06970 1.08200
V17 0.03020 0.19274 0.48713 0.48341
V18 0.02981 0.18198 0.42613 0.42566
V19 0.00409 0.04189 0.19267 0.19308
V20 0.04678 0.30709 0.85295 0.85442
And the predictions are not perfectly interpolating the y-data.
predict(ridge_reg, newx=X)
s0 s1 s2 s3
[1,] 7.919595 8.441812 9.155059 9.168386
[2,] 7.804364 7.785944 7.982064 7.990984
[3,] 7.705466 7.081441 5.750685 5.713745
[4,] 7.784464 7.659786 7.612904 7.614922
[5,] 7.861569 8.106474 8.574746 8.587421
y
[1] 9.169844 7.991979 5.709654 7.615122 8.588858
An arguable disadvantage of Ridge regression is
to include all predictors and thus to
give non-zero estimates \(\beta_1, \cdots, \beta_j\)
So it cannot be used for feature selection
and can be difficult to interpret.
Solution: LASSO (Least Absolute Shrinkage and
Selection Operator) combines regularization (shrinkage) with variable
selection.
The difference between Ridge and Lasso is the penalty term.
Least Squares minimizes \[\text{SSError} = \sum_{i=1}^n \left ( y_i - {\bf x}^\prime {\bf \beta} \right )^2\]
Ridge Regression minimizes \[\text{SSError} + \lambda \sum_{j=1}^p \beta_j^2\] Lasso Regression minimizes \[\text{SSError} + \lambda \sum_{j=1}^p |\beta_j|\]
The Lasso effect
As \(\lambda\) grows some \(\beta_j\) will be exactly zero
Interpret this as the predictor is no longer important
Depending on \(\lambda\), different predictors will be in the model
In R use glmnet() with the
alpha=1 parameter to get Lasso parameter estimates.
glmnet() fits a class of models called elastic
nets. The elastic net penalty is \[\lambda \left \{ \frac{1-\alpha}{2} \sum_{j=1}^p
\beta_j^2 + \alpha \sum_{j=1}^p |\beta_j| \right \}\]
RLasso regression for the Hitters data set with \(\lambda = 100, 10, 1, 0.1, 0\).
grid <- c(100, 10, 1, 0.1 , 0)
lasso_reg <- glmnet(X, y, alpha=1, lambda=grid)
coef(lasso_reg)
Lasso regression estimates for \(\lambda = 100, 10, 1, 0.1, 0\).
s0 s1 s2 s3 s4
(Intercept) 220.10409 -1.49700 150.27526 163.35927 164.53816
AtBat . . -1.90102 -1.98737 -1.98522
Hits 1.13626 2.01155 6.72889 7.40745 7.46595
HmRun . . 1.33127 3.71491 4.17264
Runs . . -1.03154 -2.14140 -2.28368
RBI . . . -0.84049 -1.00296
Walks 1.18265 2.24853 5.56223 6.16142 6.22837
Years . . -6.90650 -4.30318 -3.86027
CAtBat . . -0.07605 -0.15629 -0.16934
CHits . . . 0.14104 0.15983
CHmRun . 0.04705 0.13180 -0.00474 -0.08136
CRuns 0.11012 0.21931 1.15880 1.37567 1.40784
CRBI 0.31456 0.40399 0.61220 0.73167 0.76981
CWalks . . -0.72112 -0.79961 -0.80606
LeagueN . 18.93190 46.61968 61.12354 62.74030
DivisionW . -115.19852 -116.21235 -116.95948 -117.12701
PutOuts 0.00330 0.23596 0.28144 0.28230 0.28189
Assists . . 0.29052 0.36238 0.37108
Errors . -0.78124 -2.85412 -3.32928 -3.38591
NewLeagueN . . -9.75083 -23.67106 -25.28518
The value of \(\lambda\) has great
impact on Ridge and Lasso results.
You can compute the test error for different values of \(\lambda\) using
validation data set
leave-one-out cross-validation
\(k\)-fold cross-validation
The cv.glmnet function in R helps greatly
with the workflow.
Choosing \(\lambda\) by 10-fold cross-validation in Lasso regression.
cv.out <- cv.glmnet(X,y,alpha=1)
bestlam <- cv.out$lambda.min
bestlam
[1] 2.436791
log(bestlam)
[1] 0.8906821
plot(cv.out)
See R code for a simulation to see whether Lasso
regression is able to recover the underlying model when \(p > n\).
Spoiler alert: it does!
No method uniformly dominates the other in terms of prediction accuracy
Lasso performs better when a few predictors have large coefficients and the remaining predictors are close to zero
Ridge regression performs better when the response is related to many predictors of equal size
Lasso performs feature selection
Lasso models are easier to interpret
With Lasso models it is easier to predict (score) new observations. Why?
Regression is a dimension-reducing technique
\({\bf Y}_{n \times 1}\) is a vector in \(n\)-dimensional space
\({\bf X}{\bf \beta}\) is a vector in \((p+1)\)-dimensional space
Sometimes \(p\) is too large
We have seen techniques that deal with \(p\) being (possibly) too large
Feature selection: set some \(\beta_j = 0\)
Lasso & Ridge regression: place a constraint on the coefficient estimates
Another technique is to apply a two-step method
Obtain inputs \(Z_1,\cdots,Z_M\) from \({\bf X}_1, \cdots, {\bf X}_p\) where \(m < p\)
Fit the model using \(Z_1, \cdots, Z_M\) as the predictors
The \(Z_M\) are linear combinations of all predictors \[Z_m = \sum_{j=1}^p \phi_{jm}{\bf X}_j\] The \(\phi_{jm}\) are called the loadings.
The regression model in step 2 becomes
\[\textbf{Y}_{n \times 1} = \theta_0 +
\textbf{Z}_{n \times M}{\bf \theta} + {\bf \epsilon}\]
The dimension of the problem has been reduced from \(p+1\) to \(M+1\)
You can show that this is equivalent to a linear model with \[\beta_j = \sum_{m=1}^M\theta_m \phi_{jm}\] In other words, we control variability by constraining the coefficients \(\beta_j\) to take on this form.
Principal component analysis (PCA) is an
unsupervised learning method that finds linear
combinations of \(p\) inputs that
explain decreasing amounts of variability among the \({\bf X}\)s. These linear combinations are
called the principal components and are orthogonal to
each other.
The principal components project in the directions in which the \({\bf X}\)s are most variable.
PCA is used to visualize high-dimensional data and to reduce the dimensionality of a problem to the most important principal components (those that explain the most variability).
Principal component regression (PCR) uses \(M\) principal components from a PCA as the inputs in a regression model (instead of the \(p\) \({\bf X}\)s).
PCA is an unsupervised technique: the Y-data does
not participate in computing the principal components.
PCR is a supervised technique with target \({\bf Y}\)
PCR Assumption: The directions in which \({\bf X}_1,\cdots,{\bf X}_p\) are most variable are the directions that are most closely related to \({\bf Y}\). Put it in other words: things that explain a lot about the \({\bf X}\)s also explain a lot about \(Y\).
\(M\), the number of principal
components to use is a hyper-parameter
Choose \(M\) using cross-validation
RIn R use plr() in the pls
package.
See R code for a worked example.
pcr.fit <- pcr(Salary ~ ., data=Hit,
scale=TRUE,
subset=train,
validation="CV",
segments=5)
summary(pcr.fit)
Principal component regression with 5-fold CV and
ncomp=10
Data: {\bf X} dimension: 263 19
Y dimension: 263 1
Fit method: svdpc
Number of components considered: 10
VALIDATION: RMSEP
Cross-validated using 5 random segments.
(Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps 9 comps 10 comps
CV 452 351.7 351.2 350.5 348.2 345.8 342.5 343.4 347.0 347.1 345.3
adjCV 452 351.0 350.4 349.6 347.1 344.7 340.9 341.8 344.9 345.1 343.0
TRAINING: % variance explained
1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps 9 comps 10 comps
{\bf X} 38.31 60.16 70.84 79.03 84.29 88.63 92.26 94.96 96.28 97.26
Salary 40.63 41.58 42.17 43.22 44.90 46.48 46.69 46.75 46.86 47.76
Validation plot with MSE (instead of RMSE)
validationplot(pcr.fit,val.type="MSEP",legendpos="topright")
Score plot for the first four principal components
scoreplot(pcr.fit, comps=1:4, cex=0.7)