▶ Set validation goals
▶ Formalize the concept of a loss function/metric
▶ Understand the differences between validation data set, leave-one-out and k-fold cross-validation
▶ Appreciate k-fold CV as a general-purpose technique for model assessment and to measure model performance
▶ Appreciate the bootstrap as a general-purpose technique to estimate uncertainty
▶ Remember to plot results and data!
Never evaluate on the training data, use testing data for validation and work in cross-validation settings
Use quantitative metrics AND visualization
Compare to simple baselines:
Persistence (naive) model: \(\hat{x}_{t+1|t} =
x_t\)
Seasonal average
Reduced models
Quantify as much uncertainty as you can
Prepare a communication plan
Training data
- leads to overfitting because you can drive Ave(err) →0
- does not measure how well the model generalizes to new obs
- does not measure the true performance of the model
- BUT, easy to calculate as first check, we always have training
data
Test data
- data that was not used in fitting the model
- measures true performance of the model
- helps in selecting proper level of flexibility of the model
- difficult to calculate, requires test data
The big question: where do we get the test data?
Training Data: Used to fit or train the model. The model learns patterns, relationships, and parameters from this data.
Validation Data: Used to tune the model’s hyperparameters and guide model selection before final evaluation, e.g., decision trees vs. random forests or hyperparameters e.g., tree depth, regularization strength.
Testing Data: Used for the final, unbiased
evaluation of the model’s performance.
It simulates how the model will perform on completely unseen data in the
real world.
Why it matters: Avoiding Data Leakage + Realistic Performance
Definition: Technique to create test data and assess how well a model generalizes to new data
Purpose:
Approach: Training vs. Test Data:
Hold-Out Method:
- Split data into a training set with \(n-m\) datapoints and a test set of size
\(m\).
- Simple but can be sensitive to the split.
k-Fold Cross-Validation:
- Divide data into k folds.
- Train on k-1 folds, test on the remaining fold.
- Repeat for each fold.
Leave-One-Out Cross-Validation (LOOCV):
- Extreme case of k-Fold where k = n (number of
observations).
- Computationally expensive but low bias.
Cross-validation for time series:
- Problem: Randomly splitting time series breaks temporal
dependencies, leading to unrealistic performance.
- Solutions:
Rolling Window: Train on a fixed window, test on the
next window. E.g., train on 2010–2018, test on 2019; then train on
2010–2019, test on 2020.
Time Series k-Fold: Split data into k contiguous folds,
preserving order.
Cross-validation for spatial data:
- Problem: Overly confident in prediction
- Solutions: Divide space into blocks/clusters, use blocks as
folds.
Leave out entire clusters for testing.
Stratified k-Fold (for Classification):
- Ensures each fold has the same proportion of class labels as the
dataset.
First group the data by class labels.
For example, if you have two classes (A and B), it separates all samples
of class A and all samples of class B.
For each fold, the algorithm allocates samples from each class
proportionally.
If the original dataset has 80% class A and 20% class B, each fold will
also contain around 80% class A and 20% class B.
Within each class, the samples are shuffled randomly before being
assigned to folds.
This ensures that the samples in each fold are a random subset of the
class, not just the first or last n samples.
Randomly choose a subset of size \(m\) from the observations, this is the hold-out set
Train the model on the remaining \(n-m\) observations
Predict or classify the data in the hold-out set based on the trained model
Calculate the test error in the hold-out set.
Problems:
- Variability of results across different sets of \(m\) observations
- May need stratified sampling to preserve distributions
- How big should \(m\) be (relative to
\(n\))? 50:50? 80:20? 90:10?
RModeling miles per gallon mpg as a function of
horsepower for 392 different cars.
The dataset comes with the ISLR package. ISLR uses a
50:50 split to obtain the test data.
library(ISLR2)
data(Auto)
set.seed(1)
train <- sample(nrow(Auto),196)
head(Auto[train,]) # select the training set
head(Auto[-train,]) # select the test set
Find more details in the week’s code.
Advantages
- easy to do, uses some of the training obs for validation
- by fixing random number seed method is repeatable between runs
- a general method that can be applied to any model
Disadvantages
- has a random element to it; results change depending on which
observations are selected - variability from run to run can be large,
especially for noisy data
- has a tendency to overestimate the test error compared to
leave-one-out CV
- not every observation gets to be used in training and in testing
Step 1. Set the index of the hold-out observation to \(i\)
(Start with \(i=1\))
Step 2. Hold-out observation \(i\) and train the model on the remaining \(n-1\) observations.
Step 3. Compute \(e_i\), the test error for the held-out observation
Step 4. Put the obs back into the data, increment \(i\) and go back to step 1.
Step 5. Repeat until you have held out each observation once.
Compute the test error as the average of individual errors \(\displaystyle \frac{1}{n}\sum_{i=1}^n e_i\)
Advantages
no randomness involved, LOOCV produces identical results upon repetition
every observation gets to be used in training (\(n−1\) times) and test (once)
tends not to overestimate the test error as much as the validation test approach
a general method that can be applied to any model
Disadvantage
Compute the test error by LOOCV for the SLR model with
horsepower by fitting 392 regression models.
Auto2 <- Auto
PRESS <- 0
for (i in 1:nrow(Auto2)) {
hp <- Auto2[i,"horsepower"]
Auto2[i,"horsepower"] <- NA
m <- lm(mpg ~ horsepower, data=Auto2, na.action=na.omit)
Auto2[i,"horsepower"] <- hp
yhat_i <- predict(m,newdata=Auto2[i,])
PRESS <- PRESS + (Auto2[i,"mpg"] - yhat_i)^2
}
cat("LOOCV Prediction Error = ", PRESS/nrow(Auto2))
LOOCV Prediction Error = 24.23151
For some types of models we are in luck.
Denote as \(\widehat{y}_{-i}\) the predicted value for the \(i\)th observation derived from a regression that excluded (only) the \(i\)th observation
The test error via LOOCV is \(\displaystyle \frac{1}{n} \sum_{i=1}^n (y_i - \widehat{y}_{-i})^2\)
In linear regression models the following holds \[y_i - \widehat{y}_{-i} = \frac{y_i - \widehat{y}_i}{1-h_{ii}}\] where \(h_{ii}\) is the leverage of the \(i\)th observation.
The prediction sum of squares (PRESS) statistic is
\[PRESS = \sum_{i=1}^n \left ( y_i -
\widehat{y}_{-i} \right ) ^2\] Models with a small PRESS
statistic have good predictive ability and generalize well to new
observations.
The average PRESS value \[\frac{1}{n}PRESS =
\frac{1}{n}\sum_{i=1}^n(y_i - \widehat{y}_{-i})^2\] is the
leave-one-out cross-validation error (for squared-error loss).
Compute the LOOCV test error by applying the leverage formula.
slr <- lm(mpg ~ horsepower, data=Auto)
leverage <- hatvalues(slr)
PRESS_res <- slr$residuals / (1-leverage)
pred_error <- mean(PRESS_res^2)
pred_error
Linear Regression
- The LOOCV test error can be calculated based on a
single fit using all \(n\) observations
- LOOCV is computationally cheap, just need to get the diagonals of the
“hat” matrix (the leverages)
Generalized Linear Models
- LOOCV can be computed cheaply based on exact formulas or
approximations. You can always validate the hard way whether the
calculation is exact.
Complex Models
- Need to calculate LOOCV the hard way, by fitting the model \(n\) times.
- Or use approximations
Hybrid technique between the hold-out approach and LOOCV.
generate randomly \(k\) sets of observations with repetition, called folds
fit the model \(k\) times, each time holding out a different fold
calculate the test error from the \(n_j\) hold-out samples: \(E_j = \frac{1}{n_j} = \sum_{i=1}^{n_j} e_i\)
average the \(k\) test errors \[\frac{1}{k} \sum_{i=1}^k E_j\] or study their distribution
Advantages
Disadvantages
R10-fold cross-validation for the horsepower model.
glm.fit <- glm(mpg ~ horsepower, data=Auto)
cv.err <- cv.glm(Auto,glm.fit,K=10)
cv.err$K
cv.err$delta[1]
cv.err$K
[1] 10
> cv.err$delta[1]
[1] 24.3664
See comparison in R between 5-fold and 10-fold CV.
Which method should you choose?
▶ Hold-out approach has high variability and is biased
▶ LOOCV produces an unbiased estimator of the true test error
▶ k-fold CV has an intermediate level of bias (more than LOOCV) and some variability (less than hold-out approach)
▶ The variance of LOOCV is quite high because the \(n\) error estimates are highly correlated (why is that?)
Deterministic metrics to measure the discrepancy
between estimated (predicted) and true testing values.
Discrepancies are called errors \(e=y-\widehat{y}\).
Skill score: Measures improvement relative to a reference prediction or mean.
Visual Diagnostics:
- Predictions vs observed test data (e.g. QQ-plot)
- PIT (Probability Integral Transform) histograms
- Predicted values VS index, time, space (patterns captured?)
Goal: Check calibration (match between prediction and testing observation) and sharpness (prediction variance) of forecasts.
Accuracy: Proportion of correct
predictions.
Precision: Proportion of true positives among
predicted positives.
Recall (Sensitivity): Proportion of true
positives among actual positives.
F1-Score: Harmonic mean of precision and
recall.
Confusion Matrix: Table summarizing true
vs. predicted labels.
ROC Curve: Plot of true positive rate vs. false positive rate.
AUC: Area under the ROC curve (0.5 = random, 1 = perfect).
Please go back to the advanced regression class to review validation metrics for classification.
library(caret)
data(Default, package = "ISLR")
# Define control parameters for 5-fold CV
ctrl <- trainControl(method = "cv", number = 5,
summaryFunction = twoClassSummary, classProbs = TRUE)
# Fit a logistic regression model
model <- train(default ~ ., data = Default,
method = "glm", family = "binomial", trControl = ctrl)
print(model) # Print cross-validation results
# Confusion Matrix
predictions <- predict(model, newdata = Default)
confusionMatrix(predictions, Default$default)
You want the predictions to have similar patterns as the
observations:
Probability distribution
Matching distribution is usually not enough
Patterns like temporal or spatial patterns
Other statistics like correlation functions, tail indicators,…
Think about the bike rental, we hop to capture patterns like the seasonal and daily cycles
Purpose: Compare the distribution of your model’s predictions to the testing `true’ data
Interpretation:
- Points on the line: Predictions and observations have matching
distribution
- Deviations: prediction over- or under-estimate the `truth’
When to Use: After performing predictions on testing data from a fitted statistical model
See this week’s code
Purpose:
When to Use: For probabilistic models (e.g., Bayesian regression, quantile regression, or ensemble forecasts).
Interpretation:
# Example using the `scoring` package for PIT
library(scoring)
# Simulate probabilistic forecasts and observations
set.seed(123)
observed <- rnorm(100)
predicted <- matrix(rnorm(100 * 100, mean = observed, sd = 0.5), ncol = 100)
# Calculate PIT values
pit_values <- pit(predicted, observed)
# Plot PIT histogram
hist(pit_values, breaks = 20, main = "PIT Histogram", xlab = "PIT Values", ylab = "Frequency")
abline(v = c(0, 1), col = "red", lty = 2)
| Plot/Histogram | Well-Calibrated Model | Poorly Calibrated Model |
|---|---|---|
| QQ-Plot | Points follow the line | Points deviate from the line |
| PIT Histogram | Uniform distribution | U-shaped, hump-shaped, or skewed |
QQ-Plots: Use for checking distribution match of predictions with testing `truth’
PIT Histograms: Use for checking calibration of probabilistic forecasts
Action: If plots indicate poor fit, consider transforming variables, changing model assumptions, or using a different model family.
- Model uncertainty:
Wrong assumptions or structure
Can be controlled to a certain extent, hard to quantify
- Parameter uncertainty:
Estimation error due to finite sample size
Can be quantified with Hessian information
- Data uncertainty:
Due to random variation in observations
Can be quantified (confidence/prediction intervals, bootstrap, …)
To bootstrap means to get oneself out of a situation
by applying existing resources. What does that have to do with
statistical learning?
Suppose you want to know the variability of a quantity, e.g., an
estimator such as \(\widehat{Y} = \frac{1}{n}
\sum_{i=1}^n Y_i\).
The bootstrap solution:
Your data have natural variability that is important to understand and communicate
There are two main ways to quantify this type of uncertainty:
- Calculate confidence/prediction intervals
- Bootstrap data and estimate variability of quantities of
interest
e.g. estimated coefficients, predictions, \(R^2\), …
RUse the boot() function with three arguments: (i) the
data, (ii) a function you write that returns the statistic(s) of
interest, and (iii) the number of bootstrap samples (replicates).
The function you write has at least two arguments: the data and an index
vector to select observations. This example calculates the standard
error of the \(R^2\) statistic via
bootstrap.
boot.fn <- function(data, index) {
summary(lm(mpg ~ poly(horsepower,2), data=data,
subset=index))$r.squared
}
set.seed(123)
bs <- boot(data=Auto, boot.fn, R=1000)
bs
How does the bootstrap relate to cross-validation?
The bootstrap
- is based on sampling with replacement from the training
data
- produces estimates reflecting sampling variability - can be used to
improve statistical learning methods
- the technique is called bagging
Bagging
- general procedure to reduce the variance of a learning method
- applied to learning methods that have high variance
(e.g., decision trees)
Why the Hessian Matters
The Hessian matrix (second derivatives of the log-likelihood) at the MLE quantifies the curvature of the likelihood surface
Its inverse, the observed Fisher information \(I\), approximates the variance-covariance matrix of the MLEs:
\[\text{Var}(\widehat{\beta}) \approx I^{-1}(\widehat{\beta}) = -H^{-1}(\widehat{\beta})\]
Standard errors for coefficients come from the diagonal of this matrix
Used for confidence intervals and hypothesis testing (e.g., Wald tests)
Off-diagonal coefficients provide cross-variance information about estimated parameters
RVariance-Covariance Matrix (Inverse Hessian):
model <- glm(y ~ x1 + x2, family = binomial, data = df)
vcov_matrix <- vcov(model) # Inverse HessianNumerical Hessian (if needed):
library(numDeriv)
log_lik <- function(coef) {
-sum(dpois(df$y, lambda = exp(df$x1 %*% coef), log = TRUE))
}
hessian <- hessian(log_lik, coef(model))Forecasts are inputs to decisions, not ends in themselves
Tailor outputs to user needs:
Risk managers need quantiles
Planners need scenario ranges
Emphasize interpretability over aesthetics
Same forecast communicated in two ways:
Demand will be 1000 MW (false precision)
There is an 80% chance demand will stay between 950 and 1050 MW
(transparent)