Validation of Statistical Models

Learning Goals

▶ Set validation goals

▶ Formalize the concept of a loss function/metric

▶ Understand the differences between validation data set, leave-one-out and k-fold cross-validation

▶ Appreciate k-fold CV as a general-purpose technique for model assessment and to measure model performance

▶ Appreciate the bootstrap as a general-purpose technique to estimate uncertainty

▶ Remember to plot results and data!

Why Validate Models?

  • Overfitting vs. Underfitting:
    • Overfitting: Model fits training data too closely, performs poorly on new data, high variance
    • Underfitting: Model is too simple, fails to capture patterns, high bias
  • Generalization:
    • Goal: Build models that perform well on unseen data.
  • Prediction Realism (“perform well”):
    • Quantify realism of prediction VS observed data
    • Use cross-validation to avoid biased lucky evaluation

Key Principles of Validation

  • Never evaluate on the training data, use testing data for validation and work in cross-validation settings

  • Use quantitative metrics AND visualization

  • Compare to simple baselines:
    Persistence (naive) model: \(\hat{x}_{t+1|t} = x_t\)
    Seasonal average
    Reduced models

  • Quantify as much uncertainty as you can

  • Prepare a communication plan

Cross-Validation Techniques

Training/Testing Data

Training data
- leads to overfitting because you can drive Ave(err) →0
- does not measure how well the model generalizes to new obs
- does not measure the true performance of the model
- BUT, easy to calculate as first check, we always have training data

Test data
- data that was not used in fitting the model
- measures true performance of the model
- helps in selecting proper level of flexibility of the model
- difficult to calculate, requires test data

The big question: where do we get the test data?

Training/Testing/Validation

  1. Training Data: Used to fit or train the model. The model learns patterns, relationships, and parameters from this data.

  2. Validation Data: Used to tune the model’s hyperparameters and guide model selection before final evaluation,  e.g., decision trees vs. random forests or hyperparameters e.g., tree depth, regularization strength.

  3. Testing Data: Used for the final, unbiased evaluation of the model’s performance.
    It simulates how the model will perform on completely unseen data in the real world.

Why it matters: Avoiding Data Leakage + Realistic Performance

What is Cross-Validation?

  • Definition: Technique to create test data and assess how well a model generalizes to new data

  • Purpose:

    • Assess prediction skills and generalizability of a model
    • (Optimize model hyperparameters (e.g. LASSO shrinkage parameter))
  • Approach: Training vs. Test Data:

    • Training: Used to fit the model.
    • Test: Used to evaluate prediction performance.

Types of Cross-Validation:

Hold-Out Method:
- Split data into a training set with \(n-m\) datapoints and a test set of size \(m\).
- Simple but can be sensitive to the split.

k-Fold Cross-Validation:
- Divide data into k folds.
- Train on k-1 folds, test on the remaining fold.
- Repeat for each fold.

Leave-One-Out Cross-Validation (LOOCV):
- Extreme case of k-Fold where k = n (number of observations).
- Computationally expensive but low bias.

Types of Advanced Cross-Validation:

Cross-validation for time series:
- Problem: Randomly splitting time series breaks temporal dependencies, leading to unrealistic performance.
- Solutions:
Rolling Window: Train on a fixed window, test on the next window. E.g., train on 2010–2018, test on 2019; then train on 2010–2019, test on 2020.
Time Series k-Fold: Split data into k contiguous folds, preserving order.

Cross-validation for spatial data:
- Problem: Overly confident in prediction
- Solutions: Divide space into blocks/clusters, use blocks as folds.
Leave out entire clusters for testing.

Stratified k-Fold (for Classification):
- Ensures each fold has the same proportion of class labels as the dataset.

  1. First group the data by class labels.
    For example, if you have two classes (A and B), it separates all samples of class A and all samples of class B.

  2. For each fold, the algorithm allocates samples from each class proportionally.
    If the original dataset has 80% class A and 20% class B, each fold will also contain around 80% class A and 20% class B.

  3. Within each class, the samples are shuffled randomly before being assigned to folds.
    This ensures that the samples in each fold are a random subset of the class, not just the first or last n samples.

Hold-out Method

  1. Randomly choose a subset of size \(m\) from the observations, this is the hold-out set

  2. Train the model on the remaining \(n-m\) observations

  3. Predict or classify the data in the hold-out set based on the trained model

  4. Calculate the test error in the hold-out set.


Problems:
- Variability of results across different sets of \(m\) observations
- May need stratified sampling to preserve distributions
- How big should \(m\) be (relative to \(n\))? 50:50? 80:20? 90:10?

Hold-out Method in R

Modeling miles per gallon mpg as a function of horsepower for 392 different cars.

The dataset comes with the ISLR package. ISLR uses a 50:50 split to obtain the test data.

library(ISLR2)
data(Auto)
set.seed(1)
train <- sample(nrow(Auto),196)

head(Auto[train,])    # select the training set
head(Auto[-train,])   # select the test set


Find more details in the week’s code.

Hold-out Method Summary

Advantages
- easy to do, uses some of the training obs for validation
- by fixing random number seed method is repeatable between runs
- a general method that can be applied to any model


Disadvantages
- has a random element to it; results change depending on which observations are selected  - variability from run to run can be large, especially for noisy data
- has a tendency to overestimate the test error compared to leave-one-out CV
- not every observation gets to be used in training and in testing

Leave-One-Out Cross-Validation

Step 1. Set the index of the hold-out observation to \(i\)
(Start with \(i=1\))

Step 2. Hold-out observation \(i\) and train the model on the remaining \(n-1\) observations.

Step 3. Compute \(e_i\), the test error for the held-out observation

Step 4. Put the obs back into the data, increment \(i\) and go back to step 1.

Step 5. Repeat until you have held out each observation once.

Compute the test error as the average of individual errors \(\displaystyle \frac{1}{n}\sum_{i=1}^n e_i\)

Advantages

  • no randomness involved, LOOCV produces identical results upon repetition

  • every observation gets to be used in training (\(n−1\) times) and test (once)

  • tends not to overestimate the test error as much as the validation test approach

  • a general method that can be applied to any model

Disadvantage

  • can be computationally intensive, if the model has to be fit \(n\) times

LOOCV the Hard Way

Compute the test error by LOOCV for the SLR model with horsepower by fitting 392 regression models.

Auto2 <- Auto
PRESS <- 0
for (i in 1:nrow(Auto2)) {
    hp <- Auto2[i,"horsepower"]
    Auto2[i,"horsepower"] <- NA
    m <- lm(mpg ~ horsepower, data=Auto2, na.action=na.omit)
    Auto2[i,"horsepower"] <- hp
    yhat_i <- predict(m,newdata=Auto2[i,])
    PRESS <- PRESS + (Auto2[i,"mpg"] - yhat_i)^2
}
cat("LOOCV Prediction Error = ", PRESS/nrow(Auto2)) 

LOOCV Prediction Error =  24.23151

LOOCV in practice

For some types of models we are in luck.

  • Denote as \(\widehat{y}_{-i}\) the predicted value for the \(i\)th observation derived from a regression that excluded (only) the \(i\)th observation

  • The test error via LOOCV is \(\displaystyle \frac{1}{n} \sum_{i=1}^n (y_i - \widehat{y}_{-i})^2\)

  • In linear regression models the following holds \[y_i - \widehat{y}_{-i} = \frac{y_i - \widehat{y}_i}{1-h_{ii}}\] where \(h_{ii}\) is the leverage of the \(i\)th observation.

Prediction sum of squares

The prediction sum of squares (PRESS) statistic is \[PRESS = \sum_{i=1}^n \left ( y_i - \widehat{y}_{-i} \right ) ^2\] Models with a small PRESS statistic have good predictive ability and generalize well to new observations.
The average PRESS value \[\frac{1}{n}PRESS = \frac{1}{n}\sum_{i=1}^n(y_i - \widehat{y}_{-i})^2\] is the leave-one-out cross-validation error (for squared-error loss).

LOOCV from Leverage

Compute the LOOCV test error by applying the leverage formula.

slr <- lm(mpg ~ horsepower, data=Auto)
leverage <- hatvalues(slr)
PRESS_res <- slr$residuals / (1-leverage)
pred_error <- mean(PRESS_res^2)
pred_error

LOOCV Summary

Linear Regression
- The LOOCV test error can be calculated based on a single fit using all \(n\) observations
- LOOCV is computationally cheap, just need to get the diagonals of the “hat” matrix (the leverages)

Generalized Linear Models
- LOOCV can be computed cheaply based on exact formulas or approximations. You can always validate the hard way whether the calculation is exact.

Complex Models
- Need to calculate LOOCV the hard way, by fitting the model \(n\) times.
- Or use approximations

\(k\)-Fold Cross-Validation

Hybrid technique between the hold-out approach and LOOCV.

  • generate randomly \(k\) sets of observations with repetition, called folds

  • fit the model \(k\) times, each time holding out a different fold

  • calculate the test error from the \(n_j\) hold-out samples: \(E_j = \frac{1}{n_j} = \sum_{i=1}^{n_j} e_i\)

  • average the \(k\) test errors \[\frac{1}{k} \sum_{i=1}^k E_j\] or study their distribution

  • Advantages

    • not as variable upon repetition as the hold-out method 
    • not as computationally intensive as LOOCV when models have to be refit
    • every observation gets to be used in training (\(k-1\) times) and test (once)
    • less bias in computing the test error than hold-out method
    • a very general method applicable to any model
  • Disadvantages

    • has a random element when splitting the data into \(k\) sets
    • possibly computationally intensive when fitting the model \(k\) times

\(10\)-fold CV in R

10-fold cross-validation for the horsepower model.

    glm.fit <- glm(mpg ~ horsepower, data=Auto)
    cv.err <- cv.glm(Auto,glm.fit,K=10)
    cv.err$K
    cv.err$delta[1]

    cv.err$K
    [1] 10
    > cv.err$delta[1]
    [1] 24.3664


See comparison in R between 5-fold and 10-fold CV.

Comparison

Which method should you choose?

▶ Hold-out approach has high variability and is biased

▶ LOOCV produces an unbiased estimator of the true test error

▶ k-fold CV has an intermediate level of bias (more than LOOCV) and some variability (less than hold-out approach)

▶ The variance of LOOCV is quite high because the \(n\) error estimates are highly correlated (why is that?)

Metrics and Visuals

Metrics and Visuals

Deterministic metrics to measure the discrepancy between estimated (predicted) and true testing values.
Discrepancies are called errors \(e=y-\widehat{y}\).

Skill score: Measures improvement relative to a reference prediction or mean.


Visual Diagnostics:
- Predictions vs observed test data (e.g. QQ-plot)
- PIT (Probability Integral Transform) histograms
- Predicted values VS index, time, space (patterns captured?)


Goal: Check calibration (match between prediction and testing observation) and sharpness (prediction variance) of forecasts.

Validation Metrics Regressions

  • Mean Squared Error (MSE):
    • Average squared difference between observed and predicted values.
  • Root Mean Squared Error (RMSE):
    • Square root of MSE, in original units.
  • Mean Absolute Error (MAE):
    • Average absolute difference between observed and predicted values.
  • R-squared (R²):
    • Proportion of variance explained by the model.
  • Coverage percentage:
    • Probability that a confidence interval contain the true target value.

Validation Metrics For Classification

  • Accuracy: Proportion of correct predictions.

  • Precision: Proportion of true positives among predicted positives.

  • Recall (Sensitivity): Proportion of true positives among actual positives.

  • F1-Score: Harmonic mean of precision and recall.

  • Confusion Matrix: Table summarizing true vs. predicted labels.

  • ROC Curve: Plot of true positive rate vs. false positive rate.

  • AUC: Area under the ROC curve (0.5 = random, 1 = perfect).

Example in R: Classification Metrics

Please go back to the advanced regression class to review validation metrics for classification.

library(caret)
data(Default, package = "ISLR")

# Define control parameters for 5-fold CV
ctrl <- trainControl(method = "cv", number = 5, 
                summaryFunction = twoClassSummary, classProbs = TRUE)

# Fit a logistic regression model
model <- train(default ~ ., data = Default, 
                method = "glm", family = "binomial", trControl = ctrl)
print(model) # Print cross-validation results

# Confusion Matrix
predictions <- predict(model, newdata = Default)
confusionMatrix(predictions, Default$default)

Visualization for Validation

You want the predictions to have similar patterns as the observations:

  • Probability distribution
    Matching distribution is usually not enough

  • Patterns like temporal or spatial patterns

  • Other statistics like correlation functions, tail indicators,…


Think about the bike rental, we hop to capture patterns like the seasonal and daily cycles

QQ-Plots for Predictions

Purpose: Compare the distribution of your model’s predictions to the testing `true’ data


Interpretation:
- Points on the line: Predictions and observations have matching distribution
- Deviations: prediction over- or under-estimate the `truth’


When to Use: After performing predictions on testing data from a fitted statistical model

R Code Example:

See this week’s code

Probability Integral Transform (PIT) Histograms

Purpose: 

  • Assess the calibration of probabilistic forecasts
  • If the model is well-calibrated, the PIT values should be uniformly distributed between 0 and 1

When to Use: For probabilistic models (e.g., Bayesian regression, quantile regression, or ensemble forecasts).

Interpretation:

  • Uniform distribution: Model is well-calibrated
  • U-shaped or hump-shaped: Model is under- or over-dispersed

R Code Example:

# Example using the `scoring` package for PIT
library(scoring)
# Simulate probabilistic forecasts and observations
set.seed(123)
observed <- rnorm(100)
predicted <- matrix(rnorm(100 * 100, mean = observed, sd = 0.5), ncol = 100)

# Calculate PIT values
pit_values <- pit(predicted, observed)

# Plot PIT histogram
hist(pit_values, breaks = 20, main = "PIT Histogram", xlab = "PIT Values", ylab = "Frequency")
abline(v = c(0, 1), col = "red", lty = 2)

Interpretation Guidelines

Plot/Histogram Well-Calibrated Model Poorly Calibrated Model
QQ-Plot Points follow the line Points deviate from the line
PIT Histogram Uniform distribution U-shaped, hump-shaped, or skewed

Key Takeaways:

  • QQ-Plots: Use for checking distribution match of predictions with testing `truth’

  • PIT Histograms: Use for checking calibration of probabilistic forecasts

  • Action: If plots indicate poor fit, consider transforming variables, changing model assumptions, or using a different model family.


  • Features: Always evaluate other features of your data!

Uncertainty Quantification

Sources of Uncertainty

  • Model uncertainty:
    Wrong assumptions or structure
    Can be controlled to a certain extent, hard to quantify
  • Parameter uncertainty:
    Estimation error due to finite sample size
    Can be quantified with Hessian information
  • Data uncertainty:
    Due to random variation in observations
    Can be quantified (confidence/prediction intervals, bootstrap, …)

Boostrap

To bootstrap means to get oneself out of a situation by applying existing resources. What does that have to do with statistical learning?

Suppose you want to know the variability of a quantity, e.g., an estimator such as \(\widehat{Y} = \frac{1}{n} \sum_{i=1}^n Y_i\).

The bootstrap solution:

  • Draw samples from the training data itself
  • Samples are drawn with replacement
  • Compute the statistic(s) of interest in each sample
  • Study statisitcs distribution or calculate st-dev, mean, …
  • Bootstrap samples are called replicates, \(R \geq 1,000\)

Reflecting Data Uncertainty

Your data have natural variability that is important to understand and communicate

There are two main ways to quantify this type of uncertainty:
- Calculate confidence/prediction intervals
- Bootstrap data and estimate variability of quantities of interest
e.g. estimated coefficients, predictions, \(R^2\), …

Bootstrap in R

Use the boot() function with three arguments: (i) the data, (ii) a function you write that returns the statistic(s) of interest, and (iii) the number of bootstrap samples (replicates).
The function you write has at least two arguments: the data and an index vector to select observations. This example calculates the standard error of the \(R^2\) statistic via bootstrap.

boot.fn <- function(data, index) {
        summary(lm(mpg ~ poly(horsepower,2), data=data, 
                subset=index))$r.squared
    }
set.seed(123)
bs <- boot(data=Auto, boot.fn, R=1000)
bs  

Bootstrap and Bagging

How does the bootstrap relate to cross-validation?

The bootstrap
- is based on sampling with replacement from the training data
- produces estimates reflecting sampling variability  - can be used to improve statistical learning methods
- the technique is called bagging

Bagging
- general procedure to reduce the variance of a learning method
- applied to learning methods that have high variance
(e.g., decision trees)

Hessian and Estimation Uncertainty

Why the Hessian Matters

  • The Hessian matrix (second derivatives of the log-likelihood) at the MLE quantifies the curvature of the likelihood surface

  • Its inverse, the observed Fisher information \(I\), approximates the variance-covariance matrix of the MLEs:

\[\text{Var}(\widehat{\beta}) \approx I^{-1}(\widehat{\beta}) = -H^{-1}(\widehat{\beta})\]

  • Key insight: Flatter curvature → higher uncertainty; steeper curvature → lower uncertainty.

Practical Implications

  • Standard errors for coefficients come from the diagonal of this matrix

  • Used for confidence intervals and hypothesis testing (e.g., Wald tests)

  • Off-diagonal coefficients provide cross-variance information about estimated parameters

Extracting the Hessian in R

  1. Variance-Covariance Matrix (Inverse Hessian):

    model <- glm(y ~ x1 + x2, family = binomial, data = df)
    vcov_matrix <- vcov(model)  # Inverse Hessian
  2. Numerical Hessian (if needed):

    library(numDeriv)
    log_lik <- function(coef) {
      -sum(dpois(df$y, lambda = exp(df$x1 %*% coef), log = TRUE))
    }
    hessian <- hessian(log_lik, coef(model))

Visualizing Uncertainty

  • Plot confidence ellipses using the covariance matrix


  • Compare Hessians for different models to assess relative uncertainty


  • Create ensemble of models reflecting parameter uncertainty by drawing random samples of parameter estimates (assuming Gaussian distribution with MLE as mean and covariance from above)


  • This covariance can also be estimated from boostrapped model parameters. Note that this bootstraping implies fitting the model many times, which can be challenging.

Communication

Communication: From Forecasts to Decisions

Forecasts are inputs to decisions, not ends in themselves

Tailor outputs to user needs:
Risk managers need quantiles
Planners need scenario ranges

Emphasize interpretability over aesthetics

Same forecast communicated in two ways:
Demand will be 1000 MW (false precision)
There is an 80% chance demand will stay between 950 and 1050 MW (transparent)

Summary

Common Pitfalls and Best Practices

Pitfalls:

  • Data Leakage: Using test or validation information in training
  • Class Imbalance: Accuracy can be misleading for imbalanced datasets
  • Over-Reliance on Single Metric: Use multiple metrics and PLOTS

Best Practices:

  • Always Use Cross-Validation: Robust performance estimates
  • Choose Metrics and Visualization Aligned with Goals: Precision/recall for imbalanced data
  • Visualize Results: QQplots, correlations, residual plots, and confusion matrices