Models

Why Models?

“All models are wrong, but some are useful.” — George Box

Remember, we said that data are a materialization of underlying process and are not ground truth.

We have similar insights about statistical models,
Key idea: Models are simplifications, not mirrors of reality.

Their value comes from insight, prediction, and decision-making—not perfection.

Learning Goals

By the end of this module, you should be able to:

Explain why randomness is essential in data models
Write the generic form of a statistical learning model
Distinguish linear vs nonlinear models (correctly!)
Recognize common modeling assumptions—and when they fail
Fit and interpret basic models in R

Deterministic vs Stochastic Thinking

A Deterministic (Mathematical) Model

Example: pollutant concentration in a river \[\alpha(s,t) = \alpha_0(s-ct)\exp(-\mu t)\]

Inputs → Output, no uncertainty, no room for chance
Same inputs → same output, every time
Assumes perfect knowledge of the system
Predicts with certainty

Reality check: rivers, sensors, and environments are messy and complex.

Why Determinism Fails in Practice

Hidden assumptions made in deterministic modeling:

Perfect measurements
Perfect mixing
No turbulence or local effects
Constant decay rates in space and time

Question: Can we ever measure and control all of this?

In reality:

Measurements are made in a highly dynamic environment (river)
How can we account for all influences in the environment?
How can we account for sampling the environment?

Adding Randomness (On Purpose)

We upgrade the model by admitting uncertainty: \[\alpha(s,t) = \alpha_0(s-ct)\exp(-\mu t) + \epsilon\] \(\epsilon\): random error, mean 0, variance \(\sigma^2\)
Often assume: \(\epsilon \sim N(0,\sigma^2)\)

Interpretation:
Individual observations vary
The average behavior follows the deterministic exponential model \[\mbox{E}[\alpha(s,t)] = \alpha_{0}(s−ct) \exp {−\mu t}\]
Consequence: Because \(\epsilon\) is random, so is \(\alpha(s,t)\) (randomness is contagious)
Can you think of an extension model for \(\epsilon\)?

“Correct on Average”

The model is not exact for every single observation
It is correct in expectation

Think about the consequences and differences for:

Predicting your credit score vs
Estimating the average credit score of people like you

Why Randomness is Essential

Randomness helps us:

Model unobserved or unmeasurable influences
Justify random sampling and experiments
Build simpler, more interpretable models
Overcome parts of unpredictablity and unknown components

Occam’s Razor:

Among competing explanations prefer the simpler
Stochastic models are often more parsimonious (fewer parameters) and easier to explain
They often explain more with less.

Parsimony Example

Goal: develop a model to predict cardiovascular disease occurrence (yes/no)
Reality: dozens of biological conditions (family history), clinical (other pathologies, age, sex), and lifestyle factors (nutrition, sedentarity)
Start simple:
- Let \(Y \in \{0,1\}\), the diseases occurs or not
- \(Y \sim \text{Bernoulli}(\pi)\) \[\Pr(Y=1)=\pi, \quad \Pr(Y=0)=1-\pi\]
In this case, we need to estimate \(\pi\) on the data to make the model usable and matching our data.
If we have a random sample of patients with 0/1 disease outputs, then the sample mean is possible estimator: \[\widehat{\pi} = \frac{1}{n}\sum_{i=1}^n Y_i,\]

Statistical Models

A statistical model is a stochastic model with unknown parameters

Unknown parameters are estimated on available data

Going back to the above example:

\(\pi\): event probability (parameter to be estimated)
Data → estimate parameters \(\widehat{\pi} = \frac{1}{n}\sum_{i=1}^n Y_i,\)

Key idea:

If Y is random, every function of Y is random (randomness is contagious)
→ \(\widehat{\pi}\) is a random variable
Parameter estimates are random variables with their own distributional properties
We chose parameter estimates with satisfactory properties (see Estimation ch.)

Why Statistical Learning?

This Bernoulli model is a possible model for disease recurrence, but too simplistic
In the above model, \(\widehat{\pi}\) is solely derived from the 0/1 disease outputs,
but does not account for other obvious factors (age, sex, family history)
Other factors are less obvious
Express \(\pi\) as a function of other inputs and parameters \[\pi = f (x_1,x_2,···,x_p; \theta_0, \theta_1, \theta_2,...\theta_p)\]

We build models to:

Explain

How do inputs affect outputs? What is the input-output relationship?
Which input variables matter?
What functional relationships are plausible?

Predict

Estimate/predict unseen outcomes
Generalize to new data

Often: we want both.

When Prediction Matters Most

Prediction is crucial when outcomes are:

Destructive (stress testing materials)
Expensive (medical tests)
Rare (fraud detection)
In the future (stock prices, demand)

The Basic Model Template

Most (not all) statistical learning models look like:

Response = Structure + Error

\[Y = f(x_1,\ldots,x_p;\theta_0,\theta_1,\ldots,\theta_k) + \epsilon\]

Y: Response: what we care about
Other names are output, outcome, target, dependent variable
Upper-case Y when referring to the random variable
Lower-case y is the observed value
Structure: systematic part of the model (mean function)
It expresses the expected value (the mean) of the response as a function of inputs and parameters and is therefore also called the mean function
- \(x_1,...,x_p\): the inputs, a.k.a. regressors, explanatory variables, features, independent variables
- \(f\) functional form between input and output
  E.g. linear regression, polynomial regression
- \(\theta_0,\theta_1,\ldots,\theta_k\): the parameters to be estimated from the data
\(\epsilon\): Error: the random component(s) (inherent stochasticity, unpredictability,…)
\(\epsilon\) the model error; assumed that \(\epsilon\) has mean \(0\) and variance \(\sigma^2\) Defines the distributional properties of the response

Observational Model

In practice, we will have access to a finite set of observations for the problem at stake,

The observational model introduces an index for the \(n\) observations: \[Y_i = f(x_{1i},x_{2i},\ldots,x_{pi}, \theta_0,\theta_1,\ldots,\theta_k) + \epsilon_i, \qquad i=1,\cdots, n\]

\(Y_1, Y_2, \ldots, Y_n\) are the sample of size \(n\)
The \(Y_i\) are random variables
\(y_1, y_2, \ldots, y_n\) are the observed values, e.g., \(12.4\), \(5.7\), , \(9.4\)
The \(y_i\) are not random variables, but their realizations
The inputs \(x_{1i}, \ldots, x_{pi}\) are typically not random variables
Inputs treated as given (conditional modeling)
We will see in the Estimation class how to arrange these observations

Example: Mitscherlich Equation

The Mitscherlich model is used in plant studies to express growth (yield) as a function of an input (fertilizer). \[ \mbox{Abstract form (response/structure/error)} \quad Y = f(x,\xi,\lambda,\kappa) + \epsilon, \] \[\mbox{Specific formulation} \quad Y = \lambda + (\xi - \lambda)\exp\left\{-\kappa x\right \} + \epsilon\]

\(Y\) is the yield
\(x\) is the amount of input, e.g., nitrogen fertilizer

Always take some time to understand the parameters of your model:

\(\lambda\) is …
\(\xi\) is …
\(\kappa\) is …
\(\epsilon \sim (0,\sigma^2)\)

Mitscherlich Equation

Can you ``estimate’’ the \(\lambda, \xi\), and \(\kappa\) parameters visually?

How would you fit such a model on a dataset?

Prediction & Explanation

Explanatory inference, also called confirmatory inference, is concerned with

understanding the relationship between response and inputs
understanding the importance of the inputs
testing hypotheses about the response
ISLR calls this ``Inference’’ in Sec. 2.1

Predictive inference is concerned with

developing an algorithm that predicts the outcome well
generalizing the algorithm to new observations (not measured, not yet seen, etc.)

Mitscherlich Example

Many applications have predictive and confirmatory goals

Predictive Insights

Provide a confidence interval for the average yield at \(x_{new}=35.7\) kg/ha
At what level of nitrogen does yield achieve 75% of the max (=inverse prediction).

Confirmatory Insights

The asymptotic yield is greater than 75
Increasing nitrogen application raises the yield by no more than 30 units
The rate of change in yield is less than \(p\) units per unit nitrogen once 100 \(kg/ha\) x are applied.

Confirmatory Insights

Good confirmatory models are not necessarily best at prediction
Parameter interpretation is very important
Hypotheses are expressed in terms of parameters and their relationships

Predictive Insights

Models that predict well are not necessarily best at testing hypotheses
Parameter interpretation is less important
A model that is biased can be more desirable than an unbiased model

Many disciplines place high value on interpretability of the models.
Sometimes, prediction is used as a validation rather than an end goal.

For example,

Biology
Life sciences
Economics
Financial Services & Insurance
Physical sciences
Geosciences

Classes of Models

Linear vs Nonlinear Models

Linear ≠ straight line

A model is linear if it is linear in its parameters

Nonlinearity does not refer to curvature of the mean function

A model is nonlinear if at least one derivative of the mean function wrt the parameters depends on one or more parameters

Mitscherlich Example \[ f(N, \xi,\lambda,\kappa) = \lambda + (\xi - \lambda)\exp\left\{ -\kappa N \right \}\] \[\frac{\partial f(N,\xi,\lambda,\kappa)}{\partial \lambda} = 1 -\exp\left\{ -\kappa N \right \}\]
the model is nonlinear

Are the following models linear or nonlinear? Response: \(Y\) ; Predictor: \(x\), \(x_1\), \(x_2\)

Model 1: \(E[Y] = \beta_0 + \beta_1 x_1 + \beta_2 x_2\)
\(\rightarrow\) Multiple linear regression

Model 2:
\(E[Y] = \left \{ \begin{array}{ll} \beta_0 + \beta_1 x & x \leq \alpha \\ \beta_0 + \beta_1\alpha & x > \alpha \end{array} \right.\)
\(\rightarrow\) Plateau (hockey-stick) model is non-linear

Example of a plateau model

Model 3: \(E[Y] = Y = \beta_0 + \beta_1 x^2 + \beta_2 x^2 + \beta_3 x^3\)
\(\rightarrow\) Polynomial regression (nonlinear shape, linear model)

Linear Models (Regression Example)

Regression: the inputs are (continuous) numeric variables \[Y_i = \beta_0 + \beta_1 x_{1i} + \cdots + \beta_p x_{pi} + \epsilon_i\] with \(\epsilon_i\) i.i.d. (independent and identically distributed)
and \(\epsilon \sim (0, \sigma^2)\)

Simple linear regression: \(Y_i = \beta_0 + \beta_1 x_{1i} + \epsilon_i\)

Multiple linear regression: more than one input (regressor variable)

Linear Models Components

Linear Regression Example

Components

The \(\beta\) are the \(p+1\) parameters of the mean function to be estimated
\(\beta_0\) is called the intercept
\(p\) = number of inputs
\(\sigma^2\), the error variance is also a parameter to be estimated

Interpretation

\(\beta_0\) is the mean of \(Y\) when \(x_1 = 0, \cdots, x_k = 0\)
\(\beta_j\) is the change in the mean of \(Y\) when \(x_j\) increases by 1 unit and all other \(x\)s do not change
Example \(\beta_3 = 4.2\): If \(x_3\) increases by 1 unit, \(\text{E}[Y]\) increases by 4.2 units

Linear Models Assumptions

Linear Regression Example

The inputs \(x_{1i}, \cdots , x_{pi}\) are fixed, not random
The errors \(\epsilon_i\) have zero mean (model is correct on average)
The errors \(\epsilon_i\) have equal variance (homoscedasticity)
The errors \(\epsilon_i\) are independent
Because \(\epsilon_i\) has mean (=expected value) \(0\), it follows that \(\begin{array}{ll} \text{E}[Y_i] &= \text{E} \left[\beta_0 + \beta_1 x_{1i} + \cdots + \beta_p x_{pi} + \epsilon_i \right]\\ &= \beta_0 + \beta_1 x_{1i} + \cdots + \beta_p x_{pi} + \text{E}[\epsilon_i] \\ &= \beta_0 + \beta_1 x_{1i} + \cdots + \beta_p x_{pi} \end{array}\)
Because \(\beta_0+\beta_1 x_{1i}+\cdots+\beta_p x_{pi}\) is the expected value (=the mean) of \(Y_i\),
it is also called the mean function.
It is implicit in this derivation that the inputs \(x\) are not random.

Conditioning Saves Us

“X is not random” really means:

We model Y given X = x

\[(Y\mid X=x) = \beta_0 + \beta_1 x + \epsilon\] Conditioning turns observed inputs into constants.

Describe situations when the assumptions of the linear model are not met,
Is normality of the errors an assumption of the linear model?

Assuming we want to predict the abalone height, which assumptions of linear regressions are violated in the following relationships?

Linear Models

Many models contain both continuous and qualitative inputs.
Take advantage of qualitative inputs to build interesting models and address interesting questions
Suppose a qualitative variable segments the data into groups
(e.g. genders, age groups, income classes, tax brackets, …)
Do groups share the same basic model?
Do slopes and/or intercepts differ between groups?
Is there an interaction between groups and continuous inputs?
Let’s work with the abalone data

In R, it is easy to implement interactions between predictors (continuous and-or categorical).

Using the function lm:

lm(Y ~ X + Z + X:Z) includes each of the predictors individually (the main effects) and their interaction (: accounts for the interaction only).
lm(Y ~ X*Z) is the compact way of writing lm(Y ~ X + Z + X:Z)
When one predictor is continuous and the other categorical, these interaction terms have the following meaning:
Assume X is continuous and Z has two categories,
lm(Y ~ X) fits the same intercept and slope for both categories,
lm(Y ~ X+Z) fits the same slope for both categories but different intercept,
lm(Y ~ X:Z) fits the same intercept but different slope per category,
lm(Y ~ X*Z) fits different intercept and slope per category.

This is model comparison, not just fitting.

Think about the abalone data when we predict the height from diameter and sex

Nonlinear Regression (Reality Bites)

Remember:

A non-linear model depends on its parameters in a non-linear fashion,
A model is nonlinear if at least one derivative of the mean function wrt the parameters depends on one or more parameters

Nonlinear models are more finicky to work with than linear models:

Finding parameter estimates is done through iterative numerical optimization
Providing good starting values for the parameter helps greatly
Good software computes analytic derivatives rather than finite difference derivatives
Statistical properties of estimators are not as clear cut as with linear models

We can still use least-squares to fit such models

Typically, in R we use nls or the library nls2

Example: Plateau model
You specify the mean function as the first argument of nls or pass an object with the formula for the mean function.

plat_model <- y ~ (b0+b1*x    )*(x <= alpha) + 
                  (b0+b1*alpha)*(x >  alpha)
                  
plat_fit <- nls(plat_model, 
                data=whatever, 
                start=list(b0=0,b1=5,alpha=40))

Let’s look at an example in R with the Mitscherlich model

Non-parametric Regression

Non-parametric regression refers to a family of regression methods where you don’t assume a fixed functional form (like linear, quadratic, etc.) between predictors \(X\) and response \(Y\)

Instead of \(Y = \beta_0 + \beta_1 X + \epsilon\), you have \(Y=f(X) + \epsilon\).

Key characteristics

No predefined model form for \(f\)
Predictors completely constructed using information derived from the data
Model complexity grows with sample size
Usually local or smooth fits
Sensitive to:
- sample size
- smoothing parameters (bandwidth, number of neighbors, tree depth, etc.)

Non-parametric Regression Examples

k-Nearest Neighbors (kNN) regression
- Predict by averaging nearby points, very intuitive
- Curse of dimensionality hits hard
Kernel regression (Nadaraya–Watson)
- Predict by smoothing weighted average
- Bandwidth \(h\) controls bias–variance tradeoff
Local polynomial regression (LOESS / LOWESS)
- Fit a small regression locally around each \(x\)
Regression trees
- Partition feature space into regions
- Basis for Random Forests and Boosting
Splines and smoothing splines
- f(x) is a sum of smooth (splines) basis functions

We will review these models and their implementations in more details in future classes.

Parametric vs Nonparametric

Parametric

few parameters, strong assumptions, rigid shape

Explicit expression for the mean function in terms of inputs and parameters
Linear regression model: \(Y = \beta_0 + \beta_1 x_1 + ... + \beta_p x_p + \epsilon\)
Nonlinear model: \(Y= \lambda + (\xi−\lambda) \exp(−\kappa x)\)
Less flexible but easier to explain

Nonparametric

many (or infinitely many) parameters, weak assumptions, flexible shape

No fixed or explicit form for the mean function,
we control how wiggly the function is (its smoothness)
It still depends on calculating unknowns (parameters)
Very flexible, some harder to explain and interpret
Typically low bias but high variance in small-data regimes

Melanoma Data

How would you model such relationships?

How do we express a mean function for the following data of skin cancer incidences per 100,000 over a 37-year period?
Is a simple linear regression sufficient: \(Y = \beta_0 + \beta_1 x + \epsilon\)?
How about a polynomial model: \(Y = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \epsilon\)?
How about a loess model (local polynomial weighted regression)?

Parametric vs Nonparametric

Small span → very flexible (can overfit)
Large span → smoother (can underfit)

Advantages

Captures nonlinear relationships easily
Does not assume any particular function form
Works well for exploratory analysis and visualization

Limitations

Computationally expensive for large datasets
Does not easily provide a formula or model for prediction outside the observed range (no “global” model)
Sensitive to outliers (though “robust LOWESS” versions exist)

Choosing a Model

There is no single “best” model.

We compare using:

Prediction error
Interpretability
Scientific plausibility
Stability
Scalability if you work with large data

Takeaways

The model is a useful abstraction for the purpose of analysis (confirmatory, prediction)
Models are tools, not truths
There are often competing models for the same task and data set AND you want to compare several models!
Randomness is a feature, not a bug
Linear models are a foundation—not a destination
Estimate the model parameters from data
Assess the quality of the model based on the parameter estimates and various criteria (See Validation class)
Statistical learning is about tradeoffs

Statistical Models

Statistical Learning - Week 3

Julie Bessac

2026

Models

Why Models?

Learning Goals

Deterministic vs Stochastic Thinking

A Deterministic (Mathematical) Model

Why Determinism Fails in Practice

Adding Randomness (On Purpose)

“Correct on Average”

Why Randomness is Essential

Parsimony Example

Statistical Models

Why Statistical Learning?

Explain

Predict

When Prediction Matters Most

The Basic Model Template

Observational Model

Example: Mitscherlich Equation

Prediction & Explanation

Classes of Models

Linear vs Nonlinear Models

Linear Models (Regression Example)

Linear Models Components

Linear Models Assumptions

Linear Models

Nonlinear Regression (Reality Bites)

Non-parametric Regression

Non-parametric Regression Examples

Parametric vs Nonparametric

Parametric

Nonparametric

Parametric vs Nonparametric

Choosing a Model

Takeaways