“All models are wrong, but some are useful.” — George Box
Remember, we said that data are a materialization of underlying process and are not ground truth.
We have similar insights about statistical models,
Key idea: Models are simplifications, not
mirrors of reality.
Their value comes from insight, prediction, and decision-making—not perfection.
By the end of this module, you should be able to:
Explain why randomness is essential in data models
Write the generic form of a statistical learning model
Distinguish linear vs nonlinear models (correctly!)
Recognize common modeling assumptions—and when they fail
Fit and interpret basic models in R
Example: pollutant concentration in a river \[\alpha(s,t) = \alpha_0(s-ct)\exp(-\mu t)\]
Reality check: rivers, sensors, and environments are messy and complex.
Hidden assumptions made in deterministic modeling:
Question: Can we ever measure and control all of this?
In reality:
We upgrade the model by admitting uncertainty: \[\alpha(s,t) = \alpha_0(s-ct)\exp(-\mu t) +
\epsilon\] \(\epsilon\): random
error, mean 0, variance \(\sigma^2\)
Often assume: \(\epsilon \sim
N(0,\sigma^2)\)
Interpretation:
Individual observations vary
The average behavior follows the deterministic exponential
model \[\mbox{E}[\alpha(s,t)] =
\alpha_{0}(s−ct) \exp {−\mu t}\]
Consequence: Because \(\epsilon\) is random, so is \(\alpha(s,t)\) (randomness is contagious)
Can you think of an extension model for \(\epsilon\)?
Think about the consequences and differences for:
Predicting your credit score vs
Estimating the average credit score of people like you
Randomness helps us:
Occam’s Razor:
Goal: develop a model to predict cardiovascular disease occurrence (yes/no)
Reality: dozens of biological conditions (family history), clinical (other pathologies, age, sex), and lifestyle factors (nutrition, sedentarity)
Start simple:
In this case, we need to estimate \(\pi\) on the data to make the model usable and matching our data.
If we have a random sample of patients with 0/1 disease outputs, then the sample mean is possible estimator: \[\widehat{\pi} = \frac{1}{n}\sum_{i=1}^n Y_i,\]
A statistical model is a stochastic model with unknown parameters
Unknown parameters are estimated on available data
Going back to the above example:
Key idea:
This Bernoulli model is a possible model for disease recurrence, but too simplistic
In the above model, \(\widehat{\pi}\) is solely derived from the
0/1 disease outputs,
but does not account for other obvious factors (age, sex, family
history)
Other factors are less obvious
Express \(\pi\) as a function of other inputs and parameters \[\pi = f (x_1,x_2,···,x_p; \theta_0, \theta_1, \theta_2,...\theta_p)\]
We build models to:
Often: we want both.
Prediction is crucial when outcomes are:
Most (not all) statistical learning models look like:
Response = Structure + Error
\[Y = f(x_1,\ldots,x_p;\theta_0,\theta_1,\ldots,\theta_k) + \epsilon\]
Y: Response: what we care about
Other names are output, outcome, target, dependent variable
Upper-case Y when referring to the random variable
Lower-case y is the observed value
Structure: systematic part of the model (mean
function)
It expresses the expected value (the mean) of the response as a function
of inputs and parameters and is therefore also called the mean
function
\(\epsilon\):
Error: the random component(s) (inherent stochasticity,
unpredictability,…)
\(\epsilon\) the model error; assumed
that \(\epsilon\) has mean \(0\) and variance \(\sigma^2\) Defines the distributional
properties of the response
In practice, we will have access to a finite set of observations for the problem at stake,
The observational model introduces an index for the \(n\) observations: \[Y_i = f(x_{1i},x_{2i},\ldots,x_{pi}, \theta_0,\theta_1,\ldots,\theta_k) + \epsilon_i, \qquad i=1,\cdots, n\]
\(Y_1, Y_2, \ldots, Y_n\) are
the sample of size \(n\)
The \(Y_i\) are random
variables
\(y_1, y_2, \ldots, y_n\) are
the observed values, e.g., \(12.4\),
\(5.7\), , \(9.4\)
The \(y_i\) are not random
variables, but their realizations
The inputs \(x_{1i}, \ldots,
x_{pi}\) are typically not random variables
Inputs treated as given (conditional modeling)
We will see in the Estimation class how to arrange these observations
The Mitscherlich model is used in plant studies to express growth (yield) as a function of an input (fertilizer). \[ \mbox{Abstract form (response/structure/error)} \quad Y = f(x,\xi,\lambda,\kappa) + \epsilon, \] \[\mbox{Specific formulation} \quad Y = \lambda + (\xi - \lambda)\exp\left\{-\kappa x\right \} + \epsilon\]
\(Y\) is the yield
\(x\) is the amount of input, e.g., nitrogen fertilizer
Always take some time to understand the parameters of your model:
\(\lambda\) is …
\(\xi\) is …
\(\kappa\) is …
\(\epsilon \sim (0,\sigma^2)\)
Mitscherlich Equation
Can you ``estimate’’ the \(\lambda, \xi\), and \(\kappa\) parameters visually?
How would you fit such a model on a dataset?
Explanatory inference, also called confirmatory inference, is concerned with
understanding the relationship between response and inputs
understanding the importance of the inputs
testing hypotheses about the response
ISLR calls this ``Inference’’ in Sec. 2.1
Predictive inference is concerned with
developing an algorithm that predicts the outcome well
generalizing the algorithm to new observations (not measured, not yet seen, etc.)
Mitscherlich Example
Many applications have predictive and confirmatory goals
Predictive Insights
Provide a confidence interval for the average yield at \(x_{new}=35.7\) kg/ha
At what level of nitrogen does yield achieve 75% of the max (=inverse prediction).
Confirmatory Insights
The asymptotic yield is greater than 75
Increasing nitrogen application raises the yield by no more than 30 units
The rate of change in yield is less than \(p\) units per unit nitrogen once 100 \(kg/ha\) x are applied.
Confirmatory Insights
Good confirmatory models are not necessarily best at prediction
Parameter interpretation is very important
Hypotheses are expressed in terms of parameters and their relationships
Predictive Insights
Models that predict well are not necessarily best at testing hypotheses
Parameter interpretation is less important
A model that is biased can be more desirable than an unbiased model
Many disciplines place high value on interpretability of the
models.
Sometimes, prediction is used as a validation rather than an end
goal.
For example,
Linear ≠ straight line
A model is linear if it is linear in its parameters
Nonlinearity does not refer to curvature of the mean function
A model is nonlinear if at least one derivative of the mean function wrt the parameters depends on one or more parameters
Are the following models linear or nonlinear? Response: \(Y\) ; Predictor: \(x\), \(x_1\), \(x_2\)
Model 1: \(E[Y] = \beta_0
+ \beta_1 x_1 + \beta_2 x_2\)
\(\rightarrow\) Multiple linear
regression
Model 2:
\(E[Y] = \left \{ \begin{array}{ll} \beta_0 +
\beta_1 x & x \leq \alpha \\
\beta_0 + \beta_1\alpha & x > \alpha \end{array}
\right.\)
\(\rightarrow\) Plateau (hockey-stick)
model is non-linear
Example of a plateau model
Model 3: \(E[Y] = Y =
\beta_0 + \beta_1 x^2 + \beta_2 x^2 + \beta_3 x^3\)
\(\rightarrow\) Polynomial regression
(nonlinear shape, linear model)
Regression: the inputs are (continuous) numeric
variables \[Y_i = \beta_0 + \beta_1 x_{1i} +
\cdots + \beta_p x_{pi} + \epsilon_i\] with \(\epsilon_i\) i.i.d. (independent and
identically distributed)
and \(\epsilon \sim (0, \sigma^2)\)
Simple linear regression: \(Y_i = \beta_0 + \beta_1 x_{1i} + \epsilon_i\)
Multiple linear regression: more than one input (regressor variable)
Linear Regression Example
Components
The \(\beta\) are the \(p+1\) parameters of the mean function to be estimated
\(\beta_0\) is called the intercept
\(p\) = number of inputs
\(\sigma^2\), the error variance is also a parameter to be estimated
Interpretation
\(\beta_0\) is the mean of \(Y\) when \(x_1 = 0, \cdots, x_k = 0\)
\(\beta_j\) is the change in the mean of \(Y\) when \(x_j\) increases by 1 unit and all other \(x\)s do not change
Example \(\beta_3 = 4.2\): If \(x_3\) increases by 1 unit, \(\text{E}[Y]\) increases by 4.2 units
Linear Regression Example
The inputs \(x_{1i}, \cdots , x_{pi}\) are fixed, not random
The errors \(\epsilon_i\) have zero mean (model is correct on average)
The errors \(\epsilon_i\) have equal variance (homoscedasticity)
The errors \(\epsilon_i\) are independent
Because \(\epsilon_i\) has mean (=expected value) \(0\), it follows that \(\begin{array}{ll} \text{E}[Y_i] &= \text{E} \left[\beta_0 + \beta_1 x_{1i} + \cdots + \beta_p x_{pi} + \epsilon_i \right]\\ &= \beta_0 + \beta_1 x_{1i} + \cdots + \beta_p x_{pi} + \text{E}[\epsilon_i] \\ &= \beta_0 + \beta_1 x_{1i} + \cdots + \beta_p x_{pi} \end{array}\)
Because \(\beta_0+\beta_1
x_{1i}+\cdots+\beta_p x_{pi}\) is the expected value (=the mean)
of \(Y_i\),
it is also called the mean function.
It is implicit in this derivation that the inputs \(x\) are not random.
Conditioning Saves Us
“X is not random” really means:
We model Y given X = x
\[(Y\mid X=x) = \beta_0 + \beta_1 x + \epsilon\] Conditioning turns observed inputs into constants.
Describe situations when the assumptions of the linear model are not met,
Is normality of the errors an assumption of the linear model?
Assuming we want to predict the abalone height, which assumptions of
linear regressions are violated in the following relationships?
Many models contain both continuous and qualitative inputs.
Take advantage of qualitative inputs to build interesting models and address interesting questions
Suppose a qualitative variable segments the data into
groups
(e.g. genders, age groups, income classes, tax brackets, …)
Do groups share the same basic model?
Do slopes and/or intercepts differ between groups?
Is there an interaction between groups and continuous inputs?
Let’s work with the abalone data
In R, it is easy to implement interactions between
predictors (continuous and-or categorical).
Using the function lm:
lm(Y ~ X + Z + X:Z) includes each of the predictors
individually (the main effects) and their interaction (:
accounts for the interaction only).
lm(Y ~ X*Z) is the compact way of writing
lm(Y ~ X + Z + X:Z)
When one predictor is continuous and the other categorical, these
interaction terms have the following meaning:
Assume X is continuous and Z has two
categories,
lm(Y ~ X) fits the same intercept and slope for both
categories,
lm(Y ~ X+Z) fits the same slope for both categories but
different intercept,
lm(Y ~ X:Z) fits the same intercept but different slope per
category,
lm(Y ~ X*Z) fits different intercept and slope per
category.
This is model comparison, not just fitting.
Think about the abalone data when we predict the height from diameter and sex
Remember:
A non-linear model depends on its parameters in a non-linear
fashion,
A model is nonlinear if at least one derivative of the mean function wrt the parameters depends on one or more parameters
Nonlinear models are more finicky to work with than linear models:
Finding parameter estimates is done through iterative numerical optimization
Providing good starting values for the parameter helps greatly
Good software computes analytic derivatives rather than finite difference derivatives
Statistical properties of estimators are not as clear cut as with linear models
We can still use least-squares to fit such models
Typically, in R we use nls or the library
nls2
Example: Plateau model
You specify the mean function as the first argument of nls
or pass an object with the formula for the mean function.
plat_model <- y ~ (b0+b1*x )*(x <= alpha) +
(b0+b1*alpha)*(x > alpha)
plat_fit <- nls(plat_model,
data=whatever,
start=list(b0=0,b1=5,alpha=40))
Let’s look at an example in R with the Mitscherlich model
Non-parametric regression refers to a family of regression methods where you don’t assume a fixed functional form (like linear, quadratic, etc.) between predictors \(X\) and response \(Y\)
Instead of \(Y = \beta_0 + \beta_1 X + \epsilon\), you have \(Y=f(X) + \epsilon\).
Key characteristics
We will review these models and their implementations in more details in future classes.
few parameters, strong assumptions, rigid shape
many (or infinitely many) parameters, weak assumptions, flexible shape
Melanoma Data
How would you model such relationships?
How do we express a mean function for the following data of skin cancer incidences per 100,000 over a 37-year period?
Is a simple linear regression sufficient: \(Y = \beta_0 + \beta_1 x + \epsilon\)?
How about a polynomial model: \(Y = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \epsilon\)?
How about a loess model (local polynomial weighted regression)?
Small span → very flexible (can overfit)
Large span → smoother (can underfit)
Advantages
Limitations
There is no single “best” model.
We compare using:
The model is a useful abstraction for the purpose of analysis
(confirmatory, prediction)
Models are tools, not truths
There are often competing models for the same task and data set AND you want to compare several models!
Randomness is a feature, not a bug
Linear models are a foundation—not a destination
Estimate the model parameters from data
Assess the quality of the model based on the parameter estimates and various criteria (See Validation class)
Statistical learning is about tradeoffs