Models

Why Models?

“All models are wrong, but some are useful.” — George Box

Remember, we said that data are a materialization of underlying process and are not ground truth.

We have similar insights about statistical models,
Key idea: Models are simplifications, not mirrors of reality.

Their value comes from insight, prediction, and decision-making—not perfection.

Learning Goals

By the end of this module, you should be able to:

  • Explain why randomness is essential in data models

  • Write the generic form of a statistical learning model

  • Distinguish linear vs nonlinear models (correctly!)

  • Recognize common modeling assumptions—and when they fail

  • Fit and interpret basic models in R

Deterministic vs Stochastic Thinking

A Deterministic (Mathematical) Model

Example: pollutant concentration in a river \[\alpha(s,t) = \alpha_0(s-ct)\exp(-\mu t)\]

  • Inputs → Output, no uncertainty, no room for chance
  • Same inputs → same output, every time
  • Assumes perfect knowledge of the system
  • Predicts with certainty

Reality check: rivers, sensors, and environments are messy and complex.

Why Determinism Fails in Practice

Hidden assumptions made in deterministic modeling:

  • Perfect measurements
  • Perfect mixing
  • No turbulence or local effects
  • Constant decay rates in space and time

Question: Can we ever measure and control all of this?

In reality:

  • Measurements are made in a highly dynamic environment (river)
  • How can we account for all influences in the environment?
  • How can we account for sampling the environment?

Adding Randomness (On Purpose)

We upgrade the model by admitting uncertainty: \[\alpha(s,t) = \alpha_0(s-ct)\exp(-\mu t) + \epsilon\] \(\epsilon\): random error, mean 0, variance \(\sigma^2\)
Often assume: \(\epsilon \sim N(0,\sigma^2)\)

  • Interpretation:
    Individual observations vary
    The average behavior follows the deterministic exponential model \[\mbox{E}[\alpha(s,t)] = \alpha_{0}(s−ct) \exp {−\mu t}\]

  • Consequence: Because \(\epsilon\) is random, so is \(\alpha(s,t)\) (randomness is contagious)

  • Can you think of an extension model for \(\epsilon\)?

“Correct on Average”

  • The model is not exact for every single observation
  • It is correct in expectation

Think about the consequences and differences for:

  • Predicting your credit score vs

  • Estimating the average credit score of people like you

Why Randomness is Essential

Randomness helps us:

  • Model unobserved or unmeasurable influences
  • Justify random sampling and experiments
  • Build simpler, more interpretable models
  • Overcome parts of unpredictablity and unknown components

Occam’s Razor:

  • Among competing explanations prefer the simpler
  • Stochastic models are often more parsimonious (fewer parameters) and easier to explain
  • They often explain more with less.

Parsimony Example

  • Goal: develop a model to predict cardiovascular disease occurrence (yes/no)

  • Reality: dozens of biological conditions (family history), clinical (other pathologies, age, sex), and lifestyle factors (nutrition, sedentarity)

  • Start simple:

    • Let \(Y \in \{0,1\}\), the diseases occurs or not
    • \(Y \sim \text{Bernoulli}(\pi)\) \[\Pr(Y=1)=\pi, \quad \Pr(Y=0)=1-\pi\]
  • In this case, we need to estimate \(\pi\) on the data to make the model usable and matching our data.

  • If we have a random sample of patients with 0/1 disease outputs, then the sample mean is possible estimator: \[\widehat{\pi} = \frac{1}{n}\sum_{i=1}^n Y_i,\]

Statistical Models

A statistical model is a stochastic model with unknown parameters

Unknown parameters are estimated on available data

Going back to the above example:

  • \(\pi\): event probability (parameter to be estimated)
  • Data → estimate parameters \(\widehat{\pi} = \frac{1}{n}\sum_{i=1}^n Y_i,\)

Key idea:

  • If Y is random, every function of Y is random (randomness is contagious)
    \(\widehat{\pi}\) is a random variable
  • Parameter estimates are random variables with their own distributional properties
  • We chose parameter estimates with satisfactory properties (see Estimation ch.)

Why Statistical Learning?

  • This Bernoulli model is a possible model for disease recurrence, but too simplistic

  • In the above model, \(\widehat{\pi}\) is solely derived from the 0/1 disease outputs,
    but does not account for other obvious factors (age, sex, family history)

  • Other factors are less obvious

  • Express \(\pi\) as a function of other inputs and parameters \[\pi = f (x_1,x_2,···,x_p; \theta_0, \theta_1, \theta_2,...\theta_p)\]

We build models to:

Explain

  • How do inputs affect outputs? What is the input-output relationship?
  • Which input variables matter?
  • What functional relationships are plausible?

Predict

  • Estimate/predict unseen outcomes
  • Generalize to new data

Often: we want both.

When Prediction Matters Most

Prediction is crucial when outcomes are:

  • Destructive (stress testing materials)
  • Expensive (medical tests)
  • Rare (fraud detection)
  • In the future (stock prices, demand)

The Basic Model Template

Most (not all) statistical learning models look like:

Response = Structure + Error

\[Y = f(x_1,\ldots,x_p;\theta_0,\theta_1,\ldots,\theta_k) + \epsilon\]

  • Y: Response: what we care about
    Other names are output, outcome, target, dependent variable
    Upper-case Y when referring to the random variable
    Lower-case y is the observed value

  • Structure: systematic part of the model (mean function)
    It expresses the expected value (the mean) of the response as a function of inputs and parameters and is therefore also called the mean function

    • \(x_1,...,x_p\): the inputs, a.k.a. regressors, explanatory variables, features, independent variables
    • \(f\) functional form between input and output
      E.g. linear regression, polynomial regression
    • \(\theta_0,\theta_1,\ldots,\theta_k\): the parameters to be estimated from the data
  • \(\epsilon\): Error: the random component(s) (inherent stochasticity, unpredictability,…)
    \(\epsilon\) the model error; assumed that \(\epsilon\) has mean \(0\) and variance \(\sigma^2\) Defines the distributional properties of the response

Observational Model

In practice, we will have access to a finite set of observations for the problem at stake,

The observational model introduces an index for the \(n\) observations: \[Y_i = f(x_{1i},x_{2i},\ldots,x_{pi}, \theta_0,\theta_1,\ldots,\theta_k) + \epsilon_i, \qquad i=1,\cdots, n\]

  • \(Y_1, Y_2, \ldots, Y_n\) are the sample of size \(n\)
    The \(Y_i\) are random variables

  • \(y_1, y_2, \ldots, y_n\) are the observed values, e.g., \(12.4\), \(5.7\), , \(9.4\)
    The \(y_i\) are not random variables, but their realizations

  • The inputs \(x_{1i}, \ldots, x_{pi}\) are typically not random variables
    Inputs treated as given (conditional modeling)

  • We will see in the Estimation class how to arrange these observations

Example: Mitscherlich Equation

The Mitscherlich model is used in plant studies to express growth (yield) as a function of an input (fertilizer). \[ \mbox{Abstract form (response/structure/error)} \quad Y = f(x,\xi,\lambda,\kappa) + \epsilon, \] \[\mbox{Specific formulation} \quad Y = \lambda + (\xi - \lambda)\exp\left\{-\kappa x\right \} + \epsilon\]

  • \(Y\) is the yield

  • \(x\) is the amount of input, e.g., nitrogen fertilizer

Always take some time to understand the parameters of your model:

  • \(\lambda\) is …

  • \(\xi\) is …

  • \(\kappa\) is …

  • \(\epsilon \sim (0,\sigma^2)\)

Mitscherlich Equation

Mitscherlich Equation

Can you ``estimate’’ the \(\lambda, \xi\), and \(\kappa\) parameters visually?

How would you fit such a model on a dataset?

Prediction & Explanation

Explanatory inference, also called confirmatory inference, is concerned with

  • understanding the relationship between response and inputs

  • understanding the importance of the inputs

  • testing hypotheses about the response

  • ISLR calls this ``Inference’’ in Sec. 2.1

Predictive inference is concerned with

  • developing an algorithm that predicts the outcome well

  • generalizing the algorithm to new observations (not measured, not yet seen, etc.)

Mitscherlich Example

Many applications have predictive and confirmatory goals

Predictive Insights

  • Provide a confidence interval for the average yield at \(x_{new}=35.7\) kg/ha

  • At what level of nitrogen does yield achieve 75% of the max (=inverse prediction).

Confirmatory Insights

  • The asymptotic yield is greater than 75

  • Increasing nitrogen application raises the yield by no more than 30 units

  • The rate of change in yield is less than \(p\) units per unit nitrogen once 100 \(kg/ha\) x are applied.

Confirmatory Insights

  • Good confirmatory models are not necessarily best at prediction

  • Parameter interpretation is very important

  • Hypotheses are expressed in terms of parameters and their relationships

Predictive Insights

  • Models that predict well are not necessarily best at testing hypotheses

  • Parameter interpretation is less important

  • A model that is biased can be more desirable than an unbiased model

Many disciplines place high value on interpretability of the models.
Sometimes, prediction is used as a validation rather than an end goal.

For example,

  • Biology
  • Life sciences
  • Economics
  • Financial Services & Insurance
  • Physical sciences
  • Geosciences

Classes of Models

Linear vs Nonlinear Models

Linear ≠ straight line

A model is linear if it is linear in its parameters

Nonlinearity does not refer to curvature of the mean function

A model is nonlinear if at least one derivative of the mean function wrt the parameters depends on one or more parameters

  • Mitscherlich Example \[ f(N, \xi,\lambda,\kappa) = \lambda + (\xi - \lambda)\exp\left\{ -\kappa N \right \}\] \[\frac{\partial f(N,\xi,\lambda,\kappa)}{\partial \lambda} = 1 -\exp\left\{ -\kappa N \right \}\]
  • the model is nonlinear

Are the following models linear or nonlinear? Response: \(Y\) ; Predictor: \(x\), \(x_1\), \(x_2\)

Model 1: \(E[Y] = \beta_0 + \beta_1 x_1 + \beta_2 x_2\)
\(\rightarrow\) Multiple linear regression

Model 2:
\(E[Y] = \left \{ \begin{array}{ll} \beta_0 + \beta_1 x & x \leq \alpha \\ \beta_0 + \beta_1\alpha & x > \alpha \end{array} \right.\)
\(\rightarrow\) Plateau (hockey-stick) model is non-linear

Example of a plateau model

Example of a plateau model

Model 3: \(E[Y] = Y = \beta_0 + \beta_1 x^2 + \beta_2 x^2 + \beta_3 x^3\)
\(\rightarrow\) Polynomial regression (nonlinear shape, linear model)

Linear Models (Regression Example)

Regression: the inputs are (continuous) numeric variables \[Y_i = \beta_0 + \beta_1 x_{1i} + \cdots + \beta_p x_{pi} + \epsilon_i\] with \(\epsilon_i\) i.i.d. (independent and identically distributed)
and \(\epsilon \sim (0, \sigma^2)\)

Simple linear regression: \(Y_i = \beta_0 + \beta_1 x_{1i} + \epsilon_i\)

Multiple linear regression: more than one input (regressor variable)

Linear Models Components

Linear Regression Example

Components

  • The \(\beta\) are the \(p+1\) parameters of the mean function to be estimated

  • \(\beta_0\) is called the intercept

  • \(p\) = number of inputs

  • \(\sigma^2\), the error variance is also a parameter to be estimated


Interpretation

  • \(\beta_0\) is the mean of \(Y\) when \(x_1 = 0, \cdots, x_k = 0\)

  • \(\beta_j\) is the change in the mean of \(Y\) when \(x_j\) increases by 1 unit and all other \(x\)s do not change

  • Example \(\beta_3 = 4.2\): If \(x_3\) increases by 1 unit, \(\text{E}[Y]\) increases by 4.2 units

Linear Models Assumptions

Linear Regression Example

  • The inputs \(x_{1i}, \cdots , x_{pi}\) are fixed, not random

  • The errors \(\epsilon_i\) have zero mean (model is correct on average)

  • The errors \(\epsilon_i\) have equal variance (homoscedasticity)

  • The errors \(\epsilon_i\) are independent

  • Because \(\epsilon_i\) has mean (=expected value) \(0\), it follows that \(\begin{array}{ll} \text{E}[Y_i] &= \text{E} \left[\beta_0 + \beta_1 x_{1i} + \cdots + \beta_p x_{pi} + \epsilon_i \right]\\ &= \beta_0 + \beta_1 x_{1i} + \cdots + \beta_p x_{pi} + \text{E}[\epsilon_i] \\ &= \beta_0 + \beta_1 x_{1i} + \cdots + \beta_p x_{pi} \end{array}\)

  • Because \(\beta_0+\beta_1 x_{1i}+\cdots+\beta_p x_{pi}\) is the expected value (=the mean) of \(Y_i\),
    it is also called the mean function.
    It is implicit in this derivation that the inputs \(x\) are not random.

Conditioning Saves Us

“X is not random” really means:

We model Y given X = x

\[(Y\mid X=x) = \beta_0 + \beta_1 x + \epsilon\] Conditioning turns observed inputs into constants.

  • Describe situations when the assumptions of the linear model are not met,

  • Is normality of the errors an assumption of the linear model?

Assuming we want to predict the abalone height, which assumptions of linear regressions are violated in the following relationships?

Linear Models

  • Many models contain both continuous and qualitative inputs.

  • Take advantage of qualitative inputs to build interesting models and address interesting questions

  • Suppose a qualitative variable segments the data into groups
    (e.g. genders, age groups, income classes, tax brackets, …)
    Do groups share the same basic model?
    Do slopes and/or intercepts differ between groups?
    Is there an interaction between groups and continuous inputs?

  • Let’s work with the abalone data

In R, it is easy to implement interactions between predictors (continuous and-or categorical).

Using the function lm:

  • lm(Y ~ X + Z + X:Z) includes each of the predictors individually (the main effects) and their interaction (: accounts for the interaction only).

  • lm(Y ~ X*Z) is the compact way of writing lm(Y ~ X + Z + X:Z)

  • When one predictor is continuous and the other categorical, these interaction terms have the following meaning:
    Assume X is continuous and Z has two categories,
    lm(Y ~ X) fits the same intercept and slope for both categories,
    lm(Y ~ X+Z) fits the same slope for both categories but different intercept,
    lm(Y ~ X:Z) fits the same intercept but different slope per category,
    lm(Y ~ X*Z) fits different intercept and slope per category.

This is model comparison, not just fitting.

Think about the abalone data when we predict the height from diameter and sex

Nonlinear Regression (Reality Bites)

Remember:

  • A non-linear model depends on its parameters in a non-linear fashion,

  • A model is nonlinear if at least one derivative of the mean function wrt the parameters depends on one or more parameters


Nonlinear models are more finicky to work with than linear models:

  • Finding parameter estimates is done through iterative numerical optimization

  • Providing good starting values for the parameter helps greatly

  • Good software computes analytic derivatives rather than finite difference derivatives

  • Statistical properties of estimators are not as clear cut as with linear models

We can still use least-squares to fit such models

Typically, in R we use nls or the library nls2

Example: Plateau model
You specify the mean function as the first argument of nls or pass an object with the formula for the mean function.

plat_model <- y ~ (b0+b1*x    )*(x <= alpha) + 
                  (b0+b1*alpha)*(x >  alpha)
                  
plat_fit <- nls(plat_model, 
                data=whatever, 
                start=list(b0=0,b1=5,alpha=40))

Let’s look at an example in R with the Mitscherlich model

Non-parametric Regression

Non-parametric regression refers to a family of regression methods where you don’t assume a fixed functional form (like linear, quadratic, etc.) between predictors \(X\) and response \(Y\)

Instead of \(Y = \beta_0 + \beta_1 X + \epsilon\), you have \(Y=f(X) + \epsilon\).

Key characteristics

  • No predefined model form for \(f\)
    Predictors completely constructed using information derived from the data
  • Model complexity grows with sample size
  • Usually local or smooth fits
  • Sensitive to:
    • sample size
    • smoothing parameters (bandwidth, number of neighbors, tree depth, etc.)

Non-parametric Regression Examples

  1. k-Nearest Neighbors (kNN) regression
    • Predict by averaging nearby points, very intuitive
    • Curse of dimensionality hits hard
  2. Kernel regression (Nadaraya–Watson)
    • Predict by smoothing weighted average
    • Bandwidth \(h\) controls bias–variance tradeoff
  3. Local polynomial regression (LOESS / LOWESS)
    • Fit a small regression locally around each \(x\)
  4. Regression trees
    • Partition feature space into regions
    • Basis for Random Forests and Boosting
  5. Splines and smoothing splines
    • f(x) is a sum of smooth (splines) basis functions

We will review these models and their implementations in more details in future classes.

Parametric vs Nonparametric

Parametric

few parameters, strong assumptions, rigid shape

  • Explicit expression for the mean function in terms of inputs and parameters
  • Linear regression model: \(Y = \beta_0 + \beta_1 x_1 + ... + \beta_p x_p + \epsilon\)
  • Nonlinear model: \(Y= \lambda + (\xi−\lambda) \exp(−\kappa x)\)
  • Less flexible but easier to explain

Nonparametric

many (or infinitely many) parameters, weak assumptions, flexible shape

  • No fixed or explicit form for the mean function,
    we control how wiggly the function is (its smoothness)
  • It still depends on calculating unknowns (parameters)
  • Very flexible, some harder to explain and interpret
  • Typically low bias but high variance in small-data regimes

Melanoma Data

How would you model such relationships?

  • How do we express a mean function for the following data of skin cancer incidences per 100,000 over a 37-year period?

  • Is a simple linear regression sufficient: \(Y = \beta_0 + \beta_1 x + \epsilon\)?

  • How about a polynomial model: \(Y = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \epsilon\)?

  • How about a loess model (local polynomial weighted regression)?

Parametric vs Nonparametric

Small span → very flexible (can overfit)
Large span → smoother (can underfit)


Advantages

  • Captures nonlinear relationships easily
  • Does not assume any particular function form
  • Works well for exploratory analysis and visualization

Limitations

  • Computationally expensive for large datasets
  • Does not easily provide a formula or model for prediction outside the observed range (no “global” model)
  • Sensitive to outliers (though “robust LOWESS” versions exist)

Choosing a Model

There is no single “best” model.

We compare using:

  • Prediction error
  • Interpretability
  • Scientific plausibility
  • Stability
  • Scalability if you work with large data

Takeaways

  • The model is a useful abstraction for the purpose of analysis (confirmatory, prediction)
    Models are tools, not truths

  • There are often competing models for the same task and data set  AND you want to compare several models!

  • Randomness is a feature, not a bug

  • Linear models are a foundation—not a destination

  • Estimate the model parameters from data

  • Assess the quality of the model based on the parameter estimates and various criteria (See Validation class)

  • Statistical learning is about tradeoffs