Learning Goals

Understand how parameters contribute to learning models
Difference between parameters and hyperparameters
What is an estimation principle, how does the principle differ from computational implementation
Expressing models in vector/matrix notation
Ordinary least squares (OLS)
Perform OLS in R
Handling rank deficiency of \({\bf X}\) and \(n < p\)
Introduction to maximum likelihood estimation

Estimation Concepts

Parameters & Hyperparameters

Whether the learning model is

a regression or classification model
a parametric or nonparametric model
a linear or logistic regression
a decision tree, random forest, or gradient boosting tree
a convolutional or a recurrent neural net
and so forth

the model contains
parameters that need to be estimated from the data
hyperparameters that need to be determined

A parameter is an unknown quantity that is estimated during the training process. For example, the intercept \(\beta_0\), slope \(\beta_1\), and error variance \(\sigma^2\) are parameters in a simple linear regression model.
A hyperparameter controls the learning process itself. Examples are the depth (number of levels) in a decision tree, number of predictors in a regression, the number of layers in a neural net, the learning rate in stochastic gradient descent.
Hyperparameters are not directly estimated from the data. Instead, we use default values, guesses, or an optimization process to find good values for these parameters.
This process is called hyperparameter tuning.

What is a large model?

A simple linear regression model has 2 parameters
A random forest might have 20 parameters
A credit scoring model might have 50 or 200 parameters
GPT-1 has 117 million parameters
GPT-2 has 1.5 billion parameters
GPT-3 has 175 billion parameters
GPT-4? It is anyone’s guess, OpenAI won’t say

Definition: Estimation Principles

An estimation principle is a mathematical rule to find estimators for the parameters that have desirable properties.
Which estimation principles are available for a data analytics problem depends on

the nature of the problem
the assumptions we are willing to make
what we know about the data-generating mechanism
the complexity of the problem

Separately from the estimation principle we need to think about how to solve the estimation problem computationally.

Challenges

Just because you can write down the math, does not mean you can compute it.
The estimation algorithm that works well for 10 parameters might be impossible for 1M parameters
Techniques that work for 1M parameters can do poorly on “small” problems
Intuitively, there has to be a relationship between \(n\), the size of the training data, and \(p\), the number of parameters
Is it possible to estimate 1,000 parameters from a data set with 200 observations?

Important Estimation Principles

Least Squares
Maximum Likelihood
Bayes Estimation

Least Squares Estimation

Overview

One of the most important estimation principles.
For models with additive error structure, \({\bf Y} = {\bf X} {\bf \beta} + {\bf \epsilon}\)
Requires no distributional assumptions beyond the mean and variance of the errors
Variants
- OLS: ordinary least squares
  errors are uncorrelated and have same variance; \({\bf \epsilon} \sim ({\bf 0}, \sigma^2{\bf I})\)
- WLS: weighted least squares
  errors are uncorrelated with different variances; \({\bf \epsilon} \sim ({\bf 0}, \sigma^2\textbf{W})\)
- GLS: generalized least squares
  errors are correlated; \({\bf \epsilon} \sim ({\bf 0}, \textbf{V})\)

Ordinary Least Squares

Objective: find the estimator that minimizes the sum of the squared errors \[{\bf \epsilon}^\prime {\bf \epsilon} = ({\bf Y} - {\bf X}{\bf \beta})^\prime ({\bf Y} - {\bf X}{\bf \beta})\]

Solution: Expand the expression, take derivatives with respect to \({\bf \beta}\), set to \({\bf 0}\) and solve

If \({\bf X}\) is a full-rank matrix, then \[\begin{aligned} \widehat{{\bf \beta}} &= \left ( {\bf X}^\prime {\bf X} \right) ^{-1} {\bf X}^\prime{\bf Y} \\ \widehat{{\bf Y}} = {\bf X}\widehat{{\bf \beta}} &= {\bf X} \left ( {\bf X}^\prime {\bf X} \right) ^{-1} {\bf X}^\prime{\bf Y} \\ &= \textbf{H}{\bf Y} \end{aligned}\] \(\left ( {\bf X}^\prime {\bf X} \right) ^{-1}\) is called the inverse matrix of \(({\bf X}^\prime{\bf X})\)

OLS Properties

\(({\bf X}^\prime{\bf X})\) is called the crossproduct matrix of \({\bf X}\).
\(\textbf{H}\) is called the “hat” matrix.
If you multiply it into the data \({\bf Y}\) you get the predicted values \(\widehat{{\bf Y}}\).
\(\widehat{{\bf \beta}}\) is an unbiased estimator of \({\bf \beta}\): \(E[\widehat{{\bf \beta}}] = {\bf \beta}\).
\(\widehat{{\bf Y}}\) is an unbiased estimator of \({\bf X}{\bf \beta}\): \(E[\widehat{{\bf Y}}] = {\bf X}{\bf \beta}\)
The error variance \(\sigma^2\) can be estimated as \(\widehat{{\bf \epsilon}}^\prime \widehat{{\bf \epsilon}} / (n - r({\bf X}))\)
\(rk({\bf X})\) is called the rank of matrix \({\bf X}\), the number of linearly independent rows and columns in the matrix

The Rank of \({\bf X}\)

Why are we paying attention to the rank of \({\bf X}\)?

the rank of an \((n \times k)\) matrix cannot exceed the smaller of \(n\) or \(k\): \(r({\bf X}_{(n \times p+1)}) \leq \text{min}(n, p+1)\)
If a square matrix is not of full rank, its inverse does not exist. How can we then compute the OLS estimator??
\(({\bf X}^\prime {\bf X})\) is of dimension \((p+1, p+1)\). In a full-rank case, its rank is \(p+1\)
But it is also true (a linear algebra result) that \(rk({\bf X}^\prime{\bf X}) = rk({\bf X})\)

Situations when \(({\bf X}^\prime {\bf X})\) is not of full rank (is rank-deficient)

Columns of \({\bf X}\) are linearly dependent: one column can be expressed as a linear function of other columns
- models with qualitative inputs (classification factors)
- including the same input variable multiple times
- including linear combination of inputs, e.g., \({\bf X}_1, {\bf X}_2, {\bf X}_3 = {\bf X}_1 + {\bf X}_2\).
\(n < p+1\): Fewer observations than parameters.
- feature engineering can add many potential inputs
- small data sets
- very large black-box models, e.g. in neural nets \(p \gg n\)

In case of 1, it does not matter if \(n \gg p\), the model will always be rank-deficient.

Solving Rank Problems

The cases \(p > n\) and \(p \gg n\) must be handled differently.

Variable selection methods (forward, backward, stepwise)
Don’t let all variables into the model, pick good ones
Shrinkage estimators (Ridge regression, Lasso regression)
Apply a penalty term that shrinks the estimates toward zero
Dimension reduction (principal component analysis, partial least squares)
Create \(k < p\) linear combinations of the data and use those as inputs

Maximum Likelihood Estimation

The Principle

If we think of data \({\bf Y}\) as the realization of a random data-generating mechanism then it would make sense

to look at the probability distribution of the data
use as parameter estimates those values that make it most likely to have generated the observed data
In other words: find the most likely explanation for the observed data

This requires that we know the joint probability distribution of the \(n\) data points.

That distribution, considered a function of the parameters, is called the likelihood function.

Thankfully, that is not that difficult when the data points are independent (a random sample)

Example of Maximum Likelihood Estimation

Suppose we are flipping a coin 50 times, it lands on “heads” 20 times.

What is the maximum likelihood estimate of the probability that the coin lands on “heads”?

For each flip we are dealing with a binary distribution \[\text{Pr}(\text{Heads}) = \pi^y \times (1-\pi)^{1-y} \qquad \text{where } y = \left \{\begin{array}{ll} 1 & \text{if heads} \\ 0 & \text{if tails}\end{array} \right .\] and \(\pi\) is the probability of “heads”.

Because the coin flips are independent, the joint probability distribution of all 50 tosses is \[\prod_{i=1}^{50} \pi ^{y_i} \times (1-\pi) ^{1-y_i}\]

Maximum Likelihood Estimation

The likelihood is the joint distribution given the data, and is interpreted as a function of the parameters (here, \(\pi\))

The value \(\widehat{\pi}\) that maximizes \(\prod_{i=1}^{50} \pi ^{y_i} \times (1-\pi) ^{1-y_i}\) also maximizes the log of the function: \[\begin{aligned} \ell(\widehat{\pi}; {\bf Y}) &= \log \left ( \prod_{i=1}^{50} \widehat{\pi} ^{y_i} \times (1-\widehat{\pi}) ^{1-y_i} \right ) \\ &= \sum_{i=1}^{50} y_i \log(\widehat{\pi}) + (1-y_i)\log(1-\widehat{\pi})\\ &= \log(\widehat{\pi}) \times 20 + \log(1-\widehat{\pi}) \times 30 \end{aligned}\]

Now take derivatives of \(\ell(\widehat{\pi}; {\bf Y})\) with respect to \(\widehat{\pi}\), set the derivative to zero and solve for \(\widehat{\pi}\). \[\frac{\partial \ell(\widehat{\pi}; {\bf Y})}{\widehat{\pi}} = 20 \frac{1}{\widehat{\pi}} - 30 \frac{1}{1-\widehat{\pi}} \equiv 0\]

The solution is \(\widehat{\pi} = 20/20\).

More generally, the maximum likelihood estimator (MLE) for this case is just the sample mean: \(Y_{OLS} = \sum_{i=1}^{n} Y_i\).

Why does this all matter?

MLEs have very desirable properties–we like them a lot; they can be biased, however
You need to make a distributional assumption to derive the MLE. OLS assumes errors have mean zero and variance \(\sigma^2\).
An MLE assumption for regression could be that the data are normally distributed (a much stronger assumption).
Certain classes of models are fit by MLE, e.g., generalized linear models (logistic regression, Poisson regression, etc.)
Software might derive MLEs. That means there is an implicit distributional assumption.

Week 8 - Estimation

Julie Bessac

2026-01-22

Learning Goals

Estimation Concepts

Parameters & Hyperparameters

What is a large model?

Definition: Estimation Principles

Challenges

Important Estimation Principles

Least Squares Estimation

Overview

Ordinary Least Squares

OLS Properties

The Rank of \({\bf X}\)

Solving Rank Problems

Maximum Likelihood Estimation

The Principle

Example of Maximum Likelihood Estimation

Maximum Likelihood Estimation