Learning Goals

Estimation Concepts

Parameters & Hyperparameters

Whether the learning model is

  • a regression or classification model

  • a parametric or nonparametric model

  • a linear or logistic regression

  • a decision tree, random forest, or gradient boosting tree

  • a convolutional or a recurrent neural net

  • and so forth

the model contains
parameters that need to be estimated from the data
hyperparameters that need to be determined

  • A parameter is an unknown quantity that is estimated during the training process. For example, the intercept \(\beta_0\), slope \(\beta_1\), and error variance \(\sigma^2\) are parameters in a simple linear regression model.

  • A hyperparameter controls the learning process itself. Examples are the depth (number of levels) in a decision tree, number of predictors in a regression, the number of layers in a neural net, the learning rate in stochastic gradient descent.
    Hyperparameters are not directly estimated from the data. Instead, we use default values, guesses, or an optimization process to find good values for these parameters.
    This process is called hyperparameter tuning.

What is a large model?

  • A simple linear regression model has 2 parameters

  • A random forest might have 20 parameters

  • A credit scoring model might have 50 or 200 parameters

  • GPT-1 has 117 million parameters

  • GPT-2 has 1.5 billion parameters

  • GPT-3 has 175 billion parameters

  • GPT-4? It is anyone’s guess, OpenAI won’t say

Definition: Estimation Principles

An estimation principle is a mathematical rule to find estimators for the parameters that have desirable properties.
Which estimation principles are available for a data analytics problem depends on

  • the nature of the problem

  • the assumptions we are willing to make

  • what we know about the data-generating mechanism

  • the complexity of the problem

Separately from the estimation principle we need to think about how to solve the estimation problem computationally.

Challenges

  • Just because you can write down the math, does not mean you can compute it.

  • The estimation algorithm that works well for 10 parameters might be impossible for 1M parameters

  • Techniques that work for 1M parameters can do poorly on “small” problems

  • Intuitively, there has to be a relationship between \(n\), the size of the training data, and \(p\), the number of parameters

  • Is it possible to estimate 1,000 parameters from a data set with 200 observations?

Important Estimation Principles

  • Least Squares

  • Maximum Likelihood

  • Bayes Estimation

Least Squares Estimation

Overview

  • One of the most important estimation principles.

  • For models with additive error structure, \({\bf Y} = {\bf X} {\bf \beta} + {\bf \epsilon}\)

  • Requires no distributional assumptions beyond the mean and variance of the errors

  • Variants

    • OLS: ordinary least squares
      errors are uncorrelated and have same variance; \({\bf \epsilon} \sim ({\bf 0}, \sigma^2{\bf I})\)

    • WLS: weighted least squares
      errors are uncorrelated with different variances; \({\bf \epsilon} \sim ({\bf 0}, \sigma^2\textbf{W})\)

    • GLS: generalized least squares
      errors are correlated; \({\bf \epsilon} \sim ({\bf 0}, \textbf{V})\)

Ordinary Least Squares

Objective: find the estimator that minimizes the sum of the squared errors \[{\bf \epsilon}^\prime {\bf \epsilon} = ({\bf Y} - {\bf X}{\bf \beta})^\prime ({\bf Y} - {\bf X}{\bf \beta})\]

Solution: Expand the expression, take derivatives with respect to \({\bf \beta}\), set to \({\bf 0}\) and solve

If \({\bf X}\) is a full-rank matrix, then \[\begin{aligned} \widehat{{\bf \beta}} &= \left ( {\bf X}^\prime {\bf X} \right) ^{-1} {\bf X}^\prime{\bf Y} \\ \widehat{{\bf Y}} = {\bf X}\widehat{{\bf \beta}} &= {\bf X} \left ( {\bf X}^\prime {\bf X} \right) ^{-1} {\bf X}^\prime{\bf Y} \\ &= \textbf{H}{\bf Y} \end{aligned}\] \(\left ( {\bf X}^\prime {\bf X} \right) ^{-1}\) is called the inverse matrix of \(({\bf X}^\prime{\bf X})\)

OLS Properties

  • \(({\bf X}^\prime{\bf X})\) is called the crossproduct matrix of \({\bf X}\).

  • \(\textbf{H}\) is called the “hat” matrix.
    If you multiply it into the data \({\bf Y}\) you get the predicted values \(\widehat{{\bf Y}}\).

  • \(\widehat{{\bf \beta}}\) is an unbiased estimator of \({\bf \beta}\): \(E[\widehat{{\bf \beta}}] = {\bf \beta}\).

  • \(\widehat{{\bf Y}}\) is an unbiased estimator of \({\bf X}{\bf \beta}\): \(E[\widehat{{\bf Y}}] = {\bf X}{\bf \beta}\)

  • The error variance \(\sigma^2\) can be estimated as \(\widehat{{\bf \epsilon}}^\prime \widehat{{\bf \epsilon}} / (n - r({\bf X}))\)

  • \(rk({\bf X})\) is called the rank of matrix \({\bf X}\), the number of linearly independent rows and columns in the matrix

The Rank of \({\bf X}\)

Why are we paying attention to the rank of \({\bf X}\)?

  • the rank of an \((n \times k)\) matrix cannot exceed the smaller of \(n\) or \(k\): \(r({\bf X}_{(n \times p+1)}) \leq \text{min}(n, p+1)\)

  • If a square matrix is not of full rank, its inverse does not exist. How can we then compute the OLS estimator??

  • \(({\bf X}^\prime {\bf X})\) is of dimension \((p+1, p+1)\). In a full-rank case, its rank is \(p+1\)

  • But it is also true (a linear algebra result) that \(rk({\bf X}^\prime{\bf X}) = rk({\bf X})\)

Situations when \(({\bf X}^\prime {\bf X})\) is not of full rank (is rank-deficient)

  1. Columns of \({\bf X}\) are linearly dependent: one column can be expressed as a linear function of other columns

    • models with qualitative inputs (classification factors)

    • including the same input variable multiple times

    • including linear combination of inputs, e.g., \({\bf X}_1, {\bf X}_2, {\bf X}_3 = {\bf X}_1 + {\bf X}_2\).

  2. \(n < p+1\): Fewer observations than parameters.

    • feature engineering can add many potential inputs

    • small data sets

    • very large black-box models, e.g. in neural nets \(p \gg n\)

In case of 1, it does not matter if \(n \gg p\), the model will always be rank-deficient.

Solving Rank Problems

The cases \(p > n\) and \(p \gg n\) must be handled differently.

  • Variable selection methods (forward, backward, stepwise)
    Don’t let all variables into the model, pick good ones

  • Shrinkage estimators (Ridge regression, Lasso regression)
    Apply a penalty term that shrinks the estimates toward zero

  • Dimension reduction (principal component analysis, partial least squares)
    Create \(k < p\) linear combinations of the data and use those as inputs

Maximum Likelihood Estimation

The Principle

If we think of data \({\bf Y}\) as the realization of a random data-generating mechanism then it would make sense

  • to look at the probability distribution of the data

  • use as parameter estimates those values that make it most likely to have generated the observed data

  • In other words: find the most likely explanation for the observed data


This requires that we know the joint probability distribution of the \(n\) data points.

That distribution, considered a function of the parameters, is called the likelihood function.

Thankfully, that is not that difficult when the data points are independent (a random sample)

Example of Maximum Likelihood Estimation

Suppose we are flipping a coin 50 times, it lands on “heads” 20 times.

What is the maximum likelihood estimate of the probability that the coin lands on “heads”?

For each flip we are dealing with a binary distribution \[\text{Pr}(\text{Heads}) = \pi^y \times (1-\pi)^{1-y} \qquad \text{where } y = \left \{\begin{array}{ll} 1 & \text{if heads} \\ 0 & \text{if tails}\end{array} \right .\] and \(\pi\) is the probability of “heads”.

Because the coin flips are independent, the joint probability distribution of all 50 tosses is \[\prod_{i=1}^{50} \pi ^{y_i} \times (1-\pi) ^{1-y_i}\]

Maximum Likelihood Estimation

The likelihood is the joint distribution given the data, and is interpreted as a function of the parameters (here, \(\pi\))

The value \(\widehat{\pi}\) that maximizes \(\prod_{i=1}^{50} \pi ^{y_i} \times (1-\pi) ^{1-y_i}\) also maximizes the log of the function: \[\begin{aligned} \ell(\widehat{\pi}; {\bf Y}) &= \log \left ( \prod_{i=1}^{50} \widehat{\pi} ^{y_i} \times (1-\widehat{\pi}) ^{1-y_i} \right ) \\ &= \sum_{i=1}^{50} y_i \log(\widehat{\pi}) + (1-y_i)\log(1-\widehat{\pi})\\ &= \log(\widehat{\pi}) \times 20 + \log(1-\widehat{\pi}) \times 30 \end{aligned}\]

Now take derivatives of \(\ell(\widehat{\pi}; {\bf Y})\) with respect to \(\widehat{\pi}\), set the derivative to zero and solve for \(\widehat{\pi}\). \[\frac{\partial \ell(\widehat{\pi}; {\bf Y})}{\widehat{\pi}} = 20 \frac{1}{\widehat{\pi}} - 30 \frac{1}{1-\widehat{\pi}} \equiv 0\]

The solution is \(\widehat{\pi} = 20/20\).

More generally, the maximum likelihood estimator (MLE) for this case is just the sample mean: \(Y_{OLS} = \sum_{i=1}^{n} Y_i\).

Why does this all matter?

  • MLEs have very desirable properties–we like them a lot; they can be biased, however

  • You need to make a distributional assumption to derive the MLE. OLS assumes errors have mean zero and variance \(\sigma^2\).
    An MLE assumption for regression could be that the data are normally distributed (a much stronger assumption).

  • Certain classes of models are fit by MLE, e.g., generalized linear models (logistic regression, Poisson regression, etc.)

  • Software might derive MLEs. That means there is an implicit distributional assumption.