Understand how parameters contribute to learning models
Difference between parameters and hyperparameters
What is an estimation principle, how does the principle differ from computational implementation
Expressing models in vector/matrix notation
Ordinary least squares (OLS)
Perform OLS in R
Handling rank deficiency of \({\bf X}\) and \(n < p\)
Introduction to maximum likelihood estimation
Whether the learning model is
a regression or classification model
a parametric or nonparametric model
a linear or logistic regression
a decision tree, random forest, or gradient boosting tree
a convolutional or a recurrent neural net
and so forth
the model contains
parameters that need to be estimated from the
data
hyperparameters that need to be determined
A parameter is an unknown quantity that is
estimated during the training process. For example, the intercept \(\beta_0\), slope \(\beta_1\), and error variance \(\sigma^2\) are parameters in a simple
linear regression model.
A hyperparameter controls the learning process
itself. Examples are the depth (number of levels) in a decision tree,
number of predictors in a regression, the number of layers in a neural
net, the learning rate in stochastic gradient descent.
Hyperparameters are not directly estimated from the data. Instead, we
use default values, guesses, or an optimization process to find good
values for these parameters.
This process is called hyperparameter tuning.
A simple linear regression model has 2 parameters
A random forest might have 20 parameters
A credit scoring model might have 50 or 200 parameters
GPT-1 has 117 million parameters
GPT-2 has 1.5 billion parameters
GPT-3 has 175 billion parameters
GPT-4? It is anyone’s guess, OpenAI won’t say
An estimation principle is a mathematical
rule to find estimators for the parameters that have desirable
properties.
Which estimation principles are available for a data analytics problem
depends on
the nature of the problem
the assumptions we are willing to make
what we know about the data-generating mechanism
the complexity of the problem
Separately from the estimation principle we need to think about how to solve the estimation problem computationally.
Just because you can write down the math, does not mean you can compute it.
The estimation algorithm that works well for 10 parameters might be impossible for 1M parameters
Techniques that work for 1M parameters can do poorly on “small” problems
Intuitively, there has to be a relationship between \(n\), the size of the training data, and \(p\), the number of parameters
Is it possible to estimate 1,000 parameters from a data set with 200 observations?
Least Squares
Maximum Likelihood
Bayes Estimation
One of the most important estimation principles.
For models with additive error structure, \({\bf Y} = {\bf X} {\bf \beta} + {\bf \epsilon}\)
Requires no distributional assumptions beyond the mean and variance of the errors
Variants
OLS: ordinary least
squares
errors are uncorrelated and have same variance; \({\bf \epsilon} \sim ({\bf 0}, \sigma^2{\bf
I})\)
WLS: weighted least
squares
errors are uncorrelated with different variances; \({\bf \epsilon} \sim ({\bf 0},
\sigma^2\textbf{W})\)
GLS: generalized least
squares
errors are correlated; \({\bf \epsilon} \sim
({\bf 0}, \textbf{V})\)
Objective: find the estimator that minimizes the sum of the squared errors \[{\bf \epsilon}^\prime {\bf \epsilon} = ({\bf Y} - {\bf X}{\bf \beta})^\prime ({\bf Y} - {\bf X}{\bf \beta})\]
Solution: Expand the expression, take derivatives
with respect to \({\bf \beta}\), set to
\({\bf 0}\) and solve
If \({\bf X}\) is a full-rank matrix, then \[\begin{aligned} \widehat{{\bf \beta}} &= \left ( {\bf X}^\prime {\bf X} \right) ^{-1} {\bf X}^\prime{\bf Y} \\ \widehat{{\bf Y}} = {\bf X}\widehat{{\bf \beta}} &= {\bf X} \left ( {\bf X}^\prime {\bf X} \right) ^{-1} {\bf X}^\prime{\bf Y} \\ &= \textbf{H}{\bf Y} \end{aligned}\] \(\left ( {\bf X}^\prime {\bf X} \right) ^{-1}\) is called the inverse matrix of \(({\bf X}^\prime{\bf X})\)
\(({\bf X}^\prime{\bf X})\) is called the crossproduct matrix of \({\bf X}\).
\(\textbf{H}\) is called the
“hat” matrix.
If you multiply it into the data \({\bf
Y}\) you get the predicted values \(\widehat{{\bf Y}}\).
\(\widehat{{\bf \beta}}\) is an unbiased estimator of \({\bf \beta}\): \(E[\widehat{{\bf \beta}}] = {\bf \beta}\).
\(\widehat{{\bf Y}}\) is an unbiased estimator of \({\bf X}{\bf \beta}\): \(E[\widehat{{\bf Y}}] = {\bf X}{\bf \beta}\)
The error variance \(\sigma^2\) can be estimated as \(\widehat{{\bf \epsilon}}^\prime \widehat{{\bf \epsilon}} / (n - r({\bf X}))\)
\(rk({\bf X})\) is called the rank of matrix \({\bf X}\), the number of linearly independent rows and columns in the matrix
Why are we paying attention to the rank of \({\bf X}\)?
the rank of an \((n \times k)\) matrix cannot exceed the smaller of \(n\) or \(k\): \(r({\bf X}_{(n \times p+1)}) \leq \text{min}(n, p+1)\)
If a square matrix is not of full rank, its inverse does not exist. How can we then compute the OLS estimator??
\(({\bf X}^\prime {\bf X})\) is of dimension \((p+1, p+1)\). In a full-rank case, its rank is \(p+1\)
But it is also true (a linear algebra result) that \(rk({\bf X}^\prime{\bf X}) = rk({\bf X})\)
Situations when \(({\bf X}^\prime {\bf X})\) is not of full rank (is rank-deficient)
Columns of \({\bf X}\) are linearly dependent: one column can be expressed as a linear function of other columns
models with qualitative inputs (classification factors)
including the same input variable multiple times
including linear combination of inputs, e.g., \({\bf X}_1, {\bf X}_2, {\bf X}_3 = {\bf X}_1 + {\bf X}_2\).
\(n < p+1\): Fewer observations than parameters.
feature engineering can add many potential inputs
small data sets
very large black-box models, e.g. in neural nets \(p \gg n\)
In case of 1, it does not matter if \(n \gg p\), the model will always be rank-deficient.
The cases \(p > n\) and \(p \gg n\) must be handled differently.
Variable selection methods (forward, backward,
stepwise)
Don’t let all variables into the model, pick good ones
Shrinkage estimators (Ridge regression, Lasso
regression)
Apply a penalty term that shrinks the estimates toward zero
Dimension reduction (principal component
analysis, partial least squares)
Create \(k < p\) linear combinations
of the data and use those as inputs
If we think of data \({\bf Y}\) as
the realization of a random data-generating mechanism then it would make
sense
to look at the probability distribution of the data
use as parameter estimates those values that make it most likely to have generated the observed data
In other words: find the most likely explanation for the observed data
This requires that we know the joint probability
distribution of the \(n\) data
points.
That distribution, considered a function of the parameters, is called
the likelihood function.
Thankfully, that is not that difficult when the data points are independent (a random sample)
Suppose we are flipping a coin 50 times, it lands on “heads” 20 times.
What is the maximum likelihood estimate of the probability that the
coin lands on “heads”?
For each flip we are dealing with a binary distribution \[\text{Pr}(\text{Heads}) = \pi^y \times
(1-\pi)^{1-y} \qquad \text{where } y = \left \{\begin{array}{ll} 1 &
\text{if heads} \\ 0 & \text{if tails}\end{array} \right .\]
and \(\pi\) is the probability of
“heads”.
Because the coin flips are independent, the joint probability distribution of all 50 tosses is \[\prod_{i=1}^{50} \pi ^{y_i} \times (1-\pi) ^{1-y_i}\]
The likelihood is the joint distribution given the
data, and is interpreted as a function of the parameters (here, \(\pi\))
The value \(\widehat{\pi}\) that maximizes \(\prod_{i=1}^{50} \pi ^{y_i} \times (1-\pi) ^{1-y_i}\) also maximizes the log of the function: \[\begin{aligned} \ell(\widehat{\pi}; {\bf Y}) &= \log \left ( \prod_{i=1}^{50} \widehat{\pi} ^{y_i} \times (1-\widehat{\pi}) ^{1-y_i} \right ) \\ &= \sum_{i=1}^{50} y_i \log(\widehat{\pi}) + (1-y_i)\log(1-\widehat{\pi})\\ &= \log(\widehat{\pi}) \times 20 + \log(1-\widehat{\pi}) \times 30 \end{aligned}\]
Now take derivatives of \(\ell(\widehat{\pi}; {\bf Y})\) with respect to \(\widehat{\pi}\), set the derivative to zero and solve for \(\widehat{\pi}\). \[\frac{\partial \ell(\widehat{\pi}; {\bf Y})}{\widehat{\pi}} = 20 \frac{1}{\widehat{\pi}} - 30 \frac{1}{1-\widehat{\pi}} \equiv 0\]
The solution is \(\widehat{\pi} =
20/20\).
More generally, the maximum likelihood estimator (MLE) for this case is just the sample mean: \(Y_{OLS} = \sum_{i=1}^{n} Y_i\).
Why does this all matter?
MLEs have very desirable properties–we like them a lot; they can be biased, however
You need to make a distributional assumption to derive the MLE.
OLS assumes errors have mean zero and variance \(\sigma^2\).
An MLE assumption for regression could be that the data are normally
distributed (a much stronger assumption).
Certain classes of models are fit by MLE, e.g., generalized linear models (logistic regression, Poisson regression, etc.)
Software might derive MLEs. That means there is an implicit distributional assumption.