Goal 1: How are decision trees used in regression and classification
Goal 2: Why is pruning a tree important to prevent overfitting
Goal 3: Why do ensembles have lower variance
Goal 3: Bagging improves prediction accuracy of trees
Goal 4: Random forests improve prediction accuracy of trees
Tree-based methods can be used for predicting a
continuous target (regression trees) or to classify a
categorical response (classification trees).
The basic idea
divide the predictor space into non-overlapping segments (regions)
to make a prediction for a given observation, we use the mean of the training data in the region to which it belongs
since the set of splitting rules to segment the predictor space can be summarized by a tree such approaches are called decision tree methods.
these methods are simple and useful for interpretation.
Carving up the predictor space Years and
Hits for the Hitters data into 3 regions to
predict log(Salary).
The predicted value is the average of the observations that fall into the regions.
This can be represented visually as a decision tree.
For a player with 6 years experience and 100 hits per year the predicted salary is \(\exp(5.998) = \$402,622\).
Trees consist of nodes and branches.
The nodes are either internal if they split the data or terminal if they appear at the bottom of the tree.
Terminal nodes are also called leaves.
The depth of a tree refers to the longest path from the root/top node to any leaf node in the tree.
Pros
Easy to interpret, easy to explain, nice visual representation
Order of splits reflects variable importance
Easily accommodate qualitative predictors
Performs well when relationship between target and predictors highly nonlinear
Cons
Trees often not as accurate as regression or classification methods
Easy to overfit, poor test performance
Sensitive to small changes in the data
We want to predict a response or class Y from inputs \(X_1\), \(X_2\),…\(X_p\). We do this by growing a binary tree.
At each internal node in the tree, we apply a test to one of the inputs, say \(X_i\).
Depending on the outcome of the test, we go to either the left or the right sub-branch of the tree.
Eventually we come to a leaf node, where we make a prediction.
This prediction aggregates or averages all the training data points which reach that leaf.
Supervised prediction method for a
continuous target.
Space generated by predictors \(X_1, \cdots, X_p\) is divided into \(J\) non-overlapping regions \(R_1, \cdots, R_{J}\). \(J\) is the number of terminal nodes.
For input observations falling into \(R_j\), the predicted value is the average of the target variable in that region: \(\widehat{y}_{R_j}\)
The goal is to find regions minimizing the prediction error \[\text{SSError} = \sum_{j=1}^{J} \sum_{i \in R_j} (y_i - \widehat{y}_{R_j})^2\]
In practice, one cannot solve that optimization problem.
Instead we follow a greedy approach (locally optimal).
How to choose \(J\) and the regions
\(R_1, \cdots, R_{J}\)?
Building a regression (or classification) tree involves 3 steps
A simple case: 1 continuous input variable
The tree equivalent to simple linear regression
We know that \(J = 2\): \(R_1 = \{X | X < s\}, R_2 = \{X | X \geq s\}\)
Now we just have to find the split point \(s\) that minimizes the heterogeneity of \(Y\) within \(R_1\) and \(R_2\)
You can do it via optimization, binary search, etc …
A more complicated case: \(p\) input
predictors
- a very high-dimensional problem
- cannot consider all possible partitions into \(J\) boxes
\(\Rightarrow\) use recursive binary splitting
Goal: Partition the feature space into regions where the response variable is as homogeneous as possible.
Result: A binary tree where each internal node represents a split, and each terminal node (leaf) represents a predicted value (mean for regression, class for classification).
Optimization: The algorithm does not guarantee the globally optimal tree but works well in practice.
Start at the top of the tree
Successively split the predictor (=input) space. Each split
Method is greedy because it does not look
back/ahead
It considers only the best split at this step rather than earlier/future
splits later in the tree-building process
The process stops when a criterion is reached, e.g., all regions contain less than \(m\) obs
Specifically, growing an initial tree \(T_0\):
Define 2 half-planes for each predictor \(X_j\) and split point \(s\): \[R_1(j, s) = \{X \mid X_j < s\}, \quad R_2(j, s) = \{X \mid X_j \geq s\}\]
Find the predictor \(j\) and split point \(s\) that minimize the sum of squared errors (RSS) in the resulting regions: \[\sum_{i: x_i \in R_1(j, s)} (y_i - \widehat{y}_{R_1})^2 + \sum_{i: x_i \in R_2(j, s)} (y_i - \widehat{y}_{R_2})^2\]
Continue splitting each resulting region until a stopping criterion is met (e.g., minimum node size).
Maximum Tree Depth
Purpose: Prevents complexity and overfitting
Minimum Number of Observations in a Node
Purpose: Ensures splits rely on enough data
Minimum Number of Observations in a Leaf
Purpose: Prevents leaves too specific to small subset
Minimum Decrease in Impurity
Purpose: Ensures meaningful splits, improve performance
Maximum Number of Terminal Nodes
Purpose: Controls the overall complexity of the tree
Minimum Improvement in Performance
Purpose: Prevents splits with insignificantly improvement
That term is mainly used in classification tree (Gini impurity), but you may encounter the term for regression trees.
It refers to the metric used as splitting criterion: MSE, Gini, entropy, …
Impurity is equivalent to maximizing information gain or maximizing the reduction in impurity.
Prediction error (e.g., misclassification rate, MSE on a hold-out set) is what you ultimately want to minimize for the entire tree, but it’s not directly used as the splitting criterion during tree construction (too expensive).
RFit a regression tree with the tree() function in the
tree library.
library(tree)
cred.tree <- tree(Balance ~ ., data=Credit[train,])
summary(credit_tree)
##
## Regression tree:
## tree(formula = Balance ~ ., data = Credit[train, ])
## Variables actually used in tree construction:
## [1] "Rating" "Limit" "Income" "Student"
## Number of terminal nodes: 10
## Residual mean deviance: 17930 = 3406000 / 190
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -338.20 -60.45 -34.37 0.00 66.37 481.90
RVisualize the tree with plot and add split labels with
text
plot(credit_tree)
text(credit_tree, pretty=0, cex=0.65)
How to interpret the partition?
This can be represented visually as a decision tree.
For a player with 6 years experience and 100 hits per year the predicted salary is \(\exp(5.998) = \$402,622\).
R Packages Comparisionrpart VS tree
Trees grown top down tend to overfit and perform poorly on test data
sets.
The strategy is to grow a large tree and to prune it back to a subtree with good test error.
An alternative could have been to prevent a tree from growing from the beginning by constraining parameters (splits, depth, …) but the tree may stop too early, missing useful splits and then underfits.
Usually, a full tree \(T_0\) has high test error and is too complex (too many leaves)
A subtree (a pruned version of \(T_0\)) might perform better
But we cannot investigate all possible subtrees, so we select a small set of subtrees for considerations
Use cross-validation to select the “best” subtree from the set
The entire process is called cost complexity pruning (also known as weakest link pruning)
Cost complexity pruning
How to choose the sequence of trees to look at for pruning?
Pruning removes parts of the tree that provide little to no additional power in predicting target variables, ultimately resulting in a more robust and manageable model.
For each value \(\alpha\) there is a tree \(T \subset T_0\) such that \[\sum_{m=1}^{J}\sum_{i \in R_m} \left( y_i - \widehat{y}_{R_m} \right)^2 + \alpha J\] is as small as possible
\(\alpha\) is similar to the shrinkage parameter in Ridge and Lasso regression. Cost complexity pruning is a regularization method
\(\alpha = 0 \, \Rightarrow T_0\), we get the full tree
As \(\alpha\) grows the penalty for having many leaves increases
\(\alpha = \infty\), we get the null tree with a single leaf node
Cost complexity pruning generates a series of trees \(T_0\), …,\(T_m\) with \(T_0\) the initial full tree and \(T_m\) the root alone.
At step \(i\), the tree is created by removing a subtree from tree \(i-1\) and replacing it with a leaf node with value chosen as in the tree building algorithm.
Step 1: Train a Full Tree to its maximum possible depth
Step 2: Define a Range of Pruning Parameters
e.g., ccp_alpha in scikit-learn or
cp in rpart
Controls the trade-off between tree complexity and fit
Larger values result in more aggressive pruning
Step 3: Perform Cross-Validation. For each value of \(\alpha\), prune the tree and evaluate its performance using cross-validation
Step 4: Select the Optimal Pruning Parameter. Choose \(\alpha\) that minimizes the cross-validated error (or maximizes classification accuracy
Step 5: Prune the Tree. Train the final tree using the optimal \(\alpha\)
Step 6: Evaluate the Pruned Tree on test data
RUse cv.tree() and plot the results (deviance or
classification error) versus tree size. Here we would choose \(J= 5\).
cv_reg_tree <- cv.tree(credit_tree, K=10)
plot(cv_reg_tree$size, cv_reg_tree$dev, type="b", ylab="Deviance", xlab="No. leaves")
Obtain the pruned tree from the full tree.
tree_pruned <- prune.tree(credit_tree, best=5)
plot(tree_pruned)
text(tree_pruned, pretty=0)
Go back to the full tree and compare
Predict on the test data set and calculate the test error.
pred_tree <- predict(tree_pruned,Credit[-train,])
mean( (Credit[-train,"Balance"] - pred_tree)^2)
[1] 65574.69
| Feature | Linear Regression | Tree Regression |
|---|---|---|
| Model Form | Linear equation: \(y = \beta_0 + \beta_1x_1 + \dots + \beta_p x_p + \epsilon\) | Hierarchical splits: Recursive partitioning of feature space |
| Assumptions | Linearity, homoscedasticity, independence | No strict assumptions; handles non-linearity naturally |
| Interpretability | High (coefficients directly indicate feature importance) | Moderate (intuitive tree structure, complex deep trees) |
| Handling Non-linearity | Poor (requires feature engineering) | Excellent (automatically captures non-linearity) |
| Outliers | Sensitive (outliers can heavily influence coefficients) | Robust (outliers have limited impact on splits) |
| Feature Scaling | Required (sensitive to scale of features) | Not required (scale-invariant) |
They work pretty much like regression trees,
except
The response (target) is categorical rather than quantitative
Instead of SSError we minimize (locally and globally) a loss function suitable for classification problems
Given regions \(R_1, \cdots, R_{J}\), predict an observation to belong to the most commonly occurring class in the region to which the observation belongs (predict by majority vote).
Let \(\widehat{p}_{mj}\) denote the
proportion of training observations in region \(R_m\) that fall into class \(j\)
Misclassification error: \(\displaystyle E_{m} = 1 - \max_j(\widehat{p}_{mj})\)
Gini Index: measures node purity \({\small G_{m} = \displaystyle \sum_{j=1}^k \widehat{p}_{mj}(1-\widehat{p}_{mj})}\)
Entropy: \({\small \displaystyle D_{m} = -1 \sum_{j=1}^{k} \widehat{p}_{mj} \text{log}(\widehat{p}_{mj})}\)
Deviance (cross-entropy): \({\small \displaystyle -2 \sum_{m=1}^{J} \sum_{j=1}^k n_{mj} \text{log}(\widehat{p}_{mj})}\)
Small values for these loss statistics indicates a good fit
\(p(1-p)\) is largest for \(p=0.5\)
and small as \(p \rightarrow 0\) or
\(p \rightarrow 1\)
Gini measures how pure the nodes are,
a small value indicates observations are mostly from the same
class
Similarly, entropy will be near zero as \(p \rightarrow 0\) or \(p \rightarrow 1\)
Typical application:
Growing trees: Gini index, entropy, or deviance
Pruning trees: misclassification rate
RLet’s look at codes
An ensemble method is an approach that combines many simple building blocks to obtain a single model.
An ensemble method designed to improve the stability and accuracy of machine learning models.
A weak learner is a model that does not perform well on its own, sometimes barely better than random, but performs well in an ensemble.
Bagging is short for Bootstrap Aggregating
is a general technique and can be applied to any learning method
is based on the idea of the bootstrap, to resample from the data set with replacement
the bootstrap can reduce the variability of a statistical learning method by averaging across weak learners
How it works:
Randomly sample the training data with replacement (bootstrap samples).
Train a separate model (e.g., decision tree) on each sample.
Combine the predictions of all models (e.g., by averaging for regression or voting for classification).
Why use it?
Reduces variance and prevents overfitting.
Works well with high-variance models like deep decision trees.
You have a data set of size \(n\)
Draw \(B\) samples of size \(n\) with replacement
\(\Rightarrow\) the bootstrap
samples
Grow decision trees deep without pruning to each
of the \(B\) bootstrap samples.
Each tree will have high variance but low bias.
Prediction
Regression trees: predicted value of target is average
of predictions from \(B\) trees
Classification trees: predicted class by taking
majority vote of classifications from \(B\) trees
Bagging provides a slick way to estimate the test error!
A bootstrap sample includes about 2/3 of the observations.
The probability that the \(n\)th bootstrap observation is the \(j\)th observation in the data frame is \(1/n\), because sampling is with replacement. \(\Rightarrow\) The probability that the \(n\)th bootstrap observation is not the \(j\)th observation is \(1-1/n\).
For a bootstrap sample of size \(n\), the probability that a particular
observation is not in the bootstrap sample is \((1 - 1/n)^n\)
Let’s look at this in R
Pros
Improves accuracy by averaging across bootstrap samples
Built-in holdout sample (OOB) allows estimation of test error
Cons
Loss of interpretability, results not represented by one tree
Trees can be highly correlated
In the presence of strong predictors these are used in most trees in the top split \(\Rightarrow\) the bagged trees all look similar
Averaging would be more effective if the trees were less correlated
Improvement over bagged trees
Trees are decorrelated
At each split, consider only a subset \(m\) of the \(p\) predictors
Choose the \(m\) inputs randomly at each split
Usually \(m \sim \sqrt{p}\) in classification problems and \(m \sim \frac{p}{3}\) in regression problems \(\Rightarrow\) not all inputs considered at each split
Pros
Averaging decorrelated trees is more effective in reducing variability
Strong predictors are not allowed to dominate most trees
Other predictors are getting a chance
A Random Forest with \(m = p\) equals Bagging
With a large number of correlated inputs, choose a small value for \(m\)
As with Bagging, increasing \(B\) does not overfit
Choose \(B\) large enough for the error rate to settle down.
RUse the randomForest() function in the
randomForest package in R. Note that \(p = 12\). Choosing mtry = 12
results in a bagged tree.
data(Boston)
train <- sample (1: nrow(Boston), nrow(Boston) / 2)
bag.boston <- randomForest(medv ~ . ,
data =Boston[train,],
mtry =12, # m = p -> Bagging
importance=TRUE)
Use the randomForest() function in the
randomForest package in R. Note that \(p=12\). Choosing mtry \(< p\) results in a Random Forest.
rf.boston <- randomForest(medv ~ . ,
data =Boston[train,],
mtry =6, # m < p -> Random Forest
importance=TRUE)
Variable importance in bagged trees and random forests
importance(bag.boston)
%IncMSE IncNodePurity
crim 26.608592 876.066482
zn 5.148020 64.377478
indus 5.087023 106.219327
chas -1.389269 9.968199
nox 19.760345 246.301715
rm 53.614293 12080.891286
age 15.776943 334.643934
dis 6.455965 249.905169
rad 1.067095 73.142867
tax 9.395352 153.639752
ptratio 9.254806 145.370983
lstat 50.662598 4838.203779
The varImpPlot() function produces a plot of the
variable importance measures.
Boosting is an ensemble ensemble learning method that combines a set of less accurate models (called “weak learners” i.e. only slightly better than random guessing) to create a single, highly accurate model (“strong learner”).
Unlike other ensemble methods that build models in parallel (e.g. bagging or rf), boosting algorithms build models sequentially.
Each new model in the sequence is trained to correct the errors made by its predecessors.
This iterative process allows the overall model to improve its accuracy, particularly by reducing bias.
Boosting is a popular and effective technique used in supervised learning for both classification and regression tasks.
AdaBoost is regarded as the first practical boosting algorithm.
It was designed for a binary classification problem.
Sequential Learning: Each new model focuses more on the observations that were misclassified by the previous models.
Adaptive Weighting: After each iteration, the weights of misclassified observations are increased, so the next model pays more attention to them. Correctly classified observations have their weights decreased.
Final Prediction: The predictions from all weak learners are combined using a weighted vote, where each model’s vote is weighted by its accuracy.
Initialize weights: equal weights to all training observations (e.g., \(w_i = \frac{1}{N}\))
Train weak learners: For each iteration \(m=1,2,...,M\)
Train a weak learner \(h_m(x)\) on the
weighted data
Compute weighted error rate \(\epsilon_m =
\sum_{i=1}^N w_i \cdot I(y_i \neq h_m(x_i))\)
Update weights:
Compute the model’s weight \(\alpha_m =
\frac{1}{2} \ln\left(\frac{1 -
\epsilon_m}{\epsilon_m}\right)\)
Update observation weights: \(w_i \leftarrow
w_i \cdot \exp(\alpha_m \cdot I(y_i \neq h_m(x_i)))\)
Normalize the weights so they sum to 1
Final Model: \(\displaystyle H(x) = \text{sign}\left(\sum_{m=1}^M \alpha_m h_m(x)\right)\)
Focus on errors: AdaBoost is particularly effective because it forces subsequent models to focus on difficult cases, reducing both bias and variance.
Versatility: AdaBoost can work with any weak learner (e.g., linear classifiers, SVMs).
Theoretical guarantees: Under certain conditions, AdaBoost can drive the training error to zero as the number of weak learners increases.
Limitations:
Sensitive to noisy data and outliers (since it focuses on misclassified
points).
Can overfit if the number of weak learners is too large or if they are
too complex.
Researchers realized that AdaBoost could be viewed as a functional gradient descent process, optimizing a loss function.
AdaBoost was shown to minimize an exponential loss.
The framework could be generalized to other loss functions (e.g., MSE, logistic loss).
Gradient boosting generalizes boosting as an
optimization problem.
Instead of focusing on misclassified points, it fits new models to the
gradient of the loss function (e.g., residuals for regression).
Core Idea: Builds models sequentially, where each new model corrects the errors of the previous one by optimizing a loss function using gradient descent.
Key Features:
Works with any differentiable loss function (e.g., MSE, log-loss). Can
use any weak learner, but decision trees are most common.
Highly flexible and often achieves state-of-the-art performance.
Limitations: Computationally intensive; requires careful tuning of hyperparameters (e.g., learning rate, tree depth).
Popular Implementations:
XGBoost (Extreme Gradient Boosting) (2014): Optimized for speed and
performance.
LightGBM (2017): Designed for distributed and efficient training.
CatBoost (2017): Handles categorical features natively.
Loss function: \(L(y, F(x))\) (e.g.,
squared error for regression)
Number of iterations (trees): M
Start with a constant model \(F_0(x) = \arg\min_{\gamma} \sum_{i=1}^n L(y_i, \gamma)\) e.g. for squared error: \(F_0(x) = \frac{1}{n}\sum_{i=1}^n y_i\)
Iterative training, for m=1, …,M:
Final model \(F(x) = F_M(x)\)
Fitting the training set too closely can lead to degradation of the model’s generalization ability.
One natural regularization parameter is the number of gradient boosting iterations M (i.e. the number of base models) to balance over- and under-fitting.
Shrinkage: An important part of gradient boosting is
regularization by shrinkage which uses a modified update rule: \(F_{m}(x)=F_{m-1}(x)+\nu \cdot \gamma
_{m}h_{m}(x),\quad 0<\nu \leq 1,\)
where parameter \(\nu \in [0,1]\) is
called the ``learning rate”.
Empirically, it has been found that small learning rates yields dramatic improvements in models’ generalization ability over gradient boosting without shrinking (\(\nu =1\)).
However, it comes at the price of increasing computational time both during training and querying: lower learning rate requires more iterations.
Key Parameters
Learning rate \(\nu\): Controls the contribution of each tree.
Number of trees \(M\): More trees can improve performance but risk overfitting.
Tree complexity: Depth of the weak learners.
Each new tree is trained to predict the pseudo-residuals (negative gradients of the loss function)
Each tree focuses on correcting the errors of the current model
GBTs often outperform other ensemble methods (like random forests) on structured data
GBT can optimize any differentiable loss function (regression,…)
Shallow trees (weak learners) and shrinkage (learning rate) helps control overfitting, especially with early stopping
GBTs provide feature importance scores, helping feature selection and interpretability
GBT can be combined with techniques like class weighting or custom loss functions to handle imbalanced datasets
Unlike gradient boosting that works as gradient descent in function space
XGBoost uses a second order Taylor approximation in the loss function
Interpretability is often sacrificed in these boosting methods
Initialize model with a constant value: \({\widehat {f}}_{(0)}(x)={\underset {\theta }{\arg \min }}\sum _{i=1}^{N}L(y_{i},\theta)\)
For m = 1 to M:
Compute the ‘gradients’ and ‘hessians’ \[{\widehat {g}}_{m}(x_{i})=\left[{\frac {\partial L(y_{i},f(x_{i}))}{\partial f(x_{i})}}\right]_{f(x)={\widehat {f}}_{(m-1)}(x)} \mbox{ and }\quad {\widehat {h}}_{m}(x_{i})=\left[{\frac {\partial ^{2}L(y_{i},f(x_{i}))}{\partial f(x_{i})^{2}}}\right]_{f(x)={\widehat {f}}_{(m-1)}(x)}\]
Fit a base learner (or weak learner, e.g. tree) using the training set \(\left\{x_{i},{\dfrac {{\widehat {g}}_{m}(x_{i})}{{\widehat {h}}_{m}(x_{i})}}\right\}_{i=1}^{N}\) by solving the optimization problem below: \({\widehat {\phi }}_{m}={\underset {\phi \in \mathbf {\Phi } }{\arg \min }}\sum _{i=1}^{N}{\frac {1}{2}}{\widehat {h}}_{m}(x_{i})\left[\phi (x_{i})-{\frac {{\widehat {g}}_{m}(x_{i})}{{\widehat {h}}_{m}(x_{i})}}\right]^{2}\)
Update the model: \({\widehat{f}}_{(m)}(x)={\widehat {f}}_{(m-1)}(x)-\nu {\widehat {\phi }}_{m}(x)\)
Output \({\widehat {f}}(x)={\widehat {f}}_{(M)}(x)=\sum _{m=0}^{M}{\widehat {f}}_{m}(x)\)
Let’s compare these methods in R
Gradient boosting is a general algorithmic framework
for building predictive models.
It works by sequentially adding weak learners, each of which corrects
the errors of the previous ensemble.
The “weak learner” can be any model (e.g., linear regression, logistic regression, decision trees, etc.), as long as you can fit it to the pseudo-residuals (negative gradients of the loss function).
When people say “gradient boosting with trees,” they mean that the weak learners are decision trees (usually shallow trees, like stumps or trees with 3–8 leaves).
The most common implementation because trees are:
Flexible: can model non-linear relationships and interactions)
Robust to outliers and missing values Require little feature
engineering.
Examples: gbm, xgboost,
lightgbm, and catboost all use trees as weak
learners.