Goal 1: How are decision trees used in regression and classification
Goal 2: Why is pruning a tree important to prevent overfitting
Goal 3: Why do ensembles have lower variance
Goal 3: Bagging improves prediction accuracy of trees
Goal 4: Random forests improve prediction accuracy of trees
Tree-based methods can be used for predicting a
continuous target (regression trees) or to classify a
categorical response (classification trees).
The basic idea
divide the predictor space into non-overlapping segments (regions)
to make a prediction for a given observation, we use the mean of the training data in the region to which it belongs
since the set of splitting rules to segment the predictor space can be summarized by a tree such approaches are called decision tree methods.
these methods are simple and useful for interpretation.
Carving up the predictor space Years and
Hits for the Hitters data into 3 regions to
predict log(Salary).
The predicted value is the average of the observations that fall into the regions.
This can be represented visually as a decision tree.
For a player with 6 years experience and 100 hits per year the predicted salary is \(\exp(5.998) = \$402,622\).
Trees consist of nodes and branches.
The nodes are either internal if they split the data or terminal if they appear at the bottom of the tree.
Terminal nodes are also called leaves.
The depth of a tree refers to the longest path from the root/top node to any leaf node in the tree.
Pros
Easy to interpret, easy to explain, nice visual representation
Order of splits reflects variable importance
Easily accommodate qualitative predictors
Performs well when relationship between target and predictors highly nonlinear
Cons
Trees often not as accurate as regression or classification methods
Easy to overfit, poor test performance
Sensitive to small changes in the data
We want to predict a response or class Y from inputs \(X_1\), \(X_2\),…\(X_p\). We do this by growing a binary tree.
At each internal node in the tree, we apply a test to one of the inputs, say \(X_i\).
Depending on the outcome of the test, we go to either the left or the right sub-branch of the tree.
Eventually we come to a leaf node, where we make a prediction.
This prediction aggregates or averages all the training data points which reach that leaf.
Supervised prediction method for a
continuous target.
Space generated by predictors \(X_1, \cdots, X_p\) is divided into \(J\) non-overlapping regions \(R_1, \cdots, R_{J}\). \(J\) is the number of terminal nodes.
For input observations falling into \(R_j\), the predicted value is the average of the target variable in that region: \(\widehat{y}_{R_j}\)
The goal is to find regions minimizing the prediction error \[{\tiny \text{SSError} = \sum_{j=1}^{J} \sum_{i \in R_j} (y_i - \widehat{y}_{R_j})^2}\]
In practice, one cannot solve that optimization problem.
Instead we follow a greedy approach (locally optimal).
How to choose \(J\) and the regions
\(R_1, \cdots, R_{J}\)?
Building a regression (or classification) tree involves 3 steps
A simple case: 1 continuous input variable
The tree equivalent to simple linear regression
We know that \(J = 2\): \(R_1 = \{X | X < s\}, R_2 = \{X | X \geq s\}\)
Now we just have to find the split point \(s\) that minimizes the heterogeneity of \(Y\) within \(R_1\) and \(R_2\)
You can do it via optimization, binary search, etc …
A more complicated case: \(p\) input
predictors
- a very high-dimensional problem
- cannot consider all possible partitions into \(J\) boxes
\(\Rightarrow\) use recursive binary splitting
Goal: Partition the feature space into regions where the response variable is as homogeneous as possible.
Result: A binary tree where each internal node represents a split, and each terminal node (leaf) represents a predicted value (mean for regression, class for classification).
Optimization: The algorithm does not guarantee the globally optimal tree but works well in practice.
Start at the top of the tree
Successively split the predictor (=input) space. Each split
Method is greedy because it does not look
back/ahead
It considers only the best split at this step rather than earlier/future
splits later in the tree-building process
The process stops when a criterion is reached, e.g., all regions contain less than \(m\) obs
Specifically, growing an initial tree \(T_0\):
Define 2 half-planes for each predictor \(X_j\) and split point \(s\): \[R_1(j, s) = \{X \mid X_j < s\}, \quad R_2(j, s) = \{X \mid X_j \geq s\}\]
Find the predictor \(j\) and split point \(s\) that minimize the sum of squared errors (RSS) in the resulting regions: \[\sum_{i: x_i \in R_1(j, s)} (y_i - \widehat{y}_{R_1})^2 + \sum_{i: x_i \in R_2(j, s)} (y_i - \widehat{y}_{R_2})^2\]
Continue splitting each resulting region until a stopping criterion is met (e.g., minimum node size).
Maximum Tree Depth
Purpose: Prevents complexity and overfitting
Minimum Number of Observations in a Node
Purpose: Ensures splits rely on enough data
Minimum Number of Observations in a Leaf
Purpose: Prevents leaves too specific to small subset
Minimum Decrease in Impurity
Purpose: Ensures meaningful splits, improve performance
Maximum Number of Terminal Nodes
Purpose: Controls the overall complexity of the tree
Minimum Improvement in Performance
Purpose: Prevents splits with insignificantly improvement
That term is mainly used in classification tree (Gini impurity), but you may encounter the term for regression trees.
It refers to the metric used as splitting criterion: MSE, Gini, entropy, …
Impurity is equivalent to maximizing information gain or maximizing the reduction in impurity.
Prediction error (e.g., misclassification rate, MSE on a hold-out set) is what you ultimately want to minimize for the entire tree, but it’s not directly used as the splitting criterion during tree construction (too expensive).
RFit a regression tree with the tree() function in the
tree library.
library(tree)
cred.tree <- tree(Balance ~ ., data=Credit[train,])
summary(credit_tree)
##
## Regression tree:
## tree(formula = Balance ~ ., data = Credit[train, ])
## Variables actually used in tree construction:
## [1] "Rating" "Limit" "Income" "Student"
## Number of terminal nodes: 10
## Residual mean deviance: 22850 = 4341000 / 190
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -417.60 -63.39 -38.45 0.00 81.73 511.60
RVisualize the tree with plot and add split labels with
text
plot(credit_tree)
text(credit_tree, pretty=0, cex=0.65)
How to interpret the partition?
This can be represented visually as a decision tree.
For a player with 6 years experience and 100 hits per year the predicted salary is \(\exp(5.998) = \$402,622\).
R Packages Comparisionrpart VS tree
Trees grown top down tend to overfit and perform poorly on test data
sets.
The strategy is to grow a large tree and to prune it back to a subtree with good test error.
An alternative could have been to prevent a tree from growing from the beginning by constraining parameters (splits, depth, …) but the tree may stop too early, missing useful splits and then underfits.
Usually, a full tree \(T_0\) has high test error and is too complex (too many leaves)
A subtree (a pruned version of \(T_0\)) might perform better
But we cannot investigate all possible subtrees, so we select a small set of subtrees for considerations
Use cross-validation to select the “best” subtree from the set
The entire process is called cost complexity pruning (also known as weakest link pruning)
Cost complexity pruning
How to choose the sequence of trees to look at for pruning?
Pruning removes parts of the tree that provide little to no additional power in predicting target variables, ultimately resulting in a more robust and manageable model.
For each value \(\alpha\) there is a tree \(T \subset T_0\) such that \[\sum_{m=1}^{J}\sum_{i \in R_m} \left( y_i - \widehat{y}_{R_m} \right)^2 + \alpha J\] is as small as possible
\(\alpha\) is similar to the shrinkage parameter in Ridge and Lasso regression. Cost complexity pruning is a regularization method
\(\alpha = 0 \, \Rightarrow T_0\), we get the full tree
As \(\alpha\) grows the penalty for having many leaves increases
\(\alpha = \infty\), we get the null tree with a single leaf node
Cost complexity pruning generates a series of trees \(T_0\), …,\(T_m\) with \(T_0\) the initial full tree and \(T_m\) the root alone.
At step \(i\), the tree is created by removing a subtree from tree \(i-1\) and replacing it with a leaf node with value chosen as in the tree building algorithm.
Step 1: Train a Full Tree to its maximum possible depth
Step 2: Define a Range of Pruning Parameters
e.g., ccp_alpha in scikit-learn or
cp in rpart
Controls the trade-off between tree complexity and fit
Larger values result in more aggressive pruning
Step 3: Perform Cross-Validation. For each value of \(\alpha\), prune the tree and evaluate its performance using cross-validation
Step 4: Select the Optimal Pruning Parameter. Choose \(\alpha\) that minimizes the cross-validated error (or maximizes classification accuracy
Step 5: Prune the Tree. Train the final tree using the optimal \(\alpha\)
Step 6: Evaluate the Pruned Tree on test data
RUse cv.tree() and plot the results (deviance or
classification error) versus tree size. Here we would choose \(J= 5\).
cv_reg_tree <- cv.tree(credit_tree, K=10)
plot(cv_reg_tree$size, cv_reg_tree$dev, type="b", ylab="Deviance", xlab="No. leaves")
Obtain the pruned tree from the full tree.
tree_pruned <- prune.tree(credit_tree, best=5)
plot(tree_pruned)
text(tree_pruned, pretty=0)
Go back to the full tree and compare
Predict on the test data set and calculate the test error.
pred_tree <- predict(tree_pruned,Credit[-train,])
mean( (Credit[-train,"Balance"] - pred_tree)^2)
[1] 65574.69
| Feature | Linear Regression | Tree Regression |
|---|---|---|
| Model Form | Linear equation: \(y = \beta_0 + \beta_1x_1 + \dots + \beta_p x_p + \epsilon\) | Hierarchical splits: Recursive partitioning of feature space |
| Assumptions | Linearity, homoscedasticity, independence | No strict assumptions; handles non-linearity naturally |
| Interpretability | High (coefficients directly indicate feature importance) | Moderate (intuitive tree structure, complex deep trees) |
| Handling Non-linearity | Poor (requires feature engineering) | Excellent (automatically captures non-linearity) |
| Outliers | Sensitive (outliers can heavily influence coefficients) | Robust (outliers have limited impact on splits) |
| Feature Scaling | Required (sensitive to scale of features) | Not required (scale-invariant) |
They work pretty much like regression trees,
except
The response (target) is categorical rather than quantitative
Instead of SSError we minimize (locally and globally) a loss function suitable for classification problems
Given regions \(R_1, \cdots, R_{J}\), predict an observation to belong to the most commonly occurring class in the region to which the observation belongs (predict by majority vote).
Let \(\widehat{p}_{mj}\) denote the
proportion of training observations in region \(R_m\) that fall into class \(j\)
Misclassification error: \(\displaystyle E_{m} = 1 - \max_j(\widehat{p}_{mj})\)
Gini Index: measures node purity \({\small G_{m} = \displaystyle \sum_{j=1}^k \widehat{p}_{mj}(1-\widehat{p}_{mj})}\)
Entropy: \({\small \displaystyle D_{m} = -1 \sum_{j=1}^{k} \widehat{p}_{mj} \text{log}(\widehat{p}_{mj})}\)
Deviance (cross-entropy): \({\small \displaystyle -2 \sum_{m=1}^{J} \sum_{j=1}^k n_{mj} \text{log}(\widehat{p}_{mj})}\)
Small values for these loss statistics indicates a good fit
\(p(1-p)\) is largest for \(p=0.5\)
and small as \(p \rightarrow 0\) or
\(p \rightarrow 1\)
Gini measures how pure the nodes are,
a small value indicates observations are mostly from the same
class
Similarly, entropy will be near zero as \(p \rightarrow 0\) or \(p \rightarrow 1\)
Typical application:
Growing trees: Gini index, entropy, or deviance
Pruning trees: misclassification rate
RLet’s look at codes
An ensemble method is an approach that combines many simple building blocks to obtain a single model.
An ensemble method designed to improve the stability and accuracy of machine learning models.
A weak learner is a model that does not perform well on its own, sometimes barely better than random, but performs well in an ensemble.
Bagging is short for Bootstrap Aggregating
is a general technique and can be applied to any learning method
is based on the idea of the bootstrap, to resample from the data set with replacement
the bootstrap can reduce the variability of a statistical learning method by averaging across weak learners
How it works:
Randomly sample the training data with replacement (bootstrap samples).
Train a separate model (e.g., decision tree) on each sample.
Combine the predictions of all models (e.g., by averaging for regression or voting for classification).
Why use it?
Reduces variance and prevents overfitting.
Works well with high-variance models like deep decision trees.
You have a data set of size \(n\)
Draw \(B\) samples of size \(n\) with replacement
\(\Rightarrow\) the bootstrap
samples
Grow decision trees deep without pruning to each
of the \(B\) bootstrap samples.
Each tree will have high variance but low bias.
Prediction
Regression trees: predicted value of target is average
of predictions from \(B\) trees
Classification trees: predicted class by taking
majority vote of classifications from \(B\) trees
Bagging provides a slick way to estimate the test error!
A bootstrap sample includes about 2/3 of the observations.
The probability that the \(n\)th bootstrap observation is the \(j\)th observation in the data frame is \(1/n\), because sampling is with replacement. \(\Rightarrow\) The probability that the \(n\)th bootstrap observation is not the \(j\)th observation is \(1-1/n\).
For a bootstrap sample of size \(n\), the probability that a particular
observation is not in the bootstrap sample is \((1 - 1/n)^n\)
Let’s look at this in R
Pros
Improves accuracy by averaging across bootstrap samples
Built-in holdout sample (OOB) allows estimation of test error
Cons
Loss of interpretability, results not represented by one tree
Trees can be highly correlated
In the presence of strong predictors these are used in most trees in the top split \(\Rightarrow\) the bagged trees all look similar
Averaging would be more effective if the trees were less correlated
Improvement over bagged trees
Trees are decorrelated
At each split, consider only a subset \(m\) of the \(p\) predictors
Choose the \(m\) inputs randomly at each split
\(m \sim \sqrt{p} \, \Rightarrow\) not all inputs considered at each split
Pros
Averaging decorrelated trees is more effective in reducing variability
Strong predictors are not allowed to dominate most trees
Other predictors are getting a chance
A Random Forest with \(m = p\) equals Bagging
With a large number of correlated inputs, choose a small value for \(m\)
As with Bagging, increasing \(B\) does not overfit
Choose \(B\) large enough for the error rate to settle down.
RUse the randomForest() function in the
randomForest package in R. Note that \(p = 12\). Choosing mtry = 12
results in a bagged tree.
data(Boston)
train <- sample (1: nrow(Boston), nrow(Boston) / 2)
bag.boston <- randomForest(medv ~ . ,
data =Boston[train,],
mtry =12, # m = p -> Bagging
importance=TRUE)
Use the randomForest() function in the
randomForest package in R. Note that \(p=12\). Choosing mtry \(< p\) results in a Random Forest.
rf.boston <- randomForest(medv ~ . ,
data =Boston[train,],
mtry =6, # m < p -> Random Forest
importance=TRUE)
Variable importance in bagged trees and random forests
importance(bag.boston)
%IncMSE IncNodePurity
crim 26.608592 876.066482
zn 5.148020 64.377478
indus 5.087023 106.219327
chas -1.389269 9.968199
nox 19.760345 246.301715
rm 53.614293 12080.891286
age 15.776943 334.643934
dis 6.455965 249.905169
rad 1.067095 73.142867
tax 9.395352 153.639752
ptratio 9.254806 145.370983
lstat 50.662598 4838.203779
The varImpPlot() function produces a plot of the
variable importance measures.