Learning Goals

Goal 1: How are decision trees used in regression and classification
Goal 2: Why is pruning a tree important to prevent overfitting
Goal 3: Why do ensembles have lower variance
Goal 3: Bagging improves prediction accuracy of trees
Goal 4: Random forests improve prediction accuracy of trees

Introduction

Definitions

Tree-based methods can be used for predicting a continuous target (regression trees) or to classify a categorical response (classification trees).

The basic idea

divide the predictor space into non-overlapping segments (regions)
to make a prediction for a given observation, we use the mean of the training data in the region to which it belongs
since the set of splitting rules to segment the predictor space can be summarized by a tree such approaches are called decision tree methods.
these methods are simple and useful for interpretation.

Example

Carving up the predictor space Years and Hits for the Hitters data into 3 regions to predict log(Salary).

The predicted value is the average of the observations that fall into the regions.

This can be represented visually as a decision tree.

For a player with 6 years experience and 100 hits per year the predicted salary is $\exp(5.998) = \$402,622$.

Decisions Trees

Trees consist of nodes and branches.

The nodes are either internal if they split the data or terminal if they appear at the bottom of the tree.

Terminal nodes are also called leaves.

The depth of a tree refers to the longest path from the root/top node to any leaf node in the tree.

Pros and Cons of Trees

Pros

Easy to interpret, easy to explain, nice visual representation
Order of splits reflects variable importance
Easily accommodate qualitative predictors
Performs well when relationship between target and predictors highly nonlinear

Cons

Trees often not as accurate as regression or classification methods
Easy to overfit, poor test performance
Sensitive to small changes in the data

Regression Trees

Basics of Regression Trees

We want to predict a response or class Y from inputs $X_1$, $X_2$,…$X_p$. We do this by growing a binary tree.
At each internal node in the tree, we apply a test to one of the inputs, say $X_i$.
Depending on the outcome of the test, we go to either the left or the right sub-branch of the tree.
Eventually we come to a leaf node, where we make a prediction.
This prediction aggregates or averages all the training data points which reach that leaf.

Regression Tree

Supervised prediction method for a continuous target.
Space generated by predictors $X_1, \cdots, X_p$ is divided into $J$ non-overlapping regions $R_1, \cdots, R_{J}$. $J$ is the number of terminal nodes.
For input observations falling into $R_j$, the predicted value is the average of the target variable in that region: $\widehat{y}_{R_j}$
The goal is to find regions minimizing the prediction error \[{\tiny \text{SSError} = \sum_{j=1}^{J} \sum_{i \in R_j} (y_i - \widehat{y}_{R_j})^2}\]
In practice, one cannot solve that optimization problem.
Instead we follow a greedy approach (locally optimal).

How to choose $J$ and the regions $R_1, \cdots, R_{J}$?

Building a regression (or classification) tree involves 3 steps

Growing the tree
- top-down greedy algorithm: recursive binary splitting
Pruning the tree
- obtain a subtree by cost complexity pruning
- often uses cross-validation to select tree complexity
Predicting (or classifying) with the (pruned) tree
- pass observations through the tree and find their terminal node

Growing a Tree

A simple case: 1 continuous input variable

The tree equivalent to simple linear regression

We know that $J = 2$: $R_1 = \{X | X < s\}, R_2 = \{X | X \geq s\}$
Now we just have to find the split point $s$ that minimizes the heterogeneity of $Y$ within $R_1$ and $R_2$
You can do it via optimization, binary search, etc …

A more complicated case: $p$ input predictors
- a very high-dimensional problem
- cannot consider all possible partitions into $J$ boxes

$\Rightarrow$ use recursive binary splitting

Goal: Partition the feature space into regions where the response variable is as homogeneous as possible.

Result: A binary tree where each internal node represents a split, and each terminal node (leaf) represents a predicted value (mean for regression, class for classification).

Optimization: The algorithm does not guarantee the globally optimal tree but works well in practice.

Recursive Binary Splitting

Start at the top of the tree
Successively split the predictor (=input) space. Each split
- looks for the input $X_i$ and split $s$ to minimize heterogeneity/impurity of $Y$ within regions
- considers only the data in the current sub-tree
- leads to two new branches
Method is greedy because it does not look back/ahead
It considers only the best split at this step rather than earlier/future splits later in the tree-building process
The process stops when a criterion is reached, e.g., all regions contain less than $m$ obs

Specifically, growing an initial tree $T_0$:

Define 2 half-planes for each predictor $X_j$ and split point $s$: \[R_1(j, s) = \{X \mid X_j < s\}, \quad R_2(j, s) = \{X \mid X_j \geq s\}\]
Find the predictor $j$ and split point $s$ that minimize the sum of squared errors (RSS) in the resulting regions: \[\sum_{i: x_i \in R_1(j, s)} (y_i - \widehat{y}_{R_1})^2 + \sum_{i: x_i \in R_2(j, s)} (y_i - \widehat{y}_{R_2})^2\]
Continue splitting each resulting region until a stopping criterion is met (e.g., minimum node size).

Some Stopping Criteria

Maximum Tree Depth
Purpose: Prevents complexity and overfitting
Minimum Number of Observations in a Node
Purpose: Ensures splits rely on enough data
Minimum Number of Observations in a Leaf
Purpose: Prevents leaves too specific to small subset
Minimum Decrease in Impurity
Purpose: Ensures meaningful splits, improve performance
Maximum Number of Terminal Nodes
Purpose: Controls the overall complexity of the tree
Minimum Improvement in Performance
Purpose: Prevents splits with insignificantly improvement

Impurity VS Prediction Error

That term is mainly used in classification tree (Gini impurity), but you may encounter the term for regression trees.

It refers to the metric used as splitting criterion: MSE, Gini, entropy, …

Impurity is equivalent to maximizing information gain or maximizing the reduction in impurity.

Prediction error (e.g., misclassification rate, MSE on a hold-out set) is what you ultimately want to minimize for the entire tree, but it’s not directly used as the splitting criterion during tree construction (too expensive).

Regression Trees in `R`

Fit a regression tree with the tree() function in the tree library.

library(tree)
cred.tree <- tree(Balance ~ ., data=Credit[train,])
summary(credit_tree)

## 
## Regression tree:
## tree(formula = Balance ~ ., data = Credit[train, ])
## Variables actually used in tree construction:
## [1] "Rating"  "Limit"   "Income"  "Student"
## Number of terminal nodes:  10 
## Residual mean deviance:  22850 = 4341000 / 190 
## Distribution of residuals:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -417.60  -63.39  -38.45    0.00   81.73  511.60

Visualizing Trees in `R`

Visualize the tree with plot and add split labels with text

plot(credit_tree) text(credit_tree, pretty=0, cex=0.65)

Returning Example

How to interpret the partition?

This can be represented visually as a decision tree.

For a player with 6 years experience and 100 hits per year the predicted salary is $\exp(5.998) = \$402,622$.

`R` Packages Comparision

rpart VS tree

Pruning

Pruning a Regression Tree

Trees grown top down tend to overfit and perform poorly on test data sets.

The strategy is to grow a large tree and to prune it back to a subtree with good test error.

An alternative could have been to prevent a tree from growing from the beginning by constraining parameters (splits, depth, …) but the tree may stop too early, missing useful splits and then underfits.

Usually, a full tree $T_0$ has high test error and is too complex (too many leaves)
A subtree (a pruned version of $T_0$) might perform better
But we cannot investigate all possible subtrees, so we select a small set of subtrees for considerations
Use cross-validation to select the “best” subtree from the set
The entire process is called cost complexity pruning (also known as weakest link pruning)

Cost complexity pruning

How to choose the sequence of trees to look at for pruning?
Pruning removes parts of the tree that provide little to no additional power in predicting target variables, ultimately resulting in a more robust and manageable model.
For each value $\alpha$ there is a tree $T \subset T_0$ such that \[\sum_{m=1}^{J}\sum_{i \in R_m} \left( y_i - \widehat{y}_{R_m} \right)^2 + \alpha J\] is as small as possible
$\alpha$ is similar to the shrinkage parameter in Ridge and Lasso regression. Cost complexity pruning is a regularization method
$\alpha = 0 \, \Rightarrow T_0$, we get the full tree
As $\alpha$ grows the penalty for having many leaves increases
$\alpha = \infty$, we get the null tree with a single leaf node

Pruning Algorithm

Cost complexity pruning generates a series of trees $T_0$, …,$T_m$ with $T_0$ the initial full tree and $T_m$ the root alone.

At step $i$, the tree is created by removing a subtree from tree $i-1$ and replacing it with a leaf node with value chosen as in the tree building algorithm.

Pruning Procedure

Step 1: Train a Full Tree to its maximum possible depth

Step 2: Define a Range of Pruning Parameters
e.g., ccp_alpha in scikit-learn or cp in rpart
Controls the trade-off between tree complexity and fit
Larger values result in more aggressive pruning

Step 3: Perform Cross-Validation. For each value of $\alpha$, prune the tree and evaluate its performance using cross-validation

Step 4: Select the Optimal Pruning Parameter. Choose $\alpha$ that minimizes the cross-validated error (or maximizes classification accuracy

Step 5: Prune the Tree. Train the final tree using the optimal $\alpha$

Step 6: Evaluate the Pruned Tree on test data

Regression Trees in `R`

Use cv.tree() and plot the results (deviance or classification error) versus tree size. Here we would choose $J= 5$.

cv_reg_tree <- cv.tree(credit_tree, K=10) plot(cv_reg_tree$size, cv_reg_tree$dev, type="b", ylab="Deviance", xlab="No. leaves")

Obtain the pruned tree from the full tree.

tree_pruned <- prune.tree(credit_tree, best=5)
plot(tree_pruned)
text(tree_pruned, pretty=0)

Go back to the full tree and compare

Predict on the test data set and calculate the test error.

pred_tree <- predict(tree_pruned,Credit[-train,])
mean( (Credit[-train,"Balance"] - pred_tree)^2)
[1] 65574.69

Linear VS Tree Regression

Feature	Linear Regression	Tree Regression
Model Form	Linear equation: $y = \beta_0 + \beta_1x_1 + \dots + \beta_p x_p + \epsilon$	Hierarchical splits: Recursive partitioning of feature space
Assumptions	Linearity, homoscedasticity, independence	No strict assumptions; handles non-linearity naturally
Interpretability	High (coefficients directly indicate feature importance)	Moderate (intuitive tree structure, complex deep trees)
Handling Non-linearity	Poor (requires feature engineering)	Excellent (automatically captures non-linearity)
Outliers	Sensitive (outliers can heavily influence coefficients)	Robust (outliers have limited impact on splits)
Feature Scaling	Required (sensitive to scale of features)	Not required (scale-invariant)

Use Linear Regression if:
- The relationship between predictors and response is linear.
- You need a simple, interpretable model.
- You want to quantify the effect of each predictor (e.g., coefficients).

Use Tree Regression if:
- The relationship is non-linear or complex.
- You want automatic feature selection and interaction detection.
- You need a model that is robust to outliers and scale.

Classification Trees

Classification Tree

They work pretty much like regression trees, except

The response (target) is categorical rather than quantitative
Instead of SSError we minimize (locally and globally) a loss function suitable for classification problems
Given regions $R_1, \cdots, R_{J}$, predict an observation to belong to the most commonly occurring class in the region to which the observation belongs (predict by majority vote).

Loss Functions for Classification Trees

Let $\widehat{p}_{mj}$ denote the proportion of training observations in region $R_m$ that fall into class $j$

Misclassification error: $\displaystyle E_{m} = 1 - \max_j(\widehat{p}_{mj})$
Gini Index: measures node purity ${\small G_{m} = \displaystyle \sum_{j=1}^k \widehat{p}_{mj}(1-\widehat{p}_{mj})}$
Entropy: ${\small \displaystyle D_{m} = -1 \sum_{j=1}^{k} \widehat{p}_{mj} \text{log}(\widehat{p}_{mj})}$
Deviance (cross-entropy): ${\small \displaystyle -2 \sum_{m=1}^{J} \sum_{j=1}^k n_{mj} \text{log}(\widehat{p}_{mj})}$
Small values for these loss statistics indicates a good fit
$p(1-p)$ is largest for $p=0.5$
and small as $p \rightarrow 0$ or $p \rightarrow 1$
Gini measures how pure the nodes are, a small value indicates observations are mostly from the same class
Similarly, entropy will be near zero as $p \rightarrow 0$ or $p \rightarrow 1$
Typical application:
Growing trees: Gini index, entropy, or deviance
Pruning trees: misclassification rate

Example in `R`

Let’s look at codes

Bagging

Ensemble

An ensemble method is an approach that combines many simple building blocks to obtain a single model.
An ensemble method designed to improve the stability and accuracy of machine learning models.
A weak learner is a model that does not perform well on its own, sometimes barely better than random, but performs well in an ensemble.

Bagging

Bagging is short for Bootstrap Aggregating
is a general technique and can be applied to any learning method
is based on the idea of the bootstrap, to resample from the data set with replacement
the bootstrap can reduce the variability of a statistical learning method by averaging across weak learners

How it works:

Randomly sample the training data with replacement (bootstrap samples).
Train a separate model (e.g., decision tree) on each sample.
Combine the predictions of all models (e.g., by averaging for regression or voting for classification).

Why use it?

Reduces variance and prevents overfitting.

Works well with high-variance models like deep decision trees.

Bagging Trees Procedure

You have a data set of size $n$

Draw $B$ samples of size $n$ with replacement
$\Rightarrow$ the bootstrap samples
Grow decision trees deep without pruning to each of the $B$ bootstrap samples.
Each tree will have high variance but low bias.
Prediction
Regression trees: predicted value of target is average of predictions from $B$ trees
Classification trees: predicted class by taking majority vote of classifications from $B$ trees

Out-of-bag Error

Bagging provides a slick way to estimate the test error!

A bootstrap sample includes about 2/3 of the observations.
The probability that the $n$th bootstrap observation is the $j$th observation in the data frame is $1/n$, because sampling is with replacement. $\Rightarrow$ The probability that the $n$th bootstrap observation is not the $j$th observation is $1-1/n$.
For a bootstrap sample of size $n$, the probability that a particular observation is not in the bootstrap sample is $(1 - 1/n)^n$
Let’s look at this in R

Pros and Cons of Bagging

Pros

Improves accuracy by averaging across bootstrap samples
Built-in holdout sample (OOB) allows estimation of test error

Cons

Loss of interpretability, results not represented by one tree
Trees can be highly correlated
In the presence of strong predictors these are used in most trees in the top split $\Rightarrow$ the bagged trees all look similar

Averaging would be more effective if the trees were less correlated

Random Forests

Improvement over bagged trees
Trees are decorrelated
At each split, consider only a subset $m$ of the $p$ predictors
Choose the $m$ inputs randomly at each split
$m \sim \sqrt{p} \, \Rightarrow$ not all inputs considered at each split

Pros

Averaging decorrelated trees is more effective in reducing variability
Strong predictors are not allowed to dominate most trees
Other predictors are getting a chance
A Random Forest with $m = p$ equals Bagging
With a large number of correlated inputs, choose a small value for $m$
As with Bagging, increasing $B$ does not overfit
Choose $B$ large enough for the error rate to settle down.

Bagging Trees and Random Forests in `R`

Use the randomForest() function in the randomForest package in R. Note that $p = 12$. Choosing mtry = 12 results in a bagged tree.

data(Boston)
train <- sample (1: nrow(Boston), nrow(Boston) / 2)

bag.boston <- randomForest(medv ~ . , 
                           data      =Boston[train,], 
                           mtry      =12,    # m = p -> Bagging
                           importance=TRUE)

Use the randomForest() function in the randomForest package in R. Note that $p=12$. Choosing mtry $< p$ results in a Random Forest.

rf.boston <- randomForest(medv ~ . , 
                          data      =Boston[train,], 
                          mtry      =6,  # m < p -> Random Forest
                          importance=TRUE)

Variable importance in bagged trees and random forests

importance(bag.boston)

          %IncMSE IncNodePurity
crim    26.608592    876.066482
zn       5.148020     64.377478
indus    5.087023    106.219327
chas    -1.389269      9.968199
nox     19.760345    246.301715
rm      53.614293  12080.891286
age     15.776943    334.643934
dis      6.455965    249.905169
rad      1.067095     73.142867
tax      9.395352    153.639752
ptratio  9.254806    145.370983
lstat   50.662598   4838.203779

The varImpPlot() function produces a plot of the variable importance measures.

Tree-based Methods

Week 10

Julie Bessac

2026-04-12

Learning Goals

Introduction

Definitions

Example

Decisions Trees

Pros and Cons of Trees

Regression Trees

Basics of Regression Trees

Regression Tree

Growing a Tree

Recursive Binary Splitting

Some Stopping Criteria

Impurity VS Prediction Error

Regression Trees in `R`

Visualizing Trees in `R`

Returning Example

`R` Packages Comparision

Pruning

Pruning a Regression Tree

Pruning Algorithm

Pruning Procedure

Regression Trees in `R`

Linear VS Tree Regression

Classification Trees

Classification Tree

Loss Functions for Classification Trees

Example in `R`

Bagging

Ensemble

Bagging

Bagging Trees Procedure

Out-of-bag Error

Pros and Cons of Bagging

Random Forests

Random Forests

Bagging Trees and Random Forests in `R`

Feature	Linear Regression	Tree Regression
Model Form	Linear equation: \(y = \beta_0 + \beta_1x_1 + \dots + \beta_p x_p + \epsilon\)	Hierarchical splits: Recursive partitioning of feature space
Assumptions	Linearity, homoscedasticity, independence	No strict assumptions; handles non-linearity naturally
Interpretability	High (coefficients directly indicate feature importance)	Moderate (intuitive tree structure, complex deep trees)
Handling Non-linearity	Poor (requires feature engineering)	Excellent (automatically captures non-linearity)
Outliers	Sensitive (outliers can heavily influence coefficients)	Robust (outliers have limited impact on splits)
Feature Scaling	Required (sensitive to scale of features)	Not required (scale-invariant)

Tree-based Methods

Week 10

Julie Bessac

2026-04-12

Learning Goals

Introduction

Definitions

Example

Decisions Trees

Pros and Cons of Trees

Regression Trees

Basics of Regression Trees

Regression Tree

Growing a Tree

Recursive Binary Splitting

Some Stopping Criteria

Impurity VS Prediction Error

Regression Trees in R

Visualizing Trees in R

Returning Example

R Packages Comparision

Pruning

Pruning a Regression Tree

Pruning Algorithm

Pruning Procedure

Regression Trees in R

Linear VS Tree Regression

Classification Trees

Classification Tree

Loss Functions for Classification Trees

Example in R

Bagging

Ensemble

Bagging

Bagging Trees Procedure

Out-of-bag Error

Pros and Cons of Bagging

Random Forests

Random Forests

Bagging Trees and Random Forests in R

Regression Trees in `R`

Visualizing Trees in `R`

`R` Packages Comparision

Regression Trees in `R`

Example in `R`

Bagging Trees and Random Forests in `R`