Learning Goals

Introduction

Definitions

Tree-based methods can be used for predicting a continuous target (regression trees) or to classify a categorical response (classification trees).

The basic idea

  • divide the predictor space into non-overlapping segments (regions)

  • to make a prediction for a given observation, we use the mean of the training data in the region to which it belongs

  • since the set of splitting rules to segment the predictor space can be summarized by a tree such approaches are called decision tree methods.

  • these methods are simple and useful for interpretation.

Example

Carving up the predictor space Years and Hits for the Hitters data into 3 regions to predict log(Salary).

The predicted value is the average of the observations that fall into the regions.

This can be represented visually as a decision tree.

For a player with 6 years experience and 100 hits per year the predicted salary is \(\exp(5.998) = \$402,622\).

Decisions Trees

Trees consist of nodes and branches.

The nodes are either internal if they split the data or terminal if they appear at the bottom of the tree.

Terminal nodes are also called leaves.

The depth of a tree refers to the longest path from the root/top node to any leaf node in the tree.

Pros and Cons of Trees

Pros

  • Easy to interpret, easy to explain, nice visual representation

  • Order of splits reflects variable importance

  • Easily accommodate qualitative predictors

  • Performs well when relationship between target and predictors highly nonlinear

Cons

  • Trees often not as accurate as regression or classification methods

  • Easy to overfit, poor test performance

  • Sensitive to small changes in the data

Regression Trees

Basics of Regression Trees

  • We want to predict a response or class Y from inputs \(X_1\), \(X_2\),…\(X_p\). We do this by growing a binary tree.

  • At each internal node in the tree, we apply a test to one of the inputs, say \(X_i\).

  • Depending on the outcome of the test, we go to either the left or the right sub-branch of the tree.

  • Eventually we come to a leaf node, where we make a prediction.

  • This prediction aggregates or averages all the training data points which reach that leaf.

Regression Tree

  • Supervised prediction method for a continuous target.

  • Space generated by predictors \(X_1, \cdots, X_p\) is divided into \(J\) non-overlapping regions \(R_1, \cdots, R_{J}\). \(J\) is the number of terminal nodes.

  • For input observations falling into \(R_j\), the predicted value is the average of the target variable in that region: \(\widehat{y}_{R_j}\)

  • The goal is to find regions minimizing the prediction error \[{\tiny \text{SSError} = \sum_{j=1}^{J} \sum_{i \in R_j} (y_i - \widehat{y}_{R_j})^2}\]

  • In practice, one cannot solve that optimization problem.
    Instead we follow a greedy approach (locally optimal).

How to choose \(J\) and the regions \(R_1, \cdots, R_{J}\)?

Building a regression (or classification) tree involves 3 steps

  1. Growing the tree 
    • top-down greedy algorithm: recursive binary splitting
  2. Pruning the tree
    • obtain a subtree by cost complexity pruning
    • often uses cross-validation to select tree complexity
  3. Predicting (or classifying) with the (pruned) tree 
    • pass observations through the tree and find their terminal node

Growing a Tree

A simple case: 1 continuous input variable 

The tree equivalent to simple linear regression

  • We know that \(J = 2\): \(R_1 = \{X | X < s\}, R_2 = \{X | X \geq s\}\)

  • Now we just have to find the split point \(s\) that minimizes the heterogeneity of \(Y\) within \(R_1\) and \(R_2\)

  • You can do it via optimization, binary search, etc …

A more complicated case: \(p\) input predictors
- a very high-dimensional problem
- cannot consider all possible partitions into \(J\) boxes

\(\Rightarrow\) use recursive binary splitting


Goal: Partition the feature space into regions where the response variable is as homogeneous as possible.

Result: A binary tree where each internal node represents a split, and each terminal node (leaf) represents a predicted value (mean for regression, class for classification).

Optimization: The algorithm does not guarantee the globally optimal tree but works well in practice.

Recursive Binary Splitting

  • Start at the top of the tree

  • Successively split the predictor (=input) space. Each split

    • looks for the input \(X_i\) and split \(s\) to minimize heterogeneity/impurity of \(Y\) within regions
    • considers only the data in the current sub-tree
    • leads to two new branches
       
  • Method is greedy because it does not look back/ahead
    It considers only the best split at this step rather than earlier/future splits later in the tree-building process

  • The process stops when a criterion is reached, e.g., all regions contain less than \(m\) obs

Specifically, growing an initial tree \(T_0\):

  1. Define 2 half-planes for each predictor \(X_j\) and split point \(s\): \[R_1(j, s) = \{X \mid X_j < s\}, \quad R_2(j, s) = \{X \mid X_j \geq s\}\]

  2. Find the predictor \(j\) and split point \(s\) that minimize the sum of squared errors (RSS) in the resulting regions: \[\sum_{i: x_i \in R_1(j, s)} (y_i - \widehat{y}_{R_1})^2 + \sum_{i: x_i \in R_2(j, s)} (y_i - \widehat{y}_{R_2})^2\]

  3. Continue splitting each resulting region until a stopping criterion is met (e.g., minimum node size).

Some Stopping Criteria

  1. Maximum Tree Depth
    Purpose: Prevents complexity and overfitting

  2. Minimum Number of Observations in a Node
    Purpose: Ensures splits rely on enough data

  3. Minimum Number of Observations in a Leaf
    Purpose: Prevents leaves too specific to small subset

  4. Minimum Decrease in Impurity
    Purpose: Ensures meaningful splits, improve performance

  5. Maximum Number of Terminal Nodes
    Purpose: Controls the overall complexity of the tree

  6. Minimum Improvement in Performance
    Purpose: Prevents splits with insignificantly improvement

Impurity VS Prediction Error

That term is mainly used in classification tree (Gini impurity), but you may encounter the term for regression trees.

It refers to the metric used as splitting criterion: MSE, Gini, entropy, …

Impurity is equivalent to maximizing information gain or maximizing the reduction in impurity.

Prediction error (e.g., misclassification rate, MSE on a hold-out set) is what you ultimately want to minimize for the entire tree, but it’s not directly used as the splitting criterion during tree construction (too expensive).

Regression Trees in R

Fit a regression tree with the tree() function in the tree library.

library(tree)
cred.tree <- tree(Balance ~ ., data=Credit[train,])
summary(credit_tree)

## 
## Regression tree:
## tree(formula = Balance ~ ., data = Credit[train, ])
## Variables actually used in tree construction:
## [1] "Rating"  "Limit"   "Income"  "Student"
## Number of terminal nodes:  10 
## Residual mean deviance:  22850 = 4341000 / 190 
## Distribution of residuals:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -417.60  -63.39  -38.45    0.00   81.73  511.60

Visualizing Trees in R

Visualize the tree with plot and add split labels with text

plot(credit_tree) text(credit_tree, pretty=0, cex=0.65)

Returning Example

How to interpret the partition?

This can be represented visually as a decision tree.

For a player with 6 years experience and 100 hits per year the predicted salary is \(\exp(5.998) = \$402,622\).

R Packages Comparision

rpart VS tree

Pruning

Pruning a Regression Tree

Trees grown top down tend to overfit and perform poorly on test data sets.

The strategy is to grow a large tree and to prune it back to a subtree with good test error.

An alternative could have been to prevent a tree from growing from the beginning by constraining parameters (splits, depth, …) but the tree may stop too early, missing useful splits and then underfits.

  • Usually, a full tree \(T_0\) has high test error and is too complex (too many leaves)

  • A subtree (a pruned version of \(T_0\)) might perform better

  • But we cannot investigate all possible subtrees, so we select a small set of subtrees for considerations

  • Use cross-validation to select the “best” subtree from the set

  • The entire process is called cost complexity pruning (also known as weakest link pruning)

Cost complexity pruning

  • How to choose the sequence of trees to look at for pruning?

  • Pruning removes parts of the tree that provide little to no additional power in predicting target variables, ultimately resulting in a more robust and manageable model.

  • For each value \(\alpha\) there is a tree \(T \subset T_0\) such that \[\sum_{m=1}^{J}\sum_{i \in R_m} \left( y_i - \widehat{y}_{R_m} \right)^2 + \alpha J\] is as small as possible

  • \(\alpha\) is similar to the shrinkage parameter in Ridge and Lasso regression.  Cost complexity pruning is a regularization method

  • \(\alpha = 0 \, \Rightarrow T_0\), we get the full tree

  • As \(\alpha\) grows the penalty for having many leaves increases

  • \(\alpha = \infty\), we get the null tree with a single leaf node

Pruning Algorithm

Cost complexity pruning generates a series of trees \(T_0\), …,\(T_m\) with \(T_0\) the initial full tree and \(T_m\) the root alone.

At step \(i\), the tree is created by removing a subtree from tree \(i-1\) and replacing it with a leaf node with value chosen as in the tree building algorithm.

Pruning Procedure

Step 1: Train a Full Tree to its maximum possible depth

Step 2: Define a Range of Pruning Parameters
e.g., ccp_alpha in scikit-learn or cp in rpart
Controls the trade-off between tree complexity and fit
Larger values result in more aggressive pruning

Step 3: Perform Cross-Validation. For each value of \(\alpha\), prune the tree and evaluate its performance using cross-validation

Step 4: Select the Optimal Pruning Parameter. Choose \(\alpha\) that minimizes the cross-validated error (or maximizes classification accuracy

Step 5: Prune the Tree. Train the final tree using the optimal \(\alpha\)

Step 6: Evaluate the Pruned Tree on test data

Regression Trees in R

Use cv.tree() and plot the results (deviance or classification error) versus tree size. Here we would choose \(J= 5\).

cv_reg_tree <- cv.tree(credit_tree, K=10) plot(cv_reg_tree$size, cv_reg_tree$dev, type="b", ylab="Deviance", xlab="No. leaves")

Obtain the pruned tree from the full tree.

tree_pruned <- prune.tree(credit_tree, best=5)
plot(tree_pruned)
text(tree_pruned, pretty=0)

Go back to the full tree and compare

Predict on the test data set and calculate the test error.

pred_tree <- predict(tree_pruned,Credit[-train,])
mean( (Credit[-train,"Balance"] - pred_tree)^2)
[1] 65574.69

Linear VS Tree Regression

Feature Linear Regression Tree Regression
Model Form Linear equation: \(y = \beta_0 + \beta_1x_1 + \dots + \beta_p x_p + \epsilon\) Hierarchical splits: Recursive partitioning of feature space
Assumptions Linearity, homoscedasticity, independence No strict assumptions; handles non-linearity naturally
Interpretability High (coefficients directly indicate feature importance) Moderate (intuitive tree structure, complex deep trees)
Handling Non-linearity Poor (requires feature engineering) Excellent (automatically captures non-linearity)
Outliers Sensitive (outliers can heavily influence coefficients) Robust (outliers have limited impact on splits)
Feature Scaling Required (sensitive to scale of features) Not required (scale-invariant)
  • Use Linear Regression if:
    • The relationship between predictors and response is linear.
    • You need a simple, interpretable model.
    • You want to quantify the effect of each predictor (e.g., coefficients).


  • Use Tree Regression if:
    • The relationship is non-linear or complex.
    • You want automatic feature selection and interaction detection.
    • You need a model that is robust to outliers and scale.

Classification Trees

Classification Tree

They work pretty much like regression trees, except

  • The response (target) is categorical rather than quantitative

  • Instead of SSError we minimize (locally and globally) a loss function suitable for classification problems

  • Given regions \(R_1, \cdots, R_{J}\), predict an observation to belong to the most commonly occurring class in the region to which the observation belongs (predict by majority vote).

Loss Functions for Classification Trees

Let \(\widehat{p}_{mj}\) denote the proportion of training observations in region \(R_m\) that fall into class \(j\)

  • Misclassification error: \(\displaystyle E_{m} = 1 - \max_j(\widehat{p}_{mj})\)

  • Gini Index: measures node purity \({\small G_{m} = \displaystyle \sum_{j=1}^k \widehat{p}_{mj}(1-\widehat{p}_{mj})}\)

  • Entropy: \({\small \displaystyle D_{m} = -1 \sum_{j=1}^{k} \widehat{p}_{mj} \text{log}(\widehat{p}_{mj})}\)

  • Deviance (cross-entropy): \({\small \displaystyle -2 \sum_{m=1}^{J} \sum_{j=1}^k n_{mj} \text{log}(\widehat{p}_{mj})}\)

  • Small values for these loss statistics indicates a good fit

  • \(p(1-p)\) is largest for \(p=0.5\)
    and small as \(p \rightarrow 0\) or \(p \rightarrow 1\)
    Gini measures how pure the nodes are, a small value indicates observations are mostly from the same class

  • Similarly, entropy will be near zero as \(p \rightarrow 0\) or \(p \rightarrow 1\)

  • Typical application:
    Growing trees: Gini index, entropy, or deviance
    Pruning trees: misclassification rate

Example in R

Let’s look at codes

Bagging

Ensemble

  • An ensemble method is an approach that combines many simple building blocks to obtain a single model.

  • An ensemble method designed to improve the stability and accuracy of machine learning models.

  • A weak learner is a model that does not perform well on its own, sometimes barely better than random, but performs well in an ensemble.

Bagging

  • Bagging is short for Bootstrap Aggregating

  • is a general technique and can be applied to any learning method

  • is based on the idea of the bootstrap, to resample from the data set with replacement

  • the bootstrap can reduce the variability of a statistical learning method by averaging across weak learners

How it works:

  1. Randomly sample the training data with replacement (bootstrap samples).

  2. Train a separate model (e.g., decision tree) on each sample.

  3. Combine the predictions of all models (e.g., by averaging for regression or voting for classification).


Why use it?

Reduces variance and prevents overfitting.

Works well with high-variance models like deep decision trees.

Bagging Trees Procedure

You have a data set of size \(n\)

  1. Draw \(B\) samples of size \(n\) with replacement
    \(\Rightarrow\) the bootstrap samples

  2. Grow decision trees deep without pruning to each of the \(B\) bootstrap samples.
    Each tree will have high variance but low bias.

  3. Prediction
    Regression trees: predicted value of target is average of predictions from \(B\) trees
    Classification trees: predicted class by taking majority vote of classifications from \(B\) trees

Out-of-bag Error

Bagging provides a slick way to estimate the test error!

  • A bootstrap sample includes about 2/3 of the observations.

  • The probability that the \(n\)th bootstrap observation is the \(j\)th observation in the data frame is \(1/n\), because sampling is with replacement. \(\Rightarrow\) The probability that the \(n\)th bootstrap observation is not the \(j\)th observation is \(1-1/n\).

  • For a bootstrap sample of size \(n\), the probability that a particular observation is not in the bootstrap sample is \((1 - 1/n)^n\)

  • Let’s look at this in R

Pros and Cons of Bagging

Pros

  • Improves accuracy by averaging across bootstrap samples

  • Built-in holdout sample (OOB) allows estimation of test error

Cons

  • Loss of interpretability, results not represented by one tree

  • Trees can be highly correlated

  • In the presence of strong predictors these are used in most trees in the top split \(\Rightarrow\) the bagged trees all look similar

Averaging would be more effective if the trees were less correlated

Random Forests

Random Forests

  • Improvement over bagged trees

  • Trees are decorrelated

  • At each split, consider only a subset \(m\) of the \(p\) predictors

  • Choose the \(m\) inputs randomly at each split

  • \(m \sim \sqrt{p} \, \Rightarrow\) not all inputs considered at each split

Pros

  • Averaging decorrelated trees is more effective in reducing variability

  • Strong predictors are not allowed to dominate most trees

  • Other predictors are getting a chance

  • A Random Forest with \(m = p\) equals Bagging

  • With a large number of correlated inputs, choose a small value for \(m\)

  • As with Bagging, increasing \(B\) does not overfit

  • Choose \(B\) large enough for the error rate to settle down.

Bagging Trees and Random Forests in R

Use the randomForest() function in the randomForest package in R. Note that \(p = 12\). Choosing mtry = 12 results in a bagged tree.

data(Boston)
train <- sample (1: nrow(Boston), nrow(Boston) / 2)

bag.boston <- randomForest(medv ~ . , 
                           data      =Boston[train,], 
                           mtry      =12,    # m = p -> Bagging
                           importance=TRUE) 

Use the randomForest() function in the randomForest package in R. Note that \(p=12\). Choosing mtry \(< p\) results in a Random Forest.

rf.boston <- randomForest(medv ~ . , 
                          data      =Boston[train,], 
                          mtry      =6,  # m < p -> Random Forest
                          importance=TRUE)

Variable importance in bagged trees and random forests

importance(bag.boston)

          %IncMSE IncNodePurity
crim    26.608592    876.066482
zn       5.148020     64.377478
indus    5.087023    106.219327
chas    -1.389269      9.968199
nox     19.760345    246.301715
rm      53.614293  12080.891286
age     15.776943    334.643934
dis      6.455965    249.905169
rad      1.067095     73.142867
tax      9.395352    153.639752
ptratio  9.254806    145.370983
lstat   50.662598   4838.203779

The varImpPlot() function produces a plot of the variable importance measures.