Learning Goals

Introduction

Definitions

Tree-based methods can be used for predicting a continuous target (regression trees) or to classify a categorical response (classification trees).

The basic idea

  • divide the predictor space into non-overlapping segments (regions)

  • to make a prediction for a given observation, we use the mean of the training data in the region to which it belongs

  • since the set of splitting rules to segment the predictor space can be summarized by a tree such approaches are called decision tree methods.

  • these methods are simple and useful for interpretation.

Example

Carving up the predictor space Years and Hits for the Hitters data into 3 regions to predict log(Salary).

The predicted value is the average of the observations that fall into the regions.

This can be represented visually as a decision tree.

For a player with 6 years experience and 100 hits per year the predicted salary is \(\exp(5.998) = \$402,622\).

Decisions Trees

Trees consist of nodes and branches.

The nodes are either internal if they split the data or terminal if they appear at the bottom of the tree.

Terminal nodes are also called leaves.

The depth of a tree refers to the longest path from the root/top node to any leaf node in the tree.

Pros and Cons of Trees

Pros

  • Easy to interpret, easy to explain, nice visual representation

  • Order of splits reflects variable importance

  • Easily accommodate qualitative predictors

  • Performs well when relationship between target and predictors highly nonlinear

Cons

  • Trees often not as accurate as regression or classification methods

  • Easy to overfit, poor test performance

  • Sensitive to small changes in the data

Regression Trees

Basics of Regression Trees

  • We want to predict a response or class Y from inputs \(X_1\), \(X_2\),…\(X_p\). We do this by growing a binary tree.

  • At each internal node in the tree, we apply a test to one of the inputs, say \(X_i\).

  • Depending on the outcome of the test, we go to either the left or the right sub-branch of the tree.

  • Eventually we come to a leaf node, where we make a prediction.

  • This prediction aggregates or averages all the training data points which reach that leaf.

Regression Tree

  • Supervised prediction method for a continuous target.

  • Space generated by predictors \(X_1, \cdots, X_p\) is divided into \(J\) non-overlapping regions \(R_1, \cdots, R_{J}\). \(J\) is the number of terminal nodes.

  • For input observations falling into \(R_j\), the predicted value is the average of the target variable in that region: \(\widehat{y}_{R_j}\)

  • The goal is to find regions minimizing the prediction error \[\text{SSError} = \sum_{j=1}^{J} \sum_{i \in R_j} (y_i - \widehat{y}_{R_j})^2\]

  • In practice, one cannot solve that optimization problem.
    Instead we follow a greedy approach (locally optimal).

How to choose \(J\) and the regions \(R_1, \cdots, R_{J}\)?

Building a regression (or classification) tree involves 3 steps

  1. Growing the tree 
    • top-down greedy algorithm: recursive binary splitting
  2. Pruning the tree
    • obtain a subtree by cost complexity pruning
    • often uses cross-validation to select tree complexity
  3. Predicting (or classifying) with the (pruned) tree 
    • pass observations through the tree and find their terminal node

Growing a Tree

A simple case: 1 continuous input variable 

The tree equivalent to simple linear regression

  • We know that \(J = 2\): \(R_1 = \{X | X < s\}, R_2 = \{X | X \geq s\}\)

  • Now we just have to find the split point \(s\) that minimizes the heterogeneity of \(Y\) within \(R_1\) and \(R_2\)

  • You can do it via optimization, binary search, etc …

A more complicated case: \(p\) input predictors
- a very high-dimensional problem
- cannot consider all possible partitions into \(J\) boxes

\(\Rightarrow\) use recursive binary splitting


Goal: Partition the feature space into regions where the response variable is as homogeneous as possible.

Result: A binary tree where each internal node represents a split, and each terminal node (leaf) represents a predicted value (mean for regression, class for classification).

Optimization: The algorithm does not guarantee the globally optimal tree but works well in practice.

Recursive Binary Splitting

  • Start at the top of the tree

  • Successively split the predictor (=input) space. Each split

    • looks for the input \(X_i\) and split \(s\) to minimize heterogeneity/impurity of \(Y\) within regions
    • considers only the data in the current sub-tree
    • leads to two new branches
       
  • Method is greedy because it does not look back/ahead
    It considers only the best split at this step rather than earlier/future splits later in the tree-building process

  • The process stops when a criterion is reached, e.g., all regions contain less than \(m\) obs

Specifically, growing an initial tree \(T_0\):

  1. Define 2 half-planes for each predictor \(X_j\) and split point \(s\): \[R_1(j, s) = \{X \mid X_j < s\}, \quad R_2(j, s) = \{X \mid X_j \geq s\}\]

  2. Find the predictor \(j\) and split point \(s\) that minimize the sum of squared errors (RSS) in the resulting regions: \[\sum_{i: x_i \in R_1(j, s)} (y_i - \widehat{y}_{R_1})^2 + \sum_{i: x_i \in R_2(j, s)} (y_i - \widehat{y}_{R_2})^2\]

  3. Continue splitting each resulting region until a stopping criterion is met (e.g., minimum node size).

Some Stopping Criteria

  1. Maximum Tree Depth
    Purpose: Prevents complexity and overfitting

  2. Minimum Number of Observations in a Node
    Purpose: Ensures splits rely on enough data

  3. Minimum Number of Observations in a Leaf
    Purpose: Prevents leaves too specific to small subset

  4. Minimum Decrease in Impurity
    Purpose: Ensures meaningful splits, improve performance

  5. Maximum Number of Terminal Nodes
    Purpose: Controls the overall complexity of the tree

  6. Minimum Improvement in Performance
    Purpose: Prevents splits with insignificantly improvement

Impurity VS Prediction Error

That term is mainly used in classification tree (Gini impurity), but you may encounter the term for regression trees.

It refers to the metric used as splitting criterion: MSE, Gini, entropy, …

Impurity is equivalent to maximizing information gain or maximizing the reduction in impurity.

Prediction error (e.g., misclassification rate, MSE on a hold-out set) is what you ultimately want to minimize for the entire tree, but it’s not directly used as the splitting criterion during tree construction (too expensive).

Regression Trees in R

Fit a regression tree with the tree() function in the tree library.

library(tree)
cred.tree <- tree(Balance ~ ., data=Credit[train,])
summary(credit_tree)

## 
## Regression tree:
## tree(formula = Balance ~ ., data = Credit[train, ])
## Variables actually used in tree construction:
## [1] "Rating"  "Limit"   "Income"  "Student"
## Number of terminal nodes:  10 
## Residual mean deviance:  17930 = 3406000 / 190 
## Distribution of residuals:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -338.20  -60.45  -34.37    0.00   66.37  481.90

Visualizing Trees in R

Visualize the tree with plot and add split labels with text

plot(credit_tree) text(credit_tree, pretty=0, cex=0.65)

Returning Example

How to interpret the partition?

This can be represented visually as a decision tree.

For a player with 6 years experience and 100 hits per year the predicted salary is \(\exp(5.998) = \$402,622\).

R Packages Comparision

rpart VS tree

Pruning

Pruning a Regression Tree

Trees grown top down tend to overfit and perform poorly on test data sets.

The strategy is to grow a large tree and to prune it back to a subtree with good test error.

An alternative could have been to prevent a tree from growing from the beginning by constraining parameters (splits, depth, …) but the tree may stop too early, missing useful splits and then underfits.

  • Usually, a full tree \(T_0\) has high test error and is too complex (too many leaves)

  • A subtree (a pruned version of \(T_0\)) might perform better

  • But we cannot investigate all possible subtrees, so we select a small set of subtrees for considerations

  • Use cross-validation to select the “best” subtree from the set

  • The entire process is called cost complexity pruning (also known as weakest link pruning)

Cost complexity pruning

  • How to choose the sequence of trees to look at for pruning?

  • Pruning removes parts of the tree that provide little to no additional power in predicting target variables, ultimately resulting in a more robust and manageable model.

  • For each value \(\alpha\) there is a tree \(T \subset T_0\) such that \[\sum_{m=1}^{J}\sum_{i \in R_m} \left( y_i - \widehat{y}_{R_m} \right)^2 + \alpha J\] is as small as possible

  • \(\alpha\) is similar to the shrinkage parameter in Ridge and Lasso regression.  Cost complexity pruning is a regularization method

  • \(\alpha = 0 \, \Rightarrow T_0\), we get the full tree

  • As \(\alpha\) grows the penalty for having many leaves increases

  • \(\alpha = \infty\), we get the null tree with a single leaf node

Pruning Algorithm

Cost complexity pruning generates a series of trees \(T_0\), …,\(T_m\) with \(T_0\) the initial full tree and \(T_m\) the root alone.

At step \(i\), the tree is created by removing a subtree from tree \(i-1\) and replacing it with a leaf node with value chosen as in the tree building algorithm.

Pruning Procedure

Step 1: Train a Full Tree to its maximum possible depth

Step 2: Define a Range of Pruning Parameters
e.g., ccp_alpha in scikit-learn or cp in rpart
Controls the trade-off between tree complexity and fit
Larger values result in more aggressive pruning

Step 3: Perform Cross-Validation. For each value of \(\alpha\), prune the tree and evaluate its performance using cross-validation

Step 4: Select the Optimal Pruning Parameter. Choose \(\alpha\) that minimizes the cross-validated error (or maximizes classification accuracy

Step 5: Prune the Tree. Train the final tree using the optimal \(\alpha\)

Step 6: Evaluate the Pruned Tree on test data

Regression Trees in R

Use cv.tree() and plot the results (deviance or classification error) versus tree size. Here we would choose \(J= 5\).

cv_reg_tree <- cv.tree(credit_tree, K=10) plot(cv_reg_tree$size, cv_reg_tree$dev, type="b", ylab="Deviance", xlab="No. leaves")

Obtain the pruned tree from the full tree.

tree_pruned <- prune.tree(credit_tree, best=5)
plot(tree_pruned)
text(tree_pruned, pretty=0)

Go back to the full tree and compare

Predict on the test data set and calculate the test error.

pred_tree <- predict(tree_pruned,Credit[-train,])
mean( (Credit[-train,"Balance"] - pred_tree)^2)
[1] 65574.69

Linear VS Tree Regression

Feature Linear Regression Tree Regression
Model Form Linear equation: \(y = \beta_0 + \beta_1x_1 + \dots + \beta_p x_p + \epsilon\) Hierarchical splits: Recursive partitioning of feature space
Assumptions Linearity, homoscedasticity, independence No strict assumptions; handles non-linearity naturally
Interpretability High (coefficients directly indicate feature importance) Moderate (intuitive tree structure, complex deep trees)
Handling Non-linearity Poor (requires feature engineering) Excellent (automatically captures non-linearity)
Outliers Sensitive (outliers can heavily influence coefficients) Robust (outliers have limited impact on splits)
Feature Scaling Required (sensitive to scale of features) Not required (scale-invariant)
  • Use Linear Regression if:
    • The relationship between predictors and response is linear.
    • You need a simple, interpretable model.
    • You want to quantify the effect of each predictor (e.g., coefficients).


  • Use Tree Regression if:
    • The relationship is non-linear or complex.
    • You want automatic feature selection and interaction detection.
    • You need a model that is robust to outliers and scale.

Classification Trees

Classification Tree

They work pretty much like regression trees, except

  • The response (target) is categorical rather than quantitative

  • Instead of SSError we minimize (locally and globally) a loss function suitable for classification problems

  • Given regions \(R_1, \cdots, R_{J}\), predict an observation to belong to the most commonly occurring class in the region to which the observation belongs (predict by majority vote).

Loss Functions for Classification Trees

Let \(\widehat{p}_{mj}\) denote the proportion of training observations in region \(R_m\) that fall into class \(j\)

  • Misclassification error: \(\displaystyle E_{m} = 1 - \max_j(\widehat{p}_{mj})\)

  • Gini Index: measures node purity \({\small G_{m} = \displaystyle \sum_{j=1}^k \widehat{p}_{mj}(1-\widehat{p}_{mj})}\)

  • Entropy: \({\small \displaystyle D_{m} = -1 \sum_{j=1}^{k} \widehat{p}_{mj} \text{log}(\widehat{p}_{mj})}\)

  • Deviance (cross-entropy): \({\small \displaystyle -2 \sum_{m=1}^{J} \sum_{j=1}^k n_{mj} \text{log}(\widehat{p}_{mj})}\)

  • Small values for these loss statistics indicates a good fit

  • \(p(1-p)\) is largest for \(p=0.5\)
    and small as \(p \rightarrow 0\) or \(p \rightarrow 1\)
    Gini measures how pure the nodes are, a small value indicates observations are mostly from the same class

  • Similarly, entropy will be near zero as \(p \rightarrow 0\) or \(p \rightarrow 1\)

  • Typical application:
    Growing trees: Gini index, entropy, or deviance
    Pruning trees: misclassification rate

Example in R

Let’s look at codes

Bagging

Ensemble

  • An ensemble method is an approach that combines many simple building blocks to obtain a single model.

  • An ensemble method designed to improve the stability and accuracy of machine learning models.

  • A weak learner is a model that does not perform well on its own, sometimes barely better than random, but performs well in an ensemble.

Bagging

  • Bagging is short for Bootstrap Aggregating

  • is a general technique and can be applied to any learning method

  • is based on the idea of the bootstrap, to resample from the data set with replacement

  • the bootstrap can reduce the variability of a statistical learning method by averaging across weak learners

How it works:

  1. Randomly sample the training data with replacement (bootstrap samples).

  2. Train a separate model (e.g., decision tree) on each sample.

  3. Combine the predictions of all models (e.g., by averaging for regression or voting for classification).


Why use it?

Reduces variance and prevents overfitting.

Works well with high-variance models like deep decision trees.

Bagging Trees Procedure

You have a data set of size \(n\)

  1. Draw \(B\) samples of size \(n\) with replacement
    \(\Rightarrow\) the bootstrap samples

  2. Grow decision trees deep without pruning to each of the \(B\) bootstrap samples.
    Each tree will have high variance but low bias.

  3. Prediction
    Regression trees: predicted value of target is average of predictions from \(B\) trees
    Classification trees: predicted class by taking majority vote of classifications from \(B\) trees

Out-of-bag Error

Bagging provides a slick way to estimate the test error!

  • A bootstrap sample includes about 2/3 of the observations.

  • The probability that the \(n\)th bootstrap observation is the \(j\)th observation in the data frame is \(1/n\), because sampling is with replacement. \(\Rightarrow\) The probability that the \(n\)th bootstrap observation is not the \(j\)th observation is \(1-1/n\).

  • For a bootstrap sample of size \(n\), the probability that a particular observation is not in the bootstrap sample is \((1 - 1/n)^n\)

  • Let’s look at this in R

Pros and Cons of Bagging

Pros

  • Improves accuracy by averaging across bootstrap samples

  • Built-in holdout sample (OOB) allows estimation of test error

Cons

  • Loss of interpretability, results not represented by one tree

  • Trees can be highly correlated

  • In the presence of strong predictors these are used in most trees in the top split \(\Rightarrow\) the bagged trees all look similar

Averaging would be more effective if the trees were less correlated

Random Forests

Random Forests

  • Improvement over bagged trees

  • Trees are decorrelated

  • At each split, consider only a subset \(m\) of the \(p\) predictors

  • Choose the \(m\) inputs randomly at each split

  • Usually \(m \sim \sqrt{p}\) in classification problems and \(m \sim \frac{p}{3}\) in regression problems \(\Rightarrow\) not all inputs considered at each split

Pros

  • Averaging decorrelated trees is more effective in reducing variability

  • Strong predictors are not allowed to dominate most trees

  • Other predictors are getting a chance

  • A Random Forest with \(m = p\) equals Bagging

  • With a large number of correlated inputs, choose a small value for \(m\)

  • As with Bagging, increasing \(B\) does not overfit

  • Choose \(B\) large enough for the error rate to settle down.

Bagging Trees and Random Forests in R

Use the randomForest() function in the randomForest package in R. Note that \(p = 12\). Choosing mtry = 12 results in a bagged tree.

data(Boston)
train <- sample (1: nrow(Boston), nrow(Boston) / 2)

bag.boston <- randomForest(medv ~ . , 
                           data      =Boston[train,], 
                           mtry      =12,    # m = p -> Bagging
                           importance=TRUE) 

Use the randomForest() function in the randomForest package in R. Note that \(p=12\). Choosing mtry \(< p\) results in a Random Forest.

rf.boston <- randomForest(medv ~ . , 
                          data      =Boston[train,], 
                          mtry      =6,  # m < p -> Random Forest
                          importance=TRUE)

Variable importance in bagged trees and random forests

importance(bag.boston)

          %IncMSE IncNodePurity
crim    26.608592    876.066482
zn       5.148020     64.377478
indus    5.087023    106.219327
chas    -1.389269      9.968199
nox     19.760345    246.301715
rm      53.614293  12080.891286
age     15.776943    334.643934
dis      6.455965    249.905169
rad      1.067095     73.142867
tax      9.395352    153.639752
ptratio  9.254806    145.370983
lstat   50.662598   4838.203779

The varImpPlot() function produces a plot of the variable importance measures.

Boosting

What is boosting?

  • Boosting is an ensemble ensemble learning method that combines a set of less accurate models (called “weak learners” i.e. only slightly better than random guessing) to create a single, highly accurate model (“strong learner”).

  • Unlike other ensemble methods that build models in parallel (e.g. bagging or rf), boosting algorithms build models sequentially.

  • Each new model in the sequence is trained to correct the errors made by its predecessors.

  • This iterative process allows the overall model to improve its accuracy, particularly by reducing bias.

  • Boosting is a popular and effective technique used in supervised learning for both classification and regression tasks.

AdaBoost

  • AdaBoost is regarded as the first practical boosting algorithm.

  • It was designed for a binary classification problem.

  • Sequential Learning: Each new model focuses more on the observations that were misclassified by the previous models.

  • Adaptive Weighting: After each iteration, the weights of misclassified observations are increased, so the next model pays more attention to them. Correctly classified observations have their weights decreased.

  • Final Prediction: The predictions from all weak learners are combined using a weighted vote, where each model’s vote is weighted by its accuracy.

AdaBoost Algorithm

  1. Initialize weights: equal weights to all training observations (e.g., \(w_i = \frac{1}{N}\))

  2. Train weak learners: For each iteration \(m=1,2,...,M\)
    Train a weak learner \(h_m(x)\) on the weighted data
    Compute weighted error rate \(\epsilon_m = \sum_{i=1}^N w_i \cdot I(y_i \neq h_m(x_i))\)

  3. Update weights:
    Compute the model’s weight \(\alpha_m = \frac{1}{2} \ln\left(\frac{1 - \epsilon_m}{\epsilon_m}\right)\)
    Update observation weights: \(w_i \leftarrow w_i \cdot \exp(\alpha_m \cdot I(y_i \neq h_m(x_i)))\)
    Normalize the weights so they sum to 1

  4. Final Model: \(\displaystyle H(x) = \text{sign}\left(\sum_{m=1}^M \alpha_m h_m(x)\right)\)

Key Features of AdaBoost

  • Focus on errors: AdaBoost is particularly effective because it forces subsequent models to focus on difficult cases, reducing both bias and variance.

  • Versatility: AdaBoost can work with any weak learner (e.g., linear classifiers, SVMs).

  • Theoretical guarantees: Under certain conditions, AdaBoost can drive the training error to zero as the number of weak learners increases.

  • Limitations:
    Sensitive to noisy data and outliers (since it focuses on misclassified points).
    Can overfit if the number of weak learners is too large or if they are too complex.

Gradient Boosting

Researchers realized that AdaBoost could be viewed as a functional gradient descent process, optimizing a loss function.

AdaBoost was shown to minimize an exponential loss.

The framework could be generalized to other loss functions (e.g., MSE, logistic loss).

Gradient boosting generalizes boosting as an optimization problem.
Instead of focusing on misclassified points, it fits new models to the gradient of the loss function (e.g., residuals for regression).

Core Idea: Builds models sequentially, where each new model corrects the errors of the previous one by optimizing a loss function using gradient descent.

Key Features:
Works with any differentiable loss function (e.g., MSE, log-loss).  Can use any weak learner, but decision trees are most common.
Highly flexible and often achieves state-of-the-art performance.

Limitations: Computationally intensive; requires careful tuning of hyperparameters (e.g., learning rate, tree depth).

Popular Implementations:
XGBoost (Extreme Gradient Boosting) (2014): Optimized for speed and performance.
LightGBM (2017): Designed for distributed and efficient training.  CatBoost (2017): Handles categorical features natively.

Gradient Boosting Algorithm

Loss function: \(L(y, F(x))\) (e.g., squared error for regression)
Number of iterations (trees): M

  1. Start with a constant model \(F_0(x) = \arg\min_{\gamma} \sum_{i=1}^n L(y_i, \gamma)\)  e.g. for squared error: \(F_0(x) = \frac{1}{n}\sum_{i=1}^n y_i\)

  2. Iterative training, for m=1, …,M:

    1. Compute pseudo-residuals (negative gradients of the loss function) \(r_{im} = -\left[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\right]_{F(x)=F_{m-1}(x)}\)
      e.g. for squared error: \(r_{im} = y_i - F_{m-1}(x_i)\)
    2. Train a weak learner(e.g. tree) \(h_m(x)\) to predict \(r_{im}\)
    3. Compute multiplier \(\gamma_{m}\) by solving \(\gamma_{m}={\underset {\gamma }{\operatorname {argmin}}}\sum_{i=1}^{n}L\left(y_{i},F_{m-1}(x_{i})+\gamma h_{m}(x_{i})\right)\)
    4. Update the model \(F_{m}(x)=F_{m-1}(x)+\gamma_{m}h_{m}(x)\)
  3. Final model \(F(x) = F_M(x)\)

Gradient Boosting Regularization

Fitting the training set too closely can lead to degradation of the model’s generalization ability.

One natural regularization parameter is the number of gradient boosting iterations M (i.e. the number of base models) to balance over- and under-fitting.

Shrinkage: An important part of gradient boosting is regularization by shrinkage which uses a modified update rule: \(F_{m}(x)=F_{m-1}(x)+\nu \cdot \gamma _{m}h_{m}(x),\quad 0<\nu \leq 1,\)
where parameter \(\nu \in [0,1]\) is called the ``learning rate”.

Empirically, it has been found that small learning rates yields dramatic improvements in models’ generalization ability over gradient boosting without shrinking (\(\nu =1\)).

However, it comes at the price of increasing computational time both during training and querying: lower learning rate requires more iterations.

Gradient Boosting Parameters

Key Parameters

  • Learning rate \(\nu\): Controls the contribution of each tree.

  • Number of trees \(M\): More trees can improve performance but risk overfitting.

  • Tree complexity: Depth of the weak learners.

Gradient-Boosted for Trees

Each new tree is trained to predict the pseudo-residuals (negative gradients of the loss function)

Each tree focuses on correcting the errors of the current model

GBTs often outperform other ensemble methods (like random forests) on structured data

GBT can optimize any differentiable loss function (regression,…)

Shallow trees (weak learners) and shrinkage (learning rate) helps control overfitting, especially with early stopping

GBTs provide feature importance scores, helping feature selection and interpretability

GBT can be combined with techniques like class weighting or custom loss functions to handle imbalanced datasets

XGBoost

Unlike gradient boosting that works as gradient descent in function space

XGBoost uses a second order Taylor approximation in the loss function

Interpretability is often sacrificed in these boosting methods

XGBoost Algorithm

Initialize model with a constant value: \({\widehat {f}}_{(0)}(x)={\underset {\theta }{\arg \min }}\sum _{i=1}^{N}L(y_{i},\theta)\)

For m = 1 to M:

  1. Compute the ‘gradients’ and ‘hessians’ \[{\widehat {g}}_{m}(x_{i})=\left[{\frac {\partial L(y_{i},f(x_{i}))}{\partial f(x_{i})}}\right]_{f(x)={\widehat {f}}_{(m-1)}(x)} \mbox{ and }\quad {\widehat {h}}_{m}(x_{i})=\left[{\frac {\partial ^{2}L(y_{i},f(x_{i}))}{\partial f(x_{i})^{2}}}\right]_{f(x)={\widehat {f}}_{(m-1)}(x)}\]

  2. Fit a base learner (or weak learner, e.g. tree) using the training set \(\left\{x_{i},{\dfrac {{\widehat {g}}_{m}(x_{i})}{{\widehat {h}}_{m}(x_{i})}}\right\}_{i=1}^{N}\) by solving the optimization problem below: \({\widehat {\phi }}_{m}={\underset {\phi \in \mathbf {\Phi } }{\arg \min }}\sum _{i=1}^{N}{\frac {1}{2}}{\widehat {h}}_{m}(x_{i})\left[\phi (x_{i})-{\frac {{\widehat {g}}_{m}(x_{i})}{{\widehat {h}}_{m}(x_{i})}}\right]^{2}\)

  3. Update the model: \({\widehat{f}}_{(m)}(x)={\widehat {f}}_{(m-1)}(x)-\nu {\widehat {\phi }}_{m}(x)\)

  4. Output \({\widehat {f}}(x)={\widehat {f}}_{(M)}(x)=\sum _{m=0}^{M}{\widehat {f}}_{m}(x)\)

Practical Example in R

Let’s compare these methods in R

Recap

Gradient boosting is a general algorithmic framework for building predictive models.
It works by sequentially adding weak learners, each of which corrects the errors of the previous ensemble.

The “weak learner” can be any model (e.g., linear regression, logistic regression, decision trees, etc.), as long as you can fit it to the pseudo-residuals (negative gradients of the loss function).

When people say “gradient boosting with trees,” they mean that the weak learners are decision trees (usually shallow trees, like stumps or trees with 3–8 leaves).

The most common implementation because trees are:
Flexible: can model non-linear relationships and interactions)
Robust to outliers and missing values  Require little feature engineering.

Examples: gbm, xgboost, lightgbm, and catboost all use trees as weak learners.