Tree Ensembles
Definition
Tree ensembles combine many decision trees to produce predictions that are more accurate and robust than any individual tree. The two dominant approaches are bagging (random forests) and boosting (gradient boosting machines).
Intuition
A single decision tree is unstable: small data changes lead to very different trees. Ensembling averages out this variance. Bagging does this in parallel by training many trees on bootstrap samples; boosting does it sequentially by having each tree correct the errors of its predecessors.
Formal Description
Decision Tree
Greedily partitions feature space by choosing splits that maximise a purity criterion:
Gini impurity (for classification):
Entropy:
MSE (for regression): choose split minimising weighted average of child variances.
The algorithm recurses until a stopping criterion (max depth, min samples per leaf, min impurity decrease).
Key hyperparameters: max_depth, min_samples_leaf, max_features.
Inductive bias: axis-aligned splits — struggles with linear decision boundaries in general position.
Random Forest (Bagging)
For trees:
- Sample rows with replacement (bootstrap sample).
- At each split, consider only random features.
- Grow tree to maximum depth (no pruning).
- Aggregate: majority vote (classification) or mean (regression).
Variance reduction: averaging i.i.d. trees with variance gives variance if trees are uncorrelated. Feature subsampling decorrelates trees.
Out-of-bag (OOB) error: each sample is out-of-bag for ~37% of trees; provides free internal validation.
Feature importance: mean decrease in impurity (MDI) or mean decrease in accuracy via permutation. MDI is fast but biased toward high-cardinality features; permutation importance is more reliable.
Gradient Boosting
Sequentially builds trees where tree fits the pseudo-residuals of the current ensemble:
where is fit to (the negative gradient of the loss).
For MSE loss: pseudo-residuals = (true residuals). For log-loss: pseudo-residuals = (error in probability space).
Shrinkage (learning rate) : smaller requires more trees but generalises better.
XGBoost
XGBoost adds a second-order (Newton) approximation and explicit regularisation:
where , , and penalises tree complexity.
Optimal leaf weight: ; optimal split gain:
LightGBM
LightGBM uses GOSS (Gradient-based One-Side Sampling) and EFB (Exclusive Feature Bundling) for speed, plus leaf-wise (best-first) tree growth instead of level-wise:
- Leaf-wise growth finds deeper, more complex splits that can overfit on small datasets; mitigate with
min_child_samplesandnum_leaves. - LightGBM is typically 10–20× faster than vanilla GBDT for large datasets.
Hyperparameter Tuning Guidance
| Hyperparameter | Effect | Range |
|---|---|---|
n_estimators | More trees = better up to diminishing returns | 100–5000 |
max_depth (XGB/RF) | Depth of individual trees | 3–8 |
num_leaves (LGB) | Max leaves (leaf-wise) | 31–255 |
learning_rate (GBM) | Shrinkage; lower = better generalisation | 0.01–0.3 |
subsample | Row subsampling per tree (stochastic GBM) | 0.5–1.0 |
colsample_bytree | Column subsampling per tree | 0.5–1.0 |
min_child_weight | Min sum of Hessians in leaf (regularisation) | 1–300 |
reg_alpha/lambda | L1/L2 regularisation on leaf weights | 0–10 |
Applications
- Tabular ML competitions: GBMs dominate (XGBoost, LightGBM, CatBoost)
- Insurance: claim frequency/severity prediction; risk segmentation
- Fraud detection: high recall at low FPR
- Feature importance for regulatory model documentation
Trade-offs
| Model | Pros | Cons |
|---|---|---|
| Decision tree | Interpretable, fast | High variance, low accuracy |
| Random forest | Low variance, robust, OOB estimate | Slower inference, less accurate than GBM |
| GBM (XGB/LGB) | State-of-art tabular accuracy | Slow training, many hyperparameters, overfits with few data |