Random Forest Regression — Interactive Lab

A pedagogical companion to Introduction to Machine Learning: Random Forest Regression ↗ Back to the post

Random Forest: many trees, one prediction

Can satellite imagery predict how well a Bolivian municipality is developing? The post trains a Random Forest regressor on 64-dimensional satellite-image embeddings to predict the Municipal Sustainable Development Index (IMDS). The headline result: the model explains about 23% of IMDS variance (R² = 0.23, RMSE = 6.52, MAE = 4.72 on the held-out test set). That is meaningful signal — and a hard ceiling. Tuning hyperparameters barely moves the needle: the baseline (R² = 0.2307) and the tuned forest (R² = 0.2297) are statistically indistinguishable on the test set.

This app turns the post's three central ideas into knobs you can move. Sweep sparsity and signal to see why averaging many shallow learners wins on tabular data. Compare a flexible Random Forest-style fit against a linear OLS baseline on a known non-linear DGP. And explore the post's real feature-importance ranking from 339 Bolivian municipalities — where A59, A42, and A26 emerge as the top embedding dimensions under permutation importance.

Why averaging tames variance — L1 vs L2 as a sketch

A single decision tree is high-variance: nudge the training data and the tree's split locations jump. A forest of B trees, each trained on a bootstrap resample, averages those jumps away. The same arithmetic principle drives the animation below: heavy penalisation (L1, orange) zeroes coefficients hard; lighter shrinkage (L2, steel-blue) compresses them smoothly. The bagging analogy is similar — many slightly biased trees averaged into a lower-variance prediction. Average noisy juries; their verdicts cancel.

The animation uses LASSO/Ridge as a sketch metaphor for "many small averagings tame variance". A Random Forest with n_estimators = 500 grows 500 different trees on 500 bootstrap samples of the 271 training municipalities and averages their predictions. Sliding the penalty up here is loosely analogous to deepening a tree's regularisation.

Tab 2

Sparsity Lab

Slide n, p, and signal. Watch how a flexible forest-style fit handles many candidate features when only a few truly matter — the same regime the satellite embeddings sit in (64 features, 271 training rows).

Tab 3

RF vs Linear Showdown

Same data, two models. See when a flexible non-linear fit beats a linear baseline — and when the gain vanishes. Run 100 simulations to see the bias-variance picture.

Tab 4

Feature Importance

The post's Bolivia results, interactively. Toggle outcomes (R², RMSE, MAE) and methods (Baseline vs Tuned RF) to compare. Hover for SEs, CIs, and the number of estimators used.

Glossary (open a card if a term is unfamiliar)

Random Forest
An ensemble of decorrelated decision trees. Bagging + random feature subsets at each split. Prediction is the average of all B trees.
Decision tree
Recursive binary splits on features. Each leaf gives a prediction. High-variance individually; that is what averaging is for.
Bagging
Bootstrap aggregating. Train B models on B resamples of the data; average their predictions. Reduces variance without inflating bias.
Train/test split
Hold out a portion of data for honest evaluation. In the post: 271 train, 68 test out of 339 municipalities.
Cross-validation
K-fold rotating exam on the training set. Provides a stable estimate of model performance. 5-fold CV R² in the post: 0.2526 (+/- 0.0728).
R², RMSE, MAE
R² is fraction-of-variance-explained; RMSE penalises large errors; MAE is average absolute error in target units. Post: 0.23, 6.52, 4.72.
Feature importance (MDI)
Mean Decrease in Impurity — how much each feature reduces error across all splits. Fast, but biased toward continuous features.
Permutation importance
Shuffle one feature, measure drop in R². Less biased than MDI; computed on the test set. A59 wins in the post.

Sparsity Lab — when only a few features matter

The post's setup has 64 candidate features (satellite embeddings) and only 339 observations. Most of the predictive signal is concentrated in a handful of dimensions — the rest are weakly informative or noise. Drag the penalty slider and watch coefficients shrink to exactly zero, one at a time. The surviving features mimic the role of "important features" in a Random Forest: those whose impurity contribution exceeds a threshold.

In the post, training n = 271 municipalities.
The post has p = 64 satellite-embedding dimensions. About 15% have a true nonzero effect; the rest are noise.
Magnitude of the truly-relevant coefficients relative to noise.
Slide left for less shrinkage (more features survive); right for more.
features kept (|I|)
out of candidates
α̂ from raw LASSO
shrunk toward zero
α̂ from post-OLS
refit on selected support
true α
0.50
held fixed for comparison

What to look for

  • Sparsity grows with the penalty. Slide right: more features are pinned to zero. Slide left: more re-enter. This mirrors how a Random Forest's MDI importance gives a long-tailed ranking — only a handful of features dominate.
  • The post's permutation-importance plot shows the same shape: A59, A42, A26 carry most of the predictive load; the remaining 50+ dimensions contribute marginally. The forest, like LASSO, finds a low-dimensional core.
  • With p ≈ n / 5 (the post's regime), both LASSO and Random Forest behave well. With p approaching n, even the forest struggles — a warning for tabular data with many weak features.

RF vs Linear Showdown — flexible vs. structured

Same simulated data. The post's partial-dependence plots reveal threshold effects — IMDS rises sharply at low values of an embedding dimension then plateaus. Random Forests capture these non-linearities; OLS does not. Here the "Rigorous" (theory-driven) penalty stands in for a conservative, structured fit and the "CV" (data-driven) penalty for a more flexible one. The two answers can diverge — the same way an RF and an OLS can disagree when the truth is non-linear.

Capped at 300 so the "Run 100 sims" button finishes quickly.
Capped at 50 for the 100-sim run.
Magnitude of the truly relevant coefficients.
0 = symmetric features · 1 = treatment dominates. Stand-in for non-linearity.

Structured (Linear / OLS-like)

Theory-driven penalty (Belloni et al. 2012)

α̂
SE(α̂)
|I_y|
|I_d|
union |I_y ∪ I_d|
λ_y, λ_d

Flexible (RF-like)

Cross-validated penalty (lambda.min)

α̂
SE(α̂)
|I_y|
|I_d|
union |I_y ∪ I_d|
λ_y, λ_d

What to look for

  • When the truth is linear (low asymmetry, low signal), both fits agree. The structured / OLS-like answer wins on standard error. Random Forest provides no advantage on truly linear DGPs.
  • When the truth is non-linear or sparse (high signal, high asymmetry), the flexible fit can drift. Random Forests pay for flexibility with variance — exactly what bagging is designed to control.
  • The post's lesson: on the Bolivia data, the tuned RF gained almost nothing over the baseline (test R² 0.2297 vs 0.2307). The performance ceiling came from the features, not from model flexibility.

Bias vs. variance over many simulations

Single runs are noisy. Run the whole pipeline 100 times with fresh draws (same parameters, different randomness) to see whether the bias is systematic.

The post's results — interactively

These numbers come from ml_rf_results.csv in the post's folder — the baseline vs tuned comparison across three metrics (R², RMSE, MAE). Toggle outcomes and methods to compare. The bar chart underneath ranks the top-20 embedding dimensions by permutation importance — A59, A42, A26 lead, but importance is spread broadly across dimensions, exactly as the post notes.

What to look for

  • Baseline and Tuned RF are essentially tied on the test set. R² is 0.2307 vs 0.2297; RMSE is identical at 6.52; MAE differs by 0.04. The tuning helped on cross-validation (0.2526 → 0.2721) but the gain didn't transfer to the specific 68-municipality test set.
  • Toggle CV vs test outcomes. The CV bars (gold/teal) sit a bit higher than the test bars (steel) — the test set is small (68 obs) and noisy, and a single 20% holdout is exactly the regime where CV gives the steadier read.
  • The bottom selection bars show all 64 features in play for both methods — Random Forest does not zero features out the way LASSO does. The "importance" ranking does the analogous job: a soft, post-hoc ordering.

Outcomes

Methods

Why does tuning gain so little?

RandomizedSearchCV improved the CV R² from 0.2526 to 0.2721 — a modest 2-point bump. But on the held-out test set the gain disappears (0.2307 → 0.2297). This is a common pattern with small datasets (n = 271 train, 68 test). The tuner explores a hyperparameter grid that, by chance, happens to fit the training-fold idiosyncrasies. On a different random 20% holdout, the gains would look different again. The ceiling here is structural: satellite embeddings only carry so much development-related signal, and no amount of forest tuning can manufacture more.

Connecting back to Tab 2

The Sparsity Lab showed a long-tailed coefficient ranking: a handful of features dominate, the rest contribute marginally. The post's permutation-importance plot shows the same shape on real data:

  • Permutation top 3: A59, A42, A26 (consistent across both importance methods).
  • MDI top 3: A30, A59, A42 — A30 ranks higher under MDI but drops considerably under permutation importance. That's the classic MDI bias toward high-cardinality / continuous features the post warns about.
  • Combined effective dimensionality: 6–8 features explain most of the model's predictive power; the remaining 56+ contribute marginally.

The takeaway from the post (§ Feature Importance) is therefore visible twice: once on a controlled simulation where sparsity is a parameter you set, and once on the original 339-municipality panel where it is the empirical pattern.