Random Forest: many trees, one prediction
Can satellite imagery predict how well a Bolivian municipality is developing? The post trains a Random Forest regressor on 64-dimensional satellite-image embeddings to predict the Municipal Sustainable Development Index (IMDS). The headline result: the model explains about 23% of IMDS variance (R² = 0.23, RMSE = 6.52, MAE = 4.72 on the held-out test set). That is meaningful signal — and a hard ceiling. Tuning hyperparameters barely moves the needle: the baseline (R² = 0.2307) and the tuned forest (R² = 0.2297) are statistically indistinguishable on the test set.
This app turns the post's three central ideas into knobs you can move. Sweep sparsity and signal to see why averaging many shallow learners wins on tabular data. Compare a flexible Random Forest-style fit against a linear OLS baseline on a known non-linear DGP. And explore the post's real feature-importance ranking from 339 Bolivian municipalities — where A59, A42, and A26 emerge as the top embedding dimensions under permutation importance.
Why averaging tames variance — L1 vs L2 as a sketch
A single decision tree is high-variance: nudge the training data and the tree's split locations jump. A forest of B trees, each trained on a bootstrap resample, averages those jumps away. The same arithmetic principle drives the animation below: heavy penalisation (L1, orange) zeroes coefficients hard; lighter shrinkage (L2, steel-blue) compresses them smoothly. The bagging analogy is similar — many slightly biased trees averaged into a lower-variance prediction. Average noisy juries; their verdicts cancel.
The animation uses LASSO/Ridge as a sketch metaphor for "many
small averagings tame variance". A Random Forest with
n_estimators = 500 grows 500 different trees on 500
bootstrap samples of the 271 training municipalities and averages
their predictions. Sliding the penalty up here is loosely
analogous to deepening a tree's regularisation.
Sparsity Lab
Slide n, p, and signal. Watch how a flexible forest-style fit handles many candidate features when only a few truly matter — the same regime the satellite embeddings sit in (64 features, 271 training rows).
RF vs Linear Showdown
Same data, two models. See when a flexible non-linear fit beats a linear baseline — and when the gain vanishes. Run 100 simulations to see the bias-variance picture.
Feature Importance
The post's Bolivia results, interactively. Toggle outcomes (R², RMSE, MAE) and methods (Baseline vs Tuned RF) to compare. Hover for SEs, CIs, and the number of estimators used.
Glossary (open a card if a term is unfamiliar)
Random Forest
Decision tree
Bagging
Train/test split
Cross-validation
R², RMSE, MAE
Feature importance (MDI)
Permutation importance
Sparsity Lab — when only a few features matter
The post's setup has 64 candidate features (satellite embeddings) and only 339 observations. Most of the predictive signal is concentrated in a handful of dimensions — the rest are weakly informative or noise. Drag the penalty slider and watch coefficients shrink to exactly zero, one at a time. The surviving features mimic the role of "important features" in a Random Forest: those whose impurity contribution exceeds a threshold.
What to look for
- Sparsity grows with the penalty. Slide right: more features are pinned to zero. Slide left: more re-enter. This mirrors how a Random Forest's MDI importance gives a long-tailed ranking — only a handful of features dominate.
- The post's permutation-importance plot shows the same shape: A59, A42, A26 carry most of the predictive load; the remaining 50+ dimensions contribute marginally. The forest, like LASSO, finds a low-dimensional core.
- With p ≈ n / 5 (the post's regime), both LASSO and Random Forest behave well. With p approaching n, even the forest struggles — a warning for tabular data with many weak features.
RF vs Linear Showdown — flexible vs. structured
Same simulated data. The post's partial-dependence plots reveal threshold effects — IMDS rises sharply at low values of an embedding dimension then plateaus. Random Forests capture these non-linearities; OLS does not. Here the "Rigorous" (theory-driven) penalty stands in for a conservative, structured fit and the "CV" (data-driven) penalty for a more flexible one. The two answers can diverge — the same way an RF and an OLS can disagree when the truth is non-linear.
Structured (Linear / OLS-like)
Theory-driven penalty (Belloni et al. 2012)
Flexible (RF-like)
Cross-validated penalty (lambda.min)
What to look for
- When the truth is linear (low asymmetry, low signal), both fits agree. The structured / OLS-like answer wins on standard error. Random Forest provides no advantage on truly linear DGPs.
- When the truth is non-linear or sparse (high signal, high asymmetry), the flexible fit can drift. Random Forests pay for flexibility with variance — exactly what bagging is designed to control.
- The post's lesson: on the Bolivia data, the tuned RF gained almost nothing over the baseline (test R² 0.2297 vs 0.2307). The performance ceiling came from the features, not from model flexibility.
Bias vs. variance over many simulations
Single runs are noisy. Run the whole pipeline 100 times with fresh draws (same parameters, different randomness) to see whether the bias is systematic.
The post's results — interactively
These numbers come from ml_rf_results.csv in the post's
folder — the baseline vs tuned comparison across three metrics
(R², RMSE, MAE). Toggle outcomes and methods to compare. The bar
chart underneath ranks the top-20 embedding dimensions by
permutation importance — A59, A42, A26 lead, but importance is
spread broadly across dimensions, exactly as the post notes.
What to look for
- Baseline and Tuned RF are essentially tied on the test set. R² is 0.2307 vs 0.2297; RMSE is identical at 6.52; MAE differs by 0.04. The tuning helped on cross-validation (0.2526 → 0.2721) but the gain didn't transfer to the specific 68-municipality test set.
- Toggle CV vs test outcomes. The CV bars (gold/teal) sit a bit higher than the test bars (steel) — the test set is small (68 obs) and noisy, and a single 20% holdout is exactly the regime where CV gives the steadier read.
- The bottom selection bars show all 64 features in play for both methods — Random Forest does not zero features out the way LASSO does. The "importance" ranking does the analogous job: a soft, post-hoc ordering.
Outcomes
Methods
Why does tuning gain so little?
RandomizedSearchCV improved the CV R² from 0.2526 to 0.2721 — a modest 2-point bump. But on the held-out test set the gain disappears (0.2307 → 0.2297). This is a common pattern with small datasets (n = 271 train, 68 test). The tuner explores a hyperparameter grid that, by chance, happens to fit the training-fold idiosyncrasies. On a different random 20% holdout, the gains would look different again. The ceiling here is structural: satellite embeddings only carry so much development-related signal, and no amount of forest tuning can manufacture more.
Connecting back to Tab 2
The Sparsity Lab showed a long-tailed coefficient ranking: a handful of features dominate, the rest contribute marginally. The post's permutation-importance plot shows the same shape on real data:
- Permutation top 3: A59, A42, A26 (consistent across both importance methods).
- MDI top 3: A30, A59, A42 — A30 ranks higher under MDI but drops considerably under permutation importance. That's the classic MDI bias toward high-cardinality / continuous features the post warns about.
- Combined effective dimensionality: 6–8 features explain most of the model's predictive power; the remaining 56+ contribute marginally.
The takeaway from the post (§ Feature Importance) is therefore visible twice: once on a controlled simulation where sparsity is a parameter you set, and once on the original 339-municipality panel where it is the empirical pattern.