Three Methods for Robust Variable Selection

BMA, LASSO, and WALS — graded against a known answer key

4triple-robust predictors
85.7%LASSO / WALS sensitivity
4,096candidate models

Carlos Mendez

Nagoya University (GSID)

June 11, 2026

The Tension

Act I

With 12 candidate drivers, \(2^{12}=4{,}096\) models give 4,096 different answers

You advise a government on climate policy. You have a dozen candidate drivers of CO\(_2\) emissions and a limited budget.

Run one regression and report it, and you have assumed the other 4,095 models are wrong. Which subset truly matters — and which are red herrings?

We built an answer key: 7 true predictors, 5 pure-noise impostors

Noise variables (trade openness, tourism, credit) are deliberately correlated with GDP and other true predictors — the multicollinearity that makes naive OLS unreliable.

Naive OLS already flirts with spurious significance

0.98

\(R^2\) of the kitchen-sink OLS — a great fit that still cannot tell signal from noise

Where we’re going

  • The lab: 120 countries, 12 candidate regressors, a known answer key
  • BMA — average 4,096 models, read off Posterior Inclusion Probabilities
  • LASSO — an L1 penalty that zeroes weak controls automatically
  • WALS — fast frequentist averaging that returns t-statistics
  • The payoff: which variables survive all three

The Investigation

Act II

Three mechanically distinct answers to one question

BMA

  • Average all 4,096 models
  • Weight by posterior probability
  • Output: \(\Pr(M_k\mid y)\) → PIP

LASSO

  • One L1-penalized fit
  • Drives weak coefficients to zero
  • Output: a sparse subset

WALS

  • Frequentist averaging
  • Orthogonalize, then average
  • Output: t-statistics

Different machinery, same target — agreement across them is what earns credibility.

BMA is just Bayes’ rule applied to 4,096 models

\[P(M_k\mid y)=\frac{P(y\mid M_k)\,P(M_k)}{\sum_{l=1}^{2^K} P(y\mid M_l)\,P(M_l)}\]

The marginal likelihood \(P(y\mid M_k)\) is a built-in Occam’s razor — complex models spread their probability thin.

A variable’s PIP is a weighted democratic vote across models

\[\text{PIP}_j=\sum_{k:\,j\in M_k} P(M_k\mid y)\]

Each of the 4,096 models votes for which variables matter — but better-fitting models get louder voices. We call PIP \(\geq 0.80\) “robust” (Raftery 1995).

BMA flags four robust drivers and zero false positives

GDP (PIP = 1.00), trade network (0.986), fossil fuel (0.948), industry (0.841) clear the 0.80 line; all five noise variables sit below 0.15.

The top models agree: the same four variables, every time

Variable-inclusion map of the top 100 models. Column width = posterior probability; blue = positive coefficient, gray = excluded. The core four form solid bands across the whole axis.

LASSO trades a little bias for a large cut in variance

\[\text{MSE}=\text{Bias}^2+\text{Variance}+\text{Irreducible noise}\]

As complexity rises, bias falls but variance explodes. The optimal model lives in between — exactly where regularized methods operate.

The L1 diamond has corners — that is why LASSO selects

\[\hat\beta_{\text{LASSO}}=\arg\min_\beta\ \frac{1}{2n}\|y-X\beta\|^2+\lambda\sum_{j=1}^{p}|\beta_j|\]

L1 contours hit a corner (a coefficient is set to exactly zero); L2 (Ridge) hits a smooth circle and never reaches zero.

Noise dies first; GDP is the last variable standing

Regularization path — as \(\lambda\) grows (left→right), orange noise variables hit zero first; GDP (\(\beta=1.200\)) persists longest. Dashed/dotted lines mark \(\lambda_{\min}\) and \(\lambda_{1\text{se}}\).

At the parsimonious penalty, LASSO keeps six variables — all real

Six bars survive (steel blue = true predictor correctly kept); gray bars are dropped. No orange — zero noise variables falsely selected.

Post-LASSO un-shrinks the coefficients back toward the truth

Variable LASSO \(\hat\beta\) Post-LASSO \(\hat\beta\) True \(\beta\)
log_gdp 1.190 1.165 1.200
fossil_fuel 0.007 0.012 0.012
urban_pop 0.004 0.008 0.010
trade_network 0.631 0.898 0.500

LASSO selects; OLS on the selected set estimates — recovering unbiased magnitudes.

WALS averages with the same prior LASSO uses for selection

\[p(\gamma_j)\propto\exp(-|\gamma_j|/\tau)\]

The Laplace prior (WALS) is peaked at zero with heavy tails — skeptical but open-minded. Its negative log is LASSO’s L1 penalty.

WALS makes GDP tower: \(|t|=34.62\), far above every other variable

Six variables clear the \(|t|\geq 2\) line; GDP’s bar runs off the chart at 34.62, trade network next at 4.39. Noise variables all sit below 1.5.

The Resolution

Act III

Four variables are triple-robust — the strongest claims the data supports

Variable BMA PIP LASSO WALS \(\|t\|\) Methods
log_gdp 1.000 yes 34.62 3
trade_network 0.986 yes 4.39 3
fossil_fuel 0.948 yes 3.26 3
industry 0.841 yes 4.01 3
urban_pop 0.648 yes 3.11 2
democracy 0.607 yes 2.58 2

All five noise variables: flagged by none. Agreement across mechanically distinct methods is what earns credibility.

Three columns of agreement — and two honest splits

Method-agreement heatmap. Top four rows solid steel blue across all three methods; bottom five (noise) solid orange. Urban_pop and democracy split blue (LASSO/WALS) vs orange (BMA).

BMA and WALS line up — but BMA’s bar is set higher

BMA PIP vs WALS \(|t|\). Upper-right quadrant = robust by both (the core four). Urban_pop and democracy: high \(|t|\) but PIP < 0.80 — BMA’s conservatism made visible.

All three recover GDP almost exactly; small effects are harder

Estimates vs true coefficients, faceted by method. Points on the 45° line = perfect recovery. GDP lands on the line for all three; trade network is overshot by all (low-variance regressor).

Same data, perfect specificity — but LASSO/WALS see more

Method Sensitivity Specificity Accuracy
BMA 57.1% 100% 75.0%
LASSO 85.7% 100% 91.7%
WALS 85.7% 100% 91.7%

Zero false positives across the board; the gap is in catching the moderate true effects.

Does triangulation make this causal? No — it disciplines selection, not identification

Objection. Agreeing across three methods still can’t manufacture a causal effect.

Response. Correct. Triangulation buys robustness of selection, not identification. These coefficients are conditional associations; causal claims would still need exogeneity, no confounding, and correct functional form. The synthetic answer key validates the methods, not a CO\(_2\) policy.

When three different methods agree, believe the variable — not any single model.