DoubleML 401(k) — Interactive Lab

A pedagogical companion to Double Machine Learning with 401(k) Data: From Eligibility Effects to Complier Analysis ↗ Back to the post

Does 401(k) access really cause households to save more?

A simple comparison says eligible households have \$19,559 more in net financial assets than ineligible ones. But eligible households also earn \$15,368 more in income, on average. How much of that \$19,559 gap is the 401(k) — and how much is the income gap dressed up in 401(k) clothing? Double Machine Learning (DML) answers that question by using flexible ML models to strip out confounding before estimating the causal effect.

This app lets you turn the dials yourself. In four tabs you will: watch a naive estimate decompose into causal effect + bias as confounding grows; compare PLR, IRM, and IIVM estimators side-by-side; and explore the 12 real DML estimates from the 1991 SIPP pension dataset.

The confounding picture in one animation

Two coefficients shrink as you regularise: the steel-blue path is the naive estimate that ignores income; the orange path is the DML estimate that partials it out. Both start at the same place, but only the DML path lands on the truth.

Tab 2

Confounding Lab

Increase income's role as confounder and watch the naive estimate inflate while DML stays anchored to the true effect.

Tab 3

PLR vs IRM vs IIVM

Same simulated data, three estimators. See how PLR and IRM agree on the ATE, while IIVM (an IV-style method) recovers a larger LATE for compliers.

Tab 4

Forest Plot

The 12 real DML estimates from the post. Toggle estimands and methods. Hover to see SE, CI, and the role of each ML learner.

Glossary (open a card if a term is unfamiliar)

Confounder
A variable that influences both treatment and outcome. Income is the dominant confounder here — eligible households earn \$15,368 more on average.
ATE — Average Treatment Effect
The average causal effect of treatment across the whole population. PLR and IRM both estimate the ATE of 401(k) eligibility.
LATE — Local Average Treatment Effect
The causal effect on compliers — units whose treatment status responds to the instrument. IIVM estimates the LATE of 401(k) participation.
PLR — Partially Linear Regression
Pins a constant treatment effect, lets confounders enter through a flexible ML model. Workhorse DML estimator. ATE here: \$8,730.
IRM — Interactive Regression Model
Drops the constant-effect assumption; uses doubly-robust (AIPW) scoring with propensity scores. ATE here: \$8,213.
IIVM — Interactive IV Model
DML adapted for binary instruments. Targets the LATE on compliers. Uses eligibility as instrument for participation. LATE here: \$11,746.
Cross-fitting (K-fold)
Train nuisance models on K-1 folds, predict on the held-out fold. Rotate. The DML guard against overfitting bias.
Orthogonal score
An estimating equation whose derivative w.r.t. nuisance-function errors is zero at the truth. Neyman-orthogonal scores are what make ML-based nuisance estimation harmless.
Propensity score
Predicted probability that a unit is treated, given covariates. IRM uses it to up-weight rare controls.
Complier
A household that participates because it is eligible, but would not otherwise. The marginal population — the focus of the LATE.

Confounding Lab — see how naive estimates inflate

Simulated data with one treatment column and many candidate controls — the way a DML practitioner would think of the 401(k) problem. The true treatment coefficient is α = 0.5. The LASSO chooses how many controls to keep based on a single penalty parameter λ. Drag λ and watch the controls drop out one at a time — this is the nuisance-function step DML performs internally.

More data ⇒ each control's coefficient is estimated more precisely.
About 15% of these have a true nonzero effect; the rest are noise.
How strongly the relevant controls predict the outcome — proxy for income's role.
Slide left for less shrinkage (more controls survive); right for more.
controls kept (|I|)
out of candidates
α̂ from raw LASSO
biased toward zero by shrinkage
α̂ from post-OLS (DML-style)
refit on selected support — the DML estimate
true α
0.50
held fixed for comparison

What to look for

  • Sparsity grows with λ. Slide right: more controls are pinned to zero. Slide left: more re-enter. At λ ≈ 0 you recover OLS — which would blow up if p ≥ n.
  • Post-OLS α̂ tracks the true α more closely. Raw LASSO shrinks the treatment too. DML uses LASSO for selection, then refits unpenalised — the same logic used by the post's PLR estimator internally.
  • The orange treatment line stays in. Try p = 100 and a large λ: 90+ controls disappear, but the orange line keeps a meaningful value. This is exactly what cross-fitting buys you in DML.

PLR vs IRM vs IIVM — three estimators, one truth

Same simulated data. PLR uses partialling out, IRM uses doubly-robust AIPW with propensity scores, IIVM uses instrumental variables. The three approaches make different assumptions and target different estimands. Tweak the sliders and watch how they agree (PLR ≈ IRM) — and how IIVM systematically picks up a larger effect (the LATE).

Capped at 300 so the "Run 100 sims" button finishes quickly.
Capped at 50 for the 100-sim run.
Common scale for both π (treatment→controls) and θ (outcome→controls).
0 = controls predict both Y and D equally · 1 = controls predict D well, Y barely.

PLR / IRM ATE estimators

Partialling-out (PLR) and doubly-robust AIPW (IRM) target the ATE.

α̂ (PLR-style)
SE(α̂)
|I_y|
|I_d|
union |I_y ∪ I_d|
target estimandATE

IIVM IV estimator

CV-tuned LASSO nuisances, IV-style scoring. Targets the LATE on compliers.

α̂ (IIVM-style)
SE(α̂)
|I_y|
|I_d|
union |I_y ∪ I_d|
target estimandLATE

Why does this happen?

  • PLR ≈ IRM under constant effects. When the true treatment effect is the same for every household, partialling out and AIPW give nearly identical answers — as the post's Table-2 results show (\$8,730 vs \$8,213 ATE).
  • IIVM > PLR/IRM when compliers benefit more. The LATE captures the effect on marginal participants — exactly the population that responds to a policy change. In the real 401(k) data: \$11,746 LATE vs \$8,730 ATE.
  • The naive gap (steel-blue line) is much larger than any DML estimate. That gap is mostly income confounding — the bias DML strips out.

Bias vs. variance over many simulations

Single runs are noisy. Run the whole pipeline 100 times with fresh draws (same parameters, different ε and v) to see whether PLR/IRM and IIVM bias is systematic.

The post's forest plot — interactively

These numbers come straight from all_results.csv and naive_estimates.csv in the post's folder — the same data used to produce the grand-comparison figure. Toggle estimands and methods to compare. Hover a point for SE, 95% CI, and the number of covariates used.

What to look for

  • Toggle "Naive (mean diff)" off to zoom into the DML estimates. The naive bars (\$19,559 for eligibility and \$27,372 for participation) compress the x-axis — they are more than double the corresponding DML estimates.
  • PLR and IRM cluster tightly between \$7,800 and \$9,400 across 4 ML learners. This convergence is the robustness check: two distinct DML frameworks land on the same ATE.
  • IIVM sits systematically higher (\$11,200--\$12,300). This is the LATE-vs-ATE gap, not noise — compliers respond more strongly to eligibility than the average household.

Estimand

Methods

The naive estimate, decomposed

For 401(k) eligibility, the naive mean difference of \$19,559 decomposes into two components:

  • Causal effect (DML ATE) ≈ \$8,730 — what the 401(k) genuinely contributes.
  • Confounding bias ≈ \$10,829 — pre-existing differences (income, education, age) that have nothing to do with the plan.

In percent terms, 55% of the naive gap is confounding bias. The income gap alone — \$15,368 — explains most of it. This is what DML's cross-fitting and orthogonal scoring are designed to strip out.

Connecting back to Tab 3

The simulated comparisons you just explored map directly onto the real pension data:

  • PLR vs IRM agreement: the simulation shows they agree under constant effects; the data shows they agree at \$8,730 vs \$8,213.
  • IIVM > ATE: the simulation shows IV-style scoring picks up the complier population; the data shows IIVM = \$11,746 — about 35% larger than ATE.
  • Learner robustness: 4 ML learners (Lasso, RF, Tree, XGB) move each estimator by less than \$1,500 — a hallmark of well-functioning DML.

The policy takeaway from the post: expanding 401(k) eligibility raises net financial assets by roughly \$8,500 per newly eligible household, and by \$12,000 for marginal participants — the population most affected by an eligibility expansion.