DoubleML 401(k) Interactive Lab

Does 401(k) access really cause households to save more?

A simple comparison says eligible households have \$19,559 more in net financial assets than ineligible ones. But eligible households also earn \$15,368 more in income, on average. How much of that \$19,559 gap is the 401(k) — and how much is the income gap dressed up in 401(k) clothing? Double Machine Learning (DML) answers that question by using flexible ML models to strip out confounding before estimating the causal effect.

This app lets you turn the dials yourself. In four tabs you will: watch a naive estimate decompose into causal effect + bias as confounding grows; compare PLR, IRM, and IIVM estimators side-by-side; and explore the 12 real DML estimates from the 1991 SIPP pension dataset.

The confounding picture in one animation

Two coefficients shrink as you regularise: the steel-blue path is the naive estimate that ignores income; the orange path is the DML estimate that partials it out. Both start at the same place, but only the DML path lands on the truth.

Tab 2

Confounding Lab

Increase income's role as confounder and watch the naive estimate inflate while DML stays anchored to the true effect.

Tab 3

PLR vs IRM vs IIVM

Same simulated data, three estimators. See how PLR and IRM agree on the ATE, while IIVM (an IV-style method) recovers a larger LATE for compliers.

Tab 4

Forest Plot

The 12 real DML estimates from the post. Toggle estimands and methods. Hover to see SE, CI, and the role of each ML learner.

Glossary (open a card if a term is unfamiliar)

Confounder

A variable that influences both treatment and outcome. Income is the dominant confounder here — eligible households earn \$15,368 more on average.

ATE — Average Treatment Effect

The average causal effect of treatment across the whole population. PLR and IRM both estimate the ATE of 401(k) eligibility.

LATE — Local Average Treatment Effect

The causal effect on compliers — units whose treatment status responds to the instrument. IIVM estimates the LATE of 401(k) participation.

PLR — Partially Linear Regression

Pins a constant treatment effect, lets confounders enter through a flexible ML model. Workhorse DML estimator. ATE here: \$8,730.

IRM — Interactive Regression Model

Drops the constant-effect assumption; uses doubly-robust (AIPW) scoring with propensity scores. ATE here: \$8,213.

IIVM — Interactive IV Model

DML adapted for binary instruments. Targets the LATE on compliers. Uses eligibility as instrument for participation. LATE here: \$11,746.

Cross-fitting (K-fold)

Train nuisance models on K-1 folds, predict on the held-out fold. Rotate. The DML guard against overfitting bias.

Orthogonal score

An estimating equation whose derivative w.r.t. nuisance-function errors is zero at the truth. Neyman-orthogonal scores are what make ML-based nuisance estimation harmless.

Propensity score

Predicted probability that a unit is treated, given covariates. IRM uses it to up-weight rare controls.

Complier

A household that participates because it is eligible, but would not otherwise. The marginal population — the focus of the LATE.

Confounding Lab — see how naive estimates inflate

Simulated data with one treatment column and many candidate controls — the way a DML practitioner would think of the 401(k) problem. The true treatment coefficient is α = 0.5. The LASSO chooses how many controls to keep based on a single penalty parameter λ. Drag λ and watch the controls drop out one at a time — this is the nuisance-function step DML performs internally.

Sample size n 200

More data ⇒ each control's coefficient is estimated more precisely.

Number of controls p 40

About 15% of these have a true nonzero effect; the rest are noise.

Signal strength 0.60

How strongly the relevant controls predict the outcome — proxy for income's role.

Penalty λ —

Slide left for less shrinkage (more controls survive); right for more.

controls kept (|I|)

—

out of — candidates

α̂ from raw LASSO

—

biased toward zero by shrinkage

α̂ from post-OLS (DML-style)

—

refit on selected support — the DML estimate

true α

0.50

held fixed for comparison

What to look for

Sparsity grows with λ. Slide right: more controls are pinned to zero. Slide left: more re-enter. At λ ≈ 0 you recover OLS — which would blow up if p ≥ n.
Post-OLS α̂ tracks the true α more closely. Raw LASSO shrinks the treatment too. DML uses LASSO for selection, then refits unpenalised — the same logic used by the post's PLR estimator internally.
The orange treatment line stays in. Try p = 100 and a large λ: 90+ controls disappear, but the orange line keeps a meaningful value. This is exactly what cross-fitting buys you in DML.

PLR vs IRM vs IIVM — three estimators, one truth

Same simulated data. PLR uses partialling out, IRM uses doubly-robust AIPW with propensity scores, IIVM uses instrumental variables. The three approaches make different assumptions and target different estimands. Tweak the sliders and watch how they agree (PLR ≈ IRM) — and how IIVM systematically picks up a larger effect (the LATE).

Sample size n 200

Capped at 300 so the "Run 100 sims" button finishes quickly.

Number of controls p 40

Capped at 50 for the 100-sim run.

Signal strength 0.50

Common scale for both π (treatment→controls) and θ (outcome→controls).

Confounding strength 0.30

0 = controls predict both Y and D equally · 1 = controls predict D well, Y barely.

PLR / IRM ATE estimators

Partialling-out (PLR) and doubly-robust AIPW (IRM) target the ATE.

α̂ (PLR-style)—

SE(α̂)—

|I_y|—

|I_d|—

union |I_y ∪ I_d|—

target estimandATE

IIVM IV estimator

CV-tuned LASSO nuisances, IV-style scoring. Targets the LATE on compliers.

α̂ (IIVM-style)—

SE(α̂)—

|I_y|—

|I_d|—

union |I_y ∪ I_d|—

target estimandLATE

Why does this happen?

PLR ≈ IRM under constant effects. When the true treatment effect is the same for every household, partialling out and AIPW give nearly identical answers — as the post's Table-2 results show (\$8,730 vs \$8,213 ATE).
IIVM > PLR/IRM when compliers benefit more. The LATE captures the effect on marginal participants — exactly the population that responds to a policy change. In the real 401(k) data: \$11,746 LATE vs \$8,730 ATE.
The naive gap (steel-blue line) is much larger than any DML estimate. That gap is mostly income confounding — the bias DML strips out.

Bias vs. variance over many simulations

Single runs are noisy. Run the whole pipeline 100 times with fresh draws (same parameters, different ε and v) to see whether PLR/IRM and IIVM bias is systematic.

DoubleML 401(k) — Interactive Lab

Does 401(k) access really cause households to save more?

The confounding picture in one animation

Confounding Lab

PLR vs IRM vs IIVM

Forest Plot

Glossary (open a card if a term is unfamiliar)

Confounding Lab — see how naive estimates inflate

What to look for

PLR vs IRM vs IIVM — three estimators, one truth

PLR / IRM ATE estimators

IIVM IV estimator

Why does this happen?

Bias vs. variance over many simulations

The post's forest plot — interactively

What to look for

Estimand

Methods

The naive estimate, decomposed

Connecting back to Tab 3