Double Machine Learning with 401(k) Data

From eligibility effects to complier analysis

$8,730DML ATE of eligibility

55%naive gap that was bias

$11,746IIVM LATE on compliers

Carlos Mendez

Nagoya University (GSID)

July 8, 2026

The Tension

Act I

Eligible households hold $19,559 more — but is the 401(k) the cause?

Over $7 trillion sits in U.S. 401(k) accounts. Eligible households hold $19,559 more in net financial assets than ineligible ones.

But eligible households also earn $15,368 more income. Is the gap the plan — or the people who get the plan?

One dataset, three models, very different answers

Naive baselines (gray) versus three DML models across four ML learners, $\hat\theta \pm 95\%$ CI. Naive overstates; DML cuts it roughly in half.

Where we’re going

The confounding problem: why naive comparisons mislead
The DML recipe: orthogonal scores + cross-fitting
Three estimands — PLR & IRM (ATE), IIVM (LATE)
Four ML learners — and why the answer barely moves
The lesson: separate “is it real?” from “for whom?”

The Investigation

Act II

The lab: 9,915 households from the 1991 SIPP survey

Outcome $Y$ — net total financial assets (net_tfa), median just $1,499
Treatment $D$ — 401(k) eligibility (e401), 37.1% eligible
Endogenous treatment — actual participation (p401), 26.2%
Controls $X$ — 9 covariates: income, age, education, family size, and 5 more

Eligibility is set by the employer; participation is a household choice. That distinction drives all three models.

Income drives both access and savings — the textbook confounder

Eligible households (blue) earn far more and cluster in the high-income, high-wealth region. Income opens a backdoor path from access to savings.

The naive estimate is the causal effect plus confounding bias

\[\hat\Delta_{\text{naive}} = \underbrace{\theta_0}_{\text{causal effect}} + \underbrace{\text{bias}}_{\text{income, education, }\ldots}\]

The naive $19,559 is not wrong arithmetic — it is the right gap answering the wrong question. DML’s whole job is to subtract the second term.

DML strips the confounding with two nuisance functions

\[Y = \theta_0 D + g_0(X) + \varepsilon, \qquad D = m_0(X) + V\]

$g_0(X) = E[Y\mid X]$

Predict savings from covariates. Residual $\tilde Y = Y - \hat g_0(X)$ is the unexplained savings.

$m_0(X) = E[D\mid X]$

Predict eligibility from covariates. Residual $\tilde D = D - \hat m_0(X)$ is the surprise eligibility.

Regress $\tilde Y$ on $\tilde D$: the slope is $\hat\theta_0$. Both residuals are cleaned of confounding, so only the causal channel remains.

Two safeguards make ML-based nuisance estimation harmless

Neyman-orthogonal score

The estimating equation’s derivative w.r.t. small nuisance errors is zero at the truth. Sloppy $\hat g_0, \hat m_0$ barely move $\hat\theta_0$.

Cross-fitting (K-fold)

Fit $\hat g_0, \hat m_0$ on $K-1$ folds, predict on the held-out fold, rotate. No row is ever scored by a model that saw it.

Orthogonality kills regularization bias; cross-fitting kills overfitting bias. Together they let Lasso, forests, or XGBoost serve as nuisance learners with no harm to inference.

Six lines fit a DML model in Python

import doubleml as dml
data_dml = dml.DoubleMLData(data, y_col="net_tfa", d_cols="e401", x_cols=features_base)

ml_l = RandomForestRegressor(...)   # g0: outcome nuisance
ml_m = RandomForestClassifier(...)  # m0: treatment nuisance

model = dml.DoubleMLPLR(data_dml, ml_l=ml_l, ml_m=ml_m, n_folds=3)
model.fit()                          # cross-fitting + orthogonal score, internally

PLR: a constant-effect ATE of $8,730 — less than half the naive gap

PLR estimates across four ML learners; all cluster near $8,000–$9,400, far below the $19,559 naive line (dashed).

IRM relaxes the constant-effect assumption — and lands at $8,213

\[\theta_0 = E\!\left[g_0(1,X) - g_0(0,X) + \frac{D\,(Y-g_0(1,X))}{m_0(X)} - \frac{(1-D)\,(Y-g_0(0,X))}{1-m_0(X)}\right]\]

The doubly-robust (AIPW) score: an outcome model corrected by inverse-propensity weighting. Consistent if either $g_0$ or $m_0$ is right — a safety net.

Two different recipes, same answer: PLR and IRM agree within $517

IRM estimates across four learners ($7,924–$8,559) are even tighter than PLR, with smaller standard errors.

Participation is a choice — so we instrument it with eligibility

\[\theta_{\text{LATE}} = \frac{E[Y\mid Z=1] - E[Y\mid Z=0]}{E[D\mid Z=1] - E[D\mid Z=0]}\]

Participation $D$ is endogenous (financial discipline is unobserved). Eligibility $Z$ is a nudge: it opens the door without forcing anyone through.

A Wald-type ratio: the instrument’s effect on savings, divided by its effect on participation.

The instrument only moves the compliers — so the LATE is their effect

Type	Behavior	In the LATE?
Always-takers	Participate regardless of eligibility	no
Never-takers	Never participate, even if eligible	no
Compliers	Participate because eligible	yes
Defiers	Assumed not to exist (monotonicity)	—

The LATE is the effect of participation on the marginal households a policy actually moves.

The IIVM LATE is $11,746 — larger than the ATE, by design

IIVM LATE estimates across four learners ($11,215–$12,281) sit well above the ATE band, as expected for compliers.

Estimates barely move across four learners — orthogonality at work

Whole-picture comparison: naive (gray), PLR (steel), IRM (orange), IIVM (teal). Within each model the four learners cluster tightly.

The Resolution

Act III

55% of the naive eligibility gap was pure confounding bias

55%

of the $19,559 naive gap (≈ $10,829) was income-driven bias, not causal effect — the ATE is $8,730

Eligibility genuinely raises savings — about $8,500 per household

$8,730

PLR mean ATE (IRM: $8,213); every 95% CI across two models and four learners excludes zero

For the households a policy actually moves, the effect is $12,000

$11,746

IIVM LATE on compliers — the marginal participants an eligibility expansion targets

Does DML make this causal? No — two assumptions still carry the weight

Objection. Letting an ML model pick the controls can’t manufacture identification.

Response. Correct. DML disciplines estimation, not identification.

The ATE needs conditional exogeneity; the LATE adds instrument validity and monotonicity.

Separate “is the effect real?” from “for whom?” — and let cross-fitting do the rest.