Double Machine Learning with 401(k) Data

From eligibility effects to complier analysis

$8,730DML ATE of eligibility
55%naive gap that was bias
$11,746IIVM LATE on compliers

Carlos Mendez

Nagoya University (GSID)

June 11, 2026

The Tension

Act I

Eligible households hold $19,559 more — but is the 401(k) the cause?

Over $7 trillion sits in U.S. 401(k) accounts. Eligible households hold $19,559 more in net financial assets than ineligible ones.

But eligible households also earn $15,368 more income. Is the gap the plan — or the people who get the plan?

One dataset, three models, very different answers

Naive baselines (gray) versus three DML models across four ML learners, \(\hat\theta \pm 95\%\) CI. Naive overstates; DML cuts it roughly in half.

Where we’re going

  • The confounding problem: why naive comparisons mislead
  • The DML recipe: orthogonal scores + cross-fitting
  • Three estimands — PLR & IRM (ATE), IIVM (LATE)
  • Four ML learners — and why the answer barely moves
  • The lesson: separate “is it real?” from “for whom?”

The Investigation

Act II

The lab: 9,915 households from the 1991 SIPP survey

  • Outcome \(Y\) — net total financial assets (net_tfa), median just $1,499
  • Treatment \(D\) — 401(k) eligibility (e401), 37.1% eligible
  • Endogenous treatment — actual participation (p401), 26.2%
  • Controls \(X\) — 9 covariates: income, age, education, family size, and 5 more

Eligibility is set by the employer; participation is a household choice. That distinction drives all three models.

Income drives both access and savings — the textbook confounder

Eligible households (blue) earn far more and cluster in the high-income, high-wealth region. Income opens a backdoor path from access to savings.

The naive estimate is the causal effect plus confounding bias

\[\hat\Delta_{\text{naive}} = \underbrace{\theta_0}_{\text{causal effect}} + \underbrace{\text{bias}}_{\text{income, education, }\ldots}\]

The naive $19,559 is not wrong arithmetic — it is the right gap answering the wrong question. DML’s whole job is to subtract the second term.

DML strips the confounding with two nuisance functions

\[Y = \theta_0 D + g_0(X) + \varepsilon, \qquad D = m_0(X) + V\]

\(g_0(X) = E[Y\mid X]\)

Predict savings from covariates. Residual \(\tilde Y = Y - \hat g_0(X)\) is the unexplained savings.

\(m_0(X) = E[D\mid X]\)

Predict eligibility from covariates. Residual \(\tilde D = D - \hat m_0(X)\) is the surprise eligibility.

Regress \(\tilde Y\) on \(\tilde D\): the slope is \(\hat\theta_0\). Both residuals are cleaned of confounding, so only the causal channel remains.

Two safeguards make ML-based nuisance estimation harmless

Neyman-orthogonal score

The estimating equation’s derivative w.r.t. small nuisance errors is zero at the truth. Sloppy \(\hat g_0, \hat m_0\) barely move \(\hat\theta_0\).

Cross-fitting (K-fold)

Fit \(\hat g_0, \hat m_0\) on \(K-1\) folds, predict on the held-out fold, rotate. No row is ever scored by a model that saw it.

Orthogonality kills regularization bias; cross-fitting kills overfitting bias. Together they let Lasso, forests, or XGBoost serve as nuisance learners with no harm to inference.

Six lines fit a DML model in Python

import doubleml as dml
data_dml = dml.DoubleMLData(data, y_col="net_tfa", d_cols="e401", x_cols=features_base)

ml_l = RandomForestRegressor(...)   # g0: outcome nuisance
ml_m = RandomForestClassifier(...)  # m0: treatment nuisance

model = dml.DoubleMLPLR(data_dml, ml_l=ml_l, ml_m=ml_m, n_folds=3)
model.fit()                          # cross-fitting + orthogonal score, internally

PLR: a constant-effect ATE of $8,730 — less than half the naive gap

PLR estimates across four ML learners; all cluster near $8,000–$9,400, far below the $19,559 naive line (dashed).

IRM relaxes the constant-effect assumption — and lands at $8,213

\[\theta_0 = E\!\left[g_0(1,X) - g_0(0,X) + \frac{D\,(Y-g_0(1,X))}{m_0(X)} - \frac{(1-D)\,(Y-g_0(0,X))}{1-m_0(X)}\right]\]

The doubly-robust (AIPW) score: an outcome model corrected by inverse-propensity weighting. Consistent if either \(g_0\) or \(m_0\) is right — a safety net.

Two different recipes, same answer: PLR and IRM agree within $517

IRM estimates across four learners ($7,924–$8,559) are even tighter than PLR, with smaller standard errors.

Participation is a choice — so we instrument it with eligibility

\[\theta_{\text{LATE}} = \frac{E[Y\mid Z=1] - E[Y\mid Z=0]}{E[D\mid Z=1] - E[D\mid Z=0]}\]

Participation \(D\) is endogenous (financial discipline is unobserved). Eligibility \(Z\) is a nudge: it opens the door without forcing anyone through.

A Wald-type ratio: the instrument’s effect on savings, divided by its effect on participation.

The instrument only moves the compliers — so the LATE is their effect

Type Behavior In the LATE?
Always-takers Participate regardless of eligibility no
Never-takers Never participate, even if eligible no
Compliers Participate because eligible yes
Defiers Assumed not to exist (monotonicity)

The LATE is the effect of participation on the marginal households a policy actually moves.

The IIVM LATE is $11,746 — larger than the ATE, by design

IIVM LATE estimates across four learners ($11,215–$12,281) sit well above the ATE band, as expected for compliers.

Estimates barely move across four learners — orthogonality at work

Whole-picture comparison: naive (gray), PLR (steel), IRM (orange), IIVM (teal). Within each model the four learners cluster tightly.

The Resolution

Act III

55% of the naive eligibility gap was pure confounding bias

55%

of the $19,559 naive gap (≈ $10,829) was income-driven bias, not causal effect — the ATE is $8,730

Eligibility genuinely raises savings — about $8,500 per household

$8,730

PLR mean ATE (IRM: $8,213); every 95% CI across two models and four learners excludes zero

For the households a policy actually moves, the effect is $12,000

$11,746

IIVM LATE on compliers — the marginal participants an eligibility expansion targets

Does DML make this causal? No — two assumptions still carry the weight

Objection. Letting an ML model pick the controls can’t manufacture identification.

Response. Correct. DML disciplines estimation, not identification.

The ATE needs conditional exogeneity; the LATE adds instrument validity and monotonicity.

Separate “is the effect real?” from “for whom?” — and let cross-fitting do the rest.