Double Machine Learning — Interactive Lab

Why Double Machine Learning?

Does a cash bonus actually shorten unemployment, or do the people who get the offer simply differ from those who do not? Standard regression can adjust for measured differences, but only if the relationship between covariates and the outcome is roughly linear. Double Machine Learning (DML) uses flexible ML models to partial out covariates, then estimates the treatment effect on the cleaned residuals — buying valid standard errors without imposing linearity on the nuisance functions.

This app lets you turn the dials. In four tabs you will: watch L1 shrinkage drive coefficients to zero (the engine inside one of the post's nuisance learners); simulate confounded data and see DML's partialling-out beat naive OLS; and reproduce the post's headline forest plot — Naive OLS, OLS+Covariates, DML (Random Forest), DML (LASSO) — on the Pennsylvania Bonus Experiment.

L1 (LASSO) vs. L2 (Ridge) — why the LASSO nuisance learner selects

Both penalties shrink coefficients. Only L1 drives them exactly to zero. The animation below shows the same coefficient under the two penalties as λ grows: the orange L1 estimate snaps to zero abruptly, the steel-blue L2 estimate asymptotes but never gets there. This is why a LASSO-based ML learner in DML doubles as a variable-selection device — useful when $g(X)$ and $m(X)$ depend on only a handful of covariates.

Tab 2

Nuisance LASSO

Slide λ on the outcome-regression nuisance step. Watch which controls survive and how the post-OLS refit recovers the unshrunk coefficient on the treatment.

Tab 3

DML vs OLS Simulator

Generate confounded data, then estimate the treatment effect two ways. See how DML's residualization tracks the true α even when naive OLS drifts.

Tab 4

Forest Plot — Bonus Experiment

The post's four estimates on a single axis, with 95% CIs for the two DML rows. Hover for SE, CI, and number of controls used.

Glossary (open a card if a term is unfamiliar)

Partial-linear model (PLR)

$Y = \theta D + g(X) + \varepsilon$. Outcome equals a linear-in-treatment term plus a flexible (possibly non-linear) function of covariates plus noise. Linearity is imposed only on $D$ — the covariates can enter $g$ however ML wants.

Nuisance functions $g, m$

The flexible parts. $g(X) = E[Y \mid X]$ predicts the outcome from covariates, $m(X) = E[D \mid X]$ predicts the treatment from covariates. "Nuisance" because we don't care about them for inference — only for cleaning the residuals.

Treatment effect $\theta$

The single number we care about: the average effect of the treatment on the outcome, holding covariates fixed via $g$ and $m$. In this post, $\hat\theta = -0.074$ (DML-RF) — bonus shortens log unemployment duration by about 7.4%.

Cross-fitting

Sample-split + swap. Estimate $g, m$ on one fold, the treatment effect on the other, then rotate and average across all folds. Removes the overfitting bias that contaminates plug-in estimators.

Neyman orthogonality

The score function has zero expected gradient w.r.t. the nuisance parameters at the truth. Small ML errors in $\hat g, \hat m$ don't bias $\hat\theta$ — the lever is balanced at the fulcrum.

Partialling-out

Regress $Y$ on $X$, regress $D$ on $X$, take the residuals $\tilde Y, \tilde D$, then regress $\tilde Y$ on $\tilde D$ to get $\hat\theta$. Frisch-Waugh-Lovell with ML in the first stage.

Learner-agnosticism

Any sufficiently flexible ML algorithm can serve as the nuisance learner. In the post, swapping Random Forest for LASSO moves $\hat\theta$ from -0.0736 to -0.0712 — within 7% of one SE.

RCT precision boost

In a randomised experiment, covariates can't fix bias (there is none) but they can absorb residual variance in $Y$. This is why DML's $\hat\theta$ is close to OLS's here — the gain is precision, not bias removal.

Nuisance LASSO — the engine inside DML-LASSO

DML needs to fit two nuisance functions: $\hat g(X)$ predicts $Y$ from covariates, $\hat m(X)$ predicts $D$ from covariates. When the post swaps Random Forest for LassoCV, this slider is the knob that's being tuned automatically. Drag the λ slider and watch covariates shrink to exactly zero — and see how the unshrunk post-OLS estimate of α survives.

Sample size n 200

More data $\Rightarrow$ each control's coefficient is estimated more precisely.

Number of controls p 40

About 15% of these have a true nonzero effect; the rest are noise.

Signal strength 0.60

Magnitude of the truly-relevant coefficients relative to noise.

Penalty —

Slide left for less shrinkage (more covariates survive); right for more.

covariates kept (|I|)

—

out of — candidates

α̂ from raw LASSO

—

shrunk toward zero

α̂ from post-OLS

—

refit on selected support

true α

0.50

held fixed for comparison

What to look for

Sparsity grows with λ. Slide right — more controls pin to zero. Slide left — they re-enter. At λ ≈ 0 the LASSO collapses to OLS; at λ ≈ λ_max everything is zero.
Raw LASSO α̂ is shrunk; post-OLS α̂ is not. This is why DoubleML uses LASSO only for selection (in $\hat g, \hat m$) and refits the final causal step without the penalty.
The treatment column (orange) stays in by construction. Try a large p and large λ — controls drop out, but the orange line remains the focus of inference.
This is what LassoCV() does internally in the post. In Tab 3 of the post, $\hat\theta = -0.0712$ comes from running this engine inside a 5-fold cross-fitting loop.

DML vs OLS — when does partialling-out matter?

The Pennsylvania Bonus Experiment is an RCT, so OLS and DML give similar estimates (the difference is precision, not bias). In this simulator we move into the territory where DML shines: observational data with covariates that drive both Y and D. The asymmetry slider controls how strongly the covariates influence treatment assignment. Crank it up and watch naive OLS drift away from the true α while DML's residualized estimator stays close.

Sample size n 200

Capped at 300 so the "Run 100 sims" button finishes quickly.

Number of controls p 40

Capped at 50 for the 100-sim run.

Signal strength 0.50

Common scale for both the outcome model and treatment model.

Confounding asymmetry 0.80

0 = no confounding (RCT-like) · 1 = covariates strongly predict D, weakly predict Y. Higher = more reward for DML's partialling-out.

DML (rigorous λ)

Partialling-out with theory-driven LASSO nuisance, then OLS on the residuals.

α̂—

SE(α̂)—

|I_y| (in $\hat g$)—

|I_d| (in $\hat m$)—

union |I_y ∪ I_d|—

λ_y, λ_d—

DML (CV-tuned λ)

λ from 3-fold cross-validation on the nuisance regressions — closer to the post's LassoCV().

α̂—

SE(α̂)—

|I_y| (in $\hat g$)—

|I_d| (in $\hat m$)—

union |I_y ∪ I_d|—

λ_y, λ_d—

What to look for

Both DML estimators bracket the true α. The vertical line is the truth — both cards' α̂ stay close, even at high asymmetry.
CV over-selects when covariates predict D. Many marginally predictive controls leak into $\hat m$, eating into the variation in D that identifies α. This is the over-selection story the post warns about when comparing learners.
The rigorous penalty is deliberately conservative. Its Bonferroni-style factor $\Phi^{-1}(1-\gamma/(2p))$ keeps selection error small relative to estimation noise — a different objective than minimising prediction MSE.
In the post's RCT setting, this difference is muted. The treatment is randomised, so $m(X)$ is roughly constant and CV vs rigorous nearly coincide. Move the asymmetry slider to 0 to see the RCT-like regime; crank it to 1 for the observational regime.

Bias vs variance over many simulations

Single runs are noisy. Run the whole pipeline 100 times with fresh draws (same parameters, different ε and v) to see whether the CV bias is systematic.

The post's headline estimates — interactively

These numbers come straight from the post's Summary table for the Pennsylvania Bonus Experiment (n = 5,099). All four methods estimate the same quantity: the effect of the bonus offer (tg) on log unemployment duration (inuidur1). Toggle methods to compare. Hover a point for the SE, 95% CI, and number of controls.

What to look for

All four estimates agree on the direction. Every method gives a negative coefficient — the bonus shortens unemployment duration. The disagreement is at the second decimal place, not the sign.
Naive OLS is the largest in absolute value (-0.0855). Adding covariates linearly (-0.0717), or via flexible ML (-0.0736 RF / -0.0712 LASSO), all shrink the estimate. In an RCT this shift is precision improvement, not bias removal.
DML-RF and DML-LASSO are within 0.0024 of each other — less than 7% of one SE. Learner-agnosticism is the post's robustness signal.
Only the two DML rows show CI bars. The OLS rows in the post are reported without DoubleML's cross-fitted standard errors — the post uses them as benchmarks, not as the primary inference target.
Both DML CIs exclude zero — but just barely. Upper bounds are -0.0041 and -0.0018. The bonus effect is statistically significant at 5%, but the true magnitude could plausibly be anywhere from a 14% reduction to almost nothing.

Outcomes

Log unemployment duration

Methods

Naive OLS OLS + Covariates DoubleML (RF) DoubleML (Lasso)

Connecting back to Tab 3

The DML-rigorous vs DML-CV comparison you experimented with on simulated data is the same machinery the post applies to the Pennsylvania experiment — just with two different ML learners (RF and LASSO) instead of two different λ-selection rules:

DML (RF): α̂ = -0.0736, SE = 0.0354, 95% CI = [-0.143, -0.004]
DML (LASSO): α̂ = -0.0712, SE = 0.0354, 95% CI = [-0.141, -0.002]
Naive OLS: α̂ = -0.0855 — biggest in absolute value, but no cross-fitted SE

The takeaway: in this RCT, all roads lead to roughly -0.07. DML's contribution is not a different point estimate but valid inference machinery that would still work if treatment were not randomised. The simulator in Tab 3 makes that hypothetical concrete.

Why no CI bars on the OLS rows?

The post reports the two OLS coefficients as reference points and computes formal inference only for the DML estimates (where the cross-fitted standard errors are available out of the box from the doubleml package). Standard OLS SEs are available too and would be slightly narrower than DML's — but they assume the linear specification is correct and lose validity if $g(X)$ is truly non-linear. Showing only the DML CIs is honest about which inference machinery the post relies on for the headline claim.