Why Double Machine Learning?
Does a cash bonus actually shorten unemployment, or do the people who get the offer simply differ from those who do not? Standard regression can adjust for measured differences, but only if the relationship between covariates and the outcome is roughly linear. Double Machine Learning (DML) uses flexible ML models to partial out covariates, then estimates the treatment effect on the cleaned residuals — buying valid standard errors without imposing linearity on the nuisance functions.
This app lets you turn the dials. In four tabs you will: watch L1 shrinkage drive coefficients to zero (the engine inside one of the post's nuisance learners); simulate confounded data and see DML's partialling-out beat naive OLS; and reproduce the post's headline forest plot — Naive OLS, OLS+Covariates, DML (Random Forest), DML (LASSO) — on the Pennsylvania Bonus Experiment.
L1 (LASSO) vs. L2 (Ridge) — why the LASSO nuisance learner selects
Both penalties shrink coefficients. Only L1 drives them exactly to zero. The animation below shows the same coefficient under the two penalties as λ grows: the orange L1 estimate snaps to zero abruptly, the steel-blue L2 estimate asymptotes but never gets there. This is why a LASSO-based ML learner in DML doubles as a variable-selection device — useful when $g(X)$ and $m(X)$ depend on only a handful of covariates.
Nuisance LASSO
Slide λ on the outcome-regression nuisance step. Watch which controls survive and how the post-OLS refit recovers the unshrunk coefficient on the treatment.
DML vs OLS Simulator
Generate confounded data, then estimate the treatment effect two ways. See how DML's residualization tracks the true α even when naive OLS drifts.
Forest Plot — Bonus Experiment
The post's four estimates on a single axis, with 95% CIs for the two DML rows. Hover for SE, CI, and number of controls used.
Glossary (open a card if a term is unfamiliar)
Partial-linear model (PLR)
Nuisance functions $g, m$
Treatment effect $\theta$
Cross-fitting
Neyman orthogonality
Partialling-out
Learner-agnosticism
RCT precision boost
Nuisance LASSO — the engine inside DML-LASSO
DML needs to fit two nuisance functions: $\hat g(X)$ predicts $Y$ from
covariates, $\hat m(X)$ predicts $D$ from covariates. When the post
swaps Random Forest for LassoCV, this slider is the knob
that's being tuned automatically.
Drag the λ slider and watch covariates shrink to exactly zero
— and see how the unshrunk post-OLS estimate of α survives.
What to look for
- Sparsity grows with λ. Slide right — more controls pin to zero. Slide left — they re-enter. At λ ≈ 0 the LASSO collapses to OLS; at λ ≈ λ_max everything is zero.
- Raw LASSO α̂ is shrunk; post-OLS α̂ is not. This is why DoubleML uses LASSO only for selection (in $\hat g, \hat m$) and refits the final causal step without the penalty.
- The treatment column (orange) stays in by construction. Try a large p and large λ — controls drop out, but the orange line remains the focus of inference.
- This is what
LassoCV()does internally in the post. In Tab 3 of the post, $\hat\theta = -0.0712$ comes from running this engine inside a 5-fold cross-fitting loop.
DML vs OLS — when does partialling-out matter?
The Pennsylvania Bonus Experiment is an RCT, so OLS and DML give similar estimates (the difference is precision, not bias). In this simulator we move into the territory where DML shines: observational data with covariates that drive both Y and D. The asymmetry slider controls how strongly the covariates influence treatment assignment. Crank it up and watch naive OLS drift away from the true α while DML's residualized estimator stays close.
DML (rigorous λ)
Partialling-out with theory-driven LASSO nuisance, then OLS on the residuals.
DML (CV-tuned λ)
λ from 3-fold cross-validation on the nuisance regressions — closer to the post's LassoCV().
What to look for
- Both DML estimators bracket the true α. The vertical line is the truth — both cards' α̂ stay close, even at high asymmetry.
- CV over-selects when covariates predict D. Many marginally predictive controls leak into $\hat m$, eating into the variation in D that identifies α. This is the over-selection story the post warns about when comparing learners.
- The rigorous penalty is deliberately conservative. Its Bonferroni-style factor $\Phi^{-1}(1-\gamma/(2p))$ keeps selection error small relative to estimation noise — a different objective than minimising prediction MSE.
- In the post's RCT setting, this difference is muted. The treatment is randomised, so $m(X)$ is roughly constant and CV vs rigorous nearly coincide. Move the asymmetry slider to 0 to see the RCT-like regime; crank it to 1 for the observational regime.
Bias vs variance over many simulations
Single runs are noisy. Run the whole pipeline 100 times with fresh draws (same parameters, different ε and v) to see whether the CV bias is systematic.
The post's headline estimates — interactively
These numbers come straight from the post's Summary table for the
Pennsylvania Bonus Experiment (n = 5,099). All four methods estimate
the same quantity: the effect of the bonus offer (tg) on
log unemployment duration (inuidur1). Toggle methods to
compare. Hover a point for the SE, 95% CI, and number of controls.
What to look for
- All four estimates agree on the direction. Every method gives a negative coefficient — the bonus shortens unemployment duration. The disagreement is at the second decimal place, not the sign.
- Naive OLS is the largest in absolute value (-0.0855). Adding covariates linearly (-0.0717), or via flexible ML (-0.0736 RF / -0.0712 LASSO), all shrink the estimate. In an RCT this shift is precision improvement, not bias removal.
- DML-RF and DML-LASSO are within 0.0024 of each other — less than 7% of one SE. Learner-agnosticism is the post's robustness signal.
- Only the two DML rows show CI bars. The OLS rows in the post are reported without DoubleML's cross-fitted standard errors — the post uses them as benchmarks, not as the primary inference target.
- Both DML CIs exclude zero — but just barely. Upper bounds are -0.0041 and -0.0018. The bonus effect is statistically significant at 5%, but the true magnitude could plausibly be anywhere from a 14% reduction to almost nothing.
Outcomes
Methods
Connecting back to Tab 3
The DML-rigorous vs DML-CV comparison you experimented with on simulated data is the same machinery the post applies to the Pennsylvania experiment — just with two different ML learners (RF and LASSO) instead of two different λ-selection rules:
- DML (RF): α̂ = -0.0736, SE = 0.0354, 95% CI = [-0.143, -0.004]
- DML (LASSO): α̂ = -0.0712, SE = 0.0354, 95% CI = [-0.141, -0.002]
- Naive OLS: α̂ = -0.0855 — biggest in absolute value, but no cross-fitted SE
The takeaway: in this RCT, all roads lead to roughly -0.07. DML's contribution is not a different point estimate but valid inference machinery that would still work if treatment were not randomised. The simulator in Tab 3 makes that hypothetical concrete.
Why no CI bars on the OLS rows?
The post reports the two OLS coefficients as reference points and
computes formal inference only for the DML estimates (where the
cross-fitted standard errors are available out of the box from the
doubleml package). Standard OLS SEs are available too
and would be slightly narrower than DML's — but they assume the
linear specification is correct and lose validity if $g(X)$ is
truly non-linear. Showing only the DML CIs is honest about
which inference machinery the post relies on for the headline
claim.