From an average effect to a personalised policy
A government runs a job-training programme and wants to know three things at once. Does training cause more months of employment? Does the effect depend on who you are? Can we use those differences to assign training better? Causal Machine Learning answers these as three estimands — the ATE, the GATE, and the IATE — and turns the IATE into a welfare-maximising rule.
This app lets you turn the dials yourself. In four tabs you will: compare the naive, DoubleML, and Causal-Forest estimators on the post's actual numbers; simulate confounding bias and watch a regression adjustment close most of it; and benchmark an IATE-based assignment rule against treating everyone and against an oracle.
Why naive comparison fails — and what DoubleML fixes
In observational data, the units who got treated are not a random sample. Caseworkers steer low-Dutch-proficiency jobseekers (who benefit most from training) into the programme — but those same jobseekers also have lower baseline employment. A simple difference-in-means therefore underestimates the programme's effect. The animation below shows the two estimators across simulated draws: the muted dot is the naive estimate (always below the true ATE), the steel-blue dot is DoubleML (covers the truth).
ATE Forest Plot
Naive vs DoubleML vs CausalForestDML on the post's actual numbers, with the truth as a reference line. Hover for SE and CI; toggle methods.
Confounding Sim
Crank up the confounding asymmetry and watch the naive estimator drift away from the truth while a regression-adjusted estimator stays close. Run 100 simulations to see the bias-variance picture.
GATE & Policy
GATE by Dutch proficiency: estimated vs truth. Then drag the cost slider and watch the welfare ranking of four assignment rules.
Glossary (open a card if a term is unfamiliar)
Potential outcomes Y(0), Y(1)
ATE
GATE
IATE
Propensity score π(x)
Unconfoundedness
Cross-fitting
Doubly-robust score
Causal forest
Welfare-maximising rule
The ATE — three estimators, one truth
These numbers come straight from method_comparison.csv in
the post's folder. The true ATE is 5.628 months (orange dashed line).
Toggle estimators on/off and hover a point for SE, 95% CI, and bias.
The story to look for: the naive interval misses the truth entirely;
DoubleML covers it; the CausalForestDML mean-of-IATEs is precise but
slightly under-covers.
Estimators
What to look for
- The naive 95% CI [4.93, 5.30] sits entirely below the truth. This is visible confounding bias — the kind you cannot see in a real application because the truth is unknown.
- DoubleML's [5.36, 5.68] straddles 5.628. Cross-fitted random-forest nuisances absorb the dependence of both treatment and outcome on covariates, and the orthogonal score corrects for residual nuisance error.
- CausalForestDML's CI is the tightest but under-covers. It is an interval for the average of individual predictions, not the population ATE — use it for ranking and heterogeneity, not ATE inference.
Why does the naive estimator under-estimate the effect?
In the synthetic DGP, caseworkers steer low-Dutch-proficiency jobseekers (who benefit most from training, mean τ ≈ 7.6 months) into the programme. Those same jobseekers also have lower baseline employment for reasons the covariates capture. The naive difference-in-means cannot disentangle the programme's effect from that selection effect — it gets pulled toward zero. DoubleML uses flexible random-forest nuisances plus the doubly-robust score to remove the confounding.
Confounding Sim — watch the bias appear and disappear
Same data-generating process as Tab 2, but you control the confounding asymmetry. Slide it from 0 (clean RCT-like data) to 1 (heavy confounding where treatment is well-predicted by covariates). The naive estimator drifts; the adjusted estimator stays close to the true ATE = 0.5. Click "Run 100 simulations" to see the full bias-variance picture.
What to look for
- At asymmetry = 0, both estimators agree. Treatment is as good as random; a simple difference is unbiased.
- Slide asymmetry to 0.8. The naive bar drifts left of the true-ATE orange line; the adjusted bar stays close.
- Push n up. Confidence tightens but the naive bias does not shrink — bias is a property of the estimator, not the sample size.
Bias vs variance over many simulations
Single runs are noisy. Run the whole pipeline 100 times (same parameters, different draws) to see whether the naive bias is systematic.
From group effects to a personalised policy
The population ATE hides the most policy-relevant signal: effects are bigger for some people than for others. Above, the GATE by Dutch proficiency declines from 7.47 months (no Dutch) to 2.91 months (native) — a 2.6× gap that lines up with the truth. Below, drag the cost slider to see how an IATE-based rule ("treat only where τ̂ > cost") compares to treat-all and an oracle that knows the true τ.
GATE by Dutch proficiency
Numbers come from gate_by_dutch.csv. The error bars are
the 95% CIs of the doubly-robust pseudo-outcome group means. Every CI
covers its corresponding truth (orange bar).
Welfare under four assignment rules
The welfare formula treats the per-person cost of training as fixed
at cost = 4 months in the post. The rules are evaluated
against the true τ in the synthetic DGP. The IATE rule
treats 83.9% of the cohort — within 0.2 percentage points of the
oracle — and captures 99.5% of oracle welfare.
What to look for
- The IATE rule beats treat-all by 7.4% (1.749 vs 1.628 months per person) — the central practical reason to estimate individual effects rather than stop at the ATE.
- The gap to the oracle is just 0.009 months. The 0.40-month MAE in the individual estimates produces only a tiny welfare loss because the mis-ranked workers cluster near the cost cutoff where the welfare slope is shallow.
- Treat-none yields zero net welfare. The cost ensures that you only gain on people for whom τ > cost — without targeting, the cost can wipe out the average effect.
Connecting back to Tab 2
The forest plot in Tab 2 shows that DoubleML closes 79% of the bias gap and gives correct CI coverage for the ATE. This tab shows the payoff of going further: the individual effect estimates from CausalForestDML, fed into a simple decision rule, recover almost all of the welfare an omniscient planner could achieve. DoubleML for the ATE; causal forest for ranking and personalised policy is the operational division of labour the literature recommends.