Causal ML for Policy Evaluation

From an average effect to a personalised policy

A government runs a job-training programme and wants to know three things at once. Does training cause more months of employment? Does the effect depend on who you are? Can we use those differences to assign training better? Causal Machine Learning answers these as three estimands — the ATE, the GATE, and the IATE — and turns the IATE into a welfare-maximising rule.

This app lets you turn the dials yourself. In four tabs you will: compare the naive, DoubleML, and Causal-Forest estimators on the post's actual numbers; simulate confounding bias and watch a regression adjustment close most of it; and benchmark an IATE-based assignment rule against treating everyone and against an oracle.

Why naive comparison fails — and what DoubleML fixes

In observational data, the units who got treated are not a random sample. Caseworkers steer low-Dutch-proficiency jobseekers (who benefit most from training) into the programme — but those same jobseekers also have lower baseline employment. A simple difference-in-means therefore underestimates the programme's effect. The animation below shows the two estimators across simulated draws: the muted dot is the naive estimate (always below the true ATE), the steel-blue dot is DoubleML (covers the truth).

Tab 2

ATE Forest Plot

Naive vs DoubleML vs CausalForestDML on the post's actual numbers, with the truth as a reference line. Hover for SE and CI; toggle methods.

Tab 3

Confounding Sim

Crank up the confounding asymmetry and watch the naive estimator drift away from the truth while a regression-adjusted estimator stays close. Run 100 simulations to see the bias-variance picture.

Tab 4

GATE & Policy

GATE by Dutch proficiency: estimated vs truth. Then drag the cost slider and watch the welfare ranking of four assignment rules.

Glossary (open a card if a term is unfamiliar)

Potential outcomes Y(0), Y(1)

The outcome a unit would have under each treatment value. We observe only one — the other is counterfactual and must be estimated.

ATE

Average Treatment Effect, E[Y(1) − Y(0)]. The headline policy number — one effect for the whole population.

GATE

Group Average Treatment Effect, E[Y(1) − Y(0) | Z = z]. The ATE restricted to a pre-specified subgroup.

IATE

Individual Average Treatment Effect, τ(x). One effect per covariate profile — the input to a personalised assignment rule.

Propensity score π(x)

P(D = 1 | X = x). Overlap requires π(x) bounded away from 0 and 1 for every relevant profile — without overlap, there is no counterfactual to estimate from.

Unconfoundedness

Conditional on X, treatment assignment is as good as random. The identifying assumption that justifies DoubleML and CausalForestDML over a naive comparison.

Cross-fitting

Split into K folds; fit nuisances on K−1 folds; predict on the held-out fold. Stops a flexible learner from over-fitting to its own training sample.

Doubly-robust score

A score that yields an unbiased ATE if either the outcome model or the propensity is correctly specified — the "double" in DoubleML.

Causal forest

A random forest adapted for causal estimation: trees split on heterogeneity in the treatment effect, not the outcome. Each leaf approximates a local CATE.

Welfare-maximising rule

A policy that treats unit i iff τ̂_i > cost. Benchmarked against treat-all and an oracle that uses the true τ.

The ATE — three estimators, one truth

These numbers come straight from method_comparison.csv in the post's folder. The true ATE is 5.628 months (orange dashed line). Toggle estimators on/off and hover a point for SE, 95% CI, and bias. The story to look for: the naive interval misses the truth entirely; DoubleML covers it; the CausalForestDML mean-of-IATEs is precise but slightly under-covers.

Estimators

Naive (DiM) DoubleML (IRM) CausalForestDML Truth (5.628)

Naive bias

−0.517

CI misses truth

DoubleML bias

−0.108

CI covers truth

DML bias closed

79%

vs naive baseline

True ATE

5.628

months employed

What to look for

The naive 95% CI [4.93, 5.30] sits entirely below the truth. This is visible confounding bias — the kind you cannot see in a real application because the truth is unknown.
DoubleML's [5.36, 5.68] straddles 5.628. Cross-fitted random-forest nuisances absorb the dependence of both treatment and outcome on covariates, and the orthogonal score corrects for residual nuisance error.
CausalForestDML's CI is the tightest but under-covers. It is an interval for the average of individual predictions, not the population ATE — use it for ranking and heterogeneity, not ATE inference.

Why does the naive estimator under-estimate the effect?

In the synthetic DGP, caseworkers steer low-Dutch-proficiency jobseekers (who benefit most from training, mean τ ≈ 7.6 months) into the programme. Those same jobseekers also have lower baseline employment for reasons the covariates capture. The naive difference-in-means cannot disentangle the programme's effect from that selection effect — it gets pulled toward zero. DoubleML uses flexible random-forest nuisances plus the doubly-robust score to remove the confounding.

Confounding Sim — watch the bias appear and disappear

Same data-generating process as Tab 2, but you control the confounding asymmetry. Slide it from 0 (clean RCT-like data) to 1 (heavy confounding where treatment is well-predicted by covariates). The naive estimator drifts; the adjusted estimator stays close to the true ATE = 0.5. Click "Run 100 simulations" to see the full bias-variance picture.

Sample size n 300

More data ⇒ tighter intervals, same bias direction.

Controls p 20

Number of pre-treatment covariates available for adjustment.

Signal strength 0.60

How strongly covariates predict the outcome.

Confounding asymmetry 0.70

0 = treatment as good as random · 1 = treatment heavily predicted by X.

Naive estimator

—

simple difference

Naive bias

—

vs true ATE

Adjusted estimator

—

regression on X

Adjusted bias

—

vs true ATE

What to look for

At asymmetry = 0, both estimators agree. Treatment is as good as random; a simple difference is unbiased.
Slide asymmetry to 0.8. The naive bar drifts left of the true-ATE orange line; the adjusted bar stays close.
Push n up. Confidence tightens but the naive bias does not shrink — bias is a property of the estimator, not the sample size.

Bias vs variance over many simulations

Single runs are noisy. Run the whole pipeline 100 times (same parameters, different draws) to see whether the naive bias is systematic.

From group effects to a personalised policy

The population ATE hides the most policy-relevant signal: effects are bigger for some people than for others. Above, the GATE by Dutch proficiency declines from 7.47 months (no Dutch) to 2.91 months (native) — a 2.6× gap that lines up with the truth. Below, drag the cost slider to see how an IATE-based rule ("treat only where τ̂ > cost") compares to treat-all and an oracle that knows the true τ.

GATE by Dutch proficiency

Numbers come from gate_by_dutch.csv. The error bars are the 95% CIs of the doubly-robust pseudo-outcome group means. Every CI covers its corresponding truth (orange bar).

Mean abs error

0.10

months, across 4 groups

Effect ratio

2.6×

lowest vs highest Dutch

IATE corr. with truth

0.956

CausalForestDML

IATE MAE

0.40

months per worker

Welfare under four assignment rules

The welfare formula treats the per-person cost of training as fixed at cost = 4 months in the post. The rules are evaluated against the true τ in the synthetic DGP. The IATE rule treats 83.9% of the cohort — within 0.2 percentage points of the oracle — and captures 99.5% of oracle welfare.

What to look for

The IATE rule beats treat-all by 7.4% (1.749 vs 1.628 months per person) — the central practical reason to estimate individual effects rather than stop at the ATE.
The gap to the oracle is just 0.009 months. The 0.40-month MAE in the individual estimates produces only a tiny welfare loss because the mis-ranked workers cluster near the cost cutoff where the welfare slope is shallow.
Treat-none yields zero net welfare. The cost ensures that you only gain on people for whom τ > cost — without targeting, the cost can wipe out the average effect.

Connecting back to Tab 2

The forest plot in Tab 2 shows that DoubleML closes 79% of the bias gap and gives correct CI coverage for the ATE. This tab shows the payoff of going further: the individual effect estimates from CausalForestDML, fed into a simple decision rule, recover almost all of the welfare an omniscient planner could achieve. DoubleML for the ATE; causal forest for ranking and personalised policy is the operational division of labour the literature recommends.