DoWhy — Interactive Lab

Why a four-step framework for causal inference?

Did a job training program actually cause participants to earn more, or did people who signed up just differ from those who did not? A simple difference in means gives $1,794, but is that the program's effect or a mirage created by confounders like age, education, and prior earnings? DoWhy answers this with four explicit steps: Model, Identify, Estimate, Refute.

This app lets you walk the four steps yourself. You'll see how confounders create spurious effects, compare five different ATE estimators on the actual Lalonde data, and stress-test the result with placebo and stability tests — the same refutation tests DoWhy runs in Step 4.

The causal graph — confounders create backdoor paths

The animation below shows the three actors in every causal inference problem: confounders (C), treatment (T), and outcome (Y). The orange pulse traces the backdoor path C → T → Y that confounders open — the spurious channel adjustment must close. The steel pulse then shows the only path left after adjustment: the genuine causal effect.

Tab 2

Confounder Lab

Slide the confounding strength yourself. Watch how the naive estimate drifts away from the true ATE — and how backdoor adjustment pulls it back.

Tab 3

Estimator Comparison

The post's headline figure, interactively. Compare five DoWhy estimators on the Lalonde data. Toggle methods and hover for SEs and CIs.

Tab 4

Refutation Tests

Stress-test the estimate. Placebo, random common cause, data subset — see why the placebo test is the most convincing of the three.

Glossary (open a card if a term is unfamiliar)

ATE — Average Treatment Effect

E[Y(1) − Y(0)]. The expected effect if a random person in the population received treatment. The estimand of interest in this post.

ATT — Effect on the Treated

E[Y(1) − Y(0) | T=1]. The effect among people who actually received treatment. PS matching estimates this, not the ATE.

Confounder

A variable that causes both treatment and outcome. You must adjust for confounders to isolate the causal effect.

Backdoor criterion

A graph rule for choosing which variables to adjust for. Conditioning on a set that blocks all backdoor paths identifies the ATE.

Propensity score e(X)

P(T = 1 | X). The probability of receiving treatment given covariates, usually estimated by logistic regression. IPW reweights by 1/e(X).

IPW — Inverse Probability Weighting

Re-weight each unit by the inverse of its propensity score. Creates a pseudo-population where treatment looks randomly assigned.

Doubly Robust (AIPW)

Combines an outcome model with a propensity score model. Consistent if either model is correct — two shots at identification.

Refutation test

A perturbation of the data (fake treatment, added random confounder, subsample) used to stress-test a causal estimate. The estimate should change in predictable ways.

Placebo test

Replace the real treatment with a random one. If the estimate stays large, the original effect was probably an artifact.

Identification

Whether the causal effect can be computed from observable data alone. Step 2 of DoWhy answers this before any estimation runs.

Confounder Lab — feel how confounding biases the naive estimate

The simulated data has one treatment d and many covariates that influence both d and outcome y. The true ATE is fixed at α = 0.5 (steel line in the chart). Slide the confounding strength and watch the orange "naive" bar drift away from the truth — then watch the teal "adjusted" bar (using the backdoor criterion) snap back to α. The bigger the gap, the more adjustment matters.

Sample size n 200

More data ⇒ tighter estimates; bias from confounding does not shrink with n.

Number of covariates p 20

All of them are confounders here (affect both d and y).

Confounding strength 0.70

How strongly covariates push both d and y. Stronger ⇒ larger naive bias.

Asymmetry 0.30

0 = covariates push d and y equally · 1 = covariates push d much more than y.

Naive estimate

α̂ from a simple OLS of y on d alone — no adjustment.

α̂—

SE(α̂)—

bias from truth—

Backdoor-adjusted

α̂ from OLS of y on d and all p covariates — the backdoor criterion.

α̂—

SE(α̂)—

bias from truth—

What to look for

Slide confounding strength right. The orange "naive" bar drifts far from the true α; the teal "adjusted" bar stays close to it. This is the bias confounders create when ignored.
Slide asymmetry right. When covariates predict treatment but barely predict the outcome, the gap between naive and adjusted shrinks — confounding only bites when covariates affect both sides.
Add more sample size. Both bars get tighter, but the naive bar's bias does not shrink. More data does not fix confounding — adjustment does.

Bias vs. variance over many simulations

Single runs are noisy. Run the pipeline 100 times with fresh draws (same parameters, different ε) to see whether the naive bias is systematic.

Five estimators on the Lalonde data — the post's headline figure

These numbers come straight from the post's analysis of the National Supported Work demonstration (n = 445, 185 treated, 260 controls). Six estimators of the ATE of job training on 1978 earnings: a naive baseline, three propensity-score methods (IPW, stratification, matching), regression adjustment, and the doubly-robust AIPW estimator. Toggle methods and hover for SEs and CIs.

What to look for

All six estimates are positive, between $1,559 and $1,794. The training program raised 1978 earnings — the question is by how much.
The five adjusted methods cluster tightly in $1,559–$1,736, while the naive estimate is highest at $1,794. The $58–$235 gap reflects bias from finite-sample covariate imbalances that adjustment removes.
Doubly Robust (AIPW) (teal, $1,620) is the most credible single estimate — consistent whether the outcome model or the propensity model is the one correctly specified. The dashed teal line marks its value across all bars.
PS Matching ($1,736) is the highest adjusted method because it discards unmatched controls, shifting the estimand from ATE toward ATT.

Why do five methods agree?

Three different paradigms — outcome modeling (regression adjustment), treatment modeling (IPW, stratification, matching), and doubly robust (AIPW) — converge on roughly the same answer. That convergence is the strongest evidence the effect is real. If only one paradigm worked, we could blame model misspecification. When all three agree, no single model can be the reason the estimate is positive.

The Lalonde dataset comes from an actual randomized experiment (the NSW), so confounding should already be small — covariate adjustment mostly improves precision, not bias. In observational data the same methods would diverge more, and the doubly robust estimator would be even more valuable.

Refutation tests — does the effect survive sabotage?

Step 4 of DoWhy's workflow is the most overlooked. It applies the same estimator to a deliberately perturbed dataset and asks: does the estimate change the way it should? Hover any bar to see the test's logic, the p-value, and what it does or does not prove.

Three perturbations of the Lalonde estimate

The original ATE from Regression Adjustment is $1,676. Each test below perturbs the data and re-estimates. The orange bar is the original; the teal bar is the perturbed estimate.

Placebo Treatment

Replace d with a random permutation. If the effect was causal, the estimate should collapse to ≈ 0.

original$1,676

placebo$62

p-value0.92

Effect dropped 96% — strongest evidence in this panel. The estimator does not "make up" effects from random noise.

Random Common Cause

Add a randomly-generated extra confounder. If the original adjustment set was adequate, adding noise should barely move the estimate.

original$1,676

with random$1,676

p-value0.90

Difference: less than $1. Model is not overly sensitive to which covariates are included.

Data Subset (80%)

Re-estimate on random 80% subsamples. A stable population effect should fluctuate only a little.

original$1,676

subset mean$1,728

p-value0.80

Slight upward drift is normal sampling variation. No single observation is driving the result.

What to look for

Compare the three teal bars in the chart. The Placebo bar (≈ $62) sits next to zero; the other two sit on top of the original. That contrast is the refutation argument.
Hover Placebo Treatment. A 96% collapse is the smoking gun: the estimator only "finds" the effect when the real treatment is in the data. With a random treatment it finds nothing.
Note the p-values. All three are large (0.80–0.92). For refutation tests, large p-values mean the test "passes" — the perturbed estimate is in the range we'd expect under the null of "the original is causal".
Caveat: passing all three tests does not prove causation — they only rule out specific failure modes. Failing any of them, however, would be a strong signal that something is wrong.

Connecting back to the four steps

Model declared the eight covariates as common causes. Identify confirmed the backdoor criterion applies and produced the adjustment formula. Estimate produced five point estimates that agree. Refute just showed that the estimate has the failure signatures of a genuine effect — collapses under a placebo, stable under added noise, stable across subsamples. Each step is independent. None alone is sufficient. The combination is what makes the conclusion credible.

Why a four-step framework for causal inference?

The causal graph — confounders create backdoor paths

Confounder Lab

Estimator Comparison

Refutation Tests

Glossary (open a card if a term is unfamiliar)

Confounder Lab — feel how confounding biases the naive estimate

Naive estimate

Backdoor-adjusted

What to look for

Bias vs. variance over many simulations

Five estimators on the Lalonde data — the post's headline figure

Methods

What to look for

Why do five methods agree?

Refutation tests — does the effect survive sabotage?

Three perturbations of the Lalonde estimate

Placebo Treatment

Random Common Cause

Data Subset (80%)

What to look for

Connecting back to the four steps