DoWhy Causal Inference — Interactive Lab

Why are naive comparisons misleading?

Suppose a company wants to know whether working from home (WFH) improves productivity. They compute the difference in average productivity between WFH and office employees, get +1.39 productivity points, and call it a day. The catch: the true causal effect is only +1.00. The naive comparison is off by 39% because introverts both self-select into WFH and are independently more productive. This app lets you build intuition for the four DoWhy steps — Model, Identify, Estimate, Refute — that recover the true answer.

In four tabs you will: watch confounders flow along backdoor paths in the DAG below, compare five estimators against a known truth (the post's headline figure), tune confounder strength and see the naive estimate drift while backdoor and IV estimates stay anchored, and run the same three refutation tests DoWhy uses to stress-test a causal claim.

The causal graph — where confounding lives

Five nodes, six arrows. Two confounders (introversion, num. children) open backdoor paths from WFH to productivity by pointing at both. The instrument (subway disruption) is special: it touches WFH but not productivity. The single orange arrow is the causal effect we want to estimate. Particles flow along each path so you can see the bias channels in motion.

Backdoor paths (gray, dashed) bias the naive comparison. The instrument path (teal) bypasses them. Only the orange arrow is the true causal effect.

Tab 2

Methods Showdown

Five estimators on the post's real (simulated) data with the true ATE marked. Toggle methods to see which CIs cover the truth.

Tab 3

Confounding Simulator

Tune confounder strength and sample size live. Watch the naive bar drift while the backdoor and IV bars stay near 1.0.

Tab 4

Refutation Lab

Re-run DoWhy's three robustness tests — placebo, random common cause, data subset — and see why the original estimate survives.

Glossary (open a card if a term is unfamiliar)

Confounder

A variable that affects both treatment and outcome, opening a "backdoor" that biases naive comparisons. Here: introversion and num. children.

Causal graph (DAG)

A directed acyclic graph encoding which variables cause which. DoWhy's "Step 1 — Model" is drawing this diagram.

Backdoor criterion

If a set of variables blocks every backdoor path, conditioning on it identifies the causal effect. Here: {introversion, num_children}.

Propensity score e(x)

The conditional probability of treatment given the covariates: P(T = 1 | X = x). The engine behind IPW and AIPW.

IPW

Inverse-probability weighting. Reweights observations by 1/e(x) (treated) or 1/(1−e(x)) (control) to mimic a randomised experiment.

Doubly Robust (AIPW)

Combines an outcome regression with the IPW reweight plus a correction. Stays consistent if either the outcome or the propensity model is correct.

Instrumental variable (IV)

A source of variation in treatment that affects the outcome only through treatment. Here: subway disruption. Valid even with unmeasured confounders, but noisier.

Exclusion restriction

The IV assumption: the instrument has no direct effect on the outcome except through treatment. Cannot be tested from data — it is a domain claim.

Refutation test

DoWhy's "Step 4". Stress-tests — placebo treatment, random common cause, data subset — that probe the estimate's stability.

ATE

Average Treatment Effect. The expected difference in potential outcomes if everyone were treated vs. everyone untreated. True ATE = 1.0 in our simulated DGP.

Methods Showdown — five estimators, one truth

These numbers come straight from estimation_results.csv in the post's folder — the same data behind Figure 3. The vertical blue line is the true ATE = 1.0. Click checkboxes to toggle methods on and off. Hover any point for the standard error, CI, bias, and whether the CI covers the truth.

True ATE

1.00

simulated; known by construction

N (sample size)

5,000

simulated WFH dataset

Methods covering truth

—

out of — shown

What to look for

The Naive CI [1.245, 1.526] does not contain the true ATE. It is precisely wrong — a small SE around the wrong target.
Three backdoor methods sit on top of the truth. Linear regression (1.0051), IPW (1.0275), and Doubly Robust (1.0115) all recover the true effect within 3%, with narrow CIs (width 0.24–0.30).
IV is centered near truth but far less precise. Its CI is 1.30 wide — more than 5x the backdoor methods. Robustness to unmeasured confounders comes at the cost of variance.
Hover Doubly Robust: its SE (0.0623) is nearly identical to regression's (0.0614). When both models are correct, DR achieves the semiparametric efficiency bound.

Why does the naive estimate overshoot?

In the WFH group, mean introversion is 5.19 versus 4.55 in the office group, and mean num. children is 1.58 versus 1.33. Both confounders push the WFH group toward higher productivity for reasons that have nothing to do with working from home. The naive comparison attributes the entire 1.39-point gap to WFH; the truth is that 0.39 of it is a confounder fingerprint. Backdoor adjustment removes the fingerprint by conditioning on the confounders directly. IV removes it by relying only on subway-disruption-induced variation in WFH, which is independent of personality and family size by construction.

Confounding Simulator — tune the bias yourself

The post's data has confounder effects β_introversion = 0.8 on productivity and 0.3 on the WFH log-odds. Drag the confounder sliders and watch three estimators in real time: Naive drifts away from the truth as confounding grows; Backdoor (linear regression adjusted) and IV stay anchored. Each estimate ships with a 95% CI so you can also see the bias-variance trade-off.

Sample size n 2000

More data shrinks every CI proportionally, but the naive bias does not go away.

Confounder → outcome β 0.80

Direct effect of introversion on productivity. Larger ⇒ more confounding bias.

Confounder → treatment β 0.30

Log-odds coefficient: how much introversion pushes people toward WFH.

Instrument strength 1.00

Log-odds coefficient: how forcefully subway disruption pushes people to WFH. Below ~0.3 the IV becomes weak.

Naive

α̂—

SE—

bias—

covers truth?—

Backdoor

α̂—

SE—

bias—

covers truth?—

IV (2SLS)

α̂—

SE—

bias—

first-stage F—

What to look for

Set "Confounder → outcome β" to 0. The Naive bar snaps to the truth — no confounding means no bias.
Crank "Confounder → outcome β" up to 2. The Naive estimate climbs past 2; Backdoor stays near 1.0 with a tight CI; IV stays near 1.0 with a wide CI.
Drop "Instrument strength" below 0.3. The first-stage F collapses and the IV CI explodes — a textbook weak-instrument failure mode.
Sample size only shrinks CIs. No amount of n fixes the naive bias — confounding is identification failure, not noise.

Refutation Lab — does the estimate survive a stress test?

DoWhy's "Step 4 — Refute" runs automated stress tests. Placebo treatment permutes the treatment label and re-estimates (a good estimator should now return zero). Random common cause adds an irrelevant fake confounder (the estimate should not move). Data subset drops 20% of the data at random (the estimate should stay close). All three results below are from the post's actual refutation run on the linear-regression estimate of 1.005.

1 · Placebo Treatment

Permute the treatment column 100 times and re-estimate. A real causal effect should collapse to zero.

original α̂1.0051

new effect−0.00003

p-value0.96

verdictPASS

2 · Random Common Cause

Add a randomly generated extra confounder. A robust estimate should not move.

original α̂1.0051

new effect1.0051

p-value0.98

verdictPASS

3 · Data Subset (80%)

Re-estimate on a random 80% subsample, 100 times. A stable estimate should not drift.

original α̂1.0051

new effect0.9988

p-value0.64

verdictPASS

How to read these tests

Placebo collapses to zero (−0.00003). If a method "found" an effect on a randomly permuted treatment, you would not trust it on the real one. Ours did not.
Random common cause leaves the estimate unchanged. A fragile estimate would bounce when extra variables enter the regression; ours is anchored.
Data subset is essentially the same number. A sample-driven artefact would shift under resampling; ours does not.
Refutation is necessary but not sufficient. Tests can fail to detect violation of the exclusion restriction or omitted-confounder bias. Pair them with domain judgment.

Connecting back to the post

The four DoWhy steps map to the four tabs: Model (Tab 1's DAG) → Identify (Tab 2 — backdoor for the first three methods, IV for the fourth) → Estimate (Tab 2's point estimates; Tab 3's tunable version) → Refute (Tab 4). Each step constrains the next: a wrong DAG breaks identification; a missed confounder breaks the backdoor estimate; a violated exclusion restriction breaks IV; a brittle estimate fails refutation. The transparency is the contribution.

Why are naive comparisons misleading?

The causal graph — where confounding lives

Methods Showdown

Confounding Simulator

Refutation Lab

Glossary (open a card if a term is unfamiliar)

Methods Showdown — five estimators, one truth

Toggle estimators

What to look for

Why does the naive estimate overshoot?

Confounding Simulator — tune the bias yourself

Naive

Backdoor

IV (2SLS)

What to look for

Refutation Lab — does the estimate survive a stress test?

1 · Placebo Treatment

2 · Random Common Cause

3 · Data Subset (80%)

How to read these tests

Connecting back to the post