DoWhy Causal Inference — Interactive Lab

A pedagogical companion to A Beginner's Guide to Causal Inference with DoWhy in Python ↗ Back to the post

Why are naive comparisons misleading?

Suppose a company wants to know whether working from home (WFH) improves productivity. They compute the difference in average productivity between WFH and office employees, get +1.39 productivity points, and call it a day. The catch: the true causal effect is only +1.00. The naive comparison is off by 39% because introverts both self-select into WFH and are independently more productive. This app lets you build intuition for the four DoWhy steps — Model, Identify, Estimate, Refute — that recover the true answer.

In four tabs you will: watch confounders flow along backdoor paths in the DAG below, compare five estimators against a known truth (the post's headline figure), tune confounder strength and see the naive estimate drift while backdoor and IV estimates stay anchored, and run the same three refutation tests DoWhy uses to stress-test a causal claim.

The causal graph — where confounding lives

Five nodes, six arrows. Two confounders (introversion, num. children) open backdoor paths from WFH to productivity by pointing at both. The instrument (subway disruption) is special: it touches WFH but not productivity. The single orange arrow is the causal effect we want to estimate. Particles flow along each path so you can see the bias channels in motion.

Backdoor paths (gray, dashed) bias the naive comparison. The instrument path (teal) bypasses them. Only the orange arrow is the true causal effect.
Tab 2

Methods Showdown

Five estimators on the post's real (simulated) data with the true ATE marked. Toggle methods to see which CIs cover the truth.

Tab 3

Confounding Simulator

Tune confounder strength and sample size live. Watch the naive bar drift while the backdoor and IV bars stay near 1.0.

Tab 4

Refutation Lab

Re-run DoWhy's three robustness tests — placebo, random common cause, data subset — and see why the original estimate survives.

Glossary (open a card if a term is unfamiliar)

Confounder
A variable that affects both treatment and outcome, opening a "backdoor" that biases naive comparisons. Here: introversion and num. children.
Causal graph (DAG)
A directed acyclic graph encoding which variables cause which. DoWhy's "Step 1 — Model" is drawing this diagram.
Backdoor criterion
If a set of variables blocks every backdoor path, conditioning on it identifies the causal effect. Here: {introversion, num_children}.
Propensity score e(x)
The conditional probability of treatment given the covariates: P(T = 1 | X = x). The engine behind IPW and AIPW.
IPW
Inverse-probability weighting. Reweights observations by 1/e(x) (treated) or 1/(1−e(x)) (control) to mimic a randomised experiment.
Doubly Robust (AIPW)
Combines an outcome regression with the IPW reweight plus a correction. Stays consistent if either the outcome or the propensity model is correct.
Instrumental variable (IV)
A source of variation in treatment that affects the outcome only through treatment. Here: subway disruption. Valid even with unmeasured confounders, but noisier.
Exclusion restriction
The IV assumption: the instrument has no direct effect on the outcome except through treatment. Cannot be tested from data — it is a domain claim.
Refutation test
DoWhy's "Step 4". Stress-tests — placebo treatment, random common cause, data subset — that probe the estimate's stability.
ATE
Average Treatment Effect. The expected difference in potential outcomes if everyone were treated vs. everyone untreated. True ATE = 1.0 in our simulated DGP.

Methods Showdown — five estimators, one truth

These numbers come straight from estimation_results.csv in the post's folder — the same data behind Figure 3. The vertical blue line is the true ATE = 1.0. Click checkboxes to toggle methods on and off. Hover any point for the standard error, CI, bias, and whether the CI covers the truth.

Toggle estimators

True ATE
1.00
simulated; known by construction
N (sample size)
5,000
simulated WFH dataset
Methods covering truth
out of shown

What to look for

  • The Naive CI [1.245, 1.526] does not contain the true ATE. It is precisely wrong — a small SE around the wrong target.
  • Three backdoor methods sit on top of the truth. Linear regression (1.0051), IPW (1.0275), and Doubly Robust (1.0115) all recover the true effect within 3%, with narrow CIs (width 0.24–0.30).
  • IV is centered near truth but far less precise. Its CI is 1.30 wide — more than 5x the backdoor methods. Robustness to unmeasured confounders comes at the cost of variance.
  • Hover Doubly Robust: its SE (0.0623) is nearly identical to regression's (0.0614). When both models are correct, DR achieves the semiparametric efficiency bound.

Why does the naive estimate overshoot?

In the WFH group, mean introversion is 5.19 versus 4.55 in the office group, and mean num. children is 1.58 versus 1.33. Both confounders push the WFH group toward higher productivity for reasons that have nothing to do with working from home. The naive comparison attributes the entire 1.39-point gap to WFH; the truth is that 0.39 of it is a confounder fingerprint. Backdoor adjustment removes the fingerprint by conditioning on the confounders directly. IV removes it by relying only on subway-disruption-induced variation in WFH, which is independent of personality and family size by construction.

Confounding Simulator — tune the bias yourself

The post's data has confounder effects βintroversion = 0.8 on productivity and 0.3 on the WFH log-odds. Drag the confounder sliders and watch three estimators in real time: Naive drifts away from the truth as confounding grows; Backdoor (linear regression adjusted) and IV stay anchored. Each estimate ships with a 95% CI so you can also see the bias-variance trade-off.

More data shrinks every CI proportionally, but the naive bias does not go away.
Direct effect of introversion on productivity. Larger ⇒ more confounding bias.
Log-odds coefficient: how much introversion pushes people toward WFH.
Log-odds coefficient: how forcefully subway disruption pushes people to WFH. Below ~0.3 the IV becomes weak.

Naive

α̂
SE
bias
covers truth?

Backdoor

α̂
SE
bias
covers truth?

IV (2SLS)

α̂
SE
bias
first-stage F

What to look for

  • Set "Confounder → outcome β" to 0. The Naive bar snaps to the truth — no confounding means no bias.
  • Crank "Confounder → outcome β" up to 2. The Naive estimate climbs past 2; Backdoor stays near 1.0 with a tight CI; IV stays near 1.0 with a wide CI.
  • Drop "Instrument strength" below 0.3. The first-stage F collapses and the IV CI explodes — a textbook weak-instrument failure mode.
  • Sample size only shrinks CIs. No amount of n fixes the naive bias — confounding is identification failure, not noise.

Refutation Lab — does the estimate survive a stress test?

DoWhy's "Step 4 — Refute" runs automated stress tests. Placebo treatment permutes the treatment label and re-estimates (a good estimator should now return zero). Random common cause adds an irrelevant fake confounder (the estimate should not move). Data subset drops 20% of the data at random (the estimate should stay close). All three results below are from the post's actual refutation run on the linear-regression estimate of 1.005.

1 · Placebo Treatment

Permute the treatment column 100 times and re-estimate. A real causal effect should collapse to zero.

original α̂1.0051
new effect−0.00003
p-value0.96
verdictPASS

2 · Random Common Cause

Add a randomly generated extra confounder. A robust estimate should not move.

original α̂1.0051
new effect1.0051
p-value0.98
verdictPASS

3 · Data Subset (80%)

Re-estimate on a random 80% subsample, 100 times. A stable estimate should not drift.

original α̂1.0051
new effect0.9988
p-value0.64
verdictPASS

How to read these tests

  • Placebo collapses to zero (−0.00003). If a method "found" an effect on a randomly permuted treatment, you would not trust it on the real one. Ours did not.
  • Random common cause leaves the estimate unchanged. A fragile estimate would bounce when extra variables enter the regression; ours is anchored.
  • Data subset is essentially the same number. A sample-driven artefact would shift under resampling; ours does not.
  • Refutation is necessary but not sufficient. Tests can fail to detect violation of the exclusion restriction or omitted-confounder bias. Pair them with domain judgment.

Connecting back to the post

The four DoWhy steps map to the four tabs: Model (Tab 1's DAG) → Identify (Tab 2 — backdoor for the first three methods, IV for the fourth) → Estimate (Tab 2's point estimates; Tab 3's tunable version) → Refute (Tab 4). Each step constrains the next: a wrong DAG breaks identification; a missed confounder breaks the backdoor estimate; a violated exclusion restriction breaks IV; a brittle estimate fails refutation. The transparency is the contribution.