Why are naive comparisons misleading?
Suppose a company wants to know whether working from home (WFH) improves productivity. They compute the difference in average productivity between WFH and office employees, get +1.39 productivity points, and call it a day. The catch: the true causal effect is only +1.00. The naive comparison is off by 39% because introverts both self-select into WFH and are independently more productive. This app lets you build intuition for the four DoWhy steps — Model, Identify, Estimate, Refute — that recover the true answer.
In four tabs you will: watch confounders flow along backdoor paths in the DAG below, compare five estimators against a known truth (the post's headline figure), tune confounder strength and see the naive estimate drift while backdoor and IV estimates stay anchored, and run the same three refutation tests DoWhy uses to stress-test a causal claim.
The causal graph — where confounding lives
Five nodes, six arrows. Two confounders (introversion, num. children) open backdoor paths from WFH to productivity by pointing at both. The instrument (subway disruption) is special: it touches WFH but not productivity. The single orange arrow is the causal effect we want to estimate. Particles flow along each path so you can see the bias channels in motion.
Methods Showdown
Five estimators on the post's real (simulated) data with the true ATE marked. Toggle methods to see which CIs cover the truth.
Confounding Simulator
Tune confounder strength and sample size live. Watch the naive bar drift while the backdoor and IV bars stay near 1.0.
Refutation Lab
Re-run DoWhy's three robustness tests — placebo, random common cause, data subset — and see why the original estimate survives.
Glossary (open a card if a term is unfamiliar)
Confounder
Causal graph (DAG)
Backdoor criterion
{introversion, num_children}.Propensity score e(x)
IPW
Doubly Robust (AIPW)
Instrumental variable (IV)
Exclusion restriction
Refutation test
ATE
Methods Showdown — five estimators, one truth
These numbers come straight from estimation_results.csv in
the post's folder — the same data behind Figure 3. The vertical blue
line is the true ATE = 1.0. Click checkboxes to toggle
methods on and off. Hover any point for the standard error, CI, bias,
and whether the CI covers the truth.
Toggle estimators
What to look for
- The Naive CI [1.245, 1.526] does not contain the true ATE. It is precisely wrong — a small SE around the wrong target.
- Three backdoor methods sit on top of the truth. Linear regression (1.0051), IPW (1.0275), and Doubly Robust (1.0115) all recover the true effect within 3%, with narrow CIs (width 0.24–0.30).
- IV is centered near truth but far less precise. Its CI is 1.30 wide — more than 5x the backdoor methods. Robustness to unmeasured confounders comes at the cost of variance.
- Hover Doubly Robust: its SE (0.0623) is nearly identical to regression's (0.0614). When both models are correct, DR achieves the semiparametric efficiency bound.
Why does the naive estimate overshoot?
In the WFH group, mean introversion is 5.19 versus 4.55 in the office group, and mean num. children is 1.58 versus 1.33. Both confounders push the WFH group toward higher productivity for reasons that have nothing to do with working from home. The naive comparison attributes the entire 1.39-point gap to WFH; the truth is that 0.39 of it is a confounder fingerprint. Backdoor adjustment removes the fingerprint by conditioning on the confounders directly. IV removes it by relying only on subway-disruption-induced variation in WFH, which is independent of personality and family size by construction.
Confounding Simulator — tune the bias yourself
The post's data has confounder effects βintroversion = 0.8 on productivity and 0.3 on the WFH log-odds. Drag the confounder sliders and watch three estimators in real time: Naive drifts away from the truth as confounding grows; Backdoor (linear regression adjusted) and IV stay anchored. Each estimate ships with a 95% CI so you can also see the bias-variance trade-off.
Naive
Backdoor
IV (2SLS)
What to look for
- Set "Confounder → outcome β" to 0. The Naive bar snaps to the truth — no confounding means no bias.
- Crank "Confounder → outcome β" up to 2. The Naive estimate climbs past 2; Backdoor stays near 1.0 with a tight CI; IV stays near 1.0 with a wide CI.
- Drop "Instrument strength" below 0.3. The first-stage F collapses and the IV CI explodes — a textbook weak-instrument failure mode.
- Sample size only shrinks CIs. No amount of n fixes the naive bias — confounding is identification failure, not noise.
Refutation Lab — does the estimate survive a stress test?
DoWhy's "Step 4 — Refute" runs automated stress tests. Placebo treatment permutes the treatment label and re-estimates (a good estimator should now return zero). Random common cause adds an irrelevant fake confounder (the estimate should not move). Data subset drops 20% of the data at random (the estimate should stay close). All three results below are from the post's actual refutation run on the linear-regression estimate of 1.005.
1 · Placebo Treatment
Permute the treatment column 100 times and re-estimate. A real causal effect should collapse to zero.
2 · Random Common Cause
Add a randomly generated extra confounder. A robust estimate should not move.
3 · Data Subset (80%)
Re-estimate on a random 80% subsample, 100 times. A stable estimate should not drift.
How to read these tests
- Placebo collapses to zero (−0.00003). If a method "found" an effect on a randomly permuted treatment, you would not trust it on the real one. Ours did not.
- Random common cause leaves the estimate unchanged. A fragile estimate would bounce when extra variables enter the regression; ours is anchored.
- Data subset is essentially the same number. A sample-driven artefact would shift under resampling; ours does not.
- Refutation is necessary but not sufficient. Tests can fail to detect violation of the exclusion restriction or omitted-confounder bias. Pair them with domain judgment.
Connecting back to the post
The four DoWhy steps map to the four tabs: Model (Tab 1's DAG) → Identify (Tab 2 — backdoor for the first three methods, IV for the fourth) → Estimate (Tab 2's point estimates; Tab 3's tunable version) → Refute (Tab 4). Each step constrains the next: a wrong DAG breaks identification; a missed confounder breaks the backdoor estimate; a violated exclusion restriction breaks IV; a brittle estimate fails refutation. The transparency is the contribution.