Did the policy actually do something?
Suppose an education ministry rolls out AI tutoring bots in some cities but not others. Did the AI tools actually improve learning, or were those cities already on an upward trajectory? Difference-in-Differences (DiD) answers this by comparing the change in outcomes over time between a treated group and a control group — using the control group as a mirror for what would have happened without treatment.
This app lets you turn the dials yourself. Three takeaways from the post, each foregrounded in its own tab: (1) the classic 2×2 DiD recovers the true ATT of 5.0 with estimate 5.12; (2) naive Two-Way Fixed Effects (TWFE) breaks under staggered adoption — 28.3% of its weight falls on "forbidden comparisons" that drag the estimate down to 2.18 versus the Callaway–Sant'Anna 2.41; and (3) HonestDiD sensitivity analysis shows the conclusion survives parallel-trends violations more than 15× the worst pre-treatment deviation.
The DiD intuition — exactly-zero shrinkage isn't the story here
The animation below shows the schematic logic of any DiD. Two outcomes drift upward over time at the same rate (parallel trends). At the dashed line, a policy kicks in and the orange (treated) curve jumps. The gap between the orange curve and the dashed teal counterfactual is the ATT — what the policy did to the people it reached.
DiD Simulator
Slide the true ATT, parallel-trend slope, and sample size. Watch the 2×2 estimator converge (or not) to the truth across 100 simulations.
Forest Plot
Five estimators across three settings — 2×2 ATT, staggered TWFE vs Callaway–Sant'Anna, and HonestDiD's robust CI at five M values.
HonestDiD
Drag a sensitivity-parameter slider and watch the robust 95% CI widen. Find the breakdown value where the effect first includes zero.
Glossary (open a card if a term is unfamiliar)
ATT (Average Treatment Effect on the Treated)
Parallel trends
Classic 2×2 design
Event study
TWFE (Two-Way Fixed Effects)
Forbidden comparisons
Callaway–Sant'Anna (CS)
Bacon decomposition
HonestDiD breakdown value M̄
Pre-trend test
DiD Simulator — when does the estimator work?
Imagine the post's classic 2×2 setup: half the units are treated, half are not, with one pre-period and one post-period. The true ATT and the parallel-trends slope are knobs you control. Drag the sliders and watch how the 2×2 estimator recovers (or misses) the truth across 100 simulated draws.
2×2 DiD estimate
Ground truth
What to look for
- Set pre-trend gap to 0. The 2×2 estimate hovers around the true ATT. Some draws are noisy, but the bias is zero on average — this is what "DiD works" looks like.
- Crank up the pre-trend gap. The estimate now systematically misses the truth. The bias scales linearly with the gap — this is exactly the parallel-trends bias from §5 of the post.
- Try ATT = 0 with a non-zero pre-trend. The estimator now finds a "treatment effect" where none exists. This is why pre-trend tests matter — and why HonestDiD (Tab 4) bounds rather than tests.
Bias vs. variance over many simulations
Single runs are noisy. Run the 2×2 design 100 times with fresh draws (same parameters, different ε) to see how the estimate spreads around the true ATT.
DiD estimators side by side — the post's headline numbers
These numbers come straight from the post's three case studies: the classic 2×2 with true ATT = 5; the staggered-adoption comparison of TWFE versus Callaway–Sant'Anna; and HonestDiD's robust CI at five M values. Toggle outcomes and methods to compare. Hover any point to see the SE, CI, and diagnostic context.
What to look for
- Toggle the "Staggered" row. Naive TWFE returns 2.18; Callaway–Sant'Anna returns 2.41. The 10% upward correction is the 28.3% weight on forbidden comparisons being removed.
- Look at the HonestDiD row. As M climbs from 0 to 15, the CI widens dramatically (from [2.53, 2.66] to [0.38, 4.81]) but the lower bound stays above zero. That is what "exceptionally robust" looks like.
- Compare the classic 2×2 to its event-study average. Both recover the true ATT of 5 within sampling error (5.12 and 4.80), confirming that the basic estimator works when treatment timing is shared.
Outcomes (settings)
Methods
Why does TWFE underestimate in the staggered setting?
With three cohorts (treated at periods 3, 5, 7) and a never-treated group, TWFE pools all of them into a single regression. Because treatment effects grow over time, an already-treated cohort 3 unit at period 5 has an elevated outcome. When TWFE uses that unit as a "control" for cohort 5, the comparison underestimates cohort 5's true effect. Bacon decomposition reveals 28.3% of TWFE's weight comes from these forbidden comparisons, pulling the average down to 2.18 versus the cleaner Callaway–Sant'Anna estimate of 2.41.
Connecting back to Tab 2
The 2×2 DiD simulator on the previous tab is exactly the building block of every method shown here — Callaway–Sant'Anna in particular is just a careful weighted average of valid 2×2 comparisons. The forbidden-comparison problem only appears when the 2×2 building block uses an already-treated unit as a "control".
HonestDiD — how robust is the conclusion?
Parallel trends is untestable for the post-treatment period — we never observe the counterfactual. Instead of assuming the violation is exactly zero, HonestDiD bounds it. The parameter M says: "the worst post-treatment violation could be up to M times the worst pre-treatment deviation". Drag the M slider and watch the robust CI widen — the point at which it first includes zero is the breakdown value.
Robust CI at this M
Reading the result
At M = 0 the CI is the standard Callaway–Sant'Anna interval [2.53, 2.66] — narrow and decisive. As M grows the CI widens symmetrically. The breakdown value is where the CI first includes zero. In this post, the breakdown exceeds M = 15 — meaning the treatment conclusion holds even if post-treatment violations are more than 15 times worse than anything observed pre-treatment. By convention, a breakdown above M = 3 is considered strong evidence.
What to look for
- Slide M to 0. The robust CI collapses to the standard CS CI [2.53, 2.66] — no allowance for parallel-trends violations.
- Slide M to 3. The CI is [2.10, 3.09] — still comfortably above zero. The conventional "robust" threshold is passed easily.
- Slide M to 15. The CI is [0.38, 4.81] — the lower bound barely stays positive. The breakdown lies just beyond this slider's range; the post estimates M̄ ≳ 15.
- Compare CI widths. Width at M = 0 is 0.13; width at M = 15 is 4.43 — 34× wider. This is the price you pay for honesty about untestable assumptions.
Why bound rather than test?
A pre-trend test (the post's p = 0.29) only fails to reject parallel trends — it cannot confirm them. Roth (2022) showed that conditioning on passing a pre-test can distort subsequent inference. HonestDiD takes a different stance: it accepts that parallel trends might fail, asks how badly, and reports the answer that survives every "how badly" up to your tolerance.