Difference-in-Differences — Interactive Lab

A pedagogical companion to Introduction to Difference-in-Differences in Python ↗ Back to the post

Did the policy actually do something?

Suppose an education ministry rolls out AI tutoring bots in some cities but not others. Did the AI tools actually improve learning, or were those cities already on an upward trajectory? Difference-in-Differences (DiD) answers this by comparing the change in outcomes over time between a treated group and a control group — using the control group as a mirror for what would have happened without treatment.

This app lets you turn the dials yourself. Three takeaways from the post, each foregrounded in its own tab: (1) the classic 2×2 DiD recovers the true ATT of 5.0 with estimate 5.12; (2) naive Two-Way Fixed Effects (TWFE) breaks under staggered adoption — 28.3% of its weight falls on "forbidden comparisons" that drag the estimate down to 2.18 versus the Callaway–Sant'Anna 2.41; and (3) HonestDiD sensitivity analysis shows the conclusion survives parallel-trends violations more than 15× the worst pre-treatment deviation.

The DiD intuition — exactly-zero shrinkage isn't the story here

The animation below shows the schematic logic of any DiD. Two outcomes drift upward over time at the same rate (parallel trends). At the dashed line, a policy kicks in and the orange (treated) curve jumps. The gap between the orange curve and the dashed teal counterfactual is the ATT — what the policy did to the people it reached.

The orange line shows L1 (LASSO-style) shrinkage and the dashed steel line shows L2 (Ridge-style) — they are stand-in shapes for "treatment shifts a level" and "the counterfactual stays smooth". DiD does not use any shrinkage — but the visual logic of "two curves diverge after a cut-point" is the same.
Tab 2

DiD Simulator

Slide the true ATT, parallel-trend slope, and sample size. Watch the 2×2 estimator converge (or not) to the truth across 100 simulations.

Tab 3

Forest Plot

Five estimators across three settings — 2×2 ATT, staggered TWFE vs Callaway–Sant'Anna, and HonestDiD's robust CI at five M values.

Tab 4

HonestDiD

Drag a sensitivity-parameter slider and watch the robust 95% CI widen. Find the breakdown value where the effect first includes zero.

Glossary (open a card if a term is unfamiliar)

ATT (Average Treatment Effect on the Treated)
The expected outcome under treatment minus the expected counterfactual, averaged over treated units. The post's true ATT = 5.0 by construction; the classic 2×2 estimator recovers 5.12.
Parallel trends
The identifying assumption: absent treatment, treated and control groups would have followed the same trajectory. Fundamentally untestable for the post-treatment period — only its pre-treatment counterpart can be checked.
Classic 2×2 design
Two groups (treated, control) × two periods (pre, post). The DiD estimator is the change in treated minus the change in control. The transparent, easiest case to understand.
Event study
Separate ATT for each period relative to treatment. Reveals dynamics: in the post, effects grow from 1.97 immediately after treatment to 3.27 six periods later.
TWFE (Two-Way Fixed Effects)
A single regression with unit + time fixed effects and a treatment dummy. Works with one treatment time; breaks under staggered adoption when effects vary across cohorts.
Forbidden comparisons
TWFE silently uses already-treated units as controls for later-treated units. With time-varying effects this pulls the estimate downward. 28.3% of TWFE's weight in the staggered example.
Callaway–Sant'Anna (CS)
A staggered-adoption estimator that computes ATT(g, t) for each cohort × period using only valid 2×2 comparisons, then aggregates with explicit weights. Doubly robust — outcome model and propensity score.
Bacon decomposition
Goodman-Bacon's diagnostic that decomposes TWFE into its underlying 2×2 comparisons by type (clean, valid timing, forbidden). Tells you what fraction of TWFE comes from comparisons you should not trust.
HonestDiD breakdown value M̄
The largest violation of parallel trends (in units of the worst pre-treatment deviation) at which the CI still excludes zero. The post's M̄ ≳ 15 — exceptionally robust.
Pre-trend test
Run a regression and test whether pre-treatment leads are jointly zero. The post's test gives slope difference 0.12 with p = 0.29 — fails to reject, so parallel trends look plausible.

DiD Simulator — when does the estimator work?

Imagine the post's classic 2×2 setup: half the units are treated, half are not, with one pre-period and one post-period. The true ATT and the parallel-trends slope are knobs you control. Drag the sliders and watch how the 2×2 estimator recovers (or misses) the truth across 100 simulated draws.

Total observations (half treated, half control). Larger n ⇒ tighter sampling distribution.
The known causal effect we want to recover. Try ATT = 0 to see a null result; try negative for sign-flip.
Violation of parallel trends, measured per period. 0 = perfect parallel trends. Larger |value| = larger bias.
Standard deviation of the idiosyncratic shock.

2×2 DiD estimate

ATT̂
SE(ATT̂)
t-stat
95% CI lower
95% CI upper

Ground truth

True ATT
Bias (ATT̂ − ATT)
Bias sourceparallel trends
CI covers truth

What to look for

  • Set pre-trend gap to 0. The 2×2 estimate hovers around the true ATT. Some draws are noisy, but the bias is zero on average — this is what "DiD works" looks like.
  • Crank up the pre-trend gap. The estimate now systematically misses the truth. The bias scales linearly with the gap — this is exactly the parallel-trends bias from §5 of the post.
  • Try ATT = 0 with a non-zero pre-trend. The estimator now finds a "treatment effect" where none exists. This is why pre-trend tests matter — and why HonestDiD (Tab 4) bounds rather than tests.

Bias vs. variance over many simulations

Single runs are noisy. Run the 2×2 design 100 times with fresh draws (same parameters, different ε) to see how the estimate spreads around the true ATT.

DiD estimators side by side — the post's headline numbers

These numbers come straight from the post's three case studies: the classic 2×2 with true ATT = 5; the staggered-adoption comparison of TWFE versus Callaway–Sant'Anna; and HonestDiD's robust CI at five M values. Toggle outcomes and methods to compare. Hover any point to see the SE, CI, and diagnostic context.

What to look for

  • Toggle the "Staggered" row. Naive TWFE returns 2.18; Callaway–Sant'Anna returns 2.41. The 10% upward correction is the 28.3% weight on forbidden comparisons being removed.
  • Look at the HonestDiD row. As M climbs from 0 to 15, the CI widens dramatically (from [2.53, 2.66] to [0.38, 4.81]) but the lower bound stays above zero. That is what "exceptionally robust" looks like.
  • Compare the classic 2×2 to its event-study average. Both recover the true ATT of 5 within sampling error (5.12 and 4.80), confirming that the basic estimator works when treatment timing is shared.

Outcomes (settings)

Methods

Why does TWFE underestimate in the staggered setting?

With three cohorts (treated at periods 3, 5, 7) and a never-treated group, TWFE pools all of them into a single regression. Because treatment effects grow over time, an already-treated cohort 3 unit at period 5 has an elevated outcome. When TWFE uses that unit as a "control" for cohort 5, the comparison underestimates cohort 5's true effect. Bacon decomposition reveals 28.3% of TWFE's weight comes from these forbidden comparisons, pulling the average down to 2.18 versus the cleaner Callaway–Sant'Anna estimate of 2.41.

Connecting back to Tab 2

The 2×2 DiD simulator on the previous tab is exactly the building block of every method shown here — Callaway–Sant'Anna in particular is just a careful weighted average of valid 2×2 comparisons. The forbidden-comparison problem only appears when the 2×2 building block uses an already-treated unit as a "control".

HonestDiD — how robust is the conclusion?

Parallel trends is untestable for the post-treatment period — we never observe the counterfactual. Instead of assuming the violation is exactly zero, HonestDiD bounds it. The parameter M says: "the worst post-treatment violation could be up to M times the worst pre-treatment deviation". Drag the M slider and watch the robust CI widen — the point at which it first includes zero is the breakdown value.

M = 0 assumes parallel trends hold exactly. M = 15 allows 15× worse post-treatment violations than anything observed pre-treatment.

Robust CI at this M

Overall ATT2.5958
Lower bound
Upper bound
CI width
Excludes zero?

Reading the result

At M = 0 the CI is the standard Callaway–Sant'Anna interval [2.53, 2.66] — narrow and decisive. As M grows the CI widens symmetrically. The breakdown value is where the CI first includes zero. In this post, the breakdown exceeds M = 15 — meaning the treatment conclusion holds even if post-treatment violations are more than 15 times worse than anything observed pre-treatment. By convention, a breakdown above M = 3 is considered strong evidence.

What to look for

  • Slide M to 0. The robust CI collapses to the standard CS CI [2.53, 2.66] — no allowance for parallel-trends violations.
  • Slide M to 3. The CI is [2.10, 3.09] — still comfortably above zero. The conventional "robust" threshold is passed easily.
  • Slide M to 15. The CI is [0.38, 4.81] — the lower bound barely stays positive. The breakdown lies just beyond this slider's range; the post estimates M̄ ≳ 15.
  • Compare CI widths. Width at M = 0 is 0.13; width at M = 15 is 4.43 — 34× wider. This is the price you pay for honesty about untestable assumptions.

Why bound rather than test?

A pre-trend test (the post's p = 0.29) only fails to reject parallel trends — it cannot confirm them. Roth (2022) showed that conditioning on passing a pre-test can distort subsequent inference. HonestDiD takes a different stance: it accepts that parallel trends might fail, asks how badly, and reports the answer that survives every "how badly" up to your tolerance.