Introduction to Difference-in-Differences in Python

Did the policy actually do something?

Suppose an education ministry rolls out AI tutoring bots in some cities but not others. Did the AI tools actually improve learning, or were those cities already on an upward trajectory? Difference-in-Differences (DiD) answers this by comparing the change in outcomes over time between a treated group and a control group — using the control group as a mirror for what would have happened without treatment.

This app lets you turn the dials yourself. Three takeaways from the post, each foregrounded in its own tab: (1) the classic 2×2 DiD recovers the true ATT of 5.0 with estimate 5.12; (2) naive Two-Way Fixed Effects (TWFE) breaks under staggered adoption — 28.3% of its weight falls on "forbidden comparisons" that drag the estimate down to 2.18 versus the Callaway–Sant'Anna 2.41; and (3) HonestDiD sensitivity analysis shows the conclusion survives parallel-trends violations more than 15× the worst pre-treatment deviation.

The DiD intuition — exactly-zero shrinkage isn't the story here

The animation below shows the schematic logic of any DiD. Two outcomes drift upward over time at the same rate (parallel trends). At the dashed line, a policy kicks in and the orange (treated) curve jumps. The gap between the orange curve and the dashed teal counterfactual is the ATT — what the policy did to the people it reached.

The orange line shows L1 (LASSO-style) shrinkage and the dashed steel line shows L2 (Ridge-style) — they are stand-in shapes for "treatment shifts a level" and "the counterfactual stays smooth". DiD does not use any shrinkage — but the visual logic of "two curves diverge after a cut-point" is the same.

Tab 2

DiD Simulator

Slide the true ATT, parallel-trend slope, and sample size. Watch the 2×2 estimator converge (or not) to the truth across 100 simulations.

Tab 3

Forest Plot

Five estimators across three settings — 2×2 ATT, staggered TWFE vs Callaway–Sant'Anna, and HonestDiD's robust CI at five M values.

Tab 4

HonestDiD

Drag a sensitivity-parameter slider and watch the robust 95% CI widen. Find the breakdown value where the effect first includes zero.

Glossary (open a card if a term is unfamiliar)

ATT (Average Treatment Effect on the Treated)

The expected outcome under treatment minus the expected counterfactual, averaged over treated units. The post's true ATT = 5.0 by construction; the classic 2×2 estimator recovers 5.12.

Parallel trends

The identifying assumption: absent treatment, treated and control groups would have followed the same trajectory. Fundamentally untestable for the post-treatment period — only its pre-treatment counterpart can be checked.

Classic 2×2 design

Two groups (treated, control) × two periods (pre, post). The DiD estimator is the change in treated minus the change in control. The transparent, easiest case to understand.

Event study

Separate ATT for each period relative to treatment. Reveals dynamics: in the post, effects grow from 1.97 immediately after treatment to 3.27 six periods later.

TWFE (Two-Way Fixed Effects)

A single regression with unit + time fixed effects and a treatment dummy. Works with one treatment time; breaks under staggered adoption when effects vary across cohorts.

Forbidden comparisons

TWFE silently uses already-treated units as controls for later-treated units. With time-varying effects this pulls the estimate downward. 28.3% of TWFE's weight in the staggered example.

Callaway–Sant'Anna (CS)

A staggered-adoption estimator that computes ATT(g, t) for each cohort × period using only valid 2×2 comparisons, then aggregates with explicit weights. Doubly robust — outcome model and propensity score.

Bacon decomposition

Goodman-Bacon's diagnostic that decomposes TWFE into its underlying 2×2 comparisons by type (clean, valid timing, forbidden). Tells you what fraction of TWFE comes from comparisons you should not trust.

HonestDiD breakdown value M̄

The largest violation of parallel trends (in units of the worst pre-treatment deviation) at which the CI still excludes zero. The post's M̄ ≳ 15 — exceptionally robust.

Pre-trend test

Run a regression and test whether pre-treatment leads are jointly zero. The post's test gives slope difference 0.12 with p = 0.29 — fails to reject, so parallel trends look plausible.

DiD Simulator — when does the estimator work?

Imagine the post's classic 2×2 setup: half the units are treated, half are not, with one pre-period and one post-period. The true ATT and the parallel-trends slope are knobs you control. Drag the sliders and watch how the 2×2 estimator recovers (or misses) the truth across 100 simulated draws.

Sample size n 200

Total observations (half treated, half control). Larger n ⇒ tighter sampling distribution.

True ATT 5.00

The known causal effect we want to recover. Try ATT = 0 to see a null result; try negative for sign-flip.

Pre-trend gap 0.00

Violation of parallel trends, measured per period. 0 = perfect parallel trends. Larger |value| = larger bias.

Noise σ 1.00

Standard deviation of the idiosyncratic shock.

2×2 DiD estimate

ATT̂—

SE(ATT̂)—

t-stat—

95% CI lower—

95% CI upper—

Ground truth

True ATT—

Bias (ATT̂ − ATT)—

Bias sourceparallel trends

CI covers truth—

What to look for

Set pre-trend gap to 0. The 2×2 estimate hovers around the true ATT. Some draws are noisy, but the bias is zero on average — this is what "DiD works" looks like.
Crank up the pre-trend gap. The estimate now systematically misses the truth. The bias scales linearly with the gap — this is exactly the parallel-trends bias from §5 of the post.
Try ATT = 0 with a non-zero pre-trend. The estimator now finds a "treatment effect" where none exists. This is why pre-trend tests matter — and why HonestDiD (Tab 4) bounds rather than tests.

Bias vs. variance over many simulations

Single runs are noisy. Run the 2×2 design 100 times with fresh draws (same parameters, different ε) to see how the estimate spreads around the true ATT.

HonestDiD — how robust is the conclusion?

Parallel trends is untestable for the post-treatment period — we never observe the counterfactual. Instead of assuming the violation is exactly zero, HonestDiD bounds it. The parameter M says: "the worst post-treatment violation could be up to M times the worst pre-treatment deviation". Drag the M slider and watch the robust CI widen — the point at which it first includes zero is the breakdown value.

Sensitivity parameter M 1.0

M = 0 assumes parallel trends hold exactly. M = 15 allows 15× worse post-treatment violations than anything observed pre-treatment.

Robust CI at this M

Overall ATT2.5958

Lower bound—

Upper bound—

CI width—

Excludes zero?—

Reading the result

At M = 0 the CI is the standard Callaway–Sant'Anna interval [2.53, 2.66] — narrow and decisive. As M grows the CI widens symmetrically. The breakdown value is where the CI first includes zero. In this post, the breakdown exceeds M = 15 — meaning the treatment conclusion holds even if post-treatment violations are more than 15 times worse than anything observed pre-treatment. By convention, a breakdown above M = 3 is considered strong evidence.

What to look for

Slide M to 0. The robust CI collapses to the standard CS CI [2.53, 2.66] — no allowance for parallel-trends violations.
Slide M to 3. The CI is [2.10, 3.09] — still comfortably above zero. The conventional "robust" threshold is passed easily.
Slide M to 15. The CI is [0.38, 4.81] — the lower bound barely stays positive. The breakdown lies just beyond this slider's range; the post estimates M̄ ≳ 15.
Compare CI widths. Width at M = 0 is 0.13; width at M = 15 is 4.43 — 34× wider. This is the price you pay for honesty about untestable assumptions.

Why bound rather than test?

A pre-trend test (the post's p = 0.29) only fails to reject parallel trends — it cannot confirm them. Roth (2022) showed that conditioning on passing a pre-test can distort subsequent inference. HonestDiD takes a different stance: it accepts that parallel trends might fail, asks how badly, and reports the answer that survives every "how badly" up to your tolerance.

Difference-in-Differences — Interactive Lab

Did the policy actually do something?

The DiD intuition — exactly-zero shrinkage isn't the story here

DiD Simulator

Forest Plot

HonestDiD

Glossary (open a card if a term is unfamiliar)

DiD Simulator — when does the estimator work?

2×2 DiD estimate

Ground truth

What to look for

Bias vs. variance over many simulations

DiD estimators side by side — the post's headline numbers

What to look for

Outcomes (settings)

Methods

Why does TWFE underestimate in the staggered setting?

Connecting back to Tab 2

HonestDiD — how robust is the conclusion?

Robust CI at this M

Reading the result

What to look for

Why bound rather than test?