Difference-in-Differences — Interactive Lab

A pedagogical companion to Difference-in-Differences for Policy Evaluation: A Tutorial using R ↗ Back to the post

Why modern Difference-in-Differences?

For decades, the workhorse for evaluating policy interventions has been the two-way fixed-effects (TWFE) regression. With staggered treatment adoption and heterogeneous treatment effects, however, TWFE silently mixes valid and invalid comparisons — already-treated units sneak in as the "control" for later-treated units — and the resulting coefficient can be biased toward zero, or even take the wrong sign.

This app lets you reproduce the post's headline finding interactively. In four tabs you will (1) slide the parallel-trends violation knob and watch the DiD effect appear; (2) compare TWFE to the Callaway-Sant'Anna group-time ATT across thousands of synthetic panels; and (3) explore the actual minimum-wage results: TWFE = −0.038, Doubly robust = −0.065, breakdown ≈ 0.67.

Parallel trends — the assumption that does all the work

DiD is identified by one assumption: in the absence of treatment, the treated group's outcome would have moved in parallel with the control's. The orange line is what we actually observe; the dashed grey line is the unobserved counterfactual. Their gap at t+3 is the DiD effect.

Adjust the divergence slider in Tab 2 to see how the estimated DiD effect moves with the assumed deviation from parallel trends.

Tab 2

Parallel Trends

Slide the violation knob. Toggle pre-trends. See how the dashed counterfactual moves and what the DiD effect estimates as a result.

Tab 3

TWFE vs CS Showdown

Simulate staggered panels with cohort-specific dynamics. Run 100 sims and watch the TWFE distribution drift away from the truth while CS stays centred.

Tab 4

Forest Plot + Event Study

The post's full menu: TWFE, CS overall, doubly robust, IPW, not-yet-treated, anticipation, lagged outcomes. Hover for SEs and CIs.

Glossary (open a card if a term is unfamiliar)

Parallel trends
Absent treatment, treated and control would move in lockstep. The single identifying assumption of DiD.
Staggered adoption
Different units enter treatment at different dates. The post has cohorts G ∈ {0, 2004, 2006}.
TWFE regression
Two-way fixed effects: Y_it = θ_t + η_i + α D_it + v_it. Equivalent to 2×2 DiD when there's only one treatment date — but biased with staggered timing under heterogeneity.
Group-time ATT — ATT(g, t)
The treatment effect for cohort g at calendar time t. The Callaway-Sant'Anna building block; aggregated into overall ATT or event study.
Forbidden comparisons
When TWFE uses already-treated units as the "control" for later-treated units. The source of the negative-weight problem.
Doubly robust DiD
Combines an outcome regression and a propensity-score model. Consistent if either model is correct — belt and suspenders.
Event study ATT(e)
ATT aggregated by event time e = t − g, not calendar time. Lines up each cohort at its own "year zero."
HonestDiD breakdown M̄
The threshold of allowed parallel-trends violation at which the CI first crosses zero. Bigger M̄ = more robust. The post reports M̄ ≈ 0.67.

The three takeaways the app re-runs in code

  1. TWFE understates the effect by ~33%. The minimum-wage TWFE estimate is −0.038 vs. the proper overall ATT of −0.057; 64% of the bias is pre-treatment contamination, 36% is improper post-treatment weighting.
  2. The doubly robust ATT (−0.065) is stable. Across estimation methods (regression, IPW, DR), comparison groups (never-treated, not-yet-treated), and base periods (universal, varying), the estimate moves only by 0.001.
  3. HonestDiD breakdown M̄ ≈ 0.67. The on-impact effect survives parallel-trends violations up to ~67% of the largest observed pre-trend deviation before the CI first crosses zero.

TWFE vs Callaway-Sant'Anna — a head-to-head

With one treatment date, TWFE = 2×2 DiD = CS. With staggered treatment and heterogeneous dynamics, TWFE silently uses already-treated units as controls. The damage shows up as bias toward zero. Crank the dials and confirm.

Strongly time-varying effects worsen TWFE's negative-weight problem.

Callaway-Sant'Anna

α̂ (overall ATT)
true ATT
bias

TWFE

α̂ (single coef)
true ATT
bias

Distribution across 100 simulated panels

Each draw re-randomises noise (same DGP, same seed family). With staggered timing and heterogeneous dynamics, TWFE's histogram drifts to the right of the true value while CS-style remains centred.

The post's headline results, interactively

Every estimator the post computes, plotted on one axis. The post's overall ATT is −0.057 (CS unconditional) or −0.065 (doubly robust with covariates). Toggle methods on/off; hover for SEs and confidence intervals.

Methods to display

Event study — ATT(e) by event time

Three estimators stacked on the same axis. TWFE (Sun-Abraham) and Callaway-Sant'Anna use no covariates; Doubly robust conditions on log population and log average pay.

HonestDiD sensitivity — when does the result break?

The dashed orange line marks the breakdown M̄ ≈ 0.67: post-treatment parallel-trends violations up to ~67% of the largest pre-treatment deviation still leave the on-impact effect statistically below zero. Beyond that, the CI crosses zero.

TWFE weight decomposition (post §5.3)

TWFE estimate−0.0381
Proper overall ATT−0.0571
Total TWFE bias0.0190
… from pre-treatment contamination64.2%
… from post-treatment weighting35.8%