DiD for Regional Data — Interactive Lab

Why does DiD give two different answers?

Did the Affordable Care Act's Medicaid expansion reduce adult mortality? The simplest 2×2 difference-in-differences (DiD) on 2,604 U.S. counties gives two opposite answers: an unweighted ATT of +0.12 deaths per 100,000 (no effect, or a tiny increase) and a population-weighted ATT of −2.56 (a meaningful reduction).

Same data. Same arithmetic. Same identifying assumption. The only difference is how you average across counties of very different size. Weighting silently changes the target parameter — not the precision, the parameter itself. The unweighted answer is the effect on the typical treated county; the weighted answer is the effect on the typical treated adult. They answer different questions.

The parallel-trends assumption, animated

DiD assumes that absent treatment, the treated group would have moved in parallel with the control group. The dashed orange line below shows that counterfactual trend; the solid orange line shows the treated group's actual trajectory after the policy kicks in. The vertical teal bar at the right edge is the ATT — exactly the gap parallel trends creates room for.

Tab 2

Weighting Simulator

Drag the heterogeneity slider and watch the unweighted and population-weighted means diverge — even though every county satisfies parallel trends.

Tab 3

Forest Plot

The post's headline numbers — seven estimators (2×2 means, TWFE, OR, IPW, DRDID, 2×T dynamic, G×T dynamic), unweighted vs population-weighted.

Tab 4

Event Study

The full G×T dynamic ATT(e) trajectory from the post, e = −10 to +5, with shaded 95% confidence bands.

The three key takeaways

The 2×2 sign reversal is real and structural. Unweighted ATT(2014) = +0.12; weighted ATT(2014) = −2.56. The pre-period gap is identical in both regimes (−54.77 vs −53.68) — the reversal is driven entirely by which counties dominate the averages.
Weighting choice dominates methodology. Within either weighting, the four 2×2 estimators (cell means, TWFE, OR, IPW, DRDID) agree to within 1.7 deaths per 100,000. Across weightings, the gap is 2.5–10 deaths per 100,000. The estimator menu matters less than the weighting button.
Power is the binding constraint. None of the six 2×2 covariate-adjusted 95% CIs excludes zero. Only the unweighted G×T event study at e = 5 escapes — at +16.96 deaths per 100,000, in the opposite-of-expected direction.

Glossary (open a card if a term is unfamiliar)

DiD (Difference-in-Differences)

Treated group's change minus control group's change. Two groups, two periods, four means — no regression required.

ATT (Average Treatment effect on the Treated)

The effect of the policy on the units that actually received it. Not the effect on a random unit — the effect on the treated subpopulation.

Parallel-trends assumption

Counterfactually, treated and control groups would have moved together. The only identifying restriction for the 2×2 DiD.

Unweighted vs population-weighted

Equal weights → effect on the typical treated county. Population weights → effect on the typical treated adult. Different target parameters, not better/worse precision.

TWFE (Two-way fixed effects)

A regression with unit and time fixed effects that recovers the same DiD coefficient as cell means on a balanced 2×2 panel.

DRDID (doubly-robust DiD)

Combines an outcome regression (OR) and inverse-propensity weighting (IPW). Consistent if either model is correctly specified.

Callaway-Sant'Anna ATT(g,t)

A group-time estimator for staggered adoption. One ATT per cohort × calendar year cell; aggregate later.

HonestDiD M̄

Rambachan-Roth sensitivity parameter. Allows post-period parallel-trends violations up to M̄ times the worst pre-period violation.

Weighting Simulator — why the sign can flip

Two cohorts: a small group of large urban counties and a large group of small rural counties. Both share parallel trends. The treatment effect is different in each cohort. Drag the effect heterogeneity slider and watch the unweighted mean (blue) and population-weighted mean (orange) diverge — exactly the mechanism that flips the sign on the real Medicaid data.

Large-urban treatment effect −6.0

Effect on big counties (high pop weight). Negative = mortality reduction.

Small-rural treatment effect +2.0

Effect on small counties (low pop weight). Set both equal to suppress the reversal.

Urban share of counties

Fraction of counties that are large-urban. Population shares are computed from population sizes.

Population ratio (urban / rural) 15

How much bigger a typical urban county is than a typical rural one.

Unweighted ATT

—

treated county average

Pop-weighted ATT

—

treated adult average

Weighting gap

—

|unw − wt|

Sign flip?

—

do the two ATTs disagree on sign?

What to look for

Set both effects to the same value. The unweighted and weighted ATTs collapse to the same number. Heterogeneity is required for the reversal.
Increase the population ratio. A bigger urban / rural population gap amplifies the divergence — the population-weighted ATT migrates toward the urban effect, the unweighted ATT toward the rural one.
Reproduce the post's headline. Urban effect = −6, rural = +2, urban share = 0.3, ratio = 15 gives weighted ≈ −3, unweighted ≈ +0.4 — the same sign reversal the manuscript reports on the actual Medicaid data.

Where does the post's asymmetry come from?

The real Medicaid panel has the same structure: the never-expansion cohort is 46.9% of counties but only 38.2% of adults; the 2014-expansion cohort is 37.6% of counties but 49.5% of adults. Switching to population weights shifts 11 percentage points of mass between the two largest cohorts — exactly the mechanism this simulator illustrates.

The post's estimator forest plot

Every number on this plot comes from the post's own results CSVs. Each method-by-weighting cell is one estimate with a 95% confidence interval. The dominant fact is colour: every blue dot (unweighted) sits to the right of its orange counterpart (population-weighted), and that 2.5–10 death gap is wider than the spread within a weighting across estimators.

What to look for

Toggle off the dynamic methods. The 2×2-only view (cell-means, TWFE, OR, IPW, DRDID) shows the within-weighting estimator agreement: at most 1.7 deaths per 100,000 between the orange dots, 1.7 between the blue ones.
Toggle on the dynamic methods. The 2×T row is the post's most dramatic divergence: +9.43 unweighted vs −0.68 weighted. Pooling across the 2014, 2015, 2016, and 2019 cohorts (the G×T row) shrinks the gap to 7.7 but does not close it.
Hover any dot. See the exact estimate, SE, and CI. Notice that no weighted CI excludes zero by a comfortable margin and no unweighted CI excludes zero at all — except the G×T dynamic.

The unweighted-vs-weighted gap, by stage

The post's Section 11 headline table shows how the gap grows when staggered cohort heterogeneity enters the design:

2×2 cell-means: gap = 2.69 (only one cohort)
2×2 DRDID: gap = 2.53 (covariate-adjusted)
2×T dynamic: gap = 10.11 (one cohort, eleven periods — the widest gap)
G×T dynamic: gap = 7.65 (all four cohorts pooled)

The gap is largest when staggered cohort heterogeneity is in play (2×T and G×T) and smallest when the four-cell 2×2 design forces a single ATT(2014). That's the manuscript's core lesson made visible at a glance: methodology and target parameter are orthogonal axes of choice, and the second dominates the first.

G×T event study — ATT(e) across cohorts

The Callaway-Sant'Anna group-time framework produces an ATT for every cohort × calendar-year cell. Aggregated across cohorts at fixed event time e, you get a single ATT(e) per relative year. Below: e = −10 to +5, both weightings, shaded 95% confidence bands. The orange dashed line at e = −0.5 separates leads (placebo test for parallel trends) from lags (causal effect).

Show

Unweighted ATT(e) Population-weighted ATT(e)

What to look for

Pre-period leads at e = −10, −9, −8 are sharply negative under both weightings (around −15 to −26 deaths per 100,000, CIs excluding zero). These are driven entirely by the small 2019 cohort — the only cohort with a pre-history that long.
From e = −7 onward, leads settle near zero. Approximate parallel trends holds across the bulk of the comparison window, even though the assumption is technically violated at long pre-horizons.
Post-treatment divergence. Unweighted ATT(e) climbs from −0.45 at e = 0 to +16.96 at e = 5 (CI [+6.83, +27.09], excludes zero). Weighted ATT(e) oscillates within [−3.74, +4.49] — every weighted CI overlaps zero.

What does this mean for the policy question?

The dynamic-aggregated ATT averaged over e ≥ 0 is +7.92 unweighted versus +0.27 population-weighted. For the typical treated adult (the policy-relevant target parameter), there is no statistically credible mortality effect in either direction. For the typical treated county-as-a-unit, the unweighted G×T design gives a positive sign — opposite of what one might expect for an insurance-expansion policy.

The manuscript flags this case as pedagogical rather than as the best-possible estimate of Medicaid's mortality effect. HonestDiD's sensitivity analysis underscores why: at M̄ = 0 the unweighted bound is entirely positive [+2.01, +14.09], but by M̄ = 0.25 it crosses zero — the slightest parallel-trends violation overturns the sign conclusion. The weighted bound already straddles zero at M̄ = 0.

DiD for Regional Data — Interactive Lab

Why does DiD give two different answers?

The parallel-trends assumption, animated

Weighting Simulator

Forest Plot

Event Study

The three key takeaways

Glossary (open a card if a term is unfamiliar)

Weighting Simulator — why the sign can flip

What to look for

Where does the post's asymmetry come from?

The post's estimator forest plot

What to look for

Methods

Weightings

The unweighted-vs-weighted gap, by stage

G×T event study — ATT(e) across cohorts

Show

What to look for

What does this mean for the policy question?