Difference-in-Differences for Regional Data

Did Medicaid expansion cut mortality? It depends who you weight by

+0.122unweighted 2x2 ATT
−2.563population-weighted · sign flip
2,604counties × 11 years

Carlos Mendez

Nagoya University (GSID)

June 11, 2026

The Tension

Act I

Same data, opposite answers — only the weights changed

Twenty million people gained Medicaid under the ACA’s staggered roll-out. Did fewer of them die?

In the simplest four-cell DiD, unweighted says \(+0.12\) deaths per 100,000 (no help). Population-weighted says \(-2.56\) (lives saved). Which number is the answer?

The 2x2 DiD flips sign when you weight by population

Cell-means: 2014 expanders (orange) vs never-expanders (blue), unweighted (left) and population-weighted (right). Nearly parallel slopes on the left (DiD \(\approx 0\)), visibly divergent on the right (DiD \(\approx -2.6\)).

Where we’re going

  • The data: a 2,604-county, 11-year panel with staggered expansion timing
  • The 2x2 — and why Levels = TWFE = Long-difference are one estimator
  • Covariate adjustment: OR, IPW, doubly robust DRDID
  • Modern staggered estimators: 2xT event study and Callaway–Sant’Anna
  • HonestDiD: how fragile is the answer to pre-trend violations?

The Investigation

Act II

The estimand: an ATT whose weight is the question

\[\text{ATT}_\omega(2014) = \big(\mathbb{E}_\omega[Y_{2014}\mid D{=}1] - \mathbb{E}_\omega[Y_{2013}\mid D{=}1]\big) - \big(\mathbb{E}_\omega[Y_{2014}\mid D{=}0] - \mathbb{E}_\omega[Y_{2013}\mid D{=}0]\big)\]

The treated group’s change minus the control group’s change — both means taken under weighting scheme \(\omega\).

Equal weights \(\Rightarrow\) the typical treated county. Population weights \(\Rightarrow\) the typical treated adult. The subscript on \(\mathbb{E}_\omega\) carries the choice into the parameter itself.

Weighting shifts 11 points of mass between the two biggest cohorts

treat_year counties share counties share adults (2013)
0 (never) 1,222 46.9% 38.2%
2014 978 37.6% 49.5%
2015 / 2016 / 2019 404 15.6% 12.4%

Population weighting rebalances toward large, urban 2014 expanders and small, rural never-expanders — a different comparison, not a tighter one.

On a balanced 2x2, Levels = TWFE = Long-difference — they are one estimator

# (a) levels: DiD is the interaction coefficient
feols(crude_rate_20_64 ~ D * Post, data = short_data, cluster = ~county_code)
# (b) two-way FE: same DiD, main effects absorbed
feols(crude_rate_20_64 ~ D:Post | county_code + year, cluster = ~county_code)
# (c) long difference: collapse to one row per county
feols(diff ~ D, data = short_long_diff, cluster = ~county_code)

All three return \(+0.122\) unweighted and \(-2.563\) weighted — identical to three decimals.

Three TWFE specs collapse onto two points — the form is interchangeable, the weight is not

Six DiD estimates (3 specs × 2 weights) with 95% CIs. The three rows within each weighting are superimposed; the weighting moves the point estimate by 2.7 deaths per 100,000.

Expansion counties were whiter, richer, and worse-overlapping — so adjust

Propensity-score densities by expansion status, unweighted (left) and population-weighted (right). Weighting piles treated mass near \(0.85\) and spreads control mass bimodally — overlap gets worse, not better.

Doubly robust DRDID is consistent if either model is right

\[\widehat{\text{ATT}}_{\text{DR}} = \frac{1}{n}\sum_{i=1}^{n}\big(\hat{w}_{D=1}(D_i) - \hat{w}_{D=0}(D_i, X_i)\big)\big(\Delta Y_i - \hat{\mu}_{\Delta, D=0}(X_i)\big)\]

Each county contributes a propensity-weighted residual: how far its 2013-to-2014 change strayed from the outcome model’s prediction for an untreated unit with the same covariates.

Belt-and-suspenders: the outcome model or the propensity model can be wrong — not both.

Covariate adjustment nudges the point but the weighting gap survives intact

method unweighted population-weighted
Outcome regression (OR) \(-1.615\) \(-3.459\)
IPW \(-0.859\) \(-3.842\)
Doubly robust (DRDID) \(-1.226\) \(-3.756\)

Within-weighting spread \(\le 0.8\); across-weighting gap \(\approx 2.5\). None of the six 95% CIs excludes zero.

The 2x2 throws away nine years — the event study uses them all

The dynamic design estimates an \(\text{ATT}(e)\) for every year relative to expansion, with \(e = -1\) omitted as the baseline.

Leads (\(e \le -2\))

  • Placebo for parallel trends
  • Should hover near zero

Lags (\(e \ge 0\))

  • Trace the effect over time
  • Where the policy answer lives

After expansion, the unweighted path climbs to +16.96 while the weighted stays flat

Dynamic \(\text{ATT}(e)\) for the 2014 cohort with shaded 95% CIs; dotted line at \(e=-0.5\) separates leads from lags. The two weightings track together pre-2014, then diverge sharply.

Callaway–Sant’Anna uses all the timing — one ATT(g,t) per cohort-year cell

\[\text{ATT}(g,t) = \mathbb{E}_\omega\big[Y_{i,t}(g) - Y_{i,t}(\infty)\mid G_i = g\big]\]

The effect of first expanding in year \(g\), relative to never expanding, evaluated at calendar time \(t\), restricted to units whose actual start year is \(g\).

Never make a mid-treatment unit serve as a “control” for an untreated one — the forbidden comparison that naive TWFE quietly makes.

Four cohorts, four stories — the 2014 cohort flips sign with weighting

By-cohort \(\text{ATT}(g)\) bars with 95% CIs. The 2014 cohort flips (\(+9.43 \to -0.68\)); the 2016 cohort is the only weighted CI excluding zero (negative), on just 93 counties.

Pool the cohorts and the weighted dynamic path still hugs zero

GxT event study pooled across all four cohorts. Early leads (\(e=-10,-9\)) sharply negative (the lone 2019 cohort); from \(e=-7\) on, leads settle near zero; post-treatment the unweighted path reaches \(+16.96\) while the weighted stays within \(\pm 5\).

The Resolution

Act III

The headline: weighting moves the answer by −2.56 deaths per 100,000

−2.56

the 2x2 ATT swings from \(+0.12\) (per county) to \(-2.56\) (per adult) — a sign flip with an identical pre-period gap

Across five stages, the weighting gap dwarfs the method gap

stage unweighted population-weighted gap
2x2 cell-means / TWFE \(+0.122\) \(-2.563\) \(2.69\)
2x2 DRDID \(-1.226\) \(-3.756\) \(2.53\)
2xT dynamic (avg \(e\ge0\)) \(+9.428\) \(-0.684\) \(10.11\)
GxT dynamic (avg \(e\ge0\)) \(+7.917\) \(+0.266\) \(7.65\)

The gap is widest where staggered cohort heterogeneity is in play, narrowest where the 2x2 forces a single comparison.

HonestDiD: every conclusion collapses by \(\bar{M} = 0.25\)

HonestDiD bounds across \(\bar{M}\), faceted by weighting. At \(\bar{M}=0\) the unweighted bound is all-positive \([+2.01, +14.09]\); the weighted straddles zero \([-6.07, +6.07]\). Both cross zero by \(\bar{M}=0.25\) and saturate at the grid limit (\(\pm 66\)) by \(\bar{M}=1\).

Does the staggered machinery make this causal? No — assumptions still carry the weight

Objection. Callaway–Sant’Anna and DRDID are modern, robust estimators — surely they pin down the effect.

Response. They fix the aggregation (no forbidden comparisons) and discipline selection on covariates — but identification still rests on parallel trends, and HonestDiD shows it is fragile here. The ATT is identified only under conditional parallel trends; the estimators cannot manufacture it.

Underpowered data, two estimands — not better and worse answers to one question

Per treated county

  • unweighted GxT \(= +7.92\)
  • 2xT reaches \(+16.96\) by \(e=5\)
  • a heterogeneity object

Per treated adult

  • weighted GxT \(= +0.27\)
  • DRDID \(= -3.76 \pm 3.29\)
  • the policy-relevant target

No CI excludes zero by a comfortable margin. When the policy is denominated in people, weight by people.

When units differ in size, choosing the weight is choosing the causal question.