Difference-in-Differences for Policy Evaluation

Does the minimum wage cut teen jobs? A modern DiD walkthrough in R

−0.057true overall ATT · TWFE missed by ⅓

−0.065doubly robust · stable everywhere

0.67HonestDiD breakdown · robust

Carlos Mendez

Nagoya University (GSID)

July 8, 2026

The Tension

Act I

Same panel, same question — but the answer depends on the estimator you trust

Did raising the minimum wage cost teens their jobs? Run the textbook two-way fixed-effects (TWFE) regression and you get $-0.038$.

Run the modern estimator on the same counties and you get $-0.057$ — half again as large. Which one is right?

A frozen federal floor of $5.15 turned the states into a natural experiment

From 2001–2007 the federal minimum wage sat frozen at $5.15/hour, while states raised their own floors at different times.

Counties fall into cohorts indexed by their first-increase year $G \in \{0, 2004, 2006\}$: 102 treated in 2004, 226 in 2006, and 1,417 never-treated.

Staggered adoption: there is no single “before” and “after” for the country — each state has its own clock.

TWFE understates the true effect by a third — that gap is the whole talk

TWFE event study (Sun-Abraham interaction-weighted estimator). A pre-trend wobble at event time $-3$; effects deepen after treatment.

Where we’re going

The lab: a staggered minimum-wage panel with a never-treated control
Why TWFE breaks — forbidden comparisons and negative weights
Callaway–Sant’Anna group-time $\mathrm{ATT}(g,t)$, then aggregate
Doubly robust covariates, then stress-test parallel trends with HonestDiD

The Investigation

Act II

Identification rests on one assumption: parallel trends

\[\mathrm{ATT} = E[\Delta Y_{t^{\ast}} \mid D=1] - E[\Delta Y_{t^{\ast}} \mid D=0]\]

The treated group’s change in outcomes, minus the comparison group’s change. Valid only if both groups would have moved in parallel absent treatment.

The estimand is the ATT — the effect on the counties that actually raised their wage.

TWFE silently uses already-treated counties as controls — a forbidden comparison

With staggered timing, TWFE mixes two kinds of comparison:

Good: treated vs not-yet-treated or never-treated
Bad: later-treated vs already-treated — the control is itself under treatment

Like grading a student’s improvement against classmates who already took the test — the “control” already saw the answer key.

The fix: estimate one clean effect per cohort × period, then aggregate

\[\mathrm{ATT}(g,t) = E[\,Y_t(g) - Y_t(0) \mid G = g\,]\]

The effect for cohort $g$, measured in calendar period $t$ — each cell compared only to clean (never- or not-yet-treated) controls.

Callaway–Sant’Anna (2021): separate identification from estimation. Build the grid of $\mathrm{ATT}(g,t)$ first, average it later.

Six lines of `did` replace the broken regression

attgt <- did::att_gt(yname = "lemp",            # log teen employment
                     idname = "id", gname = "G", # unit id, first-treatment cohort
                     tname = "year",
                     control_group = "nevertreated",
                     base_period = "universal",
                     data = data2)
attO  <- did::aggte(attgt, type = "group")       # overall ATT
attes <- did::aggte(attgt, type = "dynamic")     # event study

Each cohort’s effect deepens with exposure — dynamics TWFE flattens away

Group-time ATTs by cohort (Callaway–Sant’Anna). The G=2004 line falls further every year of exposure.

Aggregated cleanly, the true overall ATT is −0.057 — not TWFE’s −0.038

Estimator	Overall ATT	SE	Sig. 5%?
TWFE	−0.038	0.008	yes
Callaway–Sant’Anna	−0.057	0.008	yes

TWFE understated the negative effect by about a third — attenuated toward zero.

The trajectory is clear: small on impact, large after three years

Event-study aggregation. Pre-trends hover near zero (with a wobble at $e=-3$); post-treatment effects grow monotonically.

Where does TWFE’s bias come from? 64% pre-trend contamination, 36% bad weights

TWFE weight scatter. Orange pre-treatment cells get nonzero TWFE weight (they should get zero); blue post-treatment weights differ from the proper teal $\mathrm{ATT}^O$ weights.

Conditioning on covariates barely moves the answer: −0.065, three ways

Method	Overall ATT	SE
Unconditional	−0.057	0.008
Regression adj.	−0.064	0.008
IPW	−0.065	0.008
Doubly robust	−0.065	0.008

Three independent modeling routes agree — no model-dependence artifact.

Covariates also clean up the pre-trend — the early wobble shrinks and loses significance

Doubly robust event study (controlling for log population and log average pay). The $e=-3$ pre-trend shrinks from $-0.034$ to $-0.022$ and is no longer significant.

−0.065 survives every researcher choice we throw at it

−0.065

doubly robust ATT · identical under varying base period (−0.065), not-yet-treated controls (−0.065); anticipation pulls it to −0.040

The Resolution

Act III

How fragile is this? The on-impact effect breaks only at $\bar M \approx 0.67$

HonestDiD sensitivity: the on-impact CI widens as allowed parallel-trends violations grow. It first touches zero near $\bar M = 0.67$.

A dose-response emerges: each extra $1 cuts teen jobs ~5% at one year, ~9% at three

ATT per dollar of minimum-wage increase. The per-dollar effect deepens with cumulative exposure.

Does machine-disciplined DiD make this causal? Not by itself

Objection. Fancy estimators and covariate adjustment can’t manufacture identification — maybe minimum-wage states were just on a different employment path.

Response. Correct, and we never claim otherwise. The ATT is identified only under (conditional) parallel trends. HonestDiD makes the dependence on that assumption explicit and quantifies exactly how much violation the conclusion can absorb.

What to take home: clean comparisons change the number and the story

TWFE understated the ATT by ~⅓ (−0.038 vs −0.057); 64% of the bias is pre-trend contamination
The doubly robust ATT of −0.065 is stable across methods, controls, and base periods
Effects accumulate: −0.027 on impact to −0.147 after three years; ~5% per $1 at one year, ~9% at three
Robust to parallel-trends violations up to $\bar M \approx 0.67$ — suggestive, not bulletproof

Let the design, not the regression default, choose your comparisons.