event-study | Carlos Mendez

Do Industrial Parks Work? Evaluating Place-Based Policy in Ethiopia with Difference-in-Differences

Fri, 12 Jun 2026 00:00:00 +0000

Abstract

Governments across the developing world spend billions on industrial parks — fenced zones with serviced land, power and customs to lure factories — yet whether these place-based subsidies actually lift the surrounding economy, and who inside it benefits, remains hotly contested. This tutorial asks whether Ethiopia’s industrial parks raised local economic activity, urbanization, household living standards, and women’s economic agency, and how each effect can be measured credibly when parks are not placed at random. It replicates Huang, Wang & Xu (2026) on synthetic calibrated data combining a satellite district-year panel of 139 woredas observed annually over 2005–2020 (2,224 rows; 17 park-hosting woredas treated on a staggered 2008–2021 rollout against 122 propensity-score-matched never-treated controls) with two Ethiopia DHS repeated cross-sections — 13,200 households and 17,900 individuals across five survey rounds. It estimates a static two-way fixed-effects difference-in-differences and an event study with pyfixest, cross-checks them against the modern Sun-Abraham, Borusyak/Gardner and Callaway-Sant’Anna staggered estimators plus a Goodman-Bacon decomposition with diff-diff, and runs survey-weighted repeated-cross-section DiD with Conley spatial standard errors. A park raises inverse-hyperbolic-sine nighttime light by +0.215 (p < 0.01), the four staggered estimators agree within 0.046 units with 95.4% clean Bacon weight, and households gain durables (+0.229), housing (+0.248) and wealth (+0.383); crucially, average non-agricultural employment is insignificant (+0.091) yet the female effect is large (+0.140, p < 0.01). These findings imply that well-sited parks can reshape a local economy and women’s lives, but only a sex-disaggregated analysis reveals it.

1. Overview

Industrial parks are one of the most popular instruments of modern development policy. The recipe is simple: the government clears a tract of land, installs power, water, roads and a one-stop customs office, and rents serviced plots to manufacturers — usually in textiles, garments or leather. Ethiopia bet heavily on this model, opening more than twenty parks across eighteen districts between 2008 and 2021. The hope was that factories would cluster, create jobs, and pull a largely rural region into a wage economy. But place-based subsidies are controversial precisely because they might do little more than relocate activity that would have happened anyway — or light up a fenced enclave while the surrounding districts see nothing.

So the question this post tackles is genuinely two-sided: do industrial parks raise local economic activity, and — just as important — for whom? A park could boost satellite-measured luminosity yet leave household living standards flat. It could create jobs on average, yet only for men. Measuring this credibly is hard, because the government did not flip a coin to decide where parks go — it chose districts near cities and roads, which were already growing faster. We need a research design that nets out those pre-existing differences, and that handles a staggered rollout where parks opened in different years. That design is difference-in-differences (DiD), and the modern staggered-robust toolkit built around it.

Why not just one DiD regression? Because the workhorse two-way fixed-effects (TWFE) estimator can mislead under staggered timing: it secretly uses already-treated districts as controls for later-treated ones, a “forbidden comparison” that can flip the sign of the estimate when effects grow over time. A central goal of this tutorial is to show that worry being checked rather than ignored — we run four estimators side by side and decompose exactly where the TWFE number comes from. The estimand throughout is the average treatment effect on the treated (ATT) — the effect on the districts (and people) that actually got a park — identified under a parallel-trends assumption, in an explicitly observational setting where the parks were not randomly placed.

A note on the data (please read this). This tutorial replicates Huang, Wang & Xu (2026), but it runs on synthetic data built for teaching. The paper’s real inputs (harmonized nighttime lights, the GISD30 impervious-surface product, confidential Ethiopia DHS micro-data, the official park list) are licensed or restricted. Our dataset is calibrated so that re-running the paper’s analyses reproduces its findings — the signs, the significance stars, and the approximate magnitudes of the key coefficients. Most results track the paper closely; a handful of magnitudes differ, and we tabulate exactly which in Section 13. Use this to learn the methods, not to draw new conclusions about Ethiopia.

1.1 Learning objectives

By the end of this tutorial, you will be able to:

Frame a staggered place-based policy as a quasi-experiment, and explain why a treated-vs-never-treated comparison identifies the ATT under parallel trends.
Estimate a static two-way fixed-effects difference-in-differences and a dynamic event study on satellite outcomes with pyfixest.
Compare TWFE against the modern Sun-Abraham, Borusyak/Gardner and Callaway-Sant’Anna estimators, and diagnose the staggered negative-weights problem with a Goodman-Bacon decomposition using diff-diff.
Apply survey-weighted repeated-cross-section DiD to DHS household welfare and individual employment, and read a heterogeneity split that turns a null average into a sharp finding.
Defend your inference when treatment is spatially clustered, using Conley spatial-HAC standard errors and restricted-control-pool checks.

1.2 Study design

The diagram below maps the whole tutorial. Three data streams flow into one DiD design, that design is estimated by an escalating ladder of estimators, and the estimates answer three outcome families. Read it left to right: the staggered park rollout splits woredas (Ethiopia’s local districts) into treated and never-treated; we observe satellite, household, and individual outcomes; we climb from a naive 2×2 to the modern staggered-robust estimators; and we report effects on activity, welfare, and women’s empowerment.

graph LR
subgraph DATA["Three data streams"]
A["<b>Satellite</b><br/>district x year<br/>panel"]
B["<b>DHS household</b><br/>repeated<br/>cross-section"]
C["<b>DHS individual</b><br/>repeated<br/>cross-section"]
end
subgraph DESIGN["DiD design"]
D["<b>Staggered rollout</b><br/>17 treated woredas<br/>vs 122 controls"]
end
subgraph LADDER["Estimator ladder"]
E["Naive 2x2"]
F["Static TWFE<br/>+ event study"]
G["Sun-Abraham /<br/>Borusyak /<br/>Callaway-Sant'Anna"]
end
subgraph OUT["Outcome families"]
H["<b>Activity</b><br/>lights, impervious"]
I["<b>Welfare</b><br/>durables, wealth"]
J["<b>Empowerment</b><br/>female jobs, agency"]
end
A --> D
B --> D
C --> D
D --> E --> F --> G
F --> H
B --> I
C --> J
style A fill:#6a9bcc,stroke:#141413,color:#fff
style B fill:#6a9bcc,stroke:#141413,color:#fff
style C fill:#6a9bcc,stroke:#141413,color:#fff
style D fill:#d97757,stroke:#141413,color:#fff
style G fill:#00d4c8,stroke:#141413,color:#fff
style J fill:#00d4c8,stroke:#141413,color:#fff

The key idea the diagram encodes is that one estimand — the ATT — threads through everything. The naive 2×2 is the cartoon version; TWFE and its event-study view are the workhorse; and the three modern estimators are the robustness insurance that the workhorse has not been led astray by staggered timing. Each box maps onto a section below, and the gender finding (the teal “Empowerment” box) is where the analysis lands.

1.3 Where are the industrial parks located?

Ethiopia placed its parks deliberately — clustered around the capital, Addis Ababa, and the main transport corridors, yet reaching into peripheral regions of the country. Before we build any statistical machinery, it helps to see the real geography we are modeling.

Source: Appendix Figure A2 in Huang, Wang & Xu (2026), “The socioeconomic impacts of industrial parks in Ethiopia.” The map shows the paper’s real park locations for geographic context; this tutorial’s analysis runs on synthetic data calibrated to reproduce the paper’s results.

That deliberate clustering near cities and roads is exactly the kind of non-random placement our design has to handle — so before estimating anything, the next section pins down the vocabulary that makes the treated-versus-control comparison credible.

2. Key concepts

The post leans on a small vocabulary repeatedly, and the later sections assume you can move between these terms quickly. Each concept below has three parts. The definition is always visible; the example and analogy sit behind clickable cards — open them when you need them, leave them closed for a quick scan. If a later section mentions “forbidden comparisons” or “repeated cross-section” and the term feels slippery, this is the section to re-read.

1. Staggered difference-in-differences. Units adopt treatment at different times, not all at once. We compare the change in outcomes for a treated group to the change for a not-yet-treated or never-treated group. With many adoption dates, the design is a stack of overlapping 2×2 comparisons.

Example

Ethiopia’s parks open across eight cohorts: 1 woreda in 2008, then 2 in 2014, 2 in 2015, 3 in 2016, 3 in 2017, 2 in 2018, 2 in 2019, and 2 in 2020 — 17 treated woredas in total, each turning on in its own year.

Analogy

A city installs streetlights block by block over a decade. To judge their effect you cannot just compare “before any lights” to “after all lights” — you must line up each block against its own opening date and a block that never got lit.

2. Parallel trends. The identifying assumption of DiD: absent the park, treated and control woredas would have followed the same path on average. Their levels can differ; their trends must match. We cannot prove it, but a flat pre-treatment event study makes it credible.

Example

The four pre-opening event-study leads run from −0.0275 to −0.0013 and the largest absolute t among them is just 2.17 — close enough to flat to read as parallel trends holding before the parks open.

Analogy

Two boats on the same current sit at different points but drift in step. Only an engine — the treatment — should make one pull ahead. If they were already diverging before the engine fired, the comparison is broken.

3. ATT $E[Y_i(1) - Y_i(0) \mid D_i = 1]$. The Average effect of the Treatment on the Treated — the effect on the districts that got a park, not on a random district. DiD, TWFE, and all three modern estimators here target the ATT, not the population-wide ATE.

Example

The +0.215 light effect is the ATT for the 17 park woredas. It does not promise that placing a park in any random district would raise its lights that much — only that these districts, given these parks, ended up that much brighter.

Analogy

The bonus speed measured on the car that actually got the new engine — not a promise about any car you might pick off the street.

4. TWFE bias, negative weights, and forbidden comparisons. Under staggered timing, the two-way fixed-effects regression quietly uses already-treated units as controls for later-treated ones. Those “forbidden” comparisons can get negative weights and bias — even flip the sign of — the estimate when effects grow over time.

Example

Here the danger is tiny: the Goodman-Bacon decomposition shows the forbidden later-vs-earlier comparisons carry only 1.21% of the total weight (and average +0.0135), while clean treated-vs-never comparisons carry 95.42%.

Analogy

Grading a class on a curve where some students were secretly given the exam early and then used as the “average” everyone else is scored against. If only a couple of students got the early peek, the curve is barely distorted — which is the situation here.

5. Event study. Instead of one ATT, estimate one coefficient per year-relative-to-opening (event time $k$). Plotting them shows the dynamic path: flat leads before opening (no anticipation) and rising lags after (the effect building up).

Example

The light effect is +0.115 the year a park opens ($k = 0$), climbs to +0.193 at $k = +1$ and +0.219 at $k = +2$, and plateaus at +0.484 by $k = +4$ — a slow build, not an instant jump.

Analogy

A medical chart that plots a patient’s temperature day by day around the start of a drug, rather than reporting a single before/after average. The shape of the curve tells you when and how the drug works.

6. Repeated-cross-section DiD. When each survey round interviews different households (no panel key), you cannot use household fixed effects. The effect is identified off district × round group means: compare treated vs control districts before vs after their park opens, absorbing district and region×round fixed effects.

Example

The DHS data are five rounds (2000, 2005, 2011, 2016, 2019) of fresh respondents. So the household regression uses | district_id + region_id^survey_round — district and region-by-round fixed effects — with no household effect, and only coarse event phases $\{-3, …, +1\}$.

Analogy

Polling a city’s mood with a fresh sample of pedestrians each year. You cannot track any one person over time, but you can still compare how neighborhoods shifted relative to each other.

7. Survey weights and clustered/Conley standard errors. The DHS is a complex sample, so regressions are weighted by the sampling weight. Standard errors are clustered on district (allowing a district’s errors to correlate over time) and, for the satellite panel, hardened with Conley spatial-HAC errors that also allow nearby districts to correlate.

Example

For the light ATT the cluster SE (0.0792) and the Conley-HAC SE (0.0799) are nearly identical and 2.43× the naive HC0 SE (0.0329) — yet the +0.215 estimate stays significant at t = 2.69.

Analogy

Counting a milling crowd. If everyone keeps shuffling between seats, you have far fewer truly independent heads than the rows suggest — honest standard errors admit that.

8. SUTVA and spillovers. The stable-unit-treatment-value assumption says one unit’s treatment does not affect another’s outcome. If a park lifts its neighbours’ lights, the never-treated controls are contaminated and the ATT is biased. A nearby test checks for exactly this leakage.

Example

The nearby coefficient (control districts within 10 km of a park) is +0.0648 and insignificant (t = 1.06), while the host effect stays +0.2712 — no measurable spillover, so SUTVA is plausible here.

Analogy

Testing whether a new factory’s smoke drifts onto the neighbouring farm. If the farm’s crops are unchanged, you can fairly use it as a clean comparison for the factory’s own land.

3. Setup and the two star libraries

Two specialist packages do the heavy lifting, and each gets a one-line introduction the first time it appears:

pyfixest runs fixed-effects regressions with a fast, Stata-flavored formula syntax: everything left of the | is estimated, everything right of it is absorbed as fixed effects. It also ships an event_study helper with the modern saturated (Sun-Abraham) and did2s (Borusyak/Gardner) estimators built in.
diff-diff is a teaching-oriented package for difference-in-differences. We use its DifferenceInDifferences, CallawaySantAnna, and BaconDecomposition classes — the last two are exactly the staggered-robust tools this post needs.

# In Colab, install the two estimation libraries first:
# !pip install pyfixest==0.50.1 diff-diff==3.5.2
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pyfixest as pf
import diff_diff as dd
np.random.seed(42) # reproducibility
# Site dark-theme palette for figures
STEEL_BLUE, WARM_ORANGE, TEAL = "#6a9bcc", "#d97757", "#00d4c8"
DARK_NAVY, GRID_LINE, LIGHT_TEXT = "#0f1729", "#1f2b5e", "#c8d0e0"

The satellite specifications need two small design helpers. The first builds the staggered first_treat column the modern estimators require: treated woredas get their park’s opening year, and never-treated controls get 0 — not NaN, because a missing value would silently drop the 122 controls that every staggered estimator needs as its clean comparison group. The second builds the “with-trends” interactions that absorb the faster pre-existing urban trend of treated woredas (more on why in Section 6).

def add_first_treat(d):
"""Treated woredas get their open_year; never-treated controls get 0."""
out = d.copy()
out["first_treat"] = out["open_year"].fillna(0).astype(int)
return out
def add_trend_terms(d):
"""Centre time at 2012 and interact it with 2007 baseline characteristics,
so each woreda can follow its own linear trend (the paper's even columns)."""
out = d.copy()
out["t"] = out["year"] - 2012
for c in ["urbanization_rate_2007", "employment_rate_2007",
"log_pop_density_2007", "share_christian_2007", "share_amharic_2007"]:
out[f"t_{c}"] = out["t"] * out[c]
return out
TREND_TERMS = ["t_urbanization_rate_2007", "t_employment_rate_2007",
"t_log_pop_density_2007", "t_share_christian_2007",
"t_share_amharic_2007"]

With the tooling in place, the next step is to load the three data layers and understand why they are structured so differently.

4. The three datasets

Evaluating a place-based policy forces a measurement choice to the surface. National statistics would barely flinch at a few new factories, so we need sub-national data — and at three different grains. We load all three straight from the post’s data folder on GitHub, so the code runs unchanged in Colab.

BASE = ("https://raw.githubusercontent.com/cmg777/starter-academic-v501/"
"master/content/post/python_did_industrial_park/data/")
district = pd.read_csv(BASE + "industrial_park_district_panel.csv")
household = pd.read_csv(BASE + "industrial_park_household_rcs.csv")
individual = pd.read_csv(BASE + "industrial_park_individual_rcs.csv")
print("district panel :", district.shape)
print("household RCS :", household.shape)
print("individual RCS :", individual.shape)
print("treated woredas:", district.loc[district.treated == 1, "district_id"].nunique())
print("control woredas:", district.loc[district.treated == 0, "district_id"].nunique())

district panel : (2224, 34)
household RCS : (13200, 13)
individual RCS : (17900, 22)
treated woredas: 17
control woredas: 122

The three layers have fundamentally different structures, and that distinction drives every downstream choice. The district layer is a balanced panel — 139 woredas × 16 years (2005–2020) = 2,224 rows — so it supports a genuine panel event study with annual event time. The household and individual layers are repeated cross-sections: five DHS rounds of different respondents (13,200 households and 17,900 individuals), with no within-respondent panel key, so they admit only coarse event phases and survey-weighted regressions, never unit fixed effects. The treatment split is small on the treated side — 17 park woredas against 122 matched controls — which is exactly why several effects below are borderline and why honest standard errors matter.

4.1 The staggered rollout

The single feature that makes this a staggered design is that parks opened in different years. Tabulating the treated woredas by opening year shows the cohort structure that every modern estimator below keys on.

cohorts = (district[district.treated == 1]
.drop_duplicates("district_id")
.groupby("open_year").size())
print(cohorts.rename("n_treated_woredas").to_string())
print("total treated:", int(cohorts.sum()))

open_year
2008 1
2014 2
2015 2
2016 3
2017 3
2018 2
2019 2
2020 2
total treated: 17

The rollout is genuinely staggered: a single anchor woreda opens in 2008 (the Eastern Industrial Park), then the main build-out runs 2014–2020 with two to three woredas per year. This spread is what makes a naive before/after impossible — there is no single “before” — and what makes the staggered-robust estimators in Section 6 necessary rather than decorative. It also guarantees that every event time has at least three treated woredas behind it, so the dynamic path is estimated off real data at each lag.

4.2 The outcomes, and a transparent word on the data

The satellite layer carries two outcomes: ihs_light, the inverse hyperbolic sine of nighttime luminosity (a log-like transform that handles zeros), and impervious_ratio, the share of a woreda’s land that is built-up surface, observed only every five years. The household layer carries durable goods per capita, a housing-quality indicator, and the standardized wealth index. The individual layer carries non-agricultural employment plus, for women, decision-making power, savings-account ownership, and acceptance of domestic violence.

for col, layer, df in [("ihs_light", "district", district),
("durable_goods_pc", "household", household),
("nonag_employment", "individual", individual)]:
s = df[col]
print(f"{col:18s} ({layer:10s}) N={s.notna().sum():6d} "
f"mean={s.mean():.3f} sd={s.std():.3f}")

ihs_light (district ) N= 2224 mean=0.352 sd=0.715
durable_goods_pc (household ) N= 12207 mean=0.308 sd=0.487
nonag_employment (individual ) N= 17219 mean=0.343 sd=0.475

These means anchor every magnitude that follows. Durable goods average 0.308 items per capita, so the +0.229 ATT we find later is a ~74% lift off that base; non-agricultural employment averages 0.343, so a +0.140 effect for women is a large move. Before modeling, though, one caveat must be stated plainly: the data are synthetic. The data-generating process was tuned so that re-running the paper’s regressions recovers its coefficients (within about 0.02 on the headline cells), with the same signs and stars; spatial and serial shocks were injected so the standard errors behave realistically without moving the point estimates. We hold ourselves to that in Section 13. With the measurement settled, let us look at the data before regressing it.

5. Exploratory analysis: the case for parallel trends

Good causal work looks at the data before it models it. The first and most important view plots treated and control group-mean light over time — the picture difference-in-differences was invented for. One subtlety drives how we draw it: because of the synthetic bright-base device (treated park-cities are modelled as intrinsically much brighter than rural controls, a level difference the district fixed effect absorbs), plotting raw light levels would put the two groups miles apart and hide the trends. So we plot light indexed to each group’s own pre-2008 mean — baseline-normalized — which makes the “matched-then-diverge” picture read correctly.

# baseline-normalize each group's mean light to its pre-2008 average
g = (district.assign(grp=np.where(district.treated == 1, "Treated", "Control"))
.groupby(["grp", "year"])["ihs_light"].mean().reset_index())
base = g[g.year < 2008].groupby("grp")["ihs_light"].mean()
g["normed"] = g.apply(lambda r: r.ihs_light - base[r.grp], axis=1)
# (full dark-theme styling is in script.py)

Indexed to each group’s pre-2008 mean, the treated and control series sit on top of each other through the pre-rollout era — in 2008 the treated group is at −0.0018 and the control group at −0.0030, essentially identical, the visual signature of parallel trends holding before treatment turns on. From the 2014 build-out onward the treated series climbs steadily (+0.083 in 2014 → +0.186 in 2016 → +0.244 in 2017 → +0.237 in 2020) while the controls hover around zero with no trend. The eye already sees a matched pair of groups that diverge only after the parks open; the rest of the post is about measuring that divergence and trusting the measurement.

The staggered structure is easier to see one cohort at a time. The “staircase” figure traces each opening-year cohort’s mean light against the flat never-treated baseline.

Each cohort turns up at its own opening year — the 2016 cohort lifts off in 2016, the 2018 cohort in 2018 — while the never-treated line stays flat and even drifts down slightly, sharpening the contrast. This is the staggered design made visual: there is no single treatment date, so any honest estimator must align each cohort to its own clock. The next view confirms a second design fact — that treatment is not scattered randomly across the map.

Plotting the 17 treated woredas (orange) and 122 controls (blue) by longitude and latitude shows the treated units are spatially clustered, not randomly sprinkled — parks went to a handful of regions near cities and roads. Clustered treatment means a regional shock could hit several treated woredas at once, so their errors are unlikely to be independent. That is precisely the problem Conley spatial standard errors fix in Section 11. Finally, a distributional view shows the bright-base device head-on.

The boxplots split IHS light by group and pre/post period. Treated woredas sit far above controls in level — the synthetic bright base — and shift up further after opening, while controls barely move. The large level gap looks alarming but is harmless: the district fixed effect absorbs any time-invariant brightness, leaving the DiD coefficient untouched. With the intuition built, we can put the first number on the table.

6. From a naive 2×2 to the static TWFE ATT

6.1 The naive 2×2 (and why it understates the effect)

The simplest possible estimate collapses the whole staggered design at the median opening year (2017), forms four treated/control × pre/post cell means, and takes the difference of differences. diff-diff’s DifferenceInDifferences class returns it with a standard error.

d = district.copy()
d["post"] = (d.year >= 2017).astype(int) # collapse at the median opening year
cells = d.groupby(["treated", "post"])["ihs_light"].mean().unstack("post")
print(cells.round(4))
res = dd.DifferenceInDifferences(cluster="district_id").fit(
d, outcome="ihs_light", treatment="treated", time="post")
print(f"\nDiD ATT = {res.att:+.4f} (SE {res.se:.4f}, p = {res.p_value:.4f})")

post 0 1
treated
0 0.0990 0.0909
1 2.1308 2.3237
DiD ATT = +0.2011 (SE 0.0885, p = 0.0232)

Treated light rises +0.1929 post-opening while controls fall −0.0082, so the difference-in-differences is +0.2011 (SE 0.0885, p = 0.0232) — significant at 5%, with the by-hand and diff-diff estimates agreeing to four decimals. But this blended 2×2 understates the dynamic effect: the park’s impact ramps up over roughly five years (the event study below reaches +0.48), so averaging the small early post-years with the large late ones pulls the mean toward 0.20. It also leans on the Goodman-Bacon “forbidden comparisons” we worry about under staggering. The fix is to let the effect vary over time and to absorb confounders with fixed effects.

6.2 The static TWFE difference-in-differences

The workhorse specification adds two-way fixed effects. For woreda $d$ in year $t$:

$$Y_{dt} = \beta \, D_{dt} + \alpha_d + \gamma_{r(d),t} + \varepsilon_{dt}$$

In words, this says that the outcome $Y_{dt}$ (here ihs_light) equals a park effect $\beta$ times the treatment indicator $D_{dt}$ (the treatment column, which is 1 once a woreda’s park is open), plus a woreda fixed effect $\alpha_d$ that absorbs anything permanent about a district (including its bright base), plus a region-by-year fixed effect $\gamma_{r(d),t}$ that absorbs shocks common to a whole region in a given year, plus noise $\varepsilon_{dt}$. The coefficient $\beta$ is the ATT — the average park effect on the treated woredas. The “with-trends” specification adds the t_* interactions to let each woreda follow its own linear trend. In pyfixest, the part after the | lists the fixed effects to absorb:

dt = add_trend_terms(district)
out_rows = []
for ycol, label in [("ihs_light", "IHS night-light"),
("light_intensity", "Raw night-light"),
("impervious_ratio", "Impervious ratio")]:
m0 = pf.feols(f"{ycol} ~ treatment | district_id + region^year",
data=dt, vcov={"CRV1": "district_id"})
m1 = pf.feols(f"{ycol} ~ treatment + " + " + ".join(TREND_TERMS) +
" | district_id + region^year",
data=dt, vcov={"CRV1": "district_id"})
out_rows.append((label, m0.coef()["treatment"], m0.se()["treatment"],
m1.coef()["treatment"], m1.se()["treatment"]))
for label, b0, se0, b1, se1 in out_rows:
print(f"{label:18s} no-trends {b0:+.4f} ({se0:.4f}) "
f"with-trends {b1:+.4f} ({se1:.4f})")

IHS night-light no-trends +0.2704 (0.1007) with-trends +0.2152 (0.0833)
Raw night-light no-trends +1.7316 (0.4807) with-trends +1.6181 (0.4540)
Impervious ratio no-trends +0.0292 (0.0042) with-trends +0.0263 (0.0037)

The static TWFE regression recovers the paper’s headline: a park raises IHS nighttime light by +0.2152 with trend interactions (SE 0.0833, t = 2.58, significant at 1%) and +0.2704 without them — roughly a 21–27% increase in luminosity, since the IHS coefficient reads approximately as a proportional change at these magnitudes. The drop from 0.27 to 0.21 when trends are added is a textbook differential-trend confound: treated woredas were already more urban in 2007 and trending up faster, so the time × urbanization interaction absorbs that slope and the with-trends estimate is the cleaner ATT. The impervious-surface ratio rises +0.0263 with trends (SE 0.0037, t = 7.07) — about 2.6 percentage points of built-up land, ~82% of its 0.032 mean, and the most precisely estimated satellite coefficient in the study. The raw-light coefficient runs high (+1.618 vs the paper’s 1.276), a documented synthetic artifact of the bright-base device that we flag again in the reproduction audit. With a static ATT in hand, we unfold it across event time.

7. The event study: the dynamic path

A single ATT hides when the effect arrives. The event study estimates one coefficient per year-relative-to-opening, normalized to the year before opening ($k = -1$). For woreda $d$ in year $t$, with cohort opening year $g$:

$$Y_{dt} = \sum_{k \neq -1} \delta_k \, \mathbf{1}[t - g = k] + \alpha_d + \gamma_{r(d),t} + \varepsilon_{dt}$$

In words, this says we replace the single treatment dummy with a set of dummies, one for each event time $k$ (years since the park opened), each carrying its own coefficient $\delta_k$. The pre-opening coefficients ($k < 0$) should hug zero if parallel trends and no-anticipation hold; the post-opening coefficients ($k \geq 0$) trace how the effect builds. Here $\mathbf{1}[t - g = k]$ is an indicator equal to 1 when woreda $d$ is exactly $k$ years from its own opening, $\alpha_d$ and $\gamma_{r(d),t}$ are the same fixed effects as before, and the omitted $k = -1$ is the reference. We estimate the clean leads and lags with pyfixest’s saturated (Sun-Abraham) estimator, whose .aggregate() collapses the cohort dimension to one effect per $k$.

df = add_first_treat(district)
m = pf.event_study(df, yname="ihs_light", idname="district_id", tname="year",
gname="first_treat", estimator="saturated", att=True)
es = m.aggregate().reset_index()
es["event_time"] = es["period"].astype(float)
es = es[(es.event_time >= -5) & (es.event_time <= 5)].sort_values("event_time")
print(es[["event_time", "Estimate", "Std. Error", "Pr(>|t|)"]].round(4).to_string(index=False))

 event_time Estimate Std. Error Pr(>|t|)
-5.0 -0.0139 0.0176 0.4288
-4.0 -0.0013 0.0138 0.9226
-3.0 -0.0275 0.0127 0.0304
-2.0 -0.0135 0.0077 0.0791
0.0 0.1153 0.0295 0.0001
1.0 0.1928 0.0422 0.0000
2.0 0.2187 0.0641 0.0006
3.0 0.3138 0.0880 0.0004
4.0 0.4844 0.0463 0.0000
5.0 0.4697 0.0712 0.0000

The figure tells the whole story in one arc. The four pre-opening leads hug zero — they range from −0.0275 to −0.0013 and the largest absolute t among them is just 2.17 — weak enough to read as a flat pre-trend rather than a violation. The jump comes strictly after opening: the effect is already +0.1153 at $k = 0$ (p = 0.0001), climbs through +0.1928 ($k = +1$) and +0.2187 ($k = +2$), and plateaus at +0.4844 ($k = +4$) and +0.4697 ($k = +5$). This rising-then-flattening dynamic is exactly why the naive 2×2 (+0.2011) understated the long-run ATT — it averaged the small early years with the large late ones. The flat pre-period is the central piece of suggestive support for parallel trends, though it is never a proof, since the assumption concerns the unobserved post-period counterfactual. A skeptic might still worry the +0.215 TWFE headline is an artifact of staggered timing; the next section confronts that worry directly.

8. Modern staggered estimators: the negative-weights teaching moment

Here is the worry stated precisely. Under staggered adoption, the TWFE regression does not only compare treated woredas to never-treated ones. It also uses already-treated woredas as controls for later-treated ones — a “forbidden comparison.” When treatment effects grow over time (as ours clearly do, from +0.12 to +0.48), those forbidden comparisons receive negative weights and can bias TWFE, in extreme cases flipping its sign. The fix is a generation of estimators — Sun-Abraham, Borusyak/Gardner, and Callaway-Sant’Anna — that only ever compare treated cohorts to clean (not-yet- or never-treated) controls. Each targets the same ATT; if they agree with TWFE, the negative-weights problem is not biting.

def stars(t):
"""Significance stars from a t-stat (10% / 5% / 1%)."""
a = abs(t)
return "***" if a > 2.576 else "**" if a > 1.960 else "*" if a > 1.645 else ""
def cell(b, se):
"""Format a regression cell like '+0.2699*** (0.1005)'."""
return f"{b:+.4f}{stars(b / se)} ({se:.4f})"
df = add_first_treat(district)
Y = "ihs_light"
# TWFE benchmark
m_twfe = pf.event_study(df, yname=Y, idname="district_id", tname="year",
gname="first_treat", estimator="twfe", att=True)
twfe_b, twfe_se = m_twfe.coef().iloc[0], m_twfe.se().iloc[0]
# Sun-Abraham (saturated): average the clean post-period (k = 0..5) effects
m_sa = pf.event_study(df, yname=Y, idname="district_id", tname="year",
gname="first_treat", estimator="saturated", att=True)
sa = m_sa.aggregate(); sa.index = sa.index.astype(float)
sa_post = sa[(sa.index >= 0) & (sa.index <= 5)]
sa_b = float(sa_post["Estimate"].mean())
sa_se = float(np.sqrt((sa_post["Std. Error"].astype(float) ** 2).mean() / len(sa_post)))
# Borusyak/Gardner imputation (did2s)
m_d2s = pf.event_study(df, yname=Y, idname="district_id", tname="year",
gname="first_treat", estimator="did2s", att=True)
d2s_b, d2s_se = m_d2s.coef().iloc[0], m_d2s.se().iloc[0]
# Callaway-Sant'Anna against the never-treated group
cs = dd.CallawaySantAnna(control_group="never_treated", cluster="district_id").fit(
df, outcome=Y, unit="district_id", time="year",
first_treat="first_treat", aggregate="simple")
print(f"TWFE ATT : {cell(twfe_b, twfe_se)}")
print(f"Sun-Abraham ATT (avg k=0..5) : {cell(sa_b, sa_se)}")
print(f"Borusyak/Gardner ATT (did2s) : {cell(d2s_b, d2s_se)}")
print(f"Callaway-Sant'Anna ATT : {cell(cs.att, cs.se)}")

TWFE ATT : +0.2699*** (0.1005)
Sun-Abraham ATT (avg k=0..5) : +0.2991*** (0.0246)
Borusyak/Gardner ATT (did2s) : +0.3022*** (0.0907)
Callaway-Sant'Anna ATT : +0.2561*** (0.0763)

All four estimators target the same ATT and land in a tight band: TWFE +0.2699, Sun-Abraham +0.2991, Borusyak/Gardner +0.3022, and Callaway-Sant’Anna +0.2561 — a spread of only 0.046 IHS units across methods that, in other settings, can diverge sharply. Each is significant at 1%. They agree here because there is a real never-treated comparison group (the 122 controls) and the treatment effect is fairly homogeneous, so the conditions that make TWFE’s forbidden comparisons dangerous simply do not bind. This agreement is the methodological payoff: a reader worried that the headline is a negative-weighting artifact can see three staggered-robust estimators reproduce it. To show why they agree, we decompose the TWFE number itself.

The Goodman-Bacon decomposition breaks the TWFE coefficient into the weighted average of every underlying 2×2 comparison, labeling each by type. diff-diff does it in one call.

bac = dd.BaconDecomposition().fit(df, outcome=Y, unit="district_id",
time="year", first_treat="first_treat")
bdf = bac.to_dataframe()
print(f"Goodman-Bacon: TWFE = {bac.twfe_estimate:+.4f} decomposes into "
f"{len(bdf)} 2x2 comparisons.")
print(bdf.groupby("comparison_type")
.apply(lambda g: pd.Series({"total_weight": g.weight.sum(),
"weighted_avg_estimate": np.average(g.estimate, weights=g.weight)}))
.round(4))

Goodman-Bacon: TWFE = +0.2699 decomposes into 64 2x2 comparisons.
comparison_type total_weight weighted_avg_estimate
earlier_vs_later 0.0338 0.3370
later_vs_earlier 0.0121 0.0135
treated_vs_never 0.9542 0.2708

The decomposition is reassuring. The clean treated-vs-never-treated comparisons carry 95.42% of the total weight and average +0.2708 — essentially the headline. The “forbidden” later-vs-earlier comparisons (already-treated units used as controls, the ones that can flip TWFE’s sign) carry just 1.21% of the weight and contribute a near-zero +0.0135; clean earlier-vs-later comparisons add another 3.38% at +0.337. With at most ~1.2% of the weight on biased comparisons, TWFE is barely contaminated here — the empirical reason the four estimators agreed. The general lesson is worth keeping: the negative-weights problem is real in principle but empirically negligible whenever a large never-treated pool dominates the weighting, as the 122 PSM controls do. Having trusted the average, we can now ask where the effect is strongest.

9. Heterogeneity and spillovers

9.1 Where parks work: distance and roads

Place-based policy is, by definition, about place — so the effect should depend on where the park sits. We interact the treatment with distance moderators (a negative interaction means the effect fades with distance) and road-density moderators (a positive interaction means roads amplify it), each on the with-trends spec.

dt = add_trend_terms(district)
for mod in ["dist_addis_km", "dist_state_capital_km", "dist_nearest_city_km",
"primary_road_density", "paved_road_density"]:
m = pf.feols(f"ihs_light ~ treatment + treatment:{mod} + " +
" + ".join(TREND_TERMS) + " | district_id + region^year",
data=dt, vcov={"CRV1": "district_id"})
b, se = m.coef()[f"treatment:{mod}"], m.se()[f"treatment:{mod}"]
print(f"{mod:24s} interaction {b:+.5f} (se {se:.5f}, t {b/se:+.2f})")

dist_addis_km interaction -0.00822 (se 0.00232, t -3.54)
dist_state_capital_km interaction -0.00862 (se 0.00406, t -2.13)
dist_nearest_city_km interaction -0.03352 (se 0.00684, t -4.90)
primary_road_density interaction +0.32640 (se 0.84748, t +0.39)
paved_road_density interaction +0.66945 (se 0.32174, t +2.08)

Location fundamentals sharply moderate park effectiveness, exactly as the paper argues. All three distance interactions are negative — the park effect fades with distance from economic centers — and three of them are significant: distance to nearest city (−0.0335, t = −4.90, the steepest decay), distance to Addis (−0.0082, t = −3.54), and distance to the state capital (−0.0086, t = −2.13). Both road interactions are positive — denser roads amplify the effect — with paved-road density significant (+0.6695, t = 2.08) but primary-road density correctly signed yet borderline insignificant (+0.3264, t = 0.39). That last result is an honest synthetic limitation: with only 17 treated woredas the mutually-correlated moderators cannot all be precise at once, so one of the two road interactions necessarily reads non-significant. The point estimates all carry the predicted sign; precision, not direction, is what the small treated sample cannot fully deliver. A related question is whether the park’s gain is truly new or merely stolen from its neighbours.

9.2 Spillovers: does a park lift its neighbours?

The spillover test adds a nearby indicator — control woredas within 10 km of an operational park — to the Table 1 spec. If parks merely displace activity from neighbours, nearby should be negative; if the gains are net-new, it should be zero.

for ycol, label in [("ihs_light", "IHS night-light"),
("light_intensity", "Raw night-light")]:
m = pf.feols(f"{ycol} ~ treatment + nearby | district_id + region^year",
data=district, vcov={"CRV1": "district_id"})
print(f"{label:18s} treatment {m.coef()['treatment']:+.4f} "
f"nearby {m.coef()['nearby']:+.4f} (t {m.coef()['nearby']/m.se()['nearby']:+.2f})")

IHS night-light treatment +0.2712 nearby +0.0648 (t +1.06)
Raw night-light treatment +1.7328 nearby +0.0927 (t +1.35)

The nearby coefficient is +0.0648 (SE 0.0610, t = 1.06) for IHS light and +0.0927 (t = 1.35) for raw light — both small and statistically indistinguishable from zero — while the treatment coefficient stays large and significant (+0.2712). The reading is no spillover: the park lifts its host woreda by ~0.27 IHS but leaves immediate neighbours essentially unchanged, so the host’s gain is net-new activity, not displacement. This also reassures on SUTVA: with no measurable geographic spillover, the never-treated controls are not contaminated by proximity to a park, so the main ATT is not biased by treated-on-control externalities. Economically, the parks behave like relatively self-contained enclaves with weak local supplier linkages. So far the story is about lights and land — but did the parks change how people actually live?

10. Household welfare and women’s empowerment

10.1 Household living standards (Table 5)

We now switch to the DHS household repeated cross-section. Because each round samples different households, there is no household panel key, so we use no household fixed effect — the effect is identified off district × round group means, with district and region×round fixed effects and DHS survey weights. We report each outcome with and without household-size and head-age controls.

for ycol, label in [("durable_goods_pc", "Durable goods p.c."),
("housing_quality", "Housing quality"),
("wealth_index", "Wealth index")]:
m0 = pf.feols(f"{ycol} ~ treatment | district_id + region_id^survey_round",
data=household, weights="survey_weight", vcov={"CRV1": "district_id"})
m1 = pf.feols(f"{ycol} ~ treatment + hh_size + age_head | "
"district_id + region_id^survey_round",
data=household, weights="survey_weight", vcov={"CRV1": "district_id"})
print(f"{label:18s} no-controls {m0.coef()['treatment']:+.4f} "
f"with-controls {m1.coef()['treatment']:+.4f} ({m1.se()['treatment']:.4f})")

Durable goods p.c. no-controls +0.2489 with-controls +0.2286 (0.0284)
Housing quality no-controls +0.2484 with-controls +0.2480 (0.0193)
Wealth index no-controls +0.3875 with-controls +0.3825 (0.0461)

All three living-standards outcomes rise sharply and significantly. Durable goods per capita gain +0.2286 with controls (SE 0.0284, t = 8.06) — against a 0.308 mean, a ~74% increase. Housing quality (an indicator for having electricity, piped water, a toilet, and a finished floor) rises +0.2480, so the probability of clearing that bar jumps ~24.8 percentage points off a 30.7% base. The composite wealth index rises +0.3825 standard deviations (SE 0.0461, t = 8.29). Crucially, adding controls barely moves any estimate (durables 0.249 → 0.229, the others essentially unchanged), which confirms the district + region×round design already absorbs the main confounding — the covariates are only mildly correlated with treatment. As at the satellite level, the timing is clean.

# RCS event study uses coarse phase dummies (no balanced unit x time grid).
# _rcs_event_study() is defined in the companion script.py.
es = _rcs_event_study(household, "durable_goods_pc", controls=["hh_size", "age_head"])
print(es.round(4).to_string(index=False))

 event_phase estimate se p_value
-3.0 -0.0197 0.0482 0.6840
-2.0 0.0236 0.0329 0.4757
0.0 0.2606 0.0398 0.0000
1.0 0.1513 0.0387 0.0001

Because the DHS data are repeated cross-sections, the household event study uses coarse phase dummies rather than annual event time. The two pre-opening phases are flat and insignificant — phase −3 at −0.0197 (p = 0.68) and phase −2 at +0.0236 (p = 0.48), both straddling zero — so there is no differential pre-trend in household durables. The effect then jumps to +0.2606 at phase 0 (p < 0.0001) and stays strongly positive at +0.1513 at phase +1. This is the RCS counterpart to the satellite event study’s no-anticipation evidence, with the honest caveat that two pre-phases make a low-powered test. Now to the question the whole post has been building toward: who got the jobs?

10.2 Employment and women’s empowerment (Tables 6–7): the climax

This is the analytical climax, and a textbook case for heterogeneity analysis. We estimate non-agricultural employment for the full sample, then split by sex, using the same survey-weighted RCS design.

ctrl = "hh_size + age_head + age + age_sq"
for label, sub in [("Full sample", individual),
("Women", individual[individual.sex == 1]),
("Men", individual[individual.sex == 0])]:
m = pf.feols(f"nonag_employment ~ treatment + {ctrl} | "
"district_id + region_id^survey_round",
data=sub, weights="survey_weight", vcov={"CRV1": "district_id"})
b, se = m.coef()["treatment"], m.se()["treatment"]
print(f"{label:12s} {b:+.4f} ({se:.4f}) t {b/se:+.2f}")

Full sample +0.0911 (0.0580) t +1.57 <-- NULL on average
Women +0.1404 (0.0468) t +3.00 <-- SIGNIFICANT for women
Men +0.0176 (0.0934) t +0.19

The average non-agricultural employment effect is +0.0911 (SE 0.0580, t = 1.57) — insignificant — which, read alone, would suggest parks do not move employment at all. But pooling the sexes hides a strong gendered split: the female effect is +0.1404 (SE 0.0468, t = 3.00, significant at 1%) — about a 14-percentage-point rise in women’s non-agricultural employment — while the male effect is +0.0176 (t = 0.19), essentially zero. The parks, concentrated in textiles and garments, pull women into factory wage work; the men were largely already off-farm, so the average washes out. A reader who quoted only the full-sample number would badly misread the study — the sex split is the finding, not a footnote. The empowerment cascade follows the jobs.

women = individual[individual.sex == 1]
for ycol, label in [("decision_power", "Decision power"),
("savings_account", "Savings account"),
("dv_accept", "Accepts DV")]:
m = pf.feols(f"{ycol} ~ treatment + {ctrl} | "
"district_id + region_id^survey_round",
data=women, weights="survey_weight", vcov={"CRV1": "district_id"})
b, se = m.coef()["treatment"], m.se()["treatment"]
print(f"{label:18s} {b:+.4f} ({se:.4f}) t {b/se:+.2f}")

Decision power +0.1096 (0.0194) t +5.66
Savings account +0.3153 (0.0182) t +17.34
Accepts DV -0.2096 (0.0254) t -8.24

With factory jobs, women’s outcomes shift across the board (women only). Decision-making power rises +0.1096 (SE 0.0194, t = 5.66), savings-account ownership rises +0.3153 (SE 0.0182, t = 17.34) — enormous against a 6.3% base — and acceptance of domestic violence falls −0.2096 (SE 0.0254, t = −8.24), a ~21-point reduction off a 63.5% base. Economic agency translates into household bargaining power and shifting gender norms. The event study below confirms the timing.

The female-employment and decision-power event studies both sit near zero in the pre-phases and turn up at and after phase 0 (female employment jumps to +0.1311 at phase 0, p = 0.013), reinforcing the no-anticipation reading — women’s gains appear with the park, not before it. The gender result is the substantive heart of the study; one robustness battery remains to decide whether to trust the satellite headline that anchors it.

11. Robustness: Conley spatial standard errors and restricted pools

Recall from the map that all 17 treated woredas cluster spatially. When treated units are packed together, a regional shock hits several at once, so their errors are not independent draws — and the naive standard error, which assumes independence, will be too small. The fix is a Conley spatial-HAC standard error, which allows a district’s errors to correlate with itself over time (serial) and with nearby districts in the same year (spatial). The point estimate never changes; only the standard error does. We compute four standard errors for the with-trends light ATT and re-estimate it on restricted control pools.

# four SEs for the with-trends IHS-light ATT (full Conley sandwich in script.py)
se_tab = conley_se_for_spec(add_trend_terms(district), "ihs_light",
["treatment"] + TREND_TERMS)
print(se_tab.loc[se_tab.term == "treatment",
["estimate", "se_naive", "se_clustered", "se_conley", "se_hac"]]
.round(4).to_string(index=False))

 estimate se_naive se_clustered se_conley se_hac
0.2152 0.0329 0.0792 0.0346 0.0799

The satellite headline survives honest standard errors. The most conservative Conley spatial-HAC SE is 0.0799 — 2.43× the naive HC0 SE of 0.0329 — yet the ATT of +0.2152 stays significant (t = 2.69, significant at 1%). Notice the cluster SE (0.0792) and the Conley-HAC SE (0.0799) are nearly identical: clustering at the district level already captures most of the dependence, so spatial correlation beyond the district adds little here. The estimate is also stable when we change the comparison group — dropping the Addis Ababa region or restricting controls to those far from any city.

specs = {"Full sample": district,
"Drop Addis region": district[district.region != "Addis Ababa"],
"Controls >= 50km from city": district[(district.treated == 1) |
(district.dist_nearest_city_km >= 50)]}
for name, sub in specs.items():
m = pf.feols("ihs_light ~ treatment + " + " + ".join(TREND_TERMS) +
" | district_id + region^year",
data=add_trend_terms(sub), vcov={"CRV1": "district_id"})
b, se = m.coef()["treatment"], m.se()["treatment"]
print(f"{name:28s} {b:+.4f} ({se:.4f}) N={m._N}")

Full sample +0.2152 (0.0833) N=2224
Drop Addis region +0.1550 (0.0910) N=1984
Controls >= 50km from city +0.2143 (0.0854) N=1392

Dropping the Addis Ababa region pulls the estimate to +0.1550 (still significant at 10%, N 1,984) and restricting controls to those at least 50 km from a city holds it at +0.2143 (significant at 5%, N 1,392). Combined with the Section 8 agreement of Sun-Abraham, Borusyak/Gardner, and Callaway-Sant’Anna, the satellite result is robust to both the standard-error specification and the choice of comparison group. With the evidence assembled, we can return to the opening question.

12. Discussion

What we found. Yes — and the “for whom” matters as much as the “whether.” A park raises local nighttime light by about +0.215 IHS (~21%) and built-up land by ~2.6 percentage points, with the effect building over five years to a +0.48 plateau and no spillover to neighbours. Four estimators agree the staggered-DiD negative-weights problem is not biting (spread 0.046, with 95.4% clean Bacon weight). Households near a park gain durables (+0.229), housing quality (+0.248), and wealth (+0.383 SD). And the central result: average non-agricultural employment is an insignificant +0.091, yet women’s employment rises a significant +0.140, lifting their decision power (+0.110), savings (+0.315), and lowering acceptance of domestic violence (−0.210). The parks reshaped the local economy, and they did so largely through women.

So what? Two design lessons follow directly. First, on site selection: the effect fades steeply with distance from cities (−0.0335 per km to the nearest city) and is amplified by paved roads (+0.6695). A park dropped in a remote, poorly-connected woreda would do far less — proximity to existing economic centers is first-order, so place-based policy should follow the roads. Second, on sector and inclusion: because the employment and empowerment gains run through female-intensive sectors (textiles, garments), a policymaker who measured only the average employment effect would conclude the parks failed on jobs and miss their largest social return. Evaluations of place-based policy should be sex-disaggregated by default.

Limitations and the observational caveat. Be appropriately humble. The data are synthetic — calibrated to teach the methods, not to report new facts about Ethiopia. The treated group is tiny (17 woredas), so several effects are borderline; the primary-road interaction is correctly signed but imprecise, and the raw-light coefficient runs high. Most fundamentally, this is an observational study: the parks were not randomly placed, so identification rests on parallel trends, not randomization. The flat pre-trends and the null spillover support that assumption but never prove it. The adjustment here — district and region×year fixed effects, baseline-trend interactions, and the PSM-matched controls — is confounding control, not the precision-only adjustment of a randomized experiment. The ATT we report is the effect on these parks in this setting; it travels only as far as that.

13. Reproduction audit: synthetic data vs the paper

Because the data are synthetic, transparency demands we line our numbers up against the published ones. The data-generating process was tuned to match the paper coefficient by coefficient; signs and significance agree throughout, and the headline magnitudes land within about 0.02. We also disclose four documented gaps rather than paper over them.

Result	This synthetic data	Paper (reported)	Sign	Significance
Table 1: IHS light, no trends	+0.2704***	≈ +0.265**	✓	✓
Table 1: IHS light, with trends	+0.2152***	≈ +0.214**	✓	✓
Table 1: raw light, with trends	+1.6181***	≈ +1.276**	✓	partial (high)
Table 1: impervious, with trends	+0.0263***	≈ +0.028**	✓	✓
Table 2: `nearby` spillover (IHS)	+0.0648 (ns)	≈ 0 (ns)	✓	✓
Table 3: distance to nearest city	−0.0335***	negative & sig.	✓	✓
Table 4: paved-road density	+0.6695**	positive	✓	✓
Table 4: primary-road density	+0.3264 (ns)	positive	✓	partial (ns)
Table 5: durables (controls)	+0.2286***	≈ +0.226***	✓	✓
Table 5: housing (controls)	+0.2480***	≈ +0.252***	✓	✓
Table 5: wealth (controls)	+0.3825***	≈ +0.409*	✓	✓
Table 6: employment, full sample	+0.0911 (ns)	≈ +0.110 (ns)	✓	✓
Table 6: employment, women	+0.1404***	≈ +0.133***	✓	✓
Table 6: employment, men	+0.0176 (ns)	≈ +0.015 (ns)	✓	✓
Table 7: decision power	+0.1096***	≈ +0.103***	✓	✓
Table 7: savings account	+0.3153***	≈ +0.318***	✓	✓
Table 7: DV acceptance	−0.2096***	≈ −0.212***	✓	✓
Staggered: TWFE / SA / BG / CS ATT	+0.270 / +0.299 / +0.302 / +0.256	“closely track baseline”	✓	✓

Stars: *** p < .01, ** p < .05, * p < .10.

Of the audited cells, the great majority land on target in sign, significance, and magnitude (within ~0.02 on the headline coefficients). Four gaps are documented and bounded. (1) The raw-light coefficient runs high (~1.6 vs 1.276): keeping treated woredas essentially always-lit (for a clean IHS event study with only 17 clusters) removes the zero-dilution that would otherwise pull the raw mean down — a deliberate bright-base device that protects the on-target IHS coefficient. (2) The primary-road interaction is correctly signed and on-magnitude but borderline non-significant — the 17-treated sample cannot make both road interactions precise at once. (3) Light levels are not matched: treated woredas carry an intrinsically bright base (~4–5) and controls a dim one (~0.1), unlike the paper’s PSM-matched 0.94/0.87, which is exactly why the EDA figure is baseline-normalized. (4) The decision-power mean (~0.88) sits a touch below the paper’s 0.899 because the linear-probability clipping ceiling caps the achievable effect. Everywhere else, direction and significance track the paper closely. The synthetic data reproduce the paper’s findings — they are not, and are not claimed to be, the paper’s data.

14. Summary and takeaways

Number to remember	Value
Light ATT (with trends)	+0.2152*** (~21%)
Four-estimator spread	0.046 IHS units
Clean Bacon weight	95.4%
Wealth-index ATT	+0.383 SD
Female employment ATT	+0.140*** (vs +0.091 ns full sample)
Light SE: naive → Conley-HAC	0.0329 → 0.0799

A park raises local activity ~21% — and the staggered-bias worry does not bite. The with-trends TWFE ATT is +0.2152***, and TWFE, Sun-Abraham (+0.299), Borusyak/Gardner (+0.302), and Callaway-Sant’Anna (+0.256) all agree within 0.046 because 95.4% of the Bacon weight is clean treated-vs-never comparisons. When a large never-treated pool dominates, plain TWFE is barely contaminated.
The average hides the finding — split by sex. Full-sample non-ag employment is an insignificant +0.091, but the female effect is +0.140*** and the male effect is ~0. The empowerment cascade follows: decision power +0.110, savings +0.315, and acceptance of domestic violence −0.210, all highly significant. Heterogeneity analysis turned a null into the study’s headline.
Honest inference matters but does not overturn the result (a limitation in spirit). With all 17 treated woredas clustered in space, the Conley-HAC SE (0.0799) is 2.43× the naive HC0 SE (0.0329); the ATT still clears significance (t = 2.69), and the small treated sample is why the primary-road interaction and the raw-light level remain imprecise or off-target.
Next step. Re-estimate the event study with a Callaway-Sant’Anna dynamic aggregation to compare its lag-by-lag path against the saturated one, add a sensitivity analysis (à la Rambachan-Roth) that asks how large a pre-trend violation would overturn the +0.215 ATT, and test whether labor-intensive parks drive the female-employment effect more than capital-intensive ones.

15. Exercises

Drop the anchor cohort. Re-run the staggered estimators after excluding the single 2008 woreda, so all treated units come from the 2014–2020 build-out. Do the four ATTs still agree within 0.05, and does the Goodman-Bacon clean-weight share change? What does that tell you about how much one early cohort drives the comparison structure?
Stress-test the gender result. Add an interaction treatment:sex to the full-sample employment regression instead of splitting the data. Does the interaction coefficient recover the female-minus-male gap (≈ +0.123)? Why might the pooled-interaction and split-sample approaches give slightly different standard errors?
Move the collapse year. The naive 2×2 in Section 6.1 collapsed the design at the median opening year (2017). Recompute it collapsing at 2014 and at 2019. How much does the blended ATT move, and why does the choice of collapse year matter for a staggered design but not for a single-date one?

16. References

Huang, G., Wang, M., & Xu, H. (2026). The socioeconomic impacts of industrial parks in Ethiopia. Journal of Urban Economics. https://doi.org/10.1016/j.jue.2026.103867
Callaway, B., & Sant’Anna, P. H. C. (2021). Difference-in-differences with multiple time periods. Journal of Econometrics, 225(2), 200–230. https://doi.org/10.1016/j.jeconom.2020.12.001
Sun, L., & Abraham, S. (2021). Estimating dynamic treatment effects in event studies with heterogeneous treatment effects. Journal of Econometrics, 225(2), 175–199. https://doi.org/10.1016/j.jeconom.2020.09.006
Borusyak, K., Jaravel, X., & Spiess, J. (2024). Revisiting event-study designs: Robust and efficient estimation. Review of Economic Studies, 91(6), 3253–3285. https://doi.org/10.1093/restud/rdae007
Goodman-Bacon, A. (2021). Difference-in-differences with variation in treatment timing. Journal of Econometrics, 225(2), 254–277. https://doi.org/10.1016/j.jeconom.2021.03.014
Conley, T. G. (1999). GMM estimation with cross-sectional dependence. Journal of Econometrics, 92(1), 1–45. https://doi.org/10.1016/S0304-4076(98)00084-0
pyfixest documentation — https://pyfixest.org/
diff-diff documentation — https://github.com/igerber/diff-diff
Ethiopia Demographic and Health Surveys (DHS), 2000–2019 — The DHS Program, ICF / Ethiopian Public Health Institute. https://dhsprogram.com/
Chen, Z., Yu, B., Yang, C., et al. (2021). An extended time series (2000–2018) of global NPP-VIIRS-like nighttime light data. Earth System Science Data, 13(3), 889–906. https://doi.org/10.5194/essd-13-889-2021
Zhang, X., Liu, L., Zhao, T., et al. (2022). GISD30: Global 30-m impervious-surface dynamic dataset. Earth System Science Data, 14(4), 1831–1856. https://doi.org/10.5194/essd-14-1831-2022

This tutorial is a teaching replication built on synthetic data; see the data note in Section 1 and the reproduction audit in Section 13. The companion script.py regenerates every figure and table.

AI Podcast: Do Industrial Parks Work?

Click play to load

0:00 0:00

Staggered Synthetic Difference-in-Differences (SDID) in Stata: Gender Quotas and Women in Parliament

Sun, 07 Jun 2026 00:00:00 +0000

Abstract

Most real-world policies are not adopted on a single clock — parliamentary gender quotas, minimum-wage laws, and carbon taxes arrive in different units in different years, a staggered-adoption design where naive two-way fixed-effects difference-in-differences quietly breaks by using already-treated units as controls and placing negative weights on some effects. This tutorial extends synthetic difference-in-differences (SDID) to staggered adoption and applies it in Stata to a question in political economy: do parliamentary gender quotas raise the share of women in national parliaments? It uses the quota_example dataset distributed with the sdid package (Bhalotra, Clarke, Gomes & Venkataramani, 2023) — a balanced panel of 119 countries observed annually from 1990 to 2015 (3,094 observations), in which 9 countries adopt a quota across 7 cohorts (2000, 2002, 2003, 2005, 2010, 2012, 2013) and 110 remain never-treated. The method estimates a separate, clean SDID per cohort against the never-treated donor pool, then aggregates the cohort effects into the overall ATT with non-negative treated-period-share weights, complemented by the sdid_event event study and bootstrap, jackknife, and placebo inference. The overall ATT is +8.03 percentage points (SE 3.74, p = 0.032), robust to a log-GDP control (8.05 optimized, 8.06 projected), but the cohort effects swing from −3.5 to +21.8 points, with flat pre-adoption placebos supporting parallel synthetic trends and dynamic effects that appear immediately and persist for over a decade. The lesson is that a single headline number summarizes real heterogeneity, and that transparent, non-negative cohort weighting is essential when treatment timing is staggered.

1. Overview

In a previous tutorial, one unit — California — adopted one policy — Proposition 99 — in one year — 1989. That block design is the textbook setting for synthetic difference-in-differences (SDID). But most real policies do not arrive on a single clock. Parliamentary gender quotas, minimum-wage laws, carbon taxes, and clean-air regulations are adopted by different units in different years. This is the staggered adoption design, and it is where naive panel methods quietly break.

This tutorial extends SDID to staggered adoption and applies it in Stata to a real question in political economy: do parliamentary gender quotas raise the share of women in national parliaments? We use the quota_example dataset that ships with the sdid package — 119 countries observed annually from 1990 to 2015, in which 9 countries adopt a gender quota across 7 different cohorts (2000, 2002, 2003, 2005, 2010, 2012, and 2013).

The headline is a story about heterogeneity. The overall effect of quotas is about +8 percentage points of women in parliament, but the cohort-by-cohort effects swing from −3.5 to +21.8 points. A single number hides that range — and, as we will see, the naive two-way fixed-effects regression that most people reach for first can hide even more.

Why does staggered timing break the naive regression? (click to expand)

The workhorse for panel policy evaluation is the two-way fixed-effects (TWFE) regression — unit dummies, time dummies, and a treatment dummy. With one adoption date it estimates a clean difference-in-differences. With staggered timing and heterogeneous effects, the same regression implicitly uses already-treated units as controls for later adopters (“forbidden comparisons”). The result is a variance-weighted average of every 2×2 comparison in the panel, and some of those weights can be negative — so the estimate can even take the wrong sign (Goodman-Bacon, 2021; de Chaisemartin & D’Haultfœuille, 2020). Staggered SDID sidesteps this by estimating a separate, clean SDID effect for each adoption cohort and aggregating with transparent, non-negative weights.

graph TD
subgraph "Block design — predecessor (Prop 99)"
B1["California<br/>adopts 1989"] --> BATT["one ATT"]
B2["other states<br/>never treated"] --> BATT
end
subgraph "Staggered design — this post (gender quotas)"
S1["cohort 2000"] --> SATT["aggregate ATT"]
S2["cohort 2002"] --> SATT
S3["cohorts 2003 to 2013"] --> SATT
SC["110 never-treated<br/>controls"] -.donor pool.-> SATT
end
style B1 fill:#d97757,stroke:#141413,color:#fff
style B2 fill:#6a9bcc,stroke:#141413,color:#fff
style BATT fill:#00d4c8,stroke:#141413,color:#141413
style S1 fill:#d97757,stroke:#141413,color:#fff
style S2 fill:#d97757,stroke:#141413,color:#fff
style S3 fill:#d97757,stroke:#141413,color:#fff
style SC fill:#6a9bcc,stroke:#141413,color:#fff
style SATT fill:#00d4c8,stroke:#141413,color:#141413

1.1 Learning objectives

By the end of this tutorial you will be able to:

Explain why staggered adoption breaks naive TWFE difference-in-differences, and how per-cohort SDID avoids the forbidden-comparison problem.
Derive the SDID estimator from first principles — unit weights $\omega$, time weights $\lambda$, and the weighted two-way fixed-effects objective — and the rule that aggregates cohort-specific effects $\hat{\tau}_a$ into one overall ATT.
Estimate the effect of gender quotas with sdid on a staggered panel, add a covariate two different ways (optimized vs projected), and choose among bootstrap, jackknife, and placebo inference.
Read an SDID event-study plot produced by sdid_event, distinguishing pre-trend placebo coefficients from post-period dynamic effects.

2. Key concepts at a glance

Each card gives a plain-language definition, a concrete example from this quota study, and an everyday analogy. Open any term that is unfamiliar.

1. ATT (average treatment effect on the treated) — the question we actually answer.

Definition. The effect of adopting a quota on the women-in-parliament share, in the countries that adopted one, averaged over their post-adoption years. It is not the effect a quota would have everywhere — only where one was actually tried.

Example. Our headline ATT is +8.0 percentage points: across the nine adopting countries, quotas raised women’s parliamentary share by about eight points relative to their no-quota counterfactual.

Analogy. Like asking “how much did the patients who took the drug improve?” — not “how much would everyone improve?” You measure only the units that were actually treated.

2. Synthetic control — a made-to-order comparison country.

Definition. A weighted blend of never-treated “donor” countries, built so its pre-adoption path mimics the treated cohort. It stands in for the unobservable counterfactual: what the cohort’s outcome would have been without a quota.

Example. The 2002 cohort’s synthetic control mixes dozens of donors (Belgium, Paraguay, Cuba, …) so that, before 2002, the blend tracks the cohort’s trend — then keeps going as the cohort would have without the law.

Analogy. A stunt double cast to match the lead actor’s build and movement — close enough that, in the shots you cannot film the star, the double stands in convincingly.

3. Unit weights (ω) — how much each donor counts.

Definition. Non-negative weights, one per donor country, summing to one, that build the synthetic control. Each cohort gets its own ω.

Example. In the 2000 cohort, 80 donors receive nonzero weight — Argentina ≈ 0.061, Guatemala ≈ 0.057, Austria ≈ 0.045 — a diffuse blend rather than one or two stand-ins.

Analogy. A recipe calling for many ingredients in small, precise amounts: no single one dominates, so the dish survives a bad batch of any one ingredient.

4. Time weights (λ) — which "before" years matter.

Definition. Non-negative weights on the pre-adoption years, summing to one, that decide which pre-periods define the baseline. They up-weight the years most like the post-period.

Example. For the 2002 cohort, λ concentrates on the late 1990s and 2001 rather than spreading evenly across 1990–2001 — the recent past is the relevant baseline.

Analogy. Forecasting tomorrow’s weather, you trust last week far more than the same date five years ago. Time weights formalize “recent and similar counts more.”

5. Adoption cohort (a) — units that switch on together.

Definition. The set of countries that first adopt a quota in the same calendar year. Staggered SDID runs one self-contained SDID per cohort, always against the never-treated controls.

Example. There are seven cohorts — 2000, 2002, 2003, 2005, 2010, 2012, 2013 — with two countries each in 2002 and 2003, and one in the rest.

Analogy. School graduating classes: the “class of 2002” and the “class of 2010” share a start date and are analyzed as groups, even though all attend the same school.

6. Staggered adoption & the forbidden comparison — why the naive regression breaks.

Definition. Staggered adoption means units are treated at different times. The hazard: a two-way fixed-effects regression can use already-treated units as controls for later adopters — a “forbidden comparison” that places negative weights on some effects and can flip the sign.

Example. When the 2012 cohort adopts, a naive TWFE quietly treats the 2002 cohort — already treated, already changed — as part of its control group. Staggered SDID never does this: each cohort is compared only to the 110 never-treated countries.

Analogy. Timing a late runner against runners who already crossed the line and slowed to a walk — your “control” is contaminated because it has already run the race.

7. Event time (relative period) — every cohort on its own clock.

Definition. Time measured relative to each cohort’s own adoption year (… −2, −1, 0, +1 …), so cohorts that adopted in different calendar years can be lined up and averaged.

Example. Event time 0 is the year 2000 for the first cohort but 2013 for the last; re-centring lets us ask “what happens three years after a quota?” across all cohorts at once.

Analogy. Comparing marathon runners by their own start gun, not the wall clock: a runner who started at 9:05 and one who started at 9:20 are both “at mile 10” measured from their own start.

8. ATT aggregation — from many cohort effects to one number.

Definition. The overall ATT is a weighted average of the cohort effects, each weighted by its share of treated unit-by-post-period observations — earlier, longer-exposed, larger cohorts count more.

Example. The seven cohort effects span −3.5 to +21.8; weighted by treated country-years they average to +8.0 (the plain unweighted mean would be ≈ 7.0).

Analogy. A course grade that weights the final exam more than a pop quiz: the cohorts you observe for longer carry more of the final mark.

9. Pre-trend placebo test — the assumption you can see.

Definition. Event-study coefficients for the pre-adoption periods. If treated and synthetic-control countries moved in parallel before treatment, these sit near zero — a falsification check.

Example. For the 2002 cohort, all twelve pre-period placebos fall in [−0.2, +0.8] points — flat, so we cannot reject parallel synthetic trends.

Analogy. Checking a scale by weighing nothing first: if it does not read zero when empty, you distrust every later reading. Flat placebos are that “reads zero when empty” check.

10. Bootstrap, jackknife, placebo — three rulers for uncertainty.

Definition. Three ways to attach a standard error to the ATT. With many treated units all three are available; they share one point estimate but report different spread.

Example. On the two-cohort subsample the ATT is 10.3 for all three, but the SE is 4.7 (bootstrap), 6.0 (jackknife, most conservative), and 2.3 (placebo, tightest).

Analogy. Measuring a table with a tape, a folding ruler, and a laser: they agree on the length but disagree on the error bars — the cautious carpenter reports the widest.

3. The data: gender quotas across 119 countries

We use quota_example.dta, the balanced panel from Bhalotra, Clarke, Gomes & Venkataramani (2023) distributed with the sdid package. The outcome is the percentage of seats held by women in the national parliament; the treatment is the adoption of a reserved-seat gender quota; the covariate is log GDP per capita.

webuse set www.damianclarke.net/stata/
webuse quota_example, clear
label variable quota "Parliamentary gender quota"
xtset country year
codebook country year quota womparl lngdp, compact

Variable Obs Unique Mean Min Max Label
----------------------------------------------------------------------------
country 3094 119 . . . Country
year 3094 26 2002.5 1990 2015 Year
quota 3094 2 .0303814 0 1 =1 if country has a gender quota
womparl 3094 449 14.96531 0 63.8 Women in parliament
lngdp 2990 2956 9.154291 5.8701 11.61789 log(GDP)
----------------------------------------------------------------------------

The panel is balanced: 119 countries times 26 years equals 3,094 observations, with no gaps in the outcome or treatment (lngdp has 104 missing values, which will matter only when we add the covariate). The treatment indicator quota equals one for just 3% of observations, a reminder that treated country-years are scarce. Crucially, quota is absorbing — once a country adopts a quota it stays treated — which SDID requires.

Variable	Role	Symbol	Description
`country`	unit	$i$	119 countries (9 ever-treated, 110 never-treated)
`year`	time	$t$	1990–2015 (26 years)
`womparl`	outcome	$Y_{it}$	% women in the national parliament
`quota`	treatment	$W_{it}$	1 once a country has a quota, 0 before / never
`lngdp`	covariate	$X_{it}$	log GDP per capita

The estimand. Our target is the average treatment effect on the treated (ATT): the effect of adopting a quota on the women-in-parliament share in the countries that adopted one, averaged over their post-adoption years. Formally,

$$ \tau = \frac{1}{N_{tr}\, T_{post}} \sum_{i:\, W_i = 1}\ \sum_{t > T_{pre}} \left[\, Y_{it}(1) - Y_{it}(0) \,\right] $$

In words: for every treated country and every post-adoption year, take the gap between the share of women with a quota, $Y_{it}(1)$, and the share that would have occurred without one, $Y_{it}(0)$ — then average. The first term is observed; the second is the counterfactual that the synthetic control must impute, because we never see a quota-adopting country in the parallel world where it abstained.

An observational, not experimental, setting. Quotas are not randomly assigned. Countries that adopt them early may differ systematically — they may be wealthier, more democratic, or already on a rising trajectory of women’s representation. That is exactly why we need a method that builds a credible counterfactual from comparison countries rather than assuming a simple before/after change would have held. Identification rests on assumptions we will keep visible: that treated and synthetic-control countries share a common (synthetic) trend absent treatment, no anticipation of the quota, no spillovers across countries, and that adoption timing is not itself driven by the outcome’s future path.

3.1 The staggered structure

Before modelling, let us see the timing directly. The adoption year is the first year a country is treated; we tabulate the cohorts.

bysort country (year): egen firsttreat = min(cond(quota==1, year, .))
preserve
keep country firsttreat
duplicates drop
tab firsttreat, missing
restore

 firsttreat | Freq. Percent Cum.
------------+-----------------------------------
2000 | 1 0.84 0.84
2002 | 2 1.68 2.52
2003 | 2 1.68 4.20
2005 | 1 0.84 5.04
2010 | 1 0.84 5.88
2012 | 1 0.84 6.72
2013 | 1 0.84 7.56
. | 110 92.44 100.00
------------+-----------------------------------
Total | 119 100.00

Nine countries adopt a quota, spread across seven cohorts; the 2002 and 2003 cohorts contain two countries each, the rest one. The remaining 110 countries are never treated — they form the donor pool from which every cohort’s synthetic control is built. This staircase of adoption dates is the defining feature of a staggered design, and the reason a single “post” dummy is too blunt.

4. Exploratory analysis with `panelview`

A staggered design is best understood by looking at it. The panelview command (Xu & Hua) draws two pictures we need: a heatmap of who is treated when, and the raw outcome trajectories colored by treatment status.

ssc install panelview, replace
panelview womparl quota, i(country) t(year) type(treat) bytiming
panelview womparl quota, i(country) t(year) type(outcome)

The treatment heatmap (type(treat), sorted with bytiming) makes the staggered structure unmistakable: the dark treated cells appear in the top-right corner as a staircase, each step a different cohort switching on between 2000 and 2013, against a sea of never-treated controls. This is the visual opposite of a block design, where every treated cell would switch on in the same column.

The outcome plot (type(outcome)) overlays all 119 women-in-parliament series, with the 9 treated countries in orange. Several treated countries start near the bottom of the distribution and climb steeply after their adoption year — a hint of a positive effect — but the climbs begin at different times, and a few treated countries barely move. No single “treated average” line could summarize this; we need cohort-specific counterfactuals.

collapse (mean) womparl, by(evertreat year)
* ... reshape and plot ever- vs never-adopting means ...

Collapsing to group means tells a cautionary tale. The ever-adopting countries (orange) start the 1990s below the never-adopting countries (about 4% vs 10% women in parliament) and end above them by 2015 (about 23% vs 22%). A naive eyeball difference-in-differences on these two lines would be badly confounded: the groups began at different levels and the “treated” line aggregates countries that switched on in seven different years. The raw means motivate the machinery to come — we must compare each cohort to a tailored synthetic control, not to the grand average.

5. Synthetic difference-in-differences from first principles

Before tackling staggered timing, fix ideas with a single cohort. SDID (Arkhangelsky et al., 2021) is a weighted two-way fixed-effects regression. It chooses an ATT, a constant, unit fixed effects, and time fixed effects to minimize a weighted sum of squared residuals:

$$ \left(\hat{\tau}, \hat{\mu}, \hat{\alpha}, \hat{\beta}\right) = \arg\min_{\tau,\mu,\alpha,\beta} \sum_{i=1}^{N} \sum_{t=1}^{T} \left(Y_{it} - \mu - \alpha_i - \beta_t - W_{it}\,\tau\right)^{2}\, \hat{\omega}_i\, \hat{\lambda}_t $$

In words: run a difference-in-differences regression, but weight each observation by a unit weight $\hat{\omega}_i$ times a time weight $\hat{\lambda}_t$. Here $\alpha_i$ is a country fixed effect, $\beta_t$ a year fixed effect, $W_{it}$ the treatment dummy, and $\tau$ the ATT we want. Set all weights equal and you recover ordinary DiD; the weights are what make SDID special. They are not free parameters — each solves its own optimization.

The unit weights are chosen so that a weighted blend of control countries tracks the treated cohort across the pre-period:

$$ \hat{\omega} = \arg\min_{\omega_0,\, \omega \ge 0} \sum_{t=1}^{T_{pre}} \left(\omega_0 + \sum_{i=1}^{N_{co}} \omega_i\, Y_{it} - \frac{1}{N_{tr}} \sum_{i=1}^{N_{tr}} Y_{it}\right)^{2} + \zeta^{2}\, T_{pre}\, \lVert \omega \rVert^{2} $$

The bracketed term asks the synthetic control $\sum_i \omega_i Y_{it}$ (plus an intercept $\omega_0$) to match the treated average in every pre-adoption year. The intercept $\omega_0$ is the SDID twist: it lets the synthetic match the treated trend without matching its level, because any constant level gap is later absorbed by the unit fixed effect $\alpha_i$. The final term is a ridge penalty with regularization strength $\zeta$; it spreads weight across many donors instead of concentrating it on a few, which stabilizes the estimate. (Synthetic control, by contrast, drops $\omega_0$ and the penalty and must match the level too.)

The time weights are the mirror image — they pick the pre-period years that best predict each control country’s post-period average:

$$ \hat{\lambda} = \arg\min_{\lambda_0,\, \lambda \ge 0} \sum_{i=1}^{N_{co}} \left(\lambda_0 + \sum_{t=1}^{T_{pre}} \lambda_t\, Y_{it} - \frac{1}{T_{post}} \sum_{t=T_{pre}+1}^{T} Y_{it}\right)^{2} + \zeta_{\lambda}^{2}\, N_{co}\, \lVert \lambda \rVert^{2} $$

Years that look most like the post-period get the most weight, so the “before” comparison is built from the most relevant history rather than a flat average over possibly-irrelevant early years. The two weighting schemes together are what distinguish SDID from its cousins, as the table summarizes.

Method	Unit weights $\omega$	Time weights $\lambda$	Unit FE $\alpha_i$	Must match
DiD	uniform	uniform	yes	trend on all controls
Synthetic control	optimized	uniform	no	level and trend
SDID	optimized	optimized	yes	trend (level gap allowed)

6. The staggered extension: per-cohort effects and their aggregation

Staggered SDID is a disarmingly simple idea: do the single-cohort analysis once per adoption cohort, then average. For each cohort $a$, take only that cohort’s treated countries plus the pure never-treated controls, solve the SDID problem above on that sub-panel to get its own $\hat{\omega}_a$, $\hat{\lambda}_a$, and cohort effect $\hat{\tau}_a$. Because each cohort is compared only to never-treated controls, an already-treated unit is never used as a control for a later adopter — precisely the contamination that breaks naive TWFE.

graph LR
POOL["110 never-treated<br/>controls (donor pool)"]
C1["Cohort 2000<br/>+ controls"]
C2["Cohort 2002<br/>+ controls"]
CD["Cohorts 2003…2013<br/>+ controls"]
T1["SDID &rarr; &tau;<sub>2000</sub> = 8.4"]
T2["SDID &rarr; &tau;<sub>2002</sub> = 7.0"]
TD["SDID &rarr; &tau;<sub>a</sub><br/>(&minus;3.5 … +21.8)"]
ATT["Aggregate ATT = 8.0<br/>weighted by treated periods"]
POOL --> C1 --> T1 --> ATT
POOL --> C2 --> T2 --> ATT
POOL --> CD --> TD --> ATT
style POOL fill:#6a9bcc,stroke:#141413,color:#fff
style C1 fill:#d97757,stroke:#141413,color:#fff
style C2 fill:#d97757,stroke:#141413,color:#fff
style CD fill:#d97757,stroke:#141413,color:#fff
style T1 fill:#1f2b5e,stroke:#6a9bcc,color:#fff
style T2 fill:#1f2b5e,stroke:#6a9bcc,color:#fff
style TD fill:#1f2b5e,stroke:#6a9bcc,color:#fff
style ATT fill:#00d4c8,stroke:#141413,color:#141413

The overall ATT aggregates the cohort effects with non-negative weights equal to each cohort’s share of treated unit-by-post-period observations:

$$ \widehat{ATT} = \sum_{a \in \mathcal{A}} \frac{N_{tr}^{a}\, T_{post}^{a}}{\sum_{b \in \mathcal{A}} N_{tr}^{b}\, T_{post}^{b}}\ \hat{\tau}_a $$

In words: a cohort counts in proportion to how many treated country-years it contributes. The 2000 cohort, treated for 16 years (2000–2015), carries more weight than the 2013 cohort, treated for only 3. This is the staggered generalization of single-cohort SDID, and — unlike TWFE — every weight is positive and interpretable. (When each cohort has one treated unit, this reduces to the post-period share $T_{post}^{a}/T_{post}$ from Clarke et al., 2024.)

7. Estimation in Stata

One command does the whole staggered procedure. We request bootstrap inference and a fixed seed for reproducibility.

sdid womparl country year quota, vce(bootstrap) seed(1213)
matrix list e(tau)

Synthetic Difference-in-Differences Estimator
-----------------------------------------------------------------------------
womparl | ATT Std. Err. t P>|t| [95% Conf. Interval]
-------------+---------------------------------------------------------------
quota | 8.03410 3.74040 2.15 0.032 0.70305 15.36516
-----------------------------------------------------------------------------

The overall ATT is +8.03 percentage points (SE 3.74, $t=2.15$, $p=0.032$), with a 95% confidence interval of [0.70, 15.37] that excludes zero. Substantively: adopting a parliamentary gender quota raises the share of women in parliament by about eight percentage points in the adopting countries — a large effect against a sample mean of 15%, and statistically distinguishable from no effect at the 5% level.

The single number, though, is the average of a very heterogeneous set of cohort effects, returned in e(tau):

T[7,3]
Tau Std.Err. Time
r1 8.3888685 .68278345 2000
r2 6.9677465 .64102999 2002
r3 13.952256 9.1289943 2003
r4 -3.4505431 .75603453 2005
r5 2.7490355 .44799502 2010
r6 21.762716 .91589982 2012
r7 -.82032354 .83151601 2013

The cohort effects span an enormous range: from −3.5 points (2005 cohort) to +21.8 points (2012 cohort), with the 2003 cohort essentially uninformative (SE 9.13, a confidence interval that runs from −4 to +32). The teal line marks the aggregate ATT of 8.0. Notice that this aggregate is not the simple average of the seven cohort effects — that average would be about 7.0. It is the treated-period-weighted average from the aggregation formula, which up-weights the earlier, longer-exposed 2000, 2002, and 2003 cohorts. The lesson of the figure is that “+8 points on average” is a summary of real heterogeneity, not a universal constant; some quotas were transformative, others did nothing measurable.

To see the synthetic-control machinery underneath one cohort, the figure below plots the 2002 cohort against its synthetic control. Because SDID matches the pre-period trend and lets the unit fixed effect absorb the level gap, we anchor the synthetic to the treated cohort by its $\lambda$-weighted pre-period gap so the two align before adoption.

The treated 2002 cohort (orange) and its anchored synthetic control (blue dashed) track each other closely before 2002 — the synthetic was built precisely to do so — and then diverge: the treated cohort climbs to roughly 15% women in parliament while the synthetic counterfactual reaches only about 9–10%. That post-2002 gap is the cohort effect, about +7 points, matching $\hat{\tau}_{2002}=6.97$ from e(tau).

Which pre-period years anchor that comparison? The time weights $\hat{\lambda}_t$ for the 2002 cohort do not spread evenly over 1990–2001 — they concentrate on the years just before adoption.

The bars show SDID’s baseline for the 2002 cohort leaning on the late 1990s and 2001 — the pre-adoption years whose level most resembles the post-adoption period — rather than weighting all twelve pre-years equally as a plain difference-in-differences would. This is the time-weighting half of SDID at work: it builds the “before” from the most relevant history, which is also the baseline the event study below measures against.

8. Adding a covariate: optimized vs projected

Does the quota effect simply reflect economic development — richer countries both grow GDP and elect more women? We can condition on log GDP per capita. The sdid command offers two routes, and SDID needs a balanced panel, so we first drop the country-years with missing lngdp.

drop if missing(lngdp)
sdid womparl country year quota, vce(bootstrap) seed(2022) covariates(lngdp, optimized)
sdid womparl country year quota, vce(bootstrap) seed(1213) covariates(lngdp, projected)

SDID + lngdp (optimized) ATT = 8.0515 SE = 3.0466
SDID + lngdp (projected) ATT = 8.0593 SE = 3.1191

The two methods differ in how they estimate the covariate’s coefficient. The optimized method (Arkhangelsky et al., 2021) folds the covariate adjustment into the SDID optimization itself, estimating it jointly with the weights — flexible but computationally heavy. The projected method (Kranz, 2022) instead regresses the outcome on the covariate among the untreated observations first, then runs SDID on the residuals — much faster and numerically more stable. Reassuringly, here they agree to the second decimal: 8.05 and 8.06, essentially unchanged from the no-covariate estimate of 8.03. Controlling for income does not explain away the quota effect; the result is robust to the most obvious confounder.

9. The event study with `sdid_event`

A single ATT — even per cohort — cannot tell us when the effect appears, or whether treated and control countries were already diverging before the quota. For that we need an event study: the treatment effect traced out by years relative to adoption. The modern sdid_event command (Ciccia, Clarke & Pailañir, 2024) computes exactly this for SDID, including pre-period placebo estimates that serve as a parallel-trends test.

The dynamic effect at event time $\ell$ is the treated-minus-synthetic gap in that period, net of the same gap at baseline, where — characteristically for SDID — the baseline is the $\lambda$-weighted pre-period average rather than a single “year −1”:

$$ \delta_{\ell} = \left(\bar{Y}_{\ell}^{,tr} - \bar{Y}_{\ell}^{,co}\right) - \left(\bar{Y}_{base}^{,tr} - \bar{Y}_{base}^{,co}\right), \qquad \bar{Y}_{base}^{,g} = \sum_{t=1}^{T_{pre}} \hat{\lambda}_t\, \bar{Y}_t^{,g} $$

sdid_event handles the full staggered panel directly, returning a cohort-aggregated ATT plus dynamic effects. To read the dynamics transparently we focus the plot on the 2002 cohort — the package authors’ own worked example — which gives a clean event-time axis; the full-panel call confirms the same aggregated ATT (≈ 8.06).

ssc install sdid_event, replace
* full staggered panel: aggregated ATT + cohort-aggregated dynamic effects
sdid_event womparl country year quota, vce(bootstrap) brep(100) effects(8) placebo(5) covariates(lngdp)
* clean event study on the 2002 cohort, with all placebos
keep if quotaYear==2002 | quotaYear==.
sdid_event womparl country year quota, vce(placebo) brep(100) placebo(all) covariates(lngdp)

 | Estimate SE LB CI UB CI Switchers
-------------+------------------------------------------------------
ATT | 6.853472 3.372744 .2428928 13.46405 2
Effect_1 | 4.086404 1.191517 1.75103 6.421778 2
Effect_2 | 9.164442 1.522799 6.179756 12.14913 2
Effect_3 | 7.938504 2.182572 3.660663 12.21635 2
... |
Placebo_1 | -.218417 .470226 -1.14006 .703227 2
Placebo_2 | .242148 .884557 -1.491584 1.975880 2
... |

This plot rewards careful reading, and there are three things to look for.

First, the baseline is $\lambda$-weighted, not “the year before.” Unlike a textbook event study that normalizes to $t=-1$, SDID measures everything against the optimally weighted pre-period average. That is why the zero line is a weighted baseline; do not read it as the single pre-adoption year.

Second, the points to the left of zero are placebo tests. Every pre-adoption coefficient (Placebo_1 through Placebo_12, event times −1 to −12) sits within a whisker of zero — ranging only from about −0.2 to +0.8. Because the treated cohort and its synthetic control moved in parallel before 2002, we cannot reject that the parallel-(synthetic-)trends assumption holds. This is the identifying assumption made visible and, here, survived.

Third, the points to the right of zero are the dynamic ATT. The effect appears immediately at adoption (Effect_1 = +4.1 points at event time 0), roughly doubles within a year or two (Effect_2 = +9.2), and then settles in the +6 to +9 range for over a decade. Quotas do not just shift the level once; they sustain a higher share of women in parliament. Aggregated by the same treated-period logic as before, these dynamic effects reproduce the cohort’s overall ATT of about +7 points — but the plot shows the shape the single number conceals.

10. Inference: bootstrap, jackknife, and placebo

With one treated unit (California), the previous tutorial could only use placebo/permutation inference. With nine treated units here, all three of sdid’s variance estimators are on the table. To keep the comparison clean — jackknife needs more than one treated unit per adoption period — we follow Clarke et al. (2024) and restrict to the two-country 2002 and 2003 cohorts by dropping the five single-country cohorts.

graph TD
Q1{"How many<br/>treated units?"}
Q1 -->|"One (e.g. California)"| PL1["Placebo only<br/>jackknife undefined"]
Q1 -->|"Many (e.g. 9 quota adopters)"| Q2{"More controls than treated?<br/>no singleton cohorts?"}
Q2 -->|"Yes"| ALL["All three available"]
Q2 -->|"Singleton cohorts"| PL2["Placebo / bootstrap<br/>jackknife drops out"]
ALL --> BOOT["bootstrap<br/>SE 4.7 (default)"]
ALL --> JACK["jackknife<br/>SE 6.0 (most conservative)"]
ALL --> PLAC["placebo<br/>SE 2.3 (homoskedastic)"]
style Q1 fill:#141413,stroke:#6a9bcc,color:#fff
style Q2 fill:#141413,stroke:#6a9bcc,color:#fff
style PL1 fill:#d97757,stroke:#141413,color:#fff
style PL2 fill:#d97757,stroke:#141413,color:#fff
style ALL fill:#00d4c8,stroke:#141413,color:#141413
style BOOT fill:#6a9bcc,stroke:#141413,color:#fff
style JACK fill:#6a9bcc,stroke:#141413,color:#fff
style PLAC fill:#6a9bcc,stroke:#141413,color:#fff

drop if inlist(country,"Algeria","Kenya","Samoa","Swaziland","Tanzania")
sdid womparl country year quota, vce(bootstrap) seed(1213)
sdid womparl country year quota, vce(placebo) seed(1213)
sdid womparl country year quota, vce(jackknife)

method att se ci_l ci_u
bootstrap 10.33066 4.7291 1.0618 19.5995
placebo 10.33066 2.3404 5.7436 14.9178
jackknife 10.33066 6.0056 -1.4401 22.1014

The point estimate is identical across all three methods — 10.33 points on this subsample — because the inference procedure changes only the standard error, never the estimate. But the standard errors differ by a factor of nearly three: jackknife is the most conservative (SE 6.01, a confidence interval that crosses zero), placebo is the tightest (SE 2.34) but rests on a homoskedasticity assumption and requires more controls than treated units, and bootstrap sits in between (SE 4.73) and is the default. The practical takeaway: with only a handful of treated units, report the bootstrap as your headline but cross-check it — a result that is “significant” under placebo but not under jackknife deserves caution. (The subsample ATT of 10.3 is larger than the full-sample 8.0 because dropping the five single-country cohorts discards the negative 2005 and 2013 effects.)

11. Robustness and discussion

Three caveats keep the result honest. Effect concentration: the +8 aggregate leans heavily on a few cohorts — the 2012 cohort alone contributes a +21.8 effect, and the early 2000/2002/2003 cohorts carry most of the aggregation weight. Drop the 2012 cohort and the average falls noticeably. Fragile counterfactuals: with only 110 controls and as few as one treated country per cohort, some synthetic controls are imprecise — the 2003 cohort’s standard error of 9.13 is the tell. Identifying assumptions: SDID still requires no anticipation, an absorbing treatment, no cross-country spillovers, and that quota timing is not itself a response to the outcome’s trajectory; the flat event-study placebos support, but cannot prove, the parallel-trends part. Finally, quota_example is a teaching subset of Bhalotra et al. (2023); these numbers illustrate the method, not a final verdict on quota policy.

12. Summary and key takeaways

Method. Staggered SDID estimates a separate, clean synthetic difference-in-differences for each adoption cohort — comparing it only to never-treated controls — and aggregates the cohort effects $\hat{\tau}_a$ with non-negative, treated-period-share weights. This avoids the negative-weighting trap that contaminates naive two-way fixed-effects DiD under staggered timing.
Result. Gender quotas raise the share of women in parliament by an overall ATT of +8.0 percentage points (SE 3.74, $p=0.032$), robust to a log-GDP control (8.05 optimized, 8.06 projected). Cohort effects range widely, from −3.5 to +21.8 points — heterogeneity the single number hides.
Event study. The sdid_event plot shows pre-adoption placebo coefficients near zero (parallel synthetic trends) and post-adoption effects that appear immediately and persist for over a decade — the dynamics behind the average.
Inference. With nine treated units, bootstrap, jackknife, and placebo are all available; they share one point estimate (10.3 on the two-cohort illustration) but report standard errors of 4.7, 6.0, and 2.3. Jackknife is the most conservative.
Bridge. The block design (Proposition 99, the previous tutorial) and the staggered design here are two faces of one estimator — the staggered version is just single-cohort SDID, done once per cohort and averaged.

13. Exercises

Re-aggregate by hand. Pull e(tau) and each cohort’s treated unit-count and post-period length. Verify that the treated-period-weighted average of the seven $\hat{\tau}_a$ reproduces the overall ATT of 8.03, and show that it differs from the unweighted mean (≈ 7.0). Which cohorts move the aggregate the most?
Inference sensitivity. Re-run the full nine-country sample with vce(bootstrap) and then vce(placebo) at reps(500). How much do the standard error and confidence interval move, and which would you report given only nine treated units?
Drop the outlier cohort. Re-estimate the overall ATT excluding the 2012 cohort (the +21.8 outlier). How far does the aggregate fall, and what does that tell you about how concentrated the average effect is?

14. References

Arkhangelsky, D., Athey, S., Hirshberg, D. A., Imbens, G. W., & Wager, S. (2021). Synthetic Difference-in-Differences. American Economic Review, 111(12), 4088–4118.
Clarke, D., Pailañir, D., Athey, S., & Imbens, G. (2024). On Synthetic Difference-in-Differences and Related Estimation Methods in Stata. The Stata Journal, 24(4). Package: ssc install sdid.
Ciccia, D. (2024). A Short Note on Event-Study Synthetic Difference-in-Differences Estimators. Package: ssc install sdid_event.
Bhalotra, S., Clarke, D., Gomes, J. F., & Venkataramani, A. (2023). Maternal Mortality and Women’s Political Power. Journal of the European Economic Association. (Source of the quota_example data.)
Goodman-Bacon, A. (2021). Difference-in-Differences with Variation in Treatment Timing. Journal of Econometrics, 225(2), 254–277.
de Chaisemartin, C., & D’Haultfœuille, X. (2020). Two-Way Fixed Effects Estimators with Heterogeneous Treatment Effects. American Economic Review, 110(9), 2964–2996.
Xu, Y., & Hua, L. panelView: Visualizing Panel Data. Package: ssc install panelview.

Related tutorials on this site: Synthetic Difference-in-Differences (the block design) · Difference-in-Differences.

15. Acknowledgments

This tutorial uses the sdid command (Clarke, Pailañir, Athey & Imbens), the sdid_event command (Ciccia, Clarke & Pailañir), and panelview (Xu & Hua). The data, quota_example, is distributed with sdid and draws on Bhalotra, Clarke, Gomes & Venkataramani (2023). All estimates were produced by the companion analysis.do and verified against Clarke et al. (2024). AI tools (Claude Code) assisted with drafting and figure preparation; all code was executed and every number checked by the author.

AI Podcast: Staggered Synthetic Difference-in-Differences

Click play to load

0:00 0:00

event-study | Carlos Mendez

Do Industrial Parks Work? Evaluating Place-Based Policy in Ethiopia with Difference-in-Differences

Abstract

1. Overview

1.1 Learning objectives

1.2 Study design

1.3 Where are the industrial parks located?

2. Key concepts

3. Setup and the two star libraries

4. The three datasets

4.1 The staggered rollout

4.2 The outcomes, and a transparent word on the data

5. Exploratory analysis: the case for parallel trends

6. From a naive 2×2 to the static TWFE ATT

6.1 The naive 2×2 (and why it understates the effect)

6.2 The static TWFE difference-in-differences

7. The event study: the dynamic path

8. Modern staggered estimators: the negative-weights teaching moment

9. Heterogeneity and spillovers

9.1 Where parks work: distance and roads

9.2 Spillovers: does a park lift its neighbours?

10. Household welfare and women’s empowerment

10.1 Household living standards (Table 5)

10.2 Employment and women’s empowerment (Tables 6–7): the climax

11. Robustness: Conley spatial standard errors and restricted pools

12. Discussion

13. Reproduction audit: synthetic data vs the paper

14. Summary and takeaways

15. Exercises

16. References

AI Podcast: Do Industrial Parks Work?

Staggered Synthetic Difference-in-Differences (SDID) in Stata: Gender Quotas and Women in Parliament

Abstract

1. Overview

1.1 Learning objectives

2. Key concepts at a glance

3. The data: gender quotas across 119 countries

3.1 The staggered structure

4. Exploratory analysis with panelview

5. Synthetic difference-in-differences from first principles

6. The staggered extension: per-cohort effects and their aggregation

7. Estimation in Stata

8. Adding a covariate: optimized vs projected

9. The event study with sdid_event

10. Inference: bootstrap, jackknife, and placebo

11. Robustness and discussion

12. Summary and key takeaways

13. Exercises

14. References

15. Acknowledgments

AI Podcast: Staggered Synthetic Difference-in-Differences

4. Exploratory analysis with `panelview`

9. The event study with `sdid_event`