Downloads
Each dataset is available as a labeled Stata .dta and its source file.
⇩ Download all data (ZIP)stata_codebook.do
| Dataset | Grain | Rows | Stata | Source |
|---|---|---|---|---|
prop99_example | state-year | 1,209 × 4 | prop99_example.dta | prop99_example.dta |
Run stata_codebook.do in Stata once to attach long-form per-variable notes to the .dta files.
Load directly in code
Every file loads straight from GitHub (raw URLs). Swap the file name to load any dataset.
Stata
* Stata 14+ : `use` reads an https URL directly
global BASE "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/stata_sdid/data/"
use "${BASE}prop99_example.dta", clear
describe
notesPython
!pip install -q pyreadstat
import pandas as pd
BASE = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/stata_sdid/data/"
df = pd.read_stata(BASE + "prop99_example.dta")
# load every dataset at once
files = ["prop99_example"]
data = {f: pd.read_stata(BASE + f + ".dta") for f in files}
# pyreadstat (richest metadata) reads LOCAL files -> download first
import pyreadstat, urllib.request
urllib.request.urlretrieve(BASE + "prop99_example.dta", "prop99_example.dta")
df, meta = pyreadstat.read_dta("prop99_example.dta")Copy and paste this snippet in Google Colab app. https://colab.research.google.com/notebooks/empty.ipynb
R
# R : haven::read_dta auto-downloads an https URL
library(haven)
BASE <- "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/stata_sdid/data/"
df <- read_dta(paste0(BASE, "prop99_example.dta"))Overview & sources
Companion data for a hands-on Stata tutorial on synthetic difference-in-differences (SDID), applied to re-evaluate California's Proposition 99 — the 1988 ballot measure that raised the cigarette excise tax by 25 cents a pack and funded an anti-smoking campaign. The file is the canonical strongly balanced panel distributed with the sdid package (originally from Abadie, Diamond & Hainmueller 2010, and used by Arkhangelsky et al. 2021): 39 US states observed annually from 1970–2000 — 1,209 observations — with annual cigarette sales in packs per capita as the sole outcome. California is the single treated unit; the policy bites from 1989 onward. The post writes DiD, synthetic control, and SDID as one weighted two-way fixed-effects regression and estimates the ATT of Proposition 99 with the sdid command, cross-checking synthetic control against synth2.
prop99_example is a strongly balanced state-year panel (one row per state × year, no gaps) carrying a single outcome — cigarette packs per capita — and a 0/1 treatment indicator. Of the 1,209 observations only 12 are treated (California, 1989–2000). The panel deliberately contains no covariates, so synthetic control and SDID see exactly the same information set (the pre-period smoking paths) — an apples-to-apples comparison.
Data sources
| Source | Provides | Reference / URL |
|---|---|---|
| Abadie, Diamond & Hainmueller (2010) | Original Proposition 99 panel (39 states, 1970–2000, packs per capita) | Abadie, A., Diamond, A., & Hainmueller, J. (2010). Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California's Tobacco Control Program. Journal of the American Statistical Association, 105(490), 493–505. https://doi.org/10.1198/jasa.2009.ap08746 |
| sdid package (Clarke et al. 2024) | Distribution of prop99_example.dta; the sdid estimation command | Clarke, D., Pailañir, D., Athey, S., & Imbens, G. (2024). On Synthetic Difference-in-Differences and Related Estimation Methods in Stata. The Stata Journal (st0757). https://doi.org/10.1177/1536867X241297184 |
| Method references | Estimators and concepts | Arkhangelsky, Athey, Hsiao, Imbens & Wager (2021) — synthetic DiD; Abadie & Gardeazabal (2003) — synthetic control; Yan & Chen (2023) — synth2. |
Cite this data
Please cite this dataset as follows.
APA
Mendez, C. (2026). Synthetic Difference-in-Differences (SDID) in Stata: Re-evaluating California's Proposition 99 [Data set]. https://carlos-mendez.org/post/stata_sdid/
Arkhangelsky, D., Athey, S., Hsiao, D. A., Imbens, G. W., & Wager, S. (2021). Synthetic Difference-in-Differences. American Economic Review, 111(12), 4088–4118. https://doi.org/10.1257/aer.20190159 Abadie, A., Diamond, A., & Hainmueller, J. (2010). Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California's Tobacco Control Program. Journal of the American Statistical Association, 105(490), 493–505. https://doi.org/10.1198/jasa.2009.ap08746BibTeX
@misc{mendez2026statasdid,
author = {Mendez, Carlos},
title = {Synthetic Difference-in-Differences (SDID) in Stata: Re-evaluating California's Proposition 99},
year = {2026},
howpublished = {\url{https://carlos-mendez.org/post/stata_sdid/}},
note = {Data set}
}
@article{arkhangelsky2021synthetic,
author = {Arkhangelsky, Dmitry and Athey, Susan and Hsiao, David A. and Imbens, Guido W. and Wager, Stefan},
title = {Synthetic Difference-in-Differences},
journal = {American Economic Review},
volume = {111}, number = {12}, pages = {4088--4118}, year = {2021},
doi = {10.1257/aer.20190159}
}
@article{abadie2010synthetic,
author = {Abadie, Alberto and Diamond, Alexis and Hainmueller, Jens},
title = {Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California's Tobacco Control Program},
journal = {Journal of the American Statistical Association},
volume = {105}, number = {490}, pages = {493--505}, year = {2010},
doi = {10.1198/jasa.2009.ap08746}
}Variable explorer search & filter all 4 variables
Type to filter by name or label, or use the chips to filter by type. Each row shows a mini distribution. Click a header to sort.
| Variable | Type | Distribution | Label | Definition | Units | In files | Source |
|---|---|---|---|---|---|---|---|
packspercapita# | continuous | Cigarette sales (packs per capita) | Annual per-capita cigarette pack sales — the sole outcome Y_it. Mean about 119 packs; range roughly 41-296. | packs per capita per year | prop99_example | Abadie et al. (2010) / sdid package | |
state# | identifier | – | State | US state name — the panel unit. 39 states: California (treated) plus 38 control states forming the donor pool. | string | prop99_example | Abadie et al. (2010) / sdid package |
treated# | dummy | Treated indicator (Prop 99) | Treatment status W_it: 1 for California in 1989-2000 (the 12 post-Proposition-99 years), 0 otherwise. Only 12 of 1,209 observations are treated. | 0/1 | prop99_example | Abadie et al. (2010) / sdid package | |
year# | year | – | Year | Calendar year — the panel time index (19 pre-treatment years 1970-1988, 12 post-treatment years 1989-2000). | year | prop99_example | Abadie et al. (2010) / sdid package |
Cross-file variable index
Which file each variable appears in (● = present).
| Variable | prop99_example |
|---|---|
packspercapita | ● |
state | ● |
treated | ● |
year | ● |
Construction & formulas
Every estimator in the post is the same weighted two-way fixed-effects (TWFE)
regression over packspercapita (Y) and the treatment indicator
treated (W), changing only the weights — the unifying view of
Arkhangelsky et al. (2021).
- Synthetic DiD objective:
(τ̂, μ̂, α̂, β̂) = argmin Σ_i Σ_t (Y_it − μ − α_i − β_t − W_it·τ)² · ω̂_i · λ̂_t— a TWFE regression weighted by unit weightsω̂_iand time weightsλ̂_t.α_i= state fixed effect;β_t= year fixed effect;τ= the ATT. - DiD: the special case with uniform ω and λ (plain TWFE) — credibility rests entirely on parallel trends versus all controls.
- Synthetic control: keeps optimized unit weights ω but drops the time weights λ and the unit fixed effects α — so it must match California's pre-period level and trend.
- Unit weights ω: chosen so the weighted controls track California's pre-period path (with intercept ω₀ and a ridge penalty ζ² for stability).
- Time weights λ: chosen so weighted pre-period years line up with the post-period — here SDID places all pre-period weight on 1986–1988.
Estimand (ATT).
τ = (1 / (N_tr·T_post)) · Σ_{i: W_i=1} Σ_{t > T_pre} [ Y_it(1) − Y_it(0) ]
— the effect of Proposition 99 on California over the post-1988 period, where the counterfactual
Y_it(0) is never observed and each method imputes it differently. Here
N_tr = 1 (California). Because California was not randomly assigned, this is an
observational design.
The datasets
Switch datasets with the tabs. Each shows the full variable dictionary plus a sortable statistics table with mini distributions and data coverage.
expand to search (Ctrl/⌘+F) or print across all datasets
Variable dictionary
| Variable | Label | Definition | Construction | Units | Source | Coverage |
|---|---|---|---|---|---|---|
state identifier | State | US state name — the panel unit. 39 states: California (treated) plus 38 control states forming the donor pool. | From the distributed dataset; encoded to a numeric id in the post (encode state, gen(id)) for xtset/synth2. | string | Abadie et al. (2010) / sdid package | 39 states |
year year | Year | Calendar year — the panel time index (19 pre-treatment years 1970-1988, 12 post-treatment years 1989-2000). | Annual observations; strongly balanced (every state observed every year, no gaps). | year | Abadie et al. (2010) / sdid package | 1970-2000 |
packspercapita continuous | Cigarette sales (packs per capita) | Annual per-capita cigarette pack sales — the sole outcome Y_it. Mean about 119 packs; range roughly 41-296. | Distributed outcome series; the only outcome in the panel (no income, price, or demographic covariates). | packs per capita per year | Abadie et al. (2010) / sdid package | all state-years |
treated dummy | Treated indicator (Prop 99) | Treatment status W_it: 1 for California in 1989-2000 (the 12 post-Proposition-99 years), 0 otherwise. Only 12 of 1,209 observations are treated. | 1 where state == California and year >= 1989; the single-treated-unit block design. | 0/1 | Abadie et al. (2010) / sdid package | all state-years |
Distribution & statistics (click a header to sort)
| Variable | Distribution | Coverage | N | Distinct | Min | Mean | Median | Max | SD |
|---|---|---|---|---|---|---|---|---|---|
state | – | 100% | 1,209 | 39 | — | — | — | — | — |
year | – | 100% | 1,209 | 31 | 1970 | 1985.0 | 1985 | 2000 | 8.95 |
packspercapita | 100% | 1,209 | 703 | 40.70 | 118.9 | 116.3 | 296.2 | 32.77 | |
treated | 100% | 1,209 | 2 | 0 | 0.010 | 0 | 1.00 | 0.099 |
Known limitations & caveats
- Single treated unit. Only 12 of 1,209 observations are treated (California, 1989–2000). With one treated unit, placebo (permutation) inference is the only valid procedure — the jackknife is undefined and the bootstrap is unreliable; statistical power is inherently limited.
- Outcome-only panel. The file carries no income, price, or demographic covariates — only cigarette packs per capita. This is deliberate: synthetic control and SDID see exactly the same information, so any difference in their answers comes from the estimator, not from different predictors.
- Strongly balanced. Every state is observed in every year 1970–2000 with no gaps; this balance is required by all three estimators in the post.
- Observational, not experimental. California was not randomly assigned Proposition 99; identification assumes a stable comparison group, no large contemporaneous shock unique to California in 1989, and no cross-state spillovers (e.g., border cigarette purchases contaminating donor states).
- Distributed dataset. Values are byte-faithful to the canonical prop99_example.dta shipped with the sdid package; this data dictionary adds only metadata (variable + value labels).