Data dictionary · California's Proposition 99: The Synthetic DiD Panel

Downloads

Each dataset is available as a labeled Stata .dta and its source file.

⇩ Download all data (ZIP)stata_codebook.do

Dataset	Grain	Rows	Stata	Source
`prop99_example`	state-year	1,209 × 4	prop99_example.dta	prop99_example.dta

Run stata_codebook.do in Stata once to attach long-form per-variable notes to the .dta files.

Load directly in code

Every file loads straight from GitHub (raw URLs). Swap the file name to load any dataset.

Stata

* Stata 14+ : `use` reads an https URL directly
global BASE "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/stata_sdid/data/"
use "${BASE}prop99_example.dta", clear
describe
notes

Python

!pip install -q pyreadstat
import pandas as pd
BASE = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/stata_sdid/data/"
df = pd.read_stata(BASE + "prop99_example.dta")

# load every dataset at once
files = ["prop99_example"]
data = {f: pd.read_stata(BASE + f + ".dta") for f in files}

# pyreadstat (richest metadata) reads LOCAL files -> download first
import pyreadstat, urllib.request
urllib.request.urlretrieve(BASE + "prop99_example.dta", "prop99_example.dta")
df, meta = pyreadstat.read_dta("prop99_example.dta")

Copy and paste this snippet in Google Colab app. https://colab.research.google.com/notebooks/empty.ipynb

R

# R : haven::read_dta auto-downloads an https URL
library(haven)
BASE <- "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/stata_sdid/data/"
df <- read_dta(paste0(BASE, "prop99_example.dta"))

Overview & sources

Companion data for a hands-on Stata tutorial on synthetic difference-in-differences (SDID), applied to re-evaluate California's Proposition 99 — the 1988 ballot measure that raised the cigarette excise tax by 25 cents a pack and funded an anti-smoking campaign. The file is the canonical strongly balanced panel distributed with the sdid package (originally from Abadie, Diamond & Hainmueller 2010, and used by Arkhangelsky et al. 2021): 39 US states observed annually from 1970–2000 — 1,209 observations — with annual cigarette sales in packs per capita as the sole outcome. California is the single treated unit; the policy bites from 1989 onward. The post writes DiD, synthetic control, and SDID as one weighted two-way fixed-effects regression and estimates the ATT of Proposition 99 with the sdid command, cross-checking synthetic control against synth2.

One file, one outcome. prop99_example is a strongly balanced state-year panel (one row per state × year, no gaps) carrying a single outcome — cigarette packs per capita — and a 0/1 treatment indicator. Of the 1,209 observations only 12 are treated (California, 1989–2000). The panel deliberately contains no covariates, so synthetic control and SDID see exactly the same information set (the pre-period smoking paths) — an apples-to-apples comparison.

Data sources

Source	Provides	Reference / URL
Abadie, Diamond & Hainmueller (2010)	Original Proposition 99 panel (39 states, 1970–2000, packs per capita)	Abadie, A., Diamond, A., & Hainmueller, J. (2010). Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California's Tobacco Control Program. Journal of the American Statistical Association, 105(490), 493–505. https://doi.org/10.1198/jasa.2009.ap08746
sdid package (Clarke et al. 2024)	Distribution of prop99_example.dta; the sdid estimation command	Clarke, D., Pailañir, D., Athey, S., & Imbens, G. (2024). On Synthetic Difference-in-Differences and Related Estimation Methods in Stata. The Stata Journal (st0757). https://doi.org/10.1177/1536867X241297184
Method references	Estimators and concepts	Arkhangelsky, Athey, Hsiao, Imbens & Wager (2021) — synthetic DiD; Abadie & Gardeazabal (2003) — synthetic control; Yan & Chen (2023) — synth2.

Cite this data

Please cite this dataset as follows.

APA

Mendez, C. (2026). Synthetic Difference-in-Differences (SDID) in Stata: Re-evaluating California's Proposition 99 [Data set]. https://carlos-mendez.org/post/stata_sdid/

Arkhangelsky, D., Athey, S., Hsiao, D. A., Imbens, G. W., & Wager, S. (2021). Synthetic Difference-in-Differences. American Economic Review, 111(12), 4088–4118. https://doi.org/10.1257/aer.20190159 Abadie, A., Diamond, A., & Hainmueller, J. (2010). Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California's Tobacco Control Program. Journal of the American Statistical Association, 105(490), 493–505. https://doi.org/10.1198/jasa.2009.ap08746

BibTeX

@misc{mendez2026statasdid,
  author       = {Mendez, Carlos},
  title        = {Synthetic Difference-in-Differences (SDID) in Stata: Re-evaluating California's Proposition 99},
  year         = {2026},
  howpublished = {\url{https://carlos-mendez.org/post/stata_sdid/}},
  note         = {Data set}
}

@article{arkhangelsky2021synthetic,
  author  = {Arkhangelsky, Dmitry and Athey, Susan and Hsiao, David A. and Imbens, Guido W. and Wager, Stefan},
  title   = {Synthetic Difference-in-Differences},
  journal = {American Economic Review},
  volume  = {111}, number = {12}, pages = {4088--4118}, year = {2021},
  doi     = {10.1257/aer.20190159}
}
@article{abadie2010synthetic,
  author  = {Abadie, Alberto and Diamond, Alexis and Hainmueller, Jens},
  title   = {Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California's Tobacco Control Program},
  journal = {Journal of the American Statistical Association},
  volume  = {105}, number = {490}, pages = {493--505}, year = {2010},
  doi     = {10.1198/jasa.2009.ap08746}
}

Variable explorer search & filter all 4 variables

Type to filter by name or label, or use the chips to filter by type. Each row shows a mini distribution. Click a header to sort.

Variable	Type	Distribution	Label	Definition	Units	In files	Source
`packspercapita`#	continuous		Cigarette sales (packs per capita)	Annual per-capita cigarette pack sales — the sole outcome Y_it. Mean about 119 packs; range roughly 41-296.	packs per capita per year	prop99_example	Abadie et al. (2010) / sdid package
`state`#	identifier	–	State	US state name — the panel unit. 39 states: California (treated) plus 38 control states forming the donor pool.	string	prop99_example	Abadie et al. (2010) / sdid package
`treated`#	dummy		Treated indicator (Prop 99)	Treatment status W_it: 1 for California in 1989-2000 (the 12 post-Proposition-99 years), 0 otherwise. Only 12 of 1,209 observations are treated.	0/1	prop99_example	Abadie et al. (2010) / sdid package
`year`#	year	–	Year	Calendar year — the panel time index (19 pre-treatment years 1970-1988, 12 post-treatment years 1989-2000).	year	prop99_example	Abadie et al. (2010) / sdid package

Cross-file variable index

Which file each variable appears in (● = present).

Variable	prop99_example
`packspercapita`	●
`state`	●
`treated`	●
`year`	●

Construction & formulas

Every estimator in the post is the same weighted two-way fixed-effects (TWFE) regression over packspercapita (Y) and the treatment indicator treated (W), changing only the weights — the unifying view of Arkhangelsky et al. (2021).

Synthetic DiD objective: (τ̂, μ̂, α̂, β̂) = argmin Σ_i Σ_t (Y_it − μ − α_i − β_t − W_it·τ)² · ω̂_i · λ̂_t — a TWFE regression weighted by unit weights ω̂_i and time weights λ̂_t. α_i = state fixed effect; β_t = year fixed effect; τ = the ATT.
DiD: the special case with uniform ω and λ (plain TWFE) — credibility rests entirely on parallel trends versus all controls.
Synthetic control: keeps optimized unit weights ω but drops the time weights λ and the unit fixed effects α — so it must match California's pre-period level and trend.
Unit weights ω: chosen so the weighted controls track California's pre-period path (with intercept ω₀ and a ridge penalty ζ² for stability).
Time weights λ: chosen so weighted pre-period years line up with the post-period — here SDID places all pre-period weight on 1986–1988.

Estimand (ATT). τ = (1 / (N_tr·T_post)) · Σ_{i: W_i=1} Σ_{t > T_pre} [ Y_it(1) − Y_it(0) ] — the effect of Proposition 99 on California over the post-1988 period, where the counterfactual Y_it(0) is never observed and each method imputes it differently. Here N_tr = 1 (California). Because California was not randomly assigned, this is an observational design.

The datasets

Switch datasets with the tabs. Each shows the full variable dictionary plus a sortable statistics table with mini distributions and data coverage.

expand to search (Ctrl/⌘+F) or print across all datasets

state-year 1,209 × 4 · 1970-2000 · 39 US states (strongly balanced)

Panel key: state x year · Estimate the ATT of California's Proposition 99 (DiD / synthetic control / SDID).

Variable dictionary

Variable	Label	Definition	Construction	Units	Source	Coverage
`state` identifier	State	US state name — the panel unit. 39 states: California (treated) plus 38 control states forming the donor pool.	From the distributed dataset; encoded to a numeric id in the post (encode state, gen(id)) for xtset/synth2.	string	Abadie et al. (2010) / sdid package	39 states
`year` year	Year	Calendar year — the panel time index (19 pre-treatment years 1970-1988, 12 post-treatment years 1989-2000).	Annual observations; strongly balanced (every state observed every year, no gaps).	year	Abadie et al. (2010) / sdid package	1970-2000
`packspercapita` continuous	Cigarette sales (packs per capita)	Annual per-capita cigarette pack sales — the sole outcome Y_it. Mean about 119 packs; range roughly 41-296.	Distributed outcome series; the only outcome in the panel (no income, price, or demographic covariates).	packs per capita per year	Abadie et al. (2010) / sdid package	all state-years
`treated` dummy	Treated indicator (Prop 99)	Treatment status W_it: 1 for California in 1989-2000 (the 12 post-Proposition-99 years), 0 otherwise. Only 12 of 1,209 observations are treated.	1 where state == California and year >= 1989; the single-treated-unit block design.	0/1	Abadie et al. (2010) / sdid package	all state-years

Distribution & statistics (click a header to sort)

Variable	Distribution	Coverage	N	Distinct	Min	Mean	Median	Max	SD
`state`	–	100%	1,209	39	—	—	—	—	—
`year`	–	100%	1,209	31	1970	1985.0	1985	2000	8.95
`packspercapita`		100%	1,209	703	40.70	118.9	116.3	296.2	32.77
`treated`		100%	1,209	2	0	0.010	0	1.00	0.099

Known limitations & caveats

Single treated unit. Only 12 of 1,209 observations are treated (California, 1989–2000). With one treated unit, placebo (permutation) inference is the only valid procedure — the jackknife is undefined and the bootstrap is unreliable; statistical power is inherently limited.
Outcome-only panel. The file carries no income, price, or demographic covariates — only cigarette packs per capita. This is deliberate: synthetic control and SDID see exactly the same information, so any difference in their answers comes from the estimator, not from different predictors.
Strongly balanced. Every state is observed in every year 1970–2000 with no gaps; this balance is required by all three estimators in the post.
Observational, not experimental. California was not randomly assigned Proposition 99; identification assumes a stable comparison group, no large contemporaneous shock unique to California in 1989, and no cross-state spillovers (e.g., border cigarette purchases contaminating donor states).
Distributed dataset. Values are byte-faithful to the canonical prop99_example.dta shipped with the sdid package; this data dictionary adds only metadata (variable + value labels).