Data dictionary · Six Ways to Evaluate a Policy: Proposition 99

Downloads

Each dataset is available as a labeled Stata .dta and its source file.

⇩ Download all data (ZIP)stata_codebook.do

Dataset	Grain	Rows	Stata	Source
`proposition99`	state-year	1,209 × 7	proposition99.dta	proposition99.csv
`data_california`	year (California only)	31 × 8	data_california.dta	data_california.csv
`data_imputed`	state-year	1,209 × 7	data_imputed.dta	data_imputed.csv

Run stata_codebook.do in Stata once to attach long-form per-variable notes to the .dta files.

Load directly in code

Every file loads straight from GitHub (raw URLs). Swap the file name to load any dataset.

Stata

* Stata 14+ : `use` reads an https URL directly
global BASE "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/r_causalpolicy_workshop/data/"
use "${BASE}proposition99.dta", clear
describe
notes

Python

!pip install -q pyreadstat
import pandas as pd
BASE = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/r_causalpolicy_workshop/data/"
df = pd.read_stata(BASE + "proposition99.dta")

# load every dataset at once
files = ["proposition99", "data_california", "data_imputed"]
data = {f: pd.read_stata(BASE + f + ".dta") for f in files}

# pyreadstat (richest metadata) reads LOCAL files -> download first
import pyreadstat, urllib.request
urllib.request.urlretrieve(BASE + "proposition99.dta", "proposition99.dta")
df, meta = pyreadstat.read_dta("proposition99.dta")

Copy and paste this snippet in Google Colab app. https://colab.research.google.com/notebooks/empty.ipynb

R

# R : haven::read_dta auto-downloads an https URL
library(haven)
BASE <- "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/r_causalpolicy_workshop/data/"
df <- read_dta(paste0(BASE, "proposition99.dta"))

Overview & sources

Companion data for a hands-on R tutorial that evaluates California's 1989 Proposition 99 cigarette tax by applying six estimator families — naive pre-post, difference-in-differences, two interrupted time-series variants (linear growth curve and an AICc-selected ARIMA), regression discontinuity on time, synthetic control (tidysynth), and Bayesian structural time series (CausalImpact) — to one shared panel and asking how much they disagree. The data are a balanced annual panel of 39 U.S. states over 1970–2000 (1,209 state-year rows) with per-capita cigarette pack sales as the outcome and four covariates, prepared from the canonical Abadie, Diamond & Hainmueller (2010) Proposition 99 dataset distributed by the causalpolicy.nl workshop. Every causal method targets the average treatment effect on the treated (ATT) for California; the entire pipeline is open and reproducible.

Three files. proposition99 is the source 39-state annual panel (one row per state × year, 1970–2000) with four partly-missing covariates. data_california is the California-only series (one row per year) with a prepost factor marking the 1989 policy date — the analysis subset used by the naive, ITS and RDD estimators. data_imputed is the full panel after one round of mice random-forest imputation that fills every covariate gap — the complete-case input for the CausalImpact model.

Data sources

Source	Provides	Reference / URL
Abadie, Diamond & Hainmueller (2010)	Original Proposition 99 panel: per-capita cigarette sales + covariates for 39 states, 1970–2000	Abadie, A., Diamond, A., & Hainmueller, J. (2010). Synthetic control methods for comparative case studies: Estimating the effect of California's Tobacco Control Program. Journal of the American Statistical Association, 105(490), 493–505.
causalpolicy.nl workshop (ODISSEI)	Prepared proposition99.rds distribution mirrored as proposition99.csv	ODISSEI Social Data Science Team. Causal Policy Evaluation workshop. https://causalpolicy.nl/
Method references	Estimators and concepts	Holland (1986); Rubin (1974); Hyndman & Athanasopoulos (2021, fpp3); Brodersen et al. (2015, CausalImpact); van Buuren & Groothuis-Oudshoorn (2011, mice).

Cite this data

Please cite this dataset as follows.

APA

Mendez, C. (2026). Six Ways to Evaluate a Policy using R: Comparative Case Studies of Proposition 99 [Data set]. https://carlos-mendez.org/post/r_causalpolicy_workshop/

Abadie, A., Diamond, A., & Hainmueller, J. (2010). Synthetic control methods for comparative case studies: Estimating the effect of California's Tobacco Control Program. Journal of the American Statistical Association, 105(490), 493–505.

BibTeX

@misc{mendez2026rcausalpolicyworkshop,
  author       = {Mendez, Carlos},
  title        = {Six Ways to Evaluate a Policy using R: Comparative Case Studies of Proposition 99},
  year         = {2026},
  howpublished = {\url{https://carlos-mendez.org/post/r_causalpolicy_workshop/}},
  note         = {Data set}
}

@article{abadie2010synthetic,
  author  = {Abadie, Alberto and Diamond, Alexis and Hainmueller, Jens},
  title   = {Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California's Tobacco Control Program},
  journal = {Journal of the American Statistical Association},
  volume  = {105}, number = {490}, pages = {493--505}, year = {2010}
}

Variable explorer search & filter all 8 variables

Type to filter by name or label, or use the chips to filter by type. Each row shows a mini distribution. Click a header to sort.

Variable	Type	Distribution	Label	Definition	Units	In files	Source
`age15to24`#	continuous		Share of population aged 15–24	Fraction of the state population aged 15 to 24 (covariate).	0-1 (share)	proposition99, data_california, data_imputed	Abadie, Diamond & Hainmueller (2010); imputed (mice rf) in data_imputed.csv
`beer`#	continuous		Per-capita beer consumption	Per-capita beer consumption (covariate proxy for tobacco-related behaviour).	gallons per capita	proposition99, data_california, data_imputed	Abadie, Diamond & Hainmueller (2010); imputed (mice rf) in data_imputed.csv
`cigsale`#	continuous		Per-capita cigarette pack sales	Annual per-capita sales of cigarette packs (the policy outcome).	packs per capita	proposition99, data_california, data_imputed	Abadie, Diamond & Hainmueller (2010)
`lnincome`#	continuous		Log per-capita income	Natural log of per-capita personal income (covariate).	log US$	proposition99, data_california, data_imputed	Abadie, Diamond & Hainmueller (2010); imputed (mice rf) in data_imputed.csv
`prepost`#	identifier	–	Pre/Post policy period (1=Post)	Factor marking the Proposition 99 period: Pre = up to 1988, Post = 1989 onward.	Pre/Post	data_california	Derived (this study)
`retprice`#	continuous		Retail cigarette price	Average retail price of cigarettes (covariate).	US cents per pack	proposition99, data_california, data_imputed	Abadie, Diamond & Hainmueller (2010)
`state`#	identifier	–	U.S. state name	Name of the U.S. state (treated unit is California; the other 38 are donor states).	string	proposition99, data_california, data_imputed	Abadie, Diamond & Hainmueller (2010)
`year`#	year	–	Calendar year	Annual time index; 1989 is the Proposition 99 policy date.	year	proposition99, data_california, data_imputed	Abadie, Diamond & Hainmueller (2010)

Cross-file variable index

Which file each variable appears in (● = present).

Variable	proposition99	data_california	data_imputed
`age15to24`	●	●	●
`beer`	●	●	●
`cigsale`	●	●	●
`lnincome`	●	●	●
`prepost`		●
`retprice`	●	●	●
`state`	●	●	●
`year`	●	●	●

Construction & formulas

Every method targets the average treatment effect on the treated (ATT) for California over 1989–2000, each building the missing counterfactual Ŷ₁ₜ(0) — California's sales without Proposition 99 — a different way.

Naive pre-post: Ŷ₁ₜ(0) = Ȳ₁,pre — California's own pre-period mean.
Difference-in-Differences: Ŷ₁ₜ(0) = Ȳ₁,pre + (Ȳ₀,post − Ȳ₀,pre) — add Nevada's pre-to-post change.
ITS (growth curve): Ŷ₁ₜ(0) = α̂ + β̂·t — extrapolate California's pre-period linear fit.
ITS (ARIMA): Ŷ₁ₜ(0) = forecast from an AICc-selected ARIMA(1,2,0) fitted on 1970–1988.
RDD on time: segmented regression cigsale ~ year0 + prepost + year0:prepost; the level break prepost coefficient is the headline effect.
Synthetic Control: Ŷ₁ₜ(0) = Σ wᵢ* Yᵢₜ — convex weighted blend of donor states minimising pre-1988 RMSE.
CausalImpact (BSTS): y₁ₜ = μₜ + βᵀxₜ + εₜ — a local-level trend plus a regression on donor-state series and covariates, fit on the pre-period and projected forward.

Covariate imputation (for data_imputed): missing lnincome, beer and age15to24 values are filled with one round of random-forest multiple imputation, mice(m = 1, method = "rf"), under a global set.seed(42). Only the three NA-bearing covariates change; the fully-observed cigsale and retprice are byte-identical to the source panel.

The datasets

Switch datasets with the tabs. Each shows the full variable dictionary plus a sortable statistics table with mini distributions and data coverage.

expand to search (Ctrl/⌘+F) or print across all datasets

state-year 1,209 × 7 · 1970-2000 · 39 U.S. states (balanced)

Panel key: state x year · Shared input panel for all six estimators; outcome = cigsale, treated unit = California from 1989.

Variable dictionary

Variable	Label	Definition	Construction	Units	Source	Coverage
`state` identifier	U.S. state name	Name of the U.S. state (treated unit is California; the other 38 are donor states).	From the source Abadie–Diamond–Hainmueller panel; data_california.csv is filtered to California only.	string	Abadie, Diamond & Hainmueller (2010)	39 states (1 in data_california.csv)
`year` year	Calendar year	Annual time index; 1989 is the Proposition 99 policy date.	Source panel runs 1970–2000; the last full pre-period year is 1988.	year	Abadie, Diamond & Hainmueller (2010)	1970-2000
`cigsale` continuous	Per-capita cigarette pack sales	Annual per-capita sales of cigarette packs (the policy outcome).	Observed; fully populated in all three files (no imputation needed).	packs per capita	Abadie, Diamond & Hainmueller (2010)	fully observed
`lnincome` continuous	Log per-capita income	Natural log of per-capita personal income (covariate).	Source covariate; missing in 16.1% of source rows, filled in data_imputed.csv by random-forest imputation.	log US$	Abadie, Diamond & Hainmueller (2010); imputed (mice rf) in data_imputed.csv	16.1% missing in source; complete in imputed
`beer` continuous	Per-capita beer consumption	Per-capita beer consumption (covariate proxy for tobacco-related behaviour).	Source covariate; missing in 54.8% of source rows, filled in data_imputed.csv by random-forest imputation.	gallons per capita	Abadie, Diamond & Hainmueller (2010); imputed (mice rf) in data_imputed.csv	54.8% missing in source; complete in imputed
`age15to24` continuous	Share of population aged 15–24	Fraction of the state population aged 15 to 24 (covariate).	Source covariate; missing in 32.3% of source rows, filled in data_imputed.csv by random-forest imputation.	0-1 (share)	Abadie, Diamond & Hainmueller (2010); imputed (mice rf) in data_imputed.csv	32.3% missing in source; complete in imputed
`retprice` continuous	Retail cigarette price	Average retail price of cigarettes (covariate).	Observed; fully populated in all three files (no imputation needed).	US cents per pack	Abadie, Diamond & Hainmueller (2010)	fully observed

Distribution & statistics (click a header to sort)

Variable	Distribution	Coverage	N	Distinct	Min	Mean	Median	Max	SD
`state`	–	100%	1,209	39	—	—	—	—	—
`year`	–	100%	1,209	31	1970	1985.0	1985	2000	8.95
`cigsale`		100%	1,209	703	40.70	118.9	116.3	296.2	32.77
`lnincome`		84%	1,014	1,014	9.40	9.86	9.86	10.49	0.171
`beer`		45%	546	145	2.50	23.43	23.30	40.40	4.22
`age15to24`		68%	819	819	0.129	0.175	0.178	0.204	0.015
`retprice`		100%	1,209	849	27.30	108.3	95.50	351.2	64.38

year (California only) 31 × 8 · 1970-2000 · California (31 years: 19 Pre + 12 Post)

Panel key: year · Within-California analysis subset for the naive pre-post, ITS growth-curve, ITS-ARIMA and RDD-on-time estimators.

Variable dictionary

Variable	Label	Definition	Construction	Units	Source	Coverage
`state` identifier	U.S. state name	Name of the U.S. state (treated unit is California; the other 38 are donor states).	From the source Abadie–Diamond–Hainmueller panel; data_california.csv is filtered to California only.	string	Abadie, Diamond & Hainmueller (2010)	39 states (1 in data_california.csv)
`year` year	Calendar year	Annual time index; 1989 is the Proposition 99 policy date.	Source panel runs 1970–2000; the last full pre-period year is 1988.	year	Abadie, Diamond & Hainmueller (2010)	1970-2000
`cigsale` continuous	Per-capita cigarette pack sales	Annual per-capita sales of cigarette packs (the policy outcome).	Observed; fully populated in all three files (no imputation needed).	packs per capita	Abadie, Diamond & Hainmueller (2010)	fully observed
`lnincome` continuous	Log per-capita income	Natural log of per-capita personal income (covariate).	Source covariate; missing in 16.1% of source rows, filled in data_imputed.csv by random-forest imputation.	log US$	Abadie, Diamond & Hainmueller (2010); imputed (mice rf) in data_imputed.csv	16.1% missing in source; complete in imputed
`beer` continuous	Per-capita beer consumption	Per-capita beer consumption (covariate proxy for tobacco-related behaviour).	Source covariate; missing in 54.8% of source rows, filled in data_imputed.csv by random-forest imputation.	gallons per capita	Abadie, Diamond & Hainmueller (2010); imputed (mice rf) in data_imputed.csv	54.8% missing in source; complete in imputed
`age15to24` continuous	Share of population aged 15–24	Fraction of the state population aged 15 to 24 (covariate).	Source covariate; missing in 32.3% of source rows, filled in data_imputed.csv by random-forest imputation.	0-1 (share)	Abadie, Diamond & Hainmueller (2010); imputed (mice rf) in data_imputed.csv	32.3% missing in source; complete in imputed
`retprice` continuous	Retail cigarette price	Average retail price of cigarettes (covariate).	Observed; fully populated in all three files (no imputation needed).	US cents per pack	Abadie, Diamond & Hainmueller (2010)	fully observed
`prepost` identifier	Pre/Post policy period (1=Post)	Factor marking the Proposition 99 period: Pre = up to 1988, Post = 1989 onward.	factor(year > 1988, labels = c('Pre','Post')); present only in data_california.csv.	Pre/Post	Derived (this study)	data_california.csv only (19 Pre, 12 Post)

Distribution & statistics (click a header to sort)

Variable	Distribution	Coverage	N	Distinct	Min	Mean	Median	Max	SD
`state`	–	100%	31	1	—	—	—	—	—
`year`	–	100%	31	31	1970	1985.0	1985	2000	9.09
`cigsale`		100%	31	31	41.60	94.59	102.8	128.0	30.01
`lnincome`		84%	26	26	9.93	10.07	10.09	10.18	0.076
`beer`		45%	14	14	19.10	22.26	22.95	25.00	2.11
`age15to24`		68%	21	21	0.150	0.176	0.180	0.190	0.012
`retprice`		100%	31	30	38.80	119.9	98.00	351.2	77.90
`prepost`	–	100%	31	2	—	—	—	—	—

state-year 1,209 × 7 · 1970-2000 · 39 U.S. states (balanced)

Panel key: state x year · Complete-case wide-format input for the CausalImpact Bayesian structural time-series model.

Variable dictionary

Variable	Label	Definition	Construction	Units	Source	Coverage
`state` identifier	U.S. state name	Name of the U.S. state (treated unit is California; the other 38 are donor states).	From the source Abadie–Diamond–Hainmueller panel; data_california.csv is filtered to California only.	string	Abadie, Diamond & Hainmueller (2010)	39 states (1 in data_california.csv)
`year` year	Calendar year	Annual time index; 1989 is the Proposition 99 policy date.	Source panel runs 1970–2000; the last full pre-period year is 1988.	year	Abadie, Diamond & Hainmueller (2010)	1970-2000
`cigsale` continuous	Per-capita cigarette pack sales	Annual per-capita sales of cigarette packs (the policy outcome).	Observed; fully populated in all three files (no imputation needed).	packs per capita	Abadie, Diamond & Hainmueller (2010)	fully observed
`lnincome` continuous	Log per-capita income	Natural log of per-capita personal income (covariate).	Source covariate; missing in 16.1% of source rows, filled in data_imputed.csv by random-forest imputation.	log US$	Abadie, Diamond & Hainmueller (2010); imputed (mice rf) in data_imputed.csv	16.1% missing in source; complete in imputed
`beer` continuous	Per-capita beer consumption	Per-capita beer consumption (covariate proxy for tobacco-related behaviour).	Source covariate; missing in 54.8% of source rows, filled in data_imputed.csv by random-forest imputation.	gallons per capita	Abadie, Diamond & Hainmueller (2010); imputed (mice rf) in data_imputed.csv	54.8% missing in source; complete in imputed
`age15to24` continuous	Share of population aged 15–24	Fraction of the state population aged 15 to 24 (covariate).	Source covariate; missing in 32.3% of source rows, filled in data_imputed.csv by random-forest imputation.	0-1 (share)	Abadie, Diamond & Hainmueller (2010); imputed (mice rf) in data_imputed.csv	32.3% missing in source; complete in imputed
`retprice` continuous	Retail cigarette price	Average retail price of cigarettes (covariate).	Observed; fully populated in all three files (no imputation needed).	US cents per pack	Abadie, Diamond & Hainmueller (2010)	fully observed

Distribution & statistics (click a header to sort)

Variable	Distribution	Coverage	N	Distinct	Min	Mean	Median	Max	SD
`state`	–	100%	1,209	39	—	—	—	—	—
`year`	–	100%	1,209	31	1970	1985.0	1985	2000	8.95
`cigsale`		100%	1,209	703	40.70	118.9	116.3	296.2	32.77
`lnincome`		100%	1,209	1,014	9.40	9.87	9.87	10.49	0.177
`beer`		100%	1,209	145	2.50	22.88	22.20	40.40	4.18
`age15to24`		100%	1,209	819	0.129	0.168	0.170	0.204	0.018
`retprice`		100%	1,209	849	27.30	108.3	95.50	351.2	64.38

Known limitations & caveats

Single imputation. data_imputed uses one random-forest draw (m = 1) for tutorial speed; with multiple imputation (m > 1) or a different model the CausalImpact estimate can move by 1–3 packs.
Covariate missingness is heavy. In the source panel beer is missing 54.8% of rows, age15to24 32.3%, and lnincome 16.1%; cigsale and retprice are fully observed. The imputed values are model-based fills, not measurements.
RDD is segmented regression. The workshop labels the year-as-running-variable specification 'RDD'; with calendar time as the running variable it reduces to a piecewise (segmented) regression, not the classical sharp RDD on a continuous score.
Counterfactual, not data, drives the estimate. Five of six methods converge on a 13–20 pack reduction while DiD-vs-Nevada collapses to −5.7 packs (p = 0.31) and ARIMA-based ITS flips to +4.5 packs — the disagreement is the lesson, so triangulate across estimators.