← Back to the post
Interactive data dictionary

Six Ways to Evaluate a Policy: Proposition 99

Companion data for an R tutorial that runs six causal estimators on California's 1989 cigarette-tax reform.

3
datasets
8
variables
39
states
1970–2000
years

Downloads

Each dataset is available as a labeled Stata .dta and its source file.

⇩ Download all data (ZIP)stata_codebook.do

DatasetGrainRowsStataSource
proposition99state-year1,209 × 7proposition99.dtaproposition99.csv
data_californiayear (California only)31 × 8data_california.dtadata_california.csv
data_imputedstate-year1,209 × 7data_imputed.dtadata_imputed.csv

Run stata_codebook.do in Stata once to attach long-form per-variable notes to the .dta files.

Load directly in code

Every file loads straight from GitHub (raw URLs). Swap the file name to load any dataset.

Stata

* Stata 14+ : `use` reads an https URL directly
global BASE "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/r_causalpolicy_workshop/data/"
use "${BASE}proposition99.dta", clear
describe
notes

Python

!pip install -q pyreadstat
import pandas as pd
BASE = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/r_causalpolicy_workshop/data/"
df = pd.read_stata(BASE + "proposition99.dta")

# load every dataset at once
files = ["proposition99", "data_california", "data_imputed"]
data = {f: pd.read_stata(BASE + f + ".dta") for f in files}

# pyreadstat (richest metadata) reads LOCAL files -> download first
import pyreadstat, urllib.request
urllib.request.urlretrieve(BASE + "proposition99.dta", "proposition99.dta")
df, meta = pyreadstat.read_dta("proposition99.dta")

Copy and paste this snippet in Google Colab app. https://colab.research.google.com/notebooks/empty.ipynb

R

# R : haven::read_dta auto-downloads an https URL
library(haven)
BASE <- "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/r_causalpolicy_workshop/data/"
df <- read_dta(paste0(BASE, "proposition99.dta"))

Overview & sources

Companion data for a hands-on R tutorial that evaluates California's 1989 Proposition 99 cigarette tax by applying six estimator families — naive pre-post, difference-in-differences, two interrupted time-series variants (linear growth curve and an AICc-selected ARIMA), regression discontinuity on time, synthetic control (tidysynth), and Bayesian structural time series (CausalImpact) — to one shared panel and asking how much they disagree. The data are a balanced annual panel of 39 U.S. states over 1970–2000 (1,209 state-year rows) with per-capita cigarette pack sales as the outcome and four covariates, prepared from the canonical Abadie, Diamond & Hainmueller (2010) Proposition 99 dataset distributed by the causalpolicy.nl workshop. Every causal method targets the average treatment effect on the treated (ATT) for California; the entire pipeline is open and reproducible.

Three files. proposition99 is the source 39-state annual panel (one row per state × year, 1970–2000) with four partly-missing covariates. data_california is the California-only series (one row per year) with a prepost factor marking the 1989 policy date — the analysis subset used by the naive, ITS and RDD estimators. data_imputed is the full panel after one round of mice random-forest imputation that fills every covariate gap — the complete-case input for the CausalImpact model.

Data sources

SourceProvidesReference / URL
Abadie, Diamond &amp; Hainmueller (2010)Original Proposition 99 panel: per-capita cigarette sales + covariates for 39 states, 1970–2000Abadie, A., Diamond, A., & Hainmueller, J. (2010). Synthetic control methods for comparative case studies: Estimating the effect of California's Tobacco Control Program. Journal of the American Statistical Association, 105(490), 493–505.
causalpolicy.nl workshop (ODISSEI)Prepared proposition99.rds distribution mirrored as proposition99.csvODISSEI Social Data Science Team. Causal Policy Evaluation workshop. https://causalpolicy.nl/
Method referencesEstimators and conceptsHolland (1986); Rubin (1974); Hyndman & Athanasopoulos (2021, fpp3); Brodersen et al. (2015, CausalImpact); van Buuren & Groothuis-Oudshoorn (2011, mice).

Cite this data

Please cite this dataset as follows.

APA

Mendez, C. (2026). Six Ways to Evaluate a Policy using R: Comparative Case Studies of Proposition 99 [Data set]. https://carlos-mendez.org/post/r_causalpolicy_workshop/

Abadie, A., Diamond, A., & Hainmueller, J. (2010). Synthetic control methods for comparative case studies: Estimating the effect of California's Tobacco Control Program. Journal of the American Statistical Association, 105(490), 493–505.

BibTeX

@misc{mendez2026rcausalpolicyworkshop,
  author       = {Mendez, Carlos},
  title        = {Six Ways to Evaluate a Policy using R: Comparative Case Studies of Proposition 99},
  year         = {2026},
  howpublished = {\url{https://carlos-mendez.org/post/r_causalpolicy_workshop/}},
  note         = {Data set}
}

@article{abadie2010synthetic,
  author  = {Abadie, Alberto and Diamond, Alexis and Hainmueller, Jens},
  title   = {Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California's Tobacco Control Program},
  journal = {Journal of the American Statistical Association},
  volume  = {105}, number = {490}, pages = {493--505}, year = {2010}
}

Variable explorer search & filter all 8 variables

Type to filter by name or label, or use the chips to filter by type. Each row shows a mini distribution. Click a header to sort.

VariableTypeDistributionLabelDefinitionUnitsIn filesSource
age15to24#continuousmin 0.129 | median 0.178 | max 0.204Share of population aged 15–24Fraction of the state population aged 15 to 24 (covariate).0-1 (share)proposition99, data_california, data_imputedAbadie, Diamond & Hainmueller (2010); imputed (mice rf) in data_imputed.csv
beer#continuousmin 2.5 | median 23.3 | max 40.4Per-capita beer consumptionPer-capita beer consumption (covariate proxy for tobacco-related behaviour).gallons per capitaproposition99, data_california, data_imputedAbadie, Diamond & Hainmueller (2010); imputed (mice rf) in data_imputed.csv
cigsale#continuousmin 40.7 | median 116 | max 296Per-capita cigarette pack salesAnnual per-capita sales of cigarette packs (the policy outcome).packs per capitaproposition99, data_california, data_imputedAbadie, Diamond & Hainmueller (2010)
lnincome#continuousmin 9.4 | median 9.86 | max 10.5Log per-capita incomeNatural log of per-capita personal income (covariate).log US$proposition99, data_california, data_imputedAbadie, Diamond & Hainmueller (2010); imputed (mice rf) in data_imputed.csv
prepost#identifierPre/Post policy period (1=Post)Factor marking the Proposition 99 period: Pre = up to 1988, Post = 1989 onward.Pre/Postdata_californiaDerived (this study)
retprice#continuousmin 27.3 | median 95.5 | max 351Retail cigarette priceAverage retail price of cigarettes (covariate).US cents per packproposition99, data_california, data_imputedAbadie, Diamond & Hainmueller (2010)
state#identifierU.S. state nameName of the U.S. state (treated unit is California; the other 38 are donor states).stringproposition99, data_california, data_imputedAbadie, Diamond & Hainmueller (2010)
year#yearCalendar yearAnnual time index; 1989 is the Proposition 99 policy date.yearproposition99, data_california, data_imputedAbadie, Diamond & Hainmueller (2010)

Cross-file variable index

Which file each variable appears in (● = present).

Variableproposition99data_californiadata_imputed
age15to24
beer
cigsale
lnincome
prepost
retprice
state
year

Construction & formulas

Every method targets the average treatment effect on the treated (ATT) for California over 1989–2000, each building the missing counterfactual Ŷ₁ₜ(0) — California's sales without Proposition 99 — a different way.

Covariate imputation (for data_imputed): missing lnincome, beer and age15to24 values are filled with one round of random-forest multiple imputation, mice(m = 1, method = "rf"), under a global set.seed(42). Only the three NA-bearing covariates change; the fully-observed cigsale and retprice are byte-identical to the source panel.

The datasets

Switch datasets with the tabs. Each shows the full variable dictionary plus a sortable statistics table with mini distributions and data coverage.

expand to search (Ctrl/⌘+F) or print across all datasets

state-year  1,209 × 7 · 1970-2000 · 39 U.S. states (balanced)

Panel key: state x year · Shared input panel for all six estimators; outcome = cigsale, treated unit = California from 1989.

Variable dictionary

VariableLabelDefinitionConstructionUnitsSourceCoverage
state identifierU.S. state nameName of the U.S. state (treated unit is California; the other 38 are donor states).From the source Abadie–Diamond–Hainmueller panel; data_california.csv is filtered to California only.stringAbadie, Diamond & Hainmueller (2010)39 states (1 in data_california.csv)
year yearCalendar yearAnnual time index; 1989 is the Proposition 99 policy date.Source panel runs 1970–2000; the last full pre-period year is 1988.yearAbadie, Diamond & Hainmueller (2010)1970-2000
cigsale continuousPer-capita cigarette pack salesAnnual per-capita sales of cigarette packs (the policy outcome).Observed; fully populated in all three files (no imputation needed).packs per capitaAbadie, Diamond & Hainmueller (2010)fully observed
lnincome continuousLog per-capita incomeNatural log of per-capita personal income (covariate).Source covariate; missing in 16.1% of source rows, filled in data_imputed.csv by random-forest imputation.log US$Abadie, Diamond & Hainmueller (2010); imputed (mice rf) in data_imputed.csv16.1% missing in source; complete in imputed
beer continuousPer-capita beer consumptionPer-capita beer consumption (covariate proxy for tobacco-related behaviour).Source covariate; missing in 54.8% of source rows, filled in data_imputed.csv by random-forest imputation.gallons per capitaAbadie, Diamond & Hainmueller (2010); imputed (mice rf) in data_imputed.csv54.8% missing in source; complete in imputed
age15to24 continuousShare of population aged 15–24Fraction of the state population aged 15 to 24 (covariate).Source covariate; missing in 32.3% of source rows, filled in data_imputed.csv by random-forest imputation.0-1 (share)Abadie, Diamond & Hainmueller (2010); imputed (mice rf) in data_imputed.csv32.3% missing in source; complete in imputed
retprice continuousRetail cigarette priceAverage retail price of cigarettes (covariate).Observed; fully populated in all three files (no imputation needed).US cents per packAbadie, Diamond & Hainmueller (2010)fully observed

Distribution & statistics (click a header to sort)

VariableDistributionCoverageNDistinctMinMeanMedianMaxSD
state100%1,20939
year100%1,2093119701985.0198520008.95
cigsalemin 40.7 | median 116 | max 296100%1,20970340.70118.9116.3296.232.77
lnincomemin 9.4 | median 9.86 | max 10.584%1,0141,0149.409.869.8610.490.171
beermin 2.5 | median 23.3 | max 40.445%5461452.5023.4323.3040.404.22
age15to24min 0.129 | median 0.178 | max 0.20468%8198190.1290.1750.1780.2040.015
retpricemin 27.3 | median 95.5 | max 351100%1,20984927.30108.395.50351.264.38

year (California only)  31 × 8 · 1970-2000 · California (31 years: 19 Pre + 12 Post)

Panel key: year · Within-California analysis subset for the naive pre-post, ITS growth-curve, ITS-ARIMA and RDD-on-time estimators.

Variable dictionary

VariableLabelDefinitionConstructionUnitsSourceCoverage
state identifierU.S. state nameName of the U.S. state (treated unit is California; the other 38 are donor states).From the source Abadie–Diamond–Hainmueller panel; data_california.csv is filtered to California only.stringAbadie, Diamond & Hainmueller (2010)39 states (1 in data_california.csv)
year yearCalendar yearAnnual time index; 1989 is the Proposition 99 policy date.Source panel runs 1970–2000; the last full pre-period year is 1988.yearAbadie, Diamond & Hainmueller (2010)1970-2000
cigsale continuousPer-capita cigarette pack salesAnnual per-capita sales of cigarette packs (the policy outcome).Observed; fully populated in all three files (no imputation needed).packs per capitaAbadie, Diamond & Hainmueller (2010)fully observed
lnincome continuousLog per-capita incomeNatural log of per-capita personal income (covariate).Source covariate; missing in 16.1% of source rows, filled in data_imputed.csv by random-forest imputation.log US$Abadie, Diamond & Hainmueller (2010); imputed (mice rf) in data_imputed.csv16.1% missing in source; complete in imputed
beer continuousPer-capita beer consumptionPer-capita beer consumption (covariate proxy for tobacco-related behaviour).Source covariate; missing in 54.8% of source rows, filled in data_imputed.csv by random-forest imputation.gallons per capitaAbadie, Diamond & Hainmueller (2010); imputed (mice rf) in data_imputed.csv54.8% missing in source; complete in imputed
age15to24 continuousShare of population aged 15–24Fraction of the state population aged 15 to 24 (covariate).Source covariate; missing in 32.3% of source rows, filled in data_imputed.csv by random-forest imputation.0-1 (share)Abadie, Diamond & Hainmueller (2010); imputed (mice rf) in data_imputed.csv32.3% missing in source; complete in imputed
retprice continuousRetail cigarette priceAverage retail price of cigarettes (covariate).Observed; fully populated in all three files (no imputation needed).US cents per packAbadie, Diamond & Hainmueller (2010)fully observed
prepost identifierPre/Post policy period (1=Post)Factor marking the Proposition 99 period: Pre = up to 1988, Post = 1989 onward.factor(year > 1988, labels = c('Pre','Post')); present only in data_california.csv.Pre/PostDerived (this study)data_california.csv only (19 Pre, 12 Post)

Distribution & statistics (click a header to sort)

VariableDistributionCoverageNDistinctMinMeanMedianMaxSD
state100%311
year100%313119701985.0198520009.09
cigsalemin 41.6 | median 103 | max 128100%313141.6094.59102.8128.030.01
lnincomemin 9.93 | median 10.1 | max 10.284%26269.9310.0710.0910.180.076
beermin 19.1 | median 22.9 | max 2545%141419.1022.2622.9525.002.11
age15to24min 0.15 | median 0.18 | max 0.1968%21210.1500.1760.1800.1900.012
retpricemin 38.8 | median 98 | max 351100%313038.80119.998.00351.277.90
prepost100%312

state-year  1,209 × 7 · 1970-2000 · 39 U.S. states (balanced)

Panel key: state x year · Complete-case wide-format input for the CausalImpact Bayesian structural time-series model.

Variable dictionary

VariableLabelDefinitionConstructionUnitsSourceCoverage
state identifierU.S. state nameName of the U.S. state (treated unit is California; the other 38 are donor states).From the source Abadie–Diamond–Hainmueller panel; data_california.csv is filtered to California only.stringAbadie, Diamond & Hainmueller (2010)39 states (1 in data_california.csv)
year yearCalendar yearAnnual time index; 1989 is the Proposition 99 policy date.Source panel runs 1970–2000; the last full pre-period year is 1988.yearAbadie, Diamond & Hainmueller (2010)1970-2000
cigsale continuousPer-capita cigarette pack salesAnnual per-capita sales of cigarette packs (the policy outcome).Observed; fully populated in all three files (no imputation needed).packs per capitaAbadie, Diamond & Hainmueller (2010)fully observed
lnincome continuousLog per-capita incomeNatural log of per-capita personal income (covariate).Source covariate; missing in 16.1% of source rows, filled in data_imputed.csv by random-forest imputation.log US$Abadie, Diamond & Hainmueller (2010); imputed (mice rf) in data_imputed.csv16.1% missing in source; complete in imputed
beer continuousPer-capita beer consumptionPer-capita beer consumption (covariate proxy for tobacco-related behaviour).Source covariate; missing in 54.8% of source rows, filled in data_imputed.csv by random-forest imputation.gallons per capitaAbadie, Diamond & Hainmueller (2010); imputed (mice rf) in data_imputed.csv54.8% missing in source; complete in imputed
age15to24 continuousShare of population aged 15–24Fraction of the state population aged 15 to 24 (covariate).Source covariate; missing in 32.3% of source rows, filled in data_imputed.csv by random-forest imputation.0-1 (share)Abadie, Diamond & Hainmueller (2010); imputed (mice rf) in data_imputed.csv32.3% missing in source; complete in imputed
retprice continuousRetail cigarette priceAverage retail price of cigarettes (covariate).Observed; fully populated in all three files (no imputation needed).US cents per packAbadie, Diamond & Hainmueller (2010)fully observed

Distribution & statistics (click a header to sort)

VariableDistributionCoverageNDistinctMinMeanMedianMaxSD
state100%1,20939
year100%1,2093119701985.0198520008.95
cigsalemin 40.7 | median 116 | max 296100%1,20970340.70118.9116.3296.232.77
lnincomemin 9.4 | median 9.87 | max 10.5100%1,2091,0149.409.879.8710.490.177
beermin 2.5 | median 22.2 | max 40.4100%1,2091452.5022.8822.2040.404.18
age15to24min 0.129 | median 0.17 | max 0.204100%1,2098190.1290.1680.1700.2040.018
retpricemin 27.3 | median 95.5 | max 351100%1,20984927.30108.395.50351.264.38

Known limitations & caveats