Downloads
Each dataset is available as a labeled Stata .dta and its source file.
⇩ Download all data (ZIP)stata_codebook.do
| Dataset | Grain | Rows | Stata | Source |
|---|---|---|---|---|
proposition99 | state-year | 1,209 × 7 | proposition99.dta | proposition99.csv |
data_california | year (California only) | 31 × 8 | data_california.dta | data_california.csv |
data_imputed | state-year | 1,209 × 7 | data_imputed.dta | data_imputed.csv |
Run stata_codebook.do in Stata once to attach long-form per-variable notes to the .dta files.
Load directly in code
Every file loads straight from GitHub (raw URLs). Swap the file name to load any dataset.
Stata
* Stata 14+ : `use` reads an https URL directly
global BASE "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/r_causalpolicy_workshop/data/"
use "${BASE}proposition99.dta", clear
describe
notesPython
!pip install -q pyreadstat
import pandas as pd
BASE = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/r_causalpolicy_workshop/data/"
df = pd.read_stata(BASE + "proposition99.dta")
# load every dataset at once
files = ["proposition99", "data_california", "data_imputed"]
data = {f: pd.read_stata(BASE + f + ".dta") for f in files}
# pyreadstat (richest metadata) reads LOCAL files -> download first
import pyreadstat, urllib.request
urllib.request.urlretrieve(BASE + "proposition99.dta", "proposition99.dta")
df, meta = pyreadstat.read_dta("proposition99.dta")Copy and paste this snippet in Google Colab app. https://colab.research.google.com/notebooks/empty.ipynb
R
# R : haven::read_dta auto-downloads an https URL
library(haven)
BASE <- "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/r_causalpolicy_workshop/data/"
df <- read_dta(paste0(BASE, "proposition99.dta"))Overview & sources
Companion data for a hands-on R tutorial that evaluates California's 1989 Proposition 99 cigarette tax by applying six estimator families — naive pre-post, difference-in-differences, two interrupted time-series variants (linear growth curve and an AICc-selected ARIMA), regression discontinuity on time, synthetic control (tidysynth), and Bayesian structural time series (CausalImpact) — to one shared panel and asking how much they disagree. The data are a balanced annual panel of 39 U.S. states over 1970–2000 (1,209 state-year rows) with per-capita cigarette pack sales as the outcome and four covariates, prepared from the canonical Abadie, Diamond & Hainmueller (2010) Proposition 99 dataset distributed by the causalpolicy.nl workshop. Every causal method targets the average treatment effect on the treated (ATT) for California; the entire pipeline is open and reproducible.
proposition99 is the source 39-state annual panel (one row per state × year, 1970–2000) with four partly-missing covariates. data_california is the California-only series (one row per year) with a prepost factor marking the 1989 policy date — the analysis subset used by the naive, ITS and RDD estimators. data_imputed is the full panel after one round of mice random-forest imputation that fills every covariate gap — the complete-case input for the CausalImpact model.
Data sources
| Source | Provides | Reference / URL |
|---|---|---|
| Abadie, Diamond & Hainmueller (2010) | Original Proposition 99 panel: per-capita cigarette sales + covariates for 39 states, 1970–2000 | Abadie, A., Diamond, A., & Hainmueller, J. (2010). Synthetic control methods for comparative case studies: Estimating the effect of California's Tobacco Control Program. Journal of the American Statistical Association, 105(490), 493–505. |
| causalpolicy.nl workshop (ODISSEI) | Prepared proposition99.rds distribution mirrored as proposition99.csv | ODISSEI Social Data Science Team. Causal Policy Evaluation workshop. https://causalpolicy.nl/ |
| Method references | Estimators and concepts | Holland (1986); Rubin (1974); Hyndman & Athanasopoulos (2021, fpp3); Brodersen et al. (2015, CausalImpact); van Buuren & Groothuis-Oudshoorn (2011, mice). |
Cite this data
Please cite this dataset as follows.
APA
Mendez, C. (2026). Six Ways to Evaluate a Policy using R: Comparative Case Studies of Proposition 99 [Data set]. https://carlos-mendez.org/post/r_causalpolicy_workshop/
Abadie, A., Diamond, A., & Hainmueller, J. (2010). Synthetic control methods for comparative case studies: Estimating the effect of California's Tobacco Control Program. Journal of the American Statistical Association, 105(490), 493–505.BibTeX
@misc{mendez2026rcausalpolicyworkshop,
author = {Mendez, Carlos},
title = {Six Ways to Evaluate a Policy using R: Comparative Case Studies of Proposition 99},
year = {2026},
howpublished = {\url{https://carlos-mendez.org/post/r_causalpolicy_workshop/}},
note = {Data set}
}
@article{abadie2010synthetic,
author = {Abadie, Alberto and Diamond, Alexis and Hainmueller, Jens},
title = {Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California's Tobacco Control Program},
journal = {Journal of the American Statistical Association},
volume = {105}, number = {490}, pages = {493--505}, year = {2010}
}Variable explorer search & filter all 8 variables
Type to filter by name or label, or use the chips to filter by type. Each row shows a mini distribution. Click a header to sort.
| Variable | Type | Distribution | Label | Definition | Units | In files | Source |
|---|---|---|---|---|---|---|---|
age15to24# | continuous | Share of population aged 15–24 | Fraction of the state population aged 15 to 24 (covariate). | 0-1 (share) | proposition99, data_california, data_imputed | Abadie, Diamond & Hainmueller (2010); imputed (mice rf) in data_imputed.csv | |
beer# | continuous | Per-capita beer consumption | Per-capita beer consumption (covariate proxy for tobacco-related behaviour). | gallons per capita | proposition99, data_california, data_imputed | Abadie, Diamond & Hainmueller (2010); imputed (mice rf) in data_imputed.csv | |
cigsale# | continuous | Per-capita cigarette pack sales | Annual per-capita sales of cigarette packs (the policy outcome). | packs per capita | proposition99, data_california, data_imputed | Abadie, Diamond & Hainmueller (2010) | |
lnincome# | continuous | Log per-capita income | Natural log of per-capita personal income (covariate). | log US$ | proposition99, data_california, data_imputed | Abadie, Diamond & Hainmueller (2010); imputed (mice rf) in data_imputed.csv | |
prepost# | identifier | – | Pre/Post policy period (1=Post) | Factor marking the Proposition 99 period: Pre = up to 1988, Post = 1989 onward. | Pre/Post | data_california | Derived (this study) |
retprice# | continuous | Retail cigarette price | Average retail price of cigarettes (covariate). | US cents per pack | proposition99, data_california, data_imputed | Abadie, Diamond & Hainmueller (2010) | |
state# | identifier | – | U.S. state name | Name of the U.S. state (treated unit is California; the other 38 are donor states). | string | proposition99, data_california, data_imputed | Abadie, Diamond & Hainmueller (2010) |
year# | year | – | Calendar year | Annual time index; 1989 is the Proposition 99 policy date. | year | proposition99, data_california, data_imputed | Abadie, Diamond & Hainmueller (2010) |
Cross-file variable index
Which file each variable appears in (● = present).
Construction & formulas
Every method targets the average treatment effect on the treated (ATT) for
California over 1989–2000, each building the missing counterfactual
Ŷ₁ₜ(0) — California's sales without Proposition 99 — a different
way.
- Naive pre-post:
Ŷ₁ₜ(0) = Ȳ₁,pre— California's own pre-period mean. - Difference-in-Differences:
Ŷ₁ₜ(0) = Ȳ₁,pre + (Ȳ₀,post − Ȳ₀,pre)— add Nevada's pre-to-post change. - ITS (growth curve):
Ŷ₁ₜ(0) = α̂ + β̂·t— extrapolate California's pre-period linear fit. - ITS (ARIMA):
Ŷ₁ₜ(0)= forecast from an AICc-selected ARIMA(1,2,0) fitted on 1970–1988. - RDD on time: segmented regression
cigsale ~ year0 + prepost + year0:prepost; the level breakprepostcoefficient is the headline effect. - Synthetic Control:
Ŷ₁ₜ(0) = Σ wᵢ* Yᵢₜ— convex weighted blend of donor states minimising pre-1988 RMSE. - CausalImpact (BSTS):
y₁ₜ = μₜ + βᵀxₜ + εₜ— a local-level trend plus a regression on donor-state series and covariates, fit on the pre-period and projected forward.
Covariate imputation (for data_imputed): missing
lnincome, beer and age15to24 values are filled with one
round of random-forest multiple imputation, mice(m = 1, method = "rf"), under a global
set.seed(42). Only the three NA-bearing covariates change; the fully-observed
cigsale and retprice are byte-identical to the source panel.
The datasets
Switch datasets with the tabs. Each shows the full variable dictionary plus a sortable statistics table with mini distributions and data coverage.
expand to search (Ctrl/⌘+F) or print across all datasets
Variable dictionary
| Variable | Label | Definition | Construction | Units | Source | Coverage |
|---|---|---|---|---|---|---|
state identifier | U.S. state name | Name of the U.S. state (treated unit is California; the other 38 are donor states). | From the source Abadie–Diamond–Hainmueller panel; data_california.csv is filtered to California only. | string | Abadie, Diamond & Hainmueller (2010) | 39 states (1 in data_california.csv) |
year year | Calendar year | Annual time index; 1989 is the Proposition 99 policy date. | Source panel runs 1970–2000; the last full pre-period year is 1988. | year | Abadie, Diamond & Hainmueller (2010) | 1970-2000 |
cigsale continuous | Per-capita cigarette pack sales | Annual per-capita sales of cigarette packs (the policy outcome). | Observed; fully populated in all three files (no imputation needed). | packs per capita | Abadie, Diamond & Hainmueller (2010) | fully observed |
lnincome continuous | Log per-capita income | Natural log of per-capita personal income (covariate). | Source covariate; missing in 16.1% of source rows, filled in data_imputed.csv by random-forest imputation. | log US$ | Abadie, Diamond & Hainmueller (2010); imputed (mice rf) in data_imputed.csv | 16.1% missing in source; complete in imputed |
beer continuous | Per-capita beer consumption | Per-capita beer consumption (covariate proxy for tobacco-related behaviour). | Source covariate; missing in 54.8% of source rows, filled in data_imputed.csv by random-forest imputation. | gallons per capita | Abadie, Diamond & Hainmueller (2010); imputed (mice rf) in data_imputed.csv | 54.8% missing in source; complete in imputed |
age15to24 continuous | Share of population aged 15–24 | Fraction of the state population aged 15 to 24 (covariate). | Source covariate; missing in 32.3% of source rows, filled in data_imputed.csv by random-forest imputation. | 0-1 (share) | Abadie, Diamond & Hainmueller (2010); imputed (mice rf) in data_imputed.csv | 32.3% missing in source; complete in imputed |
retprice continuous | Retail cigarette price | Average retail price of cigarettes (covariate). | Observed; fully populated in all three files (no imputation needed). | US cents per pack | Abadie, Diamond & Hainmueller (2010) | fully observed |
Distribution & statistics (click a header to sort)
| Variable | Distribution | Coverage | N | Distinct | Min | Mean | Median | Max | SD |
|---|---|---|---|---|---|---|---|---|---|
state | – | 100% | 1,209 | 39 | — | — | — | — | — |
year | – | 100% | 1,209 | 31 | 1970 | 1985.0 | 1985 | 2000 | 8.95 |
cigsale | 100% | 1,209 | 703 | 40.70 | 118.9 | 116.3 | 296.2 | 32.77 | |
lnincome | 84% | 1,014 | 1,014 | 9.40 | 9.86 | 9.86 | 10.49 | 0.171 | |
beer | 45% | 546 | 145 | 2.50 | 23.43 | 23.30 | 40.40 | 4.22 | |
age15to24 | 68% | 819 | 819 | 0.129 | 0.175 | 0.178 | 0.204 | 0.015 | |
retprice | 100% | 1,209 | 849 | 27.30 | 108.3 | 95.50 | 351.2 | 64.38 |
Variable dictionary
| Variable | Label | Definition | Construction | Units | Source | Coverage |
|---|---|---|---|---|---|---|
state identifier | U.S. state name | Name of the U.S. state (treated unit is California; the other 38 are donor states). | From the source Abadie–Diamond–Hainmueller panel; data_california.csv is filtered to California only. | string | Abadie, Diamond & Hainmueller (2010) | 39 states (1 in data_california.csv) |
year year | Calendar year | Annual time index; 1989 is the Proposition 99 policy date. | Source panel runs 1970–2000; the last full pre-period year is 1988. | year | Abadie, Diamond & Hainmueller (2010) | 1970-2000 |
cigsale continuous | Per-capita cigarette pack sales | Annual per-capita sales of cigarette packs (the policy outcome). | Observed; fully populated in all three files (no imputation needed). | packs per capita | Abadie, Diamond & Hainmueller (2010) | fully observed |
lnincome continuous | Log per-capita income | Natural log of per-capita personal income (covariate). | Source covariate; missing in 16.1% of source rows, filled in data_imputed.csv by random-forest imputation. | log US$ | Abadie, Diamond & Hainmueller (2010); imputed (mice rf) in data_imputed.csv | 16.1% missing in source; complete in imputed |
beer continuous | Per-capita beer consumption | Per-capita beer consumption (covariate proxy for tobacco-related behaviour). | Source covariate; missing in 54.8% of source rows, filled in data_imputed.csv by random-forest imputation. | gallons per capita | Abadie, Diamond & Hainmueller (2010); imputed (mice rf) in data_imputed.csv | 54.8% missing in source; complete in imputed |
age15to24 continuous | Share of population aged 15–24 | Fraction of the state population aged 15 to 24 (covariate). | Source covariate; missing in 32.3% of source rows, filled in data_imputed.csv by random-forest imputation. | 0-1 (share) | Abadie, Diamond & Hainmueller (2010); imputed (mice rf) in data_imputed.csv | 32.3% missing in source; complete in imputed |
retprice continuous | Retail cigarette price | Average retail price of cigarettes (covariate). | Observed; fully populated in all three files (no imputation needed). | US cents per pack | Abadie, Diamond & Hainmueller (2010) | fully observed |
prepost identifier | Pre/Post policy period (1=Post) | Factor marking the Proposition 99 period: Pre = up to 1988, Post = 1989 onward. | factor(year > 1988, labels = c('Pre','Post')); present only in data_california.csv. | Pre/Post | Derived (this study) | data_california.csv only (19 Pre, 12 Post) |
Distribution & statistics (click a header to sort)
| Variable | Distribution | Coverage | N | Distinct | Min | Mean | Median | Max | SD |
|---|---|---|---|---|---|---|---|---|---|
state | – | 100% | 31 | 1 | — | — | — | — | — |
year | – | 100% | 31 | 31 | 1970 | 1985.0 | 1985 | 2000 | 9.09 |
cigsale | 100% | 31 | 31 | 41.60 | 94.59 | 102.8 | 128.0 | 30.01 | |
lnincome | 84% | 26 | 26 | 9.93 | 10.07 | 10.09 | 10.18 | 0.076 | |
beer | 45% | 14 | 14 | 19.10 | 22.26 | 22.95 | 25.00 | 2.11 | |
age15to24 | 68% | 21 | 21 | 0.150 | 0.176 | 0.180 | 0.190 | 0.012 | |
retprice | 100% | 31 | 30 | 38.80 | 119.9 | 98.00 | 351.2 | 77.90 | |
prepost | – | 100% | 31 | 2 | — | — | — | — | — |
Variable dictionary
| Variable | Label | Definition | Construction | Units | Source | Coverage |
|---|---|---|---|---|---|---|
state identifier | U.S. state name | Name of the U.S. state (treated unit is California; the other 38 are donor states). | From the source Abadie–Diamond–Hainmueller panel; data_california.csv is filtered to California only. | string | Abadie, Diamond & Hainmueller (2010) | 39 states (1 in data_california.csv) |
year year | Calendar year | Annual time index; 1989 is the Proposition 99 policy date. | Source panel runs 1970–2000; the last full pre-period year is 1988. | year | Abadie, Diamond & Hainmueller (2010) | 1970-2000 |
cigsale continuous | Per-capita cigarette pack sales | Annual per-capita sales of cigarette packs (the policy outcome). | Observed; fully populated in all three files (no imputation needed). | packs per capita | Abadie, Diamond & Hainmueller (2010) | fully observed |
lnincome continuous | Log per-capita income | Natural log of per-capita personal income (covariate). | Source covariate; missing in 16.1% of source rows, filled in data_imputed.csv by random-forest imputation. | log US$ | Abadie, Diamond & Hainmueller (2010); imputed (mice rf) in data_imputed.csv | 16.1% missing in source; complete in imputed |
beer continuous | Per-capita beer consumption | Per-capita beer consumption (covariate proxy for tobacco-related behaviour). | Source covariate; missing in 54.8% of source rows, filled in data_imputed.csv by random-forest imputation. | gallons per capita | Abadie, Diamond & Hainmueller (2010); imputed (mice rf) in data_imputed.csv | 54.8% missing in source; complete in imputed |
age15to24 continuous | Share of population aged 15–24 | Fraction of the state population aged 15 to 24 (covariate). | Source covariate; missing in 32.3% of source rows, filled in data_imputed.csv by random-forest imputation. | 0-1 (share) | Abadie, Diamond & Hainmueller (2010); imputed (mice rf) in data_imputed.csv | 32.3% missing in source; complete in imputed |
retprice continuous | Retail cigarette price | Average retail price of cigarettes (covariate). | Observed; fully populated in all three files (no imputation needed). | US cents per pack | Abadie, Diamond & Hainmueller (2010) | fully observed |
Distribution & statistics (click a header to sort)
| Variable | Distribution | Coverage | N | Distinct | Min | Mean | Median | Max | SD |
|---|---|---|---|---|---|---|---|---|---|
state | – | 100% | 1,209 | 39 | — | — | — | — | — |
year | – | 100% | 1,209 | 31 | 1970 | 1985.0 | 1985 | 2000 | 8.95 |
cigsale | 100% | 1,209 | 703 | 40.70 | 118.9 | 116.3 | 296.2 | 32.77 | |
lnincome | 100% | 1,209 | 1,014 | 9.40 | 9.87 | 9.87 | 10.49 | 0.177 | |
beer | 100% | 1,209 | 145 | 2.50 | 22.88 | 22.20 | 40.40 | 4.18 | |
age15to24 | 100% | 1,209 | 819 | 0.129 | 0.168 | 0.170 | 0.204 | 0.018 | |
retprice | 100% | 1,209 | 849 | 27.30 | 108.3 | 95.50 | 351.2 | 64.38 |
Known limitations & caveats
- Single imputation. data_imputed uses one random-forest draw (m = 1) for tutorial speed; with multiple imputation (m > 1) or a different model the CausalImpact estimate can move by 1–3 packs.
- Covariate missingness is heavy. In the source panel beer is missing 54.8% of rows, age15to24 32.3%, and lnincome 16.1%; cigsale and retprice are fully observed. The imputed values are model-based fills, not measurements.
- RDD is segmented regression. The workshop labels the year-as-running-variable specification 'RDD'; with calendar time as the running variable it reduces to a piecewise (segmented) regression, not the classical sharp RDD on a continuous score.
- Counterfactual, not data, drives the estimate. Five of six methods converge on a 13–20 pack reduction while DiD-vs-Nevada collapses to −5.7 packs (p = 0.31) and ARIMA-based ITS flips to +4.5 packs — the disagreement is the lesson, so triangulate across estimators.