Downloads
Each dataset is available as a labeled Stata .dta and its source file.
⇩ Download all data (ZIP)stata_codebook.do
| Dataset | Grain | Rows | Stata | Source |
|---|---|---|---|---|
r_sc_bayes_spatial_source_data | state-year | 1,209 × 6 | r_sc_bayes_spatial_source_data.dta | r_sc_bayes_spatial_source_data.csv |
Run stata_codebook.do in Stata once to attach long-form per-variable notes to the .dta files.
Load directly in code
Every file loads straight from GitHub (raw URLs). Swap the file name to load any dataset.
Stata
* Stata 14+ : `use` reads an https URL directly
global BASE "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/r_sc_bayes_spatial/data/"
use "${BASE}r_sc_bayes_spatial_source_data.dta", clear
describe
notesPython
!pip install -q pyreadstat
import pandas as pd
BASE = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/r_sc_bayes_spatial/data/"
df = pd.read_stata(BASE + "r_sc_bayes_spatial_source_data.dta")
# load every dataset at once
files = ["r_sc_bayes_spatial_source_data"]
data = {f: pd.read_stata(BASE + f + ".dta") for f in files}
# pyreadstat (richest metadata) reads LOCAL files -> download first
import pyreadstat, urllib.request
urllib.request.urlretrieve(BASE + "r_sc_bayes_spatial_source_data.dta", "r_sc_bayes_spatial_source_data.dta")
df, meta = pyreadstat.read_dta("r_sc_bayes_spatial_source_data.dta")Copy and paste this snippet in Google Colab app. https://colab.research.google.com/notebooks/empty.ipynb
R
# R : haven::read_dta auto-downloads an https URL
library(haven)
BASE <- "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/r_sc_bayes_spatial/data/"
df <- read_dta(paste0(BASE, "r_sc_bayes_spatial_source_data.dta"))Overview & sources
Companion data for an R tutorial that replicates the California case study of Sakaguchi & Tagawa (2026) on cigarette consumption and Proposition 99. The single file is the balanced state-year panel bundled in the scspill replication package — the same real tobacco panel introduced by Abadie, Diamond & Hainmueller (2010): per-capita cigarette sales (cigsale) and real retail price (retprice) for 39 US states over 1970–2000. California is the one treated unit (Prop 99 switches on in 1988); the other 38 states form the donor pool. The post fits three nested synthetic-control estimators on this panel — classical SCM (tidysynth), a Bayesian horseshoe-prior SCM, and a Bayesian spatial SCM with a spatial autoregressive (SAR) layer — and reads the ATT, the donor weights, and the per-state spillovers off each.
r_sc_bayes_spatial_source_data.csv is a balanced annual state panel: one row per state × year, 39 states × 31 years = 1,209 rows. The treatment dummy is 1 only for California in 1988–2000 (13 rows); 18 pre-treatment years (1970–1987) and 13 post-treatment years (1988–2000). The contiguity weights used by the SAR layer (California's w row and the 38×38 donor W matrix) ship separately inside the scspill package and are not part of this CSV.
Data sources
| Source | Provides | Reference / URL |
|---|---|---|
| Sakaguchi & Tagawa (2026) | Inspiring study + the scspill replication package that bundles this exact panel and the spatial weights | Sakaguchi, S. & Tagawa, H. (2026). Identification and Bayesian Inference for Synthetic Control Methods with Spillover Effects. The Econometrics Journal. https://doi.org/10.1093/ectj/utag006 (replication package: Zenodo record 19066186). |
| Abadie, Diamond & Hainmueller (2010) | Original California tobacco panel (cigsale, retprice) and the synthetic control method | Abadie, A., Diamond, A. & Hainmueller, J. (2010). Synthetic control methods for comparative case studies: Estimating the effect of California's tobacco control program. Journal of the American Statistical Association, 105(490), 493–505. https://doi.org/10.1198/jasa.2009.ap08746 |
| Method references | Estimators and concepts | Carvalho, Polson & Scott (2010, horseshoe prior); LeSage & Pace (2009, spatial econometrics). |
Cite this data
Please cite this dataset as follows.
APA
Mendez, C. (2026). Bayesian Spatial Synthetic Control: California's Proposition 99 in R [Data set]. https://carlos-mendez.org/post/r_sc_bayes_spatial/
Sakaguchi, S., & Tagawa, H. (2026). Identification and Bayesian Inference for Synthetic Control Methods with Spillover Effects. The Econometrics Journal. https://doi.org/10.1093/ectj/utag006
Abadie, A., Diamond, A., & Hainmueller, J. (2010). Synthetic control methods for comparative case studies: Estimating the effect of California's tobacco control program. Journal of the American Statistical Association, 105(490), 493–505. https://doi.org/10.1198/jasa.2009.ap08746BibTeX
@misc{mendez2026rscbayesspatial,
author = {Mendez, Carlos},
title = {Bayesian Spatial Synthetic Control: California's Proposition 99 in R},
year = {2026},
howpublished = {\url{https://carlos-mendez.org/post/r_sc_bayes_spatial/}},
note = {Data set}
}
@article{sakaguchi2026spillover,
author = {Sakaguchi, Shosei and Tagawa, Hisahiro},
title = {Identification and {Bayesian} Inference for Synthetic Control Methods with Spillover Effects},
journal = {The Econometrics Journal},
year = {2026},
doi = {10.1093/ectj/utag006}
}
@article{abadie2010synthetic,
author = {Abadie, Alberto and Diamond, Alexis and Hainmueller, Jens},
title = {Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California's Tobacco Control Program},
journal = {Journal of the American Statistical Association},
volume = {105}, number = {490}, pages = {493--505}, year = {2010},
doi = {10.1198/jasa.2009.ap08746}
}Variable explorer search & filter all 6 variables
Type to filter by name or label, or use the chips to filter by type. Each row shows a mini distribution. Click a header to sort.
| Variable | Type | Distribution | Label | Definition | Units | In files | Source |
|---|---|---|---|---|---|---|---|
cigsale# | continuous | Per-capita cigarette sales (packs) | Annual per-capita cigarette pack sales — the synthetic-control outcome. | packs per capita per year | r_sc_bayes_spatial_source_data | scspill package (Abadie et al. 2010) | |
retprice# | continuous | Real retail price (cents/pack) | Average real retail price of a cigarette pack — the lone SAR covariate X. | cents per pack (real) | r_sc_bayes_spatial_source_data | scspill package (Abadie et al. 2010) | |
state# | identifier | – | State name | US state identifier (the treated unit is California; the other 38 are donors). | string | r_sc_bayes_spatial_source_data | scspill package (Abadie et al. 2010) |
state_id# | identifier | – | State numeric ID | Integer index for the state (1-39); California is 3. | integer (1-39) | r_sc_bayes_spatial_source_data | scspill package |
treatment# | dummy | Treatment dummy (1=CA post-Prop 99) | 1 for California in the post-treatment period, else 0 — flags the treated unit-years. | 0/1 | r_sc_bayes_spatial_source_data | Constructed in analysis.R | |
year# | year | – | Calendar year | Annual time index of the panel. | year | r_sc_bayes_spatial_source_data | scspill package (Abadie et al. 2010) |
Cross-file variable index
Which file each variable appears in (● = present).
Construction & formulas
The data are an observed state-year panel; the post derives no new columns — the three
estimators below all operate on the same cigsale outcome. California is the treated
unit; the 38 other states are donors.
- ATT:
ATT = mean_t( Y_obs(CA, t) − Y_synth(CA, t) )averaged over the post-treatment years 1988–2000, where the synthetic California is a weighted blend of donor states. - Classical SCM (Stage 1): choose donor weights on the simplex
(
α_j ≥ 0, Σ α_j = 1) that minimize pre-treatment fit error, viatidysynth. - Bayesian horseshoe SCM (Stage 2):
α_j ~ N(0, τ² λ_j²)withτ, λ_j ~ HalfCauchy(0, 1)— heavy-tailed shrinkage on the same weights, relaxing the simplex. - Bayesian spatial SCM (Stage 3):
Y_c,t = ρ·W·Y_c,t + X_c,t·β + Y_c^lag·α + ε_t— adds a SAR layer with contiguity matrixWand spatial autocorrelationρ, relaxing SUTVA, so per-donor spillovers become estimable.
The treatment dummy is constructed as 1 if state == "California" & year ≥
1988 else 0 (the package's Prop 99 convention). Everything else in the file is observed
data read directly from the replication package.
The datasets
Switch datasets with the tabs. Each shows the full variable dictionary plus a sortable statistics table with mini distributions and data coverage.
expand to search (Ctrl/⌘+F) or print across all datasets
Variable dictionary
| Variable | Label | Definition | Construction | Units | Source | Coverage |
|---|---|---|---|---|---|---|
state identifier | State name | US state identifier (the treated unit is California; the other 38 are donors). | From the scspill replication package panel (Abadie et al. 2010). | string | scspill package (Abadie et al. 2010) | 39 states |
state_id identifier | State numeric ID | Integer index for the state (1-39); California is 3. | Sequential package index aligned to the alphabetical state list. | integer (1-39) | scspill package | 39 states |
year year | Calendar year | Annual time index of the panel. | Observed year, 1970-2000 (balanced; 31 years per state). | year | scspill package (Abadie et al. 2010) | 1970-2000 |
cigsale continuous | Per-capita cigarette sales (packs) | Annual per-capita cigarette pack sales — the synthetic-control outcome. | Observed tax-paid cigarette sales per capita, from the Abadie et al. (2010) tobacco data. | packs per capita per year | scspill package (Abadie et al. 2010) | all rows |
retprice continuous | Real retail price (cents/pack) | Average real retail price of a cigarette pack — the lone SAR covariate X. | Observed retail price per pack from the Abadie et al. (2010) tobacco data. | cents per pack (real) | scspill package (Abadie et al. 2010) | all rows |
treatment dummy | Treatment dummy (1=CA post-Prop 99) | 1 for California in the post-treatment period, else 0 — flags the treated unit-years. | 1 if state == 'California' & year >= 1988, else 0 (package Prop 99 convention). | 0/1 | Constructed in analysis.R | 13 ones (California 1988-2000); 1,196 zeros |
Distribution & statistics (click a header to sort)
| Variable | Distribution | Coverage | N | Distinct | Min | Mean | Median | Max | SD |
|---|---|---|---|---|---|---|---|---|---|
state | – | 100% | 1,209 | 39 | — | — | — | — | — |
state_id | – | 100% | 1,209 | 39 | — | — | — | — | — |
year | – | 100% | 1,209 | 31 | 1970 | 1985.0 | 1985 | 2000 | 8.95 |
cigsale | 100% | 1,209 | 703 | 40.70 | 118.9 | 116.3 | 296.2 | 32.77 | |
retprice | 100% | 1,209 | 849 | 27.30 | 108.3 | 95.50 | 351.2 | 64.38 | |
treatment | 100% | 1,209 | 2 | 0 | 0.011 | 0 | 1.00 | 0.103 |
Known limitations & caveats
- Narrow predictor set. The shipped panel carries only
cigsaleandretprice— not the log income, youth-share, or beer-sales predictors Abadie (2010) used. This is the main reason the classical ATT (≈ −18) is smaller in magnitude than Abadie's published ≈ −27. - Spatial weights live elsewhere. The SAR layer needs California's contiguity vector
wand the 38×38 donor contiguity matrixW, which ship inside thescspillpackage, not in this CSV. The data dictionary documents only the outcome panel. - Tutorial-scale MCMC. The post runs 5,000 iterations (vs the paper's 100,000), so the SAR ρ effective sample size is only ~3 and the Stage 3 credible interval is illustrative, not paper-grade. This is an analysis caveat, not a data caveat — the panel itself is exact.
- Treatment timing. Prop 99 was approved in November 1988 and took effect January 1989, but the package convention codes 1988 as the first post-treatment year; the
treatmentdummy follows that convention.