Six Ways to Evaluate a Policy — Interactive Lab

A pedagogical companion to Six Ways to Evaluate a Policy using R: Comparative Case Studies of Proposition 99 ↗ Back to the post

Six ways to fill in the missing counterfactual

In January 1989 California raised its cigarette tax by 25 cents per pack (Proposition 99). Per-capita cigarette sales then dropped from 116 packs in 1988 to 60 packs in 2000 — a 48% fall. But smoking was declining nationwide too. The whole post is built around one question: how much of that drop was caused by the tax, and how much would have happened anyway?

Every causal estimator on offer answers that question the same way: construct a counterfactual — what California's sales would have looked like without Proposition 99 — and report the gap between observed and counterfactual as the policy effect. What changes from method to method is how the counterfactual is built. This app lets you see all six constructions side-by-side, simulate the easy and the hard cases, and reproduce the post's headline forest plot.

One observed series, six constructed counterfactuals

The orange curve is California's actual smoking history (1970–2000). The vertical dashed line is the 1989 policy threshold. Click any method button below to overlay that method's estimate of the no-Proposition-99 counterfactual. The shaded gap between the orange observed curve and the chosen dashed counterfactual is the policy effect that method reports.

Tab 2

Counterfactual Simulator

Generate a stylised policy panel. Set the control state's secular trend, then watch DiD-with-one-control collapse to noise while a multi-donor blend recovers the true effect.

Tab 3

Seven-Method Forest Plot

The post's headline figure. Toggle methods to compare the 13 to 20 pack consensus against the DiD-vs-Nevada and ITS-ARIMA outliers.

Tab 4

Bias & Variance Lab

Run 100 simulations with a true effect you set. Watch which methods are biased, which are noisy, and how the donor pool size changes Synthetic Control's variance.

Glossary (open a card if a term is unfamiliar)

Counterfactual
What the treated unit's outcome would have been in the absence of the treatment. The unobservable column the whole tutorial is trying to estimate.
ATT — Average Treatment effect on the Treated
The effect averaged over only the units that received the policy. Every causal method here targets California's ATT, not a population-wide ATE.
Parallel trends
The DiD assumption: treated and control would have moved in parallel without the policy. Differences in levels are allowed; differences in trends are not.
Donor pool
Synthetic Control's set of untreated states. Weights are chosen data-driven to mimic the treated unit's pre-period — instead of picking one neighbour by hand.
RMSPE ratio
Synthetic Control's main fit/effect statistic. The pre-period RMSPE measures fit quality; the post/pre ratio measures effect size in unitless terms. California's ratio is 120.5 — the highest in the panel.
Fisher exact p-value
Refit Synthetic Control 38 times, once for each donor state pretending to be treated. The treated unit's rank ÷ total units is the p-value. California's is 1/39 ≈ 0.026.
BSTS — Bayesian Structural Time Series
CausalImpact's model: yt = μt + β'xt + εt. Fit on pre-period, project the no-policy counterfactual forward as a posterior distribution with a credible interval.
Posterior credible interval
A Bayesian interval that has a 95% probability of containing the true parameter given the data. Different from a frequentist CI — natural probability statement.

Counterfactual Simulator — why one control is fragile, and many is robust

Simulate a stylised state-year panel where California is treated in year 0 and there are J donor states. You set the true ATT, the noise level, and the control-trend asymmetry — how strongly the single chosen control state drifts in the same direction as California's secular trend. Then watch the gap between (a) DiD against one neighbour and (b) a synthetic-control style weighted blend.

A bigger donor pool lets the blend match the treated unit's pre-period more precisely.
The known causal effect. We will see which method recovers it.
Year-to-year sales noise. Bigger σ ⇒ less precise estimates.
0 = chosen control follows California's secular trend perfectly · 1 = control drifts in the opposite direction. This is the Nevada-vs-many-states fingerprint from §6.
True ATT
held fixed for comparison
Naive pre-post α̂
no counterfactual at all
DiD α̂ (one control)
subtract single neighbour's change
SCM-style α̂ (blend)
weighted average of all J donors

What to look for

  • Push asymmetry to 0. The single control mirrors California's secular trend and DiD collapses to roughly zero (or to the true ATT only if you also lower the secular trend) — Nevada's fingerprint in §6.
  • Push asymmetry to 1. The control state's trend flips against California's secular drift. DiD now overstates the effect because it subtracts a positive change instead of a negative one.
  • Watch the teal SCM-style estimate. Even as the asymmetry slider whips back and forth, the blended estimate stays close to the true ATT, because averaging J donors washes out any single donor's idiosyncratic trend.
  • Drop J to 5. The SCM-style blend gets noisier and starts to track the single-DiD estimate. This is why Synthetic Control papers ask for at least 20 donors.

The post's seven-method forest plot — interactively

These numbers come straight from table_cross_method.csv in the post's folder — the same data used to produce fig9_cross_method_forest.png. Toggle methods to compare. Five of the seven estimators agree on a 13 to 28 packs/capita reduction. DiD-vs-Nevada (−5.7, CI crosses zero) and ITS-ARIMA (+4.5) are the outliers — for completely different reasons.

What to look for

  • Toggle the two outliers off (DiD vs Nevada and ITS-ARIMA) and watch the remaining five methods cluster tightly between −12 and −28 packs/capita.
  • Compare Synthetic Control (teal, −18.8) and CausalImpact (steel, −12.8). Both build their counterfactual from many donor states. Their intervals overlap. They are the workshop's most defensible answers.
  • RDD on time (−20.1) lands in the middle of the consensus group, but inherits all the pre-trend mis-specification risk from ITS — its tight CI is conditional on the segmented-regression model being right.
  • Naive pre-post and ITS (growth curve) are nearly identical (≈ −27 vs −28). Both use only within-California information, so they cannot separate the policy effect from the nationwide secular decline.

Methods

Synthetic California's donor recipe

tidysynth chose convex weights that minimise pre-1988 RMSE on the lagged outcomes and four covariates. The optimal mix turns out to be a five-state cocktail — Utah, Nevada, Montana, Colorado, Connecticut — absorbing 99.8% of the weight. Read this as "the synthetic California that best matches the real California's 1970–1988 trajectory is built from these five states in these proportions".

Why does DiD vs Nevada collapse?

Nevada is geographically and culturally adjacent to California. Its own cigarette sales fell by 21.3 packs over 1984–1993 — almost as much as California's 27.0 pack drop. When DiD subtracts that Nevada change from California's change, almost all of California's drop is absorbed. Synthetic Control's response is to blend 38 donor states using data-driven weights, so no single similar-trending control can dominate.

Why does ITS-ARIMA flip sign?

The AICc-selected ARIMA(1, 2, 0) model double-differences California's pre-period series. That picks up the acceleration of the late 1980s downward trend and extrapolates that acceleration aggressively. The model's counterfactual lands below California's observed post-period sales, implying the policy "raised" smoking by ≈ 4.5 packs. This is the textbook warning against single-best-AIC ITS without a comparison-unit reality check.

Bias & Variance Lab — run the methods 100 times

Single runs are noisy. Run the simulator from Tab 2 one hundred times with fresh random draws (same parameters) to see whether DiD-with-one-control is systematically biased — and whether the multi-donor blend's variance is small enough to be useful. Each simulation regenerates J donor states from scratch and applies all three estimators.

Capped at 40 so 100 simulations finish quickly.
The known causal effect. The orange vertical line on the histogram.
Year-to-year sales noise.
Same as Tab 2 — controls how badly the single DiD control mismatches.

Run the simulation

Each click runs 100 fresh draws and tallies the three estimators' answers.

Connecting back to the post

  • The orange Naive-pre-post histogram is centred near zero or far from the true α depending on whether the secular trend is included in the simulator. This is why the post calls Naive pre-post "descriptive, not causal".
  • The orange DiD histogram is wide — its variance comes from picking a single noisy control unit. Bigger σ ⇒ wider DiD histogram. The post's Nevada DiD has the same fragility.
  • The teal SCM-style histogram is tight and centred on the true α at J = 20+. Its variance is √J times smaller than DiD's because averaging J donors washes out idiosyncratic donor noise. This is the principled justification for Synthetic Control over hand-picked DiD.
  • Drop J to 5, run again. The teal histogram widens. At J = 5 the blend is still better than single-DiD, but the gain over DiD is small. This is the empirical justification for the "at least 20 donors" rule of thumb.