Six Ways to Evaluate a Policy

Comparative case studies of California’s Proposition 99, in R

−18.9synthetic-control ATT (packs/capita)
+4.5ARIMA sign flip · same data
0.026placebo Fisher exact p

Carlos Mendez

Nagoya University (GSID)

June 11, 2026

The Tension

Act I

California’s smoking fell 48% after Proposition 99 — but how much was the tax?

In January 1989 California raised its cigarette tax by 25 cents. Per-capita sales fell from 116 packs (1970–1988) to 60 packs (1989–2000).

But the whole country was smoking less. How much of California’s drop did the policy actually cause?

One dataset, six estimators — and they disagree from −28 to +4.5 packs

Effect on per-capita cigarette sales, 95% intervals · naive, DiD, two ITS, RDD-on-time, synthetic control, CausalImpact. Dashed line = zero.

Where we’re going

  • The estimand: every method targets the ATT on California, 1989–2000
  • Each method imputes the same missing piece — California’s no-policy counterfactual
  • Six estimators, escalating discipline: naive · DiD · ITS · RDD · synthetic control · CausalImpact
  • The lesson: the choice of counterfactual, not the data, drives the estimate

The Investigation

Act II

Causal inference is a missing-data problem: we never see \(Y(0)\) for treated California

\[Y_{it} = D_{it}\, Y_{it}(1) + (1 - D_{it})\, Y_{it}(0)\]

Each state-year has two potential outcomes; we observe only one. For California after 1989 the missing one is \(Y_{it}(0)\) — sales without the tax.

Every estimator in this deck is a different way to impute that missing \(Y_{it}(0)\).

The target is the ATT on California, not the population-wide ATE

\[\text{ATT} = \mathbb{E}\big[Y_{it}(1) - Y_{it}(0) \,\big|\, D_{it} = 1\big]\]

ATT (our target)

  • effect on the unit actually treated
  • California, post-1989
  • identifiable with one treated unit

ATE (not estimable here)

  • effect averaged over every state
  • needs policy variation we lack
  • Prop 99 was never randomized

The shared logic: effect = observed − counterfactual; only the counterfactual differs

Fixed for everyone

  • California’s observed post-1989 series
  • the policy date (1989)
  • the 39-state panel

What each method changes

  • naive → California’s pre-mean
  • DiD → Nevada’s pre→post change
  • ITS → California’s own pre-trend
  • SCM → weighted donor blend
  • CausalImpact → Bayesian fit on donors

Six dashed arrows feed one counterfactual box; the gap to the observed series is the effect.

Naive pre-post overstates the effect at −27 packs — it has no counterfactual at all

Quantity Estimate HAC SE \(p\)
Pre-period mean (1984–88) 98.98
Post-period shift −27.02 5.30 <0.001

Estimand: purely descriptive. Identifying assumption: California’s pre-level would have frozen — almost certainly false.

DiD against Nevada collapses to −5.7 packs — a single bad control destroys the contrast

California (orange) vs Nevada (blue): both already falling before 1989, so the difference-in-differences contrast nearly vanishes.

ITS extrapolates California’s own pre-trend: a straight line gives −28 packs

\[\widehat{Y_{1t}(0)} = \hat\alpha + \hat\beta\, t, \qquad t > t^*\]

A linear fit on 1970–1988 (\(\hat\beta = -1.78\) packs/year, \(R^2 = 0.74\)) extended forward; the gap to observed averages −28.3 packs.

Assumption: the pre-trend has the right shape. No comparison unit — so it can’t separate policy from the secular decline.

Swap the line for an AICc-chosen ARIMA and the effect flips sign to +4.5 packs

+4.5

ITS via ARIMA(1,2,0) · the counterfactual bends below observed, implying the tax raised smoking

Why the ARIMA counterfactual misfires: it forecasts below what California actually did

ARIMA(1,2,0) counterfactual (dashed, with 95% band) sits below the observed orange series throughout 1989–2000.

RDD-on-time finds a −20 pack level break right at 1989

Piecewise pre/post linear fit (\(R^2 = 0.97\)): a clear level jump and a steeper slope at the 1989 threshold.

Synthetic control blends donors instead of picking one: −18.9 packs

California (observed) vs synthetic California: near-perfect pre-1989 fit, then a gap opens and widens to ~30 packs by 2000.

Synthetic California is five states, anchored on lagged cigarette sales

Faceted plot_weights output: five donor states carry ~100% of the weight (left); two lagged outcomes dominate the V matrix (right).

Six lines fit synthetic control end-to-end with tidysynth

prop99_syn <- prop99 |>
  synthetic_control(outcome = cigsale, unit = state, time = year,
                    i_unit = "California", i_time = 1988,
                    generate_placebos = TRUE) |>
  generate_predictor(time_window = 1980, cigsale_1980 = cigsale) |>
  generate_weights(optimization_window = 1970:1988) |>
  generate_control()

The matching works: synthetic California nails the pre-period the donor average misses

Predictor California Synthetic Donor avg
cigsale_1975 127.1 127.0 136.9
cigsale_1980 120.2 120.2 138.1
cigsale_1988 90.1 91.4 114.2

On 1988 sales, synthetic California (91.4) almost matches the real 90.1 — the donor average (114.2) is 24 packs off.

Inference without a standard error: California ranks 1st of 39, Fisher exact \(p = 0.026\)

0.026

Placebo Fisher exact \(p\) · MSPE ratio 123.9, more than 2.5× the next-highest state

CausalImpact hands the donors to a Bayesian model: −13 packs, 92% posterior probability

Top: observed California vs Bayesian counterfactual with a widening credible band. Bottom: cumulative effect reaching ~−150 packs by 2000.

The Resolution

Act III

Five of six methods converge on a 13–20 pack reduction — the disagreement is informative

Method Estimand Estimate
Naive pre-post descriptive −27.0
DiD vs Nevada ATT −5.7
ITS growth post-gap −28.3
ITS ARIMA post-gap +4.5
RDD on time level jump −20.1
Synthetic Control ATT −18.9
CausalImpact ATT −12.8

Consensus cluster (RDD, SCM, CausalImpact): −13 to −20. Outliers: DiD collapses, ARIMA flips.

The synthetic-control ATT is −18.9 packs, robust where single comparisons fail

−18.9

ATT on California, 1989–2000 · Fisher exact \(p = 0.026\) · matches Abadie et al. (2010) within rounding

Does machine-selected matching make this causal? No — assumptions still carry the weight

Objection. Synthetic control’s data-driven donor weighting manufactures the identification.

Response. It does not. The ATT is identified only if a convex donor combination tracks California’s no-policy path — a pre-trend assumption the placebo test stresses but cannot prove. Triangulating across six estimators discloses where each assumption breaks; it never relaxes them.

The choice of counterfactual — not the data — is the design decision that moves the answer.