Six Ways to Evaluate a Policy

Comparative case studies of California’s Proposition 99, in R

−18.9synthetic-control ATT (packs/capita)

+4.5ARIMA sign flip · same data

0.026placebo Fisher exact p

Carlos Mendez

Nagoya University (GSID)

July 8, 2026

The Tension

Act I

California’s smoking fell 48% after Proposition 99 — but how much was the tax?

In January 1989 California raised its cigarette tax by 25 cents. Per-capita sales fell from 116 packs (1970–1988) to 60 packs (1989–2000).

But the whole country was smoking less. How much of California’s drop did the policy actually cause?

One dataset, six estimators — and they disagree from −28 to +4.5 packs

Effect on per-capita cigarette sales, 95% intervals · naive, DiD, two ITS, RDD-on-time, synthetic control, CausalImpact. Dashed line = zero.

Where we’re going

The estimand: every method targets the ATT on California, 1989–2000
Each method imputes the same missing piece — California’s no-policy counterfactual
Six estimators, escalating discipline: naive · DiD · ITS · RDD · synthetic control · CausalImpact
The lesson: the choice of counterfactual, not the data, drives the estimate

The Investigation

Act II

Causal inference is a missing-data problem: we never see \(Y(0)\) for treated California

\[Y_{it} = D_{it}\, Y_{it}(1) + (1 - D_{it})\, Y_{it}(0)\]

Each state-year has two potential outcomes; we observe only one. For California after 1989 the missing one is \(Y_{it}(0)\) — sales without the tax.

Every estimator in this deck is a different way to impute that missing \(Y_{it}(0)\).

The target is the ATT on California, not the population-wide ATE

\[\text{ATT} = \mathbb{E}\big[Y_{it}(1) - Y_{it}(0) \,\big|\, D_{it} = 1\big]\]

ATT (our target)

effect on the unit actually treated
California, post-1989
identifiable with one treated unit

ATE (not estimable here)

effect averaged over every state
needs policy variation we lack
Prop 99 was never randomized

The shared logic: effect = observed − counterfactual; only the counterfactual differs

Fixed for everyone

California’s observed post-1989 series
the policy date (1989)
the 39-state panel

What each method changes

naive → California’s pre-mean
DiD → Nevada’s pre→post change
ITS → California’s own pre-trend
SCM → weighted donor blend
CausalImpact → Bayesian fit on donors

Six dashed arrows feed one counterfactual box; the gap to the observed series is the effect.

Naive pre-post overstates the effect at −27 packs — it has no counterfactual at all

Quantity	Estimate	HAC SE	\(p\)
Pre-period mean (1984–88)	98.98	—	—
Post-period shift	−27.02	5.30	<0.001

Estimand: purely descriptive. Identifying assumption: California’s pre-level would have frozen — almost certainly false.

DiD against Nevada collapses to −5.7 packs — a single bad control destroys the contrast

California (orange) vs Nevada (blue): both already falling before 1989, so the difference-in-differences contrast nearly vanishes.

ITS extrapolates California’s own pre-trend: a straight line gives −28 packs

\[\widehat{Y_{1t}(0)} = \hat\alpha + \hat\beta\, t, \qquad t > t^*\]

A linear fit on 1970–1988 (\(\hat\beta = -1.78\) packs/year, \(R^2 = 0.74\)) extended forward; the gap to observed averages −28.3 packs.

Assumption: the pre-trend has the right shape. No comparison unit — so it can’t separate policy from the secular decline.

Swap the line for an AICc-chosen ARIMA and the effect flips sign to +4.5 packs

+4.5

ITS via ARIMA(1,2,0) · the counterfactual bends below observed, implying the tax raised smoking

Why the ARIMA counterfactual misfires: it forecasts below what California actually did

ARIMA(1,2,0) counterfactual (dashed, with 95% band) sits below the observed orange series throughout 1989–2000.

RDD-on-time finds a −20 pack level break right at 1989

Piecewise pre/post linear fit (\(R^2 = 0.97\)): a clear level jump and a steeper slope at the 1989 threshold.

Synthetic control blends donors instead of picking one: −18.9 packs

California (observed) vs synthetic California: near-perfect pre-1989 fit, then a gap opens and widens to ~30 packs by 2000.

Synthetic California is five states, anchored on lagged cigarette sales

Faceted plot_weights output: five donor states carry ~100% of the weight (left); two lagged outcomes dominate the V matrix (right).

Six lines fit synthetic control end-to-end with `tidysynth`

prop99_syn <- prop99 |>
  synthetic_control(outcome = cigsale, unit = state, time = year,
                    i_unit = "California", i_time = 1988,
                    generate_placebos = TRUE) |>
  generate_predictor(time_window = 1980, cigsale_1980 = cigsale) |>
  generate_weights(optimization_window = 1970:1988) |>
  generate_control()

The matching works: synthetic California nails the pre-period the donor average misses

Predictor	California	Synthetic	Donor avg
cigsale_1975	127.1	127.0	136.9
cigsale_1980	120.2	120.2	138.1
cigsale_1988	90.1	91.4	114.2

On 1988 sales, synthetic California (91.4) almost matches the real 90.1 — the donor average (114.2) is 24 packs off.

Inference without a standard error: California ranks 1st of 39, Fisher exact \(p = 0.026\)

0.026

Placebo Fisher exact \(p\) · MSPE ratio 123.9, more than 2.5× the next-highest state

CausalImpact hands the donors to a Bayesian model: −13 packs, 92% posterior probability

Top: observed California vs Bayesian counterfactual with a widening credible band. Bottom: cumulative effect reaching ~−150 packs by 2000.

The Resolution

Act III

Five of six methods converge on a 13–20 pack reduction — the disagreement is informative

Method	Estimand	Estimate
Naive pre-post	descriptive	−27.0
DiD vs Nevada	ATT	−5.7
ITS growth	post-gap	−28.3
ITS ARIMA	post-gap	+4.5
RDD on time	level jump	−20.1
Synthetic Control	ATT	−18.9
CausalImpact	ATT	−12.8

Consensus cluster (RDD, SCM, CausalImpact): −13 to −20. Outliers: DiD collapses, ARIMA flips.

The synthetic-control ATT is −18.9 packs, robust where single comparisons fail

−18.9

ATT on California, 1989–2000 · Fisher exact \(p = 0.026\) · matches Abadie et al. (2010) within rounding

Does machine-selected matching make this causal? No — assumptions still carry the weight

Objection. Synthetic control’s data-driven donor weighting manufactures the identification.

Response. It does not. The ATT is identified only if a convex donor combination tracks California’s no-policy path — a pre-trend assumption the placebo test stresses but cannot prove. Triangulating across six estimators discloses where each assumption breaks; it never relaxes them.