Difference-in-Differences (DiD) | Carlos Mendez

Difference-in-Differences for Policy Evaluation: A Tutorial using R

Thu, 26 Mar 2026 00:00:00 +0000

1. Overview

Does raising the minimum wage reduce employment among young workers? This question has been at the center of one of the longest-running debates in labor economics, and the Difference-in-Differences (DID) method has been the primary tool for answering it. In this tutorial, we analyze how state-level minimum wage increases between 2001 and 2007 affected teen employment in the United States — a period when the federal minimum wage was frozen at \$5.15 per hour, while individual states raised their own minimum wages at different times. This variation in treatment timing creates a natural experiment ideally suited for DID.

For decades, applied researchers implemented DID using a simple two-way fixed effects (TWFE) regression — a panel regression with unit and time fixed effects. Recent research has revealed that this approach can produce severely biased estimates when there is staggered treatment adoption (units treated at different times) and treatment effect heterogeneity (effects that vary across groups or over time). The TWFE regression implicitly makes “forbidden comparisons” that use already-treated units as the comparison group, and it assigns negative weights to some group-time treatment effects. These problems are not theoretical curiosities — they lead to meaningful differences in empirical estimates.

This tutorial walks through the complete modern DID workflow. We begin with the traditional TWFE regression and demonstrate its limitations. We then introduce the Callaway and Sant’Anna (2021) framework for estimating group-time average treatment effects, $ATT(g,t)$, that cleanly separate identification from estimation. We extend the analysis with covariates using doubly robust estimation, assess the sensitivity of results to violations of parallel trends using HonestDiD (Rambachan and Roth, 2023), and explore how to handle heterogeneous treatment doses across states. The tutorial is based on Callaway’s (2022) chapter “Difference-in-Differences for Policy Evaluation” and the accompanying LSU workshop materials.

Learning objectives:

Understand the parallel trends assumption and why TWFE regressions break down with staggered treatment adoption and treatment effect heterogeneity
Estimate group-time average treatment effects using att_gt() from the did package and aggregate them into overall ATTs and event studies
Diagnose TWFE bias through weight decomposition, identifying negative weights and pre-treatment contamination
Apply doubly robust estimation with conditional parallel trends and assess robustness to base period and comparison group choices
Conduct HonestDiD sensitivity analysis to evaluate how robust findings are to violations of parallel trends

2. Setup

# Install packages if needed
cran_packages <- c("did", "fixest", "HonestDiD", "DRDID", "BMisc",
"modelsummary", "ggplot2", "dplyr", "pte")
missing <- cran_packages[!sapply(cran_packages, requireNamespace, quietly = TRUE)]
if (length(missing) > 0) install.packages(missing)
# twfeweights is GitHub-only
if (!requireNamespace("twfeweights", quietly = TRUE)) {
remotes::install_github("bcallaway11/twfeweights")
}
# pte may also require GitHub install if not on CRAN
if (!requireNamespace("pte", quietly = TRUE)) {
remotes::install_github("bcallaway11/pte")
}
library(did)
library(fixest)
library(twfeweights)
library(HonestDiD)
library(DRDID)
library(BMisc)
library(modelsummary)
library(ggplot2)
library(dplyr)

3. Data Loading and Exploration

The dataset comes from Callaway and Sant’Anna (2021) and contains county-level panel data on teen employment and state minimum wages across the United States from 2001 to 2007. During this period, the federal minimum wage remained constant at \$5.15 per hour, while several states raised their state-level minimum wages above the federal floor at different points in time. States that raised their minimum wages form the “treated” groups, identified by the year their first increase took effect. States that never raised their minimum wage above the federal level during this period form the “never-treated” comparison group.

# Load data from Callaway's GitHub repository
load(url("https://github.com/bcallaway11/did_chapter/raw/master/mw_data_ch2.RData"))
# Filter: keep groups 0 (never-treated), 2004, 2006; drop Northeast region
mw_data_ch2 <- subset(mw_data_ch2,
(G %in% c(2004, 2006, 2007, 0)) & (region != "1"))
# Main analysis subset: drop G=2007, keep year >= 2003
data2 <- subset(mw_data_ch2, G != 2007 & year >= 2003)
head(data2[, c("id", "year", "G", "lemp", "lpop", "region")])

 id year G lemp lpop region
6 1001 2003 0 5.253534 10.07352 3
7 1001 2004 0 5.288267 10.06966 3
8 1001 2005 0 5.267858 10.06235 3
9 1001 2006 0 5.298317 10.05546 3
10 1001 2007 0 5.232025 10.04953 3
31 1003 2003 0 6.822197 11.16740 3

# Counties by treatment group
data2 %>%
filter(year == 2003) %>%
group_by(G) %>%
summarise(n_counties = n(), .groups = "drop")

 G n_counties
1 0 1417
2 2004 102
3 2006 226

The dataset contains 8,725 county-year observations spanning 1,745 counties over five years (2003–2007). There are two treatment groups: 102 counties in states that first raised their minimum wage in 2004 (G=2004) and 226 counties in states that did so in 2006 (G=2006). The remaining 1,417 counties are in states that kept their minimum wage at the federal level throughout the period and serve as the never-treated comparison group. We drop the G=2007 group (states raising their minimum wage right before the federal increase) to maintain a cleaner analysis window, following the workshop approach.

# Summary statistics
summary(data2[, c("lemp", "lpop", "lavg_pay")])

 lemp lpop lavg_pay
Min. : 1.099 Min. : 6.397 Min. : 9.646
1st Qu.: 4.615 1st Qu.: 9.149 1st Qu.:10.117
Median : 5.517 Median : 9.931 Median :10.225
Mean : 5.594 Mean :10.030 Mean :10.245
3rd Qu.: 6.458 3rd Qu.:10.762 3rd Qu.:10.352
Max. :11.173 Max. :15.492 Max. :11.223

The outcome variable lemp is log teen employment, with a mean of 5.59 (corresponding to roughly 270 teen workers per county). The covariates lpop (log county population, mean 10.03) and lavg_pay (log average county pay, mean 10.25) capture differences in county size and economic conditions that could affect employment trends. These covariates will become important when we condition the parallel trends assumption on observables in Section 7.

4. The Basic DID Framework

4.1 DID Intuition and Parallel Trends

The core idea behind Difference-in-Differences is simple: compare how outcomes change over time for the treated group relative to a comparison group. If the treated and comparison groups would have followed parallel trends in the absence of treatment, then any divergence after treatment can be attributed to the treatment itself. Formally, the Average Treatment Effect on the Treated (ATT) is identified as:

$$ATT = E[\Delta Y_{t^{\ast}} \mid D=1] - E[\Delta Y_{t^{\ast}} \mid D=0]$$

where $\Delta Y_{t^{\ast}}$ is the change in outcomes from the pre-treatment period to the post-treatment period, $D=1$ indicates treated units, and $D=0$ indicates untreated units. The ATT equals the change in outcomes for the treated group, adjusted by the change in outcomes for the comparison group.

graph TD
subgraph "Before Treatment"
A["Treated Group<br/>Pre-treatment Y"]
B["Control Group<br/>Pre-treatment Y"]
end
subgraph "After Treatment"
C["Treated Group<br/>Post-treatment Y"]
D["Control Group<br/>Post-treatment Y"]
end
A -->|"ΔY treated"| C
B -->|"ΔY control"| D
C -.->|"ATT = ΔY treated − ΔY control"| E["Causal Effect"]
style A fill:#d97757,stroke:#141413,color:#fff
style C fill:#d97757,stroke:#141413,color:#fff
style B fill:#6a9bcc,stroke:#141413,color:#fff
style D fill:#6a9bcc,stroke:#141413,color:#fff
style E fill:#00d4c8,stroke:#141413,color:#fff

In the textbook case with exactly two periods and two groups, the TWFE regression $Y_{it} = \theta_t + \eta_i + \alpha D_{it} + v_{it}$ delivers an estimate of $\alpha$ that is numerically identical to the simple DID estimator, even in the presence of treatment effect heterogeneity. Here, $\theta_t$ represents time fixed effects (captured by year in the regression), $\eta_i$ represents unit fixed effects (captured by id), $D_{it}$ is the treatment indicator (post), and $v_{it}$ are idiosyncratic unobservables.

However, this equivalence breaks down when there are multiple time periods and variation in treatment timing. In our application, states raised their minimum wages at different times (2004 and 2006), creating a staggered treatment adoption design.

The TWFE regression implicitly makes two types of comparisons: (1) “good comparisons” that compare treated groups to not-yet-treated groups, and (2) “bad comparisons” (sometimes called “forbidden comparisons”) that use already-treated groups as the comparison group. To see why this is problematic, imagine grading a student’s improvement by comparing them to classmates who already took the test last week — those “comparison” students are themselves affected by the test, so they no longer represent a valid counterfactual. Similarly, already-treated units may themselves be experiencing treatment effects, contaminating the estimate.

Moreover, under treatment effect heterogeneity, the TWFE coefficient $\alpha$ is a weighted average of underlying group-time treatment effects, and some of these weights can be negative. It is as if you tried to compute an average score but accidentally gave some students a negative weight — their positive performance would drag the average down. This means TWFE could, in principle, produce a negative estimate even when all true treatment effects are positive.

4.2 TWFE Regression

Let us start with the traditional TWFE approach to establish a baseline estimate.

twfe_res <- fixest::feols(lemp ~ post | id + year,
data = data2,
cluster = "id")
summary(twfe_res)

OLS estimation, Dep. Var.: lemp
Observations: 8,725
Fixed-effects: id: 1,745, year: 5
Standard-errors: Clustered (id)
Estimate Std. Error t value Pr(>|t|)
post -0.03812 0.008489 -4.49036 7.5762e-06 ***
---
RMSE: 0.116264 Adj. R2: 0.9926
Within R2: 0.003711

The TWFE regression estimates that minimum wage increases reduced log teen employment by 0.038 (SE = 0.008), which is statistically significant. Interpreted naively, this suggests that states raising their minimum wage experienced a 3.8% decline in teen employment relative to states that did not. However, this single coefficient attempts to summarize the entire treatment effect across two different treatment groups, multiple post-treatment periods, and varying lengths of exposure — a task that, as we will show, is not well-served by TWFE under treatment effect heterogeneity.

The TWFE event study above uses fixest::sunab() to estimate dynamic treatment effects within the TWFE framework. The coefficients suggest a small pre-trend violation at event time $-3$ and increasingly negative post-treatment effects. While the Sun-Abraham correction improves upon the standard TWFE event study by addressing some of the weighting issues, we will see that the Callaway-Sant’Anna approach provides a more principled decomposition of the treatment effect.

5. Group-Time ATT: The Callaway-Sant’Anna Approach

5.1 Estimating ATT(g,t)

The Callaway and Sant’Anna (2021) framework addresses the limitations of TWFE by working with group-time average treatment effects:

$$ATT(g,t) = E[Y_t(g) - Y_t(0) \mid G = g]$$

where $Y_t(g)$ is the potential outcome at time $t$ if first treated in period $g$, $Y_t(0)$ is the untreated potential outcome, and $G = g$ identifies units in treatment group $g$. In words, $ATT(g,t)$ is the average treatment effect for units first treated in period $g$, measured at time $t$. These building-block parameters are identified under the parallel trends assumption using clean comparisons: each treated group is compared only to units that are never treated (or not yet treated), avoiding the forbidden comparisons that plague TWFE.

attgt <- did::att_gt(yname = "lemp",
idname = "id",
gname = "G",
tname = "year",
data = data2,
control_group = "nevertreated",
base_period = "universal")
tidy(attgt)[, 1:5]

 term group time estimate std.error
ATT(2004,2003) 2004 2003 0.00000000 NA
ATT(2004,2004) 2004 2004 -0.03266653 0.02149279
ATT(2004,2005) 2004 2005 -0.06827991 0.02098524
ATT(2004,2006) 2004 2006 -0.12335404 0.02089502
ATT(2004,2007) 2004 2007 -0.13109136 0.02326712
ATT(2006,2003) 2006 2003 -0.03408910 0.01165128
ATT(2006,2004) 2006 2004 -0.01669977 0.00817406
ATT(2006,2005) 2006 2005 0.00000000 NA
ATT(2006,2006) 2006 2006 -0.01939335 0.00892409
ATT(2006,2007) 2006 2007 -0.06607568 0.00965073

The att_gt() function estimates each $ATT(g,t)$ separately. For the G=2004 group, the treatment effect grows over time: $-0.033$ on impact (2004), $-0.068$ one year later (2005), $-0.123$ two years later (2006), and $-0.131$ three years later (2007). This pattern suggests treatment effect dynamics — the negative employment effect of minimum wage increases deepens with longer exposure. For the G=2006 group, the on-impact effect is smaller ($-0.019$) and grows to $-0.066$ after one year. The pre-treatment estimates for G=2006 show a concerning value of $-0.034$ at event time $-3$ (year 2003), suggesting a possible violation of the parallel trends assumption for this group — a point we will revisit in the sensitivity analysis.

5.2 Aggregation: Overall ATT and Event Study

Group-time ATTs are informative but numerous. The aggte() function aggregates them into summary parameters. The overall ATT weights each $ATT(g,t)$ by the group size and the number of post-treatment periods:

attO <- did::aggte(attgt, type = "group")
summary(attO)

Overall summary of ATT's based on group/cohort aggregation:
ATT Std. Error [ 95% Conf. Int.]
-0.0571 0.008 -0.0727 -0.0415 *
Group Effects:
Group Estimate Std. Error [95% Simult. Conf. Band]
2004 -0.0888 0.0197 -0.1309 -0.0468 *
2006 -0.0427 0.0083 -0.0604 -0.0251 *

The overall ATT is $-0.057$ (SE = 0.008), substantially larger in magnitude than the TWFE estimate of $-0.038$. The Callaway-Sant’Anna framework reveals that TWFE understated the negative employment effect by about one-third. The group-level results show that the G=2004 group experienced a larger average effect ($-0.089$) than the G=2006 group ($-0.043$), which makes sense because the G=2004 group has been treated for more periods and thus accumulates more treatment effect dynamics.

The event study aggregation is equally informative:

attes <- did::aggte(attgt, type = "dynamic")
summary(attes)

Overall summary of ATT's based on event-study/dynamic aggregation:
ATT Std. Error [ 95% Conf. Int.]
-0.0862 0.0124 -0.1106 -0.0618 *
Dynamic Effects:
Event time Estimate Std. Error [95% Simult. Conf. Band]
-3 -0.0341 0.0119 -0.0623 -0.0059 *
-2 -0.0167 0.0076 -0.0348 0.0014
-1 0.0000 NA NA NA
0 -0.0235 0.0081 -0.0426 -0.0044 *
1 -0.0668 0.0086 -0.0870 -0.0465 *
2 -0.1234 0.0203 -0.1714 -0.0753 *
3 -0.1311 0.0230 -0.1855 -0.0767 *

The event study reveals a clear pattern: the on-impact effect at $e=0$ is $-0.024$, growing to $-0.067$ at $e=1$, $-0.123$ at $e=2$, and $-0.131$ at $e=3$. The post-treatment effects are all statistically significant and increasingly negative, consistent with the minimum wage having a cumulative negative effect on teen employment over time. However, the pre-trend at $e=-3$ is $-0.034$ and marginally significant, which raises a flag about the validity of the parallel trends assumption. The pre-trend at $e=-2$ is smaller ($-0.017$) and not significant. We will formally assess the robustness of these results to parallel trends violations using HonestDiD in Section 8.

5.3 TWFE Weight Decomposition

Why does TWFE produce a different estimate than Callaway-Sant’Anna? Both the TWFE coefficient and the overall $ATT^O$ can be written as weighted averages of the same underlying $ATT(g,t)$ values:

$$ATT^O = \sum_{g,t} w^O(g,t) \cdot ATT(g,t)$$

The difference lies in the weights. The proper $ATT^O$ weights reflect group size and number of post-treatment periods, while the TWFE weights are driven by the estimation method and can assign nonzero weight to pre-treatment periods or even negative weight to some post-treatment cells. The twfeweights package makes these weights explicit.

tw_obj <- twfeweights::twfe_weights(attgt)
tw <- tw_obj$weights_df
wO_obj <- twfeweights::attO_weights(attgt)
wO <- wO_obj$weights_df

TWFE estimate from weights: -0.0381
ATT^O estimate from weights: -0.0571
TWFE post-treatment component: -0.0503
Pre-treatment contamination: 0.0122
Total TWFE bias: 0.019
Fraction of bias from pre-treatment: 0.6422
Fraction of bias from post-treatment weighting: 0.3578

The weight decomposition is revealing. The TWFE estimate ($-0.038$) differs from the proper overall ATT ($-0.057$) by a total bias of $0.019$ — meaning TWFE attenuates the negative employment effect toward zero. Of this bias, 64.2% comes from pre-treatment contamination: the TWFE regression assigns nonzero weights to pre-treatment $ATT(g,t)$ values, which should receive zero weight in any proper treatment effect parameter. The remaining 35.8% of the bias comes from TWFE assigning different post-treatment weights than the proper $ATT^O$ weights. The figure shows this visually: the orange pre-treatment dots receive nonzero TWFE weights (horizontal position), and the post-treatment TWFE weights (blue circles) differ systematically from the proper $ATT^O$ weights (teal diamonds).

6. Relaxing Parallel Trends

6.1 Conditional Parallel Trends with Covariates

The unconditional parallel trends assumption may be too strong if treatment and comparison groups differ on observable characteristics that affect outcome trends. For example, states that raised their minimum wages may have larger populations or higher average pay levels, and these characteristics could correlate with employment trends even absent the minimum wage change. Conditional parallel trends weakens the assumption: trends need only be parallel after conditioning on covariates. The did package offers three estimation methods for this setting. Regression adjustment models the outcome as a function of covariates; inverse probability weighting (IPW) reweights the comparison group to match the treated group’s covariate distribution; and the doubly robust (DR) estimator combines both approaches, remaining consistent if either the outcome model or the propensity score model is correctly specified — like wearing both a belt and suspenders.

# Regression adjustment
cs_reg <- att_gt(yname = "lemp", tname = "year", idname = "id", gname = "G",
xformla = ~lpop + lavg_pay,
control_group = "nevertreated", base_period = "universal",
est_method = "reg", data = data2)
attO_reg <- aggte(cs_reg, type = "group")
# Inverse probability weighting
cs_ipw <- att_gt(yname = "lemp", tname = "year", idname = "id", gname = "G",
xformla = ~lpop + lavg_pay,
control_group = "nevertreated", base_period = "universal",
est_method = "ipw", data = data2)
attO_ipw <- aggte(cs_ipw, type = "group")
# Doubly robust
cs_dr <- att_gt(yname = "lemp", tname = "year", idname = "id", gname = "G",
xformla = ~lpop + lavg_pay,
control_group = "nevertreated", base_period = "universal",
est_method = "dr", data = data2)
attO_dr <- aggte(cs_dr, type = "group")

Method	Overall ATT	SE
Unconditional	$-0.057$	0.008
Regression adj.	$-0.064$	0.008
IPW	$-0.065$	0.008
Doubly robust	$-0.065$	0.008

Controlling for log population and log average pay increases the estimated negative employment effect from $-0.057$ to approximately $-0.065$ across all three conditional methods. The three estimation methods produce nearly identical estimates, which is reassuring. The fact that all three methods agree suggests that covariate adjustment is not introducing model-dependence artifacts.

The doubly robust event study shows the same qualitative pattern as the unconditional analysis: near-zero pre-trends (the pre-trend at $e=-3$ shrinks from $-0.034$ to $-0.022$ and is no longer significant) and increasingly negative post-treatment effects ($-0.027$ at $e=0$, $-0.077$ at $e=1$, $-0.135$ at $e=2$, $-0.147$ at $e=3$). The improved pre-trend behavior after conditioning on covariates suggests that some of the apparent pre-trend violations in the unconditional analysis were driven by differences in county characteristics between treatment and comparison groups.

6.2 Robustness: Base Period, Comparison Group, and Anticipation

The Callaway-Sant’Anna framework allows the researcher to make several important choices. We now check that our results are robust to these choices.

Varying base period: Instead of comparing all pre-treatment and post-treatment periods to a single universal base period ($t = g-1$), we can use a varying base period that compares each period $t$ to period $t-1$.

cs_varying <- att_gt(yname = "lemp", tname = "year", idname = "id", gname = "G",
xformla = ~lpop + lavg_pay,
control_group = "nevertreated", base_period = "varying",
est_method = "dr", data = data2)
attO_varying <- aggte(cs_varying, type = "group")

Varying base period ATT^O: -0.0646 (SE: 0.0081)

Not-yet-treated comparison group: Instead of using only the never-treated group as the comparison, we can also include units that are not yet treated at time $t$.

cs_nyt <- att_gt(yname = "lemp", tname = "year", idname = "id", gname = "G",
xformla = ~lpop + lavg_pay,
control_group = "notyettreated", base_period = "universal",
est_method = "dr", data = data2)
attO_nyt <- aggte(cs_nyt, type = "group")

Not-yet-treated ATT^O: -0.0649 (SE: 0.008)

Anticipation: If states announced their minimum wage increases before they took effect, workers and firms might adjust their behavior in anticipation. We allow for one period of anticipation by setting anticipation = 1.

cs_antic <- att_gt(yname = "lemp", tname = "year", idname = "id", gname = "G",
xformla = ~lpop + lavg_pay,
control_group = "nevertreated", base_period = "universal",
est_method = "dr", anticipation = 1, data = data2)
attO_antic <- aggte(cs_antic, type = "group")

With anticipation (1 period) ATT^O: -0.0396 (SE: 0.0098)

The results are reassuringly stable across specifications. Switching to a varying base period ($-0.065$) or using the not-yet-treated comparison group ($-0.065$) produces virtually identical estimates to our baseline doubly robust result ($-0.065$). Allowing for one period of anticipation reduces the estimated ATT to $-0.040$ (SE = 0.010), which makes sense — if some of the treatment effect occurs before the official implementation date, excluding that period from post-treatment narrows the estimated effect. The consistency across the first three specifications gives us confidence that the main findings are not driven by specific methodological choices.

7. Sensitivity Analysis: When Parallel Trends May Fail

Even after conditioning on covariates, the parallel trends assumption is not directly testable — pre-trends being close to zero is necessary but not sufficient for parallel trends to hold in post-treatment periods. The HonestDiD approach of Rambachan and Roth (2023) provides a principled sensitivity analysis: it asks how large violations of parallel trends can be before the post-treatment results break down. The “relative magnitude” variant compares the size of potential post-treatment violations to the observed size of pre-treatment deviations from parallel trends.

The HonestDiD package requires a small helper function to interface with the did package’s event study objects. This helper (available in the companion R script and in Callaway’s workshop materials) extracts the influence function (a statistical tool for computing standard errors in complex estimators) and variance-covariance matrix from the event study, then passes them to HonestDiD’s sensitivity routines. The parameter $\bar{M}$ bounds the ratio of the maximum post-treatment deviation from parallel trends to the maximum pre-treatment deviation — in other words, it is a stress test asking “how much worse can things get after treatment compared to what we already see before treatment?”

# Helper function from Callaway's workshop (references/honest_did.R)
# Bridges the did package's AGGTEobj to HonestDiD's sensitivity functions
source("references/honest_did.R")
attgt_hd <- did::att_gt(yname = "lemp", idname = "id", gname = "G",
tname = "year", data = data2,
control_group = "nevertreated",
base_period = "universal")
cs_es_hd <- aggte(attgt_hd, type = "dynamic")
hd_rm <- honest_did(es = cs_es_hd, e = 0, type = "relative_magnitude")

Original CI: [-0.0404, -0.0066]
Robust CIs:
lb ub Mbar
-0.0401 -0.00871 0.000
-0.0435 -0.00523 0.222
-0.0470 -0.00174 0.444
-0.0505 0.00523 0.667
-0.0575 0.01220 0.889
-0.0644 0.01920 1.111

The sensitivity analysis reveals that the on-impact effect ($e=0$) is robust to moderate violations of parallel trends, but not to large ones. The original 95% confidence interval is $[-0.040, -0.007]$, comfortably below zero. As $\bar{M}$ increases — meaning we allow post-treatment violations of parallel trends to be larger relative to pre-treatment violations — the confidence interval widens. The breakdown point is at $\bar{M} \approx 0.67$: if post-treatment violations are no more than about 67% as large as the pre-treatment deviations from parallel trends, the negative employment effect remains statistically significant. Beyond that threshold, the confidence interval includes zero and we can no longer rule out a null effect. Given the moderate pre-trend violations we observed (especially at $e=-3$), this suggests that the results should be interpreted with some caution — the evidence is suggestive of a negative employment effect, but it is not bulletproof.

8. More Complicated Treatment Regimes

8.1 Heterogeneous Treatment Doses

So far, we have treated all minimum wage increases as a binary “treated or not” event. But states raised their minimum wages by very different amounts — some by as little as \$0.10 above the federal floor, others by over \$1.00. A \$0.25 increase and a \$1.70 increase should not be expected to have the same employment effect. To account for this, we can normalize the treatment effect by the size of the minimum wage increase, computing an ATT per dollar.

# Use full data including G=2007 for more treated states
data3 <- subset(mw_data_ch2, year >= 2003)
treated_state_list <- unique(subset(data3, G != 0)$state_name)

The figure reveals substantial variation across states. Illinois raised its minimum wage early (2004) and by a relatively large amount, while Florida and Colorado made smaller increases later. This heterogeneity in treatment dose motivates the per-dollar normalization.

8.2 ATT Per Dollar Event Study

We compute state-specific ATTs using the doubly robust panel DID estimator from the DRDID package, then divide each by the size of the minimum wage increase above the federal level.

# For each treated state and post-treatment period, compute ATT
# using the doubly robust panel estimator, then normalize by dose
for (state in treated_state_list) {
g <- unique(subset(data3, state_name == state)$G)
for (period in 2004:2007) {
Y1 <- c(subset(data3, state_name == state & year == period)$lemp,
subset(data3, G == 0 & year == period)$lemp)
Y0 <- c(subset(data3, state_name == state & year == g - 1)$lemp,
subset(data3, G == 0 & year == g - 1)$lemp)
D <- c(rep(1, sum(data3$state_name == state & data3$year == period)),
rep(0, sum(data3$G == 0 & data3$year == period)))
attst <- DRDID::drdid_panel(Y1, Y0, D, covariates = NULL)
treat_amount <- unique(subset(data3, state_name == state &
year == period)$state_mw) - 5.15
att_per_dollar <- attst$ATT / treat_amount
}
}
# Note: this is a simplified excerpt. See analysis.R for the full
# implementation with result storage, event study aggregation, and plots.

Overall ATT per dollar: -0.0297 (SE: 0.0155)
Event study ATT per dollar:
event_time att se ci_lower ci_upper
0 -0.028 0.020 -0.066 0.010
1 -0.055 0.012 -0.079 -0.031
2 -0.091 0.015 -0.120 -0.062
3 -0.097 0.017 -0.130 -0.064

The dose-normalized results tell a consistent story. The on-impact effect per dollar is $-0.028$ (not quite significant at the 5% level), but the effect grows substantially with exposure: $-0.055$ after one year, $-0.091$ after two years, and $-0.097$ after three years. These per-dollar estimates imply that a \$1 increase in the minimum wage is associated with a decline of 0.055 log points in teen employment after one year (approximately 5.3%) and 0.097 log points after three years (approximately 9.2%). The post-treatment estimates from $e=1$ onward are all statistically significant. The overall ATT per dollar of $-0.030$ (SE = 0.016) averages across all post-treatment periods, but the event study makes clear that the cumulative effects are substantially larger.

9. Alternative Identification Strategies

The DID framework relies on the parallel trends assumption. Alternative identification strategies relax this assumption in different ways. The pte package implements a lagged outcomes strategy, which conditions on lagged outcome values rather than assuming parallel trends. Instead of assuming that treated and untreated groups would have followed the same trend, this approach assumes that controlling for the previous period’s outcome level makes treatment assignment as good as random — counties with the same employment level last year are equally likely to be in a state that raised its minimum wage, regardless of which state they are in.

library(pte)
data2_lo <- data2
data2_lo$G2 <- data2_lo$G
lo_res <- pte::pte_default(yname = "lemp", tname = "year", idname = "id",
gname = "G2", data = data2_lo,
d_outcome = FALSE, lagged_outcome_cov = TRUE)
summary(lo_res)

Overall ATT: -0.061 (SE: 0.008, 95% CI: [-0.077, -0.045])
Dynamic Effects:
Event Time Estimate Std. Error [95% Conf. Band]
-2 0.014 0.008 -0.010 0.038
-1 0.010 0.007 -0.009 0.030
0 -0.024 0.009 -0.049 0.000
1 -0.074 0.008 -0.097 -0.050 *
2 -0.129 0.019 -0.185 -0.073 *
3 -0.140 0.023 -0.206 -0.074 *

The lagged outcomes strategy produces an overall ATT of $-0.061$ (SE = 0.008), very close to the DID estimates with covariates ($-0.065$). The pre-trends under this alternative identification strategy are close to zero (0.014 at $e=-2$ and 0.010 at $e=-1$, both insignificant), and the post-treatment trajectory ($-0.024$ on impact, $-0.074$ at $e=1$, $-0.129$ at $e=2$, $-0.140$ at $e=3$) closely mirrors the DID event study. The convergence of results across different identification strategies strengthens the case that the estimated negative employment effects are reflecting a genuine causal relationship rather than an artifact of any particular set of assumptions.

10. Discussion and Takeaways

This tutorial demonstrates why TWFE regressions are unreliable with staggered treatment adoption and treatment effect heterogeneity, and how modern DID methods provide a principled alternative. The TWFE coefficient of $-0.038$ understates the true overall ATT of $-0.057$ by about one-third, with the bias driven primarily by pre-treatment contamination (64% of the total bias) and improper post-treatment weighting (36%). The Callaway-Sant’Anna framework cleanly separates identification from estimation by first computing group-time ATTs and then aggregating them into target parameters of interest.

The substantive findings suggest that state-level minimum wage increases above the federal floor reduced teen employment, with effects that grew over time. The doubly robust estimator with covariates yields an overall ATT of $-0.065$ (SE = 0.008), and the dose-normalized analysis finds effects of approximately $-0.055$ per dollar after one year and $-0.097$ per dollar after three years. These results are robust across estimation methods (regression adjustment, IPW, doubly robust), comparison group definitions (never-treated, not-yet-treated), and base period choices (universal, varying).

However, the results come with important caveats. The HonestDiD sensitivity analysis shows that the on-impact effect loses statistical significance when post-treatment parallel trends violations exceed about 67% of the pre-treatment deviations. The pre-treatment coefficient at $e=-3$ is moderately significant in the unconditional analysis, though it shrinks after covariate adjustment. These patterns suggest that while the evidence points toward negative employment effects, the magnitude should be interpreted with some caution. As Callaway (2022) notes, this application is primarily intended to illustrate the methodology rather than to settle the minimum wage debate.

The modern DID toolkit demonstrated here — did for group-time ATTs, twfeweights for diagnosing TWFE problems, HonestDiD for sensitivity analysis, and DRDID for doubly robust estimation — provides applied researchers with a complete workflow for credible causal inference in staggered treatment settings. The key lesson is that DID is not just a regression — it is an identification strategy that requires careful attention to the structure of the treatment, the comparison group, and the plausibility of the underlying assumptions.

Key takeaways:

TWFE understates the true ATT by ~33% ($-0.038$ vs $-0.057$), with 64% of the bias from pre-treatment contamination and 36% from improper post-treatment weighting
The doubly robust ATT of $-0.065$ is stable across estimation methods (regression, IPW, DR), comparison groups (never-treated, not-yet-treated), and base periods (universal, varying)
Employment effects accumulate over time: $-0.027$ on impact, growing to $-0.147$ after three years under the doubly robust specification
The on-impact effect is robust to parallel trends violations up to 67% of pre-trend magnitude ($\bar{M} \approx 0.67$), but not beyond
Per-dollar normalization reveals that a \$1 minimum wage increase reduces teen employment by approximately 5.3% after one year and 9.2% after three years

11. Exercises

Expand the sample: Re-run the analysis using data3 (which includes the G=2007 group) and compare the results. Does including the additional treatment group change the overall ATT or the event study pattern?
Alternative covariates: Experiment with different covariate specifications in the doubly robust estimator. What happens if you include only lpop? Only lavg_pay? Does the choice of covariates meaningfully affect the pre-trends?
Smoothness sensitivity: Run the HonestDiD smoothness-based sensitivity analysis (type = "smoothness") in addition to the relative magnitude analysis. How do the two approaches compare in terms of the robustness of the results?

12. References

Callaway, B. (2022). Difference-in-Differences for Policy Evaluation. In Handbook of Labor, Human Resources, and Population Economics. Springer. Published version | Working paper
Callaway, B. and Sant’Anna, P.H.C. (2021). Difference-in-Differences with Multiple Time Periods. Journal of Econometrics, 225(2), 200–230. doi:10.1016/j.jeconom.2020.12.001
Goodman-Bacon, A. (2021). Difference-in-differences with variation in treatment timing. Journal of Econometrics, 225(2), 254–277. doi:10.1016/j.jeconom.2021.03.014
Rambachan, A. and Roth, J. (2023). A More Credible Approach to Parallel Trends. Review of Economic Studies, 90(5), 2555–2591. doi:10.1093/restud/rdad018
de Chaisemartin, C. and D’Haultfoeuille, X. (2020). Two-Way Fixed Effects Estimators with Heterogeneous Treatment Effects. American Economic Review, 110(9), 2964–2996.
Sun, L. and Abraham, S. (2021). Estimating dynamic treatment effects in event studies with heterogeneous treatment effects. Journal of Econometrics, 225(2), 175–199.
did package: CRAN | GitHub
fixest package: CRAN | Documentation
twfeweights package: GitHub
HonestDiD package: CRAN | GitHub

Acknowledgements

AI tools (Claude Code, Gemini, NotebookLM) were used to make the contents of this post more accessible to students. Nevertheless, the content in this post may still have errors. Caution is needed when applying the contents of this post to true research projects.

Sensitivity Analysis for Parallel Trends in Difference-in-Differences Using honestdid in Stata

Thu, 26 Mar 2026 00:00:00 +0000

1. Overview

Difference-in-differences (DiD) is one of the most widely used methods for estimating causal effects in the social sciences. But every DiD estimate rests on a single critical assumption — parallel trends — and that assumption is fundamentally untestable. With only two periods of data, researchers cannot check whether treated and control groups followed similar trends before treatment. With multiple periods, researchers can run a pre-trends test, but as Roth (2022) demonstrated, these tests have low statistical power and can create a false sense of security.

So what can researchers do? The honestdid package, developed by Rambachan and Roth (2023), provides a formal sensitivity analysis framework. Instead of asking the binary question “Do parallel trends hold?” it asks a more useful question: “How large would violations of parallel trends need to be before my conclusion changes?” The answer — called the breakdown value — is a single number that tells the reader exactly how robust the result is.

This tutorial teaches the method in two self-contained parts. Part 1 starts with the simplest possible DiD — two groups, two periods — where parallel trends cannot be tested at all. We show how honestdid can still provide meaningful robustness analysis in this limited-data setting. Part 2 extends to a multi-period event study, where we have more pre-treatment data and can deploy the full toolkit, including both relative magnitudes and smoothness restrictions. Throughout, we use data from the Affordable Care Act’s Medicaid expansion to study the effect of expanding health insurance eligibility on insurance coverage.

Learning objectives

Construct a simple 2x2 difference-in-differences estimate and understand the parallel trends assumption
Recognize that parallel trends cannot be tested with only two periods of data
Apply honestdid with relative magnitudes (DeltaRM) to assess robustness even in the 2x2 case
Interpret breakdown values as a quantitative measure of how robust a DiD result is
Estimate a multi-period event study and run a conventional pre-trends test
Explain why pre-trends tests have low power and can mislead researchers
Apply both DeltaRM and smoothness restrictions (DeltaSD) to multi-period DiD

2. Study context — Medicaid expansion

The Affordable Care Act (ACA) gave US states the option to expand Medicaid eligibility to low-income adults. Some states expanded in 2014, while others chose not to expand at all. This creates a natural quasi-experiment: states that expanded serve as the treatment group, and states that never expanded serve as the control group. The outcome of interest is the share of the population with health insurance coverage (dins).

This is an observational study, not a randomized experiment. States were not randomly assigned to expand Medicaid — they chose to do so based on political and economic factors. This means that the parallel trends assumption is a genuine concern: states that chose to expand may have been on different insurance coverage trajectories than non-expanders even before 2014.

Our target estimand is the average treatment effect on the treated (ATT) — the effect of Medicaid expansion on insurance coverage in the states that expanded. We will use this dataset in two ways: first restricted to a narrow window around the treatment year (Part 1), then with the full panel spanning 2008–2015 (Part 2).

Variables

Variable	Description	Type
`stfips`	State FIPS code	Panel ID
`year`	Calendar year (2008–2015)	Time variable
`dins`	Share of population with health insurance	Outcome (0–1)
`yexp2`	Year of Medicaid expansion (missing if never)	Treatment timing

3. Analytical roadmap

The diagram below shows how the tutorial progresses. Each part is self-contained, with its own estimation and sensitivity analysis.

graph LR
A["<b>2x2 DiD</b><br/>Estimation"] --> B["<b>Sensitivity</b><br/>Relative Magnitudes"]
B --> C["<b>Event Study</b><br/>Estimation"]
C --> D["<b>Sensitivity</b><br/>RM + Smoothness"]
style A fill:#6a9bcc,stroke:#141413,color:#fff
style B fill:#d97757,stroke:#141413,color:#fff
style C fill:#00d4c8,stroke:#141413,color:#141413
style D fill:#141413,stroke:#d97757,color:#fff

Part 1 uses a simple before-and-after comparison where parallel trends is untestable. Part 2 leverages the full panel to run richer sensitivity analyses, including smoothness restrictions that require multiple pre-treatment periods.

4. Setup — data loading and packages

We begin by installing the required packages and loading the Medicaid expansion dataset.

* Install required packages
capture ssc install require, replace
capture ssc install ftools, replace
capture ssc install reghdfe, replace
capture ssc install coefplot, replace
capture ssc install drdid, replace
capture ssc install csdid, replace
capture net install honestdid, from("https://raw.githubusercontent.com/mcaceresb/stata-honestdid/main") replace

(output omitted)

Now we load the data and examine its structure. The dataset contains state-level panel data on health insurance coverage from 2008 to 2015, with information on when each state expanded Medicaid eligibility.

* Load data
use "https://raw.githubusercontent.com/Mixtape-Sessions/Advanced-DID/main/Exercises/Data/ehec_data.dta", clear
* Examine the data
des
tab year
tab yexp2, m

Contains data
obs: 552
vars: 5
variable name type format label variable label
stfips byte %8.0g STATEFIP state FIPS code
year int %8.0g YEAR Census/ACS survey year
dins float %9.0g Insurance Rate among low-income
childless adults
yexp2 float %9.0g Year of Medicaid Expansion
W float %9.0g total survey weight
year | Freq.
-----------+----------
2008 | 46
2009 | 46
... | ...
2019 | 46
-----------+----------
Total | 552
yexp2 | Freq.
------------+----------
2014 | 264
2015 | 36
2016 | 24
2017 | 12
2019 | 24
. | 192
------------+----------
Total | 552

The data contains 552 observations across 46 states and 12 years (2008–2019). States expanded Medicaid in different years — 22 in 2014, 3 in 2015, 2 in 2016, 1 in 2017, and 2 in 2019 — while 16 states never expanded (missing yexp2). For a clean two-group comparison, we restrict the sample to 2014-expanders and never-expanders, and keep only years 2008–2015.

* Restrict to 2008--2015, keep only 2014 expanders and never-expanders
keep if (year <= 2015) & (missing(yexp2) | (yexp2 == 2014))
* Create treatment indicator
gen byte D = (yexp2 == 2014)
* Verify sample
tab D
tab year

 D | Freq.
------------+----------
0 | 128
1 | 176
------------+----------
Total | 304
year | Freq.
------------+----------
2008 | 38
2009 | 38
2010 | 38
2011 | 38
2012 | 38
2013 | 38
2014 | 38
2015 | 38
------------+----------
Total | 304

Our analysis sample contains 38 states observed across 8 years (2008–2015): 22 treatment states that expanded Medicaid in 2014 and 16 control states that never expanded. This balanced panel provides the foundation for both parts of the tutorial.

Part 1: Simple 2x2 Difference-in-Differences

5. The 2x2 DiD — concept and estimation

5.1 Collapsing to two periods

The 2x2 DiD is the simplest version of difference-in-differences: two groups (treated and control) observed in two time periods (before and after treatment). We collapse our multi-year data into a single pre-treatment average (2008–2013) and a single post-treatment average (2014–2015).

graph TD
PRE_T["<b>Treated States</b><br/>Pre-2014 average"]
PRE_C["<b>Control States</b><br/>Pre-2014 average"]
POST_T["<b>Treated States</b><br/>Post-2014 average"]
POST_C["<b>Control States</b><br/>Post-2014 average"]
DID["<b>DiD Estimate</b>"]
PRE_T -->|"Change in Treated"| POST_T
PRE_C -->|"Change in Control"| POST_C
POST_T --> DID
POST_C --> DID
style PRE_T fill:#00d4c8,stroke:#141413,color:#141413
style POST_T fill:#00d4c8,stroke:#141413,color:#141413
style PRE_C fill:#6a9bcc,stroke:#141413,color:#fff
style POST_C fill:#6a9bcc,stroke:#141413,color:#fff
style DID fill:#d97757,stroke:#141413,color:#fff

To see the four means that define the 2x2 DiD, we create a post-treatment indicator and compute group averages.

* Create post indicator
gen byte post = (year >= 2014)
* Compute the four group means
preserve
collapse (mean) dins, by(D post)
list, clean noobs
restore

 D post dins
0 0 .6189702
0 1 .6836083
1 0 .6544622
1 1 .7808657

The four cells of the 2x2 table reveal the raw pattern. Control states (D = 0) saw insurance coverage rise from 61.90% to 68.36% — a gain of 6.46 percentage points reflecting nationwide trends. Treated states (D = 1) saw a larger increase from 65.45% to 78.09% — a gain of 12.64 percentage points. The DiD estimate is the difference of these two changes: 12.64 - 6.46 = 6.18 percentage points. This is the causal effect of Medicaid expansion on insurance coverage among low-income childless adults, under the parallel trends assumption.

5.2 Regression-based 2x2 DiD

The same estimate emerges from a regression. The 2x2 DiD regression specification is:

$$Y_{it} = \alpha + \beta \cdot \text{Treat}_i + \gamma \cdot \text{Post}_t + \delta \cdot (\text{Treat}_i \times \text{Post}_t) + \varepsilon_{it}$$

In words, the outcome for state $i$ in period $t$ equals a baseline level ($\alpha$), a treatment group fixed effect ($\beta$), a post-period fixed effect ($\gamma$), and the interaction ($\delta$) — which is the DiD estimate. The coefficient $\delta$ captures how much the treated group’s outcome changed relative to the control group’s change.

* 2x2 DiD regression
reg dins i.D##i.post, cluster(stfips)

Linear regression Number of obs = 304
F(3, 37) = 182.58
Prob > F = 0.0000
R-squared = 0.4722
Root MSE = .05526
(Std. err. adjusted for 38 clusters in stfips)
------------------------------------------------------------------------------
| Robust
dins | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
1.D | .035492 .0176856 2.01 0.052 -.0003425 .0713265
1.post | .0646382 .0052781 12.25 0.000 .0539437 .0753326
|
D#post |
1 1 | .0617653 .0085367 7.24 0.000 .0444682 .0790624
|
_cons | .6189702 .0122906 50.36 0.000 .5940671 .6438732
------------------------------------------------------------------------------

The regression confirms the manual calculation: the interaction coefficient (1.D#1.post) is 0.0618, corresponding to a 6.18 percentage point increase in insurance coverage. The effect is highly statistically significant (t = 7.24, p < 0.001), with a 95% confidence interval of [4.45, 7.91] percentage points. The standard errors are clustered at the state level to account for within-state correlation over time.

5.3 The parallel trends problem in the 2x2

This estimate relies on a crucial assumption: absent Medicaid expansion, treated and control states would have followed the same trend in insurance coverage. Formally, the parallel trends assumption states:

$$E[Y_{it}(0) | \text{Treat}_i = 1] - E[Y_{it-1}(0) | \text{Treat}_i = 1] = E[Y_{it}(0) | \text{Treat}_i = 0] - E[Y_{it-1}(0) | \text{Treat}_i = 0]$$

In words, the change in untreated potential outcomes would have been the same for both groups. The problem is that with only two periods of data, we have no way to test this. We observe each group once before treatment and once after. There is no earlier period to check whether trends were already diverging.

Imagine you have a single photograph of two runners side by side before a race. They appear to be at the same speed. You assume they were always running at the same pace — but what if one had been accelerating? With only one snapshot, you cannot know. This is exactly the situation in the 2x2 DiD: we assume parallel trends because we have no evidence against it, but we also have no evidence for it.

The honestdid package provides a way forward. Instead of assuming parallel trends holds perfectly, it asks: “How large would the violation of parallel trends need to be before the DiD result breaks down?" The next section makes this precise.

Figure 1: Group means and counterfactual trend. The dashed line shows where treated states would have been without Medicaid expansion (parallel trends assumption). The gap between the solid treated line and the dashed counterfactual is the DiD estimate of 6.18 pp.

6. Sensitivity analysis for the 2x2 DiD

Before applying sensitivity analysis, note that the 2x2 DiD estimate of 6.18 pp averages across all pre-treatment years (2008–2013) and all post-treatment years (2014–2015). The event study estimates in this section and in Part 2 measure year-specific effects relative to the reference year 2013. These are different parameters — the event study will show 4.23 pp for 2014 and 6.87 pp for 2015, which bracket the 2x2 average.

6.1 Setting up the event study for honestdid

To apply honestdid, we need coefficients in an event study format — at least one pre-treatment coefficient and one post-treatment coefficient, relative to a reference period. We restrict the data to a narrow three-year window around the treatment year: 2012 (one year before the reference), 2013 (the reference period, just before treatment), and 2014 (the treatment year).

This gives us the simplest event study possible: one pre-period coefficient (the 2012 vs 2013 difference between treated and control) and one post-period coefficient (the 2014 vs 2013 difference). The pre-period coefficient tells us whether the groups were already diverging before treatment.

* Restrict to 3-year window: 2012, 2013, 2014
preserve
keep if inrange(year, 2012, 2014)
* Create Dyear variable (treatment-year interaction)
gen Dyear = cond(D, year, 2013)
* Event study with 2013 as reference
reghdfe dins b2013.Dyear, absorb(stfips year) cluster(stfips) noconstant

HDFE Linear regression Number of obs = 114
Absorbing 2 HDFE groups F( 2, 37) = 16.27
R-squared = 0.9604
Number of clusters (stfips) = 38 Root MSE = 0.0174
(Std. err. adjusted for 38 clusters in stfips)
------------------------------------------------------------------------------
| Robust
dins | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
Dyear |
2012 | -.0062865 .0059107 -1.06 0.294 -.0182626 .0056897
2014 | .0423401 .0082657 5.12 0.000 .0255923 .059088
------------------------------------------------------------------------------

The pre-period coefficient for 2012 is -0.0063, which is small in magnitude and statistically insignificant (t = -1.06, p = 0.294). This suggests that treated and control states were on similar trajectories in the year before treatment. The post-period coefficient for 2014 is 0.0423, indicating that Medicaid expansion increased insurance coverage by 4.23 percentage points relative to the reference year, a highly significant effect (t = 5.12, p < 0.001).

6.2 Introducing relative magnitudes (DeltaRM)

Now we apply the core innovation of Rambachan and Roth (2023). The relative magnitudes restriction bounds the post-treatment violation of parallel trends relative to the largest pre-treatment violation:

$$\Delta^{RM}(\bar{M}): \quad |\delta_t^{\text{post}}| \leq \bar{M} \cdot \max_{s \in \text{pre}} |\delta_s|$$

In words, this restriction says: “the true deviation from parallel trends after treatment can be at most $\bar{M}$ times as large as the largest true deviation in the pre-treatment period.” We do not observe these true deviations directly — the package uses the estimated pre-period coefficients and their uncertainty to construct valid confidence intervals. The parameter $\bar{M}$ — read as “M-bar” — controls how much violation we allow:

$\bar{M} = 0$: exact parallel trends in the post-treatment period (strongest assumption), though pre-treatment deviations are still allowed
$\bar{M} = 1$: post-treatment violation can be as large as the worst pre-treatment violation
$\bar{M} = 2$: post-treatment violation can be twice the worst pre-treatment violation

Think of the breakdown value like a bridge stress test. Engineers do not just ask “Can the bridge hold the expected load?” They ask “How much MORE load can it take before it fails?” The breakdown value is that safety margin for your DiD estimate — the value of $\bar{M}$ at which the confidence interval first includes zero and the conclusion reverses.

The diagram below summarizes the honestdid workflow — from the event study coefficients all the way to the breakdown value.

graph LR
A["<b>Event Study</b><br/>Coefficients + VCV"] --> B["<b>Choose Restriction</b><br/>DeltaRM or DeltaSD"]
B --> C["<b>Set M Values</b><br/>mvec(0, 0.5, 1, ...)"]
C --> D["<b>Robust CIs</b><br/>for each M"]
D --> E["<b>Breakdown Value</b><br/>CI first includes zero"]
style A fill:#6a9bcc,stroke:#141413,color:#fff
style B fill:#d97757,stroke:#141413,color:#fff
style C fill:#d97757,stroke:#141413,color:#fff
style D fill:#00d4c8,stroke:#141413,color:#141413
style E fill:#141413,stroke:#d97757,color:#fff

6.3 Running honestdid

We apply honestdid to the event study results from the three-year window. With pre(1/1), we tell the package that coefficient position 1 (the 2012 coefficient) is the pre-period, and post(3/3) specifies position 3 (the 2014 coefficient) as the post-period, skipping the omitted 2013 reference at position 2.

* Sensitivity analysis: relative magnitudes
honestdid, pre(1/1) post(3/3) mvec(0(0.5)2)

| M | lb | ub |
| ------- | ------ | ------ |
| . | 0.026 | 0.059 | (Original)
| 0.0000 | 0.026 | 0.059 |
| 0.5000 | 0.022 | 0.060 |
| 1.0000 | 0.017 | 0.064 |
| 1.5000 | 0.010 | 0.069 |
| 2.0000 | 0.003 | 0.076 |
(method = C-LF, Delta = DeltaRM)

The table shows robust confidence intervals for different values of $\bar{M}$, constructed using the C-LF (conditional least-favorable) method — a procedure that accounts for both sampling uncertainty and the worst-case bias allowed by the restriction. The first row ($\bar{M}$ = .) shows the original confidence interval without any sensitivity adjustment: [0.026, 0.059]. As $\bar{M}$ increases, we allow larger violations of parallel trends, and the confidence interval widens. Even at $\bar{M}$ = 2 — allowing post-treatment violations twice as large as the pre-treatment difference — the lower bound remains positive at 0.003, still above zero. The result is remarkably robust: the conclusion that Medicaid expansion increased insurance coverage survives even generous assumptions about parallel trends violations.

6.4 The sensitivity plot

We can visualize the sensitivity analysis with a plot that shows how the confidence interval expands as we relax the parallel trends assumption.

* Generate the sensitivity plot
honestdid, pre(1/1) post(3/3) mvec(0(0.5)2) coefplot
graph export "stata_honestdid_2x2_rm.png", replace width(1200)

Figure 2: Relative magnitudes sensitivity for the 2x2 DiD. The CI stays above zero even at M-bar = 2.

Each point on the plot shows the robust confidence interval at a given $\bar{M}$. Moving right on the x-axis means allowing progressively larger violations of parallel trends. The breakdown value is where the confidence interval first touches zero. In this case, the confidence interval stays above zero even at $\bar{M}$ = 2 (lower bound = 0.003), meaning the result is robust to post-treatment violations that are at least twice as large as the pre-treatment divergence we observed.

6.5 What did we learn?

Even with just three periods of data — barely more than the textbook 2x2 — honestdid lets us go far beyond the simple assertion “we assume parallel trends holds.” We can now say: “Our result is robust to post-treatment violations of parallel trends that are at least twice as large as the pre-treatment difference between groups.” This is a much more informative and credible statement.

However, with only one pre-period coefficient, we are limited to the relative magnitudes restriction. The smoothness restriction (DeltaSD) — which bounds how quickly the trend can change direction — requires at least two pre-period coefficients to compute second differences. To unlock this richer analysis, we need more pre-treatment data. That is exactly what Part 2 provides.

Now that we have established Part 1’s results, we restore the full dataset and move to the multi-period analysis.

restore

Part 2: Multi-period Difference-in-Differences

7. From 2x2 to event study

7.1 Why more periods help

With the full panel (2008–2015), we have five pre-treatment years instead of just one. This gives us two advantages. First, we can visually inspect whether treated and control groups were on similar trajectories before 2014. Second, honestdid has richer information to calibrate the scale of potential violations, and we unlock the smoothness restriction that was unavailable in Part 1.

The multi-period event study estimates a separate treatment effect for each year relative to a reference year. The specification is:

$$Y_{it} = \alpha_i + \lambda_t + \sum_{k \neq -1} \beta_k \cdot \mathbb{1}[K_{it} = k] + \varepsilon_{it}$$

In words, the outcome for state $i$ in year $t$ depends on state fixed effects ($\alpha_i$), year fixed effects ($\lambda_t$), and a set of event-time indicators. $K_{it}$ measures event time — years relative to treatment onset (2014). The reference period $k = -1$ (year 2013) is omitted, so each $\beta_k$ measures the treated-control difference in year $k$ relative to the year just before treatment. The pre-treatment coefficients ($\beta_{-6}$ through $\beta_{-2}$) show whether trends were already diverging; the post-treatment coefficients ($\beta_0$ and $\beta_1$) capture the treatment effect.

Variable mapping: $Y$ = dins, $\alpha_i$ = state dummies (absorbed by reghdfe), $\lambda_t$ = year dummies, and $K_{it}$ = Dyear interaction variable.

7.2 Estimation

We now estimate the event study using all eight years of data. The variable Dyear interacts treatment status with calendar year, and we omit 2013 as the reference.

* Create Dyear for event study (full sample)
gen Dyear = cond(D, year, 2013)
* Full event study: 2008--2015 with 2013 as reference
reghdfe dins b2013.Dyear, absorb(stfips year) cluster(stfips) noconstant

HDFE Linear regression Number of obs = 304
Absorbing 2 HDFE groups F( 7, 37) = 10.37
R-squared = 0.9505
Number of clusters (stfips) = 38 Root MSE = 0.0185
(Std. err. adjusted for 38 clusters in stfips)
------------------------------------------------------------------------------
| Robust
dins | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
Dyear |
2008 | -.0095956 .0076769 -1.25 0.219 -.0251505 .0059593
2009 | -.0132771 .0073502 -1.81 0.079 -.02817 .0016159
2010 | -.0018712 .0067698 -0.28 0.784 -.0155881 .0118457
2011 | -.0064012 .0070425 -0.91 0.369 -.0206707 .0078682
2012 | -.0062865 .005944 -1.06 0.297 -.0183302 .0057573
2014 | .0423401 .0083124 5.09 0.000 .0254977 .0591826
2015 | .0687134 .0108512 6.33 0.000 .0467268 .0906999
------------------------------------------------------------------------------

The five pre-treatment coefficients (2008–2012) are all small in magnitude and statistically insignificant, ranging from -0.0133 to -0.0019. This suggests that treated and control states followed similar insurance coverage trajectories before Medicaid expansion. The post-treatment coefficients show a sharp break: insurance coverage jumped by 4.23 percentage points in 2014 and 6.87 percentage points in 2015, both highly significant. The growing effect over time is consistent with gradual Medicaid enrollment — eligible individuals signing up over the first two years of the program.

We visualize these coefficients in a standard event study plot.

* Event study plot
coefplot, vertical yline(0, lcolor(gs8)) ///
xline(5.5, lpattern(dash) lcolor(gs8)) ///
ciopts(recast(rcap)) ///
ytitle("Effect on insurance share") xtitle("Year") ///
title("Event Study: Medicaid Expansion and Insurance Coverage")
graph export "stata_honestdid_event_study.png", replace width(1200)

Figure 3: Event study coefficients. Pre-treatment coefficients hover near zero; post-treatment effects are large and significant.

The event study plot makes the pattern visually clear. Pre-treatment coefficients hover around zero with no discernible trend, while post-treatment coefficients jump sharply upward. The dashed vertical line marks the onset of treatment in 2014.

7.3 Conventional pre-trends test

The standard approach is to conduct a joint F-test of all pre-treatment coefficients. If we fail to reject the null that all pre-period coefficients are jointly zero, we conclude that parallel trends “holds.”

* Joint test of pre-treatment coefficients
test 2008.Dyear 2009.Dyear 2010.Dyear 2011.Dyear 2012.Dyear

 ( 1) 2008.Dyear = 0
( 2) 2009.Dyear = 0
( 3) 2010.Dyear = 0
( 4) 2011.Dyear = 0
( 5) 2012.Dyear = 0
F( 5, 37) = 0.86
Prob > F = 0.5178

The pre-trends test yields an F-statistic of 0.86 with a p-value of 0.518, providing no evidence against parallel trends. But should we trust this binary verdict? The next section explains why the answer is no.

8. Why pre-trends tests are not enough

Your DiD passed the pre-trends test. But should you trust it?

Think of a pre-trends test as a smoke detector that only beeps for large fires. A fire too small to trigger the alarm can still burn down the house. Similarly, a pre-trends test can fail to detect violations of parallel trends that are large enough to overturn your conclusions. Roth (2022) demonstrated two important problems with conventional pre-trends tests:

Low power. Pre-trends tests often cannot detect violations of parallel trends that are economically meaningful. A test with 50 observations per group may require a violation three times larger than the treatment effect to reject the null at 5% significance. Violations smaller than this detection threshold go unnoticed.
Pre-test bias. Conditioning on passing the pre-trends test introduces bias. The estimates that survive the pre-test are a selected sample — they look better than they should. Researchers who report “parallel trends holds” are unknowingly presenting results that have been filtered to appear more credible than they are.

The fundamental issue is that the pre-trends test asks a binary question — “reject or not?” — when what we really need is a continuous measure of robustness. Instead of asking “Are parallel trends exactly satisfied?” we should ask “How robust are our conclusions to plausible violations?”

graph TD
PT["<b>Parallel Trends Assumption</b><br/>(untestable)"]
CONV["<b>Conventional Approach</b><br/>Pre-trends test<br/>(binary: reject or not)"]
HONEST["<b>HonestDiD Approach</b><br/>Sensitivity analysis<br/>(how much violation<br/>can we tolerate?)"]
RESULT_C["Parallel trends holds<br/>(false confidence)"]
RESULT_H["Results robust up to<br/>M-bar = X violations<br/>(calibrated conclusion)"]
PT --> CONV
PT --> HONEST
CONV --> RESULT_C
HONEST --> RESULT_H
style PT fill:#141413,stroke:#d97757,color:#fff
style CONV fill:#d97757,stroke:#141413,color:#fff
style HONEST fill:#00d4c8,stroke:#141413,color:#141413
style RESULT_C fill:#d97757,stroke:#141413,color:#fff
style RESULT_H fill:#00d4c8,stroke:#141413,color:#141413

The honestdid approach replaces the binary verdict with a quantitative statement: “Our result is robust to violations of parallel trends up to $\bar{M}$ times the largest pre-treatment violation.” This is like reporting the load at which a bridge fails, rather than just saying “the bridge passed inspection.”

9. Sensitivity analysis — relative magnitudes (full panel)

9.1 RM with 5 pre-periods

We now apply the same relative magnitudes restriction from Part 1, but with the richer information from five pre-treatment periods. The equation is the same:

$$\Delta^{RM}(\bar{M}): \quad |\delta_t^{\text{post}}| \leq \bar{M} \cdot \max_{s \in \text{pre}} |\delta_s|$$

With five pre-period coefficients instead of one, the “max pre-period violation” is calibrated from more data points, giving a more reliable scale for what constitutes a plausible violation.

* Relative magnitudes: full panel
honestdid, pre(1/5) post(7/8) mvec(0(0.5)2)

| M | lb | ub |
| ------- | ------ | ------ |
| . | 0.026 | 0.059 | (Original)
| 0.0000 | 0.027 | 0.058 |
| 0.5000 | 0.021 | 0.063 |
| 1.0000 | 0.013 | 0.071 |
| 1.5000 | 0.003 | 0.081 |
| 2.0000 | -0.007 | 0.091 |
(method = C-LF, Delta = DeltaRM)

With five pre-periods calibrating the scale of violations, the confidence intervals widen faster than in the 2x2 case. At $\bar{M}$ = 0 (exact parallel trends), the robust CI is [0.027, 0.058]. At $\bar{M}$ = 1, allowing violations as large as the worst pre-period deviation, the CI remains positive: [0.013, 0.071]. At $\bar{M}$ = 1.5, the lower bound is barely positive at 0.003. At $\bar{M}$ = 2, the lower bound turns negative at -0.007. The breakdown value is approximately $\bar{M}$ = 1.5–2 — the post-treatment violation of parallel trends would need to be about 1.5 to 2 times as large as the worst pre-treatment deviation to overturn the conclusion.

* Sensitivity plot: relative magnitudes
honestdid, pre(1/5) post(7/8) mvec(0(0.5)2) coefplot
graph export "stata_honestdid_rm_full.png", replace width(1200)

Figure 4: Relative magnitudes sensitivity with 5 pre-periods. The CI crosses zero between M-bar = 1.5 and 2.

The sensitivity plot confirms the pattern: the confidence interval steadily widens as we allow larger violations, crossing zero between $\bar{M}$ = 1.5 and 2. Compared to the 2x2 case in Part 1, where the CI stayed positive even at $\bar{M}$ = 2, the full-panel analysis produces a slightly tighter breakdown. This happens because having more pre-period coefficients can produce a larger “max pre-period violation” (the scaling factor on the right-hand side of the relative magnitudes formula), which scales up the allowed post-treatment violation for any given $\bar{M}$.

9.2 Focusing on the average post-treatment effect

By default, honestdid examines the first post-treatment period. We can instead ask about the average treatment effect across both post-treatment periods (2014 and 2015) using the l_vec option, which specifies weights for combining the post-period coefficients.

* Average effect across 2014 and 2015
matrix l_vec = 0.5 \ 0.5
honestdid, pre(1/5) post(7/8) mvec(0(0.5)2) l_vec(l_vec)

| M | lb | ub |
| ------- | ------ | ------ |
| . | 0.039 | 0.072 | (Original)
| 0.0000 | 0.039 | 0.072 |
| 0.5000 | 0.029 | 0.079 |
| 1.0000 | 0.014 | 0.092 |
| 1.5000 | -0.002 | 0.107 |
| 2.0000 | -0.019 | 0.123 |
(method = C-LF, Delta = DeltaRM)

The average treatment effect across 2014–2015 has a higher point estimate (the original CI of [0.039, 0.072]) because the 2015 effect is larger than the 2014 effect. The breakdown value for the average effect is between $\bar{M}$ = 1 and 1.5 — at $\bar{M}$ = 1 the lower bound is still positive (0.014) but at $\bar{M}$ = 1.5 it turns negative (-0.002). Interestingly, the average effect is slightly less robust than the first-period effect alone (breakdown between 1 and 1.5 vs between 1.5 and 2). This can happen when averaging over a longer horizon amplifies the cumulative impact of potential trend deviations.

10. Sensitivity analysis — smoothness restrictions

10.1 Introducing DeltaSD

Relative magnitudes asks: “How large can the violation be?” A complementary question is: “How quickly can the trend change direction?” This is the smoothness restriction (DeltaSD), which bounds the second differences of the trend deviation.

Think of the two restrictions like driving rules. Relative magnitudes imposes a speed limit — the violation cannot exceed $\bar{M}$ times the maximum observed pre-treatment violation. Smoothness imposes an acceleration limit — the violation cannot change direction too sharply between consecutive periods. A car might be going fast but safely if it accelerated gradually; a sudden swerve is dangerous even at moderate speed.

Formally, the smoothness restriction bounds the second difference:

$$\Delta^{SD}(M): \quad |(\delta_{t+1} - \delta_t) - (\delta_t - \delta_{t-1})| \leq M \quad \text{for all } t$$

In words, the “acceleration” of the parallel trends violation — how much the slope changes from one period to the next — cannot exceed $M$ for any consecutive triple of periods. When $M = 0$, the trend deviation is perfectly linear (constant slope). Larger $M$ allows more curvature.

This restriction was not available in Part 1 because it requires at least two pre-period coefficients to compute second differences (you need three points to calculate one “acceleration”). With five pre-periods, we can now use this richer restriction.

10.2 Running honestdid with DeltaSD

* Smoothness restriction
honestdid, pre(1/5) post(7/8) mvec(0(0.005)0.04) delta(sd)

| M | lb | ub |
| ------- | ------ | ------ |
| . | 0.026 | 0.059 | (Original)
| 0.0000 | 0.026 | 0.058 |
| 0.0050 | 0.013 | 0.061 |
| 0.0100 | 0.007 | 0.065 |
| 0.0150 | 0.002 | 0.070 |
| 0.0200 | -0.003 | 0.075 |
| 0.0250 | -0.008 | 0.080 |
| 0.0300 | -0.013 | 0.085 |
| 0.0350 | -0.018 | 0.090 |
| 0.0400 | -0.023 | 0.095 |
(method = FLCI, Delta = DeltaSD)

Note that honestdid automatically selects the FLCI (fixed-length confidence interval) method for smoothness restrictions, rather than the C-LF method used for relative magnitudes. FLCI constructs a confidence interval with optimal length under the smoothness restriction. Under the smoothness restriction, the breakdown value is approximately $M$ = 0.015–0.02. At $M$ = 0 (perfectly linear trend extrapolation), the robust CI is [0.026, 0.058]. At $M$ = 0.01, the CI is [0.007, 0.065], still comfortably above zero. At $M$ = 0.015, the lower bound is barely positive at 0.002. At $M$ = 0.02, the lower bound turns negative at -0.003. The change in the rate of divergence from parallel trends would need to exceed 1.5–2 percentage points between consecutive periods to overturn the finding.

* Smoothness sensitivity plot
honestdid, pre(1/5) post(7/8) mvec(0(0.005)0.04) delta(sd) coefplot
graph export "stata_honestdid_sd_full.png", replace width(1200)

Figure 5: Smoothness restriction sensitivity. The CI crosses zero near M = 0.02.

The smoothness restriction yields a different perspective. Unlike relative magnitudes — where $\bar{M}$ is a dimensionless multiplier — the smoothness parameter $M$ is measured in the same units as the outcome (insurance share). A breakdown value of $M$ = 0.015–0.02 means the rate of divergence from parallel trends would need to shift by about 1.5–2 percentage points between consecutive periods to invalidate the result.

10.3 Comparing RM vs SD

The two approaches offer complementary views of robustness:

Restriction	Parameter	Breakdown Value	Interpretation
Relative Magnitudes	$\bar{M}$	~1.5–2	Post violation can be up to 1.5–2x the max pre violation
Smoothness	$M$	~0.015–0.02	Rate of trend divergence can shift by up to 1.5–2 pp between periods

10.4 When to choose which restriction

Use DeltaRM when: (a) you have few pre-periods — it works with just one, (b) the pre-treatment coefficients look like random noise around zero with no clear trend, or (c) you want a dimensionless measure of robustness that is easy to communicate
Use DeltaSD when: (a) you have two or more pre-periods, (b) there is a visible pre-trend (non-zero slope) and you want to formalize how much the slope can change, or (c) you want bounds measured in the outcome’s units
Report both when feasible, as we did here, to provide a complete picture

In general, relative magnitudes is the more popular choice because it is intuitive and works with minimal data. Smoothness restrictions are complementary — they capture a different form of violation (abrupt changes in trend direction rather than large absolute deviations).

10.5 How to report honestdid results in a paper

Many readers will want to apply this method in their own work. Here is example text you can adapt for a manuscript:

We conduct sensitivity analysis following Rambachan and Roth (2023). Under relative magnitudes restrictions, the treatment effect on insurance coverage remains statistically significant for $\bar{M}$ up to 1.5 (95% robust CI: [0.003, 0.081]). Under smoothness restrictions, the result is robust for $M$ up to 0.015 (95% robust CI: [0.002, 0.070]). These breakdown values indicate that post-treatment deviations from parallel trends would need to be at least 1.5 times the largest pre-treatment deviation to overturn the conclusion.

11. Extension — staggered DiD with csdid and honestdid

11.1 Why staggered timing matters

Our analysis so far restricted attention to states expanding in 2014 and compared them to never-expanders. But different states expanded Medicaid at different times — some in 2014, others in 2015 or later. Callaway and Sant’Anna (2021) showed that standard two-way fixed effects (TWFE) regressions can produce misleading estimates when treatment timing varies across units, especially if treatment effects are heterogeneous over time. The csdid package provides a heterogeneity-robust estimator that correctly handles staggered treatment adoption.

We reload the dataset and apply csdid followed by honestdid. We keep the same two-group sample (2014-expanders vs never-treated) to demonstrate the csdid workflow. With a single treatment cohort, the TWFE and Callaway-Sant’Anna estimates should agree — but in settings with multiple treatment cohorts and heterogeneous effects, they can diverge substantially.

* Reload full dataset for staggered analysis
use "https://raw.githubusercontent.com/Mixtape-Sessions/Advanced-DID/main/Exercises/Data/ehec_data.dta", clear
* Restrict to 2008--2015, keep 2014-expanders and never-expanders
keep if (year <= 2015) & (missing(yexp2) | (yexp2 == 2014))
* Replace missing yexp2 with 0 for csdid (never-treated)
replace yexp2 = 0 if missing(yexp2)
* Callaway-Sant'Anna estimator
* long2: compare each post-period to base period (long differences)
* notyet: use not-yet-treated units as additional controls
csdid dins, ivar(stfips) time(year) gvar(yexp2) long2 notyet
* Aggregate to event study
csdid_estat event, window(-5 1) estore(csdid)

ATT by Periods Before and After treatment
Event Study:Dynamic effects
------------------------------------------------------------------------------
| Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
Pre_avg | -.0074863 .0056726 -1.32 0.187 -.0186045 .0036318
Post_avg | .0555267 .0083153 6.68 0.000 .0392291 .0718244
Tm6 | -.0095956 .0073982 -1.30 0.195 -.0240958 .0049045
Tm5 | -.0132771 .0070833 -1.87 0.061 -.0271601 .000606
Tm4 | -.0018712 .006524 -0.29 0.774 -.0146579 .0109155
Tm3 | -.0064012 .0067868 -0.94 0.346 -.0197031 .0069006
Tm2 | -.0062865 .0057282 -1.10 0.272 -.0175135 .0049406
Tp0 | .0423401 .0080105 5.29 0.000 .0266398 .0580405
Tp1 | .0687134 .0104571 6.57 0.000 .0482177 .089209
------------------------------------------------------------------------------

The Callaway-Sant’Anna event study confirms the pattern from our TWFE analysis: pre-treatment coefficients (Tm6 through Tm2) are all small and insignificant, while the post-treatment effects (Tp0 = 0.0423 in 2014, Tp1 = 0.0687 in 2015) are large and highly significant. The average post-treatment effect is 5.55 percentage points.

11.2 Applying honestdid to staggered estimates

We now apply honestdid to the Callaway-Sant’Anna event study estimates. The pre() and post() indices refer to the event-time coefficient positions, skipping the Pre_avg and Post_avg summary rows at positions 1–2.

* Restore csdid results and apply honestdid
estimates restore csdid
* csdid_estat stores: Pre_avg(1), Post_avg(2), Tm6(3)..Tm2(7), Tp0(8), Tp1(9)
honestdid, pre(3/7) post(8/9) mvec(0(0.5)2) coefplot
graph export "stata_honestdid_csdid.png", replace width(1200)

| M | lb | ub |
| ------- | ------ | ------ |
| . | 0.027 | 0.058 | (Original)
| 0.0000 | 0.027 | 0.058 |
| 0.5000 | 0.022 | 0.062 |
| 1.0000 | 0.014 | 0.071 |
| 1.5000 | 0.004 | 0.080 |
| 2.0000 | -0.007 | 0.090 |
(method = C-LF, Delta = DeltaRM, alpha = 0.050)

Figure 6: Sensitivity analysis for staggered DiD. Breakdown value is consistent with the TWFE analysis.

The staggered-robust estimates from csdid produce a breakdown value between $\bar{M}$ = 1.5 and 2 — at $\bar{M}$ = 1.5 the lower bound is still positive (0.004) but at $\bar{M}$ = 2 it turns negative (-0.007). This is nearly identical to the TWFE analysis in Section 9. This is reassuring — it suggests that the TWFE estimates are reliable in this application because we restricted to a single treatment cohort (2014 expanders vs never-treated). In settings with multiple treatment cohorts and heterogeneous effects, the TWFE and staggered estimates can diverge significantly, making this comparison an important robustness check.

12. Discussion and summary

12.1 Summary of all sensitivity analyses

The table below collects every sensitivity analysis from this tutorial. Scanning across the rows reveals which settings and restrictions yield stronger or weaker robustness.

Analysis	Setting	Restriction	Breakdown Value	Robustness
Section 6	2x2 (1 pre-period)	DeltaRM	> 2	Very robust
Section 9.1	Full panel, first period	DeltaRM	~1.5–2	Robust
Section 9.2	Full panel, average effect	DeltaRM	~1–1.5	Moderately robust
Section 10	Full panel, first period	DeltaSD	~0.015–0.02	Moderately robust
Section 11	Staggered (csdid)	DeltaRM	~1.5–2	Consistent with TWFE

The first-period treatment effect is the most robust finding across all approaches. The average effect over 2014–2015 is slightly less robust because it accumulates potential violations over a longer horizon. The smoothness restriction yields a tighter bound than relative magnitudes, reflecting a different type of assumption about how trends can deviate.

This tutorial demonstrated how to move beyond the binary question “Do parallel trends hold?” to the much more useful question “How robust are my results to violations of parallel trends?” The honestdid package makes this transition straightforward in Stata.

12.2 Key takeaways

Method insight — the breakdown value replaces the pre-trends test. The breakdown value is the single most informative number to report alongside any DiD estimate. It tells the reader exactly how much they need to doubt parallel trends before the result breaks down. For the Medicaid expansion, the breakdown value is approximately $\bar{M}$ = 1.5–2 under relative magnitudes and $M$ = 0.015–0.02 under smoothness restrictions.
Data insight — Medicaid expansion robustly increased insurance coverage. The 2x2 DiD estimate of 6.18 percentage points survives sensitivity analysis. In the full-panel event study, the 4.23 percentage point effect in 2014 remains significant up to approximately $\bar{M}$ = 1.5–2, meaning the post-treatment violation would need to be roughly 1.5 to 2 times the worst pre-period deviation to overturn the result.
Practical insight — honestdid works even with limited data. Part 1 showed that sensitivity analysis is possible with just one pre-period coefficient. You do not need a long panel to use this tool — though more pre-treatment periods unlock richer analyses (DeltaSD).
Limitation — sensitivity is not identification. The breakdown value tells you how much violation is tolerable, not whether violations actually occur. Subject-matter knowledge about the specific policy context remains essential for assessing whether the parallel trends assumption is plausible.
Next step — apply honestdid to your own DiD. Every DiD analysis should report a breakdown value. The package works with reghdfe, csdid, did_multiplegt, and jwdid — any estimator that produces event-study coefficients and a variance-covariance matrix. Tip: use honestdid, coefplot cached to re-plot previous results without recomputation — useful for customizing graph appearance.

For policymakers evaluating the ACA’s Medicaid expansion, the sensitivity analysis provides calibrated confidence: the insurance coverage gains are genuine and not an artifact of differential trends between expanding and non-expanding states, unless those differential trends were very large relative to the patterns observed before the policy change.

13. Exercises

Expand the 2x2 window. In Part 1, we used a 3-year window (2012–2014). Expand it to 4 years (2011–2014) to get 2 pre-periods. Now try the smoothness restriction (delta(sd)) — does it change your conclusion about robustness?

* Starter code: restrict to 2011--2014 and re-run
keep if inrange(year, 2011, 2014)
gen Dyear = cond(D, year, 2013)
reghdfe dins b2013.Dyear, absorb(stfips year) cluster(stfips) noconstant
honestdid, pre(1/2) post(4/4) mvec(0(0.005)0.04) delta(sd)

Focus on the 2015 effect. In Part 2, modify l_vec to focus on only the second post-period (2015). Is the 2015 effect more or less robust than the 2014 effect? Why might longer-horizon effects differ in robustness?

* Starter code: l_vec selects only the second post-period
matrix l_vec = 0 \ 1
honestdid, pre(1/5) post(7/8) mvec(0(0.5)2) l_vec(l_vec)

Compare TWFE and staggered estimates. Run the relative magnitudes analysis on both the TWFE (Section 9) and staggered (Section 11) estimates with the same mvec() grid. Are the breakdown values similar? If they differ, what does that tell you about treatment effect heterogeneity?

* Starter code: after running both TWFE and csdid analyses,
* compare the breakdown values from these two commands:
* TWFE: honestdid, pre(1/5) post(7/8) mvec(0(0.5)2)
* csdid: honestdid, pre(3/7) post(8/9) mvec(0(0.5)2)

14. References

Acknowledgements

Evaluating a Cash Transfer Program (RCT) with Panel Data in Stata

Tue, 24 Mar 2026 00:00:00 +0000

1. Overview

Cash transfer programs are among the most common development interventions worldwide. Governments and international organizations spend billions of dollars each year providing direct cash transfers to low-income households. But how do we rigorously evaluate whether these programs actually work? This tutorial walks through the complete workflow of analyzing a randomized controlled trial (RCT) with panel data in Stata — from verifying that randomization succeeded, to estimating treatment effects using increasingly sophisticated methods, to comparing results across all approaches.

We use simulated data from a hypothetical cash transfer program targeting 2,000 households in a developing country. The key advantage of simulated data is that we know the true treatment effect before we begin: the program increases household consumption by 12% (0.12 log points). This known ground truth gives us a perfect benchmark to evaluate how well each econometric method recovers the correct answer.

The tutorial progresses from simple to sophisticated. We start with basic balance checks, then estimate treatment effects three different ways using only endline data — regression adjustment (RA), inverse probability weighting (IPW), and doubly robust (DR) methods. Next, we unlock the full power of panel data with difference-in-differences (DiD) and its doubly robust extension (DRDID). Finally, we address the real-world complication of imperfect compliance.

Learning objectives

Verify baseline balance using t-tests, standardized mean differences, and balance plots
Distinguish between ATE and ATT and identify which estimand each method targets
Understand three estimation strategies — regression adjustment, inverse probability weighting, and doubly robust — and when to use each
Estimate treatment effects using all three approaches and compare their results
Leverage panel data structure with difference-in-differences and understand why DiD estimates ATT
Apply doubly robust difference-in-differences (DRDID) for modern panel data analysis
Separate the effect of treatment offer from treatment receipt under imperfect compliance

2. Study design

This RCT evaluates a cash transfer program designed to boost household consumption. The study tracks 2,000 households across two survey waves — a baseline in 2021 (before the program) and an endline in 2024 (after the program was implemented). The diagram below summarizes the experimental design.

graph TD
POP["<b>2,000 Households</b><br/>Balanced panel<br/>(observed in 2021 and 2024)"]
STRAT["<b>Stratified Randomization</b><br/>Within poverty strata"]
TRT["<b>Treatment Group</b><br/>(~1,000 households)<br/>Offered cash transfer"]
CTL["<b>Control Group</b><br/>(~1,000 households)<br/>No offer"]
COMP1["85% receive<br/>the transfer"]
COMP2["15% do not<br/>receive"]
COMP3["5% receive<br/>the transfer"]
COMP4["95% do not<br/>receive"]
BASE["<b>Baseline 2021</b><br/>Pre-treatment survey"]
END["<b>Endline 2024</b><br/>Post-treatment survey"]
POP --> BASE
BASE --> STRAT
STRAT --> TRT
STRAT --> CTL
TRT --> COMP1
TRT --> COMP2
CTL --> COMP3
CTL --> COMP4
COMP1 --> END
COMP2 --> END
COMP3 --> END
COMP4 --> END
style POP fill:#6a9bcc,stroke:#141413,color:#fff
style STRAT fill:#d97757,stroke:#141413,color:#fff
style TRT fill:#00d4c8,stroke:#141413,color:#141413
style CTL fill:#6a9bcc,stroke:#141413,color:#fff
style BASE fill:#6a9bcc,stroke:#141413,color:#fff
style END fill:#d97757,stroke:#141413,color:#fff
style COMP1 fill:#00d4c8,stroke:#141413,color:#141413
style COMP2 fill:#141413,stroke:#d97757,color:#fff
style COMP3 fill:#d97757,stroke:#141413,color:#fff
style COMP4 fill:#141413,stroke:#6a9bcc,color:#fff

The randomization was stratified by poverty status (block randomization), ensuring that treatment and control groups started with similar proportions of poor and non-poor households. A critical real-world feature of this study is imperfect compliance — only 85% of households offered the treatment actually received the cash transfer, while 5% of control households received it through other channels.

Variables

Variable	Description	Type
`id`	Household identifier	Panel ID
`year`	Survey year (2021 or 2024)	Time variable
`post`	Endline indicator (1 = 2024)	Binary
`treat`	Random assignment to offer (intent-to-treat)	Binary
`D`	Actual receipt of cash transfer	Binary (endogenous)
`y`	Log monthly consumption	Continuous (outcome)
`age`	Age of household head	Continuous
`female`	Female-headed household	Binary
`poverty`	Poverty status at baseline	Binary
`edu`	Years of education	Continuous
`y0`	Log monthly consumption at baseline (pre-treatment)	Continuous

Offer vs. receipt — The variable treat captures random assignment to the program offer. It is exogenous (determined by randomization) and unrelated to household characteristics. The variable D captures actual receipt of the cash transfer. It is endogenous — households that chose to take up the program may differ systematically from those that did not. Most methods in this tutorial estimate the effect of the offer (intent-to-treat). Section 10 addresses the effect of receipt.

3. Analytical roadmap

The diagram below shows the progression of methods we will use. Each stage builds on the previous one, adding complexity and robustness.

graph LR
A["<b>Balance<br/>Checks</b><br/><i>Section 5</i>"]
B["<b>Cross-sectional<br/>RA / IPW / DR</b><br/><i>Sections 7--8</i>"]
C["<b>Panel Data<br/>DiD / DR-DiD</b><br/><i>Section 9</i>"]
D["<b>Endogenous<br/>Treatment</b><br/><i>Section 10</i>"]
A --> B
B --> C
C --> D
style A fill:#6a9bcc,stroke:#141413,color:#fff
style B fill:#d97757,stroke:#141413,color:#fff
style C fill:#00d4c8,stroke:#141413,color:#141413
style D fill:#141413,stroke:#d97757,color:#fff

We first establish that randomization worked (balance checks). Then we estimate treatment effects three ways using only endline data — regression adjustment, inverse probability weighting, and doubly robust methods. Next, we leverage the full panel structure with difference-in-differences. Finally, we address imperfect compliance by separating the effect of the offer from the effect of receipt.

4. Data loading and exploration

We begin by loading the simulated dataset from a public GitHub repository and examining its structure.

use "https://github.com/quarcs-lab/data-open/raw/master/ametrics/dataSIM4RCT.dta", clear
des y age edu female poverty treat D

Contains data
Observations: 4,000
Variables: 10
Variable Storage Display Value
name type format label Variable label
─────────────────────────────────────────────────────────────
y float %9.0g Log monthly consumption
age float %9.0g
edu float %9.0g
female float %9.0g
poverty float %9.0g
treat float %9.0g Assignment to offer (Z)
D float %9.0g Receipt of cash transfer

The dataset contains 4,000 observations — 2,000 households observed at two time points (baseline 2021 and endline 2024). The outcome variable y is log monthly consumption, treat is the random assignment indicator, and D is the actual receipt indicator.

Now let us examine summary statistics at baseline and endline separately.

sum y age edu female poverty treat D if post==0

 Variable | Obs Mean Std. dev. Min Max
─────────────+─────────────────────────────────────────────────────────
y | 2,000 10.0154 .4348886 8.454445 11.48253
age | 2,000 35.126 9.650839 18 68
edu | 2,000 12.0275 1.9889 6 18
female | 2,000 .5085 .5000528 0 1
poverty | 2,000 .3125 .4636283 0 1
treat | 2,000 .518 .4998009 0 1
D | 2,000 0 0 0 0

At baseline, mean log consumption is approximately 10.02, the average household head is 35 years old with 12 years of education, about 51% of households are female-headed, and 31% are in poverty. Treatment assignment (treat) is approximately 50%, as expected from the randomization. Crucially, the receipt variable D is zero for all households at baseline — the program had not yet been implemented.

sum y age edu female poverty treat D if post==1

 Variable | Obs Mean Std. dev. Min Max
─────────────+─────────────────────────────────────────────────────────
y | 2,000 10.1137 .4382183 8.638689 11.55002
age | 2,000 35.126 9.650839 18 68
edu | 2,000 12.0275 1.9889 6 18
female | 2,000 .5085 .5000528 0 1
poverty | 2,000 .3125 .4636283 0 1
treat | 2,000 .518 .4998009 0 1
D | 2,000 .4615 .4986402 0 1

At endline, mean consumption has risen to approximately 10.11, reflecting both the natural time trend and the treatment effect. The receipt variable D is now non-zero — about 46% of all households received the cash transfer (combining treated households who took up the program and control households who received it through other channels).

Finally, we declare the panel structure so Stata knows we have repeated observations.

xtset id year

Panel variable: id (strongly balanced)
Time variable: year, 2021 to 2024, but with gaps
Delta: 1 unit

The panel is strongly balanced — all 2,000 households appear in both survey waves, with no attrition. This is an ideal scenario that simplifies our analysis.

5. Baseline balance checks

Before estimating any treatment effects, we must verify that randomization produced comparable treatment and control groups at baseline. This is the most fundamental quality check in any RCT.

5.1 T-tests and proportion tests

We compare the treatment and control groups on all baseline characteristics using two-sample t-tests for continuous variables and proportion tests for binary variables.

ttest y if post==0, by(treat)
ttest age if post==0, by(treat)
ttest edu if post==0, by(treat)
prtest female if post==0, by(treat)
prtest poverty if post==0, by(treat)

Variable | Control Mean Treat Mean Diff p-value
────────────+──────────────────────────────────────────────
y | 10.025 10.006 0.019 0.330
age | 35.335 34.931 0.404 0.350
edu | 11.974 12.077 -0.103 0.247
female | 0.484 0.531 -0.046 0.038 **
poverty | 0.307 0.318 -0.011 0.612

Most variables show no statistically significant differences between the treatment and control groups. However, the variable female has a p-value of 0.038 — a statistically significant imbalance. The treatment group has about 4.6 percentage points more female-headed households than the control group. This imbalance occurred purely by chance but must be addressed in our estimation.

5.2 Balance table with standardized mean differences

P-values are sensitive to sample size — a large sample can make tiny differences “significant.” Standardized mean differences (SMDs) provide a scale-free measure of imbalance that is more informative. The SMD is computed as the difference in group means divided by the pooled standard deviation — this puts all variables on the same scale regardless of their units. The common rule of thumb is that SMDs below 10% indicate adequate balance.

capture ssc install ietoolkit, replace
iebaltab y age edu female poverty if post==0, grpvar(treat)

 (1) (2) (2)-(1)
Control Treatment Difference
y 10.025 10.006 0.019
(0.014) (0.014) (0.019)
age 35.335 34.931 0.404
(0.316) (0.295) (0.432)
edu 11.974 12.077 -0.103
(0.063) (0.063) (0.089)
female 0.484 0.531 -0.046**
(0.016) (0.016) (0.022)
poverty 0.307 0.318 -0.011
(0.015) (0.014) (0.021)
N 964 1,036

The balance table confirms our t-test findings. With 964 control and 1,036 treatment households, all variables are well balanced except female, which shows a statistically significant difference (marked with **). The outcome variable y has a negligible difference of 0.019 at baseline — the groups started with essentially identical consumption levels.

5.3 Visual balance plot

A balance plot provides a visual overview of all SMDs at once, making it easy to spot problematic variables.

net install balanceplot, from("https://tdmize.github.io/data") replace
balanceplot y age edu i.female i.poverty, group(treat) table nodropdv

The balance plot shows that all SMDs fall below the 10% threshold (indicated by the dashed vertical lines). The variable female has the largest SMD at approximately 9.3% — close to but still below the conventional threshold. The remaining variables — consumption, age, education, and poverty — all have SMDs well below 5%. Overall, randomization was successful, but we should control for female (and other covariates) in our estimation to improve precision.

5.4 AIPW as a formal balance test

As a final and more formal balance check, we can use the Augmented Inverse Probability Weighting (AIPW) estimator on baseline data only. If randomization was successful, the estimated “treatment effect” at baseline should be zero — since the program had not yet been implemented, there should be no difference between groups.

preserve
keep if post==0
teffects aipw (y age edu i.female i.poverty) (treat age edu i.female i.poverty)

Tip: The preserve command saves a snapshot of the current data. After the balance analysis, use restore to return to the full dataset. The companion do-file handles this automatically.

Treatment-effects estimation Number of obs = 2,000
Estimator : augmented IPW
Outcome model : linear
Treatment model: logit
──────────────────────────────────────────────────────────────────────────────
| Robust
y | Coefficient std. err. z P>|z| [95% conf. interval]
─────────────+────────────────────────────────────────────────────────────────
ATE |
treat |
(1 vs 0) | -.0244086 .018861 -1.29 0.196 -.0613754 .0125582
─────────────+────────────────────────────────────────────────────────────────
POmean |
treat |
0 | 10.02792 .0138363 724.75 0.000 10.0008 10.05504
──────────────────────────────────────────────────────────────────────────────

The AIPW-estimated “ATE” at baseline is -0.024 with a p-value of 0.196 — not statistically significant. This confirms that there is no detectable pre-treatment difference between the groups after adjusting for covariates. The treatment and control groups were statistically comparable before the program began.

Now we run the diagnostic checks for the AIPW model.

tebalance overid

Overidentification test for covariate balance
H0: Covariates are balanced
chi2(5) = 3.216
Prob > chi2 = 0.6670

The overidentification test fails to reject the null hypothesis of covariate balance (p = 0.667). There is no statistical evidence of residual imbalance after weighting.

tebalance summarize

 |Standardized differences Variance ratio
| Raw Weighted Raw Weighted
----------------+------------------------------------------------
age | -.0417918 .0002505 .9318894 .9446877
edu | .0519015 -6.96e-06 1.071677 1.078214
female |
1 | .0929611 6.51e-06 .9970775 .9999996
poverty |
1 | .0226764 .0002864 1.018475 1.000233

The balance summary reveals that the raw standardized differences (before weighting) show the female imbalance at 0.093, consistent with our earlier findings. After weighting, all standardized differences shrink to near zero (all below 0.001) — excellent balance. The variance ratios are all close to 1.0, indicating similar spread across groups.

tebalance density y

The density plot confirms that after AIPW weighting, the distributions of log consumption in the treatment and control groups overlap almost perfectly. Any small pre-existing differences in the outcome variable have been eliminated by the weighting scheme.

teffects overlap

The overlap plot shows that propensity scores for both groups are concentrated between approximately 0.43 and 0.55 — well within the range where matching and weighting are feasible. There are no extreme propensity scores near 0 or 1, confirming that the common support condition is satisfied. This is expected in a well-designed RCT where treatment probability is approximately 0.50 for all households.

restore

This AIPW-based balance analysis also serves a pedagogical purpose: it introduces the concept of doubly robust estimation before we use it for treatment effect estimation in Section 8.

6. What are we estimating? ATE vs. ATT

Before diving into estimation, we need to be precise about what we are trying to estimate. There are two fundamental causal quantities in program evaluation.

The Average Treatment Effect (ATE) answers the policymaker’s question: “What would happen if we scaled this program to the entire population?"

$$ATE = E[Y(1) - Y(0)]$$

where $Y(1)$ is the potential outcome under treatment and $Y(0)$ is the potential outcome under control, averaged over the entire population (both treated and untreated).

The Average Treatment Effect on the Treated (ATT) answers the evaluator’s question: “Did the program benefit those who were assigned to it?"

$$ATT = E[Y(1) - Y(0) \mid T = 1]$$

This averages the treatment effect only over the treated group — the households that were assigned to receive the cash transfer.

In a well-designed RCT with homogeneous treatment effects (the program affects everyone equally), ATE and ATT are the same. But when treatment effects are heterogeneous (the program benefits some households more than others), they can differ. For example, if poorer households benefit more from cash transfers and the treatment group has a higher share of poor households, the ATT could be larger than the ATE.

Understanding this distinction is critical because different methods target different estimands. Cross-sectional methods (RA, IPW, DR) can estimate either ATE or ATT. Difference-in-differences inherently estimates the ATT only. We will return to this point in Section 9.

Note on RCTs — In a randomized experiment, treatment assignment is independent of potential outcomes. This means that simple comparisons between treatment and control groups are already unbiased estimates of the ATE. When we add covariates (regression adjustment, IPW, doubly robust), we are not removing bias — we are improving precision by accounting for residual variation. This is different from observational studies, where covariate adjustment is needed to address confounding.

7. Three strategies for causal estimation

We now understand what we want to estimate (ATE and ATT from Section 6). The question becomes how to estimate it. Three families of methods exist, each taking a fundamentally different approach to solving the missing-data problem at the heart of causal inference. Each method models a different part of the data-generating process, and understanding these differences is essential for interpreting results and choosing the right tool.

7.1 Regression Adjustment (RA) — modeling the outcome

Regression adjustment solves the missing-data problem by predicting the unobserved potential outcomes. It fits separate regression models for treated and untreated groups. For each household, it uses these models to predict two potential outcomes: what consumption would be if treated, $\hat{\mu}_1(X_i)$, and what consumption would be if untreated, $\hat{\mu}_0(X_i)$. Since we only observe one of these for each household, the model fills in the missing counterfactual. The treatment effect for each household is the difference between the two predictions, and the ATE is the average across all households.

The Stata documentation describes this succinctly: “RA estimators use means of predicted outcomes for each treatment level to estimate each POM. ATEs and ATETs are differences in estimated POMs."

Analogy — predicting exam scores. Imagine two study methods (A and B) being tested on students. You observe each student using only one method. RA fits a model predicting test scores based on student characteristics (prior GPA, hours studied) separately for method-A and method-B users. Then, for every student, it predicts what their score would have been under both methods — even the one they did not use. The average difference in predicted scores is the treatment effect.

graph TD
DATA["<b>Observed Data</b><br/>Each household observed<br/>under ONE treatment only"]
M0["<b>Fit outcome model</b><br/>using control group<br/><i>Y = f(age, edu, female, poverty)</i>"]
M1["<b>Fit outcome model</b><br/>using treated group<br/><i>Y = f(age, edu, female, poverty)</i>"]
P0["Predict <b>Ŷ₀</b><br/>for ALL households"]
P1["Predict <b>Ŷ₁</b><br/>for ALL households"]
ATE["<b>ATE</b> = Average of<br/>(Ŷ₁ − Ŷ₀)"]
DATA --> M0
DATA --> M1
M0 --> P0
M1 --> P1
P0 --> ATE
P1 --> ATE
style DATA fill:#141413,stroke:#6a9bcc,color:#fff
style M0 fill:#6a9bcc,stroke:#141413,color:#fff
style M1 fill:#6a9bcc,stroke:#141413,color:#fff
style P0 fill:#6a9bcc,stroke:#141413,color:#fff
style P1 fill:#6a9bcc,stroke:#141413,color:#fff
style ATE fill:#6a9bcc,stroke:#141413,color:#fff

The RA estimator. Formally, the ATE under regression adjustment is:

$$\hat{\tau}_{RA}^{ATE} = \frac{1}{N} \sum_{i=1}^{N} \left[ \hat{\mu}_1(X_i) - \hat{\mu}_0(X_i) \right]$$

where $\hat{\mu}_1(X)$ is the predicted outcome under treatment (fitted from treated observations) and $\hat{\mu}_0(X)$ is the predicted outcome under control (fitted from untreated observations), both evaluated at each household’s covariates $X_i$. In plain language: for each household, the model predicts what their consumption would be if they received the cash transfer and what it would be if they did not. The difference is the household’s estimated treatment effect. Averaging these across all $N$ households gives the ATE.

For the ATT, we restrict the average to treated units only:

$$\hat{\tau}_{RA}^{ATT} = \frac{1}{N_1} \sum_{i: T_i = 1} \left[ \hat{\mu}_1(X_i) - \hat{\mu}_0(X_i) \right]$$

where $N_1$ is the number of treated households.

Mini example from our data. Consider Household A: a 40-year-old female in poverty with 10 years of education. The treated outcome model predicts her consumption at 10.17 log points. The untreated outcome model predicts 10.05. Her estimated individual treatment effect is $10.17 - 10.05 = 0.12$. Averaging such predictions over all 2,000 endline households gives the ATE.

Stata implementation. The teffects ra command fits linear outcome models by default. The first parenthesis specifies the outcome model (outcome variable + covariates), and the second specifies the treatment variable: teffects ra (y c.age c.edu i.female i.poverty) (treat), ate.

What can go wrong — model misspecification. RA’s Achilles heel is that it relies entirely on the outcome model being correctly specified. If consumption depends on age nonlinearly (for example, a U-shaped relationship), but we assume a linear model, the predictions $\hat{\mu}_1$ and $\hat{\mu}_0$ will be systematically wrong, biasing the ATE. As the Stata manual notes, RA works well when the outcome model is correct, but “relying on a correctly specified outcome model with little data is extremely risky.” RA gives the right answer only if the outcome model is correct. If it is wrong, the ATE estimate can be biased even with infinite data.

What if we are unsure about the functional form of the outcome model? Is there an approach that avoids modeling the outcome entirely?

7.2 Inverse Probability Weighting (IPW) — modeling the treatment assignment

IPW takes the opposite approach. Instead of modeling consumption, it models the probability of being assigned to treatment — the propensity score, defined as $p(X) = \Pr(T = 1 \mid X)$. It then reweights observations so that the treatment and control groups become comparable. The Stata documentation explains: “IPW estimators use weighted averages of the observed outcome variable to estimate means of the potential outcomes. The weights account for the missing data inherent in the potential-outcome framework."

The logic is elegant: in a perfectly randomized experiment, every household has the same 50% chance of treatment, and a simple comparison of means is unbiased. When chance imbalances arise (like our 9.3% gender SMD), the estimated propensity scores deviate slightly from 0.50. IPW corrects for these imbalances by making the reweighted sample look as if randomization had been perfect — without ever modeling the outcome.

Analogy — opinion polling. Election pollsters know their survey overrepresents some demographics. If 60% of respondents are college graduates but only 35% of voters are, pollsters give lower weight to each college graduate’s response and higher weight to non-graduates. IPW does the same thing for treatment groups — it reweights households so the treated and control groups have the same covariate distribution.

graph TD
DATA["<b>Observed Data</b><br/>Treatment and control groups<br/>may have imbalances"]
PS["<b>Estimate propensity score</b><br/>p(X) = Pr(T=1 | X)<br/><i>via logistic regression</i>"]
WT["<b>Compute weights</b>"]
WTR["Treated: weight = 1/p(X)"]
WCT["Control: weight = 1/(1−p(X))"]
ATE["<b>ATE</b> = Weighted mean(treated)<br/>− Weighted mean(control)"]
DATA --> PS
PS --> WT
WT --> WTR
WT --> WCT
WTR --> ATE
WCT --> ATE
style DATA fill:#141413,stroke:#d97757,color:#fff
style PS fill:#d97757,stroke:#141413,color:#fff
style WT fill:#d97757,stroke:#141413,color:#fff
style WTR fill:#d97757,stroke:#141413,color:#fff
style WCT fill:#d97757,stroke:#141413,color:#fff
style ATE fill:#d97757,stroke:#141413,color:#fff

The propensity score. The propensity score is estimated via logistic regression:

$$\hat{p}(X_i) = \Pr(T_i = 1 \mid X_i) = \text{logit}^{-1}(\hat{\alpha} + \hat{\beta}' X_i)$$

In plain language: we fit a logistic model predicting whether each household was assigned to treatment, based on their covariates (age, education, gender, poverty status). The predicted probability is their propensity score.

The IPW estimator. The ATE under IPW is:

$$\hat{\tau}_{IPW}^{ATE} = \frac{1}{N} \sum_{i=1}^{N} \left[ \frac{T_i \cdot Y_i}{\hat{p}(X_i)} - \frac{(1 - T_i) \cdot Y_i}{1 - \hat{p}(X_i)} \right]$$

Each treated household’s outcome is divided by its probability of being treated — this upweights treated households that “look like” control households (the Stata manual calls this placing “a larger weight on those observations for which $y_{1i}$ is observed even though its observation was not likely”). Each control household’s outcome is divided by its probability of being in the control group. The reweighting creates a pseudo-population where treatment assignment is independent of covariates.

For the ATT, only the control group needs reweighting (because the treated group is already the reference population):

$$\hat{\tau}_{IPW}^{ATT} = \frac{1}{N_1} \sum_{i=1}^{N} \left[ T_i \cdot Y_i - \frac{(1 - T_i) \cdot \hat{p}(X_i) \cdot Y_i}{1 - \hat{p}(X_i)} \right]$$

Mini example from our data. In our RCT, a female household in poverty might have $\hat{p}(X) = 0.52$ (slightly more likely to be treated due to the gender imbalance). If treated, her weight is $1/0.52 = 1.92$. If in the control group, her weight is $1/(1 - 0.52) = 2.08$. A male non-poor household might have $\hat{p}(X) = 0.49$, giving weights close to 2.0 in either group. These mild adjustments rebalance the groups to remove the chance gender imbalance.

Why IPW matters even in RCTs. In a perfect RCT, the true propensity score is exactly 0.50 for everyone, and IPW does nothing. But finite samples produce chance imbalances. IPW uses the estimated propensity scores (which deviate slightly from 0.50) to correct for these imbalances without making any assumptions about how covariates affect the outcome.

Stata implementation. The teffects ipw command fits a logistic treatment model by default. Note that the first parenthesis specifies only the outcome variable (no covariates — IPW does not model the outcome), and the second specifies the treatment model: teffects ipw (y) (treat c.age c.edu i.female i.poverty), ate.

What can go wrong — extreme weights. IPW’s vulnerability is extreme propensity scores. If $\hat{p}(X) = 0.01$ for some household, the weight becomes $1/0.01 = 100$ — that single household dominates the ATE estimate, causing high variance and instability. The Stata manual warns: “When propensity scores are extreme (near 0 or 1), the inverse weights become very large, producing unstable estimates." This happens when the treatment and control groups have poor overlap — some covariate combinations appear only in one group. In our well-designed RCT, all propensity scores are between 0.43 and 0.55 (we verified this in Section 5.4), so extreme weights are not a concern.

RA works well if the outcome model is correct but can be biased if it is wrong. IPW works well if the propensity score model is correct but can be unstable if it is wrong. Is there a method that protects us against both types of misspecification?

7.3 Doubly Robust (DR) — modeling both

Doubly robust methods combine RA and IPW into a single estimator. They fit an outcome model and estimate a propensity score. The key property — the reason they are called “doubly robust” — is that the estimator is consistent (converges to the true treatment effect with enough data) if either the outcome model or the propensity score model is correctly specified. You do not need both to be right — just one.

The Stata manual describes this property: “AIPW estimators model both the outcome and the treatment probability. A surprising fact is that only one of the two models must be correctly specified to consistently estimate the treatment effects."

Analogy — backup power. Think of a house with two independent power sources: the electrical grid (the outcome model) and a solar panel system (the propensity score model). If the grid goes down (outcome model is misspecified), solar power keeps the lights on. If clouds block the solar panels (propensity score model is wrong), the grid still works. As long as at least one power source is functioning, the house stays lit. That is doubly robust estimation — as long as at least one model is correct, the estimator gives the right answer.

graph TD
DATA["<b>Observed Data</b>"]
RA_C["<b>RA component</b><br/>Predict Ŷ₁ and Ŷ₀<br/>for each household"]
IPW_C["<b>IPW component</b><br/>Estimate propensity<br/>score p(X)"]
RESID["<b>Prediction errors</b><br/>Y − Ŷ for each<br/>household"]
CORRECT["<b>Bias-correction term</b><br/>IPW-weighted residuals"]
DR["<b>DR estimate</b><br/>= RA prediction<br/>+ Bias correction"]
DATA --> RA_C
DATA --> IPW_C
RA_C --> RESID
IPW_C --> CORRECT
RESID --> CORRECT
RA_C --> DR
CORRECT --> DR
style DATA fill:#141413,stroke:#00d4c8,color:#fff
style RA_C fill:#6a9bcc,stroke:#141413,color:#fff
style IPW_C fill:#d97757,stroke:#141413,color:#fff
style RESID fill:#6a9bcc,stroke:#141413,color:#fff
style CORRECT fill:#d97757,stroke:#141413,color:#fff
style DR fill:#00d4c8,stroke:#141413,color:#141413

The AIPW estimator. The most common doubly robust form is Augmented Inverse Probability Weighting (AIPW):

$$\hat{\tau}_{DR}^{ATE} = \frac{1}{N} \sum_{i=1}^{N} \left[ \hat{\mu}_1(X_i) - \hat{\mu}_0(X_i) + \frac{T_i (Y_i - \hat{\mu}_1(X_i))}{\hat{p}(X_i)} - \frac{(1 - T_i)(Y_i - \hat{\mu}_0(X_i))}{1 - \hat{p}(X_i)} \right]$$

This equation has two clearly interpretable components:

RA component (first two terms): $\hat{\mu}_1(X_i) - \hat{\mu}_0(X_i)$ — the regression adjustment prediction, exactly as in Section 7.1
Bias-correction component (last two terms): IPW-weighted residuals $(Y_i - \hat{\mu})$ — the difference between actual and predicted outcomes, weighted by inverse propensity scores

In plain language: start with the RA prediction of each household’s treatment effect. Then ask: how far off was that prediction from reality? Weight those prediction errors by the propensity score. If RA was already right, the errors average to zero and you just get RA. If RA was wrong but IPW is right, the weighted errors exactly cancel the RA bias.

Why the magic works — four scenarios.

Outcome model correct, propensity model wrong: The residuals $(Y_i - \hat{\mu})$ are zero on average, so the correction terms vanish. DR reduces to RA. Correct answer.
Propensity model correct, outcome model wrong: The IPW reweighting is valid, so the correction terms fix the RA bias. Correct answer.
Both models correct: Both components work together, producing the most efficient estimate.
Both models wrong: Neither safety net catches the error. The estimate can be biased. DR provides insurance, not invincibility.

AIPW vs. IPWRA in Stata. Stata offers two doubly robust commands. teffects aipw augments the IPW estimator with an outcome-model correction (the equation above). teffects ipwra applies propensity score weights to the regression adjustment — arriving at the same property from the other direction. Both are doubly robust and produce nearly identical results in practice.

Stata implementation. Both commands require specifying the outcome model in the first parenthesis and the treatment model in the second: teffects ipwra (y c.age c.edu i.female i.poverty) (treat c.age c.edu i.female i.poverty), vce(robust).

What can go wrong. DR fails only when both models are wrong. This is much less likely than either single model being wrong — getting at least one model approximately right is much easier than getting both perfectly right. However, the Stata manual notes: “When both the outcome and the treatment model are misspecified, which estimator is more robust is a matter of debate." Using flexible specifications (polynomials, interactions) reduces the risk of both models failing simultaneously.

Comparison of the three approaches

Feature	RA	IPW	DR (AIPW/IPWRA)
Models the outcome?	Yes	No	Yes
Models the treatment?	No	Yes	Yes
Key equation	$\hat{\mu}_1(X) - \hat{\mu}_0(X)$	$T \cdot Y / \hat{p}(X)$	RA + IPW-weighted residuals
Consistent if outcome model correct?	Yes	—	Yes
Consistent if treatment model correct?	—	Yes	Yes
Main vulnerability	Outcome misspecification	Extreme weights	Both models wrong
Stata command	`teffects ra`	`teffects ipw`	`teffects ipwra` / `teffects aipw`

graph LR
RA["<b>Regression Adjustment</b><br/>Models the outcome"]
IPW["<b>Inverse Probability<br/>Weighting</b><br/>Models the treatment"]
DR["<b>Doubly Robust</b><br/>Models both<br/><i>Consistent if either<br/>model is correct</i>"]
RA --> DR
IPW --> DR
style RA fill:#6a9bcc,stroke:#141413,color:#fff
style IPW fill:#d97757,stroke:#141413,color:#fff
style DR fill:#00d4c8,stroke:#141413,color:#141413

The doubly robust estimator combines the strengths of both RA and IPW. It is the standard recommendation in modern causal inference because it provides an extra layer of protection against model misspecification. Now that we understand what each method does, what it assumes, and what can go wrong, let us apply all three to our cash transfer data and compare their results.

8. Cross-sectional estimation at endline — RA, IPW, and DR

We now estimate treatment effects using only endline data. For each method, we compute both the ATE (the policymaker’s quantity) and the ATT (the evaluator’s quantity).

8.1 Simple difference in means

The simplest approach is to compare mean outcomes between treated and control groups at endline.

use "https://github.com/quarcs-lab/data-open/raw/master/ametrics/dataSIM4RCT.dta", clear
keep if post==1
reg y treat, robust

Linear regression Number of obs = 2,000
F(1, 1998) = 35.43
Prob > F = 0.0000
R-squared = 0.0174
Root MSE = .43449
──────────────────────────────────────────────────────────────────────────────
| Robust
y | Coefficient std. err. t P>|t| [95% conf. interval]
─────────────+────────────────────────────────────────────────────────────────
treat | .1157465 .0194443 5.95 0.000 .0776132 .1538798
_cons | 10.05374 .014001 718.07 0.000 10.02628 10.0812
──────────────────────────────────────────────────────────────────────────────

The simple difference in means yields an estimate of 0.116 (SE = 0.019, p < 0.001, 95% CI [0.078, 0.154]). Because the outcome is in logs, this means being offered the cash transfer increased household consumption by approximately 11.6%. This estimate is close to the true effect of 12% and is our benchmark for comparison. However, it does not adjust for the gender imbalance we discovered at baseline.

8.2 Regression Adjustment — ATE and ATT

Regression adjustment models the outcome as a function of treatment and covariates, then computes predicted outcomes under treatment and control for each observation.

* RA: Average Treatment Effect (ATE)
teffects ra (y c.age c.edu i.female i.poverty) (treat), ate

Treatment-effects estimation Number of obs = 2,000
Estimator : regression adjustment
Outcome model : linear
──────────────────────────────────────────────────────────────────────────────
| Robust
y | Coefficient std. err. z P>|z| [95% conf. interval]
─────────────+────────────────────────────────────────────────────────────────
ATE |
treat |
(1 vs 0) | .1125431 .0190927 5.89 0.000 .0751221 .1499641
─────────────+────────────────────────────────────────────────────────────────
POmean |
treat |
0 | 10.05503 .0138703 724.93 0.000 10.02785 10.08222
──────────────────────────────────────────────────────────────────────────────

* RA: Average Treatment Effect on the Treated (ATT)
teffects ra (y c.age c.edu i.female i.poverty) (treat), atet

Treatment-effects estimation Number of obs = 2,000
Estimator : regression adjustment
Outcome model : linear
──────────────────────────────────────────────────────────────────────────────
| Robust
y | Coefficient std. err. z P>|z| [95% conf. interval]
─────────────+────────────────────────────────────────────────────────────────
ATET |
treat |
(1 vs 0) | .1132537 .0191498 5.91 0.000 .0757208 .1507865
─────────────+────────────────────────────────────────────────────────────────
POmean |
treat |
0 | 10.05623 .0140082 717.88 0.000 10.02878 10.08369
──────────────────────────────────────────────────────────────────────────────

The RA estimates are ATE = 0.113 (SE = 0.019, 95% CI [0.075, 0.150]) and ATT = 0.113 (SE = 0.019, 95% CI [0.076, 0.151]). The ATE and ATT are nearly identical, which confirms that treatment effects are approximately homogeneous across households. The RA approach models the outcome with covariates (age, education, gender, poverty), which adjusts for the baseline gender imbalance and can improve precision.

8.3 Inverse Probability Weighting — ATE and ATT

IPW reweights observations based on their estimated probability of treatment, without modeling the outcome.

* IPW: Average Treatment Effect (ATE)
teffects ipw (y) (treat c.age c.edu i.female i.poverty), ate

Treatment-effects estimation Number of obs = 2,000
Estimator : inverse-probability weights
Outcome model : weighted mean
Treatment model: logit
──────────────────────────────────────────────────────────────────────────────
| Robust
y | Coefficient std. err. z P>|z| [95% conf. interval]
─────────────+────────────────────────────────────────────────────────────────
ATE |
treat |
(1 vs 0) | .1126713 .0190886 5.90 0.000 .0752583 .1500844
─────────────+────────────────────────────────────────────────────────────────
POmean |
treat |
0 | 10.05495 .0138651 725.20 0.000 10.02778 10.08213
──────────────────────────────────────────────────────────────────────────────

* IPW: Average Treatment Effect on the Treated (ATT)
teffects ipw (y) (treat c.age c.edu i.female i.poverty), atet

Treatment-effects estimation Number of obs = 2,000
Estimator : inverse-probability weights
Outcome model : weighted mean
Treatment model: logit
──────────────────────────────────────────────────────────────────────────────
| Robust
y | Coefficient std. err. z P>|z| [95% conf. interval]
─────────────+────────────────────────────────────────────────────────────────
ATET |
treat |
(1 vs 0) | .1134031 .0191397 5.93 0.000 .0758899 .1509162
─────────────+────────────────────────────────────────────────────────────────
POmean |
treat |
0 | 10.05608 .0140004 718.27 0.000 10.02864 10.08352
──────────────────────────────────────────────────────────────────────────────

The IPW estimates are ATE = 0.113 (SE = 0.019, 95% CI [0.075, 0.150]) and ATT = 0.113 (SE = 0.019, 95% CI [0.076, 0.151]). These are very close to the RA results, which is expected in a well-designed RCT where propensity scores are near 0.50 for all households. Notice that IPW does not model the outcome — it only models the treatment assignment process using the propensity score. The close agreement between RA and IPW gives us confidence that both the outcome model and the treatment model are approximately correct.

8.4 Doubly Robust — ATE and ATT (IPWRA)

The doubly robust IPWRA estimator combines outcome modeling and propensity score weighting.

* IPWRA: Average Treatment Effect (ATE)
teffects ipwra (y c.age c.edu i.female i.poverty) ///
(treat c.age c.edu i.female i.poverty), vce(robust)

Treatment-effects estimation Number of obs = 2,000
Estimator : IPW regression adjustment
Outcome model : linear
Treatment model: logit
──────────────────────────────────────────────────────────────────────────────
| Robust
y | Coefficient std. err. z P>|z| [95% conf. interval]
─────────────+────────────────────────────────────────────────────────────────
ATE |
treat |
(1 vs 0) | .112639 .0190901 5.90 0.000 .0752231 .1500549
─────────────+────────────────────────────────────────────────────────────────
POmean |
treat |
0 | 10.055 .0138677 725.07 0.000 10.02782 10.08218
──────────────────────────────────────────────────────────────────────────────

* IPWRA: Average Treatment Effect on the Treated (ATT)
teffects ipwra (y c.age c.edu i.female i.poverty) ///
(treat c.age c.edu i.female i.poverty), atet vce(robust)

Treatment-effects estimation Number of obs = 2,000
Estimator : IPW regression adjustment
Outcome model : linear
Treatment model: logit
──────────────────────────────────────────────────────────────────────────────
| Robust
y | Coefficient std. err. z P>|z| [95% conf. interval]
─────────────+────────────────────────────────────────────────────────────────
ATET |
treat |
(1 vs 0) | .1133162 .0191469 5.92 0.000 .0757889 .1508435
─────────────+────────────────────────────────────────────────────────────────
POmean |
treat |
0 | 10.05617 .0140019 718.20 0.000 10.02873 10.08361
──────────────────────────────────────────────────────────────────────────────

The doubly robust IPWRA estimates are ATE = 0.113 (SE = 0.019, 95% CI [0.075, 0.150]) and ATT = 0.113 (SE = 0.019, 95% CI [0.076, 0.151]). These are very close to the RA and IPW estimates, confirming that all three approaches converge in this well-designed RCT. The DR method provides the most reliable cross-sectional estimate because it is protected against misspecification of either the outcome or treatment model.

8.5 Doubly Robust — AIPW alternative

As a robustness check, we can also compute the doubly robust estimate using the AIPW formulation instead of IPWRA.

* AIPW: Average Treatment Effect (ATE)
teffects aipw (y c.age c.edu i.female i.poverty) ///
(treat c.age c.edu i.female i.poverty)

Treatment-effects estimation Number of obs = 2,000
Estimator : augmented IPW
Outcome model : linear by ML
Treatment model: logit
──────────────────────────────────────────────────────────────────────────────
| Robust
y | Coefficient std. err. z P>|z| [95% conf. interval]
─────────────+────────────────────────────────────────────────────────────────
ATE |
treat |
(1 vs 0) | .1126412 .0190903 5.90 0.000 .075225 .1500574
─────────────+────────────────────────────────────────────────────────────────
POmean |
treat |
0 | 10.055 .013868 725.05 0.000 10.02782 10.08218
──────────────────────────────────────────────────────────────────────────────

The AIPW estimate of ATE = 0.113 (SE = 0.019, 95% CI [0.075, 0.150]) is virtually identical to the IPWRA result (0.113). Both are doubly robust — the difference lies in the computational approach (AIPW augments the IPW estimator with a bias-correction term, while IPWRA applies IPW weights to the regression adjustment), but the theoretical properties and estimates are the same.

8.6 Cross-sectional comparison

The table below summarizes all cross-sectional estimates.

Method	Approach	Estimand	Estimate	SE	95% CI	Contains 0.12?
Simple regression	None	ATE	0.116	0.019	[0.078, 0.154]	Yes
Regression Adjustment	Outcome model	ATE	0.113	0.019	[0.075, 0.150]	Yes
Regression Adjustment	Outcome model	ATT	0.113	0.019	[0.076, 0.151]	Yes
Inverse Prob. Weighting	Treatment model	ATE	0.113	0.019	[0.075, 0.150]	Yes
Inverse Prob. Weighting	Treatment model	ATT	0.113	0.019	[0.076, 0.151]	Yes
IPWRA (Doubly Robust)	Both models	ATE	0.113	0.019	[0.075, 0.150]	Yes
IPWRA (Doubly Robust)	Both models	ATT	0.113	0.019	[0.076, 0.151]	Yes
True effect			0.12

Several patterns emerge from this comparison. First, ATE and ATT are nearly identical for every method, confirming that treatment effects are homogeneous across households. Second, RA, IPW, and DR all give remarkably similar results (all approximately 0.113) because, in this well-designed RCT, randomization ensures that both the outcome model and the propensity score model are approximately correct. Third, the simple difference in means (0.116) is slightly higher than the covariate-adjusted estimates (0.113), reflecting the precision improvement from controlling for covariates including the gender imbalance. Finally, all confidence intervals contain the true effect of 0.12 — every method successfully recovers the correct answer.

The real value of doubly robust methods becomes apparent in less ideal settings. When one model might be misspecified — a common situation in practice — DR methods provide insurance that RA or IPW alone cannot offer.

9. Leveraging panel data — Difference-in-Differences

All estimates in Section 8 used only endline data. But we have panel data — the same 2,000 households observed before and after the intervention. Can we do better?

9.1 Why use panel data?

Cross-sectional methods (RA, IPW, DR) compare treated and control groups at a single point in time — the endline. They control for observable covariates like age, education, and gender. But there may be unobservable characteristics — household motivation, geographic advantages, cultural factors — that differ between groups and affect consumption. No amount of cross-sectional covariate adjustment can control for these, because we simply do not observe them.

Analogy — comparing students across schools. Imagine comparing test scores between students at a charter school (treatment) and a traditional school (control). You can adjust for observable differences like family income and prior grades. But what about unmeasured factors — parental involvement, neighborhood quality, student ambition? A cross-sectional comparison cannot disentangle the school effect from these hidden differences. Now suppose you observe the same students before and after they switch schools. By comparing each student’s score change, you automatically cancel out all fixed student characteristics — because they are the same at both time points. That is the power of panel data.

Panel data methods like difference-in-differences (DiD) solve this problem by comparing each household to itself over time. By looking at how each household’s consumption changed from baseline to endline, we effectively control for all time-invariant unobservable characteristics (household fixed effects). This is a powerful advantage that cross-sectional methods cannot replicate.

The DiD estimator

The DiD estimator computes a simple but powerful quantity — a “difference of differences”:

$$\hat{\tau}_{DiD} = \underbrace{(\bar{Y}_{treat,post} - \bar{Y}_{treat,pre})}_{\text{Change for treated}} - \underbrace{(\bar{Y}_{control,post} - \bar{Y}_{control,pre})}_{\text{Change for control}}$$

The first difference ($\bar{Y}_{treat,post} - \bar{Y}_{treat,pre}$) captures the treatment group’s change over time — the treatment effect plus any common time trend (e.g., economic growth that affects all households). The second difference ($\bar{Y}_{control,post} - \bar{Y}_{control,pre}$) captures the control group’s change — the common time trend only, since they did not receive treatment. Subtracting the second from the first removes the time trend, isolating the treatment effect.

Mini example from our data. Suppose the treated group’s average log consumption went from 10.01 at baseline to 10.17 at endline (change = +0.16). The control group went from 10.03 to 10.06 (change = +0.03). The DiD estimate is $0.16 - 0.03 = 0.13$ — close to the true effect of 0.12. The control group’s +0.03 change captures the natural time trend that would have affected everyone, and subtracting it isolates the treatment effect.

The parallel trends assumption

The key identifying assumption of DiD is the parallel trends assumption (PTA): absent the treatment, the treatment and control groups would have followed the same time trend. Formally:

Notation note — In the DiD literature and in the Sant’Anna and Zhao (2020) paper, $D$ denotes treatment group assignment (equivalent to our treat variable). This differs from our data dictionary where D is the receipt indicator. In this section and Section 9.4, we follow the paper’s convention: $D = 1$ means assigned to treatment, $D = 0$ means assigned to control.

$$E[Y_1(0) - Y_0(0) \mid D = 1] = E[Y_1(0) - Y_0(0) \mid D = 0]$$

This says that the average change in untreated potential outcomes is the same for the treated and control groups. Note that this does not require the two groups to have the same level of consumption — only the same trend. The treated group can start higher or lower, as long as their consumption would have evolved at the same rate as the control group in the absence of the program.

In an RCT, the parallel trends assumption is very plausible because randomization ensures the groups were similar at baseline. Any pre-existing differences between groups occurred by chance and are unlikely to produce different time trends. This makes DiD a strong estimator in our setting.

graph LR
subgraph "Parallel Trends Assumption"
PRE["<b>Baseline 2021</b>"]
POST["<b>Endline 2024</b>"]
end
PRE -->|"Treated group<br/>change = effect + trend"| POST
PRE -->|"Control group<br/>change = trend only"| POST
style PRE fill:#6a9bcc,stroke:#141413,color:#fff
style POST fill:#d97757,stroke:#141413,color:#fff

9.2 Why does DiD estimate ATT and not ATE?

This is a point that many beginners miss, so it is worth explaining carefully.

Recall from Section 6 that the ATT is $E[Y_1(1) - Y_1(0) \mid D = 1]$ — the effect on those who were treated. Sant’Anna and Zhao (2020) make this explicit: the main challenge is computing $E[Y_1(0) \mid D = 1]$ — what would the treated group’s consumption have been at endline without the program?

DiD solves this by using the control group’s time trend as a stand-in. Specifically, it constructs the counterfactual for the treated group as:

$$\underbrace{E[Y_1(0) \mid D = 1]}_{\text{Counterfactual}} = \underbrace{E[Y_0 \mid D = 1]}_{\text{Treated at baseline}} + \underbrace{(E[Y_1 \mid D = 0] - E[Y_0 \mid D = 0])}_{\text{Control group’s time trend}}$$

This counterfactual is specific to the treated group — it starts from their baseline level and adds the control group’s trend. DiD therefore estimates what happened to the treated group relative to this counterfactual. This is precisely the ATT.

Why not the ATE? To estimate the ATE, we would also need the treatment effect for the untreated — what would happen if we gave the program to those who did not receive it. DiD does not provide this, because the counterfactual it constructs runs in only one direction (control trend applied to treated baseline, not treated trend applied to control baseline).

In our RCT context, since treatment was randomly assigned, ATE and ATT are likely very similar (as we saw in Section 8). But in observational studies with heterogeneous treatment effects, this distinction matters greatly. A job-training program might have a larger effect on those who voluntarily enrolled (ATT) than it would have on randomly selected workers (ATE).

9.3 Basic DiD with panel fixed effects

We now implement the basic DiD estimator using Stata’s xtdidregress command, which handles the panel structure and computes clustered standard errors.

use "https://github.com/quarcs-lab/data-open/raw/master/ametrics/dataSIM4RCT.dta", clear
* Create the treatment-post interaction
gen treat_post = treat * post
label var treat_post "Treated x Post (1 only for treated in 2024)"
* Declare panel structure
xtset id year
* Basic DiD with individual fixed effects
xtdidregress (y) (treat_post), group(id) time(year) vce(cluster id)

 Number of obs = 4,000
Number of groups = 2,000
Outcome model : linear
Treatment model: none
──────────────────────────────────────────────────────────────────────────────
| Robust
y | Coefficient std. err. t P>|t| [95% conf. interval]
─────────────+────────────────────────────────────────────────────────────────
ATET |
treat_post | .1347161 .0272737 4.94 0.000 .0812282 .188204
──────────────────────────────────────────────────────────────────────────────

The basic DiD estimate of the ATT is 0.135 (SE = 0.027, p < 0.001, 95% CI [0.081, 0.188]). This is slightly higher than the cross-sectional estimates (0.113–0.116) but still contains the true effect of 0.12 within its confidence interval. The wider standard error (0.027 vs. 0.019) reflects the additional variability introduced by differencing within households. Standard errors are clustered at the household level to account for serial correlation within panels.

The key advantage of this DiD estimate is that it controls for all time-invariant unobservable characteristics of each household. In an RCT, randomization already handles confounding, so the cross-sectional and panel estimates are similar. But in observational settings, DiD’s ability to absorb household fixed effects can correct biases that cross-sectional methods cannot.

9.4 From cross-sectional DR to panel DR — Doubly Robust DiD (DRDID)

In Section 7, we saw that doubly robust methods combine outcome modeling and propensity score modeling for cross-sectional data. DRDID extends this logic to the panel setting. It combines the DiD framework (using pre/post variation) with doubly robust covariate adjustment.

This approach was introduced by Sant’Anna and Zhao (2020) in a landmark paper published in the Journal of Econometrics. They proposed estimators that are “consistent if either (but not necessarily both) a propensity score or outcome regression working models are correctly specified” — bringing the doubly robust property from the cross-sectional world into the DiD framework.

Why do we need DRDID?

Recall from Section 9.2 that basic DiD relies on the parallel trends assumption — absent treatment, the treated and control groups would have followed the same time trend. But what if parallel trends holds only conditional on covariates? For example, what if consumption trends differ between poor and non-poor households, but within each poverty group the trends are parallel?

In this case, we need a conditional parallel trends assumption:

$$E[Y_1(0) - Y_0(0) \mid D = 1, X] = E[Y_1(0) - Y_0(0) \mid D = 0, X]$$

This says that the average change in untreated potential outcomes is the same for treated and control groups who share the same covariates $X$. Note that this allows for covariate-specific time trends (e.g., different consumption growth rates for poor and non-poor households) while still identifying the ATT.

Under this conditional parallel trends assumption, there are two ways to estimate the ATT:

Outcome regression (OR) approach — model how the outcome evolves over time for the control group, and use that model to predict the counterfactual evolution for the treated group
IPW approach — reweight the control group so its covariate distribution matches the treated group, then compute the standard DiD

The problem is the same as in the cross-sectional case: OR requires a correctly specified outcome model, and IPW requires a correctly specified propensity score model. Sant’Anna and Zhao’s insight was that you can combine both into a single estimator that works if either model is correct.

The DRDID estimator for panel data

When panel data are available (as in our case — same households observed at baseline and endline), the DRDID estimator takes a particularly clean form. Let $\Delta Y_i = Y_{i,post} - Y_{i,pre}$ denote each household’s change in consumption. The DR DID estimator is:

$$\hat{\tau}_{DR}^{DiD} = \frac{1}{N_1} \sum_{i=1}^{N} \left[ w_1(D_i) - w_0(D_i, X_i) \right] \left[ \Delta Y_i - \hat{\mu}_{0,\Delta}(X_i) \right]$$

where:

$w_1(D_i) = D_i / \bar{D}$ assigns equal weight to each treated unit (the fraction treated)
$w_0(D_i, X_i)$ reweights control units using the propensity score $\hat{p}(X)$, so they resemble the treated group
$\hat{\mu}_{0,\Delta}(X_i) = \hat{\mu}_{0,post}(X_i) - \hat{\mu}_{0,pre}(X_i)$ is the predicted change in consumption for the control group, fitted from control-group data

In plain language: for each household, compute the change in consumption over time ($\Delta Y$) and subtract the model-predicted change for the control group ($\hat{\mu}_{0,\Delta}$). This residual captures the treatment effect plus any prediction error. Then reweight these residuals using IPW so that the control group matches the treated group’s covariate profile.

Why is this doubly robust?

The doubly robust property works through the same logic as in the cross-sectional case (Section 7.3), but applied to changes rather than levels:

If the outcome model is correct ($\hat{\mu}_{0,\Delta}(X) = E[\Delta Y \mid D=0, X]$), then the residuals $\Delta Y_i - \hat{\mu}_{0,\Delta}(X_i)$ average to zero for the control group, regardless of the propensity score weights. The estimator reduces to an outcome-regression DiD. Correct answer.
If the propensity score model is correct ($\hat{p}(X) = \Pr(D=1 \mid X)$), the IPW reweighting makes the control group comparable to the treated group, regardless of the outcome model. The correction term fixes any bias from a misspecified outcome model. Correct answer.
If both are correct, the estimator achieves the semiparametric efficiency bound — it is the most precise estimator possible given the assumptions. Sant’Anna and Zhao proved this formally.
If both are wrong, the estimator can be biased — double robustness provides one layer of insurance, not two.

graph TD
DY["<b>Panel data</b><br/>ΔY = Y_post − Y_pre<br/>for each household"]
OR["<b>Outcome model</b><br/>Predict control group's<br/>consumption change<br/>μ̂₀,Δ(X)"]
PS["<b>Propensity score</b><br/>Estimate p(X)<br/>= Pr(D=1 | X)"]
RES["<b>Residuals</b><br/>ΔY − μ̂₀,Δ(X)"]
IPW_W["<b>IPW reweighting</b><br/>Make controls look<br/>like treated group"]
DRDID["<b>DR-DiD estimate</b><br/>ATT = weighted average<br/>of residuals"]
DY --> RES
OR --> RES
PS --> IPW_W
RES --> DRDID
IPW_W --> DRDID
style DY fill:#141413,stroke:#00d4c8,color:#fff
style OR fill:#6a9bcc,stroke:#141413,color:#fff
style PS fill:#d97757,stroke:#141413,color:#fff
style RES fill:#6a9bcc,stroke:#141413,color:#fff
style IPW_W fill:#d97757,stroke:#141413,color:#fff
style DRDID fill:#00d4c8,stroke:#141413,color:#141413

What DRDID adds over basic DiD and TWFE

Sant’Anna and Zhao (2020) also showed that the standard two-way fixed effects (TWFE) estimator — the workhorse of applied economics — can produce misleading results when treatment effects are heterogeneous across covariates. Specifically, the TWFE estimator implicitly assumes (i) that treatment effects are the same for all covariate values, and (ii) that there are no covariate-specific time trends. When these assumptions fail, “the estimand is, in general, different from the ATT, and policy evaluation based on it may be misleading.” DRDID avoids both of these pitfalls by allowing for flexible outcome models and covariate-specific trends.

Stata implementation

The drdid package (Rios-Avila, Sant’Anna, and Callaway) implements the estimators from the paper.

* Install the drdid package (only needed once)
ssc install drdid, replace
* Doubly Robust DiD with DRIPW estimator
drdid y c.age c.edu i.female i.poverty, ivar(id) time(year) treatment(treat) dripw

Doubly robust difference-in-differences estimator
Outcome model : least squares
Treatment model: inverse probability
──────────────────────────────────────────────────────────────────────────────
| Coefficient std. err. z P>|z| [95% conf. interval]
─────────────+────────────────────────────────────────────────────────────────
ATET | .1374784 .027387 5.02 0.000 .0838008 .191156
──────────────────────────────────────────────────────────────────────────────

The DRDID estimate of the ATT is 0.137 (SE = 0.027, p < 0.001, 95% CI [0.084, 0.191]). The dripw option specifies the Doubly Robust Inverse Probability Weighting estimator, which uses a linear least squares model for the outcome evolution of the control group and a logistic model for the propensity score. The result is slightly higher than basic DiD (0.135) and close to the true effect of 0.12.

Alternative: Stata 17+ built-in command. Stata 17 and later versions include a built-in doubly robust DiD estimator that does not require installing external packages.

xthdidregress aipw (y c.age c.edu i.female i.poverty) ///
(treat_post c.age c.edu i.female i.poverty), group(id)

Heterogeneous-treatment-effects regression Number of obs = 4,000
Number of panels = 2,000
Estimator: Augmented IPW
Panel variable: id
Treatment level: id
Control group: Never treated
(Std. err. adjusted for 2,000 clusters in id)
──────────────────────────────────────────────────────────────────────────────
| Robust
Cohort | ATET std. err. z P>|z| [95% conf. interval]
─────────────+────────────────────────────────────────────────────────────────
year |
2024 | .1374784 .027387 5.02 0.000 .0838008 .191156
──────────────────────────────────────────────────────────────────────────────
Note: ATET computed using covariates.

The xthdidregress aipw command produces the same ATT estimate of 0.137 (SE = 0.027, 95% CI [0.084, 0.191]) as the drdid package — confirming that both implement the same doubly robust DiD methodology. The output labels the result as “Cohort year 2024” because xthdidregress is designed for settings with staggered treatment adoption across multiple cohorts; in our two-period design, there is only one treatment cohort (households treated in 2024). As the Stata manual explains, “AIPW models both treatment and outcome. If at least one of the models is correctly specified, it provides consistent estimates, a property called double robustness.”

The agreement between drdid (community package) and xthdidregress aipw (built-in) provides a useful robustness check — researchers can verify their results using both implementations.

Panel data vs. repeated cross-sections

An important result from Sant’Anna and Zhao (2020) is that panel data are strictly more efficient than repeated cross-sections for estimating the ATT under the DiD framework. The intuition is straightforward: with panel data, we observe each household’s individual change over time ($\Delta Y_i$), which eliminates household-level variation. With repeated cross-sections, we can only compare group averages at different time points, which introduces additional noise. The efficiency gain is larger when the sample sizes in the pre and post periods are more imbalanced.

In our study, we have a balanced panel (same 2,000 households at baseline and endline), so we benefit from this efficiency advantage.

9.5 Cross-sectional vs. panel comparison

The table below compares our best cross-sectional estimates with the panel-based DiD estimates.

Method	Approach	Estimand	Data Used	Estimate	SE	95% CI	Contains 0.12?
Simple regression	None	ATE	Endline only	0.116	0.019	[0.078, 0.154]	Yes
RA	Outcome model	ATE	Endline only	0.113	0.019	[0.075, 0.150]	Yes
IPW	Treatment model	ATE	Endline only	0.113	0.019	[0.075, 0.150]	Yes
DR (IPWRA)	Both models	ATE	Endline only	0.113	0.019	[0.075, 0.150]	Yes
Basic DiD	Panel FE	ATT	Both waves	0.135	0.027	[0.081, 0.188]	Yes
DR-DiD (`drdid`)	Both + Panel	ATT	Both waves	0.137	0.027	[0.084, 0.191]	Yes
DR-DiD (`xthdidregress`)	Both + Panel	ATT	Both waves	0.137	0.027	[0.084, 0.191]	Yes
True effect				0.12

Several important patterns emerge from this comparison. Cross-sectional methods estimate ATE using only endline data, while DiD methods estimate ATT using both survey waves. The two DR-DiD implementations (drdid and xthdidregress aipw) produce identical results, confirming methodological consistency. The DiD estimates (0.135–0.137) are slightly higher than the cross-sectional estimates (0.113), but all confidence intervals contain the true effect of 0.12. DiD’s wider standard errors (0.027 vs. 0.019) reflect the additional variability from differencing within households.

The key value of DiD is not tighter standard errors — it is robustness to time-invariant unobservables. In observational settings where randomization does not hold, DiD can correct biases that cross-sectional methods cannot address. In this RCT, randomization already handles confounding, so the estimates are similar. DRDID adds doubly robust protection on top of DiD, making it the most robust panel method available.

10. Offer vs. receipt — endogenous treatment (advanced)

Note: This section addresses the advanced topic of imperfect compliance and endogenous treatment. Readers new to causal inference may wish to skip this section on a first reading and return to it later.

10.1 The compliance problem

All estimates in Sections 8 and 9 measure the effect of being offered the cash transfer (treat), not the effect of actually receiving it (D). This is the intent-to-treat (ITT) approach — it captures the policy-relevant effect of the offer, regardless of whether households complied.

But what about the effect of actual receipt? This is more complex because compliance is not random. Only 85% of treated households received the transfer, and 5% of control households received it through other channels. The households that chose to take up the program may differ systematically from those that did not — they may be more motivated, more financially constrained, or better connected. Naively comparing receivers to non-receivers would introduce selection bias.

The solution is to use the random assignment (treat) as an instrumental variable for actual receipt (D). Because treat was randomly assigned, it is independent of household characteristics and satisfies the requirements for a valid instrument. This allows us to isolate the causal effect of receipt, at least for the subset of households whose receipt was determined by the offer (the “compliers”).

Analogy — prescriptions and pills. Imagine a doctor randomly prescribes a medication to some patients, but not all patients fill their prescription. We cannot simply compare those who took the pill to those who did not, because pill-takers may be more health-conscious. Instead, we use the random prescription (the “offer”) as a nudge — it strongly predicts whether you take the pill but does not directly affect your health except through the pill. That is the instrumental variable approach: using the random offer to estimate the causal effect of actual receipt.

10.2 Endogenous treatment regression

Stata’s etregress command estimates the effect of an endogenous treatment variable, using the random assignment as an excluded instrument.

use "https://github.com/quarcs-lab/data-open/raw/master/ametrics/dataSIM4RCT.dta", clear
keep if post==1
* Endogenous treatment regression
etregress y c.age i.female i.poverty c.edu, ///
treat(D = treat c.age i.female i.poverty c.edu) vce(robust)
* Mark estimation sample
gen byte esample = e(sample)
* ATE of receipt
margins r.D if esample==1
* ATT of receipt
margins, predict(cte) subpop(if D==1 & esample==1)

Linear regression with endogenous treatment Number of obs = 2,000
Estimator: Maximum likelihood Wald chi2(5) = 92.23
Log pseudolikelihood = -1797.6297 Prob > chi2 = 0.0000
──────────────────────────────────────────────────────────────────────────────
| Robust
| Coefficient std. err. z P>|z| [95% conf. interval]
─────────────+────────────────────────────────────────────────────────────────
y |
age | .003187 .0010016 3.18 0.001 .001224 .0051501
1.female | .0801465 .0189552 4.23 0.000 .042995 .117298
1.poverty | -.1030302 .0205984 -5.00 0.000 -.1434023 -.062658
edu | .0182634 .0045243 4.04 0.000 .0093959 .0271308
1.D | .1471 .0246775 5.96 0.000 .0987329 .1954671
_cons | 9.705642 .0694641 139.72 0.000 9.569495 9.841789
─────────────+────────────────────────────────────────────────────────────────
D |
treat | 2.55806 .0802103 31.89 0.000 2.40085 2.715269
_cons | -1.844408 .2847883 -6.48 0.000 -2.402582 -1.286233
─────────────+────────────────────────────────────────────────────────────────
/athrho | -.0060068 .0481062 -0.12 0.901 -.1002933 .0882796
sigma | .4245195 .0066426 .411698 .4377404
──────────────────────────────────────────────────────────────────────────────
Wald test of indep. eqns. (rho = 0): chi2(1) = 0.02 Prob > chi2 = 0.9006
ATE of receipt (margins r.D):
──────────────────────────────────────────────────────────────────────────────
D | Contrast std. err. [95% conf. interval]
─────────────+────────────────────────────────────────────────────────────────
(1 vs 0) | .1471 .0246775 .0987329 .1954671
──────────────────────────────────────────────────────────────────────────────
ATT of receipt (margins, predict(cte)):
──────────────────────────────────────────────────────────────────────────────
_cons | Margin std. err. z P>|z| [95% conf. interval]
─────────────+────────────────────────────────────────────────────────────────
| .1471 .0246775 5.96 0.000 .0987329 .1954671
──────────────────────────────────────────────────────────────────────────────

The etregress output reveals several important findings. The coefficient on D (receipt) is 0.147 (SE = 0.025, p < 0.001, 95% CI [0.099, 0.195]), which is the estimated effect of actually receiving the cash transfer. This is larger than the offer-based estimates (0.113–0.116) because not everyone who was offered the program received it — the per-recipient effect is naturally larger than the per-offer effect. The Wald test of independent equations (rho = 0) has p = 0.901, indicating no evidence of endogeneity — consistent with a well-designed RCT where unobservable factors do not drive both treatment receipt and consumption. The margins commands confirm that both the ATE and ATT of receipt are 0.147 (identical in this case because the model assumes a constant treatment effect).

10.3 Doubly robust estimation of receipt effect

We can also estimate the receipt effect using a doubly robust approach, incorporating the baseline outcome y0 as an additional control variable (an ANCOVA-style adjustment) and including treat (the random assignment) as a covariate in the treatment model for D.

use "https://github.com/quarcs-lab/data-open/raw/master/ametrics/dataSIM4RCT.dta", clear
keep if post==1
* Doubly robust ATE of receipt, controlling for baseline outcome
teffects ipwra (y y0 c.age i.female i.poverty c.edu) ///
(D c.age i.female i.poverty c.edu treat), vce(robust)
* Diagnostic checks
tebalance summarize age edu i.female i.poverty
tebalance summarize, baseline
tebalance density y0
tebalance density age
teffects overlap

Treatment-effects estimation Number of obs = 2,000
Estimator : IPW regression adjustment
Outcome model : linear
Treatment model: logit
──────────────────────────────────────────────────────────────────────────────
| Robust
y | Coefficient std. err. z P>|z| [95% conf. interval]
─────────────+────────────────────────────────────────────────────────────────
ATE |
D |
(1 vs 0) | .1172686 .0322495 3.64 0.000 .0540608 .1804764
─────────────+────────────────────────────────────────────────────────────────
POmean |
D |
0 | 10.03361 .0171459 585.19 0.000 10 10.06722
──────────────────────────────────────────────────────────────────────────────

The doubly robust estimate of the ATE of receipt is 0.117 (SE = 0.032, 95% CI [0.054, 0.180]). This is slightly lower than the etregress estimate (0.147) and closer to the true effect of 0.12. The wider standard error (0.032 vs. 0.025) reflects the additional flexibility of the doubly robust approach. This specification includes y0 (the baseline outcome) in the outcome model, which controls for pre-treatment differences in consumption levels. The variable treat appears in the treatment model for D because random assignment is the strongest predictor of receipt.

The diagnostic graphs below verify adequate covariate balance and propensity score overlap for the receipt model.

The density and overlap plots confirm that the IPWRA weighting achieves good balance between receivers and non-receivers. After weighting, the effective sample sizes are approximately 999 treated and 1,001 control (rebalanced from the raw 923 receivers and 1,077 non-receivers). The weighted covariate means are closely aligned — for example, the weighted mean age is 35.0 for receivers versus 35.2 for non-receivers, and the weighted poverty rate is 31.1% versus 31.4%. The propensity scores show sufficient overlap for reliable estimation.

11. Comparing all estimates — the big picture

The table below brings together all estimates from the tutorial, providing a comprehensive overview of how different methods, estimands, and data structures relate to each other.

#	Method	Approach	Estimand	Data	Estimate	SE	95% CI	Contains 0.12?
1	Simple regression	None	ATE (offer)	Endline	0.116	0.019	[0.078, 0.154]	Yes
2	Regression Adjustment	Outcome model	ATE (offer)	Endline	0.113	0.019	[0.075, 0.150]	Yes
3	Regression Adjustment	Outcome model	ATT (offer)	Endline	0.113	0.019	[0.076, 0.151]	Yes
4	Inverse Prob. Weighting	Treatment model	ATE (offer)	Endline	0.113	0.019	[0.075, 0.150]	Yes
5	Inverse Prob. Weighting	Treatment model	ATT (offer)	Endline	0.113	0.019	[0.076, 0.151]	Yes
6	IPWRA (Doubly Robust)	Both models	ATE (offer)	Endline	0.113	0.019	[0.075, 0.150]	Yes
7	IPWRA (Doubly Robust)	Both models	ATT (offer)	Endline	0.113	0.019	[0.076, 0.151]	Yes
8	Basic DiD	Panel FE	ATT (offer)	Panel	0.135	0.027	[0.081, 0.188]	Yes
9	DR-DiD (`drdid`)	Both + Panel	ATT (offer)	Panel	0.137	0.027	[0.084, 0.191]	Yes
10	DR-DiD (`xthdidregress`)	Both + Panel	ATT (offer)	Panel	0.137	0.027	[0.084, 0.191]	Yes
11	Endogenous treatment (`etregress`)	IV	ATE (receipt)	Endline	0.147	0.025	[0.099, 0.195]	Yes
12	DR receipt (`teffects ipwra`)	Both models	ATE (receipt)	Endline	0.117	0.032	[0.054, 0.180]	Yes
	True effect				0.12

Four key takeaways

1. RA vs. IPW vs. DR. In this well-designed RCT, all three cross-sectional approaches give remarkably similar results (0.113–0.116). This convergence occurs because randomization ensures that both the outcome model and the propensity score model are approximately correct. The differences are small — but in observational studies, where one model might be misspecified, the choice of method matters much more. Doubly robust methods are the safest bet because they remain consistent if either model is correct.

2. ATE vs. ATT. For all cross-sectional methods, ATE and ATT are nearly identical (0.113–0.116). This confirms that treatment effects are roughly homogeneous across households in this simulation. When treatment effects are heterogeneous — for example, if the program benefits poorer households more — ATE and ATT can diverge. The researcher must choose the estimand that matches their policy question: ATE for scaling decisions, ATT for program evaluation.

3. Cross-sectional vs. DiD. DiD estimates (0.135–0.137) are slightly higher than cross-sectional estimates (0.113–0.116), but all confidence intervals contain the true effect of 0.12. DiD’s main advantage is controlling for time-invariant unobservable household characteristics — less important in an RCT (where randomization handles confounding) but critical in quasi-experimental settings. DRDID extends the doubly robust logic to the panel setting, providing the most robust estimator in our toolkit. DiD inherently estimates the ATT because its counterfactual is constructed specifically for the treated group.

4. Offer vs. receipt. The effect of actually receiving the cash transfer (0.117–0.147) is larger than the effect of being offered it (0.113–0.116), because imperfect compliance dilutes the offer-based estimates. The doubly robust receipt estimate (0.117) is closest to the true effect of 0.12, while the endogenous treatment model (0.147) is slightly higher. All confidence intervals contain 0.12.

12. Summary and key takeaways

The cash transfer program increased household consumption by approximately 11–14% across all estimation methods, close to the true effect of 12%. Every confidence interval contained the true value, demonstrating that all methods successfully recovered the correct answer.

Seven methodological lessons

Always verify baseline balance before estimating treatment effects. Even with randomization, chance imbalances can occur — as we saw with the gender variable (SMD = 9.3%).
Be explicit about your estimand. ATE answers the policymaker’s question (“What if we scale this up?"), while ATT answers the evaluator’s question (“Did it help the participants?"). Different methods target different estimands.
Regression adjustment models the outcome; IPW models treatment assignment; doubly robust does both. These three approaches represent fundamentally different strategies for causal estimation. Understanding what each models — and what can go wrong — is essential for choosing the right method.
In a well-designed RCT, all three approaches converge. But doubly robust methods provide insurance against model misspecification, making them the standard recommendation in modern causal inference.
Panel data controls for time-invariant unobservables that cross-sectional methods cannot address. By comparing each household to itself over time, DiD absorbs household fixed effects — motivation, geography, family culture — that are invisible to cross-sectional approaches.
DiD inherently estimates the ATT because its counterfactual is specific to the treated group. The control group’s time trend provides a counterfactual for what the treated group would have experienced without the program — but it does not tell us what would happen if the program were given to the untreated.
Doubly robust DiD (DRDID) extends the DR logic to the panel setting. It combines the power of DiD (controlling for household fixed effects) with the robustness of doubly robust estimation (protection against model misspecification), making it the most robust panel estimator available.

Limitations

This tutorial uses simulated data with known parameters. Real-world data may exhibit more complex compliance patterns, heterogeneous effects, and missing data.
The panel has only two periods (baseline and endline), limiting our ability to test for pre-treatment trends or estimate dynamic treatment effects.
Treatment effects are homogeneous by construction. In practice, researchers should explore heterogeneity across subgroups.

Next steps

Apply these methods to real-world RCT data from actual cash transfer programs
Explore heterogeneous treatment effects by gender, poverty status, or education level
Extend to multi-period panels with staggered treatment adoption, using modern DiD methods (Callaway and Sant’Anna, 2021)

13. Exercises

Heterogeneous effects by gender. Estimate treatment effects separately for male-headed and female-headed households using IPWRA. Are the effects different? Does ATE still equal ATT when you restrict to subgroups?
Model misspecification. Compare the RA, IPW, and DR estimates when you deliberately misspecify the outcome model by omitting edu and age from the covariate list. Which method is most robust to this misspecification? What does this tell you about the value of doubly robust estimation?
Basic DiD vs. doubly robust DiD. Re-run the DiD analysis using the basic xtdidregress command (no covariates) and compare it with the drdid results (with covariates). How much do the estimates differ? What does this tell you about the role of covariate adjustment in DiD?

References

Acknowledgements

Introduction to Difference-in-Differences in Python

Thu, 19 Mar 2026 00:00:00 +0000

Overview

An education ministry rolls out AI tutoring bots in some cities but not others. Did the AI tools actually improve learning, or were those cities already on an upward trajectory? This is the core challenge of policy evaluation: separating the genuine effect of an intervention from pre-existing trends and selection differences between treated and untreated groups. The seminal study by Card and Krueger (1994) pioneered this approach in a different context — examining how a minimum wage increase in New Jersey affected fast-food employment compared to neighboring Pennsylvania.

Difference-in-Differences (DiD) is the workhorse method for answering such questions. The idea is elegantly simple: compare the change in outcomes over time between a group that received treatment and a group that did not. If both groups were evolving similarly before treatment — the parallel trends assumption — then the difference in their changes isolates the causal effect. Think of it as using the control group as a mirror: it shows what would have happened to the treated group had the policy never been implemented.

The diff-diff Python package, developed by Gerber (2026), provides a unified, scikit-learn-style API for 13+ DiD estimators validated against their R counterparts. These range from the classic 2x2 design to modern methods for staggered adoption. In this tutorial, we start with the simplest case, build up to event studies and multi-cohort designs, and finish with sensitivity analysis that quantifies how robust the findings are to violations of parallel trends. All examples use synthetic panel data — datasets where the same units (cities, firms, individuals) are observed repeatedly over multiple time periods — with known true effects, so every estimate can be verified against ground truth.

Learning objectives:

Understand the logic of the 2x2 DiD design and why it identifies causal effects under parallel trends
Estimate the Average Treatment Effect on the Treated (ATT) using classic DiD
Test the parallel trends assumption with pre-treatment trend comparisons
Interpret event study plots that reveal dynamic treatment effects over time
Recognize why Two-Way Fixed Effects fails under staggered adoption and how Callaway-Sant’Anna corrects for it
Assess robustness of causal conclusions using Bacon decomposition diagnostics and HonestDiD sensitivity analysis

Conceptual framework: What is Difference-in-Differences?

Imagine a school district deploys AI tutoring bots in some schools but not others, and you want to know whether the AI tools improved learning outcomes. You could compare learning scores at AI-equipped schools versus non-equipped schools after deployment. But AI-equipped schools might have had stronger students to begin with — perhaps the district piloted the technology in its highest-performing schools. A simple post-treatment comparison confounds the AI effect with pre-existing differences. Alternatively, you could compare a single school before and after the AI rollout — but learning scores might have been rising everywhere due to a new curriculum or improved teacher training, not the AI tools.

DiD combines these two simpler approaches so that selection bias and the effect of time are, in turns, eliminated. The logic proceeds through successive differencing:

First difference: Compare a unit before and after treatment. This eliminates time-invariant differences between groups (e.g., one school always scores higher than another), but confounds the treatment effect with common time trends (e.g., district-wide learning improvements from a new curriculum).
Second difference: Difference the first differences between treated and control groups. This eliminates the common time trends, leaving only the treatment effect.

graph TB
subgraph "Before Treatment"
A["<b>Treated Group</b><br/>Pre-treatment outcome"]
B["<b>Control Group</b><br/>Pre-treatment outcome"]
end
subgraph "After Treatment"
C["<b>Treated Group</b><br/>Post-treatment outcome"]
D["<b>Control Group</b><br/>Post-treatment outcome"]
end
A -->|"Change in<br/>treated"| C
B -->|"Change in<br/>control"| D
style A fill:#d97757,stroke:#141413,color:#fff
style C fill:#d97757,stroke:#141413,color:#fff
style B fill:#6a9bcc,stroke:#141413,color:#fff
style D fill:#6a9bcc,stroke:#141413,color:#fff

The DiD estimator

The 2x2 DiD estimator formalizes this double comparison. Let $k$ denote the treated group and $U$ the untreated group:

$$\hat{\delta}^{2 \times 2}_{kU} = \big( \bar{Y}_k^{Post} - \bar{Y}_k^{Pre} \big) - \big( \bar{Y}_U^{Post} - \bar{Y}_U^{Pre} \big)$$

In words: take the before-and-after change in the treated group, subtract the before-and-after change in the control group, and the remainder is the treatment effect. Here $\bar{Y}_k^{Post}$ is the average outcome for treated units in the post-treatment period (rows where treated = 1 and post = 1), and similarly for the other three terms.

What DiD actually estimates: The potential outcomes framework

The sample-means formula above tells us how to compute DiD from data, but it does not tell us what causal quantity DiD recovers or under what assumptions it is valid. To answer these deeper questions, we need the potential outcomes framework (Rubin, 1974).

The key idea is that every unit has two potential outcomes at every point in time, but we only ever observe one of them:

$Y^1_{i}$ — the outcome unit $i$ would experience with treatment
$Y^0_{i}$ — the outcome unit $i$ would experience without treatment

For a treated city, we observe $Y^1$ (what actually happened after adopting AI tutoring) but never $Y^0$ (what would have happened had the city not adopted AI). For a control city, we observe $Y^0$ but never $Y^1$. This is the fundamental problem of causal inference: for any individual unit, the causal effect $Y^1_{i} - Y^0_{i}$ is unobservable because one potential outcome is always missing.

Since we cannot measure individual effects, we aim for the Average Treatment Effect on the Treated (ATT) — the average causal effect across all treated units in the post-treatment period:

$$ATT = E[Y^1_k - Y^0_k | Post]$$

In words: what is the average difference between what treated units actually experienced and what they would have experienced without treatment, measured in the post-treatment period? Here $E[\cdot]$ denotes the expected value (population average), $k$ indexes the treated group, and the conditioning on $Post$ restricts attention to the post-treatment period. In our data, $E[Y^1_k | Post]$ corresponds to the average outcome for rows where treated = 1 and post = 1 — that is, $\bar{Y}_k^{Post}$ from the previous formula.

The challenge is that $E[Y^0_k | Post]$ — the average untreated outcome for the treated group after treatment — is a counterfactual that we never observe. Treated cities received the policy, so we cannot see what their outcomes would have been without it. This is where DiD’s clever trick comes in.

From sample means to potential outcomes

Let us now connect the sample-means formula to potential outcomes by rewriting each $\bar{Y}$ term. For the control group, which never receives treatment, the observed outcome always equals the untreated potential outcome: $Y_U = Y^0_U$ in both periods. For the treated group, the observed outcome equals the untreated potential outcome before treatment ($Y_k = Y^0_k$ when $Pre$) and the treated potential outcome after ($Y_k = Y^1_k$ when $Post$). Substituting these into the DiD formula:

$$\hat{\delta}^{2 \times 2}_{kU} = \big( \underbrace{\bar{Y}_k^{Post}}_{= E[Y^1_k | Post]} - \underbrace{\bar{Y}_k^{Pre}}_{= E[Y^0_k | Pre]} \big) - \big( \underbrace{\bar{Y}_U^{Post}}_{= E[Y^0_U | Post]} - \underbrace{\bar{Y}_U^{Pre}}_{= E[Y^0_U | Pre]} \big)$$

On the left of the outer subtraction, the treated group’s pre-treatment mean uses $Y^0_k$ (no treatment yet) and post-treatment mean uses $Y^1_k$ (treatment is active). On the right, both control group means use $Y^0_U$ (never treated). Now we apply a standard algebraic trick: add and subtract the unobserved counterfactual $E[Y^0_k | Post]$ inside the first parenthesis:

Rearranging by grouping the first two terms and the last three:

$$= \underbrace{E[Y^1_k | Post] - E[Y^0_k | Post]}_{ATT} + \underbrace{\big( E[Y^0_k | Post] - E[Y^0_k | Pre] \big) - \big( E[Y^0_U | Post] - E[Y^0_U | Pre] \big)}_{Bias}$$

This is the fundamental decomposition of the DiD estimator (Cunningham, 2021). The first term is the ATT — the causal quantity we want. The second term is the non-parallel trends bias — the difference in how the two groups' untreated outcomes would have evolved over time. The bias term compares the untreated trajectory of the treated group ($E[Y^0_k | Post] - E[Y^0_k | Pre]$) against the untreated trajectory of the control group ($E[Y^0_U | Post] - E[Y^0_U | Pre]$). If the bias term is zero, the DiD estimator cleanly identifies the ATT.

Parallel trends assumption

The bias term vanishes when the treated and control groups would have followed the same trajectory absent treatment:

$$E[Y^0_k | Post] - E[Y^0_k | Pre] = E[Y^0_U | Post] - E[Y^0_U | Pre]$$

This is the parallel trends assumption. It does not require the groups to have the same outcome levels — only the same trends. Two cities can have different learning scores, but if their learning scores were rising at the same speed before the AI rollout, DiD can credibly estimate the policy’s impact. Importantly, this assumption is fundamentally untestable because the counterfactual outcome $E[Y^0_k | Post]$ — what would have happened to the treated group absent treatment — is never observed. We can check whether trends were parallel in the pre-treatment period, but this does not guarantee they would have remained parallel afterward. This limitation is why Section 11 introduces HonestDiD sensitivity analysis.

Regression formulation

In practice, DiD is implemented as a regression with an interaction term:

$$Y_{it} = \alpha + \gamma \cdot Treated_i + \lambda \cdot Post_t + \delta \cdot (Treated_i \times Post_t) + \varepsilon_{it}$$

where $Treated_i$ is the group indicator (our treated column), $Post_t$ is the time indicator (our post column), and $\delta$ is the DiD treatment effect. The coefficient $\gamma$ captures the pre-existing level difference between groups, and $\lambda$ captures the common time trend. This regression mechanically constructs the counterfactual using the control group’s trajectory — it always estimates the $\delta$ coefficient as the extra change in the treated group, which is only valid if the counterfactual trend truly equals the control group’s trend.

Estimand clarity: DiD targets the Average Treatment Effect on the Treated (ATT) — the average impact of treatment on those units that actually received it. This differs from the Average Treatment Effect (ATE), which averages over the entire population including units that were never treated. The ATT answers: “For the units that received the policy, how much did it change their outcomes?” This is typically the policy-relevant question, since the decision-maker wants to know whether the intervention helped the people it was aimed at.

Now that we understand the logic, let us implement it step by step using the diff-diff package.

Setup and imports

Before running the analysis, install the required package:

# Run in terminal (or use !pip install in a notebook)
pip install diff-diff

The following code imports all necessary libraries and sets configuration variables. The diff-diff package provides generate_did_data() to create synthetic panel data with known treatment effects, DifferenceInDifferences() for the classic 2x2 estimator, and several advanced estimators for multi-period and staggered designs.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from diff_diff import (
DifferenceInDifferences,
MultiPeriodDiD,
CallawaySantAnna,
BaconDecomposition,
HonestDiD,
generate_did_data,
generate_staggered_data,
check_parallel_trends,
)
# Reproducibility
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
# Site color palette
STEEL_BLUE = "#6a9bcc"
WARM_ORANGE = "#d97757"
NEAR_BLACK = "#141413"
TEAL = "#00d4c8"
# Dark-theme palette
DARK_NAVY = "#0f1729"
GRID_LINE = "#1f2b5e"
LIGHT_TEXT = "#c8d0e0"
WHITE_TEXT = "#e8ecf2"

Classic 2x2 DiD design

The simplest DiD setup has two groups (treated and control) observed at two time points (before and after treatment). We start here because the 2x2 case makes the mechanics of DiD transparent before moving to more complex designs.

Generating synthetic panel data

We use generate_did_data() to create a synthetic panel where the true treatment effect is exactly 5.0 units. This known ground truth lets us verify that the estimator recovers the correct answer. The function creates a balanced panel with n_units units observed over n_periods periods, where treatment_fraction of units receive treatment starting at treatment_period.

data_2x2 = generate_did_data(
n_units=100,
n_periods=10,
treatment_effect=5.0,
treatment_period=5,
treatment_fraction=0.5,
seed=RANDOM_SEED,
)
print(f"Dataset shape: {data_2x2.shape}")
print(f"Columns: {data_2x2.columns.tolist()}")
print(f"\nTreatment groups:")
print(data_2x2.groupby("treated")["unit"].nunique().rename(
{0: "Control", 1: "Treated"}))
print(f"\nPeriods: {sorted(int(p) for p in data_2x2['period'].unique())}")
print(f"Treatment period: 5 (post = 1 for periods >= 5)")
print(f"True treatment effect: 5.0")

Dataset shape: (1000, 6)
Columns: ['unit', 'period', 'treated', 'post', 'outcome', 'true_effect']
Treatment groups:
treated
Control 50
Treated 50
Name: unit, dtype: int64
Periods: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Treatment period: 5 (post = 1 for periods >= 5)
True treatment effect: 5.0

The synthetic panel contains 1,000 observations: 100 units observed across 10 periods (0 through 9). Half the units (50) are assigned to treatment, which begins at period 5. The dataset includes a true_effect column that equals 0.0 in pre-treatment periods and 5.0 in post-treatment periods for treated units, providing a built-in benchmark. The post indicator is 1 for periods 5–9 and 0 for periods 0–4, matching the binary time dimension of the classic 2x2 framework.

Exploring the 2x2 dataset

Before estimating any model, we inspect the raw data to understand its structure. The .head() method shows the first rows so we can see how each observation is organized as a unit-period pair.

data_2x2.head(10)

 unit period treated post outcome true_effect
0 0 1 0 10.231272 0.0
0 1 1 0 12.408662 0.0
0 2 1 0 11.253170 0.0
0 3 1 0 12.846950 0.0
0 4 1 0 11.675816 0.0
0 5 1 1 17.903997 5.0
0 6 1 1 17.659412 5.0
0 7 1 1 18.770401 5.0
0 8 1 1 20.449742 5.0
0 9 1 1 18.382114 5.0

Each row is one unit in one period. The unit column identifies the individual, period tracks time, treated indicates group assignment (time-invariant), and post flags observations after the treatment period. The outcome column is what we aim to explain, and true_effect is the ground truth we will try to recover. This unit-period structure is the hallmark of panel data — repeated observations on the same units over time.

Summary statistics confirm the design parameters:

data_2x2.describe()

 unit period treated post outcome true_effect
count 1000.000000 1000.000000 1000.00000 1000.00000 1000.000000 1000.000000
mean 49.500000 4.500000 0.50000 0.50000 13.380874 1.250000
std 28.880514 2.873719 0.50025 0.50025 3.752000 2.166147
min 0.000000 0.000000 0.00000 0.00000 4.965883 0.000000
25% 24.750000 2.000000 0.00000 0.00000 10.716817 0.000000
50% 49.500000 4.500000 0.50000 0.50000 12.558536 0.000000
75% 74.250000 7.000000 1.00000 1.00000 15.926784 1.250000
max 99.000000 9.000000 1.00000 1.00000 24.294992 5.000000

The means of treated and post are both exactly 0.50, confirming a perfectly balanced design: half the units are treated, and half the time periods are post-treatment. The outcome ranges from about 5.0 to 24.3 with a mean of 13.4, reflecting the combination of time trends, unit effects, and treatment effects. The true_effect mean of 1.25 comes from the fact that only 25% of observations (treated units in post-treatment periods) have a non-zero effect of 5.0.

A crosstab reveals the 2x2 structure that gives DiD its name:

pd.crosstab(data_2x2["treated"], data_2x2["post"], margins=True)

post 0 1 All
treated
0 250 250 500
1 250 250 500
All 500 500 1000

This is the core of the 2x2 design: 250 observations in each of the four cells (control-pre, control-post, treated-pre, treated-post). The balanced allocation means each cell has equal weight in the estimator, which maximizes statistical power. In observational studies, these cell sizes are rarely equal, but the DiD estimator adjusts for imbalance automatically.

Finally, we examine how the outcome varies across the four cells:

data_2x2.groupby(["treated", "post"])["outcome"].describe()

 count mean std min 25% 50% 75% max
treated post
0 0 250.0 10.614957 1.871283 5.670539 9.261649 10.781139 11.866492 15.825691
1 250.0 13.086386 1.968271 8.158302 11.777457 13.149548 14.600075 18.372485
1 0 250.0 11.114546 2.015353 4.965883 9.909285 11.065526 12.494486 16.804462
1 250.0 18.707609 1.905034 13.182572 17.296981 18.870692 20.070330 24.294992

In the pre-treatment period, both groups have similar mean outcomes: 10.61 for the control group and 11.11 for the treated group — a negligible difference of 0.50 that suggests the groups started on comparable footing. In the post-treatment period, the control group mean rises to 13.09 (a gain of 2.47), while the treated group mean jumps to 18.71 (a gain of 7.59). The extra gain for the treated group (7.59 - 2.47 = 5.12) closely approximates the treatment effect that DiD will formally estimate. The raw numbers already hint that something happened to the treated group beyond the natural time trend.

The box plot below visualizes these distributions:

fig, ax = plt.subplots(figsize=(9, 5))
fig.patch.set_linewidth(0)
groups = [
("Control, Pre", data_2x2[(data_2x2["treated"] == 0) & (data_2x2["post"] == 0)]["outcome"]),
("Control, Post", data_2x2[(data_2x2["treated"] == 0) & (data_2x2["post"] == 1)]["outcome"]),
("Treated, Pre", data_2x2[(data_2x2["treated"] == 1) & (data_2x2["post"] == 0)]["outcome"]),
("Treated, Post", data_2x2[(data_2x2["treated"] == 1) & (data_2x2["post"] == 1)]["outcome"]),
]
bp = ax.boxplot(
[g[1] for g in groups],
tick_labels=[g[0] for g in groups],
patch_artist=True,
widths=0.5,
medianprops=dict(color=WHITE_TEXT, linewidth=2),
)
box_colors = [STEEL_BLUE, STEEL_BLUE, WARM_ORANGE, WARM_ORANGE]
for patch, color in zip(bp["boxes"], box_colors):
patch.set_facecolor(color)
patch.set_alpha(0.6)
ax.set_ylabel("Outcome")
ax.set_title("Outcome Distribution by Treatment Group and Period")
plt.savefig("did_outcome_distribution.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

The box plot makes the treatment effect visible at a glance. In the pre-treatment period, control (steel blue) and treated (warm orange) boxes overlap almost completely, centered around 10.6–11.1. Both groups shift upward in the post period due to the natural time trend, but the treated group shifts more — its median jumps to around 18.9, compared to 13.1 for the control. The extra upward shift for the treated group is the treatment effect that DiD will formally estimate. Notice also that the spread (box height) remains similar across all four groups, suggesting that treatment affects the level but not the variability of outcomes.

Visualizing parallel trends

Before estimating the treatment effect, we check whether the treated and control groups followed similar trajectories in the pre-treatment period. This visual inspection is the first step in assessing whether the parallel trends assumption is plausible. If the two groups were diverging before treatment, any post-treatment difference could reflect pre-existing trends rather than a causal effect.

treated_means = data_2x2[data_2x2["treated"] == 1].groupby("period")["outcome"].mean()
control_means = data_2x2[data_2x2["treated"] == 0].groupby("period")["outcome"].mean()
fig, ax = plt.subplots(figsize=(9, 5))
fig.patch.set_linewidth(0)
ax.plot(control_means.index, control_means.values, "o-",
color=STEEL_BLUE, linewidth=2, markersize=7, label="Control group")
ax.plot(treated_means.index, treated_means.values, "s-",
color=WARM_ORANGE, linewidth=2, markersize=7, label="Treated group")
ax.axvline(x=4.5, color=LIGHT_TEXT, linestyle="--", linewidth=1.5,
alpha=0.7, label="Treatment onset")
ax.set_xlabel("Period")
ax.set_ylabel("Average Outcome")
ax.set_title("Parallel Trends: Treatment vs Control Groups")
ax.legend(loc="upper left")
ax.set_xticks(range(10))
plt.savefig("did_parallel_trends.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

The two groups move in lockstep during periods 0 through 4, confirming that the parallel trends assumption holds in this synthetic dataset. Both lines fluctuate around similar values with no visible divergence before period 5. After treatment onset, the treated group (warm orange) jumps upward while the control group (steel blue) continues its prior trajectory. The gap between the two lines in the post-treatment period visually represents the treatment effect — roughly 5 units, consistent with the true effect built into the data.

Estimating the treatment effect

With parallel trends confirmed visually, we apply the classic DiD estimator. The DifferenceInDifferences() class implements the 2x2 design with analytical standard errors. The .fit() method takes the data along with column names for the outcome, treatment indicator, and time indicator (pre/post).

did = DifferenceInDifferences()
results_2x2 = did.fit(data_2x2, outcome="outcome",
treatment="treated", time="post")
results_2x2.print_summary()

======================================================================
Difference-in-Differences Estimation Results
======================================================================
Observations: 1000
Treated units: 500
Control units: 500
R-squared: 0.7332
----------------------------------------------------------------------
Parameter Estimate Std. Err. t-stat P>|t|
----------------------------------------------------------------------
ATT 5.1216 0.2455 20.863 0.0000 ***
----------------------------------------------------------------------
95% Confidence Interval: [4.6399, 5.6034]
Signif. codes: '***' 0.001, '**' 0.01, '*' 0.05, '.' 0.1
======================================================================

The estimated ATT is 5.12, close to the true effect of 5.0, with a standard error of 0.25. The t-statistic of 20.86 and p-value near zero confirm that the effect is highly statistically significant. The 95% confidence interval [4.64, 5.60] comfortably contains the true value of 5.0, demonstrating that the classic DiD estimator successfully recovers the known treatment effect. The small deviation from 5.0 (an overestimate of 0.12) reflects sampling variability, not estimator bias — with 100 units and 10 periods, some random noise is expected.

Visualizing the counterfactual

DiD’s power lies in constructing a counterfactual — what would have happened to the treated group without treatment. We build this by projecting the control group’s post-treatment trajectory, shifted up by the pre-treatment gap between the groups. The shaded area between the actual treated outcomes and this counterfactual line represents the estimated causal effect.

fig, ax = plt.subplots(figsize=(9, 5))
fig.patch.set_linewidth(0)
ax.plot(control_means.index, control_means.values, "o-",
color=STEEL_BLUE, linewidth=2, markersize=7, label="Control group")
ax.plot(treated_means.index, treated_means.values, "s-",
color=WARM_ORANGE, linewidth=2, markersize=7, label="Treated group")
# Counterfactual: treated group without treatment
pre_diff = treated_means.loc[:4].mean() - control_means.loc[:4].mean()
counterfactual = control_means.loc[5:] + pre_diff
ax.plot(counterfactual.index, counterfactual.values, "s--",
color=TEAL, linewidth=2, markersize=7,
label="Counterfactual (no treatment)")
ax.fill_between(counterfactual.index, counterfactual.values,
treated_means.loc[5:].values, alpha=0.2, color=TEAL,
label=f"Treatment effect (ATT ≈ {results_2x2.att:.1f})")
ax.axvline(x=4.5, color=LIGHT_TEXT, linestyle="--", linewidth=1.5, alpha=0.7)
ax.set_xlabel("Period")
ax.set_ylabel("Average Outcome")
ax.set_title("DiD Treatment Effect: Observed vs Counterfactual")
ax.legend(loc="upper left")
ax.set_xticks(range(10))
plt.savefig("did_treatment_effect.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

The teal dashed line traces where the treated group would have been without the intervention, constructed by shifting the control group’s post-treatment path to match the treated group’s pre-treatment level. The shaded gap between the actual treated outcomes (warm orange) and this counterfactual (teal) is the estimated causal effect — approximately 5.1 units per period. This visualization makes the DiD logic tangible: the control group’s trajectory serves as the mirror image of the treated group’s no-treatment path, and the extra gain above that mirror is what the policy caused.

Testing parallel trends

The visual check suggested parallel trends hold, but a formal statistical test provides more rigorous evidence. The check_parallel_trends() function compares the pre-treatment time trends of the treated and control groups by estimating a linear slope for each group across the pre-treatment periods, then testing whether the two slopes are statistically different.

pt_result = check_parallel_trends(
data_2x2,
outcome="outcome",
time="period",
treatment_group="treated",
pre_periods=[0, 1, 2, 3, 4],
)
print(f"Treated group pre-trend slope: {pt_result['treated_trend']:.4f}"
f" (SE = {pt_result['treated_trend_se']:.4f})")
print(f"Control group pre-trend slope: {pt_result['control_trend']:.4f}"
f" (SE = {pt_result['control_trend_se']:.4f})")
print(f"Trend difference: {pt_result['trend_difference']:.4f}"
f" (SE = {pt_result['trend_difference_se']:.4f})")
print(f"t-statistic: {pt_result['t_statistic']:.4f}")
print(f"p-value: {pt_result['p_value']:.4f}")
print(f"Parallel trends plausible: {pt_result['parallel_trends_plausible']}")

Treated group pre-trend slope: 0.5262 (SE = 0.0839)
Control group pre-trend slope: 0.4047 (SE = 0.0798)
Trend difference: 0.1216 (SE = 0.1158)
t-statistic: 1.0497
p-value: 0.2938
Parallel trends plausible: True

The pre-treatment trend slopes are 0.53 for the treated group and 0.40 for the control group — a difference of 0.12 with a p-value of 0.29. Since p > 0.05, we fail to reject the null hypothesis that the trends are equal, supporting the parallel trends assumption. However, a critical caveat: failing to reject is not the same as confirming. The test has limited power, especially with only 5 pre-treatment periods. Even if the trends differed slightly, this test might not detect it. Moreover, Roth (2022) shows that conditioning on passing a pre-test can distort subsequent inference — estimated effects may be biased toward zero and confidence intervals may have incorrect coverage. This is why Section 11 introduces HonestDiD, which asks: “How wrong could parallel trends be before our conclusion changes?” That question is more informative than a binary pass/fail test.

Event study: Dynamic treatment effects

The 2x2 estimator produces a single ATT that averages across all post-treatment periods. But treatment effects often change over time — they might build up gradually, appear immediately, or fade out. An event study (also called dynamic DiD) estimates separate effects for each period relative to treatment, revealing the full trajectory.

The event study extends the basic DiD regression by replacing the single treatment effect $\delta$ with a set of period-specific coefficients — one for each period before and after treatment:

$$Y_{it} = \gamma_i + \lambda_t + \sum_{k=-K+1}^{-2} \beta_k^{lead} D_{it}^k + \sum_{k=0}^{L} \beta_k^{lag} D_{it}^k + \varepsilon_{it}$$

Let us unpack each component of this equation:

$Y_{it}$ is the outcome for unit $i$ at time $t$ — the variable we are trying to explain (our outcome column).
$\gamma_i$ are unit fixed effects — a separate intercept for each unit that absorbs all time-invariant characteristics. For example, if one city always has higher learning scores than another due to demographics or school funding levels, $\gamma_i$ captures that permanent difference. In practice, this is equivalent to demeaning each unit’s outcome by its own time-average.
$\lambda_t$ are time fixed effects — a separate intercept for each period that absorbs shocks common to all units at a given time. If a national curriculum reform in period 3 raises learning outcomes for everyone equally, $\lambda_t$ captures that common shift. Together with unit fixed effects, this implements the “two-way” in TWFE.
$D_{it}^k$ is a relative-time indicator (also called an event-time dummy): it equals 1 when unit $i$ at time $t$ is exactly $k$ periods away from its treatment onset, and 0 otherwise. For a unit first treated at period 5, we have $D_{i,3}^{-2} = 1$ (two periods before treatment), $D_{i,5}^{0} = 1$ (the treatment period itself), $D_{i,7}^{2} = 1$ (two periods after treatment), and so on.
$\beta_k^{lead}$ (for $k = -K+1, \ldots, -2$) are the lead coefficients — pre-treatment effects at each period before treatment. These serve as placebo tests: if the treated and control groups were evolving similarly before the intervention, all lead coefficients should be close to zero and statistically insignificant. A significant lead coefficient signals a pre-existing divergence, which would undermine the parallel trends assumption. The summation starts at $k = -K+1$ (the earliest available lead) and stops at $k = -2$, because the period immediately before treatment ($k = -1$) is omitted as the reference period and normalized to zero. All other coefficients are estimated relative to this baseline.
$\beta_k^{lag}$ (for $k = 0, 1, \ldots, L$) are the lag coefficients — post-treatment effects at each period after treatment onset. The coefficient $\beta_0^{lag}$ captures the instantaneous effect at the moment treatment begins, $\beta_1^{lag}$ captures the effect one period later, and so on through $\beta_L^{lag}$ at $L$ periods after treatment. These coefficients trace out the dynamic treatment effect trajectory: they reveal whether the effect appears immediately or builds up gradually, whether it persists or fades out, and whether it stabilizes at a constant level or continues to grow.
$\varepsilon_{it}$ is the error term, capturing all unobserved factors not absorbed by the fixed effects or treatment indicators.

The key insight is that this single equation simultaneously tests the identifying assumption and estimates the treatment effect. The leads ($\beta_k^{lead}$) test parallel trends period by period, while the lags ($\beta_k^{lag}$) reveal how the treatment effect evolves over time. In our tutorial, treatment begins at period 5 and the reference period is 4 ($k = -1$), so we have 4 lead coefficients at $k = -5, -4, -3, -2$ (corresponding to periods 0–3) and $L = 4$ lag coefficients at $k = 0, 1, 2, 3, 4$ (corresponding to periods 5–9).

The MultiPeriodDiD() estimator fits this specification, using one pre-treatment period as the reference point.

event = MultiPeriodDiD()
results_event = event.fit(
data_2x2,
outcome="outcome",
treatment="treated",
time="period",
post_periods=[5, 6, 7, 8, 9],
reference_period=4,
)
results_event.print_summary()

================================================================================
Multi-Period Difference-in-Differences Estimation Results
================================================================================
Observations: 1000
Treated observations: 500
Control observations: 500
Pre-treatment periods: 5
Post-treatment periods: 5
R-squared: 0.7648
--------------------------------------------------------------------------------
Pre-Period Effects (Parallel Trends Test)
--------------------------------------------------------------------------------
Period Estimate Std. Err. t-stat P>|t| Sig.
--------------------------------------------------------------------------------
0 -0.5167 0.5121 -1.009 0.3132
1 -0.5050 0.5031 -1.004 0.3157
2 -0.2804 0.5228 -0.536 0.5919
3 -0.3227 0.5187 -0.622 0.5340
[ref: 4] 0.0000 --- --- ---
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Post-Period Treatment Effects
--------------------------------------------------------------------------------
Period Estimate Std. Err. t-stat P>|t| Sig.
--------------------------------------------------------------------------------
5 4.6509 0.5162 9.011 0.0000 ***
6 4.8285 0.5227 9.238 0.0000 ***
7 4.6907 0.5068 9.255 0.0000 ***
8 4.7888 0.4908 9.757 0.0000 ***
9 5.0244 0.5203 9.657 0.0000 ***
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Average Treatment Effect (across post-periods)
--------------------------------------------------------------------------------
Parameter Estimate Std. Err. t-stat P>|t| Sig.
--------------------------------------------------------------------------------
Avg ATT 4.7967 0.3923 12.227 0.0000 ***
--------------------------------------------------------------------------------
95% Confidence Interval: [4.0269, 5.5665]
Signif. codes: '***' 0.001, '**' 0.01, '*' 0.05, '.' 0.1
================================================================================

The pre-treatment coefficients (periods 0–3) are all small and statistically insignificant, ranging from -0.52 to -0.28 with p-values well above 0.05. This confirms that the treated and control groups were evolving similarly before the intervention — the period-by-period placebo test passes. In contrast, all five post-treatment effects (periods 5–9) are large and highly significant, ranging from 4.65 to 5.02 with t-statistics above 9.0. The average ATT across post periods is 4.80 with a 95% CI of [4.03, 5.57], consistent with the true effect of 5.0. The effects are remarkably stable over time, indicating no fade-out or build-up — the treatment shifts outcomes by roughly 5 units immediately and maintains that shift.

The event study plot below makes these dynamics visible:

es_df = results_event.to_dataframe()
fig, ax = plt.subplots(figsize=(9, 5))
fig.patch.set_linewidth(0)
pre = es_df[~es_df["is_post"]]
post = es_df[es_df["is_post"]]
ax.errorbar(pre["period"], pre["effect"], yerr=1.96 * pre["se"],
fmt="o", color=STEEL_BLUE, capsize=4, linewidth=2,
markersize=8, label="Pre-treatment")
ax.errorbar(post["period"], post["effect"], yerr=1.96 * post["se"],
fmt="s", color=WARM_ORANGE, capsize=4, linewidth=2,
markersize=8, label="Post-treatment")
# Reference period
ax.plot(4, 0, "D", color=WHITE_TEXT, markersize=10, zorder=5,
label="Reference period")
ax.axhline(y=0, color=LIGHT_TEXT, linewidth=1, alpha=0.5)
ax.axvline(x=4.5, color=LIGHT_TEXT, linestyle="--", linewidth=1.5, alpha=0.5)
ax.axhline(y=5.0, color=TEAL, linestyle=":", linewidth=1.5, alpha=0.7,
label="True effect (5.0)")
ax.set_xlabel("Period")
ax.set_ylabel("Estimated Effect")
ax.set_title("Event Study: Dynamic Treatment Effects")
ax.legend(loc="upper left")
ax.set_xticks(range(10))
plt.savefig("did_event_study.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

The event study plot tells the DiD story at a glance. Pre-treatment coefficients (steel blue circles) hover near the zero line, their confidence intervals all crossing zero — this is the visual signature of valid parallel trends. At the treatment cutoff (dashed vertical line), the estimates jump sharply to around 5.0 (warm orange squares), and the teal dotted line at 5.0 shows that every post-treatment estimate is close to the true effect. The confidence intervals in the post-treatment period are narrow and well above zero, confirming both statistical significance and accuracy.

With the classic 2x2 case established, the next question is: what happens when different units adopt treatment at different times?

Staggered adoption: Why TWFE fails

In many real-world policies, treatment does not begin simultaneously for all units. AI tutoring platforms roll out city by city, digital infrastructure investments phase in over years, and educational technology grants expand district by district. This is staggered adoption — different units start treatment at different times.

The traditional approach is Two-Way Fixed Effects (TWFE) regression, which estimates a single treatment coefficient using unit and time fixed effects:

$$Y_{it} = \gamma_i + \lambda_t + \delta \cdot D_{it} + \varepsilon_{it}$$

Here $\gamma_i$ absorbs all time-invariant unit characteristics (unit fixed effects), $\lambda_t$ absorbs all common time shocks (time fixed effects), $D_{it}$ is a treatment indicator that equals 1 when unit $i$ is treated at time $t$, and $\delta$ is the single treatment effect that TWFE estimates. With a single treatment period, $\delta$ correctly recovers the ATT. But with staggered timing, the single coefficient $\delta$ is a weighted average of many underlying 2x2 comparisons — and some of those comparisons are problematic.

The problem is that TWFE makes forbidden comparisons: it implicitly uses already-treated units as controls for newly-treated units. If treatment effects grow over time, these forbidden comparisons produce negative bias, pulling the overall estimate downward. Think of it this way: if early adopters have been benefiting from treatment for three years and their outcomes have grown substantially, TWFE compares newly-treated units to these high-performing early adopters. The newly-treated units look worse by comparison, even though they are genuinely benefiting from treatment. In extreme cases with heterogeneous treatment effects across cohorts, TWFE can even assign negative weights to some 2x2 comparisons, potentially flipping the sign of the estimate opposite to every unit’s true treatment effect (this does not occur in our example, but is documented in de Chaisemartin & D’Haultfoeuille, 2020).

Generating staggered adoption data

The generate_staggered_data() function creates a panel with multiple treatment cohorts — groups of units that begin treatment in different periods — plus a never-treated group.

data_stag = generate_staggered_data(
n_units=300,
n_periods=10,
seed=RANDOM_SEED,
)
print(f"Dataset shape: {data_stag.shape}")
cohorts = data_stag.groupby("first_treat")["unit"].nunique()
print(f"\nCohort sizes:")
for ft, n in cohorts.items():
label = "Never-treated" if ft == 0 else f"First treated in period {ft}"
print(f" {label}: {n} units")
print(f"\nTotal units: {cohorts.sum()}")

Dataset shape: (3000, 7)
Cohort sizes:
Never-treated: 90 units
First treated in period 3: 60 units
First treated in period 5: 75 units
First treated in period 7: 75 units
Total units: 300

The staggered panel has 3,000 observations (300 units across 10 periods). Three treatment cohorts adopt at different times: 60 units start treatment in period 3, 75 in period 5, and 75 in period 7. Another 90 units are never treated, serving as a clean control group. The first_treat column records when each unit first received treatment (0 for never-treated). This staggered structure is where naive TWFE breaks down, as the next section demonstrates.

Exploring the staggered dataset

The staggered dataset has a richer structure than the 2x2 case. Inspecting the first rows reveals additional columns:

data_stag.head(10)

 unit period outcome first_treat treated treat true_effect
0 0 11.278161 0 0 0 0.0
0 1 11.835615 0 0 0 0.0
0 2 11.542112 0 0 0 0.0
0 3 11.716260 0 0 0 0.0
0 4 12.289791 0 0 0 0.0
0 5 10.978501 0 0 0 0.0
0 6 11.426795 0 0 0 0.0
0 7 11.433938 0 0 0 0.0
0 8 11.108223 0 0 0 0.0
0 9 12.035899 0 0 0 0.0

Unit 0 is never-treated, so all indicators stay at zero across all 10 periods. To understand the staggered structure, we need to see what happens to treated units. The columns have distinct roles:

first_treat: the period when a unit first receives treatment (0 = never treated)
treat: time-invariant group membership — equals 1 for any unit ever assigned to treatment, 0 for never-treated
treated: time-varying post-treatment indicator — equals 0 before treatment onset and switches to 1 at first_treat
true_effect: the known ground-truth treatment effect at each period, used for verification

The distinction between treat and treated is crucial: treat tells you who is in the treatment group (a permanent label), while treated tells you when they are actually under treatment (a dynamic state). For never-treated units, both are always 0. For treated units, treat is always 1, but treated flips from 0 to 1 at the unit’s treatment onset.

An early-treated unit from cohort 3 illustrates this structure:

early_unit = data_stag[data_stag["first_treat"] == 3]["unit"].iloc[0]
data_stag[data_stag["unit"] == early_unit]

 unit period outcome first_treat treated treat true_effect
90 0 13.299816 3 0 1 0.0
90 1 12.897337 3 0 1 0.0
90 2 11.882534 3 0 1 0.0
90 3 14.724679 3 1 1 2.0
90 4 16.139340 3 1 1 2.2
90 5 14.433891 3 1 1 2.4
90 6 15.949127 3 1 1 2.6
90 7 15.832888 3 1 1 2.8
90 8 17.125174 3 1 1 3.0
90 9 16.685332 3 1 1 3.2

Unit 90 has treat=1 throughout (it belongs to the treatment group), but treated flips from 0 to 1 at period 3 — the moment it enters the post-treatment state. The true_effect is 0 in the pre-treatment periods, then starts at 2.0 and grows by 0.2 each period, reaching 3.2 by period 9. This growing effect pattern is what makes staggered DiD challenging: the treatment effect for cohort 3 at period 7 (2.8) is very different from the effect at period 3 (2.0).

Now compare with a late-treated unit from cohort 7:

late_unit = data_stag[data_stag["first_treat"] == 7]["unit"].iloc[0]
data_stag[data_stag["unit"] == late_unit]

 unit period outcome first_treat treated treat true_effect
91 0 7.987886 7 0 1 0.0
91 1 8.168639 7 0 1 0.0
91 2 8.904022 7 0 1 0.0
91 3 7.984438 7 0 1 0.0
91 4 8.373931 7 0 1 0.0
91 5 7.543381 7 0 1 0.0
91 6 8.981115 7 0 1 0.0
91 7 10.105654 7 1 1 2.0
91 8 10.505532 7 1 1 2.2
91 9 11.074785 7 1 1 2.4

Unit 91 also has treat=1 throughout, but treated does not flip until period 7 — giving it a much longer pre-treatment phase (7 periods vs 3 for cohort 3) and only 3 post-treatment periods. Its true_effect starts at 2.0 at period 7 and reaches only 2.4 by period 9, compared to cohort 3’s 3.2. This asymmetry — early cohorts accumulating larger effects over more post-treatment periods — is precisely what causes TWFE to produce biased estimates when it uses already-treated cohort 3 units as “controls” for cohort 7.

Let us examine how the staggered structure differs from the 2x2 case in scale and treatment coverage. With multiple cohorts adopting at different times, the fraction of observations in post-treatment state is no longer 50%:

data_stag.describe()

 unit period outcome first_treat treated treat true_effect
count 3000.000000 3000.00000 3000.000000 3000.000000 3000.000000 3000.000000 3000.000000
mean 149.500000 4.50000 11.287067 3.600000 0.340000 0.700000 0.829000
std 86.616497 2.87276 2.528589 2.709695 0.473788 0.458334 1.173464
min 0.000000 0.00000 4.521385 0.000000 0.000000 0.000000 0.000000
25% 74.750000 2.00000 9.461867 0.000000 0.000000 0.000000 0.000000
50% 149.500000 4.50000 11.107083 4.000000 0.000000 1.000000 0.000000
75% 224.250000 7.00000 13.078036 5.500000 1.000000 1.000000 2.200000
max 299.000000 9.00000 20.616391 7.000000 1.000000 1.000000 3.200000

With 3,000 observations and 300 units, this panel is three times larger than the 2x2 case. The first_treat variable has a mean of 3.60, reflecting the mix of never-treated (0) and cohorts treated at periods 3, 5, and 7. The treated mean of 0.34 tells us that 34% of all unit-period observations are in a post-treatment state — less than half because late cohorts contribute fewer treated periods than early cohorts.

A crosstab of the number of treated (post-treatment) units by cohort and period reveals the staggered rollout:

pd.crosstab(data_stag["first_treat"], data_stag["period"],
values=data_stag["treated"], aggfunc="sum").fillna(0).astype(int)

period 0 1 2 3 4 5 6 7 8 9
first_treat
0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 60 60 60 60 60 60 60
5 0 0 0 0 0 75 75 75 75 75
7 0 0 0 0 0 0 0 75 75 75

The staggered structure is immediately visible: zeros cascade to treatment counts as each cohort enters the post-treatment state. At period 2, no units are yet treated. At period 3, 60 units from cohort 3 enter treatment. At period 5, cohort 5 adds 75 more, bringing the total to 135. By period 7, all 210 treated units are in post-treatment. The never-treated group (row 0) remains at zero throughout. This growing treated population — and the fact that cohort 3 has been treated for 4 periods by the time cohort 7 starts — is the asymmetry that makes TWFE unreliable. When TWFE uses cohort 3 as a “control” for cohort 7, it compares against units whose outcomes already incorporate a treatment effect of 2.8, not the untreated counterfactual.

The pivoted outcome means by cohort and period reveal the staggered treatment pattern:

data_stag.groupby(["first_treat", "period"])["outcome"].mean().unstack()

period 0 1 2 3 4 5 6 7 8 9
first_treat
0 9.92 9.95 10.17 10.28 10.40 10.46 10.53 10.68 10.78 10.88
3 10.39 10.51 10.59 12.82 13.07 13.33 13.60 13.99 14.22 14.56
5 10.08 10.17 10.33 10.32 10.58 12.70 12.90 13.11 13.64 13.77
7 9.61 9.76 9.73 10.04 10.00 10.10 10.35 12.25 12.59 12.91

All four cohorts track closely in their pre-treatment periods (values near 9.6–10.6 in periods 0–2), confirming parallel pre-trends. The divergence is sharp and cohort-specific: cohort 3 jumps at period 3 (from 10.59 to 12.82), cohort 5 jumps at period 5 (from 10.58 to 12.70), and cohort 7 jumps at period 7 (from 10.35 to 12.25). The never-treated group follows a smooth, gentle upward trend throughout. By period 9, all treated cohorts have outcomes around 12.9–14.6, substantially above the never-treated group’s 10.88 — but they arrived at those levels at different times.

The line plot below visualizes these divergent trajectories:

cohort_means = data_stag.groupby(["first_treat", "period"])["outcome"].mean().unstack(level=0)
cohort_colors = {0: STEEL_BLUE, 3: WARM_ORANGE, 5: TEAL, 7: WHITE_TEXT}
cohort_labels = {0: "Never-treated", 3: "Cohort 3", 5: "Cohort 5", 7: "Cohort 7"}
fig, ax = plt.subplots(figsize=(9, 5))
fig.patch.set_linewidth(0)
for ft in sorted(cohort_means.columns):
ax.plot(cohort_means.index, cohort_means[ft], "o-",
color=cohort_colors[ft], linewidth=2, markersize=6,
label=cohort_labels[ft])
# Vertical lines at treatment onsets
for ft in [3, 5, 7]:
ax.axvline(x=ft - 0.5, color=cohort_colors[ft], linestyle="--",
linewidth=1.2, alpha=0.5)
ax.set_xlabel("Period")
ax.set_ylabel("Mean Outcome")
ax.set_title("Staggered Adoption: Cohort Mean Outcomes Over Time")
ax.legend(loc="upper left")
ax.set_xticks(range(10))
plt.savefig("did_staggered_trends.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

The plot makes the staggered adoption pattern unmistakable. All four lines run in parallel during the early pre-treatment periods, then each treated cohort jumps upward at its treatment onset (marked by a dashed vertical line in the corresponding color). Cohort 3 (warm orange) diverges first at period 3, followed by cohort 5 (teal) at period 5, and cohort 7 (near black) at period 7. The never-treated group (steel blue) continues its steady, gentle upward trend without any jump. This visualization explains why TWFE fails: between periods 3 and 7, TWFE uses cohort 3 (already treated and elevated) as a comparison for cohort 7 (not yet treated). Since cohort 3’s outcomes are inflated by treatment, the comparison underestimates cohort 7’s true effect when it eventually adopts.

Bacon decomposition: Diagnosing TWFE

The Goodman-Bacon decomposition (Goodman-Bacon, 2021) reveals exactly how TWFE constructs its estimate. The key insight is that the TWFE coefficient $\hat{\delta}$ is a weighted average of all possible 2x2 DiD comparisons between pairs of treatment cohorts:

$$\hat{\delta}^{TWFE} = \sum_{k} s_{kU} \hat{\delta}_{kU} + \sum_{e \neq U} \sum_{l > e} \big( s_{el} \hat{\delta}_{el} + s_{le} \hat{\delta}_{le} \big)$$

The first sum covers clean comparisons between each treated cohort $k$ and the never-treated group $U$, weighted by $s_{kU}$. The double sum covers comparisons between pairs of treated cohorts: $\hat{\delta}_{el}$ compares earlier-treated ($e$) against later-treated ($l$) units, and $\hat{\delta}_{le}$ compares later-treated against earlier-treated units. The weights $s$ are proportional to each subsample’s size and the variance of the treatment indicator within each pair — groups treated in the middle of the panel receive the most weight. Crucially, the weights sum to one, so the TWFE estimate is a proper weighted average.

The three types of comparisons have very different reliability:

Treated vs never-treated ($\hat{\delta}_{kU}$): Clean comparisons using permanently untreated units as controls. These are the gold standard.
Earlier vs later treated ($\hat{\delta}_{el}$): Uses not-yet-treated units as controls. Valid as long as treatment has not yet affected the later cohort.
Later vs earlier treated ($\hat{\delta}_{le}$): The forbidden comparisons. Uses already-treated units as controls. If treatment effects evolve over time, these comparisons are contaminated because the “controls” are themselves experiencing treatment effects.

bacon = BaconDecomposition()
bacon_results = bacon.fit(
data_stag, outcome="outcome", unit="unit",
time="period", first_treat="first_treat",
)
bacon_results.print_summary()

=====================================================================================
Goodman-Bacon Decomposition of Two-Way Fixed Effects
=====================================================================================
Total observations: 3000
Treatment timing groups: 3
Never-treated units: 90
Total 2x2 comparisons: 9
-------------------------------------------------------------------------------------
TWFE Decomposition
-------------------------------------------------------------------------------------
TWFE Estimate: 2.1822
Weighted Sum of 2x2 Estimates: 2.1052
Decomposition Error: 0.076977
-------------------------------------------------------------------------------------
Weight Breakdown by Comparison Type
-------------------------------------------------------------------------------------
Comparison Type Weight Avg Effect Contribution
-------------------------------------------------------------------------------------
Treated vs Never-treated 0.4331 2.3745 1.0284
Earlier vs Later treated 0.2836 2.1999 0.6238
Later vs Earlier (forbidden) 0.2834 1.5989 0.4531
-------------------------------------------------------------------------------------
Total 1.0000 2.1052
-------------------------------------------------------------------------------------
WARNING: 28.3% of weight is on 'forbidden' comparisons where
already-treated units serve as controls. This can bias TWFE
when treatment effects are heterogeneous over time.
Consider using Callaway-Sant'Anna or other robust estimators.
=====================================================================================

The decomposition reveals that 28.3% of TWFE’s weight falls on forbidden comparisons — cases where already-treated units serve as controls. These forbidden comparisons produce an average effect of only 1.60, substantially lower than the 2.37 from clean treated-vs-never-treated comparisons. This downward pull drags the TWFE estimate to 2.18, below the true treatment effect. The clean comparisons (treated vs never-treated) account for 43.3% of the weight and produce the most reliable estimates, while the earlier-vs-later comparisons (28.4% weight) sit in between. The decomposition error of 0.08 reflects higher-order interaction terms that the 2x2 decomposition does not fully capture.

The following plot visualizes the decomposition:

bacon_df = bacon_results.to_dataframe()
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
fig.patch.set_linewidth(0)
# Left panel: scatter by comparison type
type_colors = {
"Treated vs Never-treated": STEEL_BLUE,
"Earlier vs Later treated": WARM_ORANGE,
"Later vs Earlier (forbidden)": "#e8856c",
"treated_vs_never": STEEL_BLUE,
"earlier_vs_later": WARM_ORANGE,
"later_vs_earlier": "#e8856c",
}
for comp_type in bacon_df["comparison_type"].unique():
subset = bacon_df[bacon_df["comparison_type"] == comp_type]
color = type_colors.get(comp_type, LIGHT_TEXT)
axes[0].scatter(subset["weight"], subset["estimate"],
s=80, color=color, alpha=0.7, edgecolors=DARK_NAVY,
label=comp_type)
axes[0].axhline(y=bacon_results.twfe_estimate, color=WHITE_TEXT,
linestyle="--", linewidth=1.5, alpha=0.7,
label=f"TWFE = {bacon_results.twfe_estimate:.2f}")
axes[0].set_xlabel("Weight")
axes[0].set_ylabel("2×2 DiD Estimate")
axes[0].set_title("Bacon Decomposition: Individual Comparisons")
axes[0].legend(fontsize=9, loc="lower right")
# Right panel: bar chart of weights by type
type_summary = bacon_df.groupby("comparison_type").agg(
weight=("weight", "sum"),
avg_effect=("estimate", lambda x: np.average(
x, weights=bacon_df.loc[x.index, "weight"])),
).reset_index()
bar_colors = [type_colors.get(t, LIGHT_TEXT)
for t in type_summary["comparison_type"]]
axes[1].barh(range(len(type_summary)), type_summary["weight"],
color=bar_colors, edgecolor=DARK_NAVY, height=0.6)
axes[1].set_yticks(range(len(type_summary)))
axes[1].set_yticklabels(type_summary["comparison_type"], fontsize=10)
axes[1].set_xlabel("Total Weight")
axes[1].set_title("Weight Distribution by Comparison Type")
for i, (w, e) in enumerate(zip(type_summary["weight"],
type_summary["avg_effect"])):
axes[1].text(w + 0.01, i, f"{w:.1%} (avg = {e:.2f})",
va="center", fontsize=10)
plt.tight_layout()
plt.savefig("did_bacon_decomposition.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

The left panel shows each individual 2x2 comparison as a point, colored by type. The forbidden comparisons (dark orange) cluster at lower effect estimates than the clean comparisons (steel blue), visually demonstrating how they pull TWFE downward. The right panel makes the weight problem stark: nearly a third of the total weight goes to comparisons where already-treated units masquerade as controls. For a policymaker relying on the TWFE estimate of 2.18, this contamination means the reported effect underestimates the true treatment impact.

Callaway-Sant’Anna: The modern solution

The Callaway-Sant’Anna (CS) estimator (Callaway & Sant’Anna, 2021) avoids forbidden comparisons entirely. Instead of a single pooled regression, CS starts from a fundamental building block — the group-time ATT:

$$ATT(g, t) = E[Y_t(g) - Y_t(\infty) \mid G = g], \quad \text{for } t \geq g$$

Here $g$ denotes the cohort (the period when a unit first becomes treated), $t$ is the current calendar period, $Y_t(g)$ is the potential outcome at time $t$ if first treated in period $g$, and $Y_t(\infty)$ is the potential outcome under perpetual non-treatment. The conditioning on $G = g$ restricts attention to units in cohort $g$. This yields a separate treatment effect estimate for each combination of cohort and calendar period, using only clean comparisons.

With never-treated controls, the group-time ATT is identified as:

$$ATT(g, t) = E[Y_t - Y_{g-1} \mid G = g] - E[Y_t - Y_{g-1} \mid G = \infty]$$

In words: take the change in outcomes from the period just before treatment ($g - 1$) to the current period ($t$) for cohort $g$ units, and subtract the same change for never-treated units ($G = \infty$). This is a 2x2 DiD comparison that uses only the never-treated group as controls, eliminating all forbidden comparisons by construction.

The doubly robust estimator

In practice, Callaway and Sant’Anna implement a doubly robust version of this estimator. Before diving into the formal equation, here is the core idea: the doubly robust estimator adjusts the comparison between treated and control units in two ways simultaneously — by reweighting the control group to look more similar to the treated group (inverse-probability weighting), and by directly modeling and subtracting the expected outcome change for controls (outcome regression). Think of it as wearing both a belt and suspenders: if either adjustment is correctly specified, the estimate is valid, even if the other one is wrong. This double protection makes the estimator more reliable than methods that rely on a single modeling assumption.

The formal equation combines inverse-probability weighting with an outcome regression adjustment:

$$ATT(g, t) = \mathbb{E}\left[\left(\frac{G_g}{\mathbb{E}[G_g]} - \frac{\frac{p_g(X)}{1-p_g(X)}}{\mathbb{E}\left[\frac{p_g(X)}{1-p_g(X)}\right]}\right)\left(Y_t - Y_{g-1} - m_{g,t}^{nev}(X)\right)\right]$$

This equation multiplies two terms inside the expectation — a weighting term (first parentheses) and an outcome term (second parentheses). Let us unpack each one.

The weighting term: $\frac{G_g}{\mathbb{E}[G_g]} - \frac{\frac{p_g(X)}{1-p_g(X)}}{\mathbb{E}\left[\frac{p_g(X)}{1-p_g(X)}\right]}$

This term determines how much each observation contributes to the ATT estimate. It works differently for treated and control units:

$G_g$ is a group indicator that equals 1 if the unit belongs to cohort $g$ and 0 otherwise. Dividing by $\mathbb{E}[G_g]$ (the share of units in cohort $g$) normalizes so that treated units receive equal weight on average. For a treated unit in cohort $g$, the first fraction contributes a positive value; for never-treated units, $G_g = 0$ so the first fraction is zero.
$p_g(X)$ is the generalized propensity score — the probability of being in cohort $g$ (rather than the never-treated group) given covariates $X$. This is estimated via logit regression of cohort membership on covariates. The ratio $\frac{p_g(X)}{1-p_g(X)}$ are the odds of being in cohort $g$, and dividing by its expectation normalizes the weights. For never-treated units, this second fraction creates a negative weight that is larger for control units whose covariates resemble the treated cohort — effectively selecting the most comparable controls. For treated units, the two fractions partially cancel, leaving a net positive weight.

The intuition is similar to propensity score matching: if a never-treated city has covariates (population, per-student spending, teacher-student ratio) that look very much like a treated city, it receives a larger (more negative) weight, making it contribute more as a counterfactual. Cities with covariates far from the treated group receive near-zero weight. This rebalances the control group so that the covariate distribution of the weighted controls matches that of the treated cohort.

The outcome term: $Y_t - Y_{g-1} - m_{g,t}^{nev}(X)$

This term measures the adjusted outcome change for each unit:

$Y_t - Y_{g-1}$ is the raw change in outcomes from the baseline period ($g - 1$, the period just before cohort $g$ starts treatment) to the current period $t$. This is the same first difference used in any DiD estimator.
$m_{g,t}^{nev}(X)$ is the outcome regression adjustment — the expected change $E[Y_t - Y_{g-1} \mid X, G = \infty]$ for never-treated units with covariates $X$. In practice, this is estimated by regressing the outcome change $\Delta Y = Y_t - Y_{g-1}$ on covariates $X$ using only the never-treated group. Subtracting $m_{g,t}^{nev}(X)$ removes the portion of the outcome change that would have occurred anyway based on observable characteristics — even without treatment. What remains is the treatment-induced change that cannot be explained by covariates alone.

Think of it this way: if cities with higher per-student spending tend to improve learning scores faster regardless of AI adoption, $m_{g,t}^{nev}(X)$ captures that covariate-driven growth trajectory. Subtracting it ensures that the estimated treatment effect is not confounded by differential growth rates across different types of cities.

Why “doubly robust”? The estimator combines both adjustment strategies — inverse-probability weighting (through the weighting term) and outcome regression (through $m_{g,t}^{nev}(X)$). The key advantage is that the ATT estimate is consistent if either the propensity score model or the outcome regression model is correctly specified — both do not need to be right simultaneously. If the propensity score model is wrong but the outcome regression is correct, the $m_{g,t}^{nev}(X)$ adjustment still removes confounding. If the outcome regression is wrong but the propensity score is correct, the reweighting still produces a valid comparison group. This double layer of protection makes the estimator more reliable in practice than methods relying on a single modeling assumption.

Note on the no-covariate case: In this tutorial, we do not pass covariates to CallawaySantAnna(). Without covariates, the propensity score $p_g(X)$ reduces to the unconditional probability of being in cohort $g$ (simply the group share), and $m_{g,t}^{nev}(X)$ reduces to the simple mean outcome change among never-treated units. The doubly robust estimator then collapses to the basic difference-in-means formula shown earlier. The full equation is presented here because it is the general form that practitioners encounter when working with real data and covariates.

The group-time ATTs are then aggregated into summary parameters. Any summary is a weighted average of the building blocks:

$$\theta = \sum_{g} \sum_{t \geq g} w_{g,t} \cdot ATT(g, t), \quad \sum_{g,t} w_{g,t} = 1$$

Two aggregations are especially useful. The overall ATT weights by cohort size:

$$\theta^{O} = \sum_{g} \theta(g) \cdot P(G = g), \quad \text{where } \theta(g) = \frac{1}{T - g + 1} \sum_{t=g}^{T} ATT(g, t)$$

The event study aggregation averages across cohorts at each relative time $e$ (periods since treatment onset):

$$\theta_D(e) = \sum_{g} ATT(g, g + e) \cdot P(G = g \mid g + e \leq T)$$

This event study aggregation is the CS analogue of the leads-and-lags event study, but free from the forbidden comparison contamination that plagues TWFE-based event studies.

The CallawaySantAnna() class takes control_group to specify which units serve as controls. Using "never_treated" restricts comparisons to units that never received treatment, the cleanest possible counterfactual. The base_period="universal" option uses a single reference period ($g - 1$) for all relative time comparisons within each cohort, rather than letting each relative period use its own baseline. This ensures that the pre-treatment coefficients are proper placebo tests: each one measures the outcome change from $g - 1$ to an earlier period, so a coefficient near zero means the treated and control groups were evolving similarly over that specific interval. With a universal base period, the period immediately before treatment ($e = -1$) is normalized to zero by construction.

cs = CallawaySantAnna(control_group="never_treated", base_period="universal")
results_cs = cs.fit(
data_stag, outcome="outcome", unit="unit",
time="period", first_treat="first_treat",
aggregate="event_study",
)
results_cs.print_summary()

=====================================================================================
Callaway-Sant'Anna Staggered Difference-in-Differences Results
=====================================================================================
Total observations: 3000
Treated units: 210
Never-treated units: 90
Treatment cohorts: 3
Time periods: 10
Control group: never_treated
Base period: universal
-------------------------------------------------------------------------------------
Overall Average Treatment Effect on the Treated
-------------------------------------------------------------------------------------
Parameter Estimate Std. Err. t-stat P>|t| Sig.
-------------------------------------------------------------------------------------
ATT 2.4136 0.0552 43.753 0.0000 ***
-------------------------------------------------------------------------------------
95% Confidence Interval: [2.3055, 2.5217]
-------------------------------------------------------------------------------------
Event Study (Dynamic) Effects
-------------------------------------------------------------------------------------
Rel. Period Estimate Std. Err. t-stat P>|t| Sig.
-------------------------------------------------------------------------------------
-7 -0.1344 0.1171 -1.148 0.2510
-6 -0.0188 0.1126 -0.167 0.8671
-5 -0.1435 0.0813 -1.766 0.0774 .
-4 -0.0091 0.0744 -0.122 0.9028
-3 -0.0697 0.0560 -1.244 0.2134
-2 -0.0709 0.0631 -1.124 0.2610
-1 0.0000 nan nan nan
0 1.9713 0.0645 30.551 0.0000 ***
1 2.1416 0.0577 37.124 0.0000 ***
2 2.2969 0.0644 35.644 0.0000 ***
3 2.6763 0.0796 33.642 0.0000 ***
4 2.7925 0.0800 34.898 0.0000 ***
5 3.0259 0.1227 24.669 0.0000 ***
6 3.2663 0.1090 29.961 0.0000 ***
-------------------------------------------------------------------------------------
Signif. codes: '***' 0.001, '**' 0.01, '*' 0.05, '.' 0.1
=====================================================================================

The overall CS estimate of the ATT is 2.41 (SE = 0.06, p < 0.001), with a 95% CI of [2.31, 2.52]. This is higher than the TWFE estimate of 2.18, confirming that TWFE was biased downward by the forbidden comparisons. The event study reveals dynamic effects that grow over time: the effect starts at 1.97 in the first period after treatment and increases to 3.27 by six periods post-treatment. This pattern of growing effects is exactly the scenario where TWFE fails most dramatically — the forbidden comparisons use units with large accumulated effects as controls for newly-treated units, producing a downward-biased average.

With the universal base period, relative period -1 is the reference and is normalized to zero by construction. The remaining pre-treatment estimates all hover near zero — the largest in magnitude is -0.14 at relative period -5 (p = 0.08), which does not reach significance at the 5% level. None of the seven pre-treatment coefficients are individually significant, providing clean support for the parallel trends assumption. This contrasts with the varying base period specification, where each pre-treatment coefficient uses a different baseline, making the placebo tests harder to interpret collectively.

The event study plot visualizes these dynamics, showing how the treatment effect builds over time relative to treatment onset:

cs_df = results_cs.to_dataframe("event_study")
fig, ax = plt.subplots(figsize=(9, 5))
fig.patch.set_linewidth(0)
pre_cs = cs_df[cs_df["relative_period"] < 0]
post_cs = cs_df[cs_df["relative_period"] >= 0]
ax.errorbar(pre_cs["relative_period"], pre_cs["effect"],
yerr=1.96 * pre_cs["se"], fmt="o", color=STEEL_BLUE,
capsize=4, linewidth=2, markersize=8, label="Pre-treatment")
ax.errorbar(post_cs["relative_period"], post_cs["effect"],
yerr=1.96 * post_cs["se"], fmt="s", color=TEAL,
capsize=4, linewidth=2, markersize=8, label="Post-treatment")
ax.axhline(y=0, color=LIGHT_TEXT, linewidth=1, alpha=0.5)
ax.axvline(x=-0.5, color=LIGHT_TEXT, linestyle="--", linewidth=1.5, alpha=0.5)
ax.set_xlabel("Periods Relative to Treatment")
ax.set_ylabel("Estimated ATT")
ax.set_title("Callaway-Sant'Anna: Event Study for Staggered Adoption")
ax.legend(loc="upper left")
plt.savefig("did_staggered_att.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

The CS event study plot shows the hallmark pattern of a valid DiD analysis: pre-treatment coefficients (steel blue) cluster tightly around zero — with relative period -1 pinned at exactly zero as the universal base period — then post-treatment coefficients (teal) rise sharply and progressively. The upward slope in the post-treatment period reveals that the treatment effect accumulates over time, growing from roughly 2.0 immediately after treatment to 3.3 six periods later. This dynamic pattern would have been obscured by TWFE’s single pooled estimate and further distorted by its forbidden comparisons.

Choosing the right estimator

With multiple DiD estimators available, the choice depends on the data structure. The following decision flowchart guides the selection:

graph TD
A["<b>Panel data with<br/>treatment & control</b>"] --> B{"Single treatment<br/>period?"}
B -->|Yes| C["<b>Classic 2×2 DiD</b><br/>DifferenceInDifferences()"]
B -->|No| D{"Staggered<br/>adoption?"}
D -->|"No<br/>(same timing)"| E["<b>Multi-Period DiD</b><br/>MultiPeriodDiD()"]
D -->|Yes| F{"Never-treated<br/>group available?"}
F -->|Yes| G["<b>Callaway-Sant'Anna</b><br/>CallawaySantAnna()"]
F -->|No| H["<b>Sun-Abraham / Stacked DiD</b><br/>SunAbraham() / StackedDiD()<br/><i>(not covered here)</i>"]
style A fill:#141413,stroke:#141413,color:#fff
style B fill:#6a9bcc,stroke:#141413,color:#fff
style C fill:#00d4c8,stroke:#141413,color:#fff
style D fill:#6a9bcc,stroke:#141413,color:#fff
style E fill:#00d4c8,stroke:#141413,color:#fff
style F fill:#6a9bcc,stroke:#141413,color:#fff
style G fill:#00d4c8,stroke:#141413,color:#fff
style H fill:#d97757,stroke:#141413,color:#fff

The following table summarizes when to use each estimator:

Scenario	Estimator	Advantage
Single treatment time, 2 groups	`DifferenceInDifferences()`	Simplest, most transparent
Single treatment time, many periods	`MultiPeriodDiD()`	Period-by-period effects, pre-trend test
Staggered, never-treated available	`CallawaySantAnna()`	Clean comparisons, flexible aggregation
Staggered, no never-treated group	`SunAbraham()`	Interaction-weighted, uses not-yet-treated
Diagnosing TWFE bias	`BaconDecomposition()`	Reveals forbidden comparison weights

The decision logic is straightforward: if all treated units start at the same time, use the classic estimator or the multi-period event study. If treatment timing varies, use Callaway-Sant’Anna (or Sun-Abraham if no never-treated group exists). Always run Bacon decomposition on TWFE results to check for contamination from forbidden comparisons. The diff-diff package also offers SyntheticDiD(), ImputationDiD(), and ContinuousDiD() for specialized settings, but the estimators above cover the vast majority of applied research.

Sensitivity analysis: HonestDiD

Every DiD analysis rests on parallel trends — but this assumption is fundamentally untestable for the post-treatment period. Pre-treatment trend tests (Section 6) check whether trends were parallel before treatment, but they cannot guarantee that trends would have remained parallel after treatment in the absence of the intervention. A new regulation might coincide with an economic downturn that affects treated regions differently, violating parallel trends even though pre-trends looked clean.

HonestDiD (Rambachan & Roth, 2023) addresses this problem directly. Instead of assuming parallel trends hold exactly, it bounds the degree of violation using a relative magnitudes restriction. Let $\delta_t = E[Y^0_t - Y^0_{t-1} \mid G = g] - E[Y^0_t - Y^0_{t-1} \mid G = \infty]$ denote the parallel trends violation at period $t$ — the difference in untreated outcome trends between the treated cohort and the never-treated group. HonestDiD constrains the post-treatment violations relative to the largest pre-treatment violation:

$$|\delta_t| \leq M \cdot \max_{t' < g} |\delta_{t'}|, \quad \text{for all } t \geq g$$

The parameter $M$ controls the degree of allowed departure. At $M = 0$, the method assumes perfect parallel trends ($\delta_t = 0$ for all post-treatment periods) and recovers the standard CI. As $M$ increases, it allows for progressively larger post-treatment violations, widening the robust CI. The breakdown value of $M$ is where the CI first includes zero — the point at which the treatment conclusion becomes fragile.

Think of $M$ as a stress test dial. Turning it up to $M = 1$ says: “The worst post-treatment violation could be as large as the worst thing we saw pre-treatment.” Turning it to $M = 5$ says: “The violation could be five times worse.” If the effect remains significant even at high $M$, the finding is genuinely robust.

M_values = [0.0, 0.5, 1.0, 1.5, 2.0, 3.0, 4.0, 5.0, 7.0, 10.0, 12.0, 15.0]
sensitivity = []
for M in M_values:
honest = HonestDiD(method="relative_magnitude", M=M)
hres = honest.fit(results_cs)
sensitivity.append({
"M": M,
"ci_lb": hres.ci_lb,
"ci_ub": hres.ci_ub,
"significant": hres.ci_lb > 0,
})
print(f"M = {M:.1f}: CI = [{hres.ci_lb:.4f}, {hres.ci_ub:.4f}]"
f" {'significant' if hres.ci_lb > 0 else 'includes zero'}")
sens_df = pd.DataFrame(sensitivity)
# Find breakdown point
breakdown_M = (sens_df[~sens_df["significant"]]["M"].min()
if not sens_df["significant"].all()
else sens_df["M"].max())
print(f"\nBreakdown value of M: {breakdown_M:.1f}")

M = 0.0: CI = [2.5324, 2.6592] significant
M = 0.5: CI = [2.4606, 2.7310] significant
M = 1.0: CI = [2.3889, 2.8028] significant
M = 1.5: CI = [2.3171, 2.8745] significant
M = 2.0: CI = [2.2453, 2.9463] significant
M = 3.0: CI = [2.1018, 3.0898] significant
M = 4.0: CI = [1.9583, 3.2334] significant
M = 5.0: CI = [1.8148, 3.3769] significant
M = 7.0: CI = [1.5277, 3.6639] significant
M = 10.0: CI = [1.0971, 4.0945] significant
M = 12.0: CI = [0.8101, 4.3816] significant
M = 15.0: CI = [0.3795, 4.8122] significant
Breakdown value of M: 15.0

At $M = 0$ (perfect parallel trends), the CI is narrow: [2.53, 2.66]. As $M$ increases, the CI widens symmetrically. At $M = 10$, the lower bound remains comfortably positive (1.10), and even at $M = 15$, it barely stays above zero (0.38). The breakdown value exceeds $M = 15$ — the treatment effect remains statistically significant even if post-treatment violations of parallel trends are more than 15 times larger than the worst pre-treatment deviation. This is exceptionally robust — in practice, a breakdown value above $M = 3$ is considered strong evidence that the finding is not driven by parallel trends violations. The improvement over the varying base period specification (which had a breakdown of $M = 12$) reflects the universal base period’s tighter pre-treatment estimates, which give HonestDiD a smaller “worst pre-treatment deviation” to scale against.

The sensitivity plot maps the robust CI as a function of $M$, making the breakdown point visually apparent:

fig, ax = plt.subplots(figsize=(9, 5))
fig.patch.set_linewidth(0)
ax.fill_between(sens_df["M"], sens_df["ci_lb"], sens_df["ci_ub"],
alpha=0.25, color=STEEL_BLUE, label="95% Robust CI")
ax.plot(sens_df["M"], sens_df["ci_lb"], "-", color=STEEL_BLUE, linewidth=2)
ax.plot(sens_df["M"], sens_df["ci_ub"], "-", color=STEEL_BLUE, linewidth=2)
ax.axhline(y=0, color=LIGHT_TEXT, linewidth=1.5, alpha=0.7)
att_val = results_cs.overall_att
ax.axhline(y=att_val, color=TEAL, linestyle=":", linewidth=1.5,
alpha=0.7, label=f"Overall ATT = {att_val:.2f}")
ax.axvline(x=breakdown_M, color=WARM_ORANGE, linestyle="--",
linewidth=2, alpha=0.8,
label=f"Breakdown (M = {breakdown_M:.1f})")
ax.set_xlabel("Sensitivity Parameter M\n"
"(maximum post-treatment violation relative to "
"largest pre-treatment violation)")
ax.set_ylabel("Treatment Effect (ATT)")
ax.set_title("HonestDiD Sensitivity Analysis: Robustness of the ATT")
ax.legend(loc="upper left")
plt.savefig("did_honest_sensitivity.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

The sensitivity plot tells the robustness story at a glance. The steel blue band shows the 95% robust CI expanding as $M$ grows — allowing for larger violations of parallel trends. The teal dotted line marks the overall ATT of 2.41, which sits comfortably within the CI for all values of $M$. The warm orange dashed line at $M = 15$ marks the boundary of our grid, with the lower CI bound still positive (0.38) at that point — the true breakdown lies even further out. In practical terms, the treatment conclusion would only be overturned if post-treatment parallel trend violations were more than 15 times worse than anything observed in the pre-treatment data — an extreme scenario that would require a dramatic structural break coinciding precisely with the treatment timing.

Best practice is to always report the breakdown value alongside the point estimate. A finding with a breakdown at $M = 0.5$ is fragile — even mild violations destroy the conclusion. A finding with a breakdown at $M = 15$ or above, as in this example, provides strong evidence that the effect is genuine regardless of moderate parallel trends violations.

Discussion

Returning to the motivating question — did AI tutoring actually improve learning? — the evidence from both the classic and modern DiD estimators is clear: treatment produced a genuine, statistically significant positive effect. In the 2x2 setting, the estimated ATT of 5.12 (95% CI: [4.64, 5.60]) closely matches the true effect of 5.0, confirming that the classic estimator works well when all units start treatment simultaneously. The event study further validates this finding by showing near-zero pre-treatment coefficients (the largest is -0.52 with p = 0.31) and stable post-treatment effects around 4.7–5.0.

The staggered adoption setting reveals a more nuanced picture. Naive TWFE estimation produces a biased estimate of 2.18, pulled downward by the 28.3% weight on forbidden comparisons where already-treated units serve as controls. The Callaway-Sant’Anna estimator corrects this bias, finding an overall ATT of 2.41 — and the event study shows that the effect is not constant but grows over time, from 1.97 immediately after treatment to 3.27 six periods later. For an education policymaker, this dynamic pattern means the AI initiative’s full benefits take time to materialize: evaluating the program too early would underestimate its long-run impact.

The HonestDiD sensitivity analysis provides the final piece of evidence. With a breakdown value exceeding $M = 15$, the treatment conclusion is robust to post-treatment parallel trends violations more than 15 times larger than anything observed pre-treatment. This level of robustness far exceeds the $M = 3$ threshold typically considered strong in applied research. Even a skeptic who doubts the parallel trends assumption would find it difficult to argue that the treatment had no effect.

Two important caveats apply. First, these results use synthetic data with known true effects, so the estimators are guaranteed to work under their assumptions. Real-world applications face additional challenges — measurement error in learning assessments, spillover effects between treated and control cities (e.g., students in control cities accessing AI tools on their own), and the possibility that AI adoption depends on unobserved factors correlated with learning outcomes. Second, the treatment effects in the staggered dataset grow linearly over time by construction. In practice, effects may follow more complex trajectories — plateauing, fading out, or accelerating — which would require careful specification of the event study window and aggregation weights.

Summary and key takeaways

This tutorial walked through the DiD toolkit from its simplest form to its most robust modern extensions. Four key takeaways emerge:

Method insight: DiD targets the ATT by using untreated units as a counterfactual for how treated units would have evolved without intervention. The classic 2x2 estimator (ATT = 5.12, SE = 0.25) works well when all units start treatment simultaneously, but staggered adoption requires modern estimators like Callaway-Sant’Anna to avoid TWFE’s forbidden comparison bias.

Data insight: The classic DiD recovered the true effect of 5.0 within sampling error (95% CI: [4.64, 5.60]). In the staggered setting, TWFE estimated 2.18 while the cleaner CS estimator found 2.41 — a 10% upward correction driven by eliminating the 28.3% weight on forbidden comparisons that dragged TWFE down. The CS event study further revealed that treatment effects grow over time, from 1.97 immediately after treatment to 3.27 six periods later.

Practical limitation: Parallel trends is untestable for the post-treatment period. Pre-treatment tests (p = 0.29 in our example) can only fail to reject, not confirm. HonestDiD provides a principled solution by computing robust confidence intervals under bounded violations. Our breakdown value exceeding $M = 15$ means the conclusion survives violations more than 15 times the worst pre-treatment departure — exceptionally strong robustness.

Next steps: This tutorial used synthetic data — the 2x2 dataset with a constant treatment effect and the staggered dataset with effects that grow over time. Real-world applications should consider adding covariates to the CS estimator (via the covariates argument), exploring continuous treatment intensity with ContinuousDiD(), and comparing CS results against SunAbraham() or ImputationDiD() as robustness checks. The diff-diff package supports all of these within the same API.

Exercises

Null effect test. Modify the generate_did_data() call to set treatment_effect=0.0. Run the full 2x2 analysis and event study. Does the estimator correctly find a zero effect? What do the pre- and post-treatment event study coefficients look like?
Covariates in Callaway-Sant’Anna. Add covariates to the staggered data (e.g., unit-level characteristics) and pass them via the covariates argument in CallawaySantAnna().fit(). Compare the ATT with and without covariate adjustment. When does covariate adjustment matter most?
Sun-Abraham comparison. Estimate the staggered treatment effect using SunAbraham(control_group="never_treated") instead of CallawaySantAnna(). Compare the overall ATT and event study coefficients. Under what conditions do the two estimators differ?
HonestDiD with finer M grid. Run the sensitivity analysis with M_values = np.arange(0, 15, 0.5) to find the exact breakdown point. How does the breakdown change if you use method="smoothness" instead of "relative_magnitude"?

References

Acknowledgements

Heterogeneous treatment effects via two-stage DID

Mon, 29 Jul 2024 00:00:00 +0000

Homogeneous Treatment Effects

🎯 Purpose: Estimate treatment effects when the treatment is not randomly assigned.
📉 Parallel Trends Assumption: In the absence of treatment, the treated and untreated groups would have followed parallel paths over time.
🔄 Two-Way Fixed-Effects (TWFE) Model:
- Static Model:
$$ y_{igt} = \mu_g + \eta_t + \tau D_{gt} + \epsilon_{igt} $$
- $ y_{igt} $: Outcome variable.
- $ i $: Individual.
- $ t $: Time.
- $ g $: Group.
- $ \mu_g $: Group fixed-effects.
- $ \eta_t $: Time fixed-effects.
- $ D_{gt} $: Indicator for treatment status.
- $ \tau $: Average treatment effect on the treated (ATT).
❗ Limitations: Assumes constant treatment effects across groups and time, which is often unrealistic.

Heterogeneous Treatment Effects

🔄 Enhanced TWFE Model: $$ y_{igt} = \mu_g + \eta_t + \tau_{gt} D_{gt} + \epsilon_{igt} $$
- Allows treatment effects ($ \tau_{gt} $) to vary by group and time.
- Aggregates group-time average treatment effects into an overall average treatment effect ($ \tau $).

Dynamic Event-Study TWFE Model

🔄 Model: $$ y_{igt} = \mu_g + \eta_t + \sum_{k=-L}^{-2} \tau_k D_{gt}^k + \sum_{k=0}^{K} \tau_k D_{gt}^k + \epsilon_{igt} $$
- Allows for treatment effects to change over time.
- $ D_{gt}^k $: Lags and leads of treatment status.
- Coefficients ($ \tau_k $) represent the average effect of being treated for $ k $ periods.
🎯 Estimation Goals:
- Objective: Estimate the average treatment effect of being exposed for $ k $ periods.
- Average Treatment Effect: $$ \tau_k = \sum_{g,t : t-g=k} \frac{N_{gt}}{N_k} \tau_{gt} $$
  - $ N_{gt} $: Number of observations in group $ g $ and time $ t $.
  - $ N_k $: Total number of observations with $ t - g = k $.

Negative Weighting Problem

❗ Issue: Traditional TWFE models can produce estimates with negative weights, leading to biased overall treatment effect estimates.
🛠 Solution by Gardner (2021):
- Use a two-stage approach to estimate group and time fixed-effects from untreated/not-yet-treated observations and then estimate treatment effects using residualized outcomes.

Two-stage differences in differences

🌱 Gardner (2021) Approach:
- 🔍 Key Insight: Under parallel trends, group and time effects are identified from the untreated/not-yet-treated observations.
- 📜 Procedure:
  1. 🥇 First Stage:
    - Estimate the model:
      
      \begin{equation} y_{igt} = \mu_g + \eta_t + \epsilon_{igt} \end{equation}
    - Using only untreated/not-yet-treated observations ($D_{gt} = 0$).
    - Obtain estimates for group and time effects ($\mu_g$ and $\eta_t$).
  2. 🥈 Second Stage:
    - Regress adjusted outcomes ($y_{igt} - \mu_g - \eta_t$) on treatment status ($D_{gt}$) in the full sample to estimate treatment effects ($\tau$).
- 🎯 Rationale:
  - The parallel trends assumption implies that residuals ($\epsilon_{igt}$) are uncorrelated with the treatment dummy, leading to a consistent estimator for the average treatment effect.

Learn by coding using this Google Colab notebook.

Staggered DiD (Ex1)

Sun, 03 Sep 2023 00:00:00 +0000

An introduction to difference in differences with multiple time periods and staggered treatment adoption. This tutorial is based on Exercise 1 of the Advanced DiD mixed tape session of Jonathan Roth. You can run and extend the analysis of this case study using Google Colab.

Staggered DiD

Sat, 02 Sep 2023 00:00:00 +0000

An introduction to difference in differences with multiple time periods and staggered treatment adoption. You can run and extend the analysis of this case study using Google Colab.

Basic DiD

Mon, 01 Apr 2019 00:00:00 +0000

In this case study, we use the Differences in Differences (DiD) method to analyze the effect of a garbage incinerator’s location on housing prices. This method is a statistical technique used in econometrics that calculates the effect of a treatment (in this case, the placement of a garbage incinerator) on an outcome (here, housing prices) by comparing the average change over time in the outcome variable for the treatment group to the average change over time for the control group. You can run and extend the analysis of this case study using Posit cloud or Google Colab.