Panel Data | Carlos Mendez

Identifying Latent Group Structures in Panel Data: The classifylasso Command in Stata

Sat, 04 Apr 2026 00:00:00 +0000

1. Overview

Do all countries respond the same way to inflation? To interest rates? To democratic transitions? Most panel data models assume yes. They force every country to share the same slope coefficients. That is a strong assumption — and often a wrong one.

Here is a preview of what we will discover. When we estimate the effect of inflation on savings across 56 countries, the pooled model says: “no significant effect.” But that average is a lie. One group of countries saves less when inflation rises. Another group saves more. The pooled estimate averages a negative and a positive effect, producing a misleading zero.

The Classifier-LASSO (C-LASSO) method solves this problem. Developed by Su, Shi, and Phillips (2016), it discovers latent groups in your panel data. Countries within each group share the same coefficients. Countries across groups can differ. Think of it like a sorting hat: rather than treating all countries as identical or all as unique, C-LASSO sorts them into a small number of groups with shared behavioral patterns.

This tutorial demonstrates the classifylasso Stata command (Huang, Wang, and Zhou 2024) with two applications:

Savings behavior across 56 countries (1995–2010) — where inflation affects savings in opposite directions depending on the country group
Democracy and economic growth across 98 countries (1970–2010) — where the pooled estimate of +1.05 masks a split of +2.15 in one group and -0.94 in another

Learning objectives:

Understand why assuming homogeneous slopes can be misleading in panel data
Learn the Classifier-LASSO method for identifying latent group structures
Implement classifylasso in Stata with both static and dynamic specifications
Use postestimation commands (classogroup, classocoef, predict gid) to visualize and interpret results
Compare pooled fixed-effects estimates with group-specific C-LASSO estimates

The diagram below maps the tutorial’s progression. We start simple and build complexity step by step.

graph LR
A["<b>EDA</b><br/>Savings data"] --> B["<b>Baseline FE</b><br/>Pooled &<br/>fixed effects"]
B --> C["<b>C-LASSO</b><br/>Static model<br/>(no lagged DV)"]
C --> D["<b>C-LASSO</b><br/>Dynamic model<br/>(jackknife)"]
D --> E["<b>Democracy</b><br/>Application<br/>(two-way FE)"]
E --> F["<b>Comparison</b><br/>Pooled vs<br/>group-specific"]
style A fill:#141413,stroke:#141413,color:#fff
style B fill:#6a9bcc,stroke:#141413,color:#fff
style C fill:#d97757,stroke:#141413,color:#fff
style D fill:#d97757,stroke:#141413,color:#fff
style E fill:#00d4c8,stroke:#141413,color:#141413
style F fill:#1a3a8a,stroke:#141413,color:#fff

2. The Problem: Homogeneous vs Heterogeneous Slopes

2.1 Three approaches to slope heterogeneity

Imagine 56 students taking the same exam. Approach 1 assumes they all studied the same way — one average study strategy explains everyone’s score. Approach 2 gives each student a unique strategy — but with only a few data points per student, the estimates are noisy. Approach 3 (C-LASSO) discovers that students naturally fall into 2–3 study groups. Students within a group share the same strategy. Students across groups differ.

The same logic applies to panel data. The standard fixed-effects model is:

$$y_{it} = \mu_i + \boldsymbol{\beta}' \mathbf{x}_{it} + u_{it}$$

Here, $y_{it}$ is the outcome for country $i$ at time $t$. The term $\mu_i$ captures country-specific intercepts (fixed effects). The slope vector $\boldsymbol{\beta}$ links the regressors $\mathbf{x}_{it}$ to the outcome. The critical assumption: $\boldsymbol{\beta}$ is the same for all countries. Japan and Nigeria get the same coefficient on inflation. That may be wrong.

At the other extreme, we could run separate regressions for each country. But with only $T = 15$ time periods per country, individual estimates are noisy. We lose statistical power.

C-LASSO introduces a middle ground. It assumes countries belong to $K$ latent groups:

$$\boldsymbol{\beta}_i = \boldsymbol{\alpha}_k \quad \text{if} \quad i \in G_k, \quad k = 1, \ldots, K$$

In words, country $i$ gets the slope coefficients of its group $G_k$. The method estimates three things simultaneously: the number of groups $K$, which countries belong to which group, and each group’s coefficients $\boldsymbol{\alpha}_k$. You do not need to specify the groups in advance. The data reveals them.

2.2 Why not just use K-means?

A natural question: why not run individual regressions first and then cluster the coefficients with K-means? C-LASSO has two advantages. First, it estimates group membership and coefficients jointly. A two-step approach (estimate, then cluster) propagates first-stage errors into the grouping. Second, C-LASSO’s penalty structure naturally pulls similar countries toward the same group. It is a statistically principled sorting mechanism, not an ad-hoc post-processing step.

3. The Classifier-LASSO Method

3.1 The C-LASSO objective function

C-LASSO minimizes a penalized least-squares objective:

$$Q_{NT,\lambda}^{(K)} = \frac{1}{NT} \sum_{i=1}^{N} \sum_{t=1}^{T} (y_{it} - \boldsymbol{\beta}_i' \mathbf{x}_{it})^2 + \frac{\lambda_{NT}}{N} \sum_{i=1}^{N} \prod_{k=1}^{K} |\boldsymbol{\beta}_i - \boldsymbol{\alpha}_k|$$

The first term is the standard sum of squared residuals. It measures how well the model fits the data. The second term is the penalty. It encourages each country’s coefficients $\boldsymbol{\beta}_i$ to be close to one of the group centers $\boldsymbol{\alpha}_k$.

Think of each group center as a planet with gravitational pull. If a country’s coefficients are close to any planet, the product $\prod_k |\boldsymbol{\beta}_i - \boldsymbol{\alpha}_k|$ shrinks toward zero. The penalty becomes small. The country gets pulled into that group. If the coefficients are far from all planets, the penalty stays large. The tuning parameter $\lambda_{NT} = c_\lambda T^{-1/3}$ controls how strong this gravitational pull is.

3.2 Three-step estimation procedure

The classifylasso command works in three steps:

Sort countries into groups. For each candidate number of groups $K$, the algorithm iteratively updates group centers and reassigns countries until convergence. Starting values come from unit-by-unit regressions.
Re-estimate within groups (postlasso). The LASSO penalty biases the coefficient estimates. So after sorting, we discard the penalized estimates and re-run plain OLS within each group. Think of it like a talent show: LASSO is the audition that selects who is in which group, but the final performance (the coefficient estimates) is unpenalized. This postlasso step gives us valid standard errors and confidence intervals.
Pick the best $K$ (information criterion). How many groups are there? The command tests $K = 1, 2, \ldots, K_{\max}$ and picks the $K$ that minimizes an information criterion. The IC acts like a referee balancing two concerns: fit (more groups fit better) and complexity (more groups risk overfitting). It works like AIC or BIC. The tuning parameter $\rho_{NT} = c_\rho (NT)^{-1/2}$ controls how harshly the referee penalizes extra groups.

3.3 Dynamic panels and Nickell bias

What if your model includes a lagged dependent variable, like $y_{i,t-1}$? This creates a problem called Nickell bias. When you demean the data to remove fixed effects, the demeaned lagged outcome becomes correlated with the demeaned error. The result: biased coefficients.

The classifylasso command offers a dynamic option to fix this. It uses the half-panel jackknife (Dhaene and Jochmans 2015). The idea is simple: split the time series in half. Estimate the model on each half. Combine the two estimates in a way that cancels the bias. Problem solved.

Now that we understand the method, let’s apply it to real data.

4. Data Exploration: Savings

4.1 Load and describe the data

Our first application uses a panel of 56 countries over 15 years, from Su, Shi, and Phillips (2016). The outcome is the savings-to-GDP ratio. The regressors are lagged savings, CPI inflation, real interest rates, and GDP growth.

use "https://github.com/cmg777/starter-academic-v501/raw/master/content/post/stata_panel_lasso_cluster/refMaterials/saving.dta", clear
xtset code year
summarize savings lagsavings cpi interest gdp

 Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
savings | 840 -2.87e-08 1.000596 -2.495871 2.893858
lagsavings | 840 5.81e-08 1.000596 -2.832278 2.91508
cpi | 840 3.56e-09 1.000596 -2.773791 3.548945
interest | 840 -7.17e-09 1.000596 -3.600348 3.277582
gdp | 840 1.06e-08 1.000596 -3.554419 2.461317

The panel is strongly balanced: 56 countries $\times$ 15 years = 840 observations. All variables are standardized to mean zero and standard deviation one. This means coefficients are in standard-deviation units. A coefficient of 0.18 means “a one-SD increase in CPI is associated with a 0.18-SD change in savings.” The balanced structure matters: C-LASSO requires all countries to be observed in all time periods.

4.2 Visualize cross-country heterogeneity

Before running any regressions, it helps to visualize how savings trajectories differ across countries. The xtline command overlays all 56 country lines on a single plot:

xtline savings, overlay ///
title("Savings-to-GDP Ratio Across 56 Countries", size(medium)) ///
subtitle("Each line represents one country", size(small)) ///
ytitle("Savings / GDP") xtitle("Year") legend(off)
graph export "stata_panel_lasso_cluster_fig1_savings_scatter.png", replace width(2400)

Figure 1: Savings-to-GDP ratio across 56 countries (1995–2010). Each line represents one country, revealing substantial heterogeneity in savings dynamics.

The spaghetti plot tells a clear story: countries do not move in lockstep. Some maintain positive savings ratios throughout. Others swing below zero. The lines diverge, cross, and cluster — suggesting that different countries follow fundamentally different savings dynamics. This is exactly the kind of heterogeneity that C-LASSO is designed to detect. Perhaps subsets of countries share similar responses, even if the full panel does not.

But first, let’s see what the standard models say.

5. Baseline: Pooled and Fixed Effects Regressions

Before applying C-LASSO, we establish a benchmark by estimating the standard pooled OLS and fixed-effects models. These models assume that all 56 countries share the same slope coefficients.

* Pooled OLS
regress savings lagsavings cpi interest gdp
* Standard Fixed Effects
xtreg savings lagsavings cpi interest gdp, fe
* Robust Fixed Effects (reghdfe)
reghdfe savings lagsavings cpi interest gdp, absorb(code) vce(robust)

 Pooled OLS FE (robust)
lagsavings 0.6051 0.6051
cpi 0.0301 0.0301
interest 0.0059 0.0059
gdp 0.1882 0.1882

The pooled OLS and fixed-effects estimates are virtually identical. R-squared is 0.438. Lagged savings dominates (coefficient 0.605, $p < 0.001$). GDP growth matters too (0.188, $p < 0.001$).

Now look at the two remaining variables. CPI: 0.030. Interest rate: 0.006. Both statistically insignificant. A textbook conclusion would be: “Inflation and interest rates do not affect savings.”

But what if the average is lying? Imagine a city where half the neighborhoods warm up by 5 degrees and the other half cool down by 5 degrees. The citywide average temperature change is zero. A meteorologist reporting “no change” would be wrong — there are changes, just in opposite directions. This is exactly what we will discover with C-LASSO.

6. Classifier-LASSO: Savings, Static Model

6.1 Estimation

We start with the simplest C-LASSO specification: a static model without the lagged dependent variable. This lets us focus on the core mechanics before adding complexity.

classifylasso savings cpi interest gdp, grouplist(1/5) tolerance(1e-4)

The command searches over $K = 1$ to $K = 5$ groups and reports the information criterion (IC) for each:

Estimation 1: Group Number = 1; IC = 0.054
Estimation 2: Group Number = 2; IC = -0.028 ← minimum
Estimation 3: Group Number = 3; IC = 0.059
Estimation 4: Group Number = 4; IC = 0.131
Estimation 5: Group Number = 5; IC = 0.213
* Selected Group Number: 2

The IC is minimized at $K = 2$, with values rising monotonically from $K = 3$ onward. This clear U-shape provides strong evidence for exactly two latent groups in the data.

6.2 Group-specific coefficients

classoselect, postselection
predict gid_static, gid
tabulate gid_static

Group 1 (34 countries, 510 obs): Within R-sq. = 0.2019
cpi | -0.1813 (z = -4.29, p < 0.001)
interest | -0.1966 (z = -4.64, p < 0.001)
gdp | 0.3346 (z = 7.98, p < 0.001)
Group 2 (22 countries, 330 obs): Within R-sq. = 0.2369
cpi | 0.4781 (z = 9.10, p < 0.001)
interest | 0.2631 (z = 5.01, p < 0.001)
gdp | 0.1117 (z = 2.23, p = 0.026)

The results are striking. Look at CPI.

In Group 1 (34 countries), higher inflation reduces savings: coefficient $-0.181$ ($p < 0.001$). In Group 2 (22 countries), higher inflation increases savings: coefficient $+0.478$ ($p < 0.001$). The sign flips completely.

The same reversal appears for the interest rate: $-0.197$ in Group 1 versus $+0.263$ in Group 2.

Now the pooled CPI coefficient of $+0.030$ makes sense. It was averaging $-0.181$ and $+0.478$ — a negative and a positive effect canceling each other out. The “insignificant” result was not evidence of no effect. It was evidence of two opposing effects hidden inside the average.

Why the reversal? In Group 1, higher inflation erodes the real value of savings, discouraging people from saving. In Group 2, higher inflation may trigger precautionary savings — households save more precisely because the economic environment feels uncertain. Same macroeconomic shock, opposite behavioral response.

6.3 Group selection plot

classogroup
graph export "stata_panel_lasso_cluster_fig2_group_selection_static.png", replace width(2400)

Figure 2: Group selection for the static savings model. The information criterion (left axis) is minimized at K=2, with a clear U-shape from K=3 onward.

The triangle marks the IC minimum at $K = 2$. The left axis shows IC values; the right axis shows iterations to convergence. Notice: $K = 2$ converged quickly (about 3 iterations). Models with $K \geq 3$ hit the maximum 20 iterations. When the algorithm struggles to converge, it is a sign of overparameterization — too many groups for the data to support.

So far, we have found two groups with a static model. But we omitted lagged savings. Let’s add it back.

7. Classifier-LASSO: Savings, Dynamic Model

7.1 Adding the lagged dependent variable

Savings are highly persistent. The pooled coefficient on lagsavings was 0.605 — a country’s savings this year strongly predicts its savings next year. Omitting this variable may bias everything else. We now add it back and replicate Su, Shi, and Phillips (2016). The dynamic option activates the half-panel jackknife to correct Nickell bias.

use "https://github.com/cmg777/starter-academic-v501/raw/master/content/post/stata_panel_lasso_cluster/refMaterials/saving.dta", clear
xtset code year
classifylasso savings lagsavings cpi interest gdp, ///
grouplist(1/5) lambda(1.5485) tolerance(1e-4) dynamic

* Selected Group Number: 2
The algorithm takes 9min57s.
Group 1 (31 countries, 465 obs): Within R-sq. = 0.4988
lagsavings | 0.6952 (z = 18.15, p < 0.001)
cpi | -0.1602 (z = -4.09, p < 0.001)
interest | -0.1490 (z = -4.04, p < 0.001)
gdp | 0.2892 (z = 7.62, p < 0.001)
Group 2 (25 countries, 375 obs): Within R-sq. = 0.4372
lagsavings | 0.6939 (z = 19.45, p < 0.001)
cpi | 0.1967 (z = 4.93, p < 0.001)
interest | 0.1225 (z = 2.98, p = 0.003)
gdp | 0.1127 (z = 2.38, p = 0.018)

Again, C-LASSO selects $K = 2$ groups. The sign reversal on CPI survives: $-0.160$ in Group 1 versus $+0.197$ in Group 2. Same for the interest rate: $-0.149$ versus $+0.123$.

Here is what is interesting about the lagsavings coefficient. Both groups show nearly identical persistence: 0.695 in Group 1 and 0.694 in Group 2. Think of it like a speedometer. Both groups of countries cruise at the same speed (savings persistence). But they swerve in opposite directions when they hit a pothole (an inflation or interest rate shock). The heterogeneity is about reactions to shocks, not about baseline behavior.

Adding lagged savings also improved the fit. Within R-squared jumped from 0.20–0.24 (static) to 0.44–0.50 (dynamic). The lagged variable clearly matters.

7.2 Coefficient plots

The classocoef postestimation command visualizes group-specific coefficients with 95% confidence bands:

classocoef cpi
graph export "stata_panel_lasso_cluster_fig3_coef_cpi.png", replace width(2400)
classocoef interest
graph export "stata_panel_lasso_cluster_fig4_coef_interest.png", replace width(2400)

Figure 3: Heterogeneous effects of CPI on savings. Group 1 (31 countries) shows a negative effect; Group 2 (25 countries) shows a positive effect. Confidence bands do not overlap.

This is the “smoking gun” figure. The two horizontal lines are the group-specific coefficients. The dashed lines show 95% confidence bands. The bands do not overlap. This is not a marginal difference. It is a robust sign reversal.

For 31 countries (Group 1), higher inflation reduces savings ($-0.160$, $p < 0.001$). For 25 countries (Group 2), higher inflation increases savings ($+0.197$, $p < 0.001$). A pooled model averages these opposing forces and finds CPI “insignificant.” That is aggregation bias at work.

Figure 4: Heterogeneous effects of the interest rate on savings. The same sign reversal pattern as CPI: negative in Group 1, positive in Group 2.

The interest rate tells the same story. Group 1 countries save less when rates rise ($-0.149$). Group 2 countries save more ($+0.123$).

Why? One interpretation: in Group 1 (more developed financial markets), higher returns make consumption more attractive — the substitution effect dominates. In Group 2 (limited financial access), higher returns make saving more rewarding — the income effect dominates.

We have now established that latent groups exist in savings data. The next question: does the same pattern appear in a completely different economic context?

8. Democracy Application: Does Democracy Cause Growth?

8.1 The Acemoglu et al. (2019) question

“Democracy does cause growth.” That is the title of a famous 2019 paper by Acemoglu, Naidu, Restrepo, and Robinson in the Journal of Political Economy. Their evidence: a pooled two-way fixed-effects model with lagged GDP finds a positive, significant effect.

But we have learned to be skeptical of pooled estimates. Does this average apply to all 98 countries? Or does it mask the same kind of sign reversal we found in savings?

8.2 Data exploration

use "https://github.com/cmg777/starter-academic-v501/raw/master/content/post/stata_panel_lasso_cluster/refMaterials/democracy.dta", clear
xtset country year
summarize lnPGDP Democracy ly1
tabulate Democracy

 Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
lnPGDP | 4,018 758.5558 162.9137 405.6728 1094.003
Democracy | 4,018 .5450473 .4980286 0 1
ly1 | 3,920 757.7754 162.6702 405.6728 1094.003
Democracy | Freq. Percent
------------+-----------------------------------
0 | 1,828 45.50
1 | 2,190 54.50

The panel covers 98 countries from 1970 to 2010 — 4,018 observations. The binary Democracy indicator is 1 for democratic country-years and 0 otherwise. About 55% of observations are democratic, reflecting the global wave of democratization. The dependent variable lnPGDP (log per-capita GDP, scaled) ranges from 406 to 1,094 — the full spectrum from low-income to high-income countries.

8.3 Pooled fixed-effects benchmark

reghdfe lnPGDP Democracy ly1, absorb(country year) cluster(country)

HDFE Linear regression Number of obs = 3,920
R-squared = 0.9991
Within R-sq. = 0.9607
(Std. err. adjusted for 98 clusters in country)
lnPGDP | Coefficient Robust std. err. t P>|t|
Democracy | 1.054992 .369806 2.85 0.005
ly1 | .970495 .0059964 161.85 0.000

Democracy is associated with a 1.055-unit increase in log per-capita GDP ($p = 0.005$, clustered SE = 0.370). Lagged GDP has a coefficient of 0.970 — strong persistence. This replicates Acemoglu et al. (2019): on average, democracy promotes growth.

On average. But we already know what “on average” can hide. Let’s run C-LASSO.

8.4 C-LASSO: revealing the heterogeneity

classifylasso lnPGDP Democracy ly1, ///
grouplist(1/5) rho(0.2) absorb(country year) ///
cluster(country) dynamic optmaxiter(300)

* Selected Group Number: 2
The algorithm takes 2h33min41s.
Group 1 (57 countries, 2,280 obs): Within R-sq. = 0.9609
Democracy | 2.151397 (z = 3.94, p < 0.001)
ly1 | 1.032752 (z = 149.97, p < 0.001)
Group 2 (41 countries, 1,640 obs): Within R-sq. = 0.9538
Democracy | -0.935589 (z = -2.69, p = 0.007)
ly1 | 0.979327 (z = 95.73, p < 0.001)

This is the tutorial’s most striking finding.

The pooled coefficient of $+1.055$ is not representative of any actual country group. It is a weighted average of two fundamentally different effects:

Group 1 (57 countries): democracy effect = $+2.151$ ($p < 0.001$). More than twice the pooled estimate.
Group 2 (41 countries): democracy effect = $-0.936$ ($p = 0.007$). Negative and significant.

The coefficient literally changes sign. For 58% of countries, democratic transitions are associated with GDP gains. For the remaining 42%, they are associated with GDP declines. The pooled model sees one number. C-LASSO sees two stories.

Note: these are conditional associations within the panel model. A causal interpretation requires the same identifying assumptions as Acemoglu et al. (2019).

8.5 Visualizing the democracy-growth split

classogroup
graph export "stata_panel_lasso_cluster_fig5_democracy_selection.png", replace width(2400)
classocoef Democracy
graph export "stata_panel_lasso_cluster_fig6_democracy_coef.png", replace width(2400)

Figure 5: Group selection for the democracy-growth model. IC is minimized at K=2, though values are close across all K (range 3.267–3.280).

The IC selects $K = 2$. But look closely: the IC values range from 3.267 to 3.280 — a span of just 0.013. The 2-group structure is optimal but not overwhelmingly so. This is a useful reminder: always check sensitivity to the tuning parameter $\rho$.

Figure 6: Heterogeneous effects of democracy on economic growth. Group 1 (57 countries) shows a positive effect (+2.15); Group 2 (41 countries) shows a negative effect (-0.94). The pooled estimate of +1.05 describes neither group.

This is the key figure of the tutorial. Each dot is one country’s individual coefficient estimate. The horizontal lines show group-specific postlasso estimates with 95% confidence bands.

The polarization is unmistakable. Group 1 (left cluster): strongly positive. Group 2 (right cluster): negative. Neither group’s confidence band crosses zero. Both effects are statistically significant.

This is not “some countries benefit, others see no effect.” It is a genuine sign reversal. Democracy is associated with growth in one group and with decline in another.

9. Comparison: What the Pooled Model Misses

9.1 Summary table

	Pooled FE	C-LASSO Group 1	C-LASSO Group 2
Democracy coefficient	+1.055	+2.151	-0.936
Standard error	0.370	0.546	0.348
p-value	0.005	< 0.001	0.007
Lagged GDP	0.970	1.033	0.979
Countries	98	57	41
Observations	3,920	2,280	1,640

9.2 Simpson’s paradox in panel data

This is Simpson’s paradox — the phenomenon where a trend that appears in aggregated data reverses when you look at subgroups.

Here is a concrete analogy. A hospital treats two types of patients: mild cases and severe cases. For mild cases, Treatment A has a higher survival rate. For severe cases, Treatment A also has a higher survival rate. But when you pool all patients together, Treatment B appears better — because it treats a disproportionate number of mild (easy) cases. The aggregate reverses the subgroup trend.

The same thing happened here. The pooled democracy estimate of $+1.055$ sits between $+2.151$ and $-0.936$. It describes neither group accurately. A policymaker relying on the pooled result would conclude that democracy universally promotes growth. They would miss that for 41 countries (42% of the sample), the relationship runs in the opposite direction.

The savings model showed the same pattern. The insignificant pooled CPI coefficient ($+0.030$) masked significant effects of $-0.160$ and $+0.197$. When effects have opposite signs, pooling does not just underestimate the magnitude. It produces a qualitatively wrong conclusion.

9.3 Robustness of the group structure

Across all three C-LASSO specifications — static savings, dynamic savings, and democracy — the IC consistently selected $K = 2$ groups. The CPI sign reversal survived the switch from static to dynamic, despite a shift in group composition (34/22 to 31/25). This consistency suggests the latent groups are real structural features of the data, not artifacts of a particular specification.

10. Summary and Takeaways

10.1 What we learned

Pooled estimates can be misleading. The insignificant pooled CPI coefficient ($+0.030$) in the savings model masked opposing effects of $-0.160$ and $+0.197$ in two latent groups. The pooled democracy coefficient ($+1.055$) masked a split of $+2.151$ versus $-0.936$.
C-LASSO finds latent groups. In all three specifications, the information criterion selected $K = 2$ groups, revealing binary latent structures in both datasets. The classifylasso command handles the full workflow: estimation, group selection, and postestimation.
The dynamic option corrects Nickell bias. When lagged dependent variables are included, the half-panel jackknife bias correction preserves the group structure while improving within-group R-squared (from 0.20–0.24 in the static model to 0.44–0.50 in the dynamic model).
Postestimation tools aid interpretation. The classogroup command visualizes the information criterion, classocoef plots group-specific coefficients with confidence bands, and predict gid assigns countries to groups.

10.2 Limitations

Three caveats. First, the IC values in the democracy model were very close across $K = 1$ through $K = 5$ (range 3.267–3.280). The 2-group structure is optimal but not dominant. Second, the datasets use numeric country codes, not names. We cannot easily identify which countries are in which group. Third, C-LASSO is computationally intensive. The democracy model took over 2.5 hours. Plan accordingly.

10.3 Exercises

Sensitivity analysis. Re-run the democracy model with rho(0.5) and rho(1.0) instead of rho(0.2). Does the selected number of groups change? How sensitive are the group assignments to this tuning parameter?
Extended lag structure. Following the reference empirical.do, estimate the democracy model with 2, 3, and 4 lags of GDP (ly1-ly2, ly1-ly3, ly1-ly4). Do the group-specific democracy coefficients remain stable?
Static vs dynamic comparison. Run classifylasso savings cpi interest gdp (without dynamic) on the savings data and compare group assignments with the dynamic model using tabulate gid_static gid_dynamic. How many countries switch groups?

References

Su, L., Shi, Z., and Phillips, P. C. B. (2016). Identifying latent structures in panel data. Econometrica, 84(6), 2215–2264.
Huang, W., Wang, Y., and Zhou, L. (2024). Identify latent group structures in panel data: The classifylasso command. Stata Journal, 24(1), 173–203.
Acemoglu, D., Naidu, S., Restrepo, P., and Robinson, J. A. (2019). Democracy does cause growth. Journal of Political Economy, 127(1), 47–100.
Dhaene, G. and Jochmans, K. (2015). Split-panel jackknife estimation of fixed-effect models. Review of Economic Studies, 82(3), 991–1030.

What Does TWFE Actually Do? Manual Demeaning and the FWL Theorem

Thu, 02 Apr 2026 00:00:00 +0000

1. Overview

Two-way fixed effects (TWFE) is one of the most widely used estimators in applied economics. Packages like fixest make it easy to estimate TWFE models with a single line of code. But what does the estimator actually do to the data? Why do time-invariant regressors like geography or colonial origin get dropped? And if you run lm() on manually demeaned data, should you get the same answer?

This tutorial answers these questions by taking TWFE apart. We estimate a standard growth regression with country and time fixed effects, then replicate the exact same coefficients by hand — subtracting country means, time means, and adding back the grand mean before running ordinary least squares. The result is not an approximation: the coefficients match to 12 significant digits. The theoretical foundation for this equivalence is the Frisch-Waugh-Lovell (FWL) theorem, a fundamental result in econometrics that connects controlling for variables in a regression to projecting them out by residualization.

We use a balanced panel of 150 countries observed over 8 time periods from the Barro convergence dataset. Along the way, we also discover why standard errors from naive lm() on demeaned data are wrong — and why you should always use a dedicated panel estimator for inference.

Learning objectives:

Understand what two-way fixed effects does mechanically to the data and why time-invariant regressors are dropped
Implement the two-way demeaning formula step by step: subtract country means, subtract time means, add back the grand mean
Verify the Frisch-Waugh-Lovell theorem empirically by comparing feols() and lm() coefficients
Interpret why naive standard errors from lm() on demeaned data are incorrect and how fixest corrects them
Visualize the demeaning transformation to build intuition about within-variation identification

2. The Frisch-Waugh-Lovell Theorem

Before diving into code, let us build the conceptual foundation. The FWL theorem answers a simple question: if you want to estimate the effect of $X$ on $Y$ while controlling for a set of variables $Z$, do you need to include everything in one big regression?

Think of it like noise-canceling headphones. Instead of listening to music with the engine noise mixed in, the headphones first subtract out the engine noise from what you hear. The result is the same music you would hear in a silent room. The FWL theorem says: instead of including all control variables in one regression, you can first “subtract them out” from both $Y$ and $X$, and then regress the residuals on each other. The coefficient on $X$ will be identical either way.

Applying FWL to two-way fixed effects

In a TWFE model, the “controls” $Z$ are the full set of country dummies and time dummies. Including all these dummies is equivalent to subtracting group means. For a variable $x_{it}$ observed for country $i$ in period $t$, the two-way demeaned version is:

$$\tilde{x}_{it} = x_{it} - \bar{x}_{i \cdot} - \bar{x}_{\cdot t} + \bar{x}_{\cdot \cdot}$$

In words, this formula says: take the observed value, subtract the country average (to remove persistent country differences), subtract the time-period average (to remove common shocks), and add back the overall average (to correct for double-subtracting the grand mean).

Here is what each symbol means:

$x_{it}$ is the observed value for country $i$ at time $t$ — in code, this is a single cell in the panel dataset
$\bar{x}_{i \cdot}$ is the country mean — the average of $x$ across all periods for country $i$
$\bar{x}_{\cdot t}$ is the time mean — the average of $x$ across all countries in period $t$
$\bar{x}_{\cdot \cdot}$ is the grand mean — the overall average of $x$ across all observations

Why add back the grand mean?

When we subtract both the country mean and the time mean, the grand mean gets subtracted twice — once as part of $\bar{x}_{i \cdot}$ and once as part of $\bar{x}_{\cdot t}$. Adding $\bar{x}_{\cdot \cdot}$ back corrects for this double subtraction. Think of it like a Venn diagram with two overlapping circles. If you subtract both circles entirely, the overlap region gets removed twice. Adding the overlap back once restores the correct amount. Without this correction, the demeaned variables would not be centered at zero, and the equivalence with TWFE would break.

The FWL theorem guarantees this equivalence formally:

$$\hat{\beta}_{\text{TWFE}} = \hat{\beta}_{\text{OLS on demeaned data}}$$

In words, the slope coefficients from a regression that includes a full set of entity and time dummies are exactly equal to the slopes from OLS applied to the two-way demeaned data. Not approximately — exactly. Let us verify this with real data.

3. Setup

We need fixest for TWFE estimation and tidyverse for data wrangling and visualization. The scales package provides axis formatting utilities.

library(fixest)
library(tidyverse)
library(scales)
set.seed(42)
# Site color palette
STEEL_BLUE <- "#6a9bcc"
WARM_ORANGE <- "#d97757"
NEAR_BLACK <- "#141413"
TEAL <- "#00d4c8"
# Variables to demean
VARS_TO_DEMEAN <- c("growth", "ln_y_initial", "log_s_k",
"log_n_gd", "log_hcap", "gov_cons")

We define the six variables that will be demeaned: the dependent variable (growth) and all five regressors. Keeping them in a vector allows us to apply the demeaning formula programmatically rather than copying and pasting for each variable.

4. Data Loading and Panel Structure

We load a balanced panel dataset with 150 countries observed over 8 time periods. The data comes from a Barro convergence exercise where the key question is whether poorer countries grow faster (conditional convergence). We convert id and time to factors so R treats them as categorical grouping variables.

panel_data <- read.csv("referenceMaterials/barro_convergence_panel.csv")
panel_data$id <- factor(panel_data$id)
panel_data$time <- factor(panel_data$time)
cat("Countries:", nlevels(panel_data$id), "\n")
cat("Time periods:", nlevels(panel_data$time), "\n")
cat("Total observations:", nrow(panel_data), "\n")
cat("Balanced panel:", all(table(panel_data$id) == nlevels(panel_data$time)), "\n")

Countries: 150
Time periods: 8
Total observations: 1200
Balanced panel: TRUE

The dataset is a perfectly balanced panel of 150 countries observed across 8 time periods, yielding 1,200 total observations. A balanced panel means every country appears in every period with no missing cells — the ideal setting for demonstrating the demeaning formula. The key variables are:

growth: annualized GDP per capita growth rate (dependent variable)
ln_y_initial: log of initial income (convergence term)
log_s_k: log of the investment share
log_n_gd: log of population growth plus depreciation
log_hcap: log of human capital
gov_cons: government consumption share

Panel structure heatmap showing all 150 countries observed across 8 time periods with no missing cells.

The heatmap confirms the balanced structure. Every one of the 150 countries is observed in all 8 time periods. This balance simplifies our demeaning procedure because we can use the closed-form formula directly, without the iterative projection that unbalanced panels would require.

5. TWFE Estimation with fixest

The fixest package makes TWFE estimation straightforward. The formula uses | to separate the regressors (left) from the fixed effects dimensions (right). Writing | id + time tells feols() to absorb both country and time fixed effects. Internally, fixest performs an efficient iterative demeaning algorithm to remove the fixed effects before estimating the slope coefficients.

twfe_model <- feols(
growth ~ ln_y_initial + log_s_k + log_n_gd + log_hcap + gov_cons | id + time,
data = panel_data
)
summary(twfe_model)

OLS estimation, Dep. Var.: growth
Observations: 1,200
Fixed-effects: id: 150, time: 8
Standard-errors: Clustered (id)
Estimate Std. Error t value Pr(>|t|)
ln_y_initial -0.055286 0.003744 -14.765156 < 2.2e-16 ***
log_s_k 0.019725 0.007583 2.601311 0.010223 *
log_n_gd -0.049614 0.022168 -2.238117 0.026696 *
log_hcap 0.009081 0.014564 0.623549 0.533877
gov_cons -0.102795 0.046398 -2.215501 0.028243 *
RMSE: 0.020517 Adj. R2: 0.755103
Within R2: 0.176777

The TWFE model reveals strong conditional beta-convergence — the hypothesis that poorer countries tend to grow faster, so income levels converge over time. The coefficient on log initial income is -0.055 (t = -14.77, p < 2.2e-16), meaning that a 1% higher initial income is associated with 0.055 percentage points slower subsequent growth, after controlling for the other covariates. Investment has the expected positive effect (0.020, p = 0.010), population growth has the expected negative effect (-0.050, p = 0.027), and government consumption is significantly negative (-0.103, p = 0.028). Human capital is positive but not statistically significant (0.009, p = 0.534). The model explains 75.5% of total variation (Adj. R-squared = 0.755), though only 17.7% of the within-variation (Within R-squared = 0.177) — typical for panel models where fixed effects absorb most cross-country heterogeneity.

Now let us replicate these coefficients by hand.

6. Manual Demeaning — Step by Step

We now walk through the demeaning procedure one step at a time. The goal is to transform every variable so that the country and time effects are removed. We will then run plain OLS on the result and verify that the coefficients match.

Step 1: Country means

For each country, we compute the average of each variable across all time periods. This gives us one mean per country per variable — capturing persistent country characteristics like geography, institutions, or long-run income level.

country_means <- panel_data |>
group_by(id) |>
summarise(across(all_of(VARS_TO_DEMEAN), mean), .groups = "drop")

Step 2: Time means

For each time period, we compute the average of each variable across all countries. These time means capture common shocks or trends that affect all countries in a given period — for instance, a global recession or a worldwide productivity boom.

time_means <- panel_data |>
group_by(time) |>
summarise(across(all_of(VARS_TO_DEMEAN), mean), .groups = "drop")

Step 3: Grand mean

The grand mean is simply the overall average of each variable across all countries and all time periods. It is a single number per variable, and we need it to correct for the double subtraction.

grand_means <- colMeans(panel_data[VARS_TO_DEMEAN])

 growth ln_y_initial log_s_k log_n_gd log_hcap gov_cons
-0.1243637 5.3643127 -1.5699117 -2.6569021 0.6645657 0.1461335

Step 4: Apply the demeaning formula

Now we bring everything together. We merge the country means and time means back into the main dataset, then apply the formula $\tilde{x}_{it} = x_{it} - \bar{x}_{i \cdot} - \bar{x}_{\cdot t} + \bar{x}_{\cdot \cdot}$ programmatically to each variable.

# Merge means
panel_dm <- panel_data |>
left_join(
country_means |> rename_with(~ paste0(.x, "_cmean"), all_of(VARS_TO_DEMEAN)),
by = "id"
) |>
left_join(
time_means |> rename_with(~ paste0(.x, "_tmean"), all_of(VARS_TO_DEMEAN)),
by = "time"
)
# Apply demeaning formula
for (v in VARS_TO_DEMEAN) {
panel_dm[[paste0(v, "_dm")]] <-
panel_dm[[v]] -
panel_dm[[paste0(v, "_cmean")]] -
panel_dm[[paste0(v, "_tmean")]] +
grand_means[v]
}

Let us verify that the demeaning worked correctly. If the formula is implemented right, the mean of each demeaned variable should be approximately zero.

Mean of demeaned variables (should be ~0):
growth_dm : -8.114169e-17
ln_y_initial_dm : 8.295170e-15
log_s_k_dm : -1.482923e-15
log_n_gd_dm : 1.599953e-15
log_hcap_dm : 5.384582e-17
gov_cons_dm : 1.832302e-16

All six demeaned variables have means on the order of $10^{-15}$ to $10^{-17}$ — effectively zero within floating-point precision. The demeaning formula is implemented correctly: the within-variation that remains is purely the deviation from both entity-specific and time-specific patterns.

7. OLS on the Demeaned Data

With the demeaning complete, we run a standard OLS regression on the demeaned variables using base R’s lm(). We deliberately use lm() rather than feols() to emphasize that this is plain ordinary least squares — no fixed effects machinery is involved.

manual_model <- lm(
growth_dm ~ ln_y_initial_dm + log_s_k_dm + log_n_gd_dm + log_hcap_dm + gov_cons_dm,
data = panel_dm
)
summary(manual_model)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.035e-16 5.938e-04 0.000 1.00000
ln_y_initial_dm -5.529e-02 3.618e-03 -15.282 < 2e-16 ***
log_s_k_dm 1.972e-02 6.846e-03 2.881 0.00403 **
log_n_gd_dm -4.961e-02 1.820e-02 -2.726 0.00651 **
log_hcap_dm 9.081e-03 1.370e-02 0.663 0.50751
gov_cons_dm -1.028e-01 4.411e-02 -2.331 0.01994 *
Residual standard error: 0.02057 on 1194 degrees of freedom
Multiple R-squared: 0.1768

Two things stand out. First, the intercept is 5.03 x 10^-16 — effectively zero. After proper two-way demeaning, the mean of all demeaned variables is near zero, so there is nothing left for the intercept to capture. This is a good sanity check: if the grand mean correction had been omitted, the intercept would be non-zero. Second, the slope coefficients look identical to those from feols(). But “look identical” is not the same as “are identical.” The next section proves they are.

8. Coefficient Comparison: The Proof

We now place the coefficients from both approaches side by side and compute their difference. If the FWL theorem holds, the slope coefficients must be identical up to floating-point precision.

twfe_coefs <- coef(twfe_model)
manual_coefs <- coef(manual_model)[-1] # drop intercept
names(manual_coefs) <- names(twfe_coefs)
comparison <- data.frame(
feols_TWFE = round(twfe_coefs, 12),
Manual_OLS = round(manual_coefs, 12),
Difference = twfe_coefs - manual_coefs
)
all.equal(unname(twfe_coefs), unname(manual_coefs))

Side-by-side coefficient comparison:
variable feols_TWFE manual_OLS difference
ln_y_initial -0.055286009819 -0.055286009819 -4.163336342e-17
log_s_k 0.019724899416 0.019724899416 3.469446952e-18
log_n_gd -0.049613972524 -0.049613972524 -2.775557562e-16
log_hcap 0.009081150621 0.009081150621 3.469446952e-17
gov_cons -0.102795317426 -0.102795317426 -3.053113318e-16
Maximum absolute difference: 3.053113e-16
all.equal() test: TRUE

This is the central result of the tutorial. All five slope coefficients are identical to at least 12 significant digits. The largest difference is 3.05 x 10^-16 — on the order of IEEE 754 double-precision machine epsilon (~2.2 x 10^-16). R’s all.equal() function confirms equality within its default tolerance. This is not an approximation: it is an exact algebraic identity guaranteed by the Frisch-Waugh-Lovell theorem.

Coefficient comparison: feols TWFE (blue circles) and manual demeaning OLS (orange triangles) occupy the exact same positions.

The dot plot makes the equivalence visually concrete. For each of the five covariates, the steel blue circle (feols TWFE) and warm orange triangle (manual demeaning OLS) occupy the exact same position. Government consumption has the largest coefficient in magnitude at -0.103, while the convergence parameter (log initial income) sits at -0.055. The dashed zero line helps distinguish positive from negative effects.

9. Visualizing What Demeaning Does

The coefficient equivalence is proven, but what does demeaning look like? How does it change the data? The following visualizations build intuition about the transformation.

Before vs after two-way demeaning: the wide cross-country spread (left) collapses to a narrow range around zero (right).

The faceted scatter plot tells the story. In the left panel (raw data), 10 countries are plotted with log initial income on the x-axis and growth on the y-axis. Each country’s observations form a distinct cluster at different income levels — the x-axis spans roughly 3 to 9. In the right panel (after demeaning), the same data is compressed to approximately -0.5 to 0.3 around zero. The between-country income differences and common time trends have been stripped away, leaving only the within-variation — the deviations from each country’s own average and each period’s common trend. This is the variation that identifies the TWFE coefficient.

Decomposing the formula for one country

To see exactly how the formula works, let us trace each component for Country 1’s growth rate across all 8 periods.

Demeaning decomposition for Country 1: observed growth (blue), country mean (orange dashed), time means (teal), grand mean (gray), and the demeaned residual (black).

The decomposition makes the formula concrete. The observed growth values (blue line) decline from about -0.18 to -0.07. The country mean (orange dashed line) is a flat horizontal at -0.127 — this is $\bar{x}_{i \cdot}$. The time means (teal dot-dash line) capture the common cross-country trend, declining from -0.189 to -0.076 — this is $\bar{x}_{\cdot t}$. The grand mean (gray dotted) sits at -0.124 — this is $\bar{x}_{\cdot \cdot}$. The demeaned series (black line) is the residual: $\tilde{x}_{it} = x_{it} - \bar{x}_{i \cdot} - \bar{x}_{\cdot t} + \bar{x}_{\cdot \cdot}$. It fluctuates around zero, capturing only the within-country, within-period deviations that TWFE uses for identification.

10. A Caveat: Standard Errors Differ

While the coefficients are identical, the standard errors from lm() on demeaned data are wrong. This is a critical practical point that many textbooks gloss over.

se_naive <- summary(manual_model)$coefficients[-1, "Std. Error"]
se_feols_iid <- se(twfe_model, se = "iid")
se_feols_cl <- se(twfe_model) # default: clustered by first FE

Standard error comparison:
variable se_naive_lm se_feols_iid se_feols_cluster
ln_y_initial 0.00361766 0.00388000 0.00374436
log_s_k 0.00684559 0.00734199 0.00758268
log_n_gd 0.01820117 0.01952104 0.02216773
log_hcap 0.01369872 0.01469209 0.01456365
gov_cons 0.04410809 0.04730660 0.04639822

Why do they differ? The lm() function does not know that 157 degrees of freedom were consumed by estimating 150 country effects and 8 time effects (minus 1 for normalization). It uses $df = N \times T - K = 1{,}195$ when the correct value is $N \times T - N - T + 1 - K = 1{,}038$. This makes naive SEs systematically too small — they understate uncertainty by 7–22% depending on the variable.

Standard error comparison: naive lm() (gray) systematically underestimates uncertainty compared to feols IID (orange) and clustered (blue).

The grouped bar chart makes the pattern clear. For every variable, the gray bars (naive lm()) are shorter than the orange (feols IID) and blue (feols clustered) bars. The gap is most visible for log(n+g+d), where the naive SE is 0.0182 versus 0.0222 for clustered — a 22% understatement. The feols IID SEs correct for the degrees-of-freedom adjustment, while the clustered SEs additionally account for within-entity serial correlation. The practical lesson: always use a dedicated panel estimator for inference, even though lm() on demeaned data gives the correct point estimates.

11. Discussion

This tutorial has demonstrated a fundamental equivalence in econometrics. TWFE is not a special estimator — it is ordinary least squares applied to data that has been demeaned by entity and time. The fixest package automates this process efficiently, but the underlying operation is straightforward subtraction. The FWL theorem guarantees the equivalence mathematically, and our empirical verification confirms it to machine precision.

Three practical insights emerge:

Demeaning reveals what FE can and cannot identify. Any variable that does not vary within a country over time (like geography or colonial history) has a country mean equal to itself. After demeaning, such a variable becomes zero everywhere and drops out of the regression. This is why fixed effects models cannot estimate the effect of time-invariant characteristics.
The grand mean correction is not optional. Omitting the $+ \bar{x}_{\cdot \cdot}$ term in the demeaning formula would double-subtract the overall level, producing a non-zero intercept and subtly wrong demeaned values. The correction is algebraically necessary for the FWL equivalence to hold.
Correct coefficients do not mean correct inference. The lm() standard errors are too small because they ignore the degrees of freedom consumed by the absorbed fixed effects. In applied work, this means artificially narrow confidence intervals and inflated t-statistics. Always use feols() or an equivalent panel estimator for standard errors and hypothesis testing.

12. Summary and Next Steps

Key takeaways:

TWFE estimation via feols() and OLS on manually demeaned data produce identical coefficients — the maximum difference across 5 coefficients is 3.05 x 10^-16, confirming the FWL theorem.
The demeaning formula subtracts entity means and time means, then adds back the grand mean to correct for double subtraction. After demeaning, all variable means are effectively zero (order of 10^-15).
The Within R-squared of 0.177 versus the overall Adjusted R-squared of 0.755 shows that most variation in growth is absorbed by the fixed effects, not by the regressors.
Naive lm() standard errors understate uncertainty by 7–22% because they ignore the 157 degrees of freedom consumed by the fixed effects. Always use a dedicated panel estimator for inference.

Limitations:

The dataset is simulated, so coefficient values reflect the data-generating process rather than real-world economic dynamics.
The tutorial assumes a balanced panel. With unbalanced panels, the simple closed-form demeaning still works algebraically, but fixest uses a more efficient iterative algorithm.
The SE comparison covers only IID and entity-clustered SEs. Other corrections (heteroskedasticity-robust, Driscoll-Kraay for cross-sectional dependence) may be relevant in applied work.

Next steps:

Apply the demeaning logic to understand why specific variables drop out of your own FE models.
Explore heterogeneous treatment effects with interaction-weighted TWFE estimators.
Read Cunningham (2021), Causal Inference: The Mixtape, Chapter 9, for the connection between TWFE demeaning and difference-in-differences designs.

13. Exercises

Omit the grand mean correction. Modify the demeaning formula to skip the $+ \bar{x}_{\cdot \cdot}$ term. Run lm() on the incorrectly demeaned data. What happens to the intercept? Do the slope coefficients still match the TWFE estimates? Why or why not?
One-way demeaning. Repeat the exercise using only entity demeaning (subtract country means, skip time means). Compare the coefficients to a one-way FE model (feols(growth ~ ... | id)). Verify the equivalence and examine how the coefficients change compared to the two-way specification.
Visualize a different variable. Recreate the demeaning decomposition plot (Section 9) for log_s_k (investment share) instead of growth. Does the country mean, time mean, or within-variation dominate for this variable? What does this tell you about the source of variation that identifies its coefficient?

14. References

Frisch, R. and Waugh, F.V. (1933). “Partial Time Regressions as Compared with Individual Trends.” Econometrica, 1(4), 387–401.
Lovell, M.C. (1963). “Seasonal Adjustment of Economic Time Series and Multiple Regression Analysis.” Journal of the American Statistical Association, 58(304), 993–1010.
Berge, L. (2018). fixest: Fast Fixed-Effects Estimations. R package. CRAN
Cunningham, S. (2021). Causal Inference: The Mixtape. Yale University Press. Online edition
Barro, R.J. and Sala-i-Martin, X. (2004). Economic Growth. 2nd edition. MIT Press.

Standard Errors in Panel Data: A Beginner's Guide in Python

Tue, 31 Mar 2026 00:00:00 +0000

1. Overview

Imagine you run a regression and find that R&D spending significantly boosts firm performance, with a t-statistic of 30. Sounds like a rock-solid result. But what if that impressive t-statistic is an illusion — a consequence of using the wrong formula for your standard errors? In panel data, where the same firms are observed year after year, this is not a hypothetical worry. The repeated observations within each firm create correlation patterns that violate the assumptions behind ordinary standard errors, and ignoring these patterns can make your estimates look far more precise than they actually are.

Standard errors are the bridge between a point estimate and a statistical conclusion. If that bridge is built on the wrong assumptions, the conclusion collapses. In a classic cross-sectional regression with independent observations, conventional standard errors work well. But panel data — where firm 1 in 2015 is related to firm 1 in 2016 — breaks the independence assumption. A firm that performs well one year tends to perform well the next. Errors within the same firm are correlated, and this within-cluster correlation means conventional standard errors understate the true uncertainty surrounding your estimates.

The solution is to use standard error estimators that account for the structure of the data. In this tutorial, we build a simulated panel of 100 firms over 10 years with a known true effect, then systematically compare six approaches to standard error estimation: conventional, White (heteroskedasticity-robust), entity-clustered, time-clustered, two-way clustered, and Driscoll-Kraay. Along the way, we discover two critical lessons. First, no standard error estimator can rescue a biased estimator — fixed effects are needed to remove omitted variable bias. Second, even after fixing bias, the choice of standard error estimator determines whether our confidence intervals have the coverage they promise. The tutorial is inspired by and builds upon the excellent reference by Gregoire (2024), while using original simulated data and explanations.

Learning objectives:

Understand why within-cluster correlation invalidates conventional standard errors in panel data
Implement six standard error estimators using Python’s linearmodels package
Compare how different SE choices affect t-statistics and inference for the same regression
Assess empirical rejection rates via Monte Carlo simulation to identify which SEs correctly control size — that is, reject the true null hypothesis no more than 5% of the time
Distinguish between the bias problem (which SEs cannot fix) and the inference problem (which SEs can fix)

2. Setup and imports

Before running the analysis, install the required package if needed:

pip install linearmodels

The linearmodels library, developed by Kevin Sheppard, extends statsmodels with specialized panel data estimators. It provides PanelOLS for fixed effects regressions with flexible covariance options. The from_formula() method accepts R-style formulas where EntityEffects and TimeEffects keywords absorb group-level fixed effects.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
from linearmodels.panel import PanelOLS
# Reproducibility
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
# Site color palette
STEEL_BLUE = "#6a9bcc"
WARM_ORANGE = "#d97757"
NEAR_BLACK = "#141413"
TEAL = "#00d4c8"

Dark theme figure styling (click to expand)

# Dark theme palette (consistent with site navbar/dark sections)
DARK_NAVY = "#0f1729"
GRID_LINE = "#1f2b5e"
LIGHT_TEXT = "#c8d0e0"
WHITE_TEXT = "#e8ecf2"
# Plot defaults — minimal, spine-free, dark background
plt.rcParams.update({
"figure.facecolor": DARK_NAVY,
"axes.facecolor": DARK_NAVY,
"axes.edgecolor": DARK_NAVY,
"axes.linewidth": 0,
"axes.labelcolor": LIGHT_TEXT,
"axes.titlecolor": WHITE_TEXT,
"axes.spines.top": False,
"axes.spines.right": False,
"axes.spines.left": False,
"axes.spines.bottom": False,
"axes.grid": True,
"grid.color": GRID_LINE,
"grid.linewidth": 0.6,
"grid.alpha": 0.8,
"xtick.color": LIGHT_TEXT,
"ytick.color": LIGHT_TEXT,
"xtick.major.size": 0,
"ytick.major.size": 0,
"text.color": WHITE_TEXT,
"font.size": 12,
"legend.frameon": False,
"legend.fontsize": 11,
"legend.labelcolor": LIGHT_TEXT,
"figure.edgecolor": DARK_NAVY,
"savefig.facecolor": DARK_NAVY,
"savefig.edgecolor": DARK_NAVY,
})

3. The data generating process

3.1 Why simulated data?

When studying standard errors, simulated data has a decisive advantage over real data: we know the true answer. If the true effect of R&D on performance is exactly 0.5, we can check whether each standard error estimator produces confidence intervals that contain 0.5 roughly 95% of the time. With real data, we never know the truth, so we cannot directly evaluate whether our SEs are working correctly.

Think of it like testing a thermometer. You would not test it in unknown conditions — you would dip it in ice water (0 degrees C) and boiling water (100 degrees C) to see if it reads correctly. Simulated data serves as our “known temperature.”

3.2 The DGP

Our data generating process creates a panel of 100 firms observed over 10 years. The key feature is that firm ability — an unobserved characteristic that differs across firms but stays constant over time — affects both R&D intensity and firm performance. This creates omitted variable bias in pooled regressions, exactly the scenario that motivates fixed effects.

The true model is:

$$y_{it} = 2.0 + 0.5 \cdot x_{it} + \mu_i + \lambda_t + \varepsilon_{it}$$

In words, firm performance ($y$) equals a constant (2.0) plus the true causal effect of R&D intensity ($x$) times 0.5, plus a firm-specific effect ($\mu_i$), a time-specific effect ($\lambda_t$), and an idiosyncratic error ($\varepsilon_{it}$). The firm effect $\mu_i$ is correlated with $x_{it}$ — more capable firms invest more in R&D — which means pooled OLS will overestimate the true effect. The errors follow an AR(1) — or first-order autoregressive — process within each firm, meaning each year’s error depends on the previous year’s error (with autocorrelation coefficient $\rho = 0.5$). This creates the within-cluster serial correlation that makes standard error choice critical.

In code, $y$ corresponds to our y column, $x$ is x (R&D intensity), and $\mu_i$ is the unobserved firm fixed effect that we will absorb with EntityEffects.

def simulate_panel(n_firms=100, n_years=10, seed=42):
"""Simulate a panel dataset with firm and time effects.
True DGP:
y_it = 2.0 + 0.5 * x_it + mu_i + lambda_t + eps_it
Where mu_i is correlated with x_it (firm ability drives both
R&D and performance), and eps_it has AR(1) serial correlation
within firms (rho = 0.5).
The TRUE causal effect of x on y is beta = 0.5.
"""
rng = np.random.default_rng(seed)
firms = np.repeat(np.arange(1, n_firms + 1), n_years)
years = np.tile(np.arange(2010, 2010 + n_years), n_firms)
# Firm-level unobserved heterogeneity (ability)
firm_ability = rng.normal(0, 2, n_firms)
mu = np.repeat(firm_ability, n_years)
# Time effects (business cycle)
time_shocks = rng.normal(0, 0.5, n_years)
lam = np.tile(time_shocks, n_firms)
# Treatment: R&D intensity (correlated with firm ability)
x = 3.0 + 0.8 * mu + rng.normal(0, 1.5, n_firms * n_years)
# Idiosyncratic errors with within-firm AR(1) serial correlation
eps = np.zeros(n_firms * n_years)
rho_ar = 0.5
for i in range(n_firms):
start = i * n_years
eps[start] = rng.normal(0, 1.5)
for t in range(1, n_years):
eps[start + t] = rho_ar * eps[start + t - 1] + rng.normal(0, 1.5)
# True model
y = 2.0 + 0.5 * x + mu + lam + eps
return pd.DataFrame({"firm": firms, "year": years, "y": y, "x": x})
df = simulate_panel(n_firms=100, n_years=10, seed=42)
print(f"Dataset shape: {df.shape}")
print(f"Number of firms: {df['firm'].nunique()}")
print(f"Number of years: {df['year'].nunique()}")
print(df.head())

Dataset shape: (1000, 4)
Number of firms: 100
Number of years: 10
firm year y x
1 2010 6.721042 4.139183
1 2011 5.889161 3.844151
1 2012 2.355109 2.596322
1 2013 2.589589 1.318461
1 2014 3.569626 3.595742

The simulated panel contains 1,000 observations — 100 firms, each observed over 10 years from 2010 to 2019. Firm 1’s performance (y) ranges from about 2.4 to 6.7 across the decade, and its R&D intensity (x) varies between 1.3 and 4.1. These year-to-year fluctuations within a single firm represent the within-firm variation that fixed effects regressions exploit, while the systematic differences across firms (some consistently high, others consistently low) represent the between-firm variation that firm fixed effects absorb.

print(df.describe().round(4))

 firm year y x
count 1000.0000 1000.0000 1000.0000 1000.0000
mean 50.5000 2014.5000 2.9699 2.8984
std 28.8805 2.8737 2.9686 1.9783
min 1.0000 2010.0000 -7.0880 -3.0834
25% 25.7500 2012.0000 0.9376 1.5721
50% 50.5000 2014.5000 2.9351 2.9669
75% 75.2500 2017.0000 5.0383 4.1769
max 100.0000 2019.0000 13.5170 9.1612

Firm performance (y) averages 2.97 with a standard deviation of 2.97, spanning from -7.09 to 13.52. R&D intensity (x) averages 2.90 with a standard deviation of 1.98. The wide ranges in both variables reflect the combination of genuine within-firm fluctuations and the large cross-firm differences injected by firm fixed effects. Next, we decompose this total variation to understand how much comes from differences between firms versus changes within firms over time.

4. Exploring the panel structure

Before estimating any model, we need to understand the structure of our panel data. A key diagnostic is the decomposition of variance into between-firm and within-firm components. This tells us where the action is — and why pooled OLS can go wrong.

4.1 Between vs. within variation

Think of variation in firm performance like variation in student test scores within a school. Some variation comes from differences between students (some students are consistently stronger than others) and some comes from variation within students over time (a student scores differently on different exams). In panel data, the “between” component captures persistent firm-level differences, while the “within” component captures how each firm deviates from its own average over time.

# Panel balance check
obs_per_firm = df.groupby("firm").size()
print(f"Observations per firm: min={obs_per_firm.min()}, "
f"max={obs_per_firm.max()}, mean={obs_per_firm.mean():.1f}")
print(f"Panel is {'balanced' if obs_per_firm.nunique() == 1 else 'unbalanced'}")
# Within vs between variation
overall_std_y = df["y"].std()
between_std_y = df.groupby("firm")["y"].mean().std()
within_std_y = df.groupby("firm")["y"].transform(lambda g: g - g.mean()).std()
print(f"\nVariation in y:")
print(f" Overall std: {overall_std_y:.4f}")
print(f" Between std: {between_std_y:.4f}")
print(f" Within std: {within_std_y:.4f}")
overall_std_x = df["x"].std()
between_std_x = df.groupby("firm")["x"].mean().std()
within_std_x = df.groupby("firm")["x"].transform(lambda g: g - g.mean()).std()
print(f"\nVariation in x:")
print(f" Overall std: {overall_std_x:.4f}")
print(f" Between std: {between_std_x:.4f}")
print(f" Within std: {within_std_x:.4f}")

Observations per firm: min=10, max=10, mean=10.0
Panel is balanced
Variation in y:
Overall std: 2.9686
Between std: 2.4645
Within std: 1.6715
Variation in x:
Overall std: 1.9783
Between std: 1.3751
Within std: 1.4282

The decomposition reveals an important pattern. For firm performance (y), the between-firm standard deviation (2.46) is substantially larger than the within-firm standard deviation (1.67). This means that persistent differences across firms account for more of the total variation than year-to-year fluctuations within individual firms. The same pattern holds for R&D intensity (x): between-firm variation (1.38) is comparable to within-firm variation (1.43). Since firm fixed effects absorb all between-firm variation, this tells us that fixed effects will have a large impact on the regression — they are removing a dominant source of variation that is confounded with the treatment.

4.2 Within-firm correlations

within_corr = (
df.groupby("firm")
.apply(lambda g: g["y"].corr(g["x"]), include_groups=False)
)
print(f"Within-firm correlation (y, x):")
print(f" Mean: {within_corr.mean():.4f}")
print(f" Median: {within_corr.median():.4f}")

Within-firm correlation (y, x):
Mean: 0.4100
Median: 0.4624

The average within-firm correlation between R&D and performance is 0.41, with a median of 0.46. This moderate positive correlation is what we expect given the true effect ($\beta = 0.5$): years in which a firm invests more in R&D tend to be years in which that firm performs better. The correlation is less than 0.5 because the AR(1) errors add noise.

# Figure: Panel structure and within-firm correlations
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
fig.patch.set_linewidth(0)
# Left: x vs y colored by firm (sample 10 firms)
rng_plot = np.random.default_rng(99)
sample_firms = sorted(rng_plot.choice(df["firm"].unique(), 10, replace=False))
colors_sample = [STEEL_BLUE, WARM_ORANGE, TEAL, "#e8956a", "#c4623d",
"#8fbfcc", "#e0a57a", "#5cc8c0", "#b0c4de", "#f0c8a0"]
for i, fid in enumerate(sample_firms):
sub = df[df["firm"] == fid]
axes[0].scatter(sub["x"], sub["y"], color=colors_sample[i % len(colors_sample)],
alpha=0.7, s=30, edgecolors=DARK_NAVY, linewidths=0.5)
axes[0].set_xlabel("R&D intensity (x)")
axes[0].set_ylabel("Firm performance (y)")
axes[0].set_title("10 sampled firms: x vs y", fontweight="bold")
# Right: within-firm correlation distribution
axes[1].hist(within_corr, bins=20, color=STEEL_BLUE, edgecolor=DARK_NAVY, alpha=0.85)
axes[1].axvline(within_corr.mean(), color=WARM_ORANGE, linewidth=2,
linestyle="--", label=f"Mean = {within_corr.mean():.2f}")
axes[1].set_xlabel("Within-firm correlation (y, x)")
axes[1].set_ylabel("Number of firms")
axes[1].set_title("Distribution of within-firm correlations", fontweight="bold")
axes[1].legend()
plt.tight_layout()
plt.savefig("panel_ses_eda.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

The left panel shows how the 10 sampled firms form distinct clusters in the scatter plot — each firm occupies a different region of the x-y space. This visual clustering is the between-firm variation that fixed effects remove. The right panel shows that most firms have a positive within-firm correlation between R&D and performance, with the distribution centered around 0.41. A few firms have near-zero or negative correlations, reflecting the random noise in the simulation. These within-firm relationships are what fixed effects regressions actually estimate.

Now that we understand the panel structure, we are ready to set up the MultiIndex that linearmodels requires and begin estimating models.

5. Setting up the MultiIndex

The linearmodels package requires panel data to be stored in a pandas DataFrame with a MultiIndex: the entity (firm) as the first level and the time period (year) as the second. This structure tells the package which observations belong to the same firm and how they are ordered in time — information it needs to compute clustered standard errors and absorb fixed effects.

df_panel = df.set_index(["firm", "year"])
print(f"MultiIndex levels: {df_panel.index.names}")
print(df_panel.head(3))

MultiIndex levels: ['firm', 'year']
y x
firm year
1 2010 6.721042 4.139183
2011 5.889161 3.844151
2012 2.355109 2.596322

The MultiIndex now encodes the panel structure directly in the DataFrame. Firm 1’s three displayed observations span 2010–2012, and linearmodels uses this ordering to know which observations to group when computing entity-clustered standard errors. With the data properly indexed, we can now estimate our first model.

6. Pooled OLS — the naive baseline

6.1 Conventional standard errors

We begin with the simplest possible approach: pooled OLS with conventional standard errors. This estimator ignores the panel structure entirely — it treats all 1,000 observations as if they were independent draws, like 1,000 different firms each observed once. We use PanelOLS.from_formula() with cov_type="unadjusted" to request conventional (homoskedastic) standard errors. The formula "y ~ 1 + x" specifies a regression of firm performance on R&D intensity with an intercept.

mod_pooled = PanelOLS.from_formula("y ~ 1 + x", data=df_panel)
res_pooled = mod_pooled.fit(cov_type="unadjusted")
beta_pooled = res_pooled.params["x"]
se_pooled = res_pooled.std_errors["x"]
t_pooled = res_pooled.tstats["x"]
print(f"Coefficient on x: {beta_pooled:.4f}")
print(f"Conventional SE: {se_pooled:.4f}")
print(f"t-statistic: {t_pooled:.4f}")

Coefficient on x: 1.0318
Conventional SE: 0.0345
t-statistic: 29.9151

The pooled OLS coefficient is 1.03 — more than double the true value of 0.5. This is omitted variable bias in action. Because high-ability firms both invest more in R&D and perform better, the regression attributes to R&D what is actually driven by unobserved ability. The conventional standard error of 0.0345 looks impressively small, yielding a t-statistic of 29.9. But this precision is doubly misleading: the point estimate itself is biased, and the standard error is too small because it ignores within-firm error correlation.

This is the first major lesson: a biased estimator with small standard errors is worse than a noisy but unbiased one. The conventional SE tells us we can be very confident that the effect is around 1.03 — but 1.03 is the wrong answer. No standard error correction can fix this; we need a different estimator (fixed effects) to address the bias. We will get there in Section 9. But first, let us see what happens when we try progressively better standard errors on the same biased pooled model.

6.2 White (heteroskedasticity-robust) standard errors

The next step up from conventional SEs is the White estimator, also called heteroskedasticity-consistent (HC) standard errors. While conventional SEs assume all errors have the same variance, the White estimator allows the error variance to differ across observations. Think of it as replacing a one-size-fits-all uncertainty measure with one tailored to each data point. In linearmodels, we request it with cov_type="robust".

The White covariance estimator is:

$$\hat{\Sigma}_{\text{White}} = (X’X)^{-1} \left( \sum_{i=1}^{N} X_i' \hat{e}_i^2 X_i \right) (X’X)^{-1}$$

In words, this replaces the constant variance assumption with the squared residuals $\hat{e}_i^2$ from each observation, producing standard errors that are robust to heteroskedasticity — situations where the spread of errors varies with the level of $X$.

res_white = mod_pooled.fit(cov_type="robust")
se_white = res_white.std_errors["x"]
t_white = res_white.tstats["x"]
print(f"White SE: {se_white:.4f}")
print(f"t-statistic: {t_white:.4f}")

White SE: 0.0361
t-statistic: 28.5897

The White SE (0.0361) is only slightly larger than the conventional SE (0.0345), and the t-statistic barely budges from 29.9 to 28.6. This is because heteroskedasticity is not the main problem here — within-cluster correlation is. The White estimator treats each observation as independent, just with potentially different variances. It does not account for the fact that firm 1’s error in 2015 is correlated with firm 1’s error in 2016. For panel data with serial correlation, we need standard errors that account for this clustering.

7. Clustered standard errors

7.1 The intuition behind clustering

Clustering is the workhorse correction for panel data standard errors. The idea is simple: if errors within a firm are correlated, then 10 observations from the same firm do not contain as much independent information as 10 observations from 10 different firms. Clustering acknowledges this by allowing arbitrary correlation among all observations within the same cluster.

Think of surveying students in classrooms. If you survey 100 students from 10 classrooms (10 per classroom), you do not have 100 independent data points — students in the same classroom share the same teacher, curriculum, and classroom environment. The effective sample size is closer to 10 (the number of classrooms) than 100 (the number of students). Clustering adjusts the standard errors to reflect this reduced effective sample size.

7.2 Entity-clustered SEs

Entity clustering allows arbitrary correlation among all observations within the same firm. We request it by setting cluster_entity=True.

# Entity-clustered
res_cl_entity = mod_pooled.fit(cov_type="clustered", cluster_entity=True)
se_cl_entity = res_cl_entity.std_errors["x"]
t_cl_entity = res_cl_entity.tstats["x"]
print(f"Entity-clustered SE: {se_cl_entity:.4f}")
print(f"t-statistic: {t_cl_entity:.4f}")

Entity-clustered SE: 0.0621
t-statistic: 16.6233

Entity-clustered SEs (0.0621) are 80% larger than conventional SEs (0.0345) and nearly double the White SEs (0.0361). The t-statistic drops from 29.9 to 16.6 — still highly significant in this case, but the inflation in standard errors demonstrates how much conventional SEs understate uncertainty when within-firm correlation is present. In a setting with a weaker true effect, this correction could flip a “significant” result to “insignificant.”

7.3 Time-clustered SEs

Time clustering allows correlation among all firms within the same year. This matters when firms face common shocks — a recession, a regulatory change, or a market-wide technology shift that affects all firms simultaneously.

# Time-clustered
res_cl_time = mod_pooled.fit(cov_type="clustered", cluster_time=True)
se_cl_time = res_cl_time.std_errors["x"]
t_cl_time = res_cl_time.tstats["x"]
print(f"Time-clustered SE: {se_cl_time:.4f}")
print(f"t-statistic: {t_cl_time:.4f}")

Time-clustered SE: 0.0168
t-statistic: 61.2757

Time-clustered SEs (0.0168) are actually smaller than conventional SEs, and the t-statistic jumps to 61.3. This happens because our DGP has only weak time effects ($\lambda_t \sim N(0, 0.5)$) but strong firm effects. With only 10 time clusters (years), the clustering correction has very few groups to work with, and the asymptotic theory — the mathematical guarantees that hold when the number of clusters is large — that justifies clustered SEs relies on having many clusters. As a rule of thumb, cluster on the dimension that has at least 40–50 groups. Here, entity clustering (100 firms) is far more appropriate than time clustering (10 years).

7.4 Two-way clustered SEs

Two-way clustering allows correlation along both dimensions simultaneously — within firms over time and across firms within the same year. This is the most conservative approach, proposed by Cameron, Gelbach, and Miller (2011). In linearmodels, set both cluster_entity=True and cluster_time=True.

# Two-way clustered
res_cl_both = mod_pooled.fit(cov_type="clustered",
cluster_entity=True, cluster_time=True)
se_cl_both = res_cl_both.std_errors["x"]
t_cl_both = res_cl_both.tstats["x"]
print(f"Two-way clustered SE: {se_cl_both:.4f}")
print(f"t-statistic: {t_cl_both:.4f}")

Two-way clustered SE: 0.0532
t-statistic: 19.3829

The two-way clustered SE (0.0532) falls between the entity-clustered (0.0621) and time-clustered (0.0168) estimates. This makes sense: the two-way estimator combines information from both clustering dimensions. Since the time dimension contributes little (weak time effects, few clusters), the two-way SE is somewhat smaller than entity-only clustering. In practice, two-way clustering is recommended when both dimensions have enough clusters and both types of correlation are plausible.

8. A side-by-side comparison so far

Before introducing fixed effects, let us pause to see all the pooled OLS standard errors side by side. Remember: the point estimate (1.0318) is the same for all of them — only the standard errors and hence the confidence intervals differ.

Model / SE Type	Coefficient	Std. Error	t-stat
Pooled OLS (conventional)	1.0318	0.0345	29.92
Pooled OLS (White/HC)	1.0318	0.0361	28.59
Pooled OLS (cluster: entity)	1.0318	0.0621	16.62
Pooled OLS (cluster: time)	1.0318	0.0168	61.28
Pooled OLS (cluster: both)	1.0318	0.0532	19.38

The entity-clustered SE is 1.8 times larger than the conventional SE. But recall that all these models estimate the wrong coefficient (1.03 vs. the true 0.5). Correcting standard errors on a biased estimator is like putting better tires on a car driving in the wrong direction. Next, we fix the direction with fixed effects.

9. Entity fixed effects with clustered SEs

9.1 Why fixed effects solve the bias

Fixed effects regression removes all time-invariant differences between firms before estimating the coefficient. Mathematically, it subtracts each firm’s time-average from its observations — a process called demeaning. After demeaning, the unobserved firm ability $\mu_i$ vanishes because it is constant over time, and we estimate $\beta$ using only the within-firm variation in $x$ and $y$. This eliminates the omitted variable bias that inflated the pooled OLS estimate.

In linearmodels, adding EntityEffects to the formula absorbs firm fixed effects:

mod_fe = PanelOLS.from_formula("y ~ 1 + x + EntityEffects", data=df_panel)
res_fe_cl = mod_fe.fit(cov_type="clustered", cluster_entity=True)
beta_fe = res_fe_cl.params["x"]
se_fe_cl = res_fe_cl.std_errors["x"]
t_fe_cl = res_fe_cl.tstats["x"]
print(f"FE coefficient on x: {beta_fe:.4f}")
print(f"Entity-clustered SE: {se_fe_cl:.4f}")
print(f"t-statistic: {t_fe_cl:.4f}")

FE coefficient on x: 0.4829
Entity-clustered SE: 0.0357
t-statistic: 13.5250

The fixed effects coefficient (0.4829) is dramatically closer to the true value of 0.5 than the pooled estimate (1.0318). The remaining gap of 0.017 is sampling noise, not systematic bias. The entity-clustered SE of 0.0357 is actually smaller than the pooled entity-clustered SE (0.0621) because fixed effects remove the between-firm variation that was inflating the residuals.

9.2 Two-way fixed effects

We can also absorb time fixed effects by adding TimeEffects, which removes year-specific shocks common to all firms. This controls for business cycle effects, regulatory changes, or any other year-level phenomenon.

mod_twfe = PanelOLS.from_formula("y ~ 1 + x + EntityEffects + TimeEffects",
data=df_panel)
res_twfe = mod_twfe.fit(cov_type="clustered", cluster_entity=True)
beta_twfe = res_twfe.params["x"]
se_twfe = res_twfe.std_errors["x"]
t_twfe = res_twfe.tstats["x"]
print(f"TWFE coefficient on x: {beta_twfe:.4f}")
print(f"Entity-clustered SE: {se_twfe:.4f}")
print(f"t-statistic: {t_twfe:.4f}")

TWFE coefficient on x: 0.4796
Entity-clustered SE: 0.0376
t-statistic: 12.7392

Adding time fixed effects barely changes the estimate (0.4796 vs. 0.4829) and slightly increases the standard error (0.0376 vs. 0.0357). This makes sense: the time effects in our DGP are small ($\lambda_t \sim N(0, 0.5)$), so absorbing them provides only a minor correction while consuming 9 additional degrees of freedom. In real applications where macroeconomic shocks are substantial, two-way FE can make a bigger difference.

10. Driscoll-Kraay standard errors

Driscoll and Kraay (1998) proposed a standard error estimator that accounts for both cross-sectional correlation (across firms within a period) and temporal dependence (within firms over time), using a kernel-based approach similar to Newey-West but applied to cross-sectional averages. In linearmodels, we request it with cov_type="kernel" and a Bartlett kernel (equivalent to Newey and West (1987) weighting). The bandwidth parameter controls how many time lags of correlation the estimator accounts for — a bandwidth of 3 means it incorporates correlations up to 3 years apart, with declining weights for longer lags.

res_dk = mod_pooled.fit(cov_type="kernel", kernel="bartlett", bandwidth=3)
se_dk = res_dk.std_errors["x"]
t_dk = res_dk.tstats["x"]
print(f"Driscoll-Kraay SE (BW=3): {se_dk:.4f}")
print(f"t-statistic: {t_dk:.4f}")

Driscoll-Kraay SE (BW=3): 0.0158
t-statistic: 65.4073

The Driscoll-Kraay SE (0.0158) is the smallest we have seen — even smaller than conventional SEs. This reflects the estimator’s focus on cross-sectional dependence, which is weak in our simulation (firms are independent given their fixed effects). In applications with strong cross-sectional correlation — for example, banks exposed to the same macroeconomic shock — Driscoll-Kraay SEs can be substantially larger. The key feature is robustness to cross-sectional dependence that entity clustering alone cannot handle.

11. Full comparison

11.1 Summary table

Now we can see all eight model-SE combinations in a single table. The true coefficient is $\beta = 0.5$. The “Reject H0” column tests the default null H0: $\beta = 0$ (not H0: $\beta = 0.5$). In Section 12, the Monte Carlo explicitly tests against the true value.

Model / SE Type	Coefficient	Std. Error	t-stat	Reject H0 (5%)
Pooled OLS (conventional)	1.0318	0.0345	29.92	Yes
Pooled OLS (White/HC)	1.0318	0.0361	28.59	Yes
Pooled OLS (cluster: entity)	1.0318	0.0621	16.62	Yes
Pooled OLS (cluster: time)	1.0318	0.0168	61.28	Yes
Pooled OLS (cluster: both)	1.0318	0.0532	19.38	Yes
Entity FE (cluster: entity)	0.4829	0.0357	13.53	Yes
Two-way FE (cluster: entity)	0.4796	0.0376	12.74	Yes
Pooled OLS (Driscoll-Kraay)	1.0318	0.0158	65.41	Yes

Two patterns stand out. First, all pooled models estimate a coefficient around 1.03 — more than double the true 0.5 — while both FE models recover estimates close to 0.5. This is the bias-versus-variance distinction: standard errors address precision, not accuracy. Second, among the FE models (which have the right coefficient), entity-clustered SEs are appropriately sized relative to the true uncertainty.

11.2 Standard error comparison

# Figure: SE comparison bar chart (code in script.py)

The bar chart reveals the full spectrum of standard error estimates. Entity-clustered SEs on the pooled model (0.0621) are the largest — they correctly reflect high within-firm correlation but sit atop a biased estimate. The FE models' entity-clustered SEs (0.036–0.038) are smaller because fixed effects absorbed the between-firm variation that inflated residuals. At the other extreme, Driscoll-Kraay (0.0158) and time-clustered (0.0168) SEs are the smallest, reflecting the weak cross-sectional and time-level correlation in our data.

11.3 Confidence intervals

# Figure: Confidence intervals across methods (code in script.py)

The confidence interval plot delivers the tutorial’s core visual message. The teal dashed line at $\beta = 0.5$ is the truth. All five pooled OLS intervals (blue) are far to the right — none come close to covering the true value, regardless of which SE estimator we use. The two FE intervals (orange) are centered near 0.5 and easily cover it. The lesson is unmistakable: standard errors cannot rescue a biased point estimate, but combined with a consistent estimator, they produce intervals with correct coverage.

12. Monte Carlo simulation — which SEs get the right rejection rate?

12.1 The experiment

The confidence interval plot above shows one simulation. But how do we know whether those intervals typically contain the true value? A single simulation could be lucky or unlucky. To rigorously evaluate each SE estimator, we need a Monte Carlo simulation: generate hundreds of independent datasets from the same DGP, estimate the model on each, and check how often the 95% confidence interval covers the true $\beta = 0.5$.

If an SE estimator is correctly sized, its 95% CI should cover the truth 95% of the time, meaning it rejects the true null hypothesis only 5% of the time. An SE that is too small produces intervals that are too narrow, leading to over-rejection — false positives in more than 5% of simulations.

We focus on Entity FE models because they produce unbiased estimates. This isolates the SE question: given that the point estimate is right on average, do the standard errors correctly quantify the remaining uncertainty?

12.2 Results

N_SIM = 500
# ... (Monte Carlo loop runs Entity FE with 6 different SE types) ...

Empirical rejection rates at 5% level (H0: beta=0.5 is true):
FE + conventional : 0.060 (30/500) ~correct
FE + White (HC) : 0.064 (32/500) ~correct
FE + cluster: entity : 0.066 (33/500) ~correct
FE + cluster: time : 0.090 (45/500)
FE + cluster: both : 0.078 (39/500) ~correct
TWFE + cluster: entity : 0.032 (16/500) ~correct

# Figure: Monte Carlo rejection rates (code in script.py)

The Monte Carlo results across 500 simulations reveal meaningful differences. Entity FE with entity-clustered SEs rejects at 6.6% — close to the nominal 5% and well within the range expected from simulation noise. Conventional SEs (6.0%) and White SEs (6.4%) also perform well here because, after absorbing firm fixed effects, the remaining within-firm errors are approximately homoskedastic with moderate serial correlation that 100 clusters can handle.

The outlier is FE with time-clustered SEs at 9.0% — nearly double the nominal rate. This over-rejection occurs because time clustering with only 10 year-clusters violates the large-cluster asymptotic assumption. With 10 clusters, the finite-sample correction is insufficient, and the SEs are too small. TWFE with entity-clustered SEs (3.2%) is slightly conservative, meaning its confidence intervals are a bit wider than necessary — a benign property compared to over-rejection.

12.3 Standard error ratios

# Figure: SE ratios relative to entity-clustered (code in script.py)

This figure normalizes all standard errors to the entity-clustered SE (the recommended default). Ratios below 1.0 indicate SEs that are smaller than entity-clustered — and therefore potentially over-confident. Conventional SEs and White SEs on the pooled model are about 0.55–0.58 times the entity-clustered SE, confirming they understate uncertainty by roughly 40%. The FE-based entity-clustered SE (0.57x) is smaller because fixed effects reduce residual variance — this is a genuine precision gain, not an artifact of ignoring correlation.

13. Discussion

13.1 Answering the case study question

We asked: when firms are observed over multiple years, how does our choice of standard error estimator change what we conclude about the effect of R&D spending on firm performance? The answer has two parts.

First, the bias problem. Pooled OLS estimates R&D’s effect at 1.03 — more than double the true 0.5. This bias comes from omitted firm ability, not from standard error choice. Entity fixed effects reduce the estimate to 0.48, close to the truth. No standard error correction can fix a biased coefficient.

Second, the inference problem. Even after fixing bias with FE, standard error choice matters. In our Monte Carlo, time-clustered SEs on FE models rejected the true null at 9.0% instead of 5%. Entity-clustered SEs maintained correct size at 6.6%. For a practitioner, using the wrong SEs could mean reporting a “significant” finding that is actually a false positive.

13.2 Practical guidance

Following the recommendations of Petersen (2009), here is a decision framework:

Always start with fixed effects if the panel has entity-level unobserved heterogeneity. Without FE, standard error corrections address precision but not bias.
Cluster on the dimension with more groups. Entity clustering (100 firms) is more reliable than time clustering (10 years) because clustered SEs rely on large-cluster asymptotics.
Two-way clustering is the safe default when both dimensions have enough clusters (rule of thumb: at least 40–50 each). It accounts for both types of dependence simultaneously.
Driscoll-Kraay is specialized. Use it when cross-sectional dependence is strong and the number of time periods is large (e.g., long macroeconomic panels).

14. Summary and next steps

Key takeaways:

Standard errors cannot fix bias. Pooled OLS overestimated the R&D effect at 1.03 (true: 0.5) regardless of which SE estimator was applied. Entity fixed effects recovered an estimate of 0.48 — close to the truth. Always address the model before worrying about the standard errors.
Clustering dimension matters. Entity-clustered SEs (0.0621) were 80% larger than conventional SEs (0.0345) on the pooled model, reflecting the within-firm correlation that conventional SEs ignore. Time-clustered SEs (0.0168) were misleadingly small because only 10 year-clusters provided too few groups for reliable asymptotic inference.
Monte Carlo validation is essential. Entity-clustered SEs on the FE model rejected the true null at 6.6% (close to the nominal 5%), while time-clustered SEs rejected at 9.0% — nearly double the expected rate. Simulation is the only way to verify that your SE choice controls size in your specific data structure.
The FE + entity-clustered combination is the reliable default. It addresses both bias (via FE) and inference (via clustering). Two-way clustering adds insurance against cross-sectional correlation when both dimensions have enough groups.

Limitations:

Our simulation uses balanced panels. With unbalanced panels (firms entering and exiting), some SE estimators require additional adjustments.
We used 100 firms and 10 years. Results may differ with fewer clusters or different cluster-size ratios.
The DGP has a simple AR(1) error structure. Real data may have more complex dependence patterns.

Next steps:

Apply these techniques to a real firm-level dataset (e.g., Compustat) and compare SE estimates.
Explore bootstrap-based approaches for clustered inference with few clusters (wild cluster bootstrap).
Study the Cameron-Gelbach-Miller multi-way clustering theory for panels with more than two clustering dimensions.

15. Exercises

Modify the DGP. Change the AR(1) coefficient from 0.5 to 0.9 (stronger serial correlation) and re-run the Monte Carlo. Which SE estimators are most affected? Does entity-clustering still control size at 5%?
Reduce the number of firms. Set n_firms=20 (keeping n_years=10) and re-run the Monte Carlo. With only 20 entity clusters, do entity-clustered SEs still perform well? At what cluster count do they start to break down?
Add cross-sectional dependence. Modify simulate_panel() so that each year has a common shock ($\delta_t$) that enters all firms' errors: eps[start + t] += delta_t. Re-run the analysis and check whether entity-clustered SEs still control size, or whether Driscoll-Kraay / two-way clustering becomes necessary.

References

Dynamic Panel BMA: Which Factors Truly Drive Economic Growth?

Sun, 29 Mar 2026 00:00:00 +0000

1. Overview

Imagine you are advising a government on how to accelerate long-run economic growth. Your team has compiled a panel dataset covering 73 countries across four decades, with nine candidate drivers: investment, education, population growth, trade openness, government spending, life expectancy, democracy, investment prices, and population size. The natural question is: which of these factors truly drive economic growth — and can we trust our answers when today’s GDP might itself be shaped by those same factors?

What is BMA? Imagine trying to predict salaries using education, experience, age, and industry. You could build one model with all four variables, or drop industry, or use only experience and education. With just 4 candidates, there are $2^4 = 16$ possible models. Which is correct? Bayesian Model Averaging (BMA) does not pick one — it averages predictions from all 16, giving more weight to models that fit the data well. This avoids betting everything on one specification that might be wrong.

This last concern is reverse causality — the possibility that GDP growth causes higher investment rather than the other way around. Cross-sectional BMA handles model uncertainty this way, but it assumes regressors are strictly exogenous. When that assumption fails, BMA can confidently point to the wrong variables.

This tutorial introduces the Bayesian Dynamic Systems Modeling R package — which extends BMA to dynamic panel data with weakly exogenous regressors. Built on the methodology of Moral-Benito (2012, 2013, 2016), it simultaneously addresses model uncertainty and reverse causality by incorporating a lagged dependent variable, entity fixed effects, and time fixed effects into the BMA framework.

Companion tutorial. For a cross-sectional perspective using BMA, LASSO, and WALS on synthetic data, see the R tutorial on variable selection. The current tutorial builds on those foundations by moving from cross-sectional to panel data and from strict to weak exogeneity.

Learning objectives:

Understand why cross-sectional BMA can be misleading when regressors are endogenous, and how dynamic panel BMA addresses this
Prepare panel data for the Bayesian DSM package using join_lagged_col() and feature_standardization()
Run Bayesian Model Averaging with bma() and interpret Posterior Inclusion Probabilities (PIPs — how often a variable appears in the best-fitting models), posterior means, and model probabilities
Assess the sensitivity of results to prior specification by varying the expected model size (how many variables the prior expects to matter) and applying dilution priors (which adjust for correlated variables)
Analyze jointness (which variables tend to appear in models together) to discover which growth determinants are complements versus substitutes

The package also includes a smaller 3-regressor example (small_model_space) for practice — see the companion R script for details.

Data Prep (lag DV, demean, standardize) → Model Space (estimate all 2^K models) → BMA (PIPs, posterior means) → Sensitivity (vary priors, EMS, dilution) → Jointness (complements vs. substitutes) → Findings (robust growth determinants)

2. Setup

We need the Bayesian Dynamic Systems Modeling package for dynamic panel BMA and tidyverse for data manipulation. The parallel package (included with base R) enables parallel computing for the model space estimation step.

# Install bdsm if needed
if (!requireNamespace("bdsm", quietly = TRUE)) {
install.packages("bdsm")
}
# Load packages
library(bdsm)
library(tidyverse)
library(parallel)
set.seed(42)

3. Why Dynamic Panel BMA?

3.1 The endogeneity problem

Standard BMA assumes that all regressors are strictly exogenous — meaning they are determined outside the model and are uncorrelated with the error term at any point in time. In growth economics, this assumption almost never holds.

Think of it this way: imagine judging a runner’s training program by their final race time, but faster runners also chose better programs. You cannot tell whether the program caused the speed or the speed attracted the program. This is reverse causality, and it contaminates cross-sectional regressions. Countries that grow faster invest more, trade more, urbanize faster, and attract more education spending — not just the other way around.

When BMA is applied to cross-sectional data with endogenous regressors, it can confidently assign high inclusion probabilities to variables that appear important only because they are consequences of growth rather than causes of it. The model averaging machinery works perfectly — but the individual models it averages over are biased.

The solution is to include last period’s GDP as a regressor. By controlling for where a country was, we isolate which new factors push it forward — breaking the feedback loop. The next section shows why this dynamic structure arises naturally from economic growth theory.

3.2 From the Solow model to a dynamic equation

Why does a dynamic equation — one with lagged GDP on the right-hand side — arise naturally in growth economics? The answer comes from the Solow growth model and its convergence prediction. The Solow model predicts that poorer countries should grow faster than richer ones, conditional on their structural characteristics (beta convergence). Through a series of algebraic steps — defining a persistence parameter, substituting observable country characteristics for the unobserved steady state, and adding fixed effects — the convergence equation yields the following dynamic panel model:

$$\ln y_{it} = \alpha \ln y_{i,t-1} + \beta' x_{it} + \eta_i + \zeta_t + v_{it}$$

This is the dynamic panel model that the Bayesian DSM package estimates. The coefficient $\alpha$ has a direct economic interpretation: it measures the persistence of GDP across periods. A value of $\alpha$ close to 1 means slow convergence — countries stay near their current income level for a long time. A value close to 0 means fast convergence — countries quickly reach their steady state. Our BMA results will reveal $\alpha \approx 0.92$, indicating very slow convergence: after a decade, countries have closed only about 8% of the gap between their current GDP and their steady state.

The key insight is that the lagged dependent variable is not an ad hoc addition — it arises directly from the Solow model’s convergence prediction. Any study of growth determinants that omits lagged GDP is implicitly assuming $\alpha = 0$, which means assuming instantaneous convergence — a prediction strongly rejected by the data. For the full step-by-step derivation from the Solow convergence equation, see Appendix B.

3.3 Weak exogeneity and the role of each component

Each component of the dynamic panel equation plays a distinct role:

Lagged dependent variable ($y_{it-1}$): Think of this as a student’s previous exam score — it captures all the accumulated history that got a country to its current level. After controlling for where a country was, we can ask: among countries at the same starting point, which factors predict who grows faster?
Entity fixed effects ($\eta_i$): Like grading on a curve within each classroom — these absorb time-invariant country traits such as geography, colonial history, and institutional heritage. We compare each country to its own average, not to other countries.
Time fixed effects ($\zeta_t$): These remove global shocks that affect all countries simultaneously, such as oil crises or the Asian financial crisis.

To understand this assumption, consider a concrete example. Suppose an oil price shock in 1985 affects both GDP and trade openness simultaneously. Weak exogeneity allows this kind of contemporaneous correlation between regressors and the fixed effects. What it rules out is that the unexplained part of today’s GDP shock — the idiosyncratic error $v_{it}$ — directly causes today’s investment to change within the same period.

The key assumption is weak exogeneity: current regressors can be correlated with past shocks but not with the current shock $v_{it}$. This is much weaker than strict exogeneity — it allows past GDP growth to influence current investment (feedback effects) while requiring only that the current unexpected shock to GDP does not simultaneously cause changes in investment. In practical terms, weak exogeneity permits the realistic feedback loops that plague growth regressions while still allowing consistent estimation.

3.4 From cross-sectional to dynamic panel BMA

Cross-sectional BMA uses a single time snapshot, assumes strict exogeneity, includes no lagged dependent variable, and has no fixed effects. Dynamic panel BMA uses multiple time periods, requires only weak exogeneity, includes a lagged dependent variable, and controls for entity and time fixed effects. Both approaches address model uncertainty by averaging across all possible model specifications.

In the companion cross-sectional tutorial, we averaged across 4,096 models of CO₂ emissions using synthetic data. Here we apply the same BMA principle — weighting models by how well they fit the data — but to a panel of 73 countries over four decades, using the methodology that handles the endogeneity that cross-sectional BMA cannot.

4. The Dataset

4.1 Loading the data

The package includes two versions of the Moral-Benito (2016) economic growth dataset. The economic_growth version has the lagged dependent variable already merged into the panel structure (with NAs in the initial period), while original_economic_growth keeps it as a separate column.

data("economic_growth")
data("original_economic_growth")
cat("economic_growth:", dim(economic_growth), "\n")
cat("Countries:", length(unique(economic_growth$country)), "\n")
cat("Years:", sort(unique(economic_growth$year)), "\n")
head(economic_growth, 5)

economic_growth: 365 12
Countries: 73
Years: 1960 1970 1980 1990 2000
# A tibble: 5 x 12
year country gdp ish sed pgrw pop ipr opem gsh lnlex polity
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1960 1 8.25 NA NA NA NA NA NA NA NA NA
2 1970 1 8.37 0.122 0.139 0.0235 10.9 61.1 1.08 0.191 3.88 0.15
3 1980 1 8.54 0.207 0.141 0.0300 13.9 92.3 1.06 0.203 4.00 0.15
4 1990 1 8.63 0.203 0.28 0.0303 18.9 100. 0.898 0.232 4.10 0.15
5 2000 1 8.66 0.115 0.774 0.0215 25.3 81.2 0.636 0.219 4.21 0.575

The panel covers 73 countries observed at 10-year intervals from 1960 to 2000, yielding 5 periods per country (365 total rows, including the initial 1960 observation). The 1960 row for each country contains only the initial GDP level — all regressors are NA because there is no “previous decade” to compute changes from. The four subsequent decades (1970–2000) contain the 292 usable observations.

4.2 Variable descriptions

The dataset contains the dependent variable (log GDP per capita) and 9 candidate growth determinants:

Variable	Description	Expected sign
`gdp`	Log real GDP per capita (dependent variable)	—
`ish`	Investment share of GDP	+
`sed`	Secondary school enrollment rate	+
`pgrw`	Population growth rate	–
`pop`	Population (millions)	?
`ipr`	Investment price (relative to US)	–
`opem`	Trade openness (imports + exports / GDP)	+
`gsh`	Government consumption share of GDP	–
`lnlex`	Log life expectancy at birth	+
`polity`	Democracy index (0 = autocracy, 1 = democracy)	?

These variables are standard in the empirical growth literature, following Sala-i-Martin, Doppelhofer, and Miller (2004). Investment share and education are expected to have positive effects on growth, while population growth and government consumption are typically associated with slower growth. The signs for population and democracy are theoretically ambiguous.

The 292 usable observations span 73 countries over four decades. Log GDP per capita ranges from 6.02 to 10.45, reflecting substantial income inequality — the richest country is roughly 80 times wealthier than the poorest in per capita terms. Investment share averages 16.9% of GDP but ranges from 1.2% to 65.3%, indicating enormous variation in capital accumulation across countries and decades. Population growth averages 1.9% per decade, with one country experiencing slight population decline (–0.6%).

5. Data Preparation

The Bayesian DSM package requires two data preprocessing steps before estimation: standardization (scaling) and demeaning (removing entity and time fixed effects). These steps ensure numerical stability and allow the model to focus on within-country, within-period variation.

5.1 Understanding the data structure

If your data has the lagged dependent variable as a separate column (like original_economic_growth), you first need to merge it into the panel structure using join_lagged_col(). This function creates the initial period row with NAs:

# Demonstration: converting original format to package format
eg_joined <- join_lagged_col(
df = original_economic_growth,
col = gdp,
col_lagged = lag_gdp,
timestamp_col = year,
entity_col = country,
timestep = 10 # 10-year intervals
)
cat("Result:", dim(eg_joined), "\n")

Result: 365 12

The economic_growth dataset already has this structure, so we can use it directly.

5.2 Standardization and demeaning

Data preparation involves two calls to feature_standardization(). The first call standardizes all regressors to have mean zero and unit variance — this puts all variables on the same scale so that the BMA coefficients are directly comparable. The second call demeans by time period to remove time fixed effects.

Think of demeaning by time as subtracting the global average for each decade. If every country’s GDP grew in the 1990s due to the tech boom, demeaning removes that common trend. What remains is each country’s deviation from the global pattern — the variation that country-specific factors must explain.

# Step 1: Standardize all regressors (mean=0, sd=1)
# Makes variables comparable: GDP and population are on vastly different scales
data_std <- feature_standardization(
df = economic_growth,
excluded_cols = c(country, year, gdp)
)
# Step 2: Demean by time period (remove time fixed effects)
# Subtracts each decade's global average, isolating country-specific variation
data_prepared <- feature_standardization(
df = data_std,
group_by_col = year,
excluded_cols = country,
scale = FALSE
)
head(data_prepared, 5)

# A tibble: 5 x 12
year country gdp ish sed pgrw pop ipr opem gsh lnlex polity
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1960 1 0.292 NA NA NA NA NA NA NA NA NA
2 1970 1 0.121 -0.493 -0.534 0.163 -0.151 -0.271 1.13 0.0496 -0.549 -0.578
3 1980 1 0.0573 0.241 -0.697 0.942 -0.181 0.0635 1.07 -0.0226 0.0167 -0.578
4 1990 1 0.0724 0.456 -0.932 1.09 -0.203 0.208 0.724 0.101 -0.0655 -0.578
5 2000 1 -0.0823 -0.505 -0.778 0.465 -0.218 -0.0620 -0.120 0.120 -0.107 0.112

After preparation, all regressor values are centered around zero. Country 1’s investment share (ish) was 0.49 standard deviations below the global average in 1970 but 0.46 standard deviations above average in 1990, showing meaningful within-country variation over time. The GDP column retains its original scale because it is the dependent variable.

6. Estimating the Full Model Space

With 9 candidate regressors, there are $2^9 = 512$ possible regression models. The package estimates every single one via numerical optimization of the marginal likelihood — the probability of observing the data given a particular model, after integrating out all parameter uncertainty. Think of this as a cooking competition with 512 recipes — each uses a different combination of 9 ingredients, and the marginal likelihood scores each recipe by balancing flavor (fit) against unnecessary complexity (overfitting).

To be concrete: model 1 might include only investment and education. Model 2 adds trade openness. Model 3 uses education and democracy but drops investment. Each of the 512 combinations gets its own likelihood estimated separately, and BMA weights them by how well they fit the data.

The optim_model_space() function handles this computation. For the full 9-regressor case, this is the most computationally intensive step — it can take several minutes depending on the machine. The package helpfully includes a precomputed full_model_space object so we can skip the wait:

# Load precomputed model space (or compute from scratch)
data("full_model_space")
# To compute from scratch (takes several minutes):
# full_model_space <- optim_model_space(
# df = data_prepared,
# dep_var_col = gdp,
# timestamp_col = year,
# entity_col = country,
# init_value = 0.5
# )
cat("Parameters matrix:", dim(full_model_space$params), "\n")
cat("Statistics matrix:", dim(full_model_space$stats), "\n")

Parameters matrix: 106 512
Statistics matrix: 22 512

The result is a list with two elements. The $params matrix contains 106 estimated parameters for each of the 512 models — these include the structural parameters ($\alpha$, $\beta$), reduced-form parameters, and variance components. The $stats matrix stores 22 statistics per model, including the log-likelihood, BIC, regular standard errors, and robust (heteroskedasticity-consistent) standard errors.

Why use marginal likelihood instead of R-squared? Unlike R-squared, which always improves when you add variables, the marginal likelihood penalizes complexity. It accounts for the fact that more parameters make it easier to fit noise. A model with 9 regressors that barely improves fit over a 5-regressor model will receive a lower marginal likelihood score — the extra parameters were not worth the complexity cost.

Before jumping into BMA, let us first establish a benchmark using a standard regression approach — this will help us appreciate what BMA adds.

7. Benchmark: Kitchen-Sink Fixed Effects

Before running BMA, it is useful to establish a benchmark. What happens if we simply throw all 9 regressors into a single fixed effects regression? This “kitchen-sink” approach is the default in applied work — but it commits to one model specification and ignores the uncertainty about which variables belong.

# Kitchen-sink FE regression with all 9 regressors
fe_full <- lm(gdp ~ lag_gdp + ish + sed + pgrw + pop + ipr +
opem + gsh + lnlex + polity +
factor(country) + factor(year),
data = original_economic_growth)
summary(fe_full)

FE regression coefficients:
Estimate Std. Error t value Pr(>|t|)
lag_gdp 0.6188 0.0501 12.3521 0.0000
ish 0.4646 0.2331 1.9934 0.0475
sed 0.0162 0.0337 0.4798 0.6319
pgrw -2.3352 2.1409 -1.0907 0.2767
pop 0.0016 0.0004 4.5092 0.0000
ipr -0.0003 0.0003 -1.0817 0.2806
opem 0.1199 0.0379 3.1652 0.0018
gsh -0.7448 0.2700 -2.7585 0.0063
lnlex 0.1153 0.2440 0.4727 0.6369
polity -0.1656 0.0570 -2.9065 0.0041
Significant at 5%: lag_gdp, ish, pop, opem, gsh, polity
R-squared: 0.988
N observations: 292

The kitchen-sink model finds 6 of 10 variables significant at the 5% level: lagged GDP, investment share, population, trade openness, government share, and democracy. Education, population growth, investment price, and life expectancy are insignificant. But this result depends entirely on this particular specification — drop one variable or add another, and the significance pattern may change. This is the model uncertainty problem that BMA is designed to solve.

The lagged GDP coefficient of 0.619 is notably lower than the BMA posterior mean (0.919), suggesting that the kitchen-sink model’s coefficient estimates are pulled by multicollinearity among the 9 regressors. BMA handles this by averaging over specifications that include different subsets.

Notice how the FE model forces a binary judgment: education is ‘insignificant’ (p = 0.63) and trade is ‘significant’ (p = 0.002). BMA replaces this all-or-nothing verdict with a nuanced probability scale: education has PIP = 0.72 (moderate evidence) and trade has PIP = 0.77 (positive evidence). The difference between ‘insignificant’ and ‘moderate evidence’ matters for policy — a policymaker who ignores education entirely because of a p-value threshold may be discarding useful information.

The kitchen-sink model commits to one specification and produces one set of p-values. But we saw that which variables look ‘significant’ depends entirely on which others are in the model. Drop one variable, and the significance pattern reshuffles. BMA solves this by never committing to a single specification — it averages over all 512, letting the data decide which matter most.

8. Bayesian Model Averaging

8.1 Running BMA

Now we can perform Bayesian Model Averaging across all 512 models. The bma() function takes the precomputed model space and the prepared data, weights each model by its posterior probability, and computes weighted averages of the coefficients:

Focus on two columns: PIP (how important is this variable?) and %(+) (is its effect consistently positive or negative?).

bma_results <- bma(full_model_space, df = data_prepared, round = 3)
# Binomial prior results
print(bma_results[[1]])

 PIP PM PSD PSDR PMcon PSDcon PSDRcon %(+)
gdp_lag NA 0.919 0.077 0.109 0.919 0.077 0.109 100.000
ish 0.773 0.063 0.045 0.062 0.082 0.034 0.059 100.000
sed 0.717 0.030 0.057 0.074 0.042 0.064 0.084 69.922
pgrw 0.714 0.018 0.030 0.052 0.025 0.033 0.060 99.609
pop 0.990 0.119 0.065 0.082 0.121 0.064 0.081 100.000
ipr 0.656 -0.034 0.033 0.044 -0.051 0.027 0.046 0.000
opem 0.766 0.034 0.030 0.033 0.044 0.026 0.031 100.000
gsh 0.751 -0.015 0.041 0.091 -0.020 0.046 0.104 30.859
lnlex 0.864 0.088 0.075 0.098 0.102 0.071 0.099 100.000
polity 0.678 -0.057 0.046 0.053 -0.084 0.030 0.044 0.000

The binomial prior results reveal a clear hierarchy among the 9 candidate regressors. Population size (pop) dominates with PIP = 0.990 — appearing in virtually every high-quality model — followed by life expectancy (lnlex) at 0.864 and investment share (ish) at 0.773. At the other end, investment price (ipr) at 0.656 and democracy (polity) at 0.678 show the weakest evidence, though even these exceed 0.5. The lagged GDP coefficient of 0.919 confirms strong persistence: a country’s current GDP is heavily determined by its past GDP.

8.2 Understanding the BMA statistics

Each column in the BMA output captures a different aspect of the evidence:

Beginner tip: For a first reading, focus on three columns: PIP (does this variable matter?), PM (what is its average effect?), and %(+) (is the effect consistently positive or negative?). The remaining columns (PSDR, PMcon, PSDcon, PSDRcon) are useful for advanced robustness checks but can be skipped on a first pass.

Statistic	Full name	Interpretation
PIP	Posterior Inclusion Probability	Fraction of posterior probability mass in models that include this variable. Think of it as a batting average: PIP = 0.99 means the variable appeared in 99% of high-scoring models
PM	Posterior Mean	Weighted average of the coefficient across all models (including zeros from models that exclude the variable)
PSD	Posterior Standard Deviation	Uncertainty around PM, incorporating both within-model and across-model variation
PSDR	Posterior SD Ratio	PSD divided by the conditional PM — a robustness measure
PMcon	Conditional Posterior Mean	Average coefficient only across models that include the variable
PSDcon	Conditional PSD	Uncertainty conditional on inclusion
%(+)	Positive sign share	Percentage of models where the coefficient is positive. Values near 0% or 100% indicate stable sign

The central quantity driving all these statistics is the posterior model probability (PMP). Each model $M_j$ receives a weight proportional to its marginal likelihood times its prior probability:

$$\mathbb{P}(M_j | \text{data}) = \frac{\exp(-\frac{1}{2} BIC_j) \cdot \mathbb{P}(M_j)}{\sum_{i=1}^{2^K} \exp(-\frac{1}{2} BIC_i) \cdot \mathbb{P}(M_i)}$$

In words, this equation says that each model’s posterior probability is its prior probability times a data-fit term (approximated by the BIC), divided by the sum across all $2^K$ models to ensure the probabilities add to 1. Models that fit the data well without too many parameters receive higher posterior probability. The PIP for a variable is then the sum of PMPs across all models that include it.

To make this concrete: if model A has BIC = –800 and model B has BIC = –790, model A fits the data better. After exponentiating and normalizing, model A might receive 73% of the posterior probability while model B gets 27%. The PIP of a variable included only in model A would then be at least 0.73.

8.3 Interpreting PIPs with Raftery’s classification

Raftery (1995) provides a standard classification for the strength of evidence based on PIP values:

PIP range	Evidence
> 0.99	Very strong
0.95 – 0.99	Strong
0.75 – 0.95	Positive
0.50 – 0.75	Weak

Under the binomial prior, pop (PIP = 0.990) reaches strong evidence — just short of the “very strong” threshold at 0.99. Life expectancy (lnlex at 0.864), investment share (ish at 0.773), trade openness (opem at 0.766), and government share (gsh at 0.751) fall in the positive evidence range. The remaining four variables — education, population growth, investment price, and democracy — show weak evidence (0.65–0.72). No variable has PIP below 0.5, suggesting the data supports relatively large models.

The sign stability column (%(+)) provides an additional robustness check. Seven of the nine regressors have perfectly stable signs: investment share, population growth, population, trade openness, and life expectancy are always positive (100%), while investment price and democracy are always negative (0%). Government share has %(+) = 30.9%, meaning its sign is negative in about 70% of models — moderately unstable. Education has %(+) = 69.9%, with a positive coefficient in about 70% of models but negative in 30%.

The following chart visualizes the PIPs with color-coded evidence tiers. We first define a dark-theme palette and extract the BMA statistics into a data frame, then build the plot:

# Dark theme palette (matching site navbar/footer)
DARK_BG <- "#0f1729"
LIGHT_TEXT <- "#c8d0e0"
LIGHTER_TEXT <- "#e8ecf2"
# Extract BMA statistics into a data frame
bma_tab <- bma_results[[1]]
pip_df <- data.frame(
variable = rownames(bma_tab)[-1],
pip = bma_tab[-1, "PIP"],
pm = bma_tab[-1, "PM"],
psd = bma_tab[-1, "PSD"],
sign_pos = bma_tab[-1, "%(+)"]
)
# Readable labels and robustness classification
var_labels <- c(ish = "Investment share", sed = "Education",
pgrw = "Population growth", pop = "Population",
ipr = "Investment price", opem = "Trade openness",
gsh = "Government share", lnlex = "Life expectancy",
polity = "Democracy")
pip_df$label <- var_labels[pip_df$variable]
pip_df$robustness <- cut(pip_df$pip,
breaks = c(0, 0.50, 0.75, 1),
labels = c("Weak (PIP < 0.50)", "Moderate (0.50-0.75)",
"Positive (PIP >= 0.75)"),
include.lowest = TRUE)
# PIP bar chart
ggplot(pip_df, aes(x = reorder(label, pip), y = pip,
fill = robustness)) +
geom_col(width = 0.65) +
geom_hline(yintercept = 0.75, linetype = "dashed",
color = LIGHT_TEXT) +
geom_hline(yintercept = 0.50, linetype = "dotted",
color = LIGHT_TEXT, alpha = 0.6) +
coord_flip() +
scale_fill_manual(values = c(
"Positive (PIP >= 0.75)" = "#6a9bcc",
"Moderate (0.50-0.75)" = "#00d4c8",
"Weak (PIP < 0.50)" = "#d97757")) +
labs(x = NULL, y = "Posterior Inclusion Probability (PIP)",
fill = "Evidence strength",
title = "BMA: Posterior Inclusion Probabilities",
subtitle = "Binomial prior (EMS = 4.5), 512 models averaged")

Population dominates the chart at PIP = 0.990, followed by life expectancy at 0.864. Five variables clear the 0.75 “positive evidence” threshold, while the remaining four — democracy, education, population growth, and investment price — fall in the “moderate” zone between 0.50 and 0.75. Compared to the kitchen-sink benchmark where 6 of 10 variables were significant at 5%, BMA paints a more nuanced picture: it grades each variable on a continuous scale of importance rather than imposing a binary significant/insignificant cutoff.

9. Visualizing Model Probabilities

9.1 Prior versus posterior model probabilities

The model_pmp() function visualizes how the data transforms our prior beliefs about which models are best. The prior assigns probability to each of the 512 models, and the data concentrates posterior mass on the models that fit best:

pmp_plots <- model_pmp(bma_results)

The prior (dashed line) is relatively flat, reflecting the uniform prior assumption. The posterior (solid line) concentrates dramatically: a handful of models capture the bulk of the posterior mass, while most models receive negligible probability. This concentration is the signature of informative data — the 73-country, 4-decade panel provides enough information to strongly favor certain model specifications.

9.2 Model sizes

The model_sizes() function shows the distribution of prior and posterior probabilities across model sizes (number of included regressors, excluding the lagged dependent variable):

size_plots <- model_sizes(bma_results)

The expected model sizes confirm this visually:

print(bma_results[[16]])

 Prior models size Posterior model size
Binomial 4.5 6.908
Binomial-beta 4.5 8.556

The posterior strongly favors larger models. While the binomial prior centers mass on models with 4–5 regressors (EMS = 4.5), the posterior shifts toward 7 regressors (6.908). Under the binomial-beta prior, the shift is even more dramatic: the posterior expected model size reaches 8.556, meaning the data wants to include nearly all 9 candidate regressors. This is consistent with the finding that all variables have PIP above 0.65 — the data sees signal in most candidates.

10. Examining Top Models

The best_models() function lets us inspect the specific variable combinations and coefficient estimates in the top-ranked models:

best8 <- best_models(bma_results, criterion = 1, best = 8)
print(best8[[1]]) # Inclusion matrix

Reading the inclusion matrix: each column is a model (ranked by fit), each row is a variable. A value of 1 means the variable is included in that model. Look for variables that appear in every top model — those are the most robust.

 'No. 1' 'No. 2' 'No. 3' 'No. 4' 'No. 5' 'No. 6' 'No. 7' 'No. 8'
gdp_lag 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
ish 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.000
sed 1.000 1.000 1.000 0.000 1.000 1.000 1.000 1.000
pgrw 1.000 1.000 1.000 1.000 0.000 1.000 1.000 1.000
pop 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
ipr 1.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000
opem 1.000 1.000 1.000 1.000 1.000 1.000 0.000 1.000
gsh 1.000 1.000 1.000 1.000 1.000 0.000 1.000 1.000
lnlex 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
polity 1.000 1.000 0.000 1.000 1.000 1.000 1.000 1.000
PMP 0.089 0.044 0.042 0.036 0.035 0.029 0.026 0.025

A striking pattern emerges: the top model includes all 9 regressors (PMP = 8.9%), and the next 7 best models are each formed by dropping exactly one variable from the full set. This “kitchen sink minus one” pattern confirms that the data supports large models.

Two variables are never dropped across the top 8 models: pop and lnlex — they appear in all 8, consistent with their high PIPs of 0.990 and 0.864. The variables dropped in models 2–8 are ipr, polity, sed, pgrw, gsh, opem, and ish — precisely the variables with the lowest PIPs.

We can also examine the coefficient estimates in the best model using the knitr-formatted output:

# Estimation results for the best model (knitr format)
print(best8[[5]])

Best model (No. 1) estimates:
gdp_lag 0.954 (0.076)*** pop 0.065 (0.056)
ish 0.079 (0.032)** ipr -0.056 (0.027)**
sed 0.034 (0.065) opem 0.043 (0.025)*
pgrw 0.025 (0.033) gsh -0.043 (0.050)
lnlex 0.151 (0.060)** polity -0.092 (0.032)***

In the best model (No. 1), the lagged GDP coefficient is 0.954 (SE = 0.076, significant at 1%), confirming the very slow convergence we derived from the Solow model. Investment share has a positive and significant coefficient of 0.079, while democracy has a negative and highly significant coefficient of –0.092. Life expectancy is positive and significant at 0.151. Education, despite being included in 7 of the top 8 models, has a large standard error (0.034, SE = 0.065) — explaining its moderate PIP despite frequent inclusion.

This combination — high inclusion rate but imprecise coefficient — happens when most models agree that education belongs in the model but disagree about its magnitude. Some estimate a positive effect of +0.08, others a negative –0.02. The variable is probably relevant, but the data does not pin down its direction.

Beyond these top models, how do the coefficients distribute across all 512 specifications? The next section examines the full posterior distributions.

11. Coefficient Distributions

Before examining individual coefficient distributions, it is helpful to see all posterior means and their uncertainty at a glance. We compute approximate 95% credible intervals as the posterior mean plus or minus two posterior standard deviations:

# Approximate 95% credible intervals
pip_df$ci_low <- pip_df$pm - 2 * pip_df$psd
pip_df$ci_high <- pip_df$pm + 2 * pip_df$psd
# Coefficient point-range plot
ggplot(pip_df, aes(x = reorder(label, pip), y = pm,
color = robustness)) +
geom_hline(yintercept = 0, linetype = "solid",
color = LIGHT_TEXT, alpha = 0.4) +
geom_pointrange(aes(ymin = ci_low, ymax = ci_high),
size = 0.6, linewidth = 0.8) +
coord_flip() +
scale_color_manual(values = c(
"Positive (PIP >= 0.75)" = "#6a9bcc",
"Moderate (0.50-0.75)" = "#00d4c8",
"Weak (PIP < 0.50)" = "#d97757")) +
labs(x = NULL, y = "Posterior Mean Coefficient",
color = "Evidence strength",
title = "BMA: Posterior Coefficient Estimates",
subtitle = "Points = posterior mean, bars = PM +/- 2*PSD")

Population and life expectancy have the largest positive posterior means, with credible intervals that do not cross zero — consistent with their high PIPs. Democracy (polity) has a clearly negative effect, also with an interval that excludes zero. Investment price is negative but with a wider interval. Education and government share have credible intervals that straddle zero, reflecting sign instability. Compared to the kitchen-sink FE model, BMA produces posterior means that account for model uncertainty: the intervals are wider than standard confidence intervals because they incorporate variation across model specifications, not just within a single specification.

The coef_hist() function provides more detailed views of the full posterior distribution of each coefficient across all 512 models, weighted by posterior model probability:

coef_plots <- coef_hist(bma_results)

Population — the most robust determinant:

print(coef_plots[[5]])

Population has a tight, entirely positive distribution centered around 0.12, confirming strong and stable evidence for a positive effect on growth.

These results hold under the default binomial prior. But how sensitive are they to our choice of prior? The next section stress-tests the findings.

12. Sensitivity to Prior Specification

A critical step in any BMA analysis is checking whether the results change when we alter our prior beliefs. If a variable’s PIP is high under one prior but low under another, we should be cautious about declaring it a robust determinant. The following chart compares PIPs across three prior specifications at a glance:

# Extract PIPs from three prior specifications
bma_tab_bb <- bma_results[[2]] # Binomial-beta
bma_tab_ems2 <- bma_ems2[[1]] # Skeptical (EMS = 2)
sens_df <- data.frame(
label = pip_df$label,
Binomial = pip_df$pip,
BinBeta = bma_tab_bb[-1, "PIP"],
EMS2 = bma_tab_ems2[-1, "PIP"])
# Pivot to long format for ggplot
sens_long <- sens_df %>%
pivot_longer(cols = c(Binomial, BinBeta, EMS2),
names_to = "prior", values_to = "pip") %>%
mutate(prior = factor(prior,
levels = c("EMS2", "Binomial", "BinBeta"),
labels = c("Skeptical (EMS=2)", "Binomial (EMS=4.5)",
"Binomial-Beta")))
# Connecting segments showing the range across priors
seg_df <- sens_df %>%
mutate(pip_min = pmin(Binomial, BinBeta, EMS2),
pip_max = pmax(Binomial, BinBeta, EMS2))
# Dumbbell chart
ggplot() +
geom_vline(xintercept = 0.75, linetype = "dashed",
color = LIGHT_TEXT) +
geom_vline(xintercept = 0.50, linetype = "dotted",
color = LIGHT_TEXT, alpha = 0.6) +
geom_segment(data = seg_df,
aes(x = pip_min, xend = pip_max,
y = reorder(label, Binomial),
yend = reorder(label, Binomial)),
color = LIGHT_TEXT, alpha = 0.3, linewidth = 1.5) +
geom_point(data = sens_long,
aes(x = pip, y = reorder(label, pip), color = prior),
size = 3.5) +
scale_color_manual(values = c(
"Skeptical (EMS=2)" = "#d97757",
"Binomial (EMS=4.5)" = "#6a9bcc",
"Binomial-Beta" = "#00d4c8")) +
labs(x = "Posterior Inclusion Probability (PIP)", y = NULL,
color = "Model prior",
title = "Prior Sensitivity: How Robust Are the PIPs?",
subtitle = "Same data, three different prior specifications")

The width of each horizontal segment shows how much a variable’s PIP changes across priors. Population is rock-solid: its PIP barely moves (0.964–0.998) regardless of the prior. Life expectancy shows moderate sensitivity (0.637–0.974). The bottom four variables — democracy, education, population growth, and investment price — are the most sensitive, with PIPs ranging from 0.34–0.94 depending on the prior. This visual makes the key message immediately clear: only population and life expectancy are robust across all prior specifications.

12.1 Binomial versus binomial-beta prior

The default analysis already computes both priors. The binomial prior assigns each variable an independent probability of inclusion equal to EMS/K (where EMS is the expected model size and K is the number of regressors). The binomial-beta prior is more flexible — it places a prior on the inclusion probability itself, allowing the data to determine how many variables should be included.

Under the binomial-beta prior, all PIPs increase substantially. Population reaches 0.998, life expectancy reaches 0.974, and even the lowest-ranked variable (investment price) reaches 0.924. The posterior expected model size jumps to 8.556 — the binomial-beta prior allows the data to express its preference for large models even more strongly than the binomial prior.

Comparing PIPs across the two priors:

Variable	PIP (Binomial)	PIP (Binomial-Beta)	Sign	Evidence strength
pop	0.990	0.998	+	Very strong
lnlex	0.864	0.974	+	Strong
ish	0.773	0.954	+	Positive → Strong
opem	0.766	0.952	+	Positive → Strong
gsh	0.751	0.948	–/+	Positive → Strong
sed	0.717	0.938	+/–	Weak → Strong
pgrw	0.714	0.938	+	Weak → Strong
polity	0.678	0.929	–	Weak → Strong
ipr	0.656	0.924	–	Weak → Strong

The ranking is stable across priors — pop and lnlex remain the top two, and ipr and polity remain the bottom two. However, the absolute PIP values depend heavily on the prior, with the binomial-beta prior being far more inclusive. This is expected: the binomial-beta prior concentrates mass on larger models when the data supports them.

12.2 Varying expected model size

The expected model size (EMS) controls how many regressors the prior expects to be relevant. The default EMS = K/2 = 4.5. Let us see what happens with a skeptical prior (EMS = 2, expecting only 2 of 9 regressors to matter) and a generous prior (EMS = 8):

With the skeptical EMS = 2 prior, only pop (PIP = 0.964) and lnlex (PIP = 0.637) remain above 0.5 under the binomial prior. Investment share drops to 0.483 and democracy falls to 0.372. This tells us that population and life expectancy are the most robust determinants — they survive even when the prior is heavily biased toward sparse models.

With EMS = 8, all PIPs exceed 0.94 — nearly identical to the binomial-beta results, confirming that the data’s preference for large models is consistent across prior specifications.

Full output tables for each prior specification are in Appendix C.

12.3 Dilution prior

Imagine two variables that measure almost the same thing — say, ‘years of schooling’ and ‘literacy rate.’ Including both in a model is redundant, and any model that includes both gets an inflated likelihood simply because it has two ways to capture the same variation.

When regressors are correlated with each other, standard priors can overcount evidence by giving high probability to models that include near-duplicate variables. The dilution prior (George, 2010) penalizes models whose regressors are highly correlated, adjusting the model prior by the determinant of the correlation matrix:

$$\mathbb{P}_D(M_j) \propto \mathbb{P}(M_j) \cdot |COR_j|^{\omega}$$

In words, this formula says that the diluted prior for model $j$ equals the standard prior multiplied by a penalty term. The penalty is the determinant of the correlation matrix among model $j$’s regressors, raised to the power $\omega$. When regressors are highly correlated, this determinant is close to zero, pushing the diluted prior toward zero. The parameter $\omega$ controls the strength of the penalty (default = 0.5).

# Dilution prior with default omega = 0.5
bma_dil <- bma(full_model_space, df = data_prepared,
round = 3, dilution = 1)
print(bma_dil[[1]])

 PIP PM PSD PSDR PMcon PSDcon PSDRcon %(+)
gdp_lag NA 0.919 0.077 0.107 0.919 0.077 0.107 100.000
ish 0.718 0.058 0.046 0.062 0.081 0.034 0.059 100.000
sed 0.640 0.026 0.055 0.070 0.041 0.064 0.084 69.922
pgrw 0.653 0.017 0.030 0.050 0.026 0.034 0.060 99.609
pop 0.989 0.125 0.065 0.082 0.126 0.064 0.081 100.000
ipr 0.638 -0.033 0.033 0.044 -0.052 0.027 0.045 0.000
opem 0.743 0.034 0.030 0.033 0.046 0.026 0.031 100.000
gsh 0.740 -0.013 0.040 0.090 -0.017 0.046 0.104 30.859
lnlex 0.808 0.081 0.075 0.098 0.100 0.071 0.099 100.000
polity 0.598 -0.049 0.047 0.053 -0.083 0.030 0.044 0.000

The dilution prior modestly reduces PIPs compared to the standard binomial prior — for example, ish drops from 0.773 to 0.718, and polity drops from 0.678 to 0.598. The posterior expected model size decreases from 6.91 to 6.53. Importantly, the ranking remains unchanged: pop and lnlex stay at the top, and the sign stability is unaffected. The dilution prior provides a useful robustness check against multicollinearity inflation.

sizes_dil <- model_sizes(bma_dil)

Having examined the evidence from every angle — PIPs, coefficients, and sensitivity — let us now synthesize the findings.

13. Summary of Findings

13.1 The robust determinants

Combining evidence across all prior specifications, we can classify each regressor by its robustness:

Variable	PIP (Bin.)	PIP (Bin-Beta)	PIP (EMS=2)	Sign	Verdict
pop	0.990	0.998	0.964	+	Robust
lnlex	0.864	0.974	0.637	+	Robust
ish	0.773	0.954	0.483	+	Positive
opem	0.766	0.952	0.468	+	Positive
gsh	0.751	0.948	0.459	–	Positive
sed	0.717	0.938	0.420	+/–	Sensitive
pgrw	0.714	0.938	0.414	+	Sensitive
polity	0.678	0.929	0.372	–	Sensitive
ipr	0.656	0.924	0.344	–	Sensitive

Bottom line: If you are advising a government on growth policy, population dynamics and public health (life expectancy) are the two levers with the strongest evidence across all modeling assumptions. Investment and trade openness show promise under the default prior but become ambiguous under skeptical specifications. Education and democracy — despite their intuitive appeal — are fragile in this framework.

Only two variables — population and life expectancy — survive as robust determinants across all prior specifications, maintaining PIP above 0.5 even under the most skeptical prior (EMS = 2). Both have stable positive signs and their coefficients are precisely estimated. Investment share and trade openness show positive evidence under the default prior but become ambiguous under the skeptical prior.

13.2 Connecting to cross-sectional results

In the companion cross-sectional tutorial, we found that BMA, LASSO, and WALS converged on the same set of robust variables for CO₂ emissions in synthetic data. The dynamic panel BMA analysis here reveals an important nuance: controlling for reverse causality through the lagged dependent variable and fixed effects changes the landscape of robust determinants. The strong persistence of GDP (lagged coefficient = 0.92) absorbs much of the cross-sectional variation, leaving fewer variables with strong independent explanatory power. This is exactly the kind of insight that cross-sectional BMA misses.

14. Conclusion

14.1 Key takeaways

Method insight: Dynamic panel BMA handles endogeneity that cross-sectional BMA cannot. By including a lagged dependent variable ($\alpha$ = 0.92) and entity/time fixed effects, the Bayesian DSM package allows BMA to work with weakly exogenous regressors, avoiding the bias that plagues standard growth regressions.
Data insight: Of 9 candidate growth determinants, only population size (PIP = 0.990) and life expectancy (PIP = 0.864) are robust across all prior specifications. This confirms the “fragility” of growth determinants documented by Sala-i-Martin et al. (2004) — most variables that appear important in one specification become ambiguous under different priors.
Sensitivity insight: Results are moderately sensitive to prior choice. Under the skeptical EMS = 2 prior, only pop (PIP = 0.964) remains very strong, while even lnlex drops to 0.637. The binomial-beta prior pushes all variables above PIP = 0.92, reflecting the data’s preference for large models (posterior EMS = 8.6).
Jointness insight: All regressor pairs are complements (HCGHM > 0), with the strongest complementarity between population and life expectancy (0.71). No substitution effects were detected, suggesting these growth determinants capture distinct dimensions of the development process. See Appendix A for the full jointness analysis.

14.2 Limitations and next steps

Computation cost: The optim_model_space() step estimates all $2^K$ models via numerical optimization. With 9 regressors (512 models), this is feasible. With 15+ regressors ($2^{15}$ = 32,768 models), computation time grows exponentially. For larger variable sets, Markov Chain Monte Carlo (MCMC) sampling over the model space may be necessary.
Weak exogeneity assumption: While weaker than strict exogeneity, the weak exogeneity assumption still requires that current regressors are uncorrelated with current shocks. If contemporaneous feedback is strong (e.g., a GDP shock immediately changes investment in the same period), the estimates may still be biased.
Extensions: The package offers additional features not covered here, including parallel computing for faster model space estimation (cl parameter in optim_model_space()), robust standard errors for heteroskedasticity, and the full suite of reduced-form parameters for understanding the dynamic feedback structure.

14.3 Exercises

Vary the dilution parameter. Run bma() with dilution = 1 and dil.Par = 2 (stronger dilution). How do the PIPs change compared to dil.Par = 0.5? Which variables are most affected by multicollinearity adjustment?
Examine the small model space. Use small_model_space with only ish, sed, and pgrw. Run the full BMA workflow (including model_pmp(), model_sizes(), best_models(), and jointness()). Do the PIP rankings change when the competition among regressors is limited to 3?
Compare standard and robust standard errors. Run best_models() with robust = TRUE and compare the coefficient significance to the default (regular SE). Are there variables that lose or gain significance under robust inference?

Appendix A: Jointness Analysis

What is jointness?

So far we have examined each regressor individually. But growth determinants do not work in isolation — they interact. Jointness measures whether two regressors tend to appear in models together (complements) or separately (substitutes).

Think of peanut butter and jelly: each is fine alone, but they show up together so often that their inclusion is correlated. In growth regressions, investment and trade openness might be complements — countries that invest heavily also trade more, and models that capture one effect benefit from including the other. Conversely, two measures of education (enrollment and literacy) might be substitutes — including one makes the other redundant.

Three jointness measures

The package implements three jointness measures. The jointness() function computes pairwise relationships between all regressors:

Hofmarcher et al. (HCGHM) ranges from –1 (perfect substitutes) to +1 (perfect complements), with 0 indicating independence. This is the recommended default measure.

Ley-Strazicich (LS) ranges from 0 to infinity, where higher values indicate stronger complementarity.

Doppelhofer-Weeks (DW) classifies relationships as: below –2 (strong substitutes), –2 to –1 (significant substitutes), –1 to 1 (unrelated), 1 to 2 (significant complements), above 2 (strong complements).

Jointness matrices

The HCGHM jointness matrix (above diagonal = binomial prior, below diagonal = binomial-beta prior):

jointness(bma_results, measure = "HCGHM")

 ish sed pgrw pop ipr opem gsh lnlex polity
ish NA 0.216 0.207 0.530 0.150 0.262 0.243 0.366 0.181
sed 0.805 NA 0.154 0.421 0.115 0.199 0.189 0.288 0.125
pgrw 0.805 0.778 NA 0.416 0.124 0.198 0.186 0.283 0.131
pop 0.905 0.874 0.874 NA 0.304 0.517 0.489 0.711 0.346
ipr 0.781 0.756 0.758 0.845 NA 0.153 0.138 0.209 0.102
opem 0.829 0.801 0.802 0.902 0.780 NA 0.241 0.372 0.169
gsh 0.821 0.794 0.794 0.893 0.772 0.819 NA 0.340 0.154
lnlex 0.864 0.835 0.835 0.944 0.810 0.863 0.853 NA 0.227
polity 0.790 0.763 0.764 0.855 0.744 0.787 0.779 0.817 NA

All HCGHM values are positive, meaning every pair of regressors acts as complements rather than substitutes. The strongest complementarity under the binomial prior (above diagonal) is between pop and lnlex at 0.711 — population size and life expectancy tend to appear in the best models together. The pop-ish pair (0.530) and pop-opem pair (0.517) are also moderately complementary. Investment price (ipr) shows the weakest complementarity with other variables, consistent with its lowest PIP.

Under the binomial-beta prior (below diagonal), all jointness values increase substantially — reaching 0.944 for the pop-lnlex pair. This is because the binomial-beta prior favors larger models, making it more likely that any two variables appear together.

The Doppelhofer-Weeks measure confirms these patterns: all pairwise DW values fall between –1 and +1, with the strongest relationship again between population and life expectancy (DW = 0.153).

Appendix B: Solow Convergence Derivation

The Solow model predicts that poorer countries should grow faster than richer ones, conditional on their structural characteristics. This is called beta convergence. Mathematically, the model implies that around the steady state, log GDP per capita evolves according to (Barro and Sala-i-Martin, 2004):

$$\ln y_{it} = (1 - e^{-\lambda \tau}) \ln y^*_i + e^{-\lambda \tau} \ln y_{i,t-1}$$

In words, a country’s current GDP ($\ln y_{it}$) is a weighted average of two forces: its long-run steady-state level ($\ln y^*_i$), determined by fundamentals like savings and technology, and its GDP in the previous period ($\ln y_{i,t-1}$), which captures where the country currently stands. The parameter $\lambda$ is the speed of convergence — how fast countries close the gap to their steady state — and $\tau$ is the time between observations (10 years in our data).

Now define $\alpha = e^{-\lambda \tau}$. The convergence equation becomes:

$$\ln y_{it} = \alpha \ln y_{i,t-1} + (1 - \alpha) \ln y^*_i$$

This is already a dynamic equation — current GDP depends on lagged GDP. The next step is to recognize that the steady state $\ln y^*_i$ is not observed directly. Instead, it depends on country characteristics such as investment rates, education, trade openness, and institutional quality. Writing these as $\beta' x_{it}$, and adding country fixed effects ($\eta_i$) for unobserved fundamentals, time effects ($\zeta_t$) for global shocks, and an error term ($v_{it}$), we arrive at the dynamic panel equation presented in Section 3.2.

Appendix C: Full Sensitivity Output

Binomial-beta prior

# Binomial-beta results (already computed)
print(bma_results[[2]])

 PIP PM PSD PSDR PMcon PSDcon PSDRcon %(+)
gdp_lag NA 0.943 0.078 0.130 0.943 0.078 0.130 100.000
ish 0.954 0.076 0.036 0.066 0.079 0.032 0.065 100.000
sed 0.938 0.035 0.063 0.094 0.037 0.064 0.097 69.922
pgrw 0.938 0.024 0.033 0.059 0.026 0.033 0.061 99.609
pop 0.998 0.080 0.062 0.083 0.080 0.062 0.083 100.000
ipr 0.924 -0.050 0.030 0.052 -0.054 0.027 0.052 0.000
opem 0.952 0.041 0.026 0.034 0.043 0.025 0.034 100.000
gsh 0.948 -0.034 0.049 0.120 -0.036 0.049 0.123 30.859
lnlex 0.974 0.134 0.069 0.105 0.138 0.066 0.104 100.000
polity 0.929 -0.084 0.038 0.053 -0.090 0.031 0.049 0.000

Skeptical prior (EMS = 2)

# Skeptical prior: EMS = 2
bma_ems2 <- bma(full_model_space, df = data_prepared, round = 3, EMS = 2)
print(bma_ems2[[1]])

 PIP PM PSD PSDR PMcon PSDcon PSDRcon %(+)
gdp_lag NA 0.922 0.081 0.102 0.922 0.081 0.102 100.000
ish 0.483 0.042 0.050 0.059 0.088 0.034 0.057 100.000
sed 0.420 0.015 0.046 0.057 0.036 0.065 0.084 69.922
pgrw 0.414 0.009 0.025 0.040 0.023 0.034 0.061 99.609
pop 0.964 0.144 0.066 0.082 0.149 0.061 0.079 100.000
ipr 0.344 -0.019 0.031 0.037 -0.055 0.028 0.045 0.000
opem 0.468 0.024 0.032 0.033 0.052 0.026 0.030 100.000
gsh 0.459 -0.003 0.032 0.071 -0.007 0.047 0.105 30.859
lnlex 0.637 0.051 0.068 0.087 0.081 0.069 0.097 100.000
polity 0.372 -0.029 0.042 0.046 -0.079 0.031 0.043 0.000

References

Spatial Dynamic Panel Data Modeling in R: Cigarette Demand Across US States

Fri, 27 Mar 2026 00:00:00 +0000

1. Overview

When a state raises its cigarette tax, smokers near the border may simply drive to a neighboring state with lower prices. This cross-border shopping effect means that cigarette consumption in one state depends not only on its own prices and income but also on the prices and consumption patterns of its neighbors. Ignoring these spatial spillovers leads to biased estimates of how prices and income affect cigarette demand — a problem that standard panel data methods cannot address.

The SDPDmod R package (Simonovska, 2025) provides an integrated workflow for spatial panel data modeling. It offers three core capabilities: (1) Bayesian model comparison across six spatial specifications using log-marginal posterior probabilities, (2) maximum likelihood estimation of spatial autoregressive (SAR) and spatial Durbin (SDM) models with optional Lee-Yu bias correction for fixed effects, and (3) impact decomposition into direct, indirect (spillover), and total effects — including short-run and long-run effects for dynamic models. This tutorial applies all three capabilities to the classic Cigar dataset: cigarette consumption across 46 US states from 1963 to 1992.

The tutorial follows a progressive approach. We start with the simplest spatial model (SAR) and build toward the most general specification (dynamic SDM with Lee-Yu correction). At each step, we interpret the results in terms of the cigarette market and compare them to simpler models. By the end, you will see how spatial spillovers and habit persistence jointly shape cigarette demand — and why models that ignore either one can produce misleading policy conclusions.

Learning objectives:

Load and row-normalize the usa46 binary contiguity matrix from SDPDmod
Prepare the Cigar panel dataset with log-transformed real prices and income
Use blmpSDPD() for Bayesian model comparison across OLS, SAR, SDM, SEM, SDEM, and SLX specifications
Estimate static SAR and SDM models using SDPDm() with individual and two-way fixed effects
Apply the Lee-Yu transformation to correct incidental parameter bias in spatial panels
Estimate dynamic spatial models with temporal and spatiotemporal lags
Decompose effects into direct, indirect, and total using impactsSDPDm(), distinguishing short-run from long-run effects

2. The Modeling Pipeline

The tutorial follows a six-stage pipeline, moving from data preparation through increasingly rich spatial panel models:

graph LR
A["Data & W<br/>(Section 3-4)"] --> B["Bayesian<br/>Comparison<br/>(Section 5)"]
B --> B2["Non-Spatial<br/>Baseline<br/>(Section 6)"]
B2 --> C["Static SAR<br/>(Section 7)"]
C --> D["Static SDM<br/>(Section 8)"]
D --> E["Dynamic SDM<br/>(Section 9)"]
E --> F["Impact<br/>Decomposition<br/>(Section 10)"]
style A fill:#6a9bcc,stroke:#141413,color:#fff
style B fill:#d97757,stroke:#141413,color:#fff
style B2 fill:#141413,stroke:#141413,color:#fff
style C fill:#6a9bcc,stroke:#141413,color:#fff
style D fill:#6a9bcc,stroke:#141413,color:#fff
style E fill:#d97757,stroke:#141413,color:#fff
style F fill:#00d4c8,stroke:#141413,color:#fff

Each stage builds on the previous one. The Bayesian comparison tells us which model family fits the data best. The static models establish baseline spatial effects. The dynamic models add habit persistence and separate short-run from long-run responses. The impact decomposition translates all of this into policy-relevant direct and spillover effects.

3. Setup and Data Preparation

3.1 Install and load packages

The analysis requires five packages: SDPDmod for spatial panel modeling, plm for the Cigar dataset, ggplot2 and reshape2 for visualization, and dplyr for data manipulation.

# Install packages if needed
cran_packages <- c("SDPDmod", "plm", "ggplot2", "reshape2", "dplyr")
missing <- cran_packages[!sapply(cran_packages, requireNamespace, quietly = TRUE)]
if (length(missing) > 0) install.packages(missing)
library(SDPDmod)
library(plm)
library(ggplot2)
library(reshape2)
library(dplyr)

3.2 Load and prepare the Cigar dataset

The Cigar dataset (Baltagi, 1992) contains panel data on cigarette consumption in 46 US states from 1963 to 1992. The key variables are sales (packs per capita), price (average price per pack in cents), ndi (per capita disposable income), pimin (minimum price in adjoining states), and cpi (consumer price index). We create log-transformed real values to work with elasticities — in a log-log model, each coefficient represents the percentage change in consumption for a one-percent change in the corresponding variable.

# Load Cigar dataset
data("Cigar", package = "plm")
data1 <- Cigar
# Create log-transformed variables
data1$logc <- log(data1$sales) # log cigarette packs per capita
data1$logp <- log(data1$price / data1$cpi) # log real price
data1$logy <- log(data1$ndi / data1$cpi) # log real per capita income
# Inspect panel structure
cat("States:", length(unique(data1$state)), "\n")
cat("Years:", length(unique(data1$year)), "\n")
cat("Observations:", nrow(data1), "\n")

States: 46
Years: 30
Observations: 1380

head(data1[, c("state", "year", "sales", "price", "ndi", "logc", "logp", "logy")])

 state year sales price ndi logc logp logy
1 1 63 93.9 28.6 1558.305 4.542230 -0.06759329 3.930354
2 1 64 95.4 29.8 1684.073 4.558079 -0.03947881 3.994983
3 1 65 98.5 29.8 1809.842 4.590057 -0.05547915 4.051007
4 1 66 96.4 31.5 1915.160 4.568506 -0.02817088 4.079398
5 1 67 95.5 31.6 2023.546 4.559126 -0.05539878 4.104051
6 1 68 88.4 35.6 2202.486 4.481872 0.02272825 4.147724

summary(data1[, c("logc", "logp", "logy")])

 logc logp logy
Min. :3.978 Min. :-0.60981 Min. :3.766
1st Qu.:4.681 1st Qu.:-0.20492 1st Qu.:4.423
Median :4.797 Median :-0.10079 Median :4.557
Mean :4.793 Mean :-0.10642 Mean :4.545
3rd Qu.:4.892 3rd Qu.:-0.01225 3rd Qu.:4.686
Max. :5.697 Max. : 0.36399 Max. :5.117

The panel is balanced with 46 states observed over 30 years (1,380 total observations). Log cigarette consumption (logc) has a mean of 4.793, corresponding to about 121 packs per capita per year. Real prices (logp) average -0.106 in log terms, and real per capita income (logy) averages 4.545. The variation across states and over time in both prices and income is what allows us to identify price and income elasticities — and the spatial structure across neighboring states is what motivates the spatial models.

The dataset also includes pimin, the minimum cigarette price in adjoining states. This variable is inherently spatial — it measures price competition from neighbors. We do not include pimin directly in our models because the SDM’s spatially lagged price term W*logp captures the same channel more flexibly. To see why, note that log(pimin/cpi) and the spatial lag of logp have a correlation of 0.92 — they measure essentially the same thing, but the spatial lag uses the full contiguity structure rather than just the cheapest neighbor.

3.3 Exploratory visualization

Before building models, the spaghetti plot below shows cigarette sales per capita for all 46 states over time, with five states highlighted for comparison.

# Highlight selected states
highlight_states <- c("CA", "NY", "NC", "KY", "UT")
ggplot(data1, aes(x = year + 1900, y = sales, group = state_abbr)) +
geom_line(data = subset(data1, !(state_abbr %in% highlight_states)),
color = "gray80", linewidth = 0.3) +
geom_line(data = subset(data1, state_abbr %in% highlight_states),
aes(color = state_abbr), linewidth = 1) +
labs(title = "Cigarette Sales per Capita Across 46 US States (1963-1992)",
x = "Year", y = "Packs per Capita", color = "State") +
theme_minimal()

Two patterns jump out. First, temporal persistence is striking: states that consumed heavily in the 1960s (like Kentucky, a major tobacco-producing state with over 150 packs per capita) remained high consumers throughout the period, while low-consumption states like Utah stayed low. This visual persistence foreshadows the dominant role of the lagged dependent variable ($\tau \approx 0.86$) in the dynamic models. Second, there is a general downward trend after the late 1970s, visible across nearly all states, reflecting the cumulative effect of anti-smoking campaigns, health awareness, and rising taxes. Time fixed effects in our panel models will absorb this common trend, isolating the within-state, within-year variation that identifies price and income elasticities.

3.4 Load and row-normalize the spatial weight matrix

A spatial weight matrix $W$ encodes which states are neighbors. The usa46 matrix included in SDPDmod is a binary contiguity matrix: $w_{ij} = 1$ if states $i$ and $j$ share a border, and $w_{ij} = 0$ otherwise. Row-normalization converts these binary entries into weights that sum to one for each row, so the spatial lag $Wy$ equals the weighted average of neighboring states' values.

# Load binary contiguity matrix of 46 US states
data("usa46", package = "SDPDmod")
cat("Dimensions:", dim(usa46), "\n")
cat("Non-zero entries:", sum(usa46 != 0), "\n")
cat("Average neighbors per state:", round(mean(rowSums(usa46)), 2), "\n")
# Row-normalize
W <- rownor(usa46)
cat("Row-normalized:", isrownor(W), "\n")

Dimensions: 46 46
Non-zero entries: 188
Average neighbors per state: 4.09
Row-normalized: TRUE

The matrix has 188 non-zero entries out of 2,116 possible pairs (8.9% density), meaning the average state shares a border with about 4 neighbors. After row-normalization, the spatial lag of any variable equals the simple average of that variable across a state’s contiguous neighbors. For example, the spatial lag of cigarette consumption for a state with 4 neighbors equals the average consumption in those 4 neighboring states.

4. Visualizing the Spatial Weight Matrix

Before estimating spatial models, it helps to visualize the neighborhood structure. The heatmap below shows the binary contiguity matrix, with each colored cell indicating a pair of neighboring states.

# Use state abbreviations for the axes
rownames(usa46) <- state_abbr
colnames(usa46) <- state_abbr
usa46_df <- melt(usa46)
colnames(usa46_df) <- c("State_i", "State_j", "Connection")
usa46_df$Connection <- factor(usa46_df$Connection, levels = c(0, 1),
labels = c("Not neighbors", "Neighbors"))
ggplot(usa46_df, aes(x = State_j, y = State_i, fill = Connection)) +
geom_tile(color = "white", linewidth = 0.1) +
scale_fill_manual(values = c("Not neighbors" = "gray95",
"Neighbors" = "#6a9bcc")) +
labs(title = "Binary Contiguity Matrix of 46 US States",
x = "State j", y = "State i") +
theme_minimal()

The sparse pattern confirms that most state pairs are not neighbors — only 8.9% of cells are colored. With state abbreviations on the axes, you can verify specific neighborhood relationships: California (CA) neighbors Arizona (AZ), Nevada (NV), and Oregon (OR); Missouri (MO) has the most neighbors at 8. The sparsity is typical of contiguity-based weight matrices and means that spatial effects operate through a relatively small number of direct neighbor relationships. The row-normalized version ensures that each state’s spatial lag is an equally weighted average of its neighbors, regardless of whether a state has 2 neighbors or 8.

4.2 Alternative weight matrices

The SDPDmod package provides several functions for constructing weight matrices from scratch: mOrdNbr() for higher-order contiguity from shapefiles, mNearestN() for k-nearest neighbors, InvDistMat() for inverse distance, and DistWMat() as a unified wrapper. Since our results may depend on the choice of $W$, we construct a 2nd-order contiguity matrix as a robustness check. This matrix treats states as neighbors if they share a border or share a common neighbor (friends-of-friends).

# 2nd-order contiguity: states reachable in 2 steps
W2_raw <- (usa46 %*% usa46) > 0 # indicator for 2-step reachability
W2_combined <- W2_raw * 1
diag(W2_combined) <- 0 # remove self-connections
W2 <- rownor(W2_combined)
cat("Original W non-zero entries:", sum(usa46 != 0), "\n")
cat("2nd-order W non-zero entries:", sum(W2_combined != 0), "\n")
cat("Avg neighbors (original):", round(mean(rowSums(usa46)), 2), "\n")
cat("Avg neighbors (2nd-order):", round(mean(rowSums(W2_combined)), 2), "\n")

Original W non-zero entries: 188
2nd-order W non-zero entries: 486
Avg neighbors (original): 4.09
Avg neighbors (2nd-order): 10.57

The 2nd-order matrix is much denser: 486 non-zero entries versus 188, with an average of 10.6 neighbors per state instead of 4.1. This broader definition of “neighbor” captures indirect spatial relationships — for example, Illinois and Kentucky are not direct contiguous neighbors, but they share Indiana as a common neighbor. We will use this alternative $W$ for a robustness check in Section 11.

5. Bayesian Model Comparison with `blmpSDPD()`

5.1 The spatial model family

Before estimating any single model, we use Bayesian model comparison to let the data tell us which spatial specification fits best. The SDPDmod package supports six models that differ in where spatial dependence enters the equation. The general spatial panel model takes the form:

$$y_t = \rho W y_t + X_t \beta + W X_t \theta + u_t, \quad u_t = \lambda W u_t + \epsilon_t$$

In words, the outcome $y_t$ can depend on neighbors' outcomes (through $\rho$), on spatially lagged covariates (through $\theta$), and spatial correlation can appear in the error term (through $\lambda$). Different restrictions on these parameters yield different models:

graph TD
GNS["General Nesting<br/>ρ, θ, λ"] -->|"λ = 0"| SDM["SDM<br/>ρ, θ"]
GNS -->|"θ = 0"| SAC["SAC<br/>ρ, λ"]
GNS -->|"ρ = 0"| SDEM["SDEM<br/>θ, λ"]
SDM -->|"θ = 0"| SAR["SAR<br/>ρ"]
SDM -->|"ρ = 0"| SLX["SLX<br/>θ"]
SAC -->|"λ = 0"| SAR
SDEM -->|"ρ = 0"| SEM["SEM<br/>λ"]
SDEM -->|"λ = 0"| SLX
SAR -->|"ρ = 0"| OLS["OLS<br/>No spatial"]
SEM -->|"λ = 0"| OLS
SLX -->|"θ = 0"| OLS
style SDM fill:#d97757,stroke:#141413,color:#fff
style SAR fill:#6a9bcc,stroke:#141413,color:#fff
style SEM fill:#6a9bcc,stroke:#141413,color:#fff
style SDEM fill:#6a9bcc,stroke:#141413,color:#fff
style SLX fill:#6a9bcc,stroke:#141413,color:#fff
style OLS fill:#141413,stroke:#141413,color:#fff
style GNS fill:#00d4c8,stroke:#141413,color:#fff
style SAC fill:#00d4c8,stroke:#141413,color:#fff

Model	Equation	Key Parameters	Interpretation
OLS	$y_t = X_t \beta + \epsilon_t$	None spatial	No spatial dependence
SAR	$y_t = \rho W y_t + X_t \beta + \epsilon_t$	$\rho$	Neighbors' outcomes affect own outcome
SEM	$y_t = X_t \beta + u_t$, $u_t = \lambda W u_t + \epsilon_t$	$\lambda$	Spatial correlation in unobservables
SLX	$y_t = X_t \beta + W X_t \theta + \epsilon_t$	$\theta$	Neighbors' covariates affect own outcome
SDM	$y_t = \rho W y_t + X_t \beta + W X_t \theta + \epsilon_t$	$\rho, \theta$	Both neighbors' outcomes and covariates matter
SDEM	$y_t = X_t \beta + W X_t \theta + u_t$, $u_t = \lambda W u_t + \epsilon_t$	$\theta, \lambda$	Spatially lagged X plus spatial errors

The blmpSDPD() function computes Bayesian log-marginal posterior probabilities for each model. Unlike classical hypothesis tests that compare models pairwise, this approach assigns a probability to every candidate model simultaneously, making it straightforward to assess which specification the data favors.

5.2 Static comparison with individual fixed effects

We begin by comparing all six models under a static specification with individual (state) fixed effects only. This controls for time-invariant differences across states — such as tobacco culture or geographic remoteness — but does not control for common time trends like federal tax changes.

res_ind <- blmpSDPD(formula = logc ~ logp + logy, data = data1, W = W,
index = c("state", "year"),
model = list("ols", "sar", "sdm", "sem", "sdem", "slx"),
effect = "individual")
res_ind

Log-marginal posteriors:
ols sar sdm sem sdem slx
1 884.7551 938.6934 1046.487 993.192 1039.671 930.0585
Model probabilities:
ols sar sdm sem sdem slx
1 0 0 0.9989 0 0.0011 0

With individual fixed effects, the SDM receives a posterior probability of 99.89%, dominating all other specifications. The SDEM gets only 0.11%, and the remaining models receive essentially zero probability. This overwhelming support for the SDM indicates that both the spatial lag of the dependent variable ($\rho W y$) and the spatial lags of covariates ($W X \theta$) are important for explaining cigarette consumption — neighbors' prices and income matter above and beyond neighbors' consumption levels.

5.3 Static comparison with two-way fixed effects

Adding time fixed effects controls for common shocks that affect all states simultaneously, such as national anti-smoking campaigns or federal excise tax changes. This typically absorbs much of the cross-sectional variation, so we might expect the model rankings to shift.

res_tw <- blmpSDPD(formula = logc ~ logp + logy, data = data1, W = W,
index = c("state", "year"),
model = list("ols", "sar", "sdm", "sem", "sdem", "slx"),
effect = "twoways",
prior = "beta") # beta prior concentrates probability near moderate rho values
res_tw

Log-marginal posteriors:
ols sar sdm sem sdem slx
1 1076.602 1095.993 1100.727 1099.415 1100.621 1080.323
Model probabilities:
ols sar sdm sem sdem slx
1 0 0.004 0.4592 0.1237 0.4131 0

With two-way fixed effects and a beta prior, the race tightens considerably. The SDM still leads with 45.92% probability, but the SDEM is close behind at 41.31%. The SEM receives 12.37%, while the SAR drops to just 0.4%. This tells us that spatial effects in the covariates ($\theta$) remain important, but there is genuine uncertainty about whether the spatial lag of the dependent variable ($\rho$) or the spatial error term ($\lambda$) best captures the remaining spatial dependence.

5.4 Dynamic comparison with two-way fixed effects

Cigarette consumption is highly persistent over time — smokers who consumed heavily last year tend to do so again this year. Dynamic models add the lagged dependent variable $y_{t-1}$ and potentially its spatial lag $W y_{t-1}$ to capture this habit persistence.

res_dyn <- blmpSDPD(formula = logc ~ logp + logy, data = data1, W = W,
index = c("state", "year"),
model = list("sar", "sdm", "sem", "sdem", "slx"),
effect = "twoways",
ldet = "mc", # Monte Carlo approximation for the log-determinant (faster for dynamic models)
dynamic = TRUE,
prior = "uniform") # uniform prior assigns equal weight to all valid rho values
res_dyn

Log-marginal posteriors:
sar sdm sem sdem slx
1 1987.651 1986.906 1987.799 1986.924 1987.388
Model probabilities:
sar sdm sem sdem slx
1 0.2573 0.1221 0.2984 0.1243 0.1979

The dynamic comparison produces a dramatically different picture: all five models receive similar probabilities, with the SEM slightly ahead at 29.84%, followed by SAR at 25.73% and SLX at 19.79%. The log-marginal posteriors are nearly identical (within 1 unit), reflecting the fact that once temporal dynamics are included, the remaining spatial signal is much weaker. The lagged dependent variable absorbs much of the persistence that spatial models previously captured.

5.5 Summary of model comparison

The figure below summarizes the posterior probabilities across all three specification comparisons (see analysis.R for the full figure code).

Specification	Top Model	Probability	Runner-up	Probability
Static, Individual FE	SDM	99.89%	SDEM	0.11%
Static, Two-way FE	SDM	45.92%	SDEM	41.31%
Dynamic, Two-way FE	SEM	29.84%	SAR	25.73%

The Bayesian comparison reveals three key insights. First, spatial dependence is unambiguously present — OLS and SLX never win. Second, the SDM is the preferred static model, which means both the spatial lag of $y$ and the spatial lags of $X$ contribute to explaining cigarette consumption. Third, adding dynamics substantially weakens the ability to discriminate among spatial specifications, because the lagged dependent variable captures much of the temporal persistence that spatial lags previously absorbed. Given that the SDM leads in two of three comparisons and nests the SAR as a special case, we will estimate both the SAR and SDM in the sections that follow, with and without dynamics.

6. Non-Spatial Baseline

Before introducing spatial models, we establish a benchmark using a standard two-way fixed effects panel regression with no spatial terms. This is the model that most applied researchers would start with — it controls for state-specific and year-specific unobserved heterogeneity but assumes that each state’s consumption depends only on its own prices and income, with no spillovers from neighbors.

pdata <- pdata.frame(data1, index = c("state", "year"))
mod_fe <- plm(logc ~ logp + logy, data = pdata, model = "within",
effect = "twoways")
summary(mod_fe)$coefficients

 Estimate Std. Error t-value Pr(>|t|)
logp -1.0348844 0.04151906 -24.92553 1.881060e-112
logy 0.5285428 0.04658276 11.34632 1.603837e-28

The non-spatial two-way FE model estimates a price elasticity of -1.035 and an income elasticity of 0.529, both highly significant. The within R-squared is 0.394, meaning that price and income explain about 39% of the within-state, within-year variation in cigarette consumption after removing fixed effects. These estimates serve as the benchmark against which we measure the value added by spatial models. As we will see, the SAR and SDM models produce similar direct price effects (around -1.00) but reveal substantial indirect (spillover) effects that the non-spatial model entirely misses — the total price elasticity in the SDM is -1.23, about 19% larger than the non-spatial estimate.

7. Static SAR Model Estimation

7.1 SAR with individual fixed effects

The Spatial Autoregressive (SAR) model adds a single spatial parameter $\rho$ that captures how much a state’s cigarette consumption depends on the weighted average of its neighbors' consumption. The model is:

$$y_t = \rho W y_t + X_t \beta + \mu_i + \epsilon_t$$

In words, cigarette consumption in state $i$ depends on (1) the average consumption of neighboring states (weighted by $W$, with strength $\rho$), (2) the state’s own price and income ($X_t \beta$), and (3) a state-specific intercept ($\mu_i$). The SDPDm() function estimates this model by maximum likelihood. The index argument specifies the panel identifiers, model = "sar" selects the spatial lag specification, and effect = "individual" includes state fixed effects.

mod_sar_ind <- SDPDm(formula = logc ~ logp + logy, data = data1, W = W,
index = c("state", "year"),
model = "sar",
effect = "individual")
summary(mod_sar_ind)

sar panel model with individual fixed effects
Spatial autoregressive coefficient:
Estimate Std. Error t-value Pr(>|t|)
rho 0.297576 0.028444 10.462 < 2.2e-16 ***
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
logp -0.5320053 0.0254445 -20.9085 <2e-16 ***
logy -0.0007088 0.0152139 -0.0466 0.9628

The spatial autoregressive coefficient $\rho = 0.298$ is highly significant ($t = 10.46$), confirming strong spatial dependence in cigarette consumption. A state’s consumption is positively influenced by its neighbors' consumption levels. The price elasticity is -0.532 ($t = -20.91$), meaning a 1% increase in real price reduces consumption by about 0.53%. However, the income coefficient is essentially zero (-0.001, $p = 0.96$), suggesting that with only state fixed effects, income variation does not significantly predict consumption — likely because state fixed effects absorb cross-sectional income differences, while the within-state time variation in income is confounded with common time trends.

7.2 SAR with two-way fixed effects

Adding time fixed effects controls for year-specific shocks common to all states and typically changes the coefficient estimates substantially.

mod_sar_tw <- SDPDm(formula = logc ~ logp + logy, data = data1, W = W,
index = c("state", "year"),
model = "sar",
effect = "twoways")
summary(mod_sar_tw)

sar panel model with twoways fixed effects
Spatial autoregressive coefficient:
Estimate Std. Error t-value Pr(>|t|)
rho 0.18659 0.02863 6.5173 7.159e-11 ***
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
logp -0.994860 0.039906 -24.930 < 2.2e-16 ***
logy 0.463555 0.046019 10.073 < 2.2e-16 ***

With two-way fixed effects, three things change. First, the spatial coefficient drops from 0.298 to 0.187 — still highly significant but weaker, because time fixed effects absorb some of the common spatial trends. Second, the price elasticity nearly doubles from -0.53 to -0.99, suggesting that the individual-FE-only model was biased by confounding time trends with prices. Third, income becomes strongly significant (0.464, $t = 10.07$): once common time trends are removed, higher real income is associated with more cigarette consumption, consistent with cigarettes being a normal good at the state level.

7.3 Impact decomposition for static SAR

In spatial models, the raw coefficients $\beta$ do not directly tell us how a change in one state’s price affects its own consumption. Because of the spatial feedback loop — my consumption affects my neighbor’s, which in turn affects mine — the actual effect is larger than $\beta$ alone. The impactsSDPDm() function decomposes the total effect into a direct effect (impact on own state) and an indirect effect (spillover to and from neighbors).

imp_sar_tw <- impactsSDPDm(mod_sar_tw)
summary(imp_sar_tw)

Impact estimates for spatial (static) model
Direct:
Estimate Std. Error t-value Pr(>|t|)
logp -1.001155 0.038855 -25.767 < 2.2e-16 ***
logy 0.465947 0.044678 10.429 < 2.2e-16 ***
Indirect:
Estimate Std. Error t-value Pr(>|t|)
logp -0.223484 0.040877 -5.4672 4.571e-08 ***
logy 0.103540 0.018939 5.4670 4.578e-08 ***
Total:
Estimate Std. Error t-value Pr(>|t|)
logp -1.224639 0.060815 -20.137 < 2.2e-16 ***
logy 0.569487 0.052965 10.752 < 2.2e-16 ***

The impact decomposition reveals that a 1% increase in a state’s own real price reduces its consumption by 1.00% directly, plus an additional 0.22% through spatial feedback — for a total price elasticity of -1.22. Think of it this way: when one state raises prices, its consumption drops, which in turn reduces the “pull” on neighboring states' consumption through the spatial lag, creating a ripple effect that feeds back to the original state. Similarly, a 1% income increase raises own-state consumption by 0.47% directly and by 0.10% through neighbors, for a total income elasticity of 0.57. The indirect effects are about 18% of the total effect, indicating economically meaningful spatial spillovers.

8. Static SDM with Lee-Yu Correction

8.1 SDM with two-way fixed effects

The Spatial Durbin Model (SDM) extends the SAR by adding spatially lagged covariates $W X$, allowing neighbors' prices and income to directly affect a state’s consumption (beyond the indirect channel through $\rho W y$):

$$y_t = \rho W y_t + X_t \beta + W X_t \theta + \mu_i + \gamma_t + \epsilon_t$$

In words, this says that cigarette consumption depends on neighbors' consumption ($\rho$), own prices and income ($\beta$), and neighbors' prices and income ($\theta$). Here $\mu_i$ captures state fixed effects and $\gamma_t$ captures time fixed effects. The SDM is the natural model when we believe that cross-border shopping responds directly to neighboring states' prices — not just indirectly through neighbors' consumption levels.

mod_sdm_tw <- SDPDm(formula = logc ~ logp + logy, data = data1, W = W,
index = c("state", "year"),
model = "sdm",
effect = "twoways")
summary(mod_sdm_tw)

sdm panel model with twoways fixed effects
Spatial autoregressive coefficient:
Estimate Std. Error t-value Pr(>|t|)
rho 0.222591 0.032825 6.7812 1.192e-11 ***
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
logp -1.002878 0.040094 -25.0134 < 2.2e-16 ***
logy 0.600876 0.057207 10.5036 < 2.2e-16 ***
W*logp 0.048490 0.080807 0.6001 0.5484546
W*logy -0.292794 0.078158 -3.7462 0.0001795 ***

The SDM reveals an interesting asymmetry. The spatial lag of price (W*logp = 0.049) is not significant ($p = 0.55$), meaning that neighboring states' prices do not directly affect own consumption once the spatial lag of consumption ($\rho = 0.223$) is accounted for. However, the spatial lag of income (W*logy = -0.293) is highly significant ($t = -3.75$): when neighboring states become richer, own-state consumption decreases. This negative spillover in income may reflect a substitution effect — as neighbors' incomes rise, their consumers may shift toward premium or out-of-state purchasing channels, reducing the spatial demand that pulls up consumption in the focal state.

8.2 SDM with Lee-Yu bias correction

Fixed effects in spatial panels create an incidental parameter problem: the large number of fixed effects (46 states + 30 years = 76 parameters) introduces a small-sample bias in the maximum likelihood estimator, particularly for the spatial autoregressive coefficient $\rho$ and the variance $\sigma^2$. The Lee-Yu transformation (Lee and Yu, 2010) corrects this bias by orthogonally transforming the data to concentrate out the fixed effects before estimation.

mod_sdm_ly <- SDPDm(formula = logc ~ logp + logy, data = data1, W = W,
index = c("state", "year"),
model = "sdm",
effect = "twoways",
LYtrans = TRUE)
summary(mod_sdm_ly)

sdm panel model with twoways fixed effects
Spatial autoregressive coefficient:
Estimate Std. Error t-value Pr(>|t|)
rho 0.262211 0.032081 8.1735 2.996e-16 ***
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
logp -1.001334 0.041121 -24.3509 < 2.2e-16 ***
logy 0.602729 0.058673 10.2726 < 2.2e-16 ***
W*logp 0.090779 0.082185 1.1046 0.2693
W*logy -0.313251 0.079982 -3.9165 8.983e-05 ***

The Lee-Yu correction increases $\rho$ from 0.223 to 0.262 — a 17% upward correction, indicating that the uncorrected estimator underestimated spatial dependence. The slope coefficients change only marginally (the price coefficient moves from -1.003 to -1.001), which is expected with $T = 30$ years. For short panels ($T < 10$), the Lee-Yu correction would matter much more. We will use the Lee-Yu corrected version as our preferred static SDM.

8.3 Comparison: SAR vs. SDM

Parameter	FE (no spatial)	SAR (Ind FE)	SAR (TW FE)	SDM (TW FE)	SDM (TW FE, LY)
$\rho$	—	0.298	0.187	0.223	0.262
logp	-1.035	-0.532	-0.995	-1.003	-1.001
logy	0.529	-0.001	0.464	0.601	0.603
W*logp	—	—	—	0.049	0.091
W*logy	—	—	—	-0.293	-0.313
$\hat{\sigma}^2$	—	0.0067	0.0051	0.0050	0.0052

Two patterns stand out. First, the price coefficient is remarkably stable across the SDM specifications (around -1.00), while it was biased in the SAR with individual FE only (-0.53). Second, adding the SDM terms increases the income coefficient from 0.46 (SAR) to 0.60 (SDM), because the negative spatial lag of income (W*logy $\approx$ -0.31) absorbs part of the spatial income effect that the SAR was attributing to the spatial lag $\rho$.

8.4 Impact decomposition for static SDM

The impact decomposition for the SDM differs fundamentally from the SAR because the $W X$ terms create additional channels for indirect effects.

imp_sdm_ly <- impactsSDPDm(mod_sdm_ly)
summary(imp_sdm_ly)

Impact estimates for spatial (static) model
Direct:
Estimate Std. Error t-value Pr(>|t|)
logp -1.010329 0.040149 -25.164 < 2.2e-16 ***
logy 0.588471 0.054940 10.711 < 2.2e-16 ***
Indirect:
Estimate Std. Error t-value Pr(>|t|)
logp -0.21925 0.09439 -2.3228 0.02019 *
logy -0.19721 0.09108 -2.1652 0.03037 *
Total:
Estimate Std. Error t-value Pr(>|t|)
logp -1.229575 0.105631 -11.6403 < 2.2e-16 ***
logy 0.391262 0.086184 4.5398 5.63e-06 ***

The SDM impact decomposition tells a richer story than the SAR. For price, the results are similar: a direct effect of -1.01 and an indirect (spillover) effect of -0.22, summing to a total price elasticity of -1.23. However, for income, the SDM flips the sign of the indirect effect: it is now negative (-0.20) instead of positive (0.10 in the SAR). This means that when neighboring states' incomes rise, the focal state’s consumption actually decreases — consistent with the significant negative W*logy coefficient we saw earlier. The total income elasticity in the SDM (0.39) is therefore lower than in the SAR (0.57), because the positive direct effect (0.59) is partially offset by the negative spillover (-0.20). This sign reversal of the income spillover is an important finding that the SAR cannot detect.

9. Dynamic Spatial Panel Models

9.1 Why dynamics? Habit persistence in cigarette consumption

Cigarette consumption is strongly habit-forming. Nicotine addiction creates a direct link between past and present consumption: last year’s smokers are very likely to be this year’s smokers. Ignoring this temporal persistence in a static model means that the spatial coefficient $\rho$ must absorb both spatial spillovers and the serial correlation in consumption patterns, leading to biased estimates of the true spatial effect. Dynamic models explicitly include the lagged dependent variable $y_{t-1}$ (with coefficient $\tau$, capturing habit persistence) and optionally its spatial lag $W y_{t-1}$ (with coefficient $\eta$, capturing spatiotemporal diffusion):

$$y_t = \rho W y_t + \tau y_{t-1} + \eta W y_{t-1} + X_t \beta + W X_t \theta + \mu_i + \gamma_t + \epsilon_t$$

In words, this equation says that today’s cigarette consumption depends on: neighbors' current consumption ($\rho$), own past consumption ($\tau$, habit persistence), neighbors' past consumption ($\eta$, spatiotemporal diffusion), own prices and income ($\beta$), and neighbors' prices and income ($\theta$). Here $y_{t-1}$ corresponds to logc(t-1) in the output, and $Wy_{t-1}$ corresponds to W*logc(t-1).

9.2 Dynamic SAR with temporal lag only

We start by adding only the temporal lag $y_{t-1}$ without the spatiotemporal lag $W y_{t-1}$, to isolate the effect of habit persistence on the spatial coefficient.

mod_dsar_tl <- SDPDm(formula = logc ~ logp + logy, data = data1, W = W,
index = c("state", "year"),
model = "sar",
effect = "twoways",
LYtrans = TRUE,
dynamic = TRUE,
tlaginfo = list(ind = NULL, tl = TRUE, stl = FALSE))
summary(mod_dsar_tl)

sar dynamic panel model with twoways fixed effects
Spatial autoregressive coefficient:
Estimate Std. Error t-value Pr(>|t|)
rho 0.0095932 0.0169929 0.5645 0.5724
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
logc(t-1) 0.866212 0.012785 67.7523 < 2.2e-16 ***
logp -0.254617 0.023047 -11.0478 < 2.2e-16 ***
logy 0.084437 0.023719 3.5598 0.0003711 ***

This result is striking. The temporal lag coefficient $\tau = 0.866$ is enormous ($t = 67.75$), confirming that cigarette consumption is extremely persistent — about 87% of last year’s consumption carries over to this year. More remarkably, the spatial autoregressive coefficient $\rho$ collapses from 0.262 (static SDM) to just 0.010 and becomes non-significant ($p = 0.57$). This suggests that what appeared to be contemporaneous spatial dependence in the static model was largely a proxy for temporal persistence: states that consumed heavily in the past continue to do so, and neighboring states happen to share similar histories. The short-run price elasticity also drops sharply from -1.00 to -0.25, because the lagged dependent variable now captures the cumulative effect of past prices.

9.3 Dynamic SAR with temporal and spatiotemporal lags

Adding the spatiotemporal lag $W y_{t-1}$ allows us to test whether neighboring states' past consumption patterns affect current consumption.

mod_dsar_full <- SDPDm(formula = logc ~ logp + logy, data = data1, W = W,
index = c("state", "year"),
model = "sar",
effect = "twoways",
LYtrans = TRUE,
dynamic = TRUE,
tlaginfo = list(ind = NULL, tl = TRUE, stl = TRUE))
summary(mod_dsar_full)

sar dynamic panel model with twoways fixed effects
Spatial autoregressive coefficient:
Estimate Std. Error t-value Pr(>|t|)
rho 0.703004 0.021363 32.907 < 2.2e-16 ***
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
logc(t-1) 0.882056 0.013012 67.789 < 2e-16 ***
W*logc(t-1) -0.727317 0.026033 -27.938 < 2e-16 ***
logp -0.243591 0.023337 -10.438 < 2e-16 ***
logy 0.055595 0.023933 2.323 0.02018 *

Adding the spatiotemporal lag dramatically changes the picture. The spatial coefficient $\rho$ jumps to 0.703, and the spatiotemporal lag $\eta = -0.727$ is strongly negative ($t = -27.94$). The temporal lag $\tau = 0.882$ remains dominant. The large $\rho$ combined with the nearly equal-and-opposite $\eta$ suggests a complex dynamic pattern: states with high current neighbor consumption tend to have higher own consumption ($\rho > 0$), but states whose neighbors consumed heavily last year tend to have lower current consumption ($\eta < 0$). However, the near-cancellation of $\rho$ and $\eta$ may also indicate multicollinearity between $Wy_t$ and $Wy_{t-1}$, making the individual coefficients hard to interpret reliably. The dynamic SDM in Section 9.4, which adds covariates' spatial lags, provides a more stable decomposition.

9.4 Dynamic SDM with both lags and Lee-Yu correction

The most general model combines all elements: spatial lag of $y$, temporal lag, spatiotemporal lag, and spatial lags of $X$, all with Lee-Yu bias correction.

mod_dsdm <- SDPDm(formula = logc ~ logp + logy, data = data1, W = W,
index = c("state", "year"),
model = "sdm",
effect = "twoways",
LYtrans = TRUE,
dynamic = TRUE,
tlaginfo = list(ind = NULL, tl = TRUE, stl = TRUE))
summary(mod_dsdm)

sdm dynamic panel model with twoways fixed effects
Spatial autoregressive coefficient:
Estimate Std. Error t-value Pr(>|t|)
rho 0.162189 0.036753 4.4129 1.02e-05 ***
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
logc(t-1) 0.864412 0.012879 67.1163 < 2.2e-16 ***
W*logc(t-1) -0.096270 0.038810 -2.4805 0.0131186 *
logp -0.270872 0.023145 -11.7031 < 2.2e-16 ***
logy 0.104262 0.029783 3.5007 0.0004641 ***
W*logp 0.195595 0.043870 4.4585 8.254e-06 ***
W*logy -0.032464 0.039520 -0.8215 0.4113891

The dynamic SDM produces the most nuanced picture. Habit persistence remains dominant ($\tau = 0.864$, $t = 67.12$). The spatial coefficient $\rho = 0.162$ is significant but much smaller than in the static model ($\rho = 0.262$), confirming that static models overstate contemporaneous spatial dependence by conflating it with temporal persistence. The spatiotemporal lag is weakly significant ($\eta = -0.096$, $p = 0.013$). Notably, the spatial lag of price (W*logp = 0.196) is now positive and significant ($t = 4.46$), a reversal from the static SDM where it was not significant. This positive coefficient means that when neighboring states' prices rise, own-state consumption increases — precisely the cross-border shopping effect we hypothesized. Smokers respond to neighbors' price increases by purchasing more in their own (now relatively cheaper) state. The spatial lag of income (W*logy = -0.032) is no longer significant once dynamics are included.

9.5 Impact decomposition: short-run and long-run effects

For dynamic models, impactsSDPDm() separates effects into short-run (immediate, one-period) and long-run (cumulative, steady-state) impacts. The long-run effects account for the feedback loop through the lagged dependent variable: a price change today affects consumption today, which affects consumption next year (through $\tau$), which feeds back again, and so on until a new equilibrium is reached.

imp_dsdm <- impactsSDPDm(mod_dsdm)
summary(imp_dsdm)

Impact estimates for spatial dynamic model
========================================================
Short-term
Direct:
Estimate Std. Error t-value Pr(>|t|)
logp -0.261569 0.022830 -11.457 < 2.2e-16 ***
logy 0.101759 0.029667 3.430 0.0006035 ***
Indirect:
Estimate Std. Error t-value Pr(>|t|)
logp 0.178932 0.046861 3.8183 0.0001344 ***
logy -0.015109 0.042210 -0.3579 0.7203812
Total:
Estimate Std. Error t-value Pr(>|t|)
logp -0.082637 0.052143 -1.5848 0.1130
logy 0.086650 0.037890 2.2868 0.0222 *
========================================================
Long-term
Direct:
Estimate Std. Error t-value Pr(>|t|)
logp -1.92836 0.20580 -9.3702 < 2.2e-16 ***
logy 0.80149 0.22655 3.5378 0.0004034 ***
Indirect:
Estimate Std. Error t-value Pr(>|t|)
logp 0.91054 0.58271 1.5626 0.1181
logy 0.48361 1.54612 0.3128 0.7544
Total:
Estimate Std. Error t-value Pr(>|t|)
logp -1.01783 0.66733 -1.5252 0.1272
logy 1.28510 1.59825 0.8041 0.4214

The gap between short-run and long-run effects is dramatic. The short-run direct price elasticity is only -0.26, meaning that a 1% price increase immediately reduces consumption by just 0.26%. But the long-run direct price elasticity is -1.93 — more than seven times larger — because the habit persistence mechanism ($\tau = 0.864$) amplifies the initial shock over time. Think of it as a snowball effect: a small reduction today accumulates year after year because lower consumption this year leads to lower consumption next year, and so on.

The short-run indirect (spillover) effect of price is positive (0.179): when a state raises its prices, neighboring states' consumption increases in the short run, consistent with cross-border shopping. This positive spillover partly offsets the direct negative effect, making the short-run total price elasticity (-0.083) small and statistically non-significant. In the long run, the indirect price effect remains positive (0.911) but becomes imprecisely estimated and non-significant, while the direct effect (-1.928) dominates. The long-run total effects for both price and income are estimated with large standard errors, reflecting the uncertainty inherent in extrapolating dynamic effects to the steady state. The non-significance of these long-run totals means that, despite large point estimates, we cannot reliably predict the net cumulative impact of price or income changes across the full spatial system. Note that the long-run effects assume the system reaches a stable equilibrium, which requires the stationarity condition $|\tau + \rho \eta| < 1$ to hold.

9.6 Comparison of dynamic specifications

Parameter	Static SDM (LY)	Dyn SAR (tl)	Dyn SAR (tl+stl)	Dyn SDM (LY)
$\rho$	0.262	0.010	0.703	0.162
$\tau$ (logc_{t-1})	—	0.866	0.882	0.864
$\eta$ (W*logc_{t-1})	—	—	-0.727	-0.096
logp	-1.001	-0.255	-0.244	-0.271
logy	0.603	0.084	0.056	0.104
W*logp	0.091	—	—	0.196
W*logy	-0.313	—	—	-0.032
$\hat{\sigma}^2$	0.0052	0.0012	0.0012	0.0012

The table reveals that temporal dynamics fundamentally reshape the spatial story. The temporal lag coefficient ($\tau \approx 0.86$) is remarkably stable across all dynamic specifications, confirming that habit persistence is the dominant force. The spatial coefficient $\rho$ varies widely depending on whether the spatiotemporal lag is included, highlighting the sensitivity of spatial inference to the dynamic specification. The short-run price and income elasticities in the dynamic models are roughly one-quarter the size of the static estimates, because the lagged dependent variable now carries the cumulative effect.

10. Effect Decomposition Summary

The figure below compares the direct, indirect, and total effects of price and income across three model-horizon combinations: the static SDM, and the short-run and long-run effects from the dynamic SDM.

# See analysis.R for the full figure code

Four patterns stand out from this comparison. First, the static SDM overstates the short-run response to price changes: its direct price effect (-1.01) is nearly four times larger than the dynamic short-run direct effect (-0.26). A policymaker using the static estimate to predict the immediate revenue impact of a cigarette tax increase would be far too optimistic about consumption reductions.

Second, spatial spillovers change sign between static and dynamic models. In the static SDM, the indirect price effect is negative (-0.22), meaning price increases reduce neighbors' consumption. In the dynamic SDM’s short run, it is positive (0.18), consistent with cross-border shopping: when one state raises prices, its neighbors' sales increase as smokers cross the border. This sign reversal underscores the importance of properly specifying temporal dynamics.

Third, long-run effects are much larger but imprecisely estimated. The long-run direct price elasticity (-1.93) is the largest estimate in the analysis, reflecting decades of accumulated habit adjustments. However, the wide confidence intervals on long-run total effects mean that precise long-run predictions require caution.

Fourth, income effects are more robust. The direct income elasticity is positive and significant in all specifications (ranging from 0.10 in the short run to 0.80 in the long run), confirming that cigarettes behave as a normal good. The indirect income effects are less stable and generally not significant in the dynamic specification.

11. Discussion

This tutorial demonstrates three key findings about spatial dynamics in cigarette demand. First, spatial dependence is real and economically meaningful, but its magnitude depends critically on the model specification. The Bayesian comparison (Section 5) unanimously rejects non-spatial models, and the total price elasticity in the static SDM (-1.23) is 22% larger than the direct effect alone (-1.01). A state that ignores spatial spillovers when evaluating a cigarette tax increase will underestimate both the consumption reduction in its own state and the cross-border effects on neighbors.

Second, habit persistence dominates the dynamic structure. The temporal lag coefficient ($\tau \approx 0.86$) is by far the largest and most precisely estimated parameter in every dynamic model. Once dynamics are included, the contemporaneous spatial coefficient weakens dramatically, and what appeared to be spatial dependence in the static model is revealed to be largely temporal persistence. This does not mean spatial effects are absent — they remain significant at $\rho = 0.16$ in the dynamic SDM — but they are much smaller than the static model suggests.

Third, the dynamic SDM uncovers a cross-border shopping effect that the static model misses. The positive and significant W*logp coefficient (0.196) in the dynamic SDM means that when neighboring states raise prices, own-state consumption increases in the short run. This is the signature of cross-border purchasing. The effect is masked in the static model because the spatial lag $\rho Wy$ absorbs it, and it only emerges when the temporal dynamics are properly specified.

A fourth finding relates to robustness to the weight matrix. Re-estimating the static SDM with a 2nd-order contiguity matrix (which expands the average number of neighbors from 4.1 to 10.6) yields a stronger spatial coefficient ($\rho = 0.449$ vs. 0.262) and a significant W*logp coefficient (0.337, $p = 0.009$) that was not significant with the 1st-order matrix. This suggests that cross-border shopping effects may extend beyond immediately adjacent states, and that the choice of spatial weight matrix matters substantively for policy conclusions.

From a software perspective, the SDPDmod package provides a streamlined R workflow that covers the complete spatial panel modeling pipeline — from Bayesian model selection through estimation to impact decomposition — in a coherent framework. The blmpSDPD() function is particularly valuable for applied researchers, as it replaces the ad hoc sequence of Wald tests with a principled, simultaneous comparison of all candidate models.

12. Summary and Next Steps

Spatial models matter for tobacco policy: the total price elasticity (-1.23 in the static SDM) is 22% larger than the direct effect alone, meaning unilateral state tax increases generate spillovers to neighboring states that standard panel models miss.
Bayesian model comparison provides principled model selection: the SDM is overwhelmingly preferred in static specifications (99.89% probability with individual FE), but adding dynamics reduces the ability to discriminate among spatial models, with all specifications receiving similar posterior probabilities.
Habit persistence is the dominant dynamic force: the temporal lag coefficient $\tau \approx 0.86$ dwarfs the contemporaneous spatial effect ($\rho = 0.16$), and static models conflate short-run and long-run responses. The short-run price elasticity (-0.26) is one-quarter of the static estimate (-1.01).
Cross-border shopping emerges in the dynamic SDM: the positive spatial lag of price (W*logp = 0.20) means that neighboring states' price increases boost own consumption in the short run — the clearest evidence of border-crossing behavior.

For further study, see the companion Stata spatial panel tutorial that applies xsmle to the same dataset, and the Stata cross-sectional spatial tutorial for a simpler introduction to spatial models without the temporal dimension. The SDPDmod package is documented in Simonovska (2025) and available on CRAN.

13. Exercises

Build your own W. In Section 4.2 we constructed a 2nd-order contiguity matrix. Re-run blmpSDPD() with this alternative W2 instead of the original W. Does the Bayesian model comparison still favor the SDM? How do the model probabilities change when the definition of “neighbor” is broader?
Include pimin directly. Add lpm = log(pimin/cpi) as an additional covariate in the SAR model: logc ~ logp + logy + lpm. Compare the results to the SDM’s W*logp coefficient. Does lpm remain significant alongside the spatial lag of the dependent variable? Why or why not?
SAR vs. SDM indirect effects. Compare the impact decomposition from the static SAR (Section 7.3) and static SDM (Section 8.4). The indirect income effect reverses sign (positive in SAR, negative in SDM). Write a paragraph explaining this reversal in terms of the cross-border shopping mechanism.
Subsample analysis. Split the data into two periods (1963–1977 and 1978–1992). Re-estimate the dynamic SDM for each period. Does the habit persistence coefficient ($\tau$) change over time? Has the spatial coefficient ($\rho$) strengthened or weakened as anti-smoking policies intensified?

14. References

Spatial Dynamic Panels with Common Factors in Stata: Credit Risk in US Banking

Fri, 27 Mar 2026 00:00:00 +0000

1. Overview

The 2007–2009 Global Financial Crisis revealed that credit risk does not stay contained within individual banks. Non-performing loans surged across the US banking system through two distinct channels — spatial spillovers from balance-sheet interdependencies among interconnected banks, and common factors from macroeconomic shocks (interest rate changes, housing market collapses, unemployment spikes) that hit all banks simultaneously. Ignoring either channel leads to biased estimates of credit risk determinants and misleading policy prescriptions. Standard spatial panel packages in Stata — such as xsmle and spxtregress — can model spatial spillovers but cannot account for unobserved common factors, leaving a critical gap in the econometrician’s toolkit.

The spxtivdfreg package (Kripfganz & Sarafidis, 2025) fills this gap by implementing a defactored instrumental variables estimator that simultaneously handles four sources of endogeneity: spatial lags of the dependent variable, temporal lags (dynamic persistence), endogenous regressors, and unobserved common factors. The estimator first removes common factors from the data using a principal-components-based defactoring procedure, then applies IV/GMM estimation to the defactored model. This approach avoids the incidental parameters bias that plagues maximum likelihood methods and does not require bias corrections like the Lee-Yu adjustment used in xsmle.

This tutorial replicates the empirical application from Kripfganz and Sarafidis (2025), which models non-performing loan ratios across 350 US commercial banks over the period 2006:Q1 to 2014:Q4 — a sample that spans the entire GFC episode. We estimate the full spatial dynamic panel model with common factors, demonstrate what happens when common factors or the spatial lag are omitted, compute short-run and long-run spillover effects, and compare homogeneous and heterogeneous slope specifications.

Learning objectives

Understand the four sources of endogeneity in spatial dynamic panel models: spatial lag, temporal lag, endogenous regressors, and common factors
Estimate the full spatial dynamic panel model with common factors using spxtivdfreg
Compare estimation results with and without common factors to assess the consequences of ignoring latent macroeconomic shocks
Compare estimation results with and without the spatial lag to evaluate the importance of bank interconnectedness
Compute and interpret short-run and long-run direct, indirect, and total effects using estat impact
Estimate heterogeneous slope models with the mean-group (MG) estimator to assess cross-bank parameter heterogeneity

2. The modeling framework

Credit risk in a banking system is shaped by forces operating at three different levels: the individual bank (its own financial ratios and management quality), the network of interconnected banks (spatial spillovers through lending relationships, common borrowers, and contagion), and the macroeconomy (interest rates, GDP growth, and other aggregate shocks that affect all banks). The spatial dynamic panel model with common factors captures all three levels in a single equation.

The diagram below illustrates the four sources of endogeneity that the spxtivdfreg estimator must address simultaneously.

graph TD
Y["<b>NPL<sub>it</sub></b><br/>Non-performing<br/>loan ratio"]
WY["<b>W · NPL<sub>t</sub></b><br/>Spatial lag<br/><i>Bank interdependence</i>"]
LY["<b>NPL<sub>i,t-1</sub></b><br/>Temporal lag<br/><i>Risk persistence</i>"]
X["<b>INEFF<sub>it</sub></b><br/>Endogenous<br/>regressor"]
F["<b>f<sub>t</sub></b><br/>Common factors<br/><i>Macro shocks</i>"]
Z["<b>Z<sub>it</sub></b><br/>Instruments<br/><i>INTEREST, lags</i>"]
WY -->|"ψ"| Y
LY -->|"ρ"| Y
X -->|"β"| Y
F -.->|"λ<sub>i</sub>"| Y
Z -.->|"IV"| X
style Y fill:#d97757,stroke:#141413,color:#fff
style WY fill:#6a9bcc,stroke:#141413,color:#fff
style LY fill:#6a9bcc,stroke:#141413,color:#fff
style X fill:#00d4c8,stroke:#141413,color:#141413
style F fill:#141413,stroke:#d97757,color:#fff
style Z fill:#6a9bcc,stroke:#141413,color:#fff

The spatial lag ($W \cdot NPL$) creates endogeneity because bank $i$’s credit risk depends on bank $j$’s credit risk, and vice versa — a simultaneity problem. The temporal lag ($NPL_{i,t-1}$) is endogenous because it correlates with the bank-specific fixed effect. The endogenous regressor (operational inefficiency, $INEFF$) is correlated with the error term. And the common factors ($f_t$) enter both the regressors and the error, inducing cross-sectional dependence and omitted variable bias.

The model is specified as:

$$NPL_{it} = \psi \sum_{j=1}^{N} w_{ij} \, NPL_{jt} + \rho \, NPL_{i,t-1} + x_{it} \beta + \alpha_i + \lambda_i' f_t + \varepsilon_{it}$$

In words, this equation says that the non-performing loan ratio of bank $i$ at time $t$ depends on: the spatial lag $\psi W \cdot NPL$ (the weighted average NPL of interconnected banks), the temporal lag $\rho \, NPL_{i,t-1}$ (the bank’s own past credit risk, capturing persistence), the bank-specific covariates $x_{it} \beta$ (financial ratios like capital adequacy, profitability, and liquidity), the individual fixed effect $\alpha_i$ (time-invariant bank characteristics), and the interactive fixed effect $\lambda_i' f_t$ (unobserved common factors with heterogeneous loadings).

Variable mapping

Symbol	Meaning	Stata variable
$NPL_{it}$	Non-performing loans / total loans (%)	`NPL`
$\psi$	Spatial autoregressive parameter	`[W]NPL`
$\rho$	Temporal autoregressive parameter	`L1.NPL`
$x_{it}$	Bank-specific covariates	`INEFF`, `CAR`, `SIZE`, …
$\alpha_i$	Bank fixed effect (absorbed)	`absorb(ID)`
$\lambda_i' f_t$	Interactive fixed effect (defactored)	estimated by `spxtivdfreg`
$w_{ij}$	Spatial weight (interconnection)	`W.csv`

Comparison with existing Stata packages

Feature	`spxtivdfreg`	`xsmle`	`spxtregress`
Estimation method	IV/GMM (defactored)	Maximum likelihood	Quasi-ML
Common factors	Yes (estimated)	No	No
Endogenous regressors	Yes (IV)	No	Limited
Dynamic (temporal lag)	Yes	Yes (`dlag`)	Yes
Bias correction needed	No	Yes (Lee-Yu)	No
Heterogeneous slopes (MG)	Yes (`mg` option)	No	No

The key advantage of spxtivdfreg is its ability to handle unobserved common factors — latent macroeconomic shocks that affect all banks but with heterogeneous intensity. Maximum likelihood methods in xsmle assume cross-sectional independence conditional on the spatial weight matrix, which is violated when common factors are present. The defactored IV approach removes these factors before estimation, producing consistent estimates even in the presence of strong cross-sectional dependence.

3. Setup and data loading

Before running any spatial dynamic panel models, we need three Stata packages: xtivdfreg (the core estimation engine), reghdfe (for absorbing fixed effects), and ftools (a dependency of reghdfe). The spxtivdfreg command is the spatial panel wrapper around xtivdfreg.

* Install packages (if not already installed)
capture which xtivdfreg
if _rc {
ssc install xtivdfreg
}
capture which reghdfe
if _rc {
ssc install reghdfe
}
capture which ftools
if _rc {
ssc install ftools
}

3.1 Data loading and panel setup

The dataset contains quarterly financial ratios for 350 US commercial banks from 2006:Q1 to 2014:Q4, yielding 36 quarters and 12,600 total observations. After absorbing fixed effects and creating lags, the effective estimation sample is 12,250 observations (350 banks times 35 periods).

clear all
use "https://github.com/cmg777/starter-academic-v501/raw/master/content/post/stata_spxtivdfreg/references/v113i06.dta", clear
xtset ID TIME

Panel variable: ID (strongly balanced)
Time variable: TIME, 1 to 36
Delta: 1 unit

The panel is strongly balanced — all 350 banks are observed in all 36 quarters. The xtset command declares ID as the bank identifier and TIME as the quarterly time index.

The sample period is rich with major macro-financial events that all banks experienced — precisely the kind of aggregate shocks that common factors are designed to capture:

graph LR
A["<b>2006--2007</b><br/>Pre-crisis<br/>Housing bubble<br/>Low NPL ratios"]
B["<b>2007--2009</b><br/>Global Financial<br/>Crisis<br/>NPL surge"]
C["<b>2010--2011</b><br/>Dodd-Frank Act<br/>Stress tests<br/>Capital rebuilding"]
D["<b>2012--2014</b><br/>Recovery<br/>Basel III phase-in<br/>NPL normalization"]
A --> B
B --> C
C --> D
style A fill:#6a9bcc,stroke:#141413,color:#fff
style B fill:#d97757,stroke:#141413,color:#fff
style C fill:#141413,stroke:#d97757,color:#fff
style D fill:#00d4c8,stroke:#141413,color:#141413

These regime shifts (housing bubble, financial crisis, regulatory tightening, recovery) are exactly the unobserved common factors that the spxtivdfreg estimator extracts. Standard two-way fixed effects would capture them only if they affected all 350 banks equally — but the interactive fixed effect structure $\lambda_i' f_t$ allows each bank to respond with different intensity to the same aggregate shock.

3.2 Summary statistics

summarize NPL INEFF CAR SIZE BUFFER PROFIT QUALITY LIQUIDITY INTEREST

 Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
NPL | 12,600 1.7283 2.1067 0 23.0378
INEFF | 12,600 .6425 .1726 .2007 2.9037
CAR | 12,600 13.5550 5.6198 1.3800 86.8400
SIZE | 12,600 14.6883 1.4234 11.9466 20.4618
BUFFER | 12,600 5.5550 5.2691 -6.6200 78.8400
PROFIT | 12,600 .8001 5.0380 -132.0700 40.9900
QUALITY | 12,600 .2827 .6245 -4.9482 27.8659
LIQUIDITY | 12,600 .7699 .2224 .0122 2.3217
INTEREST | 12,600 -1.9074 .9328 -5.1644 2.5187

Mean NPL is 1.73%, reflecting the mixture of pre-crisis, crisis, and post-crisis quarters in the sample. The standard deviation of 2.11 percentage points indicates substantial variation both across banks and over time — some banks had NPL ratios as high as 23%. Mean LIQUIDITY (loan-to-deposit ratio) is 0.77, meaning the average bank lent out 77 cents for every dollar of deposits. The wide range of CAR (1.38% to 86.84%) reflects the heterogeneity in capital structures across US commercial banks.

3.3 Variables

Variable	Description	Mean	Std. Dev.
`NPL`	Non-performing loans / total loans (%)	1.728	2.107
`INEFF`	Operational inefficiency (endogenous)	—	—
`CAR`	Capital adequacy ratio	—	—
`SIZE`	ln(total assets)	—	—
`BUFFER`	Capital buffer (leverage ratio minus 8%)	—	—
`PROFIT`	Return on equity, annualized	—	—
`QUALITY`	Loan loss provisions / assets (%)	—	—
`LIQUIDITY`	Loan-to-deposit ratio	0.770	0.222
`INTEREST`	Interest expenses / deposits (instrument for INEFF)	—	—

The dependent variable NPL measures credit risk as the share of non-performing loans in total loans, expressed in percentage points. Its mean of 1.728% reflects the mixture of pre-crisis, crisis, and post-crisis quarters in the sample, with a standard deviation of 2.107 percentage points indicating substantial variation both across banks and over time. The variable INEFF (operational inefficiency) is treated as endogenous and instrumented using INTEREST (interest expenses relative to deposits) along with lagged values of the exogenous regressors.

3.3 The spatial weight matrix

The spatial weight matrix $W$ is a 350-by-350 matrix that defines the network structure among banks. Unlike geographic contiguity matrices used in regional analysis, this matrix is constructed from economic distance — specifically, Spearman’s rank correlation of bank debt-to-asset ratios. Two banks are defined as “neighbors” if their debt ratio correlation exceeds the 95th percentile of the empirical distribution.

* Download the W matrix to the current working directory
copy "https://github.com/cmg777/starter-academic-v501/raw/master/content/post/stata_spxtivdfreg/references/W.csv" "W.csv", replace
* The W matrix (350 x 350, row-standardized, 6,300 nonzero entries) is loaded
* automatically by spxtivdfreg via the spmatrix("W.csv", import) option

The matrix is row-standardized so that each row sums to one, meaning the spatial lag of a variable equals the weighted average among a bank’s neighbors. With 6,300 nonzero entries across 350 banks, the average bank has approximately 18 neighbors — banks whose debt structures are sufficiently correlated to suggest economic interdependence. To illustrate: suppose Bank A and Bank B have a Spearman rank correlation of 0.92 in their quarterly debt ratios, while the 95th percentile threshold is 0.87. Since 0.92 exceeds 0.87, Bank A and Bank B are classified as neighbors ($w_{AB} > 0$). After row-standardization, $w_{AB}$ equals $1/18$ if Bank A has 18 neighbors. This economic-distance approach captures financial contagion channels that geographic proximity alone would miss, since two banks on opposite coasts can be highly interconnected through similar lending portfolios.

4. Full model with common factors

We now estimate the full spatial dynamic panel model with unobserved common factors. The spxtivdfreg command takes the dependent variable (NPL) and the regressors, with options specifying the model structure: absorb(ID) absorbs bank fixed effects, splag includes the spatial lag of NPL, tlags(1) adds the first temporal lag, spmatrix("W.csv", import) loads the weight matrix, and iv(...) specifies the instrumental variables. The std option standardizes the variables before extracting principal components for the factor estimation, which improves numerical stability when covariates have very different scales.

spxtivdfreg NPL INEFF CAR SIZE BUFFER PROFIT QUALITY LIQUIDITY, ///
absorb(ID) splag tlags(1) spmatrix("W.csv", import) ///
iv(INTEREST CAR SIZE BUFFER PROFIT QUALITY LIQUIDITY, splags lag(1)) std

Defactored instrumental variables estimation
Group variable: ID Number of obs = 12,250
Time variable: TIME Number of groups = 350
Number of instruments = 28 Obs per group:
Number of factors in X = 2 min = 35
Number of factors in u = 1 avg = 35.0
max = 35
Second-stage estimator (model with homogeneous slope coefficients)
--------------------------------------------------------------------------
Robust
NPL | Coefficient std. err. z P>|z| [95% conf. interval]
------+-------------------------------------------------------------------
NPL |
L1. | .2898521 .0543794 5.33 0.000 .1832704 .3964339
|
INEFF | .4473777 .1045636 4.28 0.000 .2424368 .6523186
CAR | .0305078 .0057852 5.27 0.000 .019169 .0418465
SIZE | .2225966 .0941614 2.36 0.018 .0380436 .4071496
BUFFER| -.0545049 .0118678 -4.59 0.000 -.0777653 -.0312445
PROFIT| -.0053351 .0018411 -2.90 0.004 -.0089437 -.0017266
QUALITY| .1830412 .0307657 5.95 0.000 .1227415 .2433408
LIQUIDITY| 2.452391 .2696471 9.09 0.000 1.923892 2.980889
_cons | -4.510715 1.311453 -3.44 0.001 -7.081115 -1.940315
------+-------------------------------------------------------------------
W |
NPL | .3943206 .0848856 4.65 0.000 .2279479 .5606932
------+-------------------------------------------------------------------
sigma_f | .64162366 (std. dev. of factor error component)
sigma_e | .90381799 (std. dev. of idiosyncratic error component)
rho | .33509009 (fraction of variance due to factors)
--------------------------------------------------------------------------
Hansen test: chi2(19) = 18.8250, Prob > chi2 = 0.4681

The estimator identifies 2 common factors in the regressors and 1 common factor in the error term, capturing latent macroeconomic forces that drive credit risk across the banking system. These factors represent unobserved aggregate shocks — such as Federal Reserve interest rate decisions, housing market fluctuations, and changes in regulatory stringency — that affect all banks simultaneously but with bank-specific intensities (heterogeneous factor loadings $\lambda_i$).

The spatial autoregressive parameter $\psi = 0.394$ (z = 4.65, p < 0.001) indicates strong positive spatial spillovers: when the average NPL ratio of a bank’s neighbors increases by 1 percentage point, the bank’s own NPL ratio increases by 0.39 percentage points, holding all else constant. This captures financial contagion through interconnected lending networks — when one bank’s borrowers default, it can trigger a cascade of defaults among economically linked banks.

The temporal persistence parameter $\rho = 0.290$ (z = 5.33, p < 0.001) shows that credit risk is moderately persistent: about 29% of a bank’s current NPL ratio is inherited from the previous quarter. This reflects the gradual resolution of non-performing loans through workout processes, foreclosures, and write-offs.

Among the covariates, LIQUIDITY has the largest effect at 2.452 (z = 9.09, p < 0.001), meaning that a 1 percentage point increase in the loan-to-deposit ratio is associated with a 2.45 percentage point increase in non-performing loans. Banks that extend more credit relative to their deposit base face higher credit risk. INEFF (operational inefficiency) enters with a coefficient of 0.447 (z = 4.28, p < 0.001), confirming that poorly managed banks experience higher default rates — a finding consistent with the “bad management” hypothesis in the banking literature. BUFFER enters negatively at -0.055 (z = -4.59, p < 0.001), indicating that better-capitalized banks (those with larger capital buffers above the 8% regulatory minimum) have lower credit risk.

The variance decomposition at the bottom of the output reveals that common factors explain a substantial share of the error variance: $\sigma_f = 0.642$ and $\sigma_e = 0.904$, yielding $\rho_{factor} = 0.335$. This means that 33.5% of the residual variance is attributable to unobserved common factors — macroeconomic shocks that a model without factors would absorb into biased coefficient estimates.

The Hansen J-test for overidentifying restrictions yields chi2(19) = 18.825 with p = 0.468, which does not reject the null hypothesis that the instruments are valid. This provides confidence that the IV strategy — using INTEREST and lagged values of exogenous regressors as instruments — is appropriate.

5. What happens without common factors?

To assess the consequences of ignoring latent macroeconomic shocks, we re-estimate the model with the factmax(0) option, which forces the estimator to set the number of common factors to zero. This specification is equivalent to a standard spatial dynamic panel model without interactive fixed effects.

spxtivdfreg NPL INEFF CAR SIZE BUFFER PROFIT QUALITY LIQUIDITY, ///
absorb(ID) splag tlags(1) spmatrix("W.csv", import) ///
iv(INTEREST CAR SIZE BUFFER PROFIT QUALITY LIQUIDITY, splags lag(1)) std factmax(0)

The table below compares the coefficient estimates from the full model (with factors) and the restricted model (without factors).

Variable	With factors	Without factors
$\psi$ (W*NPL)	0.394*** (0.085)	0.288*** (0.038)
$\rho$ (L1.NPL)	0.290*** (0.054)	0.594*** (0.034)
INEFF	0.447*** (0.105)	0.366*** (0.107)
CAR	0.031*** (0.006)	0.017*** (0.004)
SIZE	0.223** (0.094)	0.089 (0.061)
BUFFER	-0.055*** (0.012)	-0.025** (0.010)
PROFIT	-0.005*** (0.002)	-0.006*** (0.002)
QUALITY	0.183*** (0.031)	0.283*** (0.029)
LIQUIDITY	2.452*** (0.270)	0.843*** (0.180)
Factors ($r_x$, $r_u$)	2, 1	0, 0
J-test	18.825 [0.468]	48.151 [0.000]

The differences are striking and systematic. Without common factors, the temporal persistence doubles from $\rho = 0.290$ to $\rho = 0.594$. This inflation occurs because unobserved common factors are serially correlated (macroeconomic conditions evolve gradually), and when they are excluded from the model, the temporal lag absorbs their persistence. In other words, the model without factors confuses macroeconomic persistence with bank-level credit risk persistence.

The spatial autoregressive parameter drops from $\psi = 0.394$ to $\psi = 0.288$ — a 27% decrease. This is counterintuitive at first glance: one might expect omitting factors to inflate the spatial parameter (since common factors create cross-sectional dependence that could be mistaken for spatial spillovers). However, the inflated temporal lag in the no-factor model absorbs some of the spatial dynamics, compressing $\psi$ downward. The lesson is that omitting common factors distorts all coefficient estimates in complex and non-obvious ways.

The LIQUIDITY coefficient collapses from 2.452 to 0.843 — a 66% reduction. This suggests that much of the effect of liquidity on credit risk operates through common factors: during the GFC, aggregate liquidity conditions deteriorated system-wide, and banks with high loan-to-deposit ratios were disproportionately affected. Without factors to absorb these aggregate movements, the LIQUIDITY coefficient is biased downward.

Most critically, the Hansen J-test rejects in the no-factor model: chi2 = 48.151 with p < 0.001. This rejection means that the instruments are not valid under the no-factor specification — the model is misspecified. The common factors that enter both the regressors and the error term invalidate the exclusion restriction when they are not accounted for. This provides a formal statistical justification for including common factors: the J-test passes (p = 0.468) with factors and fails (p < 0.001) without them.

SIZE becomes statistically insignificant without factors (coefficient = 0.089, standard error = 0.061), whereas it is significant at the 5% level in the full model (0.223, standard error = 0.094). This reversal illustrates how omitting common factors can mask genuine relationships: larger banks are more exposed to systematic macro shocks (they have larger factor loadings), and without factors in the model, this exposure is incorrectly attributed to noise rather than to bank size.

6. What happens without the spatial lag?

To isolate the contribution of spatial spillovers, we now estimate a model that includes common factors but removes the spatially lagged dependent variable. This is done by dropping the splag option. Without the spatial lag, the model reduces to a dynamic panel with common factors — equivalent to the xtivdfreg command.

* Without spatial lag (spxtivdfreg without splag option)
spxtivdfreg NPL INEFF CAR SIZE BUFFER PROFIT QUALITY LIQUIDITY, ///
absorb(ID) tlags(1) spmatrix("W.csv", import) ///
iv(INTEREST CAR SIZE BUFFER PROFIT QUALITY LIQUIDITY, lag(1)) std
* Equivalent specification with xtivdfreg
xtivdfreg NPL L.NPL INEFF CAR SIZE BUFFER PROFIT QUALITY LIQUIDITY, ///
absorb(ID) ///
iv(INTEREST CAR SIZE BUFFER PROFIT QUALITY LIQUIDITY, lag(1)) std

Variable	Full model	Without spatial lag
$\psi$ (W*NPL)	0.394*** (0.085)	—
$\rho$ (L1.NPL)	0.290*** (0.054)	0.323*** (0.055)
INEFF	0.447*** (0.105)	0.638*** (0.116)
CAR	0.031*** (0.006)	0.030*** (0.006)
SIZE	0.223** (0.094)	0.346*** (0.096)
BUFFER	-0.055*** (0.012)	-0.045*** (0.016)
PROFIT	-0.005*** (0.002)	-0.004** (0.002)
QUALITY	0.183*** (0.031)	0.183*** (0.036)
LIQUIDITY	2.452*** (0.270)	2.534*** (0.311)
Factors ($r_x$, $r_u$)	2, 1	2, 1
J-test	18.825 [0.468]	8.174 [0.226]

When the spatial lag is removed, the temporal persistence increases from $\rho = 0.290$ to $\rho = 0.323$ — the temporal lag partially absorbs the missing spatial dynamics. The INEFF coefficient inflates from 0.447 to 0.638 (a 43% increase), and SIZE rises from 0.223 to 0.346 (a 55% increase). Without the spatial lag to capture bank interdependence, these covariates must do more work to explain the cross-sectional variation in credit risk, leading to upward bias.

Importantly, both specifications pass the J-test (p = 0.468 and p = 0.226, respectively), meaning that both models have valid instruments. The choice between them must therefore be based on economic reasoning rather than diagnostic tests alone. The full model with the spatial lag is preferred because financial theory predicts bank interdependence, and the spatial autoregressive parameter $\psi = 0.394$ is highly significant (z = 4.65, p < 0.001).

7. Short-run and long-run effects

In spatial dynamic panel models, the coefficient on a variable does not directly measure its total effect on the dependent variable. Because of the spatial lag ($\psi W \cdot NPL$) and the temporal lag ($\rho \, NPL_{i,t-1}$), a shock to any covariate propagates through the system both across banks (through the spatial multiplier) and over time (through dynamic accumulation). The estat impact command decomposes these effects into direct effects (the impact of a bank’s own covariate on its own NPL), indirect effects (the impact transmitted through the network of interconnected banks), and total effects (direct plus indirect).

The long-run effects account for the full dynamic accumulation of a permanent change in a covariate. The long-run multiplier scales the short-run coefficients by $(1 - \rho)^{-1}$ for the direct channel and further by $(1 - \psi)^{-1}$ for the spatial multiplier:

$$\text{Total LR effect} = \frac{\beta}{(1 - \rho)(1 - \psi)}$$

In words, this equation says that a permanent 1-unit increase in a covariate has a total long-run effect equal to its short-run coefficient $\beta$ amplified by two multipliers: the temporal multiplier $1/(1-\rho)$, which captures the compounding of the effect over time as it feeds back through lagged NPL, and the spatial multiplier $1/(1-\psi)$, which captures the amplification as the effect spreads through the bank network. The diagram below illustrates this decomposition.

graph LR
B["<b>Short-run<br/>coefficient</b><br/>β = 2.452<br/><i>(LIQUIDITY)</i>"]
T["<b>Temporal<br/>multiplier</b><br/>1/(1−ρ)<br/>= 1/(1−0.290)<br/>= 1.408"]
D["<b>Direct<br/>effect</b><br/>3.547"]
S["<b>Spatial<br/>multiplier</b><br/>1/(1−ψ)<br/>= 1/(1−0.394)<br/>= 1.650"]
I["<b>Indirect<br/>effect</b><br/>4.218"]
Tot["<b>Total<br/>effect</b><br/>7.765"]
B -->|"× temporal"| T
T -->|"= direct"| D
D -->|"× spatial"| S
S -->|"= indirect"| I
D --> Tot
I --> Tot
style B fill:#6a9bcc,stroke:#141413,color:#fff
style T fill:#d97757,stroke:#141413,color:#fff
style D fill:#00d4c8,stroke:#141413,color:#141413
style S fill:#d97757,stroke:#141413,color:#fff
style I fill:#141413,stroke:#d97757,color:#fff
style Tot fill:#6a9bcc,stroke:#141413,color:#fff

* Short-run effects (full model with factors)
estat impact, sr

7.1 Short-run effects

The short-run effects capture the immediate one-period impact of a covariate change, including the contemporaneous spatial spillover but not the dynamic accumulation over time.

Variable	SR Direct	SR Indirect	SR Total
INEFF	0.457	0.289	0.746
CAR	0.031	0.020	0.051
SIZE	0.227	0.144	0.371
BUFFER	-0.056	-0.035	-0.091
PROFIT	-0.005	-0.003	-0.009
QUALITY	0.187	0.118	0.305
LIQUIDITY	2.505	1.585	4.090

In the short run, indirect effects are roughly 63% of direct effects — the spatial multiplier $(I - \psi W)^{-1}$ amplifies every shock by about 1.63x. For LIQUIDITY, the short-run total is 4.09 — already substantially larger than the regression coefficient (2.452) due to spatial amplification alone.

* Long-run effects (full model with factors)
estat impact, lr

7.2 Long-run effects with common factors

Variable	Direct	Indirect	Total
INEFF	0.647*** (0.159)	0.769** (0.335)	1.417*** (0.427)
CAR	0.044*** (0.009)	0.052** (0.024)	0.097*** (0.029)
SIZE	0.322** (0.142)	0.383* (0.198)	0.705** (0.310)
BUFFER	-0.079*** (0.018)	-0.094** (0.043)	-0.173*** (0.054)
PROFIT	-0.008*** (0.002)	-0.009** (0.005)	-0.017*** (0.006)
QUALITY	0.265*** (0.047)	0.315** (0.141)	0.580*** (0.167)
LIQUIDITY	3.547*** (0.445)	4.218** (1.742)	7.765*** (1.904)

The long-run effects reveal that indirect (spillover) effects are comparable to or larger than direct effects for every variable. For LIQUIDITY, the direct long-run effect is 3.547 and the indirect effect is 4.218, yielding a total of 7.765 — meaning that a permanent 1 percentage point increase in the loan-to-deposit ratio across all banks would increase the system-wide NPL ratio by nearly 7.8 percentage points in the long run. The indirect effect exceeds the direct effect because the spatial multiplier amplifies shocks across the network of 18 average neighbors per bank.

For INEFF (operational inefficiency), the total long-run effect is 1.417 — more than three times the short-run coefficient of 0.447. A permanent deterioration in management quality cascades through the banking network as inefficient banks generate non-performing loans that spread to their interconnected counterparts through shared borrowers and counterparty risk.

The BUFFER variable has a total long-run effect of -0.173, meaning that a 1 percentage point increase in capital buffers above the 8% regulatory minimum reduces system-wide NPL by 0.173 percentage points in the long run. Both the direct channel (-0.079, well-capitalized banks absorb losses better) and the indirect channel (-0.094, their stability reduces contagion to neighbors) contribute to this protective effect.

7.3 Long-run effects without common factors

To see how omitting common factors distorts spillover estimates, we compare the long-run effects from the full model (with factors) to those from the factmax(0) specification.

* Long-run effects (model without factors)
spxtivdfreg NPL INEFF CAR SIZE BUFFER PROFIT QUALITY LIQUIDITY, ///
absorb(ID) splag tlags(1) spmatrix("W.csv", import) ///
iv(INTEREST CAR SIZE BUFFER PROFIT QUALITY LIQUIDITY, splags lag(1)) std factmax(0)
estat impact, lr

Variable	With factors (Total)	Without factors (Total)
INEFF	1.417***	3.117**
CAR	0.097***	0.145**
SIZE	0.705**	0.756 (n.s.)
BUFFER	-0.173***	-0.212*
PROFIT	-0.017***	-0.053***
QUALITY	0.580***	2.407***
LIQUIDITY	7.765***	7.176**

The comparison reveals severe distortion in the no-factor model’s long-run effects. The total effect of QUALITY more than quadruples from 0.580 to 2.407, and INEFF more than doubles from 1.417 to 3.117. These inflated estimates arise because the no-factor model attributes macroeconomic variation to the covariates: when aggregate loan quality deteriorates during a recession, the no-factor model incorrectly assigns this entire movement to the bank-level QUALITY and INEFF variables rather than recognizing the common factor (the recession itself).

Conversely, SIZE loses statistical significance in the no-factor model (total effect = 0.756, not significant), even though it is significant in the full model (0.705, p < 0.05). The common factors capture macro-financial conditions that disproportionately affect larger banks, and without these factors, the SIZE effect is masked by omitted variable bias.

8. Heterogeneous slopes: the mean-group estimator

The models estimated so far assume that all banks share the same slope coefficients — that is, the effect of LIQUIDITY on NPL is identical for all 350 banks. This is a strong assumption. Banks differ in their business models, geographic markets, and risk management practices, and these differences may translate into heterogeneous responses to the same financial ratios. The mg (mean-group) option in spxtivdfreg relaxes this assumption by estimating bank-specific slopes and reporting their cross-sectional average.

spxtivdfreg NPL INEFF CAR SIZE BUFFER PROFIT QUALITY LIQUIDITY, ///
absorb(ID) splag tlags(1) spmatrix("W.csv", import) ///
iv(INTEREST CAR SIZE BUFFER PROFIT QUALITY LIQUIDITY, splags lag(1)) std mg

Variable	Homogeneous (pooled)	Heterogeneous (MG)
$\psi$ (W*NPL)	0.394*** (0.085)	0.032 (0.051)
$\rho$ (L1.NPL)	0.290*** (0.054)	0.301*** (0.015)
INEFF	0.447*** (0.105)	0.759*** (0.158)
CAR	0.031*** (0.006)	0.218*** (0.026)
SIZE	0.223** (0.094)	2.004*** (0.339)
BUFFER	-0.055*** (0.012)	-0.376*** (0.042)
PROFIT	-0.005*** (0.002)	-0.018*** (0.006)
QUALITY	0.183*** (0.031)	0.287** (0.139)
LIQUIDITY	2.452*** (0.270)	6.330*** (0.506)
_cons	-4.511*** (1.311)	-29.013*** (4.167)

The most striking result is that the spatial autoregressive parameter becomes insignificant under the MG estimator: $\psi = 0.032$ (z = 0.62, p = 0.536). This suggests that the strong spatial spillovers found in the pooled model ($\psi = 0.394$) may partly reflect slope heterogeneity rather than genuine bank-to-bank contagion. When each bank is allowed its own coefficient on LIQUIDITY, SIZE, and other variables, the average spatial lag effect shrinks to near zero. This is a common finding in spatial econometrics: imposing homogeneous slopes in the presence of slope heterogeneity can create spurious spatial dependence.

The covariate coefficients increase substantially under the MG estimator. SIZE jumps from 0.223 to 2.004 (a nine-fold increase), BUFFER from -0.055 to -0.376 (a seven-fold increase), and CAR from 0.031 to 0.218 (a seven-fold increase). These larger MG coefficients suggest that the pooled model’s homogeneity restriction attenuates individual bank-level effects toward zero. The MG standard errors are generally smaller than the pooled standard errors for the temporal lag ($\rho$: 0.015 vs. 0.054) but larger for some covariates, reflecting the averaging of heterogeneous bank-specific estimates.

The temporal persistence remains stable: $\rho = 0.301$ (MG) versus $\rho = 0.290$ (pooled). This robustness suggests that credit risk persistence is a genuine phenomenon shared across all banks, not an artifact of slope heterogeneity. Whether a bank is large or small, well-managed or poorly managed, about 30% of its current NPL ratio is inherited from the previous quarter.

The MG estimator is only $\sqrt{N}$-consistent (versus $\sqrt{NT}$-consistent for the pooled estimator), making it inherently less efficient and more susceptible to outliers. With 350 banks and 35 time periods, a handful of banks with extreme coefficient estimates can shift the MG average substantially. To investigate, individual bank-specific estimates can be inspected using the mg(101) option (which displays estimates for the bank with ID 101) or extracted from the e(b_mg) and e(se_mg) matrices for further analysis — for example, to compute trimmed or median estimates that are robust to outlier influence. However, further exploration of individual heterogeneity is beyond the scope of this tutorial.

9. Model comparison and specification guidance

The following table summarizes the four model specifications estimated in this tutorial, highlighting the key coefficient estimates and diagnostic tests.

	Full model	No factors	No spatial lag	Heterogeneous (MG)
$\psi$ (spatial)	0.394***	0.288***	—	0.032
$\rho$ (temporal)	0.290***	0.594***	0.323***	0.301***
LIQUIDITY	2.452***	0.843***	2.534***	6.330***
Factors	$r_x$=2, $r_u$=1	0, 0	$r_x$=2, $r_u$=1	$r_x$=2, $r_u$=1
J-test p-value	0.468	0.000	0.226	—
Slopes	Homogeneous	Homogeneous	Homogeneous	Heterogeneous

The decision diagram below provides a practical guide for choosing among these specifications.

graph TD
START["<b>Start</b><br/>Spatial dynamic panel<br/>with suspected factors"]
JTEST["<b>J-test</b><br/>Estimate with factors<br/>and without factors"]
FACTORS["<b>Include factors</b><br/>J-test fails without<br/>(p < 0.05)"]
NOFACT["<b>No factors needed</b><br/>J-test passes without<br/>(p ≥ 0.05)"]
SPLAG["<b>Spatial lag?</b><br/>Is ψ significant?"]
FULL["<b>Full model</b><br/>spxtivdfreg with<br/>splag + factors"]
NOSPL["<b>xtivdfreg</b><br/>Dynamic panel<br/>with factors only"]
MG["<b>MG estimator</b><br/>Test slope<br/>heterogeneity"]
START --> JTEST
JTEST -->|"J rejects without factors"| FACTORS
JTEST -->|"J passes without factors"| NOFACT
FACTORS --> SPLAG
SPLAG -->|"ψ significant"| FULL
SPLAG -->|"ψ not significant"| NOSPL
FULL --> MG
style START fill:#141413,stroke:#d97757,color:#fff
style JTEST fill:#6a9bcc,stroke:#141413,color:#fff
style FACTORS fill:#00d4c8,stroke:#141413,color:#141413
style NOFACT fill:#d97757,stroke:#141413,color:#fff
style SPLAG fill:#6a9bcc,stroke:#141413,color:#fff
style FULL fill:#00d4c8,stroke:#141413,color:#141413
style NOSPL fill:#d97757,stroke:#141413,color:#fff
style MG fill:#6a9bcc,stroke:#141413,color:#fff

The J-test is the first and most important diagnostic: in our application, it unambiguously rejects the no-factor specification (p < 0.001), confirming that common factors must be included. With factors, the spatial lag is highly significant ($\psi = 0.394$, z = 4.65), supporting the full model. The MG estimator provides a robustness check that reveals potential slope heterogeneity, but its insignificant spatial lag should be interpreted cautiously — it may indicate genuine absence of spillovers, or it may reflect the difficulty of estimating bank-specific spatial parameters with only 35 time periods.

10. Discussion

Methodological implications

The spxtivdfreg package represents a significant advance in the spatial panel toolkit for Stata. By combining defactored IV estimation with spatial lag modeling, it addresses a long-standing limitation of existing packages: the inability to account for unobserved common factors. The results in this tutorial demonstrate that ignoring common factors leads to three specific problems: (1) inflated temporal persistence ($\rho$ doubling from 0.290 to 0.594), (2) distorted covariate effects (LIQUIDITY falling by 66% from 2.452 to 0.843), and (3) invalid instruments (J-test rejecting at p < 0.001). These are not minor specification issues — they fundamentally change the economic story that emerges from the analysis.

Readers who have worked through the companion spatial panel regression tutorial with xsmle may wonder: what would happen if we used xsmle on this banking dataset? Since xsmle uses maximum likelihood without common factors, its estimates would resemble the “Without factors” column in Section 5 — with temporal persistence inflated to $\rho \approx 0.59$, spatial spillovers compressed to $\psi \approx 0.29$, and the LIQUIDITY effect attenuated by two-thirds. The J-test rejection (p < 0.001) confirms that this ML specification is misspecified. The spxtivdfreg approach avoids these problems by defactoring the data before estimation.

Empirical implications

The empirical application reveals that credit risk in US banking operates through multiple interacting channels. The short-run coefficient on LIQUIDITY (2.452) implies that a 10 percentage point increase in the loan-to-deposit ratio increases non-performing loans by about 0.25 percentage points in the current quarter. But the long-run total effect (7.765) is more than three times larger, reflecting the amplification through temporal persistence and spatial contagion. This means that the true cost of excessive lending is far larger than what contemporaneous cross-sectional regressions suggest.

The common factors that the estimator identifies — 2 in the regressors and 1 in the error — capture aggregate forces such as Federal Reserve monetary policy, the collapse of the housing market, and the tightening of interbank lending during the crisis. These factors account for 33.5% of the residual variance, underscoring the importance of modeling macro-financial shocks explicitly rather than assuming they are absorbed by time fixed effects. Traditional two-way fixed effects would capture these factors only if they had homogeneous effects across banks, but the interactive fixed effect structure $\lambda_i' f_t$ allows for heterogeneous loadings — some banks are more sensitive to interest rate shocks, others to housing market conditions.

Policy implications

For banking regulators, the indirect long-run effects are particularly informative. The total long-run effect of BUFFER on NPL is -0.173, meaning that a system-wide 1 percentage point increase in capital buffers above the 8% minimum would reduce non-performing loans by 0.17 percentage points across the network. This effect is roughly split between the direct channel (banks with more capital absorb losses better) and the indirect channel (their stability reduces contagion to connected banks). This decomposition supports macroprudential policies that target system-wide capital requirements rather than bank-specific ones, since the spillover benefits of higher capital buffers are nearly as large as the direct benefits.

11. Summary and next steps

This tutorial demonstrated the complete workflow for estimating spatial dynamic panel models with unobserved common factors in Stata using the spxtivdfreg package. The key takeaways are:

Common factors are essential. The J-test rejects the no-factor model (p < 0.001), and omitting factors inflates temporal persistence from $\rho = 0.290$ to $\rho = 0.594$ — a doubling that confuses macroeconomic persistence with bank-level credit risk dynamics.
Spatial spillovers are economically significant. The spatial autoregressive parameter $\psi = 0.394$ implies that a 1 percentage point increase in neighbors' NPL raises a bank’s own NPL by 0.39 percentage points. Long-run indirect effects exceed direct effects for most variables.
Long-run total effects are large. For LIQUIDITY, the total long-run effect is 7.765 — more than three times the short-run coefficient of 2.452 — reflecting amplification through both temporal persistence and spatial contagion.
Slope heterogeneity matters for interpretation. The mean-group estimator drives the spatial lag to insignificance ($\psi = 0.032$, p = 0.536), suggesting that the pooled model’s strong spatial spillovers may partly reflect cross-bank heterogeneity in covariate effects.

For further study, the companion tutorial on spatial panel regression with xsmle covers maximum likelihood estimation of static and dynamic spatial panels, including the Spatial Durbin Model with Wald specification tests and the Lee-Yu bias correction. For cross-sectional spatial models, see the cross-sectional spatial regression tutorial. The original paper by Kripfganz and Sarafidis (2025) provides the full theoretical derivation and Monte Carlo simulations that establish the estimator’s properties.

12. Exercises

Endogeneity of INEFF. The full model treats INEFF (operational inefficiency) as endogenous and uses INTEREST (interest expenses / deposits) as an excluded instrument. Re-estimate the model treating INEFF as exogenous by removing INTEREST from the iv() option and adding INEFF to the exogenous instrument list. Does the coefficient on INEFF change substantially? What does this tell you about the direction of endogeneity bias?
Alternative factor structure. The estimator automatically selects 2 factors in the regressors and 1 in the error. Use the factmax() option to constrain the maximum number of factors to 1 or 3 and re-estimate the model. Compare the spatial parameter $\psi$, the J-test statistic, and the variance decomposition ($\rho_{factor}$). How sensitive are the results to the assumed number of common factors?
Short-run vs. long-run effects. Use estat impact, sr to compute the short-run direct, indirect, and total effects and compare them to the long-run effects in Table 3. For which variable is the ratio of long-run to short-run total effect the largest? What does this ratio tell you about the relative importance of temporal persistence vs. spatial amplification for that variable?

References

Difference-in-Differences for Policy Evaluation: A Tutorial using R

Thu, 26 Mar 2026 00:00:00 +0000

1. Overview

Does raising the minimum wage reduce employment among young workers? This question has been at the center of one of the longest-running debates in labor economics, and the Difference-in-Differences (DID) method has been the primary tool for answering it. In this tutorial, we analyze how state-level minimum wage increases between 2001 and 2007 affected teen employment in the United States — a period when the federal minimum wage was frozen at \$5.15 per hour, while individual states raised their own minimum wages at different times. This variation in treatment timing creates a natural experiment ideally suited for DID.

For decades, applied researchers implemented DID using a simple two-way fixed effects (TWFE) regression — a panel regression with unit and time fixed effects. Recent research has revealed that this approach can produce severely biased estimates when there is staggered treatment adoption (units treated at different times) and treatment effect heterogeneity (effects that vary across groups or over time). The TWFE regression implicitly makes “forbidden comparisons” that use already-treated units as the comparison group, and it assigns negative weights to some group-time treatment effects. These problems are not theoretical curiosities — they lead to meaningful differences in empirical estimates.

This tutorial walks through the complete modern DID workflow. We begin with the traditional TWFE regression and demonstrate its limitations. We then introduce the Callaway and Sant’Anna (2021) framework for estimating group-time average treatment effects, $ATT(g,t)$, that cleanly separate identification from estimation. We extend the analysis with covariates using doubly robust estimation, assess the sensitivity of results to violations of parallel trends using HonestDiD (Rambachan and Roth, 2023), and explore how to handle heterogeneous treatment doses across states. The tutorial is based on Callaway’s (2022) chapter “Difference-in-Differences for Policy Evaluation” and the accompanying LSU workshop materials.

Learning objectives:

Understand the parallel trends assumption and why TWFE regressions break down with staggered treatment adoption and treatment effect heterogeneity
Estimate group-time average treatment effects using att_gt() from the did package and aggregate them into overall ATTs and event studies
Diagnose TWFE bias through weight decomposition, identifying negative weights and pre-treatment contamination
Apply doubly robust estimation with conditional parallel trends and assess robustness to base period and comparison group choices
Conduct HonestDiD sensitivity analysis to evaluate how robust findings are to violations of parallel trends

2. Setup

# Install packages if needed
cran_packages <- c("did", "fixest", "HonestDiD", "DRDID", "BMisc",
"modelsummary", "ggplot2", "dplyr", "pte")
missing <- cran_packages[!sapply(cran_packages, requireNamespace, quietly = TRUE)]
if (length(missing) > 0) install.packages(missing)
# twfeweights is GitHub-only
if (!requireNamespace("twfeweights", quietly = TRUE)) {
remotes::install_github("bcallaway11/twfeweights")
}
# pte may also require GitHub install if not on CRAN
if (!requireNamespace("pte", quietly = TRUE)) {
remotes::install_github("bcallaway11/pte")
}
library(did)
library(fixest)
library(twfeweights)
library(HonestDiD)
library(DRDID)
library(BMisc)
library(modelsummary)
library(ggplot2)
library(dplyr)

3. Data Loading and Exploration

The dataset comes from Callaway and Sant’Anna (2021) and contains county-level panel data on teen employment and state minimum wages across the United States from 2001 to 2007. During this period, the federal minimum wage remained constant at \$5.15 per hour, while several states raised their state-level minimum wages above the federal floor at different points in time. States that raised their minimum wages form the “treated” groups, identified by the year their first increase took effect. States that never raised their minimum wage above the federal level during this period form the “never-treated” comparison group.

# Load data from Callaway's GitHub repository
load(url("https://github.com/bcallaway11/did_chapter/raw/master/mw_data_ch2.RData"))
# Filter: keep groups 0 (never-treated), 2004, 2006; drop Northeast region
mw_data_ch2 <- subset(mw_data_ch2,
(G %in% c(2004, 2006, 2007, 0)) & (region != "1"))
# Main analysis subset: drop G=2007, keep year >= 2003
data2 <- subset(mw_data_ch2, G != 2007 & year >= 2003)
head(data2[, c("id", "year", "G", "lemp", "lpop", "region")])

 id year G lemp lpop region
6 1001 2003 0 5.253534 10.07352 3
7 1001 2004 0 5.288267 10.06966 3
8 1001 2005 0 5.267858 10.06235 3
9 1001 2006 0 5.298317 10.05546 3
10 1001 2007 0 5.232025 10.04953 3
31 1003 2003 0 6.822197 11.16740 3

# Counties by treatment group
data2 %>%
filter(year == 2003) %>%
group_by(G) %>%
summarise(n_counties = n(), .groups = "drop")

 G n_counties
1 0 1417
2 2004 102
3 2006 226

The dataset contains 8,725 county-year observations spanning 1,745 counties over five years (2003–2007). There are two treatment groups: 102 counties in states that first raised their minimum wage in 2004 (G=2004) and 226 counties in states that did so in 2006 (G=2006). The remaining 1,417 counties are in states that kept their minimum wage at the federal level throughout the period and serve as the never-treated comparison group. We drop the G=2007 group (states raising their minimum wage right before the federal increase) to maintain a cleaner analysis window, following the workshop approach.

# Summary statistics
summary(data2[, c("lemp", "lpop", "lavg_pay")])

 lemp lpop lavg_pay
Min. : 1.099 Min. : 6.397 Min. : 9.646
1st Qu.: 4.615 1st Qu.: 9.149 1st Qu.:10.117
Median : 5.517 Median : 9.931 Median :10.225
Mean : 5.594 Mean :10.030 Mean :10.245
3rd Qu.: 6.458 3rd Qu.:10.762 3rd Qu.:10.352
Max. :11.173 Max. :15.492 Max. :11.223

The outcome variable lemp is log teen employment, with a mean of 5.59 (corresponding to roughly 270 teen workers per county). The covariates lpop (log county population, mean 10.03) and lavg_pay (log average county pay, mean 10.25) capture differences in county size and economic conditions that could affect employment trends. These covariates will become important when we condition the parallel trends assumption on observables in Section 7.

4. The Basic DID Framework

4.1 DID Intuition and Parallel Trends

The core idea behind Difference-in-Differences is simple: compare how outcomes change over time for the treated group relative to a comparison group. If the treated and comparison groups would have followed parallel trends in the absence of treatment, then any divergence after treatment can be attributed to the treatment itself. Formally, the Average Treatment Effect on the Treated (ATT) is identified as:

$$ATT = E[\Delta Y_{t^{\ast}} \mid D=1] - E[\Delta Y_{t^{\ast}} \mid D=0]$$

where $\Delta Y_{t^{\ast}}$ is the change in outcomes from the pre-treatment period to the post-treatment period, $D=1$ indicates treated units, and $D=0$ indicates untreated units. The ATT equals the change in outcomes for the treated group, adjusted by the change in outcomes for the comparison group.

graph TD
subgraph "Before Treatment"
A["Treated Group<br/>Pre-treatment Y"]
B["Control Group<br/>Pre-treatment Y"]
end
subgraph "After Treatment"
C["Treated Group<br/>Post-treatment Y"]
D["Control Group<br/>Post-treatment Y"]
end
A -->|"ΔY treated"| C
B -->|"ΔY control"| D
C -.->|"ATT = ΔY treated − ΔY control"| E["Causal Effect"]
style A fill:#d97757,stroke:#141413,color:#fff
style C fill:#d97757,stroke:#141413,color:#fff
style B fill:#6a9bcc,stroke:#141413,color:#fff
style D fill:#6a9bcc,stroke:#141413,color:#fff
style E fill:#00d4c8,stroke:#141413,color:#fff

In the textbook case with exactly two periods and two groups, the TWFE regression $Y_{it} = \theta_t + \eta_i + \alpha D_{it} + v_{it}$ delivers an estimate of $\alpha$ that is numerically identical to the simple DID estimator, even in the presence of treatment effect heterogeneity. Here, $\theta_t$ represents time fixed effects (captured by year in the regression), $\eta_i$ represents unit fixed effects (captured by id), $D_{it}$ is the treatment indicator (post), and $v_{it}$ are idiosyncratic unobservables.

However, this equivalence breaks down when there are multiple time periods and variation in treatment timing. In our application, states raised their minimum wages at different times (2004 and 2006), creating a staggered treatment adoption design.

The TWFE regression implicitly makes two types of comparisons: (1) “good comparisons” that compare treated groups to not-yet-treated groups, and (2) “bad comparisons” (sometimes called “forbidden comparisons”) that use already-treated groups as the comparison group. To see why this is problematic, imagine grading a student’s improvement by comparing them to classmates who already took the test last week — those “comparison” students are themselves affected by the test, so they no longer represent a valid counterfactual. Similarly, already-treated units may themselves be experiencing treatment effects, contaminating the estimate.

Moreover, under treatment effect heterogeneity, the TWFE coefficient $\alpha$ is a weighted average of underlying group-time treatment effects, and some of these weights can be negative. It is as if you tried to compute an average score but accidentally gave some students a negative weight — their positive performance would drag the average down. This means TWFE could, in principle, produce a negative estimate even when all true treatment effects are positive.

4.2 TWFE Regression

Let us start with the traditional TWFE approach to establish a baseline estimate.

twfe_res <- fixest::feols(lemp ~ post | id + year,
data = data2,
cluster = "id")
summary(twfe_res)

OLS estimation, Dep. Var.: lemp
Observations: 8,725
Fixed-effects: id: 1,745, year: 5
Standard-errors: Clustered (id)
Estimate Std. Error t value Pr(>|t|)
post -0.03812 0.008489 -4.49036 7.5762e-06 ***
---
RMSE: 0.116264 Adj. R2: 0.9926
Within R2: 0.003711

The TWFE regression estimates that minimum wage increases reduced log teen employment by 0.038 (SE = 0.008), which is statistically significant. Interpreted naively, this suggests that states raising their minimum wage experienced a 3.8% decline in teen employment relative to states that did not. However, this single coefficient attempts to summarize the entire treatment effect across two different treatment groups, multiple post-treatment periods, and varying lengths of exposure — a task that, as we will show, is not well-served by TWFE under treatment effect heterogeneity.

The TWFE event study above uses fixest::sunab() to estimate dynamic treatment effects within the TWFE framework. The coefficients suggest a small pre-trend violation at event time $-3$ and increasingly negative post-treatment effects. While the Sun-Abraham correction improves upon the standard TWFE event study by addressing some of the weighting issues, we will see that the Callaway-Sant’Anna approach provides a more principled decomposition of the treatment effect.

5. Group-Time ATT: The Callaway-Sant’Anna Approach

5.1 Estimating ATT(g,t)

The Callaway and Sant’Anna (2021) framework addresses the limitations of TWFE by working with group-time average treatment effects:

$$ATT(g,t) = E[Y_t(g) - Y_t(0) \mid G = g]$$

where $Y_t(g)$ is the potential outcome at time $t$ if first treated in period $g$, $Y_t(0)$ is the untreated potential outcome, and $G = g$ identifies units in treatment group $g$. In words, $ATT(g,t)$ is the average treatment effect for units first treated in period $g$, measured at time $t$. These building-block parameters are identified under the parallel trends assumption using clean comparisons: each treated group is compared only to units that are never treated (or not yet treated), avoiding the forbidden comparisons that plague TWFE.

attgt <- did::att_gt(yname = "lemp",
idname = "id",
gname = "G",
tname = "year",
data = data2,
control_group = "nevertreated",
base_period = "universal")
tidy(attgt)[, 1:5]

 term group time estimate std.error
ATT(2004,2003) 2004 2003 0.00000000 NA
ATT(2004,2004) 2004 2004 -0.03266653 0.02149279
ATT(2004,2005) 2004 2005 -0.06827991 0.02098524
ATT(2004,2006) 2004 2006 -0.12335404 0.02089502
ATT(2004,2007) 2004 2007 -0.13109136 0.02326712
ATT(2006,2003) 2006 2003 -0.03408910 0.01165128
ATT(2006,2004) 2006 2004 -0.01669977 0.00817406
ATT(2006,2005) 2006 2005 0.00000000 NA
ATT(2006,2006) 2006 2006 -0.01939335 0.00892409
ATT(2006,2007) 2006 2007 -0.06607568 0.00965073

The att_gt() function estimates each $ATT(g,t)$ separately. For the G=2004 group, the treatment effect grows over time: $-0.033$ on impact (2004), $-0.068$ one year later (2005), $-0.123$ two years later (2006), and $-0.131$ three years later (2007). This pattern suggests treatment effect dynamics — the negative employment effect of minimum wage increases deepens with longer exposure. For the G=2006 group, the on-impact effect is smaller ($-0.019$) and grows to $-0.066$ after one year. The pre-treatment estimates for G=2006 show a concerning value of $-0.034$ at event time $-3$ (year 2003), suggesting a possible violation of the parallel trends assumption for this group — a point we will revisit in the sensitivity analysis.

5.2 Aggregation: Overall ATT and Event Study

Group-time ATTs are informative but numerous. The aggte() function aggregates them into summary parameters. The overall ATT weights each $ATT(g,t)$ by the group size and the number of post-treatment periods:

attO <- did::aggte(attgt, type = "group")
summary(attO)

Overall summary of ATT's based on group/cohort aggregation:
ATT Std. Error [ 95% Conf. Int.]
-0.0571 0.008 -0.0727 -0.0415 *
Group Effects:
Group Estimate Std. Error [95% Simult. Conf. Band]
2004 -0.0888 0.0197 -0.1309 -0.0468 *
2006 -0.0427 0.0083 -0.0604 -0.0251 *

The overall ATT is $-0.057$ (SE = 0.008), substantially larger in magnitude than the TWFE estimate of $-0.038$. The Callaway-Sant’Anna framework reveals that TWFE understated the negative employment effect by about one-third. The group-level results show that the G=2004 group experienced a larger average effect ($-0.089$) than the G=2006 group ($-0.043$), which makes sense because the G=2004 group has been treated for more periods and thus accumulates more treatment effect dynamics.

The event study aggregation is equally informative:

attes <- did::aggte(attgt, type = "dynamic")
summary(attes)

Overall summary of ATT's based on event-study/dynamic aggregation:
ATT Std. Error [ 95% Conf. Int.]
-0.0862 0.0124 -0.1106 -0.0618 *
Dynamic Effects:
Event time Estimate Std. Error [95% Simult. Conf. Band]
-3 -0.0341 0.0119 -0.0623 -0.0059 *
-2 -0.0167 0.0076 -0.0348 0.0014
-1 0.0000 NA NA NA
0 -0.0235 0.0081 -0.0426 -0.0044 *
1 -0.0668 0.0086 -0.0870 -0.0465 *
2 -0.1234 0.0203 -0.1714 -0.0753 *
3 -0.1311 0.0230 -0.1855 -0.0767 *

The event study reveals a clear pattern: the on-impact effect at $e=0$ is $-0.024$, growing to $-0.067$ at $e=1$, $-0.123$ at $e=2$, and $-0.131$ at $e=3$. The post-treatment effects are all statistically significant and increasingly negative, consistent with the minimum wage having a cumulative negative effect on teen employment over time. However, the pre-trend at $e=-3$ is $-0.034$ and marginally significant, which raises a flag about the validity of the parallel trends assumption. The pre-trend at $e=-2$ is smaller ($-0.017$) and not significant. We will formally assess the robustness of these results to parallel trends violations using HonestDiD in Section 8.

5.3 TWFE Weight Decomposition

Why does TWFE produce a different estimate than Callaway-Sant’Anna? Both the TWFE coefficient and the overall $ATT^O$ can be written as weighted averages of the same underlying $ATT(g,t)$ values:

$$ATT^O = \sum_{g,t} w^O(g,t) \cdot ATT(g,t)$$

The difference lies in the weights. The proper $ATT^O$ weights reflect group size and number of post-treatment periods, while the TWFE weights are driven by the estimation method and can assign nonzero weight to pre-treatment periods or even negative weight to some post-treatment cells. The twfeweights package makes these weights explicit.

tw_obj <- twfeweights::twfe_weights(attgt)
tw <- tw_obj$weights_df
wO_obj <- twfeweights::attO_weights(attgt)
wO <- wO_obj$weights_df

TWFE estimate from weights: -0.0381
ATT^O estimate from weights: -0.0571
TWFE post-treatment component: -0.0503
Pre-treatment contamination: 0.0122
Total TWFE bias: 0.019
Fraction of bias from pre-treatment: 0.6422
Fraction of bias from post-treatment weighting: 0.3578

The weight decomposition is revealing. The TWFE estimate ($-0.038$) differs from the proper overall ATT ($-0.057$) by a total bias of $0.019$ — meaning TWFE attenuates the negative employment effect toward zero. Of this bias, 64.2% comes from pre-treatment contamination: the TWFE regression assigns nonzero weights to pre-treatment $ATT(g,t)$ values, which should receive zero weight in any proper treatment effect parameter. The remaining 35.8% of the bias comes from TWFE assigning different post-treatment weights than the proper $ATT^O$ weights. The figure shows this visually: the orange pre-treatment dots receive nonzero TWFE weights (horizontal position), and the post-treatment TWFE weights (blue circles) differ systematically from the proper $ATT^O$ weights (teal diamonds).

6. Relaxing Parallel Trends

6.1 Conditional Parallel Trends with Covariates

The unconditional parallel trends assumption may be too strong if treatment and comparison groups differ on observable characteristics that affect outcome trends. For example, states that raised their minimum wages may have larger populations or higher average pay levels, and these characteristics could correlate with employment trends even absent the minimum wage change. Conditional parallel trends weakens the assumption: trends need only be parallel after conditioning on covariates. The did package offers three estimation methods for this setting. Regression adjustment models the outcome as a function of covariates; inverse probability weighting (IPW) reweights the comparison group to match the treated group’s covariate distribution; and the doubly robust (DR) estimator combines both approaches, remaining consistent if either the outcome model or the propensity score model is correctly specified — like wearing both a belt and suspenders.

# Regression adjustment
cs_reg <- att_gt(yname = "lemp", tname = "year", idname = "id", gname = "G",
xformla = ~lpop + lavg_pay,
control_group = "nevertreated", base_period = "universal",
est_method = "reg", data = data2)
attO_reg <- aggte(cs_reg, type = "group")
# Inverse probability weighting
cs_ipw <- att_gt(yname = "lemp", tname = "year", idname = "id", gname = "G",
xformla = ~lpop + lavg_pay,
control_group = "nevertreated", base_period = "universal",
est_method = "ipw", data = data2)
attO_ipw <- aggte(cs_ipw, type = "group")
# Doubly robust
cs_dr <- att_gt(yname = "lemp", tname = "year", idname = "id", gname = "G",
xformla = ~lpop + lavg_pay,
control_group = "nevertreated", base_period = "universal",
est_method = "dr", data = data2)
attO_dr <- aggte(cs_dr, type = "group")

Method	Overall ATT	SE
Unconditional	$-0.057$	0.008
Regression adj.	$-0.064$	0.008
IPW	$-0.065$	0.008
Doubly robust	$-0.065$	0.008

Controlling for log population and log average pay increases the estimated negative employment effect from $-0.057$ to approximately $-0.065$ across all three conditional methods. The three estimation methods produce nearly identical estimates, which is reassuring. The fact that all three methods agree suggests that covariate adjustment is not introducing model-dependence artifacts.

The doubly robust event study shows the same qualitative pattern as the unconditional analysis: near-zero pre-trends (the pre-trend at $e=-3$ shrinks from $-0.034$ to $-0.022$ and is no longer significant) and increasingly negative post-treatment effects ($-0.027$ at $e=0$, $-0.077$ at $e=1$, $-0.135$ at $e=2$, $-0.147$ at $e=3$). The improved pre-trend behavior after conditioning on covariates suggests that some of the apparent pre-trend violations in the unconditional analysis were driven by differences in county characteristics between treatment and comparison groups.

6.2 Robustness: Base Period, Comparison Group, and Anticipation

The Callaway-Sant’Anna framework allows the researcher to make several important choices. We now check that our results are robust to these choices.

Varying base period: Instead of comparing all pre-treatment and post-treatment periods to a single universal base period ($t = g-1$), we can use a varying base period that compares each period $t$ to period $t-1$.

cs_varying <- att_gt(yname = "lemp", tname = "year", idname = "id", gname = "G",
xformla = ~lpop + lavg_pay,
control_group = "nevertreated", base_period = "varying",
est_method = "dr", data = data2)
attO_varying <- aggte(cs_varying, type = "group")

Varying base period ATT^O: -0.0646 (SE: 0.0081)

Not-yet-treated comparison group: Instead of using only the never-treated group as the comparison, we can also include units that are not yet treated at time $t$.

cs_nyt <- att_gt(yname = "lemp", tname = "year", idname = "id", gname = "G",
xformla = ~lpop + lavg_pay,
control_group = "notyettreated", base_period = "universal",
est_method = "dr", data = data2)
attO_nyt <- aggte(cs_nyt, type = "group")

Not-yet-treated ATT^O: -0.0649 (SE: 0.008)

Anticipation: If states announced their minimum wage increases before they took effect, workers and firms might adjust their behavior in anticipation. We allow for one period of anticipation by setting anticipation = 1.

cs_antic <- att_gt(yname = "lemp", tname = "year", idname = "id", gname = "G",
xformla = ~lpop + lavg_pay,
control_group = "nevertreated", base_period = "universal",
est_method = "dr", anticipation = 1, data = data2)
attO_antic <- aggte(cs_antic, type = "group")

With anticipation (1 period) ATT^O: -0.0396 (SE: 0.0098)

The results are reassuringly stable across specifications. Switching to a varying base period ($-0.065$) or using the not-yet-treated comparison group ($-0.065$) produces virtually identical estimates to our baseline doubly robust result ($-0.065$). Allowing for one period of anticipation reduces the estimated ATT to $-0.040$ (SE = 0.010), which makes sense — if some of the treatment effect occurs before the official implementation date, excluding that period from post-treatment narrows the estimated effect. The consistency across the first three specifications gives us confidence that the main findings are not driven by specific methodological choices.

7. Sensitivity Analysis: When Parallel Trends May Fail

Even after conditioning on covariates, the parallel trends assumption is not directly testable — pre-trends being close to zero is necessary but not sufficient for parallel trends to hold in post-treatment periods. The HonestDiD approach of Rambachan and Roth (2023) provides a principled sensitivity analysis: it asks how large violations of parallel trends can be before the post-treatment results break down. The “relative magnitude” variant compares the size of potential post-treatment violations to the observed size of pre-treatment deviations from parallel trends.

The HonestDiD package requires a small helper function to interface with the did package’s event study objects. This helper (available in the companion R script and in Callaway’s workshop materials) extracts the influence function (a statistical tool for computing standard errors in complex estimators) and variance-covariance matrix from the event study, then passes them to HonestDiD’s sensitivity routines. The parameter $\bar{M}$ bounds the ratio of the maximum post-treatment deviation from parallel trends to the maximum pre-treatment deviation — in other words, it is a stress test asking “how much worse can things get after treatment compared to what we already see before treatment?”

# Helper function from Callaway's workshop (references/honest_did.R)
# Bridges the did package's AGGTEobj to HonestDiD's sensitivity functions
source("references/honest_did.R")
attgt_hd <- did::att_gt(yname = "lemp", idname = "id", gname = "G",
tname = "year", data = data2,
control_group = "nevertreated",
base_period = "universal")
cs_es_hd <- aggte(attgt_hd, type = "dynamic")
hd_rm <- honest_did(es = cs_es_hd, e = 0, type = "relative_magnitude")

Original CI: [-0.0404, -0.0066]
Robust CIs:
lb ub Mbar
-0.0401 -0.00871 0.000
-0.0435 -0.00523 0.222
-0.0470 -0.00174 0.444
-0.0505 0.00523 0.667
-0.0575 0.01220 0.889
-0.0644 0.01920 1.111

The sensitivity analysis reveals that the on-impact effect ($e=0$) is robust to moderate violations of parallel trends, but not to large ones. The original 95% confidence interval is $[-0.040, -0.007]$, comfortably below zero. As $\bar{M}$ increases — meaning we allow post-treatment violations of parallel trends to be larger relative to pre-treatment violations — the confidence interval widens. The breakdown point is at $\bar{M} \approx 0.67$: if post-treatment violations are no more than about 67% as large as the pre-treatment deviations from parallel trends, the negative employment effect remains statistically significant. Beyond that threshold, the confidence interval includes zero and we can no longer rule out a null effect. Given the moderate pre-trend violations we observed (especially at $e=-3$), this suggests that the results should be interpreted with some caution — the evidence is suggestive of a negative employment effect, but it is not bulletproof.

8. More Complicated Treatment Regimes

8.1 Heterogeneous Treatment Doses

So far, we have treated all minimum wage increases as a binary “treated or not” event. But states raised their minimum wages by very different amounts — some by as little as \$0.10 above the federal floor, others by over \$1.00. A \$0.25 increase and a \$1.70 increase should not be expected to have the same employment effect. To account for this, we can normalize the treatment effect by the size of the minimum wage increase, computing an ATT per dollar.

# Use full data including G=2007 for more treated states
data3 <- subset(mw_data_ch2, year >= 2003)
treated_state_list <- unique(subset(data3, G != 0)$state_name)

The figure reveals substantial variation across states. Illinois raised its minimum wage early (2004) and by a relatively large amount, while Florida and Colorado made smaller increases later. This heterogeneity in treatment dose motivates the per-dollar normalization.

8.2 ATT Per Dollar Event Study

We compute state-specific ATTs using the doubly robust panel DID estimator from the DRDID package, then divide each by the size of the minimum wage increase above the federal level.

# For each treated state and post-treatment period, compute ATT
# using the doubly robust panel estimator, then normalize by dose
for (state in treated_state_list) {
g <- unique(subset(data3, state_name == state)$G)
for (period in 2004:2007) {
Y1 <- c(subset(data3, state_name == state & year == period)$lemp,
subset(data3, G == 0 & year == period)$lemp)
Y0 <- c(subset(data3, state_name == state & year == g - 1)$lemp,
subset(data3, G == 0 & year == g - 1)$lemp)
D <- c(rep(1, sum(data3$state_name == state & data3$year == period)),
rep(0, sum(data3$G == 0 & data3$year == period)))
attst <- DRDID::drdid_panel(Y1, Y0, D, covariates = NULL)
treat_amount <- unique(subset(data3, state_name == state &
year == period)$state_mw) - 5.15
att_per_dollar <- attst$ATT / treat_amount
}
}
# Note: this is a simplified excerpt. See analysis.R for the full
# implementation with result storage, event study aggregation, and plots.

Overall ATT per dollar: -0.0297 (SE: 0.0155)
Event study ATT per dollar:
event_time att se ci_lower ci_upper
0 -0.028 0.020 -0.066 0.010
1 -0.055 0.012 -0.079 -0.031
2 -0.091 0.015 -0.120 -0.062
3 -0.097 0.017 -0.130 -0.064

The dose-normalized results tell a consistent story. The on-impact effect per dollar is $-0.028$ (not quite significant at the 5% level), but the effect grows substantially with exposure: $-0.055$ after one year, $-0.091$ after two years, and $-0.097$ after three years. These per-dollar estimates imply that a \$1 increase in the minimum wage is associated with a decline of 0.055 log points in teen employment after one year (approximately 5.3%) and 0.097 log points after three years (approximately 9.2%). The post-treatment estimates from $e=1$ onward are all statistically significant. The overall ATT per dollar of $-0.030$ (SE = 0.016) averages across all post-treatment periods, but the event study makes clear that the cumulative effects are substantially larger.

9. Alternative Identification Strategies

The DID framework relies on the parallel trends assumption. Alternative identification strategies relax this assumption in different ways. The pte package implements a lagged outcomes strategy, which conditions on lagged outcome values rather than assuming parallel trends. Instead of assuming that treated and untreated groups would have followed the same trend, this approach assumes that controlling for the previous period’s outcome level makes treatment assignment as good as random — counties with the same employment level last year are equally likely to be in a state that raised its minimum wage, regardless of which state they are in.

library(pte)
data2_lo <- data2
data2_lo$G2 <- data2_lo$G
lo_res <- pte::pte_default(yname = "lemp", tname = "year", idname = "id",
gname = "G2", data = data2_lo,
d_outcome = FALSE, lagged_outcome_cov = TRUE)
summary(lo_res)

Overall ATT: -0.061 (SE: 0.008, 95% CI: [-0.077, -0.045])
Dynamic Effects:
Event Time Estimate Std. Error [95% Conf. Band]
-2 0.014 0.008 -0.010 0.038
-1 0.010 0.007 -0.009 0.030
0 -0.024 0.009 -0.049 0.000
1 -0.074 0.008 -0.097 -0.050 *
2 -0.129 0.019 -0.185 -0.073 *
3 -0.140 0.023 -0.206 -0.074 *

The lagged outcomes strategy produces an overall ATT of $-0.061$ (SE = 0.008), very close to the DID estimates with covariates ($-0.065$). The pre-trends under this alternative identification strategy are close to zero (0.014 at $e=-2$ and 0.010 at $e=-1$, both insignificant), and the post-treatment trajectory ($-0.024$ on impact, $-0.074$ at $e=1$, $-0.129$ at $e=2$, $-0.140$ at $e=3$) closely mirrors the DID event study. The convergence of results across different identification strategies strengthens the case that the estimated negative employment effects are reflecting a genuine causal relationship rather than an artifact of any particular set of assumptions.

10. Discussion and Takeaways

This tutorial demonstrates why TWFE regressions are unreliable with staggered treatment adoption and treatment effect heterogeneity, and how modern DID methods provide a principled alternative. The TWFE coefficient of $-0.038$ understates the true overall ATT of $-0.057$ by about one-third, with the bias driven primarily by pre-treatment contamination (64% of the total bias) and improper post-treatment weighting (36%). The Callaway-Sant’Anna framework cleanly separates identification from estimation by first computing group-time ATTs and then aggregating them into target parameters of interest.

The substantive findings suggest that state-level minimum wage increases above the federal floor reduced teen employment, with effects that grew over time. The doubly robust estimator with covariates yields an overall ATT of $-0.065$ (SE = 0.008), and the dose-normalized analysis finds effects of approximately $-0.055$ per dollar after one year and $-0.097$ per dollar after three years. These results are robust across estimation methods (regression adjustment, IPW, doubly robust), comparison group definitions (never-treated, not-yet-treated), and base period choices (universal, varying).

However, the results come with important caveats. The HonestDiD sensitivity analysis shows that the on-impact effect loses statistical significance when post-treatment parallel trends violations exceed about 67% of the pre-treatment deviations. The pre-treatment coefficient at $e=-3$ is moderately significant in the unconditional analysis, though it shrinks after covariate adjustment. These patterns suggest that while the evidence points toward negative employment effects, the magnitude should be interpreted with some caution. As Callaway (2022) notes, this application is primarily intended to illustrate the methodology rather than to settle the minimum wage debate.

The modern DID toolkit demonstrated here — did for group-time ATTs, twfeweights for diagnosing TWFE problems, HonestDiD for sensitivity analysis, and DRDID for doubly robust estimation — provides applied researchers with a complete workflow for credible causal inference in staggered treatment settings. The key lesson is that DID is not just a regression — it is an identification strategy that requires careful attention to the structure of the treatment, the comparison group, and the plausibility of the underlying assumptions.

Key takeaways:

TWFE understates the true ATT by ~33% ($-0.038$ vs $-0.057$), with 64% of the bias from pre-treatment contamination and 36% from improper post-treatment weighting
The doubly robust ATT of $-0.065$ is stable across estimation methods (regression, IPW, DR), comparison groups (never-treated, not-yet-treated), and base periods (universal, varying)
Employment effects accumulate over time: $-0.027$ on impact, growing to $-0.147$ after three years under the doubly robust specification
The on-impact effect is robust to parallel trends violations up to 67% of pre-trend magnitude ($\bar{M} \approx 0.67$), but not beyond
Per-dollar normalization reveals that a \$1 minimum wage increase reduces teen employment by approximately 5.3% after one year and 9.2% after three years

11. Exercises

Expand the sample: Re-run the analysis using data3 (which includes the G=2007 group) and compare the results. Does including the additional treatment group change the overall ATT or the event study pattern?
Alternative covariates: Experiment with different covariate specifications in the doubly robust estimator. What happens if you include only lpop? Only lavg_pay? Does the choice of covariates meaningfully affect the pre-trends?
Smoothness sensitivity: Run the HonestDiD smoothness-based sensitivity analysis (type = "smoothness") in addition to the relative magnitude analysis. How do the two approaches compare in terms of the robustness of the results?

12. References

Callaway, B. (2022). Difference-in-Differences for Policy Evaluation. In Handbook of Labor, Human Resources, and Population Economics. Springer. Published version | Working paper
Callaway, B. and Sant’Anna, P.H.C. (2021). Difference-in-Differences with Multiple Time Periods. Journal of Econometrics, 225(2), 200–230. doi:10.1016/j.jeconom.2020.12.001
Goodman-Bacon, A. (2021). Difference-in-differences with variation in treatment timing. Journal of Econometrics, 225(2), 254–277. doi:10.1016/j.jeconom.2021.03.014
Rambachan, A. and Roth, J. (2023). A More Credible Approach to Parallel Trends. Review of Economic Studies, 90(5), 2555–2591. doi:10.1093/restud/rdad018
de Chaisemartin, C. and D’Haultfoeuille, X. (2020). Two-Way Fixed Effects Estimators with Heterogeneous Treatment Effects. American Economic Review, 110(9), 2964–2996.
Sun, L. and Abraham, S. (2021). Estimating dynamic treatment effects in event studies with heterogeneous treatment effects. Journal of Econometrics, 225(2), 175–199.
did package: CRAN | GitHub
fixest package: CRAN | Documentation
twfeweights package: GitHub
HonestDiD package: CRAN | GitHub

Evaluating a Cash Transfer Program (RCT) with Panel Data in Stata

Tue, 24 Mar 2026 00:00:00 +0000

1. Overview

Cash transfer programs are among the most common development interventions worldwide. Governments and international organizations spend billions of dollars each year providing direct cash transfers to low-income households. But how do we rigorously evaluate whether these programs actually work? This tutorial walks through the complete workflow of analyzing a randomized controlled trial (RCT) with panel data in Stata — from verifying that randomization succeeded, to estimating treatment effects using increasingly sophisticated methods, to comparing results across all approaches.

We use simulated data from a hypothetical cash transfer program targeting 2,000 households in a developing country. The key advantage of simulated data is that we know the true treatment effect before we begin: the program increases household consumption by 12% (0.12 log points). This known ground truth gives us a perfect benchmark to evaluate how well each econometric method recovers the correct answer.

The tutorial progresses from simple to sophisticated. We start with basic balance checks, then estimate treatment effects three different ways using only endline data — regression adjustment (RA), inverse probability weighting (IPW), and doubly robust (DR) methods. Next, we unlock the full power of panel data with difference-in-differences (DiD) and its doubly robust extension (DRDID). Finally, we address the real-world complication of imperfect compliance.

Learning objectives

Verify baseline balance using t-tests, standardized mean differences, and balance plots
Distinguish between ATE and ATT and identify which estimand each method targets
Understand three estimation strategies — regression adjustment, inverse probability weighting, and doubly robust — and when to use each
Estimate treatment effects using all three approaches and compare their results
Leverage panel data structure with difference-in-differences and understand why DiD estimates ATT
Apply doubly robust difference-in-differences (DRDID) for modern panel data analysis
Separate the effect of treatment offer from treatment receipt under imperfect compliance

2. Study design

This RCT evaluates a cash transfer program designed to boost household consumption. The study tracks 2,000 households across two survey waves — a baseline in 2021 (before the program) and an endline in 2024 (after the program was implemented). The diagram below summarizes the experimental design.

graph TD
POP["<b>2,000 Households</b><br/>Balanced panel<br/>(observed in 2021 and 2024)"]
STRAT["<b>Stratified Randomization</b><br/>Within poverty strata"]
TRT["<b>Treatment Group</b><br/>(~1,000 households)<br/>Offered cash transfer"]
CTL["<b>Control Group</b><br/>(~1,000 households)<br/>No offer"]
COMP1["85% receive<br/>the transfer"]
COMP2["15% do not<br/>receive"]
COMP3["5% receive<br/>the transfer"]
COMP4["95% do not<br/>receive"]
BASE["<b>Baseline 2021</b><br/>Pre-treatment survey"]
END["<b>Endline 2024</b><br/>Post-treatment survey"]
POP --> BASE
BASE --> STRAT
STRAT --> TRT
STRAT --> CTL
TRT --> COMP1
TRT --> COMP2
CTL --> COMP3
CTL --> COMP4
COMP1 --> END
COMP2 --> END
COMP3 --> END
COMP4 --> END
style POP fill:#6a9bcc,stroke:#141413,color:#fff
style STRAT fill:#d97757,stroke:#141413,color:#fff
style TRT fill:#00d4c8,stroke:#141413,color:#141413
style CTL fill:#6a9bcc,stroke:#141413,color:#fff
style BASE fill:#6a9bcc,stroke:#141413,color:#fff
style END fill:#d97757,stroke:#141413,color:#fff
style COMP1 fill:#00d4c8,stroke:#141413,color:#141413
style COMP2 fill:#141413,stroke:#d97757,color:#fff
style COMP3 fill:#d97757,stroke:#141413,color:#fff
style COMP4 fill:#141413,stroke:#6a9bcc,color:#fff

The randomization was stratified by poverty status (block randomization), ensuring that treatment and control groups started with similar proportions of poor and non-poor households. A critical real-world feature of this study is imperfect compliance — only 85% of households offered the treatment actually received the cash transfer, while 5% of control households received it through other channels.

Variables

Variable	Description	Type
`id`	Household identifier	Panel ID
`year`	Survey year (2021 or 2024)	Time variable
`post`	Endline indicator (1 = 2024)	Binary
`treat`	Random assignment to offer (intent-to-treat)	Binary
`D`	Actual receipt of cash transfer	Binary (endogenous)
`y`	Log monthly consumption	Continuous (outcome)
`age`	Age of household head	Continuous
`female`	Female-headed household	Binary
`poverty`	Poverty status at baseline	Binary
`edu`	Years of education	Continuous
`y0`	Log monthly consumption at baseline (pre-treatment)	Continuous

Offer vs. receipt — The variable treat captures random assignment to the program offer. It is exogenous (determined by randomization) and unrelated to household characteristics. The variable D captures actual receipt of the cash transfer. It is endogenous — households that chose to take up the program may differ systematically from those that did not. Most methods in this tutorial estimate the effect of the offer (intent-to-treat). Section 10 addresses the effect of receipt.

3. Analytical roadmap

The diagram below shows the progression of methods we will use. Each stage builds on the previous one, adding complexity and robustness.

graph LR
A["<b>Balance<br/>Checks</b><br/><i>Section 5</i>"]
B["<b>Cross-sectional<br/>RA / IPW / DR</b><br/><i>Sections 7--8</i>"]
C["<b>Panel Data<br/>DiD / DR-DiD</b><br/><i>Section 9</i>"]
D["<b>Endogenous<br/>Treatment</b><br/><i>Section 10</i>"]
A --> B
B --> C
C --> D
style A fill:#6a9bcc,stroke:#141413,color:#fff
style B fill:#d97757,stroke:#141413,color:#fff
style C fill:#00d4c8,stroke:#141413,color:#141413
style D fill:#141413,stroke:#d97757,color:#fff

We first establish that randomization worked (balance checks). Then we estimate treatment effects three ways using only endline data — regression adjustment, inverse probability weighting, and doubly robust methods. Next, we leverage the full panel structure with difference-in-differences. Finally, we address imperfect compliance by separating the effect of the offer from the effect of receipt.

4. Data loading and exploration

We begin by loading the simulated dataset from a public GitHub repository and examining its structure.

use "https://github.com/quarcs-lab/data-open/raw/master/ametrics/dataSIM4RCT.dta", clear
des y age edu female poverty treat D

Contains data
Observations: 4,000
Variables: 10
Variable Storage Display Value
name type format label Variable label
─────────────────────────────────────────────────────────────
y float %9.0g Log monthly consumption
age float %9.0g
edu float %9.0g
female float %9.0g
poverty float %9.0g
treat float %9.0g Assignment to offer (Z)
D float %9.0g Receipt of cash transfer

The dataset contains 4,000 observations — 2,000 households observed at two time points (baseline 2021 and endline 2024). The outcome variable y is log monthly consumption, treat is the random assignment indicator, and D is the actual receipt indicator.

Now let us examine summary statistics at baseline and endline separately.

sum y age edu female poverty treat D if post==0

 Variable | Obs Mean Std. dev. Min Max
─────────────+─────────────────────────────────────────────────────────
y | 2,000 10.0154 .4348886 8.454445 11.48253
age | 2,000 35.126 9.650839 18 68
edu | 2,000 12.0275 1.9889 6 18
female | 2,000 .5085 .5000528 0 1
poverty | 2,000 .3125 .4636283 0 1
treat | 2,000 .518 .4998009 0 1
D | 2,000 0 0 0 0

At baseline, mean log consumption is approximately 10.02, the average household head is 35 years old with 12 years of education, about 51% of households are female-headed, and 31% are in poverty. Treatment assignment (treat) is approximately 50%, as expected from the randomization. Crucially, the receipt variable D is zero for all households at baseline — the program had not yet been implemented.

sum y age edu female poverty treat D if post==1

 Variable | Obs Mean Std. dev. Min Max
─────────────+─────────────────────────────────────────────────────────
y | 2,000 10.1137 .4382183 8.638689 11.55002
age | 2,000 35.126 9.650839 18 68
edu | 2,000 12.0275 1.9889 6 18
female | 2,000 .5085 .5000528 0 1
poverty | 2,000 .3125 .4636283 0 1
treat | 2,000 .518 .4998009 0 1
D | 2,000 .4615 .4986402 0 1

At endline, mean consumption has risen to approximately 10.11, reflecting both the natural time trend and the treatment effect. The receipt variable D is now non-zero — about 46% of all households received the cash transfer (combining treated households who took up the program and control households who received it through other channels).

Finally, we declare the panel structure so Stata knows we have repeated observations.

xtset id year

Panel variable: id (strongly balanced)
Time variable: year, 2021 to 2024, but with gaps
Delta: 1 unit

The panel is strongly balanced — all 2,000 households appear in both survey waves, with no attrition. This is an ideal scenario that simplifies our analysis.

5. Baseline balance checks

Before estimating any treatment effects, we must verify that randomization produced comparable treatment and control groups at baseline. This is the most fundamental quality check in any RCT.

5.1 T-tests and proportion tests

We compare the treatment and control groups on all baseline characteristics using two-sample t-tests for continuous variables and proportion tests for binary variables.

ttest y if post==0, by(treat)
ttest age if post==0, by(treat)
ttest edu if post==0, by(treat)
prtest female if post==0, by(treat)
prtest poverty if post==0, by(treat)

Variable | Control Mean Treat Mean Diff p-value
────────────+──────────────────────────────────────────────
y | 10.025 10.006 0.019 0.330
age | 35.335 34.931 0.404 0.350
edu | 11.974 12.077 -0.103 0.247
female | 0.484 0.531 -0.046 0.038 **
poverty | 0.307 0.318 -0.011 0.612

Most variables show no statistically significant differences between the treatment and control groups. However, the variable female has a p-value of 0.038 — a statistically significant imbalance. The treatment group has about 4.6 percentage points more female-headed households than the control group. This imbalance occurred purely by chance but must be addressed in our estimation.

5.2 Balance table with standardized mean differences

P-values are sensitive to sample size — a large sample can make tiny differences “significant.” Standardized mean differences (SMDs) provide a scale-free measure of imbalance that is more informative. The SMD is computed as the difference in group means divided by the pooled standard deviation — this puts all variables on the same scale regardless of their units. The common rule of thumb is that SMDs below 10% indicate adequate balance.

capture ssc install ietoolkit, replace
iebaltab y age edu female poverty if post==0, grpvar(treat)

 (1) (2) (2)-(1)
Control Treatment Difference
y 10.025 10.006 0.019
(0.014) (0.014) (0.019)
age 35.335 34.931 0.404
(0.316) (0.295) (0.432)
edu 11.974 12.077 -0.103
(0.063) (0.063) (0.089)
female 0.484 0.531 -0.046**
(0.016) (0.016) (0.022)
poverty 0.307 0.318 -0.011
(0.015) (0.014) (0.021)
N 964 1,036

The balance table confirms our t-test findings. With 964 control and 1,036 treatment households, all variables are well balanced except female, which shows a statistically significant difference (marked with **). The outcome variable y has a negligible difference of 0.019 at baseline — the groups started with essentially identical consumption levels.

5.3 Visual balance plot

A balance plot provides a visual overview of all SMDs at once, making it easy to spot problematic variables.

net install balanceplot, from("https://tdmize.github.io/data") replace
balanceplot y age edu i.female i.poverty, group(treat) table nodropdv

The balance plot shows that all SMDs fall below the 10% threshold (indicated by the dashed vertical lines). The variable female has the largest SMD at approximately 9.3% — close to but still below the conventional threshold. The remaining variables — consumption, age, education, and poverty — all have SMDs well below 5%. Overall, randomization was successful, but we should control for female (and other covariates) in our estimation to improve precision.

5.4 AIPW as a formal balance test

As a final and more formal balance check, we can use the Augmented Inverse Probability Weighting (AIPW) estimator on baseline data only. If randomization was successful, the estimated “treatment effect” at baseline should be zero — since the program had not yet been implemented, there should be no difference between groups.

preserve
keep if post==0
teffects aipw (y age edu i.female i.poverty) (treat age edu i.female i.poverty)

Tip: The preserve command saves a snapshot of the current data. After the balance analysis, use restore to return to the full dataset. The companion do-file handles this automatically.

Treatment-effects estimation Number of obs = 2,000
Estimator : augmented IPW
Outcome model : linear
Treatment model: logit
──────────────────────────────────────────────────────────────────────────────
| Robust
y | Coefficient std. err. z P>|z| [95% conf. interval]
─────────────+────────────────────────────────────────────────────────────────
ATE |
treat |
(1 vs 0) | -.0244086 .018861 -1.29 0.196 -.0613754 .0125582
─────────────+────────────────────────────────────────────────────────────────
POmean |
treat |
0 | 10.02792 .0138363 724.75 0.000 10.0008 10.05504
──────────────────────────────────────────────────────────────────────────────

The AIPW-estimated “ATE” at baseline is -0.024 with a p-value of 0.196 — not statistically significant. This confirms that there is no detectable pre-treatment difference between the groups after adjusting for covariates. The treatment and control groups were statistically comparable before the program began.

Now we run the diagnostic checks for the AIPW model.

tebalance overid

Overidentification test for covariate balance
H0: Covariates are balanced
chi2(5) = 3.216
Prob > chi2 = 0.6670

The overidentification test fails to reject the null hypothesis of covariate balance (p = 0.667). There is no statistical evidence of residual imbalance after weighting.

tebalance summarize

 |Standardized differences Variance ratio
| Raw Weighted Raw Weighted
----------------+------------------------------------------------
age | -.0417918 .0002505 .9318894 .9446877
edu | .0519015 -6.96e-06 1.071677 1.078214
female |
1 | .0929611 6.51e-06 .9970775 .9999996
poverty |
1 | .0226764 .0002864 1.018475 1.000233

The balance summary reveals that the raw standardized differences (before weighting) show the female imbalance at 0.093, consistent with our earlier findings. After weighting, all standardized differences shrink to near zero (all below 0.001) — excellent balance. The variance ratios are all close to 1.0, indicating similar spread across groups.

tebalance density y

The density plot confirms that after AIPW weighting, the distributions of log consumption in the treatment and control groups overlap almost perfectly. Any small pre-existing differences in the outcome variable have been eliminated by the weighting scheme.

teffects overlap

The overlap plot shows that propensity scores for both groups are concentrated between approximately 0.43 and 0.55 — well within the range where matching and weighting are feasible. There are no extreme propensity scores near 0 or 1, confirming that the common support condition is satisfied. This is expected in a well-designed RCT where treatment probability is approximately 0.50 for all households.

restore

This AIPW-based balance analysis also serves a pedagogical purpose: it introduces the concept of doubly robust estimation before we use it for treatment effect estimation in Section 8.

6. What are we estimating? ATE vs. ATT

Before diving into estimation, we need to be precise about what we are trying to estimate. There are two fundamental causal quantities in program evaluation.

The Average Treatment Effect (ATE) answers the policymaker’s question: “What would happen if we scaled this program to the entire population?"

$$ATE = E[Y(1) - Y(0)]$$

where $Y(1)$ is the potential outcome under treatment and $Y(0)$ is the potential outcome under control, averaged over the entire population (both treated and untreated).

The Average Treatment Effect on the Treated (ATT) answers the evaluator’s question: “Did the program benefit those who were assigned to it?"

$$ATT = E[Y(1) - Y(0) \mid T = 1]$$

This averages the treatment effect only over the treated group — the households that were assigned to receive the cash transfer.

In a well-designed RCT with homogeneous treatment effects (the program affects everyone equally), ATE and ATT are the same. But when treatment effects are heterogeneous (the program benefits some households more than others), they can differ. For example, if poorer households benefit more from cash transfers and the treatment group has a higher share of poor households, the ATT could be larger than the ATE.

Understanding this distinction is critical because different methods target different estimands. Cross-sectional methods (RA, IPW, DR) can estimate either ATE or ATT. Difference-in-differences inherently estimates the ATT only. We will return to this point in Section 9.

Note on RCTs — In a randomized experiment, treatment assignment is independent of potential outcomes. This means that simple comparisons between treatment and control groups are already unbiased estimates of the ATE. When we add covariates (regression adjustment, IPW, doubly robust), we are not removing bias — we are improving precision by accounting for residual variation. This is different from observational studies, where covariate adjustment is needed to address confounding.

7. Three strategies for causal estimation

We now understand what we want to estimate (ATE and ATT from Section 6). The question becomes how to estimate it. Three families of methods exist, each taking a fundamentally different approach to solving the missing-data problem at the heart of causal inference. Each method models a different part of the data-generating process, and understanding these differences is essential for interpreting results and choosing the right tool.

7.1 Regression Adjustment (RA) — modeling the outcome

Regression adjustment solves the missing-data problem by predicting the unobserved potential outcomes. It fits separate regression models for treated and untreated groups. For each household, it uses these models to predict two potential outcomes: what consumption would be if treated, $\hat{\mu}_1(X_i)$, and what consumption would be if untreated, $\hat{\mu}_0(X_i)$. Since we only observe one of these for each household, the model fills in the missing counterfactual. The treatment effect for each household is the difference between the two predictions, and the ATE is the average across all households.

The Stata documentation describes this succinctly: “RA estimators use means of predicted outcomes for each treatment level to estimate each POM. ATEs and ATETs are differences in estimated POMs."

Analogy — predicting exam scores. Imagine two study methods (A and B) being tested on students. You observe each student using only one method. RA fits a model predicting test scores based on student characteristics (prior GPA, hours studied) separately for method-A and method-B users. Then, for every student, it predicts what their score would have been under both methods — even the one they did not use. The average difference in predicted scores is the treatment effect.

graph TD
DATA["<b>Observed Data</b><br/>Each household observed<br/>under ONE treatment only"]
M0["<b>Fit outcome model</b><br/>using control group<br/><i>Y = f(age, edu, female, poverty)</i>"]
M1["<b>Fit outcome model</b><br/>using treated group<br/><i>Y = f(age, edu, female, poverty)</i>"]
P0["Predict <b>Ŷ₀</b><br/>for ALL households"]
P1["Predict <b>Ŷ₁</b><br/>for ALL households"]
ATE["<b>ATE</b> = Average of<br/>(Ŷ₁ − Ŷ₀)"]
DATA --> M0
DATA --> M1
M0 --> P0
M1 --> P1
P0 --> ATE
P1 --> ATE
style DATA fill:#141413,stroke:#6a9bcc,color:#fff
style M0 fill:#6a9bcc,stroke:#141413,color:#fff
style M1 fill:#6a9bcc,stroke:#141413,color:#fff
style P0 fill:#6a9bcc,stroke:#141413,color:#fff
style P1 fill:#6a9bcc,stroke:#141413,color:#fff
style ATE fill:#6a9bcc,stroke:#141413,color:#fff

The RA estimator. Formally, the ATE under regression adjustment is:

$$\hat{\tau}_{RA}^{ATE} = \frac{1}{N} \sum_{i=1}^{N} \left[ \hat{\mu}_1(X_i) - \hat{\mu}_0(X_i) \right]$$

where $\hat{\mu}_1(X)$ is the predicted outcome under treatment (fitted from treated observations) and $\hat{\mu}_0(X)$ is the predicted outcome under control (fitted from untreated observations), both evaluated at each household’s covariates $X_i$. In plain language: for each household, the model predicts what their consumption would be if they received the cash transfer and what it would be if they did not. The difference is the household’s estimated treatment effect. Averaging these across all $N$ households gives the ATE.

For the ATT, we restrict the average to treated units only:

$$\hat{\tau}_{RA}^{ATT} = \frac{1}{N_1} \sum_{i: T_i = 1} \left[ \hat{\mu}_1(X_i) - \hat{\mu}_0(X_i) \right]$$

where $N_1$ is the number of treated households.

Mini example from our data. Consider Household A: a 40-year-old female in poverty with 10 years of education. The treated outcome model predicts her consumption at 10.17 log points. The untreated outcome model predicts 10.05. Her estimated individual treatment effect is $10.17 - 10.05 = 0.12$. Averaging such predictions over all 2,000 endline households gives the ATE.

Stata implementation. The teffects ra command fits linear outcome models by default. The first parenthesis specifies the outcome model (outcome variable + covariates), and the second specifies the treatment variable: teffects ra (y c.age c.edu i.female i.poverty) (treat), ate.

What can go wrong — model misspecification. RA’s Achilles heel is that it relies entirely on the outcome model being correctly specified. If consumption depends on age nonlinearly (for example, a U-shaped relationship), but we assume a linear model, the predictions $\hat{\mu}_1$ and $\hat{\mu}_0$ will be systematically wrong, biasing the ATE. As the Stata manual notes, RA works well when the outcome model is correct, but “relying on a correctly specified outcome model with little data is extremely risky.” RA gives the right answer only if the outcome model is correct. If it is wrong, the ATE estimate can be biased even with infinite data.

What if we are unsure about the functional form of the outcome model? Is there an approach that avoids modeling the outcome entirely?

7.2 Inverse Probability Weighting (IPW) — modeling the treatment assignment

IPW takes the opposite approach. Instead of modeling consumption, it models the probability of being assigned to treatment — the propensity score, defined as $p(X) = \Pr(T = 1 \mid X)$. It then reweights observations so that the treatment and control groups become comparable. The Stata documentation explains: “IPW estimators use weighted averages of the observed outcome variable to estimate means of the potential outcomes. The weights account for the missing data inherent in the potential-outcome framework."

The logic is elegant: in a perfectly randomized experiment, every household has the same 50% chance of treatment, and a simple comparison of means is unbiased. When chance imbalances arise (like our 9.3% gender SMD), the estimated propensity scores deviate slightly from 0.50. IPW corrects for these imbalances by making the reweighted sample look as if randomization had been perfect — without ever modeling the outcome.

Analogy — opinion polling. Election pollsters know their survey overrepresents some demographics. If 60% of respondents are college graduates but only 35% of voters are, pollsters give lower weight to each college graduate’s response and higher weight to non-graduates. IPW does the same thing for treatment groups — it reweights households so the treated and control groups have the same covariate distribution.

graph TD
DATA["<b>Observed Data</b><br/>Treatment and control groups<br/>may have imbalances"]
PS["<b>Estimate propensity score</b><br/>p(X) = Pr(T=1 | X)<br/><i>via logistic regression</i>"]
WT["<b>Compute weights</b>"]
WTR["Treated: weight = 1/p(X)"]
WCT["Control: weight = 1/(1−p(X))"]
ATE["<b>ATE</b> = Weighted mean(treated)<br/>− Weighted mean(control)"]
DATA --> PS
PS --> WT
WT --> WTR
WT --> WCT
WTR --> ATE
WCT --> ATE
style DATA fill:#141413,stroke:#d97757,color:#fff
style PS fill:#d97757,stroke:#141413,color:#fff
style WT fill:#d97757,stroke:#141413,color:#fff
style WTR fill:#d97757,stroke:#141413,color:#fff
style WCT fill:#d97757,stroke:#141413,color:#fff
style ATE fill:#d97757,stroke:#141413,color:#fff

The propensity score. The propensity score is estimated via logistic regression:

$$\hat{p}(X_i) = \Pr(T_i = 1 \mid X_i) = \text{logit}^{-1}(\hat{\alpha} + \hat{\beta}' X_i)$$

In plain language: we fit a logistic model predicting whether each household was assigned to treatment, based on their covariates (age, education, gender, poverty status). The predicted probability is their propensity score.

The IPW estimator. The ATE under IPW is:

$$\hat{\tau}_{IPW}^{ATE} = \frac{1}{N} \sum_{i=1}^{N} \left[ \frac{T_i \cdot Y_i}{\hat{p}(X_i)} - \frac{(1 - T_i) \cdot Y_i}{1 - \hat{p}(X_i)} \right]$$

Each treated household’s outcome is divided by its probability of being treated — this upweights treated households that “look like” control households (the Stata manual calls this placing “a larger weight on those observations for which $y_{1i}$ is observed even though its observation was not likely”). Each control household’s outcome is divided by its probability of being in the control group. The reweighting creates a pseudo-population where treatment assignment is independent of covariates.

For the ATT, only the control group needs reweighting (because the treated group is already the reference population):

$$\hat{\tau}_{IPW}^{ATT} = \frac{1}{N_1} \sum_{i=1}^{N} \left[ T_i \cdot Y_i - \frac{(1 - T_i) \cdot \hat{p}(X_i) \cdot Y_i}{1 - \hat{p}(X_i)} \right]$$

Mini example from our data. In our RCT, a female household in poverty might have $\hat{p}(X) = 0.52$ (slightly more likely to be treated due to the gender imbalance). If treated, her weight is $1/0.52 = 1.92$. If in the control group, her weight is $1/(1 - 0.52) = 2.08$. A male non-poor household might have $\hat{p}(X) = 0.49$, giving weights close to 2.0 in either group. These mild adjustments rebalance the groups to remove the chance gender imbalance.

Why IPW matters even in RCTs. In a perfect RCT, the true propensity score is exactly 0.50 for everyone, and IPW does nothing. But finite samples produce chance imbalances. IPW uses the estimated propensity scores (which deviate slightly from 0.50) to correct for these imbalances without making any assumptions about how covariates affect the outcome.

Stata implementation. The teffects ipw command fits a logistic treatment model by default. Note that the first parenthesis specifies only the outcome variable (no covariates — IPW does not model the outcome), and the second specifies the treatment model: teffects ipw (y) (treat c.age c.edu i.female i.poverty), ate.

What can go wrong — extreme weights. IPW’s vulnerability is extreme propensity scores. If $\hat{p}(X) = 0.01$ for some household, the weight becomes $1/0.01 = 100$ — that single household dominates the ATE estimate, causing high variance and instability. The Stata manual warns: “When propensity scores are extreme (near 0 or 1), the inverse weights become very large, producing unstable estimates." This happens when the treatment and control groups have poor overlap — some covariate combinations appear only in one group. In our well-designed RCT, all propensity scores are between 0.43 and 0.55 (we verified this in Section 5.4), so extreme weights are not a concern.

RA works well if the outcome model is correct but can be biased if it is wrong. IPW works well if the propensity score model is correct but can be unstable if it is wrong. Is there a method that protects us against both types of misspecification?

7.3 Doubly Robust (DR) — modeling both

Doubly robust methods combine RA and IPW into a single estimator. They fit an outcome model and estimate a propensity score. The key property — the reason they are called “doubly robust” — is that the estimator is consistent (converges to the true treatment effect with enough data) if either the outcome model or the propensity score model is correctly specified. You do not need both to be right — just one.

The Stata manual describes this property: “AIPW estimators model both the outcome and the treatment probability. A surprising fact is that only one of the two models must be correctly specified to consistently estimate the treatment effects."

Analogy — backup power. Think of a house with two independent power sources: the electrical grid (the outcome model) and a solar panel system (the propensity score model). If the grid goes down (outcome model is misspecified), solar power keeps the lights on. If clouds block the solar panels (propensity score model is wrong), the grid still works. As long as at least one power source is functioning, the house stays lit. That is doubly robust estimation — as long as at least one model is correct, the estimator gives the right answer.

graph TD
DATA["<b>Observed Data</b>"]
RA_C["<b>RA component</b><br/>Predict Ŷ₁ and Ŷ₀<br/>for each household"]
IPW_C["<b>IPW component</b><br/>Estimate propensity<br/>score p(X)"]
RESID["<b>Prediction errors</b><br/>Y − Ŷ for each<br/>household"]
CORRECT["<b>Bias-correction term</b><br/>IPW-weighted residuals"]
DR["<b>DR estimate</b><br/>= RA prediction<br/>+ Bias correction"]
DATA --> RA_C
DATA --> IPW_C
RA_C --> RESID
IPW_C --> CORRECT
RESID --> CORRECT
RA_C --> DR
CORRECT --> DR
style DATA fill:#141413,stroke:#00d4c8,color:#fff
style RA_C fill:#6a9bcc,stroke:#141413,color:#fff
style IPW_C fill:#d97757,stroke:#141413,color:#fff
style RESID fill:#6a9bcc,stroke:#141413,color:#fff
style CORRECT fill:#d97757,stroke:#141413,color:#fff
style DR fill:#00d4c8,stroke:#141413,color:#141413

The AIPW estimator. The most common doubly robust form is Augmented Inverse Probability Weighting (AIPW):

$$\hat{\tau}_{DR}^{ATE} = \frac{1}{N} \sum_{i=1}^{N} \left[ \hat{\mu}_1(X_i) - \hat{\mu}_0(X_i) + \frac{T_i (Y_i - \hat{\mu}_1(X_i))}{\hat{p}(X_i)} - \frac{(1 - T_i)(Y_i - \hat{\mu}_0(X_i))}{1 - \hat{p}(X_i)} \right]$$

This equation has two clearly interpretable components:

RA component (first two terms): $\hat{\mu}_1(X_i) - \hat{\mu}_0(X_i)$ — the regression adjustment prediction, exactly as in Section 7.1
Bias-correction component (last two terms): IPW-weighted residuals $(Y_i - \hat{\mu})$ — the difference between actual and predicted outcomes, weighted by inverse propensity scores

In plain language: start with the RA prediction of each household’s treatment effect. Then ask: how far off was that prediction from reality? Weight those prediction errors by the propensity score. If RA was already right, the errors average to zero and you just get RA. If RA was wrong but IPW is right, the weighted errors exactly cancel the RA bias.

Why the magic works — four scenarios.

Outcome model correct, propensity model wrong: The residuals $(Y_i - \hat{\mu})$ are zero on average, so the correction terms vanish. DR reduces to RA. Correct answer.
Propensity model correct, outcome model wrong: The IPW reweighting is valid, so the correction terms fix the RA bias. Correct answer.
Both models correct: Both components work together, producing the most efficient estimate.
Both models wrong: Neither safety net catches the error. The estimate can be biased. DR provides insurance, not invincibility.

AIPW vs. IPWRA in Stata. Stata offers two doubly robust commands. teffects aipw augments the IPW estimator with an outcome-model correction (the equation above). teffects ipwra applies propensity score weights to the regression adjustment — arriving at the same property from the other direction. Both are doubly robust and produce nearly identical results in practice.

Stata implementation. Both commands require specifying the outcome model in the first parenthesis and the treatment model in the second: teffects ipwra (y c.age c.edu i.female i.poverty) (treat c.age c.edu i.female i.poverty), vce(robust).

What can go wrong. DR fails only when both models are wrong. This is much less likely than either single model being wrong — getting at least one model approximately right is much easier than getting both perfectly right. However, the Stata manual notes: “When both the outcome and the treatment model are misspecified, which estimator is more robust is a matter of debate." Using flexible specifications (polynomials, interactions) reduces the risk of both models failing simultaneously.

Comparison of the three approaches

Feature	RA	IPW	DR (AIPW/IPWRA)
Models the outcome?	Yes	No	Yes
Models the treatment?	No	Yes	Yes
Key equation	$\hat{\mu}_1(X) - \hat{\mu}_0(X)$	$T \cdot Y / \hat{p}(X)$	RA + IPW-weighted residuals
Consistent if outcome model correct?	Yes	—	Yes
Consistent if treatment model correct?	—	Yes	Yes
Main vulnerability	Outcome misspecification	Extreme weights	Both models wrong
Stata command	`teffects ra`	`teffects ipw`	`teffects ipwra` / `teffects aipw`

graph LR
RA["<b>Regression Adjustment</b><br/>Models the outcome"]
IPW["<b>Inverse Probability<br/>Weighting</b><br/>Models the treatment"]
DR["<b>Doubly Robust</b><br/>Models both<br/><i>Consistent if either<br/>model is correct</i>"]
RA --> DR
IPW --> DR
style RA fill:#6a9bcc,stroke:#141413,color:#fff
style IPW fill:#d97757,stroke:#141413,color:#fff
style DR fill:#00d4c8,stroke:#141413,color:#141413

The doubly robust estimator combines the strengths of both RA and IPW. It is the standard recommendation in modern causal inference because it provides an extra layer of protection against model misspecification. Now that we understand what each method does, what it assumes, and what can go wrong, let us apply all three to our cash transfer data and compare their results.

8. Cross-sectional estimation at endline — RA, IPW, and DR

We now estimate treatment effects using only endline data. For each method, we compute both the ATE (the policymaker’s quantity) and the ATT (the evaluator’s quantity).

8.1 Simple difference in means

The simplest approach is to compare mean outcomes between treated and control groups at endline.

use "https://github.com/quarcs-lab/data-open/raw/master/ametrics/dataSIM4RCT.dta", clear
keep if post==1
reg y treat, robust

Linear regression Number of obs = 2,000
F(1, 1998) = 35.43
Prob > F = 0.0000
R-squared = 0.0174
Root MSE = .43449
──────────────────────────────────────────────────────────────────────────────
| Robust
y | Coefficient std. err. t P>|t| [95% conf. interval]
─────────────+────────────────────────────────────────────────────────────────
treat | .1157465 .0194443 5.95 0.000 .0776132 .1538798
_cons | 10.05374 .014001 718.07 0.000 10.02628 10.0812
──────────────────────────────────────────────────────────────────────────────

The simple difference in means yields an estimate of 0.116 (SE = 0.019, p < 0.001, 95% CI [0.078, 0.154]). Because the outcome is in logs, this means being offered the cash transfer increased household consumption by approximately 11.6%. This estimate is close to the true effect of 12% and is our benchmark for comparison. However, it does not adjust for the gender imbalance we discovered at baseline.

8.2 Regression Adjustment — ATE and ATT

Regression adjustment models the outcome as a function of treatment and covariates, then computes predicted outcomes under treatment and control for each observation.

* RA: Average Treatment Effect (ATE)
teffects ra (y c.age c.edu i.female i.poverty) (treat), ate

Treatment-effects estimation Number of obs = 2,000
Estimator : regression adjustment
Outcome model : linear
──────────────────────────────────────────────────────────────────────────────
| Robust
y | Coefficient std. err. z P>|z| [95% conf. interval]
─────────────+────────────────────────────────────────────────────────────────
ATE |
treat |
(1 vs 0) | .1125431 .0190927 5.89 0.000 .0751221 .1499641
─────────────+────────────────────────────────────────────────────────────────
POmean |
treat |
0 | 10.05503 .0138703 724.93 0.000 10.02785 10.08222
──────────────────────────────────────────────────────────────────────────────

* RA: Average Treatment Effect on the Treated (ATT)
teffects ra (y c.age c.edu i.female i.poverty) (treat), atet

Treatment-effects estimation Number of obs = 2,000
Estimator : regression adjustment
Outcome model : linear
──────────────────────────────────────────────────────────────────────────────
| Robust
y | Coefficient std. err. z P>|z| [95% conf. interval]
─────────────+────────────────────────────────────────────────────────────────
ATET |
treat |
(1 vs 0) | .1132537 .0191498 5.91 0.000 .0757208 .1507865
─────────────+────────────────────────────────────────────────────────────────
POmean |
treat |
0 | 10.05623 .0140082 717.88 0.000 10.02878 10.08369
──────────────────────────────────────────────────────────────────────────────

The RA estimates are ATE = 0.113 (SE = 0.019, 95% CI [0.075, 0.150]) and ATT = 0.113 (SE = 0.019, 95% CI [0.076, 0.151]). The ATE and ATT are nearly identical, which confirms that treatment effects are approximately homogeneous across households. The RA approach models the outcome with covariates (age, education, gender, poverty), which adjusts for the baseline gender imbalance and can improve precision.

8.3 Inverse Probability Weighting — ATE and ATT

IPW reweights observations based on their estimated probability of treatment, without modeling the outcome.

* IPW: Average Treatment Effect (ATE)
teffects ipw (y) (treat c.age c.edu i.female i.poverty), ate

Treatment-effects estimation Number of obs = 2,000
Estimator : inverse-probability weights
Outcome model : weighted mean
Treatment model: logit
──────────────────────────────────────────────────────────────────────────────
| Robust
y | Coefficient std. err. z P>|z| [95% conf. interval]
─────────────+────────────────────────────────────────────────────────────────
ATE |
treat |
(1 vs 0) | .1126713 .0190886 5.90 0.000 .0752583 .1500844
─────────────+────────────────────────────────────────────────────────────────
POmean |
treat |
0 | 10.05495 .0138651 725.20 0.000 10.02778 10.08213
──────────────────────────────────────────────────────────────────────────────

* IPW: Average Treatment Effect on the Treated (ATT)
teffects ipw (y) (treat c.age c.edu i.female i.poverty), atet

Treatment-effects estimation Number of obs = 2,000
Estimator : inverse-probability weights
Outcome model : weighted mean
Treatment model: logit
──────────────────────────────────────────────────────────────────────────────
| Robust
y | Coefficient std. err. z P>|z| [95% conf. interval]
─────────────+────────────────────────────────────────────────────────────────
ATET |
treat |
(1 vs 0) | .1134031 .0191397 5.93 0.000 .0758899 .1509162
─────────────+────────────────────────────────────────────────────────────────
POmean |
treat |
0 | 10.05608 .0140004 718.27 0.000 10.02864 10.08352
──────────────────────────────────────────────────────────────────────────────

The IPW estimates are ATE = 0.113 (SE = 0.019, 95% CI [0.075, 0.150]) and ATT = 0.113 (SE = 0.019, 95% CI [0.076, 0.151]). These are very close to the RA results, which is expected in a well-designed RCT where propensity scores are near 0.50 for all households. Notice that IPW does not model the outcome — it only models the treatment assignment process using the propensity score. The close agreement between RA and IPW gives us confidence that both the outcome model and the treatment model are approximately correct.

8.4 Doubly Robust — ATE and ATT (IPWRA)

The doubly robust IPWRA estimator combines outcome modeling and propensity score weighting.

* IPWRA: Average Treatment Effect (ATE)
teffects ipwra (y c.age c.edu i.female i.poverty) ///
(treat c.age c.edu i.female i.poverty), vce(robust)

Treatment-effects estimation Number of obs = 2,000
Estimator : IPW regression adjustment
Outcome model : linear
Treatment model: logit
──────────────────────────────────────────────────────────────────────────────
| Robust
y | Coefficient std. err. z P>|z| [95% conf. interval]
─────────────+────────────────────────────────────────────────────────────────
ATE |
treat |
(1 vs 0) | .112639 .0190901 5.90 0.000 .0752231 .1500549
─────────────+────────────────────────────────────────────────────────────────
POmean |
treat |
0 | 10.055 .0138677 725.07 0.000 10.02782 10.08218
──────────────────────────────────────────────────────────────────────────────

* IPWRA: Average Treatment Effect on the Treated (ATT)
teffects ipwra (y c.age c.edu i.female i.poverty) ///
(treat c.age c.edu i.female i.poverty), atet vce(robust)

Treatment-effects estimation Number of obs = 2,000
Estimator : IPW regression adjustment
Outcome model : linear
Treatment model: logit
──────────────────────────────────────────────────────────────────────────────
| Robust
y | Coefficient std. err. z P>|z| [95% conf. interval]
─────────────+────────────────────────────────────────────────────────────────
ATET |
treat |
(1 vs 0) | .1133162 .0191469 5.92 0.000 .0757889 .1508435
─────────────+────────────────────────────────────────────────────────────────
POmean |
treat |
0 | 10.05617 .0140019 718.20 0.000 10.02873 10.08361
──────────────────────────────────────────────────────────────────────────────

The doubly robust IPWRA estimates are ATE = 0.113 (SE = 0.019, 95% CI [0.075, 0.150]) and ATT = 0.113 (SE = 0.019, 95% CI [0.076, 0.151]). These are very close to the RA and IPW estimates, confirming that all three approaches converge in this well-designed RCT. The DR method provides the most reliable cross-sectional estimate because it is protected against misspecification of either the outcome or treatment model.

8.5 Doubly Robust — AIPW alternative

As a robustness check, we can also compute the doubly robust estimate using the AIPW formulation instead of IPWRA.

* AIPW: Average Treatment Effect (ATE)
teffects aipw (y c.age c.edu i.female i.poverty) ///
(treat c.age c.edu i.female i.poverty)

Treatment-effects estimation Number of obs = 2,000
Estimator : augmented IPW
Outcome model : linear by ML
Treatment model: logit
──────────────────────────────────────────────────────────────────────────────
| Robust
y | Coefficient std. err. z P>|z| [95% conf. interval]
─────────────+────────────────────────────────────────────────────────────────
ATE |
treat |
(1 vs 0) | .1126412 .0190903 5.90 0.000 .075225 .1500574
─────────────+────────────────────────────────────────────────────────────────
POmean |
treat |
0 | 10.055 .013868 725.05 0.000 10.02782 10.08218
──────────────────────────────────────────────────────────────────────────────

The AIPW estimate of ATE = 0.113 (SE = 0.019, 95% CI [0.075, 0.150]) is virtually identical to the IPWRA result (0.113). Both are doubly robust — the difference lies in the computational approach (AIPW augments the IPW estimator with a bias-correction term, while IPWRA applies IPW weights to the regression adjustment), but the theoretical properties and estimates are the same.

8.6 Cross-sectional comparison

The table below summarizes all cross-sectional estimates.

Method	Approach	Estimand	Estimate	SE	95% CI	Contains 0.12?
Simple regression	None	ATE	0.116	0.019	[0.078, 0.154]	Yes
Regression Adjustment	Outcome model	ATE	0.113	0.019	[0.075, 0.150]	Yes
Regression Adjustment	Outcome model	ATT	0.113	0.019	[0.076, 0.151]	Yes
Inverse Prob. Weighting	Treatment model	ATE	0.113	0.019	[0.075, 0.150]	Yes
Inverse Prob. Weighting	Treatment model	ATT	0.113	0.019	[0.076, 0.151]	Yes
IPWRA (Doubly Robust)	Both models	ATE	0.113	0.019	[0.075, 0.150]	Yes
IPWRA (Doubly Robust)	Both models	ATT	0.113	0.019	[0.076, 0.151]	Yes
True effect			0.12

Several patterns emerge from this comparison. First, ATE and ATT are nearly identical for every method, confirming that treatment effects are homogeneous across households. Second, RA, IPW, and DR all give remarkably similar results (all approximately 0.113) because, in this well-designed RCT, randomization ensures that both the outcome model and the propensity score model are approximately correct. Third, the simple difference in means (0.116) is slightly higher than the covariate-adjusted estimates (0.113), reflecting the precision improvement from controlling for covariates including the gender imbalance. Finally, all confidence intervals contain the true effect of 0.12 — every method successfully recovers the correct answer.

The real value of doubly robust methods becomes apparent in less ideal settings. When one model might be misspecified — a common situation in practice — DR methods provide insurance that RA or IPW alone cannot offer.

9. Leveraging panel data — Difference-in-Differences

All estimates in Section 8 used only endline data. But we have panel data — the same 2,000 households observed before and after the intervention. Can we do better?

9.1 Why use panel data?

Cross-sectional methods (RA, IPW, DR) compare treated and control groups at a single point in time — the endline. They control for observable covariates like age, education, and gender. But there may be unobservable characteristics — household motivation, geographic advantages, cultural factors — that differ between groups and affect consumption. No amount of cross-sectional covariate adjustment can control for these, because we simply do not observe them.

Analogy — comparing students across schools. Imagine comparing test scores between students at a charter school (treatment) and a traditional school (control). You can adjust for observable differences like family income and prior grades. But what about unmeasured factors — parental involvement, neighborhood quality, student ambition? A cross-sectional comparison cannot disentangle the school effect from these hidden differences. Now suppose you observe the same students before and after they switch schools. By comparing each student’s score change, you automatically cancel out all fixed student characteristics — because they are the same at both time points. That is the power of panel data.

Panel data methods like difference-in-differences (DiD) solve this problem by comparing each household to itself over time. By looking at how each household’s consumption changed from baseline to endline, we effectively control for all time-invariant unobservable characteristics (household fixed effects). This is a powerful advantage that cross-sectional methods cannot replicate.

The DiD estimator

The DiD estimator computes a simple but powerful quantity — a “difference of differences”:

$$\hat{\tau}_{DiD} = \underbrace{(\bar{Y}_{treat,post} - \bar{Y}_{treat,pre})}_{\text{Change for treated}} - \underbrace{(\bar{Y}_{control,post} - \bar{Y}_{control,pre})}_{\text{Change for control}}$$

The first difference ($\bar{Y}_{treat,post} - \bar{Y}_{treat,pre}$) captures the treatment group’s change over time — the treatment effect plus any common time trend (e.g., economic growth that affects all households). The second difference ($\bar{Y}_{control,post} - \bar{Y}_{control,pre}$) captures the control group’s change — the common time trend only, since they did not receive treatment. Subtracting the second from the first removes the time trend, isolating the treatment effect.

Mini example from our data. Suppose the treated group’s average log consumption went from 10.01 at baseline to 10.17 at endline (change = +0.16). The control group went from 10.03 to 10.06 (change = +0.03). The DiD estimate is $0.16 - 0.03 = 0.13$ — close to the true effect of 0.12. The control group’s +0.03 change captures the natural time trend that would have affected everyone, and subtracting it isolates the treatment effect.

The parallel trends assumption

The key identifying assumption of DiD is the parallel trends assumption (PTA): absent the treatment, the treatment and control groups would have followed the same time trend. Formally:

Notation note — In the DiD literature and in the Sant’Anna and Zhao (2020) paper, $D$ denotes treatment group assignment (equivalent to our treat variable). This differs from our data dictionary where D is the receipt indicator. In this section and Section 9.4, we follow the paper’s convention: $D = 1$ means assigned to treatment, $D = 0$ means assigned to control.

$$E[Y_1(0) - Y_0(0) \mid D = 1] = E[Y_1(0) - Y_0(0) \mid D = 0]$$

This says that the average change in untreated potential outcomes is the same for the treated and control groups. Note that this does not require the two groups to have the same level of consumption — only the same trend. The treated group can start higher or lower, as long as their consumption would have evolved at the same rate as the control group in the absence of the program.

In an RCT, the parallel trends assumption is very plausible because randomization ensures the groups were similar at baseline. Any pre-existing differences between groups occurred by chance and are unlikely to produce different time trends. This makes DiD a strong estimator in our setting.

graph LR
subgraph "Parallel Trends Assumption"
PRE["<b>Baseline 2021</b>"]
POST["<b>Endline 2024</b>"]
end
PRE -->|"Treated group<br/>change = effect + trend"| POST
PRE -->|"Control group<br/>change = trend only"| POST
style PRE fill:#6a9bcc,stroke:#141413,color:#fff
style POST fill:#d97757,stroke:#141413,color:#fff

9.2 Why does DiD estimate ATT and not ATE?

This is a point that many beginners miss, so it is worth explaining carefully.

Recall from Section 6 that the ATT is $E[Y_1(1) - Y_1(0) \mid D = 1]$ — the effect on those who were treated. Sant’Anna and Zhao (2020) make this explicit: the main challenge is computing $E[Y_1(0) \mid D = 1]$ — what would the treated group’s consumption have been at endline without the program?

DiD solves this by using the control group’s time trend as a stand-in. Specifically, it constructs the counterfactual for the treated group as:

$$\underbrace{E[Y_1(0) \mid D = 1]}_{\text{Counterfactual}} = \underbrace{E[Y_0 \mid D = 1]}_{\text{Treated at baseline}} + \underbrace{(E[Y_1 \mid D = 0] - E[Y_0 \mid D = 0])}_{\text{Control group’s time trend}}$$

This counterfactual is specific to the treated group — it starts from their baseline level and adds the control group’s trend. DiD therefore estimates what happened to the treated group relative to this counterfactual. This is precisely the ATT.

Why not the ATE? To estimate the ATE, we would also need the treatment effect for the untreated — what would happen if we gave the program to those who did not receive it. DiD does not provide this, because the counterfactual it constructs runs in only one direction (control trend applied to treated baseline, not treated trend applied to control baseline).

In our RCT context, since treatment was randomly assigned, ATE and ATT are likely very similar (as we saw in Section 8). But in observational studies with heterogeneous treatment effects, this distinction matters greatly. A job-training program might have a larger effect on those who voluntarily enrolled (ATT) than it would have on randomly selected workers (ATE).

9.3 Basic DiD with panel fixed effects

We now implement the basic DiD estimator using Stata’s xtdidregress command, which handles the panel structure and computes clustered standard errors.

use "https://github.com/quarcs-lab/data-open/raw/master/ametrics/dataSIM4RCT.dta", clear
* Create the treatment-post interaction
gen treat_post = treat * post
label var treat_post "Treated x Post (1 only for treated in 2024)"
* Declare panel structure
xtset id year
* Basic DiD with individual fixed effects
xtdidregress (y) (treat_post), group(id) time(year) vce(cluster id)

 Number of obs = 4,000
Number of groups = 2,000
Outcome model : linear
Treatment model: none
──────────────────────────────────────────────────────────────────────────────
| Robust
y | Coefficient std. err. t P>|t| [95% conf. interval]
─────────────+────────────────────────────────────────────────────────────────
ATET |
treat_post | .1347161 .0272737 4.94 0.000 .0812282 .188204
──────────────────────────────────────────────────────────────────────────────

The basic DiD estimate of the ATT is 0.135 (SE = 0.027, p < 0.001, 95% CI [0.081, 0.188]). This is slightly higher than the cross-sectional estimates (0.113–0.116) but still contains the true effect of 0.12 within its confidence interval. The wider standard error (0.027 vs. 0.019) reflects the additional variability introduced by differencing within households. Standard errors are clustered at the household level to account for serial correlation within panels.

The key advantage of this DiD estimate is that it controls for all time-invariant unobservable characteristics of each household. In an RCT, randomization already handles confounding, so the cross-sectional and panel estimates are similar. But in observational settings, DiD’s ability to absorb household fixed effects can correct biases that cross-sectional methods cannot.

9.4 From cross-sectional DR to panel DR — Doubly Robust DiD (DRDID)

In Section 7, we saw that doubly robust methods combine outcome modeling and propensity score modeling for cross-sectional data. DRDID extends this logic to the panel setting. It combines the DiD framework (using pre/post variation) with doubly robust covariate adjustment.

This approach was introduced by Sant’Anna and Zhao (2020) in a landmark paper published in the Journal of Econometrics. They proposed estimators that are “consistent if either (but not necessarily both) a propensity score or outcome regression working models are correctly specified” — bringing the doubly robust property from the cross-sectional world into the DiD framework.

Why do we need DRDID?

Recall from Section 9.2 that basic DiD relies on the parallel trends assumption — absent treatment, the treated and control groups would have followed the same time trend. But what if parallel trends holds only conditional on covariates? For example, what if consumption trends differ between poor and non-poor households, but within each poverty group the trends are parallel?

In this case, we need a conditional parallel trends assumption:

$$E[Y_1(0) - Y_0(0) \mid D = 1, X] = E[Y_1(0) - Y_0(0) \mid D = 0, X]$$

This says that the average change in untreated potential outcomes is the same for treated and control groups who share the same covariates $X$. Note that this allows for covariate-specific time trends (e.g., different consumption growth rates for poor and non-poor households) while still identifying the ATT.

Under this conditional parallel trends assumption, there are two ways to estimate the ATT:

Outcome regression (OR) approach — model how the outcome evolves over time for the control group, and use that model to predict the counterfactual evolution for the treated group
IPW approach — reweight the control group so its covariate distribution matches the treated group, then compute the standard DiD

The problem is the same as in the cross-sectional case: OR requires a correctly specified outcome model, and IPW requires a correctly specified propensity score model. Sant’Anna and Zhao’s insight was that you can combine both into a single estimator that works if either model is correct.

The DRDID estimator for panel data

When panel data are available (as in our case — same households observed at baseline and endline), the DRDID estimator takes a particularly clean form. Let $\Delta Y_i = Y_{i,post} - Y_{i,pre}$ denote each household’s change in consumption. The DR DID estimator is:

$$\hat{\tau}_{DR}^{DiD} = \frac{1}{N_1} \sum_{i=1}^{N} \left[ w_1(D_i) - w_0(D_i, X_i) \right] \left[ \Delta Y_i - \hat{\mu}_{0,\Delta}(X_i) \right]$$

where:

$w_1(D_i) = D_i / \bar{D}$ assigns equal weight to each treated unit (the fraction treated)
$w_0(D_i, X_i)$ reweights control units using the propensity score $\hat{p}(X)$, so they resemble the treated group
$\hat{\mu}_{0,\Delta}(X_i) = \hat{\mu}_{0,post}(X_i) - \hat{\mu}_{0,pre}(X_i)$ is the predicted change in consumption for the control group, fitted from control-group data

In plain language: for each household, compute the change in consumption over time ($\Delta Y$) and subtract the model-predicted change for the control group ($\hat{\mu}_{0,\Delta}$). This residual captures the treatment effect plus any prediction error. Then reweight these residuals using IPW so that the control group matches the treated group’s covariate profile.

Why is this doubly robust?

The doubly robust property works through the same logic as in the cross-sectional case (Section 7.3), but applied to changes rather than levels:

If the outcome model is correct ($\hat{\mu}_{0,\Delta}(X) = E[\Delta Y \mid D=0, X]$), then the residuals $\Delta Y_i - \hat{\mu}_{0,\Delta}(X_i)$ average to zero for the control group, regardless of the propensity score weights. The estimator reduces to an outcome-regression DiD. Correct answer.
If the propensity score model is correct ($\hat{p}(X) = \Pr(D=1 \mid X)$), the IPW reweighting makes the control group comparable to the treated group, regardless of the outcome model. The correction term fixes any bias from a misspecified outcome model. Correct answer.
If both are correct, the estimator achieves the semiparametric efficiency bound — it is the most precise estimator possible given the assumptions. Sant’Anna and Zhao proved this formally.
If both are wrong, the estimator can be biased — double robustness provides one layer of insurance, not two.

graph TD
DY["<b>Panel data</b><br/>ΔY = Y_post − Y_pre<br/>for each household"]
OR["<b>Outcome model</b><br/>Predict control group's<br/>consumption change<br/>μ̂₀,Δ(X)"]
PS["<b>Propensity score</b><br/>Estimate p(X)<br/>= Pr(D=1 | X)"]
RES["<b>Residuals</b><br/>ΔY − μ̂₀,Δ(X)"]
IPW_W["<b>IPW reweighting</b><br/>Make controls look<br/>like treated group"]
DRDID["<b>DR-DiD estimate</b><br/>ATT = weighted average<br/>of residuals"]
DY --> RES
OR --> RES
PS --> IPW_W
RES --> DRDID
IPW_W --> DRDID
style DY fill:#141413,stroke:#00d4c8,color:#fff
style OR fill:#6a9bcc,stroke:#141413,color:#fff
style PS fill:#d97757,stroke:#141413,color:#fff
style RES fill:#6a9bcc,stroke:#141413,color:#fff
style IPW_W fill:#d97757,stroke:#141413,color:#fff
style DRDID fill:#00d4c8,stroke:#141413,color:#141413

What DRDID adds over basic DiD and TWFE

Sant’Anna and Zhao (2020) also showed that the standard two-way fixed effects (TWFE) estimator — the workhorse of applied economics — can produce misleading results when treatment effects are heterogeneous across covariates. Specifically, the TWFE estimator implicitly assumes (i) that treatment effects are the same for all covariate values, and (ii) that there are no covariate-specific time trends. When these assumptions fail, “the estimand is, in general, different from the ATT, and policy evaluation based on it may be misleading.” DRDID avoids both of these pitfalls by allowing for flexible outcome models and covariate-specific trends.

Stata implementation

The drdid package (Rios-Avila, Sant’Anna, and Callaway) implements the estimators from the paper.

* Install the drdid package (only needed once)
ssc install drdid, replace
* Doubly Robust DiD with DRIPW estimator
drdid y c.age c.edu i.female i.poverty, ivar(id) time(year) treatment(treat) dripw

Doubly robust difference-in-differences estimator
Outcome model : least squares
Treatment model: inverse probability
──────────────────────────────────────────────────────────────────────────────
| Coefficient std. err. z P>|z| [95% conf. interval]
─────────────+────────────────────────────────────────────────────────────────
ATET | .1374784 .027387 5.02 0.000 .0838008 .191156
──────────────────────────────────────────────────────────────────────────────

The DRDID estimate of the ATT is 0.137 (SE = 0.027, p < 0.001, 95% CI [0.084, 0.191]). The dripw option specifies the Doubly Robust Inverse Probability Weighting estimator, which uses a linear least squares model for the outcome evolution of the control group and a logistic model for the propensity score. The result is slightly higher than basic DiD (0.135) and close to the true effect of 0.12.

Alternative: Stata 17+ built-in command. Stata 17 and later versions include a built-in doubly robust DiD estimator that does not require installing external packages.

xthdidregress aipw (y c.age c.edu i.female i.poverty) ///
(treat_post c.age c.edu i.female i.poverty), group(id)

Heterogeneous-treatment-effects regression Number of obs = 4,000
Number of panels = 2,000
Estimator: Augmented IPW
Panel variable: id
Treatment level: id
Control group: Never treated
(Std. err. adjusted for 2,000 clusters in id)
──────────────────────────────────────────────────────────────────────────────
| Robust
Cohort | ATET std. err. z P>|z| [95% conf. interval]
─────────────+────────────────────────────────────────────────────────────────
year |
2024 | .1374784 .027387 5.02 0.000 .0838008 .191156
──────────────────────────────────────────────────────────────────────────────
Note: ATET computed using covariates.

The xthdidregress aipw command produces the same ATT estimate of 0.137 (SE = 0.027, 95% CI [0.084, 0.191]) as the drdid package — confirming that both implement the same doubly robust DiD methodology. The output labels the result as “Cohort year 2024” because xthdidregress is designed for settings with staggered treatment adoption across multiple cohorts; in our two-period design, there is only one treatment cohort (households treated in 2024). As the Stata manual explains, “AIPW models both treatment and outcome. If at least one of the models is correctly specified, it provides consistent estimates, a property called double robustness.”

The agreement between drdid (community package) and xthdidregress aipw (built-in) provides a useful robustness check — researchers can verify their results using both implementations.

Panel data vs. repeated cross-sections

An important result from Sant’Anna and Zhao (2020) is that panel data are strictly more efficient than repeated cross-sections for estimating the ATT under the DiD framework. The intuition is straightforward: with panel data, we observe each household’s individual change over time ($\Delta Y_i$), which eliminates household-level variation. With repeated cross-sections, we can only compare group averages at different time points, which introduces additional noise. The efficiency gain is larger when the sample sizes in the pre and post periods are more imbalanced.

In our study, we have a balanced panel (same 2,000 households at baseline and endline), so we benefit from this efficiency advantage.

9.5 Cross-sectional vs. panel comparison

The table below compares our best cross-sectional estimates with the panel-based DiD estimates.

Method	Approach	Estimand	Data Used	Estimate	SE	95% CI	Contains 0.12?
Simple regression	None	ATE	Endline only	0.116	0.019	[0.078, 0.154]	Yes
RA	Outcome model	ATE	Endline only	0.113	0.019	[0.075, 0.150]	Yes
IPW	Treatment model	ATE	Endline only	0.113	0.019	[0.075, 0.150]	Yes
DR (IPWRA)	Both models	ATE	Endline only	0.113	0.019	[0.075, 0.150]	Yes
Basic DiD	Panel FE	ATT	Both waves	0.135	0.027	[0.081, 0.188]	Yes
DR-DiD (`drdid`)	Both + Panel	ATT	Both waves	0.137	0.027	[0.084, 0.191]	Yes
DR-DiD (`xthdidregress`)	Both + Panel	ATT	Both waves	0.137	0.027	[0.084, 0.191]	Yes
True effect				0.12

Several important patterns emerge from this comparison. Cross-sectional methods estimate ATE using only endline data, while DiD methods estimate ATT using both survey waves. The two DR-DiD implementations (drdid and xthdidregress aipw) produce identical results, confirming methodological consistency. The DiD estimates (0.135–0.137) are slightly higher than the cross-sectional estimates (0.113), but all confidence intervals contain the true effect of 0.12. DiD’s wider standard errors (0.027 vs. 0.019) reflect the additional variability from differencing within households.

The key value of DiD is not tighter standard errors — it is robustness to time-invariant unobservables. In observational settings where randomization does not hold, DiD can correct biases that cross-sectional methods cannot address. In this RCT, randomization already handles confounding, so the estimates are similar. DRDID adds doubly robust protection on top of DiD, making it the most robust panel method available.

10. Offer vs. receipt — endogenous treatment (advanced)

Note: This section addresses the advanced topic of imperfect compliance and endogenous treatment. Readers new to causal inference may wish to skip this section on a first reading and return to it later.

10.1 The compliance problem

All estimates in Sections 8 and 9 measure the effect of being offered the cash transfer (treat), not the effect of actually receiving it (D). This is the intent-to-treat (ITT) approach — it captures the policy-relevant effect of the offer, regardless of whether households complied.

But what about the effect of actual receipt? This is more complex because compliance is not random. Only 85% of treated households received the transfer, and 5% of control households received it through other channels. The households that chose to take up the program may differ systematically from those that did not — they may be more motivated, more financially constrained, or better connected. Naively comparing receivers to non-receivers would introduce selection bias.

The solution is to use the random assignment (treat) as an instrumental variable for actual receipt (D). Because treat was randomly assigned, it is independent of household characteristics and satisfies the requirements for a valid instrument. This allows us to isolate the causal effect of receipt, at least for the subset of households whose receipt was determined by the offer (the “compliers”).

Analogy — prescriptions and pills. Imagine a doctor randomly prescribes a medication to some patients, but not all patients fill their prescription. We cannot simply compare those who took the pill to those who did not, because pill-takers may be more health-conscious. Instead, we use the random prescription (the “offer”) as a nudge — it strongly predicts whether you take the pill but does not directly affect your health except through the pill. That is the instrumental variable approach: using the random offer to estimate the causal effect of actual receipt.

10.2 Endogenous treatment regression

Stata’s etregress command estimates the effect of an endogenous treatment variable, using the random assignment as an excluded instrument.

use "https://github.com/quarcs-lab/data-open/raw/master/ametrics/dataSIM4RCT.dta", clear
keep if post==1
* Endogenous treatment regression
etregress y c.age i.female i.poverty c.edu, ///
treat(D = treat c.age i.female i.poverty c.edu) vce(robust)
* Mark estimation sample
gen byte esample = e(sample)
* ATE of receipt
margins r.D if esample==1
* ATT of receipt
margins, predict(cte) subpop(if D==1 & esample==1)

Linear regression with endogenous treatment Number of obs = 2,000
Estimator: Maximum likelihood Wald chi2(5) = 92.23
Log pseudolikelihood = -1797.6297 Prob > chi2 = 0.0000
──────────────────────────────────────────────────────────────────────────────
| Robust
| Coefficient std. err. z P>|z| [95% conf. interval]
─────────────+────────────────────────────────────────────────────────────────
y |
age | .003187 .0010016 3.18 0.001 .001224 .0051501
1.female | .0801465 .0189552 4.23 0.000 .042995 .117298
1.poverty | -.1030302 .0205984 -5.00 0.000 -.1434023 -.062658
edu | .0182634 .0045243 4.04 0.000 .0093959 .0271308
1.D | .1471 .0246775 5.96 0.000 .0987329 .1954671
_cons | 9.705642 .0694641 139.72 0.000 9.569495 9.841789
─────────────+────────────────────────────────────────────────────────────────
D |
treat | 2.55806 .0802103 31.89 0.000 2.40085 2.715269
_cons | -1.844408 .2847883 -6.48 0.000 -2.402582 -1.286233
─────────────+────────────────────────────────────────────────────────────────
/athrho | -.0060068 .0481062 -0.12 0.901 -.1002933 .0882796
sigma | .4245195 .0066426 .411698 .4377404
──────────────────────────────────────────────────────────────────────────────
Wald test of indep. eqns. (rho = 0): chi2(1) = 0.02 Prob > chi2 = 0.9006
ATE of receipt (margins r.D):
──────────────────────────────────────────────────────────────────────────────
D | Contrast std. err. [95% conf. interval]
─────────────+────────────────────────────────────────────────────────────────
(1 vs 0) | .1471 .0246775 .0987329 .1954671
──────────────────────────────────────────────────────────────────────────────
ATT of receipt (margins, predict(cte)):
──────────────────────────────────────────────────────────────────────────────
_cons | Margin std. err. z P>|z| [95% conf. interval]
─────────────+────────────────────────────────────────────────────────────────
| .1471 .0246775 5.96 0.000 .0987329 .1954671
──────────────────────────────────────────────────────────────────────────────

The etregress output reveals several important findings. The coefficient on D (receipt) is 0.147 (SE = 0.025, p < 0.001, 95% CI [0.099, 0.195]), which is the estimated effect of actually receiving the cash transfer. This is larger than the offer-based estimates (0.113–0.116) because not everyone who was offered the program received it — the per-recipient effect is naturally larger than the per-offer effect. The Wald test of independent equations (rho = 0) has p = 0.901, indicating no evidence of endogeneity — consistent with a well-designed RCT where unobservable factors do not drive both treatment receipt and consumption. The margins commands confirm that both the ATE and ATT of receipt are 0.147 (identical in this case because the model assumes a constant treatment effect).

10.3 Doubly robust estimation of receipt effect

We can also estimate the receipt effect using a doubly robust approach, incorporating the baseline outcome y0 as an additional control variable (an ANCOVA-style adjustment) and including treat (the random assignment) as a covariate in the treatment model for D.

use "https://github.com/quarcs-lab/data-open/raw/master/ametrics/dataSIM4RCT.dta", clear
keep if post==1
* Doubly robust ATE of receipt, controlling for baseline outcome
teffects ipwra (y y0 c.age i.female i.poverty c.edu) ///
(D c.age i.female i.poverty c.edu treat), vce(robust)
* Diagnostic checks
tebalance summarize age edu i.female i.poverty
tebalance summarize, baseline
tebalance density y0
tebalance density age
teffects overlap

Treatment-effects estimation Number of obs = 2,000
Estimator : IPW regression adjustment
Outcome model : linear
Treatment model: logit
──────────────────────────────────────────────────────────────────────────────
| Robust
y | Coefficient std. err. z P>|z| [95% conf. interval]
─────────────+────────────────────────────────────────────────────────────────
ATE |
D |
(1 vs 0) | .1172686 .0322495 3.64 0.000 .0540608 .1804764
─────────────+────────────────────────────────────────────────────────────────
POmean |
D |
0 | 10.03361 .0171459 585.19 0.000 10 10.06722
──────────────────────────────────────────────────────────────────────────────

The doubly robust estimate of the ATE of receipt is 0.117 (SE = 0.032, 95% CI [0.054, 0.180]). This is slightly lower than the etregress estimate (0.147) and closer to the true effect of 0.12. The wider standard error (0.032 vs. 0.025) reflects the additional flexibility of the doubly robust approach. This specification includes y0 (the baseline outcome) in the outcome model, which controls for pre-treatment differences in consumption levels. The variable treat appears in the treatment model for D because random assignment is the strongest predictor of receipt.

The diagnostic graphs below verify adequate covariate balance and propensity score overlap for the receipt model.

The density and overlap plots confirm that the IPWRA weighting achieves good balance between receivers and non-receivers. After weighting, the effective sample sizes are approximately 999 treated and 1,001 control (rebalanced from the raw 923 receivers and 1,077 non-receivers). The weighted covariate means are closely aligned — for example, the weighted mean age is 35.0 for receivers versus 35.2 for non-receivers, and the weighted poverty rate is 31.1% versus 31.4%. The propensity scores show sufficient overlap for reliable estimation.

11. Comparing all estimates — the big picture

The table below brings together all estimates from the tutorial, providing a comprehensive overview of how different methods, estimands, and data structures relate to each other.

#	Method	Approach	Estimand	Data	Estimate	SE	95% CI	Contains 0.12?
1	Simple regression	None	ATE (offer)	Endline	0.116	0.019	[0.078, 0.154]	Yes
2	Regression Adjustment	Outcome model	ATE (offer)	Endline	0.113	0.019	[0.075, 0.150]	Yes
3	Regression Adjustment	Outcome model	ATT (offer)	Endline	0.113	0.019	[0.076, 0.151]	Yes
4	Inverse Prob. Weighting	Treatment model	ATE (offer)	Endline	0.113	0.019	[0.075, 0.150]	Yes
5	Inverse Prob. Weighting	Treatment model	ATT (offer)	Endline	0.113	0.019	[0.076, 0.151]	Yes
6	IPWRA (Doubly Robust)	Both models	ATE (offer)	Endline	0.113	0.019	[0.075, 0.150]	Yes
7	IPWRA (Doubly Robust)	Both models	ATT (offer)	Endline	0.113	0.019	[0.076, 0.151]	Yes
8	Basic DiD	Panel FE	ATT (offer)	Panel	0.135	0.027	[0.081, 0.188]	Yes
9	DR-DiD (`drdid`)	Both + Panel	ATT (offer)	Panel	0.137	0.027	[0.084, 0.191]	Yes
10	DR-DiD (`xthdidregress`)	Both + Panel	ATT (offer)	Panel	0.137	0.027	[0.084, 0.191]	Yes
11	Endogenous treatment (`etregress`)	IV	ATE (receipt)	Endline	0.147	0.025	[0.099, 0.195]	Yes
12	DR receipt (`teffects ipwra`)	Both models	ATE (receipt)	Endline	0.117	0.032	[0.054, 0.180]	Yes
	True effect				0.12

Four key takeaways

1. RA vs. IPW vs. DR. In this well-designed RCT, all three cross-sectional approaches give remarkably similar results (0.113–0.116). This convergence occurs because randomization ensures that both the outcome model and the propensity score model are approximately correct. The differences are small — but in observational studies, where one model might be misspecified, the choice of method matters much more. Doubly robust methods are the safest bet because they remain consistent if either model is correct.

2. ATE vs. ATT. For all cross-sectional methods, ATE and ATT are nearly identical (0.113–0.116). This confirms that treatment effects are roughly homogeneous across households in this simulation. When treatment effects are heterogeneous — for example, if the program benefits poorer households more — ATE and ATT can diverge. The researcher must choose the estimand that matches their policy question: ATE for scaling decisions, ATT for program evaluation.

3. Cross-sectional vs. DiD. DiD estimates (0.135–0.137) are slightly higher than cross-sectional estimates (0.113–0.116), but all confidence intervals contain the true effect of 0.12. DiD’s main advantage is controlling for time-invariant unobservable household characteristics — less important in an RCT (where randomization handles confounding) but critical in quasi-experimental settings. DRDID extends the doubly robust logic to the panel setting, providing the most robust estimator in our toolkit. DiD inherently estimates the ATT because its counterfactual is constructed specifically for the treated group.

4. Offer vs. receipt. The effect of actually receiving the cash transfer (0.117–0.147) is larger than the effect of being offered it (0.113–0.116), because imperfect compliance dilutes the offer-based estimates. The doubly robust receipt estimate (0.117) is closest to the true effect of 0.12, while the endogenous treatment model (0.147) is slightly higher. All confidence intervals contain 0.12.

12. Summary and key takeaways

The cash transfer program increased household consumption by approximately 11–14% across all estimation methods, close to the true effect of 12%. Every confidence interval contained the true value, demonstrating that all methods successfully recovered the correct answer.

Seven methodological lessons

Always verify baseline balance before estimating treatment effects. Even with randomization, chance imbalances can occur — as we saw with the gender variable (SMD = 9.3%).
Be explicit about your estimand. ATE answers the policymaker’s question (“What if we scale this up?"), while ATT answers the evaluator’s question (“Did it help the participants?"). Different methods target different estimands.
Regression adjustment models the outcome; IPW models treatment assignment; doubly robust does both. These three approaches represent fundamentally different strategies for causal estimation. Understanding what each models — and what can go wrong — is essential for choosing the right method.
In a well-designed RCT, all three approaches converge. But doubly robust methods provide insurance against model misspecification, making them the standard recommendation in modern causal inference.
Panel data controls for time-invariant unobservables that cross-sectional methods cannot address. By comparing each household to itself over time, DiD absorbs household fixed effects — motivation, geography, family culture — that are invisible to cross-sectional approaches.
DiD inherently estimates the ATT because its counterfactual is specific to the treated group. The control group’s time trend provides a counterfactual for what the treated group would have experienced without the program — but it does not tell us what would happen if the program were given to the untreated.
Doubly robust DiD (DRDID) extends the DR logic to the panel setting. It combines the power of DiD (controlling for household fixed effects) with the robustness of doubly robust estimation (protection against model misspecification), making it the most robust panel estimator available.

Limitations

This tutorial uses simulated data with known parameters. Real-world data may exhibit more complex compliance patterns, heterogeneous effects, and missing data.
The panel has only two periods (baseline and endline), limiting our ability to test for pre-treatment trends or estimate dynamic treatment effects.
Treatment effects are homogeneous by construction. In practice, researchers should explore heterogeneity across subgroups.

Next steps

Apply these methods to real-world RCT data from actual cash transfer programs
Explore heterogeneous treatment effects by gender, poverty status, or education level
Extend to multi-period panels with staggered treatment adoption, using modern DiD methods (Callaway and Sant’Anna, 2021)

13. Exercises

Heterogeneous effects by gender. Estimate treatment effects separately for male-headed and female-headed households using IPWRA. Are the effects different? Does ATE still equal ATT when you restrict to subgroups?
Model misspecification. Compare the RA, IPW, and DR estimates when you deliberately misspecify the outcome model by omitting edu and age from the covariate list. Which method is most robust to this misspecification? What does this tell you about the value of doubly robust estimation?
Basic DiD vs. doubly robust DiD. Re-run the DiD analysis using the basic xtdidregress command (no covariates) and compare it with the drdid results (with covariates). How much do the estimates differ? What does this tell you about the role of covariate adjustment in DiD?

References

High-Dimensional Fixed Effects Regression: An Introduction in Python

Fri, 20 Mar 2026 00:00:00 +0000

1. Overview

Imagine you want to know whether union membership raises wages. You run a regression and find a strong positive association: union workers earn 18% more. But wait — what if the workers who join unions are also more motivated, more experienced, or work in industries that pay well regardless? That 18% could be mostly selection, not a genuine union effect. This is one of the most pervasive problems in empirical research: omitted variable bias. Any time your data is grouped — by individual, firm, country, or time period — unobserved characteristics that differ across groups can contaminate your estimates, leading to conclusions that look solid but are fundamentally misleading.

Fixed effects regression is the workhorse solution. By absorbing all time-invariant group-level heterogeneity — a worker’s innate ability, a firm’s management culture, a country’s institutional quality — fixed effects eliminate an entire class of confounders in one step. The result is striking: in the wage panel we analyze below, the apparent union premium drops from 18% to just 7% once we account for individual fixed effects, revealing that more than half the raw association was driven by who selects into unions, not what unions do. This kind of dramatic correction is routine in applied research, which is why fixed effects appear in virtually every empirical paper that uses panel data.

Modern implementations make this computationally painless. Rather than estimating thousands of dummy variables, they use a demeaning algorithm that sweeps out group means before estimation. PyFixest brings this approach to Python with a concise formula syntax inspired by R’s fixest package — the most popular fixed effects library in the R ecosystem. In this tutorial we use PyFixest to build from simple OLS through one-way and two-way fixed effects, compare inference methods, perform instrumental variable estimation, analyze a real wage panel, and run event study designs for difference-in-differences — all with a few lines of code. Along the way, we will see why fixed effects work (by manually reproducing them via demeaning), discover what they cannot do (estimate time-invariant effects like education), learn when standard TWFE breaks down in staggered treatment designs, and apply the CRE/Mundlak approach to recover the very coefficients that one-way FE absorb.

Learning objectives:

Understand why unobserved group heterogeneity biases OLS and how fixed effects remove that bias
Implement one-way and two-way fixed effects regressions using PyFixest’s formula syntax
Compare multiple model specifications efficiently using PyFixest’s stepwise operators
Assess robustness by computing standard errors under different clustering assumptions
Decompose panel variation into between and within components to diagnose what FE can and cannot estimate
Frame a real wage panel through the Mincer equation and its panel extensions
Recover time-invariant coefficients (education, race) using the CRE/Mundlak approach
Apply fixed effects to event study designs with staggered treatment adoption

Content outline. Sections 2–4 set up the environment and establish an OLS baseline. Sections 5–6 introduce fixed effects — first through PyFixest’s absorption syntax, then by reproducing the same result manually via demeaning, building intuition for what FE actually does to the data. Section 7 shows how to compare multiple specifications in a single call, and Section 8 explores how standard error choices affect inference. Section 9 extends to two-way FE, and Section 10 combines FE with instrumental variables. Section 11 is the core case study: a real wage panel framed by the Mincer equation, where we decompose within and between variation, see how one-way FE absorb time-invariant variables like education, stress-test the common trends assumption with group-specific time effects, and recover education’s coefficient through the CRE/Mundlak approach. Section 12 applies FE to event study designs, with a careful discussion of why period −1 serves as the universal baseline. Throughout, each section builds on the previous — the manual demeaning in Section 6 explains why education vanishes in Section 11, and the stepwise comparison in Section 7 foreshadows the specification table in Section 11.

2. Setup and imports

Before running the analysis, install the required packages if needed:

pip install pyfixest

The following code imports PyFixest and standard data science libraries. PyFixest provides feols() as its main estimation function, which accepts R-style formulas with a pipe | separator for fixed effects.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pyfixest as pf
# Reproducibility
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
# Site color palette
STEEL_BLUE = "#6a9bcc"
WARM_ORANGE = "#d97757"
NEAR_BLACK = "#141413"
TEAL = "#00d4c8"

Dark theme figure styling (click to expand)

# Dark theme palette (consistent with site navbar/dark sections)
DARK_NAVY = "#0f1729"
GRID_LINE = "#1f2b5e"
LIGHT_TEXT = "#c8d0e0"
WHITE_TEXT = "#e8ecf2"
# Plot defaults — minimal, spine-free, dark background
plt.rcParams.update({
"figure.facecolor": DARK_NAVY,
"axes.facecolor": DARK_NAVY,
"axes.edgecolor": DARK_NAVY,
"axes.linewidth": 0,
"axes.labelcolor": LIGHT_TEXT,
"axes.titlecolor": WHITE_TEXT,
"axes.spines.top": False,
"axes.spines.right": False,
"axes.spines.left": False,
"axes.spines.bottom": False,
"axes.grid": True,
"grid.color": GRID_LINE,
"grid.linewidth": 0.6,
"grid.alpha": 0.8,
"xtick.color": LIGHT_TEXT,
"ytick.color": LIGHT_TEXT,
"xtick.major.size": 0,
"ytick.major.size": 0,
"text.color": WHITE_TEXT,
"font.size": 12,
"legend.frameon": False,
"legend.fontsize": 11,
"legend.labelcolor": LIGHT_TEXT,
"figure.edgecolor": DARK_NAVY,
"savefig.facecolor": DARK_NAVY,
"savefig.edgecolor": DARK_NAVY,
})

3. Data loading and exploration

3.1 Loading the dataset

PyFixest includes a built-in synthetic dataset designed for demonstrating fixed effects regression. We load it with pf.get_data(), which returns a DataFrame with outcome variables (Y, Y2), covariates (X1, X2), fixed effect identifiers (f1, f2, f3, group_id), instruments (Z1, Z2), and sampling weights.

data = pf.get_data()
print(f"Dataset shape: {data.shape}")
print(f"\nColumn names: {list(data.columns)}")
print(data.head())
print(data.describe().round(3))

Dataset shape: (1000, 11)
Column names: ['Y', 'Y2', 'X1', 'X2', 'f1', 'f2', 'f3', 'group_id', 'Z1', 'Z2', 'weights']
Y Y2 X1 X2 ... group_id Z1 Z2 weights
0 NaN 2.357103 0.0 0.457858 ... 9.0 -0.330607 1.054826 0.661478
1 -1.458643 5.163147 NaN -4.998406 ... 8.0 NaN -4.113690 0.772732
2 0.169132 0.751140 2.0 1.558480 ... 16.0 1.207778 0.465282 0.990929
3 3.319513 -2.656368 1.0 1.560402 ... 3.0 2.869997 0.467570 0.021123
4 0.134420 -1.866416 2.0 -3.472232 ... 14.0 0.835819 -3.115669 0.790815
Y Y2 X1 ... Z1 Z2 weights
count 999.000 1000.000 999.000 ... 999.000 1000.000 1000.000
mean -0.127 -0.309 1.043 ... 1.040 -0.113 0.495
std 2.305 5.584 0.808 ... 1.307 3.172 0.291
min -6.536 -16.974 0.000 ... -2.825 -11.576 0.000
25% -1.732 -4.029 0.000 ... 0.121 -2.252 0.248
50% -0.211 -0.459 1.000 ... 1.040 -0.064 0.469
75% 1.576 3.528 2.000 ... 1.946 2.028 0.746
max 6.907 17.156 2.000 ... 4.601 11.420 1.000

The dataset has 1,000 observations across 11 columns. The outcome Y has a mean of -0.127 and standard deviation of 2.305, while X1 takes discrete values 0, 1, and 2. A few observations have missing values (1 missing in Y, X1, f1, and Z1), which PyFixest handles automatically by dropping incomplete cases. The group_id variable identifies the group each observation belongs to, and this is the dimension we will control for with fixed effects.

3.2 Visualizing group structure

Before estimating any model, it helps to see how the relationship between X1 and Y varies across groups. If groups have different average levels of Y, standard OLS will mix within-group variation (what we care about) with between-group variation (which may reflect confounders).

fig, ax = plt.subplots(figsize=(10, 6))
groups = data["group_id"].unique()
n_groups = len(groups)
cmap = plt.cm.tab20
for i, g in enumerate(sorted(groups)):
subset = data[data["group_id"] == g]
ax.scatter(subset["X1"], subset["Y"], alpha=0.5, s=20,
color=cmap(i / n_groups),
label=f"Group {g}" if i < 5 else None)
ax.set_xlabel("X1", fontsize=13)
ax.set_ylabel("Y", fontsize=13)
ax.set_title("Outcome (Y) vs Covariate (X1) by Group", fontsize=15, fontweight="bold")
ax.legend(title="Group (first 5)", fontsize=9)
plt.savefig("pyfixest_scatter_by_group.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

The scatter plot reveals that different groups have distinct average levels of Y — some clusters sit higher and others lower on the vertical axis. Within each group, however, Y tends to decrease as X1 increases. This visual separation between groups is exactly the kind of heterogeneity that fixed effects regression absorbs, allowing us to isolate the within-group relationship between X1 and Y.

4. Simple OLS baseline (no fixed effects)

To establish a benchmark, we first estimate a standard OLS regression of Y on X1 without any fixed effects. The model is:

$$Y_i = \beta_0 + \beta_1 X_{1i} + \epsilon_i$$

In words, we assume the outcome $Y$ is a linear function of $X_1$ plus random noise $\epsilon$. This gives us the overall association, mixing both within-group and between-group variation. We use heteroskedasticity-robust standard errors (HC1) to account for non-constant variance.

fit_ols = pf.feols("Y ~ X1", data=data, vcov="HC1")
print(fit_ols.summary())

Estimation: OLS
Dep. var.: Y, Fixed effects: 0
Inference: HC1
Observations: 998
| Coefficient | Estimate | Std. Error | t value | Pr(>|t|) | 2.5% | 97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| Intercept | 0.919 | 0.112 | 8.223 | 0.000 | 0.699 | 1.138 |
| X1 | -1.000 | 0.082 | -12.134 | 0.000 | -1.162 | -0.838 |
---
RMSE: 2.158 R2: 0.123

The pooled OLS estimates a coefficient of -1.000 on X1 (SE = 0.082, p < 0.001), with an R-squared of 0.123. This means that a one-unit increase in X1 is associated with a 1.0-point decrease in Y on average. However, this estimate ignores group-level differences — it could be biased if X1 correlates with unobserved group characteristics. The model explains only 12.3% of the total variation in Y, leaving substantial unexplained heterogeneity. Let us now see how fixed effects change the picture.

5. One-way fixed effects

The following diagram illustrates the core problem fixed effects solve. When an unobserved group characteristic correlates with both the covariate and the outcome, it creates a backdoor path that biases OLS. Fixed effects block this path by absorbing all group-level variation.

graph LR
A["<b>Group Characteristics</b><br/>(unobserved)"] -->|"correlates"| X["<b>X1</b><br/>(covariate)"]
A -->|"affects"| Y["<b>Y</b><br/>(outcome)"]
X -->|"causal effect β = ?"| Y
FE["<b>Fixed Effects</b><br/>(absorbs A)"] -.->|"blocks backdoor"| A
style A fill:#d97757,stroke:#141413,color:#fff
style X fill:#6a9bcc,stroke:#141413,color:#fff
style Y fill:#00d4c8,stroke:#141413,color:#fff
style FE fill:#1a3a8a,stroke:#141413,color:#fff,stroke-dasharray: 5 5

5.1 Absorbing group heterogeneity

Fixed effects regression controls for all time-invariant group characteristics by effectively adding a separate intercept for each group. In PyFixest, we specify fixed effects after a pipe | in the formula. The syntax Y ~ X1 | group_id means: regress Y on X1, absorbing group_id fixed effects. Think of this as asking: “within each group, what is the relationship between X1 and Y?”

fit_fe1 = pf.feols("Y ~ X1 | group_id", data=data, vcov="HC1")
print(fit_fe1.summary())

Estimation: OLS
Dep. var.: Y, Fixed effects: group_id
Inference: HC1
Observations: 998
| Coefficient | Estimate | Std. Error | t value | Pr(>|t|) | 2.5% | 97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| X1 | -1.019 | 0.083 | -12.234 | 0.000 | -1.182 | -0.856 |
---
RMSE: 2.141 R2: 0.137 R2 Within: 0.126

With group_id fixed effects absorbed, the coefficient on X1 shifts slightly to -1.019 (SE = 0.083). The within R-squared of 0.126 tells us how much of the within-group variation in Y is explained by X1 after removing group means. Compared to the pooled OLS estimate of -1.000, the fixed effects estimate is similar in this synthetic dataset, suggesting that X1 does not strongly correlate with group-level unobservables here. In real data, the shift can be dramatic — that gap is the omitted variable bias that fixed effects remove.

5.2 Equivalence with dummy variables

Under the hood, fixed effects absorption produces the same point estimates as including explicit dummy variables for each group. PyFixest’s C() operator creates these dummies. The key advantage of absorption is computational: with thousands of groups, estimating thousands of dummy coefficients is slow and memory-intensive, while demeaning is fast.

fit_dummy = pf.feols("Y ~ X1 + C(group_id)", data=data, vcov="HC1")
print(f"X1 coefficient (FE absorption): {fit_fe1.coef()['X1']:.4f}")
print(f"X1 coefficient (dummy vars): {fit_dummy.coef()['X1']:.4f}")

X1 coefficient (FE absorption): -1.0190
X1 coefficient (dummy vars): -1.0190

Both approaches yield identical coefficients of -1.0190 on X1, confirming that FE absorption and dummy variable inclusion are algebraically equivalent. The absorption approach simply avoids estimating and storing the hundreds or thousands of group intercepts that are typically not of interest — what econometricians call nuisance parameters.

6. Understanding fixed effects via manual demeaning

6.1 The within transformation

To build intuition for what fixed effects actually do, we can perform the within transformation manually. For each observation, we subtract its group mean from both Y and X1. This removes all between-group variation, leaving only the deviations from each group’s average. Regressing the demeaned Y on the demeaned X1 recovers the same coefficient as the FE estimator. It is like centering each group at the origin — the only variation left is how individuals within a group differ from their group’s typical level.

The fixed effects estimator solves:

$$\hat{\beta}_{FE} = \left(\sum_{i=1}^{N} \ddot{X}_i' \ddot{X}_i\right)^{-1} \sum_{i=1}^{N} \ddot{X}_i' \ddot{Y}_i$$

where $\ddot{X}_i = X_{it} - \bar{X}_i$ and $\ddot{Y}_i = Y_{it} - \bar{Y}_i$ are the demeaned variables. In words, this says the FE estimator uses only within-group deviations from group means, eliminating any bias from group-level confounders.

# Manual demeaning (within transformation)
data_dm = data.copy()
for col in ["Y", "X1"]:
group_means = data_dm.groupby("group_id")[col].transform("mean")
data_dm[f"{col}_dm"] = data_dm[col] - group_means
fit_demeaned = pf.feols("Y_dm ~ X1_dm", data=data_dm, vcov="HC1")
print(f"X1 coefficient (FE absorption): {fit_fe1.coef()['X1']:.4f}")
print(f"X1 coefficient (manual demean): {fit_demeaned.coef()['X1_dm']:.4f}")
print(f"X1 coefficient (OLS, no FE): {fit_ols.coef()['X1']:.4f}")

X1 coefficient (FE absorption): -1.0190
X1 coefficient (manual demean): -1.0190
X1 coefficient (OLS, no FE): -1.0001

The manual demeaning produces a coefficient of -1.0190, exactly matching the FE absorption result. The pooled OLS gave -1.0001 by comparison. This confirms that fixed effects regression is mathematically equivalent to subtracting group means from every variable before running OLS. The difference between -1.019 (FE) and -1.000 (OLS) reflects the bias introduced by between-group variation that is removed by demeaning.

6.2 Visualizing the demeaning

fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# Left: Raw data
for i, g in enumerate(sorted(groups)[:5]):
subset = data[data["group_id"] == g]
axes[0].scatter(subset["X1"], subset["Y"], alpha=0.4, s=20,
color=cmap(i / n_groups))
axes[0].set_xlabel("X1 (raw)", fontsize=13)
axes[0].set_ylabel("Y (raw)", fontsize=13)
axes[0].set_title("Raw Data: Between + Within Variation", fontsize=13, fontweight="bold")
# Right: Demeaned data
axes[1].scatter(data_dm["X1_dm"], data_dm["Y_dm"], alpha=0.4, s=20, color=STEEL_BLUE)
x_range = np.linspace(data_dm["X1_dm"].min(), data_dm["X1_dm"].max(), 100)
y_pred = fit_demeaned.coef()["X1_dm"] * x_range
axes[1].plot(x_range, y_pred, color=WARM_ORANGE, linewidth=2.5,
label=f"FE slope = {fit_demeaned.coef()['X1_dm']:.3f}")
axes[1].set_xlabel("X1 (demeaned)", fontsize=13)
axes[1].set_ylabel("Y (demeaned)", fontsize=13)
axes[1].set_title("Demeaned Data: Within-Group Variation Only", fontsize=13, fontweight="bold")
axes[1].legend(fontsize=11)
plt.savefig("pyfixest_demeaning.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

The left panel shows the raw data with groups scattered at different vertical levels — this between-group variation is what confounds the OLS estimate. The right panel shows the demeaned data: all groups are now centered at the origin, and the clear negative slope of -1.019 reflects the pure within-group relationship. This visual makes the FE intuition concrete: by removing group averages, we eliminate confounding from any variable that is constant within groups. Now let us explore how to estimate multiple specifications efficiently.

7. Multiple estimation with stepwise operators

7.1 Cumulative stepwise fixed effects

One of PyFixest’s most powerful features is its formula operators for estimating multiple models in a single call. The csw0() operator adds fixed effects cumulatively: csw0(f1, f2) estimates three models — no FE, then f1 only, then f1 + f2 — in one line. This is far more efficient than writing three separate calls and makes it easy to see how results change as we add controls.

fit_multi = pf.feols("Y ~ X1 | csw0(f1, f2)", data=data, vcov="HC1")
# Print summary for each model
models = fit_multi.all_fitted_models
for key in models:
m = models[key]
print(f"\nModel: {key}")
print(m.summary())

Model: Y~X1
Estimation: OLS
Dep. var.: Y, Fixed effects: 0
Inference: HC1
Observations: 998
| Coefficient | Estimate | Std. Error | t value | Pr(>|t|) |
|:--------------|-----------:|-------------:|----------:|-----------:|
| Intercept | 0.919 | 0.112 | 8.223 | 0.000 |
| X1 | -1.000 | 0.082 | -12.134 | 0.000 |
---
RMSE: 2.158 R2: 0.123
Model: Y~X1|f1
Estimation: OLS
Dep. var.: Y, Fixed effects: f1
Inference: HC1
Observations: 997
| Coefficient | Estimate | Std. Error | t value | Pr(>|t|) |
|:--------------|-----------:|-------------:|----------:|-----------:|
| X1 | -0.949 | 0.067 | -14.094 | 0.000 |
---
RMSE: 1.73 R2: 0.437 R2 Within: 0.161
Model: Y~X1|f1+f2
Estimation: OLS
Dep. var.: Y, Fixed effects: f1+f2
Inference: HC1
Observations: 997
| Coefficient | Estimate | Std. Error | t value | Pr(>|t|) |
|:--------------|-----------:|-------------:|----------:|-----------:|
| X1 | -0.919 | 0.060 | -15.440 | 0.000 |
---
RMSE: 1.441 R2: 0.609 R2 Within: 0.200

The coefficient on X1 shifts from -1.000 (no FE) to -0.949 (with f1) to -0.919 (with f1 + f2), while the overall R-squared jumps from 0.123 to 0.437 to 0.609. Adding f1 alone explains an additional 31 percentage points of variation — a massive improvement that shows how much group-level heterogeneity f1 captures. Adding f2 on top of f1 brings R-squared to 0.609, meaning the two fixed effect dimensions together account for over 60% of the total variation in Y. The standard error on X1 also shrinks from 0.082 to 0.060, reflecting the precision gain from reducing residual noise.

Specification	X1 Coef.	SE	R-squared	R-squared Within
No FE	-1.000	0.082	0.123	—
FE: f1	-0.949	0.067	0.437	0.161
FE: f1 + f2	-0.919	0.060	0.609	0.200

7.2 Visualizing coefficient stability

The table above shows the numbers, but a figure makes the comparison more immediate. Plotting the coefficient with its 95% confidence interval across specifications reveals both the stability of the point estimate and the precision gain from adding fixed effects.

# Coefficient comparison across specifications
model_names = ["No FE", "FE: f1", "FE: f1 + f2"]
coefs = [models[k].coef()["X1"] for k in models]
ses = [models[k].se()["X1"] for k in models]
fig, ax = plt.subplots(figsize=(8, 5))
y_pos = np.arange(len(model_names))
ax.barh(y_pos, coefs, xerr=[1.96 * s for s in ses], height=0.5,
color=[STEEL_BLUE, WARM_ORANGE, TEAL], edgecolor=DARK_NAVY, capsize=5)
ax.set_yticks(y_pos)
ax.set_yticklabels(model_names, fontsize=12)
ax.set_xlabel("Coefficient on X1", fontsize=13)
ax.set_title("Effect of X1 Across Fixed Effect Specifications", fontsize=14, fontweight="bold")
ax.axvline(x=0, color=NEAR_BLACK, linewidth=0.8, linestyle="--", alpha=0.5)
plt.savefig("pyfixest_coef_comparison.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

The coefficient comparison chart shows that the point estimate on X1 remains stable around -1.0 across all three specifications, with confidence intervals narrowing as we add fixed effects. This stability suggests the estimate is robust to the inclusion of group-level controls. In applied research, large shifts across specifications would signal omitted variable concerns, making this type of comparison essential for assessing credibility.

8. Inference: choosing the right standard errors

8.1 Comparing standard error estimators

The choice of standard errors can dramatically change statistical inference, even when point estimates remain the same. Standard (iid) errors assume all observations are independent and identically distributed. Heteroskedasticity-robust (HC1) errors relax the constant-variance assumption. Cluster-robust (CRV) errors account for arbitrary correlation within groups — essential when observations within a group are not independent, like repeated measurements of the same individual. Think of it like estimating average height: if you measure the same person ten times, those ten measurements are not ten independent observations, and your standard error should reflect that.

se_types = {
"iid": "iid",
"HC1 (robust)": "HC1",
"CRV1 (group_id)": {"CRV1": "group_id"},
"CRV1 (group_id + f2)": {"CRV1": "group_id + f2"},
"CRV3 (group_id)": {"CRV3": "group_id"},
}
print(f"{'SE Type':<22} {'SE(X1)':<10} {'t-stat':<10} {'p-value':<10}")
print("-" * 52)
for name, vcov in se_types.items():
fit_tmp = pf.feols("Y ~ X1 | group_id", data=data, vcov=vcov)
print(f"{name:<22} {fit_tmp.se()['X1']:<10.4f} "
f"{fit_tmp.tstat()['X1']:<10.3f} {fit_tmp.pvalue()['X1']:<10.4f}")

SE Type SE(X1) t-stat p-value
----------------------------------------------------
iid 0.0858 -11.875 0.0000
HC1 (robust) 0.0833 -12.234 0.0000
CRV1 (group_id) 0.1172 -8.696 0.0000
CRV1 (group_id + f2) 0.1207 -8.445 0.0000
CRV3 (group_id) 0.1247 -8.174 0.0000

The standard error on X1 ranges from 0.0833 (HC1) to 0.1247 (CRV3), a 50% increase depending on the assumption about error correlation. While all p-values remain below 0.001 in this case, the t-statistic drops from 12.2 to 8.2 — a substantial difference that could determine significance for weaker effects. Cluster-robust SEs (CRV1) inflate to 0.1172 because they account for within-group correlation. The CRV3 estimator, which provides a more conservative finite-sample correction, gives the largest SE of 0.1247. In practice, you should cluster at the level where you believe errors are correlated.

8.2 Visualizing the SE tradeoff

fig, ax = plt.subplots(figsize=(9, 5))
se_names = list(se_types.keys())
se_vals = []
for name, vcov in se_types.items():
fit_tmp = pf.feols("Y ~ X1 | group_id", data=data, vcov=vcov)
se_vals.append(fit_tmp.se()["X1"])
colors = [STEEL_BLUE, WARM_ORANGE, TEAL, "#e8956a", "#f0a88c"]
bars = ax.bar(range(len(se_names)), se_vals, color=colors, edgecolor=DARK_NAVY, width=0.6)
ax.set_xticks(range(len(se_names)))
ax.set_xticklabels(se_names, rotation=25, ha="right", fontsize=10)
ax.set_ylabel("Standard Error of X1", fontsize=13)
ax.set_title("Standard Errors Under Different Assumptions", fontsize=14, fontweight="bold")
for i, v in enumerate(se_vals):
ax.text(i, v + 0.002, f"{v:.4f}", ha="center", fontsize=10, fontweight="bold")
plt.savefig("pyfixest_se_comparison.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

The bar chart makes the progression vivid: moving from iid to cluster-robust standard errors increases uncertainty by nearly 50%. The iid and HC1 estimates are similar because heteroskedasticity is not a major concern here. The real jump occurs when we account for within-group correlation (CRV1), and the CRV3 bias-corrected estimator is the most conservative. For applied work with grouped data, defaulting to cluster-robust errors is the safest choice — underestimating standard errors leads to falsely significant results.

9. Two-way fixed effects

When data has two grouping dimensions — for example, firms and years, or workers and occupations — two-way fixed effects absorb unobserved heterogeneity along both dimensions. In PyFixest, we simply list both FE variables after the pipe: Y ~ X1 + X2 | f1 + f2. This absorbs all factors that are constant within each level of f1 and each level of f2.

fit_twoway = pf.feols("Y ~ X1 + X2 | f1 + f2", data=data, vcov="HC1")
print(fit_twoway.summary())

Estimation: OLS
Dep. var.: Y, Fixed effects: f1+f2
Inference: HC1
Observations: 997
| Coefficient | Estimate | Std. Error | t value | Pr(>|t|) | 2.5% | 97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| X1 | -0.924 | 0.056 | -16.375 | 0.000 | -1.035 | -0.813 |
| X2 | -0.174 | 0.015 | -11.246 | 0.000 | -0.204 | -0.144 |
---
RMSE: 1.346 R2: 0.659 R2 Within: 0.303

Adding both f1 and f2 as fixed effects plus the additional covariate X2 yields an R-squared of 0.659 and a within R-squared of 0.303. The coefficient on X1 is -0.924 (SE = 0.056) and X2 is -0.174 (SE = 0.015), both highly significant. The within R-squared of 0.303 means that X1 and X2 together explain about 30% of the variation in Y after absorbing both dimensions of fixed effects — a substantial improvement over the 20% with X1 alone in the previous section.

10. Instrumental variables with fixed effects

Sometimes the explanatory variable itself is endogenous — correlated with the error term due to measurement error, simultaneity, or omitted variables that fixed effects do not capture. Instrumental variables (IV) estimation addresses this by using external variables (instruments) that affect the outcome only through the endogenous variable. Think of instruments as a natural experiment embedded in the data: Z affects X but has no direct path to Y, so any association between Z and Y must flow through X. In PyFixest, the IV syntax uses a second pipe: Y2 ~ 1 | f1 + f2 | X1 ~ Z1 + Z2. This reads: outcome Y2, no exogenous controls (just the intercept 1), fixed effects f1 + f2, and endogenous variable X1 instrumented by Z1 and Z2.

The IV estimator recovers the coefficient on X1 by first predicting X1 using the instruments, then using these predictions in the second-stage regression:

$$\text{First stage: } X_1 = \pi_0 + \pi_1 Z_1 + \pi_2 Z_2 + \alpha_i + \gamma_t + \nu$$

$$\text{Second stage: } Y_2 = \beta X_1^{predicted} + \alpha_i + \gamma_t + \epsilon$$

In words, the first stage isolates the variation in X1 that is driven by the instruments Z1 and Z2, stripping away the endogenous component. The second stage then uses only this “clean” variation to estimate the effect of X1 on Y2. Here, $\alpha_i$ corresponds to the f1 fixed effects, $\gamma_t$ corresponds to the f2 fixed effects, and $\beta$ is the causal parameter of interest that we recover from the X1 coefficient in PyFixest’s output.

fit_iv = pf.feols("Y2 ~ 1 | f1 + f2 | X1 ~ Z1 + Z2", data=data)
print(fit_iv.summary())
print(f"\nFirst-stage F-statistic: {fit_iv._f_stat_1st_stage:.2f}")

Estimation: IV
Dep. var.: Y2, Fixed effects: f1+f2
Inference: iid
Observations: 998
| Coefficient | Estimate | Std. Error | t value | Pr(>|t|) | 2.5% | 97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| X1 | -1.600 | 0.336 | -4.768 | 0.000 | -2.259 | -0.942 |
---
First-stage F-statistic: 311.54

The IV estimate of X1 is -1.600 (SE = 0.336), substantially larger in magnitude than the OLS estimate of approximately -1.0. This divergence suggests that the OLS coefficient on X1 is attenuated — a classic sign of measurement error or endogeneity that biases OLS toward zero. The first-stage F-statistic of 311.54 is well above the conventional threshold of 10, indicating that Z1 and Z2 are strong instruments. Strong instruments mean the IV estimate is reliable; with weak instruments, IV can perform worse than OLS. Note that with heterogeneous treatment effects, IV identifies the Local Average Treatment Effect (LATE) — the effect for units whose treatment status is shifted by the instruments — rather than the Average Treatment Effect (ATE) for the entire population.

11. Panel data application: wage determinants

11.1 The wage panel: variables and structure

To see fixed effects in action with real data, we analyze the Vella and Verbeek (1998) panel of 545 young men observed over 8 years (1980–1987) from the National Longitudinal Survey of Youth (NLSY). This dataset, used in many econometrics textbooks, is ideal for studying wage determinants because it tracks the same workers as they enter the labor market, gain experience, change jobs, and make decisions about union membership and marriage. The key challenge is that unobserved individual ability differs across workers and correlates with both wages and these covariates — a classic case for one-way fixed effects.

url = "https://raw.githubusercontent.com/bashtage/linearmodels/main/linearmodels/datasets/wage_panel/wage_panel.csv.bz2"
wage_df = pd.read_csv(url, compression="bz2")
print(f"Wage panel shape: {wage_df.shape}")
print(wage_df.describe().round(3))

Wage panel shape: (4360, 12)
nr year black exper hisp ... educ union lwage expersq occupation
count 4360.000 4360.000 4360.000 4360.000 4360.000 ... 4360.000 4360.000 4360.000 4360.000 4360.000
mean 5262.059 1983.500 0.116 6.500 0.161 ... 11.768 0.244 1.649 50.425 4.989
std 3496.150 2.292 0.320 2.292 0.367 ... 1.353 0.430 0.533 40.782 2.320
min 13.000 1980.000 0.000 1.000 0.000 ... 3.000 0.000 -3.579 1.000 1.000
25% 2329.000 1981.750 0.000 4.750 0.000 ... 11.000 0.000 1.351 16.000 4.000
50% 4569.000 1983.500 0.000 6.500 0.000 ... 12.000 0.000 1.671 36.000 5.000
75% 8406.000 1985.250 0.000 8.250 0.000 ... 12.000 0.000 1.991 81.000 6.000
max 12548.000 1987.000 1.000 12.000 1.000 ... 16.000 1.000 4.052 324.000 9.000

The panel contains 4,360 observations (545 individuals over 8 years) with 12 variables. Before running any model, it is important to understand how each variable is defined and measured.

Outcome variable:

lwage — the natural logarithm of hourly wage. The log transformation means that coefficients are interpreted as approximate percentage changes. The mean of 1.649 corresponds to about \$5.20 per hour in 1980s dollars ($e^{1.649} \approx 5.20$). The standard deviation of 0.533 indicates substantial wage dispersion: the gap between a worker at the 25th percentile (\$3.86/hr) and the 75th percentile (\$7.32/hr) is roughly a doubling of wages.

Time-varying covariates (change within a worker over time):

hours — annual hours worked. Mean of 2,191 (roughly 42 hours per week for 52 weeks). Ranges from 120 to 4,992, capturing both part-time spells and heavy overtime. We include hours to control for labor supply differences that affect hourly wage calculations.
union — binary indicator (1 = covered by a union contract in the current year, 0 = not covered). About 24.4% of person-year observations are union-covered. Workers can move in and out of union jobs across years, and this within-worker variation in union status is what one-way FE use to identify the union wage premium.
married — binary indicator (1 = currently married, 0 = not married). About 43.9% of observations are married. Since these are young men tracked from their early twenties, many transition from single to married during the panel, providing within-worker variation.
exper — years of potential labor market experience, defined as age minus years of education minus 6. Ranges from 1 to 12 years. In this balanced panel where every worker is observed in every year, experience increases by exactly 1 each year, making it perfectly collinear with entity + year fixed effects. We therefore use expersq instead in FE models.
expersq — experience squared ($exper^2$). Captures the well-documented concavity in the experience–earnings profile: wages rise with experience but at a diminishing rate. Unlike exper, the squared term is a nonlinear function of time, so it is not collinear with entity + year FE and can be estimated.
occupation — occupational category, coded 1 through 9 (9 distinct categories). Workers can and do switch occupations across years. This variable can be used as an additional fixed effect dimension.

Time-invariant covariates (fixed for each worker across all years):

educ — years of completed schooling at the start of the panel. Mean of 11.77 years (just below a high school diploma), ranging from 3 to 16 years. Because the sample tracks young men who have already finished their schooling, education does not change over time. The median of 12 years (exactly a high school diploma) and the 75th percentile of 12 years indicate that most workers in this sample have a high school education, with a smaller group holding college degrees.
black — binary indicator (1 = Black, 0 = non-Black). About 11.6% of workers are Black. Because race does not change over time, one-way FE absorb any wage differences associated with being Black.
hisp — binary indicator (1 = Hispanic, 0 = non-Hispanic). About 16.1% of workers are Hispanic. Like black, this is absorbed by one-way FE.

Panel identifiers:

nr — unique worker identifier (545 distinct workers). This defines the entity dimension for fixed effects.
year — calendar year, taking values 1980 through 1987. The panel is balanced: every worker appears in every year, giving exactly $545 \times 8 = 4,360$ observations.

The distinction between time-varying and time-invariant variables is the most consequential feature of this dataset for fixed effects analysis. Time-invariant variables will be perfectly collinear with entity dummies and cannot be estimated under one-way FE. Time-varying variables survive the within transformation and their effects can be identified. We verify this classification empirically:

invariance = wage_df.groupby("nr")[["educ", "black", "hisp"]].nunique()
print(f"Max unique values per worker:")
print(invariance.max())

Max unique values per worker:
educ 1
black 1
hisp 1
dtype: int64

Each worker has exactly one value of education, race, and ethnicity across all eight years — confirming these are truly time-invariant. By contrast, occupation is time-varying:

occ_changes = wage_df.groupby("nr")["occupation"].nunique()
print(f"Workers who change occupation: {(occ_changes > 1).sum()} / {len(occ_changes)}")

Workers who change occupation: 484 / 545

Nearly 89% of workers switch occupations at least once during the panel. This high rate of switching makes occupation a valid candidate for a fixed effect dimension of its own (Section 11.5). By contrast, a variable like education, which never changes within a worker, would produce a column of zeros after demeaning and must be dropped — a point we return to in Sections 11.3 and 11.4.

11.2 Within vs between variation

Before estimating any model, it helps to decompose the variation in each variable into between-worker variation (permanent differences across workers) and within-worker variation (changes over a worker’s career). This decomposition foreshadows what one-way fixed effects can and cannot estimate.

cols = ["lwage", "hours", "union", "married", "expersq", "educ"]
between = wage_df.groupby("nr")[cols].mean().std()
for col in cols:
wage_df[f"{col}_within"] = wage_df[col] - wage_df.groupby("nr")[col].transform("mean")
within = wage_df[[f"{c}_within" for c in cols]].std()
variation = pd.DataFrame({"Between": between, "Within": within}).round(4)
print(variation)

 Between Within
lwage 0.3907 0.3623
hours 381.7831 418.6057
union 0.3294 0.2760
married 0.3766 0.3236
expersq 26.3513 31.1431
educ 1.7476 0.0000

The raw standard deviations differ wildly across variables (hours is in the hundreds, union is a fraction), so we normalize by computing each variable’s within share — the fraction of total variation that comes from within-worker changes over time. This puts all variables on the same 0–100% scale:

total = np.sqrt(between**2 + within**2)
within_share = (within / total).fillna(0) # educ: 0/0 → 0
between_share = 1 - within_share
fig, ax = plt.subplots(figsize=(10, 5))
y_pos = np.arange(len(cols))
bar_height = 0.55
# Stacked horizontal bars: between (left) + within (right) = 100%
ax.barh(y_pos, between_share.values, bar_height,
label="Between (cross-worker)", color=STEEL_BLUE, edgecolor=DARK_NAVY)
ax.barh(y_pos, within_share.values, bar_height, left=between_share.values,
label="Within (over career)", color=WARM_ORANGE, edgecolor=DARK_NAVY)
plt.savefig("pyfixest_within_between.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

The decomposition reveals a critical pattern. Education is 100% between-worker variation — its within share is exactly 0% — because no worker changes their education level during the panel. This means one-way FE literally cannot estimate education’s effect: the demeaned education column is all zeros. Log wages have a 68% within share and 32% between share, meaning most wage variation comes from changes over a worker’s career rather than permanent differences across workers. Variables with substantial within shares — union (64%), married (65%), hours (74%), expersq (76%) — can be estimated under one-way FE because they change over a worker’s career. The higher the within share, the more statistical power one-way FE retains for that variable.

11.3 The Mincer equation and its panel extensions

Before estimating any models, it helps to lay out the econometric framework that organizes all subsequent specifications. The classic Mincer equation (Mincer, 1974) is the workhorse model of labor economics:

$$\ln(wage_i) = \beta_0 + \beta_1 educ_i + \beta_2 exper_i + \beta_3 exper_i^2 + \epsilon_i$$

This log-linear specification models wages as a function of years of schooling and experience, with experience entering quadratically to capture concave returns — each additional year of experience raises wages, but by a diminishing amount. It is a cross-sectional model, estimating the average relationship across all workers at a single point in time.

The extended Mincer equation adds controls for union membership, marital status, hours worked, and demographic characteristics:

$$\ln(wage_{it}) = \beta_0 + \beta_1 educ_i + \beta_2 expersq_{it} + \beta_3 union_{it} + \beta_4 married_{it} + \beta_5 hours_{it} + \beta_6 black_i + \beta_7 hisp_i + \epsilon_{it}$$

The panel FE extension replaces explicit controls for time-invariant characteristics with entity and time fixed effects:

$$\ln(wage_{it}) = \beta X_{it} + \gamma Z_i + \alpha_i + \delta_t + \epsilon_{it}$$

where $X_{it}$ denotes time-varying covariates (union, married, hours, experience), $Z_i$ denotes time-invariant characteristics (education, race), $\alpha_i$ captures one-way fixed effects (one intercept per worker), and $\delta_t$ captures year fixed effects. The key insight: when we include $\alpha_i$, the time-invariant variables $Z_i$ become perfectly collinear with the entity dummies and are absorbed. We gain protection against omitted variable bias from all unobserved time-invariant confounders, but we lose the ability to estimate $\gamma$.

The CRE/Mundlak extension — the Mundlak (1978) device — offers a way to recover $\gamma$:

$$\ln(wage_{it}) = \beta X_{it} + \gamma Z_i + \pi \bar{X}_i + \epsilon_{it}$$

where $\bar{X}_i$ are individual means of the time-varying variables. This replaces entity dummies with individual means, which model the correlation between unobserved heterogeneity and the covariates. The result: $\hat{\beta} \approx \hat{\beta}_{FE}$ for the time-varying variables, while $\gamma$ is now estimable because we no longer include entity dummies that absorb it.

Sections 11.4–11.7 estimate these models progressively: pooled OLS and one-way FE (11.4), two-way and three-way FE (11.5), group-specific time trends (11.6), and CRE/Mundlak (11.7).

11.4 From pooled OLS to one-way FE: the education tradeoff

We begin with the extended Mincer equation estimated by pooled OLS, which includes both time-varying and time-invariant variables:

fit_pooled = pf.feols(
"lwage ~ educ + expersq + union + married + hours + black + hisp",
data=wage_df, vcov="HC1"
)
print(fit_pooled.summary())

Estimation: OLS
Dep. var.: lwage, Fixed effects: 0
Inference: HC1
Observations: 4360
| Coefficient | Estimate | Std. Error | t value | Pr(>|t|) | 2.5% | 97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| Intercept | 0.265 | 0.069 | 3.823 | 0.000 | 0.129 | 0.402 |
| educ | 0.106 | 0.005 | 22.924 | 0.000 | 0.097 | 0.115 |
| expersq | 0.003 | 0.000 | 16.930 | 0.000 | 0.003 | 0.004 |
| union | 0.183 | 0.016 | 11.205 | 0.000 | 0.151 | 0.215 |
| married | 0.141 | 0.015 | 9.308 | 0.000 | 0.111 | 0.171 |
| hours | -0.000 | 0.000 | -3.139 | 0.002 | -0.000 | -0.000 |
| black | -0.135 | 0.024 | -5.549 | 0.000 | -0.182 | -0.087 |
| hisp | 0.013 | 0.020 | 0.670 | 0.503 | -0.025 | 0.052 |
---
RMSE: 0.484 R2: 0.175

Pooled OLS estimates a 10.6% return to each year of education, an 18.3% union premium, and a 14.1% marriage premium. Black workers earn about 13.5% less, while the Hispanic coefficient is small and insignificant. The R-squared is 0.175 — these variables explain less than a fifth of wage variation.

Now we estimate the one-way FE model, which absorbs all time-invariant worker characteristics:

fit_entity = pf.feols("lwage ~ expersq + union + married + hours | nr",
data=wage_df, vcov={"CRV1": "nr"})
print(fit_entity.summary())

Estimation: OLS
Dep. var.: lwage, Fixed effects: nr
Inference: CRV1
Observations: 4360
| Coefficient | Estimate | Std. Error | t value | Pr(>|t|) | 2.5% | 97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| expersq | 0.004 | 0.000 | 16.537 | 0.000 | 0.003 | 0.004 |
| union | 0.078 | 0.024 | 3.319 | 0.001 | 0.032 | 0.125 |
| married | 0.115 | 0.022 | 5.217 | 0.000 | 0.071 | 0.158 |
| hours | -0.000 | 0.000 | -3.807 | 0.000 | -0.000 | -0.000 |
---
RMSE: 0.335 R2: 0.605 R2 Within: 0.145

One-way fixed effects dramatically improve model fit: R-squared jumps from 0.175 (pooled OLS) to 0.605, meaning worker-level heterogeneity accounts for over 40 percentage points of explained variation. The union premium drops from 18.3% to 7.8% (SE = 0.024) — more than half the pooled estimate was driven by selection (workers who join unions differ systematically from those who do not). The marriage premium falls from 14.1% to 11.5% (SE = 0.022), a smaller reduction suggesting that marital status is less confounded by unobserved ability. The expersq coefficient of 0.004 captures the concavity of the experience–earnings profile within workers over time. Notice that educ, black, and hisp are absent: these time-invariant variables are perfectly collinear with the 545 worker dummies and cannot be estimated under one-way FE.

To see what happens when we try to include a time-invariant variable alongside one-way FE:

import warnings
with warnings.catch_warnings(record=True) as w:
warnings.simplefilter("always")
fit_educ = pf.feols("lwage ~ expersq + union + married + educ | nr",
data=wage_df, vcov={"CRV1": "nr"})
print(f"Coefficients estimated: {list(fit_educ.coef().index)}")

Coefficients estimated: ['expersq', 'union', 'married']

Education is silently dropped. This is not a bug — it is a fundamental consequence of the within transformation (Section 6):

$$\ddot{educ}_{it} = educ_i - \bar{educ}_i = 0 \quad \text{for all } t$$

Because a worker’s education does not change over the eight years of the panel, the demeaned value is exactly zero for every observation. A column of zeros is perfectly collinear with the entity dummies, so it must be dropped. The same applies to black and hisp.

Variable	Pooled OLS	One-Way FE
educ	0.106	dropped
expersq	0.003	0.004
union	0.183	0.078
married	0.141	0.115
hours	-0.000	-0.000
black	-0.135	dropped
hisp	0.013	dropped
R-squared	0.175	0.605

This table crystallizes the fundamental tradeoff. Pooled OLS estimates everything — education, race, union, marriage — but its estimates are biased by unobserved ability. One-Way FE eliminates the ability bias, and the union premium drops from 18.3% to 7.8%, revealing that more than half the raw association was selection. But the price is steep: education, Black, and Hispanic are all absorbed into the individual intercepts. We cannot estimate the return to schooling or the racial wage gap under one-way FE. Sections 11.5–11.6 push further with additional FE dimensions, and Section 11.7 shows how CRE partially resolves this tradeoff.

11.5 Two-way and three-way fixed effects

Adding year fixed effects to one-way FE creates a two-way FE (TWFE) model that absorbs both individual heterogeneity and common time trends:

fit_panel = pf.feols("lwage ~ expersq + union + married + hours | nr + year",
data=wage_df, vcov={"CRV1": "nr + year"})

We can go further by adding occupation as a third fixed effect dimension. As we saw in Section 11.1, nearly 89% of workers switch occupations during the panel, so occupation is a valid time-varying dimension:

fit_threeway = pf.feols(
"lwage ~ expersq + union + married + hours | nr + year + C(occupation)",
data=wage_df, vcov={"CRV1": "nr"}
)

Variable	Pooled OLS	One-Way FE	Two-Way FE	Three-Way FE
expersq	0.003	0.004	-0.006	-0.006
union	0.183	0.078	0.073	0.075
married	0.141	0.115	0.048	0.047
hours	-0.000	-0.000	-0.000	-0.000
R-squared	0.175	0.605	0.631	0.632

fig, axes = plt.subplots(2, 2, figsize=(12, 8))
panel_models = {"Pooled OLS": fit_pooled, "One-Way FE": fit_entity,
"Two-Way FE": fit_panel, "Three-Way FE": fit_threeway}
panel_vars = ["expersq", "union", "married", "hours"]
panel_colors = [STEEL_BLUE, WARM_ORANGE, TEAL, "#e8956a"]
for idx, var in enumerate(panel_vars):
ax = axes.flatten()[idx]
model_names_p = list(panel_models.keys())
coefs_p = [panel_models[m].coef()[var] for m in model_names_p]
ses_p = [panel_models[m].se()[var] for m in model_names_p]
ax.bar(range(4), coefs_p, yerr=[1.96 * s for s in ses_p],
color=panel_colors, edgecolor=DARK_NAVY, width=0.5, capsize=4)
ax.set_xticks(range(4))
ax.set_xticklabels(model_names_p, fontsize=8, rotation=15)
ax.set_title(var, fontsize=12, fontweight="bold")
ax.axhline(y=0, color=NEAR_BLACK, linewidth=0.5, linestyle="--", alpha=0.5)
fig.suptitle("Coefficient Estimates Across FE Specifications",
fontsize=14, fontweight="bold", y=1.02)
plt.savefig("pyfixest_wage_extended.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

The results show diminishing returns to additional FE dimensions. The big action was one-way FE: R-squared jumps from 0.175 to 0.605, and the union premium drops from 18.3% to 7.8%. Adding year effects (TWFE) pushes R-squared to 0.631 and the union premium stabilizes at 7.3%. Adding occupation as a third dimension barely moves anything — R-squared rises to 0.632 and the union premium is 7.5%. The expersq coefficient flips sign with TWFE (-0.006) because year effects absorb common trends in experience and wages. The stability of the union and marriage coefficients across the last three specifications suggests these estimates are robust to additional controls for time trends and occupational sorting.

11.6 Interactive fixed effects

Sections 11.4–11.5 used additive fixed effects (nr + year), where every individual shares the same set of year effects. Interactive (or interacted) fixed effects generalize this by allowing one FE dimension to vary across levels of another — producing group-specific intercepts for each time period. Instead of a single set of year dummies shared by all workers, we estimate separate year effects for each demographic group.

Why does this matter? Black and non-Black workers may face different labor market trends during the 1980s. If macroeconomic shocks hit these groups differently, a common set of year effects would be misspecified. We can test this by allowing year effects to vary by race:

$$\ln(wage_{it}) = \beta X_{it} + \alpha_i + \gamma_{t,g(i)} + \epsilon_{it}$$

where $g(i) \in \{Black, non\text{-}Black\}$, so we estimate separate year effects for each racial group.

Pyfixest implements interactive FE with the caret operator (^): the syntax year^black in the fixed-effects slot creates a separate year dummy for each value of black. This mirrors R’s fixest package. The equivalent manual approach is to concatenate the columns (wage_df["year_black"] = wage_df["year"].astype(str) + "_" + wage_df["black"].astype(str)) and absorb the resulting string variable, but the caret operator is preferred because it keeps the interaction structure visible in the formula.

# Pyfixest caret operator for interacted fixed effects
fit_gtrends = pf.feols("lwage ~ expersq + union + married + hours | nr + year^black",
data=wage_df, vcov={"CRV1": "nr"})
print(fit_gtrends.summary())

Estimation: OLS
Dep. var.: lwage, Fixed effects: nr+year^black
Inference: CRV1
Observations: 4360
| Coefficient | Estimate | Std. Error | t value | Pr(>|t|) | 2.5% | 97.5% |
|:--------------|----------: |------------: |--------: |---------: |-----: |------: |
| expersq | -0.006 | 0.001 | -5.878 | 0.000 | -0.008 | -0.004 |
| union | 0.074 | 0.024 | 3.129 | 0.002 | 0.028 | 0.121 |
| married | 0.045 | 0.020 | 2.262 | 0.024 | 0.006 | 0.084 |
| hours | -0.000 | 0.000 | -0.393 | 0.694 | -0.001 | 0.001 |

Variable	Two-Way FE (additive)	Interactive FE (year × race)
expersq	-0.006	-0.006
union	0.073	0.074
married	0.048	0.045
hours	-0.000	-0.000

fig, ax = plt.subplots(figsize=(9, 5))
vars_plot = ["expersq", "union", "married", "hours"]
x = np.arange(len(vars_plot))
width = 0.35
twfe_coefs = [fit_panel.coef()[v] for v in vars_plot]
gtrend_coefs = [fit_gtrends.coef()[v] for v in vars_plot]
ax.bar(x - width/2, twfe_coefs, width, label="Two-Way FE", color=STEEL_BLUE, edgecolor=DARK_NAVY)
ax.bar(x + width/2, gtrend_coefs, width, label="Interactive FE", color=WARM_ORANGE, edgecolor=DARK_NAVY)
ax.set_xticks(x)
ax.set_xticklabels(vars_plot, fontsize=11)
ax.set_ylabel("Coefficient Estimate", fontsize=13)
ax.set_title("Additive vs Interactive Fixed Effects", fontsize=14, fontweight="bold")
ax.legend(fontsize=11)
ax.axhline(y=0, color=NEAR_BLACK, linewidth=0.5, linestyle="--", alpha=0.5)
plt.savefig("pyfixest_group_trends.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

The coefficients are nearly identical under both specifications. Moving from additive to interactive fixed effects barely changes the estimated returns to union membership (7.3% → 7.4%), marriage (4.8% → 4.5%), or experience. This stability indicates that year effects are similar across racial groups — the additive TWFE specification is not misspecified by imposing common year effects. The interactive model uses 545 one-way FE plus 16 group-year FE (8 years × 2 groups) = 561 FE parameters to explain 4,360 observations — well short of saturation. Had the coefficients shifted substantially, that would have signaled that Black and non-Black workers face sufficiently different macro trends to warrant group-specific year effects, and that the standard additive TWFE was masking this heterogeneity.

11.7 Recovering time-invariant effects: the CRE/Mundlak approach

Sections 11.4–11.6 revealed a fundamental tradeoff in panel econometrics. One-way FE eliminate omitted variable bias from all unobserved time-invariant confounders — a powerful guarantee — but they absorb education, race, and ethnicity in the process. Pooled OLS estimates coefficients for everything, but those estimates are biased whenever unobserved worker traits correlate with the covariates. We want the best of both worlds: the bias protection of FE with the ability to estimate time-invariant effects.

Imagine you could describe each worker’s “type” not with a unique ID but with a summary of their career trajectory — their average union participation rate, average hours worked, average marital status, and so on. Two workers with similar career averages are arguably similar in unobserved ways too: a worker who spends 80% of their career in a union likely differs systematically from one who never joins. The Correlated Random Effects (CRE) model — also called the Mundlak (1978) device — operationalizes this intuition by replacing the 545 entity dummies with a handful of individual-mean variables that capture the same correlation structure.

The CRE equation. Recall from Section 11.3 that the CRE equation replaces entity dummies $\alpha_i$ with individual means $\bar{X}_i$ of the time-varying variables:

$$\ln(wage_{it}) = \beta X_{it} + \gamma Z_i + \pi \bar{X}_i + \epsilon_{it}$$

In words, this equation says that a worker’s log wage depends on three components: (1) their current values of time-varying covariates ($X_{it}$), (2) their permanent characteristics ($Z_i$ like education and race), and (3) a set of correction terms ($\bar{X}_i$) that capture the average level of each time-varying variable across their career. In our code, $X_{it}$ corresponds to expersq, union, married, and hours in each year; $Z_i$ corresponds to educ, black, and hisp; and $\bar{X}_i$ corresponds to the *_mean columns we compute below.

Why does including $\bar{X}_i$ work? The individual means proxy for the unobserved individual effect $\alpha_i$. Consider union membership: if workers who join unions more often (high $\overline{union}_i$) also have higher unobserved ability or motivation, then $\overline{union}_i$ captures that correlation. Once we control for it, the remaining within-person variation in union status is “clean” — and the time-invariant variables are no longer collinear with entity dummies (because there are no entity dummies).

Contrast with FE. One-way FE assumes $\alpha_i$ can be anything — completely unrestricted. CRE assumes $\alpha_i = \pi \bar{X}_i + \text{error}$ — the individual effect is a linear function of the career averages. This is a stronger assumption, but it buys back education and race. The payoff: $\hat{\beta}$ for time-varying variables should approximately match the one-way FE estimates (because the means absorb the same correlation), while $\gamma$ for time-invariant variables is now estimable.

mundlak_vars = ["union", "married", "hours", "expersq"]
for var in mundlak_vars:
wage_df[f"{var}_mean"] = wage_df.groupby("nr")[var].transform("mean")
fit_mundlak = pf.feols(
"lwage ~ expersq + union + married + hours + educ + black + hisp "
"+ expersq_mean + union_mean + married_mean + hours_mean",
data=wage_df, vcov={"CRV1": "nr"}
)
print(fit_mundlak.summary())

Estimation: OLS
Dep. var.: lwage, Fixed effects: 0
Inference: CRV1
Observations: 4360
| Coefficient | Estimate | Std. Error | t value | Pr(>|t|) | 2.5% | 97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| Intercept | 0.276 | 0.073 | 3.798 | 0.000 | 0.133 | 0.418 |
| expersq | 0.004 | 0.000 | 13.284 | 0.000 | 0.004 | 0.005 |
| union | 0.078 | 0.019 | 4.050 | 0.000 | 0.040 | 0.116 |
| married | 0.115 | 0.017 | 6.664 | 0.000 | 0.081 | 0.149 |
| hours | -0.000 | 0.000 | -0.007 | 0.994 | -0.000 | 0.000 |
| educ | 0.094 | 0.005 | 17.295 | 0.000 | 0.083 | 0.104 |
| black | -0.140 | 0.024 | -5.930 | 0.000 | -0.187 | -0.094 |
| hisp | 0.009 | 0.019 | 0.469 | 0.639 | -0.028 | 0.045 |
| expersq_mean | -0.003 | 0.001 | -3.498 | 0.001 | -0.005 | -0.001 |
| union_mean | 0.179 | 0.037 | 4.838 | 0.000 | 0.106 | 0.251 |
| married_mean | -0.041 | 0.042 | -0.969 | 0.333 | -0.123 | 0.042 |
| hours_mean | 0.002 | 0.001 | 3.109 | 0.002 | 0.001 | 0.003 |

Variable	One-Way FE	CRE
expersq	0.004	0.004
union	0.078	0.078
married	0.115	0.115
hours	-0.000	-0.000
educ	dropped	0.094
black	dropped	-0.140
hisp	dropped	0.009

fig, ax = plt.subplots(figsize=(10, 6))
compare_vars = ["expersq", "union", "married", "hours", "educ", "black", "hisp"]
x = np.arange(len(compare_vars))
width = 0.25
pooled_vals = [fit_pooled.coef()[v] for v in compare_vars]
entity_vals = [fit_entity.coef()[v] if v in fit_entity.coef().index else 0 for v in compare_vars]
mundlak_vals = [fit_mundlak.coef()[v] if v in fit_mundlak.coef().index else 0 for v in compare_vars]
ax.bar(x - width, pooled_vals, width, label="Pooled OLS", color=STEEL_BLUE, edgecolor=DARK_NAVY)
ax.bar(x, entity_vals, width, label="One-Way FE", color=WARM_ORANGE, edgecolor=DARK_NAVY)
ax.bar(x + width, mundlak_vals, width, label="CRE", color=TEAL, edgecolor=DARK_NAVY)
ax.set_xticks(x)
ax.set_xticklabels(compare_vars, fontsize=10, rotation=15)
ax.set_ylabel("Coefficient Estimate", fontsize=13)
ax.set_title("Pooled OLS vs One-Way FE vs CRE", fontsize=14, fontweight="bold")
ax.legend(fontsize=11)
ax.axhline(y=0, color=NEAR_BLACK, linewidth=0.5, linestyle="--", alpha=0.5)
plt.savefig("pyfixest_mundlak.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

The CRE model bridges one-way FE and pooled OLS. For time-varying variables (union, married, hours, expersq), the CRE coefficients closely match the one-way FE estimates — confirming that the individual means successfully proxy for entity dummies. For time-invariant variables, CRE recovers what one-way FE cannot: education’s coefficient is 0.094 per year of schooling (a 9.4% return), and the Black wage gap is -0.140 (14.0% lower wages). These are close to the pooled OLS estimates, but now they are estimated in a framework that controls for the correlation between unobserved heterogeneity and the covariates (via the individual means).

The CRE correction terms ($\pi$ coefficients) are informative in their own right. The union_mean coefficient of 0.179 is large and highly significant ($p < 0.001$): workers with persistently higher union participation earn substantially more on average, even after controlling for the within-person union effect (0.078). This gap — 0.179 versus 0.078 — is evidence of positive selection into unions: workers who join unions more often tend to have higher unobserved ability or to work in higher-paying industries. The hours_mean coefficient (0.002, $p = 0.002$) suggests that workers who consistently work longer hours earn more per hour on average, while married_mean is small and insignificant, indicating that selection into marriage is not strongly associated with unobserved wage determinants once other factors are controlled.

The caveat is that CRE relies on the assumption that unobserved heterogeneity correlates with covariates only through their individual means — a stronger assumption than one-way FE, which makes no such restriction. However, this assumption is testable. The CRE correction terms provide a built-in Hausman-type test: if $\pi = 0$ jointly (all correction terms are zero), then pooled OLS and one-way FE yield the same estimates, and the simpler random effects model is efficient. In our case, the large and significant union_mean and hours_mean coefficients strongly reject $\pi = 0$, confirming that unobserved heterogeneity does correlate with the covariates and that FE or CRE is needed over pooled OLS. Exercise 6 asks you to formalize this test.

11.8 What fixed effects absorb vs. what survives

The wage panel illustrates a general principle: one-way fixed effects absorb everything about a person that does not change over the observation window. Variables that do change over time — like union status, marital status, and occupation — survive the within transformation and can be estimated. The CRE/Mundlak approach (Section 11.7) partially resolves the tradeoff by recovering time-invariant coefficients. The diagram below summarizes this partition and recovery:

graph LR
subgraph "Absorbed by One-Way FE"
ED["<b>Education</b><br/>(time-invariant)"]
AB["<b>Ability</b><br/>(unobserved)"]
RC["<b>Race</b><br/>(time-invariant)"]
end
subgraph "Estimated (time-varying)"
UN["<b>Union</b>"]
MA["<b>Married</b>"]
OC["<b>Occupation</b>"]
end
subgraph "Recovery strategies"
MK["<b>CRE/Mundlak</b><br/>(individual means)"]
end
UN --> W["<b>Log Wage</b>"]
MA --> W
OC --> W
ED -.-> W
AB -.-> W
MK -.->|"recovers γ"| ED
MK -.->|"recovers γ"| RC
style ED fill:#d97757,stroke:#141413,color:#fff,stroke-dasharray: 5 5
style AB fill:#d97757,stroke:#141413,color:#fff,stroke-dasharray: 5 5
style RC fill:#d97757,stroke:#141413,color:#fff,stroke-dasharray: 5 5
style UN fill:#6a9bcc,stroke:#141413,color:#fff
style MA fill:#6a9bcc,stroke:#141413,color:#fff
style OC fill:#6a9bcc,stroke:#141413,color:#fff
style W fill:#00d4c8,stroke:#141413,color:#fff
style MK fill:#1a3a8a,stroke:#141413,color:#fff,stroke-dasharray: 5 5

The dashed arrows from the orange (absorbed) variables indicate that their effects on wages are real but unestimable under one-way FE — they are folded into each worker’s individual intercept. The solid arrows from the blue (estimated) variables show the effects we can identify: changes in union status, marital status, and occupation that occur within a worker’s career. The dark blue CRE/Mundlak node represents the recovery strategy from Section 11.7: by substituting individual means for entity dummies, we recover the coefficients $\gamma$ for education and race while producing time-varying estimates that closely match one-way FE. This partially resolves the tradeoff from Section 11.4, though at the cost of a stronger modeling assumption.

12. Event study: difference-in-differences

12.1 Staggered treatment adoption

Event studies are a popular extension of fixed effects that estimate dynamic treatment effects around the time of an intervention. In a staggered design, different groups (states, firms, individuals) receive treatment at different times — for example, states adopting a minimum wage increase in different years. The standard approach uses TWFE with relative-time indicators. However, this can produce biased estimates when treatment timing varies across groups and effects are heterogeneous. The DID2S estimator (Gardner, 2022) addresses this by separating the estimation into two stages: first estimating fixed effects from untreated observations, then recovering treatment effects from the residuals. The target estimand in this design is the Average Treatment Effect on the Treated (ATT) — the average effect for units that actually received treatment.

PyFixest provides both approaches. We use a simulated dataset with staggered treatment adoption across states:

df_het = pd.read_csv(
"https://raw.githubusercontent.com/py-econometrics/pyfixest/master/pyfixest/did/data/df_het.csv"
)
print(f"DiD dataset shape: {df_het.shape}")
print(f"Columns: {list(df_het.columns)}")

DiD dataset shape: (46500, 14)
Columns: ['unit', 'state', 'group', 'unit_fe', 'g', 'year', 'year_fe', 'treat',
'rel_year', 'rel_year_binned', 'error', 'te', 'te_dynamic', 'dep_var']

The event study dataset contains 46,500 observations across units nested in states, with a binary treatment indicator and relative time variable measuring periods before and after treatment onset. The dep_var column is the outcome we want to explain, and rel_year measures the distance in years from each unit’s treatment date (negative values are pre-treatment). This structure is typical of policy evaluation studies where different states adopt a policy at different times.

12.2 Year −1 as the universal baseline

Both estimators use ref=-1.0, setting the last pre-treatment period as the baseline. This choice is not arbitrary — it is the conventional and most informative reference point for three reasons:

Closest to treatment onset. Period −1 is the last observation before treatment begins. Using it as the baseline minimizes the extrapolation distance: we compare each period’s outcome to the most recent untreated state, rather than to some distant past.
Universal across cohorts. In staggered designs, different states adopt treatment in different calendar years. But rel_year = -1 has the same meaning for every cohort: “the last year before this group was treated.” It aligns all cohorts to a common relative-time clock, making the coefficients directly comparable.
Transparent parallel trends test. Pre-treatment coefficients (periods −20 through −2) measure deviations from the baseline. If these coefficients are near zero, the treated and control groups were on parallel trajectories before treatment — validating the key identifying assumption. Choosing −1 as the baseline makes this test as transparent as possible: any non-zero pre-treatment coefficient is a direct signal of differential pre-trends.

How to read the event study plot. Each coefficient represents the difference in outcomes between treatment and control groups, relative to their difference at period −1. Pre-treatment coefficients near zero validate parallel trends. The coefficient at period 0 is the immediate treatment effect. Post-treatment coefficients show how the effect evolves over time. If we had chosen a different baseline (say, period −5), all coefficients would shift by a constant — the shape of the event study would be identical, but the levels would change. The convention of using −1 simply makes the plot easiest to interpret.

12.3 TWFE vs DID2S

We estimate event study coefficients using both TWFE and DID2S, with period -1 (the year before treatment) as the reference category. The i() operator in PyFixest creates indicator variables for each relative year, analogous to R’s i() function.

# TWFE event study
fit_twfe = pf.feols(
"dep_var ~ i(rel_year, ref=-1.0) | state + year",
data=df_het, vcov={"CRV1": "state"},
)
# DID2S (Gardner 2022) -- two-stage estimator
fit_did2s = pf.did2s(
df_het, yname="dep_var",
first_stage="~ 0 | state + year",
second_stage="~ i(rel_year, ref=-1.0)",
treatment="treat", cluster="state",
)

# Extract coefficients from both estimators for plotting
import re
def parse_rel_years(coef_dict, se_dict):
years, vals, ses_list = [], [], []
for k in coef_dict.index:
match = re.search(r'\[T\.(-?\d+\.?\d*)\]', str(k))
if match:
years.append(float(match.group(1)))
vals.append(coef_dict[k])
ses_list.append(se_dict[k])
return years, vals, ses_list
twfe_years, twfe_vals, twfe_ses = parse_rel_years(fit_twfe.coef(), fit_twfe.se())
did2s_years, did2s_vals, did2s_ses = parse_rel_years(fit_did2s.coef(), fit_did2s.se())

PyFixest stores event study coefficients with names like [T.-5.0], [T.0.0], etc. The helper function above extracts the relative year from each coefficient name and pairs it with the estimate and standard error, giving us arrays ready for plotting.

fig, ax = plt.subplots(figsize=(12, 6))
offset = 0.15
ax.errorbar([y - offset for y in twfe_years], twfe_vals,
yerr=[1.96*s for s in twfe_ses],
fmt='o', color=STEEL_BLUE, capsize=3, label='TWFE')
ax.errorbar([y + offset for y in did2s_years], did2s_vals,
yerr=[1.96*s for s in did2s_ses],
fmt='s', color=WARM_ORANGE, capsize=3, label='DID2S (Gardner 2022)')
ax.axhline(y=0, color=LIGHT_TEXT, linewidth=0.8, linestyle="--", alpha=0.5)
ax.axvline(x=-0.5, color=LIGHT_TEXT, linewidth=1, linestyle="--", alpha=0.6)
ax.plot(-1, 0, 'D', color=TEAL, markersize=10, zorder=5,
label="Baseline (t = −1)")
ax.set_xlabel("Relative Year", fontsize=13)
ax.set_ylabel("Coefficient Estimate", fontsize=13)
ax.set_title("Event Study: TWFE vs DID2S", fontsize=14, fontweight="bold")
ax.legend(fontsize=11)
plt.savefig("pyfixest_event_study.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

Both estimators show near-zero pre-treatment coefficients (validating the parallel trends assumption) and a sharp jump at treatment onset. The immediate treatment effect at period 0 is approximately 1.3–1.4, growing steadily to about 2.8 by period 20. The TWFE estimates (blue circles) are slightly larger than DID2S (orange squares) in post-treatment periods — this upward bias is the well-documented problem with TWFE under staggered adoption and heterogeneous effects. The DID2S estimator corrects this by using only untreated observations to estimate the counterfactual, producing cleaner estimates of the dynamic treatment path.

13. Hypothesis testing: Wald test

PyFixest supports joint hypothesis testing via Wald tests, which assess whether multiple coefficients are simultaneously equal to zero. This is useful when you want to test whether a group of related variables jointly matters, not just one at a time.

fit_wald = pf.feols("Y ~ X1 + X2 | f1", data=data, vcov="HC1")
R = np.eye(2) # Test both X1=0 and X2=0 jointly
wald_result = fit_wald.wald_test(R=R)
print(f"Wald test (joint null: X1=0, X2=0):")
print(wald_result)

Wald test (joint null: X1=0, X2=0):
statistic 1.554006e+02
pvalue 1.110223e-16

The Wald test statistic is 155.4 with a p-value effectively zero (< 10^{-16}), overwhelmingly rejecting the null hypothesis that both X1 and X2 have zero effect on Y. This joint test is more informative than individual t-tests because it accounts for the correlation between the two coefficient estimates. In practice, Wald tests are essential for testing hypotheses about groups of variables, such as whether all interaction terms or all year dummies are jointly significant.

14. Wild cluster bootstrap

When the number of clusters is small (roughly below 50), cluster-robust standard errors can be unreliable. The wild cluster bootstrap provides more accurate inference in this setting by simulating the distribution of the test statistic under the null hypothesis. PyFixest integrates with the wildboottest package to make this straightforward:

fit_boot = pf.feols("Y ~ X1 | group_id", data=data, vcov={"CRV1": "group_id"})
boot_result = fit_boot.wildboottest(param="X1", reps=999, seed=42)
print(boot_result)

param X1
t value -8.616818459577098
Pr(>|t|) 0.0
bootstrap_type 11
inference CRV(group_id)
impose_null True

The wild bootstrap t-statistic of -8.62 and p-value of 0.0 confirm that the effect of X1 remains highly significant even under the more conservative bootstrap inference. The impose_null=True setting means the bootstrap simulates data under the null hypothesis of no effect, which generally provides better size control in finite samples. With only ~20 groups in this dataset, the bootstrap p-value is more trustworthy than the asymptotic cluster-robust p-value.

15. Discussion

This tutorial posed a simple question: how do unobserved group-level characteristics bias regression estimates, and how can we account for them? The answer, demonstrated across multiple settings, is that fixed effects regression removes this bias by focusing on within-group variation only.

The synthetic data showed that OLS estimates shift from -1.000 to -1.019 when absorbing group fixed effects — a modest change in this controlled setting, but one that demonstrates the mechanism. The real-world wage panel told a more dramatic story: the union wage premium dropped from 18.3% (pooled OLS) to 7.3% (two-way FE), revealing that more than half of the apparent union premium reflects worker selection rather than a genuine union effect. This has direct implications for labor economists and policymakers: overestimating the union premium leads to overestimating the economic impact of declining unionization.

Framing the wage panel through the Mincer equation (Section 11.3) provided a unifying thread for the entire analysis. The classic Mincer specification — log wages as a function of education, experience, and experience squared — is the starting point for virtually all empirical wage research. By extending it with additional controls and then progressively adding fixed effects, we traced a clear arc from pooled cross-sectional estimation to panel methods that account for unobserved heterogeneity. The within-versus-between decomposition (Section 11.2) made this arc concrete: education has zero within-worker variation, so one-way FE cannot estimate its effect, while variables like union status and marital status have substantial within-worker variation and can be identified.

The wage panel also highlighted a fundamental tradeoff in fixed effects estimation: the very mechanism that removes ability bias — absorbing all time-invariant individual characteristics — also prevents estimation of time-invariant variables like education. This is not a limitation to be worked around but a defining feature of the method. The CRE/Mundlak approach (Section 11.7) offers a principled resolution: by including individual means of time-varying variables as additional regressors, it proxies for the unobserved heterogeneity that one-way FE would absorb, recovering education’s coefficient (0.094 per year of schooling) while producing time-varying estimates that closely match one-way FE. The key assumption — that unobserved heterogeneity correlates with covariates only through their individual means — is stronger than FE’s assumption of no time-varying confounding, but it is the price of recovering time-invariant effects.

The three-way FE extension (adding occupation fixed effects) showed that occupation sorting explains negligible additional wage variation beyond individual and time effects, confirming that the dominant source of wage heterogeneity is persistent individual characteristics. The group-specific time trends analysis (Section 11.6) showed that allowing Black and non-Black workers to have different year effects produces estimates nearly identical to standard TWFE, supporting the common trends assumption in this particular panel. This is a useful diagnostic in practice: if group-specific trends substantially change the coefficients, the researcher should worry about whether the standard TWFE results are confounded by differential macro trends.

PyFixest makes the entire workflow — from simple OLS through two-way FE, IV, CRE/Mundlak, and event studies — accessible with a concise formula syntax. The ability to estimate multiple specifications in one call (csw0) and compare inference methods (iid, HC1, CRV1, CRV3, wild bootstrap) means researchers can quickly build a comprehensive picture of how sensitive their results are to modeling choices.

16. Summary and next steps

Key takeaways:

Fixed effects remove group-level confounding. In the wage panel, individual FE reduced the apparent union premium from 18.3% to 7.8%, revealing that over half the raw premium reflects selection on unobserved ability. Without FE, policy conclusions about unionization would be substantially biased.
The within-between decomposition diagnoses what FE can estimate. Decomposing each variable’s variation into between-worker and within-worker components reveals which coefficients survive one-way FE. Education has zero within variation and is absorbed; union status and marital status have substantial within shares (64% and 65%) and can be estimated. This diagnostic should precede any panel analysis.
The Mincer equation provides a unifying framework for wage regressions. Framing the analysis through the classic Mincer specification — and its extensions to panel data — makes the progression from pooled OLS to one-way FE to CRE/Mundlak a coherent arc rather than a collection of ad hoc specifications.
Standard errors matter as much as point estimates. Clustering standard errors inflated the SE on X1 by 50% compared to iid errors (0.1247 vs 0.0833). With weaker effects, this difference could flip a result from significant to insignificant — always cluster at the appropriate level.
Multiple specifications are a robustness check, not a fishing exercise. The coefficient on X1 remained stable around -1.0 across no FE, one-way FE, and two-way FE. In the wage panel, the union premium stabilized at 7.3–7.8% across one-way FE, two-way FE, three-way FE, and group-specific time trends — strong evidence that these estimates are robust.
Group-specific time trends test the common trends assumption. Allowing Black and non-Black workers to have different year effects produced estimates nearly identical to standard TWFE, supporting the assumption that both groups faced similar macroeconomic trends during 1980–1987. When this test fails, standard TWFE results may be unreliable.
One-Way FE cannot estimate time-invariant effects, but CRE can recover them. Education was silently dropped from the one-way FE model because the within transformation reduces any constant variable to zero. The CRE model partially resolves this tradeoff by substituting individual means of time-varying variables for entity dummies, recovering education’s coefficient (0.094 per year) while producing time-varying estimates that match one-way FE. The cost is a stronger modeling assumption — that unobserved heterogeneity correlates with covariates only through their individual means.
TWFE event studies can be biased with staggered adoption. The DID2S estimator produced cleaner estimates by separating counterfactual estimation from treatment effect recovery. When treatment timing varies, always compare TWFE with a robust alternative like DID2S.
The event study baseline is not arbitrary. Setting ref=-1 (the last pre-treatment period) is the convention because it provides the most transparent test of parallel trends and minimizes extrapolation from the baseline to treatment onset. All cohorts in a staggered design share this reference point, making it the natural common clock.

Limitations: Fixed effects only remove time-invariant confounders. If a relevant confounder changes over time within groups, FE cannot address it. Additionally, FE estimation discards all between-group variation, which reduces statistical power and makes it impossible to estimate the effects of time-invariant variables — as we saw directly in Section 11.2, where education’s within share was exactly zero. CRE offers a partial resolution, but its assumption that unobserved heterogeneity correlates with covariates only through individual means may not hold in all settings — if ability correlates with the trajectory of union membership rather than its mean, the CRE estimates would still be biased. The group-specific time trends test (Section 11.6) is a useful diagnostic but is not definitive: passing it does not prove that common trends hold, only that the data are consistent with the assumption along the dimension tested. Finally, the datasets here are synthetic or well-studied — in messy real-world data, the parallel trends assumption underlying event studies may not hold.

Next steps: The CRE/Mundlak approach demonstrated in Section 11.7 can be extended in several directions: Wooldridge (2010, Ch. 10) develops the correlated random effects framework more formally, including CRE probit and tobit models for limited dependent variables. Hausman-Taylor estimation offers an alternative strategy for recovering time-invariant coefficients under different identifying assumptions. Beyond the wage panel, explore PyFixest’s support for Poisson regression (pf.fepois) for count data, quantile regression (pf.quantreg) for distributional effects, and the pf.event_study() common API for streamlined event study estimation with multiple estimators. For more advanced inference, investigate randomization inference via fit.ritest() and multiple testing corrections with pf.bonferroni() and pf.rwolf().

17. Exercises

Varying the clustering level. Re-estimate the one-way FE model (Y ~ X1 | group_id) with different clustering variables: f1, f2, and f3. How do the standard errors change? Which clustering level produces the most conservative inference, and why?
Weak instruments. Modify the IV specification to use only Z1 as an instrument (instead of both Z1 and Z2). How does the first-stage F-statistic change? How does the IV coefficient and its standard error respond to the weaker first stage?
CRE with additional means. In Section 11.7, we included individual means only for the time-varying regressors. What happens if you also include year fixed effects alongside the CRE correction terms (i.e., add | year to the CRE specification)? Do the time-varying coefficients shift closer to the TWFE estimates? Does the education coefficient change?
Group-specific trends by other dimensions. Section 11.6 allowed year effects to vary by race (black). Repeat this analysis using hisp instead, or using a union-status interaction (C(year):C(union)). Do the results differ from the standard TWFE specification? What does this tell you about the common trends assumption along different group dimensions?
Within-between decomposition on new data. Download a panel dataset of your choice (e.g., Penn World Table, World Development Indicators) and compute the within-versus-between decomposition for all variables. Which variables have the highest within share? What does this predict about which coefficients will survive one-way FE? Verify by estimating both pooled OLS and one-way FE models.
Hausman test via CRE. The CRE model provides a simple Hausman-type test: if the coefficients on the individual means ($\bar{X}_i$) are jointly zero, then pooled OLS and one-way FE yield the same estimates, and random effects is efficient. Test whether the four CRE correction terms (union_mean, married_mean, hours_mean, expersq_mean) are jointly significant using a Wald test. What does the result imply about the choice between random effects and fixed effects for this panel?

18. References

Introduction to Difference-in-Differences in Python

Thu, 19 Mar 2026 00:00:00 +0000

Overview

An education ministry rolls out AI tutoring bots in some cities but not others. Did the AI tools actually improve learning, or were those cities already on an upward trajectory? This is the core challenge of policy evaluation: separating the genuine effect of an intervention from pre-existing trends and selection differences between treated and untreated groups. The seminal study by Card and Krueger (1994) pioneered this approach in a different context — examining how a minimum wage increase in New Jersey affected fast-food employment compared to neighboring Pennsylvania.

Difference-in-Differences (DiD) is the workhorse method for answering such questions. The idea is elegantly simple: compare the change in outcomes over time between a group that received treatment and a group that did not. If both groups were evolving similarly before treatment — the parallel trends assumption — then the difference in their changes isolates the causal effect. Think of it as using the control group as a mirror: it shows what would have happened to the treated group had the policy never been implemented.

The diff-diff Python package, developed by Gerber (2026), provides a unified, scikit-learn-style API for 13+ DiD estimators validated against their R counterparts. These range from the classic 2x2 design to modern methods for staggered adoption. In this tutorial, we start with the simplest case, build up to event studies and multi-cohort designs, and finish with sensitivity analysis that quantifies how robust the findings are to violations of parallel trends. All examples use synthetic panel data — datasets where the same units (cities, firms, individuals) are observed repeatedly over multiple time periods — with known true effects, so every estimate can be verified against ground truth.

Learning objectives:

Understand the logic of the 2x2 DiD design and why it identifies causal effects under parallel trends
Estimate the Average Treatment Effect on the Treated (ATT) using classic DiD
Test the parallel trends assumption with pre-treatment trend comparisons
Interpret event study plots that reveal dynamic treatment effects over time
Recognize why Two-Way Fixed Effects fails under staggered adoption and how Callaway-Sant’Anna corrects for it
Assess robustness of causal conclusions using Bacon decomposition diagnostics and HonestDiD sensitivity analysis

Conceptual framework: What is Difference-in-Differences?

Imagine a school district deploys AI tutoring bots in some schools but not others, and you want to know whether the AI tools improved learning outcomes. You could compare learning scores at AI-equipped schools versus non-equipped schools after deployment. But AI-equipped schools might have had stronger students to begin with — perhaps the district piloted the technology in its highest-performing schools. A simple post-treatment comparison confounds the AI effect with pre-existing differences. Alternatively, you could compare a single school before and after the AI rollout — but learning scores might have been rising everywhere due to a new curriculum or improved teacher training, not the AI tools.

DiD combines these two simpler approaches so that selection bias and the effect of time are, in turns, eliminated. The logic proceeds through successive differencing:

First difference: Compare a unit before and after treatment. This eliminates time-invariant differences between groups (e.g., one school always scores higher than another), but confounds the treatment effect with common time trends (e.g., district-wide learning improvements from a new curriculum).
Second difference: Difference the first differences between treated and control groups. This eliminates the common time trends, leaving only the treatment effect.

graph TB
subgraph "Before Treatment"
A["<b>Treated Group</b><br/>Pre-treatment outcome"]
B["<b>Control Group</b><br/>Pre-treatment outcome"]
end
subgraph "After Treatment"
C["<b>Treated Group</b><br/>Post-treatment outcome"]
D["<b>Control Group</b><br/>Post-treatment outcome"]
end
A -->|"Change in<br/>treated"| C
B -->|"Change in<br/>control"| D
style A fill:#d97757,stroke:#141413,color:#fff
style C fill:#d97757,stroke:#141413,color:#fff
style B fill:#6a9bcc,stroke:#141413,color:#fff
style D fill:#6a9bcc,stroke:#141413,color:#fff

The DiD estimator

The 2x2 DiD estimator formalizes this double comparison. Let $k$ denote the treated group and $U$ the untreated group:

$$\hat{\delta}^{2 \times 2}_{kU} = \big( \bar{Y}_k^{Post} - \bar{Y}_k^{Pre} \big) - \big( \bar{Y}_U^{Post} - \bar{Y}_U^{Pre} \big)$$

In words: take the before-and-after change in the treated group, subtract the before-and-after change in the control group, and the remainder is the treatment effect. Here $\bar{Y}_k^{Post}$ is the average outcome for treated units in the post-treatment period (rows where treated = 1 and post = 1), and similarly for the other three terms.

What DiD actually estimates: The potential outcomes framework

The sample-means formula above tells us how to compute DiD from data, but it does not tell us what causal quantity DiD recovers or under what assumptions it is valid. To answer these deeper questions, we need the potential outcomes framework (Rubin, 1974).

The key idea is that every unit has two potential outcomes at every point in time, but we only ever observe one of them:

$Y^1_{i}$ — the outcome unit $i$ would experience with treatment
$Y^0_{i}$ — the outcome unit $i$ would experience without treatment

For a treated city, we observe $Y^1$ (what actually happened after adopting AI tutoring) but never $Y^0$ (what would have happened had the city not adopted AI). For a control city, we observe $Y^0$ but never $Y^1$. This is the fundamental problem of causal inference: for any individual unit, the causal effect $Y^1_{i} - Y^0_{i}$ is unobservable because one potential outcome is always missing.

Since we cannot measure individual effects, we aim for the Average Treatment Effect on the Treated (ATT) — the average causal effect across all treated units in the post-treatment period:

$$ATT = E[Y^1_k - Y^0_k | Post]$$

In words: what is the average difference between what treated units actually experienced and what they would have experienced without treatment, measured in the post-treatment period? Here $E[\cdot]$ denotes the expected value (population average), $k$ indexes the treated group, and the conditioning on $Post$ restricts attention to the post-treatment period. In our data, $E[Y^1_k | Post]$ corresponds to the average outcome for rows where treated = 1 and post = 1 — that is, $\bar{Y}_k^{Post}$ from the previous formula.

The challenge is that $E[Y^0_k | Post]$ — the average untreated outcome for the treated group after treatment — is a counterfactual that we never observe. Treated cities received the policy, so we cannot see what their outcomes would have been without it. This is where DiD’s clever trick comes in.

From sample means to potential outcomes

Let us now connect the sample-means formula to potential outcomes by rewriting each $\bar{Y}$ term. For the control group, which never receives treatment, the observed outcome always equals the untreated potential outcome: $Y_U = Y^0_U$ in both periods. For the treated group, the observed outcome equals the untreated potential outcome before treatment ($Y_k = Y^0_k$ when $Pre$) and the treated potential outcome after ($Y_k = Y^1_k$ when $Post$). Substituting these into the DiD formula:

$$\hat{\delta}^{2 \times 2}_{kU} = \big( \underbrace{\bar{Y}_k^{Post}}_{= E[Y^1_k | Post]} - \underbrace{\bar{Y}_k^{Pre}}_{= E[Y^0_k | Pre]} \big) - \big( \underbrace{\bar{Y}_U^{Post}}_{= E[Y^0_U | Post]} - \underbrace{\bar{Y}_U^{Pre}}_{= E[Y^0_U | Pre]} \big)$$

On the left of the outer subtraction, the treated group’s pre-treatment mean uses $Y^0_k$ (no treatment yet) and post-treatment mean uses $Y^1_k$ (treatment is active). On the right, both control group means use $Y^0_U$ (never treated). Now we apply a standard algebraic trick: add and subtract the unobserved counterfactual $E[Y^0_k | Post]$ inside the first parenthesis:

Rearranging by grouping the first two terms and the last three:

$$= \underbrace{E[Y^1_k | Post] - E[Y^0_k | Post]}_{ATT} + \underbrace{\big( E[Y^0_k | Post] - E[Y^0_k | Pre] \big) - \big( E[Y^0_U | Post] - E[Y^0_U | Pre] \big)}_{Bias}$$

This is the fundamental decomposition of the DiD estimator (Cunningham, 2021). The first term is the ATT — the causal quantity we want. The second term is the non-parallel trends bias — the difference in how the two groups' untreated outcomes would have evolved over time. The bias term compares the untreated trajectory of the treated group ($E[Y^0_k | Post] - E[Y^0_k | Pre]$) against the untreated trajectory of the control group ($E[Y^0_U | Post] - E[Y^0_U | Pre]$). If the bias term is zero, the DiD estimator cleanly identifies the ATT.

Parallel trends assumption

The bias term vanishes when the treated and control groups would have followed the same trajectory absent treatment:

$$E[Y^0_k | Post] - E[Y^0_k | Pre] = E[Y^0_U | Post] - E[Y^0_U | Pre]$$

This is the parallel trends assumption. It does not require the groups to have the same outcome levels — only the same trends. Two cities can have different learning scores, but if their learning scores were rising at the same speed before the AI rollout, DiD can credibly estimate the policy’s impact. Importantly, this assumption is fundamentally untestable because the counterfactual outcome $E[Y^0_k | Post]$ — what would have happened to the treated group absent treatment — is never observed. We can check whether trends were parallel in the pre-treatment period, but this does not guarantee they would have remained parallel afterward. This limitation is why Section 11 introduces HonestDiD sensitivity analysis.

Regression formulation

In practice, DiD is implemented as a regression with an interaction term:

$$Y_{it} = \alpha + \gamma \cdot Treated_i + \lambda \cdot Post_t + \delta \cdot (Treated_i \times Post_t) + \varepsilon_{it}$$

where $Treated_i$ is the group indicator (our treated column), $Post_t$ is the time indicator (our post column), and $\delta$ is the DiD treatment effect. The coefficient $\gamma$ captures the pre-existing level difference between groups, and $\lambda$ captures the common time trend. This regression mechanically constructs the counterfactual using the control group’s trajectory — it always estimates the $\delta$ coefficient as the extra change in the treated group, which is only valid if the counterfactual trend truly equals the control group’s trend.

Estimand clarity: DiD targets the Average Treatment Effect on the Treated (ATT) — the average impact of treatment on those units that actually received it. This differs from the Average Treatment Effect (ATE), which averages over the entire population including units that were never treated. The ATT answers: “For the units that received the policy, how much did it change their outcomes?” This is typically the policy-relevant question, since the decision-maker wants to know whether the intervention helped the people it was aimed at.

Now that we understand the logic, let us implement it step by step using the diff-diff package.

Setup and imports

Before running the analysis, install the required package:

# Run in terminal (or use !pip install in a notebook)
pip install diff-diff

The following code imports all necessary libraries and sets configuration variables. The diff-diff package provides generate_did_data() to create synthetic panel data with known treatment effects, DifferenceInDifferences() for the classic 2x2 estimator, and several advanced estimators for multi-period and staggered designs.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from diff_diff import (
DifferenceInDifferences,
MultiPeriodDiD,
CallawaySantAnna,
BaconDecomposition,
HonestDiD,
generate_did_data,
generate_staggered_data,
check_parallel_trends,
)
# Reproducibility
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
# Site color palette
STEEL_BLUE = "#6a9bcc"
WARM_ORANGE = "#d97757"
NEAR_BLACK = "#141413"
TEAL = "#00d4c8"
# Dark-theme palette
DARK_NAVY = "#0f1729"
GRID_LINE = "#1f2b5e"
LIGHT_TEXT = "#c8d0e0"
WHITE_TEXT = "#e8ecf2"

Classic 2x2 DiD design

The simplest DiD setup has two groups (treated and control) observed at two time points (before and after treatment). We start here because the 2x2 case makes the mechanics of DiD transparent before moving to more complex designs.

Generating synthetic panel data

We use generate_did_data() to create a synthetic panel where the true treatment effect is exactly 5.0 units. This known ground truth lets us verify that the estimator recovers the correct answer. The function creates a balanced panel with n_units units observed over n_periods periods, where treatment_fraction of units receive treatment starting at treatment_period.

data_2x2 = generate_did_data(
n_units=100,
n_periods=10,
treatment_effect=5.0,
treatment_period=5,
treatment_fraction=0.5,
seed=RANDOM_SEED,
)
print(f"Dataset shape: {data_2x2.shape}")
print(f"Columns: {data_2x2.columns.tolist()}")
print(f"\nTreatment groups:")
print(data_2x2.groupby("treated")["unit"].nunique().rename(
{0: "Control", 1: "Treated"}))
print(f"\nPeriods: {sorted(int(p) for p in data_2x2['period'].unique())}")
print(f"Treatment period: 5 (post = 1 for periods >= 5)")
print(f"True treatment effect: 5.0")

Dataset shape: (1000, 6)
Columns: ['unit', 'period', 'treated', 'post', 'outcome', 'true_effect']
Treatment groups:
treated
Control 50
Treated 50
Name: unit, dtype: int64
Periods: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Treatment period: 5 (post = 1 for periods >= 5)
True treatment effect: 5.0

The synthetic panel contains 1,000 observations: 100 units observed across 10 periods (0 through 9). Half the units (50) are assigned to treatment, which begins at period 5. The dataset includes a true_effect column that equals 0.0 in pre-treatment periods and 5.0 in post-treatment periods for treated units, providing a built-in benchmark. The post indicator is 1 for periods 5–9 and 0 for periods 0–4, matching the binary time dimension of the classic 2x2 framework.

Exploring the 2x2 dataset

Before estimating any model, we inspect the raw data to understand its structure. The .head() method shows the first rows so we can see how each observation is organized as a unit-period pair.

data_2x2.head(10)

 unit period treated post outcome true_effect
0 0 1 0 10.231272 0.0
0 1 1 0 12.408662 0.0
0 2 1 0 11.253170 0.0
0 3 1 0 12.846950 0.0
0 4 1 0 11.675816 0.0
0 5 1 1 17.903997 5.0
0 6 1 1 17.659412 5.0
0 7 1 1 18.770401 5.0
0 8 1 1 20.449742 5.0
0 9 1 1 18.382114 5.0

Each row is one unit in one period. The unit column identifies the individual, period tracks time, treated indicates group assignment (time-invariant), and post flags observations after the treatment period. The outcome column is what we aim to explain, and true_effect is the ground truth we will try to recover. This unit-period structure is the hallmark of panel data — repeated observations on the same units over time.

Summary statistics confirm the design parameters:

data_2x2.describe()

 unit period treated post outcome true_effect
count 1000.000000 1000.000000 1000.00000 1000.00000 1000.000000 1000.000000
mean 49.500000 4.500000 0.50000 0.50000 13.380874 1.250000
std 28.880514 2.873719 0.50025 0.50025 3.752000 2.166147
min 0.000000 0.000000 0.00000 0.00000 4.965883 0.000000
25% 24.750000 2.000000 0.00000 0.00000 10.716817 0.000000
50% 49.500000 4.500000 0.50000 0.50000 12.558536 0.000000
75% 74.250000 7.000000 1.00000 1.00000 15.926784 1.250000
max 99.000000 9.000000 1.00000 1.00000 24.294992 5.000000

The means of treated and post are both exactly 0.50, confirming a perfectly balanced design: half the units are treated, and half the time periods are post-treatment. The outcome ranges from about 5.0 to 24.3 with a mean of 13.4, reflecting the combination of time trends, unit effects, and treatment effects. The true_effect mean of 1.25 comes from the fact that only 25% of observations (treated units in post-treatment periods) have a non-zero effect of 5.0.

A crosstab reveals the 2x2 structure that gives DiD its name:

pd.crosstab(data_2x2["treated"], data_2x2["post"], margins=True)

post 0 1 All
treated
0 250 250 500
1 250 250 500
All 500 500 1000

This is the core of the 2x2 design: 250 observations in each of the four cells (control-pre, control-post, treated-pre, treated-post). The balanced allocation means each cell has equal weight in the estimator, which maximizes statistical power. In observational studies, these cell sizes are rarely equal, but the DiD estimator adjusts for imbalance automatically.

Finally, we examine how the outcome varies across the four cells:

data_2x2.groupby(["treated", "post"])["outcome"].describe()

 count mean std min 25% 50% 75% max
treated post
0 0 250.0 10.614957 1.871283 5.670539 9.261649 10.781139 11.866492 15.825691
1 250.0 13.086386 1.968271 8.158302 11.777457 13.149548 14.600075 18.372485
1 0 250.0 11.114546 2.015353 4.965883 9.909285 11.065526 12.494486 16.804462
1 250.0 18.707609 1.905034 13.182572 17.296981 18.870692 20.070330 24.294992

In the pre-treatment period, both groups have similar mean outcomes: 10.61 for the control group and 11.11 for the treated group — a negligible difference of 0.50 that suggests the groups started on comparable footing. In the post-treatment period, the control group mean rises to 13.09 (a gain of 2.47), while the treated group mean jumps to 18.71 (a gain of 7.59). The extra gain for the treated group (7.59 - 2.47 = 5.12) closely approximates the treatment effect that DiD will formally estimate. The raw numbers already hint that something happened to the treated group beyond the natural time trend.

The box plot below visualizes these distributions:

fig, ax = plt.subplots(figsize=(9, 5))
fig.patch.set_linewidth(0)
groups = [
("Control, Pre", data_2x2[(data_2x2["treated"] == 0) & (data_2x2["post"] == 0)]["outcome"]),
("Control, Post", data_2x2[(data_2x2["treated"] == 0) & (data_2x2["post"] == 1)]["outcome"]),
("Treated, Pre", data_2x2[(data_2x2["treated"] == 1) & (data_2x2["post"] == 0)]["outcome"]),
("Treated, Post", data_2x2[(data_2x2["treated"] == 1) & (data_2x2["post"] == 1)]["outcome"]),
]
bp = ax.boxplot(
[g[1] for g in groups],
tick_labels=[g[0] for g in groups],
patch_artist=True,
widths=0.5,
medianprops=dict(color=WHITE_TEXT, linewidth=2),
)
box_colors = [STEEL_BLUE, STEEL_BLUE, WARM_ORANGE, WARM_ORANGE]
for patch, color in zip(bp["boxes"], box_colors):
patch.set_facecolor(color)
patch.set_alpha(0.6)
ax.set_ylabel("Outcome")
ax.set_title("Outcome Distribution by Treatment Group and Period")
plt.savefig("did_outcome_distribution.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

The box plot makes the treatment effect visible at a glance. In the pre-treatment period, control (steel blue) and treated (warm orange) boxes overlap almost completely, centered around 10.6–11.1. Both groups shift upward in the post period due to the natural time trend, but the treated group shifts more — its median jumps to around 18.9, compared to 13.1 for the control. The extra upward shift for the treated group is the treatment effect that DiD will formally estimate. Notice also that the spread (box height) remains similar across all four groups, suggesting that treatment affects the level but not the variability of outcomes.

Visualizing parallel trends

Before estimating the treatment effect, we check whether the treated and control groups followed similar trajectories in the pre-treatment period. This visual inspection is the first step in assessing whether the parallel trends assumption is plausible. If the two groups were diverging before treatment, any post-treatment difference could reflect pre-existing trends rather than a causal effect.

treated_means = data_2x2[data_2x2["treated"] == 1].groupby("period")["outcome"].mean()
control_means = data_2x2[data_2x2["treated"] == 0].groupby("period")["outcome"].mean()
fig, ax = plt.subplots(figsize=(9, 5))
fig.patch.set_linewidth(0)
ax.plot(control_means.index, control_means.values, "o-",
color=STEEL_BLUE, linewidth=2, markersize=7, label="Control group")
ax.plot(treated_means.index, treated_means.values, "s-",
color=WARM_ORANGE, linewidth=2, markersize=7, label="Treated group")
ax.axvline(x=4.5, color=LIGHT_TEXT, linestyle="--", linewidth=1.5,
alpha=0.7, label="Treatment onset")
ax.set_xlabel("Period")
ax.set_ylabel("Average Outcome")
ax.set_title("Parallel Trends: Treatment vs Control Groups")
ax.legend(loc="upper left")
ax.set_xticks(range(10))
plt.savefig("did_parallel_trends.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

The two groups move in lockstep during periods 0 through 4, confirming that the parallel trends assumption holds in this synthetic dataset. Both lines fluctuate around similar values with no visible divergence before period 5. After treatment onset, the treated group (warm orange) jumps upward while the control group (steel blue) continues its prior trajectory. The gap between the two lines in the post-treatment period visually represents the treatment effect — roughly 5 units, consistent with the true effect built into the data.

Estimating the treatment effect

With parallel trends confirmed visually, we apply the classic DiD estimator. The DifferenceInDifferences() class implements the 2x2 design with analytical standard errors. The .fit() method takes the data along with column names for the outcome, treatment indicator, and time indicator (pre/post).

did = DifferenceInDifferences()
results_2x2 = did.fit(data_2x2, outcome="outcome",
treatment="treated", time="post")
results_2x2.print_summary()

======================================================================
Difference-in-Differences Estimation Results
======================================================================
Observations: 1000
Treated units: 500
Control units: 500
R-squared: 0.7332
----------------------------------------------------------------------
Parameter Estimate Std. Err. t-stat P>|t|
----------------------------------------------------------------------
ATT 5.1216 0.2455 20.863 0.0000 ***
----------------------------------------------------------------------
95% Confidence Interval: [4.6399, 5.6034]
Signif. codes: '***' 0.001, '**' 0.01, '*' 0.05, '.' 0.1
======================================================================

The estimated ATT is 5.12, close to the true effect of 5.0, with a standard error of 0.25. The t-statistic of 20.86 and p-value near zero confirm that the effect is highly statistically significant. The 95% confidence interval [4.64, 5.60] comfortably contains the true value of 5.0, demonstrating that the classic DiD estimator successfully recovers the known treatment effect. The small deviation from 5.0 (an overestimate of 0.12) reflects sampling variability, not estimator bias — with 100 units and 10 periods, some random noise is expected.

Visualizing the counterfactual

DiD’s power lies in constructing a counterfactual — what would have happened to the treated group without treatment. We build this by projecting the control group’s post-treatment trajectory, shifted up by the pre-treatment gap between the groups. The shaded area between the actual treated outcomes and this counterfactual line represents the estimated causal effect.

fig, ax = plt.subplots(figsize=(9, 5))
fig.patch.set_linewidth(0)
ax.plot(control_means.index, control_means.values, "o-",
color=STEEL_BLUE, linewidth=2, markersize=7, label="Control group")
ax.plot(treated_means.index, treated_means.values, "s-",
color=WARM_ORANGE, linewidth=2, markersize=7, label="Treated group")
# Counterfactual: treated group without treatment
pre_diff = treated_means.loc[:4].mean() - control_means.loc[:4].mean()
counterfactual = control_means.loc[5:] + pre_diff
ax.plot(counterfactual.index, counterfactual.values, "s--",
color=TEAL, linewidth=2, markersize=7,
label="Counterfactual (no treatment)")
ax.fill_between(counterfactual.index, counterfactual.values,
treated_means.loc[5:].values, alpha=0.2, color=TEAL,
label=f"Treatment effect (ATT ≈ {results_2x2.att:.1f})")
ax.axvline(x=4.5, color=LIGHT_TEXT, linestyle="--", linewidth=1.5, alpha=0.7)
ax.set_xlabel("Period")
ax.set_ylabel("Average Outcome")
ax.set_title("DiD Treatment Effect: Observed vs Counterfactual")
ax.legend(loc="upper left")
ax.set_xticks(range(10))
plt.savefig("did_treatment_effect.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

The teal dashed line traces where the treated group would have been without the intervention, constructed by shifting the control group’s post-treatment path to match the treated group’s pre-treatment level. The shaded gap between the actual treated outcomes (warm orange) and this counterfactual (teal) is the estimated causal effect — approximately 5.1 units per period. This visualization makes the DiD logic tangible: the control group’s trajectory serves as the mirror image of the treated group’s no-treatment path, and the extra gain above that mirror is what the policy caused.

Testing parallel trends

The visual check suggested parallel trends hold, but a formal statistical test provides more rigorous evidence. The check_parallel_trends() function compares the pre-treatment time trends of the treated and control groups by estimating a linear slope for each group across the pre-treatment periods, then testing whether the two slopes are statistically different.

pt_result = check_parallel_trends(
data_2x2,
outcome="outcome",
time="period",
treatment_group="treated",
pre_periods=[0, 1, 2, 3, 4],
)
print(f"Treated group pre-trend slope: {pt_result['treated_trend']:.4f}"
f" (SE = {pt_result['treated_trend_se']:.4f})")
print(f"Control group pre-trend slope: {pt_result['control_trend']:.4f}"
f" (SE = {pt_result['control_trend_se']:.4f})")
print(f"Trend difference: {pt_result['trend_difference']:.4f}"
f" (SE = {pt_result['trend_difference_se']:.4f})")
print(f"t-statistic: {pt_result['t_statistic']:.4f}")
print(f"p-value: {pt_result['p_value']:.4f}")
print(f"Parallel trends plausible: {pt_result['parallel_trends_plausible']}")

Treated group pre-trend slope: 0.5262 (SE = 0.0839)
Control group pre-trend slope: 0.4047 (SE = 0.0798)
Trend difference: 0.1216 (SE = 0.1158)
t-statistic: 1.0497
p-value: 0.2938
Parallel trends plausible: True

The pre-treatment trend slopes are 0.53 for the treated group and 0.40 for the control group — a difference of 0.12 with a p-value of 0.29. Since p > 0.05, we fail to reject the null hypothesis that the trends are equal, supporting the parallel trends assumption. However, a critical caveat: failing to reject is not the same as confirming. The test has limited power, especially with only 5 pre-treatment periods. Even if the trends differed slightly, this test might not detect it. Moreover, Roth (2022) shows that conditioning on passing a pre-test can distort subsequent inference — estimated effects may be biased toward zero and confidence intervals may have incorrect coverage. This is why Section 11 introduces HonestDiD, which asks: “How wrong could parallel trends be before our conclusion changes?” That question is more informative than a binary pass/fail test.

Event study: Dynamic treatment effects

The 2x2 estimator produces a single ATT that averages across all post-treatment periods. But treatment effects often change over time — they might build up gradually, appear immediately, or fade out. An event study (also called dynamic DiD) estimates separate effects for each period relative to treatment, revealing the full trajectory.

The event study extends the basic DiD regression by replacing the single treatment effect $\delta$ with a set of period-specific coefficients — one for each period before and after treatment:

$$Y_{it} = \gamma_i + \lambda_t + \sum_{k=-K+1}^{-2} \beta_k^{lead} D_{it}^k + \sum_{k=0}^{L} \beta_k^{lag} D_{it}^k + \varepsilon_{it}$$

Let us unpack each component of this equation:

$Y_{it}$ is the outcome for unit $i$ at time $t$ — the variable we are trying to explain (our outcome column).
$\gamma_i$ are unit fixed effects — a separate intercept for each unit that absorbs all time-invariant characteristics. For example, if one city always has higher learning scores than another due to demographics or school funding levels, $\gamma_i$ captures that permanent difference. In practice, this is equivalent to demeaning each unit’s outcome by its own time-average.
$\lambda_t$ are time fixed effects — a separate intercept for each period that absorbs shocks common to all units at a given time. If a national curriculum reform in period 3 raises learning outcomes for everyone equally, $\lambda_t$ captures that common shift. Together with unit fixed effects, this implements the “two-way” in TWFE.
$D_{it}^k$ is a relative-time indicator (also called an event-time dummy): it equals 1 when unit $i$ at time $t$ is exactly $k$ periods away from its treatment onset, and 0 otherwise. For a unit first treated at period 5, we have $D_{i,3}^{-2} = 1$ (two periods before treatment), $D_{i,5}^{0} = 1$ (the treatment period itself), $D_{i,7}^{2} = 1$ (two periods after treatment), and so on.
$\beta_k^{lead}$ (for $k = -K+1, \ldots, -2$) are the lead coefficients — pre-treatment effects at each period before treatment. These serve as placebo tests: if the treated and control groups were evolving similarly before the intervention, all lead coefficients should be close to zero and statistically insignificant. A significant lead coefficient signals a pre-existing divergence, which would undermine the parallel trends assumption. The summation starts at $k = -K+1$ (the earliest available lead) and stops at $k = -2$, because the period immediately before treatment ($k = -1$) is omitted as the reference period and normalized to zero. All other coefficients are estimated relative to this baseline.
$\beta_k^{lag}$ (for $k = 0, 1, \ldots, L$) are the lag coefficients — post-treatment effects at each period after treatment onset. The coefficient $\beta_0^{lag}$ captures the instantaneous effect at the moment treatment begins, $\beta_1^{lag}$ captures the effect one period later, and so on through $\beta_L^{lag}$ at $L$ periods after treatment. These coefficients trace out the dynamic treatment effect trajectory: they reveal whether the effect appears immediately or builds up gradually, whether it persists or fades out, and whether it stabilizes at a constant level or continues to grow.
$\varepsilon_{it}$ is the error term, capturing all unobserved factors not absorbed by the fixed effects or treatment indicators.

The key insight is that this single equation simultaneously tests the identifying assumption and estimates the treatment effect. The leads ($\beta_k^{lead}$) test parallel trends period by period, while the lags ($\beta_k^{lag}$) reveal how the treatment effect evolves over time. In our tutorial, treatment begins at period 5 and the reference period is 4 ($k = -1$), so we have 4 lead coefficients at $k = -5, -4, -3, -2$ (corresponding to periods 0–3) and $L = 4$ lag coefficients at $k = 0, 1, 2, 3, 4$ (corresponding to periods 5–9).

The MultiPeriodDiD() estimator fits this specification, using one pre-treatment period as the reference point.

event = MultiPeriodDiD()
results_event = event.fit(
data_2x2,
outcome="outcome",
treatment="treated",
time="period",
post_periods=[5, 6, 7, 8, 9],
reference_period=4,
)
results_event.print_summary()

================================================================================
Multi-Period Difference-in-Differences Estimation Results
================================================================================
Observations: 1000
Treated observations: 500
Control observations: 500
Pre-treatment periods: 5
Post-treatment periods: 5
R-squared: 0.7648
--------------------------------------------------------------------------------
Pre-Period Effects (Parallel Trends Test)
--------------------------------------------------------------------------------
Period Estimate Std. Err. t-stat P>|t| Sig.
--------------------------------------------------------------------------------
0 -0.5167 0.5121 -1.009 0.3132
1 -0.5050 0.5031 -1.004 0.3157
2 -0.2804 0.5228 -0.536 0.5919
3 -0.3227 0.5187 -0.622 0.5340
[ref: 4] 0.0000 --- --- ---
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Post-Period Treatment Effects
--------------------------------------------------------------------------------
Period Estimate Std. Err. t-stat P>|t| Sig.
--------------------------------------------------------------------------------
5 4.6509 0.5162 9.011 0.0000 ***
6 4.8285 0.5227 9.238 0.0000 ***
7 4.6907 0.5068 9.255 0.0000 ***
8 4.7888 0.4908 9.757 0.0000 ***
9 5.0244 0.5203 9.657 0.0000 ***
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Average Treatment Effect (across post-periods)
--------------------------------------------------------------------------------
Parameter Estimate Std. Err. t-stat P>|t| Sig.
--------------------------------------------------------------------------------
Avg ATT 4.7967 0.3923 12.227 0.0000 ***
--------------------------------------------------------------------------------
95% Confidence Interval: [4.0269, 5.5665]
Signif. codes: '***' 0.001, '**' 0.01, '*' 0.05, '.' 0.1
================================================================================

The pre-treatment coefficients (periods 0–3) are all small and statistically insignificant, ranging from -0.52 to -0.28 with p-values well above 0.05. This confirms that the treated and control groups were evolving similarly before the intervention — the period-by-period placebo test passes. In contrast, all five post-treatment effects (periods 5–9) are large and highly significant, ranging from 4.65 to 5.02 with t-statistics above 9.0. The average ATT across post periods is 4.80 with a 95% CI of [4.03, 5.57], consistent with the true effect of 5.0. The effects are remarkably stable over time, indicating no fade-out or build-up — the treatment shifts outcomes by roughly 5 units immediately and maintains that shift.

The event study plot below makes these dynamics visible:

es_df = results_event.to_dataframe()
fig, ax = plt.subplots(figsize=(9, 5))
fig.patch.set_linewidth(0)
pre = es_df[~es_df["is_post"]]
post = es_df[es_df["is_post"]]
ax.errorbar(pre["period"], pre["effect"], yerr=1.96 * pre["se"],
fmt="o", color=STEEL_BLUE, capsize=4, linewidth=2,
markersize=8, label="Pre-treatment")
ax.errorbar(post["period"], post["effect"], yerr=1.96 * post["se"],
fmt="s", color=WARM_ORANGE, capsize=4, linewidth=2,
markersize=8, label="Post-treatment")
# Reference period
ax.plot(4, 0, "D", color=WHITE_TEXT, markersize=10, zorder=5,
label="Reference period")
ax.axhline(y=0, color=LIGHT_TEXT, linewidth=1, alpha=0.5)
ax.axvline(x=4.5, color=LIGHT_TEXT, linestyle="--", linewidth=1.5, alpha=0.5)
ax.axhline(y=5.0, color=TEAL, linestyle=":", linewidth=1.5, alpha=0.7,
label="True effect (5.0)")
ax.set_xlabel("Period")
ax.set_ylabel("Estimated Effect")
ax.set_title("Event Study: Dynamic Treatment Effects")
ax.legend(loc="upper left")
ax.set_xticks(range(10))
plt.savefig("did_event_study.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

The event study plot tells the DiD story at a glance. Pre-treatment coefficients (steel blue circles) hover near the zero line, their confidence intervals all crossing zero — this is the visual signature of valid parallel trends. At the treatment cutoff (dashed vertical line), the estimates jump sharply to around 5.0 (warm orange squares), and the teal dotted line at 5.0 shows that every post-treatment estimate is close to the true effect. The confidence intervals in the post-treatment period are narrow and well above zero, confirming both statistical significance and accuracy.

With the classic 2x2 case established, the next question is: what happens when different units adopt treatment at different times?

Staggered adoption: Why TWFE fails

In many real-world policies, treatment does not begin simultaneously for all units. AI tutoring platforms roll out city by city, digital infrastructure investments phase in over years, and educational technology grants expand district by district. This is staggered adoption — different units start treatment at different times.

The traditional approach is Two-Way Fixed Effects (TWFE) regression, which estimates a single treatment coefficient using unit and time fixed effects:

$$Y_{it} = \gamma_i + \lambda_t + \delta \cdot D_{it} + \varepsilon_{it}$$

Here $\gamma_i$ absorbs all time-invariant unit characteristics (unit fixed effects), $\lambda_t$ absorbs all common time shocks (time fixed effects), $D_{it}$ is a treatment indicator that equals 1 when unit $i$ is treated at time $t$, and $\delta$ is the single treatment effect that TWFE estimates. With a single treatment period, $\delta$ correctly recovers the ATT. But with staggered timing, the single coefficient $\delta$ is a weighted average of many underlying 2x2 comparisons — and some of those comparisons are problematic.

The problem is that TWFE makes forbidden comparisons: it implicitly uses already-treated units as controls for newly-treated units. If treatment effects grow over time, these forbidden comparisons produce negative bias, pulling the overall estimate downward. Think of it this way: if early adopters have been benefiting from treatment for three years and their outcomes have grown substantially, TWFE compares newly-treated units to these high-performing early adopters. The newly-treated units look worse by comparison, even though they are genuinely benefiting from treatment. In extreme cases with heterogeneous treatment effects across cohorts, TWFE can even assign negative weights to some 2x2 comparisons, potentially flipping the sign of the estimate opposite to every unit’s true treatment effect (this does not occur in our example, but is documented in de Chaisemartin & D’Haultfoeuille, 2020).

Generating staggered adoption data

The generate_staggered_data() function creates a panel with multiple treatment cohorts — groups of units that begin treatment in different periods — plus a never-treated group.

data_stag = generate_staggered_data(
n_units=300,
n_periods=10,
seed=RANDOM_SEED,
)
print(f"Dataset shape: {data_stag.shape}")
cohorts = data_stag.groupby("first_treat")["unit"].nunique()
print(f"\nCohort sizes:")
for ft, n in cohorts.items():
label = "Never-treated" if ft == 0 else f"First treated in period {ft}"
print(f" {label}: {n} units")
print(f"\nTotal units: {cohorts.sum()}")

Dataset shape: (3000, 7)
Cohort sizes:
Never-treated: 90 units
First treated in period 3: 60 units
First treated in period 5: 75 units
First treated in period 7: 75 units
Total units: 300

The staggered panel has 3,000 observations (300 units across 10 periods). Three treatment cohorts adopt at different times: 60 units start treatment in period 3, 75 in period 5, and 75 in period 7. Another 90 units are never treated, serving as a clean control group. The first_treat column records when each unit first received treatment (0 for never-treated). This staggered structure is where naive TWFE breaks down, as the next section demonstrates.

Exploring the staggered dataset

The staggered dataset has a richer structure than the 2x2 case. Inspecting the first rows reveals additional columns:

data_stag.head(10)

 unit period outcome first_treat treated treat true_effect
0 0 11.278161 0 0 0 0.0
0 1 11.835615 0 0 0 0.0
0 2 11.542112 0 0 0 0.0
0 3 11.716260 0 0 0 0.0
0 4 12.289791 0 0 0 0.0
0 5 10.978501 0 0 0 0.0
0 6 11.426795 0 0 0 0.0
0 7 11.433938 0 0 0 0.0
0 8 11.108223 0 0 0 0.0
0 9 12.035899 0 0 0 0.0

Unit 0 is never-treated, so all indicators stay at zero across all 10 periods. To understand the staggered structure, we need to see what happens to treated units. The columns have distinct roles:

first_treat: the period when a unit first receives treatment (0 = never treated)
treat: time-invariant group membership — equals 1 for any unit ever assigned to treatment, 0 for never-treated
treated: time-varying post-treatment indicator — equals 0 before treatment onset and switches to 1 at first_treat
true_effect: the known ground-truth treatment effect at each period, used for verification

The distinction between treat and treated is crucial: treat tells you who is in the treatment group (a permanent label), while treated tells you when they are actually under treatment (a dynamic state). For never-treated units, both are always 0. For treated units, treat is always 1, but treated flips from 0 to 1 at the unit’s treatment onset.

An early-treated unit from cohort 3 illustrates this structure:

early_unit = data_stag[data_stag["first_treat"] == 3]["unit"].iloc[0]
data_stag[data_stag["unit"] == early_unit]

 unit period outcome first_treat treated treat true_effect
90 0 13.299816 3 0 1 0.0
90 1 12.897337 3 0 1 0.0
90 2 11.882534 3 0 1 0.0
90 3 14.724679 3 1 1 2.0
90 4 16.139340 3 1 1 2.2
90 5 14.433891 3 1 1 2.4
90 6 15.949127 3 1 1 2.6
90 7 15.832888 3 1 1 2.8
90 8 17.125174 3 1 1 3.0
90 9 16.685332 3 1 1 3.2

Unit 90 has treat=1 throughout (it belongs to the treatment group), but treated flips from 0 to 1 at period 3 — the moment it enters the post-treatment state. The true_effect is 0 in the pre-treatment periods, then starts at 2.0 and grows by 0.2 each period, reaching 3.2 by period 9. This growing effect pattern is what makes staggered DiD challenging: the treatment effect for cohort 3 at period 7 (2.8) is very different from the effect at period 3 (2.0).

Now compare with a late-treated unit from cohort 7:

late_unit = data_stag[data_stag["first_treat"] == 7]["unit"].iloc[0]
data_stag[data_stag["unit"] == late_unit]

 unit period outcome first_treat treated treat true_effect
91 0 7.987886 7 0 1 0.0
91 1 8.168639 7 0 1 0.0
91 2 8.904022 7 0 1 0.0
91 3 7.984438 7 0 1 0.0
91 4 8.373931 7 0 1 0.0
91 5 7.543381 7 0 1 0.0
91 6 8.981115 7 0 1 0.0
91 7 10.105654 7 1 1 2.0
91 8 10.505532 7 1 1 2.2
91 9 11.074785 7 1 1 2.4

Unit 91 also has treat=1 throughout, but treated does not flip until period 7 — giving it a much longer pre-treatment phase (7 periods vs 3 for cohort 3) and only 3 post-treatment periods. Its true_effect starts at 2.0 at period 7 and reaches only 2.4 by period 9, compared to cohort 3’s 3.2. This asymmetry — early cohorts accumulating larger effects over more post-treatment periods — is precisely what causes TWFE to produce biased estimates when it uses already-treated cohort 3 units as “controls” for cohort 7.

Let us examine how the staggered structure differs from the 2x2 case in scale and treatment coverage. With multiple cohorts adopting at different times, the fraction of observations in post-treatment state is no longer 50%:

data_stag.describe()

 unit period outcome first_treat treated treat true_effect
count 3000.000000 3000.00000 3000.000000 3000.000000 3000.000000 3000.000000 3000.000000
mean 149.500000 4.50000 11.287067 3.600000 0.340000 0.700000 0.829000
std 86.616497 2.87276 2.528589 2.709695 0.473788 0.458334 1.173464
min 0.000000 0.00000 4.521385 0.000000 0.000000 0.000000 0.000000
25% 74.750000 2.00000 9.461867 0.000000 0.000000 0.000000 0.000000
50% 149.500000 4.50000 11.107083 4.000000 0.000000 1.000000 0.000000
75% 224.250000 7.00000 13.078036 5.500000 1.000000 1.000000 2.200000
max 299.000000 9.00000 20.616391 7.000000 1.000000 1.000000 3.200000

With 3,000 observations and 300 units, this panel is three times larger than the 2x2 case. The first_treat variable has a mean of 3.60, reflecting the mix of never-treated (0) and cohorts treated at periods 3, 5, and 7. The treated mean of 0.34 tells us that 34% of all unit-period observations are in a post-treatment state — less than half because late cohorts contribute fewer treated periods than early cohorts.

A crosstab of the number of treated (post-treatment) units by cohort and period reveals the staggered rollout:

pd.crosstab(data_stag["first_treat"], data_stag["period"],
values=data_stag["treated"], aggfunc="sum").fillna(0).astype(int)

period 0 1 2 3 4 5 6 7 8 9
first_treat
0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 60 60 60 60 60 60 60
5 0 0 0 0 0 75 75 75 75 75
7 0 0 0 0 0 0 0 75 75 75

The staggered structure is immediately visible: zeros cascade to treatment counts as each cohort enters the post-treatment state. At period 2, no units are yet treated. At period 3, 60 units from cohort 3 enter treatment. At period 5, cohort 5 adds 75 more, bringing the total to 135. By period 7, all 210 treated units are in post-treatment. The never-treated group (row 0) remains at zero throughout. This growing treated population — and the fact that cohort 3 has been treated for 4 periods by the time cohort 7 starts — is the asymmetry that makes TWFE unreliable. When TWFE uses cohort 3 as a “control” for cohort 7, it compares against units whose outcomes already incorporate a treatment effect of 2.8, not the untreated counterfactual.

The pivoted outcome means by cohort and period reveal the staggered treatment pattern:

data_stag.groupby(["first_treat", "period"])["outcome"].mean().unstack()

period 0 1 2 3 4 5 6 7 8 9
first_treat
0 9.92 9.95 10.17 10.28 10.40 10.46 10.53 10.68 10.78 10.88
3 10.39 10.51 10.59 12.82 13.07 13.33 13.60 13.99 14.22 14.56
5 10.08 10.17 10.33 10.32 10.58 12.70 12.90 13.11 13.64 13.77
7 9.61 9.76 9.73 10.04 10.00 10.10 10.35 12.25 12.59 12.91

All four cohorts track closely in their pre-treatment periods (values near 9.6–10.6 in periods 0–2), confirming parallel pre-trends. The divergence is sharp and cohort-specific: cohort 3 jumps at period 3 (from 10.59 to 12.82), cohort 5 jumps at period 5 (from 10.58 to 12.70), and cohort 7 jumps at period 7 (from 10.35 to 12.25). The never-treated group follows a smooth, gentle upward trend throughout. By period 9, all treated cohorts have outcomes around 12.9–14.6, substantially above the never-treated group’s 10.88 — but they arrived at those levels at different times.

The line plot below visualizes these divergent trajectories:

cohort_means = data_stag.groupby(["first_treat", "period"])["outcome"].mean().unstack(level=0)
cohort_colors = {0: STEEL_BLUE, 3: WARM_ORANGE, 5: TEAL, 7: WHITE_TEXT}
cohort_labels = {0: "Never-treated", 3: "Cohort 3", 5: "Cohort 5", 7: "Cohort 7"}
fig, ax = plt.subplots(figsize=(9, 5))
fig.patch.set_linewidth(0)
for ft in sorted(cohort_means.columns):
ax.plot(cohort_means.index, cohort_means[ft], "o-",
color=cohort_colors[ft], linewidth=2, markersize=6,
label=cohort_labels[ft])
# Vertical lines at treatment onsets
for ft in [3, 5, 7]:
ax.axvline(x=ft - 0.5, color=cohort_colors[ft], linestyle="--",
linewidth=1.2, alpha=0.5)
ax.set_xlabel("Period")
ax.set_ylabel("Mean Outcome")
ax.set_title("Staggered Adoption: Cohort Mean Outcomes Over Time")
ax.legend(loc="upper left")
ax.set_xticks(range(10))
plt.savefig("did_staggered_trends.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

The plot makes the staggered adoption pattern unmistakable. All four lines run in parallel during the early pre-treatment periods, then each treated cohort jumps upward at its treatment onset (marked by a dashed vertical line in the corresponding color). Cohort 3 (warm orange) diverges first at period 3, followed by cohort 5 (teal) at period 5, and cohort 7 (near black) at period 7. The never-treated group (steel blue) continues its steady, gentle upward trend without any jump. This visualization explains why TWFE fails: between periods 3 and 7, TWFE uses cohort 3 (already treated and elevated) as a comparison for cohort 7 (not yet treated). Since cohort 3’s outcomes are inflated by treatment, the comparison underestimates cohort 7’s true effect when it eventually adopts.

Bacon decomposition: Diagnosing TWFE

The Goodman-Bacon decomposition (Goodman-Bacon, 2021) reveals exactly how TWFE constructs its estimate. The key insight is that the TWFE coefficient $\hat{\delta}$ is a weighted average of all possible 2x2 DiD comparisons between pairs of treatment cohorts:

$$\hat{\delta}^{TWFE} = \sum_{k} s_{kU} \hat{\delta}_{kU} + \sum_{e \neq U} \sum_{l > e} \big( s_{el} \hat{\delta}_{el} + s_{le} \hat{\delta}_{le} \big)$$

The first sum covers clean comparisons between each treated cohort $k$ and the never-treated group $U$, weighted by $s_{kU}$. The double sum covers comparisons between pairs of treated cohorts: $\hat{\delta}_{el}$ compares earlier-treated ($e$) against later-treated ($l$) units, and $\hat{\delta}_{le}$ compares later-treated against earlier-treated units. The weights $s$ are proportional to each subsample’s size and the variance of the treatment indicator within each pair — groups treated in the middle of the panel receive the most weight. Crucially, the weights sum to one, so the TWFE estimate is a proper weighted average.

The three types of comparisons have very different reliability:

Treated vs never-treated ($\hat{\delta}_{kU}$): Clean comparisons using permanently untreated units as controls. These are the gold standard.
Earlier vs later treated ($\hat{\delta}_{el}$): Uses not-yet-treated units as controls. Valid as long as treatment has not yet affected the later cohort.
Later vs earlier treated ($\hat{\delta}_{le}$): The forbidden comparisons. Uses already-treated units as controls. If treatment effects evolve over time, these comparisons are contaminated because the “controls” are themselves experiencing treatment effects.

bacon = BaconDecomposition()
bacon_results = bacon.fit(
data_stag, outcome="outcome", unit="unit",
time="period", first_treat="first_treat",
)
bacon_results.print_summary()

=====================================================================================
Goodman-Bacon Decomposition of Two-Way Fixed Effects
=====================================================================================
Total observations: 3000
Treatment timing groups: 3
Never-treated units: 90
Total 2x2 comparisons: 9
-------------------------------------------------------------------------------------
TWFE Decomposition
-------------------------------------------------------------------------------------
TWFE Estimate: 2.1822
Weighted Sum of 2x2 Estimates: 2.1052
Decomposition Error: 0.076977
-------------------------------------------------------------------------------------
Weight Breakdown by Comparison Type
-------------------------------------------------------------------------------------
Comparison Type Weight Avg Effect Contribution
-------------------------------------------------------------------------------------
Treated vs Never-treated 0.4331 2.3745 1.0284
Earlier vs Later treated 0.2836 2.1999 0.6238
Later vs Earlier (forbidden) 0.2834 1.5989 0.4531
-------------------------------------------------------------------------------------
Total 1.0000 2.1052
-------------------------------------------------------------------------------------
WARNING: 28.3% of weight is on 'forbidden' comparisons where
already-treated units serve as controls. This can bias TWFE
when treatment effects are heterogeneous over time.
Consider using Callaway-Sant'Anna or other robust estimators.
=====================================================================================

The decomposition reveals that 28.3% of TWFE’s weight falls on forbidden comparisons — cases where already-treated units serve as controls. These forbidden comparisons produce an average effect of only 1.60, substantially lower than the 2.37 from clean treated-vs-never-treated comparisons. This downward pull drags the TWFE estimate to 2.18, below the true treatment effect. The clean comparisons (treated vs never-treated) account for 43.3% of the weight and produce the most reliable estimates, while the earlier-vs-later comparisons (28.4% weight) sit in between. The decomposition error of 0.08 reflects higher-order interaction terms that the 2x2 decomposition does not fully capture.

The following plot visualizes the decomposition:

bacon_df = bacon_results.to_dataframe()
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
fig.patch.set_linewidth(0)
# Left panel: scatter by comparison type
type_colors = {
"Treated vs Never-treated": STEEL_BLUE,
"Earlier vs Later treated": WARM_ORANGE,
"Later vs Earlier (forbidden)": "#e8856c",
"treated_vs_never": STEEL_BLUE,
"earlier_vs_later": WARM_ORANGE,
"later_vs_earlier": "#e8856c",
}
for comp_type in bacon_df["comparison_type"].unique():
subset = bacon_df[bacon_df["comparison_type"] == comp_type]
color = type_colors.get(comp_type, LIGHT_TEXT)
axes[0].scatter(subset["weight"], subset["estimate"],
s=80, color=color, alpha=0.7, edgecolors=DARK_NAVY,
label=comp_type)
axes[0].axhline(y=bacon_results.twfe_estimate, color=WHITE_TEXT,
linestyle="--", linewidth=1.5, alpha=0.7,
label=f"TWFE = {bacon_results.twfe_estimate:.2f}")
axes[0].set_xlabel("Weight")
axes[0].set_ylabel("2×2 DiD Estimate")
axes[0].set_title("Bacon Decomposition: Individual Comparisons")
axes[0].legend(fontsize=9, loc="lower right")
# Right panel: bar chart of weights by type
type_summary = bacon_df.groupby("comparison_type").agg(
weight=("weight", "sum"),
avg_effect=("estimate", lambda x: np.average(
x, weights=bacon_df.loc[x.index, "weight"])),
).reset_index()
bar_colors = [type_colors.get(t, LIGHT_TEXT)
for t in type_summary["comparison_type"]]
axes[1].barh(range(len(type_summary)), type_summary["weight"],
color=bar_colors, edgecolor=DARK_NAVY, height=0.6)
axes[1].set_yticks(range(len(type_summary)))
axes[1].set_yticklabels(type_summary["comparison_type"], fontsize=10)
axes[1].set_xlabel("Total Weight")
axes[1].set_title("Weight Distribution by Comparison Type")
for i, (w, e) in enumerate(zip(type_summary["weight"],
type_summary["avg_effect"])):
axes[1].text(w + 0.01, i, f"{w:.1%} (avg = {e:.2f})",
va="center", fontsize=10)
plt.tight_layout()
plt.savefig("did_bacon_decomposition.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

The left panel shows each individual 2x2 comparison as a point, colored by type. The forbidden comparisons (dark orange) cluster at lower effect estimates than the clean comparisons (steel blue), visually demonstrating how they pull TWFE downward. The right panel makes the weight problem stark: nearly a third of the total weight goes to comparisons where already-treated units masquerade as controls. For a policymaker relying on the TWFE estimate of 2.18, this contamination means the reported effect underestimates the true treatment impact.

Callaway-Sant’Anna: The modern solution

The Callaway-Sant’Anna (CS) estimator (Callaway & Sant’Anna, 2021) avoids forbidden comparisons entirely. Instead of a single pooled regression, CS starts from a fundamental building block — the group-time ATT:

$$ATT(g, t) = E[Y_t(g) - Y_t(\infty) \mid G = g], \quad \text{for } t \geq g$$

Here $g$ denotes the cohort (the period when a unit first becomes treated), $t$ is the current calendar period, $Y_t(g)$ is the potential outcome at time $t$ if first treated in period $g$, and $Y_t(\infty)$ is the potential outcome under perpetual non-treatment. The conditioning on $G = g$ restricts attention to units in cohort $g$. This yields a separate treatment effect estimate for each combination of cohort and calendar period, using only clean comparisons.

With never-treated controls, the group-time ATT is identified as:

$$ATT(g, t) = E[Y_t - Y_{g-1} \mid G = g] - E[Y_t - Y_{g-1} \mid G = \infty]$$

In words: take the change in outcomes from the period just before treatment ($g - 1$) to the current period ($t$) for cohort $g$ units, and subtract the same change for never-treated units ($G = \infty$). This is a 2x2 DiD comparison that uses only the never-treated group as controls, eliminating all forbidden comparisons by construction.

The doubly robust estimator

In practice, Callaway and Sant’Anna implement a doubly robust version of this estimator. Before diving into the formal equation, here is the core idea: the doubly robust estimator adjusts the comparison between treated and control units in two ways simultaneously — by reweighting the control group to look more similar to the treated group (inverse-probability weighting), and by directly modeling and subtracting the expected outcome change for controls (outcome regression). Think of it as wearing both a belt and suspenders: if either adjustment is correctly specified, the estimate is valid, even if the other one is wrong. This double protection makes the estimator more reliable than methods that rely on a single modeling assumption.

The formal equation combines inverse-probability weighting with an outcome regression adjustment:

$$ATT(g, t) = \mathbb{E}\left[\left(\frac{G_g}{\mathbb{E}[G_g]} - \frac{\frac{p_g(X)}{1-p_g(X)}}{\mathbb{E}\left[\frac{p_g(X)}{1-p_g(X)}\right]}\right)\left(Y_t - Y_{g-1} - m_{g,t}^{nev}(X)\right)\right]$$

This equation multiplies two terms inside the expectation — a weighting term (first parentheses) and an outcome term (second parentheses). Let us unpack each one.

The weighting term: $\frac{G_g}{\mathbb{E}[G_g]} - \frac{\frac{p_g(X)}{1-p_g(X)}}{\mathbb{E}\left[\frac{p_g(X)}{1-p_g(X)}\right]}$

This term determines how much each observation contributes to the ATT estimate. It works differently for treated and control units:

$G_g$ is a group indicator that equals 1 if the unit belongs to cohort $g$ and 0 otherwise. Dividing by $\mathbb{E}[G_g]$ (the share of units in cohort $g$) normalizes so that treated units receive equal weight on average. For a treated unit in cohort $g$, the first fraction contributes a positive value; for never-treated units, $G_g = 0$ so the first fraction is zero.
$p_g(X)$ is the generalized propensity score — the probability of being in cohort $g$ (rather than the never-treated group) given covariates $X$. This is estimated via logit regression of cohort membership on covariates. The ratio $\frac{p_g(X)}{1-p_g(X)}$ are the odds of being in cohort $g$, and dividing by its expectation normalizes the weights. For never-treated units, this second fraction creates a negative weight that is larger for control units whose covariates resemble the treated cohort — effectively selecting the most comparable controls. For treated units, the two fractions partially cancel, leaving a net positive weight.

The intuition is similar to propensity score matching: if a never-treated city has covariates (population, per-student spending, teacher-student ratio) that look very much like a treated city, it receives a larger (more negative) weight, making it contribute more as a counterfactual. Cities with covariates far from the treated group receive near-zero weight. This rebalances the control group so that the covariate distribution of the weighted controls matches that of the treated cohort.

The outcome term: $Y_t - Y_{g-1} - m_{g,t}^{nev}(X)$

This term measures the adjusted outcome change for each unit:

$Y_t - Y_{g-1}$ is the raw change in outcomes from the baseline period ($g - 1$, the period just before cohort $g$ starts treatment) to the current period $t$. This is the same first difference used in any DiD estimator.
$m_{g,t}^{nev}(X)$ is the outcome regression adjustment — the expected change $E[Y_t - Y_{g-1} \mid X, G = \infty]$ for never-treated units with covariates $X$. In practice, this is estimated by regressing the outcome change $\Delta Y = Y_t - Y_{g-1}$ on covariates $X$ using only the never-treated group. Subtracting $m_{g,t}^{nev}(X)$ removes the portion of the outcome change that would have occurred anyway based on observable characteristics — even without treatment. What remains is the treatment-induced change that cannot be explained by covariates alone.

Think of it this way: if cities with higher per-student spending tend to improve learning scores faster regardless of AI adoption, $m_{g,t}^{nev}(X)$ captures that covariate-driven growth trajectory. Subtracting it ensures that the estimated treatment effect is not confounded by differential growth rates across different types of cities.

Why “doubly robust”? The estimator combines both adjustment strategies — inverse-probability weighting (through the weighting term) and outcome regression (through $m_{g,t}^{nev}(X)$). The key advantage is that the ATT estimate is consistent if either the propensity score model or the outcome regression model is correctly specified — both do not need to be right simultaneously. If the propensity score model is wrong but the outcome regression is correct, the $m_{g,t}^{nev}(X)$ adjustment still removes confounding. If the outcome regression is wrong but the propensity score is correct, the reweighting still produces a valid comparison group. This double layer of protection makes the estimator more reliable in practice than methods relying on a single modeling assumption.

Note on the no-covariate case: In this tutorial, we do not pass covariates to CallawaySantAnna(). Without covariates, the propensity score $p_g(X)$ reduces to the unconditional probability of being in cohort $g$ (simply the group share), and $m_{g,t}^{nev}(X)$ reduces to the simple mean outcome change among never-treated units. The doubly robust estimator then collapses to the basic difference-in-means formula shown earlier. The full equation is presented here because it is the general form that practitioners encounter when working with real data and covariates.

The group-time ATTs are then aggregated into summary parameters. Any summary is a weighted average of the building blocks:

$$\theta = \sum_{g} \sum_{t \geq g} w_{g,t} \cdot ATT(g, t), \quad \sum_{g,t} w_{g,t} = 1$$

Two aggregations are especially useful. The overall ATT weights by cohort size:

$$\theta^{O} = \sum_{g} \theta(g) \cdot P(G = g), \quad \text{where } \theta(g) = \frac{1}{T - g + 1} \sum_{t=g}^{T} ATT(g, t)$$

The event study aggregation averages across cohorts at each relative time $e$ (periods since treatment onset):

$$\theta_D(e) = \sum_{g} ATT(g, g + e) \cdot P(G = g \mid g + e \leq T)$$

This event study aggregation is the CS analogue of the leads-and-lags event study, but free from the forbidden comparison contamination that plagues TWFE-based event studies.

The CallawaySantAnna() class takes control_group to specify which units serve as controls. Using "never_treated" restricts comparisons to units that never received treatment, the cleanest possible counterfactual. The base_period="universal" option uses a single reference period ($g - 1$) for all relative time comparisons within each cohort, rather than letting each relative period use its own baseline. This ensures that the pre-treatment coefficients are proper placebo tests: each one measures the outcome change from $g - 1$ to an earlier period, so a coefficient near zero means the treated and control groups were evolving similarly over that specific interval. With a universal base period, the period immediately before treatment ($e = -1$) is normalized to zero by construction.

cs = CallawaySantAnna(control_group="never_treated", base_period="universal")
results_cs = cs.fit(
data_stag, outcome="outcome", unit="unit",
time="period", first_treat="first_treat",
aggregate="event_study",
)
results_cs.print_summary()

=====================================================================================
Callaway-Sant'Anna Staggered Difference-in-Differences Results
=====================================================================================
Total observations: 3000
Treated units: 210
Never-treated units: 90
Treatment cohorts: 3
Time periods: 10
Control group: never_treated
Base period: universal
-------------------------------------------------------------------------------------
Overall Average Treatment Effect on the Treated
-------------------------------------------------------------------------------------
Parameter Estimate Std. Err. t-stat P>|t| Sig.
-------------------------------------------------------------------------------------
ATT 2.4136 0.0552 43.753 0.0000 ***
-------------------------------------------------------------------------------------
95% Confidence Interval: [2.3055, 2.5217]
-------------------------------------------------------------------------------------
Event Study (Dynamic) Effects
-------------------------------------------------------------------------------------
Rel. Period Estimate Std. Err. t-stat P>|t| Sig.
-------------------------------------------------------------------------------------
-7 -0.1344 0.1171 -1.148 0.2510
-6 -0.0188 0.1126 -0.167 0.8671
-5 -0.1435 0.0813 -1.766 0.0774 .
-4 -0.0091 0.0744 -0.122 0.9028
-3 -0.0697 0.0560 -1.244 0.2134
-2 -0.0709 0.0631 -1.124 0.2610
-1 0.0000 nan nan nan
0 1.9713 0.0645 30.551 0.0000 ***
1 2.1416 0.0577 37.124 0.0000 ***
2 2.2969 0.0644 35.644 0.0000 ***
3 2.6763 0.0796 33.642 0.0000 ***
4 2.7925 0.0800 34.898 0.0000 ***
5 3.0259 0.1227 24.669 0.0000 ***
6 3.2663 0.1090 29.961 0.0000 ***
-------------------------------------------------------------------------------------
Signif. codes: '***' 0.001, '**' 0.01, '*' 0.05, '.' 0.1
=====================================================================================

The overall CS estimate of the ATT is 2.41 (SE = 0.06, p < 0.001), with a 95% CI of [2.31, 2.52]. This is higher than the TWFE estimate of 2.18, confirming that TWFE was biased downward by the forbidden comparisons. The event study reveals dynamic effects that grow over time: the effect starts at 1.97 in the first period after treatment and increases to 3.27 by six periods post-treatment. This pattern of growing effects is exactly the scenario where TWFE fails most dramatically — the forbidden comparisons use units with large accumulated effects as controls for newly-treated units, producing a downward-biased average.

With the universal base period, relative period -1 is the reference and is normalized to zero by construction. The remaining pre-treatment estimates all hover near zero — the largest in magnitude is -0.14 at relative period -5 (p = 0.08), which does not reach significance at the 5% level. None of the seven pre-treatment coefficients are individually significant, providing clean support for the parallel trends assumption. This contrasts with the varying base period specification, where each pre-treatment coefficient uses a different baseline, making the placebo tests harder to interpret collectively.

The event study plot visualizes these dynamics, showing how the treatment effect builds over time relative to treatment onset:

cs_df = results_cs.to_dataframe("event_study")
fig, ax = plt.subplots(figsize=(9, 5))
fig.patch.set_linewidth(0)
pre_cs = cs_df[cs_df["relative_period"] < 0]
post_cs = cs_df[cs_df["relative_period"] >= 0]
ax.errorbar(pre_cs["relative_period"], pre_cs["effect"],
yerr=1.96 * pre_cs["se"], fmt="o", color=STEEL_BLUE,
capsize=4, linewidth=2, markersize=8, label="Pre-treatment")
ax.errorbar(post_cs["relative_period"], post_cs["effect"],
yerr=1.96 * post_cs["se"], fmt="s", color=TEAL,
capsize=4, linewidth=2, markersize=8, label="Post-treatment")
ax.axhline(y=0, color=LIGHT_TEXT, linewidth=1, alpha=0.5)
ax.axvline(x=-0.5, color=LIGHT_TEXT, linestyle="--", linewidth=1.5, alpha=0.5)
ax.set_xlabel("Periods Relative to Treatment")
ax.set_ylabel("Estimated ATT")
ax.set_title("Callaway-Sant'Anna: Event Study for Staggered Adoption")
ax.legend(loc="upper left")
plt.savefig("did_staggered_att.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

The CS event study plot shows the hallmark pattern of a valid DiD analysis: pre-treatment coefficients (steel blue) cluster tightly around zero — with relative period -1 pinned at exactly zero as the universal base period — then post-treatment coefficients (teal) rise sharply and progressively. The upward slope in the post-treatment period reveals that the treatment effect accumulates over time, growing from roughly 2.0 immediately after treatment to 3.3 six periods later. This dynamic pattern would have been obscured by TWFE’s single pooled estimate and further distorted by its forbidden comparisons.

Choosing the right estimator

With multiple DiD estimators available, the choice depends on the data structure. The following decision flowchart guides the selection:

graph TD
A["<b>Panel data with<br/>treatment & control</b>"] --> B{"Single treatment<br/>period?"}
B -->|Yes| C["<b>Classic 2×2 DiD</b><br/>DifferenceInDifferences()"]
B -->|No| D{"Staggered<br/>adoption?"}
D -->|"No<br/>(same timing)"| E["<b>Multi-Period DiD</b><br/>MultiPeriodDiD()"]
D -->|Yes| F{"Never-treated<br/>group available?"}
F -->|Yes| G["<b>Callaway-Sant'Anna</b><br/>CallawaySantAnna()"]
F -->|No| H["<b>Sun-Abraham / Stacked DiD</b><br/>SunAbraham() / StackedDiD()<br/><i>(not covered here)</i>"]
style A fill:#141413,stroke:#141413,color:#fff
style B fill:#6a9bcc,stroke:#141413,color:#fff
style C fill:#00d4c8,stroke:#141413,color:#fff
style D fill:#6a9bcc,stroke:#141413,color:#fff
style E fill:#00d4c8,stroke:#141413,color:#fff
style F fill:#6a9bcc,stroke:#141413,color:#fff
style G fill:#00d4c8,stroke:#141413,color:#fff
style H fill:#d97757,stroke:#141413,color:#fff

The following table summarizes when to use each estimator:

Scenario	Estimator	Advantage
Single treatment time, 2 groups	`DifferenceInDifferences()`	Simplest, most transparent
Single treatment time, many periods	`MultiPeriodDiD()`	Period-by-period effects, pre-trend test
Staggered, never-treated available	`CallawaySantAnna()`	Clean comparisons, flexible aggregation
Staggered, no never-treated group	`SunAbraham()`	Interaction-weighted, uses not-yet-treated
Diagnosing TWFE bias	`BaconDecomposition()`	Reveals forbidden comparison weights

The decision logic is straightforward: if all treated units start at the same time, use the classic estimator or the multi-period event study. If treatment timing varies, use Callaway-Sant’Anna (or Sun-Abraham if no never-treated group exists). Always run Bacon decomposition on TWFE results to check for contamination from forbidden comparisons. The diff-diff package also offers SyntheticDiD(), ImputationDiD(), and ContinuousDiD() for specialized settings, but the estimators above cover the vast majority of applied research.

Sensitivity analysis: HonestDiD

Every DiD analysis rests on parallel trends — but this assumption is fundamentally untestable for the post-treatment period. Pre-treatment trend tests (Section 6) check whether trends were parallel before treatment, but they cannot guarantee that trends would have remained parallel after treatment in the absence of the intervention. A new regulation might coincide with an economic downturn that affects treated regions differently, violating parallel trends even though pre-trends looked clean.

HonestDiD (Rambachan & Roth, 2023) addresses this problem directly. Instead of assuming parallel trends hold exactly, it bounds the degree of violation using a relative magnitudes restriction. Let $\delta_t = E[Y^0_t - Y^0_{t-1} \mid G = g] - E[Y^0_t - Y^0_{t-1} \mid G = \infty]$ denote the parallel trends violation at period $t$ — the difference in untreated outcome trends between the treated cohort and the never-treated group. HonestDiD constrains the post-treatment violations relative to the largest pre-treatment violation:

$$|\delta_t| \leq M \cdot \max_{t' < g} |\delta_{t'}|, \quad \text{for all } t \geq g$$

The parameter $M$ controls the degree of allowed departure. At $M = 0$, the method assumes perfect parallel trends ($\delta_t = 0$ for all post-treatment periods) and recovers the standard CI. As $M$ increases, it allows for progressively larger post-treatment violations, widening the robust CI. The breakdown value of $M$ is where the CI first includes zero — the point at which the treatment conclusion becomes fragile.

Think of $M$ as a stress test dial. Turning it up to $M = 1$ says: “The worst post-treatment violation could be as large as the worst thing we saw pre-treatment.” Turning it to $M = 5$ says: “The violation could be five times worse.” If the effect remains significant even at high $M$, the finding is genuinely robust.

M_values = [0.0, 0.5, 1.0, 1.5, 2.0, 3.0, 4.0, 5.0, 7.0, 10.0, 12.0, 15.0]
sensitivity = []
for M in M_values:
honest = HonestDiD(method="relative_magnitude", M=M)
hres = honest.fit(results_cs)
sensitivity.append({
"M": M,
"ci_lb": hres.ci_lb,
"ci_ub": hres.ci_ub,
"significant": hres.ci_lb > 0,
})
print(f"M = {M:.1f}: CI = [{hres.ci_lb:.4f}, {hres.ci_ub:.4f}]"
f" {'significant' if hres.ci_lb > 0 else 'includes zero'}")
sens_df = pd.DataFrame(sensitivity)
# Find breakdown point
breakdown_M = (sens_df[~sens_df["significant"]]["M"].min()
if not sens_df["significant"].all()
else sens_df["M"].max())
print(f"\nBreakdown value of M: {breakdown_M:.1f}")

M = 0.0: CI = [2.5324, 2.6592] significant
M = 0.5: CI = [2.4606, 2.7310] significant
M = 1.0: CI = [2.3889, 2.8028] significant
M = 1.5: CI = [2.3171, 2.8745] significant
M = 2.0: CI = [2.2453, 2.9463] significant
M = 3.0: CI = [2.1018, 3.0898] significant
M = 4.0: CI = [1.9583, 3.2334] significant
M = 5.0: CI = [1.8148, 3.3769] significant
M = 7.0: CI = [1.5277, 3.6639] significant
M = 10.0: CI = [1.0971, 4.0945] significant
M = 12.0: CI = [0.8101, 4.3816] significant
M = 15.0: CI = [0.3795, 4.8122] significant
Breakdown value of M: 15.0

At $M = 0$ (perfect parallel trends), the CI is narrow: [2.53, 2.66]. As $M$ increases, the CI widens symmetrically. At $M = 10$, the lower bound remains comfortably positive (1.10), and even at $M = 15$, it barely stays above zero (0.38). The breakdown value exceeds $M = 15$ — the treatment effect remains statistically significant even if post-treatment violations of parallel trends are more than 15 times larger than the worst pre-treatment deviation. This is exceptionally robust — in practice, a breakdown value above $M = 3$ is considered strong evidence that the finding is not driven by parallel trends violations. The improvement over the varying base period specification (which had a breakdown of $M = 12$) reflects the universal base period’s tighter pre-treatment estimates, which give HonestDiD a smaller “worst pre-treatment deviation” to scale against.

The sensitivity plot maps the robust CI as a function of $M$, making the breakdown point visually apparent:

fig, ax = plt.subplots(figsize=(9, 5))
fig.patch.set_linewidth(0)
ax.fill_between(sens_df["M"], sens_df["ci_lb"], sens_df["ci_ub"],
alpha=0.25, color=STEEL_BLUE, label="95% Robust CI")
ax.plot(sens_df["M"], sens_df["ci_lb"], "-", color=STEEL_BLUE, linewidth=2)
ax.plot(sens_df["M"], sens_df["ci_ub"], "-", color=STEEL_BLUE, linewidth=2)
ax.axhline(y=0, color=LIGHT_TEXT, linewidth=1.5, alpha=0.7)
att_val = results_cs.overall_att
ax.axhline(y=att_val, color=TEAL, linestyle=":", linewidth=1.5,
alpha=0.7, label=f"Overall ATT = {att_val:.2f}")
ax.axvline(x=breakdown_M, color=WARM_ORANGE, linestyle="--",
linewidth=2, alpha=0.8,
label=f"Breakdown (M = {breakdown_M:.1f})")
ax.set_xlabel("Sensitivity Parameter M\n"
"(maximum post-treatment violation relative to "
"largest pre-treatment violation)")
ax.set_ylabel("Treatment Effect (ATT)")
ax.set_title("HonestDiD Sensitivity Analysis: Robustness of the ATT")
ax.legend(loc="upper left")
plt.savefig("did_honest_sensitivity.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

The sensitivity plot tells the robustness story at a glance. The steel blue band shows the 95% robust CI expanding as $M$ grows — allowing for larger violations of parallel trends. The teal dotted line marks the overall ATT of 2.41, which sits comfortably within the CI for all values of $M$. The warm orange dashed line at $M = 15$ marks the boundary of our grid, with the lower CI bound still positive (0.38) at that point — the true breakdown lies even further out. In practical terms, the treatment conclusion would only be overturned if post-treatment parallel trend violations were more than 15 times worse than anything observed in the pre-treatment data — an extreme scenario that would require a dramatic structural break coinciding precisely with the treatment timing.

Best practice is to always report the breakdown value alongside the point estimate. A finding with a breakdown at $M = 0.5$ is fragile — even mild violations destroy the conclusion. A finding with a breakdown at $M = 15$ or above, as in this example, provides strong evidence that the effect is genuine regardless of moderate parallel trends violations.

Discussion

Returning to the motivating question — did AI tutoring actually improve learning? — the evidence from both the classic and modern DiD estimators is clear: treatment produced a genuine, statistically significant positive effect. In the 2x2 setting, the estimated ATT of 5.12 (95% CI: [4.64, 5.60]) closely matches the true effect of 5.0, confirming that the classic estimator works well when all units start treatment simultaneously. The event study further validates this finding by showing near-zero pre-treatment coefficients (the largest is -0.52 with p = 0.31) and stable post-treatment effects around 4.7–5.0.

The staggered adoption setting reveals a more nuanced picture. Naive TWFE estimation produces a biased estimate of 2.18, pulled downward by the 28.3% weight on forbidden comparisons where already-treated units serve as controls. The Callaway-Sant’Anna estimator corrects this bias, finding an overall ATT of 2.41 — and the event study shows that the effect is not constant but grows over time, from 1.97 immediately after treatment to 3.27 six periods later. For an education policymaker, this dynamic pattern means the AI initiative’s full benefits take time to materialize: evaluating the program too early would underestimate its long-run impact.

The HonestDiD sensitivity analysis provides the final piece of evidence. With a breakdown value exceeding $M = 15$, the treatment conclusion is robust to post-treatment parallel trends violations more than 15 times larger than anything observed pre-treatment. This level of robustness far exceeds the $M = 3$ threshold typically considered strong in applied research. Even a skeptic who doubts the parallel trends assumption would find it difficult to argue that the treatment had no effect.

Two important caveats apply. First, these results use synthetic data with known true effects, so the estimators are guaranteed to work under their assumptions. Real-world applications face additional challenges — measurement error in learning assessments, spillover effects between treated and control cities (e.g., students in control cities accessing AI tools on their own), and the possibility that AI adoption depends on unobserved factors correlated with learning outcomes. Second, the treatment effects in the staggered dataset grow linearly over time by construction. In practice, effects may follow more complex trajectories — plateauing, fading out, or accelerating — which would require careful specification of the event study window and aggregation weights.

Summary and key takeaways

This tutorial walked through the DiD toolkit from its simplest form to its most robust modern extensions. Four key takeaways emerge:

Method insight: DiD targets the ATT by using untreated units as a counterfactual for how treated units would have evolved without intervention. The classic 2x2 estimator (ATT = 5.12, SE = 0.25) works well when all units start treatment simultaneously, but staggered adoption requires modern estimators like Callaway-Sant’Anna to avoid TWFE’s forbidden comparison bias.

Data insight: The classic DiD recovered the true effect of 5.0 within sampling error (95% CI: [4.64, 5.60]). In the staggered setting, TWFE estimated 2.18 while the cleaner CS estimator found 2.41 — a 10% upward correction driven by eliminating the 28.3% weight on forbidden comparisons that dragged TWFE down. The CS event study further revealed that treatment effects grow over time, from 1.97 immediately after treatment to 3.27 six periods later.

Practical limitation: Parallel trends is untestable for the post-treatment period. Pre-treatment tests (p = 0.29 in our example) can only fail to reject, not confirm. HonestDiD provides a principled solution by computing robust confidence intervals under bounded violations. Our breakdown value exceeding $M = 15$ means the conclusion survives violations more than 15 times the worst pre-treatment departure — exceptionally strong robustness.

Next steps: This tutorial used synthetic data — the 2x2 dataset with a constant treatment effect and the staggered dataset with effects that grow over time. Real-world applications should consider adding covariates to the CS estimator (via the covariates argument), exploring continuous treatment intensity with ContinuousDiD(), and comparing CS results against SunAbraham() or ImputationDiD() as robustness checks. The diff-diff package supports all of these within the same API.

Exercises

Null effect test. Modify the generate_did_data() call to set treatment_effect=0.0. Run the full 2x2 analysis and event study. Does the estimator correctly find a zero effect? What do the pre- and post-treatment event study coefficients look like?
Covariates in Callaway-Sant’Anna. Add covariates to the staggered data (e.g., unit-level characteristics) and pass them via the covariates argument in CallawaySantAnna().fit(). Compare the ATT with and without covariate adjustment. When does covariate adjustment matter most?
Sun-Abraham comparison. Estimate the staggered treatment effect using SunAbraham(control_group="never_treated") instead of CallawaySantAnna(). Compare the overall ATT and event study coefficients. Under what conditions do the two estimators differ?
HonestDiD with finer M grid. Run the sensitivity analysis with M_values = np.arange(0, 15, 0.5) to find the exact breakdown point. How does the breakdown change if you use method="smoothness" instead of "relative_magnitude"?

References

Heterogeneous treatment effects via two-stage DID

Mon, 29 Jul 2024 00:00:00 +0000

Homogeneous Treatment Effects

🎯 Purpose: Estimate treatment effects when the treatment is not randomly assigned.
📉 Parallel Trends Assumption: In the absence of treatment, the treated and untreated groups would have followed parallel paths over time.
🔄 Two-Way Fixed-Effects (TWFE) Model:
- Static Model:
$$ y_{igt} = \mu_g + \eta_t + \tau D_{gt} + \epsilon_{igt} $$
- $ y_{igt} $: Outcome variable.
- $ i $: Individual.
- $ t $: Time.
- $ g $: Group.
- $ \mu_g $: Group fixed-effects.
- $ \eta_t $: Time fixed-effects.
- $ D_{gt} $: Indicator for treatment status.
- $ \tau $: Average treatment effect on the treated (ATT).
❗ Limitations: Assumes constant treatment effects across groups and time, which is often unrealistic.

Heterogeneous Treatment Effects

🔄 Enhanced TWFE Model: $$ y_{igt} = \mu_g + \eta_t + \tau_{gt} D_{gt} + \epsilon_{igt} $$
- Allows treatment effects ($ \tau_{gt} $) to vary by group and time.
- Aggregates group-time average treatment effects into an overall average treatment effect ($ \tau $).

Dynamic Event-Study TWFE Model

🔄 Model: $$ y_{igt} = \mu_g + \eta_t + \sum_{k=-L}^{-2} \tau_k D_{gt}^k + \sum_{k=0}^{K} \tau_k D_{gt}^k + \epsilon_{igt} $$
- Allows for treatment effects to change over time.
- $ D_{gt}^k $: Lags and leads of treatment status.
- Coefficients ($ \tau_k $) represent the average effect of being treated for $ k $ periods.
🎯 Estimation Goals:
- Objective: Estimate the average treatment effect of being exposed for $ k $ periods.
- Average Treatment Effect: $$ \tau_k = \sum_{g,t : t-g=k} \frac{N_{gt}}{N_k} \tau_{gt} $$
  - $ N_{gt} $: Number of observations in group $ g $ and time $ t $.
  - $ N_k $: Total number of observations with $ t - g = k $.

Negative Weighting Problem

❗ Issue: Traditional TWFE models can produce estimates with negative weights, leading to biased overall treatment effect estimates.
🛠 Solution by Gardner (2021):
- Use a two-stage approach to estimate group and time fixed-effects from untreated/not-yet-treated observations and then estimate treatment effects using residualized outcomes.

Two-stage differences in differences

🌱 Gardner (2021) Approach:
- 🔍 Key Insight: Under parallel trends, group and time effects are identified from the untreated/not-yet-treated observations.
- 📜 Procedure:
  1. 🥇 First Stage:
    - Estimate the model:
      
      \begin{equation} y_{igt} = \mu_g + \eta_t + \epsilon_{igt} \end{equation}
    - Using only untreated/not-yet-treated observations ($D_{gt} = 0$).
    - Obtain estimates for group and time effects ($\mu_g$ and $\eta_t$).
  2. 🥈 Second Stage:
    - Regress adjusted outcomes ($y_{igt} - \mu_g - \eta_t$) on treatment status ($D_{gt}$) in the full sample to estimate treatment effects ($\tau$).
- 🎯 Rationale:
  - The parallel trends assumption implies that residuals ($\epsilon_{igt}$) are uncorrelated with the treatment dummy, leading to a consistent estimator for the average treatment effect.

Learn by coding using this Google Colab notebook.

Spatial Panel Regression in Stata: Cigarette Demand Across US States

Fri, 01 Dec 2023 00:00:00 +0000

1. Overview

Cigarette taxation is a state-level policy instrument, but consumption in one state does not exist in isolation. When a state raises its tobacco tax, consumers near state borders may simply drive across to buy cheaper cigarettes in a neighboring state. This cross-border shopping effect means that a state’s cigarette consumption depends not only on its own prices and income but also on the prices and income of its neighbors. Standard panel data models — pooled OLS, fixed effects, and two-way fixed effects — cannot capture these spatial spillovers because they treat each state as an independent observation.

This tutorial introduces spatial panel regression as a framework for modeling geographic interdependence in panel data. We use the classic Baltagi cigarette demand dataset, which tracks per-capita cigarette consumption, real prices, and real per-capita income across 46 US states from 1963 to 1992. Starting from non-spatial panel models as a baseline, we progressively build toward the Spatial Durbin Model (SDM) — a flexible specification that includes both the spatial lag of the dependent variable and spatial lags of the explanatory variables. We then use Wald tests to determine whether simpler spatial models (SAR, SLX, or SEM) are adequate, and finally extend the framework to dynamic spatial panels that account for habit persistence in cigarette consumption.

All estimation is performed using the xsmle package in Stata, which implements maximum likelihood estimation for a family of spatial panel models with fixed effects. The spatial weight matrix is a binary contiguity matrix that defines two states as neighbors if they share a common border, row-standardized so that the spatial lag of a variable equals the average value among a state’s neighbors.

Learning objectives

Estimate non-spatial panel models (pooled OLS, region FE, time FE, two-way FE) and compare their price and income elasticities
Construct and load a row-standardized spatial weight matrix for panel data in Stata
Estimate the Spatial Durbin Model (SDM) with two-way fixed effects using the xsmle package
Apply the Lee and Yu bias correction for spatial panels with moderate time dimensions
Use Wald tests to evaluate whether the SDM simplifies to SAR, SLX, or SEM
Estimate dynamic spatial panel models with temporal and spatiotemporal lags to capture habit persistence

2. The modeling pipeline

The tutorial follows a progressive approach — each stage builds on the previous one by relaxing assumptions and adding complexity. The diagram below summarizes the path from data preparation through the final dynamic spatial models.

graph LR
A["<b>Data & W</b><br/><i>Section 3</i><br/>Panel setup<br/>Weight matrix"]
B["<b>Non-Spatial</b><br/><i>Section 4</i><br/>OLS, FE,<br/>Two-way FE"]
C["<b>SDM</b><br/><i>Section 6</i><br/>Spatial Durbin<br/>+ Lee-Yu"]
D["<b>Wald Tests</b><br/><i>Section 7</i><br/>SAR? SLX?<br/>SEM?"]
E["<b>Dynamic</b><br/><i>Section 8</i><br/>Temporal &<br/>spatial lags"]
A --> B
B --> C
C --> D
D --> E
style A fill:#6a9bcc,stroke:#141413,color:#fff
style B fill:#d97757,stroke:#141413,color:#fff
style C fill:#00d4c8,stroke:#141413,color:#141413
style D fill:#141413,stroke:#d97757,color:#fff
style E fill:#6a9bcc,stroke:#141413,color:#fff

We first establish non-spatial benchmarks to understand the baseline price and income elasticities. Then we introduce the Spatial Durbin Model to capture spillovers, apply Wald tests to check whether a simpler spatial specification suffices, and finally add dynamic components to account for the habit-forming nature of cigarette consumption.

3. Setup and data loading

Before running any spatial models, we need three Stata packages: spmat for spatial weight matrix management, xsmle for spatial panel estimation, and spwmatrix for weight matrix conversion. If you have not installed them, uncomment the net install lines below.

clear all
macro drop _all
set more off
version 12
* Install packages (uncomment if needed)
*net install st0292, from(http://www.stata-journal.com/software/sj13-2)
*net install xsmle, from(http://fmwww.bc.edu/RePEc/bocode/x)
*net install spwmatrix, from(http://fmwww.bc.edu/RePEc/bocode/s)

3.1 Spatial weight matrix

The spatial weight matrix W defines the neighborhood structure among the 46 US states. We use a binary contiguity matrix where two states are neighbors if they share a common border. The matrix is stored in a .dta file and converted to an spmat object with row-standardization — meaning that each row sums to one, so the spatial lag of a variable equals the weighted average among a state’s neighbors.

* Load binary contiguity W matrix and convert to row-standardized spmat object
use "https://github.com/quarcs-lab/data-open/raw/master/cigar/Wct_bin.dta", replace
spmat dta Wst m1-m46, norm(row) replace

The spmat dta command reads columns m1 through m46 from the loaded dataset and stores them as a spatial weight matrix object named Wst. The norm(row) option applies row-standardization, and replace overwrites any existing matrix with the same name.

3.2 Panel data setup

The Baltagi cigarette demand dataset contains three variables measured across 46 US states and 30 years (1963–1992): log per-capita cigarette consumption (logc), log real cigarette price (logp), and log real per-capita disposable income (logy).

* Load panel data
use "https://github.com/quarcs-lab/data-open/raw/master/cigar/baltagi_cigar.dta", clear
sort year state
xtset state year

Panel variable: state (strongly balanced)
Time variable: year, 1963 to 1992
Delta: 1 unit

The panel is strongly balanced — all 46 states are observed in all 30 years, yielding 1,380 total observations. This balanced structure simplifies estimation and avoids the complications of missing data.

3.3 Panel summary statistics

The xtsum command decomposes each variable’s variation into between-state and within-state components — a key diagnostic for understanding what panel models can and cannot identify.

xtsum

Variable | Mean Std. dev. Min Max | Observations
-----------------+--------------------------------------------+----------------
logc overall | 4.625563 .2538233 3.736352 5.399758 | N = 1380
between | .225498 4.057739 5.19628 | n = 46
within | .1254968 4.110718 5.070093 | T = 30
| |
logp overall | 3.648067 .3364439 2.579455 4.588055 | N = 1380
between | .1927783 3.22723 4.021831 | n = 46
within | .2798008 2.780289 4.372397 | T = 30
| |
logy overall | 1.615786 .248717 .8676362 2.253795 | N = 1380
between | .1363281 1.294913 2.063736 | n = 46
within | .2098697 1.035539 2.106283 | T = 30

Variables

Variable	Description	Mean	Std. Dev.
`logc`	Log per-capita cigarette consumption (packs)	4.626	0.254
`logp`	Log real price per pack (cents)	3.648	0.336
`logy`	Log real per-capita disposable income	1.616	0.249

Mean log consumption is 4.63, corresponding to roughly 102 packs per capita per year. The between-state standard deviation of logc (0.225) is larger than the within-state standard deviation (0.125), indicating that cross-state differences in consumption levels are more pronounced than changes within a single state over time. For logp, the pattern reverses — within-state variation (0.280) exceeds between-state variation (0.193), reflecting the fact that real prices changed substantially over this 30-year period due to tax policy changes and inflation. This decomposition foreshadows why fixed effects models, which exploit within-state variation, may produce different elasticity estimates than pooled models.

4. Non-spatial panel models

Before introducing spatial dependence, we estimate four standard panel specifications to establish baseline price and income elasticities. Each model relaxes a different assumption about unobserved heterogeneity, and comparing their estimates reveals how sensitive the results are to the treatment of state-level and time-level confounders.

4.1 Pooled OLS

Pooled OLS treats all 1,380 observations as independent, ignoring the panel structure entirely. It provides a naive benchmark.

reg logc logp logy
estimates store pool

 Source | SS df MS Number of obs = 1,380
-------------+---------------------------------- F(2, 1377) = 199.28
Model | 21.564818 2 10.7824090 Prob > F = 0.0000
Residual | 74.518523 1,377 .054116576 R-squared = 0.2244
-------------+---------------------------------- Adj R-squared = 0.2233
Total | 96.083341 1,379 .069676098 Root MSE = .23284
------------------------------------------------------------------------------
logc | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
logp | -.3857227 .0309752 -12.45 0.000 -.4464987 -.3249467
logy | .3724439 .0264568 14.08 0.000 .3205328 .4243551
_cons | 4.396312 .0531992 82.64 0.000 4.291951 4.500674
------------------------------------------------------------------------------

Pooled OLS estimates a price elasticity of -0.386 and an income elasticity of 0.372, both statistically significant at the 1% level. However, the R-squared is only 0.224, and more importantly, this model assumes no systematic differences across states — an untenable assumption given the large between-state variation we observed in the summary statistics.

4.2 Region fixed effects

Region (state) fixed effects control for all time-invariant state characteristics — geographic location, cultural attitudes toward smoking, historical tobacco production, and any other state-specific factor that does not change over the sample period.

xtreg logc logp logy, fe
estimates store rfe

Fixed-effects (within) regression Number of obs = 1,380
Group variable: state Number of groups = 46
R-squared: Obs per group:
Within = 0.4059 min = 30
Between = 0.0681 avg = 30.0
Overall = 0.1050 max = 30
F(2,1332) = 455.52
corr(u_i, Xb) = -0.8072 Prob > F = 0.0000
------------------------------------------------------------------------------
logc | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
logp | -.2307217 .0276419 -8.35 0.000 -.2849426 -.1765008
logy | -.0145419 .0389849 -0.37 0.709 -.0910300 .0619462
_cons | 4.619736 .0542965 85.09 0.000 4.513180 4.726293
------------------------------------------------------------------------------
sigma_u | .21834832
sigma_e | .09498463
rho | .84090063 (fraction of variance due to u_i)
------------------------------------------------------------------------------
F test that all u_i=0: F(45, 1332) = 85.78 Prob > F = 0.0000

After controlling for state fixed effects, the price elasticity drops to -0.231 — substantially smaller in magnitude than the pooled OLS estimate of -0.386. This difference reveals that much of the apparent price sensitivity in pooled OLS was driven by cross-state composition effects: low-price states tend to have higher consumption for reasons unrelated to price (e.g., tobacco-producing states have both lower prices and stronger smoking cultures). The income elasticity becomes statistically insignificant at -0.015 (p = 0.709), suggesting that within-state income changes over time do not strongly predict consumption changes once state-level heterogeneity is absorbed. The F-test for joint significance of state fixed effects is overwhelming (F = 85.78, p < 0.001), confirming that state heterogeneity is substantial.

4.3 Time fixed effects

Time fixed effects control for shocks common to all states in a given year — federal anti-smoking campaigns, national health reports (such as the 1964 Surgeon General’s report), and macroeconomic fluctuations.

reg logc logp logy i.year
estimates store tfe

 Source | SS df MS Number of obs = 1,380
-------------+---------------------------------- F(31, 1348) = 41.04
Model | 48.7107267 31 1.57131054 Prob > F = 0.0000
Residual | 47.3726143 1,348 .03514290 R-squared = 0.5070
-------------+---------------------------------- Adj R-squared = 0.4957
Total | 96.083341 1,379 .069676098 Root MSE = .18747
------------------------------------------------------------------------------
logc | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
logp | -.8612867 .0389729 -22.10 0.000 -.9377676 -.7848058
logy | .8045032 .0466019 17.26 0.000 .7130647 .8959417
_cons | 3.958816 .0638297 62.02 0.000 3.833551 4.084081
------------------------------------------------------------------------------

With time fixed effects, the price elasticity jumps to -0.861 and the income elasticity to 0.805 — both much larger in magnitude than the pooled OLS estimates. By removing common year-level trends (such as the secular decline in smoking rates after the Surgeon General’s report), the model isolates cross-state differences in a given year. The R-squared increases to 0.507, a substantial improvement over pooled OLS.

4.4 Two-way fixed effects

Two-way fixed effects combine state and time dummies, controlling simultaneously for state-specific time-invariant factors and year-specific common shocks. This is the most thorough non-spatial specification and serves as our benchmark.

xtreg logc logp logy i.year, fe
estimates store rtfe

Fixed-effects (within) regression Number of obs = 1,380
Group variable: state Number of groups = 46
R-squared: Obs per group:
Within = 0.7891 min = 30
Between = 0.0121 avg = 30.0
Overall = 0.0456 max = 30
F(31,1303) = 157.60
corr(u_i, Xb) = -0.5688 Prob > F = 0.0000
------------------------------------------------------------------------------
logc | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
logp | -.4020279 .0272553 -14.75 0.000 -.4555018 -.3485541
logy | .1193476 .0478095 2.50 0.013 .0255202 .2131749
_cons | 4.515994 .0533810 84.59 0.000 4.411254 4.620733
------------------------------------------------------------------------------
sigma_u | .21428785
sigma_e | .05601281
rho | .93607854 (fraction of variance due to u_i)
------------------------------------------------------------------------------

The two-way FE model yields a price elasticity of -0.402 and an income elasticity of 0.119. The within R-squared is 0.789, a dramatic improvement over the region-only FE model (0.406), indicating that year effects absorb a large share of temporal variation. The price elasticity is roughly intermediate between the region-FE (-0.231) and time-FE (-0.861) estimates, illustrating how the choice of fixed effects changes the identifying variation and the resulting elasticity.

4.5 Comparison of non-spatial models

estimates table pool rfe tfe rtfe, b(%7.2f) star(0.1 0.05 0.01) stf(%9.0f)

	Pooled OLS	Region FE	Time FE	Two-way FE
`logp`	-0.39***	-0.23***	-0.86***	-0.40***
`logy`	0.37***	-0.01	0.80***	0.12**
R-sq	0.224	0.406	0.507	0.789

The four specifications tell a coherent story: price has a consistently negative effect on cigarette consumption, but the magnitude varies from -0.23 (region FE) to -0.86 (time FE) depending on which sources of variation are exploited. The two-way FE estimate of -0.40 is the most credible non-spatial benchmark because it controls for both state heterogeneity and common time trends. However, all four models assume that each state’s consumption depends only on its own price and income — an assumption we will relax in the next section.

5. Why spatial models?

Even with two-way fixed effects, the models above ignore a potentially important channel: spatial spillovers. If Virginia raises its cigarette tax, smokers in bordering states might change their behavior too — either because they no longer cross into Virginia to buy cheaper cigarettes, or because Virginia’s policy signals a broader regional trend. Similarly, a rise in income in one state may increase consumption in neighboring states through commuting, trade, and social networks.

The Spatial Durbin Model (SDM) is a flexible framework that captures these spillovers through two channels:

$$y_{it} = \rho \sum_{j=1}^{N} w_{ij} y_{jt} + x_{it} \beta + \sum_{j=1}^{N} w_{ij} x_{jt} \theta + \mu_i + \lambda_t + \varepsilon_{it}$$

In words, this equation says that cigarette consumption in state $i$ at time $t$ depends on three spatial components: (1) the spatial lag of the dependent variable $\rho W y$ — how much a state’s consumption is influenced by its neighbors' consumption, (2) the own effects of price and income $X \beta$, and (3) the spatial lags of the explanatory variables $W X \theta$ — how neighbors' prices and incomes spill over. The parameters $\mu_i$ and $\lambda_t$ are state and year fixed effects, respectively.

Symbol	Meaning	Code variable
$y_{it}$	Log cigarette consumption in state $i$, year $t$	`logc`
$\rho$	Spatial autoregressive parameter (neighbor consumption effect)	`[Spatial]rho`
$w_{ij}$	Element of the row-standardized weight matrix	`Wst`
$x_{it}$	Own price and income	`logp`, `logy`
$\beta$	Own-variable coefficients	`[Main]logp`, `[Main]logy`
$\theta$	Spatial lag coefficients (neighbor effects of X)	`[Wx]logp`, `[Wx]logy`

A key advantage of the SDM is that it nests three simpler spatial models as special cases. This means we can start with the general SDM and then test whether the data supports reducing it to a simpler specification.

graph TD
SDM["<b>Spatial Durbin Model (SDM)</b><br/>y = ρWy + Xβ + WXθ + ε<br/><i>Most general</i>"]
SAR["<b>SAR</b><br/>y = ρWy + Xβ + ε<br/><i>θ = 0</i>"]
SLX["<b>SLX</b><br/>y = Xβ + WXθ + ε<br/><i>ρ = 0</i>"]
SEM["<b>SEM</b><br/>y = Xβ + u, u = λWu + ε<br/><i>θ + ρβ = 0</i>"]
SDM -->|"θ = 0?"| SAR
SDM -->|"ρ = 0?"| SLX
SDM -->|"θ + ρβ = 0?"| SEM
style SDM fill:#00d4c8,stroke:#141413,color:#141413
style SAR fill:#6a9bcc,stroke:#141413,color:#fff
style SLX fill:#d97757,stroke:#141413,color:#fff
style SEM fill:#141413,stroke:#d97757,color:#fff

The SAR (Spatial Autoregressive) model restricts $\theta = 0$, assuming that only neighbors' consumption (not their prices or incomes) matters. The SLX (Spatial Lag of X) model restricts $\rho = 0$, assuming that neighbors' characteristics affect local consumption but there is no autoregressive feedback. The SEM (Spatial Error Model) imposes the common factor restriction $\theta + \rho \beta = 0$, implying that spatial dependence operates entirely through correlated errors rather than substantive spillovers. In Section 7, we will use Wald tests to determine which, if any, of these restrictions the data supports.

6. Spatial Durbin Model (SDM)

6.1 SDM with two-way fixed effects

We now estimate the full Spatial Durbin Model with both state and year fixed effects. The xsmle command performs maximum likelihood estimation for spatial panel models. The option type(both) specifies two-way fixed effects, mod(sdm) selects the Spatial Durbin specification, and effects nsim(999) computes direct and indirect effects using 999 Monte Carlo simulations.

xsmle logc logp logy, fe type(both) wmat(Wst) mod(sdm) effects nsim(999) nolog
estimates store sdm1

Spatial Durbin model with fixed-effects Number of obs = 1,380
Group variable: state Number of groups = 46
Time variable: year
Obs per group:
min = 30
avg = 30.0
max = 30
Wald chi2(4) = 379.19
Log-likelihood = 1971.5204 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
logc | Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
Main |
logp | -.3068973 .0282114 -10.88 0.000 -.3621907 -.2516039
logy | .0781427 .0481269 1.62 0.104 -.0161843 .1724697
-------------+----------------------------------------------------------------
Wx |
logp | -.2060671 .0649703 -3.17 0.002 -.3334065 -.0787277
logy | .1803542 .0885162 2.04 0.042 .0068656 .3538428
-------------+----------------------------------------------------------------
Spatial |
rho | .2649571 .0327948 8.08 0.000 .2006804 .3292339
-------------+----------------------------------------------------------------
sigma2_e| .0027866
------------------------------------------------------------------------------
Direct | -.3131508 .0285649 -10.96 0.000 -.3691370 -.2571645
Indirect | -.3138174 .0812337 -3.86 0.000 -.4730325 -.1546023
Total | -.6269682 .0866710 -7.23 0.000 -.7968403 -.4570961
|
Direct | .0941302 .0488720 1.93 0.054 -.0016572 .1899176
Indirect | .2683417 .1099814 2.44 0.015 .0527821 .4839013
Total | .3624719 .1216523 2.98 0.003 .1240378 .6009060

The spatial autoregressive parameter $\rho$ is 0.265 (z = 8.08, p < 0.001), indicating substantial positive spatial dependence — states with higher-consuming neighbors tend to consume more themselves, even after controlling for own prices and income. The own price coefficient ([Main]logp) is -0.307, while the spatial lag of neighbors' prices ([Wx]logp) is -0.206, meaning that higher prices in neighboring states also reduce local consumption. This is consistent with the cross-border shopping hypothesis: when neighbors' prices rise, there are fewer opportunities for local consumers to shop across borders, reinforcing the local price effect.

The direct effect of price is -0.313, meaning that a 1% increase in a state’s own price reduces its consumption by 0.31%. The indirect (spillover) effect of price is -0.314, nearly as large as the direct effect. This means that when all neighboring states raise prices by 1%, the resulting reduction in consumption in the focal state is comparable to the state raising its own price. The total effect of price is -0.627 — much larger than the two-way FE estimate of -0.402, revealing that non-spatial models substantially underestimate the true price sensitivity of cigarette demand.

6.2 Lee and Yu bias correction

In spatial panels with fixed effects, the maximum likelihood estimator suffers from the incidental parameters problem — the number of fixed effect parameters grows with the number of states, which introduces a bias term of order $1/T$. With $T = 30$ years, this bias may be non-negligible. Lee and Yu (2010) proposed a bias correction procedure that adjusts the ML estimates to eliminate the leading bias term.

xsmle logc logp logy, fe type(both) leeyu wmat(Wst) mod(sdm) effects nsim(999) nolog
estimates store sdm2

Spatial Durbin model with fixed-effects (Lee-Yu) Number of obs = 1,334
Group variable: state Number of groups = 46
Time variable: year
Obs per group:
min = 29
avg = 29.0
max = 29
Wald chi2(4) = 392.50
Log-likelihood = 1932.4681 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
logc | Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
Main |
logp | -.3044782 .0283901 -10.72 0.000 -.3601218 -.2488346
logy | .0770150 .0486311 1.58 0.113 -.0183001 .1723301
-------------+----------------------------------------------------------------
Wx |
logp | -.2083124 .0654876 -3.18 0.001 -.3366657 -.0799591
logy | .1869831 .0894718 2.09 0.037 .0116216 .3623446
-------------+----------------------------------------------------------------
Spatial |
rho | .2596348 .0332441 7.81 0.000 .1944776 .3247920
-------------+----------------------------------------------------------------
sigma2_e| .0027512
------------------------------------------------------------------------------
Direct | -.3104271 .0287814 -10.79 0.000 -.3668377 -.2540166
Indirect | -.3122946 .0825781 -3.78 0.000 -.4741447 -.1504446
Total | -.6227218 .0878439 -7.09 0.000 -.7948927 -.4505509
|
Direct | .0935487 .0494610 1.89 0.059 -.0033931 .1904905
Indirect | .2739264 .1115282 2.46 0.014 .0553351 .4925177
Total | .3674751 .1235608 2.97 0.003 .1253004 .6096498

The Lee-Yu correction uses $N \times (T-1) = 46 \times 29 = 1{,}334$ observations (one time period is lost in the transformation). The corrected estimates are very close to the uncorrected ones: $\rho$ changes from 0.265 to 0.260, the own price coefficient from -0.307 to -0.304, and the total price effect from -0.627 to -0.623. This stability is reassuring — with $T = 30$, the bias is already small. The closeness of the two sets of estimates provides confidence that the standard ML estimates are reliable for this dataset.

6.3 Comparison

	SDM (standard)	SDM (Lee-Yu)
$\rho$	0.265***	0.260***
`logp` (own)	-0.307***	-0.304***
`logy` (own)	0.078	0.077
`W*logp` (neighbors)	-0.206***	-0.208***
`W*logy` (neighbors)	0.180**	0.187**
Direct price effect	-0.313***	-0.310***
Indirect price effect	-0.314***	-0.312***
Total price effect	-0.627***	-0.623***

The two sets of estimates are nearly identical, confirming that the incidental parameters bias is negligible with 30 time periods. For the remainder of this tutorial, we use the Lee-Yu corrected estimates as our preferred specification.

7. Wald specification tests

The SDM is the most general model in the spatial panel family, nesting SAR, SLX, and SEM as special cases. Before accepting the full SDM, we should test whether the data supports a simpler specification. We do this by testing the parameter restrictions that define each nested model. If the restrictions are rejected, the simpler model is inadequate and we should retain the SDM.

We first re-estimate the SDM with the Lee-Yu correction (the quietly prefix suppresses output since we already displayed these results).

quietly xsmle logc logp logy, fe type(both) leeyu wmat(Wst) mod(sdm) effects nsim(999) nolog

7.1 Can the SDM reduce to SAR?

The SAR model restricts $\theta = 0$ — that is, the spatial lags of the explanatory variables are zero. Under SAR, only neighbors' consumption matters, not their prices or incomes directly. We test this with a joint Wald test on the [Wx] coefficients.

* Wald test: Reduce to SAR? (NO if p < 0.05)
test ([Wx]logp = 0) ([Wx]logy = 0)

 ( 1) [Wx]logp = 0
( 2) [Wx]logy = 0
chi2( 2) = 12.87
Prob > chi2 = 0.0016

The Wald test rejects the SAR restriction (chi2 = 12.87, p = 0.002). This means that neighbors' prices and incomes have direct effects on local consumption beyond their influence through the spatial lag of consumption. Dropping the $WX$ terms from the model would misspecify the spatial dependence structure.

7.2 Can the SDM reduce to SLX?

The SLX model restricts $\rho = 0$ — there is no spatial autoregressive feedback through the dependent variable. Under SLX, neighbors' characteristics affect local consumption directly, but the spatial multiplier effect (where shocks propagate through the network) is absent.

* Wald test: Reduce to SLX? (NO if p < 0.05)
test ([Spatial]rho = 0)

 ( 1) [Spatial]rho = 0
chi2( 1) = 61.04
Prob > chi2 = 0.0000

The Wald test overwhelmingly rejects the SLX restriction (chi2 = 61.04, p < 0.001). The spatial autoregressive parameter $\rho$ is far from zero, confirming that there is a genuine feedback mechanism: a shock to consumption in one state propagates to its neighbors, which in turn affects their neighbors, creating a spatial multiplier.

7.3 Can the SDM reduce to SEM?

The SEM (Spatial Error Model) imposes the common factor restriction $\theta + \rho \beta = 0$. Under this restriction, the spatial dependence is purely a nuisance — it enters through correlated error terms rather than through substantive economic spillovers. If SEM is adequate, the apparent spillover effects are an artifact of omitted spatially correlated variables, not genuine cross-border interactions.

* Wald test: Reduce to SEM? (NO if p < 0.05)
testnl ([Wx]logp = -[Spatial]rho*[Main]logp) ([Wx]logy = -[Spatial]rho*[Main]logy)

 (1) [Wx]logp = -[Spatial]rho*[Main]logp
(2) [Wx]logy = -[Spatial]rho*[Main]logy
chi2( 2) = 8.49
Prob > chi2 = 0.0143

The Wald test rejects the SEM common factor restriction (chi2 = 8.49, p = 0.014). The spatial dependence in cigarette demand is not merely a nuisance in the error term — it reflects substantive economic spillovers across state borders. This is exactly what economic theory predicts: cross-border shopping creates genuine causal links between neighboring states' prices and local consumption.

7.4 Summary of specification tests

graph TD
SDM["<b>Spatial Durbin Model (SDM)</b><br/>RETAINED"]
SAR["<b>SAR</b><br/>θ = 0<br/>Rejected<br/>p = 0.002"]
SLX["<b>SLX</b><br/>ρ = 0<br/>Rejected<br/>p < 0.001"]
SEM["<b>SEM</b><br/>θ + ρβ = 0<br/>Rejected<br/>p = 0.014"]
SDM -->|"chi2 = 12.87"| SAR
SDM -->|"chi2 = 61.04"| SLX
SDM -->|"chi2 = 8.49"| SEM
style SDM fill:#00d4c8,stroke:#141413,color:#141413
style SAR fill:#d97757,stroke:#141413,color:#fff
style SLX fill:#d97757,stroke:#141413,color:#fff
style SEM fill:#d97757,stroke:#141413,color:#fff

All three Wald tests reject the restricted models. The SDM cannot be simplified to SAR (neighbors' X variables matter), SLX (the autoregressive feedback matters), or SEM (the spatial dependence is substantive, not a nuisance). The full SDM is the appropriate specification for modeling cigarette demand across US states. This result confirms that spatial spillovers in cigarette consumption operate through multiple channels simultaneously: direct cross-border effects of neighbors' prices and incomes, and feedback effects through the spatial lag of consumption itself.

8. Dynamic spatial panel models

Cigarette consumption is well known to be habit-forming — past consumption is a strong predictor of current consumption because of nicotine addiction. Standard (static) spatial models ignore this temporal persistence, which may bias the spatial parameter estimates. Dynamic spatial panel models extend the SDM by including lagged values of consumption, allowing us to separate habit persistence from spatial spillovers.

The xsmle package supports three dynamic specifications through the dlag() option:

`dlag()`	Dynamic term added	Interpretation
1	$\tau \cdot y_{i,t-1}$	Temporal lag: own past consumption
2	$\psi \cdot \sum_j w_{ij} y_{j,t-1}$	Spatiotemporal lag: neighbors' past consumption
3	Both $\tau \cdot y_{i,t-1}$ and $\psi \cdot \sum_j w_{ij} y_{j,t-1}$	Full dynamic: own + neighbors' past consumption

The most general dynamic SDM (with dlag(3)) extends the static equation from Section 5 by adding two lagged terms:

$$y_{it} = \tau \, y_{i,t-1} + \psi \sum_{j=1}^{N} w_{ij} \, y_{j,t-1} + \rho \sum_{j=1}^{N} w_{ij} \, y_{jt} + x_{it} \beta + \sum_{j=1}^{N} w_{ij} \, x_{jt} \theta + \mu_i + \lambda_t + \varepsilon_{it}$$

In words, this equation says that a state’s cigarette consumption depends on its own past consumption ($\tau y_{i,t-1}$, capturing habit persistence), the average past consumption of its neighbors ($\psi W y_{t-1}$, capturing spatiotemporal diffusion), and all the contemporaneous spatial terms from the static SDM. The parameter $\tau$ measures how strongly last year’s smoking predicts this year’s — think of it as the “addiction coefficient.” The parameter $\psi$ captures whether neighbors' past behavior diffuses across borders over time.

Symbol	Meaning	Code variable
$\tau$	Temporal lag (habit persistence)	`[Temporal]tau`
$\psi$	Spatiotemporal lag (neighbors' past consumption)	`[Temporal]psi`
$y_{i,t-1}$	Own consumption last year	`dlag(1)`
$W y_{t-1}$	Average neighbors' consumption last year	`dlag(2)`

8.1 Non-dynamic SDM (baseline)

We re-estimate the static SDM as a baseline for comparison with the dynamic specifications.

xsmle logc logp logy, fe type(both) wmat(Wst) mod(sdm) effects nsim(999) nolog
eststo SDM0

Spatial Durbin model with fixed-effects Number of obs = 1,380
------------------------------------------------------------------------------
logc | Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
Main |
logp | -.3068973 .0282114 -10.88 0.000 -.3621907 -.2516039
logy | .0781427 .0481269 1.62 0.104 -.0161843 .1724697
Wx |
logp | -.2060671 .0649703 -3.17 0.002 -.3334065 -.0787277
logy | .1803542 .0885162 2.04 0.042 .0068656 .3538428
Spatial |
rho | .2649571 .0327948 8.08 0.000 .2006804 .3292339
------------------------------------------------------------------------------

8.2 Dynamic SDM with temporal lag ($\tau \cdot y_{i,t-1}$)

Adding the temporal lag of own consumption captures habit persistence — the tendency for this year’s smoking to depend on last year’s smoking, holding prices and income constant.

xsmle logc logp logy, dlag(1) fe type(both) wmat(Wst) mod(sdm) effects nsim(999) nolog
eststo dySDM1

Dynamic Spatial Durbin model with fixed-effects Number of obs = 1,334
------------------------------------------------------------------------------
logc | Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
Main |
logp | -.1516305 .0226714 -6.69 0.000 -.1960657 -.1071954
logy | .0285493 .0376124 0.76 0.448 -.0451697 .1022683
Wx |
logp | -.0714289 .0521683 -1.37 0.171 -.1736769 .0308190
logy | .0592735 .0706984 0.84 0.402 -.0792929 .1978399
Spatial |
rho | .1021753 .0307624 3.32 0.001 .0418821 .1624685
Temporal |
tau | .6543218 .0196285 33.33 0.000 .6158507 .6927928
------------------------------------------------------------------------------

The temporal lag coefficient $\tau$ is 0.654 (z = 33.33, p < 0.001) — a very strong habit persistence effect. Controlling for last year’s consumption dramatically reduces the other coefficients: the own price effect drops from -0.307 to -0.152, and the spatial autoregressive parameter $\rho$ falls from 0.265 to 0.102. This means that much of the apparent spatial dependence in the static SDM was actually capturing temporal autocorrelation that manifests spatially. The spatial lag of neighbors' prices ([Wx]logp) becomes insignificant (p = 0.171), suggesting that once habit persistence is controlled for, the direct cross-border price spillover weakens considerably.

8.3 Dynamic SDM with spatiotemporal lag ($\psi \cdot W \cdot y_{i,t-1}$)

Instead of own past consumption, this specification includes the spatial lag of past consumption — how much neighbors smoked last year.

xsmle logc logp logy, dlag(2) fe type(both) wmat(Wst) mod(sdm) effects nsim(999) nolog
eststo dySDM2

Dynamic Spatial Durbin model with fixed-effects Number of obs = 1,334
------------------------------------------------------------------------------
logc | Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
Main |
logp | -.2981475 .0280193 -10.64 0.000 -.3530643 -.2432307
logy | .0637218 .0478561 1.33 0.183 -.0300745 .1575181
Wx |
logp | -.1425379 .0647518 -2.20 0.028 -.2694490 -.0156268
logy | .1320869 .0888243 1.49 0.137 -.0420055 .3061793
Spatial |
rho | .1523264 .0369871 4.12 0.000 .0798330 .2248199
Temporal |
psi | .2712508 .0339714 7.98 0.000 .2046680 .3378335
------------------------------------------------------------------------------

The spatiotemporal lag coefficient $\psi$ is 0.271 (z = 7.98, p < 0.001), indicating that neighbors' past consumption does have a positive effect on current consumption. However, this effect is weaker than the own temporal lag ($\tau = 0.654$ in the previous specification). The spatial autoregressive parameter drops to $\rho = 0.152$, and the own price coefficient stays close to the static SDM value at -0.298.

8.4 Full dynamic SDM ($\tau \cdot y_{i,t-1} + \psi \cdot W \cdot y_{i,t-1}$)

The most general dynamic specification includes both the temporal lag and the spatiotemporal lag.

xsmle logc logp logy, dlag(3) fe type(both) wmat(Wst) mod(sdm) effects nsim(999) nolog
eststo dySDM3

Dynamic Spatial Durbin model with fixed-effects Number of obs = 1,334
------------------------------------------------------------------------------
logc | Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
Main |
logp | -.1498627 .0226523 -6.62 0.000 -.1942603 -.1054651
logy | .0271398 .0376004 0.72 0.470 -.0465556 .1008351
Wx |
logp | -.0636842 .0524156 -1.21 0.224 -.1664169 .0390485
logy | .0471982 .0712803 0.66 0.508 -.0925087 .1869052
Spatial |
rho | .0803516 .0322458 2.49 0.013 .0171509 .1435524
Temporal |
tau | .6389621 .0208541 30.64 0.000 .5980889 .6798353
psi | .0494172 .0325896 1.52 0.130 -.0144571 .1132915
------------------------------------------------------------------------------

In the full dynamic model, the temporal lag dominates: $\tau = 0.639$ (z = 30.64, p < 0.001), while the spatiotemporal lag $\psi = 0.049$ is not statistically significant (p = 0.130). This indicates that a state’s own past consumption is the primary driver of temporal persistence, and neighbors' past consumption does not add meaningful additional information once own habit persistence is controlled for. The spatial autoregressive parameter further drops to $\rho = 0.080$, and the spatial lags of price and income become insignificant.

8.5 Comparison of dynamic models

esttab SDM0 dySDM1 dySDM2 dySDM3, mtitle("SDM" "dySDM1" "dySDM2" "dySDM3")

	SDM (static)	dySDM1 ($\tau$)	dySDM2 ($\psi$)	dySDM3 ($\tau + \psi$)
`logp` (own)	-0.307***	-0.152***	-0.298***	-0.150***
`logy` (own)	0.078	0.029	0.064	0.027
`W*logp`	-0.206***	-0.071	-0.143**	-0.064
`W*logy`	0.180**	0.059	0.132	0.047
$\rho$	0.265***	0.102***	0.152***	0.080**
$\tau$ (own lag)	—	0.654***	—	0.639***
$\psi$ (spatial lag)	—	—	0.271***	0.049

The comparison reveals a clear pattern. First, habit persistence is the dominant dynamic force: $\tau$ is large and highly significant whether estimated alone (0.654) or jointly with $\psi$ (0.639), while $\psi$ loses significance once $\tau$ is included. Second, controlling for habit persistence substantially attenuates spatial spillover estimates: the spatial autoregressive parameter $\rho$ falls from 0.265 (static) to 0.080 (full dynamic), and the spatial lags of price and income become insignificant. This suggests that the static SDM’s spillover estimates partly capture omitted temporal dynamics. Third, the short-run price elasticity in the dynamic model (-0.150) is about half the static estimate (-0.307), but the long-run price elasticity — computed as $\beta / (1 - \tau)$ — is approximately $-0.150 / (1 - 0.639) = -0.416$, close to the static estimate. The static SDM conflates short-run and long-run responses into a single coefficient.

9. Discussion

This tutorial demonstrates that spatial dependence matters for modeling cigarette demand across US states. The Wald tests in Section 7 conclusively reject all three restricted spatial models (SAR, SLX, SEM), confirming that the Spatial Durbin Model is the appropriate specification. The total price effect in the static SDM (-0.627) is more than 50% larger than the two-way FE estimate (-0.402), revealing that non-spatial models systematically understate the true price sensitivity of cigarette demand by ignoring cross-border spillovers.

The dynamic extensions in Section 8 provide important nuance. Once habit persistence is controlled for ($\tau \approx 0.65$), the spatial autoregressive parameter drops by two-thirds (from 0.265 to 0.080), and many spatial lag coefficients lose statistical significance. This does not mean spatial dependence is unimportant — rather, it means that the static SDM conflates temporal and spatial dynamics. In the dynamic model, the short-run own price elasticity is -0.15 and the long-run elasticity is approximately -0.42, offering policymakers a clearer picture of how quickly cigarette taxation takes effect.

From a policy perspective, these results carry a direct implication: state-level tobacco taxation has cross-border spillover effects that policymakers must consider. When a single state raises its cigarette tax, the demand reduction is partially offset by cross-border shopping. However, when neighboring states raise taxes simultaneously, the total demand reduction is amplified. This supports the case for coordinated regional or federal tobacco taxation rather than isolated state-level policies. The finding that habit persistence is the dominant dynamic force ($\tau \approx 0.65$) also suggests that the full impact of a tax increase takes several years to materialize, as consumers slowly adjust their consumption habits.

10. Summary and next steps

This tutorial covered the complete workflow for spatial panel regression in Stata — from loading a spatial weight matrix and estimating non-spatial benchmarks, through the full Spatial Durbin Model with Wald specification tests, to dynamic spatial extensions. The key takeaways are:

Non-spatial models understate price sensitivity. The two-way FE price elasticity is -0.40, but the SDM total effect is -0.63 — a 57% increase that reflects cross-border spillovers ignored by standard panel models.
The SDM cannot be simplified. All three Wald tests reject the SAR, SLX, and SEM restrictions, meaning that spatial dependence operates through multiple channels simultaneously: neighbors' consumption ($\rho$), neighbors' prices ($\theta_{logp}$), and neighbors' income ($\theta_{logy}$).
Habit persistence dominates temporal dynamics. The temporal lag coefficient $\tau \approx 0.65$ is large and robust, while the spatiotemporal lag $\psi$ loses significance once $\tau$ is included. Static spatial models overstate contemporaneous spillovers by absorbing temporal autocorrelation.
Short-run vs. long-run elasticities differ substantially. The dynamic SDM’s short-run price elasticity (-0.15) is less than half its long-run counterpart (-0.42), information that is lost in static specifications.

For further study, consider applying these methods to other spatial datasets or exploring alternative spatial specifications. The companion tutorial on cross-sectional spatial regression covers the spatial models available for single-period data, including the full taxonomy of SAR, SEM, SLX, SDM, SDEM, and SAC models. For datasets where unobserved common factors (macroeconomic shocks, regulatory changes) may drive cross-sectional dependence beyond what the spatial weight matrix captures, see the spatial dynamic panels with common factors tutorial, which uses the spxtivdfreg package to combine spatial lags with defactored IV estimation. For Python implementations of spatial econometrics, see the PySAL ecosystem and the spreg package.

11. Exercises

Alternative weight matrix. Replace the binary contiguity matrix with an inverse-distance weight matrix. Re-estimate the SDM and compare the spatial autoregressive parameter $\rho$ and the indirect effects. Does the choice of weight matrix change the substantive conclusions about cross-border spillovers?
SAR vs. SDM direct comparison. Estimate a SAR model (mod(sar) in xsmle) with two-way fixed effects and the Lee-Yu correction. Compare its price elasticity to the SDM. Given that the Wald test rejected the SAR restriction, how different are the elasticity estimates in practice?
Subsample analysis. Split the sample into two periods (1963–1977 and 1978–1992) and estimate the SDM separately for each. Did the spatial dependence structure of cigarette demand change over time? What historical events (e.g., the Surgeon General’s reports, the rise of anti-smoking legislation) might explain differences between the two periods?

References

Staggered DiD (Ex1)

Sun, 03 Sep 2023 00:00:00 +0000

An introduction to difference in differences with multiple time periods and staggered treatment adoption. This tutorial is based on Exercise 1 of the Advanced DiD mixed tape session of Jonathan Roth. You can run and extend the analysis of this case study using Google Colab.

Staggered DiD

Sat, 02 Sep 2023 00:00:00 +0000

An introduction to difference in differences with multiple time periods and staggered treatment adoption. You can run and extend the analysis of this case study using Google Colab.