Fixed Effects and TWFE | Carlos Mendez

What Does TWFE Actually Do? Manual Demeaning and the FWL Theorem

Thu, 02 Apr 2026 00:00:00 +0000

1. Overview

Two-way fixed effects (TWFE) is one of the most widely used estimators in applied economics. Packages like fixest make it easy to estimate TWFE models with a single line of code. But what does the estimator actually do to the data? Why do time-invariant regressors like geography or colonial origin get dropped? And if you run lm() on manually demeaned data, should you get the same answer?

This tutorial answers these questions by taking TWFE apart. We estimate a standard growth regression with country and time fixed effects, then replicate the exact same coefficients by hand — subtracting country means, time means, and adding back the grand mean before running ordinary least squares. The result is not an approximation: the coefficients match to 12 significant digits. The theoretical foundation for this equivalence is the Frisch-Waugh-Lovell (FWL) theorem, a fundamental result in econometrics that connects controlling for variables in a regression to projecting them out by residualization.

We use a balanced panel of 150 countries observed over 8 time periods from the Barro convergence dataset. Along the way, we also discover why standard errors from naive lm() on demeaned data are wrong — and why you should always use a dedicated panel estimator for inference.

Learning objectives:

Understand what two-way fixed effects does mechanically to the data and why time-invariant regressors are dropped
Implement the two-way demeaning formula step by step: subtract country means, subtract time means, add back the grand mean
Verify the Frisch-Waugh-Lovell theorem empirically by comparing feols() and lm() coefficients
Interpret why naive standard errors from lm() on demeaned data are incorrect and how fixest corrects them
Visualize the demeaning transformation to build intuition about within-variation identification

2. The Frisch-Waugh-Lovell Theorem

Before diving into code, let us build the conceptual foundation. The FWL theorem answers a simple question: if you want to estimate the effect of $X$ on $Y$ while controlling for a set of variables $Z$, do you need to include everything in one big regression?

Think of it like noise-canceling headphones. Instead of listening to music with the engine noise mixed in, the headphones first subtract out the engine noise from what you hear. The result is the same music you would hear in a silent room. The FWL theorem says: instead of including all control variables in one regression, you can first “subtract them out” from both $Y$ and $X$, and then regress the residuals on each other. The coefficient on $X$ will be identical either way.

Applying FWL to two-way fixed effects

In a TWFE model, the “controls” $Z$ are the full set of country dummies and time dummies. Including all these dummies is equivalent to subtracting group means. For a variable $x_{it}$ observed for country $i$ in period $t$, the two-way demeaned version is:

$$\tilde{x}_{it} = x_{it} - \bar{x}_{i \cdot} - \bar{x}_{\cdot t} + \bar{x}_{\cdot \cdot}$$

In words, this formula says: take the observed value, subtract the country average (to remove persistent country differences), subtract the time-period average (to remove common shocks), and add back the overall average (to correct for double-subtracting the grand mean).

Here is what each symbol means:

$x_{it}$ is the observed value for country $i$ at time $t$ — in code, this is a single cell in the panel dataset
$\bar{x}_{i \cdot}$ is the country mean — the average of $x$ across all periods for country $i$
$\bar{x}_{\cdot t}$ is the time mean — the average of $x$ across all countries in period $t$
$\bar{x}_{\cdot \cdot}$ is the grand mean — the overall average of $x$ across all observations

Why add back the grand mean?

When we subtract both the country mean and the time mean, the grand mean gets subtracted twice — once as part of $\bar{x}_{i \cdot}$ and once as part of $\bar{x}_{\cdot t}$. Adding $\bar{x}_{\cdot \cdot}$ back corrects for this double subtraction. Think of it like a Venn diagram with two overlapping circles. If you subtract both circles entirely, the overlap region gets removed twice. Adding the overlap back once restores the correct amount. Without this correction, the demeaned variables would not be centered at zero, and the equivalence with TWFE would break.

The FWL theorem guarantees this equivalence formally:

$$\hat{\beta}_{\text{TWFE}} = \hat{\beta}_{\text{OLS on demeaned data}}$$

In words, the slope coefficients from a regression that includes a full set of entity and time dummies are exactly equal to the slopes from OLS applied to the two-way demeaned data. Not approximately — exactly. Let us verify this with real data.

3. Setup

We need fixest for TWFE estimation and tidyverse for data wrangling and visualization. The scales package provides axis formatting utilities.

library(fixest)
library(tidyverse)
library(scales)
set.seed(42)
# Site color palette
STEEL_BLUE <- "#6a9bcc"
WARM_ORANGE <- "#d97757"
NEAR_BLACK <- "#141413"
TEAL <- "#00d4c8"
# Variables to demean
VARS_TO_DEMEAN <- c("growth", "ln_y_initial", "log_s_k",
"log_n_gd", "log_hcap", "gov_cons")

We define the six variables that will be demeaned: the dependent variable (growth) and all five regressors. Keeping them in a vector allows us to apply the demeaning formula programmatically rather than copying and pasting for each variable.

4. Data Loading and Panel Structure

We load a balanced panel dataset with 150 countries observed over 8 time periods. The data comes from a Barro convergence exercise where the key question is whether poorer countries grow faster (conditional convergence). We convert id and time to factors so R treats them as categorical grouping variables.

panel_data <- read.csv("referenceMaterials/barro_convergence_panel.csv")
panel_data$id <- factor(panel_data$id)
panel_data$time <- factor(panel_data$time)
cat("Countries:", nlevels(panel_data$id), "\n")
cat("Time periods:", nlevels(panel_data$time), "\n")
cat("Total observations:", nrow(panel_data), "\n")
cat("Balanced panel:", all(table(panel_data$id) == nlevels(panel_data$time)), "\n")

Countries: 150
Time periods: 8
Total observations: 1200
Balanced panel: TRUE

The dataset is a perfectly balanced panel of 150 countries observed across 8 time periods, yielding 1,200 total observations. A balanced panel means every country appears in every period with no missing cells — the ideal setting for demonstrating the demeaning formula. The key variables are:

growth: annualized GDP per capita growth rate (dependent variable)
ln_y_initial: log of initial income (convergence term)
log_s_k: log of the investment share
log_n_gd: log of population growth plus depreciation
log_hcap: log of human capital
gov_cons: government consumption share

Panel structure heatmap showing all 150 countries observed across 8 time periods with no missing cells.

The heatmap confirms the balanced structure. Every one of the 150 countries is observed in all 8 time periods. This balance simplifies our demeaning procedure because we can use the closed-form formula directly, without the iterative projection that unbalanced panels would require.

5. TWFE Estimation with fixest

The fixest package makes TWFE estimation straightforward. The formula uses | to separate the regressors (left) from the fixed effects dimensions (right). Writing | id + time tells feols() to absorb both country and time fixed effects. Internally, fixest performs an efficient iterative demeaning algorithm to remove the fixed effects before estimating the slope coefficients.

twfe_model <- feols(
growth ~ ln_y_initial + log_s_k + log_n_gd + log_hcap + gov_cons | id + time,
data = panel_data
)
summary(twfe_model)

OLS estimation, Dep. Var.: growth
Observations: 1,200
Fixed-effects: id: 150, time: 8
Standard-errors: Clustered (id)
Estimate Std. Error t value Pr(>|t|)
ln_y_initial -0.055286 0.003744 -14.765156 < 2.2e-16 ***
log_s_k 0.019725 0.007583 2.601311 0.010223 *
log_n_gd -0.049614 0.022168 -2.238117 0.026696 *
log_hcap 0.009081 0.014564 0.623549 0.533877
gov_cons -0.102795 0.046398 -2.215501 0.028243 *
RMSE: 0.020517 Adj. R2: 0.755103
Within R2: 0.176777

The TWFE model reveals strong conditional beta-convergence — the hypothesis that poorer countries tend to grow faster, so income levels converge over time. The coefficient on log initial income is -0.055 (t = -14.77, p < 2.2e-16), meaning that a 1% higher initial income is associated with 0.055 percentage points slower subsequent growth, after controlling for the other covariates. Investment has the expected positive effect (0.020, p = 0.010), population growth has the expected negative effect (-0.050, p = 0.027), and government consumption is significantly negative (-0.103, p = 0.028). Human capital is positive but not statistically significant (0.009, p = 0.534). The model explains 75.5% of total variation (Adj. R-squared = 0.755), though only 17.7% of the within-variation (Within R-squared = 0.177) — typical for panel models where fixed effects absorb most cross-country heterogeneity.

Now let us replicate these coefficients by hand.

6. Manual Demeaning — Step by Step

We now walk through the demeaning procedure one step at a time. The goal is to transform every variable so that the country and time effects are removed. We will then run plain OLS on the result and verify that the coefficients match.

Step 1: Country means

For each country, we compute the average of each variable across all time periods. This gives us one mean per country per variable — capturing persistent country characteristics like geography, institutions, or long-run income level.

country_means <- panel_data |>
group_by(id) |>
summarise(across(all_of(VARS_TO_DEMEAN), mean), .groups = "drop")

Step 2: Time means

For each time period, we compute the average of each variable across all countries. These time means capture common shocks or trends that affect all countries in a given period — for instance, a global recession or a worldwide productivity boom.

time_means <- panel_data |>
group_by(time) |>
summarise(across(all_of(VARS_TO_DEMEAN), mean), .groups = "drop")

Step 3: Grand mean

The grand mean is simply the overall average of each variable across all countries and all time periods. It is a single number per variable, and we need it to correct for the double subtraction.

grand_means <- colMeans(panel_data[VARS_TO_DEMEAN])

 growth ln_y_initial log_s_k log_n_gd log_hcap gov_cons
-0.1243637 5.3643127 -1.5699117 -2.6569021 0.6645657 0.1461335

Step 4: Apply the demeaning formula

Now we bring everything together. We merge the country means and time means back into the main dataset, then apply the formula $\tilde{x}_{it} = x_{it} - \bar{x}_{i \cdot} - \bar{x}_{\cdot t} + \bar{x}_{\cdot \cdot}$ programmatically to each variable.

# Merge means
panel_dm <- panel_data |>
left_join(
country_means |> rename_with(~ paste0(.x, "_cmean"), all_of(VARS_TO_DEMEAN)),
by = "id"
) |>
left_join(
time_means |> rename_with(~ paste0(.x, "_tmean"), all_of(VARS_TO_DEMEAN)),
by = "time"
)
# Apply demeaning formula
for (v in VARS_TO_DEMEAN) {
panel_dm[[paste0(v, "_dm")]] <-
panel_dm[[v]] -
panel_dm[[paste0(v, "_cmean")]] -
panel_dm[[paste0(v, "_tmean")]] +
grand_means[v]
}

Let us verify that the demeaning worked correctly. If the formula is implemented right, the mean of each demeaned variable should be approximately zero.

Mean of demeaned variables (should be ~0):
growth_dm : -8.114169e-17
ln_y_initial_dm : 8.295170e-15
log_s_k_dm : -1.482923e-15
log_n_gd_dm : 1.599953e-15
log_hcap_dm : 5.384582e-17
gov_cons_dm : 1.832302e-16

All six demeaned variables have means on the order of $10^{-15}$ to $10^{-17}$ — effectively zero within floating-point precision. The demeaning formula is implemented correctly: the within-variation that remains is purely the deviation from both entity-specific and time-specific patterns.

7. OLS on the Demeaned Data

With the demeaning complete, we run a standard OLS regression on the demeaned variables using base R’s lm(). We deliberately use lm() rather than feols() to emphasize that this is plain ordinary least squares — no fixed effects machinery is involved.

manual_model <- lm(
growth_dm ~ ln_y_initial_dm + log_s_k_dm + log_n_gd_dm + log_hcap_dm + gov_cons_dm,
data = panel_dm
)
summary(manual_model)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.035e-16 5.938e-04 0.000 1.00000
ln_y_initial_dm -5.529e-02 3.618e-03 -15.282 < 2e-16 ***
log_s_k_dm 1.972e-02 6.846e-03 2.881 0.00403 **
log_n_gd_dm -4.961e-02 1.820e-02 -2.726 0.00651 **
log_hcap_dm 9.081e-03 1.370e-02 0.663 0.50751
gov_cons_dm -1.028e-01 4.411e-02 -2.331 0.01994 *
Residual standard error: 0.02057 on 1194 degrees of freedom
Multiple R-squared: 0.1768

Two things stand out. First, the intercept is 5.03 x 10^-16 — effectively zero. After proper two-way demeaning, the mean of all demeaned variables is near zero, so there is nothing left for the intercept to capture. This is a good sanity check: if the grand mean correction had been omitted, the intercept would be non-zero. Second, the slope coefficients look identical to those from feols(). But “look identical” is not the same as “are identical.” The next section proves they are.

8. Coefficient Comparison: The Proof

We now place the coefficients from both approaches side by side and compute their difference. If the FWL theorem holds, the slope coefficients must be identical up to floating-point precision.

twfe_coefs <- coef(twfe_model)
manual_coefs <- coef(manual_model)[-1] # drop intercept
names(manual_coefs) <- names(twfe_coefs)
comparison <- data.frame(
feols_TWFE = round(twfe_coefs, 12),
Manual_OLS = round(manual_coefs, 12),
Difference = twfe_coefs - manual_coefs
)
all.equal(unname(twfe_coefs), unname(manual_coefs))

Side-by-side coefficient comparison:
variable feols_TWFE manual_OLS difference
ln_y_initial -0.055286009819 -0.055286009819 -4.163336342e-17
log_s_k 0.019724899416 0.019724899416 3.469446952e-18
log_n_gd -0.049613972524 -0.049613972524 -2.775557562e-16
log_hcap 0.009081150621 0.009081150621 3.469446952e-17
gov_cons -0.102795317426 -0.102795317426 -3.053113318e-16
Maximum absolute difference: 3.053113e-16
all.equal() test: TRUE

This is the central result of the tutorial. All five slope coefficients are identical to at least 12 significant digits. The largest difference is 3.05 x 10^-16 — on the order of IEEE 754 double-precision machine epsilon (~2.2 x 10^-16). R’s all.equal() function confirms equality within its default tolerance. This is not an approximation: it is an exact algebraic identity guaranteed by the Frisch-Waugh-Lovell theorem.

Coefficient comparison: feols TWFE (blue circles) and manual demeaning OLS (orange triangles) occupy the exact same positions.

The dot plot makes the equivalence visually concrete. For each of the five covariates, the steel blue circle (feols TWFE) and warm orange triangle (manual demeaning OLS) occupy the exact same position. Government consumption has the largest coefficient in magnitude at -0.103, while the convergence parameter (log initial income) sits at -0.055. The dashed zero line helps distinguish positive from negative effects.

9. Visualizing What Demeaning Does

The coefficient equivalence is proven, but what does demeaning look like? How does it change the data? The following visualizations build intuition about the transformation.

Before vs after two-way demeaning: the wide cross-country spread (left) collapses to a narrow range around zero (right).

The faceted scatter plot tells the story. In the left panel (raw data), 10 countries are plotted with log initial income on the x-axis and growth on the y-axis. Each country’s observations form a distinct cluster at different income levels — the x-axis spans roughly 3 to 9. In the right panel (after demeaning), the same data is compressed to approximately -0.5 to 0.3 around zero. The between-country income differences and common time trends have been stripped away, leaving only the within-variation — the deviations from each country’s own average and each period’s common trend. This is the variation that identifies the TWFE coefficient.

Decomposing the formula for one country

To see exactly how the formula works, let us trace each component for Country 1’s growth rate across all 8 periods.

Demeaning decomposition for Country 1: observed growth (blue), country mean (orange dashed), time means (teal), grand mean (gray), and the demeaned residual (black).

The decomposition makes the formula concrete. The observed growth values (blue line) decline from about -0.18 to -0.07. The country mean (orange dashed line) is a flat horizontal at -0.127 — this is $\bar{x}_{i \cdot}$. The time means (teal dot-dash line) capture the common cross-country trend, declining from -0.189 to -0.076 — this is $\bar{x}_{\cdot t}$. The grand mean (gray dotted) sits at -0.124 — this is $\bar{x}_{\cdot \cdot}$. The demeaned series (black line) is the residual: $\tilde{x}_{it} = x_{it} - \bar{x}_{i \cdot} - \bar{x}_{\cdot t} + \bar{x}_{\cdot \cdot}$. It fluctuates around zero, capturing only the within-country, within-period deviations that TWFE uses for identification.

10. A Caveat: Standard Errors Differ

While the coefficients are identical, the standard errors from lm() on demeaned data are wrong. This is a critical practical point that many textbooks gloss over.

se_naive <- summary(manual_model)$coefficients[-1, "Std. Error"]
se_feols_iid <- se(twfe_model, se = "iid")
se_feols_cl <- se(twfe_model) # default: clustered by first FE

Standard error comparison:
variable se_naive_lm se_feols_iid se_feols_cluster
ln_y_initial 0.00361766 0.00388000 0.00374436
log_s_k 0.00684559 0.00734199 0.00758268
log_n_gd 0.01820117 0.01952104 0.02216773
log_hcap 0.01369872 0.01469209 0.01456365
gov_cons 0.04410809 0.04730660 0.04639822

Why do they differ? The lm() function does not know that 157 degrees of freedom were consumed by estimating 150 country effects and 8 time effects (minus 1 for normalization). It uses $df = N \times T - K = 1{,}195$ when the correct value is $N \times T - N - T + 1 - K = 1{,}038$. This makes naive SEs systematically too small — they understate uncertainty by 7–22% depending on the variable.

Standard error comparison: naive lm() (gray) systematically underestimates uncertainty compared to feols IID (orange) and clustered (blue).

The grouped bar chart makes the pattern clear. For every variable, the gray bars (naive lm()) are shorter than the orange (feols IID) and blue (feols clustered) bars. The gap is most visible for log(n+g+d), where the naive SE is 0.0182 versus 0.0222 for clustered — a 22% understatement. The feols IID SEs correct for the degrees-of-freedom adjustment, while the clustered SEs additionally account for within-entity serial correlation. The practical lesson: always use a dedicated panel estimator for inference, even though lm() on demeaned data gives the correct point estimates.

11. Discussion

This tutorial has demonstrated a fundamental equivalence in econometrics. TWFE is not a special estimator — it is ordinary least squares applied to data that has been demeaned by entity and time. The fixest package automates this process efficiently, but the underlying operation is straightforward subtraction. The FWL theorem guarantees the equivalence mathematically, and our empirical verification confirms it to machine precision.

Three practical insights emerge:

Demeaning reveals what FE can and cannot identify. Any variable that does not vary within a country over time (like geography or colonial history) has a country mean equal to itself. After demeaning, such a variable becomes zero everywhere and drops out of the regression. This is why fixed effects models cannot estimate the effect of time-invariant characteristics.
The grand mean correction is not optional. Omitting the $+ \bar{x}_{\cdot \cdot}$ term in the demeaning formula would double-subtract the overall level, producing a non-zero intercept and subtly wrong demeaned values. The correction is algebraically necessary for the FWL equivalence to hold.
Correct coefficients do not mean correct inference. The lm() standard errors are too small because they ignore the degrees of freedom consumed by the absorbed fixed effects. In applied work, this means artificially narrow confidence intervals and inflated t-statistics. Always use feols() or an equivalent panel estimator for standard errors and hypothesis testing.

12. Summary and Next Steps

Key takeaways:

TWFE estimation via feols() and OLS on manually demeaned data produce identical coefficients — the maximum difference across 5 coefficients is 3.05 x 10^-16, confirming the FWL theorem.
The demeaning formula subtracts entity means and time means, then adds back the grand mean to correct for double subtraction. After demeaning, all variable means are effectively zero (order of 10^-15).
The Within R-squared of 0.177 versus the overall Adjusted R-squared of 0.755 shows that most variation in growth is absorbed by the fixed effects, not by the regressors.
Naive lm() standard errors understate uncertainty by 7–22% because they ignore the 157 degrees of freedom consumed by the fixed effects. Always use a dedicated panel estimator for inference.

Limitations:

The dataset is simulated, so coefficient values reflect the data-generating process rather than real-world economic dynamics.
The tutorial assumes a balanced panel. With unbalanced panels, the simple closed-form demeaning still works algebraically, but fixest uses a more efficient iterative algorithm.
The SE comparison covers only IID and entity-clustered SEs. Other corrections (heteroskedasticity-robust, Driscoll-Kraay for cross-sectional dependence) may be relevant in applied work.

Next steps:

Apply the demeaning logic to understand why specific variables drop out of your own FE models.
Explore heterogeneous treatment effects with interaction-weighted TWFE estimators.
Read Cunningham (2021), Causal Inference: The Mixtape, Chapter 9, for the connection between TWFE demeaning and difference-in-differences designs.

13. Exercises

Omit the grand mean correction. Modify the demeaning formula to skip the $+ \bar{x}_{\cdot \cdot}$ term. Run lm() on the incorrectly demeaned data. What happens to the intercept? Do the slope coefficients still match the TWFE estimates? Why or why not?
One-way demeaning. Repeat the exercise using only entity demeaning (subtract country means, skip time means). Compare the coefficients to a one-way FE model (feols(growth ~ ... | id)). Verify the equivalence and examine how the coefficients change compared to the two-way specification.
Visualize a different variable. Recreate the demeaning decomposition plot (Section 9) for log_s_k (investment share) instead of growth. Does the country mean, time mean, or within-variation dominate for this variable? What does this tell you about the source of variation that identifies its coefficient?

14. References

Frisch, R. and Waugh, F.V. (1933). “Partial Time Regressions as Compared with Individual Trends.” Econometrica, 1(4), 387–401.
Lovell, M.C. (1963). “Seasonal Adjustment of Economic Time Series and Multiple Regression Analysis.” Journal of the American Statistical Association, 58(304), 993–1010.
Berge, L. (2018). fixest: Fast Fixed-Effects Estimations. R package. CRAN
Cunningham, S. (2021). Causal Inference: The Mixtape. Yale University Press. Online edition
Barro, R.J. and Sala-i-Martin, X. (2004). Economic Growth. 2nd edition. MIT Press.

Acknowledgements

AI tools (Claude Code, Gemini, NotebookLM) were used to make the contents of this post more accessible to students. Nevertheless, the content in this post may still have errors. Caution is needed when applying the contents of this post to true research projects.

High-Dimensional Fixed Effects Regression: An Introduction in Python

Fri, 20 Mar 2026 00:00:00 +0000

1. Overview

Imagine you want to know whether union membership raises wages. You run a regression and find a strong positive association: union workers earn 18% more. But wait — what if the workers who join unions are also more motivated, more experienced, or work in industries that pay well regardless? That 18% could be mostly selection, not a genuine union effect. This is one of the most pervasive problems in empirical research: omitted variable bias. Any time your data is grouped — by individual, firm, country, or time period — unobserved characteristics that differ across groups can contaminate your estimates, leading to conclusions that look solid but are fundamentally misleading.

Fixed effects regression is the workhorse solution. By absorbing all time-invariant group-level heterogeneity — a worker’s innate ability, a firm’s management culture, a country’s institutional quality — fixed effects eliminate an entire class of confounders in one step. The result is striking: in the wage panel we analyze below, the apparent union premium drops from 18% to just 7% once we account for individual fixed effects, revealing that more than half the raw association was driven by who selects into unions, not what unions do. This kind of dramatic correction is routine in applied research, which is why fixed effects appear in virtually every empirical paper that uses panel data.

Modern implementations make this computationally painless. Rather than estimating thousands of dummy variables, they use a demeaning algorithm that sweeps out group means before estimation. PyFixest brings this approach to Python with a concise formula syntax inspired by R’s fixest package — the most popular fixed effects library in the R ecosystem. In this tutorial we use PyFixest to build from simple OLS through one-way and two-way fixed effects, compare inference methods, perform instrumental variable estimation, analyze a real wage panel, and run event study designs for difference-in-differences — all with a few lines of code. Along the way, we will see why fixed effects work (by manually reproducing them via demeaning), discover what they cannot do (estimate time-invariant effects like education), learn when standard TWFE breaks down in staggered treatment designs, and apply the CRE/Mundlak approach to recover the very coefficients that one-way FE absorb.

Learning objectives:

Understand why unobserved group heterogeneity biases OLS and how fixed effects remove that bias
Implement one-way and two-way fixed effects regressions using PyFixest’s formula syntax
Compare multiple model specifications efficiently using PyFixest’s stepwise operators
Assess robustness by computing standard errors under different clustering assumptions
Decompose panel variation into between and within components to diagnose what FE can and cannot estimate
Frame a real wage panel through the Mincer equation and its panel extensions
Recover time-invariant coefficients (education, race) using the CRE/Mundlak approach
Apply fixed effects to event study designs with staggered treatment adoption

Content outline. Sections 2–4 set up the environment and establish an OLS baseline. Sections 5–6 introduce fixed effects — first through PyFixest’s absorption syntax, then by reproducing the same result manually via demeaning, building intuition for what FE actually does to the data. Section 7 shows how to compare multiple specifications in a single call, and Section 8 explores how standard error choices affect inference. Section 9 extends to two-way FE, and Section 10 combines FE with instrumental variables. Section 11 is the core case study: a real wage panel framed by the Mincer equation, where we decompose within and between variation, see how one-way FE absorb time-invariant variables like education, stress-test the common trends assumption with group-specific time effects, and recover education’s coefficient through the CRE/Mundlak approach. Section 12 applies FE to event study designs, with a careful discussion of why period −1 serves as the universal baseline. Throughout, each section builds on the previous — the manual demeaning in Section 6 explains why education vanishes in Section 11, and the stepwise comparison in Section 7 foreshadows the specification table in Section 11.

2. Setup and imports

Before running the analysis, install the required packages if needed:

pip install pyfixest

The following code imports PyFixest and standard data science libraries. PyFixest provides feols() as its main estimation function, which accepts R-style formulas with a pipe | separator for fixed effects.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pyfixest as pf
# Reproducibility
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
# Site color palette
STEEL_BLUE = "#6a9bcc"
WARM_ORANGE = "#d97757"
NEAR_BLACK = "#141413"
TEAL = "#00d4c8"

Dark theme figure styling (click to expand)

# Dark theme palette (consistent with site navbar/dark sections)
DARK_NAVY = "#0f1729"
GRID_LINE = "#1f2b5e"
LIGHT_TEXT = "#c8d0e0"
WHITE_TEXT = "#e8ecf2"
# Plot defaults — minimal, spine-free, dark background
plt.rcParams.update({
"figure.facecolor": DARK_NAVY,
"axes.facecolor": DARK_NAVY,
"axes.edgecolor": DARK_NAVY,
"axes.linewidth": 0,
"axes.labelcolor": LIGHT_TEXT,
"axes.titlecolor": WHITE_TEXT,
"axes.spines.top": False,
"axes.spines.right": False,
"axes.spines.left": False,
"axes.spines.bottom": False,
"axes.grid": True,
"grid.color": GRID_LINE,
"grid.linewidth": 0.6,
"grid.alpha": 0.8,
"xtick.color": LIGHT_TEXT,
"ytick.color": LIGHT_TEXT,
"xtick.major.size": 0,
"ytick.major.size": 0,
"text.color": WHITE_TEXT,
"font.size": 12,
"legend.frameon": False,
"legend.fontsize": 11,
"legend.labelcolor": LIGHT_TEXT,
"figure.edgecolor": DARK_NAVY,
"savefig.facecolor": DARK_NAVY,
"savefig.edgecolor": DARK_NAVY,
})

3. Data loading and exploration

3.1 Loading the dataset

PyFixest includes a built-in synthetic dataset designed for demonstrating fixed effects regression. We load it with pf.get_data(), which returns a DataFrame with outcome variables (Y, Y2), covariates (X1, X2), fixed effect identifiers (f1, f2, f3, group_id), instruments (Z1, Z2), and sampling weights.

data = pf.get_data()
print(f"Dataset shape: {data.shape}")
print(f"\nColumn names: {list(data.columns)}")
print(data.head())
print(data.describe().round(3))

Dataset shape: (1000, 11)
Column names: ['Y', 'Y2', 'X1', 'X2', 'f1', 'f2', 'f3', 'group_id', 'Z1', 'Z2', 'weights']
Y Y2 X1 X2 ... group_id Z1 Z2 weights
0 NaN 2.357103 0.0 0.457858 ... 9.0 -0.330607 1.054826 0.661478
1 -1.458643 5.163147 NaN -4.998406 ... 8.0 NaN -4.113690 0.772732
2 0.169132 0.751140 2.0 1.558480 ... 16.0 1.207778 0.465282 0.990929
3 3.319513 -2.656368 1.0 1.560402 ... 3.0 2.869997 0.467570 0.021123
4 0.134420 -1.866416 2.0 -3.472232 ... 14.0 0.835819 -3.115669 0.790815
Y Y2 X1 ... Z1 Z2 weights
count 999.000 1000.000 999.000 ... 999.000 1000.000 1000.000
mean -0.127 -0.309 1.043 ... 1.040 -0.113 0.495
std 2.305 5.584 0.808 ... 1.307 3.172 0.291
min -6.536 -16.974 0.000 ... -2.825 -11.576 0.000
25% -1.732 -4.029 0.000 ... 0.121 -2.252 0.248
50% -0.211 -0.459 1.000 ... 1.040 -0.064 0.469
75% 1.576 3.528 2.000 ... 1.946 2.028 0.746
max 6.907 17.156 2.000 ... 4.601 11.420 1.000

The dataset has 1,000 observations across 11 columns. The outcome Y has a mean of -0.127 and standard deviation of 2.305, while X1 takes discrete values 0, 1, and 2. A few observations have missing values (1 missing in Y, X1, f1, and Z1), which PyFixest handles automatically by dropping incomplete cases. The group_id variable identifies the group each observation belongs to, and this is the dimension we will control for with fixed effects.

3.2 Visualizing group structure

Before estimating any model, it helps to see how the relationship between X1 and Y varies across groups. If groups have different average levels of Y, standard OLS will mix within-group variation (what we care about) with between-group variation (which may reflect confounders).

fig, ax = plt.subplots(figsize=(10, 6))
groups = data["group_id"].unique()
n_groups = len(groups)
cmap = plt.cm.tab20
for i, g in enumerate(sorted(groups)):
subset = data[data["group_id"] == g]
ax.scatter(subset["X1"], subset["Y"], alpha=0.5, s=20,
color=cmap(i / n_groups),
label=f"Group {g}" if i < 5 else None)
ax.set_xlabel("X1", fontsize=13)
ax.set_ylabel("Y", fontsize=13)
ax.set_title("Outcome (Y) vs Covariate (X1) by Group", fontsize=15, fontweight="bold")
ax.legend(title="Group (first 5)", fontsize=9)
plt.savefig("pyfixest_scatter_by_group.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

The scatter plot reveals that different groups have distinct average levels of Y — some clusters sit higher and others lower on the vertical axis. Within each group, however, Y tends to decrease as X1 increases. This visual separation between groups is exactly the kind of heterogeneity that fixed effects regression absorbs, allowing us to isolate the within-group relationship between X1 and Y.

4. Simple OLS baseline (no fixed effects)

To establish a benchmark, we first estimate a standard OLS regression of Y on X1 without any fixed effects. The model is:

$$Y_i = \beta_0 + \beta_1 X_{1i} + \epsilon_i$$

In words, we assume the outcome $Y$ is a linear function of $X_1$ plus random noise $\epsilon$. This gives us the overall association, mixing both within-group and between-group variation. We use heteroskedasticity-robust standard errors (HC1) to account for non-constant variance.

fit_ols = pf.feols("Y ~ X1", data=data, vcov="HC1")
print(fit_ols.summary())

Estimation: OLS
Dep. var.: Y, Fixed effects: 0
Inference: HC1
Observations: 998
| Coefficient | Estimate | Std. Error | t value | Pr(>|t|) | 2.5% | 97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| Intercept | 0.919 | 0.112 | 8.223 | 0.000 | 0.699 | 1.138 |
| X1 | -1.000 | 0.082 | -12.134 | 0.000 | -1.162 | -0.838 |
---
RMSE: 2.158 R2: 0.123

The pooled OLS estimates a coefficient of -1.000 on X1 (SE = 0.082, p < 0.001), with an R-squared of 0.123. This means that a one-unit increase in X1 is associated with a 1.0-point decrease in Y on average. However, this estimate ignores group-level differences — it could be biased if X1 correlates with unobserved group characteristics. The model explains only 12.3% of the total variation in Y, leaving substantial unexplained heterogeneity. Let us now see how fixed effects change the picture.

5. One-way fixed effects

The following diagram illustrates the core problem fixed effects solve. When an unobserved group characteristic correlates with both the covariate and the outcome, it creates a backdoor path that biases OLS. Fixed effects block this path by absorbing all group-level variation.

graph LR
A["<b>Group Characteristics</b><br/>(unobserved)"] -->|"correlates"| X["<b>X1</b><br/>(covariate)"]
A -->|"affects"| Y["<b>Y</b><br/>(outcome)"]
X -->|"causal effect β = ?"| Y
FE["<b>Fixed Effects</b><br/>(absorbs A)"] -.->|"blocks backdoor"| A
style A fill:#d97757,stroke:#141413,color:#fff
style X fill:#6a9bcc,stroke:#141413,color:#fff
style Y fill:#00d4c8,stroke:#141413,color:#fff
style FE fill:#1a3a8a,stroke:#141413,color:#fff,stroke-dasharray: 5 5

5.1 Absorbing group heterogeneity

Fixed effects regression controls for all time-invariant group characteristics by effectively adding a separate intercept for each group. In PyFixest, we specify fixed effects after a pipe | in the formula. The syntax Y ~ X1 | group_id means: regress Y on X1, absorbing group_id fixed effects. Think of this as asking: “within each group, what is the relationship between X1 and Y?”

fit_fe1 = pf.feols("Y ~ X1 | group_id", data=data, vcov="HC1")
print(fit_fe1.summary())

Estimation: OLS
Dep. var.: Y, Fixed effects: group_id
Inference: HC1
Observations: 998
| Coefficient | Estimate | Std. Error | t value | Pr(>|t|) | 2.5% | 97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| X1 | -1.019 | 0.083 | -12.234 | 0.000 | -1.182 | -0.856 |
---
RMSE: 2.141 R2: 0.137 R2 Within: 0.126

With group_id fixed effects absorbed, the coefficient on X1 shifts slightly to -1.019 (SE = 0.083). The within R-squared of 0.126 tells us how much of the within-group variation in Y is explained by X1 after removing group means. Compared to the pooled OLS estimate of -1.000, the fixed effects estimate is similar in this synthetic dataset, suggesting that X1 does not strongly correlate with group-level unobservables here. In real data, the shift can be dramatic — that gap is the omitted variable bias that fixed effects remove.

5.2 Equivalence with dummy variables

Under the hood, fixed effects absorption produces the same point estimates as including explicit dummy variables for each group. PyFixest’s C() operator creates these dummies. The key advantage of absorption is computational: with thousands of groups, estimating thousands of dummy coefficients is slow and memory-intensive, while demeaning is fast.

fit_dummy = pf.feols("Y ~ X1 + C(group_id)", data=data, vcov="HC1")
print(f"X1 coefficient (FE absorption): {fit_fe1.coef()['X1']:.4f}")
print(f"X1 coefficient (dummy vars): {fit_dummy.coef()['X1']:.4f}")

X1 coefficient (FE absorption): -1.0190
X1 coefficient (dummy vars): -1.0190

Both approaches yield identical coefficients of -1.0190 on X1, confirming that FE absorption and dummy variable inclusion are algebraically equivalent. The absorption approach simply avoids estimating and storing the hundreds or thousands of group intercepts that are typically not of interest — what econometricians call nuisance parameters.

6. Understanding fixed effects via manual demeaning

6.1 The within transformation

To build intuition for what fixed effects actually do, we can perform the within transformation manually. For each observation, we subtract its group mean from both Y and X1. This removes all between-group variation, leaving only the deviations from each group’s average. Regressing the demeaned Y on the demeaned X1 recovers the same coefficient as the FE estimator. It is like centering each group at the origin — the only variation left is how individuals within a group differ from their group’s typical level.

The fixed effects estimator solves:

$$\hat{\beta}_{FE} = \left(\sum_{i=1}^{N} \ddot{X}_i' \ddot{X}_i\right)^{-1} \sum_{i=1}^{N} \ddot{X}_i' \ddot{Y}_i$$

where $\ddot{X}_i = X_{it} - \bar{X}_i$ and $\ddot{Y}_i = Y_{it} - \bar{Y}_i$ are the demeaned variables. In words, this says the FE estimator uses only within-group deviations from group means, eliminating any bias from group-level confounders.

# Manual demeaning (within transformation)
data_dm = data.copy()
for col in ["Y", "X1"]:
group_means = data_dm.groupby("group_id")[col].transform("mean")
data_dm[f"{col}_dm"] = data_dm[col] - group_means
fit_demeaned = pf.feols("Y_dm ~ X1_dm", data=data_dm, vcov="HC1")
print(f"X1 coefficient (FE absorption): {fit_fe1.coef()['X1']:.4f}")
print(f"X1 coefficient (manual demean): {fit_demeaned.coef()['X1_dm']:.4f}")
print(f"X1 coefficient (OLS, no FE): {fit_ols.coef()['X1']:.4f}")

X1 coefficient (FE absorption): -1.0190
X1 coefficient (manual demean): -1.0190
X1 coefficient (OLS, no FE): -1.0001

The manual demeaning produces a coefficient of -1.0190, exactly matching the FE absorption result. The pooled OLS gave -1.0001 by comparison. This confirms that fixed effects regression is mathematically equivalent to subtracting group means from every variable before running OLS. The difference between -1.019 (FE) and -1.000 (OLS) reflects the bias introduced by between-group variation that is removed by demeaning.

6.2 Visualizing the demeaning

fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# Left: Raw data
for i, g in enumerate(sorted(groups)[:5]):
subset = data[data["group_id"] == g]
axes[0].scatter(subset["X1"], subset["Y"], alpha=0.4, s=20,
color=cmap(i / n_groups))
axes[0].set_xlabel("X1 (raw)", fontsize=13)
axes[0].set_ylabel("Y (raw)", fontsize=13)
axes[0].set_title("Raw Data: Between + Within Variation", fontsize=13, fontweight="bold")
# Right: Demeaned data
axes[1].scatter(data_dm["X1_dm"], data_dm["Y_dm"], alpha=0.4, s=20, color=STEEL_BLUE)
x_range = np.linspace(data_dm["X1_dm"].min(), data_dm["X1_dm"].max(), 100)
y_pred = fit_demeaned.coef()["X1_dm"] * x_range
axes[1].plot(x_range, y_pred, color=WARM_ORANGE, linewidth=2.5,
label=f"FE slope = {fit_demeaned.coef()['X1_dm']:.3f}")
axes[1].set_xlabel("X1 (demeaned)", fontsize=13)
axes[1].set_ylabel("Y (demeaned)", fontsize=13)
axes[1].set_title("Demeaned Data: Within-Group Variation Only", fontsize=13, fontweight="bold")
axes[1].legend(fontsize=11)
plt.savefig("pyfixest_demeaning.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

The left panel shows the raw data with groups scattered at different vertical levels — this between-group variation is what confounds the OLS estimate. The right panel shows the demeaned data: all groups are now centered at the origin, and the clear negative slope of -1.019 reflects the pure within-group relationship. This visual makes the FE intuition concrete: by removing group averages, we eliminate confounding from any variable that is constant within groups. Now let us explore how to estimate multiple specifications efficiently.

7. Multiple estimation with stepwise operators

7.1 Cumulative stepwise fixed effects

One of PyFixest’s most powerful features is its formula operators for estimating multiple models in a single call. The csw0() operator adds fixed effects cumulatively: csw0(f1, f2) estimates three models — no FE, then f1 only, then f1 + f2 — in one line. This is far more efficient than writing three separate calls and makes it easy to see how results change as we add controls.

fit_multi = pf.feols("Y ~ X1 | csw0(f1, f2)", data=data, vcov="HC1")
# Print summary for each model
models = fit_multi.all_fitted_models
for key in models:
m = models[key]
print(f"\nModel: {key}")
print(m.summary())

Model: Y~X1
Estimation: OLS
Dep. var.: Y, Fixed effects: 0
Inference: HC1
Observations: 998
| Coefficient | Estimate | Std. Error | t value | Pr(>|t|) |
|:--------------|-----------:|-------------:|----------:|-----------:|
| Intercept | 0.919 | 0.112 | 8.223 | 0.000 |
| X1 | -1.000 | 0.082 | -12.134 | 0.000 |
---
RMSE: 2.158 R2: 0.123
Model: Y~X1|f1
Estimation: OLS
Dep. var.: Y, Fixed effects: f1
Inference: HC1
Observations: 997
| Coefficient | Estimate | Std. Error | t value | Pr(>|t|) |
|:--------------|-----------:|-------------:|----------:|-----------:|
| X1 | -0.949 | 0.067 | -14.094 | 0.000 |
---
RMSE: 1.73 R2: 0.437 R2 Within: 0.161
Model: Y~X1|f1+f2
Estimation: OLS
Dep. var.: Y, Fixed effects: f1+f2
Inference: HC1
Observations: 997
| Coefficient | Estimate | Std. Error | t value | Pr(>|t|) |
|:--------------|-----------:|-------------:|----------:|-----------:|
| X1 | -0.919 | 0.060 | -15.440 | 0.000 |
---
RMSE: 1.441 R2: 0.609 R2 Within: 0.200

The coefficient on X1 shifts from -1.000 (no FE) to -0.949 (with f1) to -0.919 (with f1 + f2), while the overall R-squared jumps from 0.123 to 0.437 to 0.609. Adding f1 alone explains an additional 31 percentage points of variation — a massive improvement that shows how much group-level heterogeneity f1 captures. Adding f2 on top of f1 brings R-squared to 0.609, meaning the two fixed effect dimensions together account for over 60% of the total variation in Y. The standard error on X1 also shrinks from 0.082 to 0.060, reflecting the precision gain from reducing residual noise.

Specification	X1 Coef.	SE	R-squared	R-squared Within
No FE	-1.000	0.082	0.123	—
FE: f1	-0.949	0.067	0.437	0.161
FE: f1 + f2	-0.919	0.060	0.609	0.200

7.2 Visualizing coefficient stability

The table above shows the numbers, but a figure makes the comparison more immediate. Plotting the coefficient with its 95% confidence interval across specifications reveals both the stability of the point estimate and the precision gain from adding fixed effects.

# Coefficient comparison across specifications
model_names = ["No FE", "FE: f1", "FE: f1 + f2"]
coefs = [models[k].coef()["X1"] for k in models]
ses = [models[k].se()["X1"] for k in models]
fig, ax = plt.subplots(figsize=(8, 5))
y_pos = np.arange(len(model_names))
ax.barh(y_pos, coefs, xerr=[1.96 * s for s in ses], height=0.5,
color=[STEEL_BLUE, WARM_ORANGE, TEAL], edgecolor=DARK_NAVY, capsize=5)
ax.set_yticks(y_pos)
ax.set_yticklabels(model_names, fontsize=12)
ax.set_xlabel("Coefficient on X1", fontsize=13)
ax.set_title("Effect of X1 Across Fixed Effect Specifications", fontsize=14, fontweight="bold")
ax.axvline(x=0, color=NEAR_BLACK, linewidth=0.8, linestyle="--", alpha=0.5)
plt.savefig("pyfixest_coef_comparison.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

The coefficient comparison chart shows that the point estimate on X1 remains stable around -1.0 across all three specifications, with confidence intervals narrowing as we add fixed effects. This stability suggests the estimate is robust to the inclusion of group-level controls. In applied research, large shifts across specifications would signal omitted variable concerns, making this type of comparison essential for assessing credibility.

8. Inference: choosing the right standard errors

8.1 Comparing standard error estimators

The choice of standard errors can dramatically change statistical inference, even when point estimates remain the same. Standard (iid) errors assume all observations are independent and identically distributed. Heteroskedasticity-robust (HC1) errors relax the constant-variance assumption. Cluster-robust (CRV) errors account for arbitrary correlation within groups — essential when observations within a group are not independent, like repeated measurements of the same individual. Think of it like estimating average height: if you measure the same person ten times, those ten measurements are not ten independent observations, and your standard error should reflect that.

se_types = {
"iid": "iid",
"HC1 (robust)": "HC1",
"CRV1 (group_id)": {"CRV1": "group_id"},
"CRV1 (group_id + f2)": {"CRV1": "group_id + f2"},
"CRV3 (group_id)": {"CRV3": "group_id"},
}
print(f"{'SE Type':<22} {'SE(X1)':<10} {'t-stat':<10} {'p-value':<10}")
print("-" * 52)
for name, vcov in se_types.items():
fit_tmp = pf.feols("Y ~ X1 | group_id", data=data, vcov=vcov)
print(f"{name:<22} {fit_tmp.se()['X1']:<10.4f} "
f"{fit_tmp.tstat()['X1']:<10.3f} {fit_tmp.pvalue()['X1']:<10.4f}")

SE Type SE(X1) t-stat p-value
----------------------------------------------------
iid 0.0858 -11.875 0.0000
HC1 (robust) 0.0833 -12.234 0.0000
CRV1 (group_id) 0.1172 -8.696 0.0000
CRV1 (group_id + f2) 0.1207 -8.445 0.0000
CRV3 (group_id) 0.1247 -8.174 0.0000

The standard error on X1 ranges from 0.0833 (HC1) to 0.1247 (CRV3), a 50% increase depending on the assumption about error correlation. While all p-values remain below 0.001 in this case, the t-statistic drops from 12.2 to 8.2 — a substantial difference that could determine significance for weaker effects. Cluster-robust SEs (CRV1) inflate to 0.1172 because they account for within-group correlation. The CRV3 estimator, which provides a more conservative finite-sample correction, gives the largest SE of 0.1247. In practice, you should cluster at the level where you believe errors are correlated.

8.2 Visualizing the SE tradeoff

fig, ax = plt.subplots(figsize=(9, 5))
se_names = list(se_types.keys())
se_vals = []
for name, vcov in se_types.items():
fit_tmp = pf.feols("Y ~ X1 | group_id", data=data, vcov=vcov)
se_vals.append(fit_tmp.se()["X1"])
colors = [STEEL_BLUE, WARM_ORANGE, TEAL, "#e8956a", "#f0a88c"]
bars = ax.bar(range(len(se_names)), se_vals, color=colors, edgecolor=DARK_NAVY, width=0.6)
ax.set_xticks(range(len(se_names)))
ax.set_xticklabels(se_names, rotation=25, ha="right", fontsize=10)
ax.set_ylabel("Standard Error of X1", fontsize=13)
ax.set_title("Standard Errors Under Different Assumptions", fontsize=14, fontweight="bold")
for i, v in enumerate(se_vals):
ax.text(i, v + 0.002, f"{v:.4f}", ha="center", fontsize=10, fontweight="bold")
plt.savefig("pyfixest_se_comparison.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

The bar chart makes the progression vivid: moving from iid to cluster-robust standard errors increases uncertainty by nearly 50%. The iid and HC1 estimates are similar because heteroskedasticity is not a major concern here. The real jump occurs when we account for within-group correlation (CRV1), and the CRV3 bias-corrected estimator is the most conservative. For applied work with grouped data, defaulting to cluster-robust errors is the safest choice — underestimating standard errors leads to falsely significant results.

9. Two-way fixed effects

When data has two grouping dimensions — for example, firms and years, or workers and occupations — two-way fixed effects absorb unobserved heterogeneity along both dimensions. In PyFixest, we simply list both FE variables after the pipe: Y ~ X1 + X2 | f1 + f2. This absorbs all factors that are constant within each level of f1 and each level of f2.

fit_twoway = pf.feols("Y ~ X1 + X2 | f1 + f2", data=data, vcov="HC1")
print(fit_twoway.summary())

Estimation: OLS
Dep. var.: Y, Fixed effects: f1+f2
Inference: HC1
Observations: 997
| Coefficient | Estimate | Std. Error | t value | Pr(>|t|) | 2.5% | 97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| X1 | -0.924 | 0.056 | -16.375 | 0.000 | -1.035 | -0.813 |
| X2 | -0.174 | 0.015 | -11.246 | 0.000 | -0.204 | -0.144 |
---
RMSE: 1.346 R2: 0.659 R2 Within: 0.303

Adding both f1 and f2 as fixed effects plus the additional covariate X2 yields an R-squared of 0.659 and a within R-squared of 0.303. The coefficient on X1 is -0.924 (SE = 0.056) and X2 is -0.174 (SE = 0.015), both highly significant. The within R-squared of 0.303 means that X1 and X2 together explain about 30% of the variation in Y after absorbing both dimensions of fixed effects — a substantial improvement over the 20% with X1 alone in the previous section.

10. Instrumental variables with fixed effects

Sometimes the explanatory variable itself is endogenous — correlated with the error term due to measurement error, simultaneity, or omitted variables that fixed effects do not capture. Instrumental variables (IV) estimation addresses this by using external variables (instruments) that affect the outcome only through the endogenous variable. Think of instruments as a natural experiment embedded in the data: Z affects X but has no direct path to Y, so any association between Z and Y must flow through X. In PyFixest, the IV syntax uses a second pipe: Y2 ~ 1 | f1 + f2 | X1 ~ Z1 + Z2. This reads: outcome Y2, no exogenous controls (just the intercept 1), fixed effects f1 + f2, and endogenous variable X1 instrumented by Z1 and Z2.

The IV estimator recovers the coefficient on X1 by first predicting X1 using the instruments, then using these predictions in the second-stage regression:

$$\text{First stage: } X_1 = \pi_0 + \pi_1 Z_1 + \pi_2 Z_2 + \alpha_i + \gamma_t + \nu$$

$$\text{Second stage: } Y_2 = \beta X_1^{predicted} + \alpha_i + \gamma_t + \epsilon$$

In words, the first stage isolates the variation in X1 that is driven by the instruments Z1 and Z2, stripping away the endogenous component. The second stage then uses only this “clean” variation to estimate the effect of X1 on Y2. Here, $\alpha_i$ corresponds to the f1 fixed effects, $\gamma_t$ corresponds to the f2 fixed effects, and $\beta$ is the causal parameter of interest that we recover from the X1 coefficient in PyFixest’s output.

fit_iv = pf.feols("Y2 ~ 1 | f1 + f2 | X1 ~ Z1 + Z2", data=data)
print(fit_iv.summary())
print(f"\nFirst-stage F-statistic: {fit_iv._f_stat_1st_stage:.2f}")

Estimation: IV
Dep. var.: Y2, Fixed effects: f1+f2
Inference: iid
Observations: 998
| Coefficient | Estimate | Std. Error | t value | Pr(>|t|) | 2.5% | 97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| X1 | -1.600 | 0.336 | -4.768 | 0.000 | -2.259 | -0.942 |
---
First-stage F-statistic: 311.54

The IV estimate of X1 is -1.600 (SE = 0.336), substantially larger in magnitude than the OLS estimate of approximately -1.0. This divergence suggests that the OLS coefficient on X1 is attenuated — a classic sign of measurement error or endogeneity that biases OLS toward zero. The first-stage F-statistic of 311.54 is well above the conventional threshold of 10, indicating that Z1 and Z2 are strong instruments. Strong instruments mean the IV estimate is reliable; with weak instruments, IV can perform worse than OLS. Note that with heterogeneous treatment effects, IV identifies the Local Average Treatment Effect (LATE) — the effect for units whose treatment status is shifted by the instruments — rather than the Average Treatment Effect (ATE) for the entire population.

11. Panel data application: wage determinants

11.1 The wage panel: variables and structure

To see fixed effects in action with real data, we analyze the Vella and Verbeek (1998) panel of 545 young men observed over 8 years (1980–1987) from the National Longitudinal Survey of Youth (NLSY). This dataset, used in many econometrics textbooks, is ideal for studying wage determinants because it tracks the same workers as they enter the labor market, gain experience, change jobs, and make decisions about union membership and marriage. The key challenge is that unobserved individual ability differs across workers and correlates with both wages and these covariates — a classic case for one-way fixed effects.

url = "https://raw.githubusercontent.com/bashtage/linearmodels/main/linearmodels/datasets/wage_panel/wage_panel.csv.bz2"
wage_df = pd.read_csv(url, compression="bz2")
print(f"Wage panel shape: {wage_df.shape}")
print(wage_df.describe().round(3))

Wage panel shape: (4360, 12)
nr year black exper hisp ... educ union lwage expersq occupation
count 4360.000 4360.000 4360.000 4360.000 4360.000 ... 4360.000 4360.000 4360.000 4360.000 4360.000
mean 5262.059 1983.500 0.116 6.500 0.161 ... 11.768 0.244 1.649 50.425 4.989
std 3496.150 2.292 0.320 2.292 0.367 ... 1.353 0.430 0.533 40.782 2.320
min 13.000 1980.000 0.000 1.000 0.000 ... 3.000 0.000 -3.579 1.000 1.000
25% 2329.000 1981.750 0.000 4.750 0.000 ... 11.000 0.000 1.351 16.000 4.000
50% 4569.000 1983.500 0.000 6.500 0.000 ... 12.000 0.000 1.671 36.000 5.000
75% 8406.000 1985.250 0.000 8.250 0.000 ... 12.000 0.000 1.991 81.000 6.000
max 12548.000 1987.000 1.000 12.000 1.000 ... 16.000 1.000 4.052 324.000 9.000

The panel contains 4,360 observations (545 individuals over 8 years) with 12 variables. Before running any model, it is important to understand how each variable is defined and measured.

Outcome variable:

lwage — the natural logarithm of hourly wage. The log transformation means that coefficients are interpreted as approximate percentage changes. The mean of 1.649 corresponds to about \$5.20 per hour in 1980s dollars ($e^{1.649} \approx 5.20$). The standard deviation of 0.533 indicates substantial wage dispersion: the gap between a worker at the 25th percentile (\$3.86/hr) and the 75th percentile (\$7.32/hr) is roughly a doubling of wages.

Time-varying covariates (change within a worker over time):

hours — annual hours worked. Mean of 2,191 (roughly 42 hours per week for 52 weeks). Ranges from 120 to 4,992, capturing both part-time spells and heavy overtime. We include hours to control for labor supply differences that affect hourly wage calculations.
union — binary indicator (1 = covered by a union contract in the current year, 0 = not covered). About 24.4% of person-year observations are union-covered. Workers can move in and out of union jobs across years, and this within-worker variation in union status is what one-way FE use to identify the union wage premium.
married — binary indicator (1 = currently married, 0 = not married). About 43.9% of observations are married. Since these are young men tracked from their early twenties, many transition from single to married during the panel, providing within-worker variation.
exper — years of potential labor market experience, defined as age minus years of education minus 6. Ranges from 1 to 12 years. In this balanced panel where every worker is observed in every year, experience increases by exactly 1 each year, making it perfectly collinear with entity + year fixed effects. We therefore use expersq instead in FE models.
expersq — experience squared ($exper^2$). Captures the well-documented concavity in the experience–earnings profile: wages rise with experience but at a diminishing rate. Unlike exper, the squared term is a nonlinear function of time, so it is not collinear with entity + year FE and can be estimated.
occupation — occupational category, coded 1 through 9 (9 distinct categories). Workers can and do switch occupations across years. This variable can be used as an additional fixed effect dimension.

Time-invariant covariates (fixed for each worker across all years):

educ — years of completed schooling at the start of the panel. Mean of 11.77 years (just below a high school diploma), ranging from 3 to 16 years. Because the sample tracks young men who have already finished their schooling, education does not change over time. The median of 12 years (exactly a high school diploma) and the 75th percentile of 12 years indicate that most workers in this sample have a high school education, with a smaller group holding college degrees.
black — binary indicator (1 = Black, 0 = non-Black). About 11.6% of workers are Black. Because race does not change over time, one-way FE absorb any wage differences associated with being Black.
hisp — binary indicator (1 = Hispanic, 0 = non-Hispanic). About 16.1% of workers are Hispanic. Like black, this is absorbed by one-way FE.

Panel identifiers:

nr — unique worker identifier (545 distinct workers). This defines the entity dimension for fixed effects.
year — calendar year, taking values 1980 through 1987. The panel is balanced: every worker appears in every year, giving exactly $545 \times 8 = 4,360$ observations.

The distinction between time-varying and time-invariant variables is the most consequential feature of this dataset for fixed effects analysis. Time-invariant variables will be perfectly collinear with entity dummies and cannot be estimated under one-way FE. Time-varying variables survive the within transformation and their effects can be identified. We verify this classification empirically:

invariance = wage_df.groupby("nr")[["educ", "black", "hisp"]].nunique()
print(f"Max unique values per worker:")
print(invariance.max())

Max unique values per worker:
educ 1
black 1
hisp 1
dtype: int64

Each worker has exactly one value of education, race, and ethnicity across all eight years — confirming these are truly time-invariant. By contrast, occupation is time-varying:

occ_changes = wage_df.groupby("nr")["occupation"].nunique()
print(f"Workers who change occupation: {(occ_changes > 1).sum()} / {len(occ_changes)}")

Workers who change occupation: 484 / 545

Nearly 89% of workers switch occupations at least once during the panel. This high rate of switching makes occupation a valid candidate for a fixed effect dimension of its own (Section 11.5). By contrast, a variable like education, which never changes within a worker, would produce a column of zeros after demeaning and must be dropped — a point we return to in Sections 11.3 and 11.4.

11.2 Within vs between variation

Before estimating any model, it helps to decompose the variation in each variable into between-worker variation (permanent differences across workers) and within-worker variation (changes over a worker’s career). This decomposition foreshadows what one-way fixed effects can and cannot estimate.

cols = ["lwage", "hours", "union", "married", "expersq", "educ"]
between = wage_df.groupby("nr")[cols].mean().std()
for col in cols:
wage_df[f"{col}_within"] = wage_df[col] - wage_df.groupby("nr")[col].transform("mean")
within = wage_df[[f"{c}_within" for c in cols]].std()
variation = pd.DataFrame({"Between": between, "Within": within}).round(4)
print(variation)

 Between Within
lwage 0.3907 0.3623
hours 381.7831 418.6057
union 0.3294 0.2760
married 0.3766 0.3236
expersq 26.3513 31.1431
educ 1.7476 0.0000

The raw standard deviations differ wildly across variables (hours is in the hundreds, union is a fraction), so we normalize by computing each variable’s within share — the fraction of total variation that comes from within-worker changes over time. This puts all variables on the same 0–100% scale:

total = np.sqrt(between**2 + within**2)
within_share = (within / total).fillna(0) # educ: 0/0 → 0
between_share = 1 - within_share
fig, ax = plt.subplots(figsize=(10, 5))
y_pos = np.arange(len(cols))
bar_height = 0.55
# Stacked horizontal bars: between (left) + within (right) = 100%
ax.barh(y_pos, between_share.values, bar_height,
label="Between (cross-worker)", color=STEEL_BLUE, edgecolor=DARK_NAVY)
ax.barh(y_pos, within_share.values, bar_height, left=between_share.values,
label="Within (over career)", color=WARM_ORANGE, edgecolor=DARK_NAVY)
plt.savefig("pyfixest_within_between.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

The decomposition reveals a critical pattern. Education is 100% between-worker variation — its within share is exactly 0% — because no worker changes their education level during the panel. This means one-way FE literally cannot estimate education’s effect: the demeaned education column is all zeros. Log wages have a 68% within share and 32% between share, meaning most wage variation comes from changes over a worker’s career rather than permanent differences across workers. Variables with substantial within shares — union (64%), married (65%), hours (74%), expersq (76%) — can be estimated under one-way FE because they change over a worker’s career. The higher the within share, the more statistical power one-way FE retains for that variable.

11.3 The Mincer equation and its panel extensions

Before estimating any models, it helps to lay out the econometric framework that organizes all subsequent specifications. The classic Mincer equation (Mincer, 1974) is the workhorse model of labor economics:

$$\ln(wage_i) = \beta_0 + \beta_1 educ_i + \beta_2 exper_i + \beta_3 exper_i^2 + \epsilon_i$$

This log-linear specification models wages as a function of years of schooling and experience, with experience entering quadratically to capture concave returns — each additional year of experience raises wages, but by a diminishing amount. It is a cross-sectional model, estimating the average relationship across all workers at a single point in time.

The extended Mincer equation adds controls for union membership, marital status, hours worked, and demographic characteristics:

$$\ln(wage_{it}) = \beta_0 + \beta_1 educ_i + \beta_2 expersq_{it} + \beta_3 union_{it} + \beta_4 married_{it} + \beta_5 hours_{it} + \beta_6 black_i + \beta_7 hisp_i + \epsilon_{it}$$

The panel FE extension replaces explicit controls for time-invariant characteristics with entity and time fixed effects:

$$\ln(wage_{it}) = \beta X_{it} + \gamma Z_i + \alpha_i + \delta_t + \epsilon_{it}$$

where $X_{it}$ denotes time-varying covariates (union, married, hours, experience), $Z_i$ denotes time-invariant characteristics (education, race), $\alpha_i$ captures one-way fixed effects (one intercept per worker), and $\delta_t$ captures year fixed effects. The key insight: when we include $\alpha_i$, the time-invariant variables $Z_i$ become perfectly collinear with the entity dummies and are absorbed. We gain protection against omitted variable bias from all unobserved time-invariant confounders, but we lose the ability to estimate $\gamma$.

The CRE/Mundlak extension — the Mundlak (1978) device — offers a way to recover $\gamma$:

$$\ln(wage_{it}) = \beta X_{it} + \gamma Z_i + \pi \bar{X}_i + \epsilon_{it}$$

where $\bar{X}_i$ are individual means of the time-varying variables. This replaces entity dummies with individual means, which model the correlation between unobserved heterogeneity and the covariates. The result: $\hat{\beta} \approx \hat{\beta}_{FE}$ for the time-varying variables, while $\gamma$ is now estimable because we no longer include entity dummies that absorb it.

Sections 11.4–11.7 estimate these models progressively: pooled OLS and one-way FE (11.4), two-way and three-way FE (11.5), group-specific time trends (11.6), and CRE/Mundlak (11.7).

11.4 From pooled OLS to one-way FE: the education tradeoff

We begin with the extended Mincer equation estimated by pooled OLS, which includes both time-varying and time-invariant variables:

fit_pooled = pf.feols(
"lwage ~ educ + expersq + union + married + hours + black + hisp",
data=wage_df, vcov="HC1"
)
print(fit_pooled.summary())

Estimation: OLS
Dep. var.: lwage, Fixed effects: 0
Inference: HC1
Observations: 4360
| Coefficient | Estimate | Std. Error | t value | Pr(>|t|) | 2.5% | 97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| Intercept | 0.265 | 0.069 | 3.823 | 0.000 | 0.129 | 0.402 |
| educ | 0.106 | 0.005 | 22.924 | 0.000 | 0.097 | 0.115 |
| expersq | 0.003 | 0.000 | 16.930 | 0.000 | 0.003 | 0.004 |
| union | 0.183 | 0.016 | 11.205 | 0.000 | 0.151 | 0.215 |
| married | 0.141 | 0.015 | 9.308 | 0.000 | 0.111 | 0.171 |
| hours | -0.000 | 0.000 | -3.139 | 0.002 | -0.000 | -0.000 |
| black | -0.135 | 0.024 | -5.549 | 0.000 | -0.182 | -0.087 |
| hisp | 0.013 | 0.020 | 0.670 | 0.503 | -0.025 | 0.052 |
---
RMSE: 0.484 R2: 0.175

Pooled OLS estimates a 10.6% return to each year of education, an 18.3% union premium, and a 14.1% marriage premium. Black workers earn about 13.5% less, while the Hispanic coefficient is small and insignificant. The R-squared is 0.175 — these variables explain less than a fifth of wage variation.

Now we estimate the one-way FE model, which absorbs all time-invariant worker characteristics:

fit_entity = pf.feols("lwage ~ expersq + union + married + hours | nr",
data=wage_df, vcov={"CRV1": "nr"})
print(fit_entity.summary())

Estimation: OLS
Dep. var.: lwage, Fixed effects: nr
Inference: CRV1
Observations: 4360
| Coefficient | Estimate | Std. Error | t value | Pr(>|t|) | 2.5% | 97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| expersq | 0.004 | 0.000 | 16.537 | 0.000 | 0.003 | 0.004 |
| union | 0.078 | 0.024 | 3.319 | 0.001 | 0.032 | 0.125 |
| married | 0.115 | 0.022 | 5.217 | 0.000 | 0.071 | 0.158 |
| hours | -0.000 | 0.000 | -3.807 | 0.000 | -0.000 | -0.000 |
---
RMSE: 0.335 R2: 0.605 R2 Within: 0.145

One-way fixed effects dramatically improve model fit: R-squared jumps from 0.175 (pooled OLS) to 0.605, meaning worker-level heterogeneity accounts for over 40 percentage points of explained variation. The union premium drops from 18.3% to 7.8% (SE = 0.024) — more than half the pooled estimate was driven by selection (workers who join unions differ systematically from those who do not). The marriage premium falls from 14.1% to 11.5% (SE = 0.022), a smaller reduction suggesting that marital status is less confounded by unobserved ability. The expersq coefficient of 0.004 captures the concavity of the experience–earnings profile within workers over time. Notice that educ, black, and hisp are absent: these time-invariant variables are perfectly collinear with the 545 worker dummies and cannot be estimated under one-way FE.

To see what happens when we try to include a time-invariant variable alongside one-way FE:

import warnings
with warnings.catch_warnings(record=True) as w:
warnings.simplefilter("always")
fit_educ = pf.feols("lwage ~ expersq + union + married + educ | nr",
data=wage_df, vcov={"CRV1": "nr"})
print(f"Coefficients estimated: {list(fit_educ.coef().index)}")

Coefficients estimated: ['expersq', 'union', 'married']

Education is silently dropped. This is not a bug — it is a fundamental consequence of the within transformation (Section 6):

$$\ddot{educ}_{it} = educ_i - \bar{educ}_i = 0 \quad \text{for all } t$$

Because a worker’s education does not change over the eight years of the panel, the demeaned value is exactly zero for every observation. A column of zeros is perfectly collinear with the entity dummies, so it must be dropped. The same applies to black and hisp.

Variable	Pooled OLS	One-Way FE
educ	0.106	dropped
expersq	0.003	0.004
union	0.183	0.078
married	0.141	0.115
hours	-0.000	-0.000
black	-0.135	dropped
hisp	0.013	dropped
R-squared	0.175	0.605

This table crystallizes the fundamental tradeoff. Pooled OLS estimates everything — education, race, union, marriage — but its estimates are biased by unobserved ability. One-Way FE eliminates the ability bias, and the union premium drops from 18.3% to 7.8%, revealing that more than half the raw association was selection. But the price is steep: education, Black, and Hispanic are all absorbed into the individual intercepts. We cannot estimate the return to schooling or the racial wage gap under one-way FE. Sections 11.5–11.6 push further with additional FE dimensions, and Section 11.7 shows how CRE partially resolves this tradeoff.

11.5 Two-way and three-way fixed effects

Adding year fixed effects to one-way FE creates a two-way FE (TWFE) model that absorbs both individual heterogeneity and common time trends:

fit_panel = pf.feols("lwage ~ expersq + union + married + hours | nr + year",
data=wage_df, vcov={"CRV1": "nr + year"})

We can go further by adding occupation as a third fixed effect dimension. As we saw in Section 11.1, nearly 89% of workers switch occupations during the panel, so occupation is a valid time-varying dimension:

fit_threeway = pf.feols(
"lwage ~ expersq + union + married + hours | nr + year + C(occupation)",
data=wage_df, vcov={"CRV1": "nr"}
)

Variable	Pooled OLS	One-Way FE	Two-Way FE	Three-Way FE
expersq	0.003	0.004	-0.006	-0.006
union	0.183	0.078	0.073	0.075
married	0.141	0.115	0.048	0.047
hours	-0.000	-0.000	-0.000	-0.000
R-squared	0.175	0.605	0.631	0.632

fig, axes = plt.subplots(2, 2, figsize=(12, 8))
panel_models = {"Pooled OLS": fit_pooled, "One-Way FE": fit_entity,
"Two-Way FE": fit_panel, "Three-Way FE": fit_threeway}
panel_vars = ["expersq", "union", "married", "hours"]
panel_colors = [STEEL_BLUE, WARM_ORANGE, TEAL, "#e8956a"]
for idx, var in enumerate(panel_vars):
ax = axes.flatten()[idx]
model_names_p = list(panel_models.keys())
coefs_p = [panel_models[m].coef()[var] for m in model_names_p]
ses_p = [panel_models[m].se()[var] for m in model_names_p]
ax.bar(range(4), coefs_p, yerr=[1.96 * s for s in ses_p],
color=panel_colors, edgecolor=DARK_NAVY, width=0.5, capsize=4)
ax.set_xticks(range(4))
ax.set_xticklabels(model_names_p, fontsize=8, rotation=15)
ax.set_title(var, fontsize=12, fontweight="bold")
ax.axhline(y=0, color=NEAR_BLACK, linewidth=0.5, linestyle="--", alpha=0.5)
fig.suptitle("Coefficient Estimates Across FE Specifications",
fontsize=14, fontweight="bold", y=1.02)
plt.savefig("pyfixest_wage_extended.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

The results show diminishing returns to additional FE dimensions. The big action was one-way FE: R-squared jumps from 0.175 to 0.605, and the union premium drops from 18.3% to 7.8%. Adding year effects (TWFE) pushes R-squared to 0.631 and the union premium stabilizes at 7.3%. Adding occupation as a third dimension barely moves anything — R-squared rises to 0.632 and the union premium is 7.5%. The expersq coefficient flips sign with TWFE (-0.006) because year effects absorb common trends in experience and wages. The stability of the union and marriage coefficients across the last three specifications suggests these estimates are robust to additional controls for time trends and occupational sorting.

11.6 Interactive fixed effects

Sections 11.4–11.5 used additive fixed effects (nr + year), where every individual shares the same set of year effects. Interactive (or interacted) fixed effects generalize this by allowing one FE dimension to vary across levels of another — producing group-specific intercepts for each time period. Instead of a single set of year dummies shared by all workers, we estimate separate year effects for each demographic group.

Why does this matter? Black and non-Black workers may face different labor market trends during the 1980s. If macroeconomic shocks hit these groups differently, a common set of year effects would be misspecified. We can test this by allowing year effects to vary by race:

$$\ln(wage_{it}) = \beta X_{it} + \alpha_i + \gamma_{t,g(i)} + \epsilon_{it}$$

where $g(i) \in \{Black, non\text{-}Black\}$, so we estimate separate year effects for each racial group.

Pyfixest implements interactive FE with the caret operator (^): the syntax year^black in the fixed-effects slot creates a separate year dummy for each value of black. This mirrors R’s fixest package. The equivalent manual approach is to concatenate the columns (wage_df["year_black"] = wage_df["year"].astype(str) + "_" + wage_df["black"].astype(str)) and absorb the resulting string variable, but the caret operator is preferred because it keeps the interaction structure visible in the formula.

# Pyfixest caret operator for interacted fixed effects
fit_gtrends = pf.feols("lwage ~ expersq + union + married + hours | nr + year^black",
data=wage_df, vcov={"CRV1": "nr"})
print(fit_gtrends.summary())

Estimation: OLS
Dep. var.: lwage, Fixed effects: nr+year^black
Inference: CRV1
Observations: 4360
| Coefficient | Estimate | Std. Error | t value | Pr(>|t|) | 2.5% | 97.5% |
|:--------------|----------: |------------: |--------: |---------: |-----: |------: |
| expersq | -0.006 | 0.001 | -5.878 | 0.000 | -0.008 | -0.004 |
| union | 0.074 | 0.024 | 3.129 | 0.002 | 0.028 | 0.121 |
| married | 0.045 | 0.020 | 2.262 | 0.024 | 0.006 | 0.084 |
| hours | -0.000 | 0.000 | -0.393 | 0.694 | -0.001 | 0.001 |

Variable	Two-Way FE (additive)	Interactive FE (year × race)
expersq	-0.006	-0.006
union	0.073	0.074
married	0.048	0.045
hours	-0.000	-0.000

fig, ax = plt.subplots(figsize=(9, 5))
vars_plot = ["expersq", "union", "married", "hours"]
x = np.arange(len(vars_plot))
width = 0.35
twfe_coefs = [fit_panel.coef()[v] for v in vars_plot]
gtrend_coefs = [fit_gtrends.coef()[v] for v in vars_plot]
ax.bar(x - width/2, twfe_coefs, width, label="Two-Way FE", color=STEEL_BLUE, edgecolor=DARK_NAVY)
ax.bar(x + width/2, gtrend_coefs, width, label="Interactive FE", color=WARM_ORANGE, edgecolor=DARK_NAVY)
ax.set_xticks(x)
ax.set_xticklabels(vars_plot, fontsize=11)
ax.set_ylabel("Coefficient Estimate", fontsize=13)
ax.set_title("Additive vs Interactive Fixed Effects", fontsize=14, fontweight="bold")
ax.legend(fontsize=11)
ax.axhline(y=0, color=NEAR_BLACK, linewidth=0.5, linestyle="--", alpha=0.5)
plt.savefig("pyfixest_group_trends.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

The coefficients are nearly identical under both specifications. Moving from additive to interactive fixed effects barely changes the estimated returns to union membership (7.3% → 7.4%), marriage (4.8% → 4.5%), or experience. This stability indicates that year effects are similar across racial groups — the additive TWFE specification is not misspecified by imposing common year effects. The interactive model uses 545 one-way FE plus 16 group-year FE (8 years × 2 groups) = 561 FE parameters to explain 4,360 observations — well short of saturation. Had the coefficients shifted substantially, that would have signaled that Black and non-Black workers face sufficiently different macro trends to warrant group-specific year effects, and that the standard additive TWFE was masking this heterogeneity.

11.7 Recovering time-invariant effects: the CRE/Mundlak approach

Sections 11.4–11.6 revealed a fundamental tradeoff in panel econometrics. One-way FE eliminate omitted variable bias from all unobserved time-invariant confounders — a powerful guarantee — but they absorb education, race, and ethnicity in the process. Pooled OLS estimates coefficients for everything, but those estimates are biased whenever unobserved worker traits correlate with the covariates. We want the best of both worlds: the bias protection of FE with the ability to estimate time-invariant effects.

Imagine you could describe each worker’s “type” not with a unique ID but with a summary of their career trajectory — their average union participation rate, average hours worked, average marital status, and so on. Two workers with similar career averages are arguably similar in unobserved ways too: a worker who spends 80% of their career in a union likely differs systematically from one who never joins. The Correlated Random Effects (CRE) model — also called the Mundlak (1978) device — operationalizes this intuition by replacing the 545 entity dummies with a handful of individual-mean variables that capture the same correlation structure.

The CRE equation. Recall from Section 11.3 that the CRE equation replaces entity dummies $\alpha_i$ with individual means $\bar{X}_i$ of the time-varying variables:

$$\ln(wage_{it}) = \beta X_{it} + \gamma Z_i + \pi \bar{X}_i + \epsilon_{it}$$

In words, this equation says that a worker’s log wage depends on three components: (1) their current values of time-varying covariates ($X_{it}$), (2) their permanent characteristics ($Z_i$ like education and race), and (3) a set of correction terms ($\bar{X}_i$) that capture the average level of each time-varying variable across their career. In our code, $X_{it}$ corresponds to expersq, union, married, and hours in each year; $Z_i$ corresponds to educ, black, and hisp; and $\bar{X}_i$ corresponds to the *_mean columns we compute below.

Why does including $\bar{X}_i$ work? The individual means proxy for the unobserved individual effect $\alpha_i$. Consider union membership: if workers who join unions more often (high $\overline{union}_i$) also have higher unobserved ability or motivation, then $\overline{union}_i$ captures that correlation. Once we control for it, the remaining within-person variation in union status is “clean” — and the time-invariant variables are no longer collinear with entity dummies (because there are no entity dummies).

Contrast with FE. One-way FE assumes $\alpha_i$ can be anything — completely unrestricted. CRE assumes $\alpha_i = \pi \bar{X}_i + \text{error}$ — the individual effect is a linear function of the career averages. This is a stronger assumption, but it buys back education and race. The payoff: $\hat{\beta}$ for time-varying variables should approximately match the one-way FE estimates (because the means absorb the same correlation), while $\gamma$ for time-invariant variables is now estimable.

mundlak_vars = ["union", "married", "hours", "expersq"]
for var in mundlak_vars:
wage_df[f"{var}_mean"] = wage_df.groupby("nr")[var].transform("mean")
fit_mundlak = pf.feols(
"lwage ~ expersq + union + married + hours + educ + black + hisp "
"+ expersq_mean + union_mean + married_mean + hours_mean",
data=wage_df, vcov={"CRV1": "nr"}
)
print(fit_mundlak.summary())

Estimation: OLS
Dep. var.: lwage, Fixed effects: 0
Inference: CRV1
Observations: 4360
| Coefficient | Estimate | Std. Error | t value | Pr(>|t|) | 2.5% | 97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| Intercept | 0.276 | 0.073 | 3.798 | 0.000 | 0.133 | 0.418 |
| expersq | 0.004 | 0.000 | 13.284 | 0.000 | 0.004 | 0.005 |
| union | 0.078 | 0.019 | 4.050 | 0.000 | 0.040 | 0.116 |
| married | 0.115 | 0.017 | 6.664 | 0.000 | 0.081 | 0.149 |
| hours | -0.000 | 0.000 | -0.007 | 0.994 | -0.000 | 0.000 |
| educ | 0.094 | 0.005 | 17.295 | 0.000 | 0.083 | 0.104 |
| black | -0.140 | 0.024 | -5.930 | 0.000 | -0.187 | -0.094 |
| hisp | 0.009 | 0.019 | 0.469 | 0.639 | -0.028 | 0.045 |
| expersq_mean | -0.003 | 0.001 | -3.498 | 0.001 | -0.005 | -0.001 |
| union_mean | 0.179 | 0.037 | 4.838 | 0.000 | 0.106 | 0.251 |
| married_mean | -0.041 | 0.042 | -0.969 | 0.333 | -0.123 | 0.042 |
| hours_mean | 0.002 | 0.001 | 3.109 | 0.002 | 0.001 | 0.003 |

Variable	One-Way FE	CRE
expersq	0.004	0.004
union	0.078	0.078
married	0.115	0.115
hours	-0.000	-0.000
educ	dropped	0.094
black	dropped	-0.140
hisp	dropped	0.009

fig, ax = plt.subplots(figsize=(10, 6))
compare_vars = ["expersq", "union", "married", "hours", "educ", "black", "hisp"]
x = np.arange(len(compare_vars))
width = 0.25
pooled_vals = [fit_pooled.coef()[v] for v in compare_vars]
entity_vals = [fit_entity.coef()[v] if v in fit_entity.coef().index else 0 for v in compare_vars]
mundlak_vals = [fit_mundlak.coef()[v] if v in fit_mundlak.coef().index else 0 for v in compare_vars]
ax.bar(x - width, pooled_vals, width, label="Pooled OLS", color=STEEL_BLUE, edgecolor=DARK_NAVY)
ax.bar(x, entity_vals, width, label="One-Way FE", color=WARM_ORANGE, edgecolor=DARK_NAVY)
ax.bar(x + width, mundlak_vals, width, label="CRE", color=TEAL, edgecolor=DARK_NAVY)
ax.set_xticks(x)
ax.set_xticklabels(compare_vars, fontsize=10, rotation=15)
ax.set_ylabel("Coefficient Estimate", fontsize=13)
ax.set_title("Pooled OLS vs One-Way FE vs CRE", fontsize=14, fontweight="bold")
ax.legend(fontsize=11)
ax.axhline(y=0, color=NEAR_BLACK, linewidth=0.5, linestyle="--", alpha=0.5)
plt.savefig("pyfixest_mundlak.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

The CRE model bridges one-way FE and pooled OLS. For time-varying variables (union, married, hours, expersq), the CRE coefficients closely match the one-way FE estimates — confirming that the individual means successfully proxy for entity dummies. For time-invariant variables, CRE recovers what one-way FE cannot: education’s coefficient is 0.094 per year of schooling (a 9.4% return), and the Black wage gap is -0.140 (14.0% lower wages). These are close to the pooled OLS estimates, but now they are estimated in a framework that controls for the correlation between unobserved heterogeneity and the covariates (via the individual means).

The CRE correction terms ($\pi$ coefficients) are informative in their own right. The union_mean coefficient of 0.179 is large and highly significant ($p < 0.001$): workers with persistently higher union participation earn substantially more on average, even after controlling for the within-person union effect (0.078). This gap — 0.179 versus 0.078 — is evidence of positive selection into unions: workers who join unions more often tend to have higher unobserved ability or to work in higher-paying industries. The hours_mean coefficient (0.002, $p = 0.002$) suggests that workers who consistently work longer hours earn more per hour on average, while married_mean is small and insignificant, indicating that selection into marriage is not strongly associated with unobserved wage determinants once other factors are controlled.

The caveat is that CRE relies on the assumption that unobserved heterogeneity correlates with covariates only through their individual means — a stronger assumption than one-way FE, which makes no such restriction. However, this assumption is testable. The CRE correction terms provide a built-in Hausman-type test: if $\pi = 0$ jointly (all correction terms are zero), then pooled OLS and one-way FE yield the same estimates, and the simpler random effects model is efficient. In our case, the large and significant union_mean and hours_mean coefficients strongly reject $\pi = 0$, confirming that unobserved heterogeneity does correlate with the covariates and that FE or CRE is needed over pooled OLS. Exercise 6 asks you to formalize this test.

11.8 What fixed effects absorb vs. what survives

The wage panel illustrates a general principle: one-way fixed effects absorb everything about a person that does not change over the observation window. Variables that do change over time — like union status, marital status, and occupation — survive the within transformation and can be estimated. The CRE/Mundlak approach (Section 11.7) partially resolves the tradeoff by recovering time-invariant coefficients. The diagram below summarizes this partition and recovery:

graph LR
subgraph "Absorbed by One-Way FE"
ED["<b>Education</b><br/>(time-invariant)"]
AB["<b>Ability</b><br/>(unobserved)"]
RC["<b>Race</b><br/>(time-invariant)"]
end
subgraph "Estimated (time-varying)"
UN["<b>Union</b>"]
MA["<b>Married</b>"]
OC["<b>Occupation</b>"]
end
subgraph "Recovery strategies"
MK["<b>CRE/Mundlak</b><br/>(individual means)"]
end
UN --> W["<b>Log Wage</b>"]
MA --> W
OC --> W
ED -.-> W
AB -.-> W
MK -.->|"recovers γ"| ED
MK -.->|"recovers γ"| RC
style ED fill:#d97757,stroke:#141413,color:#fff,stroke-dasharray: 5 5
style AB fill:#d97757,stroke:#141413,color:#fff,stroke-dasharray: 5 5
style RC fill:#d97757,stroke:#141413,color:#fff,stroke-dasharray: 5 5
style UN fill:#6a9bcc,stroke:#141413,color:#fff
style MA fill:#6a9bcc,stroke:#141413,color:#fff
style OC fill:#6a9bcc,stroke:#141413,color:#fff
style W fill:#00d4c8,stroke:#141413,color:#fff
style MK fill:#1a3a8a,stroke:#141413,color:#fff,stroke-dasharray: 5 5

The dashed arrows from the orange (absorbed) variables indicate that their effects on wages are real but unestimable under one-way FE — they are folded into each worker’s individual intercept. The solid arrows from the blue (estimated) variables show the effects we can identify: changes in union status, marital status, and occupation that occur within a worker’s career. The dark blue CRE/Mundlak node represents the recovery strategy from Section 11.7: by substituting individual means for entity dummies, we recover the coefficients $\gamma$ for education and race while producing time-varying estimates that closely match one-way FE. This partially resolves the tradeoff from Section 11.4, though at the cost of a stronger modeling assumption.

12. Event study: difference-in-differences

12.1 Staggered treatment adoption

Event studies are a popular extension of fixed effects that estimate dynamic treatment effects around the time of an intervention. In a staggered design, different groups (states, firms, individuals) receive treatment at different times — for example, states adopting a minimum wage increase in different years. The standard approach uses TWFE with relative-time indicators. However, this can produce biased estimates when treatment timing varies across groups and effects are heterogeneous. The DID2S estimator (Gardner, 2022) addresses this by separating the estimation into two stages: first estimating fixed effects from untreated observations, then recovering treatment effects from the residuals. The target estimand in this design is the Average Treatment Effect on the Treated (ATT) — the average effect for units that actually received treatment.

PyFixest provides both approaches. We use a simulated dataset with staggered treatment adoption across states:

df_het = pd.read_csv(
"https://raw.githubusercontent.com/py-econometrics/pyfixest/master/pyfixest/did/data/df_het.csv"
)
print(f"DiD dataset shape: {df_het.shape}")
print(f"Columns: {list(df_het.columns)}")

DiD dataset shape: (46500, 14)
Columns: ['unit', 'state', 'group', 'unit_fe', 'g', 'year', 'year_fe', 'treat',
'rel_year', 'rel_year_binned', 'error', 'te', 'te_dynamic', 'dep_var']

The event study dataset contains 46,500 observations across units nested in states, with a binary treatment indicator and relative time variable measuring periods before and after treatment onset. The dep_var column is the outcome we want to explain, and rel_year measures the distance in years from each unit’s treatment date (negative values are pre-treatment). This structure is typical of policy evaluation studies where different states adopt a policy at different times.

12.2 Year −1 as the universal baseline

Both estimators use ref=-1.0, setting the last pre-treatment period as the baseline. This choice is not arbitrary — it is the conventional and most informative reference point for three reasons:

Closest to treatment onset. Period −1 is the last observation before treatment begins. Using it as the baseline minimizes the extrapolation distance: we compare each period’s outcome to the most recent untreated state, rather than to some distant past.
Universal across cohorts. In staggered designs, different states adopt treatment in different calendar years. But rel_year = -1 has the same meaning for every cohort: “the last year before this group was treated.” It aligns all cohorts to a common relative-time clock, making the coefficients directly comparable.
Transparent parallel trends test. Pre-treatment coefficients (periods −20 through −2) measure deviations from the baseline. If these coefficients are near zero, the treated and control groups were on parallel trajectories before treatment — validating the key identifying assumption. Choosing −1 as the baseline makes this test as transparent as possible: any non-zero pre-treatment coefficient is a direct signal of differential pre-trends.

How to read the event study plot. Each coefficient represents the difference in outcomes between treatment and control groups, relative to their difference at period −1. Pre-treatment coefficients near zero validate parallel trends. The coefficient at period 0 is the immediate treatment effect. Post-treatment coefficients show how the effect evolves over time. If we had chosen a different baseline (say, period −5), all coefficients would shift by a constant — the shape of the event study would be identical, but the levels would change. The convention of using −1 simply makes the plot easiest to interpret.

12.3 TWFE vs DID2S

We estimate event study coefficients using both TWFE and DID2S, with period -1 (the year before treatment) as the reference category. The i() operator in PyFixest creates indicator variables for each relative year, analogous to R’s i() function.

# TWFE event study
fit_twfe = pf.feols(
"dep_var ~ i(rel_year, ref=-1.0) | state + year",
data=df_het, vcov={"CRV1": "state"},
)
# DID2S (Gardner 2022) -- two-stage estimator
fit_did2s = pf.did2s(
df_het, yname="dep_var",
first_stage="~ 0 | state + year",
second_stage="~ i(rel_year, ref=-1.0)",
treatment="treat", cluster="state",
)

# Extract coefficients from both estimators for plotting
import re
def parse_rel_years(coef_dict, se_dict):
years, vals, ses_list = [], [], []
for k in coef_dict.index:
match = re.search(r'\[T\.(-?\d+\.?\d*)\]', str(k))
if match:
years.append(float(match.group(1)))
vals.append(coef_dict[k])
ses_list.append(se_dict[k])
return years, vals, ses_list
twfe_years, twfe_vals, twfe_ses = parse_rel_years(fit_twfe.coef(), fit_twfe.se())
did2s_years, did2s_vals, did2s_ses = parse_rel_years(fit_did2s.coef(), fit_did2s.se())

PyFixest stores event study coefficients with names like [T.-5.0], [T.0.0], etc. The helper function above extracts the relative year from each coefficient name and pairs it with the estimate and standard error, giving us arrays ready for plotting.

fig, ax = plt.subplots(figsize=(12, 6))
offset = 0.15
ax.errorbar([y - offset for y in twfe_years], twfe_vals,
yerr=[1.96*s for s in twfe_ses],
fmt='o', color=STEEL_BLUE, capsize=3, label='TWFE')
ax.errorbar([y + offset for y in did2s_years], did2s_vals,
yerr=[1.96*s for s in did2s_ses],
fmt='s', color=WARM_ORANGE, capsize=3, label='DID2S (Gardner 2022)')
ax.axhline(y=0, color=LIGHT_TEXT, linewidth=0.8, linestyle="--", alpha=0.5)
ax.axvline(x=-0.5, color=LIGHT_TEXT, linewidth=1, linestyle="--", alpha=0.6)
ax.plot(-1, 0, 'D', color=TEAL, markersize=10, zorder=5,
label="Baseline (t = −1)")
ax.set_xlabel("Relative Year", fontsize=13)
ax.set_ylabel("Coefficient Estimate", fontsize=13)
ax.set_title("Event Study: TWFE vs DID2S", fontsize=14, fontweight="bold")
ax.legend(fontsize=11)
plt.savefig("pyfixest_event_study.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

Both estimators show near-zero pre-treatment coefficients (validating the parallel trends assumption) and a sharp jump at treatment onset. The immediate treatment effect at period 0 is approximately 1.3–1.4, growing steadily to about 2.8 by period 20. The TWFE estimates (blue circles) are slightly larger than DID2S (orange squares) in post-treatment periods — this upward bias is the well-documented problem with TWFE under staggered adoption and heterogeneous effects. The DID2S estimator corrects this by using only untreated observations to estimate the counterfactual, producing cleaner estimates of the dynamic treatment path.

13. Hypothesis testing: Wald test

PyFixest supports joint hypothesis testing via Wald tests, which assess whether multiple coefficients are simultaneously equal to zero. This is useful when you want to test whether a group of related variables jointly matters, not just one at a time.

fit_wald = pf.feols("Y ~ X1 + X2 | f1", data=data, vcov="HC1")
R = np.eye(2) # Test both X1=0 and X2=0 jointly
wald_result = fit_wald.wald_test(R=R)
print(f"Wald test (joint null: X1=0, X2=0):")
print(wald_result)

Wald test (joint null: X1=0, X2=0):
statistic 1.554006e+02
pvalue 1.110223e-16

The Wald test statistic is 155.4 with a p-value effectively zero (< 10^{-16}), overwhelmingly rejecting the null hypothesis that both X1 and X2 have zero effect on Y. This joint test is more informative than individual t-tests because it accounts for the correlation between the two coefficient estimates. In practice, Wald tests are essential for testing hypotheses about groups of variables, such as whether all interaction terms or all year dummies are jointly significant.

14. Wild cluster bootstrap

When the number of clusters is small (roughly below 50), cluster-robust standard errors can be unreliable. The wild cluster bootstrap provides more accurate inference in this setting by simulating the distribution of the test statistic under the null hypothesis. PyFixest integrates with the wildboottest package to make this straightforward:

fit_boot = pf.feols("Y ~ X1 | group_id", data=data, vcov={"CRV1": "group_id"})
boot_result = fit_boot.wildboottest(param="X1", reps=999, seed=42)
print(boot_result)

param X1
t value -8.616818459577098
Pr(>|t|) 0.0
bootstrap_type 11
inference CRV(group_id)
impose_null True

The wild bootstrap t-statistic of -8.62 and p-value of 0.0 confirm that the effect of X1 remains highly significant even under the more conservative bootstrap inference. The impose_null=True setting means the bootstrap simulates data under the null hypothesis of no effect, which generally provides better size control in finite samples. With only ~20 groups in this dataset, the bootstrap p-value is more trustworthy than the asymptotic cluster-robust p-value.

15. Discussion

This tutorial posed a simple question: how do unobserved group-level characteristics bias regression estimates, and how can we account for them? The answer, demonstrated across multiple settings, is that fixed effects regression removes this bias by focusing on within-group variation only.

The synthetic data showed that OLS estimates shift from -1.000 to -1.019 when absorbing group fixed effects — a modest change in this controlled setting, but one that demonstrates the mechanism. The real-world wage panel told a more dramatic story: the union wage premium dropped from 18.3% (pooled OLS) to 7.3% (two-way FE), revealing that more than half of the apparent union premium reflects worker selection rather than a genuine union effect. This has direct implications for labor economists and policymakers: overestimating the union premium leads to overestimating the economic impact of declining unionization.

Framing the wage panel through the Mincer equation (Section 11.3) provided a unifying thread for the entire analysis. The classic Mincer specification — log wages as a function of education, experience, and experience squared — is the starting point for virtually all empirical wage research. By extending it with additional controls and then progressively adding fixed effects, we traced a clear arc from pooled cross-sectional estimation to panel methods that account for unobserved heterogeneity. The within-versus-between decomposition (Section 11.2) made this arc concrete: education has zero within-worker variation, so one-way FE cannot estimate its effect, while variables like union status and marital status have substantial within-worker variation and can be identified.

The wage panel also highlighted a fundamental tradeoff in fixed effects estimation: the very mechanism that removes ability bias — absorbing all time-invariant individual characteristics — also prevents estimation of time-invariant variables like education. This is not a limitation to be worked around but a defining feature of the method. The CRE/Mundlak approach (Section 11.7) offers a principled resolution: by including individual means of time-varying variables as additional regressors, it proxies for the unobserved heterogeneity that one-way FE would absorb, recovering education’s coefficient (0.094 per year of schooling) while producing time-varying estimates that closely match one-way FE. The key assumption — that unobserved heterogeneity correlates with covariates only through their individual means — is stronger than FE’s assumption of no time-varying confounding, but it is the price of recovering time-invariant effects.

The three-way FE extension (adding occupation fixed effects) showed that occupation sorting explains negligible additional wage variation beyond individual and time effects, confirming that the dominant source of wage heterogeneity is persistent individual characteristics. The group-specific time trends analysis (Section 11.6) showed that allowing Black and non-Black workers to have different year effects produces estimates nearly identical to standard TWFE, supporting the common trends assumption in this particular panel. This is a useful diagnostic in practice: if group-specific trends substantially change the coefficients, the researcher should worry about whether the standard TWFE results are confounded by differential macro trends.

PyFixest makes the entire workflow — from simple OLS through two-way FE, IV, CRE/Mundlak, and event studies — accessible with a concise formula syntax. The ability to estimate multiple specifications in one call (csw0) and compare inference methods (iid, HC1, CRV1, CRV3, wild bootstrap) means researchers can quickly build a comprehensive picture of how sensitive their results are to modeling choices.

16. Summary and next steps

Key takeaways:

Fixed effects remove group-level confounding. In the wage panel, individual FE reduced the apparent union premium from 18.3% to 7.8%, revealing that over half the raw premium reflects selection on unobserved ability. Without FE, policy conclusions about unionization would be substantially biased.
The within-between decomposition diagnoses what FE can estimate. Decomposing each variable’s variation into between-worker and within-worker components reveals which coefficients survive one-way FE. Education has zero within variation and is absorbed; union status and marital status have substantial within shares (64% and 65%) and can be estimated. This diagnostic should precede any panel analysis.
The Mincer equation provides a unifying framework for wage regressions. Framing the analysis through the classic Mincer specification — and its extensions to panel data — makes the progression from pooled OLS to one-way FE to CRE/Mundlak a coherent arc rather than a collection of ad hoc specifications.
Standard errors matter as much as point estimates. Clustering standard errors inflated the SE on X1 by 50% compared to iid errors (0.1247 vs 0.0833). With weaker effects, this difference could flip a result from significant to insignificant — always cluster at the appropriate level.
Multiple specifications are a robustness check, not a fishing exercise. The coefficient on X1 remained stable around -1.0 across no FE, one-way FE, and two-way FE. In the wage panel, the union premium stabilized at 7.3–7.8% across one-way FE, two-way FE, three-way FE, and group-specific time trends — strong evidence that these estimates are robust.
Group-specific time trends test the common trends assumption. Allowing Black and non-Black workers to have different year effects produced estimates nearly identical to standard TWFE, supporting the assumption that both groups faced similar macroeconomic trends during 1980–1987. When this test fails, standard TWFE results may be unreliable.
One-Way FE cannot estimate time-invariant effects, but CRE can recover them. Education was silently dropped from the one-way FE model because the within transformation reduces any constant variable to zero. The CRE model partially resolves this tradeoff by substituting individual means of time-varying variables for entity dummies, recovering education’s coefficient (0.094 per year) while producing time-varying estimates that match one-way FE. The cost is a stronger modeling assumption — that unobserved heterogeneity correlates with covariates only through their individual means.
TWFE event studies can be biased with staggered adoption. The DID2S estimator produced cleaner estimates by separating counterfactual estimation from treatment effect recovery. When treatment timing varies, always compare TWFE with a robust alternative like DID2S.
The event study baseline is not arbitrary. Setting ref=-1 (the last pre-treatment period) is the convention because it provides the most transparent test of parallel trends and minimizes extrapolation from the baseline to treatment onset. All cohorts in a staggered design share this reference point, making it the natural common clock.

Limitations: Fixed effects only remove time-invariant confounders. If a relevant confounder changes over time within groups, FE cannot address it. Additionally, FE estimation discards all between-group variation, which reduces statistical power and makes it impossible to estimate the effects of time-invariant variables — as we saw directly in Section 11.2, where education’s within share was exactly zero. CRE offers a partial resolution, but its assumption that unobserved heterogeneity correlates with covariates only through individual means may not hold in all settings — if ability correlates with the trajectory of union membership rather than its mean, the CRE estimates would still be biased. The group-specific time trends test (Section 11.6) is a useful diagnostic but is not definitive: passing it does not prove that common trends hold, only that the data are consistent with the assumption along the dimension tested. Finally, the datasets here are synthetic or well-studied — in messy real-world data, the parallel trends assumption underlying event studies may not hold.

Next steps: The CRE/Mundlak approach demonstrated in Section 11.7 can be extended in several directions: Wooldridge (2010, Ch. 10) develops the correlated random effects framework more formally, including CRE probit and tobit models for limited dependent variables. Hausman-Taylor estimation offers an alternative strategy for recovering time-invariant coefficients under different identifying assumptions. Beyond the wage panel, explore PyFixest’s support for Poisson regression (pf.fepois) for count data, quantile regression (pf.quantreg) for distributional effects, and the pf.event_study() common API for streamlined event study estimation with multiple estimators. For more advanced inference, investigate randomization inference via fit.ritest() and multiple testing corrections with pf.bonferroni() and pf.rwolf().

17. Exercises

Varying the clustering level. Re-estimate the one-way FE model (Y ~ X1 | group_id) with different clustering variables: f1, f2, and f3. How do the standard errors change? Which clustering level produces the most conservative inference, and why?
Weak instruments. Modify the IV specification to use only Z1 as an instrument (instead of both Z1 and Z2). How does the first-stage F-statistic change? How does the IV coefficient and its standard error respond to the weaker first stage?
CRE with additional means. In Section 11.7, we included individual means only for the time-varying regressors. What happens if you also include year fixed effects alongside the CRE correction terms (i.e., add | year to the CRE specification)? Do the time-varying coefficients shift closer to the TWFE estimates? Does the education coefficient change?
Group-specific trends by other dimensions. Section 11.6 allowed year effects to vary by race (black). Repeat this analysis using hisp instead, or using a union-status interaction (C(year):C(union)). Do the results differ from the standard TWFE specification? What does this tell you about the common trends assumption along different group dimensions?
Within-between decomposition on new data. Download a panel dataset of your choice (e.g., Penn World Table, World Development Indicators) and compute the within-versus-between decomposition for all variables. Which variables have the highest within share? What does this predict about which coefficients will survive one-way FE? Verify by estimating both pooled OLS and one-way FE models.
Hausman test via CRE. The CRE model provides a simple Hausman-type test: if the coefficients on the individual means ($\bar{X}_i$) are jointly zero, then pooled OLS and one-way FE yield the same estimates, and random effects is efficient. Test whether the four CRE correction terms (union_mean, married_mean, hours_mean, expersq_mean) are jointly significant using a Wald test. What does the result imply about the choice between random effects and fixed effects for this panel?

Fixed Effects and TWFE | Carlos Mendez

What Does TWFE Actually Do? Manual Demeaning and the FWL Theorem

1. Overview

2. The Frisch-Waugh-Lovell Theorem

Applying FWL to two-way fixed effects

Why add back the grand mean?

3. Setup

4. Data Loading and Panel Structure

5. TWFE Estimation with fixest

6. Manual Demeaning — Step by Step

Step 1: Country means

Step 2: Time means

Step 3: Grand mean

Step 4: Apply the demeaning formula

7. OLS on the Demeaned Data

8. Coefficient Comparison: The Proof

9. Visualizing What Demeaning Does

Decomposing the formula for one country

10. A Caveat: Standard Errors Differ

11. Discussion

12. Summary and Next Steps

13. Exercises

14. References

Acknowledgements

High-Dimensional Fixed Effects Regression: An Introduction in Python

1. Overview

2. Setup and imports

3. Data loading and exploration

3.1 Loading the dataset

3.2 Visualizing group structure

4. Simple OLS baseline (no fixed effects)

5. One-way fixed effects

5.1 Absorbing group heterogeneity

5.2 Equivalence with dummy variables

6. Understanding fixed effects via manual demeaning

6.1 The within transformation

6.2 Visualizing the demeaning

7. Multiple estimation with stepwise operators

7.1 Cumulative stepwise fixed effects

7.2 Visualizing coefficient stability

8. Inference: choosing the right standard errors

8.1 Comparing standard error estimators

8.2 Visualizing the SE tradeoff

9. Two-way fixed effects

10. Instrumental variables with fixed effects

11. Panel data application: wage determinants

11.1 The wage panel: variables and structure

11.2 Within vs between variation

11.3 The Mincer equation and its panel extensions

11.4 From pooled OLS to one-way FE: the education tradeoff

11.5 Two-way and three-way fixed effects

11.6 Interactive fixed effects

11.7 Recovering time-invariant effects: the CRE/Mundlak approach

11.8 What fixed effects absorb vs. what survives

12. Event study: difference-in-differences

12.1 Staggered treatment adoption

12.2 Year −1 as the universal baseline

12.3 TWFE vs DID2S

13. Hypothesis testing: Wald test

14. Wild cluster bootstrap

15. Discussion

16. Summary and next steps

17. Exercises

18. References

Acknowledgements