FWL Theorem | Carlos Mendez

Visualizing Regression with the FWL Theorem in R

Fri, 27 Mar 2026 00:00:00 +0000

1. Overview

“What does it actually mean to control for a variable?” This is perhaps the most common question in applied regression — and one of the hardest to answer intuitively. When we say “the effect of coupons on sales, controlling for income,” we are describing a relationship that lives in multidimensional space and cannot be directly plotted on a 2D scatter plot. Or can it?

The Frisch-Waugh-Lovell (FWL) theorem provides the answer. It says that the coefficient on any variable in a multiple regression equals the slope from a simple bivariate regression — after first “partialling out” the other variables from both the outcome and the variable of interest. Partialling out means regressing a variable on the controls and keeping only the leftover (residual) variation — the part that the controls cannot explain. This means we can visualize any regression coefficient as a 2D scatter plot, as long as we first remove the influence of the controls from both axes.

The fwlplot R package (Butts & McDermott, 2024) turns this into a one-liner. It uses the same formula syntax as fixest::feols() — including the | operator for fixed effects — and produces a scatter plot of the residualized data with the regression line overlaid. The result is a visual answer to “what does controlling for X look like?”

This tutorial builds intuition progressively. We start with simulated data where we know the true effect, show how confounding creates a misleading picture, and use fwl_plot() to reveal the truth. We then extend to real data with high-dimensional fixed effects — first flights data (controlling for origin and destination airports) and then panel wage data (controlling for unobserved individual ability).

Learning objectives:

State the FWL theorem and explain its geometric intuition
Use fwl_plot() to visualize a bivariate relationship before and after controlling for confounders
Demonstrate that manual FWL residualization reproduces feols() coefficients exactly
Visualize what fixed effects “do” to data by comparing raw vs. residualized scatter plots
Apply fwl_plot() to real panel data with high-dimensional fixed effects
Connect FWL to omitted variable bias and Simpson’s paradox

2. The Modeling Pipeline

graph LR
A["Simulated<br/>Data<br/>(Section 3)"] --> B["fwl_plot()<br/>Naive vs. FWL<br/>(Section 4)"]
B --> C["Manual FWL<br/>Verification<br/>(Section 5)"]
C --> D["Fixed Effects<br/>Flights Data<br/>(Section 6)"]
D --> E["Panel Data<br/>Wages<br/>(Section 7)"]
E --> F["ggplot2<br/>& Recipe<br/>(Section 8)"]
style A fill:#6a9bcc,stroke:#141413,color:#fff
style B fill:#d97757,stroke:#141413,color:#fff
style C fill:#d97757,stroke:#141413,color:#fff
style D fill:#6a9bcc,stroke:#141413,color:#fff
style E fill:#6a9bcc,stroke:#141413,color:#fff
style F fill:#00d4c8,stroke:#141413,color:#fff

We start where the answer is known (simulated data), see the result with fwl_plot() first, then peek under the hood with manual FWL verification. From there we apply the same one-liner to increasingly complex real-world settings.

3. Setup and Data

3.1 Install and load packages

# Install packages if needed
cran_packages <- c("fwlplot", "fixest", "ggplot2", "patchwork",
"nycflights13", "wooldridge")
missing <- cran_packages[!sapply(cran_packages, requireNamespace, quietly = TRUE)]
if (length(missing) > 0) install.packages(missing)
library(fwlplot)
library(fixest)
library(ggplot2)
library(patchwork)
library(nycflights13)
library(wooldridge)

The fwlplot package provides the fwl_plot() function for FWL-residualized scatter plots. It is built on fixest, which handles the residualization computation using fast demeaning algorithms. The patchwork package lets us combine multiple ggplot2 plots side by side. The nycflights13 and wooldridge packages provide the real datasets we will use later.

3.2 Simulated confounding data

To build intuition, we simulate a retail scenario where a store manager wants to know whether distributing coupons increases sales. The catch: income is a confounder — wealthier neighborhoods receive fewer coupons (the store targets promotions at lower-income areas) but have higher baseline sales. This creates a spurious negative correlation between coupons and sales, even though coupons genuinely boost sales.

The causal structure looks like this:

graph TD
Income["Income<br/>(confounder)"]
Coupons["Coupons<br/>(treatment)"]
Sales["Sales<br/>(outcome)"]
Income -->|"-0.5<br/>(fewer coupons<br/>to rich areas)"| Coupons
Income -->|"+0.3<br/>(rich areas<br/>buy more)"| Sales
Coupons -->|"+0.2<br/>(true causal<br/>effect)"| Sales
style Income fill:#d97757,stroke:#141413,color:#fff
style Coupons fill:#6a9bcc,stroke:#141413,color:#fff
style Sales fill:#00d4c8,stroke:#141413,color:#fff

Income opens a “backdoor path” from coupons to sales: coupons ← income → sales. Unless we block this path by controlling for income, the naive estimate will be biased. The data generating process is:

$$\text{income} \sim N(50, 10)$$

$$\text{coupons} = 60 - 0.5 \times \text{income} + \epsilon_1, \quad \epsilon_1 \sim N(0, 5)$$

$$\text{sales} = 10 + 0.2 \times \text{coupons} + 0.3 \times \text{income} + \epsilon_2, \quad \epsilon_2 \sim N(0, 3)$$

In words, the true causal effect of coupons on sales is +0.2: each additional coupon increases sales by 0.2 units. But because income negatively drives coupons ($-0.5$) and positively drives sales ($+0.3$), a naive regression of sales on coupons alone will confound the coupon effect with the income effect, producing a biased estimate. The noise terms $\epsilon_1$ and $\epsilon_2$ correspond to the rnorm() calls in the code below.

set.seed(42)
n <- 200
income <- rnorm(n, mean = 50, sd = 10)
dayofweek <- sample(1:7, n, replace = TRUE)
coupons <- 60 - 0.5 * income + rnorm(n, 0, 5)
sales <- 10 + 0.2 * coupons + 0.3 * income + 0.5 * dayofweek + rnorm(n, 0, 3)
store_data <- data.frame(
sales = round(sales, 2),
coupons = round(coupons, 2),
income = round(income, 2),
dayofweek = dayofweek
)
head(store_data)

 sales coupons income dayofweek
1 40.02 27.79 63.71 4
2 31.37 34.03 44.35 5
3 31.30 28.01 53.63 6
4 34.37 28.68 56.33 4
5 42.62 35.91 54.04 5
6 39.50 33.45 48.94 4

round(cor(store_data[, c("sales", "coupons", "income")]), 3)

 sales coupons income
sales 1.000 -0.166 0.500
coupons -0.166 1.000 -0.709
income 0.500 -0.709 1.000

The correlation matrix confirms the confounding structure. Coupons and sales have a negative raw correlation (-0.166), even though the true causal effect is positive (+0.2). This is because income is strongly negatively correlated with coupons (-0.709) and strongly positively correlated with sales (0.500). A naive analysis would conclude that coupons hurt sales — a classic instance of Simpson’s paradox, where the direction of an association reverses when a confounding variable is accounted for.

4. fwl_plot() in Action: Naive vs. Controlled

4.1 The naive scatter

The simplest way to see why confounding is dangerous: plot the raw relationship with fwl_plot(). When no controls are specified, fwl_plot() produces a standard scatter plot with a regression line:

fwl_plot(sales ~ coupons, data = store_data, ggplot = TRUE)

The slope is -0.093 ($p = 0.019$): coupons appear to reduce sales. This is statistically significant but substantively wrong — the true effect is +0.2. The store manager who trusts this analysis would cancel the coupon program, losing real revenue.

4.2 Controlling for income: one line of code

Now watch what happens when we add income as a control — just add it to the formula:

fwl_plot(sales ~ coupons + income, data = store_data, ggplot = TRUE)

The slope reverses to +0.212 ($p < 0.001$) — close to the true value of +0.2. The fwl_plot() function residualized both coupons and sales on income behind the scenes, then plotted the residuals. The figure below shows both panels side by side:

The left panel shows the raw relationship: more coupons, lower sales (a downward slope). The right panel shows the same data after removing the influence of income from both axes. Once income is partialled out, the true positive effect of coupons emerges clearly. This is what “controlling for income” looks like geometrically — and fwl_plot() produces it in a single line.

4.3 The regression table confirms

The fixest::feols() function produces the same coefficient, confirmed by etable() for side-by-side comparison:

fe_naive <- feols(sales ~ coupons, data = store_data)
fe_full <- feols(sales ~ coupons + income, data = store_data)
etable(fe_naive, fe_full, headers = c("Naive", "Controlled"))

 fe_naive fe_full
Naive Controlled
Dependent Var.: sales sales
Constant 36.93*** (1.397) 11.34*** (3.008)
coupons -0.0934* (0.0393) 0.2123*** (0.0467)
income 0.3004*** (0.0325)
_______________ _________________ __________________
S.E. type IID IID
Observations 200 200
R2 0.02768 0.32148
Adj. R2 0.02277 0.31459

Adding income as a control flips the coupon coefficient from -0.093 to +0.212 and increases the R-squared from 0.028 to 0.321. The income coefficient (0.300) is close to the true value of 0.3. Every number in this table corresponds to a visual feature of the fwl_plot() scatter plots above.

5. Under the Hood: Manual FWL Verification

5.1 The three-step recipe

The FWL theorem can be stated as a simple recipe. Think of it like measuring height for your age: instead of comparing raw heights, you compare how much taller or shorter each person is than the average for their age group. Similarly, FWL compares how much more or fewer coupons a store had for its income level, against how much more or fewer sales it had for its income level.

The three steps are:

Regress sales on income, save the residuals (the part of sales that income cannot explain)
Regress coupons on income, save the residuals (the part of coupons that income cannot explain)
Regress the sales residuals on the coupon residuals — the slope is the coupon coefficient

# Step 1: Residualize sales on income
resid_y <- resid(lm(sales ~ income, data = store_data))
# Step 2: Residualize coupons on income
resid_x <- resid(lm(coupons ~ income, data = store_data))
# Step 3: Regress residuals on residuals
fwl_manual <- lm(resid_y ~ resid_x)
# Compare coefficients
cat("feols coefficient: ", round(coef(fe_full)["coupons"], 6), "\n")
cat("Manual FWL coefficient:", round(coef(fwl_manual)["resid_x"], 6), "\n")

feols coefficient: 0.212288
Manual FWL coefficient: 0.212288

The coefficients match to six decimal places. This is not an approximation — it is an exact algebraic identity. Every time you run a multiple regression, the software is implicitly performing these three steps for each coefficient.

5.2 The formal theorem

For those who want the math, the FWL theorem states that in the regression $Y = X_1 \beta_1 + X_2 \beta_2 + \epsilon$, the coefficient $\hat{\beta}_1$ equals:

$$\hat{\beta}_1 = (\tilde{X}_1' \tilde{X}_1)^{-1} \tilde{X}_1' \tilde{Y}, \quad \text{where} \quad \tilde{Y} = M_{X_2} Y, \quad \tilde{X}_1 = M_{X_2} X_1$$

Here $M_{X_2} = I - X_2(X_2’X_2)^{-1}X_2'$ is the “residual-maker” matrix that projects out the effect of $X_2$. In our example, $Y$ is sales, $X_1$ is coupons, and $X_2$ is income. The tilded variables $\tilde{Y}$ and $\tilde{X}_1$ are the residuals from the resid() calls above.

5.3 Omitted variable bias: predicting the error

The confounding we saw is not mysterious — the omitted variable bias (OVB) formula predicts it exactly. When we omit income from the regression, the bias on the coupon coefficient is:

$$\text{bias} = \hat{\gamma} \times \hat{\delta}$$

In words, the bias equals the effect of the omitted variable on the outcome ($\hat{\gamma}$) multiplied by the relationship between the omitted variable and the treatment ($\hat{\delta}$). Here $\hat{\gamma}$ is the effect of income on sales (in the full model) and $\hat{\delta}$ is the coefficient from regressing coupons on income.

gamma_hat <- coef(fe_full)["income"] # 0.3004
delta_hat <- coef(lm(coupons ~ income, data = store_data))["income"] # -0.4937
ovb <- gamma_hat * delta_hat # -0.1483
cat("OVB = gamma * delta:", round(ovb, 4), "\n")
cat("Naive ≈ True + OVB:", round(coef(fe_full)["coupons"] + ovb, 4), "\n")
cat("Actual naive:", round(coef(fe_naive)["coupons"], 4), "\n")

OVB = gamma * delta: -0.1483
Naive ≈ True + OVB: 0.064
Actual naive: -0.0934

The OVB formula predicts a bias of -0.148: income’s positive effect on sales ($\hat{\gamma} = 0.300$) times its negative relationship with coupons ($\hat{\delta} = -0.494$) produces a large negative bias. The predicted naive coefficient (true + bias = 0.212 + (-0.148) = 0.064) is close to the actual naive coefficient (-0.093) — the small discrepancy comes from sampling variation with $n = 200$. The key insight: the bias is predictable. If you know the direction of the confounder’s effects on both the treatment and the outcome, you know which way the naive estimate is biased.

5.4 Adding more controls

The FWL theorem extends naturally to any number of controls. The fwl_plot() call handles it automatically:

fe_full3 <- feols(sales ~ coupons + income + dayofweek, data = store_data)
etable(fe_naive, fe_full, fe_full3,
headers = c("Naive", "+ Income", "+ Income + Day"))

 fe_naive fe_full fe_full3
Naive + Income + Income + Day
Dependent Var.: sales sales sales
Constant 36.93*** (1.397) 11.34*** (3.008) 9.640** (2.953)
coupons -0.0934* (0.0393) 0.2123*** (0.0467) 0.2219*** (0.0454)
income 0.3004*** (0.0325) 0.2961*** (0.0316)
dayofweek 0.4029*** (0.1095)
_______________ _________________ __________________ __________________
S.E. type IID IID IID
Observations 200 200 200
R2 0.02768 0.32148 0.36535
Adj. R2 0.02277 0.31459 0.35564

The coupon coefficient progresses from -0.093 (naive, wrong sign), to +0.212 (controlling for income), to +0.222 (adding day of week). The R-squared jumps from 0.028 to 0.365 as we add controls. Each fwl_plot() panel shows a tighter cloud as more variation is absorbed by the controls — the residualized scatter becomes more focused on the coupon-specific variation in sales.

6. Visualizing Fixed Effects

6.1 What are fixed effects?

Fixed effects are a special case of the FWL theorem applied to group dummy variables. When we include airport fixed effects in a regression, we are “partialling out” airport-specific means — in other words, demeaning. Demeaning means subtracting each group’s average from every observation in that group. The result is that we compare each airport to itself rather than comparing different airports to each other.

Think of it like a race handicap. Raw times compare runners who started at different positions. Demeaning each runner’s times converts them to “how much faster or slower than their personal average,” making the comparison fair. The FWL theorem guarantees that this demeaning procedure produces the same coefficients as including a full set of dummy variables in the regression.

6.2 Flights data: progressive fixed effects

The nycflights13 dataset contains all domestic flights from New York’s three airports (EWR, JFK, LGA) in 2013. We ask: what is the relationship between air time and departure delay?

data("flights", package = "nycflights13")
flights_clean <- flights[complete.cases(flights[, c("dep_delay", "air_time", "origin", "dest")]), ]
flights_clean <- flights_clean[flights_clean$dep_delay < 120 & flights_clean$dep_delay > -30, ]
# Remove singleton origin-dest combos for stable FE estimation
od_counts <- table(paste(flights_clean$origin, flights_clean$dest))
flights_clean <- flights_clean[paste(flights_clean$origin, flights_clean$dest) %in%
names(od_counts[od_counts > 1]), ]
cat("Observations:", nrow(flights_clean), "\n")

Observations: 317578

We sample 5,000 flights for plotting (the regression line uses all data, only the plotted points are sampled to avoid overplotting):

set.seed(123)
flights_sample <- flights_clean[sample(nrow(flights_clean), 5000), ]

Now the power of fwl_plot() — three one-liners that progressively add fixed effects. In fixest syntax, the | operator separates regular covariates (left) from fixed effects (right), so dep_delay ~ air_time | origin + dest means “regress departure delay on air time, with origin and destination fixed effects”:

# No fixed effects
fwl_plot(dep_delay ~ air_time, data = flights_sample, ggplot = TRUE)
# Origin airport FE
fwl_plot(dep_delay ~ air_time | origin, data = flights_sample, ggplot = TRUE)
# Origin + destination FE
fwl_plot(dep_delay ~ air_time | origin + dest, data = flights_sample, ggplot = TRUE)

The visual transformation is striking. Panel A (no FE) shows a vague cloud with a nearly flat slope. Panel B (origin FE) removes the three origin-airport means, tightening the horizontal spread. Panel C (origin + destination FE) removes the 103 destination means as well, collapsing the air-time variation to within-route deviations.

6.3 Comparing regression tables

fe_none <- feols(dep_delay ~ air_time, data = flights_clean)
fe_origin <- feols(dep_delay ~ air_time | origin, data = flights_clean)
fe_both <- feols(dep_delay ~ air_time | origin + dest, data = flights_clean)
etable(fe_none, fe_origin, fe_both,
headers = c("No FE", "Origin FE", "Origin + Dest FE"))

 fe_none fe_origin fe_both
No FE Origin FE Origin + Dest FE
Dependent Var.: dep_delay dep_delay dep_delay
air_time -0.0031*** (0.0004) -0.0061*** (0.0005) -0.0067. (0.0034)
Fixed-Effects: ------------------- ------------------- -----------------
origin No Yes Yes
dest No No Yes
_______________ ___________________ ___________________ _________________
Observations 317,578 317,578 317,578
R2 0.00016 0.00594 0.01296
Within R2 -- 0.00058 1.19e-5

The air time coefficient changes as we add fixed effects: -0.003 (no FE), -0.006 (origin FE), -0.007 (origin + destination FE, significant at the 10% level only — the . marker indicates $p < 0.10$). The residualized scatter in Panel C answers a sharper question: “For flights on the same route, does longer-than-usual air time predict higher-than-usual departure delay?” The answer is weakly negative — routes with variable air times show slightly less delay when the flight takes longer, possibly because longer air times reflect favorable wind conditions.

7. Panel Data: Returns to Experience

7.1 The wage panel

The wagepan dataset from the Wooldridge textbook contains panel data on 545 individuals observed over 8 years (1980–1987). A classic question in labor economics is: what is the return to experience?

The challenge is unobserved ability. Two people with 5 years of experience may earn very different wages because one is more talented, motivated, or well-connected. These personal traits — which we cannot directly measure — are the “unobserved ability” that creates omitted variable bias. More talented workers earn higher wages and tend to accumulate experience in higher-paying jobs, so the naive correlation between experience and wages confounds ability with genuine experience effects.

data("wagepan", package = "wooldridge")
cat("Observations:", nrow(wagepan), "\n")
cat("Individuals:", length(unique(wagepan$nr)), "\n")
cat("Years:", length(unique(wagepan$year)), "\n")

Observations: 4360
Individuals: 545
Years: 8

7.2 Pooled OLS vs. individual fixed effects

fe_pool <- feols(lwage ~ educ + exper + expersq, data = wagepan)
fe_fe <- feols(lwage ~ exper + expersq | nr, data = wagepan)
fe_twfe <- feols(lwage ~ exper + expersq | nr + year, data = wagepan)
etable(fe_pool, fe_fe, fe_twfe,
headers = c("Pooled OLS", "Individual FE", "Individual + Year FE"))

 fe_pool fe_fe fe_twfe
Pooled OLS Individual FE Individual + Year FE
Dependent Var.: lwage lwage lwage
Constant -0.0564 (0.0639)
educ 0.1021*** (0.0047)
exper 0.1050*** (0.0102) 0.1223*** (0.0082)
expersq -0.0036*** (0.0007) -0.0045*** (0.0006) -0.0054*** (0.0007)
Fixed-Effects: ------------------- ------------------- -------------------
nr No Yes Yes
year No No Yes
_______________ ___________________ ___________________ ___________________
Observations 4,360 4,360 4,360
R2 0.14772 0.61727 0.61850
Within R2 -- 0.17270 0.01534

Several things change as we add fixed effects. First, the educ coefficient disappears from the individual FE column — education is time-invariant for most individuals, so it is perfectly collinear with person dummies. Second, the exper linear term disappears from the two-way FE column — because experience increments by exactly one year for everyone, it is perfectly collinear with year dummies. Only expersq (which varies non-linearly across individuals) survives.

In the individual FE model, the experience coefficient increases from 0.105 to 0.122. This means the within-person return to experience is larger than the pooled estimate. The R-squared jumps from 0.148 to 0.617, showing that individual fixed effects explain the majority of wage variation — most of the “action” in wages comes from who you are, not how many years you have worked.

7.3 Visualizing the within-person variation

Again, fwl_plot() produces the before/after comparison in two one-liners. We sample 150 individuals for visual clarity (with 545 individuals the plot would be too dense):

set.seed(456)
sample_ids <- sample(unique(wagepan$nr), 150)
wage_sample <- wagepan[wagepan$nr %in% sample_ids, ]
# Raw bivariate relationship
fwl_plot(lwage ~ exper, data = wage_sample, ggplot = TRUE)
# With individual fixed effects
fwl_plot(lwage ~ exper | nr, data = wage_sample, ggplot = TRUE)

The visual difference is dramatic. Panel A plots the raw bivariate relationship with a shallow slope of about 0.03. The wide fan of points reflects unobserved ability differences: individuals at the same experience level have wildly different wages. Panel B (individual FE) strips away each person’s average wage and average experience, leaving only the within-person deviations. The slope steepens to 0.122 — more than three times larger — showing that a one-year increase in experience raises wages by about 12.2% within the same individual. The tighter cloud in Panel B shows that once we account for who each person is, the experience-wage relationship is much more precisely identified.

8. Customization and Quick Reference

8.1 ggplot2 integration

The fwl_plot() function can return a ggplot2 object by setting ggplot = TRUE, allowing full customization with ggplot2 layers and themes. This is useful for publication-quality figures with consistent styling, faceting, or combining multiple plots with patchwork:

p <- fwl_plot(sales ~ coupons + income, data = store_data, ggplot = TRUE)
fig5 <- p +
labs(title = "FWL Visualization: Coupons Effect on Sales",
subtitle = "After residualizing on income") +
theme_minimal(base_size = 13)

8.2 Quick reference: fwl_plot() recipes

Here are the most common fwl_plot() patterns you will use:

# 1. Raw scatter (no controls)
fwl_plot(y ~ x, data = df)
# 2. Control for one or more variables
fwl_plot(y ~ x + control1 + control2, data = df)
# 3. Fixed effects (use | to separate)
fwl_plot(y ~ x | group_fe, data = df)
# 4. Multiple fixed effects
fwl_plot(y ~ x | fe1 + fe2, data = df)
# 5. Return ggplot2 object for customization
fwl_plot(y ~ x + control, data = df, ggplot = TRUE) + theme_minimal()
# 6. Sample points for large datasets (line uses all data)
fwl_plot(y ~ x | fe, data = big_data, n_sample = 5000)

8.3 Key arguments

Argument	Purpose	Example
`formula`	Same as `feols()`: `y ~ x + controls \| FE`	`sales ~ coupons + income`
`data`	Input data frame	`store_data`
`ggplot`	Return ggplot2 object (default: base R)	`ggplot = TRUE`
`n_sample`	Sample N points for large datasets	`n_sample = 5000`
`vcov`	Variance-covariance specification	`vcov = "hetero"`

For large datasets like the flights data (317K+ observations), the n_sample argument is essential to avoid overplotting. The regression line is always computed on the full data — only the plotted points are sampled, so the slope is unaffected.

9. Discussion

The FWL theorem is not just a mathematical curiosity — it is the foundation of how modern regression software works. When fixest::feols() estimates a model with fixed effects, it does not literally create and invert a matrix with thousands of dummy variables. Instead, it uses the FWL logic to demean the data and run OLS on the residuals. This is why fixest can handle millions of observations with hundreds of thousands of fixed effects: the demeaning step is $O(N)$, while creating the full dummy matrix would be $O(N \times K)$.

As a diagnostic tool, FWL scatter plots reveal problems that regression tables hide. If the residualized scatter shows a curved relationship, your linear specification may be wrong. If it shows outliers, they may be driving the coefficient. If the cloud collapses to a near-vertical line (as in Panel C of the flights figure), the within-group variation may be too small to identify the effect reliably.

The FWL theorem also connects to more advanced methods. Double Machine Learning (Chernozhukov et al., 2018) generalizes the partialling-out idea by using machine learning models instead of linear regression to residualize the data. The Python FWL tutorial on this site takes that next step. The fwlplot package does not do DML, but the visual intuition — “look at the residualized scatter to see the conditional relationship” — carries over directly.

One limitation: the FWL theorem applies only to linear regression. For logistic regression, Poisson regression, or other nonlinear models, the partialling-out logic does not hold exactly. The residualized scatter plot for a nonlinear model is at best an approximation of the conditional relationship, not an exact representation.

10. Summary and Next Steps

Confounding produces misleading regressions: in our simulated data, the naive coupon coefficient was -0.093 (coupons “hurt” sales), while the true causal effect is +0.2. After controlling for income via fwl_plot(), the estimate was +0.212, recovering the true effect.
The OVB formula predicts the bias exactly: the bias was $0.300 \times (-0.494) = -0.148$, correctly predicting the negative direction and approximate magnitude of the confounding.
FWL is not an approximation — it is an exact algebraic identity: the coefficient from partialling out controls matches feols() to six decimal places. Every multiple regression coefficient can be visualized as a bivariate scatter plot.
Fixed effects are FWL applied to group dummies: the flights data showed how adding origin and destination FE progressively transformed the scatter. The air-time coefficient changed from -0.003 (no FE) to -0.007 (origin + destination FE).
Panel FE reveal within-person effects: the wage data showed that controlling for individual ability via FE steepened the bivariate experience slope from 0.03 (pooled, no controls) to 0.122 (within-person), more than tripling the estimated return to experience.

For further study, see the companion Python FWL tutorial that extends the partialling-out logic to Double Machine Learning, and the R DID tutorial that uses fixest for difference-in-differences with staggered treatment adoption.

11. Exercises

Omitted variable direction. Use the OVB formula from Section 5.3 to predict what happens if you also omit dayofweek (in addition to income). Run the naive regression lm(sales ~ coupons) and compare the bias to $\hat{\gamma}_{income} \times \hat{\delta}_{income} + \hat{\gamma}_{day} \times \hat{\delta}_{day}$. Does the extended OVB formula still predict the direction correctly?
Multiple controls. Use fwl_plot() to visualize the coupon effect after controlling for both income and dayofweek. Compare this to controlling for income alone. Does the scatter change visually? Does the coefficient change?
Your own data. Pick a dataset from the wooldridge package (e.g., hprice1, wage2, crime2) and use fwl_plot() to visualize a regression relationship before and after adding controls. Does the coefficient change substantially? Can you identify what the confounder is doing?

12. Datasets

The datasets used in this tutorial are saved as CSV files in the post directory for reuse in other tutorials:

File	Rows	Description
`store_data.csv`	200	Simulated retail data (sales, coupons, income, dayofweek)
`flights_sample.csv`	5,000	Cleaned NYC flights sample (delays, air time, origin, dest)
`wagepan.csv`	4,360	Wooldridge wage panel (545 individuals, 8 years)

13. References

Acknowledgements

AI tools (Claude Code, Gemini, NotebookLM) were used to make the contents of this post more accessible to students. Nevertheless, the content in this post may still have errors. Caution is needed when applying the contents of this post to true research projects.

Visualizing Regression with the FWL Theorem in Stata

Fri, 27 Mar 2026 00:00:00 +0000

1. Overview

“What does it actually mean to control for a variable?” This question appears in every applied regression course, and the answer is surprisingly hard to visualize. When we say “the effect of coupons on sales, controlling for income,” we are describing a relationship in multidimensional space. This relationship cannot be directly plotted on a two-dimensional scatter. The Frisch-Waugh-Lovell (FWL) theorem changes this: it shows that the coefficient from a multiple regression equals the slope of a simple bivariate regression — after first residualizing (partialling out) the control variables from both the outcome and the variable of interest.

The scatterfit Stata package (Ahrens, 2024) makes this visual in one command. It takes a dependent variable, an independent variable, and optional controls or fixed effects, then produces a scatter plot of the residualized data with a fitted regression line. Built on reghdfe, it handles high-dimensional fixed effects efficiently. It also offers features beyond what R’s fwl_plot() or Python’s manual FWL can do: binned scatter plots for large datasets, regression parameters printed directly on the plot, and multiple fit types (linear, quadratic, lowess).

This tutorial is the third in a trilogy — see the companion R tutorial and Python tutorial — and uses the same datasets for cross-language comparability. All data are loaded from GitHub URLs so the analysis is fully reproducible.

Learning objectives:

Use scatterfit to visualize bivariate relationships with and without controls
Demonstrate FWL residualization with controls() and fcontrols()
Verify manually that FWL reproduces reghdfe coefficients exactly
Visualize fixed effects using fcontrols() on flights data
Use binned scatter plots to summarize patterns in large datasets
Show regression parameters directly on plots with regparameters()

2. The Modeling Pipeline

graph LR
A["Load Data<br/>from GitHub<br/>(Section 3)"] --> B["Naive vs.<br/>FWL Scatter<br/>(Section 4)"]
B --> C["Manual FWL<br/>Verification<br/>(Section 5)"]
C --> D["Binned<br/>Scatter<br/>(Section 6)"]
D --> E["Fixed Effects<br/>Flights<br/>(Section 7)"]
E --> F["Panel Data<br/>Wages<br/>(Section 8)"]
style A fill:#6a9bcc,stroke:#141413,color:#fff
style B fill:#d97757,stroke:#141413,color:#fff
style C fill:#d97757,stroke:#141413,color:#fff
style D fill:#00d4c8,stroke:#141413,color:#fff
style E fill:#6a9bcc,stroke:#141413,color:#fff
style F fill:#6a9bcc,stroke:#141413,color:#fff

We start where the answer is known (simulated data), see the result with scatterfit, verify manually, then apply the same tool to real flights data and panel wage data.

3. Setup and Data

3.1 Install packages

The scatterfit command requires reghdfe and ftools for high-dimensional fixed effects estimation. All packages are installed from SSC or GitHub:

* Install packages if not already installed
capture ssc install reghdfe, replace
capture ssc install ftools, replace
capture ssc install estout, replace
capture net install scatterfit, ///
from("https://raw.githubusercontent.com/leojahrens/scatterfit/master") replace

3.2 Load the simulated store data

We load the same simulated retail dataset used in the R and Python FWL tutorials. The data are hosted on GitHub for reproducibility:

import delimited "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/r_fwlplot/store_data.csv", clear

The data simulate a scenario where a store manager wants to know whether distributing coupons increases sales. Income is a confounder — wealthier neighborhoods receive fewer coupons (the store targets promotions at lower-income areas) but have higher baseline sales:

graph TD
Income["Income<br/>(confounder)"]
Coupons["Coupons<br/>(treatment)"]
Sales["Sales<br/>(outcome)"]
Income -->|"-0.5<br/>(fewer coupons<br/>to rich areas)"| Coupons
Income -->|"+0.3<br/>(rich areas<br/>buy more)"| Sales
Coupons -->|"+0.2<br/>(true causal<br/>effect)"| Sales
style Income fill:#d97757,stroke:#141413,color:#fff
style Coupons fill:#6a9bcc,stroke:#141413,color:#fff
style Sales fill:#00d4c8,stroke:#141413,color:#fff

The arrows in this diagram show causal relationships, and the numbers are the true effect sizes in the data generating process. The true causal effect of coupons on sales is +0.2, but income opens a backdoor path — an indirect route from coupons to sales that goes through income (coupons $\leftarrow$ income $\rightarrow$ sales). Unless we block this path by controlling for income, the naive estimate will be biased downward.

summarize sales coupons income dayofweek

 Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
sales | 200 33.6747 3.811032 24.89 45.23
coupons | 200 34.85685 6.788834 18.72 53.25
income | 200 49.72545 9.745807 20.07 77.02
dayofweek | 200 3.915 1.996926 1 7

correlate sales coupons income

 | sales coupons income
-------------+---------------------------
sales | 1.0000
coupons | -0.1664 1.0000
income | 0.5003 -0.7087 1.0000

The correlation matrix confirms the confounding structure. Coupons and sales have a negative raw correlation (-0.166), even though the true effect is positive (+0.2). Income is strongly negatively correlated with coupons (-0.709) and positively correlated with sales (0.500). A naive regression would wrongly conclude that coupons hurt sales.

4. scatterfit in Action: Naive vs. Controlled

4.1 The naive scatter

The simplest scatterfit call plots the raw relationship. The regparameters() option prints the regression coefficient, p-value, and R-squared directly on the plot — a feature unique to this Stata package:

scatterfit sales coupons, regparameters(coef pval r2) ///
opts(name(naive, replace) title("A. Naive: No Controls"))

The slope is -0.093 ($p = 0.018$, $R^2 = 0.028$): coupons appear to reduce sales. This is statistically significant but substantively wrong — the true effect is +0.2. The near-zero R-squared confirms that the naive model explains almost none of the variation in sales.

4.2 Controlling for income: one option

Now add income as a control. In scatterfit, the controls() option specifies continuous variables to partial out using the FWL procedure. Behind the scenes, scatterfit calls reghdfe to residualize both sales and coupons on income, then plots the residuals:

scatterfit sales coupons, controls(income) regparameters(coef pval r2) ///
opts(name(controlled, replace) title("B. FWL: Controlling for Income"))

The slope reverses to +0.212 ($p < 0.001$, $R^2 = 0.32$) — close to the true value of +0.2. The R-squared jumps from 0.03 to 0.32, showing that controlling for income explains a large share of the variation. Combining both panels:

graph combine naive controlled, ///
title("What Does 'Controlling for Income' Look Like?") rows(1)
graph export "stata_fwl_fig1_naive_vs_controlled.png", replace

The left panel shows the raw relationship: more coupons, lower sales ($R^2 = 0.028$). The right panel shows the same data after removing the influence of income from both axes via controls(income). The true positive effect of coupons emerges clearly, and the $R^2$ rises to 0.32.

4.3 The regression table confirms

We can compare the naive and controlled regressions side by side using Stata’s estimates store and estimates table workflow. The estimates store command saves regression results under a name, and estimates table displays multiple stored results in columns — similar to R’s etable() or Python’s stargazer:

regress sales coupons
estimates store naive_ols
regress sales coupons income
estimates store full_ols
estimates table naive_ols full_ols, stats(r2 N) b(%9.4f) se(%9.4f)

--------------------------------------
Variable | naive_ols full_ols
-------------+------------------------
coupons | -0.0934 0.2123
| 0.0393 0.0467
income | 0.3004
| 0.0325
_cons | 36.9301 11.3352
| 1.3969 3.0080
-------------+------------------------
r2 | 0.0277 0.3215
N | 200 200
--------------------------------------

Adding income as a control flips the coupon coefficient from -0.093 to +0.212 and increases the R-squared from 0.028 to 0.321. The income coefficient (0.300) is close to the true value of 0.3.

4.4 Omitted variable bias: predicting the error

The confounding is not mysterious — the omitted variable bias (OVB) formula predicts it exactly:

$$\text{bias} = \hat{\gamma} \times \hat{\delta}$$

In words, the bias equals the effect of the omitted variable on the outcome ($\hat{\gamma}$) multiplied by the relationship between the omitted variable and the treatment ($\hat{\delta}$).

* gamma = effect of income on sales (in full model)
regress sales coupons income
local gamma = _b[income] // 0.3004
* delta = regression of coupons on income
regress coupons income
local delta = _b[income] // -0.4937
* OVB = gamma * delta
display "OVB = " %9.4f `gamma' * `delta'

OVB = -0.1483

5. Under the Hood: Manual FWL Verification

5.1 The three-step recipe

The FWL theorem can be implemented manually in Stata using regress and predict:

* Step 1: Residualize sales on income
regress sales income
predict resid_sales, residuals
* Step 2: Residualize coupons on income
regress coupons income
predict resid_coupons, residuals
* Step 3: Regress residuals on residuals
regress resid_sales resid_coupons

------------------------------------------------------------------------------
resid_sales | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
resid_coup~s | .2122882 .046581 4.56 0.000 .1204297 .3041466
_cons | -2.87e-09 .222537 -0.00 1.000 -.4388468 .4388468
------------------------------------------------------------------------------

The FWL coefficient on resid_coupons is 0.212288 — exactly the same as the full regression coefficient on coupons (0.212288). This is not an approximation; it is an algebraic identity. Formally, the FWL theorem says:

$$\hat{\beta}_1 = \frac{\text{Cov}(\tilde{Y}, \tilde{X}_1)}{\text{Var}(\tilde{X}_1)}$$

where $\tilde{Y}$ and $\tilde{X}_1$ are the residuals from regressing $Y$ and $X_1$ on the controls $Z$. In our example, $\tilde{Y}$ is resid_sales (the part of sales that income cannot explain) and $\tilde{X}_1$ is resid_coupons (the part of coupons that income cannot explain). The ratio of their covariance to the variance of $\tilde{X}_1$ gives the slope we see in the regression above.

Think of it like measuring height for your age: instead of comparing raw heights, you compare how much taller or shorter each person is than the average for their age group.

5.2 Adding more controls

The scatterfit command handles any number of controls automatically:

scatterfit sales coupons, ///
regparameters(coef pval r2) opts(name(panel_a, replace) title("A. No Controls"))
scatterfit sales coupons, controls(income) ///
regparameters(coef pval r2) opts(name(panel_b, replace) title("B. + Income"))
scatterfit sales coupons, controls(income dayofweek) ///
regparameters(coef pval r2) opts(name(panel_c, replace) title("C. + Income + Day"))
graph combine panel_a panel_b panel_c, ///
title("Progressive Controls: How the Scatter Changes") rows(1)
graph export "stata_fwl_fig2_three_panels.png", replace

estimates table m1_naive m2_income m3_full, stats(r2 r2_a N)

--------------------------------------------------
Variable | m1_naive m2_income m3_full
-------------+------------------------------------
coupons | -0.0934 0.2123 0.2219
| 0.0393 0.0467 0.0454
income | 0.3004 0.2961
| 0.0325 0.0316
dayofweek | 0.4029
| 0.1095
_cons | 36.9301 11.3352 9.6398
| 1.3969 3.0080 2.9527
-------------+------------------------------------
r2 | 0.0277 0.3215 0.3654
r2_a | 0.0228 0.3146 0.3556
N | 200 200 200
--------------------------------------------------

The coupon coefficient progresses from -0.093 (naive, wrong sign), to +0.212 (controlling for income), to +0.222 (adding day of week). The R-squared — now visible directly on each panel — jumps from 0.028 to 0.32 to 0.37. Each scatterfit panel shows a tighter cloud as more variation is absorbed by the controls.

6. Binned Scatter Plots

6.1 Why binned scatters?

With large datasets (thousands or millions of observations), scatter plots become useless — individual points merge into a solid blob. Binned scatter plots solve this by grouping observations into quantile bins along the x-axis and plotting the bin means. The regression line is still estimated on the full data, so the slope is unaffected. This is one of scatterfit’s key advantages over R’s fwl_plot().

6.2 Unbinned vs. binned

scatterfit sales coupons, controls(income) ///
regparameters(coef pval r2) opts(name(unbinned, replace) title("A. Unbinned (all points)"))
scatterfit sales coupons, controls(income) binned ///
regparameters(coef pval r2) opts(name(binned, replace) title("B. Binned (20 quantiles)"))
graph combine unbinned binned, ///
title("Binned Scatter: Summarizing Patterns in Large Data") rows(1)
graph export "stata_fwl_fig3_binned_scatter.png", replace

Both panels show the same FWL-residualized relationship ($\beta = 0.21$, $R^2 = 0.32$), but the binned version (right) replaces 200 individual points with 20 bin-mean markers. For our small dataset the difference is modest, but for the flights data (5,000+ observations) or production datasets (millions of rows), binning is essential. The nquantiles() option controls how many bins to use:

* Fewer bins = smoother but less detail
scatterfit sales coupons, controls(income) binned nquantiles(10)
* More bins = more detail but noisier
scatterfit sales coupons, controls(income) binned nquantiles(30)

7. Visualizing Fixed Effects

7.1 Load the flights data

We load the NYC flights sample — 5,000 flights from New York’s three airports (EWR, JFK, LGA) in 2013:

import delimited "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/r_fwlplot/flights_sample.csv", clear
summarize dep_delay air_time
tabulate origin
* Encode string variables for fixed effects (needed by scatterfit/reghdfe)
encode origin, gen(origin_fe)
encode dest, gen(dest_fe)

 Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
dep_delay | 5,000 7.3172 22.83736 -20 119
air_time | 5,000 150.3636 93.47726 22 650

7.2 Progressive fixed effects

The fcontrols() option specifies categorical variables to absorb as fixed effects. This is analogous to feols(...| FE) in R’s fixest:

* No fixed effects
scatterfit dep_delay air_time, regparameters(coef pval r2) ///
opts(name(fe_none, replace) title("A. No Fixed Effects"))
* Origin airport FE
scatterfit dep_delay air_time, fcontrols(origin_fe) ///
regparameters(coef pval r2) opts(name(fe_origin, replace) title("B. Origin FE"))
* Origin + destination FE
scatterfit dep_delay air_time, fcontrols(origin_fe dest_fe) ///
regparameters(coef pval r2) opts(name(fe_both, replace) title("C. Origin + Dest FE"))
graph combine fe_none fe_origin fe_both, ///
title("What Do Fixed Effects 'Do' to the Data?") rows(1)
graph export "stata_fwl_fig4_fixed_effects.png", replace

Panel A shows the raw cloud with a nearly flat slope ($R^2 \approx 0$). Panel B removes the three origin-airport means, tightening the horizontal spread. Panel C removes the destination means as well, collapsing the variation to within-route deviations and increasing $R^2$ substantially. The fcontrols() option handles all the demeaning internally using reghdfe.

7.3 Regression table

regress dep_delay air_time
estimates store fe0
reghdfe dep_delay air_time, absorb(origin_fe) vce(robust)
estimates store fe1
reghdfe dep_delay air_time, absorb(origin_fe dest_fe) vce(robust)
estimates store fe2
estimates table fe0 fe1 fe2, stats(r2 N) b(%9.4f) se(%9.4f)

--------------------------------------------------
Variable | fe0 fe1 fe2
-------------+------------------------------------
air_time | -0.0050 -0.0079 -0.0324
| 0.0035 0.0034 0.0265
_cons | 8.0669 8.5072 12.1416
| 0.6117 0.6449 4.0186
-------------+------------------------------------
r2 | 0.0004 0.0055 0.0310
N | 5000 5000 4994
--------------------------------------------------

The air time coefficient changes as we add fixed effects: -0.005 (no FE), -0.008 (origin FE), -0.032 (origin + destination FE). Note that these are estimated on the 5,000-observation sample, so the coefficients differ somewhat from the full-data estimates in the R tutorial. The key pattern is the same: adding fixed effects absorbs between-group variation and changes both the magnitude and precision of the coefficient. With origin + destination FE, 6 singleton observations are dropped (N = 4,994) — singletons are routes with only one flight in the sample, where within-group variation cannot be estimated.

8. Panel Data: Returns to Experience

8.1 Load the wage panel

The wage panel contains 545 individuals observed over 8 years (1980–1987). The classic question: what is the return to experience? The challenge is unobserved ability — two people with the same experience may earn very different wages because one is more talented, motivated, or well-connected. These unmeasured personal traits are the “unobserved ability” that individual fixed effects absorb.

import delimited "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/r_fwlplot/wagepan.csv", clear
xtset nr year
summarize lwage exper expersq educ

 Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
lwage | 4,360 1.649147 .5326094 -3.579079 4.05186
exper | 4,360 6.514679 2.825873 0 18
expersq | 4,360 50.42477 40.78199 0 324
educ | 4,360 11.76697 1.746181 3 16

8.2 Pooled OLS vs. individual fixed effects

regress lwage educ exper expersq
estimates store pool
reghdfe lwage exper expersq, absorb(nr)
estimates store fe_ind
reghdfe lwage exper expersq, absorb(nr year)
estimates store fe_twfe
estimates table pool fe_ind fe_twfe, stats(r2 N)

--------------------------------------------------
Variable | pool fe_ind fe_twfe
-------------+------------------------------------
educ | 0.1021
| 0.0047
exper | 0.1050 0.1223 (omitted)
| 0.0102 0.0082
expersq | -0.0036 -0.0045 -0.0054
| 0.0007 0.0006 0.0007
_cons | -0.0564 1.0807 1.9223
| 0.0639 0.0263 0.0359
-------------+------------------------------------
r2 | 0.1477 0.6173 0.6185
N | 4360 4360 4360
--------------------------------------------------

Several things change as we add fixed effects. The educ coefficient disappears from the individual FE column — education is time-invariant (it does not change over the 8 years for any individual), so it is perfectly collinear with person dummies. Stata marks exper as (omitted) in the two-way FE column — because experience increments by one year for everyone, it is perfectly collinear with year dummies. Only expersq (which varies non-linearly) survives both sets of fixed effects. The R-squared jumps from 0.148 to 0.617, showing that individual fixed effects explain the majority of wage variation.

8.3 scatterfit with individual FE

* Sample 150 individuals for visual clarity
preserve
set seed 456
bysort nr: gen first = (_n == 1)
gen rand = runiform() if first
bysort nr (rand): replace rand = rand[1]
sort rand nr year
egen rank = group(rand) if first
bysort nr (rank): replace rank = rank[1]
keep if rank <= 150
scatterfit lwage exper, regparameters(coef pval r2) ///
opts(name(wage_raw, replace) title("A. Raw: Pooled Cross-Section"))
scatterfit lwage exper, fcontrols(nr) regparameters(coef pval r2) ///
opts(name(wage_fe, replace) title("B. FWL: Individual Fixed Effects"))
graph combine wage_raw wage_fe, ///
title("Controlling for Unobserved Ability") rows(1)
graph export "stata_fwl_fig5_panel_data.png", replace
restore

The visual difference is dramatic. Panel A shows a wide fan with a shallow slope ($R^2 = 0.043$) — individuals at the same experience level have wildly different wages, reflecting unobserved ability. Panel B applies fcontrols(nr) to strip away each person’s average wage and experience, leaving only within-person deviations. The $R^2$ jumps from 0.04 to 0.59, showing that individual fixed effects explain most of the wage variation. The slope steepens sharply: the within-person return to experience is about 0.07 log points per year (roughly 7%), and the relationship is much more precisely identified once we control for who each person is.

9. Advanced: Fit Types and Regression Parameters

9.1 Multiple fit types

The regparameters() option displays the coefficient, standard error, p-value, R-squared, and sample size directly on the plot. The scatterfit command also supports fit types beyond linear — quadratic and lowess — as diagnostics for nonlinearity:

* Linear fit with all regression parameters displayed on the plot
scatterfit sales coupons, controls(income) ///
regparameters(coef se pval r2 n)
graph export "stata_fwl_fig6_advanced.png", replace

* Lowess fit: nonparametric check (note: lowess does not support controls())
scatterfit sales coupons, fit(lowess)

The quadratic fit serves as a diagnostic. If the relationship looks curved in the residualized scatter, your linear specification may be misspecified. Note that fit(lowess) and fit(lpoly) do not support controls() in the current version of scatterfit — use them on raw or manually residualized data. For our simulated data (which is truly linear), the quadratic fit closely follows the linear fit, confirming the specification is appropriate.

9.2 Regression parameters on the plot

The regparameters() option displays statistical information directly on the scatter plot. Available parameters:

Parameter	Display
`coef`	Slope coefficient
`se`	Standard error
`pval`	P-value
`r2`	R-squared
`n`	Sample size

* Show everything
scatterfit sales coupons, controls(income) regparameters(coef se pval r2 n)

This is especially useful for presentations and papers where you want to communicate both the visual pattern and the statistical evidence in a single figure.

9.3 Quick reference: scatterfit recipes

* 1. Raw scatter (no controls)
scatterfit y x
* 2. Control for continuous variables (FWL)
scatterfit y x, controls(z1 z2)
* 3. Control for fixed effects (categorical)
scatterfit y x, fcontrols(group_fe)
* 4. Both continuous controls and fixed effects
scatterfit y x, controls(z1) fcontrols(group_fe)
* 5. Binned scatter (for large datasets)
scatterfit y x, controls(z1) binned nquantiles(20)
* 6. Show regression parameters on the plot
scatterfit y x, controls(z1) regparameters(coef pval r2)
* 7. Quadratic fit (works with controls)
scatterfit y x, controls(z1) fit(quadratic)
* 8. Lowess fit (does NOT support controls — use on raw data)
scatterfit y x, fit(lowess)

10. Discussion

The FWL theorem is not just a pedagogical tool — it is the computational engine behind Stata’s reghdfe command. When reghdfe estimates a model with fixed effects, it does not create a matrix with thousands of dummy variables. Instead, it uses an iterative demeaning algorithm (a generalization of FWL) to absorb the fixed effects, then runs OLS on the residuals. This is why reghdfe can handle millions of observations with tens of thousands of fixed effects.

The scatterfit package offers three advantages over the R and Python implementations of FWL visualization. First, binned scatter plots (Section 6) are essential for large datasets where individual points merge into an unreadable blob. Second, regression parameters on the plot (regparameters()) combine the visual and statistical evidence in a single figure, reducing the back-and-forth between plots and tables. Third, multiple fit types (fit(quadratic), fit(lowess)) serve as built-in diagnostics for linearity.

Across the three tutorials (Python, R, Stata), the key numbers are the same because we use the same datasets: the naive coupon coefficient is -0.093, the true effect is +0.212 after controlling for income, and the OVB is -0.148. The FWL theorem is the same in every language — only the syntax changes:

Task	Python	R	Stata
Raw scatter	`plt.scatter(x, y)`	`fwl_plot(y ~ x)`	`scatterfit y x`
Control for Z	manual `resid()`	`fwl_plot(y ~ x + z)`	`scatterfit y x, controls(z)`
Fixed effects	not supported	`fwl_plot(y ~ x \| fe)`	`scatterfit y x, fcontrols(fe)`
Binned scatter	not supported	not supported	`scatterfit y x, binned`
Stats on plot	not supported	not supported	`regparameters(coef pval r2)`

Students who learn FWL in one language can immediately apply it in another.

One limitation: the FWL theorem applies only to linear regression. For logistic, Poisson, or other nonlinear models, the partialling-out logic does not hold exactly. Stata’s scatterfit does support fitmodel(logit) and fitmodel(poisson), but these are direct fits, not FWL residualizations.

11. Summary and Next Steps

Confounding produces misleading regressions: the naive coupon coefficient was -0.093 (wrong sign), while the true causal effect is +0.2. After FWL residualization with controls(income), the estimate was +0.212.
The OVB formula predicts the bias exactly: $0.300 \times (-0.494) = -0.148$, correctly predicting the negative direction and approximate magnitude of the confounding.
FWL is an exact identity: the manual three-step procedure in Stata (regress + predict resid + regress) matches the full regression to six decimal places (0.212288).
Fixed effects are FWL applied to group dummies: fcontrols() in scatterfit calls reghdfe internally to demean the data, equivalent to feols(... | FE) in R.
Binned scatter plots and on-plot statistics are Stata’s advantage: the binned and regparameters() options provide capabilities that the R and Python FWL tools lack.

For further study, see the companion R FWL tutorial using fwl_plot() and the Python FWL tutorial that extends FWL to Double Machine Learning.

12. Exercises

OVB direction. In our simulation, predict the direction of the OVB if you also omit dayofweek. Compute $\hat{\gamma}_{day} \times \hat{\delta}_{day}$ and add it to the income OVB. Does the total bias match the difference between the naive and the fully controlled coefficient?
Binned scatter with different bins. Re-run scatterfit sales coupons, controls(income) binned nquantiles(k) for $k = 5, 10, 20, 50$. How does the visual change? At what point do you lose meaningful information?
slopefit: heterogeneous effects. Use the slopefit command: slopefit sales coupons income. This shows how the coupon-sales slope varies across income levels. Do coupons work better in low-income or high-income neighborhoods?

13. References

Ahrens, L. (2024). scatterfit: Scatter Plots with Fit Lines and Regression Results. GitHub.
Correia, S. (2016). reghdfe: Linear Models with Many Levels of Fixed Effects. Stata Journal.
Frisch, R. & Waugh, F. V. (1933). Partial Time Regressions as Compared with Individual Trends. Econometrica, 1(4), 387–401.
Lovell, M. C. (1963). Seasonal Adjustment of Economic Time Series and Multiple Regression Analysis. JASA, 58(304), 993–1010.
Angrist, J. D. & Pischke, J.-S. (2009). Mostly Harmless Econometrics. Princeton University Press.
Datasets: simulated store data, NYC flights sample, and Wooldridge wage panel from the companion R FWL tutorial on this site.

Acknowledgements

The FWL Theorem: Making Multivariate Regressions Intuitive

Sat, 14 Mar 2026 00:00:00 +0000

Overview

Including multiple variables in a regression raises a natural question: what does it actually mean to “control for” a confounder? The output is a coefficient, but a multivariate regression cannot be plotted on a simple two-dimensional scatter plot. This makes it hard to build intuition about what the regression is doing behind the scenes.

The Frisch-Waugh-Lovell (FWL) theorem answers this question. It shows that any coefficient from a multivariate regression can be recovered from a simple univariate regression — after removing the influence of all other variables through a procedure called partialling-out (also known as residualization or orthogonalization). Think of it as stripping away the noise from other variables so that only the signal of interest remains.

This tutorial is inspired by Courthoud (2022), and applies the FWL theorem to a simulated retail scenario. A chain of stores distributes discount coupons and wants to know whether the coupons increase sales. The catch: neighborhood income affects both coupon usage and sales, creating a confounding relationship that makes the naive analysis misleading. The analysis uses FWL to untangle these effects, verifies the theorem step by step, and visualizes the conditional relationship that multivariate regression captures but hides from view.

Learning objectives:

Understand the Frisch-Waugh-Lovell theorem and why it matters for causal inference
Implement the partialling-out procedure using OLS residuals
Visualize conditional relationships that multivariate regressions capture but cannot directly plot
Compare naive and conditional estimates to see how omitted variable bias distorts results
Connect FWL to modern applications such as Double Machine Learning

The causal structure

Before looking at data, it helps to understand the causal relationships among the variables. A Directed Acyclic Graph (DAG) — a diagram where arrows indicate direct causal effects — makes these assumptions explicit.

In this retail scenario, three variables interact:

graph LR
I["<b>Income</b><br/>(confounder)"] -->|"Higher income<br/>→ fewer coupons"| C["<b>Coupons</b><br/>(treatment)"]
I -->|"Higher income<br/>→ more spending"| S["<b>Sales</b><br/>(outcome)"]
C -->|"True causal<br/>effect: +0.2"| S
style I fill:#d97757,stroke:#141413,color:#fff
style C fill:#6a9bcc,stroke:#141413,color:#fff
style S fill:#00d4c8,stroke:#141413,color:#fff

Income acts as a confounder — a variable that influences both the treatment (coupon usage) and the outcome (sales). Wealthier neighborhoods use fewer coupons but spend more, creating a backdoor path from coupons to sales through income. Ignoring income allows this backdoor path to generate a spurious negative association between coupons and sales, masking the true positive effect.

To recover the genuine causal effect, the analysis must block this backdoor path by conditioning on income. The FWL theorem provides an elegant way to do this and to visualize the result.

Setup and imports

The following code loads all necessary libraries. The analysis relies on statsmodels for OLS regression, seaborn for regression plots, and matplotlib for figure customization. The RANDOM_SEED ensures that every reader gets identical results.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf
# Reproducibility
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
# Site color palette
STEEL_BLUE = "#6a9bcc"
WARM_ORANGE = "#d97757"
NEAR_BLACK = "#141413"
TEAL = "#00d4c8"

Note on figure styling: The figures in this post use a dark theme for visual consistency with the site. The companion script.py includes the full styling code. To reproduce the dark-themed figures, add the following to your setup:

Dark theme settings (click to expand)

DARK_NAVY = "#0f1729"
GRID_LINE = "#1f2b5e"
LIGHT_TEXT = "#c8d0e0"
WHITE_TEXT = "#e8ecf2"
plt.rcParams.update({
"figure.facecolor": DARK_NAVY, "axes.facecolor": DARK_NAVY,
"axes.edgecolor": DARK_NAVY, "axes.linewidth": 0,
"axes.labelcolor": LIGHT_TEXT, "axes.titlecolor": WHITE_TEXT,
"axes.spines.top": False, "axes.spines.right": False,
"axes.spines.left": False, "axes.spines.bottom": False,
"axes.grid": True, "grid.color": GRID_LINE,
"grid.linewidth": 0.6, "grid.alpha": 0.8,
"xtick.color": LIGHT_TEXT, "ytick.color": LIGHT_TEXT,
"text.color": WHITE_TEXT, "font.size": 12,
"legend.frameon": False, "legend.labelcolor": LIGHT_TEXT,
"savefig.facecolor": DARK_NAVY, "savefig.edgecolor": DARK_NAVY,
})

Data simulation

Rather than importing data from an external source, this section builds a transparent data generating process (DGP) so that the true causal effect is known in advance and the methods can be verified against it. Think of it as running a controlled experiment in a computer: set the rules, generate the data, and then check whether the statistical tools find the right answer.

The DGP encodes the causal structure from the DAG above:

income is drawn from a normal distribution centered at \$50K
coupons depends negatively on income (wealthier customers use fewer coupons) plus random noise
sales depends positively on both coupons (+0.2) and income (+0.3), plus a day-of-week effect and random noise

The true causal effect of coupons on sales is exactly +0.2 — this is the Average Treatment Effect (ATE), the average impact of coupons on sales across all stores. In concrete terms, every 1 percentage point increase in coupon usage causes a \$200 increase in daily sales (measured in thousands).

def simulate_store_data(n=50, seed=42):
"""Simulate retail store data with confounding by income."""
rng = np.random.default_rng(seed)
income = rng.normal(50, 10, n)
dayofweek = rng.integers(1, 8, n)
coupons = 60 - 0.5 * income + rng.normal(0, 5, n)
sales = (10 + 0.2 * coupons + 0.3 * income
+ 0.5 * dayofweek + rng.normal(0, 3, n))
return pd.DataFrame({
"sales": np.round(sales, 2),
"coupons": np.round(coupons, 2),
"income": np.round(income, 2),
"dayofweek": dayofweek,
})
N = 50
df = simulate_store_data(n=N, seed=RANDOM_SEED)
print("Dataset shape:", df.shape)
print(df.head())
print(df.describe().round(2))

Dataset shape: (50, 4)
sales coupons income dayofweek
0 37.37 36.93 53.05 6
1 36.88 38.06 39.60 6
2 33.09 32.04 57.50 6
3 35.09 33.43 59.41 5
4 27.01 43.21 30.49 4
sales coupons income dayofweek
count 50.00 50.00 50.00 50.00
mean 33.61 33.84 50.91 3.92
std 3.96 4.89 7.68 1.88
min 25.76 23.26 30.49 1.00
25% 31.30 31.53 45.78 2.00
50% 33.24 33.25 51.74 4.00
75% 36.00 36.89 56.42 5.75
max 44.38 43.79 71.42 7.00

The dataset contains 50 stores with average daily sales of \$33,610, average coupon usage of 33.84%, and average neighborhood income of \$50,910. Sales range from \$25,760 to \$44,380, reflecting meaningful variation across stores. Coupon usage spans from 23% to 44%, and income ranges from \$30,490 to \$71,420. This variation provides enough signal to estimate the relationships of interest.

The naive relationship

The simplest approach is to regress sales directly on coupon usage, ignoring income entirely. This is what a rushed analyst might do — just look at whether stores with more coupon usage have higher or lower sales.

sns.regplot(x="coupons", y="sales", data=df, ci=False,
scatter_kws={"color": STEEL_BLUE, "alpha": 0.7, "edgecolors": "white", "s": 60},
line_kws={"color": WARM_ORANGE, "linewidth": 2, "label": "Linear fit"})
plt.legend()
plt.xlabel("Coupon usage (%)")
plt.ylabel("Daily sales (thousands $)")
plt.title("Naive relationship: Sales vs. coupon usage")
plt.savefig("fwl_naive_regression.png", dpi=300, bbox_inches="tight")
plt.show()

Naive regression: the downward slope suggests coupons reduce sales, but this is driven by confounding from income.

naive_model = smf.ols("sales ~ coupons", df).fit()
print(naive_model.summary().tables[1])

==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 37.1906 3.960 9.390 0.000 29.228 45.154
coupons -0.1059 0.116 -0.914 0.365 -0.339 0.127
==============================================================================

The naive regression suggests that coupons have a negative effect on sales: each additional percentage point of coupon usage is associated with \$106 less in daily sales. However, this coefficient is not statistically significant (p = 0.365), and the 95% confidence interval [-0.339, 0.127] spans both negative and positive values. More importantly, the true effect is +0.2, so this estimate is not just imprecise — it points in the wrong direction. The confounder (income) is pulling the estimate downward because wealthier neighborhoods use fewer coupons but spend more.

Controlling for income

To block the backdoor path through income, the next step includes it as a control variable in the regression. This is the standard approach in applied work: add the confounder to the right-hand side of the regression equation.

full_model = smf.ols("sales ~ coupons + income", df).fit()
print(full_model.summary().tables[1])

==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 5.0278 7.181 0.700 0.487 -9.418 19.474
coupons 0.2673 0.120 2.222 0.031 0.025 0.509
income 0.3836 0.076 5.015 0.000 0.230 0.537
==============================================================================

Controlling for income reverses the picture entirely. The coefficient on coupons is now +0.2673 (p = 0.031), indicating that each additional percentage point of coupon usage increases daily sales by about \$267. This is close to the true effect of +0.2, and the 95% confidence interval [0.025, 0.509] no longer includes zero. Income itself has a strong positive effect of +0.3836 (p < 0.001), confirming that wealthier neighborhoods spend more. By conditioning on income, the backdoor path is blocked and the estimate moves much closer to the true causal effect.

But what is the regression actually doing when it “controls for” income? This is where the FWL theorem provides a clear answer.

The FWL theorem

The Frisch-Waugh-Lovell theorem, first published by Ragnar Frisch and Frederick Waugh in 1933 and later given an elegant proof by Michael Lovell in 1963, provides a precise algebraic decomposition of what multivariate regression does under the hood.

Consider a linear model with two sets of regressors:

$$y_i = \beta_1 x_{i,1} + \beta_2 x_{i,2} + \varepsilon_i$$

In words, this equation says that the outcome $y$ (sales) equals the effect $\beta_1$ of the variable of interest $x_1$ (coupons), plus the effect $\beta_2$ of the control variable $x_2$ (income), plus an error term $\varepsilon$. In this analysis, $y$ corresponds to the sales column, $x_1$ to coupons, and $x_2$ to income.

The FWL theorem states that the Ordinary Least Squares (OLS) estimator — the standard method for fitting a regression line by minimizing squared prediction errors — $\hat{\beta}_1$ from this multivariate regression is identical to the estimator obtained from a simpler procedure:

$$\hat{\beta}_1^{FWL} = \frac{\text{Cov}(\tilde{y}, \, \tilde{x}_1)}{\text{Var}(\tilde{x}_1)}$$

where $\tilde{x}_1$ is the residual from regressing $x_1$ on $x_2$, and $\tilde{y}$ is the residual from regressing $y$ on $x_2$.

In words, this says: to estimate the effect of coupons while controlling for income, we can (1) remove income’s influence from coupons, (2) remove income’s influence from sales, and (3) regress the cleaned sales on the cleaned coupons. The resulting coefficient is exactly the same as the one from the full multivariate regression.

This procedure is called partialling-out because it removes the variation explained by the control variables, keeping only the residual variation — the part that is orthogonal to (independent of) income. The three equivalent estimators are:

Full OLS: Regress $y$ on $x_1$ and $x_2$ jointly
Partial FWL: Regress $y$ on $\tilde{x}_1$ (residuals of $x_1$ on $x_2$)
Full FWL: Regress $\tilde{y}$ on $\tilde{x}_1$ (residuals of both variables on $x_2$)

All three produce the same $\hat{\beta}_1$. The full FWL (option 3) also gives the correct standard errors.

Verifying FWL step by step

Let us verify each step of the theorem using the simulated data.

Step 1: Residualize coupons only

First, we regress coupons on income and extract the residuals $\tilde{x}_1$. These residuals represent the variation in coupon usage that cannot be explained by income — the “purified” coupon signal. Then we regress sales on these residuals. Because residuals always average to zero by construction (they are mean-zero), we drop the intercept from this regression.

# Residualize coupons with respect to income
df["coupons_tilde"] = smf.ols("coupons ~ income", df).fit().resid
# Regress sales on residualized coupons (no intercept)
fwl_step1 = smf.ols("sales ~ coupons_tilde - 1", df).fit()
print(fwl_step1.summary().tables[1])

=================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------
coupons_tilde 0.2673 1.271 0.210 0.834 -2.288 2.822
=================================================================================

The coefficient is exactly 0.2673 — identical to the full regression. However, the standard error has exploded from 0.120 to 1.271, making the estimate appear insignificant (p = 0.834). This happens because income was only partialled out from coupons but not from sales. The remaining variation in sales due to income inflates the residual variance of the regression, producing artificially large standard errors.

Step 2: Residualize both variables

To fix the standard errors, we also residualize sales with respect to income. Now both variables have had income’s influence removed.

# Residualize sales with respect to income
df["sales_tilde"] = smf.ols("sales ~ income", df).fit().resid
# Regress residualized sales on residualized coupons (no intercept)
fwl_step2 = smf.ols("sales_tilde ~ coupons_tilde - 1", df).fit()
print(fwl_step2.summary().tables[1])

=================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------
coupons_tilde 0.2673 0.118 2.269 0.028 0.031 0.504
=================================================================================

The coefficient remains exactly 0.2673, and now the standard error (0.118) and p-value (0.028) are nearly identical to the full regression (SE = 0.120, p = 0.031). The slight difference in standard errors comes from a degrees-of-freedom adjustment — the full regression uses up an extra degree of freedom to estimate the income coefficient (leaving fewer data points for estimating uncertainty), while this univariate regression does not. The substantive conclusion is the same: coupons have a significant positive effect on sales after partialling out income.

Visualizing partialling-out

What does partialling-out actually look like? Regressing coupons on income produces fitted values that form a line through the data. The residuals — the vertical distances between each point and this line — represent the coupon variation that income cannot explain.

df["coupons_hat"] = smf.ols("coupons ~ income", df).fit().predict()
fig, ax = plt.subplots(figsize=(8, 6))
ax.scatter(df["income"], df["coupons"], color=STEEL_BLUE, alpha=0.7,
edgecolors="white", s=60, label="Stores")
sns.regplot(x="income", y="coupons", data=df, ci=False, scatter=False,
line_kws={"color": WARM_ORANGE, "linewidth": 2, "label": "Linear fit"}, ax=ax)
ax.vlines(df["income"],
np.minimum(df["coupons"], df["coupons_hat"]),
np.maximum(df["coupons"], df["coupons_hat"]),
linestyle="--", color=NEAR_BLACK, alpha=0.5, linewidth=1,
label="Residuals")
ax.set_xlabel("Neighborhood income (thousands $)")
ax.set_ylabel("Coupon usage (%)")
ax.set_title("Partialling-out: removing income's effect on coupons")
ax.legend()
plt.savefig("fwl_residuals_income.png", dpi=300, bbox_inches="tight")
plt.show()

Partialling-out: the dashed lines are the residuals — the coupon variation that income cannot explain.

The downward-sloping fitted line confirms that higher-income neighborhoods use fewer coupons. The vertical dashed lines are the residuals — the part of coupon usage that income does not predict. Some stores use more coupons than their neighborhood income would suggest (positive residuals), and others use fewer (negative residuals). Partialling out income keeps only these residuals, effectively asking: “Among stores with similar income levels, which ones have unusually high or low coupon usage?”

The conditional relationship revealed

It is now possible to plot the relationship that the multivariate regression captures but cannot directly display: residualized sales against residualized coupons. Both variables have had income’s influence removed, so any remaining relationship is the conditional effect of coupons on sales — the effect after accounting for income differences.

fig, ax = plt.subplots(figsize=(8, 6))
ax.scatter(df["coupons_tilde"], df["sales_tilde"], color=STEEL_BLUE,
alpha=0.7, edgecolors="white", s=60, label="Stores (residualized)")
sns.regplot(x="coupons_tilde", y="sales_tilde", data=df, ci=False, scatter=False,
line_kws={"color": WARM_ORANGE, "linewidth": 2, "label": "Linear fit"}, ax=ax)
ax.set_xlabel("Residual coupon usage")
ax.set_ylabel("Residual sales")
ax.set_title("Conditional relationship after partialling-out income")
ax.legend()
plt.savefig("fwl_partialled_out.png", dpi=300, bbox_inches="tight")
plt.show()

After removing income’s influence from both variables, the true positive effect of coupons on sales emerges.

The positive slope is now clearly visible. Stripping away the confounding influence of income reveals that stores where coupon usage is higher than expected (given their neighborhood income) tend to also have sales that are higher than expected. The slope of this line is exactly 0.2673 — the same coefficient produced by the full multivariate regression.

Scaling for interpretability

One drawback of the partialled-out plot is that both axes show residuals centered around zero, which makes the magnitudes hard to interpret. A negative coupon value of -5 does not mean the store has -5% coupon usage — it means coupon usage is 5 percentage points below what income alone would predict.

Adding the sample mean back to each residualized variable fixes this. The shift moves the axes without changing the slope.

df["coupons_tilde_scaled"] = df["coupons_tilde"] + df["coupons"].mean()
df["sales_tilde_scaled"] = df["sales_tilde"] + df["sales"].mean()
# Verify the coefficient is unchanged
scaled_model = smf.ols("sales_tilde_scaled ~ coupons_tilde_scaled", df).fit()
print(scaled_model.summary().tables[1])

========================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------
Intercept 24.5585 4.053 6.059 0.000 16.409 32.708
coupons_tilde_scaled 0.2673 0.119 2.246 0.029 0.028 0.507
========================================================================================

fig, ax = plt.subplots(figsize=(8, 6))
ax.scatter(df["coupons_tilde_scaled"], df["sales_tilde_scaled"],
color=STEEL_BLUE, alpha=0.7, edgecolors="white", s=60,
label="Stores (residualized + scaled)")
sns.regplot(x="coupons_tilde_scaled", y="sales_tilde_scaled", data=df,
ci=False, scatter=False,
line_kws={"color": WARM_ORANGE, "linewidth": 2, "label": "Linear fit"}, ax=ax)
ax.set_xlabel("Coupon usage (%, residualized + mean)")
ax.set_ylabel("Daily sales (thousands $, residualized + mean)")
ax.set_title("Scaled residuals: interpretable magnitudes")
ax.legend()
plt.savefig("fwl_scaled_residuals.png", dpi=300, bbox_inches="tight")
plt.show()

Adding the sample means back to the residuals restores interpretable units without changing the slope.

The coefficient remains exactly 0.2673 (p = 0.029), confirming that adding the means back does not alter the estimated relationship. Now the axes are in interpretable units: coupon usage around 34% and daily sales around \$33,600. This scaled version is ideal for presentations and reports where the audience needs to understand both the direction and the magnitude of the conditional relationship at a glance.

Extending to multiple controls

The FWL theorem works with any number of control variables, not just one. To demonstrate, the next step adds dayofweek as a second control alongside income. The theorem says both controls can be partialled out simultaneously and the same coefficient on coupons will emerge.

# Full regression with both controls
full_model_2 = smf.ols("sales ~ coupons + income + dayofweek", df).fit()
print(full_model_2.summary().tables[1])

==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 3.9825 7.172 0.555 0.581 -10.454 18.419
coupons 0.2706 0.119 2.266 0.028 0.030 0.511
income 0.3774 0.076 4.961 0.000 0.224 0.531
dayofweek 0.3195 0.245 1.306 0.198 -0.173 0.812
==============================================================================

# FWL: partial out both income and dayofweek
df["coupons_tilde_2"] = smf.ols("coupons ~ income + dayofweek", df).fit().resid
df["sales_tilde_2"] = smf.ols("sales ~ income + dayofweek", df).fit().resid
fwl_multi = smf.ols("sales_tilde_2 ~ coupons_tilde_2 - 1", df).fit()
print(fwl_multi.summary().tables[1])

===================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------
coupons_tilde_2 0.2706 0.116 2.338 0.023 0.038 0.503
===================================================================================

With both controls, the full regression gives a coupon coefficient of 0.2706 (p = 0.028). The FWL procedure — partialling out income and day of week from both sales and coupons — yields the identical coefficient of 0.2706 (p = 0.023). The day-of-week effect itself (0.3195, p = 0.198) is not statistically significant in this sample, but including it slightly sharpens the coupon estimate from 0.2673 to 0.2706 by absorbing additional residual variance. This confirms that FWL scales to any number of controls.

Naive vs. conditional: the full picture

To appreciate how much the FWL procedure changes the conclusions, the next figure places the naive and conditional relationships side by side. The left panel shows the raw data; the right panel shows the same data after partialling out income.

fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# Left: naive relationship
axes[0].scatter(df["coupons"], df["sales"], color=STEEL_BLUE, alpha=0.7,
edgecolors="white", s=60)
sns.regplot(x="coupons", y="sales", data=df, ci=False, scatter=False,
line_kws={"color": WARM_ORANGE, "linewidth": 2}, ax=axes[0])
axes[0].set_xlabel("Coupon usage (%)")
axes[0].set_ylabel("Daily sales (thousands $)")
axes[0].set_title("Naive (no controls)")
# Right: after partialling-out income
axes[1].scatter(df["coupons_tilde_scaled"], df["sales_tilde_scaled"],
color=TEAL, alpha=0.7, edgecolors="white", s=60)
sns.regplot(x="coupons_tilde_scaled", y="sales_tilde_scaled", data=df,
ci=False, scatter=False,
line_kws={"color": WARM_ORANGE, "linewidth": 2}, ax=axes[1])
axes[1].set_xlabel("Coupon usage (%, after partialling-out)")
axes[1].set_ylabel("Daily sales (thousands $, after partialling-out)")
axes[1].set_title("After partialling-out income (FWL)")
plt.suptitle("The FWL theorem reveals the true relationship",
fontsize=14, fontweight="bold", y=1.02)
plt.tight_layout()
plt.savefig("fwl_comparison.png", dpi=300, bbox_inches="tight")
plt.show()

Simpson’s paradox resolved: the naive negative slope (left) reverses to a positive slope (right) after partialling out income.

The contrast is striking. On the left, the naive analysis suggests a negative relationship (slope = -0.106) — coupons appear to hurt sales. On the right, after removing income’s confounding influence, the true positive relationship emerges (slope = +0.267). This is a textbook example of Simpson’s paradox: a trend that appears in aggregate data reverses when the data is properly conditioned on a relevant variable.

Summary of results

Method	Coupons coefficient	Std. error	p-value
Naive OLS (no controls)	-0.1059	0.116	0.365
Full OLS (+ income)	+0.2673	0.120	0.031
FWL Step 1 (residualize X only)	+0.2673	1.271	0.834
FWL Step 2 (residualize both)	+0.2673	0.118	0.028
Full OLS (+ income + day)	+0.2706	0.119	0.028
FWL (+ income + day)	+0.2706	0.116	0.023

All FWL variants produce the same coefficient as the corresponding full regression, confirming the theorem. The coefficient of +0.267 is close to the true DGP value of +0.200, with the difference attributable to finite-sample noise in 50 observations.

Applications of the FWL theorem

The FWL theorem is not just a mathematical curiosity — it has practical applications across several domains.

Data visualization

As shown above, FWL makes it possible to plot the conditional relationship between two variables after controlling for confounders. This is invaluable when presenting regression results to non-technical audiences who understand scatter plots but not regression tables with multiple coefficients.

Computational efficiency

When a regression includes high-dimensional fixed effects — for example, year, industry, and country dummies that could add hundreds of columns — computing the full regression becomes expensive. The FWL theorem allows software to partial out these fixed effects first, reducing the problem to a much smaller regression. Widely-used packages that exploit this strategy include:

reghdfe in Stata
fixest in R
pyfixest in Python — a fast, user-friendly package for fixed-effects regression (including multi-way clustering and interaction effects), inspired by fixest’s R API
pyhdfe in Python

Machine learning and causal inference

Perhaps the most impactful modern application is Double Machine Learning (DML), developed by Chernozhukov, Chetverikov, Demirer, Duflo, Hansen, Newey, and Robins (2018). DML extends the FWL logic by replacing the OLS regressions in the partialling-out step with flexible machine learning models (random forests, lasso, neural networks). This allows the control variables to have complex, nonlinear effects on both the treatment and the outcome — while still recovering a valid causal estimate of the treatment effect.

If you want to see DML in action, check out the companion tutorial on Introduction to Causal Inference: Double Machine Learning, which applies the partialling-out estimator to a real randomized experiment.

Discussion

This tutorial set out to answer a simple question: what does it mean to “control for” a variable in regression, and how can the result be visualized? The FWL theorem provides a definitive answer. Controlling for income in a regression of sales on coupons is equivalent to removing income’s influence from both variables and then regressing the residuals.

In the simulated retail scenario, failing to control for income produced a misleading negative coefficient of -0.106, suggesting coupons reduce sales. After partialling out income, the coefficient reversed to +0.267 (p = 0.031), revealing that coupons genuinely increase sales by about \$267 per percentage point. This estimate is close to the true data-generating parameter of +0.200, with the gap attributable to sampling variability in just 50 stores.

For a practitioner — say, the marketing director of the retail chain — the takeaway is clear. An analysis that ignored neighborhood income would conclude the coupon program was counterproductive. The FWL-based analysis shows it works, and provides a plot that makes this case visually compelling. The theorem bridges the gap between the numbers in a regression table and the intuitive two-variable scatter plot.

Summary and next steps

Key takeaways:

Sign reversal. The naive coupon coefficient was -0.106 (negative, not significant). After controlling for income, it became +0.267 (positive, p = 0.031). Ignoring confounders can reverse not just the magnitude but the direction of an estimated effect.
Exact equivalence. The FWL procedure produced a coefficient of 0.2673 — identical to the full multivariate regression down to four decimal places — whether partialling out one control (income) or two (income + day of week). The theorem is not an approximation; it is an algebraic identity.
Visualization power. FWL reduces any multivariate regression to a univariate one, enabling scatter plots that display conditional relationships. This is especially valuable for communicating results to non-technical stakeholders.
Foundation for DML. FWL underpins modern causal inference methods like Double Machine Learning, where flexible ML learners replace OLS in the partialling-out step. Understanding FWL is a prerequisite for understanding DML.
Linearity assumption matters. The FWL procedure relies on OLS residualization, which assumes linear relationships between the controls and both the treatment and outcome. If income affects coupons or sales nonlinearly, OLS residuals will not fully remove the confounding — motivating methods like DML that replace OLS with flexible learners.

Limitations:

The data is simulated with a known linear DGP. In real data, the DGP is unknown and may be nonlinear, requiring methods like DML rather than plain OLS.
The FWL theorem assumes a correctly specified linear model. If the relationship between income and coupons (or sales) is nonlinear, OLS residualization will not fully remove the confounding.
With only 50 observations, the estimates have wide confidence intervals. Larger samples would sharpen the estimates.

Next steps:

See Double Machine Learning to learn how FWL extends to nonlinear settings.
See Introduction to Causal Inference: The DoWhy Approach for a full causal inference workflow with real data.

Exercises

Sample size sensitivity. Change N from 50 to 500 in the simulate_store_data() function. How do the naive and FWL coefficients change? How do the standard errors shrink? Is the naive estimate still misleading with a larger sample?
Nonlinear confounding. Modify the DGP so that income affects coupons nonlinearly: coupons = 60 - 0.01 * income**2 + noise. Does the FWL procedure (with linear OLS residualization) still recover the true coefficient? Why or why not?
Real data application. Pick a dataset with a known confounder (e.g., the wage-education-ability relationship) and apply the FWL procedure. Visualize the naive and conditional relationships side by side.