<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>FWL Theorem | Carlos Mendez</title><link>https://carlos-mendez.org/category/fwl-theorem/</link><atom:link href="https://carlos-mendez.org/category/fwl-theorem/index.xml" rel="self" type="application/rss+xml"/><description>FWL Theorem</description><generator>Wowchemy (https://wowchemy.com)</generator><language>en-us</language><copyright>Carlos Mendez</copyright><lastBuildDate>Fri, 27 Mar 2026 00:00:00 +0000</lastBuildDate><image><url>https://carlos-mendez.org/media/icon_huedfae549300b4ca5d201a9bd09a3ecd5_79625_512x512_fill_lanczos_center_3.png</url><title>FWL Theorem</title><link>https://carlos-mendez.org/category/fwl-theorem/</link></image><item><title>Visualizing Regression with the FWL Theorem in R</title><link>https://carlos-mendez.org/post/r_fwlplot/</link><pubDate>Fri, 27 Mar 2026 00:00:00 +0000</pubDate><guid>https://carlos-mendez.org/post/r_fwlplot/</guid><description>&lt;h2 id="1-overview">1. Overview&lt;/h2>
&lt;p>&amp;ldquo;What does it actually mean to &lt;em>control for&lt;/em> a variable?&amp;rdquo; This is perhaps the most common question in applied regression &amp;mdash; and one of the hardest to answer intuitively. When we say &amp;ldquo;the effect of coupons on sales, controlling for income,&amp;rdquo; we are describing a relationship that lives in multidimensional space and cannot be directly plotted on a 2D scatter plot. Or can it?&lt;/p>
&lt;p>The &lt;strong>Frisch-Waugh-Lovell (FWL) theorem&lt;/strong> provides the answer. It says that the coefficient on any variable in a multiple regression equals the slope from a simple bivariate regression &amp;mdash; after first &amp;ldquo;partialling out&amp;rdquo; the other variables from both the outcome and the variable of interest. Partialling out means regressing a variable on the controls and keeping only the leftover (residual) variation &amp;mdash; the part that the controls cannot explain. This means we &lt;em>can&lt;/em> visualize any regression coefficient as a 2D scatter plot, as long as we first remove the influence of the controls from both axes.&lt;/p>
&lt;p>The &lt;a href="https://cran.r-project.org/package=fwlplot" target="_blank" rel="noopener">fwlplot&lt;/a> R package (Butts &amp;amp; McDermott, 2024) turns this into a one-liner. It uses the same formula syntax as &lt;a href="https://lrberge.github.io/fixest/reference/feols.html" target="_blank" rel="noopener">&lt;code>fixest::feols()&lt;/code>&lt;/a> &amp;mdash; including the &lt;code>|&lt;/code> operator for fixed effects &amp;mdash; and produces a scatter plot of the residualized data with the regression line overlaid. The result is a visual answer to &amp;ldquo;what does controlling for X look like?&amp;rdquo;&lt;/p>
&lt;p>This tutorial builds intuition progressively. We start with simulated data where we &lt;em>know&lt;/em> the true effect, show how confounding creates a misleading picture, and use &lt;code>fwl_plot()&lt;/code> to reveal the truth. We then extend to real data with high-dimensional fixed effects &amp;mdash; first flights data (controlling for origin and destination airports) and then panel wage data (controlling for unobserved individual ability).&lt;/p>
&lt;p>&lt;strong>Learning objectives:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>State the FWL theorem and explain its geometric intuition&lt;/li>
&lt;li>Use &lt;code>fwl_plot()&lt;/code> to visualize a bivariate relationship before and after controlling for confounders&lt;/li>
&lt;li>Demonstrate that manual FWL residualization reproduces &lt;code>feols()&lt;/code> coefficients exactly&lt;/li>
&lt;li>Visualize what fixed effects &amp;ldquo;do&amp;rdquo; to data by comparing raw vs. residualized scatter plots&lt;/li>
&lt;li>Apply &lt;code>fwl_plot()&lt;/code> to real panel data with high-dimensional fixed effects&lt;/li>
&lt;li>Connect FWL to omitted variable bias and Simpson&amp;rsquo;s paradox&lt;/li>
&lt;/ul>
&lt;h2 id="2-the-modeling-pipeline">2. The Modeling Pipeline&lt;/h2>
&lt;pre>&lt;code class="language-mermaid">graph LR
A[&amp;quot;Simulated&amp;lt;br/&amp;gt;Data&amp;lt;br/&amp;gt;(Section 3)&amp;quot;] --&amp;gt; B[&amp;quot;fwl_plot()&amp;lt;br/&amp;gt;Naive vs. FWL&amp;lt;br/&amp;gt;(Section 4)&amp;quot;]
B --&amp;gt; C[&amp;quot;Manual FWL&amp;lt;br/&amp;gt;Verification&amp;lt;br/&amp;gt;(Section 5)&amp;quot;]
C --&amp;gt; D[&amp;quot;Fixed Effects&amp;lt;br/&amp;gt;Flights Data&amp;lt;br/&amp;gt;(Section 6)&amp;quot;]
D --&amp;gt; E[&amp;quot;Panel Data&amp;lt;br/&amp;gt;Wages&amp;lt;br/&amp;gt;(Section 7)&amp;quot;]
E --&amp;gt; F[&amp;quot;ggplot2&amp;lt;br/&amp;gt;&amp;amp; Recipe&amp;lt;br/&amp;gt;(Section 8)&amp;quot;]
style A fill:#6a9bcc,stroke:#141413,color:#fff
style B fill:#d97757,stroke:#141413,color:#fff
style C fill:#d97757,stroke:#141413,color:#fff
style D fill:#6a9bcc,stroke:#141413,color:#fff
style E fill:#6a9bcc,stroke:#141413,color:#fff
style F fill:#00d4c8,stroke:#141413,color:#fff
&lt;/code>&lt;/pre>
&lt;p>We start where the answer is known (simulated data), see the result with &lt;code>fwl_plot()&lt;/code> first, then peek under the hood with manual FWL verification. From there we apply the same one-liner to increasingly complex real-world settings.&lt;/p>
&lt;h2 id="3-setup-and-data">3. Setup and Data&lt;/h2>
&lt;h3 id="31-install-and-load-packages">3.1 Install and load packages&lt;/h3>
&lt;pre>&lt;code class="language-r"># Install packages if needed
cran_packages &amp;lt;- c(&amp;quot;fwlplot&amp;quot;, &amp;quot;fixest&amp;quot;, &amp;quot;ggplot2&amp;quot;, &amp;quot;patchwork&amp;quot;,
&amp;quot;nycflights13&amp;quot;, &amp;quot;wooldridge&amp;quot;)
missing &amp;lt;- cran_packages[!sapply(cran_packages, requireNamespace, quietly = TRUE)]
if (length(missing) &amp;gt; 0) install.packages(missing)
library(fwlplot)
library(fixest)
library(ggplot2)
library(patchwork)
library(nycflights13)
library(wooldridge)
&lt;/code>&lt;/pre>
&lt;p>The &lt;code>fwlplot&lt;/code> package provides the &lt;code>fwl_plot()&lt;/code> function for FWL-residualized scatter plots. It is built on &lt;code>fixest&lt;/code>, which handles the residualization computation using fast demeaning algorithms. The &lt;code>patchwork&lt;/code> package lets us combine multiple ggplot2 plots side by side. The &lt;code>nycflights13&lt;/code> and &lt;code>wooldridge&lt;/code> packages provide the real datasets we will use later.&lt;/p>
&lt;h3 id="32-simulated-confounding-data">3.2 Simulated confounding data&lt;/h3>
&lt;p>To build intuition, we simulate a retail scenario where a store manager wants to know whether distributing coupons increases sales. The catch: &lt;strong>income is a confounder&lt;/strong> &amp;mdash; wealthier neighborhoods receive fewer coupons (the store targets promotions at lower-income areas) but have higher baseline sales. This creates a spurious negative correlation between coupons and sales, even though coupons genuinely boost sales.&lt;/p>
&lt;p>The causal structure looks like this:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
Income[&amp;quot;Income&amp;lt;br/&amp;gt;(confounder)&amp;quot;]
Coupons[&amp;quot;Coupons&amp;lt;br/&amp;gt;(treatment)&amp;quot;]
Sales[&amp;quot;Sales&amp;lt;br/&amp;gt;(outcome)&amp;quot;]
Income --&amp;gt;|&amp;quot;-0.5&amp;lt;br/&amp;gt;(fewer coupons&amp;lt;br/&amp;gt;to rich areas)&amp;quot;| Coupons
Income --&amp;gt;|&amp;quot;+0.3&amp;lt;br/&amp;gt;(rich areas&amp;lt;br/&amp;gt;buy more)&amp;quot;| Sales
Coupons --&amp;gt;|&amp;quot;+0.2&amp;lt;br/&amp;gt;(true causal&amp;lt;br/&amp;gt;effect)&amp;quot;| Sales
style Income fill:#d97757,stroke:#141413,color:#fff
style Coupons fill:#6a9bcc,stroke:#141413,color:#fff
style Sales fill:#00d4c8,stroke:#141413,color:#fff
&lt;/code>&lt;/pre>
&lt;p>Income opens a &amp;ldquo;backdoor path&amp;rdquo; from coupons to sales: coupons ← income → sales. Unless we block this path by controlling for income, the naive estimate will be biased. The data generating process is:&lt;/p>
&lt;p>$$\text{income} \sim N(50, 10)$$&lt;/p>
&lt;p>$$\text{coupons} = 60 - 0.5 \times \text{income} + \epsilon_1, \quad \epsilon_1 \sim N(0, 5)$$&lt;/p>
&lt;p>$$\text{sales} = 10 + 0.2 \times \text{coupons} + 0.3 \times \text{income} + \epsilon_2, \quad \epsilon_2 \sim N(0, 3)$$&lt;/p>
&lt;p>In words, the true causal effect of coupons on sales is &lt;strong>+0.2&lt;/strong>: each additional coupon increases sales by 0.2 units. But because income negatively drives coupons ($-0.5$) and positively drives sales ($+0.3$), a naive regression of sales on coupons alone will confound the coupon effect with the income effect, producing a biased estimate. The noise terms $\epsilon_1$ and $\epsilon_2$ correspond to the &lt;code>rnorm()&lt;/code> calls in the code below.&lt;/p>
&lt;pre>&lt;code class="language-r">set.seed(42)
n &amp;lt;- 200
income &amp;lt;- rnorm(n, mean = 50, sd = 10)
dayofweek &amp;lt;- sample(1:7, n, replace = TRUE)
coupons &amp;lt;- 60 - 0.5 * income + rnorm(n, 0, 5)
sales &amp;lt;- 10 + 0.2 * coupons + 0.3 * income + 0.5 * dayofweek + rnorm(n, 0, 3)
store_data &amp;lt;- data.frame(
sales = round(sales, 2),
coupons = round(coupons, 2),
income = round(income, 2),
dayofweek = dayofweek
)
head(store_data)
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text"> sales coupons income dayofweek
1 40.02 27.79 63.71 4
2 31.37 34.03 44.35 5
3 31.30 28.01 53.63 6
4 34.37 28.68 56.33 4
5 42.62 35.91 54.04 5
6 39.50 33.45 48.94 4
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-r">round(cor(store_data[, c(&amp;quot;sales&amp;quot;, &amp;quot;coupons&amp;quot;, &amp;quot;income&amp;quot;)]), 3)
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text"> sales coupons income
sales 1.000 -0.166 0.500
coupons -0.166 1.000 -0.709
income 0.500 -0.709 1.000
&lt;/code>&lt;/pre>
&lt;p>The correlation matrix confirms the confounding structure. Coupons and sales have a &lt;em>negative&lt;/em> raw correlation (-0.166), even though the true causal effect is positive (+0.2). This is because income is strongly negatively correlated with coupons (-0.709) and strongly positively correlated with sales (0.500). A naive analysis would conclude that coupons hurt sales &amp;mdash; a classic instance of &lt;strong>Simpson&amp;rsquo;s paradox&lt;/strong>, where the direction of an association reverses when a confounding variable is accounted for.&lt;/p>
&lt;h2 id="4-fwl_plot-in-action-naive-vs-controlled">4. fwl_plot() in Action: Naive vs. Controlled&lt;/h2>
&lt;h3 id="41-the-naive-scatter">4.1 The naive scatter&lt;/h3>
&lt;p>The simplest way to see why confounding is dangerous: plot the raw relationship with &lt;code>fwl_plot()&lt;/code>. When no controls are specified, &lt;code>fwl_plot()&lt;/code> produces a standard scatter plot with a regression line:&lt;/p>
&lt;pre>&lt;code class="language-r">fwl_plot(sales ~ coupons, data = store_data, ggplot = TRUE)
&lt;/code>&lt;/pre>
&lt;p>The slope is &lt;strong>-0.093&lt;/strong> ($p = 0.019$): coupons appear to &lt;em>reduce&lt;/em> sales. This is statistically significant but substantively wrong &amp;mdash; the true effect is +0.2. The store manager who trusts this analysis would cancel the coupon program, losing real revenue.&lt;/p>
&lt;h3 id="42-controlling-for-income-one-line-of-code">4.2 Controlling for income: one line of code&lt;/h3>
&lt;p>Now watch what happens when we add &lt;code>income&lt;/code> as a control &amp;mdash; just add it to the formula:&lt;/p>
&lt;pre>&lt;code class="language-r">fwl_plot(sales ~ coupons + income, data = store_data, ggplot = TRUE)
&lt;/code>&lt;/pre>
&lt;p>The slope reverses to &lt;strong>+0.212&lt;/strong> ($p &amp;lt; 0.001$) &amp;mdash; close to the true value of +0.2. The &lt;code>fwl_plot()&lt;/code> function residualized both coupons and sales on income behind the scenes, then plotted the residuals. The figure below shows both panels side by side:&lt;/p>
&lt;p>&lt;img src="r_fwlplot_fig1_naive_vs_controlled.png" alt="Naive scatter (left) shows a negative slope; after FWL residualization on income (right), the slope reverses to positive">&lt;/p>
&lt;p>The left panel shows the raw relationship: more coupons, lower sales (a downward slope). The right panel shows the &lt;em>same&lt;/em> data after removing the influence of income from both axes. Once income is partialled out, the true positive effect of coupons emerges clearly. This is what &amp;ldquo;controlling for income&amp;rdquo; looks like geometrically &amp;mdash; and &lt;code>fwl_plot()&lt;/code> produces it in a single line.&lt;/p>
&lt;h3 id="43-the-regression-table-confirms">4.3 The regression table confirms&lt;/h3>
&lt;p>The &lt;code>fixest::feols()&lt;/code> function produces the same coefficient, confirmed by &lt;code>etable()&lt;/code> for side-by-side comparison:&lt;/p>
&lt;pre>&lt;code class="language-r">fe_naive &amp;lt;- feols(sales ~ coupons, data = store_data)
fe_full &amp;lt;- feols(sales ~ coupons + income, data = store_data)
etable(fe_naive, fe_full, headers = c(&amp;quot;Naive&amp;quot;, &amp;quot;Controlled&amp;quot;))
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text"> fe_naive fe_full
Naive Controlled
Dependent Var.: sales sales
Constant 36.93*** (1.397) 11.34*** (3.008)
coupons -0.0934* (0.0393) 0.2123*** (0.0467)
income 0.3004*** (0.0325)
_______________ _________________ __________________
S.E. type IID IID
Observations 200 200
R2 0.02768 0.32148
Adj. R2 0.02277 0.31459
&lt;/code>&lt;/pre>
&lt;p>Adding income as a control flips the coupon coefficient from -0.093 to +0.212 and increases the R-squared from 0.028 to 0.321. The income coefficient (0.300) is close to the true value of 0.3. Every number in this table corresponds to a visual feature of the &lt;code>fwl_plot()&lt;/code> scatter plots above.&lt;/p>
&lt;h2 id="5-under-the-hood-manual-fwl-verification">5. Under the Hood: Manual FWL Verification&lt;/h2>
&lt;h3 id="51-the-three-step-recipe">5.1 The three-step recipe&lt;/h3>
&lt;p>The FWL theorem can be stated as a simple recipe. Think of it like measuring height &lt;em>for your age&lt;/em>: instead of comparing raw heights, you compare how much taller or shorter each person is than the average for their age group. Similarly, FWL compares how much more or fewer coupons a store had &lt;em>for its income level&lt;/em>, against how much more or fewer sales it had &lt;em>for its income level&lt;/em>.&lt;/p>
&lt;p>The three steps are:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Regress sales on income&lt;/strong>, save the residuals (the part of sales that income cannot explain)&lt;/li>
&lt;li>&lt;strong>Regress coupons on income&lt;/strong>, save the residuals (the part of coupons that income cannot explain)&lt;/li>
&lt;li>&lt;strong>Regress the sales residuals on the coupon residuals&lt;/strong> &amp;mdash; the slope is the coupon coefficient&lt;/li>
&lt;/ol>
&lt;pre>&lt;code class="language-r"># Step 1: Residualize sales on income
resid_y &amp;lt;- resid(lm(sales ~ income, data = store_data))
# Step 2: Residualize coupons on income
resid_x &amp;lt;- resid(lm(coupons ~ income, data = store_data))
# Step 3: Regress residuals on residuals
fwl_manual &amp;lt;- lm(resid_y ~ resid_x)
# Compare coefficients
cat(&amp;quot;feols coefficient: &amp;quot;, round(coef(fe_full)[&amp;quot;coupons&amp;quot;], 6), &amp;quot;\n&amp;quot;)
cat(&amp;quot;Manual FWL coefficient:&amp;quot;, round(coef(fwl_manual)[&amp;quot;resid_x&amp;quot;], 6), &amp;quot;\n&amp;quot;)
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text">feols coefficient: 0.212288
Manual FWL coefficient: 0.212288
&lt;/code>&lt;/pre>
&lt;p>The coefficients match to six decimal places. This is not an approximation &amp;mdash; it is an exact algebraic identity. Every time you run a multiple regression, the software is implicitly performing these three steps for each coefficient.&lt;/p>
&lt;h3 id="52-the-formal-theorem">5.2 The formal theorem&lt;/h3>
&lt;p>For those who want the math, the FWL theorem states that in the regression $Y = X_1 \beta_1 + X_2 \beta_2 + \epsilon$, the coefficient $\hat{\beta}_1$ equals:&lt;/p>
&lt;p>$$\hat{\beta}_1 = (\tilde{X}_1' \tilde{X}_1)^{-1} \tilde{X}_1' \tilde{Y}, \quad \text{where} \quad \tilde{Y} = M_{X_2} Y, \quad \tilde{X}_1 = M_{X_2} X_1$$&lt;/p>
&lt;p>Here $M_{X_2} = I - X_2(X_2&amp;rsquo;X_2)^{-1}X_2'$ is the &amp;ldquo;residual-maker&amp;rdquo; matrix that projects out the effect of $X_2$. In our example, $Y$ is &lt;code>sales&lt;/code>, $X_1$ is &lt;code>coupons&lt;/code>, and $X_2$ is &lt;code>income&lt;/code>. The tilded variables $\tilde{Y}$ and $\tilde{X}_1$ are the residuals from the &lt;code>resid()&lt;/code> calls above.&lt;/p>
&lt;h3 id="53-omitted-variable-bias-predicting-the-error">5.3 Omitted variable bias: predicting the error&lt;/h3>
&lt;p>The confounding we saw is not mysterious &amp;mdash; the &lt;strong>omitted variable bias (OVB) formula&lt;/strong> predicts it exactly. When we omit income from the regression, the bias on the coupon coefficient is:&lt;/p>
&lt;p>$$\text{bias} = \hat{\gamma} \times \hat{\delta}$$&lt;/p>
&lt;p>In words, the bias equals the effect of the omitted variable on the outcome ($\hat{\gamma}$) multiplied by the relationship between the omitted variable and the treatment ($\hat{\delta}$). Here $\hat{\gamma}$ is the effect of income on sales (in the full model) and $\hat{\delta}$ is the coefficient from regressing coupons on income.&lt;/p>
&lt;pre>&lt;code class="language-r">gamma_hat &amp;lt;- coef(fe_full)[&amp;quot;income&amp;quot;] # 0.3004
delta_hat &amp;lt;- coef(lm(coupons ~ income, data = store_data))[&amp;quot;income&amp;quot;] # -0.4937
ovb &amp;lt;- gamma_hat * delta_hat # -0.1483
cat(&amp;quot;OVB = gamma * delta:&amp;quot;, round(ovb, 4), &amp;quot;\n&amp;quot;)
cat(&amp;quot;Naive ≈ True + OVB:&amp;quot;, round(coef(fe_full)[&amp;quot;coupons&amp;quot;] + ovb, 4), &amp;quot;\n&amp;quot;)
cat(&amp;quot;Actual naive:&amp;quot;, round(coef(fe_naive)[&amp;quot;coupons&amp;quot;], 4), &amp;quot;\n&amp;quot;)
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text">OVB = gamma * delta: -0.1483
Naive ≈ True + OVB: 0.064
Actual naive: -0.0934
&lt;/code>&lt;/pre>
&lt;p>The OVB formula predicts a bias of -0.148: income&amp;rsquo;s positive effect on sales ($\hat{\gamma} = 0.300$) times its negative relationship with coupons ($\hat{\delta} = -0.494$) produces a large negative bias. The predicted naive coefficient (true + bias = 0.212 + (-0.148) = 0.064) is close to the actual naive coefficient (-0.093) &amp;mdash; the small discrepancy comes from sampling variation with $n = 200$. The key insight: the bias is &lt;em>predictable&lt;/em>. If you know the direction of the confounder&amp;rsquo;s effects on both the treatment and the outcome, you know which way the naive estimate is biased.&lt;/p>
&lt;h3 id="54-adding-more-controls">5.4 Adding more controls&lt;/h3>
&lt;p>The FWL theorem extends naturally to any number of controls. The &lt;code>fwl_plot()&lt;/code> call handles it automatically:&lt;/p>
&lt;pre>&lt;code class="language-r">fe_full3 &amp;lt;- feols(sales ~ coupons + income + dayofweek, data = store_data)
etable(fe_naive, fe_full, fe_full3,
headers = c(&amp;quot;Naive&amp;quot;, &amp;quot;+ Income&amp;quot;, &amp;quot;+ Income + Day&amp;quot;))
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text"> fe_naive fe_full fe_full3
Naive + Income + Income + Day
Dependent Var.: sales sales sales
Constant 36.93*** (1.397) 11.34*** (3.008) 9.640** (2.953)
coupons -0.0934* (0.0393) 0.2123*** (0.0467) 0.2219*** (0.0454)
income 0.3004*** (0.0325) 0.2961*** (0.0316)
dayofweek 0.4029*** (0.1095)
_______________ _________________ __________________ __________________
S.E. type IID IID IID
Observations 200 200 200
R2 0.02768 0.32148 0.36535
Adj. R2 0.02277 0.31459 0.35564
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="r_fwlplot_fig2_fwl_verification.png" alt="Three-panel FWL progression: no controls (left), controlling for income (center), controlling for income + day of week (right)">&lt;/p>
&lt;p>The coupon coefficient progresses from -0.093 (naive, wrong sign), to +0.212 (controlling for income), to +0.222 (adding day of week). The R-squared jumps from 0.028 to 0.365 as we add controls. Each &lt;code>fwl_plot()&lt;/code> panel shows a tighter cloud as more variation is absorbed by the controls &amp;mdash; the residualized scatter becomes more focused on the &lt;em>coupon-specific&lt;/em> variation in sales.&lt;/p>
&lt;h2 id="6-visualizing-fixed-effects">6. Visualizing Fixed Effects&lt;/h2>
&lt;h3 id="61-what-are-fixed-effects">6.1 What are fixed effects?&lt;/h3>
&lt;p>Fixed effects are a special case of the FWL theorem applied to group dummy variables. When we include airport fixed effects in a regression, we are &amp;ldquo;partialling out&amp;rdquo; airport-specific means &amp;mdash; in other words, &lt;strong>demeaning&lt;/strong>. Demeaning means subtracting each group&amp;rsquo;s average from every observation in that group. The result is that we compare each airport to &lt;em>itself&lt;/em> rather than comparing different airports to each other.&lt;/p>
&lt;p>Think of it like a race handicap. Raw times compare runners who started at different positions. Demeaning each runner&amp;rsquo;s times converts them to &amp;ldquo;how much faster or slower than their personal average,&amp;rdquo; making the comparison fair. The FWL theorem guarantees that this demeaning procedure produces the same coefficients as including a full set of dummy variables in the regression.&lt;/p>
&lt;h3 id="62-flights-data-progressive-fixed-effects">6.2 Flights data: progressive fixed effects&lt;/h3>
&lt;p>The &lt;code>nycflights13&lt;/code> dataset contains all domestic flights from New York&amp;rsquo;s three airports (EWR, JFK, LGA) in 2013. We ask: what is the relationship between air time and departure delay?&lt;/p>
&lt;pre>&lt;code class="language-r">data(&amp;quot;flights&amp;quot;, package = &amp;quot;nycflights13&amp;quot;)
flights_clean &amp;lt;- flights[complete.cases(flights[, c(&amp;quot;dep_delay&amp;quot;, &amp;quot;air_time&amp;quot;, &amp;quot;origin&amp;quot;, &amp;quot;dest&amp;quot;)]), ]
flights_clean &amp;lt;- flights_clean[flights_clean$dep_delay &amp;lt; 120 &amp;amp; flights_clean$dep_delay &amp;gt; -30, ]
# Remove singleton origin-dest combos for stable FE estimation
od_counts &amp;lt;- table(paste(flights_clean$origin, flights_clean$dest))
flights_clean &amp;lt;- flights_clean[paste(flights_clean$origin, flights_clean$dest) %in%
names(od_counts[od_counts &amp;gt; 1]), ]
cat(&amp;quot;Observations:&amp;quot;, nrow(flights_clean), &amp;quot;\n&amp;quot;)
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text">Observations: 317578
&lt;/code>&lt;/pre>
&lt;p>We sample 5,000 flights for plotting (the regression line uses all data, only the plotted points are sampled to avoid overplotting):&lt;/p>
&lt;pre>&lt;code class="language-r">set.seed(123)
flights_sample &amp;lt;- flights_clean[sample(nrow(flights_clean), 5000), ]
&lt;/code>&lt;/pre>
&lt;p>Now the power of &lt;code>fwl_plot()&lt;/code> &amp;mdash; three one-liners that progressively add fixed effects. In &lt;code>fixest&lt;/code> syntax, the &lt;code>|&lt;/code> operator separates regular covariates (left) from fixed effects (right), so &lt;code>dep_delay ~ air_time | origin + dest&lt;/code> means &amp;ldquo;regress departure delay on air time, with origin and destination fixed effects&amp;rdquo;:&lt;/p>
&lt;pre>&lt;code class="language-r"># No fixed effects
fwl_plot(dep_delay ~ air_time, data = flights_sample, ggplot = TRUE)
# Origin airport FE
fwl_plot(dep_delay ~ air_time | origin, data = flights_sample, ggplot = TRUE)
# Origin + destination FE
fwl_plot(dep_delay ~ air_time | origin + dest, data = flights_sample, ggplot = TRUE)
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="r_fwlplot_fig3_fixed_effects.png" alt="Progressive FWL plots: no FE (left), origin FE (center), origin + destination FE (right)">&lt;/p>
&lt;p>The visual transformation is striking. Panel A (no FE) shows a vague cloud with a nearly flat slope. Panel B (origin FE) removes the three origin-airport means, tightening the horizontal spread. Panel C (origin + destination FE) removes the 103 destination means as well, collapsing the air-time variation to &lt;em>within-route&lt;/em> deviations.&lt;/p>
&lt;h3 id="63-comparing-regression-tables">6.3 Comparing regression tables&lt;/h3>
&lt;pre>&lt;code class="language-r">fe_none &amp;lt;- feols(dep_delay ~ air_time, data = flights_clean)
fe_origin &amp;lt;- feols(dep_delay ~ air_time | origin, data = flights_clean)
fe_both &amp;lt;- feols(dep_delay ~ air_time | origin + dest, data = flights_clean)
etable(fe_none, fe_origin, fe_both,
headers = c(&amp;quot;No FE&amp;quot;, &amp;quot;Origin FE&amp;quot;, &amp;quot;Origin + Dest FE&amp;quot;))
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text"> fe_none fe_origin fe_both
No FE Origin FE Origin + Dest FE
Dependent Var.: dep_delay dep_delay dep_delay
air_time -0.0031*** (0.0004) -0.0061*** (0.0005) -0.0067. (0.0034)
Fixed-Effects: ------------------- ------------------- -----------------
origin No Yes Yes
dest No No Yes
_______________ ___________________ ___________________ _________________
Observations 317,578 317,578 317,578
R2 0.00016 0.00594 0.01296
Within R2 -- 0.00058 1.19e-5
&lt;/code>&lt;/pre>
&lt;p>The air time coefficient changes as we add fixed effects: -0.003 (no FE), -0.006 (origin FE), -0.007 (origin + destination FE, significant at the 10% level only &amp;mdash; the &lt;code>.&lt;/code> marker indicates $p &amp;lt; 0.10$). The residualized scatter in Panel C answers a sharper question: &amp;ldquo;For flights on the &lt;em>same route&lt;/em>, does longer-than-usual air time predict higher-than-usual departure delay?&amp;rdquo; The answer is weakly negative &amp;mdash; routes with variable air times show slightly less delay when the flight takes longer, possibly because longer air times reflect favorable wind conditions.&lt;/p>
&lt;h2 id="7-panel-data-returns-to-experience">7. Panel Data: Returns to Experience&lt;/h2>
&lt;h3 id="71-the-wage-panel">7.1 The wage panel&lt;/h3>
&lt;p>The &lt;code>wagepan&lt;/code> dataset from the Wooldridge textbook contains panel data on 545 individuals observed over 8 years (1980&amp;ndash;1987). A classic question in labor economics is: what is the return to experience?&lt;/p>
&lt;p>The challenge is &lt;strong>unobserved ability&lt;/strong>. Two people with 5 years of experience may earn very different wages because one is more talented, motivated, or well-connected. These personal traits &amp;mdash; which we cannot directly measure &amp;mdash; are the &amp;ldquo;unobserved ability&amp;rdquo; that creates omitted variable bias. More talented workers earn higher wages &lt;em>and&lt;/em> tend to accumulate experience in higher-paying jobs, so the naive correlation between experience and wages confounds ability with genuine experience effects.&lt;/p>
&lt;pre>&lt;code class="language-r">data(&amp;quot;wagepan&amp;quot;, package = &amp;quot;wooldridge&amp;quot;)
cat(&amp;quot;Observations:&amp;quot;, nrow(wagepan), &amp;quot;\n&amp;quot;)
cat(&amp;quot;Individuals:&amp;quot;, length(unique(wagepan$nr)), &amp;quot;\n&amp;quot;)
cat(&amp;quot;Years:&amp;quot;, length(unique(wagepan$year)), &amp;quot;\n&amp;quot;)
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text">Observations: 4360
Individuals: 545
Years: 8
&lt;/code>&lt;/pre>
&lt;h3 id="72-pooled-ols-vs-individual-fixed-effects">7.2 Pooled OLS vs. individual fixed effects&lt;/h3>
&lt;pre>&lt;code class="language-r">fe_pool &amp;lt;- feols(lwage ~ educ + exper + expersq, data = wagepan)
fe_fe &amp;lt;- feols(lwage ~ exper + expersq | nr, data = wagepan)
fe_twfe &amp;lt;- feols(lwage ~ exper + expersq | nr + year, data = wagepan)
etable(fe_pool, fe_fe, fe_twfe,
headers = c(&amp;quot;Pooled OLS&amp;quot;, &amp;quot;Individual FE&amp;quot;, &amp;quot;Individual + Year FE&amp;quot;))
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text"> fe_pool fe_fe fe_twfe
Pooled OLS Individual FE Individual + Year FE
Dependent Var.: lwage lwage lwage
Constant -0.0564 (0.0639)
educ 0.1021*** (0.0047)
exper 0.1050*** (0.0102) 0.1223*** (0.0082)
expersq -0.0036*** (0.0007) -0.0045*** (0.0006) -0.0054*** (0.0007)
Fixed-Effects: ------------------- ------------------- -------------------
nr No Yes Yes
year No No Yes
_______________ ___________________ ___________________ ___________________
Observations 4,360 4,360 4,360
R2 0.14772 0.61727 0.61850
Within R2 -- 0.17270 0.01534
&lt;/code>&lt;/pre>
&lt;p>Several things change as we add fixed effects. First, the &lt;code>educ&lt;/code> coefficient disappears from the individual FE column &amp;mdash; education is time-invariant for most individuals, so it is perfectly collinear with person dummies. Second, the &lt;code>exper&lt;/code> linear term disappears from the two-way FE column &amp;mdash; because experience increments by exactly one year for everyone, it is perfectly collinear with year dummies. Only &lt;code>expersq&lt;/code> (which varies non-linearly across individuals) survives.&lt;/p>
&lt;p>In the individual FE model, the experience coefficient &lt;em>increases&lt;/em> from 0.105 to 0.122. This means the within-person return to experience is larger than the pooled estimate. The R-squared jumps from 0.148 to 0.617, showing that individual fixed effects explain the majority of wage variation &amp;mdash; most of the &amp;ldquo;action&amp;rdquo; in wages comes from &lt;em>who you are&lt;/em>, not &lt;em>how many years you have worked&lt;/em>.&lt;/p>
&lt;h3 id="73-visualizing-the-within-person-variation">7.3 Visualizing the within-person variation&lt;/h3>
&lt;p>Again, &lt;code>fwl_plot()&lt;/code> produces the before/after comparison in two one-liners. We sample 150 individuals for visual clarity (with 545 individuals the plot would be too dense):&lt;/p>
&lt;pre>&lt;code class="language-r">set.seed(456)
sample_ids &amp;lt;- sample(unique(wagepan$nr), 150)
wage_sample &amp;lt;- wagepan[wagepan$nr %in% sample_ids, ]
# Raw bivariate relationship
fwl_plot(lwage ~ exper, data = wage_sample, ggplot = TRUE)
# With individual fixed effects
fwl_plot(lwage ~ exper | nr, data = wage_sample, ggplot = TRUE)
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="r_fwlplot_fig4_panel_data.png" alt="Raw pooled cross-section (left) vs. individual fixed-effects residualized scatter (right) for log wage vs. experience">&lt;/p>
&lt;p>The visual difference is dramatic. Panel A plots the raw bivariate relationship with a shallow slope of about 0.03. The wide fan of points reflects unobserved ability differences: individuals at the same experience level have wildly different wages. Panel B (individual FE) strips away each person&amp;rsquo;s average wage and average experience, leaving only the &lt;em>within-person&lt;/em> deviations. The slope steepens to 0.122 &amp;mdash; more than three times larger &amp;mdash; showing that a one-year increase in experience raises wages by about 12.2% &lt;em>within the same individual&lt;/em>. The tighter cloud in Panel B shows that once we account for who each person is, the experience-wage relationship is much more precisely identified.&lt;/p>
&lt;h2 id="8-customization-and-quick-reference">8. Customization and Quick Reference&lt;/h2>
&lt;h3 id="81-ggplot2-integration">8.1 ggplot2 integration&lt;/h3>
&lt;p>The &lt;code>fwl_plot()&lt;/code> function can return a ggplot2 object by setting &lt;code>ggplot = TRUE&lt;/code>, allowing full customization with ggplot2 layers and themes. This is useful for publication-quality figures with consistent styling, faceting, or combining multiple plots with &lt;code>patchwork&lt;/code>:&lt;/p>
&lt;pre>&lt;code class="language-r">p &amp;lt;- fwl_plot(sales ~ coupons + income, data = store_data, ggplot = TRUE)
fig5 &amp;lt;- p +
labs(title = &amp;quot;FWL Visualization: Coupons Effect on Sales&amp;quot;,
subtitle = &amp;quot;After residualizing on income&amp;quot;) +
theme_minimal(base_size = 13)
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="r_fwlplot_fig5_ggplot_custom.png" alt="FWL scatter plot with ggplot2 customization showing coupons effect on sales after residualizing on income">&lt;/p>
&lt;h3 id="82-quick-reference-fwl_plot-recipes">8.2 Quick reference: fwl_plot() recipes&lt;/h3>
&lt;p>Here are the most common &lt;code>fwl_plot()&lt;/code> patterns you will use:&lt;/p>
&lt;pre>&lt;code class="language-r"># 1. Raw scatter (no controls)
fwl_plot(y ~ x, data = df)
# 2. Control for one or more variables
fwl_plot(y ~ x + control1 + control2, data = df)
# 3. Fixed effects (use | to separate)
fwl_plot(y ~ x | group_fe, data = df)
# 4. Multiple fixed effects
fwl_plot(y ~ x | fe1 + fe2, data = df)
# 5. Return ggplot2 object for customization
fwl_plot(y ~ x + control, data = df, ggplot = TRUE) + theme_minimal()
# 6. Sample points for large datasets (line uses all data)
fwl_plot(y ~ x | fe, data = big_data, n_sample = 5000)
&lt;/code>&lt;/pre>
&lt;h3 id="83-key-arguments">8.3 Key arguments&lt;/h3>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Argument&lt;/th>
&lt;th>Purpose&lt;/th>
&lt;th>Example&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;code>formula&lt;/code>&lt;/td>
&lt;td>Same as &lt;code>feols()&lt;/code>: &lt;code>y ~ x + controls | FE&lt;/code>&lt;/td>
&lt;td>&lt;code>sales ~ coupons + income&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>data&lt;/code>&lt;/td>
&lt;td>Input data frame&lt;/td>
&lt;td>&lt;code>store_data&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>ggplot&lt;/code>&lt;/td>
&lt;td>Return ggplot2 object (default: base R)&lt;/td>
&lt;td>&lt;code>ggplot = TRUE&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>n_sample&lt;/code>&lt;/td>
&lt;td>Sample N points for large datasets&lt;/td>
&lt;td>&lt;code>n_sample = 5000&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>vcov&lt;/code>&lt;/td>
&lt;td>Variance-covariance specification&lt;/td>
&lt;td>&lt;code>vcov = &amp;quot;hetero&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>For large datasets like the flights data (317K+ observations), the &lt;code>n_sample&lt;/code> argument is essential to avoid overplotting. The regression line is always computed on the full data &amp;mdash; only the &lt;em>plotted points&lt;/em> are sampled, so the slope is unaffected.&lt;/p>
&lt;h2 id="9-discussion">9. Discussion&lt;/h2>
&lt;p>The FWL theorem is not just a mathematical curiosity &amp;mdash; it is the foundation of how modern regression software works. When &lt;code>fixest::feols()&lt;/code> estimates a model with fixed effects, it does not literally create and invert a matrix with thousands of dummy variables. Instead, it uses the FWL logic to demean the data and run OLS on the residuals. This is why &lt;code>fixest&lt;/code> can handle millions of observations with hundreds of thousands of fixed effects: the demeaning step is $O(N)$, while creating the full dummy matrix would be $O(N \times K)$.&lt;/p>
&lt;p>As a diagnostic tool, FWL scatter plots reveal problems that regression tables hide. If the residualized scatter shows a curved relationship, your linear specification may be wrong. If it shows outliers, they may be driving the coefficient. If the cloud collapses to a near-vertical line (as in Panel C of the flights figure), the within-group variation may be too small to identify the effect reliably.&lt;/p>
&lt;p>The FWL theorem also connects to more advanced methods. &lt;strong>Double Machine Learning&lt;/strong> (Chernozhukov et al., 2018) generalizes the partialling-out idea by using machine learning models instead of linear regression to residualize the data. The Python FWL tutorial on this site takes that next step. The &lt;code>fwlplot&lt;/code> package does not do DML, but the visual intuition &amp;mdash; &amp;ldquo;look at the residualized scatter to see the conditional relationship&amp;rdquo; &amp;mdash; carries over directly.&lt;/p>
&lt;p>One limitation: the FWL theorem applies only to linear regression. For logistic regression, Poisson regression, or other nonlinear models, the partialling-out logic does not hold exactly. The residualized scatter plot for a nonlinear model is at best an approximation of the conditional relationship, not an exact representation.&lt;/p>
&lt;h2 id="10-summary-and-next-steps">10. Summary and Next Steps&lt;/h2>
&lt;ul>
&lt;li>&lt;strong>Confounding produces misleading regressions:&lt;/strong> in our simulated data, the naive coupon coefficient was -0.093 (coupons &amp;ldquo;hurt&amp;rdquo; sales), while the true causal effect is +0.2. After controlling for income via &lt;code>fwl_plot()&lt;/code>, the estimate was +0.212, recovering the true effect.&lt;/li>
&lt;li>&lt;strong>The OVB formula predicts the bias exactly:&lt;/strong> the bias was $0.300 \times (-0.494) = -0.148$, correctly predicting the negative direction and approximate magnitude of the confounding.&lt;/li>
&lt;li>&lt;strong>FWL is not an approximation &amp;mdash; it is an exact algebraic identity:&lt;/strong> the coefficient from partialling out controls matches &lt;code>feols()&lt;/code> to six decimal places. Every multiple regression coefficient &lt;em>can&lt;/em> be visualized as a bivariate scatter plot.&lt;/li>
&lt;li>&lt;strong>Fixed effects are FWL applied to group dummies:&lt;/strong> the flights data showed how adding origin and destination FE progressively transformed the scatter. The air-time coefficient changed from -0.003 (no FE) to -0.007 (origin + destination FE).&lt;/li>
&lt;li>&lt;strong>Panel FE reveal within-person effects:&lt;/strong> the wage data showed that controlling for individual ability via FE steepened the bivariate experience slope from 0.03 (pooled, no controls) to 0.122 (within-person), more than tripling the estimated return to experience.&lt;/li>
&lt;/ul>
&lt;p>For further study, see the companion &lt;a href="https://carlos-mendez.org/post/python_fwl/">Python FWL tutorial&lt;/a> that extends the partialling-out logic to Double Machine Learning, and the &lt;a href="https://carlos-mendez.org/post/r_did/">R DID tutorial&lt;/a> that uses &lt;code>fixest&lt;/code> for difference-in-differences with staggered treatment adoption.&lt;/p>
&lt;h2 id="11-exercises">11. Exercises&lt;/h2>
&lt;ol>
&lt;li>
&lt;p>&lt;strong>Omitted variable direction.&lt;/strong> Use the OVB formula from Section 5.3 to predict what happens if you also omit &lt;code>dayofweek&lt;/code> (in addition to income). Run the naive regression &lt;code>lm(sales ~ coupons)&lt;/code> and compare the bias to $\hat{\gamma}_{income} \times \hat{\delta}_{income} + \hat{\gamma}_{day} \times \hat{\delta}_{day}$. Does the extended OVB formula still predict the direction correctly?&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Multiple controls.&lt;/strong> Use &lt;code>fwl_plot()&lt;/code> to visualize the coupon effect after controlling for both income and &lt;code>dayofweek&lt;/code>. Compare this to controlling for income alone. Does the scatter change visually? Does the coefficient change?&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Your own data.&lt;/strong> Pick a dataset from the &lt;code>wooldridge&lt;/code> package (e.g., &lt;code>hprice1&lt;/code>, &lt;code>wage2&lt;/code>, &lt;code>crime2&lt;/code>) and use &lt;code>fwl_plot()&lt;/code> to visualize a regression relationship before and after adding controls. Does the coefficient change substantially? Can you identify what the confounder is doing?&lt;/p>
&lt;/li>
&lt;/ol>
&lt;h2 id="12-datasets">12. Datasets&lt;/h2>
&lt;p>The datasets used in this tutorial are saved as CSV files in the post directory for reuse in other tutorials:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>File&lt;/th>
&lt;th>Rows&lt;/th>
&lt;th>Description&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;code>store_data.csv&lt;/code>&lt;/td>
&lt;td>200&lt;/td>
&lt;td>Simulated retail data (sales, coupons, income, dayofweek)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>flights_sample.csv&lt;/code>&lt;/td>
&lt;td>5,000&lt;/td>
&lt;td>Cleaned NYC flights sample (delays, air time, origin, dest)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>wagepan.csv&lt;/code>&lt;/td>
&lt;td>4,360&lt;/td>
&lt;td>Wooldridge wage panel (545 individuals, 8 years)&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h2 id="13-references">13. References&lt;/h2>
&lt;ol>
&lt;li>&lt;a href="https://cran.r-project.org/package=fwlplot" target="_blank" rel="noopener">Butts, K. &amp;amp; McDermott, G. (2024). fwlplot: Scatter Plot After Residualizing. CRAN.&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://doi.org/10.2307/1907330" target="_blank" rel="noopener">Frisch, R. &amp;amp; Waugh, F. V. (1933). Partial Time Regressions as Compared with Individual Trends. &lt;em>Econometrica&lt;/em>, 1(4), 387&amp;ndash;401.&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://doi.org/10.1080/01621459.1963.10480682" target="_blank" rel="noopener">Lovell, M. C. (1963). Seasonal Adjustment of Economic Time Series and Multiple Regression Analysis. &lt;em>JASA&lt;/em>, 58(304), 993&amp;ndash;1010.&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://cran.r-project.org/package=fixest" target="_blank" rel="noopener">Berge, L. (2018). fixest: Fast Fixed-Effects Estimations in R. CRAN.&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://press.princeton.edu/books/paperback/9780691120355/mostly-harmless-econometrics" target="_blank" rel="noopener">Angrist, J. D. &amp;amp; Pischke, J.-S. (2009). &lt;em>Mostly Harmless Econometrics.&lt;/em> Princeton University Press.&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://doi.org/10.1111/ectj.12097" target="_blank" rel="noopener">Chernozhukov, V. et al. (2018). Double/Debiased Machine Learning for Treatment and Structural Parameters. &lt;em>The Econometrics Journal&lt;/em>, 21(1), C1&amp;ndash;C68.&lt;/a>&lt;/li>
&lt;/ol>
&lt;h4 id="acknowledgements">Acknowledgements&lt;/h4>
&lt;p>AI tools (Claude Code, Gemini, NotebookLM) were used to make the contents of this post more accessible to students. Nevertheless, the content in this post may still have errors. Caution is needed when applying the contents of this post to true research projects.&lt;/p></description></item><item><title>Visualizing Regression with the FWL Theorem in Stata</title><link>https://carlos-mendez.org/post/stata_fwl/</link><pubDate>Fri, 27 Mar 2026 00:00:00 +0000</pubDate><guid>https://carlos-mendez.org/post/stata_fwl/</guid><description>&lt;h2 id="1-overview">1. Overview&lt;/h2>
&lt;p>&amp;ldquo;What does it actually mean to &lt;em>control for&lt;/em> a variable?&amp;rdquo; This question appears in every applied regression course, and the answer is surprisingly hard to visualize. When we say &amp;ldquo;the effect of coupons on sales, controlling for income,&amp;rdquo; we are describing a relationship in multidimensional space. This relationship cannot be directly plotted on a two-dimensional scatter. The &lt;strong>Frisch-Waugh-Lovell (FWL) theorem&lt;/strong> changes this: it shows that the coefficient from a multiple regression equals the slope of a simple bivariate regression &amp;mdash; after first &lt;em>residualizing&lt;/em> (partialling out) the control variables from both the outcome and the variable of interest.&lt;/p>
&lt;p>The &lt;a href="https://github.com/leojahrens/scatterfit" target="_blank" rel="noopener">scatterfit&lt;/a> Stata package (Ahrens, 2024) makes this visual in one command. It takes a dependent variable, an independent variable, and optional controls or fixed effects, then produces a scatter plot of the residualized data with a fitted regression line. Built on &lt;code>reghdfe&lt;/code>, it handles high-dimensional fixed effects efficiently. It also offers features beyond what R&amp;rsquo;s &lt;code>fwl_plot()&lt;/code> or Python&amp;rsquo;s manual FWL can do: &lt;strong>binned scatter plots&lt;/strong> for large datasets, &lt;strong>regression parameters printed directly on the plot&lt;/strong>, and &lt;strong>multiple fit types&lt;/strong> (linear, quadratic, lowess).&lt;/p>
&lt;p>This tutorial is the third in a trilogy &amp;mdash; see the companion &lt;a href="https://carlos-mendez.org/post/r_fwlplot/">R tutorial&lt;/a> and &lt;a href="https://carlos-mendez.org/post/python_fwl/">Python tutorial&lt;/a> &amp;mdash; and uses the &lt;strong>same datasets&lt;/strong> for cross-language comparability. All data are loaded from GitHub URLs so the analysis is fully reproducible.&lt;/p>
&lt;p>&lt;strong>Learning objectives:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Use &lt;code>scatterfit&lt;/code> to visualize bivariate relationships with and without controls&lt;/li>
&lt;li>Demonstrate FWL residualization with &lt;code>controls()&lt;/code> and &lt;code>fcontrols()&lt;/code>&lt;/li>
&lt;li>Verify manually that FWL reproduces &lt;code>reghdfe&lt;/code> coefficients exactly&lt;/li>
&lt;li>Visualize fixed effects using &lt;code>fcontrols()&lt;/code> on flights data&lt;/li>
&lt;li>Use binned scatter plots to summarize patterns in large datasets&lt;/li>
&lt;li>Show regression parameters directly on plots with &lt;code>regparameters()&lt;/code>&lt;/li>
&lt;/ul>
&lt;h2 id="2-the-modeling-pipeline">2. The Modeling Pipeline&lt;/h2>
&lt;pre>&lt;code class="language-mermaid">graph LR
A[&amp;quot;Load Data&amp;lt;br/&amp;gt;from GitHub&amp;lt;br/&amp;gt;(Section 3)&amp;quot;] --&amp;gt; B[&amp;quot;Naive vs.&amp;lt;br/&amp;gt;FWL Scatter&amp;lt;br/&amp;gt;(Section 4)&amp;quot;]
B --&amp;gt; C[&amp;quot;Manual FWL&amp;lt;br/&amp;gt;Verification&amp;lt;br/&amp;gt;(Section 5)&amp;quot;]
C --&amp;gt; D[&amp;quot;Binned&amp;lt;br/&amp;gt;Scatter&amp;lt;br/&amp;gt;(Section 6)&amp;quot;]
D --&amp;gt; E[&amp;quot;Fixed Effects&amp;lt;br/&amp;gt;Flights&amp;lt;br/&amp;gt;(Section 7)&amp;quot;]
E --&amp;gt; F[&amp;quot;Panel Data&amp;lt;br/&amp;gt;Wages&amp;lt;br/&amp;gt;(Section 8)&amp;quot;]
style A fill:#6a9bcc,stroke:#141413,color:#fff
style B fill:#d97757,stroke:#141413,color:#fff
style C fill:#d97757,stroke:#141413,color:#fff
style D fill:#00d4c8,stroke:#141413,color:#fff
style E fill:#6a9bcc,stroke:#141413,color:#fff
style F fill:#6a9bcc,stroke:#141413,color:#fff
&lt;/code>&lt;/pre>
&lt;p>We start where the answer is known (simulated data), see the result with &lt;code>scatterfit&lt;/code>, verify manually, then apply the same tool to real flights data and panel wage data.&lt;/p>
&lt;h2 id="3-setup-and-data">3. Setup and Data&lt;/h2>
&lt;h3 id="31-install-packages">3.1 Install packages&lt;/h3>
&lt;p>The &lt;code>scatterfit&lt;/code> command requires &lt;code>reghdfe&lt;/code> and &lt;code>ftools&lt;/code> for high-dimensional fixed effects estimation. All packages are installed from SSC or GitHub:&lt;/p>
&lt;pre>&lt;code class="language-stata">* Install packages if not already installed
capture ssc install reghdfe, replace
capture ssc install ftools, replace
capture ssc install estout, replace
capture net install scatterfit, ///
from(&amp;quot;https://raw.githubusercontent.com/leojahrens/scatterfit/master&amp;quot;) replace
&lt;/code>&lt;/pre>
&lt;h3 id="32-load-the-simulated-store-data">3.2 Load the simulated store data&lt;/h3>
&lt;p>We load the same simulated retail dataset used in the R and Python FWL tutorials. The data are hosted on GitHub for reproducibility:&lt;/p>
&lt;pre>&lt;code class="language-stata">import delimited &amp;quot;https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/r_fwlplot/store_data.csv&amp;quot;, clear
&lt;/code>&lt;/pre>
&lt;p>The data simulate a scenario where a store manager wants to know whether distributing coupons increases sales. &lt;strong>Income is a confounder&lt;/strong> &amp;mdash; wealthier neighborhoods receive fewer coupons (the store targets promotions at lower-income areas) but have higher baseline sales:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
Income[&amp;quot;Income&amp;lt;br/&amp;gt;(confounder)&amp;quot;]
Coupons[&amp;quot;Coupons&amp;lt;br/&amp;gt;(treatment)&amp;quot;]
Sales[&amp;quot;Sales&amp;lt;br/&amp;gt;(outcome)&amp;quot;]
Income --&amp;gt;|&amp;quot;-0.5&amp;lt;br/&amp;gt;(fewer coupons&amp;lt;br/&amp;gt;to rich areas)&amp;quot;| Coupons
Income --&amp;gt;|&amp;quot;+0.3&amp;lt;br/&amp;gt;(rich areas&amp;lt;br/&amp;gt;buy more)&amp;quot;| Sales
Coupons --&amp;gt;|&amp;quot;+0.2&amp;lt;br/&amp;gt;(true causal&amp;lt;br/&amp;gt;effect)&amp;quot;| Sales
style Income fill:#d97757,stroke:#141413,color:#fff
style Coupons fill:#6a9bcc,stroke:#141413,color:#fff
style Sales fill:#00d4c8,stroke:#141413,color:#fff
&lt;/code>&lt;/pre>
&lt;p>The arrows in this diagram show causal relationships, and the numbers are the true effect sizes in the data generating process. The true causal effect of coupons on sales is &lt;strong>+0.2&lt;/strong>, but income opens a &lt;strong>backdoor path&lt;/strong> &amp;mdash; an indirect route from coupons to sales that goes &lt;em>through&lt;/em> income (coupons $\leftarrow$ income $\rightarrow$ sales). Unless we block this path by controlling for income, the naive estimate will be biased downward.&lt;/p>
&lt;pre>&lt;code class="language-stata">summarize sales coupons income dayofweek
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text"> Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
sales | 200 33.6747 3.811032 24.89 45.23
coupons | 200 34.85685 6.788834 18.72 53.25
income | 200 49.72545 9.745807 20.07 77.02
dayofweek | 200 3.915 1.996926 1 7
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-stata">correlate sales coupons income
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text"> | sales coupons income
-------------+---------------------------
sales | 1.0000
coupons | -0.1664 1.0000
income | 0.5003 -0.7087 1.0000
&lt;/code>&lt;/pre>
&lt;p>The correlation matrix confirms the confounding structure. Coupons and sales have a &lt;em>negative&lt;/em> raw correlation (-0.166), even though the true effect is positive (+0.2). Income is strongly negatively correlated with coupons (-0.709) and positively correlated with sales (0.500). A naive regression would wrongly conclude that coupons hurt sales.&lt;/p>
&lt;h2 id="4-scatterfit-in-action-naive-vs-controlled">4. scatterfit in Action: Naive vs. Controlled&lt;/h2>
&lt;h3 id="41-the-naive-scatter">4.1 The naive scatter&lt;/h3>
&lt;p>The simplest &lt;code>scatterfit&lt;/code> call plots the raw relationship. The &lt;code>regparameters()&lt;/code> option prints the regression coefficient, p-value, and R-squared directly on the plot &amp;mdash; a feature unique to this Stata package:&lt;/p>
&lt;pre>&lt;code class="language-stata">scatterfit sales coupons, regparameters(coef pval r2) ///
opts(name(naive, replace) title(&amp;quot;A. Naive: No Controls&amp;quot;))
&lt;/code>&lt;/pre>
&lt;p>The slope is &lt;strong>-0.093&lt;/strong> ($p = 0.018$, $R^2 = 0.028$): coupons appear to &lt;em>reduce&lt;/em> sales. This is statistically significant but substantively wrong &amp;mdash; the true effect is +0.2. The near-zero R-squared confirms that the naive model explains almost none of the variation in sales.&lt;/p>
&lt;h3 id="42-controlling-for-income-one-option">4.2 Controlling for income: one option&lt;/h3>
&lt;p>Now add income as a control. In &lt;code>scatterfit&lt;/code>, the &lt;code>controls()&lt;/code> option specifies continuous variables to partial out using the FWL procedure. Behind the scenes, &lt;code>scatterfit&lt;/code> calls &lt;code>reghdfe&lt;/code> to residualize both sales and coupons on income, then plots the residuals:&lt;/p>
&lt;pre>&lt;code class="language-stata">scatterfit sales coupons, controls(income) regparameters(coef pval r2) ///
opts(name(controlled, replace) title(&amp;quot;B. FWL: Controlling for Income&amp;quot;))
&lt;/code>&lt;/pre>
&lt;p>The slope reverses to &lt;strong>+0.212&lt;/strong> ($p &amp;lt; 0.001$, $R^2 = 0.32$) &amp;mdash; close to the true value of +0.2. The R-squared jumps from 0.03 to 0.32, showing that controlling for income explains a large share of the variation. Combining both panels:&lt;/p>
&lt;pre>&lt;code class="language-stata">graph combine naive controlled, ///
title(&amp;quot;What Does 'Controlling for Income' Look Like?&amp;quot;) rows(1)
graph export &amp;quot;stata_fwl_fig1_naive_vs_controlled.png&amp;quot;, replace
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="stata_fwl_fig1_naive_vs_controlled.png" alt="Naive scatter (left) shows a negative slope with R2 of 0.028; after FWL residualization with controls(income), the slope reverses to positive with R2 of 0.32">&lt;/p>
&lt;p>The left panel shows the raw relationship: more coupons, lower sales ($R^2 = 0.028$). The right panel shows the &lt;em>same&lt;/em> data after removing the influence of income from both axes via &lt;code>controls(income)&lt;/code>. The true positive effect of coupons emerges clearly, and the $R^2$ rises to 0.32.&lt;/p>
&lt;h3 id="43-the-regression-table-confirms">4.3 The regression table confirms&lt;/h3>
&lt;p>We can compare the naive and controlled regressions side by side using Stata&amp;rsquo;s &lt;code>estimates store&lt;/code> and &lt;code>estimates table&lt;/code> workflow. The &lt;code>estimates store&lt;/code> command saves regression results under a name, and &lt;code>estimates table&lt;/code> displays multiple stored results in columns &amp;mdash; similar to R&amp;rsquo;s &lt;code>etable()&lt;/code> or Python&amp;rsquo;s &lt;code>stargazer&lt;/code>:&lt;/p>
&lt;pre>&lt;code class="language-stata">regress sales coupons
estimates store naive_ols
regress sales coupons income
estimates store full_ols
estimates table naive_ols full_ols, stats(r2 N) b(%9.4f) se(%9.4f)
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text">--------------------------------------
Variable | naive_ols full_ols
-------------+------------------------
coupons | -0.0934 0.2123
| 0.0393 0.0467
income | 0.3004
| 0.0325
_cons | 36.9301 11.3352
| 1.3969 3.0080
-------------+------------------------
r2 | 0.0277 0.3215
N | 200 200
--------------------------------------
&lt;/code>&lt;/pre>
&lt;p>Adding income as a control flips the coupon coefficient from -0.093 to +0.212 and increases the R-squared from 0.028 to 0.321. The income coefficient (0.300) is close to the true value of 0.3.&lt;/p>
&lt;h3 id="44-omitted-variable-bias-predicting-the-error">4.4 Omitted variable bias: predicting the error&lt;/h3>
&lt;p>The confounding is not mysterious &amp;mdash; the &lt;strong>omitted variable bias (OVB) formula&lt;/strong> predicts it exactly:&lt;/p>
&lt;p>$$\text{bias} = \hat{\gamma} \times \hat{\delta}$$&lt;/p>
&lt;p>In words, the bias equals the effect of the omitted variable on the outcome ($\hat{\gamma}$) multiplied by the relationship between the omitted variable and the treatment ($\hat{\delta}$).&lt;/p>
&lt;pre>&lt;code class="language-stata">* gamma = effect of income on sales (in full model)
regress sales coupons income
local gamma = _b[income] // 0.3004
* delta = regression of coupons on income
regress coupons income
local delta = _b[income] // -0.4937
* OVB = gamma * delta
display &amp;quot;OVB = &amp;quot; %9.4f `gamma' * `delta'
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text">OVB = -0.1483
&lt;/code>&lt;/pre>
&lt;p>The OVB formula predicts a bias of -0.148: income&amp;rsquo;s positive effect on sales ($\hat{\gamma} = 0.300$) times its negative relationship with coupons ($\hat{\delta} = -0.494$) produces a large negative bias. The predicted naive coefficient (true + bias = 0.212 + (-0.148) = 0.064) is close to the actual naive coefficient (-0.093) &amp;mdash; the discrepancy comes from sampling variation with $n = 200$.&lt;/p>
&lt;h2 id="5-under-the-hood-manual-fwl-verification">5. Under the Hood: Manual FWL Verification&lt;/h2>
&lt;h3 id="51-the-three-step-recipe">5.1 The three-step recipe&lt;/h3>
&lt;p>The FWL theorem can be implemented manually in Stata using &lt;code>regress&lt;/code> and &lt;code>predict&lt;/code>:&lt;/p>
&lt;pre>&lt;code class="language-stata">* Step 1: Residualize sales on income
regress sales income
predict resid_sales, residuals
* Step 2: Residualize coupons on income
regress coupons income
predict resid_coupons, residuals
* Step 3: Regress residuals on residuals
regress resid_sales resid_coupons
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text">------------------------------------------------------------------------------
resid_sales | Coefficient Std. err. t P&amp;gt;|t| [95% conf. interval]
-------------+----------------------------------------------------------------
resid_coup~s | .2122882 .046581 4.56 0.000 .1204297 .3041466
_cons | -2.87e-09 .222537 -0.00 1.000 -.4388468 .4388468
------------------------------------------------------------------------------
&lt;/code>&lt;/pre>
&lt;p>The FWL coefficient on &lt;code>resid_coupons&lt;/code> is &lt;strong>0.212288&lt;/strong> &amp;mdash; exactly the same as the full regression coefficient on &lt;code>coupons&lt;/code> (0.212288). This is not an approximation; it is an algebraic identity. Formally, the FWL theorem says:&lt;/p>
&lt;p>$$\hat{\beta}_1 = \frac{\text{Cov}(\tilde{Y}, \tilde{X}_1)}{\text{Var}(\tilde{X}_1)}$$&lt;/p>
&lt;p>where $\tilde{Y}$ and $\tilde{X}_1$ are the residuals from regressing $Y$ and $X_1$ on the controls $Z$. In our example, $\tilde{Y}$ is &lt;code>resid_sales&lt;/code> (the part of sales that income cannot explain) and $\tilde{X}_1$ is &lt;code>resid_coupons&lt;/code> (the part of coupons that income cannot explain). The ratio of their covariance to the variance of $\tilde{X}_1$ gives the slope we see in the regression above.&lt;/p>
&lt;p>Think of it like measuring height &lt;em>for your age&lt;/em>: instead of comparing raw heights, you compare how much taller or shorter each person is than the average for their age group.&lt;/p>
&lt;h3 id="52-adding-more-controls">5.2 Adding more controls&lt;/h3>
&lt;p>The &lt;code>scatterfit&lt;/code> command handles any number of controls automatically:&lt;/p>
&lt;pre>&lt;code class="language-stata">scatterfit sales coupons, ///
regparameters(coef pval r2) opts(name(panel_a, replace) title(&amp;quot;A. No Controls&amp;quot;))
scatterfit sales coupons, controls(income) ///
regparameters(coef pval r2) opts(name(panel_b, replace) title(&amp;quot;B. + Income&amp;quot;))
scatterfit sales coupons, controls(income dayofweek) ///
regparameters(coef pval r2) opts(name(panel_c, replace) title(&amp;quot;C. + Income + Day&amp;quot;))
graph combine panel_a panel_b panel_c, ///
title(&amp;quot;Progressive Controls: How the Scatter Changes&amp;quot;) rows(1)
graph export &amp;quot;stata_fwl_fig2_three_panels.png&amp;quot;, replace
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="stata_fwl_fig2_three_panels.png" alt="Three-panel progression showing coefficient, p-value, and R2: no controls (left, R2 = 0.028), controlling for income (center, R2 = 0.32), controlling for income and day of week (right, R2 = 0.37)">&lt;/p>
&lt;pre>&lt;code class="language-stata">estimates table m1_naive m2_income m3_full, stats(r2 r2_a N)
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text">--------------------------------------------------
Variable | m1_naive m2_income m3_full
-------------+------------------------------------
coupons | -0.0934 0.2123 0.2219
| 0.0393 0.0467 0.0454
income | 0.3004 0.2961
| 0.0325 0.0316
dayofweek | 0.4029
| 0.1095
_cons | 36.9301 11.3352 9.6398
| 1.3969 3.0080 2.9527
-------------+------------------------------------
r2 | 0.0277 0.3215 0.3654
r2_a | 0.0228 0.3146 0.3556
N | 200 200 200
--------------------------------------------------
&lt;/code>&lt;/pre>
&lt;p>The coupon coefficient progresses from -0.093 (naive, wrong sign), to +0.212 (controlling for income), to +0.222 (adding day of week). The R-squared &amp;mdash; now visible directly on each panel &amp;mdash; jumps from 0.028 to 0.32 to 0.37. Each scatterfit panel shows a tighter cloud as more variation is absorbed by the controls.&lt;/p>
&lt;h2 id="6-binned-scatter-plots">6. Binned Scatter Plots&lt;/h2>
&lt;h3 id="61-why-binned-scatters">6.1 Why binned scatters?&lt;/h3>
&lt;p>With large datasets (thousands or millions of observations), scatter plots become useless &amp;mdash; individual points merge into a solid blob. &lt;strong>Binned scatter plots&lt;/strong> solve this by grouping observations into quantile bins along the x-axis and plotting the bin means. The regression line is still estimated on the full data, so the slope is unaffected. This is one of &lt;code>scatterfit&lt;/code>&amp;rsquo;s key advantages over R&amp;rsquo;s &lt;code>fwl_plot()&lt;/code>.&lt;/p>
&lt;h3 id="62-unbinned-vs-binned">6.2 Unbinned vs. binned&lt;/h3>
&lt;pre>&lt;code class="language-stata">scatterfit sales coupons, controls(income) ///
regparameters(coef pval r2) opts(name(unbinned, replace) title(&amp;quot;A. Unbinned (all points)&amp;quot;))
scatterfit sales coupons, controls(income) binned ///
regparameters(coef pval r2) opts(name(binned, replace) title(&amp;quot;B. Binned (20 quantiles)&amp;quot;))
graph combine unbinned binned, ///
title(&amp;quot;Binned Scatter: Summarizing Patterns in Large Data&amp;quot;) rows(1)
graph export &amp;quot;stata_fwl_fig3_binned_scatter.png&amp;quot;, replace
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="stata_fwl_fig3_binned_scatter.png" alt="Unbinned scatter (left) vs. binned scatter with 20 quantiles (right), both showing the same FWL-residualized relationship with coefficient, p-value, and R2 annotations">&lt;/p>
&lt;p>Both panels show the same FWL-residualized relationship ($\beta = 0.21$, $R^2 = 0.32$), but the binned version (right) replaces 200 individual points with 20 bin-mean markers. For our small dataset the difference is modest, but for the flights data (5,000+ observations) or production datasets (millions of rows), binning is essential. The &lt;code>nquantiles()&lt;/code> option controls how many bins to use:&lt;/p>
&lt;pre>&lt;code class="language-stata">* Fewer bins = smoother but less detail
scatterfit sales coupons, controls(income) binned nquantiles(10)
* More bins = more detail but noisier
scatterfit sales coupons, controls(income) binned nquantiles(30)
&lt;/code>&lt;/pre>
&lt;h2 id="7-visualizing-fixed-effects">7. Visualizing Fixed Effects&lt;/h2>
&lt;h3 id="71-load-the-flights-data">7.1 Load the flights data&lt;/h3>
&lt;p>We load the NYC flights sample &amp;mdash; 5,000 flights from New York&amp;rsquo;s three airports (EWR, JFK, LGA) in 2013:&lt;/p>
&lt;pre>&lt;code class="language-stata">import delimited &amp;quot;https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/r_fwlplot/flights_sample.csv&amp;quot;, clear
summarize dep_delay air_time
tabulate origin
* Encode string variables for fixed effects (needed by scatterfit/reghdfe)
encode origin, gen(origin_fe)
encode dest, gen(dest_fe)
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text"> Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
dep_delay | 5,000 7.3172 22.83736 -20 119
air_time | 5,000 150.3636 93.47726 22 650
&lt;/code>&lt;/pre>
&lt;h3 id="72-progressive-fixed-effects">7.2 Progressive fixed effects&lt;/h3>
&lt;p>The &lt;code>fcontrols()&lt;/code> option specifies categorical variables to absorb as fixed effects. This is analogous to &lt;code>feols(...| FE)&lt;/code> in R&amp;rsquo;s fixest:&lt;/p>
&lt;pre>&lt;code class="language-stata">* No fixed effects
scatterfit dep_delay air_time, regparameters(coef pval r2) ///
opts(name(fe_none, replace) title(&amp;quot;A. No Fixed Effects&amp;quot;))
* Origin airport FE
scatterfit dep_delay air_time, fcontrols(origin_fe) ///
regparameters(coef pval r2) opts(name(fe_origin, replace) title(&amp;quot;B. Origin FE&amp;quot;))
* Origin + destination FE
scatterfit dep_delay air_time, fcontrols(origin_fe dest_fe) ///
regparameters(coef pval r2) opts(name(fe_both, replace) title(&amp;quot;C. Origin + Dest FE&amp;quot;))
graph combine fe_none fe_origin fe_both, ///
title(&amp;quot;What Do Fixed Effects 'Do' to the Data?&amp;quot;) rows(1)
graph export &amp;quot;stata_fwl_fig4_fixed_effects.png&amp;quot;, replace
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="stata_fwl_fig4_fixed_effects.png" alt="Progressive FWL plots with coefficient, p-value, and R2: no FE (left, R2 near 0), origin FE (center), origin + destination FE (right)">&lt;/p>
&lt;p>Panel A shows the raw cloud with a nearly flat slope ($R^2 \approx 0$). Panel B removes the three origin-airport means, tightening the horizontal spread. Panel C removes the destination means as well, collapsing the variation to &lt;em>within-route&lt;/em> deviations and increasing $R^2$ substantially. The &lt;code>fcontrols()&lt;/code> option handles all the demeaning internally using &lt;code>reghdfe&lt;/code>.&lt;/p>
&lt;h3 id="73-regression-table">7.3 Regression table&lt;/h3>
&lt;pre>&lt;code class="language-stata">regress dep_delay air_time
estimates store fe0
reghdfe dep_delay air_time, absorb(origin_fe) vce(robust)
estimates store fe1
reghdfe dep_delay air_time, absorb(origin_fe dest_fe) vce(robust)
estimates store fe2
estimates table fe0 fe1 fe2, stats(r2 N) b(%9.4f) se(%9.4f)
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text">--------------------------------------------------
Variable | fe0 fe1 fe2
-------------+------------------------------------
air_time | -0.0050 -0.0079 -0.0324
| 0.0035 0.0034 0.0265
_cons | 8.0669 8.5072 12.1416
| 0.6117 0.6449 4.0186
-------------+------------------------------------
r2 | 0.0004 0.0055 0.0310
N | 5000 5000 4994
--------------------------------------------------
&lt;/code>&lt;/pre>
&lt;p>The air time coefficient changes as we add fixed effects: -0.005 (no FE), -0.008 (origin FE), -0.032 (origin + destination FE). Note that these are estimated on the 5,000-observation sample, so the coefficients differ somewhat from the full-data estimates in the R tutorial. The key pattern is the same: adding fixed effects absorbs between-group variation and changes both the magnitude and precision of the coefficient. With origin + destination FE, 6 singleton observations are dropped (N = 4,994) &amp;mdash; singletons are routes with only one flight in the sample, where within-group variation cannot be estimated.&lt;/p>
&lt;h2 id="8-panel-data-returns-to-experience">8. Panel Data: Returns to Experience&lt;/h2>
&lt;h3 id="81-load-the-wage-panel">8.1 Load the wage panel&lt;/h3>
&lt;p>The wage panel contains 545 individuals observed over 8 years (1980&amp;ndash;1987). The classic question: what is the return to experience? The challenge is &lt;strong>unobserved ability&lt;/strong> &amp;mdash; two people with the same experience may earn very different wages because one is more talented, motivated, or well-connected. These unmeasured personal traits are the &amp;ldquo;unobserved ability&amp;rdquo; that individual fixed effects absorb.&lt;/p>
&lt;pre>&lt;code class="language-stata">import delimited &amp;quot;https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/r_fwlplot/wagepan.csv&amp;quot;, clear
xtset nr year
summarize lwage exper expersq educ
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text"> Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
lwage | 4,360 1.649147 .5326094 -3.579079 4.05186
exper | 4,360 6.514679 2.825873 0 18
expersq | 4,360 50.42477 40.78199 0 324
educ | 4,360 11.76697 1.746181 3 16
&lt;/code>&lt;/pre>
&lt;h3 id="82-pooled-ols-vs-individual-fixed-effects">8.2 Pooled OLS vs. individual fixed effects&lt;/h3>
&lt;pre>&lt;code class="language-stata">regress lwage educ exper expersq
estimates store pool
reghdfe lwage exper expersq, absorb(nr)
estimates store fe_ind
reghdfe lwage exper expersq, absorb(nr year)
estimates store fe_twfe
estimates table pool fe_ind fe_twfe, stats(r2 N)
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text">--------------------------------------------------
Variable | pool fe_ind fe_twfe
-------------+------------------------------------
educ | 0.1021
| 0.0047
exper | 0.1050 0.1223 (omitted)
| 0.0102 0.0082
expersq | -0.0036 -0.0045 -0.0054
| 0.0007 0.0006 0.0007
_cons | -0.0564 1.0807 1.9223
| 0.0639 0.0263 0.0359
-------------+------------------------------------
r2 | 0.1477 0.6173 0.6185
N | 4360 4360 4360
--------------------------------------------------
&lt;/code>&lt;/pre>
&lt;p>Several things change as we add fixed effects. The &lt;code>educ&lt;/code> coefficient disappears from the individual FE column &amp;mdash; education is time-invariant (it does not change over the 8 years for any individual), so it is perfectly collinear with person dummies. Stata marks &lt;code>exper&lt;/code> as &lt;code>(omitted)&lt;/code> in the two-way FE column &amp;mdash; because experience increments by one year for everyone, it is perfectly collinear with year dummies. Only &lt;code>expersq&lt;/code> (which varies non-linearly) survives both sets of fixed effects. The R-squared jumps from 0.148 to 0.617, showing that individual fixed effects explain the majority of wage variation.&lt;/p>
&lt;h3 id="83-scatterfit-with-individual-fe">8.3 scatterfit with individual FE&lt;/h3>
&lt;pre>&lt;code class="language-stata">* Sample 150 individuals for visual clarity
preserve
set seed 456
bysort nr: gen first = (_n == 1)
gen rand = runiform() if first
bysort nr (rand): replace rand = rand[1]
sort rand nr year
egen rank = group(rand) if first
bysort nr (rank): replace rank = rank[1]
keep if rank &amp;lt;= 150
scatterfit lwage exper, regparameters(coef pval r2) ///
opts(name(wage_raw, replace) title(&amp;quot;A. Raw: Pooled Cross-Section&amp;quot;))
scatterfit lwage exper, fcontrols(nr) regparameters(coef pval r2) ///
opts(name(wage_fe, replace) title(&amp;quot;B. FWL: Individual Fixed Effects&amp;quot;))
graph combine wage_raw wage_fe, ///
title(&amp;quot;Controlling for Unobserved Ability&amp;quot;) rows(1)
graph export &amp;quot;stata_fwl_fig5_panel_data.png&amp;quot;, replace
restore
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="stata_fwl_fig5_panel_data.png" alt="Raw pooled cross-section (left, R2 = 0.043) vs. individual fixed-effects residualized scatter (right, R2 = 0.59) for log wage vs. experience">&lt;/p>
&lt;p>The visual difference is dramatic. Panel A shows a wide fan with a shallow slope ($R^2 = 0.043$) &amp;mdash; individuals at the same experience level have wildly different wages, reflecting unobserved ability. Panel B applies &lt;code>fcontrols(nr)&lt;/code> to strip away each person&amp;rsquo;s average wage and experience, leaving only &lt;em>within-person&lt;/em> deviations. The $R^2$ jumps from 0.04 to 0.59, showing that individual fixed effects explain most of the wage variation. The slope steepens sharply: the within-person return to experience is about 0.07 log points per year (roughly 7%), and the relationship is much more precisely identified once we control for who each person is.&lt;/p>
&lt;h2 id="9-advanced-fit-types-and-regression-parameters">9. Advanced: Fit Types and Regression Parameters&lt;/h2>
&lt;h3 id="91-multiple-fit-types">9.1 Multiple fit types&lt;/h3>
&lt;p>The &lt;code>regparameters()&lt;/code> option displays the coefficient, standard error, p-value, R-squared, and sample size directly on the plot. The &lt;code>scatterfit&lt;/code> command also supports fit types beyond linear &amp;mdash; quadratic and lowess &amp;mdash; as diagnostics for nonlinearity:&lt;/p>
&lt;pre>&lt;code class="language-stata">* Linear fit with all regression parameters displayed on the plot
scatterfit sales coupons, controls(income) ///
regparameters(coef se pval r2 n)
graph export &amp;quot;stata_fwl_fig6_advanced.png&amp;quot;, replace
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="stata_fwl_fig6_advanced.png" alt="Linear FWL fit with regression parameters (coefficient, SE, p-value, R-squared, N) displayed directly on the plot">&lt;/p>
&lt;pre>&lt;code class="language-stata">* Lowess fit: nonparametric check (note: lowess does not support controls())
scatterfit sales coupons, fit(lowess)
&lt;/code>&lt;/pre>
&lt;p>The quadratic fit serves as a diagnostic. If the relationship looks curved in the residualized scatter, your linear specification may be misspecified. Note that &lt;code>fit(lowess)&lt;/code> and &lt;code>fit(lpoly)&lt;/code> do not support &lt;code>controls()&lt;/code> in the current version of &lt;code>scatterfit&lt;/code> &amp;mdash; use them on raw or manually residualized data. For our simulated data (which is truly linear), the quadratic fit closely follows the linear fit, confirming the specification is appropriate.&lt;/p>
&lt;h3 id="92-regression-parameters-on-the-plot">9.2 Regression parameters on the plot&lt;/h3>
&lt;p>The &lt;code>regparameters()&lt;/code> option displays statistical information directly on the scatter plot. Available parameters:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Parameter&lt;/th>
&lt;th>Display&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;code>coef&lt;/code>&lt;/td>
&lt;td>Slope coefficient&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>se&lt;/code>&lt;/td>
&lt;td>Standard error&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>pval&lt;/code>&lt;/td>
&lt;td>P-value&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>r2&lt;/code>&lt;/td>
&lt;td>R-squared&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>n&lt;/code>&lt;/td>
&lt;td>Sample size&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;pre>&lt;code class="language-stata">* Show everything
scatterfit sales coupons, controls(income) regparameters(coef se pval r2 n)
&lt;/code>&lt;/pre>
&lt;p>This is especially useful for presentations and papers where you want to communicate both the visual pattern and the statistical evidence in a single figure.&lt;/p>
&lt;h3 id="93-quick-reference-scatterfit-recipes">9.3 Quick reference: scatterfit recipes&lt;/h3>
&lt;pre>&lt;code class="language-stata">* 1. Raw scatter (no controls)
scatterfit y x
* 2. Control for continuous variables (FWL)
scatterfit y x, controls(z1 z2)
* 3. Control for fixed effects (categorical)
scatterfit y x, fcontrols(group_fe)
* 4. Both continuous controls and fixed effects
scatterfit y x, controls(z1) fcontrols(group_fe)
* 5. Binned scatter (for large datasets)
scatterfit y x, controls(z1) binned nquantiles(20)
* 6. Show regression parameters on the plot
scatterfit y x, controls(z1) regparameters(coef pval r2)
* 7. Quadratic fit (works with controls)
scatterfit y x, controls(z1) fit(quadratic)
* 8. Lowess fit (does NOT support controls — use on raw data)
scatterfit y x, fit(lowess)
&lt;/code>&lt;/pre>
&lt;h2 id="10-discussion">10. Discussion&lt;/h2>
&lt;p>The FWL theorem is not just a pedagogical tool &amp;mdash; it is the computational engine behind Stata&amp;rsquo;s &lt;code>reghdfe&lt;/code> command. When &lt;code>reghdfe&lt;/code> estimates a model with fixed effects, it does not create a matrix with thousands of dummy variables. Instead, it uses an iterative demeaning algorithm (a generalization of FWL) to absorb the fixed effects, then runs OLS on the residuals. This is why &lt;code>reghdfe&lt;/code> can handle millions of observations with tens of thousands of fixed effects.&lt;/p>
&lt;p>The &lt;code>scatterfit&lt;/code> package offers three advantages over the R and Python implementations of FWL visualization. First, &lt;strong>binned scatter plots&lt;/strong> (Section 6) are essential for large datasets where individual points merge into an unreadable blob. Second, &lt;strong>regression parameters on the plot&lt;/strong> (&lt;code>regparameters()&lt;/code>) combine the visual and statistical evidence in a single figure, reducing the back-and-forth between plots and tables. Third, &lt;strong>multiple fit types&lt;/strong> (&lt;code>fit(quadratic)&lt;/code>, &lt;code>fit(lowess)&lt;/code>) serve as built-in diagnostics for linearity.&lt;/p>
&lt;p>Across the three tutorials (Python, R, Stata), the key numbers are the same because we use the same datasets: the naive coupon coefficient is -0.093, the true effect is +0.212 after controlling for income, and the OVB is -0.148. The FWL theorem is the same in every language &amp;mdash; only the syntax changes:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Task&lt;/th>
&lt;th>Python&lt;/th>
&lt;th>R&lt;/th>
&lt;th>Stata&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>Raw scatter&lt;/td>
&lt;td>&lt;code>plt.scatter(x, y)&lt;/code>&lt;/td>
&lt;td>&lt;code>fwl_plot(y ~ x)&lt;/code>&lt;/td>
&lt;td>&lt;code>scatterfit y x&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Control for Z&lt;/td>
&lt;td>manual &lt;code>resid()&lt;/code>&lt;/td>
&lt;td>&lt;code>fwl_plot(y ~ x + z)&lt;/code>&lt;/td>
&lt;td>&lt;code>scatterfit y x, controls(z)&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Fixed effects&lt;/td>
&lt;td>not supported&lt;/td>
&lt;td>&lt;code>fwl_plot(y ~ x | fe)&lt;/code>&lt;/td>
&lt;td>&lt;code>scatterfit y x, fcontrols(fe)&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Binned scatter&lt;/td>
&lt;td>not supported&lt;/td>
&lt;td>not supported&lt;/td>
&lt;td>&lt;code>scatterfit y x, binned&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Stats on plot&lt;/td>
&lt;td>not supported&lt;/td>
&lt;td>not supported&lt;/td>
&lt;td>&lt;code>regparameters(coef pval r2)&lt;/code>&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>Students who learn FWL in one language can immediately apply it in another.&lt;/p>
&lt;p>One limitation: the FWL theorem applies only to linear regression. For logistic, Poisson, or other nonlinear models, the partialling-out logic does not hold exactly. Stata&amp;rsquo;s &lt;code>scatterfit&lt;/code> does support &lt;code>fitmodel(logit)&lt;/code> and &lt;code>fitmodel(poisson)&lt;/code>, but these are direct fits, not FWL residualizations.&lt;/p>
&lt;h2 id="11-summary-and-next-steps">11. Summary and Next Steps&lt;/h2>
&lt;ul>
&lt;li>&lt;strong>Confounding produces misleading regressions:&lt;/strong> the naive coupon coefficient was -0.093 (wrong sign), while the true causal effect is +0.2. After FWL residualization with &lt;code>controls(income)&lt;/code>, the estimate was +0.212.&lt;/li>
&lt;li>&lt;strong>The OVB formula predicts the bias exactly:&lt;/strong> $0.300 \times (-0.494) = -0.148$, correctly predicting the negative direction and approximate magnitude of the confounding.&lt;/li>
&lt;li>&lt;strong>FWL is an exact identity:&lt;/strong> the manual three-step procedure in Stata (&lt;code>regress&lt;/code> + &lt;code>predict resid&lt;/code> + &lt;code>regress&lt;/code>) matches the full regression to six decimal places (0.212288).&lt;/li>
&lt;li>&lt;strong>Fixed effects are FWL applied to group dummies:&lt;/strong> &lt;code>fcontrols()&lt;/code> in &lt;code>scatterfit&lt;/code> calls &lt;code>reghdfe&lt;/code> internally to demean the data, equivalent to &lt;code>feols(... | FE)&lt;/code> in R.&lt;/li>
&lt;li>&lt;strong>Binned scatter plots and on-plot statistics are Stata&amp;rsquo;s advantage:&lt;/strong> the &lt;code>binned&lt;/code> and &lt;code>regparameters()&lt;/code> options provide capabilities that the R and Python FWL tools lack.&lt;/li>
&lt;/ul>
&lt;p>For further study, see the companion &lt;a href="https://carlos-mendez.org/post/r_fwlplot/">R FWL tutorial&lt;/a> using &lt;code>fwl_plot()&lt;/code> and the &lt;a href="https://carlos-mendez.org/post/python_fwl/">Python FWL tutorial&lt;/a> that extends FWL to Double Machine Learning.&lt;/p>
&lt;h2 id="12-exercises">12. Exercises&lt;/h2>
&lt;ol>
&lt;li>
&lt;p>&lt;strong>OVB direction.&lt;/strong> In our simulation, predict the direction of the OVB if you also omit &lt;code>dayofweek&lt;/code>. Compute $\hat{\gamma}_{day} \times \hat{\delta}_{day}$ and add it to the income OVB. Does the total bias match the difference between the naive and the fully controlled coefficient?&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Binned scatter with different bins.&lt;/strong> Re-run &lt;code>scatterfit sales coupons, controls(income) binned nquantiles(k)&lt;/code> for $k = 5, 10, 20, 50$. How does the visual change? At what point do you lose meaningful information?&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>slopefit: heterogeneous effects.&lt;/strong> Use the &lt;code>slopefit&lt;/code> command: &lt;code>slopefit sales coupons income&lt;/code>. This shows how the coupon-sales slope varies across income levels. Do coupons work better in low-income or high-income neighborhoods?&lt;/p>
&lt;/li>
&lt;/ol>
&lt;h2 id="13-references">13. References&lt;/h2>
&lt;ol>
&lt;li>&lt;a href="https://github.com/leojahrens/scatterfit" target="_blank" rel="noopener">Ahrens, L. (2024). scatterfit: Scatter Plots with Fit Lines and Regression Results. GitHub.&lt;/a>&lt;/li>
&lt;li>&lt;a href="http://scorreia.com/software/reghdfe/" target="_blank" rel="noopener">Correia, S. (2016). reghdfe: Linear Models with Many Levels of Fixed Effects. Stata Journal.&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://doi.org/10.2307/1907330" target="_blank" rel="noopener">Frisch, R. &amp;amp; Waugh, F. V. (1933). Partial Time Regressions as Compared with Individual Trends. &lt;em>Econometrica&lt;/em>, 1(4), 387&amp;ndash;401.&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://doi.org/10.1080/01621459.1963.10480682" target="_blank" rel="noopener">Lovell, M. C. (1963). Seasonal Adjustment of Economic Time Series and Multiple Regression Analysis. &lt;em>JASA&lt;/em>, 58(304), 993&amp;ndash;1010.&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://press.princeton.edu/books/paperback/9780691120355/mostly-harmless-econometrics" target="_blank" rel="noopener">Angrist, J. D. &amp;amp; Pischke, J.-S. (2009). &lt;em>Mostly Harmless Econometrics.&lt;/em> Princeton University Press.&lt;/a>&lt;/li>
&lt;li>Datasets: simulated store data, NYC flights sample, and Wooldridge wage panel from the companion &lt;a href="https://carlos-mendez.org/post/r_fwlplot/">R FWL tutorial&lt;/a> on this site.&lt;/li>
&lt;/ol>
&lt;h4 id="acknowledgements">Acknowledgements&lt;/h4>
&lt;p>AI tools (Claude Code, Gemini, NotebookLM) were used to make the contents of this post more accessible to students. Nevertheless, the content in this post may still have errors. Caution is needed when applying the contents of this post to true research projects.&lt;/p></description></item><item><title>The FWL Theorem: Making Multivariate Regressions Intuitive</title><link>https://carlos-mendez.org/post/python_fwl/</link><pubDate>Sat, 14 Mar 2026 00:00:00 +0000</pubDate><guid>https://carlos-mendez.org/post/python_fwl/</guid><description>&lt;h2 id="overview">Overview&lt;/h2>
&lt;p>Including multiple variables in a regression raises a natural question: what does it actually mean to &amp;ldquo;control for&amp;rdquo; a confounder? The output is a coefficient, but a multivariate regression cannot be plotted on a simple two-dimensional scatter plot. This makes it hard to build intuition about what the regression is doing behind the scenes.&lt;/p>
&lt;p>The &lt;strong>Frisch-Waugh-Lovell (FWL) theorem&lt;/strong> answers this question. It shows that any coefficient from a multivariate regression can be recovered from a simple univariate regression &amp;mdash; after removing the influence of all other variables through a procedure called &lt;em>partialling-out&lt;/em> (also known as &lt;em>residualization&lt;/em> or &lt;em>orthogonalization&lt;/em>). Think of it as stripping away the noise from other variables so that only the signal of interest remains.&lt;/p>
&lt;p>This tutorial is inspired by &lt;a href="https://towardsdatascience.com/the-fwl-theorem-or-how-to-make-all-regressions-intuitive-59f801eb3299/" target="_blank" rel="noopener">Courthoud (2022)&lt;/a>, and applies the FWL theorem to a simulated retail scenario. A chain of stores distributes discount coupons and wants to know whether the coupons increase sales. The catch: neighborhood income affects both coupon usage and sales, creating a confounding relationship that makes the naive analysis misleading. The analysis uses FWL to untangle these effects, verifies the theorem step by step, and visualizes the conditional relationship that multivariate regression captures but hides from view.&lt;/p>
&lt;p>&lt;strong>Learning objectives:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Understand the Frisch-Waugh-Lovell theorem and why it matters for causal inference&lt;/li>
&lt;li>Implement the partialling-out procedure using OLS residuals&lt;/li>
&lt;li>Visualize conditional relationships that multivariate regressions capture but cannot directly plot&lt;/li>
&lt;li>Compare naive and conditional estimates to see how omitted variable bias distorts results&lt;/li>
&lt;li>Connect FWL to modern applications such as Double Machine Learning&lt;/li>
&lt;/ul>
&lt;h2 id="the-causal-structure">The causal structure&lt;/h2>
&lt;p>Before looking at data, it helps to understand the causal relationships among the variables. A &lt;strong>Directed Acyclic Graph (DAG)&lt;/strong> &amp;mdash; a diagram where arrows indicate direct causal effects &amp;mdash; makes these assumptions explicit.&lt;/p>
&lt;p>In this retail scenario, three variables interact:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph LR
I[&amp;quot;&amp;lt;b&amp;gt;Income&amp;lt;/b&amp;gt;&amp;lt;br/&amp;gt;(confounder)&amp;quot;] --&amp;gt;|&amp;quot;Higher income&amp;lt;br/&amp;gt;→ fewer coupons&amp;quot;| C[&amp;quot;&amp;lt;b&amp;gt;Coupons&amp;lt;/b&amp;gt;&amp;lt;br/&amp;gt;(treatment)&amp;quot;]
I --&amp;gt;|&amp;quot;Higher income&amp;lt;br/&amp;gt;→ more spending&amp;quot;| S[&amp;quot;&amp;lt;b&amp;gt;Sales&amp;lt;/b&amp;gt;&amp;lt;br/&amp;gt;(outcome)&amp;quot;]
C --&amp;gt;|&amp;quot;True causal&amp;lt;br/&amp;gt;effect: +0.2&amp;quot;| S
style I fill:#d97757,stroke:#141413,color:#fff
style C fill:#6a9bcc,stroke:#141413,color:#fff
style S fill:#00d4c8,stroke:#141413,color:#fff
&lt;/code>&lt;/pre>
&lt;p>Income acts as a &lt;strong>confounder&lt;/strong> &amp;mdash; a variable that influences both the treatment (coupon usage) and the outcome (sales). Wealthier neighborhoods use fewer coupons but spend more, creating a &lt;em>backdoor path&lt;/em> from coupons to sales through income. Ignoring income allows this backdoor path to generate a spurious negative association between coupons and sales, masking the true positive effect.&lt;/p>
&lt;p>To recover the genuine causal effect, the analysis must &lt;strong>block&lt;/strong> this backdoor path by conditioning on income. The FWL theorem provides an elegant way to do this and to visualize the result.&lt;/p>
&lt;h2 id="setup-and-imports">Setup and imports&lt;/h2>
&lt;p>The following code loads all necessary libraries. The analysis relies on &lt;a href="https://www.statsmodels.org/stable/index.html" target="_blank" rel="noopener">statsmodels&lt;/a> for OLS regression, &lt;a href="https://seaborn.pydata.org/" target="_blank" rel="noopener">seaborn&lt;/a> for regression plots, and &lt;a href="https://matplotlib.org/" target="_blank" rel="noopener">matplotlib&lt;/a> for figure customization. The &lt;code>RANDOM_SEED&lt;/code> ensures that every reader gets identical results.&lt;/p>
&lt;pre>&lt;code class="language-python">import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf
# Reproducibility
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
# Site color palette
STEEL_BLUE = &amp;quot;#6a9bcc&amp;quot;
WARM_ORANGE = &amp;quot;#d97757&amp;quot;
NEAR_BLACK = &amp;quot;#141413&amp;quot;
TEAL = &amp;quot;#00d4c8&amp;quot;
&lt;/code>&lt;/pre>
&lt;blockquote>
&lt;p>&lt;strong>Note on figure styling:&lt;/strong> The figures in this post use a dark theme for visual consistency with the site. The companion &lt;code>script.py&lt;/code> includes the full styling code. To reproduce the dark-themed figures, add the following to your setup:&lt;/p>
&lt;details>&lt;summary>Dark theme settings (click to expand)&lt;/summary>
&lt;pre>&lt;code class="language-python">DARK_NAVY = &amp;quot;#0f1729&amp;quot;
GRID_LINE = &amp;quot;#1f2b5e&amp;quot;
LIGHT_TEXT = &amp;quot;#c8d0e0&amp;quot;
WHITE_TEXT = &amp;quot;#e8ecf2&amp;quot;
plt.rcParams.update({
&amp;quot;figure.facecolor&amp;quot;: DARK_NAVY, &amp;quot;axes.facecolor&amp;quot;: DARK_NAVY,
&amp;quot;axes.edgecolor&amp;quot;: DARK_NAVY, &amp;quot;axes.linewidth&amp;quot;: 0,
&amp;quot;axes.labelcolor&amp;quot;: LIGHT_TEXT, &amp;quot;axes.titlecolor&amp;quot;: WHITE_TEXT,
&amp;quot;axes.spines.top&amp;quot;: False, &amp;quot;axes.spines.right&amp;quot;: False,
&amp;quot;axes.spines.left&amp;quot;: False, &amp;quot;axes.spines.bottom&amp;quot;: False,
&amp;quot;axes.grid&amp;quot;: True, &amp;quot;grid.color&amp;quot;: GRID_LINE,
&amp;quot;grid.linewidth&amp;quot;: 0.6, &amp;quot;grid.alpha&amp;quot;: 0.8,
&amp;quot;xtick.color&amp;quot;: LIGHT_TEXT, &amp;quot;ytick.color&amp;quot;: LIGHT_TEXT,
&amp;quot;text.color&amp;quot;: WHITE_TEXT, &amp;quot;font.size&amp;quot;: 12,
&amp;quot;legend.frameon&amp;quot;: False, &amp;quot;legend.labelcolor&amp;quot;: LIGHT_TEXT,
&amp;quot;savefig.facecolor&amp;quot;: DARK_NAVY, &amp;quot;savefig.edgecolor&amp;quot;: DARK_NAVY,
})
&lt;/code>&lt;/pre>
&lt;/details>
&lt;/blockquote>
&lt;h2 id="data-simulation">Data simulation&lt;/h2>
&lt;p>Rather than importing data from an external source, this section builds a transparent data generating process (DGP) so that the &lt;strong>true causal effect&lt;/strong> is known in advance and the methods can be verified against it. Think of it as running a controlled experiment in a computer: set the rules, generate the data, and then check whether the statistical tools find the right answer.&lt;/p>
&lt;p>The DGP encodes the causal structure from the DAG above:&lt;/p>
&lt;ul>
&lt;li>&lt;code>income&lt;/code> is drawn from a normal distribution centered at \$50K&lt;/li>
&lt;li>&lt;code>coupons&lt;/code> depends negatively on income (wealthier customers use fewer coupons) plus random noise&lt;/li>
&lt;li>&lt;code>sales&lt;/code> depends positively on both coupons (+0.2) and income (+0.3), plus a day-of-week effect and random noise&lt;/li>
&lt;/ul>
&lt;p>The true causal effect of coupons on sales is &lt;strong>exactly +0.2&lt;/strong> &amp;mdash; this is the &lt;strong>Average Treatment Effect (ATE)&lt;/strong>, the average impact of coupons on sales across all stores. In concrete terms, every 1 percentage point increase in coupon usage causes a \$200 increase in daily sales (measured in thousands).&lt;/p>
&lt;pre>&lt;code class="language-python">def simulate_store_data(n=50, seed=42):
&amp;quot;&amp;quot;&amp;quot;Simulate retail store data with confounding by income.&amp;quot;&amp;quot;&amp;quot;
rng = np.random.default_rng(seed)
income = rng.normal(50, 10, n)
dayofweek = rng.integers(1, 8, n)
coupons = 60 - 0.5 * income + rng.normal(0, 5, n)
sales = (10 + 0.2 * coupons + 0.3 * income
+ 0.5 * dayofweek + rng.normal(0, 3, n))
return pd.DataFrame({
&amp;quot;sales&amp;quot;: np.round(sales, 2),
&amp;quot;coupons&amp;quot;: np.round(coupons, 2),
&amp;quot;income&amp;quot;: np.round(income, 2),
&amp;quot;dayofweek&amp;quot;: dayofweek,
})
N = 50
df = simulate_store_data(n=N, seed=RANDOM_SEED)
print(&amp;quot;Dataset shape:&amp;quot;, df.shape)
print(df.head())
print(df.describe().round(2))
&lt;/code>&lt;/pre>
&lt;pre>&lt;code>Dataset shape: (50, 4)
sales coupons income dayofweek
0 37.37 36.93 53.05 6
1 36.88 38.06 39.60 6
2 33.09 32.04 57.50 6
3 35.09 33.43 59.41 5
4 27.01 43.21 30.49 4
sales coupons income dayofweek
count 50.00 50.00 50.00 50.00
mean 33.61 33.84 50.91 3.92
std 3.96 4.89 7.68 1.88
min 25.76 23.26 30.49 1.00
25% 31.30 31.53 45.78 2.00
50% 33.24 33.25 51.74 4.00
75% 36.00 36.89 56.42 5.75
max 44.38 43.79 71.42 7.00
&lt;/code>&lt;/pre>
&lt;p>The dataset contains 50 stores with average daily sales of \$33,610, average coupon usage of 33.84%, and average neighborhood income of \$50,910. Sales range from \$25,760 to \$44,380, reflecting meaningful variation across stores. Coupon usage spans from 23% to 44%, and income ranges from \$30,490 to \$71,420. This variation provides enough signal to estimate the relationships of interest.&lt;/p>
&lt;h2 id="the-naive-relationship">The naive relationship&lt;/h2>
&lt;p>The simplest approach is to regress sales directly on coupon usage, ignoring income entirely. This is what a rushed analyst might do &amp;mdash; just look at whether stores with more coupon usage have higher or lower sales.&lt;/p>
&lt;pre>&lt;code class="language-python">sns.regplot(x=&amp;quot;coupons&amp;quot;, y=&amp;quot;sales&amp;quot;, data=df, ci=False,
scatter_kws={&amp;quot;color&amp;quot;: STEEL_BLUE, &amp;quot;alpha&amp;quot;: 0.7, &amp;quot;edgecolors&amp;quot;: &amp;quot;white&amp;quot;, &amp;quot;s&amp;quot;: 60},
line_kws={&amp;quot;color&amp;quot;: WARM_ORANGE, &amp;quot;linewidth&amp;quot;: 2, &amp;quot;label&amp;quot;: &amp;quot;Linear fit&amp;quot;})
plt.legend()
plt.xlabel(&amp;quot;Coupon usage (%)&amp;quot;)
plt.ylabel(&amp;quot;Daily sales (thousands $)&amp;quot;)
plt.title(&amp;quot;Naive relationship: Sales vs. coupon usage&amp;quot;)
plt.savefig(&amp;quot;fwl_naive_regression.png&amp;quot;, dpi=300, bbox_inches=&amp;quot;tight&amp;quot;)
plt.show()
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="fwl_naive_regression.png" alt="Scatter plot showing a negative relationship between coupon usage and sales, with a downward-sloping regression line.">
&lt;em>Naive regression: the downward slope suggests coupons reduce sales, but this is driven by confounding from income.&lt;/em>&lt;/p>
&lt;pre>&lt;code class="language-python">naive_model = smf.ols(&amp;quot;sales ~ coupons&amp;quot;, df).fit()
print(naive_model.summary().tables[1])
&lt;/code>&lt;/pre>
&lt;pre>&lt;code>==============================================================================
coef std err t P&amp;gt;|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 37.1906 3.960 9.390 0.000 29.228 45.154
coupons -0.1059 0.116 -0.914 0.365 -0.339 0.127
==============================================================================
&lt;/code>&lt;/pre>
&lt;p>The naive regression suggests that coupons have a &lt;strong>negative&lt;/strong> effect on sales: each additional percentage point of coupon usage is associated with \$106 less in daily sales. However, this coefficient is not statistically significant (p = 0.365), and the 95% confidence interval [-0.339, 0.127] spans both negative and positive values. More importantly, the true effect is +0.2, so this estimate is not just imprecise &amp;mdash; it points in the wrong direction. The confounder (income) is pulling the estimate downward because wealthier neighborhoods use fewer coupons but spend more.&lt;/p>
&lt;h2 id="controlling-for-income">Controlling for income&lt;/h2>
&lt;p>To block the backdoor path through income, the next step includes it as a control variable in the regression. This is the standard approach in applied work: add the confounder to the right-hand side of the regression equation.&lt;/p>
&lt;pre>&lt;code class="language-python">full_model = smf.ols(&amp;quot;sales ~ coupons + income&amp;quot;, df).fit()
print(full_model.summary().tables[1])
&lt;/code>&lt;/pre>
&lt;pre>&lt;code>==============================================================================
coef std err t P&amp;gt;|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 5.0278 7.181 0.700 0.487 -9.418 19.474
coupons 0.2673 0.120 2.222 0.031 0.025 0.509
income 0.3836 0.076 5.015 0.000 0.230 0.537
==============================================================================
&lt;/code>&lt;/pre>
&lt;p>Controlling for income reverses the picture entirely. The coefficient on coupons is now &lt;strong>+0.2673&lt;/strong> (p = 0.031), indicating that each additional percentage point of coupon usage increases daily sales by about \$267. This is close to the true effect of +0.2, and the 95% confidence interval [0.025, 0.509] no longer includes zero. Income itself has a strong positive effect of +0.3836 (p &amp;lt; 0.001), confirming that wealthier neighborhoods spend more. By conditioning on income, the backdoor path is blocked and the estimate moves much closer to the true causal effect.&lt;/p>
&lt;p>But what is the regression actually &lt;em>doing&lt;/em> when it &amp;ldquo;controls for&amp;rdquo; income? This is where the FWL theorem provides a clear answer.&lt;/p>
&lt;h2 id="the-fwl-theorem">The FWL theorem&lt;/h2>
&lt;p>The Frisch-Waugh-Lovell theorem, first published by Ragnar Frisch and Frederick Waugh in 1933 and later given an elegant proof by Michael Lovell in 1963, provides a precise algebraic decomposition of what multivariate regression does under the hood.&lt;/p>
&lt;p>Consider a linear model with two sets of regressors:&lt;/p>
&lt;p>$$y_i = \beta_1 x_{i,1} + \beta_2 x_{i,2} + \varepsilon_i$$&lt;/p>
&lt;p>In words, this equation says that the outcome $y$ (sales) equals the effect $\beta_1$ of the variable of interest $x_1$ (coupons), plus the effect $\beta_2$ of the control variable $x_2$ (income), plus an error term $\varepsilon$. In this analysis, $y$ corresponds to the &lt;code>sales&lt;/code> column, $x_1$ to &lt;code>coupons&lt;/code>, and $x_2$ to &lt;code>income&lt;/code>.&lt;/p>
&lt;p>The FWL theorem states that the &lt;strong>Ordinary Least Squares (OLS)&lt;/strong> estimator &amp;mdash; the standard method for fitting a regression line by minimizing squared prediction errors &amp;mdash; $\hat{\beta}_1$ from this multivariate regression is &lt;strong>identical&lt;/strong> to the estimator obtained from a simpler procedure:&lt;/p>
&lt;p>$$\hat{\beta}_1^{FWL} = \frac{\text{Cov}(\tilde{y}, \, \tilde{x}_1)}{\text{Var}(\tilde{x}_1)}$$&lt;/p>
&lt;p>where $\tilde{x}_1$ is the residual from regressing $x_1$ on $x_2$, and $\tilde{y}$ is the residual from regressing $y$ on $x_2$.&lt;/p>
&lt;p>In words, this says: to estimate the effect of coupons while controlling for income, we can (1) remove income&amp;rsquo;s influence from coupons, (2) remove income&amp;rsquo;s influence from sales, and (3) regress the cleaned sales on the cleaned coupons. The resulting coefficient is &lt;strong>exactly&lt;/strong> the same as the one from the full multivariate regression.&lt;/p>
&lt;p>This procedure is called &lt;strong>partialling-out&lt;/strong> because it removes the variation explained by the control variables, keeping only the residual variation &amp;mdash; the part that is &lt;em>orthogonal&lt;/em> to (independent of) income. The three equivalent estimators are:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Full OLS:&lt;/strong> Regress $y$ on $x_1$ and $x_2$ jointly&lt;/li>
&lt;li>&lt;strong>Partial FWL:&lt;/strong> Regress $y$ on $\tilde{x}_1$ (residuals of $x_1$ on $x_2$)&lt;/li>
&lt;li>&lt;strong>Full FWL:&lt;/strong> Regress $\tilde{y}$ on $\tilde{x}_1$ (residuals of both variables on $x_2$)&lt;/li>
&lt;/ol>
&lt;p>All three produce the same $\hat{\beta}_1$. The full FWL (option 3) also gives the correct standard errors.&lt;/p>
&lt;h2 id="verifying-fwl-step-by-step">Verifying FWL step by step&lt;/h2>
&lt;p>Let us verify each step of the theorem using the simulated data.&lt;/p>
&lt;h3 id="step-1-residualize-coupons-only">Step 1: Residualize coupons only&lt;/h3>
&lt;p>First, we regress coupons on income and extract the residuals $\tilde{x}_1$. These residuals represent the variation in coupon usage that &lt;strong>cannot&lt;/strong> be explained by income &amp;mdash; the &amp;ldquo;purified&amp;rdquo; coupon signal. Then we regress sales on these residuals. Because residuals always average to zero by construction (they are &lt;em>mean-zero&lt;/em>), we drop the intercept from this regression.&lt;/p>
&lt;pre>&lt;code class="language-python"># Residualize coupons with respect to income
df[&amp;quot;coupons_tilde&amp;quot;] = smf.ols(&amp;quot;coupons ~ income&amp;quot;, df).fit().resid
# Regress sales on residualized coupons (no intercept)
fwl_step1 = smf.ols(&amp;quot;sales ~ coupons_tilde - 1&amp;quot;, df).fit()
print(fwl_step1.summary().tables[1])
&lt;/code>&lt;/pre>
&lt;pre>&lt;code>=================================================================================
coef std err t P&amp;gt;|t| [0.025 0.975]
---------------------------------------------------------------------------------
coupons_tilde 0.2673 1.271 0.210 0.834 -2.288 2.822
=================================================================================
&lt;/code>&lt;/pre>
&lt;p>The coefficient is &lt;strong>exactly 0.2673&lt;/strong> &amp;mdash; identical to the full regression. However, the standard error has exploded from 0.120 to 1.271, making the estimate appear insignificant (p = 0.834). This happens because income was only partialled out from coupons but not from sales. The remaining variation in sales due to income inflates the residual variance of the regression, producing artificially large standard errors.&lt;/p>
&lt;h3 id="step-2-residualize-both-variables">Step 2: Residualize both variables&lt;/h3>
&lt;p>To fix the standard errors, we also residualize sales with respect to income. Now both variables have had income&amp;rsquo;s influence removed.&lt;/p>
&lt;pre>&lt;code class="language-python"># Residualize sales with respect to income
df[&amp;quot;sales_tilde&amp;quot;] = smf.ols(&amp;quot;sales ~ income&amp;quot;, df).fit().resid
# Regress residualized sales on residualized coupons (no intercept)
fwl_step2 = smf.ols(&amp;quot;sales_tilde ~ coupons_tilde - 1&amp;quot;, df).fit()
print(fwl_step2.summary().tables[1])
&lt;/code>&lt;/pre>
&lt;pre>&lt;code>=================================================================================
coef std err t P&amp;gt;|t| [0.025 0.975]
---------------------------------------------------------------------------------
coupons_tilde 0.2673 0.118 2.269 0.028 0.031 0.504
=================================================================================
&lt;/code>&lt;/pre>
&lt;p>The coefficient remains &lt;strong>exactly 0.2673&lt;/strong>, and now the standard error (0.118) and p-value (0.028) are nearly identical to the full regression (SE = 0.120, p = 0.031). The slight difference in standard errors comes from a degrees-of-freedom adjustment &amp;mdash; the full regression uses up an extra degree of freedom to estimate the income coefficient (leaving fewer data points for estimating uncertainty), while this univariate regression does not. The substantive conclusion is the same: coupons have a significant positive effect on sales after partialling out income.&lt;/p>
&lt;h2 id="visualizing-partialling-out">Visualizing partialling-out&lt;/h2>
&lt;p>What does partialling-out actually look like? Regressing coupons on income produces fitted values that form a line through the data. The &lt;strong>residuals&lt;/strong> &amp;mdash; the vertical distances between each point and this line &amp;mdash; represent the coupon variation that income cannot explain.&lt;/p>
&lt;pre>&lt;code class="language-python">df[&amp;quot;coupons_hat&amp;quot;] = smf.ols(&amp;quot;coupons ~ income&amp;quot;, df).fit().predict()
fig, ax = plt.subplots(figsize=(8, 6))
ax.scatter(df[&amp;quot;income&amp;quot;], df[&amp;quot;coupons&amp;quot;], color=STEEL_BLUE, alpha=0.7,
edgecolors=&amp;quot;white&amp;quot;, s=60, label=&amp;quot;Stores&amp;quot;)
sns.regplot(x=&amp;quot;income&amp;quot;, y=&amp;quot;coupons&amp;quot;, data=df, ci=False, scatter=False,
line_kws={&amp;quot;color&amp;quot;: WARM_ORANGE, &amp;quot;linewidth&amp;quot;: 2, &amp;quot;label&amp;quot;: &amp;quot;Linear fit&amp;quot;}, ax=ax)
ax.vlines(df[&amp;quot;income&amp;quot;],
np.minimum(df[&amp;quot;coupons&amp;quot;], df[&amp;quot;coupons_hat&amp;quot;]),
np.maximum(df[&amp;quot;coupons&amp;quot;], df[&amp;quot;coupons_hat&amp;quot;]),
linestyle=&amp;quot;--&amp;quot;, color=NEAR_BLACK, alpha=0.5, linewidth=1,
label=&amp;quot;Residuals&amp;quot;)
ax.set_xlabel(&amp;quot;Neighborhood income (thousands $)&amp;quot;)
ax.set_ylabel(&amp;quot;Coupon usage (%)&amp;quot;)
ax.set_title(&amp;quot;Partialling-out: removing income's effect on coupons&amp;quot;)
ax.legend()
plt.savefig(&amp;quot;fwl_residuals_income.png&amp;quot;, dpi=300, bbox_inches=&amp;quot;tight&amp;quot;)
plt.show()
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="fwl_residuals_income.png" alt="Scatter plot of coupon usage versus income with a downward-sloping fitted line and vertical dashed lines showing residuals for each store.">
&lt;em>Partialling-out: the dashed lines are the residuals &amp;mdash; the coupon variation that income cannot explain.&lt;/em>&lt;/p>
&lt;p>The downward-sloping fitted line confirms that higher-income neighborhoods use fewer coupons. The vertical dashed lines are the residuals &amp;mdash; the part of coupon usage that income does not predict. Some stores use more coupons than their neighborhood income would suggest (positive residuals), and others use fewer (negative residuals). Partialling out income keeps only these residuals, effectively asking: &amp;ldquo;Among stores with similar income levels, which ones have unusually high or low coupon usage?&amp;rdquo;&lt;/p>
&lt;h2 id="the-conditional-relationship-revealed">The conditional relationship revealed&lt;/h2>
&lt;p>It is now possible to plot the relationship that the multivariate regression captures but cannot directly display: residualized sales against residualized coupons. Both variables have had income&amp;rsquo;s influence removed, so any remaining relationship is the &lt;strong>conditional&lt;/strong> effect of coupons on sales &amp;mdash; the effect after accounting for income differences.&lt;/p>
&lt;pre>&lt;code class="language-python">fig, ax = plt.subplots(figsize=(8, 6))
ax.scatter(df[&amp;quot;coupons_tilde&amp;quot;], df[&amp;quot;sales_tilde&amp;quot;], color=STEEL_BLUE,
alpha=0.7, edgecolors=&amp;quot;white&amp;quot;, s=60, label=&amp;quot;Stores (residualized)&amp;quot;)
sns.regplot(x=&amp;quot;coupons_tilde&amp;quot;, y=&amp;quot;sales_tilde&amp;quot;, data=df, ci=False, scatter=False,
line_kws={&amp;quot;color&amp;quot;: WARM_ORANGE, &amp;quot;linewidth&amp;quot;: 2, &amp;quot;label&amp;quot;: &amp;quot;Linear fit&amp;quot;}, ax=ax)
ax.set_xlabel(&amp;quot;Residual coupon usage&amp;quot;)
ax.set_ylabel(&amp;quot;Residual sales&amp;quot;)
ax.set_title(&amp;quot;Conditional relationship after partialling-out income&amp;quot;)
ax.legend()
plt.savefig(&amp;quot;fwl_partialled_out.png&amp;quot;, dpi=300, bbox_inches=&amp;quot;tight&amp;quot;)
plt.show()
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="fwl_partialled_out.png" alt="Scatter plot showing a positive relationship between residualized coupon usage and residualized sales, with an upward-sloping regression line.">
&lt;em>After removing income&amp;rsquo;s influence from both variables, the true positive effect of coupons on sales emerges.&lt;/em>&lt;/p>
&lt;p>The positive slope is now clearly visible. Stripping away the confounding influence of income reveals that stores where coupon usage is higher than expected (given their neighborhood income) tend to also have sales that are higher than expected. The slope of this line is exactly 0.2673 &amp;mdash; the same coefficient produced by the full multivariate regression.&lt;/p>
&lt;h2 id="scaling-for-interpretability">Scaling for interpretability&lt;/h2>
&lt;p>One drawback of the partialled-out plot is that both axes show residuals centered around zero, which makes the magnitudes hard to interpret. A negative coupon value of -5 does not mean the store has -5% coupon usage &amp;mdash; it means coupon usage is 5 percentage points below what income alone would predict.&lt;/p>
&lt;p>Adding the sample mean back to each residualized variable fixes this. The shift moves the axes without changing the slope.&lt;/p>
&lt;pre>&lt;code class="language-python">df[&amp;quot;coupons_tilde_scaled&amp;quot;] = df[&amp;quot;coupons_tilde&amp;quot;] + df[&amp;quot;coupons&amp;quot;].mean()
df[&amp;quot;sales_tilde_scaled&amp;quot;] = df[&amp;quot;sales_tilde&amp;quot;] + df[&amp;quot;sales&amp;quot;].mean()
# Verify the coefficient is unchanged
scaled_model = smf.ols(&amp;quot;sales_tilde_scaled ~ coupons_tilde_scaled&amp;quot;, df).fit()
print(scaled_model.summary().tables[1])
&lt;/code>&lt;/pre>
&lt;pre>&lt;code>========================================================================================
coef std err t P&amp;gt;|t| [0.025 0.975]
----------------------------------------------------------------------------------------
Intercept 24.5585 4.053 6.059 0.000 16.409 32.708
coupons_tilde_scaled 0.2673 0.119 2.246 0.029 0.028 0.507
========================================================================================
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-python">fig, ax = plt.subplots(figsize=(8, 6))
ax.scatter(df[&amp;quot;coupons_tilde_scaled&amp;quot;], df[&amp;quot;sales_tilde_scaled&amp;quot;],
color=STEEL_BLUE, alpha=0.7, edgecolors=&amp;quot;white&amp;quot;, s=60,
label=&amp;quot;Stores (residualized + scaled)&amp;quot;)
sns.regplot(x=&amp;quot;coupons_tilde_scaled&amp;quot;, y=&amp;quot;sales_tilde_scaled&amp;quot;, data=df,
ci=False, scatter=False,
line_kws={&amp;quot;color&amp;quot;: WARM_ORANGE, &amp;quot;linewidth&amp;quot;: 2, &amp;quot;label&amp;quot;: &amp;quot;Linear fit&amp;quot;}, ax=ax)
ax.set_xlabel(&amp;quot;Coupon usage (%, residualized + mean)&amp;quot;)
ax.set_ylabel(&amp;quot;Daily sales (thousands $, residualized + mean)&amp;quot;)
ax.set_title(&amp;quot;Scaled residuals: interpretable magnitudes&amp;quot;)
ax.legend()
plt.savefig(&amp;quot;fwl_scaled_residuals.png&amp;quot;, dpi=300, bbox_inches=&amp;quot;tight&amp;quot;)
plt.show()
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="fwl_scaled_residuals.png" alt="Scatter plot of scaled residualized sales versus scaled residualized coupon usage, with axes now showing values in the original units centered around their means.">
&lt;em>Adding the sample means back to the residuals restores interpretable units without changing the slope.&lt;/em>&lt;/p>
&lt;p>The coefficient remains exactly 0.2673 (p = 0.029), confirming that adding the means back does not alter the estimated relationship. Now the axes are in interpretable units: coupon usage around 34% and daily sales around \$33,600. This scaled version is ideal for presentations and reports where the audience needs to understand both the direction and the magnitude of the conditional relationship at a glance.&lt;/p>
&lt;h2 id="extending-to-multiple-controls">Extending to multiple controls&lt;/h2>
&lt;p>The FWL theorem works with &lt;strong>any number&lt;/strong> of control variables, not just one. To demonstrate, the next step adds &lt;code>dayofweek&lt;/code> as a second control alongside income. The theorem says both controls can be partialled out simultaneously and the same coefficient on coupons will emerge.&lt;/p>
&lt;pre>&lt;code class="language-python"># Full regression with both controls
full_model_2 = smf.ols(&amp;quot;sales ~ coupons + income + dayofweek&amp;quot;, df).fit()
print(full_model_2.summary().tables[1])
&lt;/code>&lt;/pre>
&lt;pre>&lt;code>==============================================================================
coef std err t P&amp;gt;|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 3.9825 7.172 0.555 0.581 -10.454 18.419
coupons 0.2706 0.119 2.266 0.028 0.030 0.511
income 0.3774 0.076 4.961 0.000 0.224 0.531
dayofweek 0.3195 0.245 1.306 0.198 -0.173 0.812
==============================================================================
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-python"># FWL: partial out both income and dayofweek
df[&amp;quot;coupons_tilde_2&amp;quot;] = smf.ols(&amp;quot;coupons ~ income + dayofweek&amp;quot;, df).fit().resid
df[&amp;quot;sales_tilde_2&amp;quot;] = smf.ols(&amp;quot;sales ~ income + dayofweek&amp;quot;, df).fit().resid
fwl_multi = smf.ols(&amp;quot;sales_tilde_2 ~ coupons_tilde_2 - 1&amp;quot;, df).fit()
print(fwl_multi.summary().tables[1])
&lt;/code>&lt;/pre>
&lt;pre>&lt;code>===================================================================================
coef std err t P&amp;gt;|t| [0.025 0.975]
-----------------------------------------------------------------------------------
coupons_tilde_2 0.2706 0.116 2.338 0.023 0.038 0.503
===================================================================================
&lt;/code>&lt;/pre>
&lt;p>With both controls, the full regression gives a coupon coefficient of 0.2706 (p = 0.028). The FWL procedure &amp;mdash; partialling out income and day of week from both sales and coupons &amp;mdash; yields the &lt;strong>identical&lt;/strong> coefficient of 0.2706 (p = 0.023). The day-of-week effect itself (0.3195, p = 0.198) is not statistically significant in this sample, but including it slightly sharpens the coupon estimate from 0.2673 to 0.2706 by absorbing additional residual variance. This confirms that FWL scales to any number of controls.&lt;/p>
&lt;h2 id="naive-vs-conditional-the-full-picture">Naive vs. conditional: the full picture&lt;/h2>
&lt;p>To appreciate how much the FWL procedure changes the conclusions, the next figure places the naive and conditional relationships side by side. The left panel shows the raw data; the right panel shows the same data after partialling out income.&lt;/p>
&lt;pre>&lt;code class="language-python">fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# Left: naive relationship
axes[0].scatter(df[&amp;quot;coupons&amp;quot;], df[&amp;quot;sales&amp;quot;], color=STEEL_BLUE, alpha=0.7,
edgecolors=&amp;quot;white&amp;quot;, s=60)
sns.regplot(x=&amp;quot;coupons&amp;quot;, y=&amp;quot;sales&amp;quot;, data=df, ci=False, scatter=False,
line_kws={&amp;quot;color&amp;quot;: WARM_ORANGE, &amp;quot;linewidth&amp;quot;: 2}, ax=axes[0])
axes[0].set_xlabel(&amp;quot;Coupon usage (%)&amp;quot;)
axes[0].set_ylabel(&amp;quot;Daily sales (thousands $)&amp;quot;)
axes[0].set_title(&amp;quot;Naive (no controls)&amp;quot;)
# Right: after partialling-out income
axes[1].scatter(df[&amp;quot;coupons_tilde_scaled&amp;quot;], df[&amp;quot;sales_tilde_scaled&amp;quot;],
color=TEAL, alpha=0.7, edgecolors=&amp;quot;white&amp;quot;, s=60)
sns.regplot(x=&amp;quot;coupons_tilde_scaled&amp;quot;, y=&amp;quot;sales_tilde_scaled&amp;quot;, data=df,
ci=False, scatter=False,
line_kws={&amp;quot;color&amp;quot;: WARM_ORANGE, &amp;quot;linewidth&amp;quot;: 2}, ax=axes[1])
axes[1].set_xlabel(&amp;quot;Coupon usage (%, after partialling-out)&amp;quot;)
axes[1].set_ylabel(&amp;quot;Daily sales (thousands $, after partialling-out)&amp;quot;)
axes[1].set_title(&amp;quot;After partialling-out income (FWL)&amp;quot;)
plt.suptitle(&amp;quot;The FWL theorem reveals the true relationship&amp;quot;,
fontsize=14, fontweight=&amp;quot;bold&amp;quot;, y=1.02)
plt.tight_layout()
plt.savefig(&amp;quot;fwl_comparison.png&amp;quot;, dpi=300, bbox_inches=&amp;quot;tight&amp;quot;)
plt.show()
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="fwl_comparison.png" alt="Two-panel figure comparing the naive negative relationship between sales and coupons on the left with the positive conditional relationship after partialling-out income on the right.">
&lt;em>Simpson&amp;rsquo;s paradox resolved: the naive negative slope (left) reverses to a positive slope (right) after partialling out income.&lt;/em>&lt;/p>
&lt;p>The contrast is striking. On the left, the naive analysis suggests a negative relationship (slope = -0.106) &amp;mdash; coupons appear to hurt sales. On the right, after removing income&amp;rsquo;s confounding influence, the true positive relationship emerges (slope = +0.267). This is a textbook example of &lt;strong>Simpson&amp;rsquo;s paradox&lt;/strong>: a trend that appears in aggregate data reverses when the data is properly conditioned on a relevant variable.&lt;/p>
&lt;h2 id="summary-of-results">Summary of results&lt;/h2>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Method&lt;/th>
&lt;th>Coupons coefficient&lt;/th>
&lt;th>Std. error&lt;/th>
&lt;th>p-value&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>Naive OLS (no controls)&lt;/td>
&lt;td>-0.1059&lt;/td>
&lt;td>0.116&lt;/td>
&lt;td>0.365&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Full OLS (+ income)&lt;/td>
&lt;td>+0.2673&lt;/td>
&lt;td>0.120&lt;/td>
&lt;td>0.031&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>FWL Step 1 (residualize X only)&lt;/td>
&lt;td>+0.2673&lt;/td>
&lt;td>1.271&lt;/td>
&lt;td>0.834&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>FWL Step 2 (residualize both)&lt;/td>
&lt;td>+0.2673&lt;/td>
&lt;td>0.118&lt;/td>
&lt;td>0.028&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Full OLS (+ income + day)&lt;/td>
&lt;td>+0.2706&lt;/td>
&lt;td>0.119&lt;/td>
&lt;td>0.028&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>FWL (+ income + day)&lt;/td>
&lt;td>+0.2706&lt;/td>
&lt;td>0.116&lt;/td>
&lt;td>0.023&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>All FWL variants produce the same coefficient as the corresponding full regression, confirming the theorem. The coefficient of +0.267 is close to the true DGP value of +0.200, with the difference attributable to finite-sample noise in 50 observations.&lt;/p>
&lt;h2 id="applications-of-the-fwl-theorem">Applications of the FWL theorem&lt;/h2>
&lt;p>The FWL theorem is not just a mathematical curiosity &amp;mdash; it has practical applications across several domains.&lt;/p>
&lt;h3 id="data-visualization">Data visualization&lt;/h3>
&lt;p>As shown above, FWL makes it possible to plot the conditional relationship between two variables after controlling for confounders. This is invaluable when presenting regression results to non-technical audiences who understand scatter plots but not regression tables with multiple coefficients.&lt;/p>
&lt;h3 id="computational-efficiency">Computational efficiency&lt;/h3>
&lt;p>When a regression includes &lt;strong>high-dimensional fixed effects&lt;/strong> &amp;mdash; for example, year, industry, and country dummies that could add hundreds of columns &amp;mdash; computing the full regression becomes expensive. The FWL theorem allows software to partial out these fixed effects first, reducing the problem to a much smaller regression. Widely-used packages that exploit this strategy include:&lt;/p>
&lt;ul>
&lt;li>&lt;a href="https://scorreia.com/software/reghdfe/" target="_blank" rel="noopener">reghdfe&lt;/a> in Stata&lt;/li>
&lt;li>&lt;a href="https://cran.r-project.org/web/packages/fixest/index.html" target="_blank" rel="noopener">fixest&lt;/a> in R&lt;/li>
&lt;li>&lt;a href="https://pyfixest.org/pyfixest.html" target="_blank" rel="noopener">pyfixest&lt;/a> in Python &amp;mdash; a fast, user-friendly package for fixed-effects regression (including multi-way clustering and interaction effects), inspired by fixest&amp;rsquo;s R API&lt;/li>
&lt;li>&lt;a href="https://pyhdfe.readthedocs.io/en/stable/index.html" target="_blank" rel="noopener">pyhdfe&lt;/a> in Python&lt;/li>
&lt;/ul>
&lt;h3 id="machine-learning-and-causal-inference">Machine learning and causal inference&lt;/h3>
&lt;p>Perhaps the most impactful modern application is &lt;strong>Double Machine Learning (DML)&lt;/strong>, developed by Chernozhukov, Chetverikov, Demirer, Duflo, Hansen, Newey, and Robins (2018). DML extends the FWL logic by replacing the OLS regressions in the partialling-out step with &lt;strong>flexible machine learning models&lt;/strong> (random forests, lasso, neural networks). This allows the control variables to have complex, nonlinear effects on both the treatment and the outcome &amp;mdash; while still recovering a valid causal estimate of the treatment effect.&lt;/p>
&lt;p>If you want to see DML in action, check out the companion tutorial on &lt;a href="https://carlos-mendez.org/post/python_doubleml/">Introduction to Causal Inference: Double Machine Learning&lt;/a>, which applies the partialling-out estimator to a real randomized experiment.&lt;/p>
&lt;h2 id="discussion">Discussion&lt;/h2>
&lt;p>This tutorial set out to answer a simple question: what does it mean to &amp;ldquo;control for&amp;rdquo; a variable in regression, and how can the result be visualized? The FWL theorem provides a definitive answer. Controlling for income in a regression of sales on coupons is equivalent to removing income&amp;rsquo;s influence from both variables and then regressing the residuals.&lt;/p>
&lt;p>In the simulated retail scenario, failing to control for income produced a misleading negative coefficient of -0.106, suggesting coupons reduce sales. After partialling out income, the coefficient reversed to +0.267 (p = 0.031), revealing that coupons genuinely increase sales by about \$267 per percentage point. This estimate is close to the true data-generating parameter of +0.200, with the gap attributable to sampling variability in just 50 stores.&lt;/p>
&lt;p>For a practitioner &amp;mdash; say, the marketing director of the retail chain &amp;mdash; the takeaway is clear. An analysis that ignored neighborhood income would conclude the coupon program was counterproductive. The FWL-based analysis shows it works, and provides a plot that makes this case visually compelling. The theorem bridges the gap between the numbers in a regression table and the intuitive two-variable scatter plot.&lt;/p>
&lt;h2 id="summary-and-next-steps">Summary and next steps&lt;/h2>
&lt;p>&lt;strong>Key takeaways:&lt;/strong>&lt;/p>
&lt;ol>
&lt;li>
&lt;p>&lt;strong>Sign reversal.&lt;/strong> The naive coupon coefficient was -0.106 (negative, not significant). After controlling for income, it became +0.267 (positive, p = 0.031). Ignoring confounders can reverse not just the magnitude but the direction of an estimated effect.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Exact equivalence.&lt;/strong> The FWL procedure produced a coefficient of 0.2673 &amp;mdash; identical to the full multivariate regression down to four decimal places &amp;mdash; whether partialling out one control (income) or two (income + day of week). The theorem is not an approximation; it is an algebraic identity.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Visualization power.&lt;/strong> FWL reduces any multivariate regression to a univariate one, enabling scatter plots that display conditional relationships. This is especially valuable for communicating results to non-technical stakeholders.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Foundation for DML.&lt;/strong> FWL underpins modern causal inference methods like Double Machine Learning, where flexible ML learners replace OLS in the partialling-out step. Understanding FWL is a prerequisite for understanding DML.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Linearity assumption matters.&lt;/strong> The FWL procedure relies on OLS residualization, which assumes linear relationships between the controls and both the treatment and outcome. If income affects coupons or sales nonlinearly, OLS residuals will not fully remove the confounding &amp;mdash; motivating methods like DML that replace OLS with flexible learners.&lt;/p>
&lt;/li>
&lt;/ol>
&lt;p>&lt;strong>Limitations:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>The data is simulated with a known linear DGP. In real data, the DGP is unknown and may be nonlinear, requiring methods like DML rather than plain OLS.&lt;/li>
&lt;li>The FWL theorem assumes a correctly specified linear model. If the relationship between income and coupons (or sales) is nonlinear, OLS residualization will not fully remove the confounding.&lt;/li>
&lt;li>With only 50 observations, the estimates have wide confidence intervals. Larger samples would sharpen the estimates.&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Next steps:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>See &lt;a href="https://carlos-mendez.org/post/python_doubleml/">Double Machine Learning&lt;/a> to learn how FWL extends to nonlinear settings.&lt;/li>
&lt;li>See &lt;a href="https://carlos-mendez.org/post/python_dowhy/">Introduction to Causal Inference: The DoWhy Approach&lt;/a> for a full causal inference workflow with real data.&lt;/li>
&lt;/ul>
&lt;h2 id="exercises">Exercises&lt;/h2>
&lt;ol>
&lt;li>
&lt;p>&lt;strong>Sample size sensitivity.&lt;/strong> Change &lt;code>N&lt;/code> from 50 to 500 in the &lt;code>simulate_store_data()&lt;/code> function. How do the naive and FWL coefficients change? How do the standard errors shrink? Is the naive estimate still misleading with a larger sample?&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Nonlinear confounding.&lt;/strong> Modify the DGP so that income affects coupons nonlinearly: &lt;code>coupons = 60 - 0.01 * income**2 + noise&lt;/code>. Does the FWL procedure (with linear OLS residualization) still recover the true coefficient? Why or why not?&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Real data application.&lt;/strong> Pick a dataset with a known confounder (e.g., the wage-education-ability relationship) and apply the FWL procedure. Visualize the naive and conditional relationships side by side.&lt;/p>
&lt;/li>
&lt;/ol>
&lt;h2 id="references">References&lt;/h2>
&lt;ol>
&lt;li>&lt;a href="https://towardsdatascience.com/the-fwl-theorem-or-how-to-make-all-regressions-intuitive-59f801eb3299/" target="_blank" rel="noopener">Courthoud, M. (2022). Understanding the Frisch-Waugh-Lovell Theorem. &lt;em>Towards Data Science&lt;/em>.&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://www.jstor.org/stable/1907330" target="_blank" rel="noopener">Frisch, R. and Waugh, F. V. (1933). Partial Time Regressions as Compared with Individual Trends. &lt;em>Econometrica&lt;/em>, 1(4), 387&amp;ndash;401.&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://www.tandfonline.com/doi/abs/10.1080/01621459.1963.10480682" target="_blank" rel="noopener">Lovell, M. C. (1963). Seasonal Adjustment of Economic Time Series and Multiple Regression Analysis. &lt;em>Journal of the American Statistical Association&lt;/em>, 58(304), 993&amp;ndash;1010.&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://academic.oup.com/ectj/article/21/1/C1/5056401" target="_blank" rel="noopener">Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. (2018). Double/Debiased Machine Learning for Treatment and Structural Parameters. &lt;em>The Econometrics Journal&lt;/em>, 21(1), C1&amp;ndash;C68.&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://academic.oup.com/restud/article-abstract/81/2/608/1523757" target="_blank" rel="noopener">Belloni, A., Chernozhukov, V., and Hansen, C. (2014). Inference on Treatment Effects after Selection among High-Dimensional Controls. &lt;em>Review of Economic Studies&lt;/em>, 81(2), 608&amp;ndash;650.&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://pyfixest.org/pyfixest.html" target="_blank" rel="noopener">pyfixest &amp;mdash; Fast Estimation of Fixed-Effects Models in Python&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://www.statsmodels.org/stable/index.html" target="_blank" rel="noopener">statsmodels &amp;mdash; Statistical Modeling in Python&lt;/a>&lt;/li>
&lt;/ol>
&lt;h4 id="acknowledgements">Acknowledgements&lt;/h4>
&lt;p>AI tools (Claude Code, Gemini, NotebookLM) were used to make the contents of this post more accessible to students. Nevertheless, the content in this post may still have errors. Caution is needed when applying the contents of this post to true research projects.&lt;/p></description></item></channel></rss>