TWFE & Manual Demeaning — Interactive Lab

What does two-way fixed effects actually do?

Suppose you fit a panel regression with feols(y ~ x | id + time). What is the package doing under the hood? The Frisch-Waugh-Lovell (FWL) theorem says it is mathematically equivalent to (a) subtracting each unit's mean, (b) subtracting each period's mean, (c) adding back the grand mean to correct for double-subtraction, and (d) running plain OLS on the residuals. The coefficients match to machine precision — in this post's panel of 150 countries × 8 periods, the maximum difference is 3.05 × 10⁻¹⁶, smaller than IEEE 754 double-precision epsilon.

This app lets you experiment with the transformation. In four tabs you will: sweep the within-variation knob and watch how cross-country differences disappear; reproduce the FWL coefficient equivalence on simulated panels; toggle covariates on the post's forest plot; and see why naive lm() standard errors are systematically too small.

Shrinkage vs. absorption — two different ways to deal with controls

The animation contrasts two strategies for handling many controls. L1 (LASSO) shrinks coefficients toward zero, exactly removing some. L2 (Ridge) shrinks but never reaches zero. Fixed effects take a third route: they absorb the unit-level and period-level variation entirely, leaving the coefficient identified only from within-variation. The TWFE coefficient is OLS on the leftover — no shrinkage at all. The slider here is illustrative; Tab 2 has the actual demeaning controls.

Tab 2

Demeaning Lab

Build a simulated panel. Watch unit means and time means stripped away. See within-variation emerge.

Tab 3

FWL Showdown

Two routes to the same coefficient: full LSDV regression vs OLS on demeaned data. Confirm machine-precision agreement across 100 simulations.

Tab 4

Forest Plot

The post's headline result: feols TWFE and manual demeaning OLS overlap exactly. Hover for SEs and confidence intervals.

Glossary (open a card if a term is unfamiliar)

Two-way fixed effects (TWFE)

A panel regression with both unit FE α_i and time FE λ_t. Each unit gets its own intercept; each period gets its own intercept. Absorbs every time-invariant unit characteristic and every unit-invariant time shock.

Within transformation (demeaning)

ỹ_it = y_it − ȳ_i· − ȳ_·t + ȳ_··. Subtract the unit's time-average, subtract the period's cross-section average, add back the grand mean. What remains is purely within-unit, within-period variation.

Frisch-Waugh-Lovell (FWL) theorem

In OLS, the coefficient on a regressor in a multivariate regression equals the coefficient from a univariate regression of the residualised outcome on the residualised regressor. Applied to FE: absorbing dummies is identical to demeaning.

LSDV (Least-Squares Dummy Variables)

The "brute force" way to estimate FE: include a dummy for every unit and every period in an OLS regression. Numerically equivalent to TWFE but computationally expensive for large panels.

Within-R²

The fraction of within-variation explained by the regressors after absorbing fixed effects. Distinct from overall R²: in this post, adj R² = 0.755 but within-R² is only 0.177.

Degrees of freedom (df)

Naive lm() reports df = N·T − K = 1195. Correct df = N·T − N − T + 1 − K = 1038. The 157 missing df are consumed by the 150 country FE and 8 time FE (minus 1 normalisation).

Clustered standard errors

SE adjustment that accounts for within-unit correlation in residuals. The default in feols() with FE models. Can inflate or deflate SEs depending on the correlation structure.

Balanced panel

Every unit observed in every period. This post uses 150 × 8 = 1,200 observations with no missing cells, making the closed-form demeaning formula work in one pass.

Demeaning Lab — see what TWFE strips away

Build a simulated panel of n units × T periods. The data-generating process bakes in unit-specific intercepts (some countries are always richer than others) and a common time trend (everyone grows on average). The slider below let you increase the treatment signal. The plot shows the LASSO path as a stand-in for the "many possible models" that demeaning makes unnecessary — with FE, you do not need to choose. You absorb them all.

Sample size n 200

Think of this as units × periods. The post uses 150 × 8 = 1,200.

Number of controls p 40

FE eliminates the need to pick. With p large, model selection becomes a real burden.

Signal strength 0.60

Magnitude of the truly-relevant coefficients relative to noise.

Penalty —

Slide left for less shrinkage; right for more. With TWFE, no penalty needed — just absorb.

controls kept (|I|)

—

out of — candidates

α̂ from raw LASSO

—

shrunk toward zero

α̂ from post-OLS

—

refit on selected support

true α

0.50

held fixed for comparison

What to look for

Cross-unit variation is large. In real panels, most variation lies between units (countries differ in income levels by factors of 100×). Demeaning removes all of that.
Within-variation is what identifies β. After two-way demeaning, only deviations from each unit's average and each period's average remain. In the post, that within-variation explains 17.7% of the residual variance.
FE never need a tuning parameter. Unlike LASSO's λ, the demeaning transformation is parameter-free. You always subtract the same means and add back the same grand mean.

FWL Showdown — feols vs manual demeaning

The Frisch-Waugh-Lovell theorem guarantees that feols(y ~ x | id + time) and lm(y_demeaned ~ x_demeaned) produce identical coefficients on the slopes. Not approximately. Not "close." Identical to machine precision. The two cards below show both estimates side-by-side on the same simulated panel. Run 100 simulations to see the distributions overlap exactly.

Sample size n 200

Capped at 300 so the "Run 100 sims" button finishes quickly.

Number of controls p 40

Capped at 50 for the 100-sim run.

Signal strength 0.50

Common scale for both unit FE strength and treatment signal.

Unit-FE asymmetry 0.80

0 = unit FE explain little · 1 = unit FE dominate. Larger values stress-test the demeaning.

feols TWFE (LSDV)

Estimator: OLS on [d, X, unit dummies, time dummies] — feols absorbs them internally.

α̂—

SE(α̂) — correct df—

N units—

T periods—

df (correct)—

|α̂_LSDV − α̂_within|—

Manual demeaning

Estimator: ỹ_it = y_it − ȳ_i· − ȳ_·t + ȳ_··, then plain lm().

α̂—

SE(α̂) — naive lm() df—

N units—

T periods—

df (naive — wrong)—

|α̂_LSDV − α̂_within|—

Why the coefficients match to machine precision

Same projection. Both LSDV (with dummies) and demeaning project the data onto the same subspace — the orthogonal complement of the span of unit and time indicator vectors.
Coefficients identical, SEs differ. The slopes are guaranteed to match exactly. But naive lm() doesn't know about the absorbed FE, so its SE uses the wrong degrees of freedom (1195 instead of 1038 in the post's panel).
This works only for balanced panels in one pass. With unbalanced panels, you need the iterative algorithm in fixest. The closed-form three-step demeaning fails.

Stability of the FWL equivalence over many simulations

Single runs are noisy. Run the whole pipeline 100 times with fresh draws (same parameters, different random errors) to see the two distributions overlap perfectly.

The post's coefficient table — interactively

These numbers come straight from coefficient_comparison.csv and se_comparison.csv in the post's folder — the same data used to produce the headline figures. Toggle outcomes (the five regressors) and methods (feols TWFE vs manual demeaning OLS). The two markers overlap exactly for every variable. The bars at the bottom show standard-error comparisons across naive, IID-corrected, and clustered approaches.

What to look for

The two markers always overlap. Every covariate's feols and manual estimate coincide to the fourth decimal place. The largest absolute difference in the post is 3.05 × 10⁻¹⁶ — pure floating-point round-off.
Confidence intervals differ slightly. Even though the point estimates match, the CIs use different SEs (clustered for feols, naive for lm). Hover any marker to inspect.
The SE bars below show the three SE variants side-by-side. Naive lm() (steel blue) ignores the absorbed FE and so uses too many df. feols IID (green) uses the correct df. feols cluster (orange) further adjusts for within-unit residual correlation. For log(n+g+d), cluster SE is ~22% larger than naive; for gov. consumption it is ~5% larger.

Outcomes (regressors)

Methods

feols TWFE Manual demeaning

Why are the naive lm() SEs wrong?

The naive lm() on demeaned data thinks the residuals have 1200 − 6 = 1194 degrees of freedom — one for each of the 5 slopes plus the intercept. But the demeaning silently consumed 150 + 8 − 1 = 157 additional df (one per country FE plus one per time FE, minus 1 normalisation). The correct df is 1200 − 5 − 157 = 1038. Using 1194 instead of 1038 inflates the effective sample size, producing SEs that are too small. The feols package knows about the absorbed FE and gets the df right automatically.

Connecting back to Tab 3

The FWL equivalence you just saw in simulation is exactly what happens on the real Barro convergence panel of 150 countries × 8 periods:

Log Initial Income (convergence parameter): feols = −0.0553, manual = −0.0553 (difference = −4.2 × 10⁻¹⁷).
Gov. Consumption Share: feols = −0.1028, manual = −0.1028 (difference = −3.1 × 10⁻¹⁶).
All five coefficients agree to 12 significant digits. R's all.equal() returns TRUE.

The takeaway from the post is therefore visible twice: once on a controlled simulation where you set the truth, and once on the original 150 × 8 = 1,200 observations panel that motivates the whole exercise.