Downloads
Each dataset is available as a labeled Stata .dta and its source file.
⇩ Download all data (ZIP)stata_codebook.do
| Dataset | Grain | Rows | Stata | Source |
|---|---|---|---|---|
source_data | country-year (period) | 1,200 × 11 | source_data.dta | source_data.csv |
data_demeaned | country-year (period) | 1,200 × 14 | data_demeaned.dta | data_demeaned.csv |
Run stata_codebook.do in Stata once to attach long-form per-variable notes to the .dta files.
Load directly in code
Every file loads straight from GitHub (raw URLs). Swap the file name to load any dataset.
Stata
* Stata 14+ : `use` reads an https URL directly
global BASE "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/r_demeaning_twfe/data/"
use "${BASE}source_data.dta", clear
describe
notesPython
!pip install -q pyreadstat
import pandas as pd
BASE = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/r_demeaning_twfe/data/"
df = pd.read_stata(BASE + "source_data.dta")
# load every dataset at once
files = ["source_data", "data_demeaned"]
data = {f: pd.read_stata(BASE + f + ".dta") for f in files}
# pyreadstat (richest metadata) reads LOCAL files -> download first
import pyreadstat, urllib.request
urllib.request.urlretrieve(BASE + "source_data.dta", "source_data.dta")
df, meta = pyreadstat.read_dta("source_data.dta")Copy and paste this snippet in Google Colab app. https://colab.research.google.com/notebooks/empty.ipynb
R
# R : haven::read_dta auto-downloads an https URL
library(haven)
BASE <- "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/r_demeaning_twfe/data/"
df <- read_dta(paste0(BASE, "source_data.dta"))Overview & sources
Companion data for a hands-on R tutorial that takes the two-way fixed effects (TWFE) estimator apart to show it is nothing more than ordinary least squares applied to two-way demeaned data — the equivalence guaranteed by the Frisch–Waugh–Lovell (FWL) theorem. The tutorial uses a balanced, synthetic Barro convergence panel of 150 countries observed over 8 time periods (1,200 observations), regressing GDP-per-capita growth on log initial income, investment share, population growth, human capital, and government consumption. It estimates the model with country and time fixed effects via fixest::feols(), then replicates the coefficients by hand — subtracting country means, subtracting time means, and adding back the grand mean before running base R's lm(). The two routes match to at least 12 significant digits (the convergence coefficient is −0.055286 either way; the largest coefficient difference is 3.05×10−16, on the order of machine epsilon), while naive lm() standard errors understate uncertainty by 7–22% because they ignore the degrees of freedom consumed by the absorbed fixed effects.
source_data is the raw balanced country×period panel as loaded for the analysis (one row per country×year, 1,200 rows). data_demeaned is the within-transformed analysis dataset: the same 1,200 rows carrying the six model variables in both their raw form and their two-way demeaned form (the _dm columns), each equal to the observed value minus its country mean minus its time mean plus the grand mean.
Data sources
| Source | Provides | Reference / URL |
|---|---|---|
| Synthetic (this study) | All values — a simulated balanced Barro convergence panel (open & reproducible) | Mendez, C. (2026). See the post's R script analysis.R and referenceMaterials/manual_demeaning_twfe_tutorial.qmd for the data-generating process. |
| Frisch–Waugh–Lovell theorem | The result that guarantees the TWFE / OLS-on-demeaned equivalence | Frisch, R., & Waugh, F. V. (1933). Partial Time Regressions as Compared with Individual Trends. Econometrica, 1(4), 387–401. Lovell, M. C. (1963). Seasonal Adjustment of Economic Time Series and Multiple Regression Analysis. JASA, 58(304), 993–1010. |
| Method references | Two-way fixed effects / within transformation, growth convergence, and the estimator | Berge, L. (2018). fixest: Fast Fixed-Effects Estimations (R package). Barro, R. J., & Sala-i-Martin, X. (2004). Economic Growth (2nd ed.). MIT Press. |
Cite this data
Please cite this dataset as follows.
APA
Mendez, C. (2026). What Does TWFE Actually Do? Manual Demeaning and the FWL Theorem [Data set]. https://carlos-mendez.org/post/r_demeaning_twfe/
Frisch, R., & Waugh, F. V. (1933). Partial Time Regressions as Compared with Individual Trends. Econometrica, 1(4), 387–401. Lovell, M. C. (1963). Seasonal Adjustment of Economic Time Series and Multiple Regression Analysis. Journal of the American Statistical Association, 58(304), 993–1010.BibTeX
@misc{mendez2026rdemeaningtwfe,
author = {Mendez, Carlos},
title = {What Does TWFE Actually Do? Manual Demeaning and the FWL Theorem},
year = {2026},
howpublished = {\url{https://carlos-mendez.org/post/r_demeaning_twfe/}},
note = {Data set}
}
@article{frisch1933partial,
author = {Frisch, Ragnar and Waugh, Frederick V.},
title = {Partial Time Regressions as Compared with Individual Trends},
journal = {Econometrica},
volume = {1}, number = {4}, pages = {387--401}, year = {1933}
}
@article{lovell1963seasonal,
author = {Lovell, Michael C.},
title = {Seasonal Adjustment of Economic Time Series and Multiple Regression Analysis},
journal = {Journal of the American Statistical Association},
volume = {58}, number = {304}, pages = {993--1010}, year = {1963}
}Variable explorer search & filter all 17 variables
Type to filter by name or label, or use the chips to filter by type. Each row shows a mini distribution. Click a header to sort.
| Variable | Type | Distribution | Label | Definition | Units | In files | Source |
|---|---|---|---|---|---|---|---|
gov_cons# | continuous | Government consumption share | Government consumption as a share of GDP — a control regressor. | 0-1 (share) | source_data, data_demeaned | Simulation | |
gov_cons_dm# | continuous | Demeaned government consumption share | Two-way demeaned government consumption share (deviation from country + time means, grand mean restored). | share (deviation) | data_demeaned | Derived (within transform) | |
growth# | continuous | GDP per capita growth (dependent variable) | Annualized GDP-per-capita growth rate — the outcome regressed in the TWFE model. | rate (per year) | source_data, data_demeaned | Simulation | |
growth_dm# | continuous | Demeaned GDP per capita growth | Two-way demeaned growth: deviation of growth from its country mean and time mean (grand mean added back). The within-variation that identifies the TWFE coefficient. | rate (deviation) | data_demeaned | Derived (within transform) | |
hcap# | continuous | Human capital index | Human-capital stock proxy (e.g., schooling-based index). | index | source_data | Simulation | |
id# | identifier | – | Country identifier | Sequential country index (the entity dimension of the panel; treated as a factor for the fixed effects). | integer code | source_data, data_demeaned | Simulation |
ln_y_initial# | continuous | Log initial income (convergence term) | Natural log of initial GDP per capita — the beta-convergence regressor (a negative slope means poorer countries grow faster). | log US$ | source_data, data_demeaned | Simulation | |
ln_y_initial_dm# | continuous | Demeaned log initial income | Two-way demeaned log initial income (deviation from country + time means, grand mean restored). | log US$ (deviation) | data_demeaned | Derived (within transform) | |
log_hcap# | continuous | Log human capital | Natural log of the human-capital index. | log index | source_data, data_demeaned | Derived | |
log_hcap_dm# | continuous | Demeaned log human capital | Two-way demeaned log human capital (deviation from country + time means, grand mean restored). | log index (deviation) | data_demeaned | Derived (within transform) | |
log_n_gd# | continuous | Log of population growth + g + d | Natural log of population growth plus the standard 0.05 for growth and depreciation — the Solow n + g + d regressor. | log rate | source_data, data_demeaned | Derived | |
log_n_gd_dm# | continuous | Demeaned log(n + g + d) | Two-way demeaned log of population growth plus g + d (deviation from country + time means, grand mean restored). | log rate (deviation) | data_demeaned | Derived (within transform) | |
log_s_k# | continuous | Log investment share | Natural log of the investment share of GDP (Solow capital-accumulation regressor). | log share | source_data, data_demeaned | Derived | |
log_s_k_dm# | continuous | Demeaned log investment share | Two-way demeaned log investment share (deviation from country + time means, grand mean restored). | log share (deviation) | data_demeaned | Derived (within transform) | |
n_pop# | continuous | Population growth rate | Population growth rate (the n in the Solow n + g + d term). | rate (per year) | source_data | Simulation | |
s_k# | continuous | Investment share of GDP | Physical-capital investment share of GDP (Solow-style accumulation rate). | 0-1 (share) | source_data | Simulation | |
time# | identifier | – | Time period identifier | Sequential time-period index (the time dimension of the panel; treated as a factor for the fixed effects). | integer code | source_data, data_demeaned | Simulation |
Cross-file variable index
Which file each variable appears in (● = present).
| Variable | source_data | data_demeaned |
|---|---|---|
gov_cons | ● | ● |
gov_cons_dm | ● | |
growth | ● | ● |
growth_dm | ● | |
hcap | ● | |
id | ● | ● |
ln_y_initial | ● | ● |
ln_y_initial_dm | ● | |
log_hcap | ● | ● |
log_hcap_dm | ● | |
log_n_gd | ● | ● |
log_n_gd_dm | ● | |
log_s_k | ● | ● |
log_s_k_dm | ● | |
n_pop | ● | |
s_k | ● | |
time | ● | ● |
Construction & formulas
The model is a two-way fixed-effects growth regression over country i and period
t:
- TWFE specification:
growth_it = α_i + λ_t + β·x_it + u_it— country fixed effectsα_iand time fixed effectsλ_tabsorb every time-invariant country trait and every country-invariant time shock. - Two-way within (demeaning) transformation:
x̃_it = x_it − x̄_i. − x̄_.t + x̄_..— observed value minus the country mean (x̄_i., average over periods), minus the time mean (x̄_.t, average over countries), plus the grand mean (x̄_..). The_dmcolumns indata_demeanedare exactly this transform of the six model variables. - Grand-mean correction: the
+ x̄_..term cancels the double-subtraction of the overall level (the grand mean is removed once viax̄_i.and once viax̄_.t); without it the demeaned variables are not centered at zero. - FWL equivalence:
β̂_TWFE = β̂_OLS on demeaned data— running plain OLS (lm()) on the_dmcolumns returns exactly thefeols()TWFE slopes (up to machine precision).
Regressor construction from the raw inputs:
log_s_k = log(s_k) (log investment share),
log_n_gd = log(n_pop + 0.05) (log of population growth plus the standard 0.05 for
growth + depreciation), and log_hcap = log(hcap) (log human capital).
The datasets
Switch datasets with the tabs. Each shows the full variable dictionary plus a sortable statistics table with mini distributions and data coverage.
expand to search (Ctrl/⌘+F) or print across all datasets
Variable dictionary
| Variable | Label | Definition | Construction | Units | Source | Coverage |
|---|---|---|---|---|---|---|
id identifier | Country identifier | Sequential country index (the entity dimension of the panel; treated as a factor for the fixed effects). | 1..150, one per synthetic country. | integer code | Simulation | both files |
time identifier | Time period identifier | Sequential time-period index (the time dimension of the panel; treated as a factor for the fixed effects). | 1..8, one per period. | integer code | Simulation | both files |
growth continuous | GDP per capita growth (dependent variable) | Annualized GDP-per-capita growth rate — the outcome regressed in the TWFE model. | Simulated outcome of the Barro convergence data-generating process. | rate (per year) | Simulation | both files |
ln_y_initial continuous | Log initial income (convergence term) | Natural log of initial GDP per capita — the beta-convergence regressor (a negative slope means poorer countries grow faster). | Simulated log initial income. | log US$ | Simulation | both files |
s_k continuous | Investment share of GDP | Physical-capital investment share of GDP (Solow-style accumulation rate). | Simulated; log_s_k = log(s_k). | 0-1 (share) | Simulation | source_data only |
n_pop continuous | Population growth rate | Population growth rate (the n in the Solow n + g + d term). | Simulated; log_n_gd = log(n_pop + 0.05). | rate (per year) | Simulation | source_data only |
hcap continuous | Human capital index | Human-capital stock proxy (e.g., schooling-based index). | Simulated; log_hcap = log(hcap). | index | Simulation | source_data only |
gov_cons continuous | Government consumption share | Government consumption as a share of GDP — a control regressor. | Simulated. | 0-1 (share) | Simulation | both files |
log_s_k continuous | Log investment share | Natural log of the investment share of GDP (Solow capital-accumulation regressor). | log(s_k). | log share | Derived | both files |
log_n_gd continuous | Log of population growth + g + d | Natural log of population growth plus the standard 0.05 for growth and depreciation — the Solow n + g + d regressor. | log(n_pop + 0.05). | log rate | Derived | both files |
log_hcap continuous | Log human capital | Natural log of the human-capital index. | log(hcap). | log index | Derived | both files |
Distribution & statistics (click a header to sort)
| Variable | Distribution | Coverage | N | Distinct | Min | Mean | Median | Max | SD |
|---|---|---|---|---|---|---|---|---|---|
id | – | 100% | 1,200 | 150 | — | — | — | — | — |
time | – | 100% | 1,200 | 8 | — | — | — | — | — |
growth | 100% | 1,200 | 1,200 | -0.238 | -0.124 | -0.122 | -0.004 | 0.045 | |
ln_y_initial | 100% | 1,200 | 1,200 | 1.92 | 5.36 | 5.16 | 9.87 | 1.59 | |
s_k | 100% | 1,200 | 1,200 | 0.092 | 0.214 | 0.215 | 0.360 | 0.049 | |
n_pop | 100% | 1,200 | 1,198 | 0.005 | 0.020 | 0.020 | 0.038 | 0.005 | |
hcap | 100% | 1,200 | 1,200 | 1.02 | 1.98 | 1.98 | 2.98 | 0.347 | |
gov_cons | 100% | 1,200 | 1,200 | 0.070 | 0.146 | 0.145 | 0.220 | 0.028 | |
log_s_k | 100% | 1,200 | 1,200 | -2.39 | -1.57 | -1.54 | -1.02 | 0.244 | |
log_n_gd | 100% | 1,200 | 1,198 | -2.90 | -2.66 | -2.66 | -2.43 | 0.073 | |
log_hcap | 100% | 1,200 | 1,200 | 0.022 | 0.665 | 0.682 | 1.09 | 0.185 |
Variable dictionary
| Variable | Label | Definition | Construction | Units | Source | Coverage |
|---|---|---|---|---|---|---|
id identifier | Country identifier | Sequential country index (the entity dimension of the panel; treated as a factor for the fixed effects). | 1..150, one per synthetic country. | integer code | Simulation | both files |
time identifier | Time period identifier | Sequential time-period index (the time dimension of the panel; treated as a factor for the fixed effects). | 1..8, one per period. | integer code | Simulation | both files |
growth continuous | GDP per capita growth (dependent variable) | Annualized GDP-per-capita growth rate — the outcome regressed in the TWFE model. | Simulated outcome of the Barro convergence data-generating process. | rate (per year) | Simulation | both files |
ln_y_initial continuous | Log initial income (convergence term) | Natural log of initial GDP per capita — the beta-convergence regressor (a negative slope means poorer countries grow faster). | Simulated log initial income. | log US$ | Simulation | both files |
log_s_k continuous | Log investment share | Natural log of the investment share of GDP (Solow capital-accumulation regressor). | log(s_k). | log share | Derived | both files |
log_n_gd continuous | Log of population growth + g + d | Natural log of population growth plus the standard 0.05 for growth and depreciation — the Solow n + g + d regressor. | log(n_pop + 0.05). | log rate | Derived | both files |
log_hcap continuous | Log human capital | Natural log of the human-capital index. | log(hcap). | log index | Derived | both files |
gov_cons continuous | Government consumption share | Government consumption as a share of GDP — a control regressor. | Simulated. | 0-1 (share) | Simulation | both files |
growth_dm continuous | Demeaned GDP per capita growth | Two-way demeaned growth: deviation of growth from its country mean and time mean (grand mean added back). The within-variation that identifies the TWFE coefficient. | growth - country_mean(growth) - time_mean(growth) + grand_mean(growth); mean ≈ 0. | rate (deviation) | Derived (within transform) | data_demeaned only |
ln_y_initial_dm continuous | Demeaned log initial income | Two-way demeaned log initial income (deviation from country + time means, grand mean restored). | ln_y_initial - country mean - time mean + grand mean; mean ≈ 0. | log US$ (deviation) | Derived (within transform) | data_demeaned only |
log_s_k_dm continuous | Demeaned log investment share | Two-way demeaned log investment share (deviation from country + time means, grand mean restored). | log_s_k - country mean - time mean + grand mean; mean ≈ 0. | log share (deviation) | Derived (within transform) | data_demeaned only |
log_n_gd_dm continuous | Demeaned log(n + g + d) | Two-way demeaned log of population growth plus g + d (deviation from country + time means, grand mean restored). | log_n_gd - country mean - time mean + grand mean; mean ≈ 0. | log rate (deviation) | Derived (within transform) | data_demeaned only |
log_hcap_dm continuous | Demeaned log human capital | Two-way demeaned log human capital (deviation from country + time means, grand mean restored). | log_hcap - country mean - time mean + grand mean; mean ≈ 0. | log index (deviation) | Derived (within transform) | data_demeaned only |
gov_cons_dm continuous | Demeaned government consumption share | Two-way demeaned government consumption share (deviation from country + time means, grand mean restored). | gov_cons - country mean - time mean + grand mean; mean ≈ 0. | share (deviation) | Derived (within transform) | data_demeaned only |
Distribution & statistics (click a header to sort)
| Variable | Distribution | Coverage | N | Distinct | Min | Mean | Median | Max | SD |
|---|---|---|---|---|---|---|---|---|---|
id | – | 100% | 1,200 | 150 | — | — | — | — | — |
time | – | 100% | 1,200 | 8 | — | — | — | — | — |
growth | 100% | 1,200 | 1,200 | -0.238 | -0.124 | -0.122 | -0.004 | 0.045 | |
ln_y_initial | 100% | 1,200 | 1,200 | 1.92 | 5.36 | 5.16 | 9.87 | 1.59 | |
log_s_k | 100% | 1,200 | 1,200 | -2.39 | -1.57 | -1.54 | -1.02 | 0.244 | |
log_n_gd | 100% | 1,200 | 1,198 | -2.90 | -2.66 | -2.66 | -2.43 | 0.073 | |
log_hcap | 100% | 1,200 | 1,200 | 0.022 | 0.665 | 0.682 | 1.09 | 0.185 | |
gov_cons | 100% | 1,200 | 1,200 | 0.070 | 0.146 | 0.145 | 0.220 | 0.028 | |
growth_dm | 100% | 1,200 | 1,200 | -0.077 | -8.09e-17 | 6.88e-05 | 0.091 | 0.023 | |
ln_y_initial_dm | 100% | 1,200 | 1,200 | -0.690 | 8.30e-15 | -4.16e-04 | 0.574 | 0.164 | |
log_s_k_dm | 100% | 1,200 | 1,200 | -0.421 | -1.49e-15 | 0.003 | 0.377 | 0.087 | |
log_n_gd_dm | 100% | 1,200 | 1,200 | -0.124 | 1.60e-15 | 0.002 | 0.125 | 0.033 | |
log_hcap_dm | 100% | 1,200 | 1,200 | -0.337 | 5.38e-17 | -4.09e-05 | 0.177 | 0.044 | |
gov_cons_dm | 100% | 1,200 | 1,200 | -0.051 | 1.83e-16 | -5.08e-05 | 0.040 | 0.014 |
Known limitations & caveats
- Synthetic data. The panel is simulated; coefficient values reflect the data-generating process, not real-world economic dynamics — they are not empirical evidence about growth convergence.
- Coefficients match, standard errors do not. OLS on the demeaned data reproduces the TWFE point estimates exactly, but naive
lm()standard errors are too small (by 7–22% here) because they do not subtract the degrees of freedom consumed by the 150 country plus 8 time effects. Always use a dedicated panel estimator (e.g.feols()) for inference. - Balanced panel only. The closed-form two-way demeaning used here is exact because the panel is balanced; unbalanced panels require the iterative projection that
fixestperforms internally. - Demeaning kills time-invariant regressors. Any variable constant within a country over time equals its own country mean, so its demeaned column is zero everywhere and it drops out — this is why fixed-effects models cannot identify time-invariant characteristics.