Downloads
Each dataset is available as a labeled Stata .dta and its source file.
⇩ Download all data (ZIP)stata_codebook.do
| Dataset | Grain | Rows | Stata | Source |
|---|---|---|---|---|
store_data | store (cross-section) | 200 × 4 | store_data.dta | store_data.csv |
flights_sample | flight | 5,000 × 9 | flights_sample.dta | flights_sample.csv |
wagepan | individual-year | 4,360 × 44 | wagepan.dta | wagepan.csv |
Run stata_codebook.do in Stata once to attach long-form per-variable notes to the .dta files.
Load directly in code
Every file loads straight from GitHub (raw URLs). Swap the file name to load any dataset.
Stata
* Stata 14+ : `use` reads an https URL directly
global BASE "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/r_fwlplot/data/"
use "${BASE}store_data.dta", clear
describe
notesPython
!pip install -q pyreadstat
import pandas as pd
BASE = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/r_fwlplot/data/"
df = pd.read_stata(BASE + "store_data.dta")
# load every dataset at once
files = ["store_data", "flights_sample", "wagepan"]
data = {f: pd.read_stata(BASE + f + ".dta") for f in files}
# pyreadstat (richest metadata) reads LOCAL files -> download first
import pyreadstat, urllib.request
urllib.request.urlretrieve(BASE + "store_data.dta", "store_data.dta")
df, meta = pyreadstat.read_dta("store_data.dta")Copy and paste this snippet in Google Colab app. https://colab.research.google.com/notebooks/empty.ipynb
R
# R : haven::read_dta auto-downloads an https URL
library(haven)
BASE <- "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/r_fwlplot/data/"
df <- read_dta(paste0(BASE, "store_data.dta"))Overview & sources
Companion data for a hands-on R tutorial on the fwlplot package (Butts & McDermott, 2024), which renders the Frisch–Waugh–Lovell (FWL) theorem as a picture: any multiple-regression coefficient equals the slope of a simple bivariate regression after partialling the other controls out of both axes. The post builds intuition across three datasets — an n=200 simulated retail panel where income confounds the coupon–sales relationship, the nycflights13 flights data (a 5,000-row cleaned sample), and Wooldridge's wagepan panel (545 individuals over 1980–1987). The simulated case shows confounding reverse the naive coupon slope from −0.093 to the controlled +0.212 (true effect +0.2); fixed effects on the flights and wage panels show what "controlling for" looks like geometrically.
store_data is a simulated cross-section (one row per store, n=200) with sales, coupons, income and day-of-week. flights_sample is a 5,000-row sample of cleaned 2013 NYC departures (one row per flight) from the nycflights13 package. wagepan is a balanced wage panel (one row per individual × year; 545 individuals × 8 years = 4,360 rows, 1980–1987) from the Wooldridge package.
Data sources
| Source | Provides | Reference / URL |
|---|---|---|
| Simulated (this study) | store_data — a synthetic retail cross-section with a known confounder (income) and a known true coupon effect (+0.2) | Mendez, C. (2026). See the post's R script analysis.R for the full data-generating process (set.seed(42)). |
| nycflights13 | flights_sample — a 5,000-row cleaned sample of on-time departures from New York's three airports in 2013 | Wickham, H. (2021). nycflights13: Flights that Departed NYC in 2013. CRAN. https://cran.r-project.org/package=nycflights13 (source: US Bureau of Transportation Statistics). |
| Wooldridge wagepan | wagepan — panel of 545 men over 8 years (1980–1987) used in Wooldridge's panel-data examples | Wooldridge, J. M. Introductory Econometrics. wagepan dataset via the wooldridge R package. https://cran.r-project.org/package=wooldridge (originally from Vella & Verbeek, 1998, J. Applied Econometrics). |
| Method references | FWL theorem and the fwlplot / fixest implementation | Frisch & Waugh (1933); Lovell (1963); Butts & McDermott (2024, fwlplot); Berge (2018, fixest). |
Cite this data
Please cite this dataset as follows.
APA
Mendez, C. (2026). Visualizing Regression with the FWL Theorem in R [Data set]. https://carlos-mendez.org/post/r_fwlplot/
Butts, K., & McDermott, G. (2024). fwlplot: Scatter Plot After Residualizing. CRAN. https://cran.r-project.org/package=fwlplot — Frisch, R., & Waugh, F. V. (1933). Partial Time Regressions as Compared with Individual Trends. Econometrica, 1(4), 387–401. — Lovell, M. C. (1963). Seasonal Adjustment of Economic Time Series and Multiple Regression Analysis. JASA, 58(304), 993–1010. — Wickham, H. (2021). nycflights13: Flights that Departed NYC in 2013. CRAN.BibTeX
@misc{mendez2026rfwlplot,
author = {Mendez, Carlos},
title = {Visualizing Regression with the FWL Theorem in R},
year = {2026},
howpublished = {\url{https://carlos-mendez.org/post/r_fwlplot/}},
note = {Data set}
}
@misc{butts2024fwlplot,
author = {Butts, Kyle and McDermott, Grant},
title = {fwlplot: Scatter Plot After Residualizing},
year = {2024}, howpublished = {CRAN}, note = {R package},
url = {https://cran.r-project.org/package=fwlplot}
}
@article{frisch1933partial,
author = {Frisch, Ragnar and Waugh, Frederick V.},
title = {Partial Time Regressions as Compared with Individual Trends},
journal = {Econometrica}, volume = {1}, number = {4}, pages = {387--401}, year = {1933}
}
@article{lovell1963seasonal,
author = {Lovell, Michael C.},
title = {Seasonal Adjustment of Economic Time Series and Multiple Regression Analysis},
journal = {Journal of the American Statistical Association},
volume = {58}, number = {304}, pages = {993--1010}, year = {1963}
}Variable explorer search & filter all 57 variables
Type to filter by name or label, or use the chips to filter by type. Each row shows a mini distribution. Click a header to sort.
| Variable | Type | Distribution | Label | Definition | Units | In files | Source |
|---|---|---|---|---|---|---|---|
agric# | dummy | Industry: agriculture (1=yes) | 1 if employed in agriculture, else 0. | 0/1 | wagepan | Wooldridge wagepan | |
air_time# | continuous | Air time (min) | Time in the air, in minutes (the regressor of interest in the flights example). | minutes | flights_sample | nycflights13 (US BTS) | |
arr_delay# | continuous | Arrival delay (min) | Arrival delay in minutes. | minutes | flights_sample | nycflights13 (US BTS) | |
black# | dummy | Race: Black (1=yes) | 1 if the individual is Black, else 0 (time-invariant). | 0/1 | wagepan | Wooldridge wagepan | |
bus# | dummy | Industry: business/repair services (1=yes) | 1 if employed in business and repair services, else 0. | 0/1 | wagepan | Wooldridge wagepan | |
carrier# | identifier | – | Carrier code | Two-letter airline carrier code. | code | flights_sample | nycflights13 (US BTS) |
construc# | dummy | Industry: construction (1=yes) | 1 if employed in construction, else 0. | 0/1 | wagepan | Wooldridge wagepan | |
coupons# | continuous | Coupons distributed (treatment) | Number/intensity of coupons distributed (the regressor of interest). | count/index | store_data | Simulation (this study) | |
d81# | dummy | Year dummy: 1981 (1=yes) | 1 if the observation year is 1981, else 0. | 0/1 | wagepan | Wooldridge wagepan | |
d82# | dummy | Year dummy: 1982 (1=yes) | 1 if the observation year is 1982, else 0. | 0/1 | wagepan | Wooldridge wagepan | |
d83# | dummy | Year dummy: 1983 (1=yes) | 1 if the observation year is 1983, else 0. | 0/1 | wagepan | Wooldridge wagepan | |
d84# | dummy | Year dummy: 1984 (1=yes) | 1 if the observation year is 1984, else 0. | 0/1 | wagepan | Wooldridge wagepan | |
d85# | dummy | Year dummy: 1985 (1=yes) | 1 if the observation year is 1985, else 0. | 0/1 | wagepan | Wooldridge wagepan | |
d86# | dummy | Year dummy: 1986 (1=yes) | 1 if the observation year is 1986, else 0. | 0/1 | wagepan | Wooldridge wagepan | |
d87# | dummy | Year dummy: 1987 (1=yes) | 1 if the observation year is 1987, else 0. | 0/1 | wagepan | Wooldridge wagepan | |
day# | identifier | – | Day of month (1-31) | Calendar day of month of the scheduled departure. | 1-31 | flights_sample | nycflights13 (US BTS) |
dayofweek# | identifier | – | Day of week (1-7) | Day-of-week indicator used as an additional control in §5.4. | 1-7 | store_data | Simulation (this study) |
dep_delay# | continuous | Departure delay (min) | Departure delay in minutes (the outcome in the flights regressions). | minutes | flights_sample | nycflights13 (US BTS) | |
dest# | identifier | – | Destination airport (FE) | Destination airport code; used as a fixed effect alongside origin. | code | flights_sample | nycflights13 (US BTS) |
educ# | continuous | Years of education | Years of schooling (time-invariant; drops out under individual FE). | years | wagepan | Wooldridge wagepan | |
ent# | dummy | Industry: entertainment (1=yes) | 1 if employed in entertainment, else 0. | 0/1 | wagepan | Wooldridge wagepan | |
exper# | continuous | Labor-market experience (years) | Years of (potential) labor-market experience — the regressor of interest in §7. | years | wagepan | Wooldridge wagepan | |
expersq# | continuous | Experience squared | Square of labor-market experience (captures the concave wage-experience profile). | years^2 | wagepan | Wooldridge wagepan (derived) | |
fin# | dummy | Industry: finance (1=yes) | 1 if employed in finance, insurance, or real estate, else 0. | 0/1 | wagepan | Wooldridge wagepan | |
hisp# | dummy | Ethnicity: Hispanic (1=yes) | 1 if the individual is Hispanic, else 0 (time-invariant). | 0/1 | wagepan | Wooldridge wagepan | |
hour# | identifier | – | Scheduled departure hour (0-23) | Scheduled departure hour (local). | 0-23 | flights_sample | nycflights13 (US BTS) |
hours# | continuous | Annual hours worked | Annual hours worked. | hours/year | wagepan | Wooldridge wagepan | |
income# | continuous | Neighborhood income (confounder) | Neighborhood income level — the confounder that drives both coupons and sales. | index units | store_data | Simulation (this study) | |
lwage# | continuous | Log hourly wage | Natural log of the hourly wage (the outcome in the wage regressions). | log US$ | wagepan | Wooldridge wagepan | |
manuf# | dummy | Industry: manufacturing (1=yes) | 1 if employed in manufacturing, else 0. | 0/1 | wagepan | Wooldridge wagepan | |
married# | dummy | Married (1=yes) | 1 if married, else 0. | 0/1 | wagepan | Wooldridge wagepan | |
min# | dummy | Industry: mining (1=yes) | 1 if employed in mining, else 0. | 0/1 | wagepan | Wooldridge wagepan | |
month# | identifier | – | Month of flight (1-12) | Calendar month of the scheduled departure. | 1-12 | flights_sample | nycflights13 (US BTS) |
nr# | identifier | – | Person identifier | Unique individual identifier (the panel unit; used as the individual fixed effect). | id | wagepan | Wooldridge wagepan |
nrthcen# | dummy | Region: North Central (1=yes) | 1 if resident of the North Central census region, else 0. | 0/1 | wagepan | Wooldridge wagepan | |
nrtheast# | dummy | Region: Northeast (1=yes) | 1 if resident of the Northeast census region, else 0. | 0/1 | wagepan | Wooldridge wagepan | |
occ1# | dummy | Occupation group 1 (1=yes) | 1 if in occupation group 1, else 0 (occupational dummies occ1-occ9). | 0/1 | wagepan | Wooldridge wagepan | |
occ2# | dummy | Occupation group 2 (1=yes) | 1 if in occupation group 2, else 0. | 0/1 | wagepan | Wooldridge wagepan | |
occ3# | dummy | Occupation group 3 (1=yes) | 1 if in occupation group 3, else 0. | 0/1 | wagepan | Wooldridge wagepan | |
occ4# | dummy | Occupation group 4 (1=yes) | 1 if in occupation group 4, else 0. | 0/1 | wagepan | Wooldridge wagepan | |
occ5# | dummy | Occupation group 5 (1=yes) | 1 if in occupation group 5, else 0. | 0/1 | wagepan | Wooldridge wagepan | |
occ6# | dummy | Occupation group 6 (1=yes) | 1 if in occupation group 6, else 0. | 0/1 | wagepan | Wooldridge wagepan | |
occ7# | dummy | Occupation group 7 (1=yes) | 1 if in occupation group 7, else 0. | 0/1 | wagepan | Wooldridge wagepan | |
occ8# | dummy | Occupation group 8 (1=yes) | 1 if in occupation group 8, else 0. | 0/1 | wagepan | Wooldridge wagepan | |
occ9# | dummy | Occupation group 9 (1=yes) | 1 if in occupation group 9, else 0. | 0/1 | wagepan | Wooldridge wagepan | |
origin# | identifier | – | Origin airport (FE) | Origin airport code — one of New York's three airports; used as a fixed effect. | code | flights_sample | nycflights13 (US BTS) |
per# | dummy | Industry: personal services (1=yes) | 1 if employed in personal services, else 0. | 0/1 | wagepan | Wooldridge wagepan | |
poorhlth# | dummy | Poor health (1=yes) | 1 if the individual reports being in poor health, else 0. | 0/1 | wagepan | Wooldridge wagepan | |
pro# | dummy | Industry: professional services (1=yes) | 1 if employed in professional and related services, else 0. | 0/1 | wagepan | Wooldridge wagepan | |
pub# | dummy | Industry: public administration (1=yes) | 1 if employed in public administration, else 0. | 0/1 | wagepan | Wooldridge wagepan | |
rur# | dummy | Rural residence (1=yes) | 1 if resident in a rural area, else 0. | 0/1 | wagepan | Wooldridge wagepan | |
sales# | continuous | Store sales (simulated) | Simulated sales for the store (the outcome variable). | index units | store_data | Simulation (this study) | |
south# | dummy | Region: South (1=yes) | 1 if resident of the South census region, else 0. | 0/1 | wagepan | Wooldridge wagepan | |
tra# | dummy | Industry: transportation (1=yes) | 1 if employed in transportation, communications, or utilities, else 0. | 0/1 | wagepan | Wooldridge wagepan | |
trad# | dummy | Industry: trade (1=yes) | 1 if employed in wholesale or retail trade, else 0. | 0/1 | wagepan | Wooldridge wagepan | |
union# | dummy | Union contract (1=yes) | 1 if wage is set by a collective-bargaining agreement, else 0. | 0/1 | wagepan | Wooldridge wagepan | |
year# | year | – | Calendar year (1980-1987) | Year of the observation. | year | wagepan | Wooldridge wagepan |
Cross-file variable index
Which file each variable appears in (● = present).
| Variable | store_data | flights_sample | wagepan |
|---|---|---|---|
agric | ● | ||
air_time | ● | ||
arr_delay | ● | ||
black | ● | ||
bus | ● | ||
carrier | ● | ||
construc | ● | ||
coupons | ● | ||
d81 | ● | ||
d82 | ● | ||
d83 | ● | ||
d84 | ● | ||
d85 | ● | ||
d86 | ● | ||
d87 | ● | ||
day | ● | ||
dayofweek | ● | ||
dep_delay | ● | ||
dest | ● | ||
educ | ● | ||
ent | ● | ||
exper | ● | ||
expersq | ● | ||
fin | ● | ||
hisp | ● | ||
hour | ● | ||
hours | ● | ||
income | ● | ||
lwage | ● | ||
manuf | ● | ||
married | ● | ||
min | ● | ||
month | ● | ||
nr | ● | ||
nrthcen | ● | ||
nrtheast | ● | ||
occ1 | ● | ||
occ2 | ● | ||
occ3 | ● | ||
occ4 | ● | ||
occ5 | ● | ||
occ6 | ● | ||
occ7 | ● | ||
occ8 | ● | ||
occ9 | ● | ||
origin | ● | ||
per | ● | ||
poorhlth | ● | ||
pro | ● | ||
pub | ● | ||
rur | ● | ||
sales | ● | ||
south | ● | ||
tra | ● | ||
trad | ● | ||
union | ● | ||
year | ● |
Construction & formulas
The Frisch–Waugh–Lovell (FWL) theorem: in the regression
Y = X₁β₁ + X₂β₂ + ε, the coefficient
β₁ on the variable of interest equals the slope from a simple bivariate regression
after partialling X₂ out of both axes:
- Step 1 — residualize the outcome: regress
Yon the controlsX₂, keep the residuals&Ytilde; = M₂ Y. - Step 2 — residualize the regressor: regress
X₁on the controlsX₂, keep the residualsX₁tilde = M₂ X₁. - Step 3 — residual-on-residual: regress
&Ytilde;onX₁tilde; the slope equalsβ₁exactly (0.212288in the store data, matchingfeols()to six decimals).
Here M₂ = I − X₂(X₂'X₂)⁻¹X₂' is the
residual-maker matrix. Fixed effects are FWL applied to group dummies: including
| origin + dest (flights) or | nr (wages) demeans each variable within group before
fitting. fwl_plot() automates all of this and plots the residualized scatter (an added-variable
plot) with the regression line overlaid.
Omitted variable bias: bias = γ × δ, where γ
is the effect of the omitted control on the outcome and δ is the slope of the omitted control
on the regressor. In the store data, 0.300 × (−0.494) = −0.148.
The datasets
Switch datasets with the tabs. Each shows the full variable dictionary plus a sortable statistics table with mini distributions and data coverage.
expand to search (Ctrl/⌘+F) or print across all datasets
Variable dictionary
| Variable | Label | Definition | Construction | Units | Source | Coverage |
|---|---|---|---|---|---|---|
sales continuous | Store sales (simulated) | Simulated sales for the store (the outcome variable). | sales = 10 + 0.2·coupons + 0.3·income + 0.5·dayofweek + N(0,3); rounded to 2 decimals. | index units | Simulation (this study) | 200 stores |
coupons continuous | Coupons distributed (treatment) | Number/intensity of coupons distributed (the regressor of interest). | coupons = 60 − 0.5·income + N(0,5); rounded to 2 decimals. Negatively driven by income (the confounder). | count/index | Simulation (this study) | 200 stores |
income continuous | Neighborhood income (confounder) | Neighborhood income level — the confounder that drives both coupons and sales. | income ~ N(50, 10); rounded to 2 decimals. | index units | Simulation (this study) | 200 stores |
dayofweek identifier | Day of week (1-7) | Day-of-week indicator used as an additional control in §5.4. | Uniform draw sample(1:7); 1=first day ... 7=last day. | 1-7 | Simulation (this study) | 200 stores |
Distribution & statistics (click a header to sort)
| Variable | Distribution | Coverage | N | Distinct | Min | Mean | Median | Max | SD |
|---|---|---|---|---|---|---|---|---|---|
sales | 100% | 200 | 191 | 24.89 | 33.67 | 33.55 | 45.23 | 3.81 | |
coupons | 100% | 200 | 190 | 18.72 | 34.86 | 34.82 | 53.25 | 6.79 | |
income | 100% | 200 | 192 | 20.07 | 49.73 | 49.84 | 77.02 | 9.75 | |
dayofweek | – | 100% | 200 | 7 | — | — | — | — | — |
Variable dictionary
| Variable | Label | Definition | Construction | Units | Source | Coverage |
|---|---|---|---|---|---|---|
dep_delay continuous | Departure delay (min) | Departure delay in minutes (the outcome in the flights regressions). | From nycflights13; cleaned sample keeps dep_delay in (−30, 120). | minutes | nycflights13 (US BTS) | 5,000 flights |
arr_delay continuous | Arrival delay (min) | Arrival delay in minutes. | From nycflights13 (carried in the saved sample; not used in the post's regressions). | minutes | nycflights13 (US BTS) | 5,000 flights |
air_time continuous | Air time (min) | Time in the air, in minutes (the regressor of interest in the flights example). | From nycflights13; cleaned to non-missing values. | minutes | nycflights13 (US BTS) | 5,000 flights |
origin identifier | Origin airport (FE) | Origin airport code — one of New York's three airports; used as a fixed effect. | From nycflights13: EWR, JFK, or LGA. | code | nycflights13 (US BTS) | 5,000 flights |
dest identifier | Destination airport (FE) | Destination airport code; used as a fixed effect alongside origin. | From nycflights13 (IATA destination code). | code | nycflights13 (US BTS) | 5,000 flights |
carrier identifier | Carrier code | Two-letter airline carrier code. | From nycflights13 (carried in the sample; not used in the post's regressions). | code | nycflights13 (US BTS) | 5,000 flights |
month identifier | Month of flight (1-12) | Calendar month of the scheduled departure. | From nycflights13. | 1-12 | nycflights13 (US BTS) | 5,000 flights |
day identifier | Day of month (1-31) | Calendar day of month of the scheduled departure. | From nycflights13. | 1-31 | nycflights13 (US BTS) | 5,000 flights |
hour identifier | Scheduled departure hour (0-23) | Scheduled departure hour (local). | From nycflights13. | 0-23 | nycflights13 (US BTS) | 5,000 flights |
Distribution & statistics (click a header to sort)
| Variable | Distribution | Coverage | N | Distinct | Min | Mean | Median | Max | SD |
|---|---|---|---|---|---|---|---|---|---|
dep_delay | 100% | 5,000 | 137 | -20.00 | 7.32 | -2.00 | 119.0 | 22.84 | |
arr_delay | 100% | 5,000 | 191 | -66.00 | 1.40 | -6.00 | 166.0 | 29.43 | |
air_time | 100% | 5,000 | 361 | 22.00 | 150.4 | 130.0 | 650.0 | 93.48 | |
origin | – | 100% | 5,000 | 3 | — | — | — | — | — |
dest | – | 100% | 5,000 | 96 | — | — | — | — | — |
carrier | – | 100% | 5,000 | 15 | — | — | — | — | — |
month | – | 100% | 5,000 | 12 | — | — | — | — | — |
day | – | 100% | 5,000 | 31 | — | — | — | — | — |
hour | – | 100% | 5,000 | 19 | — | — | — | — | — |
Variable dictionary
| Variable | Label | Definition | Construction | Units | Source | Coverage |
|---|---|---|---|---|---|---|
nr identifier | Person identifier | Unique individual identifier (the panel unit; used as the individual fixed effect). | From the Wooldridge wagepan dataset. | id | Wooldridge wagepan | 545 individuals |
year year | Calendar year (1980-1987) | Year of the observation. | From wagepan; used as the year fixed effect in two-way FE models. | year | Wooldridge wagepan | 1980-1987 |
agric dummy | Industry: agriculture (1=yes) | 1 if employed in agriculture, else 0. | From wagepan industry indicators. | 0/1 | Wooldridge wagepan | panel |
black dummy | Race: Black (1=yes) | 1 if the individual is Black, else 0 (time-invariant). | From wagepan. | 0/1 | Wooldridge wagepan | panel |
bus dummy | Industry: business/repair services (1=yes) | 1 if employed in business and repair services, else 0. | From wagepan industry indicators. | 0/1 | Wooldridge wagepan | panel |
construc dummy | Industry: construction (1=yes) | 1 if employed in construction, else 0. | From wagepan industry indicators. | 0/1 | Wooldridge wagepan | panel |
ent dummy | Industry: entertainment (1=yes) | 1 if employed in entertainment, else 0. | From wagepan industry indicators. | 0/1 | Wooldridge wagepan | panel |
exper continuous | Labor-market experience (years) | Years of (potential) labor-market experience — the regressor of interest in §7. | From wagepan; increments by one year per individual per year. | years | Wooldridge wagepan | 0-18 |
fin dummy | Industry: finance (1=yes) | 1 if employed in finance, insurance, or real estate, else 0. | From wagepan industry indicators. | 0/1 | Wooldridge wagepan | panel |
hisp dummy | Ethnicity: Hispanic (1=yes) | 1 if the individual is Hispanic, else 0 (time-invariant). | From wagepan. | 0/1 | Wooldridge wagepan | panel |
poorhlth dummy | Poor health (1=yes) | 1 if the individual reports being in poor health, else 0. | From wagepan. | 0/1 | Wooldridge wagepan | panel |
hours continuous | Annual hours worked | Annual hours worked. | From wagepan. | hours/year | Wooldridge wagepan | panel |
manuf dummy | Industry: manufacturing (1=yes) | 1 if employed in manufacturing, else 0. | From wagepan industry indicators. | 0/1 | Wooldridge wagepan | panel |
married dummy | Married (1=yes) | 1 if married, else 0. | From wagepan. | 0/1 | Wooldridge wagepan | panel |
min dummy | Industry: mining (1=yes) | 1 if employed in mining, else 0. | From wagepan industry indicators. | 0/1 | Wooldridge wagepan | panel |
nrthcen dummy | Region: North Central (1=yes) | 1 if resident of the North Central census region, else 0. | From wagepan region indicators. | 0/1 | Wooldridge wagepan | panel |
nrtheast dummy | Region: Northeast (1=yes) | 1 if resident of the Northeast census region, else 0. | From wagepan region indicators. | 0/1 | Wooldridge wagepan | panel |
occ1 dummy | Occupation group 1 (1=yes) | 1 if in occupation group 1, else 0 (occupational dummies occ1-occ9). | From wagepan occupation indicators. | 0/1 | Wooldridge wagepan | panel |
occ2 dummy | Occupation group 2 (1=yes) | 1 if in occupation group 2, else 0. | From wagepan occupation indicators. | 0/1 | Wooldridge wagepan | panel |
occ3 dummy | Occupation group 3 (1=yes) | 1 if in occupation group 3, else 0. | From wagepan occupation indicators. | 0/1 | Wooldridge wagepan | panel |
occ4 dummy | Occupation group 4 (1=yes) | 1 if in occupation group 4, else 0. | From wagepan occupation indicators. | 0/1 | Wooldridge wagepan | panel |
occ5 dummy | Occupation group 5 (1=yes) | 1 if in occupation group 5, else 0. | From wagepan occupation indicators. | 0/1 | Wooldridge wagepan | panel |
occ6 dummy | Occupation group 6 (1=yes) | 1 if in occupation group 6, else 0. | From wagepan occupation indicators. | 0/1 | Wooldridge wagepan | panel |
occ7 dummy | Occupation group 7 (1=yes) | 1 if in occupation group 7, else 0. | From wagepan occupation indicators. | 0/1 | Wooldridge wagepan | panel |
occ8 dummy | Occupation group 8 (1=yes) | 1 if in occupation group 8, else 0. | From wagepan occupation indicators. | 0/1 | Wooldridge wagepan | panel |
occ9 dummy | Occupation group 9 (1=yes) | 1 if in occupation group 9, else 0. | From wagepan occupation indicators. | 0/1 | Wooldridge wagepan | panel |
per dummy | Industry: personal services (1=yes) | 1 if employed in personal services, else 0. | From wagepan industry indicators. | 0/1 | Wooldridge wagepan | panel |
pro dummy | Industry: professional services (1=yes) | 1 if employed in professional and related services, else 0. | From wagepan industry indicators. | 0/1 | Wooldridge wagepan | panel |
pub dummy | Industry: public administration (1=yes) | 1 if employed in public administration, else 0. | From wagepan industry indicators. | 0/1 | Wooldridge wagepan | panel |
rur dummy | Rural residence (1=yes) | 1 if resident in a rural area, else 0. | From wagepan. | 0/1 | Wooldridge wagepan | panel |
south dummy | Region: South (1=yes) | 1 if resident of the South census region, else 0. | From wagepan region indicators. | 0/1 | Wooldridge wagepan | panel |
educ continuous | Years of education | Years of schooling (time-invariant; drops out under individual FE). | From wagepan. | years | Wooldridge wagepan | 3-16 |
tra dummy | Industry: transportation (1=yes) | 1 if employed in transportation, communications, or utilities, else 0. | From wagepan industry indicators. | 0/1 | Wooldridge wagepan | panel |
trad dummy | Industry: trade (1=yes) | 1 if employed in wholesale or retail trade, else 0. | From wagepan industry indicators. | 0/1 | Wooldridge wagepan | panel |
union dummy | Union contract (1=yes) | 1 if wage is set by a collective-bargaining agreement, else 0. | From wagepan. | 0/1 | Wooldridge wagepan | panel |
lwage continuous | Log hourly wage | Natural log of the hourly wage (the outcome in the wage regressions). | From wagepan (log of hourly wage). | log US$ | Wooldridge wagepan | panel |
d81 dummy | Year dummy: 1981 (1=yes) | 1 if the observation year is 1981, else 0. | From wagepan year dummies d81-d87. | 0/1 | Wooldridge wagepan | panel |
d82 dummy | Year dummy: 1982 (1=yes) | 1 if the observation year is 1982, else 0. | From wagepan year dummies d81-d87. | 0/1 | Wooldridge wagepan | panel |
d83 dummy | Year dummy: 1983 (1=yes) | 1 if the observation year is 1983, else 0. | From wagepan year dummies d81-d87. | 0/1 | Wooldridge wagepan | panel |
d84 dummy | Year dummy: 1984 (1=yes) | 1 if the observation year is 1984, else 0. | From wagepan year dummies d81-d87. | 0/1 | Wooldridge wagepan | panel |
d85 dummy | Year dummy: 1985 (1=yes) | 1 if the observation year is 1985, else 0. | From wagepan year dummies d81-d87. | 0/1 | Wooldridge wagepan | panel |
d86 dummy | Year dummy: 1986 (1=yes) | 1 if the observation year is 1986, else 0. | From wagepan year dummies d81-d87. | 0/1 | Wooldridge wagepan | panel |
d87 dummy | Year dummy: 1987 (1=yes) | 1 if the observation year is 1987, else 0. | From wagepan year dummies d81-d87. | 0/1 | Wooldridge wagepan | panel |
expersq continuous | Experience squared | Square of labor-market experience (captures the concave wage-experience profile). | exper^2. | years^2 | Wooldridge wagepan (derived) | panel |
Distribution & statistics (click a header to sort)
| Variable | Distribution | Coverage | N | Distinct | Min | Mean | Median | Max | SD |
|---|---|---|---|---|---|---|---|---|---|
nr | – | 100% | 4,360 | 545 | — | — | — | — | — |
year | – | 100% | 4,360 | 8 | 1980 | 1983.5 | 1983 | 1987 | 2.29 |
agric | 100% | 4,360 | 2 | 0 | 0.032 | 0 | 1.00 | 0.176 | |
black | 100% | 4,360 | 2 | 0 | 0.116 | 0 | 1.00 | 0.320 | |
bus | 100% | 4,360 | 2 | 0 | 0.076 | 0 | 1.00 | 0.265 | |
construc | 100% | 4,360 | 2 | 0 | 0.075 | 0 | 1.00 | 0.263 | |
ent | 100% | 4,360 | 2 | 0 | 0.015 | 0 | 1.00 | 0.122 | |
exper | 100% | 4,360 | 19 | 0 | 6.51 | 6.00 | 18.00 | 2.83 | |
fin | 100% | 4,360 | 2 | 0 | 0.037 | 0 | 1.00 | 0.189 | |
hisp | 100% | 4,360 | 2 | 0 | 0.156 | 0 | 1.00 | 0.363 | |
poorhlth | 100% | 4,360 | 2 | 0 | 0.017 | 0 | 1.00 | 0.129 | |
hours | 100% | 4,360 | 1,276 | 120.0 | 2,191.3 | 2,080.0 | 4,992.0 | 566.4 | |
manuf | 100% | 4,360 | 2 | 0 | 0.282 | 0 | 1.00 | 0.450 | |
married | 100% | 4,360 | 2 | 0 | 0.439 | 0 | 1.00 | 0.496 | |
min | 100% | 4,360 | 2 | 0 | 0.016 | 0 | 1.00 | 0.124 | |
nrthcen | 100% | 4,360 | 2 | 0 | 0.258 | 0 | 1.00 | 0.437 | |
nrtheast | 100% | 4,360 | 2 | 0 | 0.190 | 0 | 1.00 | 0.392 | |
occ1 | 100% | 4,360 | 2 | 0 | 0.104 | 0 | 1.00 | 0.305 | |
occ2 | 100% | 4,360 | 2 | 0 | 0.092 | 0 | 1.00 | 0.288 | |
occ3 | 100% | 4,360 | 2 | 0 | 0.053 | 0 | 1.00 | 0.225 | |
occ4 | 100% | 4,360 | 2 | 0 | 0.111 | 0 | 1.00 | 0.315 | |
occ5 | 100% | 4,360 | 2 | 0 | 0.214 | 0 | 1.00 | 0.410 | |
occ6 | 100% | 4,360 | 2 | 0 | 0.202 | 0 | 1.00 | 0.402 | |
occ7 | 100% | 4,360 | 2 | 0 | 0.092 | 0 | 1.00 | 0.289 | |
occ8 | 100% | 4,360 | 2 | 0 | 0.015 | 0 | 1.00 | 0.120 | |
occ9 | 100% | 4,360 | 2 | 0 | 0.117 | 0 | 1.00 | 0.321 | |
per | 100% | 4,360 | 2 | 0 | 0.017 | 0 | 1.00 | 0.128 | |
pro | 100% | 4,360 | 2 | 0 | 0.076 | 0 | 1.00 | 0.266 | |
pub | 100% | 4,360 | 2 | 0 | 0.040 | 0 | 1.00 | 0.196 | |
rur | 100% | 4,360 | 2 | 0 | 0.204 | 0 | 1.00 | 0.403 | |
south | 100% | 4,360 | 2 | 0 | 0.351 | 0 | 1.00 | 0.477 | |
educ | 100% | 4,360 | 13 | 3.00 | 11.77 | 12.00 | 16.00 | 1.75 | |
tra | 100% | 4,360 | 2 | 0 | 0.066 | 0 | 1.00 | 0.248 | |
trad | 100% | 4,360 | 2 | 0 | 0.268 | 0 | 1.00 | 0.443 | |
union | 100% | 4,360 | 2 | 0 | 0.244 | 0 | 1.00 | 0.430 | |
lwage | 100% | 4,360 | 3,631 | -3.58 | 1.65 | 1.67 | 4.05 | 0.533 | |
d81 | 100% | 4,360 | 2 | 0 | 0.125 | 0 | 1.00 | 0.331 | |
d82 | 100% | 4,360 | 2 | 0 | 0.125 | 0 | 1.00 | 0.331 | |
d83 | 100% | 4,360 | 2 | 0 | 0.125 | 0 | 1.00 | 0.331 | |
d84 | 100% | 4,360 | 2 | 0 | 0.125 | 0 | 1.00 | 0.331 | |
d85 | 100% | 4,360 | 2 | 0 | 0.125 | 0 | 1.00 | 0.331 | |
d86 | 100% | 4,360 | 2 | 0 | 0.125 | 0 | 1.00 | 0.331 | |
d87 | 100% | 4,360 | 2 | 0 | 0.125 | 0 | 1.00 | 0.331 | |
expersq | 100% | 4,360 | 19 | 0 | 50.42 | 36.00 | 324.0 | 40.78 |
Known limitations & caveats
- store_data is simulated. Values come from a known data-generating process (set.seed(42)); the true coupon effect is +0.2 by construction. It illustrates confounding and FWL, not a real retail market.
- flights_sample is a 5,000-row sample drawn (set.seed(123)) from ~317,578 cleaned flights — kept only delays in (−30, 120) minutes with no missing air time/origin/destination, and routes with more than one observation. Slopes in the post's tables are estimated on the full cleaned data, not this sample; use the sample for plotting/teaching, not for headline estimates.
- wagepan covers employed men only, 1980–1987. Time-invariant traits (educ, black, hisp) drop under individual fixed effects; the linear exper term drops under two-way (individual + year) fixed effects because experience increments by one year for everyone (collinear with year dummies).
- FWL is exact only for linear regression. For logit/Poisson and other nonlinear models the residualized scatter is at best an approximation of the conditional relationship.