Data dictionary · Dynamic Panel Data Models: Employment Persistence

Downloads

Each dataset is available as a labeled Stata .dta and its source file.

⇩ Download all data (ZIP)stata_codebook.do

Dataset	Grain	Rows	Stata	Source
`abdata`	firm-year	1,031 × 10	abdata.dta	abdata.csv
`data_prepared`	firm-year	1,031 × 20	data_prepared.dta	data_prepared.csv

Run stata_codebook.do in Stata once to attach long-form per-variable notes to the .dta files.

Load directly in code

Every file loads straight from GitHub (raw URLs). Swap the file name to load any dataset.

Stata

* Stata 14+ : `use` reads an https URL directly
global BASE "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_dynamic_panel/data/"
use "${BASE}abdata.dta", clear
describe
notes

Python

!pip install -q pyreadstat
import pandas as pd
BASE = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_dynamic_panel/data/"
df = pd.read_stata(BASE + "abdata.dta")

# load every dataset at once
files = ["abdata", "data_prepared"]
data = {f: pd.read_stata(BASE + f + ".dta") for f in files}

# pyreadstat (richest metadata) reads LOCAL files -> download first
import pyreadstat, urllib.request
urllib.request.urlretrieve(BASE + "abdata.dta", "abdata.dta")
df, meta = pyreadstat.read_dta("abdata.dta")

Copy and paste this snippet in Google Colab app. https://colab.research.google.com/notebooks/empty.ipynb

R

# R : haven::read_dta auto-downloads an https URL
library(haven)
BASE <- "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_dynamic_panel/data/"
df <- read_dta(paste0(BASE, "abdata.dta"))

Overview & sources

Companion data for a hands-on Python tutorial that estimates how persistent firm-level employment is — the autoregressive coefficient ρ of a dynamic labor-demand equation — using the canonical Arellano and Bond (1991) panel of 140 UK manufacturing firms observed annually over 1976–1984 (1,031 firm-years, unbalanced). The tutorial walks the full estimator ladder: pooled OLS (biased up by the omitted firm effect) and fixed effects (Nickell bias, biased down) via pyfixest, then Anderson–Hsiao IV, Arellano–Bond difference GMM and Blundell–Bond system GMM via pydynpd, with the AR(2), Hansen, and instrument-collapse diagnostics that separate the one defensible estimate (system GMM, ρ̂ = 0.927) from four seductive wrong ones. This dataset is the classic dynamic-panel teaching dataset, used by Arellano & Bond, Blundell & Bond (1998), and Roodman (2009).

Two files. abdata is the raw input panel — one row per firm × year, unbalanced over 1976–1984 — carrying employment, wages, capital, and industry output in both levels and logs. data_prepared is the analysis sample: the same panel with the one-period lags, the two-period lag of employment, and the first differences (computed firm-by-firm, respecting firm boundaries) that every estimator runs on. Requiring a single lag drops the panel from 1,031 to 891 usable rows; the GMM estimators run on 751.

Data sources

Source	Provides	Reference / URL
Arellano & Bond (1991)	The original UK manufacturing employment panel (140 firms, 1976–1984) — the abdata dataset	Arellano, M., & Bond, S. (1991). Some Tests of Specification for Panel Data: Monte Carlo Evidence and an Application to Employment Equations. Review of Economic Studies, 58(2), 277–297. https://doi.org/10.2307/2297968
pydynpd (Wu, Hua & Xu 2023)	Distribution of the dataset (bundled as abdata) and the published replication benchmark	Wu, D., Hua, L., & Xu, J. (2023). pydynpd: A Python package for dynamic panel model. Journal of Open Source Software, 8(83), 4416. https://doi.org/10.21105/joss.04416 — https://github.com/dazhwu/pydynpd
Method references	Estimators and concepts	Anderson & Hsiao (1981); Blundell & Bond (1998); Bond (2002); Roodman (2009); Windmeijer (2005); Nickell (1981).

Cite this data

Please cite this dataset as follows.

APA

Mendez, C. (2026). Dynamic Panel Data Models in Python: From Nickell Bias to System GMM [Data set]. https://carlos-mendez.org/post/python_dynamic_panel/

Arellano, M., & Bond, S. (1991). Some Tests of Specification for Panel Data: Monte Carlo Evidence and an Application to Employment Equations. Review of Economic Studies, 58(2), 277–297.

BibTeX

@misc{mendez2026pythondynamicpanel,
  author       = {Mendez, Carlos},
  title        = {Dynamic Panel Data Models in Python: From Nickell Bias to System GMM},
  year         = {2026},
  howpublished = {\url{https://carlos-mendez.org/post/python_dynamic_panel/}},
  note         = {Data set}
}

@article{arellano1991some,
  author  = {Arellano, Manuel and Bond, Stephen},
  title   = {Some Tests of Specification for Panel Data: Monte Carlo Evidence and an Application to Employment Equations},
  journal = {Review of Economic Studies},
  volume  = {58}, number = {2}, pages = {277--297}, year = {1991},
  doi     = {10.2307/2297968}
}

Variable explorer search & filter all 20 variables

Type to filter by name or label, or use the chips to filter by type. Each row shows a mini distribution. Click a header to sort.

Variable	Type	Distribution	Label	Definition	Units	In files	Source
`cap`#	continuous		Gross capital stock (level)	Firm gross capital stock (the level behind log capital k).	index/level	abdata, data_prepared	Arellano & Bond (1991)
`d_k`#	continuous		First difference of log capital stock	Year-on-year change in the log capital stock.	log change	data_prepared	Derived (this study)
`d_k_lag1`#	continuous		Lagged first difference of log capital stock	Last year's change in the log capital stock; a control in the differenced equation.	log change	data_prepared	Derived (this study)
`d_n`#	continuous		First difference of log employment	Year-on-year change in log employment; the dependent variable of the differenced equation.	log change	data_prepared	Derived (this study)
`d_n_lag1`#	continuous		Lagged first difference of log employment	Last year's change in log employment; the endogenous regressor in the Anderson-Hsiao 2SLS.	log change	data_prepared	Derived (this study)
`d_w`#	continuous		First difference of log real wage	Year-on-year change in the log real wage.	log change	data_prepared	Derived (this study)
`d_w_lag1`#	continuous		Lagged first difference of log real wage	Last year's change in the log real wage; a control in the differenced equation.	log change	data_prepared	Derived (this study)
`emp`#	continuous		Employment (level)	Firm employment in thousands (the level behind log employment n).	thousands of employees	abdata, data_prepared	Arellano & Bond (1991)
`id`#	identifier	–	Firm identifier	Sequential firm (panel unit) identifier; 140 UK manufacturing firms.	integer	abdata, data_prepared	Arellano & Bond (1991)
`indoutpt`#	continuous		Industry output (level)	Industry-level output for the firm's sector (the level behind log industry output ys).	index/level	abdata, data_prepared	Arellano & Bond (1991)
`k`#	continuous		Log capital stock	Natural log of the firm gross capital stock; a current control.	log level	abdata, data_prepared	Arellano & Bond (1991)
`k_lag1`#	continuous		Log capital stock, one-period lag	Last year's log capital stock; a lagged control in the levels equation.	log level	data_prepared	Derived (this study)
`n`#	continuous		Log employment	Natural log of firm employment; the dependent variable of the dynamic model.	log thousands	abdata, data_prepared	Arellano & Bond (1991)
`n_lag1`#	continuous		Log employment, one-period lag	Last year's log employment; the lagged dependent variable carrying persistence rho (labeled L1.n in GMM output).	log thousands	data_prepared	Derived (this study)
`n_lag2`#	continuous		Log employment, two-period lag	Log employment two years ago; the Anderson-Hsiao instrument for the differenced lag.	log thousands	data_prepared	Derived (this study)
`w`#	continuous		Log real wage	Natural log of the firm real wage; a current control.	log level	abdata, data_prepared	Arellano & Bond (1991)
`w_lag1`#	continuous		Log real wage, one-period lag	Last year's log real wage; a lagged control in the levels equation.	log level	data_prepared	Derived (this study)
`wage`#	continuous		Real wage (level)	Firm real product wage (the level behind log wage w).	index/level	abdata, data_prepared	Arellano & Bond (1991)
`year`#	year	–	Calendar year	Annual time index of the observation.	year	abdata, data_prepared	Arellano & Bond (1991)
`ys`#	continuous		Log industry output	Natural log of industry output for the firm's sector (auxiliary variable).	log level	abdata, data_prepared	Arellano & Bond (1991)

Cross-file variable index

Which file each variable appears in (● = present).

Variable	abdata	data_prepared
`cap`	●	●
`d_k`		●
`d_k_lag1`		●
`d_n`		●
`d_n_lag1`		●
`d_w`		●
`d_w_lag1`		●
`emp`	●	●
`id`	●	●
`indoutpt`	●	●
`k`	●	●
`k_lag1`		●
`n`	●	●
`n_lag1`		●
`n_lag2`		●
`w`	●	●
`w_lag1`		●
`wage`	●	●
`year`	●	●
`ys`	●	●

Construction & formulas

The model is a dynamic labor-demand equation for log employment n, with current and lagged log real wages w and log capital k, a firm fixed effect α_i, year effects δ_t, and an idiosyncratic shock ε_it:

Levels → logs: n = log(emp), w = log(wage), k = log(cap), ys = log(indoutpt) — the estimation variables are the logged columns; the level columns are kept for reference.
Dynamic model: n_it = ρ·n_i,t-1 + β1·w_it + β2·w_i,t-1 + β3·k_it + β4·k_i,t-1 + α_i + δ_t + ε_it — ρ (the coefficient on n_lag1 / L1.n) is the persistence parameter of interest.
One-period lag (per firm): v_lag1 = v_i,t-1 for v ∈ {n, w, k}; n_lag2 = n_i,t-2 (the Anderson–Hsiao instrument).
First difference (per firm): d_v = v_it − v_i,t-1 for v ∈ {n, w, k}; d_v_lag1 = d_v_i,t-1. Differencing eliminates α_i exactly.

All lags and differences are computed within firm id after sorting by [id, year], so no firm inherits a lag from another firm. The first observed year of every firm therefore has missing lags/differences.

The datasets

Switch datasets with the tabs. Each shows the full variable dictionary plus a sortable statistics table with mini distributions and data coverage.

expand to search (Ctrl/⌘+F) or print across all datasets

firm-year 1,031 × 10 · 1976-1984 · 140 firms (unbalanced; 1,031 firm-years)

Panel key: id x year · Raw input panel for the dynamic labor-demand estimators.

Variable dictionary

Variable	Label	Definition	Construction	Units	Source	Coverage
`id` identifier	Firm identifier	Sequential firm (panel unit) identifier; 140 UK manufacturing firms.	From the Arellano-Bond panel; the panel id passed to estimators as [id, year].	integer	Arellano & Bond (1991)	both files
`year` year	Calendar year	Annual time index of the observation.	From the Arellano-Bond panel; range 1976-1984.	year	Arellano & Bond (1991)	both files
`emp` continuous	Employment (level)	Firm employment in thousands (the level behind log employment n).	Raw level; n = log(emp).	thousands of employees	Arellano & Bond (1991)	both files
`wage` continuous	Real wage (level)	Firm real product wage (the level behind log wage w).	Raw level; w = log(wage).	index/level	Arellano & Bond (1991)	both files
`cap` continuous	Gross capital stock (level)	Firm gross capital stock (the level behind log capital k).	Raw level; k = log(cap).	index/level	Arellano & Bond (1991)	both files
`indoutpt` continuous	Industry output (level)	Industry-level output for the firm's sector (the level behind log industry output ys).	Raw level; ys = log(indoutpt).	index/level	Arellano & Bond (1991)	both files
`n` continuous	Log employment	Natural log of firm employment; the dependent variable of the dynamic model.	log(emp).	log thousands	Arellano & Bond (1991)	both files
`w` continuous	Log real wage	Natural log of the firm real wage; a current control.	log(wage).	log level	Arellano & Bond (1991)	both files
`k` continuous	Log capital stock	Natural log of the firm gross capital stock; a current control.	log(cap).	log level	Arellano & Bond (1991)	both files
`ys` continuous	Log industry output	Natural log of industry output for the firm's sector (auxiliary variable).	log(indoutpt).	log level	Arellano & Bond (1991)	both files

Distribution & statistics (click a header to sort)

Variable	Distribution	Coverage	N	Distinct	Min	Mean	Median	Max	SD
`id`	–	100%	1,031	140	—	—	—	—	—
`year`	–	100%	1,031	9	1976	1979.7	1980	1984	2.22
`emp`		100%	1,031	955	0.104	7.89	2.29	108.6	15.93
`wage`		100%	1,031	1,029	8.02	23.92	24.01	45.23	5.65
`cap`		100%	1,031	1,001	0.012	2.51	0.518	47.11	6.25
`indoutpt`		100%	1,031	330	86.90	103.8	100.6	128.4	9.94
`n`		100%	1,031	955	-2.26	1.06	0.827	4.69	1.34
`w`		100%	1,031	1,029	2.08	3.14	3.18	3.81	0.263
`k`		100%	1,031	1,001	-4.43	-0.442	-0.658	3.85	1.51
`ys`		100%	1,031	330	4.46	4.64	4.61	4.85	0.094

firm-year 1,031 × 20 · 1976-1984 · 140 firms (1,031 rows; lag/difference columns missing in each firm's first year(s))

Panel key: id x year · Identically-built lags/differences so every estimator runs on the same transformed variables.

Variable dictionary

Variable	Label	Definition	Construction	Units	Source	Coverage
`id` identifier	Firm identifier	Sequential firm (panel unit) identifier; 140 UK manufacturing firms.	From the Arellano-Bond panel; the panel id passed to estimators as [id, year].	integer	Arellano & Bond (1991)	both files
`year` year	Calendar year	Annual time index of the observation.	From the Arellano-Bond panel; range 1976-1984.	year	Arellano & Bond (1991)	both files
`emp` continuous	Employment (level)	Firm employment in thousands (the level behind log employment n).	Raw level; n = log(emp).	thousands of employees	Arellano & Bond (1991)	both files
`wage` continuous	Real wage (level)	Firm real product wage (the level behind log wage w).	Raw level; w = log(wage).	index/level	Arellano & Bond (1991)	both files
`cap` continuous	Gross capital stock (level)	Firm gross capital stock (the level behind log capital k).	Raw level; k = log(cap).	index/level	Arellano & Bond (1991)	both files
`indoutpt` continuous	Industry output (level)	Industry-level output for the firm's sector (the level behind log industry output ys).	Raw level; ys = log(indoutpt).	index/level	Arellano & Bond (1991)	both files
`n` continuous	Log employment	Natural log of firm employment; the dependent variable of the dynamic model.	log(emp).	log thousands	Arellano & Bond (1991)	both files
`w` continuous	Log real wage	Natural log of the firm real wage; a current control.	log(wage).	log level	Arellano & Bond (1991)	both files
`k` continuous	Log capital stock	Natural log of the firm gross capital stock; a current control.	log(cap).	log level	Arellano & Bond (1991)	both files
`ys` continuous	Log industry output	Natural log of industry output for the firm's sector (auxiliary variable).	log(indoutpt).	log level	Arellano & Bond (1991)	both files
`n_lag1` continuous	Log employment, one-period lag	Last year's log employment; the lagged dependent variable carrying persistence rho (labeled L1.n in GMM output).	Within firm id: n_i,t-1 = groupby('id')['n'].shift(1).	log thousands	Derived (this study)	data_prepared
`w_lag1` continuous	Log real wage, one-period lag	Last year's log real wage; a lagged control in the levels equation.	Within firm id: w_i,t-1 = groupby('id')['w'].shift(1).	log level	Derived (this study)	data_prepared
`k_lag1` continuous	Log capital stock, one-period lag	Last year's log capital stock; a lagged control in the levels equation.	Within firm id: k_i,t-1 = groupby('id')['k'].shift(1).	log level	Derived (this study)	data_prepared
`n_lag2` continuous	Log employment, two-period lag	Log employment two years ago; the Anderson-Hsiao instrument for the differenced lag.	Within firm id: n_i,t-2 = groupby('id')['n'].shift(2).	log thousands	Derived (this study)	data_prepared
`d_n` continuous	First difference of log employment	Year-on-year change in log employment; the dependent variable of the differenced equation.	Within firm id: n_it - n_i,t-1 = groupby('id')['n'].diff().	log change	Derived (this study)	data_prepared
`d_w` continuous	First difference of log real wage	Year-on-year change in the log real wage.	Within firm id: w_it - w_i,t-1 = groupby('id')['w'].diff().	log change	Derived (this study)	data_prepared
`d_k` continuous	First difference of log capital stock	Year-on-year change in the log capital stock.	Within firm id: k_it - k_i,t-1 = groupby('id')['k'].diff().	log change	Derived (this study)	data_prepared
`d_n_lag1` continuous	Lagged first difference of log employment	Last year's change in log employment; the endogenous regressor in the Anderson-Hsiao 2SLS.	Within firm id: d_n_i,t-1 = groupby('id')['d_n'].shift(1).	log change	Derived (this study)	data_prepared
`d_w_lag1` continuous	Lagged first difference of log real wage	Last year's change in the log real wage; a control in the differenced equation.	Within firm id: d_w_i,t-1 = groupby('id')['d_w'].shift(1).	log change	Derived (this study)	data_prepared
`d_k_lag1` continuous	Lagged first difference of log capital stock	Last year's change in the log capital stock; a control in the differenced equation.	Within firm id: d_k_i,t-1 = groupby('id')['d_k'].shift(1).	log change	Derived (this study)	data_prepared

Distribution & statistics (click a header to sort)

Variable	Distribution	Coverage	N	Distinct	Min	Mean	Median	Max	SD
`id`	–	100%	1,031	140	—	—	—	—	—
`year`	–	100%	1,031	9	1976	1979.7	1980	1984	2.22
`emp`		100%	1,031	955	0.104	7.89	2.29	108.6	15.93
`wage`		100%	1,031	1,029	8.02	23.92	24.01	45.23	5.65
`cap`		100%	1,031	1,001	0.012	2.51	0.518	47.11	6.25
`indoutpt`		100%	1,031	330	86.90	103.8	100.6	128.4	9.94
`n`		100%	1,031	955	-2.26	1.06	0.827	4.69	1.34
`w`		100%	1,031	1,029	2.08	3.14	3.18	3.81	0.263
`k`		100%	1,031	1,001	-4.43	-0.442	-0.658	3.85	1.51
`ys`		100%	1,031	330	4.46	4.64	4.61	4.85	0.094
`n_lag1`		86%	891	832	-2.10	1.08	0.857	4.69	1.34
`w_lag1`		86%	891	889	2.08	3.13	3.17	3.81	0.264
`k_lag1`		86%	891	866	-4.43	-0.413	-0.631	3.85	1.50
`n_lag2`		73%	751	702	-2.08	1.11	0.882	4.69	1.33
`d_n`		86%	891	886	-0.997	-0.044	-0.025	0.806	0.138
`d_w`		86%	891	891	-0.675	0.006	0.005	0.924	0.088
`d_k`		86%	891	891	-1.08	-0.036	-0.045	0.884	0.162
`d_n_lag1`		73%	751	746	-0.997	-0.038	-0.019	0.806	0.140
`d_w_lag1`		73%	751	751	-0.675	0.002	7.59e-04	0.924	0.090
`d_k_lag1`		73%	751	751	-1.08	-0.025	-0.030	0.884	0.163

Known limitations & caveats

Teaching dataset. This is a methods showcase on 1970s–80s UK manufacturing, not a current estimate of employment dynamics; the value is in the workflow, not the era-specific numbers.
Each lag burns data. With T as small as 7–9 per firm, every lag/difference costs each firm its first observed year(s): the panel falls from 1,031 rows to 891 (one lag), to 751 (GMM), to 611 (the two-lag replication spec). The lag/difference columns in data_prepared are missing for those dropped firm-years.
Unbalanced panel. 103 firms appear for 7 years, 23 for 8, and 14 for all 9; estimators handle this natively but observation counts vary by specification.
Variance is lopsided. The between-firm SD of log employment (1.339) is about seven times the within-firm SD (0.195), so the unobserved firm level dominates — which is exactly why naive estimators fail and why dynamic-panel GMM is needed.
Headline caveat. The system-GMM 95% CI [0.773, 1.081] includes the unit root, so 'employment is stationary' is not a defensible claim — only the point estimate and its lower bound are.