Downloads
Each dataset is available as a labeled Stata .dta and its source file.
⇩ Download all data (ZIP)stata_codebook.do
| Dataset | Grain | Rows | Stata | Source |
|---|---|---|---|---|
abdata | firm-year | 1,031 × 10 | abdata.dta | abdata.csv |
data_prepared | firm-year | 1,031 × 20 | data_prepared.dta | data_prepared.csv |
Run stata_codebook.do in Stata once to attach long-form per-variable notes to the .dta files.
Load directly in code
Every file loads straight from GitHub (raw URLs). Swap the file name to load any dataset.
Stata
* Stata 14+ : `use` reads an https URL directly
global BASE "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_dynamic_panel/data/"
use "${BASE}abdata.dta", clear
describe
notesPython
!pip install -q pyreadstat
import pandas as pd
BASE = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_dynamic_panel/data/"
df = pd.read_stata(BASE + "abdata.dta")
# load every dataset at once
files = ["abdata", "data_prepared"]
data = {f: pd.read_stata(BASE + f + ".dta") for f in files}
# pyreadstat (richest metadata) reads LOCAL files -> download first
import pyreadstat, urllib.request
urllib.request.urlretrieve(BASE + "abdata.dta", "abdata.dta")
df, meta = pyreadstat.read_dta("abdata.dta")Copy and paste this snippet in Google Colab app. https://colab.research.google.com/notebooks/empty.ipynb
R
# R : haven::read_dta auto-downloads an https URL
library(haven)
BASE <- "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_dynamic_panel/data/"
df <- read_dta(paste0(BASE, "abdata.dta"))Overview & sources
Companion data for a hands-on Python tutorial that estimates how persistent firm-level employment is — the autoregressive coefficient ρ of a dynamic labor-demand equation — using the canonical Arellano and Bond (1991) panel of 140 UK manufacturing firms observed annually over 1976–1984 (1,031 firm-years, unbalanced). The tutorial walks the full estimator ladder: pooled OLS (biased up by the omitted firm effect) and fixed effects (Nickell bias, biased down) via pyfixest, then Anderson–Hsiao IV, Arellano–Bond difference GMM and Blundell–Bond system GMM via pydynpd, with the AR(2), Hansen, and instrument-collapse diagnostics that separate the one defensible estimate (system GMM, ρ̂ = 0.927) from four seductive wrong ones. This dataset is the classic dynamic-panel teaching dataset, used by Arellano & Bond, Blundell & Bond (1998), and Roodman (2009).
abdata is the raw input panel — one row per firm × year, unbalanced over 1976–1984 — carrying employment, wages, capital, and industry output in both levels and logs. data_prepared is the analysis sample: the same panel with the one-period lags, the two-period lag of employment, and the first differences (computed firm-by-firm, respecting firm boundaries) that every estimator runs on. Requiring a single lag drops the panel from 1,031 to 891 usable rows; the GMM estimators run on 751.
Data sources
| Source | Provides | Reference / URL |
|---|---|---|
| Arellano & Bond (1991) | The original UK manufacturing employment panel (140 firms, 1976–1984) — the abdata dataset | Arellano, M., & Bond, S. (1991). Some Tests of Specification for Panel Data: Monte Carlo Evidence and an Application to Employment Equations. Review of Economic Studies, 58(2), 277–297. https://doi.org/10.2307/2297968 |
| pydynpd (Wu, Hua & Xu 2023) | Distribution of the dataset (bundled as abdata) and the published replication benchmark | Wu, D., Hua, L., & Xu, J. (2023). pydynpd: A Python package for dynamic panel model. Journal of Open Source Software, 8(83), 4416. https://doi.org/10.21105/joss.04416 — https://github.com/dazhwu/pydynpd |
| Method references | Estimators and concepts | Anderson & Hsiao (1981); Blundell & Bond (1998); Bond (2002); Roodman (2009); Windmeijer (2005); Nickell (1981). |
Cite this data
Please cite this dataset as follows.
APA
Mendez, C. (2026). Dynamic Panel Data Models in Python: From Nickell Bias to System GMM [Data set]. https://carlos-mendez.org/post/python_dynamic_panel/
Arellano, M., & Bond, S. (1991). Some Tests of Specification for Panel Data: Monte Carlo Evidence and an Application to Employment Equations. Review of Economic Studies, 58(2), 277–297.BibTeX
@misc{mendez2026pythondynamicpanel,
author = {Mendez, Carlos},
title = {Dynamic Panel Data Models in Python: From Nickell Bias to System GMM},
year = {2026},
howpublished = {\url{https://carlos-mendez.org/post/python_dynamic_panel/}},
note = {Data set}
}
@article{arellano1991some,
author = {Arellano, Manuel and Bond, Stephen},
title = {Some Tests of Specification for Panel Data: Monte Carlo Evidence and an Application to Employment Equations},
journal = {Review of Economic Studies},
volume = {58}, number = {2}, pages = {277--297}, year = {1991},
doi = {10.2307/2297968}
}Variable explorer search & filter all 20 variables
Type to filter by name or label, or use the chips to filter by type. Each row shows a mini distribution. Click a header to sort.
| Variable | Type | Distribution | Label | Definition | Units | In files | Source |
|---|---|---|---|---|---|---|---|
cap# | continuous | Gross capital stock (level) | Firm gross capital stock (the level behind log capital k). | index/level | abdata, data_prepared | Arellano & Bond (1991) | |
d_k# | continuous | First difference of log capital stock | Year-on-year change in the log capital stock. | log change | data_prepared | Derived (this study) | |
d_k_lag1# | continuous | Lagged first difference of log capital stock | Last year's change in the log capital stock; a control in the differenced equation. | log change | data_prepared | Derived (this study) | |
d_n# | continuous | First difference of log employment | Year-on-year change in log employment; the dependent variable of the differenced equation. | log change | data_prepared | Derived (this study) | |
d_n_lag1# | continuous | Lagged first difference of log employment | Last year's change in log employment; the endogenous regressor in the Anderson-Hsiao 2SLS. | log change | data_prepared | Derived (this study) | |
d_w# | continuous | First difference of log real wage | Year-on-year change in the log real wage. | log change | data_prepared | Derived (this study) | |
d_w_lag1# | continuous | Lagged first difference of log real wage | Last year's change in the log real wage; a control in the differenced equation. | log change | data_prepared | Derived (this study) | |
emp# | continuous | Employment (level) | Firm employment in thousands (the level behind log employment n). | thousands of employees | abdata, data_prepared | Arellano & Bond (1991) | |
id# | identifier | – | Firm identifier | Sequential firm (panel unit) identifier; 140 UK manufacturing firms. | integer | abdata, data_prepared | Arellano & Bond (1991) |
indoutpt# | continuous | Industry output (level) | Industry-level output for the firm's sector (the level behind log industry output ys). | index/level | abdata, data_prepared | Arellano & Bond (1991) | |
k# | continuous | Log capital stock | Natural log of the firm gross capital stock; a current control. | log level | abdata, data_prepared | Arellano & Bond (1991) | |
k_lag1# | continuous | Log capital stock, one-period lag | Last year's log capital stock; a lagged control in the levels equation. | log level | data_prepared | Derived (this study) | |
n# | continuous | Log employment | Natural log of firm employment; the dependent variable of the dynamic model. | log thousands | abdata, data_prepared | Arellano & Bond (1991) | |
n_lag1# | continuous | Log employment, one-period lag | Last year's log employment; the lagged dependent variable carrying persistence rho (labeled L1.n in GMM output). | log thousands | data_prepared | Derived (this study) | |
n_lag2# | continuous | Log employment, two-period lag | Log employment two years ago; the Anderson-Hsiao instrument for the differenced lag. | log thousands | data_prepared | Derived (this study) | |
w# | continuous | Log real wage | Natural log of the firm real wage; a current control. | log level | abdata, data_prepared | Arellano & Bond (1991) | |
w_lag1# | continuous | Log real wage, one-period lag | Last year's log real wage; a lagged control in the levels equation. | log level | data_prepared | Derived (this study) | |
wage# | continuous | Real wage (level) | Firm real product wage (the level behind log wage w). | index/level | abdata, data_prepared | Arellano & Bond (1991) | |
year# | year | – | Calendar year | Annual time index of the observation. | year | abdata, data_prepared | Arellano & Bond (1991) |
ys# | continuous | Log industry output | Natural log of industry output for the firm's sector (auxiliary variable). | log level | abdata, data_prepared | Arellano & Bond (1991) |
Cross-file variable index
Which file each variable appears in (● = present).
Construction & formulas
The model is a dynamic labor-demand equation for log employment n, with current
and lagged log real wages w and log capital k, a firm fixed effect
α_i, year effects δ_t, and an idiosyncratic shock
ε_it:
- Levels → logs:
n = log(emp),w = log(wage),k = log(cap),ys = log(indoutpt)— the estimation variables are the logged columns; the level columns are kept for reference. - Dynamic model:
n_it = ρ·n_i,t-1 + β1·w_it + β2·w_i,t-1 + β3·k_it + β4·k_i,t-1 + α_i + δ_t + ε_it—ρ(the coefficient onn_lag1/L1.n) is the persistence parameter of interest. - One-period lag (per firm):
v_lag1 = v_i,t-1forv ∈ {n, w, k};n_lag2 = n_i,t-2(the Anderson–Hsiao instrument). - First difference (per firm):
d_v = v_it − v_i,t-1forv ∈ {n, w, k};d_v_lag1 = d_v_i,t-1. Differencing eliminatesα_iexactly.
All lags and differences are computed within firm id after sorting by
[id, year], so no firm inherits a lag from another firm. The first observed year of
every firm therefore has missing lags/differences.
The datasets
Switch datasets with the tabs. Each shows the full variable dictionary plus a sortable statistics table with mini distributions and data coverage.
expand to search (Ctrl/⌘+F) or print across all datasets
Variable dictionary
| Variable | Label | Definition | Construction | Units | Source | Coverage |
|---|---|---|---|---|---|---|
id identifier | Firm identifier | Sequential firm (panel unit) identifier; 140 UK manufacturing firms. | From the Arellano-Bond panel; the panel id passed to estimators as [id, year]. | integer | Arellano & Bond (1991) | both files |
year year | Calendar year | Annual time index of the observation. | From the Arellano-Bond panel; range 1976-1984. | year | Arellano & Bond (1991) | both files |
emp continuous | Employment (level) | Firm employment in thousands (the level behind log employment n). | Raw level; n = log(emp). | thousands of employees | Arellano & Bond (1991) | both files |
wage continuous | Real wage (level) | Firm real product wage (the level behind log wage w). | Raw level; w = log(wage). | index/level | Arellano & Bond (1991) | both files |
cap continuous | Gross capital stock (level) | Firm gross capital stock (the level behind log capital k). | Raw level; k = log(cap). | index/level | Arellano & Bond (1991) | both files |
indoutpt continuous | Industry output (level) | Industry-level output for the firm's sector (the level behind log industry output ys). | Raw level; ys = log(indoutpt). | index/level | Arellano & Bond (1991) | both files |
n continuous | Log employment | Natural log of firm employment; the dependent variable of the dynamic model. | log(emp). | log thousands | Arellano & Bond (1991) | both files |
w continuous | Log real wage | Natural log of the firm real wage; a current control. | log(wage). | log level | Arellano & Bond (1991) | both files |
k continuous | Log capital stock | Natural log of the firm gross capital stock; a current control. | log(cap). | log level | Arellano & Bond (1991) | both files |
ys continuous | Log industry output | Natural log of industry output for the firm's sector (auxiliary variable). | log(indoutpt). | log level | Arellano & Bond (1991) | both files |
Distribution & statistics (click a header to sort)
| Variable | Distribution | Coverage | N | Distinct | Min | Mean | Median | Max | SD |
|---|---|---|---|---|---|---|---|---|---|
id | – | 100% | 1,031 | 140 | — | — | — | — | — |
year | – | 100% | 1,031 | 9 | 1976 | 1979.7 | 1980 | 1984 | 2.22 |
emp | 100% | 1,031 | 955 | 0.104 | 7.89 | 2.29 | 108.6 | 15.93 | |
wage | 100% | 1,031 | 1,029 | 8.02 | 23.92 | 24.01 | 45.23 | 5.65 | |
cap | 100% | 1,031 | 1,001 | 0.012 | 2.51 | 0.518 | 47.11 | 6.25 | |
indoutpt | 100% | 1,031 | 330 | 86.90 | 103.8 | 100.6 | 128.4 | 9.94 | |
n | 100% | 1,031 | 955 | -2.26 | 1.06 | 0.827 | 4.69 | 1.34 | |
w | 100% | 1,031 | 1,029 | 2.08 | 3.14 | 3.18 | 3.81 | 0.263 | |
k | 100% | 1,031 | 1,001 | -4.43 | -0.442 | -0.658 | 3.85 | 1.51 | |
ys | 100% | 1,031 | 330 | 4.46 | 4.64 | 4.61 | 4.85 | 0.094 |
Variable dictionary
| Variable | Label | Definition | Construction | Units | Source | Coverage |
|---|---|---|---|---|---|---|
id identifier | Firm identifier | Sequential firm (panel unit) identifier; 140 UK manufacturing firms. | From the Arellano-Bond panel; the panel id passed to estimators as [id, year]. | integer | Arellano & Bond (1991) | both files |
year year | Calendar year | Annual time index of the observation. | From the Arellano-Bond panel; range 1976-1984. | year | Arellano & Bond (1991) | both files |
emp continuous | Employment (level) | Firm employment in thousands (the level behind log employment n). | Raw level; n = log(emp). | thousands of employees | Arellano & Bond (1991) | both files |
wage continuous | Real wage (level) | Firm real product wage (the level behind log wage w). | Raw level; w = log(wage). | index/level | Arellano & Bond (1991) | both files |
cap continuous | Gross capital stock (level) | Firm gross capital stock (the level behind log capital k). | Raw level; k = log(cap). | index/level | Arellano & Bond (1991) | both files |
indoutpt continuous | Industry output (level) | Industry-level output for the firm's sector (the level behind log industry output ys). | Raw level; ys = log(indoutpt). | index/level | Arellano & Bond (1991) | both files |
n continuous | Log employment | Natural log of firm employment; the dependent variable of the dynamic model. | log(emp). | log thousands | Arellano & Bond (1991) | both files |
w continuous | Log real wage | Natural log of the firm real wage; a current control. | log(wage). | log level | Arellano & Bond (1991) | both files |
k continuous | Log capital stock | Natural log of the firm gross capital stock; a current control. | log(cap). | log level | Arellano & Bond (1991) | both files |
ys continuous | Log industry output | Natural log of industry output for the firm's sector (auxiliary variable). | log(indoutpt). | log level | Arellano & Bond (1991) | both files |
n_lag1 continuous | Log employment, one-period lag | Last year's log employment; the lagged dependent variable carrying persistence rho (labeled L1.n in GMM output). | Within firm id: n_i,t-1 = groupby('id')['n'].shift(1). | log thousands | Derived (this study) | data_prepared |
w_lag1 continuous | Log real wage, one-period lag | Last year's log real wage; a lagged control in the levels equation. | Within firm id: w_i,t-1 = groupby('id')['w'].shift(1). | log level | Derived (this study) | data_prepared |
k_lag1 continuous | Log capital stock, one-period lag | Last year's log capital stock; a lagged control in the levels equation. | Within firm id: k_i,t-1 = groupby('id')['k'].shift(1). | log level | Derived (this study) | data_prepared |
n_lag2 continuous | Log employment, two-period lag | Log employment two years ago; the Anderson-Hsiao instrument for the differenced lag. | Within firm id: n_i,t-2 = groupby('id')['n'].shift(2). | log thousands | Derived (this study) | data_prepared |
d_n continuous | First difference of log employment | Year-on-year change in log employment; the dependent variable of the differenced equation. | Within firm id: n_it - n_i,t-1 = groupby('id')['n'].diff(). | log change | Derived (this study) | data_prepared |
d_w continuous | First difference of log real wage | Year-on-year change in the log real wage. | Within firm id: w_it - w_i,t-1 = groupby('id')['w'].diff(). | log change | Derived (this study) | data_prepared |
d_k continuous | First difference of log capital stock | Year-on-year change in the log capital stock. | Within firm id: k_it - k_i,t-1 = groupby('id')['k'].diff(). | log change | Derived (this study) | data_prepared |
d_n_lag1 continuous | Lagged first difference of log employment | Last year's change in log employment; the endogenous regressor in the Anderson-Hsiao 2SLS. | Within firm id: d_n_i,t-1 = groupby('id')['d_n'].shift(1). | log change | Derived (this study) | data_prepared |
d_w_lag1 continuous | Lagged first difference of log real wage | Last year's change in the log real wage; a control in the differenced equation. | Within firm id: d_w_i,t-1 = groupby('id')['d_w'].shift(1). | log change | Derived (this study) | data_prepared |
d_k_lag1 continuous | Lagged first difference of log capital stock | Last year's change in the log capital stock; a control in the differenced equation. | Within firm id: d_k_i,t-1 = groupby('id')['d_k'].shift(1). | log change | Derived (this study) | data_prepared |
Distribution & statistics (click a header to sort)
| Variable | Distribution | Coverage | N | Distinct | Min | Mean | Median | Max | SD |
|---|---|---|---|---|---|---|---|---|---|
id | – | 100% | 1,031 | 140 | — | — | — | — | — |
year | – | 100% | 1,031 | 9 | 1976 | 1979.7 | 1980 | 1984 | 2.22 |
emp | 100% | 1,031 | 955 | 0.104 | 7.89 | 2.29 | 108.6 | 15.93 | |
wage | 100% | 1,031 | 1,029 | 8.02 | 23.92 | 24.01 | 45.23 | 5.65 | |
cap | 100% | 1,031 | 1,001 | 0.012 | 2.51 | 0.518 | 47.11 | 6.25 | |
indoutpt | 100% | 1,031 | 330 | 86.90 | 103.8 | 100.6 | 128.4 | 9.94 | |
n | 100% | 1,031 | 955 | -2.26 | 1.06 | 0.827 | 4.69 | 1.34 | |
w | 100% | 1,031 | 1,029 | 2.08 | 3.14 | 3.18 | 3.81 | 0.263 | |
k | 100% | 1,031 | 1,001 | -4.43 | -0.442 | -0.658 | 3.85 | 1.51 | |
ys | 100% | 1,031 | 330 | 4.46 | 4.64 | 4.61 | 4.85 | 0.094 | |
n_lag1 | 86% | 891 | 832 | -2.10 | 1.08 | 0.857 | 4.69 | 1.34 | |
w_lag1 | 86% | 891 | 889 | 2.08 | 3.13 | 3.17 | 3.81 | 0.264 | |
k_lag1 | 86% | 891 | 866 | -4.43 | -0.413 | -0.631 | 3.85 | 1.50 | |
n_lag2 | 73% | 751 | 702 | -2.08 | 1.11 | 0.882 | 4.69 | 1.33 | |
d_n | 86% | 891 | 886 | -0.997 | -0.044 | -0.025 | 0.806 | 0.138 | |
d_w | 86% | 891 | 891 | -0.675 | 0.006 | 0.005 | 0.924 | 0.088 | |
d_k | 86% | 891 | 891 | -1.08 | -0.036 | -0.045 | 0.884 | 0.162 | |
d_n_lag1 | 73% | 751 | 746 | -0.997 | -0.038 | -0.019 | 0.806 | 0.140 | |
d_w_lag1 | 73% | 751 | 751 | -0.675 | 0.002 | 7.59e-04 | 0.924 | 0.090 | |
d_k_lag1 | 73% | 751 | 751 | -1.08 | -0.025 | -0.030 | 0.884 | 0.163 |
Known limitations & caveats
- Teaching dataset. This is a methods showcase on 1970s–80s UK manufacturing, not a current estimate of employment dynamics; the value is in the workflow, not the era-specific numbers.
- Each lag burns data. With T as small as 7–9 per firm, every lag/difference costs each firm its first observed year(s): the panel falls from 1,031 rows to 891 (one lag), to 751 (GMM), to 611 (the two-lag replication spec). The lag/difference columns in
data_preparedare missing for those dropped firm-years. - Unbalanced panel. 103 firms appear for 7 years, 23 for 8, and 14 for all 9; estimators handle this natively but observation counts vary by specification.
- Variance is lopsided. The between-firm SD of log employment (1.339) is about seven times the within-firm SD (0.195), so the unobserved firm level dominates — which is exactly why naive estimators fail and why dynamic-panel GMM is needed.
- Headline caveat. The system-GMM 95% CI [0.773, 1.081] includes the unit root, so 'employment is stationary' is not a defensible claim — only the point estimate and its lower bound are.