Data dictionary · Taming Model Uncertainty in the Environmental Kuznets Curve

Downloads

Each dataset is available as a labeled Stata .dta and its source file.

⇩ Download all data (ZIP)stata_codebook.do

Dataset	Grain	Rows	Stata	Source
`synthetic_ekc_panel`	country-year	1,600 × 18	synthetic_ekc_panel.dta	synthetic_ekc_panel.csv

Run stata_codebook.do in Stata once to attach long-form per-variable notes to the .dta files.

Load directly in code

Every file loads straight from GitHub (raw URLs). Swap the file name to load any dataset.

Stata

* Stata 14+ : `use` reads an https URL directly
global BASE "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/stata_bma_dsl/data/"
use "${BASE}synthetic_ekc_panel.dta", clear
describe
notes

Python

!pip install -q pyreadstat
import pandas as pd
BASE = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/stata_bma_dsl/data/"
df = pd.read_stata(BASE + "synthetic_ekc_panel.dta")

# load every dataset at once
files = ["synthetic_ekc_panel"]
data = {f: pd.read_stata(BASE + f + ".dta") for f in files}

# pyreadstat (richest metadata) reads LOCAL files -> download first
import pyreadstat, urllib.request
urllib.request.urlretrieve(BASE + "synthetic_ekc_panel.dta", "synthetic_ekc_panel.dta")
df, meta = pyreadstat.read_dta("synthetic_ekc_panel.dta")

Copy and paste this snippet in Google Colab app. https://colab.research.google.com/notebooks/empty.ipynb

R

# R : haven::read_dta auto-downloads an https URL
library(haven)
BASE <- "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/stata_bma_dsl/data/"
df <- read_dta(paste0(BASE, "synthetic_ekc_panel.dta"))

Overview & sources

Companion data for a Stata tutorial that confronts the model-uncertainty problem in testing the Environmental Kuznets Curve (EKC). With 12 candidate control variables there are 2⁹² = 4,096 possible regressions on log CO₂ per capita, and choosing one is a hidden assumption. The tutorial applies two principled solutions — Bayesian Model Averaging (BMA, via Stata’s bmaregress) and Post-Double-Selection LASSO (DSL, via dsregress) — to estimate the inverted-N cubic relationship between log CO₂ and log GDP. The data is fully synthetic, built from a known data-generating process in which 5 controls truly affect emissions and 7 are pure noise, so each method can be graded against ground truth. With country and year fixed effects, BMA recovers the GDP cubic almost exactly and flags 6 of 8 true predictors at PIP ≥ 0.80 with zero false positives; DSL produces fast cluster-robust estimates. Stripping the fixed effects inflates the GDP coefficient and generates false positives.

One file. synthetic_ekc_panel is a balanced annual country panel — one row per country × year, 80 countries × 20 years (1995–2014) = 1,600 observations. It carries the outcome (log CO₂ per capita), the three GDP polynomial terms, and the 12 candidate controls (5 true predictors + 7 noise). Set the panel in Stata with xtset country_id year, yearly.

Data sources

Source	Provides	Reference / URL
Synthetic (this study)	All values — simulated from a calibrated cubic-EKC data-generating process with a known answer key (open & reproducible)	Mendez, C. (2026). See the post's Stata do-file generate_data.do for the full DGP.
Gravina & Lanzafame (2025)	Inspiration for the panel structure and variable set (the synthetic data is NOT identical to the original)	Gravina, A. F., & Lanzafame, M. (2025). Inequality and the Environmental Kuznets Curve.
Method references	Estimators and concepts	Bayesian Model Averaging (Raftery 1995; Steel 2020); Post-Double-Selection LASSO (Belloni, Chernozhukov & Hansen 2014); Environmental Kuznets Curve (Grossman & Krueger 1995).

Cite this data

Please cite this dataset as follows.

APA

Mendez, C. (2026). Taming Model Uncertainty in the Environmental Kuznets Curve: BMA and Double-Selection LASSO with Panel Data [Data set]. https://carlos-mendez.org/post/stata_bma_dsl/

Belloni, A., Chernozhukov, V., & Hansen, C. (2014). Inference on treatment effects after selection among high-dimensional controls. Review of Economic Studies, 81(2), 608–650. Raftery, A. E. (1995). Bayesian model selection in social research. Sociological Methodology, 25, 111–163. Grossman, G. M., & Krueger, A. B. (1995). Economic growth and the environment. Quarterly Journal of Economics, 110(2), 353–377.

BibTeX

@misc{mendez2026statabmadsl,
  author       = {Mendez, Carlos},
  title        = {Taming Model Uncertainty in the Environmental Kuznets Curve: BMA and Double-Selection LASSO with Panel Data},
  year         = {2026},
  howpublished = {\url{https://carlos-mendez.org/post/stata_bma_dsl/}},
  note         = {Data set}
}

@article{belloni2014inference,
  author  = {Belloni, Alexandre and Chernozhukov, Victor and Hansen, Christian},
  title   = {Inference on Treatment Effects after Selection among High-Dimensional Controls},
  journal = {Review of Economic Studies},
  volume  = {81}, number = {2}, pages = {608--650}, year = {2014}
}
@article{raftery1995bayesian,
  author  = {Raftery, Adrian E.},
  title   = {Bayesian Model Selection in Social Research},
  journal = {Sociological Methodology},
  volume  = {25}, pages = {111--163}, year = {1995}
}
@article{grossman1995economic,
  author  = {Grossman, Gene M. and Krueger, Alan B.},
  title   = {Economic Growth and the Environment},
  journal = {Quarterly Journal of Economics},
  volume  = {110}, number = {2}, pages = {353--377}, year = {1995}
}

Variable explorer search & filter all 18 variables

Type to filter by name or label, or use the chips to filter by type. Each row shows a mini distribution. Click a header to sort.

Variable	Type	Distribution	Label	Definition	Units	In files	Source
`corruption`#	continuous		Corruption index	NOISE control (true coef 0): corruption score; no GDP correlation.	index (0-100)	synthetic_ekc_panel	Simulation (noise)
`country_id`#	identifier	–	Country ID	Synthetic country identifier (1-80); no real-country mapping.	integer (1-80)	synthetic_ekc_panel	Simulation
`credit`#	continuous		Domestic credit (% GDP)	NOISE control (true coef 0): domestic credit; moderately correlated with GDP.	% of GDP	synthetic_ekc_panel	Simulation (noise)
`democracy`#	continuous		Democracy score	TRUE predictor (coef -0.005, weak): democracy index (-10 to 10); more democracy -> less CO2.	index (-10 to 10)	synthetic_ekc_panel	Simulation (true predictor)
`fdi`#	continuous		FDI inflows (% GDP)	NOISE control (true coef 0): foreign direct investment inflows; no GDP correlation.	% of GDP	synthetic_ekc_panel	Simulation (noise)
`fossil_fuel`#	continuous		Fossil fuel share (%)	TRUE predictor (coef +0.015): fossil-fuel share of energy; more fossil fuels -> more CO2.	% (5-95)	synthetic_ekc_panel	Simulation (true predictor)
`globalization`#	continuous		Globalization index	NOISE control (true coef 0): tricky decoy, strongly correlated with GDP.	index (20-95)	synthetic_ekc_panel	Simulation (noise)
`industry`#	continuous		Industry VA (% GDP)	TRUE predictor (coef +0.010): industry value added share; more industry -> more CO2.	% of GDP (5-60)	synthetic_ekc_panel	Simulation (true predictor)
`ln_co2`#	continuous		CO2 per capita (log)	Outcome variable: natural log of CO2 emissions per capita.	log	synthetic_ekc_panel	Simulation (DGP outcome)
`ln_gdp`#	continuous		GDP per capita (log)	Log GDP per capita; the income axis of the EKC.	log international $	synthetic_ekc_panel	Simulation
`ln_gdp_cb`#	continuous		GDP per capita cubed (log)	Cubic GDP term (inverted-N) for the EKC polynomial.	log^3	synthetic_ekc_panel	Derived
`ln_gdp_sq`#	continuous		GDP per capita squared (log)	Quadratic GDP term for the EKC polynomial.	log^2	synthetic_ekc_panel	Derived
`pop_density`#	continuous		Population density	NOISE control (true coef 0): population density; no GDP correlation.	persons per km^2	synthetic_ekc_panel	Simulation (noise)
`renewable`#	continuous		Renewable energy (%)	TRUE predictor (coef -0.010): renewable share of energy; more renewables -> less CO2.	% (1-80)	synthetic_ekc_panel	Simulation (true predictor)
`services`#	continuous		Services VA (% GDP)	NOISE control (true coef 0): tricky decoy, strongly correlated with GDP.	% of GDP (10-80)	synthetic_ekc_panel	Simulation (noise)
`trade`#	continuous		Trade openness (% GDP)	NOISE control (true coef 0): trade openness; moderately correlated with GDP.	% of GDP (10-200)	synthetic_ekc_panel	Simulation (noise)
`urban`#	continuous		Urban population (%)	TRUE predictor (coef +0.007, weak): urbanization rate; more urban -> more CO2.	% (10-95)	synthetic_ekc_panel	Simulation (true predictor)
`year`#	year	–	Calendar year	Annual time index, 1995-2014.	year	synthetic_ekc_panel	Simulation

Cross-file variable index

Which file each variable appears in (● = present).

Variable	synthetic_ekc_panel
`corruption`	●
`country_id`	●
`credit`	●
`democracy`	●
`fdi`	●
`fossil_fuel`	●
`globalization`	●
`industry`	●
`ln_co2`	●
`ln_gdp`	●
`ln_gdp_cb`	●
`ln_gdp_sq`	●
`pop_density`	●
`renewable`	●
`services`	●
`trade`	●
`urban`	●
`year`	●

Construction & formulas

The synthetic outcome — log CO₂ per capita — is generated from a cubic EKC with country and year fixed effects (the inverted-N shape):

ln(CO2)_it = β₁·ln(GDP)_it + β₂·ln(GDP)²_it + β₃·ln(GDP)³_it + Xᵗᵘᵘᵉ·γ + α_i + δ_t + ε_it

with the true coefficients written into the DGP: β₁ = -7.10, β₂ = +0.81, β₃ = -0.03 (designed turning points near \$1,895 and \$34,647), five true control slopes — fossil_fuel +0.015, renewable -0.010, urban +0.007, democracy -0.005, industry +0.010 — and seven noise controls with a true coefficient of exactly 0 (globalization, pop_density, corruption, services, trade, fdi, credit). Country fixed effects are N(0, 0.50) draws, the year effect is a downward decarbonization trend, and observation noise is N(0, 0.15).

GDP polynomial: ln_gdp_sq = ln_gdp², ln_gdp_cb = ln_gdp³ — the linear/squared/cubic terms that trace the inverted-N.
Turning points (income where emissions change direction): x* = [-β₂ ± √(β₂² - 3β₁β₃)] / (3β₃), GDP* = exp(x*).
BMA: averages over the 2⁹² = 4,096 model space; each variable’s posterior inclusion probability (PIP) is PIP_j = Σ_{k: x_j ∈ M_k} P(M_k | data), the posterior weight on models containing it (robustness threshold PIP ≥ 0.80). The posterior mean is β̂_j = Σ_k P(M_k | data)·β̂_{j,k}.
Double-Selection LASSO: LASSO-select controls for the outcome, then for each variable of interest, take the union, then run OLS on that union (cluster-robust SE) — "select, then regress" protection against omitted-variable bias.

The true predictors and several noise variables are deliberately correlated with GDP in the DGP (e.g. globalization, services strongly), so a naive regression would flag the noise as "significant"; the task for BMA and DSL is to see through that correlation.

The datasets

Switch datasets with the tabs. Each shows the full variable dictionary plus a sortable statistics table with mini distributions and data coverage.

expand to search (Ctrl/⌘+F) or print across all datasets

country-year 1,600 × 18 · 1995-2014 · 80 countries (balanced; 1,600 obs)

Panel key: country_id x year · Test the inverted-N EKC and grade BMA vs Double-Selection LASSO against a known answer key.

Variable dictionary

Variable	Label	Definition	Construction	Units	Source	Coverage
`country_id` identifier	Country ID	Synthetic country identifier (1-80); no real-country mapping.	Sequential 1..80 assigned at panel creation (gen country_id = _n).	integer (1-80)	Simulation	80 countries
`year` year	Calendar year	Annual time index, 1995-2014.	1995 + sequence within country (20 years per country, balanced).	year	Simulation	1995-2014
`ln_co2` continuous	CO2 per capita (log)	Outcome variable: natural log of CO2 emissions per capita.	Generated from the DGP: b1ln_gdp + b2ln_gdp_sq + b3*ln_gdp_cb + 5 true controls + country FE + year FE + N(0,0.15) noise.	log	Simulation (DGP outcome)	country-year
`ln_gdp` continuous	GDP per capita (log)	Log GDP per capita; the income axis of the EKC.	Country baseline (uniform 7.0-11.5) + annual growth*(year-1995) + N(0,0.05) noise.	log international $	Simulation	country-year
`ln_gdp_sq` continuous	GDP per capita squared (log)	Quadratic GDP term for the EKC polynomial.	ln_gdp^2.	log^2	Derived	country-year
`ln_gdp_cb` continuous	GDP per capita cubed (log)	Cubic GDP term (inverted-N) for the EKC polynomial.	ln_gdp^3.	log^3	Derived	country-year
`fossil_fuel` continuous	Fossil fuel share (%)	TRUE predictor (coef +0.015): fossil-fuel share of energy; more fossil fuels -> more CO2.	Country base (correlated with GDP) + N(0,3) noise - 0.3*(year-1995), bounded to [5,95].	% (5-95)	Simulation (true predictor)	country-year
`renewable` continuous	Renewable energy (%)	TRUE predictor (coef -0.010): renewable share of energy; more renewables -> less CO2.	Country base (negatively correlated with GDP) + N(0,2) + 0.4*(year-1995), bounded to [1,80].	% (1-80)	Simulation (true predictor)	country-year
`urban` continuous	Urban population (%)	TRUE predictor (coef +0.007, weak): urbanization rate; more urban -> more CO2.	Country base (correlated with GDP) + N(0,1.5) + 0.3*(year-1995), bounded to [10,95].	% (10-95)	Simulation (true predictor)	country-year
`globalization` continuous	Globalization index	NOISE control (true coef 0): tricky decoy, strongly correlated with GDP.	Country base (strong GDP corr) + N(0,3) + 0.2*(year-1995), bounded to [20,95].	index (20-95)	Simulation (noise)	country-year
`pop_density` continuous	Population density	NOISE control (true coef 0): population density; no GDP correlation.	Log-normal base exp(N(4,1.2)) * (1+0.01*(year-1995)) + N(0,5), floored at 1.	persons per km^2	Simulation (noise)	country-year
`democracy` continuous	Democracy score	TRUE predictor (coef -0.005, weak): democracy index (-10 to 10); more democracy -> less CO2.	Country base (uniform -5..10) + N(0,0.5), bounded to [-10,10].	index (-10 to 10)	Simulation (true predictor)	country-year
`corruption` continuous	Corruption index	NOISE control (true coef 0): corruption score; no GDP correlation.	Country base (uniform 0-100) + N(0,5), bounded to [0,100].	index (0-100)	Simulation (noise)	country-year
`industry` continuous	Industry VA (% GDP)	TRUE predictor (coef +0.010): industry value added share; more industry -> more CO2.	Country base (correlated with GDP) + N(0,2) - 0.1*(year-1995), bounded to [5,60].	% of GDP (5-60)	Simulation (true predictor)	country-year
`services` continuous	Services VA (% GDP)	NOISE control (true coef 0): tricky decoy, strongly correlated with GDP.	Country base (strong GDP corr) + N(0,2) + 0.2*(year-1995), bounded to [10,80].	% of GDP (10-80)	Simulation (noise)	country-year
`trade` continuous	Trade openness (% GDP)	NOISE control (true coef 0): trade openness; moderately correlated with GDP.	Country base (moderate GDP corr) + N(0,5), bounded to [10,200].	% of GDP (10-200)	Simulation (noise)	country-year
`fdi` continuous	FDI inflows (% GDP)	NOISE control (true coef 0): foreign direct investment inflows; no GDP correlation.	Country base N(3,4) + N(0,2).	% of GDP	Simulation (noise)	country-year
`credit` continuous	Domestic credit (% GDP)	NOISE control (true coef 0): domestic credit; moderately correlated with GDP.	Country base (moderate GDP corr) + N(0,5) + 0.3*(year-1995), floored at 5.	% of GDP	Simulation (noise)	country-year

Distribution & statistics (click a header to sort)

Variable	Distribution	Coverage	N	Distinct	Min	Mean	Median	Max	SD
`country_id`	–	100%	1,600	80	—	—	—	—	—
`year`	–	100%	1,600	20	1995	2004.5	2004	2014	5.77
`ln_co2`		100%	1,600	1,599	-21.04	-19.04	-18.98	-16.83	0.786
`ln_gdp`		100%	1,600	1,600	6.97	9.58	9.60	11.97	1.33
`ln_gdp_sq`		100%	1,600	1,600	48.64	93.62	92.09	143.3	25.55
`ln_gdp_cb`		100%	1,600	1,600	339.2	931.1	883.8	1,715.2	373.8
`fossil_fuel`		100%	1,600	1,593	6.37	54.77	53.76	95.00	19.14
`renewable`		100%	1,600	1,600	1.00	29.54	28.93	64.22	11.97
`urban`		100%	1,600	1,600	15.95	53.67	53.30	91.63	14.78
`globalization`		100%	1,600	1,595	26.76	57.65	56.45	95.00	12.72
`pop_density`		100%	1,600	1,571	1.00	121.3	51.72	1,571.8	210.3
`democracy`		100%	1,600	1,594	-6.12	2.33	2.26	10.00	4.18
`corruption`		100%	1,600	1,554	0	52.35	50.51	100.0	28.53
`industry`		100%	1,600	1,600	5.84	24.64	24.61	45.33	6.18
`services`		100%	1,600	1,600	17.83	43.56	43.43	64.07	9.37
`trade`		100%	1,600	1,600	10.04	67.44	68.08	128.1	19.36
`fdi`		100%	1,600	1,600	-11.50	2.98	2.96	16.20	4.37
`credit`		100%	1,600	1,600	11.33	53.44	51.39	123.2	18.20

Known limitations & caveats

Synthetic data. There is no real data behind this tutorial. Every value is simulated from a known data-generating process; results are internally consistent with the calibration but are NOT empirical evidence about real-world CO₂ emissions or the EKC.
Inspired by, not identical to. The panel is inspired by Gravina & Lanzafame (2025) but the values are fully synthetic and do not reproduce that study's data.
Known answer key. Exactly 5 controls (fossil_fuel, renewable, urban, democracy, industry) have a true non-zero effect; the other 7 are pure noise with a true coefficient of zero. Use this only to grade method accuracy, not to draw substantive conclusions.
Weak signals are hard. Two true controls (urban +0.007, democracy -0.005) have small coefficients and fall below the PIP ≥ 0.80 threshold even though they belong — a realistic limitation, not a data error.
Fixed effects matter. The cubic-EKC recovery and zero-false-positive selection hold only with country and year fixed effects; pooled (no-FE) estimates inflate the GDP coefficient 2–3× and generate false positives.
Country IDs are arbitrary. country_id 1–80 are synthetic identifiers with no real-country mapping; the CSV rows are not sorted by id.