← Back to the post
Interactive data dictionary

Taming Model Uncertainty in the Environmental Kuznets Curve

A synthetic 80-country panel with a known answer key for grading BMA and Double-Selection LASSO.

1
dataset
18
variables
80
countries
1995–2014
years

Downloads

Each dataset is available as a labeled Stata .dta and its source file.

⇩ Download all data (ZIP)stata_codebook.do

DatasetGrainRowsStataSource
synthetic_ekc_panelcountry-year1,600 × 18synthetic_ekc_panel.dtasynthetic_ekc_panel.csv

Run stata_codebook.do in Stata once to attach long-form per-variable notes to the .dta files.

Load directly in code

Every file loads straight from GitHub (raw URLs). Swap the file name to load any dataset.

Stata

* Stata 14+ : `use` reads an https URL directly
global BASE "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/stata_bma_dsl/data/"
use "${BASE}synthetic_ekc_panel.dta", clear
describe
notes

Python

!pip install -q pyreadstat
import pandas as pd
BASE = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/stata_bma_dsl/data/"
df = pd.read_stata(BASE + "synthetic_ekc_panel.dta")

# load every dataset at once
files = ["synthetic_ekc_panel"]
data = {f: pd.read_stata(BASE + f + ".dta") for f in files}

# pyreadstat (richest metadata) reads LOCAL files -> download first
import pyreadstat, urllib.request
urllib.request.urlretrieve(BASE + "synthetic_ekc_panel.dta", "synthetic_ekc_panel.dta")
df, meta = pyreadstat.read_dta("synthetic_ekc_panel.dta")

Copy and paste this snippet in Google Colab app. https://colab.research.google.com/notebooks/empty.ipynb

R

# R : haven::read_dta auto-downloads an https URL
library(haven)
BASE <- "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/stata_bma_dsl/data/"
df <- read_dta(paste0(BASE, "synthetic_ekc_panel.dta"))

Overview & sources

Companion data for a Stata tutorial that confronts the model-uncertainty problem in testing the Environmental Kuznets Curve (EKC). With 12 candidate control variables there are 2⁹² = 4,096 possible regressions on log CO2 per capita, and choosing one is a hidden assumption. The tutorial applies two principled solutions — Bayesian Model Averaging (BMA, via Stata’s bmaregress) and Post-Double-Selection LASSO (DSL, via dsregress) — to estimate the inverted-N cubic relationship between log CO2 and log GDP. The data is fully synthetic, built from a known data-generating process in which 5 controls truly affect emissions and 7 are pure noise, so each method can be graded against ground truth. With country and year fixed effects, BMA recovers the GDP cubic almost exactly and flags 6 of 8 true predictors at PIP ≥ 0.80 with zero false positives; DSL produces fast cluster-robust estimates. Stripping the fixed effects inflates the GDP coefficient and generates false positives.

One file. synthetic_ekc_panel is a balanced annual country panel — one row per country × year, 80 countries × 20 years (1995–2014) = 1,600 observations. It carries the outcome (log CO2 per capita), the three GDP polynomial terms, and the 12 candidate controls (5 true predictors + 7 noise). Set the panel in Stata with xtset country_id year, yearly.

Data sources

SourceProvidesReference / URL
Synthetic (this study)All values — simulated from a calibrated cubic-EKC data-generating process with a known answer key (open &amp; reproducible)Mendez, C. (2026). See the post's Stata do-file generate_data.do for the full DGP.
Gravina &amp; Lanzafame (2025)Inspiration for the panel structure and variable set (the synthetic data is NOT identical to the original)Gravina, A. F., & Lanzafame, M. (2025). Inequality and the Environmental Kuznets Curve.
Method referencesEstimators and conceptsBayesian Model Averaging (Raftery 1995; Steel 2020); Post-Double-Selection LASSO (Belloni, Chernozhukov & Hansen 2014); Environmental Kuznets Curve (Grossman & Krueger 1995).

Cite this data

Please cite this dataset as follows.

APA

Mendez, C. (2026). Taming Model Uncertainty in the Environmental Kuznets Curve: BMA and Double-Selection LASSO with Panel Data [Data set]. https://carlos-mendez.org/post/stata_bma_dsl/

Belloni, A., Chernozhukov, V., & Hansen, C. (2014). Inference on treatment effects after selection among high-dimensional controls. Review of Economic Studies, 81(2), 608–650. Raftery, A. E. (1995). Bayesian model selection in social research. Sociological Methodology, 25, 111–163. Grossman, G. M., & Krueger, A. B. (1995). Economic growth and the environment. Quarterly Journal of Economics, 110(2), 353–377.

BibTeX

@misc{mendez2026statabmadsl,
  author       = {Mendez, Carlos},
  title        = {Taming Model Uncertainty in the Environmental Kuznets Curve: BMA and Double-Selection LASSO with Panel Data},
  year         = {2026},
  howpublished = {\url{https://carlos-mendez.org/post/stata_bma_dsl/}},
  note         = {Data set}
}

@article{belloni2014inference,
  author  = {Belloni, Alexandre and Chernozhukov, Victor and Hansen, Christian},
  title   = {Inference on Treatment Effects after Selection among High-Dimensional Controls},
  journal = {Review of Economic Studies},
  volume  = {81}, number = {2}, pages = {608--650}, year = {2014}
}
@article{raftery1995bayesian,
  author  = {Raftery, Adrian E.},
  title   = {Bayesian Model Selection in Social Research},
  journal = {Sociological Methodology},
  volume  = {25}, pages = {111--163}, year = {1995}
}
@article{grossman1995economic,
  author  = {Grossman, Gene M. and Krueger, Alan B.},
  title   = {Economic Growth and the Environment},
  journal = {Quarterly Journal of Economics},
  volume  = {110}, number = {2}, pages = {353--377}, year = {1995}
}

Variable explorer search & filter all 18 variables

Type to filter by name or label, or use the chips to filter by type. Each row shows a mini distribution. Click a header to sort.

VariableTypeDistributionLabelDefinitionUnitsIn filesSource
corruption#continuousmin 0 | median 50.5 | max 100Corruption indexNOISE control (true coef 0): corruption score; no GDP correlation.index (0-100)synthetic_ekc_panelSimulation (noise)
country_id#identifierCountry IDSynthetic country identifier (1-80); no real-country mapping.integer (1-80)synthetic_ekc_panelSimulation
credit#continuousmin 11.3 | median 51.4 | max 123Domestic credit (% GDP)NOISE control (true coef 0): domestic credit; moderately correlated with GDP.% of GDPsynthetic_ekc_panelSimulation (noise)
democracy#continuousmin -6.12 | median 2.26 | max 10Democracy scoreTRUE predictor (coef -0.005, weak): democracy index (-10 to 10); more democracy -> less CO2.index (-10 to 10)synthetic_ekc_panelSimulation (true predictor)
fdi#continuousmin -11.5 | median 2.96 | max 16.2FDI inflows (% GDP)NOISE control (true coef 0): foreign direct investment inflows; no GDP correlation.% of GDPsynthetic_ekc_panelSimulation (noise)
fossil_fuel#continuousmin 6.37 | median 53.8 | max 95Fossil fuel share (%)TRUE predictor (coef +0.015): fossil-fuel share of energy; more fossil fuels -> more CO2.% (5-95)synthetic_ekc_panelSimulation (true predictor)
globalization#continuousmin 26.8 | median 56.5 | max 95Globalization indexNOISE control (true coef 0): tricky decoy, strongly correlated with GDP.index (20-95)synthetic_ekc_panelSimulation (noise)
industry#continuousmin 5.84 | median 24.6 | max 45.3Industry VA (% GDP)TRUE predictor (coef +0.010): industry value added share; more industry -> more CO2.% of GDP (5-60)synthetic_ekc_panelSimulation (true predictor)
ln_co2#continuousmin -21 | median -19 | max -16.8CO2 per capita (log)Outcome variable: natural log of CO2 emissions per capita.logsynthetic_ekc_panelSimulation (DGP outcome)
ln_gdp#continuousmin 6.97 | median 9.6 | max 12GDP per capita (log)Log GDP per capita; the income axis of the EKC.log international $synthetic_ekc_panelSimulation
ln_gdp_cb#continuousmin 339 | median 884 | max 1.72e+03GDP per capita cubed (log)Cubic GDP term (inverted-N) for the EKC polynomial.log^3synthetic_ekc_panelDerived
ln_gdp_sq#continuousmin 48.6 | median 92.1 | max 143GDP per capita squared (log)Quadratic GDP term for the EKC polynomial.log^2synthetic_ekc_panelDerived
pop_density#continuousmin 1 | median 51.7 | max 1.57e+03Population densityNOISE control (true coef 0): population density; no GDP correlation.persons per km^2synthetic_ekc_panelSimulation (noise)
renewable#continuousmin 1 | median 28.9 | max 64.2Renewable energy (%)TRUE predictor (coef -0.010): renewable share of energy; more renewables -> less CO2.% (1-80)synthetic_ekc_panelSimulation (true predictor)
services#continuousmin 17.8 | median 43.4 | max 64.1Services VA (% GDP)NOISE control (true coef 0): tricky decoy, strongly correlated with GDP.% of GDP (10-80)synthetic_ekc_panelSimulation (noise)
trade#continuousmin 10 | median 68.1 | max 128Trade openness (% GDP)NOISE control (true coef 0): trade openness; moderately correlated with GDP.% of GDP (10-200)synthetic_ekc_panelSimulation (noise)
urban#continuousmin 16 | median 53.3 | max 91.6Urban population (%)TRUE predictor (coef +0.007, weak): urbanization rate; more urban -> more CO2.% (10-95)synthetic_ekc_panelSimulation (true predictor)
year#yearCalendar yearAnnual time index, 1995-2014.yearsynthetic_ekc_panelSimulation

Cross-file variable index

Which file each variable appears in (● = present).

Construction & formulas

The synthetic outcome — log CO2 per capita — is generated from a cubic EKC with country and year fixed effects (the inverted-N shape):

ln(CO2)_it = β₁·ln(GDP)_it + β₂·ln(GDP)²_it + β₃·ln(GDP)³_it + Xᵗᵘᵘᵉ·γ + α_i + δ_t + ε_it

with the true coefficients written into the DGP: β₁ = -7.10, β₂ = +0.81, β₃ = -0.03 (designed turning points near \$1,895 and \$34,647), five true control slopes — fossil_fuel +0.015, renewable -0.010, urban +0.007, democracy -0.005, industry +0.010 — and seven noise controls with a true coefficient of exactly 0 (globalization, pop_density, corruption, services, trade, fdi, credit). Country fixed effects are N(0, 0.50) draws, the year effect is a downward decarbonization trend, and observation noise is N(0, 0.15).

The true predictors and several noise variables are deliberately correlated with GDP in the DGP (e.g. globalization, services strongly), so a naive regression would flag the noise as "significant"; the task for BMA and DSL is to see through that correlation.

The datasets

Switch datasets with the tabs. Each shows the full variable dictionary plus a sortable statistics table with mini distributions and data coverage.

expand to search (Ctrl/⌘+F) or print across all datasets

country-year  1,600 × 18 · 1995-2014 · 80 countries (balanced; 1,600 obs)

Panel key: country_id x year · Test the inverted-N EKC and grade BMA vs Double-Selection LASSO against a known answer key.

Variable dictionary

VariableLabelDefinitionConstructionUnitsSourceCoverage
country_id identifierCountry IDSynthetic country identifier (1-80); no real-country mapping.Sequential 1..80 assigned at panel creation (gen country_id = _n).integer (1-80)Simulation80 countries
year yearCalendar yearAnnual time index, 1995-2014.1995 + sequence within country (20 years per country, balanced).yearSimulation1995-2014
ln_co2 continuousCO2 per capita (log)Outcome variable: natural log of CO2 emissions per capita.Generated from the DGP: b1*ln_gdp + b2*ln_gdp_sq + b3*ln_gdp_cb + 5 true controls + country FE + year FE + N(0,0.15) noise.logSimulation (DGP outcome)country-year
ln_gdp continuousGDP per capita (log)Log GDP per capita; the income axis of the EKC.Country baseline (uniform 7.0-11.5) + annual growth*(year-1995) + N(0,0.05) noise.log international $Simulationcountry-year
ln_gdp_sq continuousGDP per capita squared (log)Quadratic GDP term for the EKC polynomial.ln_gdp^2.log^2Derivedcountry-year
ln_gdp_cb continuousGDP per capita cubed (log)Cubic GDP term (inverted-N) for the EKC polynomial.ln_gdp^3.log^3Derivedcountry-year
fossil_fuel continuousFossil fuel share (%)TRUE predictor (coef +0.015): fossil-fuel share of energy; more fossil fuels -> more CO2.Country base (correlated with GDP) + N(0,3) noise - 0.3*(year-1995), bounded to [5,95].% (5-95)Simulation (true predictor)country-year
renewable continuousRenewable energy (%)TRUE predictor (coef -0.010): renewable share of energy; more renewables -> less CO2.Country base (negatively correlated with GDP) + N(0,2) + 0.4*(year-1995), bounded to [1,80].% (1-80)Simulation (true predictor)country-year
urban continuousUrban population (%)TRUE predictor (coef +0.007, weak): urbanization rate; more urban -> more CO2.Country base (correlated with GDP) + N(0,1.5) + 0.3*(year-1995), bounded to [10,95].% (10-95)Simulation (true predictor)country-year
globalization continuousGlobalization indexNOISE control (true coef 0): tricky decoy, strongly correlated with GDP.Country base (strong GDP corr) + N(0,3) + 0.2*(year-1995), bounded to [20,95].index (20-95)Simulation (noise)country-year
pop_density continuousPopulation densityNOISE control (true coef 0): population density; no GDP correlation.Log-normal base exp(N(4,1.2)) * (1+0.01*(year-1995)) + N(0,5), floored at 1.persons per km^2Simulation (noise)country-year
democracy continuousDemocracy scoreTRUE predictor (coef -0.005, weak): democracy index (-10 to 10); more democracy -> less CO2.Country base (uniform -5..10) + N(0,0.5), bounded to [-10,10].index (-10 to 10)Simulation (true predictor)country-year
corruption continuousCorruption indexNOISE control (true coef 0): corruption score; no GDP correlation.Country base (uniform 0-100) + N(0,5), bounded to [0,100].index (0-100)Simulation (noise)country-year
industry continuousIndustry VA (% GDP)TRUE predictor (coef +0.010): industry value added share; more industry -> more CO2.Country base (correlated with GDP) + N(0,2) - 0.1*(year-1995), bounded to [5,60].% of GDP (5-60)Simulation (true predictor)country-year
services continuousServices VA (% GDP)NOISE control (true coef 0): tricky decoy, strongly correlated with GDP.Country base (strong GDP corr) + N(0,2) + 0.2*(year-1995), bounded to [10,80].% of GDP (10-80)Simulation (noise)country-year
trade continuousTrade openness (% GDP)NOISE control (true coef 0): trade openness; moderately correlated with GDP.Country base (moderate GDP corr) + N(0,5), bounded to [10,200].% of GDP (10-200)Simulation (noise)country-year
fdi continuousFDI inflows (% GDP)NOISE control (true coef 0): foreign direct investment inflows; no GDP correlation.Country base N(3,4) + N(0,2).% of GDPSimulation (noise)country-year
credit continuousDomestic credit (% GDP)NOISE control (true coef 0): domestic credit; moderately correlated with GDP.Country base (moderate GDP corr) + N(0,5) + 0.3*(year-1995), floored at 5.% of GDPSimulation (noise)country-year

Distribution & statistics (click a header to sort)

VariableDistributionCoverageNDistinctMinMeanMedianMaxSD
country_id100%1,60080
year100%1,6002019952004.5200420145.77
ln_co2min -21 | median -19 | max -16.8100%1,6001,599-21.04-19.04-18.98-16.830.786
ln_gdpmin 6.97 | median 9.6 | max 12100%1,6001,6006.979.589.6011.971.33
ln_gdp_sqmin 48.6 | median 92.1 | max 143100%1,6001,60048.6493.6292.09143.325.55
ln_gdp_cbmin 339 | median 884 | max 1.72e+03100%1,6001,600339.2931.1883.81,715.2373.8
fossil_fuelmin 6.37 | median 53.8 | max 95100%1,6001,5936.3754.7753.7695.0019.14
renewablemin 1 | median 28.9 | max 64.2100%1,6001,6001.0029.5428.9364.2211.97
urbanmin 16 | median 53.3 | max 91.6100%1,6001,60015.9553.6753.3091.6314.78
globalizationmin 26.8 | median 56.5 | max 95100%1,6001,59526.7657.6556.4595.0012.72
pop_densitymin 1 | median 51.7 | max 1.57e+03100%1,6001,5711.00121.351.721,571.8210.3
democracymin -6.12 | median 2.26 | max 10100%1,6001,594-6.122.332.2610.004.18
corruptionmin 0 | median 50.5 | max 100100%1,6001,554052.3550.51100.028.53
industrymin 5.84 | median 24.6 | max 45.3100%1,6001,6005.8424.6424.6145.336.18
servicesmin 17.8 | median 43.4 | max 64.1100%1,6001,60017.8343.5643.4364.079.37
trademin 10 | median 68.1 | max 128100%1,6001,60010.0467.4468.08128.119.36
fdimin -11.5 | median 2.96 | max 16.2100%1,6001,600-11.502.982.9616.204.37
creditmin 11.3 | median 51.4 | max 123100%1,6001,60011.3353.4451.39123.218.20

Known limitations & caveats