Downloads
Each dataset is available as a labeled Stata .dta and its source file.
⇩ Download all data (ZIP)stata_codebook.do
| Dataset | Grain | Rows | Stata | Source |
|---|---|---|---|---|
synthetic_ekc_panel | country-year | 1,600 × 18 | synthetic_ekc_panel.dta | synthetic_ekc_panel.csv |
Run stata_codebook.do in Stata once to attach long-form per-variable notes to the .dta files.
Load directly in code
Every file loads straight from GitHub (raw URLs). Swap the file name to load any dataset.
Stata
* Stata 14+ : `use` reads an https URL directly
global BASE "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/stata_bma_dsl/data/"
use "${BASE}synthetic_ekc_panel.dta", clear
describe
notesPython
!pip install -q pyreadstat
import pandas as pd
BASE = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/stata_bma_dsl/data/"
df = pd.read_stata(BASE + "synthetic_ekc_panel.dta")
# load every dataset at once
files = ["synthetic_ekc_panel"]
data = {f: pd.read_stata(BASE + f + ".dta") for f in files}
# pyreadstat (richest metadata) reads LOCAL files -> download first
import pyreadstat, urllib.request
urllib.request.urlretrieve(BASE + "synthetic_ekc_panel.dta", "synthetic_ekc_panel.dta")
df, meta = pyreadstat.read_dta("synthetic_ekc_panel.dta")Copy and paste this snippet in Google Colab app. https://colab.research.google.com/notebooks/empty.ipynb
R
# R : haven::read_dta auto-downloads an https URL
library(haven)
BASE <- "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/stata_bma_dsl/data/"
df <- read_dta(paste0(BASE, "synthetic_ekc_panel.dta"))Overview & sources
Companion data for a Stata tutorial that confronts the model-uncertainty problem in testing the Environmental Kuznets Curve (EKC). With 12 candidate control variables there are 2⁹² = 4,096 possible regressions on log CO2 per capita, and choosing one is a hidden assumption. The tutorial applies two principled solutions — Bayesian Model Averaging (BMA, via Stata’s bmaregress) and Post-Double-Selection LASSO (DSL, via dsregress) — to estimate the inverted-N cubic relationship between log CO2 and log GDP. The data is fully synthetic, built from a known data-generating process in which 5 controls truly affect emissions and 7 are pure noise, so each method can be graded against ground truth. With country and year fixed effects, BMA recovers the GDP cubic almost exactly and flags 6 of 8 true predictors at PIP ≥ 0.80 with zero false positives; DSL produces fast cluster-robust estimates. Stripping the fixed effects inflates the GDP coefficient and generates false positives.
synthetic_ekc_panel is a balanced annual country panel — one row per country × year, 80 countries × 20 years (1995–2014) = 1,600 observations. It carries the outcome (log CO2 per capita), the three GDP polynomial terms, and the 12 candidate controls (5 true predictors + 7 noise). Set the panel in Stata with xtset country_id year, yearly.
Data sources
| Source | Provides | Reference / URL |
|---|---|---|
| Synthetic (this study) | All values — simulated from a calibrated cubic-EKC data-generating process with a known answer key (open & reproducible) | Mendez, C. (2026). See the post's Stata do-file generate_data.do for the full DGP. |
| Gravina & Lanzafame (2025) | Inspiration for the panel structure and variable set (the synthetic data is NOT identical to the original) | Gravina, A. F., & Lanzafame, M. (2025). Inequality and the Environmental Kuznets Curve. |
| Method references | Estimators and concepts | Bayesian Model Averaging (Raftery 1995; Steel 2020); Post-Double-Selection LASSO (Belloni, Chernozhukov & Hansen 2014); Environmental Kuznets Curve (Grossman & Krueger 1995). |
Cite this data
Please cite this dataset as follows.
APA
Mendez, C. (2026). Taming Model Uncertainty in the Environmental Kuznets Curve: BMA and Double-Selection LASSO with Panel Data [Data set]. https://carlos-mendez.org/post/stata_bma_dsl/
Belloni, A., Chernozhukov, V., & Hansen, C. (2014). Inference on treatment effects after selection among high-dimensional controls. Review of Economic Studies, 81(2), 608–650. Raftery, A. E. (1995). Bayesian model selection in social research. Sociological Methodology, 25, 111–163. Grossman, G. M., & Krueger, A. B. (1995). Economic growth and the environment. Quarterly Journal of Economics, 110(2), 353–377.BibTeX
@misc{mendez2026statabmadsl,
author = {Mendez, Carlos},
title = {Taming Model Uncertainty in the Environmental Kuznets Curve: BMA and Double-Selection LASSO with Panel Data},
year = {2026},
howpublished = {\url{https://carlos-mendez.org/post/stata_bma_dsl/}},
note = {Data set}
}
@article{belloni2014inference,
author = {Belloni, Alexandre and Chernozhukov, Victor and Hansen, Christian},
title = {Inference on Treatment Effects after Selection among High-Dimensional Controls},
journal = {Review of Economic Studies},
volume = {81}, number = {2}, pages = {608--650}, year = {2014}
}
@article{raftery1995bayesian,
author = {Raftery, Adrian E.},
title = {Bayesian Model Selection in Social Research},
journal = {Sociological Methodology},
volume = {25}, pages = {111--163}, year = {1995}
}
@article{grossman1995economic,
author = {Grossman, Gene M. and Krueger, Alan B.},
title = {Economic Growth and the Environment},
journal = {Quarterly Journal of Economics},
volume = {110}, number = {2}, pages = {353--377}, year = {1995}
}Variable explorer search & filter all 18 variables
Type to filter by name or label, or use the chips to filter by type. Each row shows a mini distribution. Click a header to sort.
| Variable | Type | Distribution | Label | Definition | Units | In files | Source |
|---|---|---|---|---|---|---|---|
corruption# | continuous | Corruption index | NOISE control (true coef 0): corruption score; no GDP correlation. | index (0-100) | synthetic_ekc_panel | Simulation (noise) | |
country_id# | identifier | – | Country ID | Synthetic country identifier (1-80); no real-country mapping. | integer (1-80) | synthetic_ekc_panel | Simulation |
credit# | continuous | Domestic credit (% GDP) | NOISE control (true coef 0): domestic credit; moderately correlated with GDP. | % of GDP | synthetic_ekc_panel | Simulation (noise) | |
democracy# | continuous | Democracy score | TRUE predictor (coef -0.005, weak): democracy index (-10 to 10); more democracy -> less CO2. | index (-10 to 10) | synthetic_ekc_panel | Simulation (true predictor) | |
fdi# | continuous | FDI inflows (% GDP) | NOISE control (true coef 0): foreign direct investment inflows; no GDP correlation. | % of GDP | synthetic_ekc_panel | Simulation (noise) | |
fossil_fuel# | continuous | Fossil fuel share (%) | TRUE predictor (coef +0.015): fossil-fuel share of energy; more fossil fuels -> more CO2. | % (5-95) | synthetic_ekc_panel | Simulation (true predictor) | |
globalization# | continuous | Globalization index | NOISE control (true coef 0): tricky decoy, strongly correlated with GDP. | index (20-95) | synthetic_ekc_panel | Simulation (noise) | |
industry# | continuous | Industry VA (% GDP) | TRUE predictor (coef +0.010): industry value added share; more industry -> more CO2. | % of GDP (5-60) | synthetic_ekc_panel | Simulation (true predictor) | |
ln_co2# | continuous | CO2 per capita (log) | Outcome variable: natural log of CO2 emissions per capita. | log | synthetic_ekc_panel | Simulation (DGP outcome) | |
ln_gdp# | continuous | GDP per capita (log) | Log GDP per capita; the income axis of the EKC. | log international $ | synthetic_ekc_panel | Simulation | |
ln_gdp_cb# | continuous | GDP per capita cubed (log) | Cubic GDP term (inverted-N) for the EKC polynomial. | log^3 | synthetic_ekc_panel | Derived | |
ln_gdp_sq# | continuous | GDP per capita squared (log) | Quadratic GDP term for the EKC polynomial. | log^2 | synthetic_ekc_panel | Derived | |
pop_density# | continuous | Population density | NOISE control (true coef 0): population density; no GDP correlation. | persons per km^2 | synthetic_ekc_panel | Simulation (noise) | |
renewable# | continuous | Renewable energy (%) | TRUE predictor (coef -0.010): renewable share of energy; more renewables -> less CO2. | % (1-80) | synthetic_ekc_panel | Simulation (true predictor) | |
services# | continuous | Services VA (% GDP) | NOISE control (true coef 0): tricky decoy, strongly correlated with GDP. | % of GDP (10-80) | synthetic_ekc_panel | Simulation (noise) | |
trade# | continuous | Trade openness (% GDP) | NOISE control (true coef 0): trade openness; moderately correlated with GDP. | % of GDP (10-200) | synthetic_ekc_panel | Simulation (noise) | |
urban# | continuous | Urban population (%) | TRUE predictor (coef +0.007, weak): urbanization rate; more urban -> more CO2. | % (10-95) | synthetic_ekc_panel | Simulation (true predictor) | |
year# | year | – | Calendar year | Annual time index, 1995-2014. | year | synthetic_ekc_panel | Simulation |
Cross-file variable index
Which file each variable appears in (● = present).
| Variable | synthetic_ekc_panel |
|---|---|
corruption | ● |
country_id | ● |
credit | ● |
democracy | ● |
fdi | ● |
fossil_fuel | ● |
globalization | ● |
industry | ● |
ln_co2 | ● |
ln_gdp | ● |
ln_gdp_cb | ● |
ln_gdp_sq | ● |
pop_density | ● |
renewable | ● |
services | ● |
trade | ● |
urban | ● |
year | ● |
Construction & formulas
The synthetic outcome — log CO2 per capita — is generated from a cubic EKC with country and year fixed effects (the inverted-N shape):
ln(CO2)_it = β₁·ln(GDP)_it + β₂·ln(GDP)²_it + β₃·ln(GDP)³_it + Xᵗᵘᵘᵉ·γ + α_i + δ_t + ε_it
with the true coefficients written into the DGP:
β₁ = -7.10, β₂ = +0.81, β₃ = -0.03
(designed turning points near \$1,895 and \$34,647), five true control slopes
— fossil_fuel +0.015, renewable -0.010, urban +0.007,
democracy -0.005, industry +0.010 — and seven noise controls with
a true coefficient of exactly 0 (globalization, pop_density, corruption,
services, trade, fdi, credit). Country fixed effects are N(0, 0.50) draws,
the year effect is a downward decarbonization trend, and observation noise is N(0, 0.15).
- GDP polynomial:
ln_gdp_sq = ln_gdp²,ln_gdp_cb = ln_gdp³— the linear/squared/cubic terms that trace the inverted-N. - Turning points (income where emissions change direction):
x* = [-β₂ ± √(β₂² - 3β₁β₃)] / (3β₃),GDP* = exp(x*). - BMA: averages over the
2⁹² = 4,096model space; each variable’s posterior inclusion probability (PIP) isPIP_j = Σ_{k: x_j ∈ M_k} P(M_k | data), the posterior weight on models containing it (robustness threshold PIP ≥ 0.80). The posterior mean isβ̂_j = Σ_k P(M_k | data)·β̂_{j,k}. - Double-Selection LASSO: LASSO-select controls for the outcome, then for each variable of interest, take the union, then run OLS on that union (cluster-robust SE) — "select, then regress" protection against omitted-variable bias.
The true predictors and several noise variables are deliberately correlated with GDP in the DGP
(e.g. globalization, services strongly), so a naive regression would flag
the noise as "significant"; the task for BMA and DSL is to see through that correlation.
The datasets
Switch datasets with the tabs. Each shows the full variable dictionary plus a sortable statistics table with mini distributions and data coverage.
expand to search (Ctrl/⌘+F) or print across all datasets
Variable dictionary
| Variable | Label | Definition | Construction | Units | Source | Coverage |
|---|---|---|---|---|---|---|
country_id identifier | Country ID | Synthetic country identifier (1-80); no real-country mapping. | Sequential 1..80 assigned at panel creation (gen country_id = _n). | integer (1-80) | Simulation | 80 countries |
year year | Calendar year | Annual time index, 1995-2014. | 1995 + sequence within country (20 years per country, balanced). | year | Simulation | 1995-2014 |
ln_co2 continuous | CO2 per capita (log) | Outcome variable: natural log of CO2 emissions per capita. | Generated from the DGP: b1*ln_gdp + b2*ln_gdp_sq + b3*ln_gdp_cb + 5 true controls + country FE + year FE + N(0,0.15) noise. | log | Simulation (DGP outcome) | country-year |
ln_gdp continuous | GDP per capita (log) | Log GDP per capita; the income axis of the EKC. | Country baseline (uniform 7.0-11.5) + annual growth*(year-1995) + N(0,0.05) noise. | log international $ | Simulation | country-year |
ln_gdp_sq continuous | GDP per capita squared (log) | Quadratic GDP term for the EKC polynomial. | ln_gdp^2. | log^2 | Derived | country-year |
ln_gdp_cb continuous | GDP per capita cubed (log) | Cubic GDP term (inverted-N) for the EKC polynomial. | ln_gdp^3. | log^3 | Derived | country-year |
fossil_fuel continuous | Fossil fuel share (%) | TRUE predictor (coef +0.015): fossil-fuel share of energy; more fossil fuels -> more CO2. | Country base (correlated with GDP) + N(0,3) noise - 0.3*(year-1995), bounded to [5,95]. | % (5-95) | Simulation (true predictor) | country-year |
renewable continuous | Renewable energy (%) | TRUE predictor (coef -0.010): renewable share of energy; more renewables -> less CO2. | Country base (negatively correlated with GDP) + N(0,2) + 0.4*(year-1995), bounded to [1,80]. | % (1-80) | Simulation (true predictor) | country-year |
urban continuous | Urban population (%) | TRUE predictor (coef +0.007, weak): urbanization rate; more urban -> more CO2. | Country base (correlated with GDP) + N(0,1.5) + 0.3*(year-1995), bounded to [10,95]. | % (10-95) | Simulation (true predictor) | country-year |
globalization continuous | Globalization index | NOISE control (true coef 0): tricky decoy, strongly correlated with GDP. | Country base (strong GDP corr) + N(0,3) + 0.2*(year-1995), bounded to [20,95]. | index (20-95) | Simulation (noise) | country-year |
pop_density continuous | Population density | NOISE control (true coef 0): population density; no GDP correlation. | Log-normal base exp(N(4,1.2)) * (1+0.01*(year-1995)) + N(0,5), floored at 1. | persons per km^2 | Simulation (noise) | country-year |
democracy continuous | Democracy score | TRUE predictor (coef -0.005, weak): democracy index (-10 to 10); more democracy -> less CO2. | Country base (uniform -5..10) + N(0,0.5), bounded to [-10,10]. | index (-10 to 10) | Simulation (true predictor) | country-year |
corruption continuous | Corruption index | NOISE control (true coef 0): corruption score; no GDP correlation. | Country base (uniform 0-100) + N(0,5), bounded to [0,100]. | index (0-100) | Simulation (noise) | country-year |
industry continuous | Industry VA (% GDP) | TRUE predictor (coef +0.010): industry value added share; more industry -> more CO2. | Country base (correlated with GDP) + N(0,2) - 0.1*(year-1995), bounded to [5,60]. | % of GDP (5-60) | Simulation (true predictor) | country-year |
services continuous | Services VA (% GDP) | NOISE control (true coef 0): tricky decoy, strongly correlated with GDP. | Country base (strong GDP corr) + N(0,2) + 0.2*(year-1995), bounded to [10,80]. | % of GDP (10-80) | Simulation (noise) | country-year |
trade continuous | Trade openness (% GDP) | NOISE control (true coef 0): trade openness; moderately correlated with GDP. | Country base (moderate GDP corr) + N(0,5), bounded to [10,200]. | % of GDP (10-200) | Simulation (noise) | country-year |
fdi continuous | FDI inflows (% GDP) | NOISE control (true coef 0): foreign direct investment inflows; no GDP correlation. | Country base N(3,4) + N(0,2). | % of GDP | Simulation (noise) | country-year |
credit continuous | Domestic credit (% GDP) | NOISE control (true coef 0): domestic credit; moderately correlated with GDP. | Country base (moderate GDP corr) + N(0,5) + 0.3*(year-1995), floored at 5. | % of GDP | Simulation (noise) | country-year |
Distribution & statistics (click a header to sort)
| Variable | Distribution | Coverage | N | Distinct | Min | Mean | Median | Max | SD |
|---|---|---|---|---|---|---|---|---|---|
country_id | – | 100% | 1,600 | 80 | — | — | — | — | — |
year | – | 100% | 1,600 | 20 | 1995 | 2004.5 | 2004 | 2014 | 5.77 |
ln_co2 | 100% | 1,600 | 1,599 | -21.04 | -19.04 | -18.98 | -16.83 | 0.786 | |
ln_gdp | 100% | 1,600 | 1,600 | 6.97 | 9.58 | 9.60 | 11.97 | 1.33 | |
ln_gdp_sq | 100% | 1,600 | 1,600 | 48.64 | 93.62 | 92.09 | 143.3 | 25.55 | |
ln_gdp_cb | 100% | 1,600 | 1,600 | 339.2 | 931.1 | 883.8 | 1,715.2 | 373.8 | |
fossil_fuel | 100% | 1,600 | 1,593 | 6.37 | 54.77 | 53.76 | 95.00 | 19.14 | |
renewable | 100% | 1,600 | 1,600 | 1.00 | 29.54 | 28.93 | 64.22 | 11.97 | |
urban | 100% | 1,600 | 1,600 | 15.95 | 53.67 | 53.30 | 91.63 | 14.78 | |
globalization | 100% | 1,600 | 1,595 | 26.76 | 57.65 | 56.45 | 95.00 | 12.72 | |
pop_density | 100% | 1,600 | 1,571 | 1.00 | 121.3 | 51.72 | 1,571.8 | 210.3 | |
democracy | 100% | 1,600 | 1,594 | -6.12 | 2.33 | 2.26 | 10.00 | 4.18 | |
corruption | 100% | 1,600 | 1,554 | 0 | 52.35 | 50.51 | 100.0 | 28.53 | |
industry | 100% | 1,600 | 1,600 | 5.84 | 24.64 | 24.61 | 45.33 | 6.18 | |
services | 100% | 1,600 | 1,600 | 17.83 | 43.56 | 43.43 | 64.07 | 9.37 | |
trade | 100% | 1,600 | 1,600 | 10.04 | 67.44 | 68.08 | 128.1 | 19.36 | |
fdi | 100% | 1,600 | 1,600 | -11.50 | 2.98 | 2.96 | 16.20 | 4.37 | |
credit | 100% | 1,600 | 1,600 | 11.33 | 53.44 | 51.39 | 123.2 | 18.20 |
Known limitations & caveats
- Synthetic data. There is no real data behind this tutorial. Every value is simulated from a known data-generating process; results are internally consistent with the calibration but are NOT empirical evidence about real-world CO2 emissions or the EKC.
- Inspired by, not identical to. The panel is inspired by Gravina & Lanzafame (2025) but the values are fully synthetic and do not reproduce that study's data.
- Known answer key. Exactly 5 controls (
fossil_fuel, renewable, urban, democracy, industry) have a true non-zero effect; the other 7 are pure noise with a true coefficient of zero. Use this only to grade method accuracy, not to draw substantive conclusions. - Weak signals are hard. Two true controls (
urban+0.007,democracy-0.005) have small coefficients and fall below the PIP ≥ 0.80 threshold even though they belong — a realistic limitation, not a data error. - Fixed effects matter. The cubic-EKC recovery and zero-false-positive selection hold only with country and year fixed effects; pooled (no-FE) estimates inflate the GDP coefficient 2–3× and generate false positives.
- Country IDs are arbitrary.
country_id1–80 are synthetic identifiers with no real-country mapping; the CSV rows are not sorted by id.