Downloads
Each dataset is available as a labeled Stata .dta and its source file.
⇩ Download all data (ZIP)stata_codebook.do
| Dataset | Grain | Rows | Stata | Source |
|---|---|---|---|---|
sim_resource_curse | district-year | 3,000 × 18 | sim_resource_curse.dta | sim_resource_curse.csv |
Run stata_codebook.do in Stata once to attach long-form per-variable notes to the .dta files.
Load directly in code
Every file loads straight from GitHub (raw URLs). Swap the file name to load any dataset.
Stata
* Stata 14+ : `use` reads an https URL directly
global BASE "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_EconML/data/"
use "${BASE}sim_resource_curse.dta", clear
describe
notesPython
!pip install -q pyreadstat
import pandas as pd
BASE = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_EconML/data/"
df = pd.read_stata(BASE + "sim_resource_curse.dta")
# load every dataset at once
files = ["sim_resource_curse"]
data = {f: pd.read_stata(BASE + f + ".dta") for f in files}
# pyreadstat (richest metadata) reads LOCAL files -> download first
import pyreadstat, urllib.request
urllib.request.urlretrieve(BASE + "sim_resource_curse.dta", "sim_resource_curse.dta")
df, meta = pyreadstat.read_dta("sim_resource_curse.dta")Copy and paste this snippet in Google Colab app. https://colab.research.google.com/notebooks/empty.ipynb
R
# R : haven::read_dta auto-downloads an https URL
library(haven)
BASE <- "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_EconML/data/"
df <- read_dta(paste0(BASE, "sim_resource_curse.dta"))Overview & sources
Companion data for a hands-on Python tutorial that estimates heterogeneous causal effects of mining and mineral prices on economic development using EconML's CausalForestDML — a Double Machine Learning causal forest. The dataset is fully synthetic with a known data-generating process, so estimates can be checked against ground truth. Its structure mirrors Hodler, Lechner & Raschky (2023): 300 districts across 8 countries observed annually over 2003–2012 (3,000 district-year rows). Treatment has four levels — no mining (0) and mining at low (1), medium (2), and high (3) mineral prices — and is heavily imbalanced at 85%/5%/5%/5%. The outcome is log nighttime lights. The forest recovers an ATE of 0.240 for the basic mining effect (true 0.250), a non-linear price gradient, and the finding that institutions moderate the mining margin but not the price margin. The entire DGP is open and reproducible.
sim_resource_curse is a balanced annual district panel (one row per district × year) covering 300 districts in 8 countries over 2003–2012. It carries the four-level treatment, the binary mining flag and price index, two outcomes (log nighttime lights and a conflict dummy), and the geographic, institutional, and demographic covariates used as heterogeneity features (X) and first-stage controls (W).
Data sources
| Source | Provides | Reference / URL |
|---|---|---|
| Synthetic (this study) | All values — simulated via a calibrated data-generating process with known ground-truth ATEs (open & reproducible) | Mendez, C. (2026). See the post's Python script.py for the full DGP and ground-truth parameters. |
| Hodler, Lechner & Raschky (2023) | Structural template the simulation mirrors (district panel, mining treatment, NTL outcome, institutional moderation) | Hodler, R., Lechner, M., & Raschky, P. A. (2023). Institutions and the resource curse: New insights from causal machine learning. PLoS ONE, 18(6), e0284968. https://doi.org/10.1371/journal.pone.0284968 |
| Method references | Estimators and concepts | Microsoft EconML (CausalForestDML); Chernozhukov et al. (2018, double/debiased ML); Wager & Athey (2018, causal forests); Robinson (1988); Athey, Tibshirani & Wager (2019, generalized random forests / BLB). |
Cite this data
Please cite this dataset as follows.
APA
Mendez, C. (2026). Causal Machine Learning and the Resource Curse with Python EconML [Data set]. https://carlos-mendez.org/post/python_EconML/
Hodler, R., Lechner, M., & Raschky, P. A. (2023). Institutions and the resource curse: New insights from causal machine learning. PLoS ONE, 18(6), e0284968. Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1), C1–C68. Wager, S., & Athey, S. (2018). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523), 1228–1242.BibTeX
@misc{mendez2026pythoneconml,
author = {Mendez, Carlos},
title = {Causal Machine Learning and the Resource Curse with Python EconML},
year = {2026},
howpublished = {\url{https://carlos-mendez.org/post/python_EconML/}},
note = {Data set}
}
@article{hodler2023institutions,
author = {Hodler, Roland and Lechner, Michael and Raschky, Paul A.},
title = {Institutions and the resource curse: New insights from causal machine learning},
journal = {PLoS ONE},
volume = {18}, number = {6}, pages = {e0284968}, year = {2023},
doi = {10.1371/journal.pone.0284968}
}
@article{chernozhukov2018dml,
author = {Chernozhukov, Victor and Chetverikov, Denis and Demirer, Mert and Duflo, Esther
and Hansen, Christian and Newey, Whitney and Robins, James},
title = {Double/debiased machine learning for treatment and structural parameters},
journal = {The Econometrics Journal},
volume = {21}, number = {1}, pages = {C1--C68}, year = {2018},
doi = {10.1111/ectj.12097}
}
@article{wager2018estimation,
author = {Wager, Stefan and Athey, Susan},
title = {Estimation and inference of heterogeneous treatment effects using random forests},
journal = {Journal of the American Statistical Association},
volume = {113}, number = {523}, pages = {1228--1242}, year = {2018},
doi = {10.1080/01621459.2017.1319839}
}Variable explorer search & filter all 18 variables
Type to filter by name or label, or use the chips to filter by type. Each row shows a mini distribution. Click a header to sort.
| Variable | Type | Distribution | Label | Definition | Units | In files | Source |
|---|---|---|---|---|---|---|---|
agri_suitability# | continuous | Agricultural suitability | Index of land suitability for agriculture; a geographic X feature. | 0-1 | sim_resource_curse | Simulation | |
conflict# | dummy | Conflict incidence (1=yes) | Binary indicator of armed-conflict incidence in the district-year (secondary outcome; not analysed in the post). | 0/1 | sim_resource_curse | Simulation | |
country_id# | identifier | – | Country identifier | Country to which the district belongs; used as a first-stage control (W). | integer ID | sim_resource_curse | Simulation |
distance_capital# | continuous | Distance to capital (m) | Distance from the district to the national capital; a geographic X feature (top heterogeneity splitter by importance). | meters | sim_resource_curse | Simulation | |
district_id# | identifier | – | District identifier | Synthetic district ID; the panel unit observed across years. | integer ID | sim_resource_curse | Simulation |
elevation# | continuous | Elevation (m) | District mean elevation above sea level; a geographic X feature. | meters | sim_resource_curse | Simulation | |
ethnic_frac# | continuous | Ethnic fractionalization | Index of ethnic/linguistic heterogeneity within the district; a demographic X feature. | 0-1 | sim_resource_curse | Simulation | |
exec_constraints# | identifier | – | Constraints on the executive (1-6) | Institutional-quality measure: strength of constraints on executive power; a hypothesized moderator of the mining effect. | 1-6 scale | sim_resource_curse | Simulation |
gdp_pc# | continuous | GDP per capita | District/country GDP per capita; an X heterogeneity feature. | US$ | sim_resource_curse | Simulation | |
mining# | dummy | Mining active (1=yes) | Binary flag: whether the district-year has active mining (treatment >= 1). | 0/1 | sim_resource_curse | Simulation | |
ntl_log# | continuous | Log nighttime lights (outcome) | Natural log of nighttime light intensity — the headline development outcome. | log NTL | sim_resource_curse | Simulation | |
population# | continuous | Population | District population; a demographic X feature. | people | sim_resource_curse | Simulation | |
price_index# | continuous | Mineral price index | Continuous mineral-price index that defines the mining treatment level (0 when no mining). | index (>=0) | sim_resource_curse | Simulation | |
quality_of_govt# | continuous | Quality of government | Continuous institutional-quality index; an alternative moderator used to cross-validate the executive-constraints finding. | 0-1 | sim_resource_curse | Simulation | |
ruggedness# | continuous | Terrain ruggedness | Terrain ruggedness index; a geographic X feature. | index (>=0) | sim_resource_curse | Simulation | |
temperature# | continuous | Temperature (°C) | District mean temperature; a geographic X feature. | degrees Celsius | sim_resource_curse | Simulation | |
treatment# | identifier | – | Treatment level (0-3) | Four-level discrete treatment: no mining vs mining at low/medium/high mineral prices. | 0-3 (category) | sim_resource_curse | Simulation |
year# | year | – | Calendar year | Annual time index; used as a first-stage control (W). | year | sim_resource_curse | Simulation |
Cross-file variable index
Which file each variable appears in (● = present).
Construction & formulas
The estimator is EconML's CausalForestDML, a Double Machine Learning (DML) causal
forest applied to the four-level treatment with log nighttime lights as the outcome.
- Partially linear model (Robinson 1988; Chernozhukov et al. 2018):
Y = τ(X)·T + g₀(X,W) + εandT = m₀(X,W) + v, whereg₀andm₀are nuisance conditional means estimated by Gradient Boosting. - Residualization:
Ỹ = Y − E[Y|X,W],T̃ = T − m₀(X,W), thenỸ = τ(X)·T̃ + ε— the Frisch–Waugh–Lovell logic. The causal forest is the second-stage learner that estimates the covariate-dependent slopeτ(X). - CATE:
τ(x) = E[Y(1) − Y(0) | X = x]— the per-profile treatment effect (here estimated for each pairwise treatment contrast). - GATE:
GATE_g = (1/n_g) · Σ_{i∈g} τ̂(Xᵢ)— the CATE averaged over a pre-specified subgroup; SE= sqrt(mean(seᵢ²)/n_g)from the per-observation Bootstrap-of-Little-Bags standard errors. - ATE:
E[τ(X)]— the overall average, reported per pairwise contrast with 90% BLB confidence intervals. - Honest causal forest (Wager & Athey 2018): each tree uses one subsample
to choose splits and a disjoint subsample to estimate leaf means; 5-fold cross-fitting via
GroupKFoldondistrict_idblocks within-district leakage.
Synthetic data-generating process (ground truth, from script.py): the log-NTL
effect of mining is 0.25 at mean institutions with institutional moderation
0.15; the medium-price premium is 0.05 (small) and the high-price
premium is 0.30 (large), yielding ground-truth ATEs of 1-0 = 0.25, 2-0 = 0.30,
3-0 = 0.55, 2-1 = 0.05, 3-1 = 0.30, 3-2 = 0.25. Outcome noise SD is 0.25. A parallel
conflict process (mining base 0.70, institutional dampening −0.50, price premia 0.15/0.50, base
rate 0.12) generates the conflict dummy. Because every confounder is built in, the Conditional
Independence Assumption holds by construction.
The datasets
Switch datasets with the tabs. Each shows the full variable dictionary plus a sortable statistics table with mini distributions and data coverage.
expand to search (Ctrl/⌘+F) or print across all datasets
Variable dictionary
| Variable | Label | Definition | Construction | Units | Source | Coverage |
|---|---|---|---|---|---|---|
district_id identifier | District identifier | Synthetic district ID; the panel unit observed across years. | 1..300, one per district (each district appears in all 10 years). | integer ID | Simulation | 300 districts |
country_id identifier | Country identifier | Country to which the district belongs; used as a first-stage control (W). | 1..8; districts are nested within 8 countries. | integer ID | Simulation | 8 countries |
year year | Calendar year | Annual time index; used as a first-stage control (W). | 2003-2012 (balanced; every district observed in each year). | year | Simulation | 2003-2012 |
treatment identifier | Treatment level (0-3) | Four-level discrete treatment: no mining vs mining at low/medium/high mineral prices. | 0 = no mining; 1/2/3 = mining at low/medium/high prices. Heavily imbalanced (85/5/5/5). | 0-3 (category) | Simulation | 2,550 / 150 / 150 / 150 rows |
mining dummy | Mining active (1=yes) | Binary flag: whether the district-year has active mining (treatment >= 1). | 1 if treatment in {1,2,3}, else 0. | 0/1 | Simulation | 450 mining district-years |
price_index continuous | Mineral price index | Continuous mineral-price index that defines the mining treatment level (0 when no mining). | 0 for untreated rows; positive and increasing across low/medium/high price levels. | index (>=0) | Simulation | 0 for non-mining; >0 for mining |
exec_constraints identifier | Constraints on the executive (1-6) | Institutional-quality measure: strength of constraints on executive power; a hypothesized moderator of the mining effect. | Discrete 1-6 scale assigned per country/district (X feature). | 1-6 scale | Simulation | 6 levels |
quality_of_govt continuous | Quality of government | Continuous institutional-quality index; an alternative moderator used to cross-validate the executive-constraints finding. | Country-level index in [0.22, 0.70] (X feature). | 0-1 | Simulation | 8 distinct values |
gdp_pc continuous | GDP per capita | District/country GDP per capita; an X heterogeneity feature. | Country-level value in [500, 5000] (8 distinct values). | US$ | Simulation | 8 distinct values |
elevation continuous | Elevation (m) | District mean elevation above sea level; a geographic X feature. | Simulated per district (time-invariant). | meters | Simulation | 300 districts |
temperature continuous | Temperature (°C) | District mean temperature; a geographic X feature. | Simulated per district (time-invariant). | degrees Celsius | Simulation | 300 districts |
ruggedness continuous | Terrain ruggedness | Terrain ruggedness index; a geographic X feature. | Simulated per district (time-invariant). | index (>=0) | Simulation | 300 districts |
distance_capital continuous | Distance to capital (m) | Distance from the district to the national capital; a geographic X feature (top heterogeneity splitter by importance). | Simulated per district (time-invariant). | meters | Simulation | 300 districts |
agri_suitability continuous | Agricultural suitability | Index of land suitability for agriculture; a geographic X feature. | Simulated per district in [0, ~1] (time-invariant). | 0-1 | Simulation | 300 districts |
population continuous | Population | District population; a demographic X feature. | Simulated per district (time-invariant). | people | Simulation | 300 districts |
ethnic_frac continuous | Ethnic fractionalization | Index of ethnic/linguistic heterogeneity within the district; a demographic X feature. | Simulated per district in [~0.20, ~0.90] (time-invariant). | 0-1 | Simulation | 300 districts |
ntl_log continuous | Log nighttime lights (outcome) | Natural log of nighttime light intensity — the headline development outcome. | Simulated from the DGP: mining base effect + institutional moderation + price premia + mean-zero noise (SD 0.25). | log NTL | Simulation | all 3,000 rows |
conflict dummy | Conflict incidence (1=yes) | Binary indicator of armed-conflict incidence in the district-year (secondary outcome; not analysed in the post). | Simulated from a parallel process: mining base 0.70, institutional dampening -0.50, price premia 0.15/0.50, base rate 0.12. | 0/1 | Simulation | 369 conflict district-years |
Distribution & statistics (click a header to sort)
| Variable | Distribution | Coverage | N | Distinct | Min | Mean | Median | Max | SD |
|---|---|---|---|---|---|---|---|---|---|
district_id | – | 100% | 3,000 | 300 | — | — | — | — | — |
country_id | – | 100% | 3,000 | 8 | — | — | — | — | — |
year | – | 100% | 3,000 | 10 | 2003 | 2007.5 | 2007 | 2012 | 2.87 |
treatment | – | 100% | 3,000 | 4 | — | — | — | — | — |
mining | 100% | 3,000 | 2 | 0 | 0.150 | 0 | 1.00 | 0.357 | |
price_index | 100% | 3,000 | 451 | 0 | 0.123 | 0 | 1.81 | 0.324 | |
exec_constraints | – | 100% | 3,000 | 6 | — | — | — | — | — |
quality_of_govt | 100% | 3,000 | 8 | 0.220 | 0.440 | 0.420 | 0.700 | 0.152 | |
gdp_pc | 100% | 3,000 | 8 | 500.0 | 2,198.0 | 1,800.0 | 5,000.0 | 1,469.9 | |
elevation | 100% | 3,000 | 281 | 0 | 499.1 | 502.2 | 1,357.2 | 302.0 | |
temperature | 100% | 3,000 | 298 | 13.99 | 23.91 | 24.04 | 35.00 | 3.92 | |
ruggedness | 100% | 3,000 | 261 | 0 | 24.42 | 24.12 | 76.95 | 17.80 | |
distance_capital | 100% | 3,000 | 300 | 10,814 | 268,100 | 262,877 | 497,464 | 144,040 | |
agri_suitability | 100% | 3,000 | 290 | 0 | 0.395 | 0.393 | 0.983 | 0.197 | |
population | 100% | 3,000 | 300 | 4,134.7 | 82,028 | 58,886 | 596,950 | 85,187 | |
ethnic_frac | 100% | 3,000 | 300 | 0.201 | 0.550 | 0.537 | 0.899 | 0.202 | |
ntl_log | 100% | 3,000 | 3,000 | -2.50 | -1.10 | -1.10 | 0.265 | 0.435 | |
conflict | 100% | 3,000 | 2 | 0 | 0.123 | 0 | 1.00 | 0.328 |
Known limitations & caveats
- Synthetic data. There is no real data behind this tutorial; results are internally consistent with the calibration but are not empirical evidence about real-world mining, prices, or development.
- Heavily imbalanced treatment. 85% of rows are untreated (no mining) and only 5% fall in each mining level (150 rows each); within-mining price contrasts (e.g. 3-1) draw on just 300 observations, so price-effect standard errors are large.
- No clustered standard errors. EconML's
inference=Truereports forest-level Bootstrap-of-Little-Bags SEs that treat observations as independent; with this panel (same district across years) true clustered SEs would typically be larger.GroupKFoldby district prevents first-stage leakage but does not cluster second-stage variance. - Contemporaneous outcomes. Treatment and outcome are measured in the same year; the structural template (Hodler, Lechner & Raschky 2023) lags the outcome to
t+1to rule out reverse causality. - Identification rests on the CIA. The Conditional Independence Assumption holds here by construction; in real data it is untestable and a causal forest does not relax it.