Data dictionary · Causal Machine Learning and the Resource Curse

Downloads

Each dataset is available as a labeled Stata .dta and its source file.

⇩ Download all data (ZIP)stata_codebook.do

Dataset	Grain	Rows	Stata	Source
`sim_resource_curse`	district-year	3,000 × 18	sim_resource_curse.dta	sim_resource_curse.csv

Run stata_codebook.do in Stata once to attach long-form per-variable notes to the .dta files.

Load directly in code

Every file loads straight from GitHub (raw URLs). Swap the file name to load any dataset.

Stata

* Stata 14+ : `use` reads an https URL directly
global BASE "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_EconML/data/"
use "${BASE}sim_resource_curse.dta", clear
describe
notes

Python

!pip install -q pyreadstat
import pandas as pd
BASE = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_EconML/data/"
df = pd.read_stata(BASE + "sim_resource_curse.dta")

# load every dataset at once
files = ["sim_resource_curse"]
data = {f: pd.read_stata(BASE + f + ".dta") for f in files}

# pyreadstat (richest metadata) reads LOCAL files -> download first
import pyreadstat, urllib.request
urllib.request.urlretrieve(BASE + "sim_resource_curse.dta", "sim_resource_curse.dta")
df, meta = pyreadstat.read_dta("sim_resource_curse.dta")

Copy and paste this snippet in Google Colab app. https://colab.research.google.com/notebooks/empty.ipynb

R

# R : haven::read_dta auto-downloads an https URL
library(haven)
BASE <- "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_EconML/data/"
df <- read_dta(paste0(BASE, "sim_resource_curse.dta"))

Overview & sources

Companion data for a hands-on Python tutorial that estimates heterogeneous causal effects of mining and mineral prices on economic development using EconML's CausalForestDML — a Double Machine Learning causal forest. The dataset is fully synthetic with a known data-generating process, so estimates can be checked against ground truth. Its structure mirrors Hodler, Lechner & Raschky (2023): 300 districts across 8 countries observed annually over 2003–2012 (3,000 district-year rows). Treatment has four levels — no mining (0) and mining at low (1), medium (2), and high (3) mineral prices — and is heavily imbalanced at 85%/5%/5%/5%. The outcome is log nighttime lights. The forest recovers an ATE of 0.240 for the basic mining effect (true 0.250), a non-linear price gradient, and the finding that institutions moderate the mining margin but not the price margin. The entire DGP is open and reproducible.

One file. sim_resource_curse is a balanced annual district panel (one row per district × year) covering 300 districts in 8 countries over 2003–2012. It carries the four-level treatment, the binary mining flag and price index, two outcomes (log nighttime lights and a conflict dummy), and the geographic, institutional, and demographic covariates used as heterogeneity features (X) and first-stage controls (W).

Data sources

Source	Provides	Reference / URL
Synthetic (this study)	All values — simulated via a calibrated data-generating process with known ground-truth ATEs (open & reproducible)	Mendez, C. (2026). See the post's Python script.py for the full DGP and ground-truth parameters.
Hodler, Lechner & Raschky (2023)	Structural template the simulation mirrors (district panel, mining treatment, NTL outcome, institutional moderation)	Hodler, R., Lechner, M., & Raschky, P. A. (2023). Institutions and the resource curse: New insights from causal machine learning. PLoS ONE, 18(6), e0284968. https://doi.org/10.1371/journal.pone.0284968
Method references	Estimators and concepts	Microsoft EconML (CausalForestDML); Chernozhukov et al. (2018, double/debiased ML); Wager & Athey (2018, causal forests); Robinson (1988); Athey, Tibshirani & Wager (2019, generalized random forests / BLB).

Cite this data

Please cite this dataset as follows.

APA

Mendez, C. (2026). Causal Machine Learning and the Resource Curse with Python EconML [Data set]. https://carlos-mendez.org/post/python_EconML/

Hodler, R., Lechner, M., & Raschky, P. A. (2023). Institutions and the resource curse: New insights from causal machine learning. PLoS ONE, 18(6), e0284968. Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1), C1–C68. Wager, S., & Athey, S. (2018). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523), 1228–1242.

BibTeX

@misc{mendez2026pythoneconml,
  author       = {Mendez, Carlos},
  title        = {Causal Machine Learning and the Resource Curse with Python EconML},
  year         = {2026},
  howpublished = {\url{https://carlos-mendez.org/post/python_EconML/}},
  note         = {Data set}
}

@article{hodler2023institutions,
  author  = {Hodler, Roland and Lechner, Michael and Raschky, Paul A.},
  title   = {Institutions and the resource curse: New insights from causal machine learning},
  journal = {PLoS ONE},
  volume  = {18}, number = {6}, pages = {e0284968}, year = {2023},
  doi     = {10.1371/journal.pone.0284968}
}
@article{chernozhukov2018dml,
  author  = {Chernozhukov, Victor and Chetverikov, Denis and Demirer, Mert and Duflo, Esther
             and Hansen, Christian and Newey, Whitney and Robins, James},
  title   = {Double/debiased machine learning for treatment and structural parameters},
  journal = {The Econometrics Journal},
  volume  = {21}, number = {1}, pages = {C1--C68}, year = {2018},
  doi     = {10.1111/ectj.12097}
}
@article{wager2018estimation,
  author  = {Wager, Stefan and Athey, Susan},
  title   = {Estimation and inference of heterogeneous treatment effects using random forests},
  journal = {Journal of the American Statistical Association},
  volume  = {113}, number = {523}, pages = {1228--1242}, year = {2018},
  doi     = {10.1080/01621459.2017.1319839}
}

Variable explorer search & filter all 18 variables

Type to filter by name or label, or use the chips to filter by type. Each row shows a mini distribution. Click a header to sort.

Variable	Type	Distribution	Label	Definition	Units	In files	Source
`agri_suitability`#	continuous		Agricultural suitability	Index of land suitability for agriculture; a geographic X feature.	0-1	sim_resource_curse	Simulation
`conflict`#	dummy		Conflict incidence (1=yes)	Binary indicator of armed-conflict incidence in the district-year (secondary outcome; not analysed in the post).	0/1	sim_resource_curse	Simulation
`country_id`#	identifier	–	Country identifier	Country to which the district belongs; used as a first-stage control (W).	integer ID	sim_resource_curse	Simulation
`distance_capital`#	continuous		Distance to capital (m)	Distance from the district to the national capital; a geographic X feature (top heterogeneity splitter by importance).	meters	sim_resource_curse	Simulation
`district_id`#	identifier	–	District identifier	Synthetic district ID; the panel unit observed across years.	integer ID	sim_resource_curse	Simulation
`elevation`#	continuous		Elevation (m)	District mean elevation above sea level; a geographic X feature.	meters	sim_resource_curse	Simulation
`ethnic_frac`#	continuous		Ethnic fractionalization	Index of ethnic/linguistic heterogeneity within the district; a demographic X feature.	0-1	sim_resource_curse	Simulation
`exec_constraints`#	identifier	–	Constraints on the executive (1-6)	Institutional-quality measure: strength of constraints on executive power; a hypothesized moderator of the mining effect.	1-6 scale	sim_resource_curse	Simulation
`gdp_pc`#	continuous		GDP per capita	District/country GDP per capita; an X heterogeneity feature.	US$	sim_resource_curse	Simulation
`mining`#	dummy		Mining active (1=yes)	Binary flag: whether the district-year has active mining (treatment >= 1).	0/1	sim_resource_curse	Simulation
`ntl_log`#	continuous		Log nighttime lights (outcome)	Natural log of nighttime light intensity — the headline development outcome.	log NTL	sim_resource_curse	Simulation
`population`#	continuous		Population	District population; a demographic X feature.	people	sim_resource_curse	Simulation
`price_index`#	continuous		Mineral price index	Continuous mineral-price index that defines the mining treatment level (0 when no mining).	index (>=0)	sim_resource_curse	Simulation
`quality_of_govt`#	continuous		Quality of government	Continuous institutional-quality index; an alternative moderator used to cross-validate the executive-constraints finding.	0-1	sim_resource_curse	Simulation
`ruggedness`#	continuous		Terrain ruggedness	Terrain ruggedness index; a geographic X feature.	index (>=0)	sim_resource_curse	Simulation
`temperature`#	continuous		Temperature (°C)	District mean temperature; a geographic X feature.	degrees Celsius	sim_resource_curse	Simulation
`treatment`#	identifier	–	Treatment level (0-3)	Four-level discrete treatment: no mining vs mining at low/medium/high mineral prices.	0-3 (category)	sim_resource_curse	Simulation
`year`#	year	–	Calendar year	Annual time index; used as a first-stage control (W).	year	sim_resource_curse	Simulation

Cross-file variable index

Which file each variable appears in (● = present).

Variable	sim_resource_curse
`agri_suitability`	●
`conflict`	●
`country_id`	●
`distance_capital`	●
`district_id`	●
`elevation`	●
`ethnic_frac`	●
`exec_constraints`	●
`gdp_pc`	●
`mining`	●
`ntl_log`	●
`population`	●
`price_index`	●
`quality_of_govt`	●
`ruggedness`	●
`temperature`	●
`treatment`	●
`year`	●

Construction & formulas

The estimator is EconML's CausalForestDML, a Double Machine Learning (DML) causal forest applied to the four-level treatment with log nighttime lights as the outcome.

Partially linear model (Robinson 1988; Chernozhukov et al. 2018): Y = τ(X)·T + g₀(X,W) + ε and T = m₀(X,W) + v, where g₀ and m₀ are nuisance conditional means estimated by Gradient Boosting.
Residualization: Ỹ = Y − E[Y|X,W], T̃ = T − m₀(X,W), then Ỹ = τ(X)·T̃ + ε — the Frisch–Waugh–Lovell logic. The causal forest is the second-stage learner that estimates the covariate-dependent slope τ(X).
CATE: τ(x) = E[Y(1) − Y(0) | X = x] — the per-profile treatment effect (here estimated for each pairwise treatment contrast).
GATE: GATE_g = (1/n_g) · Σ_{i∈g} τ̂(Xᵢ) — the CATE averaged over a pre-specified subgroup; SE = sqrt(mean(seᵢ²)/n_g) from the per-observation Bootstrap-of-Little-Bags standard errors.
ATE: E[τ(X)] — the overall average, reported per pairwise contrast with 90% BLB confidence intervals.
Honest causal forest (Wager & Athey 2018): each tree uses one subsample to choose splits and a disjoint subsample to estimate leaf means; 5-fold cross-fitting via GroupKFold on district_id blocks within-district leakage.

Synthetic data-generating process (ground truth, from script.py): the log-NTL effect of mining is 0.25 at mean institutions with institutional moderation 0.15; the medium-price premium is 0.05 (small) and the high-price premium is 0.30 (large), yielding ground-truth ATEs of 1-0 = 0.25, 2-0 = 0.30, 3-0 = 0.55, 2-1 = 0.05, 3-1 = 0.30, 3-2 = 0.25. Outcome noise SD is 0.25. A parallel conflict process (mining base 0.70, institutional dampening −0.50, price premia 0.15/0.50, base rate 0.12) generates the conflict dummy. Because every confounder is built in, the Conditional Independence Assumption holds by construction.

The datasets

Switch datasets with the tabs. Each shows the full variable dictionary plus a sortable statistics table with mini distributions and data coverage.

expand to search (Ctrl/⌘+F) or print across all datasets

district-year 3,000 × 18 · 2003-2012 · 300 districts, 8 countries (balanced)

Panel key: district_id x year · Estimate heterogeneous causal effects of mining and mineral prices on development (CausalForestDML / DML).

Variable dictionary

Variable	Label	Definition	Construction	Units	Source	Coverage
`district_id` identifier	District identifier	Synthetic district ID; the panel unit observed across years.	1..300, one per district (each district appears in all 10 years).	integer ID	Simulation	300 districts
`country_id` identifier	Country identifier	Country to which the district belongs; used as a first-stage control (W).	1..8; districts are nested within 8 countries.	integer ID	Simulation	8 countries
`year` year	Calendar year	Annual time index; used as a first-stage control (W).	2003-2012 (balanced; every district observed in each year).	year	Simulation	2003-2012
`treatment` identifier	Treatment level (0-3)	Four-level discrete treatment: no mining vs mining at low/medium/high mineral prices.	0 = no mining; 1/2/3 = mining at low/medium/high prices. Heavily imbalanced (85/5/5/5).	0-3 (category)	Simulation	2,550 / 150 / 150 / 150 rows
`mining` dummy	Mining active (1=yes)	Binary flag: whether the district-year has active mining (treatment >= 1).	1 if treatment in {1,2,3}, else 0.	0/1	Simulation	450 mining district-years
`price_index` continuous	Mineral price index	Continuous mineral-price index that defines the mining treatment level (0 when no mining).	0 for untreated rows; positive and increasing across low/medium/high price levels.	index (>=0)	Simulation	0 for non-mining; >0 for mining
`exec_constraints` identifier	Constraints on the executive (1-6)	Institutional-quality measure: strength of constraints on executive power; a hypothesized moderator of the mining effect.	Discrete 1-6 scale assigned per country/district (X feature).	1-6 scale	Simulation	6 levels
`quality_of_govt` continuous	Quality of government	Continuous institutional-quality index; an alternative moderator used to cross-validate the executive-constraints finding.	Country-level index in [0.22, 0.70] (X feature).	0-1	Simulation	8 distinct values
`gdp_pc` continuous	GDP per capita	District/country GDP per capita; an X heterogeneity feature.	Country-level value in [500, 5000] (8 distinct values).	US$	Simulation	8 distinct values
`elevation` continuous	Elevation (m)	District mean elevation above sea level; a geographic X feature.	Simulated per district (time-invariant).	meters	Simulation	300 districts
`temperature` continuous	Temperature (°C)	District mean temperature; a geographic X feature.	Simulated per district (time-invariant).	degrees Celsius	Simulation	300 districts
`ruggedness` continuous	Terrain ruggedness	Terrain ruggedness index; a geographic X feature.	Simulated per district (time-invariant).	index (>=0)	Simulation	300 districts
`distance_capital` continuous	Distance to capital (m)	Distance from the district to the national capital; a geographic X feature (top heterogeneity splitter by importance).	Simulated per district (time-invariant).	meters	Simulation	300 districts
`agri_suitability` continuous	Agricultural suitability	Index of land suitability for agriculture; a geographic X feature.	Simulated per district in [0, ~1] (time-invariant).	0-1	Simulation	300 districts
`population` continuous	Population	District population; a demographic X feature.	Simulated per district (time-invariant).	people	Simulation	300 districts
`ethnic_frac` continuous	Ethnic fractionalization	Index of ethnic/linguistic heterogeneity within the district; a demographic X feature.	Simulated per district in [~0.20, ~0.90] (time-invariant).	0-1	Simulation	300 districts
`ntl_log` continuous	Log nighttime lights (outcome)	Natural log of nighttime light intensity — the headline development outcome.	Simulated from the DGP: mining base effect + institutional moderation + price premia + mean-zero noise (SD 0.25).	log NTL	Simulation	all 3,000 rows
`conflict` dummy	Conflict incidence (1=yes)	Binary indicator of armed-conflict incidence in the district-year (secondary outcome; not analysed in the post).	Simulated from a parallel process: mining base 0.70, institutional dampening -0.50, price premia 0.15/0.50, base rate 0.12.	0/1	Simulation	369 conflict district-years

Distribution & statistics (click a header to sort)

Variable	Distribution	Coverage	N	Distinct	Min	Mean	Median	Max	SD
`district_id`	–	100%	3,000	300	—	—	—	—	—
`country_id`	–	100%	3,000	8	—	—	—	—	—
`year`	–	100%	3,000	10	2003	2007.5	2007	2012	2.87
`treatment`	–	100%	3,000	4	—	—	—	—	—
`mining`		100%	3,000	2	0	0.150	0	1.00	0.357
`price_index`		100%	3,000	451	0	0.123	0	1.81	0.324
`exec_constraints`	–	100%	3,000	6	—	—	—	—	—
`quality_of_govt`		100%	3,000	8	0.220	0.440	0.420	0.700	0.152
`gdp_pc`		100%	3,000	8	500.0	2,198.0	1,800.0	5,000.0	1,469.9
`elevation`		100%	3,000	281	0	499.1	502.2	1,357.2	302.0
`temperature`		100%	3,000	298	13.99	23.91	24.04	35.00	3.92
`ruggedness`		100%	3,000	261	0	24.42	24.12	76.95	17.80
`distance_capital`		100%	3,000	300	10,814	268,100	262,877	497,464	144,040
`agri_suitability`		100%	3,000	290	0	0.395	0.393	0.983	0.197
`population`		100%	3,000	300	4,134.7	82,028	58,886	596,950	85,187
`ethnic_frac`		100%	3,000	300	0.201	0.550	0.537	0.899	0.202
`ntl_log`		100%	3,000	3,000	-2.50	-1.10	-1.10	0.265	0.435
`conflict`		100%	3,000	2	0	0.123	0	1.00	0.328

Known limitations & caveats

Synthetic data. There is no real data behind this tutorial; results are internally consistent with the calibration but are not empirical evidence about real-world mining, prices, or development.
Heavily imbalanced treatment. 85% of rows are untreated (no mining) and only 5% fall in each mining level (150 rows each); within-mining price contrasts (e.g. 3-1) draw on just 300 observations, so price-effect standard errors are large.
No clustered standard errors. EconML's inference=True reports forest-level Bootstrap-of-Little-Bags SEs that treat observations as independent; with this panel (same district across years) true clustered SEs would typically be larger. GroupKFold by district prevents first-stage leakage but does not cluster second-stage variance.
Contemporaneous outcomes. Treatment and outcome are measured in the same year; the structural template (Hodler, Lechner & Raschky 2023) lags the outcome to t+1 to rule out reverse causality.
Identification rests on the CIA. The Conditional Independence Assumption holds here by construction; in real data it is untestable and a causal forest does not relax it.