← Back to the post
Interactive data dictionary

Causal Machine Learning and the Resource Curse

Simulated district panel with known ground-truth treatment effects, analysed with EconML's CausalForestDML (Double Machine Learning).

300
districts
8
countries
2003–2012
years
3000
district-years

Downloads

Each dataset is available as a labeled Stata .dta and its source file.

⇩ Download all data (ZIP)stata_codebook.do

DatasetGrainRowsStataSource
sim_resource_cursedistrict-year3,000 × 18sim_resource_curse.dtasim_resource_curse.csv

Run stata_codebook.do in Stata once to attach long-form per-variable notes to the .dta files.

Load directly in code

Every file loads straight from GitHub (raw URLs). Swap the file name to load any dataset.

Stata

* Stata 14+ : `use` reads an https URL directly
global BASE "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_EconML/data/"
use "${BASE}sim_resource_curse.dta", clear
describe
notes

Python

!pip install -q pyreadstat
import pandas as pd
BASE = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_EconML/data/"
df = pd.read_stata(BASE + "sim_resource_curse.dta")

# load every dataset at once
files = ["sim_resource_curse"]
data = {f: pd.read_stata(BASE + f + ".dta") for f in files}

# pyreadstat (richest metadata) reads LOCAL files -> download first
import pyreadstat, urllib.request
urllib.request.urlretrieve(BASE + "sim_resource_curse.dta", "sim_resource_curse.dta")
df, meta = pyreadstat.read_dta("sim_resource_curse.dta")

Copy and paste this snippet in Google Colab app. https://colab.research.google.com/notebooks/empty.ipynb

R

# R : haven::read_dta auto-downloads an https URL
library(haven)
BASE <- "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_EconML/data/"
df <- read_dta(paste0(BASE, "sim_resource_curse.dta"))

Overview & sources

Companion data for a hands-on Python tutorial that estimates heterogeneous causal effects of mining and mineral prices on economic development using EconML's CausalForestDML — a Double Machine Learning causal forest. The dataset is fully synthetic with a known data-generating process, so estimates can be checked against ground truth. Its structure mirrors Hodler, Lechner & Raschky (2023): 300 districts across 8 countries observed annually over 2003–2012 (3,000 district-year rows). Treatment has four levels — no mining (0) and mining at low (1), medium (2), and high (3) mineral prices — and is heavily imbalanced at 85%/5%/5%/5%. The outcome is log nighttime lights. The forest recovers an ATE of 0.240 for the basic mining effect (true 0.250), a non-linear price gradient, and the finding that institutions moderate the mining margin but not the price margin. The entire DGP is open and reproducible.

One file. sim_resource_curse is a balanced annual district panel (one row per district × year) covering 300 districts in 8 countries over 2003–2012. It carries the four-level treatment, the binary mining flag and price index, two outcomes (log nighttime lights and a conflict dummy), and the geographic, institutional, and demographic covariates used as heterogeneity features (X) and first-stage controls (W).

Data sources

SourceProvidesReference / URL
Synthetic (this study)All values — simulated via a calibrated data-generating process with known ground-truth ATEs (open &amp; reproducible)Mendez, C. (2026). See the post's Python script.py for the full DGP and ground-truth parameters.
Hodler, Lechner &amp; Raschky (2023)Structural template the simulation mirrors (district panel, mining treatment, NTL outcome, institutional moderation)Hodler, R., Lechner, M., & Raschky, P. A. (2023). Institutions and the resource curse: New insights from causal machine learning. PLoS ONE, 18(6), e0284968. https://doi.org/10.1371/journal.pone.0284968
Method referencesEstimators and conceptsMicrosoft EconML (CausalForestDML); Chernozhukov et al. (2018, double/debiased ML); Wager & Athey (2018, causal forests); Robinson (1988); Athey, Tibshirani & Wager (2019, generalized random forests / BLB).

Cite this data

Please cite this dataset as follows.

APA

Mendez, C. (2026). Causal Machine Learning and the Resource Curse with Python EconML [Data set]. https://carlos-mendez.org/post/python_EconML/

Hodler, R., Lechner, M., & Raschky, P. A. (2023). Institutions and the resource curse: New insights from causal machine learning. PLoS ONE, 18(6), e0284968. Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1), C1–C68. Wager, S., & Athey, S. (2018). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523), 1228–1242.

BibTeX

@misc{mendez2026pythoneconml,
  author       = {Mendez, Carlos},
  title        = {Causal Machine Learning and the Resource Curse with Python EconML},
  year         = {2026},
  howpublished = {\url{https://carlos-mendez.org/post/python_EconML/}},
  note         = {Data set}
}

@article{hodler2023institutions,
  author  = {Hodler, Roland and Lechner, Michael and Raschky, Paul A.},
  title   = {Institutions and the resource curse: New insights from causal machine learning},
  journal = {PLoS ONE},
  volume  = {18}, number = {6}, pages = {e0284968}, year = {2023},
  doi     = {10.1371/journal.pone.0284968}
}
@article{chernozhukov2018dml,
  author  = {Chernozhukov, Victor and Chetverikov, Denis and Demirer, Mert and Duflo, Esther
             and Hansen, Christian and Newey, Whitney and Robins, James},
  title   = {Double/debiased machine learning for treatment and structural parameters},
  journal = {The Econometrics Journal},
  volume  = {21}, number = {1}, pages = {C1--C68}, year = {2018},
  doi     = {10.1111/ectj.12097}
}
@article{wager2018estimation,
  author  = {Wager, Stefan and Athey, Susan},
  title   = {Estimation and inference of heterogeneous treatment effects using random forests},
  journal = {Journal of the American Statistical Association},
  volume  = {113}, number = {523}, pages = {1228--1242}, year = {2018},
  doi     = {10.1080/01621459.2017.1319839}
}

Variable explorer search & filter all 18 variables

Type to filter by name or label, or use the chips to filter by type. Each row shows a mini distribution. Click a header to sort.

VariableTypeDistributionLabelDefinitionUnitsIn filesSource
agri_suitability#continuousmin 0 | median 0.393 | max 0.983Agricultural suitabilityIndex of land suitability for agriculture; a geographic X feature.0-1sim_resource_curseSimulation
conflict#dummyshare coded 1 = 0.123Conflict incidence (1=yes)Binary indicator of armed-conflict incidence in the district-year (secondary outcome; not analysed in the post).0/1sim_resource_curseSimulation
country_id#identifierCountry identifierCountry to which the district belongs; used as a first-stage control (W).integer IDsim_resource_curseSimulation
distance_capital#continuousmin 1.08e+04 | median 2.63e+05 | max 4.97e+05Distance to capital (m)Distance from the district to the national capital; a geographic X feature (top heterogeneity splitter by importance).meterssim_resource_curseSimulation
district_id#identifierDistrict identifierSynthetic district ID; the panel unit observed across years.integer IDsim_resource_curseSimulation
elevation#continuousmin 0 | median 502 | max 1.36e+03Elevation (m)District mean elevation above sea level; a geographic X feature.meterssim_resource_curseSimulation
ethnic_frac#continuousmin 0.201 | median 0.537 | max 0.899Ethnic fractionalizationIndex of ethnic/linguistic heterogeneity within the district; a demographic X feature.0-1sim_resource_curseSimulation
exec_constraints#identifierConstraints on the executive (1-6)Institutional-quality measure: strength of constraints on executive power; a hypothesized moderator of the mining effect.1-6 scalesim_resource_curseSimulation
gdp_pc#continuousmin 500 | median 1.8e+03 | max 5e+03GDP per capitaDistrict/country GDP per capita; an X heterogeneity feature.US$sim_resource_curseSimulation
mining#dummyshare coded 1 = 0.150Mining active (1=yes)Binary flag: whether the district-year has active mining (treatment >= 1).0/1sim_resource_curseSimulation
ntl_log#continuousmin -2.5 | median -1.1 | max 0.265Log nighttime lights (outcome)Natural log of nighttime light intensity — the headline development outcome.log NTLsim_resource_curseSimulation
population#continuousmin 4.13e+03 | median 5.89e+04 | max 5.97e+05PopulationDistrict population; a demographic X feature.peoplesim_resource_curseSimulation
price_index#continuousmin 0 | median 0 | max 1.81Mineral price indexContinuous mineral-price index that defines the mining treatment level (0 when no mining).index (>=0)sim_resource_curseSimulation
quality_of_govt#continuousmin 0.22 | median 0.42 | max 0.7Quality of governmentContinuous institutional-quality index; an alternative moderator used to cross-validate the executive-constraints finding.0-1sim_resource_curseSimulation
ruggedness#continuousmin 0 | median 24.1 | max 77Terrain ruggednessTerrain ruggedness index; a geographic X feature.index (>=0)sim_resource_curseSimulation
temperature#continuousmin 14 | median 24 | max 35Temperature (°C)District mean temperature; a geographic X feature.degrees Celsiussim_resource_curseSimulation
treatment#identifierTreatment level (0-3)Four-level discrete treatment: no mining vs mining at low/medium/high mineral prices.0-3 (category)sim_resource_curseSimulation
year#yearCalendar yearAnnual time index; used as a first-stage control (W).yearsim_resource_curseSimulation

Cross-file variable index

Which file each variable appears in (● = present).

Construction & formulas

The estimator is EconML's CausalForestDML, a Double Machine Learning (DML) causal forest applied to the four-level treatment with log nighttime lights as the outcome.

Synthetic data-generating process (ground truth, from script.py): the log-NTL effect of mining is 0.25 at mean institutions with institutional moderation 0.15; the medium-price premium is 0.05 (small) and the high-price premium is 0.30 (large), yielding ground-truth ATEs of 1-0 = 0.25, 2-0 = 0.30, 3-0 = 0.55, 2-1 = 0.05, 3-1 = 0.30, 3-2 = 0.25. Outcome noise SD is 0.25. A parallel conflict process (mining base 0.70, institutional dampening −0.50, price premia 0.15/0.50, base rate 0.12) generates the conflict dummy. Because every confounder is built in, the Conditional Independence Assumption holds by construction.

The datasets

Switch datasets with the tabs. Each shows the full variable dictionary plus a sortable statistics table with mini distributions and data coverage.

expand to search (Ctrl/⌘+F) or print across all datasets

district-year  3,000 × 18 · 2003-2012 · 300 districts, 8 countries (balanced)

Panel key: district_id x year · Estimate heterogeneous causal effects of mining and mineral prices on development (CausalForestDML / DML).

Variable dictionary

VariableLabelDefinitionConstructionUnitsSourceCoverage
district_id identifierDistrict identifierSynthetic district ID; the panel unit observed across years.1..300, one per district (each district appears in all 10 years).integer IDSimulation300 districts
country_id identifierCountry identifierCountry to which the district belongs; used as a first-stage control (W).1..8; districts are nested within 8 countries.integer IDSimulation8 countries
year yearCalendar yearAnnual time index; used as a first-stage control (W).2003-2012 (balanced; every district observed in each year).yearSimulation2003-2012
treatment identifierTreatment level (0-3)Four-level discrete treatment: no mining vs mining at low/medium/high mineral prices.0 = no mining; 1/2/3 = mining at low/medium/high prices. Heavily imbalanced (85/5/5/5).0-3 (category)Simulation2,550 / 150 / 150 / 150 rows
mining dummyMining active (1=yes)Binary flag: whether the district-year has active mining (treatment >= 1).1 if treatment in {1,2,3}, else 0.0/1Simulation450 mining district-years
price_index continuousMineral price indexContinuous mineral-price index that defines the mining treatment level (0 when no mining).0 for untreated rows; positive and increasing across low/medium/high price levels.index (>=0)Simulation0 for non-mining; >0 for mining
exec_constraints identifierConstraints on the executive (1-6)Institutional-quality measure: strength of constraints on executive power; a hypothesized moderator of the mining effect.Discrete 1-6 scale assigned per country/district (X feature).1-6 scaleSimulation6 levels
quality_of_govt continuousQuality of governmentContinuous institutional-quality index; an alternative moderator used to cross-validate the executive-constraints finding.Country-level index in [0.22, 0.70] (X feature).0-1Simulation8 distinct values
gdp_pc continuousGDP per capitaDistrict/country GDP per capita; an X heterogeneity feature.Country-level value in [500, 5000] (8 distinct values).US$Simulation8 distinct values
elevation continuousElevation (m)District mean elevation above sea level; a geographic X feature.Simulated per district (time-invariant).metersSimulation300 districts
temperature continuousTemperature (°C)District mean temperature; a geographic X feature.Simulated per district (time-invariant).degrees CelsiusSimulation300 districts
ruggedness continuousTerrain ruggednessTerrain ruggedness index; a geographic X feature.Simulated per district (time-invariant).index (>=0)Simulation300 districts
distance_capital continuousDistance to capital (m)Distance from the district to the national capital; a geographic X feature (top heterogeneity splitter by importance).Simulated per district (time-invariant).metersSimulation300 districts
agri_suitability continuousAgricultural suitabilityIndex of land suitability for agriculture; a geographic X feature.Simulated per district in [0, ~1] (time-invariant).0-1Simulation300 districts
population continuousPopulationDistrict population; a demographic X feature.Simulated per district (time-invariant).peopleSimulation300 districts
ethnic_frac continuousEthnic fractionalizationIndex of ethnic/linguistic heterogeneity within the district; a demographic X feature.Simulated per district in [~0.20, ~0.90] (time-invariant).0-1Simulation300 districts
ntl_log continuousLog nighttime lights (outcome)Natural log of nighttime light intensity — the headline development outcome.Simulated from the DGP: mining base effect + institutional moderation + price premia + mean-zero noise (SD 0.25).log NTLSimulationall 3,000 rows
conflict dummyConflict incidence (1=yes)Binary indicator of armed-conflict incidence in the district-year (secondary outcome; not analysed in the post).Simulated from a parallel process: mining base 0.70, institutional dampening -0.50, price premia 0.15/0.50, base rate 0.12.0/1Simulation369 conflict district-years

Distribution & statistics (click a header to sort)

VariableDistributionCoverageNDistinctMinMeanMedianMaxSD
district_id100%3,000300
country_id100%3,0008
year100%3,0001020032007.5200720122.87
treatment100%3,0004
miningshare coded 1 = 0.150100%3,000200.15001.000.357
price_indexmin 0 | median 0 | max 1.81100%3,00045100.12301.810.324
exec_constraints100%3,0006
quality_of_govtmin 0.22 | median 0.42 | max 0.7100%3,00080.2200.4400.4200.7000.152
gdp_pcmin 500 | median 1.8e+03 | max 5e+03100%3,0008500.02,198.01,800.05,000.01,469.9
elevationmin 0 | median 502 | max 1.36e+03100%3,0002810499.1502.21,357.2302.0
temperaturemin 14 | median 24 | max 35100%3,00029813.9923.9124.0435.003.92
ruggednessmin 0 | median 24.1 | max 77100%3,000261024.4224.1276.9517.80
distance_capitalmin 1.08e+04 | median 2.63e+05 | max 4.97e+05100%3,00030010,814268,100262,877497,464144,040
agri_suitabilitymin 0 | median 0.393 | max 0.983100%3,00029000.3950.3930.9830.197
populationmin 4.13e+03 | median 5.89e+04 | max 5.97e+05100%3,0003004,134.782,02858,886596,95085,187
ethnic_fracmin 0.201 | median 0.537 | max 0.899100%3,0003000.2010.5500.5370.8990.202
ntl_logmin -2.5 | median -1.1 | max 0.265100%3,0003,000-2.50-1.10-1.100.2650.435
conflictshare coded 1 = 0.123100%3,000200.12301.000.328

Known limitations & caveats