← Back to the post
Interactive data dictionary

Causal Machine Learning and the Resource Curse

Heterogeneous treatment effects of mining and mineral prices on development, on a fully synthetic district-year panel for Stata 19's cate command.

300
districts
2003–2012
years
8
countries
3,000
rows

Downloads

Each dataset is available as a labeled Stata .dta and its source file.

⇩ Download all data (ZIP)stata_codebook.do

DatasetGrainRowsStataSource
sim_resource_cursedistrict-year3,000 × 18sim_resource_curse.dtasim_resource_curse.csv

Run stata_codebook.do in Stata once to attach long-form per-variable notes to the .dta files.

Load directly in code

Every file loads straight from GitHub (raw URLs). Swap the file name to load any dataset.

Stata

* Stata 14+ : `use` reads an https URL directly
global BASE "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/stata_cate2/data/"
use "${BASE}sim_resource_curse.dta", clear
describe
notes

Python

!pip install -q pyreadstat
import pandas as pd
BASE = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/stata_cate2/data/"
df = pd.read_stata(BASE + "sim_resource_curse.dta")

# load every dataset at once
files = ["sim_resource_curse"]
data = {f: pd.read_stata(BASE + f + ".dta") for f in files}

# pyreadstat (richest metadata) reads LOCAL files -> download first
import pyreadstat, urllib.request
urllib.request.urlretrieve(BASE + "sim_resource_curse.dta", "sim_resource_curse.dta")
df, meta = pyreadstat.read_dta("sim_resource_curse.dta")

Copy and paste this snippet in Google Colab app. https://colab.research.google.com/notebooks/empty.ipynb

R

# R : haven::read_dta auto-downloads an https URL
library(haven)
BASE <- "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/stata_cate2/data/"
df <- read_dta(paste0(BASE, "sim_resource_curse.dta"))

Overview & sources

Companion data for a hands-on Stata 19 tutorial that replicates the three core findings of Hodler, Lechner & Raschky (2023) on a fully synthetic panel with known ground-truth causal effects. The panel has 3,000 district-year observations (300 districts × 10 years, 2003–2012) across 8 fictional countries, with log nighttime lights (ntl_log) and a conflict indicator (conflict) as outcomes, a four-level mining/price treatment, and executive constraints and quality of government as institutional moderators. The post estimates average (ATE), group (GATE), and individualized (IATE) treatment effects for six pairwise binary contrasts via generalized random forests, comparing the Partialing-Out (PO) and doubly robust Augmented IPW (AIPW) estimators with 5-fold cross-fitting, supported by formal heterogeneity tests. Because the data-generating process is known, every estimate can be checked against the truth.

One file. sim_resource_curse is a balanced district-year panel: one row per district × year (300 districts × 10 years = 3,000 rows, 2003–2012) across 8 fictional countries. The treatment distribution is highly imbalanced — about 85% of observations are controls (no mining) and each of the three treated price levels holds about 5% — mirroring real mining data. (Stata's cate command additionally derives an integer exec_con = round(exec_constraints) at run time for grouping; it is not stored in this CSV.)

Data sources

SourceProvidesReference / URL
Hodler, Lechner &amp; Raschky (2023)Replicated study; structure, outcomes, moderators, and the three target findingsHodler, R., Lechner, M., & Raschky, P. A. (2023). Institutions and the resource curse: New insights from causal machine learning. PLoS ONE, 18(5), e0284968. https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0284968
Synthetic (this study)All values — simulated via a data-generating process with known ground-truth treatment effects (open &amp; reproducible)Mendez, C. (2026). See the post's Stata do-file analysis.do for the full simulation/DGP.
Method referencesEstimators and concepts behind Stata 19&#x27;s cateAthey, Tibshirani & Wager (2019, GRF); Nie & Wager (2021, PO/R-learner); Knaus (2022) & Kennedy (2023, AIPW); StataCorp (2025, Stata 19 cate).
Resource-curse theorySubstantive motivationSachs & Warner (1995); Mehlum, Moene & Torvik (2006).

Cite this data

Please cite this dataset as follows.

APA

Mendez, C. (2026). Causal Machine Learning and the Resource Curse with Stata 19 [Data set]. https://carlos-mendez.org/post/stata_cate2/

Hodler, R., Lechner, M., & Raschky, P. A. (2023). Institutions and the resource curse: New insights from causal machine learning. PLoS ONE, 18(5), e0284968. Athey, S., Tibshirani, J., & Wager, S. (2019). Generalized random forests. Annals of Statistics, 47(2), 1148–1178. Nie, X., & Wager, S. (2021). Quasi-oracle estimation of heterogeneous treatment effects. Biometrika, 108(2), 299–319.

BibTeX

@misc{mendez2026statacate2,
  author       = {Mendez, Carlos},
  title        = {Causal Machine Learning and the Resource Curse with Stata 19},
  year         = {2026},
  howpublished = {\url{https://carlos-mendez.org/post/stata_cate2/}},
  note         = {Data set}
}

@article{hodler2023institutions,
  author  = {Hodler, Roland and Lechner, Michael and Raschky, Paul A.},
  title   = {Institutions and the resource curse: New insights from causal machine learning},
  journal = {PLoS ONE},
  volume  = {18}, number = {5}, pages = {e0284968}, year = {2023}
}
@article{athey2019grf,
  author  = {Athey, Susan and Tibshirani, Julie and Wager, Stefan},
  title   = {Generalized random forests},
  journal = {Annals of Statistics},
  volume  = {47}, number = {2}, pages = {1148--1178}, year = {2019}
}
@article{nie2021quasi,
  author  = {Nie, Xinkun and Wager, Stefan},
  title   = {Quasi-oracle estimation of heterogeneous treatment effects},
  journal = {Biometrika},
  volume  = {108}, number = {2}, pages = {299--319}, year = {2021}
}

Variable explorer search & filter all 18 variables

Type to filter by name or label, or use the chips to filter by type. Each row shows a mini distribution. Click a header to sort.

VariableTypeDistributionLabelDefinitionUnitsIn filesSource
agri_suitability#continuousmin 0 | median 0.393 | max 0.983Agricultural suitability (0-1)Suitability of district land for agriculture (economic/geographic control).0-1sim_resource_curseSimulation
conflict#dummyshare coded 1 = 0.123Conflict event (binary)1 if a conflict event occurred in the district-year, else 0; the secondary outcome.0/1sim_resource_curseSimulation
country_id#identifierCountry ID (1-8)Country identifier; districts are nested within 8 fictional countries.integersim_resource_curseSimulation
distance_capital#continuousmin 1.08e+04 | median 2.63e+05 | max 4.97e+05Distance to capital (meters)Distance from the district to the national capital (geographic control).meterssim_resource_curseSimulation
district_id#identifierDistrict ID (1-300)District identifier; the panel unit.integersim_resource_curseSimulation
elevation#continuousmin 0 | median 502 | max 1.36e+03Elevation (meters)Mean district elevation (geographic control / heterogeneity covariate).meterssim_resource_curseSimulation
ethnic_frac#continuousmin 0.201 | median 0.537 | max 0.899Ethnic fractionalization (0-1)Degree of ethnic heterogeneity in the district (control).0-1sim_resource_curseSimulation
exec_constraints#continuousmin 1 | median 4 | max 6Constraints on the executive (1-6)Institutional-quality moderator: strength of constraints on executive power.1-6 scalesim_resource_curseSimulation
gdp_pc#continuousmin 500 | median 1.8e+03 | max 5e+03GDP per capitaDistrict GDP per capita (economic control).US$sim_resource_curseSimulation
mining#dummyshare coded 1 = 0.150Mining district (binary)1 if the district-year has active mining (treatment > 0), else 0.0/1sim_resource_curseSimulation
ntl_log#continuousmin -2.5 | median -1.1 | max 0.265Log nighttime lightsLog of nighttime-light intensity; the development-proxy outcome.log intensitysim_resource_curseSimulation
population#continuousmin 4.13e+03 | median 5.89e+04 | max 5.97e+05PopulationDistrict population (economic control).personssim_resource_curseSimulation
price_index#continuousmin 0 | median 0 | max 1.81Mineral price indexGlobal mineral-price index applied to the district-year.indexsim_resource_curseSimulation
quality_of_govt#continuousmin 0.22 | median 0.42 | max 0.7Quality of government (0.22-0.70)Alternative institutional-quality moderator: overall quality of government.0-1 indexsim_resource_curseSimulation
ruggedness#continuousmin 0 | median 24.1 | max 77Terrain ruggednessTerrain ruggedness index (geographic control).indexsim_resource_curseSimulation
temperature#continuousmin 14 | median 24 | max 35Mean temperature (Celsius)Mean district temperature (geographic control).degrees Csim_resource_curseSimulation
treatment#identifierTreatment group (0=none,1=low,2=med,3=high price)Four-level mining/mineral-price treatment: 0 no mining, 1/2/3 mining at low/medium/high price.0-3sim_resource_curseSimulation
year#yearCalendar year (2003-2012)Annual time index; the panel time dimension.yearsim_resource_curseSimulation

Cross-file variable index

Which file each variable appears in (● = present).

Construction & formulas

The Conditional Average Treatment Effect is the treatment effect for units with covariate profile x:

Stata 19's cate uses a partial linear model with cross-fitting (here xfolds(5)): y = d·τ(x) + g(x,w) + ε and d = f(x,w) + u, where g/f are machine-learning nuisance functions, x are CATE variables (potential moderators) and w are controls (here i.country_id i.year).

Two estimators. PO (Partialing-Out) residualizes both y and d on (x,w) and regresses the residuals via a generalized random forest. AIPW (Augmented IPW) builds doubly robust scores from an outcome model and a propensity model; it stays consistent if either nuisance model is right.

Multi-valued treatment via pairwise binaries. The 4-level treatment (0=none, 1=low, 2=med, 3=high price) is split into six binary contrasts (1-0, 2-0, 3-0, 2-1, 3-1, 3-2), subsetting to the two relevant groups each time, e.g. treat_1v0 = (treatment == 1).

Ground truth (NTL contrasts). The DGP fixes the true ATEs: 1-0 = 0.25, 2-0 = 0.30, 3-0 = 0.55, 2-1 = 0.05, 3-1 = 0.30, 3-2 = 0.25 — the small 2-1 step vs the large 3-1 step encodes the price non-linearity.

The datasets

Switch datasets with the tabs. Each shows the full variable dictionary plus a sortable statistics table with mini distributions and data coverage.

expand to search (Ctrl/⌘+F) or print across all datasets

district-year  3,000 × 18 · 2003-2012 · 300 districts, 8 countries (balanced)

Panel key: district_id x year · Estimate heterogeneous treatment effects of mining/prices via Stata 19 cate (ATE/GATE/IATE).

Variable dictionary

VariableLabelDefinitionConstructionUnitsSourceCoverage
district_id identifierDistrict ID (1-300)District identifier; the panel unit.Sequential integer 1..300 in the simulation.integerSimulationall rows
country_id identifierCountry ID (1-8)Country identifier; districts are nested within 8 fictional countries.Integer 1..8 assigned in the simulation; used as i.country_id control.integerSimulationall rows
year yearCalendar year (2003-2012)Annual time index; the panel time dimension.2003..2012 for every district (balanced); used as i.year control.yearSimulationall rows
treatment identifierTreatment group (0=none,1=low,2=med,3=high price)Four-level mining/mineral-price treatment: 0 no mining, 1/2/3 mining at low/medium/high price.Assigned in the DGP; ~85% level 0 and ~5% each of 1,2,3. Split into binary pairwise contrasts for cate.0-3Simulationall rows
mining dummyMining district (binary)1 if the district-year has active mining (treatment > 0), else 0.Indicator derived from treatment (1 for levels 1-3, 0 for level 0).0/1Simulationall rows
price_index continuousMineral price indexGlobal mineral-price index applied to the district-year.Set by the treatment price level in the DGP (0 for non-mining).indexSimulationall rows
exec_constraints continuousConstraints on the executive (1-6)Institutional-quality moderator: strength of constraints on executive power.Drawn per district in the DGP; rounded to exec_con (1-6) in Stata for GATE grouping.1-6 scaleSimulationall rows
quality_of_govt continuousQuality of government (0.22-0.70)Alternative institutional-quality moderator: overall quality of government.Drawn per district in the DGP; quartile-binned (qog_cat) for GATEs in the post.0-1 indexSimulationall rows
gdp_pc continuousGDP per capitaDistrict GDP per capita (economic control).Simulated district economic level (range ~500-5,000).US$Simulationall rows
elevation continuousElevation (meters)Mean district elevation (geographic control / heterogeneity covariate).Drawn per district in the DGP.metersSimulationall rows
temperature continuousMean temperature (Celsius)Mean district temperature (geographic control).Drawn per district in the DGP.degrees CSimulationall rows
ruggedness continuousTerrain ruggednessTerrain ruggedness index (geographic control).Drawn per district in the DGP.indexSimulationall rows
distance_capital continuousDistance to capital (meters)Distance from the district to the national capital (geographic control).Drawn per district in the DGP.metersSimulationall rows
agri_suitability continuousAgricultural suitability (0-1)Suitability of district land for agriculture (economic/geographic control).Drawn per district in the DGP, scaled to [0,1].0-1Simulationall rows
population continuousPopulationDistrict population (economic control).Drawn per district in the DGP.personsSimulationall rows
ethnic_frac continuousEthnic fractionalization (0-1)Degree of ethnic heterogeneity in the district (control).Drawn per district in the DGP, scaled to [0,1].0-1Simulationall rows
ntl_log continuousLog nighttime lightsLog of nighttime-light intensity; the development-proxy outcome.Simulated outcome driven by treatment, moderators, controls, and DGP ground-truth effects.log intensitySimulationall rows
conflict dummyConflict event (binary)1 if a conflict event occurred in the district-year, else 0; the secondary outcome.Simulated binary outcome; baseline ~10.7% for non-mining, raised by mining in the DGP.0/1Simulationall rows

Distribution & statistics (click a header to sort)

VariableDistributionCoverageNDistinctMinMeanMedianMaxSD
district_id100%3,000300
country_id100%3,0008
year100%3,0001020032007.5200720122.87
treatment100%3,0004
miningshare coded 1 = 0.150100%3,000200.15001.000.357
price_indexmin 0 | median 0 | max 1.81100%3,00045100.12301.810.324
exec_constraintsmin 1 | median 4 | max 6100%3,00061.003.684.006.001.49
quality_of_govtmin 0.22 | median 0.42 | max 0.7100%3,00080.2200.4400.4200.7000.152
gdp_pcmin 500 | median 1.8e+03 | max 5e+03100%3,0008500.02,198.01,800.05,000.01,469.9
elevationmin 0 | median 502 | max 1.36e+03100%3,0002810499.1502.21,357.2302.0
temperaturemin 14 | median 24 | max 35100%3,00029813.9923.9124.0435.003.92
ruggednessmin 0 | median 24.1 | max 77100%3,000261024.4224.1276.9517.80
distance_capitalmin 1.08e+04 | median 2.63e+05 | max 4.97e+05100%3,00030010,814268,100262,877497,464144,040
agri_suitabilitymin 0 | median 0.393 | max 0.983100%3,00029000.3950.3930.9830.197
populationmin 4.13e+03 | median 5.89e+04 | max 5.97e+05100%3,0003004,134.782,02858,886596,95085,187
ethnic_fracmin 0.201 | median 0.537 | max 0.899100%3,0003000.2010.5500.5370.8990.202
ntl_logmin -2.5 | median -1.1 | max 0.265100%3,0003,000-2.50-1.10-1.100.2650.435
conflictshare coded 1 = 0.123100%3,000200.12301.000.328

Known limitations & caveats