Data dictionary · Causal Machine Learning and the Resource Curse

Downloads

Each dataset is available as a labeled Stata .dta and its source file.

⇩ Download all data (ZIP)stata_codebook.do

Dataset	Grain	Rows	Stata	Source
`sim_resource_curse`	district-year	3,000 × 18	sim_resource_curse.dta	sim_resource_curse.csv

Run stata_codebook.do in Stata once to attach long-form per-variable notes to the .dta files.

Load directly in code

Every file loads straight from GitHub (raw URLs). Swap the file name to load any dataset.

Stata

* Stata 14+ : `use` reads an https URL directly
global BASE "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/stata_cate2/data/"
use "${BASE}sim_resource_curse.dta", clear
describe
notes

Python

!pip install -q pyreadstat
import pandas as pd
BASE = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/stata_cate2/data/"
df = pd.read_stata(BASE + "sim_resource_curse.dta")

# load every dataset at once
files = ["sim_resource_curse"]
data = {f: pd.read_stata(BASE + f + ".dta") for f in files}

# pyreadstat (richest metadata) reads LOCAL files -> download first
import pyreadstat, urllib.request
urllib.request.urlretrieve(BASE + "sim_resource_curse.dta", "sim_resource_curse.dta")
df, meta = pyreadstat.read_dta("sim_resource_curse.dta")

Copy and paste this snippet in Google Colab app. https://colab.research.google.com/notebooks/empty.ipynb

R

# R : haven::read_dta auto-downloads an https URL
library(haven)
BASE <- "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/stata_cate2/data/"
df <- read_dta(paste0(BASE, "sim_resource_curse.dta"))

Overview & sources

Companion data for a hands-on Stata 19 tutorial that replicates the three core findings of Hodler, Lechner & Raschky (2023) on a fully synthetic panel with known ground-truth causal effects. The panel has 3,000 district-year observations (300 districts × 10 years, 2003–2012) across 8 fictional countries, with log nighttime lights (ntl_log) and a conflict indicator (conflict) as outcomes, a four-level mining/price treatment, and executive constraints and quality of government as institutional moderators. The post estimates average (ATE), group (GATE), and individualized (IATE) treatment effects for six pairwise binary contrasts via generalized random forests, comparing the Partialing-Out (PO) and doubly robust Augmented IPW (AIPW) estimators with 5-fold cross-fitting, supported by formal heterogeneity tests. Because the data-generating process is known, every estimate can be checked against the truth.

One file. sim_resource_curse is a balanced district-year panel: one row per district × year (300 districts × 10 years = 3,000 rows, 2003–2012) across 8 fictional countries. The treatment distribution is highly imbalanced — about 85% of observations are controls (no mining) and each of the three treated price levels holds about 5% — mirroring real mining data. (Stata's cate command additionally derives an integer exec_con = round(exec_constraints) at run time for grouping; it is not stored in this CSV.)

Data sources

Source	Provides	Reference / URL
Hodler, Lechner & Raschky (2023)	Replicated study; structure, outcomes, moderators, and the three target findings	Hodler, R., Lechner, M., & Raschky, P. A. (2023). Institutions and the resource curse: New insights from causal machine learning. PLoS ONE, 18(5), e0284968. https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0284968
Synthetic (this study)	All values — simulated via a data-generating process with known ground-truth treatment effects (open & reproducible)	Mendez, C. (2026). See the post's Stata do-file analysis.do for the full simulation/DGP.
Method references	Estimators and concepts behind Stata 19's cate	Athey, Tibshirani & Wager (2019, GRF); Nie & Wager (2021, PO/R-learner); Knaus (2022) & Kennedy (2023, AIPW); StataCorp (2025, Stata 19 cate).
Resource-curse theory	Substantive motivation	Sachs & Warner (1995); Mehlum, Moene & Torvik (2006).

Cite this data

Please cite this dataset as follows.

APA

Mendez, C. (2026). Causal Machine Learning and the Resource Curse with Stata 19 [Data set]. https://carlos-mendez.org/post/stata_cate2/

Hodler, R., Lechner, M., & Raschky, P. A. (2023). Institutions and the resource curse: New insights from causal machine learning. PLoS ONE, 18(5), e0284968. Athey, S., Tibshirani, J., & Wager, S. (2019). Generalized random forests. Annals of Statistics, 47(2), 1148–1178. Nie, X., & Wager, S. (2021). Quasi-oracle estimation of heterogeneous treatment effects. Biometrika, 108(2), 299–319.

BibTeX

@misc{mendez2026statacate2,
  author       = {Mendez, Carlos},
  title        = {Causal Machine Learning and the Resource Curse with Stata 19},
  year         = {2026},
  howpublished = {\url{https://carlos-mendez.org/post/stata_cate2/}},
  note         = {Data set}
}

@article{hodler2023institutions,
  author  = {Hodler, Roland and Lechner, Michael and Raschky, Paul A.},
  title   = {Institutions and the resource curse: New insights from causal machine learning},
  journal = {PLoS ONE},
  volume  = {18}, number = {5}, pages = {e0284968}, year = {2023}
}
@article{athey2019grf,
  author  = {Athey, Susan and Tibshirani, Julie and Wager, Stefan},
  title   = {Generalized random forests},
  journal = {Annals of Statistics},
  volume  = {47}, number = {2}, pages = {1148--1178}, year = {2019}
}
@article{nie2021quasi,
  author  = {Nie, Xinkun and Wager, Stefan},
  title   = {Quasi-oracle estimation of heterogeneous treatment effects},
  journal = {Biometrika},
  volume  = {108}, number = {2}, pages = {299--319}, year = {2021}
}

Variable explorer search & filter all 18 variables

Type to filter by name or label, or use the chips to filter by type. Each row shows a mini distribution. Click a header to sort.

Variable	Type	Distribution	Label	Definition	Units	In files	Source
`agri_suitability`#	continuous		Agricultural suitability (0-1)	Suitability of district land for agriculture (economic/geographic control).	0-1	sim_resource_curse	Simulation
`conflict`#	dummy		Conflict event (binary)	1 if a conflict event occurred in the district-year, else 0; the secondary outcome.	0/1	sim_resource_curse	Simulation
`country_id`#	identifier	–	Country ID (1-8)	Country identifier; districts are nested within 8 fictional countries.	integer	sim_resource_curse	Simulation
`distance_capital`#	continuous		Distance to capital (meters)	Distance from the district to the national capital (geographic control).	meters	sim_resource_curse	Simulation
`district_id`#	identifier	–	District ID (1-300)	District identifier; the panel unit.	integer	sim_resource_curse	Simulation
`elevation`#	continuous		Elevation (meters)	Mean district elevation (geographic control / heterogeneity covariate).	meters	sim_resource_curse	Simulation
`ethnic_frac`#	continuous		Ethnic fractionalization (0-1)	Degree of ethnic heterogeneity in the district (control).	0-1	sim_resource_curse	Simulation
`exec_constraints`#	continuous		Constraints on the executive (1-6)	Institutional-quality moderator: strength of constraints on executive power.	1-6 scale	sim_resource_curse	Simulation
`gdp_pc`#	continuous		GDP per capita	District GDP per capita (economic control).	US$	sim_resource_curse	Simulation
`mining`#	dummy		Mining district (binary)	1 if the district-year has active mining (treatment > 0), else 0.	0/1	sim_resource_curse	Simulation
`ntl_log`#	continuous		Log nighttime lights	Log of nighttime-light intensity; the development-proxy outcome.	log intensity	sim_resource_curse	Simulation
`population`#	continuous		Population	District population (economic control).	persons	sim_resource_curse	Simulation
`price_index`#	continuous		Mineral price index	Global mineral-price index applied to the district-year.	index	sim_resource_curse	Simulation
`quality_of_govt`#	continuous		Quality of government (0.22-0.70)	Alternative institutional-quality moderator: overall quality of government.	0-1 index	sim_resource_curse	Simulation
`ruggedness`#	continuous		Terrain ruggedness	Terrain ruggedness index (geographic control).	index	sim_resource_curse	Simulation
`temperature`#	continuous		Mean temperature (Celsius)	Mean district temperature (geographic control).	degrees C	sim_resource_curse	Simulation
`treatment`#	identifier	–	Treatment group (0=none,1=low,2=med,3=high price)	Four-level mining/mineral-price treatment: 0 no mining, 1/2/3 mining at low/medium/high price.	0-3	sim_resource_curse	Simulation
`year`#	year	–	Calendar year (2003-2012)	Annual time index; the panel time dimension.	year	sim_resource_curse	Simulation

Cross-file variable index

Which file each variable appears in (● = present).

Variable	sim_resource_curse
`agri_suitability`	●
`conflict`	●
`country_id`	●
`distance_capital`	●
`district_id`	●
`elevation`	●
`ethnic_frac`	●
`exec_constraints`	●
`gdp_pc`	●
`mining`	●
`ntl_log`	●
`population`	●
`price_index`	●
`quality_of_govt`	●
`ruggedness`	●
`temperature`	●
`treatment`	●
`year`	●

Construction & formulas

The Conditional Average Treatment Effect is the treatment effect for units with covariate profile x:

CATE: τ(x) = E{ y_i(1) − y_i(0) | x_i = x } — a function of x, not a single number.
ATE: E{ τ(X) } — the CATE averaged over the whole sample (the headline number).
GATE: τ(g) = E{ τ(x) | G = g } — the CATE averaged within a pre-specified group (e.g. each executive-constraints level).
IATE: τ(x_i) — one estimated effect per observation.

Stata 19's cate uses a partial linear model with cross-fitting (here xfolds(5)): y = d·τ(x) + g(x,w) + ε and d = f(x,w) + u, where g/f are machine-learning nuisance functions, x are CATE variables (potential moderators) and w are controls (here i.country_id i.year).

Two estimators. PO (Partialing-Out) residualizes both y and d on (x,w) and regresses the residuals via a generalized random forest. AIPW (Augmented IPW) builds doubly robust scores from an outcome model and a propensity model; it stays consistent if either nuisance model is right.

Multi-valued treatment via pairwise binaries. The 4-level treatment (0=none, 1=low, 2=med, 3=high price) is split into six binary contrasts (1-0, 2-0, 3-0, 2-1, 3-1, 3-2), subsetting to the two relevant groups each time, e.g. treat_1v0 = (treatment == 1).

Ground truth (NTL contrasts). The DGP fixes the true ATEs: 1-0 = 0.25, 2-0 = 0.30, 3-0 = 0.55, 2-1 = 0.05, 3-1 = 0.30, 3-2 = 0.25 — the small 2-1 step vs the large 3-1 step encodes the price non-linearity.

The datasets

Switch datasets with the tabs. Each shows the full variable dictionary plus a sortable statistics table with mini distributions and data coverage.

expand to search (Ctrl/⌘+F) or print across all datasets

district-year 3,000 × 18 · 2003-2012 · 300 districts, 8 countries (balanced)

Panel key: district_id x year · Estimate heterogeneous treatment effects of mining/prices via Stata 19 cate (ATE/GATE/IATE).

Variable dictionary

Variable	Label	Definition	Construction	Units	Source	Coverage
`district_id` identifier	District ID (1-300)	District identifier; the panel unit.	Sequential integer 1..300 in the simulation.	integer	Simulation	all rows
`country_id` identifier	Country ID (1-8)	Country identifier; districts are nested within 8 fictional countries.	Integer 1..8 assigned in the simulation; used as i.country_id control.	integer	Simulation	all rows
`year` year	Calendar year (2003-2012)	Annual time index; the panel time dimension.	2003..2012 for every district (balanced); used as i.year control.	year	Simulation	all rows
`treatment` identifier	Treatment group (0=none,1=low,2=med,3=high price)	Four-level mining/mineral-price treatment: 0 no mining, 1/2/3 mining at low/medium/high price.	Assigned in the DGP; ~85% level 0 and ~5% each of 1,2,3. Split into binary pairwise contrasts for cate.	0-3	Simulation	all rows
`mining` dummy	Mining district (binary)	1 if the district-year has active mining (treatment > 0), else 0.	Indicator derived from treatment (1 for levels 1-3, 0 for level 0).	0/1	Simulation	all rows
`price_index` continuous	Mineral price index	Global mineral-price index applied to the district-year.	Set by the treatment price level in the DGP (0 for non-mining).	index	Simulation	all rows
`exec_constraints` continuous	Constraints on the executive (1-6)	Institutional-quality moderator: strength of constraints on executive power.	Drawn per district in the DGP; rounded to exec_con (1-6) in Stata for GATE grouping.	1-6 scale	Simulation	all rows
`quality_of_govt` continuous	Quality of government (0.22-0.70)	Alternative institutional-quality moderator: overall quality of government.	Drawn per district in the DGP; quartile-binned (qog_cat) for GATEs in the post.	0-1 index	Simulation	all rows
`gdp_pc` continuous	GDP per capita	District GDP per capita (economic control).	Simulated district economic level (range ~500-5,000).	US$	Simulation	all rows
`elevation` continuous	Elevation (meters)	Mean district elevation (geographic control / heterogeneity covariate).	Drawn per district in the DGP.	meters	Simulation	all rows
`temperature` continuous	Mean temperature (Celsius)	Mean district temperature (geographic control).	Drawn per district in the DGP.	degrees C	Simulation	all rows
`ruggedness` continuous	Terrain ruggedness	Terrain ruggedness index (geographic control).	Drawn per district in the DGP.	index	Simulation	all rows
`distance_capital` continuous	Distance to capital (meters)	Distance from the district to the national capital (geographic control).	Drawn per district in the DGP.	meters	Simulation	all rows
`agri_suitability` continuous	Agricultural suitability (0-1)	Suitability of district land for agriculture (economic/geographic control).	Drawn per district in the DGP, scaled to [0,1].	0-1	Simulation	all rows
`population` continuous	Population	District population (economic control).	Drawn per district in the DGP.	persons	Simulation	all rows
`ethnic_frac` continuous	Ethnic fractionalization (0-1)	Degree of ethnic heterogeneity in the district (control).	Drawn per district in the DGP, scaled to [0,1].	0-1	Simulation	all rows
`ntl_log` continuous	Log nighttime lights	Log of nighttime-light intensity; the development-proxy outcome.	Simulated outcome driven by treatment, moderators, controls, and DGP ground-truth effects.	log intensity	Simulation	all rows
`conflict` dummy	Conflict event (binary)	1 if a conflict event occurred in the district-year, else 0; the secondary outcome.	Simulated binary outcome; baseline ~10.7% for non-mining, raised by mining in the DGP.	0/1	Simulation	all rows

Distribution & statistics (click a header to sort)

Variable	Distribution	Coverage	N	Distinct	Min	Mean	Median	Max	SD
`district_id`	–	100%	3,000	300	—	—	—	—	—
`country_id`	–	100%	3,000	8	—	—	—	—	—
`year`	–	100%	3,000	10	2003	2007.5	2007	2012	2.87
`treatment`	–	100%	3,000	4	—	—	—	—	—
`mining`		100%	3,000	2	0	0.150	0	1.00	0.357
`price_index`		100%	3,000	451	0	0.123	0	1.81	0.324
`exec_constraints`		100%	3,000	6	1.00	3.68	4.00	6.00	1.49
`quality_of_govt`		100%	3,000	8	0.220	0.440	0.420	0.700	0.152
`gdp_pc`		100%	3,000	8	500.0	2,198.0	1,800.0	5,000.0	1,469.9
`elevation`		100%	3,000	281	0	499.1	502.2	1,357.2	302.0
`temperature`		100%	3,000	298	13.99	23.91	24.04	35.00	3.92
`ruggedness`		100%	3,000	261	0	24.42	24.12	76.95	17.80
`distance_capital`		100%	3,000	300	10,814	268,100	262,877	497,464	144,040
`agri_suitability`		100%	3,000	290	0	0.395	0.393	0.983	0.197
`population`		100%	3,000	300	4,134.7	82,028	58,886	596,950	85,187
`ethnic_frac`		100%	3,000	300	0.201	0.550	0.537	0.899	0.202
`ntl_log`		100%	3,000	3,000	-2.50	-1.10	-1.10	0.265	0.435
`conflict`		100%	3,000	2	0	0.123	0	1.00	0.328

Known limitations & caveats

Synthetic data. There is no real data behind this tutorial; values are simulated with known ground-truth effects, so results validate the method but are not empirical evidence about real-world mining or conflict.
Direction differs from the paper. Hodler et al. (2023) found stronger institutions amplify mining benefits (upward GATE slope); this DGP produces the opposite sign (weaker institutions, larger mining effect). The reproduced structural finding is that institutions systematically moderate mining, not prices — not the sign.
Severe treatment imbalance. ~85% of rows are controls; each treated price level is only ~5% (~150 obs). Within-mining price contrasts (2-1, 3-1, 3-2) use only ~300 observations, so confidence intervals are wide.
Overlap failure (3-2). The high-vs-medium price contrast has propensity scores near 0/1 on its tiny subsample; AIPW fails and even PO with pstolerance(1e-8) returns an unreliable ATE. It is excluded from the summary.
Finite-sample variability. Several estimates deviate from the ground truth (e.g. 1-0 = 0.149 vs 0.25) due to small treated samples, 5-fold cross-fitting, and the random seed; directional patterns are robust, point estimates are seed-dependent.
Conflict ground truths. The DGP does not specify ground-truth ATEs for the conflict outcome; conflict results are interpreted directionally only.
Requires Stata 19+. The cate command does not exist in Stata 18; the do-file aborts on older versions.