Downloads
Each dataset is available as a labeled Stata .dta and its source file.
⇩ Download all data (ZIP)stata_codebook.do
| Dataset | Grain | Rows | Stata | Source |
|---|---|---|---|---|
sim_resource_curse | district-year | 3,000 × 18 | sim_resource_curse.dta | sim_resource_curse.csv |
Run stata_codebook.do in Stata once to attach long-form per-variable notes to the .dta files.
Load directly in code
Every file loads straight from GitHub (raw URLs). Swap the file name to load any dataset.
Stata
* Stata 14+ : `use` reads an https URL directly
global BASE "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/stata_cate2/data/"
use "${BASE}sim_resource_curse.dta", clear
describe
notesPython
!pip install -q pyreadstat
import pandas as pd
BASE = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/stata_cate2/data/"
df = pd.read_stata(BASE + "sim_resource_curse.dta")
# load every dataset at once
files = ["sim_resource_curse"]
data = {f: pd.read_stata(BASE + f + ".dta") for f in files}
# pyreadstat (richest metadata) reads LOCAL files -> download first
import pyreadstat, urllib.request
urllib.request.urlretrieve(BASE + "sim_resource_curse.dta", "sim_resource_curse.dta")
df, meta = pyreadstat.read_dta("sim_resource_curse.dta")Copy and paste this snippet in Google Colab app. https://colab.research.google.com/notebooks/empty.ipynb
R
# R : haven::read_dta auto-downloads an https URL
library(haven)
BASE <- "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/stata_cate2/data/"
df <- read_dta(paste0(BASE, "sim_resource_curse.dta"))Overview & sources
Companion data for a hands-on Stata 19 tutorial that replicates the three core findings of Hodler, Lechner & Raschky (2023) on a fully synthetic panel with known ground-truth causal effects. The panel has 3,000 district-year observations (300 districts × 10 years, 2003–2012) across 8 fictional countries, with log nighttime lights (ntl_log) and a conflict indicator (conflict) as outcomes, a four-level mining/price treatment, and executive constraints and quality of government as institutional moderators. The post estimates average (ATE), group (GATE), and individualized (IATE) treatment effects for six pairwise binary contrasts via generalized random forests, comparing the Partialing-Out (PO) and doubly robust Augmented IPW (AIPW) estimators with 5-fold cross-fitting, supported by formal heterogeneity tests. Because the data-generating process is known, every estimate can be checked against the truth.
sim_resource_curse is a balanced district-year panel: one row per district × year (300 districts × 10 years = 3,000 rows, 2003–2012) across 8 fictional countries. The treatment distribution is highly imbalanced — about 85% of observations are controls (no mining) and each of the three treated price levels holds about 5% — mirroring real mining data. (Stata's cate command additionally derives an integer exec_con = round(exec_constraints) at run time for grouping; it is not stored in this CSV.)
Data sources
| Source | Provides | Reference / URL |
|---|---|---|
| Hodler, Lechner & Raschky (2023) | Replicated study; structure, outcomes, moderators, and the three target findings | Hodler, R., Lechner, M., & Raschky, P. A. (2023). Institutions and the resource curse: New insights from causal machine learning. PLoS ONE, 18(5), e0284968. https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0284968 |
| Synthetic (this study) | All values — simulated via a data-generating process with known ground-truth treatment effects (open & reproducible) | Mendez, C. (2026). See the post's Stata do-file analysis.do for the full simulation/DGP. |
| Method references | Estimators and concepts behind Stata 19's cate | Athey, Tibshirani & Wager (2019, GRF); Nie & Wager (2021, PO/R-learner); Knaus (2022) & Kennedy (2023, AIPW); StataCorp (2025, Stata 19 cate). |
| Resource-curse theory | Substantive motivation | Sachs & Warner (1995); Mehlum, Moene & Torvik (2006). |
Cite this data
Please cite this dataset as follows.
APA
Mendez, C. (2026). Causal Machine Learning and the Resource Curse with Stata 19 [Data set]. https://carlos-mendez.org/post/stata_cate2/
Hodler, R., Lechner, M., & Raschky, P. A. (2023). Institutions and the resource curse: New insights from causal machine learning. PLoS ONE, 18(5), e0284968. Athey, S., Tibshirani, J., & Wager, S. (2019). Generalized random forests. Annals of Statistics, 47(2), 1148–1178. Nie, X., & Wager, S. (2021). Quasi-oracle estimation of heterogeneous treatment effects. Biometrika, 108(2), 299–319.BibTeX
@misc{mendez2026statacate2,
author = {Mendez, Carlos},
title = {Causal Machine Learning and the Resource Curse with Stata 19},
year = {2026},
howpublished = {\url{https://carlos-mendez.org/post/stata_cate2/}},
note = {Data set}
}
@article{hodler2023institutions,
author = {Hodler, Roland and Lechner, Michael and Raschky, Paul A.},
title = {Institutions and the resource curse: New insights from causal machine learning},
journal = {PLoS ONE},
volume = {18}, number = {5}, pages = {e0284968}, year = {2023}
}
@article{athey2019grf,
author = {Athey, Susan and Tibshirani, Julie and Wager, Stefan},
title = {Generalized random forests},
journal = {Annals of Statistics},
volume = {47}, number = {2}, pages = {1148--1178}, year = {2019}
}
@article{nie2021quasi,
author = {Nie, Xinkun and Wager, Stefan},
title = {Quasi-oracle estimation of heterogeneous treatment effects},
journal = {Biometrika},
volume = {108}, number = {2}, pages = {299--319}, year = {2021}
}Variable explorer search & filter all 18 variables
Type to filter by name or label, or use the chips to filter by type. Each row shows a mini distribution. Click a header to sort.
| Variable | Type | Distribution | Label | Definition | Units | In files | Source |
|---|---|---|---|---|---|---|---|
agri_suitability# | continuous | Agricultural suitability (0-1) | Suitability of district land for agriculture (economic/geographic control). | 0-1 | sim_resource_curse | Simulation | |
conflict# | dummy | Conflict event (binary) | 1 if a conflict event occurred in the district-year, else 0; the secondary outcome. | 0/1 | sim_resource_curse | Simulation | |
country_id# | identifier | – | Country ID (1-8) | Country identifier; districts are nested within 8 fictional countries. | integer | sim_resource_curse | Simulation |
distance_capital# | continuous | Distance to capital (meters) | Distance from the district to the national capital (geographic control). | meters | sim_resource_curse | Simulation | |
district_id# | identifier | – | District ID (1-300) | District identifier; the panel unit. | integer | sim_resource_curse | Simulation |
elevation# | continuous | Elevation (meters) | Mean district elevation (geographic control / heterogeneity covariate). | meters | sim_resource_curse | Simulation | |
ethnic_frac# | continuous | Ethnic fractionalization (0-1) | Degree of ethnic heterogeneity in the district (control). | 0-1 | sim_resource_curse | Simulation | |
exec_constraints# | continuous | Constraints on the executive (1-6) | Institutional-quality moderator: strength of constraints on executive power. | 1-6 scale | sim_resource_curse | Simulation | |
gdp_pc# | continuous | GDP per capita | District GDP per capita (economic control). | US$ | sim_resource_curse | Simulation | |
mining# | dummy | Mining district (binary) | 1 if the district-year has active mining (treatment > 0), else 0. | 0/1 | sim_resource_curse | Simulation | |
ntl_log# | continuous | Log nighttime lights | Log of nighttime-light intensity; the development-proxy outcome. | log intensity | sim_resource_curse | Simulation | |
population# | continuous | Population | District population (economic control). | persons | sim_resource_curse | Simulation | |
price_index# | continuous | Mineral price index | Global mineral-price index applied to the district-year. | index | sim_resource_curse | Simulation | |
quality_of_govt# | continuous | Quality of government (0.22-0.70) | Alternative institutional-quality moderator: overall quality of government. | 0-1 index | sim_resource_curse | Simulation | |
ruggedness# | continuous | Terrain ruggedness | Terrain ruggedness index (geographic control). | index | sim_resource_curse | Simulation | |
temperature# | continuous | Mean temperature (Celsius) | Mean district temperature (geographic control). | degrees C | sim_resource_curse | Simulation | |
treatment# | identifier | – | Treatment group (0=none,1=low,2=med,3=high price) | Four-level mining/mineral-price treatment: 0 no mining, 1/2/3 mining at low/medium/high price. | 0-3 | sim_resource_curse | Simulation |
year# | year | – | Calendar year (2003-2012) | Annual time index; the panel time dimension. | year | sim_resource_curse | Simulation |
Cross-file variable index
Which file each variable appears in (● = present).
Construction & formulas
The Conditional Average Treatment Effect is the treatment effect for units with covariate
profile x:
- CATE:
τ(x) = E{ y_i(1) − y_i(0) | x_i = x }— a function ofx, not a single number. - ATE:
E{ τ(X) }— the CATE averaged over the whole sample (the headline number). - GATE:
τ(g) = E{ τ(x) | G = g }— the CATE averaged within a pre-specified group (e.g. each executive-constraints level). - IATE:
τ(x_i)— one estimated effect per observation.
Stata 19's cate uses a partial linear model with cross-fitting (here
xfolds(5)): y = d·τ(x) + g(x,w) + ε and d = f(x,w) + u, where
g/f are machine-learning nuisance functions, x are CATE
variables (potential moderators) and w are controls (here i.country_id i.year).
Two estimators. PO (Partialing-Out) residualizes both
y and d on (x,w) and regresses the residuals via a
generalized random forest. AIPW (Augmented IPW) builds doubly robust scores from an
outcome model and a propensity model; it stays consistent if either nuisance model is right.
Multi-valued treatment via pairwise binaries. The 4-level treatment
(0=none, 1=low, 2=med, 3=high price) is split into six binary contrasts (1-0, 2-0, 3-0, 2-1, 3-1,
3-2), subsetting to the two relevant groups each time, e.g. treat_1v0 = (treatment == 1).
Ground truth (NTL contrasts). The DGP fixes the true ATEs: 1-0 = 0.25, 2-0 = 0.30, 3-0 = 0.55, 2-1 = 0.05, 3-1 = 0.30, 3-2 = 0.25 — the small 2-1 step vs the large 3-1 step encodes the price non-linearity.
The datasets
Switch datasets with the tabs. Each shows the full variable dictionary plus a sortable statistics table with mini distributions and data coverage.
expand to search (Ctrl/⌘+F) or print across all datasets
Variable dictionary
| Variable | Label | Definition | Construction | Units | Source | Coverage |
|---|---|---|---|---|---|---|
district_id identifier | District ID (1-300) | District identifier; the panel unit. | Sequential integer 1..300 in the simulation. | integer | Simulation | all rows |
country_id identifier | Country ID (1-8) | Country identifier; districts are nested within 8 fictional countries. | Integer 1..8 assigned in the simulation; used as i.country_id control. | integer | Simulation | all rows |
year year | Calendar year (2003-2012) | Annual time index; the panel time dimension. | 2003..2012 for every district (balanced); used as i.year control. | year | Simulation | all rows |
treatment identifier | Treatment group (0=none,1=low,2=med,3=high price) | Four-level mining/mineral-price treatment: 0 no mining, 1/2/3 mining at low/medium/high price. | Assigned in the DGP; ~85% level 0 and ~5% each of 1,2,3. Split into binary pairwise contrasts for cate. | 0-3 | Simulation | all rows |
mining dummy | Mining district (binary) | 1 if the district-year has active mining (treatment > 0), else 0. | Indicator derived from treatment (1 for levels 1-3, 0 for level 0). | 0/1 | Simulation | all rows |
price_index continuous | Mineral price index | Global mineral-price index applied to the district-year. | Set by the treatment price level in the DGP (0 for non-mining). | index | Simulation | all rows |
exec_constraints continuous | Constraints on the executive (1-6) | Institutional-quality moderator: strength of constraints on executive power. | Drawn per district in the DGP; rounded to exec_con (1-6) in Stata for GATE grouping. | 1-6 scale | Simulation | all rows |
quality_of_govt continuous | Quality of government (0.22-0.70) | Alternative institutional-quality moderator: overall quality of government. | Drawn per district in the DGP; quartile-binned (qog_cat) for GATEs in the post. | 0-1 index | Simulation | all rows |
gdp_pc continuous | GDP per capita | District GDP per capita (economic control). | Simulated district economic level (range ~500-5,000). | US$ | Simulation | all rows |
elevation continuous | Elevation (meters) | Mean district elevation (geographic control / heterogeneity covariate). | Drawn per district in the DGP. | meters | Simulation | all rows |
temperature continuous | Mean temperature (Celsius) | Mean district temperature (geographic control). | Drawn per district in the DGP. | degrees C | Simulation | all rows |
ruggedness continuous | Terrain ruggedness | Terrain ruggedness index (geographic control). | Drawn per district in the DGP. | index | Simulation | all rows |
distance_capital continuous | Distance to capital (meters) | Distance from the district to the national capital (geographic control). | Drawn per district in the DGP. | meters | Simulation | all rows |
agri_suitability continuous | Agricultural suitability (0-1) | Suitability of district land for agriculture (economic/geographic control). | Drawn per district in the DGP, scaled to [0,1]. | 0-1 | Simulation | all rows |
population continuous | Population | District population (economic control). | Drawn per district in the DGP. | persons | Simulation | all rows |
ethnic_frac continuous | Ethnic fractionalization (0-1) | Degree of ethnic heterogeneity in the district (control). | Drawn per district in the DGP, scaled to [0,1]. | 0-1 | Simulation | all rows |
ntl_log continuous | Log nighttime lights | Log of nighttime-light intensity; the development-proxy outcome. | Simulated outcome driven by treatment, moderators, controls, and DGP ground-truth effects. | log intensity | Simulation | all rows |
conflict dummy | Conflict event (binary) | 1 if a conflict event occurred in the district-year, else 0; the secondary outcome. | Simulated binary outcome; baseline ~10.7% for non-mining, raised by mining in the DGP. | 0/1 | Simulation | all rows |
Distribution & statistics (click a header to sort)
| Variable | Distribution | Coverage | N | Distinct | Min | Mean | Median | Max | SD |
|---|---|---|---|---|---|---|---|---|---|
district_id | – | 100% | 3,000 | 300 | — | — | — | — | — |
country_id | – | 100% | 3,000 | 8 | — | — | — | — | — |
year | – | 100% | 3,000 | 10 | 2003 | 2007.5 | 2007 | 2012 | 2.87 |
treatment | – | 100% | 3,000 | 4 | — | — | — | — | — |
mining | 100% | 3,000 | 2 | 0 | 0.150 | 0 | 1.00 | 0.357 | |
price_index | 100% | 3,000 | 451 | 0 | 0.123 | 0 | 1.81 | 0.324 | |
exec_constraints | 100% | 3,000 | 6 | 1.00 | 3.68 | 4.00 | 6.00 | 1.49 | |
quality_of_govt | 100% | 3,000 | 8 | 0.220 | 0.440 | 0.420 | 0.700 | 0.152 | |
gdp_pc | 100% | 3,000 | 8 | 500.0 | 2,198.0 | 1,800.0 | 5,000.0 | 1,469.9 | |
elevation | 100% | 3,000 | 281 | 0 | 499.1 | 502.2 | 1,357.2 | 302.0 | |
temperature | 100% | 3,000 | 298 | 13.99 | 23.91 | 24.04 | 35.00 | 3.92 | |
ruggedness | 100% | 3,000 | 261 | 0 | 24.42 | 24.12 | 76.95 | 17.80 | |
distance_capital | 100% | 3,000 | 300 | 10,814 | 268,100 | 262,877 | 497,464 | 144,040 | |
agri_suitability | 100% | 3,000 | 290 | 0 | 0.395 | 0.393 | 0.983 | 0.197 | |
population | 100% | 3,000 | 300 | 4,134.7 | 82,028 | 58,886 | 596,950 | 85,187 | |
ethnic_frac | 100% | 3,000 | 300 | 0.201 | 0.550 | 0.537 | 0.899 | 0.202 | |
ntl_log | 100% | 3,000 | 3,000 | -2.50 | -1.10 | -1.10 | 0.265 | 0.435 | |
conflict | 100% | 3,000 | 2 | 0 | 0.123 | 0 | 1.00 | 0.328 |
Known limitations & caveats
- Synthetic data. There is no real data behind this tutorial; values are simulated with known ground-truth effects, so results validate the method but are not empirical evidence about real-world mining or conflict.
- Direction differs from the paper. Hodler et al. (2023) found stronger institutions amplify mining benefits (upward GATE slope); this DGP produces the opposite sign (weaker institutions, larger mining effect). The reproduced structural finding is that institutions systematically moderate mining, not prices — not the sign.
- Severe treatment imbalance. ~85% of rows are controls; each treated price level is only ~5% (~150 obs). Within-mining price contrasts (2-1, 3-1, 3-2) use only ~300 observations, so confidence intervals are wide.
- Overlap failure (3-2). The high-vs-medium price contrast has propensity scores near 0/1 on its tiny subsample; AIPW fails and even PO with pstolerance(1e-8) returns an unreliable ATE. It is excluded from the summary.
- Finite-sample variability. Several estimates deviate from the ground truth (e.g. 1-0 = 0.149 vs 0.25) due to small treated samples, 5-fold cross-fitting, and the random seed; directional patterns are robust, point estimates are seed-dependent.
- Conflict ground truths. The DGP does not specify ground-truth ATEs for the conflict outcome; conflict results are interpreted directionally only.
- Requires Stata 19+. The cate command does not exist in Stata 18; the do-file aborts on older versions.