Causal Machine Learning and the Resource Curse

Heterogeneous treatment effects with EconML’s CausalForestDML

0.240DML mining ATE · true 0.250
−0.141naive bias removed
0.089GATE range · institutions moderate

Carlos Mendez

Nagoya University (GSID)

June 11, 2026

The Tension

Act I

Is resource wealth a blessing or a curse? The honest answer is: it depends on whom you ask

A district strikes a mine. Nighttime lights climb. But the district next door — same mineral, weaker institutions — barely flickers.

One average effect cannot describe both. The question is not “what is the effect of mining?” but for whom, and how much?

The naive comparison says mining barely helps — and it is wrong by 56%

Contrast Naive Truth Bias
Mining vs none (1-0) 0.109 0.250 −0.141
High vs low price (3-1) 0.413 0.300 +0.113

Mining districts differ systematically — worse geography, weaker institutions — so a raw difference-in-means confounds selection with effect.

We want the CATE — a function of \(\mathbf{x}\), not a single number

\[\tau(\mathbf{x}) = E\{Y_i(1) - Y_i(0) \mid \mathbf{X}_i = \mathbf{x}\}\]

Where \(\tau(\cdot)\) bends with \(\mathbf{x}\), mining helps some districts more than others — that bend is the whole point.

Where we’re going

  • The lab: a simulated 300-district panel with known ground truth
  • Double Machine Learning — residualize away the confounders
  • The causal forest recovers the ATE and discovers the price non-linearity
  • GATEs reveal that institutions moderate mining, but not price

The Investigation

Act II

A simulated lab where we know the right answers in advance

  • 3,000 district-years — 300 districts, 8 countries, 2003–2012
  • Outcome — log nighttime lights (a proxy for development)
  • Treatment — four levels: no mining, then low / medium / high mineral prices
  • Ground truth built in — so we can check the method recovers it

Structure mirrors Hodler, Lechner & Raschky (2023); because we built the data, the identifying assumption holds by construction.

The treatment is brutally imbalanced — 85% of district-years never mine

Treatment distribution. 2,550 control observations but only 150 per mining level — within-mining contrasts lean on just 300 rows.

DML residualizes both sides, then lets a forest read the remainder

\[Y_i = \tau(\mathbf{X}_i)\, T_i + g_0(\mathbf{X}_i, \mathbf{W}_i) + \varepsilon_i\]

\[T_i = m_0(\mathbf{X}_i, \mathbf{W}_i) + v_i\]

Subtract the predictable parts: \(\tilde Y_i = \tau(\mathbf{X}_i)\,\tilde T_i + \varepsilon_i\). The nuisance functions \(g_0, m_0\) exist only to be subtracted out.

Why first-stage errors barely matter: Neyman orthogonality

\[\left.\frac{\partial}{\partial \eta} E[\psi(W; \tau, \eta)] \right|_{\eta = \eta_0} = 0\]

At the truth, the estimating equation is flat in the nuisances. A 10% error in \(\hat g_0\) enters \(\hat\tau\) at order \((0.10)^2 \approx 0.01\) — second order.

Configure CausalForestDML: honest trees, cross-fitting, grouped by district

est_ntl = CausalForestDML(
    model_y=GradientBoostingRegressor(n_estimators=200, max_depth=4),
    model_t=GradientBoostingClassifier(n_estimators=200, max_depth=4),
    discrete_treatment=True, categories=[0, 1, 2, 3],
    n_estimators=500, min_samples_leaf=10,
    honest=True,       # split-chooser ≠ leaf-estimator → valid CIs
    inference=True,    # Bootstrap-of-Little-Bags standard errors
    cv=5,              # 5-fold cross-fitting
)
est_ntl.fit(Y, T, X=X, W=W,
            groups=df['district_id'].values)  # GroupKFold: no district leakage

Identification rests on one untestable assumption — not on the algorithm

\[\{Y_i(0), Y_i(1), Y_i(2), Y_i(3)\} \perp T_i \mid (\mathbf{X}_i, \mathbf{W}_i)\]

Conditional Independence: once we know a district’s geography, institutions, country, and year, mining status is as good as random. We built the data, so it holds here — in real data it is untestable.

The Resolution

Act III

The forest recovers a mining ATE of 0.240 — within sampling error of the true 0.250

0.240

mining-vs-none ATE (1-0), SE 0.070, 90% CI [0.124, 0.355] · true value 0.250

DML removes the −0.141 confounding bias the naive estimator carried

−0.141

bias in the naive 1-0 estimate (0.109 vs truth 0.250) · the forest erases almost all of it

The forest discovers a non-linear price gradient — without being told to look

Contrast ATE SE Sig.?
Medium vs low (2-1) 0.029 0.101 no
High vs low (3-1) 0.220 0.101 5%
High vs medium (3-2) 0.191 0.109 10%

Flat from low to medium, then a jump at high prices — shape discovery with no functional form pre-specified.

Institutions amplify the mining effect — stronger constraints, larger payoff

GATEs for the mining effect (1-0) by executive constraints. The upward slope: stronger institutions amplify the economic benefit of mining.

The price effect is flat across institutions — a non-finding that is the finding

GATEs for the price effect (3-1) by executive constraints. The flat line: institutions do not moderate price effects (range 0.045).

A second institutional measure cross-validates the same asymmetry

Mining effect (1-0)

  • rises with quality of government
  • matches the exec.-constraints range of 0.089
  • institutions amplify mining

Price effect (3-1)

  • flat across quality of government
  • matches the exec.-constraints range of 0.045
  • institutions do not touch price

Quality of government tells the same story as executive constraints — robust to the institutional measure.

Beware: the most “important” features are not the moderators

Feature importance for heterogeneity. Geographic variables dominate split frequency, yet institutions are the true moderators.

A depth-2 tree turns the forest’s heterogeneity into a story you can tell aloud

Depth-2 CATE interpreter for the mining effect. Each leaf reports the mean estimated CATE for the subgroup defined by the splits above it.

The strongest objection — and the answer

Objection. A machine that picks controls cannot manufacture causal identification.

Response. Exactly right. \(\tau(\mathbf{x})\) is identified only under the Conditional Independence Assumption — the forest earns honest intervals, but it cannot rule out an unobserved confounder.

Let the data reveal for whom — but never forget the assumption that makes it causal.