Cross-sectional Data | Carlos Mendez

Taming Model Uncertainty in the Environmental Kuznets Curve: BMA and Double-Selection LASSO with Panel Data

Sun, 29 Mar 2026 00:00:00 +0000

1. Overview

Can countries grow their way out of pollution? The Environmental Kuznets Curve (EKC) hypothesis says yes — up to a point. As economies develop, pollution first rises with industrialization and then falls as countries grow wealthy enough to afford cleaner technology. But recent research suggests a more complex inverted-N shape: pollution falls at very low incomes, rises through industrialization, and then falls again at high incomes.

Testing for this shape requires a cubic polynomial in GDP per capita — and beyond GDP, many other factors might affect CO₂ emissions. With 12 candidate control variables, there are $2^{12} = 4{,}096$ possible regression models. Which model should we estimate? This is the model uncertainty problem.

This tutorial introduces two principled solutions:

Bayesian Model Averaging (BMA) estimates thousands of models and averages the results, weighting each by how well it fits the data. Each variable gets a Posterior Inclusion Probability (PIP) — the fraction of high-quality models that include it.
Post-Double-Selection LASSO (DSL) uses LASSO to automatically select which controls matter — once for the outcome, once for each variable of interest — then runs OLS with the union of all selected controls. This “select, then regress” approach protects against omitted variable bias.

We use synthetic panel data with a known “answer key” — we designed the data so that 5 controls truly affect CO₂ and 7 are pure noise. This lets us grade each method: does it correctly identify the true predictors? The data is inspired by the panel dataset of Gravina and Lanzafame (2025) but is fully synthetic and not identical to the original.

Companion tutorial. For a cross-sectional perspective using R with BMA, LASSO, and WALS, see the R tutorial on variable selection.

Learning objectives:

Understand the EKC hypothesis and why a cubic polynomial tests for an inverted-N shape
Recognize model uncertainty as a practical challenge when many controls are available
Implement BMA with bmaregress and interpret PIPs and coefficient densities
Implement post-double-selection LASSO with dsregress and understand its four-step algorithm: LASSO on outcome, LASSO on each variable of interest, union, then OLS
Evaluate both methods against a known ground truth to assess their accuracy

The following diagram summarizes the methodological sequence of this tutorial. We begin with exploratory data analysis to visualize the raw income–pollution relationship, then estimate baseline fixed effects regressions to expose the model uncertainty problem. Next, we apply BMA and DSL as two alternative solutions, and finally compare both methods against the known answer key.

graph LR
A["<b>EDA</b><br/>Scatter plot"] --> B["<b>Baseline FE</b><br/>Standard panel<br/>regressions"]
B --> C["<b>BMA</b><br/>Bayesian Model<br/>Averaging"]
C --> D["<b>DSL</b><br/>Double-Selection<br/>LASSO"]
D --> E["<b>Comparison</b><br/>Check against<br/>answer key"]
style A fill:#141413,stroke:#141413,color:#fff
style B fill:#6a9bcc,stroke:#141413,color:#fff
style C fill:#d97757,stroke:#141413,color:#fff
style D fill:#00d4c8,stroke:#141413,color:#141413
style E fill:#1a3a8a,stroke:#141413,color:#fff

2. Setup and Synthetic Data

2.1 Why synthetic data?

Real-world datasets rarely come with an answer key. We never know which control variables truly belong in the model. By generating synthetic data with a known data-generating process (DGP), we can verify whether BMA and DSL correctly recover the truth. This is the same “answer key” approach used in the companion R tutorial, applied here to panel data.

2.2 The data-generating process

The outcome — log CO₂ per capita — follows a cubic EKC with country and year fixed effects:

$$\ln(\text{CO2})_{it} = \beta_1 \ln(\text{GDP})_{it} + \beta_2 [\ln(\text{GDP})_{it}]^2 + \beta_3 [\ln(\text{GDP})_{it}]^3 + \mathbf{X}_{it}^{\text{true}} \boldsymbol{\gamma} + \alpha_i + \delta_t + \varepsilon_{it}$$

In words, log CO₂ depends on a cubic function of log GDP (producing the inverted-N shape), five true control variables $\mathbf{X}^{\text{true}}$, country fixed effects $\alpha_i$, year fixed effects $\delta_t$, and random noise $\varepsilon_{it}$.

The answer key — which variables are true predictors and which are noise:

Variable	Group	In DGP?	True coef.	GDP corr.	Role
`fossil_fuel`	Energy	Yes	+0.015	moderate	More fossil fuels → more CO₂
`renewable`	Energy	Yes	–0.010	moderate	More renewables → less CO₂
`urban`	Socio	Yes	+0.007	moderate	More urbanization → more CO₂
`democracy`	Institutional	Yes	–0.005	low	More democracy → less CO₂
`industry`	Economic	Yes	+0.010	moderate	More industry → more CO₂
`globalization`	Socio	No	0	high	Noise — tricky (correlated with GDP)
`pop_density`	Socio	No	0	low	Noise
`corruption`	Institutional	No	0	low	Noise
`services`	Economic	No	0	high	Noise — tricky (correlated with GDP)
`trade`	Economic	No	0	moderate	Noise — tricky (correlated with GDP)
`fdi`	Economic	No	0	low	Noise
`credit`	Economic	No	0	moderate	Noise — tricky (correlated with GDP)

The “GDP corr.” column is key to understanding why this problem is non-trivial. Four noise variables (globalization, services, trade, credit) are deliberately correlated with GDP. A naive regression would find them “significant” because they piggyback on GDP’s true effect. The challenge for BMA and DSL is to see through this correlation and correctly identify that only the 5 true controls belong in the model.

With the DGP and answer key defined, we now load the synthetic data and set up the Stata environment.

2.3 Load the data

The synthetic data is hosted on GitHub for reproducibility. It was generated by generate_data.do (see the link above).

* Load synthetic data from GitHub
import delimited "https://github.com/cmg777/starter-academic-v501/raw/master/content/post/stata_bma_dsl/synthetic_ekc_panel.csv", clear
xtset country_id year, yearly

2.4 Define macros

We define all variable groups as global macros — used in every command throughout the tutorial:

global outcome "ln_co2"
global gdp_vars "ln_gdp ln_gdp_sq ln_gdp_cb"
global energy "fossil_fuel renewable"
global socio "urban globalization pop_density"
global inst "democracy corruption"
global econ "industry services trade fdi credit"
global controls "$energy $socio $inst $econ"
global fe "i.country_id i.year"
* Ground truth (for evaluation)
global true_vars "fossil_fuel renewable urban democracy industry"
global noise_vars "globalization pop_density corruption services trade fdi credit"

summarize $outcome $gdp_vars $controls

 Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
ln_co2 | 1,600 -19.0385 .7863276 -21.03685 -16.8315
ln_gdp | 1,600 9.58387 1.329675 6.974263 11.9704
ln_gdp_sq | 1,600 93.6174 25.55106 48.64035 143.2904
ln_gdp_cb | 1,600 931.105 373.829 339.2306 1715.243
fossil_fuel | 1,600 54.7724 19.14168 6.36807 95
renewable | 1,600 29.5413 11.96568 1 64.2207
urban | 1,600 53.6742 14.778 15.95174 91.63234
globalizat~n | 1,600 57.6498 12.71537 26.75758 95
pop_density | 1,600 121.344 210.2646 1 1571.771
democracy | 1,600 2.33346 4.179503 -6.12244 10
corruption | 1,600 52.3523 28.52792 0 100
industry | 1,600 24.6433 6.180478 5.843938 45.32926
services | 1,600 43.5598 9.366089 17.82623 64.07455
trade | 1,600 67.4355 19.36148 10.04306 128.0595
fdi | 1,600 2.98237 4.373857 -11.50437 16.19903
credit | 1,600 53.4402 18.20204 11.32991 123.2399

The dataset contains 1,600 observations from 80 countries over 20 years (1995–2014). Log GDP per capita ranges from 6.97 to 11.97, spanning the full income spectrum from about \$1,065 to \$158,000 in synthetic international dollars. Log CO₂ has a mean of –19.04 with substantial variation (standard deviation 0.79), reflecting the wide range of development levels in our synthetic panel. With the data loaded, we next visualize the raw income–pollution relationship.

3. Exploratory Data Analysis

Before modeling, let us look at the raw relationship between income and emissions.

twoway (scatter $outcome ln_gdp, ///
msize(vsmall) mcolor("106 155 204"%40) msymbol(circle)), ///
ytitle("Log CO2 per capita") ///
xtitle("Log GDP per capita") ///
title("Synthetic Data: CO2 vs. Income", size(medium)) ///
subtitle("80 countries, 1995-2014 (N = 1,600)", size(small)) ///
scheme(s2color)

The scatter reveals a distinctly nonlinear pattern. At low income levels, CO₂ emissions increase steeply with GDP. At higher income levels, the relationship flattens and bends. This curvature motivates the cubic EKC specification. The diagram below shows the two competing EKC shapes — the classic inverted-U (quadratic) and the more complex inverted-N (cubic) with its three distinct phases:

graph TD
EKC["<b>Environmental Kuznets Curve</b><br/>How does pollution change<br/>as income grows?"]
EKC --> IU["<b>Inverted-U</b><br/>Quadratic: β₁ > 0, β₂ < 0<br/>One turning point"]
EKC --> IN["<b>Inverted-N</b><br/>Cubic: β₁ < 0, β₂ > 0, β₃ < 0<br/>Two turning points"]
IN --> P1["<b>Phase 1: Declining</b><br/>Very poor countries"]
IN --> P2["<b>Phase 2: Rising</b><br/>Industrializing countries"]
IN --> P3["<b>Phase 3: Declining</b><br/>Wealthy countries"]
style EKC fill:#141413,stroke:#141413,color:#fff
style IU fill:#6a9bcc,stroke:#141413,color:#fff
style IN fill:#d97757,stroke:#141413,color:#fff
style P1 fill:#00d4c8,stroke:#141413,color:#141413
style P2 fill:#d97757,stroke:#141413,color:#fff
style P3 fill:#00d4c8,stroke:#141413,color:#141413

For an inverted-N, we need $\beta_1 < 0$, $\beta_2 > 0$, $\beta_3 < 0$. Our synthetic DGP was designed with exactly this sign pattern ($\beta_1 = -7.1$, $\beta_2 = 0.81$, $\beta_3 = -0.03$), so BMA and DSL should recover it — but can they also correctly identify which of the 12 controls truly matter? Let us start with standard panel regressions to see how sensitive the GDP coefficients are to the choice of controls.

4. Baseline — Standard Fixed Effects

Before reaching for sophisticated methods, let us see what standard panel regressions say. We run two specifications using macros:

4.1 Sparse specification

reghdfe $outcome $gdp_vars, absorb(country_id year) vce(cluster country_id)
estimates store fe_sparse

HDFE Linear regression Number of obs = 1,600
R-squared = 0.9620
Within R-sq. = 0.0354
Number of clusters (country_id) = 80
(Std. err. adjusted for 80 clusters in country_id)
------------------------------------------------------------------------------
| Robust
ln_co2 | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
ln_gdp | -7.498046 1.623988 -4.62 0.000 -10.73051 -4.26558
ln_gdp_sq | .848967 .1704533 4.98 0.000 .5096881 1.188246
ln_gdp_cb | -.0314993 .005931 -5.31 0.000 -.0433047 -.019694
------------------------------------------------------------------------------

The sparse model finds the inverted-N sign pattern ($\beta_1 < 0$, $\beta_2 > 0$, $\beta_3 < 0$), all significant at the 0.1% level with cluster-robust standard errors (clustered at the country level). The within R² is just 0.035 — the GDP polynomial alone explains only about 3.5% of within-country CO₂ variation after absorbing country and year fixed effects. The overall R² of 0.96 is high because the country fixed effects capture most of the variation.

4.2 Kitchen-sink specification

reghdfe $outcome $gdp_vars $controls, absorb(country_id year) vce(cluster country_id)
estimates store fe_kitchen

HDFE Linear regression Number of obs = 1,600
R-squared = 0.9655
Within R-sq. = 0.1249
Number of clusters (country_id) = 80
(Std. err. adjusted for 80 clusters in country_id)
------------------------------------------------------------------------------
| Robust
ln_co2 | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
ln_gdp | -7.130693 1.562581 -4.56 0.000 -10.24093 -4.020453
ln_gdp_sq | .8059928 .1647973 4.89 0.000 .477972 1.134014
ln_gdp_cb | -.0298133 .0057365 -5.20 0.000 -.0412314 -.0183951
fossil_fuel | .0138444 .0014853 9.32 0.000 .010888 .0168008
renewable | -.006795 .0019322 -3.52 0.001 -.0106409 -.0029491
urban | .0057534 .0021432 2.68 0.009 .0014875 .0100192
globalizat~n | .0015186 .0012832 1.18 0.240 -.0010357 .0040728
pop_density | .0000794 .0002303 0.34 0.731 -.000379 .0005378
democracy | -.0002971 .007735 -0.04 0.969 -.0156933 .0150991
corruption | .0009812 .0008415 1.17 0.247 -.0006936 .0026561
industry | .0086336 .0017848 4.84 0.000 .0050811 .0121861
services | -.0005642 .0017205 -0.33 0.744 -.0039889 .0028604
trade | -.0002458 .0007695 -0.32 0.750 -.0017774 .0012858
fdi | -.0017599 .0019509 -0.90 0.370 -.005643 .0021232
credit | -.00139 .0007516 -1.85 0.068 -.002886 .0001061
------------------------------------------------------------------------------

Adding all 12 controls raises the within R² from 0.035 to 0.125 — a meaningful improvement, though the country and year FE still dominate the overall explanatory power (R² = 0.966). The three strongest true predictors (fossil fuel, industry, urban) are clearly significant, while most noise variables are statistically insignificant. Democracy’s estimate (–0.0003, p = 0.97) is far from its true value (–0.005) and indistinguishable from zero — illustrating why weak signals are hard to detect even with the correct model.

The critical question is: which specification should we trust? The next subsection shows that the GDP coefficients — and hence the EKC shape — shift depending on which controls we include.

4.3 The model uncertainty problem

Coefficient	Sparse FE	Kitchen-Sink FE	True DGP
$\beta_1$ (GDP)	–7.498	–7.131	–7.100
$\beta_2$ (GDP²)	0.849	0.806	0.810
$\beta_3$ (GDP³)	–0.031	–0.030	–0.030

Both specifications recover the correct sign pattern, but the magnitudes shift. The kitchen-sink FE estimates (–7.131, 0.806, –0.030) are closer to the true DGP values (–7.100, 0.810, –0.030) than the sparse FE (–7.498, 0.849, –0.031), because the omitted true controls create bias in the sparse model. But which of the 12 controls actually belongs?

* Compare coefficients side by side (simplified from analysis.do)
graph twoway ///
(bar value order if spec == "Sparse FE", ///
barwidth(0.35) color("106 155 204")) ///
(bar value order if spec == "Kitchen-Sink FE", ///
barwidth(0.35) color("217 119 87")), ///
xlabel(1 `""b1" "(GDP)""' 2 `""b2" "(GDP sq)""' 3 `""b3" "(GDP cb)""' ///
4 `""b1" "(GDP)""' 5 `""b2" "(GDP sq)""' 6 `""b3" "(GDP cb)""') ///
xline(3.5, lcolor(gs10) lpattern(dash)) ///
ytitle("Coefficient value") ///
title("Coefficient Instability Across Specifications") ///
legend(order(1 "Sparse FE (no controls)" 2 "Kitchen-Sink FE (all 12 controls)") ///
rows(1) position(6)) ///
scheme(s2color)

To understand the practical implications of these coefficient shifts, we compute the income thresholds where emissions change direction. The turning points are found by setting the first derivative of the cubic to zero:

$$x^* = \frac{-\hat{\beta}_2 \pm \sqrt{\hat{\beta}_2^2 - 3\hat{\beta}_1\hat{\beta}_3}}{3\hat{\beta}_3}, \quad \text{GDP}^* = \exp(x^*)$$

Turning point	Sparse FE	Kitchen-Sink FE	True DGP
Minimum (CO₂ starts rising)	\$2,478	\$2,426	\$1,895
Maximum (CO₂ starts falling)	\$25,656	\$27,694	\$34,647

The turning points shift modestly between specifications — the minimum stays near \$2,400–\$2,500 while the maximum moves from \$25,656 to \$27,694 depending on controls. Neither matches the true DGP values perfectly, motivating BMA and DSL as principled alternatives to ad hoc control selection.

5. Bayesian Model Averaging

5.1 The idea

Think of BMA as betting on a horse race. Instead of putting all your money on one model, BMA spreads bets across the field, wagering more on models with better track records.

graph TD
Start["<b>12 Candidate Controls</b><br/>2¹² = 4,096<br/>possible models"] --> MCMC["<b>MCMC Sampling</b><br/>Draw 50,000 models"]
MCMC --> Post["<b>Posterior Probability</b><br/>Weight by fit × parsimony"]
Post --> Avg["<b>Weighted Average</b><br/>Coefficients averaged<br/>across models"]
Post --> PIP["<b>PIPs</b><br/>Inclusion probability<br/>for each variable"]
style Start fill:#141413,stroke:#141413,color:#fff
style MCMC fill:#6a9bcc,stroke:#141413,color:#fff
style Post fill:#d97757,stroke:#141413,color:#fff
style Avg fill:#00d4c8,stroke:#141413,color:#141413
style PIP fill:#00d4c8,stroke:#141413,color:#141413

Formally, this betting process follows Bayes' rule, which tells us how to weight models by their fit and complexity.

Step 1: Model posterior probabilities. The posterior probability of model $M_k$ is:

$$P(M_k | \text{data}) = \frac{P(\text{data} | M_k) \cdot P(M_k)}{\sum_{l=1}^{K} P(\text{data} | M_l) \cdot P(M_l)}$$

In words, the probability of model $k$ being correct equals how well it fits the data (the marginal likelihood $P(\text{data} | M_k)$) times our prior belief ($P(M_k)$), divided by the total across all models. Models that fit the data well and are parsimonious receive higher posterior weight — this is BMA’s built-in Occam’s razor.

The marginal likelihood $P(\text{data} | M_k)$ is not the same as the ordinary likelihood. It integrates over all possible coefficient values, penalizing models with many parameters that “waste” probability mass on parameter regions the data does not support:

$$P(\text{data} | M_k) = \int P(\text{data} | \boldsymbol{\beta}_k, M_k) \, P(\boldsymbol{\beta}_k | M_k) \, d\boldsymbol{\beta}_k$$

In words, the marginal likelihood asks: “If we averaged this model’s fit across all plausible coefficient values (weighted by the prior $P(\boldsymbol{\beta}_k | M_k)$), how well does it explain the data?” This integral is what makes BMA automatically penalize overly complex models — a model with many parameters spreads its prior probability thinly across a high-dimensional space, and only recovers that probability if the data strongly supports those extra dimensions.

Step 2: Posterior Inclusion Probabilities. The PIP for variable $j$ sums the posterior probabilities across all models that include it:

$$\text{PIP}_j = \sum_{k:\, x_j \in M_k} P(M_k | \text{data})$$

In words, PIP answers: “Across all the models BMA considered, what fraction of the total posterior weight belongs to models that include variable $j$?” If fossil fuel appears in every high-probability model, its PIP approaches 1.0. If democracy only appears in low-probability models, its PIP stays near 0.

Step 3: BMA posterior mean. BMA does not just select variables — it also produces model-averaged coefficient estimates. The posterior mean of coefficient $\beta_j$ averages across all models, weighted by their posterior probabilities:

$$\hat{\beta}_j^{\text{BMA}} = \sum_{k=1}^{K} P(M_k | \text{data}) \cdot \hat{\beta}_{j,k}$$

where $\hat{\beta}_{j,k}$ is the coefficient estimate of variable $j$ in model $M_k$ (set to zero if $j$ is not in $M_k$). In words, the BMA estimate is a weighted average of the coefficient across all models, including models where the variable is absent (contributing zero). This shrinks the coefficient toward zero in proportion to the evidence against inclusion — a variable with PIP = 0.5 has its BMA coefficient shrunk by roughly half compared to its conditional estimate.

Think of PIP as a democratic vote across all candidate models. Each model casts a weighted vote for which variables matter, with better-fitting models getting louder voices. Raftery (1995) proposed standard interpretation thresholds based on the strength of evidence:

PIP range	Evidence	Analogy
$\geq 0.99$	Decisive	Beyond reasonable doubt
$0.95 - 0.99$	Very strong	Strong consensus
$0.80 - 0.95$	Strong (robust)	Clear majority
$0.50 - 0.80$	Borderline	Split vote
$< 0.50$	Weak/none (fragile)	Minority opinion

We use PIP $\geq$ 0.80 as our robustness threshold throughout this tutorial — a variable with PIP above 0.80 appears in the vast majority of the probability-weighted model space, providing “strong evidence” by Raftery’s classification. This is the most widely used cutoff in applied BMA studies.

A key assumption underlying BMA is that the true data-generating process is well-approximated by a weighted combination of the candidate models (the “M-closed” assumption). When the candidate set omits important functional forms or interactions, BMA’s posterior probabilities may be unreliable.

5.2 Key options

With the conceptual framework in place, we now turn to implementation. Stata 18’s bmaregress command has three families of options: priors (what you believe before seeing the data), MCMC controls (how the algorithm explores the model space), and output formatting (what gets displayed). The full option list is in the Stata manual; here we explain the ones used in this tutorial:

Prior specifications (see bmaregress priors for alternatives):

gprior(uip) — Unit Information Prior: sets the prior precision on coefficients equal to the information in one observation ($g = N$). This is a standard, relatively uninformative choice that lets the data dominate. Alternatives include gprior(bric) (benchmark risk inflation criterion, $g = \max(N, p^2)$), gprior(zs) (Zellner-Siow), and gprior(hyper) (hyper-g prior with data-driven $g$)
mprior(uniform) — all $2^{12} = 4{,}096$ models are equally likely a priori; no model is privileged before seeing the data. The alternative mprior(binomial) applies a beta-binomial prior that penalizes very large or very small models, often producing more conservative PIPs

MCMC controls:

mcmcsize(50000) — draws 50,000 models from the model space using MC$^3$ (Markov chain Monte Carlo model composition) sampling. Larger values improve posterior estimates but increase computation time
burnin(5000) — discards the first 5,000 draws to allow the chain to reach its stationary distribution before collecting samples
rseed(9988) — fixes the random number seed for exact reproducibility. Students running the same command will get identical results
groupfv — treats all dummies from a single factor variable as one group that enters or exits models together. Without groupfv, writing i.country_id would create 80 individual dummy variables, and BMA would consider including or excluding each one independently — producing an astronomical model space ($2^{80}$ combinations of country dummies alone) that is both computationally infeasible and conceptually meaningless. With groupfv, the 80 country dummies move as a package: either all 80 are in the model or none are. Think of it like hiring a sports team — you recruit the whole roster, not individual players one by one. In the output, this is why you see “Groups = 15” instead of 113: BMA treats the 80 country dummies as 1 group, the 19 year dummies as 1 group, and each of the 12 candidate controls + 3 GDP terms as their own groups ($1 + 1 + 15 = 17$, minus 2 that are “always” included = 15 groups subject to selection)
($fe, always) — country and year fixed effects are always included in every model; they are not subject to model selection. This is standard practice in panel data BMA: we want to control for unobserved country and time heterogeneity in every model, and only let BMA decide about the candidate controls

Output formatting:

pipcutoff(0.8) — display only variables with PIP above 0.80 in the output table. This is a display threshold only — it does not affect the underlying estimation
inputorder — display variables in the order they were specified in the command, rather than sorted by PIP

5.3 Estimation

bmaregress $outcome $gdp_vars $controls ///
($fe, always), ///
mprior(uniform) groupfv gprior(uip) ///
mcmcsize(50000) rseed(9988) inputorder pipcutoff(0.8)

Bayesian model averaging No. of obs = 1,600
Linear regression No. of predictors = 113
MC3 sampling Groups = 15
Always = 98
No. of models = 163
Priors: Mean model size = 104.578
Models: Uniform MCMC sample size = 50,000
Coef.: Zellner's g Acceptance rate = 0.0904
g: Unit-information, g = 1,600 Shrinkage, g/(1+g) = 0.9994
Sampling correlation = 0.9997
------------------------------------------------------------------------------
ln_co2 | Mean Std. dev. Group PIP
-------------+----------------------------------------------------------------
ln_gdp | -7.13901 1.811093 1 .99401
ln_gdp_sq | .8078437 .1892418 2 .99991
ln_gdp_cb | -.0299182 .0065105 3 .99976
fossil_fuel | .0138139 .001283 4 1
renewable | -.0068332 .0023506 5 .95945
industry | .0085503 .0019766 11 .99867
------------------------------------------------------------------------------
Note: 9 predictors with PIP less than .8 not shown.

The Stata output says “PIP less than .8” because we set pipcutoff(0.8) as the display threshold — only variables exceeding this stricter robustness criterion appear in the table. The 9 hidden variables are the two weak true controls (urban, democracy) and all 7 noise variables (services, trade, FDI, credit, population density, corruption, globalization). Figure 3 below shows PIP values for all 15 variables.

The output shows 113 predictors in 15 groups: the 80 country dummies (grouped as 1 by groupfv) + 19 year dummies (grouped as 1) + 12 candidate controls (each its own group) + the 3 GDP terms (each its own group) = 15 selection groups total, with 98 variables “always” included (the country and year FE). BMA sampled 163 distinct models out of 4,096 possible. This might seem low, but the MC$^3$ algorithm does not need to visit every model — it concentrates on the high-posterior-probability region. The sampling correlation of 0.9997 (very close to 1.0) confirms that the MC$^3$ chain adequately explored the model space — the posterior probability is concentrated on a relatively small number of high-quality models. The acceptance rate of 0.09 is below the typical 20–40% range, but the high sampling correlation provides reassurance that the results are reliable. Six variables have PIP above the 0.80 robustness threshold: the three GDP terms (PIP = 0.994–1.000) and three of the five true controls — fossil fuel (PIP = 1.000), industry (PIP = 0.999), and renewable energy (PIP = 0.959). The BMA posterior means (–7.139, 0.808, –0.030) are remarkably close to the true DGP values (–7.100, 0.810, –0.030), substantially closer than the sparse FE estimates.

Two true controls — urban (coefficient 0.007) and democracy (coefficient –0.005) — have PIPs well below 0.80. Their true effects are small, making them hard to distinguish from noise. This is a realistic limitation: even a powerful method like BMA struggles with weak signals.

5.4 Turning points

Using the BMA posterior means, the turning points are:

Minimum: \$2,411 GDP per capita (true: \$1,895)
Maximum: \$27,269 GDP per capita (true: \$34,647)

Both turning points are in the right ballpark but not exact. The turning point formula amplifies small differences across all three coefficients — even though each BMA posterior mean is within 1% of the true DGP value, the compound effect shifts the maximum turning point from \$34,647 (true) to \$27,269 (BMA). The inverted-N shape is clearly recovered.

5.5 Posterior Inclusion Probabilities

The PIP chart is BMA’s signature output. We extract PIPs from the estimation results, label each variable, and color-code bars by ground truth: steel blue for true predictors, gray for noise.

* Extract PIPs and create a horizontal bar chart
matrix pip_mat = e(pip)
* ... (create dataset of variable names and PIPs, add readable labels) ...
* Mark true vs noise predictors
gen is_true = inlist(varname, "fossil_fuel", "renewable", "urban", ///
"democracy", "industry", "ln_gdp", "ln_gdp_sq", "ln_gdp_cb")
gsort -pip
graph twoway ///
(bar pip order if is_true == 1, horizontal barwidth(0.6) ///
color("106 155 204")) ///
(bar pip order if is_true == 0, horizontal barwidth(0.6) ///
color(gs11)), ///
xline(0.8, lcolor("217 119 87") lpattern(dash) lwidth(medium)) ///
ylabel(1(1)15, valuelabel angle(0) labsize(small)) ///
xlabel(0(0.2)1, format(%3.1f)) ///
xtitle("Posterior Inclusion Probability (PIP)") ///
title("BMA: Which Variables Matter?") ///
legend(order(1 "True predictor (in DGP)" 2 "Noise variable (not in DGP)") ///
rows(1) position(6)) ///
scheme(s2color)

The PIP chart cleanly separates the variables into two groups. At the top (PIP near 1.0): fossil fuel share, GDP terms, industry, and renewable energy — all true predictors correctly identified. At the bottom (PIP near 0.0): the seven noise variables (globalization, corruption, services, trade, FDI, credit, population density) plus urban population and democracy. BMA correctly assigns zero-like PIPs to all noise variables, and correctly flags 3 of 5 true predictors as robust. The two misses (urban, democracy) have small true coefficients (0.007 and –0.005), making them genuinely hard to detect.

5.6 Coefficient density plots

The bmagraph coefdensity command shows the posterior distribution of each coefficient across all sampled models. We plot all six variables with PIP above 0.80 in a 3x2 grid — the three GDP polynomial terms (top row) and the three robust controls (bottom row). In each panel, the blue curve shows the density conditional on the variable being included in the model, and the red horizontal line shows the probability of noninclusion (1 – PIP). When the red line is flat near zero and the blue curve is far from zero, the variable is strongly supported.

* Consistent formatting for all panels
local panel_opts `" xtitle("Coefficient value", size(vsmall)) "'
local panel_opts `" `panel_opts' ytitle("Density", size(vsmall)) "'
local panel_opts `" `panel_opts' ylabel(, labsize(vsmall) angle(0)) "'
local panel_opts `" `panel_opts' xlabel(, labsize(vsmall)) "'
local panel_opts `" `panel_opts' legend(off) scheme(s2color) "'
* Generate density for all 6 robust variables (PIP > 0.80)
bmagraph coefdensity ln_gdp, title("GDP per capita (log)", size(small)) `panel_opts' name(dens_gdp, replace)
bmagraph coefdensity ln_gdp_sq, title("GDP squared (log)", size(small)) `panel_opts' name(dens_gdp_sq, replace)
bmagraph coefdensity ln_gdp_cb, title("GDP cubed (log)", size(small)) `panel_opts' name(dens_gdp_cb, replace)
bmagraph coefdensity fossil_fuel, title("Fossil fuel share (%)", size(small)) `panel_opts' name(dens_fossil, replace)
bmagraph coefdensity renewable, title("Renewable energy (%)", size(small)) `panel_opts' name(dens_renewable, replace)
bmagraph coefdensity industry, title("Industry VA (% GDP)", size(small)) `panel_opts' name(dens_industry, replace)
graph combine dens_gdp dens_gdp_sq dens_gdp_cb ///
dens_fossil dens_renewable dens_industry, ///
cols(3) rows(2) imargin(small) ///
title("BMA: Posterior Coefficient Densities", size(medsmall)) ///
subtitle("All 6 robust variables (PIP > 0.80)", size(small)) ///
note("Blue curve = posterior density conditional on inclusion." ///
"Red line = probability of noninclusion (1 - PIP)." ///
"Near-zero red line + blue curve far from zero = strong evidence.", size(vsmall)) ///
scheme(s2color) xsize(12) ysize(7)

All six densities are concentrated well away from zero, confirming that every variable with PIP above 0.80 has a genuinely non-zero effect. The three GDP terms (top row) form the inverted-N polynomial: the linear term is centered near –7.1 (true: –7.1), the squared term near +0.81 (true: +0.81), and the cubic term near –0.030 (true: –0.030). The three controls (bottom row) show tight, unimodal densities: fossil fuel near +0.014 (true: +0.015), renewable energy near –0.007 (true: –0.010), and industry near +0.009 (true: +0.010). Renewable energy’s posterior mean (–0.007) is slightly attenuated compared to the true value (–0.010), reflecting the BMA shrinkage that occurs when a variable’s PIP is below 1.0 — models that exclude it pull the average toward zero.

5.7 Pooled BMA (without fixed effects)

To parallel the pooled DSL comparison in Section 6.6, we also run BMA without country or year fixed effects — treating the panel as a pooled cross-section. This removes the ($fe, always) and groupfv options, leaving only the 12 candidate controls and 3 GDP terms as predictors (15 total, vs 113 with FE).

* BMA without FE -- pooled cross-section
bmaregress ln_co2 ln_gdp ln_gdp_sq ln_gdp_cb ///
fossil_fuel renewable urban industry democracy ///
services trade fdi credit pop_density ///
corruption globalization, ///
mprior(uniform) gprior(uip) ///
mcmcsize(50000) rseed(9988) pipcutoff(0.5) burnin(5000)

Bayesian model averaging No. of obs = 1,600
Linear regression No. of predictors = 15
MC3 sampling Groups = 15
Always = 0
No. of models = 34
Priors: Mean model size = 11.978
Models: Uniform MCMC sample size = 50,000
Coef.: Zellner's g Acceptance rate = 0.0733
g: Unit-information, g = 1,600 Shrinkage, g/(1+g) = 0.9994
Sampling correlation = 0.9996
------------------------------------------------------------------------------
ln_co2 | Mean Std. dev. Group PIP
-------------+----------------------------------------------------------------
ln_gdp | -21.25807 1.641676 1 1
ln_gdp_sq | 2.284729 .1748838 2 1
ln_gdp_cb | -.0813937 .0061308 3 1
fossil_fuel | .0188853 .0010554 4 1
renewable | -.0192089 .0013911 5 1
urban | .0103139 .0012072 6 1
industry | .0138361 .0023478 7 1
services | .0164633 .0016573 9 1
pop_density | -.0004314 .0000567 13 1
credit | .0041017 .0008414 12 .99984
trade | -.0020939 .001084 10 .86009
democracy | .007879 .0042984 8 .84142
------------------------------------------------------------------------------
Note: 3 predictors with PIP less than .5 not shown.

The pooled BMA results are striking in two ways. First, the GDP coefficients are severely biased — the same pattern as pooled DSL: $\beta_1 = -21.26$ (true: –7.10), $\beta_2 = 2.28$ (true: 0.81), $\beta_3 = -0.081$ (true: –0.03). Without country fixed effects, the GDP terms absorb persistent cross-country differences in emissions levels, inflating the coefficients by a factor of 2–3x.

Second, the PIPs tell a completely different story than with FE. Without fixed effects, 12 of 15 variables have PIP above 0.80 — including noise variables like services (PIP = 1.000), population density (PIP = 1.000), credit (PIP = 1.000), and trade (PIP = 0.860). With FE, only 6 variables cleared the 0.80 threshold and all 7 noise variables had PIPs near zero. The pooled BMA commits 5 false positives (services, pop_density, credit, trade, and democracy incorrectly flagged as robust noise variables or given inflated PIPs) compared to zero false positives with FE. This happens because the noise variables are correlated with omitted country effects — without FE to absorb those effects, the correlations create spurious associations that BMA interprets as genuine predictive power.

The turning points (\$5,752 minimum, \$23,298 maximum) are far from the truth, and the 95% credible intervals fail to cover the true values for all three GDP terms — the same coverage failure seen in pooled DSL. The lesson is clear: fixed effects are not optional in panel BMA. They are essential for correct variable selection, not just coefficient estimation.

6. Post-Double-Selection LASSO

6.1 The idea

Stata’s dsregress implements the post-double-selection method of Belloni, Chernozhukov, and Hansen (2014). Think of it as a smart research assistant who reads the data twice — once to find controls that predict the outcome (CO₂), and again to find controls that predict the variables of interest (GDP terms) — then runs a clean OLS regression using only the controls that survived at least one selection.

The “double” in double-selection refers to the union of two separate LASSO selections. Why is this union necessary? If a control variable predicts both CO₂ and GDP but a single LASSO run on CO₂ happens to miss it, omitting it from the final regression would bias the GDP coefficient. The second LASSO step (on GDP) catches variables that the first step might miss, and vice versa.

The algorithm has four steps:

graph TD
Controls["<b>12 Candidate Controls</b><br/>+ country & year FE"]
Controls --> Step1["<b>Step 1: LASSO on Outcome</b><br/>CO2 ~ all controls<br/>→ Selected set X̃y"]
Controls --> Step2["<b>Step 2: LASSO on Each Variable of Interest</b><br/>GDP ~ all controls → X̃₁<br/>GDP² ~ all controls → X̃₂<br/>GDP³ ~ all controls → X̃₃"]
Step1 --> Union["<b>Step 3: Take the Union</b><br/>X̂ = X̃y ∪ X̃₁ ∪ X̃₂ ∪ X̃₃<br/>Only controls surviving<br/>at least one selection"]
Step2 --> Union
Union --> OLS["<b>Step 4: Final OLS</b><br/>CO2 ~ GDP + GDP² + GDP³ + X̂<br/>Standard OLS with valid<br/>inference on GDP terms"]
style Controls fill:#141413,stroke:#141413,color:#fff
style Step1 fill:#6a9bcc,stroke:#141413,color:#fff
style Step2 fill:#d97757,stroke:#141413,color:#fff
style Union fill:#1a3a8a,stroke:#141413,color:#fff
style OLS fill:#00d4c8,stroke:#141413,color:#141413

At the heart of each LASSO step is a penalized regression that shrinks irrelevant coefficients to exactly zero:

$$\hat{\boldsymbol{\beta}}^{\text{LASSO}} = \arg\min_{\boldsymbol{\beta}} \left\{ \frac{1}{2N} \sum_{i=1}^{N}(y_i - \mathbf{x}_i'\boldsymbol{\beta})^2 + \lambda \sum_{j=1}^{p} |\beta_j| \right\}$$

In words, LASSO minimizes the sum of squared residuals (the usual OLS objective) plus a penalty term $\lambda \sum |\beta_j|$ that charges a cost proportional to the absolute value of each coefficient. The tuning parameter $\lambda$ controls how harsh this penalty is — think of it as a “strictness dial.” When $\lambda = 0$, LASSO is just OLS. As $\lambda$ increases, more coefficients are forced to exactly zero. The L1 (absolute value) penalty is what makes LASSO a variable selector: unlike the L2 (squared) penalty used in Ridge regression, the L1 penalty has sharp corners at zero that drive weak coefficients to exactly zero rather than merely shrinking them.

Why “double” selection? The key insight of Belloni, Chernozhukov, and Hansen (2014) is that a single LASSO selection can miss important confounders. Consider our panel setting. We want to estimate the effect of GDP terms ($\mathbf{D}$) on CO₂ ($Y$), controlling for other variables ($\mathbf{W}$). The model is:

$$Y_i = \mathbf{D}_i' \boldsymbol{\alpha} + \mathbf{W}_i' \boldsymbol{\beta} + \varepsilon_i$$

A confounder $W_j$ that affects both $Y$ and $\mathbf{D}$ must be included to avoid omitted variable bias. But if $W_j$ has a weak effect on $Y$, the LASSO on $Y$ might miss it. The double-selection strategy solves this by running LASSO twice:

Step 1 selects controls that predict $Y$: $\quad \hat{S}_Y = \{j : \hat{\beta}_j^{\text{LASSO}(Y)} \neq 0\}$
Step 2 selects controls that predict each $D_k$: $\quad \hat{S}_{D_k} = \{j : \hat{\gamma}_{j,k}^{\text{LASSO}(D_k)} \neq 0\}$
Step 3 takes the union: $\quad \hat{S} = \hat{S}_Y \cup \hat{S}_{D_1} \cup \hat{S}_{D_2} \cup \hat{S}_{D_3}$
Step 4 runs OLS of $Y$ on $\mathbf{D}$ and $\mathbf{W}_{\hat{S}}$ with standard inference

The union in Step 3 ensures that a confounder missed by the $Y$-LASSO but caught by the $D$-LASSO is still included. This “safety net” property is what gives post-double-selection its valid inference guarantees — the final OLS produces consistent estimates of $\boldsymbol{\alpha}$ even if each individual LASSO makes some selection mistakes.

The dsregress command uses a “plugin” method to choose $\lambda$ — an analytical formula that sets the penalty based on the sample size and noise level, without requiring cross-validation. A key assumption underlying DSL is approximate sparsity: only a small number of controls truly matter, so LASSO can safely set the rest to zero. When the true model is dense (many small effects rather than a few large ones), LASSO may struggle to select the right variables.

Before implementing DSL, it helps to see the two methods side by side:

Feature	BMA	Post-Double-Selection
Philosophy	Bayesian (posteriors)	Frequentist (p-values)
Strategy	Average across models	Select controls, then OLS
Output	PIPs for every variable	Set of selected controls
Speed	Minutes (MCMC)	Seconds (optimization)
Reference	Raftery et al. (1997)	Belloni, Chernozhukov, Hansen (2014)

6.2 Key options

With the algorithm clear, let us examine the Stata implementation. The dsregress command has a concise syntax, but each element plays a specific role. The full option list is in the Stata LASSO manual; here we explain the ones used in this tutorial:

Syntax structure: dsregress depvar varsofinterest, controls(controlvars) [options]

$outcome (ln_co2) — the dependent variable. DSL will run LASSO on this variable against all controls (Step 1)
$gdp_vars (ln_gdp ln_gdp_sq ln_gdp_cb) — the variables of interest. These are never penalized by LASSO; they always appear in the final OLS. DSL runs a separate LASSO for each one against all controls (Steps 2a–2c)
controls(($fe) $controls) — the candidate controls subject to LASSO selection. Parentheses around $fe tell Stata to treat factor variables (country and year dummies) as always-included in the LASSO penalty but available for selection. The 12 candidate controls are subject to the standard LASSO penalty
vce(cluster country_id) — compute cluster-robust standard errors at the country level in the final OLS (Step 4). This also affects the LASSO penalty through the selection(plugin) method, which adjusts $\lambda$ for cluster dependence
selection(plugin) (default) — choose $\lambda$ using a data-driven analytical formula rather than cross-validation. The alternative selection(cv) uses cross-validation but is slower
lassoinfo (post-estimation) — reports the number of selected controls and the $\lambda$ value for each LASSO step
lassocoef (post-estimation) — displays which specific variables were selected or dropped by LASSO

Related commands. Stata also offers poregress (partialing-out regression), which residualizes both the outcome and the treatment against all controls instead of selecting then regressing. Both methods provide valid inference. xporegress extends this to cross-fit partialing-out for even more robust inference. This tutorial uses dsregress because its select-then-regress logic is more intuitive for beginners.

6.3 Estimation

dsregress $outcome $gdp_vars, ///
controls(($fe) $controls) ///
vce(cluster country_id)

Double-selection linear model Number of obs = 1,600
Number of controls = 112
Number of selected controls = 102
Wald chi2(3) = 53.15
Prob > chi2 = 0.0000
(Std. err. adjusted for 80 clusters in country_id)
------------------------------------------------------------------------------
| Robust
ln_co2 | Coefficient std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
ln_gdp | -7.433319 1.628321 -4.57 0.000 -10.62477 -4.241868
ln_gdp_sq | .8401567 .1713522 4.90 0.000 .5043126 1.176001
ln_gdp_cb | -.0310764 .005952 -5.22 0.000 -.0427421 -.0194107
------------------------------------------------------------------------------

Post-double-selection completed in seconds with cluster-robust standard errors at the country level. Internally, dsregress ran four separate LASSO regressions (Step 1 on CO₂, Steps 2a–2c on each GDP term), took the union of all selected controls, and then ran a final OLS of CO₂ on the GDP terms plus that union. All three GDP terms are significant at the 0.1% level. The Wald test strongly rejects the null that GDP terms are jointly zero ($\chi^2 = 53.15$, p < 0.001).

6.4 Turning points

Minimum: \$2,429 GDP per capita (true: \$1,895)
Maximum: \$27,672 GDP per capita (true: \$34,647)

The post-double-selection turning points (\$2,429 and \$27,672) fall between the sparse FE and kitchen-sink estimates, closer to the BMA values. With cluster-robust standard errors, the LASSO selection retained 102 of 112 controls for the outcome equation and 100 for each GDP term. The union of selected controls in Step 3 includes a few more candidate variables than without clustering, producing coefficients (–7.433, 0.840, –0.031) that lie between the sparse and kitchen-sink specifications.

6.5 LASSO selection

To understand which controls LASSO kept and which it dropped, we inspect the selection details:

lassoinfo

 Estimate: active
Command: dsregress
------------------------------------------------------
| No. of
| Selection selected
Variable | Model method lambda variables
------------+-----------------------------------------
ln_co2 | linear plugin .3818852 102
ln_gdp | linear plugin .3818852 100
ln_gdp_sq | linear plugin .3818852 100
ln_gdp_cb | linear plugin .3818852 100
------------------------------------------------------

The lassoinfo output shows each of the four LASSO steps. The outcome equation selected 102 of 112 controls, while each GDP equation selected 100. The 112 candidates include 80 country dummies + 19 year dummies = 99 FE dummies, plus the 12 candidate variables and the constant. LASSO retains nearly all informative FE dummies and drops about 10–12 of the weakest candidates at each step. The union across all four steps (Step 3) yields the final control set for Step 4’s OLS. With cluster-robust standard errors, the lambda is larger (0.382 vs 0.090 without clustering), leading to slightly different selection and producing DSL coefficients (–7.433, 0.840, –0.031) that fall between the sparse and kitchen-sink FE.

Why does DSL not match BMA’s accuracy here? In panel data settings where FE dummies dominate the control set (99 of 112 variables), LASSO retains nearly all FE dummies and has limited room to discriminate among the 12 candidate controls of interest — it dropped only 10–12 variables at each step, most of them weak FE dummies rather than noise controls. This “almost everything selected” outcome means DSL’s final OLS is close to the kitchen-sink specification, which explains why its coefficients (–7.433, 0.840, –0.031) fall between sparse and kitchen-sink FE rather than converging to the true DGP. To see LASSO’s selection power unleashed, we next run DSL without fixed effects.

6.6 Pooled DSL (without fixed effects)

What happens when LASSO has only 12 candidate controls instead of 112? To answer this, we run DSL on the pooled data — treating the panel as a cross-sectional dataset without country or year fixed effects. This gives LASSO full room to discriminate among the candidate controls, but at the cost of omitting the unobserved country heterogeneity that fixed effects would absorb.

* DSL without FE -- pooled cross-section with cluster-robust SEs
dsregress $outcome $gdp_vars, ///
controls($controls) ///
vce(cluster country_id)

Double-selection linear model Number of obs = 1,600
Number of controls = 12
Number of selected controls = 7
Wald chi2(3) = 25.05
Prob > chi2 = 0.0000
(Std. err. adjusted for 80 clusters in country_id)
------------------------------------------------------------------------------
| Robust
ln_co2 | Coefficient std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
ln_gdp | -22.03297 5.277295 -4.18 0.000 -32.37628 -11.68966
ln_gdp_sq | 2.366878 .5652276 4.19 0.000 1.259052 3.474703
ln_gdp_cb | -.084224 .0199055 -4.23 0.000 -.1232381 -.04521
------------------------------------------------------------------------------

The pooled DSL still finds the correct inverted-N sign pattern ($\beta_1 < 0$, $\beta_2 > 0$, $\beta_3 < 0$), but the magnitudes are dramatically different from the true DGP. The linear coefficient (–22.03) is more than three times the true value (–7.10), and the other terms are similarly inflated. This is omitted variable bias: without country fixed effects, the GDP terms absorb not only their own effect on CO₂ but also the persistent cross-country differences in emissions levels that fixed effects would have captured.

lassoinfo

 Estimate: active
Command: dsregress
------------------------------------------------------
| No. of
| Selection selected
Variable | Model method lambda variables
------------+-----------------------------------------
ln_co2 | linear plugin .3818852 5
ln_gdp | linear plugin .3818852 7
ln_gdp_sq | linear plugin .3818852 7
ln_gdp_cb | linear plugin .3818852 7
------------------------------------------------------

Now the contrast with the FE-based DSL is stark. The outcome LASSO selected only 5 of 12 controls (vs 102 of 112 with FE), and the GDP LASSOes selected 7 of 12 (vs 100 of 112). Without FE dummies flooding the candidate set, LASSO can genuinely discriminate — it zeroed out 5–7 controls as irrelevant. The turning points are \$5,581 (minimum) and \$24,532 (maximum), far from the true values.

This comparison illustrates a fundamental tradeoff in panel data econometrics: fixed effects remove bias but limit LASSO’s selection power. With FE, the estimates are unbiased but LASSO selects almost everything. Without FE, LASSO selects sharply but the estimates are biased by unobserved heterogeneity. The FE-based DSL from Section 6.3 is the correct specification for this data, even though LASSO’s selection looks less impressive.

7. Head-to-Head Comparison

7.1 Coefficient comparison

	Sparse FE	Kitchen-Sink FE	BMA (FE)	DSL (FE)	BMA (pooled)	DSL (pooled)	True DGP
$\beta_1$ (GDP)	–7.498	–7.131	–7.139	–7.433	–21.258	–22.033	–7.100
$\beta_2$ (GDP²)	0.849	0.806	0.808	0.840	2.285	2.367	0.810
$\beta_3$ (GDP³)	–0.031	–0.030	–0.030	–0.031	–0.081	–0.084	–0.030
Min TP	\$2,478	\$2,426	\$2,411	\$2,429	\$5,752	\$5,581	\$1,895
Max TP	\$25,656	\$27,694	\$27,269	\$27,672	\$23,298	\$24,532	\$34,647

The table reveals a sharp divide between FE-based and pooled specifications. The four FE-based methods (columns 2–5) all produce GDP coefficients within a narrow range of the true values — BMA (FE) and Kitchen-Sink FE are closest, with estimates within 1% of the truth. The two pooled methods (columns 6–7) are dramatically biased, with coefficients inflated 2–3x. Strikingly, BMA (pooled) and DSL (pooled) agree closely with each other (–21.26 vs –22.03 for $\beta_1$), confirming that the bias comes from omitting fixed effects, not from the choice of variable selection method. Both pooled methods produce turning points displaced from the truth (\$5,600–5,800 vs true \$1,895 for the minimum).

7.2 Uncertainty: confidence and credible intervals

Point estimates tell only half the story. How uncertain is each method, and does the interval actually contain the truth? The table below shows 95% confidence intervals (for the frequentist methods) and approximate 95% credible intervals (for BMA, computed as posterior mean $\pm$ 2 posterior SD). The last column checks whether the true DGP value falls inside the interval.

	$\beta_1$ (GDP) interval	Covers true?	$\beta_2$ (GDP²) interval	Covers true?	$\beta_3$ (GDP³) interval	Covers true?
Sparse FE	[–10.731, –4.266]	Yes	[0.510, 1.188]	Yes	[–0.043, –0.020]	Yes
Kitchen-Sink FE	[–10.241, –4.021]	Yes	[0.478, 1.134]	Yes	[–0.041, –0.018]	Yes
BMA (FE) (credible)	[–10.761, –3.517]	Yes	[0.429, 1.186]	Yes	[–0.043, –0.017]	Yes
DSL (FE)	[–10.625, –4.242]	Yes	[0.504, 1.176]	Yes	[–0.043, –0.019]	Yes
BMA (pooled) (credible)	[–24.541, –17.975]	No	[1.935, 2.635]	No	[–0.094, –0.069]	No
DSL (pooled)	[–32.376, –11.690]	No	[1.259, 3.475]	No	[–0.123, –0.045]	No
True DGP	–7.100		0.810		–0.030

The four FE-based methods all produce intervals that contain the true parameter values — a reassuring result. Both pooled methods, however, fail to cover the truth for any of the three coefficients. The pooled DSL intervals are wide (the $\beta_1$ interval spans 20.7 units) but centered so far from the truth that even this width cannot compensate. The pooled BMA credible intervals are actually narrower (spanning 6.6 units for $\beta_1$) but even more precisely wrong — they are tightly concentrated around the biased estimate. This is the worst-case scenario: false precision from a misspecified model.

Width reflects uncertainty. Among the FE-based methods, BMA produces the widest interval for $\beta_1$ (width = 7.24), followed by Sparse FE (6.47), DSL with FE (6.38), and Kitchen-Sink FE (6.22). BMA’s wider intervals reflect its honest accounting of model uncertainty — it averages across thousands of models, each contributing slightly different coefficient estimates, which inflates the posterior standard deviation. The frequentist methods condition on a single model and therefore understate the total uncertainty.

Centering reflects bias. Kitchen-Sink FE and BMA center their intervals closest to the true value (–7.131 and –7.139 vs. true –7.100), while Sparse FE (–7.498) and DSL with FE (–7.433) are slightly further away. The pooled DSL (–22.033) is dramatically off-center, illustrating that omitted variable bias overwhelms any precision gained from better variable selection.

Coverage requires correct specification. The pooled DSL result drives home a critical lesson: a confidence interval is only as good as the model behind it. The 95% label promises that, in repeated sampling, 95% of intervals would contain the truth — but this guarantee holds only if the model is correctly specified. When country fixed effects are omitted, the model is misspecified, and the intervals fail despite being statistically “valid” within the pooled framework.

Bayesian vs frequentist interpretation. BMA’s credible intervals have a different interpretation: a 95% BMA credible interval says “given the data and priors, there is a 95% posterior probability the true coefficient lies in this range,” while a 95% confidence interval says “if we repeated this procedure many times, 95% of the intervals would contain the truth.” In practice, both require correct model specification to be reliable.

7.3 Predicted EKC curves

The curves are normalized to zero at the sample-mean GDP so both methods are directly comparable:

* Generate predicted EKC curves for BMA and DSL, normalized at mean GDP
summarize ln_gdp
local xmin = r(min)
local xmax = r(max)
local xmean = r(mean)
clear
set obs 500
gen lngdp = `xmin' + (_n - 1) * (`xmax' - `xmin') / 499
* Cubic component for each method (using stored coefficients)
gen fit_bma = `b1_bma' * lngdp + `b2_bma' * lngdp^2 + `b3_bma' * lngdp^3
gen fit_dsl = `b1_dsl' * lngdp + `b2_dsl' * lngdp^2 + `b3_dsl' * lngdp^3
* Normalize: subtract value at sample-mean GDP
local norm_bma = `b1_bma' * `xmean' + `b2_bma' * `xmean'^2 + `b3_bma' * `xmean'^3
local norm_dsl = `b1_dsl' * `xmean' + `b2_dsl' * `xmean'^2 + `b3_dsl' * `xmean'^3
replace fit_bma = fit_bma - `norm_bma'
replace fit_dsl = fit_dsl - `norm_dsl'
twoway ///
(line fit_bma lngdp, lcolor("106 155 204") lwidth(medthick)) ///
(line fit_dsl lngdp, lcolor("217 119 87") lwidth(medthick) lpattern(dash)), ///
xline(`lnmin_bma', lcolor("106 155 204"%50) lpattern(shortdash)) ///
xline(`lnmax_bma', lcolor("106 155 204"%50) lpattern(shortdash)) ///
ytitle("Predicted log CO2 (normalized at mean GDP)") ///
xtitle("Log GDP per capita") ///
title("Predicted EKC Shape: BMA vs. DSL") ///
legend(order(1 "BMA" 2 "DSL") rows(1) position(6)) ///
scheme(s2color)

Both curves trace a clear inverted-N: CO₂ falls at low incomes, rises through industrialization, and falls again at high incomes. The BMA curve (solid blue) and DSL curve (dashed orange) are nearly indistinguishable, with turning points closely aligned. The normalization at mean GDP makes the shape immediately visible — a major improvement over plotting raw cubic components that would sit at different y-levels.

7.4 Answer key: grading the methods

The ultimate test: do BMA and DSL correctly identify the 5 true predictors and reject the 7 noise variables?

* Dot plot: BMA PIPs color-coded by ground truth
* (extract PIPs, label variables, mark true vs noise --- see analysis.do)
graph twoway ///
(scatter order pip if is_true == 1, ///
mcolor("106 155 204") msymbol(circle) msize(large)) ///
(scatter order pip if is_true == 0, ///
mcolor(gs9) msymbol(diamond) msize(large)), ///
xline(0.8, lcolor("217 119 87") lpattern(dash) lwidth(medium)) ///
ylabel(1(1)15, valuelabel angle(0) labsize(small)) ///
xlabel(0(0.2)1, format(%3.1f)) ///
xtitle("BMA Posterior Inclusion Probability") ///
title("Answer Key: Do BMA and DSL Recover the Truth?") ///
legend(order(1 "True predictor" 2 "Noise variable") ///
rows(1) position(6)) ///
scheme(s2color)

BMA’s report card: Of the 8 true predictors (3 GDP terms + 5 controls), BMA correctly assigns PIP > 0.80 to 6 — the three GDP terms, fossil fuel, industry, and renewable energy. It misses urban (PIP ~ 0.27) and democracy (PIP ~ 0.02), whose true coefficients are small (0.007 and –0.005). All 7 noise variables receive PIPs well below 0.80. BMA makes zero false positives (no noise variable incorrectly flagged as robust) and two false negatives (two weak true predictors missed).

Post-double-selection’s report card: With cluster-robust SEs, the union of all four LASSO steps selected 102 of 112 total controls (including FE dummies). The resulting DSL coefficients (–7.433, 0.840, –0.031) fall between the sparse and kitchen-sink FE, closer to the true DGP than the sparse specification. The entire procedure runs in seconds rather than minutes.

Bottom line: Both methods recover the inverted-N EKC shape. BMA provides more granular variable-level inference (PIPs), while DSL provides fast, valid coefficient estimates. The synthetic data “answer key” confirms that both are doing their job — with the expected limitation that weak signals are hard to detect.

8. Discussion

8.1 What the results mean for the EKC

Both BMA and DSL identify the inverted-N EKC shape with turning points close to the true DGP values. BMA correctly identifies 6 of 8 true predictors (3 GDP terms + fossil fuel, industry, renewable) with zero false positives among noise variables. The inverted-N shape implies three phases of the income–pollution relationship:

Declining phase (below ~\$2,400): Very poor countries where CO₂ may fall as subsistence agriculture shifts toward slightly cleaner energy.
Rising phase (~\$2,400 to ~\$27,000): Industrializing countries where emissions rise sharply. Most of the world’s population lives here.
Declining phase (above ~\$27,000): Wealthy countries where clean technology and regulation reduce emissions.

The policy implication is important: the inverted-N suggests that the “environmental improvement” phase is not automatic. Unlike the simpler inverted-U hypothesis, which predicts a single turning point after which pollution monotonically declines, the inverted-N warns that countries at very low income levels may already be on a declining emissions path that reverses once industrialization begins. This makes the middle-income range — where emissions rise steeply — the critical window for environmental policy intervention.

The three robust control variables identified by BMA reinforce this narrative:

Fossil fuel dependence (PIP = 1.000) is the single strongest predictor of CO₂ emissions, with a coefficient close to the true DGP value.
Renewable energy share (PIP = 0.959) enters with a negative sign, confirming that energy mix transitions reduce emissions.
Industry value-added (PIP = 0.999) captures the composition effect — economies dominated by manufacturing produce more CO₂ per unit of GDP than service-based economies.

8.2 When to use BMA vs post-double-selection

The two methods answer fundamentally different research questions:

Use BMA when the question is “which variables robustly predict the outcome?" BMA provides PIPs, coefficient densities, and a complete picture of the model space. It excels in exploratory settings where variable importance is the goal. In our simulation, BMA produced the most accurate coefficient estimates (–7.139 vs true –7.100) and provided rich diagnostics (PIP chart, density plots) that make the evidence for each variable transparent. The cost is computational: BMA requires MCMC sampling (minutes to hours depending on the model space).

Use post-double-selection when the question is “what is the causal effect of a specific variable of interest, controlling for high-dimensional confounders?" DSL provides fast, valid inference on the coefficients of interest with standard errors and confidence intervals. It is designed for settings where you have a clear treatment variable and many potential controls. In our simulation, DSL completed in seconds and produced valid standard errors, but its coefficient estimates (–7.433) were less accurate than BMA’s because LASSO had limited room to discriminate among controls in the FE-heavy panel setting.

Use both together (as in this tutorial) when you want the strongest possible evidence. If a Bayesian and a frequentist method agree on the sign, magnitude, and significance of an effect, the finding is unlikely to be an artifact of any single modeling choice. Disagreements between the methods are also informative — they signal areas where the evidence is sensitive to assumptions.

8.3 Pooled vs fixed effects: a cautionary comparison

The pooled specifications (Sections 5.7 and 6.6) provide a powerful pedagogical contrast. When we strip away fixed effects and run both BMA and DSL on pooled data, three things happen simultaneously:

LASSO selection improves but estimates worsen. Without 99 FE dummies diluting the candidate set, LASSO in pooled DSL selected only 5–7 of 12 controls (vs 102 of 112 with FE). This is closer to the “textbook” LASSO scenario where the method has genuine discriminating power. Yet the resulting coefficient estimates are 2–3x the true values because omitted country heterogeneity biases everything.

BMA PIPs become unreliable. With fixed effects, BMA assigned PIP near zero to all 7 noise variables — zero false positives. Without FE, 5 noise variables (services, pop_density, credit, trade, and inflated democracy) received PIPs above 0.80. The noise variables are correlated with omitted country effects, and BMA interprets these spurious correlations as genuine predictive power. This demonstrates that PIP thresholds are only meaningful when the model set is correctly specified.

Both methods agree on the bias. Pooled BMA and pooled DSL produce remarkably similar biased coefficients ($\beta_1 = -21.26$ vs $-22.03$), confirming that the problem is not the variable selection method but the omitted fixed effects. The agreement between a Bayesian and a frequentist method on the wrong answer reinforces the lesson: method agreement is not a substitute for correct model specification.

The practical takeaway for applied researchers: in panel data settings, always include entity fixed effects (or equivalent controls for unobserved heterogeneity) before applying BMA or DSL. Running these methods on pooled data without FE will produce misleading results — not because the methods fail, but because the models they average over or select from are all misspecified.

8.4 Limitations and caveats

Synthetic vs real data. This is synthetic data — the patterns are sharper than real-world data, and we can verify ground truth only because we designed the DGP. With real data, model uncertainty is genuinely unresolvable, and there is no answer key to check against. The separation between true predictors and noise variables is cleaner here than in most applications.

Weak signals are hard to detect. Both methods missed urban population (PIP = 0.27) and democracy (PIP = 0.02), whose true coefficients are small (0.007 and –0.005). This is not a failure of the methods — it is a fundamental statistical limitation. Detecting a coefficient of 0.005 in the presence of panel-level noise requires either a much larger sample or a stronger signal.

Panel FE and LASSO. In our panel setting, 99 of 112 candidate controls are FE dummies that LASSO retains almost entirely. This limits DSL’s ability to discriminate among the 12 candidate controls. In cross-sectional settings or settings with many genuinely irrelevant variables, DSL would have more room to operate and potentially match BMA’s accuracy.

Extensions. Researchers working with real EKC data should also consider endogeneity (via 2SLS-BMA, as in Gravina and Lanzafame, 2025), alternative pollutants (SO₂, PM2.5), spatial dependence across countries, and structural breaks in the income–pollution relationship.

9. Summary and Next Steps

Takeaways

Both methods confirm the inverted-N shape. BMA (Bayesian, averaging across models) and post-double-selection (frequentist, LASSO-based) both recover the inverted-N EKC. BMA produces coefficients closest to the true DGP (–7.139 vs –7.100 for $\beta_1$). DSL with cluster-robust SEs gives –7.433, falling between the sparse and kitchen-sink FE. Both methods outperform the naive sparse specification.
Both methods recover the ground truth. BMA correctly identifies 6 of 8 true predictors with zero false positives. The three strongest true controls (fossil fuel, industry, renewable energy) all receive PIPs above 0.95. The two misses (urban, democracy) have small true coefficients, illustrating that even good methods have limits with weak signals.
Model uncertainty is real. The GDP linear coefficient shifts from –7.498 (sparse) to –7.131 (kitchen-sink) depending on which controls are included. The maximum turning point moves by \$2,000. BMA and DSL provide principled solutions.
BMA and post-double-selection serve different purposes. BMA excels at variable selection (PIPs, coefficient densities) and produced the most accurate coefficient estimates in this setting. Post-double-selection is fastest and provides standard frequentist inference with cluster-robust SEs. In panel settings dominated by FE dummies, LASSO has limited room to discriminate among candidate controls; DSL would be more powerful in cross-sectional settings with many irrelevant variables.
Fixed effects are essential, not optional. Running either method on pooled data without FE produces coefficients inflated 2–3x (BMA pooled: –21.26, DSL pooled: –22.03, vs true –7.10 for $\beta_1$). Worse, pooled BMA assigns high PIPs to 5 noise variables that the FE-based BMA correctly rejects. Confidence and credible intervals from pooled models fail to cover the true values for all three coefficients. The lesson: always include fixed effects in panel data before applying variable selection methods.

Exercises

Sensitivity to the g-prior. Re-run bmaregress with gprior(bric) instead of gprior(uip). The BIC prior penalizes model complexity more heavily. Do the PIPs change? Does it still identify fossil fuel, industry, and renewable as robust? (Hint: BIC priors tend to be more conservative, so borderline variables may drop below the threshold.)
Test for inverted-U. Drop ln_gdp_cb and re-run with only linear and squared GDP terms. What do BMA and DSL say about the simpler quadratic specification? (Hint: since the DGP includes a cubic term, the quadratic model is misspecified — check whether the coefficients absorb the cubic effect or produce a visibly different EKC shape.)
Increase noise. Re-generate the synthetic data with sigma_eps = 0.30 (double the noise) in generate_data.do and re-run the full analysis. How does this affect BMA’s ability to distinguish true predictors from noise? (Hint: expect more variables with PIPs in the ambiguous 0.3–0.7 range, and possibly some noise variables crossing the 0.80 threshold — false positives become more likely with noisier data.)

Appendix A: First-Differences Analysis

A.1 Motivation

The fixed effects estimator removes time-invariant country heterogeneity by demeaning each variable within country. An alternative approach is first differencing: computing the change between the last and first year for each country ($\Delta x_i = x_{i,2014} - x_{i,1995}$). This also removes time-invariant effects and produces a pure cross-sectional dataset of 80 observations — one per country. The cross-sectional setting is where LASSO-based methods are most powerful, because there are no FE dummies diluting the candidate set.

The tradeoff is statistical power: first differencing uses only two data points per country (discarding 18 intermediate years), while the within-estimator uses all 20. We expect noisier estimates but cleaner variable selection.

A.2 Constructing the first-difference dataset

* Keep only first (1995) and last (2014) years, reshape, compute differences
keep if year == 1995 | year == 2014
reshape wide $outcome $gdp_vars $controls, i(country_id) j(year)
foreach v in $outcome $gdp_vars $controls {
gen d_`v' = `v'2014 - `v'1995
}

This produces 80 observations, each representing how much a country’s variables changed over the 20-year period. For example, d_ln_gdp measures the log growth in GDP per capita from 1995 to 2014.

A.3 Baseline OLS on first differences

* Sparse: GDP terms only
regress d_ln_co2 d_ln_gdp d_ln_gdp_sq d_ln_gdp_cb, robust
* Kitchen-sink: all 12 controls
regress d_ln_co2 d_ln_gdp d_ln_gdp_sq d_ln_gdp_cb ///
d_fossil_fuel d_renewable d_urban d_industry d_democracy ///
d_services d_trade d_fdi d_credit d_pop_density ///
d_corruption d_globalization, robust

FD Sparse OLS:

Linear regression Number of obs = 80
Prob > F = 0.0009
R-squared = 0.1433
------------------------------------------------------------------------------
| Robust
d_ln_co2 | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
d_ln_gdp | -10.36189 4.092422 -2.53 0.013 -18.51265 -2.211121
d_ln_gdp_sq | 1.155962 .4223643 2.74 0.008 .3147506 1.997173
d_ln_gdp_cb | -.0414947 .0143721 -2.89 0.005 -.0701192 -.0128702
_cons | -.3036562 .0724366 -4.19 0.000 -.4479262 -.1593861
------------------------------------------------------------------------------

FD Kitchen-sink OLS:

Linear regression Number of obs = 80
Prob > F = 0.0029
R-squared = 0.3707
------------------------------------------------------------------------------
| Robust
d_ln_co2 | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
d_ln_gdp | -8.109709 5.031758 -1.61 0.112 -18.1618 1.942382
d_ln_gdp_sq | .9238864 .5213262 1.77 0.081 -.1175823 1.965355
d_ln_gdp_cb | -.0336221 .0179583 -1.87 0.066 -.0694979 .0022536
d_fossil_f~l | .0147108 .0067313 2.19 0.033 .0012635 .0281582
d_renewable | -.0237808 .0110384 -2.15 0.035 -.0458327 -.001729
d_urban | .0002501 .014913 0.02 0.987 -.0295421 .0300424
d_industry | .0309085 .0105974 2.92 0.005 .0097377 .0520793
d_democracy | .019337 .0290345 0.67 0.508 -.038666 .07734
d_services | -.0047239 .0098816 -0.48 0.634 -.0244647 .0150169
d_trade | .006726 .0044062 1.53 0.132 -.0020764 .0155284
d_fdi | .0000124 .0091898 0.00 0.999 -.0183463 .0183712
d_credit | .0028644 .0043456 0.66 0.512 -.0058169 .0115457
d_pop_dens~y | .0006396 .0004991 1.28 0.205 -.0003575 .0016366
d_corruption | -.0036115 .0033497 -1.08 0.285 -.0103033 .0030803
d_globaliz~n | -.0004567 .0082494 -0.06 0.956 -.0169368 .0160235
_cons | -.0085823 .1746184 -0.05 0.961 -.3574226 .340258
------------------------------------------------------------------------------

The FD sparse OLS finds the inverted-N sign pattern with all three terms significant at the 5% level — but the coefficients are noisier than the FE estimates (e.g., $\beta_1 = -10.36$ vs –7.50 for sparse FE). The R² of 0.14 is low, reflecting the loss of within-country time-series variation when collapsing 20 years into a single difference.

Adding controls in the kitchen-sink raises R² to 0.37 but makes the GDP terms individually insignificant (p = 0.07–0.11) — a consequence of having only 80 observations and 15 regressors. Among the controls, fossil fuel (p = 0.033), renewable energy (p = 0.035), and industry (p = 0.005) are significant — the same three strong predictors identified by BMA with fixed effects.

A.4 BMA on first differences

bmaregress d_ln_co2 d_ln_gdp d_ln_gdp_sq d_ln_gdp_cb ///
d_fossil_fuel d_renewable d_urban d_industry d_democracy ///
d_services d_trade d_fdi d_credit d_pop_density ///
d_corruption d_globalization, ///
mprior(uniform) gprior(uip) ///
mcmcsize(50000) rseed(9988) pipcutoff(0.5) burnin(5000)

Bayesian model averaging No. of obs = 80
Linear regression No. of predictors = 15
MC3 sampling Groups = 15
Always = 0
No. of models = 2,317
For CPMP >= .9 = 581
Priors: Mean model size = 3.304
Models: Uniform Burn-in = 5,000
Cons.: Noninformative MCMC sample size = 50,000
Coef.: Zellner's g Acceptance rate = 0.3080
g: Unit-information, g = 80 Shrinkage, g/(1+g) = 0.9877
sigma2: Noninformative Mean sigma2 = 0.051
Sampling correlation = 0.9958
------------------------------------------------------------------------------
d_ln_co2 | Mean Std. dev. Group PIP
-------------+----------------------------------------------------------------
d_industry | .0364834 .0090778 7 .99823
------------------------------------------------------------------------------
Note: 14 predictors with PIP less than .5 not shown.

The FD-BMA result is dramatically different from the FE-based BMA. Only one variable passes the 0.50 PIP display threshold: the change in industry share (PIP = 0.998). The three GDP polynomial terms all have PIPs below 0.30:

Variable	PIP (FD-BMA)	PIP (FE-BMA)
d_ln_gdp	0.298	0.994
d_ln_gdp_sq	0.267	1.000
d_ln_gdp_cb	0.271	1.000
d_fossil_fuel	0.183	1.000
d_renewable	0.350	0.959
d_urban	0.096	0.268
d_industry	0.998	0.999
d_democracy	0.094	0.023

With only 80 cross-sectional observations, BMA’s evidence threshold is much harder to clear. The GDP terms — which are the core of the EKC — do not survive because the 20-year differences are noisy and the cubic polynomial requires precise estimation of three correlated terms simultaneously.

The change in industry share is the only variable with a strong enough signal-to-noise ratio to clear BMA’s bar. The FE-based BMA (N = 1,600) has 20x more observations to work with, which is why it identifies 6 robust variables.

A.5 DSL on first differences

dsregress d_ln_co2 d_ln_gdp d_ln_gdp_sq d_ln_gdp_cb, ///
controls(d_fossil_fuel d_renewable d_urban d_industry d_democracy ///
d_services d_trade d_fdi d_credit d_pop_density ///
d_corruption d_globalization) ///
rseed(9988)

Double-selection linear model Number of obs = 80
Number of controls = 12
Number of selected controls = 1
Wald chi2(3) = 10.65
Prob > chi2 = 0.0138
------------------------------------------------------------------------------
| Robust
d_ln_co2 | Coefficient std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
d_ln_gdp | -5.047196 4.558593 -1.11 0.268 -13.98187 3.887483
d_ln_gdp_sq | .5943786 .4700569 1.26 0.206 -.326916 1.515673
d_ln_gdp_cb | -.0220809 .0160386 -1.38 0.169 -.0535159 .0093541
------------------------------------------------------------------------------

lassoinfo

 Estimate: active
Command: dsregress
------------------------------------------------------
| No. of
| Selection selected
Variable | Model method lambda variables
------------+-----------------------------------------
d_ln_co2 | linear plugin .3818852 1
d_ln_gdp | linear plugin .3818852 0
d_ln_gdp_sq | linear plugin .3818852 0
d_ln_gdp_cb | linear plugin .3818852 0
------------------------------------------------------

FD-DSL selected only 1 control for the outcome equation (likely d_industry, consistent with BMA) and zero controls for each of the three GDP equations. With such sparse selection, the final OLS is essentially a regression of d_ln_co2 on the three GDP terms plus one control — and none of the three GDP terms are individually significant (p = 0.17–0.27). The Wald test for joint significance is borderline (p = 0.014), suggesting the GDP terms collectively have some explanatory power, but the individual estimates are too noisy for inference.

A.6 Comparison: first differences vs fixed effects

	FD Sparse	FD Kitchen	FD BMA	FD DSL	FE BMA	FE DSL	True DGP
$\beta_1$ (GDP)	–10.362	–8.110	n/a	–5.047	–7.139	–7.433	–7.100
$\beta_2$ (GDP²)	1.156	0.924	n/a	0.594	0.808	0.840	0.810
$\beta_3$ (GDP³)	–0.041	–0.034	n/a	–0.022	–0.030	–0.031	–0.030
GDP terms robust?	Yes (p < 0.05)	No (p > 0.05)	No (PIP < 0.30)	No (p > 0.05)	Yes (PIP > 0.99)	Yes (p < 0.001)
Controls selected	n/a	n/a	1 of 12	1 of 12	6 of 12	102 of 112
Min TP	\$1,913	\$1,465	n/a	\$987	\$2,411	\$2,429	\$1,895
Max TP	\$60,817	\$61,655	n/a	\$62,983	\$27,269	\$27,672	\$34,647

Note. FD-BMA posterior means for the GDP terms are heavily shrunk toward zero (because their PIPs are ~0.27–0.30), so we report “n/a” rather than misleading point estimates.

The comparison reveals a stark trade-off between the two identification strategies:

Fixed effects win on accuracy. The FE-based estimates are close to the true DGP values, with BMA (FE) achieving the best accuracy ($\beta_1 = -7.139$ vs true –7.100). The FD estimates are noisier: FD-sparse overshoots ($\beta_1 = -10.36$), while FD-DSL undershoots (–5.05). The FD turning points are wildly inaccurate — the maximum turning point is \$61,000–63,000 in first differences vs \$27,000 with FE (true: \$34,647).

First differences struggle with the cubic polynomial. Estimating a cubic EKC requires precise measurement of three highly correlated terms ($\ln GDP$, $(\ln GDP)^2$, $(\ln GDP)^3$). With only 80 observations (one 20-year change per country), the multicollinearity among differenced GDP terms is severe. Both BMA and DSL respond rationally: BMA gives all three terms PIPs below 0.30, and DSL selects zero controls for the GDP equations. Neither method “trusts” the cubic specification in this small sample.

Industry is the strongest cross-sectional signal. Both FD-BMA (PIP = 0.998) and FD-DSL (selected as the sole control) identify the change in industry share as the most important cross-sectional predictor of CO₂ change. This makes economic sense: countries that industrialized the most over 1995–2014 also increased their emissions the most, regardless of their income trajectory.

Practical implication. First differences are appropriate when the research question is about long-run changes rather than levels. But for testing the EKC cubic shape, the panel FE approach is far more powerful because it uses all 1,600 observations rather than collapsing to 80. The FD analysis confirms that the inverted-N result in the main body is robust to the identification strategy in spirit (the signs are correct in FD-sparse OLS), but the magnitudes and statistical power are substantially weaker.

References

Visualizing Regression with the FWL Theorem in R

Fri, 27 Mar 2026 00:00:00 +0000

1. Overview

“What does it actually mean to control for a variable?” This is perhaps the most common question in applied regression — and one of the hardest to answer intuitively. When we say “the effect of coupons on sales, controlling for income,” we are describing a relationship that lives in multidimensional space and cannot be directly plotted on a 2D scatter plot. Or can it?

The Frisch-Waugh-Lovell (FWL) theorem provides the answer. It says that the coefficient on any variable in a multiple regression equals the slope from a simple bivariate regression — after first “partialling out” the other variables from both the outcome and the variable of interest. Partialling out means regressing a variable on the controls and keeping only the leftover (residual) variation — the part that the controls cannot explain. This means we can visualize any regression coefficient as a 2D scatter plot, as long as we first remove the influence of the controls from both axes.

The fwlplot R package (Butts & McDermott, 2024) turns this into a one-liner. It uses the same formula syntax as fixest::feols() — including the | operator for fixed effects — and produces a scatter plot of the residualized data with the regression line overlaid. The result is a visual answer to “what does controlling for X look like?”

This tutorial builds intuition progressively. We start with simulated data where we know the true effect, show how confounding creates a misleading picture, and use fwl_plot() to reveal the truth. We then extend to real data with high-dimensional fixed effects — first flights data (controlling for origin and destination airports) and then panel wage data (controlling for unobserved individual ability).

Learning objectives:

State the FWL theorem and explain its geometric intuition
Use fwl_plot() to visualize a bivariate relationship before and after controlling for confounders
Demonstrate that manual FWL residualization reproduces feols() coefficients exactly
Visualize what fixed effects “do” to data by comparing raw vs. residualized scatter plots
Apply fwl_plot() to real panel data with high-dimensional fixed effects
Connect FWL to omitted variable bias and Simpson’s paradox

2. The Modeling Pipeline

graph LR
A["Simulated<br/>Data<br/>(Section 3)"] --> B["fwl_plot()<br/>Naive vs. FWL<br/>(Section 4)"]
B --> C["Manual FWL<br/>Verification<br/>(Section 5)"]
C --> D["Fixed Effects<br/>Flights Data<br/>(Section 6)"]
D --> E["Panel Data<br/>Wages<br/>(Section 7)"]
E --> F["ggplot2<br/>& Recipe<br/>(Section 8)"]
style A fill:#6a9bcc,stroke:#141413,color:#fff
style B fill:#d97757,stroke:#141413,color:#fff
style C fill:#d97757,stroke:#141413,color:#fff
style D fill:#6a9bcc,stroke:#141413,color:#fff
style E fill:#6a9bcc,stroke:#141413,color:#fff
style F fill:#00d4c8,stroke:#141413,color:#fff

We start where the answer is known (simulated data), see the result with fwl_plot() first, then peek under the hood with manual FWL verification. From there we apply the same one-liner to increasingly complex real-world settings.

3. Setup and Data

3.1 Install and load packages

# Install packages if needed
cran_packages <- c("fwlplot", "fixest", "ggplot2", "patchwork",
"nycflights13", "wooldridge")
missing <- cran_packages[!sapply(cran_packages, requireNamespace, quietly = TRUE)]
if (length(missing) > 0) install.packages(missing)
library(fwlplot)
library(fixest)
library(ggplot2)
library(patchwork)
library(nycflights13)
library(wooldridge)

The fwlplot package provides the fwl_plot() function for FWL-residualized scatter plots. It is built on fixest, which handles the residualization computation using fast demeaning algorithms. The patchwork package lets us combine multiple ggplot2 plots side by side. The nycflights13 and wooldridge packages provide the real datasets we will use later.

3.2 Simulated confounding data

To build intuition, we simulate a retail scenario where a store manager wants to know whether distributing coupons increases sales. The catch: income is a confounder — wealthier neighborhoods receive fewer coupons (the store targets promotions at lower-income areas) but have higher baseline sales. This creates a spurious negative correlation between coupons and sales, even though coupons genuinely boost sales.

The causal structure looks like this:

graph TD
Income["Income<br/>(confounder)"]
Coupons["Coupons<br/>(treatment)"]
Sales["Sales<br/>(outcome)"]
Income -->|"-0.5<br/>(fewer coupons<br/>to rich areas)"| Coupons
Income -->|"+0.3<br/>(rich areas<br/>buy more)"| Sales
Coupons -->|"+0.2<br/>(true causal<br/>effect)"| Sales
style Income fill:#d97757,stroke:#141413,color:#fff
style Coupons fill:#6a9bcc,stroke:#141413,color:#fff
style Sales fill:#00d4c8,stroke:#141413,color:#fff

Income opens a “backdoor path” from coupons to sales: coupons ← income → sales. Unless we block this path by controlling for income, the naive estimate will be biased. The data generating process is:

$$\text{income} \sim N(50, 10)$$

$$\text{coupons} = 60 - 0.5 \times \text{income} + \epsilon_1, \quad \epsilon_1 \sim N(0, 5)$$

$$\text{sales} = 10 + 0.2 \times \text{coupons} + 0.3 \times \text{income} + \epsilon_2, \quad \epsilon_2 \sim N(0, 3)$$

In words, the true causal effect of coupons on sales is +0.2: each additional coupon increases sales by 0.2 units. But because income negatively drives coupons ($-0.5$) and positively drives sales ($+0.3$), a naive regression of sales on coupons alone will confound the coupon effect with the income effect, producing a biased estimate. The noise terms $\epsilon_1$ and $\epsilon_2$ correspond to the rnorm() calls in the code below.

set.seed(42)
n <- 200
income <- rnorm(n, mean = 50, sd = 10)
dayofweek <- sample(1:7, n, replace = TRUE)
coupons <- 60 - 0.5 * income + rnorm(n, 0, 5)
sales <- 10 + 0.2 * coupons + 0.3 * income + 0.5 * dayofweek + rnorm(n, 0, 3)
store_data <- data.frame(
sales = round(sales, 2),
coupons = round(coupons, 2),
income = round(income, 2),
dayofweek = dayofweek
)
head(store_data)

 sales coupons income dayofweek
1 40.02 27.79 63.71 4
2 31.37 34.03 44.35 5
3 31.30 28.01 53.63 6
4 34.37 28.68 56.33 4
5 42.62 35.91 54.04 5
6 39.50 33.45 48.94 4

round(cor(store_data[, c("sales", "coupons", "income")]), 3)

 sales coupons income
sales 1.000 -0.166 0.500
coupons -0.166 1.000 -0.709
income 0.500 -0.709 1.000

The correlation matrix confirms the confounding structure. Coupons and sales have a negative raw correlation (-0.166), even though the true causal effect is positive (+0.2). This is because income is strongly negatively correlated with coupons (-0.709) and strongly positively correlated with sales (0.500). A naive analysis would conclude that coupons hurt sales — a classic instance of Simpson’s paradox, where the direction of an association reverses when a confounding variable is accounted for.

4. fwl_plot() in Action: Naive vs. Controlled

4.1 The naive scatter

The simplest way to see why confounding is dangerous: plot the raw relationship with fwl_plot(). When no controls are specified, fwl_plot() produces a standard scatter plot with a regression line:

fwl_plot(sales ~ coupons, data = store_data, ggplot = TRUE)

The slope is -0.093 ($p = 0.019$): coupons appear to reduce sales. This is statistically significant but substantively wrong — the true effect is +0.2. The store manager who trusts this analysis would cancel the coupon program, losing real revenue.

4.2 Controlling for income: one line of code

Now watch what happens when we add income as a control — just add it to the formula:

fwl_plot(sales ~ coupons + income, data = store_data, ggplot = TRUE)

The slope reverses to +0.212 ($p < 0.001$) — close to the true value of +0.2. The fwl_plot() function residualized both coupons and sales on income behind the scenes, then plotted the residuals. The figure below shows both panels side by side:

The left panel shows the raw relationship: more coupons, lower sales (a downward slope). The right panel shows the same data after removing the influence of income from both axes. Once income is partialled out, the true positive effect of coupons emerges clearly. This is what “controlling for income” looks like geometrically — and fwl_plot() produces it in a single line.

4.3 The regression table confirms

The fixest::feols() function produces the same coefficient, confirmed by etable() for side-by-side comparison:

fe_naive <- feols(sales ~ coupons, data = store_data)
fe_full <- feols(sales ~ coupons + income, data = store_data)
etable(fe_naive, fe_full, headers = c("Naive", "Controlled"))

 fe_naive fe_full
Naive Controlled
Dependent Var.: sales sales
Constant 36.93*** (1.397) 11.34*** (3.008)
coupons -0.0934* (0.0393) 0.2123*** (0.0467)
income 0.3004*** (0.0325)
_______________ _________________ __________________
S.E. type IID IID
Observations 200 200
R2 0.02768 0.32148
Adj. R2 0.02277 0.31459

Adding income as a control flips the coupon coefficient from -0.093 to +0.212 and increases the R-squared from 0.028 to 0.321. The income coefficient (0.300) is close to the true value of 0.3. Every number in this table corresponds to a visual feature of the fwl_plot() scatter plots above.

5. Under the Hood: Manual FWL Verification

5.1 The three-step recipe

The FWL theorem can be stated as a simple recipe. Think of it like measuring height for your age: instead of comparing raw heights, you compare how much taller or shorter each person is than the average for their age group. Similarly, FWL compares how much more or fewer coupons a store had for its income level, against how much more or fewer sales it had for its income level.

The three steps are:

Regress sales on income, save the residuals (the part of sales that income cannot explain)
Regress coupons on income, save the residuals (the part of coupons that income cannot explain)
Regress the sales residuals on the coupon residuals — the slope is the coupon coefficient

# Step 1: Residualize sales on income
resid_y <- resid(lm(sales ~ income, data = store_data))
# Step 2: Residualize coupons on income
resid_x <- resid(lm(coupons ~ income, data = store_data))
# Step 3: Regress residuals on residuals
fwl_manual <- lm(resid_y ~ resid_x)
# Compare coefficients
cat("feols coefficient: ", round(coef(fe_full)["coupons"], 6), "\n")
cat("Manual FWL coefficient:", round(coef(fwl_manual)["resid_x"], 6), "\n")

feols coefficient: 0.212288
Manual FWL coefficient: 0.212288

The coefficients match to six decimal places. This is not an approximation — it is an exact algebraic identity. Every time you run a multiple regression, the software is implicitly performing these three steps for each coefficient.

5.2 The formal theorem

For those who want the math, the FWL theorem states that in the regression $Y = X_1 \beta_1 + X_2 \beta_2 + \epsilon$, the coefficient $\hat{\beta}_1$ equals:

$$\hat{\beta}_1 = (\tilde{X}_1' \tilde{X}_1)^{-1} \tilde{X}_1' \tilde{Y}, \quad \text{where} \quad \tilde{Y} = M_{X_2} Y, \quad \tilde{X}_1 = M_{X_2} X_1$$

Here $M_{X_2} = I - X_2(X_2’X_2)^{-1}X_2'$ is the “residual-maker” matrix that projects out the effect of $X_2$. In our example, $Y$ is sales, $X_1$ is coupons, and $X_2$ is income. The tilded variables $\tilde{Y}$ and $\tilde{X}_1$ are the residuals from the resid() calls above.

5.3 Omitted variable bias: predicting the error

The confounding we saw is not mysterious — the omitted variable bias (OVB) formula predicts it exactly. When we omit income from the regression, the bias on the coupon coefficient is:

$$\text{bias} = \hat{\gamma} \times \hat{\delta}$$

In words, the bias equals the effect of the omitted variable on the outcome ($\hat{\gamma}$) multiplied by the relationship between the omitted variable and the treatment ($\hat{\delta}$). Here $\hat{\gamma}$ is the effect of income on sales (in the full model) and $\hat{\delta}$ is the coefficient from regressing coupons on income.

gamma_hat <- coef(fe_full)["income"] # 0.3004
delta_hat <- coef(lm(coupons ~ income, data = store_data))["income"] # -0.4937
ovb <- gamma_hat * delta_hat # -0.1483
cat("OVB = gamma * delta:", round(ovb, 4), "\n")
cat("Naive ≈ True + OVB:", round(coef(fe_full)["coupons"] + ovb, 4), "\n")
cat("Actual naive:", round(coef(fe_naive)["coupons"], 4), "\n")

OVB = gamma * delta: -0.1483
Naive ≈ True + OVB: 0.064
Actual naive: -0.0934

The OVB formula predicts a bias of -0.148: income’s positive effect on sales ($\hat{\gamma} = 0.300$) times its negative relationship with coupons ($\hat{\delta} = -0.494$) produces a large negative bias. The predicted naive coefficient (true + bias = 0.212 + (-0.148) = 0.064) is close to the actual naive coefficient (-0.093) — the small discrepancy comes from sampling variation with $n = 200$. The key insight: the bias is predictable. If you know the direction of the confounder’s effects on both the treatment and the outcome, you know which way the naive estimate is biased.

5.4 Adding more controls

The FWL theorem extends naturally to any number of controls. The fwl_plot() call handles it automatically:

fe_full3 <- feols(sales ~ coupons + income + dayofweek, data = store_data)
etable(fe_naive, fe_full, fe_full3,
headers = c("Naive", "+ Income", "+ Income + Day"))

 fe_naive fe_full fe_full3
Naive + Income + Income + Day
Dependent Var.: sales sales sales
Constant 36.93*** (1.397) 11.34*** (3.008) 9.640** (2.953)
coupons -0.0934* (0.0393) 0.2123*** (0.0467) 0.2219*** (0.0454)
income 0.3004*** (0.0325) 0.2961*** (0.0316)
dayofweek 0.4029*** (0.1095)
_______________ _________________ __________________ __________________
S.E. type IID IID IID
Observations 200 200 200
R2 0.02768 0.32148 0.36535
Adj. R2 0.02277 0.31459 0.35564

The coupon coefficient progresses from -0.093 (naive, wrong sign), to +0.212 (controlling for income), to +0.222 (adding day of week). The R-squared jumps from 0.028 to 0.365 as we add controls. Each fwl_plot() panel shows a tighter cloud as more variation is absorbed by the controls — the residualized scatter becomes more focused on the coupon-specific variation in sales.

6. Visualizing Fixed Effects

6.1 What are fixed effects?

Fixed effects are a special case of the FWL theorem applied to group dummy variables. When we include airport fixed effects in a regression, we are “partialling out” airport-specific means — in other words, demeaning. Demeaning means subtracting each group’s average from every observation in that group. The result is that we compare each airport to itself rather than comparing different airports to each other.

Think of it like a race handicap. Raw times compare runners who started at different positions. Demeaning each runner’s times converts them to “how much faster or slower than their personal average,” making the comparison fair. The FWL theorem guarantees that this demeaning procedure produces the same coefficients as including a full set of dummy variables in the regression.

6.2 Flights data: progressive fixed effects

The nycflights13 dataset contains all domestic flights from New York’s three airports (EWR, JFK, LGA) in 2013. We ask: what is the relationship between air time and departure delay?

data("flights", package = "nycflights13")
flights_clean <- flights[complete.cases(flights[, c("dep_delay", "air_time", "origin", "dest")]), ]
flights_clean <- flights_clean[flights_clean$dep_delay < 120 & flights_clean$dep_delay > -30, ]
# Remove singleton origin-dest combos for stable FE estimation
od_counts <- table(paste(flights_clean$origin, flights_clean$dest))
flights_clean <- flights_clean[paste(flights_clean$origin, flights_clean$dest) %in%
names(od_counts[od_counts > 1]), ]
cat("Observations:", nrow(flights_clean), "\n")

Observations: 317578

We sample 5,000 flights for plotting (the regression line uses all data, only the plotted points are sampled to avoid overplotting):

set.seed(123)
flights_sample <- flights_clean[sample(nrow(flights_clean), 5000), ]

Now the power of fwl_plot() — three one-liners that progressively add fixed effects. In fixest syntax, the | operator separates regular covariates (left) from fixed effects (right), so dep_delay ~ air_time | origin + dest means “regress departure delay on air time, with origin and destination fixed effects”:

# No fixed effects
fwl_plot(dep_delay ~ air_time, data = flights_sample, ggplot = TRUE)
# Origin airport FE
fwl_plot(dep_delay ~ air_time | origin, data = flights_sample, ggplot = TRUE)
# Origin + destination FE
fwl_plot(dep_delay ~ air_time | origin + dest, data = flights_sample, ggplot = TRUE)

The visual transformation is striking. Panel A (no FE) shows a vague cloud with a nearly flat slope. Panel B (origin FE) removes the three origin-airport means, tightening the horizontal spread. Panel C (origin + destination FE) removes the 103 destination means as well, collapsing the air-time variation to within-route deviations.

6.3 Comparing regression tables

fe_none <- feols(dep_delay ~ air_time, data = flights_clean)
fe_origin <- feols(dep_delay ~ air_time | origin, data = flights_clean)
fe_both <- feols(dep_delay ~ air_time | origin + dest, data = flights_clean)
etable(fe_none, fe_origin, fe_both,
headers = c("No FE", "Origin FE", "Origin + Dest FE"))

 fe_none fe_origin fe_both
No FE Origin FE Origin + Dest FE
Dependent Var.: dep_delay dep_delay dep_delay
air_time -0.0031*** (0.0004) -0.0061*** (0.0005) -0.0067. (0.0034)
Fixed-Effects: ------------------- ------------------- -----------------
origin No Yes Yes
dest No No Yes
_______________ ___________________ ___________________ _________________
Observations 317,578 317,578 317,578
R2 0.00016 0.00594 0.01296
Within R2 -- 0.00058 1.19e-5

The air time coefficient changes as we add fixed effects: -0.003 (no FE), -0.006 (origin FE), -0.007 (origin + destination FE, significant at the 10% level only — the . marker indicates $p < 0.10$). The residualized scatter in Panel C answers a sharper question: “For flights on the same route, does longer-than-usual air time predict higher-than-usual departure delay?” The answer is weakly negative — routes with variable air times show slightly less delay when the flight takes longer, possibly because longer air times reflect favorable wind conditions.

7. Panel Data: Returns to Experience

7.1 The wage panel

The wagepan dataset from the Wooldridge textbook contains panel data on 545 individuals observed over 8 years (1980–1987). A classic question in labor economics is: what is the return to experience?

The challenge is unobserved ability. Two people with 5 years of experience may earn very different wages because one is more talented, motivated, or well-connected. These personal traits — which we cannot directly measure — are the “unobserved ability” that creates omitted variable bias. More talented workers earn higher wages and tend to accumulate experience in higher-paying jobs, so the naive correlation between experience and wages confounds ability with genuine experience effects.

data("wagepan", package = "wooldridge")
cat("Observations:", nrow(wagepan), "\n")
cat("Individuals:", length(unique(wagepan$nr)), "\n")
cat("Years:", length(unique(wagepan$year)), "\n")

Observations: 4360
Individuals: 545
Years: 8

7.2 Pooled OLS vs. individual fixed effects

fe_pool <- feols(lwage ~ educ + exper + expersq, data = wagepan)
fe_fe <- feols(lwage ~ exper + expersq | nr, data = wagepan)
fe_twfe <- feols(lwage ~ exper + expersq | nr + year, data = wagepan)
etable(fe_pool, fe_fe, fe_twfe,
headers = c("Pooled OLS", "Individual FE", "Individual + Year FE"))

 fe_pool fe_fe fe_twfe
Pooled OLS Individual FE Individual + Year FE
Dependent Var.: lwage lwage lwage
Constant -0.0564 (0.0639)
educ 0.1021*** (0.0047)
exper 0.1050*** (0.0102) 0.1223*** (0.0082)
expersq -0.0036*** (0.0007) -0.0045*** (0.0006) -0.0054*** (0.0007)
Fixed-Effects: ------------------- ------------------- -------------------
nr No Yes Yes
year No No Yes
_______________ ___________________ ___________________ ___________________
Observations 4,360 4,360 4,360
R2 0.14772 0.61727 0.61850
Within R2 -- 0.17270 0.01534

Several things change as we add fixed effects. First, the educ coefficient disappears from the individual FE column — education is time-invariant for most individuals, so it is perfectly collinear with person dummies. Second, the exper linear term disappears from the two-way FE column — because experience increments by exactly one year for everyone, it is perfectly collinear with year dummies. Only expersq (which varies non-linearly across individuals) survives.

In the individual FE model, the experience coefficient increases from 0.105 to 0.122. This means the within-person return to experience is larger than the pooled estimate. The R-squared jumps from 0.148 to 0.617, showing that individual fixed effects explain the majority of wage variation — most of the “action” in wages comes from who you are, not how many years you have worked.

7.3 Visualizing the within-person variation

Again, fwl_plot() produces the before/after comparison in two one-liners. We sample 150 individuals for visual clarity (with 545 individuals the plot would be too dense):

set.seed(456)
sample_ids <- sample(unique(wagepan$nr), 150)
wage_sample <- wagepan[wagepan$nr %in% sample_ids, ]
# Raw bivariate relationship
fwl_plot(lwage ~ exper, data = wage_sample, ggplot = TRUE)
# With individual fixed effects
fwl_plot(lwage ~ exper | nr, data = wage_sample, ggplot = TRUE)

The visual difference is dramatic. Panel A plots the raw bivariate relationship with a shallow slope of about 0.03. The wide fan of points reflects unobserved ability differences: individuals at the same experience level have wildly different wages. Panel B (individual FE) strips away each person’s average wage and average experience, leaving only the within-person deviations. The slope steepens to 0.122 — more than three times larger — showing that a one-year increase in experience raises wages by about 12.2% within the same individual. The tighter cloud in Panel B shows that once we account for who each person is, the experience-wage relationship is much more precisely identified.

8. Customization and Quick Reference

8.1 ggplot2 integration

The fwl_plot() function can return a ggplot2 object by setting ggplot = TRUE, allowing full customization with ggplot2 layers and themes. This is useful for publication-quality figures with consistent styling, faceting, or combining multiple plots with patchwork:

p <- fwl_plot(sales ~ coupons + income, data = store_data, ggplot = TRUE)
fig5 <- p +
labs(title = "FWL Visualization: Coupons Effect on Sales",
subtitle = "After residualizing on income") +
theme_minimal(base_size = 13)

8.2 Quick reference: fwl_plot() recipes

Here are the most common fwl_plot() patterns you will use:

# 1. Raw scatter (no controls)
fwl_plot(y ~ x, data = df)
# 2. Control for one or more variables
fwl_plot(y ~ x + control1 + control2, data = df)
# 3. Fixed effects (use | to separate)
fwl_plot(y ~ x | group_fe, data = df)
# 4. Multiple fixed effects
fwl_plot(y ~ x | fe1 + fe2, data = df)
# 5. Return ggplot2 object for customization
fwl_plot(y ~ x + control, data = df, ggplot = TRUE) + theme_minimal()
# 6. Sample points for large datasets (line uses all data)
fwl_plot(y ~ x | fe, data = big_data, n_sample = 5000)

8.3 Key arguments

Argument	Purpose	Example
`formula`	Same as `feols()`: `y ~ x + controls \| FE`	`sales ~ coupons + income`
`data`	Input data frame	`store_data`
`ggplot`	Return ggplot2 object (default: base R)	`ggplot = TRUE`
`n_sample`	Sample N points for large datasets	`n_sample = 5000`
`vcov`	Variance-covariance specification	`vcov = "hetero"`

For large datasets like the flights data (317K+ observations), the n_sample argument is essential to avoid overplotting. The regression line is always computed on the full data — only the plotted points are sampled, so the slope is unaffected.

9. Discussion

The FWL theorem is not just a mathematical curiosity — it is the foundation of how modern regression software works. When fixest::feols() estimates a model with fixed effects, it does not literally create and invert a matrix with thousands of dummy variables. Instead, it uses the FWL logic to demean the data and run OLS on the residuals. This is why fixest can handle millions of observations with hundreds of thousands of fixed effects: the demeaning step is $O(N)$, while creating the full dummy matrix would be $O(N \times K)$.

As a diagnostic tool, FWL scatter plots reveal problems that regression tables hide. If the residualized scatter shows a curved relationship, your linear specification may be wrong. If it shows outliers, they may be driving the coefficient. If the cloud collapses to a near-vertical line (as in Panel C of the flights figure), the within-group variation may be too small to identify the effect reliably.

The FWL theorem also connects to more advanced methods. Double Machine Learning (Chernozhukov et al., 2018) generalizes the partialling-out idea by using machine learning models instead of linear regression to residualize the data. The Python FWL tutorial on this site takes that next step. The fwlplot package does not do DML, but the visual intuition — “look at the residualized scatter to see the conditional relationship” — carries over directly.

One limitation: the FWL theorem applies only to linear regression. For logistic regression, Poisson regression, or other nonlinear models, the partialling-out logic does not hold exactly. The residualized scatter plot for a nonlinear model is at best an approximation of the conditional relationship, not an exact representation.

10. Summary and Next Steps

Confounding produces misleading regressions: in our simulated data, the naive coupon coefficient was -0.093 (coupons “hurt” sales), while the true causal effect is +0.2. After controlling for income via fwl_plot(), the estimate was +0.212, recovering the true effect.
The OVB formula predicts the bias exactly: the bias was $0.300 \times (-0.494) = -0.148$, correctly predicting the negative direction and approximate magnitude of the confounding.
FWL is not an approximation — it is an exact algebraic identity: the coefficient from partialling out controls matches feols() to six decimal places. Every multiple regression coefficient can be visualized as a bivariate scatter plot.
Fixed effects are FWL applied to group dummies: the flights data showed how adding origin and destination FE progressively transformed the scatter. The air-time coefficient changed from -0.003 (no FE) to -0.007 (origin + destination FE).
Panel FE reveal within-person effects: the wage data showed that controlling for individual ability via FE steepened the bivariate experience slope from 0.03 (pooled, no controls) to 0.122 (within-person), more than tripling the estimated return to experience.

For further study, see the companion Python FWL tutorial that extends the partialling-out logic to Double Machine Learning, and the R DID tutorial that uses fixest for difference-in-differences with staggered treatment adoption.

11. Exercises

Omitted variable direction. Use the OVB formula from Section 5.3 to predict what happens if you also omit dayofweek (in addition to income). Run the naive regression lm(sales ~ coupons) and compare the bias to $\hat{\gamma}_{income} \times \hat{\delta}_{income} + \hat{\gamma}_{day} \times \hat{\delta}_{day}$. Does the extended OVB formula still predict the direction correctly?
Multiple controls. Use fwl_plot() to visualize the coupon effect after controlling for both income and dayofweek. Compare this to controlling for income alone. Does the scatter change visually? Does the coefficient change?
Your own data. Pick a dataset from the wooldridge package (e.g., hprice1, wage2, crime2) and use fwl_plot() to visualize a regression relationship before and after adding controls. Does the coefficient change substantially? Can you identify what the confounder is doing?

12. Datasets

The datasets used in this tutorial are saved as CSV files in the post directory for reuse in other tutorials:

File	Rows	Description
`store_data.csv`	200	Simulated retail data (sales, coupons, income, dayofweek)
`flights_sample.csv`	5,000	Cleaned NYC flights sample (delays, air time, origin, dest)
`wagepan.csv`	4,360	Wooldridge wage panel (545 individuals, 8 years)

13. References

Visualizing Regression with the FWL Theorem in Stata

Fri, 27 Mar 2026 00:00:00 +0000

1. Overview

“What does it actually mean to control for a variable?” This question appears in every applied regression course, and the answer is surprisingly hard to visualize. When we say “the effect of coupons on sales, controlling for income,” we are describing a relationship in multidimensional space. This relationship cannot be directly plotted on a two-dimensional scatter. The Frisch-Waugh-Lovell (FWL) theorem changes this: it shows that the coefficient from a multiple regression equals the slope of a simple bivariate regression — after first residualizing (partialling out) the control variables from both the outcome and the variable of interest.

The scatterfit Stata package (Ahrens, 2024) makes this visual in one command. It takes a dependent variable, an independent variable, and optional controls or fixed effects, then produces a scatter plot of the residualized data with a fitted regression line. Built on reghdfe, it handles high-dimensional fixed effects efficiently. It also offers features beyond what R’s fwl_plot() or Python’s manual FWL can do: binned scatter plots for large datasets, regression parameters printed directly on the plot, and multiple fit types (linear, quadratic, lowess).

This tutorial is the third in a trilogy — see the companion R tutorial and Python tutorial — and uses the same datasets for cross-language comparability. All data are loaded from GitHub URLs so the analysis is fully reproducible.

Learning objectives:

Use scatterfit to visualize bivariate relationships with and without controls
Demonstrate FWL residualization with controls() and fcontrols()
Verify manually that FWL reproduces reghdfe coefficients exactly
Visualize fixed effects using fcontrols() on flights data
Use binned scatter plots to summarize patterns in large datasets
Show regression parameters directly on plots with regparameters()

2. The Modeling Pipeline

graph LR
A["Load Data<br/>from GitHub<br/>(Section 3)"] --> B["Naive vs.<br/>FWL Scatter<br/>(Section 4)"]
B --> C["Manual FWL<br/>Verification<br/>(Section 5)"]
C --> D["Binned<br/>Scatter<br/>(Section 6)"]
D --> E["Fixed Effects<br/>Flights<br/>(Section 7)"]
E --> F["Panel Data<br/>Wages<br/>(Section 8)"]
style A fill:#6a9bcc,stroke:#141413,color:#fff
style B fill:#d97757,stroke:#141413,color:#fff
style C fill:#d97757,stroke:#141413,color:#fff
style D fill:#00d4c8,stroke:#141413,color:#fff
style E fill:#6a9bcc,stroke:#141413,color:#fff
style F fill:#6a9bcc,stroke:#141413,color:#fff

We start where the answer is known (simulated data), see the result with scatterfit, verify manually, then apply the same tool to real flights data and panel wage data.

3. Setup and Data

3.1 Install packages

The scatterfit command requires reghdfe and ftools for high-dimensional fixed effects estimation. All packages are installed from SSC or GitHub:

* Install packages if not already installed
capture ssc install reghdfe, replace
capture ssc install ftools, replace
capture ssc install estout, replace
capture net install scatterfit, ///
from("https://raw.githubusercontent.com/leojahrens/scatterfit/master") replace

3.2 Load the simulated store data

We load the same simulated retail dataset used in the R and Python FWL tutorials. The data are hosted on GitHub for reproducibility:

import delimited "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/r_fwlplot/store_data.csv", clear

The data simulate a scenario where a store manager wants to know whether distributing coupons increases sales. Income is a confounder — wealthier neighborhoods receive fewer coupons (the store targets promotions at lower-income areas) but have higher baseline sales:

graph TD
Income["Income<br/>(confounder)"]
Coupons["Coupons<br/>(treatment)"]
Sales["Sales<br/>(outcome)"]
Income -->|"-0.5<br/>(fewer coupons<br/>to rich areas)"| Coupons
Income -->|"+0.3<br/>(rich areas<br/>buy more)"| Sales
Coupons -->|"+0.2<br/>(true causal<br/>effect)"| Sales
style Income fill:#d97757,stroke:#141413,color:#fff
style Coupons fill:#6a9bcc,stroke:#141413,color:#fff
style Sales fill:#00d4c8,stroke:#141413,color:#fff

The arrows in this diagram show causal relationships, and the numbers are the true effect sizes in the data generating process. The true causal effect of coupons on sales is +0.2, but income opens a backdoor path — an indirect route from coupons to sales that goes through income (coupons $\leftarrow$ income $\rightarrow$ sales). Unless we block this path by controlling for income, the naive estimate will be biased downward.

summarize sales coupons income dayofweek

 Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
sales | 200 33.6747 3.811032 24.89 45.23
coupons | 200 34.85685 6.788834 18.72 53.25
income | 200 49.72545 9.745807 20.07 77.02
dayofweek | 200 3.915 1.996926 1 7

correlate sales coupons income

 | sales coupons income
-------------+---------------------------
sales | 1.0000
coupons | -0.1664 1.0000
income | 0.5003 -0.7087 1.0000

The correlation matrix confirms the confounding structure. Coupons and sales have a negative raw correlation (-0.166), even though the true effect is positive (+0.2). Income is strongly negatively correlated with coupons (-0.709) and positively correlated with sales (0.500). A naive regression would wrongly conclude that coupons hurt sales.

4. scatterfit in Action: Naive vs. Controlled

4.1 The naive scatter

The simplest scatterfit call plots the raw relationship. The regparameters() option prints the regression coefficient, p-value, and R-squared directly on the plot — a feature unique to this Stata package:

scatterfit sales coupons, regparameters(coef pval r2) ///
opts(name(naive, replace) title("A. Naive: No Controls"))

The slope is -0.093 ($p = 0.018$, $R^2 = 0.028$): coupons appear to reduce sales. This is statistically significant but substantively wrong — the true effect is +0.2. The near-zero R-squared confirms that the naive model explains almost none of the variation in sales.

4.2 Controlling for income: one option

Now add income as a control. In scatterfit, the controls() option specifies continuous variables to partial out using the FWL procedure. Behind the scenes, scatterfit calls reghdfe to residualize both sales and coupons on income, then plots the residuals:

scatterfit sales coupons, controls(income) regparameters(coef pval r2) ///
opts(name(controlled, replace) title("B. FWL: Controlling for Income"))

The slope reverses to +0.212 ($p < 0.001$, $R^2 = 0.32$) — close to the true value of +0.2. The R-squared jumps from 0.03 to 0.32, showing that controlling for income explains a large share of the variation. Combining both panels:

graph combine naive controlled, ///
title("What Does 'Controlling for Income' Look Like?") rows(1)
graph export "stata_fwl_fig1_naive_vs_controlled.png", replace

The left panel shows the raw relationship: more coupons, lower sales ($R^2 = 0.028$). The right panel shows the same data after removing the influence of income from both axes via controls(income). The true positive effect of coupons emerges clearly, and the $R^2$ rises to 0.32.

4.3 The regression table confirms

We can compare the naive and controlled regressions side by side using Stata’s estimates store and estimates table workflow. The estimates store command saves regression results under a name, and estimates table displays multiple stored results in columns — similar to R’s etable() or Python’s stargazer:

regress sales coupons
estimates store naive_ols
regress sales coupons income
estimates store full_ols
estimates table naive_ols full_ols, stats(r2 N) b(%9.4f) se(%9.4f)

--------------------------------------
Variable | naive_ols full_ols
-------------+------------------------
coupons | -0.0934 0.2123
| 0.0393 0.0467
income | 0.3004
| 0.0325
_cons | 36.9301 11.3352
| 1.3969 3.0080
-------------+------------------------
r2 | 0.0277 0.3215
N | 200 200
--------------------------------------

Adding income as a control flips the coupon coefficient from -0.093 to +0.212 and increases the R-squared from 0.028 to 0.321. The income coefficient (0.300) is close to the true value of 0.3.

4.4 Omitted variable bias: predicting the error

The confounding is not mysterious — the omitted variable bias (OVB) formula predicts it exactly:

$$\text{bias} = \hat{\gamma} \times \hat{\delta}$$

In words, the bias equals the effect of the omitted variable on the outcome ($\hat{\gamma}$) multiplied by the relationship between the omitted variable and the treatment ($\hat{\delta}$).

* gamma = effect of income on sales (in full model)
regress sales coupons income
local gamma = _b[income] // 0.3004
* delta = regression of coupons on income
regress coupons income
local delta = _b[income] // -0.4937
* OVB = gamma * delta
display "OVB = " %9.4f `gamma' * `delta'

OVB = -0.1483

5. Under the Hood: Manual FWL Verification

5.1 The three-step recipe

The FWL theorem can be implemented manually in Stata using regress and predict:

* Step 1: Residualize sales on income
regress sales income
predict resid_sales, residuals
* Step 2: Residualize coupons on income
regress coupons income
predict resid_coupons, residuals
* Step 3: Regress residuals on residuals
regress resid_sales resid_coupons

------------------------------------------------------------------------------
resid_sales | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
resid_coup~s | .2122882 .046581 4.56 0.000 .1204297 .3041466
_cons | -2.87e-09 .222537 -0.00 1.000 -.4388468 .4388468
------------------------------------------------------------------------------

The FWL coefficient on resid_coupons is 0.212288 — exactly the same as the full regression coefficient on coupons (0.212288). This is not an approximation; it is an algebraic identity. Formally, the FWL theorem says:

$$\hat{\beta}_1 = \frac{\text{Cov}(\tilde{Y}, \tilde{X}_1)}{\text{Var}(\tilde{X}_1)}$$

where $\tilde{Y}$ and $\tilde{X}_1$ are the residuals from regressing $Y$ and $X_1$ on the controls $Z$. In our example, $\tilde{Y}$ is resid_sales (the part of sales that income cannot explain) and $\tilde{X}_1$ is resid_coupons (the part of coupons that income cannot explain). The ratio of their covariance to the variance of $\tilde{X}_1$ gives the slope we see in the regression above.

Think of it like measuring height for your age: instead of comparing raw heights, you compare how much taller or shorter each person is than the average for their age group.

5.2 Adding more controls

The scatterfit command handles any number of controls automatically:

scatterfit sales coupons, ///
regparameters(coef pval r2) opts(name(panel_a, replace) title("A. No Controls"))
scatterfit sales coupons, controls(income) ///
regparameters(coef pval r2) opts(name(panel_b, replace) title("B. + Income"))
scatterfit sales coupons, controls(income dayofweek) ///
regparameters(coef pval r2) opts(name(panel_c, replace) title("C. + Income + Day"))
graph combine panel_a panel_b panel_c, ///
title("Progressive Controls: How the Scatter Changes") rows(1)
graph export "stata_fwl_fig2_three_panels.png", replace

estimates table m1_naive m2_income m3_full, stats(r2 r2_a N)

--------------------------------------------------
Variable | m1_naive m2_income m3_full
-------------+------------------------------------
coupons | -0.0934 0.2123 0.2219
| 0.0393 0.0467 0.0454
income | 0.3004 0.2961
| 0.0325 0.0316
dayofweek | 0.4029
| 0.1095
_cons | 36.9301 11.3352 9.6398
| 1.3969 3.0080 2.9527
-------------+------------------------------------
r2 | 0.0277 0.3215 0.3654
r2_a | 0.0228 0.3146 0.3556
N | 200 200 200
--------------------------------------------------

The coupon coefficient progresses from -0.093 (naive, wrong sign), to +0.212 (controlling for income), to +0.222 (adding day of week). The R-squared — now visible directly on each panel — jumps from 0.028 to 0.32 to 0.37. Each scatterfit panel shows a tighter cloud as more variation is absorbed by the controls.

6. Binned Scatter Plots

6.1 Why binned scatters?

With large datasets (thousands or millions of observations), scatter plots become useless — individual points merge into a solid blob. Binned scatter plots solve this by grouping observations into quantile bins along the x-axis and plotting the bin means. The regression line is still estimated on the full data, so the slope is unaffected. This is one of scatterfit’s key advantages over R’s fwl_plot().

6.2 Unbinned vs. binned

scatterfit sales coupons, controls(income) ///
regparameters(coef pval r2) opts(name(unbinned, replace) title("A. Unbinned (all points)"))
scatterfit sales coupons, controls(income) binned ///
regparameters(coef pval r2) opts(name(binned, replace) title("B. Binned (20 quantiles)"))
graph combine unbinned binned, ///
title("Binned Scatter: Summarizing Patterns in Large Data") rows(1)
graph export "stata_fwl_fig3_binned_scatter.png", replace

Both panels show the same FWL-residualized relationship ($\beta = 0.21$, $R^2 = 0.32$), but the binned version (right) replaces 200 individual points with 20 bin-mean markers. For our small dataset the difference is modest, but for the flights data (5,000+ observations) or production datasets (millions of rows), binning is essential. The nquantiles() option controls how many bins to use:

* Fewer bins = smoother but less detail
scatterfit sales coupons, controls(income) binned nquantiles(10)
* More bins = more detail but noisier
scatterfit sales coupons, controls(income) binned nquantiles(30)

7. Visualizing Fixed Effects

7.1 Load the flights data

We load the NYC flights sample — 5,000 flights from New York’s three airports (EWR, JFK, LGA) in 2013:

import delimited "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/r_fwlplot/flights_sample.csv", clear
summarize dep_delay air_time
tabulate origin
* Encode string variables for fixed effects (needed by scatterfit/reghdfe)
encode origin, gen(origin_fe)
encode dest, gen(dest_fe)

 Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
dep_delay | 5,000 7.3172 22.83736 -20 119
air_time | 5,000 150.3636 93.47726 22 650

7.2 Progressive fixed effects

The fcontrols() option specifies categorical variables to absorb as fixed effects. This is analogous to feols(...| FE) in R’s fixest:

* No fixed effects
scatterfit dep_delay air_time, regparameters(coef pval r2) ///
opts(name(fe_none, replace) title("A. No Fixed Effects"))
* Origin airport FE
scatterfit dep_delay air_time, fcontrols(origin_fe) ///
regparameters(coef pval r2) opts(name(fe_origin, replace) title("B. Origin FE"))
* Origin + destination FE
scatterfit dep_delay air_time, fcontrols(origin_fe dest_fe) ///
regparameters(coef pval r2) opts(name(fe_both, replace) title("C. Origin + Dest FE"))
graph combine fe_none fe_origin fe_both, ///
title("What Do Fixed Effects 'Do' to the Data?") rows(1)
graph export "stata_fwl_fig4_fixed_effects.png", replace

Panel A shows the raw cloud with a nearly flat slope ($R^2 \approx 0$). Panel B removes the three origin-airport means, tightening the horizontal spread. Panel C removes the destination means as well, collapsing the variation to within-route deviations and increasing $R^2$ substantially. The fcontrols() option handles all the demeaning internally using reghdfe.

7.3 Regression table

regress dep_delay air_time
estimates store fe0
reghdfe dep_delay air_time, absorb(origin_fe) vce(robust)
estimates store fe1
reghdfe dep_delay air_time, absorb(origin_fe dest_fe) vce(robust)
estimates store fe2
estimates table fe0 fe1 fe2, stats(r2 N) b(%9.4f) se(%9.4f)

--------------------------------------------------
Variable | fe0 fe1 fe2
-------------+------------------------------------
air_time | -0.0050 -0.0079 -0.0324
| 0.0035 0.0034 0.0265
_cons | 8.0669 8.5072 12.1416
| 0.6117 0.6449 4.0186
-------------+------------------------------------
r2 | 0.0004 0.0055 0.0310
N | 5000 5000 4994
--------------------------------------------------

The air time coefficient changes as we add fixed effects: -0.005 (no FE), -0.008 (origin FE), -0.032 (origin + destination FE). Note that these are estimated on the 5,000-observation sample, so the coefficients differ somewhat from the full-data estimates in the R tutorial. The key pattern is the same: adding fixed effects absorbs between-group variation and changes both the magnitude and precision of the coefficient. With origin + destination FE, 6 singleton observations are dropped (N = 4,994) — singletons are routes with only one flight in the sample, where within-group variation cannot be estimated.

8. Panel Data: Returns to Experience

8.1 Load the wage panel

The wage panel contains 545 individuals observed over 8 years (1980–1987). The classic question: what is the return to experience? The challenge is unobserved ability — two people with the same experience may earn very different wages because one is more talented, motivated, or well-connected. These unmeasured personal traits are the “unobserved ability” that individual fixed effects absorb.

import delimited "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/r_fwlplot/wagepan.csv", clear
xtset nr year
summarize lwage exper expersq educ

 Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
lwage | 4,360 1.649147 .5326094 -3.579079 4.05186
exper | 4,360 6.514679 2.825873 0 18
expersq | 4,360 50.42477 40.78199 0 324
educ | 4,360 11.76697 1.746181 3 16

8.2 Pooled OLS vs. individual fixed effects

regress lwage educ exper expersq
estimates store pool
reghdfe lwage exper expersq, absorb(nr)
estimates store fe_ind
reghdfe lwage exper expersq, absorb(nr year)
estimates store fe_twfe
estimates table pool fe_ind fe_twfe, stats(r2 N)

--------------------------------------------------
Variable | pool fe_ind fe_twfe
-------------+------------------------------------
educ | 0.1021
| 0.0047
exper | 0.1050 0.1223 (omitted)
| 0.0102 0.0082
expersq | -0.0036 -0.0045 -0.0054
| 0.0007 0.0006 0.0007
_cons | -0.0564 1.0807 1.9223
| 0.0639 0.0263 0.0359
-------------+------------------------------------
r2 | 0.1477 0.6173 0.6185
N | 4360 4360 4360
--------------------------------------------------

Several things change as we add fixed effects. The educ coefficient disappears from the individual FE column — education is time-invariant (it does not change over the 8 years for any individual), so it is perfectly collinear with person dummies. Stata marks exper as (omitted) in the two-way FE column — because experience increments by one year for everyone, it is perfectly collinear with year dummies. Only expersq (which varies non-linearly) survives both sets of fixed effects. The R-squared jumps from 0.148 to 0.617, showing that individual fixed effects explain the majority of wage variation.

8.3 scatterfit with individual FE

* Sample 150 individuals for visual clarity
preserve
set seed 456
bysort nr: gen first = (_n == 1)
gen rand = runiform() if first
bysort nr (rand): replace rand = rand[1]
sort rand nr year
egen rank = group(rand) if first
bysort nr (rank): replace rank = rank[1]
keep if rank <= 150
scatterfit lwage exper, regparameters(coef pval r2) ///
opts(name(wage_raw, replace) title("A. Raw: Pooled Cross-Section"))
scatterfit lwage exper, fcontrols(nr) regparameters(coef pval r2) ///
opts(name(wage_fe, replace) title("B. FWL: Individual Fixed Effects"))
graph combine wage_raw wage_fe, ///
title("Controlling for Unobserved Ability") rows(1)
graph export "stata_fwl_fig5_panel_data.png", replace
restore

The visual difference is dramatic. Panel A shows a wide fan with a shallow slope ($R^2 = 0.043$) — individuals at the same experience level have wildly different wages, reflecting unobserved ability. Panel B applies fcontrols(nr) to strip away each person’s average wage and experience, leaving only within-person deviations. The $R^2$ jumps from 0.04 to 0.59, showing that individual fixed effects explain most of the wage variation. The slope steepens sharply: the within-person return to experience is about 0.07 log points per year (roughly 7%), and the relationship is much more precisely identified once we control for who each person is.

9. Advanced: Fit Types and Regression Parameters

9.1 Multiple fit types

The regparameters() option displays the coefficient, standard error, p-value, R-squared, and sample size directly on the plot. The scatterfit command also supports fit types beyond linear — quadratic and lowess — as diagnostics for nonlinearity:

* Linear fit with all regression parameters displayed on the plot
scatterfit sales coupons, controls(income) ///
regparameters(coef se pval r2 n)
graph export "stata_fwl_fig6_advanced.png", replace

* Lowess fit: nonparametric check (note: lowess does not support controls())
scatterfit sales coupons, fit(lowess)

The quadratic fit serves as a diagnostic. If the relationship looks curved in the residualized scatter, your linear specification may be misspecified. Note that fit(lowess) and fit(lpoly) do not support controls() in the current version of scatterfit — use them on raw or manually residualized data. For our simulated data (which is truly linear), the quadratic fit closely follows the linear fit, confirming the specification is appropriate.

9.2 Regression parameters on the plot

The regparameters() option displays statistical information directly on the scatter plot. Available parameters:

Parameter	Display
`coef`	Slope coefficient
`se`	Standard error
`pval`	P-value
`r2`	R-squared
`n`	Sample size

* Show everything
scatterfit sales coupons, controls(income) regparameters(coef se pval r2 n)

This is especially useful for presentations and papers where you want to communicate both the visual pattern and the statistical evidence in a single figure.

9.3 Quick reference: scatterfit recipes

* 1. Raw scatter (no controls)
scatterfit y x
* 2. Control for continuous variables (FWL)
scatterfit y x, controls(z1 z2)
* 3. Control for fixed effects (categorical)
scatterfit y x, fcontrols(group_fe)
* 4. Both continuous controls and fixed effects
scatterfit y x, controls(z1) fcontrols(group_fe)
* 5. Binned scatter (for large datasets)
scatterfit y x, controls(z1) binned nquantiles(20)
* 6. Show regression parameters on the plot
scatterfit y x, controls(z1) regparameters(coef pval r2)
* 7. Quadratic fit (works with controls)
scatterfit y x, controls(z1) fit(quadratic)
* 8. Lowess fit (does NOT support controls — use on raw data)
scatterfit y x, fit(lowess)

10. Discussion

The FWL theorem is not just a pedagogical tool — it is the computational engine behind Stata’s reghdfe command. When reghdfe estimates a model with fixed effects, it does not create a matrix with thousands of dummy variables. Instead, it uses an iterative demeaning algorithm (a generalization of FWL) to absorb the fixed effects, then runs OLS on the residuals. This is why reghdfe can handle millions of observations with tens of thousands of fixed effects.

The scatterfit package offers three advantages over the R and Python implementations of FWL visualization. First, binned scatter plots (Section 6) are essential for large datasets where individual points merge into an unreadable blob. Second, regression parameters on the plot (regparameters()) combine the visual and statistical evidence in a single figure, reducing the back-and-forth between plots and tables. Third, multiple fit types (fit(quadratic), fit(lowess)) serve as built-in diagnostics for linearity.

Across the three tutorials (Python, R, Stata), the key numbers are the same because we use the same datasets: the naive coupon coefficient is -0.093, the true effect is +0.212 after controlling for income, and the OVB is -0.148. The FWL theorem is the same in every language — only the syntax changes:

Task	Python	R	Stata
Raw scatter	`plt.scatter(x, y)`	`fwl_plot(y ~ x)`	`scatterfit y x`
Control for Z	manual `resid()`	`fwl_plot(y ~ x + z)`	`scatterfit y x, controls(z)`
Fixed effects	not supported	`fwl_plot(y ~ x \| fe)`	`scatterfit y x, fcontrols(fe)`
Binned scatter	not supported	not supported	`scatterfit y x, binned`
Stats on plot	not supported	not supported	`regparameters(coef pval r2)`

Students who learn FWL in one language can immediately apply it in another.

One limitation: the FWL theorem applies only to linear regression. For logistic, Poisson, or other nonlinear models, the partialling-out logic does not hold exactly. Stata’s scatterfit does support fitmodel(logit) and fitmodel(poisson), but these are direct fits, not FWL residualizations.

11. Summary and Next Steps

Confounding produces misleading regressions: the naive coupon coefficient was -0.093 (wrong sign), while the true causal effect is +0.2. After FWL residualization with controls(income), the estimate was +0.212.
The OVB formula predicts the bias exactly: $0.300 \times (-0.494) = -0.148$, correctly predicting the negative direction and approximate magnitude of the confounding.
FWL is an exact identity: the manual three-step procedure in Stata (regress + predict resid + regress) matches the full regression to six decimal places (0.212288).
Fixed effects are FWL applied to group dummies: fcontrols() in scatterfit calls reghdfe internally to demean the data, equivalent to feols(... | FE) in R.
Binned scatter plots and on-plot statistics are Stata’s advantage: the binned and regparameters() options provide capabilities that the R and Python FWL tools lack.

For further study, see the companion R FWL tutorial using fwl_plot() and the Python FWL tutorial that extends FWL to Double Machine Learning.

12. Exercises

OVB direction. In our simulation, predict the direction of the OVB if you also omit dayofweek. Compute $\hat{\gamma}_{day} \times \hat{\delta}_{day}$ and add it to the income OVB. Does the total bias match the difference between the naive and the fully controlled coefficient?
Binned scatter with different bins. Re-run scatterfit sales coupons, controls(income) binned nquantiles(k) for $k = 5, 10, 20, 50$. How does the visual change? At what point do you lose meaningful information?
slopefit: heterogeneous effects. Use the slopefit command: slopefit sales coupons income. This shows how the coupon-sales slope varies across income levels. Do coupons work better in low-income or high-income neighborhoods?

13. References

Ahrens, L. (2024). scatterfit: Scatter Plots with Fit Lines and Regression Results. GitHub.
Correia, S. (2016). reghdfe: Linear Models with Many Levels of Fixed Effects. Stata Journal.
Frisch, R. & Waugh, F. V. (1933). Partial Time Regressions as Compared with Individual Trends. Econometrica, 1(4), 387–401.
Lovell, M. C. (1963). Seasonal Adjustment of Economic Time Series and Multiple Regression Analysis. JASA, 58(304), 993–1010.
Angrist, J. D. & Pischke, J.-S. (2009). Mostly Harmless Econometrics. Princeton University Press.
Datasets: simulated store data, NYC flights sample, and Wooldridge wage panel from the companion R FWL tutorial on this site.

Three Methods for Robust Variable Selection: BMA, LASSO, and WALS

Mon, 23 Mar 2026 00:00:00 +0000

1. Overview

Imagine you are an economist advising a government on climate policy. Your team has collected cross-country data on a dozen potential drivers of CO₂ emissions: GDP per capita, fossil fuel dependence, urbanization, industrial output, democratic governance, trade networks, agricultural activity, trade openness, foreign direct investment, corruption, tourism, and domestic credit. The government has a limited budget and wants to know: which of these factors truly drive CO₂ emissions, and which are red herrings?

This is the variable selection problem, and it is harder than it sounds. With 12 candidate variables, each either included or excluded from a regression, there are $2^{12} = 4,096$ possible models you could estimate. Run one model and report it as “the answer,” and you have implicitly assumed the other 4,095 models are wrong. That is a very strong assumption — and almost certainly unjustified.

In practice, researchers handle this by specification searching: they try many models, drop insignificant variables, and report whichever specification “works best.” This process inflates false discoveries. A noise variable that happens to look significant in one specification gets reported, while the many failed specifications are hidden in the researcher’s desk drawer. This is sometimes called the file drawer problem or pretesting bias.

This tutorial introduces three principled approaches to the variable selection problem:

graph TD
Q["<b>Variable Selection</b><br/>Which of 12 variables<br/>truly matter?"] --> BMA
Q --> LASSO
Q --> WALS
BMA["<b>BMA</b><br/>Bayesian Model Averaging<br/>PIPs from 4,096 models"] --> R["<b>Convergence</b><br/>Variables identified<br/>by all 3 methods"]
LASSO["<b>LASSO</b><br/>L1 penalized regression<br/>Automatic selection"] --> R
WALS["<b>WALS</b><br/>Frequentist averaging<br/>t-statistics"] --> R
style Q fill:#141413,stroke:#141413,color:#fff
style BMA fill:#6a9bcc,stroke:#141413,color:#fff
style LASSO fill:#d97757,stroke:#141413,color:#fff
style WALS fill:#00d4c8,stroke:#141413,color:#fff
style R fill:#1a3a8a,stroke:#141413,color:#fff

Bayesian Model Averaging (BMA): Average across all 4,096 models, weighting each by how well it fits the data. Variables that appear important across many models earn a high “inclusion probability.”
LASSO (Least Absolute Shrinkage and Selection Operator): Add a penalty to the regression that forces the coefficients of irrelevant variables to be exactly zero, performing automatic selection.
Weighted Average Least Squares (WALS): A fast frequentist model-averaging method that transforms the problem so each variable can be evaluated independently.

We use synthetic data throughout this tutorial. This means we know the true data-generating process — which variables truly matter and which do not. This “answer key” lets us verify whether each method correctly recovers the truth. By the end, you will understand not just how to run each method, but why it works and when to prefer one over the others.

Learning objectives:

Understand the variable selection problem and why running a single model is insufficient when model uncertainty is large
Implement Bayesian Model Averaging in R and interpret Posterior Inclusion Probabilities (PIPs)
Apply LASSO with cross-validation to perform automatic variable selection and use Post-LASSO for unbiased estimation
Run WALS as a fast frequentist model-averaging alternative and interpret its t-statistics
Compare results across all three methods to identify truly robust determinants via methodological triangulation

Content outline. Section 2 sets up the R environment. Section 3 introduces the synthetic dataset and its built-in “answer key” — 7 true predictors and 5 noise variables with realistic multicollinearity. Section 4 runs naive OLS to illustrate the spurious significance problem. Sections 5–8 cover BMA: Bayes' rule foundations, the PIP framework, a toy example, and full implementation. Sections 9–12 cover LASSO: the bias-variance tradeoff, L1/L2 geometry, cross-validated implementation, and Post-LASSO. Sections 13–16 cover WALS: frequentist model averaging, the semi-orthogonal transformation, the Laplace prior, and implementation. Section 17 brings all three methods together for a grand comparison. Section 18 summarizes key takeaways and provides further reading.

2. Setup

Before running the analysis, install the required packages if needed. The following code checks for missing packages and installs them automatically.

# List all packages needed for this tutorial
required_packages <- c(
"tidyverse", # data manipulation and ggplot2 visualization
"BMS", # Bayesian Model Averaging via the bms() function
"glmnet", # LASSO and Ridge regression via coordinate descent
"WALS", # Weighted Average Least Squares estimation
"scales", # nice axis formatting in plots
"patchwork", # combine multiple ggplot panels
"ggrepel", # non-overlapping text labels on plots
"corrplot", # correlation matrix heatmaps
"broom" # tidy model summaries
)
# Install any packages not yet available
missing <- required_packages[!sapply(required_packages, requireNamespace, quietly = TRUE)]
if (length(missing) > 0) {
install.packages(missing, repos = "https://cloud.r-project.org")
}
# Load libraries
library(tidyverse)
library(BMS)
library(glmnet)
library(WALS)
library(scales)
library(patchwork)
library(ggrepel)
library(corrplot)
library(broom)

3. The Synthetic Dataset

3.1 The data-generating process (our “answer key”)

We use a cross-sectional dataset of 120 fictional countries. The key design choices:

7 variables have true nonzero effects on CO₂ emissions
5 variables are pure noise (their true coefficients are exactly zero)
The noise variables are correlated with GDP and other true predictors, creating realistic multicollinearity. This makes variable selection genuinely challenging — naive OLS will find spurious “significant” results for noise variables.

Think of this as setting up a controlled experiment. We know the answer before we begin, so we can grade each method’s performance.

The data-generating process below shows exactly how the synthetic dataset was built. The CSV file synthetic-co2-cross-section.csv was generated with set.seed(2017) and can be loaded directly from GitHub for full reproducibility.

# --- DATA-GENERATING PROCESS (reference) ---
set.seed(2017)
n <- 120 # number of "countries"
# GDP drives many other variables (realistic: richer countries
# have higher urbanization, more industry, etc.)
log_gdp <- rnorm(n, mean = 8.5, sd = 1.5)
# --- TRUE PREDICTORS (correlated with GDP) ---
fossil_fuel <- 30 + 3 * log_gdp + rnorm(n, 0, 10) # higher in richer countries
urban_pop <- 20 + 5 * log_gdp + rnorm(n, 0, 12) # increases with income
industry <- 15 + 1.5 * log_gdp + rnorm(n, 0, 6) # industry share
democracy <- 5 + 2 * log_gdp + rnorm(n, 0, 8) # democracy index
trade_network <- 0.2 + 0.05 * log_gdp + rnorm(n, 0, 0.15) # trade centrality
agriculture <- 40 - 3 * log_gdp + rnorm(n, 0, 8) # negatively correlated with GDP
# --- NOISE VARIABLES (correlated with GDP but NO true effect) ---
log_trade <- 3.5 + 0.1 * log_gdp + rnorm(n, 0, 0.5)
fdi <- 2 + rnorm(n, 0, 4)
corruption <- 0.8 - 0.05 * log_gdp + rnorm(n, 0, 0.15)
log_tourism <- 12 + 0.3 * log_gdp + rnorm(n, 0, 1.2)
log_credit <- 2.5 + 0.15 * log_gdp + rnorm(n, 0, 0.6)
# --- TRUE DATA-GENERATING PROCESS ---
log_co2 <- 2.0 + # intercept
1.200 * log_gdp + # GDP: strong positive (elasticity)
0.008 * industry + # industry: positive
0.012 * fossil_fuel + # fossil fuel: positive
0.010 * urban_pop + # urbanization: positive
0.004 * democracy + # democracy: small positive
0.500 * trade_network + # trade network: moderate positive
0.005 * agriculture + # agriculture: weak positive
# NOISE VARIABLES HAVE ZERO TRUE EFFECT
rnorm(n, 0, 0.3) # random noise (sigma = 0.3)

The true coefficients serve as our “answer key”:

Variable	True $\beta$	Role	Interpretation
log_gdp	1.200	True predictor	1% more GDP $\to$ 1.2% more CO₂
trade_network	0.500	True predictor	Moderate positive effect
fossil_fuel	0.012	True predictor	1 pp more fossil fuel $\to$ 1.2% more CO₂
urban_pop	0.010	True predictor	1 pp more urbanization $\to$ 1.0% more CO₂
industry	0.008	True predictor	Positive composition effect
agriculture	0.005	True predictor	Weak positive effect
democracy	0.004	True predictor	Small positive effect
log_trade	0	Noise	No true effect
fdi	0	Noise	No true effect
corruption	0	Noise	No true effect
log_tourism	0	Noise	No true effect
log_credit	0	Noise	No true effect

Now let us load the pre-generated dataset:

# Load the synthetic dataset directly from GitHub
DATA_URL <- "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/r_bma_lasso_wals/synthetic-co2-cross-section.csv"
synth_data <- read.csv(DATA_URL)
cat("Dataset:", nrow(synth_data), "countries,", ncol(synth_data), "variables\n")
head(synth_data)

Dataset: 120 countries, 14 variables
country log_co2 log_gdp industry fossil_fuel urban_pop democracy trade_network
1 Country_001 13.27 9.47 29.25 66.94 67.97 25.67 0.77
2 Country_002 12.18 8.44 24.97 51.43 66.14 20.51 0.85
3 Country_003 13.50 10.16 28.19 50.62 73.91 29.08 0.73
...

3.2 Descriptive statistics

The following summary statistics give us a first look at the data structure. Note the wide range of scales: GDP is in log units (mean around 8.5), while percentage variables like fossil fuel share and urbanization range from single digits to near 100.

# Descriptive statistics for all 13 numeric variables
synth_data |>
select(-country) |>
pivot_longer(everything(), names_to = "variable", values_to = "value") |>
summarise(
n = n(),
mean = round(mean(value), 2),
sd = round(sd(value), 2),
min = round(min(value), 2),
max = round(max(value), 2),
.by = variable
)

 variable n mean sd min max
log_co2 120 14.22 2.11 8.76 20.36
log_gdp 120 8.53 1.57 4.61 13.21
industry 120 27.87 6.21 8.32 44.98
fossil_fuel 120 55.49 9.62 24.72 81.22
urban_pop 120 62.52 13.25 29.81 97.62
democracy 120 22.94 8.32 3.10 45.00
trade_network 120 0.64 0.17 0.18 1.04
agriculture 120 13.87 8.11 1.00 37.11
log_trade 120 4.43 0.46 3.45 5.84
fdi 120 2.23 4.19 -5.00 13.62
corruption 120 0.37 0.16 0.05 0.71
log_tourism 120 14.61 1.32 11.54 19.63
log_credit 120 3.83 0.65 2.30 5.50

The dataset has 120 observations and 14 variables (1 dependent, 12 candidate regressors, 1 country identifier). The dependent variable log_co2 has a mean of 14.22 with a standard deviation of 2.11 log points, reflecting substantial cross-country variation in emissions. The candidate regressors span very different scales — trade_network ranges from 0.18 to 1.04, while urban_pop ranges from 29.8 to 97.6 — which is why BMA, LASSO, and WALS each handle scaling internally.

3.3 Correlation structure

A key feature of our synthetic data is that the noise variables are correlated with the true predictors — especially with GDP. This correlation is what makes variable selection difficult: in a standard OLS regression, the noise variables will “borrow” explanatory power from the true predictors.

# Compute correlation matrix for all 12 candidate regressors
cor_matrix <- synth_data |>
select(-country, -log_co2) |>
cor()
# Draw the heatmap
corrplot(cor_matrix, method = "color", type = "lower",
addCoef.col = "black", number.cex = 0.7,
col = colorRampPalette(c("#d97757", "white", "#6a9bcc"))(200),
diag = FALSE)

The correlation heatmap reveals the realistic structure we built into the data. GDP is positively correlated with fossil fuel use, urbanization, industry, and the trade network — but also with the noise variables like trade openness, tourism, and credit. This multicollinearity is precisely what makes a naive “throw everything into OLS” approach unreliable. For example, log_tourism has a correlation of approximately 0.3 with log_gdp, which means it can pick up GDP’s signal even though its true effect is zero.

Note. We created a synthetic dataset where we know which 7 variables truly affect CO₂ emissions and which 5 are noise. The noise variables are deliberately correlated with the true predictors, mimicking the multicollinearity found in real cross-country data.

4. The General Model

Our goal is to estimate the following linear model:

$$ \log(\text{CO}_{2,i}) = \beta_0 + \sum_{j=1}^{12} \beta_j x_{j,i} + \varepsilon_i $$

where:

$\log(\text{CO}_{2,i})$ is the log of CO₂ emissions for country $i$
$\beta_0$ is the intercept (the predicted log CO₂ when all regressors are zero)
$\beta_j$ is the coefficient on the $j$-th regressor: the change in log CO₂ associated with a one-unit increase in $x_j$, holding all other variables constant
$\varepsilon_i$ is the error term: everything that affects CO₂ emissions but is not captured by the 12 regressors

Because the dependent variable is in logs, the interpretation of each coefficient depends on whether the regressor is also in logs:

Regressor type	Interpretation of $\beta_j$	Example
Log-log (e.g., log GDP)	Elasticity: a 1% increase in GDP is associated with a $\beta_j$% change in CO₂	$\beta = 1.2$ means 1% more GDP $\to$ 1.2% more CO₂
Level-log (e.g., fossil fuel %)	Semi-elasticity: a 1-unit increase in the regressor is associated with a $100 \times \beta_j$% change in CO₂	$\beta = 0.012$ means 1 pp more fossil fuel $\to$ 1.2% more CO₂

We want to determine which $\beta_j$ are truly nonzero. We know the answer (we designed the data), but let us first see what happens if we just run OLS with all 12 variables.

# Run OLS with all 12 candidate regressors
ols_full <- lm(log_co2 ~ log_gdp + industry + fossil_fuel + urban_pop +
democracy + trade_network + agriculture +
log_trade + fdi + corruption + log_tourism + log_credit,
data = synth_data)
# Display summary
summary(ols_full)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.283773 0.494736 4.616 1.06e-05 ***
log_gdp 1.163669 0.032747 35.537 < 2e-16 ***
industry 0.017577 0.005004 3.513 0.000661 ***
fossil_fuel 0.011988 0.003240 3.698 0.000349 ***
urban_pop 0.008221 0.002689 3.057 0.002794 **
democracy 0.010497 0.003975 2.640 0.009549 **
trade_network 0.912828 0.203681 4.482 1.94e-05 ***
agriculture -0.000629 0.004242 -0.148 0.882568
log_trade -0.055738 0.064829 -0.860 0.391509
fdi 0.000789 0.007045 0.112 0.910964
corruption 0.010767 0.201954 0.053 0.957573
log_tourism -0.028025 0.024415 -1.148 0.253610
log_credit 0.045689 0.049690 0.919 0.360252
---
Multiple R-squared: 0.9801, Adjusted R-squared: 0.9779

Look carefully at the noise variables. For example, log_trade has a t-statistic of $-0.86$ (p = 0.392) and corruption has a t-statistic of $0.05$ (p = 0.958). None reach conventional significance in this sample. However, their estimated coefficients can be non-negligible in magnitude — and in a different random sample, some noise variables could easily cross the 5% threshold. This is the risk of spurious significance, caused by the correlation between noise variables and the true predictors. It is precisely this problem that motivates the three methods we study next.

Warning. With 12 correlated regressors and only 120 observations, OLS can produce misleading significance levels. A variable with a true coefficient of zero may appear significant simply because it is correlated with a genuinely important predictor. This is why we need principled variable selection methods.

PART 1: Bayesian Model Averaging

5. Bayes' Rule — The Foundation

Before we can understand Bayesian Model Averaging, we need to understand Bayes' rule — the mathematical machinery that powers the entire framework.

5.1 A coin-flip example

Suppose a friend gives you a coin. You want to know: is this coin fair (probability of heads = 0.5), or is it biased (probability of heads = 0.7)?

Before flipping, you have no strong opinion. You assign equal prior probabilities:

$P(\text{fair}) = 0.5$ (50% chance the coin is fair)
$P(\text{biased}) = 0.5$ (50% chance the coin is biased)

Now you flip the coin 10 times and observe 7 heads. How should you update your beliefs?

The likelihood of seeing 7 heads in 10 flips is:

If the coin is fair ($p = 0.5$): $P(\text{7 heads} | \text{fair}) = \binom{10}{7} (0.5)^{10} = 0.1172$
If the coin is biased ($p = 0.7$): $P(\text{7 heads} | \text{biased}) = \binom{10}{7} (0.7)^7 (0.3)^3 = 0.2668$

The biased coin makes the data more likely. Bayes' rule combines the prior and the likelihood:

$$ P(H|D) = \frac{P(D|H) \cdot P(H)}{P(D)} $$

where:

$P(H|D)$ = posterior probability (what we believe after seeing the data)
$P(D|H)$ = likelihood (how probable the data is under hypothesis $H$)
$P(H)$ = prior probability (what we believed before seeing the data)
$P(D)$ = marginal likelihood (a normalizing constant that ensures probabilities sum to 1)

For our coin:

$$ P(\text{fair}|\text{7H}) = \frac{0.1172 \times 0.5}{0.1172 \times 0.5 + 0.2668 \times 0.5} = \frac{0.0586}{0.1920} = 0.305 $$

$$ P(\text{biased}|\text{7H}) = \frac{0.2668 \times 0.5}{0.1920} = 0.695 $$

After seeing 7 heads, we update from 50–50 to roughly 30–70 in favor of the biased coin. The data shifted our beliefs, but did not erase the prior entirely.

5.2 The bridge to model averaging

Now replace “fair coin” and “biased coin” with regression models:

Hypothesis = “Which variables belong in the model?”
Prior = “Before seeing data, any combination of variables is equally plausible”
Likelihood = “How well does each model fit the data?”
Posterior = “After seeing data, which models are most credible?”

This is exactly what BMA does. Instead of two coin hypotheses, we have 4,096 model hypotheses — but the logic of Bayes' rule is identical.

Note. Bayes' rule updates prior beliefs using data. The posterior probability of any hypothesis is proportional to its prior probability times its likelihood. BMA applies this same logic to regression models instead of coin flips.

6. The BMA Framework

6.1 Posterior model probability

With 12 candidate variables, there are $K = 12$ regressors and $2^K = 4,096$ possible models. Denote the $k$-th model as $M_k$. BMA assigns each model a posterior probability:

$$ P(M_k | y) = \frac{P(y | M_k) \cdot P(M_k)}{\sum_{l=1}^{2^K} P(y | M_l) \cdot P(M_l)} $$

This is just Bayes' rule applied to models. Let us unpack each piece:

$P(y | M_k)$ is the marginal likelihood of model $M_k$. It measures how well the model fits the data, automatically penalizing complexity. A model with many parameters can fit the data closely, but the marginal likelihood integrates over all possible parameter values, spreading the probability thin. This acts as a built-in Occam’s razor: simpler models that fit the data well receive higher marginal likelihoods than complex models that fit only slightly better.
$P(M_k)$ is the prior model probability. With no prior information, we use a uniform prior: every model is equally likely, so $P(M_k) = 1/4,096$ for all $k$. This means the posterior is driven entirely by the data.
The denominator is a normalizing constant that ensures all posterior model probabilities sum to 1.

6.2 Posterior Inclusion Probability (PIP)

We do not really care about individual models — we care about individual variables. The Posterior Inclusion Probability of variable $j$ is the sum of the posterior probabilities of all models that include variable $j$:

$$ \text{PIP}_j = \sum_{k:\, j \in M_k} P(M_k | y) $$

Think of it as a democratic vote. Each of the 4,096 models casts a vote for which variables matter. But the votes are weighted: models that fit the data well get louder voices. If variable $j$ appears in most of the high-probability models, it earns a high PIP.

The standard interpretation thresholds (Raftery, 1995):

PIP range	Interpretation	Analogy
$\geq 0.99$	Decisive evidence	Beyond reasonable doubt
$0.95 - 0.99$	Very strong evidence	Strong consensus
$0.80 - 0.95$	Strong evidence (robust)	Clear majority
$0.50 - 0.80$	Borderline evidence	Split vote
$< 0.50$	Weak/no evidence (fragile)	Minority opinion

We will use PIP $\geq$ 0.80 as our threshold for “robust” throughout this tutorial.

6.3 Posterior mean

Once we know which variables matter, we want to know how much they matter. The posterior mean of coefficient $j$ is:

$$ E[\beta_j | y] = \sum_{k=1}^{2^K} \hat{\beta}_{j,k} \cdot P(M_k | y) $$

where $\hat{\beta}_{j,k}$ is the estimated coefficient of variable $j$ in model $k$ (and zero if $j$ is not in model $k$). This is a weighted average of the coefficient across all models. Variables with high PIPs get posterior means close to their “full model” estimates; variables with low PIPs get posterior means shrunk toward zero.

7. Toy Example — BMA on 3 Variables

Before running BMA on all 12 variables, let us work through a small example by hand. We pick just 3 variables: log_gdp and fossil_fuel (true predictors) and log_trade (noise). With 3 variables, each can be either IN or OUT of the model, giving us $2^3 = 8$ possible models — small enough to examine every single one.

Here are all 8 models written out explicitly:

Model	Formula
$M_1$	log_co2 $\sim$ 1 (intercept only)
$M_2$	log_co2 $\sim$ log_gdp
$M_3$	log_co2 $\sim$ fossil_fuel
$M_4$	log_co2 $\sim$ log_trade
$M_5$	log_co2 $\sim$ log_gdp + fossil_fuel
$M_6$	log_co2 $\sim$ log_gdp + log_trade
$M_7$	log_co2 $\sim$ fossil_fuel + log_trade
$M_8$	log_co2 $\sim$ log_gdp + fossil_fuel + log_trade

7.1 Step 1 — Fit every model and compute BIC

We fit each of the 8 models using OLS and compute its BIC score. Remember: lower BIC = better (the model explains the data well without unnecessary complexity).

# Select our 3 variables
toy_data <- synth_data |>
select(log_co2, log_gdp, fossil_fuel, log_trade)
# Write out all 8 model formulas explicitly
model_formulas <- c(
"log_co2 ~ 1", # M1: intercept only
"log_co2 ~ log_gdp", # M2
"log_co2 ~ fossil_fuel", # M3
"log_co2 ~ log_trade", # M4
"log_co2 ~ log_gdp + fossil_fuel", # M5
"log_co2 ~ log_gdp + log_trade", # M6
"log_co2 ~ fossil_fuel + log_trade", # M7
"log_co2 ~ log_gdp + fossil_fuel + log_trade" # M8
)
# Fit each model and extract its BIC
bic_values <- sapply(model_formulas, function(f) {
BIC(lm(as.formula(f), data = toy_data))
})
# Organize results in a table
toy_results <- tibble(
model = paste0("M", 1:8),
formula = model_formulas,
bic = round(bic_values, 1)
) |>
arrange(bic)
print(toy_results)

 model formula bic
M5 log_co2 ~ log_gdp + fossil_fuel 114.1
M8 log_co2 ~ log_gdp + fossil_fuel + log_trade 118.5
M2 log_co2 ~ log_gdp 120.7
M6 log_co2 ~ log_gdp + log_trade 125.4
M3 log_co2 ~ fossil_fuel 514.4
M7 log_co2 ~ fossil_fuel + log_trade 519.0
M1 log_co2 ~ 1 528.3
M4 log_co2 ~ log_trade 533.0

The winner is $M_5$ (log_gdp + fossil_fuel) with BIC = 114.1 — exactly the two true predictors, no noise. The runner-up $M_8$ adds log_trade but its BIC is worse (118.5), meaning the extra variable does not improve the fit enough to justify the added complexity. Models without GDP ($M_1$, $M_3$, $M_4$, $M_7$) have dramatically worse BIC scores, confirming GDP’s dominant role.

7.2 Step 2 — Convert BIC to posterior probabilities

Now we turn each BIC into a posterior model probability. The formula is:

$$ P(M_k | y) = \frac{\exp(-0.5 \cdot \text{BIC}_k)}{\sum_{l=1}^{8} \exp(-0.5 \cdot \text{BIC}_l)} $$

Because the BIC values can be very large, we work with differences from the best model to avoid numerical overflow. Subtracting the minimum BIC from all values does not change the probabilities:

$$ P(M_k | y) = \frac{\exp\bigl(-0.5 \cdot (\text{BIC}_k - \text{BIC}_{\min})\bigr)}{\sum_{l=1}^{8} \exp\bigl(-0.5 \cdot (\text{BIC}_l - \text{BIC}_{\min})\bigr)} $$

Let us plug in the numbers. The best model ($M_5$) has BIC = 114.1, so $\Delta_5 = 0$. The runner-up ($M_8$) has $\Delta_8 = 118.5 - 114.1 = 4.4$:

$$ w_5 = \exp(-0.5 \times 0) = 1.000, \quad w_8 = \exp(-0.5 \times 4.4) = 0.111 $$

The remaining models have much larger $\Delta$ values, so their weights are essentially zero. After normalizing by the sum of all weights ($1.000 + 0.111 + 0.037 + \ldots \approx 1.151$):

$$ P(M_5 | y) = \frac{1.000}{1.151} = 0.869, \quad P(M_8 | y) = \frac{0.111}{1.151} = 0.096 $$

# Convert BIC to posterior probabilities using the delta-BIC trick
toy_results <- toy_results |>
mutate(
delta_bic = bic - min(bic), # difference from best
weight = exp(-0.5 * delta_bic), # unnormalized weight
post_prob = round(weight / sum(weight), 4) # normalize to sum to 1
)
toy_results |> select(model, bic, delta_bic, weight, post_prob)

 model bic delta_bic weight post_prob
M5 114.1 0.0 1.0000 0.8687
M8 118.5 4.4 0.1108 0.0962
M2 120.7 6.6 0.0369 0.0320
M6 125.4 11.3 0.0035 0.0031
M3 514.4 400.3 0.0000 0.0000
M7 519.0 404.9 0.0000 0.0000
M1 528.3 414.2 0.0000 0.0000
M4 533.0 418.9 0.0000 0.0000

One model dominates: $M_5$ captures 86.9% of the posterior probability — exactly the two true predictors. The runner-up $M_8$ (adding log_trade) gets only 9.6%, and $M_2$ (GDP alone) gets 3.2%. The remaining 5 models share less than 0.4% of the total weight. BMA’s Occam’s razor is at work: adding log_trade to the model ($M_8$) does not improve the fit enough to overcome the complexity penalty, so the simpler model ($M_5$) wins decisively.

7.3 Step 3 — Compute Posterior Inclusion Probabilities

Finally, we compute the PIP of each variable by summing the posterior probabilities of all models that include it. For example, log_trade appears in models $M_4$, $M_6$, $M_7$, and $M_8$, so:

$$ \text{PIP}_{\text{log_trade}} = P(M_4 | y) + P(M_6 | y) + P(M_7 | y) + P(M_8 | y) = 0.000 + 0.003 + 0.000 + 0.096 = 0.099 $$

That is well below the 0.50 threshold — fragile evidence, exactly what we expect for a noise variable.

# Compute PIPs: for each variable, sum P(M|y) across models that include it
pip_toy <- tibble(
variable = c("log_gdp", "fossil_fuel", "log_trade"),
true_effect = c("True", "True", "Noise"),
pip = c(
# log_gdp appears in M2, M5, M6, M8
sum(toy_results$post_prob[toy_results$model %in% c("M2","M5","M6","M8")]),
# fossil_fuel appears in M3, M5, M7, M8
sum(toy_results$post_prob[toy_results$model %in% c("M3","M5","M7","M8")]),
# log_trade appears in M4, M6, M7, M8
sum(toy_results$post_prob[toy_results$model %in% c("M4","M6","M7","M8")])
)
)
print(pip_toy)

 variable true_effect pip
log_gdp True 1.000
fossil_fuel True 0.965
log_trade Noise 0.099

Even with this simple 3-variable example, BMA correctly identifies the two true predictors. GDP has a PIP of 1.000 (decisive evidence) and fossil_fuel has a PIP of 0.965 (robust) — they appear in every high-probability model. Log_trade has a PIP of only 0.099 (fragile) — well below the 0.50 threshold. BMA’s built-in Occam’s razor penalizes models that include noise variables without substantially improving the fit.

8. BMA on All 12 Variables

8.1 Running BMA

Now we apply BMA to the full dataset with all 12 candidate regressors using the BMS package. Because 4,096 models is computationally manageable, the MCMC sampler explores the full model space efficiently.

set.seed(2021) # reproducibility for MCMC sampling
# Prepare the data matrix: DV in first column, regressors follow
bma_data <- synth_data |>
select(log_co2, log_gdp, industry, fossil_fuel, urban_pop,
democracy, trade_network, agriculture,
log_trade, fdi, corruption, log_tourism, log_credit) |>
as.data.frame()
# Run BMA
bma_fit <- bms(
X.data = bma_data, # data with DV in column 1
burn = 50000, # burn-in iterations
iter = 200000, # post-burn-in iterations
g = "BRIC", # BRIC g-prior (robust default)
mprior = "uniform", # uniform model prior
nmodel = 2000, # store top 2000 models
mcmc = "bd", # birth-death MCMC sampler
user.int = FALSE # suppress interactive output
)

The key parameters deserve explanation:

burn = 50,000: the first 50,000 MCMC draws are discarded as “burn-in” to ensure the sampler has converged to the posterior distribution
iter = 200,000: the next 200,000 draws are used for inference
g = “BRIC”: the Benchmark Risk Inflation Criterion prior on the regression coefficients, a robust default choice
mprior = “uniform”: every model is equally likely a priori, so the posterior is driven entirely by the data

8.2 PIP bar chart

The PIP bar chart classifies each variable as robust (PIP $\geq$ 0.80), borderline (0.50–0.80), or fragile (PIP $<$ 0.50). This visualization makes it easy to see which variables earn strong support across the model space and which are effectively irrelevant.

# Extract PIPs and posterior means
bma_coefs <- coef(bma_fit)
bma_df <- as.data.frame(bma_coefs) |>
rownames_to_column("variable") |>
as_tibble() |>
rename(pip = PIP, post_mean = `Post Mean`, post_sd = `Post SD`) |>
select(variable, pip, post_mean, post_sd) |>
mutate(
true_beta = true_beta_lookup[variable],
robustness = case_when(
pip >= 0.80 ~ "Robust (PIP >= 0.80)",
pip >= 0.50 ~ "Borderline",
TRUE ~ "Fragile (PIP < 0.50)"
),
ci_low = post_mean - 2 * post_sd,
ci_high = post_mean + 2 * post_sd
)
# Plot PIPs
ggplot(bma_df, aes(x = reorder(variable, pip), y = pip, fill = robustness)) +
geom_col(width = 0.65) +
geom_hline(yintercept = 0.80, linetype = "dashed") +
coord_flip() +
labs(x = NULL, y = "Posterior Inclusion Probability (PIP)")

The PIP bar chart reveals a clear separation between signal and noise. GDP dominates with a PIP of 1.00, followed by trade_network (0.986), fossil_fuel (0.948), and industry (0.841) — all with PIPs above the 0.80 robustness threshold. The noise variables (log_trade, fdi, corruption, log_tourism, log_credit) all have PIPs well below 0.15, confirming that BMA correctly classifies them as fragile. Urban_pop ($\beta = 0.010$, PIP = 0.648) and democracy ($\beta = 0.004$, PIP = 0.607) land in the borderline range — true predictors whose effects are moderate enough that BMA hedges between including and excluding them. Agriculture ($\beta = 0.005$, PIP = 0.087) is classified as fragile, an honest reflection of the sample’s limited power to detect its very small effect.

8.3 Posterior coefficient plot

Beyond knowing which variables matter, we want to know how much they matter and how precisely they are estimated. The posterior coefficient plot displays the BMA-estimated effect size for each variable along with approximate 95% credible intervals (posterior mean $\pm$ 2 posterior standard deviations).

# Coefficient plot with 95% credible intervals
ggplot(bma_df, aes(x = reorder(variable, pip), y = post_mean, color = robustness)) +
geom_pointrange(aes(ymin = ci_low, ymax = ci_high)) +
geom_hline(yintercept = 0, linetype = "solid", color = "gray50") +
coord_flip()

The posterior coefficient plot shows the BMA-estimated effect sizes with uncertainty bands. GDP’s posterior mean of approximately 1.19 closely recovers the true value of 1.200, and its 95% credible interval is narrow, reflecting high precision. Trade_network has a posterior mean of 0.87, overshooting its true value of 0.500 — but its wide credible interval honestly reflects substantial estimation uncertainty. The noise variables and low-PIP variables like agriculture have posterior means shrunk very close to zero — this is BMA’s shrinkage at work. Variables with low PIPs appear in few high-probability models, so their posterior means are averaged with many models where the coefficient is zero, pulling the estimate toward zero.

8.4 Variable-inclusion map

The variable-inclusion map shows which variables appear in the highest-probability models and whether their coefficients are positive or negative. Unlike a simple heatmap, the width of each column is proportional to the model’s posterior probability — so wide columns represent models that the data strongly supports. The x-axis shows cumulative posterior model probability: if the first model has PMP = 0.15, it occupies the region from 0 to 0.15; the second model fills from 0.15 to 0.15 + its PMP, and so on. A solid band of color stretching across most of the x-axis means the variable appears in virtually every high-probability model.

# Extract top 100 models and their coefficient estimates
top_coefs <- topmodels.bma(bma_fit)
n_top <- min(100, ncol(top_coefs))
top_coefs <- top_coefs[, 1:n_top]
# Extract posterior model probabilities (MCMC-based)
model_pmps <- pmp.bma(bma_fit)[1:n_top, 1]
# Cumulative x positions: each model's width = its PMP
cum_pmp <- c(0, cumsum(model_pmps))
# Order variables by PIP (highest at top)
var_order <- bma_df |> arrange(desc(pip)) |> pull(variable)
# Build rectangle data for every variable × model combination
rect_data <- expand.grid(
var_idx = seq_len(nrow(top_coefs)),
model_idx = seq_len(n_top)
) |>
mutate(
variable = rownames(top_coefs)[var_idx],
coef_value = mapply(function(v, m) top_coefs[v, m], var_idx, model_idx),
sign = case_when(
coef_value > 0 ~ "Positive",
coef_value < 0 ~ "Negative",
TRUE ~ "Not included"
),
xmin = cum_pmp[model_idx],
xmax = cum_pmp[model_idx + 1],
variable = factor(variable, levels = rev(var_order))
)
# Plot the variable-inclusion map
ggplot(rect_data, aes(xmin = xmin, xmax = xmax,
ymin = as.numeric(variable) - 0.45,
ymax = as.numeric(variable) + 0.45,
fill = sign)) +
geom_rect() +
scale_fill_manual(
name = "Coefficient",
values = c("Positive" = "#6a9bcc",
"Negative" = "#d97757",
"Not included" = "#d0cdc8")
) +
scale_x_continuous(expand = c(0, 0),
labels = scales::label_number(accuracy = 0.1)) +
scale_y_continuous(breaks = seq_along(var_order),
labels = rev(var_order),
expand = c(0, 0)) +
labs(title = "Variable-Inclusion Map",
subtitle = paste0("Top ", n_top, " models shown out of ",
nrow(pmp.bma(bma_fit)), " visited"),
x = "Cumulative posterior model probability",
y = NULL)

The variable-inclusion map reveals clear structure. The top variables — log_gdp, trade_network, fossil_fuel, and industry — form solid blue bands stretching across nearly the entire x-axis, meaning they appear with positive coefficients in virtually every high-probability model. Urban_pop and democracy also show substantial inclusion, consistent with their borderline PIPs. In contrast, the noise variables (log_trade, fdi, corruption, log_tourism, log_credit) appear as mostly gray with occasional patches of blue or orange, indicating they enter and exit models sporadically and sometimes with the wrong sign. The fact that noise variables occasionally appear with negative coefficients (orange patches) is another sign of fragility — their coefficient estimates are unstable because they have no true effect.

8.5 BMA results vs. known truth

# Compare BMA results with the true DGP
bma_summary <- bma_df |>
mutate(
bma_robust = pip >= 0.80,
true_nonzero = true_beta != 0,
correct = bma_robust == true_nonzero
) |>
select(variable, true_beta, pip, post_mean, bma_robust, true_nonzero, correct)
print(bma_summary)

 variable true_beta pip post_mean bma_robust true_nonzero correct
log_gdp 1.200 1.000 1.1854 TRUE TRUE TRUE
trade_network 0.500 0.986 0.8727 TRUE TRUE TRUE
fossil_fuel 0.012 0.948 0.0117 TRUE TRUE TRUE
industry 0.008 0.841 0.0142 TRUE TRUE TRUE
urban_pop 0.010 0.648 0.0049 FALSE TRUE FALSE
democracy 0.004 0.607 0.0066 FALSE TRUE FALSE
log_tourism 0.000 0.130 -0.0039 FALSE FALSE TRUE
log_credit 0.000 0.104 0.0051 FALSE FALSE TRUE
agriculture 0.005 0.087 -0.0002 FALSE TRUE FALSE
log_trade 0.000 0.084 -0.0037 FALSE FALSE TRUE
corruption 0.000 0.078 0.0026 FALSE FALSE TRUE
fdi 0.000 0.077 -0.0000 FALSE FALSE TRUE

BMA correctly classifies 9 of 12 variables. The four strongest true predictors (GDP, trade_network, fossil_fuel, industry) all receive PIPs above 0.80 — these are the “robust” determinants. All five noise variables receive PIPs below 0.15 — correctly identified as fragile. Urban_pop (PIP = 0.648) and democracy (PIP = 0.607) fall in the borderline range — they are true predictors, but BMA’s conservative Occam’s razor hedges because their effects are moderate. Agriculture ($\beta = 0.005$, PIP = 0.087) is missed entirely. This reveals an important nuance: BMA prioritizes precision over sensitivity. It would rather miss a small true effect than falsely include a noise variable.

Note. BMA on all 12 variables correctly gives high PIPs to the strong true predictors (GDP, trade network, fossil fuel, industry) and low PIPs to the noise variables. Variables with moderate or small true effects may land in the borderline zone. The variable-inclusion map shows that the top models consistently include the core predictors.

PART 2: LASSO

9. Regularization — Adding a Penalty

9.1 The bias-variance tradeoff

OLS is an unbiased estimator — on average, it gets the coefficients right. But with many correlated regressors, OLS coefficients have high variance: they bounce around from sample to sample. Adding or removing a single variable can drastically change the estimates.

The key insight of regularization is that a little bias can buy a lot of variance reduction, lowering the overall prediction error. The total error of a prediction decomposes as:

$$ \text{MSE} = \text{Bias}^2 + \text{Variance} + \text{Irreducible noise} $$

The figure illustrates the fundamental tradeoff. At low complexity (strong regularization), bias is high but variance is low. At high complexity (weak or no regularization, like OLS), bias is near zero but variance explodes. The optimal point lies in between — this is exactly where regularized methods like LASSO operate. Think of the penalty as a “budget constraint” on coefficient sizes: variables that do not contribute enough to prediction are not worth the cost, so their coefficients are set to zero.

10. L1 vs. L2 Geometry

10.1 The LASSO (L1) penalty

The LASSO solves the following optimization problem:

$$ \hat{\beta}_{\text{LASSO}} = \arg\min_\beta \; \frac{1}{2n}\|y - X\beta\|^2 + \lambda \|\beta\|_1 $$

where:

$\frac{1}{2n}\|y - X\beta\|^2$ is the sum of squared residuals (the usual OLS loss, scaled)
$\|\beta\|_1 = \sum_{j=1}^{p} |\beta_j|$ is the L1 norm (sum of absolute values)
$\lambda \geq 0$ is the regularization parameter: it controls how much we penalize large coefficients. When $\lambda = 0$, LASSO reduces to OLS. As $\lambda \to \infty$, all coefficients are shrunk to zero.

10.2 The Ridge (L2) penalty

For comparison, Ridge regression uses the L2 norm instead:

$$ \hat{\beta}_{\text{Ridge}} = \arg\min_\beta \; \frac{1}{2n}\|y - X\beta\|^2 + \lambda \|\beta\|_2^2 $$

where $\|\beta\|_2^2 = \sum_{j=1}^{p} \beta_j^2$ is the sum of squared coefficients.

10.3 Why LASSO selects variables and Ridge does not

The geometric explanation is one of the most elegant ideas in modern statistics. The constraint region for LASSO (L1) is a diamond, while the constraint region for Ridge (L2) is a circle. When the elliptical OLS contours meet the diamond, they typically hit a corner, where one or more coefficients are exactly zero. When they meet the circle, they hit a smooth curve — coefficients are shrunk but never exactly zero.

The key insight: the L1 diamond has corners where coefficients are exactly zero — this is why LASSO selects variables. The L2 circle has no corners, so Ridge shrinks coefficients toward zero but never reaches it. LASSO performs simultaneous estimation and variable selection; Ridge only estimates.

11. LASSO on All 12 Variables

11.1 Running LASSO with cross-validation

The LASSO has one tuning parameter: $\lambda$, which controls the strength of the penalty. Too small and we include noise; too large and we exclude true predictors. We choose $\lambda$ using 10-fold cross-validation: split the data into 10 folds, train on 9, predict the 10th, and repeat. The $\lambda$ that minimizes the average prediction error across folds is called lambda.min.

set.seed(2021) # reproducibility for cross-validation folds
# Prepare the design matrix X and response vector y
X <- synth_data |>
select(log_gdp, industry, fossil_fuel, urban_pop, democracy,
trade_network, agriculture, log_trade, fdi, corruption,
log_tourism, log_credit) |>
as.matrix()
y <- synth_data$log_co2
# Run LASSO (alpha = 1) with 10-fold cross-validation
lasso_cv <- cv.glmnet(
x = X,
y = y,
alpha = 1, # alpha=1 is LASSO (alpha=0 is Ridge)
nfolds = 10,
standardize = TRUE # standardize predictors internally
)

11.2 Regularization path

# Fit the full LASSO path
lasso_full <- glmnet(X, y, alpha = 1, standardize = TRUE)
# Plot coefficient paths
ggplot(path_df, aes(x = log_lambda, y = coefficient, color = variable)) +
geom_line() +
geom_vline(xintercept = log(lasso_cv$lambda.min), linetype = "dashed") +
geom_vline(xintercept = log(lasso_cv$lambda.1se), linetype = "dotted")

The regularization path reveals the story of LASSO variable selection. Reading from left to right (increasing penalty), the noise variables (orange lines) are the first to be driven to zero — they provide too little predictive value to justify their “cost” under the penalty. GDP (the strongest predictor with $\beta = 1.200$) persists the longest, requiring the largest penalty to be eliminated. The vertical lines mark lambda.min (minimum CV error) and lambda.1se (most parsimonious model within 1 SE of the minimum). The gap between them represents the tension between fitting the data well and keeping the model simple.

11.3 Cross-validation curve

# Plot the CV curve
ggplot(cv_df, aes(x = log_lambda, y = mse)) +
geom_ribbon(aes(ymin = mse_lo, ymax = mse_hi), fill = "gray85", alpha = 0.5) +
geom_line(color = "#6a9bcc") +
geom_vline(xintercept = log(lasso_cv$lambda.min), linetype = "dashed")

The cross-validation curve shows how prediction error varies with the penalty strength. The curve has a characteristic U-shape: too little penalty (left) allows overfitting (high error from variance), while too much penalty (right) underfits (high error from bias). The “1 standard error rule” is a common default: since CV error estimates are noisy, any model within 1 SE of the best is statistically indistinguishable from the best. We prefer the simpler one (lambda.1se).

11.4 Selected variables

# Extract LASSO coefficients at lambda.1se
lasso_coefs_1se <- coef(lasso_cv, s = "lambda.1se")
lasso_df <- tibble(
variable = rownames(lasso_coefs_1se)[-1],
lasso_coef = as.numeric(lasso_coefs_1se)[-1]
) |>
mutate(
selected = lasso_coef != 0,
true_beta = true_beta_lookup[variable],
is_noise = true_beta == 0,
bar_color = case_when(
!selected ~ "Not selected",
is_noise ~ "Noise (false positive)",
TRUE ~ "True predictor (correct)"
)
)
# Plot selected variables
ggplot(lasso_df, aes(x = reorder(variable, abs(lasso_coef)), y = lasso_coef, fill = bar_color)) +
geom_col(width = 0.6) + coord_flip()

At lambda.1se, LASSO selects a sparse subset of the 12 candidate variables. The selected variables are shown with colored bars: steel blue for true predictors correctly retained, orange for any noise variables falsely included. Variables with zero coefficients (gray) have been excluded by the LASSO penalty. The key question is: did LASSO keep the right variables and drop the right ones?

12. Post-LASSO

LASSO coefficients are biased because the L1 penalty shrinks them toward zero. The selected variables are correct (we hope), but the coefficient values are too small. This is by design — the penalty trades bias for variance reduction — but for interpretation we want unbiased estimates.

The fix is simple: Post-LASSO (Belloni and Chernozhukov, 2013). Run OLS using only the variables that LASSO selected. The LASSO does the selection; OLS does the estimation.

# Identify which variables LASSO selected at lambda.1se
selected_vars <- lasso_df |> filter(selected) |> pull(variable)
# Build the Post-LASSO formula
post_lasso_formula <- as.formula(
paste("log_co2 ~", paste(selected_vars, collapse = " + "))
)
# Run OLS on the selected variables only
post_lasso_fit <- lm(post_lasso_formula, data = synth_data)
# Compare: LASSO vs Post-LASSO vs True coefficients
post_lasso_summary <- broom::tidy(post_lasso_fit) |>
filter(term != "(Intercept)") |>
rename(variable = term, post_lasso_coef = estimate) |>
select(variable, post_lasso_coef) |>
left_join(lasso_df |> select(variable, lasso_coef, true_beta), by = "variable")
print(post_lasso_summary)

 variable lasso_coef post_lasso_coef true_beta
log_gdp 1.1899 1.1646 1.200
industry 0.0090 0.0176 0.008
fossil_fuel 0.0072 0.0118 0.012
urban_pop 0.0041 0.0078 0.010
democracy 0.0046 0.0113 0.004
trade_network 0.6309 0.8978 0.500

Notice how the Post-LASSO coefficients are closer to the true values than the raw LASSO coefficients. For example, fossil_fuel’s LASSO coefficient is 0.007 (shrunk from the true 0.012), but the Post-LASSO estimate is 0.012 — recovering the truth almost exactly. Similarly, urban_pop recovers from 0.004 (LASSO) to 0.008 (Post-LASSO), closer to the true value of 0.010. Trade_network’s Post-LASSO estimate (0.898) overshoots the true value (0.500), reflecting the difficulty of precisely estimating a coefficient on a low-variance variable. The LASSO selected the right variables; Post-LASSO recovered unbiased magnitudes.

Note. LASSO coefficients are shrunk toward zero by design. Post-LASSO runs OLS on only the LASSO-selected variables, producing unbiased coefficient estimates while retaining the variable selection from LASSO.

PART 3: Weighted Average Least Squares (WALS)

13. Frequentist Model Averaging

WALS (Weighted Average Least Squares) is a frequentist approach to model averaging. Like BMA, it averages over models instead of selecting just one. But unlike BMA, it does not require MCMC sampling or the specification of a full Bayesian prior.

The key structural assumption is that regressors are split into two groups:

$$ y = X_1 \beta_1 + X_2 \beta_2 + \varepsilon $$

where:

$X_1$ are focus regressors: variables you are certain belong in the model. In a cross-sectional setting, this is typically just the intercept.
$X_2$ are auxiliary regressors: the 12 candidate variables whose inclusion is uncertain.
$\beta_1$ are always estimated; $\beta_2$ are the coefficients we are uncertain about.

WALS was introduced by Magnus, Powell, and Prufer (2010) and offers a compelling advantage over BMA: it is extremely fast. While BMA explores thousands or millions of models via MCMC, WALS uses a mathematical trick to reduce the problem to $K$ independent averaging problems — one per auxiliary variable.

14. The Semi-Orthogonal Transformation

Why correlated variables make averaging hard

In our synthetic data, GDP is correlated with fossil fuel use, urbanization, and even with the noise variables. This means that the decision to include one variable affects the importance of another. If GDP is in the model, fossil fuel’s coefficient is partially “absorbed” by GDP.

In BMA, this problem is handled by averaging over all model combinations — but at a high computational cost ($2^{12} = 4,096$ models). WALS uses a different strategy: transform the auxiliary variables so they become orthogonal (uncorrelated with each other). Once orthogonal, each variable can be averaged independently.

The mathematical trick

The semi-orthogonal transformation works as follows:

Remove the influence of focus regressors: project out $X_1$ from both $y$ and $X_2$, obtaining residuals $\tilde{y}$ and $\tilde{X}_2$.
Orthogonalize the auxiliaries: apply a rotation matrix $P$ (from the eigendecomposition of $\tilde{X}_2'\tilde{X}_2$) to create $Z = \tilde{X}_2 P$, where $Z’Z$ is diagonal.
Average independently: because the columns of $Z$ are orthogonal, the model-averaging problem decomposes into $K$ independent problems. Each transformed variable is averaged separately.

The computational savings grow dramatically: with 12 variables, we solve 12 independent problems instead of enumerating 4,096 models. Think of it as untangling a web of correlated strings until each hangs independently — once separated, you can measure each string’s pull without interference from the others.

15. The Laplace Prior

WALS requires a prior distribution for the transformed coefficients. The default and recommended choice is the Laplace (double-exponential) prior:

$$ p(\gamma_j) \propto \exp(-|\gamma_j| / \tau) $$

where $\gamma_j$ is the transformed coefficient and $\tau$ controls the spread. The Laplace prior has two key features:

Peaked at zero: it encodes skepticism — the prior believes most variables probably have small effects
Heavy tails: it allows large effects if the data strongly supports them — variables with strong signal can “break through” the prior

The deep connection to LASSO

Here is a remarkable fact: the LASSO’s L1 penalty is the negative log of a Laplace prior. The MAP (maximum a posteriori) estimate under a Laplace prior is:

$$ \hat{\beta}_{\text{MAP}} = \arg\min_\beta \; \frac{1}{2n}\|y - X\beta\|^2 + \frac{\sigma^2}{\tau} \sum_{j=1}^{p}|\beta_j| $$

This is identical to the LASSO objective with $\lambda = \sigma^2 / \tau$. The LASSO penalty and the Laplace prior are two sides of the same coin.

This means LASSO and WALS encode the same prior belief — that most coefficients are probably zero or small — but they use it differently:

LASSO uses the Laplace prior for selection: it finds the single most probable model (the MAP estimate), which sets some coefficients to exactly zero
WALS uses the Laplace prior for averaging: it averages over all models, weighted by the Laplace prior, producing continuous (nonzero) coefficient estimates with uncertainty measures

Note. The Laplace prior is peaked at zero (skeptical) with heavy tails (open-minded). It is the same prior that underlies LASSO’s L1 penalty. LASSO uses it for hard selection (zeros vs. nonzeros); WALS uses it for soft averaging (continuous weights).

16. WALS on All 12 Variables

16.1 Running WALS

# WALS splits regressors into two groups:
# X1 = focus regressors (always included): just the intercept
# X2 = auxiliary regressors (uncertain): our 12 candidate variables
# Prepare the focus regressor matrix (intercept only)
X1_wals <- matrix(1, nrow = nrow(synth_data), ncol = 1)
colnames(X1_wals) <- "(Intercept)"
# Prepare the auxiliary regressor matrix (all 12 candidates)
X2_wals <- synth_data |>
select(log_gdp, industry, fossil_fuel, urban_pop, democracy,
trade_network, agriculture, log_trade, fdi, corruption,
log_tourism, log_credit) |>
as.matrix()
y_wals <- synth_data$log_co2
# Fit WALS with the Laplace prior (the recommended default)
wals_fit <- wals(
x = X1_wals, # focus regressors (intercept)
x2 = X2_wals, # auxiliary regressors (12 candidates)
y = y_wals, # response variable
prior = laplace() # Laplace prior for auxiliaries
)
wals_summary <- summary(wals_fit)

The WALS function call is remarkably concise. Unlike BMA, there is no MCMC sampling, no burn-in period, and no convergence diagnostics to worry about. The computation is essentially instantaneous.

# Extract results
aux_coefs <- wals_summary$auxCoefs
wals_df <- tibble(
variable = rownames(aux_coefs),
estimate = aux_coefs[, "Estimate"],
se = aux_coefs[, "Std. Error"],
t_stat = estimate / se
) |>
mutate(
true_beta = true_beta_lookup[variable],
abs_t = abs(t_stat),
wals_robust = abs_t >= 2
)
print(wals_df |> arrange(desc(abs_t)) |> select(variable, estimate, t_stat, true_beta))

 variable estimate t_stat true_beta
log_gdp 1.1333 34.62 1.200
trade_network 0.8458 4.39 0.500
industry 0.0187 4.01 0.008
fossil_fuel 0.0099 3.26 0.012
urban_pop 0.0082 3.11 0.010
democracy 0.0097 2.58 0.004
log_credit 0.0659 1.43 0.000
agriculture -0.0046 -1.13 0.005
log_tourism -0.0148 -0.64 0.000
log_trade 0.0196 0.31 0.000
fdi -0.0011 -0.17 0.000
corruption -0.0165 -0.09 0.000

WALS produces familiar t-statistics for each auxiliary variable. Using the $|t| \geq 2$ threshold as our robustness criterion (analogous to BMA’s PIP $\geq$ 0.80), we can classify each variable as robust or fragile.

16.2 t-statistic bar chart

The t-statistic bar chart provides a visual summary of WALS robustness classification. Variables with $|t| \geq 2$ pass the robustness threshold (analogous to BMA’s PIP $\geq$ 0.80), while those below the threshold are considered fragile.

# Classify each variable for the bar chart
wals_df <- wals_df |>
mutate(
bar_color = case_when(
wals_robust & true_nonzero ~ "True positive",
wals_robust & !true_nonzero ~ "False positive",
!wals_robust & true_nonzero ~ "False negative",
TRUE ~ "True negative"
)
)
ggplot(wals_df, aes(x = reorder(variable, abs_t), y = t_stat, fill = bar_color)) +
geom_col(width = 0.6) +
geom_hline(yintercept = c(-2, 2), linetype = "dashed") +
coord_flip()

The t-statistic bar chart shows a clear separation. GDP towers above all others with $|t| = 34.62$, followed by trade_network ($|t| = 4.39$), industry ($|t| = 4.01$), fossil_fuel ($|t| = 3.26$), urban_pop ($|t| = 3.11$), and democracy ($|t| = 2.58$). These six variables pass the $|t| \geq 2$ threshold. The noise variables all have $|t| < 1.5$, confirming they are not robust determinants. Agriculture ($|t| = 1.13$) falls just below the robustness threshold — its true effect ($\beta = 0.005$) is simply too small to detect reliably with this sample size.

Note. WALS produces t-statistics for each auxiliary variable. Using the $|t| \geq 2$ threshold, we can classify variables as robust or fragile. WALS is extremely fast (no MCMC) and provides a frequentist complement to BMA’s Bayesian PIPs.

PART 4: Grand Comparison

17. Three Methods, Same Question, Same Data

We have now applied all three methods to the same synthetic dataset. Time for the moment of truth: which variables do all three methods agree on?

17.1 Comprehensive comparison table

# Merge all results
grand_table <- bma_compare |>
left_join(lasso_compare, by = "variable") |>
left_join(wals_compare, by = "variable") |>
mutate(
true_beta = true_beta_lookup[variable],
bma_robust = bma_pip >= 0.80,
n_methods = bma_robust + lasso_selected + wals_robust,
triple_robust = n_methods == 3,
true_nonzero = true_beta != 0
)
print(grand_table |>
select(variable, true_beta, bma_pip, bma_robust, lasso_selected, wals_t, wals_robust, n_methods) |>
arrange(desc(n_methods)))

 variable true_beta bma_pip bma_robust lasso_selected wals_t wals_robust n_methods
log_gdp 1.200 1.000 TRUE TRUE 34.62 TRUE 3
trade_network 0.500 0.986 TRUE TRUE 4.39 TRUE 3
fossil_fuel 0.012 0.948 TRUE TRUE 3.26 TRUE 3
industry 0.008 0.841 TRUE TRUE 4.01 TRUE 3
urban_pop 0.010 0.648 FALSE TRUE 3.11 TRUE 2
democracy 0.004 0.607 FALSE TRUE 2.58 TRUE 2
log_tourism 0.000 0.130 FALSE FALSE -0.64 FALSE 0
log_credit 0.000 0.104 FALSE FALSE 1.43 FALSE 0
agriculture 0.005 0.087 FALSE FALSE -1.13 FALSE 0
log_trade 0.000 0.084 FALSE FALSE 0.31 FALSE 0
corruption 0.000 0.078 FALSE FALSE -0.09 FALSE 0
fdi 0.000 0.077 FALSE FALSE -0.17 FALSE 0

The results are striking. Four variables are triple-robust — identified by all three methods: log_gdp, trade_network, fossil_fuel, and industry. Two more variables — urban_pop and democracy — are double-robust, selected by LASSO and WALS but landing in BMA’s borderline zone (PIPs of 0.648 and 0.607). All five noise variables are correctly excluded by all three methods. Agriculture ($\beta = 0.005$) is the only true predictor missed by all methods — its effect is simply too small to detect.

17.2 Method agreement heatmap

The heatmap provides a visual summary of agreement. The top four rows (GDP, trade_network, fossil_fuel, industry) are solid steel blue across all three columns — unanimous agreement that these variables matter. Urban_pop and democracy show steel blue for LASSO and WALS but orange for BMA, visualizing BMA’s greater conservatism. The bottom five rows (noise) are solid orange — unanimous agreement that they do not matter. Agriculture is also orange throughout, reflecting all methods' consensus that its tiny effect ($\beta = 0.005$) cannot be reliably distinguished from zero.

17.3 BMA PIP vs. WALS |t-statistic|

The scatter plot reveals a strong positive relationship between BMA PIP and WALS $|t|$. Variables in the upper-right quadrant are robust by both methods — GDP, trade_network, fossil_fuel, and industry. Urban_pop and democracy sit in an interesting middle zone: high WALS $|t|$ (above 2) but moderate BMA PIP (below 0.80), illustrating BMA’s more conservative threshold. The noise variables cluster in the lower-left corner (low PIP, low $|t|$). LASSO selection (triangle markers) aligns with the WALS threshold, selecting the same six variables that pass $|t| \geq 2$.

17.4 Coefficient comparison

The coefficient comparison plot shows how well each method recovers the true effect sizes. Points on the dashed 45-degree line represent perfect recovery. GDP ($\beta = 1.200$) is recovered almost exactly by all three methods. The smaller coefficients (fossil_fuel at 0.012, urban_pop at 0.010) are also well-estimated. Trade_network’s coefficient is overestimated by all methods (true 0.500, estimates around 0.85–0.90), reflecting the difficulty of precisely estimating an effect on a low-variance variable. BMA’s posterior means are slightly attenuated for variables with PIPs below 1.0 (the averaging shrinks them toward zero).

17.5 Agreement summary

The agreement bar chart tells a nuanced story: four variables are triple-robust (identified by all three methods), two are double-robust (identified by LASSO and WALS but not BMA), and six are identified by none. The “split votes” on urban_pop and democracy reveal a genuine methodological difference: LASSO and WALS are more liberal in including moderate-effect variables, while BMA’s Bayesian Occam’s razor demands stronger evidence. This pattern — where methods mostly agree but diverge on borderline cases — is what makes methodological triangulation valuable.

17.6 Method performance

# Sensitivity, specificity, and accuracy for each method
results_by_method <- tibble(
method = c("BMA", "LASSO", "WALS"),
true_pos = c(4, 6, 6), # true predictors correctly identified
false_pos = c(0, 0, 0), # noise variables falsely identified
false_neg = c(3, 1, 1), # true predictors missed
true_neg = c(5, 5, 5), # noise variables correctly excluded
sensitivity = true_pos / 7,
specificity = true_neg / 5,
accuracy = (true_pos + true_neg) / 12
)
print(results_by_method)

 method true_pos false_pos false_neg true_neg sensitivity specificity accuracy
BMA 4 0 3 5 0.571 1.000 0.750
LASSO 6 0 1 5 0.857 1.000 0.917
WALS 6 0 1 5 0.857 1.000 0.917

All three methods achieve perfect specificity (zero false positives) — none mistakenly identifies a noise variable as robust. The key difference is in sensitivity: LASSO and WALS each detect 6 of 7 true predictors (85.7%), while BMA detects only 4 (57.1%). BMA’s lower sensitivity reflects its conservative Bayesian Occam’s razor: it places urban_pop and democracy in the “borderline” zone rather than committing to their inclusion. The one variable missed by all methods — agriculture ($\beta = 0.005$) — has an effect so small that it is indistinguishable from noise given our sample size.

17.7 When to use which method

Method	Best for	Strengths	Limitations
BMA	Full uncertainty quantification	Probabilistic (PIPs), handles model uncertainty formally, coefficient intervals	Slower (MCMC), requires prior specification
LASSO	Prediction, sparse models	Fast, automatic selection, works with many variables	Binary (in/out), biased coefficients (use Post-LASSO)
WALS	Speed, frequentist inference	Very fast, produces t-statistics, no MCMC	Less common, limited software support

The strongest recommendation: use all three. When they converge on the same variables (as with our four triple-robust predictors), you have the strongest possible evidence. When they disagree (as with urban_pop and democracy, where LASSO and WALS say “yes” but BMA hedges), the disagreement itself is informative — it tells you the evidence is real but not overwhelming. In real-world data, complications such as nonlinearity, heteroskedasticity, and endogeneity may affect method performance and should be addressed before applying these techniques.

18. Conclusion

18.1 Summary

This tutorial introduced three principled approaches to the variable selection problem:

Bayesian Model Averaging (BMA) averages over all possible models, weighting each by its posterior probability. It produces Posterior Inclusion Probabilities (PIPs) that quantify how robust each variable is across the entire model space. Variables with PIP $\geq$ 0.80 are considered robust.
LASSO adds an L1 penalty to the OLS objective, forcing irrelevant coefficients to exactly zero. Cross-validation selects the penalty strength. Post-LASSO recovers unbiased coefficient estimates for the selected variables.
WALS uses a semi-orthogonal transformation to decompose the model-averaging problem into independent subproblems — one per variable. It is extremely fast and produces familiar t-statistics for robustness assessment.

18.2 Key takeaways

The methods mostly converge — and their disagreements are informative. Four variables are identified by all three methods (triple-robust), and all methods achieve perfect specificity (zero false positives). LASSO and WALS are more sensitive (detecting 6 of 7 true predictors), while BMA is more conservative (detecting 4). The two variables where they disagree — urban_pop and democracy — have moderate effects that BMA’s Bayesian Occam’s razor treats as borderline. This pattern illustrates the value of methodological triangulation across fundamentally different statistical paradigms.

Model uncertainty is real but addressable. With 12 candidate variables, there are 4,096 possible models. Rather than pretending one of them is “the” model, these methods account for the uncertainty explicitly. The result is more honest inference.

Synthetic data lets us verify. Because we designed the data-generating process, we could check each method’s performance against the known truth. In practice, the truth is unknown — which is precisely why using multiple methods is so valuable.

18.3 Applying this to your own research

The code in this tutorial is designed to be modular. To apply these methods to your own data:

Replace the CSV: load your own cross-sectional dataset instead of the synthetic one
Define the variable list: specify which variables are candidates for selection
Run the three methods: use the same bms(), cv.glmnet(), and wals() function calls
Compare results: build the same comparison table and heatmap

The interpretation framework — PIPs for BMA, selection for LASSO, t-statistics for WALS — applies regardless of the specific dataset.

18.4 Further reading

BMA: Hoeting, J.A., Madigan, D., Raftery, A.E., and Volinsky, C.T. (1999). “Bayesian Model Averaging: A Tutorial.” Statistical Science, 14(4), 382–417.
LASSO: Tibshirani, R. (1996). “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society, Series B, 58(1), 267–288.
WALS: Magnus, J.R., Powell, O., and Prufer, P. (2010). “A Comparison of Two Model Averaging Techniques with an Application to Growth Empirics.” Journal of Econometrics, 154(2), 139–153.
Application: Aller, C., Ductor, L., and Grechyna, D. (2021). “Robust Determinants of CO₂ Emissions.” Energy Economics, 96, 105154.
Post-LASSO: Belloni, A. and Chernozhukov, V. (2013). “Least Squares After Model Selection in High-Dimensional Sparse Models.” Bernoulli, 19(2), 521–547.
R Packages: BMS vignette, glmnet vignette, WALS package

References

Hoeting, J.A., Madigan, D., Raftery, A.E., and Volinsky, C.T. (1999). Bayesian Model Averaging: A Tutorial. Statistical Science, 14(4), 382–417.
Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society, Series B, 58(1), 267–288.
Magnus, J.R., Powell, O., and Prufer, P. (2010). A Comparison of Two Model Averaging Techniques with an Application to Growth Empirics. Journal of Econometrics, 154(2), 139–153.
Raftery, A.E. (1995). Bayesian Model Selection in Social Research. Sociological Methodology, 25, 111–163.
Aller, C., Ductor, L., and Grechyna, D. (2021). Robust Determinants of CO₂ Emissions. Energy Economics, 96, 105154.
Belloni, A. and Chernozhukov, V. (2013). Least Squares After Model Selection in High-Dimensional Sparse Models. Bernoulli, 19(2), 521–547.

Exploratory Spatial Data Analysis: Spatial Clusters and Dynamics of Human Development in South America

Sun, 22 Mar 2026 00:00:00 +0000

1. Overview

When we look at a map of human development across South America, a pattern immediately stands out: prosperous regions tend to cluster together, and so do lagging regions. But is this clustering statistically significant, or could it arise by chance? And how have these spatial clusters evolved over time?

Exploratory Spatial Data Analysis (ESDA) provides the tools to answer these questions. ESDA is a set of techniques for visualizing spatial distributions, identifying patterns of spatial clustering, and detecting spatial outliers. Unlike standard exploratory data analysis, which treats observations as independent, ESDA explicitly accounts for the geographic location of each observation and the relationships between neighbors.

This tutorial uses the Subnational Human Development Index (SHDI) from Smits and Permanyer (2019) for 153 sub-national regions across 12 South American countries in 2013 and 2019 — the same dataset from the Pooled PCA tutorial. We progress from simple scatter plots and choropleth maps to formal tests of spatial dependence (Moran’s I), local cluster identification (LISA maps), and space-time dynamics. By the end, you will be able to answer: do nearby regions in South America share similar development levels, and how have these spatial clusters evolved between 2013 and 2019?

Learning objectives:

Understand the concept of spatial autocorrelation and why it matters for regional analysis
Create choropleth maps and scatter plots to visualize spatial distributions
Build and interpret a spatial weights matrix using Queen contiguity
Compute and interpret global Moran’s I for spatial dependence testing
Identify local spatial clusters (HH, LL) and outliers (HL, LH) using LISA statistics
Explore space-time dynamics of spatial clusters using directional Moran scatter plots
Compare country-level development trajectories within the spatial framework

2. The ESDA pipeline

The analysis follows a natural progression from visualization to formal testing. Each step builds on the previous one, moving from “what does the data look like?” to “is the spatial pattern statistically significant?” to “where exactly are the clusters?”

graph LR
A["<b>Step 1</b><br/>Load &<br/>Explore"] --> B["<b>Step 2</b><br/>Visualize<br/>Maps"]
B --> C["<b>Step 3</b><br/>Spatial<br/>Weights"]
C --> D["<b>Step 4</b><br/>Global<br/>Moran's I"]
D --> E["<b>Step 5</b><br/>Local<br/>LISA"]
E --> F["<b>Step 6</b><br/>Space-Time<br/>Dynamics"]
style A fill:#141413,stroke:#6a9bcc,color:#fff
style B fill:#d97757,stroke:#141413,color:#fff
style C fill:#6a9bcc,stroke:#141413,color:#fff
style D fill:#6a9bcc,stroke:#141413,color:#fff
style E fill:#00d4c8,stroke:#141413,color:#fff
style F fill:#1a3a8a,stroke:#141413,color:#fff

Steps 1–2 are purely visual — they build intuition about where high and low values are concentrated. Step 3 formalizes the notion of “neighbors” through a spatial weights matrix. Steps 4–5 use that matrix to compute statistics that quantify spatial clustering, first globally (one number for the whole map) and then locally (one number per region). Step 6 connects the spatial and temporal dimensions by tracking how regions move through the Moran scatter plot between periods.

3. Setup and imports

The analysis uses GeoPandas for spatial data handling, PySAL for spatial statistics, and splot for specialized spatial visualizations.

import numpy as np
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
from libpysal.weights import Queen
from libpysal.weights import lag_spatial
from esda.moran import Moran, Moran_Local
from splot.esda import moran_scatterplot, lisa_cluster
from splot.libpysal import plot_spatial_weights
from adjustText import adjust_text
import mapclassify
# Reproducibility
RANDOM_SEED = 42
# Site color palette
STEEL_BLUE = "#6a9bcc"
WARM_ORANGE = "#d97757"
NEAR_BLACK = "#141413"
TEAL = "#00d4c8"

Dark theme figure styling (click to expand)

# Dark theme palette (consistent with site navbar/dark sections)
DARK_NAVY = "#0f1729"
GRID_LINE = "#1f2b5e"
LIGHT_TEXT = "#c8d0e0"
WHITE_TEXT = "#e8ecf2"
# Plot defaults — minimal, spine-free, dark background
plt.rcParams.update({
"figure.facecolor": DARK_NAVY,
"axes.facecolor": DARK_NAVY,
"axes.edgecolor": DARK_NAVY,
"axes.linewidth": 0,
"axes.labelcolor": LIGHT_TEXT,
"axes.titlecolor": WHITE_TEXT,
"axes.spines.top": False,
"axes.spines.right": False,
"axes.spines.left": False,
"axes.spines.bottom": False,
"axes.grid": True,
"grid.color": GRID_LINE,
"grid.linewidth": 0.6,
"grid.alpha": 0.8,
"xtick.color": LIGHT_TEXT,
"ytick.color": LIGHT_TEXT,
"xtick.major.size": 0,
"ytick.major.size": 0,
"text.color": WHITE_TEXT,
"font.size": 12,
"legend.frameon": False,
"legend.fontsize": 11,
"legend.labelcolor": LIGHT_TEXT,
"figure.edgecolor": DARK_NAVY,
"savefig.facecolor": DARK_NAVY,
"savefig.edgecolor": DARK_NAVY,
})

4. Data loading and exploration

The dataset is a GeoJSON file containing polygon geometries and development indicators for 153 sub-national regions across South America. It is a spatial version of the data from the Pooled PCA tutorial, sourced from the Global Data Lab (Smits and Permanyer, 2019). Each region has the Subnational Human Development Index (SHDI) and its three component indices — Health, Education, and Income — for 2013 and 2019.

DATA_URL = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_esda2/data.geojson"
gdf = gpd.read_file(DATA_URL)
print(f"Loaded: {gdf.shape[0]} rows, {gdf.shape[1]} columns")
print(f"Countries: {gdf['country'].nunique()}")
print(f"CRS: {gdf.crs}")

Loaded: 153 rows, 25 columns
Countries: 12
CRS: EPSG:4326

Before computing change columns, we prepare the data for labeling. Some region names in the raw data are very long (e.g., “Chubut, Neuquen, Rio Negro, Santa Cruz, Tierra del Fuego”), so we simplify them. We also create a region_country column that appends the ISO country code to each region name — this makes labels immediately informative when regions from different countries appear on the same plot.

# Country name → ISO 3166-1 alpha-3 code
COUNTRY_ISO = {
"Argentina": "ARG", "Bolivia": "BOL", "Brazil": "BRA",
"Chili": "CHL", "Colombia": "COL", "Ecuador": "ECU",
"Guyana": "GUY", "Paraguay": "PRY", "Peru": "PER",
"Suriname": "SUR", "Uruguay": "URY", "Venezuela": "VEN",
}
gdf["country_iso"] = gdf["country"].map(COUNTRY_ISO)
# Simplify long region names
RENAME = {
"Catamarca, La Rioja, San Juan": "Catamarca-La Rioja",
"Corrientes, Entre Rios, Misiones": "Corrientes-Misiones",
"Chubut, Neuquen, Rio Negro, Santa Cruz, Tierra del Fuego": "Patagonia",
"La Pampa, San Luis, Mendoza": "La Pampa-Mendoza",
"Santiago del Estero, Tucuman": "Tucuman-Sgo Estero",
"Tarapaca (incl Arica and Parinacota)": "Tarapaca",
"Valparaiso (former Aconcagua)": "Valparaiso",
"Los Lagos (incl Los Rios)": "Los Lagos",
"Magallanes and La Antartica Chilena": "Magallanes",
"Antioquia (incl Medellin)": "Antioquia",
"Atlantico (incl Barranquilla)": "Atlantico",
"Bolivar (Sur and Norte)": "Bolivar",
"Essequibo Islands-West Demerara": "Essequibo-W Demerara",
"East Berbice-Corentyne": "E Berbice-Corentyne",
"Upper Takutu-Upper Essequibo": "Upper Takutu-Essequibo",
"Upper Demerara-Berbice": "Upper Demerara",
"Cuyuni-Mazaruni-Upper Essequibo": "Cuyuni-Mazaruni",
"Region Metropolitana": "R. Metropolitana",
"Federal District": "Federal Dist.",
"City of Buenos Aires": "C. Buenos Aires",
"Brokopondo and Sipaliwini": "Brokopondo-Sipaliwini",
"Montevideo and Metropolitan area": "Montevideo",
}
gdf["region"] = gdf["region"].replace(RENAME)
# Create region_country label column
gdf["region_country"] = gdf["region"] + " (" + gdf["country_iso"] + ")"

We then compute the change in SHDI and its components between the two periods.

gdf["shdi_change"] = gdf["shdi2019"] - gdf["shdi2013"]
gdf["health_change"] = gdf["healthindex2019"] - gdf["healthindex2013"]
gdf["educ_change"] = gdf["edindex2019"] - gdf["edindex2013"]
gdf["income_change"] = gdf["incindex2019"] - gdf["incindex2013"]
print(gdf[["shdi2013", "shdi2019", "shdi_change"]].describe().round(4).to_string())

 shdi2013 shdi2019 shdi_change
count 153.0000 153.0000 153.0000
mean 0.7424 0.7477 0.0053
std 0.0594 0.0613 0.0319
min 0.5540 0.5580 -0.0670
25% 0.7070 0.7150 0.0090
50% 0.7430 0.7440 0.0150
75% 0.7740 0.7840 0.0250
max 0.8780 0.8830 0.0450

The dataset covers 153 regions across 12 South American countries. Mean SHDI increased modestly from 0.7424 in 2013 to 0.7477 in 2019 (+0.0053), but the change varied widely: from a maximum decline of -0.0670 to a maximum improvement of +0.0450. The standard deviation of SHDI also increased slightly (0.0594 to 0.0613), hinting that regional disparities may have widened.

5. Exploratory scatter plots

5.1 HDI scatter: 2013 vs 2019

A scatter plot of SHDI in 2013 against SHDI in 2019 provides a quick overview of temporal dynamics. Points above the 45-degree line represent regions that improved; points below represent regions that declined.

fig, ax = plt.subplots(figsize=(8, 7))
ax.scatter(gdf["shdi2013"], gdf["shdi2019"],
color=STEEL_BLUE, edgecolors=DARK_NAVY, s=45, alpha=0.75, zorder=3)
lims = [min(gdf["shdi2013"].min(), gdf["shdi2019"].min()) - 0.01,
max(gdf["shdi2013"].max(), gdf["shdi2019"].max()) + 0.01]
ax.plot(lims, lims, color=WARM_ORANGE, linewidth=1.5, linestyle="--",
label="45° line (no change)", zorder=2)
ax.set_xlabel("SHDI 2013")
ax.set_ylabel("SHDI 2019")
ax.set_title("Subnational HDI: 2013 vs 2019")
ax.legend()
# Label extreme regions (biggest gains, biggest losses, highest, lowest)
residual = gdf["shdi2019"] - gdf["shdi2013"]
extremes = set()
extremes.update(residual.nlargest(3).index.tolist())
extremes.update(residual.nsmallest(3).index.tolist())
extremes.update(gdf["shdi2019"].nlargest(2).index.tolist())
extremes.update(gdf["shdi2019"].nsmallest(2).index.tolist())
texts = []
for i in extremes:
texts.append(ax.text(gdf.loc[i, "shdi2013"], gdf.loc[i, "shdi2019"],
gdf.loc[i, "region_country"], fontsize=8, color=LIGHT_TEXT))
adjust_text(texts, ax=ax, arrowprops=dict(arrowstyle="-", color=LIGHT_TEXT,
alpha=0.5, lw=0.5))
plt.savefig("esda2_scatter_hdi.png", dpi=300, bbox_inches="tight")
plt.show()

Of 153 regions, 126 improved their SHDI between 2013 and 2019, while 27 declined. The labels identify key cases: at the top, C. Buenos Aires (ARG) and R. Metropolitana (CHL) lead with SHDI above 0.88. At the bottom, Potaro-Siparuni (GUY) and Barima-Waini (GUY) remain the least developed. The biggest decliners — Federal Dist. (VEN), Carabobo (VEN), and Aragua (VEN) — are all Venezuelan states, falling well below the 45-degree line. The biggest improvers — Meta (COL), Vichada (COL), and Brokopondo-Sipaliwini (SUR) — rose above the line, with gains up to +0.045 points.

5.2 Component scatter plots

The SHDI is a composite of three sub-indices: Health, Education, and Income. Breaking down the change by component reveals which dimensions drove the aggregate patterns.

fig, axes = plt.subplots(1, 3, figsize=(18, 5.5))
components = [
("healthindex2013", "healthindex2019", "Health Index"),
("edindex2013", "edindex2019", "Education Index"),
("incindex2013", "incindex2019", "Income Index"),
]
for ax, (col13, col19, label) in zip(axes, components):
ax.scatter(gdf[col13], gdf[col19],
color=STEEL_BLUE, edgecolors=DARK_NAVY, s=40, alpha=0.7, zorder=3)
lims = [min(gdf[col13].min(), gdf[col19].min()) - 0.02,
max(gdf[col13].max(), gdf[col19].max()) + 0.02]
ax.plot(lims, lims, color=WARM_ORANGE, linewidth=1.5, linestyle="--", zorder=2)
ax.set_xlabel(f"{label} 2013")
ax.set_ylabel(f"{label} 2019")
ax.set_title(label)
# Label extreme regions per component
comp_residual = gdf[col19] - gdf[col13]
comp_extremes = set()
comp_extremes.update(comp_residual.nlargest(2).index.tolist())
comp_extremes.update(comp_residual.nsmallest(2).index.tolist())
texts = []
for i in comp_extremes:
texts.append(ax.text(gdf.loc[i, col13], gdf.loc[i, col19],
gdf.loc[i, "region_country"], fontsize=7, color=LIGHT_TEXT))
adjust_text(texts, ax=ax, arrowprops=dict(arrowstyle="-", color=LIGHT_TEXT,
alpha=0.5, lw=0.5))
fig.suptitle("HDI components: 2013 vs 2019", fontsize=14, y=1.02)
plt.tight_layout()
plt.savefig("esda2_scatter_components.png", dpi=300, bbox_inches="tight")
plt.show()

The three components tell very different stories. Health and Education improved almost universally — the vast majority of points lie above the 45-degree line. Income, however, tells a starkly different story: 71 of 153 regions (46.4%) experienced a decline in their income index between 2013 and 2019. This mixed signal — education and health gains partially offset by income losses — explains why the aggregate SHDI improvement was so modest (+0.005 on average). The income panel also shows wider scatter, indicating greater heterogeneity in economic trajectories across the continent.

6. Choropleth maps

6.1 HDI levels across South America

The scatter plots tell us what changed, but not where. Choropleth maps add the geographic dimension by coloring each region according to its SHDI value. To make the two years directly comparable, we use Fisher-Jenks natural breaks computed from 2013 and held constant for 2019. Fisher-Jenks is a classification method that finds natural groupings in data by minimizing within-class variance — it places break points where the data naturally separates into clusters. This way, a color change between maps reflects a genuine shift in development class, not a shifting classification scheme. The legend shows the number of regions in each class, making it easy to see how the distribution shifted.

import mapclassify
from matplotlib.patches import Patch
# Fisher-Jenks breaks from 2013 (5 classes)
fj = mapclassify.FisherJenks(gdf["shdi2013"].values, k=5)
breaks = fj.bins.tolist()
# Extend upper break to cover 2019 max
max_val = max(gdf["shdi2013"].max(), gdf["shdi2019"].max())
if max_val > breaks[-1]:
breaks[-1] = float(round(max_val + 0.001, 3))
# Apply same breaks to 2019
fj_2019 = mapclassify.UserDefined(gdf["shdi2019"].values, bins=breaks)
# Class transitions
classes_2013 = fj.yb
classes_2019 = fj_2019.yb
improved = (classes_2019 > classes_2013).sum()
stayed = (classes_2019 == classes_2013).sum()
declined = (classes_2019 < classes_2013).sum()
print(f"Breaks (from 2013): {[round(b, 3) for b in breaks]}")
print(f" Improved (moved up): {improved}")
print(f" Stayed same: {stayed}")
print(f" Declined (moved down): {declined}")

Breaks (from 2013): [0.622, 0.693, 0.734, 0.789, 0.884]
Improved (moved up): 43
Stayed same: 86
Declined (moved down): 24

# Class labels
class_labels = []
lower = round(gdf["shdi2013"].min(), 2)
for b in breaks:
class_labels.append(f"{lower:.2f} – {b:.2f}")
lower = round(b, 2)
fig, axes = plt.subplots(1, 2, figsize=(16, 12))
cmap = plt.cm.coolwarm
norm = plt.Normalize(vmin=0, vmax=len(breaks) - 1)
for ax, year_col, title, year_fj in [
(axes[0], "shdi2013", "SHDI 2013", fj),
(axes[1], "shdi2019", "SHDI 2019", fj_2019),
]:
colors = [cmap(norm(c)) for c in year_fj.yb]
gdf.plot(ax=ax, color=colors, edgecolor=GRID_LINE, linewidth=0.3)
ax.set_title(title, fontsize=14, pad=10)
ax.set_axis_off()
# Legend with region counts per class
counts = np.bincount(year_fj.yb, minlength=len(breaks))
handles = [Patch(facecolor=cmap(norm(i)), edgecolor=GRID_LINE,
label=f"{cl} (n={c})")
for i, (cl, c) in enumerate(zip(class_labels, counts))]
ax.legend(handles=handles, title="SHDI Class", loc="lower right",
fontsize=10, title_fontsize=11)
# Label extreme regions on both maps
map_extremes = gdf["shdi2019"].nlargest(3).index.tolist() + \
gdf["shdi2019"].nsmallest(3).index.tolist()
for ax_map in axes:
texts = []
for i in map_extremes:
centroid = gdf.geometry.iloc[i].centroid
texts.append(ax_map.text(centroid.x, centroid.y,
gdf.loc[i, "region_country"],
fontsize=7, color=WHITE_TEXT, weight="bold"))
adjust_text(texts, ax=ax_map, arrowprops=dict(arrowstyle="-|>",
color=LIGHT_TEXT, alpha=0.9, lw=1.2, mutation_scale=8))
plt.savefig("esda2_choropleth_hdi.png", dpi=300, bbox_inches="tight")
plt.show()

The Fisher-Jenks classification reveals both persistence and change in South America’s development geography. Using the same 2013 breaks for both maps, 43 regions moved up at least one class between 2013 and 2019, 86 stayed in the same class, and 24 declined. The legend counts make the shifts visible: the lowest class shrank from n=6 to n=4, while the middle classes absorbed most of the movement. The Southern Cone and southern Brazil consistently occupy the highest class (red tones), while the Amazon basin, Guyana, and parts of Venezuela anchor the lowest class (blue tones). This visual clustering is precisely what spatial autocorrelation statistics will later quantify — high values are surrounded by high values, and low values are surrounded by low values.

6.2 Mapping HDI change

A map of SHDI change (2019 minus 2013) reveals the geographic distribution of gains and losses, using a diverging color scale centered at zero.

fig, ax = plt.subplots(1, 1, figsize=(10, 10))
abs_max = max(abs(gdf["shdi_change"].min()), abs(gdf["shdi_change"].max()))
gdf.plot(column="shdi_change", cmap="RdYlGn", ax=ax, legend=False,
edgecolor=DARK_NAVY, linewidth=0.3, vmin=-abs_max, vmax=abs_max)
ax.set_title("Change in SHDI (2019 - 2013)", fontsize=14, pad=10)
ax.set_axis_off()
# Label biggest gainers and losers
change_top = gdf["shdi_change"].nlargest(3).index.tolist()
change_bot = gdf["shdi_change"].nsmallest(3).index.tolist()
texts = []
for i in change_top + change_bot:
centroid = gdf.geometry.iloc[i].centroid
texts.append(ax.text(centroid.x, centroid.y, gdf.loc[i, "region"],
fontsize=7, color=WHITE_TEXT, weight="bold"))
adjust_text(texts, ax=ax, arrowprops=dict(arrowstyle="-|>",
color=LIGHT_TEXT, alpha=0.9, lw=1.2,
mutation_scale=8))
sm = plt.cm.ScalarMappable(cmap="RdYlGn",
norm=plt.Normalize(vmin=-abs_max, vmax=abs_max))
cbar = fig.colorbar(sm, ax=ax, orientation="horizontal",
fraction=0.03, pad=0.02, aspect=40)
cbar.set_label("SHDI change (2019 - 2013)")
plt.savefig("esda2_choropleth_change.png", dpi=300, bbox_inches="tight")
plt.show()

The change map reveals that development losses are geographically concentrated, not randomly scattered. The labels pinpoint the extremes: Federal Dist. (VEN), Carabobo (VEN), and Aragua (VEN) show the deepest red (declines of up to -0.067 points), while Vichada (COL), Meta (COL), and Brokopondo-Sipaliwini (SUR) show the brightest green (improvements of up to +0.045). The geographic concentration of gains and losses suggests that spatial proximity plays a role in development trajectories — a hypothesis that we formalize in the next sections.

7. Spatial weights

7.1 What is a spatial weights matrix?

To test for spatial clustering formally, we first need to define what “neighbor” means. A spatial weights matrix $W$ is an $n \times n$ matrix where each entry $w_{ij}$ encodes the spatial relationship between regions $i$ and $j$. If two regions are neighbors, $w_{ij} > 0$; if not, $w_{ij} = 0$.

The most common approach for polygon data is contiguity-based weights:

Queen contiguity: Two regions are neighbors if they share any boundary point (even a single corner). Named after the queen in chess, which can move in any direction.
Rook contiguity: Two regions are neighbors only if they share an edge (not just a corner). More restrictive than Queen.

We use Queen contiguity because it captures the broadest definition of adjacency, which is appropriate for irregular administrative boundaries.

7.2 Building Queen contiguity weights

PySAL’s Queen.from_dataframe() builds the weights matrix directly from a GeoDataFrame. After construction, we row-standardize the matrix so that each region’s neighbor weights sum to 1. This makes the spatial lag (the weighted average of neighbors' values) directly interpretable as the mean neighbor value.

from libpysal.weights import Queen
W = Queen.from_dataframe(gdf)
W.transform = "r" # Row-standardize
print(f"Number of regions: {W.n}")
print(f"Min neighbors: {W.min_neighbors}")
print(f"Max neighbors: {W.max_neighbors}")
print(f"Mean neighbors: {W.mean_neighbors:.2f}")
print(f"Islands: {W.islands}")

Number of regions: 153
Min neighbors: 0
Max neighbors: 11
Mean neighbors: 4.93
Islands: [87, 145]

The Queen contiguity matrix connects 153 regions with an average of 4.93 neighbors each (minimum 0, maximum 11). Two regions have no neighbors (islands): San Andres (COL) (index 87) and Nueva Esparta (VEN) (index 145) — both are island territories separated from the mainland by water. PySAL excludes these isolates from spatial autocorrelation calculations, as they have no defined spatial relationship with other regions. Row-standardization ensures that each region’s spatial lag is the simple average of its neighbors' values, regardless of how many neighbors it has.

7.3 Visualizing the connectivity structure

The plot_spatial_weights() function from splot overlays the weights network on the map, drawing lines between each region’s centroid and its neighbors' centroids.

fig, ax = plt.subplots(figsize=(10, 10))
gdf.plot(ax=ax, facecolor="none", edgecolor=GRID_LINE, linewidth=0.5)
plot_spatial_weights(W, gdf, ax=ax)
ax.set_title("Queen contiguity weights", fontsize=14, pad=10)
ax.set_axis_off()
plt.savefig("esda2_spatial_weights.png", dpi=300, bbox_inches="tight")
plt.show()

The network visualization shows the connectivity structure underlying all spatial statistics in this tutorial. Denser networks appear in areas with many small regions (e.g., southern Brazil, northern Argentina), while sparser connections appear in areas with large administrative units (e.g., the Amazon basin). The two island territories (San Andres and Nueva Esparta) appear as isolated dots with no connecting lines. This network is the foundation for computing spatial lags — the weighted average of neighbors' values — which is the building block of Moran’s I.

8. Global spatial autocorrelation

8.1 Moran’s I: concept and intuition

Moran’s I is the most widely used measure of global spatial autocorrelation. It answers a simple question: do similar values tend to cluster together more than expected by chance? Think of it like temperature on a weather map — if it is hot in one city, nearby cities are likely hot too. Moran’s I measures how strongly this “neighbor similarity” holds for development levels across South American regions.

The statistic is defined as:

$$I = \frac{n}{\sum_{i} \sum_{j} w_{ij}} \cdot \frac{\sum_{i} \sum_{j} w_{ij} (x_i - \bar{x})(x_j - \bar{x})}{\sum_{i} (x_i - \bar{x})^2}$$

where $n$ is the number of regions, $w_{ij}$ are the spatial weights, $x_i$ is the value at region $i$, and $\bar{x}$ is the overall mean. In plain language: Moran’s I compares the product of deviations from the mean for each pair of neighbors. If high-value regions tend to be next to high-value regions (and low next to low), these products are positive, and $I$ is positive.

$I \approx +1$: strong positive spatial autocorrelation (clustering of similar values)
$I \approx 0$: no spatial pattern (random arrangement)
$I \approx -1$: strong negative spatial autocorrelation (checkerboard pattern)

The expected value under spatial randomness is $E(I) = -1/(n-1)$, which approaches zero for large $n$.

8.2 Moran’s I for HDI (2013 and 2019)

We compute Moran’s I with 999 random permutations to generate a reference distribution and assess statistical significance. A permutation test works by randomly shuffling all the SHDI values across the map 999 times — like dealing cards to random seats. If the real Moran’s I is more extreme than almost all the shuffled values, we can be confident the spatial pattern is real, not coincidence.

from esda.moran import Moran
moran_2013 = Moran(gdf["shdi2013"], W, permutations=999)
moran_2019 = Moran(gdf["shdi2019"], W, permutations=999)
print(f"SHDI 2013: I = {moran_2013.I:.4f}, p-value = {moran_2013.p_sim:.4f}, "
f"z-score = {moran_2013.z_sim:.4f}")
print(f"SHDI 2019: I = {moran_2019.I:.4f}, p-value = {moran_2019.p_sim:.4f}, "
f"z-score = {moran_2019.z_sim:.4f}")
print(f"Expected I (random): {moran_2013.EI:.4f}")

SHDI 2013: I = 0.5680, p-value = 0.0010, z-score = 10.7661
SHDI 2019: I = 0.6320, p-value = 0.0010, z-score = 11.9890
Expected I (random): -0.0066

Moran’s I for SHDI is strongly positive and highly significant in both years. In 2013, $I = 0.5680$ (p = 0.001, z = 10.77), and in 2019, $I = 0.6320$ (p = 0.001, z = 11.99). Both values are far above the expected value under spatial randomness ($E(I) = -0.0066$), confirming that regions with similar development levels are spatially clustered. Notably, spatial autocorrelation strengthened from 2013 to 2019 ($I$ increased from 0.568 to 0.632), suggesting that development clusters became more pronounced over the period — the spatial divide deepened.

8.3 Moran scatter plot

The Moran scatter plot visualizes the spatial relationship by plotting each region’s standardized value ($z_i$) against the spatial lag of its neighbors ($Wz_i$). The slope of the regression line through the scatter equals Moran’s I. The four quadrants identify the type of spatial association for each region:

HH (top-right): High values surrounded by high neighbors
LL (bottom-left): Low values surrounded by low neighbors
LH (top-left): Low values surrounded by high neighbors (spatial outlier)
HL (bottom-right): High values surrounded by low neighbors (spatial outlier)

from scipy import stats as scipy_stats
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
for ax, moran_obj, year in [
(axes[0], moran_2013, "2013"),
(axes[1], moran_2019, "2019"),
]:
# Standardize values and compute spatial lag
y = gdf[f"shdi{year}"].values
z = (y - y.mean()) / y.std()
wz = lag_spatial(W, z)
ax.scatter(z, wz, color=STEEL_BLUE, s=35, alpha=0.7,
edgecolors=GRID_LINE, linewidths=0.3, zorder=3)
# Regression line (slope = Moran's I)
slope, intercept, _, _, _ = scipy_stats.linregress(z, wz)
x_range = np.array([z.min(), z.max()])
ax.plot(x_range, intercept + slope * x_range, color=WARM_ORANGE,
linewidth=1.5, zorder=2)
# Quadrant dividers at origin
ax.axhline(0, color=LIGHT_TEXT, linewidth=0.8, alpha=0.5, zorder=1)
ax.axvline(0, color=LIGHT_TEXT, linewidth=0.8, alpha=0.5, zorder=1)
# Quadrant labels
xlim, ylim = ax.get_xlim(), ax.get_ylim()
pad_x = (xlim[1] - xlim[0]) * 0.05
pad_y = (ylim[1] - ylim[0]) * 0.05
ax.text(xlim[1] - pad_x, ylim[1] - pad_y, "HH", fontsize=13,
ha="right", va="top", color=LIGHT_TEXT, alpha=0.5)
ax.text(xlim[0] + pad_x, ylim[1] - pad_y, "LH", fontsize=13,
ha="left", va="top", color=LIGHT_TEXT, alpha=0.5)
ax.text(xlim[0] + pad_x, ylim[0] + pad_y, "LL", fontsize=13,
ha="left", va="bottom", color=LIGHT_TEXT, alpha=0.5)
ax.text(xlim[1] - pad_x, ylim[0] + pad_y, "HL", fontsize=13,
ha="right", va="bottom", color=LIGHT_TEXT, alpha=0.5)
ax.set_xlabel(f"SHDI {year} (standardized)")
ax.set_ylabel(f"Spatial lag of SHDI {year}")
ax.set_title(f"({'a' if year == '2013' else 'b'}) Moran scatter plot "
f"— {year} (I = {moran_obj.I:.4f})")
plt.tight_layout()
plt.savefig("esda2_moran_global.png", dpi=300, bbox_inches="tight")
plt.show()

Both Moran scatter plots show a clear positive slope, with the majority of regions falling in the HH and LL quadrants (positive spatial autocorrelation). The steeper slope in the 2019 panel visually confirms the increase in Moran’s I from 0.5680 to 0.6320. Regions in the HH quadrant (top-right) represent the Southern Cone prosperity cluster, while regions in the LL quadrant (bottom-left) represent the Amazon/Guyana deprivation cluster. The relatively few points in the LH and HL quadrants are spatial outliers — regions whose development level diverges sharply from their neighbors.

9. Local spatial autocorrelation (LISA)

9.1 From global to local: why LISA matters

Global Moran’s I gives us one number for the entire map, confirming that spatial clustering exists. But it does not tell us where the clusters are located. Local Indicators of Spatial Association (LISA) decompose the global statistic into a contribution from each individual region (Anselin, 1995).

The local Moran statistic for region $i$ is:

$$I_i = z_i \sum_{j} w_{ij} z_j$$

where $z_i = (x_i - \bar{x}) / s$ is the standardized value at region $i$ and $\sum_{j} w_{ij} z_j$ is its spatial lag (the weighted average of neighbors' standardized values). In plain language: each region’s local statistic is the product of its own deviation from the mean and the average deviation of its neighbors. In the code, $x_i$ corresponds to gdf["shdi2019"] and $w_{ij}$ to the row-standardized Queen weights W.

Each region receives a local Moran’s I statistic and is classified into one of four types based on its quadrant in the Moran scatter plot:

HH (High-High): A high-value region surrounded by high-value neighbors — a “hot spot” or prosperity cluster
LL (Low-Low): A low-value region surrounded by low-value neighbors — a “cold spot” or deprivation trap
HL (High-Low): A high-value region surrounded by low-value neighbors — a positive spatial outlier
LH (Low-High): A low-value region surrounded by high-value neighbors — a negative spatial outlier

Statistical significance is assessed via permutation tests. Only regions with p-values below a chosen threshold (here, $p < 0.10$) are classified as belonging to a cluster.

9.2 LISA for HDI 2019

We compute the local Moran’s I for SHDI in 2019 and visualize the results as a Moran scatter plot with significant regions colored by quadrant (left panel) and a cluster map (right panel).

localMoran_2019 = Moran_Local(gdf["shdi2019"], W, permutations=999, seed=12345)
wlag_2019 = lag_spatial(W, gdf["shdi2019"].values)
sig_2019 = localMoran_2019.p_sim < 0.10
q_labels = {1: "HH", 2: "LH", 3: "LL", 4: "HL"}
for q_val, q_name in q_labels.items():
count = ((localMoran_2019.q == q_val) & sig_2019).sum()
print(f" {q_name}: {count}")
print(f" Not significant: {(~sig_2019).sum()}")

 HH: 30
LH: 1
LL: 37
HL: 5
Not significant: 80

LISA_COLORS = {1: "#d7191c", 2: "#89cff0", 3: "#2c7bb6", 4: "#fdae61"}
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(14, 6))
# (a) LISA scatter plot with colored quadrants
ax = axes[0]
slope, intercept, _, _, _ = scipy_stats.linregress(gdf["shdi2019"].values, wlag_2019)
# Non-significant points (grey)
ns_mask = ~sig_2019
ax.scatter(gdf.loc[ns_mask, "shdi2019"], wlag_2019[ns_mask],
color="#bababa", s=30, alpha=0.4, edgecolors=GRID_LINE,
linewidths=0.3, label="ns", zorder=2)
# Significant points colored by quadrant
for q_val, q_name in q_labels.items():
mask = (localMoran_2019.q == q_val) & sig_2019
if mask.any():
ax.scatter(gdf.loc[mask, "shdi2019"], wlag_2019[mask],
color=LISA_COLORS[q_val], s=40, alpha=0.8,
edgecolors=GRID_LINE, linewidths=0.3,
label=q_name, zorder=3)
# Regression line
x_range = np.array([gdf["shdi2019"].min(), gdf["shdi2019"].max()])
ax.plot(x_range, intercept + slope * x_range, color=WARM_ORANGE,
linewidth=1.2, zorder=1)
# Crosshairs at mean
ax.axhline(wlag_2019.mean(), color=GRID_LINE, linewidth=0.8, linestyle="--", zorder=0)
ax.axvline(gdf["shdi2019"].mean(), color=GRID_LINE, linewidth=0.8, linestyle="--", zorder=0)
ax.set_xlabel("SHDI 2019")
ax.set_ylabel("Spatial lag of SHDI 2019")
ax.set_title(f"(a) Moran scatter plot (I = {moran_2019.I:.4f})")
# (b) LISA cluster map
lisa_cluster(localMoran_2019, gdf, p=0.10,
legend_kwds={"bbox_to_anchor": (0.02, 0.90)}, ax=axes[1])
axes[1].set_facecolor(DARK_NAVY)
axes[1].set_title("(b) LISA clusters (p < 0.10)")
# Label extreme LISA regions on both panels
label_idx = []
hh_mask = (localMoran_2019.q == 1) & sig_2019
if hh_mask.any():
label_idx += gdf.loc[hh_mask, "shdi2019"].nlargest(3).index.tolist()
ll_mask = (localMoran_2019.q == 3) & sig_2019
if ll_mask.any():
label_idx += gdf.loc[ll_mask, "shdi2019"].nsmallest(3).index.tolist()
hl_mask = (localMoran_2019.q == 4) & sig_2019
if hl_mask.any():
label_idx.append(gdf.loc[hl_mask, "shdi2019"].idxmax())
lh_mask = (localMoran_2019.q == 2) & sig_2019
if lh_mask.any():
label_idx.append(gdf.loc[lh_mask, "shdi2019"].idxmin())
# Scatter labels
texts = [axes[0].text(gdf.loc[i, "shdi2019"], wlag_2019[i], gdf.loc[i, "region"],
fontsize=7, color=LIGHT_TEXT) for i in label_idx]
adjust_text(texts, ax=axes[0], arrowprops=dict(arrowstyle="-", color=LIGHT_TEXT,
alpha=0.5, lw=0.5))
# Map labels
texts = [axes[1].text(gdf.geometry.iloc[i].centroid.x, gdf.geometry.iloc[i].centroid.y,
gdf.loc[i, "region_country"], fontsize=7, color=WHITE_TEXT, weight="bold")
for i in label_idx]
adjust_text(texts, ax=axes[1], arrowprops=dict(arrowstyle="-|>", color=LIGHT_TEXT,
alpha=0.9, lw=1.2, mutation_scale=8))
plt.tight_layout()
plt.savefig("esda2_lisa_2019.png", dpi=300, bbox_inches="tight")
plt.show()

At the 10% significance level, the 2019 LISA analysis identifies 30 HH regions, 37 LL regions, 5 HL outliers, 1 LH outlier, and 80 non-significant regions. The labels highlight the extremes of each cluster type. The three highest HH regions — R. Metropolitana (CHL, SHDI = 0.883), C. Buenos Aires (ARG, 0.882), and Antofagasta (CHL, 0.875) — anchor the Southern Cone prosperity core. The three lowest LL regions — Potaro-Siparuni (GUY, 0.558), Barima-Waini (GUY, 0.592), and Upper Takutu-Essequibo (GUY, 0.601) — anchor the deprivation cluster in northern South America. San Andres (COL) (0.789) appears as an HL outlier: a high-development island surrounded by lower-development mainland neighbors. Potosi (BOL) (0.631) is the lone LH outlier: a lagging region surrounded by better-performing neighbors.

9.3 LISA for HDI 2013

Repeating the analysis for 2013 allows us to compare how clusters have evolved over time.

localMoran_2013 = Moran_Local(gdf["shdi2013"], W, permutations=999, seed=12345)
wlag_2013 = lag_spatial(W, gdf["shdi2013"].values)
sig_2013 = localMoran_2013.p_sim < 0.10
for q_val, q_name in q_labels.items():
count = ((localMoran_2013.q == q_val) & sig_2013).sum()
print(f" {q_name}: {count}")
print(f" Not significant: {(~sig_2013).sum()}")

 HH: 31
LH: 0
LL: 29
HL: 5
Not significant: 88

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(14, 6))
# (a) LISA scatter plot with colored quadrants
ax = axes[0]
slope, intercept, _, _, _ = scipy_stats.linregress(gdf["shdi2013"].values, wlag_2013)
ns_mask = ~sig_2013
ax.scatter(gdf.loc[ns_mask, "shdi2013"], wlag_2013[ns_mask],
color="#bababa", s=30, alpha=0.4, edgecolors=GRID_LINE,
linewidths=0.3, label="ns", zorder=2)
for q_val, q_name in q_labels.items():
mask = (localMoran_2013.q == q_val) & sig_2013
if mask.any():
ax.scatter(gdf.loc[mask, "shdi2013"], wlag_2013[mask],
color=LISA_COLORS[q_val], s=40, alpha=0.8,
edgecolors=GRID_LINE, linewidths=0.3,
label=q_name, zorder=3)
x_range = np.array([gdf["shdi2013"].min(), gdf["shdi2013"].max()])
ax.plot(x_range, intercept + slope * x_range, color=WARM_ORANGE,
linewidth=1.2, zorder=1)
ax.axhline(wlag_2013.mean(), color=GRID_LINE, linewidth=0.8, linestyle="--", zorder=0)
ax.axvline(gdf["shdi2013"].mean(), color=GRID_LINE, linewidth=0.8, linestyle="--", zorder=0)
ax.set_xlabel("SHDI 2013")
ax.set_ylabel("Spatial lag of SHDI 2013")
ax.set_title(f"(a) Moran scatter plot (I = {moran_2013.I:.4f})")
# (b) LISA cluster map
lisa_cluster(localMoran_2013, gdf, p=0.10,
legend_kwds={"bbox_to_anchor": (0.02, 0.90)}, ax=axes[1])
axes[1].set_facecolor(DARK_NAVY)
axes[1].set_title("(b) LISA clusters (p < 0.10)")
# Label extreme LISA regions (3 HH, 3 LL, 1 HL; no LH in 2013)
label_idx = []
hh_mask = (localMoran_2013.q == 1) & sig_2013
if hh_mask.any():
label_idx += gdf.loc[hh_mask, "shdi2013"].nlargest(3).index.tolist()
ll_mask = (localMoran_2013.q == 3) & sig_2013
if ll_mask.any():
label_idx += gdf.loc[ll_mask, "shdi2013"].nsmallest(3).index.tolist()
hl_mask = (localMoran_2013.q == 4) & sig_2013
if hl_mask.any():
label_idx.append(gdf.loc[hl_mask, "shdi2013"].idxmax())
lh_mask = (localMoran_2013.q == 2) & sig_2013
if lh_mask.any():
label_idx.append(gdf.loc[lh_mask, "shdi2013"].idxmin())
texts = [axes[0].text(gdf.loc[i, "shdi2013"], wlag_2013[i], gdf.loc[i, "region"],
fontsize=7, color=LIGHT_TEXT) for i in label_idx]
adjust_text(texts, ax=axes[0], arrowprops=dict(arrowstyle="-", color=LIGHT_TEXT,
alpha=0.5, lw=0.5))
texts = [axes[1].text(gdf.geometry.iloc[i].centroid.x, gdf.geometry.iloc[i].centroid.y,
gdf.loc[i, "region_country"], fontsize=7, color=WHITE_TEXT, weight="bold")
for i in label_idx]
adjust_text(texts, ax=axes[1], arrowprops=dict(arrowstyle="-|>", color=LIGHT_TEXT,
alpha=0.9, lw=1.2, mutation_scale=8))
plt.tight_layout()
plt.savefig("esda2_lisa_2013.png", dpi=300, bbox_inches="tight")
plt.show()

The 2013 LISA analysis identifies 31 HH regions, 29 LL regions, 5 HL outliers, 0 LH outliers, and 88 non-significant regions. The same three HH leaders appear: C. Buenos Aires (ARG, 0.878), R. Metropolitana (CHL, 0.857), and Antofagasta (CHL, 0.852). The same three LL anchors persist: Potaro-Siparuni (GUY, 0.554), Barima-Waini (GUY, 0.577), and Upper Takutu-Essequibo (GUY, 0.585). The HL outlier in 2013 is Nueva Esparta (VEN) (0.797) — an island state that performed well despite its mainland neighbors. Comparing with 2019, the most striking change is the expansion of the LL cluster from 29 to 37 regions, while the HH cluster remained roughly stable (31 to 30). This asymmetric evolution is consistent with the income decline concentrated in Venezuela, which pulled more regions into the deprivation cluster.

9.4 Comparing LISA clusters across time

A transition table reveals how regions moved between LISA categories from 2013 to 2019.

sig_2013 = localMoran_2013.p_sim < 0.10
sig_2019 = localMoran_2019.p_sim < 0.10
q_labels = {1: "HH", 2: "LH", 3: "LL", 4: "HL"}
labels_2013 = ["ns" if not sig_2013[i] else q_labels[localMoran_2013.q[i]]
for i in range(len(gdf))]
labels_2019 = ["ns" if not sig_2019[i] else q_labels[localMoran_2019.q[i]]
for i in range(len(gdf))]
transition_df = pd.crosstab(
pd.Series(labels_2013, name="2013"),
pd.Series(labels_2019, name="2019")
)
print(transition_df.to_string())

2019 HH HL LH LL ns
2013
HH 27 0 0 0 4
HL 0 2 0 2 1
LL 0 2 0 18 9
ns 3 1 1 17 66

The transition table reveals strong cluster persistence. Of the 31 regions in the HH cluster in 2013, 27 remained HH in 2019 (87% persistence), while only 4 became non-significant. Of the 29 LL regions in 2013, 18 remained LL (62% persistence). The most notable transition is from non-significant to LL: 17 regions that were not part of any significant cluster in 2013 joined the low-development cluster by 2019. This expansion of the LL cluster, combined with the high persistence of HH, paints a picture of entrenched spatial inequality — prosperity clusters are stable, and deprivation clusters are growing.

10. Space-time dynamics

10.1 Directional Moran scatter plot

The LISA transition table tracks changes in statistical significance, but regions can also move within the Moran scatter plot even without crossing significance thresholds. A directional Moran scatter plot shows the movement vector for each region from its 2013 position to its 2019 position in the (standardized value, spatial lag) space. The arrows reveal the direction and magnitude of change in both a region’s own development and its neighbors' development.

To make the two periods comparable, we standardize both years using the pooled mean and standard deviation (across both periods combined), following the same logic as the Pooled PCA tutorial.

from libpysal.weights import lag_spatial
# Standardize using pooled parameters
mean_all = np.mean(np.concatenate([gdf["shdi2013"].values, gdf["shdi2019"].values]))
std_all = np.std(np.concatenate([gdf["shdi2013"].values, gdf["shdi2019"].values]))
z_2013 = (gdf["shdi2013"].values - mean_all) / std_all
z_2019 = (gdf["shdi2019"].values - mean_all) / std_all
# Spatial lags
wz_2013 = lag_spatial(W, z_2013)
wz_2019 = lag_spatial(W, z_2019)
fig, ax = plt.subplots(figsize=(9, 8))
for i in range(len(gdf)):
ax.annotate("", xy=(z_2019[i], wz_2019[i]),
xytext=(z_2013[i], wz_2013[i]),
arrowprops=dict(arrowstyle="->", color=STEEL_BLUE,
alpha=0.5, lw=0.8))
ax.scatter(z_2013, wz_2013, color=WARM_ORANGE, s=20, alpha=0.6,
label="2013", zorder=4)
ax.scatter(z_2019, wz_2019, color=TEAL, s=20, alpha=0.6,
label="2019", zorder=4)
ax.axhline(0, color=GRID_LINE, linewidth=1)
ax.axvline(0, color=GRID_LINE, linewidth=1)
ax.set_xlabel("SHDI (standardized)")
ax.set_ylabel("Spatial lag of SHDI")
ax.set_title("Directional Moran scatter plot: movements from 2013 to 2019")
ax.legend()
plt.savefig("esda2_directional_moran.png", dpi=300, bbox_inches="tight")
plt.show()

# Classify quadrant transitions
q_2013 = np.where((z_2013 >= 0) & (wz_2013 >= 0), "HH",
np.where((z_2013 < 0) & (wz_2013 >= 0), "LH",
np.where((z_2013 < 0) & (wz_2013 < 0), "LL", "HL")))
q_2019 = np.where((z_2019 >= 0) & (wz_2019 >= 0), "HH",
np.where((z_2019 < 0) & (wz_2019 >= 0), "LH",
np.where((z_2019 < 0) & (wz_2019 < 0), "LL", "HL")))
transition_moran = pd.crosstab(
pd.Series(q_2013, name="2013"),
pd.Series(q_2019, name="2019")
)
print(transition_moran.to_string())
stayed = (q_2013 == q_2019).sum()
moved = (q_2013 != q_2019).sum()
print(f"\nStayed in same quadrant: {stayed} ({stayed/len(gdf)*100:.1f}%)")
print(f"Moved to different quadrant: {moved} ({moved/len(gdf)*100:.1f}%)")

2019 HH HL LH LL
2013
HH 41 1 2 10
HL 9 6 0 5
LH 0 0 2 3
LL 7 10 11 46
Stayed in same quadrant: 95 (62.1%)
Moved to different quadrant: 58 (37.9%)

The directional Moran scatter plot reveals the space-time dynamics of South American development. 95 regions (62.1%) remained in the same Moran scatter plot quadrant between 2013 and 2019, while 58 (37.9%) crossed quadrant boundaries. The most stable quadrants are HH (41 of 54 stayed, 76%) and LL (46 of 74 stayed, 62%), confirming that both prosperity and deprivation clusters are persistent. The most common transitions are LL to LH (11 regions) and HL to HH (9 regions), suggesting some upward mobility at the boundary of the prosperity cluster. However, the 10 HH-to-LL transitions highlight that the Venezuelan crisis pulled previously well-performing regions into the low-development quadrant — a dramatic downward trajectory that affected both the regions themselves and their neighbors.

10.2 Country focus: Venezuela vs Bolivia

Venezuela and Bolivia offer a stark contrast in subnational development trajectories. In 2013, Venezuela’s regions were spread across the upper half of the Moran scatter plot — 13 of 24 regions sat in the HH quadrant, reflecting relatively high development levels and high-development neighbors. Bolivia’s 9 regions, by contrast, were concentrated in the lower-left corner (8 in LL, 1 in LH). By 2019, these two countries had moved in opposite directions. We isolate them in the directional Moran scatter plot to compare their movement vectors.

# Filter Venezuela and Bolivia regions
ven_mask = gdf["country"] == "Venezuela"
bol_mask = gdf["country"] == "Bolivia"
# Shared axis limits (from the full dataset, for comparability)
all_z = np.concatenate([z_2013, z_2019])
all_wz = np.concatenate([wz_2013, wz_2019])
pad = 0.3
shared_xlim = (all_z.min() - pad, all_z.max() + pad)
shared_ylim = (all_wz.min() - pad, all_wz.max() + pad)
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(16, 7))
for ax, mask, title in [
(axes[0], bol_mask, "(a) Bolivia"),
(axes[1], ven_mask, "(b) Venezuela"),
]:
# Background: all regions (grey, faded)
for i in range(len(gdf)):
ax.annotate("", xy=(z_2019[i], wz_2019[i]),
xytext=(z_2013[i], wz_2013[i]),
arrowprops=dict(arrowstyle="->", color=GRID_LINE,
alpha=0.15, lw=0.5))
ax.scatter(z_2013, wz_2013, color=GRID_LINE, s=10, alpha=0.15, zorder=2)
ax.scatter(z_2019, wz_2019, color=GRID_LINE, s=10, alpha=0.15, zorder=2)
# Highlighted country
for i in gdf.index[mask]:
ax.annotate("", xy=(z_2019[i], wz_2019[i]),
xytext=(z_2013[i], wz_2013[i]),
arrowprops=dict(arrowstyle="->", color=STEEL_BLUE,
alpha=0.7, lw=1.0))
ax.scatter(z_2013[mask], wz_2013[mask], color=WARM_ORANGE, s=30,
alpha=0.8, edgecolors=GRID_LINE, linewidths=0.3,
label="2013", zorder=5)
ax.scatter(z_2019[mask], wz_2019[mask], color=TEAL, s=30,
alpha=0.8, edgecolors=GRID_LINE, linewidths=0.3,
label="2019", zorder=5)
# Labels at 2019 positions
texts = []
for i in gdf.index[mask]:
texts.append(ax.text(z_2019[i], wz_2019[i], gdf.loc[i, "region"],
fontsize=7, color=LIGHT_TEXT))
adjust_text(texts, ax=ax, arrowprops=dict(arrowstyle="-", color=LIGHT_TEXT,
alpha=0.5, lw=0.5))
# Quadrant lines and labels
ax.axhline(0, color=GRID_LINE, linewidth=1, zorder=1)
ax.axvline(0, color=GRID_LINE, linewidth=1, zorder=1)
ax.set_xlim(shared_xlim)
ax.set_ylim(shared_ylim)
ox = (shared_xlim[1] - shared_xlim[0]) * 0.05
oy = (shared_ylim[1] - shared_ylim[0]) * 0.05
for lbl, ha, va, x, y in [
("HH", "right", "top", shared_xlim[1] - ox, shared_ylim[1] - oy),
("LH", "left", "top", shared_xlim[0] + ox, shared_ylim[1] - oy),
("LL", "left", "bottom", shared_xlim[0] + ox, shared_ylim[0] + oy),
("HL", "right", "bottom", shared_xlim[1] - ox, shared_ylim[0] + oy),
]:
ax.text(x, y, lbl, fontsize=14, ha=ha, va=va,
color=LIGHT_TEXT, alpha=0.6)
ax.set_xlabel("SHDI (standardized)")
ax.set_ylabel("Spatial lag of SHDI")
ax.set_title(title)
ax.legend(fontsize=8)
plt.tight_layout()
plt.savefig("esda2_directional_ven_bol.png", dpi=300, bbox_inches="tight")
plt.show()

# Summary statistics for Venezuela and Bolivia
for country, mask in [("Venezuela", ven_mask), ("Bolivia", bol_mask)]:
n = mask.sum()
mean_change = gdf.loc[mask, "shdi_change"].mean()
min_change = gdf.loc[mask, "shdi_change"].min()
max_change = gdf.loc[mask, "shdi_change"].max()
# Quadrant transitions
q13 = q_2013[mask]
q19 = q_2019[mask]
stayed = (q13 == q19).sum()
moved = (q13 != q19).sum()
print(f"\n{country} ({n} regions):")
print(f" Mean SHDI change: {mean_change:+.4f}")
print(f" Range: [{min_change:+.4f}, {max_change:+.4f}]")
print(f" Quadrant stability: {stayed} stayed, {moved} moved")
print(f" 2013 quadrants: {', '.join(f'{q}={c}' for q, c in zip(*np.unique(q13, return_counts=True)))}")
print(f" 2019 quadrants: {', '.join(f'{q}={c}' for q, c in zip(*np.unique(q19, return_counts=True)))}")

Venezuela (24 regions):
Mean SHDI change: -0.0653
Range: [-0.0670, -0.0640]
Quadrant stability: 3 stayed, 21 moved
2013 quadrants: HH=13, HL=5, LH=3, LL=3
2019 quadrants: HL=1, LH=2, LL=21
Bolivia (9 regions):
Mean SHDI change: +0.0333
Range: [+0.0300, +0.0350]
Quadrant stability: 7 stayed, 2 moved
2013 quadrants: LH=1, LL=8
2019 quadrants: HL=1, LH=2, LL=6

Panel (a) shows Bolivia’s modest but consistent rightward movement. All 9 regions started in the lower-left portion of the plot (8 in LL, 1 in LH) and shifted rightward by 2019, reflecting genuine improvement in own-region development. The mean SHDI change was +0.033, with a remarkably tight range ([+0.030, +0.035]) indicating that the gains were broad-based across all Bolivian regions. Seven of 9 regions (78%) remained in the same quadrant, with 2 moving out of LL — one to LH and one to HL. The arrows are short and point consistently to the right, meaning Bolivia improved its own development levels without substantially changing the spatial lag (its neighbors' conditions remained similar). This pattern suggests steady, internally driven progress that has not yet been large enough to escape the low-development spatial cluster.

Panel (b) tells the opposite story. Venezuela’s 24 regions experienced the most dramatic downward shift in the entire dataset, with a mean SHDI change of -0.065. In 2013, Venezuelan regions were spread across the upper portion of the plot — 13 in HH, 5 in HL, 3 in LH, and only 3 in LL. By 2019, the picture had completely inverted: 21 of 24 regions (88%) crossed quadrant boundaries, with 21 ending in the LL quadrant. The arrows sweep uniformly downward and to the left, reflecting both the collapse of each region’s own development level and the negative spillover onto its neighbors' spatial lags. The narrow range of change ([-0.067, -0.064]) reveals that the crisis was not localized to a few regions — it was a near-uniform national collapse that dragged every Venezuelan region, regardless of its 2013 starting point, into the low-development quadrant.

The juxtaposition is instructive. Bolivia’s arrows are short, rightward, and clustered — a country making incremental gains within a stable spatial structure. Venezuela’s arrows are long, southwest-pointing, and tightly bundled — a country experiencing systemic collapse that erased decades of development advantage in just six years. The contrast highlights how economic crises can propagate spatially: Venezuela’s decline did not just reduce its own regions' development, it also pulled down the spatial lags of neighboring Colombian and Brazilian border regions, contributing to the expansion of the LL cluster documented in Section 9.

11. Discussion

Spatial autocorrelation in South American human development is strong and persistent. Global Moran’s I increased from 0.568 in 2013 to 0.632 in 2019 (both p = 0.001), indicating that the spatial clustering of development levels strengthened over the period. This means the development gap between prosperous and lagging regions is not only large but spatially structured — high-development regions form a contiguous band across the Southern Cone, while low-development regions form an equally contiguous band across the Amazon basin and northern South America.

The LISA analysis pinpoints these clusters with precision. In 2019, 30 regions form a significant HH cluster (high development surrounded by high-development neighbors) and 37 regions form a significant LL cluster (low development surrounded by low-development neighbors). The LL cluster expanded from 29 to 37 regions between 2013 and 2019, driven primarily by Venezuela’s economic crisis and its spillover effects on neighboring regions. The HH cluster remained stable (31 to 30), with 87% persistence — a sign that prosperity corridors in the Southern Cone are structurally entrenched.

The space-time analysis reveals that 62% of regions stayed in the same Moran scatter plot quadrant, but the 38% that moved tell an important story. The most concerning transitions are the 10 regions that moved from HH to LL and the 17 previously non-significant regions that joined the LL LISA cluster. These movements are concentrated in Venezuela and its neighbors, illustrating how economic shocks can propagate spatially.

The Venezuela–Bolivia comparison crystallizes the two forces shaping South America’s spatial development landscape. Venezuela’s 24 regions collapsed nearly uniformly (mean SHDI change of -0.065, with 88% crossing quadrant boundaries), transforming a country that was largely in the HH quadrant in 2013 into one almost entirely in the LL quadrant by 2019. Bolivia’s 9 regions, starting from a much lower base, improved steadily (+0.033) with 78% quadrant stability. These divergent trajectories illustrate that spatial clusters are not static: they can expand rapidly through crisis-driven contagion (Venezuela pulling its neighbors downward) or contract slowly through sustained internal improvement (Bolivia gradually lifting its regions rightward in the Moran scatter plot). The fact that Venezuela’s decline was spatially contagious — dragging down the spatial lags of neighboring Colombian and Brazilian border regions — while Bolivia’s improvement remained spatially contained underscores an asymmetry: negative shocks propagate faster and farther across borders than positive ones.

For policy, these findings suggest that spatially targeted interventions may be more effective than uniform national programs. The persistent LL clusters represent development traps where a region’s own conditions are reinforced by the equally poor conditions of its neighbors. Breaking these traps may require coordinated cross-regional or cross-border programs that address the spatial dimension of underdevelopment. Bolivia’s experience suggests that broad-based national improvement can lift all regions, but escaping the low-development spatial cluster may require the additional step of improving neighbors' conditions simultaneously — a challenge that calls for cross-border cooperation.

12. Summary and next steps

Key takeaways:

Method insight: ESDA reveals spatial patterns invisible in aspatial analysis. The same dataset that shows a modest aggregate improvement (+0.005 SHDI) conceals a deepening spatial divide — Moran’s I increased from 0.568 to 0.632, meaning spatial clustering strengthened between 2013 and 2019.
Data insight: 30 HH and 37 LL regions form statistically significant clusters at the 10% level. The LL cluster expanded by 8 regions (from 29 to 37), while the HH cluster remained stable. Cluster persistence is high: 87% for HH and 62% for LL, indicating entrenched spatial inequality.
Country insight: Venezuela and Bolivia illustrate contrasting development dynamics. Venezuela’s 24 regions collapsed nearly uniformly (mean -0.065), with 88% crossing quadrant boundaries from the upper to the lower portion of the Moran scatter plot. Bolivia’s 9 regions improved steadily (+0.033) with 78% quadrant stability, showing broad-based gains that have not yet been large enough to escape the LL spatial cluster.
Limitation: Queen contiguity assumes shared borders, which excludes island territories (San Andres, Nueva Esparta) and may not capture cross-water economic linkages. With only two time periods (2013 and 2019), we cannot distinguish permanent structural clusters from temporary effects of the Venezuelan crisis. The p = 0.10 significance threshold is relatively permissive.
Next step: Extend the analysis with spatial regression models (spatial lag and spatial error models) to test whether a region’s development is directly influenced by its neighbors' development, or whether the clustering is driven by shared underlying factors. Bivariate LISA could reveal whether income clusters coincide with education clusters. Adding more time periods (2000–2019) from the full Global Data Lab series would enable Spatial Markov chain analysis of cluster transition probabilities.

13. Exercises

Income clusters. Repeat the LISA analysis for the income index (incindex2019) instead of SHDI. Are income clusters in the same locations as HDI clusters? How many regions belong to both an income LL and an HDI LL cluster?
Alternative weights. Build k-nearest neighbors weights (KNN from libpysal.weights) with $k = 5$ and Rook contiguity (Rook from libpysal.weights) instead of Queen contiguity. How does Moran’s I change under each specification? Does the KNN approach resolve the island problem?
Bivariate Moran. Use Moran_BV from esda to compute the bivariate Moran’s I between education and income indices. Are regions with high education surrounded by regions with high income, or are the two dimensions spatially independent?
Spatial autocorrelation of change. Compute Moran’s I for shdi_change instead of the level variables. Is the change in SHDI between 2013 and 2019 itself spatially clustered? Compare the result with the change choropleth from Section 6.2. Hint: Moran(gdf["shdi_change"], W, permutations=999).
Component-level Moran’s I. Compute Moran’s I for the health, education, and income indices separately in both 2013 and 2019. Which component shows the strongest spatial autocorrelation? Does the income index — which declined in 46% of regions — show a different spatial pattern than health or education?
Multiple testing sensitivity. Re-run the 2019 LISA analysis at $p < 0.05$ instead of $p < 0.10$. How many HH and LL regions survive the stricter threshold? Research the Bonferroni correction ($0.05 / 153 \approx 0.0003$) and the False Discovery Rate (FDR) procedure — how would these affect the cluster counts?
Neighbor count distribution. Plot a histogram of the number of neighbors per region from the Queen weights matrix (use W.cardinalities). What is the shape of the distribution? Which regions have the most and fewest neighbors, and why?
Is the Moran’s I increase significant? Moran’s I rose from 0.568 to 0.632 between 2013 and 2019. But does this difference pass a significance test? Try a bootstrap approach: pool the 2013 and 2019 SHDI values, randomly assign them to the two periods 999 times, and compute the difference in Moran’s I each time. Where does the observed difference (0.064) fall in the bootstrap distribution?
Moran’s I excluding Venezuela. Recompute Moran’s I for 2013 and 2019 after dropping Venezuela’s 24 regions (rebuild the Queen weights on the subset GeoDataFrame). Does the increase in spatial autocorrelation survive? If not, the “deepening spatial divide” may be driven by a single country’s crisis rather than a continent-wide trend.
LISA significance map. Create a choropleth map coloring each region by its LISA p-value (localMoran_2019.p_sim) using a sequential colormap. How many regions have $p < 0.01$ vs $p < 0.05$ vs $p < 0.10$? Are the deeply significant regions ($p < 0.01$) concentrated in the same locations as the cluster map from Section 9.2?

14. References

Multiscale Geographically Weighted Regression: Spatially Varying Economic Convergence in Indonesia

Sun, 22 Mar 2026 00:00:00 +0000

1. Overview

When we ask “do poorer regions catch up to richer ones?”, the standard approach is to run a single regression across all regions and report one coefficient. But what if the answer depends on where you look? A negative coefficient in Sumatra does not mean the same process is at work in Papua. A global regression forces every district onto the same line — and in doing so, it may hide the most interesting part of the story.

Multiscale Geographically Weighted Regression (MGWR) addresses this by estimating a separate set of coefficients at every location, weighted by proximity. Its key innovation over standard GWR is that each variable is allowed to operate at its own spatial scale. The intercept (representing baseline growth conditions) might vary smoothly across large regions, while the convergence coefficient might shift sharply between neighboring districts. MGWR discovers these scales from the data rather than imposing a single bandwidth on all variables.

This tutorial applies MGWR to 514 Indonesian districts to answer: does economic catching-up happen at the same pace everywhere in Indonesia, or does geography shape how fast poorer districts close the gap? We progress from a global regression baseline through MGWR estimation and coefficient mapping, revealing that the global R² of 0.214 jumps to 0.762 once we allow the relationship to vary across space.

Learning objectives:

Understand why a single regression coefficient may hide important spatial variation
Estimate location-specific relationships with spatially varying coefficients
Apply MGWR to allow each variable to operate at its own spatial scale
Map and interpret spatially varying coefficients across Indonesia
Compare global OLS vs MGWR model fit and diagnostics

2. The modeling pipeline

The analysis follows a natural progression: start with a simple global model, visualize the spatial patterns it cannot capture, then let MGWR reveal the local structure.

graph LR
A["<b>Step 1</b><br/>Load &<br/>Explore"] --> B["<b>Step 2</b><br/>Map<br/>Variables"]
B --> C["<b>Step 3</b><br/>Global<br/>OLS"]
C --> D["<b>Step 4</b><br/>MGWR<br/>Estimation"]
D --> E["<b>Step 5</b><br/>Map<br/>Coefficients"]
E --> F["<b>Step 6</b><br/>Significance<br/>& Compare"]
style A fill:#141413,stroke:#6a9bcc,color:#fff
style B fill:#d97757,stroke:#141413,color:#fff
style C fill:#6a9bcc,stroke:#141413,color:#fff
style D fill:#00d4c8,stroke:#141413,color:#fff
style E fill:#00d4c8,stroke:#141413,color:#fff
style F fill:#1a3a8a,stroke:#141413,color:#fff

3. Setup and imports

The analysis uses mgwr for multiscale regression, GeoPandas for spatial data, and mapclassify for choropleth classification.

import numpy as np
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
from matplotlib.patches import Patch
import mapclassify
from scipy import stats
from mgwr.gwr import MGWR
from mgwr.sel_bw import Sel_BW
import warnings
warnings.filterwarnings("ignore")
# Site color palette
STEEL_BLUE = "#6a9bcc"
WARM_ORANGE = "#d97757"
NEAR_BLACK = "#141413"
TEAL = "#00d4c8"

Dark theme figure styling (click to expand)

DARK_NAVY = "#0f1729"
GRID_LINE = "#1f2b5e"
LIGHT_TEXT = "#c8d0e0"
WHITE_TEXT = "#e8ecf2"
plt.rcParams.update({
"figure.facecolor": DARK_NAVY,
"axes.facecolor": DARK_NAVY,
"axes.edgecolor": DARK_NAVY,
"axes.linewidth": 0,
"axes.labelcolor": LIGHT_TEXT,
"axes.titlecolor": WHITE_TEXT,
"axes.spines.top": False,
"axes.spines.right": False,
"axes.spines.left": False,
"axes.spines.bottom": False,
"axes.grid": True,
"grid.color": GRID_LINE,
"grid.linewidth": 0.6,
"grid.alpha": 0.8,
"xtick.color": LIGHT_TEXT,
"ytick.color": LIGHT_TEXT,
"xtick.major.size": 0,
"ytick.major.size": 0,
"text.color": WHITE_TEXT,
"font.size": 12,
"legend.frameon": False,
"legend.fontsize": 11,
"legend.labelcolor": LIGHT_TEXT,
"figure.edgecolor": DARK_NAVY,
"savefig.facecolor": DARK_NAVY,
"savefig.edgecolor": DARK_NAVY,
})

4. Data loading and exploration

The dataset covers 514 Indonesian districts with GDP per capita in 2010 and the subsequent growth rate through 2018. Indonesia is an ideal setting for studying spatial heterogeneity: it spans over 17,000 islands across 5,000 km of ocean, with enormous variation in economic structure, geography, and institutional capacity.

The core idea behind convergence is straightforward: if poorer districts tend to grow faster than richer ones, the income gap narrows over time. In a regression framework, this means we expect a negative relationship between initial income (log GDP per capita in 2010) and subsequent growth. The question is whether that negative relationship holds uniformly across the archipelago — or whether it is stronger in some places and weaker (or even reversed) in others.

CSV_URL = ("https://github.com/quarcs-lab/data-quarcs/raw/refs/heads/"
"master/indonesia514/dataBeta.csv")
GEO_URL = ("https://github.com/quarcs-lab/data-quarcs/raw/refs/heads/"
"master/indonesia514/mapIdonesia514-opt.geojson")
df = pd.read_csv(CSV_URL)
geo = gpd.read_file(GEO_URL)
gdf = geo.merge(df, on="districtID", how="left")
print(f"Loaded: {gdf.shape[0]} districts, {gdf.shape[1]} columns")
print(gdf[["ln_gdppc2010", "g"]].describe().round(4).to_string())

Loaded: 514 districts, 16 columns
ln_gdppc2010 g
count 514.0000 514.0000
mean 9.8371 0.3860
std 0.7603 0.3205
min 7.1657 -2.0452
25% 9.3983 0.2583
50% 9.7626 0.3453
75% 10.1739 0.4158
max 13.4438 2.0563

The 514 districts span a wide range of initial income: log GDP per capita ranges from 7.17 (the poorest district, roughly \$1,300 per capita) to 13.44 (the richest, roughly \$690,000 — likely a resource-extraction enclave). Growth rates also vary enormously, from -2.05 (severe contraction) to +2.06 (rapid expansion), with a mean of 0.39. This high variance in both variables suggests that a single regression line will struggle to capture the full picture.

5. Exploratory maps

Before fitting any model, we map the two key variables to see whether spatial patterns are visible to the naked eye. If initial income and growth are geographically clustered, that is already a hint that spatial models will outperform global ones.

fig, axes = plt.subplots(2, 1, figsize=(14, 14))
for ax, col, title in [
(axes[0], "ln_gdppc2010", "(a) Log GDP per capita, 2010"),
(axes[1], "g", "(b) GDP growth rate, 2010–2018"),
]:
fj = mapclassify.FisherJenks(gdf[col].dropna().values, k=5)
classified = mapclassify.UserDefined(gdf[col].values, bins=fj.bins.tolist())
cmap = plt.cm.coolwarm
norm = plt.Normalize(vmin=0, vmax=4)
colors = [cmap(norm(c)) for c in classified.yb]
gdf.plot(ax=ax, color=colors, edgecolor=GRID_LINE, linewidth=0.2)
ax.set_title(title, fontsize=14, pad=10)
ax.set_axis_off()
plt.tight_layout()
plt.savefig("mgwr_map_xy.png", dpi=300, bbox_inches="tight")
plt.show()

The maps reveal clear spatial structure. Initial income (panel a) is highest in Jakarta and resource-rich districts in Kalimantan and Papua (warm red), while the lowest-income districts cluster in eastern Nusa Tenggara and parts of Maluku (cool blue). Growth rates (panel b) show a different pattern: some of the poorest districts in Papua and Sulawesi experienced rapid growth (suggesting catching-up), while several high-income resource districts saw contraction. The fact that these patterns are geographically organized — not randomly scattered — motivates the use of spatially varying models.

6. Global regression baseline

The simplest test for economic convergence fits a single regression line through all 514 districts. If the slope is negative, poorer districts (low initial income) tend to grow faster than richer ones.

$$g_i = \alpha + \beta \cdot \ln(y_{i,2010}) + \varepsilon_i$$

where $g_i$ is the growth rate, $\ln(y_{i,2010})$ is log initial income, and $\beta < 0$ indicates convergence. In the code, $g_i$ corresponds to the column g and $\ln(y_{i,2010})$ to ln_gdppc2010.

slope, intercept, r_value, p_value, std_err = stats.linregress(
gdf["ln_gdppc2010"], gdf["g"]
)
print(f"Slope (convergence coefficient): {slope:.4f}")
print(f"R-squared: {r_value**2:.4f}")
print(f"p-value: {p_value:.6f}")

Slope (convergence coefficient): -0.1948
R-squared: 0.2135
p-value: 0.000000

fig, ax = plt.subplots(figsize=(10, 7))
ax.scatter(gdf["ln_gdppc2010"], gdf["g"],
color=STEEL_BLUE, edgecolors=GRID_LINE, s=35, alpha=0.6, zorder=3)
x_range = np.linspace(gdf["ln_gdppc2010"].min(), gdf["ln_gdppc2010"].max(), 100)
ax.plot(x_range, intercept + slope * x_range, color=WARM_ORANGE,
linewidth=2, zorder=2)
ax.set_xlabel("Log GDP per capita (2010)")
ax.set_ylabel("GDP growth rate (2010–2018)")
ax.set_title("Global convergence regression")
plt.savefig("mgwr_scatter_global.png", dpi=300, bbox_inches="tight")
plt.show()

The global regression confirms that convergence exists on average: the slope is $-0.195$ (p < 0.001), meaning a 1-unit increase in log initial income is associated with a 0.195 percentage-point lower growth rate. However, the R² of only 0.214 means this single line explains just 21% of the variation in growth rates. The scatter plot shows enormous dispersion around the regression line — many districts with similar initial income experienced vastly different growth trajectories. This low explanatory power is the motivation for MGWR: perhaps the relationship is not weak everywhere, but rather strong in some regions and absent in others, and a single coefficient is simply averaging over this heterogeneity.

7. From global to local: why MGWR?

7.1 The limitation of a single coefficient

The global regression tells us that $\beta = -0.195$ on average across Indonesia. But consider two districts with the same initial income — one in Java, where infrastructure and market access are strong, and one in Papua, where remoteness and institutional challenges dominate. There is no reason to expect the same convergence dynamic in both places. A single coefficient forces them onto the same line.

Geographically Weighted Regression (GWR) addresses this by estimating a separate regression at each location, using a kernel function — a distance-decay weighting scheme (typically Gaussian or bisquare) that gives more weight to nearby observations and less to distant ones. The result is a set of location-specific coefficients — each district gets its own slope and intercept:

$$g_i = \alpha(u_i, v_i) + \beta(u_i, v_i) \cdot \ln(y_{i,2010}) + \varepsilon_i$$

where $(u_i, v_i)$ are the geographic coordinates of district $i$, and both $\alpha$ and $\beta$ are now functions of location rather than fixed constants. In the code, $(u_i, v_i)$ correspond to COORD_X and COORD_Y. The bandwidth parameter $h$ controls how many neighbors contribute to each local regression — a small bandwidth means only very close districts matter (highly local), while a large bandwidth approaches the global model.

However, standard GWR uses a single bandwidth for all variables, which means the intercept and the convergence coefficient are forced to vary at the same spatial scale.

MGWR removes this constraint. It allows each variable to find its own optimal bandwidth through an iterative back-fitting procedure — a process that cycles through each variable, optimizing its bandwidth while holding the others fixed, until all bandwidths converge. If baseline growth conditions vary smoothly across large regions (large bandwidth), while the convergence speed varies sharply between neighboring districts (small bandwidth), MGWR will discover this from the data. This makes MGWR a more flexible and realistic model for processes that operate at multiple spatial scales. The key assumption is that spatial relationships are locally stationary within each kernel window — the relationship between income and growth is approximately constant among the nearest $h$ districts, even if it differs across the full map.

7.2 MGWR estimation

The mgwr package requires variables to be standardized (zero mean, unit variance) before multiscale bandwidth selection. This ensures that the bandwidths are comparable across variables measured in different units. The spherical=True flag tells the algorithm to compute great-circle distances rather than Euclidean distances, which is essential when working with geographic coordinates spanning a large area like Indonesia.

# Prepare variables
y = gdf["g"].values.reshape((-1, 1))
X = gdf[["ln_gdppc2010"]].values
coords = list(zip(gdf["COORD_X"], gdf["COORD_Y"]))
# Standardize (required for MGWR)
Zy = (y - y.mean(axis=0)) / y.std(axis=0)
ZX = (X - X.mean(axis=0)) / X.std(axis=0)
# Bandwidth selection and model fitting
mgwr_selector = Sel_BW(coords, Zy, ZX, multi=True, spherical=True)
mgwr_bw = mgwr_selector.search()
mgwr_results = MGWR(coords, Zy, ZX, mgwr_selector, spherical=True).fit()
mgwr_results.summary()

===========================================================================
Model type Gaussian
Number of observations: 514
Number of covariates: 2
Global Regression Results
---------------------------------------------------------------------------
R2: 0.214
Adj. R2: 0.212
Multi-Scale Geographically Weighted Regression (MGWR) Results
---------------------------------------------------------------------------
Spatial kernel: Adaptive bisquare
MGWR bandwidths
---------------------------------------------------------------------------
Variable Bandwidth ENP_j Adj t-val(95%) Adj alpha(95%)
X0 44.000 26.805 3.127 0.002
X1 44.000 25.271 3.109 0.002
Diagnostic information
---------------------------------------------------------------------------
Residual sum of squares: 122.081
Effective number of parameters (trace(S)): 52.076
Sigma estimate: 0.514
R2 0.762
Adjusted R2 0.736
AICc: 838.405
===========================================================================

The MGWR results are striking. R² jumps from 0.214 (global) to 0.762 (MGWR) — the spatially varying model explains more than three times as much variation as the global regression. Both the intercept and the convergence coefficient receive a bandwidth of 44, meaning each local regression draws on the 44 nearest districts. This is a relatively local scale (44 out of 514 districts, or about 8.6% of the sample), confirming that the convergence relationship varies substantially across the archipelago. The effective number of parameters is 52.1, reflecting the cost of estimating location-specific coefficients instead of two global ones.

7.3 Mapping MGWR coefficients

The power of MGWR lies in the coefficient maps. Instead of a single number for the whole country, we can now visualize how the convergence relationship changes from district to district. Because MGWR is estimated on standardized variables, the mapped coefficients are in standard-deviation units: a coefficient of $-1.0$ means that a one-standard-deviation increase in log initial income is associated with a one-standard-deviation decrease in growth at that location.

gdf["mgwr_intercept"] = mgwr_results.params[:, 0]
gdf["mgwr_slope"] = mgwr_results.params[:, 1]

Intercept map — the intercept captures baseline growth conditions after accounting for initial income. Positive values indicate districts that grew faster than expected given their income level; negative values indicate underperformance.

fig, ax = plt.subplots(figsize=(14, 8))
# Fisher-Jenks classification with Patch legend (see script.py for details)
gdf.plot(ax=ax, column="mgwr_intercept", scheme="FisherJenks", k=5,
cmap="coolwarm", edgecolor=GRID_LINE, linewidth=0.2, legend=True)
ax.set_title(f"MGWR intercept (bandwidth = {int(mgwr_bw[0])})")
ax.set_axis_off()
plt.savefig("mgwr_mgwr_intercept.png", dpi=300, bbox_inches="tight")
plt.show()

The intercept map reveals a clear east–west gradient. Districts in western Indonesia (Sumatra and Java) tend to have negative intercepts — they grew less than the convergence model would predict based on their initial income alone. Districts in eastern Indonesia (Papua, Maluku, Nusa Tenggara) show positive intercepts, indicating growth that exceeded what initial income would predict. This pattern may reflect the role of resource extraction, infrastructure investment, and fiscal transfers that disproportionately boosted growth in less-developed eastern regions during the 2010–2018 period.

Convergence coefficient map — the slope captures how strongly initial income predicts subsequent growth at each location. Large negative values indicate rapid catching-up; values near zero or positive indicate no convergence or divergence.

fig, ax = plt.subplots(figsize=(14, 8))
gdf.plot(ax=ax, column="mgwr_slope", scheme="FisherJenks", k=5,
cmap="coolwarm", edgecolor=GRID_LINE, linewidth=0.2, legend=True)
ax.set_title(f"MGWR convergence coefficient (bandwidth = {int(mgwr_bw[1])})")
ax.set_axis_off()
plt.savefig("mgwr_mgwr_slope.png", dpi=300, bbox_inches="tight")
plt.show()

The convergence coefficient map is the central finding of this analysis. The global regression reported a single $\beta = -0.195$, but MGWR reveals that this average hides enormous spatial variation. The strongest catching-up (deepest blue, coefficients as negative as $-1.74$) concentrates in western Sumatra and parts of Kalimantan — districts where poorer areas grew much faster than richer neighbors. In contrast, most of Java, eastern Indonesia, and the Maluku islands show coefficients near zero (light pink), indicating that the convergence relationship is essentially absent in these areas. A handful of districts show weakly positive coefficients (up to 0.42), suggesting localized divergence where richer districts pulled further ahead. The coefficient ranges from $-1.74$ to $+0.42$, with a median of $-0.085$ and a standard deviation of 0.553 — far from the single value of $-0.195$ reported by the global model.

7.4 Statistical significance

Not all local coefficients are statistically distinguishable from zero. MGWR provides t-values corrected for multiple testing, which we use to classify each district’s convergence coefficient as significantly negative (catching-up), not significant, or significantly positive (diverging).

mgwr_filtered_t = mgwr_results.filter_tvals()
t_sig = mgwr_filtered_t[:, 1] # Slope t-values
sig_cats = np.where(t_sig < 0, "Negative (catching-up)",
np.where(t_sig > 0, "Positive (diverging)", "Not significant"))
print(f"Negative (catching-up): {(sig_cats == 'Negative (catching-up)').sum()}")
print(f"Not significant: {(sig_cats == 'Not significant').sum()}")
print(f"Positive (diverging): {(sig_cats == 'Positive (diverging)').sum()}")

Negative (catching-up): 149
Not significant: 365
Positive (diverging): 0

fig, ax = plt.subplots(figsize=(14, 8))
cat_colors = {
"Negative (catching-up)": "#2c7bb6",
"Not significant": GRID_LINE,
"Positive (diverging)": "#d7191c",
}
colors_sig = [cat_colors[c] for c in sig_cats]
gdf.plot(ax=ax, color=colors_sig, edgecolor=GRID_LINE, linewidth=0.2)
ax.set_title("MGWR convergence coefficient: statistical significance")
ax.set_axis_off()
plt.savefig("mgwr_mgwr_significance.png", dpi=300, bbox_inches="tight")
plt.show()

Of 514 districts, 149 (29%) show statistically significant convergence at the corrected 5% level — concentrated in Sumatra, western Kalimantan, and Sulawesi. The remaining 365 districts (71%) have convergence coefficients that are not distinguishable from zero after correcting for multiple comparisons. No district shows significant divergence. This means that while the global regression detects convergence on average, it is actually driven by a minority of districts — primarily in western Indonesia — while the majority of the archipelago shows no significant relationship between initial income and growth.

8. Model comparison

The table below summarizes how much explanatory power the spatially varying model adds over the global baseline.

print(f"{'Metric':<25} {'Global OLS':>12} {'MGWR':>12}")
print(f"{'R²':<25} {0.2135:>12.4f} {0.7625:>12.4f}")
print(f"{'Adj. R²':<25} {0.2120:>12.4f} {0.7357:>12.4f}")
print(f"{'AICc':<25} {1341.25:>12.2f} {838.41:>12.2f}")
print(f"{'Bandwidth (intercept)':<25} {'all (514)':>12} {'44':>12}")
print(f"{'Bandwidth (slope)':<25} {'all (514)':>12} {'44':>12}")

Metric Global OLS MGWR
R² 0.2135 0.7625
Adj. R² 0.2120 0.7357
AICc 1341.25 838.41
Bandwidth (intercept) all (514) 44
Bandwidth (slope) all (514) 44

MGWR more than triples the explained variance ($R^2$: 0.214 to 0.762) and dramatically reduces the AICc from 1341 to 838, confirming that the improvement in fit is not merely due to additional flexibility. The bandwidth of 44 for both variables means each local regression uses the nearest 44 districts (about 8.6% of the sample), confirming that the convergence process is highly localized. The adjusted $R^2$ of 0.736 accounts for the additional complexity (52 effective parameters vs 2 in OLS) and still shows a massive improvement, indicating that the spatial variation in coefficients is genuine and not overfitting.

9. Discussion

Economic catching-up in Indonesia is not uniform — it is concentrated in western Sumatra and parts of Kalimantan, while most of the archipelago shows no significant convergence. The global regression’s $\beta = -0.195$ suggests a moderate convergence tendency, but MGWR reveals that this average is driven by a subset of 149 districts (29%) with strong catching-up dynamics. The remaining 365 districts have convergence coefficients indistinguishable from zero.

The intercept map adds another dimension: eastern Indonesian districts tend to have positive intercepts (above-expected growth), while western districts have negative intercepts (below-expected growth). This east–west gradient likely reflects the impact of fiscal transfers, resource booms, and infrastructure programs that targeted less-developed regions during the 2010–2018 period. Combined with the convergence coefficient map, the picture is nuanced: eastern Indonesia grew faster than expected (high intercept), but not because of convergence dynamics (near-zero slope) — rather, because of other factors captured by the intercept.

For policy, these findings challenge the assumption that national-level convergence statistics reflect what is happening locally. A policymaker looking at $\beta = -0.195$ might conclude that Indonesia’s development strategy is successfully closing regional gaps. MGWR reveals that catching-up is geographically selective, and the majority of districts are not on a convergence path at all. Spatially targeted interventions — rather than uniform national programs — may be needed to address this uneven landscape.

10. Summary and next steps

Key takeaways:

Method insight: MGWR reveals spatial heterogeneity invisible to global regression. R² improves from 0.214 to 0.762 by allowing location-specific coefficients. Both variables operate at a bandwidth of 44 districts (~8.6% of the sample), indicating highly localized economic dynamics. Variable standardization is essential before MGWR estimation.
Data insight: Only 149 of 514 Indonesian districts (29%) show statistically significant convergence, concentrated in Sumatra and Kalimantan. The convergence coefficient ranges from $-1.74$ to $+0.42$, far from the global average of $-0.195$. Eastern Indonesia grows faster than expected (positive intercepts) but not through convergence — the catching-up mechanism is absent there.
Limitation: The bivariate model (one independent variable) is intentionally simple for pedagogical purposes. Real convergence analysis would include controls for human capital, infrastructure, institutional quality, and sectoral composition. The bandwidth of 44 applies to both variables in this case, but with additional covariates, MGWR’s ability to assign different bandwidths per variable would be more visible.
Next step: Extend the model with additional covariates (education, investment, fiscal transfers) to disentangle the sources of spatial heterogeneity. Apply MGWR to panel data with multiple time periods. Compare MGWR results with the spatial clusters identified in the ESDA tutorial to see whether convergence hotspots align with LISA clusters.

11. Exercises

Add a second variable. Include an education indicator (e.g., years of schooling) as a second independent variable and re-run MGWR. Do the two covariates receive different bandwidths? What does that tell you about the spatial scale at which education affects growth?
Map the t-values. Instead of mapping the raw coefficients, map the local t-statistics from mgwr_results.tvalues[:, 1]. How does this map compare to the significance map based on corrected t-values?
Compare with ESDA. Run a Moran’s I test on the MGWR residuals. Is there remaining spatial autocorrelation? If not, MGWR has successfully captured the spatial structure. If yes, what might be missing?

12. References

Introduction to PCA Analysis for Building Development Indicators

Sat, 21 Mar 2026 00:00:00 +0000

1. Overview

In development economics, we rarely measure progress with just one number. To understand a country’s health system, you might look at life expectancy, infant mortality, hospital beds per capita, and disease prevalence. But how do you rank 50 countries when you have multiple metrics measured in different units — years, rates, and raw counts? You cannot simply add them together. You need a single, elegant “Development Index.”

Principal Component Analysis (PCA) is a statistical technique used for data compression. It takes a dataset with many correlated variables and condenses it into a single composite index while retaining as much of the original information as possible. Think of PCA as finding the hallway in a building that gives you the longest unobstructed view — the direction where the data is most spread out, and therefore most informative. For visual introductions to the core idea, see Principal Component Analysis (PCA) Explained Simply and Visualizing Principal Component Analysis (PCA). For a hands-on interactive demonstration, try the Numiqo PCA Lab.

This tutorial builds a simplified Health Index using only two indicators — Life Expectancy (years) and Infant Mortality (deaths per 1,000 live births) — for 50 simulated countries. By using simulated data with a known structure, we can verify that PCA recovers the true underlying pattern. The same six-step pipeline scales naturally to 10, 20, or 100 indicators.

Learning objectives:

Understand why polarity adjustment and standardization are prerequisites for PCA
Compute the covariance matrix and interpret its entries as variable overlap
Perform eigen-decomposition to extract principal component weights and variance proportions
Construct a composite index by projecting standardized data onto the first principal component
Verify manual PCA results against scikit-learn’s PCA implementation

2. The PCA pipeline

Before diving into the math, it helps to see the full pipeline at a glance. Each of the six steps builds on the previous one and cannot be skipped.

graph LR
A["<b>Step 1</b><br/>Polarity<br/>Adjustment"] --> B["<b>Step 2</b><br/>Standardization<br/>(Z-scores)"]
B --> C["<b>Step 3</b><br/>Covariance<br/>Matrix"]
C --> D["<b>Step 4</b><br/>Eigen-<br/>Decomposition"]
D --> E["<b>Step 5</b><br/>Scoring<br/>(PC1)"]
E --> F["<b>Step 6</b><br/>Normalization<br/>(0-1)"]
style A fill:#d97757,stroke:#141413,color:#fff
style B fill:#6a9bcc,stroke:#141413,color:#fff
style C fill:#6a9bcc,stroke:#141413,color:#fff
style D fill:#00d4c8,stroke:#141413,color:#fff
style E fill:#00d4c8,stroke:#141413,color:#fff
style F fill:#1a3a8a,stroke:#141413,color:#fff

The pipeline transforms raw indicators into a single number that captures the dominant pattern of variation. We start by aligning indicator directions (Step 1), removing unit differences (Step 2), measuring variable overlap (Step 3), finding the optimal weights (Step 4), computing scores (Step 5), and finally rescaling for human readability (Step 6).

3. Setup and imports

The analysis relies on NumPy for linear algebra, pandas for data management, matplotlib for visualization, and scikit-learn for verification. The RANDOM_SEED ensures every reader gets identical results.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
# Reproducibility
RANDOM_SEED = 42
# Site color palette
STEEL_BLUE = "#6a9bcc"
WARM_ORANGE = "#d97757"
NEAR_BLACK = "#141413"
TEAL = "#00d4c8"

Dark theme figure styling (click to expand)

# Dark theme palette (consistent with site navbar/dark sections)
DARK_NAVY = "#0f1729"
GRID_LINE = "#1f2b5e"
LIGHT_TEXT = "#c8d0e0"
WHITE_TEXT = "#e8ecf2"
# Plot defaults — minimal, spine-free, dark background
plt.rcParams.update({
"figure.facecolor": DARK_NAVY,
"axes.facecolor": DARK_NAVY,
"axes.edgecolor": DARK_NAVY,
"axes.linewidth": 0,
"axes.labelcolor": LIGHT_TEXT,
"axes.titlecolor": WHITE_TEXT,
"axes.spines.top": False,
"axes.spines.right": False,
"axes.spines.left": False,
"axes.spines.bottom": False,
"axes.grid": True,
"grid.color": GRID_LINE,
"grid.linewidth": 0.6,
"grid.alpha": 0.8,
"xtick.color": LIGHT_TEXT,
"ytick.color": LIGHT_TEXT,
"xtick.major.size": 0,
"ytick.major.size": 0,
"text.color": WHITE_TEXT,
"font.size": 12,
"legend.frameon": False,
"legend.fontsize": 11,
"legend.labelcolor": LIGHT_TEXT,
"figure.edgecolor": DARK_NAVY,
"savefig.facecolor": DARK_NAVY,
"savefig.edgecolor": DARK_NAVY,
})

4. Simulating health data

We generate data for 50 countries driven by a single latent factor — base_health — drawn from a uniform distribution. This factor drives both life expectancy (positively) and infant mortality (negatively), mimicking the real-world pattern where healthier countries perform well across multiple indicators simultaneously. Using simulated data lets us verify that PCA recovers this known single-factor structure.

def simulate_health_data(n=50, seed=42):
"""Simulate health indicators for n countries.
True DGP:
base_health ~ Uniform(0, 1) -- latent health capacity
life_exp = 55 + 30 * base_health + N(0, 2) -- range ~55-85
infant_mort = 60 - 55 * base_health + N(0, 3) -- range ~2-60
"""
rng = np.random.default_rng(seed)
base_health = rng.uniform(0, 1, n)
life_exp = 55 + 30 * base_health + rng.normal(0, 2, n)
infant_mort = 60 - 55 * base_health + rng.normal(0, 3, n)
countries = [f"Country_{i+1:02d}" for i in range(n)]
return pd.DataFrame({
"country": countries,
"life_exp": np.round(life_exp, 1),
"infant_mort": np.round(infant_mort, 1),
})
df = simulate_health_data(n=50, seed=RANDOM_SEED)
# Save raw data to CSV (used later in the scikit-learn pipeline)
df.to_csv("health_data.csv", index=False)
print(f"Dataset shape: {df.shape}")
print(f"\nFirst 5 rows:")
print(df.head().to_string(index=False))
print(f"\nDescriptive statistics:")
print(df[["life_exp", "infant_mort"]].describe().round(2).to_string())
print("\nSaved: health_data.csv")

Dataset shape: (50, 3)
First 5 rows:
country life_exp infant_mort
Country_01 79.6 18.6
Country_02 68.3 33.1
Country_03 81.3 11.6
Country_04 77.2 25.5
Country_05 54.9 53.8
Descriptive statistics:
life_exp infant_mort
count 50.00 50.00
mean 70.72 30.30
std 8.62 15.57
min 54.90 3.50
25% 63.45 17.28
50% 71.25 30.25
75% 78.90 42.05
max 84.70 58.70
Saved: health_data.csv

All 50 countries loaded with two health indicators. Life expectancy ranges from 54.9 to 84.7 years with a mean of 70.72, while infant mortality ranges from 3.5 to 58.7 per 1,000 live births with a mean of 30.30. Notice the directional conflict: life expectancy is a “positive” indicator (higher means better health), while infant mortality is a “negative” indicator (higher means worse health). This conflict is precisely what Step 1 will resolve.

5. Exploring the raw data

Before transforming the data, let us visualize the raw relationship between the two indicators. The Pearson correlation coefficient ($r$) measures the strength and direction of the linear relationship between two variables, ranging from $-1$ (perfect negative) to $+1$ (perfect positive). If the two indicators are strongly correlated, PCA will be able to compress them effectively into a single index.

raw_corr = df["life_exp"].corr(df["infant_mort"])
print(f"Pearson correlation (LE vs IM): {raw_corr:.4f}")

Pearson correlation (LE vs IM): -0.9595

fig, ax = plt.subplots(figsize=(8, 6))
fig.patch.set_linewidth(0)
ax.scatter(df["life_exp"], df["infant_mort"],
color=STEEL_BLUE, edgecolors=DARK_NAVY, s=60, zorder=3)
# Label extreme countries
sorted_df = df.sort_values("life_exp")
label_idx = list(sorted_df.head(5).index) + list(sorted_df.tail(5).index)
for i in label_idx:
ax.annotate(df.loc[i, "country"],
(df.loc[i, "life_exp"], df.loc[i, "infant_mort"]),
fontsize=7, color=LIGHT_TEXT, xytext=(5, 5),
textcoords="offset points")
ax.set_xlabel("Life Expectancy (years)")
ax.set_ylabel("Infant Mortality (per 1,000 live births)")
ax.set_title("Raw health indicators: Life Expectancy vs. Infant Mortality")
ax.annotate(f"r = {raw_corr:.2f}", xy=(0.95, 0.95), xycoords="axes fraction",
fontsize=12, color=WARM_ORANGE, fontweight="bold",
va="top", ha="right")
plt.savefig("pca_raw_scatter.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

The Pearson correlation is $r = -0.96$, confirming a very strong negative relationship. Countries with high life expectancy almost always have low infant mortality, and vice versa. This means the two indicators are telling essentially the same story about health — just in opposite directions. This high redundancy is exactly what PCA will exploit to compress two dimensions into one.

6. Step 1: Polarity adjustment — aligning the health goals

What it is: Before any math is applied, we must ensure our indicators share the same logical direction. We mathematically invert indicators where “higher” means “worse” so that all variables move in the same positive direction. For our negative indicator (Infant Mortality, or $IM$), we calculate an adjusted value:

$$IM_i^{*} = -1 \times IM_i$$

In words, this says: for each country $i$, multiply its infant mortality rate by negative one. After this transformation, a large positive value of $IM^{}$ means low infant mortality — a good outcome. Here $IM_i$ corresponds to the infant_mort column, and $IM_i^{}$ will be stored as infant_mort_adj.

The application: Country_01 has an infant mortality rate of 18.6 deaths per 1,000 live births. Applying the formula: $IM^{*} = -1 \times 18.6 = -18.6$. The raw value of 18.6 becomes $-18.6$ after polarity adjustment. The negative sign encodes “18.6 units of infant survival” — a positive health signal that can now be combined with Life Expectancy because both variables point in the same direction.

The Intuition: Life Expectancy ($LE$) is a “positive” indicator: higher numbers mean better health. Infant Mortality is a “negative” indicator: higher numbers mean worse health. If we feed these into an index as they are, the final score will be contradictory. Imagine comparing exam scores where one professor grades 0–100 (higher is better) and another grades on demerits 0–100 (lower is better). Before averaging, you must flip the demerit scale.

The Necessity: We must flip the negative indicator so that “up” always means “better.” By multiplying by $-1$, instead of measuring “Infant Mortality,” we are effectively measuring “Infant Survival.” Now, for both variables, a higher number universally indicates a stronger health system.

df["infant_mort_adj"] = -1 * df["infant_mort"]
adj_corr = df["life_exp"].corr(df["infant_mort_adj"])
print(f"Correlation after polarity adjustment (LE vs -IM): {adj_corr:.4f}")
print(f"\nFirst 5 rows with adjusted IM:")
print(df[["country", "life_exp", "infant_mort", "infant_mort_adj"]].head().to_string(index=False))

Correlation after polarity adjustment (LE vs -IM): 0.9595
First 5 rows with adjusted IM:
country life_exp infant_mort infant_mort_adj
Country_01 79.6 18.6 -18.6
Country_02 68.3 33.1 -33.1
Country_03 81.3 11.6 -11.6
Country_04 77.2 25.5 -25.5
Country_05 54.9 53.8 -53.8

The correlation has flipped from $-0.96$ to $+0.96$. Both indicators now point in the same direction: higher values mean better health. The magnitude of the correlation is unchanged — the relationship is identical, just properly aligned. With this alignment in place, we can proceed to standardize the variables.

7. Step 2: Standardization — comparing apples to apples

What it is: We transform our raw data into Z-scores. For each value, we subtract the sample mean ($\mu$) and divide by the standard deviation ($\sigma$):

$$Z_{ij} = \frac{X_{ij} - \bar{X}_j}{\sigma_j}$$

In words, this says: for country $i$ and variable $j$, subtract the variable’s mean $\bar{X}_j$ and divide by its standard deviation $\sigma_j$. The result is a unitless score that tells us how many standard deviations above or below average the country is. Here $X_{ij}$ is the raw value (e.g., life_exp or infant_mort_adj), $\bar{X}_j$ is computed by np.mean(), and $\sigma_j$ is computed by np.std(ddof=0).

The application: Country_01 has Life Expectancy = 79.6 and adjusted Infant Mortality = $-18.6$. Applying the formula: $Z_{LE} = (79.6 - 70.72) / 8.53 = 8.88 / 8.53 = 1.0402$ and $Z_{IM} = (-18.6 - (-30.30)) / 15.42 = 11.70 / 15.42 = 0.7587$. Country_01 is 1.04 standard deviations above average in life expectancy and 0.76 standard deviations above average in infant survival. Both positive Z-scores confirm it is a healthier-than-average country on both indicators. These two numbers are now directly comparable — 1.04 and 0.76 are both measured in the same unit (standard deviations), even though the original variables were in years and rates.

The Intuition: Life Expectancy is measured in years (range 54.9–84.7). Infant Mortality is measured as a rate per 1,000 (range 3.5–58.7). If we mix these directly, the index will naturally be dominated by Infant Mortality simply because its values have a wider physical spread.

The Necessity: We standardize both variables to have a mean of $0$ and a standard deviation of $1$. We are no longer looking at “years” or “rates.” We are looking at “standard deviations from the global average.” Both indicators now have equal footing.

# Manual standardization
le_mean = df["life_exp"].mean()
le_std = df["life_exp"].std(ddof=0)
im_mean = df["infant_mort_adj"].mean()
im_std = df["infant_mort_adj"].std(ddof=0)
df["z_le"] = (df["life_exp"] - le_mean) / le_std
df["z_im"] = (df["infant_mort_adj"] - im_mean) / im_std
print(f"Life Expectancy -- mean: {le_mean:.2f}, std: {le_std:.2f}")
print(f"Infant Mort (adj) -- mean: {im_mean:.2f}, std: {im_std:.2f}")
print(f"\nZ-score statistics:")
print(f" z_le mean: {df['z_le'].mean():.6f}, std: {df['z_le'].std(ddof=0):.6f}")
print(f" z_im mean: {df['z_im'].mean():.6f}, std: {df['z_im'].std(ddof=0):.6f}")
# Verify with sklearn
scaler = StandardScaler()
Z_sklearn = scaler.fit_transform(df[["life_exp", "infant_mort_adj"]])
max_diff = np.max(np.abs(Z_sklearn - df[["z_le", "z_im"]].values))
print(f"\nMax difference from sklearn StandardScaler: {max_diff:.2e}")

Life Expectancy -- mean: 70.72, std: 8.53
Infant Mort (adj) -- mean: -30.30, std: 15.42
Z-score statistics:
z_le mean: 0.000000, std: 1.000000
z_im mean: 0.000000, std: 1.000000
Max difference from sklearn StandardScaler: 0.00e+00

Both Z-scores now have a mean of exactly 0 and a standard deviation of exactly 1, confirmed by the zero-difference check against StandardScaler(). Note that describe() in Section 4 reported std = 8.62 for life expectancy, while here we get 8.53. The difference is the denominator: pandas' describe() divides by $n - 1$ (sample standard deviation, ddof=1), while StandardScaler and our manual formula divide by $n$ (population standard deviation, ddof=0). We use ddof=0 because PCA treats the dataset as the full population being analyzed, not a sample from a larger population. A country that is 2 standard deviations above average in life expectancy is now directly comparable to one that is 2 standard deviations above average in (adjusted) infant mortality. The unit problem is solved, and we can now measure how the two variables move together.

8. Step 3: The covariance matrix — mapping the overlap

What it is: We calculate the covariance matrix to measure how the two standardized variables move together. For two variables, this forms a $2 \times 2$ matrix ($\Sigma$). Because our data is standardized, the covariance between them is simply their correlation ($r$):

$$\Sigma = \frac{1}{n} Z^T Z = \begin{pmatrix} 1 & r \\ r & 1 \end{pmatrix}$$

In words, this says: the covariance matrix $\Sigma$ of standardized data has 1s on the diagonal (each variable has unit variance after standardization) and the correlation $r$ on the off-diagonal. With two variables, PCA only needs to decompose this single $2 \times 2$ matrix.

The application: Plugging in our standardized data, the resulting matrix has diagonal entries of 1.0000 (guaranteed by standardization — each variable has unit variance) and an off-diagonal of 0.9595 (the correlation $r$). This off-diagonal value means that when a country’s standardized life expectancy increases by 1 standard deviation, its standardized infant survival tends to increase by 0.96 standard deviations as well. The two variables move almost in lockstep, confirming heavy redundancy that PCA can exploit.

The Intuition: In the real world, these two indicators are heavily correlated. A country with high life expectancy almost certainly has high infant survival. They are essentially telling us the same story about the country’s healthcare system.

The Necessity: The covariance matrix measures exactly how strong this overlap is. It tells the PCA algorithm mathematically, “These two variables share a high amount of redundant information. You can safely compress them into one variable without losing the big picture.”

Z = df[["z_le", "z_im"]].values
cov_matrix = np.cov(Z.T, ddof=0)
print(f"Covariance matrix (2x2):")
print(f" [{cov_matrix[0, 0]:.4f} {cov_matrix[0, 1]:.4f}]")
print(f" [{cov_matrix[1, 0]:.4f} {cov_matrix[1, 1]:.4f}]")
print(f"\nOff-diagonal (correlation): {cov_matrix[0, 1]:.4f}")

Covariance matrix (2x2):
[1.0000 0.9595]
[0.9595 1.0000]
Off-diagonal (correlation): 0.9595

The diagonal entries are exactly 1.0 (unit variance, as expected after standardization) and the off-diagonal is 0.9595 — the same correlation we computed earlier. This means 96% of the movement in one variable is mirrored by the other. The covariance matrix has now quantified the overlap, and eigen-decomposition will use this information to find the optimal compression axis.

9. Step 4: Eigen-decomposition — finding the optimal direction

This is the mathematical core, where we find our new, compressed index. It introduces two new concepts — eigenvectors and eigenvalues — that are central to how PCA works.

What it is: We decompose the covariance matrix $\Sigma$ into two outputs by solving the equation $\Sigma \mathbf{v} = \lambda \mathbf{v}$:

An eigenvector ($\mathbf{v}$) is a direction in the data space. For our two health indicators, each eigenvector is a pair of numbers $[w_1, w_2]$ that defines a direction — like a compass heading through the scatter plot of countries. PCA finds the direction along which the data is most spread out. The components of this eigenvector become the weights for combining our indicators into a single index. A $2 \times 2$ matrix always produces exactly two eigenvectors, perpendicular to each other.
An eigenvalue ($\lambda$) is a number that tells us how much variance — how much “spread” — the data has along its corresponding eigenvector direction. A large eigenvalue means the countries are widely dispersed in that direction (lots of information), while a small eigenvalue means they are tightly clustered (little information). The eigenvalues always sum to the total number of variables (in our case, 2).

Why are they useful for PCA? Together, eigenvectors and eigenvalues answer two questions at once. The eigenvector with the largest eigenvalue identifies the single best direction to project our data — the direction that captures the most variation across countries. Its components tell us exactly how much weight to give each indicator in our composite index. The ratio of the largest eigenvalue to the total tells us what percentage of the original information our single-number index retains.

The application: For a $2 \times 2$ correlation matrix, the eigenvalues have an elegant closed-form solution: $\lambda_1 = 1 + r$ and $\lambda_2 = 1 - r$. With our correlation of $r = 0.9595$: $\lambda_1 = 1 + 0.9595 = 1.9595$ and $\lambda_2 = 1 - 0.9595 = 0.0405$. The variance explained by PC1 is $1.9595 / 2.0000 = 97.97\%$. This reveals a direct link between correlation strength and PCA compression power. Because $r = 0.96$, the first eigenvalue absorbs nearly all the variance, leaving only $\lambda_2 = 0.04$ for PC2. The higher the correlation between our health indicators, the more variance PC1 captures. At the extreme, if the correlation were zero, both eigenvalues would equal 1.0 and PCA would offer no compression advantage at all.

The Intuition: Imagine plotting all 50 countries on a 2D graph — Standardized Life Expectancy on the X-axis, Standardized Infant Survival on the Y-axis. The countries form a narrow, elongated cloud stretching diagonally from the lower-left (unhealthy countries) to the upper-right (healthy countries). The first eigenvector is a straight line drawn through the long axis of this cloud — the direction where countries differ the most. Its eigenvalue measures how stretched the cloud is along that line. If you stood at the center of the cloud and looked down the first eigenvector, you would see maximum separation between countries. Look down the second eigenvector (perpendicular), and the countries would appear tightly bunched — almost no useful information in that direction. This is why we keep the first eigenvector and discard the second: it captures nearly all the meaningful variation.

The Necessity: Eigen-decomposition removes human bias. Instead of randomly guessing how much weight to give Life Expectancy versus Infant Mortality, the math calculates the absolute optimal weights to capture the maximum amount of overlapping information. The algorithm finds the single direction that best summarizes 50 countries' health performance — no subjective judgment required.

eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)
# Sort in descending order (eigh returns ascending)
idx = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]
# Sign convention: first weight positive
if eigenvectors[0, 0] < 0:
eigenvectors[:, 0] *= -1
var_explained = eigenvalues / eigenvalues.sum() * 100
print(f"Eigenvalues: [{eigenvalues[0]:.4f}, {eigenvalues[1]:.4f}]")
print(f"Sum of eigenvalues: {eigenvalues.sum():.4f}")
print(f"\nEigenvector (PC1): [{eigenvectors[0, 0]:.4f}, {eigenvectors[1, 0]:.4f}]")
print(f"Eigenvector (PC2): [{eigenvectors[0, 1]:.4f}, {eigenvectors[1, 1]:.4f}]")
print(f"\nVariance explained:")
print(f" PC1: {var_explained[0]:.2f}%")
print(f" PC2: {var_explained[1]:.2f}%")

Eigenvalues: [1.9595, 0.0405]
Sum of eigenvalues: 2.0000
Eigenvector (PC1): [0.7071, 0.7071]
Eigenvector (PC2): [0.7071, -0.7071]
Variance explained:
PC1: 97.97%
PC2: 2.03%

The first eigenvalue is 1.9595 and the second is just 0.0405, summing to 2.0 (the number of variables). PC1 captures 97.97% of all variance in the data — nearly everything. The eigenvector weights are both 0.7071 ($\approx 1/\sqrt{2}$), meaning both variables contribute equally to PC1. This equal weighting is not a coincidence — it is a mathematical certainty whenever PCA is applied to exactly two standardized variables. A $2 \times 2$ correlation matrix always has eigenvectors $[1/\sqrt{2}, \; 1/\sqrt{2}]$ and $[1/\sqrt{2}, \; -1/\sqrt{2}]$, regardless of how strong the correlation is. With three or more variables, the weights would generally differ, giving more influence to variables that contribute unique information.

9.1 Visualizing the principal components

fig, ax = plt.subplots(figsize=(8, 8))
fig.patch.set_linewidth(0)
ax.scatter(Z[:, 0], Z[:, 1], color=STEEL_BLUE, edgecolors=DARK_NAVY,
s=60, zorder=3, alpha=0.8)
# Eigenvector arrows scaled by sqrt(eigenvalue) so length reflects variance
vis = 1.5 # visibility multiplier
scale_pc1 = np.sqrt(eigenvalues[0]) * vis
scale_pc2 = np.sqrt(eigenvalues[1]) * vis
ax.annotate("", xy=(eigenvectors[0, 0] * scale_pc1, eigenvectors[1, 0] * scale_pc1),
xytext=(0, 0),
arrowprops=dict(arrowstyle="-|>", color=WARM_ORANGE, lw=2.5))
ax.annotate("", xy=(eigenvectors[0, 1] * scale_pc2, eigenvectors[1, 1] * scale_pc2),
xytext=(0, 0),
arrowprops=dict(arrowstyle="-|>", color=TEAL, lw=2.0))
ax.text(eigenvectors[0, 0] * scale_pc1 + 0.15, eigenvectors[1, 0] * scale_pc1 + 0.15,
f"PC1 ({var_explained[0]:.1f}%)", color=WARM_ORANGE, fontsize=12,
fontweight="bold")
ax.text(eigenvectors[0, 1] * scale_pc2 + 0.15, eigenvectors[1, 1] * scale_pc2 - 0.15,
f"PC2 ({var_explained[1]:.1f}%)", color=TEAL, fontsize=12,
fontweight="bold")
ax.axhline(0, color=GRID_LINE, linewidth=0.8, zorder=1)
ax.axvline(0, color=GRID_LINE, linewidth=0.8, zorder=1)
ax.set_xlabel("Standardized Life Expectancy (Z-score)")
ax.set_ylabel("Standardized Infant Survival (Z-score)")
ax.set_title("Standardized data with principal component directions")
ax.set_aspect("equal")
plt.savefig("pca_standardized_eigenvectors.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

The orange PC1 arrow points along the diagonal — the direction of maximum spread through the narrow, elongated data cloud. Because both weights are positive and equal (0.7071 each), PC1 essentially averages the two standardized indicators. The teal PC2 arrow is perpendicular and captures only the small residual variation (2.03%) not explained by PC1.

9.2 Variance explained

fig, ax = plt.subplots(figsize=(6, 4))
fig.patch.set_linewidth(0)
bars = ax.bar(["PC1", "PC2"], var_explained,
color=[WARM_ORANGE, STEEL_BLUE],
edgecolor=DARK_NAVY, width=0.5)
for bar, val in zip(bars, var_explained):
ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 1,
f"{val:.1f}%", ha="center", va="bottom", fontsize=13,
fontweight="bold", color=WHITE_TEXT)
ax.set_ylabel("Variance Explained (%)")
ax.set_title("Variance explained by each principal component")
ax.set_ylim(0, 110)
plt.savefig("pca_variance_explained.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

PC1 alone captures 98.0% of all variation, meaning a single number retains almost all the information in the original two variables. The remaining 2.0% on PC2 is mostly noise from the simulation. This extreme dominance confirms that the two health indicators are largely redundant — an ideal scenario for PCA compression. With the optimal weights in hand, we can now compute each country’s score.

10. Step 5: Scoring — building the index

Step 4 produced the eigenvector $[0.7071, \; 0.7071]$. These two numbers are the weights $w_1$ and $w_2$ — the recipe for building our index. The first component ($w_1 = 0.7071$) multiplies standardized Life Expectancy, and the second ($w_2 = 0.7071$) multiplies standardized Infant Survival. The eigenvector itself IS the formula for combining the variables.

What it is: We multiply each country’s standardized data by the weights from our eigenvector to calculate their final Principal Component 1 ($PC1$) score:

$$PC1_i = (w_1 \times Z_{i,LE}) + (w_2 \times Z_{i,IM})$$

In words, this says: for country $i$, the PC1 score is $w_1$ times its standardized life expectancy plus $w_2$ times its standardized (adjusted) infant mortality. Here $w_1$ and $w_2$ are the eigenvector components from Step 4, $Z_{i,LE}$ is z_le, and $Z_{i,IM}$ is z_im.

Why are the weights equal? As explained in Step 4, equal weights ($w_1 = w_2 = 1/\sqrt{2}$) are a mathematical certainty in the two-variable standardized case — they hold for any correlation value $r$. Because both weights are identical, the PC1 score is equivalent to a simple average of the two Z-scores (scaled by $\sqrt{2}$). In practice, this means PCA adds no weighting advantage over a naive average when you have exactly two standardized indicators. The real power of PCA emerges with three or more variables, where the algorithm discovers unequal weights that reflect each variable’s unique contribution.

The application: Country_01’s Z-scores from Step 2 were $Z_{LE} = 1.0402$ and $Z_{IM} = 0.7587$. Applying the formula: $PC1 = 0.7071 \times 1.0402 + 0.7071 \times 0.7587 = 0.7355 + 0.5365 = 1.2720$. Country_01’s PC1 score of 1.27 is positive and well above the mean of 0, placing it in the healthier half of the sample. The contribution from life expectancy (0.7355) is slightly larger than from infant survival (0.5365), reflecting the fact that Country_01 is further above average in life expectancy ($Z = 1.04$) than in infant survival ($Z = 0.76$).

The Intuition: We take every country’s dot on our 2D graph and project it squarely onto that single diagonal line. The position of the country along that line is its new, single Health Score.

The Necessity: We have successfully collapsed a 2D matrix into a 1D number line. Two variables have officially become one composite index.

w1 = eigenvectors[0, 0]
w2 = eigenvectors[1, 0]
df["pc1"] = w1 * df["z_le"] + w2 * df["z_im"]
print(f"Eigenvector weights: w1 = {w1:.4f}, w2 = {w2:.4f}")
print(f"\nPC1 score statistics:")
print(f" Mean: {df['pc1'].mean():.4f}")
print(f" Std: {df['pc1'].std(ddof=0):.4f}")
print(f" Min: {df['pc1'].min():.4f}")
print(f" Max: {df['pc1'].max():.4f}")
print(f"\nTop 5 countries (highest PC1):")
print(df.nlargest(5, "pc1")[["country", "life_exp", "infant_mort", "pc1"]]
.to_string(index=False))
print(f"\nBottom 5 countries (lowest PC1):")
print(df.nsmallest(5, "pc1")[["country", "life_exp", "infant_mort", "pc1"]]
.to_string(index=False))

Eigenvector weights: w1 = 0.7071, w2 = 0.7071
PC1 score statistics:
Mean: 0.0000
Std: 1.3998
Min: -2.3892
Max: 2.3734
Top 5 countries (highest PC1):
country life_exp infant_mort pc1
Country_12 84.7 3.8 2.373421
Country_32 83.4 4.0 2.256521
Country_23 81.6 3.5 2.130292
Country_06 83.6 8.6 2.062127
Country_03 81.3 11.6 1.733944
Bottom 5 countries (lowest PC1):
country life_exp infant_mort pc1
Country_05 54.9 53.8 -2.389155
Country_28 57.7 58.7 -2.381854
Country_29 58.8 55.9 -2.162285
Country_18 58.5 54.9 -2.141282
Country_50 57.2 51.9 -2.111422

PC1 scores range from $-2.39$ (Country_05) to $+2.37$ (Country_12). The top-scoring countries combine high life expectancy (81–85 years) with very low infant mortality (3.5–8.6 per 1,000), while the bottom-scoring countries show the opposite pattern (54.9–58.8 years, 48–59 per 1,000). The mean of 0.0 confirms that PC1 is centered, as expected from standardized inputs.

df_sorted = df.sort_values("pc1", ascending=True)
fig, ax = plt.subplots(figsize=(10, 14))
fig.patch.set_linewidth(0)
colors = [TEAL if v >= 0 else WARM_ORANGE for v in df_sorted["pc1"]]
ax.barh(range(len(df_sorted)), df_sorted["pc1"], color=colors,
edgecolor=DARK_NAVY, height=0.7)
ax.set_yticks(range(len(df_sorted)))
ax.set_yticklabels(df_sorted["country"], fontsize=8)
ax.axvline(0, color=LIGHT_TEXT, linewidth=0.8, zorder=1)
ax.set_xlabel("PC1 Score")
ax.set_title("PC1 scores: countries ranked by health performance")
plt.savefig("pca_pc1_scores.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

The bar chart reveals a roughly symmetric distribution of PC1 scores around zero, with the healthiest countries (teal bars) on the right and the least healthy (orange bars) on the left. However, the raw PC1 scores include negative numbers — a format that is hard to communicate in policy reports. The next step normalizes these scores to a 0–1 scale.

11. Step 6: Normalization — making it human-readable

What it is: We apply Min-Max scaling to compress the $PC1$ scores into a range between 0 and 1. Let $\min(PC1)$ be the lowest score in the sample, and $\max(PC1)$ be the highest:

$$HI_i = \frac{PC1_i - PC1_{min}}{PC1_{max} - PC1_{min}}$$

In words, this says: subtract the minimum PC1 score and divide by the range. The country with the lowest PC1 score gets $HI = 0$ and the country with the highest gets $HI = 1$.

The application: Country_01’s PC1 score from Step 5 is 1.2720, while the sample minimum is $-2.3892$ and the maximum is $2.3734$. Applying the formula: $HI = (1.2720 - (-2.3892)) / (2.3734 - (-2.3892)) = 3.6612 / 4.7626 = 0.7687$. Country_01’s Health Index of 0.77 means it performs better than roughly 77% of the scale defined by the worst-performing country (0.00) and the best-performing country (1.00). A policymaker can immediately understand this number without knowing anything about Z-scores or eigenvectors.

The Intuition: Because we standardized the data earlier, our $PC1$ scores have negative numbers (e.g., $-2.39$). You cannot easily publish a report saying a country has a health score of negative 2.39 — it confuses policymakers and the public.

The Necessity: Normalization forces the absolute lowest scoring country to equal $0$, and the highest scoring country to equal $1$. Everyone else scales proportionally in between. The result is a highly rigorous, purely data-driven index that is instantly understandable.

pc1_min = df["pc1"].min()
pc1_max = df["pc1"].max()
df["health_index"] = (df["pc1"] - pc1_min) / (pc1_max - pc1_min)
print(f"PC1 range: [{pc1_min:.4f}, {pc1_max:.4f}]")
print(f"\nHealth Index statistics:")
print(f" Mean: {df['health_index'].mean():.4f}")
print(f" Median: {df['health_index'].median():.4f}")
print(f" Std: {df['health_index'].std(ddof=0):.4f}")
print(f"\nTop 10 countries:")
print(df.nlargest(10, "health_index")[
["country", "life_exp", "infant_mort", "health_index"]
].to_string(index=False))
print(f"\nBottom 10 countries:")
print(df.nsmallest(10, "health_index")[
["country", "life_exp", "infant_mort", "health_index"]
].to_string(index=False))

PC1 range: [-2.3892, 2.3734]
Health Index statistics:
Mean: 0.5017
Median: 0.5182
Std: 0.2939
Top 10 countries:
country life_exp infant_mort health_index
Country_12 84.7 3.8 1.000000
Country_32 83.4 4.0 0.975454
Country_23 81.6 3.5 0.948950
Country_06 83.6 8.6 0.934637
Country_03 81.3 11.6 0.865729
Country_45 79.1 7.8 0.864043
Country_24 79.5 11.4 0.836335
Country_19 79.1 15.2 0.792782
Country_14 79.0 15.5 0.788153
Country_46 79.0 16.5 0.778523
Bottom 10 countries:
country life_exp infant_mort health_index
Country_05 54.9 53.8 0.000000
Country_28 57.7 58.7 0.001533
Country_29 58.8 55.9 0.047636
Country_18 58.5 54.9 0.052046
Country_50 57.2 51.9 0.058316
Country_37 56.5 48.4 0.079840
Country_09 58.3 50.1 0.094789
Country_26 61.8 53.4 0.123910
Country_36 59.9 48.9 0.134184
Country_39 60.9 48.5 0.155436

The Health Index has a mean of 0.50 and a median of 0.52, indicating a roughly symmetric distribution. Country_12 leads with a perfect score of 1.00 (life expectancy 84.7 years, infant mortality 3.8), while Country_05 anchors the bottom at 0.00 (54.9 years, 53.8 per 1,000). The gap between the top 10 (all above 0.78) and the bottom 10 (all below 0.16) reveals a stark divide in health outcomes across the sample.

df_sorted_hi = df.sort_values("health_index", ascending=True)
fig, ax = plt.subplots(figsize=(10, 14))
fig.patch.set_linewidth(0)
# Gradient from warm orange (low) to teal (high)
cmap_colors = []
for val in df_sorted_hi["health_index"]:
r = int(0xd9 + val * (0x00 - 0xd9))
g = int(0x77 + val * (0xd4 - 0x77))
b = int(0x57 + val * (0xc8 - 0x57))
cmap_colors.append(f"#{r:02x}{g:02x}{b:02x}")
ax.barh(range(len(df_sorted_hi)), df_sorted_hi["health_index"],
color=cmap_colors, edgecolor=DARK_NAVY, height=0.7)
ax.set_yticks(range(len(df_sorted_hi)))
ax.set_yticklabels(df_sorted_hi["country"], fontsize=8)
ax.set_xlabel("Health Index (0 = worst, 1 = best)")
ax.set_title("Health Index: countries ranked from 0 (worst) to 1 (best)")
ax.set_xlim(0, 1.05)
plt.savefig("pca_health_index.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

The gradient-colored bar chart makes the health divide visually immediate. Countries cluster into three rough groups: a high-performing cluster above 0.75 (warm teal), a middle group between 0.25 and 0.75, and a struggling cluster below 0.25 (warm orange). The bottom 10 countries all have Health Index values below 0.16, suggesting systemic health challenges that span both longevity and infant survival. Country_05 and Country_28 appear to have no bar at all — this is not missing data. Country_05 has a Health Index of exactly 0.00 because it is the worst performer in the sample and Min-Max normalization maps the minimum to zero by definition. Country_28 has an index of just 0.0015, so close to zero that its bar is invisible at this scale. Both countries have low life expectancy (54.9 and 57.7 years) combined with high infant mortality (53.8 and 58.7 per 1,000), placing them at the extreme low end of the health spectrum.

12. Replicating the analysis with scikit-learn

Now that we understand every step, scikit-learn can do the entire pipeline — from raw CSV to final Health Index — in a single, compact script. This section presents the automated pipeline and then compares its results against our manual implementation.

12.1 A PCA pipeline with scikit-learn

The code block below is designed to be reusable: by changing only the CSV file path, the column names, and the list of negative indicators, you can apply this same pipeline to any dataset.

# ── Full PCA pipeline with scikit-learn ──────────────────────────
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
# ── Configuration (change these for your own dataset) ────────────
CSV_FILE = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_pca/health_data.csv"
ID_COL = "country" # Row identifier column
POSITIVE_COLS = ["life_exp"] # Higher = better
NEGATIVE_COLS = ["infant_mort"] # Higher = worse (will be flipped)
# Step 1: Load raw data from CSV
df_sk = pd.read_csv(CSV_FILE)
print(f"Loaded: {df_sk.shape[0]} rows, {df_sk.shape[1]} columns")
# Step 2: Polarity adjustment — flip negative indicators
# Multiplying by -1 so that "higher = better" for all variables
for col in NEGATIVE_COLS:
df_sk[col + "_adj"] = -1 * df_sk[col]
adj_cols = POSITIVE_COLS + [col + "_adj" for col in NEGATIVE_COLS]
# Step 3: Standardization — Z-scores (mean=0, std=1)
# StandardScaler centers and scales each column independently
scaler = StandardScaler()
Z_sk = scaler.fit_transform(df_sk[adj_cols])
# Step 4: PCA — fit to find eigenvectors and eigenvalues
# n_components=1 because we want a single composite index
pca_sk = PCA(n_components=1)
pca_sk.fit(Z_sk)
# Step 5: Transform — project data onto the first principal component
df_sk["pc1"] = pca_sk.transform(Z_sk)[:, 0]
# Step 6: Normalization — Min-Max scaling to 0-1
df_sk["pc1_index"] = (
(df_sk["pc1"] - df_sk["pc1"].min())
/ (df_sk["pc1"].max() - df_sk["pc1"].min())
)
# Export results
df_sk.to_csv("pc1_index_results.csv", index=False)
# Summary
indicator_cols = POSITIVE_COLS + NEGATIVE_COLS
print(f"\nPC1 weights: {pca_sk.components_[0].round(4)}")
print(f"Variance explained: {pca_sk.explained_variance_ratio_.round(4)}")
print(f"\nTop 5:")
print(df_sk.nlargest(5, "pc1_index")[
[ID_COL] + indicator_cols + ["pc1_index"]
].to_string(index=False))
print(f"\nBottom 5:")
print(df_sk.nsmallest(5, "pc1_index")[
[ID_COL] + indicator_cols + ["pc1_index"]
].to_string(index=False))
print(f"\nSaved: pc1_index_results.csv")

Loaded: 50 rows, 3 columns
PC1 weights: [0.7071 0.7071]
Variance explained: [0.9797]
Top 5:
country life_exp infant_mort pc1_index
Country_12 84.7 3.8 1.000000
Country_32 83.4 4.0 0.975454
Country_23 81.6 3.5 0.948950
Country_06 83.6 8.6 0.934637
Country_03 81.3 11.6 0.865729
Bottom 5:
country life_exp infant_mort pc1_index
Country_05 54.9 53.8 0.000000
Country_28 57.7 58.7 0.001533
Country_29 58.8 55.9 0.047636
Country_18 58.5 54.9 0.052046
Country_50 57.2 51.9 0.058316
Saved: pc1_index_results.csv

The entire six-step manual pipeline collapses into roughly 15 lines of sklearn code. The configuration block at the top (CSV_FILE, POSITIVE_COLS, NEGATIVE_COLS) makes the script reusable: to build a different composite index, simply point it to a new CSV and specify which columns are positive and which are negative. The rankings match our manual results exactly — Country_12 leads at 1.00 and Country_05 anchors the bottom at 0.00.

12.2 Manual vs. scikit-learn comparison

Now that we have both sets of PC1 scores — one from our six manual steps, one from the sklearn pipeline — we can compare them directly. One subtlety: eigenvectors are defined up to a sign flip, so sklearn may return scores with the opposite sign. We check for this and flip if needed.

sklearn_pc1 = df_sk["pc1"].values
# Handle sign ambiguity: eigenvectors can point in either direction
sign_corr = np.corrcoef(df["pc1"], sklearn_pc1)[0, 1]
if sign_corr < 0:
sklearn_pc1 = -sklearn_pc1
print("Note: sklearn returned opposite sign (normal). Flipped for comparison.")
max_diff = np.max(np.abs(sklearn_pc1 - df["pc1"].values))
corr_val = np.corrcoef(df["pc1"], sklearn_pc1)[0, 1]
print(f"Max absolute difference in PC1 scores: {max_diff:.2e}")
print(f"Correlation between manual and sklearn: {corr_val:.6f}")

Max absolute difference in PC1 scores: 1.33e-15
Correlation between manual and sklearn: 1.000000

fig, ax = plt.subplots(figsize=(6, 6))
fig.patch.set_linewidth(0)
ax.scatter(df["pc1"], sklearn_pc1, color=STEEL_BLUE, edgecolors=DARK_NAVY,
s=60, zorder=3)
lim_min = min(df["pc1"].min(), sklearn_pc1.min()) - 0.2
lim_max = max(df["pc1"].max(), sklearn_pc1.max()) + 0.2
ax.plot([lim_min, lim_max], [lim_min, lim_max], color=WARM_ORANGE,
linewidth=2, linestyle="--", label="Perfect agreement", zorder=2)
ax.set_xlabel("Manual PC1 Score")
ax.set_ylabel("scikit-learn PC1 Score")
ax.set_title("Manual vs. scikit-learn PCA: verification")
ax.legend(loc="upper left")
ax.set_aspect("equal")
plt.savefig("pca_sklearn_comparison.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

The maximum absolute difference between manual and sklearn PC1 scores is $1.33 \times 10^{-15}$ — essentially machine-precision zero. The correlation is 1.000000, confirming perfect agreement. All 50 points fall exactly on the dashed 45-degree line. This validates that our step-by-step manual implementation produces identical results to the optimized library.

13. Summary results

Step	Input	Output	Key Result
Polarity	IM (raw)	IM* = -IM	Correlation: -0.96 to +0.96
Standardization	LE, IM*	Z_LE, Z_IM	Mean=0, SD=1 for both
Covariance	Z matrix	2x2 matrix	Off-diagonal r = 0.96
Eigen-decomposition	Cov matrix	eigenvalues, eigenvectors	PC1 captures 98.0%
Scoring	Z * eigvec	PC1 scores	Range: [-2.39, 2.37]
Normalization	PC1	Health Index	Range: [0.00, 1.00]

14. Discussion

The answer to our opening question is clear: yes, PCA successfully reduces two correlated health indicators into a single Health Index. With a correlation of 0.96 between life expectancy and adjusted infant mortality, PC1 captures 97.97% of all variation — meaning our single-number index retains virtually all the information from both original variables.

The approximately equal eigenvector weights ($w_1 = w_2 = 0.707$) reveal that PCA produced an index nearly identical to a simple average of Z-scores. This is not always the case. With less correlated indicators or more than two variables, the weights would diverge, giving more influence to the indicators that contribute unique information. In high-dimensional settings with 15 or 20 indicators, PCA’s ability to discover these unequal weights becomes far more valuable than any manual weighting scheme. For an applied example, Mendez and Gonzales (2021) use PCA to classify 339 Bolivian municipalities according to human capital constraints — combining malnutrition, language barriers, dropout rates, and education inequality into composite indices that reveal distinct geographic clusters of deprivation.

A policymaker looking at these results could immediately identify that the bottom 10 countries (Health Index below 0.16) suffer from both low life expectancy and high infant mortality, indicating systemic health system weaknesses rather than isolated problems. These countries would be natural candidates for comprehensive health investment packages rather than single-issue interventions.

It is crucial to understand that this index is a measure of relative performance within the specific sample. A score of 1.0 does not mean a country has achieved perfect health — it simply means that country is the best performer among the 50 analyzed. Adding or removing countries from the sample changes every score because both the standardization parameters (mean and standard deviation) and the Min-Max bounds depend on which countries are included. If a new country with extremely high life expectancy joins the sample, every existing country’s Z-scores shift downward, altering all PC1 scores and the final index.

Using a PCA-based health index to compare against a PCA-based education index is also problematic. A health index score of 0.77 and an education index score of 0.77 may look equivalent, but they are not directly comparable. Each index has its own eigenvectors, eigenvalues, and standardization parameters derived from entirely different variables with different correlation structures. The numbers live on different scales — 0.77 in health means “77% of the way between the worst and best health performers,” while 0.77 in education means the same relative position but within a completely different set of indicators. Combining or averaging PCA indices across domains requires additional methodological choices (such as those used in the UNDP Human Development Index).

Using our PCA-based health index to study changes over time introduces further challenges. If you compute the index separately for each year, both the eigenvector weights and the Z-score parameters (means, standard deviations) can shift from year to year, making scores from different periods non-comparable. A country’s index could improve not because its health system got better, but because the sample’s average got worse. One potential solution is a pooled PCA approach — standardizing across all years simultaneously and computing a single set of eigenvectors from the pooled covariance matrix. However, this requires assuming that the correlation structure between indicators remains constant over time, which may not hold if the relationship between life expectancy and infant mortality evolves as countries develop. For an example of PCA applied to social progress indicators across countries and multiple years, see Peiro-Palomino, Picazo-Tadeo, and Rios (2023).

15. Summary and next steps

Key takeaways:

Method insight: PCA is most effective when indicators are highly correlated. With $r = 0.96$, PC1 captured 98.0% of variance. With weakly correlated indicators, PCA would require multiple components, reducing the simplicity advantage. Always check the correlation structure before choosing PCA for index construction.
Data insight: The equal eigenvector weights (both 0.707) mean PCA produced an index nearly identical to a simple Z-score average in this two-variable case. The real power of PCA emerges when variables contribute unequally and you need the algorithm to discover the optimal weighting.
Limitation: With only two variables, PCA offers modest dimensionality reduction (2 to 1). The technique’s full value emerges with many indicators (e.g., 15 SDG variables reduced to 3–4 components). Also, the Health Index is relative — adding or removing countries changes every score because of the Min-Max normalization.
Next step: Extend to multi-variable PCA with real data (e.g., SDG indicators covering education, income, and health). Explore how many components to retain using the scree plot and cumulative variance threshold (commonly 80–90%), and consider factor analysis for latent variable interpretation.

Limitations of this analysis:

The data is simulated. Real WHO data would include outliers, missing values, and non-normal distributions that require additional preprocessing.
Two-variable PCA is a pedagogical simplification. Real composite indices (like the UNDP Human Development Index) use more indicators and often apply domain-specific weighting decisions alongside statistical methods.
Min-Max normalization is sensitive to outliers. A single extreme country can compress the range for everyone else. Robust alternatives include percentile ranking or winsorization.

16. Exercises

Add a third indicator. Extend the data generating process with a third variable (e.g., healthcare_spending = 200 + 800 * base_health + noise). Run the same pipeline with three variables. How does the variance explained by PC1 change? Do the eigenvector weights shift from equal?
Test outlier sensitivity. Modify one country to have extreme values (e.g., life_exp = 40, infant_mort = 100). How does Min-Max normalization affect the rankings of other countries? Try replacing Min-Max with percentile-based normalization and compare.
Apply to real data. Download Life Expectancy and Infant Mortality data from the WHO Global Health Observatory. Apply the six-step pipeline to real countries. Compare your PCA-based Health Index ranking with the UNDP Human Development Index and discuss any discrepancies.

17. References

Pooled PCA for Building Development Indicators Across Time

Sat, 21 Mar 2026 00:00:00 +0000

1. Overview

In the Introduction to PCA Analysis for Building Development Indicators, we built a Health Index from two indicators using a six-step pipeline. That tutorial’s Discussion section raised a critical warning:

If you compute the index separately for each year, both the eigenvector weights and the Z-score parameters (means, standard deviations) can shift from year to year, making scores from different periods non-comparable. A country’s index could improve not because its health system got better, but because the sample’s average got worse.

This sequel addresses that problem head-on with real data. We use the Subnational Human Development Index from the Global Data Lab, which provides Education, Health, and Income sub-indices for 153 sub-national regions across 12 South American countries in 2013 and 2019. When we track development over time, we need the yardstick to remain fixed. If the ruler itself changes between measurements, we cannot tell whether the object grew or the ruler shrank. Pooled PCA solves this by standardizing and computing weights from all periods simultaneously, producing a single fixed yardstick that makes scores directly comparable across time.

The real data reveals a nuanced story: education and health improved on average across South America between 2013 and 2019, but income declined. This mixed signal makes the choice between pooled and per-period PCA consequential — the two methods disagree on the direction of change for 16 out of 153 regions.

Learning objectives:

Understand why per-period PCA produces non-comparable scores across time
Implement pooled standardization using cross-period means and standard deviations
Compute pooled eigenvectors from stacked data to obtain stable weights
Apply pooled normalization with cross-period min/max bounds
Contrast pooled vs per-period PCA using rank stability and direction-of-change analysis

2. The pooled PCA pipeline

The pooled pipeline extends the six-step pipeline from the previous tutorial by adding a stacking step at the beginning and replacing per-period parameters with pooled parameters at the standardization, covariance, and normalization steps.

graph LR
S["<b>Step 0</b><br/>Stack<br/>Periods"] --> A["<b>Step 1</b><br/>Polarity<br/>Adjustment"]
A --> B["<b>Step 2</b><br/>Pooled<br/>Standardization"]
B --> C["<b>Step 3</b><br/>Pooled<br/>Covariance"]
C --> D["<b>Step 4</b><br/>Pooled Eigen-<br/>Decomposition"]
D --> E["<b>Step 5</b><br/>Scoring<br/>(PC1)"]
E --> F["<b>Step 6</b><br/>Pooled<br/>Normalization"]
style S fill:#141413,stroke:#6a9bcc,color:#fff
style A fill:#d97757,stroke:#141413,color:#fff
style B fill:#6a9bcc,stroke:#141413,color:#fff
style C fill:#6a9bcc,stroke:#141413,color:#fff
style D fill:#00d4c8,stroke:#141413,color:#fff
style E fill:#00d4c8,stroke:#141413,color:#fff
style F fill:#1a3a8a,stroke:#141413,color:#fff

The key insight is that Steps 2, 3, and 6 — labeled “Pooled” — compute their parameters from the stacked data (all periods combined) rather than from each period separately. This single change ensures that a region’s Z-score in 2013 is measured against the same baseline as its Z-score in 2019, that the eigenvector weights are fixed across time, and that the 0–1 normalization uses a common scale.

3. Setup and imports

The analysis uses the same libraries as the previous tutorial: NumPy for linear algebra, pandas for data management, matplotlib for visualization, and scikit-learn for verification.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
# Reproducibility
RANDOM_SEED = 42
# Site color palette
STEEL_BLUE = "#6a9bcc"
WARM_ORANGE = "#d97757"
NEAR_BLACK = "#141413"
TEAL = "#00d4c8"

Dark theme figure styling (click to expand)

# Dark theme palette (consistent with site navbar/dark sections)
DARK_NAVY = "#0f1729"
GRID_LINE = "#1f2b5e"
LIGHT_TEXT = "#c8d0e0"
WHITE_TEXT = "#e8ecf2"
# Plot defaults — minimal, spine-free, dark background
plt.rcParams.update({
"figure.facecolor": DARK_NAVY,
"axes.facecolor": DARK_NAVY,
"axes.edgecolor": DARK_NAVY,
"axes.linewidth": 0,
"axes.labelcolor": LIGHT_TEXT,
"axes.titlecolor": WHITE_TEXT,
"axes.spines.top": False,
"axes.spines.right": False,
"axes.spines.left": False,
"axes.spines.bottom": False,
"axes.grid": True,
"grid.color": GRID_LINE,
"grid.linewidth": 0.6,
"grid.alpha": 0.8,
"xtick.color": LIGHT_TEXT,
"ytick.color": LIGHT_TEXT,
"xtick.major.size": 0,
"ytick.major.size": 0,
"text.color": WHITE_TEXT,
"font.size": 12,
"legend.frameon": False,
"legend.fontsize": 11,
"legend.labelcolor": LIGHT_TEXT,
"figure.edgecolor": DARK_NAVY,
"savefig.facecolor": DARK_NAVY,
"savefig.edgecolor": DARK_NAVY,
})

4. Loading the Subnational HDI data

The dataset is a subsample from the Subnational Human Development Database constructed by Smits and Permanyer (2019), which provides sub-national development indicators for countries worldwide. We use the South American subset with three HDI component indices for 2013 and 2019. The original data is in wide format (one row per region, with year-specific columns), so we reshape it into a long panel format suitable for pooled PCA.

DATA_URL = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_pca2/data.csv"
raw = pd.read_csv(DATA_URL)
print(f"Raw dataset: {raw.shape[0]} regions, {raw.shape[1]} columns")
print(f"Countries: {raw['country'].nunique()}")
# Reshape wide → long
rows = []
for _, r in raw.iterrows():
for year in [2013, 2019]:
rows.append({
"GDLcode": r["GDLcode"],
"region": r["region"],
"country": r["country"],
"period": f"Y{year}",
"education": round(r[f"edindex{year}"], 4),
"health": round(r[f"healthindex{year}"], 4),
"income": round(r[f"incindex{year}"], 4),
"shdi_official": round(r[f"shdi{year}"], 4),
"pop": round(r[f"pop{year}"], 1),
})
df = pd.DataFrame(rows)

To make regions instantly identifiable in figures and tables, we create a region_country label that combines a shortened region name with a three-letter country abbreviation. This avoids ambiguity — for example, “Cordoba” exists in both Argentina and Colombia.

# Create informative label: shortened region + country abbreviation
COUNTRY_ABBR = {
"Argentina": "ARG", "Bolivia": "BOL", "Brazil": "BRA",
"Chile": "CHL", "Colombia": "COL", "Ecuador": "ECU",
"Guyana": "GUY", "Paraguay": "PRY", "Peru": "PER",
"Suriname": "SUR", "Uruguay": "URY", "Venezuela": "VEN",
}
def make_label(region, country, max_len=25):
"""Shorten region name and append country abbreviation."""
abbr = COUNTRY_ABBR.get(country, country[:3].upper())
short = region[:max_len].rstrip(", ") if len(region) > max_len else region
return f"{short} ({abbr})"
df["region_country"] = df.apply(
lambda r: make_label(r["region"], r["country"]), axis=1
)
df.to_csv("data_long.csv", index=False)
INDICATORS = ["education", "health", "income"]
print(f"\nPanel dataset: {df.shape[0]} rows (= {raw.shape[0]} regions x 2 periods)")
print(f"\nFirst 6 rows:")
print(df[["region_country", "period", "education", "health", "income"]].head(6).to_string(index=False))
print(f"\nRegions per country:")
print(raw["country"].value_counts().sort_index().to_string())

Panel dataset: 306 rows (= 153 regions x 2 periods)
First 6 rows:
region_country period education health income
City of Buenos Aires (ARG) Y2013 0.926 0.858 0.850
City of Buenos Aires (ARG) Y2019 0.946 0.872 0.832
Rest of Buenos Aires (ARG) Y2013 0.797 0.858 0.820
Rest of Buenos Aires (ARG) Y2019 0.830 0.872 0.802
Catamarca, La Rioja, San (ARG) Y2013 0.822 0.858 0.828
Catamarca, La Rioja, San (ARG) Y2019 0.856 0.872 0.810
Regions per country:
country
Argentina 11
Bolivia 9
Brazil 27
Chile 13
Colombia 33
Ecuador 3
Guyana 10
Paraguay 5
Peru 6
Suriname 5
Uruguay 7
Venezuela 24

The panel contains 306 rows (153 regions $\times$ 2 periods) covering 12 South American countries. Colombia contributes the most regions (33), followed by Brazil (27) and Venezuela (24), while Ecuador has only 3. The region_country label — such as “City of Buenos Aires (ARG)” or “Potosi (BOL)” — will make every region immediately identifiable in the analysis that follows.

print(f"Period means:")
print(df.groupby("period")[INDICATORS].mean().round(4).to_string())
p1_means = df[df["period"] == "Y2013"][INDICATORS].mean()
p2_means = df[df["period"] == "Y2019"][INDICATORS].mean()
changes = p2_means - p1_means
print(f"\nMean changes (2019 - 2013):")
print(f" Education: {changes['education']:+.4f}")
print(f" Health: {changes['health']:+.4f}")
print(f" Income: {changes['income']:+.4f}")

Period means:
education health income
period
Y2013 0.6674 0.8370 0.7355
Y2019 0.6899 0.8504 0.7153
Mean changes (2019 - 2013):
Education: +0.0225
Health: +0.0134
Income: -0.0202

The period means reveal a mixed development story: education rose from 0.667 to 0.690 (+0.023) and health from 0.837 to 0.850 (+0.013), but income declined from 0.736 to 0.715 ($-0.020$). This income decline across much of South America between 2013 and 2019 — driven by commodity price drops and economic slowdowns — is a real signal that our PCA-based index must capture correctly. Note that all three indicators are positive-direction (higher means better), so no polarity adjustment is needed.

5. Exploring the raw data

Before running any PCA, let us examine the country-level patterns, the correlation structure, and the period-to-period shift.

# Country-level means by period
print(f"Country-level means by period:")
country_means = (df.groupby(["country", "period"])[INDICATORS]
.mean().round(3).unstack("period"))
country_means.columns = [f"{col[0]}_{col[1]}" for col in country_means.columns]
print(country_means.to_string())

Country-level means by period:
education_Y2013 education_Y2019 health_Y2013 health_Y2019 income_Y2013 income_Y2019
country
Argentina 0.823 0.852 0.858 0.872 0.827 0.809
Bolivia 0.652 0.689 0.778 0.809 0.633 0.665
Brazil 0.659 0.684 0.838 0.859 0.745 0.732
Chile 0.732 0.781 0.911 0.925 0.806 0.814
Colombia 0.615 0.654 0.858 0.876 0.707 0.720
Ecuador 0.688 0.691 0.857 0.877 0.707 0.699
Guyana 0.568 0.574 0.747 0.764 0.599 0.622
Paraguay 0.612 0.624 0.820 0.835 0.698 0.717
Peru 0.671 0.713 0.845 0.866 0.688 0.703
Suriname 0.584 0.627 0.791 0.804 0.757 0.725
Uruguay 0.694 0.722 0.878 0.891 0.790 0.799
Venezuela 0.708 0.682 0.813 0.801 0.782 0.630

The country-level means reveal stark development gaps across South America. Chile leads in health (0.911–0.925) and is strong across all dimensions. Argentina leads in education (0.823–0.852) with high income. At the other end, Guyana has the lowest education (0.568–0.574) and Bolivia the lowest health (0.778–0.809). Most countries improved on all three indicators between 2013 and 2019, but two stand out for income decline: Venezuela’s income collapsed from 0.782 to 0.630 ($-0.152$), reflecting its severe economic crisis, and Argentina’s income also fell from 0.827 to 0.809. These divergent trajectories across countries are precisely why a fixed yardstick (pooled PCA) is essential for temporal comparison.

corr_matrix = df[INDICATORS].corr().round(4)
print(f"Pooled correlation matrix:")
print(corr_matrix.to_string())

Pooled correlation matrix:
education health income
education 1.0000 0.4392 0.6808
health 0.4392 1.0000 0.6303
income 0.6808 0.6303 1.0000

The correlations are moderate to strong but far from the near-perfect values we saw in the previous tutorial’s simulated data ($r > 0.93$). Education and Income show the strongest correlation (0.68), followed by Health and Income (0.63), with Education and Health the weakest (0.44). These lower correlations mean PCA will capture less variance in PC1 — the three indicators carry more independent information than in the simulated case, reflecting the genuine complexity of human development. The weak Education-Health link (0.44) suggests that a region can have high literacy but mediocre life expectancy (or vice versa) — education and health are partly independent dimensions of development.

fig, ax = plt.subplots(figsize=(8, 6))
fig.patch.set_linewidth(0)
p1 = df[df["period"] == "Y2013"]
p2 = df[df["period"] == "Y2019"]
ax.scatter(p1["education"], p1["income"], color=STEEL_BLUE,
edgecolors=DARK_NAVY, s=40, zorder=3, alpha=0.7, label="2013")
ax.scatter(p2["education"], p2["income"], color=WARM_ORANGE,
edgecolors=DARK_NAVY, s=40, zorder=3, alpha=0.7, label="2019")
# Centroid arrows
c1_edu, c1_inc = p1["education"].mean(), p1["income"].mean()
c2_edu, c2_inc = p2["education"].mean(), p2["income"].mean()
ax.annotate("", xy=(c2_edu, c2_inc), xytext=(c1_edu, c1_inc),
arrowprops=dict(arrowstyle="-|>", color=TEAL, lw=2.5))
ax.set_xlabel("Education Index")
ax.set_ylabel("Income Index")
ax.set_title("Education vs. Income by period (153 South American regions)")
ax.legend(loc="lower right")
plt.savefig("pca2_period_shift_scatter.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

The scatter plot reveals a striking pattern: between 2013 (steel blue) and 2019 (orange), the cloud shifted right (education improved) but downward (income declined). The teal arrow connecting the two period centroids captures this asymmetric shift. This is a real-world complication that simple simulated data would not produce — per-period PCA will handle this mixed signal differently from pooled PCA.

6. The problem: per-period PCA

To understand why pooled PCA is necessary, let us first see what goes wrong with the naive approach. We run the full six-step pipeline separately for each period — standardizing with period-specific means, computing period-specific eigenvectors, and normalizing with period-specific bounds.

Per-period standardization uses different baselines for each period:

$$Z_{ij}^{(t)} = \frac{X_{ij,t} - \bar{X}_j^{(t)}}{\sigma_j^{(t)}}$$

In words, this says: standardize using only the data from period $t$. The mean and standard deviation change between periods, so the yardstick shifts.

Per-period normalization uses different bounds for each period:

$$HDI_i^{(t)} = \frac{PC1_i^{(t)} - PC1_{min}^{(t)}}{PC1_{max}^{(t)} - PC1_{min}^{(t)}}$$

In words, this says: the worst region in each period gets 0 and the best gets 1, but the scale resets every period.

def run_single_period_pca(df_period, indicators):
"""Run the full PCA pipeline on a single-period DataFrame."""
X = df_period[indicators].values
means = X.mean(axis=0)
stds = X.std(axis=0, ddof=0)
Z = (X - means) / stds
cov = np.cov(Z.T, ddof=0)
eigenvalues, eigenvectors = np.linalg.eigh(cov)
idx = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]
if eigenvectors[0, 0] < 0:
eigenvectors[:, 0] *= -1
pc1 = Z @ eigenvectors[:, 0]
hdi = (pc1 - pc1.min()) / (pc1.max() - pc1.min())
return {"pc1": pc1, "hdi": hdi, "weights": eigenvectors[:, 0],
"eigenvalues": eigenvalues,
"var_explained": eigenvalues / eigenvalues.sum() * 100,
"means": means, "stds": stds}
pp_p1 = run_single_period_pca(df[df["period"] == "Y2013"], INDICATORS)
pp_p2 = run_single_period_pca(df[df["period"] == "Y2019"], INDICATORS)
print(f"Per-period eigenvector weights (PC1):")
print(f" 2013: [{pp_p1['weights'][0]:.4f}, {pp_p1['weights'][1]:.4f}, {pp_p1['weights'][2]:.4f}]")
print(f" 2019: [{pp_p2['weights'][0]:.4f}, {pp_p2['weights'][1]:.4f}, {pp_p2['weights'][2]:.4f}]")
print(f" Shift: [{pp_p2['weights'][0] - pp_p1['weights'][0]:+.4f}, "
f"{pp_p2['weights'][1] - pp_p1['weights'][1]:+.4f}, "
f"{pp_p2['weights'][2] - pp_p1['weights'][2]:+.4f}]")

Per-period eigenvector weights (PC1):
2013: [0.5832, 0.5100, 0.6322]
2019: [0.5405, 0.5657, 0.6228]
Shift: [-0.0427, +0.0556, -0.0095]

The eigenvector weights shift substantially between periods. Education’s weight drops from 0.583 to 0.541 ($-0.043$), while Health’s weight jumps from 0.510 to 0.566 ($+0.056$). This means the index formula itself changes — a region’s 2013 HDI and 2019 HDI are computed with different recipes, making temporal comparison unreliable. Under per-period PCA, 43 out of 153 regions appear to decline in HDI despite the overall improvement in education and health. The per-period approach erases the mixed global signal by re-centering every period to a mean of zero.

To visualize how individual regions shift in rank under per-period PCA, we store each period’s HDI scores and compute ranks.

df_p1 = df[df["period"] == "Y2013"].copy()
df_p2 = df[df["period"] == "Y2019"].copy()
df_p1["pp_hdi"] = pp_p1["hdi"]
df_p2["pp_hdi"] = pp_p2["hdi"]
df_p1["pp_rank"] = df_p1["pp_hdi"].rank(ascending=False).astype(int)
df_p2["pp_rank"] = df_p2["pp_hdi"].rank(ascending=False).astype(int)

fig, ax = plt.subplots(figsize=(8, 10))
fig.patch.set_linewidth(0)
rank_change = df_p2["pp_rank"].values - df_p1["pp_rank"].values
abs_change = np.abs(rank_change)
top_changers_idx = np.argsort(abs_change)[-10:]
for i in top_changers_idx:
r1 = df_p1.iloc[i]["pp_rank"]
r2 = df_p2.iloc[i]["pp_rank"]
label = df_p1.iloc[i]["region_country"]
color = TEAL if r2 < r1 else WARM_ORANGE
ax.plot([0, 1], [r1, r2], color=color, linewidth=2, alpha=0.8)
ax.text(-0.05, r1, f"{label} (#{int(r1)})", ha="right", va="center",
fontsize=7, color=LIGHT_TEXT)
ax.text(1.05, r2, f"{label} (#{int(r2)})", ha="left", va="center",
fontsize=7, color=LIGHT_TEXT)
ax.set_xlim(-0.6, 1.6)
ax.set_ylim(160, -5)
ax.set_xticks([0, 1])
ax.set_xticklabels(["2013 Rank", "2019 Rank"], fontsize=13)
ax.set_ylabel("Rank (1 = best)")
ax.set_title("Per-period PCA: rank shifts for 10 regions\n(teal = improved, orange = declined)")
plt.savefig("pca2_perperiod_rank_shift.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

The running example: City of Buenos Aires — Argentina’s capital and one of the most developed regions in South America — has a per-period HDI of 1.000 in 2013 (ranked #1) and 0.960 in 2019 — a decline of -0.04. But we know Buenos Aires improved in education (0.926 $\to$ 0.946) and health (0.858 $\to$ 0.872), with only a modest income decline (0.850 $\to$ 0.832). Is Buenos Aires really declining, or is the shifting yardstick hiding a more nuanced story?

7. Pooled Step 1: Stacking the data

The first step of pooled PCA is to stack all periods into a single dataset. From PCA’s perspective, we have 306 observations (153 regions $\times$ 2 periods), not two separate groups. The period column is metadata that we carry through for analysis, but it does not enter the PCA computation.

print(f"Stacked dataset: {df.shape[0]} rows, {df.shape[1]} columns")

Stacked dataset: 306 rows, 9 columns

The stacked dataset has 306 rows. PCA will treat each row equally regardless of which period it belongs to, producing a single set of standardization parameters and a single set of eigenvector weights.

8. Pooled Step 2: Pooled standardization

What it is: We compute the mean and standard deviation from the entire stacked dataset (all 306 rows) and use these pooled parameters to standardize every observation:

$$Z_{ij,t}^{pooled} = \frac{X_{ij,t} - \bar{X}_j^{pooled}}{\sigma_j^{pooled}}$$

In words, this says: for region $i$, indicator $j$, at time $t$, subtract the pooled mean $\bar{X}_j^{pooled}$ (computed across all regions and all periods) and divide by the pooled standard deviation $\sigma_j^{pooled}$.

The application: City of Buenos Aires has education = 0.926 in 2013 and 0.946 in 2019. The pooled mean for education is 0.679 and the pooled standard deviation is 0.081. Under per-period standardization, 2013 uses mean = 0.667 and 2019 uses mean = 0.690 — a shifting baseline. Under pooled standardization, both periods use the same mean = 0.679. The increase from 0.926 to 0.946 maps to a genuine increase in pooled Z-score.

The intuition: Imagine measuring children’s heights at age 5 and age 10. Per-period standardization compares each child only to their same-age peers: a tall 5-year-old gets a high Z-score, and a tall 10-year-old gets a high Z-score, but you cannot tell how much each child grew because the reference group changed. Pooled standardization measures everyone against the same ruler — the combined height distribution — so the Z-score increase from age 5 to age 10 directly reflects actual growth.

The necessity: Without pooled standardization, the income decline (from 0.736 to 0.715 on average) would be hidden. Per-period Z-scores re-center income to zero each period, erasing the decline. Pooled Z-scores preserve it: the 2019 income Z-scores average slightly below zero, correctly reflecting the real economic setback.

X_all = df[INDICATORS].values # 306 rows
pooled_means = X_all.mean(axis=0)
pooled_stds = X_all.std(axis=0, ddof=0)
Z_pooled = (X_all - pooled_means) / pooled_stds
print(f"Pooled standardization parameters:")
print(f" Means: [{pooled_means[0]:.4f}, {pooled_means[1]:.4f}, {pooled_means[2]:.4f}]")
print(f" Stds: [{pooled_stds[0]:.4f}, {pooled_stds[1]:.4f}, {pooled_stds[2]:.4f}]")
scaler = StandardScaler()
Z_sklearn = scaler.fit_transform(X_all)
max_diff = np.max(np.abs(Z_sklearn - Z_pooled))
print(f"\nMax difference from sklearn StandardScaler: {max_diff:.2e}")

Pooled standardization parameters:
Means: [0.6786, 0.8437, 0.7254]
Stds: [0.0814, 0.0472, 0.0749]
Max difference from sklearn StandardScaler: 0.00e+00

The pooled means sit between the period-specific means (e.g., education: 0.667 in 2013, 0.690 in 2019, 0.679 pooled). The standard deviations are similar across periods because the within-period spread is much larger than the between-period level shift. The zero-difference check against StandardScaler() confirms our manual computation is correct.

9. Pooled Step 3: Covariance matrix

We compute the $3 \times 3$ covariance matrix from the pooled standardized data (all 306 rows):

$$\Sigma^{pooled} = \frac{1}{nT} Z^{pooled^T} Z^{pooled}$$

In words, this says: the pooled covariance matrix measures how the three standardized indicators co-move across all region-period observations.

cov_pooled = np.cov(Z_pooled.T, ddof=0)
print(f"Pooled covariance matrix (3x3):")
for i in range(3):
row = " [" + " ".join(f"{cov_pooled[i, j]:.4f}" for j in range(3)) + "]"
print(row)

Pooled covariance matrix (3x3):
[1.0000 0.4392 0.6808]
[0.4392 1.0000 0.6303]
[0.6808 0.6303 1.0000]

The off-diagonals range from 0.44 (Education-Health) to 0.68 (Education-Income). These are substantially lower than the 0.93–0.95 values in the simulated data from the previous tutorial, reflecting the genuine complexity of human development. Education and Health are only moderately correlated because they measure different dimensions — a region can have high literacy but mediocre life expectancy (or vice versa). This means PC1 will capture less total variance, and the eigenvector weights will be more unequal.

10. Pooled Step 4: Eigen-decomposition

We decompose the pooled covariance matrix to find the direction of maximum spread:

$$\Sigma^{pooled} \mathbf{v}_k = \lambda_k \mathbf{v}_k$$

The PC1 score for each region-period is:

$$PC1_{i,t} = w_1 , Z_{i,edu,t}^{pooled} + w_2 , Z_{i,health,t}^{pooled} + w_3 , Z_{i,income,t}^{pooled}$$

In words, this says: each region’s PC1 score is a weighted sum of its three pooled-standardized indicators, using the single set of pooled weights $[w_1, w_2, w_3]$.

eigenvalues, eigenvectors = np.linalg.eigh(cov_pooled)
idx = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]
if eigenvectors[0, 0] < 0:
eigenvectors[:, 0] *= -1
var_explained = eigenvalues / eigenvalues.sum() * 100
print(f"Pooled eigenvalues: [{eigenvalues[0]:.4f}, {eigenvalues[1]:.4f}, {eigenvalues[2]:.4f}]")
print(f"\nPooled eigenvector (PC1): [{eigenvectors[0, 0]:.4f}, {eigenvectors[1, 0]:.4f}, {eigenvectors[2, 0]:.4f}]")
print(f"\nVariance explained:")
print(f" PC1: {var_explained[0]:.2f}%")
print(f" PC2: {var_explained[1]:.2f}%")
print(f" PC3: {var_explained[2]:.2f}%")

Pooled eigenvalues: [2.1726, 0.5631, 0.2643]
Pooled eigenvector (PC1): [0.5642, 0.5448, 0.6204]
Variance explained:
PC1: 72.42%
PC2: 18.77%
PC3: 8.81%

PC1 captures 72.42% of all variance — substantially less than the 96% in the simulated tutorial, but still a strong majority. The eigenvector weights are $[0.5642, 0.5448, 0.6204]$, revealing that Income carries the highest weight (0.620), followed by Education (0.564), with Health contributing least (0.545). This unequal weighting reflects the real-world correlation structure: Income is more strongly correlated with the other two indicators, so it contributes more unique information to the composite index. Unlike the two-variable case from the previous tutorial where equal weights were a mathematical certainty, three variables allow PCA to discover data-driven weights. Crucially, these weights are fixed — the same weights apply to 2013 and 2019 because they were computed from the pooled data.

The variance explained chart shows PC1 dominating but with meaningful contributions from PC2 (18.8%) and PC3 (8.8%). The fact that PC2 and PC3 are not negligible means some development dimensions are not captured by a single index. For instance, PC2 might separate regions with high education but low income from those with the opposite pattern. For this tutorial, we focus on PC1 as the composite HDI, but researchers working with this data should consider whether retaining PC2 adds meaningful insight.

11. Pooled Step 5: Scoring

We project all 306 rows onto PC1 using the fixed pooled weights.

w = eigenvectors[:, 0]
df["pc1"] = Z_pooled @ w
pc1_p1 = df[df["period"] == "Y2013"]["pc1"]
pc1_p2 = df[df["period"] == "Y2019"]["pc1"]
print(f"Pooled PC1 score statistics:")
print(f" 2013 mean: {pc1_p1.mean():.4f}")
print(f" 2019 mean: {pc1_p2.mean():.4f}")
print(f" Shift: {pc1_p2.mean() - pc1_p1.mean():+.4f}")

Pooled PC1 score statistics:
2013 mean: -0.0720
2019 mean: 0.0720
Shift: +0.1439

The 2013 mean PC1 score is $-0.072$ (below the grand mean) and the 2019 mean is $+0.072$ (above the grand mean). The shift of $+0.144$ represents pooled PCA’s measure of net development progress across South America. This is a modest positive shift, reflecting the trade-off between education/health gains and income decline. Under per-period PCA, this shift would be exactly zero by construction — the net progress would be invisible.

12. Pooled Step 6: Normalization

We apply Min-Max normalization using the pooled bounds — the minimum and maximum PC1 scores across all 306 observations:

$$HDI_{i,t} = \frac{PC1_{i,t} - PC1_{min}^{pooled}}{PC1_{max}^{pooled} - PC1_{min}^{pooled}}$$

pc1_min = df["pc1"].min()
pc1_max = df["pc1"].max()
df["hdi"] = (df["pc1"] - pc1_min) / (pc1_max - pc1_min)
print(f"\nPooled HDI — 2019 top 5:")
print(df[df["period"] == "Y2019"].nlargest(5, "hdi")[
["region_country", "education", "health", "income", "hdi"]
].to_string(index=False))
print(f"\nPooled HDI — 2013 bottom 5:")
print(df[df["period"] == "Y2013"].nsmallest(5, "hdi")[
["region_country", "education", "health", "income", "hdi"]
].to_string(index=False))

Pooled HDI — 2019 top 5:
region_country education health income hdi
Region Metropolitana (CHL) 0.877 0.929 0.844 1.000000
Tarapaca (incl Arica and (CHL) 0.888 0.937 0.823 0.999348
City of Buenos Aires (ARG) 0.946 0.872 0.832 0.965232
Antofagasta (CHL) 0.894 0.896 0.838 0.961010
Valparaiso (former Aconca (CHL) 0.842 0.931 0.831 0.959202
Pooled HDI — 2013 bottom 5:
region_country education health income hdi
Potaro-Siparuni (GUY) 0.522 0.735 0.443 0.000000
Barima-Waini (GUY) 0.483 0.745 0.534 0.074601
Potosi (BOL) 0.564 0.666 0.578 0.076345
Upper Takutu-Upper Essequ (GUY) 0.567 0.751 0.470 0.089799
Brokopondo and Sipaliwini (SUR) 0.382 0.774 0.602 0.099207

The top 5 in 2019 are dominated by Chilean regions (Region Metropolitana, Tarapaca, Antofagasta, Valparaiso) plus Buenos Aires. Chile’s strong performance across all three indicators — particularly Health (0.90–0.94) — places its regions at the top. The bottom 5 in 2013 are remote regions of Guyana (Potaro-Siparuni, Barima-Waini), Bolivia (Potosi), and Suriname (Brokopondo), characterized by low education and income despite moderate health outcomes. The Potaro-Siparuni region of Guyana anchors the bottom at HDI = 0.00 (education 0.522, health 0.735, income 0.443).

City of Buenos Aires has pooled HDI of 0.946 in 2013 and 0.965 in 2019 — an improvement of $+0.019$. Under per-period PCA, the same region showed a decline of $-0.040$. Pooled PCA correctly reveals that Buenos Aires improved modestly while being overtaken by Chilean regions that improved faster.

The paired bar chart shows the pooled HDI for the top and bottom 15 regions. In the top group, orange (2019) bars consistently extend further than steel blue (2013) bars, reflecting genuine improvement. In the bottom group, the pattern is more mixed — some of the least developed regions in 2013 made substantial gains by 2019, while others barely moved. The dashed separator line divides the bottom 15 (below) from the top 15 (above).

13. The contrast: pooled vs per-period PCA

We now have two sets of HDI values for every region-period: one from per-period PCA and one from pooled PCA. To compare them, we build a wide table with each region’s pooled and per-period HDI change side by side.

from scipy.stats import spearmanr
# Separate pooled HDI by period
df_pooled_p1 = df[df["period"] == "Y2013"].copy()
df_pooled_p2 = df[df["period"] == "Y2019"].copy()
# Build comparison table: pooled vs per-period changes
compare = df_pooled_p1[["region", "country", "region_country", "hdi"]].rename(
columns={"hdi": "hdi_p1"}
).merge(
df_pooled_p2[["region", "country", "hdi"]].rename(columns={"hdi": "hdi_p2"}),
on=["region", "country"]
)
compare["hdi_change"] = compare["hdi_p2"] - compare["hdi_p1"]
compare["pp_change"] = df_p2["pp_hdi"].values - df_p1["pp_hdi"].values
compare["method_diff"] = compare["hdi_change"] - compare["pp_change"]
# Direction disagreement
disagree = ((compare["hdi_change"] > 0) & (compare["pp_change"] < 0)) | \
((compare["hdi_change"] < 0) & (compare["pp_change"] > 0))
# Spearman rank correlation
rho_change, _ = spearmanr(compare["hdi_change"], compare["pp_change"])
# Running example: City of Buenos Aires
ba = compare[compare["region_country"].str.contains("Buenos Aires")].iloc[0]
ba_pp_p1 = df_p1[df_p1["region_country"].str.contains("Buenos Aires")]["pp_hdi"].values[0]
ba_pp_p2 = df_p2[df_p2["region_country"].str.contains("Buenos Aires")]["pp_hdi"].values[0]
print(f"City of Buenos Aires:")
print(f" Per-period: 2013={ba_pp_p1:.4f}, 2019={ba_pp_p2:.4f}, Change={ba_pp_p2 - ba_pp_p1:+.4f}")
print(f" Pooled: 2013={ba['hdi_p1']:.4f}, 2019={ba['hdi_p2']:.4f}, Change={ba['hdi_change']:+.4f}")
print(f"\nRegions where methods disagree on direction: {disagree.sum()} / {len(compare)}")
print(f"\nSpearman rank correlation (HDI change): rho = {rho_change:.4f}")

City of Buenos Aires:
Per-period: 2013=1.0000, 2019=0.9604, Change=-0.0396
Pooled: 2013=0.9464, 2019=0.9652, Change=+0.0189
Regions where methods disagree on direction: 16 / 153
Spearman rank correlation (HDI change): rho = 0.9818

For City of Buenos Aires, per-period PCA shows a decline of $-0.04$ while pooled PCA shows an improvement of $+0.02$. The two methods disagree on the direction of change for 16 out of 153 regions — about 10% of the sample. The Spearman rank correlation for improvement rankings is 0.982, meaning the two methods largely agree on who improved most, but the direction disagreements for specific regions could lead to different policy conclusions.

fig, ax = plt.subplots(figsize=(7, 7))
fig.patch.set_linewidth(0)
ax.scatter(compare["hdi_change"], compare["pp_change"],
color=STEEL_BLUE, edgecolors=DARK_NAVY, s=40, zorder=3, alpha=0.7)
lim_min = min(compare["hdi_change"].min(), compare["pp_change"].min()) - 0.02
lim_max = max(compare["hdi_change"].max(), compare["pp_change"].max()) + 0.02
ax.plot([lim_min, lim_max], [lim_min, lim_max], color=WARM_ORANGE,
linewidth=2, linestyle="--", label="Perfect agreement", zorder=2)
ax.axhline(0, color=GRID_LINE, linewidth=0.8, zorder=1)
ax.axvline(0, color=GRID_LINE, linewidth=0.8, zorder=1)
# Label extreme outliers
top_outliers = compare.nlargest(3, "method_diff")
bot_outliers = compare.nsmallest(3, "method_diff")
for _, row in pd.concat([top_outliers, bot_outliers]).iterrows():
ax.annotate(row["region_country"], (row["hdi_change"], row["pp_change"]),
fontsize=6, color=TEAL, xytext=(5, 5),
textcoords="offset points")
ax.set_xlabel("Pooled HDI change (2019 - 2013)")
ax.set_ylabel("Per-period HDI change (2019 - 2013)")
ax.set_title("Pooled vs. per-period PCA: HDI change comparison")
ax.legend(loc="upper left")
ax.set_aspect("equal")
plt.savefig("pca2_pooled_vs_perperiod_change.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

The scatter plot places pooled HDI change on the horizontal axis and per-period HDI change on the vertical axis. If both methods agreed perfectly, all points would fall on the dashed 45-degree line. The cloud sits systematically below the line for most regions — per-period PCA tends to understate improvement (or overstate decline) relative to pooled PCA, because per-period standardization erases the net positive shift in education and health.

compare["pooled_change_rank"] = compare["hdi_change"].rank(ascending=False).astype(int)
compare["pp_change_rank"] = compare["pp_change"].rank(ascending=False).astype(int)
compare["change_rank_diff"] = np.abs(compare["pooled_change_rank"] - compare["pp_change_rank"])
fig, ax = plt.subplots(figsize=(8, 10))
fig.patch.set_linewidth(0)
top_change_rank_diff = compare.nlargest(10, "change_rank_diff")
for _, row in top_change_rank_diff.iterrows():
r_pooled = row["pooled_change_rank"]
r_pp = row["pp_change_rank"]
label = row["region_country"]
color = TEAL if r_pooled < r_pp else WARM_ORANGE
ax.plot([0, 1], [r_pooled, r_pp], color=color, linewidth=2, alpha=0.8)
ax.text(-0.05, r_pooled, f"{label} (#{int(r_pooled)})", ha="right",
va="center", fontsize=7, color=LIGHT_TEXT)
ax.text(1.05, r_pp, f"{label} (#{int(r_pp)})", ha="left",
va="center", fontsize=7, color=LIGHT_TEXT)
ax.set_xlim(-0.6, 1.6)
ax.set_ylim(160, -5)
ax.set_xticks([0, 1])
ax.set_xticklabels(["Pooled Improvement Rank", "Per-period Improvement Rank"], fontsize=11)
ax.set_ylabel("Rank (1 = most improved)")
ax.set_title("Who improved the most? Pooled vs. per-period rankings\n(teal = ranked higher by pooled, orange = ranked lower)")
plt.savefig("pca2_rank_comparison_bump.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

The bump chart compares who improved the most under each method. The crossing lines show where the two methods re-order regions' improvement rankings. Regions that pooled PCA ranks as top improvers may be ranked lower by per-period PCA if their gains were partly masked by the shifting baseline.

14. Validation against the official SHDI

The Global Data Lab computes an official Subnational HDI (SHDI) using a geometric mean methodology similar to the UNDP’s approach. We can validate our PCA-based index by comparing both the pooled and per-period approaches against this official benchmark. If pooled PCA better tracks the established methodology, it provides further evidence that the pooled approach is superior for temporal analysis.

# Add per-period HDI to main DataFrame for comparison
df["pp_hdi"] = pd.concat([df_p1["pp_hdi"], df_p2["pp_hdi"]]).sort_index().values
# Pooled PCA vs official SHDI
corr_pooled = df["hdi"].corr(df["shdi_official"])
r2_pooled = corr_pooled ** 2
# Per-period PCA vs official SHDI
corr_pp = df["pp_hdi"].corr(df["shdi_official"])
r2_pp = corr_pp ** 2
print(f"Pooled PCA vs official SHDI:")
print(f" Pearson r: {corr_pooled:.4f}")
print(f" R-squared: {r2_pooled:.4f}")
print(f"\nPer-period PCA vs official SHDI:")
print(f" Pearson r: {corr_pp:.4f}")
print(f" R-squared: {r2_pp:.4f}")
print(f"\nR-squared difference (pooled - per-period): {r2_pooled - r2_pp:+.4f}")

Pooled PCA vs official SHDI:
Pearson r: 0.9911
R-squared: 0.9823
Per-period PCA vs official SHDI:
Pearson r: 0.9874
R-squared: 0.9750
R-squared difference (pooled - per-period): +0.0073

fig, axes = plt.subplots(1, 2, figsize=(14, 6))
fig.patch.set_linewidth(0)
p1_mask = df["period"] == "Y2013"
p2_mask = df["period"] == "Y2019"
# Panel A: Pooled PCA vs SHDI
ax = axes[0]
ax.scatter(df.loc[p1_mask, "shdi_official"], df.loc[p1_mask, "hdi"],
color=STEEL_BLUE, edgecolors=DARK_NAVY, s=30, alpha=0.7, zorder=3, label="2013")
ax.scatter(df.loc[p2_mask, "shdi_official"], df.loc[p2_mask, "hdi"],
color=WARM_ORANGE, edgecolors=DARK_NAVY, s=30, alpha=0.7, zorder=3, label="2019")
ax.set_xlabel("Official SHDI")
ax.set_ylabel("Pooled PCA HDI")
ax.set_title(f"Pooled PCA (R² = {r2_pooled:.4f})")
ax.legend(loc="upper left", fontsize=9)
# Panel B: Per-period PCA vs SHDI
ax = axes[1]
ax.scatter(df.loc[p1_mask, "shdi_official"], df.loc[p1_mask, "pp_hdi"],
color=STEEL_BLUE, edgecolors=DARK_NAVY, s=30, alpha=0.7, zorder=3, label="2013")
ax.scatter(df.loc[p2_mask, "shdi_official"], df.loc[p2_mask, "pp_hdi"],
color=WARM_ORANGE, edgecolors=DARK_NAVY, s=30, alpha=0.7, zorder=3, label="2019")
ax.set_xlabel("Official SHDI")
ax.set_ylabel("Per-period PCA HDI")
ax.set_title(f"Per-period PCA (R² = {r2_pp:.4f})")
ax.legend(loc="upper left", fontsize=9)
fig.suptitle("Validation: which PCA method tracks the official SHDI better?",
fontsize=14, y=1.02)
plt.tight_layout()
plt.savefig("pca2_validation_vs_shdi.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

Pooled PCA achieves $R^2 = 0.9823$, outperforming per-period PCA at $R^2 = 0.9750$. The difference of +0.0073 may seem small in absolute terms, but it is consistent and meaningful: pooled PCA explains 0.73 percentage points more of the variance in the official SHDI. The left panel shows pooled PCA points tightly clustered along the fit line with both periods intermixed seamlessly — exactly what we want for a temporally comparable index. The right panel shows per-period PCA with a slightly wider scatter, reflecting the distortion introduced by re-centering each period to its own baseline. The fact that the official SHDI (which uses a fixed geometric mean formula across years) correlates more strongly with pooled PCA than with per-period PCA validates the pooled approach: when the goal is temporal comparability, fitting on stacked data is the right choice.

Validating the dynamics: changes over time

The level comparison above tests cross-sectional fit — do the PCA-based indices rank regions correctly at a point in time? But the core promise of pooled PCA is capturing dynamics — changes over time. We now test whether the change in PCA-based HDI tracks the change in official SHDI.

# Compute official SHDI change per region
shdi_wide = (df.loc[p1_mask, ["region", "country", "shdi_official"]]
.rename(columns={"shdi_official": "shdi_p1"}))
shdi_wide = shdi_wide.merge(
df.loc[p2_mask, ["region", "country", "shdi_official"]]
.rename(columns={"shdi_official": "shdi_p2"}),
on=["region", "country"]
)
shdi_wide["shdi_change"] = shdi_wide["shdi_p2"] - shdi_wide["shdi_p1"]
# Merge with comparison table
compare_val = compare.merge(shdi_wide[["region", "country", "shdi_change"]],
on=["region", "country"])
# R² for changes
corr_pooled_change = compare_val["hdi_change"].corr(compare_val["shdi_change"])
r2_pooled_change = corr_pooled_change ** 2
corr_pp_change = compare_val["pp_change"].corr(compare_val["shdi_change"])
r2_pp_change = corr_pp_change ** 2
print(f"Pooled PCA change vs official SHDI change:")
print(f" Pearson r: {corr_pooled_change:.4f}")
print(f" R-squared: {r2_pooled_change:.4f}")
print(f"\nPer-period PCA change vs official SHDI change:")
print(f" Pearson r: {corr_pp_change:.4f}")
print(f" R-squared: {r2_pp_change:.4f}")
print(f"\nR-squared difference (pooled - per-period): {r2_pooled_change - r2_pp_change:+.4f}")

Pooled PCA change vs official SHDI change:
Pearson r: 0.9982
R-squared: 0.9964
Per-period PCA change vs official SHDI change:
Pearson r: 0.9957
R-squared: 0.9913
R-squared difference (pooled - per-period): +0.0051

fig, axes = plt.subplots(1, 2, figsize=(14, 6))
fig.patch.set_linewidth(0)
# Panel A: Pooled PCA change vs SHDI change
ax = axes[0]
ax.scatter(compare_val["shdi_change"], compare_val["hdi_change"],
color=STEEL_BLUE, edgecolors=DARK_NAVY, s=40, alpha=0.7, zorder=3)
ax.axhline(0, color=GRID_LINE, linewidth=0.8, zorder=1)
ax.axvline(0, color=GRID_LINE, linewidth=0.8, zorder=1)
ax.set_xlabel("Official SHDI change (2019 - 2013)")
ax.set_ylabel("Pooled PCA HDI change")
ax.set_title(f"Pooled PCA (R² = {r2_pooled_change:.4f})")
# Panel B: Per-period PCA change vs SHDI change
ax = axes[1]
ax.scatter(compare_val["shdi_change"], compare_val["pp_change"],
color=STEEL_BLUE, edgecolors=DARK_NAVY, s=40, alpha=0.7, zorder=3)
ax.axhline(0, color=GRID_LINE, linewidth=0.8, zorder=1)
ax.axvline(0, color=GRID_LINE, linewidth=0.8, zorder=1)
ax.set_xlabel("Official SHDI change (2019 - 2013)")
ax.set_ylabel("Per-period PCA HDI change")
ax.set_title(f"Per-period PCA (R² = {r2_pp_change:.4f})")
fig.suptitle("Validation: which PCA method better captures development dynamics?",
fontsize=14, y=1.02)
plt.tight_layout()
plt.savefig("pca2_validation_changes.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

The change validation is even more compelling than the level validation. Pooled PCA change achieves $R^2 = 0.9964$, outperforming per-period PCA change at $R^2 = 0.9913$. Both methods track the official SHDI dynamics remarkably well ($r > 0.99$), but pooled PCA is the tighter fit. The left panel shows pooled PCA changes falling almost exactly on the regression line, with virtually no scatter. The right panel shows per-period PCA changes with slightly more dispersion, reflecting the noise introduced by re-centering each period’s baseline. Taken together, the level validation ($R^2$: 0.9823 vs 0.9750) and the change validation ($R^2$: 0.9964 vs 0.9913) consistently favor pooled PCA — it better reproduces both the cross-sectional rankings and the temporal dynamics of the official Subnational Human Development Index.

15. Replicating with scikit-learn

The pooled PCA pipeline with scikit-learn is nearly identical to the single-period pipeline from the previous tutorial. The key insight is that sklearn’s fit_transform on the stacked data IS pooled PCA — no special panel-data library is needed.

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
# ── Configuration (change these for your own dataset) ────────────
CSV_URL = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_pca2/data_long.csv"
ID_COL = "region"
PERIOD_COL = "period"
POSITIVE_COLS = ["education", "health", "income"]
NEGATIVE_COLS = []
# Step 0: Load long-format panel data
df_sk = pd.read_csv(CSV_URL)
print(f"Loaded: {df_sk.shape[0]} rows, {df_sk.shape[1]} columns")
# Step 1: Polarity adjustment
for col in NEGATIVE_COLS:
df_sk[col + "_adj"] = -1 * df_sk[col]
adj_cols = POSITIVE_COLS + [col + "_adj" for col in NEGATIVE_COLS]
# Step 2: POOLED standardization (fit on ALL periods)
scaler = StandardScaler()
Z_sk = scaler.fit_transform(df_sk[adj_cols])
# Step 3-4: POOLED PCA (fit on ALL periods)
pca_sk = PCA(n_components=1)
df_sk["pc1"] = pca_sk.fit_transform(Z_sk)[:, 0]
# Step 5-6: POOLED normalization (min/max across ALL periods)
df_sk["pc1_index"] = (
(df_sk["pc1"] - df_sk["pc1"].min())
/ (df_sk["pc1"].max() - df_sk["pc1"].min())
)
df_sk.to_csv("pc1_index_results.csv", index=False)
print(f"\nPC1 weights: {pca_sk.components_[0].round(4)}")
print(f"Variance explained: {pca_sk.explained_variance_ratio_.round(4)}")
print(f"\nSaved: pc1_index_results.csv")

Loaded: 306 rows, 10 columns
PC1 weights: [0.5642 0.5448 0.6204]
Variance explained: [0.7242]
Saved: pc1_index_results.csv

The sklearn pipeline produces identical weights ($[0.5642, 0.5448, 0.6204]$) and variance explained (72.42%), with a maximum absolute difference of $2.00 \times 10^{-15}$ from our manual implementation.

16. Application: Space-time analyses

With a temporally comparable pooled PCA index in hand, we can now analyze development dynamics across South America. This section demonstrates two types of space-time analysis: mapping how the spatial distribution of development shifted between 2013 and 2019, and measuring how spatial inequality changed over the same period.

Spatial distribution dynamics

Choropleth maps provide an intuitive way to visualize where development improved, stagnated, or declined. The key methodological choice is to compute the color breaks from the initial period (2013) using the Fisher-Jenks natural breaks algorithm and hold those breaks constant in the 2019 map. This ensures that a color change between maps reflects a genuine shift in HDI, not a shifting classification scheme. If we re-computed breaks for each period, regions could change color simply because the overall distribution shifted, not because they individually improved.

import geopandas as gpd
import mapclassify
import contextily as cx
# Load GeoJSON boundaries and merge pooled HDI using GDLcode
GEO_URL = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_pca2/data.geojson"
gdf = gpd.read_file(GEO_URL)
hdi_2013 = df_pooled_p1[["GDLcode", "hdi"]].rename(columns={"hdi": "hdi_2013"})
hdi_2019 = df_pooled_p2[["GDLcode", "hdi"]].rename(columns={"hdi": "hdi_2019"})
gdf = gdf.merge(hdi_2013, on="GDLcode")
gdf = gdf.merge(hdi_2019, on="GDLcode")
# Reproject to Web Mercator for basemap
gdf_3857 = gdf.to_crs(epsg=3857)
# Fisher-Jenks breaks from 2013 (5 classes)
fj = mapclassify.FisherJenks(gdf_3857["hdi_2013"].values, k=5)
breaks = fj.bins.tolist()
# Extend upper break to cover 2019 max
max_val = max(gdf_3857["hdi_2013"].max(), gdf_3857["hdi_2019"].max())
if max_val > breaks[-1]:
breaks[-1] = float(round(max_val + 0.001, 3))
# Apply adjusted breaks to 2019 (must come AFTER break extension)
fj_2019 = mapclassify.UserDefined(gdf_3857["hdi_2019"].values, bins=breaks)
# Class transitions
classes_2013 = fj.yb
classes_2019 = fj_2019.yb
improved = (classes_2019 > classes_2013).sum()
stayed = (classes_2019 == classes_2013).sum()
declined = (classes_2019 < classes_2013).sum()
print(f"Fisher-Jenks breaks (from 2013): {[round(b, 3) for b in breaks]}")
print(f"\nClass transitions (2013 → 2019):")
print(f" Improved (moved up): {improved}")
print(f" Stayed same: {stayed}")
print(f" Declined (moved down): {declined}")

Fisher-Jenks breaks (from 2013): [0.167, 0.449, 0.581, 0.73, 1.001]
Class transitions (2013 → 2019):
Improved (moved up): 40
Stayed same: 88
Declined (moved down): 25

# Class labels
class_labels = []
lower = 0.0
for b in breaks:
class_labels.append(f"{lower:.2f} – {b:.2f}")
lower = b
fig, axes = plt.subplots(1, 2, figsize=(16, 12))
fig.patch.set_facecolor(DARK_NAVY)
fig.patch.set_linewidth(0)
from matplotlib.patches import Patch
cmap = plt.cm.coolwarm
norm = plt.Normalize(vmin=0, vmax=len(breaks) - 1)
for ax, year_col, title, year_fj in [
(axes[0], "hdi_2013", "Pooled PCA HDI — 2013", fj),
(axes[1], "hdi_2019", "Pooled PCA HDI — 2019", fj_2019),
]:
# Classify and assign colors manually
year_classes = year_fj.yb
colors = [cmap(norm(c)) for c in year_classes]
gdf_3857.plot(
ax=ax, color=colors,
edgecolor=DARK_NAVY, linewidth=0.3,
)
cx.add_basemap(ax, source=cx.providers.CartoDB.DarkMatter, zoom=4, attribution="")
ax.set_title(title, fontsize=14, color=WHITE_TEXT, pad=10)
ax.set_axis_off()
# Build legend manually with correct counts
counts = np.bincount(year_fj.yb, minlength=len(breaks))
handles = []
for i, (cl, c) in enumerate(zip(class_labels, counts)):
handles.append(Patch(facecolor=cmap(norm(i)), edgecolor=DARK_NAVY,
label=f"{cl} (n={c})"))
leg = ax.legend(handles=handles, title="HDI Class", loc="lower right",
fontsize=16, title_fontsize=17)
leg.set_frame_on(True)
leg.get_frame().set_facecolor("#1a1a2e")
leg.get_frame().set_edgecolor(LIGHT_TEXT)
leg.get_frame().set_alpha(0.9)
leg.get_frame().set_linewidth(1.5)
for text in leg.get_texts():
text.set_color(WHITE_TEXT)
leg.get_title().set_color(WHITE_TEXT)
fig.suptitle("Spatial distribution dynamics: Pooled PCA HDI\n"
"(Fisher-Jenks breaks from 2013 held constant)",
fontsize=15, color=WHITE_TEXT, y=0.95)
plt.tight_layout(rect=[0, 0, 1, 0.93])
plt.savefig("pca2_choropleth_hdi.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

The choropleth maps reveal clear geographic patterns in South American development. The Southern Cone (Chile, Argentina, Uruguay) and southern Brazil appear in the highest HDI classes (teal tones), while the Amazon basin, interior Guyana, and parts of Bolivia occupy the lowest classes (orange tones). Between 2013 and 2019, 40 regions moved up at least one Fisher-Jenks class, 88 stayed in the same class, and 25 declined. The upward mobility is concentrated in the Andean countries (Peru, Bolivia, Colombia) where education gains shifted regions from the second to the third class. The declines are predominantly in Venezuelan states, visible as regions shifting from mid-range blues to warmer colors — a direct cartographic reflection of Venezuela’s economic crisis. The fact that both maps use the same classification breaks makes these color changes directly interpretable: any region that changed color genuinely crossed a development threshold.

Spatial inequality dynamics

The Gini index measures inequality in the distribution of a variable across a population, ranging from 0 (perfect equality — every region has the same value) to 1 (perfect inequality — all development concentrated in a single region). Think of it as a single number that summarizes how unevenly a resource or outcome is distributed. By computing the Gini index for each indicator in each period, we can track whether development is converging (Gini falling — regions becoming more similar) or diverging (Gini rising — gaps widening).

We use the Gini class from PySAL’s inequality library, which provides a robust implementation of the Gini coefficient. The Gini(values).g attribute returns the computed coefficient.

from inequality.gini import Gini
# Compute Gini for each indicator and pooled HDI, per period
gini_rows = []
for period_label in ["Y2013", "Y2019"]:
mask = df["period"] == period_label
row = {"period": period_label}
for col in INDICATORS + ["hdi"]:
row[col] = round(Gini(df.loc[mask, col].values).g, 4)
gini_rows.append(row)
gini_df = pd.DataFrame(gini_rows).set_index("period")
# Add change row
change_row = gini_df.loc["Y2019"] - gini_df.loc["Y2013"]
change_row.name = "Change"
gini_df = pd.concat([gini_df, change_row.to_frame().T])
print(f"Gini index by indicator and period:")
print(gini_df.to_string())

Gini index by indicator and period:
education health income hdi
Y2013 0.0655 0.0295 0.0549 0.1712
Y2019 0.0639 0.0318 0.0585 0.1795
Change -0.0016 0.0023 0.0036 0.0083

fig, ax = plt.subplots(figsize=(8, 5))
fig.patch.set_linewidth(0)
labels = ["Education", "Health", "Income", "Pooled HDI"]
cols = INDICATORS + ["hdi"]
vals_2013 = [gini_df.loc["Y2013", c] for c in cols]
vals_2019 = [gini_df.loc["Y2019", c] for c in cols]
x = np.arange(len(labels))
width = 0.3
bars1 = ax.bar(x - width/2, vals_2013, width, color=STEEL_BLUE,
edgecolor=DARK_NAVY, label="2013")
bars2 = ax.bar(x + width/2, vals_2019, width, color=WARM_ORANGE,
edgecolor=DARK_NAVY, label="2019")
for bar in bars1:
ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.002,
f"{bar.get_height():.4f}", ha="center", va="bottom",
fontsize=9, color=LIGHT_TEXT)
for bar in bars2:
ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.002,
f"{bar.get_height():.4f}", ha="center", va="bottom",
fontsize=9, color=LIGHT_TEXT)
ax.set_xticks(x)
ax.set_xticklabels(labels, fontsize=12)
ax.set_ylabel("Gini Index")
ax.set_title("Spatial inequality dynamics: Gini index by indicator (2013 vs 2019)")
ax.legend()
ax.set_ylim(0, ax.get_ylim()[1] * 1.15)
plt.savefig("pca2_gini_dynamics.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

The Gini analysis reveals a nuanced inequality story across South America’s sub-national regions. Education is the only dimension that converged between 2013 and 2019 — its Gini fell from 0.0655 to 0.0639 ($-0.0016$), meaning regions became slightly more equal in educational attainment. Health and income both diverged: health inequality rose from 0.0295 to 0.0318 ($+0.0023$) and income inequality from 0.0549 to 0.0585 ($+0.0036$). The composite pooled PCA HDI shows an overall increase in inequality from 0.1712 to 0.1795 ($+0.0083$), driven primarily by the income and health dimensions. This tells a policy-relevant story: while South America made progress in reducing educational gaps across regions, the income decline was unevenly distributed — some regions (particularly Venezuelan states) experienced far steeper economic setbacks than others, widening the income gap. The fact that overall HDI inequality increased despite educational convergence underscores that development progress is not uniform across dimensions, and a composite index like the pooled PCA HDI captures these cross-cutting dynamics in a single measure.

Population-weighted inequality

The unweighted Gini treats every region equally — Potaro-Siparuni (population 10,000) carries the same weight as São Paulo (population 44 million). For policy analysis, we often care more about how many people experience inequality, not how many regions. A population-weighted Gini accounts for this by giving larger regions proportionally more influence. Since PySAL’s Gini class does not support population weights, we implement the weighted Gini using the trapezoidal Lorenz curve approach.

def weighted_gini(values, weights):
"""Compute the population-weighted Gini index using the Lorenz curve.
Parameters
----------
values : array-like — indicator values (e.g., HDI per region)
weights : array-like — population weights (e.g., region population)
Returns
-------
float — weighted Gini coefficient in [0, 1]
"""
v = np.asarray(values, dtype=float)
w = np.asarray(weights, dtype=float)
order = np.argsort(v)
v, w = v[order], w[order]
# Cumulative population and value shares
cum_w = np.cumsum(w) / np.sum(w)
cum_vw = np.cumsum(v * w) / np.sum(v * w)
# Prepend zero for trapezoidal integration
cum_w = np.concatenate(([0], cum_w))
cum_vw = np.concatenate(([0], cum_vw))
# Area under Lorenz curve
B = np.sum((cum_w[1:] - cum_w[:-1]) * (cum_vw[1:] + cum_vw[:-1]) / 2)
return 1 - 2 * B
# Compute population-weighted Gini
wgini_rows = []
for period_label in ["Y2013", "Y2019"]:
mask = df["period"] == period_label
row = {"period": period_label}
for col in INDICATORS + ["hdi"]:
row[col] = round(weighted_gini(
df.loc[mask, col].values, df.loc[mask, "pop"].values
), 4)
wgini_rows.append(row)
wgini_df = pd.DataFrame(wgini_rows).set_index("period")
wchange_row = wgini_df.loc["Y2019"] - wgini_df.loc["Y2013"]
wchange_row.name = "Change"
wgini_df = pd.concat([wgini_df, wchange_row.to_frame().T])
print(f"Population-weighted Gini index:")
print(wgini_df.to_string())

Population-weighted Gini index:
education health income hdi
Y2013 0.0525 0.0174 0.0359 0.1113
Y2019 0.0521 0.0186 0.0387 0.1156
Change -0.0004 0.0012 0.0028 0.0043

fig, axes = plt.subplots(1, 2, figsize=(14, 5), sharey=True)
fig.patch.set_linewidth(0)
labels = ["Education", "Health", "Income", "Pooled HDI"]
cols = INDICATORS + ["hdi"]
x = np.arange(len(labels))
width = 0.3
# Panel A: Unweighted
ax = axes[0]
uw_13 = [gini_df.loc["Y2013", c] for c in cols]
uw_19 = [gini_df.loc["Y2019", c] for c in cols]
ax.bar(x - width/2, uw_13, width, color=STEEL_BLUE, edgecolor=DARK_NAVY, label="2013")
ax.bar(x + width/2, uw_19, width, color=WARM_ORANGE, edgecolor=DARK_NAVY, label="2019")
for i, (v13, v19) in enumerate(zip(uw_13, uw_19)):
ax.text(i - width/2, v13 + 0.002, f"{v13:.4f}", ha="center", va="bottom",
fontsize=8, color=LIGHT_TEXT)
ax.text(i + width/2, v19 + 0.002, f"{v19:.4f}", ha="center", va="bottom",
fontsize=8, color=LIGHT_TEXT)
ax.set_xticks(x)
ax.set_xticklabels(labels, fontsize=11)
ax.set_ylabel("Gini Index")
ax.set_title("Unweighted Gini")
ax.legend(fontsize=9)
# Panel B: Population-weighted
ax = axes[1]
pw_13 = [wgini_df.loc["Y2013", c] for c in cols]
pw_19 = [wgini_df.loc["Y2019", c] for c in cols]
ax.bar(x - width/2, pw_13, width, color=STEEL_BLUE, edgecolor=DARK_NAVY, label="2013")
ax.bar(x + width/2, pw_19, width, color=WARM_ORANGE, edgecolor=DARK_NAVY, label="2019")
for i, (v13, v19) in enumerate(zip(pw_13, pw_19)):
ax.text(i - width/2, v13 + 0.002, f"{v13:.4f}", ha="center", va="bottom",
fontsize=8, color=LIGHT_TEXT)
ax.text(i + width/2, v19 + 0.002, f"{v19:.4f}", ha="center", va="bottom",
fontsize=8, color=LIGHT_TEXT)
ax.set_xticks(x)
ax.set_xticklabels(labels, fontsize=11)
ax.set_title("Population-weighted Gini")
ax.legend(fontsize=9)
fig.suptitle("Spatial inequality: unweighted vs. population-weighted Gini",
fontsize=14, y=1.02)
plt.tight_layout()
plt.savefig("pca2_gini_weighted_comparison.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

The population-weighted Gini values are substantially lower than their unweighted counterparts across all indicators and both periods. For example, the pooled HDI Gini drops from 0.1712 (unweighted) to 0.1113 (weighted) in 2013 — a 35% reduction. This gap means that large-population regions (São Paulo, Buenos Aires, Bogota, Santiago) tend to cluster near the middle of the development distribution, while the extreme values (both high and low) are found in smaller regions. When we weight by population, the outlier regions matter less, and inequality appears lower because most South Americans live in moderately developed areas. The direction of change, however, is consistent: both weighted and unweighted Gini show education converging ($-0.0004$ weighted vs $-0.0016$ unweighted) while income ($+0.0028$ vs $+0.0036$) and overall HDI ($+0.0043$ vs $+0.0083$) diverge. The divergence is smaller in population-weighted terms, suggesting that the widening gaps are driven more by sparsely populated peripheral regions than by the major urban centers where most people live.

17. Summary results

Step	Input	Output	Key Result
Stack	2 periods $\times$ 153 regions	306-row DataFrame	Panel format ready
Polarity	Raw indicators	Aligned indicators	All positive (no flip needed)
Pooled Standardization	306 rows	Z-scores (pooled)	Fixed baseline across periods
Pooled Covariance	Z matrix	3$\times$3 matrix	Off-diagonals 0.44–0.68
Pooled Eigen-decomposition	Cov matrix	eigenvalues, eigenvectors	PC1 captures 72.4%
Scoring	Z $\times$ eigvec	PC1 scores	2019 mean > 2013 mean (+0.14)
Pooled Normalization	PC1	HDI (0–1)	Comparable across periods

18. Discussion

Pooled PCA successfully builds a composite development index that is directly comparable across time periods. By standardizing with pooled means, computing a single set of eigenvector weights from stacked data, and normalizing with pooled min/max bounds, the index preserves genuine temporal dynamics. The net development shift of +0.14 PC1 units (reflecting education and health gains partially offset by income decline) is captured by pooled PCA but would be invisible under per-period PCA.

The real South American data revealed that Income carries the highest eigenvector weight (0.620), meaning PCA gives Income more influence than Education (0.564) or Health (0.545) in the composite index. This data-driven weighting differs from the UNDP’s equal-weight geometric mean approach, yet the two methods agree closely ($r = 0.991$). The similarity arises because all three indicators are positively correlated and driven by the same broad development processes. The differences emerge in regions with unbalanced profiles — for example, regions with very high health but low education may rank differently under PCA versus the geometric mean.

The per-period approach disagrees with pooled PCA on the direction of change for 16 regions (10% of the sample). In each of these 16 cases, per-period PCA shows a decline while pooled PCA shows an improvement — the shifting baseline erases genuine but modest gains. A policymaker using per-period PCA might conclude these regions are “falling behind” when in reality they made progress, just less than the shifting average.

The income decline across South America between 2013 and 2019 makes the pooled approach particularly important. Per-period standardization would hide this real economic setback by re-centering income to zero each period. Pooled standardization preserves it, allowing researchers to see that income genuinely declined while education and health improved. This mixed signal is precisely the kind of nuance that development analysis must capture.

19. Summary and next steps

Key takeaways:

Method insight: Pooled PCA produces temporally comparable composite indices by fitting standardization and eigen-decomposition on stacked data. The two methods disagree on the direction of HDI change for 16 out of 153 South American regions. The Spearman rank correlation for improvement rankings is 0.982 — high but not perfect, with consequential differences for specific regions.
Data insight: Income carries the highest PC1 weight (0.620) despite education having a wider range. PC1 captures 72.4% of variance — lower than the 96% in simulated data, reflecting the genuine complexity of real development indicators. The PCA-based HDI correlates at $r = 0.991$ with the official SHDI, validating the approach.
Limitation: PC1 captures only 72% of variance, meaning 28% of development variation is lost in the compression. PC2 (19%) might capture meaningful patterns (e.g., health vs income trade-offs). Also, the pooled approach assumes a stable correlation structure between 2013 and 2019 — a strong assumption over a 6-year period that included significant economic volatility in the region.
Next step: Extend the analysis to more time periods (2000–2019) using the full Global Data Lab time series. Explore PC2 interpretation for policy-relevant sub-dimensions. Consider factor analysis for more flexible loading structures, and compare results across different world regions.

Limitations of this analysis:

The data covers only South America. Development patterns in Sub-Saharan Africa or South Asia may produce different correlation structures and eigenvector weights.
Two periods (2013 and 2019) is the minimum for temporal analysis. More periods would strengthen the pooled estimates and allow testing the constant-correlation assumption.
The PCA-based index is relative to this specific sample. Adding or removing regions changes every score.
Min-Max normalization is sensitive to outliers. The Potaro-Siparuni region of Guyana anchors the bottom and compresses the range for everyone else.

20. Exercises

Explore PC2. The second principal component captures 18.8% of variance. Compute PC2 scores and plot them against PC1. What development pattern does PC2 capture? Which regions score high on PC1 but low on PC2 (or vice versa)?
Test the constant-correlation assumption. Compute the correlation matrices separately for 2013 and 2019. How much do they differ? If the Income-Education correlation changed substantially, what would that imply for the validity of pooled PCA?
Compare with the UNDP methodology. The official SHDI uses a geometric mean: $SHDI = (Education \times Health \times Income)^{1/3}$. Compute this for all regions and compare the ranking with your PCA-based ranking. Where do the two methods disagree most, and why?

21. References

The FWL Theorem: Making Multivariate Regressions Intuitive

Sat, 14 Mar 2026 00:00:00 +0000

Overview

Including multiple variables in a regression raises a natural question: what does it actually mean to “control for” a confounder? The output is a coefficient, but a multivariate regression cannot be plotted on a simple two-dimensional scatter plot. This makes it hard to build intuition about what the regression is doing behind the scenes.

The Frisch-Waugh-Lovell (FWL) theorem answers this question. It shows that any coefficient from a multivariate regression can be recovered from a simple univariate regression — after removing the influence of all other variables through a procedure called partialling-out (also known as residualization or orthogonalization). Think of it as stripping away the noise from other variables so that only the signal of interest remains.

This tutorial is inspired by Courthoud (2022), and applies the FWL theorem to a simulated retail scenario. A chain of stores distributes discount coupons and wants to know whether the coupons increase sales. The catch: neighborhood income affects both coupon usage and sales, creating a confounding relationship that makes the naive analysis misleading. The analysis uses FWL to untangle these effects, verifies the theorem step by step, and visualizes the conditional relationship that multivariate regression captures but hides from view.

Learning objectives:

Understand the Frisch-Waugh-Lovell theorem and why it matters for causal inference
Implement the partialling-out procedure using OLS residuals
Visualize conditional relationships that multivariate regressions capture but cannot directly plot
Compare naive and conditional estimates to see how omitted variable bias distorts results
Connect FWL to modern applications such as Double Machine Learning

The causal structure

Before looking at data, it helps to understand the causal relationships among the variables. A Directed Acyclic Graph (DAG) — a diagram where arrows indicate direct causal effects — makes these assumptions explicit.

In this retail scenario, three variables interact:

graph LR
I["<b>Income</b><br/>(confounder)"] -->|"Higher income<br/>→ fewer coupons"| C["<b>Coupons</b><br/>(treatment)"]
I -->|"Higher income<br/>→ more spending"| S["<b>Sales</b><br/>(outcome)"]
C -->|"True causal<br/>effect: +0.2"| S
style I fill:#d97757,stroke:#141413,color:#fff
style C fill:#6a9bcc,stroke:#141413,color:#fff
style S fill:#00d4c8,stroke:#141413,color:#fff

Income acts as a confounder — a variable that influences both the treatment (coupon usage) and the outcome (sales). Wealthier neighborhoods use fewer coupons but spend more, creating a backdoor path from coupons to sales through income. Ignoring income allows this backdoor path to generate a spurious negative association between coupons and sales, masking the true positive effect.

To recover the genuine causal effect, the analysis must block this backdoor path by conditioning on income. The FWL theorem provides an elegant way to do this and to visualize the result.

Setup and imports

The following code loads all necessary libraries. The analysis relies on statsmodels for OLS regression, seaborn for regression plots, and matplotlib for figure customization. The RANDOM_SEED ensures that every reader gets identical results.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf
# Reproducibility
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
# Site color palette
STEEL_BLUE = "#6a9bcc"
WARM_ORANGE = "#d97757"
NEAR_BLACK = "#141413"
TEAL = "#00d4c8"

Note on figure styling: The figures in this post use a dark theme for visual consistency with the site. The companion script.py includes the full styling code. To reproduce the dark-themed figures, add the following to your setup:

Dark theme settings (click to expand)

DARK_NAVY = "#0f1729"
GRID_LINE = "#1f2b5e"
LIGHT_TEXT = "#c8d0e0"
WHITE_TEXT = "#e8ecf2"
plt.rcParams.update({
"figure.facecolor": DARK_NAVY, "axes.facecolor": DARK_NAVY,
"axes.edgecolor": DARK_NAVY, "axes.linewidth": 0,
"axes.labelcolor": LIGHT_TEXT, "axes.titlecolor": WHITE_TEXT,
"axes.spines.top": False, "axes.spines.right": False,
"axes.spines.left": False, "axes.spines.bottom": False,
"axes.grid": True, "grid.color": GRID_LINE,
"grid.linewidth": 0.6, "grid.alpha": 0.8,
"xtick.color": LIGHT_TEXT, "ytick.color": LIGHT_TEXT,
"text.color": WHITE_TEXT, "font.size": 12,
"legend.frameon": False, "legend.labelcolor": LIGHT_TEXT,
"savefig.facecolor": DARK_NAVY, "savefig.edgecolor": DARK_NAVY,
})

Data simulation

Rather than importing data from an external source, this section builds a transparent data generating process (DGP) so that the true causal effect is known in advance and the methods can be verified against it. Think of it as running a controlled experiment in a computer: set the rules, generate the data, and then check whether the statistical tools find the right answer.

The DGP encodes the causal structure from the DAG above:

income is drawn from a normal distribution centered at \$50K
coupons depends negatively on income (wealthier customers use fewer coupons) plus random noise
sales depends positively on both coupons (+0.2) and income (+0.3), plus a day-of-week effect and random noise

The true causal effect of coupons on sales is exactly +0.2 — this is the Average Treatment Effect (ATE), the average impact of coupons on sales across all stores. In concrete terms, every 1 percentage point increase in coupon usage causes a \$200 increase in daily sales (measured in thousands).

def simulate_store_data(n=50, seed=42):
"""Simulate retail store data with confounding by income."""
rng = np.random.default_rng(seed)
income = rng.normal(50, 10, n)
dayofweek = rng.integers(1, 8, n)
coupons = 60 - 0.5 * income + rng.normal(0, 5, n)
sales = (10 + 0.2 * coupons + 0.3 * income
+ 0.5 * dayofweek + rng.normal(0, 3, n))
return pd.DataFrame({
"sales": np.round(sales, 2),
"coupons": np.round(coupons, 2),
"income": np.round(income, 2),
"dayofweek": dayofweek,
})
N = 50
df = simulate_store_data(n=N, seed=RANDOM_SEED)
print("Dataset shape:", df.shape)
print(df.head())
print(df.describe().round(2))

Dataset shape: (50, 4)
sales coupons income dayofweek
0 37.37 36.93 53.05 6
1 36.88 38.06 39.60 6
2 33.09 32.04 57.50 6
3 35.09 33.43 59.41 5
4 27.01 43.21 30.49 4
sales coupons income dayofweek
count 50.00 50.00 50.00 50.00
mean 33.61 33.84 50.91 3.92
std 3.96 4.89 7.68 1.88
min 25.76 23.26 30.49 1.00
25% 31.30 31.53 45.78 2.00
50% 33.24 33.25 51.74 4.00
75% 36.00 36.89 56.42 5.75
max 44.38 43.79 71.42 7.00

The dataset contains 50 stores with average daily sales of \$33,610, average coupon usage of 33.84%, and average neighborhood income of \$50,910. Sales range from \$25,760 to \$44,380, reflecting meaningful variation across stores. Coupon usage spans from 23% to 44%, and income ranges from \$30,490 to \$71,420. This variation provides enough signal to estimate the relationships of interest.

The naive relationship

The simplest approach is to regress sales directly on coupon usage, ignoring income entirely. This is what a rushed analyst might do — just look at whether stores with more coupon usage have higher or lower sales.

sns.regplot(x="coupons", y="sales", data=df, ci=False,
scatter_kws={"color": STEEL_BLUE, "alpha": 0.7, "edgecolors": "white", "s": 60},
line_kws={"color": WARM_ORANGE, "linewidth": 2, "label": "Linear fit"})
plt.legend()
plt.xlabel("Coupon usage (%)")
plt.ylabel("Daily sales (thousands $)")
plt.title("Naive relationship: Sales vs. coupon usage")
plt.savefig("fwl_naive_regression.png", dpi=300, bbox_inches="tight")
plt.show()

Naive regression: the downward slope suggests coupons reduce sales, but this is driven by confounding from income.

naive_model = smf.ols("sales ~ coupons", df).fit()
print(naive_model.summary().tables[1])

==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 37.1906 3.960 9.390 0.000 29.228 45.154
coupons -0.1059 0.116 -0.914 0.365 -0.339 0.127
==============================================================================

The naive regression suggests that coupons have a negative effect on sales: each additional percentage point of coupon usage is associated with \$106 less in daily sales. However, this coefficient is not statistically significant (p = 0.365), and the 95% confidence interval [-0.339, 0.127] spans both negative and positive values. More importantly, the true effect is +0.2, so this estimate is not just imprecise — it points in the wrong direction. The confounder (income) is pulling the estimate downward because wealthier neighborhoods use fewer coupons but spend more.

Controlling for income

To block the backdoor path through income, the next step includes it as a control variable in the regression. This is the standard approach in applied work: add the confounder to the right-hand side of the regression equation.

full_model = smf.ols("sales ~ coupons + income", df).fit()
print(full_model.summary().tables[1])

==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 5.0278 7.181 0.700 0.487 -9.418 19.474
coupons 0.2673 0.120 2.222 0.031 0.025 0.509
income 0.3836 0.076 5.015 0.000 0.230 0.537
==============================================================================

Controlling for income reverses the picture entirely. The coefficient on coupons is now +0.2673 (p = 0.031), indicating that each additional percentage point of coupon usage increases daily sales by about \$267. This is close to the true effect of +0.2, and the 95% confidence interval [0.025, 0.509] no longer includes zero. Income itself has a strong positive effect of +0.3836 (p < 0.001), confirming that wealthier neighborhoods spend more. By conditioning on income, the backdoor path is blocked and the estimate moves much closer to the true causal effect.

But what is the regression actually doing when it “controls for” income? This is where the FWL theorem provides a clear answer.

The FWL theorem

The Frisch-Waugh-Lovell theorem, first published by Ragnar Frisch and Frederick Waugh in 1933 and later given an elegant proof by Michael Lovell in 1963, provides a precise algebraic decomposition of what multivariate regression does under the hood.

Consider a linear model with two sets of regressors:

$$y_i = \beta_1 x_{i,1} + \beta_2 x_{i,2} + \varepsilon_i$$

In words, this equation says that the outcome $y$ (sales) equals the effect $\beta_1$ of the variable of interest $x_1$ (coupons), plus the effect $\beta_2$ of the control variable $x_2$ (income), plus an error term $\varepsilon$. In this analysis, $y$ corresponds to the sales column, $x_1$ to coupons, and $x_2$ to income.

The FWL theorem states that the Ordinary Least Squares (OLS) estimator — the standard method for fitting a regression line by minimizing squared prediction errors — $\hat{\beta}_1$ from this multivariate regression is identical to the estimator obtained from a simpler procedure:

$$\hat{\beta}_1^{FWL} = \frac{\text{Cov}(\tilde{y}, \, \tilde{x}_1)}{\text{Var}(\tilde{x}_1)}$$

where $\tilde{x}_1$ is the residual from regressing $x_1$ on $x_2$, and $\tilde{y}$ is the residual from regressing $y$ on $x_2$.

In words, this says: to estimate the effect of coupons while controlling for income, we can (1) remove income’s influence from coupons, (2) remove income’s influence from sales, and (3) regress the cleaned sales on the cleaned coupons. The resulting coefficient is exactly the same as the one from the full multivariate regression.

This procedure is called partialling-out because it removes the variation explained by the control variables, keeping only the residual variation — the part that is orthogonal to (independent of) income. The three equivalent estimators are:

Full OLS: Regress $y$ on $x_1$ and $x_2$ jointly
Partial FWL: Regress $y$ on $\tilde{x}_1$ (residuals of $x_1$ on $x_2$)
Full FWL: Regress $\tilde{y}$ on $\tilde{x}_1$ (residuals of both variables on $x_2$)

All three produce the same $\hat{\beta}_1$. The full FWL (option 3) also gives the correct standard errors.

Verifying FWL step by step

Let us verify each step of the theorem using the simulated data.

Step 1: Residualize coupons only

First, we regress coupons on income and extract the residuals $\tilde{x}_1$. These residuals represent the variation in coupon usage that cannot be explained by income — the “purified” coupon signal. Then we regress sales on these residuals. Because residuals always average to zero by construction (they are mean-zero), we drop the intercept from this regression.

# Residualize coupons with respect to income
df["coupons_tilde"] = smf.ols("coupons ~ income", df).fit().resid
# Regress sales on residualized coupons (no intercept)
fwl_step1 = smf.ols("sales ~ coupons_tilde - 1", df).fit()
print(fwl_step1.summary().tables[1])

=================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------
coupons_tilde 0.2673 1.271 0.210 0.834 -2.288 2.822
=================================================================================

The coefficient is exactly 0.2673 — identical to the full regression. However, the standard error has exploded from 0.120 to 1.271, making the estimate appear insignificant (p = 0.834). This happens because income was only partialled out from coupons but not from sales. The remaining variation in sales due to income inflates the residual variance of the regression, producing artificially large standard errors.

Step 2: Residualize both variables

To fix the standard errors, we also residualize sales with respect to income. Now both variables have had income’s influence removed.

# Residualize sales with respect to income
df["sales_tilde"] = smf.ols("sales ~ income", df).fit().resid
# Regress residualized sales on residualized coupons (no intercept)
fwl_step2 = smf.ols("sales_tilde ~ coupons_tilde - 1", df).fit()
print(fwl_step2.summary().tables[1])

=================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------
coupons_tilde 0.2673 0.118 2.269 0.028 0.031 0.504
=================================================================================

The coefficient remains exactly 0.2673, and now the standard error (0.118) and p-value (0.028) are nearly identical to the full regression (SE = 0.120, p = 0.031). The slight difference in standard errors comes from a degrees-of-freedom adjustment — the full regression uses up an extra degree of freedom to estimate the income coefficient (leaving fewer data points for estimating uncertainty), while this univariate regression does not. The substantive conclusion is the same: coupons have a significant positive effect on sales after partialling out income.

Visualizing partialling-out

What does partialling-out actually look like? Regressing coupons on income produces fitted values that form a line through the data. The residuals — the vertical distances between each point and this line — represent the coupon variation that income cannot explain.

df["coupons_hat"] = smf.ols("coupons ~ income", df).fit().predict()
fig, ax = plt.subplots(figsize=(8, 6))
ax.scatter(df["income"], df["coupons"], color=STEEL_BLUE, alpha=0.7,
edgecolors="white", s=60, label="Stores")
sns.regplot(x="income", y="coupons", data=df, ci=False, scatter=False,
line_kws={"color": WARM_ORANGE, "linewidth": 2, "label": "Linear fit"}, ax=ax)
ax.vlines(df["income"],
np.minimum(df["coupons"], df["coupons_hat"]),
np.maximum(df["coupons"], df["coupons_hat"]),
linestyle="--", color=NEAR_BLACK, alpha=0.5, linewidth=1,
label="Residuals")
ax.set_xlabel("Neighborhood income (thousands $)")
ax.set_ylabel("Coupon usage (%)")
ax.set_title("Partialling-out: removing income's effect on coupons")
ax.legend()
plt.savefig("fwl_residuals_income.png", dpi=300, bbox_inches="tight")
plt.show()

Partialling-out: the dashed lines are the residuals — the coupon variation that income cannot explain.

The downward-sloping fitted line confirms that higher-income neighborhoods use fewer coupons. The vertical dashed lines are the residuals — the part of coupon usage that income does not predict. Some stores use more coupons than their neighborhood income would suggest (positive residuals), and others use fewer (negative residuals). Partialling out income keeps only these residuals, effectively asking: “Among stores with similar income levels, which ones have unusually high or low coupon usage?”

The conditional relationship revealed

It is now possible to plot the relationship that the multivariate regression captures but cannot directly display: residualized sales against residualized coupons. Both variables have had income’s influence removed, so any remaining relationship is the conditional effect of coupons on sales — the effect after accounting for income differences.

fig, ax = plt.subplots(figsize=(8, 6))
ax.scatter(df["coupons_tilde"], df["sales_tilde"], color=STEEL_BLUE,
alpha=0.7, edgecolors="white", s=60, label="Stores (residualized)")
sns.regplot(x="coupons_tilde", y="sales_tilde", data=df, ci=False, scatter=False,
line_kws={"color": WARM_ORANGE, "linewidth": 2, "label": "Linear fit"}, ax=ax)
ax.set_xlabel("Residual coupon usage")
ax.set_ylabel("Residual sales")
ax.set_title("Conditional relationship after partialling-out income")
ax.legend()
plt.savefig("fwl_partialled_out.png", dpi=300, bbox_inches="tight")
plt.show()

After removing income’s influence from both variables, the true positive effect of coupons on sales emerges.

The positive slope is now clearly visible. Stripping away the confounding influence of income reveals that stores where coupon usage is higher than expected (given their neighborhood income) tend to also have sales that are higher than expected. The slope of this line is exactly 0.2673 — the same coefficient produced by the full multivariate regression.

Scaling for interpretability

One drawback of the partialled-out plot is that both axes show residuals centered around zero, which makes the magnitudes hard to interpret. A negative coupon value of -5 does not mean the store has -5% coupon usage — it means coupon usage is 5 percentage points below what income alone would predict.

Adding the sample mean back to each residualized variable fixes this. The shift moves the axes without changing the slope.

df["coupons_tilde_scaled"] = df["coupons_tilde"] + df["coupons"].mean()
df["sales_tilde_scaled"] = df["sales_tilde"] + df["sales"].mean()
# Verify the coefficient is unchanged
scaled_model = smf.ols("sales_tilde_scaled ~ coupons_tilde_scaled", df).fit()
print(scaled_model.summary().tables[1])

========================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------
Intercept 24.5585 4.053 6.059 0.000 16.409 32.708
coupons_tilde_scaled 0.2673 0.119 2.246 0.029 0.028 0.507
========================================================================================

fig, ax = plt.subplots(figsize=(8, 6))
ax.scatter(df["coupons_tilde_scaled"], df["sales_tilde_scaled"],
color=STEEL_BLUE, alpha=0.7, edgecolors="white", s=60,
label="Stores (residualized + scaled)")
sns.regplot(x="coupons_tilde_scaled", y="sales_tilde_scaled", data=df,
ci=False, scatter=False,
line_kws={"color": WARM_ORANGE, "linewidth": 2, "label": "Linear fit"}, ax=ax)
ax.set_xlabel("Coupon usage (%, residualized + mean)")
ax.set_ylabel("Daily sales (thousands $, residualized + mean)")
ax.set_title("Scaled residuals: interpretable magnitudes")
ax.legend()
plt.savefig("fwl_scaled_residuals.png", dpi=300, bbox_inches="tight")
plt.show()

Adding the sample means back to the residuals restores interpretable units without changing the slope.

The coefficient remains exactly 0.2673 (p = 0.029), confirming that adding the means back does not alter the estimated relationship. Now the axes are in interpretable units: coupon usage around 34% and daily sales around \$33,600. This scaled version is ideal for presentations and reports where the audience needs to understand both the direction and the magnitude of the conditional relationship at a glance.

Extending to multiple controls

The FWL theorem works with any number of control variables, not just one. To demonstrate, the next step adds dayofweek as a second control alongside income. The theorem says both controls can be partialled out simultaneously and the same coefficient on coupons will emerge.

# Full regression with both controls
full_model_2 = smf.ols("sales ~ coupons + income + dayofweek", df).fit()
print(full_model_2.summary().tables[1])

==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 3.9825 7.172 0.555 0.581 -10.454 18.419
coupons 0.2706 0.119 2.266 0.028 0.030 0.511
income 0.3774 0.076 4.961 0.000 0.224 0.531
dayofweek 0.3195 0.245 1.306 0.198 -0.173 0.812
==============================================================================

# FWL: partial out both income and dayofweek
df["coupons_tilde_2"] = smf.ols("coupons ~ income + dayofweek", df).fit().resid
df["sales_tilde_2"] = smf.ols("sales ~ income + dayofweek", df).fit().resid
fwl_multi = smf.ols("sales_tilde_2 ~ coupons_tilde_2 - 1", df).fit()
print(fwl_multi.summary().tables[1])

===================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------
coupons_tilde_2 0.2706 0.116 2.338 0.023 0.038 0.503
===================================================================================

With both controls, the full regression gives a coupon coefficient of 0.2706 (p = 0.028). The FWL procedure — partialling out income and day of week from both sales and coupons — yields the identical coefficient of 0.2706 (p = 0.023). The day-of-week effect itself (0.3195, p = 0.198) is not statistically significant in this sample, but including it slightly sharpens the coupon estimate from 0.2673 to 0.2706 by absorbing additional residual variance. This confirms that FWL scales to any number of controls.

Naive vs. conditional: the full picture

To appreciate how much the FWL procedure changes the conclusions, the next figure places the naive and conditional relationships side by side. The left panel shows the raw data; the right panel shows the same data after partialling out income.

fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# Left: naive relationship
axes[0].scatter(df["coupons"], df["sales"], color=STEEL_BLUE, alpha=0.7,
edgecolors="white", s=60)
sns.regplot(x="coupons", y="sales", data=df, ci=False, scatter=False,
line_kws={"color": WARM_ORANGE, "linewidth": 2}, ax=axes[0])
axes[0].set_xlabel("Coupon usage (%)")
axes[0].set_ylabel("Daily sales (thousands $)")
axes[0].set_title("Naive (no controls)")
# Right: after partialling-out income
axes[1].scatter(df["coupons_tilde_scaled"], df["sales_tilde_scaled"],
color=TEAL, alpha=0.7, edgecolors="white", s=60)
sns.regplot(x="coupons_tilde_scaled", y="sales_tilde_scaled", data=df,
ci=False, scatter=False,
line_kws={"color": WARM_ORANGE, "linewidth": 2}, ax=axes[1])
axes[1].set_xlabel("Coupon usage (%, after partialling-out)")
axes[1].set_ylabel("Daily sales (thousands $, after partialling-out)")
axes[1].set_title("After partialling-out income (FWL)")
plt.suptitle("The FWL theorem reveals the true relationship",
fontsize=14, fontweight="bold", y=1.02)
plt.tight_layout()
plt.savefig("fwl_comparison.png", dpi=300, bbox_inches="tight")
plt.show()

Simpson’s paradox resolved: the naive negative slope (left) reverses to a positive slope (right) after partialling out income.

The contrast is striking. On the left, the naive analysis suggests a negative relationship (slope = -0.106) — coupons appear to hurt sales. On the right, after removing income’s confounding influence, the true positive relationship emerges (slope = +0.267). This is a textbook example of Simpson’s paradox: a trend that appears in aggregate data reverses when the data is properly conditioned on a relevant variable.

Summary of results

Method	Coupons coefficient	Std. error	p-value
Naive OLS (no controls)	-0.1059	0.116	0.365
Full OLS (+ income)	+0.2673	0.120	0.031
FWL Step 1 (residualize X only)	+0.2673	1.271	0.834
FWL Step 2 (residualize both)	+0.2673	0.118	0.028
Full OLS (+ income + day)	+0.2706	0.119	0.028
FWL (+ income + day)	+0.2706	0.116	0.023

All FWL variants produce the same coefficient as the corresponding full regression, confirming the theorem. The coefficient of +0.267 is close to the true DGP value of +0.200, with the difference attributable to finite-sample noise in 50 observations.

Applications of the FWL theorem

The FWL theorem is not just a mathematical curiosity — it has practical applications across several domains.

Data visualization

As shown above, FWL makes it possible to plot the conditional relationship between two variables after controlling for confounders. This is invaluable when presenting regression results to non-technical audiences who understand scatter plots but not regression tables with multiple coefficients.

Computational efficiency

When a regression includes high-dimensional fixed effects — for example, year, industry, and country dummies that could add hundreds of columns — computing the full regression becomes expensive. The FWL theorem allows software to partial out these fixed effects first, reducing the problem to a much smaller regression. Widely-used packages that exploit this strategy include:

reghdfe in Stata
fixest in R
pyfixest in Python — a fast, user-friendly package for fixed-effects regression (including multi-way clustering and interaction effects), inspired by fixest’s R API
pyhdfe in Python

Machine learning and causal inference

Perhaps the most impactful modern application is Double Machine Learning (DML), developed by Chernozhukov, Chetverikov, Demirer, Duflo, Hansen, Newey, and Robins (2018). DML extends the FWL logic by replacing the OLS regressions in the partialling-out step with flexible machine learning models (random forests, lasso, neural networks). This allows the control variables to have complex, nonlinear effects on both the treatment and the outcome — while still recovering a valid causal estimate of the treatment effect.

If you want to see DML in action, check out the companion tutorial on Introduction to Causal Inference: Double Machine Learning, which applies the partialling-out estimator to a real randomized experiment.

Discussion

This tutorial set out to answer a simple question: what does it mean to “control for” a variable in regression, and how can the result be visualized? The FWL theorem provides a definitive answer. Controlling for income in a regression of sales on coupons is equivalent to removing income’s influence from both variables and then regressing the residuals.

In the simulated retail scenario, failing to control for income produced a misleading negative coefficient of -0.106, suggesting coupons reduce sales. After partialling out income, the coefficient reversed to +0.267 (p = 0.031), revealing that coupons genuinely increase sales by about \$267 per percentage point. This estimate is close to the true data-generating parameter of +0.200, with the gap attributable to sampling variability in just 50 stores.

For a practitioner — say, the marketing director of the retail chain — the takeaway is clear. An analysis that ignored neighborhood income would conclude the coupon program was counterproductive. The FWL-based analysis shows it works, and provides a plot that makes this case visually compelling. The theorem bridges the gap between the numbers in a regression table and the intuitive two-variable scatter plot.

Summary and next steps

Key takeaways:

Sign reversal. The naive coupon coefficient was -0.106 (negative, not significant). After controlling for income, it became +0.267 (positive, p = 0.031). Ignoring confounders can reverse not just the magnitude but the direction of an estimated effect.
Exact equivalence. The FWL procedure produced a coefficient of 0.2673 — identical to the full multivariate regression down to four decimal places — whether partialling out one control (income) or two (income + day of week). The theorem is not an approximation; it is an algebraic identity.
Visualization power. FWL reduces any multivariate regression to a univariate one, enabling scatter plots that display conditional relationships. This is especially valuable for communicating results to non-technical stakeholders.
Foundation for DML. FWL underpins modern causal inference methods like Double Machine Learning, where flexible ML learners replace OLS in the partialling-out step. Understanding FWL is a prerequisite for understanding DML.
Linearity assumption matters. The FWL procedure relies on OLS residualization, which assumes linear relationships between the controls and both the treatment and outcome. If income affects coupons or sales nonlinearly, OLS residuals will not fully remove the confounding — motivating methods like DML that replace OLS with flexible learners.

Limitations:

The data is simulated with a known linear DGP. In real data, the DGP is unknown and may be nonlinear, requiring methods like DML rather than plain OLS.
The FWL theorem assumes a correctly specified linear model. If the relationship between income and coupons (or sales) is nonlinear, OLS residualization will not fully remove the confounding.
With only 50 observations, the estimates have wide confidence intervals. Larger samples would sharpen the estimates.

Next steps:

See Double Machine Learning to learn how FWL extends to nonlinear settings.
See Introduction to Causal Inference: The DoWhy Approach for a full causal inference workflow with real data.

Exercises

Sample size sensitivity. Change N from 50 to 500 in the simulate_store_data() function. How do the naive and FWL coefficients change? How do the standard errors shrink? Is the naive estimate still misleading with a larger sample?
Nonlinear confounding. Modify the DGP so that income affects coupons nonlinearly: coupons = 60 - 0.01 * income**2 + noise. Does the FWL procedure (with linear OLS residualization) still recover the true coefficient? Why or why not?
Real data application. Pick a dataset with a known confounder (e.g., the wage-education-ability relationship) and apply the FWL procedure. Visualize the naive and conditional relationships side by side.

References

Introduction to Partial Identification: Bounding Causal Effects Under Unmeasured Confounding

Fri, 13 Mar 2026 00:00:00 +0000

Overview

Does a job training program actually help workers find jobs, or could an unmeasured factor – like prior work experience – explain the entire observed association? In standard causal inference with methods like Double Machine Learning or DoWhy, we assume that all confounders are observed. But what if that assumption fails? Rather than abandoning causal analysis entirely, partial identification offers an honest alternative: instead of estimating a single number, we compute bounds – a range of values that the true causal effect must lie within, given only minimal assumptions.

Think of it this way. If someone tells you that $x + y = 10$ and $y = 6$, you know $x = 4$ exactly – that is point identification. But if they only tell you that $y$ is somewhere between 4 and 7, you can still say $x$ is between 3 and 6. You have not pinned down $x$ exactly, but you have ruled out many values. That is partial identification: credible uncertainty over incredible certainty.

In this tutorial we simulate an observational study where an unmeasured confounder biases the naive estimate, then compute Manski bounds (the widest possible bounds under minimal assumptions), entropy-based bounds (tighter bounds using information-theoretic constraints), and Tian-Pearl bounds for the Probability of Necessity and Sufficiency. We use the CausalBoundingEngine Python package, which provides a unified framework for multiple bounding methods.

Learning objectives:

Understand why unmeasured confounders invalidate point identification and when partial identification is the appropriate response
Implement Manski (worst-case) bounds for the Average Treatment Effect using the algebra of observable probabilities
Compute Tian-Pearl bounds for the Probability of Necessity and Sufficiency (PNS)
Compare multiple bounding methods to see how additional assumptions tighten bounds
Assess whether bounds are informative enough for practical decision-making

The Identification Problem

Point identification vs. partial identification

Most causal inference methods produce a single estimate of the treatment effect – a point estimate. This requires strong assumptions. For example, regression adjustment assumes that all variables affecting both treatment and outcome are included in the model. Double Machine Learning assumes conditional ignorability – that treatment is as good as randomly assigned once we condition on observed covariates. These assumptions are untestable: we can never verify from the data alone that no important variable was left out.

Partial identification relaxes these assumptions. Instead of requiring “no unmeasured confounders,” it asks: “What can we learn about the causal effect using only the data we observe, without assuming confounders away?” The answer is a range of values – called the identified set or bounds – consistent with the data and the weaker assumptions. Any value outside this range can be rejected; any value inside it remains plausible.

The key estimand we target is the Average Treatment Effect (ATE):

$$\text{ATE} = E[Y(1)] - E[Y(0)]$$

In words, the ATE is the difference between the average outcome if everyone were treated and the average outcome if no one were treated. Here $Y(1)$ is the potential outcome under treatment (getting a job if trained) and $Y(0)$ is the potential outcome without treatment (getting a job without training). We never observe both potential outcomes for the same person – this is the fundamental problem of causal inference – so we must rely on assumptions to bridge the gap between what we observe and what we want to know.

The confounded scenario

In our case study, a job training program ($X$) may cause workers to find jobs ($Y$), but prior work experience ($U$) also affects both who enrolls in training and who gets hired. The causal diagram below shows these relationships – each arrow represents a direct causal influence from one variable to another. Because $U$ is unmeasured, we cannot block the backdoor path $X \leftarrow U \rightarrow Y$ – an indirect route from treatment to outcome through a common cause that creates a spurious association. The backdoor criterion says that if we could condition on all variables along such paths, we could identify the causal effect. Since $U$ is unobserved, the criterion fails and standard causal methods will produce biased estimates.

graph LR
U["U<br/>(Prior Experience)<br/><i>Unmeasured</i>"] -->|"affects enrollment"| X["X<br/>(Job Training)"]
U -->|"affects hiring"| Y["Y<br/>(Got a Job)"]
X -->|"causal effect<br/>(what we want)"| Y
style U fill:#999999,stroke:#141413,color:#fff,stroke-dasharray: 5 5
style X fill:#6a9bcc,stroke:#141413,color:#fff
style Y fill:#d97757,stroke:#141413,color:#fff

The dashed border on $U$ signals it is unmeasured. Because we cannot condition on $U$, the backdoor criterion fails and point identification is impossible. This is precisely when partial identification becomes valuable: we can still bound the causal effect using only the observable joint distribution of $X$ and $Y$. The next section sets up our simulated data so we can see exactly how this works.

Setup and Imports

We use CausalBoundingEngine, a Python package that provides a unified interface for applying and comparing multiple causal bounding methods. Install it with pip install causalboundingengine.

import numpy as np
import matplotlib.pyplot as plt
import time
from causalboundingengine.scenarios import BinaryConf
# Reproducibility
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
# Configuration
N = 1000 # Number of simulated workers
# Site color palette
STEEL_BLUE = "#6a9bcc"
WARM_ORANGE = "#d97757"
NEAR_BLACK = "#141413"
TEAL = "#00d4c8"
HEADING_BLUE = "#1a3a8a"

Data Simulation

We simulate an observational study where 1,000 workers either receive job training ($X = 1$) or not ($X = 0$), and we observe whether they get a job within six months ($Y = 1$) or not ($Y = 0$). An unmeasured confounder – prior work experience ($U$) – affects both who enrolls in training and who gets hired, creating genuine confounding. The data-generating process has two parts. First, treatment assignment depends on the confounder:

$$P(X_i = 1) = 0.3 + 0.4 \, U_i$$

Workers with prior experience ($U = 1$) have a 70% chance of enrolling in training, while inexperienced workers ($U = 0$) have only a 30% chance. This creates confounding: the treated group is enriched with experienced workers who would have found jobs regardless. Second, the outcome depends on training, experience, and their interaction:

$$P(Y_i = 1) = \text{clip}\big(0.2 + 0.3 \, X_i + 0.4 \, U_i - 0.1 \, X_i U_i, \; 0, \; 1\big)$$

In words, the probability of getting a job depends on training (a positive effect of 0.3), prior experience (a positive effect of 0.4), and a small negative interaction (workers with prior experience benefit slightly less from training). We reveal these equations so we can compute the true ATE and verify that our bounds contain it.

# Unmeasured confounder: prior work experience (30% prevalence)
U = np.random.binomial(1, 0.3, N)
# Treatment: enrollment depends on experience (confounded assignment)
X_prob = 0.3 + 0.4 * U # P(X=1|U=0)=0.3, P(X=1|U=1)=0.7
X = np.random.binomial(1, X_prob, N)
# Outcome probability depends on X, U, and their interaction
Y_prob = np.clip(0.2 + 0.3 * X + 0.4 * U - 0.1 * X * U, 0, 1)
Y = np.random.binomial(1, Y_prob) # Outcome: got a job
# Summary statistics
print(f"Dataset: {N} simulated workers")
print(f"Treatment (X): {X.sum()} trained ({X.mean():.1%})")
print(f"Outcome (Y): {Y.sum()} got a job ({Y.mean():.1%})")
# Contingency table
n_00 = ((X == 0) & (Y == 0)).sum()
n_01 = ((X == 0) & (Y == 1)).sum()
n_10 = ((X == 1) & (Y == 0)).sum()
n_11 = ((X == 1) & (Y == 1)).sum()
print(f"\nContingency Table:")
print(f"{'':>15} {'Y=0':>8} {'Y=1':>8} {'Total':>8}")
print(f"{'X=0 (Control)':>15} {n_00:>8} {n_01:>8} {n_00+n_01:>8}")
print(f"{'X=1 (Trained)':>15} {n_10:>8} {n_11:>8} {n_10+n_11:>8}")
print(f"{'Total':>15} {n_00+n_10:>8} {n_01+n_11:>8} {N:>8}")

Dataset: 1000 simulated workers
Treatment (X): 401 trained (40.1%)
Outcome (Y): 407 got a job (40.7%)
Contingency Table:
Y=0 Y=1 Total
X=0 (Control) 447 152 599
X=1 (Trained) 146 255 401
Total 593 407 1000

Our simulated dataset has 1,000 workers with an imbalanced treatment split: only 401 received training while 599 did not. This imbalance itself is a signature of confounding – experienced workers (who are more likely to get hired anyway) disproportionately enroll in training. Overall, 40.7% of workers found jobs. The contingency table reveals that 255 out of 401 trained workers got jobs (63.6%) compared to 152 out of 599 untrained workers (25.4%). This raw difference of 38.2 percentage points overstates the true causal effect because the treated group is enriched with experienced workers.

Exploratory Data Analysis

Before computing bounds, we visualize the observed conditional probabilities – the job rates for trained and untrained workers. This is what we can directly observe in the data.

P_Y1_X1 = Y[X == 1].mean() # P(Y=1 | X=1)
P_Y1_X0 = Y[X == 0].mean() # P(Y=1 | X=0)
naive_ate = P_Y1_X1 - P_Y1_X0
print(f"P(Y=1 | X=1) = {P_Y1_X1:.4f} (trained workers who got jobs)")
print(f"P(Y=1 | X=0) = {P_Y1_X0:.4f} (untrained workers who got jobs)")
fig, ax = plt.subplots(figsize=(7, 5))
groups = ["No Training\n(X = 0)", "Training\n(X = 1)"]
probs = [P_Y1_X0, P_Y1_X1]
colors = [STEEL_BLUE, WARM_ORANGE]
bars = ax.bar(groups, probs, color=colors, width=0.5,
edgecolor=NEAR_BLACK, linewidth=0.8)
for bar, prob in zip(bars, probs):
ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.01,
f"{prob:.1%}", ha="center", va="bottom", fontsize=13,
fontweight="bold", color=NEAR_BLACK)
# Annotate the naive ATE gap between bars
ax.annotate("", xy=(1, P_Y1_X1), xytext=(0, P_Y1_X0),
arrowprops=dict(arrowstyle="<->", color=NEAR_BLACK, lw=1.5))
ax.text(0.5, (P_Y1_X1 + P_Y1_X0) / 2, f"Naive ATE = {naive_ate:.2%}",
ha="center", va="bottom", fontsize=11, color=NEAR_BLACK,
bbox=dict(boxstyle="round,pad=0.3", facecolor="white",
edgecolor=NEAR_BLACK, alpha=0.8))
ax.set_ylabel("P(Got a Job | Treatment)", fontsize=12)
ax.set_title("Observed Job Rates by Training Status", fontsize=14, color=HEADING_BLUE)
ax.set_ylim(0, 0.75)
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)
plt.savefig("partial_id_observed_probs.png", dpi=300, bbox_inches="tight")
plt.show()

P(Y=1 | X=1) = 0.6359 (trained workers who got jobs)
P(Y=1 | X=0) = 0.2538 (untrained workers who got jobs)

Trained workers find jobs at more than twice the rate of untrained workers: 63.6% versus 25.4%, a gap of 38.2 percentage points. However, this raw comparison confounds the causal effect of training with the influence of prior experience. Because experienced workers are more likely to both enroll in training (70% vs. 30% enrollment rate) and get hired, the treated group is systematically different from the control group. To separate causation from confounding, we need to go beyond this naive comparison.

Baseline – The Naive Estimate

The simplest estimate of the causal effect is the naive difference in means: we subtract the job rate of untrained workers from the job rate of trained workers. If there were no confounders, this would equal the true ATE. With confounders, it is biased.

# True ATE from known DGP (since we simulated the data)
# E[Y(1)] = P(U=0) * P(Y=1|X=1,U=0) + P(U=1) * P(Y=1|X=1,U=1)
# = 0.7 * 0.5 + 0.3 * 0.8 = 0.59
# E[Y(0)] = P(U=0) * P(Y=1|X=0,U=0) + P(U=1) * P(Y=1|X=0,U=1)
# = 0.7 * 0.2 + 0.3 * 0.6 = 0.32
E_Y1_true = 0.7 * 0.5 + 0.3 * 0.8 # = 0.59
E_Y0_true = 0.7 * 0.2 + 0.3 * 0.6 # = 0.32
true_ate = E_Y1_true - E_Y0_true # = 0.27
print(f"Naive ATE (difference in means): {naive_ate:.4f}")
print(f"True ATE (from known DGP): {true_ate:.4f}")
print(f"Bias (Naive - True): {naive_ate - true_ate:+.4f}")

Naive ATE (difference in means): 0.3822
True ATE (from known DGP): 0.2700
Bias (Naive - True): +0.1122

The naive estimate of 0.3822 overshoots the true ATE of 0.27 by 11.2 percentage points – a substantial upward bias. This happens because experienced workers ($U = 1$) are more likely to both enroll in training and find jobs, inflating the apparent benefit of training. Without observing $U$, we have no way to know the magnitude or even the direction of this bias from the data alone. This motivates partial identification – we can at least bound the true effect.

Manski Bounds

What are Manski bounds?

Manski bounds (also called “no-assumptions bounds”) are the widest possible bounds on the ATE that use only the observed data and no additional assumptions beyond the law of total probability (the rule that the probability of an event equals the sum of its probabilities across all subgroups, weighted by subgroup size). The idea is simple: for the group we do not observe under a given treatment, we consider the worst-case scenario. What if all untreated workers would have gotten jobs if trained? What if none would have?

Think of Manski bounds like a courtroom verdict based only on eyewitness testimony. The witnesses tell you what they saw – the outcomes for treated and untreated groups. But for the people not in the courtroom (the counterfactual outcomes we never observe), you assume the worst and the best to bracket the truth.

Formally, the law of total probability gives us:

$$E[Y(1)] = E[Y|X=1] \cdot P(X=1) + E[Y(1)|X=0] \cdot P(X=0)$$

We observe $E[Y|X=1]$ and $P(X=1)$, but $E[Y(1)|X=0]$ – the average outcome of untrained workers had they been trained – is unobservable. Since $Y$ is binary, this unknown quantity lies between 0 and 1. The same logic applies to $E[Y(0)]$. Substituting worst-case and best-case values:

$$E[Y(1)] \in \big[E[Y|X=1] \cdot P(X=1), \; E[Y|X=1] \cdot P(X=1) + P(X=0)\big]$$

$$E[Y(0)] \in \big[E[Y|X=0] \cdot P(X=0), \; E[Y|X=0] \cdot P(X=0) + P(X=1)\big]$$

The ATE bounds are then the lowest possible $E[Y(1)]$ minus the highest possible $E[Y(0)]$ (lower bound) and vice versa (upper bound).

Manual computation

We walk through the Manski bounds computation step by step using the observed probabilities, so the reader can see exactly how each number arises.

P_X1 = X.mean()
P_X0 = 1 - P_X1
# Bound E[Y(1)]: observed part + worst/best case for unobserved
E_Y1_lower = P_Y1_X1 * P_X1 + 0 * P_X0 # worst case: no untrained would benefit
E_Y1_upper = P_Y1_X1 * P_X1 + 1 * P_X0 # best case: all untrained would benefit
# Bound E[Y(0)]: observed part + worst/best case for unobserved
E_Y0_lower = P_Y1_X0 * P_X0 + 0 * P_X1 # worst case
E_Y0_upper = P_Y1_X0 * P_X0 + 1 * P_X1 # best case
# ATE bounds: min difference vs max difference
ATE_lower = E_Y1_lower - E_Y0_upper
ATE_upper = E_Y1_upper - E_Y0_lower
print(f"Step 1: Observed probabilities")
print(f" P(Y=1|X=1) = {P_Y1_X1:.4f}")
print(f" P(Y=1|X=0) = {P_Y1_X0:.4f}")
print(f" P(X=1) = {P_X1:.4f}, P(X=0) = {P_X0:.4f}")
print(f"\nStep 2: Bound potential outcome means")
print(f" E[Y(1)] in [{E_Y1_lower:.4f}, {E_Y1_upper:.4f}]")
print(f" E[Y(0)] in [{E_Y0_lower:.4f}, {E_Y0_upper:.4f}]")
print(f"\nStep 3: Compute ATE bounds")
print(f" ATE_lower = {E_Y1_lower:.4f} - {E_Y0_upper:.4f} = {ATE_lower:.4f}")
print(f" ATE_upper = {E_Y1_upper:.4f} - {E_Y0_lower:.4f} = {ATE_upper:.4f}")
print(f"\n Manski Bounds: [{ATE_lower:.4f}, {ATE_upper:.4f}]")
print(f" Width: {ATE_upper - ATE_lower:.4f}")
print(f" Contains true ATE ({true_ate})? {ATE_lower <= true_ate <= ATE_upper}")

Step 1: Observed probabilities
P(Y=1|X=1) = 0.6359
P(Y=1|X=0) = 0.2538
P(X=1) = 0.4010, P(X=0) = 0.5990
Step 2: Bound potential outcome means
E[Y(1)] in [0.2550, 0.8540]
E[Y(0)] in [0.1520, 0.5530]
Step 3: Compute ATE bounds
ATE_lower = 0.2550 - 0.5530 = -0.2980
ATE_upper = 0.8540 - 0.1520 = 0.7020
Manski Bounds: [-0.2980, 0.7020]
Width: 1.0000
Contains true ATE (0.27)? True

The Manski bounds place the true ATE between -0.298 and 0.702 – a width of exactly 1.0. This means we cannot even determine the sign of the causal effect under no assumptions: the bounds span zero, so the training program might help, hurt, or have no effect at all. While this seems discouraging, these bounds are guaranteed to contain the true effect (0.27) and are sharp (meaning no tighter bounds exist under these assumptions). They establish the baseline that any tighter method must improve upon.

Verification with CausalBoundingEngine

We use BinaryConf() to initialize the confounded scenario. This class takes the observed treatment and outcome arrays and provides methods for computing various bounds. The manski() method computes the classical no-assumptions bounds for the ATE.

# Initialize the confounded binary scenario
scenario = BinaryConf(X, Y)
# Compute Manski bounds using the package
start_time = time.time()
manski_bounds = scenario.ATE.manski()
manski_time = time.time() - start_time
print(f"Manski Bounds (ATE): [{manski_bounds[0]:.4f}, {manski_bounds[1]:.4f}]")
print(f"Width: {manski_bounds[1] - manski_bounds[0]:.4f}")
print(f"Contains true ATE? {manski_bounds[0] <= true_ate <= manski_bounds[1]}")
print(f"Computation Time: {manski_time:.6f} seconds")

Manski Bounds (ATE): [-0.2980, 0.7020]
Width: 1.0000
Contains true ATE? True
Computation Time: 0.000112 seconds

The package confirms our manual computation exactly: [-0.2980, 0.7020] with a width of 1.0. The computation takes less than a millisecond because Manski bounds have a closed-form solution – no optimization is needed. This verification gives us confidence in both our understanding of the math and the package implementation. But can we do better with stronger assumptions? The next section explores methods that trade stronger assumptions for tighter bounds.

Beyond Manski – Tighter Bounds

Manski bounds assume nothing beyond the data. But additional structural assumptions – even mild ones – can dramatically narrow the identified set. CausalBoundingEngine provides several methods that leverage different assumptions to tighten bounds.

Autobound (linear programming)

The autobound() method uses linear programming (an optimization technique that finds the best value within constraints defined by linear equations) to compute the tightest possible bounds given the constraints implied by the observed distribution. Think of it as an optimization problem: find the narrowest interval that is consistent with every probability constraint the data imposes.

start_time = time.time()
autobound_ate = scenario.ATE.autobound()
autobound_time = time.time() - start_time
print(f"Autobound (ATE): [{autobound_ate[0]:.4f}, {autobound_ate[1]:.4f}]")
print(f"Width: {autobound_ate[1] - autobound_ate[0]:.4f}")
print(f"Contains true? {autobound_ate[0] <= true_ate <= autobound_ate[1]}")
print(f"Computation Time: {autobound_time:.6f} seconds")

Autobound (ATE): [-0.2980, 0.7020]
Width: 1.0000
Contains true? True
Computation Time: 0.301253 seconds

The autobound method returns the same bounds as Manski: [-0.2980, 0.7020]. This is an important result, not a failure. It confirms that the Manski bounds are already sharp – they are the tightest possible bounds for the ATE in a binary confounded scenario without additional assumptions. No linear programming trick can improve upon them because the worst-case distributions that achieve the extreme bounds are actually valid probability distributions. The autobound takes longer (0.30 seconds vs. instantaneous) because it solves an optimization problem to arrive at the same answer.

Entropy bounds

The entropybounds() method adds an information-theoretic constraint: it limits how much the unmeasured confounder can distort the joint distribution by bounding the entropy of the latent variable. Formally, the constraint requires that the conditional entropy of the unmeasured variable given the observed data is bounded:

$$H(U | X, Y) \leq \theta$$

where $H$ denotes Shannon entropy – a measure of uncertainty, like the unpredictability of a coin flip. A fair coin has maximum entropy because each flip is maximally surprising; a two-headed coin has zero entropy because the outcome is certain. The key parameter theta caps how “surprising” the hidden confounder is allowed to be. Smaller values impose stricter constraints and produce tighter bounds: low theta means the confounder can only redistribute probability mass in limited ways.

start_time = time.time()
entropy_ate = scenario.ATE.entropybounds(theta=0.1)
entropy_time = time.time() - start_time
print(f"Entropy Bounds (ATE, theta=0.1): [{entropy_ate[0]:.4f}, {entropy_ate[1]:.4f}]")
print(f"Width: {entropy_ate[1] - entropy_ate[0]:.4f}")
print(f"Contains true? {entropy_ate[0] <= true_ate <= entropy_ate[1]}")
print(f"Computation Time: {entropy_time:.6f} seconds")

Entropy Bounds (ATE, theta=0.1): [-0.2279, 0.4540]
Width: 0.6819
Contains true? True
Computation Time: 0.027686 seconds

With theta = 0.1, the entropy bounds narrow the ATE to [-0.2279, 0.4540] – a width of 0.68, which is 32% narrower than the Manski bounds. The bounds still cross zero, so we cannot definitively conclude the sign of the effect, but the range is considerably more informative. The entropy constraint says: “the unmeasured confounder is allowed to create some distortion, but not unlimited distortion.” This is a middle ground between no assumptions (Manski) and full identification (assuming no confounders at all).

Probability of Necessity and Sufficiency (PNS)

Beyond the ATE, partial identification can address a deeper causal question: for how many workers did training both cause them to get a job and was essential for getting that job? This is the Probability of Necessity and Sufficiency (PNS).

$$\text{PNS} = P(Y_{X=1} = 1 \, \cap \, Y_{X=0} = 0)$$

In words, PNS is the probability that a worker would get a job if trained ($Y_{X=1} = 1$) and would not get a job if untrained ($Y_{X=0} = 0$). Unlike the ATE, which averages over the population, the PNS captures individual-level causation. It matters for legal and medical decisions: a court might ask whether a specific intervention was necessary for the outcome, not just whether it helps on average.

We compute PNS bounds using tianpearl(), which implements the sharp closed-form bounds from Tian and Pearl (2000). These bounds use observational data to constrain the three probabilities of causation: necessity (PN), sufficiency (PS), and their conjunction (PNS).

# Tian-Pearl bounds for PNS
start_time = time.time()
tianpearl_pns = scenario.PNS.tianpearl()
tianpearl_time = time.time() - start_time
print(f"Tian-Pearl Bounds (PNS): [{tianpearl_pns[0]:.4f}, {tianpearl_pns[1]:.4f}]")
print(f"Width: {tianpearl_pns[1] - tianpearl_pns[0]:.4f}")
print(f"Computation Time: {tianpearl_time:.6f} seconds")
# Compare with autobound and entropy
autobound_pns = scenario.PNS.autobound()
entropy_pns = scenario.PNS.entropybounds(theta=0.1)
print(f"\nAutobound (PNS): [{autobound_pns[0]:.4f}, {autobound_pns[1]:.4f}]")
print(f"Width: {autobound_pns[1] - autobound_pns[0]:.4f}")
print(f"\nEntropy Bounds (PNS): [{entropy_pns[0]:.4f}, {entropy_pns[1]:.4f}]")
print(f"Width: {entropy_pns[1] - entropy_pns[0]:.4f}")

Tian-Pearl Bounds (PNS): [0.0000, 0.7020]
Width: 0.7020
Computation Time: 0.000057 seconds
Autobound (PNS): [-0.0000, 0.7020]
Width: 0.7020
Entropy Bounds (PNS): [0.0000, 0.8394]
Width: 0.8394

The Tian-Pearl bounds place the PNS between 0.00 and 0.702. The lower bound of zero means we cannot rule out the possibility that training is never individually necessary and sufficient. Some workers might always get jobs regardless, others might never get jobs regardless, and the observed difference could arise from group-level patterns rather than individual-level causation. The upper bound of 0.702 means at most 70.2% of workers experienced training as both necessary and sufficient for employment. The autobound confirms these are already sharp (0.7020 width). Interestingly, the entropy bounds are wider for PNS (0.8394) than Tian-Pearl – the entropy constraint is less effective for counterfactual queries (questions about what would have happened under a different treatment) than for the ATE.

Comparing All Bounds

When does it help to identify or decide?

A decision-maker can draw conclusions from partial identification in two ways. If the entire bound interval is positive, we conclude the treatment helps (on average) even without observing the confounder. If the interval spans zero, we cannot determine the sign – honesty about this uncertainty is a strength, not a weakness.

The following flowchart summarizes when to use partial identification versus point identification methods:

graph TD
Q["Are all confounders<br/>observed?"] -->|"Yes"| PI["<b>Point Identification</b><br/>DoWhy, DoubleML"]
Q -->|"No"| IV["Is there an<br/>instrument?"]
IV -->|"Yes"| IVPI["<b>Point Identification</b><br/>via Instrumental Variables<br/>(IV / 2SLS)"]
IV -->|"No"| PART["<b>Partial Identification</b><br/>Compute bounds"]
style Q fill:#141413,stroke:#141413,color:#fff
style PI fill:#6a9bcc,stroke:#141413,color:#fff
style IV fill:#141413,stroke:#141413,color:#fff
style IVPI fill:#6a9bcc,stroke:#141413,color:#fff
style PART fill:#d97757,stroke:#141413,color:#fff

ATE bounds comparison

fig, ax = plt.subplots(figsize=(9, 5))
methods_ate = [
("Entropy (theta=0.1)", entropy_ate, TEAL),
("Autobound (LP)", autobound_ate, WARM_ORANGE),
("Manski (No Assumptions)", manski_bounds, STEEL_BLUE),
]
for i, (label, bounds, color) in enumerate(methods_ate):
width = bounds[1] - bounds[0]
ax.barh(i, width, left=bounds[0], height=0.5, color=color,
edgecolor=NEAR_BLACK, linewidth=0.8, alpha=0.85)
ax.text(bounds[1] + 0.01, i, f"[{bounds[0]:.3f}, {bounds[1]:.3f}]",
va="center", fontsize=9, color=NEAR_BLACK)
ax.axvline(x=true_ate, color=NEAR_BLACK, linestyle="--", linewidth=2,
label=f"True ATE = {true_ate:.2f}")
ax.axvline(x=naive_ate, color="#999999", linestyle=":", linewidth=1.5,
label=f"Naive estimate = {naive_ate:.4f}")
ax.set_yticks([0, 1, 2])
ax.set_yticklabels(["Entropy (theta=0.1)", "Autobound (LP)",
"Manski (No Assumptions)"], fontsize=11)
ax.set_xlabel("Average Treatment Effect (ATE)", fontsize=12)
ax.set_title("Comparing Causal Bounds on the ATE", fontsize=14, color=HEADING_BLUE)
ax.legend(loc="upper center", bbox_to_anchor=(0.5, -0.12), fontsize=10, ncol=2)
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)
plt.savefig("partial_id_bounds_comparison.png", dpi=300, bbox_inches="tight")
plt.show()

All three methods contain the true ATE (0.27), but they differ in width. Manski and Autobound both produce identical bounds of [-0.298, 0.702] with width 1.0, confirming the Manski bounds are already sharp. The entropy bounds with theta = 0.1 narrow the interval to [-0.228, 0.454] (width 0.68), a 32% improvement. The naive estimate (0.3822, gray dotted line) lies noticeably to the right of the true ATE (0.27, black dashed line), illustrating the upward bias caused by confounding – experienced workers disproportionately enroll in training and find jobs.

Summary table

Method	Estimand	Lower	Upper	Width	Contains True ATE?
Manski	ATE	-0.2980	0.7020	1.0000	Yes
Autobound (LP)	ATE	-0.2980	0.7020	1.0000	Yes
Entropy (theta = 0.1)	ATE	-0.2279	0.4540	0.6819	Yes
Tian-Pearl	PNS	0.0000	0.7020	0.7020	–
Autobound (LP)	PNS	0.0000	0.7020	0.7020	–
Entropy (theta = 0.1)	PNS	0.0000	0.8394	0.8394	–

PNS bounds comparison

We now visualize the PNS bounds from all three methods side by side, just as we did for the ATE above. This comparison reveals which bounding approach is most effective for counterfactual queries about individual-level causation.

fig, ax = plt.subplots(figsize=(9, 4.5))
methods_pns = [
("Entropy (theta=0.1)", entropy_pns, TEAL),
("Autobound (LP)", autobound_pns, WARM_ORANGE),
("Tian-Pearl (Closed Form)", tianpearl_pns, STEEL_BLUE),
]
for i, (label, bounds, color) in enumerate(methods_pns):
width = bounds[1] - bounds[0]
ax.barh(i, width, left=bounds[0], height=0.5, color=color,
edgecolor=NEAR_BLACK, linewidth=0.8, alpha=0.85)
ax.text(bounds[1] + 0.01, i, f"[{bounds[0]:.3f}, {bounds[1]:.3f}]",
va="center", fontsize=9, color=NEAR_BLACK)
ax.set_yticks([0, 1, 2])
ax.set_yticklabels(["Entropy (theta=0.1)", "Autobound (LP)",
"Tian-Pearl (Closed Form)"], fontsize=11)
ax.set_xlabel("Probability of Necessity & Sufficiency (PNS)", fontsize=12)
ax.set_title("Comparing Causal Bounds on the PNS", fontsize=14, color=HEADING_BLUE)
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)
plt.savefig("partial_id_pns_bounds.png", dpi=300, bbox_inches="tight")
plt.show()

For the PNS, the Tian-Pearl and Autobound methods produce identical sharp bounds of [0.000, 0.702]. The entropy method yields wider bounds [0.000, 0.839] – the entropy constraint is less effective here because PNS is a counterfactual quantity that depends on the joint distribution of potential outcomes, which is harder to constrain with information-theoretic tools. All methods agree that the lower bound is zero, meaning we cannot rule out that training is never individually necessary and sufficient.

Validation – Coverage Simulation

A critical property of valid bounds is coverage: they must contain the true parameter value. Since we control the data-generating process, we can verify this by repeating the simulation 100 times with different random seeds and checking whether each set of bounds contains the true ATE of 0.27.

n_sims = 100
coverage = {"Manski": 0, "Autobound": 0, "Entropy": 0}
for sim in range(n_sims):
np.random.seed(sim)
U_s = np.random.binomial(1, 0.3, N)
X_prob_s = 0.3 + 0.4 * U_s
X_s = np.random.binomial(1, X_prob_s, N)
Y_prob_s = np.clip(0.2 + 0.3 * X_s + 0.4 * U_s - 0.1 * X_s * U_s, 0, 1)
Y_s = np.random.binomial(1, Y_prob_s)
sc = BinaryConf(X_s, Y_s)
m = sc.ATE.manski()
a = sc.ATE.autobound()
e = sc.ATE.entropybounds(theta=0.1)
if m[0] <= true_ate <= m[1]: coverage["Manski"] += 1
if a[0] <= true_ate <= a[1]: coverage["Autobound"] += 1
if e[0] <= true_ate <= e[1]: coverage["Entropy"] += 1
for method, count in coverage.items():
print(f" {method} coverage: {count}/{n_sims} ({count/n_sims:.0%})")

 Manski coverage: 100/100 (100%)
Autobound coverage: 100/100 (100%)
Entropy coverage: 100/100 (100%)

fig, ax = plt.subplots(figsize=(7, 4.5))
methods = ["Manski", "Autobound", "Entropy\n(\u03b8 = 0.1)"]
coverages = [coverage[k] / n_sims * 100 for k in coverage]
colors = [STEEL_BLUE, WARM_ORANGE, TEAL]
bars = ax.bar(methods, coverages, color=colors, width=0.5,
edgecolor=NEAR_BLACK, linewidth=0.8)
for bar, cov in zip(bars, coverages):
ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.5,
f"{cov:.0f}%", ha="center", va="bottom", fontsize=13,
fontweight="bold", color=NEAR_BLACK)
ax.axhline(y=100, color=NEAR_BLACK, linestyle="--", linewidth=1, alpha=0.5)
ax.set_ylabel("Coverage Rate (%)", fontsize=12)
ax.set_title("Do Bounds Contain the True ATE?\n(100 Simulations)", fontsize=14, color=HEADING_BLUE)
ax.set_ylim(0, 110)
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)
plt.savefig("partial_id_coverage.png", dpi=300, bbox_inches="tight")
plt.show()

All three methods achieve 100% coverage across 100 simulations – the true ATE of 0.27 falls within the computed bounds in every single draw. This is not surprising for Manski and Autobound, which make no assumptions beyond the data. For entropy bounds, the 100% coverage suggests that theta = 0.1 is a conservative enough constraint that it does not exclude the true value. In practice, choosing theta requires domain knowledge: too small and the bounds may not cover the truth; too large and the bounds approach the uninformative Manski width.

Sensitivity – How Sample Size Affects Bounds

A common misconception is that collecting more data will narrow partial identification bounds. This is generally not true. Unlike confidence intervals, which shrink with more observations, identification bounds reflect fundamental uncertainty about what we do not observe – the unmeasured confounder. More data gives us more precise estimates of the observed probabilities, but does not reduce the range of possible confounding.

sample_sizes = [100, 250, 500, 1000, 2500, 5000]
n_reps = 30
manski_widths = {n: [] for n in sample_sizes}
entropy_widths = {n: [] for n in sample_sizes}
for n in sample_sizes:
for rep in range(n_reps):
np.random.seed(rep + 1000)
U_s = np.random.binomial(1, 0.3, n)
X_prob_s = 0.3 + 0.4 * U_s
X_s = np.random.binomial(1, X_prob_s, n)
Y_prob_s = np.clip(0.2 + 0.3 * X_s + 0.4 * U_s - 0.1 * X_s * U_s, 0, 1)
Y_s = np.random.binomial(1, Y_prob_s)
sc = BinaryConf(X_s, Y_s)
m = sc.ATE.manski()
e = sc.ATE.entropybounds(theta=0.1)
manski_widths[n].append(m[1] - m[0])
entropy_widths[n].append(e[1] - e[0])
for n in sample_sizes:
print(f"N={n:>5}: Manski width = {np.mean(manski_widths[n]):.4f} "
f"(+/- {np.std(manski_widths[n]):.4f}), "
f"Entropy width = {np.mean(entropy_widths[n]):.4f} "
f"(+/- {np.std(entropy_widths[n]):.4f})")

N= 100: Manski width = 1.0000 (+/- 0.0000), Entropy width = 0.6733 (+/- 0.0139)
N= 250: Manski width = 1.0000 (+/- 0.0000), Entropy width = 0.6733 (+/- 0.0100)
N= 500: Manski width = 1.0000 (+/- 0.0000), Entropy width = 0.6753 (+/- 0.0084)
N= 1000: Manski width = 1.0000 (+/- 0.0000), Entropy width = 0.6772 (+/- 0.0055)
N= 2500: Manski width = 1.0000 (+/- 0.0000), Entropy width = 0.6751 (+/- 0.0032)
N= 5000: Manski width = 1.0000 (+/- 0.0000), Entropy width = 0.6753 (+/- 0.0027)

fig, ax = plt.subplots(figsize=(8, 5))
manski_means = [np.mean(manski_widths[n]) for n in sample_sizes]
manski_stds = [np.std(manski_widths[n]) for n in sample_sizes]
entropy_means = [np.mean(entropy_widths[n]) for n in sample_sizes]
entropy_stds = [np.std(entropy_widths[n]) for n in sample_sizes]
ax.plot(sample_sizes, manski_means, "o-", color=STEEL_BLUE, linewidth=2,
markersize=7, label="Manski Bounds", zorder=3)
ax.fill_between(sample_sizes,
[m - s for m, s in zip(manski_means, manski_stds)],
[m + s for m, s in zip(manski_means, manski_stds)],
color=STEEL_BLUE, alpha=0.15)
ax.plot(sample_sizes, entropy_means, "s-", color=TEAL, linewidth=2,
markersize=7, label="Entropy Bounds (\u03b8 = 0.1)", zorder=3)
ax.fill_between(sample_sizes,
[m - s for m, s in zip(entropy_means, entropy_stds)],
[m + s for m, s in zip(entropy_means, entropy_stds)],
color=TEAL, alpha=0.15)
ax.set_xlabel("Sample Size (N)", fontsize=12)
ax.set_ylabel("Bound Width (Upper - Lower)", fontsize=12)
ax.set_title("Bound Width vs. Sample Size", fontsize=14, color=HEADING_BLUE)
ax.legend(loc="center right", fontsize=11)
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)
ax.set_xscale("log")
ax.set_xticks(sample_sizes)
ax.set_xticklabels([str(n) for n in sample_sizes])
plt.savefig("partial_id_sample_size.png", dpi=300, bbox_inches="tight")
plt.show()

The Manski bound width remains exactly 1.0 regardless of sample size – from N = 100 to N = 5,000, the width does not budge. This is because Manski bounds are identification bounds, not statistical estimates: they reflect what we fundamentally cannot learn without observing the confounder, not sampling noise. The entropy bounds similarly stabilize around 0.68 across all sample sizes, with only their variance decreasing (from +/-0.014 at N = 100 to +/-0.003 at N = 5,000). The practical implication is clear: to narrow these bounds, you need stronger assumptions or additional data about the confounder – not just more observations of the same variables.

Discussion

We began by asking whether a job training program helps workers find jobs when a key confounder – prior work experience – is unmeasured. The naive difference in means (0.3822) suggests training increases job probability by about 38 percentage points, but this estimate is upward biased by 11.2 percentage points because experienced workers disproportionately enroll in training.

Partial identification provides an honest answer. The Manski bounds place the true ATE between -0.298 and 0.702: training might reduce job probability by as much as 30 percentage points or increase it by as much as 70 percentage points. This interval is wide enough to span zero, so we cannot conclude even the direction of the effect under minimal assumptions. The entropy bounds (theta = 0.1) narrow this to [-0.228, 0.454], a 32% reduction, but still include zero.

So what does this mean for a policymaker? If you are deciding whether to fund the training program, the Manski bounds alone are not informative enough. You need either additional data (an instrument, panel data, or direct measurement of the confounder) or stronger assumptions to narrow the bounds. However, the bounds are valuable for ruling out extreme claims: the ATE cannot exceed 0.702, so any claim of a 75-percentage-point benefit is inconsistent with the data. Partial identification does not give you the answer, but it tells you honestly what the data can and cannot say.

This framework complements the point-identification methods covered in previous tutorials. Double Machine Learning and DoWhy assume all confounders are observed and produce precise estimates. Partial identification drops that assumption and produces bounds instead. The choice depends on whether the “no unmeasured confounders” assumption is credible in your application.

Summary and Next Steps

Key takeaways:

Method insight: Manski bounds require only observational data and the law of total probability – no parametric assumptions, no exclusion restrictions. The price is width: a full 1.0 on the probability scale for binary outcomes. These bounds are already sharp (autobound confirms this), establishing the fundamental limit of what data alone can tell us.
Data insight: The naive estimate (0.3822) overshoots the true ATE (0.27) by 11.2 percentage points because experienced workers disproportionately enroll in training. This upward bias illustrates why raw comparisons are misleading in observational studies. The Manski bounds honestly bracket this uncertainty by admitting the effect could range from -0.298 to 0.702.
Limitation: When bounds span zero – as ours do for both Manski and entropy methods – we cannot determine even the sign of the treatment effect. This is an honest result, not a failure. It means the data, without additional structure, genuinely cannot distinguish a helpful program from a harmful one.
Next step: To narrow bounds, add structural information. An instrumental variable (use BinaryIV scenario in CausalBoundingEngine) can dramatically tighten bounds. Monotonicity assumptions (treatment can only help, never hurt) halve the Manski width. Alternatively, sensitivity analysis methods like Cinelli and Hazlett’s partial $R^2$ approach let you ask: “How strong would the confounder need to be to explain away the observed effect?”

Limitations: This tutorial uses binary variables only – real applications often involve continuous outcomes and treatments, which require different bounding approaches. The simulated data lets us verify coverage but does not capture the messy complexities of real observational studies. The entropy bounds require choosing a theta parameter, and we have not provided guidance on calibrating this choice from domain knowledge.

Exercises

Increase confounder strength: Change the confounder’s effect on the outcome from 0.4 to 0.8 in the data-generating process. How do the Manski bounds change? Does the naive estimate become more biased?
Add an instrumental variable: Create a variable $Z$ that affects $X$ but not $Y$ directly (e.g., $Z$ is a randomly mailed training invitation). Use the BinaryIV scenario in CausalBoundingEngine. Do the IV-based bounds tighten compared to Manski?
Real-world application: Find a published observational study in your field (economics, epidemiology, or social science). Identify what the unmeasured confounders might be. How wide would the Manski bounds be given the observed treatment and outcome rates? Would the study’s conclusions survive under partial identification?

References

Introduction to Causal Inference: The DoWhy Approach with the Lalonde Dataset

Thu, 12 Mar 2026 00:00:00 +0000

Overview

Does a job training program actually cause participants to earn more, or do people who enroll in training simply differ from those who do not? This is the central challenge of causal inference: distinguishing genuine treatment effects from confounding differences between groups. A simple comparison of average earnings between participants and non-participants can be misleading if the two groups differ in age, education, or prior employment history.

DoWhy is a Python library that provides a principled, end-to-end framework for causal inference. It organizes the analysis into four explicit steps — Model, Identify, Estimate, Refute — each of which forces the analyst to state and test causal assumptions rather than hiding them inside a black-box estimator. In this tutorial, we apply DoWhy to the Lalonde dataset, a classic dataset from the National Supported Work (NSW) Demonstration program, to estimate how much the job training program increased participants' earnings in 1978.

Learning objectives:

Understand DoWhy’s four-step causal inference workflow (Model, Identify, Estimate, Refute)
Define a causal graph that encodes domain knowledge about confounders
Identify causal estimands from the graph using the backdoor criterion
Estimate causal effects using multiple methods (regression adjustment, IPW, doubly robust, propensity score stratification, propensity score matching)
Assess robustness of estimates using refutation tests

DoWhy’s four-step framework

Most statistical software lets you jump straight from data to estimates, skipping the hard work of stating assumptions and testing whether the results are trustworthy. DoWhy takes a different approach: it organizes every causal analysis into four explicit steps, each answering a distinct question.

graph LR
A["<b>1. Model</b><br/>Define causal<br/>assumptions"] --> B["<b>2. Identify</b><br/>Find the right<br/>formula"]
B --> C["<b>3. Estimate</b><br/>Compute the<br/>causal effect"]
C --> D["<b>4. Refute</b><br/>Stress-test<br/>the result"]
style A fill:#6a9bcc,stroke:#141413,color:#fff
style B fill:#d97757,stroke:#141413,color:#fff
style C fill:#00d4c8,stroke:#141413,color:#fff
style D fill:#8b5cf6,stroke:#141413,color:#fff

Each step answers a specific question and builds on the previous one:

Model — “What are the causal relationships?" Encode your domain knowledge as a causal graph (a DAG). This is where you declare which variables cause which, making your assumptions explicit and debatable rather than hidden inside a regression.
Identify — “Can we estimate the effect from data?" Given the graph, DoWhy uses graph theory to determine whether the causal effect is identifiable — meaning it can be computed from observed data alone — and returns the mathematical formula (the estimand) needed to do so.
Estimate — “What is the causal effect?" Apply one or more statistical methods to compute the actual numeric estimate. DoWhy supports multiple estimators so you can check whether different methods agree.
Refute — “Should we trust the estimate?" Run automated falsification tests that probe whether the result could be a statistical artifact, whether it is sensitive to unobserved confounders, and whether it is stable across subsamples.

The ordering is deliberate. You cannot estimate a causal effect without first identifying the correct formula, and you cannot identify the formula without first specifying your causal assumptions. This sequential discipline is DoWhy’s key contribution: it prevents the common mistake of running a regression and calling the coefficient “causal” without ever checking whether the adjustment set is correct or whether the result survives basic robustness checks.

Setup and imports

Before running the analysis, install the required package if needed:

pip install dowhy # https://pypi.org/project/dowhy/

The following code imports all necessary libraries and sets configuration variables. We define the outcome, treatment, and covariate columns that will be used throughout the analysis.

import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression, LinearRegression as SklearnLR
from dowhy import CausalModel
from dowhy.datasets import lalonde_dataset
# Reproducibility
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
# Configuration
OUTCOME = "re78"
OUTCOME_LABEL = "Earnings in 1978 (USD)"
TREATMENT = "treat"
TREATMENT_LABEL = "Job Training (treat)"
COVARIATES = ["age", "educ", "black", "hisp", "married", "nodegr", "re74", "re75"]

Data loading: The Lalonde Dataset

The Lalonde dataset comes from the National Supported Work (NSW) Demonstration, a randomized employment program conducted in the 1970s in the United States. Eligible applicants — mostly disadvantaged workers with limited employment histories — were randomly assigned to receive job training (treatment) or not (control). The dataset records each participant’s demographics, prior earnings, and post-program earnings in 1978. It has become a benchmark for testing causal inference methods because the random assignment provides a credible ground truth against which observational estimators can be compared.

DoWhy includes the Lalonde dataset directly, so we can load it with the lalonde_dataset() function.

df = lalonde_dataset()
# Convert boolean treatment to integer for DoWhy compatibility
df[TREATMENT] = df[TREATMENT].astype(int)
print(f"Dataset shape: {df.shape}")
print(f"\nTreatment groups:")
print(df[TREATMENT].value_counts().sort_index().rename({0: "Control", 1: "Training"}))
print(f"\nOutcome ({OUTCOME}) summary:")
print(df[OUTCOME].describe().round(2))

Dataset shape: (445, 12)
Treatment groups:
treat
Control 260
Training 185
Name: count, dtype: int64
Outcome (re78) summary:
count 445.00
mean 5300.76
std 6631.49
min 0.00
25% 0.00
50% 3701.81
75% 8124.72
max 60307.93
Name: re78, dtype: float64

The dataset contains 445 participants with 12 variables. The treatment is split into 185 individuals who received job training and 260 controls who did not. The outcome variable, real earnings in 1978 (re78), has a mean of \$5,301 but enormous variation (standard deviation of \$6,631), ranging from \$0 to \$60,308. The median (\$3,702) is well below the mean, indicating a right-skewed distribution — many participants earned little or nothing while a few earned substantially more.

Exploratory data analysis

Outcome distribution by treatment group

Before any causal modeling, we compare the raw earnings distributions between training and control groups. If the training program had an effect, we expect to see higher average earnings in the training group — but we cannot yet tell whether any difference is truly caused by the program or driven by pre-existing differences between the groups.

fig, ax = plt.subplots(figsize=(8, 5))
for group, label, color in [(0, "Control", "#6a9bcc"), (1, "Training", "#d97757")]:
subset = df[df[TREATMENT] == group][OUTCOME]
ax.hist(subset, bins=30, alpha=0.6, label=f"{label} (mean=${subset.mean():,.0f})",
color=color, edgecolor="white")
ax.set_xlabel(OUTCOME_LABEL)
ax.set_ylabel("Count")
ax.set_title(f"Distribution of {OUTCOME_LABEL} by Treatment Group")
ax.legend()
plt.savefig("dowhy_outcome_by_treatment.png", dpi=300, bbox_inches="tight")
plt.show()

Both distributions are heavily right-skewed, with a large spike near zero reflecting participants who had no earnings. The training group has a higher mean (\$6,349) compared to the control group (\$4,555), a raw difference of about \$1,794. However, both distributions overlap substantially, and the spike at zero is present in both groups, indicating that many participants struggled to find employment regardless of training.

Covariate balance

In a randomized experiment, we expect the covariates to be balanced across treatment and control groups. Under randomization, the naive difference-in-means is unbiased for the ATE in expectation — but with a finite sample of 445 observations, chance imbalances can still arise and reduce the precision of the estimate. Checking covariate balance helps us assess whether such imbalances exist and whether covariate adjustment could improve efficiency. We first examine the categorical covariates as proportions, then use Standardized Mean Differences to assess balance across all covariates on a common scale.

Categorical covariates

The four binary covariates — black, hisp, married, and nodegr (no high school degree) — indicate demographic group membership. Comparing their proportions across treatment and control groups reveals whether random assignment produced balanced groups on these characteristics.

categorical_vars = ["black", "hisp", "married", "nodegr"]
cat_means = df.groupby(TREATMENT)[categorical_vars].mean()
fig, ax = plt.subplots(figsize=(8, 5))
x = np.arange(len(categorical_vars))
width = 0.35
ax.bar(x - width / 2, cat_means.loc[0], width, label="Control",
color="#6a9bcc", edgecolor="white")
ax.bar(x + width / 2, cat_means.loc[1], width, label="Training",
color="#d97757", edgecolor="white")
ax.set_xticks(x)
ax.set_xticklabels(categorical_vars, rotation=45, ha="right")
ax.set_ylabel("Proportion")
ax.set_ylim(0, 1)
ax.set_title("Covariate Balance: Categorical Variables")
ax.legend()
plt.savefig("dowhy_covariate_balance_categorical.png", dpi=300, bbox_inches="tight")
plt.show()

The categorical covariates are well balanced across treatment and control groups, consistent with random assignment. The sample is predominantly Black (83%) and has a high rate of lacking a high school diploma (78%), reflecting the disadvantaged population targeted by the NSW program. Hispanic and married proportions are low in both groups (roughly 6% and 16%, respectively), with no meaningful differences between treatment arms.

Covariate balance: Standardized Mean Differences

Comparing raw group means can be misleading when covariates are measured on different scales. Suppose the control group earns \$500 more in prior earnings (re74) than the training group, and is also 1 year older on average. Which imbalance is larger? The raw numbers cannot answer this question — \$500 sounds like a lot, but prior earnings vary by thousands of dollars across individuals, so a \$500 gap may be trivial relative to the spread. A 1-year age difference sounds small, but if most participants are clustered around age 25, that gap may represent a meaningful shift in the distribution.

The Standardized Mean Difference (SMD) resolves this by asking: how many standard deviations apart are the treatment and control groups on each covariate? For each variable, we compute the difference in group means and divide by the pooled standard deviation. This converts every covariate — whether binary, measured in years, or measured in dollars — to the same unitless scale, making imbalances directly comparable:

$$\text{SMD} = \frac{\bar{X}_{treated} - \bar{X}_{control}}{\sqrt{(s^2_{treated} + s^2_{control}) \,/\, 2}}$$

An absolute SMD below 0.1 is the conventional threshold for “good balance” (Austin, 2011). Values above 0.1 signal that the groups differ by more than one-tenth of a standard deviation on that variable — enough to potentially confound the treatment effect estimate. A Love plot displays the absolute SMD for all covariates as horizontal bars, with a dashed line at the 0.1 threshold. Bars in steel blue fall below the threshold (balanced), while bars in warm orange exceed it (imbalanced).

# Standardized Mean Difference (SMD) for all covariates
treated = df[df[TREATMENT] == 1]
control = df[df[TREATMENT] == 0]
smd_values = {}
for var in COVARIATES:
diff = treated[var].mean() - control[var].mean()
pooled_sd = np.sqrt((treated[var].std()**2 + control[var].std()**2) / 2)
smd_values[var] = diff / pooled_sd
smd_df = pd.DataFrame({"variable": list(smd_values.keys()),
"smd": list(smd_values.values())})
smd_df["abs_smd"] = smd_df["smd"].abs()
smd_df = smd_df.sort_values("abs_smd")
fig, ax = plt.subplots(figsize=(8, 5))
colors = ["#6a9bcc" if v < 0.1 else "#d97757" for v in smd_df["abs_smd"]]
ax.barh(smd_df["variable"], smd_df["abs_smd"], color=colors,
edgecolor="white", height=0.6)
ax.axvline(0.1, color="#141413", linewidth=1, linestyle="--", label="SMD = 0.1 threshold")
ax.set_xlabel("Absolute Standardized Mean Difference")
ax.set_title("Covariate Balance: Love Plot (All Covariates)")
ax.legend(loc="lower right")
plt.savefig("dowhy_covariate_balance_smd.png", dpi=300, bbox_inches="tight")
plt.show()

The Love plot reveals a more nuanced picture than raw mean comparisons would suggest. Prior earnings (re74 and re75) — which appeared imbalanced when comparing raw means in the thousands — are actually well balanced on the standardized scale (SMD < 0.1), because their large variances absorb the mean differences. In contrast, nodegr shows the largest imbalance (SMD ~0.31), followed by hisp (~0.18) and educ (~0.14). These imbalances, despite random assignment, reflect the small sample size and the disadvantaged population targeted by NSW. Although the naive difference-in-means remains unbiased under randomization, adjusting for these chance imbalances can improve the precision of the treatment effect estimate — a well-known result in the experimental design literature (Lin, 2013; Freedman, 2008).

The causal inference problem

ATE vs ATT: Two different causal questions

Before estimating the treatment effect, we need to be precise about which causal question we are asking. There are two distinct estimands, each answering a different policy-relevant question:

Average Treatment Effect (ATE) — “What would happen if we assigned treatment to a random person from the entire population?" The ATE averages the treatment effect over everyone — both the treated and the untreated:

$$\text{ATE} = E[Y(1) - Y(0)]$$

Average Treatment Effect on the Treated (ATT) — “What was the effect of treatment for those who actually received it?" The ATT averages the treatment effect only over the treated subpopulation:

$$\text{ATT} = E[Y(1) - Y(0) \mid T = 1]$$

The distinction matters because the people who receive treatment may differ systematically from those who do not. If the training program helps disadvantaged workers the most, and disadvantaged workers are more likely to enroll, then the ATT (the effect on those who enrolled) will be larger than the ATE (the effect if we enrolled everyone at random). Conversely, if the program is most effective for workers who are least likely to enroll, the ATE could exceed the ATT.

In this tutorial, we estimate the ATE — the average effect of the NSW job training program across the entire study population. This is the natural estimand for a randomized experiment where we want to evaluate the program’s overall impact. Four of our five estimation methods (regression adjustment, IPW, AIPW, and propensity score stratification) target the ATE directly. The exception is propensity score matching, which discards unmatched control units and therefore shifts the estimand toward the ATT — we flag this distinction when we discuss the matching results.

Why simple comparisons can mislead

A naive approach to estimating the treatment effect is to compute the difference in mean outcomes between the training and control groups. This gives us the Average Treatment Effect (ATE):

$$\text{ATE}_{naive} = \bar{Y}_{treated} - \bar{Y}_{control}$$

While this is a natural starting point and is unbiased in expectation under randomization, it can be imprecise when finite-sample covariate imbalances exist. Adjusting for covariates that predict the outcome can sharpen the estimate. In observational studies, the problem is more severe — without adjustment, the naive estimator can be genuinely biased by confounding.

mean_treated = df[df[TREATMENT] == 1][OUTCOME].mean()
mean_control = df[df[TREATMENT] == 0][OUTCOME].mean()
naive_ate = mean_treated - mean_control
print(f"Mean earnings (Training): ${mean_treated:,.2f}")
print(f"Mean earnings (Control): ${mean_control:,.2f}")
print(f"Naive ATE (difference): ${naive_ate:,.2f}")

Mean earnings (Training): $6,349.14
Mean earnings (Control): $4,554.80
Naive ATE (difference): $1,794.34

The naive estimate suggests that training increases earnings by \$1,794 on average. Under randomization, this estimate is unbiased in expectation, but the finite-sample covariate imbalances we observed earlier (particularly in nodegr, hisp, and educ) mean that covariate adjustment can sharpen the estimate and account for chance differences between groups. This is where DoWhy’s structured framework helps — it forces us to explicitly model our causal assumptions, identify the correct estimand, apply rigorous estimation methods, and test whether the results hold up under scrutiny.

Step 1: Model — Define the causal graph

The first step in DoWhy’s framework is to encode our domain knowledge as a causal graph — a Directed Acyclic Graph (DAG) that specifies which variables cause which. In our case, the covariates (age, education, race, prior earnings, etc.) are common causes of both treatment assignment and the outcome. Even in a randomized experiment, these covariates predict the outcome and adjusting for them improves precision, so we include them in the model. This also makes the tutorial directly applicable to observational settings where these variables are genuine confounders.

What is a DAG?

A Directed Acyclic Graph is the formal language of causal inference. Each word in the name carries meaning:

Directed — every edge is an arrow pointing from cause to effect. If age affects earnings, we draw an arrow from age to re78, never the reverse.
Acyclic — there are no feedback loops. You cannot follow the arrows and return to where you started. This rules out simultaneous causation (e.g., “A causes B and B causes A at the same time”), which requires more advanced models.
Graph — variables are nodes (circles or squares) and causal relationships are edges (arrows). The full picture is a map of which variables drive which.

The DAG is not a statistical model — it encodes qualitative assumptions about the data-generating process before we look at a single number. Its power lies in what it tells us about which variables to adjust for and which to leave alone.

Types of variables in a causal graph

Not all variables play the same role. Understanding the three fundamental types is essential for deciding what to control for:

graph TD
C["<b>Confounder</b><br/>(e.g., prior earnings)"] -->|"affects"| T["Treatment"]
C -->|"affects"| Y["Outcome"]
T -.->|"causal effect"| Y
style C fill:#00d4c8,stroke:#141413,color:#fff
style T fill:#6a9bcc,stroke:#141413,color:#fff
style Y fill:#d97757,stroke:#141413,color:#fff

Confounders (common causes) — A variable that affects both the treatment and the outcome. For example, prior earnings (re74) may influence whether someone enrolls in training and how much they earn later. Confounders create a spurious association between treatment and outcome. You must adjust for confounders to isolate the causal effect.

graph LR
T["Treatment"] -->|"causes"| M["<b>Mediator</b><br/>(e.g., skills)"]
M -->|"causes"| Y["Outcome"]
style T fill:#6a9bcc,stroke:#141413,color:#fff
style M fill:#00d4c8,stroke:#141413,color:#fff
style Y fill:#d97757,stroke:#141413,color:#fff

Mediators — A variable that lies on the causal path from treatment to outcome. For example, if job training increases skills, and skills increase earnings, then skills is a mediator. You should NOT adjust for mediators — doing so would block the very causal pathway you are trying to measure, attenuating or eliminating the estimated effect.

graph TD
T["Treatment"] -->|"affects"| Col["<b>Collider</b><br/>(e.g., in_survey)"]
Y["Outcome"] -->|"affects"| Col
T -.->|"causal effect"| Y
style T fill:#6a9bcc,stroke:#141413,color:#fff
style Col fill:#00d4c8,stroke:#141413,color:#fff
style Y fill:#d97757,stroke:#141413,color:#fff

Colliders — A variable that is caused by both the treatment and the outcome (or by variables on both sides). For example, if both training and high earnings make someone likely to appear in a follow-up survey, then in_survey is a collider. You should NOT condition on colliders — doing so can create a spurious association between treatment and outcome even where none exists (a phenomenon called collider bias or selection bias).

In the Lalonde dataset, all eight covariates (age, education, race, marital status, degree status, and prior earnings) are measured before treatment assignment, so they can only be confounders — they cannot be mediators or colliders. This makes the graph straightforward: every covariate points to both treat and re78.

The causal structure we assume is:

Each covariate (age, educ, black, hisp, married, nodegr, re74, re75) affects both treatment assignment and earnings
Treatment (treat) affects the outcome (re78)
No covariate is itself caused by the treatment (pre-treatment variables)

We now create the CausalModel in DoWhy, specifying the treatment, outcome, and common causes. The model object stores the data, the causal graph, and metadata that DoWhy will use in subsequent steps to determine the correct adjustment strategy.

model = CausalModel(
data=df,
treatment=TREATMENT,
outcome=OUTCOME,
common_causes=COVARIATES,
)
print("CausalModel created successfully.")

CausalModel created successfully.

DoWhy can visualize the causal graph it constructed using the view_model() method, which uses Graphviz to render the DAG automatically from the model’s internal graph representation:

# Visualize the causal graph using DoWhy's built-in method
model.view_model(layout="dot")
from IPython.display import Image, display
display(Image(filename="causal_model.png"))

The DAG makes our assumptions explicit: the eight covariates are common causes that affect both treatment assignment (treat) and earnings (re78). The arrows encode the direction of causation — each confounder points to both treat and re78, and treat points to re78 (the causal effect we want to estimate). By stating these assumptions as a graph, DoWhy can automatically determine which variables need to be adjusted for and which estimation strategies are valid.

Step 2: Identify — Find the causal estimand

With the causal graph defined, DoWhy’s identify_effect() method uses graph theory to identify the causal estimand — the mathematical expression that, if computed correctly, equals the true causal effect. This step determines whether the effect is identifiable from the data given our assumptions, and what variables we need to condition on.

What does “identification” mean?

In causal inference, identification answers a deceptively simple question: can we compute the causal effect from the data we have, without running a new experiment? The answer is not always yes. Consider a scenario where an unmeasured variable (say, “motivation”) affects both whether someone enrolls in training and how much they earn afterward. No amount of data on age, education, and prior earnings can untangle the causal effect of training from the confounding effect of motivation — the causal effect is not identified without observing motivation.

Identification is the bridge between causal assumptions (encoded in the graph) and statistical computation (what we can actually calculate from data). If the effect is identified, the identification step produces an estimand — a precise mathematical formula that tells us exactly which conditional expectations or reweightings to compute. If the effect is not identified, no estimation method can produce a credible causal estimate, no matter how sophisticated.

Identification strategies

DoWhy checks three main strategies, each applicable in different causal structures:

Backdoor criterion — The most common strategy. It applies when we can observe all confounders between treatment and outcome. By conditioning on these confounders, we “block” all backdoor paths — non-causal pathways that create spurious associations. In the Lalonde example, conditioning on the eight covariates satisfies the backdoor criterion because they are the only common causes of treat and re78.
Instrumental variables (IV) — Useful when some confounders are unobserved. An instrument is a variable that affects treatment but has no direct effect on the outcome except through the treatment itself. For example, draft lottery numbers have been used as instruments for military service: the lottery affects whether someone serves (treatment) but has no direct effect on later earnings (outcome) except through the service itself. IV estimation requires strong assumptions but can identify causal effects when backdoor adjustment is impossible.
Front-door criterion — Applies when there is a mediator that fully transmits the treatment effect and is itself unconfounded with the outcome. This strategy is rare in practice but theoretically important: it can identify causal effects even in the presence of unmeasured confounders between treatment and outcome, as long as the mediator pathway is clean.

A key advantage of DoWhy is that it automates the identification step. Given the causal graph, DoWhy algorithmically checks which strategies are valid and returns the correct estimand. This prevents a common and dangerous mistake in applied work: manually choosing which variables to “control for” without formally checking whether the chosen adjustment set actually satisfies the conditions for causal identification.

identified_estimand = model.identify_effect(proceed_when_unidentifiable=True)
print(identified_estimand)

Estimand type: EstimandType.NONPARAMETRIC_ATE
### Estimand : 1
Estimand name: backdoor
Estimand expression:
d
────────(E[re78|educ,black,age,hisp,re75,married,re74,nodegr])
d[treat]
Estimand assumption 1, Unconfoundedness: If U→{treat} and U→re78
then P(re78|treat,educ,black,age,hisp,re75,married,re74,nodegr,U)
= P(re78|treat,educ,black,age,hisp,re75,married,re74,nodegr)

DoWhy identifies the backdoor estimand as the primary identification strategy, expressing the causal effect as the derivative of the conditional expectation of earnings with respect to treatment, conditioning on all eight covariates. The critical assumption is unconfoundedness — there are no unmeasured confounders beyond the ones we specified. DoWhy also checks for instrumental variable and front-door estimands but finds none applicable, which is expected given our graph structure.

Step 3: Estimate — Compute the causal effect

With the estimand identified, we now use estimate_effect() to compute the actual causal effect estimate. DoWhy supports multiple estimation methods, each with different assumptions and properties. We compare five approaches to see how robust the estimate is across methods.

Causal estimation methods fall into three broad paradigms, distinguished by what they model:

Outcome modeling (Regression Adjustment) — directly models the relationship $E[Y \mid X, T]$ between covariates, treatment, and outcome. Its validity depends on correctly specifying this outcome model.
Treatment modeling (IPW, PS Stratification, PS Matching) — models the treatment assignment mechanism $P(T \mid X)$ (the propensity score) and uses it to remove confounding. All three methods rely exclusively on the propensity score — they differ in how they use it (reweighting, grouping, or pairing observations) but none of them model the outcome. Their validity depends on correctly specifying the propensity score model.
Doubly robust (AIPW) — the only true hybrid. It explicitly combines an outcome model $E[Y \mid X, T]$ with a propensity score model $P(T \mid X)$, and is consistent if either model is correctly specified. This “double protection” is why it is called doubly robust.

The following diagram shows how these paradigms relate to the five methods we will apply:

graph TD
Root["<b>Estimation Methods</b>"] --> OM["<b>Outcome Modeling</b><br/><i>Models E[Y | X, T]</i>"]
Root --> TM["<b>Treatment Modeling</b><br/><i>Models P(T | X)</i>"]
Root --> DR_cat["<b>Doubly Robust</b><br/><i>Models both E[Y | X, T]<br/>and P(T | X)</i>"]
OM --> RA["Regression<br/>Adjustment"]
TM --> IPW["Inverse Probability<br/>Weighting"]
TM --> PSS["PS<br/>Stratification"]
TM --> PSM["PS<br/>Matching"]
DR_cat --> DR["AIPW"]
style Root fill:#141413,stroke:#141413,color:#fff
style OM fill:#6a9bcc,stroke:#141413,color:#fff
style TM fill:#d97757,stroke:#141413,color:#fff
style DR_cat fill:#00d4c8,stroke:#141413,color:#fff
style RA fill:#6a9bcc,stroke:#141413,color:#fff
style IPW fill:#d97757,stroke:#141413,color:#fff
style PSS fill:#d97757,stroke:#141413,color:#fff
style PSM fill:#d97757,stroke:#141413,color:#fff
style DR fill:#00d4c8,stroke:#141413,color:#fff

Understanding these paradigms helps clarify why different methods can give somewhat different estimates and why comparing across paradigms is a powerful robustness check. The key trade-offs are:

What each paradigm models: Outcome modeling specifies how covariates relate to earnings ($E[Y \mid X, T]$). Treatment modeling specifies how covariates relate to treatment assignment ($P(T \mid X)$) — all three PS methods use this same propensity score but differ in how they apply it. Doubly robust specifies both models simultaneously.
What each paradigm assumes: Regression adjustment requires the outcome model to be correctly specified. All three propensity score methods (IPW, stratification, matching) require the propensity score model to be correctly specified. Doubly robust only requires one of the two to be correct.
Bias-variance characteristics: Regression adjustment tends to be low-variance but can be biased if the outcome-covariate relationship is nonlinear. IPW can have high variance when propensity scores are extreme (near 0 or 1). Stratification and matching use the propensity score more conservatively — by grouping or pairing rather than directly reweighting — which can reduce variance relative to IPW. Doubly robust balances both concerns but is more complex to implement.

The three treatment modeling methods differ in how they use the propensity score to create balanced comparisons:

IPW reweights every observation by the inverse of its propensity score, creating a pseudo-population where treatment is independent of covariates. It uses the full sample but can be unstable when propensity scores are near 0 or 1.
PS Stratification divides observations into groups (strata) with similar propensity scores, then computes simple mean differences within each stratum. By comparing treated and control units within the same stratum, it approximates a block-randomized experiment.
PS Matching pairs each treated unit with the control unit that has the most similar propensity score, then computes mean differences within matched pairs. It discards unmatched observations, focusing on the closest comparisons at the cost of reduced sample size.

None of these methods model the outcome — they all achieve confounding adjustment purely through the propensity score. If outcome modeling and treatment modeling agree, we can be more confident that neither model is badly misspecified.

Method 1: Regression Adjustment

Regression adjustment is grounded in the potential outcomes framework: each individual has two potential outcomes — $Y(1)$ if treated and $Y(0)$ if not — and the causal effect is their difference. Since we only observe one outcome per person, regression adjustment estimates both potential outcomes by modeling $E[Y \mid X, T]$, the conditional expectation of the outcome given covariates and treatment status. The treatment effect is the coefficient on the treatment indicator, which captures the difference in expected outcomes between treated and control units at the same covariate values — effectively comparing apples to apples.

The key assumption is that the outcome model must be correctly specified. If the true relationship between covariates and the outcome is nonlinear or includes interactions, a simple linear model will produce biased estimates. In econometrics, this approach is closely related to the Frisch-Waugh-Lovell theorem, which shows that the treatment coefficient in a multiple regression is identical to what you would get by first partialing out the covariates from both the treatment and the outcome, then regressing the residuals on each other. This makes regression adjustment the simplest and most transparent baseline estimator.

We use DoWhy’s backdoor.linear_regression method:

estimate_ra = model.estimate_effect(
identified_estimand,
method_name="backdoor.linear_regression",
confidence_intervals=True,
)
print(f"Estimated ATE (Regression Adjustment): ${estimate_ra.value:,.2f}")

Estimated ATE (Regression Adjustment): $1,676.34

The regression adjustment estimate is \$1,676, slightly lower than the naive difference of \$1,794. The reduction from \$1,794 to \$1,676 reflects the covariate adjustment — by accounting for finite-sample imbalances in age, education, race, and prior earnings, the estimated treatment effect shrinks by about \$118. In this randomized setting, the adjustment primarily improves precision rather than removing bias, but the same technique is essential in observational studies where confounding is a genuine concern.

Method 2: Inverse Probability Weighting (IPW)

IPW takes a fundamentally different approach from regression adjustment. Instead of modeling the outcome, it models the treatment assignment mechanism. The central concept is the propensity score, $e(X) = P(T = 1 \mid X)$ — the probability that a unit receives treatment given its observed covariates. A person with a propensity score of 0.8 has an 80% chance of being treated based on their characteristics; a person with a score of 0.2 has only a 20% chance.

The key intuition behind inverse weighting is that units who are unlikely to receive the treatment they actually received carry more information about the causal effect. Consider a treated individual with a low propensity score (say 0.1) — this person was unlikely to be treated, yet was treated. Their outcome is especially informative because they are “similar” to the control group in all observable respects. IPW upweights such surprising cases by assigning them a weight of $1/e(X) = 10$, while a treated person with $e(X) = 0.9$ receives a weight of only $1/0.9 \approx 1.1$. This reweighting creates a “pseudo-population” in which treatment assignment is independent of the observed confounders, mimicking what a randomized experiment would look like.

A critical contrast with regression adjustment: IPW makes no assumptions about how covariates relate to the outcome — it only requires that the propensity score model is correctly specified. However, IPW has a key vulnerability: when propensity scores are extreme (near 0 or 1), the inverse weights become very large, producing unstable estimates with high variance. This is why practitioners often use weight trimming or stabilized weights in practice.

The IPW estimator is:

$$\hat{\tau}_{IPW} = \frac{1}{n} \sum_{i=1}^{n} \left[ \frac{T_i Y_i}{\hat{e}(X_i)} - \frac{(1 - T_i) Y_i}{1 - \hat{e}(X_i)} \right]$$

where $\hat{e}(X_i)$ is the estimated propensity score for individual $i$.

We use DoWhy’s backdoor.propensity_score_weighting method, which implements the Horvitz-Thompson inverse probability estimator:

estimate_ipw = model.estimate_effect(
identified_estimand,
method_name="backdoor.propensity_score_weighting",
method_params={"weighting_scheme": "ips_weight"},
)
print(f"Estimated ATE (IPW): ${estimate_ipw.value:,.2f}")

Estimated ATE (IPW): $1,559.47

The IPW estimate of \$1,559 is the lowest among all methods. IPW is sensitive to extreme propensity scores — when some individuals have very high or very low probabilities of treatment, their weights become large and can dominate the estimate. In this dataset, the estimated propensity scores are reasonably well-behaved (the NSW was a randomized experiment), so the IPW estimate remains in the plausible range. The difference from the regression adjustment (\$1,676 vs \$1,559) reflects the fact that IPW makes no assumptions about the outcome model, relying entirely on correct specification of the propensity score model.

Method 3: Doubly Robust (AIPW)

The doubly robust estimator — also called Augmented Inverse Probability Weighting (AIPW) — combines both regression adjustment and IPW into a single estimator. The key advantage is that the estimate is consistent if either the outcome model or the propensity score model is correctly specified (hence “doubly robust”). This provides an important safeguard against model misspecification.

The intuition is straightforward: AIPW starts with the regression adjustment estimate ($\hat{\mu}_1(X) - \hat{\mu}_0(X)$, the difference in predicted outcomes under treatment and control) and then adds a correction term based on the IPW-weighted prediction errors. If the outcome model is perfectly specified, the prediction errors $Y - \hat{\mu}(X)$ are pure noise and the correction averages to zero — the regression adjustment alone does the work. If the outcome model is misspecified but the propensity score model is correct, the IPW-weighted correction term exactly compensates for the bias in the outcome predictions. This is why the estimator only needs one of the two models to be correct — whichever model is right “rescues” the other.

Beyond its robustness property, AIPW achieves the semiparametric efficiency bound when both models are correctly specified, meaning no other estimator that makes the same assumptions can have lower variance. This makes it a natural default choice in modern causal inference.

The AIPW estimator is:

$$\hat{\tau}_{DR} = \frac{1}{n} \sum_{i=1}^{n} \left[ \hat{\mu}_1(X_i) - \hat{\mu}_0(X_i) + \frac{T_i (Y_i - \hat{\mu}_1(X_i))}{\hat{e}(X_i)} - \frac{(1 - T_i)(Y_i - \hat{\mu}_0(X_i))}{1 - \hat{e}(X_i)} \right]$$

where $\hat{\mu}_1(X_i)$ and $\hat{\mu}_0(X_i)$ are the predicted outcomes under treatment and control, and $\hat{e}(X_i)$ is the propensity score.

We implement the AIPW estimator manually rather than using DoWhy’s built-in backdoor.doubly_robust method, which has a known compatibility issue with recent scikit-learn versions. The manual implementation uses LogisticRegression for the propensity score model and LinearRegression for the outcome model, making the estimator’s two-component structure fully transparent.

# Doubly Robust (AIPW) — manual implementation
ps_model = LogisticRegression(max_iter=1000, random_state=42)
ps_model.fit(df[COVARIATES], df[TREATMENT])
ps = ps_model.predict_proba(df[COVARIATES])[:, 1]
outcome_model_1 = SklearnLR().fit(df[df[TREATMENT] == 1][COVARIATES], df[df[TREATMENT] == 1][OUTCOME])
outcome_model_0 = SklearnLR().fit(df[df[TREATMENT] == 0][COVARIATES], df[df[TREATMENT] == 0][OUTCOME])
mu1 = outcome_model_1.predict(df[COVARIATES])
mu0 = outcome_model_0.predict(df[COVARIATES])
T = df[TREATMENT].values
Y = df[OUTCOME].values
dr_ate = np.mean(
(mu1 - mu0)
+ T * (Y - mu1) / ps
- (1 - T) * (Y - mu0) / (1 - ps)
)
print(f"Estimated ATE (Doubly Robust): ${dr_ate:,.2f}")

Estimated ATE (Doubly Robust): $1,620.04

The doubly robust estimate of \$1,620 falls between the regression adjustment (\$1,676) and IPW (\$1,559) estimates. This reflects how the AIPW estimator works: it uses the outcome model as its primary estimate and adds an IPW-weighted correction based on the prediction residuals. The fact that it is close to both individual estimates suggests that neither model is severely misspecified. In practice, the doubly robust estimator is often the preferred choice because it provides insurance against misspecification of either component model.

Method 4: Propensity Score Stratification

Propensity score stratification builds on a powerful result from Rosenbaum and Rubin (1983): conditioning on the scalar propensity score is sufficient to remove all confounding from observed covariates, even though the score compresses multiple covariates into a single number. This means that within a group of individuals who all have similar propensity scores, treatment assignment is effectively random with respect to the observed confounders — just as in a randomized experiment.

Stratification is a discrete approximation to this idea. Instead of conditioning on the exact propensity score (which would require infinite data), we bin observations into a small number of strata — typically 5 quintiles. Within each stratum, treated and control individuals have similar propensity scores and are therefore more comparable, so the within-stratum treatment effect is less confounded. The overall ATE is a weighted average of these stratum-specific effects. A classic result from Cochran (1968) shows that 5 strata typically remove over 90% of the bias from observed confounders, making this a surprisingly effective yet simple approach.

There is a practical trade-off in choosing the number of strata: more strata produce finer covariate balance within each group, reducing bias, but also leave fewer observations per stratum, increasing variance. Five strata is the conventional choice, balancing these considerations well.

We use DoWhy’s backdoor.propensity_score_stratification method:

estimate_ps_strat = model.estimate_effect(
identified_estimand,
method_name="backdoor.propensity_score_stratification",
method_params={"num_strata": 5, "clipping_threshold": 5},
)
print(f"Estimated ATE (PS Stratification): ${estimate_ps_strat.value:,.2f}")

Estimated ATE (PS Stratification): $1,617.07

Propensity score stratification with 5 strata estimates the ATE at \$1,617, very close to the doubly robust estimate (\$1,620). The stratification approach is more flexible than regression adjustment because it does not impose a functional form on the outcome-covariate relationship. The estimate is in the same ballpark as the other adjusted results, which is reassuring — multiple methods agree that the training effect is in the \$1,550–\$1,700 range.

Method 5: Propensity Score Matching

Propensity score matching constructs a comparison group by finding, for each treated individual, the control individual(s) with the most similar propensity score. The treatment effect is then estimated by comparing outcomes within these matched pairs. This is conceptually the most intuitive approach — it directly mimics what we would see if we could compare individuals who are identical except for their treatment status.

An important subtlety is that matching typically discards unmatched control units — those with no treated counterpart nearby in propensity score space. This means the estimand shifts from the Average Treatment Effect (ATE) toward the Average Treatment Effect on the Treated (ATT), which answers a slightly different question: “What was the effect of treatment for those who were actually treated?” rather than “What would the effect be if we treated everyone?”

Several practical choices affect matching quality. With-replacement matching allows each control to be matched to multiple treated units, reducing bias but increasing variance. 1:k matching uses $k$ nearest controls per treated unit, averaging out noise but potentially introducing worse matches. Caliper restrictions discard matches where the propensity score difference exceeds a threshold, preventing poor matches at the cost of losing some treated observations. These choices create a fundamental bias-variance trade-off: tighter matching criteria reduce bias from imperfect comparisons but may discard many observations, increasing the variance of the estimate.

We use DoWhy’s backdoor.propensity_score_matching method:

estimate_ps_match = model.estimate_effect(
identified_estimand,
method_name="backdoor.propensity_score_matching",
)
print(f"Estimated ATE (PS Matching): ${estimate_ps_match.value:,.2f}")

Estimated ATE (PS Matching): $1,735.69

Propensity score matching estimates the effect at \$1,736, the highest of the five adjusted estimates and closest to the naive difference. Matching tends to give slightly different results because it uses only the closest comparisons rather than the full sample. As noted above, this estimate is closer to the ATT than the ATE, so it answers a slightly different question than the other four methods — readers should keep this distinction in mind when comparing across estimators. The fact that all five methods produce estimates between \$1,559 and \$1,736 provides strong evidence that the treatment effect is real and robust to the choice of estimation method.

Step 4: Refute — Test robustness

The final and perhaps most valuable step in DoWhy’s framework is refutation — systematically testing whether the estimated causal effect is robust to violations of our assumptions. DoWhy’s refute_estimate() method provides several built-in refutation tests, each probing a different potential weakness.

Why refutation matters

Most causal inference workflows stop after estimation: you run a regression, get a coefficient, and report it as the causal effect. DoWhy’s refutation step is its key innovation — it provides automated falsification tests that probe whether the estimate could be an artifact of the model, the data, or violated assumptions. This is the causal inference equivalent of “stress testing”: if the estimate survives multiple attempts to break it, we can be more confident that it reflects a genuine causal relationship.

DoWhy’s refutation tests fall into three categories, each targeting a different potential weakness:

Placebo tests — “If the treatment doesn’t matter, does the effect disappear?" These tests replace the real treatment with a fake (randomly permuted) treatment. If the estimated effect drops to near zero, the original result is tied to the actual treatment rather than being a statistical artifact of the model or data structure.
Sensitivity tests — “If we missed a confounder, does the estimate change?" These tests add a randomly generated variable as an additional confounder. If the estimate barely changes, it suggests the result is not fragile — adding one more covariate does not destabilize it. This provides indirect evidence (though not proof) that unobserved confounders may not be a major concern.
Stability tests — “If we use different data, does the estimate hold?" These tests re-estimate the effect on random subsets of the data. If the estimate fluctuates wildly, it may depend on a few influential observations rather than reflecting a stable population-level effect.

An important caveat: passing all refutation tests does not prove causation. The tests can only detect certain types of problems — they cannot rule out every possible source of bias. However, failing any test is a strong signal that something is wrong and warrants further investigation before drawing causal conclusions.

Placebo Treatment Test

The placebo test replaces the actual treatment with a randomly permuted version. If our estimate is truly capturing a causal effect, this fake treatment should produce an effect near zero. A large p-value indicates that the placebo effect is not significantly different from zero, confirming that the real treatment drives the original estimate.

refute_placebo = model.refute_estimate(
identified_estimand,
estimate_ra,
method_name="placebo_treatment_refuter",
placebo_type="permute",
num_simulations=100,
)
print(refute_placebo)

Refute: Use a Placebo Treatment
Estimated effect:1676.3426437675835
New effect:61.821946542496946
p value:0.92

The placebo treatment test produces a new effect of approximately \$62, which is close to zero and dramatically smaller than the original estimate of \$1,676. The high p-value (0.92) indicates that the original estimate is well above what we would expect from a random treatment assignment. This is strong evidence that the estimated effect is not an artifact of the model or data structure.

Random Common Cause Test

The random common cause test adds a randomly generated confounder to the model and checks whether the estimate changes. If our model is correctly specified and the estimate is robust, adding a random variable should not significantly alter the result.

refute_random = model.refute_estimate(
identified_estimand,
estimate_ra,
method_name="random_common_cause",
num_simulations=100,
)
print(refute_random)

Refute: Add a random common cause
Estimated effect:1676.3426437675835
New effect:1675.606781672203
p value:0.9

Adding a random common cause barely changes the estimate: from \$1,676 to \$1,676 — a difference of less than \$1. The high p-value (0.90) confirms that the original estimate is stable when an additional (irrelevant) confounder is introduced. This suggests that the model is not overly sensitive to the specific set of confounders included.

Data Subset Test

The data subset test re-estimates the effect on random 80% subsamples of the data. If the estimate is robust, it should remain similar across different subsets. Large fluctuations would suggest that the result depends on a few influential observations.

refute_subset = model.refute_estimate(
identified_estimand,
estimate_ra,
method_name="data_subset_refuter",
subset_fraction=0.8,
num_simulations=100,
)
print(refute_subset)

Refute: Use a subset of data
Estimated effect:1676.3426437675835
New effect:1727.583871150809
p value:0.8

The data subset refuter produces a mean effect of \$1,728 across 100 random subsamples, close to the full-sample estimate of \$1,676. The high p-value (0.80) indicates that the estimate is stable across subsets and does not depend on a handful of outlier observations. The slight increase in the subsample estimate (\$1,728 vs \$1,676) reflects normal sampling variability.

Comparing all estimates

To visualize how all estimation approaches compare, we plot the ATE estimates side by side. Consistent estimates across different methods strengthen confidence in the causal conclusion.

fig, ax = plt.subplots(figsize=(9, 6))
methods = ["Naive\n(Diff. in Means)", "Regression\nAdjustment", "IPW",
"Doubly Robust\n(AIPW)", "PS\nStratification", "PS\nMatching"]
estimates = [naive_ate, estimate_ra.value, estimate_ipw.value,
dr_ate, estimate_ps_strat.value, estimate_ps_match.value]
colors = ["#999999", "#6a9bcc", "#d97757", "#00d4c8", "#e8956a", "#c4623d"]
bars = ax.barh(methods, estimates, color=colors, edgecolor="white", height=0.6)
for bar, val in zip(bars, estimates):
ax.text(val + 50, bar.get_y() + bar.get_height() / 2,
f"${val:,.0f}", va="center", fontsize=10, color="#141413")
ax.axvline(0, color="black", linewidth=0.5, linestyle="--")
ax.set_xlabel("Estimated Average Treatment Effect (USD)")
ax.set_title("Causal Effect Estimates: NSW Job Training on 1978 Earnings")
plt.savefig("dowhy_estimate_comparison.png", dpi=300, bbox_inches="tight")
plt.show()

All six methods produce positive estimates between \$1,559 and \$1,794, indicating that the NSW job training program increased participants' 1978 earnings by roughly \$1,550–\$1,800. The five adjusted methods cluster between \$1,559 and \$1,736, suggesting that about \$58–\$235 of the naive estimate was due to finite-sample covariate imbalances rather than the treatment. The convergence across fundamentally different estimation strategies — outcome modeling (regression adjustment), treatment modeling (IPW, stratification, matching), and doubly robust (AIPW) — is strong evidence that the effect is real.

Summary table

Method	Estimated ATE	Notes
Naive (Difference in Means)	\$1,794	No covariate adjustment
Regression Adjustment	\$1,676	Models outcome, assumes linearity
IPW	\$1,559	Models treatment assignment
Doubly Robust (AIPW)	\$1,620	Models both outcome and treatment
Propensity Score Stratification	\$1,617	5 strata, flexible
Propensity Score Matching	\$1,736	Nearest-neighbor matching (closer to ATT)

Refutation Test	New Effect	p-value	Interpretation
Placebo Treatment	\$62	0.92	Effect vanishes with fake treatment
Random Common Cause	\$1,676	0.90	Stable with added confounder
Data Subset (80%)	\$1,728	0.80	Stable across subsamples

The summary confirms a consistent causal effect across methods: the NSW job training program increased 1978 earnings by approximately \$1,550–\$1,800. All five adjusted methods and all three refutation tests support the validity of the estimate. The placebo test is particularly convincing — when the real treatment is replaced by random noise, the effect drops from \$1,676 to just \$62, confirming that the observed effect is tied to the actual treatment and not a statistical artifact. The doubly robust estimate (\$1,620) provides the most credible point estimate because it is consistent under misspecification of either the outcome model or the propensity score model.

Discussion

The Lalonde dataset provides a compelling case study for DoWhy’s four-step framework. Each step serves a distinct purpose: the Model step forces us to articulate our causal assumptions as a graph, the Identify step uses graph theory to determine the correct adjustment formula, the Estimate step applies statistical methods to compute the effect, and the Refute step probes whether the result withstands scrutiny.

The estimated ATE ranges from \$1,559 (IPW) to \$1,736 (PS matching), with the doubly robust estimate at \$1,620 providing a credible middle ground. On a base of \$4,555 for the control group, this represents roughly a 34–38% increase in earnings — a substantial effect for a disadvantaged population with very low baseline earnings. The three estimation paradigms — outcome modeling (regression adjustment), treatment modeling (IPW, stratification, matching), and doubly robust (AIPW) — each bring different strengths, and their convergence strengthens the causal conclusion.

The key strength of DoWhy over ad-hoc statistical approaches is transparency. The causal graph makes assumptions visible and debatable. The identification step formally checks whether the effect is estimable. Multiple estimation methods let us assess robustness. And refutation tests provide automated sanity checks that would otherwise require expert judgment.

Limitations and next steps

This analysis demonstrates DoWhy’s workflow on a well-understood dataset, but several limitations apply:

Small sample size: With only 445 observations, estimates have high variance and the propensity score methods may suffer from poor overlap in some regions of the covariate space
Unconfoundedness assumption: The backdoor criterion requires that all confounders are observed. If there are unmeasured factors affecting both training enrollment and earnings, our estimates would be biased
Linear outcome model: The regression adjustment and doubly robust estimates assume a linear relationship between covariates and earnings, which may be too restrictive for the highly skewed outcome distribution
Experimental data: The NSW was a randomized experiment, making it the easiest setting for causal inference. DoWhy’s advantages are more pronounced in observational studies where confounding is more severe

Next steps could include:

Apply DoWhy to an observational version of the Lalonde dataset (e.g., the PSID or CPS comparison groups) where confounding is much stronger
Explore DoWhy’s instrumental variable and front-door estimators for settings where the backdoor criterion fails
Investigate heterogeneous treatment effects — does training help some subgroups more than others?
Use nonparametric outcome models (e.g., random forests) in the doubly robust estimator for more flexible modeling
Compare DoWhy’s estimates with Double Machine Learning (DoubleML) for a side-by-side comparison of frameworks

Takeaways

DoWhy’s four-step workflow (Model, Identify, Estimate, Refute) makes causal assumptions explicit and testable, rather than hiding them inside a black-box estimator.
The NSW job training program increased 1978 earnings by approximately \$1,550–\$1,800, a 34–38% gain over the control group mean of \$4,555.
Five estimation methods — regression adjustment, IPW, doubly robust, PS stratification, and PS matching — all produce positive, consistent estimates, strengthening confidence in the causal conclusion.
The doubly robust (AIPW) estimator (\$1,620) is the most credible single estimate because it remains consistent if either the outcome model or the propensity score model is misspecified.
IPW and regression adjustment represent two complementary paradigms: modeling treatment assignment (\$1,559) vs. modeling the outcome (\$1,676). Their divergence quantifies sensitivity to modeling choices.
Refutation tests confirm robustness — the placebo test reduced the effect from \$1,676 to just \$62, ruling out statistical artifacts.
Causal graphs encode domain knowledge as testable assumptions; the backdoor criterion then determines which variables must be conditioned on for valid causal estimation.
Next step: apply DoWhy to an observational comparison group (e.g., PSID or CPS) where confounding is stronger and the choice of estimator matters more.

Exercises

Change the number of strata. Re-run the propensity score stratification with num_strata=10 and num_strata=20. How does the ATE estimate change? What are the tradeoffs of using more vs. fewer strata with a sample of only 445 observations?
Add an additional refutation test. DoWhy supports a bootstrap_refuter that re-estimates the effect on bootstrap samples. Implement this refuter and compare its results to the data subset refuter. Are the conclusions similar?
Estimate effects for subgroups. Split the dataset by black (race indicator) and estimate the ATE separately for each subgroup using DoWhy. Does the job training program have a different effect for Black vs. non-Black participants? What might explain any differences you observe?

References

Introduction to Causal Inference: Double Machine Learning

Tue, 10 Mar 2026 00:00:00 +0000

Overview

Does a cash bonus actually cause unemployed workers to find jobs faster, or do the workers who receive bonuses simply differ from those who do not? This is the core challenge of causal inference: separating a genuine treatment effect from the influence of confounders — variables that affect both the treatment and the outcome, creating spurious associations. Standard regression can adjust for these confounders, but when their relationship with the outcome is complex and nonlinear, linear models may fail to fully remove bias.

Double Machine Learning (DML) addresses this problem by using flexible machine learning models to partial out the confounding variation, then estimating the causal effect on the cleaned residuals. In this tutorial we apply DML to the Pennsylvania Bonus Experiment, a real randomized study where some unemployment insurance claimants received a cash bonus for finding employment quickly. We estimate how much the bonus reduced unemployment duration, and we compare DML estimates against naive and covariate-adjusted OLS to see how debiasing changes the results.

Learning objectives:

Understand why prediction and causal inference require different approaches
Learn the Partially Linear Regression (PLR) model and the partialling-out estimator
Implement Double Machine Learning with cross-fitting using the doubleml package
Interpret causal effect estimates, standard errors, and confidence intervals
Assess robustness by comparing results across different ML learners

Setup and imports

Before running the analysis, install the required package if needed:

pip install doubleml

The following code imports all necessary libraries and sets the configuration variables for our analysis. We use RANDOM_SEED = 42 throughout to ensure reproducibility, and define the outcome, treatment, and covariate columns that will be used in all subsequent steps.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.base import clone
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LassoCV, LinearRegression
from doubleml import DoubleMLData, DoubleMLPLR
from doubleml.datasets import fetch_bonus
# Reproducibility
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
# Configuration
OUTCOME = "inuidur1"
OUTCOME_LABEL = "Log Unemployment Duration"
TREATMENT = "tg"
COVARIATES = [
"female", "black", "othrace", "dep1", "dep2",
"q2", "q3", "q4", "q5", "q6",
"agelt35", "agegt54", "durable", "lusd", "husd",
]

Data loading: The Pennsylvania Bonus Experiment

The Pennsylvania Bonus Experiment is a well-known dataset in labor economics. In this study, a random subset of unemployment insurance claimants was offered a cash bonus if they found a new job within a qualifying period. The dataset records whether each claimant received the bonus offer (treatment) and how long they remained unemployed (outcome), along with demographic and labor market covariates.

df = fetch_bonus("DataFrame")
print(f"Dataset shape: {df.shape}")
print(f"Observations: {len(df)}")
print(f"\nTreatment groups:")
print(df[TREATMENT].value_counts().rename({0: "Control", 1: "Bonus"}))
print(f"\nOutcome ({OUTCOME}) summary:")
print(df[OUTCOME].describe().round(3))

Dataset shape: (5099, 26)
Observations: 5099
Treatment groups:
tg
Control 3354
Bonus 1745
Name: count, dtype: int64
Outcome (inuidur1) summary:
count 5099.000
mean 2.028
std 1.215
min 0.000
25% 1.099
50% 2.398
75% 3.219
max 3.951
Name: inuidur1, dtype: float64

The dataset contains 5,099 unemployment insurance claimants with 26 variables. The treatment is unevenly split: 1,745 claimants received the bonus offer while 3,354 served as controls. The outcome variable, log unemployment duration (inuidur1), ranges from 0.0 to 3.95 with a mean of 2.028 and standard deviation of 1.215, indicating substantial variation in how long claimants remained unemployed. The median (2.398) exceeds the mean, suggesting a left-skewed distribution where some claimants found jobs very quickly. The interquartile range spans from 1.099 to 3.219, meaning the middle 50% of claimants had log durations in this band.

Exploratory data analysis

Outcome distribution by treatment group

Before modeling, we examine whether the outcome distributions differ visibly between treated and control groups. While a randomized experiment should produce balanced groups on average, visualizing the raw data helps us understand the structure of the outcome and spot any obvious patterns.

fig, ax = plt.subplots(figsize=(8, 5))
for group, label, color in [(0, "Control", "#6a9bcc"), (1, "Bonus", "#d97757")]:
subset = df[df[TREATMENT] == group][OUTCOME]
ax.hist(subset, bins=30, alpha=0.6, label=f"{label} (mean={subset.mean():.3f})",
color=color, edgecolor="white")
ax.set_xlabel(OUTCOME_LABEL)
ax.set_ylabel("Count")
ax.set_title(f"Distribution of {OUTCOME_LABEL} by Treatment Group")
ax.legend()
plt.savefig("doubleml_outcome_by_treatment.png", dpi=300, bbox_inches="tight")
plt.show()

The histogram reveals that both groups share a similar shape, with a concentration of claimants at higher log durations (around 3.0–3.5) and a spread of shorter durations below 2.0. The bonus group shows a slightly lower mean (1.971) compared to the control group (2.057), a difference of about 0.09 log points. This raw gap hints at a potential treatment effect, but we cannot yet attribute it to the bonus because confounders may also differ between groups.

Covariate balance

In a well-designed randomized experiment, the distribution of covariates should be roughly similar across treatment and control groups. We check this balance to verify that randomization worked as expected and to understand which characteristics might confound the treatment-outcome relationship if balance is imperfect.

covariate_means = df.groupby(TREATMENT)[COVARIATES].mean()
fig, ax = plt.subplots(figsize=(12, 6))
x = np.arange(len(COVARIATES))
width = 0.35
ax.bar(x - width / 2, covariate_means.loc[0], width, label="Control",
color="#6a9bcc", edgecolor="white")
ax.bar(x + width / 2, covariate_means.loc[1], width, label="Bonus",
color="#d97757", edgecolor="white")
ax.set_xticks(x)
ax.set_xticklabels(COVARIATES, rotation=45, ha="right")
ax.set_ylabel("Mean Value")
ax.set_title("Covariate Balance: Control vs Bonus Group")
ax.legend()
plt.savefig("doubleml_covariate_balance.png", dpi=300, bbox_inches="tight")
plt.show()

The covariate means are nearly identical across treatment and control groups for all 15 covariates, confirming that randomization produced well-balanced groups. Demographic variables like female, black, and age indicators show negligible differences, as do the economic indicators (durable, lusd, husd). This balance is reassuring: it means that any difference in unemployment duration between groups is unlikely to be driven by observable confounders. Nevertheless, DML provides a principled way to adjust for these covariates and improve precision.

Why adjust for covariates?

Because the Pennsylvania Bonus Experiment is a randomized trial, treatment assignment is independent of covariates by design — there is no confounding bias to remove. However, adjusting for covariates can still improve the precision of the causal estimate by absorbing residual variation in the outcome. In observational studies, covariate adjustment is essential to avoid confounding bias, but even in an RCT, it sharpens inference. The question is how to adjust. Standard OLS assumes a linear relationship between covariates and the outcome, which may miss complex nonlinear patterns. The naive OLS model regresses the outcome $Y$ directly on the treatment $D$:

$$Y_i = \alpha + \beta \, D_i + \epsilon_i \quad \text{(naive, no covariates)}$$

Adding covariates $X$ linearly gives:

$$Y_i = \alpha + \beta \, D_i + X_i' \gamma + \epsilon_i \quad \text{(with covariates)}$$

In our data, $Y_i$ is inuidur1 (log unemployment duration), $D_i$ is tg (the bonus indicator), and $X_i$ contains the 15 demographic and labor market covariates. In both cases, $\beta$ is the estimated treatment effect. But if the true relationship between $X$ and $Y$ is nonlinear, the linear specification may leave residual confounding in $\hat{\beta}$.

Naive OLS baseline

We start with two simple OLS regressions to establish baseline estimates: one with no covariates (naive), and one that linearly adjusts for all 15 covariates. These provide a reference point for evaluating how much DML’s flexible adjustment changes the estimated treatment effect.

# Naive OLS: no covariates
ols = LinearRegression()
ols.fit(df[[TREATMENT]], df[OUTCOME])
naive_coef = ols.coef_[0]
# OLS with covariates
ols_full = LinearRegression()
ols_full.fit(df[[TREATMENT] + COVARIATES], df[OUTCOME])
ols_full_coef = ols_full.coef_[0]
print(f"Naive OLS coefficient (no covariates): {naive_coef:.4f}")
print(f"OLS with covariates coefficient: {ols_full_coef:.4f}")

Naive OLS coefficient (no covariates): -0.0855
OLS with covariates coefficient: -0.0717

The naive OLS estimate is -0.0855, suggesting that the bonus is associated with an 8.6% reduction in log unemployment duration. Adding covariates shifts the estimate to -0.0717 (7.2% reduction). In a randomized experiment, this shift reflects precision improvement from absorbing residual variation — not confounding bias removal. Even so, linear adjustment may not capture complex nonlinear relationships between covariates and the outcome. Double Machine Learning will use flexible ML models to more thoroughly partial out covariate effects and further sharpen the estimate.

What is Double Machine Learning?

The Partially Linear Regression (PLR) model

Double Machine Learning operates within the Partially Linear Regression framework. The key idea is that the outcome $Y$ depends on the treatment $D$ through a linear coefficient (the causal effect we want) plus a potentially complex, nonlinear function of covariates $X$. The PLR model consists of two structural equations:

$$Y = D \, \theta_0 + g_0(X) + \varepsilon, \quad E[\varepsilon \mid D, X] = 0$$

$$D = m_0(X) + V, \quad E[V \mid X] = 0$$

Here, $\theta_0$ is the causal parameter of interest — the Average Treatment Effect (ATE) of the bonus on unemployment duration. The function $g_0(\cdot)$ is a nuisance function, meaning it is not our target but something we must estimate along the way; it captures how covariates affect the outcome. Similarly, $m_0(\cdot)$ models how covariates predict treatment assignment. Think of nuisance functions as scaffolding: essential during construction but not part of the final result. The error terms $\varepsilon$ and $V$ are orthogonal to the covariates by construction. In our data, $Y$ = inuidur1, $D$ = tg, and $X$ includes the 15 covariate columns in COVARIATES. The challenge is that both $g_0$ and $m_0$ can be arbitrarily complex — DML uses machine learning to estimate them flexibly.

The partialling-out estimator

The DML algorithm works in two stages. First, it uses ML models to predict the outcome from covariates alone (estimating $E[Y \mid X]$) and to predict the treatment from covariates alone (estimating $E[D \mid X]$). Then it computes residuals from both predictions — the part of $Y$ not explained by $X$, and the part of $D$ not explained by $X$:

$$\tilde{Y} = Y - \hat{g}_0(X) = Y - \hat{E}[Y \mid X]$$

$$\tilde{D} = D - \hat{m}_0(X) = D - \hat{E}[D \mid X]$$

Finally, it regresses the outcome residuals on the treatment residuals to obtain the causal estimate:

$$\hat{\theta}_0 = \left( \frac{1}{N} \sum_{i=1}^{N} \tilde{D}_i^2 \right)^{-1} \frac{1}{N} \sum_{i=1}^{N} \tilde{D}_i \, \tilde{Y}_i$$

Think of this like noise-canceling headphones: the ML models learn the “noise” pattern (how covariates influence both $Y$ and $D$), and we subtract it away so that only the “signal” — the causal relationship between $D$ and $Y$ — remains.

Cross-fitting: why it matters

A naive implementation of partialling-out would use the same data to fit the ML models and compute residuals. This introduces regularization bias — a distortion that occurs because the ML model’s complexity penalty contaminates the causal estimate. DML solves this with cross-fitting: the data is split into $K$ folds, and each fold’s residuals are computed using ML models trained on the other $K-1$ folds. Think of it like grading exams: to avoid bias, we split the class into groups where each group’s predictions are made by a model that never saw their data. The cross-fitted estimator is:

$$\hat{\theta}_0^{CF} = \left( \frac{1}{N} \sum_{k=1}^{K} \sum_{i \in I_k} \left(\tilde{D}_i^{(k)}\right)^2 \right)^{-1} \frac{1}{N} \sum_{k=1}^{K} \sum_{i \in I_k} \tilde{D}_i^{(k)} \, \tilde{Y}_i^{(k)}$$

where $\tilde{Y}_i^{(k)}$ and $\tilde{D}_i^{(k)}$ are residuals for observation $i$ in fold $k$, computed using models trained on all folds except $k$. In words, we average the treatment effect estimates across all folds, where each fold’s estimate uses residuals computed from models that never saw that fold’s data. This ensures that the residuals are computed out-of-sample, eliminating overfitting bias and preserving valid statistical inference (standard errors, p-values, confidence intervals).

Setting up DoubleML

The doubleml package provides a clean interface for implementing DML. We first wrap our data into a DoubleMLData object that specifies the outcome, treatment, and covariate columns. Then we configure the ML learners: Random Forest regressors for both the outcome model ml_l (estimating $\hat{g}_0$) and the treatment model ml_m (estimating $\hat{m}_0$).

# Prepare data for DoubleML
dml_data = DoubleMLData(df, y_col=OUTCOME, d_cols=TREATMENT, x_cols=COVARIATES)
print(dml_data)

================== DoubleMLData Object ==================
------------------ Data summary ------------------
Outcome variable: inuidur1
Treatment variable(s): ['tg']
Covariates: ['female', 'black', 'othrace', 'dep1', 'dep2', 'q2', 'q3', 'q4', 'q5', 'q6', 'agelt35', 'agegt54', 'durable', 'lusd', 'husd']
Instrument variable(s): None
No. Observations: 5099

The DoubleMLData object confirms our setup: inuidur1 as the outcome, tg as the treatment, and all 15 covariates registered. The object reports 5,099 observations and no instrumental variables, which is correct for the PLR model. Separating the data into these three roles is fundamental to DML: the covariates $X$ will be partialled out from both $Y$ and $D$, while the treatment-outcome relationship $\theta_0$ is the sole target of inference.

Now we configure the ML learners. We use Random Forest with 500 trees, max depth of 5, and sqrt feature sampling — a moderate configuration that balances flexibility with regularization.

# Configure ML learners
learner = RandomForestRegressor(n_estimators=500, max_features="sqrt",
max_depth=5, random_state=RANDOM_SEED)
ml_l_rf = clone(learner) # Learner for E[Y|X]
ml_m_rf = clone(learner) # Learner for E[D|X]
print(f"ml_l (outcome model): {type(ml_l_rf).__name__}")
print(f"ml_m (treatment model): {type(ml_m_rf).__name__}")
print(f" n_estimators={learner.n_estimators}, max_depth={learner.max_depth}, max_features='{learner.max_features}'")

ml_l (outcome model): RandomForestRegressor
ml_m (treatment model): RandomForestRegressor
n_estimators=500, max_depth=5, max_features='sqrt'

Both the outcome and treatment models use RandomForestRegressor with 500 estimators and max depth 5. The clone() function creates independent copies so that each model is trained separately during the DML fitting process. The max_features='sqrt' setting means each split considers only the square root of 15 covariates (about 4 features), adding randomness that reduces overfitting. Capping tree depth at 5 prevents overfitting to individual observations while still capturing nonlinear interactions among covariates — a balance that matters because overly complex nuisance models can destabilize the cross-fitted residuals.

Fitting the PLR model

With data and learners configured, we fit the Partially Linear Regression model using 5-fold cross-fitting. The DoubleMLPLR class handles the full DML pipeline: splitting data into folds, fitting ML models on training folds, computing out-of-sample residuals, and estimating the causal coefficient with valid standard errors.

np.random.seed(RANDOM_SEED)
dml_plr_rf = DoubleMLPLR(dml_data, ml_l_rf, ml_m_rf, n_folds=5)
dml_plr_rf.fit()
print(dml_plr_rf.summary)

 coef std err t P>|t| 2.5 % 97.5 %
tg -0.0736 0.0354 -2.077 0.0378 -0.1430 -0.0041

The DML estimate with Random Forest learners yields a treatment coefficient of -0.0736 with a standard error of 0.0354. The t-statistic is -2.077, producing a p-value of 0.0378, which is statistically significant at the 5% level. The 95% confidence interval is [-0.1430, -0.0041], meaning we are 95% confident that the true causal effect of the bonus lies between a 14.3% and 0.4% reduction in log unemployment duration.

Interpreting the results

Let us extract and interpret the key quantities from the fitted model to understand both the statistical and practical significance of the estimated treatment effect.

rf_coef = dml_plr_rf.coef[0]
rf_se = dml_plr_rf.se[0]
rf_pval = dml_plr_rf.pval[0]
rf_ci = dml_plr_rf.confint().values[0]
print(f"Coefficient (theta_0): {rf_coef:.4f}")
print(f"Standard Error: {rf_se:.4f}")
print(f"p-value: {rf_pval:.4f}")
print(f"95% CI: [{rf_ci[0]:.4f}, {rf_ci[1]:.4f}]")
print(f"\nInterpretation:")
print(f" The bonus reduces log unemployment duration by {abs(rf_coef):.4f}.")
print(f" This corresponds to approximately a {abs(rf_coef)*100:.1f}% reduction.")
print(f" We are 95% confident the true effect lies between")
print(f" {abs(rf_ci[1])*100:.1f}% and {abs(rf_ci[0])*100:.1f}% reduction.")

Coefficient (theta_0): -0.0736
Standard Error: 0.0354
p-value: 0.0378
95% CI: [-0.1430, -0.0041]
Interpretation:
The bonus reduces log unemployment duration by 0.0736.
This corresponds to approximately a 7.4% reduction.
We are 95% confident the true effect lies between
0.4% and 14.3% reduction.

The estimated causal effect is $\hat{\theta}_0 = -0.0736$, meaning the cash bonus reduces log unemployment duration by approximately 7.4%. Since the outcome is in log scale, this translates to roughly a 7.1% proportional reduction in actual unemployment duration (using $e^{-0.0736} - 1 \approx -0.071$). The effect is statistically significant ($p = 0.0378$), and the 95% confidence interval is constructed as:

$$\text{CI}_{95\%} = \hat{\theta}_0 \pm 1.96 \times \text{SE}(\hat{\theta}_0) = -0.0736 \pm 1.96 \times 0.0354 = [-0.1430, \; -0.0041]$$

The interval excludes zero, confirming that the bonus has a genuine causal impact. However, the wide interval — spanning from a 0.4% to 14.3% reduction — reflects meaningful uncertainty about the exact magnitude.

Sensitivity: does the choice of ML learner matter?

A key advantage of DML is that it is agnostic to the choice of ML learner, as long as the learner is flexible enough to approximate the true confounding function. To verify that our results are not driven by the specific choice of Random Forest, we re-estimate the model using Lasso, a fundamentally different class of learner. Lasso is a linear regression with L1 regularization, meaning it adds a penalty proportional to the absolute size of each coefficient, which drives some coefficients to exactly zero and effectively performs variable selection.

ml_l_lasso = LassoCV()
ml_m_lasso = LassoCV()
np.random.seed(RANDOM_SEED)
dml_plr_lasso = DoubleMLPLR(dml_data, ml_l_lasso, ml_m_lasso, n_folds=5)
dml_plr_lasso.fit()
print(dml_plr_lasso.summary)

 coef std err t P>|t| 2.5 % 97.5 %
tg -0.0712 0.0354 -2.009 0.0445 -0.1406 -0.0018

The Lasso-based DML estimate is -0.0712 with a standard error of 0.0354 and p-value of 0.0445. This is remarkably close to the Random Forest estimate of -0.0736, with a difference of only 0.0024 — less than 7% of the standard error. The 95% confidence interval is [-0.1406, -0.0018], which also excludes zero. The near-identical results across two fundamentally different learners strongly support the robustness of the estimated treatment effect.

Comparing all estimates

To see how different estimation strategies affect the results, we visualize all four coefficient estimates side by side: naive OLS, OLS with covariates, DML with Random Forest, and DML with Lasso. The DML estimates include confidence intervals derived from valid statistical inference.

lasso_coef = dml_plr_lasso.coef[0]
lasso_se = dml_plr_lasso.se[0]
lasso_ci = dml_plr_lasso.confint().values[0]
fig, ax = plt.subplots(figsize=(8, 5))
methods = ["Naive OLS", "OLS + Covariates", "DoubleML (RF)", "DoubleML (Lasso)"]
coefs = [naive_coef, ols_full_coef, rf_coef, lasso_coef]
colors = ["#999999", "#666666", "#6a9bcc", "#d97757"]
ax.barh(methods, coefs, color=colors, edgecolor="white", height=0.6)
ax.errorbar(rf_coef, 2, xerr=[[rf_coef - rf_ci[0]], [rf_ci[1] - rf_coef]],
fmt="none", color="#141413", capsize=5, linewidth=2)
ax.errorbar(lasso_coef, 3, xerr=[[lasso_coef - lasso_ci[0]], [lasso_ci[1] - lasso_coef]],
fmt="none", color="#141413", capsize=5, linewidth=2)
ax.axvline(0, color="black", linewidth=0.5, linestyle="--")
ax.set_xlabel("Estimated Coefficient (Effect on Log Unemployment Duration)")
ax.set_title("Naive OLS vs Double Machine Learning Estimates")
plt.savefig("doubleml_coefficient_comparison.png", dpi=300, bbox_inches="tight")
plt.show()

All four methods agree on the direction and approximate magnitude of the treatment effect: the bonus reduces unemployment duration. The naive OLS estimate (-0.0855) is the largest in absolute terms, while covariate adjustment and DML both shrink it toward -0.07. The DML estimates with Random Forest (-0.0736) and Lasso (-0.0712) cluster closely together and fall between the two OLS benchmarks. Crucially, only the DML estimates come with valid confidence intervals, both of which exclude zero, providing statistical evidence that the effect is real.

Confidence intervals

To better visualize the uncertainty around the DML estimates, we plot the 95% confidence intervals for both the Random Forest and Lasso specifications. If both intervals are similar and exclude zero, this strengthens our confidence in the causal conclusion.

fig, ax = plt.subplots(figsize=(8, 4))
y_pos = [0, 1]
labels = ["DoubleML (Random Forest)", "DoubleML (Lasso)"]
point_estimates = [rf_coef, lasso_coef]
ci_low = [rf_ci[0], lasso_ci[0]]
ci_high = [rf_ci[1], lasso_ci[1]]
for i, (est, lo, hi, label) in enumerate(zip(point_estimates, ci_low, ci_high, labels)):
ax.plot([lo, hi], [i, i], color="#6a9bcc" if i == 0 else "#d97757", linewidth=3)
ax.plot(est, i, "o", color="#141413", markersize=8, zorder=5)
ax.text(hi + 0.005, i, f"{est:.4f} [{lo:.4f}, {hi:.4f}]", va="center", fontsize=9)
ax.axvline(0, color="black", linewidth=0.5, linestyle="--")
ax.set_yticks(y_pos)
ax.set_yticklabels(labels)
ax.set_xlabel("Treatment Effect Estimate (95% CI)")
ax.set_title("Confidence Intervals: DoubleML Estimates")
plt.savefig("doubleml_confint.png", dpi=300, bbox_inches="tight")
plt.show()

Both confidence intervals are nearly identical in width and position, spanning from roughly -0.14 to near zero. The Random Forest interval [-0.1430, -0.0041] and Lasso interval [-0.1406, -0.0018] both exclude zero, but just barely — the upper bounds are very close to zero (0.4% and 0.2% reduction, respectively). This tells us that while the bonus has a statistically significant negative effect on unemployment duration, the effect size is modest and estimated with considerable uncertainty.

Summary table

Method	Coefficient	Std Error	p-value	95% CI
Naive OLS	-0.0855	–	–	–
OLS + Covariates	-0.0717	–	–	–
DoubleML (RF)	-0.0736	0.0354	0.0378	[-0.1430, -0.0041]
DoubleML (Lasso)	-0.0712	0.0354	0.0445	[-0.1406, -0.0018]

The summary table confirms a consistent pattern across all four estimation methods. The naive OLS gives the largest estimate at -0.0855; adjusting for covariates improves precision and shifts the estimate toward -0.07. The two DML specifications produce very similar estimates of -0.0736 and -0.0712. Both DML p-values are below 0.05, providing statistically significant evidence of a causal effect. The standard errors are identical (0.0354), which is expected since both use the same sample size and cross-fitting structure.

Discussion

The Pennsylvania Bonus Experiment provides a clear demonstration of Double Machine Learning in action. Because the experiment was randomized, the DML estimates are close to the OLS estimates — the confounding function $g_0(X)$ is relatively flat when treatment assignment is independent of covariates. This is actually reassuring: in a well-designed experiment, flexible covariate adjustment should not dramatically change the results, and indeed the DML estimates ($\hat{\theta}_0 = -0.0736, -0.0712$) are close to the covariate-adjusted OLS (-0.0717).

The key finding is that the cash bonus reduces log unemployment duration by approximately 7.4%, and this effect is statistically significant (p < 0.05). In practical terms, this means the bonus incentive helped claimants find new jobs somewhat faster. However, the wide confidence intervals suggest that the true effect could be as small as 0.4% or as large as 14.3%, so policymakers should be cautious about the precise magnitude.

The robustness across learners (Random Forest vs. Lasso) is a strength of DML. Both learners capture similar confounding structure, and the near-identical estimates provide evidence that the result is not an artifact of a particular modeling choice.

Summary and next steps

This tutorial demonstrated Double Machine Learning for causal inference using the Pennsylvania Bonus Experiment. The key takeaways are:

Method: DML’s main advantage over OLS is not the point estimate (both give ~7% here) but the infrastructure — valid standard errors, confidence intervals, and robustness to nonlinear confounding. On this RCT the estimates are similar; on observational data where $g_0(X)$ is complex, OLS would break down while DML remains valid
Data: The cash bonus reduces unemployment duration by 7.4% ($p = 0.038$, 95% CI: [-14.3%, -0.4%]). The wide CI means the true effect could be anywhere from negligible to substantial — policymakers should not over-interpret the point estimate
Robustness: Random Forest and Lasso produce nearly identical estimates (-0.0736 vs -0.0712), differing by less than 7% of the standard error. This learner-agnosticism is a core strength of the DML framework
Limitation: The PLR model assumes a constant treatment effect ($\theta_0$ is the same for everyone). If the bonus helps some subgroups more than others (e.g., younger vs. older workers), PLR will average over this heterogeneity — use the Interactive Regression Model (IRM) to detect it

Limitations:

The Pennsylvania Bonus Experiment is a randomized trial, which is the easiest setting for causal inference. DML’s advantages are more pronounced in observational studies where confounding is severe
We used the PLR model, which assumes a linear treatment effect ($\theta_0$ is constant). More complex treatment heterogeneity could be explored with the Interactive Regression Model (IRM)
The confidence intervals are wide, reflecting limited sample size and moderate signal strength
We did not explore heterogeneous treatment effects — situations where the bonus might help some subgroups (e.g., younger workers, women) more than others

Next steps:

Apply DML to an observational dataset where confounding is more severe
Explore the Interactive Regression Model for binary treatments
Investigate treatment effect heterogeneity using DoubleML’s cate() functionality
Compare additional ML learners (gradient boosting, neural networks)

Exercises

Change the number of folds. Re-run the DML analysis with n_folds=3 and n_folds=10. How do the estimates and standard errors change? What are the tradeoffs of using more vs. fewer folds?
Try a different ML learner. Replace the Random Forest with GradientBoostingRegressor from scikit-learn. Does the estimated treatment effect change? Compare the result to the RF and Lasso estimates.
Investigate heterogeneous effects. Split the sample by gender (female) and estimate the DML treatment effect separately for men and women. Is the bonus more effective for one group? What might explain any differences?

References

Academic references:

Package and API documentation:

Introduction to Machine Learning: Random Forest Regression

Tue, 10 Mar 2026 00:00:00 +0000

Overview

Can satellite imagery predict how well a municipality is developing? This notebook explores that question by applying Random Forest regression to predict Bolivia’s Municipal Sustainable Development Index (IMDS) from satellite image embeddings. IMDS is a composite index (0–100 scale) that captures how well each of Bolivia’s 339 municipalities is progressing toward sustainable development goals. Satellite embeddings are 64-dimensional feature vectors extracted from 2017 satellite imagery — they compress visual information about land use, urbanization, and terrain into numbers a model can learn from.

The Random Forest algorithm is a natural starting point for this kind of tabular prediction task: it handles non-linear relationships, requires minimal preprocessing, and provides built-in measures of feature importance. By the end of this tutorial, we will know how much development-related signal satellite imagery actually contains — and where its predictive power falls short.

Learning objectives:

Understand the Random Forest algorithm and why it works well for tabular data
Follow ML best practices: train/test split, cross-validation, hyperparameter tuning
Interpret model performance metrics (R², RMSE, MAE)
Analyze feature importance and partial dependence plots
Build intuition for when ML adds value over simpler approaches

import sys
if "google.colab" in sys.modules:
!git clone --depth 1 https://github.com/cmg777/claude4data.git /content/claude4data 2>/dev/null || true
%cd /content/claude4data/notebooks
sys.path.insert(0, "..")
from config import set_seeds, RANDOM_SEED, IMAGES_DIR, TABLES_DIR, DATA_DIR
set_seeds()

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import randint
from sklearn.model_selection import train_test_split, cross_val_score, RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.inspection import PartialDependenceDisplay, permutation_importance
# Configuration
TARGET = "imds"
TARGET_LABEL = "IMDS (Municipal Sustainable Development Index)"
FEATURE_COLS = [f"A{i:02d}" for i in range(64)]
DS4BOLIVIA_BASE = "https://raw.githubusercontent.com/quarcs-lab/ds4bolivia/master"
CACHE_PATH = DATA_DIR / "rawData" / "ds4bolivia_merged.csv"

Data Loading

The data comes from the DS4Bolivia repository, which provides standardized datasets for studying Bolivian development. We merge three tables on asdf_id — the unique identifier for each municipality: SDG indices (our target variables), satellite embeddings (our features), and region names (for context).

if CACHE_PATH.exists():
print(f"Loading cached data from {CACHE_PATH}")
df = pd.read_csv(CACHE_PATH)
else:
print("Downloading data from DS4Bolivia...")
sdg = pd.read_csv(f"{DS4BOLIVIA_BASE}/sdg/sdg.csv")
embeddings = pd.read_csv(
f"{DS4BOLIVIA_BASE}/satelliteEmbeddings/satelliteEmbeddings2017.csv"
)
regions = pd.read_csv(f"{DS4BOLIVIA_BASE}/regionNames/regionNames.csv")
df = sdg.merge(embeddings, on="asdf_id").merge(regions, on="asdf_id")
CACHE_PATH.parent.mkdir(parents=True, exist_ok=True)
df.to_csv(CACHE_PATH, index=False)
print(f"Cached merged data to {CACHE_PATH}")
X = df[FEATURE_COLS]
y = df[TARGET]
mask = X.notna().all(axis=1) & y.notna()
X = X[mask]
y = y[mask]
print(f"Dataset shape: {df.shape}")
print(f"Observations after dropping missing: {len(y)}")
print(f"\nTarget variable ({TARGET}) summary:")
print(y.describe().round(2))

Downloading data from DS4Bolivia...
Cached merged data to data/rawData/ds4bolivia_merged.csv
Dataset shape: (339, 88)
Observations after dropping missing: 339
Target variable (imds) summary:
count 339.00
mean 51.05
std 6.77
min 35.70
25% 47.00
50% 50.50
75% 54.85
max 80.20
Name: imds, dtype: float64

All 339 Bolivian municipalities loaded successfully with no missing values — the dataset provides complete national coverage. The merged data has 88 columns: the 64 satellite embedding features, SDG indices, and region identifiers. IMDS scores range from 35.70 to 80.20 with a mean of 51.05 and standard deviation of 6.77, meaning most municipalities cluster within about 7 points of the national average on the 0–100 scale.

Exploratory Data Analysis

Before building any model, we explore the data to understand its structure. EDA helps us spot issues — skewed distributions, outliers, or weak feature correlations — that could affect model performance. It also builds intuition about what patterns the model might find.

Target Distribution

The histogram below shows how IMDS values are distributed across municipalities. The shape of this distribution matters: a highly skewed target can bias predictions toward the majority range.

fig, ax = plt.subplots(figsize=(8, 5))
ax.hist(y, bins=30, edgecolor="white", alpha=0.8, color="#6a9bcc")
ax.axvline(y.mean(), color="#d97757", linestyle="--", linewidth=2, label=f"Mean = {y.mean():.1f}")
ax.axvline(y.median(), color="#141413", linestyle=":", linewidth=2, label=f"Median = {y.median():.1f}")
ax.set_xlabel(TARGET_LABEL)
ax.set_ylabel("Count")
ax.set_title(f"Distribution of {TARGET_LABEL}")
ax.legend()
plt.savefig(IMAGES_DIR / "ml_target_distribution.png", dpi=300, bbox_inches="tight")
plt.show()

The distribution is roughly bell-shaped with a slight right skew — the mean (51.1) sits just above the median (50.5), indicating a small tail of higher-performing municipalities. Most scores fall between 47 and 55, meaning the majority of Bolivia’s municipalities have similar mid-range development levels. The handful of outliers above 70 likely correspond to larger urban centers like La Paz, Santa Cruz, and Cochabamba, which have significantly higher development infrastructure.

Embedding Correlations

Next we examine which satellite embedding dimensions are most correlated with the target. Strong correlations suggest the model has useful signal to learn from; weak correlations across the board would be a warning sign.

correlations = X.corrwith(y).abs().sort_values(ascending=False)
top10_features = correlations.head(10).index.tolist()
corr_matrix = df[top10_features + [TARGET]].corr()
fig, ax = plt.subplots(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap="RdBu_r", center=0,
square=True, ax=ax, vmin=-1, vmax=1)
ax.set_title(f"Correlations: Top-10 Embeddings & {TARGET_LABEL}")
plt.savefig(IMAGES_DIR / "ml_embedding_correlations.png", dpi=300, bbox_inches="tight")
plt.show()

The heatmap reveals that the strongest individual correlations between embedding dimensions and IMDS are moderate (in the 0.25–0.40 range), which is typical for satellite-derived features predicting complex socioeconomic outcomes. Several embedding dimensions are also correlated with each other, suggesting they capture overlapping spatial patterns — the Random Forest can handle this multicollinearity — features carrying overlapping information — well since it selects feature subsets at each split. With these moderate correlations, the model has real signal to work with, so let’s proceed to building it.

Train/Test Split

Now that we understand the data’s structure, we can prepare it for modeling. We split the data into training (80%) and test (20%) sets before any model fitting. This is a fundamental ML practice: if the model ever “sees” the test data during training or tuning, our performance estimate will be overly optimistic — a problem called data leakage. The random_state ensures the same split every time we run the notebook, making results reproducible.

X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=RANDOM_SEED
)
print(f"Training set: {len(X_train)} municipalities")
print(f"Test set: {len(X_test)} municipalities")

Training set: 271 municipalities
Test set: 68 municipalities

The split gives us 271 municipalities for training and 68 for testing. With only 339 total observations, this is a relatively small dataset for ML — the test set of 68 means each test prediction represents about 1.5% of the data. This makes cross-validation especially important for getting reliable performance estimates, since a single 68-sample test set could be unrepresentative by chance.

Baseline Model

Before tuning anything, we establish a baseline using a Random Forest with default hyperparameters. Random Forest works by building many decision trees on random subsets of the data and features, then averaging their predictions. This “wisdom of crowds” approach reduces overfitting compared to a single decision tree. Formally, the prediction is:

$$\hat{y} = \frac{1}{B} \sum_{b=1}^{B} T_b(\mathbf{x})$$

In words, the predicted value $\hat{y}$ is the average of predictions from all $B$ individual trees. Each tree $T_b$ sees a different random subset of training rows and features, so the trees make different errors — averaging cancels out much of the noise. Here $B$ corresponds to the n_estimators parameter (100 in our baseline, 500 after tuning) and $\mathbf{x}$ is the 64-dimensional satellite embedding vector for a given municipality.

Cross-Validation

Think of cross-validation as a rotating exam: the model takes turns training on different subsets and testing on the remainder, so no single lucky split determines the score. We evaluate the baseline with 5-fold cross-validation on the training set. Instead of a single train/validation split, k-fold CV rotates through 5 different validation sets and averages the scores. This gives a more reliable and stable performance estimate, especially important with smaller datasets like ours.

baseline_rf = RandomForestRegressor(n_estimators=100, random_state=RANDOM_SEED)
cv_scores = cross_val_score(baseline_rf, X_train, y_train, cv=5, scoring="r2")
print(f"5-Fold CV R² scores: {cv_scores.round(4)}")
print(f"Mean CV R²: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")

5-Fold CV R² scores: [0.152 0.1867 0.2704 0.3084 0.3454]
Mean CV R²: 0.2526 (+/- 0.0728)

The 5-fold CV R² scores range from 0.152 to 0.345, with a mean of 0.2526 (+/- 0.0728). This means the baseline model explains about 25% of the variation in IMDS on average, but the high variability across folds (standard deviation of 0.07) reflects the small dataset — different subsets of 271 municipalities can look quite different from each other. An R² around 0.25 is a reasonable starting point for predicting a complex social outcome from satellite imagery alone.

Test Evaluation

We now fit the baseline on the full training set and evaluate on the held-out test data. This gives our first concrete performance estimate — a reference point that any tuning should improve upon.

baseline_rf.fit(X_train, y_train)
baseline_pred = baseline_rf.predict(X_test)
baseline_r2 = r2_score(y_test, baseline_pred)
baseline_rmse = np.sqrt(mean_squared_error(y_test, baseline_pred))
baseline_mae = mean_absolute_error(y_test, baseline_pred)
print(f"Baseline Test R²: {baseline_r2:.4f}")
print(f"Baseline Test RMSE: {baseline_rmse:.2f}")
print(f"Baseline Test MAE: {baseline_mae:.2f}")

Baseline Test R²: 0.2307
Baseline Test RMSE: 6.52
Baseline Test MAE: 4.68

On the held-out test set, the baseline achieves R² = 0.2307, RMSE = 6.52, and MAE = 4.68. In practical terms, the model’s predictions are typically off by about 4.7 IMDS points (MAE) on a scale where most values fall between 47 and 55. The RMSE of 6.52 is higher than the MAE, indicating some larger errors are pulling it up. This baseline gives us a concrete reference — any improvement from tuning should beat these numbers.

Hyperparameter Tuning

The baseline model uses scikit-learn’s defaults, but we can often do better by searching for optimal hyperparameters. RandomizedSearchCV is more efficient than exhaustive grid search — it samples random combinations and evaluates each with cross-validation. Here’s what each hyperparameter controls:

n_estimators: Number of trees in the forest (more trees = more stable but slower)
max_depth: How deep each tree can grow (deeper = more complex patterns but risk overfitting)
min_samples_split: Minimum samples needed to split a node (higher = more regularization)
min_samples_leaf: Minimum samples in a leaf node (higher = smoother predictions)
max_features: How many features each tree considers per split (fewer = more diverse trees)

param_distributions = {
"n_estimators": [100, 200, 300, 500],
"max_depth": [None, 10, 20, 30],
"min_samples_split": randint(2, 11),
"min_samples_leaf": randint(1, 5),
"max_features": ["sqrt", "log2", None],
}
search = RandomizedSearchCV(
RandomForestRegressor(random_state=RANDOM_SEED),
param_distributions=param_distributions,
n_iter=50,
cv=5,
scoring="r2",
random_state=RANDOM_SEED,
n_jobs=-1,
)
search.fit(X_train, y_train)
print(f"Best CV R²: {search.best_score_:.4f}")
print(f"\nBest parameters:")
for param, value in search.best_params_.items():
print(f" {param}: {value}")

Best CV R²: 0.2721
Best parameters:
max_depth: 30
max_features: sqrt
min_samples_leaf: 1
min_samples_split: 4
n_estimators: 500

The best configuration found uses 500 trees with max_depth=30, max_features=sqrt, min_samples_leaf=1, and min_samples_split=4. The best CV R² of 0.2721 is modestly higher than the baseline’s 0.2526 — about a 2 percentage point improvement in explained variance. The tuning selected a deeper, more complex model (max_depth=30 vs the default of unlimited) while constraining feature subsampling to sqrt(64)=8 features per split, which encourages tree diversity.

Model Evaluation

Now we evaluate the tuned model on the held-out test set — data the model has never seen during training or tuning. Three complementary metrics tell us different things:

R² (coefficient of determination): What fraction of the target’s variance the model explains. R² = 1.0 is perfect; R² = 0 means the model is no better than predicting the mean.
RMSE (Root Mean Squared Error): Average prediction error in the same units as the target. Penalizes large errors more heavily.
MAE (Mean Absolute Error): Average absolute error. More robust to outliers than RMSE.

$$R^2 = 1 - \frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2}$$

$$\text{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2} \qquad \text{MAE} = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|$$

In these formulas, $y_i$ is the actual IMDS value for municipality $i$, $\hat{y}_i$ is the model’s prediction, and $\bar{y}$ is the mean IMDS across all test municipalities. R² compares total prediction error to the naive baseline of always guessing the mean — higher is better. RMSE and MAE both measure average error in IMDS points, but RMSE penalizes large misses more heavily because it squares the errors before averaging. In code, $y_i$ is y_test, $\hat{y}_i$ is tuned_pred, and $n$ is 68 (the test set size).

best_rf = search.best_estimator_
tuned_pred = best_rf.predict(X_test)
tuned_r2 = r2_score(y_test, tuned_pred)
tuned_rmse = np.sqrt(mean_squared_error(y_test, tuned_pred))
tuned_mae = mean_absolute_error(y_test, tuned_pred)
print(f"Tuned Test R²: {tuned_r2:.4f}")
print(f"Tuned Test RMSE: {tuned_rmse:.2f}")
print(f"Tuned Test MAE: {tuned_mae:.2f}")

Tuned Test R²: 0.2297
Tuned Test RMSE: 6.52
Tuned Test MAE: 4.72

The tuned model achieves R² = 0.2297, RMSE = 6.52, and MAE = 4.72 on the test set — essentially identical to the baseline (R² = 0.2307, RMSE = 6.52, MAE = 4.68). This is a common finding with small datasets: the tuning improved CV performance slightly but the gains didn’t transfer to the specific test set. The model explains about 23% of IMDS variation, meaning satellite embeddings capture real but limited predictive signal for municipal development.

Actual vs Predicted

This scatter plot shows how well the model’s predictions match reality. Points falling exactly on the dashed 45-degree line would indicate perfect predictions; scatter around the line shows prediction error.

fig, ax = plt.subplots(figsize=(7, 7))
ax.scatter(y_test, tuned_pred, alpha=0.6, edgecolors="white", linewidth=0.5, color="#6a9bcc")
lims = [min(y_test.min(), tuned_pred.min()) - 2, max(y_test.max(), tuned_pred.max()) + 2]
ax.plot(lims, lims, "--", color="#d97757", linewidth=2, label="Perfect prediction")
ax.set_xlim(lims)
ax.set_ylim(lims)
ax.set_xlabel(f"Actual {TARGET_LABEL}")
ax.set_ylabel(f"Predicted {TARGET_LABEL}")
ax.set_title(f"Actual vs Predicted {TARGET_LABEL}")
ax.legend()
ax.set_aspect("equal")
plt.savefig(IMAGES_DIR / "ml_actual_vs_predicted.png", dpi=300, bbox_inches="tight")
plt.show()

The scatter shows moderate agreement between actual and predicted IMDS values, with noticeable spread around the 45-degree line. Predictions tend to cluster in the 47–55 range (near the training mean), with the model struggling to predict extreme values — municipalities with very high or low IMDS scores are pulled toward the center. This “regression to the mean” effect is typical when the model has limited predictive power.

Residual Analysis

Residuals (actual minus predicted) should ideally be randomly scattered around zero with no obvious pattern. Patterns in residuals can reveal systematic biases — for example, if the model consistently underpredicts high-IMDS municipalities, it suggests the features miss something important about well-developed areas.

residuals = y_test - tuned_pred
fig, ax = plt.subplots(figsize=(8, 5))
ax.scatter(tuned_pred, residuals, alpha=0.6, edgecolors="white", linewidth=0.5, color="#6a9bcc")
ax.axhline(0, color="#d97757", linestyle="--", linewidth=2)
ax.set_xlabel(f"Predicted {TARGET_LABEL}")
ax.set_ylabel("Residuals")
ax.set_title("Residuals vs Predicted Values")
plt.savefig(IMAGES_DIR / "ml_residuals.png", dpi=300, bbox_inches="tight")
plt.show()

The residuals appear roughly randomly scattered around zero, which is encouraging — there’s no strong systematic bias. However, the spread is wider at the extremes, suggesting the model’s errors are larger for municipalities with unusually high or low predicted IMDS. This pattern — the spread of errors changing across the prediction range, known as heteroscedasticity — is consistent with the regression-to-the-mean effect seen in the scatter plot above.

Feature Importance

Which satellite embedding dimensions matter most for predicting IMDS? We compare two methods that answer this question differently:

Mean Decrease in Impurity (MDI)

MDI measures how much each feature reduces prediction error across all splits in all trees. It’s fast to compute (built into the trained model) but can be biased toward high-cardinality features — those with many distinct values, like continuous numbers — or correlated features.

mdi_importance = pd.Series(best_rf.feature_importances_, index=FEATURE_COLS)
top20_mdi = mdi_importance.sort_values(ascending=False).head(20)
fig, ax = plt.subplots(figsize=(10, 6))
top20_mdi.sort_values().plot.barh(ax=ax, color="#6a9bcc", edgecolor="white")
ax.set_xlabel("Mean Decrease in Impurity")
ax.set_title(f"Top-20 Feature Importance (MDI) for {TARGET_LABEL}")
plt.savefig(IMAGES_DIR / "ml_feature_importance_mdi.png", dpi=300, bbox_inches="tight")
plt.show()

The MDI plot shows that A30 and A59 rank highest, but importance is distributed across many embedding dimensions rather than concentrated in just a few. This suggests the satellite imagery captures multiple independent visual patterns relevant to development — no single dimension dominates. However, MDI can be inflated for continuous features, so we’ll cross-check with permutation importance next.

Permutation Importance

Permutation importance is more reliable. Imagine scrambling all the values in one column of a spreadsheet — if the model’s accuracy barely changes, that column wasn’t contributing much. That’s exactly what permutation importance does: it randomly shuffles each feature and measures how much the model’s R² drops. Unlike MDI, permutation importance is evaluated on the test set and is not biased by feature scale or cardinality.

perm_result = permutation_importance(
best_rf, X_test, y_test, n_repeats=10, random_state=RANDOM_SEED, n_jobs=-1
)
perm_importance = pd.Series(perm_result.importances_mean, index=FEATURE_COLS)
top20_perm = perm_importance.sort_values(ascending=False).head(20)
fig, ax = plt.subplots(figsize=(10, 6))
top20_perm.sort_values().plot.barh(ax=ax, color="#d97757", edgecolor="white")
ax.set_xlabel("Mean Decrease in R² (Permutation)")
ax.set_title(f"Top-20 Feature Importance (Permutation) for {TARGET_LABEL}")
plt.savefig(IMAGES_DIR / "ml_feature_importance_permutation.png", dpi=300, bbox_inches="tight")
plt.show()

Permutation importance gives a more trustworthy picture. A59 emerges as the clear top feature under both methods, with A42 and A26 also ranking highly. The ranking differs somewhat from MDI (A30 drops considerably), which is expected — permutation importance is less biased and directly measures predictive contribution on the test set. These top features are the embedding dimensions that genuinely help the model distinguish between municipalities with different IMDS levels. Let’s now visualize how these features affect predictions.

Partial Dependence Plots

Partial dependence plots show the marginal effect of a single feature on predictions, averaging over all other features. They reveal non-linear relationships that a simple correlation coefficient can’t capture — for example, a feature might have no effect below a threshold but a strong effect above it. We plot the top-6 most important features (by permutation importance).

top6_features = perm_importance.sort_values(ascending=False).head(6).index.tolist()
fig, axes = plt.subplots(2, 3, figsize=(15, 8))
PartialDependenceDisplay.from_estimator(
best_rf, X_train, top6_features, ax=axes.ravel(),
grid_resolution=50, n_jobs=-1
)
fig.suptitle(f"Partial Dependence Plots — Top-6 Features for {TARGET_LABEL}", fontsize=14)
plt.tight_layout(rect=[0, 0, 1, 0.95])
plt.savefig(IMAGES_DIR / "ml_partial_dependence.png", dpi=300, bbox_inches="tight")
plt.show()

The partial dependence plots reveal non-linear relationships between the top features and predicted IMDS. Some dimensions show threshold effects — the predicted IMDS changes sharply at certain embedding values then levels off. These non-linearities justify using Random Forest over a linear model, as a linear regression would miss these step-like patterns. The embedding dimensions likely correspond to visual landscape features (urbanization, vegetation cover, infrastructure density) that change abruptly between rural and urban municipalities.

Summary and Results

Metric	Baseline	Tuned
R²	0.2307	0.2297
RMSE	6.52	6.52
MAE	4.68	4.72

The summary table confirms that tuning provided negligible improvement over the baseline for this dataset: both models achieve R² around 0.23, RMSE of 6.52, and MAE near 4.7. Key takeaways from this analysis:

Method insight: Random Forest with default hyperparameters performed just as well as the tuned model (R² = 0.2307 vs 0.2297), suggesting the performance ceiling comes from the features themselves, not model configuration. When the signal in the data is limited, sophisticated tuning adds little.
Data insight: Satellite embeddings explain roughly a quarter of IMDS variation — a meaningful signal showing that remote sensing captures real development-related patterns. Feature importance is broadly distributed across embedding dimensions (A59, A42, A26 rank highest), meaning IMDS prediction relies on many visual patterns rather than a single dominant signal.
Practical limitation: The model’s regression-to-the-mean behavior (predictions cluster in the 47–55 range) means it cannot reliably identify the highest- or lowest-performing municipalities individually. A policymaker using these predictions to target aid would miss the most extreme cases.
Next step: The 77% of unexplained variance likely comes from factors invisible to satellites — governance quality, migration patterns, informal economies. Combining satellite embeddings with administrative or survey data would be the natural next experiment to boost predictive power.

Limitations and Next Steps

This analysis demonstrates that satellite embeddings contain real predictive signal for municipal development outcomes, but several limitations apply:

Moderate R²: The model captures meaningful patterns but leaves much variation unexplained — development is driven by many factors invisible from space (governance, migration, informal economy).
Temporal mismatch: We use 2017 satellite imagery with SDG indices from a potentially different period.
Feature interpretability: Embedding dimensions (A00–A63) are abstract; connecting them to physical landscape features requires further analysis.
Small sample: With only 339 municipalities, complex models risk overfitting despite cross-validation.

Next steps could include: trying other algorithms (gradient boosting, regularized regression), incorporating additional features (geographic, demographic), or using explainability tools like SHAP values for richer interpretation.

Exercises

Try a different algorithm. Replace RandomForestRegressor with GradientBoostingRegressor from scikit-learn. Does the R² improve? How do the feature importance rankings change compared to Random Forest?
Predict a different SDG index. The DS4Bolivia dataset contains 15 individual SDG indices (sdg1 through sdg15) alongside the composite IMDS. Pick one SDG index as the target and re-run the full pipeline. Which SDG dimensions are most predictable from satellite imagery, and which are hardest?
Add geographic features. Merge the region names data and create dummy variables for Bolivia’s nine departments. Does combining satellite embeddings with administrative region information improve model performance? What does this tell you about spatial patterns in development?

References

Exploratory Spatial Data Analysis (ESDA)

Fri, 01 Mar 2024 00:00:00 +0000

Exploratory Spatial Data Analysis (ESDA) of Regional Development

This interactive application enables users to explore municipal development indicators across Bolivia. In particular, it offers:

🗺️ Geographical data visualizations
📈 Distribution and comparative analysis tools
💾 Downloadable datasets
🧮 Access to a cloud-based computational notebook on Google Colab

⚠️ This application is open source and still work in progress. Source code is available at: github.com/cmg777/streamlit_esda101

📚 Data Sources and Credits

Primary data source: Municipal Atlas of the SDGs in Bolivia 2020.
Additional indicators for multiple years were sourced from the GeoQuery project.
Administrative boundaries from the GeoBoundaries database
Streamlit web app and computational notebook by Carlos Mendez.
Erick Gonzales and Pedro Leoni also colaborated in the organization of the data and the creation of the initial geospatial database

Citation:
Mendez, C. (2025, March 24). Regional Development Indicators of Bolivia: A Dashboard for Exploratory Analysis (Version 0.0.2) [Computer software]. Zenodo. https://doi.org/10.5281/zenodo.15074864

🌐 Context and Motivation

Adopted in 2015, the 2030 Agenda for Sustainable Development established 17 Sustainable Development Goals. While global metrics offer useful benchmarks, they often overlook subnational disparities—particularly in heterogeneous countries such as Bolivia.

🇧🇴 Bolivia ranks 79/166 on the 2020 SDG Index (score: 69.3)
🏘️ The Municipal Atlas of the SDGs in Bolivia 2020 reveals intra-national disparities comparable to global inter-country variation

📊 Development Index: Índice Municipal de Desarrollo Sostenible (IMDS)

The Municipal Sustainable Development Index (IMDS) summarizes municipal performance using 62 indicators across 15 Sustainable Development Goals. However, systematic and reliable information on goals 12 and 14 were not available at the municipal level.

🎯 Methodological Criteria

✅ Relevance to local Sustainable Development Goal targets
📥 Data availability from official or trusted sources
🌐 Full municipal coverage (339 municipalities)
🕒 Data mostly from 2012–2019
🧮 Low redundancy between indicators

🗃️ Indicators by Sustainable Development Goal

🧱 Goal 1: No Poverty

Energy poverty rate (2012, INE)
Multidimensional Poverty Index (2013, UDAPE)
Unmet Basic Needs (2012, INE)
Access to basic services: water, sanitation, electricity (2012, INE)

🌾 Goal 2: Zero Hunger

Chronic malnutrition in children under five (2016, Ministry of Health)
Obesity prevalence in women (2016, Ministry of Health)
Average agricultural unit size (2013, Agricultural Census)
Tractor density per 1,000 farms (2013, Agricultural Census)

🏥 Goal 3: Good Health and Well-being

Infant and under-five mortality rates (2016, Ministry of Health)
Institutional birth coverage (2016, Ministry of Health)
Incidence of Chagas, HIV, malaria, tuberculosis, dengue (2016, Ministry of Health)
Adolescent fertility rate (2016, Ministry of Health)

📚 Goal 4: Quality Education

Secondary school dropout rates, by gender (2016, Ministry of Education)
Adult literacy rate (2012, INE)
Share of population with higher education (2012, INE)
Share of qualified teachers, initial and secondary levels (2016, Ministry of Education)

⚖️ Goal 5: Gender Equality

Gender parity in education, labor participation, and poverty (2012–2016, INE and UDAPE)
Note: Data on gender-based violence not available at municipal level

💧 Goal 6: Clean Water and Sanitation

Access to potable water (2012, INE)
Access to sanitation services (2012, INE)
Proportion of treated wastewater (2015, Ministry of Environment)

⚡ Goal 7: Affordable and Clean Energy

Electricity coverage (2012, INE)
Per capita electricity consumption (2015, Ministry of Energy)
Use of clean cooking energy (2015, Ministry of Hydrocarbons)
CO₂ emissions per capita, energy-related (2015, international satellite data)

💼 Goal 8: Decent Work and Economic Growth

Share of non-functioning electricity meters (proxy for informality/unemployment) (2015, Ministry of Energy)
Labor force participation rate (2012, INE)
Youth not in education, employment, or training (NEET rate) (2015, Ministry of Labor)

🏗️ Goal 9: Industry, Innovation, and Infrastructure

Internet access in households (2012, INE)
Mobile signal coverage (2015, telecommunications data)
Availability of urban infrastructure (2015, Ministry of Public Works)

⚖️ Goal 10: Reduced Inequality

Proxy measures: municipal differences in poverty and participation rates (2012–2016, INE and UDAPE)

🏘️ Goal 11: Sustainable Cities and Communities

Urban housing adequacy (2012, INE)
Access to collective transportation (2015, Ministry of Transport)

🌍 Goal 13: Climate Action

Natural disaster resilience index (2015, Ministry of Environment)
CO₂ emissions and forest degradation (2015, satellite data)

🌳 Goal 15: Life on Land

Deforestation rates (2015, satellite data)
Biodiversity loss indicators (2015, Ministry of Environment)

🕊️ Goal 16: Peace, Justice, and Strong Institutions

Birth registration coverage (2012, INE)
Crime and homicide rates (2015, Ministry of Government)
Corruption perceptions (2015, civil society organizations)

🤝 Goal 17: Partnerships for the Goals

Municipal fiscal capacity (2015, Ministry of Economy)
Public investment per capita (2015, Ministry of Economy)

⚠️ Limitations and Future Work

No disaggregated data for Indigenous Territories (TIOC)
Many indicators based on 2012 Census; updates pending
Limited information for Goals 12 and 14 at municipal level
No indicators for educational quality (due to lack of standardized testing)
Gender violence data unavailable at municipal scale

🔗 Access

Original website: atlas.sdsnbolivia.org
Original Publication: sdsnbolivia.org/Atlas
Source Code of the Web App: github.com/cmg777/streamlit_esda101
Computational Notebook: Google Colab

Studying spatial heterogeneity

Sat, 23 Dec 2023 00:00:00 +0000

A geocomputational notebook to compute GWR and MGWR

Construct and export spatial connectivity structures (W)

Sat, 02 Dec 2023 00:00:00 +0000

Cross-Sectional Spatial Regression in Stata: Crime in Columbus Neighborhoods

Fri, 01 Dec 2023 00:00:00 +0000

1. Overview

Crime does not stop at neighborhood boundaries. A neighborhood’s crime rate may depend not only on its own socioeconomic conditions but also on conditions in adjacent areas — through spatial displacement (criminals move to easier targets nearby), diffusion (criminal networks operate across borders), and shared exposure to common risk factors. Standard regression models that treat each neighborhood as an independent observation miss these spatial spillovers, potentially producing biased estimates of how income and housing values affect crime.

This tutorial introduces the complete taxonomy of cross-sectional spatial regression models — from a simple OLS baseline through the most general GNS (General Nesting Spatial) specification. Using the classic Columbus crime dataset, we progressively estimate eight models: OLS, SAR, SEM, SLX, SDM, SDEM, SAC, and GNS. Each model captures spatial dependence through a different combination of three channels: the spatial lag of the dependent variable ($\rho Wy$), the spatial lag of the explanatory variables ($WX\theta$), and the spatial lag of the error term ($\lambda Wu$). We use specification tests from the SDM to determine which simpler model the data supports, and compare all models using log-likelihoods and direct/indirect effect decompositions, following Elhorst (2014, Chapter 2).

The Columbus crime dataset contains 49 neighborhoods in Columbus, Ohio, with data on residential burglaries and vehicle thefts per 1,000 households (CRIME), household income in \$1,000 (INC), and housing value in \$1,000 (HOVAL). The spatial weight matrix is a Queen contiguity matrix — two neighborhoods are neighbors if they share a common border or vertex — row-standardized so that the spatial lag of a variable equals the weighted average among a neighborhood’s neighbors. All estimation uses Stata’s official spregress command (available since Stata 15), which implements maximum likelihood estimation for the full family of cross-sectional spatial models.

Mendez, C. (2021). Spatial econometrics for cross-sectional data in Stata. DOI: 10.5281/zenodo.5151076

Learning objectives

Construct and load a Queen contiguity spatial weight matrix in Stata using spmatrix fromdata
Compute spatial lags of explanatory variables ($WX$) manually using Mata
Test for spatial autocorrelation using Moran’s I and LM tests
Estimate the full taxonomy of spatial models (SAR, SEM, SLX, SDM, SDEM, SAC, GNS) using spregress
Decompose coefficient estimates into direct, indirect (spillover), and total effects using estat impact
Use specification tests to determine whether the SDM simplifies to SAR, SLX, or SEM
Compare models and identify the SDM and SDEM as preferred specifications following Elhorst (2014)

2. The spatial model taxonomy

The eight models in this tutorial form a nested hierarchy. At the top sits the GNS (General Nesting Spatial) model, which includes all three spatial channels simultaneously. Each intermediate model imposes one or more restrictions, and OLS sits at the bottom with no spatial terms at all. Understanding this nesting structure is essential for model selection — we estimate from the general to the specific, using statistical tests to determine whether restrictions are warranted.

graph TD
GNS["<b>GNS</b><br/>y = ρWy + Xβ + WXθ + u<br/>u = λWu + ε<br/><i>Most general</i>"]
SDM["<b>SDM</b><br/>y = ρWy + Xβ + WXθ + ε<br/><i>λ = 0</i>"]
SDEM["<b>SDEM</b><br/>y = Xβ + WXθ + u<br/>u = λWu + ε<br/><i>ρ = 0</i>"]
SAC["<b>SAC</b><br/>y = ρWy + Xβ + u<br/>u = λWu + ε<br/><i>θ = 0</i>"]
SAR["<b>SAR</b><br/>y = ρWy + Xβ + ε<br/><i>λ = 0, θ = 0</i>"]
SEM["<b>SEM</b><br/>y = Xβ + u<br/>u = λWu + ε<br/><i>ρ = 0, θ = 0</i>"]
SLX["<b>SLX</b><br/>y = Xβ + WXθ + ε<br/><i>ρ = 0, λ = 0</i>"]
OLS["<b>OLS</b><br/>y = Xβ + ε<br/><i>ρ = 0, θ = 0, λ = 0</i>"]
GNS --> SDM
GNS --> SDEM
GNS --> SAC
SDM --> SAR
SDM --> SLX
SDEM --> SLX
SDEM --> SEM
SAC --> SAR
SAC --> SEM
SAR --> OLS
SEM --> OLS
SLX --> OLS
style GNS fill:#141413,stroke:#d97757,color:#fff
style SDM fill:#00d4c8,stroke:#141413,color:#141413
style SDEM fill:#6a9bcc,stroke:#141413,color:#fff
style SAC fill:#6a9bcc,stroke:#141413,color:#fff
style SAR fill:#d97757,stroke:#141413,color:#fff
style SEM fill:#d97757,stroke:#141413,color:#fff
style SLX fill:#d97757,stroke:#141413,color:#fff
style OLS fill:#141413,stroke:#6a9bcc,color:#fff

The diagram shows three spatial channels and their corresponding parameters: $\rho$ (spatial lag of $y$), $\theta$ (spatial lag of $X$), and $\lambda$ (spatial lag of the error). Setting any of these to zero yields a nested model. The SDM is often the starting point for model selection because it nests the three most common models — SAR, SLX, and SEM — and the restrictions can be tested with standard Wald tests.

3. Setup and data loading

Before running any spatial models, we need the estout package for table output and the spatwmat/spatdiag packages for LM diagnostic tests. If you have not installed them, uncomment the ssc install and net install lines below.

clear all
macro drop _all
set more off
* Install packages (uncomment if needed)
*ssc install estout, replace
*net install st0085_2, from(http://www.stata-journal.com/software/sj14-2)

3.1 Spatial weight matrix

The spatial weight matrix W defines the neighborhood structure among the 49 Columbus neighborhoods. We use a Queen contiguity matrix where two neighborhoods are neighbors if they share a common border or vertex. The matrix is stored in a .dta file and converted to an spmatrix object with row-standardization — meaning that each row sums to one, so the spatial lag of a variable equals the weighted average among a neighborhood’s neighbors.

* Load Queen contiguity W matrix
use "https://github.com/quarcs-lab/data-open/raw/master/Columbus/columbus/Wqueen_fromStata_spmat.dta", clear
gen id = _n
order id, first
spset id
spmatrix fromdata W = v*, normalize(row) replace
spmatrix summarize W

Spatial-weighting matrix W
Dimensions: 49 x 49
Stored type: dense
Normalization: row
Summary statistics
-------------------------------------------
Min Mean Max N
-------------------------------------------
Nonzero .0625 .2049 .5000 236
All .0000 .0042 .5000 2401
-------------------------------------------

The spmatrix fromdata command reads the columns of the loaded dataset and stores them as a spatial weight matrix object named W. The normalize(row) option applies row-standardization, and replace overwrites any existing matrix with the same name. The matrix has 236 nonzero entries out of 2,401 total cells, meaning the average neighborhood has approximately $236 / 49 \approx 4.8$ neighbors.

Note: The companion analysis.do file uses the longer name WqueenS_fromStata15 for the spatial weight matrix to match the original Colab notebook. In this tutorial, we use the shorter name W for readability. Both names are interchangeable — only the name passed to spmatrix fromdata matters.

3.2 Generating spatial lags of X

Before loading the crime data, we pre-compute the spatial lags of the explanatory variables ($W \cdot INC$ and $W \cdot HOVAL$) using Mata. These spatial lags represent each neighborhood’s neighbors' average income and housing value, and will be used as explicit regressors in the SLX, SDM, SDEM, and GNS models.

* Load data and generate spatial lags of X manually
use "https://github.com/quarcs-lab/data-open/raw/master/Columbus/columbus/columbusDbase.dta", clear
spset id
label var CRIME "Crime"
label var INC "Income"
label var HOVAL "House value"
* Compute W*X using Mata (bypasses spregress ivarlag)
mata: spmatrix_matafromsp(W_mata, id_vec, "W")
mata: st_view(inc=., ., "INC")
mata: st_view(hoval=., ., "HOVAL")
gen double W_INC = .
gen double W_HOVAL = .
mata: st_store(., "W_INC", W_mata * inc)
mata: st_store(., "W_HOVAL", W_mata * hoval)
label var W_INC "W * Income"
label var W_HOVAL "W * House value"

Why compute W*X manually? Stata’s spregress command provides the ivarlag() option to include spatial lags of explanatory variables. However, this option may produce incorrect coefficient signs in some Stata versions. Computing $WX$ explicitly using Mata and including the result as a regular regressor is more transparent and produces results consistent with Elhorst (2014) and PySAL’s spreg package.

3.3 Summary statistics

summarize CRIME INC HOVAL

 Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
CRIME | 49 35.1288 16.5647 .1783 68.8920
INC | 49 14.3765 5.7575 3.7240 27.8966
HOVAL | 49 38.4362 18.4661 5.0000 96.4000

3.4 Variables

Variable	Description	Mean	Std. Dev.
`CRIME`	Residential burglaries and vehicle thefts per 1,000 households	35.13	16.56
`INC`	Household income (\$1,000)	14.38	5.76
`HOVAL`	Housing value (\$1,000)	38.44	18.47

Mean crime is 35.13 incidents per 1,000 households, with substantial variation across neighborhoods (standard deviation of 16.56, ranging from near zero to 68.89). Mean household income is \$14,380 and mean housing value is \$38,440. The wide range of both income (\$3,724 to \$27,897) and housing value (\$5,000 to \$96,400) reflects the considerable socioeconomic heterogeneity across Columbus neighborhoods, providing sufficient variation to estimate the effects of these variables on crime.

4. OLS baseline and spatial diagnostics

4.1 OLS regression

Before introducing any spatial structure, we estimate a standard OLS regression of crime on income and housing value. This provides a non-spatial benchmark against which all subsequent models will be compared.

regress CRIME INC HOVAL
eststo OLS
estat ic
mat s = r(S)
quietly estadd scalar AIC = s[1,5]

 Source | SS df MS Number of obs = 49
-------------+---------------------------------- F(2, 46) = 28.39
Model | 5765.1588 2 2882.5794 Prob > F = 0.0000
Residual | 4670.9753 46 101.5429 R-squared = 0.5524
-------------+---------------------------------- Adj R-squared = 0.5330
Total | 10436.1341 48 217.4194 Root MSE = 10.0769
------------------------------------------------------------------------------
CRIME | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
INC | -1.5973 .3341 -4.78 0.000 -2.2699 -.9247
HOVAL | -0.2739 .1032 -2.65 0.011 -0.4817 -.0661
_cons | 68.6190 4.7355 14.49 0.000 59.0876 78.1504
------------------------------------------------------------------------------

OLS estimates that each additional \$1,000 in household income is associated with a reduction of 1.60 crimes per 1,000 households, and each additional \$1,000 in housing value is associated with a reduction of 0.27 crimes. Both coefficients are statistically significant, and the model explains about 55% of the variation in crime rates across neighborhoods (R-squared = 0.552). The intercept of 68.62 represents the predicted crime rate for a hypothetical neighborhood with zero income and zero housing value. However, OLS assumes that crime in one neighborhood is independent of conditions in adjacent neighborhoods — an assumption we now test directly.

4.2 Moran’s I test

Moran’s I is the most widely used test for spatial autocorrelation. Applied to OLS residuals, it tests whether the residuals in nearby neighborhoods are more similar (positive spatial autocorrelation) or more dissimilar (negative spatial autocorrelation) than expected under spatial independence. The test statistic is:

$$I = \frac{N}{S_0} \cdot \frac{e' W e}{e' e}$$

where $e$ is the vector of OLS residuals, $W$ is the row-standardized spatial weight matrix, $N$ is the number of observations, and $S_0$ is the sum of all elements of $W$. Under the null hypothesis of no spatial autocorrelation, $I$ follows an approximately standard normal distribution after standardization.

regress CRIME INC HOVAL
estat moran, errorlag(W)

Moran test for spatial autocorrelation in the error
H0: Error is i.i.d.
I = 0.2222
E(I) = -0.0208
Mean = -0.0208
Sd(I) = 0.0856
z = 2.8391
p-value = 0.0045

Moran’s I is 0.222 with a z-statistic of 2.84 (p = 0.005), providing strong evidence of positive spatial autocorrelation in the OLS residuals. Neighborhoods with high unexplained crime tend to cluster near other neighborhoods with high unexplained crime, and vice versa. This violates the OLS assumption of independent errors and motivates the use of spatial regression models. The positive sign of Moran’s I is consistent with crime diffusion — criminal activity in one neighborhood spills over into adjacent areas.

4.3 LM tests for spatial specification

While Moran’s I confirms the presence of spatial autocorrelation, it does not indicate the form of the spatial dependence. The Lagrange Multiplier (LM) tests proposed by Anselin (1988) test separately for the spatial lag ($\rho Wy$) and spatial error ($\lambda Wu$) specifications. The robust versions of these tests remain valid even when the alternative specification is also present.

* Create compatible W matrix for spatdiag
spatwmat using "https://github.com/quarcs-lab/data-open/raw/master/Columbus/columbus/Wqueen_fromStata_spmat.dta", ///
name(Wcompat) eigenval(eWcompat) standardize
quietly regress CRIME INC HOVAL
spatdiag, weights(Wcompat)

Spatial error:
Moran's I = 0.2055 Prob = 0.0068
Lagrange multiplier = 5.3282 Prob = 0.0210
Robust LM = 2.1901 Prob = 0.1389
Spatial lag:
Lagrange multiplier = 3.3954 Prob = 0.0654
Robust LM = 0.2572 Prob = 0.6121

The standard LM test for the spatial error ($\lambda$) is significant at the 5% level (LM = 5.33, p = 0.021), while the standard LM test for the spatial lag ($\rho$) is marginally significant at the 10% level (LM = 3.40, p = 0.065). The robust tests provide further guidance: the robust LM-error is 2.19 (p = 0.139) and the robust LM-lag is only 0.26 (p = 0.612).

Following the Anselin (2005) decision rule — compare the standard LM tests first, then use the robust tests to break ties — the evidence favors the SEM specification. The standard LM-error is larger and more significant than the standard LM-lag, and the robust LM-error remains larger than the robust LM-lag. The decision tree below summarizes this logic. However, as we will see, the full model taxonomy reveals a more nuanced picture.

graph TD
MI["<b>Moran's I</b><br/>I = 0.222, p = 0.005<br/>Significant"]
LM["<b>Standard LM Tests</b><br/>LM-error = 5.33 (p = 0.021)<br/>LM-lag = 3.40 (p = 0.065)"]
RLM["<b>Robust LM Tests</b><br/>Robust LM-error = 2.19<br/>Robust LM-lag = 0.26"]
SEM_d["<b>SEM Preferred</b><br/>Error specification<br/>dominates"]
MI -->|"Spatial dependence?"| LM
LM -->|"Both significant?"| RLM
RLM -->|"Error > Lag"| SEM_d
style MI fill:#6a9bcc,stroke:#141413,color:#fff
style LM fill:#d97757,stroke:#141413,color:#fff
style RLM fill:#00d4c8,stroke:#141413,color:#141413
style SEM_d fill:#141413,stroke:#d97757,color:#fff

5. First-generation spatial models

5.1 SAR (Spatial Autoregressive / Spatial Lag)

The SAR model adds a spatial lag of the dependent variable to the OLS specification. It assumes that crime in a neighborhood depends directly on the crime rate in adjacent neighborhoods — a “contagion” or “diffusion” channel where high crime in one area breeds crime in neighboring areas.

$$y = \rho W y + X \beta + \varepsilon$$

The parameter $\rho$ measures the strength of this spatial feedback. Because $Wy$ is endogenous (it depends on $y$, which depends on $\varepsilon$), OLS estimation would be inconsistent. We use maximum likelihood estimation via spregress.

spregress CRIME INC HOVAL, ml dvarlag(W)
eststo SAR
estat ic
mat s = r(S)
quietly estadd scalar AIC = s[1,5]

Spatial autoregressive model Number of obs = 49
Maximum likelihood estimates Wald chi2(2) = 54.83
Prob > chi2 = 0.0000
Log-likelihood = -184.926 Pseudo R2 = 0.5830
------------------------------------------------------------------------------
CRIME | Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
CRIME |
INC | -1.0312 .3359 -3.07 0.002 -1.6897 -.3728
HOVAL | -0.2654 .0922 -2.88 0.004 -0.4461 -.0847
_cons | 45.0719 7.8406 5.75 0.000 29.7046 60.4392
-------------+----------------------------------------------------------------
W |
CRIME | 0.4283 .1228 3.49 0.000 0.1875 0.6690
------------------------------------------------------------------------------

The spatial autoregressive parameter $\rho$ is 0.428 (z = 3.49, p < 0.001), indicating substantial positive spatial dependence. After accounting for the spatial lag, the own income coefficient drops to -1.03 (from -1.60 in OLS), while the housing value coefficient remains similar at -0.27. The reduction in the income coefficient suggests that part of what OLS attributed to income was actually capturing spatial spillover effects that are now absorbed by $\rho$.

However, the raw coefficients in the SAR model do not have the same interpretation as OLS coefficients because the spatial lag creates a feedback loop: a change in income in one neighborhood affects its crime, which affects its neighbors' crime, which feeds back to the original neighborhood. The proper interpretation requires decomposing effects into direct, indirect, and total components.

estat impact

 Coefficient Std. err. z P>|z|
-------------------------------------------------------------------
INC
Direct | -1.1024 .3486 -3.16 0.002
Indirect | -0.7594 .3712 -2.05 0.041
Total | -1.8618 .5803 -3.21 0.001
-------------------------------------------------------------------
HOVAL
Direct | -0.2838 .0983 -2.89 0.004
Indirect | -0.1954 .1123 -1.74 0.082
Total | -0.4792 .1722 -2.78 0.005
-------------------------------------------------------------------

The direct effect of income is -1.10, meaning that a \$1,000 increase in a neighborhood’s own income reduces its crime by 1.10 incidents per 1,000 households. The indirect (spillover) effect is -0.76 and statistically significant (p = 0.041), meaning that when all neighboring neighborhoods experience a \$1,000 income increase, the focal neighborhood’s crime drops by an additional 0.76 incidents through the spatial feedback channel. The total effect of income is -1.86, larger than the OLS estimate of -1.60, revealing that OLS understates the total impact of income on crime. However, a key limitation of the SAR is that the ratio between the indirect and direct effect is the same for every variable ($\delta / (1 - \delta) \approx 0.75$), which may be overly restrictive.

5.2 SEM (Spatial Error Model)

The SEM assumes that spatial dependence operates through the error term rather than through a direct contagion channel. Spatially correlated unobservable factors — such as local policing strategies, community organizations, or land use patterns — generate correlated residuals across adjacent neighborhoods.

$$y = X \beta + u, \quad u = \lambda W u + \varepsilon$$

The parameter $\lambda$ measures the degree of spatial autocorrelation in the error term. Unlike the SAR, the SEM does not produce indirect (spillover) effects — the spatial dependence is treated as a nuisance rather than a substantive economic channel.

spregress CRIME INC HOVAL, ml errorlag(W)
eststo SEM
estat ic
mat s = r(S)
quietly estadd scalar AIC = s[1,5]

Spatial error model Number of obs = 49
Maximum likelihood estimates Wald chi2(2) = 50.51
Prob > chi2 = 0.0000
Log-likelihood = -184.379 Pseudo R2 = 0.5877
------------------------------------------------------------------------------
CRIME | Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
CRIME |
INC | -0.9376 .3393 -2.76 0.006 -1.6027 -.2726
HOVAL | -0.3023 .0909 -3.32 0.001 -0.4805 -.1241
_cons | 59.6228 5.4722 10.90 0.000 48.8975 70.3481
-------------+----------------------------------------------------------------
W |
lambda | 0.5623 .1330 4.23 0.000 0.3017 0.8230
------------------------------------------------------------------------------

The spatial error parameter $\lambda$ is 0.562 (z = 4.23, p < 0.001), confirming substantial spatial autocorrelation in the unobservables. The income coefficient is -0.94, further attenuated from the OLS estimate, and the housing value coefficient is -0.30, slightly larger in magnitude than OLS. The log-likelihood of -184.38 is higher than OLS (-187.38), confirming the spatial error structure improves fit.

estat impact

 Coefficient Std. err. z P>|z|
-------------------------------------------------------------------
INC
Direct | -0.9376 .3393 -2.76 0.006
Indirect | 0.0000 . . .
Total | -0.9376 .3393 -2.76 0.006
-------------------------------------------------------------------
HOVAL
Direct | -0.3023 .0909 -3.32 0.001
Indirect | 0.0000 . . .
Total | -0.3023 .0909 -3.32 0.001
-------------------------------------------------------------------

As expected, the SEM produces zero indirect effects by construction. In the SEM, spatial dependence is a nuisance in the error term, not a substantive spillover channel. The direct and total effects are identical. If one believes that crime spillovers are substantively important — for example, through displacement or diffusion — the SEM’s assumption that all spatial dependence is in the errors is overly restrictive. As we will see in Sections 6 and 8, models that include $WX\theta$ terms reveal a significant negative spillover of neighbors' income on crime, which the SEM cannot detect.

6. Models with spatial lags of X

6.1 SLX (Spatial Lag of X)

The SLX model includes spatial lags of the explanatory variables but no spatial lag of $y$ and no spatial error. It captures local spillovers — the idea that a neighborhood’s crime depends on its neighbors' income and housing values — without the global feedback mechanism of the SAR.

$$y = X \beta + W X \theta + \varepsilon$$

The $\theta$ coefficients measure the direct impact of neighbors' characteristics on the focal neighborhood’s crime. Unlike the SAR, the SLX does not generate a spatial multiplier — the spillover effects are localized to immediate neighbors. Since the SLX has no spatial autoregressive or error component, it can be estimated by OLS with the pre-computed $W \cdot INC$ and $W \cdot HOVAL$ variables as additional regressors.

regress CRIME INC HOVAL W_INC W_HOVAL
eststo SLX
estat ic
mat s = r(S)
quietly estadd scalar AIC = s[1,5]

 Source | SS df MS Number of obs = 49
-------------+---------------------------------- F(4, 44) = 17.24
Model | 6373.4060 4 1593.35150 Prob > F = 0.0000
Residual | 4062.7281 44 92.33473 R-squared = 0.6105
-------------+---------------------------------- Adj R-squared = 0.5751
Total | 10436.1341 48 217.4194 Root MSE = 9.6090
------------------------------------------------------------------------------
CRIME | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
INC | -1.0974 .3738 -2.94 0.005 -1.8509 -.3438
HOVAL | -0.2944 .1017 -2.90 0.006 -0.4993 -.0895
W_INC | -1.3987 .5601 -2.50 0.016 -2.5275 -.2700
W_HOVAL | 0.2148 .2079 1.03 0.307 -0.2045 0.6342
_cons | 74.5534 6.7156 11.10 0.000 61.0167 88.0901
------------------------------------------------------------------------------

The spatial lag of income ($W \cdot INC$) is -1.40 and statistically significant (t = -2.50, p = 0.016), meaning that higher average income among a neighborhood’s neighbors is associated with lower crime in the focal neighborhood. This is economically intuitive: neighborhoods surrounded by wealthier areas benefit from reduced crime, possibly through better public services, lower criminal opportunity, or social spillovers. The spatial lag of housing value ($W \cdot HOVAL$) is +0.21 but statistically insignificant (p = 0.307). The own-variable coefficients are INC at -1.10 and HOVAL at -0.29, both highly significant. The log-likelihood of -184.0 is higher than OLS (-187.4), and the LR-test of the SLX versus OLS is 6.8 with 2 df (critical value 5.99), meaning the OLS model needs to be rejected in favor of the SLX.

The direct and indirect effects in the SLX correspond directly to $\beta$ and $\theta$ because there is no spatial multiplier:

	Direct	Indirect	Total
INC	-1.10***	-1.40**	-2.50***
HOVAL	-0.29***	+0.21	-0.08

The total effect of income is -2.50, much larger than the OLS estimate of -1.60, revealing that a substantial portion of the income effect operates through the neighbors' income channel. For housing value, the positive but insignificant indirect effect partially offsets the negative direct effect, suggesting that the crime-reducing effect of housing value is primarily a within-neighborhood phenomenon.

6.2 SDM (Spatial Durbin Model)

The SDM combines the spatial lag of $y$ from the SAR with the spatial lags of $X$ from the SLX. It is the most popular “general purpose” spatial model because it nests SAR, SLX, and SEM as special cases, enabling formal specification testing.

$$y = \rho W y + X \beta + W X \theta + \varepsilon$$

The SDM captures spillovers through two channels: a global feedback channel ($\rho Wy$, where shocks propagate through the entire network) and a local channel ($WX\theta$, where neighbors' characteristics directly affect local outcomes). We include $W \cdot INC$ and $W \cdot HOVAL$ as regular regressors alongside the spatial lag of crime.

spregress CRIME INC HOVAL W_INC W_HOVAL, ml dvarlag(W)
eststo SDM
estat ic
mat s = r(S)
quietly estadd scalar AIC = s[1,5]

Spatial Durbin model Number of obs = 49
Maximum likelihood estimates Wald chi2(4) = 56.79
Prob > chi2 = 0.0000
Log-likelihood = -181.639 Pseudo R2 = 0.6037
------------------------------------------------------------------------------
CRIME | Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
CRIME |
INC | -0.9199 .3347 -2.75 0.006 -1.5758 -.2639
HOVAL | -0.2971 .0904 -3.29 0.001 -0.4742 -.1200
W_INC | -0.5839 .5742 -1.02 0.309 -1.7094 0.5415
W_HOVAL | 0.2577 .1872 1.38 0.169 -0.1092 0.6247
-------------+----------------------------------------------------------------
W |
CRIME | 0.4035 .1613 2.50 0.012 0.0873 0.7197
_cons | 44.3200 13.0455 3.40 0.001 18.7512 69.8888
------------------------------------------------------------------------------

The spatial autoregressive parameter $\rho$ is 0.404 (z = 2.50, p = 0.012), close to the SAR estimate. The own income coefficient is -0.92 and housing value is -0.30. The spatial lag of income ($W \cdot INC = -0.58$) is negative but individually insignificant (p = 0.309), while the spatial lag of housing value ($W \cdot HOVAL = +0.26$) is positive and also insignificant (p = 0.169). Although the $\theta$ terms are individually insignificant, their joint significance is tested formally via the specification tests in Section 7.

estat impact

 Coefficient Std. err. z P>|z|
-------------------------------------------------------------------
INC
Direct | -1.0250 .3350 -3.06 0.002
Indirect | -1.4959 .8060 -1.86 0.064
Total | -2.5209 .8820 -2.86 0.004
-------------------------------------------------------------------
HOVAL
Direct | -0.2820 .0900 -3.13 0.002
Indirect | 0.2158 .2990 0.72 0.470
Total | -0.0661 .3050 -0.22 0.828
-------------------------------------------------------------------

The direct effect of income is -1.03, similar to the SAR. The indirect (spillover) effect of income is -1.50 and marginally significant (p = 0.064), much larger than in the SAR (-0.76), because the SDM accounts for both the spatial feedback channel ($\rho$) and the direct effect of neighbors' income ($\theta_{INC}$). The total effect of income is -2.52, substantially larger than the SAR’s -1.86. For housing value, the indirect effect is +0.22 (insignificant), suggesting that neighbors' housing values do not generate meaningful crime spillovers once the global feedback is accounted for.

7. Specification tests from SDM

The SDM nests SAR, SLX, and SEM as special cases. Before accepting the full SDM, we test whether the data supports simplifying to one of these more parsimonious specifications. We re-estimate the SDM and apply three tests. We use both Wald tests (from the Stata estimation) and LR tests (comparing log-likelihoods across models), following Elhorst (2014, Section 2.9).

quietly spregress CRIME INC HOVAL W_INC W_HOVAL, ml dvarlag(W)

7.1 Reduce to SLX? (test $\rho = 0$)

The SLX model restricts $\rho = 0$ — there is no spatial autoregressive feedback. Under SLX, neighbors' characteristics affect local crime directly, but there is no contagion through the spatial lag of crime itself.

* Wald test: Reduce to SLX? (NO if p < 0.05)
test ([W]CRIME = 0)

The test rejects the SLX restriction at the 1% level. The spatial autoregressive parameter $\rho$ is significantly different from zero, meaning that the global feedback channel is an important feature of the data. The LR test confirms this: $-2(\text{LogL}_{SLX} - \text{LogL}_{SDM}) \approx 7.4$ with 1 df (critical value 3.84). Dropping $\rho$ would misspecify the model.

7.2 Reduce to SAR? (test $\theta = 0$)

The SAR model restricts $\theta = 0$ — the spatial lags of the explanatory variables are zero. Under SAR, only neighbors' crime levels matter, not their incomes or housing values directly.

* Wald test: Reduce to SAR? (NO if p < 0.05)
test ([CRIME]W_INC = 0) ([CRIME]W_HOVAL = 0)

The test fails to reject the SAR restriction. The spatial lags of income and housing value are jointly insignificant, suggesting that the SAR specification may be adequate. The LR test also fails to reject: $-2(\text{LogL}_{SAR} - \text{LogL}_{SDM}) \approx 2.0$ with 2 df (critical value 5.99). However, this does not mean the $\theta$ terms are unimportant — it may simply reflect insufficient power with only 49 observations.

7.3 Reduce to SEM? (common factor restriction)

The SEM imposes the common factor restriction $\theta + \rho \beta = 0$. Under this restriction, the apparent spatial lag effects are entirely attributable to spatially correlated errors rather than substantive spillovers.

* Wald test: Reduce to SEM? (NO if p < 0.05)
testnl ([CRIME]W_INC = -[W]CRIME * [CRIME]INC) ([CRIME]W_HOVAL = -[W]CRIME * [CRIME]HOVAL)

The test fails to reject the SEM common factor restriction. The LR test yields $-2(\text{LogL}_{SEM} - \text{LogL}_{SDM}) \approx 4.0$ with 2 df (critical value 5.99), confirming the SEM is not rejected. This means that the spatial dependence in the Columbus data could be interpreted as arising from spatially correlated unobservables rather than substantive crime spillovers.

7.4 SDM vs. SLX: the key comparison

The SDM clearly outperforms the SLX. The SLX is estimated by OLS (no spatial lag of $y$), while the SDM adds $\rho Wy$ which is highly significant ($\rho = 0.40$, z = 2.50). This spatial feedback term substantially improves the fit. The SLX alone, despite its significant $W \cdot INC$ coefficient, fails to capture the global spatial feedback that the $\rho$ parameter provides.

7.5 Summary of specification tests

graph TD
SDM["<b>Spatial Durbin Model (SDM)</b><br/>Starting point"]
SLX["<b>SLX</b><br/>ρ = 0<br/>Rejected"]
SAR["<b>SAR</b><br/>θ = 0<br/>Not rejected"]
SEM["<b>SEM</b><br/>θ + ρβ = 0<br/>Not rejected"]
SDM -->|"LR ≈ 7.4, 1 df"| SLX
SDM -->|"LR ≈ 2.0, 2 df"| SAR
SDM -->|"LR ≈ 4.0, 2 df"| SEM
style SDM fill:#00d4c8,stroke:#141413,color:#141413
style SLX fill:#d97757,stroke:#141413,color:#fff
style SAR fill:#6a9bcc,stroke:#141413,color:#fff
style SEM fill:#6a9bcc,stroke:#141413,color:#fff

The specification tests tell a nuanced story. Both the SAR restriction ($\theta = 0$) and the SEM common factor restriction ($\theta + \rho\beta = 0$) cannot be rejected at the 5% level. Only the SLX restriction ($\rho = 0$) is rejected, confirming that the spatial autoregressive parameter $\rho$ is essential. This leaves both SAR and SEM as statistically adequate simplifications. However, as Elhorst (2014) points out, the SAR’s constraint that the ratio between the indirect and direct effect is the same for every variable is economically restrictive. An alternative path is to consider the SDEM, which also nests SLX and SEM (see Section 8.1).

8. Extended spatial models

8.1 SDEM (Spatial Durbin Error Model)

The SDEM combines the spatial lags of X from the SLX with the spatial error structure of the SEM. It captures local spillovers through $WX\theta$ and spatially correlated unobservables through $\lambda Wu$, but does not include the global feedback mechanism of $\rho Wy$.

$$y = X \beta + W X \theta + u, \quad u = \lambda W u + \varepsilon$$

The SDEM is sometimes preferred over the SDM when one believes that spillovers are local (limited to immediate neighbors) rather than global (propagating through the entire network). Like the SDM, the SDEM nests both the SLX ($\lambda = 0$) and the SEM ($\theta = 0$).

spregress CRIME INC HOVAL W_INC W_HOVAL, ml errorlag(W)
eststo SDEM
estat ic
mat s = r(S)
quietly estadd scalar AIC = s[1,5]

Spatial Durbin error model Number of obs = 49
Maximum likelihood estimates Wald chi2(4) = 66.92
Prob > chi2 = 0.0000
Log-likelihood = -181.779 Pseudo R2 = 0.5988
------------------------------------------------------------------------------
CRIME | Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
CRIME |
INC | -1.0523 .3213 -3.28 0.001 -1.6821 -.4225
HOVAL | -0.2782 .0911 -3.05 0.002 -0.4568 -.0996
W_INC | -1.2049 .5736 -2.10 0.036 -2.3292 -.0806
W_HOVAL | 0.1312 .2072 0.63 0.527 -0.2749 0.5374
-------------+----------------------------------------------------------------
W |
lambda | 0.4036 .1635 2.47 0.014 0.0832 0.7241
_cons | 73.6451 8.7239 8.44 0.000 56.5465 90.7437
------------------------------------------------------------------------------

The spatial error parameter $\lambda$ is 0.404 (z = 2.47, p = 0.014), confirming that spatially correlated unobservables are important. Crucially, the spatial lag of income $W \cdot INC$ is -1.20 and statistically significant (z = -2.10, p = 0.036). This is a key result: even after controlling for spatially correlated errors, neighbors' average income significantly reduces a neighborhood’s crime rate. The spatial lag of housing value ($W \cdot HOVAL = +0.13$) remains insignificant (p = 0.527).

In the SDEM, the indirect effects correspond directly to the $\theta$ coefficients because there is no spatial multiplier (no $\rho Wy$ term):

	Direct	Indirect	Total
INC	-1.05***	-1.20**	-2.26***
HOVAL	-0.28***	+0.13	-0.15

The indirect effect of income is -1.20 (significant at 5%), indicating that a \$1,000 increase in neighbors' average income reduces crime in the focal neighborhood by 1.20 incidents per 1,000 households. This is a substantively important local spillover: neighborhoods benefit from having wealthier neighbors through reduced crime. The total effect of income is -2.26, even larger than the OLS estimate of -1.60, because OLS ignores the neighbors' income channel entirely.

8.2 SAC / SARAR

The SAC (also called SARAR) model includes both a spatial lag of the dependent variable and a spatial error term, but no spatial lags of $X$. It separates two forms of spatial dependence: substantive spillovers through $\rho Wy$ and nuisance dependence through $\lambda Wu$.

$$y = \rho W y + X \beta + u, \quad u = \lambda W u + \varepsilon$$

spregress CRIME INC HOVAL, ml dvarlag(W) errorlag(W)
eststo SAC
estat ic
mat s = r(S)
quietly estadd scalar AIC = s[1,5]

SAC model Number of obs = 49
Wald chi2(2) = 54.77
Log-likelihood = -182.581 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
CRIME | Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
CRIME |
INC | -1.0260 .3268 -3.14 0.002 -1.6666 -.3854
HOVAL | -0.2820 .0900 -3.13 0.002 -0.4584 -.1056
_cons | 47.8000 9.8900 4.83 0.000 28.4159 67.1841
-------------+----------------------------------------------------------------
W |
CRIME | 0.4780 .1622 2.95 0.003 0.1601 0.7959
lambda | 0.1660 .2969 0.56 0.576 -0.4158 0.7478
------------------------------------------------------------------------------

In the SAC model, $\rho$ is 0.478 (z = 2.95, p = 0.003) and $\lambda$ is 0.166 (z = 0.56, p = 0.576). When both are included, $\rho$ remains significant but $\lambda$ becomes insignificant, suggesting that the spatial lag model (SAR) dominates the spatial error structure. The coefficient of $\rho$ in the SAC (0.478) is close to the SAR value (0.428), and $\lambda$ in the SAC (0.166) is much smaller than in the SEM (0.562). The LR test of SAC versus SAR is approximately 0.3 with 1 df, and SAC versus SEM is approximately 2.3 with 1 df — neither reaches the 5% critical value of 3.84, making it difficult to choose among these three models. However, since $\rho$ is significant while $\lambda$ is not, the SAR is the more parsimonious choice.

estat impact

 Coefficient Std. err. z P>|z|
-------------------------------------------------------------------
INC
Direct | -1.0630 .3250 -3.27 0.001
Indirect | -0.5600 .3390 -1.65 0.099
Total | -1.6230 .5500 -2.95 0.003
-------------------------------------------------------------------
HOVAL
Direct | -0.2920 .0910 -3.21 0.001
Indirect | -0.1540 .0980 -1.57 0.116
Total | -0.4460 .1580 -2.82 0.005
-------------------------------------------------------------------

The SAC’s effect decomposition falls between the SAR and SEM. The direct effect of income (-1.06) is similar to the SAR (-1.10), and the indirect effects are somewhat attenuated because the spatial error term absorbs a portion of the spatial dependence. One key limitation of the SAC (shared with the SAR) is that the ratio between the indirect and direct effect is the same for every explanatory variable, because spillovers operate only through the spatial multiplier $(I - \rho W)^{-1}$. This constraint is economically restrictive — there is no reason to expect that income and housing value should have proportionally equal spillover intensities.

8.3 GNS (General Nesting Spatial)

The GNS model includes all three spatial channels simultaneously: the spatial lag of $y$, the spatial lags of $X$, and the spatial error. It is the most general specification in the taxonomy.

$$y = \rho W y + X \beta + W X \theta + u, \quad u = \lambda W u + \varepsilon$$

spregress CRIME INC HOVAL W_INC W_HOVAL, ml dvarlag(W) errorlag(W)
eststo GNS
estat ic
mat s = r(S)
quietly estadd scalar AIC = s[1,5]

General nesting spatial model Number of obs = 49
Wald chi2(4) = 55.64
Log-likelihood = -179.689 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
CRIME | Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
CRIME |
INC | -0.9510 .4397 -2.16 0.031 -1.8129 -.0891
HOVAL | -0.2860 .0997 -2.87 0.004 -0.4813 -.0907
W_INC | -0.6930 1.6896 -0.41 0.682 -4.0046 2.6186
W_HOVAL | 0.2080 .2849 0.73 0.465 -0.3504 0.7664
-------------+----------------------------------------------------------------
W |
CRIME | 0.3150 .9553 0.33 0.742 -1.5574 2.1874
lambda | 0.1540 1.0267 0.15 0.881 -1.8583 2.1663
_cons | 50.9000 14.2800 3.56 0.000 22.9115 78.8885
------------------------------------------------------------------------------

In the GNS model, $\rho$ is 0.315 (p = 0.742), $\lambda$ is 0.154 (p = 0.881), and the spatial lags of income and housing value are both insignificant. With seven spatial parameters competing to explain the same 49 observations, the model is overparameterized. As Gibbons and Overman (2012) explain, interaction effects among the dependent variable and interaction effects among the error terms are only weakly identified separately. Combining both (as in the GNS) compounds this problem — significance levels of all variables tend to collapse. The log-likelihood barely improves over the SDM or SDEM, and the AIC is higher, confirming that the additional complexity does not improve fit.

The GNS’s effect decomposition is correspondingly imprecise:

	Direct	Indirect	Total
INC	-1.03***	-1.37	-2.40
HOVAL	-0.28***	+0.16	-0.11

The direct effects remain significant and stable (consistent with all other models), but the indirect effects have very large standard errors. The GNS confirms what the specification tests already suggested — the data does not support the most general specification, and a more parsimonious model is needed.

9. Model comparison

9.1 Coefficient comparison

We compare all eight models side by side, focusing on the key coefficients and model fit. Values are based on ML estimation; t-values in parentheses.

esttab OLS SAR SEM SLX SDM SDEM SAC GNS, ///
label stats(AIC) mtitle("OLS" "SAR" "SEM" "SLX" "SDM" "SDEM" "SAC" "GNS")

	OLS	SAR	SEM	SLX	SDM	SDEM	SAC	GNS
INC	-1.60***	-1.03***	-0.94***	-1.10***	-0.92***	-1.05***	-1.03***	-0.95**
HOVAL	-0.27***	-0.27***	-0.30***	-0.29***	-0.30***	-0.28***	-0.28***	-0.29***
$\rho$ (W*y)	—	0.43***	—	—	0.40**	—	0.48***	0.32
$\lambda$ (W*e)	—	—	0.56***	—	—	0.40**	0.17	0.15
W*INC	—	—	—	-1.40**	-0.58	-1.20**	—	-0.69
W*HOVAL	—	—	—	+0.21	+0.26	+0.13	—	+0.21

Several patterns emerge. First, the income coefficient is consistently negative across all models, ranging from -0.92 (SDM) to -1.60 (OLS). The spatial models generally produce smaller income coefficients than OLS, suggesting that part of the OLS income effect was capturing omitted spatial structure. Second, the housing value coefficient is remarkably stable across all models, ranging from -0.27 to -0.30 — this variable is insensitive to the spatial specification choice. Third, and crucially, the spatial lag of income ($W \cdot INC$) is negative and significant in the SLX (-1.40, t = -2.50) and the SDEM (-1.20, z = -2.10), meaning that neighbors' income is a substantive predictor of crime. The SLX, SDM, SDEM, and GNS models all agree that $W \cdot INC$ is negative and $W \cdot HOVAL$ is positive, producing a consistent pattern of spatial spillover estimates regardless of which other spatial channels are included.

9.2 Direct and indirect effects comparison

	OLS	SAR	SEM	SLX	SDM	SDEM	SAC	GNS
INC
Direct	-1.60***	-1.10***	-0.94***	-1.10***	-1.03***	-1.05***	-1.06***	-1.03***
Indirect	0	-0.76**	0	-1.40**	-1.50*	-1.20**	-0.56	-1.37
Total	-1.60***	-1.86***	-0.94***	-2.50***	-2.52***	-2.26***	-1.62***	-2.40
HOVAL
Direct	-0.27***	-0.28***	-0.30***	-0.29***	-0.28***	-0.28***	-0.29***	-0.28***
Indirect	0	-0.20*	0	+0.21	+0.22	+0.13	-0.15	+0.16
Total	-0.27***	-0.48***	-0.30***	-0.08	-0.07	-0.15	-0.45***	-0.11

The direct effects of income and housing value are broadly consistent across models: approximately -0.94 to -1.60 for income and -0.27 to -0.30 for housing value. The indirect effects reveal the most important differences:

The OLS, SEM, and SAR models produce no or wrong spillover effects. OLS has zero spillovers by construction. The SEM’s spillovers are zero by construction. The SAR constrains the ratio between indirect and direct effects to be equal for every variable, which forces the housing value spillover to be negative (-0.20) even though the SLX, SDM, SDEM, and GNS all suggest it is positive.
The SLX, SDM, SDEM, and GNS models agree on the pattern: income spillovers are large and negative (-1.20 to -1.50), while housing value spillovers are small and positive (+0.13 to +0.22) and insignificant. This consistency across different model specifications strengthens the case that the income spillover is a robust finding.
The total effect of income is substantially larger in models with $\theta$ terms (-2.26 to -2.52) than in models without them (-0.94 to -1.86). This reveals that the standard SAR/SEM models substantially underestimate the full impact of income on crime by ignoring the local spillover channel.

10. Discussion

The Columbus crime dataset illustrates a recurring challenge in spatial econometrics: choosing among models that capture spatial dependence through different channels. Following Elhorst (2014, Section 2.9), the evidence points toward the SDM and SDEM as the preferred specifications, though neither the SAR nor SEM can be formally rejected.

Why not SAR, SEM, or SAC? The specification tests fail to reject both the SAR restriction ($\theta = 0$) and the SEM common factor restriction ($\theta + \rho\beta = 0$), which might suggest these simpler models are adequate. However, as Elhorst (2014) emphasizes, these models have structural limitations. The SAR and SAC constrain the ratio between the indirect and direct effect to be the same for every explanatory variable — a consequence of spillovers operating solely through the spatial multiplier $(I - \rho W)^{-1}\beta_k$. In the Columbus data, this forces the housing value spillover to be negative (proportional to the direct effect), even though the SLX, SDM, SDEM, and GNS models all estimate it as positive. The SEM, on the other hand, produces zero spillover effects by construction, which may be too restrictive if one believes that crime is genuinely affected by conditions in neighboring areas.

Why SDM and SDEM? Both models allow the indirect effect to differ freely across explanatory variables. In both, the spillover effect of income is negative and significant (SDM: -1.50, marginally significant; SDEM: -1.20, significant at 5%), while the spillover effect of housing value is positive but insignificant. This flexibility produces economically sensible results: neighborhoods surrounded by higher-income areas experience less crime (consistent with crime displacement and opportunity theory), but neighbors' housing values have no significant independent effect on crime.

The SDM-SDEM dilemma. Whether it is the SDM or the SDEM model that better describes the data is difficult to say, since these two models are non-nested (the SDM has $\rho$ but no $\lambda$; the SDEM has $\lambda$ but no $\rho$). The GNS, which nests both, is overparameterized and produces insignificant estimates for all spatial parameters. Both models produce comparable spillover effects in terms of magnitude and significance. As Elhorst (2014) notes, this is worrying because the two models have different interpretations: the SDM implies that crime spillovers propagate globally through the network, while the SDEM implies they are local (limited to immediate neighbors) with the remaining spatial pattern driven by unobserved common factors.

Policy implications. A \$1,000 increase in household income reduces crime by approximately 1.0 incident per 1,000 households directly and an additional 1.2–1.5 incidents indirectly through the spatial spillover channel, for a total effect of 2.3–2.5. This means that policies to increase income in the poorest neighborhoods generate positive externalities for neighboring areas that are even larger than the within-neighborhood effect. The total income effect in the SDM/SDEM (-2.3 to -2.5) is 40–55% larger than the OLS estimate (-1.60), revealing the magnitude of the bias from ignoring spatial spillovers.

This tutorial complements the companion post on spatial panel regression, which demonstrates the same model taxonomy in a panel data setting using cigarette demand across US states. The panel setting offers additional advantages — fixed effects to control for unobserved heterogeneity and dynamic extensions to separate temporal from spatial dynamics — but requires repeated observations over time. The cross-sectional framework presented here is appropriate when only a single snapshot of spatial data is available, which is common in urban economics, criminology, and regional science.

11. Summary and next steps

This tutorial covered the complete taxonomy of cross-sectional spatial regression models in Stata — from OLS diagnostics through the most general GNS specification. The key takeaways are:

Spatial autocorrelation is significant. Moran’s I of 0.222 (p = 0.005) confirms that OLS residuals are positively spatially autocorrelated, and the LM tests favor the spatial error specification.
The SDM and SDEM are the preferred models. Both models allow the indirect effects to differ across explanatory variables, and both identify a significant negative spillover effect of income. The SAR, SEM, and SLX restrictions from the SDM cannot be formally rejected, but the SAR and SAC impose an economically restrictive constraint (equal spillover-to-direct ratios for all variables), while the SEM produces zero spillovers by construction.
Direct effects are robust to spatial specification. The direct effect of income ranges from -1.03 to -1.10 across the four models with $\theta$ terms (SLX, SDM, SDEM, GNS), and the direct effect of housing value ranges from -0.28 to -0.29 — substantially more stable than the indirect effects.
Neighbors' income significantly reduces crime. The indirect effect of income is -1.20 (SDEM) to -1.50 (SDM), comparable to or larger than the direct effect. The total income effect in the SDM/SDEM (-2.3 to -2.5) is 40–55% larger than the OLS estimate (-1.60), revealing substantial bias from ignoring spatial spillovers.
The GNS is overparameterized. When all three spatial channels ($\rho$, $\theta$, $\lambda$) are included simultaneously, all become insignificant. The difficulty of separately identifying endogenous interaction effects and error interaction effects is a fundamental limitation of the cross-sectional setting.

For further study, consider the companion tutorial on spatial panel regression, which extends these methods to panel data with fixed effects and dynamic specifications. For Python implementations, the PySAL spreg package provides analogous spatial regression tools.

12. Exercises

Alternative weight matrix. Replace the Queen contiguity matrix with a k-nearest neighbors matrix (e.g., $k = 4$ or $k = 6$). Re-estimate the SAR and SEM models and compare the spatial parameter estimates ($\rho$ and $\lambda$). Does the choice of weight matrix change the substantive conclusions about spatial dependence in crime?
Single explanatory variable. Re-estimate all eight models using only INC (dropping HOVAL). How do the spatial parameter estimates and the AIC rankings change? Does the Wald test from the SDM still fail to reject the SAR and SEM restrictions?
Rook vs. Queen contiguity. Construct a Rook contiguity matrix (neighbors share a common edge, not just a vertex) and re-estimate the SDM. Compare the Wald specification test results to those obtained with Queen contiguity. Are the conclusions about which spatial model is appropriate sensitive to the contiguity definition?

References

Spatial inequality dynamics

Sun, 27 Aug 2023 00:00:00 +0000

Introduction to spatial data science

Mon, 01 Apr 2019 00:00:00 +0000

Introduction to spatial data science with Python

Use marginal predictions

Mon, 01 Apr 2019 00:00:00 +0000

TBA