Principal Component Analysis (PCA) | Carlos Mendez

Introduction to PCA Analysis for Building Development Indicators

Sat, 21 Mar 2026 00:00:00 +0000

1. Overview

In development economics, we rarely measure progress with just one number. To understand a country’s health system, you might look at life expectancy, infant mortality, hospital beds per capita, and disease prevalence. But how do you rank 50 countries when you have multiple metrics measured in different units — years, rates, and raw counts? You cannot simply add them together. You need a single, elegant “Development Index.”

Principal Component Analysis (PCA) is a statistical technique used for data compression. It takes a dataset with many correlated variables and condenses it into a single composite index while retaining as much of the original information as possible. Think of PCA as finding the hallway in a building that gives you the longest unobstructed view — the direction where the data is most spread out, and therefore most informative. For visual introductions to the core idea, see Principal Component Analysis (PCA) Explained Simply and Visualizing Principal Component Analysis (PCA). For a hands-on interactive demonstration, try the Numiqo PCA Lab.

This tutorial builds a simplified Health Index using only two indicators — Life Expectancy (years) and Infant Mortality (deaths per 1,000 live births) — for 50 simulated countries. By using simulated data with a known structure, we can verify that PCA recovers the true underlying pattern. The same six-step pipeline scales naturally to 10, 20, or 100 indicators.

Learning objectives:

Understand why polarity adjustment and standardization are prerequisites for PCA
Compute the covariance matrix and interpret its entries as variable overlap
Perform eigen-decomposition to extract principal component weights and variance proportions
Construct a composite index by projecting standardized data onto the first principal component
Verify manual PCA results against scikit-learn’s PCA implementation

2. The PCA pipeline

Before diving into the math, it helps to see the full pipeline at a glance. Each of the six steps builds on the previous one and cannot be skipped.

graph LR
A["<b>Step 1</b><br/>Polarity<br/>Adjustment"] --> B["<b>Step 2</b><br/>Standardization<br/>(Z-scores)"]
B --> C["<b>Step 3</b><br/>Covariance<br/>Matrix"]
C --> D["<b>Step 4</b><br/>Eigen-<br/>Decomposition"]
D --> E["<b>Step 5</b><br/>Scoring<br/>(PC1)"]
E --> F["<b>Step 6</b><br/>Normalization<br/>(0-1)"]
style A fill:#d97757,stroke:#141413,color:#fff
style B fill:#6a9bcc,stroke:#141413,color:#fff
style C fill:#6a9bcc,stroke:#141413,color:#fff
style D fill:#00d4c8,stroke:#141413,color:#fff
style E fill:#00d4c8,stroke:#141413,color:#fff
style F fill:#1a3a8a,stroke:#141413,color:#fff

The pipeline transforms raw indicators into a single number that captures the dominant pattern of variation. We start by aligning indicator directions (Step 1), removing unit differences (Step 2), measuring variable overlap (Step 3), finding the optimal weights (Step 4), computing scores (Step 5), and finally rescaling for human readability (Step 6).

3. Setup and imports

The analysis relies on NumPy for linear algebra, pandas for data management, matplotlib for visualization, and scikit-learn for verification. The RANDOM_SEED ensures every reader gets identical results.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
# Reproducibility
RANDOM_SEED = 42
# Site color palette
STEEL_BLUE = "#6a9bcc"
WARM_ORANGE = "#d97757"
NEAR_BLACK = "#141413"
TEAL = "#00d4c8"

Dark theme figure styling (click to expand)

# Dark theme palette (consistent with site navbar/dark sections)
DARK_NAVY = "#0f1729"
GRID_LINE = "#1f2b5e"
LIGHT_TEXT = "#c8d0e0"
WHITE_TEXT = "#e8ecf2"
# Plot defaults — minimal, spine-free, dark background
plt.rcParams.update({
"figure.facecolor": DARK_NAVY,
"axes.facecolor": DARK_NAVY,
"axes.edgecolor": DARK_NAVY,
"axes.linewidth": 0,
"axes.labelcolor": LIGHT_TEXT,
"axes.titlecolor": WHITE_TEXT,
"axes.spines.top": False,
"axes.spines.right": False,
"axes.spines.left": False,
"axes.spines.bottom": False,
"axes.grid": True,
"grid.color": GRID_LINE,
"grid.linewidth": 0.6,
"grid.alpha": 0.8,
"xtick.color": LIGHT_TEXT,
"ytick.color": LIGHT_TEXT,
"xtick.major.size": 0,
"ytick.major.size": 0,
"text.color": WHITE_TEXT,
"font.size": 12,
"legend.frameon": False,
"legend.fontsize": 11,
"legend.labelcolor": LIGHT_TEXT,
"figure.edgecolor": DARK_NAVY,
"savefig.facecolor": DARK_NAVY,
"savefig.edgecolor": DARK_NAVY,
})

4. Simulating health data

We generate data for 50 countries driven by a single latent factor — base_health — drawn from a uniform distribution. This factor drives both life expectancy (positively) and infant mortality (negatively), mimicking the real-world pattern where healthier countries perform well across multiple indicators simultaneously. Using simulated data lets us verify that PCA recovers this known single-factor structure.

def simulate_health_data(n=50, seed=42):
"""Simulate health indicators for n countries.
True DGP:
base_health ~ Uniform(0, 1) -- latent health capacity
life_exp = 55 + 30 * base_health + N(0, 2) -- range ~55-85
infant_mort = 60 - 55 * base_health + N(0, 3) -- range ~2-60
"""
rng = np.random.default_rng(seed)
base_health = rng.uniform(0, 1, n)
life_exp = 55 + 30 * base_health + rng.normal(0, 2, n)
infant_mort = 60 - 55 * base_health + rng.normal(0, 3, n)
countries = [f"Country_{i+1:02d}" for i in range(n)]
return pd.DataFrame({
"country": countries,
"life_exp": np.round(life_exp, 1),
"infant_mort": np.round(infant_mort, 1),
})
df = simulate_health_data(n=50, seed=RANDOM_SEED)
# Save raw data to CSV (used later in the scikit-learn pipeline)
df.to_csv("health_data.csv", index=False)
print(f"Dataset shape: {df.shape}")
print(f"\nFirst 5 rows:")
print(df.head().to_string(index=False))
print(f"\nDescriptive statistics:")
print(df[["life_exp", "infant_mort"]].describe().round(2).to_string())
print("\nSaved: health_data.csv")

Dataset shape: (50, 3)
First 5 rows:
country life_exp infant_mort
Country_01 79.6 18.6
Country_02 68.3 33.1
Country_03 81.3 11.6
Country_04 77.2 25.5
Country_05 54.9 53.8
Descriptive statistics:
life_exp infant_mort
count 50.00 50.00
mean 70.72 30.30
std 8.62 15.57
min 54.90 3.50
25% 63.45 17.28
50% 71.25 30.25
75% 78.90 42.05
max 84.70 58.70
Saved: health_data.csv

All 50 countries loaded with two health indicators. Life expectancy ranges from 54.9 to 84.7 years with a mean of 70.72, while infant mortality ranges from 3.5 to 58.7 per 1,000 live births with a mean of 30.30. Notice the directional conflict: life expectancy is a “positive” indicator (higher means better health), while infant mortality is a “negative” indicator (higher means worse health). This conflict is precisely what Step 1 will resolve.

5. Exploring the raw data

Before transforming the data, let us visualize the raw relationship between the two indicators. The Pearson correlation coefficient ($r$) measures the strength and direction of the linear relationship between two variables, ranging from $-1$ (perfect negative) to $+1$ (perfect positive). If the two indicators are strongly correlated, PCA will be able to compress them effectively into a single index.

raw_corr = df["life_exp"].corr(df["infant_mort"])
print(f"Pearson correlation (LE vs IM): {raw_corr:.4f}")

Pearson correlation (LE vs IM): -0.9595

fig, ax = plt.subplots(figsize=(8, 6))
fig.patch.set_linewidth(0)
ax.scatter(df["life_exp"], df["infant_mort"],
color=STEEL_BLUE, edgecolors=DARK_NAVY, s=60, zorder=3)
# Label extreme countries
sorted_df = df.sort_values("life_exp")
label_idx = list(sorted_df.head(5).index) + list(sorted_df.tail(5).index)
for i in label_idx:
ax.annotate(df.loc[i, "country"],
(df.loc[i, "life_exp"], df.loc[i, "infant_mort"]),
fontsize=7, color=LIGHT_TEXT, xytext=(5, 5),
textcoords="offset points")
ax.set_xlabel("Life Expectancy (years)")
ax.set_ylabel("Infant Mortality (per 1,000 live births)")
ax.set_title("Raw health indicators: Life Expectancy vs. Infant Mortality")
ax.annotate(f"r = {raw_corr:.2f}", xy=(0.95, 0.95), xycoords="axes fraction",
fontsize=12, color=WARM_ORANGE, fontweight="bold",
va="top", ha="right")
plt.savefig("pca_raw_scatter.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

The Pearson correlation is $r = -0.96$, confirming a very strong negative relationship. Countries with high life expectancy almost always have low infant mortality, and vice versa. This means the two indicators are telling essentially the same story about health — just in opposite directions. This high redundancy is exactly what PCA will exploit to compress two dimensions into one.

6. Step 1: Polarity adjustment — aligning the health goals

What it is: Before any math is applied, we must ensure our indicators share the same logical direction. We mathematically invert indicators where “higher” means “worse” so that all variables move in the same positive direction. For our negative indicator (Infant Mortality, or $IM$), we calculate an adjusted value:

$$IM_i^{*} = -1 \times IM_i$$

In words, this says: for each country $i$, multiply its infant mortality rate by negative one. After this transformation, a large positive value of $IM^{}$ means low infant mortality — a good outcome. Here $IM_i$ corresponds to the infant_mort column, and $IM_i^{}$ will be stored as infant_mort_adj.

The application: Country_01 has an infant mortality rate of 18.6 deaths per 1,000 live births. Applying the formula: $IM^{*} = -1 \times 18.6 = -18.6$. The raw value of 18.6 becomes $-18.6$ after polarity adjustment. The negative sign encodes “18.6 units of infant survival” — a positive health signal that can now be combined with Life Expectancy because both variables point in the same direction.

The Intuition: Life Expectancy ($LE$) is a “positive” indicator: higher numbers mean better health. Infant Mortality is a “negative” indicator: higher numbers mean worse health. If we feed these into an index as they are, the final score will be contradictory. Imagine comparing exam scores where one professor grades 0–100 (higher is better) and another grades on demerits 0–100 (lower is better). Before averaging, you must flip the demerit scale.

The Necessity: We must flip the negative indicator so that “up” always means “better.” By multiplying by $-1$, instead of measuring “Infant Mortality,” we are effectively measuring “Infant Survival.” Now, for both variables, a higher number universally indicates a stronger health system.

df["infant_mort_adj"] = -1 * df["infant_mort"]
adj_corr = df["life_exp"].corr(df["infant_mort_adj"])
print(f"Correlation after polarity adjustment (LE vs -IM): {adj_corr:.4f}")
print(f"\nFirst 5 rows with adjusted IM:")
print(df[["country", "life_exp", "infant_mort", "infant_mort_adj"]].head().to_string(index=False))

Correlation after polarity adjustment (LE vs -IM): 0.9595
First 5 rows with adjusted IM:
country life_exp infant_mort infant_mort_adj
Country_01 79.6 18.6 -18.6
Country_02 68.3 33.1 -33.1
Country_03 81.3 11.6 -11.6
Country_04 77.2 25.5 -25.5
Country_05 54.9 53.8 -53.8

The correlation has flipped from $-0.96$ to $+0.96$. Both indicators now point in the same direction: higher values mean better health. The magnitude of the correlation is unchanged — the relationship is identical, just properly aligned. With this alignment in place, we can proceed to standardize the variables.

7. Step 2: Standardization — comparing apples to apples

What it is: We transform our raw data into Z-scores. For each value, we subtract the sample mean ($\mu$) and divide by the standard deviation ($\sigma$):

$$Z_{ij} = \frac{X_{ij} - \bar{X}_j}{\sigma_j}$$

In words, this says: for country $i$ and variable $j$, subtract the variable’s mean $\bar{X}_j$ and divide by its standard deviation $\sigma_j$. The result is a unitless score that tells us how many standard deviations above or below average the country is. Here $X_{ij}$ is the raw value (e.g., life_exp or infant_mort_adj), $\bar{X}_j$ is computed by np.mean(), and $\sigma_j$ is computed by np.std(ddof=0).

The application: Country_01 has Life Expectancy = 79.6 and adjusted Infant Mortality = $-18.6$. Applying the formula: $Z_{LE} = (79.6 - 70.72) / 8.53 = 8.88 / 8.53 = 1.0402$ and $Z_{IM} = (-18.6 - (-30.30)) / 15.42 = 11.70 / 15.42 = 0.7587$. Country_01 is 1.04 standard deviations above average in life expectancy and 0.76 standard deviations above average in infant survival. Both positive Z-scores confirm it is a healthier-than-average country on both indicators. These two numbers are now directly comparable — 1.04 and 0.76 are both measured in the same unit (standard deviations), even though the original variables were in years and rates.

The Intuition: Life Expectancy is measured in years (range 54.9–84.7). Infant Mortality is measured as a rate per 1,000 (range 3.5–58.7). If we mix these directly, the index will naturally be dominated by Infant Mortality simply because its values have a wider physical spread.

The Necessity: We standardize both variables to have a mean of $0$ and a standard deviation of $1$. We are no longer looking at “years” or “rates.” We are looking at “standard deviations from the global average.” Both indicators now have equal footing.

# Manual standardization
le_mean = df["life_exp"].mean()
le_std = df["life_exp"].std(ddof=0)
im_mean = df["infant_mort_adj"].mean()
im_std = df["infant_mort_adj"].std(ddof=0)
df["z_le"] = (df["life_exp"] - le_mean) / le_std
df["z_im"] = (df["infant_mort_adj"] - im_mean) / im_std
print(f"Life Expectancy -- mean: {le_mean:.2f}, std: {le_std:.2f}")
print(f"Infant Mort (adj) -- mean: {im_mean:.2f}, std: {im_std:.2f}")
print(f"\nZ-score statistics:")
print(f" z_le mean: {df['z_le'].mean():.6f}, std: {df['z_le'].std(ddof=0):.6f}")
print(f" z_im mean: {df['z_im'].mean():.6f}, std: {df['z_im'].std(ddof=0):.6f}")
# Verify with sklearn
scaler = StandardScaler()
Z_sklearn = scaler.fit_transform(df[["life_exp", "infant_mort_adj"]])
max_diff = np.max(np.abs(Z_sklearn - df[["z_le", "z_im"]].values))
print(f"\nMax difference from sklearn StandardScaler: {max_diff:.2e}")

Life Expectancy -- mean: 70.72, std: 8.53
Infant Mort (adj) -- mean: -30.30, std: 15.42
Z-score statistics:
z_le mean: 0.000000, std: 1.000000
z_im mean: 0.000000, std: 1.000000
Max difference from sklearn StandardScaler: 0.00e+00

Both Z-scores now have a mean of exactly 0 and a standard deviation of exactly 1, confirmed by the zero-difference check against StandardScaler(). Note that describe() in Section 4 reported std = 8.62 for life expectancy, while here we get 8.53. The difference is the denominator: pandas' describe() divides by $n - 1$ (sample standard deviation, ddof=1), while StandardScaler and our manual formula divide by $n$ (population standard deviation, ddof=0). We use ddof=0 because PCA treats the dataset as the full population being analyzed, not a sample from a larger population. A country that is 2 standard deviations above average in life expectancy is now directly comparable to one that is 2 standard deviations above average in (adjusted) infant mortality. The unit problem is solved, and we can now measure how the two variables move together.

8. Step 3: The covariance matrix — mapping the overlap

What it is: We calculate the covariance matrix to measure how the two standardized variables move together. For two variables, this forms a $2 \times 2$ matrix ($\Sigma$). Because our data is standardized, the covariance between them is simply their correlation ($r$):

$$\Sigma = \frac{1}{n} Z^T Z = \begin{pmatrix} 1 & r \\ r & 1 \end{pmatrix}$$

In words, this says: the covariance matrix $\Sigma$ of standardized data has 1s on the diagonal (each variable has unit variance after standardization) and the correlation $r$ on the off-diagonal. With two variables, PCA only needs to decompose this single $2 \times 2$ matrix.

The application: Plugging in our standardized data, the resulting matrix has diagonal entries of 1.0000 (guaranteed by standardization — each variable has unit variance) and an off-diagonal of 0.9595 (the correlation $r$). This off-diagonal value means that when a country’s standardized life expectancy increases by 1 standard deviation, its standardized infant survival tends to increase by 0.96 standard deviations as well. The two variables move almost in lockstep, confirming heavy redundancy that PCA can exploit.

The Intuition: In the real world, these two indicators are heavily correlated. A country with high life expectancy almost certainly has high infant survival. They are essentially telling us the same story about the country’s healthcare system.

The Necessity: The covariance matrix measures exactly how strong this overlap is. It tells the PCA algorithm mathematically, “These two variables share a high amount of redundant information. You can safely compress them into one variable without losing the big picture.”

Z = df[["z_le", "z_im"]].values
cov_matrix = np.cov(Z.T, ddof=0)
print(f"Covariance matrix (2x2):")
print(f" [{cov_matrix[0, 0]:.4f} {cov_matrix[0, 1]:.4f}]")
print(f" [{cov_matrix[1, 0]:.4f} {cov_matrix[1, 1]:.4f}]")
print(f"\nOff-diagonal (correlation): {cov_matrix[0, 1]:.4f}")

Covariance matrix (2x2):
[1.0000 0.9595]
[0.9595 1.0000]
Off-diagonal (correlation): 0.9595

The diagonal entries are exactly 1.0 (unit variance, as expected after standardization) and the off-diagonal is 0.9595 — the same correlation we computed earlier. This means 96% of the movement in one variable is mirrored by the other. The covariance matrix has now quantified the overlap, and eigen-decomposition will use this information to find the optimal compression axis.

9. Step 4: Eigen-decomposition — finding the optimal direction

This is the mathematical core, where we find our new, compressed index. It introduces two new concepts — eigenvectors and eigenvalues — that are central to how PCA works.

What it is: We decompose the covariance matrix $\Sigma$ into two outputs by solving the equation $\Sigma \mathbf{v} = \lambda \mathbf{v}$:

An eigenvector ($\mathbf{v}$) is a direction in the data space. For our two health indicators, each eigenvector is a pair of numbers $[w_1, w_2]$ that defines a direction — like a compass heading through the scatter plot of countries. PCA finds the direction along which the data is most spread out. The components of this eigenvector become the weights for combining our indicators into a single index. A $2 \times 2$ matrix always produces exactly two eigenvectors, perpendicular to each other.
An eigenvalue ($\lambda$) is a number that tells us how much variance — how much “spread” — the data has along its corresponding eigenvector direction. A large eigenvalue means the countries are widely dispersed in that direction (lots of information), while a small eigenvalue means they are tightly clustered (little information). The eigenvalues always sum to the total number of variables (in our case, 2).

Why are they useful for PCA? Together, eigenvectors and eigenvalues answer two questions at once. The eigenvector with the largest eigenvalue identifies the single best direction to project our data — the direction that captures the most variation across countries. Its components tell us exactly how much weight to give each indicator in our composite index. The ratio of the largest eigenvalue to the total tells us what percentage of the original information our single-number index retains.

The application: For a $2 \times 2$ correlation matrix, the eigenvalues have an elegant closed-form solution: $\lambda_1 = 1 + r$ and $\lambda_2 = 1 - r$. With our correlation of $r = 0.9595$: $\lambda_1 = 1 + 0.9595 = 1.9595$ and $\lambda_2 = 1 - 0.9595 = 0.0405$. The variance explained by PC1 is $1.9595 / 2.0000 = 97.97\%$. This reveals a direct link between correlation strength and PCA compression power. Because $r = 0.96$, the first eigenvalue absorbs nearly all the variance, leaving only $\lambda_2 = 0.04$ for PC2. The higher the correlation between our health indicators, the more variance PC1 captures. At the extreme, if the correlation were zero, both eigenvalues would equal 1.0 and PCA would offer no compression advantage at all.

The Intuition: Imagine plotting all 50 countries on a 2D graph — Standardized Life Expectancy on the X-axis, Standardized Infant Survival on the Y-axis. The countries form a narrow, elongated cloud stretching diagonally from the lower-left (unhealthy countries) to the upper-right (healthy countries). The first eigenvector is a straight line drawn through the long axis of this cloud — the direction where countries differ the most. Its eigenvalue measures how stretched the cloud is along that line. If you stood at the center of the cloud and looked down the first eigenvector, you would see maximum separation between countries. Look down the second eigenvector (perpendicular), and the countries would appear tightly bunched — almost no useful information in that direction. This is why we keep the first eigenvector and discard the second: it captures nearly all the meaningful variation.

The Necessity: Eigen-decomposition removes human bias. Instead of randomly guessing how much weight to give Life Expectancy versus Infant Mortality, the math calculates the absolute optimal weights to capture the maximum amount of overlapping information. The algorithm finds the single direction that best summarizes 50 countries' health performance — no subjective judgment required.

eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)
# Sort in descending order (eigh returns ascending)
idx = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]
# Sign convention: first weight positive
if eigenvectors[0, 0] < 0:
eigenvectors[:, 0] *= -1
var_explained = eigenvalues / eigenvalues.sum() * 100
print(f"Eigenvalues: [{eigenvalues[0]:.4f}, {eigenvalues[1]:.4f}]")
print(f"Sum of eigenvalues: {eigenvalues.sum():.4f}")
print(f"\nEigenvector (PC1): [{eigenvectors[0, 0]:.4f}, {eigenvectors[1, 0]:.4f}]")
print(f"Eigenvector (PC2): [{eigenvectors[0, 1]:.4f}, {eigenvectors[1, 1]:.4f}]")
print(f"\nVariance explained:")
print(f" PC1: {var_explained[0]:.2f}%")
print(f" PC2: {var_explained[1]:.2f}%")

Eigenvalues: [1.9595, 0.0405]
Sum of eigenvalues: 2.0000
Eigenvector (PC1): [0.7071, 0.7071]
Eigenvector (PC2): [0.7071, -0.7071]
Variance explained:
PC1: 97.97%
PC2: 2.03%

The first eigenvalue is 1.9595 and the second is just 0.0405, summing to 2.0 (the number of variables). PC1 captures 97.97% of all variance in the data — nearly everything. The eigenvector weights are both 0.7071 ($\approx 1/\sqrt{2}$), meaning both variables contribute equally to PC1. This equal weighting is not a coincidence — it is a mathematical certainty whenever PCA is applied to exactly two standardized variables. A $2 \times 2$ correlation matrix always has eigenvectors $[1/\sqrt{2}, \; 1/\sqrt{2}]$ and $[1/\sqrt{2}, \; -1/\sqrt{2}]$, regardless of how strong the correlation is. With three or more variables, the weights would generally differ, giving more influence to variables that contribute unique information.

9.1 Visualizing the principal components

fig, ax = plt.subplots(figsize=(8, 8))
fig.patch.set_linewidth(0)
ax.scatter(Z[:, 0], Z[:, 1], color=STEEL_BLUE, edgecolors=DARK_NAVY,
s=60, zorder=3, alpha=0.8)
# Eigenvector arrows scaled by sqrt(eigenvalue) so length reflects variance
vis = 1.5 # visibility multiplier
scale_pc1 = np.sqrt(eigenvalues[0]) * vis
scale_pc2 = np.sqrt(eigenvalues[1]) * vis
ax.annotate("", xy=(eigenvectors[0, 0] * scale_pc1, eigenvectors[1, 0] * scale_pc1),
xytext=(0, 0),
arrowprops=dict(arrowstyle="-|>", color=WARM_ORANGE, lw=2.5))
ax.annotate("", xy=(eigenvectors[0, 1] * scale_pc2, eigenvectors[1, 1] * scale_pc2),
xytext=(0, 0),
arrowprops=dict(arrowstyle="-|>", color=TEAL, lw=2.0))
ax.text(eigenvectors[0, 0] * scale_pc1 + 0.15, eigenvectors[1, 0] * scale_pc1 + 0.15,
f"PC1 ({var_explained[0]:.1f}%)", color=WARM_ORANGE, fontsize=12,
fontweight="bold")
ax.text(eigenvectors[0, 1] * scale_pc2 + 0.15, eigenvectors[1, 1] * scale_pc2 - 0.15,
f"PC2 ({var_explained[1]:.1f}%)", color=TEAL, fontsize=12,
fontweight="bold")
ax.axhline(0, color=GRID_LINE, linewidth=0.8, zorder=1)
ax.axvline(0, color=GRID_LINE, linewidth=0.8, zorder=1)
ax.set_xlabel("Standardized Life Expectancy (Z-score)")
ax.set_ylabel("Standardized Infant Survival (Z-score)")
ax.set_title("Standardized data with principal component directions")
ax.set_aspect("equal")
plt.savefig("pca_standardized_eigenvectors.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

The orange PC1 arrow points along the diagonal — the direction of maximum spread through the narrow, elongated data cloud. Because both weights are positive and equal (0.7071 each), PC1 essentially averages the two standardized indicators. The teal PC2 arrow is perpendicular and captures only the small residual variation (2.03%) not explained by PC1.

9.2 Variance explained

fig, ax = plt.subplots(figsize=(6, 4))
fig.patch.set_linewidth(0)
bars = ax.bar(["PC1", "PC2"], var_explained,
color=[WARM_ORANGE, STEEL_BLUE],
edgecolor=DARK_NAVY, width=0.5)
for bar, val in zip(bars, var_explained):
ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 1,
f"{val:.1f}%", ha="center", va="bottom", fontsize=13,
fontweight="bold", color=WHITE_TEXT)
ax.set_ylabel("Variance Explained (%)")
ax.set_title("Variance explained by each principal component")
ax.set_ylim(0, 110)
plt.savefig("pca_variance_explained.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

PC1 alone captures 98.0% of all variation, meaning a single number retains almost all the information in the original two variables. The remaining 2.0% on PC2 is mostly noise from the simulation. This extreme dominance confirms that the two health indicators are largely redundant — an ideal scenario for PCA compression. With the optimal weights in hand, we can now compute each country’s score.

10. Step 5: Scoring — building the index

Step 4 produced the eigenvector $[0.7071, \; 0.7071]$. These two numbers are the weights $w_1$ and $w_2$ — the recipe for building our index. The first component ($w_1 = 0.7071$) multiplies standardized Life Expectancy, and the second ($w_2 = 0.7071$) multiplies standardized Infant Survival. The eigenvector itself IS the formula for combining the variables.

What it is: We multiply each country’s standardized data by the weights from our eigenvector to calculate their final Principal Component 1 ($PC1$) score:

$$PC1_i = (w_1 \times Z_{i,LE}) + (w_2 \times Z_{i,IM})$$

In words, this says: for country $i$, the PC1 score is $w_1$ times its standardized life expectancy plus $w_2$ times its standardized (adjusted) infant mortality. Here $w_1$ and $w_2$ are the eigenvector components from Step 4, $Z_{i,LE}$ is z_le, and $Z_{i,IM}$ is z_im.

Why are the weights equal? As explained in Step 4, equal weights ($w_1 = w_2 = 1/\sqrt{2}$) are a mathematical certainty in the two-variable standardized case — they hold for any correlation value $r$. Because both weights are identical, the PC1 score is equivalent to a simple average of the two Z-scores (scaled by $\sqrt{2}$). In practice, this means PCA adds no weighting advantage over a naive average when you have exactly two standardized indicators. The real power of PCA emerges with three or more variables, where the algorithm discovers unequal weights that reflect each variable’s unique contribution.

The application: Country_01’s Z-scores from Step 2 were $Z_{LE} = 1.0402$ and $Z_{IM} = 0.7587$. Applying the formula: $PC1 = 0.7071 \times 1.0402 + 0.7071 \times 0.7587 = 0.7355 + 0.5365 = 1.2720$. Country_01’s PC1 score of 1.27 is positive and well above the mean of 0, placing it in the healthier half of the sample. The contribution from life expectancy (0.7355) is slightly larger than from infant survival (0.5365), reflecting the fact that Country_01 is further above average in life expectancy ($Z = 1.04$) than in infant survival ($Z = 0.76$).

The Intuition: We take every country’s dot on our 2D graph and project it squarely onto that single diagonal line. The position of the country along that line is its new, single Health Score.

The Necessity: We have successfully collapsed a 2D matrix into a 1D number line. Two variables have officially become one composite index.

w1 = eigenvectors[0, 0]
w2 = eigenvectors[1, 0]
df["pc1"] = w1 * df["z_le"] + w2 * df["z_im"]
print(f"Eigenvector weights: w1 = {w1:.4f}, w2 = {w2:.4f}")
print(f"\nPC1 score statistics:")
print(f" Mean: {df['pc1'].mean():.4f}")
print(f" Std: {df['pc1'].std(ddof=0):.4f}")
print(f" Min: {df['pc1'].min():.4f}")
print(f" Max: {df['pc1'].max():.4f}")
print(f"\nTop 5 countries (highest PC1):")
print(df.nlargest(5, "pc1")[["country", "life_exp", "infant_mort", "pc1"]]
.to_string(index=False))
print(f"\nBottom 5 countries (lowest PC1):")
print(df.nsmallest(5, "pc1")[["country", "life_exp", "infant_mort", "pc1"]]
.to_string(index=False))

Eigenvector weights: w1 = 0.7071, w2 = 0.7071
PC1 score statistics:
Mean: 0.0000
Std: 1.3998
Min: -2.3892
Max: 2.3734
Top 5 countries (highest PC1):
country life_exp infant_mort pc1
Country_12 84.7 3.8 2.373421
Country_32 83.4 4.0 2.256521
Country_23 81.6 3.5 2.130292
Country_06 83.6 8.6 2.062127
Country_03 81.3 11.6 1.733944
Bottom 5 countries (lowest PC1):
country life_exp infant_mort pc1
Country_05 54.9 53.8 -2.389155
Country_28 57.7 58.7 -2.381854
Country_29 58.8 55.9 -2.162285
Country_18 58.5 54.9 -2.141282
Country_50 57.2 51.9 -2.111422

PC1 scores range from $-2.39$ (Country_05) to $+2.37$ (Country_12). The top-scoring countries combine high life expectancy (81–85 years) with very low infant mortality (3.5–8.6 per 1,000), while the bottom-scoring countries show the opposite pattern (54.9–58.8 years, 48–59 per 1,000). The mean of 0.0 confirms that PC1 is centered, as expected from standardized inputs.

df_sorted = df.sort_values("pc1", ascending=True)
fig, ax = plt.subplots(figsize=(10, 14))
fig.patch.set_linewidth(0)
colors = [TEAL if v >= 0 else WARM_ORANGE for v in df_sorted["pc1"]]
ax.barh(range(len(df_sorted)), df_sorted["pc1"], color=colors,
edgecolor=DARK_NAVY, height=0.7)
ax.set_yticks(range(len(df_sorted)))
ax.set_yticklabels(df_sorted["country"], fontsize=8)
ax.axvline(0, color=LIGHT_TEXT, linewidth=0.8, zorder=1)
ax.set_xlabel("PC1 Score")
ax.set_title("PC1 scores: countries ranked by health performance")
plt.savefig("pca_pc1_scores.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

The bar chart reveals a roughly symmetric distribution of PC1 scores around zero, with the healthiest countries (teal bars) on the right and the least healthy (orange bars) on the left. However, the raw PC1 scores include negative numbers — a format that is hard to communicate in policy reports. The next step normalizes these scores to a 0–1 scale.

11. Step 6: Normalization — making it human-readable

What it is: We apply Min-Max scaling to compress the $PC1$ scores into a range between 0 and 1. Let $\min(PC1)$ be the lowest score in the sample, and $\max(PC1)$ be the highest:

$$HI_i = \frac{PC1_i - PC1_{min}}{PC1_{max} - PC1_{min}}$$

In words, this says: subtract the minimum PC1 score and divide by the range. The country with the lowest PC1 score gets $HI = 0$ and the country with the highest gets $HI = 1$.

The application: Country_01’s PC1 score from Step 5 is 1.2720, while the sample minimum is $-2.3892$ and the maximum is $2.3734$. Applying the formula: $HI = (1.2720 - (-2.3892)) / (2.3734 - (-2.3892)) = 3.6612 / 4.7626 = 0.7687$. Country_01’s Health Index of 0.77 means it performs better than roughly 77% of the scale defined by the worst-performing country (0.00) and the best-performing country (1.00). A policymaker can immediately understand this number without knowing anything about Z-scores or eigenvectors.

The Intuition: Because we standardized the data earlier, our $PC1$ scores have negative numbers (e.g., $-2.39$). You cannot easily publish a report saying a country has a health score of negative 2.39 — it confuses policymakers and the public.

The Necessity: Normalization forces the absolute lowest scoring country to equal $0$, and the highest scoring country to equal $1$. Everyone else scales proportionally in between. The result is a highly rigorous, purely data-driven index that is instantly understandable.

pc1_min = df["pc1"].min()
pc1_max = df["pc1"].max()
df["health_index"] = (df["pc1"] - pc1_min) / (pc1_max - pc1_min)
print(f"PC1 range: [{pc1_min:.4f}, {pc1_max:.4f}]")
print(f"\nHealth Index statistics:")
print(f" Mean: {df['health_index'].mean():.4f}")
print(f" Median: {df['health_index'].median():.4f}")
print(f" Std: {df['health_index'].std(ddof=0):.4f}")
print(f"\nTop 10 countries:")
print(df.nlargest(10, "health_index")[
["country", "life_exp", "infant_mort", "health_index"]
].to_string(index=False))
print(f"\nBottom 10 countries:")
print(df.nsmallest(10, "health_index")[
["country", "life_exp", "infant_mort", "health_index"]
].to_string(index=False))

PC1 range: [-2.3892, 2.3734]
Health Index statistics:
Mean: 0.5017
Median: 0.5182
Std: 0.2939
Top 10 countries:
country life_exp infant_mort health_index
Country_12 84.7 3.8 1.000000
Country_32 83.4 4.0 0.975454
Country_23 81.6 3.5 0.948950
Country_06 83.6 8.6 0.934637
Country_03 81.3 11.6 0.865729
Country_45 79.1 7.8 0.864043
Country_24 79.5 11.4 0.836335
Country_19 79.1 15.2 0.792782
Country_14 79.0 15.5 0.788153
Country_46 79.0 16.5 0.778523
Bottom 10 countries:
country life_exp infant_mort health_index
Country_05 54.9 53.8 0.000000
Country_28 57.7 58.7 0.001533
Country_29 58.8 55.9 0.047636
Country_18 58.5 54.9 0.052046
Country_50 57.2 51.9 0.058316
Country_37 56.5 48.4 0.079840
Country_09 58.3 50.1 0.094789
Country_26 61.8 53.4 0.123910
Country_36 59.9 48.9 0.134184
Country_39 60.9 48.5 0.155436

The Health Index has a mean of 0.50 and a median of 0.52, indicating a roughly symmetric distribution. Country_12 leads with a perfect score of 1.00 (life expectancy 84.7 years, infant mortality 3.8), while Country_05 anchors the bottom at 0.00 (54.9 years, 53.8 per 1,000). The gap between the top 10 (all above 0.78) and the bottom 10 (all below 0.16) reveals a stark divide in health outcomes across the sample.

df_sorted_hi = df.sort_values("health_index", ascending=True)
fig, ax = plt.subplots(figsize=(10, 14))
fig.patch.set_linewidth(0)
# Gradient from warm orange (low) to teal (high)
cmap_colors = []
for val in df_sorted_hi["health_index"]:
r = int(0xd9 + val * (0x00 - 0xd9))
g = int(0x77 + val * (0xd4 - 0x77))
b = int(0x57 + val * (0xc8 - 0x57))
cmap_colors.append(f"#{r:02x}{g:02x}{b:02x}")
ax.barh(range(len(df_sorted_hi)), df_sorted_hi["health_index"],
color=cmap_colors, edgecolor=DARK_NAVY, height=0.7)
ax.set_yticks(range(len(df_sorted_hi)))
ax.set_yticklabels(df_sorted_hi["country"], fontsize=8)
ax.set_xlabel("Health Index (0 = worst, 1 = best)")
ax.set_title("Health Index: countries ranked from 0 (worst) to 1 (best)")
ax.set_xlim(0, 1.05)
plt.savefig("pca_health_index.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

The gradient-colored bar chart makes the health divide visually immediate. Countries cluster into three rough groups: a high-performing cluster above 0.75 (warm teal), a middle group between 0.25 and 0.75, and a struggling cluster below 0.25 (warm orange). The bottom 10 countries all have Health Index values below 0.16, suggesting systemic health challenges that span both longevity and infant survival. Country_05 and Country_28 appear to have no bar at all — this is not missing data. Country_05 has a Health Index of exactly 0.00 because it is the worst performer in the sample and Min-Max normalization maps the minimum to zero by definition. Country_28 has an index of just 0.0015, so close to zero that its bar is invisible at this scale. Both countries have low life expectancy (54.9 and 57.7 years) combined with high infant mortality (53.8 and 58.7 per 1,000), placing them at the extreme low end of the health spectrum.

12. Replicating the analysis with scikit-learn

Now that we understand every step, scikit-learn can do the entire pipeline — from raw CSV to final Health Index — in a single, compact script. This section presents the automated pipeline and then compares its results against our manual implementation.

12.1 A PCA pipeline with scikit-learn

The code block below is designed to be reusable: by changing only the CSV file path, the column names, and the list of negative indicators, you can apply this same pipeline to any dataset.

# ── Full PCA pipeline with scikit-learn ──────────────────────────
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
# ── Configuration (change these for your own dataset) ────────────
CSV_FILE = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_pca/health_data.csv"
ID_COL = "country" # Row identifier column
POSITIVE_COLS = ["life_exp"] # Higher = better
NEGATIVE_COLS = ["infant_mort"] # Higher = worse (will be flipped)
# Step 1: Load raw data from CSV
df_sk = pd.read_csv(CSV_FILE)
print(f"Loaded: {df_sk.shape[0]} rows, {df_sk.shape[1]} columns")
# Step 2: Polarity adjustment — flip negative indicators
# Multiplying by -1 so that "higher = better" for all variables
for col in NEGATIVE_COLS:
df_sk[col + "_adj"] = -1 * df_sk[col]
adj_cols = POSITIVE_COLS + [col + "_adj" for col in NEGATIVE_COLS]
# Step 3: Standardization — Z-scores (mean=0, std=1)
# StandardScaler centers and scales each column independently
scaler = StandardScaler()
Z_sk = scaler.fit_transform(df_sk[adj_cols])
# Step 4: PCA — fit to find eigenvectors and eigenvalues
# n_components=1 because we want a single composite index
pca_sk = PCA(n_components=1)
pca_sk.fit(Z_sk)
# Step 5: Transform — project data onto the first principal component
df_sk["pc1"] = pca_sk.transform(Z_sk)[:, 0]
# Step 6: Normalization — Min-Max scaling to 0-1
df_sk["pc1_index"] = (
(df_sk["pc1"] - df_sk["pc1"].min())
/ (df_sk["pc1"].max() - df_sk["pc1"].min())
)
# Export results
df_sk.to_csv("pc1_index_results.csv", index=False)
# Summary
indicator_cols = POSITIVE_COLS + NEGATIVE_COLS
print(f"\nPC1 weights: {pca_sk.components_[0].round(4)}")
print(f"Variance explained: {pca_sk.explained_variance_ratio_.round(4)}")
print(f"\nTop 5:")
print(df_sk.nlargest(5, "pc1_index")[
[ID_COL] + indicator_cols + ["pc1_index"]
].to_string(index=False))
print(f"\nBottom 5:")
print(df_sk.nsmallest(5, "pc1_index")[
[ID_COL] + indicator_cols + ["pc1_index"]
].to_string(index=False))
print(f"\nSaved: pc1_index_results.csv")

Loaded: 50 rows, 3 columns
PC1 weights: [0.7071 0.7071]
Variance explained: [0.9797]
Top 5:
country life_exp infant_mort pc1_index
Country_12 84.7 3.8 1.000000
Country_32 83.4 4.0 0.975454
Country_23 81.6 3.5 0.948950
Country_06 83.6 8.6 0.934637
Country_03 81.3 11.6 0.865729
Bottom 5:
country life_exp infant_mort pc1_index
Country_05 54.9 53.8 0.000000
Country_28 57.7 58.7 0.001533
Country_29 58.8 55.9 0.047636
Country_18 58.5 54.9 0.052046
Country_50 57.2 51.9 0.058316
Saved: pc1_index_results.csv

The entire six-step manual pipeline collapses into roughly 15 lines of sklearn code. The configuration block at the top (CSV_FILE, POSITIVE_COLS, NEGATIVE_COLS) makes the script reusable: to build a different composite index, simply point it to a new CSV and specify which columns are positive and which are negative. The rankings match our manual results exactly — Country_12 leads at 1.00 and Country_05 anchors the bottom at 0.00.

12.2 Manual vs. scikit-learn comparison

Now that we have both sets of PC1 scores — one from our six manual steps, one from the sklearn pipeline — we can compare them directly. One subtlety: eigenvectors are defined up to a sign flip, so sklearn may return scores with the opposite sign. We check for this and flip if needed.

sklearn_pc1 = df_sk["pc1"].values
# Handle sign ambiguity: eigenvectors can point in either direction
sign_corr = np.corrcoef(df["pc1"], sklearn_pc1)[0, 1]
if sign_corr < 0:
sklearn_pc1 = -sklearn_pc1
print("Note: sklearn returned opposite sign (normal). Flipped for comparison.")
max_diff = np.max(np.abs(sklearn_pc1 - df["pc1"].values))
corr_val = np.corrcoef(df["pc1"], sklearn_pc1)[0, 1]
print(f"Max absolute difference in PC1 scores: {max_diff:.2e}")
print(f"Correlation between manual and sklearn: {corr_val:.6f}")

Max absolute difference in PC1 scores: 1.33e-15
Correlation between manual and sklearn: 1.000000

fig, ax = plt.subplots(figsize=(6, 6))
fig.patch.set_linewidth(0)
ax.scatter(df["pc1"], sklearn_pc1, color=STEEL_BLUE, edgecolors=DARK_NAVY,
s=60, zorder=3)
lim_min = min(df["pc1"].min(), sklearn_pc1.min()) - 0.2
lim_max = max(df["pc1"].max(), sklearn_pc1.max()) + 0.2
ax.plot([lim_min, lim_max], [lim_min, lim_max], color=WARM_ORANGE,
linewidth=2, linestyle="--", label="Perfect agreement", zorder=2)
ax.set_xlabel("Manual PC1 Score")
ax.set_ylabel("scikit-learn PC1 Score")
ax.set_title("Manual vs. scikit-learn PCA: verification")
ax.legend(loc="upper left")
ax.set_aspect("equal")
plt.savefig("pca_sklearn_comparison.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

The maximum absolute difference between manual and sklearn PC1 scores is $1.33 \times 10^{-15}$ — essentially machine-precision zero. The correlation is 1.000000, confirming perfect agreement. All 50 points fall exactly on the dashed 45-degree line. This validates that our step-by-step manual implementation produces identical results to the optimized library.

13. Summary results

Step	Input	Output	Key Result
Polarity	IM (raw)	IM* = -IM	Correlation: -0.96 to +0.96
Standardization	LE, IM*	Z_LE, Z_IM	Mean=0, SD=1 for both
Covariance	Z matrix	2x2 matrix	Off-diagonal r = 0.96
Eigen-decomposition	Cov matrix	eigenvalues, eigenvectors	PC1 captures 98.0%
Scoring	Z * eigvec	PC1 scores	Range: [-2.39, 2.37]
Normalization	PC1	Health Index	Range: [0.00, 1.00]

14. Discussion

The answer to our opening question is clear: yes, PCA successfully reduces two correlated health indicators into a single Health Index. With a correlation of 0.96 between life expectancy and adjusted infant mortality, PC1 captures 97.97% of all variation — meaning our single-number index retains virtually all the information from both original variables.

The approximately equal eigenvector weights ($w_1 = w_2 = 0.707$) reveal that PCA produced an index nearly identical to a simple average of Z-scores. This is not always the case. With less correlated indicators or more than two variables, the weights would diverge, giving more influence to the indicators that contribute unique information. In high-dimensional settings with 15 or 20 indicators, PCA’s ability to discover these unequal weights becomes far more valuable than any manual weighting scheme. For an applied example, Mendez and Gonzales (2021) use PCA to classify 339 Bolivian municipalities according to human capital constraints — combining malnutrition, language barriers, dropout rates, and education inequality into composite indices that reveal distinct geographic clusters of deprivation.

A policymaker looking at these results could immediately identify that the bottom 10 countries (Health Index below 0.16) suffer from both low life expectancy and high infant mortality, indicating systemic health system weaknesses rather than isolated problems. These countries would be natural candidates for comprehensive health investment packages rather than single-issue interventions.

It is crucial to understand that this index is a measure of relative performance within the specific sample. A score of 1.0 does not mean a country has achieved perfect health — it simply means that country is the best performer among the 50 analyzed. Adding or removing countries from the sample changes every score because both the standardization parameters (mean and standard deviation) and the Min-Max bounds depend on which countries are included. If a new country with extremely high life expectancy joins the sample, every existing country’s Z-scores shift downward, altering all PC1 scores and the final index.

Using a PCA-based health index to compare against a PCA-based education index is also problematic. A health index score of 0.77 and an education index score of 0.77 may look equivalent, but they are not directly comparable. Each index has its own eigenvectors, eigenvalues, and standardization parameters derived from entirely different variables with different correlation structures. The numbers live on different scales — 0.77 in health means “77% of the way between the worst and best health performers,” while 0.77 in education means the same relative position but within a completely different set of indicators. Combining or averaging PCA indices across domains requires additional methodological choices (such as those used in the UNDP Human Development Index).

Using our PCA-based health index to study changes over time introduces further challenges. If you compute the index separately for each year, both the eigenvector weights and the Z-score parameters (means, standard deviations) can shift from year to year, making scores from different periods non-comparable. A country’s index could improve not because its health system got better, but because the sample’s average got worse. One potential solution is a pooled PCA approach — standardizing across all years simultaneously and computing a single set of eigenvectors from the pooled covariance matrix. However, this requires assuming that the correlation structure between indicators remains constant over time, which may not hold if the relationship between life expectancy and infant mortality evolves as countries develop. For an example of PCA applied to social progress indicators across countries and multiple years, see Peiro-Palomino, Picazo-Tadeo, and Rios (2023).

15. Summary and next steps

Key takeaways:

Method insight: PCA is most effective when indicators are highly correlated. With $r = 0.96$, PC1 captured 98.0% of variance. With weakly correlated indicators, PCA would require multiple components, reducing the simplicity advantage. Always check the correlation structure before choosing PCA for index construction.
Data insight: The equal eigenvector weights (both 0.707) mean PCA produced an index nearly identical to a simple Z-score average in this two-variable case. The real power of PCA emerges when variables contribute unequally and you need the algorithm to discover the optimal weighting.
Limitation: With only two variables, PCA offers modest dimensionality reduction (2 to 1). The technique’s full value emerges with many indicators (e.g., 15 SDG variables reduced to 3–4 components). Also, the Health Index is relative — adding or removing countries changes every score because of the Min-Max normalization.
Next step: Extend to multi-variable PCA with real data (e.g., SDG indicators covering education, income, and health). Explore how many components to retain using the scree plot and cumulative variance threshold (commonly 80–90%), and consider factor analysis for latent variable interpretation.

Limitations of this analysis:

The data is simulated. Real WHO data would include outliers, missing values, and non-normal distributions that require additional preprocessing.
Two-variable PCA is a pedagogical simplification. Real composite indices (like the UNDP Human Development Index) use more indicators and often apply domain-specific weighting decisions alongside statistical methods.
Min-Max normalization is sensitive to outliers. A single extreme country can compress the range for everyone else. Robust alternatives include percentile ranking or winsorization.

16. Exercises

Add a third indicator. Extend the data generating process with a third variable (e.g., healthcare_spending = 200 + 800 * base_health + noise). Run the same pipeline with three variables. How does the variance explained by PC1 change? Do the eigenvector weights shift from equal?
Test outlier sensitivity. Modify one country to have extreme values (e.g., life_exp = 40, infant_mort = 100). How does Min-Max normalization affect the rankings of other countries? Try replacing Min-Max with percentile-based normalization and compare.
Apply to real data. Download Life Expectancy and Infant Mortality data from the WHO Global Health Observatory. Apply the six-step pipeline to real countries. Compare your PCA-based Health Index ranking with the UNDP Human Development Index and discuss any discrepancies.

17. References

Acknowledgements

AI tools (Claude Code, Gemini, NotebookLM) were used to make the contents of this post more accessible to students. Nevertheless, the content in this post may still have errors. Caution is needed when applying the contents of this post to true research projects.

Pooled PCA for Building Development Indicators Across Time

Sat, 21 Mar 2026 00:00:00 +0000

1. Overview

In the Introduction to PCA Analysis for Building Development Indicators, we built a Health Index from two indicators using a six-step pipeline. That tutorial’s Discussion section raised a critical warning:

If you compute the index separately for each year, both the eigenvector weights and the Z-score parameters (means, standard deviations) can shift from year to year, making scores from different periods non-comparable. A country’s index could improve not because its health system got better, but because the sample’s average got worse.

This sequel addresses that problem head-on with real data. We use the Subnational Human Development Index from the Global Data Lab, which provides Education, Health, and Income sub-indices for 153 sub-national regions across 12 South American countries in 2013 and 2019. When we track development over time, we need the yardstick to remain fixed. If the ruler itself changes between measurements, we cannot tell whether the object grew or the ruler shrank. Pooled PCA solves this by standardizing and computing weights from all periods simultaneously, producing a single fixed yardstick that makes scores directly comparable across time.

The real data reveals a nuanced story: education and health improved on average across South America between 2013 and 2019, but income declined. This mixed signal makes the choice between pooled and per-period PCA consequential — the two methods disagree on the direction of change for 16 out of 153 regions.

Learning objectives:

Understand why per-period PCA produces non-comparable scores across time
Implement pooled standardization using cross-period means and standard deviations
Compute pooled eigenvectors from stacked data to obtain stable weights
Apply pooled normalization with cross-period min/max bounds
Contrast pooled vs per-period PCA using rank stability and direction-of-change analysis

2. The pooled PCA pipeline

The pooled pipeline extends the six-step pipeline from the previous tutorial by adding a stacking step at the beginning and replacing per-period parameters with pooled parameters at the standardization, covariance, and normalization steps.

graph LR
S["<b>Step 0</b><br/>Stack<br/>Periods"] --> A["<b>Step 1</b><br/>Polarity<br/>Adjustment"]
A --> B["<b>Step 2</b><br/>Pooled<br/>Standardization"]
B --> C["<b>Step 3</b><br/>Pooled<br/>Covariance"]
C --> D["<b>Step 4</b><br/>Pooled Eigen-<br/>Decomposition"]
D --> E["<b>Step 5</b><br/>Scoring<br/>(PC1)"]
E --> F["<b>Step 6</b><br/>Pooled<br/>Normalization"]
style S fill:#141413,stroke:#6a9bcc,color:#fff
style A fill:#d97757,stroke:#141413,color:#fff
style B fill:#6a9bcc,stroke:#141413,color:#fff
style C fill:#6a9bcc,stroke:#141413,color:#fff
style D fill:#00d4c8,stroke:#141413,color:#fff
style E fill:#00d4c8,stroke:#141413,color:#fff
style F fill:#1a3a8a,stroke:#141413,color:#fff

The key insight is that Steps 2, 3, and 6 — labeled “Pooled” — compute their parameters from the stacked data (all periods combined) rather than from each period separately. This single change ensures that a region’s Z-score in 2013 is measured against the same baseline as its Z-score in 2019, that the eigenvector weights are fixed across time, and that the 0–1 normalization uses a common scale.

3. Setup and imports

The analysis uses the same libraries as the previous tutorial: NumPy for linear algebra, pandas for data management, matplotlib for visualization, and scikit-learn for verification.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
# Reproducibility
RANDOM_SEED = 42
# Site color palette
STEEL_BLUE = "#6a9bcc"
WARM_ORANGE = "#d97757"
NEAR_BLACK = "#141413"
TEAL = "#00d4c8"

Dark theme figure styling (click to expand)

# Dark theme palette (consistent with site navbar/dark sections)
DARK_NAVY = "#0f1729"
GRID_LINE = "#1f2b5e"
LIGHT_TEXT = "#c8d0e0"
WHITE_TEXT = "#e8ecf2"
# Plot defaults — minimal, spine-free, dark background
plt.rcParams.update({
"figure.facecolor": DARK_NAVY,
"axes.facecolor": DARK_NAVY,
"axes.edgecolor": DARK_NAVY,
"axes.linewidth": 0,
"axes.labelcolor": LIGHT_TEXT,
"axes.titlecolor": WHITE_TEXT,
"axes.spines.top": False,
"axes.spines.right": False,
"axes.spines.left": False,
"axes.spines.bottom": False,
"axes.grid": True,
"grid.color": GRID_LINE,
"grid.linewidth": 0.6,
"grid.alpha": 0.8,
"xtick.color": LIGHT_TEXT,
"ytick.color": LIGHT_TEXT,
"xtick.major.size": 0,
"ytick.major.size": 0,
"text.color": WHITE_TEXT,
"font.size": 12,
"legend.frameon": False,
"legend.fontsize": 11,
"legend.labelcolor": LIGHT_TEXT,
"figure.edgecolor": DARK_NAVY,
"savefig.facecolor": DARK_NAVY,
"savefig.edgecolor": DARK_NAVY,
})

4. Loading the Subnational HDI data

The dataset is a subsample from the Subnational Human Development Database constructed by Smits and Permanyer (2019), which provides sub-national development indicators for countries worldwide. We use the South American subset with three HDI component indices for 2013 and 2019. The original data is in wide format (one row per region, with year-specific columns), so we reshape it into a long panel format suitable for pooled PCA.

DATA_URL = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_pca2/data.csv"
raw = pd.read_csv(DATA_URL)
print(f"Raw dataset: {raw.shape[0]} regions, {raw.shape[1]} columns")
print(f"Countries: {raw['country'].nunique()}")
# Reshape wide → long
rows = []
for _, r in raw.iterrows():
for year in [2013, 2019]:
rows.append({
"GDLcode": r["GDLcode"],
"region": r["region"],
"country": r["country"],
"period": f"Y{year}",
"education": round(r[f"edindex{year}"], 4),
"health": round(r[f"healthindex{year}"], 4),
"income": round(r[f"incindex{year}"], 4),
"shdi_official": round(r[f"shdi{year}"], 4),
"pop": round(r[f"pop{year}"], 1),
})
df = pd.DataFrame(rows)

To make regions instantly identifiable in figures and tables, we create a region_country label that combines a shortened region name with a three-letter country abbreviation. This avoids ambiguity — for example, “Cordoba” exists in both Argentina and Colombia.

# Create informative label: shortened region + country abbreviation
COUNTRY_ABBR = {
"Argentina": "ARG", "Bolivia": "BOL", "Brazil": "BRA",
"Chile": "CHL", "Colombia": "COL", "Ecuador": "ECU",
"Guyana": "GUY", "Paraguay": "PRY", "Peru": "PER",
"Suriname": "SUR", "Uruguay": "URY", "Venezuela": "VEN",
}
def make_label(region, country, max_len=25):
"""Shorten region name and append country abbreviation."""
abbr = COUNTRY_ABBR.get(country, country[:3].upper())
short = region[:max_len].rstrip(", ") if len(region) > max_len else region
return f"{short} ({abbr})"
df["region_country"] = df.apply(
lambda r: make_label(r["region"], r["country"]), axis=1
)
df.to_csv("data_long.csv", index=False)
INDICATORS = ["education", "health", "income"]
print(f"\nPanel dataset: {df.shape[0]} rows (= {raw.shape[0]} regions x 2 periods)")
print(f"\nFirst 6 rows:")
print(df[["region_country", "period", "education", "health", "income"]].head(6).to_string(index=False))
print(f"\nRegions per country:")
print(raw["country"].value_counts().sort_index().to_string())

Panel dataset: 306 rows (= 153 regions x 2 periods)
First 6 rows:
region_country period education health income
City of Buenos Aires (ARG) Y2013 0.926 0.858 0.850
City of Buenos Aires (ARG) Y2019 0.946 0.872 0.832
Rest of Buenos Aires (ARG) Y2013 0.797 0.858 0.820
Rest of Buenos Aires (ARG) Y2019 0.830 0.872 0.802
Catamarca, La Rioja, San (ARG) Y2013 0.822 0.858 0.828
Catamarca, La Rioja, San (ARG) Y2019 0.856 0.872 0.810
Regions per country:
country
Argentina 11
Bolivia 9
Brazil 27
Chile 13
Colombia 33
Ecuador 3
Guyana 10
Paraguay 5
Peru 6
Suriname 5
Uruguay 7
Venezuela 24

The panel contains 306 rows (153 regions $\times$ 2 periods) covering 12 South American countries. Colombia contributes the most regions (33), followed by Brazil (27) and Venezuela (24), while Ecuador has only 3. The region_country label — such as “City of Buenos Aires (ARG)” or “Potosi (BOL)” — will make every region immediately identifiable in the analysis that follows.

print(f"Period means:")
print(df.groupby("period")[INDICATORS].mean().round(4).to_string())
p1_means = df[df["period"] == "Y2013"][INDICATORS].mean()
p2_means = df[df["period"] == "Y2019"][INDICATORS].mean()
changes = p2_means - p1_means
print(f"\nMean changes (2019 - 2013):")
print(f" Education: {changes['education']:+.4f}")
print(f" Health: {changes['health']:+.4f}")
print(f" Income: {changes['income']:+.4f}")

Period means:
education health income
period
Y2013 0.6674 0.8370 0.7355
Y2019 0.6899 0.8504 0.7153
Mean changes (2019 - 2013):
Education: +0.0225
Health: +0.0134
Income: -0.0202

The period means reveal a mixed development story: education rose from 0.667 to 0.690 (+0.023) and health from 0.837 to 0.850 (+0.013), but income declined from 0.736 to 0.715 ($-0.020$). This income decline across much of South America between 2013 and 2019 — driven by commodity price drops and economic slowdowns — is a real signal that our PCA-based index must capture correctly. Note that all three indicators are positive-direction (higher means better), so no polarity adjustment is needed.

5. Exploring the raw data

Before running any PCA, let us examine the country-level patterns, the correlation structure, and the period-to-period shift.

# Country-level means by period
print(f"Country-level means by period:")
country_means = (df.groupby(["country", "period"])[INDICATORS]
.mean().round(3).unstack("period"))
country_means.columns = [f"{col[0]}_{col[1]}" for col in country_means.columns]
print(country_means.to_string())

Country-level means by period:
education_Y2013 education_Y2019 health_Y2013 health_Y2019 income_Y2013 income_Y2019
country
Argentina 0.823 0.852 0.858 0.872 0.827 0.809
Bolivia 0.652 0.689 0.778 0.809 0.633 0.665
Brazil 0.659 0.684 0.838 0.859 0.745 0.732
Chile 0.732 0.781 0.911 0.925 0.806 0.814
Colombia 0.615 0.654 0.858 0.876 0.707 0.720
Ecuador 0.688 0.691 0.857 0.877 0.707 0.699
Guyana 0.568 0.574 0.747 0.764 0.599 0.622
Paraguay 0.612 0.624 0.820 0.835 0.698 0.717
Peru 0.671 0.713 0.845 0.866 0.688 0.703
Suriname 0.584 0.627 0.791 0.804 0.757 0.725
Uruguay 0.694 0.722 0.878 0.891 0.790 0.799
Venezuela 0.708 0.682 0.813 0.801 0.782 0.630

The country-level means reveal stark development gaps across South America. Chile leads in health (0.911–0.925) and is strong across all dimensions. Argentina leads in education (0.823–0.852) with high income. At the other end, Guyana has the lowest education (0.568–0.574) and Bolivia the lowest health (0.778–0.809). Most countries improved on all three indicators between 2013 and 2019, but two stand out for income decline: Venezuela’s income collapsed from 0.782 to 0.630 ($-0.152$), reflecting its severe economic crisis, and Argentina’s income also fell from 0.827 to 0.809. These divergent trajectories across countries are precisely why a fixed yardstick (pooled PCA) is essential for temporal comparison.

corr_matrix = df[INDICATORS].corr().round(4)
print(f"Pooled correlation matrix:")
print(corr_matrix.to_string())

Pooled correlation matrix:
education health income
education 1.0000 0.4392 0.6808
health 0.4392 1.0000 0.6303
income 0.6808 0.6303 1.0000

The correlations are moderate to strong but far from the near-perfect values we saw in the previous tutorial’s simulated data ($r > 0.93$). Education and Income show the strongest correlation (0.68), followed by Health and Income (0.63), with Education and Health the weakest (0.44). These lower correlations mean PCA will capture less variance in PC1 — the three indicators carry more independent information than in the simulated case, reflecting the genuine complexity of human development. The weak Education-Health link (0.44) suggests that a region can have high literacy but mediocre life expectancy (or vice versa) — education and health are partly independent dimensions of development.

fig, ax = plt.subplots(figsize=(8, 6))
fig.patch.set_linewidth(0)
p1 = df[df["period"] == "Y2013"]
p2 = df[df["period"] == "Y2019"]
ax.scatter(p1["education"], p1["income"], color=STEEL_BLUE,
edgecolors=DARK_NAVY, s=40, zorder=3, alpha=0.7, label="2013")
ax.scatter(p2["education"], p2["income"], color=WARM_ORANGE,
edgecolors=DARK_NAVY, s=40, zorder=3, alpha=0.7, label="2019")
# Centroid arrows
c1_edu, c1_inc = p1["education"].mean(), p1["income"].mean()
c2_edu, c2_inc = p2["education"].mean(), p2["income"].mean()
ax.annotate("", xy=(c2_edu, c2_inc), xytext=(c1_edu, c1_inc),
arrowprops=dict(arrowstyle="-|>", color=TEAL, lw=2.5))
ax.set_xlabel("Education Index")
ax.set_ylabel("Income Index")
ax.set_title("Education vs. Income by period (153 South American regions)")
ax.legend(loc="lower right")
plt.savefig("pca2_period_shift_scatter.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

The scatter plot reveals a striking pattern: between 2013 (steel blue) and 2019 (orange), the cloud shifted right (education improved) but downward (income declined). The teal arrow connecting the two period centroids captures this asymmetric shift. This is a real-world complication that simple simulated data would not produce — per-period PCA will handle this mixed signal differently from pooled PCA.

6. The problem: per-period PCA

To understand why pooled PCA is necessary, let us first see what goes wrong with the naive approach. We run the full six-step pipeline separately for each period — standardizing with period-specific means, computing period-specific eigenvectors, and normalizing with period-specific bounds.

Per-period standardization uses different baselines for each period:

$$Z_{ij}^{(t)} = \frac{X_{ij,t} - \bar{X}_j^{(t)}}{\sigma_j^{(t)}}$$

In words, this says: standardize using only the data from period $t$. The mean and standard deviation change between periods, so the yardstick shifts.

Per-period normalization uses different bounds for each period:

$$HDI_i^{(t)} = \frac{PC1_i^{(t)} - PC1_{min}^{(t)}}{PC1_{max}^{(t)} - PC1_{min}^{(t)}}$$

In words, this says: the worst region in each period gets 0 and the best gets 1, but the scale resets every period.

def run_single_period_pca(df_period, indicators):
"""Run the full PCA pipeline on a single-period DataFrame."""
X = df_period[indicators].values
means = X.mean(axis=0)
stds = X.std(axis=0, ddof=0)
Z = (X - means) / stds
cov = np.cov(Z.T, ddof=0)
eigenvalues, eigenvectors = np.linalg.eigh(cov)
idx = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]
if eigenvectors[0, 0] < 0:
eigenvectors[:, 0] *= -1
pc1 = Z @ eigenvectors[:, 0]
hdi = (pc1 - pc1.min()) / (pc1.max() - pc1.min())
return {"pc1": pc1, "hdi": hdi, "weights": eigenvectors[:, 0],
"eigenvalues": eigenvalues,
"var_explained": eigenvalues / eigenvalues.sum() * 100,
"means": means, "stds": stds}
pp_p1 = run_single_period_pca(df[df["period"] == "Y2013"], INDICATORS)
pp_p2 = run_single_period_pca(df[df["period"] == "Y2019"], INDICATORS)
print(f"Per-period eigenvector weights (PC1):")
print(f" 2013: [{pp_p1['weights'][0]:.4f}, {pp_p1['weights'][1]:.4f}, {pp_p1['weights'][2]:.4f}]")
print(f" 2019: [{pp_p2['weights'][0]:.4f}, {pp_p2['weights'][1]:.4f}, {pp_p2['weights'][2]:.4f}]")
print(f" Shift: [{pp_p2['weights'][0] - pp_p1['weights'][0]:+.4f}, "
f"{pp_p2['weights'][1] - pp_p1['weights'][1]:+.4f}, "
f"{pp_p2['weights'][2] - pp_p1['weights'][2]:+.4f}]")

Per-period eigenvector weights (PC1):
2013: [0.5832, 0.5100, 0.6322]
2019: [0.5405, 0.5657, 0.6228]
Shift: [-0.0427, +0.0556, -0.0095]

The eigenvector weights shift substantially between periods. Education’s weight drops from 0.583 to 0.541 ($-0.043$), while Health’s weight jumps from 0.510 to 0.566 ($+0.056$). This means the index formula itself changes — a region’s 2013 HDI and 2019 HDI are computed with different recipes, making temporal comparison unreliable. Under per-period PCA, 43 out of 153 regions appear to decline in HDI despite the overall improvement in education and health. The per-period approach erases the mixed global signal by re-centering every period to a mean of zero.

To visualize how individual regions shift in rank under per-period PCA, we store each period’s HDI scores and compute ranks.

df_p1 = df[df["period"] == "Y2013"].copy()
df_p2 = df[df["period"] == "Y2019"].copy()
df_p1["pp_hdi"] = pp_p1["hdi"]
df_p2["pp_hdi"] = pp_p2["hdi"]
df_p1["pp_rank"] = df_p1["pp_hdi"].rank(ascending=False).astype(int)
df_p2["pp_rank"] = df_p2["pp_hdi"].rank(ascending=False).astype(int)

fig, ax = plt.subplots(figsize=(8, 10))
fig.patch.set_linewidth(0)
rank_change = df_p2["pp_rank"].values - df_p1["pp_rank"].values
abs_change = np.abs(rank_change)
top_changers_idx = np.argsort(abs_change)[-10:]
for i in top_changers_idx:
r1 = df_p1.iloc[i]["pp_rank"]
r2 = df_p2.iloc[i]["pp_rank"]
label = df_p1.iloc[i]["region_country"]
color = TEAL if r2 < r1 else WARM_ORANGE
ax.plot([0, 1], [r1, r2], color=color, linewidth=2, alpha=0.8)
ax.text(-0.05, r1, f"{label} (#{int(r1)})", ha="right", va="center",
fontsize=7, color=LIGHT_TEXT)
ax.text(1.05, r2, f"{label} (#{int(r2)})", ha="left", va="center",
fontsize=7, color=LIGHT_TEXT)
ax.set_xlim(-0.6, 1.6)
ax.set_ylim(160, -5)
ax.set_xticks([0, 1])
ax.set_xticklabels(["2013 Rank", "2019 Rank"], fontsize=13)
ax.set_ylabel("Rank (1 = best)")
ax.set_title("Per-period PCA: rank shifts for 10 regions\n(teal = improved, orange = declined)")
plt.savefig("pca2_perperiod_rank_shift.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

The running example: City of Buenos Aires — Argentina’s capital and one of the most developed regions in South America — has a per-period HDI of 1.000 in 2013 (ranked #1) and 0.960 in 2019 — a decline of -0.04. But we know Buenos Aires improved in education (0.926 $\to$ 0.946) and health (0.858 $\to$ 0.872), with only a modest income decline (0.850 $\to$ 0.832). Is Buenos Aires really declining, or is the shifting yardstick hiding a more nuanced story?

7. Pooled Step 1: Stacking the data

The first step of pooled PCA is to stack all periods into a single dataset. From PCA’s perspective, we have 306 observations (153 regions $\times$ 2 periods), not two separate groups. The period column is metadata that we carry through for analysis, but it does not enter the PCA computation.

print(f"Stacked dataset: {df.shape[0]} rows, {df.shape[1]} columns")

Stacked dataset: 306 rows, 9 columns

The stacked dataset has 306 rows. PCA will treat each row equally regardless of which period it belongs to, producing a single set of standardization parameters and a single set of eigenvector weights.

8. Pooled Step 2: Pooled standardization

What it is: We compute the mean and standard deviation from the entire stacked dataset (all 306 rows) and use these pooled parameters to standardize every observation:

$$Z_{ij,t}^{pooled} = \frac{X_{ij,t} - \bar{X}_j^{pooled}}{\sigma_j^{pooled}}$$

In words, this says: for region $i$, indicator $j$, at time $t$, subtract the pooled mean $\bar{X}_j^{pooled}$ (computed across all regions and all periods) and divide by the pooled standard deviation $\sigma_j^{pooled}$.

The application: City of Buenos Aires has education = 0.926 in 2013 and 0.946 in 2019. The pooled mean for education is 0.679 and the pooled standard deviation is 0.081. Under per-period standardization, 2013 uses mean = 0.667 and 2019 uses mean = 0.690 — a shifting baseline. Under pooled standardization, both periods use the same mean = 0.679. The increase from 0.926 to 0.946 maps to a genuine increase in pooled Z-score.

The intuition: Imagine measuring children’s heights at age 5 and age 10. Per-period standardization compares each child only to their same-age peers: a tall 5-year-old gets a high Z-score, and a tall 10-year-old gets a high Z-score, but you cannot tell how much each child grew because the reference group changed. Pooled standardization measures everyone against the same ruler — the combined height distribution — so the Z-score increase from age 5 to age 10 directly reflects actual growth.

The necessity: Without pooled standardization, the income decline (from 0.736 to 0.715 on average) would be hidden. Per-period Z-scores re-center income to zero each period, erasing the decline. Pooled Z-scores preserve it: the 2019 income Z-scores average slightly below zero, correctly reflecting the real economic setback.

X_all = df[INDICATORS].values # 306 rows
pooled_means = X_all.mean(axis=0)
pooled_stds = X_all.std(axis=0, ddof=0)
Z_pooled = (X_all - pooled_means) / pooled_stds
print(f"Pooled standardization parameters:")
print(f" Means: [{pooled_means[0]:.4f}, {pooled_means[1]:.4f}, {pooled_means[2]:.4f}]")
print(f" Stds: [{pooled_stds[0]:.4f}, {pooled_stds[1]:.4f}, {pooled_stds[2]:.4f}]")
scaler = StandardScaler()
Z_sklearn = scaler.fit_transform(X_all)
max_diff = np.max(np.abs(Z_sklearn - Z_pooled))
print(f"\nMax difference from sklearn StandardScaler: {max_diff:.2e}")

Pooled standardization parameters:
Means: [0.6786, 0.8437, 0.7254]
Stds: [0.0814, 0.0472, 0.0749]
Max difference from sklearn StandardScaler: 0.00e+00

The pooled means sit between the period-specific means (e.g., education: 0.667 in 2013, 0.690 in 2019, 0.679 pooled). The standard deviations are similar across periods because the within-period spread is much larger than the between-period level shift. The zero-difference check against StandardScaler() confirms our manual computation is correct.

9. Pooled Step 3: Covariance matrix

We compute the $3 \times 3$ covariance matrix from the pooled standardized data (all 306 rows):

$$\Sigma^{pooled} = \frac{1}{nT} Z^{pooled^T} Z^{pooled}$$

In words, this says: the pooled covariance matrix measures how the three standardized indicators co-move across all region-period observations.

cov_pooled = np.cov(Z_pooled.T, ddof=0)
print(f"Pooled covariance matrix (3x3):")
for i in range(3):
row = " [" + " ".join(f"{cov_pooled[i, j]:.4f}" for j in range(3)) + "]"
print(row)

Pooled covariance matrix (3x3):
[1.0000 0.4392 0.6808]
[0.4392 1.0000 0.6303]
[0.6808 0.6303 1.0000]

The off-diagonals range from 0.44 (Education-Health) to 0.68 (Education-Income). These are substantially lower than the 0.93–0.95 values in the simulated data from the previous tutorial, reflecting the genuine complexity of human development. Education and Health are only moderately correlated because they measure different dimensions — a region can have high literacy but mediocre life expectancy (or vice versa). This means PC1 will capture less total variance, and the eigenvector weights will be more unequal.

10. Pooled Step 4: Eigen-decomposition

We decompose the pooled covariance matrix to find the direction of maximum spread:

$$\Sigma^{pooled} \mathbf{v}_k = \lambda_k \mathbf{v}_k$$

The PC1 score for each region-period is:

$$PC1_{i,t} = w_1 , Z_{i,edu,t}^{pooled} + w_2 , Z_{i,health,t}^{pooled} + w_3 , Z_{i,income,t}^{pooled}$$

In words, this says: each region’s PC1 score is a weighted sum of its three pooled-standardized indicators, using the single set of pooled weights $[w_1, w_2, w_3]$.

eigenvalues, eigenvectors = np.linalg.eigh(cov_pooled)
idx = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]
if eigenvectors[0, 0] < 0:
eigenvectors[:, 0] *= -1
var_explained = eigenvalues / eigenvalues.sum() * 100
print(f"Pooled eigenvalues: [{eigenvalues[0]:.4f}, {eigenvalues[1]:.4f}, {eigenvalues[2]:.4f}]")
print(f"\nPooled eigenvector (PC1): [{eigenvectors[0, 0]:.4f}, {eigenvectors[1, 0]:.4f}, {eigenvectors[2, 0]:.4f}]")
print(f"\nVariance explained:")
print(f" PC1: {var_explained[0]:.2f}%")
print(f" PC2: {var_explained[1]:.2f}%")
print(f" PC3: {var_explained[2]:.2f}%")

Pooled eigenvalues: [2.1726, 0.5631, 0.2643]
Pooled eigenvector (PC1): [0.5642, 0.5448, 0.6204]
Variance explained:
PC1: 72.42%
PC2: 18.77%
PC3: 8.81%

PC1 captures 72.42% of all variance — substantially less than the 96% in the simulated tutorial, but still a strong majority. The eigenvector weights are $[0.5642, 0.5448, 0.6204]$, revealing that Income carries the highest weight (0.620), followed by Education (0.564), with Health contributing least (0.545). This unequal weighting reflects the real-world correlation structure: Income is more strongly correlated with the other two indicators, so it contributes more unique information to the composite index. Unlike the two-variable case from the previous tutorial where equal weights were a mathematical certainty, three variables allow PCA to discover data-driven weights. Crucially, these weights are fixed — the same weights apply to 2013 and 2019 because they were computed from the pooled data.

The variance explained chart shows PC1 dominating but with meaningful contributions from PC2 (18.8%) and PC3 (8.8%). The fact that PC2 and PC3 are not negligible means some development dimensions are not captured by a single index. For instance, PC2 might separate regions with high education but low income from those with the opposite pattern. For this tutorial, we focus on PC1 as the composite HDI, but researchers working with this data should consider whether retaining PC2 adds meaningful insight.

11. Pooled Step 5: Scoring

We project all 306 rows onto PC1 using the fixed pooled weights.

w = eigenvectors[:, 0]
df["pc1"] = Z_pooled @ w
pc1_p1 = df[df["period"] == "Y2013"]["pc1"]
pc1_p2 = df[df["period"] == "Y2019"]["pc1"]
print(f"Pooled PC1 score statistics:")
print(f" 2013 mean: {pc1_p1.mean():.4f}")
print(f" 2019 mean: {pc1_p2.mean():.4f}")
print(f" Shift: {pc1_p2.mean() - pc1_p1.mean():+.4f}")

Pooled PC1 score statistics:
2013 mean: -0.0720
2019 mean: 0.0720
Shift: +0.1439

The 2013 mean PC1 score is $-0.072$ (below the grand mean) and the 2019 mean is $+0.072$ (above the grand mean). The shift of $+0.144$ represents pooled PCA’s measure of net development progress across South America. This is a modest positive shift, reflecting the trade-off between education/health gains and income decline. Under per-period PCA, this shift would be exactly zero by construction — the net progress would be invisible.

12. Pooled Step 6: Normalization

We apply Min-Max normalization using the pooled bounds — the minimum and maximum PC1 scores across all 306 observations:

$$HDI_{i,t} = \frac{PC1_{i,t} - PC1_{min}^{pooled}}{PC1_{max}^{pooled} - PC1_{min}^{pooled}}$$

pc1_min = df["pc1"].min()
pc1_max = df["pc1"].max()
df["hdi"] = (df["pc1"] - pc1_min) / (pc1_max - pc1_min)
print(f"\nPooled HDI — 2019 top 5:")
print(df[df["period"] == "Y2019"].nlargest(5, "hdi")[
["region_country", "education", "health", "income", "hdi"]
].to_string(index=False))
print(f"\nPooled HDI — 2013 bottom 5:")
print(df[df["period"] == "Y2013"].nsmallest(5, "hdi")[
["region_country", "education", "health", "income", "hdi"]
].to_string(index=False))

Pooled HDI — 2019 top 5:
region_country education health income hdi
Region Metropolitana (CHL) 0.877 0.929 0.844 1.000000
Tarapaca (incl Arica and (CHL) 0.888 0.937 0.823 0.999348
City of Buenos Aires (ARG) 0.946 0.872 0.832 0.965232
Antofagasta (CHL) 0.894 0.896 0.838 0.961010
Valparaiso (former Aconca (CHL) 0.842 0.931 0.831 0.959202
Pooled HDI — 2013 bottom 5:
region_country education health income hdi
Potaro-Siparuni (GUY) 0.522 0.735 0.443 0.000000
Barima-Waini (GUY) 0.483 0.745 0.534 0.074601
Potosi (BOL) 0.564 0.666 0.578 0.076345
Upper Takutu-Upper Essequ (GUY) 0.567 0.751 0.470 0.089799
Brokopondo and Sipaliwini (SUR) 0.382 0.774 0.602 0.099207

The top 5 in 2019 are dominated by Chilean regions (Region Metropolitana, Tarapaca, Antofagasta, Valparaiso) plus Buenos Aires. Chile’s strong performance across all three indicators — particularly Health (0.90–0.94) — places its regions at the top. The bottom 5 in 2013 are remote regions of Guyana (Potaro-Siparuni, Barima-Waini), Bolivia (Potosi), and Suriname (Brokopondo), characterized by low education and income despite moderate health outcomes. The Potaro-Siparuni region of Guyana anchors the bottom at HDI = 0.00 (education 0.522, health 0.735, income 0.443).

City of Buenos Aires has pooled HDI of 0.946 in 2013 and 0.965 in 2019 — an improvement of $+0.019$. Under per-period PCA, the same region showed a decline of $-0.040$. Pooled PCA correctly reveals that Buenos Aires improved modestly while being overtaken by Chilean regions that improved faster.

The paired bar chart shows the pooled HDI for the top and bottom 15 regions. In the top group, orange (2019) bars consistently extend further than steel blue (2013) bars, reflecting genuine improvement. In the bottom group, the pattern is more mixed — some of the least developed regions in 2013 made substantial gains by 2019, while others barely moved. The dashed separator line divides the bottom 15 (below) from the top 15 (above).

13. The contrast: pooled vs per-period PCA

We now have two sets of HDI values for every region-period: one from per-period PCA and one from pooled PCA. To compare them, we build a wide table with each region’s pooled and per-period HDI change side by side.

from scipy.stats import spearmanr
# Separate pooled HDI by period
df_pooled_p1 = df[df["period"] == "Y2013"].copy()
df_pooled_p2 = df[df["period"] == "Y2019"].copy()
# Build comparison table: pooled vs per-period changes
compare = df_pooled_p1[["region", "country", "region_country", "hdi"]].rename(
columns={"hdi": "hdi_p1"}
).merge(
df_pooled_p2[["region", "country", "hdi"]].rename(columns={"hdi": "hdi_p2"}),
on=["region", "country"]
)
compare["hdi_change"] = compare["hdi_p2"] - compare["hdi_p1"]
compare["pp_change"] = df_p2["pp_hdi"].values - df_p1["pp_hdi"].values
compare["method_diff"] = compare["hdi_change"] - compare["pp_change"]
# Direction disagreement
disagree = ((compare["hdi_change"] > 0) & (compare["pp_change"] < 0)) | \
((compare["hdi_change"] < 0) & (compare["pp_change"] > 0))
# Spearman rank correlation
rho_change, _ = spearmanr(compare["hdi_change"], compare["pp_change"])
# Running example: City of Buenos Aires
ba = compare[compare["region_country"].str.contains("Buenos Aires")].iloc[0]
ba_pp_p1 = df_p1[df_p1["region_country"].str.contains("Buenos Aires")]["pp_hdi"].values[0]
ba_pp_p2 = df_p2[df_p2["region_country"].str.contains("Buenos Aires")]["pp_hdi"].values[0]
print(f"City of Buenos Aires:")
print(f" Per-period: 2013={ba_pp_p1:.4f}, 2019={ba_pp_p2:.4f}, Change={ba_pp_p2 - ba_pp_p1:+.4f}")
print(f" Pooled: 2013={ba['hdi_p1']:.4f}, 2019={ba['hdi_p2']:.4f}, Change={ba['hdi_change']:+.4f}")
print(f"\nRegions where methods disagree on direction: {disagree.sum()} / {len(compare)}")
print(f"\nSpearman rank correlation (HDI change): rho = {rho_change:.4f}")

City of Buenos Aires:
Per-period: 2013=1.0000, 2019=0.9604, Change=-0.0396
Pooled: 2013=0.9464, 2019=0.9652, Change=+0.0189
Regions where methods disagree on direction: 16 / 153
Spearman rank correlation (HDI change): rho = 0.9818

For City of Buenos Aires, per-period PCA shows a decline of $-0.04$ while pooled PCA shows an improvement of $+0.02$. The two methods disagree on the direction of change for 16 out of 153 regions — about 10% of the sample. The Spearman rank correlation for improvement rankings is 0.982, meaning the two methods largely agree on who improved most, but the direction disagreements for specific regions could lead to different policy conclusions.

fig, ax = plt.subplots(figsize=(7, 7))
fig.patch.set_linewidth(0)
ax.scatter(compare["hdi_change"], compare["pp_change"],
color=STEEL_BLUE, edgecolors=DARK_NAVY, s=40, zorder=3, alpha=0.7)
lim_min = min(compare["hdi_change"].min(), compare["pp_change"].min()) - 0.02
lim_max = max(compare["hdi_change"].max(), compare["pp_change"].max()) + 0.02
ax.plot([lim_min, lim_max], [lim_min, lim_max], color=WARM_ORANGE,
linewidth=2, linestyle="--", label="Perfect agreement", zorder=2)
ax.axhline(0, color=GRID_LINE, linewidth=0.8, zorder=1)
ax.axvline(0, color=GRID_LINE, linewidth=0.8, zorder=1)
# Label extreme outliers
top_outliers = compare.nlargest(3, "method_diff")
bot_outliers = compare.nsmallest(3, "method_diff")
for _, row in pd.concat([top_outliers, bot_outliers]).iterrows():
ax.annotate(row["region_country"], (row["hdi_change"], row["pp_change"]),
fontsize=6, color=TEAL, xytext=(5, 5),
textcoords="offset points")
ax.set_xlabel("Pooled HDI change (2019 - 2013)")
ax.set_ylabel("Per-period HDI change (2019 - 2013)")
ax.set_title("Pooled vs. per-period PCA: HDI change comparison")
ax.legend(loc="upper left")
ax.set_aspect("equal")
plt.savefig("pca2_pooled_vs_perperiod_change.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

The scatter plot places pooled HDI change on the horizontal axis and per-period HDI change on the vertical axis. If both methods agreed perfectly, all points would fall on the dashed 45-degree line. The cloud sits systematically below the line for most regions — per-period PCA tends to understate improvement (or overstate decline) relative to pooled PCA, because per-period standardization erases the net positive shift in education and health.

compare["pooled_change_rank"] = compare["hdi_change"].rank(ascending=False).astype(int)
compare["pp_change_rank"] = compare["pp_change"].rank(ascending=False).astype(int)
compare["change_rank_diff"] = np.abs(compare["pooled_change_rank"] - compare["pp_change_rank"])
fig, ax = plt.subplots(figsize=(8, 10))
fig.patch.set_linewidth(0)
top_change_rank_diff = compare.nlargest(10, "change_rank_diff")
for _, row in top_change_rank_diff.iterrows():
r_pooled = row["pooled_change_rank"]
r_pp = row["pp_change_rank"]
label = row["region_country"]
color = TEAL if r_pooled < r_pp else WARM_ORANGE
ax.plot([0, 1], [r_pooled, r_pp], color=color, linewidth=2, alpha=0.8)
ax.text(-0.05, r_pooled, f"{label} (#{int(r_pooled)})", ha="right",
va="center", fontsize=7, color=LIGHT_TEXT)
ax.text(1.05, r_pp, f"{label} (#{int(r_pp)})", ha="left",
va="center", fontsize=7, color=LIGHT_TEXT)
ax.set_xlim(-0.6, 1.6)
ax.set_ylim(160, -5)
ax.set_xticks([0, 1])
ax.set_xticklabels(["Pooled Improvement Rank", "Per-period Improvement Rank"], fontsize=11)
ax.set_ylabel("Rank (1 = most improved)")
ax.set_title("Who improved the most? Pooled vs. per-period rankings\n(teal = ranked higher by pooled, orange = ranked lower)")
plt.savefig("pca2_rank_comparison_bump.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

The bump chart compares who improved the most under each method. The crossing lines show where the two methods re-order regions' improvement rankings. Regions that pooled PCA ranks as top improvers may be ranked lower by per-period PCA if their gains were partly masked by the shifting baseline.

14. Validation against the official SHDI

The Global Data Lab computes an official Subnational HDI (SHDI) using a geometric mean methodology similar to the UNDP’s approach. We can validate our PCA-based index by comparing both the pooled and per-period approaches against this official benchmark. If pooled PCA better tracks the established methodology, it provides further evidence that the pooled approach is superior for temporal analysis.

# Add per-period HDI to main DataFrame for comparison
df["pp_hdi"] = pd.concat([df_p1["pp_hdi"], df_p2["pp_hdi"]]).sort_index().values
# Pooled PCA vs official SHDI
corr_pooled = df["hdi"].corr(df["shdi_official"])
r2_pooled = corr_pooled ** 2
# Per-period PCA vs official SHDI
corr_pp = df["pp_hdi"].corr(df["shdi_official"])
r2_pp = corr_pp ** 2
print(f"Pooled PCA vs official SHDI:")
print(f" Pearson r: {corr_pooled:.4f}")
print(f" R-squared: {r2_pooled:.4f}")
print(f"\nPer-period PCA vs official SHDI:")
print(f" Pearson r: {corr_pp:.4f}")
print(f" R-squared: {r2_pp:.4f}")
print(f"\nR-squared difference (pooled - per-period): {r2_pooled - r2_pp:+.4f}")

Pooled PCA vs official SHDI:
Pearson r: 0.9911
R-squared: 0.9823
Per-period PCA vs official SHDI:
Pearson r: 0.9874
R-squared: 0.9750
R-squared difference (pooled - per-period): +0.0073

fig, axes = plt.subplots(1, 2, figsize=(14, 6))
fig.patch.set_linewidth(0)
p1_mask = df["period"] == "Y2013"
p2_mask = df["period"] == "Y2019"
# Panel A: Pooled PCA vs SHDI
ax = axes[0]
ax.scatter(df.loc[p1_mask, "shdi_official"], df.loc[p1_mask, "hdi"],
color=STEEL_BLUE, edgecolors=DARK_NAVY, s=30, alpha=0.7, zorder=3, label="2013")
ax.scatter(df.loc[p2_mask, "shdi_official"], df.loc[p2_mask, "hdi"],
color=WARM_ORANGE, edgecolors=DARK_NAVY, s=30, alpha=0.7, zorder=3, label="2019")
ax.set_xlabel("Official SHDI")
ax.set_ylabel("Pooled PCA HDI")
ax.set_title(f"Pooled PCA (R² = {r2_pooled:.4f})")
ax.legend(loc="upper left", fontsize=9)
# Panel B: Per-period PCA vs SHDI
ax = axes[1]
ax.scatter(df.loc[p1_mask, "shdi_official"], df.loc[p1_mask, "pp_hdi"],
color=STEEL_BLUE, edgecolors=DARK_NAVY, s=30, alpha=0.7, zorder=3, label="2013")
ax.scatter(df.loc[p2_mask, "shdi_official"], df.loc[p2_mask, "pp_hdi"],
color=WARM_ORANGE, edgecolors=DARK_NAVY, s=30, alpha=0.7, zorder=3, label="2019")
ax.set_xlabel("Official SHDI")
ax.set_ylabel("Per-period PCA HDI")
ax.set_title(f"Per-period PCA (R² = {r2_pp:.4f})")
ax.legend(loc="upper left", fontsize=9)
fig.suptitle("Validation: which PCA method tracks the official SHDI better?",
fontsize=14, y=1.02)
plt.tight_layout()
plt.savefig("pca2_validation_vs_shdi.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

Pooled PCA achieves $R^2 = 0.9823$, outperforming per-period PCA at $R^2 = 0.9750$. The difference of +0.0073 may seem small in absolute terms, but it is consistent and meaningful: pooled PCA explains 0.73 percentage points more of the variance in the official SHDI. The left panel shows pooled PCA points tightly clustered along the fit line with both periods intermixed seamlessly — exactly what we want for a temporally comparable index. The right panel shows per-period PCA with a slightly wider scatter, reflecting the distortion introduced by re-centering each period to its own baseline. The fact that the official SHDI (which uses a fixed geometric mean formula across years) correlates more strongly with pooled PCA than with per-period PCA validates the pooled approach: when the goal is temporal comparability, fitting on stacked data is the right choice.

Validating the dynamics: changes over time

The level comparison above tests cross-sectional fit — do the PCA-based indices rank regions correctly at a point in time? But the core promise of pooled PCA is capturing dynamics — changes over time. We now test whether the change in PCA-based HDI tracks the change in official SHDI.

# Compute official SHDI change per region
shdi_wide = (df.loc[p1_mask, ["region", "country", "shdi_official"]]
.rename(columns={"shdi_official": "shdi_p1"}))
shdi_wide = shdi_wide.merge(
df.loc[p2_mask, ["region", "country", "shdi_official"]]
.rename(columns={"shdi_official": "shdi_p2"}),
on=["region", "country"]
)
shdi_wide["shdi_change"] = shdi_wide["shdi_p2"] - shdi_wide["shdi_p1"]
# Merge with comparison table
compare_val = compare.merge(shdi_wide[["region", "country", "shdi_change"]],
on=["region", "country"])
# R² for changes
corr_pooled_change = compare_val["hdi_change"].corr(compare_val["shdi_change"])
r2_pooled_change = corr_pooled_change ** 2
corr_pp_change = compare_val["pp_change"].corr(compare_val["shdi_change"])
r2_pp_change = corr_pp_change ** 2
print(f"Pooled PCA change vs official SHDI change:")
print(f" Pearson r: {corr_pooled_change:.4f}")
print(f" R-squared: {r2_pooled_change:.4f}")
print(f"\nPer-period PCA change vs official SHDI change:")
print(f" Pearson r: {corr_pp_change:.4f}")
print(f" R-squared: {r2_pp_change:.4f}")
print(f"\nR-squared difference (pooled - per-period): {r2_pooled_change - r2_pp_change:+.4f}")

Pooled PCA change vs official SHDI change:
Pearson r: 0.9982
R-squared: 0.9964
Per-period PCA change vs official SHDI change:
Pearson r: 0.9957
R-squared: 0.9913
R-squared difference (pooled - per-period): +0.0051

fig, axes = plt.subplots(1, 2, figsize=(14, 6))
fig.patch.set_linewidth(0)
# Panel A: Pooled PCA change vs SHDI change
ax = axes[0]
ax.scatter(compare_val["shdi_change"], compare_val["hdi_change"],
color=STEEL_BLUE, edgecolors=DARK_NAVY, s=40, alpha=0.7, zorder=3)
ax.axhline(0, color=GRID_LINE, linewidth=0.8, zorder=1)
ax.axvline(0, color=GRID_LINE, linewidth=0.8, zorder=1)
ax.set_xlabel("Official SHDI change (2019 - 2013)")
ax.set_ylabel("Pooled PCA HDI change")
ax.set_title(f"Pooled PCA (R² = {r2_pooled_change:.4f})")
# Panel B: Per-period PCA change vs SHDI change
ax = axes[1]
ax.scatter(compare_val["shdi_change"], compare_val["pp_change"],
color=STEEL_BLUE, edgecolors=DARK_NAVY, s=40, alpha=0.7, zorder=3)
ax.axhline(0, color=GRID_LINE, linewidth=0.8, zorder=1)
ax.axvline(0, color=GRID_LINE, linewidth=0.8, zorder=1)
ax.set_xlabel("Official SHDI change (2019 - 2013)")
ax.set_ylabel("Per-period PCA HDI change")
ax.set_title(f"Per-period PCA (R² = {r2_pp_change:.4f})")
fig.suptitle("Validation: which PCA method better captures development dynamics?",
fontsize=14, y=1.02)
plt.tight_layout()
plt.savefig("pca2_validation_changes.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

The change validation is even more compelling than the level validation. Pooled PCA change achieves $R^2 = 0.9964$, outperforming per-period PCA change at $R^2 = 0.9913$. Both methods track the official SHDI dynamics remarkably well ($r > 0.99$), but pooled PCA is the tighter fit. The left panel shows pooled PCA changes falling almost exactly on the regression line, with virtually no scatter. The right panel shows per-period PCA changes with slightly more dispersion, reflecting the noise introduced by re-centering each period’s baseline. Taken together, the level validation ($R^2$: 0.9823 vs 0.9750) and the change validation ($R^2$: 0.9964 vs 0.9913) consistently favor pooled PCA — it better reproduces both the cross-sectional rankings and the temporal dynamics of the official Subnational Human Development Index.

15. Replicating with scikit-learn

The pooled PCA pipeline with scikit-learn is nearly identical to the single-period pipeline from the previous tutorial. The key insight is that sklearn’s fit_transform on the stacked data IS pooled PCA — no special panel-data library is needed.

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
# ── Configuration (change these for your own dataset) ────────────
CSV_URL = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_pca2/data_long.csv"
ID_COL = "region"
PERIOD_COL = "period"
POSITIVE_COLS = ["education", "health", "income"]
NEGATIVE_COLS = []
# Step 0: Load long-format panel data
df_sk = pd.read_csv(CSV_URL)
print(f"Loaded: {df_sk.shape[0]} rows, {df_sk.shape[1]} columns")
# Step 1: Polarity adjustment
for col in NEGATIVE_COLS:
df_sk[col + "_adj"] = -1 * df_sk[col]
adj_cols = POSITIVE_COLS + [col + "_adj" for col in NEGATIVE_COLS]
# Step 2: POOLED standardization (fit on ALL periods)
scaler = StandardScaler()
Z_sk = scaler.fit_transform(df_sk[adj_cols])
# Step 3-4: POOLED PCA (fit on ALL periods)
pca_sk = PCA(n_components=1)
df_sk["pc1"] = pca_sk.fit_transform(Z_sk)[:, 0]
# Step 5-6: POOLED normalization (min/max across ALL periods)
df_sk["pc1_index"] = (
(df_sk["pc1"] - df_sk["pc1"].min())
/ (df_sk["pc1"].max() - df_sk["pc1"].min())
)
df_sk.to_csv("pc1_index_results.csv", index=False)
print(f"\nPC1 weights: {pca_sk.components_[0].round(4)}")
print(f"Variance explained: {pca_sk.explained_variance_ratio_.round(4)}")
print(f"\nSaved: pc1_index_results.csv")

Loaded: 306 rows, 10 columns
PC1 weights: [0.5642 0.5448 0.6204]
Variance explained: [0.7242]
Saved: pc1_index_results.csv

The sklearn pipeline produces identical weights ($[0.5642, 0.5448, 0.6204]$) and variance explained (72.42%), with a maximum absolute difference of $2.00 \times 10^{-15}$ from our manual implementation.

16. Application: Space-time analyses

With a temporally comparable pooled PCA index in hand, we can now analyze development dynamics across South America. This section demonstrates two types of space-time analysis: mapping how the spatial distribution of development shifted between 2013 and 2019, and measuring how spatial inequality changed over the same period.

Spatial distribution dynamics

Choropleth maps provide an intuitive way to visualize where development improved, stagnated, or declined. The key methodological choice is to compute the color breaks from the initial period (2013) using the Fisher-Jenks natural breaks algorithm and hold those breaks constant in the 2019 map. This ensures that a color change between maps reflects a genuine shift in HDI, not a shifting classification scheme. If we re-computed breaks for each period, regions could change color simply because the overall distribution shifted, not because they individually improved.

import geopandas as gpd
import mapclassify
import contextily as cx
# Load GeoJSON boundaries and merge pooled HDI using GDLcode
GEO_URL = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_pca2/data.geojson"
gdf = gpd.read_file(GEO_URL)
hdi_2013 = df_pooled_p1[["GDLcode", "hdi"]].rename(columns={"hdi": "hdi_2013"})
hdi_2019 = df_pooled_p2[["GDLcode", "hdi"]].rename(columns={"hdi": "hdi_2019"})
gdf = gdf.merge(hdi_2013, on="GDLcode")
gdf = gdf.merge(hdi_2019, on="GDLcode")
# Reproject to Web Mercator for basemap
gdf_3857 = gdf.to_crs(epsg=3857)
# Fisher-Jenks breaks from 2013 (5 classes)
fj = mapclassify.FisherJenks(gdf_3857["hdi_2013"].values, k=5)
breaks = fj.bins.tolist()
# Extend upper break to cover 2019 max
max_val = max(gdf_3857["hdi_2013"].max(), gdf_3857["hdi_2019"].max())
if max_val > breaks[-1]:
breaks[-1] = float(round(max_val + 0.001, 3))
# Apply adjusted breaks to 2019 (must come AFTER break extension)
fj_2019 = mapclassify.UserDefined(gdf_3857["hdi_2019"].values, bins=breaks)
# Class transitions
classes_2013 = fj.yb
classes_2019 = fj_2019.yb
improved = (classes_2019 > classes_2013).sum()
stayed = (classes_2019 == classes_2013).sum()
declined = (classes_2019 < classes_2013).sum()
print(f"Fisher-Jenks breaks (from 2013): {[round(b, 3) for b in breaks]}")
print(f"\nClass transitions (2013 → 2019):")
print(f" Improved (moved up): {improved}")
print(f" Stayed same: {stayed}")
print(f" Declined (moved down): {declined}")

Fisher-Jenks breaks (from 2013): [0.167, 0.449, 0.581, 0.73, 1.001]
Class transitions (2013 → 2019):
Improved (moved up): 40
Stayed same: 88
Declined (moved down): 25

# Class labels
class_labels = []
lower = 0.0
for b in breaks:
class_labels.append(f"{lower:.2f} – {b:.2f}")
lower = b
fig, axes = plt.subplots(1, 2, figsize=(16, 12))
fig.patch.set_facecolor(DARK_NAVY)
fig.patch.set_linewidth(0)
from matplotlib.patches import Patch
cmap = plt.cm.coolwarm
norm = plt.Normalize(vmin=0, vmax=len(breaks) - 1)
for ax, year_col, title, year_fj in [
(axes[0], "hdi_2013", "Pooled PCA HDI — 2013", fj),
(axes[1], "hdi_2019", "Pooled PCA HDI — 2019", fj_2019),
]:
# Classify and assign colors manually
year_classes = year_fj.yb
colors = [cmap(norm(c)) for c in year_classes]
gdf_3857.plot(
ax=ax, color=colors,
edgecolor=DARK_NAVY, linewidth=0.3,
)
cx.add_basemap(ax, source=cx.providers.CartoDB.DarkMatter, zoom=4, attribution="")
ax.set_title(title, fontsize=14, color=WHITE_TEXT, pad=10)
ax.set_axis_off()
# Build legend manually with correct counts
counts = np.bincount(year_fj.yb, minlength=len(breaks))
handles = []
for i, (cl, c) in enumerate(zip(class_labels, counts)):
handles.append(Patch(facecolor=cmap(norm(i)), edgecolor=DARK_NAVY,
label=f"{cl} (n={c})"))
leg = ax.legend(handles=handles, title="HDI Class", loc="lower right",
fontsize=16, title_fontsize=17)
leg.set_frame_on(True)
leg.get_frame().set_facecolor("#1a1a2e")
leg.get_frame().set_edgecolor(LIGHT_TEXT)
leg.get_frame().set_alpha(0.9)
leg.get_frame().set_linewidth(1.5)
for text in leg.get_texts():
text.set_color(WHITE_TEXT)
leg.get_title().set_color(WHITE_TEXT)
fig.suptitle("Spatial distribution dynamics: Pooled PCA HDI\n"
"(Fisher-Jenks breaks from 2013 held constant)",
fontsize=15, color=WHITE_TEXT, y=0.95)
plt.tight_layout(rect=[0, 0, 1, 0.93])
plt.savefig("pca2_choropleth_hdi.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

The choropleth maps reveal clear geographic patterns in South American development. The Southern Cone (Chile, Argentina, Uruguay) and southern Brazil appear in the highest HDI classes (teal tones), while the Amazon basin, interior Guyana, and parts of Bolivia occupy the lowest classes (orange tones). Between 2013 and 2019, 40 regions moved up at least one Fisher-Jenks class, 88 stayed in the same class, and 25 declined. The upward mobility is concentrated in the Andean countries (Peru, Bolivia, Colombia) where education gains shifted regions from the second to the third class. The declines are predominantly in Venezuelan states, visible as regions shifting from mid-range blues to warmer colors — a direct cartographic reflection of Venezuela’s economic crisis. The fact that both maps use the same classification breaks makes these color changes directly interpretable: any region that changed color genuinely crossed a development threshold.

Spatial inequality dynamics

The Gini index measures inequality in the distribution of a variable across a population, ranging from 0 (perfect equality — every region has the same value) to 1 (perfect inequality — all development concentrated in a single region). Think of it as a single number that summarizes how unevenly a resource or outcome is distributed. By computing the Gini index for each indicator in each period, we can track whether development is converging (Gini falling — regions becoming more similar) or diverging (Gini rising — gaps widening).

We use the Gini class from PySAL’s inequality library, which provides a robust implementation of the Gini coefficient. The Gini(values).g attribute returns the computed coefficient.

from inequality.gini import Gini
# Compute Gini for each indicator and pooled HDI, per period
gini_rows = []
for period_label in ["Y2013", "Y2019"]:
mask = df["period"] == period_label
row = {"period": period_label}
for col in INDICATORS + ["hdi"]:
row[col] = round(Gini(df.loc[mask, col].values).g, 4)
gini_rows.append(row)
gini_df = pd.DataFrame(gini_rows).set_index("period")
# Add change row
change_row = gini_df.loc["Y2019"] - gini_df.loc["Y2013"]
change_row.name = "Change"
gini_df = pd.concat([gini_df, change_row.to_frame().T])
print(f"Gini index by indicator and period:")
print(gini_df.to_string())

Gini index by indicator and period:
education health income hdi
Y2013 0.0655 0.0295 0.0549 0.1712
Y2019 0.0639 0.0318 0.0585 0.1795
Change -0.0016 0.0023 0.0036 0.0083

fig, ax = plt.subplots(figsize=(8, 5))
fig.patch.set_linewidth(0)
labels = ["Education", "Health", "Income", "Pooled HDI"]
cols = INDICATORS + ["hdi"]
vals_2013 = [gini_df.loc["Y2013", c] for c in cols]
vals_2019 = [gini_df.loc["Y2019", c] for c in cols]
x = np.arange(len(labels))
width = 0.3
bars1 = ax.bar(x - width/2, vals_2013, width, color=STEEL_BLUE,
edgecolor=DARK_NAVY, label="2013")
bars2 = ax.bar(x + width/2, vals_2019, width, color=WARM_ORANGE,
edgecolor=DARK_NAVY, label="2019")
for bar in bars1:
ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.002,
f"{bar.get_height():.4f}", ha="center", va="bottom",
fontsize=9, color=LIGHT_TEXT)
for bar in bars2:
ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.002,
f"{bar.get_height():.4f}", ha="center", va="bottom",
fontsize=9, color=LIGHT_TEXT)
ax.set_xticks(x)
ax.set_xticklabels(labels, fontsize=12)
ax.set_ylabel("Gini Index")
ax.set_title("Spatial inequality dynamics: Gini index by indicator (2013 vs 2019)")
ax.legend()
ax.set_ylim(0, ax.get_ylim()[1] * 1.15)
plt.savefig("pca2_gini_dynamics.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

The Gini analysis reveals a nuanced inequality story across South America’s sub-national regions. Education is the only dimension that converged between 2013 and 2019 — its Gini fell from 0.0655 to 0.0639 ($-0.0016$), meaning regions became slightly more equal in educational attainment. Health and income both diverged: health inequality rose from 0.0295 to 0.0318 ($+0.0023$) and income inequality from 0.0549 to 0.0585 ($+0.0036$). The composite pooled PCA HDI shows an overall increase in inequality from 0.1712 to 0.1795 ($+0.0083$), driven primarily by the income and health dimensions. This tells a policy-relevant story: while South America made progress in reducing educational gaps across regions, the income decline was unevenly distributed — some regions (particularly Venezuelan states) experienced far steeper economic setbacks than others, widening the income gap. The fact that overall HDI inequality increased despite educational convergence underscores that development progress is not uniform across dimensions, and a composite index like the pooled PCA HDI captures these cross-cutting dynamics in a single measure.

Population-weighted inequality

The unweighted Gini treats every region equally — Potaro-Siparuni (population 10,000) carries the same weight as São Paulo (population 44 million). For policy analysis, we often care more about how many people experience inequality, not how many regions. A population-weighted Gini accounts for this by giving larger regions proportionally more influence. Since PySAL’s Gini class does not support population weights, we implement the weighted Gini using the trapezoidal Lorenz curve approach.

def weighted_gini(values, weights):
"""Compute the population-weighted Gini index using the Lorenz curve.
Parameters
----------
values : array-like — indicator values (e.g., HDI per region)
weights : array-like — population weights (e.g., region population)
Returns
-------
float — weighted Gini coefficient in [0, 1]
"""
v = np.asarray(values, dtype=float)
w = np.asarray(weights, dtype=float)
order = np.argsort(v)
v, w = v[order], w[order]
# Cumulative population and value shares
cum_w = np.cumsum(w) / np.sum(w)
cum_vw = np.cumsum(v * w) / np.sum(v * w)
# Prepend zero for trapezoidal integration
cum_w = np.concatenate(([0], cum_w))
cum_vw = np.concatenate(([0], cum_vw))
# Area under Lorenz curve
B = np.sum((cum_w[1:] - cum_w[:-1]) * (cum_vw[1:] + cum_vw[:-1]) / 2)
return 1 - 2 * B
# Compute population-weighted Gini
wgini_rows = []
for period_label in ["Y2013", "Y2019"]:
mask = df["period"] == period_label
row = {"period": period_label}
for col in INDICATORS + ["hdi"]:
row[col] = round(weighted_gini(
df.loc[mask, col].values, df.loc[mask, "pop"].values
), 4)
wgini_rows.append(row)
wgini_df = pd.DataFrame(wgini_rows).set_index("period")
wchange_row = wgini_df.loc["Y2019"] - wgini_df.loc["Y2013"]
wchange_row.name = "Change"
wgini_df = pd.concat([wgini_df, wchange_row.to_frame().T])
print(f"Population-weighted Gini index:")
print(wgini_df.to_string())

Population-weighted Gini index:
education health income hdi
Y2013 0.0525 0.0174 0.0359 0.1113
Y2019 0.0521 0.0186 0.0387 0.1156
Change -0.0004 0.0012 0.0028 0.0043

fig, axes = plt.subplots(1, 2, figsize=(14, 5), sharey=True)
fig.patch.set_linewidth(0)
labels = ["Education", "Health", "Income", "Pooled HDI"]
cols = INDICATORS + ["hdi"]
x = np.arange(len(labels))
width = 0.3
# Panel A: Unweighted
ax = axes[0]
uw_13 = [gini_df.loc["Y2013", c] for c in cols]
uw_19 = [gini_df.loc["Y2019", c] for c in cols]
ax.bar(x - width/2, uw_13, width, color=STEEL_BLUE, edgecolor=DARK_NAVY, label="2013")
ax.bar(x + width/2, uw_19, width, color=WARM_ORANGE, edgecolor=DARK_NAVY, label="2019")
for i, (v13, v19) in enumerate(zip(uw_13, uw_19)):
ax.text(i - width/2, v13 + 0.002, f"{v13:.4f}", ha="center", va="bottom",
fontsize=8, color=LIGHT_TEXT)
ax.text(i + width/2, v19 + 0.002, f"{v19:.4f}", ha="center", va="bottom",
fontsize=8, color=LIGHT_TEXT)
ax.set_xticks(x)
ax.set_xticklabels(labels, fontsize=11)
ax.set_ylabel("Gini Index")
ax.set_title("Unweighted Gini")
ax.legend(fontsize=9)
# Panel B: Population-weighted
ax = axes[1]
pw_13 = [wgini_df.loc["Y2013", c] for c in cols]
pw_19 = [wgini_df.loc["Y2019", c] for c in cols]
ax.bar(x - width/2, pw_13, width, color=STEEL_BLUE, edgecolor=DARK_NAVY, label="2013")
ax.bar(x + width/2, pw_19, width, color=WARM_ORANGE, edgecolor=DARK_NAVY, label="2019")
for i, (v13, v19) in enumerate(zip(pw_13, pw_19)):
ax.text(i - width/2, v13 + 0.002, f"{v13:.4f}", ha="center", va="bottom",
fontsize=8, color=LIGHT_TEXT)
ax.text(i + width/2, v19 + 0.002, f"{v19:.4f}", ha="center", va="bottom",
fontsize=8, color=LIGHT_TEXT)
ax.set_xticks(x)
ax.set_xticklabels(labels, fontsize=11)
ax.set_title("Population-weighted Gini")
ax.legend(fontsize=9)
fig.suptitle("Spatial inequality: unweighted vs. population-weighted Gini",
fontsize=14, y=1.02)
plt.tight_layout()
plt.savefig("pca2_gini_weighted_comparison.png", dpi=300, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY, pad_inches=0)
plt.show()

The population-weighted Gini values are substantially lower than their unweighted counterparts across all indicators and both periods. For example, the pooled HDI Gini drops from 0.1712 (unweighted) to 0.1113 (weighted) in 2013 — a 35% reduction. This gap means that large-population regions (São Paulo, Buenos Aires, Bogota, Santiago) tend to cluster near the middle of the development distribution, while the extreme values (both high and low) are found in smaller regions. When we weight by population, the outlier regions matter less, and inequality appears lower because most South Americans live in moderately developed areas. The direction of change, however, is consistent: both weighted and unweighted Gini show education converging ($-0.0004$ weighted vs $-0.0016$ unweighted) while income ($+0.0028$ vs $+0.0036$) and overall HDI ($+0.0043$ vs $+0.0083$) diverge. The divergence is smaller in population-weighted terms, suggesting that the widening gaps are driven more by sparsely populated peripheral regions than by the major urban centers where most people live.

17. Summary results

Step	Input	Output	Key Result
Stack	2 periods $\times$ 153 regions	306-row DataFrame	Panel format ready
Polarity	Raw indicators	Aligned indicators	All positive (no flip needed)
Pooled Standardization	306 rows	Z-scores (pooled)	Fixed baseline across periods
Pooled Covariance	Z matrix	3$\times$3 matrix	Off-diagonals 0.44–0.68
Pooled Eigen-decomposition	Cov matrix	eigenvalues, eigenvectors	PC1 captures 72.4%
Scoring	Z $\times$ eigvec	PC1 scores	2019 mean > 2013 mean (+0.14)
Pooled Normalization	PC1	HDI (0–1)	Comparable across periods

18. Discussion

Pooled PCA successfully builds a composite development index that is directly comparable across time periods. By standardizing with pooled means, computing a single set of eigenvector weights from stacked data, and normalizing with pooled min/max bounds, the index preserves genuine temporal dynamics. The net development shift of +0.14 PC1 units (reflecting education and health gains partially offset by income decline) is captured by pooled PCA but would be invisible under per-period PCA.

The real South American data revealed that Income carries the highest eigenvector weight (0.620), meaning PCA gives Income more influence than Education (0.564) or Health (0.545) in the composite index. This data-driven weighting differs from the UNDP’s equal-weight geometric mean approach, yet the two methods agree closely ($r = 0.991$). The similarity arises because all three indicators are positively correlated and driven by the same broad development processes. The differences emerge in regions with unbalanced profiles — for example, regions with very high health but low education may rank differently under PCA versus the geometric mean.

The per-period approach disagrees with pooled PCA on the direction of change for 16 regions (10% of the sample). In each of these 16 cases, per-period PCA shows a decline while pooled PCA shows an improvement — the shifting baseline erases genuine but modest gains. A policymaker using per-period PCA might conclude these regions are “falling behind” when in reality they made progress, just less than the shifting average.

The income decline across South America between 2013 and 2019 makes the pooled approach particularly important. Per-period standardization would hide this real economic setback by re-centering income to zero each period. Pooled standardization preserves it, allowing researchers to see that income genuinely declined while education and health improved. This mixed signal is precisely the kind of nuance that development analysis must capture.

19. Summary and next steps

Key takeaways:

Method insight: Pooled PCA produces temporally comparable composite indices by fitting standardization and eigen-decomposition on stacked data. The two methods disagree on the direction of HDI change for 16 out of 153 South American regions. The Spearman rank correlation for improvement rankings is 0.982 — high but not perfect, with consequential differences for specific regions.
Data insight: Income carries the highest PC1 weight (0.620) despite education having a wider range. PC1 captures 72.4% of variance — lower than the 96% in simulated data, reflecting the genuine complexity of real development indicators. The PCA-based HDI correlates at $r = 0.991$ with the official SHDI, validating the approach.
Limitation: PC1 captures only 72% of variance, meaning 28% of development variation is lost in the compression. PC2 (19%) might capture meaningful patterns (e.g., health vs income trade-offs). Also, the pooled approach assumes a stable correlation structure between 2013 and 2019 — a strong assumption over a 6-year period that included significant economic volatility in the region.
Next step: Extend the analysis to more time periods (2000–2019) using the full Global Data Lab time series. Explore PC2 interpretation for policy-relevant sub-dimensions. Consider factor analysis for more flexible loading structures, and compare results across different world regions.

Limitations of this analysis:

The data covers only South America. Development patterns in Sub-Saharan Africa or South Asia may produce different correlation structures and eigenvector weights.
Two periods (2013 and 2019) is the minimum for temporal analysis. More periods would strengthen the pooled estimates and allow testing the constant-correlation assumption.
The PCA-based index is relative to this specific sample. Adding or removing regions changes every score.
Min-Max normalization is sensitive to outliers. The Potaro-Siparuni region of Guyana anchors the bottom and compresses the range for everyone else.

20. Exercises

Explore PC2. The second principal component captures 18.8% of variance. Compute PC2 scores and plot them against PC1. What development pattern does PC2 capture? Which regions score high on PC1 but low on PC2 (or vice versa)?
Test the constant-correlation assumption. Compute the correlation matrices separately for 2013 and 2019. How much do they differ? If the Income-Education correlation changed substantially, what would that imply for the validity of pooled PCA?
Compare with the UNDP methodology. The official SHDI uses a geometric mean: $SHDI = (Education \times Health \times Income)^{1/3}$. Compute this for all regions and compare the ranking with your PCA-based ranking. Where do the two methods disagree most, and why?

Principal Component Analysis (PCA) | Carlos Mendez

Introduction to PCA Analysis for Building Development Indicators

1. Overview

2. The PCA pipeline

3. Setup and imports

4. Simulating health data

5. Exploring the raw data

6. Step 1: Polarity adjustment — aligning the health goals

7. Step 2: Standardization — comparing apples to apples

8. Step 3: The covariance matrix — mapping the overlap

9. Step 4: Eigen-decomposition — finding the optimal direction

9.1 Visualizing the principal components

9.2 Variance explained

10. Step 5: Scoring — building the index

11. Step 6: Normalization — making it human-readable

12. Replicating the analysis with scikit-learn

12.1 A PCA pipeline with scikit-learn

12.2 Manual vs. scikit-learn comparison

13. Summary results

14. Discussion

15. Summary and next steps

16. Exercises

17. References

Acknowledgements

Pooled PCA for Building Development Indicators Across Time

1. Overview

2. The pooled PCA pipeline

3. Setup and imports

4. Loading the Subnational HDI data

5. Exploring the raw data

6. The problem: per-period PCA

7. Pooled Step 1: Stacking the data

8. Pooled Step 2: Pooled standardization

9. Pooled Step 3: Covariance matrix

10. Pooled Step 4: Eigen-decomposition

11. Pooled Step 5: Scoring

12. Pooled Step 6: Normalization

13. The contrast: pooled vs per-period PCA

14. Validation against the official SHDI

Validating the dynamics: changes over time

15. Replicating with scikit-learn

16. Application: Space-time analyses

Spatial distribution dynamics

Spatial inequality dynamics

Population-weighted inequality

17. Summary results

18. Discussion

19. Summary and next steps

20. Exercises

21. References

Acknowledgements