Institutions, Settler Mortality, and IV — Interactive Lab

A pedagogical companion to Do Institutions Cause Prosperity? An IV Tutorial in Python ↗ Back to the post

Why instrument institutions with settler mortality?

Cross-country plots show that rich countries have better property-rights institutions, but the slope cannot prove that institutions cause prosperity — reverse causality, omitted variables, and measurement error all bias naive OLS. Acemoglu, Johnson and Robinson (2001) propose a famous instrument: the mortality rate of European settlers circa 1500–1900. Mortality shaped which colonies became extractive vs settler — and thus what kind of institutions those countries inherited — but cannot directly affect 1995 GDP except through institutions. That untestable assumption is the exclusion restriction.

On the AJR 64-country sample, IV gives β̂ = 0.94481% larger than the OLS slope of 0.522. The Wu-Hausman test rejects OLS at p < 0.0001. The first-stage F is 16.85 — borderline-strong, almost exactly at the Stock-Yogo 10% threshold. This app lets you turn the dials. In four tabs you will inspect the identification DAG; slide instrument strength and watch IV move from precise to wild; race OLS against IV under simulated confounding; and toggle the post's 14 specifications side-by-side on a single forest plot.

The IV identification strategy at a glance

The diagram below is the heart of every IV paper. The instrument Z (settler mortality) is allowed to affect the endogenous regressor X (institutions); X is allowed to affect the outcome Y (log GDP); but Z must not have a direct arrow into Y. Unobserved confounders U can freely contaminate X and Y — that is the whole reason OLS is biased. The orange dot is a particle traveling the allowed Z → X → Y path. The dashed red arrow shows the path forbidden by the exclusion restriction.

Tab 2

Instrument Strength

Drag the first-stage slope of Z on X from weak to strong. Watch the IV slope's confidence band collapse, and notice when OLS bias still dominates.

Tab 3

OLS vs IV Simulator

Confound treatment and outcome with shared unobservables. Run 100 simulations to compare the OLS and IV sampling distributions side-by-side.

Tab 4

Forest Plot — 14 specifications

OLS baseline, IV with logem4, IV with colonial/legal/religious/geographic/health controls, and IV with alternative instruments — all on one axis with 95% CIs.

Glossary (open a card if a term is unfamiliar)

Endogeneity
A regressor is endogenous when it is correlated with the error term. In our context, $\textit{avexpr}$ is endogenous: it is jointly determined with GDP, shares unobserved confounders with GDP, and is measured imperfectly. The Wu-Hausman test rejects OLS consistency at $p < 0.0001$ ($F = 24.22$).
Instrumental variable (Z)
A variable that affects the outcome $Y$ only through the endogenous regressor $X$. Three conditions: relevance (Z and X correlated), exclusion (Z has no direct arrow into Y), exogeneity (Z is uncorrelated with the error term). Coin-flip eligibility for a drug trial: the flip influences recovery only through whether the patient took the drug.
Two-Stage Least Squares (2SLS)
Stage 1: regress endogenous $X$ on instrument $Z$ to get $\hat X$. Stage 2: regress $Y$ on $\hat X$ to get the IV coefficient. Filtering muddy water through a sieve — the sieve catches the confounding; what passes through is the clean signal.
First stage and reduced form
The first stage regresses $X$ on $Z$. The reduced form regresses $Y$ directly on $Z$. With one instrument, the 2SLS coefficient equals the ratio $\hat\beta\_{IV} = \hat\beta\_{RF} / \hat\beta\_{FS}$. In the post: $-0.573 / -0.607 = 0.944$.
Weak instrument
An instrument that only weakly predicts the endogenous regressor. Conventional rule (Staiger-Stock 1997): first-stage $F > 10$. Stock-Yogo (2005) tighten this to $F > 16.38$ for 10% maximal IV size distortion. Weak instruments produce IV estimates with huge SEs and substantial finite-sample bias — a radio antenna picking up mostly static.
LATE vs ATE
Under heterogeneous effects, 2SLS does not identify the population ATE. Imbens-Angrist (1994): 2SLS identifies the Local Average Treatment Effect — the effect for "compliers", units whose treatment would change with the instrument. Our 0.944 applies to countries whose institutions would have been different had settler mortality been different — not to never-colonized countries.
Exclusion restriction
The untestable heart of every IV: the instrument $Z$ affects $Y$ only through $X$. If settler mortality also directly affects modern GDP via, say, a malaria channel, the exclusion restriction fails. The dashed red arrow in the DAG above.
Hansen J / Sargan test
With more instruments than endogenous regressors, the joint exogeneity of the instrument set is partially testable. If two instruments disagree on the causal effect, the test rejects. In Panel C of Tab 8, Hansen J p-values 0.18–0.79 across five alternative instrument pairs uniformly fail to reject — modest support for AJR's exclusion restriction.
Wu-Hausman endogeneity test
A formal test of whether OLS is consistent. Compares OLS and IV estimates: if the gap is large relative to standard errors, OLS is rejected as biased. $F = 24.22$, $p < 0.0001$ in the AJR main spec — the data say OLS is biased, IV is empirically warranted.

Instrument strength — when IV works and when it breaks

The engine of IV is the first-stage relationship $X = \pi Z + v$. If $\pi$ is large (strong instrument), Z extracts plenty of clean variation in X and the IV estimator is precise. If $\pi$ is small (weak instrument), the same noise gets amplified by the $1/\pi$ rescaling and the IV estimator can be more biased than OLS in finite samples. Drag the instrument strength slider and watch the teal IV line move from tightly tracking the true β to wildly swinging. The orange OLS line, by contrast, stays stubbornly biased in the same place — instrument strength does not help OLS.

More countries → tighter OLS and tighter IV. The AJR sample has n = 64.
First-stage slope of $Z$ on $X$. AJR's $\hat\pi = -0.607$ in absolute value $\approx 0.6$.
How much the unobserved $U$ pushes both $X$ and $Y$. γ = 0 → OLS is unbiased; γ > 0 → OLS is biased.
The causal effect of $X$ on $Y$ we are trying to recover. Default = 0.94 (AJR's IV estimate).
OLS β̂
SE
IV β̂
SE
First-stage F
Stock-Yogo 10% = 16.38
True β
0.94
held fixed for comparison

What to look for

  • OLS is biased whenever γ > 0. Slide γ from 0 to 1.5: the orange OLS line drifts well above the dashed true-β line, while the teal IV line stays roughly centred. The IV — OLS gap is the diagnostic that something is contaminating OLS.
  • Strong instruments give precise IV. With $\pi = 1.5$ (very strong) the IV slope hugs the truth tightly. With $\pi = 0.1$ (very weak), the IV slope swings wildly and the first-stage F drops below 10 — read the SE, not the point estimate.
  • OLS is unaffected by π. The OLS line ignores the instrument entirely. Improving your instrument doesn't help OLS — only switching to IV does.
  • The "bias-precision trade-off" of IV. When γ is large but π is weak, IV may have more total error (bias² + variance) than OLS. This is the AJR caveat: in Tab 7 health-channel specs, first-stage F drops below 5 and the IV CIs widen to uselessness.

OLS vs IV — bias and variance over many simulations

A single estimate is noisy. The deeper question is: on average, which method finds the truth? Draw the same DGP 100 times with fresh random shocks, estimate β by both OLS and IV each time, and compare the two sampling distributions. The orange OLS histogram clusters around a biased centre (its mean drifts from the truth by γ). The teal IV histogram clusters near the true β — wider, but right on target. Click "Run 100 simulations" below.

Capped at 300 so 100 sims finish under a second.
Higher π → tighter IV histogram.
Higher γ → bigger OLS bias.
Default = 0.94, the AJR main IV estimate.

OLS

Regress $Y$ on $X$ directly. Ignores the instrument.

single β̂
mean β̂ over 100 sims
sd(β̂)
bias

IV (2SLS)

Use $\hat X = \hat\pi Z$ from the first stage as a clean stand-in for $X$.

single β̂
mean β̂ over 100 sims
sd(β̂)
bias

Run the experiment

100 fresh datasets with the parameters above. Each gives one OLS β̂ and one IV β̂. The histogram below stacks them.

What to look for

  • OLS is biased on average. Its histogram centre drifts above the dashed true-β line by roughly γ. The bias does not vanish with more sims — it is systematic.
  • IV is approximately unbiased. Its histogram centres on the truth (subject to a small finite-sample bias that shrinks with $n \cdot \pi^2$).
  • IV pays for unbiasedness with variance. The teal histogram is wider than the orange one. When π is small, the variance penalty can be huge — the histogram spreads across the chart.
  • The post's AJR result lives at the right edge of this trade-off. With $n = 64$ and $\pi \approx 0.6$, IV is roughly unbiased but the SE is large (0.176 vs OLS's 0.05). The 81% gap between OLS (0.522) and IV (0.944) is interpretable only because the instrument is borderline strong.

The post's headline estimates — 14 specifications interactively

These numbers come straight from the post's tables for the AJR 64-country base sample. Toggle specifications to compare. Hover a point for SE, 95% CI, and first-stage F. Orange = OLS; steel blue = IV with logem4; light teal = IV in weak-instrument territory; pale grey-blue = IV with alternative instruments.

What to look for

  • OLS sits at ≈ 0.52 with a tight CI. Every IV variant with $\log\textit{em4}$ sits at 0.69–1.08 with overlapping CIs. The 81% IV-over-OLS gap is the post's headline.
  • Robustness controls leave the IV coefficient in the 0.7–1.1 band. Colonial-identity, legal-origin, religion, climate, and ethnic-fragmentation controls do not eliminate the institutions effect.
  • Health-channel specs are the place to retain doubt. IV + malaria (β̂ = 0.69, F = 3.98) and IV + life expectancy (β̂ = 0.63, F = 4.23) drop the first-stage F well below 5 — the CIs balloon, and the data alone cannot adjudicate whether health is a "bad control" or a real exclusion violation.
  • Alternative instruments agree. Replacing $\log\textit{em4}$ with $\textit{euro1900}$, $\textit{cons00a}$, or $\textit{democ00a}$ gives 0.71–0.87 — the post's Hansen J p-values 0.18–0.79 fail to reject joint exogeneity, modestly supporting AJR.
  • Stock-Yogo 10% threshold is 16.38. Tooltips show first-stage F for each IV spec. Filter by F-strength via the method toggles to compare strong vs weak specs.

Outcomes

Methods (uncheck to declutter)

Connecting back to Tab 2 & Tab 3

The OLS vs IV gap you experimented with on simulated data is the same machinery the post applies to the 64 ex-colonies — just with real data instead of a controlled DGP:

  • OLS (base, n=64): β̂ = 0.522, SE = 0.050 — narrow CI, but biased downward by measurement error.
  • IV (main, logem4, n=64): β̂ = 0.944, SE = 0.176, 95% CI = [0.599, 1.289], first-stage F = 16.85 — wider CI, approximately unbiased.
  • IV + euro1900 (n=63): β̂ = 0.870, F = 44.03 — different instrument, similar message.
  • Wu-Hausman test: $F = 24.22$, $p < 0.0001$ — empirically, OLS is biased.

The takeaway: in this dataset, OLS attenuates the institutions effect by 81%. That gap is the empirical handprint of measurement error in the institutional-quality index, and de-noising it via IV reveals a steeper causal slope — not a shallower one.

Why are the weak-IV specs still on the chart?

Two specs (IV + malaria, IV + life exp.) have first-stage F < 5, well below any weak-IV threshold. Honest reporting includes them — they are not artifacts. The lesson: read the CI, not the point estimate. Both weak-IV CIs overlap the strong-IV band, so they do not contradict the headline, but their point estimates carry low information. The post discusses this in §9 under "the trickiest case — health channels".