Why instrument institutions with settler mortality?
Cross-country plots show that rich countries have better property-rights institutions, but the slope cannot prove that institutions cause prosperity — reverse causality, omitted variables, and measurement error all bias naive OLS. Acemoglu, Johnson and Robinson (2001) propose a famous instrument: the mortality rate of European settlers circa 1500–1900. Mortality shaped which colonies became extractive vs settler — and thus what kind of institutions those countries inherited — but cannot directly affect 1995 GDP except through institutions. That untestable assumption is the exclusion restriction.
On the AJR 64-country sample, IV gives β̂ = 0.944 — 81% larger than the OLS slope of 0.522. The Wu-Hausman test rejects OLS at p < 0.0001. The first-stage F is 16.85 — borderline-strong, almost exactly at the Stock-Yogo 10% threshold. This app lets you turn the dials. In four tabs you will inspect the identification DAG; slide instrument strength and watch IV move from precise to wild; race OLS against IV under simulated confounding; and toggle the post's 14 specifications side-by-side on a single forest plot.
The IV identification strategy at a glance
The diagram below is the heart of every IV paper. The instrument Z (settler mortality) is allowed to affect the endogenous regressor X (institutions); X is allowed to affect the outcome Y (log GDP); but Z must not have a direct arrow into Y. Unobserved confounders U can freely contaminate X and Y — that is the whole reason OLS is biased. The orange dot is a particle traveling the allowed Z → X → Y path. The dashed red arrow shows the path forbidden by the exclusion restriction.
Instrument Strength
Drag the first-stage slope of Z on X from weak to strong. Watch the IV slope's confidence band collapse, and notice when OLS bias still dominates.
OLS vs IV Simulator
Confound treatment and outcome with shared unobservables. Run 100 simulations to compare the OLS and IV sampling distributions side-by-side.
Forest Plot — 14 specifications
OLS baseline, IV with logem4, IV with colonial/legal/religious/geographic/health controls, and IV with alternative instruments — all on one axis with 95% CIs.
Glossary (open a card if a term is unfamiliar)
Endogeneity
Instrumental variable (Z)
Two-Stage Least Squares (2SLS)
First stage and reduced form
Weak instrument
LATE vs ATE
Exclusion restriction
Hansen J / Sargan test
Wu-Hausman endogeneity test
Instrument strength — when IV works and when it breaks
The engine of IV is the first-stage relationship $X = \pi Z + v$. If $\pi$ is large (strong instrument), Z extracts plenty of clean variation in X and the IV estimator is precise. If $\pi$ is small (weak instrument), the same noise gets amplified by the $1/\pi$ rescaling and the IV estimator can be more biased than OLS in finite samples. Drag the instrument strength slider and watch the teal IV line move from tightly tracking the true β to wildly swinging. The orange OLS line, by contrast, stays stubbornly biased in the same place — instrument strength does not help OLS.
What to look for
- OLS is biased whenever γ > 0. Slide γ from 0 to 1.5: the orange OLS line drifts well above the dashed true-β line, while the teal IV line stays roughly centred. The IV — OLS gap is the diagnostic that something is contaminating OLS.
- Strong instruments give precise IV. With $\pi = 1.5$ (very strong) the IV slope hugs the truth tightly. With $\pi = 0.1$ (very weak), the IV slope swings wildly and the first-stage F drops below 10 — read the SE, not the point estimate.
- OLS is unaffected by π. The OLS line ignores the instrument entirely. Improving your instrument doesn't help OLS — only switching to IV does.
- The "bias-precision trade-off" of IV. When γ is large but π is weak, IV may have more total error (bias² + variance) than OLS. This is the AJR caveat: in Tab 7 health-channel specs, first-stage F drops below 5 and the IV CIs widen to uselessness.
OLS vs IV — bias and variance over many simulations
A single estimate is noisy. The deeper question is: on average, which method finds the truth? Draw the same DGP 100 times with fresh random shocks, estimate β by both OLS and IV each time, and compare the two sampling distributions. The orange OLS histogram clusters around a biased centre (its mean drifts from the truth by γ). The teal IV histogram clusters near the true β — wider, but right on target. Click "Run 100 simulations" below.
OLS
Regress $Y$ on $X$ directly. Ignores the instrument.
IV (2SLS)
Use $\hat X = \hat\pi Z$ from the first stage as a clean stand-in for $X$.
Run the experiment
100 fresh datasets with the parameters above. Each gives one OLS β̂ and one IV β̂. The histogram below stacks them.
What to look for
- OLS is biased on average. Its histogram centre drifts above the dashed true-β line by roughly γ. The bias does not vanish with more sims — it is systematic.
- IV is approximately unbiased. Its histogram centres on the truth (subject to a small finite-sample bias that shrinks with $n \cdot \pi^2$).
- IV pays for unbiasedness with variance. The teal histogram is wider than the orange one. When π is small, the variance penalty can be huge — the histogram spreads across the chart.
- The post's AJR result lives at the right edge of this trade-off. With $n = 64$ and $\pi \approx 0.6$, IV is roughly unbiased but the SE is large (0.176 vs OLS's 0.05). The 81% gap between OLS (0.522) and IV (0.944) is interpretable only because the instrument is borderline strong.
The post's headline estimates — 14 specifications interactively
These numbers come straight from the post's tables for the AJR
64-country base sample. Toggle specifications to compare. Hover a
point for SE, 95% CI, and first-stage F. Orange = OLS; steel blue =
IV with logem4; light teal = IV in weak-instrument
territory; pale grey-blue = IV with alternative instruments.
What to look for
- OLS sits at ≈ 0.52 with a tight CI. Every IV variant with $\log\textit{em4}$ sits at 0.69–1.08 with overlapping CIs. The 81% IV-over-OLS gap is the post's headline.
- Robustness controls leave the IV coefficient in the 0.7–1.1 band. Colonial-identity, legal-origin, religion, climate, and ethnic-fragmentation controls do not eliminate the institutions effect.
- Health-channel specs are the place to retain doubt. IV + malaria (β̂ = 0.69, F = 3.98) and IV + life expectancy (β̂ = 0.63, F = 4.23) drop the first-stage F well below 5 — the CIs balloon, and the data alone cannot adjudicate whether health is a "bad control" or a real exclusion violation.
- Alternative instruments agree. Replacing $\log\textit{em4}$ with $\textit{euro1900}$, $\textit{cons00a}$, or $\textit{democ00a}$ gives 0.71–0.87 — the post's Hansen J p-values 0.18–0.79 fail to reject joint exogeneity, modestly supporting AJR.
- Stock-Yogo 10% threshold is 16.38. Tooltips show first-stage F for each IV spec. Filter by F-strength via the method toggles to compare strong vs weak specs.
Outcomes
Methods (uncheck to declutter)
Connecting back to Tab 2 & Tab 3
The OLS vs IV gap you experimented with on simulated data is the same machinery the post applies to the 64 ex-colonies — just with real data instead of a controlled DGP:
- OLS (base, n=64): β̂ = 0.522, SE = 0.050 — narrow CI, but biased downward by measurement error.
- IV (main, logem4, n=64): β̂ = 0.944, SE = 0.176, 95% CI = [0.599, 1.289], first-stage F = 16.85 — wider CI, approximately unbiased.
- IV + euro1900 (n=63): β̂ = 0.870, F = 44.03 — different instrument, similar message.
- Wu-Hausman test: $F = 24.22$, $p < 0.0001$ — empirically, OLS is biased.
The takeaway: in this dataset, OLS attenuates the institutions effect by 81%. That gap is the empirical handprint of measurement error in the institutional-quality index, and de-noising it via IV reveals a steeper causal slope — not a shallower one.
Why are the weak-IV specs still on the chart?
Two specs (IV + malaria, IV + life exp.) have first-stage F < 5, well below any weak-IV threshold. Honest reporting includes them — they are not artifacts. The lesson: read the CI, not the point estimate. Both weak-IV CIs overlap the strong-IV band, so they do not contradict the headline, but their point estimates carry low information. The post discusses this in §9 under "the trickiest case — health channels".