LASSO | Carlos Mendez

Identifying Latent Group Structures in Panel Data: The classifylasso Command in Stata

Sat, 04 Apr 2026 00:00:00 +0000

1. Overview

Do all countries respond the same way to inflation? To interest rates? To democratic transitions? Most panel data models assume yes. They force every country to share the same slope coefficients. That is a strong assumption — and often a wrong one.

Here is a preview of what we will discover. When we estimate the effect of inflation on savings across 56 countries, the pooled model says: “no significant effect.” But that average is a lie. One group of countries saves less when inflation rises. Another group saves more. The pooled estimate averages a negative and a positive effect, producing a misleading zero.

The Classifier-LASSO (C-LASSO) method solves this problem. Developed by Su, Shi, and Phillips (2016), it discovers latent groups in your panel data. Countries within each group share the same coefficients. Countries across groups can differ. Think of it like a sorting hat: rather than treating all countries as identical or all as unique, C-LASSO sorts them into a small number of groups with shared behavioral patterns.

This tutorial demonstrates the classifylasso Stata command (Huang, Wang, and Zhou 2024) with two applications:

Savings behavior across 56 countries (1995–2010) — where inflation affects savings in opposite directions depending on the country group
Democracy and economic growth across 98 countries (1970–2010) — where the pooled estimate of +1.05 masks a split of +2.15 in one group and -0.94 in another

Learning objectives:

Understand why assuming homogeneous slopes can be misleading in panel data
Learn the Classifier-LASSO method for identifying latent group structures
Implement classifylasso in Stata with both static and dynamic specifications
Use postestimation commands (classogroup, classocoef, predict gid) to visualize and interpret results
Compare pooled fixed-effects estimates with group-specific C-LASSO estimates

The diagram below maps the tutorial’s progression. We start simple and build complexity step by step.

graph LR
A["<b>EDA</b><br/>Savings data"] --> B["<b>Baseline FE</b><br/>Pooled &<br/>fixed effects"]
B --> C["<b>C-LASSO</b><br/>Static model<br/>(no lagged DV)"]
C --> D["<b>C-LASSO</b><br/>Dynamic model<br/>(jackknife)"]
D --> E["<b>Democracy</b><br/>Application<br/>(two-way FE)"]
E --> F["<b>Comparison</b><br/>Pooled vs<br/>group-specific"]
style A fill:#141413,stroke:#141413,color:#fff
style B fill:#6a9bcc,stroke:#141413,color:#fff
style C fill:#d97757,stroke:#141413,color:#fff
style D fill:#d97757,stroke:#141413,color:#fff
style E fill:#00d4c8,stroke:#141413,color:#141413
style F fill:#1a3a8a,stroke:#141413,color:#fff

2. The Problem: Homogeneous vs Heterogeneous Slopes

2.1 Three approaches to slope heterogeneity

Imagine 56 students taking the same exam. Approach 1 assumes they all studied the same way — one average study strategy explains everyone’s score. Approach 2 gives each student a unique strategy — but with only a few data points per student, the estimates are noisy. Approach 3 (C-LASSO) discovers that students naturally fall into 2–3 study groups. Students within a group share the same strategy. Students across groups differ.

The same logic applies to panel data. The standard fixed-effects model is:

$$y_{it} = \mu_i + \boldsymbol{\beta}' \mathbf{x}_{it} + u_{it}$$

Here, $y_{it}$ is the outcome for country $i$ at time $t$. The term $\mu_i$ captures country-specific intercepts (fixed effects). The slope vector $\boldsymbol{\beta}$ links the regressors $\mathbf{x}_{it}$ to the outcome. The critical assumption: $\boldsymbol{\beta}$ is the same for all countries. Japan and Nigeria get the same coefficient on inflation. That may be wrong.

At the other extreme, we could run separate regressions for each country. But with only $T = 15$ time periods per country, individual estimates are noisy. We lose statistical power.

C-LASSO introduces a middle ground. It assumes countries belong to $K$ latent groups:

$$\boldsymbol{\beta}_i = \boldsymbol{\alpha}_k \quad \text{if} \quad i \in G_k, \quad k = 1, \ldots, K$$

In words, country $i$ gets the slope coefficients of its group $G_k$. The method estimates three things simultaneously: the number of groups $K$, which countries belong to which group, and each group’s coefficients $\boldsymbol{\alpha}_k$. You do not need to specify the groups in advance. The data reveals them.

2.2 Why not just use K-means?

A natural question: why not run individual regressions first and then cluster the coefficients with K-means? C-LASSO has two advantages. First, it estimates group membership and coefficients jointly. A two-step approach (estimate, then cluster) propagates first-stage errors into the grouping. Second, C-LASSO’s penalty structure naturally pulls similar countries toward the same group. It is a statistically principled sorting mechanism, not an ad-hoc post-processing step.

3. The Classifier-LASSO Method

3.1 The C-LASSO objective function

C-LASSO minimizes a penalized least-squares objective:

$$Q_{NT,\lambda}^{(K)} = \frac{1}{NT} \sum_{i=1}^{N} \sum_{t=1}^{T} (y_{it} - \boldsymbol{\beta}_i' \mathbf{x}_{it})^2 + \frac{\lambda_{NT}}{N} \sum_{i=1}^{N} \prod_{k=1}^{K} |\boldsymbol{\beta}_i - \boldsymbol{\alpha}_k|$$

The first term is the standard sum of squared residuals. It measures how well the model fits the data. The second term is the penalty. It encourages each country’s coefficients $\boldsymbol{\beta}_i$ to be close to one of the group centers $\boldsymbol{\alpha}_k$.

Think of each group center as a planet with gravitational pull. If a country’s coefficients are close to any planet, the product $\prod_k |\boldsymbol{\beta}_i - \boldsymbol{\alpha}_k|$ shrinks toward zero. The penalty becomes small. The country gets pulled into that group. If the coefficients are far from all planets, the penalty stays large. The tuning parameter $\lambda_{NT} = c_\lambda T^{-1/3}$ controls how strong this gravitational pull is.

3.2 Three-step estimation procedure

The classifylasso command works in three steps:

Sort countries into groups. For each candidate number of groups $K$, the algorithm iteratively updates group centers and reassigns countries until convergence. Starting values come from unit-by-unit regressions.
Re-estimate within groups (postlasso). The LASSO penalty biases the coefficient estimates. So after sorting, we discard the penalized estimates and re-run plain OLS within each group. Think of it like a talent show: LASSO is the audition that selects who is in which group, but the final performance (the coefficient estimates) is unpenalized. This postlasso step gives us valid standard errors and confidence intervals.
Pick the best $K$ (information criterion). How many groups are there? The command tests $K = 1, 2, \ldots, K_{\max}$ and picks the $K$ that minimizes an information criterion. The IC acts like a referee balancing two concerns: fit (more groups fit better) and complexity (more groups risk overfitting). It works like AIC or BIC. The tuning parameter $\rho_{NT} = c_\rho (NT)^{-1/2}$ controls how harshly the referee penalizes extra groups.

3.3 Dynamic panels and Nickell bias

What if your model includes a lagged dependent variable, like $y_{i,t-1}$? This creates a problem called Nickell bias. When you demean the data to remove fixed effects, the demeaned lagged outcome becomes correlated with the demeaned error. The result: biased coefficients.

The classifylasso command offers a dynamic option to fix this. It uses the half-panel jackknife (Dhaene and Jochmans 2015). The idea is simple: split the time series in half. Estimate the model on each half. Combine the two estimates in a way that cancels the bias. Problem solved.

Now that we understand the method, let’s apply it to real data.

4. Data Exploration: Savings

4.1 Load and describe the data

Our first application uses a panel of 56 countries over 15 years, from Su, Shi, and Phillips (2016). The outcome is the savings-to-GDP ratio. The regressors are lagged savings, CPI inflation, real interest rates, and GDP growth.

use "https://github.com/cmg777/starter-academic-v501/raw/master/content/post/stata_panel_lasso_cluster/refMaterials/saving.dta", clear
xtset code year
summarize savings lagsavings cpi interest gdp

 Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
savings | 840 -2.87e-08 1.000596 -2.495871 2.893858
lagsavings | 840 5.81e-08 1.000596 -2.832278 2.91508
cpi | 840 3.56e-09 1.000596 -2.773791 3.548945
interest | 840 -7.17e-09 1.000596 -3.600348 3.277582
gdp | 840 1.06e-08 1.000596 -3.554419 2.461317

The panel is strongly balanced: 56 countries $\times$ 15 years = 840 observations. All variables are standardized to mean zero and standard deviation one. This means coefficients are in standard-deviation units. A coefficient of 0.18 means “a one-SD increase in CPI is associated with a 0.18-SD change in savings.” The balanced structure matters: C-LASSO requires all countries to be observed in all time periods.

4.2 Visualize cross-country heterogeneity

Before running any regressions, it helps to visualize how savings trajectories differ across countries. The xtline command overlays all 56 country lines on a single plot:

xtline savings, overlay ///
title("Savings-to-GDP Ratio Across 56 Countries", size(medium)) ///
subtitle("Each line represents one country", size(small)) ///
ytitle("Savings / GDP") xtitle("Year") legend(off)
graph export "stata_panel_lasso_cluster_fig1_savings_scatter.png", replace width(2400)

Figure 1: Savings-to-GDP ratio across 56 countries (1995–2010). Each line represents one country, revealing substantial heterogeneity in savings dynamics.

The spaghetti plot tells a clear story: countries do not move in lockstep. Some maintain positive savings ratios throughout. Others swing below zero. The lines diverge, cross, and cluster — suggesting that different countries follow fundamentally different savings dynamics. This is exactly the kind of heterogeneity that C-LASSO is designed to detect. Perhaps subsets of countries share similar responses, even if the full panel does not.

But first, let’s see what the standard models say.

5. Baseline: Pooled and Fixed Effects Regressions

Before applying C-LASSO, we establish a benchmark by estimating the standard pooled OLS and fixed-effects models. These models assume that all 56 countries share the same slope coefficients.

* Pooled OLS
regress savings lagsavings cpi interest gdp
* Standard Fixed Effects
xtreg savings lagsavings cpi interest gdp, fe
* Robust Fixed Effects (reghdfe)
reghdfe savings lagsavings cpi interest gdp, absorb(code) vce(robust)

 Pooled OLS FE (robust)
lagsavings 0.6051 0.6051
cpi 0.0301 0.0301
interest 0.0059 0.0059
gdp 0.1882 0.1882

The pooled OLS and fixed-effects estimates are virtually identical. R-squared is 0.438. Lagged savings dominates (coefficient 0.605, $p < 0.001$). GDP growth matters too (0.188, $p < 0.001$).

Now look at the two remaining variables. CPI: 0.030. Interest rate: 0.006. Both statistically insignificant. A textbook conclusion would be: “Inflation and interest rates do not affect savings.”

But what if the average is lying? Imagine a city where half the neighborhoods warm up by 5 degrees and the other half cool down by 5 degrees. The citywide average temperature change is zero. A meteorologist reporting “no change” would be wrong — there are changes, just in opposite directions. This is exactly what we will discover with C-LASSO.

6. Classifier-LASSO: Savings, Static Model

6.1 Estimation

We start with the simplest C-LASSO specification: a static model without the lagged dependent variable. This lets us focus on the core mechanics before adding complexity.

classifylasso savings cpi interest gdp, grouplist(1/5) tolerance(1e-4)

The command searches over $K = 1$ to $K = 5$ groups and reports the information criterion (IC) for each:

Estimation 1: Group Number = 1; IC = 0.054
Estimation 2: Group Number = 2; IC = -0.028 ← minimum
Estimation 3: Group Number = 3; IC = 0.059
Estimation 4: Group Number = 4; IC = 0.131
Estimation 5: Group Number = 5; IC = 0.213
* Selected Group Number: 2

The IC is minimized at $K = 2$, with values rising monotonically from $K = 3$ onward. This clear U-shape provides strong evidence for exactly two latent groups in the data.

6.2 Group-specific coefficients

classoselect, postselection
predict gid_static, gid
tabulate gid_static

Group 1 (34 countries, 510 obs): Within R-sq. = 0.2019
cpi | -0.1813 (z = -4.29, p < 0.001)
interest | -0.1966 (z = -4.64, p < 0.001)
gdp | 0.3346 (z = 7.98, p < 0.001)
Group 2 (22 countries, 330 obs): Within R-sq. = 0.2369
cpi | 0.4781 (z = 9.10, p < 0.001)
interest | 0.2631 (z = 5.01, p < 0.001)
gdp | 0.1117 (z = 2.23, p = 0.026)

The results are striking. Look at CPI.

In Group 1 (34 countries), higher inflation reduces savings: coefficient $-0.181$ ($p < 0.001$). In Group 2 (22 countries), higher inflation increases savings: coefficient $+0.478$ ($p < 0.001$). The sign flips completely.

The same reversal appears for the interest rate: $-0.197$ in Group 1 versus $+0.263$ in Group 2.

Now the pooled CPI coefficient of $+0.030$ makes sense. It was averaging $-0.181$ and $+0.478$ — a negative and a positive effect canceling each other out. The “insignificant” result was not evidence of no effect. It was evidence of two opposing effects hidden inside the average.

Why the reversal? In Group 1, higher inflation erodes the real value of savings, discouraging people from saving. In Group 2, higher inflation may trigger precautionary savings — households save more precisely because the economic environment feels uncertain. Same macroeconomic shock, opposite behavioral response.

6.3 Group selection plot

classogroup
graph export "stata_panel_lasso_cluster_fig2_group_selection_static.png", replace width(2400)

Figure 2: Group selection for the static savings model. The information criterion (left axis) is minimized at K=2, with a clear U-shape from K=3 onward.

The triangle marks the IC minimum at $K = 2$. The left axis shows IC values; the right axis shows iterations to convergence. Notice: $K = 2$ converged quickly (about 3 iterations). Models with $K \geq 3$ hit the maximum 20 iterations. When the algorithm struggles to converge, it is a sign of overparameterization — too many groups for the data to support.

So far, we have found two groups with a static model. But we omitted lagged savings. Let’s add it back.

7. Classifier-LASSO: Savings, Dynamic Model

7.1 Adding the lagged dependent variable

Savings are highly persistent. The pooled coefficient on lagsavings was 0.605 — a country’s savings this year strongly predicts its savings next year. Omitting this variable may bias everything else. We now add it back and replicate Su, Shi, and Phillips (2016). The dynamic option activates the half-panel jackknife to correct Nickell bias.

use "https://github.com/cmg777/starter-academic-v501/raw/master/content/post/stata_panel_lasso_cluster/refMaterials/saving.dta", clear
xtset code year
classifylasso savings lagsavings cpi interest gdp, ///
grouplist(1/5) lambda(1.5485) tolerance(1e-4) dynamic

* Selected Group Number: 2
The algorithm takes 9min57s.
Group 1 (31 countries, 465 obs): Within R-sq. = 0.4988
lagsavings | 0.6952 (z = 18.15, p < 0.001)
cpi | -0.1602 (z = -4.09, p < 0.001)
interest | -0.1490 (z = -4.04, p < 0.001)
gdp | 0.2892 (z = 7.62, p < 0.001)
Group 2 (25 countries, 375 obs): Within R-sq. = 0.4372
lagsavings | 0.6939 (z = 19.45, p < 0.001)
cpi | 0.1967 (z = 4.93, p < 0.001)
interest | 0.1225 (z = 2.98, p = 0.003)
gdp | 0.1127 (z = 2.38, p = 0.018)

Again, C-LASSO selects $K = 2$ groups. The sign reversal on CPI survives: $-0.160$ in Group 1 versus $+0.197$ in Group 2. Same for the interest rate: $-0.149$ versus $+0.123$.

Here is what is interesting about the lagsavings coefficient. Both groups show nearly identical persistence: 0.695 in Group 1 and 0.694 in Group 2. Think of it like a speedometer. Both groups of countries cruise at the same speed (savings persistence). But they swerve in opposite directions when they hit a pothole (an inflation or interest rate shock). The heterogeneity is about reactions to shocks, not about baseline behavior.

Adding lagged savings also improved the fit. Within R-squared jumped from 0.20–0.24 (static) to 0.44–0.50 (dynamic). The lagged variable clearly matters.

7.2 Coefficient plots

The classocoef postestimation command visualizes group-specific coefficients with 95% confidence bands:

classocoef cpi
graph export "stata_panel_lasso_cluster_fig3_coef_cpi.png", replace width(2400)
classocoef interest
graph export "stata_panel_lasso_cluster_fig4_coef_interest.png", replace width(2400)

Figure 3: Heterogeneous effects of CPI on savings. Group 1 (31 countries) shows a negative effect; Group 2 (25 countries) shows a positive effect. Confidence bands do not overlap.

This is the “smoking gun” figure. The two horizontal lines are the group-specific coefficients. The dashed lines show 95% confidence bands. The bands do not overlap. This is not a marginal difference. It is a robust sign reversal.

For 31 countries (Group 1), higher inflation reduces savings ($-0.160$, $p < 0.001$). For 25 countries (Group 2), higher inflation increases savings ($+0.197$, $p < 0.001$). A pooled model averages these opposing forces and finds CPI “insignificant.” That is aggregation bias at work.

Figure 4: Heterogeneous effects of the interest rate on savings. The same sign reversal pattern as CPI: negative in Group 1, positive in Group 2.

The interest rate tells the same story. Group 1 countries save less when rates rise ($-0.149$). Group 2 countries save more ($+0.123$).

Why? One interpretation: in Group 1 (more developed financial markets), higher returns make consumption more attractive — the substitution effect dominates. In Group 2 (limited financial access), higher returns make saving more rewarding — the income effect dominates.

We have now established that latent groups exist in savings data. The next question: does the same pattern appear in a completely different economic context?

8. Democracy Application: Does Democracy Cause Growth?

8.1 The Acemoglu et al. (2019) question

“Democracy does cause growth.” That is the title of a famous 2019 paper by Acemoglu, Naidu, Restrepo, and Robinson in the Journal of Political Economy. Their evidence: a pooled two-way fixed-effects model with lagged GDP finds a positive, significant effect.

But we have learned to be skeptical of pooled estimates. Does this average apply to all 98 countries? Or does it mask the same kind of sign reversal we found in savings?

8.2 Data exploration

use "https://github.com/cmg777/starter-academic-v501/raw/master/content/post/stata_panel_lasso_cluster/refMaterials/democracy.dta", clear
xtset country year
summarize lnPGDP Democracy ly1
tabulate Democracy

 Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
lnPGDP | 4,018 758.5558 162.9137 405.6728 1094.003
Democracy | 4,018 .5450473 .4980286 0 1
ly1 | 3,920 757.7754 162.6702 405.6728 1094.003
Democracy | Freq. Percent
------------+-----------------------------------
0 | 1,828 45.50
1 | 2,190 54.50

The panel covers 98 countries from 1970 to 2010 — 4,018 observations. The binary Democracy indicator is 1 for democratic country-years and 0 otherwise. About 55% of observations are democratic, reflecting the global wave of democratization. The dependent variable lnPGDP (log per-capita GDP, scaled) ranges from 406 to 1,094 — the full spectrum from low-income to high-income countries.

8.3 Pooled fixed-effects benchmark

reghdfe lnPGDP Democracy ly1, absorb(country year) cluster(country)

HDFE Linear regression Number of obs = 3,920
R-squared = 0.9991
Within R-sq. = 0.9607
(Std. err. adjusted for 98 clusters in country)
lnPGDP | Coefficient Robust std. err. t P>|t|
Democracy | 1.054992 .369806 2.85 0.005
ly1 | .970495 .0059964 161.85 0.000

Democracy is associated with a 1.055-unit increase in log per-capita GDP ($p = 0.005$, clustered SE = 0.370). Lagged GDP has a coefficient of 0.970 — strong persistence. This replicates Acemoglu et al. (2019): on average, democracy promotes growth.

On average. But we already know what “on average” can hide. Let’s run C-LASSO.

8.4 C-LASSO: revealing the heterogeneity

classifylasso lnPGDP Democracy ly1, ///
grouplist(1/5) rho(0.2) absorb(country year) ///
cluster(country) dynamic optmaxiter(300)

* Selected Group Number: 2
The algorithm takes 2h33min41s.
Group 1 (57 countries, 2,280 obs): Within R-sq. = 0.9609
Democracy | 2.151397 (z = 3.94, p < 0.001)
ly1 | 1.032752 (z = 149.97, p < 0.001)
Group 2 (41 countries, 1,640 obs): Within R-sq. = 0.9538
Democracy | -0.935589 (z = -2.69, p = 0.007)
ly1 | 0.979327 (z = 95.73, p < 0.001)

This is the tutorial’s most striking finding.

The pooled coefficient of $+1.055$ is not representative of any actual country group. It is a weighted average of two fundamentally different effects:

Group 1 (57 countries): democracy effect = $+2.151$ ($p < 0.001$). More than twice the pooled estimate.
Group 2 (41 countries): democracy effect = $-0.936$ ($p = 0.007$). Negative and significant.

The coefficient literally changes sign. For 58% of countries, democratic transitions are associated with GDP gains. For the remaining 42%, they are associated with GDP declines. The pooled model sees one number. C-LASSO sees two stories.

Note: these are conditional associations within the panel model. A causal interpretation requires the same identifying assumptions as Acemoglu et al. (2019).

8.5 Visualizing the democracy-growth split

classogroup
graph export "stata_panel_lasso_cluster_fig5_democracy_selection.png", replace width(2400)
classocoef Democracy
graph export "stata_panel_lasso_cluster_fig6_democracy_coef.png", replace width(2400)

Figure 5: Group selection for the democracy-growth model. IC is minimized at K=2, though values are close across all K (range 3.267–3.280).

The IC selects $K = 2$. But look closely: the IC values range from 3.267 to 3.280 — a span of just 0.013. The 2-group structure is optimal but not overwhelmingly so. This is a useful reminder: always check sensitivity to the tuning parameter $\rho$.

Figure 6: Heterogeneous effects of democracy on economic growth. Group 1 (57 countries) shows a positive effect (+2.15); Group 2 (41 countries) shows a negative effect (-0.94). The pooled estimate of +1.05 describes neither group.

This is the key figure of the tutorial. Each dot is one country’s individual coefficient estimate. The horizontal lines show group-specific postlasso estimates with 95% confidence bands.

The polarization is unmistakable. Group 1 (left cluster): strongly positive. Group 2 (right cluster): negative. Neither group’s confidence band crosses zero. Both effects are statistically significant.

This is not “some countries benefit, others see no effect.” It is a genuine sign reversal. Democracy is associated with growth in one group and with decline in another.

9. Comparison: What the Pooled Model Misses

9.1 Summary table

	Pooled FE	C-LASSO Group 1	C-LASSO Group 2
Democracy coefficient	+1.055	+2.151	-0.936
Standard error	0.370	0.546	0.348
p-value	0.005	< 0.001	0.007
Lagged GDP	0.970	1.033	0.979
Countries	98	57	41
Observations	3,920	2,280	1,640

9.2 Simpson’s paradox in panel data

This is Simpson’s paradox — the phenomenon where a trend that appears in aggregated data reverses when you look at subgroups.

Here is a concrete analogy. A hospital treats two types of patients: mild cases and severe cases. For mild cases, Treatment A has a higher survival rate. For severe cases, Treatment A also has a higher survival rate. But when you pool all patients together, Treatment B appears better — because it treats a disproportionate number of mild (easy) cases. The aggregate reverses the subgroup trend.

The same thing happened here. The pooled democracy estimate of $+1.055$ sits between $+2.151$ and $-0.936$. It describes neither group accurately. A policymaker relying on the pooled result would conclude that democracy universally promotes growth. They would miss that for 41 countries (42% of the sample), the relationship runs in the opposite direction.

The savings model showed the same pattern. The insignificant pooled CPI coefficient ($+0.030$) masked significant effects of $-0.160$ and $+0.197$. When effects have opposite signs, pooling does not just underestimate the magnitude. It produces a qualitatively wrong conclusion.

9.3 Robustness of the group structure

Across all three C-LASSO specifications — static savings, dynamic savings, and democracy — the IC consistently selected $K = 2$ groups. The CPI sign reversal survived the switch from static to dynamic, despite a shift in group composition (34/22 to 31/25). This consistency suggests the latent groups are real structural features of the data, not artifacts of a particular specification.

10. Summary and Takeaways

10.1 What we learned

Pooled estimates can be misleading. The insignificant pooled CPI coefficient ($+0.030$) in the savings model masked opposing effects of $-0.160$ and $+0.197$ in two latent groups. The pooled democracy coefficient ($+1.055$) masked a split of $+2.151$ versus $-0.936$.
C-LASSO finds latent groups. In all three specifications, the information criterion selected $K = 2$ groups, revealing binary latent structures in both datasets. The classifylasso command handles the full workflow: estimation, group selection, and postestimation.
The dynamic option corrects Nickell bias. When lagged dependent variables are included, the half-panel jackknife bias correction preserves the group structure while improving within-group R-squared (from 0.20–0.24 in the static model to 0.44–0.50 in the dynamic model).
Postestimation tools aid interpretation. The classogroup command visualizes the information criterion, classocoef plots group-specific coefficients with confidence bands, and predict gid assigns countries to groups.

10.2 Limitations

Three caveats. First, the IC values in the democracy model were very close across $K = 1$ through $K = 5$ (range 3.267–3.280). The 2-group structure is optimal but not dominant. Second, the datasets use numeric country codes, not names. We cannot easily identify which countries are in which group. Third, C-LASSO is computationally intensive. The democracy model took over 2.5 hours. Plan accordingly.

10.3 Exercises

Sensitivity analysis. Re-run the democracy model with rho(0.5) and rho(1.0) instead of rho(0.2). Does the selected number of groups change? How sensitive are the group assignments to this tuning parameter?
Extended lag structure. Following the reference empirical.do, estimate the democracy model with 2, 3, and 4 lags of GDP (ly1-ly2, ly1-ly3, ly1-ly4). Do the group-specific democracy coefficients remain stable?
Static vs dynamic comparison. Run classifylasso savings cpi interest gdp (without dynamic) on the savings data and compare group assignments with the dynamic model using tabulate gid_static gid_dynamic. How many countries switch groups?

References

Su, L., Shi, Z., and Phillips, P. C. B. (2016). Identifying latent structures in panel data. Econometrica, 84(6), 2215–2264.
Huang, W., Wang, Y., and Zhou, L. (2024). Identify latent group structures in panel data: The classifylasso command. Stata Journal, 24(1), 173–203.
Acemoglu, D., Naidu, S., Restrepo, P., and Robinson, J. A. (2019). Democracy does cause growth. Journal of Political Economy, 127(1), 47–100.
Dhaene, G. and Jochmans, K. (2015). Split-panel jackknife estimation of fixed-effect models. Review of Economic Studies, 82(3), 991–1030.

Acknowledgements

AI tools (Claude Code, Gemini, NotebookLM) were used to make the contents of this post more accessible to students. Nevertheless, the content in this post may still have errors. Caution is needed when applying the contents of this post to true research projects.

Taming Model Uncertainty in the Environmental Kuznets Curve: BMA and Double-Selection LASSO with Panel Data

Sun, 29 Mar 2026 00:00:00 +0000

1. Overview

Can countries grow their way out of pollution? The Environmental Kuznets Curve (EKC) hypothesis says yes — up to a point. As economies develop, pollution first rises with industrialization and then falls as countries grow wealthy enough to afford cleaner technology. But recent research suggests a more complex inverted-N shape: pollution falls at very low incomes, rises through industrialization, and then falls again at high incomes.

Testing for this shape requires a cubic polynomial in GDP per capita — and beyond GDP, many other factors might affect CO₂ emissions. With 12 candidate control variables, there are $2^{12} = 4{,}096$ possible regression models. Which model should we estimate? This is the model uncertainty problem.

This tutorial introduces two principled solutions:

Bayesian Model Averaging (BMA) estimates thousands of models and averages the results, weighting each by how well it fits the data. Each variable gets a Posterior Inclusion Probability (PIP) — the fraction of high-quality models that include it.
Post-Double-Selection LASSO (DSL) uses LASSO to automatically select which controls matter — once for the outcome, once for each variable of interest — then runs OLS with the union of all selected controls. This “select, then regress” approach protects against omitted variable bias.

We use synthetic panel data with a known “answer key” — we designed the data so that 5 controls truly affect CO₂ and 7 are pure noise. This lets us grade each method: does it correctly identify the true predictors? The data is inspired by the panel dataset of Gravina and Lanzafame (2025) but is fully synthetic and not identical to the original.

Companion tutorial. For a cross-sectional perspective using R with BMA, LASSO, and WALS, see the R tutorial on variable selection.

Learning objectives:

Understand the EKC hypothesis and why a cubic polynomial tests for an inverted-N shape
Recognize model uncertainty as a practical challenge when many controls are available
Implement BMA with bmaregress and interpret PIPs and coefficient densities
Implement post-double-selection LASSO with dsregress and understand its four-step algorithm: LASSO on outcome, LASSO on each variable of interest, union, then OLS
Evaluate both methods against a known ground truth to assess their accuracy

The following diagram summarizes the methodological sequence of this tutorial. We begin with exploratory data analysis to visualize the raw income–pollution relationship, then estimate baseline fixed effects regressions to expose the model uncertainty problem. Next, we apply BMA and DSL as two alternative solutions, and finally compare both methods against the known answer key.

graph LR
A["<b>EDA</b><br/>Scatter plot"] --> B["<b>Baseline FE</b><br/>Standard panel<br/>regressions"]
B --> C["<b>BMA</b><br/>Bayesian Model<br/>Averaging"]
C --> D["<b>DSL</b><br/>Double-Selection<br/>LASSO"]
D --> E["<b>Comparison</b><br/>Check against<br/>answer key"]
style A fill:#141413,stroke:#141413,color:#fff
style B fill:#6a9bcc,stroke:#141413,color:#fff
style C fill:#d97757,stroke:#141413,color:#fff
style D fill:#00d4c8,stroke:#141413,color:#141413
style E fill:#1a3a8a,stroke:#141413,color:#fff

2. Setup and Synthetic Data

2.1 Why synthetic data?

Real-world datasets rarely come with an answer key. We never know which control variables truly belong in the model. By generating synthetic data with a known data-generating process (DGP), we can verify whether BMA and DSL correctly recover the truth. This is the same “answer key” approach used in the companion R tutorial, applied here to panel data.

2.2 The data-generating process

The outcome — log CO₂ per capita — follows a cubic EKC with country and year fixed effects:

$$\ln(\text{CO2})_{it} = \beta_1 \ln(\text{GDP})_{it} + \beta_2 [\ln(\text{GDP})_{it}]^2 + \beta_3 [\ln(\text{GDP})_{it}]^3 + \mathbf{X}_{it}^{\text{true}} \boldsymbol{\gamma} + \alpha_i + \delta_t + \varepsilon_{it}$$

In words, log CO₂ depends on a cubic function of log GDP (producing the inverted-N shape), five true control variables $\mathbf{X}^{\text{true}}$, country fixed effects $\alpha_i$, year fixed effects $\delta_t$, and random noise $\varepsilon_{it}$.

The answer key — which variables are true predictors and which are noise:

Variable	Group	In DGP?	True coef.	GDP corr.	Role
`fossil_fuel`	Energy	Yes	+0.015	moderate	More fossil fuels → more CO₂
`renewable`	Energy	Yes	–0.010	moderate	More renewables → less CO₂
`urban`	Socio	Yes	+0.007	moderate	More urbanization → more CO₂
`democracy`	Institutional	Yes	–0.005	low	More democracy → less CO₂
`industry`	Economic	Yes	+0.010	moderate	More industry → more CO₂
`globalization`	Socio	No	0	high	Noise — tricky (correlated with GDP)
`pop_density`	Socio	No	0	low	Noise
`corruption`	Institutional	No	0	low	Noise
`services`	Economic	No	0	high	Noise — tricky (correlated with GDP)
`trade`	Economic	No	0	moderate	Noise — tricky (correlated with GDP)
`fdi`	Economic	No	0	low	Noise
`credit`	Economic	No	0	moderate	Noise — tricky (correlated with GDP)

The “GDP corr.” column is key to understanding why this problem is non-trivial. Four noise variables (globalization, services, trade, credit) are deliberately correlated with GDP. A naive regression would find them “significant” because they piggyback on GDP’s true effect. The challenge for BMA and DSL is to see through this correlation and correctly identify that only the 5 true controls belong in the model.

With the DGP and answer key defined, we now load the synthetic data and set up the Stata environment.

2.3 Load the data

The synthetic data is hosted on GitHub for reproducibility. It was generated by generate_data.do (see the link above).

* Load synthetic data from GitHub
import delimited "https://github.com/cmg777/starter-academic-v501/raw/master/content/post/stata_bma_dsl/synthetic_ekc_panel.csv", clear
xtset country_id year, yearly

2.4 Define macros

We define all variable groups as global macros — used in every command throughout the tutorial:

global outcome "ln_co2"
global gdp_vars "ln_gdp ln_gdp_sq ln_gdp_cb"
global energy "fossil_fuel renewable"
global socio "urban globalization pop_density"
global inst "democracy corruption"
global econ "industry services trade fdi credit"
global controls "$energy $socio $inst $econ"
global fe "i.country_id i.year"
* Ground truth (for evaluation)
global true_vars "fossil_fuel renewable urban democracy industry"
global noise_vars "globalization pop_density corruption services trade fdi credit"

summarize $outcome $gdp_vars $controls

 Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
ln_co2 | 1,600 -19.0385 .7863276 -21.03685 -16.8315
ln_gdp | 1,600 9.58387 1.329675 6.974263 11.9704
ln_gdp_sq | 1,600 93.6174 25.55106 48.64035 143.2904
ln_gdp_cb | 1,600 931.105 373.829 339.2306 1715.243
fossil_fuel | 1,600 54.7724 19.14168 6.36807 95
renewable | 1,600 29.5413 11.96568 1 64.2207
urban | 1,600 53.6742 14.778 15.95174 91.63234
globalizat~n | 1,600 57.6498 12.71537 26.75758 95
pop_density | 1,600 121.344 210.2646 1 1571.771
democracy | 1,600 2.33346 4.179503 -6.12244 10
corruption | 1,600 52.3523 28.52792 0 100
industry | 1,600 24.6433 6.180478 5.843938 45.32926
services | 1,600 43.5598 9.366089 17.82623 64.07455
trade | 1,600 67.4355 19.36148 10.04306 128.0595
fdi | 1,600 2.98237 4.373857 -11.50437 16.19903
credit | 1,600 53.4402 18.20204 11.32991 123.2399

The dataset contains 1,600 observations from 80 countries over 20 years (1995–2014). Log GDP per capita ranges from 6.97 to 11.97, spanning the full income spectrum from about \$1,065 to \$158,000 in synthetic international dollars. Log CO₂ has a mean of –19.04 with substantial variation (standard deviation 0.79), reflecting the wide range of development levels in our synthetic panel. With the data loaded, we next visualize the raw income–pollution relationship.

3. Exploratory Data Analysis

Before modeling, let us look at the raw relationship between income and emissions.

twoway (scatter $outcome ln_gdp, ///
msize(vsmall) mcolor("106 155 204"%40) msymbol(circle)), ///
ytitle("Log CO2 per capita") ///
xtitle("Log GDP per capita") ///
title("Synthetic Data: CO2 vs. Income", size(medium)) ///
subtitle("80 countries, 1995-2014 (N = 1,600)", size(small)) ///
scheme(s2color)

The scatter reveals a distinctly nonlinear pattern. At low income levels, CO₂ emissions increase steeply with GDP. At higher income levels, the relationship flattens and bends. This curvature motivates the cubic EKC specification. The diagram below shows the two competing EKC shapes — the classic inverted-U (quadratic) and the more complex inverted-N (cubic) with its three distinct phases:

graph TD
EKC["<b>Environmental Kuznets Curve</b><br/>How does pollution change<br/>as income grows?"]
EKC --> IU["<b>Inverted-U</b><br/>Quadratic: β₁ > 0, β₂ < 0<br/>One turning point"]
EKC --> IN["<b>Inverted-N</b><br/>Cubic: β₁ < 0, β₂ > 0, β₃ < 0<br/>Two turning points"]
IN --> P1["<b>Phase 1: Declining</b><br/>Very poor countries"]
IN --> P2["<b>Phase 2: Rising</b><br/>Industrializing countries"]
IN --> P3["<b>Phase 3: Declining</b><br/>Wealthy countries"]
style EKC fill:#141413,stroke:#141413,color:#fff
style IU fill:#6a9bcc,stroke:#141413,color:#fff
style IN fill:#d97757,stroke:#141413,color:#fff
style P1 fill:#00d4c8,stroke:#141413,color:#141413
style P2 fill:#d97757,stroke:#141413,color:#fff
style P3 fill:#00d4c8,stroke:#141413,color:#141413

For an inverted-N, we need $\beta_1 < 0$, $\beta_2 > 0$, $\beta_3 < 0$. Our synthetic DGP was designed with exactly this sign pattern ($\beta_1 = -7.1$, $\beta_2 = 0.81$, $\beta_3 = -0.03$), so BMA and DSL should recover it — but can they also correctly identify which of the 12 controls truly matter? Let us start with standard panel regressions to see how sensitive the GDP coefficients are to the choice of controls.

4. Baseline — Standard Fixed Effects

Before reaching for sophisticated methods, let us see what standard panel regressions say. We run two specifications using macros:

4.1 Sparse specification

reghdfe $outcome $gdp_vars, absorb(country_id year) vce(cluster country_id)
estimates store fe_sparse

HDFE Linear regression Number of obs = 1,600
R-squared = 0.9620
Within R-sq. = 0.0354
Number of clusters (country_id) = 80
(Std. err. adjusted for 80 clusters in country_id)
------------------------------------------------------------------------------
| Robust
ln_co2 | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
ln_gdp | -7.498046 1.623988 -4.62 0.000 -10.73051 -4.26558
ln_gdp_sq | .848967 .1704533 4.98 0.000 .5096881 1.188246
ln_gdp_cb | -.0314993 .005931 -5.31 0.000 -.0433047 -.019694
------------------------------------------------------------------------------

The sparse model finds the inverted-N sign pattern ($\beta_1 < 0$, $\beta_2 > 0$, $\beta_3 < 0$), all significant at the 0.1% level with cluster-robust standard errors (clustered at the country level). The within R² is just 0.035 — the GDP polynomial alone explains only about 3.5% of within-country CO₂ variation after absorbing country and year fixed effects. The overall R² of 0.96 is high because the country fixed effects capture most of the variation.

4.2 Kitchen-sink specification

reghdfe $outcome $gdp_vars $controls, absorb(country_id year) vce(cluster country_id)
estimates store fe_kitchen

HDFE Linear regression Number of obs = 1,600
R-squared = 0.9655
Within R-sq. = 0.1249
Number of clusters (country_id) = 80
(Std. err. adjusted for 80 clusters in country_id)
------------------------------------------------------------------------------
| Robust
ln_co2 | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
ln_gdp | -7.130693 1.562581 -4.56 0.000 -10.24093 -4.020453
ln_gdp_sq | .8059928 .1647973 4.89 0.000 .477972 1.134014
ln_gdp_cb | -.0298133 .0057365 -5.20 0.000 -.0412314 -.0183951
fossil_fuel | .0138444 .0014853 9.32 0.000 .010888 .0168008
renewable | -.006795 .0019322 -3.52 0.001 -.0106409 -.0029491
urban | .0057534 .0021432 2.68 0.009 .0014875 .0100192
globalizat~n | .0015186 .0012832 1.18 0.240 -.0010357 .0040728
pop_density | .0000794 .0002303 0.34 0.731 -.000379 .0005378
democracy | -.0002971 .007735 -0.04 0.969 -.0156933 .0150991
corruption | .0009812 .0008415 1.17 0.247 -.0006936 .0026561
industry | .0086336 .0017848 4.84 0.000 .0050811 .0121861
services | -.0005642 .0017205 -0.33 0.744 -.0039889 .0028604
trade | -.0002458 .0007695 -0.32 0.750 -.0017774 .0012858
fdi | -.0017599 .0019509 -0.90 0.370 -.005643 .0021232
credit | -.00139 .0007516 -1.85 0.068 -.002886 .0001061
------------------------------------------------------------------------------

Adding all 12 controls raises the within R² from 0.035 to 0.125 — a meaningful improvement, though the country and year FE still dominate the overall explanatory power (R² = 0.966). The three strongest true predictors (fossil fuel, industry, urban) are clearly significant, while most noise variables are statistically insignificant. Democracy’s estimate (–0.0003, p = 0.97) is far from its true value (–0.005) and indistinguishable from zero — illustrating why weak signals are hard to detect even with the correct model.

The critical question is: which specification should we trust? The next subsection shows that the GDP coefficients — and hence the EKC shape — shift depending on which controls we include.

4.3 The model uncertainty problem

Coefficient	Sparse FE	Kitchen-Sink FE	True DGP
$\beta_1$ (GDP)	–7.498	–7.131	–7.100
$\beta_2$ (GDP²)	0.849	0.806	0.810
$\beta_3$ (GDP³)	–0.031	–0.030	–0.030

Both specifications recover the correct sign pattern, but the magnitudes shift. The kitchen-sink FE estimates (–7.131, 0.806, –0.030) are closer to the true DGP values (–7.100, 0.810, –0.030) than the sparse FE (–7.498, 0.849, –0.031), because the omitted true controls create bias in the sparse model. But which of the 12 controls actually belongs?

* Compare coefficients side by side (simplified from analysis.do)
graph twoway ///
(bar value order if spec == "Sparse FE", ///
barwidth(0.35) color("106 155 204")) ///
(bar value order if spec == "Kitchen-Sink FE", ///
barwidth(0.35) color("217 119 87")), ///
xlabel(1 `""b1" "(GDP)""' 2 `""b2" "(GDP sq)""' 3 `""b3" "(GDP cb)""' ///
4 `""b1" "(GDP)""' 5 `""b2" "(GDP sq)""' 6 `""b3" "(GDP cb)""') ///
xline(3.5, lcolor(gs10) lpattern(dash)) ///
ytitle("Coefficient value") ///
title("Coefficient Instability Across Specifications") ///
legend(order(1 "Sparse FE (no controls)" 2 "Kitchen-Sink FE (all 12 controls)") ///
rows(1) position(6)) ///
scheme(s2color)

To understand the practical implications of these coefficient shifts, we compute the income thresholds where emissions change direction. The turning points are found by setting the first derivative of the cubic to zero:

$$x^* = \frac{-\hat{\beta}_2 \pm \sqrt{\hat{\beta}_2^2 - 3\hat{\beta}_1\hat{\beta}_3}}{3\hat{\beta}_3}, \quad \text{GDP}^* = \exp(x^*)$$

Turning point	Sparse FE	Kitchen-Sink FE	True DGP
Minimum (CO₂ starts rising)	\$2,478	\$2,426	\$1,895
Maximum (CO₂ starts falling)	\$25,656	\$27,694	\$34,647

The turning points shift modestly between specifications — the minimum stays near \$2,400–\$2,500 while the maximum moves from \$25,656 to \$27,694 depending on controls. Neither matches the true DGP values perfectly, motivating BMA and DSL as principled alternatives to ad hoc control selection.

5. Bayesian Model Averaging

5.1 The idea

Think of BMA as betting on a horse race. Instead of putting all your money on one model, BMA spreads bets across the field, wagering more on models with better track records.

graph TD
Start["<b>12 Candidate Controls</b><br/>2¹² = 4,096<br/>possible models"] --> MCMC["<b>MCMC Sampling</b><br/>Draw 50,000 models"]
MCMC --> Post["<b>Posterior Probability</b><br/>Weight by fit × parsimony"]
Post --> Avg["<b>Weighted Average</b><br/>Coefficients averaged<br/>across models"]
Post --> PIP["<b>PIPs</b><br/>Inclusion probability<br/>for each variable"]
style Start fill:#141413,stroke:#141413,color:#fff
style MCMC fill:#6a9bcc,stroke:#141413,color:#fff
style Post fill:#d97757,stroke:#141413,color:#fff
style Avg fill:#00d4c8,stroke:#141413,color:#141413
style PIP fill:#00d4c8,stroke:#141413,color:#141413

Formally, this betting process follows Bayes' rule, which tells us how to weight models by their fit and complexity.

Step 1: Model posterior probabilities. The posterior probability of model $M_k$ is:

$$P(M_k | \text{data}) = \frac{P(\text{data} | M_k) \cdot P(M_k)}{\sum_{l=1}^{K} P(\text{data} | M_l) \cdot P(M_l)}$$

In words, the probability of model $k$ being correct equals how well it fits the data (the marginal likelihood $P(\text{data} | M_k)$) times our prior belief ($P(M_k)$), divided by the total across all models. Models that fit the data well and are parsimonious receive higher posterior weight — this is BMA’s built-in Occam’s razor.

The marginal likelihood $P(\text{data} | M_k)$ is not the same as the ordinary likelihood. It integrates over all possible coefficient values, penalizing models with many parameters that “waste” probability mass on parameter regions the data does not support:

$$P(\text{data} | M_k) = \int P(\text{data} | \boldsymbol{\beta}_k, M_k) \, P(\boldsymbol{\beta}_k | M_k) \, d\boldsymbol{\beta}_k$$

In words, the marginal likelihood asks: “If we averaged this model’s fit across all plausible coefficient values (weighted by the prior $P(\boldsymbol{\beta}_k | M_k)$), how well does it explain the data?” This integral is what makes BMA automatically penalize overly complex models — a model with many parameters spreads its prior probability thinly across a high-dimensional space, and only recovers that probability if the data strongly supports those extra dimensions.

Step 2: Posterior Inclusion Probabilities. The PIP for variable $j$ sums the posterior probabilities across all models that include it:

$$\text{PIP}_j = \sum_{k:\, x_j \in M_k} P(M_k | \text{data})$$

In words, PIP answers: “Across all the models BMA considered, what fraction of the total posterior weight belongs to models that include variable $j$?” If fossil fuel appears in every high-probability model, its PIP approaches 1.0. If democracy only appears in low-probability models, its PIP stays near 0.

Step 3: BMA posterior mean. BMA does not just select variables — it also produces model-averaged coefficient estimates. The posterior mean of coefficient $\beta_j$ averages across all models, weighted by their posterior probabilities:

$$\hat{\beta}_j^{\text{BMA}} = \sum_{k=1}^{K} P(M_k | \text{data}) \cdot \hat{\beta}_{j,k}$$

where $\hat{\beta}_{j,k}$ is the coefficient estimate of variable $j$ in model $M_k$ (set to zero if $j$ is not in $M_k$). In words, the BMA estimate is a weighted average of the coefficient across all models, including models where the variable is absent (contributing zero). This shrinks the coefficient toward zero in proportion to the evidence against inclusion — a variable with PIP = 0.5 has its BMA coefficient shrunk by roughly half compared to its conditional estimate.

Think of PIP as a democratic vote across all candidate models. Each model casts a weighted vote for which variables matter, with better-fitting models getting louder voices. Raftery (1995) proposed standard interpretation thresholds based on the strength of evidence:

PIP range	Evidence	Analogy
$\geq 0.99$	Decisive	Beyond reasonable doubt
$0.95 - 0.99$	Very strong	Strong consensus
$0.80 - 0.95$	Strong (robust)	Clear majority
$0.50 - 0.80$	Borderline	Split vote
$< 0.50$	Weak/none (fragile)	Minority opinion

We use PIP $\geq$ 0.80 as our robustness threshold throughout this tutorial — a variable with PIP above 0.80 appears in the vast majority of the probability-weighted model space, providing “strong evidence” by Raftery’s classification. This is the most widely used cutoff in applied BMA studies.

A key assumption underlying BMA is that the true data-generating process is well-approximated by a weighted combination of the candidate models (the “M-closed” assumption). When the candidate set omits important functional forms or interactions, BMA’s posterior probabilities may be unreliable.

5.2 Key options

With the conceptual framework in place, we now turn to implementation. Stata 18’s bmaregress command has three families of options: priors (what you believe before seeing the data), MCMC controls (how the algorithm explores the model space), and output formatting (what gets displayed). The full option list is in the Stata manual; here we explain the ones used in this tutorial:

Prior specifications (see bmaregress priors for alternatives):

gprior(uip) — Unit Information Prior: sets the prior precision on coefficients equal to the information in one observation ($g = N$). This is a standard, relatively uninformative choice that lets the data dominate. Alternatives include gprior(bric) (benchmark risk inflation criterion, $g = \max(N, p^2)$), gprior(zs) (Zellner-Siow), and gprior(hyper) (hyper-g prior with data-driven $g$)
mprior(uniform) — all $2^{12} = 4{,}096$ models are equally likely a priori; no model is privileged before seeing the data. The alternative mprior(binomial) applies a beta-binomial prior that penalizes very large or very small models, often producing more conservative PIPs

MCMC controls:

mcmcsize(50000) — draws 50,000 models from the model space using MC$^3$ (Markov chain Monte Carlo model composition) sampling. Larger values improve posterior estimates but increase computation time
burnin(5000) — discards the first 5,000 draws to allow the chain to reach its stationary distribution before collecting samples
rseed(9988) — fixes the random number seed for exact reproducibility. Students running the same command will get identical results
groupfv — treats all dummies from a single factor variable as one group that enters or exits models together. Without groupfv, writing i.country_id would create 80 individual dummy variables, and BMA would consider including or excluding each one independently — producing an astronomical model space ($2^{80}$ combinations of country dummies alone) that is both computationally infeasible and conceptually meaningless. With groupfv, the 80 country dummies move as a package: either all 80 are in the model or none are. Think of it like hiring a sports team — you recruit the whole roster, not individual players one by one. In the output, this is why you see “Groups = 15” instead of 113: BMA treats the 80 country dummies as 1 group, the 19 year dummies as 1 group, and each of the 12 candidate controls + 3 GDP terms as their own groups ($1 + 1 + 15 = 17$, minus 2 that are “always” included = 15 groups subject to selection)
($fe, always) — country and year fixed effects are always included in every model; they are not subject to model selection. This is standard practice in panel data BMA: we want to control for unobserved country and time heterogeneity in every model, and only let BMA decide about the candidate controls

Output formatting:

pipcutoff(0.8) — display only variables with PIP above 0.80 in the output table. This is a display threshold only — it does not affect the underlying estimation
inputorder — display variables in the order they were specified in the command, rather than sorted by PIP

5.3 Estimation

bmaregress $outcome $gdp_vars $controls ///
($fe, always), ///
mprior(uniform) groupfv gprior(uip) ///
mcmcsize(50000) rseed(9988) inputorder pipcutoff(0.8)

Bayesian model averaging No. of obs = 1,600
Linear regression No. of predictors = 113
MC3 sampling Groups = 15
Always = 98
No. of models = 163
Priors: Mean model size = 104.578
Models: Uniform MCMC sample size = 50,000
Coef.: Zellner's g Acceptance rate = 0.0904
g: Unit-information, g = 1,600 Shrinkage, g/(1+g) = 0.9994
Sampling correlation = 0.9997
------------------------------------------------------------------------------
ln_co2 | Mean Std. dev. Group PIP
-------------+----------------------------------------------------------------
ln_gdp | -7.13901 1.811093 1 .99401
ln_gdp_sq | .8078437 .1892418 2 .99991
ln_gdp_cb | -.0299182 .0065105 3 .99976
fossil_fuel | .0138139 .001283 4 1
renewable | -.0068332 .0023506 5 .95945
industry | .0085503 .0019766 11 .99867
------------------------------------------------------------------------------
Note: 9 predictors with PIP less than .8 not shown.

The Stata output says “PIP less than .8” because we set pipcutoff(0.8) as the display threshold — only variables exceeding this stricter robustness criterion appear in the table. The 9 hidden variables are the two weak true controls (urban, democracy) and all 7 noise variables (services, trade, FDI, credit, population density, corruption, globalization). Figure 3 below shows PIP values for all 15 variables.

The output shows 113 predictors in 15 groups: the 80 country dummies (grouped as 1 by groupfv) + 19 year dummies (grouped as 1) + 12 candidate controls (each its own group) + the 3 GDP terms (each its own group) = 15 selection groups total, with 98 variables “always” included (the country and year FE). BMA sampled 163 distinct models out of 4,096 possible. This might seem low, but the MC$^3$ algorithm does not need to visit every model — it concentrates on the high-posterior-probability region. The sampling correlation of 0.9997 (very close to 1.0) confirms that the MC$^3$ chain adequately explored the model space — the posterior probability is concentrated on a relatively small number of high-quality models. The acceptance rate of 0.09 is below the typical 20–40% range, but the high sampling correlation provides reassurance that the results are reliable. Six variables have PIP above the 0.80 robustness threshold: the three GDP terms (PIP = 0.994–1.000) and three of the five true controls — fossil fuel (PIP = 1.000), industry (PIP = 0.999), and renewable energy (PIP = 0.959). The BMA posterior means (–7.139, 0.808, –0.030) are remarkably close to the true DGP values (–7.100, 0.810, –0.030), substantially closer than the sparse FE estimates.

Two true controls — urban (coefficient 0.007) and democracy (coefficient –0.005) — have PIPs well below 0.80. Their true effects are small, making them hard to distinguish from noise. This is a realistic limitation: even a powerful method like BMA struggles with weak signals.

5.4 Turning points

Using the BMA posterior means, the turning points are:

Minimum: \$2,411 GDP per capita (true: \$1,895)
Maximum: \$27,269 GDP per capita (true: \$34,647)

Both turning points are in the right ballpark but not exact. The turning point formula amplifies small differences across all three coefficients — even though each BMA posterior mean is within 1% of the true DGP value, the compound effect shifts the maximum turning point from \$34,647 (true) to \$27,269 (BMA). The inverted-N shape is clearly recovered.

5.5 Posterior Inclusion Probabilities

The PIP chart is BMA’s signature output. We extract PIPs from the estimation results, label each variable, and color-code bars by ground truth: steel blue for true predictors, gray for noise.

* Extract PIPs and create a horizontal bar chart
matrix pip_mat = e(pip)
* ... (create dataset of variable names and PIPs, add readable labels) ...
* Mark true vs noise predictors
gen is_true = inlist(varname, "fossil_fuel", "renewable", "urban", ///
"democracy", "industry", "ln_gdp", "ln_gdp_sq", "ln_gdp_cb")
gsort -pip
graph twoway ///
(bar pip order if is_true == 1, horizontal barwidth(0.6) ///
color("106 155 204")) ///
(bar pip order if is_true == 0, horizontal barwidth(0.6) ///
color(gs11)), ///
xline(0.8, lcolor("217 119 87") lpattern(dash) lwidth(medium)) ///
ylabel(1(1)15, valuelabel angle(0) labsize(small)) ///
xlabel(0(0.2)1, format(%3.1f)) ///
xtitle("Posterior Inclusion Probability (PIP)") ///
title("BMA: Which Variables Matter?") ///
legend(order(1 "True predictor (in DGP)" 2 "Noise variable (not in DGP)") ///
rows(1) position(6)) ///
scheme(s2color)

The PIP chart cleanly separates the variables into two groups. At the top (PIP near 1.0): fossil fuel share, GDP terms, industry, and renewable energy — all true predictors correctly identified. At the bottom (PIP near 0.0): the seven noise variables (globalization, corruption, services, trade, FDI, credit, population density) plus urban population and democracy. BMA correctly assigns zero-like PIPs to all noise variables, and correctly flags 3 of 5 true predictors as robust. The two misses (urban, democracy) have small true coefficients (0.007 and –0.005), making them genuinely hard to detect.

5.6 Coefficient density plots

The bmagraph coefdensity command shows the posterior distribution of each coefficient across all sampled models. We plot all six variables with PIP above 0.80 in a 3x2 grid — the three GDP polynomial terms (top row) and the three robust controls (bottom row). In each panel, the blue curve shows the density conditional on the variable being included in the model, and the red horizontal line shows the probability of noninclusion (1 – PIP). When the red line is flat near zero and the blue curve is far from zero, the variable is strongly supported.

* Consistent formatting for all panels
local panel_opts `" xtitle("Coefficient value", size(vsmall)) "'
local panel_opts `" `panel_opts' ytitle("Density", size(vsmall)) "'
local panel_opts `" `panel_opts' ylabel(, labsize(vsmall) angle(0)) "'
local panel_opts `" `panel_opts' xlabel(, labsize(vsmall)) "'
local panel_opts `" `panel_opts' legend(off) scheme(s2color) "'
* Generate density for all 6 robust variables (PIP > 0.80)
bmagraph coefdensity ln_gdp, title("GDP per capita (log)", size(small)) `panel_opts' name(dens_gdp, replace)
bmagraph coefdensity ln_gdp_sq, title("GDP squared (log)", size(small)) `panel_opts' name(dens_gdp_sq, replace)
bmagraph coefdensity ln_gdp_cb, title("GDP cubed (log)", size(small)) `panel_opts' name(dens_gdp_cb, replace)
bmagraph coefdensity fossil_fuel, title("Fossil fuel share (%)", size(small)) `panel_opts' name(dens_fossil, replace)
bmagraph coefdensity renewable, title("Renewable energy (%)", size(small)) `panel_opts' name(dens_renewable, replace)
bmagraph coefdensity industry, title("Industry VA (% GDP)", size(small)) `panel_opts' name(dens_industry, replace)
graph combine dens_gdp dens_gdp_sq dens_gdp_cb ///
dens_fossil dens_renewable dens_industry, ///
cols(3) rows(2) imargin(small) ///
title("BMA: Posterior Coefficient Densities", size(medsmall)) ///
subtitle("All 6 robust variables (PIP > 0.80)", size(small)) ///
note("Blue curve = posterior density conditional on inclusion." ///
"Red line = probability of noninclusion (1 - PIP)." ///
"Near-zero red line + blue curve far from zero = strong evidence.", size(vsmall)) ///
scheme(s2color) xsize(12) ysize(7)

All six densities are concentrated well away from zero, confirming that every variable with PIP above 0.80 has a genuinely non-zero effect. The three GDP terms (top row) form the inverted-N polynomial: the linear term is centered near –7.1 (true: –7.1), the squared term near +0.81 (true: +0.81), and the cubic term near –0.030 (true: –0.030). The three controls (bottom row) show tight, unimodal densities: fossil fuel near +0.014 (true: +0.015), renewable energy near –0.007 (true: –0.010), and industry near +0.009 (true: +0.010). Renewable energy’s posterior mean (–0.007) is slightly attenuated compared to the true value (–0.010), reflecting the BMA shrinkage that occurs when a variable’s PIP is below 1.0 — models that exclude it pull the average toward zero.

5.7 Pooled BMA (without fixed effects)

To parallel the pooled DSL comparison in Section 6.6, we also run BMA without country or year fixed effects — treating the panel as a pooled cross-section. This removes the ($fe, always) and groupfv options, leaving only the 12 candidate controls and 3 GDP terms as predictors (15 total, vs 113 with FE).

* BMA without FE -- pooled cross-section
bmaregress ln_co2 ln_gdp ln_gdp_sq ln_gdp_cb ///
fossil_fuel renewable urban industry democracy ///
services trade fdi credit pop_density ///
corruption globalization, ///
mprior(uniform) gprior(uip) ///
mcmcsize(50000) rseed(9988) pipcutoff(0.5) burnin(5000)

Bayesian model averaging No. of obs = 1,600
Linear regression No. of predictors = 15
MC3 sampling Groups = 15
Always = 0
No. of models = 34
Priors: Mean model size = 11.978
Models: Uniform MCMC sample size = 50,000
Coef.: Zellner's g Acceptance rate = 0.0733
g: Unit-information, g = 1,600 Shrinkage, g/(1+g) = 0.9994
Sampling correlation = 0.9996
------------------------------------------------------------------------------
ln_co2 | Mean Std. dev. Group PIP
-------------+----------------------------------------------------------------
ln_gdp | -21.25807 1.641676 1 1
ln_gdp_sq | 2.284729 .1748838 2 1
ln_gdp_cb | -.0813937 .0061308 3 1
fossil_fuel | .0188853 .0010554 4 1
renewable | -.0192089 .0013911 5 1
urban | .0103139 .0012072 6 1
industry | .0138361 .0023478 7 1
services | .0164633 .0016573 9 1
pop_density | -.0004314 .0000567 13 1
credit | .0041017 .0008414 12 .99984
trade | -.0020939 .001084 10 .86009
democracy | .007879 .0042984 8 .84142
------------------------------------------------------------------------------
Note: 3 predictors with PIP less than .5 not shown.

The pooled BMA results are striking in two ways. First, the GDP coefficients are severely biased — the same pattern as pooled DSL: $\beta_1 = -21.26$ (true: –7.10), $\beta_2 = 2.28$ (true: 0.81), $\beta_3 = -0.081$ (true: –0.03). Without country fixed effects, the GDP terms absorb persistent cross-country differences in emissions levels, inflating the coefficients by a factor of 2–3x.

Second, the PIPs tell a completely different story than with FE. Without fixed effects, 12 of 15 variables have PIP above 0.80 — including noise variables like services (PIP = 1.000), population density (PIP = 1.000), credit (PIP = 1.000), and trade (PIP = 0.860). With FE, only 6 variables cleared the 0.80 threshold and all 7 noise variables had PIPs near zero. The pooled BMA commits 5 false positives (services, pop_density, credit, trade, and democracy incorrectly flagged as robust noise variables or given inflated PIPs) compared to zero false positives with FE. This happens because the noise variables are correlated with omitted country effects — without FE to absorb those effects, the correlations create spurious associations that BMA interprets as genuine predictive power.

The turning points (\$5,752 minimum, \$23,298 maximum) are far from the truth, and the 95% credible intervals fail to cover the true values for all three GDP terms — the same coverage failure seen in pooled DSL. The lesson is clear: fixed effects are not optional in panel BMA. They are essential for correct variable selection, not just coefficient estimation.

6. Post-Double-Selection LASSO

6.1 The idea

Stata’s dsregress implements the post-double-selection method of Belloni, Chernozhukov, and Hansen (2014). Think of it as a smart research assistant who reads the data twice — once to find controls that predict the outcome (CO₂), and again to find controls that predict the variables of interest (GDP terms) — then runs a clean OLS regression using only the controls that survived at least one selection.

The “double” in double-selection refers to the union of two separate LASSO selections. Why is this union necessary? If a control variable predicts both CO₂ and GDP but a single LASSO run on CO₂ happens to miss it, omitting it from the final regression would bias the GDP coefficient. The second LASSO step (on GDP) catches variables that the first step might miss, and vice versa.

The algorithm has four steps:

graph TD
Controls["<b>12 Candidate Controls</b><br/>+ country & year FE"]
Controls --> Step1["<b>Step 1: LASSO on Outcome</b><br/>CO2 ~ all controls<br/>→ Selected set X̃y"]
Controls --> Step2["<b>Step 2: LASSO on Each Variable of Interest</b><br/>GDP ~ all controls → X̃₁<br/>GDP² ~ all controls → X̃₂<br/>GDP³ ~ all controls → X̃₃"]
Step1 --> Union["<b>Step 3: Take the Union</b><br/>X̂ = X̃y ∪ X̃₁ ∪ X̃₂ ∪ X̃₃<br/>Only controls surviving<br/>at least one selection"]
Step2 --> Union
Union --> OLS["<b>Step 4: Final OLS</b><br/>CO2 ~ GDP + GDP² + GDP³ + X̂<br/>Standard OLS with valid<br/>inference on GDP terms"]
style Controls fill:#141413,stroke:#141413,color:#fff
style Step1 fill:#6a9bcc,stroke:#141413,color:#fff
style Step2 fill:#d97757,stroke:#141413,color:#fff
style Union fill:#1a3a8a,stroke:#141413,color:#fff
style OLS fill:#00d4c8,stroke:#141413,color:#141413

At the heart of each LASSO step is a penalized regression that shrinks irrelevant coefficients to exactly zero:

$$\hat{\boldsymbol{\beta}}^{\text{LASSO}} = \arg\min_{\boldsymbol{\beta}} \left\{ \frac{1}{2N} \sum_{i=1}^{N}(y_i - \mathbf{x}_i'\boldsymbol{\beta})^2 + \lambda \sum_{j=1}^{p} |\beta_j| \right\}$$

In words, LASSO minimizes the sum of squared residuals (the usual OLS objective) plus a penalty term $\lambda \sum |\beta_j|$ that charges a cost proportional to the absolute value of each coefficient. The tuning parameter $\lambda$ controls how harsh this penalty is — think of it as a “strictness dial.” When $\lambda = 0$, LASSO is just OLS. As $\lambda$ increases, more coefficients are forced to exactly zero. The L1 (absolute value) penalty is what makes LASSO a variable selector: unlike the L2 (squared) penalty used in Ridge regression, the L1 penalty has sharp corners at zero that drive weak coefficients to exactly zero rather than merely shrinking them.

Why “double” selection? The key insight of Belloni, Chernozhukov, and Hansen (2014) is that a single LASSO selection can miss important confounders. Consider our panel setting. We want to estimate the effect of GDP terms ($\mathbf{D}$) on CO₂ ($Y$), controlling for other variables ($\mathbf{W}$). The model is:

$$Y_i = \mathbf{D}_i' \boldsymbol{\alpha} + \mathbf{W}_i' \boldsymbol{\beta} + \varepsilon_i$$

A confounder $W_j$ that affects both $Y$ and $\mathbf{D}$ must be included to avoid omitted variable bias. But if $W_j$ has a weak effect on $Y$, the LASSO on $Y$ might miss it. The double-selection strategy solves this by running LASSO twice:

Step 1 selects controls that predict $Y$: $\quad \hat{S}_Y = \{j : \hat{\beta}_j^{\text{LASSO}(Y)} \neq 0\}$
Step 2 selects controls that predict each $D_k$: $\quad \hat{S}_{D_k} = \{j : \hat{\gamma}_{j,k}^{\text{LASSO}(D_k)} \neq 0\}$
Step 3 takes the union: $\quad \hat{S} = \hat{S}_Y \cup \hat{S}_{D_1} \cup \hat{S}_{D_2} \cup \hat{S}_{D_3}$
Step 4 runs OLS of $Y$ on $\mathbf{D}$ and $\mathbf{W}_{\hat{S}}$ with standard inference

The union in Step 3 ensures that a confounder missed by the $Y$-LASSO but caught by the $D$-LASSO is still included. This “safety net” property is what gives post-double-selection its valid inference guarantees — the final OLS produces consistent estimates of $\boldsymbol{\alpha}$ even if each individual LASSO makes some selection mistakes.

The dsregress command uses a “plugin” method to choose $\lambda$ — an analytical formula that sets the penalty based on the sample size and noise level, without requiring cross-validation. A key assumption underlying DSL is approximate sparsity: only a small number of controls truly matter, so LASSO can safely set the rest to zero. When the true model is dense (many small effects rather than a few large ones), LASSO may struggle to select the right variables.

Before implementing DSL, it helps to see the two methods side by side:

Feature	BMA	Post-Double-Selection
Philosophy	Bayesian (posteriors)	Frequentist (p-values)
Strategy	Average across models	Select controls, then OLS
Output	PIPs for every variable	Set of selected controls
Speed	Minutes (MCMC)	Seconds (optimization)
Reference	Raftery et al. (1997)	Belloni, Chernozhukov, Hansen (2014)

6.2 Key options

With the algorithm clear, let us examine the Stata implementation. The dsregress command has a concise syntax, but each element plays a specific role. The full option list is in the Stata LASSO manual; here we explain the ones used in this tutorial:

Syntax structure: dsregress depvar varsofinterest, controls(controlvars) [options]

$outcome (ln_co2) — the dependent variable. DSL will run LASSO on this variable against all controls (Step 1)
$gdp_vars (ln_gdp ln_gdp_sq ln_gdp_cb) — the variables of interest. These are never penalized by LASSO; they always appear in the final OLS. DSL runs a separate LASSO for each one against all controls (Steps 2a–2c)
controls(($fe) $controls) — the candidate controls subject to LASSO selection. Parentheses around $fe tell Stata to treat factor variables (country and year dummies) as always-included in the LASSO penalty but available for selection. The 12 candidate controls are subject to the standard LASSO penalty
vce(cluster country_id) — compute cluster-robust standard errors at the country level in the final OLS (Step 4). This also affects the LASSO penalty through the selection(plugin) method, which adjusts $\lambda$ for cluster dependence
selection(plugin) (default) — choose $\lambda$ using a data-driven analytical formula rather than cross-validation. The alternative selection(cv) uses cross-validation but is slower
lassoinfo (post-estimation) — reports the number of selected controls and the $\lambda$ value for each LASSO step
lassocoef (post-estimation) — displays which specific variables were selected or dropped by LASSO

Related commands. Stata also offers poregress (partialing-out regression), which residualizes both the outcome and the treatment against all controls instead of selecting then regressing. Both methods provide valid inference. xporegress extends this to cross-fit partialing-out for even more robust inference. This tutorial uses dsregress because its select-then-regress logic is more intuitive for beginners.

6.3 Estimation

dsregress $outcome $gdp_vars, ///
controls(($fe) $controls) ///
vce(cluster country_id)

Double-selection linear model Number of obs = 1,600
Number of controls = 112
Number of selected controls = 102
Wald chi2(3) = 53.15
Prob > chi2 = 0.0000
(Std. err. adjusted for 80 clusters in country_id)
------------------------------------------------------------------------------
| Robust
ln_co2 | Coefficient std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
ln_gdp | -7.433319 1.628321 -4.57 0.000 -10.62477 -4.241868
ln_gdp_sq | .8401567 .1713522 4.90 0.000 .5043126 1.176001
ln_gdp_cb | -.0310764 .005952 -5.22 0.000 -.0427421 -.0194107
------------------------------------------------------------------------------

Post-double-selection completed in seconds with cluster-robust standard errors at the country level. Internally, dsregress ran four separate LASSO regressions (Step 1 on CO₂, Steps 2a–2c on each GDP term), took the union of all selected controls, and then ran a final OLS of CO₂ on the GDP terms plus that union. All three GDP terms are significant at the 0.1% level. The Wald test strongly rejects the null that GDP terms are jointly zero ($\chi^2 = 53.15$, p < 0.001).

6.4 Turning points

Minimum: \$2,429 GDP per capita (true: \$1,895)
Maximum: \$27,672 GDP per capita (true: \$34,647)

The post-double-selection turning points (\$2,429 and \$27,672) fall between the sparse FE and kitchen-sink estimates, closer to the BMA values. With cluster-robust standard errors, the LASSO selection retained 102 of 112 controls for the outcome equation and 100 for each GDP term. The union of selected controls in Step 3 includes a few more candidate variables than without clustering, producing coefficients (–7.433, 0.840, –0.031) that lie between the sparse and kitchen-sink specifications.

6.5 LASSO selection

To understand which controls LASSO kept and which it dropped, we inspect the selection details:

lassoinfo

 Estimate: active
Command: dsregress
------------------------------------------------------
| No. of
| Selection selected
Variable | Model method lambda variables
------------+-----------------------------------------
ln_co2 | linear plugin .3818852 102
ln_gdp | linear plugin .3818852 100
ln_gdp_sq | linear plugin .3818852 100
ln_gdp_cb | linear plugin .3818852 100
------------------------------------------------------

The lassoinfo output shows each of the four LASSO steps. The outcome equation selected 102 of 112 controls, while each GDP equation selected 100. The 112 candidates include 80 country dummies + 19 year dummies = 99 FE dummies, plus the 12 candidate variables and the constant. LASSO retains nearly all informative FE dummies and drops about 10–12 of the weakest candidates at each step. The union across all four steps (Step 3) yields the final control set for Step 4’s OLS. With cluster-robust standard errors, the lambda is larger (0.382 vs 0.090 without clustering), leading to slightly different selection and producing DSL coefficients (–7.433, 0.840, –0.031) that fall between the sparse and kitchen-sink FE.

Why does DSL not match BMA’s accuracy here? In panel data settings where FE dummies dominate the control set (99 of 112 variables), LASSO retains nearly all FE dummies and has limited room to discriminate among the 12 candidate controls of interest — it dropped only 10–12 variables at each step, most of them weak FE dummies rather than noise controls. This “almost everything selected” outcome means DSL’s final OLS is close to the kitchen-sink specification, which explains why its coefficients (–7.433, 0.840, –0.031) fall between sparse and kitchen-sink FE rather than converging to the true DGP. To see LASSO’s selection power unleashed, we next run DSL without fixed effects.

6.6 Pooled DSL (without fixed effects)

What happens when LASSO has only 12 candidate controls instead of 112? To answer this, we run DSL on the pooled data — treating the panel as a cross-sectional dataset without country or year fixed effects. This gives LASSO full room to discriminate among the candidate controls, but at the cost of omitting the unobserved country heterogeneity that fixed effects would absorb.

* DSL without FE -- pooled cross-section with cluster-robust SEs
dsregress $outcome $gdp_vars, ///
controls($controls) ///
vce(cluster country_id)

Double-selection linear model Number of obs = 1,600
Number of controls = 12
Number of selected controls = 7
Wald chi2(3) = 25.05
Prob > chi2 = 0.0000
(Std. err. adjusted for 80 clusters in country_id)
------------------------------------------------------------------------------
| Robust
ln_co2 | Coefficient std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
ln_gdp | -22.03297 5.277295 -4.18 0.000 -32.37628 -11.68966
ln_gdp_sq | 2.366878 .5652276 4.19 0.000 1.259052 3.474703
ln_gdp_cb | -.084224 .0199055 -4.23 0.000 -.1232381 -.04521
------------------------------------------------------------------------------

The pooled DSL still finds the correct inverted-N sign pattern ($\beta_1 < 0$, $\beta_2 > 0$, $\beta_3 < 0$), but the magnitudes are dramatically different from the true DGP. The linear coefficient (–22.03) is more than three times the true value (–7.10), and the other terms are similarly inflated. This is omitted variable bias: without country fixed effects, the GDP terms absorb not only their own effect on CO₂ but also the persistent cross-country differences in emissions levels that fixed effects would have captured.

lassoinfo

 Estimate: active
Command: dsregress
------------------------------------------------------
| No. of
| Selection selected
Variable | Model method lambda variables
------------+-----------------------------------------
ln_co2 | linear plugin .3818852 5
ln_gdp | linear plugin .3818852 7
ln_gdp_sq | linear plugin .3818852 7
ln_gdp_cb | linear plugin .3818852 7
------------------------------------------------------

Now the contrast with the FE-based DSL is stark. The outcome LASSO selected only 5 of 12 controls (vs 102 of 112 with FE), and the GDP LASSOes selected 7 of 12 (vs 100 of 112). Without FE dummies flooding the candidate set, LASSO can genuinely discriminate — it zeroed out 5–7 controls as irrelevant. The turning points are \$5,581 (minimum) and \$24,532 (maximum), far from the true values.

This comparison illustrates a fundamental tradeoff in panel data econometrics: fixed effects remove bias but limit LASSO’s selection power. With FE, the estimates are unbiased but LASSO selects almost everything. Without FE, LASSO selects sharply but the estimates are biased by unobserved heterogeneity. The FE-based DSL from Section 6.3 is the correct specification for this data, even though LASSO’s selection looks less impressive.

7. Head-to-Head Comparison

7.1 Coefficient comparison

	Sparse FE	Kitchen-Sink FE	BMA (FE)	DSL (FE)	BMA (pooled)	DSL (pooled)	True DGP
$\beta_1$ (GDP)	–7.498	–7.131	–7.139	–7.433	–21.258	–22.033	–7.100
$\beta_2$ (GDP²)	0.849	0.806	0.808	0.840	2.285	2.367	0.810
$\beta_3$ (GDP³)	–0.031	–0.030	–0.030	–0.031	–0.081	–0.084	–0.030
Min TP	\$2,478	\$2,426	\$2,411	\$2,429	\$5,752	\$5,581	\$1,895
Max TP	\$25,656	\$27,694	\$27,269	\$27,672	\$23,298	\$24,532	\$34,647

The table reveals a sharp divide between FE-based and pooled specifications. The four FE-based methods (columns 2–5) all produce GDP coefficients within a narrow range of the true values — BMA (FE) and Kitchen-Sink FE are closest, with estimates within 1% of the truth. The two pooled methods (columns 6–7) are dramatically biased, with coefficients inflated 2–3x. Strikingly, BMA (pooled) and DSL (pooled) agree closely with each other (–21.26 vs –22.03 for $\beta_1$), confirming that the bias comes from omitting fixed effects, not from the choice of variable selection method. Both pooled methods produce turning points displaced from the truth (\$5,600–5,800 vs true \$1,895 for the minimum).

7.2 Uncertainty: confidence and credible intervals

Point estimates tell only half the story. How uncertain is each method, and does the interval actually contain the truth? The table below shows 95% confidence intervals (for the frequentist methods) and approximate 95% credible intervals (for BMA, computed as posterior mean $\pm$ 2 posterior SD). The last column checks whether the true DGP value falls inside the interval.

	$\beta_1$ (GDP) interval	Covers true?	$\beta_2$ (GDP²) interval	Covers true?	$\beta_3$ (GDP³) interval	Covers true?
Sparse FE	[–10.731, –4.266]	Yes	[0.510, 1.188]	Yes	[–0.043, –0.020]	Yes
Kitchen-Sink FE	[–10.241, –4.021]	Yes	[0.478, 1.134]	Yes	[–0.041, –0.018]	Yes
BMA (FE) (credible)	[–10.761, –3.517]	Yes	[0.429, 1.186]	Yes	[–0.043, –0.017]	Yes
DSL (FE)	[–10.625, –4.242]	Yes	[0.504, 1.176]	Yes	[–0.043, –0.019]	Yes
BMA (pooled) (credible)	[–24.541, –17.975]	No	[1.935, 2.635]	No	[–0.094, –0.069]	No
DSL (pooled)	[–32.376, –11.690]	No	[1.259, 3.475]	No	[–0.123, –0.045]	No
True DGP	–7.100		0.810		–0.030

The four FE-based methods all produce intervals that contain the true parameter values — a reassuring result. Both pooled methods, however, fail to cover the truth for any of the three coefficients. The pooled DSL intervals are wide (the $\beta_1$ interval spans 20.7 units) but centered so far from the truth that even this width cannot compensate. The pooled BMA credible intervals are actually narrower (spanning 6.6 units for $\beta_1$) but even more precisely wrong — they are tightly concentrated around the biased estimate. This is the worst-case scenario: false precision from a misspecified model.

Width reflects uncertainty. Among the FE-based methods, BMA produces the widest interval for $\beta_1$ (width = 7.24), followed by Sparse FE (6.47), DSL with FE (6.38), and Kitchen-Sink FE (6.22). BMA’s wider intervals reflect its honest accounting of model uncertainty — it averages across thousands of models, each contributing slightly different coefficient estimates, which inflates the posterior standard deviation. The frequentist methods condition on a single model and therefore understate the total uncertainty.

Centering reflects bias. Kitchen-Sink FE and BMA center their intervals closest to the true value (–7.131 and –7.139 vs. true –7.100), while Sparse FE (–7.498) and DSL with FE (–7.433) are slightly further away. The pooled DSL (–22.033) is dramatically off-center, illustrating that omitted variable bias overwhelms any precision gained from better variable selection.

Coverage requires correct specification. The pooled DSL result drives home a critical lesson: a confidence interval is only as good as the model behind it. The 95% label promises that, in repeated sampling, 95% of intervals would contain the truth — but this guarantee holds only if the model is correctly specified. When country fixed effects are omitted, the model is misspecified, and the intervals fail despite being statistically “valid” within the pooled framework.

Bayesian vs frequentist interpretation. BMA’s credible intervals have a different interpretation: a 95% BMA credible interval says “given the data and priors, there is a 95% posterior probability the true coefficient lies in this range,” while a 95% confidence interval says “if we repeated this procedure many times, 95% of the intervals would contain the truth.” In practice, both require correct model specification to be reliable.

7.3 Predicted EKC curves

The curves are normalized to zero at the sample-mean GDP so both methods are directly comparable:

* Generate predicted EKC curves for BMA and DSL, normalized at mean GDP
summarize ln_gdp
local xmin = r(min)
local xmax = r(max)
local xmean = r(mean)
clear
set obs 500
gen lngdp = `xmin' + (_n - 1) * (`xmax' - `xmin') / 499
* Cubic component for each method (using stored coefficients)
gen fit_bma = `b1_bma' * lngdp + `b2_bma' * lngdp^2 + `b3_bma' * lngdp^3
gen fit_dsl = `b1_dsl' * lngdp + `b2_dsl' * lngdp^2 + `b3_dsl' * lngdp^3
* Normalize: subtract value at sample-mean GDP
local norm_bma = `b1_bma' * `xmean' + `b2_bma' * `xmean'^2 + `b3_bma' * `xmean'^3
local norm_dsl = `b1_dsl' * `xmean' + `b2_dsl' * `xmean'^2 + `b3_dsl' * `xmean'^3
replace fit_bma = fit_bma - `norm_bma'
replace fit_dsl = fit_dsl - `norm_dsl'
twoway ///
(line fit_bma lngdp, lcolor("106 155 204") lwidth(medthick)) ///
(line fit_dsl lngdp, lcolor("217 119 87") lwidth(medthick) lpattern(dash)), ///
xline(`lnmin_bma', lcolor("106 155 204"%50) lpattern(shortdash)) ///
xline(`lnmax_bma', lcolor("106 155 204"%50) lpattern(shortdash)) ///
ytitle("Predicted log CO2 (normalized at mean GDP)") ///
xtitle("Log GDP per capita") ///
title("Predicted EKC Shape: BMA vs. DSL") ///
legend(order(1 "BMA" 2 "DSL") rows(1) position(6)) ///
scheme(s2color)

Both curves trace a clear inverted-N: CO₂ falls at low incomes, rises through industrialization, and falls again at high incomes. The BMA curve (solid blue) and DSL curve (dashed orange) are nearly indistinguishable, with turning points closely aligned. The normalization at mean GDP makes the shape immediately visible — a major improvement over plotting raw cubic components that would sit at different y-levels.

7.4 Answer key: grading the methods

The ultimate test: do BMA and DSL correctly identify the 5 true predictors and reject the 7 noise variables?

* Dot plot: BMA PIPs color-coded by ground truth
* (extract PIPs, label variables, mark true vs noise --- see analysis.do)
graph twoway ///
(scatter order pip if is_true == 1, ///
mcolor("106 155 204") msymbol(circle) msize(large)) ///
(scatter order pip if is_true == 0, ///
mcolor(gs9) msymbol(diamond) msize(large)), ///
xline(0.8, lcolor("217 119 87") lpattern(dash) lwidth(medium)) ///
ylabel(1(1)15, valuelabel angle(0) labsize(small)) ///
xlabel(0(0.2)1, format(%3.1f)) ///
xtitle("BMA Posterior Inclusion Probability") ///
title("Answer Key: Do BMA and DSL Recover the Truth?") ///
legend(order(1 "True predictor" 2 "Noise variable") ///
rows(1) position(6)) ///
scheme(s2color)

BMA’s report card: Of the 8 true predictors (3 GDP terms + 5 controls), BMA correctly assigns PIP > 0.80 to 6 — the three GDP terms, fossil fuel, industry, and renewable energy. It misses urban (PIP ~ 0.27) and democracy (PIP ~ 0.02), whose true coefficients are small (0.007 and –0.005). All 7 noise variables receive PIPs well below 0.80. BMA makes zero false positives (no noise variable incorrectly flagged as robust) and two false negatives (two weak true predictors missed).

Post-double-selection’s report card: With cluster-robust SEs, the union of all four LASSO steps selected 102 of 112 total controls (including FE dummies). The resulting DSL coefficients (–7.433, 0.840, –0.031) fall between the sparse and kitchen-sink FE, closer to the true DGP than the sparse specification. The entire procedure runs in seconds rather than minutes.

Bottom line: Both methods recover the inverted-N EKC shape. BMA provides more granular variable-level inference (PIPs), while DSL provides fast, valid coefficient estimates. The synthetic data “answer key” confirms that both are doing their job — with the expected limitation that weak signals are hard to detect.

8. Discussion

8.1 What the results mean for the EKC

Both BMA and DSL identify the inverted-N EKC shape with turning points close to the true DGP values. BMA correctly identifies 6 of 8 true predictors (3 GDP terms + fossil fuel, industry, renewable) with zero false positives among noise variables. The inverted-N shape implies three phases of the income–pollution relationship:

Declining phase (below ~\$2,400): Very poor countries where CO₂ may fall as subsistence agriculture shifts toward slightly cleaner energy.
Rising phase (~\$2,400 to ~\$27,000): Industrializing countries where emissions rise sharply. Most of the world’s population lives here.
Declining phase (above ~\$27,000): Wealthy countries where clean technology and regulation reduce emissions.

The policy implication is important: the inverted-N suggests that the “environmental improvement” phase is not automatic. Unlike the simpler inverted-U hypothesis, which predicts a single turning point after which pollution monotonically declines, the inverted-N warns that countries at very low income levels may already be on a declining emissions path that reverses once industrialization begins. This makes the middle-income range — where emissions rise steeply — the critical window for environmental policy intervention.

The three robust control variables identified by BMA reinforce this narrative:

Fossil fuel dependence (PIP = 1.000) is the single strongest predictor of CO₂ emissions, with a coefficient close to the true DGP value.
Renewable energy share (PIP = 0.959) enters with a negative sign, confirming that energy mix transitions reduce emissions.
Industry value-added (PIP = 0.999) captures the composition effect — economies dominated by manufacturing produce more CO₂ per unit of GDP than service-based economies.

8.2 When to use BMA vs post-double-selection

The two methods answer fundamentally different research questions:

Use BMA when the question is “which variables robustly predict the outcome?" BMA provides PIPs, coefficient densities, and a complete picture of the model space. It excels in exploratory settings where variable importance is the goal. In our simulation, BMA produced the most accurate coefficient estimates (–7.139 vs true –7.100) and provided rich diagnostics (PIP chart, density plots) that make the evidence for each variable transparent. The cost is computational: BMA requires MCMC sampling (minutes to hours depending on the model space).

Use post-double-selection when the question is “what is the causal effect of a specific variable of interest, controlling for high-dimensional confounders?" DSL provides fast, valid inference on the coefficients of interest with standard errors and confidence intervals. It is designed for settings where you have a clear treatment variable and many potential controls. In our simulation, DSL completed in seconds and produced valid standard errors, but its coefficient estimates (–7.433) were less accurate than BMA’s because LASSO had limited room to discriminate among controls in the FE-heavy panel setting.

Use both together (as in this tutorial) when you want the strongest possible evidence. If a Bayesian and a frequentist method agree on the sign, magnitude, and significance of an effect, the finding is unlikely to be an artifact of any single modeling choice. Disagreements between the methods are also informative — they signal areas where the evidence is sensitive to assumptions.

8.3 Pooled vs fixed effects: a cautionary comparison

The pooled specifications (Sections 5.7 and 6.6) provide a powerful pedagogical contrast. When we strip away fixed effects and run both BMA and DSL on pooled data, three things happen simultaneously:

LASSO selection improves but estimates worsen. Without 99 FE dummies diluting the candidate set, LASSO in pooled DSL selected only 5–7 of 12 controls (vs 102 of 112 with FE). This is closer to the “textbook” LASSO scenario where the method has genuine discriminating power. Yet the resulting coefficient estimates are 2–3x the true values because omitted country heterogeneity biases everything.

BMA PIPs become unreliable. With fixed effects, BMA assigned PIP near zero to all 7 noise variables — zero false positives. Without FE, 5 noise variables (services, pop_density, credit, trade, and inflated democracy) received PIPs above 0.80. The noise variables are correlated with omitted country effects, and BMA interprets these spurious correlations as genuine predictive power. This demonstrates that PIP thresholds are only meaningful when the model set is correctly specified.

Both methods agree on the bias. Pooled BMA and pooled DSL produce remarkably similar biased coefficients ($\beta_1 = -21.26$ vs $-22.03$), confirming that the problem is not the variable selection method but the omitted fixed effects. The agreement between a Bayesian and a frequentist method on the wrong answer reinforces the lesson: method agreement is not a substitute for correct model specification.

The practical takeaway for applied researchers: in panel data settings, always include entity fixed effects (or equivalent controls for unobserved heterogeneity) before applying BMA or DSL. Running these methods on pooled data without FE will produce misleading results — not because the methods fail, but because the models they average over or select from are all misspecified.

8.4 Limitations and caveats

Synthetic vs real data. This is synthetic data — the patterns are sharper than real-world data, and we can verify ground truth only because we designed the DGP. With real data, model uncertainty is genuinely unresolvable, and there is no answer key to check against. The separation between true predictors and noise variables is cleaner here than in most applications.

Weak signals are hard to detect. Both methods missed urban population (PIP = 0.27) and democracy (PIP = 0.02), whose true coefficients are small (0.007 and –0.005). This is not a failure of the methods — it is a fundamental statistical limitation. Detecting a coefficient of 0.005 in the presence of panel-level noise requires either a much larger sample or a stronger signal.

Panel FE and LASSO. In our panel setting, 99 of 112 candidate controls are FE dummies that LASSO retains almost entirely. This limits DSL’s ability to discriminate among the 12 candidate controls. In cross-sectional settings or settings with many genuinely irrelevant variables, DSL would have more room to operate and potentially match BMA’s accuracy.

Extensions. Researchers working with real EKC data should also consider endogeneity (via 2SLS-BMA, as in Gravina and Lanzafame, 2025), alternative pollutants (SO₂, PM2.5), spatial dependence across countries, and structural breaks in the income–pollution relationship.

9. Summary and Next Steps

Takeaways

Both methods confirm the inverted-N shape. BMA (Bayesian, averaging across models) and post-double-selection (frequentist, LASSO-based) both recover the inverted-N EKC. BMA produces coefficients closest to the true DGP (–7.139 vs –7.100 for $\beta_1$). DSL with cluster-robust SEs gives –7.433, falling between the sparse and kitchen-sink FE. Both methods outperform the naive sparse specification.
Both methods recover the ground truth. BMA correctly identifies 6 of 8 true predictors with zero false positives. The three strongest true controls (fossil fuel, industry, renewable energy) all receive PIPs above 0.95. The two misses (urban, democracy) have small true coefficients, illustrating that even good methods have limits with weak signals.
Model uncertainty is real. The GDP linear coefficient shifts from –7.498 (sparse) to –7.131 (kitchen-sink) depending on which controls are included. The maximum turning point moves by \$2,000. BMA and DSL provide principled solutions.
BMA and post-double-selection serve different purposes. BMA excels at variable selection (PIPs, coefficient densities) and produced the most accurate coefficient estimates in this setting. Post-double-selection is fastest and provides standard frequentist inference with cluster-robust SEs. In panel settings dominated by FE dummies, LASSO has limited room to discriminate among candidate controls; DSL would be more powerful in cross-sectional settings with many irrelevant variables.
Fixed effects are essential, not optional. Running either method on pooled data without FE produces coefficients inflated 2–3x (BMA pooled: –21.26, DSL pooled: –22.03, vs true –7.10 for $\beta_1$). Worse, pooled BMA assigns high PIPs to 5 noise variables that the FE-based BMA correctly rejects. Confidence and credible intervals from pooled models fail to cover the true values for all three coefficients. The lesson: always include fixed effects in panel data before applying variable selection methods.

Exercises

Sensitivity to the g-prior. Re-run bmaregress with gprior(bric) instead of gprior(uip). The BIC prior penalizes model complexity more heavily. Do the PIPs change? Does it still identify fossil fuel, industry, and renewable as robust? (Hint: BIC priors tend to be more conservative, so borderline variables may drop below the threshold.)
Test for inverted-U. Drop ln_gdp_cb and re-run with only linear and squared GDP terms. What do BMA and DSL say about the simpler quadratic specification? (Hint: since the DGP includes a cubic term, the quadratic model is misspecified — check whether the coefficients absorb the cubic effect or produce a visibly different EKC shape.)
Increase noise. Re-generate the synthetic data with sigma_eps = 0.30 (double the noise) in generate_data.do and re-run the full analysis. How does this affect BMA’s ability to distinguish true predictors from noise? (Hint: expect more variables with PIPs in the ambiguous 0.3–0.7 range, and possibly some noise variables crossing the 0.80 threshold — false positives become more likely with noisier data.)

Appendix A: First-Differences Analysis

A.1 Motivation

The fixed effects estimator removes time-invariant country heterogeneity by demeaning each variable within country. An alternative approach is first differencing: computing the change between the last and first year for each country ($\Delta x_i = x_{i,2014} - x_{i,1995}$). This also removes time-invariant effects and produces a pure cross-sectional dataset of 80 observations — one per country. The cross-sectional setting is where LASSO-based methods are most powerful, because there are no FE dummies diluting the candidate set.

The tradeoff is statistical power: first differencing uses only two data points per country (discarding 18 intermediate years), while the within-estimator uses all 20. We expect noisier estimates but cleaner variable selection.

A.2 Constructing the first-difference dataset

* Keep only first (1995) and last (2014) years, reshape, compute differences
keep if year == 1995 | year == 2014
reshape wide $outcome $gdp_vars $controls, i(country_id) j(year)
foreach v in $outcome $gdp_vars $controls {
gen d_`v' = `v'2014 - `v'1995
}

This produces 80 observations, each representing how much a country’s variables changed over the 20-year period. For example, d_ln_gdp measures the log growth in GDP per capita from 1995 to 2014.

A.3 Baseline OLS on first differences

* Sparse: GDP terms only
regress d_ln_co2 d_ln_gdp d_ln_gdp_sq d_ln_gdp_cb, robust
* Kitchen-sink: all 12 controls
regress d_ln_co2 d_ln_gdp d_ln_gdp_sq d_ln_gdp_cb ///
d_fossil_fuel d_renewable d_urban d_industry d_democracy ///
d_services d_trade d_fdi d_credit d_pop_density ///
d_corruption d_globalization, robust

FD Sparse OLS:

Linear regression Number of obs = 80
Prob > F = 0.0009
R-squared = 0.1433
------------------------------------------------------------------------------
| Robust
d_ln_co2 | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
d_ln_gdp | -10.36189 4.092422 -2.53 0.013 -18.51265 -2.211121
d_ln_gdp_sq | 1.155962 .4223643 2.74 0.008 .3147506 1.997173
d_ln_gdp_cb | -.0414947 .0143721 -2.89 0.005 -.0701192 -.0128702
_cons | -.3036562 .0724366 -4.19 0.000 -.4479262 -.1593861
------------------------------------------------------------------------------

FD Kitchen-sink OLS:

Linear regression Number of obs = 80
Prob > F = 0.0029
R-squared = 0.3707
------------------------------------------------------------------------------
| Robust
d_ln_co2 | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
d_ln_gdp | -8.109709 5.031758 -1.61 0.112 -18.1618 1.942382
d_ln_gdp_sq | .9238864 .5213262 1.77 0.081 -.1175823 1.965355
d_ln_gdp_cb | -.0336221 .0179583 -1.87 0.066 -.0694979 .0022536
d_fossil_f~l | .0147108 .0067313 2.19 0.033 .0012635 .0281582
d_renewable | -.0237808 .0110384 -2.15 0.035 -.0458327 -.001729
d_urban | .0002501 .014913 0.02 0.987 -.0295421 .0300424
d_industry | .0309085 .0105974 2.92 0.005 .0097377 .0520793
d_democracy | .019337 .0290345 0.67 0.508 -.038666 .07734
d_services | -.0047239 .0098816 -0.48 0.634 -.0244647 .0150169
d_trade | .006726 .0044062 1.53 0.132 -.0020764 .0155284
d_fdi | .0000124 .0091898 0.00 0.999 -.0183463 .0183712
d_credit | .0028644 .0043456 0.66 0.512 -.0058169 .0115457
d_pop_dens~y | .0006396 .0004991 1.28 0.205 -.0003575 .0016366
d_corruption | -.0036115 .0033497 -1.08 0.285 -.0103033 .0030803
d_globaliz~n | -.0004567 .0082494 -0.06 0.956 -.0169368 .0160235
_cons | -.0085823 .1746184 -0.05 0.961 -.3574226 .340258
------------------------------------------------------------------------------

The FD sparse OLS finds the inverted-N sign pattern with all three terms significant at the 5% level — but the coefficients are noisier than the FE estimates (e.g., $\beta_1 = -10.36$ vs –7.50 for sparse FE). The R² of 0.14 is low, reflecting the loss of within-country time-series variation when collapsing 20 years into a single difference.

Adding controls in the kitchen-sink raises R² to 0.37 but makes the GDP terms individually insignificant (p = 0.07–0.11) — a consequence of having only 80 observations and 15 regressors. Among the controls, fossil fuel (p = 0.033), renewable energy (p = 0.035), and industry (p = 0.005) are significant — the same three strong predictors identified by BMA with fixed effects.

A.4 BMA on first differences

bmaregress d_ln_co2 d_ln_gdp d_ln_gdp_sq d_ln_gdp_cb ///
d_fossil_fuel d_renewable d_urban d_industry d_democracy ///
d_services d_trade d_fdi d_credit d_pop_density ///
d_corruption d_globalization, ///
mprior(uniform) gprior(uip) ///
mcmcsize(50000) rseed(9988) pipcutoff(0.5) burnin(5000)

Bayesian model averaging No. of obs = 80
Linear regression No. of predictors = 15
MC3 sampling Groups = 15
Always = 0
No. of models = 2,317
For CPMP >= .9 = 581
Priors: Mean model size = 3.304
Models: Uniform Burn-in = 5,000
Cons.: Noninformative MCMC sample size = 50,000
Coef.: Zellner's g Acceptance rate = 0.3080
g: Unit-information, g = 80 Shrinkage, g/(1+g) = 0.9877
sigma2: Noninformative Mean sigma2 = 0.051
Sampling correlation = 0.9958
------------------------------------------------------------------------------
d_ln_co2 | Mean Std. dev. Group PIP
-------------+----------------------------------------------------------------
d_industry | .0364834 .0090778 7 .99823
------------------------------------------------------------------------------
Note: 14 predictors with PIP less than .5 not shown.

The FD-BMA result is dramatically different from the FE-based BMA. Only one variable passes the 0.50 PIP display threshold: the change in industry share (PIP = 0.998). The three GDP polynomial terms all have PIPs below 0.30:

Variable	PIP (FD-BMA)	PIP (FE-BMA)
d_ln_gdp	0.298	0.994
d_ln_gdp_sq	0.267	1.000
d_ln_gdp_cb	0.271	1.000
d_fossil_fuel	0.183	1.000
d_renewable	0.350	0.959
d_urban	0.096	0.268
d_industry	0.998	0.999
d_democracy	0.094	0.023

With only 80 cross-sectional observations, BMA’s evidence threshold is much harder to clear. The GDP terms — which are the core of the EKC — do not survive because the 20-year differences are noisy and the cubic polynomial requires precise estimation of three correlated terms simultaneously.

The change in industry share is the only variable with a strong enough signal-to-noise ratio to clear BMA’s bar. The FE-based BMA (N = 1,600) has 20x more observations to work with, which is why it identifies 6 robust variables.

A.5 DSL on first differences

dsregress d_ln_co2 d_ln_gdp d_ln_gdp_sq d_ln_gdp_cb, ///
controls(d_fossil_fuel d_renewable d_urban d_industry d_democracy ///
d_services d_trade d_fdi d_credit d_pop_density ///
d_corruption d_globalization) ///
rseed(9988)

Double-selection linear model Number of obs = 80
Number of controls = 12
Number of selected controls = 1
Wald chi2(3) = 10.65
Prob > chi2 = 0.0138
------------------------------------------------------------------------------
| Robust
d_ln_co2 | Coefficient std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
d_ln_gdp | -5.047196 4.558593 -1.11 0.268 -13.98187 3.887483
d_ln_gdp_sq | .5943786 .4700569 1.26 0.206 -.326916 1.515673
d_ln_gdp_cb | -.0220809 .0160386 -1.38 0.169 -.0535159 .0093541
------------------------------------------------------------------------------

lassoinfo

 Estimate: active
Command: dsregress
------------------------------------------------------
| No. of
| Selection selected
Variable | Model method lambda variables
------------+-----------------------------------------
d_ln_co2 | linear plugin .3818852 1
d_ln_gdp | linear plugin .3818852 0
d_ln_gdp_sq | linear plugin .3818852 0
d_ln_gdp_cb | linear plugin .3818852 0
------------------------------------------------------

FD-DSL selected only 1 control for the outcome equation (likely d_industry, consistent with BMA) and zero controls for each of the three GDP equations. With such sparse selection, the final OLS is essentially a regression of d_ln_co2 on the three GDP terms plus one control — and none of the three GDP terms are individually significant (p = 0.17–0.27). The Wald test for joint significance is borderline (p = 0.014), suggesting the GDP terms collectively have some explanatory power, but the individual estimates are too noisy for inference.

A.6 Comparison: first differences vs fixed effects

	FD Sparse	FD Kitchen	FD BMA	FD DSL	FE BMA	FE DSL	True DGP
$\beta_1$ (GDP)	–10.362	–8.110	n/a	–5.047	–7.139	–7.433	–7.100
$\beta_2$ (GDP²)	1.156	0.924	n/a	0.594	0.808	0.840	0.810
$\beta_3$ (GDP³)	–0.041	–0.034	n/a	–0.022	–0.030	–0.031	–0.030
GDP terms robust?	Yes (p < 0.05)	No (p > 0.05)	No (PIP < 0.30)	No (p > 0.05)	Yes (PIP > 0.99)	Yes (p < 0.001)
Controls selected	n/a	n/a	1 of 12	1 of 12	6 of 12	102 of 112
Min TP	\$1,913	\$1,465	n/a	\$987	\$2,411	\$2,429	\$1,895
Max TP	\$60,817	\$61,655	n/a	\$62,983	\$27,269	\$27,672	\$34,647

Note. FD-BMA posterior means for the GDP terms are heavily shrunk toward zero (because their PIPs are ~0.27–0.30), so we report “n/a” rather than misleading point estimates.

The comparison reveals a stark trade-off between the two identification strategies:

Fixed effects win on accuracy. The FE-based estimates are close to the true DGP values, with BMA (FE) achieving the best accuracy ($\beta_1 = -7.139$ vs true –7.100). The FD estimates are noisier: FD-sparse overshoots ($\beta_1 = -10.36$), while FD-DSL undershoots (–5.05). The FD turning points are wildly inaccurate — the maximum turning point is \$61,000–63,000 in first differences vs \$27,000 with FE (true: \$34,647).

First differences struggle with the cubic polynomial. Estimating a cubic EKC requires precise measurement of three highly correlated terms ($\ln GDP$, $(\ln GDP)^2$, $(\ln GDP)^3$). With only 80 observations (one 20-year change per country), the multicollinearity among differenced GDP terms is severe. Both BMA and DSL respond rationally: BMA gives all three terms PIPs below 0.30, and DSL selects zero controls for the GDP equations. Neither method “trusts” the cubic specification in this small sample.

Industry is the strongest cross-sectional signal. Both FD-BMA (PIP = 0.998) and FD-DSL (selected as the sole control) identify the change in industry share as the most important cross-sectional predictor of CO₂ change. This makes economic sense: countries that industrialized the most over 1995–2014 also increased their emissions the most, regardless of their income trajectory.

Practical implication. First differences are appropriate when the research question is about long-run changes rather than levels. But for testing the EKC cubic shape, the panel FE approach is far more powerful because it uses all 1,600 observations rather than collapsing to 80. The FD analysis confirms that the inverted-N result in the main body is robust to the identification strategy in spirit (the signs are correct in FD-sparse OLS), but the magnitudes and statistical power are substantially weaker.

References

Acknowledgements

Three Methods for Robust Variable Selection: BMA, LASSO, and WALS

Mon, 23 Mar 2026 00:00:00 +0000

1. Overview

Imagine you are an economist advising a government on climate policy. Your team has collected cross-country data on a dozen potential drivers of CO₂ emissions: GDP per capita, fossil fuel dependence, urbanization, industrial output, democratic governance, trade networks, agricultural activity, trade openness, foreign direct investment, corruption, tourism, and domestic credit. The government has a limited budget and wants to know: which of these factors truly drive CO₂ emissions, and which are red herrings?

This is the variable selection problem, and it is harder than it sounds. With 12 candidate variables, each either included or excluded from a regression, there are $2^{12} = 4,096$ possible models you could estimate. Run one model and report it as “the answer,” and you have implicitly assumed the other 4,095 models are wrong. That is a very strong assumption — and almost certainly unjustified.

In practice, researchers handle this by specification searching: they try many models, drop insignificant variables, and report whichever specification “works best.” This process inflates false discoveries. A noise variable that happens to look significant in one specification gets reported, while the many failed specifications are hidden in the researcher’s desk drawer. This is sometimes called the file drawer problem or pretesting bias.

This tutorial introduces three principled approaches to the variable selection problem:

graph TD
Q["<b>Variable Selection</b><br/>Which of 12 variables<br/>truly matter?"] --> BMA
Q --> LASSO
Q --> WALS
BMA["<b>BMA</b><br/>Bayesian Model Averaging<br/>PIPs from 4,096 models"] --> R["<b>Convergence</b><br/>Variables identified<br/>by all 3 methods"]
LASSO["<b>LASSO</b><br/>L1 penalized regression<br/>Automatic selection"] --> R
WALS["<b>WALS</b><br/>Frequentist averaging<br/>t-statistics"] --> R
style Q fill:#141413,stroke:#141413,color:#fff
style BMA fill:#6a9bcc,stroke:#141413,color:#fff
style LASSO fill:#d97757,stroke:#141413,color:#fff
style WALS fill:#00d4c8,stroke:#141413,color:#fff
style R fill:#1a3a8a,stroke:#141413,color:#fff

Bayesian Model Averaging (BMA): Average across all 4,096 models, weighting each by how well it fits the data. Variables that appear important across many models earn a high “inclusion probability.”
LASSO (Least Absolute Shrinkage and Selection Operator): Add a penalty to the regression that forces the coefficients of irrelevant variables to be exactly zero, performing automatic selection.
Weighted Average Least Squares (WALS): A fast frequentist model-averaging method that transforms the problem so each variable can be evaluated independently.

We use synthetic data throughout this tutorial. This means we know the true data-generating process — which variables truly matter and which do not. This “answer key” lets us verify whether each method correctly recovers the truth. By the end, you will understand not just how to run each method, but why it works and when to prefer one over the others.

Learning objectives:

Understand the variable selection problem and why running a single model is insufficient when model uncertainty is large
Implement Bayesian Model Averaging in R and interpret Posterior Inclusion Probabilities (PIPs)
Apply LASSO with cross-validation to perform automatic variable selection and use Post-LASSO for unbiased estimation
Run WALS as a fast frequentist model-averaging alternative and interpret its t-statistics
Compare results across all three methods to identify truly robust determinants via methodological triangulation

Content outline. Section 2 sets up the R environment. Section 3 introduces the synthetic dataset and its built-in “answer key” — 7 true predictors and 5 noise variables with realistic multicollinearity. Section 4 runs naive OLS to illustrate the spurious significance problem. Sections 5–8 cover BMA: Bayes' rule foundations, the PIP framework, a toy example, and full implementation. Sections 9–12 cover LASSO: the bias-variance tradeoff, L1/L2 geometry, cross-validated implementation, and Post-LASSO. Sections 13–16 cover WALS: frequentist model averaging, the semi-orthogonal transformation, the Laplace prior, and implementation. Section 17 brings all three methods together for a grand comparison. Section 18 summarizes key takeaways and provides further reading.

2. Setup

Before running the analysis, install the required packages if needed. The following code checks for missing packages and installs them automatically.

# List all packages needed for this tutorial
required_packages <- c(
"tidyverse", # data manipulation and ggplot2 visualization
"BMS", # Bayesian Model Averaging via the bms() function
"glmnet", # LASSO and Ridge regression via coordinate descent
"WALS", # Weighted Average Least Squares estimation
"scales", # nice axis formatting in plots
"patchwork", # combine multiple ggplot panels
"ggrepel", # non-overlapping text labels on plots
"corrplot", # correlation matrix heatmaps
"broom" # tidy model summaries
)
# Install any packages not yet available
missing <- required_packages[!sapply(required_packages, requireNamespace, quietly = TRUE)]
if (length(missing) > 0) {
install.packages(missing, repos = "https://cloud.r-project.org")
}
# Load libraries
library(tidyverse)
library(BMS)
library(glmnet)
library(WALS)
library(scales)
library(patchwork)
library(ggrepel)
library(corrplot)
library(broom)

3. The Synthetic Dataset

3.1 The data-generating process (our “answer key”)

We use a cross-sectional dataset of 120 fictional countries. The key design choices:

7 variables have true nonzero effects on CO₂ emissions
5 variables are pure noise (their true coefficients are exactly zero)
The noise variables are correlated with GDP and other true predictors, creating realistic multicollinearity. This makes variable selection genuinely challenging — naive OLS will find spurious “significant” results for noise variables.

Think of this as setting up a controlled experiment. We know the answer before we begin, so we can grade each method’s performance.

The data-generating process below shows exactly how the synthetic dataset was built. The CSV file synthetic-co2-cross-section.csv was generated with set.seed(2017) and can be loaded directly from GitHub for full reproducibility.

# --- DATA-GENERATING PROCESS (reference) ---
set.seed(2017)
n <- 120 # number of "countries"
# GDP drives many other variables (realistic: richer countries
# have higher urbanization, more industry, etc.)
log_gdp <- rnorm(n, mean = 8.5, sd = 1.5)
# --- TRUE PREDICTORS (correlated with GDP) ---
fossil_fuel <- 30 + 3 * log_gdp + rnorm(n, 0, 10) # higher in richer countries
urban_pop <- 20 + 5 * log_gdp + rnorm(n, 0, 12) # increases with income
industry <- 15 + 1.5 * log_gdp + rnorm(n, 0, 6) # industry share
democracy <- 5 + 2 * log_gdp + rnorm(n, 0, 8) # democracy index
trade_network <- 0.2 + 0.05 * log_gdp + rnorm(n, 0, 0.15) # trade centrality
agriculture <- 40 - 3 * log_gdp + rnorm(n, 0, 8) # negatively correlated with GDP
# --- NOISE VARIABLES (correlated with GDP but NO true effect) ---
log_trade <- 3.5 + 0.1 * log_gdp + rnorm(n, 0, 0.5)
fdi <- 2 + rnorm(n, 0, 4)
corruption <- 0.8 - 0.05 * log_gdp + rnorm(n, 0, 0.15)
log_tourism <- 12 + 0.3 * log_gdp + rnorm(n, 0, 1.2)
log_credit <- 2.5 + 0.15 * log_gdp + rnorm(n, 0, 0.6)
# --- TRUE DATA-GENERATING PROCESS ---
log_co2 <- 2.0 + # intercept
1.200 * log_gdp + # GDP: strong positive (elasticity)
0.008 * industry + # industry: positive
0.012 * fossil_fuel + # fossil fuel: positive
0.010 * urban_pop + # urbanization: positive
0.004 * democracy + # democracy: small positive
0.500 * trade_network + # trade network: moderate positive
0.005 * agriculture + # agriculture: weak positive
# NOISE VARIABLES HAVE ZERO TRUE EFFECT
rnorm(n, 0, 0.3) # random noise (sigma = 0.3)

The true coefficients serve as our “answer key”:

Variable	True $\beta$	Role	Interpretation
log_gdp	1.200	True predictor	1% more GDP $\to$ 1.2% more CO₂
trade_network	0.500	True predictor	Moderate positive effect
fossil_fuel	0.012	True predictor	1 pp more fossil fuel $\to$ 1.2% more CO₂
urban_pop	0.010	True predictor	1 pp more urbanization $\to$ 1.0% more CO₂
industry	0.008	True predictor	Positive composition effect
agriculture	0.005	True predictor	Weak positive effect
democracy	0.004	True predictor	Small positive effect
log_trade	0	Noise	No true effect
fdi	0	Noise	No true effect
corruption	0	Noise	No true effect
log_tourism	0	Noise	No true effect
log_credit	0	Noise	No true effect

Now let us load the pre-generated dataset:

# Load the synthetic dataset directly from GitHub
DATA_URL <- "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/r_bma_lasso_wals/synthetic-co2-cross-section.csv"
synth_data <- read.csv(DATA_URL)
cat("Dataset:", nrow(synth_data), "countries,", ncol(synth_data), "variables\n")
head(synth_data)

Dataset: 120 countries, 14 variables
country log_co2 log_gdp industry fossil_fuel urban_pop democracy trade_network
1 Country_001 13.27 9.47 29.25 66.94 67.97 25.67 0.77
2 Country_002 12.18 8.44 24.97 51.43 66.14 20.51 0.85
3 Country_003 13.50 10.16 28.19 50.62 73.91 29.08 0.73
...

3.2 Descriptive statistics

The following summary statistics give us a first look at the data structure. Note the wide range of scales: GDP is in log units (mean around 8.5), while percentage variables like fossil fuel share and urbanization range from single digits to near 100.

# Descriptive statistics for all 13 numeric variables
synth_data |>
select(-country) |>
pivot_longer(everything(), names_to = "variable", values_to = "value") |>
summarise(
n = n(),
mean = round(mean(value), 2),
sd = round(sd(value), 2),
min = round(min(value), 2),
max = round(max(value), 2),
.by = variable
)

 variable n mean sd min max
log_co2 120 14.22 2.11 8.76 20.36
log_gdp 120 8.53 1.57 4.61 13.21
industry 120 27.87 6.21 8.32 44.98
fossil_fuel 120 55.49 9.62 24.72 81.22
urban_pop 120 62.52 13.25 29.81 97.62
democracy 120 22.94 8.32 3.10 45.00
trade_network 120 0.64 0.17 0.18 1.04
agriculture 120 13.87 8.11 1.00 37.11
log_trade 120 4.43 0.46 3.45 5.84
fdi 120 2.23 4.19 -5.00 13.62
corruption 120 0.37 0.16 0.05 0.71
log_tourism 120 14.61 1.32 11.54 19.63
log_credit 120 3.83 0.65 2.30 5.50

The dataset has 120 observations and 14 variables (1 dependent, 12 candidate regressors, 1 country identifier). The dependent variable log_co2 has a mean of 14.22 with a standard deviation of 2.11 log points, reflecting substantial cross-country variation in emissions. The candidate regressors span very different scales — trade_network ranges from 0.18 to 1.04, while urban_pop ranges from 29.8 to 97.6 — which is why BMA, LASSO, and WALS each handle scaling internally.

3.3 Correlation structure

A key feature of our synthetic data is that the noise variables are correlated with the true predictors — especially with GDP. This correlation is what makes variable selection difficult: in a standard OLS regression, the noise variables will “borrow” explanatory power from the true predictors.

# Compute correlation matrix for all 12 candidate regressors
cor_matrix <- synth_data |>
select(-country, -log_co2) |>
cor()
# Draw the heatmap
corrplot(cor_matrix, method = "color", type = "lower",
addCoef.col = "black", number.cex = 0.7,
col = colorRampPalette(c("#d97757", "white", "#6a9bcc"))(200),
diag = FALSE)

The correlation heatmap reveals the realistic structure we built into the data. GDP is positively correlated with fossil fuel use, urbanization, industry, and the trade network — but also with the noise variables like trade openness, tourism, and credit. This multicollinearity is precisely what makes a naive “throw everything into OLS” approach unreliable. For example, log_tourism has a correlation of approximately 0.3 with log_gdp, which means it can pick up GDP’s signal even though its true effect is zero.

Note. We created a synthetic dataset where we know which 7 variables truly affect CO₂ emissions and which 5 are noise. The noise variables are deliberately correlated with the true predictors, mimicking the multicollinearity found in real cross-country data.

4. The General Model

Our goal is to estimate the following linear model:

$$ \log(\text{CO}_{2,i}) = \beta_0 + \sum_{j=1}^{12} \beta_j x_{j,i} + \varepsilon_i $$

where:

$\log(\text{CO}_{2,i})$ is the log of CO₂ emissions for country $i$
$\beta_0$ is the intercept (the predicted log CO₂ when all regressors are zero)
$\beta_j$ is the coefficient on the $j$-th regressor: the change in log CO₂ associated with a one-unit increase in $x_j$, holding all other variables constant
$\varepsilon_i$ is the error term: everything that affects CO₂ emissions but is not captured by the 12 regressors

Because the dependent variable is in logs, the interpretation of each coefficient depends on whether the regressor is also in logs:

Regressor type	Interpretation of $\beta_j$	Example
Log-log (e.g., log GDP)	Elasticity: a 1% increase in GDP is associated with a $\beta_j$% change in CO₂	$\beta = 1.2$ means 1% more GDP $\to$ 1.2% more CO₂
Level-log (e.g., fossil fuel %)	Semi-elasticity: a 1-unit increase in the regressor is associated with a $100 \times \beta_j$% change in CO₂	$\beta = 0.012$ means 1 pp more fossil fuel $\to$ 1.2% more CO₂

We want to determine which $\beta_j$ are truly nonzero. We know the answer (we designed the data), but let us first see what happens if we just run OLS with all 12 variables.

# Run OLS with all 12 candidate regressors
ols_full <- lm(log_co2 ~ log_gdp + industry + fossil_fuel + urban_pop +
democracy + trade_network + agriculture +
log_trade + fdi + corruption + log_tourism + log_credit,
data = synth_data)
# Display summary
summary(ols_full)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.283773 0.494736 4.616 1.06e-05 ***
log_gdp 1.163669 0.032747 35.537 < 2e-16 ***
industry 0.017577 0.005004 3.513 0.000661 ***
fossil_fuel 0.011988 0.003240 3.698 0.000349 ***
urban_pop 0.008221 0.002689 3.057 0.002794 **
democracy 0.010497 0.003975 2.640 0.009549 **
trade_network 0.912828 0.203681 4.482 1.94e-05 ***
agriculture -0.000629 0.004242 -0.148 0.882568
log_trade -0.055738 0.064829 -0.860 0.391509
fdi 0.000789 0.007045 0.112 0.910964
corruption 0.010767 0.201954 0.053 0.957573
log_tourism -0.028025 0.024415 -1.148 0.253610
log_credit 0.045689 0.049690 0.919 0.360252
---
Multiple R-squared: 0.9801, Adjusted R-squared: 0.9779

Look carefully at the noise variables. For example, log_trade has a t-statistic of $-0.86$ (p = 0.392) and corruption has a t-statistic of $0.05$ (p = 0.958). None reach conventional significance in this sample. However, their estimated coefficients can be non-negligible in magnitude — and in a different random sample, some noise variables could easily cross the 5% threshold. This is the risk of spurious significance, caused by the correlation between noise variables and the true predictors. It is precisely this problem that motivates the three methods we study next.

Warning. With 12 correlated regressors and only 120 observations, OLS can produce misleading significance levels. A variable with a true coefficient of zero may appear significant simply because it is correlated with a genuinely important predictor. This is why we need principled variable selection methods.

PART 1: Bayesian Model Averaging

5. Bayes' Rule — The Foundation

Before we can understand Bayesian Model Averaging, we need to understand Bayes' rule — the mathematical machinery that powers the entire framework.

5.1 A coin-flip example

Suppose a friend gives you a coin. You want to know: is this coin fair (probability of heads = 0.5), or is it biased (probability of heads = 0.7)?

Before flipping, you have no strong opinion. You assign equal prior probabilities:

$P(\text{fair}) = 0.5$ (50% chance the coin is fair)
$P(\text{biased}) = 0.5$ (50% chance the coin is biased)

Now you flip the coin 10 times and observe 7 heads. How should you update your beliefs?

The likelihood of seeing 7 heads in 10 flips is:

If the coin is fair ($p = 0.5$): $P(\text{7 heads} | \text{fair}) = \binom{10}{7} (0.5)^{10} = 0.1172$
If the coin is biased ($p = 0.7$): $P(\text{7 heads} | \text{biased}) = \binom{10}{7} (0.7)^7 (0.3)^3 = 0.2668$

The biased coin makes the data more likely. Bayes' rule combines the prior and the likelihood:

$$ P(H|D) = \frac{P(D|H) \cdot P(H)}{P(D)} $$

where:

$P(H|D)$ = posterior probability (what we believe after seeing the data)
$P(D|H)$ = likelihood (how probable the data is under hypothesis $H$)
$P(H)$ = prior probability (what we believed before seeing the data)
$P(D)$ = marginal likelihood (a normalizing constant that ensures probabilities sum to 1)

For our coin:

$$ P(\text{fair}|\text{7H}) = \frac{0.1172 \times 0.5}{0.1172 \times 0.5 + 0.2668 \times 0.5} = \frac{0.0586}{0.1920} = 0.305 $$

$$ P(\text{biased}|\text{7H}) = \frac{0.2668 \times 0.5}{0.1920} = 0.695 $$

After seeing 7 heads, we update from 50–50 to roughly 30–70 in favor of the biased coin. The data shifted our beliefs, but did not erase the prior entirely.

5.2 The bridge to model averaging

Now replace “fair coin” and “biased coin” with regression models:

Hypothesis = “Which variables belong in the model?”
Prior = “Before seeing data, any combination of variables is equally plausible”
Likelihood = “How well does each model fit the data?”
Posterior = “After seeing data, which models are most credible?”

This is exactly what BMA does. Instead of two coin hypotheses, we have 4,096 model hypotheses — but the logic of Bayes' rule is identical.

Note. Bayes' rule updates prior beliefs using data. The posterior probability of any hypothesis is proportional to its prior probability times its likelihood. BMA applies this same logic to regression models instead of coin flips.

6. The BMA Framework

6.1 Posterior model probability

With 12 candidate variables, there are $K = 12$ regressors and $2^K = 4,096$ possible models. Denote the $k$-th model as $M_k$. BMA assigns each model a posterior probability:

$$ P(M_k | y) = \frac{P(y | M_k) \cdot P(M_k)}{\sum_{l=1}^{2^K} P(y | M_l) \cdot P(M_l)} $$

This is just Bayes' rule applied to models. Let us unpack each piece:

$P(y | M_k)$ is the marginal likelihood of model $M_k$. It measures how well the model fits the data, automatically penalizing complexity. A model with many parameters can fit the data closely, but the marginal likelihood integrates over all possible parameter values, spreading the probability thin. This acts as a built-in Occam’s razor: simpler models that fit the data well receive higher marginal likelihoods than complex models that fit only slightly better.
$P(M_k)$ is the prior model probability. With no prior information, we use a uniform prior: every model is equally likely, so $P(M_k) = 1/4,096$ for all $k$. This means the posterior is driven entirely by the data.
The denominator is a normalizing constant that ensures all posterior model probabilities sum to 1.

6.2 Posterior Inclusion Probability (PIP)

We do not really care about individual models — we care about individual variables. The Posterior Inclusion Probability of variable $j$ is the sum of the posterior probabilities of all models that include variable $j$:

$$ \text{PIP}_j = \sum_{k:\, j \in M_k} P(M_k | y) $$

Think of it as a democratic vote. Each of the 4,096 models casts a vote for which variables matter. But the votes are weighted: models that fit the data well get louder voices. If variable $j$ appears in most of the high-probability models, it earns a high PIP.

The standard interpretation thresholds (Raftery, 1995):

PIP range	Interpretation	Analogy
$\geq 0.99$	Decisive evidence	Beyond reasonable doubt
$0.95 - 0.99$	Very strong evidence	Strong consensus
$0.80 - 0.95$	Strong evidence (robust)	Clear majority
$0.50 - 0.80$	Borderline evidence	Split vote
$< 0.50$	Weak/no evidence (fragile)	Minority opinion

We will use PIP $\geq$ 0.80 as our threshold for “robust” throughout this tutorial.

6.3 Posterior mean

Once we know which variables matter, we want to know how much they matter. The posterior mean of coefficient $j$ is:

$$ E[\beta_j | y] = \sum_{k=1}^{2^K} \hat{\beta}_{j,k} \cdot P(M_k | y) $$

where $\hat{\beta}_{j,k}$ is the estimated coefficient of variable $j$ in model $k$ (and zero if $j$ is not in model $k$). This is a weighted average of the coefficient across all models. Variables with high PIPs get posterior means close to their “full model” estimates; variables with low PIPs get posterior means shrunk toward zero.

7. Toy Example — BMA on 3 Variables

Before running BMA on all 12 variables, let us work through a small example by hand. We pick just 3 variables: log_gdp and fossil_fuel (true predictors) and log_trade (noise). With 3 variables, each can be either IN or OUT of the model, giving us $2^3 = 8$ possible models — small enough to examine every single one.

Here are all 8 models written out explicitly:

Model	Formula
$M_1$	log_co2 $\sim$ 1 (intercept only)
$M_2$	log_co2 $\sim$ log_gdp
$M_3$	log_co2 $\sim$ fossil_fuel
$M_4$	log_co2 $\sim$ log_trade
$M_5$	log_co2 $\sim$ log_gdp + fossil_fuel
$M_6$	log_co2 $\sim$ log_gdp + log_trade
$M_7$	log_co2 $\sim$ fossil_fuel + log_trade
$M_8$	log_co2 $\sim$ log_gdp + fossil_fuel + log_trade

7.1 Step 1 — Fit every model and compute BIC

We fit each of the 8 models using OLS and compute its BIC score. Remember: lower BIC = better (the model explains the data well without unnecessary complexity).

# Select our 3 variables
toy_data <- synth_data |>
select(log_co2, log_gdp, fossil_fuel, log_trade)
# Write out all 8 model formulas explicitly
model_formulas <- c(
"log_co2 ~ 1", # M1: intercept only
"log_co2 ~ log_gdp", # M2
"log_co2 ~ fossil_fuel", # M3
"log_co2 ~ log_trade", # M4
"log_co2 ~ log_gdp + fossil_fuel", # M5
"log_co2 ~ log_gdp + log_trade", # M6
"log_co2 ~ fossil_fuel + log_trade", # M7
"log_co2 ~ log_gdp + fossil_fuel + log_trade" # M8
)
# Fit each model and extract its BIC
bic_values <- sapply(model_formulas, function(f) {
BIC(lm(as.formula(f), data = toy_data))
})
# Organize results in a table
toy_results <- tibble(
model = paste0("M", 1:8),
formula = model_formulas,
bic = round(bic_values, 1)
) |>
arrange(bic)
print(toy_results)

 model formula bic
M5 log_co2 ~ log_gdp + fossil_fuel 114.1
M8 log_co2 ~ log_gdp + fossil_fuel + log_trade 118.5
M2 log_co2 ~ log_gdp 120.7
M6 log_co2 ~ log_gdp + log_trade 125.4
M3 log_co2 ~ fossil_fuel 514.4
M7 log_co2 ~ fossil_fuel + log_trade 519.0
M1 log_co2 ~ 1 528.3
M4 log_co2 ~ log_trade 533.0

The winner is $M_5$ (log_gdp + fossil_fuel) with BIC = 114.1 — exactly the two true predictors, no noise. The runner-up $M_8$ adds log_trade but its BIC is worse (118.5), meaning the extra variable does not improve the fit enough to justify the added complexity. Models without GDP ($M_1$, $M_3$, $M_4$, $M_7$) have dramatically worse BIC scores, confirming GDP’s dominant role.

7.2 Step 2 — Convert BIC to posterior probabilities

Now we turn each BIC into a posterior model probability. The formula is:

$$ P(M_k | y) = \frac{\exp(-0.5 \cdot \text{BIC}_k)}{\sum_{l=1}^{8} \exp(-0.5 \cdot \text{BIC}_l)} $$

Because the BIC values can be very large, we work with differences from the best model to avoid numerical overflow. Subtracting the minimum BIC from all values does not change the probabilities:

$$ P(M_k | y) = \frac{\exp\bigl(-0.5 \cdot (\text{BIC}_k - \text{BIC}_{\min})\bigr)}{\sum_{l=1}^{8} \exp\bigl(-0.5 \cdot (\text{BIC}_l - \text{BIC}_{\min})\bigr)} $$

Let us plug in the numbers. The best model ($M_5$) has BIC = 114.1, so $\Delta_5 = 0$. The runner-up ($M_8$) has $\Delta_8 = 118.5 - 114.1 = 4.4$:

$$ w_5 = \exp(-0.5 \times 0) = 1.000, \quad w_8 = \exp(-0.5 \times 4.4) = 0.111 $$

The remaining models have much larger $\Delta$ values, so their weights are essentially zero. After normalizing by the sum of all weights ($1.000 + 0.111 + 0.037 + \ldots \approx 1.151$):

$$ P(M_5 | y) = \frac{1.000}{1.151} = 0.869, \quad P(M_8 | y) = \frac{0.111}{1.151} = 0.096 $$

# Convert BIC to posterior probabilities using the delta-BIC trick
toy_results <- toy_results |>
mutate(
delta_bic = bic - min(bic), # difference from best
weight = exp(-0.5 * delta_bic), # unnormalized weight
post_prob = round(weight / sum(weight), 4) # normalize to sum to 1
)
toy_results |> select(model, bic, delta_bic, weight, post_prob)

 model bic delta_bic weight post_prob
M5 114.1 0.0 1.0000 0.8687
M8 118.5 4.4 0.1108 0.0962
M2 120.7 6.6 0.0369 0.0320
M6 125.4 11.3 0.0035 0.0031
M3 514.4 400.3 0.0000 0.0000
M7 519.0 404.9 0.0000 0.0000
M1 528.3 414.2 0.0000 0.0000
M4 533.0 418.9 0.0000 0.0000

One model dominates: $M_5$ captures 86.9% of the posterior probability — exactly the two true predictors. The runner-up $M_8$ (adding log_trade) gets only 9.6%, and $M_2$ (GDP alone) gets 3.2%. The remaining 5 models share less than 0.4% of the total weight. BMA’s Occam’s razor is at work: adding log_trade to the model ($M_8$) does not improve the fit enough to overcome the complexity penalty, so the simpler model ($M_5$) wins decisively.

7.3 Step 3 — Compute Posterior Inclusion Probabilities

Finally, we compute the PIP of each variable by summing the posterior probabilities of all models that include it. For example, log_trade appears in models $M_4$, $M_6$, $M_7$, and $M_8$, so:

$$ \text{PIP}_{\text{log_trade}} = P(M_4 | y) + P(M_6 | y) + P(M_7 | y) + P(M_8 | y) = 0.000 + 0.003 + 0.000 + 0.096 = 0.099 $$

That is well below the 0.50 threshold — fragile evidence, exactly what we expect for a noise variable.

# Compute PIPs: for each variable, sum P(M|y) across models that include it
pip_toy <- tibble(
variable = c("log_gdp", "fossil_fuel", "log_trade"),
true_effect = c("True", "True", "Noise"),
pip = c(
# log_gdp appears in M2, M5, M6, M8
sum(toy_results$post_prob[toy_results$model %in% c("M2","M5","M6","M8")]),
# fossil_fuel appears in M3, M5, M7, M8
sum(toy_results$post_prob[toy_results$model %in% c("M3","M5","M7","M8")]),
# log_trade appears in M4, M6, M7, M8
sum(toy_results$post_prob[toy_results$model %in% c("M4","M6","M7","M8")])
)
)
print(pip_toy)

 variable true_effect pip
log_gdp True 1.000
fossil_fuel True 0.965
log_trade Noise 0.099

Even with this simple 3-variable example, BMA correctly identifies the two true predictors. GDP has a PIP of 1.000 (decisive evidence) and fossil_fuel has a PIP of 0.965 (robust) — they appear in every high-probability model. Log_trade has a PIP of only 0.099 (fragile) — well below the 0.50 threshold. BMA’s built-in Occam’s razor penalizes models that include noise variables without substantially improving the fit.

8. BMA on All 12 Variables

8.1 Running BMA

Now we apply BMA to the full dataset with all 12 candidate regressors using the BMS package. Because 4,096 models is computationally manageable, the MCMC sampler explores the full model space efficiently.

set.seed(2021) # reproducibility for MCMC sampling
# Prepare the data matrix: DV in first column, regressors follow
bma_data <- synth_data |>
select(log_co2, log_gdp, industry, fossil_fuel, urban_pop,
democracy, trade_network, agriculture,
log_trade, fdi, corruption, log_tourism, log_credit) |>
as.data.frame()
# Run BMA
bma_fit <- bms(
X.data = bma_data, # data with DV in column 1
burn = 50000, # burn-in iterations
iter = 200000, # post-burn-in iterations
g = "BRIC", # BRIC g-prior (robust default)
mprior = "uniform", # uniform model prior
nmodel = 2000, # store top 2000 models
mcmc = "bd", # birth-death MCMC sampler
user.int = FALSE # suppress interactive output
)

The key parameters deserve explanation:

burn = 50,000: the first 50,000 MCMC draws are discarded as “burn-in” to ensure the sampler has converged to the posterior distribution
iter = 200,000: the next 200,000 draws are used for inference
g = “BRIC”: the Benchmark Risk Inflation Criterion prior on the regression coefficients, a robust default choice
mprior = “uniform”: every model is equally likely a priori, so the posterior is driven entirely by the data

8.2 PIP bar chart

The PIP bar chart classifies each variable as robust (PIP $\geq$ 0.80), borderline (0.50–0.80), or fragile (PIP $<$ 0.50). This visualization makes it easy to see which variables earn strong support across the model space and which are effectively irrelevant.

# Extract PIPs and posterior means
bma_coefs <- coef(bma_fit)
bma_df <- as.data.frame(bma_coefs) |>
rownames_to_column("variable") |>
as_tibble() |>
rename(pip = PIP, post_mean = `Post Mean`, post_sd = `Post SD`) |>
select(variable, pip, post_mean, post_sd) |>
mutate(
true_beta = true_beta_lookup[variable],
robustness = case_when(
pip >= 0.80 ~ "Robust (PIP >= 0.80)",
pip >= 0.50 ~ "Borderline",
TRUE ~ "Fragile (PIP < 0.50)"
),
ci_low = post_mean - 2 * post_sd,
ci_high = post_mean + 2 * post_sd
)
# Plot PIPs
ggplot(bma_df, aes(x = reorder(variable, pip), y = pip, fill = robustness)) +
geom_col(width = 0.65) +
geom_hline(yintercept = 0.80, linetype = "dashed") +
coord_flip() +
labs(x = NULL, y = "Posterior Inclusion Probability (PIP)")

The PIP bar chart reveals a clear separation between signal and noise. GDP dominates with a PIP of 1.00, followed by trade_network (0.986), fossil_fuel (0.948), and industry (0.841) — all with PIPs above the 0.80 robustness threshold. The noise variables (log_trade, fdi, corruption, log_tourism, log_credit) all have PIPs well below 0.15, confirming that BMA correctly classifies them as fragile. Urban_pop ($\beta = 0.010$, PIP = 0.648) and democracy ($\beta = 0.004$, PIP = 0.607) land in the borderline range — true predictors whose effects are moderate enough that BMA hedges between including and excluding them. Agriculture ($\beta = 0.005$, PIP = 0.087) is classified as fragile, an honest reflection of the sample’s limited power to detect its very small effect.

8.3 Posterior coefficient plot

Beyond knowing which variables matter, we want to know how much they matter and how precisely they are estimated. The posterior coefficient plot displays the BMA-estimated effect size for each variable along with approximate 95% credible intervals (posterior mean $\pm$ 2 posterior standard deviations).

# Coefficient plot with 95% credible intervals
ggplot(bma_df, aes(x = reorder(variable, pip), y = post_mean, color = robustness)) +
geom_pointrange(aes(ymin = ci_low, ymax = ci_high)) +
geom_hline(yintercept = 0, linetype = "solid", color = "gray50") +
coord_flip()

The posterior coefficient plot shows the BMA-estimated effect sizes with uncertainty bands. GDP’s posterior mean of approximately 1.19 closely recovers the true value of 1.200, and its 95% credible interval is narrow, reflecting high precision. Trade_network has a posterior mean of 0.87, overshooting its true value of 0.500 — but its wide credible interval honestly reflects substantial estimation uncertainty. The noise variables and low-PIP variables like agriculture have posterior means shrunk very close to zero — this is BMA’s shrinkage at work. Variables with low PIPs appear in few high-probability models, so their posterior means are averaged with many models where the coefficient is zero, pulling the estimate toward zero.

8.4 Variable-inclusion map

The variable-inclusion map shows which variables appear in the highest-probability models and whether their coefficients are positive or negative. Unlike a simple heatmap, the width of each column is proportional to the model’s posterior probability — so wide columns represent models that the data strongly supports. The x-axis shows cumulative posterior model probability: if the first model has PMP = 0.15, it occupies the region from 0 to 0.15; the second model fills from 0.15 to 0.15 + its PMP, and so on. A solid band of color stretching across most of the x-axis means the variable appears in virtually every high-probability model.

# Extract top 100 models and their coefficient estimates
top_coefs <- topmodels.bma(bma_fit)
n_top <- min(100, ncol(top_coefs))
top_coefs <- top_coefs[, 1:n_top]
# Extract posterior model probabilities (MCMC-based)
model_pmps <- pmp.bma(bma_fit)[1:n_top, 1]
# Cumulative x positions: each model's width = its PMP
cum_pmp <- c(0, cumsum(model_pmps))
# Order variables by PIP (highest at top)
var_order <- bma_df |> arrange(desc(pip)) |> pull(variable)
# Build rectangle data for every variable × model combination
rect_data <- expand.grid(
var_idx = seq_len(nrow(top_coefs)),
model_idx = seq_len(n_top)
) |>
mutate(
variable = rownames(top_coefs)[var_idx],
coef_value = mapply(function(v, m) top_coefs[v, m], var_idx, model_idx),
sign = case_when(
coef_value > 0 ~ "Positive",
coef_value < 0 ~ "Negative",
TRUE ~ "Not included"
),
xmin = cum_pmp[model_idx],
xmax = cum_pmp[model_idx + 1],
variable = factor(variable, levels = rev(var_order))
)
# Plot the variable-inclusion map
ggplot(rect_data, aes(xmin = xmin, xmax = xmax,
ymin = as.numeric(variable) - 0.45,
ymax = as.numeric(variable) + 0.45,
fill = sign)) +
geom_rect() +
scale_fill_manual(
name = "Coefficient",
values = c("Positive" = "#6a9bcc",
"Negative" = "#d97757",
"Not included" = "#d0cdc8")
) +
scale_x_continuous(expand = c(0, 0),
labels = scales::label_number(accuracy = 0.1)) +
scale_y_continuous(breaks = seq_along(var_order),
labels = rev(var_order),
expand = c(0, 0)) +
labs(title = "Variable-Inclusion Map",
subtitle = paste0("Top ", n_top, " models shown out of ",
nrow(pmp.bma(bma_fit)), " visited"),
x = "Cumulative posterior model probability",
y = NULL)

The variable-inclusion map reveals clear structure. The top variables — log_gdp, trade_network, fossil_fuel, and industry — form solid blue bands stretching across nearly the entire x-axis, meaning they appear with positive coefficients in virtually every high-probability model. Urban_pop and democracy also show substantial inclusion, consistent with their borderline PIPs. In contrast, the noise variables (log_trade, fdi, corruption, log_tourism, log_credit) appear as mostly gray with occasional patches of blue or orange, indicating they enter and exit models sporadically and sometimes with the wrong sign. The fact that noise variables occasionally appear with negative coefficients (orange patches) is another sign of fragility — their coefficient estimates are unstable because they have no true effect.

8.5 BMA results vs. known truth

# Compare BMA results with the true DGP
bma_summary <- bma_df |>
mutate(
bma_robust = pip >= 0.80,
true_nonzero = true_beta != 0,
correct = bma_robust == true_nonzero
) |>
select(variable, true_beta, pip, post_mean, bma_robust, true_nonzero, correct)
print(bma_summary)

 variable true_beta pip post_mean bma_robust true_nonzero correct
log_gdp 1.200 1.000 1.1854 TRUE TRUE TRUE
trade_network 0.500 0.986 0.8727 TRUE TRUE TRUE
fossil_fuel 0.012 0.948 0.0117 TRUE TRUE TRUE
industry 0.008 0.841 0.0142 TRUE TRUE TRUE
urban_pop 0.010 0.648 0.0049 FALSE TRUE FALSE
democracy 0.004 0.607 0.0066 FALSE TRUE FALSE
log_tourism 0.000 0.130 -0.0039 FALSE FALSE TRUE
log_credit 0.000 0.104 0.0051 FALSE FALSE TRUE
agriculture 0.005 0.087 -0.0002 FALSE TRUE FALSE
log_trade 0.000 0.084 -0.0037 FALSE FALSE TRUE
corruption 0.000 0.078 0.0026 FALSE FALSE TRUE
fdi 0.000 0.077 -0.0000 FALSE FALSE TRUE

BMA correctly classifies 9 of 12 variables. The four strongest true predictors (GDP, trade_network, fossil_fuel, industry) all receive PIPs above 0.80 — these are the “robust” determinants. All five noise variables receive PIPs below 0.15 — correctly identified as fragile. Urban_pop (PIP = 0.648) and democracy (PIP = 0.607) fall in the borderline range — they are true predictors, but BMA’s conservative Occam’s razor hedges because their effects are moderate. Agriculture ($\beta = 0.005$, PIP = 0.087) is missed entirely. This reveals an important nuance: BMA prioritizes precision over sensitivity. It would rather miss a small true effect than falsely include a noise variable.

Note. BMA on all 12 variables correctly gives high PIPs to the strong true predictors (GDP, trade network, fossil fuel, industry) and low PIPs to the noise variables. Variables with moderate or small true effects may land in the borderline zone. The variable-inclusion map shows that the top models consistently include the core predictors.

PART 2: LASSO

9. Regularization — Adding a Penalty

9.1 The bias-variance tradeoff

OLS is an unbiased estimator — on average, it gets the coefficients right. But with many correlated regressors, OLS coefficients have high variance: they bounce around from sample to sample. Adding or removing a single variable can drastically change the estimates.

The key insight of regularization is that a little bias can buy a lot of variance reduction, lowering the overall prediction error. The total error of a prediction decomposes as:

$$ \text{MSE} = \text{Bias}^2 + \text{Variance} + \text{Irreducible noise} $$

The figure illustrates the fundamental tradeoff. At low complexity (strong regularization), bias is high but variance is low. At high complexity (weak or no regularization, like OLS), bias is near zero but variance explodes. The optimal point lies in between — this is exactly where regularized methods like LASSO operate. Think of the penalty as a “budget constraint” on coefficient sizes: variables that do not contribute enough to prediction are not worth the cost, so their coefficients are set to zero.

10. L1 vs. L2 Geometry

10.1 The LASSO (L1) penalty

The LASSO solves the following optimization problem:

$$ \hat{\beta}_{\text{LASSO}} = \arg\min_\beta \; \frac{1}{2n}\|y - X\beta\|^2 + \lambda \|\beta\|_1 $$

where:

$\frac{1}{2n}\|y - X\beta\|^2$ is the sum of squared residuals (the usual OLS loss, scaled)
$\|\beta\|_1 = \sum_{j=1}^{p} |\beta_j|$ is the L1 norm (sum of absolute values)
$\lambda \geq 0$ is the regularization parameter: it controls how much we penalize large coefficients. When $\lambda = 0$, LASSO reduces to OLS. As $\lambda \to \infty$, all coefficients are shrunk to zero.

10.2 The Ridge (L2) penalty

For comparison, Ridge regression uses the L2 norm instead:

$$ \hat{\beta}_{\text{Ridge}} = \arg\min_\beta \; \frac{1}{2n}\|y - X\beta\|^2 + \lambda \|\beta\|_2^2 $$

where $\|\beta\|_2^2 = \sum_{j=1}^{p} \beta_j^2$ is the sum of squared coefficients.

10.3 Why LASSO selects variables and Ridge does not

The geometric explanation is one of the most elegant ideas in modern statistics. The constraint region for LASSO (L1) is a diamond, while the constraint region for Ridge (L2) is a circle. When the elliptical OLS contours meet the diamond, they typically hit a corner, where one or more coefficients are exactly zero. When they meet the circle, they hit a smooth curve — coefficients are shrunk but never exactly zero.

The key insight: the L1 diamond has corners where coefficients are exactly zero — this is why LASSO selects variables. The L2 circle has no corners, so Ridge shrinks coefficients toward zero but never reaches it. LASSO performs simultaneous estimation and variable selection; Ridge only estimates.

11. LASSO on All 12 Variables

11.1 Running LASSO with cross-validation

The LASSO has one tuning parameter: $\lambda$, which controls the strength of the penalty. Too small and we include noise; too large and we exclude true predictors. We choose $\lambda$ using 10-fold cross-validation: split the data into 10 folds, train on 9, predict the 10th, and repeat. The $\lambda$ that minimizes the average prediction error across folds is called lambda.min.

set.seed(2021) # reproducibility for cross-validation folds
# Prepare the design matrix X and response vector y
X <- synth_data |>
select(log_gdp, industry, fossil_fuel, urban_pop, democracy,
trade_network, agriculture, log_trade, fdi, corruption,
log_tourism, log_credit) |>
as.matrix()
y <- synth_data$log_co2
# Run LASSO (alpha = 1) with 10-fold cross-validation
lasso_cv <- cv.glmnet(
x = X,
y = y,
alpha = 1, # alpha=1 is LASSO (alpha=0 is Ridge)
nfolds = 10,
standardize = TRUE # standardize predictors internally
)

11.2 Regularization path

# Fit the full LASSO path
lasso_full <- glmnet(X, y, alpha = 1, standardize = TRUE)
# Plot coefficient paths
ggplot(path_df, aes(x = log_lambda, y = coefficient, color = variable)) +
geom_line() +
geom_vline(xintercept = log(lasso_cv$lambda.min), linetype = "dashed") +
geom_vline(xintercept = log(lasso_cv$lambda.1se), linetype = "dotted")

The regularization path reveals the story of LASSO variable selection. Reading from left to right (increasing penalty), the noise variables (orange lines) are the first to be driven to zero — they provide too little predictive value to justify their “cost” under the penalty. GDP (the strongest predictor with $\beta = 1.200$) persists the longest, requiring the largest penalty to be eliminated. The vertical lines mark lambda.min (minimum CV error) and lambda.1se (most parsimonious model within 1 SE of the minimum). The gap between them represents the tension between fitting the data well and keeping the model simple.

11.3 Cross-validation curve

# Plot the CV curve
ggplot(cv_df, aes(x = log_lambda, y = mse)) +
geom_ribbon(aes(ymin = mse_lo, ymax = mse_hi), fill = "gray85", alpha = 0.5) +
geom_line(color = "#6a9bcc") +
geom_vline(xintercept = log(lasso_cv$lambda.min), linetype = "dashed")

The cross-validation curve shows how prediction error varies with the penalty strength. The curve has a characteristic U-shape: too little penalty (left) allows overfitting (high error from variance), while too much penalty (right) underfits (high error from bias). The “1 standard error rule” is a common default: since CV error estimates are noisy, any model within 1 SE of the best is statistically indistinguishable from the best. We prefer the simpler one (lambda.1se).

11.4 Selected variables

# Extract LASSO coefficients at lambda.1se
lasso_coefs_1se <- coef(lasso_cv, s = "lambda.1se")
lasso_df <- tibble(
variable = rownames(lasso_coefs_1se)[-1],
lasso_coef = as.numeric(lasso_coefs_1se)[-1]
) |>
mutate(
selected = lasso_coef != 0,
true_beta = true_beta_lookup[variable],
is_noise = true_beta == 0,
bar_color = case_when(
!selected ~ "Not selected",
is_noise ~ "Noise (false positive)",
TRUE ~ "True predictor (correct)"
)
)
# Plot selected variables
ggplot(lasso_df, aes(x = reorder(variable, abs(lasso_coef)), y = lasso_coef, fill = bar_color)) +
geom_col(width = 0.6) + coord_flip()

At lambda.1se, LASSO selects a sparse subset of the 12 candidate variables. The selected variables are shown with colored bars: steel blue for true predictors correctly retained, orange for any noise variables falsely included. Variables with zero coefficients (gray) have been excluded by the LASSO penalty. The key question is: did LASSO keep the right variables and drop the right ones?

12. Post-LASSO

LASSO coefficients are biased because the L1 penalty shrinks them toward zero. The selected variables are correct (we hope), but the coefficient values are too small. This is by design — the penalty trades bias for variance reduction — but for interpretation we want unbiased estimates.

The fix is simple: Post-LASSO (Belloni and Chernozhukov, 2013). Run OLS using only the variables that LASSO selected. The LASSO does the selection; OLS does the estimation.

# Identify which variables LASSO selected at lambda.1se
selected_vars <- lasso_df |> filter(selected) |> pull(variable)
# Build the Post-LASSO formula
post_lasso_formula <- as.formula(
paste("log_co2 ~", paste(selected_vars, collapse = " + "))
)
# Run OLS on the selected variables only
post_lasso_fit <- lm(post_lasso_formula, data = synth_data)
# Compare: LASSO vs Post-LASSO vs True coefficients
post_lasso_summary <- broom::tidy(post_lasso_fit) |>
filter(term != "(Intercept)") |>
rename(variable = term, post_lasso_coef = estimate) |>
select(variable, post_lasso_coef) |>
left_join(lasso_df |> select(variable, lasso_coef, true_beta), by = "variable")
print(post_lasso_summary)

 variable lasso_coef post_lasso_coef true_beta
log_gdp 1.1899 1.1646 1.200
industry 0.0090 0.0176 0.008
fossil_fuel 0.0072 0.0118 0.012
urban_pop 0.0041 0.0078 0.010
democracy 0.0046 0.0113 0.004
trade_network 0.6309 0.8978 0.500

Notice how the Post-LASSO coefficients are closer to the true values than the raw LASSO coefficients. For example, fossil_fuel’s LASSO coefficient is 0.007 (shrunk from the true 0.012), but the Post-LASSO estimate is 0.012 — recovering the truth almost exactly. Similarly, urban_pop recovers from 0.004 (LASSO) to 0.008 (Post-LASSO), closer to the true value of 0.010. Trade_network’s Post-LASSO estimate (0.898) overshoots the true value (0.500), reflecting the difficulty of precisely estimating a coefficient on a low-variance variable. The LASSO selected the right variables; Post-LASSO recovered unbiased magnitudes.

Note. LASSO coefficients are shrunk toward zero by design. Post-LASSO runs OLS on only the LASSO-selected variables, producing unbiased coefficient estimates while retaining the variable selection from LASSO.

PART 3: Weighted Average Least Squares (WALS)

13. Frequentist Model Averaging

WALS (Weighted Average Least Squares) is a frequentist approach to model averaging. Like BMA, it averages over models instead of selecting just one. But unlike BMA, it does not require MCMC sampling or the specification of a full Bayesian prior.

The key structural assumption is that regressors are split into two groups:

$$ y = X_1 \beta_1 + X_2 \beta_2 + \varepsilon $$

where:

$X_1$ are focus regressors: variables you are certain belong in the model. In a cross-sectional setting, this is typically just the intercept.
$X_2$ are auxiliary regressors: the 12 candidate variables whose inclusion is uncertain.
$\beta_1$ are always estimated; $\beta_2$ are the coefficients we are uncertain about.

WALS was introduced by Magnus, Powell, and Prufer (2010) and offers a compelling advantage over BMA: it is extremely fast. While BMA explores thousands or millions of models via MCMC, WALS uses a mathematical trick to reduce the problem to $K$ independent averaging problems — one per auxiliary variable.

14. The Semi-Orthogonal Transformation

Why correlated variables make averaging hard

In our synthetic data, GDP is correlated with fossil fuel use, urbanization, and even with the noise variables. This means that the decision to include one variable affects the importance of another. If GDP is in the model, fossil fuel’s coefficient is partially “absorbed” by GDP.

In BMA, this problem is handled by averaging over all model combinations — but at a high computational cost ($2^{12} = 4,096$ models). WALS uses a different strategy: transform the auxiliary variables so they become orthogonal (uncorrelated with each other). Once orthogonal, each variable can be averaged independently.

The mathematical trick

The semi-orthogonal transformation works as follows:

Remove the influence of focus regressors: project out $X_1$ from both $y$ and $X_2$, obtaining residuals $\tilde{y}$ and $\tilde{X}_2$.
Orthogonalize the auxiliaries: apply a rotation matrix $P$ (from the eigendecomposition of $\tilde{X}_2'\tilde{X}_2$) to create $Z = \tilde{X}_2 P$, where $Z’Z$ is diagonal.
Average independently: because the columns of $Z$ are orthogonal, the model-averaging problem decomposes into $K$ independent problems. Each transformed variable is averaged separately.

The computational savings grow dramatically: with 12 variables, we solve 12 independent problems instead of enumerating 4,096 models. Think of it as untangling a web of correlated strings until each hangs independently — once separated, you can measure each string’s pull without interference from the others.

15. The Laplace Prior

WALS requires a prior distribution for the transformed coefficients. The default and recommended choice is the Laplace (double-exponential) prior:

$$ p(\gamma_j) \propto \exp(-|\gamma_j| / \tau) $$

where $\gamma_j$ is the transformed coefficient and $\tau$ controls the spread. The Laplace prior has two key features:

Peaked at zero: it encodes skepticism — the prior believes most variables probably have small effects
Heavy tails: it allows large effects if the data strongly supports them — variables with strong signal can “break through” the prior

The deep connection to LASSO

Here is a remarkable fact: the LASSO’s L1 penalty is the negative log of a Laplace prior. The MAP (maximum a posteriori) estimate under a Laplace prior is:

$$ \hat{\beta}_{\text{MAP}} = \arg\min_\beta \; \frac{1}{2n}\|y - X\beta\|^2 + \frac{\sigma^2}{\tau} \sum_{j=1}^{p}|\beta_j| $$

This is identical to the LASSO objective with $\lambda = \sigma^2 / \tau$. The LASSO penalty and the Laplace prior are two sides of the same coin.

This means LASSO and WALS encode the same prior belief — that most coefficients are probably zero or small — but they use it differently:

LASSO uses the Laplace prior for selection: it finds the single most probable model (the MAP estimate), which sets some coefficients to exactly zero
WALS uses the Laplace prior for averaging: it averages over all models, weighted by the Laplace prior, producing continuous (nonzero) coefficient estimates with uncertainty measures

Note. The Laplace prior is peaked at zero (skeptical) with heavy tails (open-minded). It is the same prior that underlies LASSO’s L1 penalty. LASSO uses it for hard selection (zeros vs. nonzeros); WALS uses it for soft averaging (continuous weights).

16. WALS on All 12 Variables

16.1 Running WALS

# WALS splits regressors into two groups:
# X1 = focus regressors (always included): just the intercept
# X2 = auxiliary regressors (uncertain): our 12 candidate variables
# Prepare the focus regressor matrix (intercept only)
X1_wals <- matrix(1, nrow = nrow(synth_data), ncol = 1)
colnames(X1_wals) <- "(Intercept)"
# Prepare the auxiliary regressor matrix (all 12 candidates)
X2_wals <- synth_data |>
select(log_gdp, industry, fossil_fuel, urban_pop, democracy,
trade_network, agriculture, log_trade, fdi, corruption,
log_tourism, log_credit) |>
as.matrix()
y_wals <- synth_data$log_co2
# Fit WALS with the Laplace prior (the recommended default)
wals_fit <- wals(
x = X1_wals, # focus regressors (intercept)
x2 = X2_wals, # auxiliary regressors (12 candidates)
y = y_wals, # response variable
prior = laplace() # Laplace prior for auxiliaries
)
wals_summary <- summary(wals_fit)

The WALS function call is remarkably concise. Unlike BMA, there is no MCMC sampling, no burn-in period, and no convergence diagnostics to worry about. The computation is essentially instantaneous.

# Extract results
aux_coefs <- wals_summary$auxCoefs
wals_df <- tibble(
variable = rownames(aux_coefs),
estimate = aux_coefs[, "Estimate"],
se = aux_coefs[, "Std. Error"],
t_stat = estimate / se
) |>
mutate(
true_beta = true_beta_lookup[variable],
abs_t = abs(t_stat),
wals_robust = abs_t >= 2
)
print(wals_df |> arrange(desc(abs_t)) |> select(variable, estimate, t_stat, true_beta))

 variable estimate t_stat true_beta
log_gdp 1.1333 34.62 1.200
trade_network 0.8458 4.39 0.500
industry 0.0187 4.01 0.008
fossil_fuel 0.0099 3.26 0.012
urban_pop 0.0082 3.11 0.010
democracy 0.0097 2.58 0.004
log_credit 0.0659 1.43 0.000
agriculture -0.0046 -1.13 0.005
log_tourism -0.0148 -0.64 0.000
log_trade 0.0196 0.31 0.000
fdi -0.0011 -0.17 0.000
corruption -0.0165 -0.09 0.000

WALS produces familiar t-statistics for each auxiliary variable. Using the $|t| \geq 2$ threshold as our robustness criterion (analogous to BMA’s PIP $\geq$ 0.80), we can classify each variable as robust or fragile.

16.2 t-statistic bar chart

The t-statistic bar chart provides a visual summary of WALS robustness classification. Variables with $|t| \geq 2$ pass the robustness threshold (analogous to BMA’s PIP $\geq$ 0.80), while those below the threshold are considered fragile.

# Classify each variable for the bar chart
wals_df <- wals_df |>
mutate(
bar_color = case_when(
wals_robust & true_nonzero ~ "True positive",
wals_robust & !true_nonzero ~ "False positive",
!wals_robust & true_nonzero ~ "False negative",
TRUE ~ "True negative"
)
)
ggplot(wals_df, aes(x = reorder(variable, abs_t), y = t_stat, fill = bar_color)) +
geom_col(width = 0.6) +
geom_hline(yintercept = c(-2, 2), linetype = "dashed") +
coord_flip()

The t-statistic bar chart shows a clear separation. GDP towers above all others with $|t| = 34.62$, followed by trade_network ($|t| = 4.39$), industry ($|t| = 4.01$), fossil_fuel ($|t| = 3.26$), urban_pop ($|t| = 3.11$), and democracy ($|t| = 2.58$). These six variables pass the $|t| \geq 2$ threshold. The noise variables all have $|t| < 1.5$, confirming they are not robust determinants. Agriculture ($|t| = 1.13$) falls just below the robustness threshold — its true effect ($\beta = 0.005$) is simply too small to detect reliably with this sample size.

Note. WALS produces t-statistics for each auxiliary variable. Using the $|t| \geq 2$ threshold, we can classify variables as robust or fragile. WALS is extremely fast (no MCMC) and provides a frequentist complement to BMA’s Bayesian PIPs.

PART 4: Grand Comparison

17. Three Methods, Same Question, Same Data

We have now applied all three methods to the same synthetic dataset. Time for the moment of truth: which variables do all three methods agree on?

17.1 Comprehensive comparison table

# Merge all results
grand_table <- bma_compare |>
left_join(lasso_compare, by = "variable") |>
left_join(wals_compare, by = "variable") |>
mutate(
true_beta = true_beta_lookup[variable],
bma_robust = bma_pip >= 0.80,
n_methods = bma_robust + lasso_selected + wals_robust,
triple_robust = n_methods == 3,
true_nonzero = true_beta != 0
)
print(grand_table |>
select(variable, true_beta, bma_pip, bma_robust, lasso_selected, wals_t, wals_robust, n_methods) |>
arrange(desc(n_methods)))

 variable true_beta bma_pip bma_robust lasso_selected wals_t wals_robust n_methods
log_gdp 1.200 1.000 TRUE TRUE 34.62 TRUE 3
trade_network 0.500 0.986 TRUE TRUE 4.39 TRUE 3
fossil_fuel 0.012 0.948 TRUE TRUE 3.26 TRUE 3
industry 0.008 0.841 TRUE TRUE 4.01 TRUE 3
urban_pop 0.010 0.648 FALSE TRUE 3.11 TRUE 2
democracy 0.004 0.607 FALSE TRUE 2.58 TRUE 2
log_tourism 0.000 0.130 FALSE FALSE -0.64 FALSE 0
log_credit 0.000 0.104 FALSE FALSE 1.43 FALSE 0
agriculture 0.005 0.087 FALSE FALSE -1.13 FALSE 0
log_trade 0.000 0.084 FALSE FALSE 0.31 FALSE 0
corruption 0.000 0.078 FALSE FALSE -0.09 FALSE 0
fdi 0.000 0.077 FALSE FALSE -0.17 FALSE 0

The results are striking. Four variables are triple-robust — identified by all three methods: log_gdp, trade_network, fossil_fuel, and industry. Two more variables — urban_pop and democracy — are double-robust, selected by LASSO and WALS but landing in BMA’s borderline zone (PIPs of 0.648 and 0.607). All five noise variables are correctly excluded by all three methods. Agriculture ($\beta = 0.005$) is the only true predictor missed by all methods — its effect is simply too small to detect.

17.2 Method agreement heatmap

The heatmap provides a visual summary of agreement. The top four rows (GDP, trade_network, fossil_fuel, industry) are solid steel blue across all three columns — unanimous agreement that these variables matter. Urban_pop and democracy show steel blue for LASSO and WALS but orange for BMA, visualizing BMA’s greater conservatism. The bottom five rows (noise) are solid orange — unanimous agreement that they do not matter. Agriculture is also orange throughout, reflecting all methods' consensus that its tiny effect ($\beta = 0.005$) cannot be reliably distinguished from zero.

17.3 BMA PIP vs. WALS |t-statistic|

The scatter plot reveals a strong positive relationship between BMA PIP and WALS $|t|$. Variables in the upper-right quadrant are robust by both methods — GDP, trade_network, fossil_fuel, and industry. Urban_pop and democracy sit in an interesting middle zone: high WALS $|t|$ (above 2) but moderate BMA PIP (below 0.80), illustrating BMA’s more conservative threshold. The noise variables cluster in the lower-left corner (low PIP, low $|t|$). LASSO selection (triangle markers) aligns with the WALS threshold, selecting the same six variables that pass $|t| \geq 2$.

17.4 Coefficient comparison

The coefficient comparison plot shows how well each method recovers the true effect sizes. Points on the dashed 45-degree line represent perfect recovery. GDP ($\beta = 1.200$) is recovered almost exactly by all three methods. The smaller coefficients (fossil_fuel at 0.012, urban_pop at 0.010) are also well-estimated. Trade_network’s coefficient is overestimated by all methods (true 0.500, estimates around 0.85–0.90), reflecting the difficulty of precisely estimating an effect on a low-variance variable. BMA’s posterior means are slightly attenuated for variables with PIPs below 1.0 (the averaging shrinks them toward zero).

17.5 Agreement summary

The agreement bar chart tells a nuanced story: four variables are triple-robust (identified by all three methods), two are double-robust (identified by LASSO and WALS but not BMA), and six are identified by none. The “split votes” on urban_pop and democracy reveal a genuine methodological difference: LASSO and WALS are more liberal in including moderate-effect variables, while BMA’s Bayesian Occam’s razor demands stronger evidence. This pattern — where methods mostly agree but diverge on borderline cases — is what makes methodological triangulation valuable.

17.6 Method performance

# Sensitivity, specificity, and accuracy for each method
results_by_method <- tibble(
method = c("BMA", "LASSO", "WALS"),
true_pos = c(4, 6, 6), # true predictors correctly identified
false_pos = c(0, 0, 0), # noise variables falsely identified
false_neg = c(3, 1, 1), # true predictors missed
true_neg = c(5, 5, 5), # noise variables correctly excluded
sensitivity = true_pos / 7,
specificity = true_neg / 5,
accuracy = (true_pos + true_neg) / 12
)
print(results_by_method)

 method true_pos false_pos false_neg true_neg sensitivity specificity accuracy
BMA 4 0 3 5 0.571 1.000 0.750
LASSO 6 0 1 5 0.857 1.000 0.917
WALS 6 0 1 5 0.857 1.000 0.917

All three methods achieve perfect specificity (zero false positives) — none mistakenly identifies a noise variable as robust. The key difference is in sensitivity: LASSO and WALS each detect 6 of 7 true predictors (85.7%), while BMA detects only 4 (57.1%). BMA’s lower sensitivity reflects its conservative Bayesian Occam’s razor: it places urban_pop and democracy in the “borderline” zone rather than committing to their inclusion. The one variable missed by all methods — agriculture ($\beta = 0.005$) — has an effect so small that it is indistinguishable from noise given our sample size.

17.7 When to use which method

Method	Best for	Strengths	Limitations
BMA	Full uncertainty quantification	Probabilistic (PIPs), handles model uncertainty formally, coefficient intervals	Slower (MCMC), requires prior specification
LASSO	Prediction, sparse models	Fast, automatic selection, works with many variables	Binary (in/out), biased coefficients (use Post-LASSO)
WALS	Speed, frequentist inference	Very fast, produces t-statistics, no MCMC	Less common, limited software support

The strongest recommendation: use all three. When they converge on the same variables (as with our four triple-robust predictors), you have the strongest possible evidence. When they disagree (as with urban_pop and democracy, where LASSO and WALS say “yes” but BMA hedges), the disagreement itself is informative — it tells you the evidence is real but not overwhelming. In real-world data, complications such as nonlinearity, heteroskedasticity, and endogeneity may affect method performance and should be addressed before applying these techniques.

18. Conclusion

18.1 Summary

This tutorial introduced three principled approaches to the variable selection problem:

Bayesian Model Averaging (BMA) averages over all possible models, weighting each by its posterior probability. It produces Posterior Inclusion Probabilities (PIPs) that quantify how robust each variable is across the entire model space. Variables with PIP $\geq$ 0.80 are considered robust.
LASSO adds an L1 penalty to the OLS objective, forcing irrelevant coefficients to exactly zero. Cross-validation selects the penalty strength. Post-LASSO recovers unbiased coefficient estimates for the selected variables.
WALS uses a semi-orthogonal transformation to decompose the model-averaging problem into independent subproblems — one per variable. It is extremely fast and produces familiar t-statistics for robustness assessment.

18.2 Key takeaways

The methods mostly converge — and their disagreements are informative. Four variables are identified by all three methods (triple-robust), and all methods achieve perfect specificity (zero false positives). LASSO and WALS are more sensitive (detecting 6 of 7 true predictors), while BMA is more conservative (detecting 4). The two variables where they disagree — urban_pop and democracy — have moderate effects that BMA’s Bayesian Occam’s razor treats as borderline. This pattern illustrates the value of methodological triangulation across fundamentally different statistical paradigms.

Model uncertainty is real but addressable. With 12 candidate variables, there are 4,096 possible models. Rather than pretending one of them is “the” model, these methods account for the uncertainty explicitly. The result is more honest inference.

Synthetic data lets us verify. Because we designed the data-generating process, we could check each method’s performance against the known truth. In practice, the truth is unknown — which is precisely why using multiple methods is so valuable.

18.3 Applying this to your own research

The code in this tutorial is designed to be modular. To apply these methods to your own data:

Replace the CSV: load your own cross-sectional dataset instead of the synthetic one
Define the variable list: specify which variables are candidates for selection
Run the three methods: use the same bms(), cv.glmnet(), and wals() function calls
Compare results: build the same comparison table and heatmap

The interpretation framework — PIPs for BMA, selection for LASSO, t-statistics for WALS — applies regardless of the specific dataset.

18.4 Further reading

BMA: Hoeting, J.A., Madigan, D., Raftery, A.E., and Volinsky, C.T. (1999). “Bayesian Model Averaging: A Tutorial.” Statistical Science, 14(4), 382–417.
LASSO: Tibshirani, R. (1996). “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society, Series B, 58(1), 267–288.
WALS: Magnus, J.R., Powell, O., and Prufer, P. (2010). “A Comparison of Two Model Averaging Techniques with an Application to Growth Empirics.” Journal of Econometrics, 154(2), 139–153.
Application: Aller, C., Ductor, L., and Grechyna, D. (2021). “Robust Determinants of CO₂ Emissions.” Energy Economics, 96, 105154.
Post-LASSO: Belloni, A. and Chernozhukov, V. (2013). “Least Squares After Model Selection in High-Dimensional Sparse Models.” Bernoulli, 19(2), 521–547.
R Packages: BMS vignette, glmnet vignette, WALS package

References

Hoeting, J.A., Madigan, D., Raftery, A.E., and Volinsky, C.T. (1999). Bayesian Model Averaging: A Tutorial. Statistical Science, 14(4), 382–417.
Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society, Series B, 58(1), 267–288.
Magnus, J.R., Powell, O., and Prufer, P. (2010). A Comparison of Two Model Averaging Techniques with an Application to Growth Empirics. Journal of Econometrics, 154(2), 139–153.
Raftery, A.E. (1995). Bayesian Model Selection in Social Research. Sociological Methodology, 25, 111–163.
Aller, C., Ductor, L., and Grechyna, D. (2021). Robust Determinants of CO₂ Emissions. Energy Economics, 96, 105154.
Belloni, A. and Chernozhukov, V. (2013). Least Squares After Model Selection in High-Dimensional Sparse Models. Bernoulli, 19(2), 521–547.

Acknowledgements

Introduction to Causal Inference: Double Machine Learning

Tue, 10 Mar 2026 00:00:00 +0000

Overview

Does a cash bonus actually cause unemployed workers to find jobs faster, or do the workers who receive bonuses simply differ from those who do not? This is the core challenge of causal inference: separating a genuine treatment effect from the influence of confounders — variables that affect both the treatment and the outcome, creating spurious associations. Standard regression can adjust for these confounders, but when their relationship with the outcome is complex and nonlinear, linear models may fail to fully remove bias.

Double Machine Learning (DML) addresses this problem by using flexible machine learning models to partial out the confounding variation, then estimating the causal effect on the cleaned residuals. In this tutorial we apply DML to the Pennsylvania Bonus Experiment, a real randomized study where some unemployment insurance claimants received a cash bonus for finding employment quickly. We estimate how much the bonus reduced unemployment duration, and we compare DML estimates against naive and covariate-adjusted OLS to see how debiasing changes the results.

Learning objectives:

Understand why prediction and causal inference require different approaches
Learn the Partially Linear Regression (PLR) model and the partialling-out estimator
Implement Double Machine Learning with cross-fitting using the doubleml package
Interpret causal effect estimates, standard errors, and confidence intervals
Assess robustness by comparing results across different ML learners

Setup and imports

Before running the analysis, install the required package if needed:

pip install doubleml

The following code imports all necessary libraries and sets the configuration variables for our analysis. We use RANDOM_SEED = 42 throughout to ensure reproducibility, and define the outcome, treatment, and covariate columns that will be used in all subsequent steps.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.base import clone
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LassoCV, LinearRegression
from doubleml import DoubleMLData, DoubleMLPLR
from doubleml.datasets import fetch_bonus
# Reproducibility
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
# Configuration
OUTCOME = "inuidur1"
OUTCOME_LABEL = "Log Unemployment Duration"
TREATMENT = "tg"
COVARIATES = [
"female", "black", "othrace", "dep1", "dep2",
"q2", "q3", "q4", "q5", "q6",
"agelt35", "agegt54", "durable", "lusd", "husd",
]

Data loading: The Pennsylvania Bonus Experiment

The Pennsylvania Bonus Experiment is a well-known dataset in labor economics. In this study, a random subset of unemployment insurance claimants was offered a cash bonus if they found a new job within a qualifying period. The dataset records whether each claimant received the bonus offer (treatment) and how long they remained unemployed (outcome), along with demographic and labor market covariates.

df = fetch_bonus("DataFrame")
print(f"Dataset shape: {df.shape}")
print(f"Observations: {len(df)}")
print(f"\nTreatment groups:")
print(df[TREATMENT].value_counts().rename({0: "Control", 1: "Bonus"}))
print(f"\nOutcome ({OUTCOME}) summary:")
print(df[OUTCOME].describe().round(3))

Dataset shape: (5099, 26)
Observations: 5099
Treatment groups:
tg
Control 3354
Bonus 1745
Name: count, dtype: int64
Outcome (inuidur1) summary:
count 5099.000
mean 2.028
std 1.215
min 0.000
25% 1.099
50% 2.398
75% 3.219
max 3.951
Name: inuidur1, dtype: float64

The dataset contains 5,099 unemployment insurance claimants with 26 variables. The treatment is unevenly split: 1,745 claimants received the bonus offer while 3,354 served as controls. The outcome variable, log unemployment duration (inuidur1), ranges from 0.0 to 3.95 with a mean of 2.028 and standard deviation of 1.215, indicating substantial variation in how long claimants remained unemployed. The median (2.398) exceeds the mean, suggesting a left-skewed distribution where some claimants found jobs very quickly. The interquartile range spans from 1.099 to 3.219, meaning the middle 50% of claimants had log durations in this band.

Exploratory data analysis

Outcome distribution by treatment group

Before modeling, we examine whether the outcome distributions differ visibly between treated and control groups. While a randomized experiment should produce balanced groups on average, visualizing the raw data helps us understand the structure of the outcome and spot any obvious patterns.

fig, ax = plt.subplots(figsize=(8, 5))
for group, label, color in [(0, "Control", "#6a9bcc"), (1, "Bonus", "#d97757")]:
subset = df[df[TREATMENT] == group][OUTCOME]
ax.hist(subset, bins=30, alpha=0.6, label=f"{label} (mean={subset.mean():.3f})",
color=color, edgecolor="white")
ax.set_xlabel(OUTCOME_LABEL)
ax.set_ylabel("Count")
ax.set_title(f"Distribution of {OUTCOME_LABEL} by Treatment Group")
ax.legend()
plt.savefig("doubleml_outcome_by_treatment.png", dpi=300, bbox_inches="tight")
plt.show()

The histogram reveals that both groups share a similar shape, with a concentration of claimants at higher log durations (around 3.0–3.5) and a spread of shorter durations below 2.0. The bonus group shows a slightly lower mean (1.971) compared to the control group (2.057), a difference of about 0.09 log points. This raw gap hints at a potential treatment effect, but we cannot yet attribute it to the bonus because confounders may also differ between groups.

Covariate balance

In a well-designed randomized experiment, the distribution of covariates should be roughly similar across treatment and control groups. We check this balance to verify that randomization worked as expected and to understand which characteristics might confound the treatment-outcome relationship if balance is imperfect.

covariate_means = df.groupby(TREATMENT)[COVARIATES].mean()
fig, ax = plt.subplots(figsize=(12, 6))
x = np.arange(len(COVARIATES))
width = 0.35
ax.bar(x - width / 2, covariate_means.loc[0], width, label="Control",
color="#6a9bcc", edgecolor="white")
ax.bar(x + width / 2, covariate_means.loc[1], width, label="Bonus",
color="#d97757", edgecolor="white")
ax.set_xticks(x)
ax.set_xticklabels(COVARIATES, rotation=45, ha="right")
ax.set_ylabel("Mean Value")
ax.set_title("Covariate Balance: Control vs Bonus Group")
ax.legend()
plt.savefig("doubleml_covariate_balance.png", dpi=300, bbox_inches="tight")
plt.show()

The covariate means are nearly identical across treatment and control groups for all 15 covariates, confirming that randomization produced well-balanced groups. Demographic variables like female, black, and age indicators show negligible differences, as do the economic indicators (durable, lusd, husd). This balance is reassuring: it means that any difference in unemployment duration between groups is unlikely to be driven by observable confounders. Nevertheless, DML provides a principled way to adjust for these covariates and improve precision.

Why adjust for covariates?

Because the Pennsylvania Bonus Experiment is a randomized trial, treatment assignment is independent of covariates by design — there is no confounding bias to remove. However, adjusting for covariates can still improve the precision of the causal estimate by absorbing residual variation in the outcome. In observational studies, covariate adjustment is essential to avoid confounding bias, but even in an RCT, it sharpens inference. The question is how to adjust. Standard OLS assumes a linear relationship between covariates and the outcome, which may miss complex nonlinear patterns. The naive OLS model regresses the outcome $Y$ directly on the treatment $D$:

$$Y_i = \alpha + \beta \, D_i + \epsilon_i \quad \text{(naive, no covariates)}$$

Adding covariates $X$ linearly gives:

$$Y_i = \alpha + \beta \, D_i + X_i' \gamma + \epsilon_i \quad \text{(with covariates)}$$

In our data, $Y_i$ is inuidur1 (log unemployment duration), $D_i$ is tg (the bonus indicator), and $X_i$ contains the 15 demographic and labor market covariates. In both cases, $\beta$ is the estimated treatment effect. But if the true relationship between $X$ and $Y$ is nonlinear, the linear specification may leave residual confounding in $\hat{\beta}$.

Naive OLS baseline

We start with two simple OLS regressions to establish baseline estimates: one with no covariates (naive), and one that linearly adjusts for all 15 covariates. These provide a reference point for evaluating how much DML’s flexible adjustment changes the estimated treatment effect.

# Naive OLS: no covariates
ols = LinearRegression()
ols.fit(df[[TREATMENT]], df[OUTCOME])
naive_coef = ols.coef_[0]
# OLS with covariates
ols_full = LinearRegression()
ols_full.fit(df[[TREATMENT] + COVARIATES], df[OUTCOME])
ols_full_coef = ols_full.coef_[0]
print(f"Naive OLS coefficient (no covariates): {naive_coef:.4f}")
print(f"OLS with covariates coefficient: {ols_full_coef:.4f}")

Naive OLS coefficient (no covariates): -0.0855
OLS with covariates coefficient: -0.0717

The naive OLS estimate is -0.0855, suggesting that the bonus is associated with an 8.6% reduction in log unemployment duration. Adding covariates shifts the estimate to -0.0717 (7.2% reduction). In a randomized experiment, this shift reflects precision improvement from absorbing residual variation — not confounding bias removal. Even so, linear adjustment may not capture complex nonlinear relationships between covariates and the outcome. Double Machine Learning will use flexible ML models to more thoroughly partial out covariate effects and further sharpen the estimate.

What is Double Machine Learning?

The Partially Linear Regression (PLR) model

Double Machine Learning operates within the Partially Linear Regression framework. The key idea is that the outcome $Y$ depends on the treatment $D$ through a linear coefficient (the causal effect we want) plus a potentially complex, nonlinear function of covariates $X$. The PLR model consists of two structural equations:

$$Y = D \, \theta_0 + g_0(X) + \varepsilon, \quad E[\varepsilon \mid D, X] = 0$$

$$D = m_0(X) + V, \quad E[V \mid X] = 0$$

Here, $\theta_0$ is the causal parameter of interest — the Average Treatment Effect (ATE) of the bonus on unemployment duration. The function $g_0(\cdot)$ is a nuisance function, meaning it is not our target but something we must estimate along the way; it captures how covariates affect the outcome. Similarly, $m_0(\cdot)$ models how covariates predict treatment assignment. Think of nuisance functions as scaffolding: essential during construction but not part of the final result. The error terms $\varepsilon$ and $V$ are orthogonal to the covariates by construction. In our data, $Y$ = inuidur1, $D$ = tg, and $X$ includes the 15 covariate columns in COVARIATES. The challenge is that both $g_0$ and $m_0$ can be arbitrarily complex — DML uses machine learning to estimate them flexibly.

The partialling-out estimator

The DML algorithm works in two stages. First, it uses ML models to predict the outcome from covariates alone (estimating $E[Y \mid X]$) and to predict the treatment from covariates alone (estimating $E[D \mid X]$). Then it computes residuals from both predictions — the part of $Y$ not explained by $X$, and the part of $D$ not explained by $X$:

$$\tilde{Y} = Y - \hat{g}_0(X) = Y - \hat{E}[Y \mid X]$$

$$\tilde{D} = D - \hat{m}_0(X) = D - \hat{E}[D \mid X]$$

Finally, it regresses the outcome residuals on the treatment residuals to obtain the causal estimate:

$$\hat{\theta}_0 = \left( \frac{1}{N} \sum_{i=1}^{N} \tilde{D}_i^2 \right)^{-1} \frac{1}{N} \sum_{i=1}^{N} \tilde{D}_i \, \tilde{Y}_i$$

Think of this like noise-canceling headphones: the ML models learn the “noise” pattern (how covariates influence both $Y$ and $D$), and we subtract it away so that only the “signal” — the causal relationship between $D$ and $Y$ — remains.

Cross-fitting: why it matters

A naive implementation of partialling-out would use the same data to fit the ML models and compute residuals. This introduces regularization bias — a distortion that occurs because the ML model’s complexity penalty contaminates the causal estimate. DML solves this with cross-fitting: the data is split into $K$ folds, and each fold’s residuals are computed using ML models trained on the other $K-1$ folds. Think of it like grading exams: to avoid bias, we split the class into groups where each group’s predictions are made by a model that never saw their data. The cross-fitted estimator is:

$$\hat{\theta}_0^{CF} = \left( \frac{1}{N} \sum_{k=1}^{K} \sum_{i \in I_k} \left(\tilde{D}_i^{(k)}\right)^2 \right)^{-1} \frac{1}{N} \sum_{k=1}^{K} \sum_{i \in I_k} \tilde{D}_i^{(k)} \, \tilde{Y}_i^{(k)}$$

where $\tilde{Y}_i^{(k)}$ and $\tilde{D}_i^{(k)}$ are residuals for observation $i$ in fold $k$, computed using models trained on all folds except $k$. In words, we average the treatment effect estimates across all folds, where each fold’s estimate uses residuals computed from models that never saw that fold’s data. This ensures that the residuals are computed out-of-sample, eliminating overfitting bias and preserving valid statistical inference (standard errors, p-values, confidence intervals).

Setting up DoubleML

The doubleml package provides a clean interface for implementing DML. We first wrap our data into a DoubleMLData object that specifies the outcome, treatment, and covariate columns. Then we configure the ML learners: Random Forest regressors for both the outcome model ml_l (estimating $\hat{g}_0$) and the treatment model ml_m (estimating $\hat{m}_0$).

# Prepare data for DoubleML
dml_data = DoubleMLData(df, y_col=OUTCOME, d_cols=TREATMENT, x_cols=COVARIATES)
print(dml_data)

================== DoubleMLData Object ==================
------------------ Data summary ------------------
Outcome variable: inuidur1
Treatment variable(s): ['tg']
Covariates: ['female', 'black', 'othrace', 'dep1', 'dep2', 'q2', 'q3', 'q4', 'q5', 'q6', 'agelt35', 'agegt54', 'durable', 'lusd', 'husd']
Instrument variable(s): None
No. Observations: 5099

The DoubleMLData object confirms our setup: inuidur1 as the outcome, tg as the treatment, and all 15 covariates registered. The object reports 5,099 observations and no instrumental variables, which is correct for the PLR model. Separating the data into these three roles is fundamental to DML: the covariates $X$ will be partialled out from both $Y$ and $D$, while the treatment-outcome relationship $\theta_0$ is the sole target of inference.

Now we configure the ML learners. We use Random Forest with 500 trees, max depth of 5, and sqrt feature sampling — a moderate configuration that balances flexibility with regularization.

# Configure ML learners
learner = RandomForestRegressor(n_estimators=500, max_features="sqrt",
max_depth=5, random_state=RANDOM_SEED)
ml_l_rf = clone(learner) # Learner for E[Y|X]
ml_m_rf = clone(learner) # Learner for E[D|X]
print(f"ml_l (outcome model): {type(ml_l_rf).__name__}")
print(f"ml_m (treatment model): {type(ml_m_rf).__name__}")
print(f" n_estimators={learner.n_estimators}, max_depth={learner.max_depth}, max_features='{learner.max_features}'")

ml_l (outcome model): RandomForestRegressor
ml_m (treatment model): RandomForestRegressor
n_estimators=500, max_depth=5, max_features='sqrt'

Both the outcome and treatment models use RandomForestRegressor with 500 estimators and max depth 5. The clone() function creates independent copies so that each model is trained separately during the DML fitting process. The max_features='sqrt' setting means each split considers only the square root of 15 covariates (about 4 features), adding randomness that reduces overfitting. Capping tree depth at 5 prevents overfitting to individual observations while still capturing nonlinear interactions among covariates — a balance that matters because overly complex nuisance models can destabilize the cross-fitted residuals.

Fitting the PLR model

With data and learners configured, we fit the Partially Linear Regression model using 5-fold cross-fitting. The DoubleMLPLR class handles the full DML pipeline: splitting data into folds, fitting ML models on training folds, computing out-of-sample residuals, and estimating the causal coefficient with valid standard errors.

np.random.seed(RANDOM_SEED)
dml_plr_rf = DoubleMLPLR(dml_data, ml_l_rf, ml_m_rf, n_folds=5)
dml_plr_rf.fit()
print(dml_plr_rf.summary)

 coef std err t P>|t| 2.5 % 97.5 %
tg -0.0736 0.0354 -2.077 0.0378 -0.1430 -0.0041

The DML estimate with Random Forest learners yields a treatment coefficient of -0.0736 with a standard error of 0.0354. The t-statistic is -2.077, producing a p-value of 0.0378, which is statistically significant at the 5% level. The 95% confidence interval is [-0.1430, -0.0041], meaning we are 95% confident that the true causal effect of the bonus lies between a 14.3% and 0.4% reduction in log unemployment duration.

Interpreting the results

Let us extract and interpret the key quantities from the fitted model to understand both the statistical and practical significance of the estimated treatment effect.

rf_coef = dml_plr_rf.coef[0]
rf_se = dml_plr_rf.se[0]
rf_pval = dml_plr_rf.pval[0]
rf_ci = dml_plr_rf.confint().values[0]
print(f"Coefficient (theta_0): {rf_coef:.4f}")
print(f"Standard Error: {rf_se:.4f}")
print(f"p-value: {rf_pval:.4f}")
print(f"95% CI: [{rf_ci[0]:.4f}, {rf_ci[1]:.4f}]")
print(f"\nInterpretation:")
print(f" The bonus reduces log unemployment duration by {abs(rf_coef):.4f}.")
print(f" This corresponds to approximately a {abs(rf_coef)*100:.1f}% reduction.")
print(f" We are 95% confident the true effect lies between")
print(f" {abs(rf_ci[1])*100:.1f}% and {abs(rf_ci[0])*100:.1f}% reduction.")

Coefficient (theta_0): -0.0736
Standard Error: 0.0354
p-value: 0.0378
95% CI: [-0.1430, -0.0041]
Interpretation:
The bonus reduces log unemployment duration by 0.0736.
This corresponds to approximately a 7.4% reduction.
We are 95% confident the true effect lies between
0.4% and 14.3% reduction.

The estimated causal effect is $\hat{\theta}_0 = -0.0736$, meaning the cash bonus reduces log unemployment duration by approximately 7.4%. Since the outcome is in log scale, this translates to roughly a 7.1% proportional reduction in actual unemployment duration (using $e^{-0.0736} - 1 \approx -0.071$). The effect is statistically significant ($p = 0.0378$), and the 95% confidence interval is constructed as:

$$\text{CI}_{95\%} = \hat{\theta}_0 \pm 1.96 \times \text{SE}(\hat{\theta}_0) = -0.0736 \pm 1.96 \times 0.0354 = [-0.1430, \; -0.0041]$$

The interval excludes zero, confirming that the bonus has a genuine causal impact. However, the wide interval — spanning from a 0.4% to 14.3% reduction — reflects meaningful uncertainty about the exact magnitude.

Sensitivity: does the choice of ML learner matter?

A key advantage of DML is that it is agnostic to the choice of ML learner, as long as the learner is flexible enough to approximate the true confounding function. To verify that our results are not driven by the specific choice of Random Forest, we re-estimate the model using Lasso, a fundamentally different class of learner. Lasso is a linear regression with L1 regularization, meaning it adds a penalty proportional to the absolute size of each coefficient, which drives some coefficients to exactly zero and effectively performs variable selection.

ml_l_lasso = LassoCV()
ml_m_lasso = LassoCV()
np.random.seed(RANDOM_SEED)
dml_plr_lasso = DoubleMLPLR(dml_data, ml_l_lasso, ml_m_lasso, n_folds=5)
dml_plr_lasso.fit()
print(dml_plr_lasso.summary)

 coef std err t P>|t| 2.5 % 97.5 %
tg -0.0712 0.0354 -2.009 0.0445 -0.1406 -0.0018

The Lasso-based DML estimate is -0.0712 with a standard error of 0.0354 and p-value of 0.0445. This is remarkably close to the Random Forest estimate of -0.0736, with a difference of only 0.0024 — less than 7% of the standard error. The 95% confidence interval is [-0.1406, -0.0018], which also excludes zero. The near-identical results across two fundamentally different learners strongly support the robustness of the estimated treatment effect.

Comparing all estimates

To see how different estimation strategies affect the results, we visualize all four coefficient estimates side by side: naive OLS, OLS with covariates, DML with Random Forest, and DML with Lasso. The DML estimates include confidence intervals derived from valid statistical inference.

lasso_coef = dml_plr_lasso.coef[0]
lasso_se = dml_plr_lasso.se[0]
lasso_ci = dml_plr_lasso.confint().values[0]
fig, ax = plt.subplots(figsize=(8, 5))
methods = ["Naive OLS", "OLS + Covariates", "DoubleML (RF)", "DoubleML (Lasso)"]
coefs = [naive_coef, ols_full_coef, rf_coef, lasso_coef]
colors = ["#999999", "#666666", "#6a9bcc", "#d97757"]
ax.barh(methods, coefs, color=colors, edgecolor="white", height=0.6)
ax.errorbar(rf_coef, 2, xerr=[[rf_coef - rf_ci[0]], [rf_ci[1] - rf_coef]],
fmt="none", color="#141413", capsize=5, linewidth=2)
ax.errorbar(lasso_coef, 3, xerr=[[lasso_coef - lasso_ci[0]], [lasso_ci[1] - lasso_coef]],
fmt="none", color="#141413", capsize=5, linewidth=2)
ax.axvline(0, color="black", linewidth=0.5, linestyle="--")
ax.set_xlabel("Estimated Coefficient (Effect on Log Unemployment Duration)")
ax.set_title("Naive OLS vs Double Machine Learning Estimates")
plt.savefig("doubleml_coefficient_comparison.png", dpi=300, bbox_inches="tight")
plt.show()

All four methods agree on the direction and approximate magnitude of the treatment effect: the bonus reduces unemployment duration. The naive OLS estimate (-0.0855) is the largest in absolute terms, while covariate adjustment and DML both shrink it toward -0.07. The DML estimates with Random Forest (-0.0736) and Lasso (-0.0712) cluster closely together and fall between the two OLS benchmarks. Crucially, only the DML estimates come with valid confidence intervals, both of which exclude zero, providing statistical evidence that the effect is real.

Confidence intervals

To better visualize the uncertainty around the DML estimates, we plot the 95% confidence intervals for both the Random Forest and Lasso specifications. If both intervals are similar and exclude zero, this strengthens our confidence in the causal conclusion.

fig, ax = plt.subplots(figsize=(8, 4))
y_pos = [0, 1]
labels = ["DoubleML (Random Forest)", "DoubleML (Lasso)"]
point_estimates = [rf_coef, lasso_coef]
ci_low = [rf_ci[0], lasso_ci[0]]
ci_high = [rf_ci[1], lasso_ci[1]]
for i, (est, lo, hi, label) in enumerate(zip(point_estimates, ci_low, ci_high, labels)):
ax.plot([lo, hi], [i, i], color="#6a9bcc" if i == 0 else "#d97757", linewidth=3)
ax.plot(est, i, "o", color="#141413", markersize=8, zorder=5)
ax.text(hi + 0.005, i, f"{est:.4f} [{lo:.4f}, {hi:.4f}]", va="center", fontsize=9)
ax.axvline(0, color="black", linewidth=0.5, linestyle="--")
ax.set_yticks(y_pos)
ax.set_yticklabels(labels)
ax.set_xlabel("Treatment Effect Estimate (95% CI)")
ax.set_title("Confidence Intervals: DoubleML Estimates")
plt.savefig("doubleml_confint.png", dpi=300, bbox_inches="tight")
plt.show()

Both confidence intervals are nearly identical in width and position, spanning from roughly -0.14 to near zero. The Random Forest interval [-0.1430, -0.0041] and Lasso interval [-0.1406, -0.0018] both exclude zero, but just barely — the upper bounds are very close to zero (0.4% and 0.2% reduction, respectively). This tells us that while the bonus has a statistically significant negative effect on unemployment duration, the effect size is modest and estimated with considerable uncertainty.

Summary table

Method	Coefficient	Std Error	p-value	95% CI
Naive OLS	-0.0855	–	–	–
OLS + Covariates	-0.0717	–	–	–
DoubleML (RF)	-0.0736	0.0354	0.0378	[-0.1430, -0.0041]
DoubleML (Lasso)	-0.0712	0.0354	0.0445	[-0.1406, -0.0018]

The summary table confirms a consistent pattern across all four estimation methods. The naive OLS gives the largest estimate at -0.0855; adjusting for covariates improves precision and shifts the estimate toward -0.07. The two DML specifications produce very similar estimates of -0.0736 and -0.0712. Both DML p-values are below 0.05, providing statistically significant evidence of a causal effect. The standard errors are identical (0.0354), which is expected since both use the same sample size and cross-fitting structure.

Discussion

The Pennsylvania Bonus Experiment provides a clear demonstration of Double Machine Learning in action. Because the experiment was randomized, the DML estimates are close to the OLS estimates — the confounding function $g_0(X)$ is relatively flat when treatment assignment is independent of covariates. This is actually reassuring: in a well-designed experiment, flexible covariate adjustment should not dramatically change the results, and indeed the DML estimates ($\hat{\theta}_0 = -0.0736, -0.0712$) are close to the covariate-adjusted OLS (-0.0717).

The key finding is that the cash bonus reduces log unemployment duration by approximately 7.4%, and this effect is statistically significant (p < 0.05). In practical terms, this means the bonus incentive helped claimants find new jobs somewhat faster. However, the wide confidence intervals suggest that the true effect could be as small as 0.4% or as large as 14.3%, so policymakers should be cautious about the precise magnitude.

The robustness across learners (Random Forest vs. Lasso) is a strength of DML. Both learners capture similar confounding structure, and the near-identical estimates provide evidence that the result is not an artifact of a particular modeling choice.

Summary and next steps

This tutorial demonstrated Double Machine Learning for causal inference using the Pennsylvania Bonus Experiment. The key takeaways are:

Method: DML’s main advantage over OLS is not the point estimate (both give ~7% here) but the infrastructure — valid standard errors, confidence intervals, and robustness to nonlinear confounding. On this RCT the estimates are similar; on observational data where $g_0(X)$ is complex, OLS would break down while DML remains valid
Data: The cash bonus reduces unemployment duration by 7.4% ($p = 0.038$, 95% CI: [-14.3%, -0.4%]). The wide CI means the true effect could be anywhere from negligible to substantial — policymakers should not over-interpret the point estimate
Robustness: Random Forest and Lasso produce nearly identical estimates (-0.0736 vs -0.0712), differing by less than 7% of the standard error. This learner-agnosticism is a core strength of the DML framework
Limitation: The PLR model assumes a constant treatment effect ($\theta_0$ is the same for everyone). If the bonus helps some subgroups more than others (e.g., younger vs. older workers), PLR will average over this heterogeneity — use the Interactive Regression Model (IRM) to detect it

Limitations:

The Pennsylvania Bonus Experiment is a randomized trial, which is the easiest setting for causal inference. DML’s advantages are more pronounced in observational studies where confounding is severe
We used the PLR model, which assumes a linear treatment effect ($\theta_0$ is constant). More complex treatment heterogeneity could be explored with the Interactive Regression Model (IRM)
The confidence intervals are wide, reflecting limited sample size and moderate signal strength
We did not explore heterogeneous treatment effects — situations where the bonus might help some subgroups (e.g., younger workers, women) more than others

Next steps:

Apply DML to an observational dataset where confounding is more severe
Explore the Interactive Regression Model for binary treatments
Investigate treatment effect heterogeneity using DoubleML’s cate() functionality
Compare additional ML learners (gradient boosting, neural networks)

Exercises

Change the number of folds. Re-run the DML analysis with n_folds=3 and n_folds=10. How do the estimates and standard errors change? What are the tradeoffs of using more vs. fewer folds?
Try a different ML learner. Replace the Random Forest with GradientBoostingRegressor from scikit-learn. Does the estimated treatment effect change? Compare the result to the RF and Lasso estimates.
Investigate heterogeneous effects. Split the sample by gender (female) and estimate the DML treatment effect separately for men and women. Is the bonus more effective for one group? What might explain any differences?

References

Academic references:

Package and API documentation: