Variable Selection | Carlos Mendez

Identifying Latent Group Structures in Panel Data: The classifylasso Command in Stata

Sat, 04 Apr 2026 00:00:00 +0000

1. Overview

Do all countries respond the same way to inflation? To interest rates? To democratic transitions? Most panel data models assume yes. They force every country to share the same slope coefficients. That is a strong assumption — and often a wrong one.

Here is a preview of what we will discover. When we estimate the effect of inflation on savings across 56 countries, the pooled model says: “no significant effect.” But that average is a lie. One group of countries saves less when inflation rises. Another group saves more. The pooled estimate averages a negative and a positive effect, producing a misleading zero.

The Classifier-LASSO (C-LASSO) method solves this problem. Developed by Su, Shi, and Phillips (2016), it discovers latent groups in your panel data. Countries within each group share the same coefficients. Countries across groups can differ. Think of it like a sorting hat: rather than treating all countries as identical or all as unique, C-LASSO sorts them into a small number of groups with shared behavioral patterns.

This tutorial demonstrates the classifylasso Stata command (Huang, Wang, and Zhou 2024) with two applications:

Savings behavior across 56 countries (1995–2010) — where inflation affects savings in opposite directions depending on the country group
Democracy and economic growth across 98 countries (1970–2010) — where the pooled estimate of +1.05 masks a split of +2.15 in one group and -0.94 in another

Learning objectives:

Understand why assuming homogeneous slopes can be misleading in panel data
Learn the Classifier-LASSO method for identifying latent group structures
Implement classifylasso in Stata with both static and dynamic specifications
Use postestimation commands (classogroup, classocoef, predict gid) to visualize and interpret results
Compare pooled fixed-effects estimates with group-specific C-LASSO estimates

The diagram below maps the tutorial’s progression. We start simple and build complexity step by step.

graph LR
A["<b>EDA</b><br/>Savings data"] --> B["<b>Baseline FE</b><br/>Pooled &<br/>fixed effects"]
B --> C["<b>C-LASSO</b><br/>Static model<br/>(no lagged DV)"]
C --> D["<b>C-LASSO</b><br/>Dynamic model<br/>(jackknife)"]
D --> E["<b>Democracy</b><br/>Application<br/>(two-way FE)"]
E --> F["<b>Comparison</b><br/>Pooled vs<br/>group-specific"]
style A fill:#141413,stroke:#141413,color:#fff
style B fill:#6a9bcc,stroke:#141413,color:#fff
style C fill:#d97757,stroke:#141413,color:#fff
style D fill:#d97757,stroke:#141413,color:#fff
style E fill:#00d4c8,stroke:#141413,color:#141413
style F fill:#1a3a8a,stroke:#141413,color:#fff

2. The Problem: Homogeneous vs Heterogeneous Slopes

2.1 Three approaches to slope heterogeneity

Imagine 56 students taking the same exam. Approach 1 assumes they all studied the same way — one average study strategy explains everyone’s score. Approach 2 gives each student a unique strategy — but with only a few data points per student, the estimates are noisy. Approach 3 (C-LASSO) discovers that students naturally fall into 2–3 study groups. Students within a group share the same strategy. Students across groups differ.

The same logic applies to panel data. The standard fixed-effects model is:

$$y_{it} = \mu_i + \boldsymbol{\beta}' \mathbf{x}_{it} + u_{it}$$

Here, $y_{it}$ is the outcome for country $i$ at time $t$. The term $\mu_i$ captures country-specific intercepts (fixed effects). The slope vector $\boldsymbol{\beta}$ links the regressors $\mathbf{x}_{it}$ to the outcome. The critical assumption: $\boldsymbol{\beta}$ is the same for all countries. Japan and Nigeria get the same coefficient on inflation. That may be wrong.

At the other extreme, we could run separate regressions for each country. But with only $T = 15$ time periods per country, individual estimates are noisy. We lose statistical power.

C-LASSO introduces a middle ground. It assumes countries belong to $K$ latent groups:

$$\boldsymbol{\beta}_i = \boldsymbol{\alpha}_k \quad \text{if} \quad i \in G_k, \quad k = 1, \ldots, K$$

In words, country $i$ gets the slope coefficients of its group $G_k$. The method estimates three things simultaneously: the number of groups $K$, which countries belong to which group, and each group’s coefficients $\boldsymbol{\alpha}_k$. You do not need to specify the groups in advance. The data reveals them.

2.2 Why not just use K-means?

A natural question: why not run individual regressions first and then cluster the coefficients with K-means? C-LASSO has two advantages. First, it estimates group membership and coefficients jointly. A two-step approach (estimate, then cluster) propagates first-stage errors into the grouping. Second, C-LASSO’s penalty structure naturally pulls similar countries toward the same group. It is a statistically principled sorting mechanism, not an ad-hoc post-processing step.

3. The Classifier-LASSO Method

3.1 The C-LASSO objective function

C-LASSO minimizes a penalized least-squares objective:

$$Q_{NT,\lambda}^{(K)} = \frac{1}{NT} \sum_{i=1}^{N} \sum_{t=1}^{T} (y_{it} - \boldsymbol{\beta}_i' \mathbf{x}_{it})^2 + \frac{\lambda_{NT}}{N} \sum_{i=1}^{N} \prod_{k=1}^{K} |\boldsymbol{\beta}_i - \boldsymbol{\alpha}_k|$$

The first term is the standard sum of squared residuals. It measures how well the model fits the data. The second term is the penalty. It encourages each country’s coefficients $\boldsymbol{\beta}_i$ to be close to one of the group centers $\boldsymbol{\alpha}_k$.

Think of each group center as a planet with gravitational pull. If a country’s coefficients are close to any planet, the product $\prod_k |\boldsymbol{\beta}_i - \boldsymbol{\alpha}_k|$ shrinks toward zero. The penalty becomes small. The country gets pulled into that group. If the coefficients are far from all planets, the penalty stays large. The tuning parameter $\lambda_{NT} = c_\lambda T^{-1/3}$ controls how strong this gravitational pull is.

3.2 Three-step estimation procedure

The classifylasso command works in three steps:

Sort countries into groups. For each candidate number of groups $K$, the algorithm iteratively updates group centers and reassigns countries until convergence. Starting values come from unit-by-unit regressions.
Re-estimate within groups (postlasso). The LASSO penalty biases the coefficient estimates. So after sorting, we discard the penalized estimates and re-run plain OLS within each group. Think of it like a talent show: LASSO is the audition that selects who is in which group, but the final performance (the coefficient estimates) is unpenalized. This postlasso step gives us valid standard errors and confidence intervals.
Pick the best $K$ (information criterion). How many groups are there? The command tests $K = 1, 2, \ldots, K_{\max}$ and picks the $K$ that minimizes an information criterion. The IC acts like a referee balancing two concerns: fit (more groups fit better) and complexity (more groups risk overfitting). It works like AIC or BIC. The tuning parameter $\rho_{NT} = c_\rho (NT)^{-1/2}$ controls how harshly the referee penalizes extra groups.

3.3 Dynamic panels and Nickell bias

What if your model includes a lagged dependent variable, like $y_{i,t-1}$? This creates a problem called Nickell bias. When you demean the data to remove fixed effects, the demeaned lagged outcome becomes correlated with the demeaned error. The result: biased coefficients.

The classifylasso command offers a dynamic option to fix this. It uses the half-panel jackknife (Dhaene and Jochmans 2015). The idea is simple: split the time series in half. Estimate the model on each half. Combine the two estimates in a way that cancels the bias. Problem solved.

Now that we understand the method, let’s apply it to real data.

4. Data Exploration: Savings

4.1 Load and describe the data

Our first application uses a panel of 56 countries over 15 years, from Su, Shi, and Phillips (2016). The outcome is the savings-to-GDP ratio. The regressors are lagged savings, CPI inflation, real interest rates, and GDP growth.

use "https://github.com/cmg777/starter-academic-v501/raw/master/content/post/stata_panel_lasso_cluster/refMaterials/saving.dta", clear
xtset code year
summarize savings lagsavings cpi interest gdp

 Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
savings | 840 -2.87e-08 1.000596 -2.495871 2.893858
lagsavings | 840 5.81e-08 1.000596 -2.832278 2.91508
cpi | 840 3.56e-09 1.000596 -2.773791 3.548945
interest | 840 -7.17e-09 1.000596 -3.600348 3.277582
gdp | 840 1.06e-08 1.000596 -3.554419 2.461317

The panel is strongly balanced: 56 countries $\times$ 15 years = 840 observations. All variables are standardized to mean zero and standard deviation one. This means coefficients are in standard-deviation units. A coefficient of 0.18 means “a one-SD increase in CPI is associated with a 0.18-SD change in savings.” The balanced structure matters: C-LASSO requires all countries to be observed in all time periods.

4.2 Visualize cross-country heterogeneity

Before running any regressions, it helps to visualize how savings trajectories differ across countries. The xtline command overlays all 56 country lines on a single plot:

xtline savings, overlay ///
title("Savings-to-GDP Ratio Across 56 Countries", size(medium)) ///
subtitle("Each line represents one country", size(small)) ///
ytitle("Savings / GDP") xtitle("Year") legend(off)
graph export "stata_panel_lasso_cluster_fig1_savings_scatter.png", replace width(2400)

Figure 1: Savings-to-GDP ratio across 56 countries (1995–2010). Each line represents one country, revealing substantial heterogeneity in savings dynamics.

The spaghetti plot tells a clear story: countries do not move in lockstep. Some maintain positive savings ratios throughout. Others swing below zero. The lines diverge, cross, and cluster — suggesting that different countries follow fundamentally different savings dynamics. This is exactly the kind of heterogeneity that C-LASSO is designed to detect. Perhaps subsets of countries share similar responses, even if the full panel does not.

But first, let’s see what the standard models say.

5. Baseline: Pooled and Fixed Effects Regressions

Before applying C-LASSO, we establish a benchmark by estimating the standard pooled OLS and fixed-effects models. These models assume that all 56 countries share the same slope coefficients.

* Pooled OLS
regress savings lagsavings cpi interest gdp
* Standard Fixed Effects
xtreg savings lagsavings cpi interest gdp, fe
* Robust Fixed Effects (reghdfe)
reghdfe savings lagsavings cpi interest gdp, absorb(code) vce(robust)

 Pooled OLS FE (robust)
lagsavings 0.6051 0.6051
cpi 0.0301 0.0301
interest 0.0059 0.0059
gdp 0.1882 0.1882

The pooled OLS and fixed-effects estimates are virtually identical. R-squared is 0.438. Lagged savings dominates (coefficient 0.605, $p < 0.001$). GDP growth matters too (0.188, $p < 0.001$).

Now look at the two remaining variables. CPI: 0.030. Interest rate: 0.006. Both statistically insignificant. A textbook conclusion would be: “Inflation and interest rates do not affect savings.”

But what if the average is lying? Imagine a city where half the neighborhoods warm up by 5 degrees and the other half cool down by 5 degrees. The citywide average temperature change is zero. A meteorologist reporting “no change” would be wrong — there are changes, just in opposite directions. This is exactly what we will discover with C-LASSO.

6. Classifier-LASSO: Savings, Static Model

6.1 Estimation

We start with the simplest C-LASSO specification: a static model without the lagged dependent variable. This lets us focus on the core mechanics before adding complexity.

classifylasso savings cpi interest gdp, grouplist(1/5) tolerance(1e-4)

The command searches over $K = 1$ to $K = 5$ groups and reports the information criterion (IC) for each:

Estimation 1: Group Number = 1; IC = 0.054
Estimation 2: Group Number = 2; IC = -0.028 ← minimum
Estimation 3: Group Number = 3; IC = 0.059
Estimation 4: Group Number = 4; IC = 0.131
Estimation 5: Group Number = 5; IC = 0.213
* Selected Group Number: 2

The IC is minimized at $K = 2$, with values rising monotonically from $K = 3$ onward. This clear U-shape provides strong evidence for exactly two latent groups in the data.

6.2 Group-specific coefficients

classoselect, postselection
predict gid_static, gid
tabulate gid_static

Group 1 (34 countries, 510 obs): Within R-sq. = 0.2019
cpi | -0.1813 (z = -4.29, p < 0.001)
interest | -0.1966 (z = -4.64, p < 0.001)
gdp | 0.3346 (z = 7.98, p < 0.001)
Group 2 (22 countries, 330 obs): Within R-sq. = 0.2369
cpi | 0.4781 (z = 9.10, p < 0.001)
interest | 0.2631 (z = 5.01, p < 0.001)
gdp | 0.1117 (z = 2.23, p = 0.026)

The results are striking. Look at CPI.

In Group 1 (34 countries), higher inflation reduces savings: coefficient $-0.181$ ($p < 0.001$). In Group 2 (22 countries), higher inflation increases savings: coefficient $+0.478$ ($p < 0.001$). The sign flips completely.

The same reversal appears for the interest rate: $-0.197$ in Group 1 versus $+0.263$ in Group 2.

Now the pooled CPI coefficient of $+0.030$ makes sense. It was averaging $-0.181$ and $+0.478$ — a negative and a positive effect canceling each other out. The “insignificant” result was not evidence of no effect. It was evidence of two opposing effects hidden inside the average.

Why the reversal? In Group 1, higher inflation erodes the real value of savings, discouraging people from saving. In Group 2, higher inflation may trigger precautionary savings — households save more precisely because the economic environment feels uncertain. Same macroeconomic shock, opposite behavioral response.

6.3 Group selection plot

classogroup
graph export "stata_panel_lasso_cluster_fig2_group_selection_static.png", replace width(2400)

Figure 2: Group selection for the static savings model. The information criterion (left axis) is minimized at K=2, with a clear U-shape from K=3 onward.

The triangle marks the IC minimum at $K = 2$. The left axis shows IC values; the right axis shows iterations to convergence. Notice: $K = 2$ converged quickly (about 3 iterations). Models with $K \geq 3$ hit the maximum 20 iterations. When the algorithm struggles to converge, it is a sign of overparameterization — too many groups for the data to support.

So far, we have found two groups with a static model. But we omitted lagged savings. Let’s add it back.

7. Classifier-LASSO: Savings, Dynamic Model

7.1 Adding the lagged dependent variable

Savings are highly persistent. The pooled coefficient on lagsavings was 0.605 — a country’s savings this year strongly predicts its savings next year. Omitting this variable may bias everything else. We now add it back and replicate Su, Shi, and Phillips (2016). The dynamic option activates the half-panel jackknife to correct Nickell bias.

use "https://github.com/cmg777/starter-academic-v501/raw/master/content/post/stata_panel_lasso_cluster/refMaterials/saving.dta", clear
xtset code year
classifylasso savings lagsavings cpi interest gdp, ///
grouplist(1/5) lambda(1.5485) tolerance(1e-4) dynamic

* Selected Group Number: 2
The algorithm takes 9min57s.
Group 1 (31 countries, 465 obs): Within R-sq. = 0.4988
lagsavings | 0.6952 (z = 18.15, p < 0.001)
cpi | -0.1602 (z = -4.09, p < 0.001)
interest | -0.1490 (z = -4.04, p < 0.001)
gdp | 0.2892 (z = 7.62, p < 0.001)
Group 2 (25 countries, 375 obs): Within R-sq. = 0.4372
lagsavings | 0.6939 (z = 19.45, p < 0.001)
cpi | 0.1967 (z = 4.93, p < 0.001)
interest | 0.1225 (z = 2.98, p = 0.003)
gdp | 0.1127 (z = 2.38, p = 0.018)

Again, C-LASSO selects $K = 2$ groups. The sign reversal on CPI survives: $-0.160$ in Group 1 versus $+0.197$ in Group 2. Same for the interest rate: $-0.149$ versus $+0.123$.

Here is what is interesting about the lagsavings coefficient. Both groups show nearly identical persistence: 0.695 in Group 1 and 0.694 in Group 2. Think of it like a speedometer. Both groups of countries cruise at the same speed (savings persistence). But they swerve in opposite directions when they hit a pothole (an inflation or interest rate shock). The heterogeneity is about reactions to shocks, not about baseline behavior.

Adding lagged savings also improved the fit. Within R-squared jumped from 0.20–0.24 (static) to 0.44–0.50 (dynamic). The lagged variable clearly matters.

7.2 Coefficient plots

The classocoef postestimation command visualizes group-specific coefficients with 95% confidence bands:

classocoef cpi
graph export "stata_panel_lasso_cluster_fig3_coef_cpi.png", replace width(2400)
classocoef interest
graph export "stata_panel_lasso_cluster_fig4_coef_interest.png", replace width(2400)

Figure 3: Heterogeneous effects of CPI on savings. Group 1 (31 countries) shows a negative effect; Group 2 (25 countries) shows a positive effect. Confidence bands do not overlap.

This is the “smoking gun” figure. The two horizontal lines are the group-specific coefficients. The dashed lines show 95% confidence bands. The bands do not overlap. This is not a marginal difference. It is a robust sign reversal.

For 31 countries (Group 1), higher inflation reduces savings ($-0.160$, $p < 0.001$). For 25 countries (Group 2), higher inflation increases savings ($+0.197$, $p < 0.001$). A pooled model averages these opposing forces and finds CPI “insignificant.” That is aggregation bias at work.

Figure 4: Heterogeneous effects of the interest rate on savings. The same sign reversal pattern as CPI: negative in Group 1, positive in Group 2.

The interest rate tells the same story. Group 1 countries save less when rates rise ($-0.149$). Group 2 countries save more ($+0.123$).

Why? One interpretation: in Group 1 (more developed financial markets), higher returns make consumption more attractive — the substitution effect dominates. In Group 2 (limited financial access), higher returns make saving more rewarding — the income effect dominates.

We have now established that latent groups exist in savings data. The next question: does the same pattern appear in a completely different economic context?

8. Democracy Application: Does Democracy Cause Growth?

8.1 The Acemoglu et al. (2019) question

“Democracy does cause growth.” That is the title of a famous 2019 paper by Acemoglu, Naidu, Restrepo, and Robinson in the Journal of Political Economy. Their evidence: a pooled two-way fixed-effects model with lagged GDP finds a positive, significant effect.

But we have learned to be skeptical of pooled estimates. Does this average apply to all 98 countries? Or does it mask the same kind of sign reversal we found in savings?

8.2 Data exploration

use "https://github.com/cmg777/starter-academic-v501/raw/master/content/post/stata_panel_lasso_cluster/refMaterials/democracy.dta", clear
xtset country year
summarize lnPGDP Democracy ly1
tabulate Democracy

 Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
lnPGDP | 4,018 758.5558 162.9137 405.6728 1094.003
Democracy | 4,018 .5450473 .4980286 0 1
ly1 | 3,920 757.7754 162.6702 405.6728 1094.003
Democracy | Freq. Percent
------------+-----------------------------------
0 | 1,828 45.50
1 | 2,190 54.50

The panel covers 98 countries from 1970 to 2010 — 4,018 observations. The binary Democracy indicator is 1 for democratic country-years and 0 otherwise. About 55% of observations are democratic, reflecting the global wave of democratization. The dependent variable lnPGDP (log per-capita GDP, scaled) ranges from 406 to 1,094 — the full spectrum from low-income to high-income countries.

8.3 Pooled fixed-effects benchmark

reghdfe lnPGDP Democracy ly1, absorb(country year) cluster(country)

HDFE Linear regression Number of obs = 3,920
R-squared = 0.9991
Within R-sq. = 0.9607
(Std. err. adjusted for 98 clusters in country)
lnPGDP | Coefficient Robust std. err. t P>|t|
Democracy | 1.054992 .369806 2.85 0.005
ly1 | .970495 .0059964 161.85 0.000

Democracy is associated with a 1.055-unit increase in log per-capita GDP ($p = 0.005$, clustered SE = 0.370). Lagged GDP has a coefficient of 0.970 — strong persistence. This replicates Acemoglu et al. (2019): on average, democracy promotes growth.

On average. But we already know what “on average” can hide. Let’s run C-LASSO.

8.4 C-LASSO: revealing the heterogeneity

classifylasso lnPGDP Democracy ly1, ///
grouplist(1/5) rho(0.2) absorb(country year) ///
cluster(country) dynamic optmaxiter(300)

* Selected Group Number: 2
The algorithm takes 2h33min41s.
Group 1 (57 countries, 2,280 obs): Within R-sq. = 0.9609
Democracy | 2.151397 (z = 3.94, p < 0.001)
ly1 | 1.032752 (z = 149.97, p < 0.001)
Group 2 (41 countries, 1,640 obs): Within R-sq. = 0.9538
Democracy | -0.935589 (z = -2.69, p = 0.007)
ly1 | 0.979327 (z = 95.73, p < 0.001)

This is the tutorial’s most striking finding.

The pooled coefficient of $+1.055$ is not representative of any actual country group. It is a weighted average of two fundamentally different effects:

Group 1 (57 countries): democracy effect = $+2.151$ ($p < 0.001$). More than twice the pooled estimate.
Group 2 (41 countries): democracy effect = $-0.936$ ($p = 0.007$). Negative and significant.

The coefficient literally changes sign. For 58% of countries, democratic transitions are associated with GDP gains. For the remaining 42%, they are associated with GDP declines. The pooled model sees one number. C-LASSO sees two stories.

Note: these are conditional associations within the panel model. A causal interpretation requires the same identifying assumptions as Acemoglu et al. (2019).

8.5 Visualizing the democracy-growth split

classogroup
graph export "stata_panel_lasso_cluster_fig5_democracy_selection.png", replace width(2400)
classocoef Democracy
graph export "stata_panel_lasso_cluster_fig6_democracy_coef.png", replace width(2400)

Figure 5: Group selection for the democracy-growth model. IC is minimized at K=2, though values are close across all K (range 3.267–3.280).

The IC selects $K = 2$. But look closely: the IC values range from 3.267 to 3.280 — a span of just 0.013. The 2-group structure is optimal but not overwhelmingly so. This is a useful reminder: always check sensitivity to the tuning parameter $\rho$.

Figure 6: Heterogeneous effects of democracy on economic growth. Group 1 (57 countries) shows a positive effect (+2.15); Group 2 (41 countries) shows a negative effect (-0.94). The pooled estimate of +1.05 describes neither group.

This is the key figure of the tutorial. Each dot is one country’s individual coefficient estimate. The horizontal lines show group-specific postlasso estimates with 95% confidence bands.

The polarization is unmistakable. Group 1 (left cluster): strongly positive. Group 2 (right cluster): negative. Neither group’s confidence band crosses zero. Both effects are statistically significant.

This is not “some countries benefit, others see no effect.” It is a genuine sign reversal. Democracy is associated with growth in one group and with decline in another.

9. Comparison: What the Pooled Model Misses

9.1 Summary table

	Pooled FE	C-LASSO Group 1	C-LASSO Group 2
Democracy coefficient	+1.055	+2.151	-0.936
Standard error	0.370	0.546	0.348
p-value	0.005	< 0.001	0.007
Lagged GDP	0.970	1.033	0.979
Countries	98	57	41
Observations	3,920	2,280	1,640

9.2 Simpson’s paradox in panel data

This is Simpson’s paradox — the phenomenon where a trend that appears in aggregated data reverses when you look at subgroups.

Here is a concrete analogy. A hospital treats two types of patients: mild cases and severe cases. For mild cases, Treatment A has a higher survival rate. For severe cases, Treatment A also has a higher survival rate. But when you pool all patients together, Treatment B appears better — because it treats a disproportionate number of mild (easy) cases. The aggregate reverses the subgroup trend.

The same thing happened here. The pooled democracy estimate of $+1.055$ sits between $+2.151$ and $-0.936$. It describes neither group accurately. A policymaker relying on the pooled result would conclude that democracy universally promotes growth. They would miss that for 41 countries (42% of the sample), the relationship runs in the opposite direction.

The savings model showed the same pattern. The insignificant pooled CPI coefficient ($+0.030$) masked significant effects of $-0.160$ and $+0.197$. When effects have opposite signs, pooling does not just underestimate the magnitude. It produces a qualitatively wrong conclusion.

9.3 Robustness of the group structure

Across all three C-LASSO specifications — static savings, dynamic savings, and democracy — the IC consistently selected $K = 2$ groups. The CPI sign reversal survived the switch from static to dynamic, despite a shift in group composition (34/22 to 31/25). This consistency suggests the latent groups are real structural features of the data, not artifacts of a particular specification.

10. Summary and Takeaways

10.1 What we learned

Pooled estimates can be misleading. The insignificant pooled CPI coefficient ($+0.030$) in the savings model masked opposing effects of $-0.160$ and $+0.197$ in two latent groups. The pooled democracy coefficient ($+1.055$) masked a split of $+2.151$ versus $-0.936$.
C-LASSO finds latent groups. In all three specifications, the information criterion selected $K = 2$ groups, revealing binary latent structures in both datasets. The classifylasso command handles the full workflow: estimation, group selection, and postestimation.
The dynamic option corrects Nickell bias. When lagged dependent variables are included, the half-panel jackknife bias correction preserves the group structure while improving within-group R-squared (from 0.20–0.24 in the static model to 0.44–0.50 in the dynamic model).
Postestimation tools aid interpretation. The classogroup command visualizes the information criterion, classocoef plots group-specific coefficients with confidence bands, and predict gid assigns countries to groups.

10.2 Limitations

Three caveats. First, the IC values in the democracy model were very close across $K = 1$ through $K = 5$ (range 3.267–3.280). The 2-group structure is optimal but not dominant. Second, the datasets use numeric country codes, not names. We cannot easily identify which countries are in which group. Third, C-LASSO is computationally intensive. The democracy model took over 2.5 hours. Plan accordingly.

10.3 Exercises

Sensitivity analysis. Re-run the democracy model with rho(0.5) and rho(1.0) instead of rho(0.2). Does the selected number of groups change? How sensitive are the group assignments to this tuning parameter?
Extended lag structure. Following the reference empirical.do, estimate the democracy model with 2, 3, and 4 lags of GDP (ly1-ly2, ly1-ly3, ly1-ly4). Do the group-specific democracy coefficients remain stable?
Static vs dynamic comparison. Run classifylasso savings cpi interest gdp (without dynamic) on the savings data and compare group assignments with the dynamic model using tabulate gid_static gid_dynamic. How many countries switch groups?

References

Su, L., Shi, Z., and Phillips, P. C. B. (2016). Identifying latent structures in panel data. Econometrica, 84(6), 2215–2264.
Huang, W., Wang, Y., and Zhou, L. (2024). Identify latent group structures in panel data: The classifylasso command. Stata Journal, 24(1), 173–203.
Acemoglu, D., Naidu, S., Restrepo, P., and Robinson, J. A. (2019). Democracy does cause growth. Journal of Political Economy, 127(1), 47–100.
Dhaene, G. and Jochmans, K. (2015). Split-panel jackknife estimation of fixed-effect models. Review of Economic Studies, 82(3), 991–1030.

Dynamic Panel BMA: Which Factors Truly Drive Economic Growth?

Sun, 29 Mar 2026 00:00:00 +0000

1. Overview

Imagine you are advising a government on how to accelerate long-run economic growth. Your team has compiled a panel dataset covering 73 countries across four decades, with nine candidate drivers: investment, education, population growth, trade openness, government spending, life expectancy, democracy, investment prices, and population size. The natural question is: which of these factors truly drive economic growth — and can we trust our answers when today’s GDP might itself be shaped by those same factors?

What is BMA? Imagine trying to predict salaries using education, experience, age, and industry. You could build one model with all four variables, or drop industry, or use only experience and education. With just 4 candidates, there are $2^4 = 16$ possible models. Which is correct? Bayesian Model Averaging (BMA) does not pick one — it averages predictions from all 16, giving more weight to models that fit the data well. This avoids betting everything on one specification that might be wrong.

This last concern is reverse causality — the possibility that GDP growth causes higher investment rather than the other way around. Cross-sectional BMA handles model uncertainty this way, but it assumes regressors are strictly exogenous. When that assumption fails, BMA can confidently point to the wrong variables.

This tutorial introduces the Bayesian Dynamic Systems Modeling R package — which extends BMA to dynamic panel data with weakly exogenous regressors. Built on the methodology of Moral-Benito (2012, 2013, 2016), it simultaneously addresses model uncertainty and reverse causality by incorporating a lagged dependent variable, entity fixed effects, and time fixed effects into the BMA framework.

Companion tutorial. For a cross-sectional perspective using BMA, LASSO, and WALS on synthetic data, see the R tutorial on variable selection. The current tutorial builds on those foundations by moving from cross-sectional to panel data and from strict to weak exogeneity.

Learning objectives:

Understand why cross-sectional BMA can be misleading when regressors are endogenous, and how dynamic panel BMA addresses this
Prepare panel data for the Bayesian DSM package using join_lagged_col() and feature_standardization()
Run Bayesian Model Averaging with bma() and interpret Posterior Inclusion Probabilities (PIPs — how often a variable appears in the best-fitting models), posterior means, and model probabilities
Assess the sensitivity of results to prior specification by varying the expected model size (how many variables the prior expects to matter) and applying dilution priors (which adjust for correlated variables)
Analyze jointness (which variables tend to appear in models together) to discover which growth determinants are complements versus substitutes

The package also includes a smaller 3-regressor example (small_model_space) for practice — see the companion R script for details.

Data Prep (lag DV, demean, standardize) → Model Space (estimate all 2^K models) → BMA (PIPs, posterior means) → Sensitivity (vary priors, EMS, dilution) → Jointness (complements vs. substitutes) → Findings (robust growth determinants)

2. Setup

We need the Bayesian Dynamic Systems Modeling package for dynamic panel BMA and tidyverse for data manipulation. The parallel package (included with base R) enables parallel computing for the model space estimation step.

# Install bdsm if needed
if (!requireNamespace("bdsm", quietly = TRUE)) {
install.packages("bdsm")
}
# Load packages
library(bdsm)
library(tidyverse)
library(parallel)
set.seed(42)

3. Why Dynamic Panel BMA?

3.1 The endogeneity problem

Standard BMA assumes that all regressors are strictly exogenous — meaning they are determined outside the model and are uncorrelated with the error term at any point in time. In growth economics, this assumption almost never holds.

Think of it this way: imagine judging a runner’s training program by their final race time, but faster runners also chose better programs. You cannot tell whether the program caused the speed or the speed attracted the program. This is reverse causality, and it contaminates cross-sectional regressions. Countries that grow faster invest more, trade more, urbanize faster, and attract more education spending — not just the other way around.

When BMA is applied to cross-sectional data with endogenous regressors, it can confidently assign high inclusion probabilities to variables that appear important only because they are consequences of growth rather than causes of it. The model averaging machinery works perfectly — but the individual models it averages over are biased.

The solution is to include last period’s GDP as a regressor. By controlling for where a country was, we isolate which new factors push it forward — breaking the feedback loop. The next section shows why this dynamic structure arises naturally from economic growth theory.

3.2 From the Solow model to a dynamic equation

Why does a dynamic equation — one with lagged GDP on the right-hand side — arise naturally in growth economics? The answer comes from the Solow growth model and its convergence prediction. The Solow model predicts that poorer countries should grow faster than richer ones, conditional on their structural characteristics (beta convergence). Through a series of algebraic steps — defining a persistence parameter, substituting observable country characteristics for the unobserved steady state, and adding fixed effects — the convergence equation yields the following dynamic panel model:

$$\ln y_{it} = \alpha \ln y_{i,t-1} + \beta' x_{it} + \eta_i + \zeta_t + v_{it}$$

This is the dynamic panel model that the Bayesian DSM package estimates. The coefficient $\alpha$ has a direct economic interpretation: it measures the persistence of GDP across periods. A value of $\alpha$ close to 1 means slow convergence — countries stay near their current income level for a long time. A value close to 0 means fast convergence — countries quickly reach their steady state. Our BMA results will reveal $\alpha \approx 0.92$, indicating very slow convergence: after a decade, countries have closed only about 8% of the gap between their current GDP and their steady state.

The key insight is that the lagged dependent variable is not an ad hoc addition — it arises directly from the Solow model’s convergence prediction. Any study of growth determinants that omits lagged GDP is implicitly assuming $\alpha = 0$, which means assuming instantaneous convergence — a prediction strongly rejected by the data. For the full step-by-step derivation from the Solow convergence equation, see Appendix B.

3.3 Weak exogeneity and the role of each component

Each component of the dynamic panel equation plays a distinct role:

Lagged dependent variable ($y_{it-1}$): Think of this as a student’s previous exam score — it captures all the accumulated history that got a country to its current level. After controlling for where a country was, we can ask: among countries at the same starting point, which factors predict who grows faster?
Entity fixed effects ($\eta_i$): Like grading on a curve within each classroom — these absorb time-invariant country traits such as geography, colonial history, and institutional heritage. We compare each country to its own average, not to other countries.
Time fixed effects ($\zeta_t$): These remove global shocks that affect all countries simultaneously, such as oil crises or the Asian financial crisis.

To understand this assumption, consider a concrete example. Suppose an oil price shock in 1985 affects both GDP and trade openness simultaneously. Weak exogeneity allows this kind of contemporaneous correlation between regressors and the fixed effects. What it rules out is that the unexplained part of today’s GDP shock — the idiosyncratic error $v_{it}$ — directly causes today’s investment to change within the same period.

The key assumption is weak exogeneity: current regressors can be correlated with past shocks but not with the current shock $v_{it}$. This is much weaker than strict exogeneity — it allows past GDP growth to influence current investment (feedback effects) while requiring only that the current unexpected shock to GDP does not simultaneously cause changes in investment. In practical terms, weak exogeneity permits the realistic feedback loops that plague growth regressions while still allowing consistent estimation.

3.4 From cross-sectional to dynamic panel BMA

Cross-sectional BMA uses a single time snapshot, assumes strict exogeneity, includes no lagged dependent variable, and has no fixed effects. Dynamic panel BMA uses multiple time periods, requires only weak exogeneity, includes a lagged dependent variable, and controls for entity and time fixed effects. Both approaches address model uncertainty by averaging across all possible model specifications.

In the companion cross-sectional tutorial, we averaged across 4,096 models of CO₂ emissions using synthetic data. Here we apply the same BMA principle — weighting models by how well they fit the data — but to a panel of 73 countries over four decades, using the methodology that handles the endogeneity that cross-sectional BMA cannot.

4. The Dataset

4.1 Loading the data

The package includes two versions of the Moral-Benito (2016) economic growth dataset. The economic_growth version has the lagged dependent variable already merged into the panel structure (with NAs in the initial period), while original_economic_growth keeps it as a separate column.

data("economic_growth")
data("original_economic_growth")
cat("economic_growth:", dim(economic_growth), "\n")
cat("Countries:", length(unique(economic_growth$country)), "\n")
cat("Years:", sort(unique(economic_growth$year)), "\n")
head(economic_growth, 5)

economic_growth: 365 12
Countries: 73
Years: 1960 1970 1980 1990 2000
# A tibble: 5 x 12
year country gdp ish sed pgrw pop ipr opem gsh lnlex polity
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1960 1 8.25 NA NA NA NA NA NA NA NA NA
2 1970 1 8.37 0.122 0.139 0.0235 10.9 61.1 1.08 0.191 3.88 0.15
3 1980 1 8.54 0.207 0.141 0.0300 13.9 92.3 1.06 0.203 4.00 0.15
4 1990 1 8.63 0.203 0.28 0.0303 18.9 100. 0.898 0.232 4.10 0.15
5 2000 1 8.66 0.115 0.774 0.0215 25.3 81.2 0.636 0.219 4.21 0.575

The panel covers 73 countries observed at 10-year intervals from 1960 to 2000, yielding 5 periods per country (365 total rows, including the initial 1960 observation). The 1960 row for each country contains only the initial GDP level — all regressors are NA because there is no “previous decade” to compute changes from. The four subsequent decades (1970–2000) contain the 292 usable observations.

4.2 Variable descriptions

The dataset contains the dependent variable (log GDP per capita) and 9 candidate growth determinants:

Variable	Description	Expected sign
`gdp`	Log real GDP per capita (dependent variable)	—
`ish`	Investment share of GDP	+
`sed`	Secondary school enrollment rate	+
`pgrw`	Population growth rate	–
`pop`	Population (millions)	?
`ipr`	Investment price (relative to US)	–
`opem`	Trade openness (imports + exports / GDP)	+
`gsh`	Government consumption share of GDP	–
`lnlex`	Log life expectancy at birth	+
`polity`	Democracy index (0 = autocracy, 1 = democracy)	?

These variables are standard in the empirical growth literature, following Sala-i-Martin, Doppelhofer, and Miller (2004). Investment share and education are expected to have positive effects on growth, while population growth and government consumption are typically associated with slower growth. The signs for population and democracy are theoretically ambiguous.

The 292 usable observations span 73 countries over four decades. Log GDP per capita ranges from 6.02 to 10.45, reflecting substantial income inequality — the richest country is roughly 80 times wealthier than the poorest in per capita terms. Investment share averages 16.9% of GDP but ranges from 1.2% to 65.3%, indicating enormous variation in capital accumulation across countries and decades. Population growth averages 1.9% per decade, with one country experiencing slight population decline (–0.6%).

5. Data Preparation

The Bayesian DSM package requires two data preprocessing steps before estimation: standardization (scaling) and demeaning (removing entity and time fixed effects). These steps ensure numerical stability and allow the model to focus on within-country, within-period variation.

5.1 Understanding the data structure

If your data has the lagged dependent variable as a separate column (like original_economic_growth), you first need to merge it into the panel structure using join_lagged_col(). This function creates the initial period row with NAs:

# Demonstration: converting original format to package format
eg_joined <- join_lagged_col(
df = original_economic_growth,
col = gdp,
col_lagged = lag_gdp,
timestamp_col = year,
entity_col = country,
timestep = 10 # 10-year intervals
)
cat("Result:", dim(eg_joined), "\n")

Result: 365 12

The economic_growth dataset already has this structure, so we can use it directly.

5.2 Standardization and demeaning

Data preparation involves two calls to feature_standardization(). The first call standardizes all regressors to have mean zero and unit variance — this puts all variables on the same scale so that the BMA coefficients are directly comparable. The second call demeans by time period to remove time fixed effects.

Think of demeaning by time as subtracting the global average for each decade. If every country’s GDP grew in the 1990s due to the tech boom, demeaning removes that common trend. What remains is each country’s deviation from the global pattern — the variation that country-specific factors must explain.

# Step 1: Standardize all regressors (mean=0, sd=1)
# Makes variables comparable: GDP and population are on vastly different scales
data_std <- feature_standardization(
df = economic_growth,
excluded_cols = c(country, year, gdp)
)
# Step 2: Demean by time period (remove time fixed effects)
# Subtracts each decade's global average, isolating country-specific variation
data_prepared <- feature_standardization(
df = data_std,
group_by_col = year,
excluded_cols = country,
scale = FALSE
)
head(data_prepared, 5)

# A tibble: 5 x 12
year country gdp ish sed pgrw pop ipr opem gsh lnlex polity
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1960 1 0.292 NA NA NA NA NA NA NA NA NA
2 1970 1 0.121 -0.493 -0.534 0.163 -0.151 -0.271 1.13 0.0496 -0.549 -0.578
3 1980 1 0.0573 0.241 -0.697 0.942 -0.181 0.0635 1.07 -0.0226 0.0167 -0.578
4 1990 1 0.0724 0.456 -0.932 1.09 -0.203 0.208 0.724 0.101 -0.0655 -0.578
5 2000 1 -0.0823 -0.505 -0.778 0.465 -0.218 -0.0620 -0.120 0.120 -0.107 0.112

After preparation, all regressor values are centered around zero. Country 1’s investment share (ish) was 0.49 standard deviations below the global average in 1970 but 0.46 standard deviations above average in 1990, showing meaningful within-country variation over time. The GDP column retains its original scale because it is the dependent variable.

6. Estimating the Full Model Space

With 9 candidate regressors, there are $2^9 = 512$ possible regression models. The package estimates every single one via numerical optimization of the marginal likelihood — the probability of observing the data given a particular model, after integrating out all parameter uncertainty. Think of this as a cooking competition with 512 recipes — each uses a different combination of 9 ingredients, and the marginal likelihood scores each recipe by balancing flavor (fit) against unnecessary complexity (overfitting).

To be concrete: model 1 might include only investment and education. Model 2 adds trade openness. Model 3 uses education and democracy but drops investment. Each of the 512 combinations gets its own likelihood estimated separately, and BMA weights them by how well they fit the data.

The optim_model_space() function handles this computation. For the full 9-regressor case, this is the most computationally intensive step — it can take several minutes depending on the machine. The package helpfully includes a precomputed full_model_space object so we can skip the wait:

# Load precomputed model space (or compute from scratch)
data("full_model_space")
# To compute from scratch (takes several minutes):
# full_model_space <- optim_model_space(
# df = data_prepared,
# dep_var_col = gdp,
# timestamp_col = year,
# entity_col = country,
# init_value = 0.5
# )
cat("Parameters matrix:", dim(full_model_space$params), "\n")
cat("Statistics matrix:", dim(full_model_space$stats), "\n")

Parameters matrix: 106 512
Statistics matrix: 22 512

The result is a list with two elements. The $params matrix contains 106 estimated parameters for each of the 512 models — these include the structural parameters ($\alpha$, $\beta$), reduced-form parameters, and variance components. The $stats matrix stores 22 statistics per model, including the log-likelihood, BIC, regular standard errors, and robust (heteroskedasticity-consistent) standard errors.

Why use marginal likelihood instead of R-squared? Unlike R-squared, which always improves when you add variables, the marginal likelihood penalizes complexity. It accounts for the fact that more parameters make it easier to fit noise. A model with 9 regressors that barely improves fit over a 5-regressor model will receive a lower marginal likelihood score — the extra parameters were not worth the complexity cost.

Before jumping into BMA, let us first establish a benchmark using a standard regression approach — this will help us appreciate what BMA adds.

7. Benchmark: Kitchen-Sink Fixed Effects

Before running BMA, it is useful to establish a benchmark. What happens if we simply throw all 9 regressors into a single fixed effects regression? This “kitchen-sink” approach is the default in applied work — but it commits to one model specification and ignores the uncertainty about which variables belong.

# Kitchen-sink FE regression with all 9 regressors
fe_full <- lm(gdp ~ lag_gdp + ish + sed + pgrw + pop + ipr +
opem + gsh + lnlex + polity +
factor(country) + factor(year),
data = original_economic_growth)
summary(fe_full)

FE regression coefficients:
Estimate Std. Error t value Pr(>|t|)
lag_gdp 0.6188 0.0501 12.3521 0.0000
ish 0.4646 0.2331 1.9934 0.0475
sed 0.0162 0.0337 0.4798 0.6319
pgrw -2.3352 2.1409 -1.0907 0.2767
pop 0.0016 0.0004 4.5092 0.0000
ipr -0.0003 0.0003 -1.0817 0.2806
opem 0.1199 0.0379 3.1652 0.0018
gsh -0.7448 0.2700 -2.7585 0.0063
lnlex 0.1153 0.2440 0.4727 0.6369
polity -0.1656 0.0570 -2.9065 0.0041
Significant at 5%: lag_gdp, ish, pop, opem, gsh, polity
R-squared: 0.988
N observations: 292

The kitchen-sink model finds 6 of 10 variables significant at the 5% level: lagged GDP, investment share, population, trade openness, government share, and democracy. Education, population growth, investment price, and life expectancy are insignificant. But this result depends entirely on this particular specification — drop one variable or add another, and the significance pattern may change. This is the model uncertainty problem that BMA is designed to solve.

The lagged GDP coefficient of 0.619 is notably lower than the BMA posterior mean (0.919), suggesting that the kitchen-sink model’s coefficient estimates are pulled by multicollinearity among the 9 regressors. BMA handles this by averaging over specifications that include different subsets.

Notice how the FE model forces a binary judgment: education is ‘insignificant’ (p = 0.63) and trade is ‘significant’ (p = 0.002). BMA replaces this all-or-nothing verdict with a nuanced probability scale: education has PIP = 0.72 (moderate evidence) and trade has PIP = 0.77 (positive evidence). The difference between ‘insignificant’ and ‘moderate evidence’ matters for policy — a policymaker who ignores education entirely because of a p-value threshold may be discarding useful information.

The kitchen-sink model commits to one specification and produces one set of p-values. But we saw that which variables look ‘significant’ depends entirely on which others are in the model. Drop one variable, and the significance pattern reshuffles. BMA solves this by never committing to a single specification — it averages over all 512, letting the data decide which matter most.

8. Bayesian Model Averaging

8.1 Running BMA

Now we can perform Bayesian Model Averaging across all 512 models. The bma() function takes the precomputed model space and the prepared data, weights each model by its posterior probability, and computes weighted averages of the coefficients:

Focus on two columns: PIP (how important is this variable?) and %(+) (is its effect consistently positive or negative?).

bma_results <- bma(full_model_space, df = data_prepared, round = 3)
# Binomial prior results
print(bma_results[[1]])

 PIP PM PSD PSDR PMcon PSDcon PSDRcon %(+)
gdp_lag NA 0.919 0.077 0.109 0.919 0.077 0.109 100.000
ish 0.773 0.063 0.045 0.062 0.082 0.034 0.059 100.000
sed 0.717 0.030 0.057 0.074 0.042 0.064 0.084 69.922
pgrw 0.714 0.018 0.030 0.052 0.025 0.033 0.060 99.609
pop 0.990 0.119 0.065 0.082 0.121 0.064 0.081 100.000
ipr 0.656 -0.034 0.033 0.044 -0.051 0.027 0.046 0.000
opem 0.766 0.034 0.030 0.033 0.044 0.026 0.031 100.000
gsh 0.751 -0.015 0.041 0.091 -0.020 0.046 0.104 30.859
lnlex 0.864 0.088 0.075 0.098 0.102 0.071 0.099 100.000
polity 0.678 -0.057 0.046 0.053 -0.084 0.030 0.044 0.000

The binomial prior results reveal a clear hierarchy among the 9 candidate regressors. Population size (pop) dominates with PIP = 0.990 — appearing in virtually every high-quality model — followed by life expectancy (lnlex) at 0.864 and investment share (ish) at 0.773. At the other end, investment price (ipr) at 0.656 and democracy (polity) at 0.678 show the weakest evidence, though even these exceed 0.5. The lagged GDP coefficient of 0.919 confirms strong persistence: a country’s current GDP is heavily determined by its past GDP.

8.2 Understanding the BMA statistics

Each column in the BMA output captures a different aspect of the evidence:

Beginner tip: For a first reading, focus on three columns: PIP (does this variable matter?), PM (what is its average effect?), and %(+) (is the effect consistently positive or negative?). The remaining columns (PSDR, PMcon, PSDcon, PSDRcon) are useful for advanced robustness checks but can be skipped on a first pass.

Statistic	Full name	Interpretation
PIP	Posterior Inclusion Probability	Fraction of posterior probability mass in models that include this variable. Think of it as a batting average: PIP = 0.99 means the variable appeared in 99% of high-scoring models
PM	Posterior Mean	Weighted average of the coefficient across all models (including zeros from models that exclude the variable)
PSD	Posterior Standard Deviation	Uncertainty around PM, incorporating both within-model and across-model variation
PSDR	Posterior SD Ratio	PSD divided by the conditional PM — a robustness measure
PMcon	Conditional Posterior Mean	Average coefficient only across models that include the variable
PSDcon	Conditional PSD	Uncertainty conditional on inclusion
%(+)	Positive sign share	Percentage of models where the coefficient is positive. Values near 0% or 100% indicate stable sign

The central quantity driving all these statistics is the posterior model probability (PMP). Each model $M_j$ receives a weight proportional to its marginal likelihood times its prior probability:

$$\mathbb{P}(M_j | \text{data}) = \frac{\exp(-\frac{1}{2} BIC_j) \cdot \mathbb{P}(M_j)}{\sum_{i=1}^{2^K} \exp(-\frac{1}{2} BIC_i) \cdot \mathbb{P}(M_i)}$$

In words, this equation says that each model’s posterior probability is its prior probability times a data-fit term (approximated by the BIC), divided by the sum across all $2^K$ models to ensure the probabilities add to 1. Models that fit the data well without too many parameters receive higher posterior probability. The PIP for a variable is then the sum of PMPs across all models that include it.

To make this concrete: if model A has BIC = –800 and model B has BIC = –790, model A fits the data better. After exponentiating and normalizing, model A might receive 73% of the posterior probability while model B gets 27%. The PIP of a variable included only in model A would then be at least 0.73.

8.3 Interpreting PIPs with Raftery’s classification

Raftery (1995) provides a standard classification for the strength of evidence based on PIP values:

PIP range	Evidence
> 0.99	Very strong
0.95 – 0.99	Strong
0.75 – 0.95	Positive
0.50 – 0.75	Weak

Under the binomial prior, pop (PIP = 0.990) reaches strong evidence — just short of the “very strong” threshold at 0.99. Life expectancy (lnlex at 0.864), investment share (ish at 0.773), trade openness (opem at 0.766), and government share (gsh at 0.751) fall in the positive evidence range. The remaining four variables — education, population growth, investment price, and democracy — show weak evidence (0.65–0.72). No variable has PIP below 0.5, suggesting the data supports relatively large models.

The sign stability column (%(+)) provides an additional robustness check. Seven of the nine regressors have perfectly stable signs: investment share, population growth, population, trade openness, and life expectancy are always positive (100%), while investment price and democracy are always negative (0%). Government share has %(+) = 30.9%, meaning its sign is negative in about 70% of models — moderately unstable. Education has %(+) = 69.9%, with a positive coefficient in about 70% of models but negative in 30%.

The following chart visualizes the PIPs with color-coded evidence tiers. We first define a dark-theme palette and extract the BMA statistics into a data frame, then build the plot:

# Dark theme palette (matching site navbar/footer)
DARK_BG <- "#0f1729"
LIGHT_TEXT <- "#c8d0e0"
LIGHTER_TEXT <- "#e8ecf2"
# Extract BMA statistics into a data frame
bma_tab <- bma_results[[1]]
pip_df <- data.frame(
variable = rownames(bma_tab)[-1],
pip = bma_tab[-1, "PIP"],
pm = bma_tab[-1, "PM"],
psd = bma_tab[-1, "PSD"],
sign_pos = bma_tab[-1, "%(+)"]
)
# Readable labels and robustness classification
var_labels <- c(ish = "Investment share", sed = "Education",
pgrw = "Population growth", pop = "Population",
ipr = "Investment price", opem = "Trade openness",
gsh = "Government share", lnlex = "Life expectancy",
polity = "Democracy")
pip_df$label <- var_labels[pip_df$variable]
pip_df$robustness <- cut(pip_df$pip,
breaks = c(0, 0.50, 0.75, 1),
labels = c("Weak (PIP < 0.50)", "Moderate (0.50-0.75)",
"Positive (PIP >= 0.75)"),
include.lowest = TRUE)
# PIP bar chart
ggplot(pip_df, aes(x = reorder(label, pip), y = pip,
fill = robustness)) +
geom_col(width = 0.65) +
geom_hline(yintercept = 0.75, linetype = "dashed",
color = LIGHT_TEXT) +
geom_hline(yintercept = 0.50, linetype = "dotted",
color = LIGHT_TEXT, alpha = 0.6) +
coord_flip() +
scale_fill_manual(values = c(
"Positive (PIP >= 0.75)" = "#6a9bcc",
"Moderate (0.50-0.75)" = "#00d4c8",
"Weak (PIP < 0.50)" = "#d97757")) +
labs(x = NULL, y = "Posterior Inclusion Probability (PIP)",
fill = "Evidence strength",
title = "BMA: Posterior Inclusion Probabilities",
subtitle = "Binomial prior (EMS = 4.5), 512 models averaged")

Population dominates the chart at PIP = 0.990, followed by life expectancy at 0.864. Five variables clear the 0.75 “positive evidence” threshold, while the remaining four — democracy, education, population growth, and investment price — fall in the “moderate” zone between 0.50 and 0.75. Compared to the kitchen-sink benchmark where 6 of 10 variables were significant at 5%, BMA paints a more nuanced picture: it grades each variable on a continuous scale of importance rather than imposing a binary significant/insignificant cutoff.

9. Visualizing Model Probabilities

9.1 Prior versus posterior model probabilities

The model_pmp() function visualizes how the data transforms our prior beliefs about which models are best. The prior assigns probability to each of the 512 models, and the data concentrates posterior mass on the models that fit best:

pmp_plots <- model_pmp(bma_results)

The prior (dashed line) is relatively flat, reflecting the uniform prior assumption. The posterior (solid line) concentrates dramatically: a handful of models capture the bulk of the posterior mass, while most models receive negligible probability. This concentration is the signature of informative data — the 73-country, 4-decade panel provides enough information to strongly favor certain model specifications.

9.2 Model sizes

The model_sizes() function shows the distribution of prior and posterior probabilities across model sizes (number of included regressors, excluding the lagged dependent variable):

size_plots <- model_sizes(bma_results)

The expected model sizes confirm this visually:

print(bma_results[[16]])

 Prior models size Posterior model size
Binomial 4.5 6.908
Binomial-beta 4.5 8.556

The posterior strongly favors larger models. While the binomial prior centers mass on models with 4–5 regressors (EMS = 4.5), the posterior shifts toward 7 regressors (6.908). Under the binomial-beta prior, the shift is even more dramatic: the posterior expected model size reaches 8.556, meaning the data wants to include nearly all 9 candidate regressors. This is consistent with the finding that all variables have PIP above 0.65 — the data sees signal in most candidates.

10. Examining Top Models

The best_models() function lets us inspect the specific variable combinations and coefficient estimates in the top-ranked models:

best8 <- best_models(bma_results, criterion = 1, best = 8)
print(best8[[1]]) # Inclusion matrix

Reading the inclusion matrix: each column is a model (ranked by fit), each row is a variable. A value of 1 means the variable is included in that model. Look for variables that appear in every top model — those are the most robust.

 'No. 1' 'No. 2' 'No. 3' 'No. 4' 'No. 5' 'No. 6' 'No. 7' 'No. 8'
gdp_lag 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
ish 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.000
sed 1.000 1.000 1.000 0.000 1.000 1.000 1.000 1.000
pgrw 1.000 1.000 1.000 1.000 0.000 1.000 1.000 1.000
pop 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
ipr 1.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000
opem 1.000 1.000 1.000 1.000 1.000 1.000 0.000 1.000
gsh 1.000 1.000 1.000 1.000 1.000 0.000 1.000 1.000
lnlex 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
polity 1.000 1.000 0.000 1.000 1.000 1.000 1.000 1.000
PMP 0.089 0.044 0.042 0.036 0.035 0.029 0.026 0.025

A striking pattern emerges: the top model includes all 9 regressors (PMP = 8.9%), and the next 7 best models are each formed by dropping exactly one variable from the full set. This “kitchen sink minus one” pattern confirms that the data supports large models.

Two variables are never dropped across the top 8 models: pop and lnlex — they appear in all 8, consistent with their high PIPs of 0.990 and 0.864. The variables dropped in models 2–8 are ipr, polity, sed, pgrw, gsh, opem, and ish — precisely the variables with the lowest PIPs.

We can also examine the coefficient estimates in the best model using the knitr-formatted output:

# Estimation results for the best model (knitr format)
print(best8[[5]])

Best model (No. 1) estimates:
gdp_lag 0.954 (0.076)*** pop 0.065 (0.056)
ish 0.079 (0.032)** ipr -0.056 (0.027)**
sed 0.034 (0.065) opem 0.043 (0.025)*
pgrw 0.025 (0.033) gsh -0.043 (0.050)
lnlex 0.151 (0.060)** polity -0.092 (0.032)***

In the best model (No. 1), the lagged GDP coefficient is 0.954 (SE = 0.076, significant at 1%), confirming the very slow convergence we derived from the Solow model. Investment share has a positive and significant coefficient of 0.079, while democracy has a negative and highly significant coefficient of –0.092. Life expectancy is positive and significant at 0.151. Education, despite being included in 7 of the top 8 models, has a large standard error (0.034, SE = 0.065) — explaining its moderate PIP despite frequent inclusion.

This combination — high inclusion rate but imprecise coefficient — happens when most models agree that education belongs in the model but disagree about its magnitude. Some estimate a positive effect of +0.08, others a negative –0.02. The variable is probably relevant, but the data does not pin down its direction.

Beyond these top models, how do the coefficients distribute across all 512 specifications? The next section examines the full posterior distributions.

11. Coefficient Distributions

Before examining individual coefficient distributions, it is helpful to see all posterior means and their uncertainty at a glance. We compute approximate 95% credible intervals as the posterior mean plus or minus two posterior standard deviations:

# Approximate 95% credible intervals
pip_df$ci_low <- pip_df$pm - 2 * pip_df$psd
pip_df$ci_high <- pip_df$pm + 2 * pip_df$psd
# Coefficient point-range plot
ggplot(pip_df, aes(x = reorder(label, pip), y = pm,
color = robustness)) +
geom_hline(yintercept = 0, linetype = "solid",
color = LIGHT_TEXT, alpha = 0.4) +
geom_pointrange(aes(ymin = ci_low, ymax = ci_high),
size = 0.6, linewidth = 0.8) +
coord_flip() +
scale_color_manual(values = c(
"Positive (PIP >= 0.75)" = "#6a9bcc",
"Moderate (0.50-0.75)" = "#00d4c8",
"Weak (PIP < 0.50)" = "#d97757")) +
labs(x = NULL, y = "Posterior Mean Coefficient",
color = "Evidence strength",
title = "BMA: Posterior Coefficient Estimates",
subtitle = "Points = posterior mean, bars = PM +/- 2*PSD")

Population and life expectancy have the largest positive posterior means, with credible intervals that do not cross zero — consistent with their high PIPs. Democracy (polity) has a clearly negative effect, also with an interval that excludes zero. Investment price is negative but with a wider interval. Education and government share have credible intervals that straddle zero, reflecting sign instability. Compared to the kitchen-sink FE model, BMA produces posterior means that account for model uncertainty: the intervals are wider than standard confidence intervals because they incorporate variation across model specifications, not just within a single specification.

The coef_hist() function provides more detailed views of the full posterior distribution of each coefficient across all 512 models, weighted by posterior model probability:

coef_plots <- coef_hist(bma_results)

Population — the most robust determinant:

print(coef_plots[[5]])

Population has a tight, entirely positive distribution centered around 0.12, confirming strong and stable evidence for a positive effect on growth.

These results hold under the default binomial prior. But how sensitive are they to our choice of prior? The next section stress-tests the findings.

12. Sensitivity to Prior Specification

A critical step in any BMA analysis is checking whether the results change when we alter our prior beliefs. If a variable’s PIP is high under one prior but low under another, we should be cautious about declaring it a robust determinant. The following chart compares PIPs across three prior specifications at a glance:

# Extract PIPs from three prior specifications
bma_tab_bb <- bma_results[[2]] # Binomial-beta
bma_tab_ems2 <- bma_ems2[[1]] # Skeptical (EMS = 2)
sens_df <- data.frame(
label = pip_df$label,
Binomial = pip_df$pip,
BinBeta = bma_tab_bb[-1, "PIP"],
EMS2 = bma_tab_ems2[-1, "PIP"])
# Pivot to long format for ggplot
sens_long <- sens_df %>%
pivot_longer(cols = c(Binomial, BinBeta, EMS2),
names_to = "prior", values_to = "pip") %>%
mutate(prior = factor(prior,
levels = c("EMS2", "Binomial", "BinBeta"),
labels = c("Skeptical (EMS=2)", "Binomial (EMS=4.5)",
"Binomial-Beta")))
# Connecting segments showing the range across priors
seg_df <- sens_df %>%
mutate(pip_min = pmin(Binomial, BinBeta, EMS2),
pip_max = pmax(Binomial, BinBeta, EMS2))
# Dumbbell chart
ggplot() +
geom_vline(xintercept = 0.75, linetype = "dashed",
color = LIGHT_TEXT) +
geom_vline(xintercept = 0.50, linetype = "dotted",
color = LIGHT_TEXT, alpha = 0.6) +
geom_segment(data = seg_df,
aes(x = pip_min, xend = pip_max,
y = reorder(label, Binomial),
yend = reorder(label, Binomial)),
color = LIGHT_TEXT, alpha = 0.3, linewidth = 1.5) +
geom_point(data = sens_long,
aes(x = pip, y = reorder(label, pip), color = prior),
size = 3.5) +
scale_color_manual(values = c(
"Skeptical (EMS=2)" = "#d97757",
"Binomial (EMS=4.5)" = "#6a9bcc",
"Binomial-Beta" = "#00d4c8")) +
labs(x = "Posterior Inclusion Probability (PIP)", y = NULL,
color = "Model prior",
title = "Prior Sensitivity: How Robust Are the PIPs?",
subtitle = "Same data, three different prior specifications")

The width of each horizontal segment shows how much a variable’s PIP changes across priors. Population is rock-solid: its PIP barely moves (0.964–0.998) regardless of the prior. Life expectancy shows moderate sensitivity (0.637–0.974). The bottom four variables — democracy, education, population growth, and investment price — are the most sensitive, with PIPs ranging from 0.34–0.94 depending on the prior. This visual makes the key message immediately clear: only population and life expectancy are robust across all prior specifications.

12.1 Binomial versus binomial-beta prior

The default analysis already computes both priors. The binomial prior assigns each variable an independent probability of inclusion equal to EMS/K (where EMS is the expected model size and K is the number of regressors). The binomial-beta prior is more flexible — it places a prior on the inclusion probability itself, allowing the data to determine how many variables should be included.

Under the binomial-beta prior, all PIPs increase substantially. Population reaches 0.998, life expectancy reaches 0.974, and even the lowest-ranked variable (investment price) reaches 0.924. The posterior expected model size jumps to 8.556 — the binomial-beta prior allows the data to express its preference for large models even more strongly than the binomial prior.

Comparing PIPs across the two priors:

Variable	PIP (Binomial)	PIP (Binomial-Beta)	Sign	Evidence strength
pop	0.990	0.998	+	Very strong
lnlex	0.864	0.974	+	Strong
ish	0.773	0.954	+	Positive → Strong
opem	0.766	0.952	+	Positive → Strong
gsh	0.751	0.948	–/+	Positive → Strong
sed	0.717	0.938	+/–	Weak → Strong
pgrw	0.714	0.938	+	Weak → Strong
polity	0.678	0.929	–	Weak → Strong
ipr	0.656	0.924	–	Weak → Strong

The ranking is stable across priors — pop and lnlex remain the top two, and ipr and polity remain the bottom two. However, the absolute PIP values depend heavily on the prior, with the binomial-beta prior being far more inclusive. This is expected: the binomial-beta prior concentrates mass on larger models when the data supports them.

12.2 Varying expected model size

The expected model size (EMS) controls how many regressors the prior expects to be relevant. The default EMS = K/2 = 4.5. Let us see what happens with a skeptical prior (EMS = 2, expecting only 2 of 9 regressors to matter) and a generous prior (EMS = 8):

With the skeptical EMS = 2 prior, only pop (PIP = 0.964) and lnlex (PIP = 0.637) remain above 0.5 under the binomial prior. Investment share drops to 0.483 and democracy falls to 0.372. This tells us that population and life expectancy are the most robust determinants — they survive even when the prior is heavily biased toward sparse models.

With EMS = 8, all PIPs exceed 0.94 — nearly identical to the binomial-beta results, confirming that the data’s preference for large models is consistent across prior specifications.

Full output tables for each prior specification are in Appendix C.

12.3 Dilution prior

Imagine two variables that measure almost the same thing — say, ‘years of schooling’ and ‘literacy rate.’ Including both in a model is redundant, and any model that includes both gets an inflated likelihood simply because it has two ways to capture the same variation.

When regressors are correlated with each other, standard priors can overcount evidence by giving high probability to models that include near-duplicate variables. The dilution prior (George, 2010) penalizes models whose regressors are highly correlated, adjusting the model prior by the determinant of the correlation matrix:

$$\mathbb{P}_D(M_j) \propto \mathbb{P}(M_j) \cdot |COR_j|^{\omega}$$

In words, this formula says that the diluted prior for model $j$ equals the standard prior multiplied by a penalty term. The penalty is the determinant of the correlation matrix among model $j$’s regressors, raised to the power $\omega$. When regressors are highly correlated, this determinant is close to zero, pushing the diluted prior toward zero. The parameter $\omega$ controls the strength of the penalty (default = 0.5).

# Dilution prior with default omega = 0.5
bma_dil <- bma(full_model_space, df = data_prepared,
round = 3, dilution = 1)
print(bma_dil[[1]])

 PIP PM PSD PSDR PMcon PSDcon PSDRcon %(+)
gdp_lag NA 0.919 0.077 0.107 0.919 0.077 0.107 100.000
ish 0.718 0.058 0.046 0.062 0.081 0.034 0.059 100.000
sed 0.640 0.026 0.055 0.070 0.041 0.064 0.084 69.922
pgrw 0.653 0.017 0.030 0.050 0.026 0.034 0.060 99.609
pop 0.989 0.125 0.065 0.082 0.126 0.064 0.081 100.000
ipr 0.638 -0.033 0.033 0.044 -0.052 0.027 0.045 0.000
opem 0.743 0.034 0.030 0.033 0.046 0.026 0.031 100.000
gsh 0.740 -0.013 0.040 0.090 -0.017 0.046 0.104 30.859
lnlex 0.808 0.081 0.075 0.098 0.100 0.071 0.099 100.000
polity 0.598 -0.049 0.047 0.053 -0.083 0.030 0.044 0.000

The dilution prior modestly reduces PIPs compared to the standard binomial prior — for example, ish drops from 0.773 to 0.718, and polity drops from 0.678 to 0.598. The posterior expected model size decreases from 6.91 to 6.53. Importantly, the ranking remains unchanged: pop and lnlex stay at the top, and the sign stability is unaffected. The dilution prior provides a useful robustness check against multicollinearity inflation.

sizes_dil <- model_sizes(bma_dil)

Having examined the evidence from every angle — PIPs, coefficients, and sensitivity — let us now synthesize the findings.

13. Summary of Findings

13.1 The robust determinants

Combining evidence across all prior specifications, we can classify each regressor by its robustness:

Variable	PIP (Bin.)	PIP (Bin-Beta)	PIP (EMS=2)	Sign	Verdict
pop	0.990	0.998	0.964	+	Robust
lnlex	0.864	0.974	0.637	+	Robust
ish	0.773	0.954	0.483	+	Positive
opem	0.766	0.952	0.468	+	Positive
gsh	0.751	0.948	0.459	–	Positive
sed	0.717	0.938	0.420	+/–	Sensitive
pgrw	0.714	0.938	0.414	+	Sensitive
polity	0.678	0.929	0.372	–	Sensitive
ipr	0.656	0.924	0.344	–	Sensitive

Bottom line: If you are advising a government on growth policy, population dynamics and public health (life expectancy) are the two levers with the strongest evidence across all modeling assumptions. Investment and trade openness show promise under the default prior but become ambiguous under skeptical specifications. Education and democracy — despite their intuitive appeal — are fragile in this framework.

Only two variables — population and life expectancy — survive as robust determinants across all prior specifications, maintaining PIP above 0.5 even under the most skeptical prior (EMS = 2). Both have stable positive signs and their coefficients are precisely estimated. Investment share and trade openness show positive evidence under the default prior but become ambiguous under the skeptical prior.

13.2 Connecting to cross-sectional results

In the companion cross-sectional tutorial, we found that BMA, LASSO, and WALS converged on the same set of robust variables for CO₂ emissions in synthetic data. The dynamic panel BMA analysis here reveals an important nuance: controlling for reverse causality through the lagged dependent variable and fixed effects changes the landscape of robust determinants. The strong persistence of GDP (lagged coefficient = 0.92) absorbs much of the cross-sectional variation, leaving fewer variables with strong independent explanatory power. This is exactly the kind of insight that cross-sectional BMA misses.

14. Conclusion

14.1 Key takeaways

Method insight: Dynamic panel BMA handles endogeneity that cross-sectional BMA cannot. By including a lagged dependent variable ($\alpha$ = 0.92) and entity/time fixed effects, the Bayesian DSM package allows BMA to work with weakly exogenous regressors, avoiding the bias that plagues standard growth regressions.
Data insight: Of 9 candidate growth determinants, only population size (PIP = 0.990) and life expectancy (PIP = 0.864) are robust across all prior specifications. This confirms the “fragility” of growth determinants documented by Sala-i-Martin et al. (2004) — most variables that appear important in one specification become ambiguous under different priors.
Sensitivity insight: Results are moderately sensitive to prior choice. Under the skeptical EMS = 2 prior, only pop (PIP = 0.964) remains very strong, while even lnlex drops to 0.637. The binomial-beta prior pushes all variables above PIP = 0.92, reflecting the data’s preference for large models (posterior EMS = 8.6).
Jointness insight: All regressor pairs are complements (HCGHM > 0), with the strongest complementarity between population and life expectancy (0.71). No substitution effects were detected, suggesting these growth determinants capture distinct dimensions of the development process. See Appendix A for the full jointness analysis.

14.2 Limitations and next steps

Computation cost: The optim_model_space() step estimates all $2^K$ models via numerical optimization. With 9 regressors (512 models), this is feasible. With 15+ regressors ($2^{15}$ = 32,768 models), computation time grows exponentially. For larger variable sets, Markov Chain Monte Carlo (MCMC) sampling over the model space may be necessary.
Weak exogeneity assumption: While weaker than strict exogeneity, the weak exogeneity assumption still requires that current regressors are uncorrelated with current shocks. If contemporaneous feedback is strong (e.g., a GDP shock immediately changes investment in the same period), the estimates may still be biased.
Extensions: The package offers additional features not covered here, including parallel computing for faster model space estimation (cl parameter in optim_model_space()), robust standard errors for heteroskedasticity, and the full suite of reduced-form parameters for understanding the dynamic feedback structure.

14.3 Exercises

Vary the dilution parameter. Run bma() with dilution = 1 and dil.Par = 2 (stronger dilution). How do the PIPs change compared to dil.Par = 0.5? Which variables are most affected by multicollinearity adjustment?
Examine the small model space. Use small_model_space with only ish, sed, and pgrw. Run the full BMA workflow (including model_pmp(), model_sizes(), best_models(), and jointness()). Do the PIP rankings change when the competition among regressors is limited to 3?
Compare standard and robust standard errors. Run best_models() with robust = TRUE and compare the coefficient significance to the default (regular SE). Are there variables that lose or gain significance under robust inference?

Appendix A: Jointness Analysis

What is jointness?

So far we have examined each regressor individually. But growth determinants do not work in isolation — they interact. Jointness measures whether two regressors tend to appear in models together (complements) or separately (substitutes).

Think of peanut butter and jelly: each is fine alone, but they show up together so often that their inclusion is correlated. In growth regressions, investment and trade openness might be complements — countries that invest heavily also trade more, and models that capture one effect benefit from including the other. Conversely, two measures of education (enrollment and literacy) might be substitutes — including one makes the other redundant.

Three jointness measures

The package implements three jointness measures. The jointness() function computes pairwise relationships between all regressors:

Hofmarcher et al. (HCGHM) ranges from –1 (perfect substitutes) to +1 (perfect complements), with 0 indicating independence. This is the recommended default measure.

Ley-Strazicich (LS) ranges from 0 to infinity, where higher values indicate stronger complementarity.

Doppelhofer-Weeks (DW) classifies relationships as: below –2 (strong substitutes), –2 to –1 (significant substitutes), –1 to 1 (unrelated), 1 to 2 (significant complements), above 2 (strong complements).

Jointness matrices

The HCGHM jointness matrix (above diagonal = binomial prior, below diagonal = binomial-beta prior):

jointness(bma_results, measure = "HCGHM")

 ish sed pgrw pop ipr opem gsh lnlex polity
ish NA 0.216 0.207 0.530 0.150 0.262 0.243 0.366 0.181
sed 0.805 NA 0.154 0.421 0.115 0.199 0.189 0.288 0.125
pgrw 0.805 0.778 NA 0.416 0.124 0.198 0.186 0.283 0.131
pop 0.905 0.874 0.874 NA 0.304 0.517 0.489 0.711 0.346
ipr 0.781 0.756 0.758 0.845 NA 0.153 0.138 0.209 0.102
opem 0.829 0.801 0.802 0.902 0.780 NA 0.241 0.372 0.169
gsh 0.821 0.794 0.794 0.893 0.772 0.819 NA 0.340 0.154
lnlex 0.864 0.835 0.835 0.944 0.810 0.863 0.853 NA 0.227
polity 0.790 0.763 0.764 0.855 0.744 0.787 0.779 0.817 NA

All HCGHM values are positive, meaning every pair of regressors acts as complements rather than substitutes. The strongest complementarity under the binomial prior (above diagonal) is between pop and lnlex at 0.711 — population size and life expectancy tend to appear in the best models together. The pop-ish pair (0.530) and pop-opem pair (0.517) are also moderately complementary. Investment price (ipr) shows the weakest complementarity with other variables, consistent with its lowest PIP.

Under the binomial-beta prior (below diagonal), all jointness values increase substantially — reaching 0.944 for the pop-lnlex pair. This is because the binomial-beta prior favors larger models, making it more likely that any two variables appear together.

The Doppelhofer-Weeks measure confirms these patterns: all pairwise DW values fall between –1 and +1, with the strongest relationship again between population and life expectancy (DW = 0.153).

Appendix B: Solow Convergence Derivation

The Solow model predicts that poorer countries should grow faster than richer ones, conditional on their structural characteristics. This is called beta convergence. Mathematically, the model implies that around the steady state, log GDP per capita evolves according to (Barro and Sala-i-Martin, 2004):

$$\ln y_{it} = (1 - e^{-\lambda \tau}) \ln y^*_i + e^{-\lambda \tau} \ln y_{i,t-1}$$

In words, a country’s current GDP ($\ln y_{it}$) is a weighted average of two forces: its long-run steady-state level ($\ln y^*_i$), determined by fundamentals like savings and technology, and its GDP in the previous period ($\ln y_{i,t-1}$), which captures where the country currently stands. The parameter $\lambda$ is the speed of convergence — how fast countries close the gap to their steady state — and $\tau$ is the time between observations (10 years in our data).

Now define $\alpha = e^{-\lambda \tau}$. The convergence equation becomes:

$$\ln y_{it} = \alpha \ln y_{i,t-1} + (1 - \alpha) \ln y^*_i$$

This is already a dynamic equation — current GDP depends on lagged GDP. The next step is to recognize that the steady state $\ln y^*_i$ is not observed directly. Instead, it depends on country characteristics such as investment rates, education, trade openness, and institutional quality. Writing these as $\beta' x_{it}$, and adding country fixed effects ($\eta_i$) for unobserved fundamentals, time effects ($\zeta_t$) for global shocks, and an error term ($v_{it}$), we arrive at the dynamic panel equation presented in Section 3.2.

Appendix C: Full Sensitivity Output

Binomial-beta prior

# Binomial-beta results (already computed)
print(bma_results[[2]])

 PIP PM PSD PSDR PMcon PSDcon PSDRcon %(+)
gdp_lag NA 0.943 0.078 0.130 0.943 0.078 0.130 100.000
ish 0.954 0.076 0.036 0.066 0.079 0.032 0.065 100.000
sed 0.938 0.035 0.063 0.094 0.037 0.064 0.097 69.922
pgrw 0.938 0.024 0.033 0.059 0.026 0.033 0.061 99.609
pop 0.998 0.080 0.062 0.083 0.080 0.062 0.083 100.000
ipr 0.924 -0.050 0.030 0.052 -0.054 0.027 0.052 0.000
opem 0.952 0.041 0.026 0.034 0.043 0.025 0.034 100.000
gsh 0.948 -0.034 0.049 0.120 -0.036 0.049 0.123 30.859
lnlex 0.974 0.134 0.069 0.105 0.138 0.066 0.104 100.000
polity 0.929 -0.084 0.038 0.053 -0.090 0.031 0.049 0.000

Skeptical prior (EMS = 2)

# Skeptical prior: EMS = 2
bma_ems2 <- bma(full_model_space, df = data_prepared, round = 3, EMS = 2)
print(bma_ems2[[1]])

 PIP PM PSD PSDR PMcon PSDcon PSDRcon %(+)
gdp_lag NA 0.922 0.081 0.102 0.922 0.081 0.102 100.000
ish 0.483 0.042 0.050 0.059 0.088 0.034 0.057 100.000
sed 0.420 0.015 0.046 0.057 0.036 0.065 0.084 69.922
pgrw 0.414 0.009 0.025 0.040 0.023 0.034 0.061 99.609
pop 0.964 0.144 0.066 0.082 0.149 0.061 0.079 100.000
ipr 0.344 -0.019 0.031 0.037 -0.055 0.028 0.045 0.000
opem 0.468 0.024 0.032 0.033 0.052 0.026 0.030 100.000
gsh 0.459 -0.003 0.032 0.071 -0.007 0.047 0.105 30.859
lnlex 0.637 0.051 0.068 0.087 0.081 0.069 0.097 100.000
polity 0.372 -0.029 0.042 0.046 -0.079 0.031 0.043 0.000

References

Taming Model Uncertainty in the Environmental Kuznets Curve: BMA and Double-Selection LASSO with Panel Data

Sun, 29 Mar 2026 00:00:00 +0000

1. Overview

Can countries grow their way out of pollution? The Environmental Kuznets Curve (EKC) hypothesis says yes — up to a point. As economies develop, pollution first rises with industrialization and then falls as countries grow wealthy enough to afford cleaner technology. But recent research suggests a more complex inverted-N shape: pollution falls at very low incomes, rises through industrialization, and then falls again at high incomes.

Testing for this shape requires a cubic polynomial in GDP per capita — and beyond GDP, many other factors might affect CO₂ emissions. With 12 candidate control variables, there are $2^{12} = 4{,}096$ possible regression models. Which model should we estimate? This is the model uncertainty problem.

This tutorial introduces two principled solutions:

Bayesian Model Averaging (BMA) estimates thousands of models and averages the results, weighting each by how well it fits the data. Each variable gets a Posterior Inclusion Probability (PIP) — the fraction of high-quality models that include it.
Post-Double-Selection LASSO (DSL) uses LASSO to automatically select which controls matter — once for the outcome, once for each variable of interest — then runs OLS with the union of all selected controls. This “select, then regress” approach protects against omitted variable bias.

We use synthetic panel data with a known “answer key” — we designed the data so that 5 controls truly affect CO₂ and 7 are pure noise. This lets us grade each method: does it correctly identify the true predictors? The data is inspired by the panel dataset of Gravina and Lanzafame (2025) but is fully synthetic and not identical to the original.

Companion tutorial. For a cross-sectional perspective using R with BMA, LASSO, and WALS, see the R tutorial on variable selection.

Learning objectives:

Understand the EKC hypothesis and why a cubic polynomial tests for an inverted-N shape
Recognize model uncertainty as a practical challenge when many controls are available
Implement BMA with bmaregress and interpret PIPs and coefficient densities
Implement post-double-selection LASSO with dsregress and understand its four-step algorithm: LASSO on outcome, LASSO on each variable of interest, union, then OLS
Evaluate both methods against a known ground truth to assess their accuracy

The following diagram summarizes the methodological sequence of this tutorial. We begin with exploratory data analysis to visualize the raw income–pollution relationship, then estimate baseline fixed effects regressions to expose the model uncertainty problem. Next, we apply BMA and DSL as two alternative solutions, and finally compare both methods against the known answer key.

graph LR
A["<b>EDA</b><br/>Scatter plot"] --> B["<b>Baseline FE</b><br/>Standard panel<br/>regressions"]
B --> C["<b>BMA</b><br/>Bayesian Model<br/>Averaging"]
C --> D["<b>DSL</b><br/>Double-Selection<br/>LASSO"]
D --> E["<b>Comparison</b><br/>Check against<br/>answer key"]
style A fill:#141413,stroke:#141413,color:#fff
style B fill:#6a9bcc,stroke:#141413,color:#fff
style C fill:#d97757,stroke:#141413,color:#fff
style D fill:#00d4c8,stroke:#141413,color:#141413
style E fill:#1a3a8a,stroke:#141413,color:#fff

2. Setup and Synthetic Data

2.1 Why synthetic data?

Real-world datasets rarely come with an answer key. We never know which control variables truly belong in the model. By generating synthetic data with a known data-generating process (DGP), we can verify whether BMA and DSL correctly recover the truth. This is the same “answer key” approach used in the companion R tutorial, applied here to panel data.

2.2 The data-generating process

The outcome — log CO₂ per capita — follows a cubic EKC with country and year fixed effects:

$$\ln(\text{CO2})_{it} = \beta_1 \ln(\text{GDP})_{it} + \beta_2 [\ln(\text{GDP})_{it}]^2 + \beta_3 [\ln(\text{GDP})_{it}]^3 + \mathbf{X}_{it}^{\text{true}} \boldsymbol{\gamma} + \alpha_i + \delta_t + \varepsilon_{it}$$

In words, log CO₂ depends on a cubic function of log GDP (producing the inverted-N shape), five true control variables $\mathbf{X}^{\text{true}}$, country fixed effects $\alpha_i$, year fixed effects $\delta_t$, and random noise $\varepsilon_{it}$.

The answer key — which variables are true predictors and which are noise:

Variable	Group	In DGP?	True coef.	GDP corr.	Role
`fossil_fuel`	Energy	Yes	+0.015	moderate	More fossil fuels → more CO₂
`renewable`	Energy	Yes	–0.010	moderate	More renewables → less CO₂
`urban`	Socio	Yes	+0.007	moderate	More urbanization → more CO₂
`democracy`	Institutional	Yes	–0.005	low	More democracy → less CO₂
`industry`	Economic	Yes	+0.010	moderate	More industry → more CO₂
`globalization`	Socio	No	0	high	Noise — tricky (correlated with GDP)
`pop_density`	Socio	No	0	low	Noise
`corruption`	Institutional	No	0	low	Noise
`services`	Economic	No	0	high	Noise — tricky (correlated with GDP)
`trade`	Economic	No	0	moderate	Noise — tricky (correlated with GDP)
`fdi`	Economic	No	0	low	Noise
`credit`	Economic	No	0	moderate	Noise — tricky (correlated with GDP)

The “GDP corr.” column is key to understanding why this problem is non-trivial. Four noise variables (globalization, services, trade, credit) are deliberately correlated with GDP. A naive regression would find them “significant” because they piggyback on GDP’s true effect. The challenge for BMA and DSL is to see through this correlation and correctly identify that only the 5 true controls belong in the model.

With the DGP and answer key defined, we now load the synthetic data and set up the Stata environment.

2.3 Load the data

The synthetic data is hosted on GitHub for reproducibility. It was generated by generate_data.do (see the link above).

* Load synthetic data from GitHub
import delimited "https://github.com/cmg777/starter-academic-v501/raw/master/content/post/stata_bma_dsl/synthetic_ekc_panel.csv", clear
xtset country_id year, yearly

2.4 Define macros

We define all variable groups as global macros — used in every command throughout the tutorial:

global outcome "ln_co2"
global gdp_vars "ln_gdp ln_gdp_sq ln_gdp_cb"
global energy "fossil_fuel renewable"
global socio "urban globalization pop_density"
global inst "democracy corruption"
global econ "industry services trade fdi credit"
global controls "$energy $socio $inst $econ"
global fe "i.country_id i.year"
* Ground truth (for evaluation)
global true_vars "fossil_fuel renewable urban democracy industry"
global noise_vars "globalization pop_density corruption services trade fdi credit"

summarize $outcome $gdp_vars $controls

 Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
ln_co2 | 1,600 -19.0385 .7863276 -21.03685 -16.8315
ln_gdp | 1,600 9.58387 1.329675 6.974263 11.9704
ln_gdp_sq | 1,600 93.6174 25.55106 48.64035 143.2904
ln_gdp_cb | 1,600 931.105 373.829 339.2306 1715.243
fossil_fuel | 1,600 54.7724 19.14168 6.36807 95
renewable | 1,600 29.5413 11.96568 1 64.2207
urban | 1,600 53.6742 14.778 15.95174 91.63234
globalizat~n | 1,600 57.6498 12.71537 26.75758 95
pop_density | 1,600 121.344 210.2646 1 1571.771
democracy | 1,600 2.33346 4.179503 -6.12244 10
corruption | 1,600 52.3523 28.52792 0 100
industry | 1,600 24.6433 6.180478 5.843938 45.32926
services | 1,600 43.5598 9.366089 17.82623 64.07455
trade | 1,600 67.4355 19.36148 10.04306 128.0595
fdi | 1,600 2.98237 4.373857 -11.50437 16.19903
credit | 1,600 53.4402 18.20204 11.32991 123.2399

The dataset contains 1,600 observations from 80 countries over 20 years (1995–2014). Log GDP per capita ranges from 6.97 to 11.97, spanning the full income spectrum from about \$1,065 to \$158,000 in synthetic international dollars. Log CO₂ has a mean of –19.04 with substantial variation (standard deviation 0.79), reflecting the wide range of development levels in our synthetic panel. With the data loaded, we next visualize the raw income–pollution relationship.

3. Exploratory Data Analysis

Before modeling, let us look at the raw relationship between income and emissions.

twoway (scatter $outcome ln_gdp, ///
msize(vsmall) mcolor("106 155 204"%40) msymbol(circle)), ///
ytitle("Log CO2 per capita") ///
xtitle("Log GDP per capita") ///
title("Synthetic Data: CO2 vs. Income", size(medium)) ///
subtitle("80 countries, 1995-2014 (N = 1,600)", size(small)) ///
scheme(s2color)

The scatter reveals a distinctly nonlinear pattern. At low income levels, CO₂ emissions increase steeply with GDP. At higher income levels, the relationship flattens and bends. This curvature motivates the cubic EKC specification. The diagram below shows the two competing EKC shapes — the classic inverted-U (quadratic) and the more complex inverted-N (cubic) with its three distinct phases:

graph TD
EKC["<b>Environmental Kuznets Curve</b><br/>How does pollution change<br/>as income grows?"]
EKC --> IU["<b>Inverted-U</b><br/>Quadratic: β₁ > 0, β₂ < 0<br/>One turning point"]
EKC --> IN["<b>Inverted-N</b><br/>Cubic: β₁ < 0, β₂ > 0, β₃ < 0<br/>Two turning points"]
IN --> P1["<b>Phase 1: Declining</b><br/>Very poor countries"]
IN --> P2["<b>Phase 2: Rising</b><br/>Industrializing countries"]
IN --> P3["<b>Phase 3: Declining</b><br/>Wealthy countries"]
style EKC fill:#141413,stroke:#141413,color:#fff
style IU fill:#6a9bcc,stroke:#141413,color:#fff
style IN fill:#d97757,stroke:#141413,color:#fff
style P1 fill:#00d4c8,stroke:#141413,color:#141413
style P2 fill:#d97757,stroke:#141413,color:#fff
style P3 fill:#00d4c8,stroke:#141413,color:#141413

For an inverted-N, we need $\beta_1 < 0$, $\beta_2 > 0$, $\beta_3 < 0$. Our synthetic DGP was designed with exactly this sign pattern ($\beta_1 = -7.1$, $\beta_2 = 0.81$, $\beta_3 = -0.03$), so BMA and DSL should recover it — but can they also correctly identify which of the 12 controls truly matter? Let us start with standard panel regressions to see how sensitive the GDP coefficients are to the choice of controls.

4. Baseline — Standard Fixed Effects

Before reaching for sophisticated methods, let us see what standard panel regressions say. We run two specifications using macros:

4.1 Sparse specification

reghdfe $outcome $gdp_vars, absorb(country_id year) vce(cluster country_id)
estimates store fe_sparse

HDFE Linear regression Number of obs = 1,600
R-squared = 0.9620
Within R-sq. = 0.0354
Number of clusters (country_id) = 80
(Std. err. adjusted for 80 clusters in country_id)
------------------------------------------------------------------------------
| Robust
ln_co2 | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
ln_gdp | -7.498046 1.623988 -4.62 0.000 -10.73051 -4.26558
ln_gdp_sq | .848967 .1704533 4.98 0.000 .5096881 1.188246
ln_gdp_cb | -.0314993 .005931 -5.31 0.000 -.0433047 -.019694
------------------------------------------------------------------------------

The sparse model finds the inverted-N sign pattern ($\beta_1 < 0$, $\beta_2 > 0$, $\beta_3 < 0$), all significant at the 0.1% level with cluster-robust standard errors (clustered at the country level). The within R² is just 0.035 — the GDP polynomial alone explains only about 3.5% of within-country CO₂ variation after absorbing country and year fixed effects. The overall R² of 0.96 is high because the country fixed effects capture most of the variation.

4.2 Kitchen-sink specification

reghdfe $outcome $gdp_vars $controls, absorb(country_id year) vce(cluster country_id)
estimates store fe_kitchen

HDFE Linear regression Number of obs = 1,600
R-squared = 0.9655
Within R-sq. = 0.1249
Number of clusters (country_id) = 80
(Std. err. adjusted for 80 clusters in country_id)
------------------------------------------------------------------------------
| Robust
ln_co2 | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
ln_gdp | -7.130693 1.562581 -4.56 0.000 -10.24093 -4.020453
ln_gdp_sq | .8059928 .1647973 4.89 0.000 .477972 1.134014
ln_gdp_cb | -.0298133 .0057365 -5.20 0.000 -.0412314 -.0183951
fossil_fuel | .0138444 .0014853 9.32 0.000 .010888 .0168008
renewable | -.006795 .0019322 -3.52 0.001 -.0106409 -.0029491
urban | .0057534 .0021432 2.68 0.009 .0014875 .0100192
globalizat~n | .0015186 .0012832 1.18 0.240 -.0010357 .0040728
pop_density | .0000794 .0002303 0.34 0.731 -.000379 .0005378
democracy | -.0002971 .007735 -0.04 0.969 -.0156933 .0150991
corruption | .0009812 .0008415 1.17 0.247 -.0006936 .0026561
industry | .0086336 .0017848 4.84 0.000 .0050811 .0121861
services | -.0005642 .0017205 -0.33 0.744 -.0039889 .0028604
trade | -.0002458 .0007695 -0.32 0.750 -.0017774 .0012858
fdi | -.0017599 .0019509 -0.90 0.370 -.005643 .0021232
credit | -.00139 .0007516 -1.85 0.068 -.002886 .0001061
------------------------------------------------------------------------------

Adding all 12 controls raises the within R² from 0.035 to 0.125 — a meaningful improvement, though the country and year FE still dominate the overall explanatory power (R² = 0.966). The three strongest true predictors (fossil fuel, industry, urban) are clearly significant, while most noise variables are statistically insignificant. Democracy’s estimate (–0.0003, p = 0.97) is far from its true value (–0.005) and indistinguishable from zero — illustrating why weak signals are hard to detect even with the correct model.

The critical question is: which specification should we trust? The next subsection shows that the GDP coefficients — and hence the EKC shape — shift depending on which controls we include.

4.3 The model uncertainty problem

Coefficient	Sparse FE	Kitchen-Sink FE	True DGP
$\beta_1$ (GDP)	–7.498	–7.131	–7.100
$\beta_2$ (GDP²)	0.849	0.806	0.810
$\beta_3$ (GDP³)	–0.031	–0.030	–0.030

Both specifications recover the correct sign pattern, but the magnitudes shift. The kitchen-sink FE estimates (–7.131, 0.806, –0.030) are closer to the true DGP values (–7.100, 0.810, –0.030) than the sparse FE (–7.498, 0.849, –0.031), because the omitted true controls create bias in the sparse model. But which of the 12 controls actually belongs?

* Compare coefficients side by side (simplified from analysis.do)
graph twoway ///
(bar value order if spec == "Sparse FE", ///
barwidth(0.35) color("106 155 204")) ///
(bar value order if spec == "Kitchen-Sink FE", ///
barwidth(0.35) color("217 119 87")), ///
xlabel(1 `""b1" "(GDP)""' 2 `""b2" "(GDP sq)""' 3 `""b3" "(GDP cb)""' ///
4 `""b1" "(GDP)""' 5 `""b2" "(GDP sq)""' 6 `""b3" "(GDP cb)""') ///
xline(3.5, lcolor(gs10) lpattern(dash)) ///
ytitle("Coefficient value") ///
title("Coefficient Instability Across Specifications") ///
legend(order(1 "Sparse FE (no controls)" 2 "Kitchen-Sink FE (all 12 controls)") ///
rows(1) position(6)) ///
scheme(s2color)

To understand the practical implications of these coefficient shifts, we compute the income thresholds where emissions change direction. The turning points are found by setting the first derivative of the cubic to zero:

$$x^* = \frac{-\hat{\beta}_2 \pm \sqrt{\hat{\beta}_2^2 - 3\hat{\beta}_1\hat{\beta}_3}}{3\hat{\beta}_3}, \quad \text{GDP}^* = \exp(x^*)$$

Turning point	Sparse FE	Kitchen-Sink FE	True DGP
Minimum (CO₂ starts rising)	\$2,478	\$2,426	\$1,895
Maximum (CO₂ starts falling)	\$25,656	\$27,694	\$34,647

The turning points shift modestly between specifications — the minimum stays near \$2,400–\$2,500 while the maximum moves from \$25,656 to \$27,694 depending on controls. Neither matches the true DGP values perfectly, motivating BMA and DSL as principled alternatives to ad hoc control selection.

5. Bayesian Model Averaging

5.1 The idea

Think of BMA as betting on a horse race. Instead of putting all your money on one model, BMA spreads bets across the field, wagering more on models with better track records.

graph TD
Start["<b>12 Candidate Controls</b><br/>2¹² = 4,096<br/>possible models"] --> MCMC["<b>MCMC Sampling</b><br/>Draw 50,000 models"]
MCMC --> Post["<b>Posterior Probability</b><br/>Weight by fit × parsimony"]
Post --> Avg["<b>Weighted Average</b><br/>Coefficients averaged<br/>across models"]
Post --> PIP["<b>PIPs</b><br/>Inclusion probability<br/>for each variable"]
style Start fill:#141413,stroke:#141413,color:#fff
style MCMC fill:#6a9bcc,stroke:#141413,color:#fff
style Post fill:#d97757,stroke:#141413,color:#fff
style Avg fill:#00d4c8,stroke:#141413,color:#141413
style PIP fill:#00d4c8,stroke:#141413,color:#141413

Formally, this betting process follows Bayes' rule, which tells us how to weight models by their fit and complexity.

Step 1: Model posterior probabilities. The posterior probability of model $M_k$ is:

$$P(M_k | \text{data}) = \frac{P(\text{data} | M_k) \cdot P(M_k)}{\sum_{l=1}^{K} P(\text{data} | M_l) \cdot P(M_l)}$$

In words, the probability of model $k$ being correct equals how well it fits the data (the marginal likelihood $P(\text{data} | M_k)$) times our prior belief ($P(M_k)$), divided by the total across all models. Models that fit the data well and are parsimonious receive higher posterior weight — this is BMA’s built-in Occam’s razor.

The marginal likelihood $P(\text{data} | M_k)$ is not the same as the ordinary likelihood. It integrates over all possible coefficient values, penalizing models with many parameters that “waste” probability mass on parameter regions the data does not support:

$$P(\text{data} | M_k) = \int P(\text{data} | \boldsymbol{\beta}_k, M_k) \, P(\boldsymbol{\beta}_k | M_k) \, d\boldsymbol{\beta}_k$$

In words, the marginal likelihood asks: “If we averaged this model’s fit across all plausible coefficient values (weighted by the prior $P(\boldsymbol{\beta}_k | M_k)$), how well does it explain the data?” This integral is what makes BMA automatically penalize overly complex models — a model with many parameters spreads its prior probability thinly across a high-dimensional space, and only recovers that probability if the data strongly supports those extra dimensions.

Step 2: Posterior Inclusion Probabilities. The PIP for variable $j$ sums the posterior probabilities across all models that include it:

$$\text{PIP}_j = \sum_{k:\, x_j \in M_k} P(M_k | \text{data})$$

In words, PIP answers: “Across all the models BMA considered, what fraction of the total posterior weight belongs to models that include variable $j$?” If fossil fuel appears in every high-probability model, its PIP approaches 1.0. If democracy only appears in low-probability models, its PIP stays near 0.

Step 3: BMA posterior mean. BMA does not just select variables — it also produces model-averaged coefficient estimates. The posterior mean of coefficient $\beta_j$ averages across all models, weighted by their posterior probabilities:

$$\hat{\beta}_j^{\text{BMA}} = \sum_{k=1}^{K} P(M_k | \text{data}) \cdot \hat{\beta}_{j,k}$$

where $\hat{\beta}_{j,k}$ is the coefficient estimate of variable $j$ in model $M_k$ (set to zero if $j$ is not in $M_k$). In words, the BMA estimate is a weighted average of the coefficient across all models, including models where the variable is absent (contributing zero). This shrinks the coefficient toward zero in proportion to the evidence against inclusion — a variable with PIP = 0.5 has its BMA coefficient shrunk by roughly half compared to its conditional estimate.

Think of PIP as a democratic vote across all candidate models. Each model casts a weighted vote for which variables matter, with better-fitting models getting louder voices. Raftery (1995) proposed standard interpretation thresholds based on the strength of evidence:

PIP range	Evidence	Analogy
$\geq 0.99$	Decisive	Beyond reasonable doubt
$0.95 - 0.99$	Very strong	Strong consensus
$0.80 - 0.95$	Strong (robust)	Clear majority
$0.50 - 0.80$	Borderline	Split vote
$< 0.50$	Weak/none (fragile)	Minority opinion

We use PIP $\geq$ 0.80 as our robustness threshold throughout this tutorial — a variable with PIP above 0.80 appears in the vast majority of the probability-weighted model space, providing “strong evidence” by Raftery’s classification. This is the most widely used cutoff in applied BMA studies.

A key assumption underlying BMA is that the true data-generating process is well-approximated by a weighted combination of the candidate models (the “M-closed” assumption). When the candidate set omits important functional forms or interactions, BMA’s posterior probabilities may be unreliable.

5.2 Key options

With the conceptual framework in place, we now turn to implementation. Stata 18’s bmaregress command has three families of options: priors (what you believe before seeing the data), MCMC controls (how the algorithm explores the model space), and output formatting (what gets displayed). The full option list is in the Stata manual; here we explain the ones used in this tutorial:

Prior specifications (see bmaregress priors for alternatives):

gprior(uip) — Unit Information Prior: sets the prior precision on coefficients equal to the information in one observation ($g = N$). This is a standard, relatively uninformative choice that lets the data dominate. Alternatives include gprior(bric) (benchmark risk inflation criterion, $g = \max(N, p^2)$), gprior(zs) (Zellner-Siow), and gprior(hyper) (hyper-g prior with data-driven $g$)
mprior(uniform) — all $2^{12} = 4{,}096$ models are equally likely a priori; no model is privileged before seeing the data. The alternative mprior(binomial) applies a beta-binomial prior that penalizes very large or very small models, often producing more conservative PIPs

MCMC controls:

mcmcsize(50000) — draws 50,000 models from the model space using MC$^3$ (Markov chain Monte Carlo model composition) sampling. Larger values improve posterior estimates but increase computation time
burnin(5000) — discards the first 5,000 draws to allow the chain to reach its stationary distribution before collecting samples
rseed(9988) — fixes the random number seed for exact reproducibility. Students running the same command will get identical results
groupfv — treats all dummies from a single factor variable as one group that enters or exits models together. Without groupfv, writing i.country_id would create 80 individual dummy variables, and BMA would consider including or excluding each one independently — producing an astronomical model space ($2^{80}$ combinations of country dummies alone) that is both computationally infeasible and conceptually meaningless. With groupfv, the 80 country dummies move as a package: either all 80 are in the model or none are. Think of it like hiring a sports team — you recruit the whole roster, not individual players one by one. In the output, this is why you see “Groups = 15” instead of 113: BMA treats the 80 country dummies as 1 group, the 19 year dummies as 1 group, and each of the 12 candidate controls + 3 GDP terms as their own groups ($1 + 1 + 15 = 17$, minus 2 that are “always” included = 15 groups subject to selection)
($fe, always) — country and year fixed effects are always included in every model; they are not subject to model selection. This is standard practice in panel data BMA: we want to control for unobserved country and time heterogeneity in every model, and only let BMA decide about the candidate controls

Output formatting:

pipcutoff(0.8) — display only variables with PIP above 0.80 in the output table. This is a display threshold only — it does not affect the underlying estimation
inputorder — display variables in the order they were specified in the command, rather than sorted by PIP

5.3 Estimation

bmaregress $outcome $gdp_vars $controls ///
($fe, always), ///
mprior(uniform) groupfv gprior(uip) ///
mcmcsize(50000) rseed(9988) inputorder pipcutoff(0.8)

Bayesian model averaging No. of obs = 1,600
Linear regression No. of predictors = 113
MC3 sampling Groups = 15
Always = 98
No. of models = 163
Priors: Mean model size = 104.578
Models: Uniform MCMC sample size = 50,000
Coef.: Zellner's g Acceptance rate = 0.0904
g: Unit-information, g = 1,600 Shrinkage, g/(1+g) = 0.9994
Sampling correlation = 0.9997
------------------------------------------------------------------------------
ln_co2 | Mean Std. dev. Group PIP
-------------+----------------------------------------------------------------
ln_gdp | -7.13901 1.811093 1 .99401
ln_gdp_sq | .8078437 .1892418 2 .99991
ln_gdp_cb | -.0299182 .0065105 3 .99976
fossil_fuel | .0138139 .001283 4 1
renewable | -.0068332 .0023506 5 .95945
industry | .0085503 .0019766 11 .99867
------------------------------------------------------------------------------
Note: 9 predictors with PIP less than .8 not shown.

The Stata output says “PIP less than .8” because we set pipcutoff(0.8) as the display threshold — only variables exceeding this stricter robustness criterion appear in the table. The 9 hidden variables are the two weak true controls (urban, democracy) and all 7 noise variables (services, trade, FDI, credit, population density, corruption, globalization). Figure 3 below shows PIP values for all 15 variables.

The output shows 113 predictors in 15 groups: the 80 country dummies (grouped as 1 by groupfv) + 19 year dummies (grouped as 1) + 12 candidate controls (each its own group) + the 3 GDP terms (each its own group) = 15 selection groups total, with 98 variables “always” included (the country and year FE). BMA sampled 163 distinct models out of 4,096 possible. This might seem low, but the MC$^3$ algorithm does not need to visit every model — it concentrates on the high-posterior-probability region. The sampling correlation of 0.9997 (very close to 1.0) confirms that the MC$^3$ chain adequately explored the model space — the posterior probability is concentrated on a relatively small number of high-quality models. The acceptance rate of 0.09 is below the typical 20–40% range, but the high sampling correlation provides reassurance that the results are reliable. Six variables have PIP above the 0.80 robustness threshold: the three GDP terms (PIP = 0.994–1.000) and three of the five true controls — fossil fuel (PIP = 1.000), industry (PIP = 0.999), and renewable energy (PIP = 0.959). The BMA posterior means (–7.139, 0.808, –0.030) are remarkably close to the true DGP values (–7.100, 0.810, –0.030), substantially closer than the sparse FE estimates.

Two true controls — urban (coefficient 0.007) and democracy (coefficient –0.005) — have PIPs well below 0.80. Their true effects are small, making them hard to distinguish from noise. This is a realistic limitation: even a powerful method like BMA struggles with weak signals.

5.4 Turning points

Using the BMA posterior means, the turning points are:

Minimum: \$2,411 GDP per capita (true: \$1,895)
Maximum: \$27,269 GDP per capita (true: \$34,647)

Both turning points are in the right ballpark but not exact. The turning point formula amplifies small differences across all three coefficients — even though each BMA posterior mean is within 1% of the true DGP value, the compound effect shifts the maximum turning point from \$34,647 (true) to \$27,269 (BMA). The inverted-N shape is clearly recovered.

5.5 Posterior Inclusion Probabilities

The PIP chart is BMA’s signature output. We extract PIPs from the estimation results, label each variable, and color-code bars by ground truth: steel blue for true predictors, gray for noise.

* Extract PIPs and create a horizontal bar chart
matrix pip_mat = e(pip)
* ... (create dataset of variable names and PIPs, add readable labels) ...
* Mark true vs noise predictors
gen is_true = inlist(varname, "fossil_fuel", "renewable", "urban", ///
"democracy", "industry", "ln_gdp", "ln_gdp_sq", "ln_gdp_cb")
gsort -pip
graph twoway ///
(bar pip order if is_true == 1, horizontal barwidth(0.6) ///
color("106 155 204")) ///
(bar pip order if is_true == 0, horizontal barwidth(0.6) ///
color(gs11)), ///
xline(0.8, lcolor("217 119 87") lpattern(dash) lwidth(medium)) ///
ylabel(1(1)15, valuelabel angle(0) labsize(small)) ///
xlabel(0(0.2)1, format(%3.1f)) ///
xtitle("Posterior Inclusion Probability (PIP)") ///
title("BMA: Which Variables Matter?") ///
legend(order(1 "True predictor (in DGP)" 2 "Noise variable (not in DGP)") ///
rows(1) position(6)) ///
scheme(s2color)

The PIP chart cleanly separates the variables into two groups. At the top (PIP near 1.0): fossil fuel share, GDP terms, industry, and renewable energy — all true predictors correctly identified. At the bottom (PIP near 0.0): the seven noise variables (globalization, corruption, services, trade, FDI, credit, population density) plus urban population and democracy. BMA correctly assigns zero-like PIPs to all noise variables, and correctly flags 3 of 5 true predictors as robust. The two misses (urban, democracy) have small true coefficients (0.007 and –0.005), making them genuinely hard to detect.

5.6 Coefficient density plots

The bmagraph coefdensity command shows the posterior distribution of each coefficient across all sampled models. We plot all six variables with PIP above 0.80 in a 3x2 grid — the three GDP polynomial terms (top row) and the three robust controls (bottom row). In each panel, the blue curve shows the density conditional on the variable being included in the model, and the red horizontal line shows the probability of noninclusion (1 – PIP). When the red line is flat near zero and the blue curve is far from zero, the variable is strongly supported.

* Consistent formatting for all panels
local panel_opts `" xtitle("Coefficient value", size(vsmall)) "'
local panel_opts `" `panel_opts' ytitle("Density", size(vsmall)) "'
local panel_opts `" `panel_opts' ylabel(, labsize(vsmall) angle(0)) "'
local panel_opts `" `panel_opts' xlabel(, labsize(vsmall)) "'
local panel_opts `" `panel_opts' legend(off) scheme(s2color) "'
* Generate density for all 6 robust variables (PIP > 0.80)
bmagraph coefdensity ln_gdp, title("GDP per capita (log)", size(small)) `panel_opts' name(dens_gdp, replace)
bmagraph coefdensity ln_gdp_sq, title("GDP squared (log)", size(small)) `panel_opts' name(dens_gdp_sq, replace)
bmagraph coefdensity ln_gdp_cb, title("GDP cubed (log)", size(small)) `panel_opts' name(dens_gdp_cb, replace)
bmagraph coefdensity fossil_fuel, title("Fossil fuel share (%)", size(small)) `panel_opts' name(dens_fossil, replace)
bmagraph coefdensity renewable, title("Renewable energy (%)", size(small)) `panel_opts' name(dens_renewable, replace)
bmagraph coefdensity industry, title("Industry VA (% GDP)", size(small)) `panel_opts' name(dens_industry, replace)
graph combine dens_gdp dens_gdp_sq dens_gdp_cb ///
dens_fossil dens_renewable dens_industry, ///
cols(3) rows(2) imargin(small) ///
title("BMA: Posterior Coefficient Densities", size(medsmall)) ///
subtitle("All 6 robust variables (PIP > 0.80)", size(small)) ///
note("Blue curve = posterior density conditional on inclusion." ///
"Red line = probability of noninclusion (1 - PIP)." ///
"Near-zero red line + blue curve far from zero = strong evidence.", size(vsmall)) ///
scheme(s2color) xsize(12) ysize(7)

All six densities are concentrated well away from zero, confirming that every variable with PIP above 0.80 has a genuinely non-zero effect. The three GDP terms (top row) form the inverted-N polynomial: the linear term is centered near –7.1 (true: –7.1), the squared term near +0.81 (true: +0.81), and the cubic term near –0.030 (true: –0.030). The three controls (bottom row) show tight, unimodal densities: fossil fuel near +0.014 (true: +0.015), renewable energy near –0.007 (true: –0.010), and industry near +0.009 (true: +0.010). Renewable energy’s posterior mean (–0.007) is slightly attenuated compared to the true value (–0.010), reflecting the BMA shrinkage that occurs when a variable’s PIP is below 1.0 — models that exclude it pull the average toward zero.

5.7 Pooled BMA (without fixed effects)

To parallel the pooled DSL comparison in Section 6.6, we also run BMA without country or year fixed effects — treating the panel as a pooled cross-section. This removes the ($fe, always) and groupfv options, leaving only the 12 candidate controls and 3 GDP terms as predictors (15 total, vs 113 with FE).

* BMA without FE -- pooled cross-section
bmaregress ln_co2 ln_gdp ln_gdp_sq ln_gdp_cb ///
fossil_fuel renewable urban industry democracy ///
services trade fdi credit pop_density ///
corruption globalization, ///
mprior(uniform) gprior(uip) ///
mcmcsize(50000) rseed(9988) pipcutoff(0.5) burnin(5000)

Bayesian model averaging No. of obs = 1,600
Linear regression No. of predictors = 15
MC3 sampling Groups = 15
Always = 0
No. of models = 34
Priors: Mean model size = 11.978
Models: Uniform MCMC sample size = 50,000
Coef.: Zellner's g Acceptance rate = 0.0733
g: Unit-information, g = 1,600 Shrinkage, g/(1+g) = 0.9994
Sampling correlation = 0.9996
------------------------------------------------------------------------------
ln_co2 | Mean Std. dev. Group PIP
-------------+----------------------------------------------------------------
ln_gdp | -21.25807 1.641676 1 1
ln_gdp_sq | 2.284729 .1748838 2 1
ln_gdp_cb | -.0813937 .0061308 3 1
fossil_fuel | .0188853 .0010554 4 1
renewable | -.0192089 .0013911 5 1
urban | .0103139 .0012072 6 1
industry | .0138361 .0023478 7 1
services | .0164633 .0016573 9 1
pop_density | -.0004314 .0000567 13 1
credit | .0041017 .0008414 12 .99984
trade | -.0020939 .001084 10 .86009
democracy | .007879 .0042984 8 .84142
------------------------------------------------------------------------------
Note: 3 predictors with PIP less than .5 not shown.

The pooled BMA results are striking in two ways. First, the GDP coefficients are severely biased — the same pattern as pooled DSL: $\beta_1 = -21.26$ (true: –7.10), $\beta_2 = 2.28$ (true: 0.81), $\beta_3 = -0.081$ (true: –0.03). Without country fixed effects, the GDP terms absorb persistent cross-country differences in emissions levels, inflating the coefficients by a factor of 2–3x.

Second, the PIPs tell a completely different story than with FE. Without fixed effects, 12 of 15 variables have PIP above 0.80 — including noise variables like services (PIP = 1.000), population density (PIP = 1.000), credit (PIP = 1.000), and trade (PIP = 0.860). With FE, only 6 variables cleared the 0.80 threshold and all 7 noise variables had PIPs near zero. The pooled BMA commits 5 false positives (services, pop_density, credit, trade, and democracy incorrectly flagged as robust noise variables or given inflated PIPs) compared to zero false positives with FE. This happens because the noise variables are correlated with omitted country effects — without FE to absorb those effects, the correlations create spurious associations that BMA interprets as genuine predictive power.

The turning points (\$5,752 minimum, \$23,298 maximum) are far from the truth, and the 95% credible intervals fail to cover the true values for all three GDP terms — the same coverage failure seen in pooled DSL. The lesson is clear: fixed effects are not optional in panel BMA. They are essential for correct variable selection, not just coefficient estimation.

6. Post-Double-Selection LASSO

6.1 The idea

Stata’s dsregress implements the post-double-selection method of Belloni, Chernozhukov, and Hansen (2014). Think of it as a smart research assistant who reads the data twice — once to find controls that predict the outcome (CO₂), and again to find controls that predict the variables of interest (GDP terms) — then runs a clean OLS regression using only the controls that survived at least one selection.

The “double” in double-selection refers to the union of two separate LASSO selections. Why is this union necessary? If a control variable predicts both CO₂ and GDP but a single LASSO run on CO₂ happens to miss it, omitting it from the final regression would bias the GDP coefficient. The second LASSO step (on GDP) catches variables that the first step might miss, and vice versa.

The algorithm has four steps:

graph TD
Controls["<b>12 Candidate Controls</b><br/>+ country & year FE"]
Controls --> Step1["<b>Step 1: LASSO on Outcome</b><br/>CO2 ~ all controls<br/>→ Selected set X̃y"]
Controls --> Step2["<b>Step 2: LASSO on Each Variable of Interest</b><br/>GDP ~ all controls → X̃₁<br/>GDP² ~ all controls → X̃₂<br/>GDP³ ~ all controls → X̃₃"]
Step1 --> Union["<b>Step 3: Take the Union</b><br/>X̂ = X̃y ∪ X̃₁ ∪ X̃₂ ∪ X̃₃<br/>Only controls surviving<br/>at least one selection"]
Step2 --> Union
Union --> OLS["<b>Step 4: Final OLS</b><br/>CO2 ~ GDP + GDP² + GDP³ + X̂<br/>Standard OLS with valid<br/>inference on GDP terms"]
style Controls fill:#141413,stroke:#141413,color:#fff
style Step1 fill:#6a9bcc,stroke:#141413,color:#fff
style Step2 fill:#d97757,stroke:#141413,color:#fff
style Union fill:#1a3a8a,stroke:#141413,color:#fff
style OLS fill:#00d4c8,stroke:#141413,color:#141413

At the heart of each LASSO step is a penalized regression that shrinks irrelevant coefficients to exactly zero:

$$\hat{\boldsymbol{\beta}}^{\text{LASSO}} = \arg\min_{\boldsymbol{\beta}} \left\{ \frac{1}{2N} \sum_{i=1}^{N}(y_i - \mathbf{x}_i'\boldsymbol{\beta})^2 + \lambda \sum_{j=1}^{p} |\beta_j| \right\}$$

In words, LASSO minimizes the sum of squared residuals (the usual OLS objective) plus a penalty term $\lambda \sum |\beta_j|$ that charges a cost proportional to the absolute value of each coefficient. The tuning parameter $\lambda$ controls how harsh this penalty is — think of it as a “strictness dial.” When $\lambda = 0$, LASSO is just OLS. As $\lambda$ increases, more coefficients are forced to exactly zero. The L1 (absolute value) penalty is what makes LASSO a variable selector: unlike the L2 (squared) penalty used in Ridge regression, the L1 penalty has sharp corners at zero that drive weak coefficients to exactly zero rather than merely shrinking them.

Why “double” selection? The key insight of Belloni, Chernozhukov, and Hansen (2014) is that a single LASSO selection can miss important confounders. Consider our panel setting. We want to estimate the effect of GDP terms ($\mathbf{D}$) on CO₂ ($Y$), controlling for other variables ($\mathbf{W}$). The model is:

$$Y_i = \mathbf{D}_i' \boldsymbol{\alpha} + \mathbf{W}_i' \boldsymbol{\beta} + \varepsilon_i$$

A confounder $W_j$ that affects both $Y$ and $\mathbf{D}$ must be included to avoid omitted variable bias. But if $W_j$ has a weak effect on $Y$, the LASSO on $Y$ might miss it. The double-selection strategy solves this by running LASSO twice:

Step 1 selects controls that predict $Y$: $\quad \hat{S}_Y = \{j : \hat{\beta}_j^{\text{LASSO}(Y)} \neq 0\}$
Step 2 selects controls that predict each $D_k$: $\quad \hat{S}_{D_k} = \{j : \hat{\gamma}_{j,k}^{\text{LASSO}(D_k)} \neq 0\}$
Step 3 takes the union: $\quad \hat{S} = \hat{S}_Y \cup \hat{S}_{D_1} \cup \hat{S}_{D_2} \cup \hat{S}_{D_3}$
Step 4 runs OLS of $Y$ on $\mathbf{D}$ and $\mathbf{W}_{\hat{S}}$ with standard inference

The union in Step 3 ensures that a confounder missed by the $Y$-LASSO but caught by the $D$-LASSO is still included. This “safety net” property is what gives post-double-selection its valid inference guarantees — the final OLS produces consistent estimates of $\boldsymbol{\alpha}$ even if each individual LASSO makes some selection mistakes.

The dsregress command uses a “plugin” method to choose $\lambda$ — an analytical formula that sets the penalty based on the sample size and noise level, without requiring cross-validation. A key assumption underlying DSL is approximate sparsity: only a small number of controls truly matter, so LASSO can safely set the rest to zero. When the true model is dense (many small effects rather than a few large ones), LASSO may struggle to select the right variables.

Before implementing DSL, it helps to see the two methods side by side:

Feature	BMA	Post-Double-Selection
Philosophy	Bayesian (posteriors)	Frequentist (p-values)
Strategy	Average across models	Select controls, then OLS
Output	PIPs for every variable	Set of selected controls
Speed	Minutes (MCMC)	Seconds (optimization)
Reference	Raftery et al. (1997)	Belloni, Chernozhukov, Hansen (2014)

6.2 Key options

With the algorithm clear, let us examine the Stata implementation. The dsregress command has a concise syntax, but each element plays a specific role. The full option list is in the Stata LASSO manual; here we explain the ones used in this tutorial:

Syntax structure: dsregress depvar varsofinterest, controls(controlvars) [options]

$outcome (ln_co2) — the dependent variable. DSL will run LASSO on this variable against all controls (Step 1)
$gdp_vars (ln_gdp ln_gdp_sq ln_gdp_cb) — the variables of interest. These are never penalized by LASSO; they always appear in the final OLS. DSL runs a separate LASSO for each one against all controls (Steps 2a–2c)
controls(($fe) $controls) — the candidate controls subject to LASSO selection. Parentheses around $fe tell Stata to treat factor variables (country and year dummies) as always-included in the LASSO penalty but available for selection. The 12 candidate controls are subject to the standard LASSO penalty
vce(cluster country_id) — compute cluster-robust standard errors at the country level in the final OLS (Step 4). This also affects the LASSO penalty through the selection(plugin) method, which adjusts $\lambda$ for cluster dependence
selection(plugin) (default) — choose $\lambda$ using a data-driven analytical formula rather than cross-validation. The alternative selection(cv) uses cross-validation but is slower
lassoinfo (post-estimation) — reports the number of selected controls and the $\lambda$ value for each LASSO step
lassocoef (post-estimation) — displays which specific variables were selected or dropped by LASSO

Related commands. Stata also offers poregress (partialing-out regression), which residualizes both the outcome and the treatment against all controls instead of selecting then regressing. Both methods provide valid inference. xporegress extends this to cross-fit partialing-out for even more robust inference. This tutorial uses dsregress because its select-then-regress logic is more intuitive for beginners.

6.3 Estimation

dsregress $outcome $gdp_vars, ///
controls(($fe) $controls) ///
vce(cluster country_id)

Double-selection linear model Number of obs = 1,600
Number of controls = 112
Number of selected controls = 102
Wald chi2(3) = 53.15
Prob > chi2 = 0.0000
(Std. err. adjusted for 80 clusters in country_id)
------------------------------------------------------------------------------
| Robust
ln_co2 | Coefficient std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
ln_gdp | -7.433319 1.628321 -4.57 0.000 -10.62477 -4.241868
ln_gdp_sq | .8401567 .1713522 4.90 0.000 .5043126 1.176001
ln_gdp_cb | -.0310764 .005952 -5.22 0.000 -.0427421 -.0194107
------------------------------------------------------------------------------

Post-double-selection completed in seconds with cluster-robust standard errors at the country level. Internally, dsregress ran four separate LASSO regressions (Step 1 on CO₂, Steps 2a–2c on each GDP term), took the union of all selected controls, and then ran a final OLS of CO₂ on the GDP terms plus that union. All three GDP terms are significant at the 0.1% level. The Wald test strongly rejects the null that GDP terms are jointly zero ($\chi^2 = 53.15$, p < 0.001).

6.4 Turning points

Minimum: \$2,429 GDP per capita (true: \$1,895)
Maximum: \$27,672 GDP per capita (true: \$34,647)

The post-double-selection turning points (\$2,429 and \$27,672) fall between the sparse FE and kitchen-sink estimates, closer to the BMA values. With cluster-robust standard errors, the LASSO selection retained 102 of 112 controls for the outcome equation and 100 for each GDP term. The union of selected controls in Step 3 includes a few more candidate variables than without clustering, producing coefficients (–7.433, 0.840, –0.031) that lie between the sparse and kitchen-sink specifications.

6.5 LASSO selection

To understand which controls LASSO kept and which it dropped, we inspect the selection details:

lassoinfo

 Estimate: active
Command: dsregress
------------------------------------------------------
| No. of
| Selection selected
Variable | Model method lambda variables
------------+-----------------------------------------
ln_co2 | linear plugin .3818852 102
ln_gdp | linear plugin .3818852 100
ln_gdp_sq | linear plugin .3818852 100
ln_gdp_cb | linear plugin .3818852 100
------------------------------------------------------

The lassoinfo output shows each of the four LASSO steps. The outcome equation selected 102 of 112 controls, while each GDP equation selected 100. The 112 candidates include 80 country dummies + 19 year dummies = 99 FE dummies, plus the 12 candidate variables and the constant. LASSO retains nearly all informative FE dummies and drops about 10–12 of the weakest candidates at each step. The union across all four steps (Step 3) yields the final control set for Step 4’s OLS. With cluster-robust standard errors, the lambda is larger (0.382 vs 0.090 without clustering), leading to slightly different selection and producing DSL coefficients (–7.433, 0.840, –0.031) that fall between the sparse and kitchen-sink FE.

Why does DSL not match BMA’s accuracy here? In panel data settings where FE dummies dominate the control set (99 of 112 variables), LASSO retains nearly all FE dummies and has limited room to discriminate among the 12 candidate controls of interest — it dropped only 10–12 variables at each step, most of them weak FE dummies rather than noise controls. This “almost everything selected” outcome means DSL’s final OLS is close to the kitchen-sink specification, which explains why its coefficients (–7.433, 0.840, –0.031) fall between sparse and kitchen-sink FE rather than converging to the true DGP. To see LASSO’s selection power unleashed, we next run DSL without fixed effects.

6.6 Pooled DSL (without fixed effects)

What happens when LASSO has only 12 candidate controls instead of 112? To answer this, we run DSL on the pooled data — treating the panel as a cross-sectional dataset without country or year fixed effects. This gives LASSO full room to discriminate among the candidate controls, but at the cost of omitting the unobserved country heterogeneity that fixed effects would absorb.

* DSL without FE -- pooled cross-section with cluster-robust SEs
dsregress $outcome $gdp_vars, ///
controls($controls) ///
vce(cluster country_id)

Double-selection linear model Number of obs = 1,600
Number of controls = 12
Number of selected controls = 7
Wald chi2(3) = 25.05
Prob > chi2 = 0.0000
(Std. err. adjusted for 80 clusters in country_id)
------------------------------------------------------------------------------
| Robust
ln_co2 | Coefficient std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
ln_gdp | -22.03297 5.277295 -4.18 0.000 -32.37628 -11.68966
ln_gdp_sq | 2.366878 .5652276 4.19 0.000 1.259052 3.474703
ln_gdp_cb | -.084224 .0199055 -4.23 0.000 -.1232381 -.04521
------------------------------------------------------------------------------

The pooled DSL still finds the correct inverted-N sign pattern ($\beta_1 < 0$, $\beta_2 > 0$, $\beta_3 < 0$), but the magnitudes are dramatically different from the true DGP. The linear coefficient (–22.03) is more than three times the true value (–7.10), and the other terms are similarly inflated. This is omitted variable bias: without country fixed effects, the GDP terms absorb not only their own effect on CO₂ but also the persistent cross-country differences in emissions levels that fixed effects would have captured.

lassoinfo

 Estimate: active
Command: dsregress
------------------------------------------------------
| No. of
| Selection selected
Variable | Model method lambda variables
------------+-----------------------------------------
ln_co2 | linear plugin .3818852 5
ln_gdp | linear plugin .3818852 7
ln_gdp_sq | linear plugin .3818852 7
ln_gdp_cb | linear plugin .3818852 7
------------------------------------------------------

Now the contrast with the FE-based DSL is stark. The outcome LASSO selected only 5 of 12 controls (vs 102 of 112 with FE), and the GDP LASSOes selected 7 of 12 (vs 100 of 112). Without FE dummies flooding the candidate set, LASSO can genuinely discriminate — it zeroed out 5–7 controls as irrelevant. The turning points are \$5,581 (minimum) and \$24,532 (maximum), far from the true values.

This comparison illustrates a fundamental tradeoff in panel data econometrics: fixed effects remove bias but limit LASSO’s selection power. With FE, the estimates are unbiased but LASSO selects almost everything. Without FE, LASSO selects sharply but the estimates are biased by unobserved heterogeneity. The FE-based DSL from Section 6.3 is the correct specification for this data, even though LASSO’s selection looks less impressive.

7. Head-to-Head Comparison

7.1 Coefficient comparison

	Sparse FE	Kitchen-Sink FE	BMA (FE)	DSL (FE)	BMA (pooled)	DSL (pooled)	True DGP
$\beta_1$ (GDP)	–7.498	–7.131	–7.139	–7.433	–21.258	–22.033	–7.100
$\beta_2$ (GDP²)	0.849	0.806	0.808	0.840	2.285	2.367	0.810
$\beta_3$ (GDP³)	–0.031	–0.030	–0.030	–0.031	–0.081	–0.084	–0.030
Min TP	\$2,478	\$2,426	\$2,411	\$2,429	\$5,752	\$5,581	\$1,895
Max TP	\$25,656	\$27,694	\$27,269	\$27,672	\$23,298	\$24,532	\$34,647

The table reveals a sharp divide between FE-based and pooled specifications. The four FE-based methods (columns 2–5) all produce GDP coefficients within a narrow range of the true values — BMA (FE) and Kitchen-Sink FE are closest, with estimates within 1% of the truth. The two pooled methods (columns 6–7) are dramatically biased, with coefficients inflated 2–3x. Strikingly, BMA (pooled) and DSL (pooled) agree closely with each other (–21.26 vs –22.03 for $\beta_1$), confirming that the bias comes from omitting fixed effects, not from the choice of variable selection method. Both pooled methods produce turning points displaced from the truth (\$5,600–5,800 vs true \$1,895 for the minimum).

7.2 Uncertainty: confidence and credible intervals

Point estimates tell only half the story. How uncertain is each method, and does the interval actually contain the truth? The table below shows 95% confidence intervals (for the frequentist methods) and approximate 95% credible intervals (for BMA, computed as posterior mean $\pm$ 2 posterior SD). The last column checks whether the true DGP value falls inside the interval.

	$\beta_1$ (GDP) interval	Covers true?	$\beta_2$ (GDP²) interval	Covers true?	$\beta_3$ (GDP³) interval	Covers true?
Sparse FE	[–10.731, –4.266]	Yes	[0.510, 1.188]	Yes	[–0.043, –0.020]	Yes
Kitchen-Sink FE	[–10.241, –4.021]	Yes	[0.478, 1.134]	Yes	[–0.041, –0.018]	Yes
BMA (FE) (credible)	[–10.761, –3.517]	Yes	[0.429, 1.186]	Yes	[–0.043, –0.017]	Yes
DSL (FE)	[–10.625, –4.242]	Yes	[0.504, 1.176]	Yes	[–0.043, –0.019]	Yes
BMA (pooled) (credible)	[–24.541, –17.975]	No	[1.935, 2.635]	No	[–0.094, –0.069]	No
DSL (pooled)	[–32.376, –11.690]	No	[1.259, 3.475]	No	[–0.123, –0.045]	No
True DGP	–7.100		0.810		–0.030

The four FE-based methods all produce intervals that contain the true parameter values — a reassuring result. Both pooled methods, however, fail to cover the truth for any of the three coefficients. The pooled DSL intervals are wide (the $\beta_1$ interval spans 20.7 units) but centered so far from the truth that even this width cannot compensate. The pooled BMA credible intervals are actually narrower (spanning 6.6 units for $\beta_1$) but even more precisely wrong — they are tightly concentrated around the biased estimate. This is the worst-case scenario: false precision from a misspecified model.

Width reflects uncertainty. Among the FE-based methods, BMA produces the widest interval for $\beta_1$ (width = 7.24), followed by Sparse FE (6.47), DSL with FE (6.38), and Kitchen-Sink FE (6.22). BMA’s wider intervals reflect its honest accounting of model uncertainty — it averages across thousands of models, each contributing slightly different coefficient estimates, which inflates the posterior standard deviation. The frequentist methods condition on a single model and therefore understate the total uncertainty.

Centering reflects bias. Kitchen-Sink FE and BMA center their intervals closest to the true value (–7.131 and –7.139 vs. true –7.100), while Sparse FE (–7.498) and DSL with FE (–7.433) are slightly further away. The pooled DSL (–22.033) is dramatically off-center, illustrating that omitted variable bias overwhelms any precision gained from better variable selection.

Coverage requires correct specification. The pooled DSL result drives home a critical lesson: a confidence interval is only as good as the model behind it. The 95% label promises that, in repeated sampling, 95% of intervals would contain the truth — but this guarantee holds only if the model is correctly specified. When country fixed effects are omitted, the model is misspecified, and the intervals fail despite being statistically “valid” within the pooled framework.

Bayesian vs frequentist interpretation. BMA’s credible intervals have a different interpretation: a 95% BMA credible interval says “given the data and priors, there is a 95% posterior probability the true coefficient lies in this range,” while a 95% confidence interval says “if we repeated this procedure many times, 95% of the intervals would contain the truth.” In practice, both require correct model specification to be reliable.

7.3 Predicted EKC curves

The curves are normalized to zero at the sample-mean GDP so both methods are directly comparable:

* Generate predicted EKC curves for BMA and DSL, normalized at mean GDP
summarize ln_gdp
local xmin = r(min)
local xmax = r(max)
local xmean = r(mean)
clear
set obs 500
gen lngdp = `xmin' + (_n - 1) * (`xmax' - `xmin') / 499
* Cubic component for each method (using stored coefficients)
gen fit_bma = `b1_bma' * lngdp + `b2_bma' * lngdp^2 + `b3_bma' * lngdp^3
gen fit_dsl = `b1_dsl' * lngdp + `b2_dsl' * lngdp^2 + `b3_dsl' * lngdp^3
* Normalize: subtract value at sample-mean GDP
local norm_bma = `b1_bma' * `xmean' + `b2_bma' * `xmean'^2 + `b3_bma' * `xmean'^3
local norm_dsl = `b1_dsl' * `xmean' + `b2_dsl' * `xmean'^2 + `b3_dsl' * `xmean'^3
replace fit_bma = fit_bma - `norm_bma'
replace fit_dsl = fit_dsl - `norm_dsl'
twoway ///
(line fit_bma lngdp, lcolor("106 155 204") lwidth(medthick)) ///
(line fit_dsl lngdp, lcolor("217 119 87") lwidth(medthick) lpattern(dash)), ///
xline(`lnmin_bma', lcolor("106 155 204"%50) lpattern(shortdash)) ///
xline(`lnmax_bma', lcolor("106 155 204"%50) lpattern(shortdash)) ///
ytitle("Predicted log CO2 (normalized at mean GDP)") ///
xtitle("Log GDP per capita") ///
title("Predicted EKC Shape: BMA vs. DSL") ///
legend(order(1 "BMA" 2 "DSL") rows(1) position(6)) ///
scheme(s2color)

Both curves trace a clear inverted-N: CO₂ falls at low incomes, rises through industrialization, and falls again at high incomes. The BMA curve (solid blue) and DSL curve (dashed orange) are nearly indistinguishable, with turning points closely aligned. The normalization at mean GDP makes the shape immediately visible — a major improvement over plotting raw cubic components that would sit at different y-levels.

7.4 Answer key: grading the methods

The ultimate test: do BMA and DSL correctly identify the 5 true predictors and reject the 7 noise variables?

* Dot plot: BMA PIPs color-coded by ground truth
* (extract PIPs, label variables, mark true vs noise --- see analysis.do)
graph twoway ///
(scatter order pip if is_true == 1, ///
mcolor("106 155 204") msymbol(circle) msize(large)) ///
(scatter order pip if is_true == 0, ///
mcolor(gs9) msymbol(diamond) msize(large)), ///
xline(0.8, lcolor("217 119 87") lpattern(dash) lwidth(medium)) ///
ylabel(1(1)15, valuelabel angle(0) labsize(small)) ///
xlabel(0(0.2)1, format(%3.1f)) ///
xtitle("BMA Posterior Inclusion Probability") ///
title("Answer Key: Do BMA and DSL Recover the Truth?") ///
legend(order(1 "True predictor" 2 "Noise variable") ///
rows(1) position(6)) ///
scheme(s2color)

BMA’s report card: Of the 8 true predictors (3 GDP terms + 5 controls), BMA correctly assigns PIP > 0.80 to 6 — the three GDP terms, fossil fuel, industry, and renewable energy. It misses urban (PIP ~ 0.27) and democracy (PIP ~ 0.02), whose true coefficients are small (0.007 and –0.005). All 7 noise variables receive PIPs well below 0.80. BMA makes zero false positives (no noise variable incorrectly flagged as robust) and two false negatives (two weak true predictors missed).

Post-double-selection’s report card: With cluster-robust SEs, the union of all four LASSO steps selected 102 of 112 total controls (including FE dummies). The resulting DSL coefficients (–7.433, 0.840, –0.031) fall between the sparse and kitchen-sink FE, closer to the true DGP than the sparse specification. The entire procedure runs in seconds rather than minutes.

Bottom line: Both methods recover the inverted-N EKC shape. BMA provides more granular variable-level inference (PIPs), while DSL provides fast, valid coefficient estimates. The synthetic data “answer key” confirms that both are doing their job — with the expected limitation that weak signals are hard to detect.

8. Discussion

8.1 What the results mean for the EKC

Both BMA and DSL identify the inverted-N EKC shape with turning points close to the true DGP values. BMA correctly identifies 6 of 8 true predictors (3 GDP terms + fossil fuel, industry, renewable) with zero false positives among noise variables. The inverted-N shape implies three phases of the income–pollution relationship:

Declining phase (below ~\$2,400): Very poor countries where CO₂ may fall as subsistence agriculture shifts toward slightly cleaner energy.
Rising phase (~\$2,400 to ~\$27,000): Industrializing countries where emissions rise sharply. Most of the world’s population lives here.
Declining phase (above ~\$27,000): Wealthy countries where clean technology and regulation reduce emissions.

The policy implication is important: the inverted-N suggests that the “environmental improvement” phase is not automatic. Unlike the simpler inverted-U hypothesis, which predicts a single turning point after which pollution monotonically declines, the inverted-N warns that countries at very low income levels may already be on a declining emissions path that reverses once industrialization begins. This makes the middle-income range — where emissions rise steeply — the critical window for environmental policy intervention.

The three robust control variables identified by BMA reinforce this narrative:

Fossil fuel dependence (PIP = 1.000) is the single strongest predictor of CO₂ emissions, with a coefficient close to the true DGP value.
Renewable energy share (PIP = 0.959) enters with a negative sign, confirming that energy mix transitions reduce emissions.
Industry value-added (PIP = 0.999) captures the composition effect — economies dominated by manufacturing produce more CO₂ per unit of GDP than service-based economies.

8.2 When to use BMA vs post-double-selection

The two methods answer fundamentally different research questions:

Use BMA when the question is “which variables robustly predict the outcome?" BMA provides PIPs, coefficient densities, and a complete picture of the model space. It excels in exploratory settings where variable importance is the goal. In our simulation, BMA produced the most accurate coefficient estimates (–7.139 vs true –7.100) and provided rich diagnostics (PIP chart, density plots) that make the evidence for each variable transparent. The cost is computational: BMA requires MCMC sampling (minutes to hours depending on the model space).

Use post-double-selection when the question is “what is the causal effect of a specific variable of interest, controlling for high-dimensional confounders?" DSL provides fast, valid inference on the coefficients of interest with standard errors and confidence intervals. It is designed for settings where you have a clear treatment variable and many potential controls. In our simulation, DSL completed in seconds and produced valid standard errors, but its coefficient estimates (–7.433) were less accurate than BMA’s because LASSO had limited room to discriminate among controls in the FE-heavy panel setting.

Use both together (as in this tutorial) when you want the strongest possible evidence. If a Bayesian and a frequentist method agree on the sign, magnitude, and significance of an effect, the finding is unlikely to be an artifact of any single modeling choice. Disagreements between the methods are also informative — they signal areas where the evidence is sensitive to assumptions.

8.3 Pooled vs fixed effects: a cautionary comparison

The pooled specifications (Sections 5.7 and 6.6) provide a powerful pedagogical contrast. When we strip away fixed effects and run both BMA and DSL on pooled data, three things happen simultaneously:

LASSO selection improves but estimates worsen. Without 99 FE dummies diluting the candidate set, LASSO in pooled DSL selected only 5–7 of 12 controls (vs 102 of 112 with FE). This is closer to the “textbook” LASSO scenario where the method has genuine discriminating power. Yet the resulting coefficient estimates are 2–3x the true values because omitted country heterogeneity biases everything.

BMA PIPs become unreliable. With fixed effects, BMA assigned PIP near zero to all 7 noise variables — zero false positives. Without FE, 5 noise variables (services, pop_density, credit, trade, and inflated democracy) received PIPs above 0.80. The noise variables are correlated with omitted country effects, and BMA interprets these spurious correlations as genuine predictive power. This demonstrates that PIP thresholds are only meaningful when the model set is correctly specified.

Both methods agree on the bias. Pooled BMA and pooled DSL produce remarkably similar biased coefficients ($\beta_1 = -21.26$ vs $-22.03$), confirming that the problem is not the variable selection method but the omitted fixed effects. The agreement between a Bayesian and a frequentist method on the wrong answer reinforces the lesson: method agreement is not a substitute for correct model specification.

The practical takeaway for applied researchers: in panel data settings, always include entity fixed effects (or equivalent controls for unobserved heterogeneity) before applying BMA or DSL. Running these methods on pooled data without FE will produce misleading results — not because the methods fail, but because the models they average over or select from are all misspecified.

8.4 Limitations and caveats

Synthetic vs real data. This is synthetic data — the patterns are sharper than real-world data, and we can verify ground truth only because we designed the DGP. With real data, model uncertainty is genuinely unresolvable, and there is no answer key to check against. The separation between true predictors and noise variables is cleaner here than in most applications.

Weak signals are hard to detect. Both methods missed urban population (PIP = 0.27) and democracy (PIP = 0.02), whose true coefficients are small (0.007 and –0.005). This is not a failure of the methods — it is a fundamental statistical limitation. Detecting a coefficient of 0.005 in the presence of panel-level noise requires either a much larger sample or a stronger signal.

Panel FE and LASSO. In our panel setting, 99 of 112 candidate controls are FE dummies that LASSO retains almost entirely. This limits DSL’s ability to discriminate among the 12 candidate controls. In cross-sectional settings or settings with many genuinely irrelevant variables, DSL would have more room to operate and potentially match BMA’s accuracy.

Extensions. Researchers working with real EKC data should also consider endogeneity (via 2SLS-BMA, as in Gravina and Lanzafame, 2025), alternative pollutants (SO₂, PM2.5), spatial dependence across countries, and structural breaks in the income–pollution relationship.

9. Summary and Next Steps

Takeaways

Both methods confirm the inverted-N shape. BMA (Bayesian, averaging across models) and post-double-selection (frequentist, LASSO-based) both recover the inverted-N EKC. BMA produces coefficients closest to the true DGP (–7.139 vs –7.100 for $\beta_1$). DSL with cluster-robust SEs gives –7.433, falling between the sparse and kitchen-sink FE. Both methods outperform the naive sparse specification.
Both methods recover the ground truth. BMA correctly identifies 6 of 8 true predictors with zero false positives. The three strongest true controls (fossil fuel, industry, renewable energy) all receive PIPs above 0.95. The two misses (urban, democracy) have small true coefficients, illustrating that even good methods have limits with weak signals.
Model uncertainty is real. The GDP linear coefficient shifts from –7.498 (sparse) to –7.131 (kitchen-sink) depending on which controls are included. The maximum turning point moves by \$2,000. BMA and DSL provide principled solutions.
BMA and post-double-selection serve different purposes. BMA excels at variable selection (PIPs, coefficient densities) and produced the most accurate coefficient estimates in this setting. Post-double-selection is fastest and provides standard frequentist inference with cluster-robust SEs. In panel settings dominated by FE dummies, LASSO has limited room to discriminate among candidate controls; DSL would be more powerful in cross-sectional settings with many irrelevant variables.
Fixed effects are essential, not optional. Running either method on pooled data without FE produces coefficients inflated 2–3x (BMA pooled: –21.26, DSL pooled: –22.03, vs true –7.10 for $\beta_1$). Worse, pooled BMA assigns high PIPs to 5 noise variables that the FE-based BMA correctly rejects. Confidence and credible intervals from pooled models fail to cover the true values for all three coefficients. The lesson: always include fixed effects in panel data before applying variable selection methods.

Exercises

Sensitivity to the g-prior. Re-run bmaregress with gprior(bric) instead of gprior(uip). The BIC prior penalizes model complexity more heavily. Do the PIPs change? Does it still identify fossil fuel, industry, and renewable as robust? (Hint: BIC priors tend to be more conservative, so borderline variables may drop below the threshold.)
Test for inverted-U. Drop ln_gdp_cb and re-run with only linear and squared GDP terms. What do BMA and DSL say about the simpler quadratic specification? (Hint: since the DGP includes a cubic term, the quadratic model is misspecified — check whether the coefficients absorb the cubic effect or produce a visibly different EKC shape.)
Increase noise. Re-generate the synthetic data with sigma_eps = 0.30 (double the noise) in generate_data.do and re-run the full analysis. How does this affect BMA’s ability to distinguish true predictors from noise? (Hint: expect more variables with PIPs in the ambiguous 0.3–0.7 range, and possibly some noise variables crossing the 0.80 threshold — false positives become more likely with noisier data.)

Appendix A: First-Differences Analysis

A.1 Motivation

The fixed effects estimator removes time-invariant country heterogeneity by demeaning each variable within country. An alternative approach is first differencing: computing the change between the last and first year for each country ($\Delta x_i = x_{i,2014} - x_{i,1995}$). This also removes time-invariant effects and produces a pure cross-sectional dataset of 80 observations — one per country. The cross-sectional setting is where LASSO-based methods are most powerful, because there are no FE dummies diluting the candidate set.

The tradeoff is statistical power: first differencing uses only two data points per country (discarding 18 intermediate years), while the within-estimator uses all 20. We expect noisier estimates but cleaner variable selection.

A.2 Constructing the first-difference dataset

* Keep only first (1995) and last (2014) years, reshape, compute differences
keep if year == 1995 | year == 2014
reshape wide $outcome $gdp_vars $controls, i(country_id) j(year)
foreach v in $outcome $gdp_vars $controls {
gen d_`v' = `v'2014 - `v'1995
}

This produces 80 observations, each representing how much a country’s variables changed over the 20-year period. For example, d_ln_gdp measures the log growth in GDP per capita from 1995 to 2014.

A.3 Baseline OLS on first differences

* Sparse: GDP terms only
regress d_ln_co2 d_ln_gdp d_ln_gdp_sq d_ln_gdp_cb, robust
* Kitchen-sink: all 12 controls
regress d_ln_co2 d_ln_gdp d_ln_gdp_sq d_ln_gdp_cb ///
d_fossil_fuel d_renewable d_urban d_industry d_democracy ///
d_services d_trade d_fdi d_credit d_pop_density ///
d_corruption d_globalization, robust

FD Sparse OLS:

Linear regression Number of obs = 80
Prob > F = 0.0009
R-squared = 0.1433
------------------------------------------------------------------------------
| Robust
d_ln_co2 | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
d_ln_gdp | -10.36189 4.092422 -2.53 0.013 -18.51265 -2.211121
d_ln_gdp_sq | 1.155962 .4223643 2.74 0.008 .3147506 1.997173
d_ln_gdp_cb | -.0414947 .0143721 -2.89 0.005 -.0701192 -.0128702
_cons | -.3036562 .0724366 -4.19 0.000 -.4479262 -.1593861
------------------------------------------------------------------------------

FD Kitchen-sink OLS:

Linear regression Number of obs = 80
Prob > F = 0.0029
R-squared = 0.3707
------------------------------------------------------------------------------
| Robust
d_ln_co2 | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
d_ln_gdp | -8.109709 5.031758 -1.61 0.112 -18.1618 1.942382
d_ln_gdp_sq | .9238864 .5213262 1.77 0.081 -.1175823 1.965355
d_ln_gdp_cb | -.0336221 .0179583 -1.87 0.066 -.0694979 .0022536
d_fossil_f~l | .0147108 .0067313 2.19 0.033 .0012635 .0281582
d_renewable | -.0237808 .0110384 -2.15 0.035 -.0458327 -.001729
d_urban | .0002501 .014913 0.02 0.987 -.0295421 .0300424
d_industry | .0309085 .0105974 2.92 0.005 .0097377 .0520793
d_democracy | .019337 .0290345 0.67 0.508 -.038666 .07734
d_services | -.0047239 .0098816 -0.48 0.634 -.0244647 .0150169
d_trade | .006726 .0044062 1.53 0.132 -.0020764 .0155284
d_fdi | .0000124 .0091898 0.00 0.999 -.0183463 .0183712
d_credit | .0028644 .0043456 0.66 0.512 -.0058169 .0115457
d_pop_dens~y | .0006396 .0004991 1.28 0.205 -.0003575 .0016366
d_corruption | -.0036115 .0033497 -1.08 0.285 -.0103033 .0030803
d_globaliz~n | -.0004567 .0082494 -0.06 0.956 -.0169368 .0160235
_cons | -.0085823 .1746184 -0.05 0.961 -.3574226 .340258
------------------------------------------------------------------------------

The FD sparse OLS finds the inverted-N sign pattern with all three terms significant at the 5% level — but the coefficients are noisier than the FE estimates (e.g., $\beta_1 = -10.36$ vs –7.50 for sparse FE). The R² of 0.14 is low, reflecting the loss of within-country time-series variation when collapsing 20 years into a single difference.

Adding controls in the kitchen-sink raises R² to 0.37 but makes the GDP terms individually insignificant (p = 0.07–0.11) — a consequence of having only 80 observations and 15 regressors. Among the controls, fossil fuel (p = 0.033), renewable energy (p = 0.035), and industry (p = 0.005) are significant — the same three strong predictors identified by BMA with fixed effects.

A.4 BMA on first differences

bmaregress d_ln_co2 d_ln_gdp d_ln_gdp_sq d_ln_gdp_cb ///
d_fossil_fuel d_renewable d_urban d_industry d_democracy ///
d_services d_trade d_fdi d_credit d_pop_density ///
d_corruption d_globalization, ///
mprior(uniform) gprior(uip) ///
mcmcsize(50000) rseed(9988) pipcutoff(0.5) burnin(5000)

Bayesian model averaging No. of obs = 80
Linear regression No. of predictors = 15
MC3 sampling Groups = 15
Always = 0
No. of models = 2,317
For CPMP >= .9 = 581
Priors: Mean model size = 3.304
Models: Uniform Burn-in = 5,000
Cons.: Noninformative MCMC sample size = 50,000
Coef.: Zellner's g Acceptance rate = 0.3080
g: Unit-information, g = 80 Shrinkage, g/(1+g) = 0.9877
sigma2: Noninformative Mean sigma2 = 0.051
Sampling correlation = 0.9958
------------------------------------------------------------------------------
d_ln_co2 | Mean Std. dev. Group PIP
-------------+----------------------------------------------------------------
d_industry | .0364834 .0090778 7 .99823
------------------------------------------------------------------------------
Note: 14 predictors with PIP less than .5 not shown.

The FD-BMA result is dramatically different from the FE-based BMA. Only one variable passes the 0.50 PIP display threshold: the change in industry share (PIP = 0.998). The three GDP polynomial terms all have PIPs below 0.30:

Variable	PIP (FD-BMA)	PIP (FE-BMA)
d_ln_gdp	0.298	0.994
d_ln_gdp_sq	0.267	1.000
d_ln_gdp_cb	0.271	1.000
d_fossil_fuel	0.183	1.000
d_renewable	0.350	0.959
d_urban	0.096	0.268
d_industry	0.998	0.999
d_democracy	0.094	0.023

With only 80 cross-sectional observations, BMA’s evidence threshold is much harder to clear. The GDP terms — which are the core of the EKC — do not survive because the 20-year differences are noisy and the cubic polynomial requires precise estimation of three correlated terms simultaneously.

The change in industry share is the only variable with a strong enough signal-to-noise ratio to clear BMA’s bar. The FE-based BMA (N = 1,600) has 20x more observations to work with, which is why it identifies 6 robust variables.

A.5 DSL on first differences

dsregress d_ln_co2 d_ln_gdp d_ln_gdp_sq d_ln_gdp_cb, ///
controls(d_fossil_fuel d_renewable d_urban d_industry d_democracy ///
d_services d_trade d_fdi d_credit d_pop_density ///
d_corruption d_globalization) ///
rseed(9988)

Double-selection linear model Number of obs = 80
Number of controls = 12
Number of selected controls = 1
Wald chi2(3) = 10.65
Prob > chi2 = 0.0138
------------------------------------------------------------------------------
| Robust
d_ln_co2 | Coefficient std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
d_ln_gdp | -5.047196 4.558593 -1.11 0.268 -13.98187 3.887483
d_ln_gdp_sq | .5943786 .4700569 1.26 0.206 -.326916 1.515673
d_ln_gdp_cb | -.0220809 .0160386 -1.38 0.169 -.0535159 .0093541
------------------------------------------------------------------------------

lassoinfo

 Estimate: active
Command: dsregress
------------------------------------------------------
| No. of
| Selection selected
Variable | Model method lambda variables
------------+-----------------------------------------
d_ln_co2 | linear plugin .3818852 1
d_ln_gdp | linear plugin .3818852 0
d_ln_gdp_sq | linear plugin .3818852 0
d_ln_gdp_cb | linear plugin .3818852 0
------------------------------------------------------

FD-DSL selected only 1 control for the outcome equation (likely d_industry, consistent with BMA) and zero controls for each of the three GDP equations. With such sparse selection, the final OLS is essentially a regression of d_ln_co2 on the three GDP terms plus one control — and none of the three GDP terms are individually significant (p = 0.17–0.27). The Wald test for joint significance is borderline (p = 0.014), suggesting the GDP terms collectively have some explanatory power, but the individual estimates are too noisy for inference.

A.6 Comparison: first differences vs fixed effects

	FD Sparse	FD Kitchen	FD BMA	FD DSL	FE BMA	FE DSL	True DGP
$\beta_1$ (GDP)	–10.362	–8.110	n/a	–5.047	–7.139	–7.433	–7.100
$\beta_2$ (GDP²)	1.156	0.924	n/a	0.594	0.808	0.840	0.810
$\beta_3$ (GDP³)	–0.041	–0.034	n/a	–0.022	–0.030	–0.031	–0.030
GDP terms robust?	Yes (p < 0.05)	No (p > 0.05)	No (PIP < 0.30)	No (p > 0.05)	Yes (PIP > 0.99)	Yes (p < 0.001)
Controls selected	n/a	n/a	1 of 12	1 of 12	6 of 12	102 of 112
Min TP	\$1,913	\$1,465	n/a	\$987	\$2,411	\$2,429	\$1,895
Max TP	\$60,817	\$61,655	n/a	\$62,983	\$27,269	\$27,672	\$34,647

Note. FD-BMA posterior means for the GDP terms are heavily shrunk toward zero (because their PIPs are ~0.27–0.30), so we report “n/a” rather than misleading point estimates.

The comparison reveals a stark trade-off between the two identification strategies:

Fixed effects win on accuracy. The FE-based estimates are close to the true DGP values, with BMA (FE) achieving the best accuracy ($\beta_1 = -7.139$ vs true –7.100). The FD estimates are noisier: FD-sparse overshoots ($\beta_1 = -10.36$), while FD-DSL undershoots (–5.05). The FD turning points are wildly inaccurate — the maximum turning point is \$61,000–63,000 in first differences vs \$27,000 with FE (true: \$34,647).

First differences struggle with the cubic polynomial. Estimating a cubic EKC requires precise measurement of three highly correlated terms ($\ln GDP$, $(\ln GDP)^2$, $(\ln GDP)^3$). With only 80 observations (one 20-year change per country), the multicollinearity among differenced GDP terms is severe. Both BMA and DSL respond rationally: BMA gives all three terms PIPs below 0.30, and DSL selects zero controls for the GDP equations. Neither method “trusts” the cubic specification in this small sample.

Industry is the strongest cross-sectional signal. Both FD-BMA (PIP = 0.998) and FD-DSL (selected as the sole control) identify the change in industry share as the most important cross-sectional predictor of CO₂ change. This makes economic sense: countries that industrialized the most over 1995–2014 also increased their emissions the most, regardless of their income trajectory.

Practical implication. First differences are appropriate when the research question is about long-run changes rather than levels. But for testing the EKC cubic shape, the panel FE approach is far more powerful because it uses all 1,600 observations rather than collapsing to 80. The FD analysis confirms that the inverted-N result in the main body is robust to the identification strategy in spirit (the signs are correct in FD-sparse OLS), but the magnitudes and statistical power are substantially weaker.

References

Three Methods for Robust Variable Selection: BMA, LASSO, and WALS

Mon, 23 Mar 2026 00:00:00 +0000

1. Overview

Imagine you are an economist advising a government on climate policy. Your team has collected cross-country data on a dozen potential drivers of CO₂ emissions: GDP per capita, fossil fuel dependence, urbanization, industrial output, democratic governance, trade networks, agricultural activity, trade openness, foreign direct investment, corruption, tourism, and domestic credit. The government has a limited budget and wants to know: which of these factors truly drive CO₂ emissions, and which are red herrings?

This is the variable selection problem, and it is harder than it sounds. With 12 candidate variables, each either included or excluded from a regression, there are $2^{12} = 4,096$ possible models you could estimate. Run one model and report it as “the answer,” and you have implicitly assumed the other 4,095 models are wrong. That is a very strong assumption — and almost certainly unjustified.

In practice, researchers handle this by specification searching: they try many models, drop insignificant variables, and report whichever specification “works best.” This process inflates false discoveries. A noise variable that happens to look significant in one specification gets reported, while the many failed specifications are hidden in the researcher’s desk drawer. This is sometimes called the file drawer problem or pretesting bias.

This tutorial introduces three principled approaches to the variable selection problem:

graph TD
Q["<b>Variable Selection</b><br/>Which of 12 variables<br/>truly matter?"] --> BMA
Q --> LASSO
Q --> WALS
BMA["<b>BMA</b><br/>Bayesian Model Averaging<br/>PIPs from 4,096 models"] --> R["<b>Convergence</b><br/>Variables identified<br/>by all 3 methods"]
LASSO["<b>LASSO</b><br/>L1 penalized regression<br/>Automatic selection"] --> R
WALS["<b>WALS</b><br/>Frequentist averaging<br/>t-statistics"] --> R
style Q fill:#141413,stroke:#141413,color:#fff
style BMA fill:#6a9bcc,stroke:#141413,color:#fff
style LASSO fill:#d97757,stroke:#141413,color:#fff
style WALS fill:#00d4c8,stroke:#141413,color:#fff
style R fill:#1a3a8a,stroke:#141413,color:#fff

Bayesian Model Averaging (BMA): Average across all 4,096 models, weighting each by how well it fits the data. Variables that appear important across many models earn a high “inclusion probability.”
LASSO (Least Absolute Shrinkage and Selection Operator): Add a penalty to the regression that forces the coefficients of irrelevant variables to be exactly zero, performing automatic selection.
Weighted Average Least Squares (WALS): A fast frequentist model-averaging method that transforms the problem so each variable can be evaluated independently.

We use synthetic data throughout this tutorial. This means we know the true data-generating process — which variables truly matter and which do not. This “answer key” lets us verify whether each method correctly recovers the truth. By the end, you will understand not just how to run each method, but why it works and when to prefer one over the others.

Learning objectives:

Understand the variable selection problem and why running a single model is insufficient when model uncertainty is large
Implement Bayesian Model Averaging in R and interpret Posterior Inclusion Probabilities (PIPs)
Apply LASSO with cross-validation to perform automatic variable selection and use Post-LASSO for unbiased estimation
Run WALS as a fast frequentist model-averaging alternative and interpret its t-statistics
Compare results across all three methods to identify truly robust determinants via methodological triangulation

Content outline. Section 2 sets up the R environment. Section 3 introduces the synthetic dataset and its built-in “answer key” — 7 true predictors and 5 noise variables with realistic multicollinearity. Section 4 runs naive OLS to illustrate the spurious significance problem. Sections 5–8 cover BMA: Bayes' rule foundations, the PIP framework, a toy example, and full implementation. Sections 9–12 cover LASSO: the bias-variance tradeoff, L1/L2 geometry, cross-validated implementation, and Post-LASSO. Sections 13–16 cover WALS: frequentist model averaging, the semi-orthogonal transformation, the Laplace prior, and implementation. Section 17 brings all three methods together for a grand comparison. Section 18 summarizes key takeaways and provides further reading.

2. Setup

Before running the analysis, install the required packages if needed. The following code checks for missing packages and installs them automatically.

# List all packages needed for this tutorial
required_packages <- c(
"tidyverse", # data manipulation and ggplot2 visualization
"BMS", # Bayesian Model Averaging via the bms() function
"glmnet", # LASSO and Ridge regression via coordinate descent
"WALS", # Weighted Average Least Squares estimation
"scales", # nice axis formatting in plots
"patchwork", # combine multiple ggplot panels
"ggrepel", # non-overlapping text labels on plots
"corrplot", # correlation matrix heatmaps
"broom" # tidy model summaries
)
# Install any packages not yet available
missing <- required_packages[!sapply(required_packages, requireNamespace, quietly = TRUE)]
if (length(missing) > 0) {
install.packages(missing, repos = "https://cloud.r-project.org")
}
# Load libraries
library(tidyverse)
library(BMS)
library(glmnet)
library(WALS)
library(scales)
library(patchwork)
library(ggrepel)
library(corrplot)
library(broom)

3. The Synthetic Dataset

3.1 The data-generating process (our “answer key”)

We use a cross-sectional dataset of 120 fictional countries. The key design choices:

7 variables have true nonzero effects on CO₂ emissions
5 variables are pure noise (their true coefficients are exactly zero)
The noise variables are correlated with GDP and other true predictors, creating realistic multicollinearity. This makes variable selection genuinely challenging — naive OLS will find spurious “significant” results for noise variables.

Think of this as setting up a controlled experiment. We know the answer before we begin, so we can grade each method’s performance.

The data-generating process below shows exactly how the synthetic dataset was built. The CSV file synthetic-co2-cross-section.csv was generated with set.seed(2017) and can be loaded directly from GitHub for full reproducibility.

# --- DATA-GENERATING PROCESS (reference) ---
set.seed(2017)
n <- 120 # number of "countries"
# GDP drives many other variables (realistic: richer countries
# have higher urbanization, more industry, etc.)
log_gdp <- rnorm(n, mean = 8.5, sd = 1.5)
# --- TRUE PREDICTORS (correlated with GDP) ---
fossil_fuel <- 30 + 3 * log_gdp + rnorm(n, 0, 10) # higher in richer countries
urban_pop <- 20 + 5 * log_gdp + rnorm(n, 0, 12) # increases with income
industry <- 15 + 1.5 * log_gdp + rnorm(n, 0, 6) # industry share
democracy <- 5 + 2 * log_gdp + rnorm(n, 0, 8) # democracy index
trade_network <- 0.2 + 0.05 * log_gdp + rnorm(n, 0, 0.15) # trade centrality
agriculture <- 40 - 3 * log_gdp + rnorm(n, 0, 8) # negatively correlated with GDP
# --- NOISE VARIABLES (correlated with GDP but NO true effect) ---
log_trade <- 3.5 + 0.1 * log_gdp + rnorm(n, 0, 0.5)
fdi <- 2 + rnorm(n, 0, 4)
corruption <- 0.8 - 0.05 * log_gdp + rnorm(n, 0, 0.15)
log_tourism <- 12 + 0.3 * log_gdp + rnorm(n, 0, 1.2)
log_credit <- 2.5 + 0.15 * log_gdp + rnorm(n, 0, 0.6)
# --- TRUE DATA-GENERATING PROCESS ---
log_co2 <- 2.0 + # intercept
1.200 * log_gdp + # GDP: strong positive (elasticity)
0.008 * industry + # industry: positive
0.012 * fossil_fuel + # fossil fuel: positive
0.010 * urban_pop + # urbanization: positive
0.004 * democracy + # democracy: small positive
0.500 * trade_network + # trade network: moderate positive
0.005 * agriculture + # agriculture: weak positive
# NOISE VARIABLES HAVE ZERO TRUE EFFECT
rnorm(n, 0, 0.3) # random noise (sigma = 0.3)

The true coefficients serve as our “answer key”:

Variable	True $\beta$	Role	Interpretation
log_gdp	1.200	True predictor	1% more GDP $\to$ 1.2% more CO₂
trade_network	0.500	True predictor	Moderate positive effect
fossil_fuel	0.012	True predictor	1 pp more fossil fuel $\to$ 1.2% more CO₂
urban_pop	0.010	True predictor	1 pp more urbanization $\to$ 1.0% more CO₂
industry	0.008	True predictor	Positive composition effect
agriculture	0.005	True predictor	Weak positive effect
democracy	0.004	True predictor	Small positive effect
log_trade	0	Noise	No true effect
fdi	0	Noise	No true effect
corruption	0	Noise	No true effect
log_tourism	0	Noise	No true effect
log_credit	0	Noise	No true effect

Now let us load the pre-generated dataset:

# Load the synthetic dataset directly from GitHub
DATA_URL <- "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/r_bma_lasso_wals/synthetic-co2-cross-section.csv"
synth_data <- read.csv(DATA_URL)
cat("Dataset:", nrow(synth_data), "countries,", ncol(synth_data), "variables\n")
head(synth_data)

Dataset: 120 countries, 14 variables
country log_co2 log_gdp industry fossil_fuel urban_pop democracy trade_network
1 Country_001 13.27 9.47 29.25 66.94 67.97 25.67 0.77
2 Country_002 12.18 8.44 24.97 51.43 66.14 20.51 0.85
3 Country_003 13.50 10.16 28.19 50.62 73.91 29.08 0.73
...

3.2 Descriptive statistics

The following summary statistics give us a first look at the data structure. Note the wide range of scales: GDP is in log units (mean around 8.5), while percentage variables like fossil fuel share and urbanization range from single digits to near 100.

# Descriptive statistics for all 13 numeric variables
synth_data |>
select(-country) |>
pivot_longer(everything(), names_to = "variable", values_to = "value") |>
summarise(
n = n(),
mean = round(mean(value), 2),
sd = round(sd(value), 2),
min = round(min(value), 2),
max = round(max(value), 2),
.by = variable
)

 variable n mean sd min max
log_co2 120 14.22 2.11 8.76 20.36
log_gdp 120 8.53 1.57 4.61 13.21
industry 120 27.87 6.21 8.32 44.98
fossil_fuel 120 55.49 9.62 24.72 81.22
urban_pop 120 62.52 13.25 29.81 97.62
democracy 120 22.94 8.32 3.10 45.00
trade_network 120 0.64 0.17 0.18 1.04
agriculture 120 13.87 8.11 1.00 37.11
log_trade 120 4.43 0.46 3.45 5.84
fdi 120 2.23 4.19 -5.00 13.62
corruption 120 0.37 0.16 0.05 0.71
log_tourism 120 14.61 1.32 11.54 19.63
log_credit 120 3.83 0.65 2.30 5.50

The dataset has 120 observations and 14 variables (1 dependent, 12 candidate regressors, 1 country identifier). The dependent variable log_co2 has a mean of 14.22 with a standard deviation of 2.11 log points, reflecting substantial cross-country variation in emissions. The candidate regressors span very different scales — trade_network ranges from 0.18 to 1.04, while urban_pop ranges from 29.8 to 97.6 — which is why BMA, LASSO, and WALS each handle scaling internally.

3.3 Correlation structure

A key feature of our synthetic data is that the noise variables are correlated with the true predictors — especially with GDP. This correlation is what makes variable selection difficult: in a standard OLS regression, the noise variables will “borrow” explanatory power from the true predictors.

# Compute correlation matrix for all 12 candidate regressors
cor_matrix <- synth_data |>
select(-country, -log_co2) |>
cor()
# Draw the heatmap
corrplot(cor_matrix, method = "color", type = "lower",
addCoef.col = "black", number.cex = 0.7,
col = colorRampPalette(c("#d97757", "white", "#6a9bcc"))(200),
diag = FALSE)

The correlation heatmap reveals the realistic structure we built into the data. GDP is positively correlated with fossil fuel use, urbanization, industry, and the trade network — but also with the noise variables like trade openness, tourism, and credit. This multicollinearity is precisely what makes a naive “throw everything into OLS” approach unreliable. For example, log_tourism has a correlation of approximately 0.3 with log_gdp, which means it can pick up GDP’s signal even though its true effect is zero.

Note. We created a synthetic dataset where we know which 7 variables truly affect CO₂ emissions and which 5 are noise. The noise variables are deliberately correlated with the true predictors, mimicking the multicollinearity found in real cross-country data.

4. The General Model

Our goal is to estimate the following linear model:

$$ \log(\text{CO}_{2,i}) = \beta_0 + \sum_{j=1}^{12} \beta_j x_{j,i} + \varepsilon_i $$

where:

$\log(\text{CO}_{2,i})$ is the log of CO₂ emissions for country $i$
$\beta_0$ is the intercept (the predicted log CO₂ when all regressors are zero)
$\beta_j$ is the coefficient on the $j$-th regressor: the change in log CO₂ associated with a one-unit increase in $x_j$, holding all other variables constant
$\varepsilon_i$ is the error term: everything that affects CO₂ emissions but is not captured by the 12 regressors

Because the dependent variable is in logs, the interpretation of each coefficient depends on whether the regressor is also in logs:

Regressor type	Interpretation of $\beta_j$	Example
Log-log (e.g., log GDP)	Elasticity: a 1% increase in GDP is associated with a $\beta_j$% change in CO₂	$\beta = 1.2$ means 1% more GDP $\to$ 1.2% more CO₂
Level-log (e.g., fossil fuel %)	Semi-elasticity: a 1-unit increase in the regressor is associated with a $100 \times \beta_j$% change in CO₂	$\beta = 0.012$ means 1 pp more fossil fuel $\to$ 1.2% more CO₂

We want to determine which $\beta_j$ are truly nonzero. We know the answer (we designed the data), but let us first see what happens if we just run OLS with all 12 variables.

# Run OLS with all 12 candidate regressors
ols_full <- lm(log_co2 ~ log_gdp + industry + fossil_fuel + urban_pop +
democracy + trade_network + agriculture +
log_trade + fdi + corruption + log_tourism + log_credit,
data = synth_data)
# Display summary
summary(ols_full)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.283773 0.494736 4.616 1.06e-05 ***
log_gdp 1.163669 0.032747 35.537 < 2e-16 ***
industry 0.017577 0.005004 3.513 0.000661 ***
fossil_fuel 0.011988 0.003240 3.698 0.000349 ***
urban_pop 0.008221 0.002689 3.057 0.002794 **
democracy 0.010497 0.003975 2.640 0.009549 **
trade_network 0.912828 0.203681 4.482 1.94e-05 ***
agriculture -0.000629 0.004242 -0.148 0.882568
log_trade -0.055738 0.064829 -0.860 0.391509
fdi 0.000789 0.007045 0.112 0.910964
corruption 0.010767 0.201954 0.053 0.957573
log_tourism -0.028025 0.024415 -1.148 0.253610
log_credit 0.045689 0.049690 0.919 0.360252
---
Multiple R-squared: 0.9801, Adjusted R-squared: 0.9779

Look carefully at the noise variables. For example, log_trade has a t-statistic of $-0.86$ (p = 0.392) and corruption has a t-statistic of $0.05$ (p = 0.958). None reach conventional significance in this sample. However, their estimated coefficients can be non-negligible in magnitude — and in a different random sample, some noise variables could easily cross the 5% threshold. This is the risk of spurious significance, caused by the correlation between noise variables and the true predictors. It is precisely this problem that motivates the three methods we study next.

Warning. With 12 correlated regressors and only 120 observations, OLS can produce misleading significance levels. A variable with a true coefficient of zero may appear significant simply because it is correlated with a genuinely important predictor. This is why we need principled variable selection methods.

PART 1: Bayesian Model Averaging

5. Bayes' Rule — The Foundation

Before we can understand Bayesian Model Averaging, we need to understand Bayes' rule — the mathematical machinery that powers the entire framework.

5.1 A coin-flip example

Suppose a friend gives you a coin. You want to know: is this coin fair (probability of heads = 0.5), or is it biased (probability of heads = 0.7)?

Before flipping, you have no strong opinion. You assign equal prior probabilities:

$P(\text{fair}) = 0.5$ (50% chance the coin is fair)
$P(\text{biased}) = 0.5$ (50% chance the coin is biased)

Now you flip the coin 10 times and observe 7 heads. How should you update your beliefs?

The likelihood of seeing 7 heads in 10 flips is:

If the coin is fair ($p = 0.5$): $P(\text{7 heads} | \text{fair}) = \binom{10}{7} (0.5)^{10} = 0.1172$
If the coin is biased ($p = 0.7$): $P(\text{7 heads} | \text{biased}) = \binom{10}{7} (0.7)^7 (0.3)^3 = 0.2668$

The biased coin makes the data more likely. Bayes' rule combines the prior and the likelihood:

$$ P(H|D) = \frac{P(D|H) \cdot P(H)}{P(D)} $$

where:

$P(H|D)$ = posterior probability (what we believe after seeing the data)
$P(D|H)$ = likelihood (how probable the data is under hypothesis $H$)
$P(H)$ = prior probability (what we believed before seeing the data)
$P(D)$ = marginal likelihood (a normalizing constant that ensures probabilities sum to 1)

For our coin:

$$ P(\text{fair}|\text{7H}) = \frac{0.1172 \times 0.5}{0.1172 \times 0.5 + 0.2668 \times 0.5} = \frac{0.0586}{0.1920} = 0.305 $$

$$ P(\text{biased}|\text{7H}) = \frac{0.2668 \times 0.5}{0.1920} = 0.695 $$

After seeing 7 heads, we update from 50–50 to roughly 30–70 in favor of the biased coin. The data shifted our beliefs, but did not erase the prior entirely.

5.2 The bridge to model averaging

Now replace “fair coin” and “biased coin” with regression models:

Hypothesis = “Which variables belong in the model?”
Prior = “Before seeing data, any combination of variables is equally plausible”
Likelihood = “How well does each model fit the data?”
Posterior = “After seeing data, which models are most credible?”

This is exactly what BMA does. Instead of two coin hypotheses, we have 4,096 model hypotheses — but the logic of Bayes' rule is identical.

Note. Bayes' rule updates prior beliefs using data. The posterior probability of any hypothesis is proportional to its prior probability times its likelihood. BMA applies this same logic to regression models instead of coin flips.

6. The BMA Framework

6.1 Posterior model probability

With 12 candidate variables, there are $K = 12$ regressors and $2^K = 4,096$ possible models. Denote the $k$-th model as $M_k$. BMA assigns each model a posterior probability:

$$ P(M_k | y) = \frac{P(y | M_k) \cdot P(M_k)}{\sum_{l=1}^{2^K} P(y | M_l) \cdot P(M_l)} $$

This is just Bayes' rule applied to models. Let us unpack each piece:

$P(y | M_k)$ is the marginal likelihood of model $M_k$. It measures how well the model fits the data, automatically penalizing complexity. A model with many parameters can fit the data closely, but the marginal likelihood integrates over all possible parameter values, spreading the probability thin. This acts as a built-in Occam’s razor: simpler models that fit the data well receive higher marginal likelihoods than complex models that fit only slightly better.
$P(M_k)$ is the prior model probability. With no prior information, we use a uniform prior: every model is equally likely, so $P(M_k) = 1/4,096$ for all $k$. This means the posterior is driven entirely by the data.
The denominator is a normalizing constant that ensures all posterior model probabilities sum to 1.

6.2 Posterior Inclusion Probability (PIP)

We do not really care about individual models — we care about individual variables. The Posterior Inclusion Probability of variable $j$ is the sum of the posterior probabilities of all models that include variable $j$:

$$ \text{PIP}_j = \sum_{k:\, j \in M_k} P(M_k | y) $$

Think of it as a democratic vote. Each of the 4,096 models casts a vote for which variables matter. But the votes are weighted: models that fit the data well get louder voices. If variable $j$ appears in most of the high-probability models, it earns a high PIP.

The standard interpretation thresholds (Raftery, 1995):

PIP range	Interpretation	Analogy
$\geq 0.99$	Decisive evidence	Beyond reasonable doubt
$0.95 - 0.99$	Very strong evidence	Strong consensus
$0.80 - 0.95$	Strong evidence (robust)	Clear majority
$0.50 - 0.80$	Borderline evidence	Split vote
$< 0.50$	Weak/no evidence (fragile)	Minority opinion

We will use PIP $\geq$ 0.80 as our threshold for “robust” throughout this tutorial.

6.3 Posterior mean

Once we know which variables matter, we want to know how much they matter. The posterior mean of coefficient $j$ is:

$$ E[\beta_j | y] = \sum_{k=1}^{2^K} \hat{\beta}_{j,k} \cdot P(M_k | y) $$

where $\hat{\beta}_{j,k}$ is the estimated coefficient of variable $j$ in model $k$ (and zero if $j$ is not in model $k$). This is a weighted average of the coefficient across all models. Variables with high PIPs get posterior means close to their “full model” estimates; variables with low PIPs get posterior means shrunk toward zero.

7. Toy Example — BMA on 3 Variables

Before running BMA on all 12 variables, let us work through a small example by hand. We pick just 3 variables: log_gdp and fossil_fuel (true predictors) and log_trade (noise). With 3 variables, each can be either IN or OUT of the model, giving us $2^3 = 8$ possible models — small enough to examine every single one.

Here are all 8 models written out explicitly:

Model	Formula
$M_1$	log_co2 $\sim$ 1 (intercept only)
$M_2$	log_co2 $\sim$ log_gdp
$M_3$	log_co2 $\sim$ fossil_fuel
$M_4$	log_co2 $\sim$ log_trade
$M_5$	log_co2 $\sim$ log_gdp + fossil_fuel
$M_6$	log_co2 $\sim$ log_gdp + log_trade
$M_7$	log_co2 $\sim$ fossil_fuel + log_trade
$M_8$	log_co2 $\sim$ log_gdp + fossil_fuel + log_trade

7.1 Step 1 — Fit every model and compute BIC

We fit each of the 8 models using OLS and compute its BIC score. Remember: lower BIC = better (the model explains the data well without unnecessary complexity).

# Select our 3 variables
toy_data <- synth_data |>
select(log_co2, log_gdp, fossil_fuel, log_trade)
# Write out all 8 model formulas explicitly
model_formulas <- c(
"log_co2 ~ 1", # M1: intercept only
"log_co2 ~ log_gdp", # M2
"log_co2 ~ fossil_fuel", # M3
"log_co2 ~ log_trade", # M4
"log_co2 ~ log_gdp + fossil_fuel", # M5
"log_co2 ~ log_gdp + log_trade", # M6
"log_co2 ~ fossil_fuel + log_trade", # M7
"log_co2 ~ log_gdp + fossil_fuel + log_trade" # M8
)
# Fit each model and extract its BIC
bic_values <- sapply(model_formulas, function(f) {
BIC(lm(as.formula(f), data = toy_data))
})
# Organize results in a table
toy_results <- tibble(
model = paste0("M", 1:8),
formula = model_formulas,
bic = round(bic_values, 1)
) |>
arrange(bic)
print(toy_results)

 model formula bic
M5 log_co2 ~ log_gdp + fossil_fuel 114.1
M8 log_co2 ~ log_gdp + fossil_fuel + log_trade 118.5
M2 log_co2 ~ log_gdp 120.7
M6 log_co2 ~ log_gdp + log_trade 125.4
M3 log_co2 ~ fossil_fuel 514.4
M7 log_co2 ~ fossil_fuel + log_trade 519.0
M1 log_co2 ~ 1 528.3
M4 log_co2 ~ log_trade 533.0

The winner is $M_5$ (log_gdp + fossil_fuel) with BIC = 114.1 — exactly the two true predictors, no noise. The runner-up $M_8$ adds log_trade but its BIC is worse (118.5), meaning the extra variable does not improve the fit enough to justify the added complexity. Models without GDP ($M_1$, $M_3$, $M_4$, $M_7$) have dramatically worse BIC scores, confirming GDP’s dominant role.

7.2 Step 2 — Convert BIC to posterior probabilities

Now we turn each BIC into a posterior model probability. The formula is:

$$ P(M_k | y) = \frac{\exp(-0.5 \cdot \text{BIC}_k)}{\sum_{l=1}^{8} \exp(-0.5 \cdot \text{BIC}_l)} $$

Because the BIC values can be very large, we work with differences from the best model to avoid numerical overflow. Subtracting the minimum BIC from all values does not change the probabilities:

$$ P(M_k | y) = \frac{\exp\bigl(-0.5 \cdot (\text{BIC}_k - \text{BIC}_{\min})\bigr)}{\sum_{l=1}^{8} \exp\bigl(-0.5 \cdot (\text{BIC}_l - \text{BIC}_{\min})\bigr)} $$

Let us plug in the numbers. The best model ($M_5$) has BIC = 114.1, so $\Delta_5 = 0$. The runner-up ($M_8$) has $\Delta_8 = 118.5 - 114.1 = 4.4$:

$$ w_5 = \exp(-0.5 \times 0) = 1.000, \quad w_8 = \exp(-0.5 \times 4.4) = 0.111 $$

The remaining models have much larger $\Delta$ values, so their weights are essentially zero. After normalizing by the sum of all weights ($1.000 + 0.111 + 0.037 + \ldots \approx 1.151$):

$$ P(M_5 | y) = \frac{1.000}{1.151} = 0.869, \quad P(M_8 | y) = \frac{0.111}{1.151} = 0.096 $$

# Convert BIC to posterior probabilities using the delta-BIC trick
toy_results <- toy_results |>
mutate(
delta_bic = bic - min(bic), # difference from best
weight = exp(-0.5 * delta_bic), # unnormalized weight
post_prob = round(weight / sum(weight), 4) # normalize to sum to 1
)
toy_results |> select(model, bic, delta_bic, weight, post_prob)

 model bic delta_bic weight post_prob
M5 114.1 0.0 1.0000 0.8687
M8 118.5 4.4 0.1108 0.0962
M2 120.7 6.6 0.0369 0.0320
M6 125.4 11.3 0.0035 0.0031
M3 514.4 400.3 0.0000 0.0000
M7 519.0 404.9 0.0000 0.0000
M1 528.3 414.2 0.0000 0.0000
M4 533.0 418.9 0.0000 0.0000

One model dominates: $M_5$ captures 86.9% of the posterior probability — exactly the two true predictors. The runner-up $M_8$ (adding log_trade) gets only 9.6%, and $M_2$ (GDP alone) gets 3.2%. The remaining 5 models share less than 0.4% of the total weight. BMA’s Occam’s razor is at work: adding log_trade to the model ($M_8$) does not improve the fit enough to overcome the complexity penalty, so the simpler model ($M_5$) wins decisively.

7.3 Step 3 — Compute Posterior Inclusion Probabilities

Finally, we compute the PIP of each variable by summing the posterior probabilities of all models that include it. For example, log_trade appears in models $M_4$, $M_6$, $M_7$, and $M_8$, so:

$$ \text{PIP}_{\text{log_trade}} = P(M_4 | y) + P(M_6 | y) + P(M_7 | y) + P(M_8 | y) = 0.000 + 0.003 + 0.000 + 0.096 = 0.099 $$

That is well below the 0.50 threshold — fragile evidence, exactly what we expect for a noise variable.

# Compute PIPs: for each variable, sum P(M|y) across models that include it
pip_toy <- tibble(
variable = c("log_gdp", "fossil_fuel", "log_trade"),
true_effect = c("True", "True", "Noise"),
pip = c(
# log_gdp appears in M2, M5, M6, M8
sum(toy_results$post_prob[toy_results$model %in% c("M2","M5","M6","M8")]),
# fossil_fuel appears in M3, M5, M7, M8
sum(toy_results$post_prob[toy_results$model %in% c("M3","M5","M7","M8")]),
# log_trade appears in M4, M6, M7, M8
sum(toy_results$post_prob[toy_results$model %in% c("M4","M6","M7","M8")])
)
)
print(pip_toy)

 variable true_effect pip
log_gdp True 1.000
fossil_fuel True 0.965
log_trade Noise 0.099

Even with this simple 3-variable example, BMA correctly identifies the two true predictors. GDP has a PIP of 1.000 (decisive evidence) and fossil_fuel has a PIP of 0.965 (robust) — they appear in every high-probability model. Log_trade has a PIP of only 0.099 (fragile) — well below the 0.50 threshold. BMA’s built-in Occam’s razor penalizes models that include noise variables without substantially improving the fit.

8. BMA on All 12 Variables

8.1 Running BMA

Now we apply BMA to the full dataset with all 12 candidate regressors using the BMS package. Because 4,096 models is computationally manageable, the MCMC sampler explores the full model space efficiently.

set.seed(2021) # reproducibility for MCMC sampling
# Prepare the data matrix: DV in first column, regressors follow
bma_data <- synth_data |>
select(log_co2, log_gdp, industry, fossil_fuel, urban_pop,
democracy, trade_network, agriculture,
log_trade, fdi, corruption, log_tourism, log_credit) |>
as.data.frame()
# Run BMA
bma_fit <- bms(
X.data = bma_data, # data with DV in column 1
burn = 50000, # burn-in iterations
iter = 200000, # post-burn-in iterations
g = "BRIC", # BRIC g-prior (robust default)
mprior = "uniform", # uniform model prior
nmodel = 2000, # store top 2000 models
mcmc = "bd", # birth-death MCMC sampler
user.int = FALSE # suppress interactive output
)

The key parameters deserve explanation:

burn = 50,000: the first 50,000 MCMC draws are discarded as “burn-in” to ensure the sampler has converged to the posterior distribution
iter = 200,000: the next 200,000 draws are used for inference
g = “BRIC”: the Benchmark Risk Inflation Criterion prior on the regression coefficients, a robust default choice
mprior = “uniform”: every model is equally likely a priori, so the posterior is driven entirely by the data

8.2 PIP bar chart

The PIP bar chart classifies each variable as robust (PIP $\geq$ 0.80), borderline (0.50–0.80), or fragile (PIP $<$ 0.50). This visualization makes it easy to see which variables earn strong support across the model space and which are effectively irrelevant.

# Extract PIPs and posterior means
bma_coefs <- coef(bma_fit)
bma_df <- as.data.frame(bma_coefs) |>
rownames_to_column("variable") |>
as_tibble() |>
rename(pip = PIP, post_mean = `Post Mean`, post_sd = `Post SD`) |>
select(variable, pip, post_mean, post_sd) |>
mutate(
true_beta = true_beta_lookup[variable],
robustness = case_when(
pip >= 0.80 ~ "Robust (PIP >= 0.80)",
pip >= 0.50 ~ "Borderline",
TRUE ~ "Fragile (PIP < 0.50)"
),
ci_low = post_mean - 2 * post_sd,
ci_high = post_mean + 2 * post_sd
)
# Plot PIPs
ggplot(bma_df, aes(x = reorder(variable, pip), y = pip, fill = robustness)) +
geom_col(width = 0.65) +
geom_hline(yintercept = 0.80, linetype = "dashed") +
coord_flip() +
labs(x = NULL, y = "Posterior Inclusion Probability (PIP)")

The PIP bar chart reveals a clear separation between signal and noise. GDP dominates with a PIP of 1.00, followed by trade_network (0.986), fossil_fuel (0.948), and industry (0.841) — all with PIPs above the 0.80 robustness threshold. The noise variables (log_trade, fdi, corruption, log_tourism, log_credit) all have PIPs well below 0.15, confirming that BMA correctly classifies them as fragile. Urban_pop ($\beta = 0.010$, PIP = 0.648) and democracy ($\beta = 0.004$, PIP = 0.607) land in the borderline range — true predictors whose effects are moderate enough that BMA hedges between including and excluding them. Agriculture ($\beta = 0.005$, PIP = 0.087) is classified as fragile, an honest reflection of the sample’s limited power to detect its very small effect.

8.3 Posterior coefficient plot

Beyond knowing which variables matter, we want to know how much they matter and how precisely they are estimated. The posterior coefficient plot displays the BMA-estimated effect size for each variable along with approximate 95% credible intervals (posterior mean $\pm$ 2 posterior standard deviations).

# Coefficient plot with 95% credible intervals
ggplot(bma_df, aes(x = reorder(variable, pip), y = post_mean, color = robustness)) +
geom_pointrange(aes(ymin = ci_low, ymax = ci_high)) +
geom_hline(yintercept = 0, linetype = "solid", color = "gray50") +
coord_flip()

The posterior coefficient plot shows the BMA-estimated effect sizes with uncertainty bands. GDP’s posterior mean of approximately 1.19 closely recovers the true value of 1.200, and its 95% credible interval is narrow, reflecting high precision. Trade_network has a posterior mean of 0.87, overshooting its true value of 0.500 — but its wide credible interval honestly reflects substantial estimation uncertainty. The noise variables and low-PIP variables like agriculture have posterior means shrunk very close to zero — this is BMA’s shrinkage at work. Variables with low PIPs appear in few high-probability models, so their posterior means are averaged with many models where the coefficient is zero, pulling the estimate toward zero.

8.4 Variable-inclusion map

The variable-inclusion map shows which variables appear in the highest-probability models and whether their coefficients are positive or negative. Unlike a simple heatmap, the width of each column is proportional to the model’s posterior probability — so wide columns represent models that the data strongly supports. The x-axis shows cumulative posterior model probability: if the first model has PMP = 0.15, it occupies the region from 0 to 0.15; the second model fills from 0.15 to 0.15 + its PMP, and so on. A solid band of color stretching across most of the x-axis means the variable appears in virtually every high-probability model.

# Extract top 100 models and their coefficient estimates
top_coefs <- topmodels.bma(bma_fit)
n_top <- min(100, ncol(top_coefs))
top_coefs <- top_coefs[, 1:n_top]
# Extract posterior model probabilities (MCMC-based)
model_pmps <- pmp.bma(bma_fit)[1:n_top, 1]
# Cumulative x positions: each model's width = its PMP
cum_pmp <- c(0, cumsum(model_pmps))
# Order variables by PIP (highest at top)
var_order <- bma_df |> arrange(desc(pip)) |> pull(variable)
# Build rectangle data for every variable × model combination
rect_data <- expand.grid(
var_idx = seq_len(nrow(top_coefs)),
model_idx = seq_len(n_top)
) |>
mutate(
variable = rownames(top_coefs)[var_idx],
coef_value = mapply(function(v, m) top_coefs[v, m], var_idx, model_idx),
sign = case_when(
coef_value > 0 ~ "Positive",
coef_value < 0 ~ "Negative",
TRUE ~ "Not included"
),
xmin = cum_pmp[model_idx],
xmax = cum_pmp[model_idx + 1],
variable = factor(variable, levels = rev(var_order))
)
# Plot the variable-inclusion map
ggplot(rect_data, aes(xmin = xmin, xmax = xmax,
ymin = as.numeric(variable) - 0.45,
ymax = as.numeric(variable) + 0.45,
fill = sign)) +
geom_rect() +
scale_fill_manual(
name = "Coefficient",
values = c("Positive" = "#6a9bcc",
"Negative" = "#d97757",
"Not included" = "#d0cdc8")
) +
scale_x_continuous(expand = c(0, 0),
labels = scales::label_number(accuracy = 0.1)) +
scale_y_continuous(breaks = seq_along(var_order),
labels = rev(var_order),
expand = c(0, 0)) +
labs(title = "Variable-Inclusion Map",
subtitle = paste0("Top ", n_top, " models shown out of ",
nrow(pmp.bma(bma_fit)), " visited"),
x = "Cumulative posterior model probability",
y = NULL)

The variable-inclusion map reveals clear structure. The top variables — log_gdp, trade_network, fossil_fuel, and industry — form solid blue bands stretching across nearly the entire x-axis, meaning they appear with positive coefficients in virtually every high-probability model. Urban_pop and democracy also show substantial inclusion, consistent with their borderline PIPs. In contrast, the noise variables (log_trade, fdi, corruption, log_tourism, log_credit) appear as mostly gray with occasional patches of blue or orange, indicating they enter and exit models sporadically and sometimes with the wrong sign. The fact that noise variables occasionally appear with negative coefficients (orange patches) is another sign of fragility — their coefficient estimates are unstable because they have no true effect.

8.5 BMA results vs. known truth

# Compare BMA results with the true DGP
bma_summary <- bma_df |>
mutate(
bma_robust = pip >= 0.80,
true_nonzero = true_beta != 0,
correct = bma_robust == true_nonzero
) |>
select(variable, true_beta, pip, post_mean, bma_robust, true_nonzero, correct)
print(bma_summary)

 variable true_beta pip post_mean bma_robust true_nonzero correct
log_gdp 1.200 1.000 1.1854 TRUE TRUE TRUE
trade_network 0.500 0.986 0.8727 TRUE TRUE TRUE
fossil_fuel 0.012 0.948 0.0117 TRUE TRUE TRUE
industry 0.008 0.841 0.0142 TRUE TRUE TRUE
urban_pop 0.010 0.648 0.0049 FALSE TRUE FALSE
democracy 0.004 0.607 0.0066 FALSE TRUE FALSE
log_tourism 0.000 0.130 -0.0039 FALSE FALSE TRUE
log_credit 0.000 0.104 0.0051 FALSE FALSE TRUE
agriculture 0.005 0.087 -0.0002 FALSE TRUE FALSE
log_trade 0.000 0.084 -0.0037 FALSE FALSE TRUE
corruption 0.000 0.078 0.0026 FALSE FALSE TRUE
fdi 0.000 0.077 -0.0000 FALSE FALSE TRUE

BMA correctly classifies 9 of 12 variables. The four strongest true predictors (GDP, trade_network, fossil_fuel, industry) all receive PIPs above 0.80 — these are the “robust” determinants. All five noise variables receive PIPs below 0.15 — correctly identified as fragile. Urban_pop (PIP = 0.648) and democracy (PIP = 0.607) fall in the borderline range — they are true predictors, but BMA’s conservative Occam’s razor hedges because their effects are moderate. Agriculture ($\beta = 0.005$, PIP = 0.087) is missed entirely. This reveals an important nuance: BMA prioritizes precision over sensitivity. It would rather miss a small true effect than falsely include a noise variable.

Note. BMA on all 12 variables correctly gives high PIPs to the strong true predictors (GDP, trade network, fossil fuel, industry) and low PIPs to the noise variables. Variables with moderate or small true effects may land in the borderline zone. The variable-inclusion map shows that the top models consistently include the core predictors.

PART 2: LASSO

9. Regularization — Adding a Penalty

9.1 The bias-variance tradeoff

OLS is an unbiased estimator — on average, it gets the coefficients right. But with many correlated regressors, OLS coefficients have high variance: they bounce around from sample to sample. Adding or removing a single variable can drastically change the estimates.

The key insight of regularization is that a little bias can buy a lot of variance reduction, lowering the overall prediction error. The total error of a prediction decomposes as:

$$ \text{MSE} = \text{Bias}^2 + \text{Variance} + \text{Irreducible noise} $$

The figure illustrates the fundamental tradeoff. At low complexity (strong regularization), bias is high but variance is low. At high complexity (weak or no regularization, like OLS), bias is near zero but variance explodes. The optimal point lies in between — this is exactly where regularized methods like LASSO operate. Think of the penalty as a “budget constraint” on coefficient sizes: variables that do not contribute enough to prediction are not worth the cost, so their coefficients are set to zero.

10. L1 vs. L2 Geometry

10.1 The LASSO (L1) penalty

The LASSO solves the following optimization problem:

$$ \hat{\beta}_{\text{LASSO}} = \arg\min_\beta \; \frac{1}{2n}\|y - X\beta\|^2 + \lambda \|\beta\|_1 $$

where:

$\frac{1}{2n}\|y - X\beta\|^2$ is the sum of squared residuals (the usual OLS loss, scaled)
$\|\beta\|_1 = \sum_{j=1}^{p} |\beta_j|$ is the L1 norm (sum of absolute values)
$\lambda \geq 0$ is the regularization parameter: it controls how much we penalize large coefficients. When $\lambda = 0$, LASSO reduces to OLS. As $\lambda \to \infty$, all coefficients are shrunk to zero.

10.2 The Ridge (L2) penalty

For comparison, Ridge regression uses the L2 norm instead:

$$ \hat{\beta}_{\text{Ridge}} = \arg\min_\beta \; \frac{1}{2n}\|y - X\beta\|^2 + \lambda \|\beta\|_2^2 $$

where $\|\beta\|_2^2 = \sum_{j=1}^{p} \beta_j^2$ is the sum of squared coefficients.

10.3 Why LASSO selects variables and Ridge does not

The geometric explanation is one of the most elegant ideas in modern statistics. The constraint region for LASSO (L1) is a diamond, while the constraint region for Ridge (L2) is a circle. When the elliptical OLS contours meet the diamond, they typically hit a corner, where one or more coefficients are exactly zero. When they meet the circle, they hit a smooth curve — coefficients are shrunk but never exactly zero.

The key insight: the L1 diamond has corners where coefficients are exactly zero — this is why LASSO selects variables. The L2 circle has no corners, so Ridge shrinks coefficients toward zero but never reaches it. LASSO performs simultaneous estimation and variable selection; Ridge only estimates.

11. LASSO on All 12 Variables

11.1 Running LASSO with cross-validation

The LASSO has one tuning parameter: $\lambda$, which controls the strength of the penalty. Too small and we include noise; too large and we exclude true predictors. We choose $\lambda$ using 10-fold cross-validation: split the data into 10 folds, train on 9, predict the 10th, and repeat. The $\lambda$ that minimizes the average prediction error across folds is called lambda.min.

set.seed(2021) # reproducibility for cross-validation folds
# Prepare the design matrix X and response vector y
X <- synth_data |>
select(log_gdp, industry, fossil_fuel, urban_pop, democracy,
trade_network, agriculture, log_trade, fdi, corruption,
log_tourism, log_credit) |>
as.matrix()
y <- synth_data$log_co2
# Run LASSO (alpha = 1) with 10-fold cross-validation
lasso_cv <- cv.glmnet(
x = X,
y = y,
alpha = 1, # alpha=1 is LASSO (alpha=0 is Ridge)
nfolds = 10,
standardize = TRUE # standardize predictors internally
)

11.2 Regularization path

# Fit the full LASSO path
lasso_full <- glmnet(X, y, alpha = 1, standardize = TRUE)
# Plot coefficient paths
ggplot(path_df, aes(x = log_lambda, y = coefficient, color = variable)) +
geom_line() +
geom_vline(xintercept = log(lasso_cv$lambda.min), linetype = "dashed") +
geom_vline(xintercept = log(lasso_cv$lambda.1se), linetype = "dotted")

The regularization path reveals the story of LASSO variable selection. Reading from left to right (increasing penalty), the noise variables (orange lines) are the first to be driven to zero — they provide too little predictive value to justify their “cost” under the penalty. GDP (the strongest predictor with $\beta = 1.200$) persists the longest, requiring the largest penalty to be eliminated. The vertical lines mark lambda.min (minimum CV error) and lambda.1se (most parsimonious model within 1 SE of the minimum). The gap between them represents the tension between fitting the data well and keeping the model simple.

11.3 Cross-validation curve

# Plot the CV curve
ggplot(cv_df, aes(x = log_lambda, y = mse)) +
geom_ribbon(aes(ymin = mse_lo, ymax = mse_hi), fill = "gray85", alpha = 0.5) +
geom_line(color = "#6a9bcc") +
geom_vline(xintercept = log(lasso_cv$lambda.min), linetype = "dashed")

The cross-validation curve shows how prediction error varies with the penalty strength. The curve has a characteristic U-shape: too little penalty (left) allows overfitting (high error from variance), while too much penalty (right) underfits (high error from bias). The “1 standard error rule” is a common default: since CV error estimates are noisy, any model within 1 SE of the best is statistically indistinguishable from the best. We prefer the simpler one (lambda.1se).

11.4 Selected variables

# Extract LASSO coefficients at lambda.1se
lasso_coefs_1se <- coef(lasso_cv, s = "lambda.1se")
lasso_df <- tibble(
variable = rownames(lasso_coefs_1se)[-1],
lasso_coef = as.numeric(lasso_coefs_1se)[-1]
) |>
mutate(
selected = lasso_coef != 0,
true_beta = true_beta_lookup[variable],
is_noise = true_beta == 0,
bar_color = case_when(
!selected ~ "Not selected",
is_noise ~ "Noise (false positive)",
TRUE ~ "True predictor (correct)"
)
)
# Plot selected variables
ggplot(lasso_df, aes(x = reorder(variable, abs(lasso_coef)), y = lasso_coef, fill = bar_color)) +
geom_col(width = 0.6) + coord_flip()

At lambda.1se, LASSO selects a sparse subset of the 12 candidate variables. The selected variables are shown with colored bars: steel blue for true predictors correctly retained, orange for any noise variables falsely included. Variables with zero coefficients (gray) have been excluded by the LASSO penalty. The key question is: did LASSO keep the right variables and drop the right ones?

12. Post-LASSO

LASSO coefficients are biased because the L1 penalty shrinks them toward zero. The selected variables are correct (we hope), but the coefficient values are too small. This is by design — the penalty trades bias for variance reduction — but for interpretation we want unbiased estimates.

The fix is simple: Post-LASSO (Belloni and Chernozhukov, 2013). Run OLS using only the variables that LASSO selected. The LASSO does the selection; OLS does the estimation.

# Identify which variables LASSO selected at lambda.1se
selected_vars <- lasso_df |> filter(selected) |> pull(variable)
# Build the Post-LASSO formula
post_lasso_formula <- as.formula(
paste("log_co2 ~", paste(selected_vars, collapse = " + "))
)
# Run OLS on the selected variables only
post_lasso_fit <- lm(post_lasso_formula, data = synth_data)
# Compare: LASSO vs Post-LASSO vs True coefficients
post_lasso_summary <- broom::tidy(post_lasso_fit) |>
filter(term != "(Intercept)") |>
rename(variable = term, post_lasso_coef = estimate) |>
select(variable, post_lasso_coef) |>
left_join(lasso_df |> select(variable, lasso_coef, true_beta), by = "variable")
print(post_lasso_summary)

 variable lasso_coef post_lasso_coef true_beta
log_gdp 1.1899 1.1646 1.200
industry 0.0090 0.0176 0.008
fossil_fuel 0.0072 0.0118 0.012
urban_pop 0.0041 0.0078 0.010
democracy 0.0046 0.0113 0.004
trade_network 0.6309 0.8978 0.500

Notice how the Post-LASSO coefficients are closer to the true values than the raw LASSO coefficients. For example, fossil_fuel’s LASSO coefficient is 0.007 (shrunk from the true 0.012), but the Post-LASSO estimate is 0.012 — recovering the truth almost exactly. Similarly, urban_pop recovers from 0.004 (LASSO) to 0.008 (Post-LASSO), closer to the true value of 0.010. Trade_network’s Post-LASSO estimate (0.898) overshoots the true value (0.500), reflecting the difficulty of precisely estimating a coefficient on a low-variance variable. The LASSO selected the right variables; Post-LASSO recovered unbiased magnitudes.

Note. LASSO coefficients are shrunk toward zero by design. Post-LASSO runs OLS on only the LASSO-selected variables, producing unbiased coefficient estimates while retaining the variable selection from LASSO.

PART 3: Weighted Average Least Squares (WALS)

13. Frequentist Model Averaging

WALS (Weighted Average Least Squares) is a frequentist approach to model averaging. Like BMA, it averages over models instead of selecting just one. But unlike BMA, it does not require MCMC sampling or the specification of a full Bayesian prior.

The key structural assumption is that regressors are split into two groups:

$$ y = X_1 \beta_1 + X_2 \beta_2 + \varepsilon $$

where:

$X_1$ are focus regressors: variables you are certain belong in the model. In a cross-sectional setting, this is typically just the intercept.
$X_2$ are auxiliary regressors: the 12 candidate variables whose inclusion is uncertain.
$\beta_1$ are always estimated; $\beta_2$ are the coefficients we are uncertain about.

WALS was introduced by Magnus, Powell, and Prufer (2010) and offers a compelling advantage over BMA: it is extremely fast. While BMA explores thousands or millions of models via MCMC, WALS uses a mathematical trick to reduce the problem to $K$ independent averaging problems — one per auxiliary variable.

14. The Semi-Orthogonal Transformation

Why correlated variables make averaging hard

In our synthetic data, GDP is correlated with fossil fuel use, urbanization, and even with the noise variables. This means that the decision to include one variable affects the importance of another. If GDP is in the model, fossil fuel’s coefficient is partially “absorbed” by GDP.

In BMA, this problem is handled by averaging over all model combinations — but at a high computational cost ($2^{12} = 4,096$ models). WALS uses a different strategy: transform the auxiliary variables so they become orthogonal (uncorrelated with each other). Once orthogonal, each variable can be averaged independently.

The mathematical trick

The semi-orthogonal transformation works as follows:

Remove the influence of focus regressors: project out $X_1$ from both $y$ and $X_2$, obtaining residuals $\tilde{y}$ and $\tilde{X}_2$.
Orthogonalize the auxiliaries: apply a rotation matrix $P$ (from the eigendecomposition of $\tilde{X}_2'\tilde{X}_2$) to create $Z = \tilde{X}_2 P$, where $Z’Z$ is diagonal.
Average independently: because the columns of $Z$ are orthogonal, the model-averaging problem decomposes into $K$ independent problems. Each transformed variable is averaged separately.

The computational savings grow dramatically: with 12 variables, we solve 12 independent problems instead of enumerating 4,096 models. Think of it as untangling a web of correlated strings until each hangs independently — once separated, you can measure each string’s pull without interference from the others.

15. The Laplace Prior

WALS requires a prior distribution for the transformed coefficients. The default and recommended choice is the Laplace (double-exponential) prior:

$$ p(\gamma_j) \propto \exp(-|\gamma_j| / \tau) $$

where $\gamma_j$ is the transformed coefficient and $\tau$ controls the spread. The Laplace prior has two key features:

Peaked at zero: it encodes skepticism — the prior believes most variables probably have small effects
Heavy tails: it allows large effects if the data strongly supports them — variables with strong signal can “break through” the prior

The deep connection to LASSO

Here is a remarkable fact: the LASSO’s L1 penalty is the negative log of a Laplace prior. The MAP (maximum a posteriori) estimate under a Laplace prior is:

$$ \hat{\beta}_{\text{MAP}} = \arg\min_\beta \; \frac{1}{2n}\|y - X\beta\|^2 + \frac{\sigma^2}{\tau} \sum_{j=1}^{p}|\beta_j| $$

This is identical to the LASSO objective with $\lambda = \sigma^2 / \tau$. The LASSO penalty and the Laplace prior are two sides of the same coin.

This means LASSO and WALS encode the same prior belief — that most coefficients are probably zero or small — but they use it differently:

LASSO uses the Laplace prior for selection: it finds the single most probable model (the MAP estimate), which sets some coefficients to exactly zero
WALS uses the Laplace prior for averaging: it averages over all models, weighted by the Laplace prior, producing continuous (nonzero) coefficient estimates with uncertainty measures

Note. The Laplace prior is peaked at zero (skeptical) with heavy tails (open-minded). It is the same prior that underlies LASSO’s L1 penalty. LASSO uses it for hard selection (zeros vs. nonzeros); WALS uses it for soft averaging (continuous weights).

16. WALS on All 12 Variables

16.1 Running WALS

# WALS splits regressors into two groups:
# X1 = focus regressors (always included): just the intercept
# X2 = auxiliary regressors (uncertain): our 12 candidate variables
# Prepare the focus regressor matrix (intercept only)
X1_wals <- matrix(1, nrow = nrow(synth_data), ncol = 1)
colnames(X1_wals) <- "(Intercept)"
# Prepare the auxiliary regressor matrix (all 12 candidates)
X2_wals <- synth_data |>
select(log_gdp, industry, fossil_fuel, urban_pop, democracy,
trade_network, agriculture, log_trade, fdi, corruption,
log_tourism, log_credit) |>
as.matrix()
y_wals <- synth_data$log_co2
# Fit WALS with the Laplace prior (the recommended default)
wals_fit <- wals(
x = X1_wals, # focus regressors (intercept)
x2 = X2_wals, # auxiliary regressors (12 candidates)
y = y_wals, # response variable
prior = laplace() # Laplace prior for auxiliaries
)
wals_summary <- summary(wals_fit)

The WALS function call is remarkably concise. Unlike BMA, there is no MCMC sampling, no burn-in period, and no convergence diagnostics to worry about. The computation is essentially instantaneous.

# Extract results
aux_coefs <- wals_summary$auxCoefs
wals_df <- tibble(
variable = rownames(aux_coefs),
estimate = aux_coefs[, "Estimate"],
se = aux_coefs[, "Std. Error"],
t_stat = estimate / se
) |>
mutate(
true_beta = true_beta_lookup[variable],
abs_t = abs(t_stat),
wals_robust = abs_t >= 2
)
print(wals_df |> arrange(desc(abs_t)) |> select(variable, estimate, t_stat, true_beta))

 variable estimate t_stat true_beta
log_gdp 1.1333 34.62 1.200
trade_network 0.8458 4.39 0.500
industry 0.0187 4.01 0.008
fossil_fuel 0.0099 3.26 0.012
urban_pop 0.0082 3.11 0.010
democracy 0.0097 2.58 0.004
log_credit 0.0659 1.43 0.000
agriculture -0.0046 -1.13 0.005
log_tourism -0.0148 -0.64 0.000
log_trade 0.0196 0.31 0.000
fdi -0.0011 -0.17 0.000
corruption -0.0165 -0.09 0.000

WALS produces familiar t-statistics for each auxiliary variable. Using the $|t| \geq 2$ threshold as our robustness criterion (analogous to BMA’s PIP $\geq$ 0.80), we can classify each variable as robust or fragile.

16.2 t-statistic bar chart

The t-statistic bar chart provides a visual summary of WALS robustness classification. Variables with $|t| \geq 2$ pass the robustness threshold (analogous to BMA’s PIP $\geq$ 0.80), while those below the threshold are considered fragile.

# Classify each variable for the bar chart
wals_df <- wals_df |>
mutate(
bar_color = case_when(
wals_robust & true_nonzero ~ "True positive",
wals_robust & !true_nonzero ~ "False positive",
!wals_robust & true_nonzero ~ "False negative",
TRUE ~ "True negative"
)
)
ggplot(wals_df, aes(x = reorder(variable, abs_t), y = t_stat, fill = bar_color)) +
geom_col(width = 0.6) +
geom_hline(yintercept = c(-2, 2), linetype = "dashed") +
coord_flip()

The t-statistic bar chart shows a clear separation. GDP towers above all others with $|t| = 34.62$, followed by trade_network ($|t| = 4.39$), industry ($|t| = 4.01$), fossil_fuel ($|t| = 3.26$), urban_pop ($|t| = 3.11$), and democracy ($|t| = 2.58$). These six variables pass the $|t| \geq 2$ threshold. The noise variables all have $|t| < 1.5$, confirming they are not robust determinants. Agriculture ($|t| = 1.13$) falls just below the robustness threshold — its true effect ($\beta = 0.005$) is simply too small to detect reliably with this sample size.

Note. WALS produces t-statistics for each auxiliary variable. Using the $|t| \geq 2$ threshold, we can classify variables as robust or fragile. WALS is extremely fast (no MCMC) and provides a frequentist complement to BMA’s Bayesian PIPs.

PART 4: Grand Comparison

17. Three Methods, Same Question, Same Data

We have now applied all three methods to the same synthetic dataset. Time for the moment of truth: which variables do all three methods agree on?

17.1 Comprehensive comparison table

# Merge all results
grand_table <- bma_compare |>
left_join(lasso_compare, by = "variable") |>
left_join(wals_compare, by = "variable") |>
mutate(
true_beta = true_beta_lookup[variable],
bma_robust = bma_pip >= 0.80,
n_methods = bma_robust + lasso_selected + wals_robust,
triple_robust = n_methods == 3,
true_nonzero = true_beta != 0
)
print(grand_table |>
select(variable, true_beta, bma_pip, bma_robust, lasso_selected, wals_t, wals_robust, n_methods) |>
arrange(desc(n_methods)))

 variable true_beta bma_pip bma_robust lasso_selected wals_t wals_robust n_methods
log_gdp 1.200 1.000 TRUE TRUE 34.62 TRUE 3
trade_network 0.500 0.986 TRUE TRUE 4.39 TRUE 3
fossil_fuel 0.012 0.948 TRUE TRUE 3.26 TRUE 3
industry 0.008 0.841 TRUE TRUE 4.01 TRUE 3
urban_pop 0.010 0.648 FALSE TRUE 3.11 TRUE 2
democracy 0.004 0.607 FALSE TRUE 2.58 TRUE 2
log_tourism 0.000 0.130 FALSE FALSE -0.64 FALSE 0
log_credit 0.000 0.104 FALSE FALSE 1.43 FALSE 0
agriculture 0.005 0.087 FALSE FALSE -1.13 FALSE 0
log_trade 0.000 0.084 FALSE FALSE 0.31 FALSE 0
corruption 0.000 0.078 FALSE FALSE -0.09 FALSE 0
fdi 0.000 0.077 FALSE FALSE -0.17 FALSE 0

The results are striking. Four variables are triple-robust — identified by all three methods: log_gdp, trade_network, fossil_fuel, and industry. Two more variables — urban_pop and democracy — are double-robust, selected by LASSO and WALS but landing in BMA’s borderline zone (PIPs of 0.648 and 0.607). All five noise variables are correctly excluded by all three methods. Agriculture ($\beta = 0.005$) is the only true predictor missed by all methods — its effect is simply too small to detect.

17.2 Method agreement heatmap

The heatmap provides a visual summary of agreement. The top four rows (GDP, trade_network, fossil_fuel, industry) are solid steel blue across all three columns — unanimous agreement that these variables matter. Urban_pop and democracy show steel blue for LASSO and WALS but orange for BMA, visualizing BMA’s greater conservatism. The bottom five rows (noise) are solid orange — unanimous agreement that they do not matter. Agriculture is also orange throughout, reflecting all methods' consensus that its tiny effect ($\beta = 0.005$) cannot be reliably distinguished from zero.

17.3 BMA PIP vs. WALS |t-statistic|

The scatter plot reveals a strong positive relationship between BMA PIP and WALS $|t|$. Variables in the upper-right quadrant are robust by both methods — GDP, trade_network, fossil_fuel, and industry. Urban_pop and democracy sit in an interesting middle zone: high WALS $|t|$ (above 2) but moderate BMA PIP (below 0.80), illustrating BMA’s more conservative threshold. The noise variables cluster in the lower-left corner (low PIP, low $|t|$). LASSO selection (triangle markers) aligns with the WALS threshold, selecting the same six variables that pass $|t| \geq 2$.

17.4 Coefficient comparison

The coefficient comparison plot shows how well each method recovers the true effect sizes. Points on the dashed 45-degree line represent perfect recovery. GDP ($\beta = 1.200$) is recovered almost exactly by all three methods. The smaller coefficients (fossil_fuel at 0.012, urban_pop at 0.010) are also well-estimated. Trade_network’s coefficient is overestimated by all methods (true 0.500, estimates around 0.85–0.90), reflecting the difficulty of precisely estimating an effect on a low-variance variable. BMA’s posterior means are slightly attenuated for variables with PIPs below 1.0 (the averaging shrinks them toward zero).

17.5 Agreement summary

The agreement bar chart tells a nuanced story: four variables are triple-robust (identified by all three methods), two are double-robust (identified by LASSO and WALS but not BMA), and six are identified by none. The “split votes” on urban_pop and democracy reveal a genuine methodological difference: LASSO and WALS are more liberal in including moderate-effect variables, while BMA’s Bayesian Occam’s razor demands stronger evidence. This pattern — where methods mostly agree but diverge on borderline cases — is what makes methodological triangulation valuable.

17.6 Method performance

# Sensitivity, specificity, and accuracy for each method
results_by_method <- tibble(
method = c("BMA", "LASSO", "WALS"),
true_pos = c(4, 6, 6), # true predictors correctly identified
false_pos = c(0, 0, 0), # noise variables falsely identified
false_neg = c(3, 1, 1), # true predictors missed
true_neg = c(5, 5, 5), # noise variables correctly excluded
sensitivity = true_pos / 7,
specificity = true_neg / 5,
accuracy = (true_pos + true_neg) / 12
)
print(results_by_method)

 method true_pos false_pos false_neg true_neg sensitivity specificity accuracy
BMA 4 0 3 5 0.571 1.000 0.750
LASSO 6 0 1 5 0.857 1.000 0.917
WALS 6 0 1 5 0.857 1.000 0.917

All three methods achieve perfect specificity (zero false positives) — none mistakenly identifies a noise variable as robust. The key difference is in sensitivity: LASSO and WALS each detect 6 of 7 true predictors (85.7%), while BMA detects only 4 (57.1%). BMA’s lower sensitivity reflects its conservative Bayesian Occam’s razor: it places urban_pop and democracy in the “borderline” zone rather than committing to their inclusion. The one variable missed by all methods — agriculture ($\beta = 0.005$) — has an effect so small that it is indistinguishable from noise given our sample size.

17.7 When to use which method

Method	Best for	Strengths	Limitations
BMA	Full uncertainty quantification	Probabilistic (PIPs), handles model uncertainty formally, coefficient intervals	Slower (MCMC), requires prior specification
LASSO	Prediction, sparse models	Fast, automatic selection, works with many variables	Binary (in/out), biased coefficients (use Post-LASSO)
WALS	Speed, frequentist inference	Very fast, produces t-statistics, no MCMC	Less common, limited software support

The strongest recommendation: use all three. When they converge on the same variables (as with our four triple-robust predictors), you have the strongest possible evidence. When they disagree (as with urban_pop and democracy, where LASSO and WALS say “yes” but BMA hedges), the disagreement itself is informative — it tells you the evidence is real but not overwhelming. In real-world data, complications such as nonlinearity, heteroskedasticity, and endogeneity may affect method performance and should be addressed before applying these techniques.

18. Conclusion

18.1 Summary

This tutorial introduced three principled approaches to the variable selection problem:

Bayesian Model Averaging (BMA) averages over all possible models, weighting each by its posterior probability. It produces Posterior Inclusion Probabilities (PIPs) that quantify how robust each variable is across the entire model space. Variables with PIP $\geq$ 0.80 are considered robust.
LASSO adds an L1 penalty to the OLS objective, forcing irrelevant coefficients to exactly zero. Cross-validation selects the penalty strength. Post-LASSO recovers unbiased coefficient estimates for the selected variables.
WALS uses a semi-orthogonal transformation to decompose the model-averaging problem into independent subproblems — one per variable. It is extremely fast and produces familiar t-statistics for robustness assessment.

18.2 Key takeaways

The methods mostly converge — and their disagreements are informative. Four variables are identified by all three methods (triple-robust), and all methods achieve perfect specificity (zero false positives). LASSO and WALS are more sensitive (detecting 6 of 7 true predictors), while BMA is more conservative (detecting 4). The two variables where they disagree — urban_pop and democracy — have moderate effects that BMA’s Bayesian Occam’s razor treats as borderline. This pattern illustrates the value of methodological triangulation across fundamentally different statistical paradigms.

Model uncertainty is real but addressable. With 12 candidate variables, there are 4,096 possible models. Rather than pretending one of them is “the” model, these methods account for the uncertainty explicitly. The result is more honest inference.

Synthetic data lets us verify. Because we designed the data-generating process, we could check each method’s performance against the known truth. In practice, the truth is unknown — which is precisely why using multiple methods is so valuable.

18.3 Applying this to your own research

The code in this tutorial is designed to be modular. To apply these methods to your own data:

Replace the CSV: load your own cross-sectional dataset instead of the synthetic one
Define the variable list: specify which variables are candidates for selection
Run the three methods: use the same bms(), cv.glmnet(), and wals() function calls
Compare results: build the same comparison table and heatmap

The interpretation framework — PIPs for BMA, selection for LASSO, t-statistics for WALS — applies regardless of the specific dataset.

18.4 Further reading

BMA: Hoeting, J.A., Madigan, D., Raftery, A.E., and Volinsky, C.T. (1999). “Bayesian Model Averaging: A Tutorial.” Statistical Science, 14(4), 382–417.
LASSO: Tibshirani, R. (1996). “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society, Series B, 58(1), 267–288.
WALS: Magnus, J.R., Powell, O., and Prufer, P. (2010). “A Comparison of Two Model Averaging Techniques with an Application to Growth Empirics.” Journal of Econometrics, 154(2), 139–153.
Application: Aller, C., Ductor, L., and Grechyna, D. (2021). “Robust Determinants of CO₂ Emissions.” Energy Economics, 96, 105154.
Post-LASSO: Belloni, A. and Chernozhukov, V. (2013). “Least Squares After Model Selection in High-Dimensional Sparse Models.” Bernoulli, 19(2), 521–547.
R Packages: BMS vignette, glmnet vignette, WALS package

References

Hoeting, J.A., Madigan, D., Raftery, A.E., and Volinsky, C.T. (1999). Bayesian Model Averaging: A Tutorial. Statistical Science, 14(4), 382–417.
Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society, Series B, 58(1), 267–288.
Magnus, J.R., Powell, O., and Prufer, P. (2010). A Comparison of Two Model Averaging Techniques with an Application to Growth Empirics. Journal of Econometrics, 154(2), 139–153.
Raftery, A.E. (1995). Bayesian Model Selection in Social Research. Sociological Methodology, 25, 111–163.
Aller, C., Ductor, L., and Grechyna, D. (2021). Robust Determinants of CO₂ Emissions. Energy Economics, 96, 105154.
Belloni, A. and Chernozhukov, V. (2013). Least Squares After Model Selection in High-Dimensional Sparse Models. Bernoulli, 19(2), 521–547.