IPW and Doubly Robust | Carlos Mendez

Evaluating a Cash Transfer Program (RCT) with Panel Data in Stata

Tue, 24 Mar 2026 00:00:00 +0000

1. Overview

Cash transfer programs are among the most common development interventions worldwide. Governments and international organizations spend billions of dollars each year providing direct cash transfers to low-income households. But how do we rigorously evaluate whether these programs actually work? This tutorial walks through the complete workflow of analyzing a randomized controlled trial (RCT) with panel data in Stata — from verifying that randomization succeeded, to estimating treatment effects using increasingly sophisticated methods, to comparing results across all approaches.

We use simulated data from a hypothetical cash transfer program targeting 2,000 households in a developing country. The key advantage of simulated data is that we know the true treatment effect before we begin: the program increases household consumption by 12% (0.12 log points). This known ground truth gives us a perfect benchmark to evaluate how well each econometric method recovers the correct answer.

The tutorial progresses from simple to sophisticated. We start with basic balance checks, then estimate treatment effects three different ways using only endline data — regression adjustment (RA), inverse probability weighting (IPW), and doubly robust (DR) methods. Next, we unlock the full power of panel data with difference-in-differences (DiD) and its doubly robust extension (DRDID). Finally, we address the real-world complication of imperfect compliance.

Learning objectives

Verify baseline balance using t-tests, standardized mean differences, and balance plots
Distinguish between ATE and ATT and identify which estimand each method targets
Understand three estimation strategies — regression adjustment, inverse probability weighting, and doubly robust — and when to use each
Estimate treatment effects using all three approaches and compare their results
Leverage panel data structure with difference-in-differences and understand why DiD estimates ATT
Apply doubly robust difference-in-differences (DRDID) for modern panel data analysis
Separate the effect of treatment offer from treatment receipt under imperfect compliance

2. Study design

This RCT evaluates a cash transfer program designed to boost household consumption. The study tracks 2,000 households across two survey waves — a baseline in 2021 (before the program) and an endline in 2024 (after the program was implemented). The diagram below summarizes the experimental design.

graph TD
POP["<b>2,000 Households</b><br/>Balanced panel<br/>(observed in 2021 and 2024)"]
STRAT["<b>Stratified Randomization</b><br/>Within poverty strata"]
TRT["<b>Treatment Group</b><br/>(~1,000 households)<br/>Offered cash transfer"]
CTL["<b>Control Group</b><br/>(~1,000 households)<br/>No offer"]
COMP1["85% receive<br/>the transfer"]
COMP2["15% do not<br/>receive"]
COMP3["5% receive<br/>the transfer"]
COMP4["95% do not<br/>receive"]
BASE["<b>Baseline 2021</b><br/>Pre-treatment survey"]
END["<b>Endline 2024</b><br/>Post-treatment survey"]
POP --> BASE
BASE --> STRAT
STRAT --> TRT
STRAT --> CTL
TRT --> COMP1
TRT --> COMP2
CTL --> COMP3
CTL --> COMP4
COMP1 --> END
COMP2 --> END
COMP3 --> END
COMP4 --> END
style POP fill:#6a9bcc,stroke:#141413,color:#fff
style STRAT fill:#d97757,stroke:#141413,color:#fff
style TRT fill:#00d4c8,stroke:#141413,color:#141413
style CTL fill:#6a9bcc,stroke:#141413,color:#fff
style BASE fill:#6a9bcc,stroke:#141413,color:#fff
style END fill:#d97757,stroke:#141413,color:#fff
style COMP1 fill:#00d4c8,stroke:#141413,color:#141413
style COMP2 fill:#141413,stroke:#d97757,color:#fff
style COMP3 fill:#d97757,stroke:#141413,color:#fff
style COMP4 fill:#141413,stroke:#6a9bcc,color:#fff

The randomization was stratified by poverty status (block randomization), ensuring that treatment and control groups started with similar proportions of poor and non-poor households. A critical real-world feature of this study is imperfect compliance — only 85% of households offered the treatment actually received the cash transfer, while 5% of control households received it through other channels.

Variables

Variable	Description	Type
`id`	Household identifier	Panel ID
`year`	Survey year (2021 or 2024)	Time variable
`post`	Endline indicator (1 = 2024)	Binary
`treat`	Random assignment to offer (intent-to-treat)	Binary
`D`	Actual receipt of cash transfer	Binary (endogenous)
`y`	Log monthly consumption	Continuous (outcome)
`age`	Age of household head	Continuous
`female`	Female-headed household	Binary
`poverty`	Poverty status at baseline	Binary
`edu`	Years of education	Continuous
`y0`	Log monthly consumption at baseline (pre-treatment)	Continuous

Offer vs. receipt — The variable treat captures random assignment to the program offer. It is exogenous (determined by randomization) and unrelated to household characteristics. The variable D captures actual receipt of the cash transfer. It is endogenous — households that chose to take up the program may differ systematically from those that did not. Most methods in this tutorial estimate the effect of the offer (intent-to-treat). Section 10 addresses the effect of receipt.

3. Analytical roadmap

The diagram below shows the progression of methods we will use. Each stage builds on the previous one, adding complexity and robustness.

graph LR
A["<b>Balance<br/>Checks</b><br/><i>Section 5</i>"]
B["<b>Cross-sectional<br/>RA / IPW / DR</b><br/><i>Sections 7--8</i>"]
C["<b>Panel Data<br/>DiD / DR-DiD</b><br/><i>Section 9</i>"]
D["<b>Endogenous<br/>Treatment</b><br/><i>Section 10</i>"]
A --> B
B --> C
C --> D
style A fill:#6a9bcc,stroke:#141413,color:#fff
style B fill:#d97757,stroke:#141413,color:#fff
style C fill:#00d4c8,stroke:#141413,color:#141413
style D fill:#141413,stroke:#d97757,color:#fff

We first establish that randomization worked (balance checks). Then we estimate treatment effects three ways using only endline data — regression adjustment, inverse probability weighting, and doubly robust methods. Next, we leverage the full panel structure with difference-in-differences. Finally, we address imperfect compliance by separating the effect of the offer from the effect of receipt.

4. Data loading and exploration

We begin by loading the simulated dataset from a public GitHub repository and examining its structure.

use "https://github.com/quarcs-lab/data-open/raw/master/ametrics/dataSIM4RCT.dta", clear
des y age edu female poverty treat D

Contains data
Observations: 4,000
Variables: 10
Variable Storage Display Value
name type format label Variable label
─────────────────────────────────────────────────────────────
y float %9.0g Log monthly consumption
age float %9.0g
edu float %9.0g
female float %9.0g
poverty float %9.0g
treat float %9.0g Assignment to offer (Z)
D float %9.0g Receipt of cash transfer

The dataset contains 4,000 observations — 2,000 households observed at two time points (baseline 2021 and endline 2024). The outcome variable y is log monthly consumption, treat is the random assignment indicator, and D is the actual receipt indicator.

Now let us examine summary statistics at baseline and endline separately.

sum y age edu female poverty treat D if post==0

 Variable | Obs Mean Std. dev. Min Max
─────────────+─────────────────────────────────────────────────────────
y | 2,000 10.0154 .4348886 8.454445 11.48253
age | 2,000 35.126 9.650839 18 68
edu | 2,000 12.0275 1.9889 6 18
female | 2,000 .5085 .5000528 0 1
poverty | 2,000 .3125 .4636283 0 1
treat | 2,000 .518 .4998009 0 1
D | 2,000 0 0 0 0

At baseline, mean log consumption is approximately 10.02, the average household head is 35 years old with 12 years of education, about 51% of households are female-headed, and 31% are in poverty. Treatment assignment (treat) is approximately 50%, as expected from the randomization. Crucially, the receipt variable D is zero for all households at baseline — the program had not yet been implemented.

sum y age edu female poverty treat D if post==1

 Variable | Obs Mean Std. dev. Min Max
─────────────+─────────────────────────────────────────────────────────
y | 2,000 10.1137 .4382183 8.638689 11.55002
age | 2,000 35.126 9.650839 18 68
edu | 2,000 12.0275 1.9889 6 18
female | 2,000 .5085 .5000528 0 1
poverty | 2,000 .3125 .4636283 0 1
treat | 2,000 .518 .4998009 0 1
D | 2,000 .4615 .4986402 0 1

At endline, mean consumption has risen to approximately 10.11, reflecting both the natural time trend and the treatment effect. The receipt variable D is now non-zero — about 46% of all households received the cash transfer (combining treated households who took up the program and control households who received it through other channels).

Finally, we declare the panel structure so Stata knows we have repeated observations.

xtset id year

Panel variable: id (strongly balanced)
Time variable: year, 2021 to 2024, but with gaps
Delta: 1 unit

The panel is strongly balanced — all 2,000 households appear in both survey waves, with no attrition. This is an ideal scenario that simplifies our analysis.

5. Baseline balance checks

Before estimating any treatment effects, we must verify that randomization produced comparable treatment and control groups at baseline. This is the most fundamental quality check in any RCT.

5.1 T-tests and proportion tests

We compare the treatment and control groups on all baseline characteristics using two-sample t-tests for continuous variables and proportion tests for binary variables.

ttest y if post==0, by(treat)
ttest age if post==0, by(treat)
ttest edu if post==0, by(treat)
prtest female if post==0, by(treat)
prtest poverty if post==0, by(treat)

Variable | Control Mean Treat Mean Diff p-value
────────────+──────────────────────────────────────────────
y | 10.025 10.006 0.019 0.330
age | 35.335 34.931 0.404 0.350
edu | 11.974 12.077 -0.103 0.247
female | 0.484 0.531 -0.046 0.038 **
poverty | 0.307 0.318 -0.011 0.612

Most variables show no statistically significant differences between the treatment and control groups. However, the variable female has a p-value of 0.038 — a statistically significant imbalance. The treatment group has about 4.6 percentage points more female-headed households than the control group. This imbalance occurred purely by chance but must be addressed in our estimation.

5.2 Balance table with standardized mean differences

P-values are sensitive to sample size — a large sample can make tiny differences “significant.” Standardized mean differences (SMDs) provide a scale-free measure of imbalance that is more informative. The SMD is computed as the difference in group means divided by the pooled standard deviation — this puts all variables on the same scale regardless of their units. The common rule of thumb is that SMDs below 10% indicate adequate balance.

capture ssc install ietoolkit, replace
iebaltab y age edu female poverty if post==0, grpvar(treat)

 (1) (2) (2)-(1)
Control Treatment Difference
y 10.025 10.006 0.019
(0.014) (0.014) (0.019)
age 35.335 34.931 0.404
(0.316) (0.295) (0.432)
edu 11.974 12.077 -0.103
(0.063) (0.063) (0.089)
female 0.484 0.531 -0.046**
(0.016) (0.016) (0.022)
poverty 0.307 0.318 -0.011
(0.015) (0.014) (0.021)
N 964 1,036

The balance table confirms our t-test findings. With 964 control and 1,036 treatment households, all variables are well balanced except female, which shows a statistically significant difference (marked with **). The outcome variable y has a negligible difference of 0.019 at baseline — the groups started with essentially identical consumption levels.

5.3 Visual balance plot

A balance plot provides a visual overview of all SMDs at once, making it easy to spot problematic variables.

net install balanceplot, from("https://tdmize.github.io/data") replace
balanceplot y age edu i.female i.poverty, group(treat) table nodropdv

The balance plot shows that all SMDs fall below the 10% threshold (indicated by the dashed vertical lines). The variable female has the largest SMD at approximately 9.3% — close to but still below the conventional threshold. The remaining variables — consumption, age, education, and poverty — all have SMDs well below 5%. Overall, randomization was successful, but we should control for female (and other covariates) in our estimation to improve precision.

5.4 AIPW as a formal balance test

As a final and more formal balance check, we can use the Augmented Inverse Probability Weighting (AIPW) estimator on baseline data only. If randomization was successful, the estimated “treatment effect” at baseline should be zero — since the program had not yet been implemented, there should be no difference between groups.

preserve
keep if post==0
teffects aipw (y age edu i.female i.poverty) (treat age edu i.female i.poverty)

Tip: The preserve command saves a snapshot of the current data. After the balance analysis, use restore to return to the full dataset. The companion do-file handles this automatically.

Treatment-effects estimation Number of obs = 2,000
Estimator : augmented IPW
Outcome model : linear
Treatment model: logit
──────────────────────────────────────────────────────────────────────────────
| Robust
y | Coefficient std. err. z P>|z| [95% conf. interval]
─────────────+────────────────────────────────────────────────────────────────
ATE |
treat |
(1 vs 0) | -.0244086 .018861 -1.29 0.196 -.0613754 .0125582
─────────────+────────────────────────────────────────────────────────────────
POmean |
treat |
0 | 10.02792 .0138363 724.75 0.000 10.0008 10.05504
──────────────────────────────────────────────────────────────────────────────

The AIPW-estimated “ATE” at baseline is -0.024 with a p-value of 0.196 — not statistically significant. This confirms that there is no detectable pre-treatment difference between the groups after adjusting for covariates. The treatment and control groups were statistically comparable before the program began.

Now we run the diagnostic checks for the AIPW model.

tebalance overid

Overidentification test for covariate balance
H0: Covariates are balanced
chi2(5) = 3.216
Prob > chi2 = 0.6670

The overidentification test fails to reject the null hypothesis of covariate balance (p = 0.667). There is no statistical evidence of residual imbalance after weighting.

tebalance summarize

 |Standardized differences Variance ratio
| Raw Weighted Raw Weighted
----------------+------------------------------------------------
age | -.0417918 .0002505 .9318894 .9446877
edu | .0519015 -6.96e-06 1.071677 1.078214
female |
1 | .0929611 6.51e-06 .9970775 .9999996
poverty |
1 | .0226764 .0002864 1.018475 1.000233

The balance summary reveals that the raw standardized differences (before weighting) show the female imbalance at 0.093, consistent with our earlier findings. After weighting, all standardized differences shrink to near zero (all below 0.001) — excellent balance. The variance ratios are all close to 1.0, indicating similar spread across groups.

tebalance density y

The density plot confirms that after AIPW weighting, the distributions of log consumption in the treatment and control groups overlap almost perfectly. Any small pre-existing differences in the outcome variable have been eliminated by the weighting scheme.

teffects overlap

The overlap plot shows that propensity scores for both groups are concentrated between approximately 0.43 and 0.55 — well within the range where matching and weighting are feasible. There are no extreme propensity scores near 0 or 1, confirming that the common support condition is satisfied. This is expected in a well-designed RCT where treatment probability is approximately 0.50 for all households.

restore

This AIPW-based balance analysis also serves a pedagogical purpose: it introduces the concept of doubly robust estimation before we use it for treatment effect estimation in Section 8.

6. What are we estimating? ATE vs. ATT

Before diving into estimation, we need to be precise about what we are trying to estimate. There are two fundamental causal quantities in program evaluation.

The Average Treatment Effect (ATE) answers the policymaker’s question: “What would happen if we scaled this program to the entire population?"

$$ATE = E[Y(1) - Y(0)]$$

where $Y(1)$ is the potential outcome under treatment and $Y(0)$ is the potential outcome under control, averaged over the entire population (both treated and untreated).

The Average Treatment Effect on the Treated (ATT) answers the evaluator’s question: “Did the program benefit those who were assigned to it?"

$$ATT = E[Y(1) - Y(0) \mid T = 1]$$

This averages the treatment effect only over the treated group — the households that were assigned to receive the cash transfer.

In a well-designed RCT with homogeneous treatment effects (the program affects everyone equally), ATE and ATT are the same. But when treatment effects are heterogeneous (the program benefits some households more than others), they can differ. For example, if poorer households benefit more from cash transfers and the treatment group has a higher share of poor households, the ATT could be larger than the ATE.

Understanding this distinction is critical because different methods target different estimands. Cross-sectional methods (RA, IPW, DR) can estimate either ATE or ATT. Difference-in-differences inherently estimates the ATT only. We will return to this point in Section 9.

Note on RCTs — In a randomized experiment, treatment assignment is independent of potential outcomes. This means that simple comparisons between treatment and control groups are already unbiased estimates of the ATE. When we add covariates (regression adjustment, IPW, doubly robust), we are not removing bias — we are improving precision by accounting for residual variation. This is different from observational studies, where covariate adjustment is needed to address confounding.

7. Three strategies for causal estimation

We now understand what we want to estimate (ATE and ATT from Section 6). The question becomes how to estimate it. Three families of methods exist, each taking a fundamentally different approach to solving the missing-data problem at the heart of causal inference. Each method models a different part of the data-generating process, and understanding these differences is essential for interpreting results and choosing the right tool.

7.1 Regression Adjustment (RA) — modeling the outcome

Regression adjustment solves the missing-data problem by predicting the unobserved potential outcomes. It fits separate regression models for treated and untreated groups. For each household, it uses these models to predict two potential outcomes: what consumption would be if treated, $\hat{\mu}_1(X_i)$, and what consumption would be if untreated, $\hat{\mu}_0(X_i)$. Since we only observe one of these for each household, the model fills in the missing counterfactual. The treatment effect for each household is the difference between the two predictions, and the ATE is the average across all households.

The Stata documentation describes this succinctly: “RA estimators use means of predicted outcomes for each treatment level to estimate each POM. ATEs and ATETs are differences in estimated POMs."

Analogy — predicting exam scores. Imagine two study methods (A and B) being tested on students. You observe each student using only one method. RA fits a model predicting test scores based on student characteristics (prior GPA, hours studied) separately for method-A and method-B users. Then, for every student, it predicts what their score would have been under both methods — even the one they did not use. The average difference in predicted scores is the treatment effect.

graph TD
DATA["<b>Observed Data</b><br/>Each household observed<br/>under ONE treatment only"]
M0["<b>Fit outcome model</b><br/>using control group<br/><i>Y = f(age, edu, female, poverty)</i>"]
M1["<b>Fit outcome model</b><br/>using treated group<br/><i>Y = f(age, edu, female, poverty)</i>"]
P0["Predict <b>Ŷ₀</b><br/>for ALL households"]
P1["Predict <b>Ŷ₁</b><br/>for ALL households"]
ATE["<b>ATE</b> = Average of<br/>(Ŷ₁ − Ŷ₀)"]
DATA --> M0
DATA --> M1
M0 --> P0
M1 --> P1
P0 --> ATE
P1 --> ATE
style DATA fill:#141413,stroke:#6a9bcc,color:#fff
style M0 fill:#6a9bcc,stroke:#141413,color:#fff
style M1 fill:#6a9bcc,stroke:#141413,color:#fff
style P0 fill:#6a9bcc,stroke:#141413,color:#fff
style P1 fill:#6a9bcc,stroke:#141413,color:#fff
style ATE fill:#6a9bcc,stroke:#141413,color:#fff

The RA estimator. Formally, the ATE under regression adjustment is:

$$\hat{\tau}_{RA}^{ATE} = \frac{1}{N} \sum_{i=1}^{N} \left[ \hat{\mu}_1(X_i) - \hat{\mu}_0(X_i) \right]$$

where $\hat{\mu}_1(X)$ is the predicted outcome under treatment (fitted from treated observations) and $\hat{\mu}_0(X)$ is the predicted outcome under control (fitted from untreated observations), both evaluated at each household’s covariates $X_i$. In plain language: for each household, the model predicts what their consumption would be if they received the cash transfer and what it would be if they did not. The difference is the household’s estimated treatment effect. Averaging these across all $N$ households gives the ATE.

For the ATT, we restrict the average to treated units only:

$$\hat{\tau}_{RA}^{ATT} = \frac{1}{N_1} \sum_{i: T_i = 1} \left[ \hat{\mu}_1(X_i) - \hat{\mu}_0(X_i) \right]$$

where $N_1$ is the number of treated households.

Mini example from our data. Consider Household A: a 40-year-old female in poverty with 10 years of education. The treated outcome model predicts her consumption at 10.17 log points. The untreated outcome model predicts 10.05. Her estimated individual treatment effect is $10.17 - 10.05 = 0.12$. Averaging such predictions over all 2,000 endline households gives the ATE.

Stata implementation. The teffects ra command fits linear outcome models by default. The first parenthesis specifies the outcome model (outcome variable + covariates), and the second specifies the treatment variable: teffects ra (y c.age c.edu i.female i.poverty) (treat), ate.

What can go wrong — model misspecification. RA’s Achilles heel is that it relies entirely on the outcome model being correctly specified. If consumption depends on age nonlinearly (for example, a U-shaped relationship), but we assume a linear model, the predictions $\hat{\mu}_1$ and $\hat{\mu}_0$ will be systematically wrong, biasing the ATE. As the Stata manual notes, RA works well when the outcome model is correct, but “relying on a correctly specified outcome model with little data is extremely risky.” RA gives the right answer only if the outcome model is correct. If it is wrong, the ATE estimate can be biased even with infinite data.

What if we are unsure about the functional form of the outcome model? Is there an approach that avoids modeling the outcome entirely?

7.2 Inverse Probability Weighting (IPW) — modeling the treatment assignment

IPW takes the opposite approach. Instead of modeling consumption, it models the probability of being assigned to treatment — the propensity score, defined as $p(X) = \Pr(T = 1 \mid X)$. It then reweights observations so that the treatment and control groups become comparable. The Stata documentation explains: “IPW estimators use weighted averages of the observed outcome variable to estimate means of the potential outcomes. The weights account for the missing data inherent in the potential-outcome framework."

The logic is elegant: in a perfectly randomized experiment, every household has the same 50% chance of treatment, and a simple comparison of means is unbiased. When chance imbalances arise (like our 9.3% gender SMD), the estimated propensity scores deviate slightly from 0.50. IPW corrects for these imbalances by making the reweighted sample look as if randomization had been perfect — without ever modeling the outcome.

Analogy — opinion polling. Election pollsters know their survey overrepresents some demographics. If 60% of respondents are college graduates but only 35% of voters are, pollsters give lower weight to each college graduate’s response and higher weight to non-graduates. IPW does the same thing for treatment groups — it reweights households so the treated and control groups have the same covariate distribution.

graph TD
DATA["<b>Observed Data</b><br/>Treatment and control groups<br/>may have imbalances"]
PS["<b>Estimate propensity score</b><br/>p(X) = Pr(T=1 | X)<br/><i>via logistic regression</i>"]
WT["<b>Compute weights</b>"]
WTR["Treated: weight = 1/p(X)"]
WCT["Control: weight = 1/(1−p(X))"]
ATE["<b>ATE</b> = Weighted mean(treated)<br/>− Weighted mean(control)"]
DATA --> PS
PS --> WT
WT --> WTR
WT --> WCT
WTR --> ATE
WCT --> ATE
style DATA fill:#141413,stroke:#d97757,color:#fff
style PS fill:#d97757,stroke:#141413,color:#fff
style WT fill:#d97757,stroke:#141413,color:#fff
style WTR fill:#d97757,stroke:#141413,color:#fff
style WCT fill:#d97757,stroke:#141413,color:#fff
style ATE fill:#d97757,stroke:#141413,color:#fff

The propensity score. The propensity score is estimated via logistic regression:

$$\hat{p}(X_i) = \Pr(T_i = 1 \mid X_i) = \text{logit}^{-1}(\hat{\alpha} + \hat{\beta}' X_i)$$

In plain language: we fit a logistic model predicting whether each household was assigned to treatment, based on their covariates (age, education, gender, poverty status). The predicted probability is their propensity score.

The IPW estimator. The ATE under IPW is:

$$\hat{\tau}_{IPW}^{ATE} = \frac{1}{N} \sum_{i=1}^{N} \left[ \frac{T_i \cdot Y_i}{\hat{p}(X_i)} - \frac{(1 - T_i) \cdot Y_i}{1 - \hat{p}(X_i)} \right]$$

Each treated household’s outcome is divided by its probability of being treated — this upweights treated households that “look like” control households (the Stata manual calls this placing “a larger weight on those observations for which $y_{1i}$ is observed even though its observation was not likely”). Each control household’s outcome is divided by its probability of being in the control group. The reweighting creates a pseudo-population where treatment assignment is independent of covariates.

For the ATT, only the control group needs reweighting (because the treated group is already the reference population):

$$\hat{\tau}_{IPW}^{ATT} = \frac{1}{N_1} \sum_{i=1}^{N} \left[ T_i \cdot Y_i - \frac{(1 - T_i) \cdot \hat{p}(X_i) \cdot Y_i}{1 - \hat{p}(X_i)} \right]$$

Mini example from our data. In our RCT, a female household in poverty might have $\hat{p}(X) = 0.52$ (slightly more likely to be treated due to the gender imbalance). If treated, her weight is $1/0.52 = 1.92$. If in the control group, her weight is $1/(1 - 0.52) = 2.08$. A male non-poor household might have $\hat{p}(X) = 0.49$, giving weights close to 2.0 in either group. These mild adjustments rebalance the groups to remove the chance gender imbalance.

Why IPW matters even in RCTs. In a perfect RCT, the true propensity score is exactly 0.50 for everyone, and IPW does nothing. But finite samples produce chance imbalances. IPW uses the estimated propensity scores (which deviate slightly from 0.50) to correct for these imbalances without making any assumptions about how covariates affect the outcome.

Stata implementation. The teffects ipw command fits a logistic treatment model by default. Note that the first parenthesis specifies only the outcome variable (no covariates — IPW does not model the outcome), and the second specifies the treatment model: teffects ipw (y) (treat c.age c.edu i.female i.poverty), ate.

What can go wrong — extreme weights. IPW’s vulnerability is extreme propensity scores. If $\hat{p}(X) = 0.01$ for some household, the weight becomes $1/0.01 = 100$ — that single household dominates the ATE estimate, causing high variance and instability. The Stata manual warns: “When propensity scores are extreme (near 0 or 1), the inverse weights become very large, producing unstable estimates." This happens when the treatment and control groups have poor overlap — some covariate combinations appear only in one group. In our well-designed RCT, all propensity scores are between 0.43 and 0.55 (we verified this in Section 5.4), so extreme weights are not a concern.

RA works well if the outcome model is correct but can be biased if it is wrong. IPW works well if the propensity score model is correct but can be unstable if it is wrong. Is there a method that protects us against both types of misspecification?

7.3 Doubly Robust (DR) — modeling both

Doubly robust methods combine RA and IPW into a single estimator. They fit an outcome model and estimate a propensity score. The key property — the reason they are called “doubly robust” — is that the estimator is consistent (converges to the true treatment effect with enough data) if either the outcome model or the propensity score model is correctly specified. You do not need both to be right — just one.

The Stata manual describes this property: “AIPW estimators model both the outcome and the treatment probability. A surprising fact is that only one of the two models must be correctly specified to consistently estimate the treatment effects."

Analogy — backup power. Think of a house with two independent power sources: the electrical grid (the outcome model) and a solar panel system (the propensity score model). If the grid goes down (outcome model is misspecified), solar power keeps the lights on. If clouds block the solar panels (propensity score model is wrong), the grid still works. As long as at least one power source is functioning, the house stays lit. That is doubly robust estimation — as long as at least one model is correct, the estimator gives the right answer.

graph TD
DATA["<b>Observed Data</b>"]
RA_C["<b>RA component</b><br/>Predict Ŷ₁ and Ŷ₀<br/>for each household"]
IPW_C["<b>IPW component</b><br/>Estimate propensity<br/>score p(X)"]
RESID["<b>Prediction errors</b><br/>Y − Ŷ for each<br/>household"]
CORRECT["<b>Bias-correction term</b><br/>IPW-weighted residuals"]
DR["<b>DR estimate</b><br/>= RA prediction<br/>+ Bias correction"]
DATA --> RA_C
DATA --> IPW_C
RA_C --> RESID
IPW_C --> CORRECT
RESID --> CORRECT
RA_C --> DR
CORRECT --> DR
style DATA fill:#141413,stroke:#00d4c8,color:#fff
style RA_C fill:#6a9bcc,stroke:#141413,color:#fff
style IPW_C fill:#d97757,stroke:#141413,color:#fff
style RESID fill:#6a9bcc,stroke:#141413,color:#fff
style CORRECT fill:#d97757,stroke:#141413,color:#fff
style DR fill:#00d4c8,stroke:#141413,color:#141413

The AIPW estimator. The most common doubly robust form is Augmented Inverse Probability Weighting (AIPW):

$$\hat{\tau}_{DR}^{ATE} = \frac{1}{N} \sum_{i=1}^{N} \left[ \hat{\mu}_1(X_i) - \hat{\mu}_0(X_i) + \frac{T_i (Y_i - \hat{\mu}_1(X_i))}{\hat{p}(X_i)} - \frac{(1 - T_i)(Y_i - \hat{\mu}_0(X_i))}{1 - \hat{p}(X_i)} \right]$$

This equation has two clearly interpretable components:

RA component (first two terms): $\hat{\mu}_1(X_i) - \hat{\mu}_0(X_i)$ — the regression adjustment prediction, exactly as in Section 7.1
Bias-correction component (last two terms): IPW-weighted residuals $(Y_i - \hat{\mu})$ — the difference between actual and predicted outcomes, weighted by inverse propensity scores

In plain language: start with the RA prediction of each household’s treatment effect. Then ask: how far off was that prediction from reality? Weight those prediction errors by the propensity score. If RA was already right, the errors average to zero and you just get RA. If RA was wrong but IPW is right, the weighted errors exactly cancel the RA bias.

Why the magic works — four scenarios.

Outcome model correct, propensity model wrong: The residuals $(Y_i - \hat{\mu})$ are zero on average, so the correction terms vanish. DR reduces to RA. Correct answer.
Propensity model correct, outcome model wrong: The IPW reweighting is valid, so the correction terms fix the RA bias. Correct answer.
Both models correct: Both components work together, producing the most efficient estimate.
Both models wrong: Neither safety net catches the error. The estimate can be biased. DR provides insurance, not invincibility.

AIPW vs. IPWRA in Stata. Stata offers two doubly robust commands. teffects aipw augments the IPW estimator with an outcome-model correction (the equation above). teffects ipwra applies propensity score weights to the regression adjustment — arriving at the same property from the other direction. Both are doubly robust and produce nearly identical results in practice.

Stata implementation. Both commands require specifying the outcome model in the first parenthesis and the treatment model in the second: teffects ipwra (y c.age c.edu i.female i.poverty) (treat c.age c.edu i.female i.poverty), vce(robust).

What can go wrong. DR fails only when both models are wrong. This is much less likely than either single model being wrong — getting at least one model approximately right is much easier than getting both perfectly right. However, the Stata manual notes: “When both the outcome and the treatment model are misspecified, which estimator is more robust is a matter of debate." Using flexible specifications (polynomials, interactions) reduces the risk of both models failing simultaneously.

Comparison of the three approaches

Feature	RA	IPW	DR (AIPW/IPWRA)
Models the outcome?	Yes	No	Yes
Models the treatment?	No	Yes	Yes
Key equation	$\hat{\mu}_1(X) - \hat{\mu}_0(X)$	$T \cdot Y / \hat{p}(X)$	RA + IPW-weighted residuals
Consistent if outcome model correct?	Yes	—	Yes
Consistent if treatment model correct?	—	Yes	Yes
Main vulnerability	Outcome misspecification	Extreme weights	Both models wrong
Stata command	`teffects ra`	`teffects ipw`	`teffects ipwra` / `teffects aipw`

graph LR
RA["<b>Regression Adjustment</b><br/>Models the outcome"]
IPW["<b>Inverse Probability<br/>Weighting</b><br/>Models the treatment"]
DR["<b>Doubly Robust</b><br/>Models both<br/><i>Consistent if either<br/>model is correct</i>"]
RA --> DR
IPW --> DR
style RA fill:#6a9bcc,stroke:#141413,color:#fff
style IPW fill:#d97757,stroke:#141413,color:#fff
style DR fill:#00d4c8,stroke:#141413,color:#141413

The doubly robust estimator combines the strengths of both RA and IPW. It is the standard recommendation in modern causal inference because it provides an extra layer of protection against model misspecification. Now that we understand what each method does, what it assumes, and what can go wrong, let us apply all three to our cash transfer data and compare their results.

8. Cross-sectional estimation at endline — RA, IPW, and DR

We now estimate treatment effects using only endline data. For each method, we compute both the ATE (the policymaker’s quantity) and the ATT (the evaluator’s quantity).

8.1 Simple difference in means

The simplest approach is to compare mean outcomes between treated and control groups at endline.

use "https://github.com/quarcs-lab/data-open/raw/master/ametrics/dataSIM4RCT.dta", clear
keep if post==1
reg y treat, robust

Linear regression Number of obs = 2,000
F(1, 1998) = 35.43
Prob > F = 0.0000
R-squared = 0.0174
Root MSE = .43449
──────────────────────────────────────────────────────────────────────────────
| Robust
y | Coefficient std. err. t P>|t| [95% conf. interval]
─────────────+────────────────────────────────────────────────────────────────
treat | .1157465 .0194443 5.95 0.000 .0776132 .1538798
_cons | 10.05374 .014001 718.07 0.000 10.02628 10.0812
──────────────────────────────────────────────────────────────────────────────

The simple difference in means yields an estimate of 0.116 (SE = 0.019, p < 0.001, 95% CI [0.078, 0.154]). Because the outcome is in logs, this means being offered the cash transfer increased household consumption by approximately 11.6%. This estimate is close to the true effect of 12% and is our benchmark for comparison. However, it does not adjust for the gender imbalance we discovered at baseline.

8.2 Regression Adjustment — ATE and ATT

Regression adjustment models the outcome as a function of treatment and covariates, then computes predicted outcomes under treatment and control for each observation.

* RA: Average Treatment Effect (ATE)
teffects ra (y c.age c.edu i.female i.poverty) (treat), ate

Treatment-effects estimation Number of obs = 2,000
Estimator : regression adjustment
Outcome model : linear
──────────────────────────────────────────────────────────────────────────────
| Robust
y | Coefficient std. err. z P>|z| [95% conf. interval]
─────────────+────────────────────────────────────────────────────────────────
ATE |
treat |
(1 vs 0) | .1125431 .0190927 5.89 0.000 .0751221 .1499641
─────────────+────────────────────────────────────────────────────────────────
POmean |
treat |
0 | 10.05503 .0138703 724.93 0.000 10.02785 10.08222
──────────────────────────────────────────────────────────────────────────────

* RA: Average Treatment Effect on the Treated (ATT)
teffects ra (y c.age c.edu i.female i.poverty) (treat), atet

Treatment-effects estimation Number of obs = 2,000
Estimator : regression adjustment
Outcome model : linear
──────────────────────────────────────────────────────────────────────────────
| Robust
y | Coefficient std. err. z P>|z| [95% conf. interval]
─────────────+────────────────────────────────────────────────────────────────
ATET |
treat |
(1 vs 0) | .1132537 .0191498 5.91 0.000 .0757208 .1507865
─────────────+────────────────────────────────────────────────────────────────
POmean |
treat |
0 | 10.05623 .0140082 717.88 0.000 10.02878 10.08369
──────────────────────────────────────────────────────────────────────────────

The RA estimates are ATE = 0.113 (SE = 0.019, 95% CI [0.075, 0.150]) and ATT = 0.113 (SE = 0.019, 95% CI [0.076, 0.151]). The ATE and ATT are nearly identical, which confirms that treatment effects are approximately homogeneous across households. The RA approach models the outcome with covariates (age, education, gender, poverty), which adjusts for the baseline gender imbalance and can improve precision.

8.3 Inverse Probability Weighting — ATE and ATT

IPW reweights observations based on their estimated probability of treatment, without modeling the outcome.

* IPW: Average Treatment Effect (ATE)
teffects ipw (y) (treat c.age c.edu i.female i.poverty), ate

Treatment-effects estimation Number of obs = 2,000
Estimator : inverse-probability weights
Outcome model : weighted mean
Treatment model: logit
──────────────────────────────────────────────────────────────────────────────
| Robust
y | Coefficient std. err. z P>|z| [95% conf. interval]
─────────────+────────────────────────────────────────────────────────────────
ATE |
treat |
(1 vs 0) | .1126713 .0190886 5.90 0.000 .0752583 .1500844
─────────────+────────────────────────────────────────────────────────────────
POmean |
treat |
0 | 10.05495 .0138651 725.20 0.000 10.02778 10.08213
──────────────────────────────────────────────────────────────────────────────

* IPW: Average Treatment Effect on the Treated (ATT)
teffects ipw (y) (treat c.age c.edu i.female i.poverty), atet

Treatment-effects estimation Number of obs = 2,000
Estimator : inverse-probability weights
Outcome model : weighted mean
Treatment model: logit
──────────────────────────────────────────────────────────────────────────────
| Robust
y | Coefficient std. err. z P>|z| [95% conf. interval]
─────────────+────────────────────────────────────────────────────────────────
ATET |
treat |
(1 vs 0) | .1134031 .0191397 5.93 0.000 .0758899 .1509162
─────────────+────────────────────────────────────────────────────────────────
POmean |
treat |
0 | 10.05608 .0140004 718.27 0.000 10.02864 10.08352
──────────────────────────────────────────────────────────────────────────────

The IPW estimates are ATE = 0.113 (SE = 0.019, 95% CI [0.075, 0.150]) and ATT = 0.113 (SE = 0.019, 95% CI [0.076, 0.151]). These are very close to the RA results, which is expected in a well-designed RCT where propensity scores are near 0.50 for all households. Notice that IPW does not model the outcome — it only models the treatment assignment process using the propensity score. The close agreement between RA and IPW gives us confidence that both the outcome model and the treatment model are approximately correct.

8.4 Doubly Robust — ATE and ATT (IPWRA)

The doubly robust IPWRA estimator combines outcome modeling and propensity score weighting.

* IPWRA: Average Treatment Effect (ATE)
teffects ipwra (y c.age c.edu i.female i.poverty) ///
(treat c.age c.edu i.female i.poverty), vce(robust)

Treatment-effects estimation Number of obs = 2,000
Estimator : IPW regression adjustment
Outcome model : linear
Treatment model: logit
──────────────────────────────────────────────────────────────────────────────
| Robust
y | Coefficient std. err. z P>|z| [95% conf. interval]
─────────────+────────────────────────────────────────────────────────────────
ATE |
treat |
(1 vs 0) | .112639 .0190901 5.90 0.000 .0752231 .1500549
─────────────+────────────────────────────────────────────────────────────────
POmean |
treat |
0 | 10.055 .0138677 725.07 0.000 10.02782 10.08218
──────────────────────────────────────────────────────────────────────────────

* IPWRA: Average Treatment Effect on the Treated (ATT)
teffects ipwra (y c.age c.edu i.female i.poverty) ///
(treat c.age c.edu i.female i.poverty), atet vce(robust)

Treatment-effects estimation Number of obs = 2,000
Estimator : IPW regression adjustment
Outcome model : linear
Treatment model: logit
──────────────────────────────────────────────────────────────────────────────
| Robust
y | Coefficient std. err. z P>|z| [95% conf. interval]
─────────────+────────────────────────────────────────────────────────────────
ATET |
treat |
(1 vs 0) | .1133162 .0191469 5.92 0.000 .0757889 .1508435
─────────────+────────────────────────────────────────────────────────────────
POmean |
treat |
0 | 10.05617 .0140019 718.20 0.000 10.02873 10.08361
──────────────────────────────────────────────────────────────────────────────

The doubly robust IPWRA estimates are ATE = 0.113 (SE = 0.019, 95% CI [0.075, 0.150]) and ATT = 0.113 (SE = 0.019, 95% CI [0.076, 0.151]). These are very close to the RA and IPW estimates, confirming that all three approaches converge in this well-designed RCT. The DR method provides the most reliable cross-sectional estimate because it is protected against misspecification of either the outcome or treatment model.

8.5 Doubly Robust — AIPW alternative

As a robustness check, we can also compute the doubly robust estimate using the AIPW formulation instead of IPWRA.

* AIPW: Average Treatment Effect (ATE)
teffects aipw (y c.age c.edu i.female i.poverty) ///
(treat c.age c.edu i.female i.poverty)

Treatment-effects estimation Number of obs = 2,000
Estimator : augmented IPW
Outcome model : linear by ML
Treatment model: logit
──────────────────────────────────────────────────────────────────────────────
| Robust
y | Coefficient std. err. z P>|z| [95% conf. interval]
─────────────+────────────────────────────────────────────────────────────────
ATE |
treat |
(1 vs 0) | .1126412 .0190903 5.90 0.000 .075225 .1500574
─────────────+────────────────────────────────────────────────────────────────
POmean |
treat |
0 | 10.055 .013868 725.05 0.000 10.02782 10.08218
──────────────────────────────────────────────────────────────────────────────

The AIPW estimate of ATE = 0.113 (SE = 0.019, 95% CI [0.075, 0.150]) is virtually identical to the IPWRA result (0.113). Both are doubly robust — the difference lies in the computational approach (AIPW augments the IPW estimator with a bias-correction term, while IPWRA applies IPW weights to the regression adjustment), but the theoretical properties and estimates are the same.

8.6 Cross-sectional comparison

The table below summarizes all cross-sectional estimates.

Method	Approach	Estimand	Estimate	SE	95% CI	Contains 0.12?
Simple regression	None	ATE	0.116	0.019	[0.078, 0.154]	Yes
Regression Adjustment	Outcome model	ATE	0.113	0.019	[0.075, 0.150]	Yes
Regression Adjustment	Outcome model	ATT	0.113	0.019	[0.076, 0.151]	Yes
Inverse Prob. Weighting	Treatment model	ATE	0.113	0.019	[0.075, 0.150]	Yes
Inverse Prob. Weighting	Treatment model	ATT	0.113	0.019	[0.076, 0.151]	Yes
IPWRA (Doubly Robust)	Both models	ATE	0.113	0.019	[0.075, 0.150]	Yes
IPWRA (Doubly Robust)	Both models	ATT	0.113	0.019	[0.076, 0.151]	Yes
True effect			0.12

Several patterns emerge from this comparison. First, ATE and ATT are nearly identical for every method, confirming that treatment effects are homogeneous across households. Second, RA, IPW, and DR all give remarkably similar results (all approximately 0.113) because, in this well-designed RCT, randomization ensures that both the outcome model and the propensity score model are approximately correct. Third, the simple difference in means (0.116) is slightly higher than the covariate-adjusted estimates (0.113), reflecting the precision improvement from controlling for covariates including the gender imbalance. Finally, all confidence intervals contain the true effect of 0.12 — every method successfully recovers the correct answer.

The real value of doubly robust methods becomes apparent in less ideal settings. When one model might be misspecified — a common situation in practice — DR methods provide insurance that RA or IPW alone cannot offer.

9. Leveraging panel data — Difference-in-Differences

All estimates in Section 8 used only endline data. But we have panel data — the same 2,000 households observed before and after the intervention. Can we do better?

9.1 Why use panel data?

Cross-sectional methods (RA, IPW, DR) compare treated and control groups at a single point in time — the endline. They control for observable covariates like age, education, and gender. But there may be unobservable characteristics — household motivation, geographic advantages, cultural factors — that differ between groups and affect consumption. No amount of cross-sectional covariate adjustment can control for these, because we simply do not observe them.

Analogy — comparing students across schools. Imagine comparing test scores between students at a charter school (treatment) and a traditional school (control). You can adjust for observable differences like family income and prior grades. But what about unmeasured factors — parental involvement, neighborhood quality, student ambition? A cross-sectional comparison cannot disentangle the school effect from these hidden differences. Now suppose you observe the same students before and after they switch schools. By comparing each student’s score change, you automatically cancel out all fixed student characteristics — because they are the same at both time points. That is the power of panel data.

Panel data methods like difference-in-differences (DiD) solve this problem by comparing each household to itself over time. By looking at how each household’s consumption changed from baseline to endline, we effectively control for all time-invariant unobservable characteristics (household fixed effects). This is a powerful advantage that cross-sectional methods cannot replicate.

The DiD estimator

The DiD estimator computes a simple but powerful quantity — a “difference of differences”:

$$\hat{\tau}_{DiD} = \underbrace{(\bar{Y}_{treat,post} - \bar{Y}_{treat,pre})}_{\text{Change for treated}} - \underbrace{(\bar{Y}_{control,post} - \bar{Y}_{control,pre})}_{\text{Change for control}}$$

The first difference ($\bar{Y}_{treat,post} - \bar{Y}_{treat,pre}$) captures the treatment group’s change over time — the treatment effect plus any common time trend (e.g., economic growth that affects all households). The second difference ($\bar{Y}_{control,post} - \bar{Y}_{control,pre}$) captures the control group’s change — the common time trend only, since they did not receive treatment. Subtracting the second from the first removes the time trend, isolating the treatment effect.

Mini example from our data. Suppose the treated group’s average log consumption went from 10.01 at baseline to 10.17 at endline (change = +0.16). The control group went from 10.03 to 10.06 (change = +0.03). The DiD estimate is $0.16 - 0.03 = 0.13$ — close to the true effect of 0.12. The control group’s +0.03 change captures the natural time trend that would have affected everyone, and subtracting it isolates the treatment effect.

The parallel trends assumption

The key identifying assumption of DiD is the parallel trends assumption (PTA): absent the treatment, the treatment and control groups would have followed the same time trend. Formally:

Notation note — In the DiD literature and in the Sant’Anna and Zhao (2020) paper, $D$ denotes treatment group assignment (equivalent to our treat variable). This differs from our data dictionary where D is the receipt indicator. In this section and Section 9.4, we follow the paper’s convention: $D = 1$ means assigned to treatment, $D = 0$ means assigned to control.

$$E[Y_1(0) - Y_0(0) \mid D = 1] = E[Y_1(0) - Y_0(0) \mid D = 0]$$

This says that the average change in untreated potential outcomes is the same for the treated and control groups. Note that this does not require the two groups to have the same level of consumption — only the same trend. The treated group can start higher or lower, as long as their consumption would have evolved at the same rate as the control group in the absence of the program.

In an RCT, the parallel trends assumption is very plausible because randomization ensures the groups were similar at baseline. Any pre-existing differences between groups occurred by chance and are unlikely to produce different time trends. This makes DiD a strong estimator in our setting.

graph LR
subgraph "Parallel Trends Assumption"
PRE["<b>Baseline 2021</b>"]
POST["<b>Endline 2024</b>"]
end
PRE -->|"Treated group<br/>change = effect + trend"| POST
PRE -->|"Control group<br/>change = trend only"| POST
style PRE fill:#6a9bcc,stroke:#141413,color:#fff
style POST fill:#d97757,stroke:#141413,color:#fff

9.2 Why does DiD estimate ATT and not ATE?

This is a point that many beginners miss, so it is worth explaining carefully.

Recall from Section 6 that the ATT is $E[Y_1(1) - Y_1(0) \mid D = 1]$ — the effect on those who were treated. Sant’Anna and Zhao (2020) make this explicit: the main challenge is computing $E[Y_1(0) \mid D = 1]$ — what would the treated group’s consumption have been at endline without the program?

DiD solves this by using the control group’s time trend as a stand-in. Specifically, it constructs the counterfactual for the treated group as:

$$\underbrace{E[Y_1(0) \mid D = 1]}_{\text{Counterfactual}} = \underbrace{E[Y_0 \mid D = 1]}_{\text{Treated at baseline}} + \underbrace{(E[Y_1 \mid D = 0] - E[Y_0 \mid D = 0])}_{\text{Control group’s time trend}}$$

This counterfactual is specific to the treated group — it starts from their baseline level and adds the control group’s trend. DiD therefore estimates what happened to the treated group relative to this counterfactual. This is precisely the ATT.

Why not the ATE? To estimate the ATE, we would also need the treatment effect for the untreated — what would happen if we gave the program to those who did not receive it. DiD does not provide this, because the counterfactual it constructs runs in only one direction (control trend applied to treated baseline, not treated trend applied to control baseline).

In our RCT context, since treatment was randomly assigned, ATE and ATT are likely very similar (as we saw in Section 8). But in observational studies with heterogeneous treatment effects, this distinction matters greatly. A job-training program might have a larger effect on those who voluntarily enrolled (ATT) than it would have on randomly selected workers (ATE).

9.3 Basic DiD with panel fixed effects

We now implement the basic DiD estimator using Stata’s xtdidregress command, which handles the panel structure and computes clustered standard errors.

use "https://github.com/quarcs-lab/data-open/raw/master/ametrics/dataSIM4RCT.dta", clear
* Create the treatment-post interaction
gen treat_post = treat * post
label var treat_post "Treated x Post (1 only for treated in 2024)"
* Declare panel structure
xtset id year
* Basic DiD with individual fixed effects
xtdidregress (y) (treat_post), group(id) time(year) vce(cluster id)

 Number of obs = 4,000
Number of groups = 2,000
Outcome model : linear
Treatment model: none
──────────────────────────────────────────────────────────────────────────────
| Robust
y | Coefficient std. err. t P>|t| [95% conf. interval]
─────────────+────────────────────────────────────────────────────────────────
ATET |
treat_post | .1347161 .0272737 4.94 0.000 .0812282 .188204
──────────────────────────────────────────────────────────────────────────────

The basic DiD estimate of the ATT is 0.135 (SE = 0.027, p < 0.001, 95% CI [0.081, 0.188]). This is slightly higher than the cross-sectional estimates (0.113–0.116) but still contains the true effect of 0.12 within its confidence interval. The wider standard error (0.027 vs. 0.019) reflects the additional variability introduced by differencing within households. Standard errors are clustered at the household level to account for serial correlation within panels.

The key advantage of this DiD estimate is that it controls for all time-invariant unobservable characteristics of each household. In an RCT, randomization already handles confounding, so the cross-sectional and panel estimates are similar. But in observational settings, DiD’s ability to absorb household fixed effects can correct biases that cross-sectional methods cannot.

9.4 From cross-sectional DR to panel DR — Doubly Robust DiD (DRDID)

In Section 7, we saw that doubly robust methods combine outcome modeling and propensity score modeling for cross-sectional data. DRDID extends this logic to the panel setting. It combines the DiD framework (using pre/post variation) with doubly robust covariate adjustment.

This approach was introduced by Sant’Anna and Zhao (2020) in a landmark paper published in the Journal of Econometrics. They proposed estimators that are “consistent if either (but not necessarily both) a propensity score or outcome regression working models are correctly specified” — bringing the doubly robust property from the cross-sectional world into the DiD framework.

Why do we need DRDID?

Recall from Section 9.2 that basic DiD relies on the parallel trends assumption — absent treatment, the treated and control groups would have followed the same time trend. But what if parallel trends holds only conditional on covariates? For example, what if consumption trends differ between poor and non-poor households, but within each poverty group the trends are parallel?

In this case, we need a conditional parallel trends assumption:

$$E[Y_1(0) - Y_0(0) \mid D = 1, X] = E[Y_1(0) - Y_0(0) \mid D = 0, X]$$

This says that the average change in untreated potential outcomes is the same for treated and control groups who share the same covariates $X$. Note that this allows for covariate-specific time trends (e.g., different consumption growth rates for poor and non-poor households) while still identifying the ATT.

Under this conditional parallel trends assumption, there are two ways to estimate the ATT:

Outcome regression (OR) approach — model how the outcome evolves over time for the control group, and use that model to predict the counterfactual evolution for the treated group
IPW approach — reweight the control group so its covariate distribution matches the treated group, then compute the standard DiD

The problem is the same as in the cross-sectional case: OR requires a correctly specified outcome model, and IPW requires a correctly specified propensity score model. Sant’Anna and Zhao’s insight was that you can combine both into a single estimator that works if either model is correct.

The DRDID estimator for panel data

When panel data are available (as in our case — same households observed at baseline and endline), the DRDID estimator takes a particularly clean form. Let $\Delta Y_i = Y_{i,post} - Y_{i,pre}$ denote each household’s change in consumption. The DR DID estimator is:

$$\hat{\tau}_{DR}^{DiD} = \frac{1}{N_1} \sum_{i=1}^{N} \left[ w_1(D_i) - w_0(D_i, X_i) \right] \left[ \Delta Y_i - \hat{\mu}_{0,\Delta}(X_i) \right]$$

where:

$w_1(D_i) = D_i / \bar{D}$ assigns equal weight to each treated unit (the fraction treated)
$w_0(D_i, X_i)$ reweights control units using the propensity score $\hat{p}(X)$, so they resemble the treated group
$\hat{\mu}_{0,\Delta}(X_i) = \hat{\mu}_{0,post}(X_i) - \hat{\mu}_{0,pre}(X_i)$ is the predicted change in consumption for the control group, fitted from control-group data

In plain language: for each household, compute the change in consumption over time ($\Delta Y$) and subtract the model-predicted change for the control group ($\hat{\mu}_{0,\Delta}$). This residual captures the treatment effect plus any prediction error. Then reweight these residuals using IPW so that the control group matches the treated group’s covariate profile.

Why is this doubly robust?

The doubly robust property works through the same logic as in the cross-sectional case (Section 7.3), but applied to changes rather than levels:

If the outcome model is correct ($\hat{\mu}_{0,\Delta}(X) = E[\Delta Y \mid D=0, X]$), then the residuals $\Delta Y_i - \hat{\mu}_{0,\Delta}(X_i)$ average to zero for the control group, regardless of the propensity score weights. The estimator reduces to an outcome-regression DiD. Correct answer.
If the propensity score model is correct ($\hat{p}(X) = \Pr(D=1 \mid X)$), the IPW reweighting makes the control group comparable to the treated group, regardless of the outcome model. The correction term fixes any bias from a misspecified outcome model. Correct answer.
If both are correct, the estimator achieves the semiparametric efficiency bound — it is the most precise estimator possible given the assumptions. Sant’Anna and Zhao proved this formally.
If both are wrong, the estimator can be biased — double robustness provides one layer of insurance, not two.

graph TD
DY["<b>Panel data</b><br/>ΔY = Y_post − Y_pre<br/>for each household"]
OR["<b>Outcome model</b><br/>Predict control group's<br/>consumption change<br/>μ̂₀,Δ(X)"]
PS["<b>Propensity score</b><br/>Estimate p(X)<br/>= Pr(D=1 | X)"]
RES["<b>Residuals</b><br/>ΔY − μ̂₀,Δ(X)"]
IPW_W["<b>IPW reweighting</b><br/>Make controls look<br/>like treated group"]
DRDID["<b>DR-DiD estimate</b><br/>ATT = weighted average<br/>of residuals"]
DY --> RES
OR --> RES
PS --> IPW_W
RES --> DRDID
IPW_W --> DRDID
style DY fill:#141413,stroke:#00d4c8,color:#fff
style OR fill:#6a9bcc,stroke:#141413,color:#fff
style PS fill:#d97757,stroke:#141413,color:#fff
style RES fill:#6a9bcc,stroke:#141413,color:#fff
style IPW_W fill:#d97757,stroke:#141413,color:#fff
style DRDID fill:#00d4c8,stroke:#141413,color:#141413

What DRDID adds over basic DiD and TWFE

Sant’Anna and Zhao (2020) also showed that the standard two-way fixed effects (TWFE) estimator — the workhorse of applied economics — can produce misleading results when treatment effects are heterogeneous across covariates. Specifically, the TWFE estimator implicitly assumes (i) that treatment effects are the same for all covariate values, and (ii) that there are no covariate-specific time trends. When these assumptions fail, “the estimand is, in general, different from the ATT, and policy evaluation based on it may be misleading.” DRDID avoids both of these pitfalls by allowing for flexible outcome models and covariate-specific trends.

Stata implementation

The drdid package (Rios-Avila, Sant’Anna, and Callaway) implements the estimators from the paper.

* Install the drdid package (only needed once)
ssc install drdid, replace
* Doubly Robust DiD with DRIPW estimator
drdid y c.age c.edu i.female i.poverty, ivar(id) time(year) treatment(treat) dripw

Doubly robust difference-in-differences estimator
Outcome model : least squares
Treatment model: inverse probability
──────────────────────────────────────────────────────────────────────────────
| Coefficient std. err. z P>|z| [95% conf. interval]
─────────────+────────────────────────────────────────────────────────────────
ATET | .1374784 .027387 5.02 0.000 .0838008 .191156
──────────────────────────────────────────────────────────────────────────────

The DRDID estimate of the ATT is 0.137 (SE = 0.027, p < 0.001, 95% CI [0.084, 0.191]). The dripw option specifies the Doubly Robust Inverse Probability Weighting estimator, which uses a linear least squares model for the outcome evolution of the control group and a logistic model for the propensity score. The result is slightly higher than basic DiD (0.135) and close to the true effect of 0.12.

Alternative: Stata 17+ built-in command. Stata 17 and later versions include a built-in doubly robust DiD estimator that does not require installing external packages.

xthdidregress aipw (y c.age c.edu i.female i.poverty) ///
(treat_post c.age c.edu i.female i.poverty), group(id)

Heterogeneous-treatment-effects regression Number of obs = 4,000
Number of panels = 2,000
Estimator: Augmented IPW
Panel variable: id
Treatment level: id
Control group: Never treated
(Std. err. adjusted for 2,000 clusters in id)
──────────────────────────────────────────────────────────────────────────────
| Robust
Cohort | ATET std. err. z P>|z| [95% conf. interval]
─────────────+────────────────────────────────────────────────────────────────
year |
2024 | .1374784 .027387 5.02 0.000 .0838008 .191156
──────────────────────────────────────────────────────────────────────────────
Note: ATET computed using covariates.

The xthdidregress aipw command produces the same ATT estimate of 0.137 (SE = 0.027, 95% CI [0.084, 0.191]) as the drdid package — confirming that both implement the same doubly robust DiD methodology. The output labels the result as “Cohort year 2024” because xthdidregress is designed for settings with staggered treatment adoption across multiple cohorts; in our two-period design, there is only one treatment cohort (households treated in 2024). As the Stata manual explains, “AIPW models both treatment and outcome. If at least one of the models is correctly specified, it provides consistent estimates, a property called double robustness.”

The agreement between drdid (community package) and xthdidregress aipw (built-in) provides a useful robustness check — researchers can verify their results using both implementations.

Panel data vs. repeated cross-sections

An important result from Sant’Anna and Zhao (2020) is that panel data are strictly more efficient than repeated cross-sections for estimating the ATT under the DiD framework. The intuition is straightforward: with panel data, we observe each household’s individual change over time ($\Delta Y_i$), which eliminates household-level variation. With repeated cross-sections, we can only compare group averages at different time points, which introduces additional noise. The efficiency gain is larger when the sample sizes in the pre and post periods are more imbalanced.

In our study, we have a balanced panel (same 2,000 households at baseline and endline), so we benefit from this efficiency advantage.

9.5 Cross-sectional vs. panel comparison

The table below compares our best cross-sectional estimates with the panel-based DiD estimates.

Method	Approach	Estimand	Data Used	Estimate	SE	95% CI	Contains 0.12?
Simple regression	None	ATE	Endline only	0.116	0.019	[0.078, 0.154]	Yes
RA	Outcome model	ATE	Endline only	0.113	0.019	[0.075, 0.150]	Yes
IPW	Treatment model	ATE	Endline only	0.113	0.019	[0.075, 0.150]	Yes
DR (IPWRA)	Both models	ATE	Endline only	0.113	0.019	[0.075, 0.150]	Yes
Basic DiD	Panel FE	ATT	Both waves	0.135	0.027	[0.081, 0.188]	Yes
DR-DiD (`drdid`)	Both + Panel	ATT	Both waves	0.137	0.027	[0.084, 0.191]	Yes
DR-DiD (`xthdidregress`)	Both + Panel	ATT	Both waves	0.137	0.027	[0.084, 0.191]	Yes
True effect				0.12

Several important patterns emerge from this comparison. Cross-sectional methods estimate ATE using only endline data, while DiD methods estimate ATT using both survey waves. The two DR-DiD implementations (drdid and xthdidregress aipw) produce identical results, confirming methodological consistency. The DiD estimates (0.135–0.137) are slightly higher than the cross-sectional estimates (0.113), but all confidence intervals contain the true effect of 0.12. DiD’s wider standard errors (0.027 vs. 0.019) reflect the additional variability from differencing within households.

The key value of DiD is not tighter standard errors — it is robustness to time-invariant unobservables. In observational settings where randomization does not hold, DiD can correct biases that cross-sectional methods cannot address. In this RCT, randomization already handles confounding, so the estimates are similar. DRDID adds doubly robust protection on top of DiD, making it the most robust panel method available.

10. Offer vs. receipt — endogenous treatment (advanced)

Note: This section addresses the advanced topic of imperfect compliance and endogenous treatment. Readers new to causal inference may wish to skip this section on a first reading and return to it later.

10.1 The compliance problem

All estimates in Sections 8 and 9 measure the effect of being offered the cash transfer (treat), not the effect of actually receiving it (D). This is the intent-to-treat (ITT) approach — it captures the policy-relevant effect of the offer, regardless of whether households complied.

But what about the effect of actual receipt? This is more complex because compliance is not random. Only 85% of treated households received the transfer, and 5% of control households received it through other channels. The households that chose to take up the program may differ systematically from those that did not — they may be more motivated, more financially constrained, or better connected. Naively comparing receivers to non-receivers would introduce selection bias.

The solution is to use the random assignment (treat) as an instrumental variable for actual receipt (D). Because treat was randomly assigned, it is independent of household characteristics and satisfies the requirements for a valid instrument. This allows us to isolate the causal effect of receipt, at least for the subset of households whose receipt was determined by the offer (the “compliers”).

Analogy — prescriptions and pills. Imagine a doctor randomly prescribes a medication to some patients, but not all patients fill their prescription. We cannot simply compare those who took the pill to those who did not, because pill-takers may be more health-conscious. Instead, we use the random prescription (the “offer”) as a nudge — it strongly predicts whether you take the pill but does not directly affect your health except through the pill. That is the instrumental variable approach: using the random offer to estimate the causal effect of actual receipt.

10.2 Endogenous treatment regression

Stata’s etregress command estimates the effect of an endogenous treatment variable, using the random assignment as an excluded instrument.

use "https://github.com/quarcs-lab/data-open/raw/master/ametrics/dataSIM4RCT.dta", clear
keep if post==1
* Endogenous treatment regression
etregress y c.age i.female i.poverty c.edu, ///
treat(D = treat c.age i.female i.poverty c.edu) vce(robust)
* Mark estimation sample
gen byte esample = e(sample)
* ATE of receipt
margins r.D if esample==1
* ATT of receipt
margins, predict(cte) subpop(if D==1 & esample==1)

Linear regression with endogenous treatment Number of obs = 2,000
Estimator: Maximum likelihood Wald chi2(5) = 92.23
Log pseudolikelihood = -1797.6297 Prob > chi2 = 0.0000
──────────────────────────────────────────────────────────────────────────────
| Robust
| Coefficient std. err. z P>|z| [95% conf. interval]
─────────────+────────────────────────────────────────────────────────────────
y |
age | .003187 .0010016 3.18 0.001 .001224 .0051501
1.female | .0801465 .0189552 4.23 0.000 .042995 .117298
1.poverty | -.1030302 .0205984 -5.00 0.000 -.1434023 -.062658
edu | .0182634 .0045243 4.04 0.000 .0093959 .0271308
1.D | .1471 .0246775 5.96 0.000 .0987329 .1954671
_cons | 9.705642 .0694641 139.72 0.000 9.569495 9.841789
─────────────+────────────────────────────────────────────────────────────────
D |
treat | 2.55806 .0802103 31.89 0.000 2.40085 2.715269
_cons | -1.844408 .2847883 -6.48 0.000 -2.402582 -1.286233
─────────────+────────────────────────────────────────────────────────────────
/athrho | -.0060068 .0481062 -0.12 0.901 -.1002933 .0882796
sigma | .4245195 .0066426 .411698 .4377404
──────────────────────────────────────────────────────────────────────────────
Wald test of indep. eqns. (rho = 0): chi2(1) = 0.02 Prob > chi2 = 0.9006
ATE of receipt (margins r.D):
──────────────────────────────────────────────────────────────────────────────
D | Contrast std. err. [95% conf. interval]
─────────────+────────────────────────────────────────────────────────────────
(1 vs 0) | .1471 .0246775 .0987329 .1954671
──────────────────────────────────────────────────────────────────────────────
ATT of receipt (margins, predict(cte)):
──────────────────────────────────────────────────────────────────────────────
_cons | Margin std. err. z P>|z| [95% conf. interval]
─────────────+────────────────────────────────────────────────────────────────
| .1471 .0246775 5.96 0.000 .0987329 .1954671
──────────────────────────────────────────────────────────────────────────────

The etregress output reveals several important findings. The coefficient on D (receipt) is 0.147 (SE = 0.025, p < 0.001, 95% CI [0.099, 0.195]), which is the estimated effect of actually receiving the cash transfer. This is larger than the offer-based estimates (0.113–0.116) because not everyone who was offered the program received it — the per-recipient effect is naturally larger than the per-offer effect. The Wald test of independent equations (rho = 0) has p = 0.901, indicating no evidence of endogeneity — consistent with a well-designed RCT where unobservable factors do not drive both treatment receipt and consumption. The margins commands confirm that both the ATE and ATT of receipt are 0.147 (identical in this case because the model assumes a constant treatment effect).

10.3 Doubly robust estimation of receipt effect

We can also estimate the receipt effect using a doubly robust approach, incorporating the baseline outcome y0 as an additional control variable (an ANCOVA-style adjustment) and including treat (the random assignment) as a covariate in the treatment model for D.

use "https://github.com/quarcs-lab/data-open/raw/master/ametrics/dataSIM4RCT.dta", clear
keep if post==1
* Doubly robust ATE of receipt, controlling for baseline outcome
teffects ipwra (y y0 c.age i.female i.poverty c.edu) ///
(D c.age i.female i.poverty c.edu treat), vce(robust)
* Diagnostic checks
tebalance summarize age edu i.female i.poverty
tebalance summarize, baseline
tebalance density y0
tebalance density age
teffects overlap

Treatment-effects estimation Number of obs = 2,000
Estimator : IPW regression adjustment
Outcome model : linear
Treatment model: logit
──────────────────────────────────────────────────────────────────────────────
| Robust
y | Coefficient std. err. z P>|z| [95% conf. interval]
─────────────+────────────────────────────────────────────────────────────────
ATE |
D |
(1 vs 0) | .1172686 .0322495 3.64 0.000 .0540608 .1804764
─────────────+────────────────────────────────────────────────────────────────
POmean |
D |
0 | 10.03361 .0171459 585.19 0.000 10 10.06722
──────────────────────────────────────────────────────────────────────────────

The doubly robust estimate of the ATE of receipt is 0.117 (SE = 0.032, 95% CI [0.054, 0.180]). This is slightly lower than the etregress estimate (0.147) and closer to the true effect of 0.12. The wider standard error (0.032 vs. 0.025) reflects the additional flexibility of the doubly robust approach. This specification includes y0 (the baseline outcome) in the outcome model, which controls for pre-treatment differences in consumption levels. The variable treat appears in the treatment model for D because random assignment is the strongest predictor of receipt.

The diagnostic graphs below verify adequate covariate balance and propensity score overlap for the receipt model.

The density and overlap plots confirm that the IPWRA weighting achieves good balance between receivers and non-receivers. After weighting, the effective sample sizes are approximately 999 treated and 1,001 control (rebalanced from the raw 923 receivers and 1,077 non-receivers). The weighted covariate means are closely aligned — for example, the weighted mean age is 35.0 for receivers versus 35.2 for non-receivers, and the weighted poverty rate is 31.1% versus 31.4%. The propensity scores show sufficient overlap for reliable estimation.

11. Comparing all estimates — the big picture

The table below brings together all estimates from the tutorial, providing a comprehensive overview of how different methods, estimands, and data structures relate to each other.

#	Method	Approach	Estimand	Data	Estimate	SE	95% CI	Contains 0.12?
1	Simple regression	None	ATE (offer)	Endline	0.116	0.019	[0.078, 0.154]	Yes
2	Regression Adjustment	Outcome model	ATE (offer)	Endline	0.113	0.019	[0.075, 0.150]	Yes
3	Regression Adjustment	Outcome model	ATT (offer)	Endline	0.113	0.019	[0.076, 0.151]	Yes
4	Inverse Prob. Weighting	Treatment model	ATE (offer)	Endline	0.113	0.019	[0.075, 0.150]	Yes
5	Inverse Prob. Weighting	Treatment model	ATT (offer)	Endline	0.113	0.019	[0.076, 0.151]	Yes
6	IPWRA (Doubly Robust)	Both models	ATE (offer)	Endline	0.113	0.019	[0.075, 0.150]	Yes
7	IPWRA (Doubly Robust)	Both models	ATT (offer)	Endline	0.113	0.019	[0.076, 0.151]	Yes
8	Basic DiD	Panel FE	ATT (offer)	Panel	0.135	0.027	[0.081, 0.188]	Yes
9	DR-DiD (`drdid`)	Both + Panel	ATT (offer)	Panel	0.137	0.027	[0.084, 0.191]	Yes
10	DR-DiD (`xthdidregress`)	Both + Panel	ATT (offer)	Panel	0.137	0.027	[0.084, 0.191]	Yes
11	Endogenous treatment (`etregress`)	IV	ATE (receipt)	Endline	0.147	0.025	[0.099, 0.195]	Yes
12	DR receipt (`teffects ipwra`)	Both models	ATE (receipt)	Endline	0.117	0.032	[0.054, 0.180]	Yes
	True effect				0.12

Four key takeaways

1. RA vs. IPW vs. DR. In this well-designed RCT, all three cross-sectional approaches give remarkably similar results (0.113–0.116). This convergence occurs because randomization ensures that both the outcome model and the propensity score model are approximately correct. The differences are small — but in observational studies, where one model might be misspecified, the choice of method matters much more. Doubly robust methods are the safest bet because they remain consistent if either model is correct.

2. ATE vs. ATT. For all cross-sectional methods, ATE and ATT are nearly identical (0.113–0.116). This confirms that treatment effects are roughly homogeneous across households in this simulation. When treatment effects are heterogeneous — for example, if the program benefits poorer households more — ATE and ATT can diverge. The researcher must choose the estimand that matches their policy question: ATE for scaling decisions, ATT for program evaluation.

3. Cross-sectional vs. DiD. DiD estimates (0.135–0.137) are slightly higher than cross-sectional estimates (0.113–0.116), but all confidence intervals contain the true effect of 0.12. DiD’s main advantage is controlling for time-invariant unobservable household characteristics — less important in an RCT (where randomization handles confounding) but critical in quasi-experimental settings. DRDID extends the doubly robust logic to the panel setting, providing the most robust estimator in our toolkit. DiD inherently estimates the ATT because its counterfactual is constructed specifically for the treated group.

4. Offer vs. receipt. The effect of actually receiving the cash transfer (0.117–0.147) is larger than the effect of being offered it (0.113–0.116), because imperfect compliance dilutes the offer-based estimates. The doubly robust receipt estimate (0.117) is closest to the true effect of 0.12, while the endogenous treatment model (0.147) is slightly higher. All confidence intervals contain 0.12.

12. Summary and key takeaways

The cash transfer program increased household consumption by approximately 11–14% across all estimation methods, close to the true effect of 12%. Every confidence interval contained the true value, demonstrating that all methods successfully recovered the correct answer.

Seven methodological lessons

Always verify baseline balance before estimating treatment effects. Even with randomization, chance imbalances can occur — as we saw with the gender variable (SMD = 9.3%).
Be explicit about your estimand. ATE answers the policymaker’s question (“What if we scale this up?"), while ATT answers the evaluator’s question (“Did it help the participants?"). Different methods target different estimands.
Regression adjustment models the outcome; IPW models treatment assignment; doubly robust does both. These three approaches represent fundamentally different strategies for causal estimation. Understanding what each models — and what can go wrong — is essential for choosing the right method.
In a well-designed RCT, all three approaches converge. But doubly robust methods provide insurance against model misspecification, making them the standard recommendation in modern causal inference.
Panel data controls for time-invariant unobservables that cross-sectional methods cannot address. By comparing each household to itself over time, DiD absorbs household fixed effects — motivation, geography, family culture — that are invisible to cross-sectional approaches.
DiD inherently estimates the ATT because its counterfactual is specific to the treated group. The control group’s time trend provides a counterfactual for what the treated group would have experienced without the program — but it does not tell us what would happen if the program were given to the untreated.
Doubly robust DiD (DRDID) extends the DR logic to the panel setting. It combines the power of DiD (controlling for household fixed effects) with the robustness of doubly robust estimation (protection against model misspecification), making it the most robust panel estimator available.

Limitations

This tutorial uses simulated data with known parameters. Real-world data may exhibit more complex compliance patterns, heterogeneous effects, and missing data.
The panel has only two periods (baseline and endline), limiting our ability to test for pre-treatment trends or estimate dynamic treatment effects.
Treatment effects are homogeneous by construction. In practice, researchers should explore heterogeneity across subgroups.

Next steps

Apply these methods to real-world RCT data from actual cash transfer programs
Explore heterogeneous treatment effects by gender, poverty status, or education level
Extend to multi-period panels with staggered treatment adoption, using modern DiD methods (Callaway and Sant’Anna, 2021)

13. Exercises

Heterogeneous effects by gender. Estimate treatment effects separately for male-headed and female-headed households using IPWRA. Are the effects different? Does ATE still equal ATT when you restrict to subgroups?
Model misspecification. Compare the RA, IPW, and DR estimates when you deliberately misspecify the outcome model by omitting edu and age from the covariate list. Which method is most robust to this misspecification? What does this tell you about the value of doubly robust estimation?
Basic DiD vs. doubly robust DiD. Re-run the DiD analysis using the basic xtdidregress command (no covariates) and compare it with the drdid results (with covariates). How much do the estimates differ? What does this tell you about the role of covariate adjustment in DiD?

References

Acknowledgements

AI tools (Claude Code, Gemini, NotebookLM) were used to make the contents of this post more accessible to students. Nevertheless, the content in this post may still have errors. Caution is needed when applying the contents of this post to true research projects.

Introduction to Causal Inference: The DoWhy Approach with the Lalonde Dataset

Thu, 12 Mar 2026 00:00:00 +0000

Overview

Does a job training program actually cause participants to earn more, or do people who enroll in training simply differ from those who do not? This is the central challenge of causal inference: distinguishing genuine treatment effects from confounding differences between groups. A simple comparison of average earnings between participants and non-participants can be misleading if the two groups differ in age, education, or prior employment history.

DoWhy is a Python library that provides a principled, end-to-end framework for causal inference. It organizes the analysis into four explicit steps — Model, Identify, Estimate, Refute — each of which forces the analyst to state and test causal assumptions rather than hiding them inside a black-box estimator. In this tutorial, we apply DoWhy to the Lalonde dataset, a classic dataset from the National Supported Work (NSW) Demonstration program, to estimate how much the job training program increased participants' earnings in 1978.

Learning objectives:

Understand DoWhy’s four-step causal inference workflow (Model, Identify, Estimate, Refute)
Define a causal graph that encodes domain knowledge about confounders
Identify causal estimands from the graph using the backdoor criterion
Estimate causal effects using multiple methods (regression adjustment, IPW, doubly robust, propensity score stratification, propensity score matching)
Assess robustness of estimates using refutation tests

DoWhy’s four-step framework

Most statistical software lets you jump straight from data to estimates, skipping the hard work of stating assumptions and testing whether the results are trustworthy. DoWhy takes a different approach: it organizes every causal analysis into four explicit steps, each answering a distinct question.

graph LR
A["<b>1. Model</b><br/>Define causal<br/>assumptions"] --> B["<b>2. Identify</b><br/>Find the right<br/>formula"]
B --> C["<b>3. Estimate</b><br/>Compute the<br/>causal effect"]
C --> D["<b>4. Refute</b><br/>Stress-test<br/>the result"]
style A fill:#6a9bcc,stroke:#141413,color:#fff
style B fill:#d97757,stroke:#141413,color:#fff
style C fill:#00d4c8,stroke:#141413,color:#fff
style D fill:#8b5cf6,stroke:#141413,color:#fff

Each step answers a specific question and builds on the previous one:

Model — “What are the causal relationships?" Encode your domain knowledge as a causal graph (a DAG). This is where you declare which variables cause which, making your assumptions explicit and debatable rather than hidden inside a regression.
Identify — “Can we estimate the effect from data?" Given the graph, DoWhy uses graph theory to determine whether the causal effect is identifiable — meaning it can be computed from observed data alone — and returns the mathematical formula (the estimand) needed to do so.
Estimate — “What is the causal effect?" Apply one or more statistical methods to compute the actual numeric estimate. DoWhy supports multiple estimators so you can check whether different methods agree.
Refute — “Should we trust the estimate?" Run automated falsification tests that probe whether the result could be a statistical artifact, whether it is sensitive to unobserved confounders, and whether it is stable across subsamples.

The ordering is deliberate. You cannot estimate a causal effect without first identifying the correct formula, and you cannot identify the formula without first specifying your causal assumptions. This sequential discipline is DoWhy’s key contribution: it prevents the common mistake of running a regression and calling the coefficient “causal” without ever checking whether the adjustment set is correct or whether the result survives basic robustness checks.

Setup and imports

Before running the analysis, install the required package if needed:

pip install dowhy # https://pypi.org/project/dowhy/

The following code imports all necessary libraries and sets configuration variables. We define the outcome, treatment, and covariate columns that will be used throughout the analysis.

import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression, LinearRegression as SklearnLR
from dowhy import CausalModel
from dowhy.datasets import lalonde_dataset
# Reproducibility
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
# Configuration
OUTCOME = "re78"
OUTCOME_LABEL = "Earnings in 1978 (USD)"
TREATMENT = "treat"
TREATMENT_LABEL = "Job Training (treat)"
COVARIATES = ["age", "educ", "black", "hisp", "married", "nodegr", "re74", "re75"]

Data loading: The Lalonde Dataset

The Lalonde dataset comes from the National Supported Work (NSW) Demonstration, a randomized employment program conducted in the 1970s in the United States. Eligible applicants — mostly disadvantaged workers with limited employment histories — were randomly assigned to receive job training (treatment) or not (control). The dataset records each participant’s demographics, prior earnings, and post-program earnings in 1978. It has become a benchmark for testing causal inference methods because the random assignment provides a credible ground truth against which observational estimators can be compared.

DoWhy includes the Lalonde dataset directly, so we can load it with the lalonde_dataset() function.

df = lalonde_dataset()
# Convert boolean treatment to integer for DoWhy compatibility
df[TREATMENT] = df[TREATMENT].astype(int)
print(f"Dataset shape: {df.shape}")
print(f"\nTreatment groups:")
print(df[TREATMENT].value_counts().sort_index().rename({0: "Control", 1: "Training"}))
print(f"\nOutcome ({OUTCOME}) summary:")
print(df[OUTCOME].describe().round(2))

Dataset shape: (445, 12)
Treatment groups:
treat
Control 260
Training 185
Name: count, dtype: int64
Outcome (re78) summary:
count 445.00
mean 5300.76
std 6631.49
min 0.00
25% 0.00
50% 3701.81
75% 8124.72
max 60307.93
Name: re78, dtype: float64

The dataset contains 445 participants with 12 variables. The treatment is split into 185 individuals who received job training and 260 controls who did not. The outcome variable, real earnings in 1978 (re78), has a mean of \$5,301 but enormous variation (standard deviation of \$6,631), ranging from \$0 to \$60,308. The median (\$3,702) is well below the mean, indicating a right-skewed distribution — many participants earned little or nothing while a few earned substantially more.

Exploratory data analysis

Outcome distribution by treatment group

Before any causal modeling, we compare the raw earnings distributions between training and control groups. If the training program had an effect, we expect to see higher average earnings in the training group — but we cannot yet tell whether any difference is truly caused by the program or driven by pre-existing differences between the groups.

fig, ax = plt.subplots(figsize=(8, 5))
for group, label, color in [(0, "Control", "#6a9bcc"), (1, "Training", "#d97757")]:
subset = df[df[TREATMENT] == group][OUTCOME]
ax.hist(subset, bins=30, alpha=0.6, label=f"{label} (mean=${subset.mean():,.0f})",
color=color, edgecolor="white")
ax.set_xlabel(OUTCOME_LABEL)
ax.set_ylabel("Count")
ax.set_title(f"Distribution of {OUTCOME_LABEL} by Treatment Group")
ax.legend()
plt.savefig("dowhy_outcome_by_treatment.png", dpi=300, bbox_inches="tight")
plt.show()

Both distributions are heavily right-skewed, with a large spike near zero reflecting participants who had no earnings. The training group has a higher mean (\$6,349) compared to the control group (\$4,555), a raw difference of about \$1,794. However, both distributions overlap substantially, and the spike at zero is present in both groups, indicating that many participants struggled to find employment regardless of training.

Covariate balance

In a randomized experiment, we expect the covariates to be balanced across treatment and control groups. Under randomization, the naive difference-in-means is unbiased for the ATE in expectation — but with a finite sample of 445 observations, chance imbalances can still arise and reduce the precision of the estimate. Checking covariate balance helps us assess whether such imbalances exist and whether covariate adjustment could improve efficiency. We first examine the categorical covariates as proportions, then use Standardized Mean Differences to assess balance across all covariates on a common scale.

Categorical covariates

The four binary covariates — black, hisp, married, and nodegr (no high school degree) — indicate demographic group membership. Comparing their proportions across treatment and control groups reveals whether random assignment produced balanced groups on these characteristics.

categorical_vars = ["black", "hisp", "married", "nodegr"]
cat_means = df.groupby(TREATMENT)[categorical_vars].mean()
fig, ax = plt.subplots(figsize=(8, 5))
x = np.arange(len(categorical_vars))
width = 0.35
ax.bar(x - width / 2, cat_means.loc[0], width, label="Control",
color="#6a9bcc", edgecolor="white")
ax.bar(x + width / 2, cat_means.loc[1], width, label="Training",
color="#d97757", edgecolor="white")
ax.set_xticks(x)
ax.set_xticklabels(categorical_vars, rotation=45, ha="right")
ax.set_ylabel("Proportion")
ax.set_ylim(0, 1)
ax.set_title("Covariate Balance: Categorical Variables")
ax.legend()
plt.savefig("dowhy_covariate_balance_categorical.png", dpi=300, bbox_inches="tight")
plt.show()

The categorical covariates are well balanced across treatment and control groups, consistent with random assignment. The sample is predominantly Black (83%) and has a high rate of lacking a high school diploma (78%), reflecting the disadvantaged population targeted by the NSW program. Hispanic and married proportions are low in both groups (roughly 6% and 16%, respectively), with no meaningful differences between treatment arms.

Covariate balance: Standardized Mean Differences

Comparing raw group means can be misleading when covariates are measured on different scales. Suppose the control group earns \$500 more in prior earnings (re74) than the training group, and is also 1 year older on average. Which imbalance is larger? The raw numbers cannot answer this question — \$500 sounds like a lot, but prior earnings vary by thousands of dollars across individuals, so a \$500 gap may be trivial relative to the spread. A 1-year age difference sounds small, but if most participants are clustered around age 25, that gap may represent a meaningful shift in the distribution.

The Standardized Mean Difference (SMD) resolves this by asking: how many standard deviations apart are the treatment and control groups on each covariate? For each variable, we compute the difference in group means and divide by the pooled standard deviation. This converts every covariate — whether binary, measured in years, or measured in dollars — to the same unitless scale, making imbalances directly comparable:

$$\text{SMD} = \frac{\bar{X}_{treated} - \bar{X}_{control}}{\sqrt{(s^2_{treated} + s^2_{control}) \,/\, 2}}$$

An absolute SMD below 0.1 is the conventional threshold for “good balance” (Austin, 2011). Values above 0.1 signal that the groups differ by more than one-tenth of a standard deviation on that variable — enough to potentially confound the treatment effect estimate. A Love plot displays the absolute SMD for all covariates as horizontal bars, with a dashed line at the 0.1 threshold. Bars in steel blue fall below the threshold (balanced), while bars in warm orange exceed it (imbalanced).

# Standardized Mean Difference (SMD) for all covariates
treated = df[df[TREATMENT] == 1]
control = df[df[TREATMENT] == 0]
smd_values = {}
for var in COVARIATES:
diff = treated[var].mean() - control[var].mean()
pooled_sd = np.sqrt((treated[var].std()**2 + control[var].std()**2) / 2)
smd_values[var] = diff / pooled_sd
smd_df = pd.DataFrame({"variable": list(smd_values.keys()),
"smd": list(smd_values.values())})
smd_df["abs_smd"] = smd_df["smd"].abs()
smd_df = smd_df.sort_values("abs_smd")
fig, ax = plt.subplots(figsize=(8, 5))
colors = ["#6a9bcc" if v < 0.1 else "#d97757" for v in smd_df["abs_smd"]]
ax.barh(smd_df["variable"], smd_df["abs_smd"], color=colors,
edgecolor="white", height=0.6)
ax.axvline(0.1, color="#141413", linewidth=1, linestyle="--", label="SMD = 0.1 threshold")
ax.set_xlabel("Absolute Standardized Mean Difference")
ax.set_title("Covariate Balance: Love Plot (All Covariates)")
ax.legend(loc="lower right")
plt.savefig("dowhy_covariate_balance_smd.png", dpi=300, bbox_inches="tight")
plt.show()

The Love plot reveals a more nuanced picture than raw mean comparisons would suggest. Prior earnings (re74 and re75) — which appeared imbalanced when comparing raw means in the thousands — are actually well balanced on the standardized scale (SMD < 0.1), because their large variances absorb the mean differences. In contrast, nodegr shows the largest imbalance (SMD ~0.31), followed by hisp (~0.18) and educ (~0.14). These imbalances, despite random assignment, reflect the small sample size and the disadvantaged population targeted by NSW. Although the naive difference-in-means remains unbiased under randomization, adjusting for these chance imbalances can improve the precision of the treatment effect estimate — a well-known result in the experimental design literature (Lin, 2013; Freedman, 2008).

The causal inference problem

ATE vs ATT: Two different causal questions

Before estimating the treatment effect, we need to be precise about which causal question we are asking. There are two distinct estimands, each answering a different policy-relevant question:

Average Treatment Effect (ATE) — “What would happen if we assigned treatment to a random person from the entire population?" The ATE averages the treatment effect over everyone — both the treated and the untreated:

$$\text{ATE} = E[Y(1) - Y(0)]$$

Average Treatment Effect on the Treated (ATT) — “What was the effect of treatment for those who actually received it?" The ATT averages the treatment effect only over the treated subpopulation:

$$\text{ATT} = E[Y(1) - Y(0) \mid T = 1]$$

The distinction matters because the people who receive treatment may differ systematically from those who do not. If the training program helps disadvantaged workers the most, and disadvantaged workers are more likely to enroll, then the ATT (the effect on those who enrolled) will be larger than the ATE (the effect if we enrolled everyone at random). Conversely, if the program is most effective for workers who are least likely to enroll, the ATE could exceed the ATT.

In this tutorial, we estimate the ATE — the average effect of the NSW job training program across the entire study population. This is the natural estimand for a randomized experiment where we want to evaluate the program’s overall impact. Four of our five estimation methods (regression adjustment, IPW, AIPW, and propensity score stratification) target the ATE directly. The exception is propensity score matching, which discards unmatched control units and therefore shifts the estimand toward the ATT — we flag this distinction when we discuss the matching results.

Why simple comparisons can mislead

A naive approach to estimating the treatment effect is to compute the difference in mean outcomes between the training and control groups. This gives us the Average Treatment Effect (ATE):

$$\text{ATE}_{naive} = \bar{Y}_{treated} - \bar{Y}_{control}$$

While this is a natural starting point and is unbiased in expectation under randomization, it can be imprecise when finite-sample covariate imbalances exist. Adjusting for covariates that predict the outcome can sharpen the estimate. In observational studies, the problem is more severe — without adjustment, the naive estimator can be genuinely biased by confounding.

mean_treated = df[df[TREATMENT] == 1][OUTCOME].mean()
mean_control = df[df[TREATMENT] == 0][OUTCOME].mean()
naive_ate = mean_treated - mean_control
print(f"Mean earnings (Training): ${mean_treated:,.2f}")
print(f"Mean earnings (Control): ${mean_control:,.2f}")
print(f"Naive ATE (difference): ${naive_ate:,.2f}")

Mean earnings (Training): $6,349.14
Mean earnings (Control): $4,554.80
Naive ATE (difference): $1,794.34

The naive estimate suggests that training increases earnings by \$1,794 on average. Under randomization, this estimate is unbiased in expectation, but the finite-sample covariate imbalances we observed earlier (particularly in nodegr, hisp, and educ) mean that covariate adjustment can sharpen the estimate and account for chance differences between groups. This is where DoWhy’s structured framework helps — it forces us to explicitly model our causal assumptions, identify the correct estimand, apply rigorous estimation methods, and test whether the results hold up under scrutiny.

Step 1: Model — Define the causal graph

The first step in DoWhy’s framework is to encode our domain knowledge as a causal graph — a Directed Acyclic Graph (DAG) that specifies which variables cause which. In our case, the covariates (age, education, race, prior earnings, etc.) are common causes of both treatment assignment and the outcome. Even in a randomized experiment, these covariates predict the outcome and adjusting for them improves precision, so we include them in the model. This also makes the tutorial directly applicable to observational settings where these variables are genuine confounders.

What is a DAG?

A Directed Acyclic Graph is the formal language of causal inference. Each word in the name carries meaning:

Directed — every edge is an arrow pointing from cause to effect. If age affects earnings, we draw an arrow from age to re78, never the reverse.
Acyclic — there are no feedback loops. You cannot follow the arrows and return to where you started. This rules out simultaneous causation (e.g., “A causes B and B causes A at the same time”), which requires more advanced models.
Graph — variables are nodes (circles or squares) and causal relationships are edges (arrows). The full picture is a map of which variables drive which.

The DAG is not a statistical model — it encodes qualitative assumptions about the data-generating process before we look at a single number. Its power lies in what it tells us about which variables to adjust for and which to leave alone.

Types of variables in a causal graph

Not all variables play the same role. Understanding the three fundamental types is essential for deciding what to control for:

graph TD
C["<b>Confounder</b><br/>(e.g., prior earnings)"] -->|"affects"| T["Treatment"]
C -->|"affects"| Y["Outcome"]
T -.->|"causal effect"| Y
style C fill:#00d4c8,stroke:#141413,color:#fff
style T fill:#6a9bcc,stroke:#141413,color:#fff
style Y fill:#d97757,stroke:#141413,color:#fff

Confounders (common causes) — A variable that affects both the treatment and the outcome. For example, prior earnings (re74) may influence whether someone enrolls in training and how much they earn later. Confounders create a spurious association between treatment and outcome. You must adjust for confounders to isolate the causal effect.

graph LR
T["Treatment"] -->|"causes"| M["<b>Mediator</b><br/>(e.g., skills)"]
M -->|"causes"| Y["Outcome"]
style T fill:#6a9bcc,stroke:#141413,color:#fff
style M fill:#00d4c8,stroke:#141413,color:#fff
style Y fill:#d97757,stroke:#141413,color:#fff

Mediators — A variable that lies on the causal path from treatment to outcome. For example, if job training increases skills, and skills increase earnings, then skills is a mediator. You should NOT adjust for mediators — doing so would block the very causal pathway you are trying to measure, attenuating or eliminating the estimated effect.

graph TD
T["Treatment"] -->|"affects"| Col["<b>Collider</b><br/>(e.g., in_survey)"]
Y["Outcome"] -->|"affects"| Col
T -.->|"causal effect"| Y
style T fill:#6a9bcc,stroke:#141413,color:#fff
style Col fill:#00d4c8,stroke:#141413,color:#fff
style Y fill:#d97757,stroke:#141413,color:#fff

Colliders — A variable that is caused by both the treatment and the outcome (or by variables on both sides). For example, if both training and high earnings make someone likely to appear in a follow-up survey, then in_survey is a collider. You should NOT condition on colliders — doing so can create a spurious association between treatment and outcome even where none exists (a phenomenon called collider bias or selection bias).

In the Lalonde dataset, all eight covariates (age, education, race, marital status, degree status, and prior earnings) are measured before treatment assignment, so they can only be confounders — they cannot be mediators or colliders. This makes the graph straightforward: every covariate points to both treat and re78.

The causal structure we assume is:

Each covariate (age, educ, black, hisp, married, nodegr, re74, re75) affects both treatment assignment and earnings
Treatment (treat) affects the outcome (re78)
No covariate is itself caused by the treatment (pre-treatment variables)

We now create the CausalModel in DoWhy, specifying the treatment, outcome, and common causes. The model object stores the data, the causal graph, and metadata that DoWhy will use in subsequent steps to determine the correct adjustment strategy.

model = CausalModel(
data=df,
treatment=TREATMENT,
outcome=OUTCOME,
common_causes=COVARIATES,
)
print("CausalModel created successfully.")

CausalModel created successfully.

DoWhy can visualize the causal graph it constructed using the view_model() method, which uses Graphviz to render the DAG automatically from the model’s internal graph representation:

# Visualize the causal graph using DoWhy's built-in method
model.view_model(layout="dot")
from IPython.display import Image, display
display(Image(filename="causal_model.png"))

The DAG makes our assumptions explicit: the eight covariates are common causes that affect both treatment assignment (treat) and earnings (re78). The arrows encode the direction of causation — each confounder points to both treat and re78, and treat points to re78 (the causal effect we want to estimate). By stating these assumptions as a graph, DoWhy can automatically determine which variables need to be adjusted for and which estimation strategies are valid.

Step 2: Identify — Find the causal estimand

With the causal graph defined, DoWhy’s identify_effect() method uses graph theory to identify the causal estimand — the mathematical expression that, if computed correctly, equals the true causal effect. This step determines whether the effect is identifiable from the data given our assumptions, and what variables we need to condition on.

What does “identification” mean?

In causal inference, identification answers a deceptively simple question: can we compute the causal effect from the data we have, without running a new experiment? The answer is not always yes. Consider a scenario where an unmeasured variable (say, “motivation”) affects both whether someone enrolls in training and how much they earn afterward. No amount of data on age, education, and prior earnings can untangle the causal effect of training from the confounding effect of motivation — the causal effect is not identified without observing motivation.

Identification is the bridge between causal assumptions (encoded in the graph) and statistical computation (what we can actually calculate from data). If the effect is identified, the identification step produces an estimand — a precise mathematical formula that tells us exactly which conditional expectations or reweightings to compute. If the effect is not identified, no estimation method can produce a credible causal estimate, no matter how sophisticated.

Identification strategies

DoWhy checks three main strategies, each applicable in different causal structures:

Backdoor criterion — The most common strategy. It applies when we can observe all confounders between treatment and outcome. By conditioning on these confounders, we “block” all backdoor paths — non-causal pathways that create spurious associations. In the Lalonde example, conditioning on the eight covariates satisfies the backdoor criterion because they are the only common causes of treat and re78.
Instrumental variables (IV) — Useful when some confounders are unobserved. An instrument is a variable that affects treatment but has no direct effect on the outcome except through the treatment itself. For example, draft lottery numbers have been used as instruments for military service: the lottery affects whether someone serves (treatment) but has no direct effect on later earnings (outcome) except through the service itself. IV estimation requires strong assumptions but can identify causal effects when backdoor adjustment is impossible.
Front-door criterion — Applies when there is a mediator that fully transmits the treatment effect and is itself unconfounded with the outcome. This strategy is rare in practice but theoretically important: it can identify causal effects even in the presence of unmeasured confounders between treatment and outcome, as long as the mediator pathway is clean.

A key advantage of DoWhy is that it automates the identification step. Given the causal graph, DoWhy algorithmically checks which strategies are valid and returns the correct estimand. This prevents a common and dangerous mistake in applied work: manually choosing which variables to “control for” without formally checking whether the chosen adjustment set actually satisfies the conditions for causal identification.

identified_estimand = model.identify_effect(proceed_when_unidentifiable=True)
print(identified_estimand)

Estimand type: EstimandType.NONPARAMETRIC_ATE
### Estimand : 1
Estimand name: backdoor
Estimand expression:
d
────────(E[re78|educ,black,age,hisp,re75,married,re74,nodegr])
d[treat]
Estimand assumption 1, Unconfoundedness: If U→{treat} and U→re78
then P(re78|treat,educ,black,age,hisp,re75,married,re74,nodegr,U)
= P(re78|treat,educ,black,age,hisp,re75,married,re74,nodegr)

DoWhy identifies the backdoor estimand as the primary identification strategy, expressing the causal effect as the derivative of the conditional expectation of earnings with respect to treatment, conditioning on all eight covariates. The critical assumption is unconfoundedness — there are no unmeasured confounders beyond the ones we specified. DoWhy also checks for instrumental variable and front-door estimands but finds none applicable, which is expected given our graph structure.

Step 3: Estimate — Compute the causal effect

With the estimand identified, we now use estimate_effect() to compute the actual causal effect estimate. DoWhy supports multiple estimation methods, each with different assumptions and properties. We compare five approaches to see how robust the estimate is across methods.

Causal estimation methods fall into three broad paradigms, distinguished by what they model:

Outcome modeling (Regression Adjustment) — directly models the relationship $E[Y \mid X, T]$ between covariates, treatment, and outcome. Its validity depends on correctly specifying this outcome model.
Treatment modeling (IPW, PS Stratification, PS Matching) — models the treatment assignment mechanism $P(T \mid X)$ (the propensity score) and uses it to remove confounding. All three methods rely exclusively on the propensity score — they differ in how they use it (reweighting, grouping, or pairing observations) but none of them model the outcome. Their validity depends on correctly specifying the propensity score model.
Doubly robust (AIPW) — the only true hybrid. It explicitly combines an outcome model $E[Y \mid X, T]$ with a propensity score model $P(T \mid X)$, and is consistent if either model is correctly specified. This “double protection” is why it is called doubly robust.

The following diagram shows how these paradigms relate to the five methods we will apply:

graph TD
Root["<b>Estimation Methods</b>"] --> OM["<b>Outcome Modeling</b><br/><i>Models E[Y | X, T]</i>"]
Root --> TM["<b>Treatment Modeling</b><br/><i>Models P(T | X)</i>"]
Root --> DR_cat["<b>Doubly Robust</b><br/><i>Models both E[Y | X, T]<br/>and P(T | X)</i>"]
OM --> RA["Regression<br/>Adjustment"]
TM --> IPW["Inverse Probability<br/>Weighting"]
TM --> PSS["PS<br/>Stratification"]
TM --> PSM["PS<br/>Matching"]
DR_cat --> DR["AIPW"]
style Root fill:#141413,stroke:#141413,color:#fff
style OM fill:#6a9bcc,stroke:#141413,color:#fff
style TM fill:#d97757,stroke:#141413,color:#fff
style DR_cat fill:#00d4c8,stroke:#141413,color:#fff
style RA fill:#6a9bcc,stroke:#141413,color:#fff
style IPW fill:#d97757,stroke:#141413,color:#fff
style PSS fill:#d97757,stroke:#141413,color:#fff
style PSM fill:#d97757,stroke:#141413,color:#fff
style DR fill:#00d4c8,stroke:#141413,color:#fff

Understanding these paradigms helps clarify why different methods can give somewhat different estimates and why comparing across paradigms is a powerful robustness check. The key trade-offs are:

What each paradigm models: Outcome modeling specifies how covariates relate to earnings ($E[Y \mid X, T]$). Treatment modeling specifies how covariates relate to treatment assignment ($P(T \mid X)$) — all three PS methods use this same propensity score but differ in how they apply it. Doubly robust specifies both models simultaneously.
What each paradigm assumes: Regression adjustment requires the outcome model to be correctly specified. All three propensity score methods (IPW, stratification, matching) require the propensity score model to be correctly specified. Doubly robust only requires one of the two to be correct.
Bias-variance characteristics: Regression adjustment tends to be low-variance but can be biased if the outcome-covariate relationship is nonlinear. IPW can have high variance when propensity scores are extreme (near 0 or 1). Stratification and matching use the propensity score more conservatively — by grouping or pairing rather than directly reweighting — which can reduce variance relative to IPW. Doubly robust balances both concerns but is more complex to implement.

The three treatment modeling methods differ in how they use the propensity score to create balanced comparisons:

IPW reweights every observation by the inverse of its propensity score, creating a pseudo-population where treatment is independent of covariates. It uses the full sample but can be unstable when propensity scores are near 0 or 1.
PS Stratification divides observations into groups (strata) with similar propensity scores, then computes simple mean differences within each stratum. By comparing treated and control units within the same stratum, it approximates a block-randomized experiment.
PS Matching pairs each treated unit with the control unit that has the most similar propensity score, then computes mean differences within matched pairs. It discards unmatched observations, focusing on the closest comparisons at the cost of reduced sample size.

None of these methods model the outcome — they all achieve confounding adjustment purely through the propensity score. If outcome modeling and treatment modeling agree, we can be more confident that neither model is badly misspecified.

Method 1: Regression Adjustment

Regression adjustment is grounded in the potential outcomes framework: each individual has two potential outcomes — $Y(1)$ if treated and $Y(0)$ if not — and the causal effect is their difference. Since we only observe one outcome per person, regression adjustment estimates both potential outcomes by modeling $E[Y \mid X, T]$, the conditional expectation of the outcome given covariates and treatment status. The treatment effect is the coefficient on the treatment indicator, which captures the difference in expected outcomes between treated and control units at the same covariate values — effectively comparing apples to apples.

The key assumption is that the outcome model must be correctly specified. If the true relationship between covariates and the outcome is nonlinear or includes interactions, a simple linear model will produce biased estimates. In econometrics, this approach is closely related to the Frisch-Waugh-Lovell theorem, which shows that the treatment coefficient in a multiple regression is identical to what you would get by first partialing out the covariates from both the treatment and the outcome, then regressing the residuals on each other. This makes regression adjustment the simplest and most transparent baseline estimator.

We use DoWhy’s backdoor.linear_regression method:

estimate_ra = model.estimate_effect(
identified_estimand,
method_name="backdoor.linear_regression",
confidence_intervals=True,
)
print(f"Estimated ATE (Regression Adjustment): ${estimate_ra.value:,.2f}")

Estimated ATE (Regression Adjustment): $1,676.34

The regression adjustment estimate is \$1,676, slightly lower than the naive difference of \$1,794. The reduction from \$1,794 to \$1,676 reflects the covariate adjustment — by accounting for finite-sample imbalances in age, education, race, and prior earnings, the estimated treatment effect shrinks by about \$118. In this randomized setting, the adjustment primarily improves precision rather than removing bias, but the same technique is essential in observational studies where confounding is a genuine concern.

Method 2: Inverse Probability Weighting (IPW)

IPW takes a fundamentally different approach from regression adjustment. Instead of modeling the outcome, it models the treatment assignment mechanism. The central concept is the propensity score, $e(X) = P(T = 1 \mid X)$ — the probability that a unit receives treatment given its observed covariates. A person with a propensity score of 0.8 has an 80% chance of being treated based on their characteristics; a person with a score of 0.2 has only a 20% chance.

The key intuition behind inverse weighting is that units who are unlikely to receive the treatment they actually received carry more information about the causal effect. Consider a treated individual with a low propensity score (say 0.1) — this person was unlikely to be treated, yet was treated. Their outcome is especially informative because they are “similar” to the control group in all observable respects. IPW upweights such surprising cases by assigning them a weight of $1/e(X) = 10$, while a treated person with $e(X) = 0.9$ receives a weight of only $1/0.9 \approx 1.1$. This reweighting creates a “pseudo-population” in which treatment assignment is independent of the observed confounders, mimicking what a randomized experiment would look like.

A critical contrast with regression adjustment: IPW makes no assumptions about how covariates relate to the outcome — it only requires that the propensity score model is correctly specified. However, IPW has a key vulnerability: when propensity scores are extreme (near 0 or 1), the inverse weights become very large, producing unstable estimates with high variance. This is why practitioners often use weight trimming or stabilized weights in practice.

The IPW estimator is:

$$\hat{\tau}_{IPW} = \frac{1}{n} \sum_{i=1}^{n} \left[ \frac{T_i Y_i}{\hat{e}(X_i)} - \frac{(1 - T_i) Y_i}{1 - \hat{e}(X_i)} \right]$$

where $\hat{e}(X_i)$ is the estimated propensity score for individual $i$.

We use DoWhy’s backdoor.propensity_score_weighting method, which implements the Horvitz-Thompson inverse probability estimator:

estimate_ipw = model.estimate_effect(
identified_estimand,
method_name="backdoor.propensity_score_weighting",
method_params={"weighting_scheme": "ips_weight"},
)
print(f"Estimated ATE (IPW): ${estimate_ipw.value:,.2f}")

Estimated ATE (IPW): $1,559.47

The IPW estimate of \$1,559 is the lowest among all methods. IPW is sensitive to extreme propensity scores — when some individuals have very high or very low probabilities of treatment, their weights become large and can dominate the estimate. In this dataset, the estimated propensity scores are reasonably well-behaved (the NSW was a randomized experiment), so the IPW estimate remains in the plausible range. The difference from the regression adjustment (\$1,676 vs \$1,559) reflects the fact that IPW makes no assumptions about the outcome model, relying entirely on correct specification of the propensity score model.

Method 3: Doubly Robust (AIPW)

The doubly robust estimator — also called Augmented Inverse Probability Weighting (AIPW) — combines both regression adjustment and IPW into a single estimator. The key advantage is that the estimate is consistent if either the outcome model or the propensity score model is correctly specified (hence “doubly robust”). This provides an important safeguard against model misspecification.

The intuition is straightforward: AIPW starts with the regression adjustment estimate ($\hat{\mu}_1(X) - \hat{\mu}_0(X)$, the difference in predicted outcomes under treatment and control) and then adds a correction term based on the IPW-weighted prediction errors. If the outcome model is perfectly specified, the prediction errors $Y - \hat{\mu}(X)$ are pure noise and the correction averages to zero — the regression adjustment alone does the work. If the outcome model is misspecified but the propensity score model is correct, the IPW-weighted correction term exactly compensates for the bias in the outcome predictions. This is why the estimator only needs one of the two models to be correct — whichever model is right “rescues” the other.

Beyond its robustness property, AIPW achieves the semiparametric efficiency bound when both models are correctly specified, meaning no other estimator that makes the same assumptions can have lower variance. This makes it a natural default choice in modern causal inference.

The AIPW estimator is:

$$\hat{\tau}_{DR} = \frac{1}{n} \sum_{i=1}^{n} \left[ \hat{\mu}_1(X_i) - \hat{\mu}_0(X_i) + \frac{T_i (Y_i - \hat{\mu}_1(X_i))}{\hat{e}(X_i)} - \frac{(1 - T_i)(Y_i - \hat{\mu}_0(X_i))}{1 - \hat{e}(X_i)} \right]$$

where $\hat{\mu}_1(X_i)$ and $\hat{\mu}_0(X_i)$ are the predicted outcomes under treatment and control, and $\hat{e}(X_i)$ is the propensity score.

We implement the AIPW estimator manually rather than using DoWhy’s built-in backdoor.doubly_robust method, which has a known compatibility issue with recent scikit-learn versions. The manual implementation uses LogisticRegression for the propensity score model and LinearRegression for the outcome model, making the estimator’s two-component structure fully transparent.

# Doubly Robust (AIPW) — manual implementation
ps_model = LogisticRegression(max_iter=1000, random_state=42)
ps_model.fit(df[COVARIATES], df[TREATMENT])
ps = ps_model.predict_proba(df[COVARIATES])[:, 1]
outcome_model_1 = SklearnLR().fit(df[df[TREATMENT] == 1][COVARIATES], df[df[TREATMENT] == 1][OUTCOME])
outcome_model_0 = SklearnLR().fit(df[df[TREATMENT] == 0][COVARIATES], df[df[TREATMENT] == 0][OUTCOME])
mu1 = outcome_model_1.predict(df[COVARIATES])
mu0 = outcome_model_0.predict(df[COVARIATES])
T = df[TREATMENT].values
Y = df[OUTCOME].values
dr_ate = np.mean(
(mu1 - mu0)
+ T * (Y - mu1) / ps
- (1 - T) * (Y - mu0) / (1 - ps)
)
print(f"Estimated ATE (Doubly Robust): ${dr_ate:,.2f}")

Estimated ATE (Doubly Robust): $1,620.04

The doubly robust estimate of \$1,620 falls between the regression adjustment (\$1,676) and IPW (\$1,559) estimates. This reflects how the AIPW estimator works: it uses the outcome model as its primary estimate and adds an IPW-weighted correction based on the prediction residuals. The fact that it is close to both individual estimates suggests that neither model is severely misspecified. In practice, the doubly robust estimator is often the preferred choice because it provides insurance against misspecification of either component model.

Method 4: Propensity Score Stratification

Propensity score stratification builds on a powerful result from Rosenbaum and Rubin (1983): conditioning on the scalar propensity score is sufficient to remove all confounding from observed covariates, even though the score compresses multiple covariates into a single number. This means that within a group of individuals who all have similar propensity scores, treatment assignment is effectively random with respect to the observed confounders — just as in a randomized experiment.

Stratification is a discrete approximation to this idea. Instead of conditioning on the exact propensity score (which would require infinite data), we bin observations into a small number of strata — typically 5 quintiles. Within each stratum, treated and control individuals have similar propensity scores and are therefore more comparable, so the within-stratum treatment effect is less confounded. The overall ATE is a weighted average of these stratum-specific effects. A classic result from Cochran (1968) shows that 5 strata typically remove over 90% of the bias from observed confounders, making this a surprisingly effective yet simple approach.

There is a practical trade-off in choosing the number of strata: more strata produce finer covariate balance within each group, reducing bias, but also leave fewer observations per stratum, increasing variance. Five strata is the conventional choice, balancing these considerations well.

We use DoWhy’s backdoor.propensity_score_stratification method:

estimate_ps_strat = model.estimate_effect(
identified_estimand,
method_name="backdoor.propensity_score_stratification",
method_params={"num_strata": 5, "clipping_threshold": 5},
)
print(f"Estimated ATE (PS Stratification): ${estimate_ps_strat.value:,.2f}")

Estimated ATE (PS Stratification): $1,617.07

Propensity score stratification with 5 strata estimates the ATE at \$1,617, very close to the doubly robust estimate (\$1,620). The stratification approach is more flexible than regression adjustment because it does not impose a functional form on the outcome-covariate relationship. The estimate is in the same ballpark as the other adjusted results, which is reassuring — multiple methods agree that the training effect is in the \$1,550–\$1,700 range.

Method 5: Propensity Score Matching

Propensity score matching constructs a comparison group by finding, for each treated individual, the control individual(s) with the most similar propensity score. The treatment effect is then estimated by comparing outcomes within these matched pairs. This is conceptually the most intuitive approach — it directly mimics what we would see if we could compare individuals who are identical except for their treatment status.

An important subtlety is that matching typically discards unmatched control units — those with no treated counterpart nearby in propensity score space. This means the estimand shifts from the Average Treatment Effect (ATE) toward the Average Treatment Effect on the Treated (ATT), which answers a slightly different question: “What was the effect of treatment for those who were actually treated?” rather than “What would the effect be if we treated everyone?”

Several practical choices affect matching quality. With-replacement matching allows each control to be matched to multiple treated units, reducing bias but increasing variance. 1:k matching uses $k$ nearest controls per treated unit, averaging out noise but potentially introducing worse matches. Caliper restrictions discard matches where the propensity score difference exceeds a threshold, preventing poor matches at the cost of losing some treated observations. These choices create a fundamental bias-variance trade-off: tighter matching criteria reduce bias from imperfect comparisons but may discard many observations, increasing the variance of the estimate.

We use DoWhy’s backdoor.propensity_score_matching method:

estimate_ps_match = model.estimate_effect(
identified_estimand,
method_name="backdoor.propensity_score_matching",
)
print(f"Estimated ATE (PS Matching): ${estimate_ps_match.value:,.2f}")

Estimated ATE (PS Matching): $1,735.69

Propensity score matching estimates the effect at \$1,736, the highest of the five adjusted estimates and closest to the naive difference. Matching tends to give slightly different results because it uses only the closest comparisons rather than the full sample. As noted above, this estimate is closer to the ATT than the ATE, so it answers a slightly different question than the other four methods — readers should keep this distinction in mind when comparing across estimators. The fact that all five methods produce estimates between \$1,559 and \$1,736 provides strong evidence that the treatment effect is real and robust to the choice of estimation method.

Step 4: Refute — Test robustness

The final and perhaps most valuable step in DoWhy’s framework is refutation — systematically testing whether the estimated causal effect is robust to violations of our assumptions. DoWhy’s refute_estimate() method provides several built-in refutation tests, each probing a different potential weakness.

Why refutation matters

Most causal inference workflows stop after estimation: you run a regression, get a coefficient, and report it as the causal effect. DoWhy’s refutation step is its key innovation — it provides automated falsification tests that probe whether the estimate could be an artifact of the model, the data, or violated assumptions. This is the causal inference equivalent of “stress testing”: if the estimate survives multiple attempts to break it, we can be more confident that it reflects a genuine causal relationship.

DoWhy’s refutation tests fall into three categories, each targeting a different potential weakness:

Placebo tests — “If the treatment doesn’t matter, does the effect disappear?" These tests replace the real treatment with a fake (randomly permuted) treatment. If the estimated effect drops to near zero, the original result is tied to the actual treatment rather than being a statistical artifact of the model or data structure.
Sensitivity tests — “If we missed a confounder, does the estimate change?" These tests add a randomly generated variable as an additional confounder. If the estimate barely changes, it suggests the result is not fragile — adding one more covariate does not destabilize it. This provides indirect evidence (though not proof) that unobserved confounders may not be a major concern.
Stability tests — “If we use different data, does the estimate hold?" These tests re-estimate the effect on random subsets of the data. If the estimate fluctuates wildly, it may depend on a few influential observations rather than reflecting a stable population-level effect.

An important caveat: passing all refutation tests does not prove causation. The tests can only detect certain types of problems — they cannot rule out every possible source of bias. However, failing any test is a strong signal that something is wrong and warrants further investigation before drawing causal conclusions.

Placebo Treatment Test

The placebo test replaces the actual treatment with a randomly permuted version. If our estimate is truly capturing a causal effect, this fake treatment should produce an effect near zero. A large p-value indicates that the placebo effect is not significantly different from zero, confirming that the real treatment drives the original estimate.

refute_placebo = model.refute_estimate(
identified_estimand,
estimate_ra,
method_name="placebo_treatment_refuter",
placebo_type="permute",
num_simulations=100,
)
print(refute_placebo)

Refute: Use a Placebo Treatment
Estimated effect:1676.3426437675835
New effect:61.821946542496946
p value:0.92

The placebo treatment test produces a new effect of approximately \$62, which is close to zero and dramatically smaller than the original estimate of \$1,676. The high p-value (0.92) indicates that the original estimate is well above what we would expect from a random treatment assignment. This is strong evidence that the estimated effect is not an artifact of the model or data structure.

Random Common Cause Test

The random common cause test adds a randomly generated confounder to the model and checks whether the estimate changes. If our model is correctly specified and the estimate is robust, adding a random variable should not significantly alter the result.

refute_random = model.refute_estimate(
identified_estimand,
estimate_ra,
method_name="random_common_cause",
num_simulations=100,
)
print(refute_random)

Refute: Add a random common cause
Estimated effect:1676.3426437675835
New effect:1675.606781672203
p value:0.9

Adding a random common cause barely changes the estimate: from \$1,676 to \$1,676 — a difference of less than \$1. The high p-value (0.90) confirms that the original estimate is stable when an additional (irrelevant) confounder is introduced. This suggests that the model is not overly sensitive to the specific set of confounders included.

Data Subset Test

The data subset test re-estimates the effect on random 80% subsamples of the data. If the estimate is robust, it should remain similar across different subsets. Large fluctuations would suggest that the result depends on a few influential observations.

refute_subset = model.refute_estimate(
identified_estimand,
estimate_ra,
method_name="data_subset_refuter",
subset_fraction=0.8,
num_simulations=100,
)
print(refute_subset)

Refute: Use a subset of data
Estimated effect:1676.3426437675835
New effect:1727.583871150809
p value:0.8

The data subset refuter produces a mean effect of \$1,728 across 100 random subsamples, close to the full-sample estimate of \$1,676. The high p-value (0.80) indicates that the estimate is stable across subsets and does not depend on a handful of outlier observations. The slight increase in the subsample estimate (\$1,728 vs \$1,676) reflects normal sampling variability.

Comparing all estimates

To visualize how all estimation approaches compare, we plot the ATE estimates side by side. Consistent estimates across different methods strengthen confidence in the causal conclusion.

fig, ax = plt.subplots(figsize=(9, 6))
methods = ["Naive\n(Diff. in Means)", "Regression\nAdjustment", "IPW",
"Doubly Robust\n(AIPW)", "PS\nStratification", "PS\nMatching"]
estimates = [naive_ate, estimate_ra.value, estimate_ipw.value,
dr_ate, estimate_ps_strat.value, estimate_ps_match.value]
colors = ["#999999", "#6a9bcc", "#d97757", "#00d4c8", "#e8956a", "#c4623d"]
bars = ax.barh(methods, estimates, color=colors, edgecolor="white", height=0.6)
for bar, val in zip(bars, estimates):
ax.text(val + 50, bar.get_y() + bar.get_height() / 2,
f"${val:,.0f}", va="center", fontsize=10, color="#141413")
ax.axvline(0, color="black", linewidth=0.5, linestyle="--")
ax.set_xlabel("Estimated Average Treatment Effect (USD)")
ax.set_title("Causal Effect Estimates: NSW Job Training on 1978 Earnings")
plt.savefig("dowhy_estimate_comparison.png", dpi=300, bbox_inches="tight")
plt.show()

All six methods produce positive estimates between \$1,559 and \$1,794, indicating that the NSW job training program increased participants' 1978 earnings by roughly \$1,550–\$1,800. The five adjusted methods cluster between \$1,559 and \$1,736, suggesting that about \$58–\$235 of the naive estimate was due to finite-sample covariate imbalances rather than the treatment. The convergence across fundamentally different estimation strategies — outcome modeling (regression adjustment), treatment modeling (IPW, stratification, matching), and doubly robust (AIPW) — is strong evidence that the effect is real.

Summary table

Method	Estimated ATE	Notes
Naive (Difference in Means)	\$1,794	No covariate adjustment
Regression Adjustment	\$1,676	Models outcome, assumes linearity
IPW	\$1,559	Models treatment assignment
Doubly Robust (AIPW)	\$1,620	Models both outcome and treatment
Propensity Score Stratification	\$1,617	5 strata, flexible
Propensity Score Matching	\$1,736	Nearest-neighbor matching (closer to ATT)

Refutation Test	New Effect	p-value	Interpretation
Placebo Treatment	\$62	0.92	Effect vanishes with fake treatment
Random Common Cause	\$1,676	0.90	Stable with added confounder
Data Subset (80%)	\$1,728	0.80	Stable across subsamples

The summary confirms a consistent causal effect across methods: the NSW job training program increased 1978 earnings by approximately \$1,550–\$1,800. All five adjusted methods and all three refutation tests support the validity of the estimate. The placebo test is particularly convincing — when the real treatment is replaced by random noise, the effect drops from \$1,676 to just \$62, confirming that the observed effect is tied to the actual treatment and not a statistical artifact. The doubly robust estimate (\$1,620) provides the most credible point estimate because it is consistent under misspecification of either the outcome model or the propensity score model.

Discussion

The Lalonde dataset provides a compelling case study for DoWhy’s four-step framework. Each step serves a distinct purpose: the Model step forces us to articulate our causal assumptions as a graph, the Identify step uses graph theory to determine the correct adjustment formula, the Estimate step applies statistical methods to compute the effect, and the Refute step probes whether the result withstands scrutiny.

The estimated ATE ranges from \$1,559 (IPW) to \$1,736 (PS matching), with the doubly robust estimate at \$1,620 providing a credible middle ground. On a base of \$4,555 for the control group, this represents roughly a 34–38% increase in earnings — a substantial effect for a disadvantaged population with very low baseline earnings. The three estimation paradigms — outcome modeling (regression adjustment), treatment modeling (IPW, stratification, matching), and doubly robust (AIPW) — each bring different strengths, and their convergence strengthens the causal conclusion.

The key strength of DoWhy over ad-hoc statistical approaches is transparency. The causal graph makes assumptions visible and debatable. The identification step formally checks whether the effect is estimable. Multiple estimation methods let us assess robustness. And refutation tests provide automated sanity checks that would otherwise require expert judgment.

Limitations and next steps

This analysis demonstrates DoWhy’s workflow on a well-understood dataset, but several limitations apply:

Small sample size: With only 445 observations, estimates have high variance and the propensity score methods may suffer from poor overlap in some regions of the covariate space
Unconfoundedness assumption: The backdoor criterion requires that all confounders are observed. If there are unmeasured factors affecting both training enrollment and earnings, our estimates would be biased
Linear outcome model: The regression adjustment and doubly robust estimates assume a linear relationship between covariates and earnings, which may be too restrictive for the highly skewed outcome distribution
Experimental data: The NSW was a randomized experiment, making it the easiest setting for causal inference. DoWhy’s advantages are more pronounced in observational studies where confounding is more severe

Next steps could include:

Apply DoWhy to an observational version of the Lalonde dataset (e.g., the PSID or CPS comparison groups) where confounding is much stronger
Explore DoWhy’s instrumental variable and front-door estimators for settings where the backdoor criterion fails
Investigate heterogeneous treatment effects — does training help some subgroups more than others?
Use nonparametric outcome models (e.g., random forests) in the doubly robust estimator for more flexible modeling
Compare DoWhy’s estimates with Double Machine Learning (DoubleML) for a side-by-side comparison of frameworks

Takeaways

DoWhy’s four-step workflow (Model, Identify, Estimate, Refute) makes causal assumptions explicit and testable, rather than hiding them inside a black-box estimator.
The NSW job training program increased 1978 earnings by approximately \$1,550–\$1,800, a 34–38% gain over the control group mean of \$4,555.
Five estimation methods — regression adjustment, IPW, doubly robust, PS stratification, and PS matching — all produce positive, consistent estimates, strengthening confidence in the causal conclusion.
The doubly robust (AIPW) estimator (\$1,620) is the most credible single estimate because it remains consistent if either the outcome model or the propensity score model is misspecified.
IPW and regression adjustment represent two complementary paradigms: modeling treatment assignment (\$1,559) vs. modeling the outcome (\$1,676). Their divergence quantifies sensitivity to modeling choices.
Refutation tests confirm robustness — the placebo test reduced the effect from \$1,676 to just \$62, ruling out statistical artifacts.
Causal graphs encode domain knowledge as testable assumptions; the backdoor criterion then determines which variables must be conditioned on for valid causal estimation.
Next step: apply DoWhy to an observational comparison group (e.g., PSID or CPS) where confounding is stronger and the choice of estimator matters more.

Exercises

Change the number of strata. Re-run the propensity score stratification with num_strata=10 and num_strata=20. How does the ATE estimate change? What are the tradeoffs of using more vs. fewer strata with a sample of only 445 observations?
Add an additional refutation test. DoWhy supports a bootstrap_refuter that re-estimates the effect on bootstrap samples. Implement this refuter and compare its results to the data subset refuter. Are the conclusions similar?
Estimate effects for subgroups. Split the dataset by black (race indicator) and estimate the ATE separately for each subgroup using DoWhy. Does the job training program have a different effect for Black vs. non-Black participants? What might explain any differences you observe?