The Causal Review

Introduction

The staggered difference-in-differences literature has produced a rich set of estimators designed to avoid the negative-weight pathologies of two-way fixed effects (TWFE) regressions when treatment timing varies across units [Callaway and Sant'Anna, 2021, Sun and Abraham, 2021, de Chaisemartin and D'Haultfœuille, 2020, Goodman-Bacon, 2021]. A common thread in this literature is that TWFE can assign negative weights to some unit-time treatment effects, producing estimands that are not convex combinations of underlying ATTs.

Borusyak et al. [2024] take a different approach to the same problem. Rather than starting from the TWFE estimator and diagnosing its failures, they start from the question: what is the most efficient way to estimate average treatment effects under parallel trends in staggered settings? Their answer is an imputation estimator that directly imputes counterfactual untreated potential outcomes using the parallel-trends restriction, then averages treatment effects using any desired weighting scheme. The estimator is efficient in the semiparametric sense: no other estimator that respects parallel trends can have a lower asymptotic variance.

1 The Setup and the Parallel Trends Restriction

1.1 Notation

Consider a balanced panel with N units and T periods. Unit i is treated from period Gᵢ onwards (Gᵢ = ∞ for never-treated units). Let Dᵢₜ = 1[t ≥ Gᵢ] denote the treatment indicator. Potential outcomes are Yᵢₜ(0) (untreated) and Yᵢₜ(1) (treated), with the observed outcome Yᵢₜ = Yᵢₜ(0) + Dᵢₜτᵢₜ where τᵢₜ is the individual-level treatment effect.

The parallel trends assumption states that, in the absence of treatment:‍

Y_it(0) = α_i + λ_t + ε_it, ℕ[ε_it | α_i, G_i] = 0, (1)

where αᵢ are unit fixed effects and λₜ are time fixed effects. This is precisely the two-way fixed effects structure.

1.2 Parallel Trends as a Restriction on Residuals

The insight of Borusyak et al. [2024] is to express parallel trends as a restriction on the regression residuals from the TWFE model estimated on the untreated observations only. Among never-treated units and pre-treatment observations of treated units, the TWFE residuals should have expectation zero. This provides moment conditions that can be used to impute Yᵢₜ(0) for treated observations.

2 The Imputation Estimator

2.1 Step 1: Estimate the Untreated Potential Outcome Model

Using only the untreated observations {(i,t) : Dᵢₜ = 0}, estimate unit and time fixed effects by OLS:‍

^Y_it(0) = ^α_i + ^λ_t, (2)

where α̂ᵢ and λ̂ₜ are estimated from the sample of untreated observations.

2.2 Step 2: Compute Treatment Effect Residuals

For each treated observation (i,t) with Dᵢₜ = 1, compute the imputed treatment effect:

^τ_it = Y_it − ^Y_it(0) = Y_it − ^α_i − ^λ_t. (3)

These are the "imputed" individual-level treatment effects: the difference between the observed (treated) outcome and the model prediction of what the outcome would have been absent treatment.

2.3 Step 3: Average with Desired Weights

The estimator aggregates these imputed effects using researcher-specified weights wᵢₜ:‍

^τ =

∑ (i,t):D_it=1

w_it^τ_it, (4)

where Σ₍ᵢ,ₜ₎:Dᵢₜ₌₁ wᵢₜ = 1. The weights wᵢₜ can implement any target estimand: the simple ATT (equal weights to all treated observations), a horizon-weighted event study (weighting by time since treatment), a cohort-weighted estimator (equal weights to cohorts as in Callaway and Sant'Anna [2021]), or any other policy-relevant average.

2.4 Variance Estimation

Borusyak et al. [2024] provide an analytical variance formula that accounts for the estimation uncertainty in the first-stage imputation of α̂ᵢ and λ̂ₜ. This is important: naive inference that ignores the first-stage uncertainty can be anti-conservative. The variance formula also extends to clustered standard errors.

3 Efficiency

The efficiency result of Borusyak et al. [2024] is that the imputation estimator achieves the semiparametric efficiency bound among all estimators that use only the parallel-trends restriction (1). No other unbiased estimator of the same estimand can have a smaller asymptotic variance.

This efficiency advantage over Callaway and Sant'Anna [2021] can be substantial in practice. Roth and Sant'Anna [2023] show that the efficient staggered DiD estimator can yield 20-40% smaller standard errors than equal-weighted cohort estimators in typical panel datasets. The source of the gain is that the imputation estimator uses all untreated observations to estimate the unit and time fixed effects, rather than restricting each cohort's comparison to its "clean" control units.

4 Comparison with Other Staggered DiD Estimators

The key distinction from Callaway and Sant'Anna [2021] is that CS restricts each cohort's comparison to never-treated or not-yet-treated units, discarding information from already-treated units. The BJS imputation uses all untreated observations (across all units and all pre-treatment periods) to pin down α̂ᵢ and λ̂ₜ, extracting more information and achieving the efficiency bound.

The key distinction from TWFE is that BJS imputation explicitly constructs the counterfactual and then computes treatment effects, whereas TWFE conflates the imputation and aggregation steps into a single regression that can produce negative weights. ‍

Estimator	No negative weights	Efficient	Event study	Software
TWFE	No	No	Via τ̂_k	fixest, lfe
Callaway-Sant'Anna	Yes	No	Yes	did
Sun-Abraham	Yes	Yes	No	fixest sunab()
de Chaisemartin-D'H	Yes	No	Yes	did, multiplegt
BJS imputation	Yes	Yes	Yes	did2s, fixest

Table 1: Comparison of Staggered DiD Estimators

5 Implementation in did2s

The did2s R package [Gardner, 2022] implements the BJS estimator (and related two-stage DiD estimators). The interface is:

library(did2s)
library(fixest)
library(data.table)

# Simulate staggered adoption panel
set.seed(123)
dt <- data.table(
  unit = rep(1:100, each = 10),
  year = rep(2001:2010, 100),
  gvar = rep(sample(c(2005, 2007, 2009, Inf), 100, replace=TRUE), each=10)
)
dt[, treated := as.integer(year >= gvar & is.finite(gvar))]
dt[, Y := unit + year + treated * 0.5 + rnorm(.N)]

# Run BJS imputation estimator
est <- did2s(
  data = dt,
  yname = "Y",
  first_stage = ~ 0 | unit + year,  # TWFE structure for imputation
  second_stage = ~ i(year, ref = 2004), # event-study coefficients
  treatment = "treated",
  cluster_var = "unit"
)

iplot(est, main = "BJS Event Study")

The first_stage argument specifies the parallel-trends model for the imputation step (here, unit and time fixed effects). The second_stage specifies how to aggregate the imputed treatment effects (here, by year for an event study). Standard errors are clustered at the unit level.

6 Diagnostic: Testing Parallel Trends

One advantage of the imputation framework is that it naturally produces pre-treatment residuals: for observations with Dᵢₜ = 0 and t < Gᵢ (pre-treatment periods of eventually treated units), the imputed "treatment effects" τ̂ᵢₜ should be near zero under parallel trends. These pre-treatment residuals can be plotted as an event study pre-trend test, providing a direct visual and statistical check on the parallel-trends assumption.

Rambachan and Roth [2023] extend this diagnostic using the HonestDiD framework: rather than simply testing whether pre-trends are zero, they construct sensitivity analyses that ask how much pre-trend violation (in the form of differential linear trends) would be needed to overturn the post-treatment conclusions.

7 Limitations and Practical Considerations

(1) Panel balance: The imputation estimator requires a balanced panel or careful handling of unbalanced data, since estimating unit fixed effects for units with few untreated observations is imprecise.

(2) Anticipation effects: If units anticipate treatment and change behaviour before treatment begins, the pre-treatment observations are contaminated and the imputation will be biased. Researchers should consider excluding periods just before treatment.

(3) Functional form: The efficiency result relies on the additive two-way fixed effects specification (1). If the true parallel-trends model has additional covariates or non-linear trends, the imputation model should be augmented accordingly.

(4) Many treated units: When most observations are treated (high treatment saturation), there are few untreated observations from which to estimate α̂ᵢ and λ̂ₜ, degrading the first-stage imputation. In this regime, the efficiency advantage of BJS diminishes.

8 Conclusion

The BJS imputation estimator provides a principled answer to the question of how to aggregate treatment effects efficiently in staggered adoption designs. By separating the imputation of counterfactual outcomes from the aggregation of treatment effects, it avoids TWFE's negative-weight problem while also achieving the semiparametric efficiency bound. For most staggered DiD applications where efficiency matters— small samples, long panels, or high heterogeneity— the BJS estimator should be preferred or at minimum reported alongside Callaway-Sant'Anna estimates as a robustness check.

References

Borusyak, K., Jaravel, X., and Spiess, J. (2024). Revisiting event study designs: Robust and efficient estimation. Review of Economic Studies, 91(6), 3253-3285.
Callaway, B. and Sant'Anna, P. H. C. (2021). Difference-in-differences with multiple time periods. Journal of Econometrics, 225(2), 200-230.
de Chaisemartin, C. and D'Haultfœuille, X. (2020). Two-way fixed effects estimators with heterogeneous treatment effects. American Economic Review, 110(9), 2964-2996.
Gardner, J. (2022). Two-stage differences in differences. arXiv preprint arXiv:2207.05943.
Goodman-Bacon, A. (2021). Difference-in-differences with variation in treatment timing. Journal of Econometrics, 225(2), 254-277.
Rambachan, A. and Roth, J. (2023). A more credible approach to parallel trends. Review of Economic Studies, 90(5), 2555-2591.
Roth, J. and Sant'Anna, P. H. C. (2023). Efficient estimation for staggered rollout designs. Journal of Political Economy Microeconomics, 1(4), 669-709.
Sun, L. and Abraham, S. (2021). Estimating dynamic treatment effects in event studies with heterogeneous treatment effects. Journal of Econometrics, 225(2), 175-199.

The Borusyak-Jaravel-Spiess Imputation Estimator: Efficient DiD for Staggered Adoption Settings

Introduction

1 The Setup and the Parallel Trends Restriction

1.1 Notation

1.2 Parallel Trends as a Restriction on Residuals

2 The Imputation Estimator

2.1 Step 1: Estimate the Untreated Potential Outcome Model

2.2 Step 2: Compute Treatment Effect Residuals

2.3 Step 3: Average with Desired Weights

2.4 Variance Estimation

3 Efficiency

4 Comparison with Other Staggered DiD Estimators

5 Implementation in did2s

6 Diagnostic: Testing Parallel Trends

7 Limitations and Practical Considerations

8 Conclusion

References

Continue Reading

The ivmte Package in R: Marginal Treatment Effects and Bounding Policy-Relevant Parameters

The contdid Package in R: Estimating Dose-Response Functions with Continuous Treatments

Recent Results: Housing Markets, Rent Control, and Urban Economics

Natural Experiments: Finding Causal Evidence Without Randomisation

Regression Discontinuity Design: Sharp, Fuzzy, and the CCT Bandwidth

The Credibility Revolution in Econometrics: Thirty Years of Causal Inference

Article Title

The Borusyak-Jaravel-Spiess Imputation Estimator: Efficient DiD for Staggered Adoption Settings

Introduction

1 The Setup and the Parallel Trends Restriction

1.1 Notation

1.2 Parallel Trends as a Restriction on Residuals

2 The Imputation Estimator

2.1 Step 1: Estimate the Untreated Potential Outcome Model

2.2 Step 2: Compute Treatment Effect Residuals

2.3 Step 3: Average with Desired Weights

2.4 Variance Estimation

3 Efficiency

4 Comparison with Other Staggered DiD Estimators

5 Implementation in did2s

6 Diagnostic: Testing Parallel Trends

7 Limitations and Practical Considerations

8 Conclusion

References

Continue Reading

The ivmte Package in R: Marginal Treatment Effects and Bounding Policy-Relevant Parameters

The contdid Package in R: Estimating Dose-Response Functions with Continuous Treatments

Recent Results: Housing Markets, Rent Control, and Urban Economics

Natural Experiments: Finding Causal Evidence Without Randomisation

Regression Discontinuity Design: Sharp, Fuzzy, and the CCT Bandwidth

The Credibility Revolution in Econometrics: Thirty Years of Causal Inference

Stay current with causal inference

Article Title