The Causal Review

The Problem with TWFE Under Staggered Adoption

Let Y_it denote the outcome for unit i at time t, and let D_it ∈ {0,1} denote treatmentstatus. In many applications, units adopt treatment at different calendar times: somein period 2, some in period 5, some never. The classical approach is to estimate thetwo-way fixed effects (TWFE) regression:

$$ Y_{it} = \alpha_i + \lambda_t + \delta D_{it} + \varepsilon_{it} $$

‍
where α_i are unit fixed effects and λ_tare time fixed effects. The coefficient δ is often interpreted as the average treatment effect.

Goodman-Bacon(2021) showed that $\delta$ is not a simple average of underlying treatment effects. Instead, it is a weighted average of all possible "2x2" DiD estimates — comparisons between an eventually-treated group and a comparison group, across a pair of time periods. The weights are proportional to the variance of the demeaned treatment indicator for that group-period pair. Crucially, when a group that was treated early serves as the control for a later-treated group, and if the early-treated group's treatment effect grows over time, the weight on that comparison is negative. The TWFE estimator can produce estimates that are biased in sign under sufficiently heterogeneous treatment effects.

Setup and Notation

Following Callaway and Sant’Anna (2021), let G_i denote the period in which unit i isfirst treated (the “cohort”), with G_i = ∞ for never-treated units. Let G be the set ofcohorts. Let Y_it(g) denote the potential outcome for unit i at time t if first treated inperiod g, and Y_it(0) the potential outcome under no treatment. The observed outcomeis:

$$ Y_{it} = Y_{it}(0) + \sum_{g \in \mathcal{G}} \mathbf{1}\{G_i = g\} (Y_{it}(g) - Y_{it}(0)) \cdot \mathbf{1}\{t \ge g\} $$

Identification Assumptions

Assumption 1 (Irreversibility of Treatment). Once treated, a unit remains treated:D_it = 1 for all t ≥ Gi.

Assumption 2 (Conditional Parallel Trends — Never-Treated). For each cohort g ∈ G and each period t ≥ 2:

$$ \mathbb{E}[Y_t(0) - Y_{t-1}(0) \mid X, G = g] = \mathbb{E}[Y_t(0) - Y_{t-1}(0) \mid X, C = 1] $$

where C = 1 denotes the never-treated group and X is a vector of pre-treatment covariates.

This assumption says that, conditional on pre-treatment covariates, the trend in untreated potential outcomes for cohort $g$ equals the trend for never-treated units. The unconditional version drops X; the conditional version allows for selection on observables.

Assumption 3 (No Anticipation). For all t < g: Y_it(g) = Y_it(0) for units in cohort g.

This assumption rules out pre-treatment changes in behaviour in anticipation offuture treatment. It can be relaxed to allow anticipation effects up to a fixed numberof periods before treatment.

Assumption 4 (Overlap). For each g ∈ G and t: p_g(X) := Pr(G = g | G ∈{g,∞},X) ∈ (0,1) almost surely.

Group-Time Average Treatment Effects

The primary estimand in the CS framework is the group-time average treatment effect:

$$ ATT(g,t) = \mathbb{E}[Y_t(g) - Y_t(0) \mid G = g], \quad g \le t $$

This is the average effect of treatment at calendar time t for units first treated in cohortg. It conditions on belonging to cohort g and asks what the average effect of treatmentis t −g periods after adoption.

Under Assumptions 1–4 (using the never-treated comparison group and conditionalparallel trends), ATT(g,t) is identified by:

$$ ATT(g,t) = \mathbb{E} \left[ \left( \frac{\mathbf{1}\{G=g\}}{\Pr(G=g)} - \frac{\frac{p_g(X)\mathbf{1}\{C=1\}}{1-p_g(X)}}{\mathbb{E}\left[\frac{p_g(X)\mathbf{1}\{C=1\}}{1-p_g(X)}\right]} \right) (Y_t - Y_{g-1}) \right] $$

This is the doubly robust (DR) representation. It is consistent if either the propensity score model p_g(X) is correctly specified or the outcome regression model is correctly specified.

Doubly Robust Estimation

recommend a doubly robust estimator that combines inverse probability weighting with outcome regression. Specifically, let ˆp_g(x) be anestimated propensity score and ˆµ^C_g,t(x) be an estimated conditional mean of Y_t − Y_g−1for the comparison group. The DR estimator is:

$$ \widehat{ATT}(g,t) = \mathbb{E}_n \left[ \begin{aligned} & \frac{\mathbf{1}\{G = g\}}{\mathbb{E}_n[\mathbf{1}\{G = g\}]} (Y_t - Y_{g-1} - \hat{\mu}_{g,t}^C(X)) \\ & - \frac{\frac{\hat{p}_g(X)\mathbf{1}\{C=1\}}{1-\hat{p}_g(X)}}{\mathbb{E}_n\left[\frac{\hat{p}_g(X)\mathbf{1}\{C=1\}}{1-\hat{p}_g(X)}\right]} (Y_t - Y_{g-1} - \hat{\mu}_{g,t}^C(X)) \end{aligned} \right] $$

(1)

Comparison Groups

CS allow for two choices of comparison group: never-treated units or not-yet-treated units (units in cohort g' > t). The not-yet-treated comparison is useful when the nevertreated group is small or unrepresentative, but requires an additional assumption thatnot-yet-treated units are on parallel trends with the target cohort.

Aggregation

The collection of ATT(g,t) estimates for all (g,t) pairs with t ≥ g constitutes the fullset of group-time treatment effects. These can be visualised as a two-dimensional arrayindexed by cohort and time. CS propose three main aggregation strategies:

Simple aggregation. Weight ATT(g,t) by the share of observations in cohort gat time t:

$$ \theta^{simple} = \sum_{g \in \mathcal{G}} \sum_{t=g}^T w_{g,t} \cdot ATT(g,t) $$

where w_g,t ∝ Pr(G = g).

Event-study (dynamic) aggregation. For each relative time ℓ = t−g, average ATT(g,g +ℓ) across cohorts:

$$ \theta^{dynamic}(\ell) = \sum_g \mathbf{1}\{g + \ell \le T\} \cdot \Pr(G = g \mid g + \ell \le T) \cdot ATT(g, g + \ell) $$

This produces an event-study plot of average effects as a function of time since treatment. For ℓ < 0, this tests pre-trends: the no-anticipation and parallel trends assumptions imply θ^dynamic(ℓ) = 0 for ℓ < 0.

Calendar-time aggregation. Average ATT(g,t) for each calendar period t, weighting by cohort shares among the treated at time t.

Inference

Callaway and Sant'Anna(2021) derive the asymptotic distribution of ATT(g,t) under standard regularity conditions. The estimator is asymptotically normal and √nconsistent. Because the aggregated estimands are linear functions of the ATT(g,t),the delta method yields standard errors for the aggregated quantities

For clustered settings (e.g., states as clusters), the authors recommend cluster-robust inference. In settings with a small number of groups, the bootstrap or permutation tests may be preferable.

Comparison to Sun–Abraham and de Chaisemartin–D'Haultfoeuille

Sun and Abraham(2021) propose an interaction-weighted (IW) estimator that is numerically identical to the CS estimator under certain aggregation choices. The IW estimator works within the TWFE regression framework by including interactions between cohort dummies and relative-time dummies, and then aggregating using cohort shares as weights.

de Chaisemartin and D'Haultf uille(2020) propose the ˆ δ^DIDM estimator, which focuses on the first period of treatment and identifies the average treatment effect for newly treated units in the first treatment period. This is a cleaner but more restrictive estimand.

The CS framework is the most general in the sense that it identifies the full collection of group-time ATTs and allows arbitrary aggregation. The choice between these estimators should be guided by the policy question.

Conclusion

The Callaway–Sant'Anna estimator is now the standard approach for DiD with staggered adoption. Its key contributions are: (1) replacing the conflated TWFE estimate with a transparent collection of group-time ATTs; (2) providing a doubly robust estimator that is valid under selection on observables; and (3) allowing flexible aggregation that aligns the estimand with the policy question. The did package in R provides a ready implementation; see the companion toolbox article in this issue.

References

Callaway, B. and Sant'Anna, P. H. C. (2021). Difference-in-differences with multiple time periods. Journal of Econometrics, 225(2):200--230.
de Chaisemartin, C. and D'Haultfuille, X. (2020). Two-way fixed effects estimators with heterogeneous treatment effects. American Economic Review, 110(9):2964--2996.
Goodman-Bacon, A. (2021). Difference-in-differences with variation in treatment timing. Journal of Econometrics, 225(2):254--277.
Rambachan, A. and Roth, J. (2023). A more credible approach to parallel trends. Review of Economic Studies, 90(5):2555--2591.
Roth, J., Sant'Anna, P. H. C., Bilinski, A., and Poe, J. (2023). What's trending in difference-in-differences? A synthesis of the recent econometrics literature. Journal of Econometrics, 235(2):2218--2244.
Sun, L. and Abraham, S. (2021). Estimating dynamic treatment effects in event studies with heterogeneous treatment effects. Journal of Econometrics, 225(2):175--199.
Angrist, J. D. and Pischke, J.-S. (2009). Mostly Harmless Econometrics: An Empiricist's Companion. Princeton University Press, Princeton, NJ.

Callaway–Sant’Anna DiD: Staggered Adoption and Group-Time ATTs

The Problem with TWFE Under Staggered Adoption

Setup and Notation

Identification Assumptions

Group-Time Average Treatment Effects

Doubly Robust Estimation

Comparison Groups

Aggregation

Inference

Comparison to Sun–Abraham and de Chaisemartin–D'Haultfoeuille

Conclusion

References

Continue Reading

The ivmte Package in R: Marginal Treatment Effects and Bounding Policy-Relevant Parameters

The contdid Package in R: Estimating Dose-Response Functions with Continuous Treatments

Recent Results: Housing Markets, Rent Control, and Urban Economics

Natural Experiments: Finding Causal Evidence Without Randomisation

Regression Discontinuity Design: Sharp, Fuzzy, and the CCT Bandwidth

The Credibility Revolution in Econometrics: Thirty Years of Causal Inference

Article Title

Callaway–Sant’Anna DiD: Staggered Adoption and Group-Time ATTs

The Problem with TWFE Under Staggered Adoption

Setup and Notation

Identification Assumptions

Group-Time Average Treatment Effects

Doubly Robust Estimation

Comparison Groups

Aggregation

Inference

Comparison to Sun–Abraham and de Chaisemartin–D'Haultfoeuille

Conclusion

References

Continue Reading

The ivmte Package in R: Marginal Treatment Effects and Bounding Policy-Relevant Parameters

The contdid Package in R: Estimating Dose-Response Functions with Continuous Treatments

Recent Results: Housing Markets, Rent Control, and Urban Economics

Natural Experiments: Finding Causal Evidence Without Randomisation

Regression Discontinuity Design: Sharp, Fuzzy, and the CCT Bandwidth

The Credibility Revolution in Econometrics: Thirty Years of Causal Inference

Stay current with causal inference

Article Title