The Problem with TWFE Under Staggered Adoption
Let \(Y_{it}\) denote the outcome for unit \(i\) at time \(t\), and let \(D_{it} \in \{0,1\}\) denote treatment status. In many applications, units adopt treatment at different calendar times: some in period 2, some in period 5, some never. The classical approach is to estimate the two-way fixed effects (TWFE) regression: \[ Y_{it} = \alpha_i + \lambda_t + \delta D_{it} + \varepsilon_{it} \] where \(\alpha_i\) are unit fixed effects and \(\lambda_t\) are time fixed effects. The coefficient \(\delta\) is often interpreted as the average treatment effect.
Goodman-Bacon(2021) showed that \(\delta\) is not a simple average of underlying treatment effects. Instead, it is a weighted average of all possible "2x2" DiD estimates — comparisons between an eventually-treated group and a comparison group, across a pair of time periods. The weights are proportional to the variance of the demeaned treatment indicator for that group-period pair. Crucially, when a group that was treated early serves as the control for a later-treated group, and if the early-treated group's treatment effect grows over time, the weight on that comparison is negative. The TWFE estimator can produce estimates that are biased in sign under sufficiently heterogeneous treatment effects.
Setup and Notation
Following Callaway and Sant'Anna(2021), let \(G_i\) denote the period in which unit \(i\) is first treated (the "cohort"), with \(G_i = \infty\) for never-treated units. Let \(\mathcal{G}\) be the set of cohorts. Let \(Y_{it}(g)\) denote the potential outcome for unit \(i\) at time \(t\) if first treated in period \(g\), and \(Y_{it}(0)\) the potential outcome under no treatment. The observed outcome is: \[ Y_{it} = Y_{it}(0) + \sum_{g \in \mathcal{G}} \mathbf{1}\{G_i = g\} (Y_{it}(g) - Y_{it}(0)) \cdot \mathbf{1}\{t \geq g\} \]
Identification Assumptions
[Irreversibility of Treatment] Once treated, a unit remains treated: \(D_{it} = 1\) for all \(t \geq G_i\).
[Conditional Parallel Trends — Never-Treated] For each cohort \(g \in \mathcal{G}\) and each period \(t \geq 2\): \[ \mathbb{E}[Y_t(0) - Y_{t-1}(0) \mid X, G = g] = \mathbb{E}[Y_t(0) - Y_{t-1}(0) \mid X, C = 1] \] where \(C = 1\) denotes the never-treated group and \(X\) is a vector of pre-treatment covariates.
This assumption says that, conditional on pre-treatment covariates, the trend in untreated potential outcomes for cohort \(g\) equals the trend for never-treated units. The unconditional version drops \(X\); the conditional version allows for selection on observables.
[No Anticipation] For all \(t < g\): \(Y_{it}(g) = Y_{it}(0)\) for units in cohort \(g\).
This assumption rules out pre-treatment changes in behaviour in anticipation of future treatment. It can be relaxed to allow anticipation effects up to a fixed number of periods before treatment.
[Overlap] For each \(g \in \mathcal{G}\) and \(t\): \(p_g(X) := \Pr(G = g \mid G \in \{g, \infty\}, X) \in (0,1)\) almost surely.
Group-Time Average Treatment Effects
The primary estimand in the CS framework is the group-time average treatment effect: \[ ATT(g, t) = \mathbb{E}\left[Y_t(g) - Y_t(0) \mid G = g\right], \quad g \leq t \] This is the average effect of treatment at calendar time \(t\) for units first treated in cohort \(g\). It conditions on belonging to cohort \(g\) and asks what the average effect of treatment is \(t - g\) periods after adoption.
Under Assumptions 1–4 (using the never-treated comparison group and conditional parallel trends), \(ATT(g,t)\) is identified by: \[ ATT(g, t) = \mathbb{E}\left[\left(\frac{\mathbf{1}\{G = g\}}{\Pr(G = g)} - \frac{\frac{p_g(X)\mathbf{1}\{C=1\}}{1-p_g(X)}}{\mathbb{E}\left[\frac{p_g(X)\mathbf{1}\{C=1\}}{1-p_g(X)}\right]}\right)(Y_t - Y_{g-1})\right] \] This is the doubly robust (DR) representation. It is consistent if either the propensity score model \(p_g(X)\) is correctly specified or the outcome regression model is correctly specified.
Doubly Robust Estimation
Callaway and Sant'Anna(2021) recommend a doubly robust estimator that combines inverse probability weighting with outcome regression. Specifically, let \(\hat{p}_g(x)\) be an estimated propensity score and \(\hat{\mu}_{g,t}^{C}(x)\) be an estimated conditional mean of \(Y_t - Y_{g-1}\) for the comparison group. The DR estimator is: \[\begin{align} \widehat{ATT}(g,t) = \mathbb{E}_n\Bigg[&\frac{\mathbf{1}\{G = g\}}{\mathbb{E}_n[\mathbf{1}\{G = g\}]}\left(Y_t - Y_{g-1} - \hat{\mu}_{g,t}^{C}(X)\right) \notag \\ &- \frac{\frac{\hat{p}_g(X)\mathbf{1}\{C=1\}}{1-\hat{p}_g(X)}}{\mathbb{E}_n\left[\frac{\hat{p}_g(X)\mathbf{1}\{C=1\}}{1-\hat{p}_g(X)}\right]}\left(Y_t - Y_{g-1} - \hat{\mu}_{g,t}^{C}(X)\right)\Bigg] \end{align}\] where \(\mathbb{E}_n[\cdot]\) denotes the sample mean. This estimator is semiparametrically efficient and doubly robust.
Comparison Groups
CS allow for two choices of comparison group: never-treated units or not-yet-treated units (units in cohort \(g' > t\)). The not-yet-treated comparison is useful when the never-treated group is small or unrepresentative, but requires an additional assumption that not-yet-treated units are on parallel trends with the target cohort.
Aggregation
The collection of \(ATT(g,t)\) estimates for all \((g,t)\) pairs with \(t \geq g\) constitutes the full set of group-time treatment effects. These can be visualised as a two-dimensional array indexed by cohort and time. CS propose three main aggregation strategies:
Simple aggregation. Weight \(ATT(g,t)\) by the share of observations in cohort \(g\) at time \(t\): \[ \theta^{simple} = \sum_{g \in \mathcal{G}} \sum_{t=g}^{T} w_{g,t} \cdot ATT(g,t) \] where \(w_{g,t} \propto \Pr(G = g)\).
Event-study (dynamic) aggregation. For each relative time \(\ell = t - g\), average \(ATT(g, g+\ell)\) across cohorts: \[ \theta^{dynamic}(\ell) = \sum_{g} \mathbf{1}\{g + \ell \leq T\} \cdot \Pr(G = g \mid g + \ell \leq T) \cdot ATT(g, g+\ell) \] This produces an event-study plot of average effects as a function of time since treatment. For \(\ell < 0\), this tests pre-trends: the no-anticipation and parallel trends assumptions imply \(\theta^{dynamic}(\ell) = 0\) for \(\ell < 0\).
Calendar-time aggregation. Average \(ATT(g,t)\) for each calendar period \(t\), weighting by cohort shares among the treated at time \(t\).
Inference
Callaway and Sant'Anna(2021) derive the asymptotic distribution of \(\widehat{ATT}(g,t)\) under standard regularity conditions. The estimator is asymptotically normal and \(\sqrt{n}\)-consistent. Because the aggregated estimands are linear functions of the \(\widehat{ATT}(g,t)\), the delta method yields standard errors for the aggregated quantities.
For clustered settings (e.g., states as clusters), the authors recommend cluster-robust inference. In settings with a small number of groups, the bootstrap or permutation tests may be preferable.
Comparison to Sun–Abraham and de Chaisemartin–D'Haultfoeuille
Sun and Abraham(2021) propose an interaction-weighted (IW) estimator that is numerically identical to the CS estimator under certain aggregation choices. The IW estimator works within the TWFE regression framework by including interactions between cohort dummies and relative-time dummies, and then aggregating using cohort shares as weights.
de Chaisemartin and D'Haultf uille(2020) propose the \(\hat{\delta}^{DID_M}\) estimator, which focuses on the first period of treatment and identifies the average treatment effect for newly treated units in the first treatment period. This is a cleaner but more restrictive estimand.
The CS framework is the most general in the sense that it identifies the full collection of group-time ATTs and allows arbitrary aggregation. The choice between these estimators should be guided by the policy question.
Conclusion
The Callaway–Sant'Anna estimator is now the standard approach for DiD with staggered adoption. Its key contributions are: (1) replacing the conflated TWFE estimate with a transparent collection of group-time ATTs; (2) providing a doubly robust estimator that is valid under selection on observables; and (3) allowing flexible aggregation that aligns the estimand with the policy question. The did package in R provides a ready implementation; see the companion toolbox article in this issue.
References
- Callaway, B. and Sant'Anna, P. H. C. (2021). Difference-in-differences with multiple time periods. Journal of Econometrics, 225(2):200--230.
- de Chaisemartin, C. and D'Haultfuille, X. (2020). Two-way fixed effects estimators with heterogeneous treatment effects. American Economic Review, 110(9):2964--2996.
- Goodman-Bacon, A. (2021). Difference-in-differences with variation in treatment timing. Journal of Econometrics, 225(2):254--277.
- Rambachan, A. and Roth, J. (2023). A more credible approach to parallel trends. Review of Economic Studies, 90(5):2555--2591.
- Roth, J., Sant'Anna, P. H. C., Bilinski, A., and Poe, J. (2023). What's trending in difference-in-differences? A synthesis of the recent econometrics literature. Journal of Econometrics, 235(2):2218--2244.
- Sun, L. and Abraham, S. (2021). Estimating dynamic treatment effects in event studies with heterogeneous treatment effects. Journal of Econometrics, 225(2):175--199.
- Angrist, J. D. and Pischke, J.-S. (2009). Mostly Harmless Econometrics: An Empiricist's Companion. Princeton University Press, Princeton, NJ.