1 Introduction
For decades, the two-way fixed effects (TWFE) estimator has been the workhorse of panel data econometrics. Regress your outcome on a treatment indicator, unit fixed effects, and time fixed effects, and the coefficient gives you or so the story went a reasonable estimate of the average treatment effect. Legions of papers were published on this basis.
The story turned out to be more complicated. Over the past several years, a wave of methodological work has demonstrated that in staggered adoption settings where different units adopt treatment at different times the TWFE estimator is a weighted average of many two-by-two difference-in-differences (DiD) comparisons, and some of those weights can be negative. A treatment effect that is uniformly positive can, in principle, yield a negative TWFE coefficient.
Goodman-Bacon [2021] provided the clearest decomposition of exactly what TWFE esti- mates. This article explains the Goodman-Bacon decomposition in depth: the mathematics, the intuition, the implications for applied work, and what researchers should do about it.
2 The Standard TWFE Setup
Consider a balanced panel with units i = 1,...,N and time periods t=1,...,T Some units are treated, and treatment is absorbing (once treated, always treated). Let Dᵢₜ if unit i is treated at time t, and define gᵢ as the period in which unit i first receives treatment (gᵢ = ∞ for never-treated units).
The standard TWFE regression is:
where αᵢ are unit fixed effects, λₜ are time fixed effects, and βᵀᵂᶠᴱ is the coefficient of interest.
In a canonical 2x2 DiD with one treated group and one clean control group, βᵀᵂᶠᴱ recovers the average treatment effect on the treated (ATT) under parallel trends. The trouble arises when there are multiple treatment cohorts.
3 The Decomposition Theorem
Goodman-Bacon [2021] shows that with staggered treatment timing, the OLS estimator βᵀᵂᶠᴱ can be written as a weighted average of all possible 2x2 DiD estimators:
where the sum runs over all pairs of groups (k, ), β̂ᴰⁱᴰₖₗ is the 2x2 DiD comparing group k to group using a specific pre/post window, and Ŝₖₗ are scalar weights that sum to one but can be negative.
More concretely, there are three types of 2x2 comparisons embedded in TWFE:
- Early vs. never-treated: A cohort that adopts treatment early, compared against units that never adopt. This is a clean comparison.
- Late vs. never-treated: A cohort that adopts treatment late, compared against never-treated units. Also clean.
- Early vs. late (timing variation): An early-adopting cohort used as the control for a later-adopting cohort (or vice versa). This is the problematic comparison.
The third type is pernicious. When an early-adopting cohort is used as the control for a late adopter, it is already treated during the comparison window. If treatment effects change over time (effect heterogeneity), the early adopter's post-treatment outcomes will absorb some of the effect, biasing the comparison. In extreme cases, the weight Ŝₖₗ on a contaminated comparison can be negative, meaning TWFE actually subtracts a positive treatment effect.
4 When Do Negative Weights Arise?
The weights in the Goodman-Bacon decomposition depend on the relative sizes of the groups and the timing of treatment. Goodman-Bacon [2021] shows that the weight on a comparison between group k (early) and group I (late) using group k as the control is:
where nₖ is the size of group k and D̄ₖ is its mean treatment rate. These weights are always non-negative in isolation, but the sign of the overall contribution depends on whether the DiD estimate itself is positive or negative. Note that Ŝₖₗ here represents the Goodman-Bacon group-pair decomposition weight the share of the overall TWFE esti- mate attributable to the (k, 1) comparison and should not be confused with the raw FWL (Frisch-Waugh-Lovell) residual-based weight wᵢₜ = D̃ᵢₜ / 𝔼[D̃ᵢₜ²] used in the de Chaisemartin- D'Haultfœuille (2020) characterisation. The two representations are equivalent but derived from different decomposition approaches.
Negative weights emerge when the treatment effect is heterogeneous across time (dynamic effects). If early adopters experience rising treatment effects over their post-treatment peri- ods, using them as controls for later adopters produces a downward-biased comparison the control group is trending upward due to its own treatment, making the treatment look less effective than it is.
Crucially, negative weights are more likely when:
- Treatment is concentrated among a few large cohorts, which then dominate the timing comparisons.
- Treatment effects grow substantially over the post-treatment period.
- The share of "already-treated" units used as controls is large.
5 A Numerical Illustration
To fix ideas, consider a simple example with three groups:
- Group A n=100): treated from period 3 onward; ATT=2 in period 3, ATT=4 in period 4.
- Group B (n=100): treated from period 4 onward: ATT=3 in period 4.
- Group C (n=100): never treated.
The clean comparisons (A vs. C and B vs. C) yield positive, meaningful DiD estimates. But TWFE also uses Group A as a control for Group B in period 4. Since Group A is already treated and its effect grew from period 3 to 4, Group A's outcome in period 4 is elevated relative to Group C. This makes Group B's treatment look smaller than it really is the contaminated comparison pulls BTWFE downward.
The figure below illustrates the three 2x2 comparisons schematically.
Figure 1: Stylised parallel trends and post-treatment paths for three groups. Group A is already treated when Group B enters treatment, biasing the A-vs-B comparison.
6 Implications for Applied Researchers
6.1 Diagnose before you estimate
The R package bacondecomp implements the Goodman-Bacon decomposition directly. Run- ning it on your data before interpreting TWFE is now a standard best practice. The de- composition reports each 2x2 DiD estimate and its weight, allowing you to see whether contaminated comparisons are driving your results.
6.2 Is TWFE always biased?
No. If treatment effects are homogeneous the same for all cohorts and in all post-treatment periods TWFE recovers the true ATT. The Goodman-Bacon decomposition reduces to a single valid DiD in the homogeneous case. The problem is specific to heterogeneous and dynamic treatment effects, which are the norm rather than the exception in most applied settings.
6.3 What to do instead
Several estimators have been proposed that avoid the negative-weights problem:
- Callaway and Sant'Anna [2021]: Estimate group-time ATT(g,t) separately for each cohort-period pair and aggregate.
- Sun and Abraham [2021]: Interact treatment timing indicators with leads/lags; fully saturate the event-study model.
- de Chaisemartin and D'Haultfœuille [2020]: A "chained" DiD estimator that only uses clean comparisons.
- Roth et al. [2023]: A survey and synthesis of the staggered DiD literature with guidance on choosing estimators.
These estimators do not use already-treated units as controls, thereby avoiding the con- taminated comparisons that generate negative weights.
7 How Widespread Is the Problem?
Baker et al. [2022] surveyed a large sample of published papers using TWFE in staggered settings and found that the majority did not check for negative weights or effect hetero- geneity. In simulations calibrated to typical applied settings, TWFE estimates were often substantially biased when treatment effects were dynamic.
The problem is not merely theoretical. Callaway and Sant'Anna [2021] reanalyse several canonical empirical applications including studies of minimum wage effects on employment- and find that cohort-specific ATTs differ substantially from the TWFE estimate, sometimes changing the sign of the estimated effect.
8 The Positive Takeaway
The Goodman-Bacon decomposition is not just a critique; it is a diagnostic tool. By decom- posing TWFE into its constituent 2x2 comparisons, researchers can:
- Identify which comparisons are "clean" (treated vs. never-treated) and which are po- tentially contaminated.
- Assess the share of the TWFE estimate driven by contaminated comparisons.
- Decide whether to proceed with TWFE (if contaminated comparisons are a small share) or adopt a heterogeneity-robust estimator.
The decomposition has done the profession a service: it made the implicit structure of TWFE transparent, and in doing so, it catalysed a richer set of tools for staggered DiD that are now widely available.
9 Conclusion
Two-way fixed effects is not broken but it does something more complicated than was widely understood. The Goodman-Bacon decomposition reveals that TWFE aggregates all possible 2x2 DiD comparisons, including comparisons that use already-treated units as controls. When treatment effects are heterogeneous or dynamic, this aggregation can produce misleading estimates.
The practical implication is straightforward: before interpreting a TWFE estimate from staggered data, run the Bacon decomposition, inspect the weights, and consider whether the cohort-specific estimators of Callaway and Sant'Anna [2021] or Sun and Abraham [2021] are more appropriate. The tools are available; the question is whether researchers use them.
References
- Goodman-Bacon, A. (2021). Difference-in-differences with variation in treatment timing. Journal of Econometrics, 225(2):254-277.
- Callaway, B. and Sant'Anna, P.H.C. (2021). Difference-in-differences with multiple time periods. Journal of Econometrics, 225(2):200-230.
- Sun, L. and Abraham, S. (2021). Estimating dynamic treatment effects in event studies with heterogeneous treatment effects. Journal of Econometrics, 225(2):175-199.
- de Chaisemartin, C. and D'Haultfœuille, X. (2020). Two-way fixed effects estimators with heterogeneous treatment effects. American Economic Review, 110(9):2964-2996.
- Roth, J., Sant'Anna, P.H.C., Bilinski, A., and Poe, J. (2023). What's trending in difference- in-differences? A synthesis of the recent econometrics literature. Journal of Econometrics, 235(2):2218-2244.
- Baker, A.C., Larcker, D.F., and Wang, C.C.Y. (2022). How much should we trust staggered difference-in-differences estimates? Journal of Financial Economics, 144(2):370-395.