Introduction
Treatment effects are almost surely heterogeneous. The effect of a minimum wage increase on employment differs across industries, firm sizes, and local labour market conditions. The returns to education differ across individuals with different abilities, networks, and opportunities. The effect of a medical intervention differs by patient age, comorbidities, and genetic background.
Yet for decades, the dominant empirical strategy in economics has been to estimate a single number—the average treatment effect (ATE), or its cousins ATT and LATE—and report it as the policy-relevant parameter. As causal machine learning has made it easier to estimate heterogeneous treatment effects, a debate has emerged: is heterogeneity a feature that enriches policy analysis, or a bug that complicates interpretation and introduces new risks of data mining and selective reporting?
This article steelmans both sides.
1 The Case That Heterogeneity Is a Feature
1.1 Average Effects Can Mislead Policy
The most powerful argument for heterogeneity is that average effects can mislead policy in settings where the policy targets or affects specific subgroups. Consider a job training programme that on average raises earnings by $500 per year. If this average conceals +$3,000 for the long-term unemployed and -$500 for short-term job losers (for whom the training delays job search), then a policy that assigns everyone to training is welfare-reducing for the second group. Reporting only the average effect hides this.
Imbens [2015] makes the point precisely: the policy-relevant parameter depends on the policy being evaluated. For a universal programme (everyone gets treatment), ATE is the right target. For a targeted programme (select the units most likely to benefit), the conditional average treatment effect (CATE) is what matters. Optimally targeted programmes can have much larger welfare gains than average-effect estimates suggest.
1.2 Causal Forests Provide Honest Heterogeneity Estimates
The development of causal forests by Wager and Athey [2018] solved a key technical problem: how to estimate CATEs without overfitting or cherry-picking. By using honest forests—where different samples are used for tree construction and effect estimation—and cross-fitting, causal forests provide asymptotically normal, bias-corrected estimates of τ(x) = E[Y(1) − Y(0) | X = x] for any covariate vector x.
Double machine learning [Chernozhukov et al., 2018] provides a complementary framework: using cross-fitting and orthogonalisation, CATEs can be estimated with parametric or semi-parametric models and standard confidence intervals, under conditions much weaker than traditional inference. These tools make rigorous heterogeneity analysis genuinely feasible for the first time. The argument is that now that we have credible tools, we should use them.
1.3 Welfare Analysis and Policy Learning
Athey and Wager [2021] formalise the case for heterogeneity from a welfare perspective. Given a budget constraint on how many units can be treated, the optimal targeting rule selects units where τ(x) > c for some threshold c determined by the constraint. This "policy learning" framework transforms CATE estimation into a direct input for decision-making, not just a descriptive exercise.
The potential efficiency gains are large. In a setting where the average effect is zero but the top quartile of the CATE distribution has a positive effect, moving from universal to targeted treatment can convert a failed programme into an effective one.
2 The Case That Heterogeneity Is a Bug
2.1 Heterogeneity Multiplies the Multiple Testing Problem
Every subgroup analysis is a potential fishing expedition. Researchers who report heterogeneous effects for women, minorities, young workers, high-income areas, and so on face a severe multiple testing problem: with enough subgroups, some will show large effects by chance, and these are the ones that get reported.
Feller and Gelman [2015] show that even pre-specified subgroup analyses in randomised trials can mislead when the number of subgroups is large relative to sample size. The reported heterogeneous effects are typically unreliable: the "most striking" heterogeneity estimates are usually the largest random deviations from the truth. This problem is compounded in observational studies where the analyst can choose among many possible covariates X to define subgroups. Without pre-specification, reported heterogeneity is almost certain to overstate the true variation in treatment effects.
2.2 Heterogeneity in TWFE DiD: A Mechanical Artefact
A specific form of "heterogeneity" that has dominated the DiD literature since 2018 is the heterogeneity in treatment effect timing—different cohorts experiencing treatment at different calendar times. Goodman-Bacon [2021] showed that TWFE regressions with heterogeneous treatment timing recover estimates that are contaminated by negative weights, not because the underlying treatment effects are heterogeneous in any meaningful sense, but because the TWFE estimator mechanically uses earlier-treated units as controls for later-treated units.
The literature's response—Callaway-Sant’Anna, Sun-Abraham, BJS—has been to "fix" this by decomposing effects by cohort and event time. But critics argue that this proliferation of cohort-specific estimates, while technically correct, can be overwhelmingly complex and hard to summarise for policy purposes. A table of 20 cohort-specific ATTs is not obviously more useful than a single TWFE number, especially if the cohort-specific estimates are noisy and the heterogeneity across cohorts is small.
2.3 The Risk of Selective Reporting and P-Hacking in Subgroups
Ding et al. [2019] provide a formal decomposition of ATE into within-group and between-group heterogeneity components. They show that the between-group component—the source of heterogeneity that actually changes the policy conclusion—is typically small relative to within-group variance. This suggests that much reported heterogeneity reflects within-cell noise rather than genuine variation in treatment effects across subgroups.
More broadly, reported heterogeneity is subject to the same publication bias pathology as main effects [Brodeur et al., 2016]: surprising, large subgroup effects are more likely to be published than null heterogeneity findings. The literature's record of heterogeneous effects is almost certainly inflated by selection.
2.4 Structural Models and the External Validity of Heterogeneity
Even if heterogeneity estimates within a study are unbiased, they may not generalise to other populations. Heckman [1997] argues that the subgroup where a given LATE or CATE is estimated—often defined by compliance with a specific instrument—may bear no resemblance to the subgroup that a future policy would target. Heterogeneity estimates are as subject to external validity concerns as average effects, and potentially more so.
3 What Would Resolve the Debate?
Several developments would help adjudicate between these positions:
- Prospective registration of heterogeneity analyses. If pre-analysis plans specify which subgroups will be analysed before data are collected, selective reporting can be controlled. The cost is the loss of exploratory flexibility.
- Hold-out validation. Heterogeneity estimates from one sample can be validated on a hold-out sample. If the subgroups identified as high-effect in the training sample do not replicate in the hold-out, the heterogeneity is likely spurious.
- More replications. As the evidence base for specific interventions grows, meta-analyses can assess whether heterogeneity patterns are consistent across studies and settings.
- Better power calculations. Most studies are powered for average effects; subgroup analyses are severely underpowered. Honest power calculations for heterogeneity analyses would deter low-power fishing expeditions.
4 A Synthesis
The tension between these positions is real but not irresolvable. Heterogeneity is clearly a feature in the sense that it contains genuine information about policy effectiveness that average effects discard. But it is a bug when it is estimated without discipline, reported selectively, or used to justify conclusions that would not survive pre-registration.
The productive path forward is to treat heterogeneity analysis with the same design-based rigour that the credibility revolution brought to average effect estimation: pre-specify the subgroups, use honest estimation methods like causal forests that control overfitting, validate on hold-outs, and apply multiple testing corrections. When these standards are met, heterogeneous treatment effect estimation is a powerful tool. When they are not, it is a sophisticated form of data mining.
References
- Athey, S. and Wager, S. (2021). Policy learning with observational data. Econometrica, 89(1), 133-161.
- Brodeur, A., Lè, M., Sangnier, M., and Zylberberg, Y. (2016). Star wars: The empirics strike back. American Economic Journal: Applied Economics, 8(1), 1-32.
- Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. Econometrics Journal, 21(1), C1-C68.
- Ding, P., Feller, A., and Miratrix, L. (2019). Decomposing treatment effect variation. Journal of the American Statistical Association, 114(525), 304-317.
- Feller, A. and Gelman, A. (2015). Hierarchical models for causal effects. In Emerging Trends in the Social and Behavioral Sciences, pp. 1-16. Wiley.
- Goodman-Bacon, A. (2021). Difference-in-differences with variation in treatment timing. Journal of Econometrics, 225(2), 254-277.
- Heckman, J. J. (1997). Instrumental variables: A study of implicit behavioral assumptions used in making program evaluations. Journal of Human Resources, 32(3), 441-462.
- Imbens, G. W. (2015). Matching methods in practice: Three examples. Journal of Human Resources, 50(2), 373-419.
- Wager, S. and Athey, S. (2018). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523), 1228-1242.