Introduction
The average treatment effect is a remarkable conceptual achievement. It reduces a complex distribution of individual causal effects to a single summary that, under appropriate conditions, can be identified from observed data. But the ATE is also, in many policy contexts, the wrong quantity. A job training programme that raises earnings by 1Z20,000 for the most motivated participants and by nothing — or worse — for others. A drug that reduces blood pressure on average might harm patients with certain genetic profiles. Policy makers who act on the ATE may be systematically targeting the wrong people.
This is not a new insight. The distinction between the ATE and the average treatment effect on the treated (ATT) was built into the potential outcomes framework from the start (Rubin(1974)). Imbens and Rubin(2015) provide a comprehensive treatment of the many estimands that arise in causal inference. What is new, in the last decade, is the practical ability to estimate the full conditional distribution of treatment effects — the CATE function \(\tau(x) = \mathbb{E}[Y_i(1) - Y_i(0) \mid X_i = x]\) — using flexible machine-learning methods, and to characterise heterogeneity in panel designs in a way that is robust to the pathologies of the classical TWFE estimator.
A Taxonomy of Treatment Effect Estimands
Let \(Y_i(1)\) and \(Y_i(0)\) denote the potential outcomes for unit \(i\) under treatment and control, respectively, and let \(D_i \in \{0,1\}\) denote treatment status. The individual treatment effect is \(\tau_i = Y_i(1) - Y_i(0)\). Since at most one potential outcome is observed, \(\tau_i\) is never identified without additional structure.
The principal population-level summaries are:
\[\begin{align} \text{ATE} &= \mathbb{E}[\tau_i] = \mathbb{E}[Y_i(1) - Y_i(0)] \\ \text{ATT} &= \mathbb{E}[\tau_i \mid D_i = 1] = \mathbb{E}[Y_i(1) - Y_i(0) \mid D_i = 1] \\ \text{LATE} &= \mathbb{E}[\tau_i \mid \text{complier}] \quad \text{(instrumental variables setting)} \\ \text{CATE}(x) &= \mathbb{E}[\tau_i \mid X_i = x] \end{align}\]Each estimand is appropriate for different questions. The ATE answers: what is the average effect of treating a randomly drawn unit from the population? The ATT answers: among those who actually received treatment, what was the average effect? The LATE answers: among those who were induced into treatment by the instrument, what was the average effect? The CATE answers: for a unit with observed characteristics \(X_i = x\), what is the expected effect?
Angrist et al.(1996) established the IV-LATE theorem: under monotonicity and the exclusion restriction, the IV estimand equals the LATE. This result is simultaneously a clarification and a caution. IV identifies a well-defined quantity, but that quantity is local to the complier population, which may differ substantially from the population of always-takers or never-takers.
Why Heterogeneity Matters: The TWFE Pathology
The importance of heterogeneity was thrown into relief by the recent literature on two-way fixed effects estimators in staggered DiD settings. Goodman-Bacon(2021) showed that the TWFE estimator — the workhorse of applied panel econometrics for three decades — is a weighted average of all possible "2x2" DiD comparisons, where the weights depend on the variance of the treatment indicator. Crucially, these weights can be negative for some comparisons. In particular, when a group that was treated early serves as a control for a group treated later, it is being compared under treatment, and the resulting 2x2 DiD subtracts the treatment effect rather than adding it.
de Chaisemartin and D'Haultf uille(2020) formalise the bias explicitly. Let \(\delta_{g,t}\) denote the treatment effect for group \(g\) at time \(t\). Under heterogeneous effects, the TWFE estimator converges to a weighted sum \(\sum_{g,t} w_{g,t} \delta_{g,t}\) where some weights \(w_{g,t}\) are negative. The estimator can have the opposite sign to the true average effect.
The solution proposed by Callaway and Sant'Anna(2021) is to define estimands at the level of group-time pairs: \(ATT(g,t) = \mathbb{E}[Y_t(g) - Y_t(0) \mid G_i = g]\), where \(G_i = g\) means unit \(i\) was first treated in period \(g\). These group-time ATTs can be aggregated in different ways depending on the policy question: simple averages across groups and times, event-study plots showing how effects evolve with time since treatment, or calendar-time averages. The key point is that heterogeneity in treatment effects is no longer an embarrassment to be smoothed over — it is a source of information to be structured.
Conditional Average Treatment Effects and Machine Learning
The CATE function \(\tau(x)\) is in general a high-dimensional nonparametric object. Estimating it directly poses two challenges: the curse of dimensionality and the bias-variance trade-off. The literature on CATE estimation with machine learning has developed several approaches to address these challenges.
Causal Forests
Wager and Athey(2018) proposed the causal forest estimator, an adaptation of random forests to the causal inference setting. The key idea is to modify the splitting criterion of the random forest so that it directly maximises heterogeneity in treatment effects across leaves, rather than minimising prediction error for the outcome. The result is a consistent, pointwise asymptotically normal estimator of \(\tau(x)\), with honest confidence intervals.
The honesty property (Athey and Imbens(2016)) is important: it means the same subsample of data is not used to choose splits and to estimate the effect within leaves. Without honesty, standard confidence intervals are invalid because the tree structure itself was chosen to maximise apparent heterogeneity.
Double Machine Learning for HTE
Chernozhukov et al.(2018) provide a general framework for estimating treatment effects in the presence of high-dimensional controls. The key innovation is the use of Neyman-orthogonal moment conditions, which render the estimator insensitive to first-order errors in estimating nuisance functions (the conditional mean of the outcome and of treatment). Combined with cross-fitting, the DML estimator achieves \(\sqrt{n}\)-consistency for the parameter of interest even when the nuisance functions are estimated at slower rates.
For CATE estimation, the R-learner of Nie and Wager(2021) adapts the DML insight: after partialling out the main effects of controls using a machine learner, the residualised treatment indicator and outcome residuals are used to estimate \(\tau(x)\) as the solution to a weighted regression problem. This allows any flexible learner to be plugged in for the final CATE step.
Aggregation and Inference
Estimating a CATE function is only the first step. Policy makers often need to know whether heterogeneity is statistically meaningful, and to characterise it in interpretable ways. Chernozhukov et al.(2018) propose a test for whether the CATE is genuinely heterogeneous (i.e., not constant) using the "best linear predictor" of the CATE. If the slope coefficient in a regression of \(\hat{\tau}(X_i)\) on \(\tau(X_i)\) is significantly different from zero, there is evidence of genuine heterogeneity.
Heterogeneity in Policy Contexts
The shift toward HTE estimation has been driven partly by substantive policy questions that the ATE simply cannot answer.
Targeting. If a social programme has heterogeneous effects, policymakers can improve welfare by targeting it at individuals with high CATE values. This requires not just estimating the CATE but ranking individuals by predicted treatment benefit, and evaluating whether that ranking has policy value (Kitagawa and Tetenov(2018)).
External validity. The ATE in a given study population may differ from the ATE in the target population to which a policy is to be applied (Angrist and Pischke(2009)). Understanding the CATE and how it relates to individual characteristics allows researchers to assess and, under assumptions, correct for this discrepancy.
Mechanism analysis. Heterogeneity along observable dimensions can provide evidence for or against proposed mechanisms. If an education intervention raises test scores more for children from disadvantaged backgrounds, this is consistent with mechanisms based on access constraints and less consistent with mechanisms based on genetic potential.
The DiD Setting Revisited
In the staggered DiD setting, the group-time ATTs estimated by Callaway and Sant'Anna(2021) can themselves be heterogeneous across groups and time periods. An event-study plot — which plots \(ATT(g, g+k)\) as a function of the number of periods \(k\) since treatment — reveals the dynamic structure of treatment effects: do they grow over time, decay, or remain constant? This dynamic pattern is informative about mechanisms and about the plausibility of long-run policy extrapolation.
Roth et al.(2023) provide a comprehensive review of modern DiD methods, emphasising that the choice of estimand — which aggregation of group-time ATTs to report — is a substantive decision that should be guided by the policy question, not by statistical convention.
Conclusion
The move from average treatment effects to heterogeneous treatment effects is one of the most significant methodological developments in empirical economics since the credibility revolution. It is driven by three forces: the recognition that the TWFE estimator conflates effects in ways that can produce misleading averages; the practical availability of machine-learning tools that can estimate the CATE flexibly; and the policy demand for targeted interventions. The conceptual shift is equally important: it asks researchers to think carefully about for whom a treatment works, not just whether it works on average. This is a harder question, but it is closer to the question that policy actually requires.
References
- Angrist, J. D., Imbens, G. W., and Rubin, D. B. (1996). Identification of causal effects using instrumental variables. Journal of the American Statistical Association, 91(434):444--455.
- Angrist, J. D. and Pischke, J.-S. (2009). Mostly Harmless Econometrics: An Empiricist's Companion. Princeton University Press, Princeton, NJ.
- Athey, S. and Imbens, G. W. (2016). Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences, 113(27):7353--7360.
- Callaway, B. and Sant'Anna, P. H. C. (2021). Difference-in-differences with multiple time periods. Journal of Econometrics, 225(2):200--230.
- Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1):C1--C68.
- Chernozhukov, V., Demirer, M., Duflo, E., and Fern\'andez-Val, I. (2018). Generic machine learning inference on heterogeneous treatment effects in randomized experiments. NBER Working Paper No. 24678.
- de Chaisemartin, C. and D'Haultfuille, X. (2020). Two-way fixed effects estimators with heterogeneous treatment effects. American Economic Review, 110(9):2964--2996.
- Goodman-Bacon, A. (2021). Difference-in-differences with variation in treatment timing. Journal of Econometrics, 225(2):254--277.
- Imbens, G. W. and Rubin, D. B. (2015). Causal Inference for Statistics, Social, and Biomedical Sciences. Cambridge University Press, Cambridge.
- Kitagawa, T. and Tetenov, A. (2018). Who should be treated? Empirical welfare maximization methods for treatment choice. Econometrica, 86(2):591--616.
- Nie, X. and Wager, S. (2021). Quasi-oracle estimation of heterogeneous treatment effects. Biometrika, 108(2):299--319.
- Roth, J., Sant'Anna, P. H. C., Bilinski, A., and Poe, J. (2023). What's trending in difference-in-differences? A synthesis of the recent econometrics literature. Journal of Econometrics, 235(2):2218--2244.
- Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology, 66(5):688--701.
- Sun, L. and Abraham, S. (2021). Estimating dynamic treatment effects in event studies with heterogeneous treatment effects. Journal of Econometrics, 225(2):175--199.
- Wager, S. and Athey, S. (2018). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523):1228--1242.