Debates & Controversies

Pooling vs. Stratification: Should Policy Evaluations Report Subgroup Effects?

1 The Tension

Every policy evaluation faces a design choice: report a single average treatment effect (ATE) for the whole sample, or disaggregate into subgroup effects by gender, age, income, geography, or other characteristics? The answer seems obvious—policymakers want to know who benefits, not just the average. But the statistical case is more complicated.

Subgroup analyses multiply the number of hypothesis tests, increasing the chance that at least one subgroup effect will appear significant by chance. They reduce the effective sample size for each subgroup, increasing estimation uncertainty. And when subgroup effects are not pre-specified, the choice of which subgroups to report can itself be a form of specification searching. The debate between pooling and stratification is really a debate between two types of error: failing to detect genuine heterogeneity (by pooling) versus mistaking noise for signal (by over-stratifying).

This article steelmans both sides.

2 The Case for Reporting Subgroup Effects

2.1 Treatment Effects Are Almost Always Heterogeneous

The dominant view in modern causal inference is that homogeneous treatment effects are the exception, not the rule. From the potential outcomes perspective, the ATE is E[τᵢ] where τᵢ = Yᵢ(1) – Yᵢ(0) is the individual treatment effect. There is no reason to expect τᵢ to be the same for a poor rural farmer and a wealthy urban professional, for a teenager and a retiree, for a man and a woman.

The development of causal machine learning methods—causal forests [Wager and Athey, 2018], double ML for heterogeneous effects [Chernozhukov et al., 2018]—is explicitly motivated by the view that the ATE is an incomplete summary. A policy that averages to a zero ATE may still be welfare-improving for some subgroups and welfare-reducing for others: ignoring this heterogeneity leads to misguided uniform policies.

2.2 Policymakers Cannot Use the ATE Directly

A policymaker deciding whether to expand a job-training programme to a specific target population—say, unemployed workers aged 25-40 in manufacturing-dependent counties—cannot directly apply an ATE estimated on a broader population. They need an effect estimate for the relevant subgroup. Providing only the ATE forces the policymaker to implicitly assume homogeneity in exactly the dimension that matters for their decision.

Imai and Ratkovic [2013] formalise this point: the relevant policy parameter is not the ATE but the "conditional average treatment effect" (CATE) E[τᵢ | Xᵢ = x] for the target population's characteristics x. Pooled estimates identify this only under homogeneity, which is typically not credible.

2.3 Subgroup Effects Can Reveal Mechanisms

Beyond direct policy relevance, heterogeneity by observed characteristics can shed light on mechanisms. If a training programme has a large effect for workers with less than a high school education but zero effect for those with more education, this pattern is consistent with a skill-complementarity mechanism and inconsistent with a pure signalling story. Pooling obscures this mechanistic insight.

Ding et al. [2019] show that the variance of τᵢ can be bounded using observed subgroup ATTs, providing a way to quantify how much treatment effect heterogeneity exists and which observed covariates explain it. This decomposition is only possible by stratifying.

3 The Case for Caution About Subgroup Analyses

3.1 Multiple Testing and False Discovery

The core statistical problem with subgroup analysis is multiplicity. If a trial pre-specifies K subgroup comparisons, the probability of at least one false positive at level α is approximately 1 – (1 – α)ᴷ. For K = 10 and α = 0.05, this exceeds 40%.

The problem is compounded when subgroup definitions are chosen after looking at the data—a practice common enough that medical regulators have adopted strict rules against unplanned subgroup analyses. In economics, where pre-analysis plans are less universal, the scope for post-hoc subgroup discovery is wide.

Feller and Gelman [2015] document that in published economic evaluations, reported subgroup effects are typically larger than overall average effects—consistent with selective reporting of the most striking heterogeneity. This pattern is suspicious: if effects were genuinely large in some subgroups and small in others, the overall average would mask the large effects. But the selective pattern of "we found a big effect in women" followed by no significant effect in men (and a null overall average) suggests the subgroup finding may reflect noise.

3.2 Power Collapse in Subgroups

When the total sample is powered to detect an effect of size δ with N observations, a 50/50 stratification on a binary characteristic halves the sample to N/2 in each subgroup.  If the subgroup effect is the same as the overall effect (δ_subgroup = δ), power drops from the designed level (say 80%) to approximately:

Powersubgroup = Φ(
δ
σ√(4/(N/2))
- z0.975) (1)

which for N = 200 and δ = 0.3 SD gives power ≈ 40%—well below adequate levels. A non-significant subgroup effect is then nearly uninformative: it is consistent with both no heterogeneity and genuine subgroup differences that the study was not powered to detect.

3.3 The Multiple Comparisons Problem in Practice

List et al. [2019] conduct a meta-analysis of economic field experiments and find that subgroup effects are frequently reported but rarely corrected for multiple comparisons. They document that the "hit rate" for subgroup claims—the fraction that survive correction or external validation—is substantially lower than for the primary outcome. This is the empirical footprint of the multiple testing problem.

4 A Framework for Navigating the Tension

The debate does not have a winner; both concerns are legitimate.  A principled approach balances them through several complementary strategies:

4.1 Pre-specify Subgroups

The most important protection against false discovery in subgroup analysis is pre-specification: the subgroups to be analysed, the hypotheses to be tested, and the multiple-comparisons correction to be applied should all be declared before the analysis begins. Pre-specified subgroup analyses should be reported and interpreted differently from exploratory post-hoc analyses—the latter serve as hypothesis-generating, not confirmatory.

4.2 Apply Multiple Testing Corrections

For a set of pre-specified subgroup comparisons, apply corrections that control the family-wise error rate (FWER) or false discovery rate (FDR). The Bonferroni correction (α/K) controls FWER conservatively; the Benjamini-Hochberg procedure controls FDR at a specified level. Romano and Wolf [2005] provide stepwise procedures with better power than Bonferroni while still controlling FWER.

4.3 Use Omnibus Tests Before Subgroup Dives

Before examining individual subgroup effects, test the joint null hypothesis of treatment effect homogeneity: H₀: τ₁ = τ₂ = ... = τₖ.  Ding et al. [2019] provide a formal test of homogeneity based on the variance of subgroup ATTs. If homogeneity is not rejected, subgroup effects should be interpreted cautiously. If it is rejected, subgroup heterogeneity is present and worth characterising.

4.4 Use Causal Machine Learning for Data-Driven Heterogeneity

Causal forests [Wager and Athey, 2018] and the generalised random forest (GRF) framework [Athey and Wager, 2019] provide data-driven tools for discovering treatment effect heterogeneity without pre-specifying which covariates drive it. These methods use sample splitting to avoid overfitting: the model is trained on one half of the data and the best linear projection (BLP) of CATEs is evaluated on the other half.

Importantly, the GRF framework provides valid inference for heterogeneity tests: test_calibration() from the grf package tests the null of no heterogeneity against the data-driven alternative identified by the forest. This replaces the ad hoc process of trying multiple subgroup cuts.

4.5 Distinguish Confirmatory from Exploratory

The field medicine has developed clear norms for distinguishing primary, secondary, and exploratory analyses in clinical trial reports. Economics has been slower to adopt similar norms. A practical recommendation: in papers reporting an RCT or a natural experiment with a primary estimand, clearly label which subgroup analyses were pre-specified and which were exploratory. Exploratory analyses are valuable—they generate hypotheses for future confirmatory studies—but should not be presented with the same inferential weight as pre-specified analyses.

5 A Middle Ground: The Targeted Maximum Likelihood (TMLE) Approach

For researchers who want to report both a pooled estimate and meaningful subgroup estimates within a single framework, targeted maximum likelihood estimation (TMLE) combined with super learning offers a principled approach. TMLE provides doubly robust estimates of the ATE that are efficient under semiparametric efficiency bounds. Subgroup TMLEs use the same framework, with honest uncertainty quantification that appropriately reflects the reduced sample size.

Luedtke and van der Laan [2016] develop optimal individualised treatment rules within the TMLE framework, providing subgroup-level recommendations that maximise expected outcomes for identified groups—a direct response to the policymaker's need for targeted guidance without the multiple testing pitfalls of conventional subgroup analysis.

6 Conclusion

The tension between pooling and stratification is not resolved by a single recommendation. Subgroup analyses are scientifically important and policy-relevant, but are vulnerable to false discovery and misinterpretation when not handled carefully.

The resolution lies in rigorous pre-specification, appropriate multiple comparisons corrections, omnibus tests of homogeneity, and the use of causal machine learning methods that combine heterogeneity discovery with valid inference. Studies that discipline their subgroup analyses through these tools can provide both a credible primary estimate and credible statements about who benefits—the combination that policymakers actually need.

References

  1. Athey, S. and Wager, S. (2019). Estimating treatment effects with causal forests: An application. Observational Studies, 5(2):37-51.
  2. Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. Econometrics Journal, 21(1):C1-C68.
  3. Ding, P., Feller, A., and Miratrix, L. (2019). Decomposing treatment effect variation. Journal of the American Statistical Association, 114(525):304-317.
  4. Feller, A. and Gelman, A. (2015). Hierarchical models for causal effects. In Emerging Trends in the Social and Behavioral Sciences, pp. 1-16.
  5. Imai, K. and Ratkovic, M. (2013). Estimating treatment effect heterogeneity in randomized program evaluation. Annals of Applied Statistics, 7(1):443-470.
  6. List, J. A., Shaikh, A. M., and Xu, Y. (2019). Multiple hypothesis testing in experimental economics. Experimental Economics, 22(4):773-793.
  7. Luedtke, A. R. and van der Laan, M. J. (2016). Statistical inference for the mean outcome under a possibly non-unique optimal treatment strategy. Annals of Statistics, 44(2):713-742.
  8. Romano, J. P. and Wolf, M. (2005). Stepwise multiple testing as formalized data snooping. Econometrica, 73(4):1237-1282. Wager, S. and Athey, S. (2018). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523):1228-1242.

Continue Reading

Browse All Sections →
Home
This is some text inside of a div block.
This is some text inside of a div block.
This is some text inside of a div block.

Article Title