The Causal Review

1 The Question

One of the most important and contested questions in applied economics is whether short-run interventions have lasting effects on individuals and communities. A job training programme lasts six months; does it raise earnings fifteen years later? A class of kindergarteners is taught by a high-quality teacher for one year; do those students earn more at age 30? A one-time cash transfer reaches households during a drought; do children in those households have better health and cognitive outcomes as adults?

The policy stakes are enormous. If short-run interventions have lasting effects, their cost-effectiveness calculations improve dramatically when long-run benefits are discounted back. If effects fade, the case for intensive early interventions is weaker than often claimed. Both sides of this debate can point to credible evidence. The question is not whether long-run effects ever exist they clearly do in some cases but whether they can be systematically trusted, what mechanisms sustain them, and whether the causal identification behind long-run estimates is as strong as the identification behind short-run estimates.

2 The Case For: Long-Run Effects Are Real and Large

2.1 The STAR Experiment: Fade-Out in Scores, Persistence in Earnings

The Tennessee Project STAR experiment randomly assigned students to smaller or larger kindergarten classes. Krueger [1999] found that smaller classes raised test scores during the treatment years. However, test score gains faded substantially after students returned to normal class sizes a pattern often cited as evidence that the intervention had limited lasting value.

Chetty et al. [2011] revisited STAR using IRS administrative data, tracking students 20 years later. They found that kindergarten class quality as measured by peer group and teacher quality significantly predicted earnings at age 27, college attendance, and neighbourhood quality in adulthood, even after controlling for test score fade-out. The mechanism they propose is non-cognitive skills: smaller classes may develop persistence, social skills, and habits that fade from test score measures but persist in adult outcomes. This finding has been influential in arguing that fade-out in test scores does not imply fade-out in long-run outcomes. The two may be measuring different things.

2.2 Teacher Value-Added and Long-Run Earnings

Chetty et al. [2014] similarly find that high-value-added teachers raise earnings and college attendance at age 28, even as test score effects fade. The transmission mechanism is again hypothesised to involve non-cognitive development.

2.3 Deworming and Human Capital

Bleakley [2007] finds that hookworm eradication in the early twentieth century American South raised school attendance and adult occupational incomes substantially. Similarly, Kremer and Miguel [2004] (the Kenya primary school deworming experiment) found large long-run effects on labour market outcomes for dewormed children, with treated individuals earning 13% more and working more hours, nearly 20 years after the original intervention.

3 The Case Against: Reasons for Scepticism

3.1 Fade-Out Is the Norm in Education

The fade-out of short-run test score gains is not an exception it is the rule. Bailey et al. [2020] conduct a comprehensive meta-analysis of early childhood programmes and find systematic fade-out: gains in cognitive measures typically halve within two to four years after programme exit. The few programmes that show persistent test score effects have unusually intensive dosage or target unusually disadvantaged populations. If non-cognitive skills were reliably transmitted, we would expect some proxy evidence of their persistence in grades, attendance, disciplinary incidents in the years immediately following the intervention. The evidence is thin.

3.2 Confounding from Secular Trends

Long-run estimates are estimated from cohorts studied decades after the intervention. The identifying variation class assignment in kindergarten, deworming in 1994 Kenya, teacher assignment in New York City in 1990 is clean for the short-run comparison. But by the time outcomes are measured at ages 25-30, dozens of life events intervene: further schooling, family formation, migration, macroeconomic fluctuations. Even if the original random assignment was valid, the long-run estimate identifies the reduced-form effect of many of these intermediate events, not simply the original treatment. This is the fundamental identification problem for long-run estimates: the assignment that identifies the short-run effect may not "carry through" cleanly to the long-run outcome, if intermediate mediators are affected by other (unrelated) factors.

3.3 Mortality Selection in Health Interventions

For health interventions, long-run estimates face a specific hazard: differential mortality. If the treated group is healthier, a larger fraction survives to the age at which outcomes are measured. The surviving treated group is not a random subsample of the original treated group; it is a positively selected group. Comparing long-run earnings of survivors in the treatment vs control group thus conflates the causal effect of the treatment with a selection effect. Deaton [2010] and related critics argue that this survivorship selection is often unaddressed in long-run health studies. The problem is sharpest for early-life health interventions (malaria, nutrition) where the treated group has substantially lower mortality exactly the interventions that produce the most dramatic long-run estimates.

3.4 Statistical Power and Multiple Testing

Long-run follow-up studies typically have smaller samples than the original study due to attrition, administrative linkage failures, or resource constraints. With smaller samples, any given estimate has lower power. In this context, statistically significant long-run effects are subject to severe publication bias: only the large, positive estimates are published, while null or negative results remain in file drawers. Deaton [2010] and others have noted that the ratio of standard errors to estimated effects often looks suspicious in long-run studies effects are "just" statistically significant with the available sample, consistent with selective reporting.

4 The Worm Wars: A Case Study in the Debate

The deworming literature illustrates the controversy vividly. Kremer and Miguel [2004] found substantial long-run effects of a Kenyan school deworming programme on labour market outcomes. A re-analysis by Aiken et al. [2015] and associated Cochrane reviews found no robust short-run effects on cognition or school participation. Defenders of the original study [Hicks et al., 2015] argued the re-analysis made coding errors and misspecified the model. The debate highlighted that long-run causal inference is particularly sensitive to:

The quality of the original assignment mechanism (was randomisation truly maintained over time?).

Attrition and administrative linkage of the follow-up sample.

Pre-specification of the outcome variables (were the significant outcomes among many tested?).

5 What Would Help Resolve the Debate?

The debate is partially empirical and partially methodological. Several developments would advance it:

Mechanistic evidence. Long-run effects are more credible if there is positive evidence of the mechanism measured non-cognitive skill persistence, educational attainment, or health status at intermediate time points. Studies that only measure the beginning and the end are harder to trust.
‍Attrition analysis. Long-run studies should report the share of the original sample successfully followed up, characterise the observable differences between followed and non-followed individuals, and report bounds under plausible assumptions about the missing observations.‍
Pre-registration of long-run outcomes. Pre-specifying which long-run outcomes will be the primary endpoints, before data collection begins, prevents selective reporting.‍
Replication. Where multiple evaluations of similar programmes exist (e.g. multiple deworming experiments, multiple early childhood education programmes), systematic meta-analysis with rigorous pooling methods provides a more robust evidence base than any single study.

6 Conclusion

The debate over long-run effects of short-run interventions is not a debate about whether effects ever persist the evidence from STAR, STAR follow-ups, and the quasi-experimental historical literature suggests they sometimes do. The debate is about the conditions under which persistence is credible, the mechanisms that sustain it, and whether the identification strategies that work for short-run outcomes carry forward cleanly to outcomes measured decades later.

Both sceptics and believers have legitimate points. Secular trends, mortality selection, attrition, and publication bias are real threats to long-run causal estimates. So are premature dismissals of non-cognitive mechanisms and excessive reliance on test scores as the only valid educational outcome. The most credible long-run studies those with pre-registered outcomes, near-complete follow-up, mechanistic intermediate evidence, and replicated designs- provide a more reliable guide to policy than the headline estimates from single studies.

References

Aiken, A. M., Davey, C., Hargreaves, J. R., and Hayes, R. J. (2015). Re-analysis of health and educational impacts of a school-based deworming programme in western Kenya: A pure replication. International Journal of Epidemiology, 44(5):1572-1580.
Bailey, M. J., Sun, S., and Timpe, B. (2021). Prep school for poor kids: The long-run impacts of Head Start on human capital and economic self-sufficiency. American Economic Review, 111(12):3963-4001.
Bleakley, H. (2007). Disease and development: Evidence from hookworm eradication in the American South. Quarterly Journal of Economics, 122(1):73-117.
Chetty, R., Friedman, J. N., Hilger, N., Saez, E., Schanzenbach, D. W., and Yagan, D. (2011). How does your kindergarten classroom affect your earnings? Evidence from Project STAR. Quarterly Journal of Economics, 126(4):1593-1660.
Chetty, R., Friedman, J. N., and Rockoff, J. E. (2014). Measuring the impacts of teachers II: Teacher value-added and student outcomes in adulthood. American Economic Review, 104(9):2633-2679.
Deaton, A. (2010). Instruments, randomization, and learning about development. Journal of Economic Literature, 48(2):424-455.
Hicks, J. H., Kremer, M., and Miguel, E. (2015). Commentary: Deworming externalities and schooling impacts in Kenya: A comment on Aiken et al. (2015) and Davey et al. (2015). International Journal of Epidemiology, 44(5):1593-1596.
Kremer, M. and Miguel, E. (2004). Worms: Identifying impacts on education and health in the presence of treatment externalities. Econometrica, 72(1):159-217.
Krueger, A. B. (1999). Experimental estimates of education production functions. Quarterly Journal of Economics, 114(2):497-532.
Imbens, G. W. and Rubin, D. B. (2015). Causal Inference for Statistics, Social, and Biomedical Sciences. Cambridge University Press.

Should Long-Run Effects of Short-Run Interventions Be Trusted?

1 The Question

2 The Case For: Long-Run Effects Are Real and Large

2.1 The STAR Experiment: Fade-Out in Scores, Persistence in Earnings

2.2 Teacher Value-Added and Long-Run Earnings

2.3 Deworming and Human Capital

3 The Case Against: Reasons for Scepticism

3.1 Fade-Out Is the Norm in Education

3.2 Confounding from Secular Trends

3.3 Mortality Selection in Health Interventions

3.4 Statistical Power and Multiple Testing

4 The Worm Wars: A Case Study in the Debate

5 What Would Help Resolve the Debate?

6 Conclusion

References

‍

Continue Reading

The ivmte Package in R: Marginal Treatment Effects and Bounding Policy-Relevant Parameters

The contdid Package in R: Estimating Dose-Response Functions with Continuous Treatments

Recent Results: Housing Markets, Rent Control, and Urban Economics

Natural Experiments: Finding Causal Evidence Without Randomisation

Regression Discontinuity Design: Sharp, Fuzzy, and the CCT Bandwidth

The Credibility Revolution in Econometrics: Thirty Years of Causal Inference

Article Title

Should Long-Run Effects of Short-Run Interventions Be Trusted?

1 The Question

2 The Case For: Long-Run Effects Are Real and Large

2.1 The STAR Experiment: Fade-Out in Scores, Persistence in Earnings

2.2 Teacher Value-Added and Long-Run Earnings

2.3 Deworming and Human Capital

3 The Case Against: Reasons for Scepticism

3.1 Fade-Out Is the Norm in Education

3.2 Confounding from Secular Trends

3.3 Mortality Selection in Health Interventions

3.4 Statistical Power and Multiple Testing

4 The Worm Wars: A Case Study in the Debate

5 What Would Help Resolve the Debate?

6 Conclusion

References

‍

Continue Reading

The ivmte Package in R: Marginal Treatment Effects and Bounding Policy-Relevant Parameters

The contdid Package in R: Estimating Dose-Response Functions with Continuous Treatments

Recent Results: Housing Markets, Rent Control, and Urban Economics

Natural Experiments: Finding Causal Evidence Without Randomisation

Regression Discontinuity Design: Sharp, Fuzzy, and the CCT Bandwidth

The Credibility Revolution in Econometrics: Thirty Years of Causal Inference

Stay current with causal inference

Article Title