The Causal Review

1 The Debate in Brief

The credibility revolution in econometrics has produced a generation of studies with strong internal validity researchers can credibly claim that their estimates reflect a causal effect for some well-defined population. But internal validity is not the same as policy relevance. The LATE (Local Average Treatment Effect) identified by an instrumental variable, or the treatment effect at the threshold in a regression discontinuity, applies to a specific and often narrow sub-population. Whether such estimates help us design, scale, or evaluate policies is the question at the heart of the external validity debate.

This debate pits two camps against each other. On one side: methodologists who argue that a credibly identified LATE, even if local, is far more useful than a precisely estimated OLS that conflates causal effects with selection bias. On the other: structural economists and development researchers who argue that the obsession with internal validity has come at the cost of policy generalisability, and that without a structural model, we cannot say anything about what would happen if we scaled the policy, changed the population, or altered the margin of intervention.

2 The Case for LATE: Better Local than Nothing

The most prominent defence of LATE comes from Imbens [2010], whose "Better LATE than Nothing" response to Deaton [2010] and Heckman and Urzua [2010] lays out the argument clearly. Cleanly identified is valuable. The key virtue of LATE is that it is identified under minimal, transparent assumptions: randomisation (or as-good-as-random), exclusion, and monotonicity. Unlike structural identification, these assumptions are often testable and require no parametric functional form assumptions. A credible LATE is better than a biased estimate of the ATE.

The complier population is often policy-relevant. In many contexts, the complier population is precisely the group that policymakers want to target. Consider an IV estimate of the returns to job training, where the instrument is random assignment to a training offer. Compliers are people who participate in training when offered but would not otherwise exactly the group that a voluntary programme would reach. The LATE is the effect on take-up.

Marginal effects matter for marginal policy decisions. When policymakers are de- ciding whether to expand a programme slightly at the margin enrol a few more partici- pants, extend a benefit to a slightly different group the LATE at the margin is precisely what matters. The ATE (effect for the whole population) would overstate or understate the effect of marginal expansion [Carneiro et al., 2011].

Extrapolation from structural models is not free. Critics of LATE often argue that structural models can extrapolate to different contexts. But structural models require strong functional form assumptions that are themselves untestable and potentially more wrong than the LATE assumptions. Both approaches require extrapolation; the question is which assumptions are less likely to be violated.

3 The Case Against: LATES Cannot Inform Policy at Scale

The critique of LATE, articulated most forcefully by Deaton [2010] and Heckman and Urzua [2010], rests on several distinct arguments.

The complier population is often small and unusual. In many applications, the share of compliers is small perhaps 10-20% of the sample and they are systematically different from the rest. Consider quarter-of-birth instruments for education: compliers are students who would have dropped out earlier but stayed in school due to compulsory schooling laws. Their returns to schooling need not be representative of returns for the broader population or for people who would be induced into more schooling by a new policy.

External validity to other contexts. Even if the LATE is a credible estimate for the compliers in one setting, it may not generalise to different populations, time periods, or geographic contexts. The effects of a job training programme estimated in 1990s urban Texas may say little about the effects of the same programme in 2020s rural France. No IV estimate can resolve this without additional modelling.

General equilibrium effects. Small-scale natural experiments do not capture general equilibrium effects. If a policy is scaled up, it changes wages, prices, and behaviour through- out the economy in ways that a local experiment cannot capture. The Mariel boatlift in- creased Miami's labour supply by 7%; a national immigration policy might increase the US labour supply by much more, triggering very different responses. Heckman and Urzua [2010] argue that LATE estimates from micro-studies are systematically misleading for macroeco- nomic policy.

The wrong margin. LATEs identify effects at the margin of the instrument the people on the edge of treatment. But policies often target the average or the deeply untreated, not the marginal individual. Knowing the returns to education for students who are just about to drop out tells us nothing about the returns for those who would never attend regardless of policy.

4 Bridging the Gap: Extrapolation and MTE

The debate has generated constructive methodological responses. Heckman et al. [2006] introduce the marginal treatment effect (MTE), which traces out treatment effects as a function of the propensity to receive treatment. The LATE is a particular weighted average of the MTE, with weights that depend on the instrument; the ATE is another weighted average. By estimating the full MTE curve, one can recover different policy-relevant treatment effect parameters without committing to a specific instrument's complier population.

Angrist and Fernández-Val [2013] and Carneiro et al. [2011] show how to extrapolate from a LATE to effects for other sub-populations, under assumptions about effect heterogeneity. These methods require some structural modelling, but less than a full structural model.

More recently, Mogstad et al. [2018] develop a framework for partial identification of policy-relevant treatment effects from a known LATE, allowing researchers to bound the ATE using the estimated LATE as an input.

5 What Evidence Would Resolve the Debate?

Replication across contexts. If LATEs estimated from different instruments in different settings converge on similar estimates, external validity becomes more plausible. Divergence across instruments suggests that the complier populations are genuinely different.

Structural-versus-reduced-form horse races. Direct comparisons between structural model predictions and reduced-form estimates particularly in out-of-sample contexts would help assess whether structural extrapolation outperforms reduced-form estimates.

Scale-up studies. Evidence on what happens when programmes are scaled from pilot to full rollout comparing LATE predictions to observed effects at scale directly tests external validity.

6 The Verdict

The external validity debate does not have a clear winner, and that is probably the right conclusion. LATEs are credibly identified and often exactly right for marginal policy ques- tions about who is on the margin of treatment. They are less useful for extrapolating to new populations, scaling policies, or studying general equilibrium effects. The appropriate response is not to abandon IV but to be honest about what it identifies and to supplement it with structural or other approaches when the policy question requires extrapolation.

In the words of Imbens [2010]: "better LATE than nothing" but better still to under- stand exactly when and why the LATE answers the question you are asking.

References

Angrist, J. D. and Fernández-Val, I. (2013). ExtraLATE-ing: External validity and overi- dentification in the LATE framework. In Acemoglu, D., Arellano, M., and Dekel, E., editors, Advances in Economics and Econometrics, vol. 3. Cambridge University Press, Cambridge.
Carneiro, P., Heckman, J. J., and Vytlacil, E. J. (2011). Estimating marginal returns to education. American Economic Review, 101(6):2754-2781.
Deaton, A. (2010). Instruments, randomization, and learning about development. Journal of Economic Literature, 48(2):424-455.
Heckman, J. J. and Urzua, S. (2010). Comparing IV with structural models: What simple IV can and cannot identify. Journal of Econometrics, 156(1):27-37.
Heckman, J. J., Urzua, S., and Vytlacil, E. (2006). Understanding instrumental variables in models with essential heterogeneity. Review of Economics and Statistics, 88(3):389-432.
Imbens, G. W. (2010). Better LATE than nothing: Some comments on Deaton (2009) and Heckman and Urzua (2009). Journal of Economic Literature, 48(2):399-423.
Mogstad, M., Santos, A., and Torgovitsky, A. (2018). Using instrumental variables for inference about policy relevant treatment parameters. Econometrica, 86(5):1589-1619.

External Validity: Are Local Average Treatment Effects Policy-Relevant?

1 The Debate in Brief

2 The Case for LATE: Better Local than Nothing

3 The Case Against: LATES Cannot Inform Policy at Scale

4 Bridging the Gap: Extrapolation and MTE

5 What Evidence Would Resolve the Debate?

6 The Verdict

References

Continue Reading

Natural Experiments: Finding Causal Evidence Without Randomisation

Regression Discontinuity Design: Sharp, Fuzzy, and the CCT Bandwidth

The Credibility Revolution in Econometrics: Thirty Years of Causal Inference

Article Title

External Validity: Are Local Average Treatment Effects Policy-Relevant?

1 The Debate in Brief

2 The Case for LATE: Better Local than Nothing

3 The Case Against: LATES Cannot Inform Policy at Scale

4 Bridging the Gap: Extrapolation and MTE

5 What Evidence Would Resolve the Debate?

6 The Verdict

References

Continue Reading

Natural Experiments: Finding Causal Evidence Without Randomisation

Regression Discontinuity Design: Sharp, Fuzzy, and the CCT Bandwidth

The Credibility Revolution in Econometrics: Thirty Years of Causal Inference

Stay current with causal inference

Article Title