The Causal Review

1 The Appeal and the Risk

The term "natural experiment" carries a powerful rhetorical appeal. It promises the causal credibility of a randomised trial, achieved not through deliberate randomisation but through the fortunate accidents of policy, history, or geography. The Vietnam draft lottery, the quarter-of-birth instrument, the Mariel boatlift, the rainfall instruments for agricultural income, the distance-to-coast instrument for trade—these are the foundational examples of a research tradition that has dominated applied economics for three decades.

But the label "natural experiment" is not a guarantee of valid causal inference. It is a claim—and a claim that can be wrong. In this article, we lay out the strongest version of both sides of the debate: the case that natural experiments, when carefully executed, are the most credible source of causal evidence in economics; and the case that the profession has been too credulous about the instruments it classifies as "natural."

2 The Three Conditions for a Valid Instrument

Recall that a valid instrument Z for the causal effect of D on Y requires three conditions:

Relevance: Z is correlated with D (testable).
Independence/Exogeneity: Z is independent of potential outcomes (Y(0),Y(1)) conditional on covariates (untestable directly).
Exclusion Restriction: Z affects Y only through D—not through any other channel (untestable directly).

Additionally, for the LATE interpretation:

4. Monotonicity: Z moves D in the same direction for all units—no defiers.

The credibility of a natural experiment hinges entirely on conditions 2-4, none of which are directly testable. The researcher's job is to construct an argument that these conditions hold in their specific setting. How convincing that argument is varies enormously across the literature.

3 The Case for Natural Experiments

3.1 The "Natural Natural Experiments" Standard

The strongest natural experiments share a common feature: the instrument's value is determined by a process that is, by design or by accident, orthogonal to the economic variables of interest. The Vietnam draft lottery is the canonical example: draft eligibility was determined by birthday, a biological fact that is plausibly unrelated to potential earnings in the absence of military service. No story about how date of birth could directly affect earnings other than through military service is immediately compelling [Angrist, 1990].

Angrist and Krueger [1991] argue that the quality of a natural experiment should be judged by how convincing the no-direct-effect story is. A researcher who uses a geographic instrument for trade should ask: are there any other channels through which geographic proximity to a trading partner could affect incomes, holding trade constant? If the answer is "not many," the exclusion restriction is plausible. If the answer is "many" (e.g. colonial history, migration, technology diffusion), the instrument is suspect.

3.2 Empirical Testing

Even though the exclusion restriction is not directly testable, researchers can build credibility through indirect tests:

Falsification tests: Test whether the instrument affects outcomes in sub-samples where it has no first-stage effect.
Overidentification tests: When multiple instruments are available, the Sargan-Hansen test checks whether they identify the same parameter (consistent with exclusion; note this only tests whether instruments agree, not whether any of them satisfies exclusion).
Placebo outcomes: Test whether the instrument affects outcomes that should not be affected by the treatment.

4 The Case for Scepticism

4.1 Deaton's Challenge

Deaton [2010] issued the most sustained critique of the instrumental variables approach in economics. His argument has several components:

LATE is not always policy-relevant. The LATE identifies the average treatment effect for a specific group of compliers—individuals who change treatment status in response to the instrument. This group may be a small and non-representative slice of the population. For policymakers who want to know what would happen if they scaled up an intervention to the whole population, the LATE for a specific instrument may provide little guidance.

The exclusion restriction is rarely justified. Deaton argues that in many celebrated natural experiments, the exclusion restriction is simply assumed rather than defended. A plausible-sounding economic story is not the same as evidence. The history of economics is littered with "instruments" that turned out to violate exclusion: colonial-era settler mortality affecting modern institutions through many channels; rainfall affecting income through channels beyond agricultural output; distance instruments reflecting geography that also affects health, conflict, and institutional quality directly.

The credibility revolution may have traded internal for external validity. By focusing on clean local identification, the credibility revolution has produced estimates that are internally valid but difficult to generalise. Structural models, despite their stronger assumptions, at least attempt to specify a model that can be used for policy counterfactuals at scale.

4.2 Rosenzweig and Wolpin's Taxonomy

Rosenzweig and Wolpin [2000] provide a careful taxonomy of natural experiments and their pitfalls. They distinguish between "natural natural experiments"—settings where the instrument's exogeneity is literally mechanical (draft lotteries, random assignment of judges)—and "quasi-natural experiments," where the researcher argues for exogeneity based on the economic plausibility of the story.

Their concern is that quasi-natural experiments impose strong assumptions that are rarely tested. The shift-share (Bartik) instrument is a prominent example: Goldsmith-Pinkham et al. [2020] show that its validity requires the industry shares used to construct the instrument to be exogenous, which in turn requires that those shares are uncorrelated with local demand shocks. This is a substantive condition that depends on the local economic history and is not automatically satisfied.

4.3 Weak Instruments and the Bias Problem

Beyond the exclusion restriction, the relevance condition has attracted increasing scrutiny. Staiger and Stock [1997] show that when the first-stage F-statistic is below about 10, IV estimates are severely biased toward OLS, and standard inference is unreliable. Lee et al. [2022] show that the conventional F > 10 rule corresponds to at most a 10% size distortion, but that many published instruments are weaker than this.

A systematic review by Brodeur et al. [2022] of hundreds of IV papers finds a distribution of first-stage F-statistics centred well below 10 in a non-negligible share of studies. These papers would be biased by construction, regardless of whether the exclusion restriction holds.

5 Finding Middle Ground

The debate between Deaton's sceptics and the natural experiment enthusiasts is not, at bottom, a methodological disagreement but a disagreement about research priorities. Both sides agree that:

Valid instruments produce credible causal estimates.
The three IV conditions are not automatically satisfied.
Structural and reduced-form approaches answer different questions.

The disagreement is about how demanding the validity conditions should be in practice. The natural experiment proponents argue that near-clean instruments provide the most reliable causal evidence available, even if they are local. The sceptics argue that "near-clean" is too generous, and that the profession should apply more rigorous scrutiny to instrument validity.

A reasonable synthesis is the following standard: a natural experiment is credible when (1) the instrument is close to mechanically random (draft lottery, judge assignment, birth timing), (2) the first-stage is strong (F > 104.7 for α = 0.05 size control under the Lee et al. [2022] tF procedure), and (3) the exclusion restriction is defended with falsification tests and economic reasoning, not just asserted.

6 What Would Settle the Debate?

Progress on this debate would come from:

Within-study comparisons that validate or invalidate specific instruments against experimental benchmarks.
Better tests of the exclusion restriction via new econometric methods, such as the Conley et al. [2012] approach for inference under "plausibly exogenous" instruments.
A catalogue of instrument failures: systematic documentation of cases where a published instrument was later shown to violate exclusion.

7 Conclusion

Natural experiments are powerful, but they are not magic. The label "natural experiment" describes a research design aspiration, not a guarantee. Every such study rests on unverifiable assumptions, and the burden of defending those assumptions through economic reasoning, falsification tests, and transparency about the complier population lies with the researcher. The credibility revolution was right to demand that researchers make their identifying assumptions explicit. The next step is to demand that researchers defend those assumptions more rigorously than has been the norm.

References

Angrist, J. D. Lifetime earnings and the Vietnam era draft lottery: Evidence from Social Security administrative records. American Economic Review, 80(3):313-336, 1990.
Angrist, J. D. and Krueger, A. B. Does compulsory school attendance affect schooling and earnings? Quarterly Journal of Economics, 106(4):979-1014, 1991.
Brodeur, A., Cook, N., Hartley, J., and Heyes, A. Do pre-registration and pre-analysis plans reduce p-hacking and publication bias? Journal of Human Resources, 57(S):S151-S186, 2022.
Conley, T. G., Hansen, C. B., and Rossi, P. E. Plausibly exogenous. Review of Economics and Statistics, 94(1):260-272, 2012.
Deaton, A. Instruments, randomization, and learning about development. Journal of Economic Literature, 48(2):424-455, 2010.
Goldsmith-Pinkham, P., Sorkin, I., and Swift, H. Bartik instruments: What, when, why, and how. American Economic Review, 110(8):2586-2624, 2020.
Imbens, G. W. Better LATE than nothing: Some comments on Deaton (2009) and Heckman and Urzua (2009). Journal of Economic Literature, 48(2):399-423, 2010.
Lee, D. S., McCrary, J., Moreira, M. J., and Porter, J. Valid t-ratio inference for IV. American Economic Review, 112(10):3260-3290, 2022.
Rosenzweig, M. R. and Wolpin, K. I. Natural "natural experiments" in economics. Journal of Economic Literature, 38(4):827-874, 2000.
Staiger, D. and Stock, J. H. Instrumental variables regression with weak instruments. Econometrica, 65(3):557-586, 1997.

Is Every "Natural Experiment" Really Natural? Scrutinising the Exclusion Restriction

1 The Appeal and the Risk

2 The Three Conditions for a Valid Instrument

3 The Case for Natural Experiments

3.1 The "Natural Natural Experiments" Standard

3.2 Empirical Testing

4 The Case for Scepticism

4.1 Deaton's Challenge

4.2 Rosenzweig and Wolpin's Taxonomy

4.3 Weak Instruments and the Bias Problem

5 Finding Middle Ground

6 What Would Settle the Debate?

7 Conclusion

References

Continue Reading

The causalml Package in Python: Uplift Modeling and CATE Meta-Learners

The gsynth Package in R: Generalized Synthetic Control with Interactive Fixed Effects

Recent Results: Immigration, Migration, and Labour Markets

Natural Experiments: Finding Causal Evidence Without Randomisation

Regression Discontinuity Design: Sharp, Fuzzy, and the CCT Bandwidth

The Credibility Revolution in Econometrics: Thirty Years of Causal Inference

Article Title

Is Every "Natural Experiment" Really Natural? Scrutinising the Exclusion Restriction

1 The Appeal and the Risk

2 The Three Conditions for a Valid Instrument

3 The Case for Natural Experiments

3.1 The "Natural Natural Experiments" Standard

3.2 Empirical Testing

4 The Case for Scepticism

4.1 Deaton's Challenge

4.2 Rosenzweig and Wolpin's Taxonomy

4.3 Weak Instruments and the Bias Problem

5 Finding Middle Ground

6 What Would Settle the Debate?

7 Conclusion

References

Continue Reading

The causalml Package in Python: Uplift Modeling and CATE Meta-Learners

The gsynth Package in R: Generalized Synthetic Control with Interactive Fixed Effects

Recent Results: Immigration, Migration, and Labour Markets

Natural Experiments: Finding Causal Evidence Without Randomisation

Regression Discontinuity Design: Sharp, Fuzzy, and the CCT Bandwidth

The Credibility Revolution in Econometrics: Thirty Years of Causal Inference

Stay current with causal inference

Article Title