The Causal Review

The Case for RCTs as the Gold Standard

Elimination of Selection Bias

The fundamental advantage of randomisation is that it eliminates selection bias by construction. When treatment is randomly assigned, D_i ⊥ (Y_i(0),Y_i(1)), and the difference in means between treated and control groups is an unbiased estimate of the average treatment effect:

$$ \mathbb{E}[Y_i \mid D_i = 1] - \mathbb{E}[Y_i \mid D_i = 0] = \mathbb{E}[Y_i(1) - Y_i(0)] = \text{ATE} $$

No assumption about the distribution of covariates or the functional form of the outcome equation is required. The estimator is nonparametric and robust to model misspecification.

Transparency of Assumptions

In a well-conducted RCT, the identifying assumption — that treatment was randomly assigned — is verified by design. Researchers can check balance of pre-treatment covariates as a diagnostic. By contrast, observational identification strategies rest on untestable assumptions (parallel trends, exclusion restrictions, continuity of the running variable distribution) whose plausibility is a matter of judgment.

The Empirical Record

A large literature documents cases where observational estimates diverge substantially from RCT estimates of the same treatment. LaLonde(1986) compared OLS and matching estimates of the effect of a job training programme to an RCT estimate and found that the observational methods produced wildly varying and frequently wrong answers. This remains one of the most influential demonstrations of the vulnerability of observational methods to unobserved confounding.

Design-Based Inference

Under the "design-based" or "Fisherian" approach to inference (Imbens and Rubin(2015)), the randomisation distribution of the test statistic — rather than an assumed data-generating process — is the basis for p-values. This requires no distributional assumptions whatsoever. The resulting inference is exact and does not depend on large-sample approximations.

The Case Against the Gold Standard Framing

Feasibility and Ethics

Many important causal questions cannot be answered by an RCT. It is not ethical to randomly assign people to smoke, to be incarcerated, or to receive a substandard education. It is not feasible to randomly assign countries to adopt different trade policies or constitutional arrangements. The gold standard framing excludes a large fraction of the most important questions in social science from the domain of credible causal inference — a clearly unacceptable conclusion.

LATE vs.\ ATE: Who Are the Compliers?

Even when an RCT is feasible, it may not identify the treatment effect for the population of interest. Imperfect compliance is common: some assigned-to-treatment units refuse treatment; some assigned-to-control units access treatment on their own. The IV estimate from the RCT (using random assignment as an instrument for actual treatment) identifies the LATE — the effect for compliers (Angrist et al.(1996)). But compliers may be systematically different from the general population in ways that limit policy relevance.

External Validity and Generalisation

RCTs are conducted in specific populations, at specific times, and in specific institutional settings. The average treatment effect estimated in a particular RCT may not generalise to other populations or settings — the problem of external validity (Angrist and Pischke(2009)). A well-designed quasi-experimental study that exploits variation in a nationally representative sample may have better external validity than an RCT conducted on a selected volunteer sample.

Hawthorne Effects and Artificiality

The act of conducting an experiment can itself change behaviour. Participants who know they are being studied may change their actions in ways that distort the estimated treatment effect. This "Hawthorne effect" is especially problematic in social interventions where participants are aware of their treatment assignment.

Replication and Credibility of Observational Studies

The critique of observational methods based on LaLonde(1986) has been revisited and qualified. Imbens and Rubin(2015) and others have argued that the failure of observational methods in that study was partly due to poor implementation (inappropriate comparison groups, misspecified models) rather than an inherent limitation of the approach. Modern quasi-experimental methods — difference-in-differences with credible parallel trends, regression discontinuity designs, instrumental variables with defensible exclusion restrictions — have a strong track record of producing findings consistent with experimental estimates.

The Diamond Standard

Cartwright(2007) argue for a "diamond standard" that evaluates evidence quality along multiple dimensions: internal validity, external validity, ecological validity, and theoretical coherence. An RCT may score high on internal validity but low on external validity, while a well-designed observational study may do better on the latter. The relevant question is not which study design is generically superior, but which study, in a given context, provides the most credible and relevant evidence.

A Framework for Comparing Designs

Rather than ranking study designs in a fixed hierarchy, it is more productive to evaluate them along several dimensions:

Internal validity: Is the identifying assumption credible? For RCTs: was randomisation actually implemented correctly? For observational studies: is the exclusion restriction or parallel trends assumption defensible?
External validity: Does the estimated effect generalise to the target population and context? This depends on who is in the study sample and how comparable they are to the target.
Statistical power: Is the study large enough to detect effects of policy-relevant magnitude?
Estimand relevance: Does the estimated quantity — ATE, ATT, LATE — correspond to the policy question?
Ethical and practical feasibility: Could the study be conducted at all?

Different study designs score differently on these dimensions, and the appropriate design depends on the question. For questions about whether a treatment works at all in a defined population, an RCT is hard to beat when feasible. For questions about how a policy works in a specific national context at a specific time, quasi-experimental evidence from that context may be more relevant than a carefully controlled experiment in a different setting.

The Convergent Evidence Standard

Perhaps the most important lesson from the debate is that causal conclusions should ideally rest on convergent evidence from multiple study designs. When an RCT, a regression discontinuity design, a difference-in-differences study, and an instrumental variables study all point in the same direction, the causal conclusion is much more credible than if any one of these designs were available alone. The history of evidence on the effects of education on earnings provides an example: IV estimates using compulsory schooling laws, proximity to college, and quarter-of-birth instruments, together with institutional quasi-experiments, all consistently point to substantial returns to education (Angrist and Pischke(2009)).

Conclusion

The RCT is a powerful and often indispensable tool for causal inference, but it is not the only tool, and the "gold standard" framing can be misleading. A randomised trial that is poorly conducted, that enrolls an unrepresentative sample, or that estimates a LATE of no policy relevance may be less informative than a well-designed observational study. The appropriate standard for causal evidence is not the design per se, but the credibility of the identifying assumption, the relevance of the estimand, and the generalisability of the findings. The credibility revolution in economics showed that observational methods can be made credible through clean research design; the task is to apply that discipline rigorously across all study types.

References

Angrist, J. D., Imbens, G. W., and Rubin, D. B. (1996). Identification of causal effects using instrumental variables. Journal of the American Statistical Association, 91(434):444--455.
Angrist, J. D. and Pischke, J.-S. (2009). Mostly Harmless Econometrics: An Empiricist's Companion. Princeton University Press, Princeton, NJ.
Cartwright, N. (2007). Hunting Causes and Using Them. Cambridge University Press, Cambridge.
Imbens, G. W. and Rubin, D. B. (2015). Causal Inference for Statistics, Social, and Biomedical Sciences. Cambridge University Press, Cambridge.
LaLonde, R. J. (1986). Evaluating the econometric evaluations of training programs with experimental data. American Economic Review, 76(4):604--620.
Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology, 66(5):688--701.
Finkelstein, A., Taubman, S., Wright, B., Bernstein, M., Gruber, J., Newhouse, J. P., Allen, H., Baicker, K., and the Oregon Health Study Group (2012). The Oregon health insurance experiment: Evidence from the first year. Quarterly Journal of Economics, 127(3):1057--1106.
Pearl, J. (2009). Causality: Models, Reasoning, and Inference. 2nd edition. Cambridge University Press, Cambridge.

RCTs vs. Observational Studies: Is Randomisation the Gold Standard?

The Case for RCTs as the Gold Standard

Elimination of Selection Bias

Transparency of Assumptions

The Empirical Record

Design-Based Inference

The Case Against the Gold Standard Framing

Feasibility and Ethics

LATE vs.\ ATE: Who Are the Compliers?

External Validity and Generalisation

Hawthorne Effects and Artificiality

Replication and Credibility of Observational Studies

The Diamond Standard

A Framework for Comparing Designs

The Convergent Evidence Standard

Conclusion

References

Continue Reading

The ivmte Package in R: Marginal Treatment Effects and Bounding Policy-Relevant Parameters

The contdid Package in R: Estimating Dose-Response Functions with Continuous Treatments

Recent Results: Housing Markets, Rent Control, and Urban Economics

Natural Experiments: Finding Causal Evidence Without Randomisation

Regression Discontinuity Design: Sharp, Fuzzy, and the CCT Bandwidth

The Credibility Revolution in Econometrics: Thirty Years of Causal Inference

Article Title

RCTs vs. Observational Studies: Is Randomisation the Gold Standard?

The Case for RCTs as the Gold Standard

Elimination of Selection Bias

Transparency of Assumptions

The Empirical Record

Design-Based Inference

The Case Against the Gold Standard Framing

Feasibility and Ethics

LATE vs.\ ATE: Who Are the Compliers?

External Validity and Generalisation

Hawthorne Effects and Artificiality

Replication and Credibility of Observational Studies

The Diamond Standard

A Framework for Comparing Designs

The Convergent Evidence Standard

Conclusion

References

Continue Reading

The ivmte Package in R: Marginal Treatment Effects and Bounding Policy-Relevant Parameters

The contdid Package in R: Estimating Dose-Response Functions with Continuous Treatments

Recent Results: Housing Markets, Rent Control, and Urban Economics

Natural Experiments: Finding Causal Evidence Without Randomisation

Regression Discontinuity Design: Sharp, Fuzzy, and the CCT Bandwidth

The Credibility Revolution in Econometrics: Thirty Years of Causal Inference

Stay current with causal inference

Article Title