The Case for RCTs as the Gold Standard
Elimination of Selection Bias
The fundamental advantage of randomisation is that it eliminates selection bias by construction. When treatment is randomly assigned, \(D_i \perp (Y_i(0), Y_i(1))\), and the difference in means between treated and control groups is an unbiased estimate of the average treatment effect: \[ \mathbb{E}[Y_i \mid D_i = 1] - \mathbb{E}[Y_i \mid D_i = 0] = \mathbb{E}[Y_i(1) - Y_i(0)] = \text{ATE} \] No assumption about the distribution of covariates or the functional form of the outcome equation is required. The estimator is nonparametric and robust to model misspecification.
Transparency of Assumptions
In a well-conducted RCT, the identifying assumption — that treatment was randomly assigned — is verified by design. Researchers can check balance of pre-treatment covariates as a diagnostic. By contrast, observational identification strategies rest on untestable assumptions (parallel trends, exclusion restrictions, continuity of the running variable distribution) whose plausibility is a matter of judgment.
The Empirical Record
A large literature documents cases where observational estimates diverge substantially from RCT estimates of the same treatment. LaLonde(1986) compared OLS and matching estimates of the effect of a job training programme to an RCT estimate and found that the observational methods produced wildly varying and frequently wrong answers. This remains one of the most influential demonstrations of the vulnerability of observational methods to unobserved confounding.
Design-Based Inference
Under the "design-based" or "Fisherian" approach to inference (Imbens and Rubin(2015)), the randomisation distribution of the test statistic — rather than an assumed data-generating process — is the basis for p-values. This requires no distributional assumptions whatsoever. The resulting inference is exact and does not depend on large-sample approximations.
The Case Against the Gold Standard Framing
Feasibility and Ethics
Many important causal questions cannot be answered by an RCT. It is not ethical to randomly assign people to smoke, to be incarcerated, or to receive a substandard education. It is not feasible to randomly assign countries to adopt different trade policies or constitutional arrangements. The gold standard framing excludes a large fraction of the most important questions in social science from the domain of credible causal inference — a clearly unacceptable conclusion.
LATE vs.\ ATE: Who Are the Compliers?
Even when an RCT is feasible, it may not identify the treatment effect for the population of interest. Imperfect compliance is common: some assigned-to-treatment units refuse treatment; some assigned-to-control units access treatment on their own. The IV estimate from the RCT (using random assignment as an instrument for actual treatment) identifies the LATE — the effect for compliers (Angrist et al.(1996)). But compliers may be systematically different from the general population in ways that limit policy relevance.
External Validity and Generalisation
RCTs are conducted in specific populations, at specific times, and in specific institutional settings. The average treatment effect estimated in a particular RCT may not generalise to other populations or settings — the problem of external validity (Angrist and Pischke(2009)). A well-designed quasi-experimental study that exploits variation in a nationally representative sample may have better external validity than an RCT conducted on a selected volunteer sample.
Hawthorne Effects and Artificiality
The act of conducting an experiment can itself change behaviour. Participants who know they are being studied may change their actions in ways that distort the estimated treatment effect. This "Hawthorne effect" is especially problematic in social interventions where participants are aware of their treatment assignment.
Replication and Credibility of Observational Studies
The critique of observational methods based on LaLonde(1986) has been revisited and qualified. Imbens and Rubin(2015) and others have argued that the failure of observational methods in that study was partly due to poor implementation (inappropriate comparison groups, misspecified models) rather than an inherent limitation of the approach. Modern quasi-experimental methods — difference-in-differences with credible parallel trends, regression discontinuity designs, instrumental variables with defensible exclusion restrictions — have a strong track record of producing findings consistent with experimental estimates.
The Diamond Standard
Cartwright(2007) argue for a "diamond standard" that evaluates evidence quality along multiple dimensions: internal validity, external validity, ecological validity, and theoretical coherence. An RCT may score high on internal validity but low on external validity, while a well-designed observational study may do better on the latter. The relevant question is not which study design is generically superior, but which study, in a given context, provides the most credible and relevant evidence.
A Framework for Comparing Designs
Rather than ranking study designs in a fixed hierarchy, it is more productive to evaluate them along several dimensions:
- Internal validity: Is the identifying assumption credible? For RCTs: was randomisation actually implemented correctly? For observational studies: is the exclusion restriction or parallel trends assumption defensible?
- External validity: Does the estimated effect generalise to the target population and context? This depends on who is in the study sample and how comparable they are to the target.
- Statistical power: Is the study large enough to detect effects of policy-relevant magnitude?
- Estimand relevance: Does the estimated quantity — ATE, ATT, LATE — correspond to the policy question?
- Ethical and practical feasibility: Could the study be conducted at all?
Different study designs score differently on these dimensions, and the appropriate design depends on the question. For questions about whether a treatment works at all in a defined population, an RCT is hard to beat when feasible. For questions about how a policy works in a specific national context at a specific time, quasi-experimental evidence from that context may be more relevant than a carefully controlled experiment in a different setting.
The Convergent Evidence Standard
Perhaps the most important lesson from the debate is that causal conclusions should ideally rest on convergent evidence from multiple study designs. When an RCT, a regression discontinuity design, a difference-in-differences study, and an instrumental variables study all point in the same direction, the causal conclusion is much more credible than if any one of these designs were available alone. The history of evidence on the effects of education on earnings provides an example: IV estimates using compulsory schooling laws, proximity to college, and quarter-of-birth instruments, together with institutional quasi-experiments, all consistently point to substantial returns to education (Angrist and Pischke(2009)).
Conclusion
The RCT is a powerful and often indispensable tool for causal inference, but it is not the only tool, and the "gold standard" framing can be misleading. A randomised trial that is poorly conducted, that enrolls an unrepresentative sample, or that estimates a LATE of no policy relevance may be less informative than a well-designed observational study. The appropriate standard for causal evidence is not the design per se, but the credibility of the identifying assumption, the relevance of the estimand, and the generalisability of the findings. The credibility revolution in economics showed that observational methods can be made credible through clean research design; the task is to apply that discipline rigorously across all study types.
References
- Angrist, J. D., Imbens, G. W., and Rubin, D. B. (1996). Identification of causal effects using instrumental variables. Journal of the American Statistical Association, 91(434):444--455.
- Angrist, J. D. and Pischke, J.-S. (2009). Mostly Harmless Econometrics: An Empiricist's Companion. Princeton University Press, Princeton, NJ.
- Cartwright, N. (2007). Hunting Causes and Using Them. Cambridge University Press, Cambridge.
- Imbens, G. W. and Rubin, D. B. (2015). Causal Inference for Statistics, Social, and Biomedical Sciences. Cambridge University Press, Cambridge.
- LaLonde, R. J. (1986). Evaluating the econometric evaluations of training programs with experimental data. American Economic Review, 76(4):604--620.
- Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology, 66(5):688--701.
- Finkelstein, A., Taubman, S., Wright, B., Bernstein, M., Gruber, J., Newhouse, J. P., Allen, H., Baicker, K., and the Oregon Health Study Group (2012). The Oregon health insurance experiment: Evidence from the first year. Quarterly Journal of Economics, 127(3):1057--1106.
- Pearl, J. (2009). Causality: Models, Reasoning, and Inference. 2nd edition. Cambridge University Press, Cambridge.