1 A Crisis or a Reckoning?
In 2016, a team of economists led by Colin Camerer published a landmark study in Science: they had attempted to replicate 18 published laboratory experiments from top economic journals and succeeded in fewer than two-thirds of the cases [Camerer et al., 2016]. A 2018 follow-up extending the exercise to social science more broadly found similar results [Camerer et al., 2018]. Meanwhile, meta-analyses suggested that effect sizes in published economic research were systematically too large—a signature of publication selection and small-sample estimation [Ioannidis et al., 2017].
The term "replication crisis" spread from psychology, where estimates of non-replication rates were even higher [Open Science Collaboration, 2015]. But economics had its own version of the same problem, with its own flavour. Unlike psychology—where the primary concern was underpowered lab experiments with student convenience samples—economics faced a more variegated set of challenges: observational studies where the key identification assumption is unverifiable, archival datasets shared reluctantly if at all, and a culture in which showing robust results meant presenting specification-searched "preferred" estimates. This article reviews the evidence on replication in economics, identifies the distinct channels through which published results can fail to replicate, and assesses the reforms that have emerged in response—going beyond the simple prescription of pre-registration, which is necessary but not sufficient.
2 What Does It Mean to Replicate?
Before assessing the crisis, precision about terminology matters. Following Christensen and Miguel [2018], three distinct concepts are often conflated:
- Computational reproducibility. Given the same data and code, do you recover the same numbers as in the published paper? This is a purely mechanical check.
- Replication. Using the same methodology on new data from a similar population, do you recover a similar effect?
- Robustness. Using different reasonable analytical choices (covariates, sample restrictions, functional forms) on the same data, does the result hold?
These three concepts have different implications. Computational irreproducibility signals error or opacity; non-replication may reflect genuine sampling variability, population heterogeneity, or a false positive; non-robustness suggests that the result depends on specific choices that may not be disclosed or defended.
3 The Evidence
3.1 Laboratory Experiments: Direct Replication Evidence
The Camerer et al. [2016] and Camerer et al. [2018] studies provide the clearest evidence because laboratory experiments are most directly replicable: the same protocol can be implemented in a new sample. Of the 18 economic laboratory experiments replicated in Camerer et al. [2016], 11 (61%) replicated in the direction and significance of the original effect. Average effect sizes in replications were about 70% of original effect sizes, consistent with systematic publication bias inflating the original estimates.
Effect-size shrinkage is expected under publication bias (also called the winner's curse): when only statistically significant results are published, the estimates that make it into print are drawn from the right tail of the sampling distribution, overstating the true effect. Replications without selection pressure toward significance will recover the true (smaller) effect.
3.2 Observational and Natural-Experiment Studies
Direct replication of observational studies is harder: there is no single "same protocol" that can be applied to new data, because identification often depends on a unique institutional setting (a specific policy change, a specific natural disaster). Instead, researchers assess:
Computational reproducibility. Surveys of reproducibility in economics journals find non-trivial failure rates even after mandatory data sharing policies. Vilhuber [2019] documents that even with journal-mandated reproducibility packages, 40-50% of papers fail to reproduce from the provided materials within a reasonable time frame, due to undisclosed software dependencies, missing data, or undocumented code.
Within-paper robustness. Brodeur et al. [2020] apply a caliper test to p-values from nearly 25,000 hypothesis tests published in the American Economic Review, Quarterly Journal of Economics, and Journal of Political Economy. They find a statistically significant deficit of p-values just above 0.05 and a surplus just below—the signature of p-hacking or selective reporting. The excess mass is larger in observational studies than in papers using randomised variation, consistent with observational researchers having more researcher degrees of freedom.
Cross-study replication. For empirical claims about large effects (minimum wage effects on employment, effects of immigration on native wages, returns to education), multiple natural experiments over many decades often produce markedly different estimates. Part of this heterogeneity reflects genuine treatment effect heterogeneity (different populations, different time periods, different margins of the effect); part may reflect specification searching.
4 The Distinct Channels of Non-Replication in Economics
Economics differs from experimental psychology in having several replication challenges that are largely absent from the lab:
- Identification assumption non-transparency. In an observational study using, say, a difference-in-differences design, the parallel trends assumption is maintained—not tested—and its plausibility is a matter of judgment. Two researchers looking at the same setting can reach different conclusions about whether the assumption is plausible, and neither can be definitively proven right.
- Researcher degrees of freedom in observational identification. The choice of control group, the sample period, the definition of treatment, the set of covariates, and whether to include unit or time fixed effects can all affect the estimate. Unlike a clinical trial where the protocol is fixed at registration, observational economists face hundreds of forking paths, many of which are theoretically defensible.
- Data availability and provenance. Many empirical papers in economics use proprietary or restricted-access administrative data that cannot be shared with replicators. Without access to the underlying data, even computational reproducibility is impossible. Journals have responded with data availability statements and mandatory replication packages, but enforcement is uneven.
- Weak identification and inflated effect sizes. In IV settings, weak instruments inflate point estimates even when the instrument is valid. Papers published on the strength of a "large" and "significant" IV estimate may be reporting an inflated first-stage-amplified estimate rather than a true treatment effect. Post-publication re-examination with better instruments or more data often produces smaller estimates.
5 What Distinguishes Studies That Replicate?
Camerer et al. [2016] and Camerer et al. [2018] find that several features predict replication success:
- Larger original effect sizes (relative to the standard error) predict replication.
- Lower original p-values predict replication—but this partly reflects better power rather than better design.
- Studies using stronger identification strategies (randomisation, clear discontinuities, strong instruments) replicate better than studies relying on more questionable identification.
- Simple, transparent designs replicate better than complex, specification-heavy designs.
A prediction market study embedded in Camerer et al. [2016] found that peer scientists who read the abstracts of the original papers and bet on whether they would replicate had significant predictive ability, suggesting that the research community has latent information about study reliability that is not fully disclosed in publication.
6 Reforms and Their Limits
6.1 Pre-registration and Pre-Analysis Plans
Pre-registration—committing to a hypothesis, estimand, and analysis plan before seeing the data—is the most widely discussed reform. Casey et al. [2012] and Olken [2015] discuss pre-analysis plans (PAPs) for development economics RCTs and find that PAP adherence improves credibility but imposes real costs (reduced flexibility to respond to unexpected patterns in the data).
For observational studies, strict pre-registration is often impractical: the "data" may be the full panel of available years from a public dataset, and the "analysis plan" would need to specify which years of data to use—a choice that is itself informative. Christensen and Miguel [2018] argue that pre-registration is most valuable when combined with theory-driven predictions, not just "I will estimate the ATT."
6.2 Data and Code Sharing
Mandatory data-sharing policies, now adopted by most top economics journals, are necessary for computational reproducibility. But they are insufficient for replication in the broader sense: a replicator who runs the provided code on the provided data and gets the same number has not tested whether the result is robust or generalisable.
6.3 Registered Reports
The registered report format—peer review of the research design before data collection, with provisional acceptance conditional on the pre-specified analysis—is increasingly used in experimental economics. It eliminates publication bias for registered studies by committing to publish based on the quality of the design, not the statistical significance of the results.
6.4 Adversarial Collaboration
Some economists have experimented with adversarial collaboration: researchers who hold opposing views on a question jointly design and conduct a study, with both agreeing in advance on the analysis plan and interpretation criteria. This format directly addresses researcher degrees of freedom by forcing the adversaries to agree on what would constitute evidence either way.
6.5 Meta-Science Infrastructure
The Social Science Replication Project and the Economics Replication Project have created systematic infrastructure for replication studies. The Journal of Applied Econometrics and several development journals now have sections specifically for replication studies. The Open Science Framework provides a preprint server for registered reports and pre-analysis plans.
7 A Balanced Assessment
The pessimistic view holds that economics has a deep replication problem rooted in publication incentives that reward novel, large effects over accurate, small ones—and that pre-registration and data sharing are marginal reforms that do not change these incentives. The optimistic view counters that the "crisis" is primarily concentrated in specific subdisciplines (lab experiments, certain observational designs) and that the credibility revolution itself—the shift toward natural experiments and quasi-experimental designs—has already substantially improved the average quality of causal identification in economics over the past thirty years.
Camerer et al. [2016] found a 61% replication rate for economics lab experiments; it is plausible that well-powered natural experiment studies with strong identification would fare better. The truth is probably somewhere between: progress on replication in economics is real but incomplete. The shift toward design-based inference has reduced some forms of specification searching; the new generation of staggered DiD estimators has fixed known biases in TWFE; data sharing mandates have improved computational reproducibility. But weak instrument problems remain common, observational researchers still face many forking paths, and the incentive to produce novel significant results has not fundamentally changed.
8 Conclusion
The replication crisis in economics is real but heterogeneous. Laboratory experiments show clear non-replication rates; observational studies face different problems of identification non-transparency and researcher degrees of freedom. The reforms adopted so far—pre-registration, data sharing, and registered reports—are valuable but insufficient on their own. A deeper reform would require changing publication incentives to reward precision and transparency over novelty and significance. Whether that change will come from journals, funders, or a generational shift in norms among researchers remains an open question.
References
- Brodeur, A., Cook, N., and Heyes, A. (2020). Methods matter: p-hacking and publication bias in causal analysis in economics. American Economic Review, 110(11):3634-3660.
- Camerer, C. F., Dreber, A., Forsell, E., Ho, T.-H., Huber, J., Johannesson, M., Kirchler, M., Almenberg, J., Altmejd, A., Chan, T., et al. (2016). Evaluating replicability of laboratory experiments in economics. Science, 351(6280):1433-1436.
- Camerer, C. F., Dreber, A., Holzmeister, F., Ho, T.-H., Huber, J., Johannesson, M., Kirchler, M., Nave, G., Nosek, B. A., Pfeiffer, T., et al. (2018). Evaluating the replicability of social science experiments in Nature and Science. Nature Human Behaviour, 2(9):637-644.
- Casey, K., Glennerster, R., and Miguel, E. (2012). Reshaping institutions: Evidence on aid impacts using a preanalysis plan. Quarterly Journal of Economics, 127(4):1755-1812.
- Christensen, G. and Miguel, E. (2018). Transparency, reproducibility, and the credibility of economics research. Journal of Economic Literature, 56(3):920-980.
- Ioannidis, J. P. A., Stanley, T. D., and Doucouliagos, H. (2017). The power of bias in economics research. Economic Journal, 127(605):F236-F265.
- Olken, B. A. (2015). Promises and perils of pre-analysis plans. Journal of Economic Perspectives, 29(3):61-80.
- Open Science Collaboration (2015). Estimating the reproducibility of psychological science. Science, 349(6251):aac4716.
- Vilhuber, L. (2019). Reproducibility and replicability in economics. Harvard Data Science Review, 1(2).