Debates & Controversies

Publication Bias and the File Drawer Problem: Are Published Effect Sizes Too Big?

1 The Core Concern

Scientific journals preferentially publish results that are statistically significant. A study with a p-value of 0.03 is far more likely to appear in print than one reporting p=0.42 even if both are equally well-designed. This creates publication bias: the distribution of published results is not representative of the distribution of all results. The studies that found nothing the "file drawer" of unpublished null results are systematically missing from the literature.

If published results are a selective sample of all research, then naive meta-analyses of the literature will overstate effect sizes, and policy recommendations based on published work will be systematically too optimistic. The question debated in this article is: how large is this problem in economics, and what should be done about it?

2 Side A: Publication Bias is a Serious and Systematic Problem

Evidence from the distribution of test statistics

Brodeur et al. [2016] analysed 50,000 t-statistics from papers published in the American Economic Review, Journal of Political Economy, and Quarterly Journal of Economics between 2005 and 2011. They found an anomalous pattern: there are far fewer t-statistics just below 1.96 (the conventional significance threshold at 5%) than just above it. The density of t-values has a "missing mass" at t ∈ [1.645, 1.96] and excess mass just above 2.0. This is exactly the pattern predicted by selective reporting: results that are marginally insignificant are more likely to be reported differently or not submitted.

Formally, under a null of no selective reporting, the distribution of t-statistics should be smooth through 1.96. A "caliper test" checks for a discontinuity in the density at the significance threshold [Brodeur et al., 2016]. The caliper test strongly rejects the null of no selection for economics journals.

The Andrews-Kasy model

Andrews and Kasy [2019] develop a structural model of publication bias in which the probability of publication depends on the p-value. They estimate the model on economics working paper and publication records and find that a result with p=0.05 is roughly three times more likely to be published than a result with p=0.5. Their selection-corrected estimates of average effect sizes across studies are substantially smaller than the raw published estimates, consistent with inflation.

Replication evidence

In a registered replication initiative, Camerer et al. [2016] attempted to replicate 18 papers published in top economics journals. Of these, 11 (61%) replicated in the sense that the original finding was confirmed. Effect sizes in replications were on average 66% of the original, consistent with upward bias from publication selection. A follow-up study of social science results found similar patterns [Camerer et al., 2018].

3 Side B: The Problem is Overstated and the Proposed Remedies are Costly

Economics is not psychology

Critics note that economics is not the social science hit hardest by the replication crisis. Unlike social psychology, where lab experiments with student samples under stylised conditions have proven fragile, many flagship economics results minimum wage effects, labour supply elasticities, returns to education have been replicated and confirmed across many countries and methods. The "caliper" discontinuity in t-statistics, while real, is only one type of publication bias and may reflect researchers legitimate decisions to present results more cleanly rather than outright suppression of findings.

The caliper test has limitations

Elliott et al. [2022] show that the caliper test conflates publication bias with other features of the research process, including legitimate specification searching and the use of one-sided tests. Under reasonable models of how researchers choose specifications, a discontinuity in t-statistics at 1.96 can arise even without any suppression of results. The caliper test may over-diagnose publication bias.

Pre-registration is not a panacea

The leading proposed remedy for publication bias is pre-registration: publicly commit to a specific analysis plan before seeing the data, then report regardless of significance. Pre-registration is now common in development economics, public health RCTs, and political science experiments [Casey et al., 2012].

However, Olken [2015] and others note that pre-registration has costs: it prevents researchers from following up on unexpected findings that may be scientifically important. Pre-specified analyses may be wrong if the researcher mis-specified the correct test or outcome. There is also a risk of a two-track system where pre-registered null results are published ceremonially but receive less attention and citations than surprising non-pre-registered findings.

Selective publication of significant results is partly efficient Some economists argue that publication bias performs a useful filtering function. Journals face finite space and reader attention. A study finding no effect is less informative in many settings than one finding an effect, at least as a first contribution. An analogy: a medical trial for a new drug that finds no effect is less actionable than one that finds a beneficial effect, even though both provide information. The issue is not publication bias per se but whether the null results that are not published would materially change the weight of evidence.

4 What Would Help Resolve the Debate?

Several developments would sharpen the empirical and normative debate:

  1. Registry of working papers and results. If all economics working papers were registered and their ultimate disposition (published, abandoned, in revision) tracked, the true distribution of all research could be compared to the published distribution. The AEA RCT Registry provides this for randomised trials; no equivalent exists for non-experimental work.
  2. Better structural models of selective reporting. The Andrews and Kasy [2019] model is a start, but it assumes a simple monotone relationship between p-values and publication probability. More realistic models would allow for discipline-specific norms, author reputation effects, and the distinction between first-mover novelty and replication.
  3. Direct replication studies. More large-scale, well-powered replication attempts-like the Social Science Replication Project [Camerer et al., 2018]-would provide direct evidence on the reproducibility of published economics results.
  4. Transparency in specification searching. Even without pre-registration, requiring authors to disclose the full set of specifications examined ("robustness table" transparency) and to correct for multiple testing in secondary analyses would reduce the impact of specification searching without eliminating researcher flexibility.

5 An Honest Assessment

Both sides contain important truths. There is credible evidence that economics journals have a preference for statistically significant results and that this preference inflates published effect sizes the caliper test and replication evidence are difficult to explain away entirely. At the same time, the problem is more complex than a simple suppression story, the magnitude of bias varies enormously across subfields, and pre-registration alone cannot solve it.

The most productive response is probably not a single institutional fix but a portfolio of practices: pre-registration for primary outcomes in prospective studies; transparency norms for specification reporting in retrospective studies; structural meta-analyses that model publication selection; and direct replication of important results. None of these remedies are costless, but together they would improve the calibration between what is published and what is true.

References

  1. Andrews, I. and Kasy, M. (2019). Identification of and correction for publication bias. American Economic Review, 109(8):2766-2794.
  2. Brodeur, A., Le, M., Sangnier, M., and Zylberberg, Y. (2016). Star wars: the empirics strike back. American Economic Journal: Applied Economics, 8(1):1-32.
  3. Camerer, C. F., Dreber, A., Forsell, E., Ho, T.-H., Huber, J., Johannesson, M., Kirchler, M., Almenberg, J., Altmejd, A., Chan, T., Heikensten, E., Holzmeister, F., Imai, T., Isaksson, S., Nave, G., Pfeiffer, T., Razen, M., and Wu, H. (2016). Evaluating replicability of laboratory experiments in economics. Science, 351(6280):1433-1436.
  4. Camerer, C. F., Dreber, A., Holzmeister, F., Ho, T.-H., Huber, J., Johannesson, M., Kirchler, M., Nave, G., Nosek, B. A., Pfeiffer, T., Altmejd, A., Buttrick, N., Chan, T., Chen, Y., Forsell, E., Gampa, A., Heikensten, E., Hummer, L., Imai, T., Isaksson, S., Manfredi, D., Rose, J., Wagenmakers, E.-J., and Wu, H. (2018). Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015. Nature Human Behaviour, 2(9):637-644.
  5. Casey, K., Glennerster, R., and Miguel, E. (2012). Reshaping institutions: evidence on aid impacts using a preanalysis plan. Quarterly Journal of Economics, 127(4):1755-1812.
  6. Elliott, G., Kudrin, N., and Wüthrich, K. (2022). Detecting p-hacking. Econometrica, 90(6):2663-2687.
  7. Olken, B. A. (2015). Promises and perils of pre-analysis plans. Journal of Economic Perspectives, 29(3):61-80.[cite: 6]

Continue Reading

Browse All Sections →
Home
This is some text inside of a div block.
This is some text inside of a div block.
This is some text inside of a div block.

Article Title