I thank the editor for selecting these letters for discussion; together, they typify the most common objections to raising standards for statistical evidence.
Gelman and Robert’s (1) letter characterizes subjective Bayesian objections to the use of more stringent statistical standards, arguing that significance levels and evidence thresholds should be based on “costs, benefits, and probabilities of all outcomes.” In principle, this is a wonderful goal, but in practice, it is impossible to achieve. In most hypothesis tests, unique and well-defined loss functions and prior densities do not exist. Instead, a plethora of vaguely defined loss functions and prior densities exist. In the context of declaring a scientific discovery, distinct loss functions and prior beliefs are held by the study investigators, by journal editors and reviewers, by other scientists, by funding agencies, and by the general public. Thousands of scientific manuscripts are written each year, and eliciting these distinct loss functions and priors on a case-by-case basis, and determining how to combine them, is simply not feasible. Worse still, the authors of each manuscript select from multiple hypothesis tests to report. This is why hypothesis tests are usually performed using commonly accepted standards for statistical significance.
With regard to loss functions, it is important to note that the greatest loss that the scientific community now faces is the loss of public confidence. Under a prior assumption of equipoise, the declaration of new discoveries based on P values near 0.05 guarantees that ∼20% of these discoveries will be false. Even higher rates of nonreproducibility were found in the empirical studies cited in ref. 2, suggesting that the assumption of equipoise is also overly optimistic. Such rates of nonreproducibility are clearly too high to maintain public confidence in science.
Gaudart et al. (3) describe a second general objection to raising the bar for declaring a scientific discovery; namely, that doing so will unnecessarily increase the costs of conducting experiments and will delay the pace of science. They add weight to their argument by posing it in the context of clinical trials, insisting that delays in declaring the success of new treatments harms patients.
Their claim that patients would be harmed by requiring more evidence in favor of investigational treatments is, however, not supported by actual trial experience. In fact, exactly the opposite appears to be true. In a comprehensive study of the success rate of more than 2,500 investigational new drugs, DiMasi et al. (4) found that more than one-third of the investigational drugs that passed phase II trials subsequently failed in larger phase III trials. That is, one-third of the investigational new drugs that exceeded the current threshold for statistical significance in phase II failed to demonstrate a benefit in phase III. In another survey of clinical trials, Gan et al. (5) reported that 62% of phase III trials of new cancer treatments failed; they ascribe these failures to overestimation of the intended treatment benefit from earlier stage trials. Overall, the failure rate of investigational drugs reported in ref. 4 was 81%.
As these surveys demonstrate, the high failure rate of phase III clinical trials stems largely from lax standards of evidence in phase II trials. By declaring success based on 5:1 odds or less in favor of an investigational treatment (i.e., a 5% significance threshold) and by ignoring the historical rate of failure of investigational drugs in early stage clinical trials, too many ineffective drugs are subjected to phase III testing.
What are the costs associated with the 5% significance threshold (or as suggested in ref. 3, a 10% threshold!) in early stage phase II trials? Phase II clinical trials typically involve between 30 and 200 patients. Phase III trials typically involve between 300 and 3,000 patients. Thus, the consequence of declaring premature success in an early stage clinical trial is to expose many more patients to inferior treatments in larger confirmatory trials. Not only do these larger phase III trials waste enormous human and financial resources, they also divert resources away from the conduct of much less expensive early stage trials of other (potentially more effective) therapies.
It is also not clear that patients enrolled in smaller phase II trials would suffer from more rigorous standards for statistical evidence. Phase II trials can be conducted either as single arm trials in which all patients receive the trial agent or as randomized trials in which patients are randomly assigned to two or more treatment groups. In single arm trials, nothing is lost by assigning additional patients to treatment arms that appear to be more successful than a well-established standard of care. In randomized phase II trials, adaptive randomization schemes can be used to ensure that the probability that a patient is assigned to a treatment increases with the current estimate of the probability that the given treatment is most effective. As the evidence in favor of a treatment increases, a higher proportion of patients are assigned to that treatment arm. Indeed, simply assigning additional patients to the most promising treatment arm after completion of the randomization phase of a trial would provide one mechanism for confirming the results of many trials at a higher evidence threshold.
The lessons learned from clinical trials apply also to other scientific experiments. False discoveries inevitably lead to additional confirmatory and follow-on experiments that waste scientific resources. In some cases, they may even alter the research direction of an entire discipline.
Pericchi et al. (6) propose an alternative solution to the problem of nonreproducibility of scientific findings. Their solution represents an interesting mix of objective Bayesian methodology and classical type II error control. Unfortunately, their proposal relies critically on the specification of an alternative hypothesis under which the type II error is calculated, and in some cases, can result in even less stringent control of type I error. A re-examination of their example illustrates these problems.
Pericchi et al. describe an experiment in which 104,490,000 Bernoulli observations are collected to test whether a success probability π is equal to 0.5. Rejection of this null hypothesis purportedly confirms the theory of psychokinesis, suggesting to the reader that there is a high probability that the null hypothesis is true and that the P value against it misleading. The authors then pose an alternative hypothesis that π differs from 0.5 by at least 0.01 and propose to control the type II error to be less than 0.05.
In actual scientific practice, however, it is hard to imagine a situation in which a scientist would design an experiment to collect 104,490,000 Bernoulli observations to test whether a success probability differed by 0.01 from 0.5 in a 1% test. This number of observations would normally be collected only to either detect a very small difference from 0.5 or to gain overwhelming evidence against the null hypothesis. In the latter case, the size of the test would be much smaller than 1%. In the former, it is important to identify the deviation that is really being tested. With
observations, a 1% test of a difference of
seems much more plausible than a test of a difference of 0.01. If that difference is instead used to adjust the type I error according to the formula in ref. 6, then the adjusted type I error becomes 0.018. That is, the type I error control is diminished, and the false discovery rate is increased.
In addition to not addressing the problem of nonreproducibility of scientific studies, the practical implementation of the approach presented in ref. 6 depends critically on the subjective specification of an alternative hypothesis, which, as noted in my response to ref. 1, severely limits the application of this methodology in routine hypothesis tests.
A few technical comments concerning issues raised in ref. 1 follow.
The characterization of uniformly most powerful Bayesian tests (UMPBTs) as minimax procedures is inaccurate. Minimax procedures are defined by minimizing the maximum loss that a decision maker can suffer. In contrast, UMPBTs are defined to maximize the probability that the Bayes factor in favor of the alternative hypothesis exceeds a specified threshold. The evidence threshold represents the consensus value required to declare a discovery. Like significance tests, the definition of UMPBTs involves no direct consideration of loss functions.
The objection to UMPBTs because they make it difficult to declare a scientific discovery is similarly misguided. Given an evidence threshold, the corresponding UMPBT is exactly the test that maximizes the probability that a discovery can be declared.
For z-tests, the exact relationship between the evidence threshold γ and significance level α is
(2, 7).
Because evidence against false null hypotheses can be accumulated exponentially fast (8–10), it is likely that any decision-theoretic analyses of optimal evidence thresholds would lead to thresholds substantially greater than 5. Thus, even if it were possible to routinely conduct the analyses suggested in ref. 1, higher thresholds than those proposed in ref. 2 would likely be adopted. I note that P value thresholds of
are now standard in particle physics.
Finally, if one chooses to rewrite history so that Fisher picked 0.005 instead of 0.05 as the evidential standard, then one might also ask whether ref. 2 would have been written. Noting that nearly all false discoveries identified in the references in ref. 2 were based on P values between 0.02–0.05, I suspect the answer to that question is no.
Supplementary Material
Footnotes
The author declares no conflict of interest.
References
- 1.Gelman A, Robert CP. Revised evidence for statistical standards. Proc Natl Acad Sci USA. 2014;111:E1933. doi: 10.1073/pnas.1322995111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Johnson VE. Revised standards for statistical evidence. Proc Natl Acad Sci USA. 2013;110(48):19313–19317. doi: 10.1073/pnas.1313476110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Gaudart J, Huiart L, Milligan PJ, Thiebaut R, Giorgi R. Reproducibility issues in science, is P value really the only answer? Proc Natl Acad Sci USA. 2014;111:E1934. doi: 10.1073/pnas.1323051111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.DiMasi JA, Feldman L, Seckler A, Wilson A. Trends in risks associated with new drug development: success rates for investigational drugs. Clin Pharmacol Ther. 2010;87(3):272–277. doi: 10.1038/clpt.2009.295. [DOI] [PubMed] [Google Scholar]
- 5.Gan HK, You B, Pond GR, Chen EX. Assumptions of expected benefits in randomized phase III trials evaluating systemic treatments for cancer. J Natl Cancer Inst. 2012;104(8):590–598. doi: 10.1093/jnci/djs141. [DOI] [PubMed] [Google Scholar]
- 6.Pericchi L, Pereira CAB, Pérez M-E. Adaptive revised standards for statistical evidence. Proc Natl Acad Sci USA. 2014;111:E1935. doi: 10.1073/pnas.1322191111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Johnson VE. Uniformly most powerful Bayesian tests. Ann Stat. 2013;41(4):1716–1741. doi: 10.1214/13-AOS1123. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Bahadur RR. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. 1967. June 21–July 18, 1965 and December 27, 1965–January 7, 1966, eds Lecam L, Neyman J (Cambridge Univ Press, London), pp 13–26. [Google Scholar]
- 9.Andrews DWK. The large sample correspondence between classical hypothesis tests and Bayesian posterior odds tests. Econometrica. 1992;62(5):1207–1232. [Google Scholar]
- 10.Johnson VE. On the use of non-local prior densities in Bayesian hypothesis tests. J R Stat Soc Series B Stat Methodol. 2010;72(2):143–170. [Google Scholar]
