1. INTRODUCTION
As readers of this journal are undoubtedly aware, publication bias, ie, the (positive) outcome of a study that influences the likelihood of publication, remains a significant issue in the research literature. Results with P values less than the “magical” .05 level and/or confidence intervals (CIs) that do not contain the null value of interest (and are hence deemed statistically significant) are much more predominant in published research literature than are studies without such findings. Studies in the field of hypertension are no exception to this bias. A relatively small cadre of researchers argue that, for a variety of reasons, nonstatistically significant results also need to be published. However, equally important is trusting that findings which are deemed statistically significant are truly important and not artefacts of statistical chicanery related to study design or analysis. This intellectual honesty remains a fundamental underpinning of the scientific method as it is known and applied today. Unfortunately, as a result of a perceived pressure (real or not) to achieve statistically significant results to increase the chances of publication, a variety of forms of ignorance (or worse, deception) still exist. In this article, we highlight one such issue that we feel is important to raise awareness among readers of this journal.
Joining the cacophony of recommendations urging caution in the utilization and interpretation of P values,1, 2, 3, 4 we recently recommended the use of CIs in addition to P values.5 The ability of the researcher (or reader) to gauge the magnitude and variability associated with the point estimate of the parameter under study remains a major added value of a CI and cannot be determined from a P value alone. However, for any such test statistic‐CI pairing, the two concepts are constructed from the same theoretical foundation and hence share many properties, both strengths and weaknesses. One major weakness is that, holding everything else constant, regardless of whether one is computing a test statistic with which to determine a P value or constructing a CI, a statistically significant result is more likely to be obtained as the sample size increases. This is an important issue—it allows a researcher to simply increase a study's size to gain (or more cynically, “buy”) statistical significance. This editorial explains the origin of this issue, provides an illuminating example, and gives some suggestions as to how to avoid generating results whose significance may be attributed primarily to study size.
2. BACKGROUND
In the 1920s and 1930s, Sir Ronald Fisher and the duo of Jerzy Neyman and Egon Pearson sought to develop a method of statistical inference that did not require specifying a prior probability that an idea was true. This had already been established by Reverend Thomas Bayes through his well‐known Bayes' Theorem in the mid‐1700s.6 Fisher and Neyman‐Pearson felt that the scientific meaning of such a probability was questionable. Thus, each attempted, with a different approach, to derive a solution to the same problem, ie, starting with a given hypothesis about a state of nature and determining what should be seen if the hypothesis were true. Fisher proposed an index he called a P value to measure the agreement between the observed data and the proposed hypothesis, while Neyman‐Pearson proposed a method for choosing between null and alternative hypotheses, which they called a hypothesis test.7, 8 The term statistical testing is employed in this article to denote the standard methodology for the testing of statistical hypotheses that is in widespread use today to distinguish it from Fisher's and Neyman‐Pearson's methods.
Fisher's method, referred to as “significance testing,” stated only a single hypothesis, as opposed to the statement of both null and alternative hypotheses that is familiar today. His P value (as today, a probability ranging between zero and one) was thus a measure of the discrepancy between the data and the stated hypothesis: the smaller the value, the greater the strength of evidence against the stated hypothesis, and the larger the value, the weaker the evidence against it. Importantly, no inference was initially inherent as part of this method. That is, this method did not assess the collected data as representative (or not) of the population from which they were drawn based on some chosen cutoff. However, it is worth noting that Fisher did eventually come to suggest that the ranges in which a P value falls would allow researchers to “suspect the hypothesis tested” or not.7 On the other hand, Neyman and Pearson's method, referred to as “hypothesis testing,” stated both null and alternative hypotheses, but concluded only one of three outcomes: reject the null hypothesis, accept the null hypothesis, or remain in doubt should the evidence not be conclusive.8 Fisher's P value was not part of this method. In the intervening years, these separately derived approaches have intertwined, resulting in the familiar paradigm used ubiquitously throughout science today.9, 10, 11 Taking a measure of the strength of the evidence against the stated hypothesis (Fisher's P value) and assigning it into a “bin” described by one of Neyman and Pearson's three hypothesis test outcomes has turned two separate, valid statistical procedures into a specious combination.
3. EXAMPLE
We will expand on an example that we presented in our previous paper in this journal.5 Imagine a Phase III (therapeutic confirmatory) clinical trial testing a new antihypertensive drug against a placebo. The current, familiar statistical testing and CI construction approach is employed. A research question, null hypothesis, and alternate (research) hypothesis, are created. The research question is: Does the test drug reduce systolic blood pressure (SBP) to a statistically significantly greater extent than the placebo? The research hypothesis is: The test drug reduces SBP to a statistically significantly greater extent than the placebo. The null hypothesis is: The test drug does not statistically significantly reduce the mean SBP as compared with the placebo.
One hundred trial participants are randomized, 50 into each of the two treatment groups. SBP is measured at the beginning of the trial (baseline) and after 12 weeks of receiving one of the two treatments. The treatment effect, defined as mean SBP change among participants in the drug treatment group minus mean SBP change in the placebo treatment group, is calculated. On this occasion, we created a hypothetical but actual data set that permitted this calculation. The treatment effect is found to be 1.78 mm Hg. A formal statistical test is conducted. A test statistic (t=1.45) and corresponding P value (.0749) are computed and a 95% CI for the difference is constructed (−0.65 to 4.21). Following the current statistical testing paradigm, at the one‐sided 2.5% α level, the null hypothesis is not rejected in favor of the research hypothesis (a conclusion consistent with the use of the computed CI as a surrogate to test the hypothesis) and the following statement can then be made: On the basis of this single trial, there is a lack of evidence that the new drug statistically significantly reduces mean SBP as compared with the placebo.
Now suppose that the same trial was conducted but with 200 participants (100 randomized into each of the two treatment groups). Further suppose that the data from the second 50 participants in each treatment group were identical to the data from the first 50 participants (an incredibly unlikely occurrence in reality, but the point being made here is still valid), and everything else remained unchanged. The *only* change is a doubling of the sample size. Redoing the above statistical analysis, the test statistic would now be computed as t=2.06 and the corresponding P value would be .0202. Further, the 95% CI for the difference in mean SBP change between the two groups is now (0.08–3.48), a tighter interval that provides greater precision. This time, at the one‐sided 2.5% α level, the null hypothesis is rejected in favor of the research hypothesis (again, a conclusion consistent with the use of the computed CI as a surrogate to test the hypothesis) and the following statement can then be made: On the basis of this single trial, there is evidence that the new drug statistically significantly reduces mean SBP as compared with the placebo. Doubling the size of the trial has allowed us to attain a statistically significant result with the P value now <.025.
In both of these scenarios, the conclusions are not in alignment with either Fisher's or Neyman‐Pearson's originally derived methods. Fisher would have taken the computed P value and considered it along the possible spectrum (zero to one) as a measure of the weight of evidence in support of the stated (null) hypothesis. In contrast, Neyman and Pearson would have determined in which of the three areas (the accept‐the‐null region, the reject‐the‐null region or the remains‐in‐doubt region) the computed test statistic fell, and simply made the corresponding conclusion (accept the null hypothesis, reject the null hypothesis, or “not sure”). No associated probability would have been computed with their method. Instead, what is now done in the combined paradigm is that Fisher's descriptive statistic P value corresponding to the computed test statistic is compared with the value (the α level) that Neyman‐Pearson would have used to mark the borders of their regions. This convolution has left us with the situation where significance can, in essence, be purchased.
4. IMPLICATIONS
Several considerations can facilitate the use of statistical testing (either through formal statistical tests or utilizing CIs as a surrogate, as we recently recommended5) in a manner that retains at least some validity and statistical cohesion. The first key consideration is to ensure researchers are aware of the independent derivations of Fisher's significance testing and Neyman‐Pearson's hypothesis testing. The manner in which the current, ubiquitously utilized “combined” test procedure is implemented and interpreted is not in alignment with either of the originally derived methods, which results in considerable misunderstanding, confusion, and even abuse.12, 13, 14, 15 We hope this paper helps in this regard.
Further, in conjunction with utilizing CIs (when possible) to gauge the magnitude of effect, we propose a return to the use of the P value along the lines of the manner in which Fisher initially proposed it, ie, as a continuous measure along a zero‐to‐one scale. Declaring a result as statistically significant or not based on comparison with an arbitrarily chosen α level has already been discussed as both nonsensical and misaligned with both the Fisher and Neyman‐Pearson approaches. Additional issues such as multiple comparisons and primary and secondary end points across which to split or adjust α levels only makes things even more convoluted. By simply providing computed P values, readers can independently gauge the weight of evidence for or against the null, knowing the scale/range in which probabilities can occur (zero to one) and the number of tests being performed. This would align correctly with Fisher's significance testing approach. Further, since statistical significance would not be formally declared, inconsistencies between statistical and clinical significance would be eliminated.
An immediate concern of returning to Fisher's originally derived significance testing paradigm is how to judge whether a result was in support of or against a null hypothesis with no formal declaration of statistical significance. The elimination of a significance threshold would force researchers to return the long‐known but often neglected ideas of replication of results and preponderance of evidence. No single study definitively proves a hypothesis. Only the replication of a result, across time and populations, provides evidence and weighting sufficient to declare an effect as real with a high degree of confidence. The more confirmatory results, the stronger convictions can be. In this day and age of time to market (or publication) and/or profit margins, these inconvenient truths are often abused or ignored.
While we are certainly believers in more data are better, collecting extra information solely as a means to achieve statistical significance can lead to declaring artificially important differences where none actually exist. A return to a statistical testing paradigm more aligned with Fisher's significance testing method would reduce the incentives to “tack on” additional participants at the end of a study in hopes of crossing an arbitrary (and invalid) α threshold. Under the Fisherian paradigm, this reduced incentive to add additional, unnecessary participants would be balanced by the desire to ensure a study size not too small so as to make it unrealistic to detect the specified difference of interest. Instead of determining a study size optimized to yield a statistically significant result (one that may be attributable to sample size alone), incentives would exist to size a study to optimally balance the aforementioned objectives. This would allow increased confidence in the observed results.
Finally, Bayesian methodology has also been offered as a way to avoid the shortcomings of the currently entangled Fisher and Neyman‐Pearson methods. However, reliance on the assignment of a prior probability to the truth of an idea has its own issues,16 and ironically was the genesis of both the Fisher and Neyman‐Pearson methods in the first place.
While no new methods or paradigms have been presented in this paper, the importance of being aware of the issues discussed and suggested solutions presented cannot be understated. Adherence to such solutions would allow a better understanding of, and belief in, P values presented when reporting results. Such could be the genesis of a statistical, or even scientific, revolution— simply by returning to the use of statistical testing as originally intended.
ACKNOWLEDGMENT
The authors wish to acknowledge Dr Matthew Hayat, Associate Professor of Epidemiology and Biostatistics, School of Public Health, Georgia State University, for his thoughtful review of an earlier draft of this manuscript.
REFERENCES
- 1. Ioannidis JPA. Why most published research findings are false. PLoS Med. 2005;2:e124. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Nuzzo R. Scientific method, statistical errors. P values, the ‘gold standard’ of statistical validity, are not as reliable as many scientists assume. February 12, 2014. http://www.nature.com/news/scientific-method-statistical-errors-1.14700. Accessed February 16, 2017.
- 3. Woolston C. Psychology journal bans P values. February 26, 2015. http://www.nature.com/news/psychology-journal-bans-p-values-1.17001. Accessed February 16, 2017.
- 4. Wasserstein RL, Lazar NA. The ASA's statement on p‐values: context, process, and purpose. Am Stat. 2016;70:129‐133. [Google Scholar]
- 5. Jiroutek MJ, Turner JR. In praise of confidence intervals: much more informative than P values alone. J Clin Hypertens (Greenwich). 2016;18:955‐957. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Stigler SM. The History of Statistics: The Measurement of Uncertainty Before 1900. Cambridge, MA: Harvard University Press; 1986. [Google Scholar]
- 7. Fisher R. Statistical Methods for Research Workers, 13th ed. New York, NY: Hafner; 1958. [Google Scholar]
- 8. Neyman J, Pearson E. On the problem of the most efficient tests of statistical hypotheses. Phil Trans R Soc Ser A. 1933;231:289‐337. [Google Scholar]
- 9. Royall R. Statistical Evidence: A Likelihood Primer. Monographs on Statistics and Applied Probability #71. London, England: Chapman and Hall; 1997. [Google Scholar]
- 10. Goodman SN. P‐values, hypothesis tests, and likelihood: implications for epidemiology of a neglected historical debate. Am J Epidemiol. 1993;137:485‐496. [DOI] [PubMed] [Google Scholar]
- 11. Gigerenzer G, Swijtink Z, Porter T, et al. The Empire of Chance. Cambridge, England: Cambridge University Press; 1989. [Google Scholar]
- 12. Browner W, Newman T. Are all significant P values created equal? The analogy between diagnostic tests and clinical research. JAMA. 1987;257:2459‐2463. [PubMed] [Google Scholar]
- 13. Diamond GA, Forrester JS. Clinical trials and statistical verdicts: probable grounds for appeal. Ann Intern Med. 1983;98:385‐394. [DOI] [PubMed] [Google Scholar]
- 14. Lilford RJ, Braunholtz D. For debate: the statistical basis of public policy: a paradigm shift is overdue. BMJ. 1996;313:603‐607. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Freeman PR. The role of p‐values in analysing trial results. Stat Med. 1993;12:1442‐1552. [DOI] [PubMed] [Google Scholar]
- 16. Goodman SN. Towards evidence‐based medical statistics. 1: the P value fallacy. Ann Intern Med. 1999;130:995‐1004. [DOI] [PubMed] [Google Scholar]
