Without certainty science is nothing more than seemingly sophisticated guesswork.
Sir Francis Bacon
Statistical analyses and consequently statistical inferences are increasingly important components (but not all) of inferential processes. For example, consider a transplant study designed to determine whether a new drug decreases the likelihood a of developing acute graft-versus-host disease (GvHD) compared with a placebo. You do the study, collect the results and perform a statistical test, typically a test of a statistical model, often the null hypothesis. Colleagues and reviewers expect you to generate a p-value from these analyses. Usually statistical significance in this context is defined as a pre-set P-value <0.05. A p-value of 0.055 is considered not statistically significant. Does a p-value 0.055 mean: (A) the new drug was ineffective? (B) The results can be accounted for by chance? (C) The null hypothesis is true? (D) All of the above? The correct answer is (E), none of the above. However, try to publish this study in BONE MARROW TRANSPLANTATION and the Editors, Hillard Lazarus and Mohamad Mohty are likely to send you a rapid rejection e-mail.
Although most scientific researchers think they know what the p-value is and what it implies about correct interpretation of their data, this is not so. When we informally sampled a cohort of scientific researchers including junior and senior scientists and clinicians, most told us the p-value was a test of whether the null hypothesis was true or not whether a factor was significantly-associated with an outcome, notions we will see are wrong.
Interestingly, statisticians are no more certain what a p-value is than are scientific researchers. To understand why, we need to consider the complex history of the p-value. R.A. Fisher introduced the p-value into scientific research as a measure of statistical inference. 1 He defined it as: the probability of the observed result, plus more extreme results, if the null hypothesis were true. Fisher suggested the p-value could be used as a measure of statistical inference, a component, but not the only component, of the more complex process of casual inference. Several assumptions underlie correct use of Fisher’s p-value. Unfortunately, many of these assumptions, such as no misclassification or confounding, are difficult to avoid in complex clinical trials such as in bone marrow transplantation. Consider the trial we mentioned of a new drug to prevent acute GvHD. Analyses of a survival endpoint has to consider confounding between reducing the incidence of acute GvHD at the expense of increasing leukaemia relapse. Researchers should study, compare and report cumulative incidence rates of developing acute GvHD, relapse without acute GvHD and death without acute GvHD and relapse simultaneously because these outcomes are not mutually-independent.
Neyman and Pearson came next in p-value history. In contrast to Fisher, they postulated a formal, mutually-exclusive alternative hypothesis to the null hypothesis and a preselected p-value level to reject the null hypothesis. This subtle but important difference involves decision-making. The Neyman-Pearson approach results in a decision regarding casual inference whereas the Fisher approach does not. It is important to realize the Fisher and Neyman-Pearson approaches are frequentist b ignoring a third approach of Bayesian inductive reasoning (see below).
The reader may wonder why we are discussing such a seemingly simple-minded question of what the p-value really means at this late hour. However, we are not alone in our concern regarding widespread misunderstanding of what a p-value is and what it means. Recently, the American Statistical Association published a report on the P-value: what it is, what it means and how p-values should be interpreted. 2 To be clear, this is not a consensus statement. Often there was considerable disagreement amongst expert statisticians on this question so readers need not feel perplexed if they are confused. The Association panel report pointed out the p-value is commonly misused and/or -interpreted. The report defines a p-value as the probability, under a specified statistical model, a statistical summary of the data would be equal or more extreme than the observed value(s). We emphasize under a specified statistical model. When we calculate a p-value we are not testing whether the difference between groups or cohorts occurred by chance but rather consistency of the data with a proposed statistical model. As we discussed often the statistical model being tested in clinical trials is the null hypothesis. Consequently, when we consider the p-value we need to understand it does not address whether the null hypothesis is true nor whether the statistical analysis of the results can be accounted for by chance (common misconceptions in our survey).
In clinical trials it is also important to consider the p-value does not reflect effect size. For example, a 5 percent decrease in the incidence of acute GvHD could be associated with a p-value <0.05 when there is a very large sample size whereas a true 50 percent decrease might be associated with a p-value >0.05 when the sample size is small. Estimated clinically-important effects with confidence intervals/bands and p-values should be transparently reported. Some biomedical studies cherry-pick results with p-values <0.05 based on multiple subgroup analyses disregarding the small sample sizes in these subgroups. Researchers should report all statistical analyses done and all hypotheses tested so the reader can consider false discovery rates which should be considered when multiple comparisons are done.
The Association panel makes another important point for readers of BONE MARROW TRANSPLANTATION namely, it is inadvisable to focus on an arbitrary point such as p<0.05 to claim statistical inference. There are two issues here. First, considering p=0.05 as a point for deciding on statistical significance is arbitrary and without a sensible mathematical underpinning. The second is other factors need consideration in deciding whether an outcome is statistically significant including study-design, measurement accuracy, evidence external to the study, accuracy of measurements and validity of assumptions underlying the data analyses. For example, a survival endpoint will usually be more valid than a leukaemia-relapse endpoint. To quote the report: [The] widespread use of statistical significance (generally accepted as p<0.05) as a license for making a claim of scientific finding (or implied truth) leads to a considerable distortion of the scientific process. This should be the take home message from our typescript.
Up to this point we have discussed considerations in the realm of frequentist statisticians. Although a discussion of using Bayesian inductive reasoning with a spectrum of probabilities (such as credibility limits) to express casual inference is beyond the scope of our discussion, this approach is increasingly considered, especially when there is uncertainty in the accuracy of measurements (such as who really has acute GvHD versus a virus infection or a drug-induced rash). A recent review by Kyriacou3 discusses use and limitations of a Bayesian induction approach. Scientific inferences based on using frequentist and Bayesian methods are not mutually exclusive and often complementary and we urge readers to consider both.
Another issue is that researchers often conduct multiple analyses of their data but may present only analyses with a statistically significant p-value. This does not allow the reader to fairly evaluate validity of the researchers claims and conclusions and consider potential biases. This p-hacking is unfair, inappropriate and should be avoided. c The bottom line is the p-value in isolation cannot be relied on to determine whether the null hypothesis is correct or not. There are several other important considerations regarding the P-value not covered in the Associations report and we refer interested readers to other reviews. 3,4
The Editors tell us they plan no immediate change in the statistical review process for BONE MARROW TRANSPLANTATION. However, it is important researchers submitting typescripts follow best statistical practices and acknowledge in their analyses and discussions limitations of the p-value in establishing causal inference. There will be a session on p-values and their correct interpretation sponsored by the Center for International Blood and Marrow Transplant Research (CIBMTR) at the next tandem meetings for transplant scientists and physicians seeking more details.
Acknowledgments
RPG acknowledges support from the National Institute of Health Research (NIHR) Biomedical Research Centre funding scheme. Parts of this typescript were published in Leukemia _______. Prof. Hillard Lazarus kindly reviewed the typescript.
Footnotes
In statistics there is a distinction between likelihood and probability. The number that is the probability of some observed outcomes given a set of parameter values is regarded as the likelihood of the set of parameter values given the observed outcomes.
Frequentist inference is a type of statistical inference based on conclusions from sample data by emphasizing the frequency (or proportion) of the data.
This data-dredging or fishing expedition is not unlike multiple non-pre-specified subgroup analyses which should be considered hypothesis generating, require confirmation and require statistical adjustment for multiple comparisons.
Conflict of Interest None
References
- 1.Fisher RA. Statistical Methods for Research Workers. Edinburgh, United Kingdom: Oliver & Boyd; 1925. [Google Scholar]
- 2.Wasserstein RL, Lazar NA. The ASAs statement on p-values: context, process and purpose. The American Statistician; 2016. [Google Scholar]
- 3.Kyriacou DN. The enduring evolution of the p value. JAMA. 2016;315:1113–1115. doi: 10.1001/jama.2016.2152. [DOI] [PubMed] [Google Scholar]
- 4.Vickers A. What is a p-value anyway? Boston: Addison Wesley Longman; 2009. p. 212. [Google Scholar]