Editor—In their critique of P values Sterne and Davey Smith omit two crucial reasons why P values do not adequately reflect evidence.1
Firstly, their statement (borrowed from Fisher) that “P values measure the strength of the evidence against the null hypothesis” does not stand up to scrutiny. A small P value means that what we observe is possible but not very likely under the null hypothesis. But then life is made up of unlikely events. P values cannot deliver evidence against a hypothesis, no matter how low the cut-off point for saying that a result is significant. Short of P=0, there is no such thing as evidence against a hypothesis.
Secondly, if evidence is what the data say then P values fail to qualify. P values are based on factors other than the observed data, notably on results “more extreme than these.” The P value is literally the sum of probabilities of events that might have happened but did not. Furthermore, to compute a P value you must know what distribution to apply to those unobserved results.
Imagine a trial of vitamin C versus placebo in matched pairs of patients with the common cold; the number of pairs in which the patient taking vitamin C fares better is the outcome of interest. If the total number of observations was predetermined the P value is computed with the binomial distribution; if it was the smallest number of successes per group the negative binomial distribution applies.2 The same trial result could lead to the null hypothesis being rejected or accepted depending on what you were told about the study design—that is, not on data alone. Other extraneous considerations that influence P values include the decision to use a one sided or a two sided test, and any adjustments made for multiple comparisons.
Several statisticians have proposed a solution: data cannot determine the absolute worth of one hypothesis taken in isolation but can provide evidence about the relative merit of two hypotheses specified a priori.3–5 The data support hypothesis A over hypothesis B if the likelihood of the data is greater under A than B; the strength of evidence favouring A over B is the likelihood ratio. Not only is this approach compatible with logic but it considers only the observed data.
References
- 1.Sterne JAC, Davey Smith G. Sifting the evidence—what's wrong with significance tests? BMJ. 2001;322:226–231. doi: 10.1136/bmj.322.7280.226. . (27 January.) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Berger JO, Berry DA. Statistical analysis and the illusion of objectivity. Am Scientist. 1988;76:159–165. [Google Scholar]
- 3.Birnbaum A. On the foundations of statistical inference [with discussion] J Am Stat Assoc. 1962;53:259–326. [Google Scholar]
- 4.Hacking I. Logic of statistical inference. New York: Cambridge University Press; 1965. [Google Scholar]
- 5.Royall RM. Statistical evidence. A likelihood paradigm. London: Chapman and Hall; 1997. [Google Scholar]
