Skip to main content
The BMJ logoLink to The BMJ
. 2001 May 12;322(7295):1184.

Sifting the evidence

Likelihood ratios are alternatives to P values

Thomas V Perneger 1
PMCID: PMC1120301  PMID: 11379590

Editor—In their critique of P values Sterne and Davey Smith omit two crucial reasons why P values do not adequately reflect evidence.1

Firstly, their statement (borrowed from Fisher) that “P values measure the strength of the evidence against the null hypothesis” does not stand up to scrutiny. A small P value means that what we observe is possible but not very likely under the null hypothesis. But then life is made up of unlikely events. P values cannot deliver evidence against a hypothesis, no matter how low the cut-off point for saying that a result is significant. Short of P=0, there is no such thing as evidence against a hypothesis.

Secondly, if evidence is what the data say then P values fail to qualify. P values are based on factors other than the observed data, notably on results “more extreme than these.” The P value is literally the sum of probabilities of events that might have happened but did not. Furthermore, to compute a P value you must know what distribution to apply to those unobserved results.

Imagine a trial of vitamin C versus placebo in matched pairs of patients with the common cold; the number of pairs in which the patient taking vitamin C fares better is the outcome of interest. If the total number of observations was predetermined the P value is computed with the binomial distribution; if it was the smallest number of successes per group the negative binomial distribution applies.2 The same trial result could lead to the null hypothesis being rejected or accepted depending on what you were told about the study design—that is, not on data alone. Other extraneous considerations that influence P values include the decision to use a one sided or a two sided test, and any adjustments made for multiple comparisons.

Several statisticians have proposed a solution: data cannot determine the absolute worth of one hypothesis taken in isolation but can provide evidence about the relative merit of two hypotheses specified a priori.35 The data support hypothesis A over hypothesis B if the likelihood of the data is greater under A than B; the strength of evidence favouring A over B is the likelihood ratio. Not only is this approach compatible with logic but it considers only the observed data.

References

  • 1.Sterne JAC, Davey Smith G. Sifting the evidence—what's wrong with significance tests? BMJ. 2001;322:226–231. doi: 10.1136/bmj.322.7280.226. . (27 January.) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Berger JO, Berry DA. Statistical analysis and the illusion of objectivity. Am Scientist. 1988;76:159–165. [Google Scholar]
  • 3.Birnbaum A. On the foundations of statistical inference [with discussion] J Am Stat Assoc. 1962;53:259–326. [Google Scholar]
  • 4.Hacking I. Logic of statistical inference. New York: Cambridge University Press; 1965. [Google Scholar]
  • 5.Royall RM. Statistical evidence. A likelihood paradigm. London: Chapman and Hall; 1997. [Google Scholar]
BMJ. 2001 May 12;322(7295):1184.

Statistics must not be confused with science

Jonathan Rees 1

Editor—I agree with much of what Sterne and Davey Smith say,1-1 but the problems should be viewed more combatively. The reason why the editor of the BMJ thinks that doctors are deficient in statistics is because he confuses statistics with science.1-2 This is why the BMJ is of more use to NHS managers than to researchers.

The confusion between a statistical hypothesis and a scientific one is widespread. A scientific theory does not have a distribution in the sense of probability theory. Newtonian physics is either right or wrong; there is not an infinite array of newtonian theories merging with those of Einstein. The two theories are qualitatively different, each ordering reality in a discontinuous way. They are not summaries of reality, nor do they have errors in a statistical sense. You don't do a systematic review of the Ptolemists and Copernicus and then do a Cochrane plot. The planets either move in a certain way or they don't.

This view of a scientific theory as something that brings coherence to nature, as a revealer of “hidden likenesses,” has little to do with probability theory. The philosopher David Hume described the fatal weakness of inductionism: in a clinical context, the idea that you can take, say, 10 000 people with a stroke and then randomise them to either of two treatments and expect to get sense at the end is naive.

The argument about the importance of statistical power is minor. Sadly, the erroneous belief that small studies are unethical is now institutionalised by ethics committees. Should you study one hypothesis on 100 patients or test more hypotheses with smaller sample sizes? How to choose? Well, certainly not by performing power calculations. P values are not markers of truth. If they were we would have invented an epistemological engine. We haven't, nor can we, for reasons that Popper laid out formally but really are obvious. Or do we imagine that those busy systematic souls who go around adding up other people's P values will now do systematic reviews of quantum mechanics, linguistics, etc and reveal the structure of nature. No, of course not.

The idea that averaging the data is a way to understand nature would be laughable if it didn't do so much harm to genuine clinical discovery.

References

  • 1-1.Sterne JAC, Davey Smith G. Sifting the evidence—what's wrong with significance tests? BMJ. 2001;322:226–231. doi: 10.1136/bmj.322.7280.226. . (27 January.) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 1-2.Editor's choice. Some gentle statistics. BMJ 2001;322 (7280). (27 January.)
BMJ. 2001 May 12;322(7295):1184.

Perfect understanding seldom happens

Michael Apple 1

Editor—The 27 January issue of the BMJ is particularly thought provoking (see for a start Editor's Choice2-1). Sterne and Davey Smith illustrate the fallibility of tests of significance.2-2 Socially responsible people are shocked that the government promotes a dogma and then finds evidence to back it.2-3 It seems that a guideline may be considered good, bad, irrelevant, wrong, premature, or tardy depending on who is speaking.2-4

Colleagues: let us mistrust everything. Whereas once I thought this a cynical cop-out, I now realise that it is an intellectually respectable stance, meeting Sterne and Davey Smith's explanation of a bayesian position on statistical truth. If I understand them correctly, this means: “This is what I think I know. Now let's see if you can shake my view.”

Given that the validity and statistical robustness of evidence is so fragile, what else are practitioners to do? Certainly we should not put our trust in consensus statements, such as those emanating from the National Institute for Clinical Excellence. Although these statements may be an improvement on “tendentious opinions selectively heralded” (TOSH), they may prove to be nothing more than “current right advice . . . probably” (CRAP).

Such edicts are relics of the time when it was a defence against complaint to appeal to an agreed body of professional opinion. With evidence as contentious as that highlighted by articles in the journal, the test of “best current opinion” becomes illusory. There will always be another expert opinion a standard deviation away. Ironically, as evidence becomes devalued and more relativistic, arriving at a considered judgment becomes more important. Such judgment used to be called professionalism; it predated guidelines until it was undermined by the joint efforts of the General Medical Council and the Department of Health.

In my view, a new institute is needed to support those of us hoping to preserve professional medical practice, in which we try to do our best for our patients, taking into account their individual circumstances and wishes while drawing on our experience of practising medicine. The institute will not deny the need for research or the importance of the P value. It will embrace controlled trials yet not dismiss n=1 studies. It will accept guidelines but only as aide memoires, not as gospel. Everyone else has an institute with a heart warming title, so I propose that this one is called the institute for “perfect understanding seldom happens; opinion flirts with facts.” PUSH OFF will do nicely as an acronym.

References

  • 2-1.Editor's choice. Some gentle statistics. BMJ 2001;322 (7280). (27 January.)
  • 2-2.Sterne JAC, Davey Smith G. Sifting the evidence—what's wrong with significance tests? BMJ. 2001;322:226–231. doi: 10.1136/bmj.322.7280.226. . (27 January.) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2-3.Macintyre S, Chalmers I, Horton R, Smith R. Using evidence to inform health policy. BMJ. 2001;322:222–225. doi: 10.1136/bmj.322.7280.222. . (27 January.) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2-4.Lux AL, Edwards SW, Osborne JP, Hancock E, Johnson AL, Kennedy CR, et al. Revised guideline for prescribing vigabatrin in children. BMJ. 2001;322:236–237. . (27 January.) [PMC free article] [PubMed] [Google Scholar]

Articles from BMJ : British Medical Journal are provided here courtesy of BMJ Publishing Group

RESOURCES