ABSTRACT
The derivation and interpretation of P values derived from inferential testing remain somewhat vague and ambiguous in the minds of some researchers/editors/reviewers/readers. The British polymath Fisher famously averred: “the value for which P = 0.05, or 1 in 20, is 1.96 or nearly 2; it is convenient to take this point as a limit in judging whether a deviation is to be considered significant or not. Deviations exceeding twice the standard deviation are thus formally regarded as significant.” This sometimes leads to an almost reductio ad absurdum mindset with an automatic discardment of studies with results where P > 0.05. It must be remembered that results may be negatively impacted by myriad factors that may be out of the researcher/s control, such as small sample sizes, small effects, bias, and random error. This paper briefly reviews the historical events leading to the acceptance of P ≤ 0.05 for statistical significance, the rationale behind the null hypothesis (H0), the meaning of P (and the potential for Type 1 and 2 Errors), α, β, the possibility of using non-0.05 cut-offs when studies are “trending toward statistical significance,” and the importance of including confidence intervals (CIs) in results. P values are vital but must be tempered by judicial consideration of CI and study design. P is a probability spectrum and not simply a binary significant/non-significant statistical metric.
MeSH:
95% confidence interval, biostatistics, P value
Keywords: Bibliometrics, biomedical research, humans, journal impact factor, periodicals as topic, publishing, statistics and numerical data
Historical introduction
The very foundation of statistical analysis is underpinned by an almost existential epistemological debate as to which fundamental methodology to utilize, as in the second quarter of the twentieth century, two competing models of statistical testing were developed: Fisher’s “significance testing” and Neyman–Pearson’s “hypothesis testing.”[1-4] Incidentally, Sir Ronald Aylmer Fisher (17 1890–1962) was a British polymath “a genius who almost single-handedly created the foundations for modern statistical science.”[5]
Fisher argued that scientific experimental results should be sought without any prior expectations, i.e., a threshold to adopt or reject a hypothesis is established, and data is collected and analyzed to yield a result that is ultimately a probability since datasets are typically based on samples and not on entire population/s. If the results exceed the pre-established threshold, the presence of a significant difference is accepted. Statistical significance in this sense is therefore a measure of probability and does not actually prove anything, but only provides evidence pro or con a hypothesis. The test typically involves one hypothesis and variable, and classically, the outcome was binary: accept or reject the hypothesis. However, with modern computers performing statistical tests, the probability obtained and the confidence intervals (CIs) derived are more in keeping with the outcome of a probability test than they were in Fisher’s era.[1,2] Neyman and Pearson collaborated on a different approach by selecting a hypothesis from several based on experimental evidence.[3,4] The pros and cons of these two methodologies were widely debated and nowadays, a hybrid of these two has been adopted.[6] This is in accordance with the philosopher Karl Popper’s contention that hypothesis generation precedes observation and experimentation and statistical testing.[7]
Descriptive vs inferential statistics
The four pillars of good research are as follows: 1. The hypothesis or the research question, 2. The null hypothesis which avers the opposite of the research question (see below), 3. Sample size calculation (via software or via online websites), and 4. Statistical analysis (as 3.). Statistical analysis is a crucial component of research and is classified as descriptive and inferential. Descriptive statistics describe the characteristics and properties of a dataset using summary statistics and graphs. This permits data visualization and insights about a dataset. On the other hand, inferential statistics make predictions from a sample of a large group of data, thereby permitting direct comparison with other datasets. For this reason, results are given in the form of probability (P). The P value has been used by researchers and clinicians since the 1950s when it was first posited by Fisher to show the statistical significance of relationships between two groups for specified variable/s.[8,9]
The null hypothesis
Testing is done using the so-called null hypothesis (H0) which states a priori that the tester assumes that there is no difference between two groups for a particular variable. H0 is conventionally accepted if no statistically significant difference is found and rejected if a statistically significant difference is found. H0 should be simple, specific, and pre-defined at the proposal level.[10]
A simple H0 contains one predictor and one outcome variable. A specific H0 contains no ambiguity about subjects/variables/statistical tests that are to be used and this will keep the research focused on the primary objective of the study and create a stronger basis for interpreting the ensuing results as compared to a hypothesis that emerges as a result of a post-hoc inspection of data, a practice that is known as data dredging and is frowned upon.[10]
P value
P is the outcome of a test that tests the probability of rejecting or failing to reject H0. In the former, the test gives a value that implies that there is a difference between the groups while the latter implies that the test did not find any difference between the two groups.[11] The smaller P, the stronger the evidence for rejecting H0.
α level
Naturally, since the test outcome is a probability, a cut-off must be set to determine when to accept/reject H0 and this is called the α level, the amount of error, i.e., the possibility of rejecting H0 when it is true, thereby reporting a falsely positive result, thereby reporting statistical significance where there is none. This is also known as a Type 1 error. The level of α should be pre-defined before initiating a study, at the level of a study’s proposal stage. In biomedical research, α is typically set at 0.05, as Fisher argued thus: “The value for which P = 0.05, or 1 in 20, is 1.96 or nearly 2; it is convenient to take this point as a limit in judging whether a deviation is to be considered significant or not. Deviations exceeding twice the standard deviation are thus formally regarded as significant.”[12] This naturally implies that with a P of 0.05, the test is bound, by chance alone, to incorrectly reject H0 once out of 20 times.[13]
β level
The probability of making a type II error, a falsely negative result, i.e., the possibility of accepting H0 when it is false, is called β level. The calculation (1 - β) is called power and is defined as the probability of observing an effect when it occurs. β is conventionally set at 0.20 (i.e., a power of 0.80).[13]
Values of α and β are required in software that calculates sample size, the minimum number of necessary samples to meet the input statistical constraints. While ideally α and β are set at zero, this is typically only possible by using unrealistically large (and expensive) sample sizes, so the investigator/s must compromise with acceptable levels of α and β as above.[13]
Non-0.05 cut-offs
There may be situations wherein a higher α is desirable and pre-agreed, for example, when comparing a new and potentially harmful treatment against an extant known and safe treatment.[14] α may also be adjusted if an effect modification by a third variable is suspected.[14] α may also need to be more stringent to account for multiple comparisons within the same study.[15] However, in biomedical work, α level for primary analyses is almost invariably 0.05.
Stringency
Convention tends to adhere to the almost mythical/magical value of P at 0.05 for statistical significance but the likelihood of reaching this value is influenced by several factors including sample size, and in the biomedical sciences, participant selection and measures of treatment and outcome used also come into play. We should constantly remind ourselves that 0.05 is an arbitrary cut-off point and that the strength of evidence lies on a continuum. For example, P = 0.051 is nearly the same as P = 0.049 but a study resulting in P = 0.051 would traditionally be statistically disregarded while potentially being clinically significant/important. For this reason, consideration of studies with low P (e.g., P < 0.10) should not be disregarded as they are “trending toward statistical significance” and may be clinically relevant, especially in small studies.[14]
Conclusion
P values are vital to interpret research but must be tempered by judicial consideration of CIs and study design. Crucially, P should be considered a spectrum and not simply a binary significant/non-significant statistical metric.
Financial support and sponsorship
Nil.
Conflicts of interest
There are no conflicts of interest.
References
- 1.Fisher RA. Statistical Methods for Research Workers. Edinburgh: Oliver and Boyd; 1925. [Google Scholar]
- 2.Fisher RA. Edinburgh:Oliver and Boyd, Edinburgh: Oliver and Boyd, Edinburgh and London; 1935. 1935. Design of experiments. [Google Scholar]
- 3.Neyman J, Pearson ES. IX. On the problem of the most efficient tests of statistical hypotheses. Philos Trans R Soc London Ser A, Contain Pap a Math or Phys Character. 1933;231:289–337. [Google Scholar]
- 4.Neyman J, Pearson ES. Joint statistical papers. Univ of California Press; 1967. [Google Scholar]
- 5.Hald A. A History of Probability and Statistics and their Applications before 1750. John Wiley &Sons. 2005 [Google Scholar]
- 6.Louçã F. Should the widest cleft in statistics:How and why Fisher opposed Neyman and Pearson. Dept. of Economics, School of Economics and Management, Technical University of Lisbon. 2008 [Google Scholar]
- 7.Popper KR. Unended Quest:An Intellectual Autobiography. Collins Glasgow. 1976 [Google Scholar]
- 8.Lang JM, Rothman KJ, Cann CI. Epidemiology. Vol. 9. Cambridge, Mass: 1998. That Confounded P value; pp. 7–8. [DOI] [PubMed] [Google Scholar]
- 9.Fisher RA. Statistical methods for research workers. Biological monographs and manuals. No. V. Stat methods Res Work Biol Monogr manuals. (11th ed) 1950 [Google Scholar]
- 10.Hulley SB, Cummings SR, Browner WS, Grady DG, Newman TB. Getting ready to estimate sample size:Hypothesis and underlying principles. Des Clin Res. 2001;3:55–9. [Google Scholar]
- 11.Boos DD, Stefanski LA. P value precision and reproducibility. Am Stat. 2011;65:213–21. doi: 10.1198/tas.2011.10129. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Fisher RA. New York, Hafner: 1973. Statistical Methods for Research Workers, chap. 1; p. 44. [Google Scholar]
- 13.Banerjee A, Chitnis UB, Jadhav SL, Bhawalkar JS, Chaudhury S. Hypothesis testing, type I and type II errors. Ind Psychiatry J. 2009;18:127–31. doi: 10.4103/0972-6748.62274. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Thiese MS, Ronna B, Ott U. P value interpretations and considerations. J Thorac Dis. 2016;8:E928–31. doi: 10.21037/jtd.2016.08.16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Bonferroni CE. Teoria statistica delle classi e calcolo delle probabilità. Seeber. 1936 (Pubblicazioni del R. Istituto superiore di scienze economiche e commerciali di Firenze) [Google Scholar]
