Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2022 Apr 11.
Published in final edited form as: Otolaryngol Head Neck Surg. 2016 Nov 14;156(6):978–980. doi: 10.1177/0194599816677735

The P Value Problem in Otolaryngology: Shifting to Effect Sizes and Confidence Intervals

Peter M Vila 1, Melanie Elizabeth Townsend 1, Neel K Bhatt 1, W Katherine Kao 1, Parul Sinha 1, J Gail Neely 1
PMCID: PMC8996525  NIHMSID: NIHMS1784087  PMID: 28566048

Abstract

There is a lack of reporting of effect sizes and confidence intervals in the current biomedical literature. The objective of this paper is to present a discussion of the recent paradigm shift encouraging the use of reporting effect sizes and confidence intervals. Although p-values help to inform us about whether or not an effect exists due to chance, effect sizes inform us about the magnitude of the effect (clinical significance), and confidence intervals inform us about the range of plausible estimates for the general population mean (precision). Reporting of effect sizes and confidence intervals are a necessary addition to the biomedical literature, and these concepts are reviewed in this manuscript.

Keywords: Reproducibility of results, Data interpretation, Statistics, Confidence intervals, Research design

INTRODUCTION

The use of p-values as a surrogate for clinical importance in the biomedical literature is an ongoing problem.1,2 Some have gone so far as to call for a complete overhaul of the current paradigm of formulating a null hypothesis in order to accept or reject it.3 A p-value can show whether an effect exists that is not likely due to chance, but it does not reveal the magnitude of the effect. When faced with the unfortunate reality that a majority of published scientific results cannot be replicated,4 it becomes clear that there is a problem with current reporting methodology.

If one is to accurately report scientific findings, it is important to not confuse statistical significance with a meaningful or clinically significant difference. With a very large sample size, a clinically meaningless difference may be reported as a “statistically significant difference” via sole reliance on a p-value. Furthermore, calculating a p-value without a prior power calculation to determine the appropriate sample size may yield spurious conclusions, as the difference may eventually be statistically significant if the experiment is repeated enough times.5 Effect sizes and confidence intervals allow one to describe the reliability, magnitude, direction, and precision of results. Simply put, the results of the research (e.g. difference between groups, baseline risk) are the effect. The magnitude of that effect is the effect size. Confidence intervals make use of the inherent variability in study results to give a measure of how precise the data are. In other words, there is a 95% chance that the mean of the population will lie within the confidence interval from the study sample, making generalization plausible.6

Integrating effect size and confidence intervals into routine practice has proven challenging.1,2 Perhaps due to habit, training, or the efficiency of a quick glance at a single number (p-value), researchers have been slow to abandon reporting p-values. Our objective is to convey to the reader the disadvantages of relying solely on the p-value, and to encourage instead the use of confidence intervals and effect size when reporting results.

PROBLEMS WITH NULL HYPOTHESIS SIGNIFICANCE TESTING

The gold standard for reporting results of scientific experiments is null hypothesis significance testing (NHST), which estimates whether the results of the study are reliably due to the exposure/intervention or due to chance. Unfortunately, this method is outdated, overused, and is counterproductive to modern clinical research. NHST gives a binary yes/no answer as to whether the results achieve a threshold defined a priori, such as p < 0.05. Under the NHST paradigm, if the p-value is less than the threshold (e.g. p = 0.013), convention suggests that one can be reasonably sure the results are reliable and not due to chance. With a large enough sample size, however, a clinically meaningless difference may be reported as a “statistically significant difference.” Furthermore, in many studies, no null hypothesis is explicitly stated, and when a p-value < 0.05 is obtained, the finding is stated without mention of a power calculation, description of the size of the difference (i.e. effect size), or consideration of the clinical significance of the finding.

INTERPRETING EFFECT SIZE

Examples of effect sizes are the mean, median, mean difference between two groups, odds ratios (OR), and risk ratios (RR). These are unstandardized effect sizes. Standardized effect sizes, such as Cohen’s d, Glass’s delta (Δ), Hedge’s g, are unit-less measures that allow results from different studies to be compared to each other, as is commonly done in meta-analyses (Table 1).3,7 Because these measures are perhaps harder for the lay person to interpret due to the lack of units, some use the rather arbitrary categories of small (0.20), medium (0.50), and large (0.80) effect sizes as originally described by Cohen.7,8 While somewhat useful for interpretation, there is controversy about their use. Considering each situation individually is more important than these arbitrary categories, and the appropriate interpretation depends on the context of the comparison.7

TABLE 1.

Continuous scale variable standardized effect size

ITEM FORMULA STANDARD DEVIATION FORMULA ASSUMPTION
Cohen’s d M1 – M2 / SDpooled SDpooled = √[(SD12 + SD22) /2]
New standard deviation by taking square root of the average of the two variances
Similar SDs and sample sizes in both groups
Glass’s delta (Δ) M1 – M2 / SDcontrol Use SD of control group Different SDs but same sample size in each group
Hedge’s g = M1 – M2 / SDpooled* Weighted SDpooled*=√ [(nA - 1)SDA2 + (nB – 1)SDB2]/(nA + nB – 2) Different sample sizes and SDs in each group

Key: SD=standard deviation; subscripts A,B,1,2 designate groups; n = number in group; M is mean

Reference: Ellis 20107

IMPORTANCE OF EFFECT-SIZE FOCUSED RESEARCH DESIGN

Designing a study based on the alternate hypothesis, rather than the null hypothesis, and using effect size estimates that will be used in reporting the results is ideal. Otherwise, interpretation of a study based on statistical significance using p-values without any consideration of a power calculation is a flawed approach.3,7,9 It is always wise to consult with an experienced statistician as the study is being designed. If the study is preliminary and without previously published estimates of the expected effect, researchers should use an a priori estimate based on pilot data. This pilot data is then used to determine the minimal clinically important difference (MCID).10 The study should be sufficiently powered to detect at least this difference. Confidence intervals are used to determine whether the result is due to chance alone or is a true difference, based on whether the interval includes zero (in the case of an absolute difference between groups) or one (in the case of a ratio between groups). If a study is performed without a power calculation (and without using confidence intervals), having a p-value less than 0.05 means nothing, as it is possible the study was over-powered and will yield a p-value less than 0.05 for a small, meaningless difference; alternatively, the study may be under-powered, and having a p-value less than 0.05 may be due entirely to chance, and if the study were repeated, would be found to be greater than 0.05.

CONCLUSION

The aim of a biomedical research study is often to determine the size of an intervention or exposure. How the authors choose to summarize the result is the effect. The magnitude of this effect is the effect size, and accompanying this should be confidence intervals attesting to the reliability and precision of the results. Effect size estimates should be used in a priori sample size calculations, as well as in the interpretation of the study to justify clinically meaningful results. The use of effect sizes and confidence intervals will better allow the reader to reach an objective conclusion based on these numbers.

Funding:

This study was funded by the Department of Otolaryngology-Head and Neck Surgery at the Washington University School of Medicine in St. Louis, St. Louis, MO.

REFERENCES

  • 1.Chavalarias D, Wallach JD, Li AHT, Ioannidis JPA. Evolution of reporting P values in the biomedical literature, 1990–2015. JAMA. 2016;315(11):1141–1148. [DOI] [PubMed] [Google Scholar]
  • 2.Kyriacou DN. The enduring evolution of the P value. JAMA. 2016;315(11):1113–1115. [DOI] [PubMed] [Google Scholar]
  • 3.Cumming G Understanding the new statistics, effect sizes, confidence intervals, and meta-analysis. New York, NY: Routledge; 2012. [Google Scholar]
  • 4.Ioannidis JP. Why most published research findings are false. PLoS medicine. 2005;2(8):e124. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Cumming G The new statistics: why and how. Psychological science. 2014;25(1):7–29. [DOI] [PubMed] [Google Scholar]
  • 6.Feinstein AR. Clinical Epidemiology: The Architecture of Clinical Research Philadelphia, PA: Saunders; 1985. [Google Scholar]
  • 7.Ellis PD. The essential guide to effect sizes. Cambridge, UK: Cambridge University Press; 2010. [Google Scholar]
  • 8.Cohen J Statistical Power Analysis for the Behavioral Sciences. 2 ed. Mahwah, NJ: Lawrence Erlbaum Associates; 1988. [Google Scholar]
  • 9.Grissom RJ, Kim JJ. Effect Sizes for Research. Univariate and Multivariate applications. 2 ed. New York, NY: Routlege; 2012. [Google Scholar]
  • 10.Wells G, Beaton D, Shea B, et al. Minimal clinically important differences: review of methods. The Journal of rheumatology. 2001;28(2):406–412. [PubMed] [Google Scholar]

RESOURCES