Skip to main content
Physical Therapy logoLink to Physical Therapy
. 2022 Sep 7;102(11):pzac118. doi: 10.1093/ptj/pzac118

In Defense of Hypothesis Testing: A Response to the Joint Editorial From the International Society of Physiotherapy Journal Editors on Statistical Inference Through Estimation

Keith Lohse 1,
PMCID: PMC10071477  PMID: 36070432

Background

In June 2022, Elkins et al1 published an editorial in PTJ on behalf of the International Society of Physiotherapy Journal Editors (ISPJE) (hereafter referred to as “the editorial”), proposing that ISPJE journals require researchers to stop using null hypothesis significance tests and adopt “estimation methods” in which the 95% CI is compared with the smallest effect of interest. In response, I posted a critical commentary as a preprint, which has since been peer-reviewed and published.2 However, I then learned that the original editorial had actually been published in the Journal of Physiotherapy in January 2022,3 and a number of other methodologists had already written critiques.4,5 Lakens’s critique4 had already been published in the Journal of Physiotherapy by the time of my preprint, with a subsequent response from Elkins et al6 (hereafter referred to as “the response”). Thus, in order to comment on the editorial in PTJ, I also need to acknowledge past critiques and the response to Lakens.

Point of View

I commend my more attuned colleagues to being aware of the original editorial in January, and I will save readers some time by saying that we have a lot of overlap in our concerns.2,4,5 Primarily, I would argue that the editorial (1) does not adequately deal with the fact that P values and CIs are based on the same mathematics, such that any weaknesses of P values are shared with CIs; (2) presents several misleading statements about P values and hypothesis testing; and (3) ultimately proposes an alternative that is a hypothesis test in itself. Specifically, the editorial proposes comparing the 95% CI with the smallest effect size of interest (hereafter Inline graphic). Although the editorial proposes this comparison informally, this comparison is mathematically equivalent to a 1-sided null hypothesis test in which Inline graphic2,4 (Figure, values A and B) and is referred to as a minimal or minimum effects test.7,8

Figure.

Figure

95% CIs and corresponding P values for testing Inline graphic (null hypothesis significance test [NHST]) and Inline graphic (1-sided minimal effects test [MET]). Open circles indicate when the NHST is statistically significant, Inline graphic.

Arguments in the response highlight the importance of CIs and the value of leaning into the uncertainty in our data. However, the arguments do not change the fundamental problem with the editorial: after proposing to ban hypothesis tests, the editorial ultimately proposes a minimal effects test against the smallest effect size of interest, but without explicit formulation.2,4 Thus, the central logic of the call to ban hypothesis tests is questionable, and I want to further draw attention to 3 issues in the response.

CIs Bounce Around Just Like P Values

The response argues that “experimenters who obtain a significant test finding cannot expect that, if an exact replication of their study were possible, it too would obtain a significant finding.” The response further quotes Amherein et al,9 saying that “it would not be very surprising for one to obtain P < .01 and the other P < .30” even with 80% power to detect a genuine effect.

First, if you are studying a genuine effect with 80% statistical power in each replication, then by definition 80% of P values in the long run would fall below .05, with 20% of P values being >.05. As such, although experimenters can indeed expect the occasional P = .30 in a follow-up study, they can also reasonably expect “replication” 80% of the time (if we define replication as P < .05). Second, it is important to note that Amherein et al9 were arguing against “statistical significance” as a dichotomy to rightly prevent human errors in judgment, not arguing against hypothesis testing in general because of some mathematical flaw inherent to P values. Finally, if the P value is “bad” because it bounces around due to sampling variability, then so too is the CI, because it is equally subject to sampling variability. Compare the Figure’s values C with D (for the null hypothesis significance test) and E with F (for the minimal effects test). The response would argue that the P value here is unstable because the result goes from nonsignificant to significant in both cases. However, the 95% CI similarly goes from including a null value to excluding a null value (0 in C and D, or 1 in E and F). Thus, the editorial’s proposed alternative is no more stable than the P value.

We Do Not Need Perfect Precision for the Null to Be Useful

The response argues that the null is self-evidently false because no tested effect will ever be perfectly zero, arguing that “the only reason (the null is) not always found to be false is that almost all studies lack the precision to detect tiny effects.” If we accept the response’s insistence on hyperprecision (ie, that the null is false if the true effect is .001 rather than .0), then the same logic can be applied to the alternative: no effect will ever perfectly equal the smallest effect size of interest either, but the editorial was willing to assume the Inline graphic value as a point-estimate in the proposed alternative.

Everything has some degree of measurement error. It makes much more practical sense to think about Inline graphic as a model, not some infinitesimally precise prediction. To the extent that we can measure an effect, a lot of tested effects will be 0, and it should not be hard to show that a useful intervention is discernably different from 0! If we want to set an even higher standard, we could set Inline graphic, where Inline graphic 0, but then we need to define what Inline graphic should be in a given population5,10 and determine an adequate study design. Further, the response’s argument for hyperprecision applies only to the point null of a 2-sided hypothesis test H0 = 0. The true effect does not need to be perfectly 0 for a 1-sided null to be true, H0 ≤ 0 or H0 ≥ 0, which begs the question: would the ISJPE allow P values from one-sided tests?

Furthermore, testing the null hypothesis is very helpful in situations when we have multiple degrees of freedom being tested simultaneously. Consider an omnibus F-test for the group × time interaction in a trial with 3 arms and 3 time points. What would the editorial propose in this situation? One could draw many CIs comparing many groups at many times, but I would argue it is still useful to obtain a P value for the interaction first to see if there is any evidence that the effect of time depends on group before delving into specific post-hoc comparisons.

We Need Better Statistical Education in Our Field

The response states, “In the absence of a well-established threshold for interpretation, authors can still interpret a confidence interval by describing the practical implications of all values inside the confidence interval.” I think the second part of this statement is a good idea, and thus it is very important to report effect sizes and CIs or better yet to share your data for maximum usability.11,12 However, the CI provides the range of population parameters that are compatible with the data that we observed to a given level of uncertainty.13,14 That is, any parameter value within the 95% CI would yield P > .05 if we tested the probability of getting data as or more extreme than our current sample, assuming the null were true (ie, Inline graphic), and the limits of the CI are where P would turn to <.05. Consider the Figure, values E vs F: when formalized as a minimal effects test, only F should lead to the decision that Inline graphic is incompatible with the effect we observed in the data. If authors are allowed to informally interpret all values within the CI, as the response argues, they might focus on the upper tail of panel E, downplaying the fact that E is an “equivocal” finding and the data are compatible with many values of Inline graphic.

History shows us that authors are overly optimistic about their data when statistical benchmarks are removed.15–17As such, I think the response assigns far too much credit to authors and readers while deflecting responsibility away from editors and reviewers. If we cannot trust the average author with P values, should we trust them with a qualitative interpretation of CIs? The solution is neither banning P values nor CIs. We need more statistical training focused on (correctly) interpreting P values, avoiding Bayesian misinterpretations of frequentist statistics, using minimal effect and equivalence tests, and so on. We also need more training on how to avoid questionable research practices and pitfalls. P values get a disproportionate amount of attention in popular conversations about methodology,18,19 but underpowered subgroup analyses, surrogate outcomes, P-hacking, hypothesizing after results are known, and selective reporting all pose far greater threats to statistical and scientific integrity.20–24 Rather than discouraging valid tools of statistical inference, I would prefer to see the ISPJE promote things like data sharing, registered reports, results-blind peer review, or even “data papers” that allow authors to archive data and receive academic credit without attempting to draw inferences from limited samples.

Conclusion

Ultimately, I agree with the editorial that measures of effect size and CIs should be reported and interpreted in the context of the research question. I also agree we should encourage authors to lean into the uncertainty of their data and avoid misinterpretations of P values. However, estimation cannot be truly separated from hypothesis testing (or vice versa), and the editorial’s proposed ban on hypothesis tests while then proposing a minimal effects test is illogical. I would encourage the ISPJE editors to invest in making published research more transparent, reproducible, and methodologically rigorous—regardless of the method of inference.

Acknowledgments

Many thanks to Dr Emma Johnson for providing critical feedback and edits on early drafts of this Point of View.

Funding

There are no funders to report for this work.

Disclosure

The author completed the ICMJE Form for Disclosure of Potential Conflicts of Interest and reported no conflicts of interest.

References

  • 1. Elkins  MR, Pinto  RZ, Verhagen  A, et al.  Statistical inference through estimation: recommendations from the International Society of Physiotherapy Journal Editors. Phys Ther.  2022;102:pzac066. 10.1093/ptj/pzac066. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Lohse  K.  No estimation without inference: a response to the International Society of Physiotherapy Journal Editors. Communications in Kinesiology. 2022;9:1. [Google Scholar]
  • 3. Elkins  MR, Pinto  RZ, Verhagen  A, et al.  Statistical inference through estimation: recommendations from the International Society of Physiotherapy Journal Editors. J Physiother.  2022;68:1–4. 10.1016/j.jphys.2021.12.001. [DOI] [PubMed] [Google Scholar]
  • 4. Lakens  D.  Correspondence: reward, but do not yet require, interval hypothesis tests. J Physiother.  2022;68:213–214. [DOI] [PubMed] [Google Scholar]
  • 5. Tenan  M, Caldwell  A. Confidence intervals and smallest worthwhile change are not a panacea: a response to the International Society of Physiotherapy Journal Editors. Communications in Kinesiology. 2022;9:1. [Google Scholar]
  • 6. Elkins  MR, Pinto  RZ, Verhagen  A, et al.  Correspondence: response to Lakens. J Physiother.  2022;68:214. 10.1016/j.jphys.2022.06.003. [DOI] [PubMed] [Google Scholar]
  • 7. Lakens D . The practical alternative to the P value is the correctly used P Value. Perspect Psychol Sci. 2021;16:639–648. 10.1177/1745691620958012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Murphy  KR, Myors  B. Testing the hypothesis that treatments have negligible effects: minimum-effect tests in the general linear model. J Appl Psychol.  1999;84:234–248. [Google Scholar]
  • 9. Amrhein  V, Greenland  S, McShane  B. Scientists rise up against statistical significance. Nature.  2019;567:305–307. [DOI] [PubMed] [Google Scholar]
  • 10. Scheel  AM, Tiokhin  L, Isager  PM, Lakens  D. Why hypothesis testers should spend less time testing hypotheses. Perspect Psychol Sci.  2021;16:744–755. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Wallis  JC, Rolando  E, Borgman  CL. If we share data, will anyone use them? Data sharing and reuse in the long tail of science and technology. PLoS One.  2013;8:e67332. 10.1371/journal.pone.0067332. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Borg  DN, Bon  JJ, Sainani  K, Baguley  B, Tierney  NJ, Drovandi  CC. Sharing data and code: a comment on the call for the adoption of more transparent research practices in sport and exercise science. SportRxiv. 2020. 10.31236/osf.io/ftdgj. [DOI] [PubMed] [Google Scholar]
  • 13. Greenland  S.  Invited commentary: the need for cognitive science in methodology. Am J Epidemiol.  2017;186:639–645. [DOI] [PubMed] [Google Scholar]
  • 14. Rafi  Z, Greenland  S. Semantic and cognitive tools to aid statistical science: replace confidence and significance by compatibility and surprise. BMC Med Res Methodol.  2020;20:244. 10.1186/s12874-020-01105-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Fricker  RD  Jr, Burke  K, Han  X, Woodall  WH. Assessing the statistical analyses used in basic and applied social psychology after their p-value ban. Am Stat.  2019;73:374–384. [Google Scholar]
  • 16. Sainani  KL, Lohse  KR, Jones  PR, Vickers  A. Magnitude-based inference is not Bayesian and is not a valid method of inference. Scand J Med Sci Sports.  2019;29:1428–1436. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Lohse  KR, Sainani  KL, Taylor  JA, Butson  ML, Knight  EJ, Vickers  AJ. Systematic review of the use of “magnitude-based inference” in sports science and medicine. PLoS One.  2020;15:e0235318. 10.1371/journal.pone.0235318. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Leek  JT, Peng  RD. Statistics: P values are just the tip of the iceberg. Nature.  2015;520:612–612. [DOI] [PubMed] [Google Scholar]
  • 19. Borg  DN, Lohse  KR, Sainani  KL. Ten common statistical errors from all phases of research, and their fixes. PM&R.  2020;12:610–614. [DOI] [PubMed] [Google Scholar]
  • 20. Simmons  JP, Nelson  LD, Simonsohn  U. Life after P-hacking. In Meeting of the Society for Personality and Social Psychology. New Orleans, LA; 2013:1719. 10.2139/ssrn.2205186. [DOI] [Google Scholar]
  • 21. Simmons  JP, Nelson  LD, Simonsohn  U. False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychol Sci. 2011;22:1359–1366. [DOI] [PubMed] [Google Scholar]
  • 22. Sun  X, Briel  M, Busse  JW, et al.  Credibility of claims of subgroup effects in randomised controlled trials: systematic review. BMJ.  2012;344:e1553. 10.1136/bmj.e1553. [DOI] [PubMed] [Google Scholar]
  • 23. Kerr  NL. HARKing: hypothesizing after the results are known. Personal Soc Psychol Rev.  1998;2:196–217. [DOI] [PubMed] [Google Scholar]
  • 24. Rosenthal  R.  The file drawer problem and tolerance for null results. Psychol Bull.  1979;86:638–641. [Google Scholar]

Articles from Physical Therapy are provided here courtesy of Oxford University Press

RESOURCES