Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Oct 1.
Published in final edited form as: Genet Epidemiol. 2022 Jun 1;46(7):390–394. doi: 10.1002/gepi.22464

Post hoc Power is Not Informative

Lacey W Heinsberg 1, Daniel E Weeks 1,2
PMCID: PMC9452450  NIHMSID: NIHMS1809262  PMID: 35642557

Abstract

Post hoc power estimates are often requested by reviewers and/or performed by researchers after a study has been conducted. The purpose of this commentary is to provide a heuristic explanation of why post hoc power should not be used. To illustrate our point, we provide a detailed simulation study of two essentially identical research experiments hypothetically conducted in parallel at two separate universities. The simulation demonstrates that post hoc power calculations are misleading and simply not informative for data interpretation. As such, we encourage authors and peer-reviewers to avoid using or requesting post hoc power calculations.

Keywords: Observed power, Retrospective power, Achieved power, Post-experiment power, Exploratory data analysis

INTRODUCTION

Predicted (i.e., pre-study) statistical power is critical in research study design and sample size determination. Akin to screening or diagnostic sensitivity, pre-study statistical power can be interpreted as the probability of observing a statistically significant result (i.e., rejecting the null hypothesis) if a true association indeed exists (Nuzzo, 2021). Well-powered studies are essential to (1) support detection of clinically meaningful differences, (2) limit false negative findings that can lead to a perceived lack of clinically meaningful differences, and (3) support responsible stewardship of resources such as money and time of the research team, study participants, and community partners (Zhang et al., 2019).

Unfortunately, post hoc (i.e., post-study) power estimates are often requested by reviewers and/or performed by researchers after a study has been conducted. Most often, post hoc power is calculated using a mathematical formula for predicted statistical power, while replacing a hypothetical and clinically meaningful effect size with the observed effect size from a given study (Bababekov et al., 2018; Zheleznyakova et al., 2016). Generally, post hoc power calculations are requested/used to explain why key study results are not statistically significant (Nuzzo, 2021).

There are several problematic issues that make post hoc power not informative, and even potentially harmful. First, this approach is both conceptually and mathematically incorrect as detailed by Zumbo & Hubley (1998). Next, post hoc power is misleading as it assumes that the observed effect size is similar to the true effect size, which can only be perfectly known when data are simulated. Finally, post hoc power is redundant with observed findings as there is a one-to-one relationship between the p-value and power. This concept is explained nicely by Nuzzo (2021) who illustrates that, in the case of a two-sided z-test, for example, p-values >0.05 will always translate to post hoc power values <50%, regardless of the sample size. This is particularly problematic when post hoc power is used to explain a nonsignificant finding as, even when a study is carefully designed and well-powered in reality, large p-values will always translate to low observed power (Althouse & Chow, 2019). Because of this redundancy, post hoc power is simply not informative when attempting to separate (a) studies that had “negative findings” because they were underpowered from (b) studies that had “negative findings” because there was no true meaningful effect to discover. Taken together, these issues indicate that post hoc power is not only incorrect and uninformative, but also potentially harmful as it can easily be misinterpreted, distract researchers from topics of greater importance, and lead to poor use of time and resources.

While the misconceptions of post hoc power have been documented nicely in more detail elsewhere (Althouse, 2021; Althouse & Chow, 2019; Dziak et al., 2020; Hoenig & Heisey, 2001; Levine & Ensom, 2001; Nuzzo, 2021; Zumbo & Hubley, 1998), post hoc power is still seen in the literature (Bababekov et al., 2018; Zheleznyakova et al., 2016) and debated (LeMaire, 2021). Therefore, the purpose of this commentary is to provide a novel approach to heuristically explaining why post hoc power should not be used. As such, please consider the following scenario.

THOUGHT EXPERIMENT

Suppose two scientific groups, one at University A and one at University B, are serendipitously and simultaneously working on essentially identical research experiments, using genetically identical mice (n=100) being fed the same diet. The purpose of their experiments is to determine the association between a two-allele genetic marker G (minor allele frequency=0.5) and a quantitative trait Y. Both groups use simple linear regression (i.e., Y ~ G) and a significance threshold of 0.05. Based on the power calculations they submitted with their grant applications to secure funding for the studies, a sample size of 100 mice and significance threshold of 0.05 gives them 80% predicted statistical power to detect an effect size of 0.368 or greater (Figure 1).

Figure 1. Predicted statistical power curve for experiments at Universities A and B.

Figure 1.

The black solid line represents analytically-derived estimates of predicted pre-study power as a function of assumed effect size while the black dotted line indicates the minimum effect size (0.368) the study has 80% predicted power to detect.

The two groups carry out their experiments and arrive at remarkably similar statistical results (Table 1). Each group writes up their results, submits their papers for publication, and each of their reviewers requests a post hoc power estimate.

Table 1.

Results of simple linear regression examining the association between a two-allele genetic marker, G, and a quantitative trait, Y, for experiments at Universities A and B.

University Est (95% CI) Statistic P-value
A 0.290 (−0.002 to 0.582) 1.970 0.0517
B 0.275 (−0.002 to 0.553) 1.968 0.0519

While the results in Table 1 are very similar, each with a p-value of approximately 0.052 (i.e., “teetering on the brink of significance” according to the humorous blog post “Still Not Significant” (Hankins, 2013)), they came from distinct realities with different true effect sizes set a priori as part of the simulation of data for this study. While we would expect the true effect sizes to be similar for two studies of such homogenous design, interference of environmental or unmeasured effects is a common limitation of research studies. Unfortunately, there is nothing in the statistical results that distinguishes the realities from which the two sets of results came. For example, in the case of University A and University B described here, while the observed effect sizes were 0.290 and 0.275, the true effect sizes were quite different at 0.402 and 0.201, respectively (Table 2). We want to believe there is an important effect in these data given the trending p-values observed, so we feel compelled to ask – “Perhaps the sample size just wasn’t large enough to detect an effect?” Sure enough, use of the observed effect sizes suggests that the post hoc power was low for each experiment (<0.60).

Table 2.

Observed effect size/power and true effect size/power for experiments at Universities A and B.

University Observed Effect Size Post hoc power True Effect Size True Power
A 0.290 0.584 0.402 0.841
B 0.275 0.537 0.201 0.322

NOTE: Observed effect sizes are consistent with those in Table 1. Post hoc power was computed at the observed effect size. True effect sizes were set a priori as part of the simulation of data for this study. True power was computed using simulation. See full simulation of experiment at GitHub: https://github.com/lwheinsberg/PostHocPower.

If these post hoc power calculations were to be presented in the published papers summarizing the results from Universities A and B, readers would arrive at similar conclusions about power for both experiments, but these conclusions would be wrong in different ways. Specifically, readers would conclude that experiment A and B were both underpowered (post hoc power A=0.584, post hoc power B=0.534). In reality, however, experiment A was well-powered (simulation-derived true power=0.841) while experiment B had very low power (simulation-derived true power=0.322).

Full code with complete details for the simulated experiments and calculation of observed/true power can be found on GitHub at https://github.com/lwheinsberg/PostHocPower.

While we acknowledge that these observations are surprising as a priori we would have assumed that the true underlying effect sizes should be very similar for both experiments, differences in the true effect size caused by environmental or other unmeasured effects are a common and real limitation of research studies. Regardless of reality, however, after the experiments are done and the results are in hand, post hoc power estimates do not lend any additional information regarding interpretation of the results because they are a direct function of the p-value and parameter estimates. As mentioned above, true effect sizes can only perfectly be known if the data are simulated. In this case, the results of both experiments were the same even though the true effect sizes were different, so looking at the results and associated post hoc power does not aid in interpretation. While conducting post hoc power estimates using a range of clinically or experimentally meaningful effects sizes is more appropriate than using the observed effect sizes, it remains that these hypothetical power values do not change the interpretation of the actual results that were obtained (Figure 2).

Figure 2. Predicted statistical power curve relative to observed and true effect sizes of experiments at Universities A and B.

Figure 2.

The curved black line indicates analytically-derived estimates of predicted pre-study power as a function of assumed effect size. The vertical black dotted line indicates the predicted statistical power of 80% to detect a minimum effect size of 0.368 or greater. The red and blue dots indicate the observed effect sizes for experiments A and B of 0.290 and 0.275, respectively. The vertical red and blue lines indicate the true effect sizes set a priori as part of the simulation of data for experiments A and B of 0.402 and 0.201, respectively.

DISCUSSION AND CONCLUSION

The scenario above demonstrates that post hoc power calculations are misleading and do not assist in data interpretation. Rather than using post hoc power estimates, authors should interpret their findings with careful regard for the design and limitations of the study in question, as well as the observations of other researchers. In doing so, authors should focus on the use of confidence intervals in data reporting as they better inform readers of the possibility of low power or inappropriate sample sizes than post hoc power calculations, and are more easily understood than p-values and power by a range of audiences (Hoenig & Heisey, 2001).

Further, because post hoc power calculations are most often requested/performed for statistically nonsignificant findings, we would like to remind researchers of several reasons (first summarized by Levine & Ensom (2001)) that negative results can arise: (1) there actually is no true association; (2) there is a true association, but the effect is smaller than hypothesized so it could not be detected with the given sample size; (3) there is a true association with an effect as large or larger than hypothesized, but it was not detected by chance; (4) the variance of the sampled data is greater than expected or observed previously, causing excessive noise in the data that prevented detection of a true association; or (5) there were confounding factors that were unaccounted for, making the results appear as if there was no association or a smaller effect than in reality (Levine & Ensom, 2001). Items 2–5 are all cases of type II error that highlight the critical need for careful study design and accurate pre-study power calculations.

Finally, throughout this commentary, we have generally been referring to “post hoc” in the context of using the observed effect sizes from one’s own study to compute observed power. However, researchers commonly use effect sizes from the literature to conduct pre-study statistical power. This alternative “post hoc approach” to computing pre-study statistical power is quite problematic, particularly in fields where power tends to be low. Specifically, as detailed by Vasishth et al. (2018), in studies with low power, simulated estimates fluctuate substantially around the true effect size and can even have the opposite direction of effect. Therefore, when an effect is significant, its size is typically an overestimate. In direct contrast, simulated estimates in well-powered studies are closer to the true effect size because the standard error is smaller (Vasishth et al., 2018). This is an important concept to understand because, particularly in fields where power tends to be low, overestimates of effect size are common in the literature due to publication bias. If pre-study power calculations are based on “post hoc” effect sizes from low-powered studies, then the estimated power will be exaggerated. This will result in an even greater number of underpowered studies in the literature, creating a vicious cycle. Therefore, rather than using observed effect sizes from the literature, it is important that researchers use hypothetical clinically or scientifically meaningful effect sizes to compute pre-study statistical power.

In sum, in advance of the study, researchers should use predicted statistical power for study planning but, after the study is done, researchers should not use post hoc power for explanation or interpretation of observed statistical results. We encourage authors and peer-reviewers to avoid using or requesting post hoc power calculations as they are misleading, they do not add scientific value, and they are simply not informative.

Acknowledgements:

The authors would like to thank the many anonymous reviewers who inspired this paper by requesting post hoc power estimates - with an extra special thank you to the Genetic Epidemiology reviewers who took the time to thoughtfully evaluate this paper as their feedback improved the quality and clarity of our work.

Funding:

Research reported in this publication was partially supported by the National Institutes of Health under award numbers TL1TR001858, R01HL1333040, and R01HL093093.

Footnotes

Conflict of Interest: The funders had no role in the design of the study; simulation, analyses, or interpretation of data; writing of the manuscript; or decision to publish the results. As such, the authors declare no conflict of interest.

REFERENCES

  1. Althouse AD (2021, March). Post Hoc Power: Not Empowering, Just Misleading. The Journal of Surgical Research. United States. 10.1016/j.jss.2019.10.049 [DOI] [PubMed] [Google Scholar]
  2. Althouse AD, & Chow ZR (2019, December). Comment on “Post-hoc Power: If You Must, At Least Try to Understand”. Annals of Surgery. United States. 10.1097/SLA.0000000000003296 [DOI] [PubMed] [Google Scholar]
  3. Bababekov YJ, Stapleton SM, Mueller JL, Fong ZV, & Chang DC (2018). A Proposal to Mitigate the Consequences of Type 2 Error in Surgical Science. Annals of Surgery, 267(4), 621–622. 10.1097/SLA.0000000000002547 [DOI] [PubMed] [Google Scholar]
  4. Dziak JJ, Dierker LC, & Abar B (2020). The Interpretation of Statistical Power after the Data have been Gathered. Current Psychology (New Brunswick, N.J.), 39(3), 870–877. 10.1007/s12144-018-0018-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Hankins M (2013). Still Not Significant. Retrieved October 27, 2021, from https://mchankins.wordpress.com/2013/04/21/still-not-significant-2/
  6. Hoenig JM, & Heisey DM (2001). The Abuse of Power. The American Statistician, 55(1), 19–24. 10.1198/000313001300339897 [DOI] [Google Scholar]
  7. LeMaire SA (2021, March). A Post Hoc Discussion About Post Hoc Power: Divergent Viewpoints on Controversial Methodology. The Journal of Surgical Research. United States. 10.1016/j.jss.2021.01.002 [DOI] [PubMed] [Google Scholar]
  8. Levine M, & Ensom MH (2001). Post hoc power analysis: an idea whose time has passed? Pharmacotherapy, 21(4), 405–409. 10.1592/phco.21.5.405.34503 [DOI] [PubMed] [Google Scholar]
  9. Nuzzo RL (2021). Post hoc Power. PM & R : The Journal of Injury, Function, and Rehabilitation, 13(4), 422–424. 10.1002/pmrj.12476 [DOI] [PubMed] [Google Scholar]
  10. Vasishth S, Mertzen D, Jäger LA, & Gelman A (2018). The statistical significance filter leads to overoptimistic expectations of replicability. Journal of Memory and Language, 103, 151–175. https://doi.org/ 10.1016/j.jml.2018.07.004 [DOI] [Google Scholar]
  11. Zhang Y, Hedo R, Rivera A, Rull R, Richardson S, & Tu XM (2019). Post hoc power analysis: is it an informative and meaningful analysis? General Psychiatry, 32(4), e100069. 10.1136/gpsych-2019-100069 [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Zheleznyakova GY, Cao H, & Schiöth HB (2016). BDNF DNA methylation changes as a biomarker of psychiatric disorders: literature review and open access database analysis. Behavioral and Brain Functions : BBF, 12(1), 17. 10.1186/s12993-016-0101-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Zumbo BD, & Hubley AM (1998). A Note on Misconceptions Concerning Prospective and Retrospective Power. Journal of the Royal Statistical Society: Series D (The Statistician), 47(2), 385–388. https://doi.org/ 10.1111/1467-9884.00139 [DOI] [Google Scholar]

RESOURCES