Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
letter
. 2019 Oct 29;116(47):23382–23383. doi: 10.1073/pnas.1909339116

The harmonic mean p-value: Strong versus weak control, and the assumption of independence

Jelle J Goeman a,1, Jonathan D Rosenblatt b, Thomas E Nichols c,d,e
PMCID: PMC6876242  PMID: 31662466

Wilson (1) proposes a multiple testing procedure based on the harmonic mean p-value (HMP). While this is a potentially useful method, he makes several claims that are not supported by the theory. Herein we identify 4 errors, for clarity described in terms of the version with equal weights 1/L, so that wR=|R|/L.

  • First, Wilson claims strong familywise error rate (FWER) control for HMP using a closed testing argument. Indeed, if HMP rejects an intersection hypothesis UR for some R, then it also rejects UR* for every R*R, implying that HMP is a closed testing procedure (2). However, from this argument Wilson may only claim that, for each set R where HMP is significant, at least one of the genetic variants in R has signal. Strong FWER control, that is, the claim that all such genetic variants have signal, may be claimed only for the elementary hypotheses rejected by a closed testing procedure. Rejecting nonsingleton sets, as Wilson does (ref. 1, p. 1198), gives no more than weak FWER control on these sets.

  • Second, Wilson claims, without proof, that HMP is valid without any dependence assumptions. However, Vovk and Wang (3) showed that the critical value for HMP under general dependence is smaller than α/log(|R|), much smaller than α|R| from table 1 in ref. 1, proving that Wilson’s HMP loses error guarantees under general dependence. A simulation shows lack of control even under moderate positive dependence. With |R|=L=105, simulating standard normals with common correlation ρ=0.2, and testing with 1-sided Z-tests, we obtain a type I error of 0.164 for α=0.05, and still an excessive rate of 0.091 with α|R|=0.029 in table 1 in ref. 1.

  • Third, Wilson claims (ref. 1, p. 1197) that HMP is more powerful than Bonferroni: When Bonferroni rejects we have piα/L for some i, so also HMP has pRαwR when iR. However, this argument is not fair. Using table 1 in ref. 1, rejection with HMP requires pRα|R|wR<α, so HMP cannot be claimed to be more powerful than Bonferroni. Wilson’s argument even reverses for proper strong FWER control: HMP must reject some elementary Ui, which requires piαL/L, with αL from table 1 in ref. 1. Because αL<α, after leveling the playing field Bonferroni is more powerful than HMP, and, unlike HMP, robust to dependence.

  • Fourth, Wilson claims (ref. 1, p. 1197) that HMP “produces significant results whenever the Simes-based BH [Benjamini–Hochberg] procedure does, although BH only controls the less stringent FDR [false discovery rate].” Indeed, HMP is smaller than the Simes/BH p-value. However, the proper critical value for Simes/BH is α, while being α|R|wR<α for HMP, so the comparison is not fair. If all p-values are in (αL,α], BH rejects all hypotheses, and HMP none. Moreover, BH in fact controls a more stringent criterion than HMP: Both methods control FWER weakly, but BH additionally controls FDR. Finally, unlike HMP, Simes/BH is robust to positively dependent p-values (4).

Despite these concerns, we acknowledge that HMP is of interest. If HMP finds multiple significant regions, HMP’s weak control is simultaneous over these regions. HMP could be extended to control false-discovery proportions post hoc (5, 6). It is thus worth exploring further, but only with a realistic assessment of its properties.

Footnotes

The authors declare no competing interest.

References

  • 1.Wilson D. J., The harmonic mean p-value for combining dependent tests. Proc. Natl. Acad. Sci. U.S.A. 116, 1195–1200 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Marcus R., Peritz E., Gabriel K., Closed testing procedures with special reference to ordered analysis of variance. Biometrika 63, 655–660 (1976). [Google Scholar]
  • 3.Vovk V., Wang R., Combining p-values via averaging. arXiv:1212.4966 (20 December 2012).
  • 4.Benjamini Y., Yekutieli D., The control of the false discovery rate in multiple testing under dependency. Ann. Stat. 29, 1165–1188 (2001). [Google Scholar]
  • 5.Goeman J. J., Solari A., Multiple testing for exploratory research. Stat. Sci. 26, 584–597 (2011). [Google Scholar]
  • 6.Rosenblatt J. D., Finos L., Weeda W. D., Solari A., Goeman J. J., All-resolutions inference for brain imaging. Neuroimage 181, 786–796 (2011). [DOI] [PubMed] [Google Scholar]

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES