There were 2 errors in Wilson (1) which I have announced, but I do not accept the 4 claimed by Goeman et al. (2); I rebut them point by point on a figshare page (https://figshare.com/articles/Trade-offs_in_model_averaging_using_multilevel_tests_Appendix_/9699740). However, their letter highlights 2 limitations of the harmonic mean -value (HMP) procedure that I discuss below with possible countermeasures.
First, they report a model of -value dependence (herein called “GRN”) with parameter . Unlike the dependence simulated in figure S4 of ref. 1, GRN dependence makes the asymptotically exact HMP anticonservative, producing a type I error rate of 0.09 when all null hypotheses are true, above the theoretical target of . This limitation is important but it does not, as claimed, imply an error. The paper states that “the assumptions of equal weights, independence, and identical degrees of freedom can be relaxed.” A fair criticism would be that the paper did not qualify that statement sufficiently.
Equation 2.7 of Davis and Resnick (3) implies that the result that as (equation 5 of ref. 1) holds despite dependence when
| [1] |
for all -values . This condition appears satisfied by GRN (Fig. 1A). Simulations confirm convergence of the asymptotically exact test (equation 4 of ref. 1) as becomes small (Fig. 1B). Thus, Eq. 1 formalizes robustness of the HMP to dependence as , but not necessarily at .
Fig. 1.
Properties of the GRN model. (A) GRN satisfies Davis and Resnick’s (3) condition. (B) For small , the asymptotically exact HMP procedure converges to the correct type I error rate ( simulations, ). (C) Simes and Bonferroni are robust to GRN dependence ( simulations, ). (D) Error rates for multilevel Bonferroni, Simes, and HMP procedures as a function of the number of -values being combined. The normal random variables have means −2.0 and 0.0 under HA and H0, respectively, in proportion 100:900 ( simulations). R code for figures is available at https://figshare.com/articles/Trade-offs_in_model_averaging_using_multilevel_tests/9699743.
Second, Goeman et al. (2) mention that the significance threshold at which the HMP rejects an individual null hypothesis should be more stringent than the Bonferroni threshold, contrary to the paper. This is a special case of the error in which the criterion for declaring set significant should be , rather than , where is the total number of tests and . Thus the power of the HMP to detect significant groups of hypotheses comes at the cost of reduced power to detect individual hypotheses.
One response is to seek a test that shares some benefits of the HMP while avoiding these issues. Multilevel versions of Bonferroni and Simes’ (4) procedures are candidates, as both are robust to GRN dependence (Fig. 1C). Their combined -values for set approximate the HMP by bounding it from above (SI Appendix, equations 36 and 39 of ref. 1):
| [2] |
(Here , , and ranks within from largest, 1, to smallest, .)
The multilevel Bonferroni and Simes methods can therefore be interpreted as approximating the HMP’s model averaging approach. These multilevel tests control the strong-sense familywise error rate because a superset of any significant subset must also be significant at threshold . This allows the most significant groups of -values to be identified, so that conclusions are made at the finest resolution permitted by the data, as in the HMP procedure.
Multilevel HMP, Simes, and Bonferroni procedures all have lower power (higher type II error rates) for combining small proportions of -values, the HMP slightly more so. All procedures have higher power for combining large proportions of -values, the HMP considerably more so (Fig. 1D). Thus, multilevel Bonferroni, Simes, and HMP procedures all offer some benefits of model averaging with different trade-offs in terms of the power of their combined tests and their robustness to dependence.
Acknowledgments
D.J.W. is a Sir Henry Dale Fellow, jointly funded by the Wellcome Trust and the Royal Society (Grant 101237/Z/13/B), and is supported by a Big Data Institute Robertson Fellowship. I thank Jacob Armstrong for comments.
Footnotes
The author declares no competing interest.
Data deposition: A point-by-point rebuttal to Goeman et al. is available at https://figshare.com/articles/Trade-offs_in_model_averaging_using_multilevel_tests_Appendix_/9699740. R code for figures is available at https://figshare.com/articles/Trade-offs_in_model_averaging_using_multilevel_tests/9699743.
References
- 1.Wilson D. J., The harmonic mean p-value for combining dependent tests. Proc. Natl. Acad. Sci. U.S.A. 116, 1195–1200, and correction (2019), 10.1073/pnas.1914128116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Goeman J. J., Rosenblatt J. D., Nichols T. E., The harmonic mean p-value: Strong versus weak control, and the assumption of independence. Proc. Natl. Acad. Sci. U.S.A., 10.1073/pnas.1909339116 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Davis R. A., Resnick S. I., Limit theory for bilinear processes with heavy-tailed noise. Ann. Appl. Probab. 6, 1191–1210 (1996). [Google Scholar]
- 4.Simes R. J., An improved Bonferroni procedure for multiple tests of significance. Biometrika 73, 751–754 (1986). [Google Scholar]

