Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2019 Oct 29;116(47):23384–23385. doi: 10.1073/pnas.1910684116

Reply to Goeman et al.: Trade-offs in model averaging using multilevel tests

Daniel J Wilson a,1
PMCID: PMC6876214  PMID: 31662469

There were 2 errors in Wilson (1) which I have announced, but I do not accept the 4 claimed by Goeman et al. (2); I rebut them point by point on a figshare page (https://figshare.com/articles/Trade-offs_in_model_averaging_using_multilevel_tests_Appendix_/9699740). However, their letter highlights 2 limitations of the harmonic mean p-value (HMP) procedure that I discuss below with possible countermeasures.

First, they report a model of p-value dependence (herein called “GRN”) with parameter ρ=0.2. Unlike the dependence simulated in figure S4 of ref. 1, GRN dependence makes the asymptotically exact HMP anticonservative, producing a type I error rate of 0.09 when all null hypotheses are true, above the theoretical target of α=0.05. This limitation is important but it does not, as claimed, imply an error. The paper states that “the assumptions of equal weights, independence, and identical degrees of freedom can be relaxed.” A fair criticism would be that the paper did not qualify that statement sufficiently.

Equation 2.7 of Davis and Resnick (3) implies that the result that ppp as p0 (equation 5 of ref. 1) holds despite dependence when

Pr(pj<xpi<x)0,asx0 [1]

for all p-values ij. This condition appears satisfied by GRN (Fig. 1A). Simulations confirm convergence of the asymptotically exact test (equation 4 of ref. 1) as α becomes small (Fig. 1B). Thus, Eq. 1 formalizes robustness of the HMP to dependence as α0, but not necessarily at α=0.05.

Fig. 1.

Fig. 1.

Properties of the GRN model. (A) GRN satisfies Davis and Resnick’s (3) condition. (B) For small α, the asymptotically exact HMP procedure converges to the correct type I error rate (108 simulations, L=105). (C) Simes and Bonferroni are robust to GRN dependence (104 simulations, L=105). (D) Error rates for multilevel Bonferroni, Simes, and HMP procedures as a function of the number of p-values being combined. The L=1,000 normal random variables have means −2.0 and 0.0 under HA and H0, respectively, in proportion 100:900 (104 simulations). R code for figures is available at https://figshare.com/articles/Trade-offs_in_model_averaging_using_multilevel_tests/9699743.

Second, Goeman et al. (2) mention that the significance threshold at which the HMP rejects an individual null hypothesis should be more stringent than the Bonferroni threshold, contrary to the paper. This is a special case of the error in which the criterion for declaring set R significant should be pRαLwR, rather than pRα|R|wR, where L is the total number of tests and αLα|R|<α. Thus the power of the HMP to detect significant groups of hypotheses comes at the cost of reduced power to detect individual hypotheses.

One response is to seek a test that shares some benefits of the HMP while avoiding these issues. Multilevel versions of Bonferroni and Simes’ (4) procedures are candidates, as both are robust to GRN dependence (Fig. 1C). Their combined p-values for set R approximate the HMP by bounding it from above (SI Appendix, equations 36 and 39 of ref. 1):

pRBonf=wRmaxiRwi/pipRSimes=wRmaxiRriRwi/pipR=wRiRwi/pi. [2]

(Here wR=iRwi, i=1Lwi=1, and riR ranks wi/pi within R from largest, 1, to smallest, |R|.)

The multilevel Bonferroni and Simes methods can therefore be interpreted as approximating the HMP’s model averaging approach. These multilevel tests control the strong-sense familywise error rate because a superset of any significant subset must also be significant at threshold αwR. This allows the most significant groups of p-values to be identified, so that conclusions are made at the finest resolution permitted by the data, as in the HMP procedure.

Multilevel HMP, Simes, and Bonferroni procedures all have lower power (higher type II error rates) for combining small proportions of p-values, the HMP slightly more so. All procedures have higher power for combining large proportions of p-values, the HMP considerably more so (Fig. 1D). Thus, multilevel Bonferroni, Simes, and HMP procedures all offer some benefits of model averaging with different trade-offs in terms of the power of their combined tests and their robustness to dependence.

Acknowledgments

D.J.W. is a Sir Henry Dale Fellow, jointly funded by the Wellcome Trust and the Royal Society (Grant 101237/Z/13/B), and is supported by a Big Data Institute Robertson Fellowship. I thank Jacob Armstrong for comments.

Footnotes

The author declares no competing interest.

References

  • 1.Wilson D. J., The harmonic mean p-value for combining dependent tests. Proc. Natl. Acad. Sci. U.S.A. 116, 1195–1200, and correction (2019), 10.1073/pnas.1914128116. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Goeman J. J., Rosenblatt J. D., Nichols T. E., The harmonic mean p-value: Strong versus weak control, and the assumption of independence. Proc. Natl. Acad. Sci. U.S.A., 10.1073/pnas.1909339116 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Davis R. A., Resnick S. I., Limit theory for bilinear processes with heavy-tailed noise. Ann. Appl. Probab. 6, 1191–1210 (1996). [Google Scholar]
  • 4.Simes R. J., An improved Bonferroni procedure for multiple tests of significance. Biometrika 73, 751–754 (1986). [Google Scholar]

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES