Reply to Goeman et al.: Trade-offs in model averaging using multilevel tests

Daniel J Wilson

doi:10.1073/pnas.1910684116

. 2019 Oct 29;116(47):23384–23385. doi: 10.1073/pnas.1910684116

Reply to Goeman et al.: Trade-offs in model averaging using multilevel tests

PMCID: PMC6876214 PMID: 31662469

There were 2 errors in Wilson (1) which I have announced, but I do not accept the 4 claimed by Goeman et al. (2); I rebut them point by point on a figshare page (https://figshare.com/articles/Trade-offs_in_model_averaging_using_multilevel_tests_Appendix_/9699740). However, their letter highlights 2 limitations of the harmonic mean $p$ -value (HMP) procedure that I discuss below with possible countermeasures.

First, they report a model of $p$ -value dependence (herein called “GRN”) with parameter $ρ = 0.2$ . Unlike the dependence simulated in figure S4 of ref. 1, GRN dependence makes the asymptotically exact HMP anticonservative, producing a type I error rate of 0.09 when all null hypotheses are true, above the theoretical target of $α = 0.05$ . This limitation is important but it does not, as claimed, imply an error. The paper states that “the assumptions of equal weights, independence, and identical degrees of freedom can be relaxed.” A fair criticism would be that the paper did not qualify that statement sufficiently.

Equation 2.7 of Davis and Resnick (3) implies that the result that $p_{\overset{\circ}{p}} \to \overset{\circ}{p}$ as $\overset{\circ}{p} \to 0$ (equation 5 of ref. 1) holds despite dependence when

\Pr (p_{j} < x ∣ p_{i} < x) \to 0, a s x \to 0

[1]

for all $p$ -values $i \neq j$ . This condition appears satisfied by GRN (Fig. 1A). Simulations confirm convergence of the asymptotically exact test (equation 4 of ref. 1) as $α$ becomes small (Fig. 1B). Thus, Eq. 1 formalizes robustness of the HMP to dependence as $α \to 0$ , but not necessarily at $α = 0.05$ .

Second, Goeman et al. (2) mention that the significance threshold at which the HMP rejects an individual null hypothesis should be more stringent than the Bonferroni threshold, contrary to the paper. This is a special case of the error in which the criterion for declaring set $R$ significant should be ${\overset{\circ}{p}}_{R} \leq α_{L} w_{R}$ , rather than ${\overset{\circ}{p}}_{R} \leq α_{| R |} w_{R}$ , where $L$ is the total number of tests and $α_{L} \leq α_{| R |} < α$ . Thus the power of the HMP to detect significant groups of hypotheses comes at the cost of reduced power to detect individual hypotheses.

One response is to seek a test that shares some benefits of the HMP while avoiding these issues. Multilevel versions of Bonferroni and Simes’ (4) procedures are candidates, as both are robust to GRN dependence (Fig. 1C). Their combined $p$ -values for set $R$ approximate the HMP by bounding it from above (SI Appendix, equations 36 and 39 of ref. 1):

\begin{align} p_{R}^{B o n f} & = \frac{w_{R}}{\max_{i \in R} \{w_{i} / p_{i}\}} \\ p_{R}^{S i m e s} & = \frac{w_{R}}{\max_{i \in R} \{r_{i R} w_{i} / p_{i}\}} \\ {\overset{\circ}{p}}_{R} & = \frac{w_{R}}{\sum_{i \in R} w_{i} / p_{i}} \end{align} .

[2]

(Here $w_{R} = \sum_{i \in R} w_{i}$ , $\sum_{i = 1}^{L} w_{i} = 1$ , and $r_{i R}$ ranks $w_{i} / p_{i}$ within $R$ from largest, 1, to smallest, $| R |$ .)

The multilevel Bonferroni and Simes methods can therefore be interpreted as approximating the HMP’s model averaging approach. These multilevel tests control the strong-sense familywise error rate because a superset of any significant subset must also be significant at threshold $α w_{R}$ . This allows the most significant groups of $p$ -values to be identified, so that conclusions are made at the finest resolution permitted by the data, as in the HMP procedure.

Multilevel HMP, Simes, and Bonferroni procedures all have lower power (higher type II error rates) for combining small proportions of $p$ -values, the HMP slightly more so. All procedures have higher power for combining large proportions of $p$ -values, the HMP considerably more so (Fig. 1D). Thus, multilevel Bonferroni, Simes, and HMP procedures all offer some benefits of model averaging with different trade-offs in terms of the power of their combined tests and their robustness to dependence.

Acknowledgments

D.J.W. is a Sir Henry Dale Fellow, jointly funded by the Wellcome Trust and the Royal Society (Grant 101237/Z/13/B), and is supported by a Big Data Institute Robertson Fellowship. I thank Jacob Armstrong for comments.

Footnotes

The author declares no competing interest.

Data deposition: A point-by-point rebuttal to Goeman et al. is available at https://figshare.com/articles/Trade-offs_in_model_averaging_using_multilevel_tests_Appendix_/9699740. R code for figures is available at https://figshare.com/articles/Trade-offs_in_model_averaging_using_multilevel_tests/9699743.

References

1.Wilson D. J., The harmonic mean p-value for combining dependent tests. Proc. Natl. Acad. Sci. U.S.A. 116, 1195–1200, and correction (2019), 10.1073/pnas.1914128116. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Goeman J. J., Rosenblatt J. D., Nichols T. E., The harmonic mean p-value: Strong versus weak control, and the assumption of independence. Proc. Natl. Acad. Sci. U.S.A., 10.1073/pnas.1909339116 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Davis R. A., Resnick S. I., Limit theory for bilinear processes with heavy-tailed noise. Ann. Appl. Probab. 6, 1191–1210 (1996). [Google Scholar]
4.Simes R. J., An improved Bonferroni procedure for multiple tests of significance. Biometrika 73, 751–754 (1986). [Google Scholar]

[r1] 1.Wilson D. J., The harmonic mean p-value for combining dependent tests. Proc. Natl. Acad. Sci. U.S.A. 116, 1195–1200, and correction (2019), 10.1073/pnas.1914128116. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r2] 2.Goeman J. J., Rosenblatt J. D., Nichols T. E., The harmonic mean p-value: Strong versus weak control, and the assumption of independence. Proc. Natl. Acad. Sci. U.S.A., 10.1073/pnas.1909339116 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r3] 3.Davis R. A., Resnick S. I., Limit theory for bilinear processes with heavy-tailed noise. Ann. Appl. Probab. 6, 1191–1210 (1996). [Google Scholar]

[r4] 4.Simes R. J., An improved Bonferroni procedure for multiple tests of significance. Biometrika 73, 751–754 (1986). [Google Scholar]

PERMALINK

Reply to Goeman et al.: Trade-offs in model averaging using multilevel tests

Daniel J Wilson

Fig. 1.

Acknowledgments

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Reply to Goeman et al.: Trade-offs in model averaging using multilevel tests

Daniel J Wilson

Fig. 1.

Acknowledgments

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases