Abstract
The traditional approaches to false discovery rate (FDR) control in multiple hypothesis testing are usually based on the null distribution of a test statistic. However, all types of null distributions, including the theoretical, permutation-based and empirical ones, have some inherent drawbacks. For example, the theoretical null might fail because of improper assumptions on the sample distribution. Here, we propose a null distribution-free approach to FDR control for multiple hypothesis testing in the case-control study. This approach, named target-decoy procedure, simply builds on the ordering of tests by some statistic or score, the null distribution of which is not required to be known. Competitive decoy tests are constructed from permutations of original samples and are used to estimate the false target discoveries. We prove that this approach controls the FDR when the score function is symmetric and the scores are independent between different tests. Simulation demonstrates that it is more stable and powerful than two popular traditional approaches, even in the existence of dependency. Evaluation is also made on two real datasets, including an arabidopsis genomics dataset and a COVID-19 proteomics dataset.
Keywords: multiple testing, false discovery rate, null distribution-free, p-value-free, decoy permutations, knockoff filter
Acknowledgments
The authors would like to thank Xiaoya Sun for her help in data analysis.
Footnotes
This paper is supported by the National Key R&D Program of China (No. 2018YFB0704304), the National Natural Science Foundation of China (Nos. 32070668, 62002231, 61832003, 61433014) and the K.C. Wong Education Foundation.
Supplementary Materials
The supplementary material provides the proofs of theorems in the main text.
Software package
The R package for the target-decoy procedure can be downloaded from http://fugroup.amss.ac.cn/software/TDFDR/TDFDR.html.
References
- [1].Almudevar A, Klebanov LB, Qiu X, Salzman P, Yakovlev AY. Utility of correlation measures in analysis of gene expression. NeuroRx. 2006;3:384–395. doi: 10.1016/j.nurx.2006.05.037. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [2].Barber RF, Candès E J. Controlling the false discovery rate via knockoffs. The Annals of Statistics. 2015;43:2055–2085. doi: 10.1214/15-AOS1337. [DOI] [Google Scholar]
- [3].Barber RF, Candès EJ. A knockoff filter for high-dimensional selective inference. The Annals of Statistics. 2019;47:2504–2537. doi: 10.1214/18-AOS1755. [DOI] [Google Scholar]
- [4].Barber RF, Cands EJ, Samworth RJ. Robust inference with knockoffs. The Annals of Statistics. 2020;48:1409–1431. doi: 10.1214/19-AOS1852. [DOI] [Google Scholar]
- [5].Basu P, Cai TT, Das K, Sun W. Weighted false discovery rate control in large-scale multiple testing. Journal of the American Statistical Association. 2018;113:1172–1183. doi: 10.1080/01621459.2017.1336443. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [6].Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological) 1995;57:289–300. [Google Scholar]
- [7].Benjamini Y, Krieger AM, Yekutieli D. Adaptive linear step-up procedures that control the false discovery rate. Biometrika. 2006;93:491–507. doi: 10.1093/biomet/93.3.491. [DOI] [Google Scholar]
- [8].Benjamini Y, Yekutieli D. The control of the false discovery rate in multiple testing under dependency. Annals of statistics. 2001;29:1165–1188. doi: 10.1214/aos/1013699998. [DOI] [Google Scholar]
- [9].Candès E, Fan Y, Janson L, Lv J. Panning for gold: model-x knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2018;80:551–577. doi: 10.1111/rssb.12265. [DOI] [Google Scholar]
- [10].Chow, Y.S., Teicher, H. Probability theory: independence, interchangeability, martingales. Springer Science & Business Media, 2012
- [11].Couté Y, Bruley C, Burger T. Beyond target-decoy competition: Stable validation of peptide and protein identifications in mass spectrometry-based discovery proteomics. Analytical Chemistry. 2020;92:14898–14906. doi: 10.1021/acs.analchem.0c00328. [DOI] [PubMed] [Google Scholar]
- [12].Danilova Y, Voronkova A, Sulimov P, Kertsz-Farkas A. Bias in false discovery rate estimation in mass-spectrometry-based peptide identification. Journal of Proteome Research. 2019;18:2354–2358. doi: 10.1021/acs.jproteome.8b00991. [DOI] [PubMed] [Google Scholar]
- [13].Diz AP, Carvajal-Rodríguez A, Skibinski DO. Multiple hypothesis testing in proteomics: a strategy for experimental work. Molecular & Cellular Proteomics. 2011;10:M110–004374. doi: 10.1074/mcp.M110.004374. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [14].Efron B. Large-scale simultaneous hypothesis testing: the choice of a null hypothesis. Journal of the American Statistical Association. 2004;99:96–104. doi: 10.1198/016214504000000089. [DOI] [Google Scholar]
- [15].Efron B. Size, power and false discovery rates. Annals of Statistics. 2007;35:1351–1377. [Google Scholar]
- [16].Efron B. Microarrays, empirical bayes and the two-groups model. Statistical Science. 2008;23:1–22. [Google Scholar]
- [17].Efron, B. Large-scale inference: empirical Bayes methods for estimation, testing, and prediction. Cambridge University Press, 2012
- [18].Efron B, Tibshirani R. Empirical bayes methods and false discovery rates for microarrays. Genetic epidemiology. 2002;23:70–86. doi: 10.1002/gepi.1124. [DOI] [PubMed] [Google Scholar]
- [19].Efron B, Tibshirani R, Storey JD, Tusher V. Empirical bayes analysis of a microarray experiment. Journal of the American statistical association. 2001;96:1151–1160. doi: 10.1198/016214501753382129. [DOI] [Google Scholar]
- [20].Elias JE, Gygi SP. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nature Methods. 2007;4:207–214. doi: 10.1038/nmeth1019. [DOI] [PubMed] [Google Scholar]
- [21].Emery, K. Controlling the FDR through multiple competition. Ph. D. thesis, The University of Sydney, 2020
- [22].Emery, K., Hasam, S., Noble, W.S., Keich, U. Multiple competition-based fdr control and its application to peptide detection. International Conference on Research in Computational Molecular Biology, 54–71 (2020)
- [23].Emery, K., Keich, U. Controlling the fdr in variable selection via multiple knockoffs. arXiv:1911.09442 (2019)
- [24].Fan Y, Demirkaya E, Li G, Lv J. Rank: Large-scale inference with graphical nonlinear knockoffs. Journal of the American Statistical Association. 2020;115:362–379. doi: 10.1080/01621459.2018.1546589. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [25].Fan Y, Lv J, Sharifvaghefi M, Uematsu Y. Ipad: Stable interpretable forecasting with knockoffs inference. Journal of the American Statistical Association. 2020;115:1822–1834. doi: 10.1080/01621459.2019.1654878. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [26].Gimenez JR, Zou J. Improving the stability of the knockoff procedure: Multiple simultaneous knockoffs and entropy maximization. Proceedings of Machine Learning Research. 2019;89:2184–2192. [Google Scholar]
- [27].He, K. Multiple hypothesis testing methods for large-scale peptide identification in computational proteomics. Master’s thesis, University of Chinese Academy of Sciences, 2013
- [28].He, K., Fu, Y., Zeng, W., Luo, L., Chi, H., Liu, C., Qing, L., Sun, R., He, S. A theoretical foundation of the target-decoy search strategy for false discovery rate control in proteomics. arXiv:1501.00537 (2015)
- [29].He, K., Li, M., Fu, Y., Gong, F., Sun, X. A direct approach to false discovery rates by decoy permutations. arXiv:1804.08222 (2018)
- [30].Keich U, Tamura K, Noble WS. Averaging strategy to reduce variability in target-decoy estimates of false discovery rate. Journal of proteome research. 2019;18:585–593. doi: 10.1021/acs.jproteome.8b00802. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [31].Kerr KF. Comments on the analysis of unbalanced microarray data. Bioinformatics. 2009;25:2035–2041. doi: 10.1093/bioinformatics/btp363. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [32].Langaas M, Lindqvist BH, Ferkingstad E. Estimating the proportion of true null hypotheses, with application to dna microarray data. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2005;67:555–572. doi: 10.1111/j.1467-9868.2005.00515.x. [DOI] [Google Scholar]
- [33].Lee C-W, Efetova M, Engelmann JC, Kramell R, Wasternack C, Ludwig-Müller J, Hedrich R, Deeken R. Agrobacterium tumefaciens promotes tumor induction by modulating pathogen defense in arabidopsis thaliana. The Plant Cell. 2009;21:2948–2962. doi: 10.1105/tpc.108.064576. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [34].Lei L, Fithian W. Power of ordered hypothesis testing. International conference on machine learning. 2016;48:2924–2932. [Google Scholar]
- [35].Levitsky LI, Ivanov MV, Lobas AA, Gorshkov MV. Unbiased false discovery rate estimation for shotgun proteomics based on the target-decoy approach. Journal of proteome research. 2017;16:393–397. doi: 10.1021/acs.jproteome.6b00144. [DOI] [PubMed] [Google Scholar]
- [36].Li J, Maathuis MH. Ggm knockoff filter: False discovery rate control for gaussian graphical models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2021;83:534–558. doi: 10.1111/rssb.12430. [DOI] [Google Scholar]
- [37].Liu, W., Ke, Y., Liu, J., Li, R. Model-free feature screening and fdr control with knockoff features. Journal of the American Statistical Association, to appear (2020)
- [38].Liu W, Shao Q. Phase transition and regularized bootstrap in large-scale t-tests with false discovery rate control. The Annals of Statistics. 2014;42:2003–2025. doi: 10.1214/14-AOS1249. [DOI] [Google Scholar]
- [39].Meinshausen N, Rice J. Estimating the proportion of false null hypotheses among a large number of independently tested hypotheses. The Annals of Statistics. 2006;34:373–393. [Google Scholar]
- [40].Romano Y, Sesia M, Cands E. Deep knockoffs. Journal of the American Statistical Association. 2020;115:1861–1872. doi: 10.1080/01621459.2019.1660174. [DOI] [Google Scholar]
- [41].Sarkar SK. Some results on false discovery rate in stepwise multiple testing procedures. Annals of statistics. 2002;30:239–257. doi: 10.1214/aos/1015362192. [DOI] [Google Scholar]
- [42].Scott JG, Berger JO. Bayes and empirical-bayes multiplicity adjustment in the variable-selection problem. The Annals of Statistics. 2010;38:2587–2619. doi: 10.1214/10-AOS792. [DOI] [Google Scholar]
- [43].Shen B, Yi X, Sun Y, Bi X, Guo T. Proteomic and metabolomic characterization of covid-19 patient sera. Cell. 2020;182:59–72. doi: 10.1016/j.cell.2020.05.032. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [44].Storey JD. A direct approach to false discovery rates. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2002;64:479–498. doi: 10.1111/1467-9868.00346. [DOI] [Google Scholar]
- [45].Storey JD. The positive false discovery rate: a bayesian interpretation and the q-value. The Annals of Statistics. 2003;31:2013–2035. doi: 10.1214/aos/1074290335. [DOI] [Google Scholar]
- [46].Storey JD, Taylor JE, Siegmund D. Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2004;66:187–205. doi: 10.1111/j.1467-9868.2004.00439.x. [DOI] [Google Scholar]
- [47].Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proceedings of the National Academy of Sciences. 2003;100:9440–9445. doi: 10.1073/pnas.1530509100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [48].Strimmer K. A unified approach to false discovery rate estimation. BMC bioinformatics. 2008;9:1–14. doi: 10.1186/1471-2105-9-303. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [49].Tan Y-D, Xu H. A general method for accurate estimation of false discovery rates in identification of differentially expressed genes. Bioinformatics. 2014;30:2018–2025. doi: 10.1093/bioinformatics/btu124. [DOI] [PubMed] [Google Scholar]
- [50].Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proceedings of the National Academy of Sciences. 2001;98:5116–5121. doi: 10.1073/pnas.091062498. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [51].Vergunst AC, van Lier MC, den Dulk-Ras A, Hooykaas PJ. Recognition of the agrobacterium tumefaciens vire2 translocation signal by the virb/d4 transport system does not require vire1. Plant physiology. 2003;133:978–988. doi: 10.1104/pp.103.029223. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [52].Xie Y, Pan W, Khodursky AB. A note on using permutation-based false discovery rate estimates to compare different analysis methods for microarray data. Bioinformatics. 2005;21:4280–4288. doi: 10.1093/bioinformatics/bti685. [DOI] [PubMed] [Google Scholar]
- [53].Yu C, Zelterman D. A parametric model to estimate the proportion from true null using a distribution for p-values. Computational statistics & data analysis. 2017;114:105–118. doi: 10.1016/j.csda.2017.04.008. [DOI] [PMC free article] [PubMed] [Google Scholar]