Skip to main content

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

bioRxiv logoLink to bioRxiv
[Preprint]. 2023 Aug 29:2023.08.28.555191. [Version 1] doi: 10.1101/2023.08.28.555191

ClipperQTL: ultrafast and powerful eGene identification method

Heather J Zhou 1, Xinzhou Ge 1,2, Jingyi Jessica Li 1,3,4,5,*
PMCID: PMC10491229  PMID: 37693523

Abstract

A central task in expression quantitative trait locus (eQTL) analysis is to identify cis-eGenes (henceforth “eGenes”), i.e., genes whose expression levels are regulated by at least one local genetic variant. Among the existing eGene identification methods, FastQTL is considered the gold standard but is computationally expensive as it requires thousands of permutations for each gene. Alternative methods such as eigenMT and TreeQTL have lower power than FastQTL. In this work, we propose ClipperQTL, which reduces the number of permutations needed from thousands to 20 for data sets with large sample sizes (> 450) by using the contrastive strategy developed in Clipper; for data sets with smaller sample sizes, it uses the same permutation-based approach as FastQTL. We show that ClipperQTL performs as well as FastQTL and runs about 500 times faster if the contrastive strategy is used and 50 times faster if the conventional permutation-based approach is used. The R package ClipperQTL is available at https://github.com/heatherjzhou/ClipperQTL.

1. Introduction

Molecular quantitative trait locus (molecular QTL, henceforth “QTL”) analysis investigates the relationship between genetic variants and molecular traits, potentially explaining findings in genome-wide association studies [1, 2]. Based on the type of molecular phenotype studied, QTL analyses can be categorized into gene expression QTL (eQTL) analyses [3, 4], alternative splicing QTL (sQTL) analyses [4], three prime untranslated region alternative polyadenylation QTL (3aQTL) analyses [5], and so on [1, 2]. Among these categories, eQTL analyses, which investigate the association between genetic variants and gene expression levels, are the most common. Therefore, in this work, we focus on eQTL analyses as an example, although everything discussed in this work is applicable to other types of QTL analyses as well.

A central task in eQTL analysis is to identify cis-eGenes (henceforth “eGenes”), i.e., genes whose expression levels are regulated by at least one local genetic variant. This presents a multiple-testing challenge as not only are there many candidate genes, each gene can have up to tens of thousands of local genetic variants, and the local genetic variants are often in linkage disequilibrium (i.e., associated) with one another.

Existing eGene identification methods include FastQTL [6], eigenMT [7], and TreeQTL [8]. All three methods share the same two-step approach: first, obtain a gene-level p-value for each gene; second, apply a false discovery rate (FDR) control method on the gene-level p-values to call eGenes. The key difference between the three methods lies in how the gene-level p-values are obtained.

Among the existing eGene identification methods, FastQTL [6] is considered the gold standard and is currently the most popular. It uses permutations to obtain gene-level p-values. There are four main ways to use FastQTL, depending on (1) whether the direct or the adaptive permutation scheme is used and (2) whether proportions or beta approximation is used (Table 1). The default way of using FastQTL is to use the adaptive permutation scheme with beta approximation [4, 6]. The adaptive permutation scheme means the number of permutations is chosen adaptively for each gene (between 1000 and 10,000 by default [4, 6]); the beta approximation helps produce higher-resolution gene-level p-values given the numbers of permutations (Algorithm S1). The main drawback of FastQTL is the lack of computational efficiency as it requires thousands of permutations for each gene. A faster implementation of FastQTL named tensorQTL has been developed [9], but it relies on graphics processing units (GPUs), which are not universally available.

Table 1:

Summary of the 11 eGene identification methods we compare. Details of these methods can be found in Sections 4.2 and S1.

Method category Method Note Method name for speed comparison
(A) (B) (C) (D)
1 Matrix eQTL Matrix eQTL
2 FastQTL FastQTL_1K-10K_prop FastQTL_1K-10K
3 FastQTL_1K-10K_beta Default FastQTL method
4 FastQTL_1K_prop FastQTL_1K
5 FastQTL_1K_beta
6 eigenMT eigenMT eigenMT
7 TreeQTL TreeQTL_BY Default TreeQTL method TreeQTL
8 TreeQTL_Storey
9 ClipperQTL ClipperQTL_standard_1K ClipperQTL_standard_1K
10 ClipperQTL_Clipper_20 ClipperQTL_Clipper_20
11 ClipperQTL_Clipper_50 ClipperQTL_Clipper_50

eigenMT [7] and TreeQTL [8] have been proposed as faster alternatives to FastQTL. Neither method uses permutations. In a nutshell, eigenMT uses Bonferroni correction to calculate a gene-level p-value for each gene but estimates the effective number of local genetic variants for each gene by performing a principal component analysis (conceptually speaking; instead of using the actual number of local genetic variants). On the other hand, TreeQTL uses Simes’ rule [10] to calculate a gene-level p-value for each gene. Our analysis shows that both eigenMT and TreeQTL have lower power than FastQTL (Figures 1 and 3).

Figure 1:

Figure 1:

Number of eGenes comparison based on GTEx expression data [4] (Table 1; see Section 2.1 for the analysis details). Each dot corresponds to a tissue. The x-axis and y-axis both represent numbers of eGenes identified by different methods. Diagonal lines through the origin are shown to help with visualization. a-c The four variants of FastQTL identify almost the same numbers of eGenes as one another. d-f eigenMT and TreeQTL methods identify fewer eGenes than FastQTL. g-i ClipperQTL methods identify almost the same numbers of eGenes as FastQTL in tissues with the appropriate sample sizes (Section 4.2). We use 465 as the sample size cutoff because the next largest sample size is 396. See Figure S2 for an analysis of the overlap between identified eGenes.

Figure 3:

Figure 3:

Power and FDR comparison of all 11 methods based on our simulation study (Table 1; Section 2.2). The target FDR is set at 0.05 (grey shaded area in b). The height of each bar represents the average across simulated data sets. Error bars indicate standard errors. In a, a horizontal line at the height of the bar for FastQTL 1K-10K beta is shown to help with visualization. All methods except Matrix eQTL can approximately control the FDR. FastQTL and ClipperQTL methods have higher power than eigenMT and TreeQTL methods.

Clipper [11] is a p-value-free FDR control method. Given a large number of features (e.g., genes), a number of measurements under the experimental (e.g., treatment) condition, and a number of measurements under the background (e.g., control) condition, Clipper works as the following: first, obtain a contrast score for each feature based on the experimental and background measurements (for example, the contrast score may be the average experimental measurement minus the average background measurement); second, given a target FDR (e.g., 0.05), obtain a cutoff for the contrast scores; lastly, call the features with contrast scores above the cutoff as discoveries. The idea is that the contrast scores of the uninteresting features (e.g., genes whose expected expression levels are not increased by the treatment) will be roughly symmetrically distributed around zero, and the outlying contrast scores in the right tail likely belong to interesting features. Notably, Clipper produces a q-value for each feature (similar to Storey’s q-values [12]), so that the features can be ranked from the most significant to the least significant.

In this work, we propose ClipperQTL for eGene identification, which reduces the number of permutations needed from thousands to 20 for data sets with large sample sizes (> 450) by using the contrastive strategy developed in Clipper; for data sets with smaller sample sizes, it uses the same permutation-based approach as FastQTL. Unlike tensorQTL, our ClipperQTL software does not rely on GPUs. We show that ClipperQTL performs as well as FastQTL and runs about 500 times faster if the contrastive strategy is used and 50 times faster if the conventional permutation-based approach is used (we refer to the two variants of ClipperQTL as the Clipper variant and the standard variant, respectively; Section 4.2).

2. Results

2.1. Real data results

We compare the performance and run time of different variants of FastQTL, eigenMT, TreeQTL, and ClipperQTL (Table 1) on the most recent GTEx expression data [4]. The 49 tissues with sample sizes above 70 are considered [4]. For each gene, we consider single nucleotide polymorphisms (SNPs) within one megabase (Mb) of the transcription start site (TSS) of the gene [4]; we use 0.01 as the threshold for the minor allele frequency (MAF) of a SNP and 10 as the threshold for the number of samples with at least one copy of the minor allele (MA samples) [6]. We include eight known covariates and a number of top expression PCs (principal components) as inferred covariates [13]. The eight known covariates are the top five genotype PCs, WGS sequencing platform (HiSeq 2000 or HiSeq X), WGS library construction protocol (PCR-based or PCR-free), and donor sex [4]. The number of expression PCs is chosen via the Buja and Eyuboglu (BE) algorithm [13, 14] for each tissue. We use the BE algorithm because we find that in our simulated data (Section S2), the BE algorithm can recover the true number of covariates well. The target FDR for eGene identification is set at 0.05. We do not include Matrix eQTL [15] in our real data comparison because both our simulation study (Section 2.2) and Huang et al. [16] show that Matrix eQTL cannot control the FDR in the eGene identification problem.

The results from our real data analysis are summarized in Figures 1, 2, and S2. We find that the four variants of FastQTL produce almost identical results as one another. Specifically, the numbers of eGenes identified by the four methods are almost identical (Figure 1), and the identified eGenes highly overlap (Figure S2). This means the adaptive permutation scheme and the beta approximation of FastQTL (Section S1.2) are not critical to the performance of FastQTL; the simplest variant, FastQTL_1K_prop, is sufficient. Further, we find that eigenMT and TreeQTL methods identify fewer eGenes than FastQTL (Figure 1). In contrast, ClipperQTL methods produce almost identical results as FastQTL in tissues with the appropriate sample sizes (Section 4.2; Figures 1 and S2).

Figure 2:

Figure 2:

Run time comparison based on GTEx expression data [4] (Table 1; see Section 2.1 for the analysis details). Each dot corresponds to a tissue. FastQTL_1K-10K takes under 500 CPU hours. FastQTL_1K takes under 50 CPU hours. ClipperQTL_standard_1K takes under 10 CPU hours. ClipperQTL_Clipper_20 takes under 1 CPU hour. Run times of ClipperQTL_Clipper_20 and ClipperQTL_Clipper_50 are only shown for tissues with sample sizes ≥ 465 (Figure 1).

In terms of run time comparison (Figure 2), we find that eigenMT has almost no computational advantage over FastQTL, and TreeQTL has no computational advantage over the standard variant of ClipperQTL (which is slower than the Clipper variant of ClipperQTL). Both the standard variant and the Clipper variant of ClipperQTL are orders of magnitude faster than FastQTL. In particular, the standard variant of ClipperQTL is about five times faster than FastQTL_1K_prop—the simplest FastQTL method—even though the algorithms are equivalent (Section 4.2); we attribute this to differences in software implementation. Compared to the default FastQTL method, the standard variant and the Clipper variant of ClipperQTL are about 50 times and 500 times faster, respectively.

2.2. Simulation results

In our simulation study, we roughly follow the data simulation in the second, more realistic simulation design of Zhou et al. [13], which roughly follows the data simulation in Wang et al. [17]. We simulate three data sets in total. Each data set is simulated according to Algorithm S5 with sample size n=838, number of genes p=1000, number of covariates K˜=20, proportion of variance explained by genotype in eGenes PVEGenotype = 0.02, and proportion of variance explained by covariates PVECovariates = 0.5. All covariates are assumed to be known covariates.

The results from our simulation study are summarized in Figure 3. We confirm the finding in Huang et al. [16] that Matrix eQTL cannot control the FDR in the eGene identification problem. All other methods can approximately control the FDR. Further, FastQTL and ClipperQTL methods have higher power than eigenMT and TreeQTL methods, consistent with our real data results (Section 2.1).

3. Discussion

We have shown that ClipperQTL achieves a 500-fold or 50-fold increase in computational efficiency compared to FastQTL (depending on the variant used) without sacrificing power or precision. In contrast, other alternatives to FastQTL such as eigenMT and TreeQTL have lower power than FastQTL.

We propose two main variants of ClipperQTL: the standard variant and the Clipper variant. The standard variant is equivalent to FastQTL with the direct permutation scheme and proportions (Algorithm S1) and is suitable for a wide range of sample sizes. The Clipper variant uses the contrastive strategy developed in Clipper [11] (Algorithm 1) and is only recommended for data sets with large sample sizes (> 450).

Regarding which variant of ClipperQTL should be used when the sample size is large enough (> 450), we believe that if computational efficiency is a priority, then the Clipper variant should be used. However, if the study also contains smaller data sets, then the researcher may choose to use the standard variant on all data sets for consistency.

A possible extension of ClipperQTL lies in trans-eGene identification. Compared to cis-eGenes, trans-eGenes are currently identified in very small numbers [4], possibly due to the lack of power of existing approaches. Since the Clipper variant of ClipperQTL only needs 20 permutations for optimal performance and using only one permutation works almost as well (Section S3), there may be potential for ClipperQTL to be adapted for trans-eGene identification.

The R package ClipperQTL is available at https://github.com/heatherjzhou/ClipperQTL. Our work demonstrates the potential of the contrastive strategy developed in Clipper and provides a simpler and more efficient way of identifying cis-eGenes.

4. Methods

4.1. Problem

Here we describe the eGene identification problem and introduce the notations for this work.

The input data are as follows. Let Y denote the n×p fully processed gene expression matrix with n samples and p genes. For gene j, j=1,,p, the relevant genotype data is stored in Sj, the n×qj genotype matrix, where each column of Sj corresponds to a local common SNP for gene j (conceptually speaking; in reality, all genotype data may be stored in one file). Let X denote the n×K covariate matrix with K covariates. Using our analysis of GTEx’s Colon - Transverse expression data [4] (Section 2.1) as an example, we have n=368, p=25,379, qj typically under 15,000, and K=37, including eight known covariates and 29 inferred covariates (the number of inferred covariates is chosen via the BE algorithm [13, 14]; Section 2.1).

The assumption is that for j=1,,p, Y,j, the jth column of Y, is a realization of the following random vector:

𝟙n×1β0j1×1+Sjn×qjβ1jqj×1+X˜n×K˜β2jK˜×1+εjn×1, (1)

where 𝟙 denotes the n×1 matrix of ones, Sj is defined as above, X˜ is the true covariate matrix (which X tries to capture), all entries of β0j, β1j, and β2j are fixed but unknown parameters, and εj is the random noise. In particular, it is assumed that at most a small number of entries of β1j are nonzero [17]. If all entries of β1j are zero, then gene j is not an eGene. On the other hand, if at least one entry of β1j is nonzero, then gene j is an eGene. The goal is to identify which of the p genes are eGenes given Y, Sjj=1p, and X.

4.2. ClipperQTL

We propose two main variants of ClipperQTL: the standard variant and the Clipper variant. The standard variant is equivalent to FastQTL with the direct permutation scheme and proportions (Algorithm S1) and is suitable for a wide range of sample sizes. The Clipper variant uses the contrastive strategy developed in Clipper [11] (Algorithm 1) and is only recommended for data sets with large sample sizes (> 450). The development of ClipperQTL is discussed in Section S3. A key technical difference between the standard variant and the Clipper variant is that in the standard variant, gene expression is permuted first and then residualized, whereas in the Clipper variant, gene expression is residualized first and then permuted.

The main input parameter of ClipperQTL under both variants is B, the number of permutations. For the standard variant, B is set at 1000 by default. For the Clipper variant, we recommend setting B between 20 and 100 (Figures S3 and S4).

4.

Supplementary Material

Supplement 1

Acknowledgments

The authors would like to thank former and current members of Junction of Statistics and Biology at UCLA for their valuable insight and suggestions.

Funding Statement

This work is supported by NSF DGE-1829071 and NIH/NHLBI T32HL139450 to H.J.Z. and NIH/NIGMS R01GM120507 and R35GM140888, NSF DBI-1846216 and DMS-2113754, Johnson & Johnson WiSTEM2D Award, Sloan Research Fellowship, and UCLA David Geffen School of Medicine W.M. Keck Foundation Junior Faculty Award to J.J.L.

Footnotes

Availability of data and materials

The R package ClipperQTL is available at https://github.com/heatherjzhou/ClipperQTL. The code used to generate the results in this work is available at https://doi.org/10.5281/zenodo.8259929. In addition, this work makes use of the following data and software:

Declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

None.

References

  • [1].Cano-Gamez Eddie and Trynka Gosia. From GWAS to function: Using functional genomics to identify the mechanisms underlying complex diseases. Frontiers in Genetics, 11:424, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [2].Ye Youqiong, Zhang Zhao, Liu Yaoming, Diao Lixia, and Han Leng. A multi-omics perspective of quantitative trait loci in precision medicine. Trends in Genetics, 36(5):318–336, 2020. [DOI] [PubMed] [Google Scholar]
  • [3].GTEx Consortium. Genetic effects on gene expression across human tissues. Nature, 550 (7675):204–213, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [4].GTEx Consortium. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science, 369(6509):1318–1330, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [5].Li Lei, Huang Kai-Lieh, Gao Yipeng, Cui Ya, Wang Gao, Elrod Nathan D., Li Yumei, Yiling Elaine Chen Ping Ji, Peng Fanglue, Russell William K., Wagner Eric J., and Li Wei. An atlas of alternative polyadenylation quantitative trait loci contributing to complex trait and disease heritability. Nature Genetics, 53(7):994–1005, 2021. [DOI] [PubMed] [Google Scholar]
  • [6].Ongen Halit, Buil Alfonso, Brown Andrew Anand, Dermitzakis Emmanouil T., and Delaneau Olivier. Fast and efficient QTL mapper for thousands of molecular phenotypes. Bioinformatics, 32(10):1479–1485, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [7].Davis Joe R., Fresard Laure, Knowles David A., Pala Mauro, Bustamante Carlos D., Battle Alexis, and Montgomery Stephen B. An efficient multiple-testing adjustment for eQTL studies that accounts for linkage disequilibrium between variants. The American Journal of Human Genetics, 98(1):216–224, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [8].Peterson C. B., Bogomolov M., Benjamini Y., and Sabatti C. TreeQTL: Hierarchical error control for eQTL findings. Bioinformatics, 32(16):2556–2558, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [9].Taylor-Weiner Amaro, Aguet François, Haradhvala Nicholas J., Gosai Sager, Anand Shankara, Kim Jaegil, Ardlie Kristin, Van Allen Eliezer M., and Getz Gad. Scaling computational genomics to millions of individuals with GPUs. Genome Biology, 20(1):228, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [10].Simes R. J. An improved Bonferroni procedure for multiple tests of significance. Biometrika, 73(3):751–754, 1986. [Google Scholar]
  • [11].Ge Xinzhou, Chen Yiling Elaine, Song Dongyuan, McDermott MeiLu, Woyshner Kyla, Manousopoulou Antigoni, Wang Ning, Li Wei, Wang Leo D., and Li Jingyi Jessica. Clipper: P-value-free FDR control on high-throughput data from two conditions. Genome Biology, 22 (1):288, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [12].Storey John D. and Tibshirani Robert. Statistical significance for genomewide studies. Proceedings of the National Academy of Sciences, 100(16):9440–9445, 2003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [13].Zhou Heather J., Li Lei, Li Yumei, Li Wei, and Li Jingyi Jessica. PCA outperforms popular hidden variable inference methods for molecular QTL mapping. Genome Biology, 23(1):210, 2022. 10.1186/s13059-022-02761-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [14].Buja Andreas and Eyuboglu Nermin. Remarks on parallel analysis. Multivariate Behavioral Research, 27(4):509–540, 1992. [DOI] [PubMed] [Google Scholar]
  • [15].Shabalin Andrey A. Matrix eQTL: Ultra fast eQTL analysis via large matrix operations. Bioinformatics, 28(10):1353–1358, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [16].Huang Qin Qin, Ritchie Scott C, Brozynska Marta, and Inouye Michael. Power, false discovery rate and Winner’s Curse in eQTL studies. Nucleic Acids Research, 46(22):e133–e133, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [17].Wang Gao, Sarkar Abhishek, Carbonetto Peter, and Stephens Matthew. A simple new approach to variable selection in regression, with application to genetic fine mapping. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82(5):1273–1300, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [18].Benjamini Yoav and Hochberg Yosef. Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological), 57(1):289–300, 1995. [Google Scholar]
  • [19].Ledoit Olivier and Wolf Michael. A well-conditioned estimator for large-dimensional covariance matrices. Journal of Multivariate Analysis, 88(2):365–411, 2004. [Google Scholar]
  • [20].Benjamini Yoav and Yekutieli Daniel. The control of the false discovery rate in multiple testing under dependency. The Annals of Statistics, 29(4):1165–1188, 2001. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement 1

Articles from bioRxiv are provided here courtesy of Cold Spring Harbor Laboratory Preprints

RESOURCES