An independent filter for gene set testing based on spectral enrichment

H Robert Frost; Zhigang Li; Folkert W Asselbergs; Jason H Moore

doi:10.1109/TCBB.2015.2415815

. Author manuscript; available in PMC: 2016 Sep 1.

Published in final edited form as: IEEE/ACM Trans Comput Biol Bioinform. 2015 Sep-Oct;12(5):1076–1086. doi: 10.1109/TCBB.2015.2415815

An independent filter for gene set testing based on spectral enrichment

H Robert Frost ¹, Zhigang Li ², Folkert W Asselbergs ³, Jason H Moore ⁴

PMCID: PMC4666312 NIHMSID: NIHMS730437 PMID: 26451820

Abstract

Gene set testing has become an indispensable tool for the analysis of high-dimensional genomic data. An important motivation for testing gene sets, rather than individual genomic variables, is to improve statistical power by reducing the number of tested hypotheses. Given the dramatic growth in common gene set collections, however, testing is often performed with nearly as many gene sets as underlying genomic variables. To address the challenge to statistical power posed by large gene set collections, we have developed spectral gene set filtering (SGSF), a novel technique for independent filtering of gene set collections prior to gene set testing. The SGSF method uses as a filter statistic the p-value measuring the statistical significance of the association between each gene set and the sample principal components (PCs), taking into account the significance of the associated eigenvalues. Because this filter statistic is independent of standard gene set test statistics under the null hypothesis but dependent under the alternative, the proportion of enriched gene sets is increased without impacting the type I error rate. As shown using simulated and real gene expression data, the SGSF algorithm accurately filters gene sets unrelated to the experimental outcome resulting in significantly increased gene set testing power.

Index Terms: Gene set testing, gene set enrichment, screening-testing, principal component analysis, random matrix theory, Tracy-Widom

1 Introduction

Gene set testing has become a critical component in the pipeline used to analyze and interpret high-dimensional genomic data [1], [2]. Gene set testing enables researchers to step back from the single gene level and explore associations between biologically meaningful groups of genes and clinically relevant variables. A test based on the aggregate effect of a set of functionally related genomic variables offers a number of important benefits relative to individual gene tests including improved statistical power, more intuitive biological interpretation and decreased variability across distinct experimental datasets. The genomic variables of interest typically represent the abundance or variation of nucleic acid molecules associated with specific genes, e.g. expression levels of mRNA molecules, and the variable sets are defined on the basis of common biological function, e.g., all genes whose protein products are active in a specific pathway. Over the past decade, significant progress has been made building and extending gene set collections [3], [4], [5] and developing, testing and refining statistical gene set testing methods [6], [7], [8].

One of the primary motivations for gene set testing is to improve statistical power via a reduction in the number of tested hypotheses relative to single gene analysis. The significant growth in gene set collections, however, can often result in gene set testing being performed with nearly as many (and sometimes even more) gene sets than original genomic variables. For example, a version of the Gene Ontology (GO) [3] loaded on September 16, 2014 into the AmiGO browser [9] has 39,908 non-obsolete terms in the biological process, cellular component and molecular function ontologies with the biological process ontology alone containing 26,501 terms, numbers of gene sets that exceed the number of genes in any relevant experimental organism. Even the much more aggressively filtered Molecular Signatures Database (MSigDB) [5] has grown in size by an order of magnitude between 2005 to 2014 from approximately 1,000 gene sets to over 10,000 with the 4.0 release. The growth in the number of gene sets in these collections is also frequently at the expense of gene set quality with an increasing level of overlap between gene sets and a large proportion of new annotations generated via fully automated methods without any curatorial review or experimental validation. For example, well over 90% of all GO annotations have the evidence code IEA (inferred from electronic annotation), meaning the annotation was generated by a computational method such as sequence similarity and has not been reviewed by a human curator [10]. Therefore, not only does gene set testing with large collections fail to deliver an improvement in statistical power, but the decline in annotation quality and higher gene set interdependency can also compromise the biological relevance and interpretability of any associations that are discovered.

The typical approach for addressing the problem of gene set collection size is either to use pre-existing collection subsets, e.g., standard GO Slims [11] or the MSigDB C5 collection that filters out GO terms with IEA evidence codes [5], or to create custom collection subsets that match a specific use case, e.g., custom GO Slim generation [9]. Although the use of data-independent subsets addresses the issue of collection size and the subsets may closely align with the domain of investigation, the process of selecting a subset is inherently subjective and thus susceptible to researcher bias. Gene sets not believed to be relevant will not be tested with the result that novel associations may never be found. For hierarchical gene set collections such as GO, methods have also been developed that reduce the number of tested gene sets by using information theoretic measures [12], [13] or by computing the association for gene sets higher in the hierarchy conditional on the results for child gene sets [14], [15], [16], [17]. Although such GO-specific methods are effective at addressing the significant overlap between GO term annotations, they are specific to hierarchical gene set collections and, for those based on a specific data set, use a criteria for filtering is not independent of the statistic used to test gene set enrichment.

Ideally, the members of a gene set collection subset should be selected based on characteristics of the empirical data under investigation. Such data-driven filtering of hypotheses has been successfully practiced in the context of genomic data analysis at the single gene level [18], [19], [20]. In this type of application, a two-stage procedure is followed where, in the first stage, a filter statistic is computed for each genomic variable, e.g., overall variance, and then, in the second stage, the desired statistical analysis is performed on just the set of dependent variables whose filtering statistic passes a given threshold. As detailed by Bourgon et al., such filtering methods can only be successful at improving power in the second stage if the filter statistic is both independent of the second stage test statistic under the null hypothesis (H₀) and dependent under the alternative hypothesis (H_A). In other words, if the test statistics follow the null hypothesis distribution, they must be statistically independent of the filter statistics and, if the test statistics follow the alternative hypothesis distribution, the test and filter statistics must be associated. Bourgon et al. refer to filtering methods that meet these requirements as independent filters and the filter statistics as marginally independent filter statistics.

Although data-driven filtering of individual genomic variables has been advocated for gene set testing [21] and empirical methods have been developed to filter out specific annotations [22], effective independent filters are not currently available that operate on entire gene sets prior to gene set testing. To address both this shortcoming and the challenge posed by large, interdependent and low quality gene set collections, we have developed spectral gene set filtering (SGSF), a novel technique for independent filtering of gene set collections prior to standard gene set testing. The SGSF method uses as a filter statistic the p-value measuring the statistical significance of the association between each gene set and the principal components (PCs) of an empirical data set, taking into account the significance of the eigenvalue associated with each PC. Because this filter statistic is independent of standard gene set enrichment test statistics under H₀, which we prove in the Appendix, but dependent under H_A, the proportion of significantly enriched gene sets is increased without impacting the type I error rate. Using simulated gene sets with simulated data and MSigDB collections with microarray gene expression data from leukemia and heart failure studies, we show that the SGSF algorithm can significantly increase gene set enrichment power by accurately filtering gene sets unrelated to the experimental outcome.

2 Methods

2.1 SGSF inputs

The SGSF method operates over the following three data structures:

An n × p data matrix X quantifying p genomic variables under n experimental conditions. The genomic data held in X, e.g., mRNA expression levels, will be modeled as a sample of n independent observations from a p-dimensional random vector x. It is assumed that any desired transformations on X have been performed and that missing values have been imputed or removed. For the purpose of proving the marginal independence of the spectral gene set enrichment filter (see the Appendix), it is assumed that the distribution of x can be approximated by a multivariate normal distribution (MVN (μ, Σ) with correlation matrix P). This distributional assumption is often well justified since sources of genomic data, especially gene expression data, typically follow a multivariate normal distribution after appropriate transformations. A generalization to the exponential family of distributions is planned for future work.
An n × 1 vector y of clinical phenotype values measured at each of the n experimental conditions. The phenotype values held in y, e.g., binary case/control status, will be modeled as known constants. The term ”phenotype” should be interpreted quite broadly in this context and simply refers to a experimental variable that is treated as an independent variable in statistical models (see Section 2.2.3). If multiple phenotype variables exist, it is possible to use a matrix Y along with the specification of appropriate parameter contrasts (see, for example, Wu et al. [8]).
An f × p binary annotation matrix A that specifies the annotation of the p genomic variables to f functional categories. The rows of A represent f biological categories, e.g., Kyoto Encyclopedia of Genes and Genomes (KEGG) [4] pathways or GO categories, and the elements a_i,j hold indicator variables whose value depends on whether an annotation exists between the function i and genomic variable j.

2.2 SGSF algorithm

The SGSF method identifies a subset of the gene sets defined by A using a non-specific and independent filter based on the statistical significance of the association between each gene set and the spectra of X. Application of the SGSF method in the context of gene set enrichment relative to the variable y involves the following steps, which are illustrated schematically in Figure explained in greater detail in sections 2.2.1 thru 2.2.3 below.

Use the spectral gene set enrichment (SGSE) method [23] to compute filter statistics, F_i, i = 1, …f, for each of the f gene sets defined by A.
Use the filter statistics to subset the f gene sets.
Test the association between the gene sets that pass the filter and y.

2.2.1 Computation of filter statistics using SGSE

The SGSE method [23] is used to compute the filter statistics, F_i, for the gene sets defined by A. Specifically, F_i is set to the p-value generated by SGSE for gene set i according to the statistical significance of the association between gene set i and the PCs of X under a competitive null hypothesis. Computation of spectral enrichment p-values by the SGSE method is realized by the following steps as illustrated in Figure 1 (a) (see Frost et al. [23] for complete details on the SGSE method):

Perform PCA on a mean centered and standardized version of X, X̃.
Determine q, the number of PCs used to represent the spectra of X. This can be all PCs with non-zero variance, all PCs that are statistically significant according to the Tracy-Widom test [25] at a specific α level or a fixed number of PCs. For SGSF, the default configuration uses all PCs with non-zero variance. Although computational more expensive, this option avoids dependence on a subjectively selected α level or specific q value.
For all q PCs, use the principal component gene set enrichment (PCGSE) method [26] to compute the statistical significance of the association between each PC and each of the f gene sets defined by A. The PCGSE method computes a p-value for each gene set via two-stage competitive gene set testing in which the correlation between each gene and each PC is used as a gene-level statistic with flexible choice of both the gene set test statistic and the method used to compute the null distribution of the gene set statistic. For SGSF, the default configuration uses the Fisher-transformed Pearson correlation coefficient between each gene and each PC as the gene-level test statistic and computes the statistical significance of the association between a gene set and a PC using a correlation-adjusted two-sided, two-sample t-test between the gene-level test statistics for genes in the set and the test statistics for genes not in the set. See Frost et al. [26] for complete details on the PCGSE method.
Compute the statistical significance of the association between each of the f gene sets and the spectra of X using the weighted Z-method [27], [28] on the q PCGSE p-values with weights based on the PC variances scaled according to PC statistical significance as quantified by the lower-tailed p-value from the Tracy-Widom test [25]. An important result from the field of random matrix theory (RMT), the Tracy-Widom law of order 1 distribution describes the variation of a scaled and centered version of the largest eigenvalue of the sample covariance matrix for multivariate normal data under the null model of an identity population covariance matrix (a so-called white Wishart distribution). Using the lower-tailed p-value from the Tracy-Widom test as a weight therefore in the weighted Z-method thus discounts the contribution from all PCs whose eigenvalues are not significantly different from what would be expected under a null model of an identity population covariance matrix. Please see Frost et al. [23] for a more detailed background on the Tracy-Widom distribution and its use in the SGSE method.

Fig. 1 — SGSF workflow. (a) Screening portion of the SGSF workflow. Takes the data matrix X and gene set annotation matrix A as inputs, computes filter statistics, *F_i*, using the SGSE method and then filters A to generate A^*. (b) Testing portion of the SGSF workflow. Based on the gene set testing workflow in Ackermann and Strimmer [24]. Takes the phenotype values y, data matrix X and filtered gene set annotation matrix A^* as inputs and computes the association between each gene set in the filtered collection and the phenotype using a competitive gene set testing method where the gene set test statistics, *S_k*, have a t-distribution under H₀.

2.2.2 Gene set collection filtering

Given filter statistics, F_i, i = 1, …, f, a subset of the gene sets defined by the matrix A of size f^* < f can be identified using the following steps as illustrated in Figure 1 (a):

Order the filter statistics from smallest to largest.
Select the f* gene sets corresponding to the first f* filter statistics in the ordered list. The number f* can be either a fixed number, e.g., f* = .1f, or can be set according to a specified filter statistic threshold α, i.e., $f * = \sum_{i = 1}^{f} 1 (F_{i} < α)$ .
Generate a matrix A^* that contains just the rows of A corresponding to the f^* gene sets that pass the filter.

2.2.3 Gene set testing using filtered gene sets

It is assumed that testing of the association between each of the f* gene sets and the phenotype variable y is performed using a two-stage, competitive gene set testing method, e.g., CAMERA [8], using the following steps as illustrated in Figure 1 (b):

Model the relationship between the genomic variables in x and the phenotype y using a series of p univariate linear models of the form x_i ~ β₀ + β₁y + ε. If multiple phenotype variables exist, a contrast of model coefficients must also be specified. Note: if a non-Gaussian exponential family distribution is assumed for x, then a set of generalized linear models would be used instead, however, the current paper considers only the Gaussian case and linear models.
Compute gene-level test statistics, z_j, j = 1, …, p, from each of the p univariate models. The t-statistic associated with β̂₁ is a typical choice. CAMERA uses a normalized t-statistic.
Use the gene-level test statistics to generate gene set test statistics, S_i, for each of the f* gene sets. The mean difference test statistic, which follows a t-distribution under H₀, is a common choice: $S_{i} = ({\bar{z}}_{i} - {\bar{z}}_{i^{c}}) / (σ_{p} \sqrt{\frac{1}{m_{i}} - \frac{1}{p - m_{i}}})$ , wherem_i is the number of genomic variables in set i, z̄_i is the mean of the z_j for members of gene set i, z̄_i^c is the mean of the z_j for genes not in set i and σ_p is the pooled standard deviation of the z_j. CAMERA uses a correlation-adjusted version of the mean difference statistic.
Determine the statistical significance of the gene-level test statistics under null hypothesis that the z_j for genomic variables in the gene set are identically distributed to the z_j for genomic variables not in the gene set. CAMERA determines statistical significance using a two-sample t-test on the correlation-adjusted mean difference statistic. Many other two-stage competitive gene set testing methods use permutation of y to calculate a p-value.

2.3 SGSF evaluation

2.3.1 Alternative gene set filtering methods

To enable a comparative assessment of our SGSF method, two alternative methods for computing the filter statistics, F_i, were considered:

Set F_i to the p-values generated by the SGSE method when executed with weights for the PCGSE p-values in the weighted Z-method set to the PC variance.
Set F_i to the p-values generated according to a χ² test of independence of gene set membership relative to variable clusters. Specifically, this method generates p-values by:
1. Clustering the p genomic variables in X̃ using k-means clustering with the Hartigan and Wong algorithm [29], 5 restarts and k set according to the global maximum of the gap statistic [30] as computed using the clusGap() function in the cluster R package [31] with the number of bootstrap resamples defaulting to 100.
2. Computing the statistical significance of the association between each of the f gene sets defined in A and the k-means clustering using Pearson’s χ² test of independence on a 2 × k contingency table whose first row holds the counts of gene set members in each of the k clusters and whose second row holds the total size of each of the k clusters.

2.3.2 Evaluation using simulated gene sets and simulated data

The standard SGSF method described in Section 2.2 and both alternative filtering methods outlined in Section 2.3.1 were used to filter gene sets defined by a simulated annotation matrix A using 1000 simulated data sets each comprised by a matrix X and vector y generated according to the latent component model outlined in sections 4.1 and 4.2 of Paul et al. [32]. The primary simulation was performed using the following parameter settings:

A was generated as a 60 × 2400 matrix defining 60 disjoint gene sets, each of size 40.
X was generated as a 30 × 2400 matrix via the model $X = \sum_{i = 1}^{4} \sqrt{λ_{i}} v_{i} u_{i}^{T} + σ_{0} E$ where λ = (3, 2.5, 2, 1.5)^T, v_i ~ N₃₀(0, I), $u_{i} = \sqrt{.025} a_{i}$ (a_i is the i^th row of A), σ₀ = .1 and E is a 30 × 300 matrix with i.i.d N(0, 1) entries.
y was generated as a 30 × 1 vector via the model $y = \sum_{i = 1}^{4} \sqrt{β_{i}} v_{i} + σ_{1} z$ where β = (0, 1, 0, 0)^T, σ₁ = 2 and z ~ N₃₀(0, I).

To test the sensitivity of the SGSF method to changes in gene set size, error variance (i.e., σ₀) and latent factor weights (i.e., λ), simulations were also performed using the following additional six parameter settings. For each of these additional simulations, all parameters were held constant at the values listed above except the indicated parameter:

A was generated as a 120 × 2400 matrix defining 120 disjoint gene sets, each of size 20.
A was generated as a 40 × 2400 matrix defining 40 disjoint gene sets, each of size 60.
σ₀ = 0.05
σ₀ = 0.2
λ = (2, 1.75, 1.5, 1.25)^T
λ = (5, 4, 3, 2)^T

According to all simulation models, the first four simulated gene sets are associated with each of the four latent factors and, consequently, the first four PCs. Only the second latent factor, and, thus, only the second gene set, is associated with y. For the filtering method based on the χ² test on variable clusters, the number of clusters was fixed at k = 4 rather than estimated using the gap statistic, which should give this method an advantage since k-means will be executed for the exact number of latent factors used to simulated X. The CAMERA method of Wu et al. [8] was used to test the statistical association between X and y for the gene sets in A before and after filtering according to a competitive H₀. The enrichment power for each of the three filtering methods at a range of filter proportions was computed by taking the ratio of the number of truly enriched gene sets to the total number of gene sets with enrichment false discovery rate (FDR) values (as computed using the method of Benjamini and Hochberg [33]) below .2. The average enrichment power for each filter proportion was computed by simply averaging across all 1000 simulated data sets.

2.3.3 Evaluation using Armstrong et al. leukemia gene expression data and MSigDB C2 v4.0 gene sets

The standard SGSF method and both alternative filtering methods were used to filter the MSigDB C2 v4.0 gene sets for the Armstrong et al. [34] leukemia gene expression data used in the 2005 GSEA paper [6]. The MSigDB C2 v4.0 gene sets and collapsed leukemia gene expression data were both downloaded from the MSigDB repository. With a minimum gene set size of 15 and maximum gene set size of 200, 3,076 gene sets out of the original 4,722 were used in the analysis. For SGSF filtering, the SGSE method [23] was executed on the leukemia gene expression data using all PCs with nonzero eigenvalues and default settings as specified in Section 2.2.1. By filtering all gene sets with SGSE-generated p-values greater than .1, the standard SGSF method reduced the original 3,076 gene sets to 83. The two alternative filtering methods were executed using the default settings as outlined in Section 2.3.1 (k=10 was selected as optimal by the gap statistic test). To enable comparison between the three techniques, filtering via the alternative methods was configured to maintain the 83 gene sets with the best filter statistics. Enrichment of the MSigDB C2 gene sets was computed using CAMERA [8] with default settings and gene-wise test statistics calculated via the linear regression of the gene expression value on the acute myeloid leukemia (AML) versus acute lymphoblastic leukemia (ALL) phenotype. FDR values were computed using for both unfiltered and filtered subsets of p-values using the method of Benjamini and Hochberg [33].

2.3.4 Evaluation using BiKE carotid plaque gene expression data and MSigDB C2 v4.0 gene sets

The MSigDB C2 v4.0 gene sets were also filtered for the carotid plaque gene expression data used by Fokersen et al. [35]. Folkersen et al. analyzed the microarray gene expression data from 126 carotid plaque samples gathered from patients during the course of carotid endarterectomies and obtained from the Biobank of Karolinska Endarterectomies (BiKE). An ischemic event was experienced by 25 out of the 126 patients (7 myocardial infarctions and 18 ischemic strokes) during a mean follow-up period of 1,333 days. For SGSF filtering, the BiKE carotid plaque gene expression data generated using the Affymetrix Human Genome U133 Plus 2.0 Array was retrieved from the Gene Expression Omnibus (GEO) [36] as GSE21545 using a GEO2R generated script. This script created a single expression value for each gene following the procedure outlined in Folkersen et al. (i.e., the mean of the log2-transformed expression measurements for all probes associated with the same gene symbo). Using a minimum gene set size of 5 and a maximum gene set size of 200, 4,185 MSigDB C2 v4.0 gene sets out of the original 4,722 were used in the analysis. The SGSE method [23] was executed on the plaque gene expression data using all PCs with non-zero eigenvalues and default settings as specified in Section 2.2.1. By filtering all MSigDB C2 gene sets with SGSE-generated p-values greater than .1, the SGSF method reduced the original collection of 4,185 gene sets to just 14. Similar to the Armstrong et al. example, the two alternative filtering methods were also executed using the default settings as outlined in Section 2.3.1 (k=10 was selected as optimal by the gap statistic test). To enable comparison between the three techniques, filtering via the alternative methods was again configured to maintain the same number of gene sets retained by the SGSF method, i.e., the 14 gene sets with the best filter statistics. Enrichment of the MSigDB C2 gene sets was computed using CAMERA [8] with default settings and gene-wise test statistics calculated via the linear regression of the gene expression value on the binary ischemic event or no ischemic event phenotype. Alternatively, univariate Cox proportional hazard models, as employed in Folkersen et al., could be used to compute gene-wise test statistics for gene set enrichment. Linear regression against the binary ischemic event phenotype was chosen for simplicity and compatibility with CAMERA.

3 Results and Discussion

3.1 Simulation example

Figure 2 illustrates the comparative performance of SGSF filtering and the two alternative filtering methods detailed in Section 2.3.1 for the simulation example outlined in Section 2.3.2. As seen in Figure 2 (a), when no gene sets are filtered (filter proportion of 0), the behavior of all filtering methods is identical to no filtering and all techniques have an average gene set enrichment power, computed as detailed in Section 2.3.2, of approximately 0.4 across the 1000 simulated data sets. As the proportion of filtered gene sets increases, average enrichment power quickly drops to near 0 when filtering is based on the Pearson χ² p-value computed between gene set membership and k-means clusters of the variables. The poor performance of cluster enrichment in this example is due to the inability of k-means clustering to correctly recover the structure of the latent factors. Filtering according to the SGSE p-values computed using PC variance weighting also exhibits a rapid drop in average enrichment performance as the filtering proportion increases. Poor performance in this case is due to the significant impact of lower variance PCs (i.e., PCs unassociated with the four latent factors and representing only noise) on the SGSE computed p-value via the weighted Z-method, as seen in Figure 2 (b). In contrast, filtering according to the standard SGSF method (i.e., filter statistics set to SGSE p-values with weights based on the product of PC variance and the lower-tailed Tracy-Widom p-value for the PC variance) is able to achieve average enrichment power that is greater than or equal to that achieved without filtering at all filtering proportions. This is due to the fact that the Tracy-Widom p-values completely discount contributions from all PCs not associated with the four latent factors, as seen in Figure 2 (b). In this simulation example, the best average enrichment for the SGSF method is obtained when 90% of the simulated gene sets are filtered.

Fig. 2 — **(a)** Estimated enrichment power at different filtering proportions averaged over 1000 simulations of the model detailed in Section 2.3.2 filtering according to the chi-squared test against k-means computed variable clusters for k=4 (solid), filtering of the gene sets according to the SGSE p-values computed using PC variance weighting (dashed line) and filtering of the according to the SGSE p-value computed with the product of variance and the Tracy-Widom p-value as weighting (dotted line). Note that all methods have identical enrichment power when no filtering is performed and that, for this simulation study, filtering according to both the chi-squared p-value and SGSE-based p-value for PC variance weighting generated 0 empirical power for all other filter proportions (the lines for these two methods therefore overlap). **(b)** Average weights used with the SGSE method to combine PCGSE-generated p-values for the first 10 PCs of the simulation example via the weighted Z-method. Weights based on the PC variance are shown via a dashed line and weights based on the product of the PC variance and the lower-tailed Tracy-Widom p-value for the PC variance are shown via a dotted line. Grey error bars in **(a)** and **(b)** represent ±1 SE.

Results for the other six simulated parameter settings are contained in the Supplemental Material file and show a similar pattern of superior performance for the SGSF method compared to the alternative methods. These additional simulations demonstrate the robustness of the SGSF method to the tested variations in gene set size, error variance and latent factor weights.

3.2 Leukemia gene expression example

Figure 3 illustrates the significant improvement in gene set enrichment power that is possible when using variance-based filter statistics. Without any filtering, the distribution of gene set enrichment p-values computed via CAMERA relative to the AML versus ALL phenotype is consistent with the null hypothesis, i.e., the p-values are approximately U(0, 1) distributed. Although both alternative filtering methods improve enrichment power, as evidenced by the increase in the relative number of small p-values, their performance is dominated by the standard SGSF method. The specific impact of filtering on enrichment power can be seen in Table 1, which contains the gene set enrichment FDR q-values for the 25 MSigDB gene sets with the most significant enrichment p-values. Although some of these gene sets, e.g., GOLUB ALL VS AML DN, are clearly related to the phenotype, without filtering all gene sets appear to have no association after multiple hypothesis correction. The alternative filtering methods represent an improvement on the no filtering case and deliver either one or two biologically plausible gene set associations at an FDR cutoff of .2. For this example, the SGSF method is clearly the most successful at improving enrichment power with 10 out of the top 25 gene sets retained at an FDR level below .2.

Fig. 3 — MSigDB C2 filtering for Armstrong et al. leukemia gene expression data. Quantile-quantile plot of U(0, 1) versus the gene set enrichment p-values computed via CAMERA for the unfiltered and filtered MSigDB C2 v4.0 gene sets and the Armstrong et al. leukemia gene expression data using AML versus ALL status as a binary phenotype as detailed in Section 2.3.3.

Table 1.

The 25 MSigDB C2 v4.0 gene sets with the most statistically significant association with AML versus ALL status as computed via CAMERA for the Armstrong et al. leukemia gene expression data.

Gene set	Direction	GSE p-value	Unfiltered q-value	χ² q-value	SGSE var. filtered q-value	SGSE TW-var. filtered q-value

`TONG_INTERACT_WITH_PTTG1`	AML	0.00117	0.914	-	0.0971	0.0576
`GOLUB_ALL_VS_AML_DN`	AML	0.00139	0.914	-	-	0.0576
`HADDAD_B_LYMPHOCYTE_PROGENITOR`	ALL	0.00141	0.914	0.117	-	-
`VERRECCHIA_EARLY_RESPONSE_TO_TGFB1`	AML	0.00234	0.914	-	-	-
`NAKAJIMA_MAST_CELL`	AML	0.00239	0.914	-	-	-
`VERRECCHIA_RESPONSE_TO_TGFB1_C2`	AML	0.00273	0.914	-	-	-
`GUENTHER_GROWTH_SPHERICAL_VS_ADHERENT_DN`	AML	0.00299	0.914	-	-	-
`CHEOK_RESPONSE_TO_HD_MTX_UP`	AML	0.00398	0.914	-	0.165	0.11
`HUPER_BREAST_BASAL_VS_LUMINAL_UP`	AML	0.004	0.914	-	-	-
`ALONSO_METASTASIS_NEURAL_UP`	AML	0.0042	0.914	-	-	-
`GOLUB_ALL_VS_AML_UP`	ALL	0.00492	0.914	-	-	-
`BIOCARTA_DC_PATHWAY`	AML	0.00687	0.914	-	-	-
`LEE_LIVER_CANCER_E2F1_UP`	AML	0.00776	0.914	-	-	0.116
`KIM_ALL_DISORDERS_CALB1_CORR_DN`	AML	0.009	0.914	-	-	-
`TONKS_TARGETS_OF_RUNX1_RUNX1T1_FUSION_HS…`	AML	0.00961	0.914	-	-	0.116
`SABATES_COLORECTAL_ADENOMA_UP`	AML	0.00965	0.914	-	-	-
`REACTOME_CELL_SURFACE_INTERACTIONS_AT_TH…`	AML	0.0102	0.914	-	-	0.116
`WANG_BARRETTS_ESOPHAGUS_AND_ESOPHAGUS_CA…`	AML	0.0111	0.914	-	-	0.116
`HILLION_HMGA1B_TARGETS`	AML	0.0112	0.914	-	0.228	0.116
`MADAN_DPPA4_TARGETS`	AML	0.012	0.914	-	-	-
`KLEIN_PRIMARY_EFFUSION_LYMPHOMA_DN`	ALL	0.0125	0.914	-	-	-
`REACTOME_REGULATION_OF_INSULIN_LIKE_GROW…`	AML	0.0138	0.914	-	0.228	0.116
`PID_UPA_UPAR_PATHWAY`	AML	0.0139	0.914	-	-	-
`VERHAAK_AML_WITH_NPM1_MUTATED_UP`	AML	0.014	0.914	0.422	-	0.116
`YAO_HOXA10_TARGETS_VIA_PROGESTERONE_UP`	AML	0.0146	0.914	-	-	-

Open in a new tab

The table columns display the gene set enrichment direction, the phenotype enrichment p-value computed via CAMERA, the FDR q-value computed using all tested MSigDB C2 v4.0 gene sets and the FDR q-value computed using each of the tested filtering methods as detailed in Section 2.3.3. If filtering according to a specific method failed to include a specific gene set, the table includes a ”-” in place of a q-value.

The SGSF method is effective in this example for two reasons: first, the SGSF filter is independent of the CAMERA gene set enrichment test statistic under H₀, as proved in Section A and, second, the SGSF filter is associated with the AML versus ALL phenotypes under H_A. Marginal independence of the filter statistic enables filtering to increase the relative proportion of significant p-values without increasing the type I error rate. The association between the SGSF filter statistic and the AML versus ALL phenotype, which is nicely illustrated in Figure 4 of Frost et al. [26], enables filtering to selectively retain significantly associated gene sets. The broader relevance of this class of gene set filters for improving gene set enrichment power in cancer gene expression studies is supported by the finding of Gorlov et al. [37] that genes with large expression variance among cancer cases are more likely to play an important role in tumorgenesis.

Fig. 4 — MSigDB C2 filtering for BiKE carotid plaque gene expression data. Quantile-quantile plot of U(0, 1) versus the gene set enrichment p-values computed via CAMERA for the unfiltered and filtered MSigDB C2 v4.0 gene sets and the BiKE carotid plaque gene expression data using ischemic event versus no ischemic event as a binary phenotype as detailed in Section 2.3.4.

3.3 Carotid plaque gene expression example

Figure 4 illustrates the impact of filtering on gene set enrichment power for the MSigDB C2 v4.0 gene sets and BiKE carotid plaque gene expression data. In this case, the distribution of gene set enrichment p-values computed by CAMERA relative to the ischemic event phenotype is approximately U(0, 1) distributed for no filtering and for each of the alternative filtering methods. Only for SGSF filtering is there a visible improvement in enrichment power. In contrast to the leukemia gene expression results shown in Table 1, it is was not feasible to show filtering results for the gene sets with the most significant phenotype enrichment p-values since none of the filtering methods retained any within the top 25. Instead, Table 2 displays the 14 MSigDB gene sets retained by the SGSF method. Although the most significant FDR q-values after SGSF filtering are only slightly below .5, these q-values are supportive of further investigation since they indicate that approximately half of the reported associations at this level are likely true. In fact, 7 of the 10 most significant gene sets in Table 2 have reported associations with atherosclerosis. Specifically, for the BHAT_ESR1_TARGETS_VIA_AKT1_DN and BHAT_ESR1_TARGETS_NOT_VIA_AKT1_DN gene sets, an association has been reported between ESR1 genetic variants and the development of atherosclerotic lesions [38]; for KEGG_ADHERENS_JUNCTION, there is evidence that junction adherens molecules are involved in atherosclerotic lesion formation through control of endothelial permeability, leukocyte recruitment and platelet deposition [39], [40]; for T_INTEGRIN_SIGNALING_PATHWAY, integrin signaling pathways have been implicated in atherosclerotic lesion development via endothelial cell activation [41]; for BIO-CARTA_TGFB_PATHWAY, TGF-β plays a key role in the development of atherosclerosis via control of the fibroproliferative response to tissue damage [42]; for PID_ALK2PATHWAY, the Alk2 signaling pathway is involved endothelial cell activation via interaction with bone morphogenic proteins [43]; for CARD_MIR302A_TARGETS, miR-302a has been implicated in lipoprotein metabolism and atherosclerosis risk [44].

Table 2.

The 14 MSigDB C2 v4.0 gene sets retained by SGSF filtering for the BiKE carotid plaque gene expression data.

Gene set	Direction	GSE p-value	Unfiltered q-value	χ² q-value	SGSE var. filtered q-value	SGSE TW-var. filtered q-value

`BHAT_ESR1_TARGETS_VIA_AKT1_DN`	no event	0.115	0.958	-	-	0.488
`HEDENFALK_BREAST_CANCER_BRACX_DN`	no event	0.124	0.958	-	-	0.488
`BHAT_ESR1_TARGETS_NOT_VIA_AKT1_DN`	no event	0.135	0.958	-	-	0.488
`KEGG_ADHERENS_JUNCTION`	no event	0.179	0.958	-	-	0.488
`ST_INTEGRIN_SIGNALING_PATHWAY`	no event	0.204	0.958	-	-	0.488
`STARK_PREFRONTAL_CORTEX_22Q11_DELETION_U…`	no event	0.224	0.958	-	-	0.488
`BIOCARTA_TGFB_PATHWAY`	no event	0.244	0.958	-	-	0.488
`PID_ALK2PATHWAY`	no event	0.324	0.958	-	0.785	0.5
`CARD_MIR302A_TARGETS`	no event	0.369	0.958	-	-	0.5
`IVANOVA_HEMATOPOIESIS_STEM_CELL_SHORT_TE…`	no event	0.386	0.958	-	0.785	0.5
`WATANABE_COLON_CANCER_MSI_VS_MSS_DN`	ischemic event	0.393	0.958	-	0.785	0.5
`REACTOME_CREB_PHOSPHORYLATION_THROUGH_TH…`	ischemic event	0.459	0.958	-	0.788	0.535
`DONATO_CELL_CYCLE_TRETINOIN`	no event	0.865	0.98	-	0.932	0.932
`REACTOME_NCAM1_INTERACTIONS`	ischemic event	0.99	0.998	-	-	0.99

Open in a new tab

The table columns display the gene set enrichment direction, the phenotype enrichment p-value computed via CAMERA, the FDR q-value computed using all tested MSigDB C2 v4.0 gene sets and the FDR q-value computed using each of the tested filtering methods as detailed in Section 2.3.4. If filtering according to a specific method failed to include a specific gene set, the table includes a ”-” in place of a q-value.

4 Conclusion

Gene set testing is a powerful analytical tool that can improve statistical power, biological interpretation and experimental replication. Because of the significant growth in gene set collections, however, the potential gains in statistical power are lost unless some form of gene set filtering is employed. Although the use of predefined collection subsets effectively reduces the number of tested hypotheses, this approach is subjective and vulnerable to researcher bias. Ideally, gene sets collections should be filtered according to statistics of the data under investigation. For such a data-driven filter to successfully improve power, the filter statistic must be marginally independent, i.e., independent of the test statistic under H₀ and dependent under H_A. Although independent filters have been identified and successfully utilized for univariate genomic analysis, effective independent filters have not been available that operate on gene sets in the context of gene set testing.

To address this gap, we developed spectral gene set filtering (SGSF), a novel technique for independent filtering of gene set collections prior to gene set testing. The SGSF method uses as a filter statistic p-values measuring the statistical significance of the association between each gene set and the principal components (PCs) of an empirical data set, taking into account the significance of the eigenvalue associated with each PC. The SGSF method is effective in any experimental context where the variance structure of genomic variables is associated with the experimental outcome of interest under the alternative hypothesis. Because this filter statistic is independent of standard gene set enrichment test statistics under H₀, the proportion of significantly enriched gene sets is increased without impacting the type I error rate. As shown using simulated gene sets with simulated data and MSigDB collections with microarray gene expression data, the SGSF algorithm accurately filters gene sets unrelated to the experimental outcome resulting in significantly increased gene set enrichment power.

Limitations of the SGSF method include the dependence on a multivariate normal distribution for the genomic data to prove the marginal independence of the filter statistic and, importantly, the requirement for power improvement that the gene sets enriched within the variance structure of the data, as detected by the SGSE method, are also associated with the clinical outcome under the alternative hypothesis. Although this later requirement has been found to hold well for cancer gene expression data [37], further testing with different clinical endpoints and different types of genomic data will be essential to determine the generality of the SGSF approach.

Supplementary Material

tcbb-frost-2415815-mm.zip

NIHMS730437-supplement-tcbb-frost-2415815-mm_zip.zip^{(1MB, zip)}

Acknowledgments

Funding: National Institutes of Health R01 grants LM010098, LM011360, EY022300, GM103506 and GM103534.

Appendix A

Proof of marginal independence of SGSF filter statistic and gene set enrichment test statistic

As outlined by Bourgon et al. [19], the statistic used by a filtering method must be marginally independent of the statistic used for tests on the filtered set of hypotheses in order for filtering to increase the proportion of significant hypotheses without elevating the type I error rate. In their paper, Bourgon et al. used the notation $U_{i}^{I}$ to represent the filtering statistic for hypothesis i and the notation $U_{i}^{I I}$ to represent the statistic employed for testing hypothesis i after filtering. The marginal independence requirement stipulates that the unconditional distribution of $U_{i}^{I I}$ under H₀ is the same as the conditional distribution of $U_{i}^{I I}$ given $U_{i}^{I}$ under H₀. In the context of gene set filters and our SGSF method, Bourgon et al.’s independent filtering condition is satisfied if the statistic used to filter the gene set collection, F_i, is marginally independent of the statistic used to test the association between each gene set and the phenotype. This can be proven using Basu’s theorem as follows:

Let $G_{i}^{I}$ be the filter statistic for gene set i. In the context of SGSF, this is the p-value computed for gene set i by the SGSE technique using the eigenvalue decomposition of the sample covariance matrix S.
Let $G_{i}^{I I}$ be the enrichment statistic for gene set i. In the context of SGSF, this is a two-sample t-statistic computed using identically distributed gene-specific z-statistics for genes in gene set i and for genes not in gene set i (see Section 2.2.3).
Assume the genomic data held in the n × p matrix X can be modeled as n observations from a p-dimensional random vector x whose distribution is approximately multivariate normal, MVN (μ, Σ).
As a member of the exponential family, the multivariate normal distribution has complete minimal sufficient statistics for the population mean, μ, and population covariance, Σ, given by the vector of sample averages, x̄, and sample covariance matrix, S (Theorem 6.2.25, Casella and Berger [45]).
Under H₀, $G_{i}^{I I}$ has a t-distribution. Because the t-distribution is only dependent on a degrees of freedom parameter, it is ancillary to both the vector of means, μ, and the covariance matrix, Σ. In practice, $G_{i}^{I I}$ is a correlation-adjusted t-statistic so is only approximately t-distributed under H₀. Note: the fact that t-statistics are ancillary to the variance under H₀ was used in Bourgon et al. [19] to prove the marginal independence of an overall variance filter statistic and t-statistics via Basu’s theorem.
Because a complete minimal sufficient statistic is independent of an ancillary statistic according to Basu’s theorem (Theorem 6.2.24, Casella and Berger [45]), $G_{i}^{I I}$ is independent of the sample covariance, S, and the vector of sample means, x̄, under H₀.
Because the gene set filtering statistic $G_{i}^{I}$ is a function of the sample covariance matrix, S, $G_{i}^{I I}$ is also independent of $G_{i}^{I}$ under H₀.
$G_{i}^{I}$ and $G_{i}^{I I}$ are therefore marginally independent and $G_{i}^{I}$ can be classified as an independent filtering statistic according to the criteria of Bourgon et al. [19].

Generalization of this proof to cases where the marginal distribution of each member of the random vector x can be modeled by a member of the exponential family will be explored in future work.

Footnotes

Availability

The MSigDB C2 v4.0 gene sets can be downloaded from http://www.broadinstitute.org/gsea/msigdb/collections.jsp. The Armstrong et al. [34] leukemia gene expression data can be downloaded from http://www.broadinstitute.org/gsea/datasets.jsp. The BiKE carotid plaque gene expression data [35] can be downloaded from GEO at http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE21545. An implementation of the SGSE algorithm used to compute the SGSF filter statistic is available in the PCGSE R package (version ≥ 0.2, http://cran.r-project.org/web/packages/PCGSE/index.html). Due to the dependency on the Bioconductor package safe, it is recommended that PCGSE be installed using the biocLite() function. At the R prompt, enter: source("http://bioconductor.org/biocLite.R") biocLite("PCGSE")

Conflict of Interest: None declared.

Contributor Information

H. Robert Frost, Email: rob.frost@dartmouth.edu, Institute for Quantitative Biomedical Sciences, the Section of Biostatistics and Epidemiology in the Department of Community and Family Medicine and the Department of Genetics at the Geisel School of Medicine, Dartmouth College, Hanover, NH 03755.

Zhigang Li, Institute for Quantitative Biomedical Sciences, the Section of Biostatistics and Epidemiology in the Department of Community and Family Medicine and the Department of Genetics at the Geisel School of Medicine, Dartmouth College, Hanover, NH 03755.

Folkert W. Asselbergs, Durrer Center for Cardio-genetic Research at the ICIN-Netherlands Heart Institute and the Department of Cardiology, Division of Heart and Lungs at the University Medical Center Utrecht, Utrecht, The Netherlands

Jason H. Moore, Institute for Quantitative Biomedical Sciences, the Section of Biostatistics and Epidemiology in the Department of Community and Family Medicine and the Department of Genetics at the Geisel School of Medicine, Dartmouth College, Hanover, NH 03755

References

1.Khatri P, Sirota M, Butte AJ. Ten years of pathway analysis: current approaches and outstanding challenges. PLoS Computational Biology. 2012 Feb;8(2):e1002375. doi: 10.1371/journal.pcbi.1002375. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Hung J-H, Yang T-H, Hu Z, Weng Z, Delisi C. Gene set enrichment analysis: performance evaluation and usage guidelines. Brief Bioinform. 2012 May;13(3):281–291. doi: 10.1093/bib/bbr049. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Gene Ontology Consortium. The gene ontology in 2010: extensions and refinements. Nucleic Acids Res. 2010 Jan;38(Database issue):D331–D335. doi: 10.1093/nar/gkp1018. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Research. 2000 Jan;28(1):27–30. doi: 10.1093/nar/28.1.27. [Online]. Available: http://www.ncbi.nlm.nih.gov/pubmed/10592173. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Liberzon A, Subramanian A, Pinchback R, Thorvaldsdóttir H, Tamayo P, Mesirov JP. Molecular signatures database (msigdb) 3.0. Bioinformatics. 2011 Jun;27(12):1739–1740. doi: 10.1093/bioinformatics/btr260. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A. 2005 Oct;102(43):15 545–15 550. doi: 10.1073/pnas.0506580102. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Efron B, Tibshirani R. On testing the significance of sets of genes. Annals of Applied Statistics. 2007 Jun;1(1):107–129. [Google Scholar]
8.Wu D, Smyth GK. Camera: a competitive gene set test accounting for inter-gene correlation. Nucleic Acids Research. 2012 Sep;40(17):e133. doi: 10.1093/nar/gks461. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Carbon S, Ireland A, Mungall CJ, Shu S, Marshall B, Lewis S, Hub AmiGO Web Presence Working Group. Amigo: online access to ontology and annotation data. Bioinformatics. 2009 Jan;25(2):288–289. doi: 10.1093/bioinformatics/btn615. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.du Plessis L, Skunca N, Dessimoz C. The what, where, how and why of gene ontology–a primer for bioinformaticians. Brief Bioinform. 2011 Nov;12(6):723–735. doi: 10.1093/bib/bbr002. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Davis MJ, Sehgal MSB, Ragan MA. Automatic, context-specific generation of gene ontology slims. BMC Bioinformatics. 2010;11:498. doi: 10.1186/1471-2105-11-498. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Alterovitz G, Xiang M, Mohan M, Ramoni MF. Go pad: the gene ontology partition database. Nucleic Acids Res. 2007 Jan;35(Database issue):D322–D327. doi: 10.1093/nar/gkl799. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Alterovitz G, Xiang M, Hill DP, Lomax J, Liu J, Cherkassky M, Dreyfuss J, Mungall C, Harris MA, Dolan ME, Blake JA, Ramoni MF. Ontology engineering. Nature Biotechnology. 2010 Feb;28(2):128–130. doi: 10.1038/nbt0210-128. [Online]. Available: http://www.ncbi.nlm.nih.gov/pubmed/20139945. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Falcon S, Gentleman R. Using GO stats to test gene lists for GO term association. Bioinformatics (Oxford, England) 2007 Jan;23(2):257–258. doi: 10.1093/bioinformatics/btl567. PMID: 17098774. [Online]. Available: http://www.ncbi.nlm.nih.gov/pubmed/17098774. [DOI] [PubMed] [Google Scholar]
15.Grossmann S, Bauer S, Robinson PN, Vingron M. Improved detection of overrepresentation of Gene-Ontology annotations with parent child analysis. Bioinformatics (Oxford, England) 2007 Nov;23(22):3024–3031. doi: 10.1093/bioinformatics/btm440. [Online]. Available: http://www.ncbi.nlm.nih.gov/pubmed/17848398. [DOI] [PubMed] [Google Scholar]
16.Lee Y, Yang X, Huang Y, Fan H, Zhang Q, Wu Y, Li J, Hasina R, Cheng C, Lingen MW, Gerstein MB, Weichselbaum RR, Xing HR, Lussier YA. Network modeling identifies molecular functions targeted by mir-204 to suppress head and neck tumor metastasis. PLoS Comput Biol. 2010 Apr;6(4):e1000730. doi: 10.1371/journal.pcbi.1000730. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Yang X, Li J, Lee Y, Lussier YA. Go-module: functional synthesis and improved interpretation of gene ontology patterns. Bioinformatics. 2011 May;27(10):1444–1446. doi: 10.1093/bioinformatics/btr142. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Talloen W, Clevert D-A, Hochreiter S, Amaratunga D, Bijnens L, Kass S, Göhlmann HWH. I/ni-calls for the exclusion of non-informative genes: a highly effective filtering tool for microarray data. Bioinformatics. 2007 Nov;23(21):2897–2902. doi: 10.1093/bioinformatics/btm478. [DOI] [PubMed] [Google Scholar]
19.Bourgon R, Gentleman R, Huber W. Independent filtering increases detection power for high-throughput experiments. Proc Natl Acad Sci U S A. 2010 May;107(21):9546–9551. doi: 10.1073/pnas.0914005107. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Dai JY, Kooperberg C, Leblanc M, Prentice RL. Two-stage testing procedures with independent filtering for genome-wide gene-environment interaction. Biometrika. 2012 Dec;99(4):929–944. doi: 10.1093/biomet/ass044. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Tripathi S, Glazko GV, Emmert-Streib F. Ensuring the statistical soundness of competitive gene set approaches: gene filtering and genome-scale coverage are essential. Nucleic Acids Res. 2013 Apr;41(7):e82. doi: 10.1093/nar/gkt054. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Frost HR, Moore JH. Optimization of gene set annotations via entropy minimization over variable clusters (emvc) Bioinformatics. 2014 Feb; doi: 10.1093/bioinformatics/btu110. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Frost HR, Li Z, Moore JH. Spectral gene set enrichment (SGSE) BMC Bioinformatics. 2015 doi: 10.1186/s12859-015-0490-7. in Press. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Ackermann M, Strimmer K. A general modular framework for gene set enrichment analysis. BMC Bioinformatics. 2009 Feb;10:47. doi: 10.1186/1471-2105-10-47. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Johnstone IM. On the distribution of the largest eigenvalue in principal components analysis. The Annals of Statistics. 2001;29(2):295–327. [Online]. Available: http://www.jstor.org/stable/2674106. [Google Scholar]
26.Frost HR, Li Z, Moore JH. Principal component gene set enrichment (PCGSE) ArXiv e-prints. 2014 Mar; doi: 10.1186/s13040-015-0059-z. arXiv:1403.5148. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Whitlock MC. Combining probability from independent tests: the weighted z-method is superior to fisher’s approach. J Evol Biol. 2005 Sep;18(5):1368–1373. doi: 10.1111/j.1420-9101.2005.00917.x. [DOI] [PubMed] [Google Scholar]
28.Won S, Morris N, Lu Q, Elston RC. Choosing an optimal method to combine p-values. Stat Med. 2009 May;28(11):1537–1553. doi: 10.1002/sim.3569. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Hartigan JA, Wong MA. A k-means clustering algorithm. Applied Statistics. 1979;28(1):100–108. [Online]. Available: http://dx.doi.org/10.2307/2346830. [Google Scholar]
30.Tibshirani R, Walther G, Hastie T. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society. Series B (Statistical Methodology) 2001;63(Part 2):411–423. [Google Scholar]
31.Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K. cluster: Cluster Analysis Basics and Extensions. r package version 1.15.2 — For new features, see the ’Changelog’ file (in the package source) 2014 [Google Scholar]
32.Paul D, Bair E, Hastie T, Tibshirani R. “Preconditioning” for feature selection and regression in high-dimensional problems’. Annals of Statistics. 2008 Aug;36(4):1595–1618. [Google Scholar]
33.Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Statistical Methodology) 1995:289–300. [Google Scholar]
34.Armstrong SA, Staunton JE, Silverman LB, Pieters R, den Boer ML, Minden MD, Sallan SE, Lander ES, Golub TR, Korsmeyer SJ. Mll translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nature Genetics. 2002 Jan;30(1):41–47. doi: 10.1038/ng765. [DOI] [PubMed] [Google Scholar]
35.Folkersen L, Persson J, Ekstrand J, Agardh HE, Hansson GK, Gabrielsen A, Hedin U, Paulsson-Berne G. Prediction of ischemic events on the basis of transcriptomic and genomic profiling in patients undergoing carotid endarterectomy. Mol Med. 2012;18:669–675. doi: 10.2119/molmed.2011.00479. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, Marshall KA, Phillippy KH, Sherman PM, Holko M, Yefanov A, Lee H, Zhang N, Robertson CL, Serova N, Davis S, Soboleva A. Ncbi geo: archive for functional genomics data sets–update. Nucleic Acids Res. 2013 Jan;41(Database issue):D991–D995. doi: 10.1093/nar/gks1193. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Gorlov IP, Yang J-Y, Byun J, Logothetis C, Gorlova OY, Do K-A, Amos C. How to get the most from microarray data: advice from reverse genomics. BMC Genomics. 2014;15(1):223. doi: 10.1186/1471-2164-15-223. [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Lehtimäki T, Kunnas TA, Mattila KM, Perola M, Penttilä A, Koivula T, Karhunen PJ. Coronary artery wall atherosclerosis in relation to the estrogen receptor 1 gene polymorphism: an autopsy study. J Mol Med (Berl) 2002 Mar;80(3):176–180. doi: 10.1007/s00109-001-0311-5. [DOI] [PubMed] [Google Scholar]
39.Zernecke A, Liehn EA, Fraemohs L, von Hundelshausen P, Koenen RR, Corada M, Dejana E, Weber C. Importance of junctional adhesion molecule-a for neointimal lesion formation and infiltration in atherosclerosis-prone mice. Arterioscler Thromb Vasc Biol. 2006 Feb;26(2):e10–e13. doi: 10.1161/01.ATV.0000197852.24529.4f. [DOI] [PubMed] [Google Scholar]
40.Schulz B, Pruessmeyer J, Maretzky T, Ludwig A, Blobel CP, Saftig P, Reiss K. Adam10 regulates endothelial permeability and t-cell transmigration by proteolysis of vascular endothelial cadherin. Circ Res. 2008 May;102(10):1192–1201. doi: 10.1161/CIRCRESAHA.107.169805. [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Yurdagul A, Jr, Green J, Albert P, McInnis MC, Mazar AP, Orr AW. Îś5Îš1 integrin signaling mediates oxidized low-density lipoprotein-induced inflammation and early atherosclerosis. Arterioscler Thromb Vasc Biol. 2014 Jul;34(7):1362–1373. doi: 10.1161/ATVBAHA.114.303863. [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Toma I, McCaffrey TA. Transforming growth factor-Îš and atherosclerosis: interwoven atherogenic and atheroprotective aspects. Cell Tissue Res. 2012 Jan;347(1):155–175. doi: 10.1007/s00441-011-1189-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Hopkins PN. Molecular biology of atherosclerosis. Physiol Rev. 2013 Jul;93(3):1317–1542. doi: 10.1152/physrev.00004.2012. [DOI] [PubMed] [Google Scholar]
44.Sacco J, Adeli K. Micrornas: emerging roles in lipid and lipoprotein metabolism. Curr Opin Lipidol. 2012 Jun;23(3):220–225. doi: 10.1097/MOL.0b013e3283534c9f. [DOI] [PubMed] [Google Scholar]
45.Casella G, Berger R. Statistical Inference. Duxbury Resource Center; 2001. Jun, [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

tcbb-frost-2415815-mm.zip

NIHMS730437-supplement-tcbb-frost-2415815-mm_zip.zip^{(1MB, zip)}

[R1] 1.Khatri P, Sirota M, Butte AJ. Ten years of pathway analysis: current approaches and outstanding challenges. PLoS Computational Biology. 2012 Feb;8(2):e1002375. doi: 10.1371/journal.pcbi.1002375. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] 2.Hung J-H, Yang T-H, Hu Z, Weng Z, Delisi C. Gene set enrichment analysis: performance evaluation and usage guidelines. Brief Bioinform. 2012 May;13(3):281–291. doi: 10.1093/bib/bbr049. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] 3.Gene Ontology Consortium. The gene ontology in 2010: extensions and refinements. Nucleic Acids Res. 2010 Jan;38(Database issue):D331–D335. doi: 10.1093/nar/gkp1018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Research. 2000 Jan;28(1):27–30. doi: 10.1093/nar/28.1.27. [Online]. Available: http://www.ncbi.nlm.nih.gov/pubmed/10592173. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] 5.Liberzon A, Subramanian A, Pinchback R, Thorvaldsdóttir H, Tamayo P, Mesirov JP. Molecular signatures database (msigdb) 3.0. Bioinformatics. 2011 Jun;27(12):1739–1740. doi: 10.1093/bioinformatics/btr260. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] 6.Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A. 2005 Oct;102(43):15 545–15 550. doi: 10.1073/pnas.0506580102. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] 7.Efron B, Tibshirani R. On testing the significance of sets of genes. Annals of Applied Statistics. 2007 Jun;1(1):107–129. [Google Scholar]

[R8] 8.Wu D, Smyth GK. Camera: a competitive gene set test accounting for inter-gene correlation. Nucleic Acids Research. 2012 Sep;40(17):e133. doi: 10.1093/nar/gks461. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Carbon S, Ireland A, Mungall CJ, Shu S, Marshall B, Lewis S, Hub AmiGO Web Presence Working Group. Amigo: online access to ontology and annotation data. Bioinformatics. 2009 Jan;25(2):288–289. doi: 10.1093/bioinformatics/btn615. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10.du Plessis L, Skunca N, Dessimoz C. The what, where, how and why of gene ontology–a primer for bioinformaticians. Brief Bioinform. 2011 Nov;12(6):723–735. doi: 10.1093/bib/bbr002. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Davis MJ, Sehgal MSB, Ragan MA. Automatic, context-specific generation of gene ontology slims. BMC Bioinformatics. 2010;11:498. doi: 10.1186/1471-2105-11-498. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] 12.Alterovitz G, Xiang M, Mohan M, Ramoni MF. Go pad: the gene ontology partition database. Nucleic Acids Res. 2007 Jan;35(Database issue):D322–D327. doi: 10.1093/nar/gkl799. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] 13.Alterovitz G, Xiang M, Hill DP, Lomax J, Liu J, Cherkassky M, Dreyfuss J, Mungall C, Harris MA, Dolan ME, Blake JA, Ramoni MF. Ontology engineering. Nature Biotechnology. 2010 Feb;28(2):128–130. doi: 10.1038/nbt0210-128. [Online]. Available: http://www.ncbi.nlm.nih.gov/pubmed/20139945. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Falcon S, Gentleman R. Using GO stats to test gene lists for GO term association. Bioinformatics (Oxford, England) 2007 Jan;23(2):257–258. doi: 10.1093/bioinformatics/btl567. PMID: 17098774. [Online]. Available: http://www.ncbi.nlm.nih.gov/pubmed/17098774. [DOI] [PubMed] [Google Scholar]

[R15] 15.Grossmann S, Bauer S, Robinson PN, Vingron M. Improved detection of overrepresentation of Gene-Ontology annotations with parent child analysis. Bioinformatics (Oxford, England) 2007 Nov;23(22):3024–3031. doi: 10.1093/bioinformatics/btm440. [Online]. Available: http://www.ncbi.nlm.nih.gov/pubmed/17848398. [DOI] [PubMed] [Google Scholar]

[R16] 16.Lee Y, Yang X, Huang Y, Fan H, Zhang Q, Wu Y, Li J, Hasina R, Cheng C, Lingen MW, Gerstein MB, Weichselbaum RR, Xing HR, Lussier YA. Network modeling identifies molecular functions targeted by mir-204 to suppress head and neck tumor metastasis. PLoS Comput Biol. 2010 Apr;6(4):e1000730. doi: 10.1371/journal.pcbi.1000730. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] 17.Yang X, Li J, Lee Y, Lussier YA. Go-module: functional synthesis and improved interpretation of gene ontology patterns. Bioinformatics. 2011 May;27(10):1444–1446. doi: 10.1093/bioinformatics/btr142. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] 18.Talloen W, Clevert D-A, Hochreiter S, Amaratunga D, Bijnens L, Kass S, Göhlmann HWH. I/ni-calls for the exclusion of non-informative genes: a highly effective filtering tool for microarray data. Bioinformatics. 2007 Nov;23(21):2897–2902. doi: 10.1093/bioinformatics/btm478. [DOI] [PubMed] [Google Scholar]

[R19] 19.Bourgon R, Gentleman R, Huber W. Independent filtering increases detection power for high-throughput experiments. Proc Natl Acad Sci U S A. 2010 May;107(21):9546–9551. doi: 10.1073/pnas.0914005107. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] 20.Dai JY, Kooperberg C, Leblanc M, Prentice RL. Two-stage testing procedures with independent filtering for genome-wide gene-environment interaction. Biometrika. 2012 Dec;99(4):929–944. doi: 10.1093/biomet/ass044. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] 21.Tripathi S, Glazko GV, Emmert-Streib F. Ensuring the statistical soundness of competitive gene set approaches: gene filtering and genome-scale coverage are essential. Nucleic Acids Res. 2013 Apr;41(7):e82. doi: 10.1093/nar/gkt054. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] 22.Frost HR, Moore JH. Optimization of gene set annotations via entropy minimization over variable clusters (emvc) Bioinformatics. 2014 Feb; doi: 10.1093/bioinformatics/btu110. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Frost HR, Li Z, Moore JH. Spectral gene set enrichment (SGSE) BMC Bioinformatics. 2015 doi: 10.1186/s12859-015-0490-7. in Press. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] 24.Ackermann M, Strimmer K. A general modular framework for gene set enrichment analysis. BMC Bioinformatics. 2009 Feb;10:47. doi: 10.1186/1471-2105-10-47. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] 25.Johnstone IM. On the distribution of the largest eigenvalue in principal components analysis. The Annals of Statistics. 2001;29(2):295–327. [Online]. Available: http://www.jstor.org/stable/2674106. [Google Scholar]

[R26] 26.Frost HR, Li Z, Moore JH. Principal component gene set enrichment (PCGSE) ArXiv e-prints. 2014 Mar; doi: 10.1186/s13040-015-0059-z. arXiv:1403.5148. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] 27.Whitlock MC. Combining probability from independent tests: the weighted z-method is superior to fisher’s approach. J Evol Biol. 2005 Sep;18(5):1368–1373. doi: 10.1111/j.1420-9101.2005.00917.x. [DOI] [PubMed] [Google Scholar]

[R28] 28.Won S, Morris N, Lu Q, Elston RC. Choosing an optimal method to combine p-values. Stat Med. 2009 May;28(11):1537–1553. doi: 10.1002/sim.3569. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] 29.Hartigan JA, Wong MA. A k-means clustering algorithm. Applied Statistics. 1979;28(1):100–108. [Online]. Available: http://dx.doi.org/10.2307/2346830. [Google Scholar]

[R30] 30.Tibshirani R, Walther G, Hastie T. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society. Series B (Statistical Methodology) 2001;63(Part 2):411–423. [Google Scholar]

[R31] 31.Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K. cluster: Cluster Analysis Basics and Extensions. r package version 1.15.2 — For new features, see the ’Changelog’ file (in the package source) 2014 [Google Scholar]

[R32] 32.Paul D, Bair E, Hastie T, Tibshirani R. “Preconditioning” for feature selection and regression in high-dimensional problems’. Annals of Statistics. 2008 Aug;36(4):1595–1618. [Google Scholar]

[R33] 33.Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Statistical Methodology) 1995:289–300. [Google Scholar]

[R34] 34.Armstrong SA, Staunton JE, Silverman LB, Pieters R, den Boer ML, Minden MD, Sallan SE, Lander ES, Golub TR, Korsmeyer SJ. Mll translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nature Genetics. 2002 Jan;30(1):41–47. doi: 10.1038/ng765. [DOI] [PubMed] [Google Scholar]

[R35] 35.Folkersen L, Persson J, Ekstrand J, Agardh HE, Hansson GK, Gabrielsen A, Hedin U, Paulsson-Berne G. Prediction of ischemic events on the basis of transcriptomic and genomic profiling in patients undergoing carotid endarterectomy. Mol Med. 2012;18:669–675. doi: 10.2119/molmed.2011.00479. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] 36.Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, Marshall KA, Phillippy KH, Sherman PM, Holko M, Yefanov A, Lee H, Zhang N, Robertson CL, Serova N, Davis S, Soboleva A. Ncbi geo: archive for functional genomics data sets–update. Nucleic Acids Res. 2013 Jan;41(Database issue):D991–D995. doi: 10.1093/nar/gks1193. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] 37.Gorlov IP, Yang J-Y, Byun J, Logothetis C, Gorlova OY, Do K-A, Amos C. How to get the most from microarray data: advice from reverse genomics. BMC Genomics. 2014;15(1):223. doi: 10.1186/1471-2164-15-223. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] 38.Lehtimäki T, Kunnas TA, Mattila KM, Perola M, Penttilä A, Koivula T, Karhunen PJ. Coronary artery wall atherosclerosis in relation to the estrogen receptor 1 gene polymorphism: an autopsy study. J Mol Med (Berl) 2002 Mar;80(3):176–180. doi: 10.1007/s00109-001-0311-5. [DOI] [PubMed] [Google Scholar]

[R39] 39.Zernecke A, Liehn EA, Fraemohs L, von Hundelshausen P, Koenen RR, Corada M, Dejana E, Weber C. Importance of junctional adhesion molecule-a for neointimal lesion formation and infiltration in atherosclerosis-prone mice. Arterioscler Thromb Vasc Biol. 2006 Feb;26(2):e10–e13. doi: 10.1161/01.ATV.0000197852.24529.4f. [DOI] [PubMed] [Google Scholar]

[R40] 40.Schulz B, Pruessmeyer J, Maretzky T, Ludwig A, Blobel CP, Saftig P, Reiss K. Adam10 regulates endothelial permeability and t-cell transmigration by proteolysis of vascular endothelial cadherin. Circ Res. 2008 May;102(10):1192–1201. doi: 10.1161/CIRCRESAHA.107.169805. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] 41.Yurdagul A, Jr, Green J, Albert P, McInnis MC, Mazar AP, Orr AW. Îś5Îš1 integrin signaling mediates oxidized low-density lipoprotein-induced inflammation and early atherosclerosis. Arterioscler Thromb Vasc Biol. 2014 Jul;34(7):1362–1373. doi: 10.1161/ATVBAHA.114.303863. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R42] 42.Toma I, McCaffrey TA. Transforming growth factor-Îš and atherosclerosis: interwoven atherogenic and atheroprotective aspects. Cell Tissue Res. 2012 Jan;347(1):155–175. doi: 10.1007/s00441-011-1189-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R43] 43.Hopkins PN. Molecular biology of atherosclerosis. Physiol Rev. 2013 Jul;93(3):1317–1542. doi: 10.1152/physrev.00004.2012. [DOI] [PubMed] [Google Scholar]

[R44] 44.Sacco J, Adeli K. Micrornas: emerging roles in lipid and lipoprotein metabolism. Curr Opin Lipidol. 2012 Jun;23(3):220–225. doi: 10.1097/MOL.0b013e3283534c9f. [DOI] [PubMed] [Google Scholar]

[R45] 45.Casella G, Berger R. Statistical Inference. Duxbury Resource Center; 2001. Jun, [Google Scholar]

PERMALINK

An independent filter for gene set testing based on spectral enrichment

H Robert Frost

Zhigang Li

Folkert W Asselbergs

Jason H Moore

Abstract

1 Introduction

2 Methods

2.1 SGSF inputs

2.2 SGSF algorithm

2.2.1 Computation of filter statistics using SGSE

Fig. 1.

2.2.2 Gene set collection filtering

2.2.3 Gene set testing using filtered gene sets

2.3 SGSF evaluation

2.3.1 Alternative gene set filtering methods

2.3.2 Evaluation using simulated gene sets and simulated data

2.3.3 Evaluation using Armstrong et al. leukemia gene expression data and MSigDB C2 v4.0 gene sets

2.3.4 Evaluation using BiKE carotid plaque gene expression data and MSigDB C2 v4.0 gene sets

3 Results and Discussion

3.1 Simulation example

Fig. 2. Enrichment power for simulation example.

3.2 Leukemia gene expression example

Fig. 3.

Table 1.

Fig. 4.

3.3 Carotid plaque gene expression example

Table 2.

4 Conclusion

Supplementary Material

Acknowledgments

Appendix A

Proof of marginal independence of SGSF filter statistic and gene set enrichment test statistic

Footnotes

Contributor Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases