Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2015 Sep 24.
Published in final edited form as: Biometrika. 2015 Jun 19;102(3):533–544. doi: 10.1093/biomet/asv013

Covariance-based analyses of biological pathways

P DANAHER 1, D PAUL 2, P WANG 3
PMCID: PMC4581526  NIHMSID: NIHMS685282  PMID: 26412865

Summary

The use of high-throughput data to study the changing behavior of biological pathways has focused mainly on examining the changes in the means of pathway genes. In this paper, we propose instead to test for changes in the co-regulated and unregulated variability of pathway genes. We assume that the eigenvalues of previously defined pathways capture biologically relevant quantities, and we develop a test for biologically meaningful changes in the eigenvalues between classes. This test reflects important and often ignored aspects of pathway behavior and provides a useful complement to traditional pathway analyses.

Keywords: gene expression, pathway analysis, spiked eigenvalue

1. Introduction

A pathway refers to a set of genes or proteins jointly participating in a biological process. It is of great interest to study the behavior of pathways using high-throughput-omics data. By treating a pre-defined set of genes with shared biological function as an analytical unit, pathway-level analyses efficiently exploit prior biological knowledge, improve interpretability, and enjoy greater power by combining the signals of individual genes. Existing pathway analysis methods have focused almost exclusively on the marginal behavior of pathway genes. For example, Tomfohr et al. (2005), Lee et al. (2008) and Drier et al. (2013) suggested synthesizing the information in pathway genes into measures of pathway activity, while there is a large body of work, including Subramanian et al. (2005) and Efron & Tibshirani (2007), aimed at identifying pathways that are enriched with differentially expressed genes. These marginal analyses, while shedding light on important questions, fail to capture the full complexity of pathway behavior.

In this paper, we propose a test to examine the joint behavior of pathway genes. This test complements the marginal analyses mentioned above and helps to provide a more comprehensive understanding of biological pathways. Specifically, we consider the problem of detecting differences in covariance among a pre-defined set of pathway genes between two classes of samples, for example between two different cancer subtypes. Tests of equality of two covariance matrices are well studied. However, the number of genes in a pathway ranges from tens to hundreds, and quite often exceeds the sample size. Under such regimes, classical tests for equality of covariance matrices no longer apply. A number of authors (Schott, 2007; Srivastava & Yanagihara, 2010; Li & Chen, 2012) have developed tests for equality of covariance matrices in the high dimension, low sample size setting. However, by testing the null hypothesis that two covariance matrices are exactly equal, without accounting for the structure induced by pathway activity, these tests provide inadequate biological insight: their rejection of the null hypothesis allows no conclusions about how pathway gene behavior differs between classes. In contrast, the proposed test is motivated from a biological model of the expression of pathway genes and focused on quantities with natural biological interpretations. The novelty of this test lies in its focus on the joint rather than the marginal behavior of pathway genes and in its consideration of disordered variability orthogonal to the effects of pathway activity.

Our biological model assumes that genes’ associations with pathway activity drive the leading eigenvector of their covariance matrix. This model suggests that the first eigenvector is invariant to changes in biological conditions, while the leading and remaining eigenvalues will vary across data sets in response to within-population variability of pathway activity and variability due to other, unregulated causes, respectively. Under this model, the covariance matrix of the expression levels of pathway genes has a spiked eigen-structure (Johnstone, 2001; Baik & Silverstein, 2006; Paul, 2007), and the leading eigenvalue and the trace of the covariance matrix provide a parsimonious and biologically relevant summary of pathway genes’ joint behavior. Baik & Silverstein (2006) showed that if the dimension-to-sample size ratio converges to a nonzero finite constant, and if the true spiked eigenvalues exceed a threshold, the corresponding sample eigenvalues converge with probability one to limits that depend on the true eigenvalues and the dimension-to-sample size ratio. Paul (2007) proved asymptotic normality of the leading sample eigenvalues under the same framework. We extend the latter asymptotic results to design a χ2 test statistic based on the joint behavior of the leading eigenvalue and the trace of the sample covariance matrices of the two classes. When the proposed test rejects the null, it indicates that specific, biologically-relevant quantities differ between classes. Simulations suggest that if the spiked covariance structure holds even approximately, the proposed test has better power to detect differences in biologically important functions of the eigenvalues than existing tests.

2. A model of co-expression in biological pathways

First, consider data from only one class. Denote the gene expression data for a previously defined pathway with p genes from n observations by the n × p matrix Y, and denote the data vector of the ith observation by yi = (yi1,…, yip). We assume that pathway activity is the primary driver of pathway gene expression. For example, for a set of genes regulated by a common transcription factor, the primary source of variance in the expression levels of the entire gene set would be changes in the activity level of the transcription factor, and it would be reasonable to specify these relationships through a linear dependence model. We write

yik=μk+hkai+εik,i=1,,n,k=1,,p, (1)

where μk is an intercept specific to gene k that can be ignored for our purposes, ai is a random variable with mean 0 and variance σa2, h = (h1,…, hp) are gene specific scaling coefficients, εi = (εi,1,…, εi,p) are independent random variables with mean 0 and variance σε2, and ε1,…, εn, a1,…, an are mutually independent. We add the constraint h22=1 to make h and σa2 identifiable. In this model, ai reflects the level of pathway activity, e.g. the transcription factor level, in the ith sample, σa2 drives the well-ordered, co-regulated component of total gene variance, and σε2 measures the unordered, noisy component of pathway gene variance. It follows that

cov(yi)=cov(aih+εi)=σa2hhT+σε2I. (2)

The first eigenvalue of cov(yi) is σa2+σε2, and the remaining eigenvalues equal σε2. This observation allows biological interpretations to be assigned to the eigenvalues of the pathway’s covariance matrix: the first eigenvalue measures the variability in pathway genes due to changing levels of pathway activity, i.e., the well-ordered, co-regulated component of pathway gene variability, while the sum of the remaining eigenvalues captures the unordered, chaotic component of variability. This interpretation echoes Tomfohr et al. (2005), Bild et al. (2005), Bair et al. (2006) and Chen et al. (2008) in implying that an observation’s projection onto the first eigenvector of the pathway’s covariance matrix measures the observation’s pathway activity level. Moreover, the covariance structure in (2) matches the spiked eigenvalue model of Baik & Silverstein (2006), Paul (2007), Nadler (2008), Onatski (2012) and others. This implication of the model holds nearly universally in pathway data. Thus, one can take advantage of the asymptotic theories under the spiked eigenvalue model to perform inference on σa2 and σε2.

In data sets with two classes of samples, we may wish to compare these biologically meaningful quantities between classes. We therefore propose to test the null hypothesis

H0:(α1,1,T1)=(α2,1,T2), (3)

where αj,k denotes the kth population eigenvalue of class j and Tj denotes the trace of the covariance matrix of class j. The interpretation of the pair (αj,1, Tj) gives the alternatives to H0 specific and useful biological meaning. For example, when α1,1 > α2,1 and T1α1,1T2α2,1, we might conclude that the pathway activity level is stronger in the first class, or, when T1α1,1 > T2α2,1, we might conclude that the pathway is dysregulated and subject to greater nloise in the first class. By directly testing H0 rather than the stronger null hypothesis of equality of the entire covariance matrices, we maximize power to detect changes in the modes of variability attributable to pathway activity and to noisy, non-co-regulated causes, while detecting other changes in the covariance matrix only insofar as they change our eigenvalue statistics. In particular, we ignore the eigenvectors and redistribution of weights among smaller eigenvalues. The biological model specified in (1) and (2) can also be seen as a factor analysis model with one factor and a specialized covariance structure for the idiosyncratic term εik, even though our hypotheses and test procedure differ from the commonly used tests for factorial invariance (Meade & Bauer, 2007).

Model (1) suggests two general features of pathway data: (a) the eigenvalues of the covariance matrix resemble the spiked model; (b) the leading eigenvectors capture the effects of pathway activities on gene expression, or equivalently, the leading spiked eigenvalues capture variability in the data due to changes in pathway activity. Both features apply for a large set of pathways even when model (1) does not hold. The statistical theory behind our test relies on (a), and the biological interpretation of our test is based on (b). The Supplementary Material contains extensive empirical investigations supporting (a) and (b).

In the next sections, we introduce a test for H0 in (3), assuming (a) and (b) hold. Moreover, in many cases, pathway genes are subject to multiple biological processes. When this occurs the covariance matrix has additional spikes, i.e., more eigenvalues become significantly larger than the noise eigenvalues. The proposed test also accommodates these scenarios.

3. A test for differences in the eigenstructure of Σ1 and Σ2

3·1. The single spiked eigenvalue setting

Denote the eigenvalues and trace of the sample covariance matrices by α̂j,i and j (j = 1, 2; i =1,…,p). To test H0 in (3), a natural choice is to form a test statistic using α̂1,1α̂2,1 and 12. We use a quadratic form to combine the information in these quantities.

Under H0 in (3), without loss of generality, we assume that σε,1 = σε,2 =1, or equivalently, the unspiked eigenvalues of the common covariance matrix are all equal to 1, αj,2 =···= αj,p = 1, (j = 1, 2). To adhere to this assumption, we normalize the data as follows. We calculate a scale factor equal to the square root of the median eigenvalue of the pooled sample covariance matrix from both classes and divide all the observations by this factor; see the Supplementary Material.

For notational convenience, in the rest of this subsection we use αj and α̂j to mean αj,1 and α̂j,1, respectively. According to Baik & Silverstein (2006), the first sample eigenvalue is a biased estimate of its population counterpart: α̂jαj + γjαj/(αj − 1), where p, n → ∞, p/njγj ∈ (0, ∞) and αj>1+γj1/2, (j = 1, 2). Define bα = (γ1γ2)α0/(α0 − 1), where α0 is the first eigenvalue shared by both classes under H0 and satisfies α0>1+max(γ11/2,γ21/2). Then under H0, (α̂1α̂2) → bα almost surely. This limiting value bα ≠ 0 when γ1γ2. To test H0, we focus on the bias-corrected quantity (α̂1,1α̂2,1α) and propose the test statistic QT^Q-1Q, where

Q=(Qα,QT)T=(α^1-α^2-b^α,T^1-T^2)T, (4)

and α and Σ̂Q are appropriate consistent estimates for bα and ΣQ = cov(Q), respectively.

We now describe the construction of α. We first propose the following estimator for α0,

α¯0=(w1α¯1+w2α¯2),wj=nj/(n1+n2), (5)

where j is the asymptotic method of moments estimator for αj, namely, j = [1+ α̂jγ̂j + {(1+ α̂jγ̂j)2 − 4α̂j}1/2]/2, which is obtained by solving the equation α̂j = αj{1+γ̂j/(αj − 1)}. Here and henceforth, γ̂j = p/nj for j = 1, 2. Substituting α0 with 0 in the expression for bα yields the estimate

b^α=(γ^1-γ^2)α¯0/(α¯0-1). (6)

If α^j<(1+γ^j1/2)2, then j is complex-valued, which indicates that the population covariance is either unspiked or has small, undetectable, spikes.

Define the 2 × 2 symmetric matrix ΣQ with diagonal elements τQαα and τQTT, respectively, and off-diagonal element τQαT. Theorem 2 yields the consistent estimates

τ^Qαα=j=12(2nj)α¯0θj2ρj1+α¯0γ^j/{(α¯0-1)2-γ^j} (7)
andτ^QαT=j=12(2nj)α¯0θjρj1+α¯0γ^j/{(α¯0-1)2-γ^j}. (8)

We estimate τ̂QTT using equation (10).

After we obtain Σ̂Q, we propose to reject H0 for large values of QT^Q-1Q. According to Theorem 2, under H0, the asymptotic joint normality of α̂1α̂2α and 12 suggests that QTQ-1Qχ22 in distribution. Then to the extent that Q-1 is estimated accurately, our test statistic QT^Q-1Q may be compared to the quantiles of a χ22 distribution to obtain a p-value. A permutation test may also be employed. Simulations in Section 5 show the proposed test to have accurate Type-1 error at all sample sizes when our assumptions hold, suggesting that accurate estimation of Q-1 is not a hurdle for the test’s performance.

3·2. Test robust to the number of spiked eigenvalues

We generally expect that genes in a pathway are jointly associated with not just one but a number of biological processes, which implies the existence of multiple spiked eigenvalues. To accommodate an unspecified number of spiked eigenvalues in the proposed test, we first estimate the number of spiked eigenvalues and then apply a modified expression for var(j).

To estimate Mj, the number of spiked eigenvalues in class j, we choose a threshold of (1+γ^j1/2)2+{2log(nj)/nj}1/2, and with I denoting the indicator function, define

M^j=m=1pI[α^j,m>(1+γ^j1/2)2+{2log(nj)/nj}1/2]. (9)

This estimator may have difficulty classifying the eigenvalues near (1+γ^j1/2)2. However, the treatment of such small spiked eigenvalues will not appreciably affect our estimates of var(j). We then use independence of 1 and 2 to estimate τQTT with

τ^QTT=j=122nj(m=1M^jα^j,m2+p-M^j). (10)

Some alternative methods for estimating the number of spikes, e.g. the proposal by Kritchman & Nadler (2008), have good power of detection and could be used instead of the estimator (9), but the approach detailed above does not depend on the Gaussian assumption.

We outline below the proposed procedure for testing H0 in (3), which is robust to the number of spiked eigenvalues.

  1. Calculate the eigenvalues {α̂j,k} and trace j of the sample covariance matrix Σ̂j (j = 1, 2).

  2. Calculate α = (γ̂1γ̂2)0/(0 − 1), where 0 is defined in equation (5).

  3. Calculate Q according to (4).

  4. Estimate ΣQ:

    1. Estimate the number of spiked eigenvalues in each class according to (9); and then calculate τ̂QTT according to (10).

    2. Calculate θj and ρj, j = 1, 2, as defined by Theorem 2. Compute τ̂Qαα; and τ̂QαT according to equation (7) and (8) respectively.

  5. Compute the test statistic QT^Q-1Q. To attain a p-value, compare its value to the quantiles of a χ22 distribution. Alternatively, permute the class labels and recompute the test statistic many times, and compare the quantiles of the resulting statistics to the true QT^Q-1Q.

Sometimes the first eigenvalue might be inadequate to capture variability due to pathway coregulation. For such occasions we could use the top M eigenvalues and test an extended null hypothesis H0M : (α1,1,…,α1,M,T1) = (α2,1,…,α2,M,T2); see the Supplementary Material.

4. Theoretical results

In this section, we outline theoretical results for the asymptotic behavior of (α̂1α̂ 2α) and (12) under the spiked eigenvalue setting implied by our biological model under the null hypothesis and assuming Gaussian data.

We first consider a single class. Denote the population eigenvalues by {αi}i=1p and their sample equivalents by {α^i}i=1p. We assume α1 >α2 = ··· = αp = 1. Write α1α, α̂ 1α̂, and let T=k=1pαk,T^=k=1pα^k. In Theorem 1, we lay the groundwork for our method by specifying the joint asymptotic distribution of (α̂, T̂). This result is of interest beyond its application to the proposed test, and to our knowledge it gives the first published expression for the joint asymptotic distribution of α̂ and .

Theorem 1

Suppose that p, n → ∞ such that n1/2|p/nγ| → 0 where γ ∈ (0, 1). Assume α > 1+ γ1/2. Let ρ = α {1+ γ/(α − 1)}. Then

αT,n-1/2(n1/2(α^-ρ)T^-T)N(0,I2),indistribution,whereαT,n=(σαα,nσαT,nσαT,nσTT,n),σαα,n=2αρ1+αγ(α-1)2-γ,σTT,n=2(α2n+p-1n),σαT,n=n-1/2αρ{2+K(ρ,γ)1+αγ(α-1)2-γ}, (11)
K(ρ,γ)=12π2--kγ(x,y)(ρ-x)2dxdy. (12)

Here kγ(x, y) is a bounded, nonnegative function with support {(1 − γ1/2)2, (1 + γ1/2)2}× {(1 − γ1/2)2, (1 + γ1/2)2}.

Corollary 1

Under the assumptions of Theorem 1, in distribution,

(n1/2(α^-ρ)T^-T)TαT,n-1(n1/2(α^-ρ)T^-T)χ22.

Remark 1

The conclusions in Theorem 1 and Corollary 1 remain unchanged even if we replace γ by γ̂ = p/n, and α by = (1/2) 1 + α̂γ̂ + {(1+ α̂γ̂)2 − 4α̂}1/2, which is obtained by solving the equation α̂ = {1+ γ̂/( − 1)}.

Remark 2

If α ≫ 1+ γ1/2, the contribution of the term K(ρ, γ) in the expression for σαT,n is asymptotically negligible. In this case, σαT,n can be replaced by

σαT,n=n-1/22αρ[1+αγ/{(α-1)2-γ}]-1.

We then apply the results of Theorem 1 to the two-class case to calculate the null distribution of our test statistic. Specifically, under H0 in (3), without loss of generality, we assume that the common covariance matrix has eigenvalues α0, 1,…, 1, where α0>1+max{γ11/2,γ21/2}, and γj = limnj→∞ p/nj∈ (0, 1). Under the alternative, the non-spiked eigenvalues could take values other than 1.

Theorem 2

Suppose that p, n1,n2 → ∞ such that nj1/2p/nj-γj0 where γj ∈ (0, 1), for j = 1, 2. Let α̂jk denote the k-th largest eigenvalue of the sample covariance matrix of class j, and T^j=k=1pα^jk. Let α be defined by (6). Introduce

Q,n=(σQαα,nσQαT,nσQαT,nσQTT,n),

with σQTT,n=2(1/n1+1/n2)(α02+p-1);

σQαα,n=w2θ12{2α0ρ11+α0γ1(α0-1)2-γ1}+w1θ22{2α0ρ21+α0γ2(α0-1)2-γ2},

where wj = nj/(n1 + n2), ρj = α0{1+ γj/(α0 − 1)},

θj=1+(1γ1+1γ2)-1(γ1-γ2)(α0-1)2κjγj,κj=12[1+ρj-1-γj{(1+ρj-γj)2-4ρj}1/2];

and σQαT,n=n1-1/2w21/2θ1[α0ρ1{2+K(ρ1,γ1)}1+α0γ1(α0-1)2-γ1]+n2-1/2w11/2θ2[α0ρ2{2+K(ρ2,γ2)}1+α0γ2(α0-1)2-γ2],

where K(ρ, γ) is as in (12). Then, in distribution,

Q,n-1/2{(n1n2n1+n2)1/2(α^11-α^21-b^α)T^1-T^2}N(0,I2). (13)

Remark 3

In Theorem 2, we can replace γj by γ̂j = p/nj, and α by 0 defined through (5) without altering the conclusions.

Remark 4

The statements of both theorems remain valid even if γj ∈ [1, ∞) for j = 1, 2, though the proofs change slightly. Moreover, the conclusions of Theorem 2 continue to hold even when γj = 0, j = 1, 2, with ρj = 0 and the terms K(ρjj) are absent from the expressions.

Remark 5

If αj1 → ∞, but αj1 = o(p), for j = 1, 2, both theorems hold.

Remark 6

If α01+max(γ11/2,γ21/2), the contribution of the terms K(ρjj)(j = 1, 2), in the expression for σQαT,n is asymptotically negligible. In this case, we replace σQαT,nby

σQαT,n=n1-1/2w21/2θ12α0ρ11+α0γ1(α0-1)2-γ1+n2-1/2w11/2θ22α0ρ21+α0γ2(α0-1)2-γ2.

This is the expression used in defining the test statistic QT^Q-1Q.

Remark 7

Both the theorems can be easily extended to cases with multiple spiked eigenvalues. See the Supplementary Material for details.

The proof of Theorem 1 uses the asymptotic expansions of the leading sample eigenvalues in Paul (2007) and the behavior of linear spectral statistics of sample covariance matrices described in Bai & Silverstein (2010). Theorem 2 follows from this and an application of the delta method.

5. Simulations

In this section, we describe simulations investigating the Type-1 error and power of the prposed test and the tests of Schott (2007) and Srivastava & Yanagihara (2010).

We consider three different sets of covariance structures. For each set, we use the same baseline covariance matrix Σ1 and introduce different perturbations to generate Σ2. We define Σ1 according to the biological model in Section 2. To simulate data with p genes, we set 1=σa2hhT+I, with σa2=35p-1/2 and hp×1 = {− 0.5, 1/(p − 1) − 0.5, …, (p − 2)(p − 1) − 0.5, 0.5}. In Σ1, σa2hhT represents the variability due to pathway activities, and I represents the unordered, noisy component of pathway gene variance. In the first perturbation, which we call the added noise setting, we let Σ2 = Σ1 +0.2I, so gene expression is subject to broader disorder in the second class. In the second perturbation, the lost co-regulation setting, we simulate pathway dysregulation by letting Σ2 = 0.7Σ1 +0.3diag(Σ1) so that overall variability is unchanged but less well-ordered. This perturbation substantially decreases the first eigenvalue while leaving the trace unchanged. In real data, a change like this could arise from deactivation of pathway regulatory elements like transcription factors. In the third perturbation, the additional biological process setting, we let Σ2 = Σ1 + ggT, where g = {gi} is defined as gi = 0.75 for i ∈ 1,… 0.4p and gi = 0 otherwise. In this setting, 40% of the genes in the pathway participate in a secondary biological process represented by the ggT component.

We consider p = 20, 50 and 100. The corresponding first eigenvalues of Σ1 under three different dimensions are 15.4, 22.5 and 30.8 respectively. For each p, we consider sample sizes n1 ∈ {20, 30, 50, 75, 100, 130} and n2 = 0.66n1. For each (p, n) and (Σ1, Σ2), we simulate 10,000 pairs of multivariate normal datasets and apply the proposed test as well as the methods of Schott (2007) and Srivastava & Yanagihara (2010) to test the differences between the two covariance matrices. We apply the robust version of the test described in in Section 3·2 for the added noise and the lost co-regulation settings, and we apply the multiple-spike version described in the Supplementary Material with M = 2 for the additional biological process setting. Under all three settings, we preprocess the data using the normalization scheme described in the supplementary material and derive the p-values according to the theoretical χ2 distributions. Additionally, we examine the tests’ Type-1 error rates in these settings by defining Σ0 = n1/(n1 + n21 + n2/(n1 + n22, generating datasets of size n1 and n2 from Σ0, and running the tests on these null datasets.

Fig. 1 displays the results of these simulations. The first row of plots displays type-I error rates of the three methods. The method of Schott (2007) is conservative, the method of Srivastava & Yanagihara (2010) is liberal, and the proposed test has the most accurate levels under all settings. The second row of plots displays powers of the three tests based on theoretical null distributions. The proposed test outperforms the others in the added noise and lost co-regulation settings and is competitive in the additional biological process setting. In the third row of plots, instead of using theoretical approximations to determine each test statistic’s threshold for significance, we compute adjusted power as the percentage of test statistics under the alternative hypothesis exceeding the 0.05 quantile of the empirical null distribution of the test statistics. In this way, the type-I errors of all tests are perfectly controlled at 0.05, so the power comparison is more fair and direct. The proposed test easily outperforms the other two in term of adjusted power under all settings and all n, p combinations.

Fig. 1.

Fig. 1

Performance of the proposed test (black lines), the method of Schott (2007) (dark grey lines) and the method of Srivastava & Yanagihara (2010) (light grey lines). Solid, dashed and dashed/dotted lines display results under p = 20, 50 and 100, respectively.

The proposed test nearly dominates the methods of Schott (2007) and Srivastava & Yanagihara (2010) in these simulations. In other simulations, we found that the methods of Schott (2007) and Srivastava & Yanagihara (2010) perform well in cases where single elements of the covariance matrix differ substantially between classes. However, changes in the biological quantities we are interested in will most often manifest as widespread, small differences in the covariance matrix, a setting which these earlier methods are not optimized to detect.

We also evaluate the effects of various departures from model (1) on the performance of the proposed test through simulations. Specifically, we consider the effects of variability in the unspiked eigenvalues, unequal error variances, non-normality of the data and multiple spiked eigenvalues. We find that the proposed test is robust to all these departures except for non-normality of the data. Thus we recommend the permutation test in highly kurtotic data.

6. Application to a breast cancer dataset

We apply the proposed test to a breast cancer gene expression dataset (Loi et al., 2007), which has microarray measurements on breast tumor samples from 277 patients treated with tamoxifen and 137 untreated patients. The interest is to identify different regulation patterns between patients with or without tamoxifen treatment. We normalized all observations to have equal median and median absolute deviance. Outliers can drive the first eigenvalue of a dataset, destroying its interpretation under our biological model. We therefore truncated each gene’s data in each class at four standard deviations from its mean. This rule truncated 6.4% of the data.

Curated databases of gene relationships like KEGG (Kanehisa & Goto, 2000), Reactome (Matthews et al., 2009), and Biocarta (Nishimura, 2001) often build pathways from genes involved in distantly related biological processes. Consequently, these curated pathways tend to be subject to complex co-regulation better described with network estimation tools (Peng et al., 2009; Danaher et al., 2014) than with this paper’s biological model. In lieu of KEGG pathways, we sought sets of genes that could be expected to exhibit the tight co-regulation implied by our model. Cheng et al. (2013a) identified attractor metagenes, sets of genes that tended to cluster together across multiple breast cancer gene expression datasets. We expected that genes clustered together across datasets would often share a biological function, and examination of Cheng et al. (2013a)’s metagenes confirmed this hypothesis. For example, the ID55 metagene contains exclusively histone genes; and the ID88 metagene contains several genes from the cytochrome P450 family, and, intriguingly, ESR1, one of the most-studied genes in breast cancer. The biological relevance of these attractor metagenes was further demonstrated by Cheng et al. (2013b), who used attractor metagenes to inform a successful breast cancer prognostic algorithm. Given their biological meaning and apparent consistency with our biological model, we took these metagenes as the basic units of our analysis, and we ran our method and a traditional gene set analysis (Efron & Tibshirani, 2007) on every metagene with more than 5 genes.

Table 1 displays selected results; the Supplementary Material has complete results. A 2.67GHz laptop took 11 minutes to compute p-values for the 24 metagenes analyzed using 10000 permutations. The proposed biological model and test revealed a rich picture of changes in co-expression far beyond what traditional Gene Set Analysis provided. The ID88 metagene has higher total variance but a lower first eigenvalue under tamoxifen. This pattern of increased noise and decreased variability due to pathway activity strongly suggests pathway dysregulation. The histone metagene, ID55, saw increases in both overall variability and its first eigenvalue under tamoxifen, suggesting more dynamic histone activity levels in the tamoxifen group. Histones are central to cancer proliferation; this result could be explained by patients heterogeneously responding to the drug. The mesenchymal transition attractor metagene followed a similar pattern, with increased variability under tamoxifen almost entirely due to an increased first eigenvalue.

Table 1.

Eigenvalue statistics and p-values calculated using the proposed test and Gene Set Analysis. Theoretical and permutation-based p-values from the proposed test are under pχ2 and pperm, respectively, and p-values from Gene Set Analysis are under pGSA.

Metagene Size
α0
α1
T0
T1
pχ2 pperm pGSA
ID88 15 38.43 23.38 58.26 62.82 0.000 0.000 0.010
ID55 18 119.67 139.97 156.72 192.28 0.000 0.000 0.240
MTA** 19 127.17 167.85 157.72 205.85 0.007 0.000 0.440
*

α0 and T0 are for the untreated group; while α1 and T1 are for the TAM treated group.

**

MTA stands for Mesenchymal Transition Attractor.

The p-values returned by the χ2 approximation and the permutation test generally tracked each other, with a Spearman correlation between them of 0.88. However, the permutation test returned uniformly higher p-values than the purely theoretical test, and one metagene, ID79, showed a markedly increased p-value under the permutation test. The liberal χ2 p-values appear to be driven by excessively kurtotic data, and they suggest the use of the permutation test over the χ2 approximation in highly kurtotic data.

7. Discussion

The proposed test is a powerful complement to traditional, marginal effects-based analyses like gene set analysis or tests comparing overall pathway activity levels. Given the high dimensionality and complex behavior of biological pathways, it seems appropriate to apply analyses focused on varied aspects of pathway behavior. A complete analysis of a pathway would include a summary of single-gene behavior, a comparison of overall pathway activity levels between disease states (Lee et al., 2008), a test for changes in covariance structure like the method proposed here, and ideally several other analyses yet to be discovered.

While the proposed test is motivated from the biological model (1), it can be applied to the broad class of pathways for which the first eigenvalue is spiked and reflects variability due to heterogeneous pathway activity levels. Nevertheless, not every gene set adheres to these assumptions. For example, many of the larger KEGG pathways contain genes too distally related to show discernible co-regulation. Our test is better applied to gene sets very likely to experience co-regulation, for example more narrowly-defined KEGG pathways and data-derived gene sets like the attractor metagenes of Cheng et al. (2013a) and the cancer signatures of Wolf et al. (2014). These data-derived gene sets are often highly biologically interpretable, and they have been shown to predict patient outcomes (Cheng et al., 2013b; Clarke et al., 2013; Wolf et al., 2014). It is possible to check a gene set’s suitability for analysis with the proposed test by comparing the prominence of its first eigenvalue (α̂/) to the α̂/ of random gene sets. Various biological and technical variables will induce eigenstructure in sets of unrelated genes. If a gene set’s first eigenvalue is more prominent than seen in random gene sets, the gene set is likely experiencing co-regulation. When the values of these technical, e.g. regent lot, or biological, e.g. cancer subtype, variables are known, it is possible to scrub their influence from the data by regressing each gene on these variables and performing the proposed test on the residuals.

A useful extension of this work would be the development of tests for differences in more targeted quantities than the somewhat broad (α̂, T̂). For example, a test for changes in (α̂) could be considered to directly look for increased dysregulation, or non-co-regulated variability, between classes. The asymptotic normality of and α̂ would make these tests simple to derive.

An approach to this problem rooted in factor analysis could also be productive, although the factor analysis literature lacks the results for high-dimensional data that enabled our approach.

SETPath, an R package implementing the test, is on CRAN (R Core Team, 2013).

Supplementary Material

Acknowledgments

Grants from the National Science Foundation and the National Institutes of Health supported this research. PD primarily worked on this method while part of the Department of Biostatistics at the University of Washington. Reviewers provided valuable input.

Footnotes

Supplementary material

Supplementary material available at Biometrika online provides a description of a normalization scheme, outlines of proofs of Theorems 1 and 2, a derivation of our test in the setting without spiked eigenvalues, simulations investigating the consequences of departures from our assumptions, and a table containing the full results of the breast cancer expression data analysis.

Contributor Information

P. DANAHER, Email: pdanaher@nanostring.com, NanoString Technologies, 530 Fairview Ave. N, Seattle, Washington 98109, U.S.A

D. PAUL, Email: debpaul@ucdavis.edu, Department of Statistics, University of California, One Shields Avenue, Davis, California 95616, U.S.A

P. WANG, Email: pei.wang@mssm.edu, Icahn Institute of Genomics and Multiscale Biology, Icahn Medical School at Mount Sinai, 1470 Madison Avenue, S8-102 New York, New York, 10029, U.S.A

References

  1. Bai ZD, Silverstein JW. Spectral Analysis of Large Dimensional Random Matrices. Springer; 2010. [Google Scholar]
  2. Baik J, Silverstein JW. Eigenvalues of large sample covariance matrices of spiked population models. Journal of Multivariate Analysis. 2006;97:1382–1408. [Google Scholar]
  3. Bair E, Hastie TJ, Paul D, Tibshirani RJ. Prediction by supervised principal components. Journal of the American Statistical Association. 2006;101:119–137. [Google Scholar]
  4. Bild AH, Yao G, Chang JT, Wang Q, Potti A, Chasse D, Joshi MB, Harpole D, Lancaster JM, Berchuck A, et al. Oncogenic pathway signatures in human cancers as a guide to targeted therapies. Nature. 2005;439:353–357. doi: 10.1038/nature04296. [DOI] [PubMed] [Google Scholar]
  5. Chen X, Wang L, Smith JD, Zhang B. Supervised principal component analysis for gene set enrichment of microarray data with continuous or survival outcomes. Bioinformatics. 2008;24:2474–2481. doi: 10.1093/bioinformatics/btn458. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Cheng WY, Yang THO, Anastassiou D. Biomolecular events in cancer revealed by attractor metagenes. PLoS Computational Biology. 2013a;9:e1002920. doi: 10.1371/journal.pcbi.1002920. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Cheng WY, Yang THO, Anastassiou D. Development of a prognostic model for breast cancer survival in an open challenge environment. Science Translational Medicine. 2013b;5:181ra50. doi: 10.1126/scitranslmed.3005974. [DOI] [PubMed] [Google Scholar]
  8. Clarke C, Madden SF, Doolan P, Aherne ST, Joyce H, ODriscoll L, Gallagher WM, Hennessy BT, Moriarty M, Crown J, et al. Correlating transcriptional networks to breast cancer survival: a large-scale coexpression analysis. Carcinogenesis. 2013;34:2300–2308. doi: 10.1093/carcin/bgt208. [DOI] [PubMed] [Google Scholar]
  9. Danaher P, Wang P, Witten DM. The joint graphical lasso for inverse covariance estimation across multiple classes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2014;76:373–397. doi: 10.1111/rssb.12033. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Drier Y, Sheffer M, Domany E. Pathway-based personalized analysis of cancer. Proceedings of the National Academy of Sciences. 2013;110:6388–6393. doi: 10.1073/pnas.1219651110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Efron B, Tibshirani RJ. On testing the significance of sets of genes. The Annals of Applied Statistics. 2007;1:107–129. [Google Scholar]
  12. Johansson K. Shape fluctuations and random matrices. Communications in Mathematical Physics. 2000;209:437–476. [Google Scholar]
  13. Johnson D, Graybill F. An analysis of a two-way model with interaction and no replication. Journal of the American Statistical Association. 1972;67:862–868. [Google Scholar]
  14. Johnstone IM. On the distribution of the largest eigenvalue in principal components analysis. Annals of Statistics. 2001;29:295–327. [Google Scholar]
  15. Kanehisa M, Goto S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Research. 2000;28:27–30. doi: 10.1093/nar/28.1.27. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Kritchman S, Nadler B. Determining the number of components in a factor model from limited noisy data. Chemometrics and Intelligent Laboratory Systems. 2008;94:19–32. [Google Scholar]
  17. Lee E, Chuang HY, Kim JW, Ideker T, Lee D. Inferring pathway activity toward precise disease classification. PLoS Computational Biology. 2008;4:e1000217. doi: 10.1371/journal.pcbi.1000217. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Li J, Chen S. Two sample tests for high-dimensional covariance matrices. The Annals of Statistics. 2012;40:908–940. [Google Scholar]
  19. Loi S, Haibe-Kains B, Desmedt C, Lallemand F, Tutt AM, Gillet C, Ellis P, Harris A, Bergh J, Foekens JA, et al. Definition of clinically distinct molecular subtypes in estrogen receptor–positive breast carcinomas through genomic grade. Journal of Clinical Oncology. 2007;25:1239–1246. doi: 10.1200/JCO.2006.07.1522. [DOI] [PubMed] [Google Scholar]
  20. Matthews L, Gopinath G, Gillespie M, Caudy M, Croft D, de Bono B, Garapati P, Hemish J, Hermjakob H, Jassal B. Reactome knowledgebase of human biological pathways and processes. Nucleic Acids Research. 2009;37:D617–D622. doi: 10.1093/nar/gkn863. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Meade AW, Bauer DJ. Power and precision in confirmatory factor analytic tests of measurement invariance. Structural Equation Modeling. 2007;14:611–635. [Google Scholar]
  22. Nadler B. Finite sample approximation results for principal component analysis: a matrix perturbation approach. Annals of Statistics. 2008;36:2791–2817. [Google Scholar]
  23. Nadler B. On the distribution of the ratio of the largest eigenvalue to the trace of a Wishart matrix. Journal of Multivariate Analysis. 2011;102:363–371. [Google Scholar]
  24. Nishimura D. Biocarta. Biotech Software and Internet Report. 2001;2:117–120. [Google Scholar]
  25. Onatski A. Asymptotics of the principal components estimator of large factor models with weakly influential factors. Journal of Econometrics. 2012;168:244–258. [Google Scholar]
  26. Paul D. Aymptotics of sample eigenstructure for a large dimension spiked covariance model. Statistica Sinica. 2007;17:1617–1642. [Google Scholar]
  27. Peng J, Wang P, Zhou N, Zhu J. Partial correlation estimation by joint sparse regression model. Journal of the American Statistical Association. 2009;104:735–746. doi: 10.1198/jasa.2009.0126. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing; Vienna, Austria: 2013. [Google Scholar]
  29. Roy S. On a heuristic method of test construction and its use in multivariate analysis. Annals of Mathematical Statistics. 1953;24:220–238. [Google Scholar]
  30. Schott J. A test for the equality of covariance matrices when the dimension is large relative to the sample size. Computational Statistics and Data Analysis. 2007;51:6535–6542. [Google Scholar]
  31. Srivastava M, Yanagihara H. Testing the equality of several covariance matrices with fewer observations than the dimension. Journal of Multivariate Analysis. 2010;101:1319–1329. [Google Scholar]
  32. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences. 2005;102:15545–15550. doi: 10.1073/pnas.0506580102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Tomfohr J, Lu J, Kepler TB. Pathway level analysis of gene expression using singular value decomposition. BMC Bioinformatics. 2005;6:225. doi: 10.1186/1471-2105-6-225. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Wolf DM, Lenburg ME, Yau C, Boudreau A, vant Veer LJ. Gene co-expression modules as clinically relevant hallmarks of breast cancer diversity. PloS One. 2014;9:e88309. doi: 10.1371/journal.pone.0088309. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

RESOURCES