Covariance-based analyses of biological pathways

P DANAHER; D PAUL; P WANG

doi:10.1093/biomet/asv013

. Author manuscript; available in PMC: 2015 Sep 24.

Published in final edited form as: Biometrika. 2015 Jun 19;102(3):533–544. doi: 10.1093/biomet/asv013

Covariance-based analyses of biological pathways

P DANAHER ¹, D PAUL ², P WANG ³

PMCID: PMC4581526 NIHMSID: NIHMS685282 PMID: 26412865

Summary

The use of high-throughput data to study the changing behavior of biological pathways has focused mainly on examining the changes in the means of pathway genes. In this paper, we propose instead to test for changes in the co-regulated and unregulated variability of pathway genes. We assume that the eigenvalues of previously defined pathways capture biologically relevant quantities, and we develop a test for biologically meaningful changes in the eigenvalues between classes. This test reflects important and often ignored aspects of pathway behavior and provides a useful complement to traditional pathway analyses.

Keywords: gene expression, pathway analysis, spiked eigenvalue

1. Introduction

A pathway refers to a set of genes or proteins jointly participating in a biological process. It is of great interest to study the behavior of pathways using high-throughput-omics data. By treating a pre-defined set of genes with shared biological function as an analytical unit, pathway-level analyses efficiently exploit prior biological knowledge, improve interpretability, and enjoy greater power by combining the signals of individual genes. Existing pathway analysis methods have focused almost exclusively on the marginal behavior of pathway genes. For example, Tomfohr et al. (2005), Lee et al. (2008) and Drier et al. (2013) suggested synthesizing the information in pathway genes into measures of pathway activity, while there is a large body of work, including Subramanian et al. (2005) and Efron & Tibshirani (2007), aimed at identifying pathways that are enriched with differentially expressed genes. These marginal analyses, while shedding light on important questions, fail to capture the full complexity of pathway behavior.

In this paper, we propose a test to examine the joint behavior of pathway genes. This test complements the marginal analyses mentioned above and helps to provide a more comprehensive understanding of biological pathways. Specifically, we consider the problem of detecting differences in covariance among a pre-defined set of pathway genes between two classes of samples, for example between two different cancer subtypes. Tests of equality of two covariance matrices are well studied. However, the number of genes in a pathway ranges from tens to hundreds, and quite often exceeds the sample size. Under such regimes, classical tests for equality of covariance matrices no longer apply. A number of authors (Schott, 2007; Srivastava & Yanagihara, 2010; Li & Chen, 2012) have developed tests for equality of covariance matrices in the high dimension, low sample size setting. However, by testing the null hypothesis that two covariance matrices are exactly equal, without accounting for the structure induced by pathway activity, these tests provide inadequate biological insight: their rejection of the null hypothesis allows no conclusions about how pathway gene behavior differs between classes. In contrast, the proposed test is motivated from a biological model of the expression of pathway genes and focused on quantities with natural biological interpretations. The novelty of this test lies in its focus on the joint rather than the marginal behavior of pathway genes and in its consideration of disordered variability orthogonal to the effects of pathway activity.

Our biological model assumes that genes’ associations with pathway activity drive the leading eigenvector of their covariance matrix. This model suggests that the first eigenvector is invariant to changes in biological conditions, while the leading and remaining eigenvalues will vary across data sets in response to within-population variability of pathway activity and variability due to other, unregulated causes, respectively. Under this model, the covariance matrix of the expression levels of pathway genes has a spiked eigen-structure (Johnstone, 2001; Baik & Silverstein, 2006; Paul, 2007), and the leading eigenvalue and the trace of the covariance matrix provide a parsimonious and biologically relevant summary of pathway genes’ joint behavior. Baik & Silverstein (2006) showed that if the dimension-to-sample size ratio converges to a nonzero finite constant, and if the true spiked eigenvalues exceed a threshold, the corresponding sample eigenvalues converge with probability one to limits that depend on the true eigenvalues and the dimension-to-sample size ratio. Paul (2007) proved asymptotic normality of the leading sample eigenvalues under the same framework. We extend the latter asymptotic results to design a χ² test statistic based on the joint behavior of the leading eigenvalue and the trace of the sample covariance matrices of the two classes. When the proposed test rejects the null, it indicates that specific, biologically-relevant quantities differ between classes. Simulations suggest that if the spiked covariance structure holds even approximately, the proposed test has better power to detect differences in biologically important functions of the eigenvalues than existing tests.

2. A model of co-expression in biological pathways

First, consider data from only one class. Denote the gene expression data for a previously defined pathway with p genes from n observations by the n × p matrix Y, and denote the data vector of the i^th observation by y_i = (y_i₁,…, y_ip). We assume that pathway activity is the primary driver of pathway gene expression. For example, for a set of genes regulated by a common transcription factor, the primary source of variance in the expression levels of the entire gene set would be changes in the activity level of the transcription factor, and it would be reasonable to specify these relationships through a linear dependence model. We write

y_{i k} = μ_{k} + h_{k} a_{i} + ε_{i k}, i = 1, \dots, n, k = 1, \dots, p,

(1)

where μ_k is an intercept specific to gene k that can be ignored for our purposes, a_i is a random variable with mean 0 and variance $σ_{a}^{2}$ , h = (h₁,…, h_p) are gene specific scaling coefficients, ε_i = (ε_i,₁,…, ε_i,p) are independent random variables with mean 0 and variance $σ_{ε}^{2}$ , and ε₁,…, ε_n, a₁,…, a_n are mutually independent. We add the constraint ${‖ h ‖}_{2}^{2} = 1$ to make h and $σ_{a}^{2}$ identifiable. In this model, a_i reflects the level of pathway activity, e.g. the transcription factor level, in the i^th sample, $σ_{a}^{2}$ drives the well-ordered, co-regulated component of total gene variance, and $σ_{ε}^{2}$ measures the unordered, noisy component of pathway gene variance. It follows that

cov (y_{i}) = cov (a_{i} h + ε_{i}) = σ_{a}^{2} h h^{T} + σ_{ε}^{2} I .

(2)

The first eigenvalue of cov(y_i) is $σ_{a}^{2} + σ_{ε}^{2}$ , and the remaining eigenvalues equal $σ_{ε}^{2}$ . This observation allows biological interpretations to be assigned to the eigenvalues of the pathway’s covariance matrix: the first eigenvalue measures the variability in pathway genes due to changing levels of pathway activity, i.e., the well-ordered, co-regulated component of pathway gene variability, while the sum of the remaining eigenvalues captures the unordered, chaotic component of variability. This interpretation echoes Tomfohr et al. (2005), Bild et al. (2005), Bair et al. (2006) and Chen et al. (2008) in implying that an observation’s projection onto the first eigenvector of the pathway’s covariance matrix measures the observation’s pathway activity level. Moreover, the covariance structure in (2) matches the spiked eigenvalue model of Baik & Silverstein (2006), Paul (2007), Nadler (2008), Onatski (2012) and others. This implication of the model holds nearly universally in pathway data. Thus, one can take advantage of the asymptotic theories under the spiked eigenvalue model to perform inference on $σ_{a}^{2}$ and $σ_{ε}^{2}$ .

In data sets with two classes of samples, we may wish to compare these biologically meaningful quantities between classes. We therefore propose to test the null hypothesis

H_{0} : (α_{1, 1}, T_{1}) = (α_{2, 1}, T_{2}),

(3)

where α_j,k denotes the k^th population eigenvalue of class j and T_j denotes the trace of the covariance matrix of class j. The interpretation of the pair (α_j,₁, T_j) gives the alternatives to H₀ specific and useful biological meaning. For example, when α₁_,₁ > α₂_,₁ and T₁ − α₁_,₁ ≈ T₂ − α₂_,₁, we might conclude that the pathway activity level is stronger in the first class, or, when T₁ − α₁_,₁ > T₂ − α₂_,₁, we might conclude that the pathway is dysregulated and subject to greater nloise in the first class. By directly testing H₀ rather than the stronger null hypothesis of equality of the entire covariance matrices, we maximize power to detect changes in the modes of variability attributable to pathway activity and to noisy, non-co-regulated causes, while detecting other changes in the covariance matrix only insofar as they change our eigenvalue statistics. In particular, we ignore the eigenvectors and redistribution of weights among smaller eigenvalues. The biological model specified in (1) and (2) can also be seen as a factor analysis model with one factor and a specialized covariance structure for the idiosyncratic term ε_ik, even though our hypotheses and test procedure differ from the commonly used tests for factorial invariance (Meade & Bauer, 2007).

Model (1) suggests two general features of pathway data: (a) the eigenvalues of the covariance matrix resemble the spiked model; (b) the leading eigenvectors capture the effects of pathway activities on gene expression, or equivalently, the leading spiked eigenvalues capture variability in the data due to changes in pathway activity. Both features apply for a large set of pathways even when model (1) does not hold. The statistical theory behind our test relies on (a), and the biological interpretation of our test is based on (b). The Supplementary Material contains extensive empirical investigations supporting (a) and (b).

In the next sections, we introduce a test for H₀ in (3), assuming (a) and (b) hold. Moreover, in many cases, pathway genes are subject to multiple biological processes. When this occurs the covariance matrix has additional spikes, i.e., more eigenvalues become significantly larger than the noise eigenvalues. The proposed test also accommodates these scenarios.

3. A test for differences in the eigenstructure of Σ₁ and Σ₂

3·1. The single spiked eigenvalue setting

Denote the eigenvalues and trace of the sample covariance matrices by α̂_j,i and T̂_j (j = 1, 2; i =1,…,p). To test H₀ in (3), a natural choice is to form a test statistic using α̂₁_,₁ − α̂₂_,₁ and T̂₁ − T̂₂. We use a quadratic form to combine the information in these quantities.

Under H₀ in (3), without loss of generality, we assume that σ_ε,₁ = σ_ε,₂ =1, or equivalently, the unspiked eigenvalues of the common covariance matrix are all equal to 1, α_j,₂ =···= α_j,p = 1, (j = 1, 2). To adhere to this assumption, we normalize the data as follows. We calculate a scale factor equal to the square root of the median eigenvalue of the pooled sample covariance matrix from both classes and divide all the observations by this factor; see the Supplementary Material.

For notational convenience, in the rest of this subsection we use α_j and α̂_j to mean α_j,₁ and α̂_j,₁, respectively. According to Baik & Silverstein (2006), the first sample eigenvalue is a biased estimate of its population counterpart: α̂_j → α_j + γ_jα_j/(α_j − 1), where p, n → ∞, p/n_j → γ_j ∈ (0, ∞) and $α_{j} > 1 + γ_{j}^{1 / 2}$ , (j = 1, 2). Define b_α = (γ₁ − γ₂)α₀/(α₀ − 1), where α₀ is the first eigenvalue shared by both classes under H₀ and satisfies $α_{0} > 1 + max (γ_{1}^{1 / 2}, γ_{2}^{1 / 2})$ . Then under H₀, (α̂₁ − α̂₂) → b_α almost surely. This limiting value b_α ≠ 0 when γ₁ ≠ γ₂. To test H₀, we focus on the bias-corrected quantity (α̂₁_,₁ − α̂₂_,₁ − b̂_α) and propose the test statistic $Q^{T} {\sum^{^}}_{Q}^{- 1} Q$ , where

Q = {(Q_{α}, Q_{T})}^{T} = {({\hat{α}}_{1} - {\hat{α}}_{2} - {\hat{b}}_{α}, {\hat{T}}_{1} - {\hat{T}}_{2})}^{T},

(4)

and b̂_α and Σ̂_Q are appropriate consistent estimates for b_α and Σ_Q = cov(Q), respectively.

We now describe the construction of b̂_α. We first propose the following estimator for α₀,

{\bar{α}}_{0} = (w_{1} {\bar{α}}_{1} + w_{2} {\bar{α}}_{2}), w_{j} = n_{j} / (n_{1} + n_{2}),

(5)

where ᾱ_j is the asymptotic method of moments estimator for α_j, namely, ᾱ_j = [1+ α̂_j − γ̂_j + {(1+ α̂_j − γ̂_j)² − 4α̂_j}^1/2]/2, which is obtained by solving the equation α̂_j = α_j{1+γ̂_j/(α_j − 1)}. Here and henceforth, γ̂_j = p/n_j for j = 1, 2. Substituting α₀ with ᾱ₀ in the expression for b_α yields the estimate

{\hat{b}}_{α} = ({\hat{γ}}_{1} - {\hat{γ}}_{2}) {\bar{α}}_{0} / ({\bar{α}}_{0} - 1) .

(6)

If ${\hat{α}}_{j} < {(1 + {\hat{γ}}_{j}^{1 / 2})}^{2}$ , then ᾱ_j is complex-valued, which indicates that the population covariance is either unspiked or has small, undetectable, spikes.

Define the 2 × 2 symmetric matrix Σ_Q with diagonal elements τ_{Q_αα} and τ_{Q_TT}, respectively, and off-diagonal element τ_{Q_αT}. Theorem 2 yields the consistent estimates

{\hat{τ}}_{Q_{α α}} = \sum_{j = 1}^{2} (\frac{2}{n_{j}}) \frac{{\bar{α}}_{0} θ_{j}^{2} ρ_{j}}{1 + {\bar{α}}_{0} {\hat{γ}}_{j} / {{({\bar{α}}_{0} - 1)}^{2} - {\hat{γ}}_{j}}}

(7)

and {\hat{τ}}_{Q_{α T}} = \sum_{j = 1}^{2} (\frac{2}{n_{j}}) \frac{{\bar{α}}_{0} θ_{j} ρ_{j}}{1 + {\bar{α}}_{0} {\hat{γ}}_{j} / {{({\bar{α}}_{0} - 1)}^{2} - {\hat{γ}}_{j}}} .

(8)

We estimate τ̂_{Q_TT} using equation (10).

After we obtain Σ̂_Q, we propose to reject H₀ for large values of $Q^{T} {\sum^{^}}_{Q}^{- 1} Q$ . According to Theorem 2, under H₀, the asymptotic joint normality of α̂₁ − α̂₂ − b̂_α and T̂₁ − T̂₂ suggests that $Q^{T} \sum_{Q}^{- 1} Q \to χ_{2}^{2}$ in distribution. Then to the extent that $\sum_{Q}^{- 1}$ is estimated accurately, our test statistic $Q^{T} {\sum^{^}}_{Q}^{- 1} Q$ may be compared to the quantiles of a $χ_{2}^{2}$ distribution to obtain a p-value. A permutation test may also be employed. Simulations in Section 5 show the proposed test to have accurate Type-1 error at all sample sizes when our assumptions hold, suggesting that accurate estimation of $\sum_{Q}^{- 1}$ is not a hurdle for the test’s performance.

3·2. Test robust to the number of spiked eigenvalues

We generally expect that genes in a pathway are jointly associated with not just one but a number of biological processes, which implies the existence of multiple spiked eigenvalues. To accommodate an unspecified number of spiked eigenvalues in the proposed test, we first estimate the number of spiked eigenvalues and then apply a modified expression for var(T̂_j).

To estimate M_j, the number of spiked eigenvalues in class j, we choose a threshold of ${(1 + {\hat{γ}}_{j}^{1 / 2})}^{2} + {2 log (n_{j}) / n_{j}}^{1 / 2}$ , and with I denoting the indicator function, define

{\hat{M}}_{j} = \sum_{m = 1}^{p} I [{\hat{α}}_{j, m} > {(1 + {\hat{γ}}_{j}^{1 / 2})}^{2} + {2 log (n_{j}) / n_{j}}^{1 / 2}] .

(9)

This estimator may have difficulty classifying the eigenvalues near ${(1 + {\hat{γ}}_{j}^{1 / 2})}^{2}$ . However, the treatment of such small spiked eigenvalues will not appreciably affect our estimates of var(T̂_j). We then use independence of T̂₁ and T̂₂ to estimate τ_{Q_TT} with

{\hat{τ}}_{Q_{T T}} = \sum_{j = 1}^{2} \frac{2}{n_{j}} (\sum_{m = 1}^{{\hat{M}}_{j}} {\hat{α}}_{j, m}^{2} + p - {\hat{M}}_{j}) .

(10)

Some alternative methods for estimating the number of spikes, e.g. the proposal by Kritchman & Nadler (2008), have good power of detection and could be used instead of the estimator (9), but the approach detailed above does not depend on the Gaussian assumption.

We outline below the proposed procedure for testing H₀ in (3), which is robust to the number of spiked eigenvalues.

Calculate the eigenvalues {α̂_j,k} and trace T̂_j of the sample covariance matrix Σ̂_j (j = 1, 2).
Calculate b̂_α = (γ̂₁ − γ̂₂)ᾱ₀/(ᾱ₀ − 1), where ᾱ₀ is defined in equation (5).
Calculate Q according to (4).
Estimate Σ_Q:
1. Estimate the number of spiked eigenvalues in each class according to (9); and then calculate τ̂_{Q_TT} according to (10).
2. Calculate θ_j and ρ_j, j = 1, 2, as defined by Theorem 2. Compute τ̂_{Q_αα}; and τ̂_{Q_αT} according to equation (7) and (8) respectively.
Compute the test statistic $Q^{T} {\sum^{^}}_{Q}^{- 1} Q$ . To attain a p-value, compare its value to the quantiles of a $χ_{2}^{2}$ distribution. Alternatively, permute the class labels and recompute the test statistic many times, and compare the quantiles of the resulting statistics to the true $Q^{T} {\sum^{^}}_{Q}^{- 1} Q$ .

Sometimes the first eigenvalue might be inadequate to capture variability due to pathway coregulation. For such occasions we could use the top M eigenvalues and test an extended null hypothesis H_{0_M} : (α₁_,₁,…,α₁_,M,T₁) = (α₂_,₁,…,α₂_,M,T₂); see the Supplementary Material.

4. Theoretical results

In this section, we outline theoretical results for the asymptotic behavior of (α̂₁ − α̂ ₂ − b̂_α) and (T̂₁ − T̂₂) under the spiked eigenvalue setting implied by our biological model under the null hypothesis and assuming Gaussian data.

We first consider a single class. Denote the population eigenvalues by ${α_{i}}_{i = 1}^{p}$ and their sample equivalents by ${{\hat{α}}_{i}}_{i = 1}^{p}$ . We assume α₁ >α₂ = ··· = α_p = 1. Write α₁ ≡ α, α̂ ₁ ≡ α̂, and let $T = \sum_{k = 1}^{p} α_{k}, \hat{T} = \sum_{k = 1}^{p} {\hat{α}}_{k}$ . In Theorem 1, we lay the groundwork for our method by specifying the joint asymptotic distribution of (α̂, T̂). This result is of interest beyond its application to the proposed test, and to our knowledge it gives the first published expression for the joint asymptotic distribution of α̂ and T̂.

Theorem 1

Suppose that p, n → ∞ such that n^1/2|p/n − γ| → 0 where γ ∈ (0, 1). Assume α > 1+ γ^1/2. Let ρ = α {1+ γ/(α − 1)}. Then

\begin{matrix} \sum_{α T, n}^{- 1 / 2} (\begin{matrix} n^{1 / 2} (\hat{α} - ρ) \\ \hat{T} - T \end{matrix}) \to N (0, I_{2}), \\ i n distribution, where \sum_{α T, n} = (\begin{matrix} σ_{α α, n} & σ_{α T, n} \\ σ_{α T, n} & σ_{T T, n} \end{matrix}), \\ σ_{α α, n} = \frac{2 α ρ}{1 + \frac{α γ}{{(α - 1)}^{2} - γ}}, σ_{T T, n} = 2 (\frac{α^{2}}{n} + \frac{p - 1}{n}), σ_{α T, n} = n^{- 1 / 2} α ρ {\frac{2 + K (ρ, γ)}{1 + \frac{α γ}{{(α - 1)}^{2} - γ}}}, \end{matrix}

(11)

K (ρ, γ) = \frac{1}{2 π^{2}} \int_{- \infty}^{\infty} \int_{- \infty}^{\infty} \frac{k_{γ} (x, y)}{{(ρ - x)}^{2}} dxdy .

(12)

Here k_γ(x, y) is a bounded, nonnegative function with support {(1 − γ^1/2)², (1 + γ^1/2)²}× {(1 − γ^1/2)², (1 + γ^1/2)²}.

Corollary 1

Under the assumptions of Theorem 1, in distribution,

{(\begin{matrix} n^{1 / 2} (\hat{α} - ρ) \\ \hat{T} - T \end{matrix})}^{T} \sum_{α T, n}^{- 1} (\begin{matrix} n^{1 / 2} (\hat{α} - ρ) \\ \hat{T} - T \end{matrix}) \to χ_{2}^{2} .

Remark 1

The conclusions in Theorem 1 and Corollary 1 remain unchanged even if we replace γ by γ̂ = p/n, and α by ᾱ = (1/2) 1 + α̂ − γ̂ + {(1+ α̂ − γ̂)² − 4α̂}^1/2, which is obtained by solving the equation α̂ = ᾱ{1+ γ̂/(ᾱ − 1)}.

Remark 2

If α ≫ 1+ γ^1/2, the contribution of the term K(ρ, γ) in the expression for σ_αT,n is asymptotically negligible. In this case, σ_αT,n can be replaced by

{\tilde{σ}}_{α T, n} = n^{- 1 / 2} 2 α ρ {[1 + α γ / {{(α - 1)}^{2} - γ}]}^{- 1} .

We then apply the results of Theorem 1 to the two-class case to calculate the null distribution of our test statistic. Specifically, under H₀ in (3), without loss of generality, we assume that the common covariance matrix has eigenvalues α₀, 1,…, 1, where $α_{0} > 1 + max {γ_{1}^{1 / 2}, γ_{2}^{1 / 2}}$ , and γ_j = lim_nj_→∞ p/n_j∈ (0, 1). Under the alternative, the non-spiked eigenvalues could take values other than 1.

Theorem 2

Suppose that p, n₁,n₂ → ∞ such that $n_{j}^{1 / 2} ∣ p / n_{j} - γ_{j} ∣ \to 0$ where γ_j ∈ (0, 1), for j = 1, 2. Let α̂_jk denote the k-th largest eigenvalue of the sample covariance matrix of class j, and ${\hat{T}}_{j} = \sum_{k = 1}^{p} {\hat{α}}_{j k}$ . Let b̂_α be defined by (6). Introduce

\sum_{Q, n} = (\begin{matrix} σ_{Q_{α α}, n} & σ_{Q_{α T}, n} \\ σ_{Q_{α T}, n} & σ_{Q_{T T}, n} \end{matrix}),

with $σ_{Q_{T T}, n} = 2 (1 / n_{1} + 1 / n_{2}) (α_{0}^{2} + p - 1)$ ;

σ_{Q_{α α}, n} = w_{2} θ_{1}^{2} {\frac{2 α_{0} ρ_{1}}{1 + \frac{α_{0} γ_{1}}{{(α_{0} - 1)}^{2} - γ_{1}}}} + w_{1} θ_{2}^{2} {\frac{2 α_{0} ρ_{2}}{1 + \frac{α_{0} γ_{2}}{{(α_{0} - 1)}^{2} - γ_{2}}}},

where w_j = n_j/(n₁ + n₂), ρ_j = α₀{1+ γ_j/(α₀ − 1)},

θ_{j} = 1 + {(\frac{1}{γ_{1}} + \frac{1}{γ_{2}})}^{- 1} \frac{(γ_{1} - γ_{2})}{{(α_{0} - 1)}^{2}} \frac{κ_{j}}{γ_{j}}, κ_{j} = \frac{1}{2} [1 + \frac{ρ_{j} - 1 - γ_{j}}{{{(1 + ρ_{j} - γ_{j})}^{2} - 4 ρ_{j}}^{1 / 2}}];

and $σ_{Q_{α T}, n} = n_{1}^{- 1 / 2} w_{2}^{1 / 2} θ_{1} [\frac{α_{0} ρ_{1} {2 + K (ρ_{1}, γ_{1})}}{1 + \frac{α_{0} γ_{1}}{{(α_{0} - 1)}^{2} - γ_{1}}}] + n_{2}^{- 1 / 2} w_{1}^{1 / 2} θ_{2} [\frac{α_{0} ρ_{2} {2 + K (ρ_{2}, γ_{2})}}{1 + \frac{α_{0} γ_{2}}{{(α_{0} - 1)}^{2} - γ_{2}}}]$ ,

where K(ρ, γ) is as in (12). Then, in distribution,

\sum_{Q, n}^{- 1 / 2} {\begin{matrix} {(\frac{n_{1} n_{2}}{n_{1} + n_{2}})}^{1 / 2} ({\hat{α}}_{11} - {\hat{α}}_{21} - {\hat{b}}_{α}) \\ {\hat{T}}_{1} - {\hat{T}}_{2} \end{matrix}} \to N (0, I_{2}) .

(13)

Remark 3

In Theorem 2, we can replace γ_j by γ̂_j = p/n_j, and α by ᾱ₀ defined through (5) without altering the conclusions.

Remark 4

The statements of both theorems remain valid even if γ_j ∈ [1, ∞) for j = 1, 2, though the proofs change slightly. Moreover, the conclusions of Theorem 2 continue to hold even when γ_j = 0, j = 1, 2, with ρ_j = 0 and the terms K(ρ_j,γ_j) are absent from the expressions.

Remark 5

If α_j₁ → ∞, but α_j₁ = o(p), for j = 1, 2, both theorems hold.

Remark 6

If $α_{0} ≫ 1 + max (γ_{1}^{1 / 2}, γ_{2}^{1 / 2})$ , the contribution of the terms K(ρ_j,γ_j)(j = 1, 2), in the expression for σ_{Q_αT,n} is asymptotically negligible. In this case, we replace σ_{Q_αT,n}by

{\tilde{σ}}_{Q_{α T}, n} = n_{1}^{- 1 / 2} w_{2}^{1 / 2} θ_{1} \frac{2 α_{0} ρ_{1}}{1 + \frac{α_{0} γ_{1}}{{(α_{0} - 1)}^{2} - γ_{1}}} + n_{2}^{- 1 / 2} w_{1}^{1 / 2} θ_{2} \frac{2 α_{0} ρ_{2}}{1 + \frac{α_{0} γ_{2}}{{(α_{0} - 1)}^{2} - γ_{2}}} .

This is the expression used in defining the test statistic $Q^{T} {\sum^{^}}_{Q}^{- 1} Q$ .

Remark 7

Both the theorems can be easily extended to cases with multiple spiked eigenvalues. See the Supplementary Material for details.

The proof of Theorem 1 uses the asymptotic expansions of the leading sample eigenvalues in Paul (2007) and the behavior of linear spectral statistics of sample covariance matrices described in Bai & Silverstein (2010). Theorem 2 follows from this and an application of the delta method.

5. Simulations

In this section, we describe simulations investigating the Type-1 error and power of the prposed test and the tests of Schott (2007) and Srivastava & Yanagihara (2010).

We consider three different sets of covariance structures. For each set, we use the same baseline covariance matrix Σ₁ and introduce different perturbations to generate Σ₂. We define Σ₁ according to the biological model in Section 2. To simulate data with p genes, we set $\sum_{1} = σ_{a}^{2} h h^{T} + I$ , with $σ_{a}^{2} = 35 p^{- 1 / 2}$ and h_p_×1 = {− 0.5, 1/(p − 1) − 0.5, …, (p − 2)(p − 1) − 0.5, 0.5}. In Σ₁, $σ_{a}^{2} h h^{T}$ represents the variability due to pathway activities, and I represents the unordered, noisy component of pathway gene variance. In the first perturbation, which we call the added noise setting, we let Σ₂ = Σ₁ +0.2I, so gene expression is subject to broader disorder in the second class. In the second perturbation, the lost co-regulation setting, we simulate pathway dysregulation by letting Σ₂ = 0.7Σ₁ +0.3diag(Σ₁) so that overall variability is unchanged but less well-ordered. This perturbation substantially decreases the first eigenvalue while leaving the trace unchanged. In real data, a change like this could arise from deactivation of pathway regulatory elements like transcription factors. In the third perturbation, the additional biological process setting, we let Σ₂ = Σ₁ + gg^T, where g = {g_i} is defined as g_i = 0.75 for i ∈ 1,… 0.4p and g_i = 0 otherwise. In this setting, 40% of the genes in the pathway participate in a secondary biological process represented by the gg^T component.

We consider p = 20, 50 and 100. The corresponding first eigenvalues of Σ₁ under three different dimensions are 15.4, 22.5 and 30.8 respectively. For each p, we consider sample sizes n₁ ∈ {20, 30, 50, 75, 100, 130} and n₂ = 0.66n₁. For each (p, n) and (Σ₁, Σ₂), we simulate 10,000 pairs of multivariate normal datasets and apply the proposed test as well as the methods of Schott (2007) and Srivastava & Yanagihara (2010) to test the differences between the two covariance matrices. We apply the robust version of the test described in in Section 3·2 for the added noise and the lost co-regulation settings, and we apply the multiple-spike version described in the Supplementary Material with M = 2 for the additional biological process setting. Under all three settings, we preprocess the data using the normalization scheme described in the supplementary material and derive the p-values according to the theoretical χ² distributions. Additionally, we examine the tests’ Type-1 error rates in these settings by defining Σ₀ = n₁/(n₁ + n₂)Σ₁ + n₂/(n₁ + n₂)Σ₂, generating datasets of size n₁ and n₂ from Σ₀, and running the tests on these null datasets.

Fig. 1 displays the results of these simulations. The first row of plots displays type-I error rates of the three methods. The method of Schott (2007) is conservative, the method of Srivastava & Yanagihara (2010) is liberal, and the proposed test has the most accurate levels under all settings. The second row of plots displays powers of the three tests based on theoretical null distributions. The proposed test outperforms the others in the added noise and lost co-regulation settings and is competitive in the additional biological process setting. In the third row of plots, instead of using theoretical approximations to determine each test statistic’s threshold for significance, we compute adjusted power as the percentage of test statistics under the alternative hypothesis exceeding the 0.05 quantile of the empirical null distribution of the test statistics. In this way, the type-I errors of all tests are perfectly controlled at 0.05, so the power comparison is more fair and direct. The proposed test easily outperforms the other two in term of adjusted power under all settings and all n, p combinations.

The proposed test nearly dominates the methods of Schott (2007) and Srivastava & Yanagihara (2010) in these simulations. In other simulations, we found that the methods of Schott (2007) and Srivastava & Yanagihara (2010) perform well in cases where single elements of the covariance matrix differ substantially between classes. However, changes in the biological quantities we are interested in will most often manifest as widespread, small differences in the covariance matrix, a setting which these earlier methods are not optimized to detect.

We also evaluate the effects of various departures from model (1) on the performance of the proposed test through simulations. Specifically, we consider the effects of variability in the unspiked eigenvalues, unequal error variances, non-normality of the data and multiple spiked eigenvalues. We find that the proposed test is robust to all these departures except for non-normality of the data. Thus we recommend the permutation test in highly kurtotic data.

6. Application to a breast cancer dataset

We apply the proposed test to a breast cancer gene expression dataset (Loi et al., 2007), which has microarray measurements on breast tumor samples from 277 patients treated with tamoxifen and 137 untreated patients. The interest is to identify different regulation patterns between patients with or without tamoxifen treatment. We normalized all observations to have equal median and median absolute deviance. Outliers can drive the first eigenvalue of a dataset, destroying its interpretation under our biological model. We therefore truncated each gene’s data in each class at four standard deviations from its mean. This rule truncated 6.4% of the data.

Curated databases of gene relationships like KEGG (Kanehisa & Goto, 2000), Reactome (Matthews et al., 2009), and Biocarta (Nishimura, 2001) often build pathways from genes involved in distantly related biological processes. Consequently, these curated pathways tend to be subject to complex co-regulation better described with network estimation tools (Peng et al., 2009; Danaher et al., 2014) than with this paper’s biological model. In lieu of KEGG pathways, we sought sets of genes that could be expected to exhibit the tight co-regulation implied by our model. Cheng et al. (2013a) identified attractor metagenes, sets of genes that tended to cluster together across multiple breast cancer gene expression datasets. We expected that genes clustered together across datasets would often share a biological function, and examination of Cheng et al. (2013a)’s metagenes confirmed this hypothesis. For example, the ID55 metagene contains exclusively histone genes; and the ID88 metagene contains several genes from the cytochrome P450 family, and, intriguingly, ESR1, one of the most-studied genes in breast cancer. The biological relevance of these attractor metagenes was further demonstrated by Cheng et al. (2013b), who used attractor metagenes to inform a successful breast cancer prognostic algorithm. Given their biological meaning and apparent consistency with our biological model, we took these metagenes as the basic units of our analysis, and we ran our method and a traditional gene set analysis (Efron & Tibshirani, 2007) on every metagene with more than 5 genes.

Table 1 displays selected results; the Supplementary Material has complete results. A 2.67GHz laptop took 11 minutes to compute p-values for the 24 metagenes analyzed using 10000 permutations. The proposed biological model and test revealed a rich picture of changes in co-expression far beyond what traditional Gene Set Analysis provided. The ID88 metagene has higher total variance but a lower first eigenvalue under tamoxifen. This pattern of increased noise and decreased variability due to pathway activity strongly suggests pathway dysregulation. The histone metagene, ID55, saw increases in both overall variability and its first eigenvalue under tamoxifen, suggesting more dynamic histone activity levels in the tamoxifen group. Histones are central to cancer proliferation; this result could be explained by patients heterogeneously responding to the drug. The mesenchymal transition attractor metagene followed a similar pattern, with increased variability under tamoxifen almost entirely due to an increased first eigenvalue.

Table 1.

Eigenvalue statistics and p-values calculated using the proposed test and Gene Set Analysis. Theoretical and permutation-based p-values from the proposed test are under p_χ² and p_perm, respectively, and p-values from Gene Set Analysis are under p_GSA.

Metagene

Size

α_{0}^{*}

α_{1}^{*}

T_{0}^{*}

T_{1}^{*}

p_χ²

p_perm

p_GSA

ID88

38.43

23.38

58.26

62.82

0.000

0.010

ID55

119.67

139.97

156.72

192.28

0.000

0.240

MTA^**

127.17

167.85

157.72

205.85

0.007

0.000

0.440

Open in a new tab

α₀ and T₀ are for the untreated group; while α₁ and T₁ are for the TAM treated group.

^**

MTA stands for Mesenchymal Transition Attractor.

The p-values returned by the χ² approximation and the permutation test generally tracked each other, with a Spearman correlation between them of 0.88. However, the permutation test returned uniformly higher p-values than the purely theoretical test, and one metagene, ID79, showed a markedly increased p-value under the permutation test. The liberal χ² p-values appear to be driven by excessively kurtotic data, and they suggest the use of the permutation test over the χ² approximation in highly kurtotic data.

7. Discussion

The proposed test is a powerful complement to traditional, marginal effects-based analyses like gene set analysis or tests comparing overall pathway activity levels. Given the high dimensionality and complex behavior of biological pathways, it seems appropriate to apply analyses focused on varied aspects of pathway behavior. A complete analysis of a pathway would include a summary of single-gene behavior, a comparison of overall pathway activity levels between disease states (Lee et al., 2008), a test for changes in covariance structure like the method proposed here, and ideally several other analyses yet to be discovered.

While the proposed test is motivated from the biological model (1), it can be applied to the broad class of pathways for which the first eigenvalue is spiked and reflects variability due to heterogeneous pathway activity levels. Nevertheless, not every gene set adheres to these assumptions. For example, many of the larger KEGG pathways contain genes too distally related to show discernible co-regulation. Our test is better applied to gene sets very likely to experience co-regulation, for example more narrowly-defined KEGG pathways and data-derived gene sets like the attractor metagenes of Cheng et al. (2013a) and the cancer signatures of Wolf et al. (2014). These data-derived gene sets are often highly biologically interpretable, and they have been shown to predict patient outcomes (Cheng et al., 2013b; Clarke et al., 2013; Wolf et al., 2014). It is possible to check a gene set’s suitability for analysis with the proposed test by comparing the prominence of its first eigenvalue (α̂/T̂) to the α̂/T̂ of random gene sets. Various biological and technical variables will induce eigenstructure in sets of unrelated genes. If a gene set’s first eigenvalue is more prominent than seen in random gene sets, the gene set is likely experiencing co-regulation. When the values of these technical, e.g. regent lot, or biological, e.g. cancer subtype, variables are known, it is possible to scrub their influence from the data by regressing each gene on these variables and performing the proposed test on the residuals.

A useful extension of this work would be the development of tests for differences in more targeted quantities than the somewhat broad (α̂, T̂). For example, a test for changes in (T̂ − α̂) could be considered to directly look for increased dysregulation, or non-co-regulated variability, between classes. The asymptotic normality of T̂ and α̂ would make these tests simple to derive.

An approach to this problem rooted in factor analysis could also be productive, although the factor analysis literature lacks the results for high-dimensional data that enabled our approach.

SETPath, an R package implementing the test, is on CRAN (R Core Team, 2013).

Supplementary Material

NIHMS685282-supplement-supplement_1.pdf^{(356.4KB, pdf)}

Acknowledgments

Grants from the National Science Foundation and the National Institutes of Health supported this research. PD primarily worked on this method while part of the Department of Biostatistics at the University of Washington. Reviewers provided valuable input.

Footnotes

Supplementary material

Supplementary material available at Biometrika online provides a description of a normalization scheme, outlines of proofs of Theorems 1 and 2, a derivation of our test in the setting without spiked eigenvalues, simulations investigating the consequences of departures from our assumptions, and a table containing the full results of the breast cancer expression data analysis.

Contributor Information

P. DANAHER, Email: pdanaher@nanostring.com, NanoString Technologies, 530 Fairview Ave. N, Seattle, Washington 98109, U.S.A

D. PAUL, Email: debpaul@ucdavis.edu, Department of Statistics, University of California, One Shields Avenue, Davis, California 95616, U.S.A

P. WANG, Email: pei.wang@mssm.edu, Icahn Institute of Genomics and Multiscale Biology, Icahn Medical School at Mount Sinai, 1470 Madison Avenue, S8-102 New York, New York, 10029, U.S.A

References

Bai ZD, Silverstein JW. Spectral Analysis of Large Dimensional Random Matrices. Springer; 2010. [Google Scholar]
Baik J, Silverstein JW. Eigenvalues of large sample covariance matrices of spiked population models. Journal of Multivariate Analysis. 2006;97:1382–1408. [Google Scholar]
Bair E, Hastie TJ, Paul D, Tibshirani RJ. Prediction by supervised principal components. Journal of the American Statistical Association. 2006;101:119–137. [Google Scholar]
Bild AH, Yao G, Chang JT, Wang Q, Potti A, Chasse D, Joshi MB, Harpole D, Lancaster JM, Berchuck A, et al. Oncogenic pathway signatures in human cancers as a guide to targeted therapies. Nature. 2005;439:353–357. doi: 10.1038/nature04296. [DOI] [PubMed] [Google Scholar]
Chen X, Wang L, Smith JD, Zhang B. Supervised principal component analysis for gene set enrichment of microarray data with continuous or survival outcomes. Bioinformatics. 2008;24:2474–2481. doi: 10.1093/bioinformatics/btn458. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cheng WY, Yang THO, Anastassiou D. Biomolecular events in cancer revealed by attractor metagenes. PLoS Computational Biology. 2013a;9:e1002920. doi: 10.1371/journal.pcbi.1002920. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cheng WY, Yang THO, Anastassiou D. Development of a prognostic model for breast cancer survival in an open challenge environment. Science Translational Medicine. 2013b;5:181ra50. doi: 10.1126/scitranslmed.3005974. [DOI] [PubMed] [Google Scholar]
Clarke C, Madden SF, Doolan P, Aherne ST, Joyce H, ODriscoll L, Gallagher WM, Hennessy BT, Moriarty M, Crown J, et al. Correlating transcriptional networks to breast cancer survival: a large-scale coexpression analysis. Carcinogenesis. 2013;34:2300–2308. doi: 10.1093/carcin/bgt208. [DOI] [PubMed] [Google Scholar]
Danaher P, Wang P, Witten DM. The joint graphical lasso for inverse covariance estimation across multiple classes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2014;76:373–397. doi: 10.1111/rssb.12033. [DOI] [PMC free article] [PubMed] [Google Scholar]
Drier Y, Sheffer M, Domany E. Pathway-based personalized analysis of cancer. Proceedings of the National Academy of Sciences. 2013;110:6388–6393. doi: 10.1073/pnas.1219651110. [DOI] [PMC free article] [PubMed] [Google Scholar]
Efron B, Tibshirani RJ. On testing the significance of sets of genes. The Annals of Applied Statistics. 2007;1:107–129. [Google Scholar]
Johansson K. Shape fluctuations and random matrices. Communications in Mathematical Physics. 2000;209:437–476. [Google Scholar]
Johnson D, Graybill F. An analysis of a two-way model with interaction and no replication. Journal of the American Statistical Association. 1972;67:862–868. [Google Scholar]
Johnstone IM. On the distribution of the largest eigenvalue in principal components analysis. Annals of Statistics. 2001;29:295–327. [Google Scholar]
Kanehisa M, Goto S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Research. 2000;28:27–30. doi: 10.1093/nar/28.1.27. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kritchman S, Nadler B. Determining the number of components in a factor model from limited noisy data. Chemometrics and Intelligent Laboratory Systems. 2008;94:19–32. [Google Scholar]
Lee E, Chuang HY, Kim JW, Ideker T, Lee D. Inferring pathway activity toward precise disease classification. PLoS Computational Biology. 2008;4:e1000217. doi: 10.1371/journal.pcbi.1000217. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li J, Chen S. Two sample tests for high-dimensional covariance matrices. The Annals of Statistics. 2012;40:908–940. [Google Scholar]
Loi S, Haibe-Kains B, Desmedt C, Lallemand F, Tutt AM, Gillet C, Ellis P, Harris A, Bergh J, Foekens JA, et al. Definition of clinically distinct molecular subtypes in estrogen receptor–positive breast carcinomas through genomic grade. Journal of Clinical Oncology. 2007;25:1239–1246. doi: 10.1200/JCO.2006.07.1522. [DOI] [PubMed] [Google Scholar]
Matthews L, Gopinath G, Gillespie M, Caudy M, Croft D, de Bono B, Garapati P, Hemish J, Hermjakob H, Jassal B. Reactome knowledgebase of human biological pathways and processes. Nucleic Acids Research. 2009;37:D617–D622. doi: 10.1093/nar/gkn863. [DOI] [PMC free article] [PubMed] [Google Scholar]
Meade AW, Bauer DJ. Power and precision in confirmatory factor analytic tests of measurement invariance. Structural Equation Modeling. 2007;14:611–635. [Google Scholar]
Nadler B. Finite sample approximation results for principal component analysis: a matrix perturbation approach. Annals of Statistics. 2008;36:2791–2817. [Google Scholar]
Nadler B. On the distribution of the ratio of the largest eigenvalue to the trace of a Wishart matrix. Journal of Multivariate Analysis. 2011;102:363–371. [Google Scholar]
Nishimura D. Biocarta. Biotech Software and Internet Report. 2001;2:117–120. [Google Scholar]
Onatski A. Asymptotics of the principal components estimator of large factor models with weakly influential factors. Journal of Econometrics. 2012;168:244–258. [Google Scholar]
Paul D. Aymptotics of sample eigenstructure for a large dimension spiked covariance model. Statistica Sinica. 2007;17:1617–1642. [Google Scholar]
Peng J, Wang P, Zhou N, Zhu J. Partial correlation estimation by joint sparse regression model. Journal of the American Statistical Association. 2009;104:735–746. doi: 10.1198/jasa.2009.0126. [DOI] [PMC free article] [PubMed] [Google Scholar]
R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing; Vienna, Austria: 2013. [Google Scholar]
Roy S. On a heuristic method of test construction and its use in multivariate analysis. Annals of Mathematical Statistics. 1953;24:220–238. [Google Scholar]
Schott J. A test for the equality of covariance matrices when the dimension is large relative to the sample size. Computational Statistics and Data Analysis. 2007;51:6535–6542. [Google Scholar]
Srivastava M, Yanagihara H. Testing the equality of several covariance matrices with fewer observations than the dimension. Journal of Multivariate Analysis. 2010;101:1319–1329. [Google Scholar]
Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences. 2005;102:15545–15550. doi: 10.1073/pnas.0506580102. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tomfohr J, Lu J, Kepler TB. Pathway level analysis of gene expression using singular value decomposition. BMC Bioinformatics. 2005;6:225. doi: 10.1186/1471-2105-6-225. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wolf DM, Lenburg ME, Yau C, Boudreau A, vant Veer LJ. Gene co-expression modules as clinically relevant hallmarks of breast cancer diversity. PloS One. 2014;9:e88309. doi: 10.1371/journal.pone.0088309. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

NIHMS685282-supplement-supplement_1.pdf^{(356.4KB, pdf)}

[R1] Bai ZD, Silverstein JW. Spectral Analysis of Large Dimensional Random Matrices. Springer; 2010. [Google Scholar]

[R2] Baik J, Silverstein JW. Eigenvalues of large sample covariance matrices of spiked population models. Journal of Multivariate Analysis. 2006;97:1382–1408. [Google Scholar]

[R3] Bair E, Hastie TJ, Paul D, Tibshirani RJ. Prediction by supervised principal components. Journal of the American Statistical Association. 2006;101:119–137. [Google Scholar]

[R4] Bild AH, Yao G, Chang JT, Wang Q, Potti A, Chasse D, Joshi MB, Harpole D, Lancaster JM, Berchuck A, et al. Oncogenic pathway signatures in human cancers as a guide to targeted therapies. Nature. 2005;439:353–357. doi: 10.1038/nature04296. [DOI] [PubMed] [Google Scholar]

[R5] Chen X, Wang L, Smith JD, Zhang B. Supervised principal component analysis for gene set enrichment of microarray data with continuous or survival outcomes. Bioinformatics. 2008;24:2474–2481. doi: 10.1093/bioinformatics/btn458. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] Cheng WY, Yang THO, Anastassiou D. Biomolecular events in cancer revealed by attractor metagenes. PLoS Computational Biology. 2013a;9:e1002920. doi: 10.1371/journal.pcbi.1002920. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Cheng WY, Yang THO, Anastassiou D. Development of a prognostic model for breast cancer survival in an open challenge environment. Science Translational Medicine. 2013b;5:181ra50. doi: 10.1126/scitranslmed.3005974. [DOI] [PubMed] [Google Scholar]

[R8] Clarke C, Madden SF, Doolan P, Aherne ST, Joyce H, ODriscoll L, Gallagher WM, Hennessy BT, Moriarty M, Crown J, et al. Correlating transcriptional networks to breast cancer survival: a large-scale coexpression analysis. Carcinogenesis. 2013;34:2300–2308. doi: 10.1093/carcin/bgt208. [DOI] [PubMed] [Google Scholar]

[R9] Danaher P, Wang P, Witten DM. The joint graphical lasso for inverse covariance estimation across multiple classes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2014;76:373–397. doi: 10.1111/rssb.12033. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] Drier Y, Sheffer M, Domany E. Pathway-based personalized analysis of cancer. Proceedings of the National Academy of Sciences. 2013;110:6388–6393. doi: 10.1073/pnas.1219651110. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] Efron B, Tibshirani RJ. On testing the significance of sets of genes. The Annals of Applied Statistics. 2007;1:107–129. [Google Scholar]

[R12] Johansson K. Shape fluctuations and random matrices. Communications in Mathematical Physics. 2000;209:437–476. [Google Scholar]

[R13] Johnson D, Graybill F. An analysis of a two-way model with interaction and no replication. Journal of the American Statistical Association. 1972;67:862–868. [Google Scholar]

[R14] Johnstone IM. On the distribution of the largest eigenvalue in principal components analysis. Annals of Statistics. 2001;29:295–327. [Google Scholar]

[R15] Kanehisa M, Goto S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Research. 2000;28:27–30. doi: 10.1093/nar/28.1.27. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] Kritchman S, Nadler B. Determining the number of components in a factor model from limited noisy data. Chemometrics and Intelligent Laboratory Systems. 2008;94:19–32. [Google Scholar]

[R17] Lee E, Chuang HY, Kim JW, Ideker T, Lee D. Inferring pathway activity toward precise disease classification. PLoS Computational Biology. 2008;4:e1000217. doi: 10.1371/journal.pcbi.1000217. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] Li J, Chen S. Two sample tests for high-dimensional covariance matrices. The Annals of Statistics. 2012;40:908–940. [Google Scholar]

[R19] Loi S, Haibe-Kains B, Desmedt C, Lallemand F, Tutt AM, Gillet C, Ellis P, Harris A, Bergh J, Foekens JA, et al. Definition of clinically distinct molecular subtypes in estrogen receptor–positive breast carcinomas through genomic grade. Journal of Clinical Oncology. 2007;25:1239–1246. doi: 10.1200/JCO.2006.07.1522. [DOI] [PubMed] [Google Scholar]

[R20] Matthews L, Gopinath G, Gillespie M, Caudy M, Croft D, de Bono B, Garapati P, Hemish J, Hermjakob H, Jassal B. Reactome knowledgebase of human biological pathways and processes. Nucleic Acids Research. 2009;37:D617–D622. doi: 10.1093/nar/gkn863. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] Meade AW, Bauer DJ. Power and precision in confirmatory factor analytic tests of measurement invariance. Structural Equation Modeling. 2007;14:611–635. [Google Scholar]

[R22] Nadler B. Finite sample approximation results for principal component analysis: a matrix perturbation approach. Annals of Statistics. 2008;36:2791–2817. [Google Scholar]

[R23] Nadler B. On the distribution of the ratio of the largest eigenvalue to the trace of a Wishart matrix. Journal of Multivariate Analysis. 2011;102:363–371. [Google Scholar]

[R24] Nishimura D. Biocarta. Biotech Software and Internet Report. 2001;2:117–120. [Google Scholar]

[R25] Onatski A. Asymptotics of the principal components estimator of large factor models with weakly influential factors. Journal of Econometrics. 2012;168:244–258. [Google Scholar]

[R26] Paul D. Aymptotics of sample eigenstructure for a large dimension spiked covariance model. Statistica Sinica. 2007;17:1617–1642. [Google Scholar]

[R27] Peng J, Wang P, Zhou N, Zhu J. Partial correlation estimation by joint sparse regression model. Journal of the American Statistical Association. 2009;104:735–746. doi: 10.1198/jasa.2009.0126. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing; Vienna, Austria: 2013. [Google Scholar]

[R29] Roy S. On a heuristic method of test construction and its use in multivariate analysis. Annals of Mathematical Statistics. 1953;24:220–238. [Google Scholar]

[R30] Schott J. A test for the equality of covariance matrices when the dimension is large relative to the sample size. Computational Statistics and Data Analysis. 2007;51:6535–6542. [Google Scholar]

[R31] Srivastava M, Yanagihara H. Testing the equality of several covariance matrices with fewer observations than the dimension. Journal of Multivariate Analysis. 2010;101:1319–1329. [Google Scholar]

[R32] Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences. 2005;102:15545–15550. doi: 10.1073/pnas.0506580102. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] Tomfohr J, Lu J, Kepler TB. Pathway level analysis of gene expression using singular value decomposition. BMC Bioinformatics. 2005;6:225. doi: 10.1186/1471-2105-6-225. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] Wolf DM, Lenburg ME, Yau C, Boudreau A, vant Veer LJ. Gene co-expression modules as clinically relevant hallmarks of breast cancer diversity. PloS One. 2014;9:e88309. doi: 10.1371/journal.pone.0088309. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Covariance-based analyses of biological pathways

P DANAHER

D PAUL

P WANG

Summary

1. Introduction

2. A model of co-expression in biological pathways

3. A test for differences in the eigenstructure of Σ₁ and Σ₂

3·1. The single spiked eigenvalue setting

3·2. Test robust to the number of spiked eigenvalues

4. Theoretical results

Theorem 1

Corollary 1

Remark 1

Remark 2

Theorem 2

Remark 3

Remark 4

Remark 5

Remark 6

Remark 7

5. Simulations

Fig. 1.

6. Application to a breast cancer dataset

Table 1.

7. Discussion

Supplementary Material

Acknowledgments

Footnotes

Contributor Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Covariance-based analyses of biological pathways

P DANAHER

D PAUL

P WANG

Summary

1. Introduction

2. A model of co-expression in biological pathways

3. A test for differences in the eigenstructure of Σ1 and Σ2

3·1. The single spiked eigenvalue setting

3·2. Test robust to the number of spiked eigenvalues

4. Theoretical results

Theorem 1

Corollary 1

Remark 1

Remark 2

Theorem 2

Remark 3

Remark 4

Remark 5

Remark 6

Remark 7

5. Simulations

Fig. 1.

6. Application to a breast cancer dataset

Table 1.

7. Discussion

Supplementary Material

Acknowledgments

Footnotes

Contributor Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

3. A test for differences in the eigenstructure of Σ₁ and Σ₂