Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 May 1.
Published in final edited form as: Comput Stat Data Anal. 2022 Jan 13;169:107418. doi: 10.1016/j.csda.2021.107418

Statistical Inference for High-Dimensional Pathway Analysis with Multiple Responses

Yang Liu a,b, Wei Sun a, Li Hsu a, Qianchuan He a,*
PMCID: PMC8813039  NIHMSID: NIHMS1769523  PMID: 35125572

Abstract

Pathway analysis, i.e., grouping analysis, has important applications in genomic studies. Existing pathway analysis approaches are mostly focused on a single response and are not suitable for analyzing complex diseases that are often related with multiple response variables. Although a handful of approaches have been developed for multiple responses, these methods are mainly designed for pathways with a moderate number of features. A multi-response pathway analysis approach that is able to conduct statistical inference when the dimension is potentially higher than sample size is introduced. Asymptotical properties of the test statistic are established and theoretical investigation of the statistical power is conducted. Simulation studies and real data analysis show that the proposed approach performs well in identifying important pathways that influence multiple expression quantitative trait loci (eQTL).

Keywords: Asymptotical distribution, Complex diseases, High dimensional inference, Multivariate responses, Pathway analysis, 2010 MSC, 62H15, 62P10

1. Introduction

Pathway analysis, i.e., grouping analysis, interrogates whether a group of features is associated with a response, and has important applications in genomic data analysis. By harnessing prior biological knowledge and accounting for concerted functional mechanisms, pathway analysis is able to examine multiple genomic features in a holistic manner, and has a strong potential to inform on new strategies to diagnose, treat, and prevent complex diseases [1]. With an explosive number of genomic features being typed in population studies, research interest in pathway analysis has been surging in recent years.

A number of approaches have been developed for conducting statistical inference of pathway analysis, and most of them focus on a single response. For example, when the dimension p (i.e., the number of features in a considered set) is moderate, the Mixed effects Score Test (MiST) can be applied to assess the association between a response and a group of features [2]. When the dimension p is high (and possibly higher than the sample size n), the principal component analysis (PCA) can potentially be used [3], though this approach has little power when the selected PCs fail to capture the association signals. Besides PCA, other methods have been developed for pathway analysis under high dimensions, with various considerations on experimental design and genetic signal structures. To name a few, Goeman et al. [4, 5] proposed score statistics for large but fixed dimensional settings; Zhong and Chen [6] proposed U-statistic based tests for linear regression models with factorial designs; Guo and Chen [7] developed an approach for high-dimensional testing in the context of generalized linear models; Kong et al. [8] introduced a method based on penalized quantile regression for handling skewed or heavy-tailed responses; Zhou [9] published a method that is optimized for transcriptome data, in which potential outliers and skewness patterns often arise and violate parametric assumptions; Liu et al. [10] developed an approach that is able to make statistical inference for high dimensional pathway analysis when p/nγ, for a constant γ ∈ (0,∞). These methods are primarily designed for the situation where a single response needs to be analyzed.

Multi-response analysis is needed when the studied disease condition involves multiple response variables. The responses may be biological measurements, such as blood pressure and lipids level in the metabolic syndrome, or molecular measurements, such as gene expressions in the analysis of gene expression quantitative trait loci (eQTLs). Multi-response analysis can potentially harness the shared information among the responses to improve statistical power and has played pivotal roles in studying genetic mechanisms of complex diseases [11]. Motivated by these considerations, some approaches have been developed for making statistical inference of multi-response pathway analysis. The multivariate kernel machine (MVKM) regression [12] accounts for the correlations among the responses, and represents one of the pioneering approaches for conducting multivariate pathway analysis. Sun et al. [13] proposed the MURAT approach under the linear mixed model, and assumed that the effects of the genomic features follow a multivariate normal distribution. He et al. [14] introduced the SOMAT method to investigate multiple responses with respect to pathways, and this method adopts a hierarchical modeling to accommodate biological characteristics of the studied genomic features. The MURAT and SOMAT are designed for moderate dimensions and are not suited for analyzing genomic sets that contain a large number of features. The MVKM can accommodate potential interactions among the features, and has also been used to analyze a moderate number of features. Ma et al. [15] proposed a residual sum of squares (RSS) type statistic for testing the effects in multi-response analysis. They considered the dimension of the responses to be high while the number of features was assumed to be fixed. Recently, Qiu et al. [16] also proposed an approach for detecting faint signals where the responses are high-dimensional and the features are low-dimensional. However, little work has been done for statistical inference of multi-response pathway analysis when dimension is high.

In this article, we propose a pathway analysis approach for jointly analyzing multiple responses with high-dimensional features. Our approach accounts for the correlations among the responses, and is able to provide valid statistical inference when the dimension p is greater than n, i.e., p = o(n2). We consider the situation where the signals are relatively weak and non-sparse, commonly seen in practical situations. We establish the asymptotic properties of the proposed statistic and further conduct theoretical investigation on the power of the statistic when both the sample size (n) and the dimension (p) go to infinity.

The article is organized as follows. Section 2 describes the testing procedures under different scenarios and establishes their asymptotic properties. Section 3 presents simulation studies for a range of settings to evaluate the empirical performance of the proposed approach. Section 4 applies the proposed tests to a genomic dataset to identify genetic pathways that are associated with the expressions of important cancer genes. Section 5 concludes the article with some remarks.

2. Methods

2.1. Model setup and notations

We consider a sample of n independent individuals and each individual has K continuous response variables. Let Yk = (Yk1, …, Ykn)T be the vector for the kth response, X = (X1, …, Xd) be an n × d adjusting covariates matrix, and G = (G1, …, Gp) be an n × p matrix for genomic features. We assume the following linear regression model for the kth response, k = 1, …, K,

Yk=Xαk+Gβk+εk, (1)

where αk = (αk1, …, αkd)T is the coefficient vector of X for the kth response (with αk1 being the intercept), βk = (βk1, …, βkp)T is the coefficient vector of G for the kth response, and εk = (εk1, …, εkn)T is a vector of independent random errors. We assume that for the ith individual, i = 1, …, n, the vector of the random errors across the K responses, (ε1i, …, εKi), follows a multivariate Gaussian distribution with mean 0 and covariance Σ={σkl}k,l=1,,K, where Σ is positive definite. That is, the covariance between the kth and the lth responses is σkl. Here, the designed matrices X and G are assumed to be fixed throughout this paper. The number of responses K and the number of adjusting covariates d are considered to be finite, while the dimension of the genomic features p is allowed to grow to infinity. We are interested in testing the global null hypothesis H0 : β1 = ··· = βK = 0 against the alternative Ha : at least one βkj ≠ 0 for k = 1, …, K and j = 1, …, p. That is, our goal is to test whether any of the genomic features is associated with any of the K responses.

Before describing our approach, we introduce some notations. For a vector a = (a1, …, an)T, let a=(i=1nai2)1/2 be the L2-norm of the vector. For any k × l matrix A=(aij)i=1,,k;j=1,,l, denote the spectral norm by ||A|| = sup||x||=1 ||Ax||, and the Frobenius norm by AF=(i=1kj=1laij2)1/2. When A is a square matrix, we denote its maximal and minimum eigenvalues by λmax(A) and λmin(A), respectively.

2.2. Testing procedures when Σ is known

We first consider the situation in which the covariance matrix Σ is known.(The situation that Σ is unknown will be studied in Section 2.3.) To test the global hypothesis H0, we first evaluate the genetic association for each of the K responses, and then construct a joint test statistic by combining the K responses together.

Consider a single response, say the kth response. Instead of directly testing if βk1 = = βkp = 0, we start with a much simpler model, Yk = Xαk+Gjβkj +εk, where Gj is the jth genomic feature. Here, only a single feature is fitted in the model, and this model is called the marginal model. To test if βkj = 0, one can construct a statistic zkj=(GjTHGj)1/2GjTHYk, where H = In − Px, In is the identity matrix, and Px = X(XTX)−1XT is the projection matrix. To test if any of the genomic features is associated with the kth response, as [10], one can aggregate the zkj’s by the statistic

Uk=j=1pzkj2,

and obtain the correlation matrix amongst the zkj’s as V = D−1/2GTHGD−1/2, where D is the diagonal matrix with entries GjTHGj, j = 1, …, p. It can be shown that E(Uk) = kk under H0.

The above Uk and V can be used for the association test for a single response, but recall that our goal is to jointly test the K responses for the p genomic features. To pursue this goal, it is necessary to characterize the joint distribution of the Uk’s. Let be the element-wise squared matrix of Σ, i.e., (Ω)kl=σkl2 for k, l = 1, …, K. Interestingly, we show in the following that U = (U1, …, UK) follows a multivariate normal distribution, when p goes to infinity.

Theorem 1.

Let n, p → ∞. If ||V|| = o(p1/2), then under H0,

Uμ2VFN(0,Ω),

where μ = p(σ11, …, σKK).

Remark 1.

Here the order of p is not restricted with respect to n, as Σ is assumed to be known. The condition ||V|| = o(p1/2) has also been used in [10]. A sufficient condition for ||V|| = o(p1/2) would be that the maximum absolute column-sum of V is of a smaller order of p1/2. For example, when the correlations among Gj’s (j = 1, …, p) are not massively high such that |Vj1j2|Cδ|j1j2| for some constants C > 0, 0 < δ < 1, the maximum absolute column-sum of V is bounded and then the condition ||V || = o(p1/2) is satisfied. This correlation structure essentially says that the correlations of two variants are weak when they are far apart, which has been widely observed in genetic studies. Some of the commonly used structures, such as the auto-regressive and block-wise correlations, satisfy this condition.

Remark 2.

The covariance matrix is the Hadamard (entrywise) product of Σ and Σ. By the Schur product theorem, is positive definite and thus invertible.

Theorem 1 provides a foundation to test the association between the K responses and the genomic features. Based on this joint distribution, we consider two statistics to conduct the joint association test. One is a quadratic type of statistic

T1*=(Uμ)TΩ1(Uμ)2VF2,

which asymptotically has a chi-squared distribution with K degrees of freedom. In addition, we can also test for a linear combination of Uk’s. For example, if there is a prior belief that the genetic effects are similar across the K responses, we can follow the lines of [17] to define

T2*=JT(Uμ)2VFJTΩJ,

which can be simplified as k=1K(Ukpσkk)/(2VFΣF), where J = (1, …, 1)T. The T2* can be shown to be asymptotically standard normal. When the genetic effects are the same across all the responses, this type of statistic has the smallest variance among all the linear combinations of (U1, …, UK) and is often considered to be a highly powerful test for this situation (see [17] for more details).

2.3. Testing procedures when Σ is unknown

The proposed statistics T1* and T2* involve the quantities μ, Σ and , which are based on the covariance components σkl (k, l = 1, …,K). However, in real applications, σkl is mostly unknown. Naturally, one may attempt to use an estimator of {σkl}k,l=1,,K to obtain the plug-in estimators μ^, Σ^ and Ω^, and then carry out the proposed tests. Our theoretical and numerical results (in Section 1 of Supplementary material) show that such a replacement works well when p = o(n). However, when p is further increasing, we show that such a replacement can lead to a breakdown of the proposed tests.

We focus on T1* as an example. Let σ^kl be the estimator of σkl under the null hypothesis, that is, for k, l = 1, …, K,

σ^kl=YkTHYlnd.

For ease of presentation, let us consider to be fixed. Replacing μ by μ^, we have that

|(Uμ^)TΩ1(Uμ^)2VF2(Uμ)TΩ1(Uμ)2VF2||(μ^μ)T(VF2Ω)1(Uμ)|+|(μ^μ)T(2VF2Ω)1(μ^μ)|Ω1μ^μVFUμVF+Ω1μ^μ22VF2, (2)

where VF2j=1pVjj=p, and Uμ2/(2VF2)=Op(1) asymptotically follows a mixed chi-square distribution with weights being the eigenvalues of . Then it follows that (2) is bounded by Op(p/n)+Op(p/n), which goes to 0 if p = o(n). This indicates that the replacement of σkl by σ^kl may work poorly when p is large. Our simulation results (in Section 4 of Supplementary material) 145 are in line with this analysis.

When p is of higher order than o(n), the deviation between (Uμ^) and (Uμ) may not be negligible and subsequently, the asymptotic normality of (Uμ^) can not be derived from Theorem 1. This motivated us to directly investigate the asymptotic behavior of (Uμ^) rather than that of (Uμ). The following lemma characterizes the asymptotic normality of (Uμ^).

Lemma 2.

Assume that the conditions in Theorem 1 hold. If p = o(n2) and p1VF2p/(nd)+η for some constant η > 0, then under H0,

Uμ^2VF22p2/(nd)N(0,Ω).

Remark 3.

The condition p1VF2p/(nd)+η is to bound the denominator away from zero. This is a mild condition since we can show that p1VF2p/(nd). When p/(nd) ≤ 1 − η for any η > 0, this condition is satisfied because p1VF21p/(nd)+η. When pnd, the lower bound is achieved when all the nonzero eigenvalues of V are the same, which means that all the variants are virtually uncorrelated. Since variants in real genetic data are usually correlated, this means that p1VF2 is generally larger than the lower bound. Hence, this condition is a fairly mild condition and is easily satisfied in practical situations.

Lemma 2 suggests that we can construct statistics directly based on (Uμ^) rather than based on (Uμ). The following theorem shows the form of such statistics as well as their asymptotic distributions.

Theorem 3.

Assume that the conditions in Lemma 2 hold. Then under H0,

T1=(Uμ^)TΩ^1(Uμ^)2VF22p2/(nd)χK2

and

T2=k=1KUkpk=1Kσ^kkΣ^F2VF22p2/(nd)N(0,1).

Besides T1 and T2, the joint distribution in Lemma 2 provides opportunities to form other test statistics. For example, if one can assign weights to the K responses based on prior information, one may use the statistic

k=1KwkUkpk=1Kwkσ^kkwTΩ^w2VF22p2/(nd)N(0,1),

where w = (w1, …, wK)T is a vector of weights for the K responses.

2.4. Analysis of Power

Next, we investigate the asymptotic power for the proposed statistics. Under the alternative hypothesis Ha, let Sk = {j : βkj ≠ 0} be the index set for the nonzero coefficients of the kth response (k = 1, …, K). Define sub-vector βSk={βkj:jSk} and sub-matrix GSk={Gj:jSk}. Let DSk be the sub-matrix of D, with diagonal elements being GjTHGj,jSk. Similarly, define βSkc, GSkc and DSkc for Skc={j:βkj=0}. We further define μβ,k=βkTGTMGβk/(2VF) for k = 1, …, K, where M = HGD1GTH. Here, μβ,k is a measure of the strength of signals for the kth response. Let μβ = {μβ,1, …, μβ,K}.

We assume the following conditions for analyzing the power of the proposed tests.

Assumption 1.

There is a constant C1 > 0 such that

C11max1jp(1nGjTHGj)C1.

Assumption 2.

There is a constant C2 > 0 such that for any k = 1, …, K,

C21λmin(1nGSkTHGSk)λmax(1nGSkTHGSk)C2.

Assumption 1 is a mild condition which states that the variations of the genetic variants should be on the same scale. Assumption 2 imposes restrictions on the eigenvalues of the matrix GSkTHGSk/n. Define the power of T1* as ρ(T1*)=P(T1*>χK,1α2), where χK,1α2 denotes the (1 − α) quantile of the chi-squared distribution with degrees of freedom K. Similarly, define the power of T2* as ρ(T2*)=P(|T2*|>ζ1α/2), where ζ1−α/2 denotes the (1 − α/2) quantile of the standard normal distribution. The following theorem characterizes the power of T1* and T2*, when the size of the overall signal is bounded by o[(nlogp)1/2VF/V].

Theorem 4.

Suppose that the conditions in Theorem 1 and Assumptions 12 hold. Under Ha, if k=1Kβk=o[(nlogp)1/2VF/V], then as n, p → ∞,

ρ(T1*)P[χK2(μβTΩ1μβ)>χK,1α2]

and

ρ(T2*)Φ(ζ1α/2+k=1Kμβ,kΣF)+Φ(ζ1α/2k=1Kμβ,kΣF),

where χK2(μβTΩ1μβ) is a non-central χK2 random variable with noncentrality parameter μβT−1μβ, and Φ(·) is the cumulative distribution function of the standard normal variable.

Remark 4.

One sufficient condition for k=1Kβk=o[(nlogp)1/2VF/V] is that k=1Kβk=o[(nlogp)1/2], which means that the overall effect size should be sufficiently small. This condition is similar to the local alternative condition in Zhong and Chen [6].

Theorem 4 provides an explicit formula to calculate the power when the signal size is moderate. It shows that the power tends to increase as ||μβ|| becomes larger. When the overall signal size k=1Kβk is of higher order than o[(nlogp)−1/2||V ||F/||V ||], it is difficult to obtain explicit formulas for the power functions. However, we show in the following theorem that, as long as the overall signal size is sufficiently large, the power of T1* and T2* will approach to one.

Theorem 5.

Suppose that the conditions in Theorem 1 and Assumption 12 hold. Under Ha, if k=1KβkC0plogp/n for some constant C0 > 0, then as n, p → ∞, ρ(T1*)1 and ρ(T2*)1.

3. Simulation Studies

We conducted simulation studies to evaluate the performance of the proposed tests T1 and T2, and compared them with (1) Bonferroni test (Bonf.), (2) Principal component analysis (PCA), (3) multivariate score test (mScore), and (4) multivariate kernel machine test (MVKM) with a linear kernel. For the Bonferroni test, we conducted the univariate score test for each genomic feature with respect to each response, and then applied the Bonferroni correction to the p × K tests. For the PCA method, we used the five leading PCs of the GTG matrix to conduct a likelihood ratio test for each response, and then applied the Bonferroni correction to the K tests. The mScore test extends the multivariate score test from a single genetic variant [18] to multiple variants, and was designed for low dimensions.

We considered K = 3 responses. Genomic features were generated from a multivariate normal distribution N(0, ρ), where ρ is a block-diagonal covariance matrix with each block being ρ0. Two correlation structures were considered for ρ0: (1) auto-regressive (AR) with (i, j)th off-diagonal element 0.6|ij|, and (2) compound symmetry (CS) with diagonal elements 1 and off-diagonal elements 0.5. These structures have been considered in other publications [10, 19] to model the correlations among genetic variants. The responses were generated as Yki=1+xi+j=1pGijβkj+εki, where xiN(0.1Gi1,1), and (ε1i, ε2i, ε3i)T follows a multivariate normal distribution with correlation of 0.5.

For the sample size n and the dimension p, we considered n = 200 with p = 100,200,300, and n = 400 with p = 300,400,500. For the alternative hypothesis, we let the proportion nonzero βjk’s to be 5%, 10%, 15%, or 20%. Considering that a genomic feature may not affect all responses, we set two signals, i.e., nonzero βjk’s, to be overlapping for the K responses and the remaining signals to be non-overlapping. Regarding the signal sizes, two scenarios were considered. In scenario (1), we set the absolute values of the signals for the three responses to be (0.10, 0.06, 0.02) for n = 200, and (0.05, 0.03, 0.01) for n = 400. That is, the magnitudes of association signals differ across different responses. In scenario (2), to examine the performance of T2, we set the absolute values of the signals to be (0.08, 0.08, 0.08) for n = 200, and (0.04, 0.04, 0.04) for n = 400. That is, the magnitudes of signals are constant for the three responses. For both scenarios, the signs of the signals can be either positive or negative.

We evaluated the Type I errors of the tests over 10,000 replications, and examined the power over 1,000 replications. Table 1 shows that the type I errors of the proposed tests T1 and T2 are around 0.05. All the other compared methods also have type I errors controlled, though the MVKM tends to be conservative under high dimensions. The lower Type I error of MVKM is likely due to the impact of the increased dimensions. For MVKM, its asymptotical distribution involves the error term σkl, whose estimation error is negligible when p is small but can become highly influential when p is large, a phenomenon that has been observed for similar approaches [10]. Our numerical experiments also show that when p is moderate compared with n, the Type I error of MVKM is close to the nominal level (see Section 4 of Supplementary material). For the power performance, Figure 1 shows the power against the proportion of nonzero βjk’s for scenario (1) under the AR correlation structure. It can be seen that, as the proportion of nonzero βkj’s becomes larger, the power of all the methods increases. As to relative performance, both T1 and T2 generally have higher power than the other compared methods, especially when the proportion of the signals is large. The Bonferroni test suffers from severe power loss because the signals were generated to be weak in the considered models. The PCA test also has low power likely because the leading five PCs failed to represent the genetic features that carry association signals. The mScore is a chi-square test with its degree of freedom equal to p×K, which causes the loss of power when p is large. With regard to the comparison between T1 and T2, T1 outperforms T2 because the magnitude of signals varies across the 3 responses. Next, we examined the power under scenario (2) where the magnitude of signals are equal for the 3 responses. Figure 2 shows that T1 and T2 still have good performance, but T2 tends to have higher power than T1. These results indicate that T1 is more powerful when the studied responses have unbalanced signal sizes, while T2 is more favorable when the signal sizes are similar across multiple responses. The power under the CS correlation structure shows a similar pattern and is provided in Section 4 of Supplementary material.

Table 1:

Type I error of the compared tests across different sample sizes, dimensions and correlation structures of the genomic features at α = 0.05

n p Corr. Bonf. PCA mScore MVKM T 1 T 2
200 100 AR 0.044 0.051 0.012 0.034 0.054 0.045
CS 0.047 0.049 0.013 0.033 0.051 0.041
200 AR 0.047 0.055 0.002 0.024 0.050 0.048
CS 0.046 0.054 0.004 0.026 0.050 0.046
300 AR 0.047 0.053 0.001 0.016 0.052 0.051
CS 0.048 0.055 0.002 0.018 0.053 0.052

400 200 AR 0.048 0.050 0.010 0.032 0.048 0.049
CS 0.046 0.051 0.013 0.034 0.049 0.048
400 AR 0.049 0.051 0.002 0.022 0.051 0.049
CS 0.048 0.051 0.004 0.025 0.050 0.047
600 AR 0.042 0.051 0.000 0.012 0.048 0.048
CS 0.045 0.051 0.000 0.015 0.047 0.048

Figure 1:

Figure 1:

Power of the compared tests at α = 0.05 under different sample sizes and dimensions with respect to the proportion of nonzero signals for scenario (1). The AR structure is considered.

Figure 2:

Figure 2:

Power of the compared tests at α = 0.05 under different sample sizes and dimensions with respect to the proportion of nonzero signals for scenario (2). The AR structure is considered.

4. Real Data Analysis

We evaluated the performance of the proposed tests through the colorectal cancer data from The Cancer Genome Atlas (TCGA). This dataset contains multiple genomic information, such as gene expressions, DNA methylations, and somatic mutations. We aimed to identify DNA methylations and somatic mutations that influence the expressions of important cancer genes.

We obtained 160 samples that have data on gene expressions, DNA methylations and somatic mutations. We considered the expressions of three genes (KRAS, BRAF and TP53) as the responses, because these genes are important for cancer development and have been found to co-alter their expressions to jointly influence the survivorship of colon cancer patients [20]. We mapped the genomic features, i.e., methylations and somatic mutations, into the 50 Hallmark Pathways based on the Broad Institute’s database. Among the 50 pathways, 38 pathways contain more than 160 genomic features; the median and maximal numbers of genomic features in a pathway are 316 and 1275 respectively, both of which are larger than the sample size. We included the following covariates in our analysis: age, gender, plates for batch effects, hyper-mutation status, and principle components accounting for tumor purity and cell type composition [21]. We excluded the P53 pathway as the TP53 gene’s expression is one of the 3 responses being studied.

We applied the proposed T1 and T2 tests along with the other compared methods to each of the 49 pathways. The threshold for statistical significance was set to 0.05/49 ≈ 0.001. Table 2 lists the pathways that were detected by any of the considered methods. It can be seen that the T1 test detects three pathways: TNFA signaling via NFKB, Unfolded protein response, and TGF beta signaling pathways. The T2 test identified the first pathway and two additional pathways, Xenobiotic metabolism and Hypoxia pathways. The MVKM detected one pathway, the TNFA signaling via NFKB pathway. The three other compared methods did not yield any significant results. The identified pathways have functional meanings which are supported by existing biological studies. For example, the nuclear factor-kappa B (NFKB) signaling pathway is a regulator of immune response and inflammation, and is associated with both BRAF [22] and TP53 [23].

Table 2:

P-values of the compared tests. P-values lower than :001 are shown in bold.

Pathway name p Bonf. PCA mScore MVKM T 1 T 2
TNFA signaling via NFKB 397 .06593 .03044 .01565 .00019 .00003 .00007
Unfolded protein response 104 .99912 .05081 .01777 .00186 .00040 .00767
TGF beta signaling 184 .15286 .27376 .03770 .00171 .00074 .00497
Xenobiotic metabolism 384 .33097 .01024 .03027 .00186 .00153 .00056
Hypoxia 727 .10970 .03363 .02102 .00160 .00523 .00088

To better understand the genetic signals behind the association results, we examined the univariate p-values in the Unfolded protein response pathway which was detected only by the T1 test. Figure 3 provides these p-values for each of the three responses. It can be seen that each response involves a number of moderate signals, however, none of these p-values reaches the global significance of 10−5 for the univariate test. This emphasizes the importance of examining a large number of features jointly across multiple responses. The univariate p-value plots for the other identified pathways can be found in Section 5 of Supplementary material.

Figure 3:

Figure 3:

Univariate p-values of genetic features for three responses in unfolded protein response pathway.

5. Discussion

Multi-response pathway analysis has important applications in biomedical research, such as studying pleiotropy in quantitative genetics, as well as conducting phenome-wide association studies. In this article, we developed a multi-response pathway analysis approach that is able to conduct statistical inference when p → ∞ and is potentially larger than n, i.e., p = o(n2). Our approach enables the detection of weak signals aggregated at the pathway level, and provides a powerful tool for deciphering the complexity of genetic mechanisms. Asymptotic normality was established for the proposed statistic, and the asymptotic power of the proposed statistic when both n and p go to infinity was also studied. Besides the proposed T1 and T2, the result in Lemma 2 provides a foundation for forming other potential statistics which can be tailored for practical situations. The power of these statistics varies and is dependent upon the alternative hypothesis, as it is known that there is no uniformly most powerful test for multi-response analysis. We also note that in the approach of [6], there is no restriction on the relative growth rate between p and n. It will be interesting to see if it is possible to extend their approach to multivariate outcome analysis. Further research is needed to address this important question.

Our proposed approach tests the global null hypothesis that none of the responses is associated with the genetic pathway. If the global null is rejected, then it would be interesting to investigate which of the response variables are influenced by the considered genetic features. One simple strategy is to test each response one by one post hoc, but the multiple testing burden brought by these marginal tests may overshadow the significance of the individual tests. Another potential strategy is to apply regularized regression methods, such as [24], to pinpoint the responses that are associated with the genetic variants. These approaches, however, often do not provide p-values for the selected results, as statistical inference for regularized regression under high dimensions remains to be an active research area.

Besides high dimensionality, there exist many other challenges in pathway analysis. For example, when there are potential causal relationship among the studied features [25], how to accommodate such a causal relationship requires new methodology development. Structural equation modeling and mediation analysis approach have been used to infer the potential causal relationship between features and outcome, but these methods are primarily focused on a limited number of outcomes and features, and how to extend them to high dimensions appears to be a nontrivial task. Future research is merited.

Supplementary Material

1

Acknowledgments

This work is supported by National Institutes of Health R01CA223498 and R01CA189532. We thank the Editor, Associate Editor, and reviewers for their very helpful comments.

Appendix

Proofs of the theorems

Proof of Theorem 1.

Consider a linear combination of (Uμ)/(2VF), denoted as Tc=cT(Uμ)/(2VF) for a vector of constants c = (c1, …, cK)T. To prove Theorem 2.1, it suffices to show that under H0, TcN(0, cTΩc).

First, we calculate the mean and variance of Tc. Under the null, we have

Tc=(2VF)1k=1Kck(j=1pzkj2pσkk)=(2VF)1k=1Kck(εkTMεkpσkk),

where M = HGD1GTH. Further define

Uc=k=1KckεkTMεk=εvT(DcM)εv,

where εv=(ε1T,,εKT)T~N(0,ΣIn), and Dc is a diagonal matrix with diagonal elements c1, …, cK.

Recall that V = D−1/2GTHGD−1/2 is the correlation matrix of zkj’s with diagonal elements Vjj = 1, j = 1, ··· , p. Then

E(Uc)=tr[(DcM)(ΣIn)]=tr(DcΣ)tr(V)=pk=1Kckσkk,

and

var(Uc)=2tr[(DcΣ)2M2]=2tr[(DcΣ)2]tr[M2]=2cTΩcVF2,

where the last equality holds since

tr[M2]=tr(HGD1GTHHGD1GTH)=tr(D1/2GTHGHGD1GTHGD1/2)=VF2.

It follows that E(Tc) = 0 and var(Tc) = cTΩc.

It remains to prove the asymptotic normality of Uc=k=1KckεkTMεk. Let e = (Σ−1/2In)εv, then we have eN(0,InK). Subsequently,

Uc=eT(Σ1/2In)(DcM)(Σ1/2In)e=eT(Σ1/2DcΣ1/2M)e.

Let B = Σ1/2DcΣ1/2M. By the eigen decomposition of B, it suffices to show that ||B||/||B||F → 0 when n, p → ∞. Note that

B=Σ1/2DcΣ1/2M=DcΣVDcΣVmaxk|ck|k=1KσkkV

and

BF=Σ1/2DcΣ1/2FMF=[tr(Σ1/2DcΣ1/2Σ1/2DcΣ1/2)]1/2VF=DcΣF(j=1pVjj2)1/2maxk|ck|σkkp,

here k′ = argmaxk |ck|. It follows that B/BF(k=1KσkkV)/(σkkp)=o(1), which completes the proof. □

Proof of Lemma 2.

Let Tch=cT(Uμ^)/2VF22p2/(nd) for a vector of constants c = (c1, …, cK)T. It suffices to show that under H0, TchN(0,cTΩc). Under H0, we can write

Uch=cT(Uμ^)=k=1Kck(εkTMεk(nd)1pεkTHεk)=εvT[Dc(M(nd)1pH)]εv=eT[(Σ1/2DcΣ1/2)(M(nd)1pH)]e,

where M, εv, Dc, and e are defined in the proof of Theorem 1.

Let M1 = M − (nd)−1pH and B1 = (Σ1/2DcΣ1/2) ⊗ M1. Since eN(0, InK), we have

E(Uch)=tr(Σ1/2DcΣ1/2)tr(M1)=tr(Σ1/2DcΣ1/2)[tr(V)(nd)1ptr(H)]=tr(Σ1/2DcΣ1/2)[p(nd)1p(nd)]=0

and

var(Uch)=2tr(B12)=2tr[(DcΣ)2]tr[M12]=2cTΩctr[M22(nd)1pM+(nd)2p2H]=2cTΩc[VF22(nd)1ptr(V)+(nd)2p2(nd)]=2cTΩc[VF2p2/(nd)].

It follows that E(Tch)=0 and var(Tch)=cTΩc.

For the asymptotic normality of Uch, we only needs to prove that ||B1||/||B1||F → 0 when p = o(n2). By Weyl’s inequality, we have

λmax(M1)λmax(M)(nd)1pλmin(H)V

and

λmin(M1)λmin(M)(nd)1pλmax(H)p/(nd).

Then, using the conditions ||V|| = o(p1/2) and p = o(n2), we have ||M1|| ≤ max(||V||,p/(nd)) = op(p1/2). It follows that

B1=Σ1/2DcΣ1/2M1maxk|ck|k=1Kσkko(p1/2).

Using the condition p1VF2p/(nd)+η, we have

B1F=Σ1/2DcΣ1/2FM1F=DcΣF[VF2p2/(nd)]1/2maxk|ck|σkk[ηp]1/2,

where k′ = argmaxk |ck|. It follows that ||B1||/||B1||F = o(1) and the theorem is proved. □

Proof of Theorem 3.

Define T1=(Uμ^)TΩ1(Uμ^)/(2VF22p2/(nd)) and T2=(k=1KQkpk=1Kσ^kk)(ΣF2VF22p2/(nd)), then it follows from Theorem 1 that under H0,T1χK2 and T2N(0,1). To prove Theorem 3, it suffices to show that T1 and T1 have the same asymptotic distribution under H0, and that T2 and T2 have the same asymptotic distribution under H0.

Since |σ^klσkl|=O(n1/2) for any k, l = 1, …, K, it follows that Σ^Σ=Op(n1/2) and Ω^Ω=Op(n1/2). Using Weyl’s inequality, we have

λmin(Ω^)λmin(Ω)+λmin(Ω^Ω)λmin(Ω)Ω^Ω.

Then with probability approaching to 1, λmin(Ω^)0.5λmin(Ω). Subsequently,

Ω^1Ω1=Ω^1(ΩΩ^)Ω1Ω^1Ω1ΩΩ^λmin(Ω^)1λminΩ1Op(n1/2)2λmin(Ω)2Op(n1/2)=Op(n1/2).

We now shall bound |T1T1|. Under H0, we have

|T1T1|=|(Uμ^)T(Ω^1Ω1)(Uμ^)/(2VF22p2/(nd))|Ω^1Ω1Uμ22VF22p2/(nd)
Op(n1/2),

where the last inequality holds since under H0, Uμ2/(2VF22p2/(nd)) asymptotically follows a mixture of chi-squared distributions with weights being the eigenvalues of . Thus T1χK2.

For T2, since Σ^F=(k,l=1Kσ^kl2)1/2, it follows that Σ^F converges to ||Σ||F in probability. By Slutsky’s theorem, T2 and T2 have the same asymptotic distribution under H0. That is, T2N(0,1). □

Proof of Theorem 4.

Under Ha, for each k, we can write

Ukpσkk=YkTMYkpσkk=βkTGTMGβk+2βkTGTMεk+(εkTMεkpσkk).

We can bound the second term in the above equation as follows. Using the Hoeffding bound for Gaussian random variables εk, we have

P{|βkTGTMεk|(logp)1/2MGβ}2exp{logpMGβ22MGβ2σkk}=2p1/(2σkk).

That is, as p → ∞, with probability approaching to 1, we have

|βkTGTMεk|(logp)1/2MGβ||=(logp)1/2MHGβ(logp)1/2MHGβ(logp)1/2V(βSkTGSkTHGSkβSkT)1/2(logp)1/2V(λmax(n1GSkTHGSk)nβSk2)1/2C21/2(nlogp)1/2Vβk,

where the second equality holds since M = MH. When ||βk|| = o[(nlogp)−1/2||V ||F/||V ||], we have 2βkTGTMεk/(2VF)0 in probability. Therefore the distribution of (Ukpσkk)/(2VF) is asymptotic equivalent to that of μβ,k+(εkTMεkpσkk)/(2VF). Similar to the proof of Theorem 1, we can show that as n, p → ∞,

Uμ2VFN(μβ,Ω).

Therefore under Ha,

T1*=(Uμ)TΩ1(Uμ)2VF2χK2(μβTΩ1μβ)

and

T2*=k=1KUkpk=1Kσkk2VFΣFN(k=1Kμβ,kΣF,1).

It follows that ρ(T1*)P[χK2(μβTΩ1μβ)>χK,1α2] and

ρ(T2*)=P(|T2*|>ζ1α/2)=P(T2*>ζ1α/2)+P(T2*<ζ1α/2)Φ(ζ1α/2+k=1Kμβ,kΣF)+Φ(ζ1α/2k=1Kμβ,kΣF).

Proof of Theorem 5.

We first consider the power of T1* :

P{(Uμ)TΩ1(Uμ)>2VF2χK,1α2}.

Note that

(Uμ)TΩ1(Uμ)λmin(Ω1)Uμ2=k=1K(YkTMYkpσkk)2/λmax(Ω).

We can further write YkTMYkpσkk=βkTGTMGβk+2βkTGTMεk+(εkTMεkpσkk). Next we will bound these three terms separately. First,

βkTGTMGβk=βSkTGSkTHGD1GTHGSkβSkT=βSkTGSkTHGSkDSk1GSkTHGSkβSkT+βSkTGSkTHGSkcDSkc1GSkcTHGSkβSkTβSkTGSkTHGSkDSk1GSkTHGSkβSkTn{λmin(n1/2DSk1/2GSkTHGSk)}2βSk2n{λmin[(n1DSk)1/2]λmin(n1GSkTHGSk)}2βSk2nC11C22βk2,

where the last inequality holds by Assumption 1 and 2.

Second, as in the proof of Theorem 1, we have |βkTGTMεk|=Op[(nlogp)1/2Vβk]. Third, by Theorem 2.1, we have (εkTMεkpσkk)/(2σkkVF)N(0,1).

Thus

|εkTMεkpσkk|=Op(VF)=Op(pV2)=op(p).

Putting these bounds together, we have

|k=1K(YkTMYkpσkk)|k=1KβkTGTMGβk2k=1K|βkTGTMεk|k=1K|εkTMεkpσkk|nC11C22k=1Kβk22C21/2(nlogp)1/2Vk=1Kβkk=1K|εkTMεkpσkk|.

We now show that the first quadratic term can dominate the remaining two terms. Recall that k=1KβkC0plogp/n for some C0 > 0, we have as n, p → ∞,

2C21/2(nlogp)1/2Vk=1KβknC11C22k=1Kβk22C21/2(nlogp)1/2Vk=1KβknC11C22(k=1Kβk)2/K2KC21/2(nlogp)1/2VnC11C22C0plogp/n=Op(Vp)=op(1)

and

k=1K|εkTMεkpσkk|nC11C22k=1Kβk2op(p)nC11C22C02plogp/(Kn)=op(1).

Therefore, with probability approaching to 1,

|k=1K(YkTMYkpσkk)|>0.5nC11C22k=1Kβk2.

It follows that as n, p → ∞,

P{(Uμ)TΩ^1(Uμ)>2VF2χK,1α2}P{k=1K(YkTMYkpσkk)22λmax(Ω)VF2χK,1α2}P{|k=1K(YkTMYkpσkk)|2/K2λmax(Ω)VF2χK,1α2}P{0.5nC11C22k=1Kβk22Kλmax(Ω)χK2(1α)VF},

which converges to 1 since k=1Kβk2C02plogp/(Kn) and ||V ||F = op(p).

Similarly, for the power of T2*, we can show that as n, p → ∞,

P{|k=1K(Ukpσkk)|>2VFΣFζ1α/2}P{0.5nC11C22k=1Kβk22VFΣFζ1α/2}1.

Footnotes

Declarations of interest: none.

Supplementary material

Additional results, supplementary tables, and figures referenced in Sections 24 can be found online.

For this work, there exist supplementary materials providing additional technique details, simulation studies and real data results.

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

  • [1].Ramanan VK, Shen L, Moore JH, Saykin AJ, Pathway analysis of genomic data: concepts, methods, and prospects for future development, Trends in Genetics 28 (7) (2012) 323–332. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [2].Sun J, Zheng Y, Hsu L, A unified mixed-effects model for rare-variant association in sequencing studies, Genetic Epidemiology 37 (4) (2013) 334–344. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [3].Hwang S, Comparison and evaluation of pathway-level aggregation methods of gene expression data, BMC Genomics 13 (7) (2012) S26. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [4].Goeman JJ, Van De Geer SA, Van Houwelingen HC, Testing against a high dimensional alternative, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68 (3) (2006) 477–493. [Google Scholar]
  • [5].Goeman JJ, Van Houwelingen HC, Finos L, Testing against a high-dimensional alternative in the generalized linear model: asymptotic type i error control, Biometrika (2011) 381–390. [Google Scholar]
  • [6].Zhong P-S, Chen SX, Tests for high-dimensional regression coefficients with factorial designs, Journal of the American Statistical Association 106 (493) (2011) 260–274. [Google Scholar]
  • [7].Guo B, Chen SX, Tests for high dimensional generalized linear models, Journal of the Royal Statistical Society. Series B (Statistical Methodology) (2016) 1079–1102. [Google Scholar]
  • [8].Kong D, Maity A, Hsu F-C, Tzeng J-Y, Testing and estimation in marker-set association study using semiparametric quantile regression kernel machine, Biometrics 72 (2) (2016) 364–371. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [9].Zhou Y-H, Pathway analysis for rna-seq data using a score-based approach, Biometrics 72 (1) (2016) 165–174. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [10].Liu Y, Sun W, Reiner AP, Kooperberg C, He Q, Statistical inference of genetic pathway analysis in high dimensions, Biometrika 106 (3) (2019) 651–651. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [11].Avery CL, He Q, North KE, Ambite JL, Boerwinkle E, Fornage M, Hindorff LA, Kooperberg C, Meigs JB, Pankow JS, et al. , A phenomics-based strategy identifies loci on apoc1, brap, and plcg1 associated with metabolic syndrome phenotype domains, PLoS Genetics 7 (10) (2011) e1002322. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [12].Maity A, Sullivan PF, Tzeng J.-i., Multivariate phenotype association analysis by marker-set kernel machine regression, Genetic Epidemiology 36 (7) (2012) 686–695. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [13].Sun J, Oualkacha K, Forgetta V, Zheng H-F, Richards JB, Ciampi A, Greenwood CM, Consortium U, et al. , A method for analyzing multiple continuous phenotypes in rare variant association studies allowing for flexible correlations in variant effects, European Journal of Human Genetics 24 (9) (2016) 1344. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [14].He Q, Liu Y, Peters U, Hsu L, Multivariate association analysis with 430 somatic mutation data, Biometrics 74 (1) (2018) 176–184. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [15].Ma Y, Lan W, Wang H, Testing predictor significance with ultra high dimensional multivariate responses, Computational Statistics and Data Analysis 83 (2015) 275–286. [Google Scholar]
  • [16].Qiu Y, Chen SX, Nettleton D, et al. , Detecting rare and faint signals via thresholding maximum likelihood estimators, The Annals of Statistics 46 (2) (2018) 895–923. [Google Scholar]
  • [17].Wei L-J, Lin DY, Weissfeld L, Regression analysis of multivariate incomplete failure time data by modeling marginal distributions, Journal of the American statistical association 84 (408) (1989) 1065–1073. [Google Scholar]
  • [18].He Q, Avery CL, Lin D-Y, A general framework for association tests with multivariate traits in large-scale genomics studies, Genetic Epidemiology 37 (8) (2013) 759–767. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [19].He Q, Zhang HH, Avery CL, Lin D, Sparse meta-analysis with high-dimensional data, Biostatistics 17 (2) (2016) 205–220. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [20].Datta J, Smith JJ, Chatila WK, McAuliffe JC, Kandoth C, Vakiani E, Frankel TL, Ganesh K, Wasserman I, Lipsyc-Sharf M, et al. , Coaltered ras/b-raf and tp53 is associated with extremes of survivorship and distinct patterns of metastasis in patients with metastatic colorectal cancer, Clinical Cancer Research 26 (5) (2020) 1077–1085. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [21].Sun W, Bunn P, Jin C, Little P, Zhabotynsky V, Perou CM, Hayes DN, Chen M, Lin D-Y, The association between copy number aberration, dna methylation and gene expression in tumor samples, Nucleic Acids Research 46 (6) (2018) 3009–3018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [22].Liu J, Kumar KS, Yu D, Molton S, McMahon M, Herlyn M, Thomas-Tikhonenko A, Fuchs S, Oncogenic braf regulates β-trcp expression and nf-κ b activity in human melanoma cells, Oncogene 26 (13) (2007) 1954–1958. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [23].Cooks T, Pateras IS, Tarcic O, Solomon H, Schetter AJ, Wilder S, Lozano G, Pikarsky E, Forshew T, Rozenfeld N, et al. , Mutant p53 prolongs nf-κb activation and promotes chronic inflammation and inflammation-associated colorectal cancer, Cancer Cell 23 (5) (2013) 634–646. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [24].Wang X, Qin L, Zhang H, Zhang Y, Hsu L, Wang P, A regularized multivariate regression approach for eqtl analysis, Statistics in Biosciences 7 (1) (2015) 129–146. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [25].Ainsworth HF, Cordell HJ, Using gene expression data to identify causal pathways between genotype and phenotype in a complex disease: application to genetic analysis workshop 19, BMC Proceedings 10 (7) (2016) 49. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1

RESOURCES