Statistical Inference for High-Dimensional Pathway Analysis with Multiple Responses

Yang Liu; Wei Sun; Li Hsu; Qianchuan He

doi:10.1016/j.csda.2021.107418

. Author manuscript; available in PMC: 2023 May 1.

Published in final edited form as: Comput Stat Data Anal. 2022 Jan 13;169:107418. doi: 10.1016/j.csda.2021.107418

Statistical Inference for High-Dimensional Pathway Analysis with Multiple Responses^⋆

Yang Liu ^a,^b, Wei Sun ^a, Li Hsu ^a, Qianchuan He ^a,^*

PMCID: PMC8813039 NIHMSID: NIHMS1769523 PMID: 35125572

Abstract

Pathway analysis, i.e., grouping analysis, has important applications in genomic studies. Existing pathway analysis approaches are mostly focused on a single response and are not suitable for analyzing complex diseases that are often related with multiple response variables. Although a handful of approaches have been developed for multiple responses, these methods are mainly designed for pathways with a moderate number of features. A multi-response pathway analysis approach that is able to conduct statistical inference when the dimension is potentially higher than sample size is introduced. Asymptotical properties of the test statistic are established and theoretical investigation of the statistical power is conducted. Simulation studies and real data analysis show that the proposed approach performs well in identifying important pathways that influence multiple expression quantitative trait loci (eQTL).

Keywords: Asymptotical distribution, Complex diseases, High dimensional inference, Multivariate responses, Pathway analysis, 2010 MSC, 62H15, 62P10

1. Introduction

Pathway analysis, i.e., grouping analysis, interrogates whether a group of features is associated with a response, and has important applications in genomic data analysis. By harnessing prior biological knowledge and accounting for concerted functional mechanisms, pathway analysis is able to examine multiple genomic features in a holistic manner, and has a strong potential to inform on new strategies to diagnose, treat, and prevent complex diseases [1]. With an explosive number of genomic features being typed in population studies, research interest in pathway analysis has been surging in recent years.

A number of approaches have been developed for conducting statistical inference of pathway analysis, and most of them focus on a single response. For example, when the dimension p (i.e., the number of features in a considered set) is moderate, the Mixed effects Score Test (MiST) can be applied to assess the association between a response and a group of features [2]. When the dimension p is high (and possibly higher than the sample size n), the principal component analysis (PCA) can potentially be used [3], though this approach has little power when the selected PCs fail to capture the association signals. Besides PCA, other methods have been developed for pathway analysis under high dimensions, with various considerations on experimental design and genetic signal structures. To name a few, Goeman et al. [4, 5] proposed score statistics for large but fixed dimensional settings; Zhong and Chen [6] proposed U-statistic based tests for linear regression models with factorial designs; Guo and Chen [7] developed an approach for high-dimensional testing in the context of generalized linear models; Kong et al. [8] introduced a method based on penalized quantile regression for handling skewed or heavy-tailed responses; Zhou [9] published a method that is optimized for transcriptome data, in which potential outliers and skewness patterns often arise and violate parametric assumptions; Liu et al. [10] developed an approach that is able to make statistical inference for high dimensional pathway analysis when p/n → γ, for a constant γ ∈ (0,∞). These methods are primarily designed for the situation where a single response needs to be analyzed.

Multi-response analysis is needed when the studied disease condition involves multiple response variables. The responses may be biological measurements, such as blood pressure and lipids level in the metabolic syndrome, or molecular measurements, such as gene expressions in the analysis of gene expression quantitative trait loci (eQTLs). Multi-response analysis can potentially harness the shared information among the responses to improve statistical power and has played pivotal roles in studying genetic mechanisms of complex diseases [11]. Motivated by these considerations, some approaches have been developed for making statistical inference of multi-response pathway analysis. The multivariate kernel machine (MVKM) regression [12] accounts for the correlations among the responses, and represents one of the pioneering approaches for conducting multivariate pathway analysis. Sun et al. [13] proposed the MURAT approach under the linear mixed model, and assumed that the effects of the genomic features follow a multivariate normal distribution. He et al. [14] introduced the SOMAT method to investigate multiple responses with respect to pathways, and this method adopts a hierarchical modeling to accommodate biological characteristics of the studied genomic features. The MURAT and SOMAT are designed for moderate dimensions and are not suited for analyzing genomic sets that contain a large number of features. The MVKM can accommodate potential interactions among the features, and has also been used to analyze a moderate number of features. Ma et al. [15] proposed a residual sum of squares (RSS) type statistic for testing the effects in multi-response analysis. They considered the dimension of the responses to be high while the number of features was assumed to be fixed. Recently, Qiu et al. [16] also proposed an approach for detecting faint signals where the responses are high-dimensional and the features are low-dimensional. However, little work has been done for statistical inference of multi-response pathway analysis when dimension is high.

In this article, we propose a pathway analysis approach for jointly analyzing multiple responses with high-dimensional features. Our approach accounts for the correlations among the responses, and is able to provide valid statistical inference when the dimension p is greater than n, i.e., p = o(n²). We consider the situation where the signals are relatively weak and non-sparse, commonly seen in practical situations. We establish the asymptotic properties of the proposed statistic and further conduct theoretical investigation on the power of the statistic when both the sample size (n) and the dimension (p) go to infinity.

The article is organized as follows. Section 2 describes the testing procedures under different scenarios and establishes their asymptotic properties. Section 3 presents simulation studies for a range of settings to evaluate the empirical performance of the proposed approach. Section 4 applies the proposed tests to a genomic dataset to identify genetic pathways that are associated with the expressions of important cancer genes. Section 5 concludes the article with some remarks.

2. Methods

2.1. Model setup and notations

We consider a sample of n independent individuals and each individual has K continuous response variables. Let Y_k = (Y_k1, …, Y_kn)^T be the vector for the kth response, X = (X₁, …, X_d) be an n × d adjusting covariates matrix, and G = (G₁, …, G_p) be an n × p matrix for genomic features. We assume the following linear regression model for the kth response, k = 1, …, K,

Y_{k} = X α_{k} + G β_{k} + ε_{k},

(1)

where α_k = (α_k1, …, α_kd)^T is the coefficient vector of X for the kth response (with α_k1 being the intercept), β_k = (β_k1, …, β_kp)^T is the coefficient vector of G for the kth response, and ε_k = (ε_k1, …, ε_kn)^T is a vector of independent random errors. We assume that for the ith individual, i = 1, …, n, the vector of the random errors across the K responses, (ε_1i, …, ε_Ki), follows a multivariate Gaussian distribution with mean 0 and covariance $Σ = {σ_{k l}}_{k, l = 1, \dots, K}$ , where Σ is positive definite. That is, the covariance between the kth and the lth responses is σ_kl. Here, the designed matrices X and G are assumed to be fixed throughout this paper. The number of responses K and the number of adjusting covariates d are considered to be finite, while the dimension of the genomic features p is allowed to grow to infinity. We are interested in testing the global null hypothesis H₀ : β₁ = ··· = β_K = 0 against the alternative H_a : at least one β_kj ≠ 0 for k = 1, …, K and j = 1, …, p. That is, our goal is to test whether any of the genomic features is associated with any of the K responses.

Before describing our approach, we introduce some notations. For a vector a = (a₁, …, a_n)^T, let $‖ a ‖ = {(\sum_{i = 1}^{n} a_{i}^{2})}^{1 / 2}$ be the L₂-norm of the vector. For any k × l matrix $A = {(a_{i j})}_{i = 1, \dots, k; j = 1, \dots, l}$ , denote the spectral norm by ||A|| = sup_||x||=1 ||Ax||, and the Frobenius norm by $‖ A ‖_{F} = {(\sum_{i = 1}^{k} \sum_{j = 1}^{l} a_{i j}^{2})}^{1 / 2}$ . When A is a square matrix, we denote its maximal and minimum eigenvalues by λ_max(A) and λ_min(A), respectively.

2.2. Testing procedures when Σ is known

We first consider the situation in which the covariance matrix Σ is known.(The situation that Σ is unknown will be studied in Section 2.3.) To test the global hypothesis H₀, we first evaluate the genetic association for each of the K responses, and then construct a joint test statistic by combining the K responses together.

Consider a single response, say the kth response. Instead of directly testing if β_k1 = … = β_kp = 0, we start with a much simpler model, Y_k = Xα_k+G_jβ_kj +ε_k, where G_j is the jth genomic feature. Here, only a single feature is fitted in the model, and this model is called the marginal model. To test if β_kj = 0, one can construct a statistic $z_{k j} = {(G_{j}^{T} H G_{j})}^{- 1 / 2} G_{j}^{T} H Y_{k}$ , where H = I_n − P_x, I_n is the identity matrix, and P_x = X(X^TX)⁻¹X^T is the projection matrix. To test if any of the genomic features is associated with the kth response, as [10], one can aggregate the z_kj’s by the statistic

U_{k} = \sum_{j = 1}^{p} z_{k j}^{2},

and obtain the correlation matrix amongst the z_kj’s as V = D^−1/2G^THGD^−1/2, where D is the diagonal matrix with entries G_j^THG_j, j = 1, …, p. It can be shown that E(U_k) = pσ_kk under H₀.

The above U_k and V can be used for the association test for a single response, but recall that our goal is to jointly test the K responses for the p genomic features. To pursue this goal, it is necessary to characterize the joint distribution of the U_k’s. Let Ω be the element-wise squared matrix of Σ, i.e., ${(Ω)}_{k l} = σ_{k l}^{2}$ for k, l = 1, …, K. Interestingly, we show in the following that U = (U₁, …, U_K) follows a multivariate normal distribution, when p goes to infinity.

Theorem 1.

Let n, p → ∞. If ||V|| = o(p^1/2), then under H₀,

\frac{U - μ}{\sqrt{2} ‖ V ‖_{F}} \to N (0, Ω),

where μ = p(σ₁₁, …, σ_KK).

Remark 1.

Here the order of p is not restricted with respect to n, as Σ is assumed to be known. The condition ||V|| = o(p^1/2) has also been used in [10]. A sufficient condition for ||V|| = o(p^1/2) would be that the maximum absolute column-sum of V is of a smaller order of p^1/2. For example, when the correlations among G_j’s (j = 1, …, p) are not massively high such that $| V_{j_{1} j_{2}} | \leq C δ^{| j_{1} - j_{2} |}$ for some constants C > 0, 0 < δ < 1, the maximum absolute column-sum of V is bounded and then the condition ||V || = o(p^1/2) is satisfied. This correlation structure essentially says that the correlations of two variants are weak when they are far apart, which has been widely observed in genetic studies. Some of the commonly used structures, such as the auto-regressive and block-wise correlations, satisfy this condition.

Remark 2.

The covariance matrix Ω is the Hadamard (entrywise) product of Σ and Σ. By the Schur product theorem, Ω is positive definite and thus invertible.

Theorem 1 provides a foundation to test the association between the K responses and the genomic features. Based on this joint distribution, we consider two statistics to conduct the joint association test. One is a quadratic type of statistic

T_{1}^{*} = \frac{{(U - μ)}^{T} Ω^{- 1} (U - μ)}{2 ‖ V ‖_{F}^{2}},

which asymptotically has a chi-squared distribution with K degrees of freedom. In addition, we can also test for a linear combination of U_k’s. For example, if there is a prior belief that the genetic effects are similar across the K responses, we can follow the lines of [17] to define

T_{2}^{*} = \frac{J^{T} (U - μ)}{\sqrt{2} ‖ V ‖_{F} \sqrt{J^{T} Ω J}},

which can be simplified as $\sum_{k = 1}^{K} (U_{k} - p σ_{k k}) / (\sqrt{2} ‖ V ‖_{F} ‖ Σ ‖_{F})$ , where J = (1, …, 1)^T. The $T_{2}^{*}$ can be shown to be asymptotically standard normal. When the genetic effects are the same across all the responses, this type of statistic has the smallest variance among all the linear combinations of (U₁, …, U_K) and is often considered to be a highly powerful test for this situation (see [17] for more details).

2.3. Testing procedures when Σ is unknown

The proposed statistics $T_{1}^{*}$ and $T_{2}^{*}$ involve the quantities μ, Σ and Ω, which are based on the covariance components σ_kl (k, l = 1, …,K). However, in real applications, σ_kl is mostly unknown. Naturally, one may attempt to use an estimator of ${σ_{k l}}_{k, l}_{= 1, \dots, K}$ to obtain the plug-in estimators $\hat{μ}$ , $\hat{Σ}$ and $\hat{Ω}$ , and then carry out the proposed tests. Our theoretical and numerical results (in Section 1 of Supplementary material) show that such a replacement works well when p = o(n). However, when p is further increasing, we show that such a replacement can lead to a breakdown of the proposed tests.

We focus on $T_{1}^{*}$ as an example. Let ${\hat{σ}}_{k l}$ be the estimator of σ_kl under the null hypothesis, that is, for k, l = 1, …, K,

{\hat{σ}}_{k l} = \frac{Y_{k}^{T} H Y_{l}}{n - d} .

For ease of presentation, let us consider Ω to be fixed. Replacing μ by $\hat{μ}$ , we have that

| \frac{{(U - \hat{μ})}^{T} Ω^{- 1} (U - \hat{μ})}{2 ‖ V ‖_{F}^{2}} - \frac{{(U - μ)}^{T} Ω^{- 1} (U - μ)}{2 ‖ V ‖_{F}^{2}} | \leq | {(\hat{μ} - μ)}^{T} {(‖ V ‖_{F}^{2} Ω)}^{- 1} (U - μ) | + | {(\hat{μ} - μ)}^{T} {(2 ‖ V ‖_{F}^{2} Ω)}^{- 1} (\hat{μ} - μ) | \leq ‖ Ω^{- 1} ‖ \frac{‖ \hat{μ} - μ ‖}{‖ V ‖_{F}} \frac{‖ U - μ ‖}{‖ V ‖_{F}} + ‖ Ω^{- 1} ‖ \frac{‖ \hat{μ} - μ ‖^{2}}{2 ‖ V ‖_{F}^{2}},

(2)

where $‖ V ‖_{F}^{2} \geq \sum_{j = 1}^{p} V_{j j} = p$ , and $‖ U - μ ‖^{2} / (2 ‖ V ‖_{F}^{2}) = O_{p} (1)$ asymptotically follows a mixed chi-square distribution with weights being the eigenvalues of Ω. Then it follows that (2) is bounded by $O_{p} (\sqrt{p / n}) + O_{p} (p / n)$ , which goes to 0 if p = o(n). This indicates that the replacement of σ_kl by ${\hat{σ}}_{k l}$ may work poorly when p is large. Our simulation results (in Section 4 of Supplementary material) 145 are in line with this analysis.

When p is of higher order than o(n), the deviation between $(U - \hat{μ})$ and (U − μ) may not be negligible and subsequently, the asymptotic normality of $(U - \hat{μ})$ can not be derived from Theorem 1. This motivated us to directly investigate the asymptotic behavior of $(U - \hat{μ})$ rather than that of (U − μ). The following lemma characterizes the asymptotic normality of $(U - \hat{μ})$ .

Lemma 2.

Assume that the conditions in Theorem 1 hold. If p = o(n²) and $p^{- 1} ‖ V ‖_{F}^{2} \geq p / (n - d) + η$ for some constant η > 0, then under H₀,

\frac{U - \hat{μ}}{\sqrt{2 ‖ V ‖_{F}^{2} - 2 p^{2} / (n - d)}} \to N (0, Ω) .

Remark 3.

The condition $p^{- 1} ‖ V ‖_{F}^{2} \geq p / (n - d) + η$ is to bound the denominator away from zero. This is a mild condition since we can show that $p^{- 1} ‖ V ‖_{F}^{2} \geq p /(n - d)$ . When p/(n − d) ≤ 1 − η for any η > 0, this condition is satisfied because $p^{- 1} ‖ V ‖_{F}^{2} \geq 1 \geq p / (n - d) + η$ . When p ≥ n − d, the lower bound is achieved when all the nonzero eigenvalues of V are the same, which means that all the variants are virtually uncorrelated. Since variants in real genetic data are usually correlated, this means that $p^{- 1} ‖ V ‖_{F}^{2}$ is generally larger than the lower bound. Hence, this condition is a fairly mild condition and is easily satisfied in practical situations.

Lemma 2 suggests that we can construct statistics directly based on $(U - \hat{μ})$ rather than based on (U − μ). The following theorem shows the form of such statistics as well as their asymptotic distributions.

Theorem 3.

Assume that the conditions in Lemma 2 hold. Then under H₀,

T_{1} = \frac{{(U - \hat{μ})}^{T} {\hat{Ω}}^{- 1} (U - \hat{μ})}{2 ‖ V ‖_{F}^{2} - 2 p^{2} / (n - d)} \to χ_{K}^{2}

and

T_{2} = \frac{\sum_{k = 1}^{K} U_{k} - p \sum_{k = 1}^{K} {\hat{σ}}_{k k}}{‖ \hat{Σ} ‖_{F} \sqrt{2 ‖ V ‖_{F}^{2} - 2 p^{2} / (n - d)}} \to N (0, 1) .

Besides T₁ and T₂, the joint distribution in Lemma 2 provides opportunities to form other test statistics. For example, if one can assign weights to the K responses based on prior information, one may use the statistic

\frac{\sum_{k = 1}^{K} w_{k} U_{k} - p \sum_{k = 1}^{K} w_{k} {\hat{σ}}_{k k}}{w^{T} \hat{Ω} w \sqrt{2 ‖ V ‖_{F}^{2} - 2 p^{2} / (n - d)}} \to N (0, 1),

where w = (w₁, …, w_K)^T is a vector of weights for the K responses.

2.4. Analysis of Power

Next, we investigate the asymptotic power for the proposed statistics. Under the alternative hypothesis H_a, let S_k = {j : β_kj ≠ 0} be the index set for the nonzero coefficients of the kth response (k = 1, …, K). Define sub-vector $β_{S_{k}} = {β_{k j} : j \in S_{k}}$ and sub-matrix $G_{S_{k}} = {G_{j} : j \in S_{k}}$ . Let $D_{S_{k}}$ be the sub-matrix of D, with diagonal elements being $G_{j}^{T} {HG}_{j}, j \in S_{k} .$ Similarly, define $β_{S_{k}^{c}}$ , $G_{S_{k}^{c}}$ and $D_{S_{k}^{c}}$ for $S_{k}^{c} = {j : β_{k j} = 0}$ . We further define $μ_{β, k} = β_{k}^{T} G^{T} MG β_{k} / (\sqrt{2} ‖ V ‖_{F})$ for k = 1, …, K, where M = HGD⁻¹G^TH. Here, μ_β,k is a measure of the strength of signals for the kth response. Let μ_β = {μ_β,1, …, μ_β,K}.

We assume the following conditions for analyzing the power of the proposed tests.

Assumption 1.

There is a constant C₁ > 0 such that

C_{1}^{- 1} \leq \max_{1 \leq j \leq p} (\frac{1}{n} G_{j}^{T} H G_{j}) \leq C_{1} .

Assumption 2.

There is a constant C₂ > 0 such that for any k = 1, …, K,

C_{2}^{- 1} \leq λ_{\min} (\frac{1}{n} G_{S_{k}}^{T} {HG}_{S_{k}}) \leq λ_{\max} (\frac{1}{n} G_{S_{k}}^{T} {HG}_{S_{k}}) \leq C_{2} .

Assumption 1 is a mild condition which states that the variations of the genetic variants should be on the same scale. Assumption 2 imposes restrictions on the eigenvalues of the matrix $G_{S_{k}}^{T} {HG}_{S_{k}} / n$ . Define the power of $T_{1}^{*}$ as $ρ (T_{1}^{*}) = P (T_{1}^{*} > χ_{K, 1 - α}^{2})$ , where $χ_{K, 1 - α}^{2}$ denotes the (1 − α) quantile of the chi-squared distribution with degrees of freedom K. Similarly, define the power of $T_{2}^{*}$ as $ρ (T_{2}^{*}) = P (| T_{2}^{*} | > ζ_{1 - α / 2})$ , where ζ_1−α/2 denotes the (1 − α/2) quantile of the standard normal distribution. The following theorem characterizes the power of $T_{1}^{*}$ and $T_{2}^{*}$ , when the size of the overall signal is bounded by $o [{(n \log p)}^{- 1 / 2} ‖ V ‖_{F} / ‖ V ‖]$ .

Theorem 4.

Suppose that the conditions in Theorem 1 and Assumptions 1–2 hold. Under H_a, if $\sum_{k = 1}^{K} ‖ β_{k} ‖ = o [{(n \log p)}^{- 1 / 2} ‖ V ‖_{F} / ‖ V ‖]$ , then as n, p → ∞,

ρ (T_{1}^{*}) \to P [χ_{K}^{2} ({μ_{β}}^{T} Ω^{- 1} μ_{β}) > χ_{K, 1 - α}^{2}]

and

ρ (T_{2}^{*}) \to Φ (- ζ_{1 - α / 2} + \sum_{k = 1}^{K} \frac{μ_{β, k}}{‖ Σ ‖_{F}}) + Φ (- ζ_{1 - α / 2} - \sum_{k = 1}^{K} \frac{μ_{β, k}}{‖ Σ ‖_{F}}),

where $χ_{K}^{2} (μ_{β}^{T} Ω^{- 1} μ_{β})$ is a non-central $χ_{K}^{2}$ random variable with noncentrality parameter μ_β^TΩ⁻¹μ_β, and Φ(·) is the cumulative distribution function of the standard normal variable.

Remark 4.

One sufficient condition for $\sum_{k = 1}^{K} ‖ β_{k} ‖ = o [{(n \log p)}^{- 1 / 2} ‖ V ‖_{F} / ‖ V ‖]$ is that $\sum_{k = 1}^{K} ‖ β_{k} ‖ = o [{(n \log p)}^{- 1 / 2}]$ , which means that the overall effect size should be sufficiently small. This condition is similar to the local alternative condition in Zhong and Chen [6].

Theorem 4 provides an explicit formula to calculate the power when the signal size is moderate. It shows that the power tends to increase as ||μ_β|| becomes larger. When the overall signal size $\sum_{k = 1}^{K} ‖ β_{k} ‖$ is of higher order than o[(nlogp)^−1/2||V ||_F/||V ||], it is difficult to obtain explicit formulas for the power functions. However, we show in the following theorem that, as long as the overall signal size is sufficiently large, the power of $T_{1}^{*}$ and $T_{2}^{*}$ will approach to one.

Theorem 5.

Suppose that the conditions in Theorem 1 and Assumption 1–2 hold. Under H_a, if $\sum_{k = 1}^{K} ‖ β_{k} ‖ \geq C_{0} \sqrt{p \log p / n}$ for some constant C₀ > 0, then as n, p → ∞, $ρ (T_{1}^{*}) \to 1$ and $ρ (T_{2}^{*}) \to 1$ .

3. Simulation Studies

We conducted simulation studies to evaluate the performance of the proposed tests T₁ and T₂, and compared them with (1) Bonferroni test (Bonf.), (2) Principal component analysis (PCA), (3) multivariate score test (mScore), and (4) multivariate kernel machine test (MVKM) with a linear kernel. For the Bonferroni test, we conducted the univariate score test for each genomic feature with respect to each response, and then applied the Bonferroni correction to the p × K tests. For the PCA method, we used the five leading PCs of the G^TG matrix to conduct a likelihood ratio test for each response, and then applied the Bonferroni correction to the K tests. The mScore test extends the multivariate score test from a single genetic variant [18] to multiple variants, and was designed for low dimensions.

We considered K = 3 responses. Genomic features were generated from a multivariate normal distribution N(0, ρ), where ρ is a block-diagonal covariance matrix with each block being ρ₀. Two correlation structures were considered for ρ₀: (1) auto-regressive (AR) with (i, j)th off-diagonal element 0.6^|i−j|, and (2) compound symmetry (CS) with diagonal elements 1 and off-diagonal elements 0.5. These structures have been considered in other publications [10, 19] to model the correlations among genetic variants. The responses were generated as $Y_{k i} = 1 + x_{i} + \sum_{j = 1}^{p} G_{i j} β_{k j} + ε_{k i}$ , where x_i ∼ N(0.1G_i1,1), and (ε_1i, ε_2i, ε_3i)^T follows a multivariate normal distribution with correlation of 0.5.

For the sample size n and the dimension p, we considered n = 200 with p = 100,200,300, and n = 400 with p = 300,400,500. For the alternative hypothesis, we let the proportion nonzero β_jk’s to be 5%, 10%, 15%, or 20%. Considering that a genomic feature may not affect all responses, we set two signals, i.e., nonzero β_jk’s, to be overlapping for the K responses and the remaining signals to be non-overlapping. Regarding the signal sizes, two scenarios were considered. In scenario (1), we set the absolute values of the signals for the three responses to be (0.10, 0.06, 0.02) for n = 200, and (0.05, 0.03, 0.01) for n = 400. That is, the magnitudes of association signals differ across different responses. In scenario (2), to examine the performance of T₂, we set the absolute values of the signals to be (0.08, 0.08, 0.08) for n = 200, and (0.04, 0.04, 0.04) for n = 400. That is, the magnitudes of signals are constant for the three responses. For both scenarios, the signs of the signals can be either positive or negative.

We evaluated the Type I errors of the tests over 10,000 replications, and examined the power over 1,000 replications. Table 1 shows that the type I errors of the proposed tests T₁ and T₂ are around 0.05. All the other compared methods also have type I errors controlled, though the MVKM tends to be conservative under high dimensions. The lower Type I error of MVKM is likely due to the impact of the increased dimensions. For MVKM, its asymptotical distribution involves the error term σ_kl, whose estimation error is negligible when p is small but can become highly influential when p is large, a phenomenon that has been observed for similar approaches [10]. Our numerical experiments also show that when p is moderate compared with n, the Type I error of MVKM is close to the nominal level (see Section 4 of Supplementary material). For the power performance, Figure 1 shows the power against the proportion of nonzero β_jk’s for scenario (1) under the AR correlation structure. It can be seen that, as the proportion of nonzero β_kj’s becomes larger, the power of all the methods increases. As to relative performance, both T₁ and T₂ generally have higher power than the other compared methods, especially when the proportion of the signals is large. The Bonferroni test suffers from severe power loss because the signals were generated to be weak in the considered models. The PCA test also has low power likely because the leading five PCs failed to represent the genetic features that carry association signals. The mScore is a chi-square test with its degree of freedom equal to p×K, which causes the loss of power when p is large. With regard to the comparison between T₁ and T₂, T₁ outperforms T₂ because the magnitude of signals varies across the 3 responses. Next, we examined the power under scenario (2) where the magnitude of signals are equal for the 3 responses. Figure 2 shows that T₁ and T₂ still have good performance, but T₂ tends to have higher power than T₁. These results indicate that T₁ is more powerful when the studied responses have unbalanced signal sizes, while T₂ is more favorable when the signal sizes are similar across multiple responses. The power under the CS correlation structure shows a similar pattern and is provided in Section 4 of Supplementary material.

Table 1:

Type I error of the compared tests across different sample sizes, dimensions and correlation structures of the genomic features at α = 0.05

n	p	Corr.	Bonf.	PCA	mScore	MVKM	T ₁	T ₂
200	100	AR	0.044	0.051	0.012	0.034	0.054	0.045
		CS	0.047	0.049	0.013	0.033	0.051	0.041
	200	AR	0.047	0.055	0.002	0.024	0.050	0.048
		CS	0.046	0.054	0.004	0.026	0.050	0.046
	300	AR	0.047	0.053	0.001	0.016	0.052	0.051
		CS	0.048	0.055	0.002	0.018	0.053	0.052

400	200	AR	0.048	0.050	0.010	0.032	0.048	0.049
		CS	0.046	0.051	0.013	0.034	0.049	0.048
	400	AR	0.049	0.051	0.002	0.022	0.051	0.049
		CS	0.048	0.051	0.004	0.025	0.050	0.047
	600	AR	0.042	0.051	0.000	0.012	0.048	0.048
		CS	0.045	0.051	0.000	0.015	0.047	0.048

Open in a new tab

Figure 1: — Power of the compared tests at α = 0.05 under different sample sizes and dimensions with respect to the proportion of nonzero signals for scenario (1). The AR structure is considered.

Figure 2: — Power of the compared tests at α = 0.05 under different sample sizes and dimensions with respect to the proportion of nonzero signals for scenario (2). The AR structure is considered.

4. Real Data Analysis

We evaluated the performance of the proposed tests through the colorectal cancer data from The Cancer Genome Atlas (TCGA). This dataset contains multiple genomic information, such as gene expressions, DNA methylations, and somatic mutations. We aimed to identify DNA methylations and somatic mutations that influence the expressions of important cancer genes.

We obtained 160 samples that have data on gene expressions, DNA methylations and somatic mutations. We considered the expressions of three genes (KRAS, BRAF and TP53) as the responses, because these genes are important for cancer development and have been found to co-alter their expressions to jointly influence the survivorship of colon cancer patients [20]. We mapped the genomic features, i.e., methylations and somatic mutations, into the 50 Hallmark Pathways based on the Broad Institute’s database. Among the 50 pathways, 38 pathways contain more than 160 genomic features; the median and maximal numbers of genomic features in a pathway are 316 and 1275 respectively, both of which are larger than the sample size. We included the following covariates in our analysis: age, gender, plates for batch effects, hyper-mutation status, and principle components accounting for tumor purity and cell type composition [21]. We excluded the P53 pathway as the TP53 gene’s expression is one of the 3 responses being studied.

We applied the proposed T₁ and T₂ tests along with the other compared methods to each of the 49 pathways. The threshold for statistical significance was set to 0.05/49 ≈ 0.001. Table 2 lists the pathways that were detected by any of the considered methods. It can be seen that the T₁ test detects three pathways: TNFA signaling via NFKB, Unfolded protein response, and TGF beta signaling pathways. The T₂ test identified the first pathway and two additional pathways, Xenobiotic metabolism and Hypoxia pathways. The MVKM detected one pathway, the TNFA signaling via NFKB pathway. The three other compared methods did not yield any significant results. The identified pathways have functional meanings which are supported by existing biological studies. For example, the nuclear factor-kappa B (NFKB) signaling pathway is a regulator of immune response and inflammation, and is associated with both BRAF [22] and TP53 [23].

Table 2:

P-values of the compared tests. P-values lower than :001 are shown in bold.

Pathway name	p	Bonf.	PCA	mScore	MVKM	T ₁	T ₂
TNFA signaling via NFKB	397	.06593	.03044	.01565	.00019	.00003	.00007
Unfolded protein response	104	.99912	.05081	.01777	.00186	.00040	.00767
TGF beta signaling	184	.15286	.27376	.03770	.00171	.00074	.00497
Xenobiotic metabolism	384	.33097	.01024	.03027	.00186	.00153	.00056
Hypoxia	727	.10970	.03363	.02102	.00160	.00523	.00088

Open in a new tab

To better understand the genetic signals behind the association results, we examined the univariate p-values in the Unfolded protein response pathway which was detected only by the T₁ test. Figure 3 provides these p-values for each of the three responses. It can be seen that each response involves a number of moderate signals, however, none of these p-values reaches the global significance of 10⁻⁵ for the univariate test. This emphasizes the importance of examining a large number of features jointly across multiple responses. The univariate p-value plots for the other identified pathways can be found in Section 5 of Supplementary material.

Figure 3: — Univariate p-values of genetic features for three responses in unfolded protein response pathway.

5. Discussion

Multi-response pathway analysis has important applications in biomedical research, such as studying pleiotropy in quantitative genetics, as well as conducting phenome-wide association studies. In this article, we developed a multi-response pathway analysis approach that is able to conduct statistical inference when p → ∞ and is potentially larger than n, i.e., p = o(n²). Our approach enables the detection of weak signals aggregated at the pathway level, and provides a powerful tool for deciphering the complexity of genetic mechanisms. Asymptotic normality was established for the proposed statistic, and the asymptotic power of the proposed statistic when both n and p go to infinity was also studied. Besides the proposed T₁ and T₂, the result in Lemma 2 provides a foundation for forming other potential statistics which can be tailored for practical situations. The power of these statistics varies and is dependent upon the alternative hypothesis, as it is known that there is no uniformly most powerful test for multi-response analysis. We also note that in the approach of [6], there is no restriction on the relative growth rate between p and n. It will be interesting to see if it is possible to extend their approach to multivariate outcome analysis. Further research is needed to address this important question.

Our proposed approach tests the global null hypothesis that none of the responses is associated with the genetic pathway. If the global null is rejected, then it would be interesting to investigate which of the response variables are influenced by the considered genetic features. One simple strategy is to test each response one by one post hoc, but the multiple testing burden brought by these marginal tests may overshadow the significance of the individual tests. Another potential strategy is to apply regularized regression methods, such as [24], to pinpoint the responses that are associated with the genetic variants. These approaches, however, often do not provide p-values for the selected results, as statistical inference for regularized regression under high dimensions remains to be an active research area.

Besides high dimensionality, there exist many other challenges in pathway analysis. For example, when there are potential causal relationship among the studied features [25], how to accommodate such a causal relationship requires new methodology development. Structural equation modeling and mediation analysis approach have been used to infer the potential causal relationship between features and outcome, but these methods are primarily focused on a limited number of outcomes and features, and how to extend them to high dimensions appears to be a nontrivial task. Future research is merited.

Supplementary Material

NIHMS1769523-supplement-1.pdf^{(309.3KB, pdf)}

Acknowledgments

This work is supported by National Institutes of Health R01CA223498 and R01CA189532. We thank the Editor, Associate Editor, and reviewers for their very helpful comments.

Appendix

Proofs of the theorems

Proof of Theorem 1.

Consider a linear combination of $(U - μ) / (\sqrt{2} ‖ V ‖_{F})$ , denoted as $T_{c} = c^{T} (U - μ) / (\sqrt{2} ‖ V ‖_{F})$ for a vector of constants c = (c₁, …, c_K)^T. To prove Theorem 2.1, it suffices to show that under H₀, T_c → N(0, c^TΩc).

First, we calculate the mean and variance of T_c. Under the null, we have

T_{c} = {(\sqrt{2} ‖ V ‖_{F})}^{- 1} \sum_{k = 1}^{K} c_{k} (\sum_{j = 1}^{p} z_{k j}^{2} - p σ_{k k}) = {(\sqrt{2} ‖ V ‖_{F})}^{- 1} \sum_{k = 1}^{K} c_{k} (ε_{k}^{T} M ε_{k} - p σ_{k k}),

where M = HGD⁻¹G^TH. Further define

U_{c} = \sum_{k = 1}^{K} c_{k} ε_{k}^{T} M ε_{k} = ε_{v}^{T} (D_{c} \otimes M) ε_{v},

where $ε_{v} = {(ε_{1}^{T}, \dots, ε_{K}^{T})}^{T} ~ N (0, Σ \otimes I_{n})$ , and D_c is a diagonal matrix with diagonal elements c₁, …, c_K.

Recall that V = D^−1/2G^THGD^−1/2 is the correlation matrix of z_kj’s with diagonal elements V_jj = 1, j = 1, ··· , p. Then

E (U_{c}) = tr [(D_{c} \otimes M) (Σ \otimes I_{n})] = tr (D_{c} Σ) tr (V) = p \sum_{k = 1}^{K} c_{k} σ_{k k},

and

var (U_{c}) = 2 tr [{(D_{c} Σ)}^{2} \otimes M^{2}] = 2 tr [{(D_{c} Σ)}^{2}] \cdot tr [M^{2}] = 2 c^{T} Ω c \cdot ‖ V ‖_{F}^{2},

where the last equality holds since

tr [M^{2}] = tr (HG D^{- 1} G^{T} HHG D^{- 1} G^{T} H) = tr (D^{- 1 / 2} G^{T} HGH G D^{- 1} G^{T} H G D^{- 1 / 2}) = ‖ V ‖_{F}^{2} .

It follows that E(T_c) = 0 and var(T_c) = c^TΩc.

It remains to prove the asymptotic normality of $U_{c} = \sum_{k = 1}^{K} c_{k} ε_{k}^{T} M ε_{k}$ . Let e = (Σ^−1/2 ⊗ I_n)ε_v, then we have e ∼ N(0,I_nK). Subsequently,

U_{c} = e^{T} (Σ^{1 / 2} \otimes I_{n}) (D_{c} \otimes M) (Σ^{1 / 2} \otimes I_{n}) e = e^{T} (Σ^{1 / 2} D_{c} Σ^{1 / 2} \otimes M) e .

Let B = Σ^1/2D_cΣ^1/2 ⊗M. By the eigen decomposition of B, it suffices to show that ||B||/||B||_F → 0 when n, p → ∞. Note that

‖ B ‖ = ‖ Σ^{1 / 2} D_{c} Σ^{1 / 2} ‖ \cdot ‖ M ‖ = ‖ D_{c} Σ ‖ \cdot ‖ V ‖ \leq ‖ D_{c} ‖ \cdot ‖ Σ ‖ \cdot ‖ V ‖ \leq \max_{k} | c_{k} | \cdot \sum_{k = 1}^{K} σ_{k k} \cdot ‖ V ‖

and

\begin{array}{l} ‖ B ‖_{F} = ‖ Σ^{1 / 2} D_{c} Σ^{1 / 2} ‖_{F} \cdot ‖ M ‖_{F} = {[tr (Σ^{1 / 2} D_{c} Σ^{1 / 2} Σ^{1 / 2} D_{c} Σ^{1 / 2})]}^{1 / 2} \cdot ‖ V ‖_{F} \\ = ‖ D_{c} Σ ‖_{F} \cdot {(\sum_{j = 1}^{p} V_{j j}^{2})}^{1 / 2} \geq \max_{k} | c_{k} | \cdot σ_{k^{'} k^{'}} \cdot \sqrt{p}, \end{array}

here k′ = argmax_k |c_k|. It follows that $‖ B ‖ / ‖ B ‖_{F} \leq (\sum_{k = 1}^{K} σ_{k k} \cdot ‖ V ‖) / (σ_{k^{'} k^{'}} \cdot \sqrt{p}) = o (1)$ , which completes the proof. □

Proof of Lemma 2.

Let $T_{c}^{h} = c^{T} (U - \hat{μ}) / \sqrt{2 ‖ V ‖_{F}^{2} - 2 p^{2} / (n - d)}$ for a vector of constants c = (c₁, …, c_K)^T. It suffices to show that under H₀, $T_{c}^{h} \to N (0, c^{T} Ω c)$ . Under H₀, we can write

U_{c}^{h} = c^{T} (U - \hat{μ}) = \sum_{k = 1}^{K} c_{k} (ε_{k}^{T} M ε_{k} - {(n - d)}^{- 1} p ε_{k}^{T} H ε_{k}) = ε_{v}^{T} [D_{c} \otimes (M - {(n - d)}^{- 1} p H)] ε_{v} = e^{T} [(Σ^{1 / 2} D_{c} Σ^{1 / 2}) \otimes (M - {(n - d)}^{- 1} p H)] e,

where M, ε_v, D_c, and e are defined in the proof of Theorem 1.

Let M₁ = M − (n − d)⁻¹pH and B₁ = (Σ^1/2D_cΣ^1/2) ⊗ M₁. Since e ∼ N(0, I_nK), we have

E (U_{c}^{h}) = tr (Σ^{1 / 2} D_{c} Σ^{1 / 2}) tr (M_{1}) = tr (Σ^{1 / 2} D_{c} Σ^{1 / 2}) [tr (V) - {(n - d)}^{- 1} p \cdot tr (H)] = tr (Σ^{1 / 2} D_{c} Σ^{1 / 2}) [p - {(n - d)}^{- 1} p (n - d)] = 0

and

var (U_{c}^{h}) = 2 tr (B_{1}^{2}) = 2 tr [{(D_{c} Σ)}^{2}] \cdot tr [{M_{1}}^{2}] = 2 c^{T} Ω c \cdot tr [M^{2} - 2 {(n - d)}^{- 1} p M + {(n - d)}^{- 2} p^{2} H] = 2 c^{T} Ω c \cdot [‖ V ‖_{F}^{2} - 2 {(n - d)}^{- 1} p tr (V) + {(n - d)}^{- 2} p^{2} (n - d)] = 2 c^{T} Ω c [‖ V ‖_{F}^{2} - p^{2} / (n - d)] .

It follows that $E (T_{c}^{h}) = 0$ and $var (T_{c}^{h}) = c^{T} Ω c$ .

For the asymptotic normality of $U_{c}^{h}$ , we only needs to prove that ||B₁||/||B₁||_F → 0 when p = o(n²). By Weyl’s inequality, we have

λ_{\max} (M_{1}) \leq λ_{\max} (M) - {(n - d)}^{- 1} p λ_{\min} (H) \leq ‖ V ‖

and

λ_{\min} (M_{1}) \geq λ_{\min} (M) - {(n - d)}^{- 1} p λ_{\max} (H) \geq p / (n - d) .

Then, using the conditions ||V|| = o(p^1/2) and p = o(n²), we have ||M₁|| ≤ max(||V||,p/(n − d)) = o_p(p^1/2). It follows that

‖ B_{1} ‖ = ‖ Σ^{1 / 2} D_{c} ‖ \cdot ‖ Σ^{1 / 2} ‖ \cdot ‖ M_{1} ‖ \leq \max_{k} | c_{k} | \cdot \sum_{k = 1}^{K} σ_{k k} \cdot o (p^{1 / 2}) .

Using the condition $p^{- 1} ‖ V ‖_{F}^{2} \geq p / (n - d) + η$ , we have

{‖ B_{1} ‖}_{F} = {‖ Σ^{1 / 2} D_{c} Σ^{1 / 2} ‖}_{F} \cdot {‖ M_{1} ‖}_{F} = {‖ D_{c} Σ ‖}_{F} \cdot {[‖ V ‖_{F}^{2} - p^{2} / (n - d)]}^{1 / 2} \geq \max_{k} | c_{k} | \cdot σ_{k^{'} k^{'}} \cdot {[η p]}^{1 / 2},

where k′ = argmax_k |c_k|. It follows that ||B₁||/||B₁||_F = o(1) and the theorem is proved. □

Proof of Theorem 3.

Define $T_{1}^{'} = {(U - \hat{μ})}^{T} Ω^{- 1} (U - \hat{μ}) / (2 ‖ V ‖_{F}^{2} - 2 p^{2} / (n - d))$ and $T_{2}^{'} = (\sum_{k = 1}^{K} Q_{k} - p \sum_{k = 1}^{K} {\hat{σ}}_{k k}) (‖ Σ ‖_{F} \sqrt{2 ‖ V ‖_{F}^{2} - 2 p^{2} / (n - d)})$ , then it follows from Theorem 1 that under $H_{0}, T_{1}^{'} \to χ_{K}^{2}$ and $T_{2}^{'} \to N (0, 1)$ . To prove Theorem 3, it suffices to show that T₁ and $T_{1}^{'}$ have the same asymptotic distribution under H₀, and that T₂ and $T_{2}^{'}$ have the same asymptotic distribution under H₀.

Since $| {\hat{σ}}_{k l} - σ_{k l} | = O (n^{- 1 / 2})$ for any k, l = 1, …, K, it follows that $‖ \hat{Σ} - Σ ‖ = O_{p} (n^{- 1 / 2})$ and $‖ \hat{Ω} - Ω ‖ = O_{p} (n^{- 1 / 2})$ . Using Weyl’s inequality, we have

λ_{\min} (\hat{Ω}) \geq λ_{\min} (Ω) + λ_{\min} (\hat{Ω} - Ω) \geq λ_{\min} (Ω) - ‖ \hat{Ω} - Ω ‖ .

Then with probability approaching to 1, $λ_{\min} (\hat{Ω}) \geq 0.5 λ_{\min} (Ω)$ . Subsequently,

‖ {\hat{Ω}}^{- 1} - Ω^{- 1} ‖ = ‖ {\hat{Ω}}^{- 1} (Ω - \hat{Ω}) Ω^{- 1} ‖ \leq ‖ {\hat{Ω}}^{- 1} ‖ \cdot ‖ Ω^{- 1} ‖ \cdot ‖ Ω - \hat{Ω} ‖ \leq λ_{\min} {(\hat{Ω})}^{- 1} λ_{\min} Ω^{- 1} O_{p} (n^{- 1 / 2}) \leq 2 λ_{\min} {(Ω)}^{- 2} O_{p} (n^{- 1 / 2}) = O_{p} (n^{- 1 / 2}) .

We now shall bound $| T_{1} - T_{1}^{'} |$ . Under H₀, we have

| T_{1} - T_{1}^{'} | = | {(U - \hat{μ})}^{T} ({\hat{Ω}}^{- 1} - Ω^{- 1}) (U - \hat{μ}) / (2 ‖ V ‖_{F}^{2} - 2 p^{2} / (n - d)) | \leq ‖ {\hat{Ω}}^{- 1} - Ω^{- 1} ‖ \frac{‖ U - μ ‖^{2}}{2 ‖ V ‖_{F}^{2} - 2 p^{2} / (n - d)}

\leq O_{p} (n^{- 1 / 2}),

where the last inequality holds since under H₀, $‖ U - μ ‖^{2} / (2 ‖ V ‖_{F}^{2} - 2 p^{2} / (n - d))$ asymptotically follows a mixture of chi-squared distributions with weights being the eigenvalues of Ω. Thus $T_{1} \to χ_{K}^{2}$ .

For T₂, since $‖ \hat{Σ} ‖_{F} = {(\sum_{k, l = 1}^{K} {\hat{σ}}_{k l}^{2})}^{1 / 2}$ , it follows that $‖ \hat{Σ} ‖_{F}$ converges to ||Σ||_F in probability. By Slutsky’s theorem, T₂ and $T_{2}^{'}$ have the same asymptotic distribution under H₀. That is, T₂ → N(0,1). □

Proof of Theorem 4.

Under H_a, for each k, we can write

U_{k} - p σ_{k k} = Y_{k}^{T} {MY}_{k} - p σ_{k k} = β_{k}^{T} G^{T} MG β_{k} + 2 β_{k}^{T} G^{T} M ε_{k} + (ε_{k}^{T} M ε_{k} - p σ_{k k}) .

We can bound the second term in the above equation as follows. Using the Hoeffding bound for Gaussian random variables ε_k, we have

P {| β_{k}^{T} G^{T} M ε_{k} | \geq {(\log p)}^{1 / 2} ‖ M G β ‖} \leq 2 \exp {- \frac{\log p ‖ M G β ‖^{2}}{2 ‖ M G β ‖^{2} σ_{k k}}} = 2 p^{- 1 / (2 σ_{k k})} .

That is, as p → ∞, with probability approaching to 1, we have

| β_{k}^{T} G^{T} M ε_{k} | \leq {(\log p)}^{1 / 2} ‖ MG β | | = {(\log p)}^{1 / 2} ‖ MHG β ‖ \leq {(\log p)}^{1 / 2} ‖ M ‖ \cdot ‖ H G β ‖ \leq {(\log p)}^{1 / 2} ‖ V ‖ {(β_{S_{k}}^{T} G_{S_{k}}^{T} {HG}_{S_{k}} β_{S_{k}}^{T})}^{1 / 2} \leq {(\log p)}^{1 / 2} ‖ V ‖ {(λ_{\max} (n^{- 1} G_{S_{k}}^{T} H G_{S_{k}}) n {‖ β_{S_{k}} ‖}^{2})}^{1 / 2} \leq C_{2}^{1 / 2} {(n \log p)}^{1 / 2} ‖ V ‖ \cdot ‖ β_{k} ‖,

where the second equality holds since M = MH. When ||β_k|| = o[(nlogp)^−1/2||V ||_F/||V ||], we have $2 β_{k}^{T} G^{T} M ε_{k} / (\sqrt{2} ‖ V ‖_{F}) \to 0$ in probability. Therefore the distribution of $(U_{k} - p σ_{k k}) / (\sqrt{2} ‖ V ‖_{F})$ is asymptotic equivalent to that of $μ_{β, k} + (ε_{k}^{T} M ε_{k} - p σ_{k k}) / (\sqrt{2} ‖ V ‖_{F})$ . Similar to the proof of Theorem 1, we can show that as n, p → ∞,

\frac{U - μ}{\sqrt{2} ‖ V ‖_{F}} \to N (μ_{β}, Ω) .

Therefore under H_a,

T_{1}^{*} = \frac{{(U - μ)}^{T} Ω^{- 1} (U - μ)}{2 ‖ V ‖_{F}^{2}} \to χ_{K}^{2} (μ_{β}^{T} Ω^{- 1} μ_{β})

and

T_{2}^{*} = \frac{\sum_{k = 1}^{K} U_{k} - p \sum_{k = 1}^{K} σ_{k k}}{\sqrt{2} ‖ V ‖_{F} ‖ Σ ‖_{F}} \to N (\sum_{k = 1}^{K} \frac{μ_{β, k}}{‖ Σ ‖_{F}}, 1) .

It follows that $ρ (T_{1}^{*}) \to P [χ_{K}^{2} (μ_{β}^{T} Ω^{- 1} μ_{β}) > χ_{K, 1 - α}^{2}]$ and

ρ (T_{2}^{*}) = P (| T_{2}^{*} | > ζ_{1 - α / 2}) = P (T_{2}^{*} > ζ_{1 - α / 2}) + P (T_{2}^{*} < - ζ_{1 - α / 2}) \to Φ (- ζ_{1 - α / 2} + \sum_{k = 1}^{K} \frac{μ_{β, k}}{‖ Σ ‖_{F}}) + Φ (- ζ_{1 - α / 2} - \sum_{k = 1}^{K} \frac{μ_{β, k}}{‖ Σ ‖_{F}}) .

□

Proof of Theorem 5.

We first consider the power of $T_{1}^{*}$ :

P {{(U - μ)}^{T} Ω^{- 1} (U - μ) > 2 ‖ V ‖_{F}^{2} χ_{K, 1 - α}^{2}} .

Note that

{(U - μ)}^{T} Ω^{- 1} (U - μ) \geq λ_{\min} (Ω^{- 1}) ‖ U - μ ‖^{2} = \sum_{k = 1}^{K} {(Y_{k}^{T} M Y_{k} - p σ_{k k})}^{2} / λ_{\max} (Ω) .

We can further write $Y_{k}^{T} M Y_{k} - p σ_{k k} = β_{k}^{T} G^{T} M G β_{k} + 2 β_{k}^{T} G^{T} M ε_{k} + (ε_{k}^{T} M ε_{k} - p σ_{k k})$ . Next we will bound these three terms separately. First,

β_{k}^{T} G^{T} MG β_{k} = β_{S_{k}}^{T} G_{S_{k}}^{T} {HGD}^{- 1} G^{T} {HG}_{S_{k}} β_{S_{k}}^{T} = β_{S_{k}}^{T} G_{S_{k}}^{T} {HG}_{S_{k}} D_{S_{k}}^{- 1} G_{S_{k}}^{T} {HG}_{S_{k}} β_{S_{k}}^{T} + β_{S_{k}}^{T} G_{S_{k}}^{T} {HG}_{S_{k}^{c}} D_{S_{k}^{c}}^{- 1} G_{S_{k}^{c}}^{T} {HG}_{S_{k}} β_{S_{k}}^{T} \geq β_{S_{k}}^{T} G_{S_{k}}^{T} {HG}_{S_{k}} D_{S_{k}}^{- 1} G_{S_{k}}^{T} {HG}_{S_{k}} β_{S_{k}}^{T} \geq n {λ_{\min} (n^{- 1 / 2} D_{S_{k}}^{- 1 / 2} G_{S_{k}}^{T} H G_{S_{k}})}^{2} {‖ β_{S_{k}} ‖}^{2} \geq n {λ_{\min} [{(n^{- 1} D_{S_{k}})}^{- 1 / 2}] λ_{\min} (n^{- 1} G_{S_{k}}^{T} H G_{S_{k}})}^{2} {‖ β_{S_{k}} ‖}^{2} \geq n C_{1}^{- 1} C_{2}^{- 2} {‖ β_{k} ‖}^{2},

where the last inequality holds by Assumption 1 and 2.

Second, as in the proof of Theorem 1, we have $| β_{k}^{T} G^{T} M ε_{k} | = O_{p} [{(n \log p)}^{1 / 2} ‖ V ‖ \cdot ‖ β_{k} ‖] .$ Third, by Theorem 2.1, we have $(ε_{k}^{T} M ε_{k} - p σ_{k k}) / (\sqrt{2} σ_{k k} ‖ V ‖_{F}) \to N (0, 1)$ .

Thus

| ε_{k}^{T} M ε_{k} - p σ_{k k} | = O_{p} (‖ V ‖_{F}) = O_{p} (\sqrt{p ‖ V ‖^{2}}) = o_{p} (p) .

Putting these bounds together, we have

| \sum_{k = 1}^{K} (Y_{k}^{T} M Y_{k} - p σ_{k k}) | \geq \sum_{k = 1}^{K} β_{k}^{T} G^{T} M G β_{k} - 2 \sum_{k = 1}^{K} | β_{k}^{T} G^{T} M ε_{k} | - \sum_{k = 1}^{K} | ε_{k}^{T} M ε_{k} - p σ_{k k} | \geq n C_{1}^{- 1} C_{2}^{- 2} \sum_{k = 1}^{K} {‖ β_{k} ‖}^{2} - 2 C_{2}^{1 / 2} {(n \log p)}^{1 / 2} ‖ V ‖ \sum_{k = 1}^{K} ‖ β_{k} ‖ - \sum_{k = 1}^{K} | ε_{k}^{T} M ε_{k} - p σ_{k k} | .

We now show that the first quadratic term can dominate the remaining two terms. Recall that $\sum_{k = 1}^{K} ‖ β_{k} ‖ \geq C_{0} \sqrt{p \log p / n}$ for some C₀ > 0, we have as n, p → ∞,

\frac{2 C_{2}^{1 / 2} {(n \log p)}^{1 / 2} ‖ V ‖ \sum_{k = 1}^{K} ‖ β_{k} ‖}{n C_{1}^{- 1} C_{2}^{- 2} {\sum_{k = 1}^{K} ‖ β_{k} ‖}^{2}} \leq \frac{2 C_{2}^{1 / 2} {(n \log p)}^{1 / 2} ‖ V ‖ \sum_{k = 1}^{K} ‖ β_{k} ‖}{n C_{1}^{- 1} C_{2}^{- 2} {(\sum_{k = 1}^{K} ‖ β_{k} ‖)}^{2} / K} \leq \frac{2 K C_{2}^{1 / 2} {(n \log p)}^{1 / 2} ‖ V ‖}{n C_{1}^{- 1} C_{2}^{- 2} C_{0} \sqrt{p \log p / n}} = O_{p} (\frac{‖ V ‖}{\sqrt{p}}) = o_{p} (1)

and

\frac{\sum_{k = 1}^{K} | ε_{k}^{T} M ε_{k} - p σ_{k k} |}{n C_{1}^{- 1} C_{2}^{- 2} \sum_{k = 1}^{K} {‖ β_{k} ‖}^{2}} \leq \frac{o_{p} (p)}{n C_{1}^{- 1} C_{2}^{- 2} C_{0}^{2} p \log p / (K n)} = o_{p} (1) .

Therefore, with probability approaching to 1,

| \sum_{k = 1}^{K} (Y_{k}^{T} M Y_{k} - p σ_{k k}) | > 0.5 n C_{1}^{- 1} C_{2}^{- 2} \sum_{k = 1}^{K} {‖ β_{k} ‖}^{2} .

It follows that as n, p → ∞,

P {{(U - μ)}^{T} {\hat{Ω}}^{- 1} (U - μ) > 2 ‖ V ‖_{F}^{2} χ_{K, 1 - α}^{2}} \geq P {\sum_{k = 1}^{K} {(Y_{k}^{T} {MY}_{k} - p σ_{k k})}^{2} \geq 2 λ_{\max} (Ω) ‖ V ‖_{F}^{2} χ_{K, 1 - α}^{2}} \geq P {{| \sum_{k = 1}^{K} (Y_{k}^{T} M Y_{k} - p σ_{k k}) |}^{2} / K \geq 2 λ_{\max} (Ω) ‖ V ‖_{F}^{2} χ_{K, 1 - α}^{2}} \geq P {0.5 n C_{1}^{- 1} C_{2}^{- 2} \sum_{k = 1}^{K} {‖ β_{k} ‖}^{2} \geq \sqrt{2 K λ_{\max} (Ω) χ_{K}^{2} (1 - α) ‖ V ‖_{F}}},

which converges to 1 since $\sum_{k = 1}^{K} {‖ β_{k} ‖}^{2} \geq C_{0}^{2} p \log p / (K n)$ and ||V ||_F = o_p(p).

Similarly, for the power of $T_{2}^{*}$ , we can show that as n, p → ∞,

P {| \sum_{k = 1}^{K} (U_{k} - p σ_{k k}) | > \sqrt{2} ‖ V ‖_{F} ‖ Σ ‖_{F} ζ_{1 - α / 2}} \geq P {0.5 n C_{1}^{- 1} C_{2}^{- 2} \sum_{k = 1}^{K} {‖ β_{k} ‖}^{2} \geq \sqrt{2} ‖ V ‖_{F} ‖ Σ ‖_{F} ζ_{1 - α / 2}} \to 1.

□

Footnotes

Declarations of interest: none.

Supplementary material

Additional results, supplementary tables, and figures referenced in Sections 2–4 can be found online.

^⋆

For this work, there exist supplementary materials providing additional technique details, simulation studies and real data results.

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

[1].Ramanan VK, Shen L, Moore JH, Saykin AJ, Pathway analysis of genomic data: concepts, methods, and prospects for future development, Trends in Genetics 28 (7) (2012) 323–332. [DOI] [PMC free article] [PubMed] [Google Scholar]
[2].Sun J, Zheng Y, Hsu L, A unified mixed-effects model for rare-variant association in sequencing studies, Genetic Epidemiology 37 (4) (2013) 334–344. [DOI] [PMC free article] [PubMed] [Google Scholar]
[3].Hwang S, Comparison and evaluation of pathway-level aggregation methods of gene expression data, BMC Genomics 13 (7) (2012) S26. [DOI] [PMC free article] [PubMed] [Google Scholar]
[4].Goeman JJ, Van De Geer SA, Van Houwelingen HC, Testing against a high dimensional alternative, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68 (3) (2006) 477–493. [Google Scholar]
[5].Goeman JJ, Van Houwelingen HC, Finos L, Testing against a high-dimensional alternative in the generalized linear model: asymptotic type i error control, Biometrika (2011) 381–390. [Google Scholar]
[6].Zhong P-S, Chen SX, Tests for high-dimensional regression coefficients with factorial designs, Journal of the American Statistical Association 106 (493) (2011) 260–274. [Google Scholar]
[7].Guo B, Chen SX, Tests for high dimensional generalized linear models, Journal of the Royal Statistical Society. Series B (Statistical Methodology) (2016) 1079–1102. [Google Scholar]
[8].Kong D, Maity A, Hsu F-C, Tzeng J-Y, Testing and estimation in marker-set association study using semiparametric quantile regression kernel machine, Biometrics 72 (2) (2016) 364–371. [DOI] [PMC free article] [PubMed] [Google Scholar]
[9].Zhou Y-H, Pathway analysis for rna-seq data using a score-based approach, Biometrics 72 (1) (2016) 165–174. [DOI] [PMC free article] [PubMed] [Google Scholar]
[10].Liu Y, Sun W, Reiner AP, Kooperberg C, He Q, Statistical inference of genetic pathway analysis in high dimensions, Biometrika 106 (3) (2019) 651–651. [DOI] [PMC free article] [PubMed] [Google Scholar]
[11].Avery CL, He Q, North KE, Ambite JL, Boerwinkle E, Fornage M, Hindorff LA, Kooperberg C, Meigs JB, Pankow JS, et al. , A phenomics-based strategy identifies loci on apoc1, brap, and plcg1 associated with metabolic syndrome phenotype domains, PLoS Genetics 7 (10) (2011) e1002322. [DOI] [PMC free article] [PubMed] [Google Scholar]
[12].Maity A, Sullivan PF, Tzeng J.-i., Multivariate phenotype association analysis by marker-set kernel machine regression, Genetic Epidemiology 36 (7) (2012) 686–695. [DOI] [PMC free article] [PubMed] [Google Scholar]
[13].Sun J, Oualkacha K, Forgetta V, Zheng H-F, Richards JB, Ciampi A, Greenwood CM, Consortium U, et al. , A method for analyzing multiple continuous phenotypes in rare variant association studies allowing for flexible correlations in variant effects, European Journal of Human Genetics 24 (9) (2016) 1344. [DOI] [PMC free article] [PubMed] [Google Scholar]
[14].He Q, Liu Y, Peters U, Hsu L, Multivariate association analysis with 430 somatic mutation data, Biometrics 74 (1) (2018) 176–184. [DOI] [PMC free article] [PubMed] [Google Scholar]
[15].Ma Y, Lan W, Wang H, Testing predictor significance with ultra high dimensional multivariate responses, Computational Statistics and Data Analysis 83 (2015) 275–286. [Google Scholar]
[16].Qiu Y, Chen SX, Nettleton D, et al. , Detecting rare and faint signals via thresholding maximum likelihood estimators, The Annals of Statistics 46 (2) (2018) 895–923. [Google Scholar]
[17].Wei L-J, Lin DY, Weissfeld L, Regression analysis of multivariate incomplete failure time data by modeling marginal distributions, Journal of the American statistical association 84 (408) (1989) 1065–1073. [Google Scholar]
[18].He Q, Avery CL, Lin D-Y, A general framework for association tests with multivariate traits in large-scale genomics studies, Genetic Epidemiology 37 (8) (2013) 759–767. [DOI] [PMC free article] [PubMed] [Google Scholar]
[19].He Q, Zhang HH, Avery CL, Lin D, Sparse meta-analysis with high-dimensional data, Biostatistics 17 (2) (2016) 205–220. [DOI] [PMC free article] [PubMed] [Google Scholar]
[20].Datta J, Smith JJ, Chatila WK, McAuliffe JC, Kandoth C, Vakiani E, Frankel TL, Ganesh K, Wasserman I, Lipsyc-Sharf M, et al. , Coaltered ras/b-raf and tp53 is associated with extremes of survivorship and distinct patterns of metastasis in patients with metastatic colorectal cancer, Clinical Cancer Research 26 (5) (2020) 1077–1085. [DOI] [PMC free article] [PubMed] [Google Scholar]
[21].Sun W, Bunn P, Jin C, Little P, Zhabotynsky V, Perou CM, Hayes DN, Chen M, Lin D-Y, The association between copy number aberration, dna methylation and gene expression in tumor samples, Nucleic Acids Research 46 (6) (2018) 3009–3018. [DOI] [PMC free article] [PubMed] [Google Scholar]
[22].Liu J, Kumar KS, Yu D, Molton S, McMahon M, Herlyn M, Thomas-Tikhonenko A, Fuchs S, Oncogenic braf regulates β-trcp expression and nf-κ b activity in human melanoma cells, Oncogene 26 (13) (2007) 1954–1958. [DOI] [PMC free article] [PubMed] [Google Scholar]
[23].Cooks T, Pateras IS, Tarcic O, Solomon H, Schetter AJ, Wilder S, Lozano G, Pikarsky E, Forshew T, Rozenfeld N, et al. , Mutant p53 prolongs nf-κb activation and promotes chronic inflammation and inflammation-associated colorectal cancer, Cancer Cell 23 (5) (2013) 634–646. [DOI] [PMC free article] [PubMed] [Google Scholar]
[24].Wang X, Qin L, Zhang H, Zhang Y, Hsu L, Wang P, A regularized multivariate regression approach for eqtl analysis, Statistics in Biosciences 7 (1) (2015) 129–146. [DOI] [PMC free article] [PubMed] [Google Scholar]
[25].Ainsworth HF, Cordell HJ, Using gene expression data to identify causal pathways between genotype and phenotype in a complex disease: application to genetic analysis workshop 19, BMC Proceedings 10 (7) (2016) 49. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials