Multiple Testing of Submatrices of a Precision Matrix with Applications to Identification of Between Pathway Interactions

Yin Xia; Tianxi Cai; T Tony Cai

doi:10.1080/01621459.2016.1251930

. Author manuscript; available in PMC: 2018 Jun 5.

Published in final edited form as: J Am Stat Assoc. 2017 Sep 26;113(521):328–339. doi: 10.1080/01621459.2016.1251930

Multiple Testing of Submatrices of a Precision Matrix with Applications to Identification of Between Pathway Interactions

Yin Xia ¹, Tianxi Cai ², T Tony Cai ³

PMCID: PMC5988269 NIHMSID: NIHMS851691 PMID: 29881130

Abstract

Making accurate inference for gene regulatory networks, including inferring about pathway by pathway interactions, is an important and difficult task. Motivated by such genomic applications, we consider multiple testing for conditional dependence between subgroups of variables. Under a Gaussian graphical model framework, the problem is translated into simultaneous testing for a collection of submatrices of a high-dimensional precision matrix with each submatrix summarizing the dependence structure between two subgroups of variables.

A novel multiple testing procedure is proposed and both theoretical and numerical properties of the procedure are investigated. Asymptotic null distribution of the test statistic for an individual hypothesis is established and the proposed multiple testing procedure is shown to asymptotically control the false discovery rate (FDR) and false discovery proportion (FDP) at the pre-specified level under regularity conditions. Simulations show that the procedure works well in controlling the FDR and has good power in detecting the true interactions. The procedure is applied to a breast cancer gene expression study to identify between pathway interactions.

Keywords: Between pathway interactions, conditional dependence, covariance structure, false discovery proportion, false discovery rate, Gaussian graphical model, multiple testing, precision matrix, testing submatrices

1 Introduction

Simultaneous inference for the interactions among a large number of variables is an important problem in statistics with a wide range of applications. Many statistical methods have been proposed to infer about pairwise interactions (Ritchie et al., 2001; Chatterjee et al., 2006; Kooperberg and Ruczinski, 2005; Kooperberg and LeBlanc, 2008; Fan and Lv, 2008; Cai and Zhang, 2014; Cai and Liu, 2015, e.g). Most of the existing methods focus on marginal assessments of pairwise interactions without conditioning on the other variables. Such marginal methods may result in false identification of interactions due to the discrepancy between conditional and unconditional effects. When prior knowledge is available to group the variables of interest, it is often of interest to make simultaneous inference for the interactions at the group level. For example, functionally related genes are often grouped into pathways and inferring about between pathway interactions is important as they represent a majority of the genetic interactions (Kelley and Ideker, 2005).

Motivated by applications in genomics, in this paper we propose methods to efficiently identify between group interactions while accounting for the joint effects from all other variables of interest. Under a Gaussian graphical model framework, we translate the problem of detecting between group interactions into the statistical problem of simultaneous testing of a collection of submatrices of a high-dimensional precision matrix. We first discuss the motivating problem of detecting between pathway interactions before presenting the framework for large-scale multiple testing of submatrices of a high-dimensional precision matrix.

1.1 Detection of Between Pathway Interactions

It is well known that genes interact functionally in networks to orchestrate cellular processes. Biological interactions of genes are often inferred based on co-expression networks since coexpressed genes tend to be functionally related or controlled by the same transcriptional regulatory elements (Weirauch, 2011). Throughout, we use the term gene-gene interaction to refer to their biological interaction, quantified by conditional co-expression (given all other genes), rather than statistical interaction unless specified otherwise. Accurately identifying important gene-gene interactions is a difficult task due to the high dimensionality of the feature space spanned by gene pairs. Particularly in genome-wide studies where the sample sizes are typically small compared to the number of interactions of interest, gene level analyses often produce results that are difficult to interpret or replicate.

One approach to improve the interpretability and reproducibility is to incorporate prior biological knowledge such as gene structure or protein-protein interaction network information to group functionally related genes into pathways and perform analysis at the pathway level. Throughout, we use the term pathway to refer generically a gene group under study, whether or not the group is indeed representing a metabolic or signaling pathway. A large number of knowledge bases have become available to assemble biologically meaningful gene groups (Xenarios et al., 2002; Rual et al., 2005; Matthews et al., 2009; Craven and Kumlien, 1999; Khatri et al., 2012). The knowledge bases provide prior information on biological processes, components, or structures in which individual genes and proteins are involved in. Analyzing high-throughput molecular measurements at the functional level is very appealing due to its potential in reducing the complexity of the problem and improving the power (Subramanian et al., 2005; Glazko and Emmert-Streib, 2009).

Detecting pathway level interactions is also biologically relevant because in order to produce appropriate physiological responses to both internal and external factors, pathways often need to function in a coordinated fashion due to the complex nature of biological systems. In addition, there is accumulating evidence that complex traits are often influenced by multiple groups of functional related genes through their dynamic interaction and coregulation (Jia et al., 2011). Therefore, the knowledge of pathway crosstalk network is helpful for inferring the function of complex biological systems (Li et al., 2008). A wide range of between pathway crosstalk have been identified as critical for understanding many diseases including breast cancer, lung cancer, ovarian cancer, major depression disorder, and Alzheimer (Osborne et al., 2005; Shou et al., 2004; Jia et al., 2011; Liu et al., 2010; Pan, 2012; Puri et al., 2008).

In addition to applications to the identification of between genetic pathway interactions, the proposed procedures are also useful for other settings. Examples include interactions between biological markers when markers are measured at different time points with multiple measurements of each marker representing one group; and interactions between different brain regions when functional MRI measurements are taken over the entire brain with groups indexed by brain regions. We next describe our proposed framework for detecting between pathway interactions based on testing for submatrices of a high-dimensional precision matrix.

1.2 Multiple Testing of Submatrices of A Precision Matrix

Under a Gaussian graphical model framework, we formulate the problem of identifying between group interactions that account for joint effects from all genes of interest as the statistical problem of simultaneous testing of submatrices of a high-dimensional precision matrix. Let {X₁, · · ·, X_n} be a random sample consisting of n independent copies of a p dimensional Gaussian random vector X ~ N_p(μ,Σ). The precision matrix, which is the inverse of Σ, is denoted by Ω = (ω_i,j). It is well-known that the precision matrix is closely connected to the corresponding Gaussian graph G = (V,E), which represents the conditional independence between components of X = (X₁, …, X_p)^⊤. Here V is the vertex set consisting of the p components X₁, …, X_p and E is the edge set consisting of ordered pairs (i, j), where (i, j) ∈ E if there is an edge between X_i and X_j, indicating that X_i and X_j are conditionally dependent given {X_k, k ≠ i, j}. It is a well-known fact that the conditional independence between X_i and X_j given all other variables is equivalent to ω_i,j = 0. See, e.g., Lauritzen (1996).

Let 𝒥₁, …, 𝒥_M ⊂ {1, …, p} be a collection of prespecified non-overlapping sets which index group memberships (e.g. pathway membership), we wish to test simultaneously the hypotheses of the conditional independence between any two gene groups given all remaining genes in the collection with proper control of the false discovery rate (FDR) and false discovery proportion (FDP) asymptotically. It follows from the above discussion that this multiple testing problem can be equivalently formulated as testing the hypotheses on the submatrices of the precision matrix Ω,

H_{0, m, h} : Ω_{J_{m} \times J_{h}} = 0 versus H_{1, m, h} : Ω_{J_{m} \times J_{h}} \neq 0, 1 \leq m < h \leq M,

(1)

while controlling the FDR and FDP asymptotically. Hereafter, all results related to the FDR and FDP are studied in the asymptotic regime and we use FDR and FDP as simplifications for the expressions of asymptotic FDR and asymptotic FDP.

Simultaneous testing of between group interactions with FDR control is technically challenging, both in constructing a suitable test statistic and establishing its null distribution for testing the interactions between any two given groups and in developing a multiple testing procedure that accounts for the multiplicity and dependency with FDR control. To the best of our knowledge, there are no currently available methods with theoretical guarantees to infer about interactions between pre-specified gene groups that adjust for effects from a large number of other genes. Furthermore, no existing methods allow the testing for such group level interactions while properly controlling a desired FDR. Liu (2013) proposed a multiple testing procedure with the FDR control for the partial correlations under a Gaussian graphical model. Xia et al. (2015) considered the problem of identifying gene-by-gene interactions associated with a binary trait under a two-sample framework and proposed a procedure for testing the differential network by simultaneously testing entry-wise hypotheses with FDR control. These methods, which can identify the locations of individual gene-by-gene interactions, are however unable to detect the presence interactions between pairs of gene groups while controlling the FDR at the group level.

In this paper, we propose a novel multiple testing procedure for between group interactions that controls the FDR and FDP asymptotically at any pre-specified level 0 < α < 1. The simultaneous testing procedure is developed in two steps. In the first step, we construct a test statistic for testing the conditional independence of a given pair of variable groups 𝒥_m and 𝒥_h, H₀_,m,h: Ω_{𝒥_m×𝒥_h} = 0, with m ≠ h. The test statistic is based on the Frobenius norm of a standardized submatrix estimate with unknown correlation structure. The estimation of this dependency structure is technically challenging, because correlations among the estimates of the entries of Ω_{𝒥_m×𝒥_h} not only depend on the entries within the submatrix, but also largely depend on the entries outside of it. To incorporate this dependency structure, we estimate the eigenvalues of the correlation matrix of the entry estimates of a given submatrix Ω_{𝒥_m×𝒥_h} through a Kronecker product by estimating the eigenvalues of two partial correlation submatrices R_{𝒥_m×𝒥_m} and R_{𝒥_h×𝒥_h} of R = D^−1/2ΩD^−1/2, where D is the diagonal matrix of Ω. It is shown that the test statistic has asymptotically the same limiting null distribution as a mixture of $χ_{1}^{2}$ with the estimated correlation structure.

In the second step, we construct a simultaneous testing procedure based on these test statistics. A major difficulty here is that the correlation structures of the entry estimates vary across different submatrices. Consequently the limiting null distributions of the test statistics for different submatrices are different. We introduce a normal quantile transformation for each test statistic, and the transformed test statistics are shown to have asymptotically the same distribution as the absolute value of a standard normal random variable under the null. Based on them, we develop a multiple testing procedure to account for the multiplicity in testing a large number of hypotheses so that the overall FDR and FDP are controlled.

Both the theoretical and numerical properties of the proposed procedure are investigated. The theoretical results show that, under regularity conditions, the proposed procedure asymptotically controls both the overall FDR and FDP at the pre-specified level. As a comparison, it is discussed in Section 4.3 that a direct application of the well-known B-H procedure (Benjamini and Hochberg, 1995) to the individual test statistics is not able to control the FDP when the number of true alternatives is fixed. Simulation studies are carried out to examine the numerical performance of the multiple testing procedure in various settings. The results show that the procedure performs well numerically in terms of both the size and power of the test. We also consider a simulation setting that is similar to the breast cancer gene expression data analyzed in this paper by mimicking the true sizes of the gene groups in the breast cancer study. The result shows that the FDR is well controlled and this new group level based method significantly outperforms the alternative procedures.

Finally, we apply the proposed procedure to assess the between pathway interactions in a breast cancer gene expression study. Many of the identified interactions are consistent with those reported in the literature.

1.3 Structure of the Paper

The rest of the paper is organized as follows. We give a detailed construction of the statistic for testing a specific submatrix of a precision matrix in Section 2. The limiting null distribution of the test statistic and the theoretical properties of the testing procedure are obtained in Section 3. A multiple testing procedure for simultaneously assessing a collection of submatrices is proposed and its theoretical properties are established in Section 4. Simulation results demonstrating the performance of the proposed methods in finite sample are given in Section 5. In Section 6, we apply the new multiple testing procedure to a breast cancer gene expression study to identify between pathway interactions. A discussion on possible extensions is given in Section 7. All proofs are contained in the supplement Xia et al. (2016).

2 Testing A Given Submatrix

We consider in this section testing a given submatrix of the precision matrix Ω,

H_{0} : Ω_{I \times J} = 0 versus H_{1} : Ω_{I \times J} \neq 0,

(2)

under the framework of Section 1.2, where ℐ and 𝒥 index two non-overlapping gene groups. A rejection of H₀ means that at least one pair of variables from ℐ and 𝒥 are not conditionally independent from each other given all other variables. As the group information is considered as prior knowledge, performing analysis at the group level is more appealing than the entrywise procedure as discussed in Section 1. We shall construct a test statistic for H₀, corresponding to no interactions between gene groups ℐ and 𝒥 conditional on all other genes. Related works on testing for independence and conditional independence between random vectors can be found in, e.g., Gieser and Randles (1997); Um and Randles (2001); Beran et al. (2007); Su and White (2007, 2008); and Huang et al. (2010).

2.1 Notation and Definitions

Denote A ⊗ B the Kronecker product of matrix A and B. For a vector β = (β₁, …, β_p)^⊤ ∈ ℝ_p, define the ℓ_q norm by ${∣ β ∣}_{q} = {(\sum_{i = 1}^{p} {∣ β_{i} ∣}^{q})}^{1 / q}$ for 1 ≤ q ≤ ∞. For any vector μ with dimension p × 1, let μ₋_i denote the (p − 1) × 1 vector by removing the i^th entry from μ. For a symmetric matrix A, let λ_max(A) and λ_min(A) denote the largest and smallest eigenvalues of A. For any p × q matrix A, A_i,₋_j denotes the i^th row of A with its j^th entry removed and A₋_i,j denotes the j^th column of A with its i^th entry removed. A₋_i,₋_j denotes the (p−1)×(q −1) submatrix of A with its i^th row and j^th column removed. A_r_×_c denotes the submatrix of A corresponding to the row vector r and column vector c. For a n×p data matrix U = (U₁, …, U_n)^⊤, denote an n × (p − 1) matrix $U_{\cdot, - i} = {(U_{1, - i}^{⊤}, \dots, U_{n, - i}^{⊤})}^{⊤}$ . Let ${\bar{U}}_{\cdot, - i} = 1 / n \sum_{k = 1}^{n} U_{k, - i}$ with dimension 1 × (p − 1), U₍_i₎ = (U₁_,i, …, U_n,i)^⊤ with dimension n × 1, Ū₍_i₎ = (Ū_i, …, Ū_i)^⊤ with dimension n × 1, where ${\bar{U}}_{i} = 1 / n \sum_{k = 1}^{n} U_{k, i}$ , and ${\bar{U}}_{(\cdot, - i)} = {({\bar{U}}_{\cdot, - i}^{⊤}, \dots, {\bar{U}}_{\cdot, - i}^{⊤})}^{⊤}$ with dimension n × (p − 1). For a matrix Ω = (ω_i,j)_p_×_p, the matrix 1-norm is defined by ${‖ Ω ‖}_{L_{1}} = {max}_{1 \leq j \leq p} \sum_{i = 1}^{p} ∣ ω_{i, j} ∣$ and the matrix element wise infinity norm is defined to be ||Ω||_∞ = max_1≤_i,j_≤_p |ω_i,j |. For a set ℋ, denote |ℋ| the cardinality of ℋ. For two sequences of real numbers {a_n} and {b_n}, write a_n = O(b_n) if there exists a constant C such that |a_n| ≤ C|b_n| holds for all n, write a_n = o(b_n) if lim_n_→_∞ a_n/b_n = 0, and write a_n ≍ b_n if lim_n_→_∞ a_n/b_n = 1.

2.2 Testing Procedure

We shall first define a standardized estimate W_i,j for each individual entry of the precision matrix, which is the one-sample version of the estimates proposed in Xia et al. (2015), then propose a novel test statistic S_ℐ×𝒥 based on the sum of all possible $W_{i, j}^{2}$ , for (i, j) ∈ ℐ ×𝒥.

It is well known that in the Gaussian setting, the precision matrix can be described in terms of the regression models, see, e.g., Section 2.5 in Anderson (2003). Specifically, we may write

X_{k, i} = α_{i} + X_{k, - i} β_{i} + ε_{k, i}, 1 \leq k \leq n,

(3)

where $ε_{k, i} ~ N (0, σ_{i, i} - \sum_{i, - i} \sum_{- i, - i}^{- 1} \sum_{- i, i})$ is independent of X_k,₋_i, and $α_{i} = μ_{i} - \sum_{i, - i} \sum_{- i, - i}^{- 1} μ_{- i}$ . The regression coefficient vector β_i and the error terms ε_k,i satisfy

β_{i} = - ω_{i, i}^{- 1} Ω_{- i, i} and r_{i, j} \equiv Cov (ε_{k, i}, ε_{k, j}) = ω_{i, j} / (ω_{i, i} ω_{j, j}) .

As in Xia et al. (2015), we first develop an estimator of ω_i,j and then base the test on its bias corrected standardization. We begin by constructing estimators of r_i,j.

Let β̂_i = (β̂₁_,i, · · ·, β̂_p₋₁_,i)^⊤ be estimators of β_i satisfying max

max_{1 \leq i \leq p} {∣ {\hat{β}}_{i} - β_{i} ∣}_{1} = o_{P} {{(log p)}^{- 1}},

(4)

max_{1 \leq i \leq p} {∣ {\hat{β}}_{i} - β_{i} ∣}_{2} = o_{P} {{(n log p)}^{- 1 / 4}} .

(5)

Such estimators can be obtained easily via the standard methods such as the Lasso and Dantzig Selector, see, e.g., Xia et al. (2015) Section 2.3. Specifically, if we use the Lasso estimator (see (18) in Section 5), then equations (4) and (5) can be satisfied under the condition (C1) in Section 3 and the sparsity condition max_1≤_i_≤_p |β_i|₀ = o{n^1/2/(log p)^3/2}.

Define the fitted residuals by

{\hat{ε}}_{k, i} = X_{k, i} - {\bar{X}}_{i} - (X_{k, - i} - {\bar{X}}_{- i}) {\hat{β}}_{i},

where ${\bar{X}}_{i} = \frac{1}{n} \sum_{k = 1}^{n} X_{k, i}, {\bar{X}}_{- i} = \frac{1}{n} \sum_{k = 1}^{n} X_{k, - i}$ . A natural estimator of r_i,j is the sample covariance between the residuals

{\tilde{r}}_{i, j} = \frac{1}{n} \sum_{k = 1}^{n} {\hat{ε}}_{k, i} {\hat{ε}}_{k, j} .

(6)

However, when i ≠ j, r̃_i,j tends to be biased due to the correlation induced by the estimated parameters. Xia et al. (2015) proposed a bias corrected estimator of r_i,j as

{\hat{r}}_{i, j} = - ({\tilde{r}}_{i, j} + {\tilde{r}}_{i, i} {\hat{β}}_{i, j} + {\tilde{r}}_{j, j} {\hat{β}}_{j - 1, i}), for 1 \leq i < j \leq p .

For i = j, we let r̃_i,i = r̃_i,i, which is a nearly unbiased estimator of r_i,i. For 1 ≤ i < j ≤ p, a natural estimator of ω_i,j can then be defined by

T_{i, j} = {\hat{r}}_{i, j} / ({\hat{r}}_{i, i} \cdot {\hat{r}}_{j, j}) .

Since {T_i,j, 1 ≤ i < j ≤ p} are heteroscedastic and can possibly have a wide range of variability, we shall first standardize T_i,j. To estimate its variance, note that

θ_{i, j} \equiv Var (ε_{k, i} ε_{k, j} / (r_{i, i} r_{j, j})) / n = (1 + ρ_{i, j}^{2}) / (n r_{i, i} r_{j, j}),

where $ρ_{i, j}^{2} = β_{i, j}^{2} r_{i, i} / r_{j, j}$ . Then θ_i,j can be estimated by ${\hat{θ}}_{i, j} = (1 + {\hat{β}}_{i, j}^{2} {\hat{r}}_{i, i} / {\hat{r}}_{j, j}) / (n {\hat{r}}_{i, i} {\hat{r}}_{j, j})$ .

Define the standardized statistics

W_{i, j} = T_{i, j} / {({\hat{θ}}_{i, j})}^{1 / 2}, for 1 \leq i < j \leq p .

(7)

Finally, we propose the following test statistic for testing a given submatrix Ω_ℐ×𝒥,

S_{I \times J} = \sum_{(i, j) \in I \times J} W_{i, j}^{2} .

(8)

We detail in Section 3 statistical properties of the proposed test statistic.

3 Theories on Testing A Given Submatrix

In this section, we investigate the theoretical properties including the limiting null distribution and the asymptotic power. We first show that the null distribution of S_ℐ×𝒥 converges to the distribution of a mixture of $χ_{1}^{2}$ variables as (n, p) → ∞ and then demonstrate that the test based on S_ℐ×𝒥 is powerful under a large collection of alternatives.

3.1 Asymptotic Null Distribution

Before studying the null distribution of S_ℐ×𝒥, we first introduce the following condition on the eigenvalues of Ω, which is a common assumption in the high-dimensional setting (Cai et al., 2013; Xia et al., 2015; Liu, 2013).

(C1) Assume that log p = o(n^1/5), and for some constant C⁰ > 0, $C_{0}^{- 1} \leq λ_{min} (Ω) \leq λ_{max} (Ω) \leq C_{0}$ . Suppose |𝒥_m| does not depend on n and p for 1 ≤ m ≤ M.

Let D be the diagonal of Ω and let (η_i,j) =: R = D^−1/2ΩD^−1/2. Under H₀, for (i₁, j₁), (i₂, j₂) ∈ ℐ × 𝒥, the covariance between the standardized statistics W_i_₁_,j_₁ and W_i_₂_,j_₂, as defined in (7), is approximately equal to η_i_₁_,i_₂η_j_₁_,j_₂, and thus can be estimated by T̃_i_₁_,i_₂ T̃_j_₁_,j_₂, where T̃:= (T̃_i,j)_p_×_p with ${\tilde{T}}_{i, j} = {\hat{r}}_{i, j} / \sqrt{{\hat{r}}_{i, i} {\hat{r}}_{j, j}}$ . Thus, we shall estimate the covariance matrix of {W_i,j, (i, j) ∈ ℐ×𝒥} by the Kronecker product of T̃_ℐ×ℐ and T̃_𝒥×𝒥. Let ${\hat{Λ}}_{I} = {({\hat{λ}}_{1}^{I}, \dots, {\hat{λ}}_{∣ I ∣}^{I})}^{⊤}$ and ${\hat{Λ}}_{J} = {{\hat{λ}}_{1}^{J}, \dots, {\hat{λ}}_{∣ J ∣}^{J}}^{⊤}$ be the eigenvalues of T̃_ℐ×ℐ and T̃_𝒥×𝒥 respectively. We then estimate the eigenvalues of the covariance matrix of {W_i,j, (i, j) ∈ ℐ ×𝒥} by ${\hat{Λ}}^{I \times J} = {({\hat{λ}}_{1}^{I \times J}, \dots, {\hat{λ}}_{K}^{I \times J})}^{⊤}$ which is the vectorized Λ̃_ℐ ⊗ Λ̃ _𝒥, where K = |ℐ||𝒥|. The following theorem states the asymptotic null distributions for S_ℐ×𝒥.

Theorem 1

Suppose that (C1), (4) and (5) hold. Then under H₀: Ω_ℐ×𝒥 = 0, for any given t ∈ ℝ, we have

\frac{P (S_{I \times J} \leq t)}{P (\sum_{l = 1}^{K} {\hat{λ}}_{l}^{I \times J} Z_{l}^{2} \leq t)} \to 1,

(9)

as (n, p)→∞, where (Z₁, …, Z_K) ~ N(0, I_K_×_K).

Remark 1

The difficulty of Theorem 1 comes from the fact that, though Ω_ℐ×𝒥 = 0 under the null, the entries {ε_k,iε_k,j, (i, j) ∈ ℐ×𝒥} can still be highly dependent with each other and their correlations depend on the entries outside of submatrix Ω_ℐ×𝒥. Thus, the distribution of S_ℐ×𝒥 cannot be simply estimated by the chi-square distribution. Actually, if we use the chi-square approximation in the following FDR control procedure in Section 4, the choice of threshold level of each statistic will be too conservative and as the result the FDR cannot be controlled at the pre-specified level α, i.e., the FDR will be much larger than α.

It has been shown in the above theorem that S_ℐ×𝒥 has different asymptotic distribution for different submatrix Ω_ℐ×𝒥. Thus, we introduce the normal quantile transformation of S_ℐ×𝒥 as follows

N_{I \times J} = Φ^{- 1} {1 - P (\sum_{l = 1}^{K} {\hat{λ}}_{l}^{I \times J} Z_{l}^{2} \geq S_{I \times J}) / 2},

where Φ(t) = P(N(0, 1) ≤ t) is standard normal cumulative distribution function (cdf) and S_ℐ×𝒥 is the observed value. Thus, we have $P (∣ N (0, 1) ∣ \geq N_{I \times J}) = P (\sum_{l = 1}^{K} {\hat{λ}}_{l}^{I \times J} Z_{l}^{2} \geq S_{I \times J})$ . Since asymptotically S_ℐ×𝒥 and $\sum_{l = 1}^{K} {\hat{λ}}_{l}^{I \times J} Z_{l}^{2}$ have the same distribution as studied in Theorem 1, thus N_ℐ×𝒥 asymptotically has the same distribution as the absolute value of a standard normal random variable. We then define the test $Φ_{α}^{I \times J}$ by

Φ_{α}^{I \times J} = I {N_{I \times J} \geq Φ^{- 1} (1 - α)} .

(10)

The hypothesis H₀: Ω_ℐ×𝒥 = 0 is rejected whenever $Φ_{α}^{I \times J} = 1$ .

Remark 2

The eigenvalues { ${\hat{λ}}_{l}^{I \times J}$ , l = 1, …, K} are calculated based on T̃_ℐ×ℐ and T̃_𝒥×𝒥 as described earlier. Given the values of { ${\hat{λ}}_{l}^{I \times J}$ , l = 1, …, K}, the distribution of the mixture of $χ_{1}^{2}$ variables $\sum_{l = 1}^{K} {\hat{λ}}_{l}^{I \times J} Z_{l}^{2}$ can be approximated by a non-central chi-squared distribution with the parameters determined by the first four cumulants of the quadratic form, see, e.g., Liu et al. (2009). We will use this approximation in our numerical studies.

3.2 Asymptotic Power

We now turn to analyze the power of the test $Φ_{α}^{I \times J}$ given in (10). For a given pair of index sets ℐ and 𝒥, we shall first define the following class of precision matrices

W_{I \times J} (α, β) = {Ω : \sum_{(i, j) \in I \times J} \frac{ω_{i, j}^{2}}{θ_{i, j}} \geq (2 + δ) (Ψ_{1 - α}^{2} + Ψ_{1 - β}^{2})},

(11)

for any δ > 0, where Ψ₁₋_α is the 1 − α quantile of $\sum_{l = 1}^{K} {\hat{λ}}_{l}^{I \times J} Z_{l}^{2}$ as defined in Theorem 1.

The next theorem shows that the test $Φ_{α}^{I \times J}$ is able to asymptotically distinguish the null parameter set in which Ω_ℐ×𝒥 = 0 from 𝒲_ℐ×𝒥 (α, β) for arbitrarily small constant δ > 0, with β → 0.

Theorem 2

Suppose that (C1), (4) and (5) hold. Then we have, for any constant δ > 0,

inf_{Ω \in W_{I \times J} (α, β)} P (Φ_{α}^{I \times J} = 1) \geq 1 - β, a s n, p \to \infty .

(12)

Since θ_i,j is of order 1/n, Theorem 2 shows that the proposed test rejects the null hypothesis H₀: Ω_ℐ×𝒥 = 0 with high probability for a large class of precision matrices satisfying the condition that there exists one entry of the submatrix Ω_ℐ×𝒥 having a magnitude larger than C/n^1/2 for $C = {2 (2 + δ) C_{0}^{2} (Ψ_{1 - α}^{2} + Ψ_{1 - β}^{2})}^{1 / 2}$ , where C₀ is given in Condition (C1).

4 Multiple Testing of Submatrices with FDR Control

In practice, there are typically many pathways under investigation and it is often of significant interest to identify which pairs of the pathways interact with each other. A natural approach to investigate interactions among the M pathways, indexed by {𝒥_m,m = 1, …, M}, is to carry out simultaneous testing of

H_{0, m, h} : Ω_{J_{m} \times J_{h}} = 0 versus H_{1, m, h} : Ω_{J_{m} \times J_{h}} \neq 0, for 1 \leq m < h \leq M,

(13)

where 𝒥₁, …, 𝒥_M ⊂ {1, …, p} is a collection of pre-specified non-overlapping index sets. In this section, we introduce a multiple testing procedure with FDR and FDP control for testing a collection of ℳ= M(M − 1)/2 hypotheses, and we shall assume that ℳ is large. Let L_m denote the cardinality of 𝒥_m assumed to be independent of n or p for 1 ≤ m ≤ M. Let ℋ = {(m, h): 1 ≤ m < h ≤ M}, ℋ₀ = {(m, h): Ω_{𝒥_m×𝒥_h} = 0, 1 ≤ m < h ≤ M} be the set of true nulls and ℋ₁ = ℋ\ℋ₀ be the set of true alternatives. We shall assume that |ℋ₁| is relatively small compared to |ℋ|, and this assumption arises frequently in many contemporary applications.

4.1 Multiple Testing Procedure

Recall that the standardization of T_i,j is defined by W_i,j = T_i,j/(θ̂_i,j)^1/2 as in (7), and the test statistic S_{𝒥_m×𝒥_h} is defined based on W_i,j as in (8). It has been shown in Theorem 1 that S_{𝒥_m×𝒥_h} has different asymptotic null distribution for different submatrix Ω_{𝒥_m×𝒥_h}. Thus, as discussed in Section 3.1, the normal quantile transformation of S_{𝒥_m×𝒥_h} is defined by

N_{J_{m} \times J_{h}} = Φ^{- 1} {1 - P (\sum_{l = 1}^{L_{m} L_{h}} {\hat{λ}}_{l}^{J_{m} \times J_{h}} Z_{l}^{2} \geq S_{J_{m} \times J_{h}}) / 2},

and N_{𝒥_m×𝒥_h} approximately has the same distribution as the absolute value of a standard normal random variable under the null H₀_,m,h. Let t be the threshold level such that H₀_,m,h is rejected if N_{𝒥_m×𝒥_h} ≥ t. For any given t, denote the total number of false positives by

R_{0} (t) = \sum_{(m, h) \in H_{0}} I {N_{J_{m} \times J_{h}} \geq t},

(14)

and the total number of rejections by

R (t) = \sum_{(m, h) \in H} I {N_{J_{m} \times J_{h}} \geq t} .

(15)

The false discovery proportion (FDP) and false discovery rate (FDR) are defined as

FDP (t) = \frac{R_{0} (t)}{R (t) \lor 1} and FDR (t) = E [FDP (t)] .

An ideal choice of t is

t_{0} = inf {0 \leq t \leq \sqrt{2 log M} : \frac{R_{0} (t)}{R (t) \lor 1} \leq α},

which would reject as many true positives as possible while controlling the FDR at the prespecified level α. However, the total number of false positives, R₀(t), is unknown as the set ℋ₀ is unknown. We propose to estimate R₀(t) by 2(1 − Φ(t))|ℋ₀| and simply estimate |ℋ₀| by ℳ because the number of true alternatives is relatively small. This leads to the following multiple testing procedure with FDR control.

Calculate test statistics N_{𝒥_m×𝒥_h}.
For given 0 ≤ α ≤ 1, calculate
$\hat{t} = inf {0 \leq t \leq \sqrt{2 log M - 2 log log M} : \frac{2 M (1 - Φ (t))}{R (t) \lor 1} \leq α} .$ (16)

If (16) does not exist, then set $\hat{t} = \sqrt{2 log M}$ .
For (m, h) ∈ ℋ, reject H₀_,m,h if N_{𝒥_m×𝒥_h} ≥ t̂.

4.2 Theoretical Properties

We now investigate the theoretical properties of the multiple testing procedure given above. For any 1 ≤ m ≤ M, define

Ξ_{m} (γ) = {h : 1 \leq h \leq M, h \neq m, \exists i \in J_{m}, j \in J_{h} s.t. ∣ ω_{i, j} ∣ \geq {(log M)}^{- 2 - γ}} .

The following theorem shows that, under regularity conditions, the above multiple testing procedure controls the FDR and FDP at the pre-specified level α asymptotically.

Theorem 3

Assume that ℳ₀ =: |ℋ₀| ≍ ℳ, and (4) and (5) hold. Suppose there exists some γ > 0 such that max_1≤_m_≤_M |Ξ_m(γ)| = o(M^τ ) for any τ > 0. Then under (C1) with p ≤ cn^r for some c > 0 and r > 0, we have

{lim^{¯}}_{(n, M) \to \infty} FDR (\hat{t}) \leq α,

and for any ε > 0,

lim_{(n, M) \to \infty} P (FDP (\hat{t}) \leq α + ε) = 1.

Remark 3

The technical condition on |Ξ_m(γ)| is to ensure that most of the submatrices are not highly correlated with each other. In the special case when max_1≤_m_≤_M |Ξ_m(γ)| = 0, then all subgroups are weakly correlated with each other, i.e., |ω_i,j| ≤ (log ℳ)⁻²⁻^γ for all i ∈ 𝒥_m, j ∈ 𝒥_h with m ≠ h. Under this setting, it is shown in the supplement Xia et al. (2016) that the proposed multiple testing procedure performs asymptotically the same as the case when all submatrices are independent with each other. We do not need this strong condition, and the weaker condition max_1≤_m_≤_M |Ξ_m(γ)| = o(M^τ ) for any τ > 0 assumed in the theorem allows the number of highly correlated submatrices growing with M.

When t̂ is not attained in the range [0, $\sqrt{2 log M - 2 log log M}$ ] as described in equation (16), we shall threshold it at $\sqrt{2 log M}$ . We state in the following corollary a condition to ensure the existence of t̂ in the range, and as a result, the FDR and FDP will converge to the pre-specified level α.

Corollary 1

Let

S_{ρ} = {(m - h) \in H : \exists (i, j) \in J_{m} \times J_{h} such that ∣ ω_{i, j} ∣ / {(θ_{i, j})}^{1 / 2} \geq {(log M)}^{\frac{1}{2} + ρ}} .

Suppose for some ρ > 0 and some δ > 0, $∣ S_{ρ} ∣ \geq (\frac{1}{\sqrt{π} α} + δ) \sqrt{log M}$ . Assume that ℳ₀ =: |ℋ₀| ≍ ℳ, and (4) and (5) hold. Suppose there exists some γ > 0 such that max_1≤_m_≤_M |Ξ_m(γ)| = o(M^τ ) for any τ > 0. Then, under (C1) with p ≤ cn^r for some c > 0 and r > 0, we have

lim_{(n, M) \to \infty} FDR (\hat{t}) = α, and FDP (\hat{t}) / α \to 1

in probability, as (n,ℳ)→∞.

Remark 4

The condition $∣ S_{ρ} ∣ \geq (\frac{1}{\sqrt{π} α} + δ) \sqrt{log M}$ in Corollary 1 is mild, since there are ℳ hypotheses in total and this condition only requires a few submatrices having one entry with magnitude exceeding (log ℳ)^1/2+^ρ/n^1/2 for some constant ρ > 0.

4.3 Differences with the B-H Procedure

In this section we first discuss the difference between our method and the Benjamini-Hochberg (B-H) procedure and then explain why in the multiple testing procedure it is critical to restrict t on the range $0 \leq t \leq \sqrt{2 log M - 2 log log M}$ in equation (16) and to threshold N_{𝒥_m×𝒥_h} at $\sqrt{2 log M}$ when t̂ is not attained in the range.

Once the test statistic N_{𝒥_m×𝒥_h} for a given submatrix is developed, a natural approach to construct a procedure for simultaneously testing a collection of submatrices is to apply the well-known B-H procedure to the p-values p_m,h = 2(1 − Φ(N_{𝒥_m×𝒥_h})), 1 ≤ m < h ≤ M, computed from the transformed statistics N_{𝒥_m×𝒥_h}. Applying the B-H procedure to these p values is equivalent to rejecting the null hypotheses H₀_,m,h whenever N_{𝒥_m×𝒥_h} ≥ t̂_BH, where

{\hat{t}}_{B H} = inf {t \geq 0 : \frac{2 M (1 - Φ (t))}{R (t) \lor 1} \leq α} .

(17)

Note that, the difference between our procedure and the B-H procedure is on the ranges of t in equations (16) and (17).

We first emphasize here that the restriction on the range $0 \leq t \leq \sqrt{2 log M - 2 log log M}$ in our proposed procedure as defined in (16) is critical. When $t \geq \sqrt{2 log M - log log M}$ , 2 ℳ (1−Φ(t)) → 0 is not even a consistent estimate of R₀(t) because |R₀(t)/{2 ℳ (1−Φ(t))}− 1| ↛ 0 in probability as (n, ℳ)→∞. However, direct application of the B-H procedure to the p-values amounts to using 2 ℳ (1 − Φ(t)) as an estimate of R₀(t) for all t ≥ 0, and as a result it may not able to control the FDP with positive probability. For example, when the number of true alternatives is fixed, it is shown in Proposition 2.1 in Liu and Shao (2014) that the B-H procedure cannot control the FDP with positive probability. Thus, in order to control FDP, it is crucial to restrict t on the range $0 \leq t \leq \sqrt{2 log M - 2 log log M}$ .

When t is not attained in the range, it is also critical to threshold N_{𝒥_m×𝒥_h} at $\sqrt{2 log M}$ instead of $\sqrt{2 log M - 2 log log M}$ . When t does not exist in the range, thresholding N_{𝒥_m×𝒥_h} at $\sqrt{2 log M - 2 log log M}$ will cause too many false rejections, and consequently the FDR cannot be controlled asymptotically at level α. If the threshold level is increased to $\sqrt{2 log M}$ , the probability of false rejections can then be perfectly controlled asymptotically as shown in equation (3) of the supplement Xia et al. (2016).

To summarize, in order to control FDR and FDP, it is crucial to restrict t on the range $0 \leq t \leq \sqrt{2 log M - 2 log log M}$ in equation (16), and when it is not attained in the range, to threshold N_{𝒥_m×𝒥_h} at $\sqrt{2 log M}$ .

5 Simulation Studies

We now turn to analyze the numerical performance of the proposed multiple testing procedure through simulation studies. We first investigate the size and power of the proposed method by considering three matrix models with a random selection of the size of submatrices. We then mimic the sizes of the pathways of the breast cancer dataset analyzed in Section 6 and study the numerical performance of the proposed multiple testing procedure in a setting that is similar to the real data application. Our method, which tests for the conditional dependence structure at a group level, is then compared with the entrywise testing method and the B-H procedure. We also compare the new method with the Bonferroni correction procedure and report the results in the supplement.

5.1 Simulation for Different Constructions of Submatrices

Our analysis is divided into two parts: the performance of the new test statistics for testing a given submatrix and the performance of the proposed multiple testing procedure. We first describe the construction of the submatrices. For a given precision matrix Ω, we randomly divide the upper triangular matrix of Ω into ℳ submatrices, where ℳ = ⌊p/s⌋ (⌊p/s⌋−1)/2 and s = 2 and 4. Thus the length of the index sets can range from 1 to (p − ⌊p/s⌋ + 1). This is equivalent to grouping the genes into ⌊p/s⌋ pathways and considering all possible conditional dependence structure between different pathways of different sizes.

The data {X₁, …, X_n} are generated from multivariate normal distribution with zero-mean and precision matrix Ω. Three choices of Ω are considered:

Model 1: $Ω^{* (1)} = (ω_{i, j}^{* (1)})$ where $ω_{i, i}^{* (1)} = 1, ω_{i, i + 1}^{* (1)} = ω_{i + 1, i}^{* (1)} = 0.5, ω_{i, i + 2}^{* (1)} = ω_{i + 2, i}^{* (1)} = 0.5$ . For each of the submatrices as we constructed above, if it contains one of those entries, we make the first row of the submatrices equal to 0.5. Let $ω_{i, j}^{* (1)} = 0$ otherwise. Ω⁽¹⁾ = D^1/2(Ω^*⁽¹⁾ + δI)/(1 + δ)D^1/2 with δ = |λ_min(Ω^*⁽¹⁾)| + 0.05.
Model 2: $Ω^{* (2)} = (ω_{i, j}^{* (2)})$ where $ω_{i, j}^{* (2)} = ω_{j, i}^{* (2)} = 0.3$ for i = 10(k − 1) + 1 and 10(k − 1) + 2 ≤ j ≤ 10(k − 1) + 10, 1 ≤ k ≤ p/10. $ω_{i, j}^{* (2)} = 0$ otherwise. For each of the submatrices as we constructed above, if it contains less than three of those entries, we make the submatrices equal to 0. Let the first row of the submatrices which are closest to the diagonal equal to 0.3. Ω⁽²⁾ = D^1/2(Ω^*⁽²⁾ + δI)/(1 + δ)D^1/2 with δ = |λ_min(Ω^*⁽²⁾)| + 0.05.
Model 3: $Ω^{* (3)} = (ω_{i, j}^{(3)})$ . For each of the two submatrices closest to the diagonal, as we constructed above, pick a random row and make the entries equal to 0.3. Let $ω_{j, i}^{* (3)} = ω_{i, j}^{* (3)}$ . Ω⁽³⁾ = D^1/2(Ω^*⁽³⁾ + δI)/(1 + δ)D^1/2 with δ = |λ_min(Ω^*⁽³⁾)| + 0.05.

where D = (D_i,j) is a diagonal matrix with D_i,i = Unif(1, 3) for i = 1, …, p.

For each generated dataset, we use the Lasso to estimate the regression coefficients β_i:

{\hat{β}}_{i} = D_{i}^{- \frac{1}{2}} \underset{u}{arg min} {\frac{1}{2 n} {| (X_{- i} - {\bar{X}}_{- i}) D_{i}^{- 1 / 2} u - (X_{(i)} - {\bar{X}}_{(i)}) |}_{2}^{2} + λ_{n, i} {∣ u ∣}_{1}},

(18)

where D_i = diag(Σ̂₋_i,₋_i), and $λ_{n, i} = κ \sqrt{{\hat{σ}}_{i, i} log p / n}$ .

Performance for testing a given submatrix

We start by comparing our test based on the test statistic S_ℐ×𝒥 with the entrywise testing of a given submatrix where the null hypothesis H₀: Ω_ℐ×𝒥 = 0 is rejected whenever max₍_i,j_{)∈ℐ×𝒥} |W_i,j| ≥ Φ⁻¹(1 − α/K). As our target is the FDR control of the multiple comparisons, we focus on the power comparisons of these two methods for a range of significance levels from 0 to α = 0.1/ℳ. For illustration, we compare the performance of these two tests by testing against a randomly selected nonzero submatrix closest to the diagonal for Model 1 with s = 4. For each method, the sample size is taken to be n = 200, while the dimension p varies over the values 100, 200, 500 and 1000. For simplicity of the comparison, the tuning parameters λ_n,i in (18) is selected to be $λ_{n, i} = \sqrt{{\hat{σ}}_{i, i} log p / n}$ for both methods. The power curves, illustrated in Figure 1, are estimated from 100 replications. We can see from the figure that the power of the new group method is much higher than the entrywise method, and the advantage becomes much clearer when the dimension of Ω grows.

Power comparisons of the group method (red, solid) and entrywise method (blue, dash) for testing a given nonzero submatrix. 100 replications.

Comparison of the multiple testing procedures

We now compare the proposed group level FDR control procedure (Group) with three other methods: entrywise multiple testing method (Entrywise), B-H procedure (B-H) and Bonferroni correction procedure (Bonferroni).

For the new method, as described in Section 4, we select the tuning parameters λ_n,i in (18) adaptively by the data with the principle of making Σ₍_m,h_)∈ℋ₀ /(N_{𝒥_m×𝒥_h} ≥ t) and (2 − 2Φ(t))|ℋ₀| as close as possible. The algorithm is similar as Xia et al. (2015) and is summarized as follows.

Let $λ_{n, i} = b / 20 \sqrt{{\sum^{^}}_{i, i} log p / n}$ for b = 1, · · ·, 40. For each b, calculate ${\hat{β}}_{i}^{(b)}$ , i = 1, · · ·, p. Based on the estimation of regression coefficients, construct the corresponding standardized transformed statistics $N_{J_{m} \times J_{h}}^{(b)}$ for each b.
Choose b̂ as the minimizer of
$\sum_{d = 1}^{10} {(\frac{\sum_{(m, h) \in H} I (N_{J_{m} \times J_{h}}^{(b)} \geq Φ^{- 1} (1 - d (1 - Φ (\sqrt{log M})) / 10))}{d (1 - Φ (\sqrt{log M})) / 10 \cdot 2 M} - 1)}^{2} .$

The tuning parameters λ_n,i are then chosen to be

λ_{n, i} = \hat{b} / 20 \sqrt{{\sum^{^}}_{i, i} log p / n} .

We examine the power of the new method based on the average powers for 100 replications,

\frac{1}{100} \sum_{r = 1}^{100} \frac{\sum_{(m, h) \in H_{1}} I {N_{J_{m} \times J_{h, r}} \geq \hat{t}}}{∣ H_{1} ∣},

(19)

where r denotes the r-th replication.

For the entrywise multiple testing method, we select the tuning parameters λ_n,i adaptively using the principle as described in Section 5 in Xia et al. (2015). We applied the multiple testing procedure as developed in Section 4 of Xia et al. (2015) by restricting t on the range [0, $\sqrt{4 log p - 2 log log p}$ ] and threshold |W_i,j | at $\sqrt{4 log p}$ if t̂ is not attained in the range. We then examine the empirical FDR by

\frac{1}{100} \sum_{r = 1}^{100} \frac{\sum_{(m, h) \in H_{0}} I {{max}_{(i, j) \in J_{m} \times J_{h}} ∣ W_{i, j} ∣ \geq \hat{t}}}{\sum_{(m, h)} I {{max}_{(i, j) \in J_{m} \times J_{h}} ∣ W_{i, j} ∣ \geq \hat{t}}},

and the empirical power by

\frac{1}{100} \sum_{r = 1}^{100} \frac{\sum_{(m, h) \in H_{1}} I {{max}_{(i, j) \in J_{m} \times J_{h}} ∣ W_{i, j} ∣ \geq \hat{t}}}{∣ H_{1} ∣} .

We apply the Bonferroni correction procedure to the new test statistics and calculate its power based on (19), with t̂ obtained by setting α_B = α/ℳ. The power of the B-H procedure applied to max₍_i,j_{)∈ℐ×𝒥} |W_i,j | are calculated by (19) with no restriction on the range of t̂.

We apply all procedures to these three models with s = 2 and 4. For each method, the sample size is taken to be n = 200, while the dimension p varies over the values 100, 200, 500 and 1000. The FDR level is set at α = 0.1 and α = 0.01 respectively, and the empirical FDRs and powers, summarized in Tables 1 and 2, are estimated from 100 replications. The standard errors of the estimated powers are much smaller than the powers themselves and are thus not reported.

Table 1.

Empirical FDRs (standard errors) (%) with n = 200, α = 0.1 and 0.01 respectively, 100 replications.

p	method	α = 10%						α = 1%

	s	2			4			2			4

	Models	1	2	3	1	2	3	1	2	3	1	2	3
		Empirical FDR (SE) (in %)

100	Group	8.9 (3.4)	9.5 (4.3)	9.0 (2.9)	4.6 (3.3)	8.6 (6.0)	9.7 (5.6)	0.8 (1.2)	1.0 (1.0)	0.6 (0.8)	0.2 (0.5)	0.8 (1.0)	0.4 (1.1)
	Entrywise	24.5 (4.6)	15.7 (5.0)	12.7 (3.7)	36.2 (7.6)	24.1 (8.3)	16.3 (6.1)	2.7 (2.3)	1.3 (1.8)	1.0 (1.2)	4.0 (4.1)	1.8 (3.3)	1.3 (5.5)
	B-H	10.0 (3.6)	12.7 (5.5)	9.8 (4.2)	8.3 (5.0)	12.4 (7.4)	11.7 (6.8)	0.8 (1.3)	1.2 (1.9)	0.8 (1.1)	0.4 (1.5)	1.5 (5.6)	1.1 (3.5)

200	Group	8.8 (2.5)	9.4 (3.5)	8.7 (2.5)	5.8 (3.3)	8.5 (4.5)	8.7 (3.9)	0.6 (0.5)	0.9 (0.5)	0.8 (0.8)	0.3 (0.4)	0.8 (0.2)	0.8 (0.7)
	Entrywise	23.5 (3.5)	14.9 (3.7)	13.6 (3.0)	33.1 (7.0)	23.0 (5.5)	15.1 (5.8)	2.1 (1.4)	1.0 (1.2)	0.9 (1.0)	3.1 (2.8)	1.3 (2.3)	0.7 (3.1)
	B-H	9.8 (2.9)	11.7 (3.8)	9.2 (3.1)	8.8 (4.2)	11.9 (4.6)	12.1 (5.5)	0.8 (0.9)	0.9 (1.9)	0.8 (1.1)	0.8 (1.5)	1.4 (2.5)	1.0 (4.2)

500	Group	7.9 (1.5)	9.9 (2.3)	8.1 (1.5)	5.4 (1.8)	8.7 (2.9)	9.3 (2.7)	0.8 (0.4)	0.9 (0.7)	0.7 (0.5)	0.5 (0.6)	0.7 (0.8)	0.9 (0.8)
	Entrywise	19.4 (2.2)	15.0 (2.3)	12.5 (2.1)	24.6 (4.9)	20.6 (4.1)	16.2 (5.3)	1.6 (1.0)	0.9 (0.9)	0.8 (0.8)	1.6 (1.4)	1.1 (1.6)	1.1 (2.1)
	B-H	8.3 (2.0)	11.3 (2.6)	9.2 (2.1)	8.7 (3.4)	11.9 (4.2)	11.6 (4.9)	0.8 (0.6)	1.1 (1.1)	0.9 (1.1)	0.7 (1.0)	1.2 (2.3)	1.4 (3.5)

1000	Group	7.9 (1.2)	9.8 (1.8)	8.7 (1.4)	6.0 (1.7)	9.0 (2.0)	10.0 (2.1)	0.7 (0.3)	0.9 (0.4)	0.7 (0.4)	0.4 (0.4)	0.8 (0.5)	1.0 (0.7)
	Entrywise	16.7 (2.0)	14.5 (1.8)	13.1 (2.1)	20.9 (3.5)	20.0 (2.9)	16.5 (5.4)	1.3 (0.7)	1.3 (0.9)	1.1 (1.0)	1.7 (1.7)	1.2 (1.6)	0.8 (2.0)
	B-H	7.8 (1.3)	11.8 (2.3)	10.1 (1.9)	9.5 (2.7)	12.0 (2.8)	12.7 (6.2)	0.6 (0.5)	1.1 (1.0)	0.8 (1.1)	0.9 (1.2)	1.3 (2.3)	0.8 (2.1)

Open in a new tab

Table 2.

Empirical powers (%) with n = 200, α = 0.1 and 0.01 respectively, 100 replications.

p	method	α = 10%						α = 1%

	s	2			4			2			4

	Models	1	2	3	1	2	3	1	2	3	1	2	3
100	Group	93.8	88.5	84.6	95.2	84.4	73.5	87.8	74.2	69.4	92.3	68.7	54.4
	Entrywise	92.8	90.3	85.2	92.8	85.1	67.3	83.4	71.9	66.8	85.9	56.3	28.5
	B-H	92.8	88.3	84.6	92.9	81.1	67.2	82.3	66.3	64.5	84.0	50.9	31.2

200	Group	90.4	83.3	72.5	92.6	77.9	58.8	82.6	66.8	56.1	87.9	62.7	42.6
	Entrywise	87.6	84.8	71.7	87.3	76.0	45.5	93.6	60.8	47.6	73.6	40.8	15.0
	B-H	87.6	81.4	70.9	86.9	71.3	43.3	72.2	54.6	46.7	71.1	33.8	14.1

500	Group	84.3	72.3	56.6	85.9	67.4	41.4	75.1	55.4	40.7	76.0	53.4	27.2
	Entrywise	78.7	70.8	50.8	72.5	60.0	24.2	60.0	43.3	25.6	48.3	25.9	5.0
	B-H	77.8	66.0	50.4	70.0	52.5	20.3	57.5	36.3	23.8	43.3	19.3	3.0

1000	Group	80.1	63.3	46.4	80.7	59.9	31.9	69.9	57.2	31.8	69.2	46.8	21.1
	Entrywise	71.2	59.2	37.1	60.7	48.5	14.5	49.3	31.5	15.5	33.5	18.5	3.1
	B-H	69.8	53.1	35.4	56.6	39.9	9.6	45.8	23.9	12.7	27.3	10.6	1.2

Open in a new tab

The average numbers of conditionally dependent (“true interaction”) and conditionally independent (“no interaction”) pairs of subgroups with 100 replications are summarized in Table 3. It can be seen that the number of “true interactions” is relatively small compared to the total number of pairs of subgroups in all cases, as we assumed in Section 4.1.

Table 3.

Average numbers of true interactions and no interactions based on 100 replications.

s	2			4			2			4

Models	1	2	3	1	2	3	1	2	3	1	2	3

p	True interactions						No interactions
100	72	54	96	29	28	46	1153	1171	1129	271	272	254
200	146	110	196	60	58	96	4803	4839	4754	1165	1167	1129
500	373	279	496	154	147	246	30752	30846	30629	7596	7603	7504
1000	748	560	996	311	297	496	124002	124190	123754	30814	30828	30629

Open in a new tab

The results in Table 1 show that the empirical FDRs of the new group level method are well maintained under the target FDR level and are reasonably close to α for almost all settings. The standard errors of the FDP are small in most cases, especially when the dimension grows. They are slightly larger in the cases when α = 0.01, mainly due to the fact that the estimation error of the standard deviation of FDP is of the order 1/l^1/2 with l = 100. As a comparison, the empirical FDRs of the entrywise method have serious distortion in most of the scenarios, especially when s = 4, in which case the empirical FDRs can be even larger than 4α. The empirical FDRs of the B-H procedure are well under control in most cases. However, its standard errors are much larger than the standard errors of the proposed method in many cases, which coincides with the discussion in Section 4.3. The numerical results also show that the Bonferroni correction procedure is much more conservative than the other two methods, and the detailed analysis is summarized in the supplement Xia et al. (2016).

Table 2 shows that the empirical powers of our proposed method for all these models are very high under various constructions of submatrices. In particular, it outperforms the entrywise testing method and the B-H procedure. Especially when the dimension is high, the powers of the new method are much higher than the other methods under all scenarios. Furthermore, the power gain of the new group level testing procedure over the entrywise testing method is significant when the dimension is high. Especially for model 3 when s = 4, the empirical powers of the new procedure are more than twice the entrywise testing method. This is because the advantage of the group level testing becomes more significant when the signals spread across various submatrices as in Model 3. We can see from the table that the empirical power of the new method gets smaller when the dimension p grows. This is because of the fact that we keep the magnitude of ω_i,j invariant for various range of dimensions.

5.2 Simulation by Mimicking the Sizes of Gene Groups

We now consider a simulation setting that is similar to the breast cancer data application given in Section 6. The submatrices of the precision matrix Ω is constructed by mimicking the sizes of the 249 gene groups used in the breast cancer application, with parameter values p = 1624, n = 295 and ℳ = 30876. The sizes of the gene groups range from 1 to 110, and the corresponding sizes of the off-diagonal submatrices range from 1×1 to 97×110. For the diagonal submatrices $Ω_{J_{m} \times J_{m}}^{*} : = (ω_{m, i, j}^{*})$ with sizes L_m×L_m, m = 1, …, 249, which describe the conditional dependency within the pathways, we let $ω_{m, i, i}^{*} = 1, ω_{m, i, i + 1}^{*} = ω_{m, i + 1, i}^{*} = 0.8$ if L_m ≥ 2, $ω_{m, i, i + 2}^{*} = ω_{m, i + 2, i}^{*} = 0.6$ if L_m ≥ 3, and $ω_{m, j, i}^{*} = ω_{m, i, j}^{*}$ . For each of the non-diagonal submatrices $Ω_{J_{M} \times J_{m + 1}}^{*}$ and $Ω_{J_{m} \times J_{m + 2}}^{*}$ , we randomly pick one row and let min{10, |𝒥_m₊₁|} and min{10, |𝒥_m₊₂|} random entries of $ω_{i, j}^{*}$ in the rows equal to 0.5 respectively. We then construct the precision matrix as Ω = D^1/2(Ω^*+δI)/(1+δ)D^1/2, with δ = λ_min(Ω^*)+0.05. The FDR level is set at α = 0.1 and α = 0.01 respectively.

By mimicking the gene group sizes, we apply the proposed method in Section 4.1, the entrywise testing procedure, the B-H procedure and the Bonferroni correction procedure as described in Section 5.1. The empirical FDR and power results are summarized in Table 4, and the performance of the Bonferroni method is reported in the supplement. The empirical FDR of the new method is equal to 0.062 when α = 0.1 and is equal to 0.006 when α = 0.01, and thus both are close to the corresponding pre-specified level. Similarly as in Section 5.1, the B-H procedure has larger standard errors than the new procedure, while the entrywise multiple testing procedure has serious FDR distortion. For the empirical powers, it is shown in Table 4 that, the new testing procedure is more powerful than all the other methods.

Table 4.

Empirical FDRs (SEs) and powers (%) by mimicking the real data with α = 0.1 and α = 0.01 respectively, based on 100 replications.

p	method	α = 10%		α = 1%

		FDR (SE) (in %)	Power (in %)	FDR (SE) (in %)	Power (in %)
1624	Group	6.2 (1.4)	47.2	0.5 (0.3)	33.6
	Entrywise	26.0 (2.7)	39.9	2.5 (1.5)	26.0
	B-H	11.1 (2.0)	44.9	1.1 (1.0)	19.8

Open in a new tab

6 Analysis of Breast Cancer Gene Expression Data

In this section, we apply the multiple testing procedure developed in Section 4 to identify between pathway interactions based on a breast cancer gene expression study as described in van’t Veer et al. (2002), to further illustrate the merit of the procedure.

This study consists of 295 subjects with primary breast carcinomas whose gene-expression levels (in log scale) are measured at cancer diagnosis. For illustration, we consider M = 70 breast cancer related pathways, including several major signaling pathways, assembled based on existing literature (Osborne et al., 2005; Pan, 2012, e.g.). These pathways consist of p = 1624 unique genes, from the molecular signature database. Examples include the MAPK signaling, WNT signaling, TGF-β signaling, calcium signaling, cell communication, p53 signaling and breast cancer estrogen signaling pathways. Note that many of the pathways have overlapping genes while our method requires group indices to be non-overlapping since two groups with shared genes are obviously dependent of each other. To remove the influence of such trivial dependence, we further partitioned the 70 pathways into 249 non-overlapping gene subgroups whose sizes range from 1 to 110 with an average of 6.5. The algorithm used for such partitioning aims to identify the smallest number of non-overlapping subgroups that can cover all the genes under consideration. The partitioning algorithm begins with creating an M × p index matrix, 𝕀 = [I₁, …, I_p]. For m = 1, …, M and q = 1, …, p, the (m, q)th element of 𝕀 is set to 1 if the qth gene belongs to the mth pathway, and 0 otherwise. Then the subgroups are indexed by the unique values of {I₁, …, I_p}.

Applying our proposed methods with target false discovery rate of 0.01, we identified 494 between subgroup interactions out of the 30876 possible subgroup pairs. These between subgroup interactions can be mapped to 311 unique between pathway interactions and 18 within pathway interactions. The top pathways with highest numbers of interactions with other pathways include MAPK signaling, calcium signaling, gycan structures biosynthesis, WNT signaling, cell communication, TGF-β signaling and breast cancer estrogen pathways. The MAPK signaling pathway has interactions with 92 gene subgroups which corresponds to 31 pathways including TGF-β, MTOR, P53, WNT, and ERBB signaling pathways. The WNT signaling pathway interacts with 25 other pathways including TGF-β, MTOR, MPAK and breast cancer estrogen signaling. The TGF-β signaling pathway interacts with 21 other pathways including MAPK, p53, WNT and calcium signaling.

Many of these interactions have been previously documented. For example, experimental data suggest that inhibition of mTORC1 leads to MAPK pathway activation (Carracedo et al., 2008). The interaction between TGF-β and WNT pathways has been known for a long time and is probably the most extensively studied. At the organism level, TGF-β interacts with many other pathways at every stage of life from birth to death. During embryonic development, the complex but delicate interactions between the TGF-β, WNT, MAPK, and other pathways are important for a range of processes including body patterning, stem cell maintenance and cell fate determination (Guo and Wang, 2008). Kouzmenko et al. (2004) showed the first direct evidence of interaction between WNT and estrogen signaling pathways via functional interaction between β-catenin and ERα.

To examine whether these 70 breast cancer pathways are enriched with interactions, we randomly selected 50 sets of 70 pathways of similar sizes as the breast cancer pathways from the C2 pathway gene sets curated from various online databases (available from the Broad Institute). For each of the 70 randomly selected pathways, we performed the same analysis as the breast cancer pathways by first partitioning them into non-overlapping subgroups and then applied our method to identify significant between subgroup interactions. To determine whether the 70 breast cancer pathways are enriched with between subgroup interactions relative to these randomly selected pathways, we calculate the proportion of between subgroup interactions deemed as significant at the FDR level of 0.01. Across the 50 randomly selected pathways, the average proportion of significant pairs was 0.011 with standard deviation 0.002. The proportion of significant pairs we identified in the breast cancer data is 0.016, which is 2.5 standard deviations higher than the mean of proportions from those 50 random sets. The results suggest that the selected 70 pathways are indeed enriched with “interaction” pairs.

7 Discussions

We proposed in this paper a multiple testing procedure under the Gaussian graphical models for detecting between group interactions. The proposed method can potentially be extended in several directions. We discuss in this section two of these possible extensions.

7.1 Extension to Gaussian Copula Graphical Models

In the present paper, the problem of identifying the conditional between group interactions is translated to the problem of multiple testing of submatrices of a high-dimensional precision matrix Ω under the Gaussian graphical model framework. The main reason for the success of this approach is that the conditional independence between two non-overlapping groups of variables is equivalent to the corresponding submatrix of Ω being 0. This approach can be extended to more general settings of the semiparametric Gaussian copula graphical models where the population distribution is non-Gaussian, see Liu et al. (2012) and Xue and Zou (2012). The semiparametric Gaussian copula model assumes that the variables follow a joint normal distribution after a set of unknown marginal monotonic transformations. It would be interesting to develop a multiple testing procedure and investigate its properties under the semiparametric Gaussian copula graphical models. Detailed analysis is involved and is an interesting topic for future research.

7.2 The Two-Sample Case

We have focused on the one-sample case in this paper. It is also of interest to study the two-sample case where the goal is to discover the changes in the conditional dependence between pathway interactions under two different disease settings. In the one-sample case studied in this paper, Ω_{𝒥_m×𝒥_h} = 0 under H₀_,m,h. Thus the null is simple but the technical details of deriving the limiting distribution of a given submatrix is still very involved because the correlation structure of {W_i,j, (i, j) ∈ 𝒥_m × 𝒥_h} largely depends on the entries outside of the submatrix of interest. In the two-sample case, we wish to test the hypotheses $H_{0, m, h} : Ω_{J_{m} \times J_{h}}^{(1)} = Ω_{J_{m} \times J_{h}}^{(2)}$ . Under the null hypothesis H₀_,m,h, each submatrix is not necessary a zero matrix. So the null is composite, consequently the dependence structures of the suitable test statistics depend on the entries both inside and outside of the submatrices of direct interest. The two-sample case is technically even more challenging and we leave it as future work.

Table 5: Empirical FDRs (standard errors) (%) with n = 200, α = 0.1 and 0.01 respectively, 100 replications.

Table 6: Empirical powers (%) with n = 200, α = 0.1 and 0.01 respectively, 100 replications.

Table 7: Empirical FDRs (SEs) and powers (%) by mimicking the real data with α = 0.1 and α = 0.01 respectively, based on 100 replications.

Acknowledgments

The research of Tony Cai was supported in part by NSF Grants DMS-1208982 and DMS-1403708, and NIH Grant R01 CA127334.

The research of Yin Xia was supported in part by “The Recruitment Program of Global Experts” Youth Project from China, the startup fund from Fudan University and NSF Grant DMS-1612906.

The research of Tianxi Cai was supported in part by NIH Grants R01 GM079330, P50 MH106933, and U54 HG007963.

References

Anderson TW. An Introduction To Multivariate Statistical Analysis. 3. Wiley-Intersceince; New York: 2003. [Google Scholar]
Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Statist Soc B. 1995;57:289–300. [Google Scholar]
Beran R, Bilodeau M, de Micheaux PL. Nonparametric tests of independence between random vectors. Journal of Multivariate Analysis. 2007;98(9):1805–1824. [Google Scholar]
Cai TT, Liu W. Large-scale multiple testing of correlations. Journal of the American Statistical Association. 2015:110. doi: 10.1080/01621459.2014.999157. (to appear) [DOI] [PMC free article] [PubMed] [Google Scholar]
Cai TT, Liu W, Xia Y. Two-sample covariance matrix testing and support recovery in high-dimensional and sparse settings. J Amer Statist Assoc. 2013;108(501):265–277. [Google Scholar]
Cai TT, Zhang A. Inference on high-dimensional differential correlation matrix. 2014 doi: 10.1016/j.jmva.2015.08.019. arXiv preprint arXiv:1408.5907. [DOI] [PMC free article] [PubMed] [Google Scholar]
Carracedo A, Ma L, Teruya-Feldstein J, Rojo F, Salmena L, Alimonti A, Egia A, Sasaki AT, Thomas G, Kozma SC, et al. Inhibition of mtorc1 leads to mapk pathway activation through a pi3k-dependent feedback loop in human cancer. The Journal of clinical investigation. 2008;118(9):3065. doi: 10.1172/JCI34739. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chatterjee N, Kalaylioglu Z, Moslehi R, Peters U, Wacholder S. Powerful multilocus tests of genetic association in the presence of gene-gene and gene-environment interactions. The American Journal of Human Genetics. 2006;79(6):1002–1016. doi: 10.1086/509704. [DOI] [PMC free article] [PubMed] [Google Scholar]
Craven M, Kumlien J. Constructing biological knowledge bases by extracting information from text sources. ISMB. 1999;1999:77–86. [PubMed] [Google Scholar]
Fan J, Lv J. Sure independence screening for ultra-high dimensional feature space (with discussions) Journal of the Royal Statistical Society, Series B. 2008;70:849–911. doi: 10.1111/j.1467-9868.2008.00674.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gieser PW, Randles RH. A nonparametric test of independence between two vectors. Journal of the American Statistical Association. 1997;92(438):561–567. [Google Scholar]
Glazko GV, Emmert-Streib F. Unite and conquer: univariate and multivariate approaches for finding differentially expressed gene sets. Bioinformatics. 2009;25(18):2348–2354. doi: 10.1093/bioinformatics/btp406. [DOI] [PMC free article] [PubMed] [Google Scholar]
Guo X, Wang XF. Signaling cross-talk between tgf-β/bmp and other pathways. Cell research. 2008;19(1):71–88. doi: 10.1038/cr.2008.302. [DOI] [PMC free article] [PubMed] [Google Scholar]
Huang TM, et al. Testing conditional independence using maximal nonlinear conditional correlation. The Annals of Statistics. 2010;38(4):2047–2091. [Google Scholar]
Jia P, Kao CF, Kuo PH, Zhao Z. A comprehensive network and pathway analysis of candidate genes in major depressive disorder. BMC systems biology. 2011;5(Suppl 3):S12. doi: 10.1186/1752-0509-5-S3-S12. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kelley R, Ideker T. Systematic interpretation of genetic interactions using protein networks. Nature biotechnology. 2005;23(5):561–566. doi: 10.1038/nbt1096. [DOI] [PMC free article] [PubMed] [Google Scholar]
Khatri P, Sirota M, Butte AJ. Ten years of pathway analysis: current approaches and outstanding challenges. PLoS computational biology. 2012;8(2):e1002375. doi: 10.1371/journal.pcbi.1002375. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kooperberg C, LeBlanc M. Increasing the power of identifying gene× gene interactions in genome-wide association studies. Genetic epidemiology. 2008;32(3):255–263. doi: 10.1002/gepi.20300. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kooperberg C, Ruczinski I. Identifying interacting snps using monte carlo logic regression. Genetic epidemiology. 2005;28(2):157–170. doi: 10.1002/gepi.20042. [DOI] [PubMed] [Google Scholar]
Kouzmenko AP, Takeyama K-i, Ito S, Furutani T, Sawatsubashi S, Maki A, Suzuki E, Kawasaki Y, Akiyama T, Tabata T, et al. Wnt/β-catenin and estrogen signaling converge in vivo. Journal of Biological Chemistry. 2004;279(39):40255–40258. doi: 10.1074/jbc.C400331200. [DOI] [PubMed] [Google Scholar]
Lauritzen SL. Graphical models. Oxford University Press; 1996. [Google Scholar]
Li Y, Agarwal P, Rajagopalan D. A global pathway crosstalk network. Bioinformatics. 2008;24(12):1442–1447. doi: 10.1093/bioinformatics/btn200. [DOI] [PubMed] [Google Scholar]
Liu H, Han F, Yuan M, Lafferty J, Wasserman L, et al. High-dimensional semiparametric gaussian copula graphical models. Ann Statist. 2012;40(4):2293–2326. [Google Scholar]
Liu H, Tang Y, Zhang HH. A new chi-square approximation to the distribution of non-negative definite quadratic forms in non-central normal variables. Comput Stat Data Anal. 2009;53(4):853–856. [Google Scholar]
Liu W. Gaussian graphical model estimation with false discovery rate control. Ann Statist. 2013;41(6):2948–2978. [Google Scholar]
Liu W, Shao QM. Phase transition and regularized bootstrap in large scale t-tests with false discovery rate control. Ann Statist. 2014;42(5):2003–2025. [Google Scholar]
Liu ZP, Wang Y, Zhang XS, Chen L. Identifying dysfunctional crosstalk of pathways in various regions of alzheimer’s disease brains. BMC systems biology. 2010;4(Suppl 2):S11. doi: 10.1186/1752-0509-4-S2-S11. [DOI] [PMC free article] [PubMed] [Google Scholar]
Matthews L, Gopinath G, Gillespie M, Caudy M, Croft D, de Bono B, Garapati P, Hemish J, Hermjakob H, Jassal B, et al. Reactome knowledgebase of human biological pathways and processes. Nucleic acids research. 2009;37(suppl 1):D619–D622. doi: 10.1093/nar/gkn863. [DOI] [PMC free article] [PubMed] [Google Scholar]
Osborne CK, Shou J, Massarweh S, Schiff R. Crosstalk between estrogen receptor and growth factor receptor pathways as a cause for endocrine therapy resistance in breast cancer. Clinical cancer research. 2005;11(2):865s–870s. [PubMed] [Google Scholar]
Pan XH. Pathway crosstalk analysis based on protein-protein network analysis in ovarian cancer. Asian Pacific Journal of Cancer Prevention. 2012;13(8):3905–3909. doi: 10.7314/apjcp.2012.13.8.3905. [DOI] [PubMed] [Google Scholar]
Puri N, Salgia R, et al. Synergism of egfr and c-met pathways, cross-talk and inhibition, in non-small cell lung cancer. Journal of carcinogenesis. 2008;7(1):9. doi: 10.4103/1477-3163.44372. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ritchie M, Hahn L, Roodi N, Bailey L, Dupont W, Parl F, Moore J. Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am J Hum Genet. 2001;69(1):138–147. doi: 10.1086/321276. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rual JF, Venkatesan K, Hao T, Hirozane-Kishikawa T, Dricot A, Li N, Berriz GF, Gibbons FD, Dreze M, Ayivi-Guedehoussou N, et al. Towards a proteome-scale map of the human protein–protein interaction network. Nature. 2005;437(7062):1173–1178. doi: 10.1038/nature04209. [DOI] [PubMed] [Google Scholar]
Shou J, Massarweh S, Osborne CK, Wakeling AE, Ali S, Weiss H, Schiff R. Mechanisms of tamoxifen resistance: increased estrogen receptor-her2/neu cross-talk in er/her2–positive breast cancer. Journal of the National Cancer Institute. 2004;96(12):926–935. doi: 10.1093/jnci/djh166. [DOI] [PubMed] [Google Scholar]
Su L, White H. A consistent characteristic function-based test for conditional independence. Journal of Econometrics. 2007;141(2):807–834. [Google Scholar]
Su L, White H. A nonparametric hellinger metric test for conditional independence. Econometric Theory. 2008;24(04):829–864. [Google Scholar]
Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences of the United States of America. 2005;102(43):15545–15550. doi: 10.1073/pnas.0506580102. [DOI] [PMC free article] [PubMed] [Google Scholar]
Um Y, Randles RH. A multivariate nonparametric test of independence among many vectors. Journal of Nonparametric Statistics. 2001;13(5):699–708. [Google Scholar]
van’t Veer LJ, Dai H, Van De Vijver MJ, He YD, Hart AA, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002;415(6871):530–536. doi: 10.1038/415530a. [DOI] [PubMed] [Google Scholar]
Weirauch MT. Gene coexpression networks for the analysis of dna microarray data. Applied statistics for network biology: methods in systems biology. 2011:215–250. [Google Scholar]
Xenarios I, Salwinski L, Duan XJ, Higney P, Kim SM, Eisenberg D. Dip, the database of interacting proteins: a research tool for studying cellular networks of protein interactions. Nucleic acids research. 2002;30(1):303–305. doi: 10.1093/nar/30.1.303. [DOI] [PMC free article] [PubMed] [Google Scholar]
Xia Y, Cai T, Cai TT. Testing differential networks with applications to the detection of gene-gene interactions. Biometrika. 2015;102:247–266. doi: 10.1093/biomet/asu074. [DOI] [PMC free article] [PubMed] [Google Scholar]
Xia Y, Cai T, Cai TT. Supplement to “Multiple Testing of Submatrices of a Precision Matrix with Applications to Identification of Between Pathway Interactions”. Technical report. 2016 doi: 10.1080/01621459.2016.1251930. [DOI] [PMC free article] [PubMed] [Google Scholar]
Xue L, Zou H. Regularized rank-based estimation of high-dimensional nonparanormal graphical models. Ann Statist. 2012;40(5):2541–2571. [Google Scholar]

[R1] Anderson TW. An Introduction To Multivariate Statistical Analysis. 3. Wiley-Intersceince; New York: 2003. [Google Scholar]

[R2] Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Statist Soc B. 1995;57:289–300. [Google Scholar]

[R3] Beran R, Bilodeau M, de Micheaux PL. Nonparametric tests of independence between random vectors. Journal of Multivariate Analysis. 2007;98(9):1805–1824. [Google Scholar]

[R4] Cai TT, Liu W. Large-scale multiple testing of correlations. Journal of the American Statistical Association. 2015:110. doi: 10.1080/01621459.2014.999157. (to appear) [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Cai TT, Liu W, Xia Y. Two-sample covariance matrix testing and support recovery in high-dimensional and sparse settings. J Amer Statist Assoc. 2013;108(501):265–277. [Google Scholar]

[R6] Cai TT, Zhang A. Inference on high-dimensional differential correlation matrix. 2014 doi: 10.1016/j.jmva.2015.08.019. arXiv preprint arXiv:1408.5907. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Carracedo A, Ma L, Teruya-Feldstein J, Rojo F, Salmena L, Alimonti A, Egia A, Sasaki AT, Thomas G, Kozma SC, et al. Inhibition of mtorc1 leads to mapk pathway activation through a pi3k-dependent feedback loop in human cancer. The Journal of clinical investigation. 2008;118(9):3065. doi: 10.1172/JCI34739. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] Chatterjee N, Kalaylioglu Z, Moslehi R, Peters U, Wacholder S. Powerful multilocus tests of genetic association in the presence of gene-gene and gene-environment interactions. The American Journal of Human Genetics. 2006;79(6):1002–1016. doi: 10.1086/509704. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Craven M, Kumlien J. Constructing biological knowledge bases by extracting information from text sources. ISMB. 1999;1999:77–86. [PubMed] [Google Scholar]

[R10] Fan J, Lv J. Sure independence screening for ultra-high dimensional feature space (with discussions) Journal of the Royal Statistical Society, Series B. 2008;70:849–911. doi: 10.1111/j.1467-9868.2008.00674.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] Gieser PW, Randles RH. A nonparametric test of independence between two vectors. Journal of the American Statistical Association. 1997;92(438):561–567. [Google Scholar]

[R12] Glazko GV, Emmert-Streib F. Unite and conquer: univariate and multivariate approaches for finding differentially expressed gene sets. Bioinformatics. 2009;25(18):2348–2354. doi: 10.1093/bioinformatics/btp406. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Guo X, Wang XF. Signaling cross-talk between tgf-β/bmp and other pathways. Cell research. 2008;19(1):71–88. doi: 10.1038/cr.2008.302. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] Huang TM, et al. Testing conditional independence using maximal nonlinear conditional correlation. The Annals of Statistics. 2010;38(4):2047–2091. [Google Scholar]

[R15] Jia P, Kao CF, Kuo PH, Zhao Z. A comprehensive network and pathway analysis of candidate genes in major depressive disorder. BMC systems biology. 2011;5(Suppl 3):S12. doi: 10.1186/1752-0509-5-S3-S12. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] Kelley R, Ideker T. Systematic interpretation of genetic interactions using protein networks. Nature biotechnology. 2005;23(5):561–566. doi: 10.1038/nbt1096. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] Khatri P, Sirota M, Butte AJ. Ten years of pathway analysis: current approaches and outstanding challenges. PLoS computational biology. 2012;8(2):e1002375. doi: 10.1371/journal.pcbi.1002375. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] Kooperberg C, LeBlanc M. Increasing the power of identifying gene× gene interactions in genome-wide association studies. Genetic epidemiology. 2008;32(3):255–263. doi: 10.1002/gepi.20300. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Kooperberg C, Ruczinski I. Identifying interacting snps using monte carlo logic regression. Genetic epidemiology. 2005;28(2):157–170. doi: 10.1002/gepi.20042. [DOI] [PubMed] [Google Scholar]

[R20] Kouzmenko AP, Takeyama K-i, Ito S, Furutani T, Sawatsubashi S, Maki A, Suzuki E, Kawasaki Y, Akiyama T, Tabata T, et al. Wnt/β-catenin and estrogen signaling converge in vivo. Journal of Biological Chemistry. 2004;279(39):40255–40258. doi: 10.1074/jbc.C400331200. [DOI] [PubMed] [Google Scholar]

[R21] Lauritzen SL. Graphical models. Oxford University Press; 1996. [Google Scholar]

[R22] Li Y, Agarwal P, Rajagopalan D. A global pathway crosstalk network. Bioinformatics. 2008;24(12):1442–1447. doi: 10.1093/bioinformatics/btn200. [DOI] [PubMed] [Google Scholar]

[R23] Liu H, Han F, Yuan M, Lafferty J, Wasserman L, et al. High-dimensional semiparametric gaussian copula graphical models. Ann Statist. 2012;40(4):2293–2326. [Google Scholar]

[R24] Liu H, Tang Y, Zhang HH. A new chi-square approximation to the distribution of non-negative definite quadratic forms in non-central normal variables. Comput Stat Data Anal. 2009;53(4):853–856. [Google Scholar]

[R25] Liu W. Gaussian graphical model estimation with false discovery rate control. Ann Statist. 2013;41(6):2948–2978. [Google Scholar]

[R26] Liu W, Shao QM. Phase transition and regularized bootstrap in large scale t-tests with false discovery rate control. Ann Statist. 2014;42(5):2003–2025. [Google Scholar]

[R27] Liu ZP, Wang Y, Zhang XS, Chen L. Identifying dysfunctional crosstalk of pathways in various regions of alzheimer’s disease brains. BMC systems biology. 2010;4(Suppl 2):S11. doi: 10.1186/1752-0509-4-S2-S11. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] Matthews L, Gopinath G, Gillespie M, Caudy M, Croft D, de Bono B, Garapati P, Hemish J, Hermjakob H, Jassal B, et al. Reactome knowledgebase of human biological pathways and processes. Nucleic acids research. 2009;37(suppl 1):D619–D622. doi: 10.1093/nar/gkn863. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] Osborne CK, Shou J, Massarweh S, Schiff R. Crosstalk between estrogen receptor and growth factor receptor pathways as a cause for endocrine therapy resistance in breast cancer. Clinical cancer research. 2005;11(2):865s–870s. [PubMed] [Google Scholar]

[R30] Pan XH. Pathway crosstalk analysis based on protein-protein network analysis in ovarian cancer. Asian Pacific Journal of Cancer Prevention. 2012;13(8):3905–3909. doi: 10.7314/apjcp.2012.13.8.3905. [DOI] [PubMed] [Google Scholar]

[R31] Puri N, Salgia R, et al. Synergism of egfr and c-met pathways, cross-talk and inhibition, in non-small cell lung cancer. Journal of carcinogenesis. 2008;7(1):9. doi: 10.4103/1477-3163.44372. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] Ritchie M, Hahn L, Roodi N, Bailey L, Dupont W, Parl F, Moore J. Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am J Hum Genet. 2001;69(1):138–147. doi: 10.1086/321276. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] Rual JF, Venkatesan K, Hao T, Hirozane-Kishikawa T, Dricot A, Li N, Berriz GF, Gibbons FD, Dreze M, Ayivi-Guedehoussou N, et al. Towards a proteome-scale map of the human protein–protein interaction network. Nature. 2005;437(7062):1173–1178. doi: 10.1038/nature04209. [DOI] [PubMed] [Google Scholar]

[R34] Shou J, Massarweh S, Osborne CK, Wakeling AE, Ali S, Weiss H, Schiff R. Mechanisms of tamoxifen resistance: increased estrogen receptor-her2/neu cross-talk in er/her2–positive breast cancer. Journal of the National Cancer Institute. 2004;96(12):926–935. doi: 10.1093/jnci/djh166. [DOI] [PubMed] [Google Scholar]

[R35] Su L, White H. A consistent characteristic function-based test for conditional independence. Journal of Econometrics. 2007;141(2):807–834. [Google Scholar]

[R36] Su L, White H. A nonparametric hellinger metric test for conditional independence. Econometric Theory. 2008;24(04):829–864. [Google Scholar]

[R37] Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences of the United States of America. 2005;102(43):15545–15550. doi: 10.1073/pnas.0506580102. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] Um Y, Randles RH. A multivariate nonparametric test of independence among many vectors. Journal of Nonparametric Statistics. 2001;13(5):699–708. [Google Scholar]

[R39] van’t Veer LJ, Dai H, Van De Vijver MJ, He YD, Hart AA, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002;415(6871):530–536. doi: 10.1038/415530a. [DOI] [PubMed] [Google Scholar]

[R40] Weirauch MT. Gene coexpression networks for the analysis of dna microarray data. Applied statistics for network biology: methods in systems biology. 2011:215–250. [Google Scholar]

[R41] Xenarios I, Salwinski L, Duan XJ, Higney P, Kim SM, Eisenberg D. Dip, the database of interacting proteins: a research tool for studying cellular networks of protein interactions. Nucleic acids research. 2002;30(1):303–305. doi: 10.1093/nar/30.1.303. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R42] Xia Y, Cai T, Cai TT. Testing differential networks with applications to the detection of gene-gene interactions. Biometrika. 2015;102:247–266. doi: 10.1093/biomet/asu074. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R43] Xia Y, Cai T, Cai TT. Supplement to “Multiple Testing of Submatrices of a Precision Matrix with Applications to Identification of Between Pathway Interactions”. Technical report. 2016 doi: 10.1080/01621459.2016.1251930. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R44] Xue L, Zou H. Regularized rank-based estimation of high-dimensional nonparanormal graphical models. Ann Statist. 2012;40(5):2541–2571. [Google Scholar]

PERMALINK

Multiple Testing of Submatrices of a Precision Matrix with Applications to Identification of Between Pathway Interactions

Yin Xia

Tianxi Cai

T Tony Cai

Abstract

1 Introduction

1.1 Detection of Between Pathway Interactions

1.2 Multiple Testing of Submatrices of A Precision Matrix

1.3 Structure of the Paper

2 Testing A Given Submatrix

2.1 Notation and Definitions

2.2 Testing Procedure

3 Theories on Testing A Given Submatrix

3.1 Asymptotic Null Distribution

Theorem 1

Remark 1

Remark 2

3.2 Asymptotic Power

Theorem 2

4 Multiple Testing of Submatrices with FDR Control

4.1 Multiple Testing Procedure

4.2 Theoretical Properties

Theorem 3

Remark 3

Corollary 1

Remark 4

4.3 Differences with the B-H Procedure

5 Simulation Studies

5.1 Simulation for Different Constructions of Submatrices

Performance for testing a given submatrix

Figure 1.

Comparison of the multiple testing procedures

Table 1.

Table 2.

Table 3.

5.2 Simulation by Mimicking the Sizes of Gene Groups

Table 4.

6 Analysis of Breast Cancer Gene Expression Data

7 Discussions

7.1 Extension to Gaussian Copula Graphical Models

7.2 The Two-Sample Case

Acknowledgments

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases