Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2024 Oct 1.
Published in final edited form as: Stat Sin. 2023 Oct;33(4):2359–2380. doi: 10.5705/ss.202020.0486

Use of random integration to test equality of high dimensional covariance matrices

Yunlu Jiang 1, Canhong Wen 1, Yukang Jiang 1, Xueqin Wang 1, Heping Zhang 1
PMCID: PMC10550010  NIHMSID: NIHMS1774677  PMID: 37799490

Abstract

Testing the equality of two covariance matrices is a fundamental problem in statistics, and especially challenging when the data are high-dimensional. Through a novel use of random integration, we can test the equality of high-dimensional covariance matrices without assuming parametric distributions for the two underlying populations, even if the dimension is much larger than the sample size. The asymptotic properties of our test for arbitrary number of covariates and sample size are studied in depth under a general multivariate model. The finite-sample performance of our test is evaluated through numerical studies. The empirical results demonstrate that our test is highly competitive with existing tests in a wide range of settings. In particular, our proposed test is distinctly powerful under different settings when there exist a few large or many small diagonal disturbances between the two covariance matrices.

Keywords: High-dimensional data, Covariance matrix, Random integration

1. Introduction

Testing the equality of two covariance matrices arises from many important problems including both the classic experimental designs and the analysis of high throughout omic data. For example, gene expression data have often been explored to classify disease types. Igolkina et al. (2018) pointed out that variance in gene expression is an important characteristic of schizophrenia. Roberts et al. (2018) showed that many genes differ in the variances of their gene expressions between disease states. Comparison between two covariance matrices is therefore essential to analyze gene expression microarray data of two different groups. This comparison becomes difficult because the number of the samples is usually much smaller than the number of genes. In the literature, many methods have been proposed to test the equality of two covariance matrices, but perform poorly in practice, especially when there are a few large or many small diagonal disturbances between the two covariance matrices.

Let X and Y be the p-dimensional random variables with covariance matrices Σ1 and Σ2, respectively. Given independent samples Xm={X1,,Xm} from X and Yn={Y1,,Yn} from Y, we want to test:

H0:Σ1=Σ2vsH1:Σ1Σ2. (1.1)

In the classic low-dimensional setting, Anderson, T. W. (2003) proposed a likelihood ratio test (LRT) statistic and showed that the proposed LRT statistic is asymptotically χ2-distribution with degrees of freedom p(p + 1)/2 under the multivariate normality assumption and H0 when p is fixed.

In recent applications from gene expression (Pan et al., 2018), to neuroimaging (Le et al., 2001), and to risk management (Bollerslev, 2019), the dimension p can be much larger than the sample size n, e.g., “large p small n” or “large p large n.” In this setting, the sample covariance matrix does not converge to the population counterpart (Bai et al, 2009), and the aforementioned classical methods for the low dimensional case either are not applicable or perform poorly. As well summarized by Cai (2017), this problem is so important and difficult that it has attracted a great deal of attention. We briefly review some of the methods below.

The modified LRT based methods have been considered by Bai et al (2009), Jiang et al. (2013), and Jiang and Qi (2015). Because Σ1 = Σ2 is equivalent to the Frobenius norm tr1 − Σ2)2 = 0, the Frobenius norm-based tests have also been proposed (Schott, 2007; Srivastava and Yanagihara, 2010). However, The modified LRT based and Frobenius norm based methods assume multivariate normality. Li and Chen (2012) removed the normality assumption by using a linear combination of three one-sample U statistics. The test proposed by Li and Chen (2012) can be applied without assuming parametric distributions for the two populations and is very powerful when there are many small differences between two population covariance matrices, but this test was found not powerful against sparse alternative (Yang and Pan, 2017) or when the two covariance matrices differ slightly only in the diagonal (Wu and Li, 2015). Using random projection, Wu and Li (2015) constructed a test statistic to improve the power when there are many small diagonal disturbances between the two covariance matrices, but again assuming the normality. He et al. (2021) introduced the adaptive testing to combine the finite-order U-statistics that includes the variants of Frobenius norm based statistics.

Assuming sparsity for Σ1 − Σ2 under the alternative hypothesis, Cai et al. (2013) introduced an extreme statistic, which is robust with respect to the population distributions, and is very powerful when there are only several large disturbances between two population covariance matrices, and Chang et al. (2017) proposed a computationally fast procedure. Zhu (2017) constructed a sparse-Leading-Eigenvalue-Driven (sLED) test to deal with the scenario in which the signals are both sparse and weak, and proved that sLED can achieve full power asymptotically when the sparse signal is strong enough. However, these tests are either not powerful when there are many small disturbances between Σ1 and Σ2 or the theoretical properties required an explicit relationship between n and p or too complicated for practical use.

Above mentioned tests often targeted at some specific situations. To accommodate various situations, some weighted combination tests are proposed. For example, Yang and Pan (2017) proposed a weighted test statistic based on random matrix theories. Zheng et al. (2020) proposed a power-enhancement high-dimensional test. These tests can handle both sparsity and non-sparsity structures. However, these tests depend on the proper choice of weights, which is a very challenging task. Furthermore, these procedures also required an explicit relationship between n and p. Yu et al. (2020) proposed a scale-invariant power enhancement test based on Fisher’s method, but again assuming the normality.

In short, although many methods exist, they are limited by different reasons. Through a novel use of random integration, we propose a method to test equality of two high dimensional covariance matrices. Specifically, the our proposed test possesses the following three merits at the same time:

  1. Our test does not require the distributional assumption.

  2. Our test works well for the paradigm of “large p”, even when there exist a few large or many small diagonal disturbances between the two covariance matrices.

  3. The asymptotic theory is established under a general multivariate model with certain moment conditions, but without requiring an explicit relationship between p and n.

The rest of the paper is organized as follows. In Section 2, we introduce our test statistic via the random integration technique, establish the asymptotic properties. In Section 3, simulation studies are conducted to evaluate the finite sample performance of the proposed test. In Section 4, a real data set is analyzed to compare the proposed test with some existing methods. We conclude with some remarks in Section 5 and all technical proofs are presented in the Supplementary Material.

2. Methodology and main results

To introduce a proper statistic to test (1.1), especially when there are many small diagonal disturbances between the two covariance matrices, it would be helpful if we can strengthen the information of diagonal disturbances for Σ1 −Σ2. Unlike the random matrix projection method, which needs the normality assumption to strengthen the information on a line, we develop the random integration technique to strengthen the information through integrations. To this end, we first denote Xc = Xμ1 and Yc = Yμ2, where μ1 = EX and μ2 = EY. Then, Σ1 = EXcXc and Σ2 = EYcYc. Note the following equivalences

Σ1=Σ2EXcXc=EYcYcαEXcXcα=αEYcYcα, for any αRpE[α(XcXcYcYc)α]=0, for any αRp.

Thus, testing whether Σ1 and Σ2 amounts to testing whether

RIw(X,Y)E2[α(XcXcYcYc)α]w(α)dα=0, (2.1)

where w(α) is a positive weight function. A critical observation is that RIw(X, Y) may be evaluated easily for certain properly chosen w. The following Theorem 1 enables us to derive an explicit form for (2.1) and obtain our difference measure of two covariances.

Theorem 1. If w(α) is a p-dimensional standard normal density function, then

RI(X,Y)RIw(X,Y)=[tr(Σ1Σ2)]2+2tr(Σ1Σ2)2, (2.2)

and RI(X, Y) ≥ 0 with the equality holds if and only if Σ1 = Σ2.

Remark 1. The p-dimensional standard normal random vector can be expressed as the independent product between a uniformly distributed random vector on the unit sphere S1p1 and the radial random variable. We choose the standard normal random density function to treat every unit vector equally in the integration. Moreover, the multivariate normal distribution is a special case of a multivariate stable distribution, and multivariate stable distributions have been used as the weight functions by Chen et al. (2019). As future work, it may be worthwhile to consider other weight functions such as uniform distribution (Zhu et al., 2017; Kim et al., 2020) and Bernoulli distribution (Qiu et al., 2021).

Note that the second term on the right hand size of (2.2) is the Frobenius norm of the difference between the two covariance matrix. the test can be designed to be powerful when there are many small disturbances between Σ1 and Σ2. Meanwhile, unlike Li and Chen (2012), the same test can be powerful when the two covariance matrices only differ by a small amount in the diagonal, due to the first term on the right hand size of (2.2). It represents the square of the difference between the diagonal elements of the two covariance matrix. If nonzero signals of the difference between the diagonal elements of Σ1 and Σ2 are weakly dense with almost the same sign, RI(X, Y) should be higher powerful than a different test statistic in which [tr1 − Σ2)]2 is replaced by k=1p(Σ1,kkΣ2,kk)2. In addition, the estimation of RI(X, Y) does not need the consistent estimates of the covariance matrices since RI(X, Y) is based on the trace of matrices. In the following, we will obtain an unbiased estimator of RI(X, Y) which is our desired test statistic to test (1.1).

Denote

Am1=1m(m1)ij(XiXi)(XjXj)2m(m1)(m2)i,j,kXiXjXkXk+1m(m1)(m2)(m3)i,j,k,lXiXjXkXl,
Bn1=1n(n1)ij(YiYi)(YjYj)2n(n1)(n2)i,j,kYiYjYkYk+1n(n1)(n2)(n3)i,j,k,lYiYjYkYl,
Cm,n1=1mni=1mj=1n(XiXi)(YjYj)1nm(m1)i,kj=1nYjYjXiXk1mn(n1)i,kj=1mXjXjYiYk+1m(m1)n(n1)i,kj,lXiXkYjYl,
Am2=1m(m1)ij(XiXj)22m(m1)(m2)i,j,kXiXjXjXk+1m(m1)(m2)(m3)i,j,k,lXiXjXkXl,
Bn2=1m(m1)ij(YiYj)22m(m1)(m2)i,j,kYiYjYjYk+1m(m1)(m2)(m3)i,j,k,lYiYjYkYl,
Cm,n2=1mni=1mj=1n(XiYj)21nm(m1)i,kj=1nXiYjYjXk1mn(n1)i,kj=1mYiXjXjYk+1m(m1)n(n1)i,kj,lXiYjXkYl.

where ∑ denotes summation over mutually distinct indices. Then the proposed sample test statistic is

RIm,n=Am12Cm,n1+Bn1+2(Am22Cm,n2+Bn2), (2.3)

which is an unbiased estimator of RI(X, Y).

Remark 2. The computation cost is at the order of max{pn4, pm4} if computed RIm,n directly. In fact, the computation burden comes from the last two sums in Am1, Bn1, Am2, Bn2 and the last three in Cm,n1, Cm,n2. Since RIm,n is invariant under the location shift, we can assume without loss of generality that μ1 = μ2 = 0. Under this assumption, the last two sums in Am1, Bn1, Am2, Bn2 and the last three in Cm,n1, Cm,n2 are all of smaller order than the first. This indicates that we can first transform data Xi to XiX¯ and Yj to YjY¯, and then compute only the first term in Am1, Bn1, Am2, Bn2, Cm,n1, Cm,n2. These will reduces the computation cost to the order of max{pn2, pm2} without affecting the asymptotic properties of our proposed test.

2.1. Asymptotic properties

To establish the limiting distribution of RIm,n, we assume the following three conditions:

E1. There exist a p × m1 matrix Γ1, a p × m2 matrix Γ2, m1-dimensional random vectors {Z1j}j=1m, and m2-dimensional random vectors {Z2j}j=1n, such that Xj = μ11Z1j for j = 1, ⋯ , m, and Yj = μ22Z2j for j = 1, ⋯ , n. And Γi, i = 1, 2, and Zij=(Zik1,,Zijmi) for i = 1, j = 1, ⋯ , m and i = 2, j = 1, ⋯ , n satisfy:

  1. Γ1Γ1=Σ1 and Γ2Γ2=Σ2 with min{m1, m2} ≥ p.

  2. {Z1j}j=1m and {Z2j}j=1n are independent and identically distributed, respectively, with EZ1j = 0, Var(Z1j)=Im1, and EZ2j = 0, Var(Z2j)=Im2, where Imi, is the mi × mi identity matrix.

  3. supi,k E|Zi jk|8 < ∞ and EZijk4=3+Δi for some constant Δi. Also,
    E(Zijl1ς1Zijlqςq)=E(Zijl1ς1)E(Zijlqςq) (2.4)
    for any positive integers q and ςl such that l=1qςl8, and l1, l2, ⋯ , lq are distinct indices.

E2. As min{m, n} → ∞, p → ∞, and for any i, j, k, l ∈ {1, 2}, triΣj) → ∞ and

tr{(ΣiΣj)(ΣkΣl)}=o{tr(ΣiΣj)tr(ΣkΣl)}. (2.5)

E3. As min{m, n} → ∞, m/(m + n) → τ ∈ (0, 1).

Remark 3. Condition E1 gives rise to a general multivariate model for high-dimensional data analysis, which includes commonly used distributions such as the multivariate normal distribution (Chen et al., 2010b; Srivastava and Yanagihara, 2010; Li and Chen, 2012). According to Chen and Qin (2010a), min{m1, m2} ≥ p means that the rank and eigenvalues of Σ1 or Σ2 are not affected by the transformation. According to Chen and Qin (2010a); He and Chen (2018), (2.4) can be viewed as a pseudo-independent condition of Zij. namely, a relaxed independence relation that allows some margin over probabilities (Kim and Lesser, 2008). Obviously, if Zij has independent components, then (2.4) is true.

In Condition E2, we do not require a direct relationship between p and n. We know cases where E2 does not imply a relation between p and n. For example, if all the eigenvalues of Σi are bounded away from zero and infinity, E2 holds. Meanwhile, some of the commonly encountered covariance structures satisfy Condition E2 (Chen et al., 2010b).

Condition E3 is a standard regularity assumption in two-sample problems, which guarantees that m and n go to infinity proportionally.

Theorem 2. Under Conditions E1–E3, as min{m, n} → ∞, we have

RIm,nRI(X,Y)σm,nDN(0,1),

where σm,n2 is defined in (A.2) in Supplementary Material.

Under H0, we can obtain RI(X, Y) = 0, and

σ0,m,n2=24(1m+1n)2tr2(Σ2).

Therefore, we obtain the following corollary.

Corollary 1. Under Conditions E1–E3 and H0 : Σ1 = Σ2 = Σ, as min{m, n} → ∞, we have

RIm,n/σ0,m,nDN(0,1).

To formulate a test procedure, we need to estimate σ0,m,n. Since EAm2=tr(Σ12), EBn2=tr(Σ22), the following is a consistent estimate σ^0,m,n of σ0,m,n under H0:

σ^0,m,n=26(1mAm2+1nBn2).

Furthermore, the following theorem assures that σ^0,m,n is ratio-consistent to σ0,m,n.

Theorem 3. Under Conditions E1–E3 and H0 : Σ1 = Σ2 = Σ, as min{m, n} → ∞, we have

RIm,n/σ^0,m,nDN(0,1). (2.6)

As we shall derive in the Supplementary Material,

Am2tr(Σ12)P1,Bn2tr(Σ22)P1,  and  σ^0,m,nσ0,m,nP1.

Theorem 3 follows from Corollary 1 and the Slutsky’s theorem. Therefore, the proposed test with a nominal θ level of significance rejects H0 if RIm,nσ^0,m,nzθ, where zθ is the upper-θ quantile of N(0,1). The approximation result in the Supplementary Material indicates that the standard normal distribution can adequately substitute for the null distribution of RIm,n/σ^0,m,n.

Next, we study the power of our proposed test. Let gm,n(Σ1,Σ2;θ)=P(RIm,nσ^0,m,nzθH1) be the power of the proposed test under H1 : Σ1 ≠ Σ2. Let SNRm,n1, Σ2) = RI(X, Y)/σm,n and γm,n=[tr(Σ12)m+tr(Σ22)n]/RI(X,Y), then we obtain the following Theorem 4.

Theorem 4. Under Conditions E1–E3 and H1 : Σ1 ≠ Σ2, we have

limm,ngm,n(Σ1,Σ2;θ)limm,nΦ(2zθ+SNRm,n(Σ1,Σ2)),

where Φ(·) is the cumulative standard normal distribution function.

Theorem 4 indicates that the power of our proposed test is bounded from below. Meanwhile, the power is mainly determined by SNRm,n1, Σ2). From (A.2) in Supplementary Material, we have

σm,n224[tr(Σ12)m+tr(Σ22)n]2+20 max{2+Δ1,2+Δ2}[tr(Σ12)m+tr(Σ22)n]RI(X,Y),

that is,

SNRm,n(Σ1,Σ2)[24γm,n2+20 max{2+Δ1,2+Δ2}γm,n]1/2.

Therefore, when γm,n → 0 as min{m, n} → ∞, we have SNRm,n1, Σ2) → ∞. Thus, we have

limm,ngm,n(Σ1,Σ2;θ)=1.

Especially, we consider the following three cases.

Case I: Let Σ1 = rIp + AR(0.1) and Σ2 = AR(0.1), where AR(ρ) = (aij)p×p is a covariance matrix with aij = ρ|ij| for i, j = 1, ⋯ , p. Then, we have the following Corollary 2.

Corollary 2. Under Conditions E1–E3, and Limm,n,pp(m+n)r2=c, where 0 < c < ∞, then we have

limm,ngm,n(Σ1,Σ2;θ)=1.

Corollary 2 shows that our proposed test is very powerful when there are many small diagonal disturbances between the two covariance matrices. In addition, under the conditions in Corollary 2, the signal-to-noise ratio of the method proposed by Li and Chen (2012) diminishes to 0. Therefore, the power of the test proposed by Li and Chen (2012) has a low bounded from below.

Case II: Let Σ1 = Ip, Σ2 = Ip + H(ϖ0, ϖ1, p0), where H(ϖ0, ϖ1, p0) = (hij)p×p with hij = 0 except hii = ϖ0, i = 1, ⋯ , p0 and hi,i+1 = hi+1,i = ϖ1, i = 1, ⋯ , p0−1. Then, we have the following Corollary 3.

Corollary 3. Under Conditions E1–E3, p(m+n)p02ϖ02=o(1) and np0ϖ0 → ∞, then we have

limm,ngm,n(Σ1,Σ2;θ)=1.

According to Cai et al. (2013), we take ϖ0=O(p/(m+n)) and p0 = p1/4. Corollary 3 shows that the proposed test is powerful. Meanwhile, the method proposed by Cai et al. (2013) is also powerful under this case.

Case III: Let Σ1 = Ip, Σ2 = Ip + M, where M is a p × p matrix with Mii = 0 and Mij = ω1 for ij. Then, we have the following Corollary 4.

Corollary 4. Under Conditions E1–E3, (m+n)pω12 as m, n, p → ∞, then we have

limm,ngm,n(Σ1,Σ2;θ)=1.

Corollary 4 indicates that our proposed test is also very powerful under some conditions when the diagonals are the same and there are many small non-diagonal disturbances between the two covariance matrices.

3. Simulation Studies

We carry out numerical simulations to investigate finite-sample performance of our proposed method. We consider our proposed method (RI) with the method proposed by Li and Chen (2012) (LC), the method proposed by Cai et al. (2013) (Cai), the method proposed by Zhu (2017) (sLED), the method proposed by Chang et al. (2017)B,α), T2 method proposed by Zheng et al. (2020), and a scale-invariant power enhancement test based on Fisher’s method (Yu et al., 2020) (Fm,n). We set the nominal level of significance at 0.05. We choose the sample sizes as m = n = 60,100, and m = 200, n = 60, and the dimension as p = 300, 500, 800, 1000, 1200, 1500. All empirical sizes and powers are calculated from 1000 replications. For Σ1 and Σ2, we consider the following six scenarios:

Scenario 1: Σ1(1)=Ip, the identity matrix, Σ2(1)=Ip+H(0.04,0.2,0.3p), where H(ϖ0, ϖ1, k) = (hij)p×p with hij = 0 except hii = ϖ0, i = 1, ⋯ , k and hi,i+1 = hi+1,i = ϖ1, i = 1, ⋯ , k − 1;

Scenario 2: Σ1(2)=Ip, Σ2(2)=Ip+H(0.04,0.2,p);

Scenario 3: Σ1(3)=Ip, Σ2(3)=Ip+Σ*(3), where Σ*(3)=(σij,*(3)) is a p × p matrix with σii,*(3)=0 for i = 1, ⋯ , p and σij,*(3)=1/p for ij.

Scenario 4: Σ1(4)=Ip, Σ2(4)=Ip+H(4,0.05,0.02p);

Scenario 5: Σ1(5)=Σ*(5)+δ0Ip, Σ2(5)=Σ*(5)+δ0Ip+U, where Σ*(5)=(σij,*(5))=D1/2CD1/2, D = diag(d1, ⋯ , dp) and d1, ⋯ , dp ~i.i.d. Unif(0.5, 2.5), C = (cij) with cii = 1, cij = 0.5 for 5(k − 1) + 1 ≤ ij ≤ 5k, where k = 1, ⋯ , [p/5] and cij = 0 otherwise. U is a p × p symmetric matrix with four nonzero entries from Unif(0,4)×max1jpσjj,*(5) randomly located in the upper triangle, and another four located in the lower triangle by symmetry. δ0=|min{λmin(Σ*(5)+U),λmin(Σ*(5))}|+0.05, where λmin(A) denotes the minimum eigenvalue of a symmetric matrix A.

Scenario 6: Σ1(6)=0.2×Ip+AR(0.1), Σ2(6)=AR(0.1), where AR(ρ) = (aij)p×p is a covariance matrix with aii = 1 and aij = ρ|ij| for ij;

Scenario 7: Σ1(7)=0.1×Ip+AR(0.2), Σ2(7)=AR(0.24).

Scenario 8: Σ1(8)=Σ*(8)+λ0Ip, Σ2(8)=Σ*(8)+Q+λ0Ip, where Σ*(8)=(σij,*(8))1i,jp with i.i.d σii,*(8)~Unif(1,2) and σij,*(8)={(|ij|+1)2H+(|ij|1)2H2(|ij|)2H}/2 with H = 0.85 for ij. A perturbation matrix Q has ⌊0.05p⌋ random non-zero elements in diagonal and in non-diagonal, respectively. ⌊0.05p⌋/2 non-zero elements are randomly allocated in the upper triangle part of Q and the others are in its lower triangle part by symmetry. The magnitudes of non-zero elements are randomly generated from Unif(τ/2, 3τ/2) with τ=8 max{max1ipσii,*(8),(log p)1/2}, and λ0=|min{λmin(*(8)+Q),λmin(*(8))}|+0.05.

Finally, the data are generated by Xi=Σ11/2Zi for i = 1, ⋯ , m and Yl=Σ21/2Zm+l for l = 1, ⋯ , n, where {Zi : i = 1, ⋯ , m + n} are independent p-dimensional random variables with i.i.d. coordinates Zij, j = 1, ⋯ , p. We consider the following four distributions for Zij:

  1. The standard normal N(0,1);

  2. The t-distribution with degrees of freedom 15, i.e., t(15);

  3. The centralized Gamma distribution with a = 16, b = 0.25, i.e., Γ(16,0.25)−4;

  4. The discrete distribution has five possible values −2, −1, 0, 3/2, 4 with probabilities 1/12, 4/25, 13/24, 16/75, 1/600, i.e.,
    Zij~112δ2+425δ1+1324δ0+1675δ3/2+1600δ4.
    This distribution is used in Yang and Pan (2017), and they showed that the first four moments of Zij are the same as those of N(0,1).

There are reasonably small disturbances between Σ1 and Σ2, and these two covariance matrices are reasonably sparse in Scenario 1. This is similar to the case considered by Yang and Pan (2017). In Scenario 2, there are many small disturbances between Σ1 and Σ2. This case was also considered by Yang and Pan (2017) and Li and Chen (2012). In Scenario 3, the two matrices differ only in off-diagonal entries with many weakly dense signals. Scenario 4 deals with the sparse situation, with ⌊0.02p⌋ features exerting larger signals. Scenario 5 is for the extremely sparse situation, with 8 nonzero signals in the non-diagonal entries. Scenarios 5 was considered by Cai et al. (2013). In Scenario 6, the two covariance matrices only differ in the diagonal. The two covariance matrices are entirely different by a larger amount in Scenario 7. Scenarios 6 and 7 were also considered by Wu and Li (2015). Scenarios 8 was studied by Chang et al. (2017) except that a perturbation matrix Q adds some non-zero elements in diagonal.

In all scenarios, we first calculate the empirical p-values when Σ1 = Σ2 = Σ(i) for i = 1, ⋯ , 8. The corresponding results are given in Tables 115 of the Supplementary Material. It can be seen from Tables 115 that the estimated p-values of the proposed RI method and the other six methods are controlled fairly well around 0.05 for all cases except the method proposed by Cai et al. (2013) for the discrete distribution.

Since the empirical powers for m = n = 100 and m = 200,n = 60 are very similar to those for m = n = 60, we only present the empirical powers with Σ1=Σ1(i) and Σ2=Σ2(i), i = 1, ⋯ , 8 for m = n = 60 in Figures 14. The empirical powers for m = n = 100 and m = 200, n = 60 are included in the Supplementary Material.

Figure 1:

Figure 1:

Empirical powers for Scenarios 1–6 with Zij follows N(0,1) and m = n = 60.

Figure 4:

Figure 4:

Empirical powers for Scenarios 1–6 with Zij follows the discrete distribution 112δ2+425δ1+1324δ0+1675δ3/2+1600δ4 and m = n = 60.

For Scenarios 1–2, we have the following findings:

  1. Our proposed RI test is considerably more powerful than the other methods. LC is the second most powerful, suggesting its ability in detecting the covariance differences with many small disturbances, as also observed by Yang and Pan (2017). T2 is the third most powerful since it is a weighted statistic. The Cai, sLED and ΨB,α methods have poor power, below 0.20 in almost all cases.

  2. The powers of the RI, LC, T2, and Fm,n methods increase from Scenario 1 to Scenario 2, which is expected because we increase the small disturbances in the differences between the two covariance matrices. So, RI, LC, T2, and Fm,n gain power as the amount of deviations increases, even if these deviations are not large. This is expected because they are defined from the Frobenius norm of the difference in the two covariance matrices. However, the other three methods are little affected as we increase the differences.

  3. It is not surprising but reassuring that the power of our proposed method increases as the sample sizes n, m increases.

In Scenario 3, the two covariance matrices have many weakly dease signals in the non-diagonal entries. The power of RI, LC, sLED, T2, and Fm,n is close to 1, but the power of Cai test and ΨB,α is lower. The results are verified by Corollary 4.

Scenario 4 contains strong and sparse signals in the main diagonal of the difference between the two covariance matrices. All seven methods perform quite well, especially when the sample size m or n is large. Meanwhile, the power of our proposed RI method is close to 1.

Scenario 5 contains the extremely sparse, strong signals in the non-diagonal differences between the two covariance matrices. The power of Cai test, T2, ΨB,α, and Fm,n is high, but the power of RI, LC and sLED are lower.

When the two covariance matrices differ in the main diagonal with sparsely weak signals in Scenario 6, our proposed RI method still has the perfect power in all cases. The powers of the other six methods are poor. The results are verified by Corollary 2.

In Scenario 7, the two covariance matrices are entirely different to a large degree. The power of the RI method is close to 1, whereas the other six methods are much less powerful and not competitive with RI.

For Scenario 8, the two covariance matrices have long-range dependence according to Chang et al. (2017). The power of the RI method is more higher than other six methods for large p.

In conclusion, the RI method has considerably higher and more stable power than the six existing methods in a wide range of settings. Not only can it deal with the cases with many small deviations in difference of the covariance matrices, but also handle the cases with sparsely strong or sparsely weak signals. Thus it is applicable in broader applications in testing the difference between covariance matrices.

4. Real data Analysis

In this section, we apply the proposed method to analyze a gene expression dataset on breast cancer from a study reported by Schemidt et al. (2008). We downloaded the dataset from http://bioconductor.org/packages/release/data/experiment/html/breastCancerMAINZ.html.

The dataset contains gene expression patterns from 200 tumors of patients who were not treated by systemic therapy after surgery, and consists of 22,283 features. There were three groups of different tumor grades: 29 well differentiated tumors (group 1), 136 moderately differentiated tumors (group 2), and 35 poor/undifferentiated tumors (group 3). This dataset has been analyzed in the literature under the assumption that the two covariance matrices are equal; e.g., Teschendorff and Caldas (2008) and Haibe-Kains et al. (2012). The equality of the two covariance matrices is a very important assumption for the validity of the reported findings. We test whether this assumption is valid.

Following Gentleman et al. (2011), for quality control and computational burden considerations, we first select the features for which more than 50% of their intensities were greater than 5 and their coefficients of variation (CV) fall inside of the range (0.22, 1.0). The intensities are gene expressions as measured by Affymetrix hgu133a technology, and the coefficient of variation is the standard deviation divided by the absolute value of the mean. We also screen the features by predetermined cutoffs, which remove low-quality features while retaining high-quality features as previously done in this dataset (Sherafatian, 2018; Chong et al., 2018; Schiffman et al., 2008). After these selection steps, 1193 features remain in our analysis. Let Σ1, Σ2 and Σ3 be the covariance matrices of these 1193 features with groups 1, 2 and 3, respectively. We standardized each of the 1193 features so that its mean is zero. Then, we apply the LC, Cai, sLED, T2, ΨB,α, Fm,n and RI methods to test separately (a) H0(1,2):Σ1=Σ2, and (b) H0(2,3):Σ2=Σ3. The p-values for all these methods are reported in Table 1. From Table 1, the RI method rejects both H0(1,2) and H0(2,3) at the significant level of 0.05, but the LC, sLED, T2, and Fm,n methods reject H0(2,3) only.

Table 1:

The corresponding p-values of the LC, Cai, sLED, T2, ΨB,α, Fm,n, and RI methods for a gene expression dataset

Method LC Cai sLED T2 ψB,α F m,n RI
H0(1,2) 0.105 0.104 0.330 0.555 0.056 0.061 1.110×10−16
H0(2,3) 3.050×10−6 0.093 0.000 8.157×10−9 0.052 4.577×10−6 3.030×10−9

To visualize the comparison among the different methods, we plot the heat maps of (Σ^1Σ^2) and (Σ^2Σ^3) for the top 100 features with the largest absolute values of the two-sample t statistics, respectively, where Σ^1, Σ^2, Σ^3 are the sample covariance matrices based on the selected 100 features. The corresponding results are shown in Figure 5. It is observed from Figure 5 that (Σ^2Σ^3) has stronger signals than (Σ^1Σ^2). Meanwhile, many moderate disturbances are present in (Σ^2Σ^3). The maximum absolute value in the elements of the estimator of (Σ^2Σ^3) for the 1193 features is 8.972, which is larger than the maximum absolute value in (Σ^1Σ^2), i.e., 4.648. However, the diagonal in (Σ^1Σ^2) has much stronger signals than those in (Σ^2Σ^3).

Figure 5:

Figure 5:

(a) the heat map of (Σ^1Σ^2) for the selected 100 features; (b) the heat map of (Σ^2Σ^3) for the selected 100 features.

Both Equation (2.2) and the simulation results of Scenarios 4–5 reveal that that the our proposed method is more powerful than the other three methods when there are more signals in the diagonal for the difference of two covariance matrices. This explains why our proposed method rejects H0(1,2) whereas the others did not. Consequently, our method is only one that is able to detect a more subtle but very important difference in this commonly analyzed dataset.

5. Discussion

Conducting inference for high-dimensional covariance matrix is highly challenging. In this paper, we make a novel use of a random integration technique to develop a two-covariance matrix test statistic. This test can be performed without estimating the covariance matrices, which is known to be extremely difficult in high dimension data. We investigate both theoretical properties and numerical performance of our method, and find that it is not only competitive but also often much more powerful than the existing methods in both simulation studies and a real data analysis when there are many small diagonal disturbances between the two covariance matrices.

There are several issues that warrant further investigation. First, a general multivariate model is applied to obtain asymptotic results. Although it is a common assumption in the literature, it would be useful to investigate the asymptotic properties of our proposed method in weaker conditions, for example, assumption 2.1 in Han and Wu (2020). Second, we consider a two-sample test for high-dimensional covariance matrices. As in Zheng et al. (2020), it would be interesting to extend our method for testing the equality of more than two covariance matrices. Third, we will extend the proposed method to test for high dimensional correlation matrices, which is a more difficult task than testing the covariance matrix (Zheng et al., 2019). Finally, we use the standard multivariate normal density function as the weight function during the construction of our test statistic. This is a common practice, but it is worthy investigating other choices that may perform better under various settings.

Supplementary Material

supplement

Figure 2:

Figure 2:

Empirical powers for Scenarios 1–6 with Zij follows t(15) and m = n = 60.

Figure 3:

Figure 3:

Empirical powers for Scenarios 1–6 with Zij follows Γ(16, 0.25) – 4 and m = n = 60.

Acknowledgements

Wang and Zhang contributed equally to this article. Jiang’s research is partially supported by the Natural Science Foundation of Guangdong (2018A030313171 and 2019A1515011830). Wen’s research is partially supported by NSFC(11801540). Wang’s research is partially supported by the International Science & Technology cooperation program of Guangdong, China(2016B050502007), the National Key Research and Development Program of China(2018YFC1315400), NSFC(11771462), and the Key Research and Development Program of Guangdong, China(2019B020228001). Zhang’s research is partially supported by U.S. National Institutes of Health (R01HG010171 and R01MH116527) and National Science Foundation (DMS2112711).

Footnotes

Supplementary Materials

The Supplementary Material includes detailed proofs of the theoretical results and additional simulation results.

References

  1. Anderson TW(2003). An introduction to multivariate statistical analysis(3rd ed.). The Annals of Applied Statistics. New York: Wiley-Interscience. [Google Scholar]
  2. Bai Z, Jiang D, Yao J-F and Zheng S (2009). Corrections to LRT on large-dimensional covariance matrix by RMT. The Annals of Statistics 37, 3822–3840. [Google Scholar]
  3. Bollerslev T, Meddahi N and Nyawa S (2019). High-dimensional multivariate realized volatility estimation. Journal of Econometrics 212, 116–136. [Google Scholar]
  4. Cai T, Liu W and Xia Y (2013). Two-sample covariance matrix testing and support recovery in high-dimensional and sparse settings. Journal of the American Statistical Association 108, 265–277. [Google Scholar]
  5. Cai TT (2017). Global testing and large-scale multiple testing for high-dimensional covariance structures. Annuals Review of Statistics and Its Application 4, 423–446. [Google Scholar]
  6. Chang J, Zhou W, Zhou W-X and Wang L (2017). Comparing large covariance matrices under weak conditions on the dependence structure and its application to gene clustering. Biometrics 73, 31–4. [DOI] [PubMed] [Google Scholar]
  7. Chen SX, Qin YL (2010a). A two-sample test for high-dimensional data with applications to gene-set testing. The Annals of Statistics 38, 808–835. [Google Scholar]
  8. Chen SX, Zhang LX and Zhong PS (2010b). Tests for high-dimensional covariance matrices. Journal of the American Statistical Association 105, 810–819. [Google Scholar]
  9. Chen F, Meintanis SG and Zhu L (2019). On some characterizations and multidimensional criteria for testing homogeneity, symmetry and independence. Journal of Multivariate Analysis 173, 125–144. [Google Scholar]
  10. Chong J, Soufan O, Li C, Caraus I, Li S, Bourque G, Wishart D and Xia J (2018). Metaboanalyst 4.0: towards more transparent and integrative metabolomics analysis. Nucleic Acids Research 46, 486–494. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Gentleman R, Carey V, Huber W and Hahne F (2011). Genefilter: Methods for filtering genes from microarray experiments. R package version 1. [Google Scholar]
  12. Haibe-Kains B, Desmedt C, Loi S, Culhane AC, Bontempi G, Quackenbush J and Sotiriou C (2012). A three-gene model to robustly identify breast cancer molecular subtypes. Journal of the National Cancer Institute 104, 311–325. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. He J and Chen SX (2018). High-dimensional two-sample covariance matrix testing via super-diagonals. Statistica Sinica 28, 2671–2696. [Google Scholar]
  14. Han Y and Wu WB (2020). Test for high dimensional covariance matrices. The Annals of Statistics 48, 3565–3588. [Google Scholar]
  15. He Y, Xu G, Wu C and Pan W (2021). Asymptotically independent u-statistics in high-dimensional testing. The Annals of Statistics 49, 154–181. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Igolkina A, Aromokus C, Newman J, Evgrafov O, Mcintyre L, Nuzhdin S and Samsonova M(2018). Analysis of gene expression variance in schizophrenia using structural equation modeling. Frontiers in Molecular Neuroscience 11, 192–192. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Jiang T and Qi Y (2015). Likelihood ratio tests for high-dimensional normal distributions. Scandinavian Journal of Statistics 42, 988–1009. [Google Scholar]
  18. Jiang T and Yang F (2013). Central limit theorems for classical likelihood ratio tests for high-dimensional normal distributions. The Annals of Statistics 41, 2029–2074. [Google Scholar]
  19. Kim Y and Lesser V (2008). Finding minimum data requirements using pseudo-independence. In 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology 2, 57–64. [Google Scholar]
  20. Kim I, Balakrishnan S and Wasserman L (2020). Robust multivariate nonparametric tests via projection averaging. The Annals of Statistics 48, 3417–3441. [Google Scholar]
  21. Le Bihan D, Mangin J-F, Poupon C, Clark CA, Pappata S, Molko N and Chabrita H(2001), Diffusion tensor imaging: concepts and applications. Journal of Magentic Resonance Imaging 13, 534–546. [DOI] [PubMed] [Google Scholar]
  22. Li J and Chen SX (2012). Two sample tests for high-dimensional covariance matrices. The Annals of Statistics 40, 908–940. [Google Scholar]
  23. Pan W, Tian Y, Wang X and Zhang H (2018). Ball divergence: nonparametric two sample test. The Annals of Statistics 46, 1109–1137. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Qiu T, Xu W and Zhu L (2021). Two-sample test in high dimensions through random selection. Computational Statistics & Data Analysis 106, 107218. [Google Scholar]
  25. Roberts AG, Catchpoole DR and Kennedy PJ (2018). Variance-based feature selection for classification of cancer subtypes using gene expression data. In 2018 International Joint Conference on Neural Networks(IJCNN). IEEE. [Google Scholar]
  26. Schiffman C, Peterick L, Perttula K, Yano Y, Carlsson H, Whitehead T, Metayer C, Hayes J, Rappaport S and Dudoit S (2008). Filtering procedures for untargeted lc-ms metabolomics data.BMC bioinformatics 20, 334–334. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Schemidt M, Böhm D, Von Törne C, Steiner E, Puhl A, Pilch H, Lehr H-A, Hengstler JG, Kölbl H and Gehrmann M (2008). The humoral immune system has a key prognostic impact in node-negative breast cancer.Cancer Research 68, 5405–5413. [DOI] [PubMed] [Google Scholar]
  28. Schott JR (2007). A test for the equality of covariance matrices when the dimension is large relative to the sample sizes. Computational Statistics & Data Analysis 51, 6335–6542. [Google Scholar]
  29. Sherafatian M (2018). Tree-based machine learning algorithms identified minimal set of mirna biomarkers for breast cancer diagnosis and molecular subtyping. Gene 677, 111–118. [DOI] [PubMed] [Google Scholar]
  30. Srivastava MS and Yanagihara H (2010). Testing the equality of several covariance matrices with fewer observations than the dimension. Journal of Multivariate Analysis 101, 1319–1329. [Google Scholar]
  31. Teschendorff AE and Caldas C (2008). A robust classifier of high predictive value to identify good prognosis patients in er-negative breast cancer Breast Cancer Research and Treatment 4, 1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Wu T-L and Li P (2015). Tests for high-dimensional covariance matrices using random matrix projection. arXiv preprint arXiv: 1511.01611. [Google Scholar]
  33. Yang Q and Pan G (2017). Weighted statistic in detecting faint and sparse alternatives for high-dimensional covariance matrices. Journal of the American Statistical Association 112, 118–200. [Google Scholar]
  34. Yu X, Li D and Xue L (2020). Fisher’s combined probability test for high-dimensional covariance matrices. arXiv preprint arXiv: 2006.00426. [Google Scholar]
  35. Zheng S, Lin R, Guo J and Yin G (2020). Testing homogeneity of high-dimensional covariance matrices. Statistical Sinica 30, 35–53. [Google Scholar]
  36. Zhu L, Lei J, Devlin B and Roeder K (2017). Testing high-dimensional covariance matrices, with application to detecting schizophrenia risk genes. Annals of Applied Statistics 11, 1810–1831. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Zhu L, Xu K, Li R and Zhong W (2017). Projection correlation between two random vectors. Biometrika 104, 829–843. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Zheng S, Cheng G, Guo J and Zhu H (2019). Test for high dimensional correlation matrices. Annals of Statistics 47, 2887–2921. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supplement

RESOURCES