Skip to main content
Biostatistics (Oxford, England) logoLink to Biostatistics (Oxford, England)
. 2016 Apr 4;17(4):677–691. doi: 10.1093/biostatistics/kxw013

Hypothesis testing for differentially correlated features

Elisa Sheng 1, Daniela Witten 2, Xiao-Hua Zhou 3,*,
PMCID: PMC5031944  PMID: 27044327

Abstract

In a multivariate setting, we consider the task of identifying features whose correlations with the other features differ across conditions. Such correlation shifts may occur independently of mean shifts, or differences in the means of the individual features across conditions. Previous approaches for detecting correlation shifts consider features simultaneously, by computing a correlation-based test statistic for each feature. However, since correlations involve two features, such approaches do not lend themselves to identifying which feature is the culprit. In this article, we instead consider a serial testing approach, by comparing columns of the sample correlation matrix across two conditions, and removing one feature at a time. Our method provides a novel perspective and favorable empirical results compared with competing approaches.

Keywords: Correlation matrix, Differential correlation, Feature selection, Hypothesis testing, Wald test

1. Introduction

In many modern research settings, it is of interest to identify features in a multivariate dataset that differ across conditions. For instance, in functional genetics, researchers look for differences in gene expression profiles between normal and diseased tissues. Using functional MRI data, scientists search for voxels that differ before and after a stimulus.

To study differences across conditions, one typically considers the mean of each feature in each condition. Suppose Inline graphic and Inline graphic are multivariate random vectors of a single set of Inline graphic features under two different conditions. Let Inline graphic and Inline graphic denote the means of Inline graphic and Inline graphic, respectively, and let Inline graphic denote the mean of the Inline graphicth feature of Inline graphic. To detect mean differences, one can consider null hypotheses of the form

graphic file with name M11.gif

However, mean differences may not capture the underlying data characteristics that differ across conditions. For example, in the context of gene expression data, co-expression, or expression of two or more genes together, may change across conditions. To address dependence between two or more genes, methods have been proposed to incorporate correlation structure to test pairs or small sets of genes for mean differences (Lai and others, 2004; Shedden and Taylor, 2004; Xiao and others, 2004; Dettling and others, 2005; Ho and others, 2007). However, differences in correlations may also occur independently of mean differences.

One way to compare correlations across conditions is to test for equality of the covariance or correlation matrices. Suppose Inline graphic and Inline graphic are the population correlation matrices of Inline graphic and Inline graphic, respectively. The null hypothesis of interest is then

graphic file with name M16.gif

In the low-dimensional setting (i.e. Inline graphic, where Inline graphic are the sample size of each condition), classical approaches for testing this null hypothesis are based on the likelihood ratio (Cole, 1968) or Wald-type test statistics (Kullback, 1967; Jennrich, 1970; Layard, 1972; Larntz and Perlman, 1985; Modarres and Jernigan, 1992; Satorra and Neudecker, 1997). Testing equality of covariance matrices is also a well-studied problem (Wilks, 1932; Bartlett, 1937; Muirhead, 1982; Seber, 1984). Methods have also been proposed for the high-dimensional setting (Inline graphic) (Ledoit and Wolf, 2002; Schott, 2007; Li and Qin, 2014).

In either the low- or high-dimensional setting, the aforementioned references are concerned with testing the global null hypothesis, Inline graphic. In other words, the goal is to determine whether the covariance or correlation matrix of all Inline graphic features differs across conditions. In the present article, we are instead interested in identifying individual features whose correlations with other features differ across two conditions.

Several recent papers have proposed approaches for identifying features whose correlations differ across conditions (e.g. Gill and others, 2010; Amar and others, 2013; Bockmayr and others, 2013). However, these methods do not employ a formal hypothesis testing framework. In the multivariate normal setting, some authors have considered identifying pairs of features whose partial correlations differ across conditions (Guo and others, 2011; Mohan and others, 2012; Danaher and others, 2014); however, these methods focus on pairwise partial correlation differences across conditions, whereas we are interested in comparing relationships of each feature with all other features across conditions.

More related to this article are recent proposals by Hu and others (2010) and Cai and others (2013), who consider tests for equality of the columns of Inline graphic and Inline graphic in order to identify correlation differences in individual features. However, neither proposal takes advantage of the particular correlation structure that arises when a small number of features are disrupted across conditions, leading to a situation in which the correlations of those features with many other features differ across conditions. This would be the case in a comparison of gene expression profiles between normal and disease conditions if only one gene were disrupted in some disease, where that disruption affects that gene's correlation with many other genes. In this article, we are interested in characterizing the notion of a disrupted gene in terms of correlation matrices, and developing a corresponding hypothesis testing framework to identify features of interest.

As a motivating example, suppose Inline graphic and Inline graphic represent gene expression profiles under two distinct conditions. Let Inline graphic and Inline graphic denote, respectively, the Inline graphicth columns of Inline graphic and Inline graphic, that is, the correlations between the Inline graphicth gene and the Inline graphic other genes, in the two conditions. Suppose that the first gene is disrupted in one of the two conditions, but that the two conditions are otherwise identical. This would manifest as Inline graphic, and more specifically, Inline graphic. Note that the submatrices of Inline graphic and Inline graphic with the first row and column removed are identical—that is, all of the differences between Inline graphic and Inline graphic can be attributed to the first gene. Figure 1 illustrates this motivating example.

Fig. 1.

Fig. 1.

In the motivating example of Section 1, the two correlation matrices differ within a single row/column. Our goal is to identify only the first feature as different between the two conditions.

Our goal is to develop a testing procedure that will determine that in the set-up of Figure 1, all differences between the correlation matrices across the two conditions can be attributed to the first feature. To describe the problem more clearly, we define the terms ‘dysregulated’ and ‘minimally dysregulated’—language borrowed from the study of gene expression data—as follows.

Definition 1. —

The Inline graphicth feature is dysregulated across conditions Inline graphic and Inline graphic if Inline graphic.

Let Inline graphic denote a subset of size Inline graphic and Inline graphic denote the subvector of the random Inline graphic-vector Inline graphic corresponding to all but the features in Inline graphic.

Definition 2. —

Consider a set of Inline graphic features, Inline graphic, that satisfies the following conditions:

  1. Inline graphic;

  2. for all strictly smaller subsets Inline graphic, Inline graphic.

Once we remove all of the features in Inline graphic from Inline graphic and Inline graphic, there is no difference in the correlation matrices among the remaining features across conditions—that is, Inline graphic. Using this terminology, we can now see that our goal is to identify Inline graphic (e.g. Inline graphic in Figure 1), as opposed to the set of dysregulated features (e.g. Inline graphic in Figure 1).

In Section 2, we develop Wald-type test statistics for testing null hypotheses of the form

graphic file with name M61.gif (1.1)

for Inline graphic. We then show that simultaneous tests of (1.1) are not appropriate, as they lead to rejection for more than just the first feature in Figure 1 (i.e. they lead to a poor estimate of Inline graphic). This motivates the need for a different approach. In Section 3, we develop a method to test the series of null hypotheses of the form

graphic file with name M64.gif (1.2)

for Inline graphic, where Inline graphic denotes the set of all Inline graphic-combinations of Inline graphic. This series of hypothesis tests leads naturally to an estimator of the set Inline graphic, which we show to have excellent empirical performance. In Section 4, we apply our proposal to two gene expression data sets. The discussion is in Section 5.

2. A simultaneous approach

Let Inline graphic and Inline graphic denote Inline graphic-dimensional random vectors corresponding to a single set of Inline graphic features under two distinct conditions. Let Inline graphic and Inline graphic denote the population covariance matrices of Inline graphic and Inline graphic, and let Inline graphic and Inline graphic denote their population correlation matrices, respectively. Without loss of generality, assume that the population mean vectors of Inline graphic and Inline graphic equal zero. Let Inline graphic and Inline graphic be independent and identically distributed samples of Inline graphic and Inline graphic. Denote the empirical covariance matrices by

graphic file with name M86.gif

where Inline graphic and Inline graphic. Denote the empirical correlation matrices by

graphic file with name M89.gif

Let the operator Inline graphic denote the “vectorization” of a Inline graphic matrix Inline graphic, defined by

graphic file with name M93.gif

In other words, the vectorization of Inline graphic stacks the columns of Inline graphic into a Inline graphic column vector. Since our goal is to develop test statistics for each of the Inline graphic features, we are interested in examining columns of the correlation matrices. Since the diagonal of a correlation matrix is always one, we ignore it, and so henceforth let Inline graphic and Inline graphic denote the Inline graphicth column of Inline graphic and Inline graphic with the Inline graphicth element removed.

2.1. A test of Inline graphic

In order to develop a test for Inline graphic, we extend a classical Wald-type approach for testing Inline graphic, which relies on the following result (Neudecker and Wesselman, 1990).

Lemma 1. —

Suppose that Inline graphic has finite fourth moments, that is, Inline graphic for all Inline graphic. Then

graphic file with name M110.gif

where Inline graphic, Inline graphic denotes the gradient with respect to Inline graphic, and Inline graphic. Furthermore, if Inline graphic is multivariate normal,

graphic file with name M116.gif (2.1)

for Inline graphic, where Inline graphic denotes the Inline graphicth element of Inline graphic.

Using Lemma 1, it can be shown that

graphic file with name M121.gif

and

graphic file with name M122.gif

where Inline graphic and Inline graphic are the asymptotic covariance matrices corresponding to Inline graphic and Inline graphic, respectively. Due to the independence of the samples from each condition, it follows that

graphic file with name M127.gif

where Inline graphic, and Inline graphic and Inline graphic are consistent estimators of Inline graphic and Inline graphic, respectively. This result motivates our proposed test statistic for Inline graphic,

graphic file with name M134.gif (2.2)

Proposition 1. —

Suppose that Inline graphic and Inline graphic have finite fourth moments, Inline graphic, Inline graphic and Inline graphic as Inline graphic for some finite constant Inline graphic. Then

graphic file with name M142.gif

The proof of Proposition 1 is given in the supplementary material available at Biostatistics online.

Proposition 1 implies that the type I error rate of the following test of Inline graphic,

graphic file with name M144.gif (2.3)

is controlled at level Inline graphic asymptotically. In a simulation study in the supplementary materials (available at Biostatistics online), we investigate the finite-sample type I error rate control of Inline graphic in (2.3).

Given (2.3), one can simultaneously test Inline graphic, using a multiplicity correction to address the problem of multiple comparisons. However, as we will see in the next section, this approach does not achieve the goal of our paper, as described in Section 1: to identify a minimally dysregulated set of features.

2.2. Motivation for a different approach

We now argue that simultaneously testing Inline graphic for Inline graphic can be problematic, motivating the need for a different approach. Consider a simplified version of the motivating example in Figure 1, where the Inline graphic matrices Inline graphic and Inline graphic take the respective forms

graphic file with name kxw013UM1.jpg (2.4)
graphic file with name kxw013UM2.jpg (2.5)

where Inline graphic, Inline graphic, and Inline graphic. Note that Inline graphic and Inline graphic are identical except in the first half of the first row/column.

We considered Inline graphic. For Inline graphic and Inline graphic, we sampled from Inline graphic and Inline graphic, and computed Inline graphic in (2.2) for Inline graphic. This was repeated 10 000 times. The distributions of Inline graphic, Inline graphic and Inline graphic are shown in Figure 2. (Test statistics Inline graphic for other values of Inline graphic are not shown, because they are identical in distribution to either Inline graphic or Inline graphic.) Other values of Inline graphic, Inline graphic, and Inline graphic yielded similar results.

Fig. 2.

Fig. 2.

The histograms illustrate the distribution of Inline graphic for Inline graphic in the simulation study from Section 2.2. The dashed lines represent a Inline graphic distribution, the asymptotic distribution of Inline graphic under Inline graphic according to Proposition 1. The gray shaded regions indicate the proportion of test statistics that would result in rejecting the null hypothesis Inline graphic under (2.2). (a) Inline graphic, (b) Inline graphic.

The left-hand panels of Figure 2 indicate that Inline graphic almost always rejects Inline graphic, as expected. The right-hand panels of Figure 2 indicate that Inline graphic has approximately a Inline graphic distribution as suggested by Proposition 1, and hence that Inline graphic has well-controlled Type I error rate.

However, the center panels of Figure 2 indicate that Inline graphic does not have a Inline graphic distribution. This becomes particularly pronounced as the sample size increases. Indeed, Inline graphic does not hold, since the Inline graphic elements of Inline graphic and Inline graphic are unequal in (2.4) and (2.5). Consequently, in this simulation study, Inline graphic will tend to be rejected when the sample size is sufficiently large, even though the differences between Inline graphic and Inline graphic can be fully attributed to differences in the first feature's correlations across the two conditions.

Our goal was to identify only the first feature as differing across the two conditions. This simulation study reveals that simultaneous tests of Inline graphic are flawed, essentially because each correlation involves a pair of features, and thus any difference in correlation across two conditions necessarily implicates at least two features. To accomplish our goal, we must develop a set of null hypotheses such that, in this example, only a single null hypothesis corresponding to the first feature is violated. This motivates our proposed approach: a serial testing procedure.

3. A serial approach

We just saw that tests of Inline graphic fail to identify only the first feature as different across conditions in the setting of Figure 1. Therefore, they are not appropriate for the goal laid out in Section 1, which is to identify Inline graphic, the minimal set of features that account for the differences between Inline graphic and Inline graphic.

We now consider a new hypothesis testing framework. Instead of simultaneously testing the Inline graphic null hypotheses, Inline graphic in (1.1) for Inline graphic, we propose to test the series of null hypotheses Inline graphic in (1.2) for Inline graphic. The null hypotheses Inline graphic are closely related to the cardinality of Inline graphic. If Inline graphic do not hold and Inline graphic does hold, then Inline graphic. In Section 3.1, we propose a test of Inline graphic, which we call Inline graphic. Then in Section 3.2, we apply serial tests of Inline graphic in order to estimate Inline graphic. A comparison to competing approaches is in Section 3.3.

3.1. Inline graphic: A test of Inline graphic

Suppose that Inline graphic in (1.2) holds. Then there exists a set Inline graphic of size Inline graphic such that Inline graphic. Hence the test statistics Inline graphic for Inline graphic are asymptotically Inline graphic. Consider the test

graphic file with name M225.gif (3.1)

Proposition 2. —

Suppose the conditions of Proposition 1 and Inline graphic hold. Then

graphic file with name M227.gif

The proof of Proposition 2 is given in the supplementary material (available at Biostatistics online). Proposition 2 implies that, under Inline graphic, the test Inline graphic controls type I error at a level Inline graphic. A simulation study of the performance of Inline graphic under Inline graphic is presented in the supplementary materials (available at Biostatistics online).

3.2. An estimator of Inline graphic

Now suppose that we perform serial tests of Inline graphic for Inline graphic. Suppose we reject Inline graphic, but do not reject Inline graphic. Then

graphic file with name M238.gif (3.2)

is a natural estimator of Inline graphic from Definition 2: our failure to reject Inline graphic supports condition 1 of Definition 2, and our rejection of Inline graphic supports condition 2 of Definition 2.

We will now show that Inline graphic with high probability.

Proposition 3. —

Suppose the conditions of Proposition 1 hold. Then

graphic file with name M243.gif

The proof of Proposition 3 is given in the supplementary material (available at Biostatistics online).

3.3. Comparison to competing methods

We compared the accuracy of Inline graphic (3.2) as an estimator of Inline graphic (Definition 2) to the methods proposed in Cai and others (2013) and Hu and others (2010), both of which use a simultaneous (rather than a serial) approach to identify features whose covariances with other features differ across conditions. Since the methods of Cai and others (2013) and Hu and others (2010) are based on covariances rather than correlations, we employ a covariance version of our proposal (see the supplementary material, available at Biostatistics online, for details).

The method of Cai and others (2013) simultaneously tests the null hypotheses Inline graphic for Inline graphic. The test statistic for Inline graphic is the squared, standardized maximum value of Inline graphic. We conclude that a given feature is in the estimate of Inline graphic by Cai and others (2013) if its test statistic exceeds some threshold.

The method of Hu and others (2010) tests for equality of the joint distribution of “covariance distances” for a given feature across the two conditions (ignoring a fixed number of covariance distances, specified by the “trim number”). We conclude that a given feature is in the estimate of Inline graphic by Hu and others (2010) if the resampling-based p-value falls below some threshold. We perform the method of Hu and others (2010) with trim numbers 0 and Inline graphic.

We simulated data where Inline graphic, Inline graphic, and Inline graphic. Each sample was drawn independently from a Inline graphic or a Inline graphic distribution. We fixed Inline graphic to be

graphic file with name M259.gif

and considered one of four possible scenarios for Inline graphic, illustrated in Figure 3.

Fig. 3.

Fig. 3.

Four scenarios for Inline graphic considered in the simulation study in Section 3.3. Each subfigure represents a Inline graphic matrix. Black entries correspond to values of 0.1, for which Inline graphic differs from Inline graphic. White off-diagonal entries correspond to values of 0.3, for which elements of Inline graphic are equal to Inline graphic. (a) Scenario A, (b) Scenario B, (c) Scenario C, (d) Scenario D.

Our proposed approach, the method of Cai and others (2013) and that of Hu and others (2010) were applied in order to obtain estimates of Inline graphic. For each method, the cardinality of the estimate of Inline graphic was varied (for instance, in our method, this corresponds to the level Inline graphic; in the method of Cai and others (2013), it corresponds to a threshold for the test statistic). The results, averaged over 1000 replications, are shown in Figure 4. The results corroborate Proposition 3, in that our method yields an average value of Inline graphic no greater than 5 for all values of Inline graphic considered.

Fig. 4.

Fig. 4.

Results from the simulation study in Section 3.3, averaged over 1000 simulated datasets. The Inline graphic-axis displays the cardinality of the estimate of Inline graphic, and the Inline graphic-axis displays the number of true positives, i.e. the number of features in the estimate of Inline graphic that are truly in Inline graphic. For reference, the bounding triangle indicates two methods: the NW boundary is an idealized method that always correctly selects features in Inline graphic, and the SE boundary indicates a method that selects features at random. (a) Scenario A, (b) Scenario B, (c) Scenario C, (d) Scenario D.

While our proposed approach does at least as well as the competitors in all four scenarios displayed in Figure 4, it does particularly well (relative to competitors) in Scenarios A, C, and D. Recall from Figure 3 that in Scenarios A, C, and D, the features in Inline graphic have differential correlations with many other features, including features in Inline graphic. Consequently, the serial approach taken by our proposal is key to identifying the correct set of features in Inline graphic, as described in Section 2.2. In contrast, in Scenario B, the features in Inline graphic have differential correlations only with other features in Inline graphic; consequently, in this scenario, there is no need for a serial approach—a simultaneous approach will perform just as well. In fact, we see that in Scenario B, all approaches have comparable performance.

4. Application to gene expression data

In this section, we examine two gene expression datasets. These datasets are high-dimensional, leading to challenges for our proposal: (1) Propositions 2 and 3 require Inline graphic; and (2) for large Inline graphic, tests of Inline graphic (1.2, 3.1) become computationally intractable unless Inline graphic is quite small. Hence a screening procedure to reduce dimensionality is necessary. We consider three screening approaches in Sections 4.14.3, respectively.

4.1. Screening based on scientific knowledge

We first consider a gene expression dataset that consists of 11 861 gene expression measurements from 220 tissue samples taken from patients with one of four subtypes of glioblastoma multiforme (GBM; Verhaak and others, 2010. We compare the two GBM subtypes with the largest sample sizes, Inline graphic (proneural) and Inline graphic (mesenchymal). In order to reduce dimensionality, we restrict attention to 34 genes involved in TCR signaling, as was done in Mohan and others (2014).

At level Inline graphic, we estimate Inline graphic. Interestingly, NFKBIA is a tumor suppressor gene. There is known to be enrichment of single-nucleotide polymorphisms and haplotypes of NFKBIA in Hodgkin's lymphoma, colorectal cancer, melanoma, hepatocellular carcinoma, breast cancer, and multiple myeloma. Furthermore, it has been reported that NFKBIA tends to be deleted in glioblastomas (Bredel and others, 2011).

4.2. Unsupervised screening

We now consider a gene expression dataset that consists of 12 600 gene expression measurements from Inline graphic normal (Inline graphic) and Inline graphic tumor (Inline graphic) prostate gland specimens from patients undergoing radical prostatectomy (Singh and others, 2002). We reduce dimensionality using an unsupervised screening approach: we restrict our analysis to the Inline graphic and 35 genes with the highest marginal variance. This screening approach has substantial precedent in the gene expression literature, in which it is often assumed that high-variance genes are more scientifically interesting than low-variance genes. Because the screening does not make use of the class labels (normal versus tumor), the resulting test for Inline graphic results in valid statistical inference (Bourgon and others, 2010).

Figure 5 displays Inline graphic for the 35 highest-variance genes. Estimates of Inline graphic for Inline graphic and Inline graphic are reported in Figure 5. We considered very small Inline graphic in order to avoid estimating Inline graphic to contain all of the features. It is possible that the cancer and normal tissue tend to have substantially different gene expression, but also likely that early microarray studies such as Singh and others (2002) tend to suffer from batch effects.

Fig. 5.

Fig. 5.

Inline graphic, and the set of genes in Inline graphic, when we restrict analysis to the genes with the highest variance in the prostate cancer dataset. The rows/columns of the matrix are in order of complete linkage clustering, using Euclidean distance. The gene ranking (highest to lowest variance) is indicated on the right-hand side of the matrix. The two tables list the genes in Inline graphic for Inline graphic (left) and Inline graphic (right).

As expected, Inline graphic depends on the set of genes considered. For example, for Inline graphic, the 13th highest-variance gene is in Inline graphic when Inline graphic; however, this gene is not in Inline graphic when we take Inline graphic. By design, as Inline graphic decreases, Inline graphic decreases. As an example of the interpretation of Inline graphic, consider the result when Inline graphic and Inline graphic. We rejected Inline graphic and Inline graphic at level Inline graphic, but failed to reject Inline graphic at level Inline graphic (a conservative estimate of the p-value corresponding to this hypothesis is Inline graphic). Consequently, Inline graphic, and Inline graphic. This suggests that there are (very conservatively) at least two minimally dysregulated genes among the top 20 highest-variance genes.

Calculating Inline graphic is computationally intensive. Consider the case of Inline graphic and Inline graphic, displayed in the last row of the left-hand table in Figure 5. In this example, Inline graphic. Therefore, in order to obtain these results, we performed tests of Inline graphic for Inline graphic. The test of Inline graphic requires computing the test statistic Inline graphic for Inline graphic and Inline graphic. Hence, in order to test Inline graphic for all Inline graphic, we computed Inline graphic test statistics. Had we chosen a much larger value of Inline graphic, then Inline graphic might have contained many more genes, which would have substantially increased the necessary computations. Fortunately, computation of these test statistics can be easily parallelized.

4.3. Supervised screening

We once again consider the prostate cancer gene expression dataset of Section 4.2; this time, however, we take a supervised screening approach. In order to implement this approach, we split the observations into a training set and a test set of equal size. We perform screening on the training set in order to obtain a small set of features, and estimate Inline graphic on the test set using that small feature set. This training/test set approach is needed in order for our estimate of Inline graphic to retain the asymptotic properties established in Section 3.

To perform supervised screening, we compute Inline graphic for all Inline graphic in the training set. We then restrict attention to the Inline graphic genes for which this quantity is largest. These 30 genes are displayed in Figure 6. Next, we estimate Inline graphic on the test set, using only these 30 genes. For Inline graphic, we rejected Inline graphic for Inline graphic, and failed to reject Inline graphic; therefore, Inline graphic, and Inline graphic. This suggests that there are five minimally dysregulated genes among the 30 genes considered.

Fig. 6.

Fig. 6.

Inline graphic for the 30 genes with the largest sample correlation difference across conditions in the prostate cancer dataset. The rows/columns of the matrix are in order of complete linkage clustering, using Euclidean distance. The gene ranking (highest to lowest Inline graphic-norm of the difference in sample correlation vectors) is indicated on the right-hand side of the matrix.

It is not surprising that, for a given value of Inline graphic and Inline graphic, the supervised screening approach results in a higher value of Inline graphic than with the unsupervised screening approach: this makes sense because the supervised screening approach selects genes that are dysregulated in the training set. Note that the actual genes in Inline graphic cannot be directly compared across the supervised and unsupervised screening approaches, as the set of genes used in each approach is largely non-overlapping.

5. Discussion

In this paper, we consider the task of identifying features whose correlations differ across conditions, as opposed to identifying features whose means differ across conditions. Correlation differences may occur independently of mean differences, and selection methods based on correlation differences have been proposed in the statistical literature (Hu and others, 2010; Cai and others, 2013). But the previously proposed methods test each feature simultaneously, which can be problematic in a scenario where just a small number of features have differential correlations with many other features across the two conditions. Instead, we proposed a serial testing approach, which overcomes the problems associated with simultaneous testing.

In this article, we propose the estimator Inline graphic of Inline graphic. Our proposal builds upon classical Wald-type tests of Inline graphic. We control the asymptotic type I error rate, and demonstrate desirable performance in a variety of simulation settings. Specifically, when non-zero values of Inline graphic are concentrated in certain rows/columns, our approach outperforms proposals that are based on simultaneous hypothesis tests (Hu and others, 2010; Cai and others, 2013).

We restrict attention to the low-dimensional setting, in which the numbers of observations in each group, Inline graphic and Inline graphic, exceed the number of features, Inline graphic, for two reasons:

  1. The asymptotic results on type I error control required Inline graphic. In order to move into the high-dimensional setting, we would need to consider an alternative to Inline graphic (2.2) for use in defining Inline graphic and Inline graphic. For example, the proposal of Cai and others (2013) could possibly be extended.

  2. For large Inline graphic, tests of Inline graphic in (1.2) become computationally intractable as Inline graphic increases. We note that if Inline graphic is small, then typically computations are not a concern, as we must only test Inline graphic for Inline graphic increasing until we fail to reject some null hypothesis. Furthermore, for a given value of Inline graphic, the computations involved in testing Inline graphic can be easily parallelized. Future work could involve developing an alternative to considering all Inline graphic combinations in order to test Inline graphic. For instance, to test Inline graphic, we could consider the Inline graphic combinations that result from serially removing the feature Inline graphic that corresponds to the largest Inline graphic.

In Section 4, we show how to apply our proposal to high-dimensional data by screening to reduce the number of features.

We leave the challenging task of studying the sampling variability of Inline graphic to future work. In addition, further research could focus on relaxing the conservativeness of Inline graphic that results from the use of the union inequality in the proof of Proposition 2.

Supplementary material

Supplementary Material is available at http://biostatistics.oxfordjournals.org.

Funding

D.W. work was supported in part by NIH DP5OD009145, NSF CAREER DMS-1252624, and a Sloan Research Fellowship. X.-H.Z. work was supported in part by U.S. Department of Veterans Affairs, Veterans Affairs Health Administration Research Career Scientist Award (RCS 05-196).

Supplementary Material

Supplementary Data

Acknowledgements

We thank Palma London and Su-In Lee for cleaning, annotating, and sharing with us the GBM gene expression dataset studied in Section 4.1.

References

  1. Amar D., Safer H., Shamir R. (2013). Dissection of regulatory networks that are altered in disease via differential co-expression. PLoS Computational Biology 9(3), e1002955. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Bartlett M. (1937). Properties of sufficiency and statistical tests. Proceedings of the Royal Statistical Society Series A 160, 268–282. [Google Scholar]
  3. Bockmayr M., Klauschen F., Gyorffy B., Denkert C., Budezies J. (2013). New network topology approaches reveal differential correlation patterns in breast cancer. BMC Systems Biology 7, 78. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Bourgon R., Gentleman R., Huber W. (2010). Independent filtering increases detection power for high-throughput experiments. Proceedings of the National Academy of Sciences 107(21), 9546–9551. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Bredel M., Scholtens D. M., Yadav A. K., Alvarez A. A., Renfrow J. J., Chandler J. P, Yu I. L. Y., Carro M. S., Dai F., Tagge M. J.. and others (2011). NFKBIA deletion in glioblastomas. New England Journal of Medicine 364(7), 627–637. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Cai T., Liu W., Xia Y. (2013). Two-sample covariance matrix testing and support recovery in high-dimensional and sparse settings. Journal of the American Statistical Association 108(501), 265–277. [Google Scholar]
  7. Cole N. (1968). The likelihood ratio test of the equality of correlation matrices. Technical Report No. 65.
  8. Danaher P., Wang P., Witten D. (2014). The joint graphical lasso for inverse covariance estimation across multiple classes. Journal of the Royal Statistical Society, Series B 76(2), 373–397. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Dettling M., Gabrielson E., Parmigiani G. (2005). Searching for differentially expressed gene combinations. Genome Biology 6, R88. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Gill R., Datta S., Datta S. (2010). A statistical framework for differential network analysis. BMC Bioinformatics 11, 95. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Guo J., Levina E., Michalidis G., Zhu J. (2011). Joint estimation of multiple graphical models. Biometrika 1–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Ho Y. Y., Cope L., Dettling M., Parmigiani G. (2007). Statistical methods for identifying differentially expressed gene combinations. Methods in Molecular Biology 408, 171–191. [DOI] [PubMed] [Google Scholar]
  13. Hu R., Qiu X., Glazko G. (2010). A new gene selection procedure based on the covariance distance. Bioinformatics 26(3), 348–354. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Jennrich R. (1970). An asymptotic Inline graphic test for the equality of two correlation matrices. Journal of the American Statistical Association 65, 904–912. [Google Scholar]
  15. Kullback S. (1967). On testing correlation matrices. Applied Statistics 16, 80–85. [Google Scholar]
  16. Lai Y. L., Wu B., Chen L., Zhao H. Y. (2004). Statistical method for identifying differential gene-gene coexpression patterns. Bioinformatics 20(17), 3146–3155. [DOI] [PubMed] [Google Scholar]
  17. Larntz K., Perlman M. (1985). A simple test for the equality of correlation matrices. Rapport technique, Department of Statistics, University of Washington, 141.
  18. Layard M. (1972). Large sample tests for the equality of two covariance matrices. Annals of Mathematical Statistics 43, 149–151. [Google Scholar]
  19. Ledoit O., Wolf M. (2002). Some hypothesis tests for tehe covariance matrix when the dimension is large compared to the sample size. The Annals of Statistics 30, 1081–1102. [Google Scholar]
  20. Li W., Qin Y. (2014). Hypothesis testing for high-dimensional covariance matrices. Journal of Multivariate Analysis 128, 108–119. [Google Scholar]
  21. Modarres R., Jernigan R. W. (1992). Testing the equality of correlation matrices. Communications in Statistics - Theory and Methods 21(8), 2107–2125. [Google Scholar]
  22. Mohan K., Chung M., Han S., Witten D., Lee S. I., Fazel M. (2012). Structured learning of Gaussian graphical models. Advances in Neural Information Processing Systems, 620–628. [PMC free article] [PubMed] [Google Scholar]
  23. Mohan K., London P., Fazel M., Witten D., Lee S.-I. (2014). Node-based learning of multiple gaussian graphical models. The Journal of Machine Learning Research 15(1), 445–488. [PMC free article] [PubMed] [Google Scholar]
  24. Muirhead R. (1982) Aspects of Multivariate Statistical Theory. New York, NY: Wiley. [Google Scholar]
  25. Neudecker H., Wesselman A. M. (1990). The asymptotic variance matrix of the sample correlation matrix. Linear Algebra and its Applications 127, 589–599. [Google Scholar]
  26. Satorra A., Neudecker H. (1997). Compact matrix expressions for generalized Wald tests of equality of moment vectors. Journal of Multivariate Analysis 63, 259–276. [Google Scholar]
  27. Schott J. R. (2007). A test for the equality of covariance matrices when the dimension is large relative to the sample size. Computational Statistics and Data Analysis 51, 6535–6542. [Google Scholar]
  28. Seber G. (1984) Multivariate Observations. USA: John Wiley & Sons. [Google Scholar]
  29. Shedden K., Taylor J. (2004). Differential correlation detects complex association between gene expression and clinical outcomes in lung adenocarcinomas. Methods in Microarray Data Analysis 4, 121–132. [Google Scholar]
  30. Singh D., Febbo P. G., Ross K., Jackson D. G., Manola J., Ladd C., Tamayo P., Renshaw A. A., D'Amico A. V., Richie J. P.. and others (2002). Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1(2), 203–209. [DOI] [PubMed] [Google Scholar]
  31. Verhaak R. G. W., Hoadley K. A., Purdom E., Wang V., Qi Y., Wilkerson M. D., Miller C. R., Ding L., Golub T., Mesirov J. P.. and others (2010). Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in pdgfra, idh1, egfr, and nf1. Cancer Cell 17(1), 98–110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Wilks S. (1932). Certain generalisations in the analysis of variance. Biometrika 24, 471–494. [Google Scholar]
  33. Xiao Y. H., Frisina R., Gordon A., Klebanov L., Yakovlev A. (2004). Multivariate search for differentially expressed gene combinations. BMC Bioinformatics 5(164). [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

Articles from Biostatistics (Oxford, England) are provided here courtesy of Oxford University Press

RESOURCES