Abstract
When a statistical test is repeatedly applied to rows of a data matrix, correlations among data rows will give rise to correlations among corresponding test statistics. We investigate the relationship between test-statistic correlation and data-row correlation and discuss its implications.
Keywords: test-statistic correlation, bivariate normal, two-sample t-test
1. Introduction
Many scientific data sets are organized in matrix forms and statistical inferences— such as hypothesis tests and regression analysis—are often repeatedly applied to individual rows of the data matrix. For example, in gene expression analysis, normalized expression values are often organized in a matrix with rows corresponding to genes and columns corresponding to biological samples (experimental units). In a two-group comparison experiment, a two-sample test will be applied to each row of the data matrix in order to assess differential expression (DE). For more complex experimental designs, regression analysis can be used.
Correlations may exist among the data rows: For example, between-gene correlations are commonly observed in gene expression data [1, 2, 3, 4, 5]. Data-row correlations can give rise to correlations among the test statistic values calculated from the data rows [6, 7, 8]. The dependence among test statistic values has brought methodological challenges to statistical procedures aiming to summarize the collection of test results. For example, some multiple hypothesis testing procedures determine a p-value cutoff by controlling the false discovery rate (FDR) [9] or the q-value [5]. Many FDR-control procedures are valid only when the test statistics satisfy certain independence or positive-dependence conditions [9, 10]. Furthermore, Efron [7] showed in a simulation study that for a nominal FDR of 0.1, the actual false discovery proportions (FDP) in individual experiments can easily vary by a factor of 10 when there are correlations among test statistics.
In a gene-set analysis, one tests for over-abundance of DE genes in a specified gene set (e.g., a molecular pathway or a gene ontology category) [11]. The correlations among DE test statistics, if not addressed appropriately, will undermine the validity of many gene-set tests [2, 8, 12]. A better understanding of the test-statistic correlations is thus of fundamental importance and is a first towards developing statistical methods that correctly account for correlations.
Without replicating the experiment, we cannot directly estimate the correlation between a pair of test statistic values, because there is only one observed test statistic value for each data row. For this reason, the correlation between the corresponding data rows (after treatment effects accounted for) is sometimes used as a surrogate—explicitly or implicitly—when one actually needs the test-statistic correlation. It is yet unclear when and to what extent the test-statistic correlation (e.g., as measured by the Pearson correlation coefficient) can be approximated by the corresponding data-row correlation, though some simulation results suggest connections between the two quantities. Efron [7] concluded through simulation that the distribution of z-value (the test statistic considered in that paper) correlation can be nearly represented by the distribution of sample correlation from the data rows. Barry et al. [6] showed by Monte Carlo simulation of gene expression data that a nearly linear relationship holds between test-statistic correlations and data-row correlations for several forms of test statistics they examined. These Monte Carlo simulation results were cited by Wu and Smyth [8] as a justification for estimating a variance inflation factor from data-row correlations in order to correct for test-statistic correlations.
In this paper, we derive an analytical formula for the test-statistic correlation as a function of the data-row correlation for a general class of test statistics— including the familiar two-sample t-test as a special case. We use simulation results to confirm our analytical findings. We show that 1) the test-statistic correlation is equal to data-row correlation when the test statistic is a linear combination of the observed data, but 2) in general, the test-statistic correlation is weaker than and not well approximated by the corresponding data-row correlation. In particular, our analytical formula reveal that 3) the test-statistic correlation depends on whether the test statistic has an expectation of 0 (which often corresponds to whether the null hypothesis is true). These findings urge us to give more thoughts about correlations when trying to summarize the collection of the test results.
2. Methods and Results
Suppose we have a data matrix and have applied a statistical test to individual rows of the data matrix. We will consider pairwise correlations and focus on two rows of the data matrix: X = (X1, …, Xn)T and Y = (Y1,…,Yn)T with mean vectors μX and μY. We will assume that the columns of the data matrix are independent so that (Xj, Yj), j = 1, … ,n, are independent bivariate random variables: this assumption is usually reasonable in a designed experiment for two-group comparison. The mean of (Xj, Yj) may vary across experimental units j = 1, … ,n, but we assume that the population variance-covariance structure remains the same across experimental units, that is,
| (1) |
for all j = 1, … ,n. We consider a general class of test statistic of the form
| (2) |
where a is a non-zero n-vector, (cX, cY) are non-random constants, and (aTX, aTY) and (SX, SY) are independent. In particular, the familiar two-sample t-test is of this form with SX and SY estimating σX and σY respectively. So is the t-test for a regression coefficient in a linear regression model.
We want to investigate the connections between the test-statistic correlation ρT = Cor(TX, TY) and the data-row correlation ρ = Cor(Xj, Yj) (common to all units j). First, we present an analytical formula that relates ρT to ρ.
Theorem 1.
For the test statistics TX, TY in (2):
| (3) |
where dX = aTμX, dY = aTμY and c = aT a.
Proof. (cX, cY) do not affect correction and can be ignored. For any (UX, UY) that are independent of (SX, SY), direct calculation shows that
and
For this theorem, we let UX = aTX, UY = aTY, then E(UX) = aTμX, E(UY) = aTμY, , and Cov(UX, UY) = ρσXσYaT a since the columns of the data matrix are assumed independent. □
To apply equation (3) in practice, we need to compute the involved moments of , but equation (3) offers some insights without explicit calculation of those quantities.
Corollary 1. ρT = ρ if SX and SY are constants (i.e., not random).
Proof. When SX and SY are constants, , and are all 0, and . □
This corollary says that for z-tests, the test-statistic correlation is the same as the corresponding data-row correlation ([6] also pointed out this). This confirms the simulation results in [7]. Another intuition offered by equation (3) is that the relation between ρT and ρ depends on whether one or both of aTμX and aTμY are 0—which often corresponds to whether the corresponding null hypotheses are true. When both aTμX and aTμY are 0, equation (3) will have a simpler form
Intuitively, in such cases, we can expect ρT ≈ ρ in large samples if SX and SY are “good” estimators of σX and will then both tend to .
More generally, though, the test-statistic correlation ρT is not the same as the data-row correlation ρ. Next, using the important special case of two-sample t-test, we will further demonstrate that, in general, ρT is not well approximated by ρ, even in large samples.
For equal-variance two-sample t-test, we let , , and , be the pooled sample variances,
in (2), where and are the sample variances for sample 1 and sample 2 respectively in data row i. From Basu’s lemma, (aTX, aTY) are independent of (SX, SY). Typically, the null hypotheses to test are dX = aTμX = 0 and dY = aTμY = 0.
Theorem 2.
For the equal-variance two-sample t-test, when n = n1 + n2 → ∞ and n1/n → r for some r, 0 < r < 1,
| (4) |
where δX = dX/σX, δY = dY/σY , and β = r(1 − r)/2.
Proof. As n = n1 + n 2 → ∞ and n1/n → r, in equation (3),
The key of the proof is to determine the limits of the moments , , , , , and . By the consistency of and the continuous mapping theorem,
For large . This implies the , , , and are all uniformly integrable (note that , and thus
In Lemma 1 in the Appendix, we show that
| (5) |
This together with the continuous mapping theorem suggests that
For these moments limits to hold, we need to show that the involved moments are uniformly integrable (see, e.g., Theorem 6.2 of [13]). It is sufficient to show that is bounded for large v : in Lemma 2 in the Appendix, we show that
Plugging the limiting values of , , , , , and into equation (3) gives equation (4). □
In the Appendix, we will also explain how to compute ρT in finite samples for two-sample t-test. It is mainly that is difficult to compute.
Theorem 2 (and equation (10) in the Appendix) reaffirms that the relation between ρT and ρ depends on (δX,δY) = (dX/σX, dY/σY). Figure 1 shows the contour plot of the limiting value of ρT when n1 = n2 → ∞ (r = 1/2,β = 1/8) as a function of (δX, δY), for ρ = −0.7, −0.1, 0.1, 0.7. Note that ρT → ρ if dX = dY = 0: typically, this means both null hypotheses are true. One can show that . That is to say, in general, the test-statistic correlation is weaker than the corresponding data-row correlation.
Figure 1:

Contour plot of the limiting values of ρT as n1 = n2 → ∞. For each ρ, the asymptotic value of ρT is plotted as a function of (δX, δY).
In Figure 2, we plotted ρT as a function of ρ when n1 = n2 = 3, 10 or ∞ for a few selected values of (δX,δY). We also added simulated values of ρT (for n1 = n2 = 3, 10) to confirm our analytical findings: For each (δX,δY) value, we let ρ vary form −1 to 1 by a step size of 0.01. For each ρ, we simulated a pair of data rows X, Y according to independent bivariate normal distributions:
and computed the two-sample t-test statistics TX and TY. H = 5000 pairs of (TX, TY) were simulated and their sample correlation were shown in Figure 2.
Figure 2:

Test-statistic correlation ρT versus data-row correlation ρ at different (δX, δY) values, when n1 = n2 = 3, 10, or ∞. The simulated values of ρT are also shown for n1 = n2 = 3 and 10. The solid (smooth) lines represent theoretical value, and dashed (jagged) lines represent simulated values.
Let as n1 = n2 → ∞. We see from Figure 2 that when δX = δY = 0, ; when δX = 0, is a linear function of ρ; and when δX and δY are both non-zero, is a quadratic function of ρ. These features are predictable from the analytical formula (4) in Theorem 2 and they hold approximately in finite samples if n is large. In fact, we see that when n1 = n2 = 10, the ρT values are already remarkably close to .
In small samples (e.g., n1 = n2 = 3), there is more difference between ρT and : ρT is often weaker than (i.e., ) with a couple of exceptions (e.g., when δX = ± 5, δY = 5), which is reasonable since are “noisier” in small samples and noise in general reduces correlation. When both δX and δY are non-zero (this typically means both null hypotheses are false),ρ does not approximate ρT well no matter what the sample size is: |ρ| can significantly overestimate |ρT |. In extreme cases when δX δY is big, ρT and ρ can have opposite signs.
3. Conclusion and discussion
This article discusses the relation between test-statistic correlation ρT and the corresponding data-row correlation ρ. Our results indicate that only in limited settings, ρT can be well approximated by ρ: for example, ρT = ρ for z-test and ρT ≈ ρ in large samples if both null hypotheses are true. For two-sample t-test, the relation between ρT and ρ will depend on (δX, δY), the expected mean differences divided by the respective standard deviations of the data rows. When δX and δY are both non-zero, ρT is a quadratic function of ρ, ρT can be much weaker than ρ (|ρT | < |ρ|), and ρT and ρ can sometimes have opposite signs.
Our findings have practical implications in statistical inferences aiming summarize the collection of test results. For example, our results indicate that it is not reliable to approximate the distribution of test-statistic correlations by the distribution of data-row correlations if we expect the null hypotheses to be false for a significant proportion of the rows—which is often the case in gene expression analysis. If one wants to assess the null distribution of the test-statistic p-values by permuting the columns of the data matrix, then one has to realize the permutation will also change the correlations among the test-statistic values (since (δX, δY) values will change after each permutation). In separate ongoing work, we are delving into these and related issues to better understand the impact of test-statistic correlation on gene set enrichment analysis, where one wants to test for overabundance of DE genes in a pre-specified set ([14] is one such attempt).
Wu and Smyth [8] discussed a variance inflation factor (VIF) which is useful when estimating the variance of the sum or average of m test statistics t1, t2, … ,tm when the corresponding genes (data rows) are correlated. In that paper, VIF is defined as , where is the average of test-statistic correlations (i.e., ρT ‘s) over all pairs of data rows in the set. (If all ti’s have the same variance τ2, then ). It was mentioned that can be estimated by the average of data-row correlations. Our results indicate that replacing test-statistic correlations by data-row correlations will not be accurate if there are mean differences between the two groups among the data rows. For example, if we consider the two-sample t-test performed on m = 21 data rows in a matrix with correlated data rows (ρ = 0.1 for all pairs, variance σ2 = 1 for all rows) and mean differences ranging from −3 to 3 (uniformly spaced, i.e., δ = −3, −2.4, −1.8, …,3 for the 21 rows) between two groups (n1 = n2 = 30), then the true VIF value computed using test-statistic correlations (which we can compute using asymptotic formula (4) in Theorem 2) should be 2.48; the VIF computed using the data-row correlations is 3.00, which overestimates the true VIF by 21%. In practice, we can estimate ρT for each pair of data rows by plugging the corresponding estimated values of ρ, d, σ values into equation (4). In the Appendix, we use a simulation to show that estimating VIF using estimated ρT values will outperform approximating ρT by the sample data-row correlations.
One reviewer asked whether our results apply to the moderated t-test where the variance estimation is based on a shrinkage method. The short answer is “no”. It is difficult to derive an analytical formula for the correlation between a pair of moderated t-test statistic values where a shrinkage method is used for estimating the variances of the test-statistic values, since information from all data rows are used for estimating the variances. Through a simple simulation where we applied moderated t-test in the limma package ([15]), we observed that the test-statistic correlations among moderated t-test statistic values depend on δX and δY, but the relationship between test-statistic correlations and data-row correlations do not follow the analytical formula that we derived in Theorem 2. In particular, we observed that for moderated t-test, the test-statistic correlations tend to be greater than the data-row correlations in some cases where δX = δY ≠ 0. For the usual two-sample t-test statistic, we have shown earlier that the magnitude of test-statistic correlation tends to be less than that of the data-row correlation. We included the details on the simulation settings and results of this simulation on moderated t-test in the Appendix.
In this paper, we assumed that the columns of the data matrix are independent and the explicit formula mainly focused on the two-sample t-test. We believe these are good starting points for discussing this complex issue of test-statistic correlations resulting from data-row correlations. In the future, we plan to extend our investigation into more general settings: for example, the test for regression coefficients in a generalized linear model.
The R codes for reproducing the results in this paper are available at Github: https://github.com/zhuob/CorrelatedTest.
Supplementary Material
Acknowledgement
Research reported in this article was partially supported by the National Institute of General Medical Sciences of the National Institutes of Health under Award Number R01GM104977 (to YD). We thank Sarah Emerson for valuable comments and suggestions in method development and manuscript preparation. This research was part of doctor dissertation of BZ under the supervision of YD. We thank the reviewers for their helpful comments.
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
References
- [1].Efron B, Large-scale inference: empirical Bayes methods for estimation, testing, and prediction, Vol. 1, Cambridge University Press, 2012. [Google Scholar]
- [2].Gatti DM, Barry WT, Nobel AB, Rusyn I, Wright FA, Heading down the wrong pathway: on the influence of correlation within gene sets, BMC genomics 11 (1) (2010) 574. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].Huang Y-T, Lin X, Gene set analysis using variance component tests, BMC Bioinformatics 14 (1) (2013) 210. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [4].Qiu X, Brooks AI, Klebanov L, Yakovlev A, The effects of normalization on the correlation structure of microarray data, BMC bioinformatics 6 (1) (2005) 120. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [5].Storey JD, The positive false discovery rate: a bayesian interpretation and the q-value, Annals of statistics (2009) 2013–2035. [Google Scholar]
- [6].Barry WT, Nobel AB, Wright FA, A statistical framework for testing functional categories in microarray data, The Annals of Applied Statistics (2008) 286–315. [Google Scholar]
- [7].Efron B, Correlation and large-scale simultaneous significance testing, Journal of the American Statistical Association 102 (477) (2007) 93–103. [Google Scholar]
- [8].Wu D, Smyth GK, Camera: a competitive gene set test accounting for inter-gene correlation, Nucleic acids research 40 (17) (2012) e133. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [9].Benjamini Y, Hochberg Y, Controlling the false discovery rate: a practical and powerful approach to multiple testing, Journal of the Royal Statistical Society. Series B (Methodological) (1995) 289–300. [Google Scholar]
- [10].Benjamini Y, Yekutieli D, The control of the false discovery rate in multiple testing under dependency, Annals of statistics (2001) 1165–1188. [Google Scholar]
- [11].Goeman JJ, Bühlmann P, Analyzing gene expression data in terms of gene sets: methodological issues, Bioinformatics 23 (8) (2007) 980–987. [DOI] [PubMed] [Google Scholar]
- [12].Yaari G, Bolen CR, Thakar J, Kleinstein SH, Quantitative set analysis for gene expression: a method to quantify gene set differential expression including gene-gene correlations, Nucleic Acids Research (2013) gkt660. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [13].DasGupta A, Asymptotic theory of statistics and probability, Springer Science & Business Media, 2008. [Google Scholar]
- [14].Zhuo B, Jiang D, MEACA: efficient gene-set interpretation of expression data using mixed models, bioRxiv (2017) 106781. [Google Scholar]
- [15].Smyth GK, Limma: linear models for microarray data, in: Bioinformatics and computational biology solutions using R and Bioconductor, Springer, 2005, pp. 397–420. [Google Scholar]
- [16].Tricomi FG, Erdélyi A, The asymptotic expansion of a ratio of gamma functions, Pacific Journal of Mathematics 1 (1) (1951) 133–142. [Google Scholar]
- [17].Olver F, Asymptotics and special functions, AK Peters/CRC Press, 1997. [Google Scholar]
- [18].Joarder AH, Moments of the product and ratio of two correlated chi-square variables, Statistical Papers 50 (3) (2009) 581–592. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
