Test-statistic correlation and data-row correlation

Bin Zhuo; Duo Jiang; Yanming Di

doi:10.1016/j.spl.2020.108903

. Author manuscript; available in PMC: 2021 Dec 1.

Published in final edited form as: Stat Probab Lett. 2020 Aug 14;167:108903. doi: 10.1016/j.spl.2020.108903

Test-statistic correlation and data-row correlation

Bin Zhuo ^a,^*, Duo Jiang ^a, Yanming Di ^a

PMCID: PMC7723344 NIHMSID: NIHMS1627323 PMID: 33304024

Abstract

When a statistical test is repeatedly applied to rows of a data matrix, correlations among data rows will give rise to correlations among corresponding test statistics. We investigate the relationship between test-statistic correlation and data-row correlation and discuss its implications.

Keywords: test-statistic correlation, bivariate normal, two-sample t-test

1. Introduction

Many scientific data sets are organized in matrix forms and statistical inferences— such as hypothesis tests and regression analysis—are often repeatedly applied to individual rows of the data matrix. For example, in gene expression analysis, normalized expression values are often organized in a matrix with rows corresponding to genes and columns corresponding to biological samples (experimental units). In a two-group comparison experiment, a two-sample test will be applied to each row of the data matrix in order to assess differential expression (DE). For more complex experimental designs, regression analysis can be used.

Correlations may exist among the data rows: For example, between-gene correlations are commonly observed in gene expression data [1, 2, 3, 4, 5]. Data-row correlations can give rise to correlations among the test statistic values calculated from the data rows [6, 7, 8]. The dependence among test statistic values has brought methodological challenges to statistical procedures aiming to summarize the collection of test results. For example, some multiple hypothesis testing procedures determine a p-value cutoff by controlling the false discovery rate (FDR) [9] or the q-value [5]. Many FDR-control procedures are valid only when the test statistics satisfy certain independence or positive-dependence conditions [9, 10]. Furthermore, Efron [7] showed in a simulation study that for a nominal FDR of 0.1, the actual false discovery proportions (FDP) in individual experiments can easily vary by a factor of 10 when there are correlations among test statistics.

In a gene-set analysis, one tests for over-abundance of DE genes in a specified gene set (e.g., a molecular pathway or a gene ontology category) [11]. The correlations among DE test statistics, if not addressed appropriately, will undermine the validity of many gene-set tests [2, 8, 12]. A better understanding of the test-statistic correlations is thus of fundamental importance and is a first towards developing statistical methods that correctly account for correlations.

Without replicating the experiment, we cannot directly estimate the correlation between a pair of test statistic values, because there is only one observed test statistic value for each data row. For this reason, the correlation between the corresponding data rows (after treatment effects accounted for) is sometimes used as a surrogate—explicitly or implicitly—when one actually needs the test-statistic correlation. It is yet unclear when and to what extent the test-statistic correlation (e.g., as measured by the Pearson correlation coefficient) can be approximated by the corresponding data-row correlation, though some simulation results suggest connections between the two quantities. Efron [7] concluded through simulation that the distribution of z-value (the test statistic considered in that paper) correlation can be nearly represented by the distribution of sample correlation from the data rows. Barry et al. [6] showed by Monte Carlo simulation of gene expression data that a nearly linear relationship holds between test-statistic correlations and data-row correlations for several forms of test statistics they examined. These Monte Carlo simulation results were cited by Wu and Smyth [8] as a justification for estimating a variance inflation factor from data-row correlations in order to correct for test-statistic correlations.

In this paper, we derive an analytical formula for the test-statistic correlation as a function of the data-row correlation for a general class of test statistics— including the familiar two-sample t-test as a special case. We use simulation results to confirm our analytical findings. We show that 1) the test-statistic correlation is equal to data-row correlation when the test statistic is a linear combination of the observed data, but 2) in general, the test-statistic correlation is weaker than and not well approximated by the corresponding data-row correlation. In particular, our analytical formula reveal that 3) the test-statistic correlation depends on whether the test statistic has an expectation of 0 (which often corresponds to whether the null hypothesis is true). These findings urge us to give more thoughts about correlations when trying to summarize the collection of the test results.

2. Methods and Results

Suppose we have a data matrix and have applied a statistical test to individual rows of the data matrix. We will consider pairwise correlations and focus on two rows of the data matrix: X = (X₁, …, X_n)^T and Y = (Y₁,…,Y_n)^T with mean vectors μ_X and μ_Y. We will assume that the columns of the data matrix are independent so that (X_j, Y_j), j = 1, … ,n, are independent bivariate random variables: this assumption is usually reasonable in a designed experiment for two-group comparison. The mean of (X_j, Y_j) may vary across experimental units j = 1, … ,n, but we assume that the population variance-covariance structure remains the same across experimental units, that is,

Cov (\begin{array}{l} X_{j} \\ Y_{j} \end{array}) = (\begin{matrix} σ_{X}^{2} & ρ σ_{X} σ_{Y} \\ ρ σ_{X} σ_{Y} & σ_{Y}^{2} \end{matrix})

(1)

for all j = 1, … ,n. We consider a general class of test statistic of the form

T_{X} = \frac{a^{T} X}{c_{X} S_{X}}, T_{Y} = \frac{a^{T} Y}{c_{Y} S_{Y}},

(2)

where a is a non-zero n-vector, (c_X, c_Y) are non-random constants, and (a^TX, a^TY) and (S_X, S_Y) are independent. In particular, the familiar two-sample t-test is of this form with S_X and S_Y estimating σ_X and σ_Y respectively. So is the t-test for a regression coefficient in a linear regression model.

We want to investigate the connections between the test-statistic correlation ρ_T = Cor(T_X, T_Y) and the data-row correlation ρ = Cor(X_j, Y_j) (common to all units j). First, we present an analytical formula that relates ρ_T to ρ.

Theorem 1.

For the test statistics T_X, T_Y in (2):

ρ_{T} = \frac{ρ σ_{X} σ_{Y} E (S_{X}^{- 1} S_{Y}^{- 1}) + c^{- 1} d_{X} d_{Y} Cov (S_{X}^{- 1}, S_{Y}^{- 1})}{\sqrt{[σ_{X}^{2} E (S_{X}^{- 2}) + c^{- 1} d_{X}^{2} Var (S_{X}^{- 1})] [σ_{Y}^{2} E (S_{Y}^{- 2}) + c^{- 1} d_{Y}^{2} Var (S_{Y}^{- 1})]}}

(3)

where d_X = a^Tμ_X, d_Y = a^Tμ_Y and c = a^T a.

Proof. (c_X, c_Y) do not affect correction and can be ignored. For any (U_X, U_Y) that are independent of (S_X, S_Y), direct calculation shows that

Cov (\frac{U_{X}}{S_{X}}, \frac{U_{Y}}{S_{Y}}) = Cov (U_{X}, U_{Y}) E (\frac{1}{S_{X} S_{Y}}) + E (U_{X}) E (U_{Y}) Cov (\frac{1}{S_{X}}, \frac{1}{S_{Y}}),

and

Var (\frac{U_{i}}{S_{i}}) = Var (U_{i}) E (\frac{1}{S_{i}^{2}}) + {(E (U_{i}))}^{2} Var (\frac{1}{S_{i}}), for i = X, Y .

For this theorem, we let U_X = a^TX, U_Y = a^TY, then E(U_X) = a^Tμ_X, E(U_Y) = a^Tμ_Y, $Var (U_{X}) = σ_{X}^{2} a^{T} a$ , $Var (U_{Y}) = σ_{Y}^{2} a^{T} a$ and Cov(U_X, U_Y) = ρσ_Xσ_Ya^T a since the columns of the data matrix are assumed independent. □

To apply equation (3) in practice, we need to compute the involved moments of $(S_{X}^{- 1}, S_{Y}^{- 1})$ , but equation (3) offers some insights without explicit calculation of those quantities.

Corollary 1. ρ_T = ρ if S_X and S_Y are constants (i.e., not random).

Proof. When S_X and S_Y are constants, $Cov (S_{X}^{- 1}, S_{Y}^{- 1})$ , $Var (S_{X}^{- 1})$ and $Var (S_{Y}^{- 1})$ are all 0, and $E (S_{X}^{- 1} S_{Y}^{- 1}) = S_{X}^{- 1} S_{Y}^{- 1} = \sqrt{E (S_{X}^{- 2}) E (S_{Y}^{- 2})}$ . □

This corollary says that for z-tests, the test-statistic correlation is the same as the corresponding data-row correlation ([6] also pointed out this). This confirms the simulation results in [7]. Another intuition offered by equation (3) is that the relation between ρ_T and ρ depends on whether one or both of a^Tμ_X and a^Tμ_Y are 0—which often corresponds to whether the corresponding null hypotheses are true. When both a^Tμ_X and a^Tμ_Y are 0, equation (3) will have a simpler form

ρ_{T} = \frac{ρ E (S_{X}^{- 1} S_{Y}^{- 1})}{\sqrt{E (S_{X}^{- 2}) E (S_{Y}^{- 2})}} .

Intuitively, in such cases, we can expect ρ_T ≈ ρ in large samples if S_X and S_Y are “good” estimators of σ_X and $σ_{Y} : E (S_{X}^{- 1} S_{Y}^{- 1}), \sqrt{E (S_{X}^{- 2}) E (S_{Y}^{- 2})}$ will then both tend to $σ_{X}^{- 1} σ_{Y}^{- 1}$ .

More generally, though, the test-statistic correlation ρ_T is not the same as the data-row correlation ρ. Next, using the important special case of two-sample t-test, we will further demonstrate that, in general, ρ_T is not well approximated by ρ, even in large samples.

For equal-variance two-sample t-test, we let $a = {(\underset{n_{1}}{\underset{︸}{- \frac{1}{n_{1}}, \dots, - \frac{1}{n_{1}}}}, \underset{n_{2}}{\underset{︸}{\frac{1}{n_{2}}, \dots, \frac{1}{n_{2}}}})}^{T}$ , $c_{X} = c_{Y} = \sqrt{1 / n_{1} + 1 / n_{2}}$ , and $S_{X}^{2}$ , $S_{Y}^{2}$ be the pooled sample variances,

S_{i}^{2} = \frac{(n_{1} - 1) S_{i, 1}^{2} + (n_{2} - 1) S_{i, 2}^{2}}{n_{1} + n_{2} - 2}, for i = X, Y,

in (2), where $S_{i, 1}^{2}$ and $S_{i, 2}^{2}$ are the sample variances for sample 1 and sample 2 respectively in data row i. From Basu’s lemma, (a^TX, a^TY) are independent of (S_X, S_Y). Typically, the null hypotheses to test are d_X = a^Tμ_X = 0 and d_Y = a^Tμ_Y = 0.

Theorem 2.

For the equal-variance two-sample t-test, when n = n₁ + n₂ → ∞ and n₁/n → r for some r, 0 < r < 1,

ρ_{T} \to \frac{ρ (1 + β δ_{X} δ_{Y} ρ)}{\sqrt{(1 + β δ_{X}^{2}) (1 + β δ_{Y}^{2})}},

(4)

where δ_X = d_X/σ_X, δ_Y = d_Y/σ_Y , and β = r(1 − r)/2.

Proof. As n = n₁ + n ₂ → ∞ and n₁/n → r, in equation (3),

n c = n a^{T} a = n (\frac{1}{n_{1}} + \frac{1}{n_{2}}) \to \frac{1}{r} + \frac{1}{1 - r} and {(n c)}^{- 1} \to r (1 - r) = 2 β .

The key of the proof is to determine the limits of the moments $E (S_{X}^{- 1} S_{Y}^{- 1})$ , $E (S_{X}^{- 2})$ , $E (S_{Y}^{- 2})$ , $Cov (S_{X}^{- 1}, S_{Y}^{- 1})$ , $Var (S_{X}^{- 1})$ , and $Var (S_{Y}^{- 1})$ . By the consistency of $(S_{X}^{2}, S_{Y}^{2})$ and the continuous mapping theorem,

S_{i}^{- 1} \overset{p}{\to} σ_{i}^{- 1}, S_{i}^{- 2} \overset{p}{\to} σ_{i}^{- 2}, for i = X, Y, and S_{X}^{- 1} S_{Y}^{- 1} \overset{p}{\to} σ_{X}^{- 1} σ_{Y}^{- 1} .

For large $v (= n - 2), E (S_{i}^{- 4}) = σ_{i}^{- 4} v^{2} / (v - 2) (v - 4) < 2 σ_{i}^{- 4}, for i = X, Y$ . This implies the $S_{X}^{- 1}$ , $S_{Y}^{- 1}$ , $S_{X}^{- 2}$ , $S_{Y}^{- 2}$ and $S_{X}^{- 1} S_{Y}^{- 1}$ are all uniformly integrable (note that $E (S_{X}^{- 2} S_{Y}^{- 2}) \leq \sqrt{E (S_{X}^{- 4}) E (S_{Y}^{- 4})}$ , and thus

E (S_{i}^{- 1}) \to σ_{i}^{- 1}, E (S_{i}^{- 2}) \to σ_{i}^{- 2}, for i = X, Y, E (S_{X}^{- 1} S_{Y}^{- 1}) \to σ_{X}^{- 1} σ_{Y}^{- 1} .

In Lemma 1 in the Appendix, we show that

\sqrt{v} [(\begin{matrix} S_{X}^{- 1} \\ S_{Y}^{- 1} \end{matrix}) - (\begin{matrix} σ_{X}^{- 1} \\ σ_{Y}^{- 1} \end{matrix})] \overset{d}{\to} N [(\begin{matrix} 0 \\ 0 \end{matrix}), \frac{1}{2} (\begin{matrix} σ_{X}^{- 2} & ρ^{2} σ_{X}^{- 1} σ_{Y}^{- 1} \\ ρ^{2} σ_{X}^{- 1} σ_{Y}^{- 1} & σ_{Y}^{- 2} \end{matrix})] .

(5)

This together with the continuous mapping theorem suggests that

Var (\sqrt{v} S_{i}^{- 1}) = E [v {(S_{i}^{- 1} - σ_{X}^{- 1})}^{2}] \to \frac{1}{2} σ_{i}^{- 2}, for i = X, Y,

Cov (\sqrt{v} S_{X}^{- 1}, \sqrt{v} S_{Y}^{- 1}) = E [v (S_{X}^{- 1} - σ_{X}^{- 1}) (S_{Y}^{- 1} - σ_{Y}^{- 1})] \to \frac{1}{2} ρ^{2} σ_{X}^{- 1} σ_{Y}^{- 1} .

For these moments limits to hold, we need to show that the involved moments are uniformly integrable (see, e.g., Theorem 6.2 of [13]). It is sufficient to show that $E [{(\sqrt{v} \cdot (S_{X}^{- 1} - σ_{X}^{- 1}))}^{4}]$ is bounded for large v : in Lemma 2 in the Appendix, we show that

E [{(\sqrt{v} \cdot (S_{X}^{- 1} - σ_{X}^{- 1}))}^{4}] = \frac{3}{4} σ_{X}^{- 4} + O (v^{- 1}) .

Plugging the limiting values of $E (S_{X}^{- 1} S_{Y}^{- 1})$ , $E (S_{X}^{- 2})$ , $E (S_{Y}^{- 2})$ , $Cov (S_{X}^{- 1}, S_{Y}^{- 1})$ , $Var (S_{X}^{- 1})$ , and $Var (S_{Y}^{- 1})$ into equation (3) gives equation (4). □

In the Appendix, we will also explain how to compute ρ_T in finite samples for two-sample t-test. It is mainly $E (S_{X}^{- 1} S_{Y}^{- 1})$ that is difficult to compute.

Theorem 2 (and equation (10) in the Appendix) reaffirms that the relation between ρ_T and ρ depends on (δ_X,δ_Y) = (d_X/σ_X, d_Y/σY). Figure 1 shows the contour plot of the limiting value of ρ_T when n₁ = n₂ → ∞ (r = 1/2,β = 1/8) as a function of (δ_X, δ_Y), for ρ = −0.7, −0.1, 0.1, 0.7. Note that ρ_T → ρ if d_X = d_Y = 0: typically, this means both null hypotheses are true. One can show that $| {lim}_{n \to \infty} ρ_{T} | \leq | ρ |$ . That is to say, in general, the test-statistic correlation is weaker than the corresponding data-row correlation.

In Figure 2, we plotted ρ_T as a function of ρ when n₁ = n₂ = 3, 10 or ∞ for a few selected values of (δ_X,δ_Y). We also added simulated values of ρ_T (for n₁ = n₂ = 3, 10) to confirm our analytical findings: For each (δ_X,δ_Y) value, we let ρ vary form −1 to 1 by a step size of 0.01. For each ρ, we simulated a pair of data rows X, Y according to independent bivariate normal distributions:

(\begin{array}{l} X_{j} \\ Y_{j} \end{array}) ~ N [(\begin{array}{l} 0 \\ 0 \end{array}), (\begin{array}{l} 1 & ρ \\ ρ & 1 \end{array})], j = 1, \dots, n_{1},

(\begin{array}{l} X_{j} \\ Y_{j} \end{array}) ~ N [(\begin{array}{l} δ_{X} \\ δ_{Y} \end{array}), (\begin{array}{l} 1 & ρ \\ ρ & 1 \end{array})], j = n_{1} + 1, \dots, n_{1} + n_{2}

and computed the two-sample t-test statistics T_X and T_Y. H = 5000 pairs of (T_X, T_Y) were simulated and their sample correlation ${\hat{ρ}}_{T}$ were shown in Figure 2.

Let $ρ_{T}^{\infty} = lim ρ_{T}$ as n₁ = n₂ → ∞. We see from Figure 2 that when δ_X = δ_Y = 0, $ρ_{T}^{\infty} = ρ$ ; when δ_X = 0, $ρ_{T}^{\infty}$ is a linear function of ρ; and when δ_X and δ_Y are both non-zero, $ρ_{T}^{\infty}$ is a quadratic function of ρ. These features are predictable from the analytical formula (4) in Theorem 2 and they hold approximately in finite samples if n is large. In fact, we see that when n₁ = n₂ = 10, the ρ_T values are already remarkably close to $ρ_{T}^{\infty}$ .

In small samples (e.g., n₁ = n₂ = 3), there is more difference between ρ_T and $ρ_{T}^{\infty}$ : ρ_T is often weaker than $ρ_{T}^{\infty}$ (i.e., $| ρ_{T} | < | ρ_{T}^{\infty} |$ ) with a couple of exceptions (e.g., when δ_X = ± 5, δ_Y = 5), which is reasonable since $(S_{X}^{2}, S_{Y}^{2})$ are “noisier” in small samples and noise in general reduces correlation. When both δ_X and δ_Y are non-zero (this typically means both null hypotheses are false),ρ does not approximate ρ_T well no matter what the sample size is: |ρ| can significantly overestimate |ρ_T |. In extreme cases when δ_X δ_Y is big, ρ_T and ρ can have opposite signs.

3. Conclusion and discussion

This article discusses the relation between test-statistic correlation ρ_T and the corresponding data-row correlation ρ. Our results indicate that only in limited settings, ρ_T can be well approximated by ρ: for example, ρ_T = ρ for z-test and ρ_T ≈ ρ in large samples if both null hypotheses are true. For two-sample t-test, the relation between ρ_T and ρ will depend on (δ_X, δ_Y), the expected mean differences divided by the respective standard deviations of the data rows. When δ_X and δ_Y are both non-zero, ρ_T is a quadratic function of ρ, ρ_T can be much weaker than ρ (|ρ_T | < |ρ|), and ρ_T and ρ can sometimes have opposite signs.

Our findings have practical implications in statistical inferences aiming summarize the collection of test results. For example, our results indicate that it is not reliable to approximate the distribution of test-statistic correlations by the distribution of data-row correlations if we expect the null hypotheses to be false for a significant proportion of the rows—which is often the case in gene expression analysis. If one wants to assess the null distribution of the test-statistic p-values by permuting the columns of the data matrix, then one has to realize the permutation will also change the correlations among the test-statistic values (since (δ_X, δ_Y) values will change after each permutation). In separate ongoing work, we are delving into these and related issues to better understand the impact of test-statistic correlation on gene set enrichment analysis, where one wants to test for overabundance of DE genes in a pre-specified set ([14] is one such attempt).

Wu and Smyth [8] discussed a variance inflation factor (VIF) which is useful when estimating the variance of the sum or average of m test statistics t₁, t₂, … ,t_m when the corresponding genes (data rows) are correlated. In that paper, VIF is defined as $1 + (m - 1) {\bar{ρ}}_{T}$ , where ${\bar{ρ}}_{T}$ is the average of test-statistic correlations (i.e., ρ_T ‘s) over all pairs of data rows in the set. (If all t_i’s have the same variance τ², then $Var (\bar{t}) = VIF \cdot τ^{2} / m$ ). It was mentioned that ${\bar{ρ}}_{T}$ can be estimated by the average of data-row correlations. Our results indicate that replacing test-statistic correlations by data-row correlations will not be accurate if there are mean differences between the two groups among the data rows. For example, if we consider the two-sample t-test performed on m = 21 data rows in a matrix with correlated data rows (ρ = 0.1 for all pairs, variance σ² = 1 for all rows) and mean differences ranging from −3 to 3 (uniformly spaced, i.e., δ = −3, −2.4, −1.8, …,3 for the 21 rows) between two groups (n₁ = n₂ = 30), then the true VIF value computed using test-statistic correlations (which we can compute using asymptotic formula (4) in Theorem 2) should be 2.48; the VIF computed using the data-row correlations is 3.00, which overestimates the true VIF by 21%. In practice, we can estimate ρ_T for each pair of data rows by plugging the corresponding estimated values of ρ, d, σ values into equation (4). In the Appendix, we use a simulation to show that estimating VIF using estimated ρ_T values will outperform approximating ρ_T by the sample data-row correlations.

One reviewer asked whether our results apply to the moderated t-test where the variance estimation is based on a shrinkage method. The short answer is “no”. It is difficult to derive an analytical formula for the correlation between a pair of moderated t-test statistic values where a shrinkage method is used for estimating the variances of the test-statistic values, since information from all data rows are used for estimating the variances. Through a simple simulation where we applied moderated t-test in the limma package ([15]), we observed that the test-statistic correlations among moderated t-test statistic values depend on δ_X and δ_Y, but the relationship between test-statistic correlations and data-row correlations do not follow the analytical formula that we derived in Theorem 2. In particular, we observed that for moderated t-test, the test-statistic correlations tend to be greater than the data-row correlations in some cases where δ_X = δ_Y ≠ 0. For the usual two-sample t-test statistic, we have shown earlier that the magnitude of test-statistic correlation tends to be less than that of the data-row correlation. We included the details on the simulation settings and results of this simulation on moderated t-test in the Appendix.

In this paper, we assumed that the columns of the data matrix are independent and the explicit formula mainly focused on the two-sample t-test. We believe these are good starting points for discussing this complex issue of test-statistic correlations resulting from data-row correlations. In the future, we plan to extend our investigation into more general settings: for example, the test for regression coefficients in a generalized linear model.

The R codes for reproducing the results in this paper are available at Github: https://github.com/zhuob/CorrelatedTest.

Supplementary Material

NIHMS1627323-supplement-1.pdf^{(462KB, pdf)}

Acknowledgement

Research reported in this article was partially supported by the National Institute of General Medical Sciences of the National Institutes of Health under Award Number R01GM104977 (to YD). We thank Sarah Emerson for valuable comments and suggestions in method development and manuscript preparation. This research was part of doctor dissertation of BZ under the supervision of YD. We thank the reviewers for their helpful comments.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

[1].Efron B, Large-scale inference: empirical Bayes methods for estimation, testing, and prediction, Vol. 1, Cambridge University Press, 2012. [Google Scholar]
[2].Gatti DM, Barry WT, Nobel AB, Rusyn I, Wright FA, Heading down the wrong pathway: on the influence of correlation within gene sets, BMC genomics 11 (1) (2010) 574. [DOI] [PMC free article] [PubMed] [Google Scholar]
[3].Huang Y-T, Lin X, Gene set analysis using variance component tests, BMC Bioinformatics 14 (1) (2013) 210. [DOI] [PMC free article] [PubMed] [Google Scholar]
[4].Qiu X, Brooks AI, Klebanov L, Yakovlev A, The effects of normalization on the correlation structure of microarray data, BMC bioinformatics 6 (1) (2005) 120. [DOI] [PMC free article] [PubMed] [Google Scholar]
[5].Storey JD, The positive false discovery rate: a bayesian interpretation and the q-value, Annals of statistics (2009) 2013–2035. [Google Scholar]
[6].Barry WT, Nobel AB, Wright FA, A statistical framework for testing functional categories in microarray data, The Annals of Applied Statistics (2008) 286–315. [Google Scholar]
[7].Efron B, Correlation and large-scale simultaneous significance testing, Journal of the American Statistical Association 102 (477) (2007) 93–103. [Google Scholar]
[8].Wu D, Smyth GK, Camera: a competitive gene set test accounting for inter-gene correlation, Nucleic acids research 40 (17) (2012) e133. [DOI] [PMC free article] [PubMed] [Google Scholar]
[9].Benjamini Y, Hochberg Y, Controlling the false discovery rate: a practical and powerful approach to multiple testing, Journal of the Royal Statistical Society. Series B (Methodological) (1995) 289–300. [Google Scholar]
[10].Benjamini Y, Yekutieli D, The control of the false discovery rate in multiple testing under dependency, Annals of statistics (2001) 1165–1188. [Google Scholar]
[11].Goeman JJ, Bühlmann P, Analyzing gene expression data in terms of gene sets: methodological issues, Bioinformatics 23 (8) (2007) 980–987. [DOI] [PubMed] [Google Scholar]
[12].Yaari G, Bolen CR, Thakar J, Kleinstein SH, Quantitative set analysis for gene expression: a method to quantify gene set differential expression including gene-gene correlations, Nucleic Acids Research (2013) gkt660. [DOI] [PMC free article] [PubMed] [Google Scholar]
[13].DasGupta A, Asymptotic theory of statistics and probability, Springer Science & Business Media, 2008. [Google Scholar]
[14].Zhuo B, Jiang D, MEACA: efficient gene-set interpretation of expression data using mixed models, bioRxiv (2017) 106781. [Google Scholar]
[15].Smyth GK, Limma: linear models for microarray data, in: Bioinformatics and computational biology solutions using R and Bioconductor, Springer, 2005, pp. 397–420. [Google Scholar]
[16].Tricomi FG, Erdélyi A, The asymptotic expansion of a ratio of gamma functions, Pacific Journal of Mathematics 1 (1) (1951) 133–142. [Google Scholar]
[17].Olver F, Asymptotics and special functions, AK Peters/CRC Press, 1997. [Google Scholar]
[18].Joarder AH, Moments of the product and ratio of two correlated chi-square variables, Statistical Papers 50 (3) (2009) 581–592. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

NIHMS1627323-supplement-1.pdf^{(462KB, pdf)}

[R1] [1].Efron B, Large-scale inference: empirical Bayes methods for estimation, testing, and prediction, Vol. 1, Cambridge University Press, 2012. [Google Scholar]

[R2] [2].Gatti DM, Barry WT, Nobel AB, Rusyn I, Wright FA, Heading down the wrong pathway: on the influence of correlation within gene sets, BMC genomics 11 (1) (2010) 574. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] [3].Huang Y-T, Lin X, Gene set analysis using variance component tests, BMC Bioinformatics 14 (1) (2013) 210. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] [4].Qiu X, Brooks AI, Klebanov L, Yakovlev A, The effects of normalization on the correlation structure of microarray data, BMC bioinformatics 6 (1) (2005) 120. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] [5].Storey JD, The positive false discovery rate: a bayesian interpretation and the q-value, Annals of statistics (2009) 2013–2035. [Google Scholar]

[R6] [6].Barry WT, Nobel AB, Wright FA, A statistical framework for testing functional categories in microarray data, The Annals of Applied Statistics (2008) 286–315. [Google Scholar]

[R7] [7].Efron B, Correlation and large-scale simultaneous significance testing, Journal of the American Statistical Association 102 (477) (2007) 93–103. [Google Scholar]

[R8] [8].Wu D, Smyth GK, Camera: a competitive gene set test accounting for inter-gene correlation, Nucleic acids research 40 (17) (2012) e133. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] [9].Benjamini Y, Hochberg Y, Controlling the false discovery rate: a practical and powerful approach to multiple testing, Journal of the Royal Statistical Society. Series B (Methodological) (1995) 289–300. [Google Scholar]

[R10] [10].Benjamini Y, Yekutieli D, The control of the false discovery rate in multiple testing under dependency, Annals of statistics (2001) 1165–1188. [Google Scholar]

[R11] [11].Goeman JJ, Bühlmann P, Analyzing gene expression data in terms of gene sets: methodological issues, Bioinformatics 23 (8) (2007) 980–987. [DOI] [PubMed] [Google Scholar]

[R12] [12].Yaari G, Bolen CR, Thakar J, Kleinstein SH, Quantitative set analysis for gene expression: a method to quantify gene set differential expression including gene-gene correlations, Nucleic Acids Research (2013) gkt660. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] [13].DasGupta A, Asymptotic theory of statistics and probability, Springer Science & Business Media, 2008. [Google Scholar]

[R14] [14].Zhuo B, Jiang D, MEACA: efficient gene-set interpretation of expression data using mixed models, bioRxiv (2017) 106781. [Google Scholar]

[R15] [15].Smyth GK, Limma: linear models for microarray data, in: Bioinformatics and computational biology solutions using R and Bioconductor, Springer, 2005, pp. 397–420. [Google Scholar]

[R16] [16].Tricomi FG, Erdélyi A, The asymptotic expansion of a ratio of gamma functions, Pacific Journal of Mathematics 1 (1) (1951) 133–142. [Google Scholar]

[R17] [17].Olver F, Asymptotics and special functions, AK Peters/CRC Press, 1997. [Google Scholar]

[R18] [18].Joarder AH, Moments of the product and ratio of two correlated chi-square variables, Statistical Papers 50 (3) (2009) 581–592. [Google Scholar]

PERMALINK

Test-statistic correlation and data-row correlation

Bin Zhuo

Duo Jiang

Yanming Di

Abstract

1. Introduction

2. Methods and Results

Theorem 1.

Theorem 2.

Figure 1:

Figure 2:

3. Conclusion and discussion

Supplementary Material

Acknowledgement

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Test-statistic correlation and data-row correlation

Bin Zhuo

Duo Jiang

Yanming Di

Abstract

1. Introduction

2. Methods and Results

Theorem 1.

Theorem 2.

Figure 1:

Figure 2:

3. Conclusion and discussion

Supplementary Material

Acknowledgement

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases