Are a set of microarrays independent of each other?

Bradley Efron

doi:10.1214/09-AOAS236

. Author manuscript; available in PMC: 2010 Jun 18.

Published in final edited form as: Ann Appl Stat. 2009 Jan 1;3(3):922–942. doi: 10.1214/09-AOAS236

Are a set of microarrays independent of each other?

Bradley Efron ^†,^*

PMCID: PMC2887702 NIHMSID: NIHMS93108 PMID: 20563291

Abstract

Having observed an m × n matrix X whose rows are possibly correlated, we wish to test the hypothesis that the columns are independent of each other. Our motivation comes from microarray studies, where the rows of X record expression levels for m different genes, often highly correlated, while the columns represent n individual microarrays, presumably obtained independently. The presumption of independence underlies all the familiar permutation, cross-validation, and bootstrap methods for microarray analysis, so it is important to know when independence fails. We develop nonparametric and normal-theory testing methods. The row and column correlations of X interact with each other in a way that complicates test procedures, essentially by reducing the accuracy of the relevant estimators.

Keywords: total correlation, effective sample size, permutation tests, matrix normal distribution, row and column correlations

1 Introduction

The formal statistical problem considered here can be stated simply: having observed an m × n data matrix X with possibly correlated rows, test the hypothesis that the columns are independent of each other. Relationships between the row correlations and column correlations of X complicate the problem’s solution.

Why are we interested in column-wise independence? The motivation in this paper comes from microarray studies, where X is a matrix of expression levels for m genes on n microarrays. In the “Cardio” study I will use for illustration there are m = 20426 genes each measured on n = 63 arrays, with the microarrays corresponding to 63 subjects, 44 healthy controls and 19 cardiovascular patients¹. We expect the gene expressions to be correlated, inducing substantial correlations within each column (Owen, 2005; Efron, 2007a; Qiu, Brooks, Klebanov and Yakovlev, 2005a), but most of the standard analysis techniques begin with an assumption of independence across microarrays, that is, across the columns of X. This can be a risky assumption: all of the familiar permutation, cross-validation and bootstrap methods for microarray analysis, such as the popular SAM program of Tusher, Tibshirani and Chu (2001), depend on column-wise independence of X; dependence can invalidate the usual choice of a null hypothesis, as discussed next, leading to flawed assessments of significance.

An immediate purpose of the Cardio study is to identify genes involved in the disease process. For gene i we compute the two-sample t-statistic “t_i” comparing sick versus healthy subjects. It will be convenient for discussion to convert these to z-scores,

z_{i} = Φ^{- 1} (F_{61} (t_{i})) i = 1, 2, \dots, m,

(1.1)

with and Ф F₆₁ the cumulative distribution functions (cdf) of standard normal and t₆₁ distributions; under the usual assumptions, z_i will have a standard N(0, 1) null distribution, called here the “theoretical null.” Unusually large values of z_i or —z_i are used to identify non-null genes, with the meaning of “unusual” depending heavily on column-wise independence.

The left panel of Figure 1 shows the histogram of all 20426 z_i values, which is seen to be much wider than N(0, 1) near its center. An “empirical null” fit to the center as in Efron (2007b) was estimated to be N(.03, 1.57²). Null overdispersion has many possible causes (Efron, 2004, 2007a,b), one of which is positive correlation across the columns of X. Such correlations reduce the effective degrees of freedom for the t-statistic, causing (1.1) to yield overdispersed null z_is, and of course changing our assessment of significance for outlying values.

*Left panel*: histogram of m = 20426 z-values (1.1) for Cardio study; center of histogram is much wider than N(0, 1) theoretical null. *Right panel*: scatterplot of microarrays 31 and 32, (x_i31, x_i32) for i = 1, 2, …, m, after removal of row-wise gene means; the scattergram seems to indicate substantial correlation between the two arrays.

The right panel of Figure 1 seems to offer a “smoking gun” for correlation: the scattergram of expression levels for microarrays 31 and 32 looks strikingly correlated, with sample correlation coefficient .805. Here X has been standardized by subtraction of its row means, so the effect is not due to so-called ecological correlations. (X is actually “doubly standardized,” as defined in Section 2.) Nevertheless the question of whether or not correlation .805 is significantly positive turns out to be surprisingly close, as discussed in Section 4, because the row-wise correlations in X drastically reduce the degrees of freedom for the scatterplot. Despite the massive appearance of 20426 points, the scattergram’s accuracy is no more than would be given by 17 independent bivariate normal pairs.

Answering the title’s question, that is, testing for column-wise independence in the presence of row-wise dependence, has both easy and difficult aspects. Section 2 introduces a class of simple permutation tests which, in the case of the Cardio data, clearly discredit column-wise independence. However these tests depend on the ordering of the n columns, and can’t be used if the initial order is lost. It is natural and desirable to look for test statistics of column-wise independence that are invariant under permutation of the columns. Classical multivariate analysis, as in Anderson (2003), develops column independence tests in terms of the eigenvalues of an n by n Wishart matrix. However, this theory depends on the assumption of row-wise independence, disqualifying it for use here.

Sections 3 through 5 consider more general classes of independence tests, both from nonpara-metric and normal theory points of view. The theorem in Section 3 illustrates a key difficulty: correlation between the rows of X (ruled out in the classic theory) can give a misleading appearance of column-wise dependence. Similarly, row-wise dependence can greatly degrade the accuracy of the usual n × n sample covariance matrix of the columns, as shown by the theorem in Section 4. Various non-permutation normal-theory tests are discussed in Section 5, some promising, but with difficulties seen for all of them. The paper ends in Section 6 with a collection of remarks and details.

2 Permutation Tests of Column-Wise Independence

Simple permutation tests can provide strong evidence against column-wise independence, as we will see for the Cardio data. Our main example concerns the 44 healthy subjects, where X is now an m × n matrix with m = 20426 and n = 44. For convenience we assume that X has been “demeaned” by the subtraction of row and column means, giving

\sum_{i} x_{i j} = \sum_{j} x_{i j} = 0 for i = 1, 2, \dots, m and j = 1, 2, \dots, n .

(2.1)

Our numerical results go further and assume “double standardization”: that in addition to (2.1),

\sum_{j} x_{i j}^{2} = n and \sum_{i} x_{i j}^{2} = m f o r i = 1, \dots, m and j = 1, \dots, n,

(2.2)

i.e., that each row and column of X has mean 0 and variance 1; see Remark 6.4 in Section 6.

Let $\hat{Δ}$ be the familiar estimate of the n × n covariance matrix between the columns of X,

\hat{Δ} = (X^{'} X) ∕ m,

(2.3)

Under double standardization, $\hat{Δ}$ is actually the sample correlation matrix, which we expect to be near the identity matrix I_n under column-wise independence. Also let v₁ denote the first eigenvector of $\hat{Δ}$ . The left panel of Figure 2 plots the components of v₁ versus array number 1, 2, … , 44. Suppose that the columns of the original expression matrix, before standardization, are independent and identically distributed m-vectors (“i.i.d.”). Then it is easy to see, Remark 6.2 of Section 6, that all orderings of the components of v₁ are equally likely. This is not what Figure 2 shows: the components seem to increase from left to right, with a noticeable block of large values for arrays 27-32.

*Left panel*: Components of first eigenvector of row sample correlation matrix for the 44 healthy Cardio subjects, plotted versus array number 1, 2, … , 44; dashes emphasize the block of large components for arrays 27–32. *Right panel*: First eigenvectors for healthy (solid line) and cancer (dashed) subjects, prostate cancer study, Singh, Febbo, et al. (2002); there was a systematic drift in expression levels as the study progressed.

Let S(v₁) be a statistic that measures structure, for instance a linear regression of v₁ versus array index. Comparing S(v₁) with a set of permuted values

{S^{* l} = S (v^{* l}), l = 1, 2, \dots, L},

(2.4)

v*^l a random permutation of the components of v₁, provides a quick test of the i.i.d. null hypothesis.

Permutation testing was applied to v₁ for the Cardio data, using the “block” statistic

S (v_{1}) = v_{1}^{'} B v_{1},

(2.5)

where B is the n × n matrix

B = \sum_{h} β_{h} β_{h}^{'} .

(2.6)

The sum in (2.6) is over all vectors β_h of the form

β_{h} = (0, 0, \dots, 0, 1, 1, \dots, 1, 0, 0, \dots, 0) .

(2.7)

with the 1s forming blocks of length between 2 and 10 inclusive. A heuristic rationale for block testing appears below; intuitively, microarray experiments are prone to block disturbances because of the way they are developed and read; see Callow et al. (2000). After L = 5000 permutations, only three S* values exceeded the actual value S(v₁), p-value .0006, yielding strong evidence against the i.i.d. null hypothesis.

The right panel of Figure 2 pertains to a microarray prostate cancer study (Singh et al., 2002) discussed in Efron (2008): m = 6033 genes were measured on each of n = 102 men, 50 healthy controls and 52 prostate cancer patients. The right panel plots first eigenvectors for $\hat{Δ}$ , (2.3), computed separately for the healthy controls and the cancer patients (the two matrices being individually doubly standardized). Both vectors increase almost linearly from left to right. Taking S(v₁) as the linear regression of v₁ versus array number, permutation testing overwhelmingly rejected the i.i.d. null hypothesis, as it also did using the block test. The prostate study appears as a favorable example of microarray technology in Efron (2008). Nevertheless, Figure 2 indicates a systematic drift in the expression level readings as the study progressed. Some genes drift up, others down (the average drift equaling 0 because of standardization), inducing a small amount of column-wise correlation.

Section 5 discusses models for X where the n × n column covariance matrix is of the “single degree of freedom” form

Δ = I + λ β β^{'}

(2.8)

for some known fixed vector β, the null hypothesis of column-wise independence being H₀ : λ = 0. An obvious choice of test statistic in this situation is

S_{β} = β^{'} (\hat{Δ} - I) β,

(2.9)

a monotone increasing function of $β^{'} \hat{Δ} β$ . If β is unknown we can replace S_β with

S_{B} = \sum_{h = 1}^{H} β_{h}^{'} \hat{Δ} β_{h} = tr (\hat{Δ} \sum_{h} β_{h} β_{h}^{'}) \equiv tr (\hat{Δ} B),

(2.10)

where {β₁, β₂, … , βH, } is a catalog of “likely prospects” as in (2.7).

Permutation test statistics such as (2.5) can be motivated from the singular value decomposition (SVD) of X,

\underset{m \times n}{X} = \underset{m \times K}{U} \underset{K \times K}{d} \underset{K \times n}{V^{'}},

(2.11)

where K is the rank, d the diagonal matrix of ordered singular values, and U and V orthonormal matrices of sizes m × K and n × K,

U^{'} U = V^{'} V = I_{K},

(2.12)

I_K the K × K identity. The squares of the diagonal elements, say

e_{1} \geq e_{2} \geq \dots \geq e_{K} > 0, (e_{k} = d_{k}^{2})

(2.13)

are the eigenvalues of X′X = V′d²V.

S_B in (2.10) can now be written as

S_{B} = \sum_{j = 1}^{k} \frac{e_{j}}{m} (v_{j}^{'} B v_{j}) .

(2.14)

Model (2.8) suggests that most of the information against the null hypothesis H₀ of independence lies in the first eigenvector v₁, getting us back to test statistic $S (v_{1}) = v_{1}^{'} B v_{1}$ as in (2.5).

What should the statistician do if column-wise independence is strongly rejected, as in the Cardio example? Use of an empirical null rather than a permutation or theoretical null, N (.03, 1.57²) rather than N(0, 1) in Figure 1, removes the reliance on column-wise independence for hypothesis testing methods such as False Discovery Rates, at the expense of increased variability. Efron (2008) discusses these points.

Two objections can be raised to our permutation tests: (1) they are really testing i.i.d., not independence; (2) non-independence might not manifest itself in the order of v₁ (particularly if the order of the microarrays has been shuffied in some unknown way).

Column-wise standardization makes the column distributions more similar, mitigating objection (1). Going further, “quantile standardization” — say replacing each column’s entries by normal scores (Bolstad, Irizarry, Åstrand and Speed, 2003) — makes the marginals exactly the same. The Cardio data was reanalyzed using normal scores, with almost identical results.

Objection (2) is more worrisome from the point of view of statistical power. The order in which the arrays were obtained should be available to the statistician, and should be analyzed to expose possible trends like those in Figure 2 ². It would be desirable, nevertheless, to have independence tests that do not depend on order — that is, test statistics invariant under column-wise permutations. The remainder of this paper concerns both the possibilities and difficulties in the development of “non-permutation” tests.

3 Row and Column Correlations

There is an interesting relationship between the row and column correlations of the matrix X, which complicates the question of column-wise independence. For the notation of this section define the n × n matrix of sample covariances between the columns of X as

\hat{Cov} = X^{'} X ∕ m,

(3.1)

called $\hat{Δ}$ in Section 2, and likewise

\hat{cov} = X X^{'} ∕ n,

(3.2)

for the m × m matrix of row-wise sample covariances (having more than 400, 000, 000 entries in the Cardio example!).

Theorem 1. If X has row and column means 0, (2.1), then the n²entries of $\hat{Cov}$ have empiricalmean 0 and variance c₂,

c_{2} = \sum_{k = 1}^{K} e_{k}^{2} ∕ {(mn)}^{2},

(3.3)

with e_k the eigenvalues (2.13), and so do the m²entries of $\hat{cov}$ . Proof. The sum of $\hat{Cov}$ ’s entries is

1_{n}^{'} X^{'} X 1_{n} ∕ m = 0,

(3.4)

according to (2.1), while the mean of squared entries is

\frac{\sum_{j = 1}^{n} \sum_{j^{'} = 1}^{n} {\hat{Cov}}_{j j^{'}}^{2}}{n^{2}} = \frac{tr ({(X^{'} X)}^{2})}{m^{2} n^{2}} = \frac{tr (V^{'} d^{4} V)}{m^{2} n^{2}} = c_{2} .

(3.5)

Replacing X′X with XX′ yields the same results for the row covariances $\hat{cov}$ .

Under double standardization (2.1)-(2.2), the covariances become sample correlations, say $\hat{cor}$ and $\hat{cor}$ for the columns and rows. Theorem 1 has a surprising consequence: whether or not the columns of X are independent, the column sample correlations will have the same mean and variance as the row correlations. In other words, substantial row-wise correlation can induce the appearance of column-wise correlation.

Figure 3 concerns the 44 healthy subjects in the Cardio study, with X an (m, n) = (20426, 44) doubly standardized matrix. All 44² column correlations are shown by the solid histogram, while the line histogram is a random sample of 10, 000 row correlations. Here c₂ = .283², so according to the Theorem both histograms have mean 0 and standard deviation .283.

*Left panel*: solid histogram the 44² column sample correlations for X the doubly standardized matrix of healthy Cardio subjects; line histogram is sample of 10000 of the 20426² row correlations. *Right panel*: solid histogram the column correlations excluding diagonal 1s; line histogram the row correlations corrected for sampling overdispersion.

The 44 diagonal elements of $\hat{cor}$ protrude as a prominent spike at 1. (We can’t see the spike of 20426 diagonal elements for the row correlation matrix $\hat{cor}$ because they form such a small fraction of all 20426².) It is easy to remove the diagonal 1’s from consideration.

Corollary. In the doubly standardized situation, the off-diagonal elements of the column correlation matrix $\hat{Cor}$ have empirical mean and variance

\hat{μ} = - \frac{1}{n - 1} and {\hat{α}}^{2} = \frac{n}{n - 1} (c_{2} - \frac{1}{n - 1}) .

(3.6)

For n = 44 and c₂ = .283 this gives

(\hat{μ}, {\hat{α}}^{2}) = (- .023, {.241}^{2}) .

(3.7)

The corresponding diagonal-removing corrections for the row correlations (replacing n by m in (3.6)) are neglible for m = 20426. However c₂ overestimates the variance of the row correlations for another reason: with only 44 points available to estimate each correlation, estimation error adds a considerable component of variance to the $\hat{cor}$ histogram in the left panel, as discussed next.

Suppose now that the columns of X are in fact independent, in which case the substantial column correlations seen in Figure 3 must actually be induced by row correlations, via Theorem 1. Let cor_ii′ indicate the true correlation between rows i and i′ (that is, between X_ij and X_i′j), and define α the total correlation to be the root mean square of the cor_ii′ values,

α^{2} = \sum_{i < i^{'}} c o r_{i i^{'}}^{2} / (\begin{matrix} m \\ 2 \end{matrix}) .

(3.8)

Remark 6.5 of Section 6 shows that ${\hat{α}}^{2}$ in (3.6) is an approximately unbiased estimate of α², assuming column-wise independence. For the Cardio example $\hat{α} = .241$ , similar to the size of the microarray correlation estimates in Efron (2007a), Owen (2005), and Qiu et al. (2005a). Section 4 discusses the crucial role of α in determining the accuracy of estimates based on X.

The right panel of Figure 3 compares the histogram of the column correlations ${\hat{cor}}_{j j^{'}}$ , now excluding cases j = j′, with the row correlation histogram corrected for sampling overdispersion via the shrinkage factor .241/.283. As predicted by Theorem 1, the similarity is striking. A possible difference lies in the long right tail of the $\hat{Cor}$ distribution (including ${\hat{cor}}_{31, 32}$ , the case illustrated in Figure 1), whose significance is examined in Section 4.

4 Normal Theory

The results of Sections 2 and 3 were developed nonparametrically. This section concerns multivariate normal theory, afterwards used in Section 5 to draw the connection with classical multivariate independence tests. We consider the matrix normal distribution for X,

graphic file with name nihms-93108-f0001.jpg

(4.1)

where the Kronecker notation indicates covariance structure

(4.2)

Row x_i of X has covariance matrix proportional to Δ,

graphic file with name nihms-93108-f0003.jpg

(4.3)

(not independently across rows unless Inline graphic is diagonal), and likewise for column x_j, x_j ~ N_m(0, Δ_jj). As in (2.1), we take all means equal 0.

Much of classical multivariate analysis focuses on the situation Inline graphic = I, where the rows x_i are independent replicates³,

(4.4)

in which case the sample covariance matrix $\hat{Δ} = X' X ∕ m$ has a scaled Wishart distribution,

\hat{Δ} \sim Wishart (m, Δ) ∕ m .

(4.5)

Distribution (4.5) has first and second moments

\underset{n \times n}{\hat{Δ}} \sim (\underset{n \times n}{Δ}, \underset{n^{2} \times n^{2}}{Δ^{(2)}} / m) with Δ_{j k, l h}^{(2)} = Δ_{j l} Δ_{k h} + Δ_{j h} Δ_{k l}

(4.6)

for j, k, l, h = 1, 2, … , n; see Mardia, Kent and Bibby (1979, p. 92).

Relation (4.6) says that when Inline graphic = I, that is when the rows of X are independent, $\hat{Δ}$ unbiasedly estimates the row covariance matrix with accuracy proportional to m^-1/2. Correlation between rows reduces the accuracy of $\hat{Δ}$ , as shown next.

Returning to the general situation (4.1)-(4.3), define

\tilde{Δ} = X^{'} σ^{- 2} X ∕ m,

(4.7)

where σ is the diagonal matrix with diagonal entries Inline graphic .

Theorem 2. Under model (4.1), $\tilde{Δ}$ has first and second moments

\tilde{Δ} \sim (Δ, Δ^{(2)} ∕ \tilde{m}), \tilde{m} = m ∕ [1 + (m - 1) α^{2}],

(4.8)

where α is the total correlation as in (3.8),

graphic file with name nihms-93108-f0010.jpg

(4.9)

and Δ⁽²⁾is the Wishart covariance (4.6).

Comparing (4.8) with (4.6), we see that correlation between the rows reduces “effective sample size” from m to $\tilde{m}$ : for α = .241 as in (3.7), the reduction is from m=20426 to $\tilde{m} = 17.2!$ (Notice that row standardization effectively makes $\tilde{Δ} \dot{=} \hat{Δ}$ (2.3), justifying the comparison.) The total correlation α shows up in other efficiency calculations; see Remark 6.7.

Proof. The row-standardized matrix X̃ = σ^-1X has matrix normal distribution

graphic file with name nihms-93108-f0011.jpg

(4.10)

where Inline graphic has diagonal elements From (4.2) we see that is the correlation between elements X_ij and X_i′j in the same column of X; $\tilde{Δ} = {\tilde{X}}^{'} \tilde{X} ∕ m$ has entries ${\tilde{Δ}}_{j k} = Σ_{i} {\tilde{X}}_{i j} {\tilde{X}}_{i k} ∕ m$ and is unbiased for, Δ

E {{\tilde{Δ}}_{j k}} = Δ_{j k},

(4.11)

using (4.2).

The covariance calculation for $\tilde{Δ}$ involves expansion

{\tilde{Δ}}_{j k} {\tilde{Δ}}_{l h} = (\sum_{i} {\tilde{X}}_{i j} {\tilde{X}}_{i k} ∕ m) (\sum_{i^{'}} {\tilde{X}}_{i^{'} l} {\tilde{X}}_{i^{'} h} ∕ m)

(4.12)

= \frac{1}{m^{2}} (\sum_{i} {\tilde{X}}_{i j} {\tilde{X}}_{i k} {\tilde{X}}_{i l} {\tilde{X}}_{i h} + \sum_{i \neq i^{'}} {\tilde{X}}_{i j} {\tilde{X}}_{i k} {\tilde{X}}_{i^{'} l} {\tilde{X}}_{i^{'} h}) .

(4.13)

Using the formula

E {Z_{1} Z_{2} Z_{3} Z_{4}} = γ_{12} γ_{34} + γ_{13} γ_{24} + γ_{14} γ_{23}

(4.14)

for a normal vector (Z₁Z₂Z₃Z₄)^′ with 0 means and covariances γ_ij, (4.2) gives

E {\sum_{i} {\tilde{X}}_{i j} {\tilde{X}}_{i k} {\tilde{X}}_{i l} {\tilde{X}}_{i h}} = m [Δ_{j k} Δ_{l h} + Δ_{j l} Δ_{k h} + Δ_{j h} Δ_{k l}]

(4.15)

and

(4.16)

Then (4.13) yields giving

E {{\tilde{Δ}}_{j k} {\tilde{Δ}}_{l h}} = Δ_{j k} Δ_{l h} + (Δ_{j l} Δ_{k h} + Δ_{j h} Δ_{k l}) (\frac{1 + (m - 1) α^{2}}{m}),

(4.17)

giving

cov ({\tilde{Δ}}_{j k}, {\tilde{Δ}}_{l h}) = (Δ_{j l} Δ_{k h} + Δ_{j h} Δ_{k l}) ∕ \tilde{m}

(4.18)

as in (4.8).

A corollary of Theorem 2, used in Section 5, concerns bilinear functions of and Δ and $\tilde{Δ}$ ,

τ^{2} = w^{'} Δ w and {\tilde{τ}}^{2} = w^{'} \tilde{Δ} w,

(4.19)

where w is a given n-vector.

Corollary. Under model (4.1), ${\tilde{τ}}^{2}$ has mean and variance

{\tilde{τ}}^{2} \sim (τ^{2}, 2 τ^{4} ∕ \tilde{m}) .

(4.20)

The proof follows that for Theorem 2; see Remark 6.9.

If Inline graphic = I in (4.1), then $\tilde{Δ} = \hat{Δ}$ and ${\tilde{τ}}^{2}$ has a scaled chi-squared distribution,

{\tilde{τ}}^{2} \sim τ^{2} \cdot χ_{m}^{2} ∕ m,

(4.21)

with mean and variance ${\tilde{τ}}^{2} \sim (τ^{2}, 2 τ^{4} ∕ m)$ , so again the effect of correlation within Inline graphic is to reduce the effective sample size from m to $\tilde{m}$ (4.8).

We can approximate $\tilde{Δ}$ (4.7), with

\hat{Δ} = X^{'} {\hat{σ}}^{- 2} X ∕ m,

(4.22)

where ${\hat{σ}}_{i i}^{2}$ is an estimate of Inline graphic _ii based on the observed variability in row i. If the rows of X have been standardized, then ${\hat{σ}}_{i i}^{2} = 1$ and $\hat{Δ}$ returns to its original definition X′X/m.

Both Theorem 2 and the Corollary encourage us to think of $\hat{Δ}$ as, approximately, a scaled Wishart distribution based on an independent sample of size $\tilde{m}$

\hat{Δ} \dot{\sim} Wishart (\tilde{m}, Δ) ∕ \tilde{m} .

(4.23)

The dangers of this approximation are discussed in Section 5, but it is, nevertheless, an evocative heuristic, as shown below.

Figure 4 returns to the question of the seemingly overwhelming correlation .805 between arrays 31 and 32 seen in Figure 1. A one-sided p-value was calculated for each of the 946 column correlations, using as a null hypothesis the normal theory correlation coefficient distribution based on a sample size of $\tilde{m}$ = 17.2 pairs of N₂(0, I) points (the correct null if Δ = I in (4.23)). Benjamini and Hochberg’s (1995) False Discovery Rate test, level q = .1, was applied to the 946 p-values. This yielded 7 significant cases, those with sample correlation .723; all 7 were from the block of arrays 27 to 32 indicated in Figure 2. Correlation .805 does turn out to be significant, but by a much closer margin than Figure 1’s scattergram suggests.

Dashed curve is normal-theory null density for correlation coefficient from $\tilde{m} = 17.2$ pairs of points; see Remark 6.6. Histogram is the 946 column correlations, right panel Figure 3. FDR test, q = .1, yielded 7 significant correlations, $\hat{Cor}$ ≥.723, including .805 between arrays 31 and 32, Figure 1.

The Fdr procedure was also applied using the simpler null distribution N(—.023, .241²) (3.7). This raised the significance threshold from .723 to .780, removing two of the previously significant correlations.

Theorem 1 showed that the variance of the observed column correlations is useless for testing column-wise independence, since any value at all can be induced by row correlations. The test in Figure 4 avoids this trap by looking for unusual outliers among the column correlations. It does not depend on the order of the columns, objection (2) in Section 2 for permutation tests, but pays the price of increased modeling assumptions.

5 Other Test Statistics

Theorem 2 offers a normal-theory strategy for testing column-wise independence. We begin with X ~ N_m,n(0, Inline graphic ⊗ Δ)(4.1), taking

(5.1)

as suggested by double standardization. The null hypothesis of column-wise independence is equivalent to the column correlation matrix equaling the identity,

H_{0} : Δ = I,

(5.2)

since then (4.2) says that all pairs in different columns are independent.

To test (5.2), we estimate with Δ with $\hat{Δ}$ , (4.22) or more simply $\hat{Δ} = X' X ∕ m$ after standardization, and compute a test statistic

S = s (\tilde{Δ}),

(5.3)

where s(.) is some measure of distance between $\hat{Δ}$ and 1. The accuracy approximation $\hat{Δ} \dot{~} (Δ, Δ^{(2)} ∕ \tilde{m})$ from (4.8), with Δ = 1, is used to assess the significance level of the observed S, maybe even employing the more daring approximation $\hat{Δ} \dot{~} Wishart (\tilde{m}, I) ∕ \tilde{m}$ . Strategy (5.3) looks promising but, as the examples of this section will show, it suffers from serious difficulties that are absent under the classic assumption of independent rows.

One of the difficulties stems from Theorem 1. An obvious test statistic for H₀ : Δ = I is

S = \sum_{j < j^{'}} {\tilde{Δ}}_{j j^{'}}^{2} / (\begin{matrix} n \\ 2 \end{matrix}),

(5.4)

the average squared off-diagonal element of $\hat{Δ}$ . But $\hat{Δ} = \hat{Cov}$ (3.1), so in the doubly standardized situation of (3.6), S is an increasing monotone function $\hat{α}$ , the estimated total correlation. This disqualifies S as a test statistic for (5.2), since large values of $\hat{α}$ can always be attributed to row-wise correlation alone.

Similarly, the variance of the eigenvalues (2.13),

S = \sum_{k = 1}^{K} {(e_{k} - e .)}^{2} ∕ k (e . = \sum e_{k} ∕ K),

(5.5)

looks appealing since the true eigenvalues all equal 1 when Δ = I. However (5.5) is also a monotonic function of $\hat{α}$ ; see Remark 6.1.

The general difficulty here is “leakage,” the fact that row-wise correlations affect the observed pattern of column-wise correlations. This becomes clearer by comparison with classical multivariate methods, where row-wise correlations are assumed away by taking Inline graphic = I in (4.1). Johnson and Graybill (1972) consider a two-way ANOVA problem where, after subtraction of main effects, X has the form

X_{i j} = a_{i} β_{j} + ∊_{i j} for i = 1, 2, \dots, m and j = 1, 2, \dots, n,

(5.6)

a_i ~ N(0, λ) and ε_ij ~ N(0, 1), all independently, with β = (β₁, β₂, …, β_n) a fixed but unknown vector (representing “one degree of freedom for nonadditivity” in the two-way table X, Johnson and Graybill’s extension of Tukey’s procedure).

In the Kronecker notation (4.1), X ~ N_m,n(0,I ⊗ Δ) with

Δ = I + λ β β^{'} .

(5.7)

Now (5.2) becomes H₀ : λ = 0. Johnson and Graybill show that, with β unknown, the likelihood ratio test rejects H₀ for large values of the eigenvalue ratio (2.13),

S = e_{1} / \sum_{k = 1}^{K} e_{k} .

(5.8)

Since the m rows of X are assumed independent, they can test H₀ by comparison of S with values $S^{⋆} = e_{1}^{⋆} ∕ \sum_{k = 1}^{K} e_{k}^{⋆}$ obtained from

{\hat{Δ}}^{*} \sim Wishart (m, I) ∕ m,

(5.9)

as in (4.5).

Getting back to the correlated rows situation, Theorem 2 suggests comparing S with values S* from

{\hat{Δ}}^{*} \sim Wishart (\tilde{m}, I) ∕ \tilde{m},

(5.10)

$\tilde{m}$ as in (4.8). The solid histogram in Figure 5 compares 100 S* values from (5.10), $\tilde{m}$ = 17.2 for the Cardio data, with the observed value S = .207 from the doubly standardized Cardio matrix for the healthy subjects used in Figure 3. All 100 S* values are much smaller than S, providing strong evidence against H₀ : Δ = I.

Eigenratio statistic (5.8) equals .207 for 20426 × 44 Cardio matrix X; *solid histogram* 100 simulations S* from Wishart (5.10), $\tilde{m} = 17.2$ ; *line histogram* 100 simulations from correlated-row X* matrices (5.11), α = .241, Δ = I.

The evidence looks somewhat weaker, though, if we simulate S* values with ${\hat{Δ}}^{⋆}$ obtained from random matrices

(5.11)

doubly standardized, where Inline graphic has total correlation α = .241, the estimated value for X, (4.9). The line histogram in Figure 5 shows 100 such S* values, all still smaller than S, but substantially less so. (Remark 6.8 describes the construction of X*.)

Why does (5.11) produce larger “null” S* values than (5.10)? The answer is simple: even though the first and second moments of ${\hat{Δ}}^{⋆} = X^{⋆}' X^{⋆} ∕ m$ match ${\hat{Δ}}^{⋆}$ from (5.10), its eigenvalues do not. The non-zero eigenvalues of X*′X*/m equal those of Inline graphic . This is another example of leakage, where the fact that in (5.11) is not the identity I_m distorts the estimated eigenvalue of ${\hat{Δ}}^{⋆}$ even if Δ = I_n.

The eigenratio statistic S = e₁/Σe_k is invariant under permutations of the columns of X, answering objection (2) to permutation testing of Section 2. Because of invariance, the eigenratio and permutation tests provide independent p-values for testing the null hypothesis of i.i.d. columns, and so can be employed together. Figure 5 is disturbing nonetheless, in suggesting that an appropriate null distribution for S depends considerably on the choice of the nuisance parameter Inline graphic in (5.11).

The bilinear form (4.19)-(4.20) yields another class of test statistics,

{\hat{τ}}^{2} = w^{'} \hat{Δ} w \dot{\sim} (τ^{2}, 2 τ^{4} ∕ \tilde{m}),

(5.12)

where w is a pre-chosen n-vector and $τ^{2} = w^{'} Δ w$ . Delta-method arguments give $C V (\hat{τ}) \dot{=} {(2 \tilde{m})}^{- 1 ∕ 2}$ for the coeffcient of variation of $\hat{τ}$ . Defining

Z_{i} = x_{i}^{'} w (x_{i}^{'} the i th row of X),

(5.13)

yields the alternative form

{\hat{τ}}^{2} = \sum_{i = 1}^{m} Z_{i}^{2} ∕ m .

(5.14)

In a two-sample situation like that for the Cardio study, sample sizes n₁ and n₂, we can choose

w^{'} = {(\frac{n_{1} n_{2}}{n_{1} + n_{2}})}^{1 ∕ 2} (- 1_{n_{1}} ∕ n_{1}, 1_{n_{2}} ∕ n_{2}),

(5.15)

“1_n” indicating a vector of n 1’s. This choice makes

Z_{i} = {(\frac{n_{1} n_{2}}{n_{1} + n_{2}})}^{1 ∕ 2} ({\overset{‒}{x}}_{2 i} - {\overset{‒}{x}}_{1 i}),

(5.16)

the multiple of the mean response difference between the two samples that has variance 1 if Δ = I. In terms of (5.12), ∥w∥² = 1 so τ² = 1.

For the Cardio study, with n₁ = 44, n₂ = 19, and $\tilde{m}$ = 17.2, we obtain $\hat{τ} = 1.48$ , coefficient of variation 0.17. This puts $\hat{τ}$ more than 2.8 standard errors above the null hypothesis value τ = 1, again providing evidence against column-wise independence. The Z_i values from (5.16) are nearly indistinguishable from the z_i values in Figure 1 — not surprisingly since with the rows of X standardized, Z_i is an equivalent form of the two-sample t-statistic t_i in (1.1).

Once again, however, there are difficulties with this as a test for column-wise independence. There is no question that the Z_i’s are overdispersed compared to the theoretical value τ = 1. But problems other than column dependence can cause overdispersion, in particular unobserved covariate differences between subjects in the two samples (Efron, 2004, 2008).

The statistic $S = w' \hat{Δ} w$ in (5.15) does not depend upon the order of the columns of X within each of the two samples, answering objection (2) against permutation tests, but it is the only such choice for a two-sample situation. Other w’s might yield interesting results. The version of (5.15) comparing the first 22 healthy Cardio subjects with the second 22 provided the spectacular value $\hat{τ} = 1.87$ , and here the “unobserved covariate” objection has less force.

Now, however, the test statistic depends on the order of the columns within the healthy subjects’ matrix, reviving objection (2). Again we might want to check a catalog of possible w vectors w₁, w₂, …, w_H, leading back to test statistic

S_{B} = \sum_{h} w_{h}^{'} \hat{Δ} w_{h} = tr (\hat{Δ} B) (B = \sum_{h} w_{h} w_{h}^{'})

(5.17)

as in (2.10), the only difference being that the null distribution of $\hat{Δ}$ now involves normal theory rather than permutations. Remark 6.9 shows that the null first and second moments of S_B are similar to (5.12),

S_{B} \underset{H_{0}}{\sim} (tr (B), \frac{2}{\tilde{m}} tr (B^{2})) .

(5.18)

In summary, normal-theory methods are interesting and promising, but are not yet proven competitors for the permutation tests of Section 2.

6 Remarks

This section presents some brief remarks and details supplementing the previous material.

Remark 6.1. The constant c₂ The variance constant c₂ in Theorem 1 (3.3) can be expressed as

c_{2} = \frac{K}{{(m n)}^{2}} [e^{- 2} + \sum_{k = 1}^{K} {(e_{k} - \overset{‒}{e})}^{2}] (\overset{‒}{e} \equiv \sum_{1}^{K} e_{k} ∕ K),

(6.1)

. so that c₂ ≥ K(ē/mn)², with equality only if the eigenvalues e_k are equal. In the doubly standardized case ē = mn/K, giving

c_{2} \geq 1 ∕ K,

(6.2)

where K is the rank of X.

Remark 6.2. Permutation invariance If the columns of X are i.i.d. observations from a distribution on R^m, then the distribution of X is invariant under permutations: Xπ ~ X for any n × n permutation matrix π. Now suppose $\tilde{X} = L (X)$ where L performs the same operation on each column of X, for example replacing each column by its normal scores vector. Then

\tilde{X} π = L (X) π = L (X π) \sim L (X) = \tilde{X},

(6.3)

showing that $\tilde{X}$ is permutation invariant.

Similarly, suppose $\tilde{X} = R (X)$ , performing the same operation ${\tilde{X}}_{i} = r (X_{i})$ on each row of X, where now we require r(x)π = r(xπ) for all n-vectors x. The same argument as (6.3) demonstrates that $\tilde{X}$ is still permutation invariant. Iterating row and column standardizations as in Table 1 then shows that if the original data matrix X is permutation invariant, so is its doubly standardized version.

Table 1.

Successive row and column standardizations of the 20426 × 44 matrix of healthy Cardio subjects. “Col” empirical standard deviation of ${\hat{C o r}}_{j j'}, j < j'$ ; “Eig” $\hat{α}$ from (3.6); “Row” from 1% sample of ${\hat{c o r}}_{i i'}$ values, adjusted for overdispersion (6.6), sampling standard error :0034.

	Col	Row	Eig		Col	Row	Eig

demeaned	0.252	0.286	0.000	demeaned	0.252	0.286	0.000
col	0.252	0.249	0.251	row	0.241	0.283	0.279
row	0.242	0.255	0.246	col	0.241	0.251	0.240
col	0.242	0.241	0.242	row	0.240	0.247	0.241
row	0.241	0.246	0.235	col	0.240	0.247	0.240
col	0.241	0.244	0.241	row	0.241	0.240	0.235
row	0.241	0.245	0.234	col	0.241	0.237	0.240
col	0.241	0.238	0.241	row	0.241	0.233	0.233

Open in a new tab

Remark 6.3. Covariances after demeaning Suppose that X is normally distributed, with covariances Inline graphic ⊗ Δ (4.2), all columns having the same expectation vector μ. Let $\tilde{X}$ be the demeaned matrix obtained by subtracting all the row and column means of X. Then

graphic file with name nihms-93108-f0028.jpg

(6.4)

where

{\tilde{Δ}}_{j j^{'}} = Δ_{j j^{'}} - Δ ._{j^{'}} - Δ_{j} . + Δ . .,

(6.5)

dots indicating averaging over the missing subscripts, and similarly for Inline graphic . This shows that de-meaning tends to reduce covariances by recentering them around 0.

Remark 6.4. Standardization A matrix X is “column standardized” by individually subtracting the mean and dividing by the standard deviation of each column, and similarly for row standardization. Table 1 shows the effect of successive row and column standardizations on the 20426 × 44 demeaned matrix of healthy Cardio subjects. Here “Col” is the empirical standard deviation of the 946 column-wise correlations ${\hat{Cor}}_{j j'}, j < j'$ ; “Eig” is $\hat{α}$ in (3.6); and “Row” is the empirical standard deviation $" \hat{β} "$ of a 1% sample of the row correlations ${\hat{cor}}_{i i'}$ , but adjusted for overdispersion,

{Row}^{2} = \frac{n}{n - 1} ({\hat{β}}^{2} - \frac{1}{n - 1}) .

(6.6)

Sampling error of the Row entries is about ±.0034.

The doubly standardized matrix X used for Figure 3 was obtained after five successive column-row standardizations. This was excessive; the Figure looked almost the same after two iterations. Other microarray examples converged equally rapidly, though small counterexamples can be constructed where double standardization isn’t possible.

Microarray analyses usually begin with some form of column-wise standardization (Bolstad et al., 2003; Qiu, Klebanov and Yakovlev, 2005b), designed to negate “brightness” differences between the n microarrays. In the same spirit, row standardization helps prevent incidental gene differences (for example, very great or very small expression level variabilities) from obscuring the actual effects of interest. Standardization tends to reduce the apparent correlations as in Remark 6.3. Without standardization, the scatterplot in Figure 1 stretches out along the main diagonal, correlation .917, driven by genes with unusually large or small inherent expression levels.

Remark 6.5. Corrected estimates of the total correlation Suppose that the true row correlations cor_ii′ have mean 0 and variance α², as in (3.8) with $\overset{‒}{cor} = 0$ , and that given cor_ii’ , the usual estimate ${\hat{cor}}_{i i'}, i i'$ has mean and variance

{\hat{cor}}_{i, i^{'}} \dot{=} [{cor}_{i i^{'}}, {(1 - {cor}_{i i^{'}}^{2})}^{2} ∕ (n - 3)],

(6.7)

(6.7) being a good normal-theory approximation (Johnson and Kotz, 1970, Chap. 32). Letting ${\overset{‒}{α}}^{2}$ be the empirical variance of the ${\hat{cor}}_{i i'}$ values, a standard empirical Bayes derivation yields

{\hat{α}}^{2} = A^{2} - \frac{3}{n - 5} A^{4} [A^{2} = \frac{(n - 3) {\overset{‒}{α}}^{2} - 1}{n - 5}]

(6.8)

as an approximately unbiased estimate of α². (If $\overset{‒}{cor}$ is not assumed to equal 0, a slightly more complicated formula applies.) Of course ${\hat{α}}^{2} = 0$ if the right side of (6.8) is negative.

Theorem 1 implies that ${\overset{‒}{α}}^{2}$ = 0 nearly equals c₂, (3.3), in the doubly standardized situation. Formula (3.6), with say

{\tilde{α}}^{2} = \frac{n}{n - 1} ({\overset{‒}{α}}^{2} - \frac{1}{n - 1})

(6.9)

is not identical to (6.8), but provides an excellent approximation for values of $\overset{‒}{α} \leq 0.5$ : with n = 44 and $\overset{‒}{α} = .283$ as in (3.6), $\hat{α} = .2415$ while $\tilde{α} = .2412$ .

Remark 6.6. Column and row centerings The column correlation mean $\hat{μ} = - 1 ∕ (n - 1)$ in (3.6) is forced by the row-wise demeaning Σ_j x_ij = 0, (2.1), centering the solid histogram in the right panel of Figure 3 at -.023. With m = 20426, the corresponding center for the line histogram is nearly 0, and the difference in the two centerings is noticeable. The dashed density curve in Figure 4, and the corresponding p-values for the FDR analysis, were shifted .023 units leftwards.

Remark 6.7. The total correlation α The total correlation α, which plays a key role in Theorem 2, (4.9), also is the central parameter of the theory developed in Efron (2007a). Equations (3.15)- (3.16) there are equivalent to (5.12) here. In both papers, α has the very convenient feature of summarizing the effects of an enormous m × m correlation matrix Inline graphic in a single number.

Remark 6.8. Inline graphic for simulation (5.11) The X* simulation used in Figure 5 began with m × n matrix Y = (y_ij),

y_{i j} = c_{I j} + e_{i j} {\begin{matrix} e_{i j} & \sim N (0, 1) \\ c_{I j} & \sim N (0, γ^{2}) \end{matrix} (all independent),

(6.10)

where I = 1, 2, 3, 4, 5 as i is in the first, second, …, last fifth of 1 through m; Y was then column standardized to give X*, so that Inline graphic had a block form, with large positive correlations (about 0.61) in the (m/5) × (m/5) diagonal blocks. The choice λ = 1.23 was required to yield α = .241.

Remark 6.9. Bilinear statistics Since $\tilde{Δ} ~ (Δ, Δ^{(2)} ∕ \tilde{m})$ (4.8), it is clear that $E {{\tilde{τ}}^{2}} = τ^{2}$ in Corollary (4.20). The variance calculation proceeds as in Theorem 2:

\begin{matrix} var {{\tilde{τ}}^{2}} & = \sum_{j k} \sum_{l h} Δ_{j k, l h}^{(2)} w_{j} w_{k} w_{l} w_{h} ∕ \tilde{m} \\ = \sum_{j k} \sum_{l h} [Δ_{j l} Δ_{k h} + Δ_{j h} Δ_{k l}] w_{j} w_{k} w_{l} w_{h} ∕ \tilde{m} \\ = [\sum_{j l} \sum_{k h} (Δ_{j l} w_{j} w_{l}) (Δ_{k h} w_{k} w_{h}) + \sum_{j h} \sum_{k l} (Δ_{j h} w_{j} w_{h}) (Δ_{k l} w_{k} w_{l})] / \tilde{m} \\ = 2 {(\sum_{j k} Δ_{j k} w_{j} w_{l})}^{2} / \tilde{m} = 2 τ^{4} ∕ \tilde{m} . \end{matrix}

(6.11)

The verification of (5.18) is the same, except with element b_jk of B replacing w_jw_k above, b_lh replacing w_lw_h, etc.

Footnotes

The entries of X are log(red/green) ratios obtained from oligonucleotide arrays.

The referee points out that when A ymetrix CEL files are available, array run dates will usually be found in the DatHeader lines.

Most multivariate texts reverse the situation, taking the columns as independent replicas of possibly correlated rows.

References

Anderson TW. An Introduction to Multivariate Statistical Analysis. Third Wiley, New York: 2003. [Google Scholar]
Benjamini Y, Hochberg Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B. 1995;57:289–300. [Google Scholar]
Bolstad BM, Irizarry RA, Åstrand M. Irizarry, Speed TP. Comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003;19:185–193. doi: 10.1093/bioinformatics/19.2.185. http://web.mit.edu/biomicro/education/RMA.pdf Available at. [DOI] [PubMed]
Callow M, Dudoit S, Gong E, Speed T, Rubin E. Microarray expression profiling identifies genes with altered expression in HDL-deficient mice. Genome Research. 2000;10:2022–2029. doi: 10.1101/gr.10.12.2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
Efron B. Large-scale simultaneous hypothesis testing: The choice of a null hypothesis. J. Amer. Statist. Assoc. 2004;99:96–104. [Google Scholar]
Efron B. Correlation and large-scale simultaneous significance testing. J. Amer. Statist. Assoc. 2007a;102:93–103. [Google Scholar]
Efron B. Size, power, and false discovery rates (2007) Ann. Statist. 2007b;35:1351–1377. [Google Scholar]
Efron B. Microarrays, empirical Bayes, and the two-groups model. Statist. Sci. 2008;23:1–47. with discussion and Rejoinder. [Google Scholar]
Johnson DE, Graybill FA. An analysis of a two-way model with interaction and no replication. J. Amer. Statist. Assoc. 1972;67:862–868. [Google Scholar]
Johnson NL, Kotz S. Continuous Univariate Distributions-1. Houghton Mifflin Company; Boston: 1970. [Google Scholar]
Mardia K, Kent J, Bibby J. Multivariate Analysis. Academic Press; London San Diego: 1979. [Google Scholar]
Owen AB. Variance of the number of false discoveries. J. Roy. Statist. Soc. Ser. B. 2005;67:411–426. [Google Scholar]
Qiu X, Brooks AI, Klebanov L, Yakovlev A. The effects of normalization on the correlation structure of microarray data. BMC Bioinformatics. 2005;6:120. doi: 10.1186/1471-2105-6-120. http://www.biomedcentral.com/1471-2105/6/120 Available at. [DOI] [PMC free article] [PubMed]
Qiu X, Klebanov L, Yakovlev A. Correlation between gene expression levels and limitations of the empirical Bayes methodology for finding differentially expressed genes. Statist. Appl. Genet. Mol. Bio. 2005;4 doi: 10.2202/1544-6115.1157. http://www.bepress.com/sagmb/vol4/iss1/art34 article 34. Available at. [DOI] [PubMed]
Singh D, Febbo PG, Ross K, Jackson DG, Manola J, Ladd C, Tamayo P, Renshaw AA, D’Amico AV, Richie JP, Lander ES, Loda M, Kantoff PW, Golub TR, Sellers WR. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell. 2002;1:203–209. doi: 10.1016/s1535-6108(02)00030-2. [DOI] [PubMed] [Google Scholar]
Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Nat. Acad. Sci. USA. 2001;98:5116–5121. doi: 10.1073/pnas.091062498. http://www.pnas.org/cgi/content/full/98/9/5116 Available at. [DOI] [PMC free article] [PubMed]

[R1] Anderson TW. An Introduction to Multivariate Statistical Analysis. Third Wiley, New York: 2003. [Google Scholar]

[R2] Benjamini Y, Hochberg Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B. 1995;57:289–300. [Google Scholar]

[R3] Bolstad BM, Irizarry RA, Åstrand M. Irizarry, Speed TP. Comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003;19:185–193. doi: 10.1093/bioinformatics/19.2.185. http://web.mit.edu/biomicro/education/RMA.pdf Available at. [DOI] [PubMed]

[R4] Callow M, Dudoit S, Gong E, Speed T, Rubin E. Microarray expression profiling identifies genes with altered expression in HDL-deficient mice. Genome Research. 2000;10:2022–2029. doi: 10.1101/gr.10.12.2022. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Efron B. Large-scale simultaneous hypothesis testing: The choice of a null hypothesis. J. Amer. Statist. Assoc. 2004;99:96–104. [Google Scholar]

[R6] Efron B. Correlation and large-scale simultaneous significance testing. J. Amer. Statist. Assoc. 2007a;102:93–103. [Google Scholar]

[R7] Efron B. Size, power, and false discovery rates (2007) Ann. Statist. 2007b;35:1351–1377. [Google Scholar]

[R8] Efron B. Microarrays, empirical Bayes, and the two-groups model. Statist. Sci. 2008;23:1–47. with discussion and Rejoinder. [Google Scholar]

[R9] Johnson DE, Graybill FA. An analysis of a two-way model with interaction and no replication. J. Amer. Statist. Assoc. 1972;67:862–868. [Google Scholar]

[R10] Johnson NL, Kotz S. Continuous Univariate Distributions-1. Houghton Mifflin Company; Boston: 1970. [Google Scholar]

[R11] Mardia K, Kent J, Bibby J. Multivariate Analysis. Academic Press; London San Diego: 1979. [Google Scholar]

[R12] Owen AB. Variance of the number of false discoveries. J. Roy. Statist. Soc. Ser. B. 2005;67:411–426. [Google Scholar]

[R13] Qiu X, Brooks AI, Klebanov L, Yakovlev A. The effects of normalization on the correlation structure of microarray data. BMC Bioinformatics. 2005;6:120. doi: 10.1186/1471-2105-6-120. http://www.biomedcentral.com/1471-2105/6/120 Available at. [DOI] [PMC free article] [PubMed]

[R14] Qiu X, Klebanov L, Yakovlev A. Correlation between gene expression levels and limitations of the empirical Bayes methodology for finding differentially expressed genes. Statist. Appl. Genet. Mol. Bio. 2005;4 doi: 10.2202/1544-6115.1157. http://www.bepress.com/sagmb/vol4/iss1/art34 article 34. Available at. [DOI] [PubMed]

[R15] Singh D, Febbo PG, Ross K, Jackson DG, Manola J, Ladd C, Tamayo P, Renshaw AA, D’Amico AV, Richie JP, Lander ES, Loda M, Kantoff PW, Golub TR, Sellers WR. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell. 2002;1:203–209. doi: 10.1016/s1535-6108(02)00030-2. [DOI] [PubMed] [Google Scholar]

[R16] Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Nat. Acad. Sci. USA. 2001;98:5116–5121. doi: 10.1073/pnas.091062498. http://www.pnas.org/cgi/content/full/98/9/5116 Available at. [DOI] [PMC free article] [PubMed]

PERMALINK

Are a set of microarrays independent of each other?

Bradley Efron

Abstract

1 Introduction

Figure 1.

2 Permutation Tests of Column-Wise Independence

Figure 2.

3 Row and Column Correlations

Figure 3.

4 Normal Theory

Figure 4.

5 Other Test Statistics

Figure 5.

6 Remarks

Table 1.

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Are a set of microarrays independent of each other?

Bradley Efron

Abstract

1 Introduction

Figure 1.

2 Permutation Tests of Column-Wise Independence

Figure 2.

3 Row and Column Correlations

Figure 3.

4 Normal Theory

Figure 4.

5 Other Test Statistics

Figure 5.

6 Remarks

Table 1.

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases