Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2010 Jun 18.
Published in final edited form as: Ann Appl Stat. 2009 Jan 1;3(3):922–942. doi: 10.1214/09-AOAS236

Are a set of microarrays independent of each other?

Bradley Efron †,*
PMCID: PMC2887702  NIHMSID: NIHMS93108  PMID: 20563291

Abstract

Having observed an m × n matrix X whose rows are possibly correlated, we wish to test the hypothesis that the columns are independent of each other. Our motivation comes from microarray studies, where the rows of X record expression levels for m different genes, often highly correlated, while the columns represent n individual microarrays, presumably obtained independently. The presumption of independence underlies all the familiar permutation, cross-validation, and bootstrap methods for microarray analysis, so it is important to know when independence fails. We develop nonparametric and normal-theory testing methods. The row and column correlations of X interact with each other in a way that complicates test procedures, essentially by reducing the accuracy of the relevant estimators.

Keywords: total correlation, effective sample size, permutation tests, matrix normal distribution, row and column correlations

1 Introduction

The formal statistical problem considered here can be stated simply: having observed an m × n data matrix X with possibly correlated rows, test the hypothesis that the columns are independent of each other. Relationships between the row correlations and column correlations of X complicate the problem’s solution.

Why are we interested in column-wise independence? The motivation in this paper comes from microarray studies, where X is a matrix of expression levels for m genes on n microarrays. In the “Cardio” study I will use for illustration there are m = 20426 genes each measured on n = 63 arrays, with the microarrays corresponding to 63 subjects, 44 healthy controls and 19 cardiovascular patients1. We expect the gene expressions to be correlated, inducing substantial correlations within each column (Owen, 2005; Efron, 2007a; Qiu, Brooks, Klebanov and Yakovlev, 2005a), but most of the standard analysis techniques begin with an assumption of independence across microarrays, that is, across the columns of X. This can be a risky assumption: all of the familiar permutation, cross-validation and bootstrap methods for microarray analysis, such as the popular SAM program of Tusher, Tibshirani and Chu (2001), depend on column-wise independence of X; dependence can invalidate the usual choice of a null hypothesis, as discussed next, leading to flawed assessments of significance.

An immediate purpose of the Cardio study is to identify genes involved in the disease process. For gene i we compute the two-sample t-statistic “ti” comparing sick versus healthy subjects. It will be convenient for discussion to convert these to z-scores,

zi=Φ1(F61(ti))i=1,2,,m, (1.1)

with and Ф F61 the cumulative distribution functions (cdf) of standard normal and t61 distributions; under the usual assumptions, zi will have a standard N(0, 1) null distribution, called here the “theoretical null.” Unusually large values of zi or —zi are used to identify non-null genes, with the meaning of “unusual” depending heavily on column-wise independence.

The left panel of Figure 1 shows the histogram of all 20426 zi values, which is seen to be much wider than N(0, 1) near its center. An “empirical null” fit to the center as in Efron (2007b) was estimated to be N(.03, 1.572). Null overdispersion has many possible causes (Efron, 2004, 2007a,b), one of which is positive correlation across the columns of X. Such correlations reduce the effective degrees of freedom for the t-statistic, causing (1.1) to yield overdispersed null zis, and of course changing our assessment of significance for outlying values.

Figure 1.

Figure 1

Left panel: histogram of m = 20426 z-values (1.1) for Cardio study; center of histogram is much wider than N(0, 1) theoretical null. Right panel: scatterplot of microarrays 31 and 32, (xi31, xi32) for i = 1, 2, …, m, after removal of row-wise gene means; the scattergram seems to indicate substantial correlation between the two arrays.

The right panel of Figure 1 seems to offer a “smoking gun” for correlation: the scattergram of expression levels for microarrays 31 and 32 looks strikingly correlated, with sample correlation coefficient .805. Here X has been standardized by subtraction of its row means, so the effect is not due to so-called ecological correlations. (X is actually “doubly standardized,” as defined in Section 2.) Nevertheless the question of whether or not correlation .805 is significantly positive turns out to be surprisingly close, as discussed in Section 4, because the row-wise correlations in X drastically reduce the degrees of freedom for the scatterplot. Despite the massive appearance of 20426 points, the scattergram’s accuracy is no more than would be given by 17 independent bivariate normal pairs.

Answering the title’s question, that is, testing for column-wise independence in the presence of row-wise dependence, has both easy and difficult aspects. Section 2 introduces a class of simple permutation tests which, in the case of the Cardio data, clearly discredit column-wise independence. However these tests depend on the ordering of the n columns, and can’t be used if the initial order is lost. It is natural and desirable to look for test statistics of column-wise independence that are invariant under permutation of the columns. Classical multivariate analysis, as in Anderson (2003), develops column independence tests in terms of the eigenvalues of an n by n Wishart matrix. However, this theory depends on the assumption of row-wise independence, disqualifying it for use here.

Sections 3 through 5 consider more general classes of independence tests, both from nonpara-metric and normal theory points of view. The theorem in Section 3 illustrates a key difficulty: correlation between the rows of X (ruled out in the classic theory) can give a misleading appearance of column-wise dependence. Similarly, row-wise dependence can greatly degrade the accuracy of the usual n × n sample covariance matrix of the columns, as shown by the theorem in Section 4. Various non-permutation normal-theory tests are discussed in Section 5, some promising, but with difficulties seen for all of them. The paper ends in Section 6 with a collection of remarks and details.

2 Permutation Tests of Column-Wise Independence

Simple permutation tests can provide strong evidence against column-wise independence, as we will see for the Cardio data. Our main example concerns the 44 healthy subjects, where X is now an m × n matrix with m = 20426 and n = 44. For convenience we assume that X has been “demeaned” by the subtraction of row and column means, giving

ixij=jxij=0fori=1,2,,mandj=1,2,,n. (2.1)

Our numerical results go further and assume “double standardization”: that in addition to (2.1),

jxij2=nandixij2=mfori=1,,mandj=1,,n, (2.2)

i.e., that each row and column of X has mean 0 and variance 1; see Remark 6.4 in Section 6.

Let Δ^ be the familiar estimate of the n × n covariance matrix between the columns of X,

Δ^=(XX)m, (2.3)

Under double standardization, Δ^ is actually the sample correlation matrix, which we expect to be near the identity matrix In under column-wise independence. Also let v1 denote the first eigenvector of Δ^. The left panel of Figure 2 plots the components of v1 versus array number 1, 2, … , 44. Suppose that the columns of the original expression matrix, before standardization, are independent and identically distributed m-vectors (“i.i.d.”). Then it is easy to see, Remark 6.2 of Section 6, that all orderings of the components of v1 are equally likely. This is not what Figure 2 shows: the components seem to increase from left to right, with a noticeable block of large values for arrays 27-32.

Figure 2.

Figure 2

Left panel: Components of first eigenvector of row sample correlation matrix for the 44 healthy Cardio subjects, plotted versus array number 1, 2, … , 44; dashes emphasize the block of large components for arrays 27–32. Right panel: First eigenvectors for healthy (solid line) and cancer (dashed) subjects, prostate cancer study, Singh, Febbo, et al. (2002); there was a systematic drift in expression levels as the study progressed.

Let S(v1) be a statistic that measures structure, for instance a linear regression of v1 versus array index. Comparing S(v1) with a set of permuted values

{Sl=S(vl),l=1,2,,L}, (2.4)

v*l a random permutation of the components of v1, provides a quick test of the i.i.d. null hypothesis.

Permutation testing was applied to v1 for the Cardio data, using the “block” statistic

S(v1)=v1Bv1, (2.5)

where B is the n × n matrix

B=hβhβh. (2.6)

The sum in (2.6) is over all vectors βh of the form

βh=(0,0,,0,1,1,,1,0,0,,0). (2.7)

with the 1s forming blocks of length between 2 and 10 inclusive. A heuristic rationale for block testing appears below; intuitively, microarray experiments are prone to block disturbances because of the way they are developed and read; see Callow et al. (2000). After L = 5000 permutations, only three S* values exceeded the actual value S(v1), p-value .0006, yielding strong evidence against the i.i.d. null hypothesis.

The right panel of Figure 2 pertains to a microarray prostate cancer study (Singh et al., 2002) discussed in Efron (2008): m = 6033 genes were measured on each of n = 102 men, 50 healthy controls and 52 prostate cancer patients. The right panel plots first eigenvectors for Δ^, (2.3), computed separately for the healthy controls and the cancer patients (the two matrices being individually doubly standardized). Both vectors increase almost linearly from left to right. Taking S(v1) as the linear regression of v1 versus array number, permutation testing overwhelmingly rejected the i.i.d. null hypothesis, as it also did using the block test. The prostate study appears as a favorable example of microarray technology in Efron (2008). Nevertheless, Figure 2 indicates a systematic drift in the expression level readings as the study progressed. Some genes drift up, others down (the average drift equaling 0 because of standardization), inducing a small amount of column-wise correlation.

Section 5 discusses models for X where the n × n column covariance matrix is of the “single degree of freedom” form

Δ=I+λββ (2.8)

for some known fixed vector β, the null hypothesis of column-wise independence being H0 : λ = 0. An obvious choice of test statistic in this situation is

Sβ=β(Δ^I)β, (2.9)

a monotone increasing function of βΔ^β. If β is unknown we can replace Sβ with

SB=h=1HβhΔ^βh=tr(Δ^hβhβh)tr(Δ^B), (2.10)

where {β1, β2, … , βH, } is a catalog of “likely prospects” as in (2.7).

Permutation test statistics such as (2.5) can be motivated from the singular value decomposition (SVD) of X,

Xm×n=Um×KdK×KVK×n, (2.11)

where K is the rank, d the diagonal matrix of ordered singular values, and U and V orthonormal matrices of sizes m × K and n × K,

UU=VV=IK, (2.12)

IK the K × K identity. The squares of the diagonal elements, say

e1e2eK>0,(ek=dk2) (2.13)

are the eigenvalues of X′X = V′d2V.

SB in (2.10) can now be written as

SB=j=1kejm(vjBvj). (2.14)

Model (2.8) suggests that most of the information against the null hypothesis H0 of independence lies in the first eigenvector v1, getting us back to test statistic S(v1)=v1Bv1 as in (2.5).

What should the statistician do if column-wise independence is strongly rejected, as in the Cardio example? Use of an empirical null rather than a permutation or theoretical null, N (.03, 1.572) rather than N(0, 1) in Figure 1, removes the reliance on column-wise independence for hypothesis testing methods such as False Discovery Rates, at the expense of increased variability. Efron (2008) discusses these points.

Two objections can be raised to our permutation tests: (1) they are really testing i.i.d., not independence; (2) non-independence might not manifest itself in the order of v1 (particularly if the order of the microarrays has been shuffied in some unknown way).

Column-wise standardization makes the column distributions more similar, mitigating objection (1). Going further, “quantile standardization” — say replacing each column’s entries by normal scores (Bolstad, Irizarry, Åstrand and Speed, 2003) — makes the marginals exactly the same. The Cardio data was reanalyzed using normal scores, with almost identical results.

Objection (2) is more worrisome from the point of view of statistical power. The order in which the arrays were obtained should be available to the statistician, and should be analyzed to expose possible trends like those in Figure 22. It would be desirable, nevertheless, to have independence tests that do not depend on order — that is, test statistics invariant under column-wise permutations. The remainder of this paper concerns both the possibilities and difficulties in the development of “non-permutation” tests.

3 Row and Column Correlations

There is an interesting relationship between the row and column correlations of the matrix X, which complicates the question of column-wise independence. For the notation of this section define the n × n matrix of sample covariances between the columns of X as

Cov^=XXm, (3.1)

called Δ^ in Section 2, and likewise

cov^=XXn, (3.2)

for the m × m matrix of row-wise sample covariances (having more than 400, 000, 000 entries in the Cardio example!).

Theorem 1. If X has row and column means 0, (2.1), then the n2 entries of Cov^ have empiricalmean 0 and variance c2,

c2=k=1Kek2(mn)2, (3.3)

with ek the eigenvalues (2.13), and so do the m2 entries of cov^. Proof. The sum of Cov^’s entries is

1nXX1nm=0, (3.4)

according to (2.1), while the mean of squared entries is

j=1nj=1nCov^jj2n2=tr((XX)2)m2n2=tr(Vd4V)m2n2=c2. (3.5)

Replacing X′X with XX′ yields the same results for the row covariances cov^.

Under double standardization (2.1)-(2.2), the covariances become sample correlations, say cor^ and cor^ for the columns and rows. Theorem 1 has a surprising consequence: whether or not the columns of X are independent, the column sample correlations will have the same mean and variance as the row correlations. In other words, substantial row-wise correlation can induce the appearance of column-wise correlation.

Figure 3 concerns the 44 healthy subjects in the Cardio study, with X an (m, n) = (20426, 44) doubly standardized matrix. All 442 column correlations are shown by the solid histogram, while the line histogram is a random sample of 10, 000 row correlations. Here c2 = .2832, so according to the Theorem both histograms have mean 0 and standard deviation .283.

Figure 3.

Figure 3

Left panel: solid histogram the 442 column sample correlations for X the doubly standardized matrix of healthy Cardio subjects; line histogram is sample of 10000 of the 204262 row correlations. Right panel: solid histogram the column correlations excluding diagonal 1s; line histogram the row correlations corrected for sampling overdispersion.

The 44 diagonal elements of cor^ protrude as a prominent spike at 1. (We can’t see the spike of 20426 diagonal elements for the row correlation matrix cor^ because they form such a small fraction of all 204262.) It is easy to remove the diagonal 1’s from consideration.

Corollary. In the doubly standardized situation, the off-diagonal elements of the column correlation matrix Cor^ have empirical mean and variance

μ^=1n1andα^2=nn1(c21n1). (3.6)

For n = 44 and c2 = .283 this gives

(μ^,α^2)=(.023,.2412). (3.7)

.

The corresponding diagonal-removing corrections for the row correlations (replacing n by m in (3.6)) are neglible for m = 20426. However c2 overestimates the variance of the row correlations for another reason: with only 44 points available to estimate each correlation, estimation error adds a considerable component of variance to the cor^ histogram in the left panel, as discussed next.

Suppose now that the columns of X are in fact independent, in which case the substantial column correlations seen in Figure 3 must actually be induced by row correlations, via Theorem 1. Let corii′ indicate the true correlation between rows i and i′ (that is, between Xij and Xi′j), and define α the total correlation to be the root mean square of the corii′ values,

α2=i<icorii2/(m2). (3.8)

.

Remark 6.5 of Section 6 shows that α^2 in (3.6) is an approximately unbiased estimate of α2, assuming column-wise independence. For the Cardio example α^=.241, similar to the size of the microarray correlation estimates in Efron (2007a), Owen (2005), and Qiu et al. (2005a). Section 4 discusses the crucial role of α in determining the accuracy of estimates based on X.

The right panel of Figure 3 compares the histogram of the column correlations cor^jj, now excluding cases j = j′, with the row correlation histogram corrected for sampling overdispersion via the shrinkage factor .241/.283. As predicted by Theorem 1, the similarity is striking. A possible difference lies in the long right tail of the Cor^ distribution (including cor^31,32, the case illustrated in Figure 1), whose significance is examined in Section 4.

4 Normal Theory

The results of Sections 2 and 3 were developed nonparametrically. This section concerns multivariate normal theory, afterwards used in Section 5 to draw the connection with classical multivariate independence tests. We consider the matrix normal distribution for X,

graphic file with name nihms-93108-f0001.jpg (4.1)

where the Kronecker notation indicates covariance structure

graphic file with name nihms-93108-f0002.jpg (4.2)

Row xi of X has covariance matrix proportional to Δ,

graphic file with name nihms-93108-f0003.jpg (4.3)

(not independently across rows unless Inline graphic is diagonal), and likewise for column xj, xj ~ Nm(0, ΔjjInline graphic). As in (2.1), we take all means equal 0.

Much of classical multivariate analysis focuses on the situation Inline graphic = I, where the rows xi are independent replicates3,

graphic file with name nihms-93108-f0007.jpg (4.4)

in which case the sample covariance matrix Δ^=XXm has a scaled Wishart distribution,

Δ^Wishart(m,Δ)m. (4.5)

Distribution (4.5) has first and second moments

Δ^n×n(Δn×n,Δ(2)n2×n2/m)withΔjk,lh(2)=ΔjlΔkh+ΔjhΔkl (4.6)

for j, k, l, h = 1, 2, … , n; see Mardia, Kent and Bibby (1979, p. 92).

Relation (4.6) says that when Inline graphic = I, that is when the rows of X are independent, Δ^ unbiasedly estimates the row covariance matrix with accuracy proportional to m-1/2. Correlation between rows reduces the accuracy of Δ^, as shown next.

Returning to the general situation (4.1)-(4.3), define

Δ~=Xσ2Xm, (4.7)

where σ is the diagonal matrix with diagonal entries Inline graphic.

Theorem 2. Under model (4.1), Δ~ has first and second moments

Δ~(Δ,Δ(2)m~),m~=m[1+(m1)α2], (4.8)

where α is the total correlation as in (3.8),

graphic file with name nihms-93108-f0010.jpg (4.9)

and Δ(2) is the Wishart covariance (4.6).

Comparing (4.8) with (4.6), we see that correlation between the rows reduces “effective sample size” from m to m~ : for α = .241 as in (3.7), the reduction is from m=20426 to m~=17.2! (Notice that row standardization effectively makes Δ~=.Δ^ (2.3), justifying the comparison.) The total correlation α shows up in other efficiency calculations; see Remark 6.7.

Proof. The row-standardized matrix = σ-1X has matrix normal distribution

graphic file with name nihms-93108-f0011.jpg (4.10)

where Inline graphic has diagonal elements Inline graphic From (4.2) we see that Inline graphic is the correlation between elements Xij and Xi′j in the same column of X;Δ~=X~X~m has entries Δ~jk=ΣiX~ijX~ikm and is unbiased for, Δ

E{Δ~jk}=Δjk, (4.11)

using (4.2).

The covariance calculation for Δ~ involves expansion

Δ~jkΔ~lh=(iX~ijX~ikm)(iX~ilX~ihm) (4.12)
=1m2(iX~ijX~ikX~ilX~ih+iiX~ijX~ikX~ilX~ih). (4.13)

Using the formula

E{Z1Z2Z3Z4}=γ12γ34+γ13γ24+γ14γ23 (4.14)

for a normal vector (Z1Z2Z3Z4) with 0 means and covariances γij, (4.2) gives

E{iX~ijX~ikX~ilX~ih}=m[ΔjkΔlh+ΔjlΔkh+ΔjhΔkl] (4.15)

and

graphic file with name nihms-93108-f0015.jpg (4.16)

Then (4.13) yields giving

E{Δ~jkΔ~lh}=ΔjkΔlh+(ΔjlΔkh+ΔjhΔkl)(1+(m1)α2m), (4.17)

giving

cov(Δ~jk,Δ~lh)=(ΔjlΔkh+ΔjhΔkl)m~ (4.18)

as in (4.8).

A corollary of Theorem 2, used in Section 5, concerns bilinear functions of and Δ and Δ~,

τ2=wΔwandτ~2=wΔ~w, (4.19)

where w is a given n-vector.

Corollary. Under model (4.1), τ~2 has mean and variance

τ~2(τ2,2τ4m~). (4.20)

The proof follows that for Theorem 2; see Remark 6.9.

If Inline graphic = I in (4.1), then Δ~=Δ^ and τ~2 has a scaled chi-squared distribution,

τ~2τ2χm2m, (4.21)

with mean and variance τ~2(τ2,2τ4m), so again the effect of correlation within Inline graphic is to reduce the effective sample size from m to m~ (4.8).

We can approximate Δ~ (4.7), with

Δ^=Xσ^2Xm, (4.22)

where σ^ii2 is an estimate of Inline graphicii based on the observed variability in row i. If the rows of X have been standardized, then σ^ii2=1 and Δ^ returns to its original definition XX/m.

Both Theorem 2 and the Corollary encourage us to think of Δ^ as, approximately, a scaled Wishart distribution based on an independent sample of size m~

Δ^.Wishart(m~,Δ)m~. (4.23)

The dangers of this approximation are discussed in Section 5, but it is, nevertheless, an evocative heuristic, as shown below.

Figure 4 returns to the question of the seemingly overwhelming correlation .805 between arrays 31 and 32 seen in Figure 1. A one-sided p-value was calculated for each of the 946 column correlations, using as a null hypothesis the normal theory correlation coefficient distribution based on a sample size of m~ = 17.2 pairs of N2(0, I) points (the correct null if Δ = I in (4.23)). Benjamini and Hochberg’s (1995) False Discovery Rate test, level q = .1, was applied to the 946 p-values. This yielded 7 significant cases, those with sample correlation .723; all 7 were from the block of arrays 27 to 32 indicated in Figure 2. Correlation .805 does turn out to be significant, but by a much closer margin than Figure 1’s scattergram suggests.

Figure 4.

Figure 4

Dashed curve is normal-theory null density for correlation coefficient from m~=17.2 pairs of points; see Remark 6.6. Histogram is the 946 column correlations, right panel Figure 3. FDR test, q = .1, yielded 7 significant correlations, Cor^ ≥.723, including .805 between arrays 31 and 32, Figure 1.

The Fdr procedure was also applied using the simpler null distribution N(—.023, .2412) (3.7). This raised the significance threshold from .723 to .780, removing two of the previously significant correlations.

Theorem 1 showed that the variance of the observed column correlations is useless for testing column-wise independence, since any value at all can be induced by row correlations. The test in Figure 4 avoids this trap by looking for unusual outliers among the column correlations. It does not depend on the order of the columns, objection (2) in Section 2 for permutation tests, but pays the price of increased modeling assumptions.

5 Other Test Statistics

Theorem 2 offers a normal-theory strategy for testing column-wise independence. We begin with X ~ Nm,n(0, Inline graphic ⊗ Δ)(4.1), taking

graphic file with name nihms-93108-f0020.jpg (5.1)

as suggested by double standardization. The null hypothesis of column-wise independence is equivalent to the column correlation matrix equaling the identity,

H0:Δ=I, (5.2)

since then (4.2) says that all pairs in different columns are independent.

To test (5.2), we estimate with Δ with Δ^, (4.22) or more simply Δ^=XXm after standardization, and compute a test statistic

S=s(Δ~), (5.3)

where s(.) is some measure of distance between Δ^ and 1. The accuracy approximation Δ^~.(Δ,Δ(2)m~) from (4.8), with Δ = 1, is used to assess the significance level of the observed S, maybe even employing the more daring approximation Δ^~.Wishart(m~,I)m~. Strategy (5.3) looks promising but, as the examples of this section will show, it suffers from serious difficulties that are absent under the classic assumption of independent rows.

One of the difficulties stems from Theorem 1. An obvious test statistic for H0 : Δ = I is

S=j<jΔ~jj2/(n2), (5.4)

the average squared off-diagonal element of Δ^. But Δ^=Cov^ (3.1), so in the doubly standardized situation of (3.6), S is an increasing monotone function α^, the estimated total correlation. This disqualifies S as a test statistic for (5.2), since large values of α^ can always be attributed to row-wise correlation alone.

Similarly, the variance of the eigenvalues (2.13),

S=k=1K(eke.)2k(e.=ekK), (5.5)

looks appealing since the true eigenvalues all equal 1 when Δ = I. However (5.5) is also a monotonic function of α^; see Remark 6.1.

The general difficulty here is “leakage,” the fact that row-wise correlations affect the observed pattern of column-wise correlations. This becomes clearer by comparison with classical multivariate methods, where row-wise correlations are assumed away by taking Inline graphic = I in (4.1). Johnson and Graybill (1972) consider a two-way ANOVA problem where, after subtraction of main effects, X has the form

Xij=aiβj+ijfori=1,2,,mandj=1,2,,n, (5.6)

ai ~ N(0, λ) and εij ~ N(0, 1), all independently, with β = (β1, β2, …, βn) a fixed but unknown vector (representing “one degree of freedom for nonadditivity” in the two-way table X, Johnson and Graybill’s extension of Tukey’s procedure).

In the Kronecker notation (4.1), X ~ Nm,n(0,I ⊗ Δ) with

Δ=I+λββ. (5.7)

Now (5.2) becomes H0 : λ = 0. Johnson and Graybill show that, with β unknown, the likelihood ratio test rejects H0 for large values of the eigenvalue ratio (2.13),

S=e1/k=1Kek. (5.8)

Since the m rows of X are assumed independent, they can test H0 by comparison of S with values S=e1k=1Kek obtained from

Δ^Wishart(m,I)m, (5.9)

as in (4.5).

Getting back to the correlated rows situation, Theorem 2 suggests comparing S with values S* from

Δ^Wishart(m~,I)m~, (5.10)

m~ as in (4.8). The solid histogram in Figure 5 compares 100 S* values from (5.10), m~ = 17.2 for the Cardio data, with the observed value S = .207 from the doubly standardized Cardio matrix for the healthy subjects used in Figure 3. All 100 S* values are much smaller than S, providing strong evidence against H0 : Δ = I.

Figure 5.

Figure 5

Eigenratio statistic (5.8) equals .207 for 20426 × 44 Cardio matrix X; solid histogram 100 simulations S* from Wishart (5.10), m~=17.2; line histogram 100 simulations from correlated-row X* matrices (5.11), α = .241, Δ = I.

The evidence looks somewhat weaker, though, if we simulate S* values with Δ^ obtained from random matrices

graphic file with name nihms-93108-f0022.jpg (5.11)

doubly standardized, where Inline graphic has total correlation α = .241, the estimated value for X, (4.9). The line histogram in Figure 5 shows 100 such S* values, all still smaller than S, but substantially less so. (Remark 6.8 describes the construction of X*.)

Why does (5.11) produce larger “null” S* values than (5.10)? The answer is simple: even though the first and second moments of Δ^=XXm match Δ^ from (5.10), its eigenvalues do not. The non-zero eigenvalues of X*′X*/m equal those of Inline graphic. This is another example of leakage, where the fact that Inline graphic in (5.11) is not the identity Im distorts the estimated eigenvalue of Δ^ even if Δ = In.

The eigenratio statistic S = e1ek is invariant under permutations of the columns of X, answering objection (2) to permutation testing of Section 2. Because of invariance, the eigenratio and permutation tests provide independent p-values for testing the null hypothesis of i.i.d. columns, and so can be employed together. Figure 5 is disturbing nonetheless, in suggesting that an appropriate null distribution for S depends considerably on the choice of the nuisance parameter Inline graphic in (5.11).

The bilinear form (4.19)-(4.20) yields another class of test statistics,

τ^2=wΔ^w.(τ2,2τ4m~), (5.12)

where w is a pre-chosen n-vector and τ2=wΔw. Delta-method arguments give CV(τ^)=.(2m~)12 for the coeffcient of variation of τ^. Defining

Zi=xiw(xitheith row ofX), (5.13)

yields the alternative form

τ^2=i=1mZi2m. (5.14)

.

In a two-sample situation like that for the Cardio study, sample sizes n1 and n2, we can choose

w=(n1n2n1+n2)12(1n1n1,1n2n2), (5.15)

“1n” indicating a vector of n 1’s. This choice makes

Zi=(n1n2n1+n2)12(x2ix1i), (5.16)

the multiple of the mean response difference between the two samples that has variance 1 if Δ = I. In terms of (5.12), ∥w2 = 1 so τ2 = 1.

For the Cardio study, with n1 = 44, n2 = 19, and m~ = 17.2, we obtain τ^=1.48, coefficient of variation 0.17. This puts τ^ more than 2.8 standard errors above the null hypothesis value τ = 1, again providing evidence against column-wise independence. The Zi values from (5.16) are nearly indistinguishable from the zi values in Figure 1 — not surprisingly since with the rows of X standardized, Zi is an equivalent form of the two-sample t-statistic ti in (1.1).

Once again, however, there are difficulties with this as a test for column-wise independence. There is no question that the Zi’s are overdispersed compared to the theoretical value τ = 1. But problems other than column dependence can cause overdispersion, in particular unobserved covariate differences between subjects in the two samples (Efron, 2004, 2008).

The statistic S=wΔ^w in (5.15) does not depend upon the order of the columns of X within each of the two samples, answering objection (2) against permutation tests, but it is the only such choice for a two-sample situation. Other w’s might yield interesting results. The version of (5.15) comparing the first 22 healthy Cardio subjects with the second 22 provided the spectacular value τ^=1.87, and here the “unobserved covariate” objection has less force.

Now, however, the test statistic depends on the order of the columns within the healthy subjects’ matrix, reviving objection (2). Again we might want to check a catalog of possible w vectors w1, w2, …, wH, leading back to test statistic

SB=hwhΔ^wh=tr(Δ^B)(B=hwhwh) (5.17)

as in (2.10), the only difference being that the null distribution of Δ^ now involves normal theory rather than permutations. Remark 6.9 shows that the null first and second moments of SB are similar to (5.12),

SBH0(tr(B),2m~tr(B2)). (5.18)

.

In summary, normal-theory methods are interesting and promising, but are not yet proven competitors for the permutation tests of Section 2.

6 Remarks

This section presents some brief remarks and details supplementing the previous material.

Remark 6.1. The constant c2 The variance constant c2 in Theorem 1 (3.3) can be expressed as

c2=K(mn)2[e2+k=1K(eke)2](e1KekK), (6.1)

. so that c2K(ē/mn)2, with equality only if the eigenvalues ek are equal. In the doubly standardized case ē = mn/K, giving

c21K, (6.2)

where K is the rank of X.

Remark 6.2. Permutation invariance If the columns of X are i.i.d. observations from a distribution on Rm, then the distribution of X is invariant under permutations: Xπ ~ X for any n × n permutation matrix π. Now suppose X~=L(X) where L performs the same operation on each column of X, for example replacing each column by its normal scores vector. Then

X~π=L(X)π=L(Xπ)L(X)=X~, (6.3)

showing that X~ is permutation invariant.

Similarly, suppose X~=R(X), performing the same operation X~i=r(Xi) on each row of X, where now we require r(x)π = r(xπ) for all n-vectors x. The same argument as (6.3) demonstrates that X~ is still permutation invariant. Iterating row and column standardizations as in Table 1 then shows that if the original data matrix X is permutation invariant, so is its doubly standardized version.

Table 1.

Successive row and column standardizations of the 20426 × 44 matrix of healthy Cardio subjects. “Col” empirical standard deviation of Cor^jj,j<j ; “Eig” α^ from (3.6); “Row” from 1% sample of cor^ii values, adjusted for overdispersion (6.6), sampling standard error :0034.

Col Row Eig Col Row Eig


demeaned 0.252 0.286 0.000 demeaned 0.252 0.286 0.000
col 0.252 0.249 0.251 row 0.241 0.283 0.279
row 0.242 0.255 0.246 col 0.241 0.251 0.240
col 0.242 0.241 0.242 row 0.240 0.247 0.241
row 0.241 0.246 0.235 col 0.240 0.247 0.240
col 0.241 0.244 0.241 row 0.241 0.240 0.235
row 0.241 0.245 0.234 col 0.241 0.237 0.240
col 0.241 0.238 0.241 row 0.241 0.233 0.233

Remark 6.3. Covariances after demeaning Suppose that X is normally distributed, with covariances Inline graphic ⊗ Δ (4.2), all columns having the same expectation vector μ. Let X~ be the demeaned matrix obtained by subtracting all the row and column means of X. Then

graphic file with name nihms-93108-f0028.jpg (6.4)

where

Δ~jj=ΔjjΔ.jΔj.+Δ.., (6.5)

dots indicating averaging over the missing subscripts, and similarly for Inline graphic. This shows that de-meaning tends to reduce covariances by recentering them around 0.

Remark 6.4. Standardization A matrix X is “column standardized” by individually subtracting the mean and dividing by the standard deviation of each column, and similarly for row standardization. Table 1 shows the effect of successive row and column standardizations on the 20426 × 44 demeaned matrix of healthy Cardio subjects. Here “Col” is the empirical standard deviation of the 946 column-wise correlations Cor^jj,j<j; “Eig” is α^ in (3.6); and “Row” is the empirical standard deviation "β^" of a 1% sample of the row correlations cor^ii, but adjusted for overdispersion,

Row2=nn1(β^21n1). (6.6)

Sampling error of the Row entries is about ±.0034.

The doubly standardized matrix X used for Figure 3 was obtained after five successive column-row standardizations. This was excessive; the Figure looked almost the same after two iterations. Other microarray examples converged equally rapidly, though small counterexamples can be constructed where double standardization isn’t possible.

Microarray analyses usually begin with some form of column-wise standardization (Bolstad et al., 2003; Qiu, Klebanov and Yakovlev, 2005b), designed to negate “brightness” differences between the n microarrays. In the same spirit, row standardization helps prevent incidental gene differences (for example, very great or very small expression level variabilities) from obscuring the actual effects of interest. Standardization tends to reduce the apparent correlations as in Remark 6.3. Without standardization, the scatterplot in Figure 1 stretches out along the main diagonal, correlation .917, driven by genes with unusually large or small inherent expression levels.

Remark 6.5. Corrected estimates of the total correlation Suppose that the true row correlations corii′ have mean 0 and variance α2, as in (3.8) with cor=0, and that given corii’ , the usual estimate cor^ii,ii has mean and variance

cor^i,i=.[corii,(1corii2)2(n3)], (6.7)

(6.7) being a good normal-theory approximation (Johnson and Kotz, 1970, Chap. 32). Letting α2 be the empirical variance of the cor^ii values, a standard empirical Bayes derivation yields

α^2=A23n5A4[A2=(n3)α21n5] (6.8)

as an approximately unbiased estimate of α2. (If cor is not assumed to equal 0, a slightly more complicated formula applies.) Of course α^2=0 if the right side of (6.8) is negative.

Theorem 1 implies that α2 = 0 nearly equals c2, (3.3), in the doubly standardized situation. Formula (3.6), with say

α~2=nn1(α21n1) (6.9)

is not identical to (6.8), but provides an excellent approximation for values of α0.5: with n = 44 and α=.283 as in (3.6), α^=.2415 while α~=.2412.

Remark 6.6. Column and row centerings The column correlation mean μ^=1(n1) in (3.6) is forced by the row-wise demeaning Σj xij = 0, (2.1), centering the solid histogram in the right panel of Figure 3 at -.023. With m = 20426, the corresponding center for the line histogram is nearly 0, and the difference in the two centerings is noticeable. The dashed density curve in Figure 4, and the corresponding p-values for the FDR analysis, were shifted .023 units leftwards.

Remark 6.7. The total correlation α The total correlation α, which plays a key role in Theorem 2, (4.9), also is the central parameter of the theory developed in Efron (2007a). Equations (3.15)- (3.16) there are equivalent to (5.12) here. In both papers, α has the very convenient feature of summarizing the effects of an enormous m × m correlation matrix Inline graphic in a single number.

Remark 6.8. Inline graphic for simulation (5.11) The X* simulation used in Figure 5 began with m × n matrix Y = (yij),

yij=cIj+eij{eijN(0,1)cIjN(0,γ2)}(all independent), (6.10)

where I = 1, 2, 3, 4, 5 as i is in the first, second, …, last fifth of 1 through m; Y was then column standardized to give X*, so that Inline graphic had a block form, with large positive correlations (about 0.61) in the (m/5) × (m/5) diagonal blocks. The choice λ = 1.23 was required to yield α = .241.

Remark 6.9. Bilinear statistics Since Δ~~(Δ,Δ(2)m~) (4.8), it is clear that E{τ~2}=τ2 in Corollary (4.20). The variance calculation proceeds as in Theorem 2:

var{τ~2}=jklhΔjk,lh(2)wjwkwlwhm~=jklh[ΔjlΔkh+ΔjhΔkl]wjwkwlwhm~=[jlkh(Δjlwjwl)(Δkhwkwh)+jhkl(Δjhwjwh)(Δklwkwl)]/m~=2(jkΔjkwjwl)2/m~=2τ4m~. (6.11)

The verification of (5.18) is the same, except with element bjk of B replacing wjwk above, blh replacing wlwh, etc.

Footnotes

1

The entries of X are log(red/green) ratios obtained from oligonucleotide arrays.

2

The referee points out that when A ymetrix CEL files are available, array run dates will usually be found in the DatHeader lines.

3

Most multivariate texts reverse the situation, taking the columns as independent replicas of possibly correlated rows.

References

  1. Anderson TW. An Introduction to Multivariate Statistical Analysis. Third Wiley, New York: 2003. [Google Scholar]
  2. Benjamini Y, Hochberg Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B. 1995;57:289–300. [Google Scholar]
  3. Bolstad BM, Irizarry RA, Åstrand M. Irizarry, Speed TP. Comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003;19:185–193. doi: 10.1093/bioinformatics/19.2.185. http://web.mit.edu/biomicro/education/RMA.pdf Available at. [DOI] [PubMed]
  4. Callow M, Dudoit S, Gong E, Speed T, Rubin E. Microarray expression profiling identifies genes with altered expression in HDL-deficient mice. Genome Research. 2000;10:2022–2029. doi: 10.1101/gr.10.12.2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Efron B. Large-scale simultaneous hypothesis testing: The choice of a null hypothesis. J. Amer. Statist. Assoc. 2004;99:96–104. [Google Scholar]
  6. Efron B. Correlation and large-scale simultaneous significance testing. J. Amer. Statist. Assoc. 2007a;102:93–103. [Google Scholar]
  7. Efron B. Size, power, and false discovery rates (2007) Ann. Statist. 2007b;35:1351–1377. [Google Scholar]
  8. Efron B. Microarrays, empirical Bayes, and the two-groups model. Statist. Sci. 2008;23:1–47. with discussion and Rejoinder. [Google Scholar]
  9. Johnson DE, Graybill FA. An analysis of a two-way model with interaction and no replication. J. Amer. Statist. Assoc. 1972;67:862–868. [Google Scholar]
  10. Johnson NL, Kotz S. Continuous Univariate Distributions-1. Houghton Mifflin Company; Boston: 1970. [Google Scholar]
  11. Mardia K, Kent J, Bibby J. Multivariate Analysis. Academic Press; London San Diego: 1979. [Google Scholar]
  12. Owen AB. Variance of the number of false discoveries. J. Roy. Statist. Soc. Ser. B. 2005;67:411–426. [Google Scholar]
  13. Qiu X, Brooks AI, Klebanov L, Yakovlev A. The effects of normalization on the correlation structure of microarray data. BMC Bioinformatics. 2005;6:120. doi: 10.1186/1471-2105-6-120. http://www.biomedcentral.com/1471-2105/6/120 Available at. [DOI] [PMC free article] [PubMed]
  14. Qiu X, Klebanov L, Yakovlev A. Correlation between gene expression levels and limitations of the empirical Bayes methodology for finding differentially expressed genes. Statist. Appl. Genet. Mol. Bio. 2005;4 doi: 10.2202/1544-6115.1157. http://www.bepress.com/sagmb/vol4/iss1/art34 article 34. Available at. [DOI] [PubMed]
  15. Singh D, Febbo PG, Ross K, Jackson DG, Manola J, Ladd C, Tamayo P, Renshaw AA, D’Amico AV, Richie JP, Lander ES, Loda M, Kantoff PW, Golub TR, Sellers WR. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell. 2002;1:203–209. doi: 10.1016/s1535-6108(02)00030-2. [DOI] [PubMed] [Google Scholar]
  16. Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Nat. Acad. Sci. USA. 2001;98:5116–5121. doi: 10.1073/pnas.091062498. http://www.pnas.org/cgi/content/full/98/9/5116 Available at. [DOI] [PMC free article] [PubMed]

RESOURCES