Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2010 May 26.
Published in final edited form as: Ann Appl Stat. 2009;3(4):1634–1654. doi: 10.1214/09-AOAS262SUPP

Inference on Low-Rank Data Matrices with Applications to Microarray Data

Xingdong Feng 1, Xuming He 1
PMCID: PMC2876352  NIHMSID: NIHMS121567  PMID: 20508835

Abstract

Probe-level microarray data are usually stored in matrices, where the row and column correspond to array and probe, respectively. Scientists routinely summarize each array by a single index as the expression level of each probe-set (gene). We examine the adequacy of a uni-dimensional summary for characterizing the data matrix of each probe-set. To do so, we propose a low-rank matrix model for the probe-level intensities, and develop a useful framework for testing the adequacy of uni-dimensionality against targeted alternatives. This is an interesting statistical problem where inference has to be made based on one data matrix whose entries are not i.i.d. We analyze the asymptotic properties of the proposed test statistics, and use Monte Carlo simulations to assess their small sample performance. Applications of the proposed tests to GeneChip data show that evidence against a uni-dimensional model is often indicative of practically relevant features of a probe-set.

Keywords and phrases: Hypothesis Test, Microarray, Singular Value Decomposition

1. INTRODUCTION

Oligonucleotide expression array technology is popular in many fields of biomedical research. The technology makes it possible to measure the abundance of messenger ribonucleic acid (mRNA) transcripts for a large number of genes simultaneously. One of them is the Genechip microarray technology, which is commercially developed by Affymetrix to measure gene expression by hybridizing the sample mRNA on a probe set, typically composed of 11–20 pairs of probes, in a specially designed chip that is called a “microarray” (Parmigiani, et al. (2003)).

Two types of probes are used in the Genechip microarray technology, the perfect match (PM), which is taken from a gene sequence for specific binding of mRNA for the gene, and the mismatch (MM), which is artificially created by changing one nucleotide of the PM sequence to control nonspecific binding of mRNA from the other genes or non-coding sequences of DNA. The probe pairs are immobilized into an array, where each spot of the array contains a probe. An RNA sample labeled with a fluorescent dye is hybridized to a microarray, and the array are then scanned. The expression levels of different genes can be measured by the intensities of the spots. We use PM or PM − MM as the intensity data for our statistical analysis. Extensive studies have been carried out on how to summarize the gene expression levels based on the probe level data. Li and Wong (2001) proposed a multiplicative model:

yij=θiϕj+εij,i=1,,n,j=1,,m, (1)

where y is the observed intensity of each spot, θ is the array effect, ϕ is the probe effect, ε is the random error, i indicates the ith array and j refers to the jth probe. This model, along with some of its variations, has been routinely used in microarray data analysis. In the present paper, we focus on one natural question: how well can we use one quantity θi to adequately summarize the expression level for each probe-set in the ith array? Hu, Wright and Zou (2006) show that the least squares estimate (LSE) of the parameters in the model can be obtained as the first component of the singular value decomposition (SVD) of the intensity matrix Y, where

Y=(y11y1myn1ynm).

Motivated by their work, we aim to develop useful methods to test if additional parameters are needed to characterize the expression data of each probe-set in each array based on the SVD.

When we applied the SVD to the 20 GeneChip microarrays produced in a recent MicroArray Quality Control (MAQC) project (Shi, et al. (2006)) for contrasting colorectal adenocarcinomas and matched normal colonic tissues, we found a number of probe-sets (including Probe-set “214974_x_at” designed to measure the gene expression for Gene “CXCL5”) with a significant 2-dimensional structure. The first two singular vectors for Probe-set “214974_x_at” are displayed graphically in Figure 1, indicating that the usual uni-dimensional summary of gene expression (corresponding to the first right singular vector) would mask the differential expression of Gene “CXCL5” in the tumor tissues. Recent studies such as that reported in Dimberg, et al. (2007) show that this gene indeed plays an important role in colorectal cancer. More detailed findings about this probe-set can be found in Section 5 together with additional examples.

Fig 1.

Fig 1

Scatterplot of singular vectors for the prob-set “214974_x_at”. The probe numbers are shown in the lower plot, and the dotted line is given by the least trimmed squares estimate. The circles in the upper plot represent the arrays hybridized by the samples from the colorectal adenocarcinomas, while the solid points represent the arrays hybridized by the samples from the normal colonic tissues. Sections 1 and 5 refer to this figure.

In Section 2 we propose a 2-dimensional model to take into account both the mean structure and the variance structure of the data matrix. We use a multiplicative model extended from Model (1), but the array effects are assumed to be random, in consistency with the fact that the arrays are typically drawn from a larger population. The LSE of the parameters in the model can be efficiently estimated via SVD. We are interested in the dimensionality of the mean of this data matrix, but first we need to define it in a precise way.

Definition 1.1

Given an n × m random matrix Y, we define the mean matrix as E(Y). If the rank of E (Y) is k, then the dimensionality of Y is defined as k, where k ∈ {1, 2, … , min (n, m)}.

If the rank of E(Y) is k, it is well known that the SVD of E(Y) has k nonzero singular values, and E(Y) can be decomposed as i=1kλiu¯iv¯iT where λ1λ2 ≥ … ≥ λk are the singular values, i ∈ ℝnis the ith left vector and i ∈ ℝm is the ith right vector, for i = 1,2, … , k. Moreover,

u¯iTu¯j=v¯iTv¯j={1,i=j0,ij.

Our primary question is whether the dimensionality (rank) of the matrix E(Y) is one or two. For this purpose, we formulate our hypothesis as H0:E(Y)=λ1u¯1v¯1T versus H1:E(Y)=λ1u¯1v¯1T+λ2u¯2v¯2T. It is possible to consider higher ranks of the mean matrix, but our approach is best illustrated with the rank 2 alternative, which is also the most relevant scenario in many applications. In Section 3, three test statistics are proposed for this problem and their asymptotic results are given. The asymptotic analysis based on the SVD of Y differs from the classical literature on the eigenvalues and eigenvectors of a sample covariance matrix, because the latter works on a data matrix with its mean removed, but our focus is directly on the mean of the data matrix.

When the number of microarrays in an experiment is small due to the cost concerns, the asymptotic distributions of the statistics proposed in Section 3 may not be sufficiently close to their exact distributions. Hence, we apply the bootstrap techniques to calibrate the first two tests discussed in Section 3. In Section 4, we assess the finite sample performance of the tests proposed in Section 3 by Monte Carlo simulations. Finally, in Section 5 we apply the proposed tests to real data sets from two studies. Our analysis shows that the second dimension of the probe-level data is often indicative of interesting features of a probe-set. A number of scenarios for the inadequacy of a uni-dimensional summary are discussed through the case studies and in the concluding Section 6. For example, we point out how our approach relates to and differs from probe re-mapping, and show that a high percentage of probes of poor binding strengths in a probe-set can mask gene expression profiles through a uni-dimensional model. All the proofs of lemmas and theorems given in the paper can be found in the supplemental article Feng and He (2009).

2. Model and Estimation

In this section, we propose a multiplicative model extended from Model (1) to account for a possible second dimension in the data matrices. Furthermore, the asymptotic properties of LSE of the parameters in the model are discussed.

2.1. A Multiplicative Model with Random Effects

Our proposed model takes the form

y¯i=θ1i(0)ϕ¯1(0)+θ2i(0)ϕ¯2(0)+ε¯i,i=1,2,,n, (2)

where i = (yi1, yi2, … , yim)T is the ith observed vector, θ¯1(0)=(θ11(0),,θ1n(0))T and θ¯2(0)=(θ21(0),,θ2n(0))T are used to explain the row effects, and ϕ¯1(0)=(ϕ¯11(0),,ϕ¯1m(0))T and ϕ¯2(0)=(ϕ¯21(0),,ϕ¯2m(0))T are used to explain the column effects in the data matrix. When applied to the probe level microarray data, θ stands for the array effect and ϕ represents the probe effect. Using ∥·∥2 to denote the L2 norm for vectors, and a̱⊥ḇ for orthogonality of and , we make the following assumptions:

  • (M1)

    ϕ¯1(0) and ϕ¯2(0) are two m-dimensional unit vectors with ϕ¯1(0)ϕ¯2(0)

  • (M2)

    θ¯j(0) are independently distributed with mean μ̱j = (μj1, …, μjn)T and variance σj2In, for j = 1,2, and all the components in each vector are independent. The third and fourth central moments of θji(0) are γj3 and τj4 respectively, for j = 1,2. Moreover, μ̱1⊥μ̱2.

  • (M3)

    The error variables ε̱i = (εi1, … ,εim)T are identically and independently distributed with mean zero and variance-covariance matrix σ2Im, and the third and fourth central moments of εij are γ3 and τ4, respectively.

  • M4

    {θ1i(0)},{θ2i(0)} and {ε̱i} are mutually independent.

  • (M5)

    n1μ¯12μ12 and n1μ¯22μ22 as n → ∞ for some finite constants μ1 and μ2. We assume that μ12+σ12>μ22+σ22, which is necessary for the identifiability of the model parameters.

  • (M6)

    μ̱j ⨀ μ̱j2 = O(n), j = 1,2, where ⨀ indicates the pointwise product of two vectors.

2.2. Least Squares Estimate of Column Effect Parameters

In this section we discuss the properties of LSE of the column effect parameters. Let θ̱1 = (θ11,…, θ1n)T, θ̱2 = (θ21,… ,θ2n)T, φ=(ϕ¯1T,ϕ¯2T)Tandϑ=(θ¯1T,θ¯2T,φT)T. With the objective function

dn(ϑ)=i=1ny¯iθ1iϕ¯1θ2iϕ¯22, (3)

the least squares estimate of ϑ can be found by minimizing dnϑ. In the present framework, the total number of parameters increases with the number of observations. To facilitate the analysis, it helps to view θ¯1(0) and θ¯2(0) are nuisance parameters. If (3) is minimized at ϑ̂, then θ̂1i and θ̂2i minimize

y¯iθ1iϕ¯^1θ2iϕ¯^22

with respect to θ1i and θ2i given ϕ̱̂1 and ϕ̱̂2. Furthermore,

θ¯^1=(ϕ¯^1Tϕ¯^1)1Yϕ¯^1, (4)

and

θ¯^2=(ϕ¯^2Tϕ¯^2)1Yϕ¯^2. (5)

Therefore, φ̂ minimizes the following objective function

dn*(φ)=i=1ny¯i[(ϕ¯1Tϕ¯1)1ϕ¯1Ty¯i]ϕ¯1[(ϕ¯2Tϕ¯2)1ϕ¯2Ty¯i]ϕ¯22. (6)

2.2.1.Consistency and Asymptotic Representation

We consider the asymptotic properties of φ̂ assuming that the number of probes m is fixed but the number of arrays n → ∞. As shown in the preceding subsection, φ̂ is a constrained M estimator that minimizes (6) subject to ∥ϕ̱1∥ = ϕ̱2∥ = 1 and ϕ̱1ϕ̱2. The derivations in the Appendix lead to the following results.

Theorem 2.1

When Model (2) and Assumptions (M1)–(M6) hold, φ^a.s.φ(0) where φ̂ is the least squares estimate of φ(0), that is φ̂ minimizes i=1nρ(y¯i;φ) subject to ∥ϕ̱1∥ = ∥ϕ̱2∥ = 1 and ϕ̱1⊥ϕ̱2, where

ρ(y¯i;φ)=y¯i(ϕ¯1Ty¯i)ϕ¯1(ϕ¯2Ty¯i)ϕ¯22. (7)

Theorem 2.1 makes it possible for give the Bahadur representation for ϕ̱̂1 and ϕ̱̂2 from the results of He and Shao (1996). We now consider the limiting distribution of n(φ^φ(0)), which is critical for us to discuss the asymptotic properties of the test statistics proposed in Section 3. Let

Γn=(n1μ¯12+σ12)ϕ¯1(0)ϕ¯1(0)T+(n1μ¯22+σ22)ϕ¯2(0)ϕ¯2(0)T+σ2Im, (8)

where Im is an m × m identity matrix. Then we have the following theorem.

Theorem 2.2

When Model (2) and Assumptions (M1)–(M6) hold, we have, for j = 1,2,

ϕ¯^jϕ¯j(0)=n1Djn1i=1n[2y¯iy¯iTϕ¯j(0)2(ϕ¯j(0)Ty¯iy¯iTϕ¯j(0))ϕ¯j(0)]+o(n1+ε), (9)

where ε; is any positive number, and

Djn=2Γn+2ϕ¯j(0)TΓnϕ¯j(0)Im+4ϕ¯j(0)ϕ¯j(0)TΓn. (10)

Thus, both n(ϕ^¯1ϕ¯1(0)) and n(ϕ^¯2ϕ¯2(0)) are asymptotically normally distributed with mean 0 and variance-covariance matrix, say, C1 and C2, respectively, where C1 and C2 are determined by φ(0) and the first four moments of i.

2.3. Least Squares Prediction of Row Effects

We now discuss the asymptotic properties of the least squares prediction of the row effects based on (4) and (5). The result is summarized in the following theorem.

Theorem 2.3

When Model (2) and Assumptions (M1)–(M6) hold, we have θ^1i=ϕ^¯1Ty¯iLθ1i(0)+ε¯iTϕ¯1(0) and θ^2i=ϕ^¯2Ty¯iLθ2i(0)+ε¯iTϕ¯2(0), where L denotes convergence in distribution.

Let

Γ=(μ12+σ12)ϕ¯1(0)ϕ¯1(0)T+(μ22+σ22)ϕ¯2(0)ϕ¯2(0)T+σ2Im. (11)

The first two eigenvalues of this matrix are μ12+σ12+σ2 and μ22+σ22+σ2, with the remaining eigenvalues σ2. Let

Sn=n1YTYn1Yϕ¯^12n1Yϕ¯^22. (12)

Then, from (4), (5), and Theorem 2.2, we have

n1θ^¯j2a.s.μj2+σj2+σ2,(j=1,2),

and

(m2)1Sna.s.σ2,

based on the strong law of large numbers. These consistent estimators for all the eigenvalues of the matrix Γ will be used when we construct the tests in the following section. On the other hand, we note that θ1i(0) and θ2i(0) may have their individual means μ1i and μ2i, respectively, and thus it is impossible to consistently estimate the individual parameters μ1i,μ2i,σ12 and σ22 without any further information.

3. HYPOTHESIS TESTING

In this section, we consider testing the null hypothesis that H0 : μ̱2 = 0̱. The second dimension θ¯2(0)ϕ¯2(0)T in Model (2) does not provide meaningful information on the mean structure of the data matrix under this null hypothesis. We expect θ̱̂2 to have zero mean under the null hypothesis and non-zero mean under the alternative hypothesis, because θ^2iLθ2i(0)+ε¯iTϕ¯2(0) as n → ∞. Motivated by this, we construct test statistics based on {θ̂2i, i = 1, 2, … ,n}. We consider three specific test statistics in the following sub-sections.

3.1.Test on a Target Direction

Consider

Ta¯=n1a¯Tθ^¯2, (13)

for any = (a1,…, an)T ∈ ℝn such that T μ̱1 = 0, ∥a̱∥2 = n and max1jnaj2/n0. We choose a vector such that ⊥μ̱1 because μ̱1 is orthogonal to μ̱2 and we want to test the null hypothesis that orthogonal to μ̱2 = 0. We use 1̱n to indicate the n-dimensional vector with all the components equal to 1. From the asymptotic properties discussed in Section 2, we have the following theorem.

Theorem 3.1

If the observations y̱1,y̱2, … y̱n are drawn from Model (2) and Assumptions (M1)–(M6) hold, and a̱ ∈ ℝn is a vector satisfying a̱T μ̱1 = 0, a̱Ta̱ = n and max1jnaj2/n0, then

n1/2a¯Tθ^¯2/σ^LN(0,1)

under the null hypothesis that μ̱2 = 0̱, where

σ^2=n1θ^¯22θ^2·2,andθ^2·=n1θ^¯2T1¯n. (14)

The power of the test depends on how far T μ̱2 deviates from zero. As to the target direction , it is usually determined by some specific comparison in practice. We will give examples of choosing in Section 5.

3.1.1.A Practical Solution When μ̱1 is Unknown

In practice, the true value of the mean vector μ̱1 is unknown, but it can be estimated when extra group information is available. Assume that the observations can be divided into p groups such that μ̱1i are equal within each group. We assume that μ1,nt−1+1 = … = μ1nt, for t = 1,2, … ,p, where n0 = 0 < n1 < …np−1 < np = n, and assume that p is fixed but nt — nt−1 → ∞ when n → ∞. For microarray data, those arrays that use the same types of tissues may form one group, and specific examples will be discussed in Section 5.

Suppose that μ̂1nt is a consistent estimator of μ1nt Let

μ¯^1=(μ^1n1,,μ^1n1,μ^1n2,,μ^1n2,,μ^1np,,μ^1np)T,

where the number of μ̂1nt in the above vector is nt — nt−1, t = 1,2, … ,p. Furthermore, when we choose a vector â̱ orthogonal to μ̱̂1, we only consider the candidates whose entries can be divided into groups and are equal to each other within each group in the form of

a^¯(a^n1,,a^n1,a^n2,,a^n2,a^np,,a^np)T.

With â̱ convergent to , the statistic Tâ̱ = n−1â̱̱Tθ̂2 has the same Bahadur representation as if we chose a vector orthogonal to μ̱1 under the null hypothesis. Hence, when we construct the tests in Section 3, we can use â̱ that is orthogonal to μ̱̂1. The choice of â̱̱ is not unique, and is best chosen in response to specific alternatives of interest in a given experiment.

3.2.A χ2 Test with Multiple Directions

As shown in Section 3.1, the power of the test T depends on the direction that we choose. In some cases, we may consider several directions simultaneously. Let us consider a k × n matrix A, where k is a fixed integer and k < n. The ith row of the matrix A is denoted as i and the jth component of i is denoted as aij for i = 1, … ,k and j = 1, … ,n. Assume that ij for ij,a¯iμ¯1,a¯iTa¯i=n and max1jnaij2/n0 for each i. Then, we propose the test statistic

TA=n1Aθ^¯22/σ^2,

with the following result.

Theorem 3.2

Under the assumptions of Theorem 3.1, and for the matrix A described in this subsection, we have TAχk2 in distribution under the null hypothesis that μ̱2 = 0̱, where χk2 has the chi-square distribution with k degrees of freedom.

In practice, given observations, we should not choose k that is close to n, because

Aθ^¯22=n2σ^2

when k = n − 1, and the variations accumulated from approximation errors will ruin the chi-square approximation.

3.3.Bootstrap Calibration

Sometimes, the sample size n is too small for the asymptotic approximations to perform well. Hence, we propose a finite sample adjustment to control the type I errors.

A bootstrap method, which avoids resampling from the rows or columns of the data matrix, to test the null hypothesis that μ̱2 = 0 can be described as follows:

  1. Draw n copies {j1,… ,jn} with replacement from {1,2,… ,n} and let θ^2i*=θ^2jiθ^2.(i=1,2,,n), where θ^2.=n1i=1nθ^2i and then evaluate Ta¯* as
    Ta¯*=n1/2a¯Tθ^¯2*/(n1θ^¯2*2θ^2·*2)1/2,
    where θ^¯2*=(θ^21*,,θ^2n*)T and θ^2·*=n1i=1nθ^2i*;
  2. Repeat Step (i) for B times to get the test statistic Ta¯,b*,b=1,2,,B. We estimate the bootstrap p-value by:
    p=B1b=1BI{|Ta¯,b*||Ta¯|}.
    To see this bootstrap method works, we note that
    n1/2a¯Tθ^¯2·=(n1/2i=1naiϕ¯2(0)Ty¯i)+op(1),n1/2a¯Tθ^¯2*=(n1/2i=1naiϕ¯2(0)Ty¯i*)+op(1),
    where y¯i*=y¯ji, and
    n1θ^¯22θ^2·2(n1θ^¯2*2θ^2·*2)=op(1).
    Since
    (n1i=1n(ϕ¯2(0)Ty¯i)2[n1i=1nϕ¯2(0)Ty¯i]2)1/2(n1/2i=1naiϕ¯2(0)Ty¯i)LN(0,1)
    under the null hypothesis, the bootstrap method works by Theorem 1 of Mammen (1991). Our proposed bootstrap method acts on θ̂2, and avoids repeated computations of the SVD. The same idea can be used for TA.

3.4. Test based on Maximum over Directions

If we do not have guided directions to look for patterns in μ̱2 , we may wish to search over a larger number of directions. The chi-square test in Section 3.2 does not apply when k is large. However, the maximum over k = n − 1 directions

Mn=max1jn1n1/2a¯jTθ^¯2, (15)

has a simple limiting distribution when ε̱i and θ¯2(0) are normally distributed.

Let

cn=2ln(n1),andbn=cn21cn1ln(4πln(n1)). (16)

Theorem 3.3

Assume the conditions of Theorem 3.1,with the addition assumption that θ¯2(0) and ε̱i are normally distributed. For any matrix A as described in subsection 3.2 with k = n − 1, we have P(cn(Mn/σ̂ − bn) ≤ x → e−e−x as n → ∞ under the null hypothesis that μ̱2 = 0̱.

Under the alternative hypothesis, we should observe larger values of Mn. Furthermore, the convergence rate of the extreme statistic is discussed in Section 4.6 of Leadbetter, Lindgren and Rootzen (1983). Based on their arguments, we can use [Φ(u)]n−1 to approximate the probability P (Mn/σ̂ ≤ u in computing the p-values of the proposed test here.

The normality of θ¯2(0) and ε̱i is not a necessary condition for the limiting distribution to hold. Our simulation results not reported in this paper suggest that Theorem 3.3 may hold in a much broader setting.

4. SIMULATIONS

To assess the performance of the proposed tests in the present paper, we report Monte Carlo simulation results by simulating data from Model (2), with the following specifications. The size of the parameters are chosen to mimic some real microarray data.

  1. θ¯1(0) is generated from the multivariate N (μ̱1, 150, 000In), where μ̱1 = (4, 500,4,500,… ,4,500)T;

  2. θ¯2(0) is generated from N (μ̱1, 150, 000In), where μ̱2 is equal to either (0, 0, …, 0)T as the null hypothesis or (125, −125, … ,125, −125)T as an alternative hypothesis;

  3. ϕ¯1=(23)1(1,1,,1)T and ϕ¯2=(23)1(1,1,,1,1)T are of dimension 12;

  4. The errors εij (i = 1, 2, …, n,j = 1, 2, … ,12) are drawn from three different distributions in different experiments: the normal distribution N(0, 5,000), the t-distribution with 5 degrees of freedom multiplied by 1030, and the centered χ2-distribution 50(Z2 − 1), where ZN(0,1).

4.1.Test on a target direction

Four different sample size are used: n = 8,16,32 and 128. Furthermore, we chose two different to compare the performance of the tests T, discussed in Section 3.

4.1.1.Case 1

In the first case, we choose = (1, −1, …, 1, −1)T, which is the ideal choice for detecting the alternative in our settings. We draw 5,000 data sets, and the 5,000 p-values are calculated based on the limiting distributions in Theorems 3.1. For the test T, the type I errors are close to the nominal level of 0.05 when n ≥ 16. Also clear from Table 1 is that the power of the test is decent even when the sample size is as small as 8.

TABLE 1.

Type I errors and powers of the target direction test are listed with increasing sample size n. The errors are generated from three different distributions.


Null
Alternative
Size Normal t χ2 Normal t χ2



8a 0.0560 0.0510 0.0430 0.6362 0.6088 0.5718
16a 0.0542 0.0492 0.0426 0.9540 0.9308 0.9004
32a 0.0470 0.0500 0.0460 0.9998 0.9974 0.9940
128a 0.0522 0.0508 0.0532 1.0000 1.0000 1.0000
8b 0.0552 0.0568 0.0458 0.4202 0.4104 0.3854
16b 0.0530 0.0500 0.0490 0.8358 0.8190 0.7840
32b 0.0546 0.0494 0.0440 0.9934 0.9890 0.9854
128b 0.0522 0.0514 0.0486 1.0000 0.9998 1.0000
a

The results are from Case 1.

b

The results are from Case 2.

4.1.2.Case 2

We choose

a¯=213(1,1,,1,1)T+21(1,,1,1,,1)T,

to see whether the test has the meaningful power when is not so well chosen to target the true pattern in μ̱2. The results are given in the lower half of Table 1. A comparison with Case 1 shows that the power of the test T is sensitive to the choice of for small n, so a good target direction based on the nature of the experiment or the knowledge of the experimenter is very valuable.

4.2.The χ2 test

For the χ2 test of Section 3.2, four sample sizes n = 8,16,32,64 are used with the Monte Carlo sample size of 5,000. We generated k = 4 vectors, which are orthogonal to μ̱1 , orthogonal to each other, and are of length n. The algorithm to generate the vectors can be described as follows.

A=(1111)(1111),

where ⊗ is the Kronecker product, and the product is repeated n times. After the first column of A is deleted, the next k = 4 columns are the vectors we use in the χ2 test. The estimated type I errors and powers of the test are listed in Table 2. It is clear that, the type I error is not close to 0.05 when n ≤ 16. In fact, we find that the type I errors in Table 3 from the limiting distributions of T and the χ2 tests can be too high or too low when the sample sizes n are small. The bootstrap method manages to control the type I errors even at small samples.

TABLE 2.

Type I errors and powers of the χ2 test are listed with increasing sample size n. The errors are drawn from three distributions.


Null
Alternative
Size Normal t χ2 Normal t χ2



8 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
16 0.0296 0.0264 0.0236 0.6406 0.6104 0.5734
32 0.0418 0.0384 0.0394 0.9918 0.9846 0.9822
64 0.0464 0.0504 0.0422 1.0000 0.9990 1.0000

TABLE 3.

Type I errors and powers are listed for comparison between the bootstrap and the large-sample approximation. The errors are generated from three different distributions.

Type I Error

Asymptotic Approximation
Bootstrap
n Normal t χ2 Normal t χ2



6a 0.1174 0.1096 0.1004 0.0420 0.0416 0.0350
8a 0.0552 0.0568 0.0458 0.0484 0.0520 0.0440
8b 0.0000 0.0000 0.0000 0.0406 0.0396 0.0256
16b 0.0296 0.0264 0.0236 0.0520 0.0430 0.0420

Estimated Power

Asymptotic Approximation
Bootstrap
n Normal t χ2 Normal t χ2



6a 0.4950 0.4820 0.4646 0.2560 0.2380 0.2194
8a 0.4202 0.4104 0.3854 0.3738 0.3670 0.3480
8b 0.0000 0.0000 0.0000 0.1508 0.1506 0.1338
16b 0.6404 0.6104 0.5734 0.7142 0.6912 0.6746
a

The results are from the test on the target direction 213(1,1,,1,1)T+21(1,,1,1,,1)T

b

The results are from the χ2 test based on the four target directions.

4.3.Test based on maximum over directions

Similar to Table 2, Table 4 shows the performance of the test Mn of Section 3.4 based on the limiting distributions. The test is conservative for small n, but remains quite powerful in the study. The test can be used even when the normality assumption in Theorem 3.3 is violated. However, our simulation results that are not reported here suggest that if θ¯2(0) and ε̱i do not have finite 4th moments, the limiting distribution would not take effect for realistic sample sizes considered in this paper.

TABLE 4.

Type I errors and powers of the test based on maximum over directions are listed with increasing sample size n. The errors are drawn from three distributions


Null
Alternative
Size Normal t χ2 Normal t χ2



8 0.0018 0.0012 0.0008 0.0270 0.0268 0.0232
16 0.0306 0.0256 0.0190 0.6992 0.6620 0.6216
32 0.0378 0.0376 0.0264 0.9850 0.9766 0.9666
64 0.0428 0.0404 0.0362 1.0000 0.9988 0.9994

5. CASE STUDIES

In this section, we analyze two microarray data sets. We apply our testing methods to search for genes with potentially complicated mean structure, and further analyze some of those genes to understand the possible causes. The data are quantile normalized in each case.

5.1.Example 1

We considered the GeneChip data (http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE5350) obtained from the recent MicroArray Quality Control (MAQC) project and used in Lin, et al. (2006). We have a total of 20 microarrays (HG-U133-Plus-2.0), generated from five colorectal adenocarcinomas and five matched normal colonic tissues with 1 technical replicate at each of two laboratories involved in the MAQC project.

In this study, we use PM as the intensity measure in Y, and carry out the SVD to get the two largest singular values λ̂1 > λ̂2. We focus on 350 probe-sets with the highest ratios λ^22/λ^12 (with all those ratios above 1/10). For each probe-set, the probe-level microarray data are stored in a matrix, where the rows correspond to the probes and the columns correspond to the arrays. The intensities from the normal tissues are entered in the column 1–5, 11–15, and those from the tumors entered in the rest of columns.

We choose a target direction to contrast the two groups in the study. In particular, we use

a¯1(μ^2,,μ^2,μ^1,,μ^1,μ^2,μ^2,μ^1,μ^1)T,

where μ̂1 is taken to be the median of θ̂1i of the first group (normal tissues), and μ̂2 the median of θ̂1i of the other group. Hence, we have a̱1μ̱̂, where

μ^¯=(μ^1,,μ^1,μ^2,μ^2,μ^1,,μ^1,μ^2,μ^2)T.

By the statistical test T developed in Section 3.1, we find that 81 out of 350 probe-sets are detected as individually significant at the 0.05 level. Out of those, 36 probe-sets remain significant after the multiple test adjustment of Benjamini and Hochberg (1995).

We plot θ̂1i,θ̂2i) i = 1,2,…,20, and (ϕ̂1j, ϕ̂2j), j = 1,2, …, m for those probe-sets that are detected as significant, some interesting facts can be observed. We now zoom in on three of those probe-sets.

5.1.1. Probe-set “214974_x_at”

In the study, the probe-set “214974_x_at” is used to measure the expression level of Gene “CXCL5”. Our test gave the p-value of 1.11 × 10−3, the adjusted p-value of 2.38 × 10−2, and the q-value, as proposed in Storey (2003), of 5.77 × 10−4, offering significant evidence against the uni-dimensional model. The first four singular values of the data matrix are (3387, 1388, 361, 168). As mentioned in the Introduction with Figure 1, the arrays cannot be easily separated by the first right singular vector, but if we use (θ̂1i, θ̂2i) jointly, the arrays are well separated in the 2-dimensional space. The usual one-dimensional index of the probe-set is insufficient to summarize the gene expression of “CXCL5”.

Further inspection of the data shows that the intensities from Probe 3 are much higher than those of the other probes, and Probe 3 dominantly contributes to the values of θ̂1i. By the Basic Local Alignment Search Tool (BLAST, http://www.ncbi.nlm.nih.gov/blast/Blast.cgi), we found that Probe 3 is represented in both Gene “CXCL5” and Gene “N-PAC”, but the other probes were confirmed as specific to Gene “CXCL5”. We further confirmed that the intensities of Probe 3 were highly correlated with the intensities of several probes in the probe-set “208506_x_at” (designed by Affymetrix to measure the expression level of Gene “N-PAC”), and thus, we need to take Probe 3 with caution. If Probe 3 were removed from the probe-set, we would have seen a clear separation of the two groups from the first singular vector; see Figure 2. In this case, the second singular vector from the whole probe-set appears to be a better summary of Gene “CXCL5”. We note that Gene “CXCL5” has been indicated as an important gene for colorectal cancer in the literature. For example, Dimberg, et al. (2007) observed significantly higher expression levels of the protein encoded by “CXCL5” in colorectal cancer tumors than in normal tissue, so the multi-dimensionality of the probe-set “214974_x_at” flagged through our statistical work can offer biologically relevant information.

Fig 2.

Fig 2

Scatterplot of singular vectors for the prob-set “214974_x_at” after we remove Probe 3. See Figure 1 for more details about this figure.

5.1.2.Probe-set “227899_at”

The probe-set “227899_at” is designed by Affymetrix to measure the expression level of Gene “VIT”. Our test gave the p-value 8.78 × 10−4, the adjusted p-value 2.38 × 10−2, and the q-value 5.77 × 10−4. The first four singular values are (3178,1011,227, 77).

From Figure 3, we note that differential expression can be detected from the second right singular vector, but not the first. From the probe-level data, we find that the intensities of Probe 4 and Probe 7 are much higher than those of the other probes, and these two probes dominate the first two singular vectors. Furthermore, we confirmed by BLAST both probes as specific for measuring the expression level of Gene “VIT”, and so did the other probes. As a double check, we applied the re-mapping method proposed by Lu, et al. (2007) and confirmed all the probes in this probe-set were specified for the three transcript variants for Gene “VIT”. Therefore, a 2-dimensional summary of the gene appears necessary for this probe-set.

Fig 3.

Fig 3

Scatterplot of singular vectors for the prob-set “227899_at”. The probe numbers are shown in the lower plot, and the dotted line is given by the least trimmed squares estimate. The circles in the upper plot represent the arrays hybridized by the samples from the colorectal adenocarcinomas, while the solid points represent the arrays hybridized by the samples from the normal colonic tissues.

To make the point further, we provide the absolute value of percentages calculated from M1 and M2 in Table 5, where

M1=(θ^11ϕ^11θ^11ϕ^1mθ^1nϕ^11θ^1nϕ^1m),

and

M2=(θ^21ϕ^21θ^21ϕ^2mθ^2nϕ^21θ^2nϕ^2m).
TABLE 5.

A summary of the absolute values of θ̂2iϕ̂2j/θ̂1iϕ̂1j in percentage by probes.

227899_at Min.(%) Ql(%) Med.(%) Q3(%) Max.(%)
Probe 1 0.06 2.48 5.03 6.56 10.92
Probe 2 0.01 0.32 0.64 0.84 1.39
Probe 3 0.04 2.00 4.04 5.27 8.78
Probe 4 0.39 17.37 35.20 45.88 76.40
Probe 5 0.07 2.97 6.02 7.85 13.06
Probe 6 0.04 1.84 3.72 4.85 8.08
Probe 7 0.22 10.01 20.29 26.44 44.02
Probe 8 0.04 1.77 3.59 4.68 7.79
Probe 9 0.03 1.29 2.62 3.42 5.69
Probe 10 0.11 4.74 9.61 12.52 20.85
Probe 11 0.17 7.69 15.59 20.32 33.83

It is clear that the information contained in the second dimension for Probes 4 and 7 is important, because in more than half of the arrays their contributions from the second dimension are more than 20% of those from the first. The joint use of θ̂1i and θ̂2i gives a more complete picture about the expression profile of Gene “VIT”.

5.1.3.Probe-set “1560296_at”

The probe-set “1560296_at” is used in the HG-U133-Plus-2.0 platform to represent Gene “DST”. This probe-set is detected by our test with a significant 2-dimensional mean structure (p-value 1.88 × 10−3, adjusted p-value 2.87 × 10−2, and q-value 6.96 × 10−4). The first four singular values are (5470,1748, 504,271).

From Figure 4, we observe that the probes 1 and 2 are dominant probes. Further inspection shows that the first singular vector is primarily determined by these two probes. Following the method of Lu, et al. (2007), we find that Probes 1, 2 and 3 are re-mapped to three transcripts each (“veejee.aApr07-unspliced”, “DST.vlApr07-unspliced” and “DST.iApr07”), yet the other probes are re-mapped to two variants only (“veejee.aApr07-unspliced” and “DST.vlApr07-unspliced”). For this probe-set, the significant 2-dimensional mean structure of the data matrix could be resolved by proper re-mapping of the probes.

Fig 4.

Fig 4

Scatterplot of singular vectors for the prob-set “1560296_at”. The probe numbers are shown in the lower plot, and the dotted line is given by the least trimmed squares estimate. The circles in the upper plot represent the arrays hybridized by the samples from the colorectal adenocarcinomas, while the solid points represent the arrays hybridized by the samples from the normal colonic tissues.

5.2.Example 2

In this example, the data (http://www.ncbi.nlm.nih.gov/projects/geo/query/acc.cgi?acc=GSE8874) were collected in a recent experiment with the 2×2×2 factorial design, the detail of which is discussed in Leung, et al. (2008). The three factors (with two levels each) are:

  1. mutation: mutant or wild type (WT);

  2. tissue: retinas or whole body;

  3. time: 36 or 52 hours post-fertilization.

Under each condition, three Affymetrix zebrafish genome arrays are replicated, so we have 24 arrays in total. The vector μ̱̂ is computed as in Example 1 by assuming that the means in each tissue group are equal. Furthermore, we generate two directions 1 and á2, used to reflect the possible tissue and mutation effects, respectively. In the study, we still use PM as the intensity measure and carry out the singular value decomposition to get the two largest singular values as λ̂1 and λ̂2, where λ̂1 ≥ λ̂2. We focus on 75 probe-sets with the highest λ^22/λ^12 (with all those ratios above 1/10), and use the χ2 test described in Section 3.2 on each of those probe-sets.

In this example, 39 out of 75 probe-sets are detected as individually significant, out of which 39 probe-sets remain significant after the multiple test adjustment of Benjamini and Hochberg (1995). We shall describe one such probe-set in detail.

5.2.1.Probe-set “Dr.7506.1.A1_at”

In the zebrafish genome array, the probe-set “Dr.7506.1.A1_at” corresponds to gene “tuba8l2”. The χ2 test gave the p-value of 2.37 × 10−5, the adjusted p-value of 7.52 × 10−5, and the q-value of 4.83 × 10−6. The first four singular values are (43142, 14839, 2078, 1688). It is clear from Figure 5 that we cannot distinguish two tissue groups based on θ̂1i, but the two groups are well separated by θ̂2i. Further inspection of the data shows that the intensities of Probe 3 are linearly related with θ̂1i, but θ̂2i are linearly related with the intensities of Probe 15. From Table 6, we see that the information from θ̂2i are clearly non-negligible. Furthermore, we used BLAST to verify that all the probes are appropriate for Gene “tuba8l2”, so there is strong evidence that the expression profile for Gene “tuba8l2” cannot be summarized by the usual uni-dimensional index across experimental conditions. In fact the commonly used gene expression index would mask the clear differential expressions of the two tissue types.

Fig 5.

Fig 5

Scatterplot of singular vectors for the probe-set “Dr.7506.1.A1_at”. The probe numbers are shown in the lower plot and the dotted line is a robust linear fit. The circles in the upper plot represent the arrays hybridized by the samples from retinas, while the solid points represent the arrays hybridized by the samples from whole body.

TABLE 6.

A summary of the absolute values of θ̂2iϕ̂2j/θ̂1iϕ̂1j in percentage by probes.

Dr.7506.1.Al_at Min.(%) Ql(%) Med.(%) Q3(%) Max.(%)
Probe 1 23.71 40.44 48.73 58.58 81.25
Probe 2 20.13 34.33 41.37 49.73 68.97
Probe 3 7.76 13.23 15.94 19.16 26.58
Probe 4 30.74 52.42 63.16 75.94 105.32
Probe 5 13.20 22.51 27.12 32.60 45.22
Probe 6 7.95 13.56 16.33 19.64 27.23
Probe 7 12.37 21.10 25.42 30.56 42.38
Probe 8 3.08 5.25 6.32 7.60 10.54
Probe 9 18.24 31.10 37.48 45.05 62.48
Probe 10 27.56 47.00 56.63 68.08 94.42
Probe 11 19.89 33.92 40.87 49.13 68.14
Probe 12 26.07 44.47 53.58 64.42 89.34
Probe 13 11.71 19.98 24.07 28.94 40.13
Probe 14 33.25 56.70 68.32 82.14 113.92
Probe 15 38.84 66.24 79.81 95.96 133.08
Probe 16 39.66 67.64 81.50 97.98 135.88

6. Conclusions

In this article, we have proposed a new framework for testing the uni-dimensional mean structure of the probe-level data matrix. For most applications, we can carry out the tests discussed in the article based on large sample approximations. We also proposed a model-based bootstrap algorithm to better control type I errors when the sample size is small.

In two case studies, the proposed method detected genes whose expression levels were not well summarized by uni-dimensional indices. Through detailed inspection of the probe-level intensities of those genes, we found that the intensities of different probes can show different profiles across experimental conditions. In our investigation, we noticed that the following scenarios exist for the violation of a uni-dimensional gene expression summary.

  1. A large percentage of probes that have poor binding strengths or low intensity measures in a probe-set can mask the gene expression profiles.

  2. One or more probes should be re-mapped to different variants of the same gene.

  3. One or more probes are cross-hybridized.

  4. An outlying and erroneous measurement is present for one of the probes.

  5. The multiplicative model used to summarize gene expression is inadequate even with all the probes are well selected.

It has been observed by Harbig, et al. (2007) that outlier signals on just one probe can seriously affect the calculations used for the subsequent analysis. While we do not always have definite answers as to the biological implications of such structures, our statistical analysis is valuable in both flagging the potentially interesting and important probes and genes for further scientific investigations. Our approach does not lead directly to probe re-mapping, but may suggest candidates for possible alternative mapping (Gautier, et al. (2004); Lu, et al. (2007)). The bottom line is clear: if we solely rely on models that assume uni-dimensional gene expressions, we might miss some of the complexities in gene expression data analysis. When a uni-dimensional model is shown to be inadequate, appropriate actions, such as probe re-mapping, an alternative model or a different summarization method (e.g. Kapur, et al. (2007)), are called for.

Supplementary Material

supplement

Acknowledgments

The research was partially supported by the NSF Grant DMS-0604229, DMS-0800631, NIH Grant R01GM080503-01A1 in the US, NNSF of China Grant 10828102, and a Changjiang Visiting Professorship at the Northeast Normal University, China. The authors thank Drs. Ping Ma and Sheng Zhong, as well as the Associate Editor for their helpful suggestions on the case studies presented in the paper.

Footnotes

SUPPLEMENTARY MATERIAL

Proofs of Main Results

(doi: ???http://lib.stat.cmu.edu/aoas/???/???;.pdf). We give a lemma on consistency, followed by the proofs for the theorems that are described in Sections 2 and 3.

References

  1. Berman S. Limit Theorems for The Maximum Term in Stationary Sequences. The Annals of Mathematical Statistics. 1964;35:502–516. [Google Scholar]
  2. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B. 1995;57:289–300. [Google Scholar]
  3. Dimberg J, Dienus O, Löfgren S, Hugander A, WÅoeÄter D. Expression and gene polymorphisms of the chemokine CXCL5 in colorectal cancer patients. International Journal of Oncology. 2007;31:97–102. [PubMed] [Google Scholar]
  4. Feng X, He X. Supplement to “Inference on low-rank data matrices with applications to microarray data.” DOI. 2009 doi: 10.1214/09-AOAS262SUPP. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Gautier L, MøLler M, Friis-Hansen L, Knudsen S. Alternative mapping of probes to genes for Affymetrix chips. BMC Bioinformatics. 2004;5:111. doi: 10.1186/1471-2105-5-111. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Harbig J, Sprinkle R, Enkemann SA. A sequence-based identification of the genes detected by probesets on the Affymetrix U133 plus 2.0 array. Nucleic Acids Research. 2005;33:e31. doi: 10.1093/nar/gni027. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. He X, Shao Q. A General Bahadur Representation of M-Estimators and Its Application to Linear Regression With Nonstochastic Designs. The Annals of Statistics. 1996;24:2608–2630. [Google Scholar]
  8. Hu J, Wright F, Zou F. Estimation of Expression Indexes for Oligonu-cleotide Arrays Using Singular Value Decomposition. Journal of American Statistical Association. 2006;101:41–50. [Google Scholar]
  9. Kapur K, Ying Y, Ouyang Z, Wong W. Exon arrays provide accurate assessments of gene expression. Genome Biology. 2007;8:R82. doi: 10.1186/gb-2007-8-5-r82. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Leadbetter MR, Lindgren G, Rootzen H. Extremes and Related Properties of Random Sequences and Processes. 1st ed. New York: Springer–Verlag; 1983. [Google Scholar]
  11. Leung YF, Ma P, Link BA, Dowling J. Factorial microarray analysis of zebrafish retinal development. Proceedings of the National Academy of Sciences. 2008;105:12909–12914. doi: 10.1073/pnas.0806038105. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Li C, Wong WH. Model-Based Analysis of Oligonucleotide Arrays: Expression Index and Outlier Detection. Proceedings of National Academy of Science USA. 2001;98:31–36. doi: 10.1073/pnas.011404098. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Lu J, Lee JC, Salit ML, Cam MC. Transcript-based redefinition of grouped oligonucleotide probe sets using AceView: High-resolution annotation for microarrays. BMC Bioinformatics. 2007;8:108. doi: 10.1186/1471-2105-8-108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Mammen E. When Does Bootstrap Work? Asymptotic Results and Simulations. 1st ed. New York: Springer–Verlag; 1991. [Google Scholar]
  15. Lin G, He X, Ji H, Shi L, Davis RW, Zhong S. Reproducibility Probability Score-Incorporating Measurement Variability Across Laboratories for Gene Selection. Nature Biotechnology. 2006;24:1476–1477. doi: 10.1038/nbt1206-1476. [DOI] [PubMed] [Google Scholar]
  16. Parmigiani G, Garrett ES, Irizarry RA, Zeger SL. The Analysis of Gene Expression Data. 1st ed. New York: Springer–Verlag; 2003. [Google Scholar]
  17. Shi L, Reid LH, Jones WD, et al. The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nature Biotechnology. 2006;24(9):1151–1161. doi: 10.1038/nbt1239. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Storey JD. The Positive False Discovery Rate: A Bayesian Interpretation and The q-value. The Annals of Statistics. 2003;31:2013–2035. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supplement

RESOURCES