Inference on Low-Rank Data Matrices with Applications to Microarray Data

Xingdong Feng; Xuming He

doi:10.1214/09-AOAS262SUPP

. Author manuscript; available in PMC: 2010 May 26.

Published in final edited form as: Ann Appl Stat. 2009;3(4):1634–1654. doi: 10.1214/09-AOAS262SUPP

Inference on Low-Rank Data Matrices with Applications to Microarray Data

Xingdong Feng ¹, Xuming He ¹

PMCID: PMC2876352 NIHMSID: NIHMS121567 PMID: 20508835

Abstract

Probe-level microarray data are usually stored in matrices, where the row and column correspond to array and probe, respectively. Scientists routinely summarize each array by a single index as the expression level of each probe-set (gene). We examine the adequacy of a uni-dimensional summary for characterizing the data matrix of each probe-set. To do so, we propose a low-rank matrix model for the probe-level intensities, and develop a useful framework for testing the adequacy of uni-dimensionality against targeted alternatives. This is an interesting statistical problem where inference has to be made based on one data matrix whose entries are not i.i.d. We analyze the asymptotic properties of the proposed test statistics, and use Monte Carlo simulations to assess their small sample performance. Applications of the proposed tests to GeneChip data show that evidence against a uni-dimensional model is often indicative of practically relevant features of a probe-set.

Keywords and phrases: Hypothesis Test, Microarray, Singular Value Decomposition

1. INTRODUCTION

Oligonucleotide expression array technology is popular in many fields of biomedical research. The technology makes it possible to measure the abundance of messenger ribonucleic acid (mRNA) transcripts for a large number of genes simultaneously. One of them is the Genechip microarray technology, which is commercially developed by Affymetrix to measure gene expression by hybridizing the sample mRNA on a probe set, typically composed of 11–20 pairs of probes, in a specially designed chip that is called a “microarray” (Parmigiani, et al. (2003)).

Two types of probes are used in the Genechip microarray technology, the perfect match (PM), which is taken from a gene sequence for specific binding of mRNA for the gene, and the mismatch (MM), which is artificially created by changing one nucleotide of the PM sequence to control nonspecific binding of mRNA from the other genes or non-coding sequences of DNA. The probe pairs are immobilized into an array, where each spot of the array contains a probe. An RNA sample labeled with a fluorescent dye is hybridized to a microarray, and the array are then scanned. The expression levels of different genes can be measured by the intensities of the spots. We use PM or PM − MM as the intensity data for our statistical analysis. Extensive studies have been carried out on how to summarize the gene expression levels based on the probe level data. Li and Wong (2001) proposed a multiplicative model:

y_{i j} = θ_{i} ϕ_{j} + ε_{i j}, i = 1, \dots, n, j = 1, \dots, m,

(1)

where y is the observed intensity of each spot, θ is the array effect, ϕ is the probe effect, ε is the random error, i indicates the ith array and j refers to the jth probe. This model, along with some of its variations, has been routinely used in microarray data analysis. In the present paper, we focus on one natural question: how well can we use one quantity θ_i to adequately summarize the expression level for each probe-set in the ith array? Hu, Wright and Zou (2006) show that the least squares estimate (LSE) of the parameters in the model can be obtained as the first component of the singular value decomposition (SVD) of the intensity matrix Y, where

Y = (\begin{matrix} y_{11} & \dots & y_{1 m} \\ ⋮ & ⋮ & ⋮ \\ y_{n 1} & \dots & y_{n m} \end{matrix}) .

Motivated by their work, we aim to develop useful methods to test if additional parameters are needed to characterize the expression data of each probe-set in each array based on the SVD.

When we applied the SVD to the 20 GeneChip microarrays produced in a recent MicroArray Quality Control (MAQC) project (Shi, et al. (2006)) for contrasting colorectal adenocarcinomas and matched normal colonic tissues, we found a number of probe-sets (including Probe-set “214974_x_at” designed to measure the gene expression for Gene “CXCL5”) with a significant 2-dimensional structure. The first two singular vectors for Probe-set “214974_x_at” are displayed graphically in Figure 1, indicating that the usual uni-dimensional summary of gene expression (corresponding to the first right singular vector) would mask the differential expression of Gene “CXCL5” in the tumor tissues. Recent studies such as that reported in Dimberg, et al. (2007) show that this gene indeed plays an important role in colorectal cancer. More detailed findings about this probe-set can be found in Section 5 together with additional examples.

Fig 1 — Scatterplot of singular vectors for the prob-set “214974_x_at”. The probe numbers are shown in the lower plot, and the dotted line is given by the least trimmed squares estimate. The circles in the upper plot represent the arrays hybridized by the samples from the colorectal adenocarcinomas, while the solid points represent the arrays hybridized by the samples from the normal colonic tissues. Sections 1 and 5 refer to this figure.

In Section 2 we propose a 2-dimensional model to take into account both the mean structure and the variance structure of the data matrix. We use a multiplicative model extended from Model (1), but the array effects are assumed to be random, in consistency with the fact that the arrays are typically drawn from a larger population. The LSE of the parameters in the model can be efficiently estimated via SVD. We are interested in the dimensionality of the mean of this data matrix, but first we need to define it in a precise way.

Definition 1.1

Given an n × m random matrix Y, we define the mean matrix as E(Y). If the rank of E (Y) is k, then the dimensionality of Y is defined as k, where k ∈ {1, 2, … , min (n, m)}.

If the rank of E(Y) is k, it is well known that the SVD of E(Y) has k nonzero singular values, and E(Y) can be decomposed as $\sum_{i = 1}^{k} λ_{i} {\underline{u}}_{i} {\underline{v}}_{i}^{T}$ where λ₁ ≥ λ₂ ≥ … ≥ λ_k are the singular values, u̱_i ∈ ℝⁿis the ith left vector and v̱_i ∈ ℝ^m is the ith right vector, for i = 1,2, … , k. Moreover,

{\underline{u}}_{i}^{T} {\underline{u}}_{j} = {\underline{v}}_{i}^{T} {\underline{v}}_{j} = {\begin{matrix} 1, & i = j \\ 0, & i \neq j . \end{matrix}

Our primary question is whether the dimensionality (rank) of the matrix E(Y) is one or two. For this purpose, we formulate our hypothesis as $H_{0} : E (Y) = λ_{1} {\underline{u}}_{1} {\underline{v}}_{1}^{T}$ versus $H_{1} : E (Y) = λ_{1} {\underline{u}}_{1} {\underline{v}}_{1}^{T} + λ_{2} {\underline{u}}_{2} {\underline{v}}_{2}^{T}$ . It is possible to consider higher ranks of the mean matrix, but our approach is best illustrated with the rank 2 alternative, which is also the most relevant scenario in many applications. In Section 3, three test statistics are proposed for this problem and their asymptotic results are given. The asymptotic analysis based on the SVD of Y differs from the classical literature on the eigenvalues and eigenvectors of a sample covariance matrix, because the latter works on a data matrix with its mean removed, but our focus is directly on the mean of the data matrix.

When the number of microarrays in an experiment is small due to the cost concerns, the asymptotic distributions of the statistics proposed in Section 3 may not be sufficiently close to their exact distributions. Hence, we apply the bootstrap techniques to calibrate the first two tests discussed in Section 3. In Section 4, we assess the finite sample performance of the tests proposed in Section 3 by Monte Carlo simulations. Finally, in Section 5 we apply the proposed tests to real data sets from two studies. Our analysis shows that the second dimension of the probe-level data is often indicative of interesting features of a probe-set. A number of scenarios for the inadequacy of a uni-dimensional summary are discussed through the case studies and in the concluding Section 6. For example, we point out how our approach relates to and differs from probe re-mapping, and show that a high percentage of probes of poor binding strengths in a probe-set can mask gene expression profiles through a uni-dimensional model. All the proofs of lemmas and theorems given in the paper can be found in the supplemental article Feng and He (2009).

2. Model and Estimation

In this section, we propose a multiplicative model extended from Model (1) to account for a possible second dimension in the data matrices. Furthermore, the asymptotic properties of LSE of the parameters in the model are discussed.

2.1. A Multiplicative Model with Random Effects

Our proposed model takes the form

{\underline{y}}_{i} = θ_{1 i}^{(0)} {\underline{ϕ}}_{1}^{(0)} + θ_{2 i}^{(0)} {\underline{ϕ}}_{2}^{(0)} + {\underline{ε}}_{i}, i = 1, 2, \dots, n,

(2)

where y̱_i = (y_i1, y_i2, … , y_im)^T is the ith observed vector, ${\underline{θ}}_{1}^{(0)} = {(θ_{11}^{(0)}, \dots, θ_{1 n}^{(0)})}^{T}$ and ${\underline{θ}}_{2}^{(0)} = {(θ_{21}^{(0)}, \dots, θ_{2 n}^{(0)})}^{T}$ are used to explain the row effects, and ${\underline{ϕ}}_{1}^{(0)} = {({\underline{ϕ}}_{11}^{(0)}, \dots, {\underline{ϕ}}_{1 m}^{(0)})}^{T}$ and ${\underline{ϕ}}_{2}^{(0)} = {({\underline{ϕ}}_{21}^{(0)}, \dots, {\underline{ϕ}}_{2 m}^{(0)})}^{T}$ are used to explain the column effects in the data matrix. When applied to the probe level microarray data, θ stands for the array effect and ϕ represents the probe effect. Using ∥·∥² to denote the L₂ norm for vectors, and a̱⊥ḇ for orthogonality of a̱ and ḇ, we make the following assumptions:

(M1)
${\underline{ϕ}}_{1}^{(0)}$ and ${\underline{ϕ}}_{2}^{(0)}$ are two m-dimensional unit vectors with ${\underline{ϕ}}_{1}^{(0)} ⊥ {\underline{ϕ}}_{2}^{(0)}$
(M2)
${\underline{θ}}_{j}^{(0)}$ are independently distributed with mean μ̱_j = (μ_j1, …, μ_jn)^T and variance $σ_{j}^{2} I_{n}$ , for j = 1,2, and all the components in each vector are independent. The third and fourth central moments of $θ_{j i}^{(0)}$ are $γ_{j}^{3}$ and $τ_{j}^{4}$ respectively, for j = 1,2. Moreover, μ̱₁⊥μ̱₂.
(M3)
The error variables ε̱_i = (ε_i1, … ,ε_im)^T are identically and independently distributed with mean zero and variance-covariance matrix σ²I_m, and the third and fourth central moments of ε_ij are γ³ and τ⁴, respectively.
M4
${θ_{1 i}^{(0)}}, {θ_{2 i}^{(0)}}$ and {ε̱_i} are mutually independent.
(M5)
$n^{- 1} ∥ {\underline{μ}}_{1} ∥^{2} \to μ_{1}^{2}$ and $n^{- 1} ∥ {\underline{μ}}_{2} ∥^{2} \to μ_{2}^{2}$ as n → ∞ for some finite constants μ₁ and μ₂. We assume that $μ_{1}^{2} + σ_{1}^{2} > μ_{2}^{2} + σ_{2}^{2}$ , which is necessary for the identifiability of the model parameters.
(M6)
∥μ̱_j ⨀ μ̱_j∥² = O(n), j = 1,2, where ⨀ indicates the pointwise product of two vectors.

2.2. Least Squares Estimate of Column Effect Parameters

In this section we discuss the properties of LSE of the column effect parameters. Let θ̱₁ = (θ₁₁,…, θ_1n)^T, θ̱₂ = (θ₂₁,… ,θ_2n)^T, $φ = {({\underline{ϕ}}_{1}^{T}, {\underline{ϕ}}_{2}^{T})}^{T} and ϑ = {({\underline{θ}}_{1}^{T}, {\underline{θ}}_{2}^{T}, φ^{T})}^{T}$ . With the objective function

d_{n} (ϑ) = \sum_{i = 1}^{n} ∥ {\underline{y}}_{i} - θ_{1 i} {\underline{ϕ}}_{1} - θ_{2 i} {\underline{ϕ}}_{2} ∥^{2},

(3)

the least squares estimate of ϑ can be found by minimizing d_nϑ. In the present framework, the total number of parameters increases with the number of observations. To facilitate the analysis, it helps to view ${\underline{θ}}_{1}^{(0)}$ and ${\underline{θ}}_{2}^{(0)}$ are nuisance parameters. If (3) is minimized at ϑ̂, then θ̂_1i and θ̂_2i minimize

∥ {\underline{y}}_{i} - θ_{1 i} {\hat{\underline{ϕ}}}_{1} - θ_{2 i} {\hat{\underline{ϕ}}}_{2} ∥^{2}

with respect to θ_1i and θ_2i given ϕ̱̂₁ and ϕ̱̂₂. Furthermore,

{\hat{\underline{θ}}}_{1} = {({\hat{\underline{ϕ}}}_{1}^{T} {\hat{\underline{ϕ}}}_{1})}^{- 1} Y {\hat{\underline{ϕ}}}_{1},

(4)

and

{\hat{\underline{θ}}}_{2} = {({\hat{\underline{ϕ}}}_{2}^{T} {\hat{\underline{ϕ}}}_{2})}^{- 1} Y {\hat{\underline{ϕ}}}_{2} .

(5)

Therefore, φ̂ minimizes the following objective function

d_{n}^{*} (φ) = \sum_{i = 1}^{n} ∥ {\underline{y}}_{i} - [{({\underline{ϕ}}_{1}^{T} {\underline{ϕ}}_{1})}^{- 1} {\underline{ϕ}}_{1}^{T} {\underline{y}}_{i}] {\underline{ϕ}}_{1} - [{({\underline{ϕ}}_{2}^{T} {\underline{ϕ}}_{2})}^{- 1} {\underline{ϕ}}_{2}^{T} {\underline{y}}_{i}] {\underline{ϕ}}_{2} ∥^{2} .

(6)

2.2.1.Consistency and Asymptotic Representation

We consider the asymptotic properties of φ̂ assuming that the number of probes m is fixed but the number of arrays n → ∞. As shown in the preceding subsection, φ̂ is a constrained M estimator that minimizes (6) subject to ∥ϕ̱₁∥ = ϕ̱₂∥ = 1 and ϕ̱₁⊥ϕ̱₂. The derivations in the Appendix lead to the following results.

Theorem 2.1

When Model (2) and Assumptions (M1)–(M6) hold, $\hat{φ} \overset{a . s .}{\to} φ^{(0)}$ where φ̂ is the least squares estimate of φ⁽⁰⁾, that is φ̂ minimizes $\sum_{i = 1}^{n} ρ ({\underline{y}}_{i}; φ)$ subject to ∥ϕ̱₁∥ = ∥ϕ̱₂∥ = 1 and ϕ̱₁⊥ϕ̱₂, where

ρ ({\underline{y}}_{i}; φ) = ∥ {\underline{y}}_{i} - ({\underline{ϕ}}_{1}^{T} {\underline{y}}_{i}) {\underline{ϕ}}_{1} - ({\underline{ϕ}}_{2}^{T} {\underline{y}}_{i}) {\underline{ϕ}}_{2} ∥^{2} .

(7)

Theorem 2.1 makes it possible for give the Bahadur representation for ϕ̱̂₁ and ϕ̱̂₂ from the results of He and Shao (1996). We now consider the limiting distribution of $\sqrt{n} (\hat{φ} - φ^{(0)})$ , which is critical for us to discuss the asymptotic properties of the test statistics proposed in Section 3. Let

Γ_{n} = (n^{- 1} ∥ {\underline{μ}}_{1} ∥^{2} + σ_{1}^{2}) {\underline{ϕ}}_{1}^{(0)} {\underline{ϕ}}_{1}^{(0) T} + (n^{- 1} ∥ {\underline{μ}}_{2} ∥^{2} + σ_{2}^{2}) {\underline{ϕ}}_{2}^{(0)} {\underline{ϕ}}_{2}^{(0) T} + σ^{2} I_{m},

(8)

where I_m is an m × m identity matrix. Then we have the following theorem.

Theorem 2.2

When Model (2) and Assumptions (M1)–(M6) hold, we have, for j = 1,2,

{\hat{\underline{ϕ}}}_{j} - {\underline{ϕ}}_{j}^{(0)} = - n^{- 1} D_{j n}^{- 1} \sum_{i = 1}^{n} [2 {\underline{y}}_{i} {\underline{y}}_{i}^{T} {\underline{ϕ}}_{j}^{(0)} - 2 ({\underline{ϕ}}_{j}^{(0) T} {\underline{y}}_{i} {\underline{y}}_{i}^{T} {\underline{ϕ}}_{j}^{(0)}) {\underline{ϕ}}_{j}^{(0)}] + o (n^{- 1 + ε}),

(9)

where ε; is any positive number, and

D_{j n} = - 2 Γ_{n} + 2 {\underline{ϕ}}_{j}^{(0) T} Γ_{n} {\underline{ϕ}}_{j}^{(0)} I_{m} + 4 {\underline{ϕ}}_{j}^{(0)} {\underline{ϕ}}_{j}^{(0) T} Γ_{n} .

(10)

Thus, both $\sqrt{n} ({\underline{\hat{ϕ}}}_{1} - {\underline{ϕ}}_{1}^{(0)})$ and $\sqrt{n} ({\underline{\hat{ϕ}}}_{2} - {\underline{ϕ}}_{2}^{(0)})$ are asymptotically normally distributed with mean 0 and variance-covariance matrix, say, C₁ and C₂, respectively, where C₁ and C₂ are determined by φ⁽⁰⁾ and the first four moments of y̱_i.

2.3. Least Squares Prediction of Row Effects

We now discuss the asymptotic properties of the least squares prediction of the row effects based on (4) and (5). The result is summarized in the following theorem.

Theorem 2.3

When Model (2) and Assumptions (M1)–(M6) hold, we have ${\hat{θ}}_{1 i} = {\underline{\hat{ϕ}}}_{1}^{T} {\underline{y}}_{i} \overset{L}{\to} θ_{1 i}^{(0)} + {\underline{ε}}_{i}^{T} {\underline{ϕ}}_{1}^{(0)}$ and ${\hat{θ}}_{2 i} = {\underline{\hat{ϕ}}}_{2}^{T} {\underline{y}}_{i} \overset{L}{\to} θ_{2 i}^{(0)} + {\underline{ε}}_{i}^{T} {\underline{ϕ}}_{2}^{(0)}$ , where $\overset{L}{\to}$ denotes convergence in distribution.

Let

Γ = (μ_{1}^{2} + σ_{1}^{2}) {\underline{ϕ}}_{1}^{(0)} {\underline{ϕ}}_{1}^{(0) T} + (μ_{2}^{2} + σ_{2}^{2}) {\underline{ϕ}}_{2}^{(0)} {\underline{ϕ}}_{2}^{(0) T} + σ^{2} I_{m} .

(11)

The first two eigenvalues of this matrix are $μ_{1}^{2} + σ_{1}^{2} + σ^{2}$ and $μ_{2}^{2} + σ_{2}^{2} + σ^{2}$ , with the remaining eigenvalues σ². Let

S_{n} = n^{- 1} Y^{T} Y - n^{- 1} ∥ Y {\hat{\underline{ϕ}}}_{1} ∥^{2} - n^{- 1} ∥ Y {\hat{\underline{ϕ}}}_{2} ∥^{2} .

(12)

Then, from (4), (5), and Theorem 2.2, we have

n^{- 1} ∥ {\underline{\hat{θ}}}_{j} ∥^{2} \overset{a . s .}{\to} μ_{j}^{2} + σ_{j}^{2} + σ^{2}, (j = 1, 2),

and

{(m - 2)}^{- 1} S_{n} \overset{a . s .}{\to} σ^{2},

based on the strong law of large numbers. These consistent estimators for all the eigenvalues of the matrix Γ will be used when we construct the tests in the following section. On the other hand, we note that $θ_{1 i}^{(0)}$ and $θ_{2 i}^{(0)}$ may have their individual means μ_1i and μ_2i, respectively, and thus it is impossible to consistently estimate the individual parameters $μ_{1 i}, μ_{2 i}, σ_{1}^{2}$ and $σ_{2}^{2}$ without any further information.

3. HYPOTHESIS TESTING

In this section, we consider testing the null hypothesis that H₀ : μ̱₂ = 0̱. The second dimension ${\underline{θ}}_{2}^{(0)} {\underline{ϕ}}_{2}^{(0) T}$ in Model (2) does not provide meaningful information on the mean structure of the data matrix under this null hypothesis. We expect θ̱̂₂ to have zero mean under the null hypothesis and non-zero mean under the alternative hypothesis, because ${\hat{θ}}_{2 i} \overset{L}{\to} θ_{2 i}^{(0)} + {\underline{ε}}_{i}^{T} {\underline{ϕ}}_{2}^{(0)}$ as n → ∞. Motivated by this, we construct test statistics based on {θ̂_2i, i = 1, 2, … ,n}. We consider three specific test statistics in the following sub-sections.

3.1.Test on a Target Direction

Consider

T_{\underline{a}} = n^{- 1} {\underline{a}}^{T} {\underline{\hat{θ}}}_{2},

(13)

for any a̱ = (a₁,…, a_n)^T ∈ ℝⁿ such that a̱^T μ̱₁ = 0, ∥a̱∥² = n and ${max}_{1 \leq j \leq n} a_{j}^{2} / n \to 0$ . We choose a vector a̱ such that a̱⊥μ̱₁ because μ̱₁ is orthogonal to μ̱₂ and we want to test the null hypothesis that orthogonal to μ̱₂ = 0. We use 1̱_n to indicate the n-dimensional vector with all the components equal to 1. From the asymptotic properties discussed in Section 2, we have the following theorem.

Theorem 3.1

If the observations y̱₁,y̱₂, … y̱_n are drawn from Model (2) and Assumptions (M1)–(M6) hold, and a̱ ∈ ℝⁿ is a vector satisfying a̱^T μ̱₁ = 0, a̱^Ta̱ = n and ${max}_{1 \leq j \leq n} a_{j}^{2} / n \to 0$ , then

n^{- 1 / 2} {\underline{a}}^{T} {\underline{\hat{θ}}}_{2} / \hat{σ} \overset{L}{\to} N (0, 1)

under the null hypothesis that μ̱₂ = 0̱, where

{\hat{σ}}^{2} = n^{- 1} ∥ {\underline{\hat{θ}}}_{2} ∥^{2} - {\hat{θ}}_{2 \cdot}^{2}, and {\hat{θ}}_{2 \cdot} = n^{- 1} {\underline{\hat{θ}}}_{2}^{T} {\underline{1}}_{n} .

(14)

The power of the test depends on how far a̱^T μ̱₂ deviates from zero. As to the target direction a̱, it is usually determined by some specific comparison in practice. We will give examples of choosing a̱ in Section 5.

3.1.1.A Practical Solution When μ̱₁ is Unknown

In practice, the true value of the mean vector μ̱₁ is unknown, but it can be estimated when extra group information is available. Assume that the observations can be divided into p groups such that μ̱_1i are equal within each group. We assume that μ₁,n_t−1+1 = … = μ_{1n_t}, for t = 1,2, … ,p, where n₀ = 0 < n₁ < …n_p−1 < n_p = n, and assume that p is fixed but n_t — n_t−1 → ∞ when n → ∞. For microarray data, those arrays that use the same types of tissues may form one group, and specific examples will be discussed in Section 5.

Suppose that μ̂_{1n_t} is a consistent estimator of μ_{1n_t} Let

{\hat{\underline{μ}}}_{1} = {({\hat{μ}}_{1 n_{1}}, \dots, {\hat{μ}}_{1 n_{1}}, {\hat{μ}}_{1 n_{2},} \dots, {\hat{μ}}_{1 n_{2},} \dots, {\hat{μ}}_{1 n_{p},} \dots, {\hat{μ}}_{1 n_{p}})}^{T},

where the number of μ̂_{1n_t} in the above vector is n_t — n_t−1, t = 1,2, … ,p. Furthermore, when we choose a vector â̱ orthogonal to μ̱̂₁, we only consider the candidates whose entries can be divided into groups and are equal to each other within each group in the form of

\underline{\hat{a}} \propto {({\hat{a}}_{n_{1}}, \dots, {\hat{a}}_{n_{1}}, {\hat{a}}_{n_{2}}, \dots, {\hat{a}}_{n_{2}}, \dots {\hat{a}}_{n_{p}}, \dots, {\hat{a}}_{n_{p}})}^{T} .

With â̱ convergent to a̱, the statistic Tâ̱ = n⁻¹â̱̱^Tθ̂₂ has the same Bahadur representation as if we chose a vector a̱ orthogonal to μ̱₁ under the null hypothesis. Hence, when we construct the tests in Section 3, we can use â̱ that is orthogonal to μ̱̂₁. The choice of â̱̱ is not unique, and is best chosen in response to specific alternatives of interest in a given experiment.

3.2.A χ² Test with Multiple Directions

As shown in Section 3.1, the power of the test T_a̱ depends on the direction a̱ that we choose. In some cases, we may consider several directions simultaneously. Let us consider a k × n matrix A, where k is a fixed integer and k < n. The ith row of the matrix A is denoted as a̱_i and the jth component of a̱_i is denoted as a_ij for i = 1, … ,k and j = 1, … ,n. Assume that a̱_i⊥a̱_j for $i \neq j, {\underline{a}}_{i} ⊥ {\underline{μ}}_{1}, {\underline{a}}_{i}^{T} {\underline{a}}_{i} = n$ and ${max}_{1 \leq j \leq n} a_{i j}^{2} / n \to 0$ for each i. Then, we propose the test statistic

T_{A} = n^{- 1} ∥ A {\underline{\hat{θ}}}_{2} ∥^{2} / {\hat{σ}}^{2},

with the following result.

Theorem 3.2

Under the assumptions of Theorem 3.1, and for the matrix A described in this subsection, we have $T_{A} \to χ_{k}^{2}$ in distribution under the null hypothesis that μ̱₂ = 0̱, where $χ_{k}^{2}$ has the chi-square distribution with k degrees of freedom.

In practice, given observations, we should not choose k that is close to n, because

∥ A {\underline{\hat{θ}}}_{2} ∥^{2} = n^{2} {\hat{σ}}^{2}

when k = n − 1, and the variations accumulated from approximation errors will ruin the chi-square approximation.

3.3.Bootstrap Calibration

Sometimes, the sample size n is too small for the asymptotic approximations to perform well. Hence, we propose a finite sample adjustment to control the type I errors.

A bootstrap method, which avoids resampling from the rows or columns of the data matrix, to test the null hypothesis that μ̱₂ = 0 can be described as follows:

Draw n copies {j₁,… ,j_n} with replacement from {1,2,… ,n} and let ${\hat{θ}}_{2 i}^{*} = {\hat{θ}}_{2 j_{i}} - {\hat{θ}}_{2 .} (i = 1, 2, \dots, n)$ , where ${\hat{θ}}_{2 .} = n^{- 1} \sum_{i = 1}^{n} {\hat{θ}}_{2 i}$ and then evaluate $T_{\underline{a}}^{*}$ as
$T_{\underline{a}}^{*} = n^{- 1 / 2} {\underline{a}}^{T} {\underline{\hat{θ}}}_{2}^{*} / {(n^{- 1} ∥ {\underline{\hat{θ}}}_{2}^{*} ∥^{2} - {\hat{θ}}_{2 \cdot}^{* 2})}^{1 / 2},$
where ${\underline{\hat{θ}}}_{2}^{*} = {({\hat{θ}}_{21}^{*}, \dots, {\hat{θ}}_{2 n}^{*})}^{T}$ and ${\hat{θ}}_{2 \cdot}^{*} = n^{- 1} \sum_{i = 1}^{n} {\hat{θ}}_{2 i}^{*}$ ;
Repeat Step (i) for B times to get the test statistic $T_{\underline{a}, b}^{*}, b = 1, 2, \dots, B$ . We estimate the bootstrap p-value by:
$p = B^{- 1} \sum_{b = 1}^{B} I {| T_{\underline{a}, b}^{*} | \geq | T_{\underline{a}} |} .$
To see this bootstrap method works, we note that
$\begin{matrix} n^{- 1 / 2} {\underline{a}}^{T} {\underline{\hat{θ}}}_{2 \cdot} & = & (n^{- 1 / 2} \sum_{i = 1}^{n} a_{i} {\underline{ϕ}}_{2}^{(0) T} {\underline{y}}_{i}) + o_{p} (1), \\ n^{- 1 / 2} {\underline{a}}^{T} {\underline{\hat{θ}}}_{2}^{*} & = & (n^{- 1 / 2} \sum_{i = 1}^{n} a_{i} {\underline{ϕ}}_{2}^{(0) T} {\underline{y}}_{i}^{*}) + o_{p} (1), \end{matrix}$
where ${\underline{y}}_{i}^{*} = {\underline{y}}_{j_{i}}$ , and
$n^{- 1} ∥ {\underline{\hat{θ}}}_{2} ∥^{2} - {\hat{θ}}_{2 \cdot}^{2} - (n^{- 1} ∥ {\underline{\hat{θ}}}_{2}^{*} ∥^{2} - {\hat{θ}}_{2 \cdot}^{* 2}) = o_{p} (1) .$
Since
${(n^{- 1} \sum_{i = 1}^{n} {({\underline{ϕ}}_{2}^{(0) T} {\underline{y}}_{i})}^{2} - {[n^{- 1} \sum_{i = 1}^{n} {\underline{ϕ}}_{2}^{(0) T} {\underline{y}}_{i}]}^{2})}^{- 1 / 2} (n^{- 1 / 2} \sum_{i = 1}^{n} a_{i} {\underline{ϕ}}_{2}^{(0) T} {\underline{y}}_{i}) \overset{L}{\to} N (0, 1)$
under the null hypothesis, the bootstrap method works by Theorem 1 of Mammen (1991). Our proposed bootstrap method acts on θ̂₂, and avoids repeated computations of the SVD. The same idea can be used for T_A.

3.4. Test based on Maximum over Directions

If we do not have guided directions to look for patterns in μ̱₂ , we may wish to search over a larger number of directions. The chi-square test in Section 3.2 does not apply when k is large. However, the maximum over k = n − 1 directions

M_{n} = max_{1 \leq j \leq n - 1} n^{- 1 / 2} {\underline{a}}_{j}^{T} {\underline{\hat{θ}}}_{2},

(15)

has a simple limiting distribution when ε̱_i and ${\underline{θ}}_{2}^{(0)}$ are normally distributed.

Let

c_{n} = \sqrt{2 ln (n - 1)}, and b_{n} = c_{n} - 2^{- 1} c_{n}^{- 1} ln (4 π ln (n - 1)) .

(16)

Theorem 3.3

Assume the conditions of Theorem 3.1,with the addition assumption that ${\underline{θ}}_{2}^{(0)}$ and ε̱_i are normally distributed. For any matrix A as described in subsection 3.2 with k = n − 1, we have P(c_n(M_n/σ̂ − b_n) ≤ x → e^−e^−x as n → ∞ under the null hypothesis that μ̱₂ = 0̱.

Under the alternative hypothesis, we should observe larger values of M_n. Furthermore, the convergence rate of the extreme statistic is discussed in Section 4.6 of Leadbetter, Lindgren and Rootzen (1983). Based on their arguments, we can use [Φ(u)]ⁿ⁻¹ to approximate the probability P (M_n/σ̂ ≤ u in computing the p-values of the proposed test here.

The normality of ${\underline{θ}}_{2}^{(0)}$ and ε̱_i is not a necessary condition for the limiting distribution to hold. Our simulation results not reported in this paper suggest that Theorem 3.3 may hold in a much broader setting.

4. SIMULATIONS

To assess the performance of the proposed tests in the present paper, we report Monte Carlo simulation results by simulating data from Model (2), with the following specifications. The size of the parameters are chosen to mimic some real microarray data.

${\underline{θ}}_{1}^{(0)}$ is generated from the multivariate N (μ̱₁, 150, 000I_n), where μ̱₁ = (4, 500,4,500,… ,4,500)^T;
${\underline{θ}}_{2}^{(0)}$ is generated from N (μ̱₁, 150, 000I_n), where μ̱₂ is equal to either (0, 0, …, 0)^T as the null hypothesis or (125, −125, … ,125, −125)^T as an alternative hypothesis;
${\underline{ϕ}}_{1} = {(2 \sqrt{3})}^{- 1} {(1, 1, \dots, 1)}^{T}$ and ${\underline{ϕ}}_{2} = {(2 \sqrt{3})}^{- 1} {(1, - 1, \dots, 1, - 1)}^{T}$ are of dimension 12;
The errors ε_ij (i = 1, 2, …, n,j = 1, 2, … ,12) are drawn from three different distributions in different experiments: the normal distribution N(0, 5,000), the t-distribution with 5 degrees of freedom multiplied by $10 \sqrt{30}$ , and the centered χ²-distribution 50(Z² − 1), where Z ∼ N(0,1).

4.1.Test on a target direction

Four different sample size are used: n = 8,16,32 and 128. Furthermore, we chose two different a̱ to compare the performance of the tests T_a̱, discussed in Section 3.

4.1.1.Case 1

In the first case, we choose a̱ = (1, −1, …, 1, −1)^T, which is the ideal choice for detecting the alternative in our settings. We draw 5,000 data sets, and the 5,000 p-values are calculated based on the limiting distributions in Theorems 3.1. For the test T_a̱, the type I errors are close to the nominal level of 0.05 when n ≥ 16. Also clear from Table 1 is that the power of the test is decent even when the sample size is as small as 8.

TABLE 1.

Type I errors and powers of the target direction test are listed with increasing sample size n. The errors are generated from three different distributions.

	Null			Alternative
Size	Normal	t	χ²	Normal	t	χ²

8^a	0.0560	0.0510	0.0430	0.6362	0.6088	0.5718
16^a	0.0542	0.0492	0.0426	0.9540	0.9308	0.9004
32^a	0.0470	0.0500	0.0460	0.9998	0.9974	0.9940
128^a	0.0522	0.0508	0.0532	1.0000	1.0000	1.0000
8^b	0.0552	0.0568	0.0458	0.4202	0.4104	0.3854
16^b	0.0530	0.0500	0.0490	0.8358	0.8190	0.7840
32^b	0.0546	0.0494	0.0440	0.9934	0.9890	0.9854
128^b	0.0522	0.0514	0.0486	1.0000	0.9998	1.0000

Open in a new tab

The results are from Case 1.

The results are from Case 2.

4.1.2.Case 2

We choose

\underline{a} = 2^{- 1} \sqrt{3} {(1, - 1, \dots, 1, - 1)}^{T} + 2^{- 1} {(1, \dots, 1, - 1, \dots, - 1)}^{T},

to see whether the test has the meaningful power when a̱ is not so well chosen to target the true pattern in μ̱₂. The results are given in the lower half of Table 1. A comparison with Case 1 shows that the power of the test T_a̱ is sensitive to the choice of a̱ for small n, so a good target direction based on the nature of the experiment or the knowledge of the experimenter is very valuable.

4.2.The χ² test

For the χ² test of Section 3.2, four sample sizes n = 8,16,32,64 are used with the Monte Carlo sample size of 5,000. We generated k = 4 vectors, which are orthogonal to μ̱₁ , orthogonal to each other, and are of length n. The algorithm to generate the vectors can be described as follows.

A = (\begin{matrix} 1 & 1 \\ 1 & - 1 \end{matrix}) \otimes \dots \otimes (\begin{matrix} 1 & 1 \\ 1 & - 1 \end{matrix}),

where ⊗ is the Kronecker product, and the product is repeated n times. After the first column of A is deleted, the next k = 4 columns are the vectors we use in the χ² test. The estimated type I errors and powers of the test are listed in Table 2. It is clear that, the type I error is not close to 0.05 when n ≤ 16. In fact, we find that the type I errors in Table 3 from the limiting distributions of T_a̱ and the χ² tests can be too high or too low when the sample sizes n are small. The bootstrap method manages to control the type I errors even at small samples.

TABLE 2.

Type I errors and powers of the χ² test are listed with increasing sample size n. The errors are drawn from three distributions.

	Null			Alternative
Size	Normal	t	χ²	Normal	t	χ²

8	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
16	0.0296	0.0264	0.0236	0.6406	0.6104	0.5734
32	0.0418	0.0384	0.0394	0.9918	0.9846	0.9822
64	0.0464	0.0504	0.0422	1.0000	0.9990	1.0000

Open in a new tab

TABLE 3.

Type I errors and powers are listed for comparison between the bootstrap and the large-sample approximation. The errors are generated from three different distributions.

Type I Error
	Asymptotic Approximation			Bootstrap
n	Normal	t	χ²	Normal	t	χ²

6^a	0.1174	0.1096	0.1004	0.0420	0.0416	0.0350
8^a	0.0552	0.0568	0.0458	0.0484	0.0520	0.0440
8^b	0.0000	0.0000	0.0000	0.0406	0.0396	0.0256
16^b	0.0296	0.0264	0.0236	0.0520	0.0430	0.0420

Estimated Power
	Asymptotic Approximation			Bootstrap
n	Normal	t	χ²	Normal	t	χ²

6^a	0.4950	0.4820	0.4646	0.2560	0.2380	0.2194
8^a	0.4202	0.4104	0.3854	0.3738	0.3670	0.3480
8^b	0.0000	0.0000	0.0000	0.1508	0.1506	0.1338
16^b	0.6404	0.6104	0.5734	0.7142	0.6912	0.6746

Open in a new tab

The results are from the test on the target direction $2^{- 1} \sqrt{3} {(1, - 1, \dots, 1, - 1)}^{T} + 2^{- 1} {(1, \dots, 1, - 1, \dots, - 1)}^{T}$

The results are from the χ² test based on the four target directions.

4.3.Test based on maximum over directions

Similar to Table 2, Table 4 shows the performance of the test M_n of Section 3.4 based on the limiting distributions. The test is conservative for small n, but remains quite powerful in the study. The test can be used even when the normality assumption in Theorem 3.3 is violated. However, our simulation results that are not reported here suggest that if ${\underline{θ}}_{2}^{(0)}$ and ε̱_i do not have finite 4th moments, the limiting distribution would not take effect for realistic sample sizes considered in this paper.

TABLE 4.

Type I errors and powers of the test based on maximum over directions are listed with increasing sample size n. The errors are drawn from three distributions

	Null			Alternative
Size	Normal	t	χ²	Normal	t	χ²

8	0.0018	0.0012	0.0008	0.0270	0.0268	0.0232
16	0.0306	0.0256	0.0190	0.6992	0.6620	0.6216
32	0.0378	0.0376	0.0264	0.9850	0.9766	0.9666
64	0.0428	0.0404	0.0362	1.0000	0.9988	0.9994

Open in a new tab

5. CASE STUDIES

In this section, we analyze two microarray data sets. We apply our testing methods to search for genes with potentially complicated mean structure, and further analyze some of those genes to understand the possible causes. The data are quantile normalized in each case.

5.1.Example 1

We considered the GeneChip data (http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE5350) obtained from the recent MicroArray Quality Control (MAQC) project and used in Lin, et al. (2006). We have a total of 20 microarrays (HG-U133-Plus-2.0), generated from five colorectal adenocarcinomas and five matched normal colonic tissues with 1 technical replicate at each of two laboratories involved in the MAQC project.

In this study, we use PM as the intensity measure in Y, and carry out the SVD to get the two largest singular values λ̂₁ > λ̂₂. We focus on 350 probe-sets with the highest ratios ${\hat{λ}}_{2}^{2} / {\hat{λ}}_{1}^{2}$ (with all those ratios above 1/10). For each probe-set, the probe-level microarray data are stored in a matrix, where the rows correspond to the probes and the columns correspond to the arrays. The intensities from the normal tissues are entered in the column 1–5, 11–15, and those from the tumors entered in the rest of columns.

We choose a target direction to contrast the two groups in the study. In particular, we use

{\underline{a}}_{1} \propto {(- {\hat{μ}}_{2}, \dots, - {\hat{μ}}_{2}, {\hat{μ}}_{1}, \dots, {\hat{μ}}_{1}, - {\hat{μ}}_{2}, \dots - {\hat{μ}}_{2}, {\hat{μ}}_{1}, \dots {\hat{μ}}_{1})}^{T},

where μ̂₁ is taken to be the median of θ̂_1i of the first group (normal tissues), and μ̂₂ the median of θ̂_1i of the other group. Hence, we have a̱₁⊥μ̱̂, where

\underline{\hat{μ}} = {({\hat{μ}}_{1}, \dots, {\hat{μ}}_{1}, {\hat{μ}}_{2}, \dots {\hat{μ}}_{2}, {\hat{μ}}_{1}, \dots, {\hat{μ}}_{1}, {\hat{μ}}_{2}, \dots {\hat{μ}}_{2})}^{T} .

By the statistical test T_a̱ developed in Section 3.1, we find that 81 out of 350 probe-sets are detected as individually significant at the 0.05 level. Out of those, 36 probe-sets remain significant after the multiple test adjustment of Benjamini and Hochberg (1995).

We plot θ̂_1i,θ̂_2i) i = 1,2,…,20, and (ϕ̂_1j, ϕ̂_2j), j = 1,2, …, m for those probe-sets that are detected as significant, some interesting facts can be observed. We now zoom in on three of those probe-sets.

5.1.1. Probe-set “214974_x_at”

In the study, the probe-set “214974_x_at” is used to measure the expression level of Gene “CXCL5”. Our test gave the p-value of 1.11 × 10⁻³, the adjusted p-value of 2.38 × 10⁻², and the q-value, as proposed in Storey (2003), of 5.77 × 10⁻⁴, offering significant evidence against the uni-dimensional model. The first four singular values of the data matrix are (3387, 1388, 361, 168). As mentioned in the Introduction with Figure 1, the arrays cannot be easily separated by the first right singular vector, but if we use (θ̂_1i, θ̂_2i) jointly, the arrays are well separated in the 2-dimensional space. The usual one-dimensional index of the probe-set is insufficient to summarize the gene expression of “CXCL5”.

Further inspection of the data shows that the intensities from Probe 3 are much higher than those of the other probes, and Probe 3 dominantly contributes to the values of θ̂_1i. By the Basic Local Alignment Search Tool (BLAST, http://www.ncbi.nlm.nih.gov/blast/Blast.cgi), we found that Probe 3 is represented in both Gene “CXCL5” and Gene “N-PAC”, but the other probes were confirmed as specific to Gene “CXCL5”. We further confirmed that the intensities of Probe 3 were highly correlated with the intensities of several probes in the probe-set “208506_x_at” (designed by Affymetrix to measure the expression level of Gene “N-PAC”), and thus, we need to take Probe 3 with caution. If Probe 3 were removed from the probe-set, we would have seen a clear separation of the two groups from the first singular vector; see Figure 2. In this case, the second singular vector from the whole probe-set appears to be a better summary of Gene “CXCL5”. We note that Gene “CXCL5” has been indicated as an important gene for colorectal cancer in the literature. For example, Dimberg, et al. (2007) observed significantly higher expression levels of the protein encoded by “CXCL5” in colorectal cancer tumors than in normal tissue, so the multi-dimensionality of the probe-set “214974_x_at” flagged through our statistical work can offer biologically relevant information.

Fig 2 — Scatterplot of singular vectors for the prob-set “214974_x_at” after we remove Probe 3. See Figure 1 for more details about this figure.

5.1.2.Probe-set “227899_at”

The probe-set “227899_at” is designed by Affymetrix to measure the expression level of Gene “VIT”. Our test gave the p-value 8.78 × 10⁻⁴, the adjusted p-value 2.38 × 10⁻², and the q-value 5.77 × 10⁻⁴. The first four singular values are (3178,1011,227, 77).

From Figure 3, we note that differential expression can be detected from the second right singular vector, but not the first. From the probe-level data, we find that the intensities of Probe 4 and Probe 7 are much higher than those of the other probes, and these two probes dominate the first two singular vectors. Furthermore, we confirmed by BLAST both probes as specific for measuring the expression level of Gene “VIT”, and so did the other probes. As a double check, we applied the re-mapping method proposed by Lu, et al. (2007) and confirmed all the probes in this probe-set were specified for the three transcript variants for Gene “VIT”. Therefore, a 2-dimensional summary of the gene appears necessary for this probe-set.

Fig 3 — Scatterplot of singular vectors for the prob-set “227899_at”. The probe numbers are shown in the lower plot, and the dotted line is given by the least trimmed squares estimate. The circles in the upper plot represent the arrays hybridized by the samples from the colorectal adenocarcinomas, while the solid points represent the arrays hybridized by the samples from the normal colonic tissues.

To make the point further, we provide the absolute value of percentages calculated from M₁ and M₂ in Table 5, where

M_{1} = (\begin{matrix} {\hat{θ}}_{11} {\hat{ϕ}}_{11} & \dots & {\hat{θ}}_{11} {\hat{ϕ}}_{1 m} \\ ⋮ & ⋮ & ⋮ \\ {\hat{θ}}_{1 n} {\hat{ϕ}}_{11} & \dots & {\hat{θ}}_{1 n} {\hat{ϕ}}_{1 m} \end{matrix}),

and

M_{2} = (\begin{matrix} {\hat{θ}}_{21} {\hat{ϕ}}_{21} & \dots & {\hat{θ}}_{21} {\hat{ϕ}}_{2 m} \\ ⋮ & ⋮ & ⋮ \\ {\hat{θ}}_{2 n} {\hat{ϕ}}_{21} & \dots & {\hat{θ}}_{2 n} {\hat{ϕ}}_{2 m} \end{matrix}) .

TABLE 5.

A summary of the absolute values of θ̂_2iϕ̂_2j/θ̂_1iϕ̂_1j in percentage by probes.

227899_at	Min.(%)	Ql(%)	Med.(%)	Q3(%)	Max.(%)
Probe 1	0.06	2.48	5.03	6.56	10.92
Probe 2	0.01	0.32	0.64	0.84	1.39
Probe 3	0.04	2.00	4.04	5.27	8.78
Probe 4	0.39	17.37	35.20	45.88	76.40
Probe 5	0.07	2.97	6.02	7.85	13.06
Probe 6	0.04	1.84	3.72	4.85	8.08
Probe 7	0.22	10.01	20.29	26.44	44.02
Probe 8	0.04	1.77	3.59	4.68	7.79
Probe 9	0.03	1.29	2.62	3.42	5.69
Probe 10	0.11	4.74	9.61	12.52	20.85
Probe 11	0.17	7.69	15.59	20.32	33.83

Open in a new tab

It is clear that the information contained in the second dimension for Probes 4 and 7 is important, because in more than half of the arrays their contributions from the second dimension are more than 20% of those from the first. The joint use of θ̂_1i and θ̂_2i gives a more complete picture about the expression profile of Gene “VIT”.

5.1.3.Probe-set “1560296_at”

The probe-set “1560296_at” is used in the HG-U133-Plus-2.0 platform to represent Gene “DST”. This probe-set is detected by our test with a significant 2-dimensional mean structure (p-value 1.88 × 10⁻³, adjusted p-value 2.87 × 10⁻², and q-value 6.96 × 10⁻⁴). The first four singular values are (5470,1748, 504,271).

From Figure 4, we observe that the probes 1 and 2 are dominant probes. Further inspection shows that the first singular vector is primarily determined by these two probes. Following the method of Lu, et al. (2007), we find that Probes 1, 2 and 3 are re-mapped to three transcripts each (“veejee.aApr07-unspliced”, “DST.vlApr07-unspliced” and “DST.iApr07”), yet the other probes are re-mapped to two variants only (“veejee.aApr07-unspliced” and “DST.vlApr07-unspliced”). For this probe-set, the significant 2-dimensional mean structure of the data matrix could be resolved by proper re-mapping of the probes.

Fig 4 — Scatterplot of singular vectors for the prob-set “1560296_at”. The probe numbers are shown in the lower plot, and the dotted line is given by the least trimmed squares estimate. The circles in the upper plot represent the arrays hybridized by the samples from the colorectal adenocarcinomas, while the solid points represent the arrays hybridized by the samples from the normal colonic tissues.

5.2.Example 2

In this example, the data (http://www.ncbi.nlm.nih.gov/projects/geo/query/acc.cgi?acc=GSE8874) were collected in a recent experiment with the 2×2×2 factorial design, the detail of which is discussed in Leung, et al. (2008). The three factors (with two levels each) are:

mutation: mutant or wild type (WT);
tissue: retinas or whole body;
time: 36 or 52 hours post-fertilization.

Under each condition, three Affymetrix zebrafish genome arrays are replicated, so we have 24 arrays in total. The vector μ̱̂ is computed as in Example 1 by assuming that the means in each tissue group are equal. Furthermore, we generate two directions a̱₁ and á₂, used to reflect the possible tissue and mutation effects, respectively. In the study, we still use PM as the intensity measure and carry out the singular value decomposition to get the two largest singular values as λ̂₁ and λ̂₂, where λ̂₁ ≥ λ̂₂. We focus on 75 probe-sets with the highest ${\hat{λ}}_{2}^{2} / {\hat{λ}}_{1}^{2}$ (with all those ratios above 1/10), and use the χ² test described in Section 3.2 on each of those probe-sets.

In this example, 39 out of 75 probe-sets are detected as individually significant, out of which 39 probe-sets remain significant after the multiple test adjustment of Benjamini and Hochberg (1995). We shall describe one such probe-set in detail.

5.2.1.Probe-set “Dr.7506.1.A1_at”

In the zebrafish genome array, the probe-set “Dr.7506.1.A1_at” corresponds to gene “tuba8l2”. The χ² test gave the p-value of 2.37 × 10⁻⁵, the adjusted p-value of 7.52 × 10⁻⁵, and the q-value of 4.83 × 10⁻⁶. The first four singular values are (43142, 14839, 2078, 1688). It is clear from Figure 5 that we cannot distinguish two tissue groups based on θ̂_1i, but the two groups are well separated by θ̂_2i. Further inspection of the data shows that the intensities of Probe 3 are linearly related with θ̂_1i, but θ̂_2i are linearly related with the intensities of Probe 15. From Table 6, we see that the information from θ̂_2i are clearly non-negligible. Furthermore, we used BLAST to verify that all the probes are appropriate for Gene “tuba8l2”, so there is strong evidence that the expression profile for Gene “tuba8l2” cannot be summarized by the usual uni-dimensional index across experimental conditions. In fact the commonly used gene expression index would mask the clear differential expressions of the two tissue types.

Fig 5 — Scatterplot of singular vectors for the probe-set “Dr.7506.1.A1_at”. The probe numbers are shown in the lower plot and the dotted line is a robust linear fit. The circles in the upper plot represent the arrays hybridized by the samples from retinas, while the solid points represent the arrays hybridized by the samples from whole body.

TABLE 6.

A summary of the absolute values of θ̂_2iϕ̂_2j/θ̂_1iϕ̂_1j in percentage by probes.

Dr.7506.1.Al_at	Min.(%)	Ql(%)	Med.(%)	Q3(%)	Max.(%)
Probe 1	23.71	40.44	48.73	58.58	81.25
Probe 2	20.13	34.33	41.37	49.73	68.97
Probe 3	7.76	13.23	15.94	19.16	26.58
Probe 4	30.74	52.42	63.16	75.94	105.32
Probe 5	13.20	22.51	27.12	32.60	45.22
Probe 6	7.95	13.56	16.33	19.64	27.23
Probe 7	12.37	21.10	25.42	30.56	42.38
Probe 8	3.08	5.25	6.32	7.60	10.54
Probe 9	18.24	31.10	37.48	45.05	62.48
Probe 10	27.56	47.00	56.63	68.08	94.42
Probe 11	19.89	33.92	40.87	49.13	68.14
Probe 12	26.07	44.47	53.58	64.42	89.34
Probe 13	11.71	19.98	24.07	28.94	40.13
Probe 14	33.25	56.70	68.32	82.14	113.92
Probe 15	38.84	66.24	79.81	95.96	133.08
Probe 16	39.66	67.64	81.50	97.98	135.88

Open in a new tab

6. Conclusions

In this article, we have proposed a new framework for testing the uni-dimensional mean structure of the probe-level data matrix. For most applications, we can carry out the tests discussed in the article based on large sample approximations. We also proposed a model-based bootstrap algorithm to better control type I errors when the sample size is small.

In two case studies, the proposed method detected genes whose expression levels were not well summarized by uni-dimensional indices. Through detailed inspection of the probe-level intensities of those genes, we found that the intensities of different probes can show different profiles across experimental conditions. In our investigation, we noticed that the following scenarios exist for the violation of a uni-dimensional gene expression summary.

A large percentage of probes that have poor binding strengths or low intensity measures in a probe-set can mask the gene expression profiles.
One or more probes should be re-mapped to different variants of the same gene.
One or more probes are cross-hybridized.
An outlying and erroneous measurement is present for one of the probes.
The multiplicative model used to summarize gene expression is inadequate even with all the probes are well selected.

It has been observed by Harbig, et al. (2007) that outlier signals on just one probe can seriously affect the calculations used for the subsequent analysis. While we do not always have definite answers as to the biological implications of such structures, our statistical analysis is valuable in both flagging the potentially interesting and important probes and genes for further scientific investigations. Our approach does not lead directly to probe re-mapping, but may suggest candidates for possible alternative mapping (Gautier, et al. (2004); Lu, et al. (2007)). The bottom line is clear: if we solely rely on models that assume uni-dimensional gene expressions, we might miss some of the complexities in gene expression data analysis. When a uni-dimensional model is shown to be inadequate, appropriate actions, such as probe re-mapping, an alternative model or a different summarization method (e.g. Kapur, et al. (2007)), are called for.

Supplementary Material

supplement

NIHMS121567-supplement-supplement.pdf^{(127KB, pdf)}

Acknowledgments

The research was partially supported by the NSF Grant DMS-0604229, DMS-0800631, NIH Grant R01GM080503-01A1 in the US, NNSF of China Grant 10828102, and a Changjiang Visiting Professorship at the Northeast Normal University, China. The authors thank Drs. Ping Ma and Sheng Zhong, as well as the Associate Editor for their helpful suggestions on the case studies presented in the paper.

Footnotes

SUPPLEMENTARY MATERIAL

Proofs of Main Results

(doi: ???http://lib.stat.cmu.edu/aoas/???/???;.pdf). We give a lemma on consistency, followed by the proofs for the theorems that are described in Sections 2 and 3.

References

Berman S. Limit Theorems for The Maximum Term in Stationary Sequences. The Annals of Mathematical Statistics. 1964;35:502–516. [Google Scholar]
Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B. 1995;57:289–300. [Google Scholar]
Dimberg J, Dienus O, Löfgren S, Hugander A, WÅoeÄter D. Expression and gene polymorphisms of the chemokine CXCL5 in colorectal cancer patients. International Journal of Oncology. 2007;31:97–102. [PubMed] [Google Scholar]
Feng X, He X. Supplement to “Inference on low-rank data matrices with applications to microarray data.” DOI. 2009 doi: 10.1214/09-AOAS262SUPP. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gautier L, MøLler M, Friis-Hansen L, Knudsen S. Alternative mapping of probes to genes for Affymetrix chips. BMC Bioinformatics. 2004;5:111. doi: 10.1186/1471-2105-5-111. [DOI] [PMC free article] [PubMed] [Google Scholar]
Harbig J, Sprinkle R, Enkemann SA. A sequence-based identification of the genes detected by probesets on the Affymetrix U133 plus 2.0 array. Nucleic Acids Research. 2005;33:e31. doi: 10.1093/nar/gni027. [DOI] [PMC free article] [PubMed] [Google Scholar]
He X, Shao Q. A General Bahadur Representation of M-Estimators and Its Application to Linear Regression With Nonstochastic Designs. The Annals of Statistics. 1996;24:2608–2630. [Google Scholar]
Hu J, Wright F, Zou F. Estimation of Expression Indexes for Oligonu-cleotide Arrays Using Singular Value Decomposition. Journal of American Statistical Association. 2006;101:41–50. [Google Scholar]
Kapur K, Ying Y, Ouyang Z, Wong W. Exon arrays provide accurate assessments of gene expression. Genome Biology. 2007;8:R82. doi: 10.1186/gb-2007-8-5-r82. [DOI] [PMC free article] [PubMed] [Google Scholar]
Leadbetter MR, Lindgren G, Rootzen H. Extremes and Related Properties of Random Sequences and Processes. 1st ed. New York: Springer–Verlag; 1983. [Google Scholar]
Leung YF, Ma P, Link BA, Dowling J. Factorial microarray analysis of zebrafish retinal development. Proceedings of the National Academy of Sciences. 2008;105:12909–12914. doi: 10.1073/pnas.0806038105. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li C, Wong WH. Model-Based Analysis of Oligonucleotide Arrays: Expression Index and Outlier Detection. Proceedings of National Academy of Science USA. 2001;98:31–36. doi: 10.1073/pnas.011404098. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lu J, Lee JC, Salit ML, Cam MC. Transcript-based redefinition of grouped oligonucleotide probe sets using AceView: High-resolution annotation for microarrays. BMC Bioinformatics. 2007;8:108. doi: 10.1186/1471-2105-8-108. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mammen E. When Does Bootstrap Work? Asymptotic Results and Simulations. 1st ed. New York: Springer–Verlag; 1991. [Google Scholar]
Lin G, He X, Ji H, Shi L, Davis RW, Zhong S. Reproducibility Probability Score-Incorporating Measurement Variability Across Laboratories for Gene Selection. Nature Biotechnology. 2006;24:1476–1477. doi: 10.1038/nbt1206-1476. [DOI] [PubMed] [Google Scholar]
Parmigiani G, Garrett ES, Irizarry RA, Zeger SL. The Analysis of Gene Expression Data. 1st ed. New York: Springer–Verlag; 2003. [Google Scholar]
Shi L, Reid LH, Jones WD, et al. The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nature Biotechnology. 2006;24(9):1151–1161. doi: 10.1038/nbt1239. [DOI] [PMC free article] [PubMed] [Google Scholar]
Storey JD. The Positive False Discovery Rate: A Bayesian Interpretation and The q-value. The Annals of Statistics. 2003;31:2013–2035. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supplement

NIHMS121567-supplement-supplement.pdf^{(127KB, pdf)}

[R1] Berman S. Limit Theorems for The Maximum Term in Stationary Sequences. The Annals of Mathematical Statistics. 1964;35:502–516. [Google Scholar]

[R2] Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B. 1995;57:289–300. [Google Scholar]

[R3] Dimberg J, Dienus O, Löfgren S, Hugander A, WÅoeÄter D. Expression and gene polymorphisms of the chemokine CXCL5 in colorectal cancer patients. International Journal of Oncology. 2007;31:97–102. [PubMed] [Google Scholar]

[R4] Feng X, He X. Supplement to “Inference on low-rank data matrices with applications to microarray data.” DOI. 2009 doi: 10.1214/09-AOAS262SUPP. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Gautier L, MøLler M, Friis-Hansen L, Knudsen S. Alternative mapping of probes to genes for Affymetrix chips. BMC Bioinformatics. 2004;5:111. doi: 10.1186/1471-2105-5-111. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] Harbig J, Sprinkle R, Enkemann SA. A sequence-based identification of the genes detected by probesets on the Affymetrix U133 plus 2.0 array. Nucleic Acids Research. 2005;33:e31. doi: 10.1093/nar/gni027. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] He X, Shao Q. A General Bahadur Representation of M-Estimators and Its Application to Linear Regression With Nonstochastic Designs. The Annals of Statistics. 1996;24:2608–2630. [Google Scholar]

[R8] Hu J, Wright F, Zou F. Estimation of Expression Indexes for Oligonu-cleotide Arrays Using Singular Value Decomposition. Journal of American Statistical Association. 2006;101:41–50. [Google Scholar]

[R9] Kapur K, Ying Y, Ouyang Z, Wong W. Exon arrays provide accurate assessments of gene expression. Genome Biology. 2007;8:R82. doi: 10.1186/gb-2007-8-5-r82. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] Leadbetter MR, Lindgren G, Rootzen H. Extremes and Related Properties of Random Sequences and Processes. 1st ed. New York: Springer–Verlag; 1983. [Google Scholar]

[R11] Leung YF, Ma P, Link BA, Dowling J. Factorial microarray analysis of zebrafish retinal development. Proceedings of the National Academy of Sciences. 2008;105:12909–12914. doi: 10.1073/pnas.0806038105. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] Li C, Wong WH. Model-Based Analysis of Oligonucleotide Arrays: Expression Index and Outlier Detection. Proceedings of National Academy of Science USA. 2001;98:31–36. doi: 10.1073/pnas.011404098. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Lu J, Lee JC, Salit ML, Cam MC. Transcript-based redefinition of grouped oligonucleotide probe sets using AceView: High-resolution annotation for microarrays. BMC Bioinformatics. 2007;8:108. doi: 10.1186/1471-2105-8-108. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] Mammen E. When Does Bootstrap Work? Asymptotic Results and Simulations. 1st ed. New York: Springer–Verlag; 1991. [Google Scholar]

[R15] Lin G, He X, Ji H, Shi L, Davis RW, Zhong S. Reproducibility Probability Score-Incorporating Measurement Variability Across Laboratories for Gene Selection. Nature Biotechnology. 2006;24:1476–1477. doi: 10.1038/nbt1206-1476. [DOI] [PubMed] [Google Scholar]

[R16] Parmigiani G, Garrett ES, Irizarry RA, Zeger SL. The Analysis of Gene Expression Data. 1st ed. New York: Springer–Verlag; 2003. [Google Scholar]

[R17] Shi L, Reid LH, Jones WD, et al. The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nature Biotechnology. 2006;24(9):1151–1161. doi: 10.1038/nbt1239. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] Storey JD. The Positive False Discovery Rate: A Bayesian Interpretation and The q-value. The Annals of Statistics. 2003;31:2013–2035. [Google Scholar]

PERMALINK

Inference on Low-Rank Data Matrices with Applications to Microarray Data

Xingdong Feng

Xuming He

Abstract

1. INTRODUCTION

Fig 1.

Definition 1.1

2. Model and Estimation

2.1. A Multiplicative Model with Random Effects

2.2. Least Squares Estimate of Column Effect Parameters

2.2.1.Consistency and Asymptotic Representation

Theorem 2.1

Theorem 2.2

2.3. Least Squares Prediction of Row Effects

Theorem 2.3

3. HYPOTHESIS TESTING

3.1.Test on a Target Direction

Theorem 3.1

3.1.1.A Practical Solution When μ̱1 is Unknown

3.2.A χ2 Test with Multiple Directions

Theorem 3.2

3.3.Bootstrap Calibration

3.4. Test based on Maximum over Directions

Theorem 3.3

4. SIMULATIONS

4.1.Test on a target direction

4.1.1.Case 1

TABLE 1.

4.1.2.Case 2

4.2.The χ2 test

TABLE 2.

TABLE 3.

4.3.Test based on maximum over directions

TABLE 4.

5. CASE STUDIES

5.1.Example 1

5.1.1. Probe-set “214974_x_at”

Fig 2.

5.1.2.Probe-set “227899_at”

Fig 3.

TABLE 5.

5.1.3.Probe-set “1560296_at”

Fig 4.

5.2.Example 2

5.2.1.Probe-set “Dr.7506.1.A1_at”

Fig 5.

TABLE 6.

6. Conclusions

Supplementary Material

Acknowledgments

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

3.1.1.A Practical Solution When μ̱₁ is Unknown

3.2.A χ² Test with Multiple Directions

4.2.The χ² test