NONPARAMETRIC ESTIMATION OF GENEWISE VARIANCE FOR MICROARRAY DATA

Jianqing Fan; Yang Feng; Yue S Niu

doi:10.1214/10-AOS802

. Author manuscript; available in PMC: 2010 Nov 12.

Published in final edited form as: Ann Stat. 2010 Nov 1;38(5):2723–2750. doi: 10.1214/10-AOS802

NONPARAMETRIC ESTIMATION OF GENEWISE VARIANCE FOR MICROARRAY DATA^{^*}

Jianqing Fan ¹, Yang Feng ², Yue S Niu ^3,^✉

PMCID: PMC2980338 NIHMSID: NIHMS248819 PMID: 21076694

Abstract

Estimation of genewise variance arises from two important applications in microarray data analysis: selecting significantly differentially expressed genes and validation tests for normalization of microarray data. We approach the problem by introducing a two-way nonparametric model, which is an extension of the famous Neyman-Scott model and is applicable beyond microarray data. The problem itself poses interesting challenges because the number of nuisance parameters is proportional to the sample size and it is not obvious how the variance function can be estimated when measurements are correlated. In such a high-dimensional nonparametric problem, we proposed two novel nonparametric estimators for genewise variance function and semiparametric estimators for measurement correlation, via solving a system of nonlinear equations. Their asymptotic normality is established. The finite sample property is demonstrated by simulation studies. The estimators also improve the power of the tests for detecting statistically differentially expressed genes. The methodology is illustrated by the data from MicroArray Quality Control (MAQC) project.

Keywords and phrases: Genewise variance estimation, gene selection, local linear regression, nonparametric model, correlation correction, validation test

1. Introduction

Microarray experiments are one of widely used technologies nowadays, allowing scientists to monitor thousands of gene expressions simultaneously. One of the important scientific endeavors of microarray data analysis is to detect statistically differentially expressed genes for downstream analysis (Cui, Hwang, and Qiu, 2005; Fan, Tam, Vande Woude and Ren, 2004; Fan and Ren, 2006; Storey and Tibshirani, 2003; Tusher, Tibshirani, and Chu, 2001). Standard t-test and F-test are frequently employed. However, due to the cost of the experiment, it is common to see a large number of genes with a small number of replications. Even in customized arrays where only several hundreds of genes expressions are measured, the number of replications is usually limited. As a result, we are facing a high dimensional statistical problem with a large number of parameters and a small sample size.

Genewise variance estimation arises at the heart of microarray data analysis. To select differentially expressed genes among thousands of genes, the t-test is frequently employed with a stringent control of type I errors. The degree of freedom is usually small due to limited replications. The power of the test can be significantly improved if the genewise variance can be estimated accurately. In such a case, the t-test becomes basically a z-test. A simple genewise variance estimator is the sample variance of replicated data, which is not reliable due to a relatively small number of replicated genes. They have direct impact on the sensitivity and specificity of t-test (Cui et al., 2005). Therefore, novel methods for estimating the genewise variances are needed for improving the power of the standard t-test.

Another important application of genewise variance estimation arises from testing whether systematic biases have been properly removed after applying some normalization method, or selecting the most appropriate normalization technique for a given array. Fan and Niu (2007) developed such validation tests (see Section 4), which require the estimation of genewise variance. The methods of variance estimation, like pooled variance estimator, and REML estimator (Smyth, Michaud, and Scott, 2005), are not accurate enough due to the small number of replications.

Due to the importance of genewise variance in microarray data analysis, conscientious efforts have been made to accurately estimate it. Various methods have been proposed under different models and assumptions. It has been widely observed that genewise variance is to a great extent related to the intensity level. Kamb and Ramaswami (2001) proposed a crude regression estimation of variance from microarray control data. Tong and Wang (2007) discussed a family of shrinkage estimators to improve the accuracy.

Let R_gi and G_gi respectively be the intensities of red (Cy3) and green (Cy5) channels for the i^th replication of the g^th gene on a two-color microarray data. The log-ratios and log-intensities are computed respectively as

Y_{g i} = {log}_{2} (G_{g i} / R_{g i}), and X_{g i} = \frac{1}{2} {log}_{2} (G_{g i} R_{g i}), i = 1, \dots, I, g = 1, \dots, N,

where I is the number of replications for each gene and N is the number of genes with replications. For the purpose of estimating genewise variance, we assume that there is no systematic biases or the systematic biases have been removed by a certain normalization method. This assumption is always made for selecting significantly differentially expressed genes or validation test under the null hypothesis. Thus we have

Y_{g i} = α_{g} + σ_{g i} ε_{g i},

with α_g denoting the log-ratio of gene expressions in the treatment and control samples. Here, (ε_g₁, ···, ε_gI)^T follows a multivariate normal distribution with ε_gi ~ N (0, 1) and Corr(ε_gi, ε_gj) = ρ when i ≠ j. It is also assumed that observations from different genes are independent. Such a model was used in Smyth et al. (2005).

In the papers by Wang, Ma, and Carroll (2008) and Carroll and Wang (2008), nonparametric measurement-error models have been introduced to aggregate the information of estimating the genewise variance:

Y_{g i} = α_{g} + σ (α_{g}) ε_{g i}, corr (ε_{g i}, ε_{{g i}^{'}}) = 0, g = 1, \dots, N, i = 1, \dots, I .

(1)

The model is intended for the analysis of the Affymetrix array (one-color array) data in which α_g represents the expected intensity level, and Y_gi is the ith replicate of observed expression level of gene g. When it is applied to the two-color microarray data as in our setting, in which α_g is the relative expression profiles between the treatment and control, several drawbacks emerge: (a) The model is difficult to interpret as the genewide variance is a function of the log-ratio of expression profiles. (b) Errors-in-variable methods have a very slow rate of convergence for the nonparametric problem and the observed intensity information X_gi is not used. (c) They are usually hard to be implemented robustly and depend sensitively on the distribution of σ(α_g)ε_gi and the i.i.d assumption on the noise. (d) In many microarray applications, α_g = 0 for most g and hence σ(α_g) are the same for most genes, which is unrealistic. Therefore, our model (2) below is complementary to that of Wang et al. (2008) and Carroll and Wang (2008), with focus on the applications to two-color microarray data.

To overcome these drawbacks in the applications to microarray data and to utilize the observed intensity information, we assume that σ_gi = σ(X_gi) for a smooth function σ(·). This leads to the following two-way nonparametric model

Y_{g i} = α_{g} + σ (X_{g i}) ε_{g i}, g = 1, \dots, N, i = 1, \dots, I,

(2)

for estimating genewise variance. This model is clearly an extension of the Neyman-Scott problem (Neyman and Scott, 1948), in which the genewise variance is a constant. The Neyman-Scott problem has many applications in astronomy. Note that the number of nuisance parameters {α_g} is proportional to the sample size. This imposes an important challenge to the nonparametric problem. It is not even clear whether the function σ(·) can be consistently estimated.

To estimate the genewise variance in their microarray data analysis, Fan et al. (2004) assumed a model similar to (2). But in the absence of other available techniques, they had to impose that the treatment effect {α_g} is also a smooth function of the intensity level so that they can apply nonparametric methods to estimate genewise variance (Ruppert et al., 1997). However, this assumption is not valid in most microarray applications, and the estimator of genewise variance incurs big biases unless {α_g} is sparse, a situation that Fan et al. (2004) hoped. Fan and Niu (2007) approached this problem in another simple way. When the noise in the replications is small, i.e., X_gi ≈ X̄_g, where X̄_g is the sample mean for the g-th gene. Therefore, they simply smoothed the pair {(X̄_g, r̄_g)}, where ${\bar{r}}_{g} = \sum_{i = 1}^{I} {(Y_{g i} - {\bar{Y}}_{g})}^{2} / (I - 1)$ . This also leads to a biased estimator, which is denoted as ξ̂²(x). One asks naturally whether the function σ(·) is estimable and how it can be estimated in the general two-way nonparametric model.

We propose a novel nonparametric approach to estimate the genewise variance. We first study a benchmark case when there is no correlation between replications, i.e., ρ = 0. This corresponds to the case with independent replications across arrays (Fan et al., 2005; Huang et al., 2005). It is also applicable to those dealt by the Neyman-Scott problem. By noticing E{(Y_gi − Ȳ_g)²|X_gi} is a linear combination of σ²(X_gi), we obtain a system of linear equations. Hence, σ²(·) can be estimated via nonparametric regression of a proper linear combination of {(Y_gi − Ȳ_g)², i = 1, ···, I} on {X_gi}. The asymptotic normality of the estimator is established. In the case that the replication correlation does not vanish, the system of equations becomes nonlinear and can not be analytically solved. However, we are able to derive the correlation corrected estimator, based on the estimator without genewise correlation. The genewise variance function and the correlation coefficient of repeated measurements are simultaneously estimated by iteratively solving a nonlinear equation. The asymptotic normality of such estimators is established.

Model (2) can be applied to the microarrays in which within-array replications are not available. In that case, we can aggregate all the microarrays together and view them as a super array with replications (Fan et al., 2005; Huang et al., 2005). In other words, i in (2) indexes arrays and ρ can be taken as 0, namely (2) is the across-array replication with ρ = 0.

The structure of this paper is as follows. In Section 2 we discuss the estimation schemes of the genewise variance and establish the asymptotic properties of the estimators. Simulation studies are given in Section 3 to verify the performance of our methods in the finite sample. Applications to the data from MicroArray Quality Control (MAQC) project are showed in Section 4 to illustrate the proposed methodology. In Section 5, we give a short summary. Technical proofs are relegated to the Appendix.

2. Nonparametric Estimators of Genewise Variance

2.1. Estimation without correlation

We first consider the specific case where there is no correlation among the replications Y_g₁, ···, Y_gI of the same gene g under model (2). This is usually applicable to the across-array replication and stimulates our procedure for the more general case with the replication correlation. In the former case, we have

E [{(Y_{g i} - {\bar{Y}}_{g})}^{2} ∣ X] = {(I - 1)}^{2} σ^{2} (X_{g i}) / I^{2} + \sum_{j \neq i} σ^{2} (X_{g j}) / I^{2}, i = 1, \dots, I .

We will discuss in §2.4 the case that I = 2. For I > 2, we have I different equations with I unknowns σ²(X_g₁), σ²(X_g₂), ···, σ²(X_gI) for a given gene g. Solving these I equations, we can express the unknowns in terms of ${E [{(Y_{g i} - {\bar{Y}}_{g})}^{2} ∣ X]}_{i = 1}^{I}$ , estimable quantities. Let

r_{g} = {({(Y_{g 1} - {\bar{Y}}_{g})}^{2}, \dots, {(Y_{g I} - {\bar{Y}}_{g})}^{2})}^{T}, and σ_{g}^{2} = {(σ^{2} (X_{g 1}), \dots, σ^{2} (X_{g I}))}^{T} .

Then, it can easily be shown that $σ_{g}^{2} = B E [r_{g} ∣ X]$ , where B is the coefficient matrix:

B = ((I^{2} - I) I - E) / (I - 1) (I - 2),

with I being the I × I identity matrix and E the I × I matrix with all elements 1. Define

Z_{g} = {(Z_{g 1}, \dots, Z_{g I})}^{T} ≜ B r_{g} .

Then, we have

σ^{2} (X_{g i}) = E [Z_{g i} ∣ X] .

(3)

Note that the left hand side of (3) depends only on X_gi, not other variables. By the the double expectation formula, it follows that the variance function σ²(·) can be expressed as the univariate regression:

σ^{2} (x) = E [Z_{g i} ∣ X_{g i} = x], i = 1, \dots, I .

(4)

Using the synthetic data {(X_gi, Z_gi), g = 1, ···, N} for each given i, we can apply the local linear regression technique (Fan and Gijbels, 1996) to obtain a nonparametric estimator ${\hat{η}}_{i}^{2} (x)$ of σ²(·). Explicitly, for a given kernel K and bandwidth h,

{\hat{η}}_{i}^{2} (x) = \sum_{g = 1}^{N} W_{N, i} (\frac{X_{g i} - x}{h}) Z_{g i}, i = 1, \dots, I,

(5)

with

W_{N, i} (u) = h^{- 1} K (u) \frac{S_{N, 2} - {u S}_{N, 1}}{S_{N, 2} S_{N, 0} - S_{N, 1}^{2}},

where K_h(u) = h⁻¹ K(u/h) and $S_{N, l} = \sum_{g = 1}^{N} K_{h} (X_{g i} - x) {[(X_{g i} - x) / h]}^{l}$ , whose dependence on i is suppressed. Thus we have I estimators ${\hat{η}}_{1}^{2} (x), \dots, {\hat{η}}_{I}^{2} (x)$ for the same genewise variance function σ(·). Each of these I estimators ${\hat{η}}_{i}^{2} (x)$ is a consistent estimator of σ²(x). To optimally aggregate those I estimators, we need the asymptotic properties of $η (x) = {({\hat{η}}_{1}^{2} (x), \dots, {\hat{η}}_{I}^{2} (x))}^{T}$ .

Denote

c_{K} = \int_{- \infty}^{\infty} u^{2} K (u) d u, d_{K} = \int_{- \infty}^{\infty} K^{2} (u) d u, σ_{1} = E [σ (X_{g i})], and σ_{2} = E [σ^{2} (X_{g i})] .

Assume that X_gi are i.i.d. with marginal density f_X(·) and ε_gi are i.i.d. random variables from the standard normal distribution. In the following result, we assume that I is fixed, but N diverges.

Theorem 1

Under the regularity conditions in the Appendix, for a fixed point x, we have

\sum^{- 1 / 2} (η - (σ^{2} (x) + b (x) + o_{P} (h^{2})) e) \overset{D}{\to} N (0, I),

provided that h → 0 and N h → ∞, where e = (1, 1, ···, 1)^T and

\sum = V_{1} I + V_{2} (E - I),

with $b (x) = \frac{h^{2}}{2} c_{K} (σ^{2} {(x))}^{″}$ ,

\begin{array}{l} V_{1} = \frac{d_{K}}{{Nhf}_{X} (x)} {2 σ^{4} (x) + \frac{4 + 4 (I - 1) (I - 3)}{(I - 1) {(I - 2)}^{2}} σ_{2} σ^{2} (x) + \frac{2}{(I - 1) (I - 2)} σ_{2}^{2}}, \\ V_{2} = \frac{1}{N} {\frac{4}{{(I - 1)}^{2}} σ^{4} (x) - \frac{8}{{(I - 1)}^{2}} σ_{2} σ^{2} (x) + \frac{2 (I - 3)}{{(I - 1)}^{2} (I - 2)} σ_{2}^{2}} . \end{array}

Note that V₂ is one order of magnitude smaller than V₁. Hence, the estimators ${\hat{η}}_{1}^{2} (x), \dots, {\hat{η}}_{I}^{2} (x)$ are asymptotically independently distributed as N(σ²(x) + b(x), V₁). Their dependence is only in the second order. The best linear combination of I estimators is

{\hat{η}}^{2} (x) = [{\hat{η}}_{1}^{2} (x) + {\hat{η}}_{2}^{2} (x) + \dots + {\hat{η}}_{I}^{2} (x)] / I,

(6)

with the asymptotic distribution

N (σ^{2} (x) + b (x), V_{1} / I + (1 - 1 / I) V_{2}) .

(7)

See also the aggregated estimator (16) with ρ = 0, which has the same asymptotic property as the estimator (8). See Remark 1 below for additional discussion.

Theorem 1 gives the asymptotic normality of the proposed nonparametric estimators under the presence of a large number of nuisance parameters ${α_{g}}_{g = 1}^{N}$ . With the newly proposed technique, we do not have to impose any assumptions on α_g such as sparsity or smoothness. This kind of local linear estimator can be applied to most two-color microarray data, for instance, customized arrays and Agilent arrays.

2.2. Variance estimation with correlated replications

2.2.1. Aggregated Estimator

We now consider the case with correlated with-array replications. There is a lot of evidence that correlation among within-array replicated genes exists (Smyth et al., 2005; Fan and Niu, 2007). Suppose that within-array replications have a common correlation corr(Y_gi, Y_gj|X) = ρ when i ≠ j. Observations across different genes or arrays are independent. Then, the conditional variance of (Y_gi − Ȳ_g) can be expressed as

var [(Y_{g i} - {\bar{Y}}_{g}) ∣ X] = {(I - 1)}^{2} σ^{2} (X_{g i}) / I^{2} + 2 ρ \sum_{\begin{matrix} 1 \leq j < k \leq I, \\ j \neq i, k \neq i \end{matrix}} σ (X_{g j}) σ (X_{g k}) / I^{2} + 2 (I - 1) ρ \sum_{j \neq i} σ^{2} (X_{g j}) / I^{2} - \sum_{j \neq i} σ (X_{g i}) σ (X_{g j}) / I^{2} .

(8)

This is a complex system of nonlinear equations and the analytic form can not be found. Innovative ideas are needed.

Using the same notation as that in the previous section, it can be calculated that

E [Z_{g i} ∣ X] = σ^{2} (X_{g i}) - \frac{2}{I - 1} \sum_{j \neq i} ρ σ (X_{g i}) σ (X_{g j}) + \frac{2}{(I - 1) (I - 2)} \sum_{\begin{array}{l} 1 \leq j < k \leq I, \\ j \neq i, k \neq i \end{array}} ρ σ (X_{g j}) σ (X_{g k}) .

Taking the expectation with respect to X_gj for all j ≠ i, we obtain

E [Z_{g i} ∣ X_{g i} = x] = σ^{2} (x) - 2 ρ σ_{1} σ (x) + ρ σ_{1}^{2} ≜ η^{2} (x),

(9)

where σ₁ = E[σ(X)].

Here, we can directly apply the local linear approach to all aggregated data ${(X_{g i}, Z_{g i})}_{i, g = 1}^{I, N}$ , due to the same regression function (9). Let ${\hat{η}}_{A}^{2} (\cdot)$ be the local linear estimator of η²(·), based on the aggregated data. Then,

{\hat{η}}_{A}^{2} (x) = \sum_{g = 1}^{N} \sum_{i = 1}^{I} W_{N} (\frac{X_{g i} - x}{h}) Z_{g i},

(10)

with

W_{N} (u) = h^{- 1} K (u) \frac{S_{N I, 2} - {u S}_{N I, 1}}{S_{N I, 0} S_{N I, 2} - S_{N I, 1}^{2}},

where $S_{N I, l} = \sum_{g = 1}^{N} \sum_{i = 1}^{I} K_{h} (X_{g i} - x) {[(X_{g i} - x) / h]}^{l}$ . There are two solutions to (9)

{\hat{σ}}_{A} {(x, ρ)}^{(1), (2)} = \hat{ρ} {\hat{σ}}_{1} \pm \sqrt{{\hat{ρ}}^{2} {\hat{σ}}_{1}^{2} - \hat{ρ} {\hat{σ}}_{1}^{2} + {\hat{η}}_{A}^{2} (x)},

(11)

Notice that given the sample X and Y, σ̂_A(x, ρ)^(1),(2) are continuous in both x and ρ. For ρ < 0, σ̂_A(x, ρ)⁽¹⁾ should be used since the standard deviation should be nonnegative. Since σ̂_A(x, ρ)⁽¹⁾ > σ̂_A(x, ρ)⁽²⁾ for every x and ρ, by the continuity of the solution in ρ, we can only use the same solution when ρ changes continuously. Then, σ̂_A(x, ρ)⁽¹⁾ should always be used regardless of ρ. From now on, we drop the superscript and denote:

{\hat{σ}}_{A} (x) = ρ σ_{1} + \sqrt{ρ^{2} σ_{1}^{2} - ρ σ_{1}^{2} + {\hat{η}}_{A}^{2} (x)} .

(12)

This is called the aggregated estimator. Note that in (12), ρ, σ₁ and σ(·) are all unknown.

2.2.2. Estimation of Correlation

To estimate ρ, we assume that there are J independent arrays (J ≥ 2). In other words, we observed data from (2) independently J times. In this case, the residual maximum likelihood (REML) estimator introduced by Smyth et al. (2005) is as follows:

{\hat{ρ}}_{0} = \frac{\sum_{g = 1}^{N} s_{B, g}^{2} - \sum_{g = 1}^{N} s_{W, g}^{2}}{\sum_{g = 1}^{N} s_{B, g}^{2} + (I - 1) \sum_{g = 1}^{N} s_{W, g}^{2}},

(13)

where $s_{B, g}^{2} = I {(J - 1)}^{- 1} \sum_{j = 1}^{J} {({\bar{Y}}_{g j} - {\bar{Y}}_{g})}^{2}$ with ${\bar{Y}}_{g j} = I^{- 1} \sum_{i = 1}^{I} Y_{gij}$ and ${\bar{Y}}_{g} = J^{- 1} \sum_{j = 1}^{J} Y_{g j}$ is the between-arrays variance and $s_{W, g}^{2}$ is the within-array variance:

s_{W, g}^{2} = \frac{1}{J (I - 1)} \sum_{j = 1}^{J} \sum_{i = 1}^{I} {(Y_{gij} - {\bar{Y}}_{g j})}^{2} .

As discussed in Smyth et al. (2005), the estimator ρ̂₀ of ρ is consistent when var(Y_gij|X) = σ_g is the same for all i = 1, ···, I and j = 1, ···, J. However, this assumption is not valid under the model (2) and a correction is needed. We propose the following estimator:

\hat{ρ} = \frac{σ_{2}}{σ_{1}^{2}} \cdot \frac{\sum_{g = 1}^{N} s_{B, g}^{2} - \sum_{g = 1}^{N} s_{W, g}^{2}}{\sum_{g = 1}^{N} s_{B, g}^{2} + (I - 1) \sum_{g = 1}^{N} s_{W, g}^{2}} .

(14)

The consistency of ρ̂ is given by the following theorem.

Theorem 2

Under the regularity condition in the Appendix, the estimator ρ̂ of ρ is $\sqrt{N}$ -consistent:

\hat{ρ} - ρ = O_{P} (N^{- 1 / 2}) .

With a consistent estimator of ρ, σ₁, σ₂ and σ_A(·) can be solved by the following iterative algorithm:

Step 1. Se ${\hat{η}}_{A}^{2} (\cdot)$ as an initial estimate of $σ_{A}^{2} (\cdot)$
Step 2. With σ̂_A(·), compute
${\hat{σ}}_{1} = N^{- 1} \sum_{g = 1}^{N} {\hat{σ}}_{A} (X_{g i}), {\hat{σ}}_{2} = N^{- 1} \sum_{g = 1}^{N} {\hat{σ}}_{A}^{2} (X_{g i}), \hat{ρ} = {\hat{ρ}}_{0} {\hat{σ}}_{2} / {\hat{σ}}_{1}^{2} .$ (15)
Step 3. With σ̂₁, σ̂₂, and ρ̂, compute σ̂_A(·) using (12).
Step 4. Repeat Steps 2 and 3 until convergence.

This provides simultaneously the estimators σ̂₁, σ̂₂, ρ̂, and σ̂_A(·). From our numerical experience, this algorithm converges quickly after a few iterations. When the algorithm converges, the estimator $σ_{A}^{2} (x)$ is given by

{\hat{σ}}_{A} (x) = \hat{ρ} {\hat{σ}}_{1} + \sqrt{{\hat{ρ}}^{2} {\hat{σ}}_{1}^{2} - \hat{ρ} {\hat{σ}}_{1}^{2} + {\hat{η}}_{A}^{2} (x)} .

(16)

Note that the presence of multiple arrays is only used to estimate the correlation ρ for the replications. It is not needed for estimating the genewise variance function. In the case of the presence of J arrays, we can take the average of the J estimates from each array.

2.2.3. Asymptotic properties

Following a similar idea as the case without correlation, we can derive the asymptotic property of ${\hat{η}}_{A}^{2} (x)$ .

Theorem 3

Under the regularity conditions in the Appendix, for a fixed point x, we have:

{V^{*}}^{- 1 / 2} {{\hat{η}}_{A}^{2} (x) - [η^{2} (x) + β (x)] + o_{P} (h^{2})} \overset{D}{\to} N (0, 1),

provided that h → 0 and Nh → ∞, with $β (x) = \frac{h^{2}}{2} c_{K} (η^{2} {(x))}^{″}$ and

V^{*} = \frac{1}{I} V_{1}^{'} + \frac{I - 1}{I} V_{2}^{'},

where

\begin{array}{l} V_{1}^{'} = \frac{d_{K}}{{Nhf}_{X} (x)} {2 σ^{4} (x) - 8 ρ σ_{1} σ^{3} (x) + C_{2} σ^{2} (x) + C_{3} σ (x) + C_{4}}, \\ V_{2}^{'} = \frac{1}{N} {D_{0} σ^{4} (x) + D_{1} σ^{3} (x) + D_{2} σ^{2} (x) + D_{3} σ (x) + D_{4}}, \end{array}

with coefficients C₂, ···, C₄, D₀, ···, D₄ defined in the Appendix.

The asymptotic normality of ${\hat{σ}}_{A}^{2} (x)$ can be derived from that of ${\hat{η}}_{A}^{2} (x)$ . More specifically, ${\hat{σ}}_{A}^{2} (x) = ϕ ({\hat{η}}_{A}^{2} (x))$ with $ϕ (z) = {(ρ σ_{1} + \sqrt{ρ^{2} σ_{1}^{2} - ρ σ_{1}^{2} + z})}^{2}$ . The derivative of ϕ(·) with respect to z is $ψ (z) = ρ σ_{1} / \sqrt{ρ^{2} σ_{1}^{2} - ρ σ_{1}^{2} + z} + 1$ . Then, by the delta method, we have

{V^{*}}^{- 1 / 2} ({\hat{σ}}_{A}^{2} (x) - ϕ (η^{2} (x) + β (x) + o_{P} (h^{2}))) \overset{D}{\to} N (0, ψ^{2} (η^{2} (x))) .

Remark 1

An alternative approach when correlation exists is to apply the same correlation correction idea to ${X_{g i}, Z_{g i}}_{g = 1}^{N}$ for every replication i, resulting in the estimator ${\hat{σ}}_{i}^{2} (x)$ . In this case, it can be proved that the best linear combination of the estimator is

{\hat{σ}}^{2} (x) = [{\hat{σ}}_{1}^{2} (x) + {\hat{σ}}_{2}^{2} (x) + \dots + {\hat{σ}}_{I}^{2} (x)] / I,

(17)

This estimator has the same asymptotic performance as the aggregated estimator. However, we prefer the aggregated estimator due to the following reasons: The equation (16) only needs to be solved once by using the algorithm in §2.2.2, all data are treated symmetrically, and ${\hat{η}}_{A}^{2} (\cdot)$ can be estimated more stably.

2.2.4. Two replications

The aforementioned methods apply to the case when there are more than two replications. For the case I = 2, the equations for var[(Y_gi − Ȳ_g)|X] collapse into one. In this case, it can be shown using the same arguments before that

var [(Y_{g i} - {\bar{Y}}_{g}) ∣ X_{g i} = x] = \frac{1}{4} σ^{2} (x) + \frac{1}{4} σ_{2} - \frac{1}{2} ρ σ_{1} σ (x), i = 1, 2

(18)

where σ₂ = E[σ²(X_gi)]. In this case, the left hand side is always equal to var[(Y_g₁ − Y_g₂)/2|X_gi = x].

Let η̂²(x) be the local linear estimator of the function on the right hand side by smoothing ${{(Y_{g 1} - Y_{g 2})}^{2} / 4}_{g = 1}^{N}$ on ${X_{g 1}}_{g = 1}^{N}$ and ${X_{g 2}}_{g = 1}^{N}$ . Then, the genewise variance is a solution to the following equation

\hat{σ} (x) = \hat{ρ} {\hat{σ}}_{1} + \sqrt{{\hat{ρ}}^{2} {\hat{σ}}_{1}^{2} - {\hat{σ}}_{2} + 4 {\hat{η}}^{2} (x)} .

(19)

. The algorithm in §2.2.2 can be applied directly.

3. Simulations and comparisons

In this section, we conduct simulations to evaluate the finite sample performance of different variance estimators ξ̂²(x), η̂²(x) and ${\hat{σ}}_{A}^{2} (x)$ . First, the bias problem of the naive non-parametric variance estimator ξ̂²(x) is demonstrated. It is shown that this bias issue can be eliminated by our newly proposed methods. Then, we consider the estimators η̂²(x) and ${\hat{σ}}_{A}^{2} (x)$ under different configurations of the within-array replication correlation.

3.1. Simulation design

In all the simulation examples, we set the number of genes N = 2000, each gene having I = 3 within-array replications, and J = 4 independent arrays. For the purpose of investigating the genewise variance estimation, the data are generated from model (2). The details of simulation scheme are summarized as follows:

α_g: The expression levels of the first 250 genes are generated from the standard double exponential distribution. The rest are 0’s. These expression levels are the same over 4 arrays in each simulation, but may vary over simulations.
X: The intensity is generated from a mixture distribution: with probability 0.7 from the distribution .0004(x − 6)³I(6 < x < 16) and 0.3 from the uniform distribution over [6, 16].
ε: ε_gi is generated from the standard normal distribution.

σ²(·): The genewise variance function is taken as

σ^{2} (x) = .15 + .015 {(12 - x)}^{2} I {x < 12} .

The parameters are taken from Fan et al. (2005). The kernel function is selected as $\frac{70}{81} {(1 - {∣ x ∣}^{3})}^{3} I (∣ x ∣ \leq 1)$ . In addition, we fix the bandwidth h = 1 for all the numerical analysis.

For every setting, we repeat the whole simulation process for T times and evaluate the estimates of σ²(·) over K = 101 grid points ${x_{k}}_{k = 1}^{K}$ on the interval [6, 16]. For the k-th grid point, we define

\begin{matrix} B_{k} = {\bar{σ}}^{2} (x_{k}) - σ^{2} (x_{k}) with {\bar{σ}}^{2} (x_{k}) = T^{- 1} \sum_{t = 1}^{T} {\hat{σ}}_{t}^{2} (x_{k}), \\ S_{k} = T^{- 1} \sum_{t = 1}^{T} {[{\hat{σ}}_{t}^{2} (x_{k}) - {\bar{σ}}^{2} (x_{k})]}^{2}, \end{matrix}

and ${MSE}_{k} = B_{k}^{2} + S_{k}$ . Let f(·) be the density function of intensity X. Let

{Bias}^{2} = \sum_{k = 1}^{K} B_{k}^{2} f (x_{k}) / \sum_{k = 1}^{K} f (x_{k}), VAR = \sum_{k = 1}^{K} S_{k} f (x_{k}) / \sum_{k = 1}^{K} f (x_{k})

and

MISE = \sum_{k = 1}^{K} {MSE}_{k} f (x_{k}) / \sum_{k = 1}^{K} f (x_{k})

be the integrated squared bias (Bias²), the integrated variance (VAR), and the integrated mean squared error (MISE) of the estimate σ̂²(·), respectively. For the t-th simulation experiment, we define

{ISE}_{t} = \sum_{k = 1}^{K} {({\hat{σ}}_{t}^{2} (x_{k}) - σ^{2} (x_{k}))}^{2} f (x_{k}) / \sum_{k = 1}^{K} f (x_{k}),

be the integrated squared error for the t-th simulation.

3.2. The bias of naive nonparametric estimator

A naive approach is to regard α_g in (2) as a smooth function of X_gi, namely, α_g = α(X_gi). The function α(·) can be estimated by a local linear regression estimator, resulting in an estimated function α̂(·). The squared residuals ${r_{g i}^{2}}_{g = 1}^{N}$ is then further smoothed on ${X_{g i}}_{g = 1}^{N}$ to obtain an estimate ξ̂²(x) of the variance function σ²(·), where r_gi = Ŷ_gi − α̂(X_gi) (Ruppert et al., 1997).

To provide a comprehensive view of the performances of the naive and the new estimators, we first compare the performances of ξ̂²(x) and η̂²(x) under the smoothness assumption of the gene effect α_g. Data from the naive nonparametric regression model is also generated with

α (x) = exp (- \frac{1}{1 - {(x - 13)}^{2}}) I {12 < x < 14} .

This allows us to understand the loss of efficiency when α_g is continuous in X_gi. This usually does not occur for microarray data, but can appear in other applications. Note that α(·) is zero in most of the region and thus is reasonably sparse. Here, the number of simulations is taken to be T = 100. The data is generated with the assumption that ρ = 0, in which case the variance estimators η̂²(x) and ${\hat{σ}}_{A}^{2} (x)$ have the same performance (See also Table 2 below). Thus, we only report the performance of η̂²(x).

Table 2.

Mean integrated squared bias (Bias²), mean integrated variance (VAR), mean integrated squared error (MISE) over 1000 simulations for different variance estimators η̂²(x) and ${\hat{σ}}_{O}^{2} (x)$ . Seven different correlation schemes are simulated: ρ = −0.4, ρ = −0.2, ρ = 0, ρ = 0.2, ρ = 0.4, ρ = 0.6, and ρ = 0.8. All quantities are multiplied by 1000.

ρ =

−0.4

−0.2

0.2

0.4

0.6

0.8

Bias²

η̂²(x)

5.93

1.48

0.00

1.48

5.91

13.31

23.67

{\hat{σ}}_{A}^{2} (x)

0.00

{\hat{σ}}_{O}^{2} (x)

0.00

0.01

VAR

η̂²(x)

0.44

0.33

0.24

0.16

0.10

0.05

0.02

{\hat{σ}}_{A}^{2} (x)

0.27

0.25

0.24

0.22

0.20

0.19

0.20

{\hat{σ}}_{O}^{2} (x)

0.27

0.25

0.24

0.22

0.20

0.18

0.23

MISE

η̂²(x)

6.37

1.81

0.24

1.64

6.01

13.37

23.69

{\hat{σ}}_{A}^{2} (x)

0.27

0.25

0.24

0.22

0.21

0.19

0.20

{\hat{σ}}_{O}^{2} (x)

0.27

0.25

0.24

0.22

0.20

0.18

0.24

Open in a new tab

In Table 1, we report the mean integrated squared bias (Bias²), the mean integrated variance (VAR), and the mean integrated squared error (MISE) of ξ̂²(x) and η̂ ²(x) with and without the smoothness assumption on the gene effect α_g. From the left panel of Table 1, we can see that when the smoothness assumption is valid, the estimator ξ̂²(x) outperforms η̂²(x). The reason is that the mean function α(X_gi) depends on the replication and is not a constant. Therefore, model (2) fails and η̂²(x) is biased. One should compare the results with those on the second row of the right panel where the model is right for η̂²(x). In this case, η̂²(x) performs much better. Its variance is about 3/2 as large as the variance in the case that mean is generated from a smooth function α(X_gi). This is expected. In the latter case, to eliminate α_g, the degree of freedom reduces from I = 3 to 2, whereas in the former case, α(X_gi) can be estimated without losing the degree of freedom, namely the number of replications is still 3. The ratio 3/2 is reflected in Table 1. However, when the smoothness assumption does not hold, there is serious bias in the estimator ξ̂²(x), even though that α_g is still reasonably sparse. The bias is an order of magnitude larger than those in the other situations.

Table 1.

Mean integrated squared bias (Bias²), mean integrated variance (VAR), mean integrated squared error (MISE) over 100 simulations for variance estimators ξ̂²(x) and η̂²(x). Two different gene effect functions α(·) are implemented. All quantities are multiplied by 1000.

	Smooth Gene Effect			Non-Smooth Gene Effect
	Bias²	VAR	MISE	Bias²	VAR	MISE
ξ̂²(x)	0.01	0.14	0.15	16.00	1.47	17.47
η̂²(x)	0.57	0.24	0.80	0.00	0.22	0.23

Open in a new tab

To see how variance estimators behave, we plot typical estimators ξ̂²(x) and η̂²(x) with median ISE value among 100 simulations in Figure 1. The solid line is the true variance function while the dotted and dashed lines represent ξ̂²(x) and η̂²(x) respectively. On the left panel of Figure 1, we can see that estimator ξ̂²(x) outperforms the estimator η̂²(x) when the smoothness assumption is valid. The region where the biases occur has already been explained above. However, ξ̂²(x) will generate substantial bias when the nonparametric regression model does not hold, and at the same time, our nonparametric estimator η̂²(x) corrects the bias very well.

Fig 1 — Variance estimators ξ̂²(x) and η̂²(x) with median performance when different gene effect function α(·) are implemented. Left Panel: Smooth α(·) function. Right panel: Non-smooth α(·) function.

3.3. Performance of new estimators

In this example, we consider the setting in §3.1 that the smoothness assumption of the gene effect α_g is not valid. For comparison purpose only, we add an oracle estimator ${\hat{σ}}_{O}^{2} (x)$ in which we assume that σ₁, σ₂ and ρ are all known. We now evaluate the performance of the estimators η̂²(x), ${\hat{σ}}_{A}^{2} (x)$ , and ${\hat{σ}}_{O}^{2} (x)$ when the correlation between within-array replications varies. To be more specific, seven different correlation settings are considered: ρ = −0.4, −0.2, 0, 0.2, 0.4, 0.6, 0.8, with ρ = 0 representing across-array replications. In this case, we increase the number of simulations to T = 1000. Again, we report Bias², VAR and MISE of the three estimators for each correlation setting in Table 2. When ρ = 0, all the three estimators give the same bias and variance. This is consistent with our theory. We can see clearly from the table that, when ρ ≠ 0, the estimator ${\hat{σ}}_{A}^{2} (x)$ produces much smaller biases than η̂²(x). In fact, when |ρ| as small as 0.2, the bias of η̂²(x) already dominates the variance.

It is worth noticing that the performance of ${\hat{σ}}_{O}^{2} (x)$ and ${\hat{σ}}_{A}^{2} (x)$ are almost always the same, which indicates that our algorithm for estimating ρ, σ₁ and σ₂ is very accurate. To see this more clearly, the squared bias, variance, and MSE of the estimator ρ, σ₁ and σ₂ in ${\hat{σ}}_{A}^{2} (x)$ under the seven correlation settings are reported in Table 3. Here the true value of σ₁ and σ₂ is 0.4217 and 0.1857. For example, when ρ = 0.8, the bias of ρ̂ is less than 0.002 for ${\hat{σ}}_{A}^{2} (x)$ , which is acceptable because the convergence threshold in the algorithm is set to be 0.001.

Table 3.

Squared bias, variance and MSE of ρ̂, σ̂₁ and σ̂₂ in the estimate ${\hat{σ}}_{A}^{2} (x)$ . All quantities are multiplied by 10⁶.

{\hat{σ}}_{A}^{2} (x)

ρ =

−0.4

−0.2

0.2

0.4

0.6

0.8

ρ̂

Bias²

0.07

0.04

0.01

0.00

3.90

VAR

7.90

16.91

28.65

36.17

35.68

27.21

20.44

MSE

7.97

16.95

28.66

36.17

35.68

27.21

24.35

σ̂₁

Bias²

0.24

0.23

0.19

0.14

0.11

0.05

2.47

VAR

11.65

11.52

11.79

12.46

13.64

15.55

18.66

MSE

11.89

11.75

11.99

12.60

13.75

15.59

21.12

σ̂₂

Bias²

0.14

0.12

0.09

0.08

0.05

0.67

VAR

10.34

10.17

10.45

11.12

12.24

13.96

16.16

MSE

10.47

10.31

10.57

11.20

12.32

14.00

16.83

Open in a new tab

In Figure 2, we render the estimates η̂²(x) and ${\hat{σ}}_{A}^{2} (x)$ with the median ISE under four different correlation settings: ρ = −0.4, ρ = 0, ρ = 0.6 and ρ = 0.8. We omit the other correlation schemes since they all have similar performance. The solid lines represent the true variance function. The dotted lines and dashed lines are for η̂²(x) and ${\hat{σ}}_{A}^{2} (x)$ respectively. For the case ρ = 0, the two estimators are indistinguishable. When ρ < 0, η̂²(x) overestimates the genewise variance function, whereas when ρ > 0, it underestimates the genewise variance function.

Fig 2 — Median performance of variance estimators η̂²(x), σ̂²(x), and ${\hat{σ}}_{A}^{2} (x)$ when ρ =−0.4, 0, 0.6, and 0.8.

4. Application to human total RNA samples using Agilent arrays

Our real data example comes from Microarray Quality Control (MAQC) project (Patterson et al., 2006). The main purpose of the original paper is on comparison of reproducibility, sensitivity and specificity of microarray measurements across different platforms (i.e., one-color and two-color) and testing sites. The MAQC project use two RNA samples, Stratagene Universal Human Reference total RNA and Ambion Human Brain Reference total RNA. The two RNA samples have been assayed on three kinds of arrays: Agilent, CapitalBio and TeleChem. The data were collected at five sites. Our study focuses only on the Agilent arrays. At each site, 10 two-color Agilent microarrays are assayed with 5 of them dye swapped, totalling 30 microarrays.

4.1. Validation Test

In the first application, we revisit the validation test as considered in Fan and Niu (2007). For the purpose of the validation tests, we use gProcessedSignal and rProcessedSignal values from Agilent Feature Extraction software as input. We follow the preprocessing scheme described in Patterson et al. (2006) and get 22144 genes from a total of 41675 non-control genes. Among those, 19 genes with each having 10 replications are used for validation tests. Under the null hypothesis of no experimental biases, a reasonable model is

Y_{g i} = α_{g} + ε_{g i}, ε_{g i} \sim N (0, σ_{g}^{2}), i = 1, \dots, I, g = 1, \dots, G .

(20)

We use the notation G to denote the number of genes that have I replications. For our data, G = 19 and I = 10. Note that G can be different from N, the total number of different genes. The validation test statistics in Fan and Niu (2007) include weighted statistics

T_{1} = \sum_{g = 1}^{G} {\sum_{i = 1}^{I} {(Y_{g i} - {\bar{Y}}_{g})}^{2} / σ_{g}^{2}}, T_{2} = \sum_{g = 1}^{G} {\sum_{i = 1}^{I} ∣ Y_{g i} - {\bar{Y}}_{g} ∣ / σ_{g}},

and unweighted test statistics

\begin{array}{l} T_{3} = {\sum_{g = 1}^{G} \sum_{i = 1}^{I} {(Y_{g i} - {\bar{Y}}_{g})}^{2} - (I - 1) \sum_{g = 1}^{G} σ_{g}^{2}} {2 (I - 1) \sum_{g = 1}^{G} σ_{g}^{4}}^{- 1 / 2}, \\ T_{4} = {\sum_{g = 1}^{G} \sum_{i = 1}^{I} ∣ Y_{g i} - {\bar{Y}}_{g} ∣ - λ_{I} \sum_{g = 1}^{G} σ_{g}} / {κ_{I} {(\sum_{g = 1}^{G} σ_{g}^{2})}^{1 / 2}}, \end{array}

where $λ_{I} = \sqrt{2 I (I - 1) / π}$ and $κ_{I}^{2} = var (\sum_{i = 1}^{I} ∣ ε_{g i} - {\bar{ε}}_{g} ∣ / σ_{g})$ . Under the null hypothesis, the test statistic T₁ is χ² distributed with degree of freedom (I − 1)G and T₂, T₃ and T₄ are all asymptotically normally distributed. As a result, the corresponding p-values can be easily computed.

Here, we apply the same statistics T₁, T₂, T₃ and T₄ but we replace the pooled sample variance estimator by the aggregated local linear estimator

{\hat{σ}}_{g}^{2} = \sum_{i = 1}^{I} {\hat{σ}}_{A}^{2} (X_{g i}) \hat{f} (X_{g i}) / \sum_{i = 1}^{I} \hat{f} (X_{g i}),

where f̂ is the estimated density function of X_gi. The difference between the new variance estimator and the simple pooled variance estimator is that we consider the genewise variance as a nonparametric function of the intensity level. The latter estimator may drag small variances of certain arrays to much higher levels by averaging, resulting in a larger estimated genewise variance and smaller test statistics or bigger p-values.

In the analysis here, we first consider all thirty arrays. The estimated correlation among replicated genes is ρ̂ = 0.69. The p-values based on the newly estimated genewise variance are depicted in Table 4. As explained in Fan and Niu (2007), T₄ is the most stable test among the four. It turns out that none of the arrays needs further normalization, which is the same as Fan and Niu (2007). Furthermore, we separate the analysis into two groups: the first group using 15 arrays without dye-swap, which has the estimated correlation ρ̂ = 0.66, and the second group using 15 arrays with dye-swap, resulting in an estimated correlation ρ̂ = 0.34. The p-values are summarized in Table 5. Results show that array AGL-2-D3 and array AGL-2-D5 need further normalization if 5% significance level applies. The difference is due to decreased estimated ρ for the dye swap arrays and p-values are sensitive to the genewise variance. We also did analysis by separating data into 6 groups: with and without dye swap, and three sites of experiments. Due to the small sample size, the six estimates of ρ range from 0.08 to 0.74, and we also find that array AGL-2-D3 needs further normalization.

Table 4.

Comparison of p-values for T₁, ···, T₄ for MAQC project data considering all 30 arrays together

	p-values

slide name	T₁	T₂	T₃	T₄
AGL-1-C1	1.0000	1.0000	1.0000	1.0000
AGL-1-C2	1.0000	1.0000	1.0000	1.0000
AGL-1-C3	1.0000	1.0000	1.0000	1.0000
AGL-1-C4	1.0000	1.0000	1.0000	1.0000
AGL-1-C5	1.0000	1.0000	1.0000	1.0000
AGL-1-D1	1.0000	1.0000	1.0000	1.0000
AGL-1-D2	1.0000	1.0000	1.0000	1.0000
AGL-1-D3	1.0000	1.0000	1.0000	1.0000
AGL-1-D4	1.0000	1.0000	1.0000	1.0000
AGL-1-D5	1.0000	1.0000	1.0000	1.0000

AGL-2-C1	1.0000	1.0000	1.0000	1.0000
AGL-2-C2	1.0000	1.0000	1.0000	1.0000
AGL-2-C3	1.0000	1.0000	1.0000	1.0000
AGL-2-C4	1.0000	1.0000	1.0000	1.0000
AGL-2-C5	1.0000	1.0000	1.0000	1.0000
AGL-2-D1	1.0000	0.9999	0.9996	0.9999
AGL-2-D2	0.8387	0.9011	0.8953	0.9182
AGL-2-D3	0.3525	0.1824	0.3902	0.1905
AGL-2-D4	1.0000	1.0000	1.0000	1.0000
AGL-2-D5	0.8820	0.8070	0.8848	0.7952

AGL-3-C1	1.0000	1.0000	1.0000	1.0000
AGL-3-C2	1.0000	1.0000	1.0000	1.0000
AGL-3-C3	1.0000	1.0000	1.0000	1.0000
AGL-3-C4	1.0000	1.0000	1.0000	1.0000
AGL-3-C5	1.0000	1.0000	1.0000	1.0000
AGL-3-D1	1.0000	1.0000	1.0000	1.0000
AGL-3-D2	1.0000	1.0000	1.0000	1.0000
AGL-3-D3	1.0000	1.0000	1.0000	1.0000
AGL-3-D4	1.0000	1.0000	1.0000	1.0000
AGL-3-D5	1.0000	1.0000	1.0000	1.0000

Open in a new tab

Table 5.

Comparison of p-values for T₁, ···, T₄ for MAQC project data considering the arrays with and without dye-swap separately

	p-values

slide name	T₁	T₂	T₃	T₄
AGL-1-C1	1.0000	1.0000	1.0000	1.0000
AGL-1-C2	1.0000	1.0000	1.0000	1.0000
AGL-1-C3	1.0000	1.0000	0.9999	1.0000
AGL-1-C4	1.0000	1.0000	1.0000	1.0000
AGL-1-C5	1.0000	1.0000	0.9999	1.0000
AGL-1-D1	1.0000	1.0000	1.0000	1.0000
AGL-1-D2	1.0000	1.0000	1.0000	1.0000
AGL-1-D3	1.0000	1.0000	1.0000	1.0000
AGL-1-D4	1.0000	1.0000	1.0000	1.0000
AGL-1-D5	1.0000	1.0000	1.0000	1.0000

AGL-2-C1	1.0000	1.0000	0.9943	1.0000
AGL-2-C2	1.0000	1.0000	1.0000	1.0000
AGL-2-C3	1.0000	1.0000	1.0000	1.0000
AGL-2-C4	0.0152	0.9493	0.3931	0.9136
AGL-2-C5	1.0000	1.0000	0.8060	1.0000
AGL-2-D1	0.7806	0.8074	0.6622	0.6584
AGL-2-D2	0.2170	0.2984	0.1651	0.2217
AGL-2-D3	0.0002	0.0000	0.0001	0.0000
AGL-2-D4	1.0000	1.0000	1.0000	1.0000
AGL-2-D5	0.1236	0.0662	0.0669	0.0300

AGL-3-C1	1.0000	1.0000	0.9996	1.0000
AGL-3-C2	1.0000	1.0000	0.9988	1.0000
AGL-3-C3	1.0000	1.0000	0.9977	1.0000
AGL-3-C4	1.0000	1.0000	1.0000	1.0000
AGL-3-C5	1.0000	1.0000	0.9999	1.0000
AGL-3-D1	1.0000	1.0000	1.0000	1.0000
AGL-3-D2	1.0000	1.0000	1.0000	1.0000
AGL-3-D3	1.0000	1.0000	1.0000	1.0000
AGL-3-D4	1.0000	1.0000	1.0000	1.0000
AGL-3-D5	1.0000	1.0000	1.0000	1.0000

Open in a new tab

4.2. Gene selection

To detect the differentially expressed genes, we follow the filter instruction and get 19,802 genes out of 41,000 unique non-control genes as in Patterson et al. (2006), i.e., I = 1. The dye swap result was averaged before doing the one-sample t-test. Thus at each site, we have five microarrays.

For each site, significant genes are selected based on these 5 dye-swaped average arrays. For all N = 19, 802 genes, there are no within-array replications. However, model (2) is still reasonable, in which i indexes the array. Hence, the “within-array correlation” becomes “between-array correlation” and is reasonably assumed as ρ = 0.

In our nonparametric estimation for the variance function, all the 19,802 genes are used to estimate the variance function, which gives us enough reason to believe that the estimator ${\hat{σ}}_{A}^{2} (x)$ is close to the inherent true variance function σ²(x).

We applied both the t-test and z-test to each gene to see if the logarithm of the expression ratio is zero, using the five arrays collected at each location. The number of differentially expressed genes detected by using the two different tests under three Fold Changes (FC) and four significant levels are given in Table 6. Large numbers of genes are identified as differentially expressed, which is expected when comparing a brain sample and a tissue pool sample (Patterson et al., 2006). We can see clearly that the z-test associated with our new variance estimator ${\hat{σ}}_{A}^{2} (x)$ leads to more differentially expressed genes. For example, at site 1, using α = 0.001, among the fold changes at least 2, t-test picks 8231 genes whereas z-test selects 8875 genes. This gives an empirical power increase of (8875–8231)/19802 ≈ 3.25% in the group with observed fold change at least 2.

Table 6.

Comparison of the number of significantly differentially expressed genes

		p < 0.05		p < 0.01		p < 0.005		p < 0.001
		t-test	z-test	t-test	z-test	t-test	z-test	t-test	z-test
Agilent 1	FC>1.5	12692	12802	12464	12752	12313	12722	11744	12646
	FC>2	8802	8879	8654	8872	8556	8869	8231	8858
	FC>4	3493	3493	3431	3493	3376	3493	3231	3493

Agilent 2	FC>1.5	12282	12678	11217	12587	10502	12536	8270	12421
	FC>2	8644	8877	7908	8875	7452	8861	6125	8828
	FC>4	3600	3649	3188	3649	2964	3649	2422	3649

Agilent 3	FC>1.5	12502	12692	11994	12576	11694	12519	10788	12374
	FC>2	8689	8832	8344	8810	8150	8800	7591	8762
	FC>4	3585	3603	3378	3602	3278	3602	2985	3600

Open in a new tab

To verify the accuracy of our variance estimation in the z-test, we compare the empirical power increase with the expected theoretical power increase. The expected theoretical power increase is computed as

ave {P_{z} (μ_{g} / σ_{g}) - P_{t_{n - 1}} (μ_{g} / σ_{g})},

(21)

taking the average of power increases across all μ_g ≠ 0. However, in the absence of the availability, we replace μ_g by its estimate, which is the sample average of n = 5 observed log-expression ratios. Table 7 depicts the results at three different sites, in which the columns “Theo” refer to the expected theoretical power increase defined by (21), with μ_g replaced by Ȳ_g and σ_g replaced by its estimate from the genewise variance function, and the columns “Emp” refer to the empirical power increase.

Table 7.

Comparison of Expected Theoretical and Empirical Power Difference (In percentage)

	α = 0.05		α = 0.01		α = 0.005		α = 0.001
	Theo	Emp	Theo	Emp	Theo	Emp	Theo	Emp
Agilent 1	2.52	0.61	6.08	3.75	8.06	5.59	13.66	11.74
Agilent 2	4.03	7.56	10.11	17.61	13.61	22.86	23.75	37.63
Agilent 3	3.02	2.56	7.14	7.39	9.42	10.19	15.94	18.18

Average	3.19	3.58	7.78	9.58	10.36	12.88	17.79	22.51

Open in a new tab

There are two things worth noticing here. First, for expected theoretical power increase, we use the sample mean Ȳ_g = μ_g + ε̄_g instead of the real gene effect μ_g, which is not observable, so it inevitably involves the error term ε̄_g. Second, the power functions P_z(μ) and P_t(μ) depend sensitively on μ and the tails of the assumed distribution. Despite of these, the expected theoretical and empirical power increases are in the same bulk and the averages are very close. This provides good evidence that our genewise variance estimation has an acceptable accuracy.

We also apply SIMEX and permutation SIMEX methods in Wang and Carroll (2008) to the MAQC data, to illustrate its utility. As mentioned in the introduction, their model is not really intended for the analysis of two-color microarray data. Should we only use the information on log-ratios (Y), the model is very hard to interpret. In addition, one might question why the information on X (observed intensity levels) is not used at all. Nevertheless, we apply the SIMEX methods of Wang and Carroll (2008) to only the log-ratios Y in the two-color data and produce similar tables to the Table 6 and 7.

From the results, we have the following understandings. First, all the numbers for z-test in Tables 8 and 9 at all significance levels are approximately the same. In fact, the p-values are very small so that numbers at different significance levels are about the same. That indicates that both SIMEX and permutation SIMEX method are tending to estimate genewise variance very small, making the test statistics large for all the time. On the other hand, our method estimates the genewise variance moderately so that the numbers are not exactly the same for different significance levels. Second, in the implementation, we found that the SIMEX and permutation SIMEX is computationally expensive (more than one hour) while our method only takes a few minutes. Third, from Tables 10 and 11 we can see that the expected theoretical power increase and the empirical ones are reasonably close, which are in lines with our method.

Table 8.

Comparison of the number of significantly differentially expressed genes using SIMEX method

		p < 0.05		p < 0.01		p < 0.005		p < 0.001
		t-test	z-test	t-test	z-test	t-test	z-test	t-test	z-test
Agilent 1	FC>1.5	12692	12820	12464	12820	12313	12820	11744	12820
	FC>2	8802	8879	8654	8879	8556	8879	8231	8879
	FC>4	3493	3493	3431	3493	3376	3493	3231	3493

Agilent 2	FC>1.5	12282	12721	11217	12721	10502	12721	8270	12721
	FC>2	8644	8878	7908	8878	7452	8878	6125	8878
	FC>4	3600	3649	3188	3649	2964	3649	2422	3649

Agilent 3	FC>1.5	12502	12760	11994	12760	11694	12760	10788	12760
	FC>2	8689	8836	8344	8836	8150	8836	7591	8836
	FC>4	3585	3603	3378	3603	3278	3603	2985	3603

Open in a new tab

Table 9.

Comparison of the number of significantly differentially expressed genes using permutation SIMEX

		p < 0.05		p < 0.01		p < 0.005		p < 0.001
		t-test	z-test	t-test	z-test	t-test	z-test	t-test	z-test
Agilent 1	FC>1.5	12692	12820	12464	12820	12313	12820	11744	12820
	FC>2	8802	8879	8654	8879	8556	8879	8231	8879
	FC>4	3493	3493	3431	3493	3376	3493	3231	3493

Agilent 2	FC>1.5	12282	12721	11217	12721	10502	12721	8270	12721
	FC>2	8644	8878	7908	8878	7452	8878	6125	8878
	FC>4	3600	3649	3188	3649	2964	3649	2422	3649

Agilent 3	FC>1.5	12502	12760	11994	12760	11694	12760	10788	12760
	FC>2	8689	8836	8344	8836	8150	8836	7591	8836
	FC>4	3585	3603	3378	3603	3278	3603	2985	3603

Open in a new tab

Table 10.

Comparison of Expected Theoretical and Empirical Power Difference using SIMEX method (In percentage)

	α = 0.05		α = 0.01		α = 0.005		α = 0.001
	Theo	Emp	Theo	Emp	Theo	Emp	Theo	Emp
Agilent 1	2.43	2.06	7.17	5.42	10.30	7.34	20.71	13.44
Agilent 2	7.16	3.41	19.20	12.06	26.17	16.90	43.46	30.42
Agilent 3	4.18	2.88	11.71	7.38	16.45	9.89	31.38	17.57

Average	4.59	2.78	12.69	8.29	17.64	11.38	31.85	20.48

Open in a new tab

Table 11.

Comparison of Expected Theoretical and Empirical Power Difference using permutation SIMEX method (In percentage)

	α = 0.05		α = 0.01		α = 0.005		α = 0.001
	Theo	Emp	Theo	Emp	Theo	Emp	Theo	Emp
Agilent 1	1.89	2.86	5.66	6.43	8.19	8.59	16.75	15.07
Agilent 2	4.84	7.37	13.44	17.22	18.97	22.50	36.90	37.26
Agilent 3	2.89	4.91	8.34	10.13	11.87	13.11	23.44	21.31

Average	3.20	5.05	9.15	11.26	13.01	14.74	25.70	24.55

Open in a new tab

5. Summary

The estimation of genewise variance function is motivated by the downstream analysis of microarray data such as validation test and selecting statistically differentially expressed genes. The methodology proposed here is novel by using across-array and within-array replications. It does not require any specific assumptions on α_g such as sparsity or smoothness, and hence reduces the bias of the conventional nonparametric estimators. Although the number of nuisance parameters is proportional to the sample size, we can estimate the main interest (variance function) consistently. By increasing the degree of freedom largely, both the validation tests and z-test using our variance estimators are more powerful in identifying arrays that need to be normalized further and more capable of selecting differentially expressed genes.

Our proposed methodology has a wide range of applications. In addition to the microarray data analysis with within-array replications, it can be also applied to the case without within-array replications, as long as the model (2) is reasonable. Our two-way nonparametric model is a natural extension of the Neyman-Scott problem. Therefore, it is applicable to all the problems where the Neyman-Scott problem is applicable.

There are possible extensions. For example, the SIMEX idea can be applied on our model in order to take into account the measurement error. We can also make adaptations to our methods when we have a prior correlation structure among replications other than the identical correlation assumption.

Acknowledgments

The authors thank the Editor, the Associate Editor and two referees, whose comments have greatly improved the scope and presentation of the paper.

APPENDIX A: APPENDIX

The following regularity conditions are imposed for the technical proofs:

The regression function σ²(x) has a bounded and continuous second derivative.
The kernel function K is a bounded symmetric density function with a bounded support.
h → 0, Nh → ∞.
E[σ⁸(X)] exists and the marginal density f_X(·) is continuous.

We need the following conditional variance-covariance matrix of the random vector Z_g in our asymptotic study.

Lemma 1

Let Ω be the variance-covariance matrix of Z_g conditioning on all data X. Then respectively the diagonal and off-diagonal elements are:

Ω_{i i} = 2 σ^{4} (X_{g i}) + \frac{2}{{(I - 1)}^{2} {(I - 2)}^{2}} \sum_{k \neq l} σ^{2} (X_{g k}) σ^{2} (X_{g l}) + \frac{4 (I - 3)}{(I - 1) {(I - 2)}^{2}} σ^{2} (X_{g i}) \sum_{j \neq i} σ^{2} (X_{g j}), i = 1, \dots, I,

(22)

Ω_{i j} = \frac{4}{{(I - 1)}^{2}} σ^{2} (X_{g i}) σ^{2} (X_{g j}) + \frac{2}{{(I - 1)}^{2} {(I - 2)}^{2}} \sum_{\begin{matrix} k \neq l \\ k, l \neq i, j \end{matrix}} σ^{2} (X_{g k}) σ^{2} (X_{g l}) - \frac{4}{{(I - 1)}^{2} (I - 2)} \sum_{k \neq i, j} σ^{2} (X_{g k}) (σ^{2} (X_{g i}) + σ^{2} (X_{g j})), i, j = 1, \dots, I, i \neq j .

(23)

Proof of Lemma 1

Let A be the variance-covariance matrix of r_g conditioning on all data X. By direct computation, the diagonal elements are given by

\begin{array}{l} A_{i i} = var [{(Y_{g i} - {\bar{Y}}_{g})}^{2} ∣ X] \\ = \frac{2 {(I - 1)}^{4}}{I^{4}} σ^{4} (X_{g i}) + \frac{4 {(I - 1)}^{2}}{I^{4}} \sum_{k \neq i} σ^{2} (X_{g i}) σ^{2} (X_{g k}) + \frac{2}{I^{4}} \sum_{k \neq i} σ^{4} (X_{g k}) + \frac{4}{I^{4}} \sum_{l, k \neq i, l < k} σ^{2} (X_{g l}) σ^{2} (X_{g k}), i = 1, \dots, I, \end{array}

(24)

and the off-diagonal elements are given by

\begin{array}{l} A_{i j} = cov {[{(Y_{g i} - {\bar{Y}}_{g})}^{2}, {(Y_{g j} - {\bar{Y}}_{g})}^{2}] ∣ X} \\ = \frac{2 {(I - 1)}^{2}}{I^{4}} [σ^{4} (X_{g i}) + σ^{4} (X_{g j})] + \frac{4 {(I - 1)}^{2}}{I^{4}} σ^{2} (X_{g i}) σ^{2} (X_{g j}) - \frac{4 (I - 1)}{I^{4}} \sum_{k \neq i, j} σ^{2} (X_{g k}) (σ^{2} (X_{g i}) + σ^{2} (X_{g j})) + \frac{4}{I^{4}} \sum_{k, l \neq i, j; l < k} σ^{2} (X_{g l}) σ^{2} (X_{g k}) + \frac{2}{I^{4}} \sum_{k \neq i, j} σ^{4} (X_{g k}) . \end{array}

(25)

Using Ω = BAB^T, we can obtain the result by direct computation.

The proofs for Theorems 1 and 3 follow a similar idea. Since Theorem 1 doesn’t involve a lot of coefficients, we will show the proof of Theorem 1 and explain the difference in the proof of Theorem 3.

Proof of Theorem 1

First of all, the bias of $η_{i}^{2} (x)$ comes from the local linear approximation. Since ${(X_{g i}, Z_{g i})}_{g = 1}^{N}$ is an i.i.d. sequence, by (4) and the result in Fan and Gijbels (1996), it follows that

E {η_{i}^{2} (x) ∣ X} = σ^{2} (x) + b (x) + o_{P} (h^{2}) .

Similarly, the asymptotic variance of $η_{i}^{2} (x)$ also follows from Fan and Gijbels (1996).

We now prove the off-diagonal elements in matrix var[η|X]

cov [({\hat{η}}_{i}^{2} (x), {\hat{η}}_{j}^{2} (x)) ∣ X] = V_{2} + o_{P} (1 / N) .

(26)

Recalling that ${\hat{η}}_{i}^{2} (x) = \sum_{g = 1}^{N} W_{N, i} ((X_{g i} - x) / h) Z_{g i}$ , we have

cov [({\hat{η}}_{i}^{2} (x), {\hat{η}}_{j}^{2} (x)) ∣ X] = \sum_{g = 1}^{N} W_{N, i} (\frac{X_{g i} - x}{h}) W_{N, j} (\frac{X_{g j} - x}{h}) cov [(Z_{g i}, Z_{g j}) ∣ X] .

The equality follows by the fact that cov[(Z_gi, Z_g_′_j)|X] = 0 when g ≠ g′. Recall Ω_ij = cov[(Z_gi, Z_gj)|X] and define R_N,g = N · W_N,j((X_gj − x)/h)Ω_ij. Thus

N \cdot cov [({\hat{η}}_{i}^{2} (x), {\hat{η}}_{j}^{2} (x)) ∣ X] = \sum_{g = 1}^{N} W_{N, i} (\frac{X_{g i} - x}{h}) R_{N, g} .

(27)

The right hand side of (27) can be seen as local linear smoother of the synthetic data ${(X_{g i}, R_{N, g})}_{g = 1}^{N}$ . Although R_N,g involves N at the first glance, its conditional expectation E[R_N,g|X_gi = x] and conditional variance var[R_N,g|X_gi = x] do not grow with N. Since ${(X_{g i}, R_{N, g})}_{g = 1}^{N}$ is an i.i.d. sequence, by the results in Fan and Gijbels (1996), we obtain

N \cdot cov [({\hat{η}}_{i}^{2} (x), {\hat{η}}_{j}^{2} (x)) ∣ X] = E [R_{N, g} ∣ X_{g i} = x] + o_{P} (1) .

To calculate E[R_N,g|X_gi = x], we apply the approximation W_N,i(u) = K(u)(1 + o_P (1))/(Nhf_X(x)) in the example of Fan and Gijbels (1996, p. 64) and have the following arguments

\begin{array}{l} E [R_{N, g} ∣ X_{g i} = x] = E [N \cdot \frac{1}{{Nhf}_{X} (x)} {h K}_{h} (X_{g j} - x) Ω_{i j} ∣ X_{g i} = x] (1 + o_{P} (1)) \\ = {(f_{X} (x))}^{- 1} \int K (u) Ω_{i j} ∣_{X_{g i} = x} (x + h u, s) f_{X} (x + h u) dud s + o_{P} (1) \\ = {N V}_{2} + o_{P} (1), \end{array}

where s represents all the integrating variables corresponding to X_g₁, ···, X_gI except X_gi and X_gj. That justifies (26).

To prove the multivariate asymptotic normality

\sum^{- 1 / 2} (η - (σ^{2} (x) + b (x) + o_{P} (h^{2})) e) \overset{D}{\to} N (0, I_{I}),

(28)

we employ Cramér-Wold device: for any unit vector a = (a₁, ···, a_I)^T in ℝ^I,

F^{*} ≜ {a^{T} \sum a}^{- 1 / 2} {\sum_{i = 1}^{I} a_{i} \sum_{g = 1}^{N} W_{N, i} (\frac{X_{g i} - x}{h}) (Z_{g i} - σ^{2} (Z_{g i}))} \overset{D}{\to} N (0, 1) .

Denote by Q_g,i = W_N,i((X_gi − x)/h)(Z_gi− σ²(X_gi)) and ${\tilde{Q}}_{g} = \sum_{i = 1}^{I} a_{i} Q_{g, i}$ . Note that the sequence ${{\tilde{Q}}_{g}}_{g = 1}^{N}$ is i.i.d. distributed. To show the asymptotic normality of F*, it is sufficient to check Lyapunov’s condition:

lim_{N \to \infty} \frac{\sum_{g = 1}^{N} E [{∣ {\tilde{Q}}_{g} ∣}^{4} ∣ X]}{{(\sum_{g = 1}^{N} E [{∣ {\tilde{Q}}_{g} ∣}^{2} ∣ X])}^{2}} = 0.

To facilitate the presentation, we first note that sequences ${Q_{g, i}}_{g = 1}^{N}$ are i.i.d. and satisfy Lyapunov’s condition for each fixed i. Denote $δ_{N, i}^{2} = \sum_{g = 1}^{N} E [{∣ Q_{g, i} ∣}^{2} ∣ X]$ . And recall that $δ_{N, i}^{2} = var [{\hat{η}}_{i}^{2} (x) ∣ X] = O_{P} ({(N h)}^{- 1})$ . Let c* be a generic constant which may vary from one line to another. We have the following approximation

\begin{array}{l} \sum_{g = 1}^{N} E [{∣ Q_{g, i} ∣}^{4} ∣ X] = c^{*} N^{- 3} E {K_{h}^{4} (X_{g i} - x) [{(Z_{g i} - σ^{2} (X_{g i}))}^{4} ∣ X]} (1 + o_{P} (1)) \\ = O_{P} ({(N h)}^{- 3}) . \end{array}

Therefore $\sum_{g = 1}^{N} E [{∣ Q_{g, i} ∣}^{4} ∣ X] = o (δ_{N, i}^{4})$ . By the marginal Lyapunov conditions, we have the following inequality

\sum_{g = 1}^{N} E [{\tilde{Q}}_{g}^{4} ∣ X] \leq c^{*} \sum_{i = 1}^{I} \sum_{g = 1}^{N} E [{∣ Q_{g, i} ∣}^{4} ∣ X] = c^{*} I \cdot o_{P} ({(N h)}^{- 2}) = o_{P} ({(N h)}^{- 2}) .

For the denominator, we have the following arguments

\begin{array}{l} \sum_{g = 1}^{N} E [{∣ {\tilde{Q}}_{g} ∣}^{2} ∣ X] = \sum_{i} a_{i}^{2} \sum_{g = 1}^{N} E [Q_{g, i}^{2} ∣ X] + \sum_{i \neq j} a_{i} a_{j} \sum_{g = 1}^{N} E [Q_{g, i} Q_{g, j} ∣ X] \\ = \sum_{i} a_{i}^{2} var [{\hat{η}}_{i}^{2} (x) ∣ X] + \sum_{i \neq j} a_{i} a_{j} cov [({\hat{η}}_{i}^{2} (x), {\hat{η}}_{j}^{2} (x)) ∣ X] \\ \overset{*}{=} O_{P} ({(N h)}^{- 1}) + O_{P} (N^{- 1}) \\ = O_{P} ({(N h)}^{- 1}) . \end{array}

Note that the second to last equality holds by the asymptotic conditional variance-covariance matrix Σ. Therefore Lyapunov’s condition is justified. That completes the proof.

Proof of Theorem 2

First of all, for each given g,

E s_{B, g}^{2} = I var ({\bar{Y}}_{g j}) = σ_{2} + ρ (I - 1) σ_{1}^{2} .

Note that by (8), we have

\begin{array}{l} E {(Y_{gij} - {\bar{Y}}_{g j})}^{2} = I^{- 2} [I (I - 1) σ_{2} + ρ (I - 1) (I - 2) σ_{1}^{2} - 2 {(I - 1)}^{2} ρ σ_{1}^{2}] \\ = I^{- 1} (I - 1) (σ_{2} - ρ σ_{1}^{2}) . \end{array}

Thus, for all g, we have

E s_{W, g}^{2} = σ_{2} - ρ σ_{1}^{2} .

Since { $s_{B, g}^{2}$ } and { $s_{W, g}^{2}$ } are i.i.d sequences across the N genes, by the central limit theorem, we have

\begin{array}{l} \frac{1}{N} \sum_{g = 1}^{N} s_{B, g}^{2} = σ_{2} + ρ (I - 1) σ_{1}^{2} + O_{P} (N^{- 1 / 2}), \\ \frac{1}{N} \sum_{g = 1}^{N} s_{W, g}^{2} = σ_{2} - ρ σ_{1}^{2} + O_{P} (N^{- 1 / 2}) . \end{array}

Therefore,

\begin{array}{l} {\hat{ρ}}_{0} = \frac{σ_{2} + ρ (I - 1) σ_{1}^{2} - σ_{2} + ρ σ_{1}^{2} + O_{P} (N^{- 1 / 2})}{σ_{2} + ρ (I - 1) σ_{1}^{2} + (I - 1) (σ_{2} - ρ σ_{1}^{2}) + O_{P} (N^{- 1 / 2})} \\ = ρ σ_{1}^{2} / σ_{2} + O_{P} (N^{- 1 / 2}) . \end{array}

Proof of Theorem 3

Note that

var [{\hat{η}}_{A}^{2} (x) ∣ X] = \sum_{g = 1}^{N} \sum_{i = 1}^{I} W_{N}^{2} (\frac{X_{g i} - x}{h}) var [Z_{g i} ∣ X] + \sum_{g = 1}^{N} \sum_{i \neq j}^{I} W_{N} (\frac{X_{g i} - x}{h}) W_{N} (\frac{X_{g j} - x}{h}) cov [(Z_{g i}, Z_{g j}) ∣ X] .

Following similar steps in the proof of Theorem 1, one can verify $var [{\hat{η}}_{A}^{2} (x) ∣ X] = V_{1}^{'} / I + (1 - 1 / I) V_{2}^{'} + o_{P} ({(N h)}^{- 1})$ , where the coefficients C₂, ···, C₄, D₀, ···, D₄ are as follows:

\begin{array}{l} C_{2} = \frac{4 (1 + ρ^{2}) σ_{2} + [4 ρ (I - 2) + 4 ρ^{2} (2 I - 3)] σ_{1}^{2}}{I - 1}, \\ C_{3} = - \frac{8 ρ^{2} (I - 3) σ_{1}^{3} + 8 (ρ^{2} + ρ) σ_{1} σ_{2}}{I - 1}, \\ C_{4} = \frac{2}{(I - 1) (I - 2)} {(1 + ρ^{2}) σ_{2}^{2} + 2 (ρ^{2} + ρ) (I - 3) σ_{1}^{2} σ_{2} + (I - 3) (I - 4) ρ^{2} σ_{1}^{4}}, \\ D_{0} = 2 (ρ^{2} - \frac{4 ρ}{I - 1} + \frac{2 (1 + ρ^{2})}{{(I - 1)}^{2}}), \\ D_{1} = \frac{8}{{(I - 1)}^{2}} {(2 I - 4) ρ - (I^{2} - 4 I + 5) ρ^{2}} σ_{1}, \\ D_{2} = \frac{4}{{(I - 1)}^{2} (I - 2)} {{(I - 3)}^{2} ρ^{2} + ({(I - 2)}^{2} + 1) ρ - 2 (I - 2)} σ_{2} + \frac{4 (I - 3)}{{(I - 1)}^{2} (I - 2)} {(3 (I - 2) (I - 3) + 2) ρ^{2} - 2 (I - 2) ρ} σ_{1}^{2}, \\ D_{3} = - \frac{8 {(I - 3)}^{2}}{{(I - 1)}^{2} (I - 2)} {(ρ^{2} + ρ) σ_{1} σ_{2} + (I - 4) ρ^{2} σ_{1}^{3}}, \\ D_{4} = \frac{4}{{(I - 1)}^{2} {(I - 2)}^{2}} {(1 + ρ^{2}) (\begin{matrix} I - 2 \\ 2 \end{matrix}) σ_{2}^{2} + 6 (ρ^{2} + ρ) (\begin{matrix} I - 2 \\ 3 \end{matrix}) σ_{1}^{2} σ_{2} + 12 ρ^{2} (\begin{matrix} I - 2 \\ 4 \end{matrix}) σ_{1}^{4}} . \end{array}

Footnotes

The financial support from NSF grant DMS-0714554 and NIH grant R01-GM072611 are greatly acknowledged.

Contributor Information

Jianqing Fan, Email: jqfan@princeton.edu, Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544.

Yang Feng, Email: yangfeng@princeton.edu, Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544.

Yue S. Niu, Email: yueniu@math.arizona.edu, Department of Mathematics, University of Arizona, 617 N. Santa Rita Ave., P.O. Box 210089, Tucson, AZ 85721-0089.

References

Carroll RJ, Wang Y. Nonparametric variance estimation in the analysis of microarray data: a measurement error approach. Biometrika. 2008 doi: 10.1093/biomet/asn017. to appear. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cui X, Hwang JT, Qiu J. Improved statistical tests for differential gene expression by shrinking variance components estimates. Biostatistics. 2005;6:59–75. doi: 10.1093/biostatistics/kxh018. [DOI] [PubMed] [Google Scholar]
Fan J, Gijbels I. Local Polynomial Modelling and Its Applications. Chapman & Hall; 1996. [Google Scholar]
Fan J, Niu Y. Selection and validation of normalization methods for c-DNA microarrays using within-array replications. Bioinformatics. 2007;23:2391–2398. doi: 10.1093/bioinformatics/btm361. [DOI] [PubMed] [Google Scholar]
Fan J, Peng H, Huang T. Semilinear high-dimensional model for normalization of microarray data: a theoretical analysis and partial consistency (with discussion) Journal of the American Statistical Association. 2005;100:781–813. [Google Scholar]
Fan J, Ren Y. Statistical analysis of DNA microarray data in cancer research. Clinical Cancer Research. 2007;12:4469–4473. doi: 10.1158/1078-0432.CCR-06-1033. [DOI] [PubMed] [Google Scholar]
Fan J, Tam P, Vande Woude G, Ren Y. Normalization and analysis of cDNA micro-arrays using within-array replications applied to neuroblastoma cell response to a cytokine. Proceedings of the National Academy of Sciences, USA. 2004;101(5):1135–1140. doi: 10.1073/pnas.0307557100. [DOI] [PMC free article] [PubMed] [Google Scholar]
Huang J, Wang D, Zhang C. A two-way semi-linear model for normalization and significant analysis of cDNA microarray data. Journal of the American Statistical Association. 2005;100:814–829. [Google Scholar]
Kamb A, Ramaswami A. A simple method for statistical analysis of intensity differences in microarray-deried gene expression data. BMC Biotechnology. 2001:1–8. doi: 10.1186/1472-6750-1-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
Neyman J, Scott E. Consistent estimates based on partially consistent observations. Econometrica. 1948;16:1–32. [Google Scholar]
Patterson T, et al. Performance comparison of one-color and two-color platforms within the MicroArray Quality Control (MAQC) project. Nature Biotechnology. 2006;24:1140–1150. doi: 10.1038/nbt1242. [DOI] [PubMed] [Google Scholar]
Ruppert D, Wand MP, Holst U, Hössjer O. Local polynomial variance function estimation. Technometrics. 1997;39:262–273. [Google Scholar]
Smyth G, Michaud J, Scott H. Use of within-array replicate spots for assessing differential expression in microarray experiments. Bioinformatics. 2005;21:2067–2075. doi: 10.1093/bioinformatics/bti270. [DOI] [PubMed] [Google Scholar]
Storey JD, Tibshirani R. Statistical significance for genome-wide studies. Proceedings of the National Academy of Sciences, USA. 2003;100:9440–9445. doi: 10.1073/pnas.1530509100. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tong T, Wang Y. Optimal Shrinkage Estimation of Variances with Applications to Microarray Data Analysis. Journal of the American Statistical Association. 2007;102:113–122. [Google Scholar]
Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proceedings of the National Academy of Sciences. 2001;98:5116–5121. doi: 10.1073/pnas.091062498. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang Y, Ma Y, Carroll RJ. Variance Estimation in the Analysis of Microarray Data. Journal of the Royal Statistical Society, SerB. 2008 doi: 10.1111/j.1467-9868.2008.00690.x. to appear. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] Carroll RJ, Wang Y. Nonparametric variance estimation in the analysis of microarray data: a measurement error approach. Biometrika. 2008 doi: 10.1093/biomet/asn017. to appear. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] Cui X, Hwang JT, Qiu J. Improved statistical tests for differential gene expression by shrinking variance components estimates. Biostatistics. 2005;6:59–75. doi: 10.1093/biostatistics/kxh018. [DOI] [PubMed] [Google Scholar]

[R3] Fan J, Gijbels I. Local Polynomial Modelling and Its Applications. Chapman & Hall; 1996. [Google Scholar]

[R4] Fan J, Niu Y. Selection and validation of normalization methods for c-DNA microarrays using within-array replications. Bioinformatics. 2007;23:2391–2398. doi: 10.1093/bioinformatics/btm361. [DOI] [PubMed] [Google Scholar]

[R5] Fan J, Peng H, Huang T. Semilinear high-dimensional model for normalization of microarray data: a theoretical analysis and partial consistency (with discussion) Journal of the American Statistical Association. 2005;100:781–813. [Google Scholar]

[R6] Fan J, Ren Y. Statistical analysis of DNA microarray data in cancer research. Clinical Cancer Research. 2007;12:4469–4473. doi: 10.1158/1078-0432.CCR-06-1033. [DOI] [PubMed] [Google Scholar]

[R7] Fan J, Tam P, Vande Woude G, Ren Y. Normalization and analysis of cDNA micro-arrays using within-array replications applied to neuroblastoma cell response to a cytokine. Proceedings of the National Academy of Sciences, USA. 2004;101(5):1135–1140. doi: 10.1073/pnas.0307557100. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] Huang J, Wang D, Zhang C. A two-way semi-linear model for normalization and significant analysis of cDNA microarray data. Journal of the American Statistical Association. 2005;100:814–829. [Google Scholar]

[R9] Kamb A, Ramaswami A. A simple method for statistical analysis of intensity differences in microarray-deried gene expression data. BMC Biotechnology. 2001:1–8. doi: 10.1186/1472-6750-1-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] Neyman J, Scott E. Consistent estimates based on partially consistent observations. Econometrica. 1948;16:1–32. [Google Scholar]

[R11] Patterson T, et al. Performance comparison of one-color and two-color platforms within the MicroArray Quality Control (MAQC) project. Nature Biotechnology. 2006;24:1140–1150. doi: 10.1038/nbt1242. [DOI] [PubMed] [Google Scholar]

[R12] Ruppert D, Wand MP, Holst U, Hössjer O. Local polynomial variance function estimation. Technometrics. 1997;39:262–273. [Google Scholar]

[R13] Smyth G, Michaud J, Scott H. Use of within-array replicate spots for assessing differential expression in microarray experiments. Bioinformatics. 2005;21:2067–2075. doi: 10.1093/bioinformatics/bti270. [DOI] [PubMed] [Google Scholar]

[R14] Storey JD, Tibshirani R. Statistical significance for genome-wide studies. Proceedings of the National Academy of Sciences, USA. 2003;100:9440–9445. doi: 10.1073/pnas.1530509100. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] Tong T, Wang Y. Optimal Shrinkage Estimation of Variances with Applications to Microarray Data Analysis. Journal of the American Statistical Association. 2007;102:113–122. [Google Scholar]

[R16] Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proceedings of the National Academy of Sciences. 2001;98:5116–5121. doi: 10.1073/pnas.091062498. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] Wang Y, Ma Y, Carroll RJ. Variance Estimation in the Analysis of Microarray Data. Journal of the Royal Statistical Society, SerB. 2008 doi: 10.1111/j.1467-9868.2008.00690.x. to appear. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

NONPARAMETRIC ESTIMATION OF GENEWISE VARIANCE FOR MICROARRAY DATA*

Jianqing Fan

Yang Feng

Yue S Niu

Roles

Abstract

1. Introduction

2. Nonparametric Estimators of Genewise Variance

2.1. Estimation without correlation

Theorem 1

2.2. Variance estimation with correlated replications

2.2.1. Aggregated Estimator

2.2.2. Estimation of Correlation

Theorem 2

2.2.3. Asymptotic properties

Theorem 3

Remark 1

2.2.4. Two replications

3. Simulations and comparisons

3.1. Simulation design

3.2. The bias of naive nonparametric estimator

Table 2.

Table 1.

Fig 1.

3.3. Performance of new estimators

Table 3.

Fig 2.

4. Application to human total RNA samples using Agilent arrays

4.1. Validation Test

Table 4.

Table 5.

4.2. Gene selection

Table 6.

Table 7.

Table 8.

Table 9.

Table 10.

Table 11.

5. Summary

Acknowledgments

APPENDIX A: APPENDIX

Lemma 1

Proof of Lemma 1

Proof of Theorem 1

Proof of Theorem 2

Proof of Theorem 3

Footnotes

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

NONPARAMETRIC ESTIMATION OF GENEWISE VARIANCE FOR MICROARRAY DATA^{^*}