Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2009 Aug 10.
Published in final edited form as: J Stat Plan Inference. 2008 Feb 1;138(2):387–404. doi: 10.1016/j.jspi.2007.06.007

A new class of mixture models for differential gene expression in DNA microarray data

Ming-Hui Chen a, Joseph G Ibrahim b,*, Yueh-Yun Chi c
PMCID: PMC2724022  NIHMSID: NIHMS128674  PMID: 19672331

Abstract

One of the fundamental issues in analyzing microarray data is to determine which genes are expressed and which ones are not for a given group of subjects. In datasets where many genes are expressed and many are not expressed (i.e., underexpressed), a bimodal distribution for the gene expression levels often results, where one mode of the distribution represents the expressed genes and the other mode represents the underexpressed genes. To model this bimodality, we propose a new class of mixture models that utilize a random threshold value for accommodating bimodality in the gene expression distribution. Theoretical properties of the proposed model are carefully examined. We use this new model to examine the problem of differential gene expression between two groups of subjects, develop prior distributions, and derive a new criterion for determining which genes are differentially expressed between the two groups. Prior elicitation is carried out using empirical Bayes methodology in order to estimate the threshold value as well as elicit the hyperparameters for the two component mixture model. The new gene selection criterion is demonstrated via several simulations to have excellent false positive rate and false negative rate properties. A gastric cancer dataset is used to motivate and illustrate the proposed methodology.

Keywords: Bayesian inference, Empirical Bayes, Gene selection criteria, Prior elicitation, Random threshold mixture model, Simulation study

1. Introduction

Statistical methodologies for differential gene expression have been rapidly developing in recent years, both from a frequentist and Bayesian framework. Recent frequentist methods include Tusher et al. (2001), Storey and Tibshirani (2003), Kerr et al. (2000), Dudoit et al. (2002), Lonnstedt and Speed (2002), Olshen and Jain (2002), Chen et al. (1997), and Lee et al. (2002). Bayesian approaches include Efron et al. (2001), Baldi and Long (2001), Ibrahim et al. (2002), Parmigiani et al. (2002), Newton et al. (2001, 2004),Newton and Kendziorski (2003),West (2003), Ishwaran and Rao (2003, 2005), Tadesse et al. (2003), Mueller et al. (2004), Liu et al. (2004), Do et al. (2005), and Hein et al. (2005). An excellent review article on statistical methods in genomics is Sebastiani et al. (2003). An recent edited book on the analysis of microarray data is Parmigiani et al. (2003). We refer the reader to the book and the review article for more detailed discussions on various methodologies and additional references.

One of the fundamental issues in analyzing microarray data is trying to determine which genes are expressed and which ones are not, and in particular, determining a threshold value for which any expression level above the threshold value will be deemed as expressed and any expression value below the threshold value would be deemed not expressed, hence underexpressed. In datasets where many genes are expressed and many are not expressed, a bimodal distribution for the gene expression levels often results. Two component mixture distributions can be quite useful for this type of modeling problem. A related problem, which can be viewed as a generalization of the bimodal problem, is to also model genes that are under expressed, expressed, and over expressed, leading to a three component mixture. Bayesian model-based methods for DNA microarray analysis are now becoming quite popular since complex models can be fit in a relatively straightforward fashion and Bayesian hierarchical models can be especially useful for this type of problem.

To motivate the proposed modeling, we consider a cDNA dataset in gastric cancer published in Chen et al. (2003). Gastric cancer, which is a form of stomach cancer, is the second most common cause of cancer death worldwide (Parkin et al., 1999). The dataset contains 90 tumor samples and 22 normal samples. A total of 6688 genes were available for analysis. The goal for these data is to determine which genes are differentially expressed between the two groups. An exploratory analysis of these data shows that for each group, the distribution of gene expression appears to be bimodal. For example, Fig. 1 shows nonparametric density estimates for nine selected genes in the tumor sample group. The horizontal axis in Fig. 1 corresponds to the logarithm of the red to green channel, log(R/G), which is the measure of gene expression. The vertical axis corresponds to the density value. Each plot represents a gene, and the nonparametric density estimate for each gene is based on the n1 = 90 tumor samples. We see from Fig. 1 the apparent bimodality in the gene expression distribution. This bimodality may be due to the fact that certain genes are expressed for certain subjects and not expressed for other subjects, thereby creating the bimodality.

Fig 1.

Fig 1

Densities for nine selected genes.

To model this bimodality, a threshold value can be defined such that all gene expression levels above this threshold value are deemed expressed and all expression values below the threshold are classified as underexpressed. One of the big challenges in this approach is how to determine the threshold value and whether the threshold value should be treated as fixed or random. Towards these goals, we develop a new class of mixture models that utilize a random threshold value for accommodating bimodality in the gene expression distribution, such as that encountered in Fig. 1. The model is then used to determine which genes are differentially expressed between the two groups (tumor vs. normal). The random threshold value can be viewed as a latent variable in the modeling process, in which a novel distribution is specified for it. Then the joint posterior distribution of the parameters and the threshold value is used for inference. Specification of this threshold mixture model has several advantages over a standard two component mixture model, as discussed in Sections 2 and 3. One of its greatest advantages is that it leads to an identifiable two component mixture model and it facilitates a straightforward prior elicitation scheme via empirical Bayes methodology. Prior elicitation using empirical Bayes methods (see Ibrahim et al., 2002; Efron et al., 2001) is now widely recognized as critical in parametric modeling of DNA microarray data since such models are highly parameterized and conventional prior elicitation strategies using noninformative or improper priors lead to either weakly or nonidentifiable models as well as computational instability. These issues are elaborated upon in Sections 2 and 3. The proposed methodology, in some sense, generalizes previous work by Ibrahim et al. (2002) in that (i) the threshold parameter is assumed unknown and random, and a distribution is specified for it using a multiple imputation technique, (ii) a probability model is posited for the underexpressed genes as well as the expressed genes, and (iii) we allow general classes of distributions for the gene expression data, in which the log-normal, Box–Cox transformations of a normal random variable, and several others are special cases. We mention that other approaches for dealing with low expression levels include left censoring the gene expression data at the truncation value as in Tadesse et al. (2003). However, such methods assume that the threshold value at which to censor is known, and thus are not a general as the methodology considered here.

In addition to the new threshold mixture model, prior distributions for the model parameters are proposed as well as a new criterion for determining which genes are differentially expressed. This new criterion is demonstrated through several simulations to have excellent false positive rate and false negative rate properties. The proposed methodology is also compared to a fully frequentist procedure called PERMAX developed by Mutter et al. (2001), the static significance analysis of microarray (SAM) model, proposed by Tusher et al. (2001), and the parametric empirical bayes methods for microarray data (EBarrays), proposed by Kendziorski et al. (2003).

The rest of this article is organized as follows. In Section 2, we present a new mixture model for modeling expressed and underexpressed genes using a single threshold value. In Section 3, we introduce a class of hierarchical priors for the random threshold parameter as well as the other parameters. Since the threshold value is random, we propose an algorithm for eliciting prior distributions via a standard multiple imputation technique. In Section 4, a new criterion for determining which genes are differentially expressed is developed. In Section 5, we present extensive simulation results illustrating the operating characteristics of the proposed methodology, and in Section 6, we illustrate the methodology on the gastric cancer dataset, and carry out prediction analysis of microarrays (PAM), which is a procedure for cross-validation proposed by Tibshirani et al. (2002).We conclude the article with a brief discussion in Section 7.

2. New threshold mixture model

The proposed model is constructed as follows: Let j = 1, 2 index the tissue type (normal vs. tumor) and let yjgi denote the gene expression random variable for the jth tissue type and the gth gene for the ith individual, i = 1, 2, …, njg and g =1,…, G. Let pjg = probability that the gth gene is not expressed for tissue type j. Here, we do not assume that the raw gene expression levels yjgi follow a particular distribution, but rather Assume that h(yjgi) is a known differentiable transformation of yjgi to achieve normality. For example, h(.) could be the Box–Cox class of transformations in which

h(x)={xγ1γ,γ0,log(x),γ=0.

h(.) can also represent other classes of parametric transformations that achieve normality. Consider the model,

p(yjgi|αjg,τjg2,μjg,σjg2)=pjgp1(yjgi|αjg,τjg2)+(1pjg)p2(yjgi|μjg,σjg2), (2.1)

where

p1(yjgi|αjg,τjg2)=(2π)1/2|h(yjgi)|τjg1exp{12τjg2(h(yjgi)αjg)2}

and

p2(yjgi|μjg,σjg2)=(2π)1/2|h(yjgi)|σjg1exp{12σjg2(h(yjgi)μjg)2}.

Note that in (2.1), p1(yjgi|αjg,τjg2)andp2(yjgi|µjg,σjg2) are the distributions for yjgi for the underexpressed and expressed genes, respectively.

One of the potential disadvantages with (2.1) is that it is virtually impossible to develop data-based prior specifications for pjg,αjg,τjg2,μjg,andσjg2 since one does not know in advance which group each gene expression level belongs to. In order to solve this dilemma, we can construct a cut-off value, or threshold, in the prior elicitation strategy so that all gene expression levels below a certain threshold belong to one group and all gene expression levels above this threshold belong to the other group. Once a threshold value is defined, empirical Bayes prior elicitation would proceed in a straightforward fashion.

In our model development, we wish to introduce a threshold while at the same time retaining the two component structure of (2.1). Towards this goal, we consider an alternative but equivalent model of (2.1). Let cjgi denote a random threshold parameter such that the gene is not expressed if yjgicjgi for the j th tissue type, the ith individual, and the gth gene. Assume that the cjgi ’s are i.i.d. with distribution

p(cjgi|c0jg,ςjg2)=(2π)1/2|h(cjgi)|ςjg1exp{12ςjg2(h(cjgi)h(c0jg))2}. (2.2)

Let Ajgi = {y: ycjgi} and Ajgic denotes the complement of Ajgi. We consider the following joint distribution for (yjgi, cjgi):

p(yjgi,cjgi|αjg,τjg2,μjg,σjg2,c0jg,ςjg2)=[pjgp1(yjgi|αjg,τjg2)1Ajgi(yjgi)/qyjgi  +(1pjg)p2(yjgi|μjg,σjg2)1Ajgic(yjgi)/(1qyjgi)]p(cjgi|c0jgi,ςjg2), (2.3)

where qyjgi=yjgip(cjgi|c0jg,ςjg2)dcjgi. Now we are led to a useful identity which relates (2.3) to (2.1).

Identity 2.1

The marginal distribution of (2.3) for yjgi reduces to the distribution, p(yjgi|αjg,τjg2,μjg,σjg2) given in (2.1). That is,

0p(yjgi,cjgi|αjg,τjg2,μjg,σjg2,c0jg,ςjg2)dcjgi=p(yjgi|αjg,τjg2,μjg,σjg2).

Proof

Given yjgi, we have

0p(yjgi,cjgi|αjg,τjg2,μjg,σjg2,c0jg,ςjg2)dcjgi=yjgi[pjgp1(yjgi|αjg,τjg2)/qyjgi]p(cjgi|c0jg,ςjg2)dcjgi  +0yjgi[(1pjg)p2(yjgi|μjg,σjg2)/(1qyjgi)]p(cjgi|c0jg,ςjg2)dcjgi=pjgp1(yjgi|αjg,τjg2)+(1pjg)p2(yjgi|μjg,σjg2),

which establishes the identity.

The main purpose of (2.3) and (2.1) is that by introducing the latent variable cjgi, we are able to (i) obtain an identifiable model and (ii) construct the same two-component mixture model (2.1) such that once cjgi is given, we then immediately know which genes belong to which group, and this is what facilitates a straightforward data-based prior elicitation scheme. We note that model (2.1) as it stands is not identifiable since the labels of all of the parameters are arbitrary. However, by introducing the threshold parameter cjgi, we induce an ordering in the parameters of the two component mixture model that immediately yields an identifiable model. We mention here that there are alternative approaches to the model development and inference scheme proposed here. One such approach is to deal with the two component mixture model directly and make it identifiable by placing constraints on the means, and then estimate the parameters using the EM algorithm. In this framework, however, estimation of standard errors would be much more difficult than the fully Bayesian approach we adopt here. Identity 2.1 also demonstrates that (2.1) and (2.3) are indeed equivalent.

Let δjgi = 1 if yjgicjgi and 0 otherwise. Since cjgi is random, δjgi is an unobserved latent variable. However, given the value of cjgi and yjgi, the value of δjgi is completely known. We present another useful identity which relates pjg to the event {yjgicjgi}.

Identity 2.2

The probability pjg in (2.1) is the probability that the gene is not expressed, that is, yjgi ≤cjgi under the mixture distribution (2.3). Specifically, we have

P(δjgi=1)=P(yjgicjgi)=pjgandP(δjgi=0)=P(yjgi>cjgi)=1pjg. (2.4)

Proof

It is sufficient to show Pjgi = 1) = P (yjgicjgi) = pjg. Based on the definition of the joint distribution of (yjgi, cjgi), we obtain

P(yjgicjgi)=yjgicjgip(yjgi,cjgi|αjg,τjg2,μjg,σjg2,c0jg,ςjg2)dcjgidyjgi=0yjgi[pjgp1(yjgi|αjg,τjg2)/qyjgi]p(cjgi|c0jg,ςjg2)dcjgidyjgi=0pjgp1(yjgi|αjg,τjg2)dyjgi=pjg,

which establishes the identity.

Identity 2.2 shows the relationship between pjg and the threshold parameter cjgi, and thus we see that pjg simply corresponds to the probability that the gene expression level is below the threshold in (2.3).

We now construct the likelihood as follows. Let δ=(δ 111,…, δ2,G,n2G), c= (c111,…, c2,G,n2G), c0=(c011, c021,…,c01G, c02G), α = (α11, α21,…, α1G, α2G) τ2=(τ112,τ212,,τ1G2,τ2G2),µ=(µ11,µ21,,µ1G,µ2G),σ2=(σ112,σ212,,σ1G2,σ2G2), ς2=(ς112,ς212,,ς1G2,ς2G2) , p = (p11, p21,…, p1G, p2G), and θ = (α, τ2, µ, σ2, c0, ς2, p). Let D = (y111,…, y2,G,n2G, c) denote the complete data and Dobs = (y111,…, y2,G,n2G) denote the observed data.

The likelihood function for θ based on the complete data D = (y111,…, y2,G,n2G, c) is thus given by

L(θ|D)=j=12g=1Gi=1njg{[pjg/qyjgi]δjgi[(1pjg)/(1qyjgi)]1δjgip1(yjgi|αjg,τjg2)δjgi×p2(yjgi|μjg,σjg2)1δjgi}. (2.5)

An interesting special case of the general model (2.5) is obtained by taking cjgi = cjg, that is to let the random threshold parameter be free of the sample (subject). This special case is attractive, since (i) the yjgi’s share the same random threshold parameter cjg for the same tissue type and the same gene, and (ii) the yjgi’s are correlated across the same tissue type and the same gene.

3. Prior specifications

Following Ibrahim et al. (2002), the empirical Bayes methodology is carried out by specifying a data-based guide value for all of the hyperparameters of the priors. We first elicit the parameters (c0jg,ςjg2) given in (2.2). The guide values for c0jg and ςjg2 are h(c0jg)=1njgi=1njgh(yjgi)andςjg2=κ0G1g=1G(h(c0jg)h¯(c0j))2,where  h¯(c0j)=1Gg=1Gh(c0jg) and κ0 is a fixed parameter. A default choice of κ0 is 1. Here, cjgi is an unobserved latent variable. Using multiple imputation, we independently generate cjgib from (2.2) for b = 1, 2, …, B. For each b, let δjgi*b=1ifyjgicjgib and 0 otherwise, δjgi1*b=δjgi*bandδjgi2*b=1δjgi*b for b = 1, 2, …, B. We then take

n¯jk=1BGb=1Bg=1Gi=1njgδjgik*b

for k = 1, 2.

We follow the same ideas as in Ibrahim et al. (2002) for specifying priors for the rest of the parameters. For αjg, we take

αjg|τjg2,αj0N(αj0,τ0τjg2/n¯j1),

where τ0 > 0 is a specified scalar, and

τjg2𝒢(aj01,bj01),

where (aj01, bj01) are hyperparameters, for j = 1, 2. Similarly, we take

μjg|σjg2,μj0N(μj0,σ0σjg2/n¯j2),

where σ0 >0 is a specified scalar and

σjg2𝒢(aj02,bj02),

where (aj02, bj02) are hyperparameters, for j = 1, 2. We further take αj0N(mj01,vj012),µj0N(mj02,vj022), j = 1, 2, and we take aj0k fixed and bj0k random for our hierarchical prior. Specifically, we take a gamma prior for bj0k, i.e., bj0k ~ 𝒢 (qj0k, tj0k), where (qj0k, tj0k) are specified hyperparameters for k = 1, 2 and j = 1, 2.

For the pjg’s, we specify the prior as follows. We first let

ejg=logit(pjg)=log(pjg1pjg)

and then specify a normal prior on the ejg’s, therefore inducing a prior on the pjg’s. Thus, we take

ejgN(uj0,kj0wj02),j=1,2

and for the prior for ejg, we take

uj0~N(u^j0,hj0wj02),j=1,2.

The hyperparameters k0 = (k10, k20), h0 = (h10, h20), and wj02, j = 1, 2, are prespecified.

The guide values for all the hyperparameters are specified as follows. For mj0k, we take

mj0k=1BNjkb=1Bg=1Gi=1njgδjgik*bh(yjgi),

where Njk=1Bb=1Bg=1Gi=1njgδjgik*b, for k = 1, 2 and j = 1, 2. For υj0k2 we take

υj0k2=ηj0kMSGjk,

where

MSGjk=1G1g=1Gnjgk(mjg0kmj0k)2,mjg0k=b=1Bi=1njgδjgik*bh(yjgi)b=1Bi=1njgδjgik*b,

njgk=1Bb=1Bi=1njgδjgik*b, for k = 1, 2 and j = 1, 2, and η0 = (η101, η201, η102, η202) is a vector of chosen scalers. A guide value for tj0k is tj0k1=dj0kMSEjk where

MSEjk=1B(NjkG)b=1Bg=1Gi=1njgδjgik*b(h(yjgi)mjg0k)2

for k = 1, 2 and j = 1, 2, and d0 = (d101, d201, d102, d202) is a vector of chosen scalars. We see that MSEjk is just the mean square error for the expressed or underexpressed gene expression levels for tissue type j. Finally, we elicit the guide values based on the sample proportion of underexpressed gene expression values for ûj0 and wj02. For ûj0, we propose a guide value of

u^j0=log[1BGb=1Bg=1Gp^jgb/(11BGb=1Bg=1Gp^jg)],

where jg is the sample proportion of underexpressed gene expression values over all of the individuals for the jth tumor type in the bth imputed sample. This guide value for ûj0 seems quite suitable based on the definition of ejg. Finally, for wj02, we take a guide value of the form

wj02={(1BGb=1Bg=1Gp^jgb)(11BGb=1Bg=1Gp^jgb)}1.

Thus we see that the guide value for wj02 is just the frequentist variance of 1BGb=1Bg=1Gp^jgb.

To gain a better understanding of the prior distributions and their associated hyperparameters, a directed acyclic graph (DAG) of the prior elicitation scheme is given in Fig. 2.

Fig 2.

Fig 2

Graphical display in prior specification. Elements in circles are stochastic, while elements in squares are empirically specified hyperparameters. Shaded circles indicate parameters of interest. Double squares correspond to prespecified scalar hyperparameters.

4. Gene selection criteria

To discriminate between the normal and tumor tissues, we follow Ibrahim et al. (2002) and let

ψjg=Ey[yjgi|pjg,αjg,τjg2,μjg,σjg2],

where y = (y111, …, y2,G,n2G), and the expectation is with respect to the joint distribution of y. Thus, we have

ψjg=E[yjgi|αjg,τjg2]pjg+(1pjg)E[yjgi|μjg,σjg2]. (4.1)

If h(yjgi) = log(yjgi), then

ψjg=pjgexp{αjg+τjg22}+(1pjg)exp{μjg+σjg22}. (4.2)

The primary reason why we focus on ψjg=Ey[yjgi|pjg,αjg,τjg2,μjg,σjg2] as a gene selection criterion is that this quantity provides combined information on both the location and scale parameters in the specified model for yjgi. In contrast, if for example, yjgi has a log-normal distribution and we consider the expected value of log(yjgi) as the gene selection criterion, we can immediately see that this expectation is simply the weighted average of the location parameters. Thus, the expected value of log(yjgi) is not as informative as the expected value of yjgi since it does not use information on both the location and scale parameters.

To compare the gene expression level means between the normal and tumor tissues, we follow Ibrahim et al. (2002) and define

ξg=ψ2g/ψ1g,g=1,2,,G. (4.3)

Then, we propose the following algorithm for determining which genes are differentially expressed between the two groups:

  • Step 1. We first compute the posterior distributions of all the ξg’s, g = 1, 2, …, G, and for each ξg, we compute γg21 = Pg > 2|D) or γg22 = Pg < 0.5|D) (the 2-criterion), as well as γg31 = Pg > 3|D) or ξg32 = Pg < 1/3|D) (the 3-criterion).

  • Step 2. We select a cut-off value, denoted γ0, for γgjk for determining which genes are different. Possible values of γ0 might be γ0 = 0.7, γ0 = 0.8, and γ0 = 0.9.

  • Step 3. We declare gene g different for the two tissue types if γg21≥γ0 or γg22≥γ0 (the 2-criterion), or if γg31≥γ0 or γg32≥γ0 (the 3-criterion).

We note that computing Pg > 2|D) (or Pg > 3|D)) is quite straightforward since it is a by-product of Markov chain Monte Carlo (MCMC) sampling. Specifically, suppose {ξgq,q=1,2,,Q} is an MCMC sample from the posterior distribution. Then, an Monte Carlo estimate of Pg > 2|D) is simply

P^(ξg>2|D)=1Qq=1Q1{ξgq>2},

where 1{ξgq>2} is the indicator function. In using (4.3), Ibrahim et al. (2002) only considered the “one-criterion” for determining which genes are differentially expressed, that is, Pg>1|D). Our experience shows that the 2 and 3-criteria yield much better false positive and false negative rates as opposed to the 1-criterion, and thus we use these for determining which genes are differentially expressed.

5. A simulation study

We conducted a simulation study to investigate the operating characteristics of the threshold mixture model in (2.5) in the context of differential gene expression, and to also compare the performance of the proposed model to frequentist methods for differential gene expression based on Significance Analysis of Microarrays (SAM, Tusher et al., 2001), parametric empirical bayes methods for microarray data (EBarrays, Kendziorski et al., 2003), and permutation methods based on t statistics (PERMAX, Mutter et al., 2001). Towards these goals, we simulated data from the log-normal model in (2.1). The simulation assumes two groups, in which n1 = n2 = 25 and G = 1000 genes. The data was simulated so that 50 genes are in truth “differentially expressed” (i.e., the expression levels are simulated from two different log-normal distribution with different location and scale parameters), and 950 genes are in truth “not differentially expressed” (i.e., the gene expression levels are generated from identical log-normal distributions). Specifically, the data was simulated from the log-normal (h(y) = log (y)) mixture model in (2.1) with pjg = 0.4, αjg =1, τjg2=0.25, µjg = 4, and σjg2=2, j = 1, 2, for the 950 “similar” genes, and, pjg = 0.4, α1g = 1, α2g = 1.5, τjg2=0.25, j = 1, 2, µ1g = 3, µ2g = 7, and σjg2=2, j = 1, 2, for the 50 “different genes”.

Table 1 summarizes the false positive rates (FPR) and false negative rates (FNR) based on three different priors: (I) noninformative with (η0, d0, k0, h0) = (100, 100, 50, 50), (II) moderately informative with (η0, d0, k0, h0) = (1, 1, 1, 1), or (III) informative with (η0, d0, k0, h0) = (0.01, 0.01, 0.01, 0.01). The results shown in Table 1 are based on 500 simulations. We see from Table 1 that the performance of the proposed 2-criterion and 3-criterion is quite good, and appears to behave best with γ0≥0.80. For example, for γ0 =0.80 under noninformative priors, the mean FPR is 0.038 and 0.007 and the mean FNR is 0.0003 and 0.001 under the 2-criterion and the 3-criterion, respectively. Moreover, the FPR and FNR are quite robust with respect to the choice of the prior. We see that we get essentially the same rates for all three priors for several different values of γ0. These results are very encouraging and show that our gene selection algorithm described in Section 4 along with both the 2-criterion and the 3-criterion have good properties.

Table 1.

False negative rate and false positive rate of the proposed criterion under model (2.5)

Prior γo 2-criterion
3-criterion
Mean
Mean
Mean
Mean
FNR SD FNR SD FNR SD FNR SD
I 0.70 0.0001 0.0013 0.0965 0.0343 0.0004 0.0029 0.0241 0.0173
0.80 0.0003 0.0025 0.0381 0.0157 0.0014 0.0052 0.0070 0.0052
0.85 0.0007 0.0037 0.0198 0.0094 0.0026 0.0071 0.0030 0.0026
0.90 0.0019 0.0061 0.0079 0.0046 0.0060 0.0114 0.0009 0.0011
0.95 0.0019 0.0061 0.0016 0.0015 0.0202 0.0191 0.0001 0.0004
II 0.70 0.0001 0.0009 0.0790 0.0094 0.0003 0.0024 0.0164 0.0043
0.80 0.0003 0.0024 0.0299 0.0058 0.0012 0.0050 0.0045 0.0022
0.85 0.0006 0.0035 0.0153 0.0042 0.0026 0.0072 0.0019 0.0015
0.90 0.0017 0.0058 0.0057 0.0025 0.0062 0.0114 0.0006 0.0008
0.95 0.0017 0.0058 0.0012 0.0011 0.0210 0.0206 0.0001 0.0003
III 0.70 0.0003 0.0025 0.0700 0.0083 0.0017 0.0056 0.0121 0.0035
0.80 0.0011 0.0046 0.0284 0.0055 0.0046 0.0093 0.0038 0.0019
0.85 0.0028 0.0072 0.0153 0.0041 0.0078 0.0125 0.0017 0.0014
0.90 0.0058 0.0105 0.0067 0.0026 0.0153 0.0180 0.0006 0.0008
0.95 0.0058 0.0105 0.0016 0.0013 0.0410 0.0268 0.0001 0.0003

Table 2 shows the false positive and false negative rates based on the noninformative prior and various combinations of (n1, n2). In Table 2, the results are based on the 3-criterion. We also compared our procedure to PERMAX, SAM, and EBarrays. In PERMAX, standard pooled variance t statistics for comparing normal tissues to tumor tissues are computed for each gene. We let tg denote the t statistic for the gth gene. To nonparametrically determine the significance of each gene while controlling the overall error rate, the permutation distribution of the most extreme statistics over all genes is used. Since the distributions of the t statistics are not symmetric with unequal group sizes, this is done separately in each tail. Assuming positive values of tg indicate higher values in normal tissues, and letting t(p) be the maximum statistic over all genes for the pth permutation, the p-value for gene g in the direction of higher expression in normal tissues is the proportion of permutations where the observed tg is ≥t(p), with a similar calculation in the opposite tail for differences in the opposite direction. SAM is now a well known statistical technique for finding significant genes in a set of microarray experiments. It uses repeated permutations of the data to determine if the expression of any genes are significantly related to the response variable (the grouping variable in the context of this paper). EBarrays assumes a hierarchical mixture model to account for differences among genes in their average expression levels, differential expression for a given gene among groups, and measurement fluctuations. Posterior probabilities of patterns of differential expression across groups can be computed and used to determine significant genes.

Table 2.

Comparison of three methods

Method (n1, n2) γ0 Mean FNR SD Mean FPR SD
Proposed criterion (25, 25) 0.70 0.0004 0.0029 0.0241 0.0173
under model (2.5) 0.80 0.0014 0.0052 0.0070 0.0052
0.90 0.0060 0.0114 0.0009 0.0011
SAM (FDR ≤ 0.05) 0.0000 0.0000 0.0013 0.0011
SAM (FDR ≤ 0.10) 0.0000 0.0000 0.0038 0.0022
PERMAX (α = 0.05) 0.7150 0.0627 0.0000 0.0002
PERMAX (α = 0.10) 0.6382 0.0688 0.0001 0.0003
EBarrays (PP > 0.5) 0.0000 0.0000 0.9999 0.0003
EBarrays (PP > 0.7) 0.0000 0.0000 0.9835 0.0247
Proposed criterion (20, 20) 0.70 0.0020 0.0061 0.0341 0.0201
under model (2.5) 0.80 0.0058 0.0105 0.0103 0.0061
0.90 0.0197 0.0198 0.0014 0.0014
SAM (FDR ≤ 0.05) 0.0003 0.0025 0.0015 0.0012
SAM (FDR ≤ 0.10) 0.0001 0.0015 0.0038 0.0023
PERMAX (α = 0.05) 0.8462 0.0510 0.0001 0.0003
PERMAX (α = 0.10) 0.7871 0.0597 0.0001 0.0004
EBarrays (PP > 0.5) 0.0000 0.0000 0.9999 0.0002
EBarrays (PP > 0.7) 0.0004 0.0029 0.9864 0.0159
Proposed criterion (10, 10) 0.70 0.0283 0.0238 0.0709 0.0334
under model (2.5) 0.80 0.0570 0.0349 0.0235 0.0125
0.90 0.1400 0.0565 0.0033 0.0028
SAM (FDR ≤ 0.05) 0.0855 0.0528 0.0014 0.0013
SAM (FDR ≤ 0.10) 0.0409 0.0361 0.0040 0.0022
PERMAX (α = 0.05) 0.9890 0.0142 0.0000 0.0001
PERMAX (α = 0.10) 0.9799 0.0196 0.0001 0.0003
EBarrays (PP > 0.5) 0.0006 0.0037 0.9995 0.0011
EBarrays (PP > 0.7) 0.1319 0.1385 0.4254 0.3458
Proposed criterion (5, 5) 0.70 0.1074 0.0532 0.1403 0.0943
under model (2.5) 0.80 0.1822 0.0693 0.0563 0.0466
0.90 0.3441 0.0902 0.0110 0.0114
SAM (FDR ≤ 0.05) 0.6104 0.1443 0.0007 0.0010
SAM (FDR ≤ 0.10) 0.4634 0.1624 0.0023 0.0020
PERMAX (α = 0.05) 0.9978 0. 0067 0.0000 0.0002
PERMAX (α = 0.10) 0.9957 0.0101 0.0001 0.0003
EBarrays (PP > 0.5) 0.0124 0.0184 0.9922 0.0115
EBarrays (PP > 0.7) 0.9114 0.1327 0.0026 0.0104

Table 2 shows the PERMAX procedure based on 50,000 permutations, as well as the SAM model based on 20,000 repeated permutations. We see from Table 2 that our method provides more satisfactory false negative and false positive rate results compared to the other three approaches. For a given γ0 and criterion (2 or 3-criterion) based on our approach, both the false positive and negative rates increase as the two sample size decreases. Although the PERMAX procedure gives an excellent FPR, the false negative rates based on the PERMAX procedure are extremely high. For example, for (n1, n2) = (25, 25) using a significance level of 0.05, the false negative rate is 0.715. The false negative rates based on the PERMAX procedure increase as the sample size decreases. Based on controlling the false discovery rate (FDR), SAM yields reasonably good false positive and false negative rates; however, the false negative rate increases substantially when (n1, n2) = (5, 5). Like the PERMAX procedure, the false negative rates increase as the sample size decreases. For the EBarrays results, we assume a log-normal mixture model, and use either 0.5 or 0.7 as the threshold posterior probability to identify genes that are differentially expressed. We can clearly see that under almost all simulation scenarios, EBarrays produces the highest false positive rates. An exception occurs when (n1, n2)=(5, 5) and the threshold posterior probability is 0.7; however, in this case, the false negative rate is extremely high.

In addition, we conducted a study of the robustness of the log-normal distribution. We simulated data from a gamma mixture model with all shape parameters equal to 3 and means matching the ones specified for the log-normal models in the simulation design discussed earlier. Again, the data were simulated so that 50 genes are in truth differentially expressed and 950 genes are in truth not differentially expressed. The results shown in Table 3 are based on noninformative priors and the 3-criterion. Our proposed 3-criterion outperforms the PERMAX, SAM and EBarrays procedures. Although SAM gives impressive results as shown in Table 2, it is far less robust with respect to model assumptions compared to our log-normal hierarchical model in (2.5). Overall, these simulation results show that our mixture model (2.5) along with the 3-criterion (or 2-criterion) gene selection algorithm is very promising.

Table 3.

Sensitivity analysis

Method (n1, n2) γ0 Mean FNR SD Mean FPR SD
Proposed criterion (25, 25) 0.70 0.0000 0.0000 0.0963 0.1224
under model (2.5) 0.80 0.0000 0.0000 0.0202 0.0351
0.90 0.0001 0.0023 0.0012 0.0021
SAM (FDR ≤ 0.05) 0.4290 0.1043 0.0014 0.0015
SAM (FDR ≤ 0.10) 0.2800 0.0834 0.0040 0.0024
PERMAX (α = 0.05) 0.8046 0.0556 0.0000 0.0002
PERMAX (α = 0.10) 0.7428 0.0621 0.0001 0.0003
EBarrays (PP > 0.5) 0.0000 0.0000 0.9999 0.0004
EBarrays (PP > 0.7) 0.0063 0.0128 0.8935 0.1248
Proposed criterion (20, 20) 0.70 0.0000 0.0009 0.1217 0.1408
under model (2.5) 0.80 0.0001 0.0013 0.0286 0.0476
0.90 0.0002 0.0018 0.0020 0.0040
SAM (FDR ≤ 0.05) 0.5992 0.0886 0.0011 0.0011
SAM (FDR ≤ 0.10) 0.4646 0.1042 0.0029 0.0021
PERMAX (α = 0.05) 0.8980 0.0430 0.0001 0.0003
PERMAX (α = 0.10) 0.8588 0.0501 0.0001 0.0003
EBarrays (PP > 0.5) 0.0001 0.0015 0.9997 0.0006
EBarrays (PP > 0.7) 0.0304 0.0381 0.6714 0.2652
Proposed criterion (10, 10) 0.70 0.0024 0.0072 0.2145 0.2037
under model (2.5) 0.80 0.0058 0.0109 0.0718 0.1004
0.90 0.0147 0.0180 0.0100 0.0170
SAM (FDR ≤ 0.05) 0.9176 0.0469 0.0007 0.0009
SAM (FDR ≤ 0.10) 0.8909 0.0685 0.0011 0.0012
PERMAX (α = 0.05) 0.9890 0.0141 0.0000 0.0002
PERMAX (α = 0.10) 0.9798 0.0194 0.0001 0.0003
EBarrays (PP > 0.5) 0.0027 0.0074 0.9970 0.0039
EBarrays (PP > 0.7) 0.4563 0.1507 0.0688 0.06111
Proposed criterion (5, 5) 0.70 0.0265 0.0269 0.2989 0.2343
under model (2.5) 0.80 0.0510 0.0372 0.1301 0.1388
0.90 0.0991 0.0478 0.0308 0.0334
SAM (FDR ≤ 0.05) 0.9724 0.0226 0.0011 0.0011
SAM (FDR ≤ 0.10) 0.9717 0.0252 0.0011 0.0012
PERMAX (α = 0.05) 0.9996 0. 0028 0.0000 0.0002
PERMAX (α = 0.10) 0.9990 0.0045 0.0001 0.0003
EBarrays (PP > 0.5) 0.0209 0.0316 0.9764 0.0335
EBarrays (PP > 0.7) 0.9020 0.1032 0.0067 0.0102

6. Analysis of gastric cancer data

In this section, we revisit the gastric cancer dataset of Chen et al. (2003). The dataset contains 90 tumor and 22 normal examples, and a total of 6688 genes were available for analysis. Here, we carry out an analysis of these data with the proposed model and compare it to the analysis of Chen et al. (2003), which is reported on the website (http://genome-www.stanford.edu/Gastric_Cancer2), and SAM. Table 4 shows the number of selected genes under the 2 and 3 criteria, respectively. As expected, we see that as γ0 increases, the number of differentially expressed genes becomes smaller, and also the 3-criterion yields fewer differentially expressed genes than the 2-criterion. Table 4 also lists the percentage of differentially expressed genes that matched between our list and Chen et al.’s (2003) list, as well as the number of matches between our list and the list from SAM. We see that our list and Chen et al.’s list matched for at least 80% of the genes for all combinations of γ0 and type of criterion (2 or 3-criterion), with the highest percentage of matches occurring for γ0 = 0.90 along with the 2-criterion. Chen et al. (2003) identified 3329 genes as differentially expressed using a “nonparametric t-test” with a p-value cutoff of 0.001 or 0.002 based on 10,000 random “column permutations”. Further details of this statistical procedure can be found in Troyanskaya et al. (2002). The percentage of matched genes identified as differentially expressed between our method and SAM is at least 96%. Despite this high percentage, the number of genes identified by SAM as differentially expressed is substantially larger than our approach. We also note that the gastric cancer dataset consists of a substantial number of missing gene expression measurements. From the 90 tumor samples, only 8.49% of the genes are completely observed, and up to 35.36% of the genes have more than five missing gene expression measurements. As for the 22 normal samples, only 43.26% of the genes are fully observed and 7.64% of the genes are missing at least five gene expression measurements. Our proposed mixture model intrinsically allows for unbalanced data, and is capable of handling the missing data properly, whereas the SAM method naively imputes missing values via a k-nearest neighbor algorithm. This ad-hoc method of handling of missing data may yield misleading results and makes SAM less appropriate as a tool in analyzing microarray data when there is a significant amount of missing data. We mention here that EBarrays cannot handle missing data. More specifically, in the presence of missing gene expression data, the number of genes for each subject becomes different and EBarrays is not able toaccommodate this setting, and hence not applicable for non-rectangular data. Therefore, EBarrays cannot be applied to the gastric cancer data.

Table 4.

Number of genes differentially expressed using κ0 = 1

γ 0 = 0.70 γ 0 = 0.80 γ 0 = 0.90
2-criterion
Number of genes selected 762 613 411
Number of genes matched with Chen et al. (2003) 695 563 379
Matched percentage 91.21% 91.84% 92.21%
Number of genes matched with SAM (FDR ≤ 0.05) 739 598 403
Matched percentage 96.98% 97.55% 98.05%
Number of genes matched with SAM (FDR ≤ 0.10) 744 602 406
Matched percentage 97.64% 98.21% 98.78%
3-criterion
Number of genes selected 188 145 98
Number of genes matched with Chen et al. (2003) 160 123 79
Matched percentage 85.11% 84.83% 80.61%
Number of genes matched with SAM (FDR ≤ 0.05) 183 141 96
Matched percentage 97.34% 97.24% 97.96%
Number of genes matched with SAM (FDR ≤ 0.10) 186 143 97
Matched percentage 98.94% 98.62% 98.98%

Note that when FDR ≤ 0.05, 4511 genes were identified differentially expressed, while when FDR≤0.1, 5082 genes were identified different

Fig. 3 shows nonparametric density estimates for nine genes for the tumor and normal samples that were deemed differentially expressed by our proposed method as well as Chen et al. (2003). From Fig. 3, we see the clear separation in distributions, thereby correctly identifying the genes as differentially expressed, as well as the apparent bimodality in each distribution. Fig. 4 shows nine genes that were not deemed differentially expressed by Chen et al. (2003) but were deemed differentially expressed by us using either the 2 or 3 criteria or γ0≥.70. Fig. 4 is striking. Although it appears that there is much overlap between the distributions in each panel, we see that there are great differences in the tails between the two distributions in each panel, and in particular, there is often bimodality in the tails. Chen et al.’s method cannot pick up these differences as this method is primarily aimed at detecting differences in location, and is unable to detect bimodality or large differences in the tails. In contrast, (2.3) is very well suited to pick up these types of differences. The nine genes reported in Fig. 4 are identified by SAM as differentially expressed. Fig. 5 shows a plot of the posterior probability for the 3-criterion (see Step 1 in Section 4) versus the posterior mean of log(ξg) for the 3329 genes that were deemed differentially expressed by Chen et al. (2003). We see from Fig. 5 that many such genes have small posterior probability (less than 0.70) as well as a small log(ξg) according to our proposed criterion, and thus such genes might be inaccurately claimed to be differentially expressed. Similarly, Fig. 6 shows a plot of the posterior probability for the 3-criterion versus the posterior mean of log(ξg) for the 3329 genes that were not deemed differentially expressed by Chen et al. (2003).We see from Fig. 6 that many such genes actually have a large log(ξg) as well as a large posterior probability according to our proposed method, implying that Chen et al. may be inaccurately declaring certain genes as not differentially expressed, when in reality they may be. Again, these false declarations may be due to the inability of the nonparametric t-test to detect differences in tail behavior and bimodality in the gene expression distributions.

Fig 3.

Fig 3

Densities for nine genes deemed differentially expressed by the proposed methodology as well as by Chen et al. (2003) (solid: tumor, dashed: normal).

Fig 4.

Fig 4

Densities for nine genes that were deemed differentially expressed by the proposed methodology but not deemed differentially expressed by Chen et al. (2003) (solid: tumor, dashed: normal).

Fig 5.

Fig 5

Posterior probability (max{γg31, γg32}) versus the posterior mean of log(ξg) for 3329 genes selected by Chen et al. (2003).

Fig 6.

Fig 6

Posterior probability (max{γg31, γg32}) versus the posterior mean of log(ξg) for genes that were not selected by Chen et al. (2003).

In addition, we have done extensive sensitivity analyses to examine the robustness of the proposed methodology. For the gastric cancer dataset, we considered analyses using κ0 = 0.5, 1, 2, with the 2 and 3-criteria, along with γ0 = 0.70, 0.80, 0.90. The results were remarkably consistent with each other and for a given γ0 and criterion, the number of differentially expressed genes were nearly identical for all three values of κ0. Table 4 is based on κ0 = 1, yielding for example, 762, 613, and 411 differentially expressed genes for κ0=0.7, 0.8, 0.90 based on the 2-criterion. For κ0 = 0.5, the number of differentially expressed genes was 76, 611, and 408, corresponding to a matching percentage rate (with κ0 = 1) of 99.74%, 99.67%, and 99.50%, respectively. Similar results were obtained for κ0 = 2 and the 3-criterion. In fact, the matching percentage was always at least 99.32% for any combination of γ0, type of criterion (2-or 3-criterion) and κ0 = 0.5, 2.

Finally, we compared both the classification and predictive accuracy of our proposed model via cross-validation using the genes selected by our method and the genes selected by SAM. Cross-validation was carried out using the PAM, proposed by Tibshirani et al. (2002).We use 10-fold cross-validation, and determine the degree of shrinkage in the calculation of the nearest shrunken centroids by minimizing the cross-validated and test errors. The results show that both our method and SAM are quite capable of producing excellent classification and predictive power.

7. Discussion

We have proposed a useful two component mixture model for analyzing gene expression data. The proposed model is especially useful in situations where bimodality exists in the gene expression distributions. Such bimodality is common when there are many non-expressed as well as expressed genes for a given tissue type. The proposed methodology has been shown to outperform other well known methods for detecting differentially expressed genes. Future work includes further evaluations of this methodology on real datasets to see if the proposed methods can uncover differentially expressed genes when the biological truth is known. Another point of further evaluation is to apply the proposed methodology on the affymetrix Latin square database to see if one can further recover genes that are known to be differentially expressed in various settings.

Extensions of our proposed model to 3 or more components is straightforward, accommodating situations involving underexpressed, expressed, and overexpressed genes in a given dataset. For example, suppose we consider the following three component mixture:

p(yjgi|αjg,τjg2,μjg,σjg2,ϕjg,ηjg2)=p1jgp1(yjgi|αjg,τjg2)+p2jgp2(yjgi|μjg,σjg2)+p3jgp3(yjgi|ϕjg,ηjg2), (7.1)

wherepljg0,l=1,2,3,p1jg+p2jg+p3jg=1,

p1(yjgi|αjg,τjg2)=(2π)1/2|h(yjgi)|τjg1exp{12τjg2(h(yjgi)αjg)2},p2(yjgi|μjg,σjg2)=(2π)1/2|h(yjgi)|σjg1exp{12σjg2(h(yjgi)μjg)2},

and

p3(yjgi|ϕjg,ηjg2)=(2π)1/2|h(yjgi)|ηjg1exp{12ηjg2(h(yjgi)ϕjg)2}.

In (7.1), p1(yjgi|αjg,τjg2), p2(yjgi|μjg,σjg2), and p3(yjgi|ϕjg,ηjg2) are the distributions for yjgi for the underexpressed, expressed, and over expressed genes, respectively. An equivalent model to (7.1) with random threshold parameters can be constructed as follows. Let cjgi = (c1jgi, c2jgi)′ denote a vector of two random threshold parameters such that the gene is underexpressed if yjgic1jgi and expressed if c1jgi < yjgic2jgi for the jth tissue type, ith individual, and the gth gene. Assume that the cjgi ’s are i.i.d. with a continuous bivariate distribution p (cjgi|c0cjg, Σ0cjg) with support Ωc = {0<c1jgi < c2jgi < ∞}. Let A1jgi = {y: yc1jgi}, A2jgi = {y: c1jgi < yc2jgi}, and A3jgi = {y: y > c2jgi}, and suppose (yjgi, cjgi) have the following joint distribution:

p(yjgi,cjgi|αjg,τjg2,μjg,σjg2,ϕjg,ηjg2,c0jg,Σ0jg)=[p1jgp1(yjgi|αjg,τjg2)1A1jgi(yjgi)/q1yjgi+p2jgp2(yjgi|μjg,σjg2)1A2jgi(yjgi)/q2yjgi  ×p3jgp3(yjgi|ϕjg,ηjg2)1A3jgi(yjgi)/q3yjgi]p(cjgi|c0jg,Σ0cjg), (7.2)

where q1yjgi = ∫ yjgic1jgi<c2jgi p(cjgi |c0jg0cjg) dcjgi, q2yjgi = ∫c1jgi < yjgic2jgi p(cjgi|c0jg0cjg) dcjgi, and q3yjgi = ∫c1jgi< c2jgi< yjgi p(cjgi |c0jg0cjg) dcjgi. Following the proof of Identity 2.1, we can show that (7.2) reduces to (7.1) after integrating out cjgi. Similar to the two component mixture model (2.1), we take

p(cjgi|c0jg,0cjg2)|h(c1jgi)h(c2jgi)||Σ0cjg|1/2     ×exp{12(h(cjgi)h(c0jg))0cjg1(h(cjgi)h(c0jg))},0<c1jgi<c2jgi<,

where h(cjgi)=(h(c1jgi), h(c2jgi))′ and h (c0jg)=(h(c01jg), h(c02jg))′. We then elicit the hyperparameters c0jg and 0cjg2 using the summary statistics of the quantiles of the yjgi’s.

References

  1. Baldi P, Long AD. A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes. Bioinformatics. 2001;17:509–519. doi: 10.1093/bioinformatics/17.6.509. [DOI] [PubMed] [Google Scholar]
  2. Chen Y, Dougherty ER, Bittner ML. Ratio-based decisions and the quantitative analysis of cDNA microarray images. J. Biomed. Optics. 1997;4:364–374. doi: 10.1117/12.281504. [DOI] [PubMed] [Google Scholar]
  3. Chen X, Leung S, Yuen ST, Chu K-M, Ji J, Li R, Chan SY, Law S, Troyanskaya OG, Wong J, Samuel S, Botstein D, Brown PO. Variation in gene expression patterns in human gastric cancers. Molelcular Biol. Cell. 2003;14:3208–3215. doi: 10.1091/mbc.E02-12-0833. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Do K-A, Mueller P, Tang F. A bayesian mixture model for differential gene expression. Appl. Statistics. 2005;54:611–626. [Google Scholar]
  5. Dudoit S, Yang YH, Callow MJ, Speed TP. Statistical methods for identifying genes with differential expression in replicated cDNA microarray experiments. Statist. Sinica. 2002;12:111–139. [Google Scholar]
  6. Efron B, Tibshirani R, Storey J, Tusher VG. Empirical Bayes analysis of a microarray experiment. J. Amer. Statist. Assoc. 2001;96:1151–1160. [Google Scholar]
  7. Hein A-M, Richardson S, Cuaston HC, Graeme AK, Green PJ. BGX: a fully bayesian integrated approach to the analysis of affymetrix GeneChip data. Biostatistics. 2005;6:349–373. doi: 10.1093/biostatistics/kxi016. [DOI] [PubMed] [Google Scholar]
  8. Ibrahim JG, Chen M-H, Gray RJ. Bayesian models for gene expression with DNA microarray data. J. Amer. Statist. Assoc. 2002;97:88–99. [Google Scholar]
  9. Ishwaran H, Rao JS. Detecting differentially expressed genes in microarrays using Bayesian model selection. J. Amer. Statist. Assoc. 2003;98:438–455. [Google Scholar]
  10. Ishwaran H, Rao S. Spike and slab gene selection of multigroup microarray data. J. Amer. Statist. Assoc. 2005;100:764–780. [Google Scholar]
  11. Kendziorski CM, Newton MA, Lan H, Gould MN. On parametric empirical bayes methods for comparing multiple groups using replicated gene expression profiles. Statist. Med. 2003;22:3899–3914. doi: 10.1002/sim.1548. [DOI] [PubMed] [Google Scholar]
  12. Kerr MK, Martin M, Churchhill GA. Analysis of variance for gene expression microarray data. J. Comput. Biol. 2000;7:819–837. doi: 10.1089/10665270050514954. [DOI] [PubMed] [Google Scholar]
  13. Lee MLT, Lu W, Whitmore GA, Beier D. Models for microarray gene expression data. J. Biopharm. Statist. 2002;12:1–19. doi: 10.1081/bip-120005737. [DOI] [PubMed] [Google Scholar]
  14. Liu D, Parmigiani G, Caffo B. Screening for differentially expressed genes: are multilevel models helpful? Technical Report, Department of Biostatistics, Johns Hopkins University; 2004. [Google Scholar]
  15. Lonnstedt I, Speed T. Replicated microarray data. Statist. Sinica. 2002;12:31–46. [Google Scholar]
  16. Mueller P, Parmagiani G, Robert C, Rousseau J. Optimal sample size for multiple testing: the case of gene expression microarrays. J. Amer. Statist. Assoc. 2004;99:990–1001. [Google Scholar]
  17. Mutter GL, Baak JPA, Fitzgerald JT, Gray RJ, Neuberg D, Kust GA, Gentleman R, Gullens SR, Wei LJ, Wilcox M. Global expression changes of constitutive and hormonally regulated genes during endometrial neoplastic transformation. Gynecol. Oncol. 2001;83:177–185. doi: 10.1006/gyno.2001.6352. [DOI] [PubMed] [Google Scholar]
  18. Newton MA, Kendziorski CM. The Analysis of Gene Expression Data: An Overview of Methods and Software. New York: Springer; 2003. Parametric empircal bayes methods for microarrays; pp. 254–271. [Google Scholar]
  19. Newton MA, Kendziorski CM, Richmond CS, Blattner FR, Tsui KW. On differential variability of expression ratios: improving statistical inference about gene expression changes from microarray data. J. Comput. Biol. 2001;8:37–52. doi: 10.1089/106652701300099074. [DOI] [PubMed] [Google Scholar]
  20. Newton MA, Noueiry A, Sarkar D, Ahlquist P. Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics. 2004;5:155–176. doi: 10.1093/biostatistics/5.2.155. [DOI] [PubMed] [Google Scholar]
  21. Olshen AB, Jain AN. Deriving quantitative conclusions from microarray data. Bioinformatics. 2002;18:961–970. doi: 10.1093/bioinformatics/18.7.961. [DOI] [PubMed] [Google Scholar]
  22. Parmigiani G, Garrett ES, Anbazhagan R, Gabrielson E. A statistical framework for expression-based molecular classification in cancer. J. Roy. Statist. Soc. Ser. B. 2002;64:717–736. [Google Scholar]
  23. Parmigiani G, Garrett ES, Irizarry RA, Zeger SL, editors. The Analysis of Gene Expression Data: An Overview of Methods and Software. New York: Springer; 2003. [Google Scholar]
  24. Parkin DM, Pisani P, Ferlay J. Estimates of the worldwide incidence of 25 major cancers in 1990. Internat. J. Cancer. 1999;80:827–841. doi: 10.1002/(sici)1097-0215(19990315)80:6<827::aid-ijc6>3.0.co;2-p. [DOI] [PubMed] [Google Scholar]
  25. Sebastiani P, Gussoni E, Kohane IS, Ramoni MF. Statistical challenges in functional genomics (with discussion) Statist. Sci. 2003;18:33–70. [Google Scholar]
  26. Storey J, Tibshirani R. The Analysis of Gene Expression Data: An Overview of Methods and Software. New York: Springer; 2003. SAM thresholding and false discovery rates for detenting differential gene expression in DNA microarrays; pp. 272–290. [Google Scholar]
  27. Tadesse MG, Ibrahim JG, Mutter G. Identification of differentially expressed genes in high-density oligoneucleotide arrays accounting for the quantification limits of the technology. Biometrics. 2003;59:542–554. doi: 10.1111/1541-0420.00064. [DOI] [PubMed] [Google Scholar]
  28. Tibshirani R, Hastie T, Narasimhan B, Chu G. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc. Nat. Acad. Sci. 2002;99:6567–6572. doi: 10.1073/pnas.082099299. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Troyanskaya OG, Garber ME, Brown PO, Botstein D, Altman RB. Nonparametric methods for identifying differentially expressed genes in microarray data. Bioinformatics. 2002;18:1454–1461. doi: 10.1093/bioinformatics/18.11.1454. [DOI] [PubMed] [Google Scholar]
  30. Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Nat. Acad. Sci. 2001;98:5116–5121. doi: 10.1073/pnas.091062498. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. West AP. Bayesian factor analysis regression for models in the “Large p, Small m” Paradigm. In: Bernardo JM, Bayarri MJ, Berger JO, Dawid AP, Heckerman D, Smith AFM, West M, editors. Bayesian Statistics 7. Oxford: Oxford University Press; 2003. pp. 733–742. [Google Scholar]

RESOURCES