Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2018 Oct 25.
Published in final edited form as: Biometrics. 2011 Apr 22;67(4):1617–1626. doi: 10.1111/j.1541-0420.2011.01602.x

An Empirical Bayes Approach to Joint Analysis of Multiple Microarray Gene Expression Studies

Lingyan Ruan 1,*, Ming Yuan 1,**
PMCID: PMC6201754  NIHMSID: NIHMS282002  PMID: 21517790

Summary

With the prevalence of gene expression studies and the relatively low reproducibility caused by insufficient sample sizes, it is natural to consider joint analysis that could combine data from different experiments effectively in order to achieve improved accuracy. We present in this paper a model-based approach for better identification of differentially expressed genes by incorporating data from different studies. The model can accommodate in a seamless fashion a wide range of studies including those performed at different platforms by fitting each data with different set of parameters, and/or under different but overlapping biological conditions. Model-based inferences can be done in an empirical Bayes fashion. Because of the information sharing among studies, the joint analysis dramatically improves inferences based on individual analysis. Simulation studies and real data examples are presented to demonstrate the effectiveness of the proposed approach under a variety of complications that often arise in practice.

Keywords: Empirical Bayes, gene expression, joint analysis, mixture model

1. Introduction

Microarray technology has presented unprecedented opportunities in genomic studies of complex diseases. It allows researchers to simultaneously monitor thousands of transcripts and discover novel bio-markers and genes. Despite their successes, these studies are often hampered by their relatively low reproducibility. This deficiency is often attributed to the high variability of gene expression measurements. Sources of distortion and noise are involved in almost every step along the process of taking gene expression measurements. It has long been recognized (e.g., Lee et al., 2000; Mukherjee et al., 2003) that such problem could be alleviated through increased sample size. However, experiments with limited sample sizes remain common due to economic considerations. The recent explosion of popularity of high-throughput gene expression studies offers a more cost-effective alternative to this problem. With studies of the same diseases carried out independently by different research groups, it is natural to consider efficient ways of combining these data and jointly analyzing them. Through information sharing across studies, the accuracy of inferences could be greatly improved.

Because of its great potential, joint analysis of multiple experiments has attracted much attention in recent years. See, for example, Choi and Ghosh (2008) for a recent review. It is most commonly done through cross-experiment data normalization and transformation, which aims at translating and normalizing measurements from different sources on a common scale to allow for integration. In particular, Jiang et al. (2004) present a gene shaving method based on random forest (Brieman, 2001) and Fisher’s linear discrimination analysis. Warnat, Eils and Brors (2005) and Shabalin et al. (2008) also discuss different ways of integrating data through cross-experiment transformation. In general, however, it is difficult to integrate data without information loss and this would heavily bias each study. For example, van’t Veer et al. (2002) and Wang et al. (2005) ended up with different predictive gene subsets with only three genes in common. and there is no clear guidelines as to how it can be performed efficiently. Alternatively, one can also combine individual analysis results summarized by t-statistic, p-value, scored gene list and so on (e.g., Rhodes et al., 2002; Choi et al., 2003; Ghosh et al., 2003; Parmigiani et al., 2004; Pyne, Futcher and Skiena, 2006; Garrett-Mayer, et al., 2007; Shen, Ghosh and Chinnaiyan, 2007). In particular, Choi et al., (2003) propose to combine the effect size of genes from each study and conduct a permutation test to determine the significance level. Rhodes et al. (2002) and Pyne et al. (2006) consider ways of combining p-values of each study. Due to the small sample size of each study, the summary statistics obtained inevitably have high variations and subsequently these methods are subject to loss of efficiency in information sharing. This happens such as the studies of van’t Veer et al. (2002) and Wang et al. (2005) mentioned above. It is also demonstrated by Mah et al. (2004) that detected genes on different platforms could have poor overlap. See Hong and Breitling (2008) for a comparison of methods and Rhodes et al. (2004), Parmigiani et al. (2004), Scharpf et al. (2009) for other approaches.

There are also several major practical hurdles to joint analysis. In particular, there is no general consensus on how gene expression experiments should be conducted. As a result, the choice of sample cohorts (e.g., age, ethnicity, and phase of disease), experiment platforms (e.g., cDNA or oligonucleotide), and processing facilities may all be different, and the scale of observations may not be comparable. These variations among experiments prohibit us from treating them as if they were simple replicates from a single study. In particular, a recent study in Kuo et al. (2002) compared Affymetrix and spotted cDNA and it was claimed that the correlation between the measurements from the two platforms was fairly low so it was unlikely that the two types of data could be transformed or normalized into a common standardized index. In practice, integrating multiple studies can be further complicated by missing data and some times, mismatch in biological conditions.

Consider, for illustration purpose, the study of prostate cancer, the most diagnosed cancer in men. There are a host of gene expression studies of prostate cancer. To motivate our work, Microarray data were collected from four publicly available prostate cancer gene expression datasets generated independently by Dhanasekaran et al. (2001), Luo et al. (2001), Magee et al. (2001) and Welsh et al. (2001) respectively. One of the goals common to all four studies is in determining which genes are differentially expressed between locally advanced prostate cancer and benign tissue. The experiments, however, are done with different technologies. Dhanasekaran et al. (2001) and Luo et al. (2001) studies used spotted cDNA microarrays; while the other two experiments utilized Affymetrix GeneChips. In particular, the experiment from Magee et al. (2001) was conducted using HU6800 chip and Welsh et al. (2001) was done on U95A chip. Furthermore, these studies were performed on different but overlapping sets of genes. To overcome this problem, existing methods (see, e.g., Rhodes et al., 2002; Ghosh et al., 2003; Warnat et al., 2005) focus only on genes that are present in all studies. As we shall see in Section 4, such practice may result in more than 75% of the genes being discarded in some studies. Moreover, the remaining 25% of genes contain missing data, i.e., not all genes have complete observations from the samples tested. If the methods applied can not allow missing data, this will reduce to only 1 gene (satisfying both intersection and complete data). This is clearly not an effective way of using the data. Another complication in combining the four experiments is the mismatch in biological conditions. Although all four studies include comparisons between locally advanced prostate cancer and benign prostate, Dhanasekaran et al. (2001) and Magee et al. (2001) also included a third biological condition: metastatic prostate cancer. Earlier attempts to combine these studies have either chosen to discard data collected from this condition or combining it with locally advanced cancer to form a new hypothesis.

These aforementioned limitations prompt us to develop a new technique. In this paper, we propose a model-based method to integrate information from multiple experiments for the purpose of identifying differentially expressed genes among multiple biological conditions. Following Newton et al. (2001) and Kendziorski et al. (2003), we model the data from each individual study by a parametric empirical Bayes model to share information across transcripts. Within this framework, all data present in every study can be used for analysis, not only intersection genes. For simplicity, let us assume the genes into analysis are concordant. It is not a reasonable assumption supported because it is nothing about the proposed method. Union data can be further selected by removing some disconcordant genes, see Garrett-Mayer et al. (2007) for discussions. These separate models are flexible to be applicable to different platforms and multiple biological conditions. Latent variables are then introduced to model the pattern of expression for a particular transcript and to share information across experiments. The modeling framework is fairly flexible and can handle a variety of practical issues including those mentioned above with ease.

The rest of this paper is organized as follows. In the next section, we introduce the general modeling framework and show how statistical inferences can be efficiently conducted. Section 3 presents simulation studies to demonstrate the merits and versatility of the proposed method. We revisit the prostate cancer examples in Section 4 as well as another real data example before concluding with some remarks and discussions in Section 5.

2. Model and Inference

2.1 Parametric Empirical Bayes Model for A Single Study

We begin with modeling gene expression data from a single study. Various methods have been developed for such purposes. Interested readers are referred to Parmigiani et al. (2003), Allison et al. (2006) and Do, Müller and Vannucci (2006) for recent surveys. Here we adopt a parametric empirical Bayes approach introduced by Newton et al. (2001) and Kendziorski et al. (2003).

Let xgcr be the gene expression measurement taken from the rth replicate under condition c for gene g. Take the data from Dhanasekaran et al. (2001) as an example, three biological conditions (c = 1, 2, or 3), namely benign prostate, localized prostate cancer or metastatic prostate cancer; 4,839 genes (g = 1, 2, …, 4839) are considered. A total of 14 replicates (r = 1, 2, …, 14) are obtained for benign prostate; 14 for localized and 20 for metastatic prostate cancer respectively.

To fix ideas, we focus on two conditions (c = 1 or 2) in what follows. Sensible expression patterns concerning the comparison between two conditions for a particular gene include equivalent expression and differential expression. This can be formulated through latent variables μgc representing a population level of expression for gene g under biological condition c. Equivalent expression means that μg1 = μg2 whereas differential expression indicates μg1μg2. Our goal is therefore to infer such expression patterns from xg = (xg11, xg12, …, xg1n1) and xg = (xg21, xg22, …, xg2n2) where n1 and n2 are the number of replicates obtained under each condition respectively. It is not hard to see that the marginal distribution of (xg, xg)

f(xg1·,xg2·)=(1π)f(xg1·,xg2·EE)+πf(xg1·,xg2·DE), (1)

where we use f to denote a generic density function, marginal or conditional; and π = P (DE). The two conditional distributions can be modeled through a two level hierarchical model:

f(xg1·,xg2·EE)={k=1n1f(xg1kμg1=μ;θ)}{k=1n2f(xg2kμg2=μ;θ)}f(μ;τ)dμ; (2)
f(xg1·,xg2·DE)={k=1n1f(xg1kμg1;θ)}×{k=1n2f(xg2kμg2;θ)}f(μg1;τ)f(μg2;τ)dμg1dμg2; (3)

where θ and τ are parameters shared by all genes and determined by the experiment characteristics.

Two particular choices of f(·|μ; θ) and f(·; τ) are advocated, often referred to the lognormal-normal (LNN) model and Gamma-Gamma (GG) model. In the LNN model, f(·|μ; θ) is a lognormal distribution, i.e.,

f(xμ;θ)=12πθexp{(lnxμ)22θ}; (4)

whereas f(·; τ) is also a normal distribution with τ = (τ1, τ2)′ represents the mean and variance parameter respectively. Alternatively for the GG model, f(·|μ; θ) is a Gamma distribution, i.e.,

f(xμ;θ)=λθΓ(θ)xθ1exp(λx), (5)

where the shape parameter is given by λ = θ/μ. f(·; τ) is chosen such that λ also follows a Gamma distribution

f(λ;τ1,τ2)=τ2τ1Γ(τ1)λgτ11exp(τ2λ). (6)

Closed form expression are available for f(xg, xg|μg1 = μg2) and f(xg, xg|μg1μg2) with both LNN and GG models. The readers are referred to Kendziorski et al. (2003) for further details.

2.2 Joint Modeling with Multiple Studies

We now consider multiple studies. For brevity, we shall first assume that in each study, the same set of genes (g = 1, 2, …, G) and the same set of conditions (c = 1, 2, …, C) are considered. This assumption will later be relaxed. With slight abuse of notation, let Xs:= {xsgcr: g = 1, …, G; c = 1, …, C; r = 1, …, nsc} be the gene expression measurements obtained in the sth study (s = 1, 2, …, S) where nsc is the number of replicates under condition c in the study. Clearly Xs can be modeled using the parametric empirical Bayes model discussed before. The hierarchical modeling can be summarized by the diagram below:

graphic file with name nihms282002u1.jpg

The latent expression levels are determined stochastically by the expression pattern through distribution f(μ; τ) whereas the expression measurement by the latent levels through conditional distribution f(x|μ; θ). Parameters θ and τ reflect the stochastic variation within a study and therefore are allowed to be experiment-dependent. This is, in particular, necessary when handling studies from different platforms due to their difference in scales. To this end, we shall write θs and τs in what follows to emphasize the dependence between these parameters and the study. On the other hand, given that the same biological process is studied, a gene’s differential expression pattern should remain the same across all studies.

Let x·gc· = {xsgcr: s = 1, …, S; r = 1, …, nsc} be the collection of all expression measurements obtained from all studies on gene g and condition c. Then the conditional distribution of these measurements under the two differential expression patterns can be given by

f(x·g1·,x·g2·DE)=s=1Sf(xsg1·,xsg2·DE); (7)
f(x·g1·,x·g2·EE)=s=1Sf(xsg1·,xsg2·EE), (8)

where the experiment specific conditional distributions are given in the previous subsection. In other words, for a randomly picked gene, its marginal distribution will follow a two-component mixture distribution:

f(x·g··)=πs=1Sf(xsg1·,xsg2·DE)+(1π)s=1Sf(xsg1·,xsg2·EE), (9)

where π is the probability that a randomly picked gene is differentially expressed. Note that from (9), data collected from different studies are not independent under our joint modeling framework because

f(x·g··)s=1Sf(xsg··),

where f(xsg··) is the marginal density of the data collected from Study s as given by (1).

2.3 Empirical Bayes Inference

If the experiment specific parameters θs and τs, s = 1, …, S are known, inference on a gene’s expression pattern can be conducted through their posterior probabilities, i.e.,

P(DEx·g1·,x·g2·)=πf(x·g1·,x·g2·DE)πf(x·g1·,x·g2·DE)+(1π)f(x·g1·,x·g2·EE), (10)

where π = P(DE) is the probability that a randomly selected gene is differentially expressed. According to Bayes rule, we classify a gene as differentially expressed if the posterior probability of differential expression is greater than 50% and equivalent expression otherwise. These posterior probabilities provide a natural means of inferring differential expression by integrating multiple studies.

Following Efron et al. (2001) and Newton et al. (2001), parameters {θs, τs: s = 1, …, S} as well as π can be estimated in an empirical Bayes fashion. Note that these parameters are shared by all genes. The log-likelihood for all data can then be given by

(x··1·,x··2·)=g=1G(x·g1·,x·g2·),

where

(x·g1·,x·g2·)=log{(1π)f(x·g1·,x·g2·EE)+πf(x·g1·,x·g2·DE)}.

The maximum likelihood estimator of all parameters θs and τs, s = 1, …, S and π can be efficiently computed using EM algorithm by treating a gene’s differential expression pattern (i.e, EE or DE) as missing.

Denote by zg Gene g’s differential expression pattern. Then the log complete likelihood of parameter η:= {π, θs, τs: 1 ≤ sS} can be given as

(η;x.,z.)=g=1G[1(zg=DE){logπ+s=1Slogf(xsg··DE;θs,τs)}+1(zg=EE){log(1π)+s=1Slogf(xsg··EE;θs,τs)}].

Let { π[t],θs[t],τs[t]:1sS} be the parameter estimates obtained from the tth iteration of the EM algorithm. Then in the (t + 1)th iteration, we compute first the expectation of log complete likelihood with respect to zgs given x and these parameter estimates:

Q(η=EZg:1gG(θ;x.,Z.)=g=1G[Tg{logπ+s=1Slogf(xsg··DE;θs,τs)}+(1Tg){log(1π)+s=1Slogf(xsg··EE;θs,τs)}],

where

Tg=π[t]s=1Sf(xsg··,xsg··DE;θs[t],τs[t])π[t]s=1Sf(xsg··,xsg··DE;θs[t],τs[t])+(1π[t])s=1Sf(xsg··,xsg··EE,θs[t],τs[t]).

In the second step, also called as M step, we maximize Q with respect to θ to get an updated parameter estimation. In particular, it is clear that

π[t+1]=1Gg=1GTg,

and (θs, τs) can be updated by the maxmizer of

Qs(θs,τs):=g=1G{Tglogf(xsg··DE;θs,τs)+(1Tg)logf(xsg··EE;θs,τs)}.

In principle, a prior can also be assigned to these parameters and fully Bayesian inference can be made for the hierarchical model. We opt for the empirical Bayes framework to avoid sophisticated and sometimes subjective prior elicitation.

2.4 Missing Data

As mentioned in Section 1, one of the most common difficulties associated with joint analysis is missing data. Due to limitation of technology and quality control, the set of genes measured in one data set may not be the same as another data set. In practice, only those genes measured across all experiments are included in the joint analysis. This can be a significant loss of information as we shall see in the prostate cancer data in Section 4 where 30% to 75% of the data from each experiment are wasted if this approach is taken. In contrast, this problem can be conveniently addressed within our framework. Rather than considering only genes that are present in all experiment, we include all genes that appears in at least one experiment. If a particular gene is not present in an experiment, we treat it as missing data. More specifically, let Inline graphic (⊂{1, …, S}) be the collection of study indices where Gene g is missing. Then the log complete likelihood becomes

(η;x.,z.)=g=1G[1(zg=DE){logπ+sMglogf(xsg··DE;θs,τs)}+1(zg=EE){log(1π)+sMglogf(xsg··EE;θs,τs)}].

The EM algorithm proceeds in exactly the same fashion as before except that now the index s for the products and summations over studies now runs over sInline graphic instead of 1 ≤ sS for gene G.

2.5 Multiple Conditions and Condition Mismatch

The proposed framework for joint analysis can be easily extended to handle more than two conditions. Consider, for example, the data taken from Dhanasekaran et al. (2001) where three biological conditions are investigated. For each condition, we introduce a latent gene expression level, μsgc, c = 1, 2 or 3. When comparing these conditions for gene g, we have the following equality or inequality conditions that may hold:

Pattern1:μsg1=μsg2=μsg3,Pattern2:μsg1=μsg2μsg3,Pattern3:μsg1μsg2=μsg3,Pattern4:μsg1=μsg3μsg2,Pattern5:μsg1μsg2μsg3. (11)

Similar to before, these latent expression level can be modeled by an experiment-specific distribution f(μ; τs) where under Pattern 1, all three latent expression levels are obtained as a single sample from f(·; τs); under Pattern 2, μsg1 = μsg2 and μsg3 are two independent samples from f(·; τs) and so on. Similar formula as before can therefore be derived for f(x·gc·|Pattern k):

f(x·g1·,x·g2·,x·g3·Patternk)=s=1Sf(xsg1·,xsg2·,x·g3·Patternk),

where the conditional densities can be computed and the inferences can also be conducted in a similar fashion as before.

A practical challenge that often arises with multiple biological conditions is the possible condition mismatch. Different experiments are designed to address and compare different but overlapping conditions. The overlap in biological conditions makes information sharing possible but the difference in biological conditions makes the information sharing difficult. For example, among the four prostate cancer studies we discussed earlier in the introduction, Dhanasekaran et al. (2001) considered three conditions including benign prostate, localized prostate cancer and metastatic prostate cancer; whereas Luo et al. (2001), only investigated the first two conditions. A common practice is to ignore data obtained under the third condition from Dhanasekaran et al. (2001) and compare the first condition through a joint analysis. Although a convenient and sensible solution, it is clearly not the most efficient way of using data. In general, following this practice, when including multiple studies, we can only use those conditions that are present in all studies. Furthermore, as we shall demonstrate by simulations in the next section, doing so may result in loss of efficiency as well.

The problem of condition mismatch can also be handled conveniently within our proposed framework of joint analysis. For illustration purpose, we assume the one study has three conditions but the other one missed the third condition. With the third condition missing, it is evident that the expression measurements obtained from the second study have the following conditional distributions:

f(x2g1·,x2g1·Patternk)={f(x2g1·,x2g1·EE)k=1,2f(x2g1·,x2g1·DE)k=3,4,5,

where f(x2g, x2g|EE) and f(x2g, x2g|DE) are defined by (2) and (3) respectively. The posterior probability of Pattern k can therefore be evaluated as

P(Patternkx·g··)=πkf(x1g··Patternk)f(x2g··Patternk)j=15πjf(x1g··Patternj)f(x2g··Patternj).

Parameter estimation can also be carried out in the same fashion as before.

3. Simulation Studies

3.1 Benefit of Joint Analysis

To demonstrate the effectiveness of the proposed method, we first conducted several sets of simulation studies. To demonstrate the benefit of joint analysis, we begin with a simple setting: two biological conditions, and no missing data. A total of G = 5, 000 genes and S = 4 experiments were simulated. For each experiment, nsc = 3 replicates were simulated under each condition. The gene expression data were simulated from LNN or GG model. Due to their similarity in performance, we report here only the results from LNN models. The simulation settings for each experiment are similar to those previously employed by Kendziorski et al. (2003) to mimic the real gene expression data and represent different experimental variations in practice. Denote η = (τ1, τ2, θ) the parameters associated with the LNN model. The parameters of the four experiments are set at η1 = (2, 0.52, 0.152), η2 = (5, 0.62, 0.252), η3 = (15, 12, 0.352), η4 = (30, 1.22, 0.452) respectively. These parameters are selected to mimic some of the main characteristics of the real data example to be presented later. Note that τ2 reflects the variation of the latent mean of the gene expression levels such that larger values of τ2 correspond to better separated between the two conditions for differentially expressed genes. In particular, the four studies use in this simulation set have average effect sizes of 1.62, 1.8, 2.64 and 3.23 respectively, where the average effect size is defined as the median of the effect sizes of differential expressions. A randomly chosen π = 10% genes are set to be differentially expressed. Both the joint and separate analyses were conducted. In the separate analysis, each experiment is analyzed separately using the empirical Bayes approach of Kendziorski et al. (2003), referred to as EBarrays. We also apply the proposed approach for joint analysis. The operating characteristics of both analysis based upon 100 runs are summarized in Table 1.

Table 1.

Operating characteristics of joint analysis and separate analysis. The results are summarized from 100 runs and all units are in percentages. The numbers in parentheses are the standard errors.

Joint Analysis Separate Analysis
Experiment 1 Experiment 2 Experiment 3 Experiment 4
Sensitivity 99.34 (0.037) 53.16 (0.245) 58.39 (0.258) 78.87 (0.185) 91.9 (0.109)
Specificity 99.99 (0.001) 99.48 (0.013) 99.49 (0.012) 99.78 (0.007) 99.92 (0.004)
FDR 0.1 (0.013) 8.06 (0.173) 7.24 (0.16) 2.44 (0.073) 0.77 (0.041)

We observe that joint analysis can significantly improve the separate analysis. Among the four experiments, Experiment 4 has the strongest signal to noise ratio, which is also reflected by its superior performance to the other three experiments when analyzed separately. A possible misconception is that it is fruitless to combine such a good-quality experiment with others with relatively poor quality. Our result clearly suggests otherwise. It indicates that joint analysis can greatly improve even the experiment with the best quality.

To gain further insight of the merits of the proposed method, we now compare it with several alternative strategies of joint analysis. The first method is to naively combine the separate analysis of the four experiments by using the largest posterior probability of differential expression. The other two methods are taken from Choi et al. (2003) and Choi et. al. (2007) respectively. Unlike the proposed method, these alternatives no longer connects with the posterior probability of differential expression and therefore it is unclear what a Bayes rule means in these context. Nonetheless, each of these methods does provide a score, similar to the posterior probabilities, measuring the strength of evidence for differential expression. It is therefore of interest to know to what extent these scores can serve as proxies to identify differential expressed genes. A natural measure is the so-called area under the receiver operating characteristic curves (AUC). AUC, by definition, is necessarily between 0 and 1. The close the AUC score is to one, the more effective the score is. For the proposed method and its three alternatives, the AUC, summarized from 100 runs are 99.66% (0.01%), 99.47% (0.02%), 60.98% (0.13%) and 83.41% (0.12%) respectively. The numbers in parentheses are the standard errors. Not surprisingly, the proposed method outperforms the alternatives because our simulation setting matches perfectly with the model settings.

To evaluate the robustness of the proposed method, we consider a more complex simulation setup where the experimental data were generated as follows:

  • Experiment 1

    The latent gene expression levels were simulated from an inverse Gamma distribution with shape parameter 2 and location parameter 10. Then the gene expression measurement were simulated from a Gamma distribution with the latent means and shape parameter 20.

  • Experiment 2

    The latent means were simulated so that A: = log((μ2g1μ2g2)1/2) follows a uniform distribution between 5 and 11; and M = log(μ2g1/μ2g2) follows a uniform distribution between −1 and 1 for differentially expressed genes and 0 for equivalently expressed genes. Then the observed gene expression measurements were simulated from Gamma distribution with shape parameter 15.

  • Experiment 3

    Similar to Experiment 2 except that now M follows a uniform distribution between −2 and 2 and the expression measurements were simulated with shape parameter 25.

  • Experiment 4

    Data were simulated from a LNN model with parameter θ = 0.32 and τ = (2.3, 1.392).

The other setting are similar as before. The effect sizes of the four studies are 2.09, 1.65, 2.71 and 7.69 respectively. To gain further insights, we also consider three different percentages of differential expression: π = 5%, 10% and 20%. The sensitivity of joint four analysis ranges from 92 percent to 94 per cent and individual analysis has sensitivity from 22 per cent to 76 per cent. It is evident that the joint analysis significantly outperforms separate analysis. We again compare the proposed method with the alternative methods for joint analysis in terms of AUC. Results are reported in Table 2. Similar to before, the proposed method enjoys superior performance.

Table 2.

AUC comparison of the proposed method (Proposed Joint EB), and identifying differential expression based on largest posterior probability of separate analysis (Combined Separate EB) and the methods of Choi et al. (2003) and Choi et al. (2007). Results are based on 100 runs. All units are in percentages. The numbers in parentheses are standard errors.

π = 5% π = 10% π = 20%
Proposed Joint EB 99.63 (0.02) 99.61 (0.02) 99.59 (0.01)
Combined Separate EB 98.56 (0.04) 98.48 (0.03) 98.49 (0.02)
Choi et al. (2003) 61.42 (0.19) 61.11 (0.14) 60.92 (0.12)
Choi et al. (2007) 57.94 (0.24) 58.86 (0.16) 60.15 (0.13)

3.2 Missing Data

We now consider the problem of missing data. To this end, we consider the following simulation scheme with a total of G = 5, 000 genes at two conditions. The proportion of DE genes is 5%. Similar to before, three replicates were simulated at every condition. Because of the robustness of the method, we focus here only on the LNN model with the parameters given before. The difference is now each experiment only involves a subset of the genes. In particular, Experiment 1 includes 4, 500 randomly selected genes; and each of the remaining three experiments has 80% overlap with the first experiment and the set of overlapping genes is drawn randomly. In addition, Experiments 2 and 3 each has 250 new genes randomly selected from the 500 genes not included in Experiment 1. Experiment 4 covers all 500 genes not available in Experiment 1. As a result, Experiments 2 and 3 each has 3850 genes whereas Experiment 4 comprises of 4100 genes.

Table 3 summarizes the results from 100 simulation runs. It is clear that joint analysis dramatically improves the sensitivity with lower false discovery rate.

Table 3.

Performance comparison between joint and separate analysis when there are missing data. All units are in percentages. The numbers in parentheses are standard errors.

Joint Analysis Separate Analysis
Experiment 1 Experiment 2 Experiment 3 Experiment 4
Sensitivity 70.33 (0.351) 32.16 (0.324) 32.13 (0.372) 32.44 (0.417) 32.75 (0.349)
Specificity 99.73 (0.007) 99.75 (0.009) 99.74 (0.011) 99.75 (0.01) 99.75 (0.009)
FDR 6.78 (0.157) 12.86 (0.365) 12.96 (0.462) 12.39 (0.422) 12.62 (0.419)

The methods of Choi et al. (2003) and Choi et al. (2007) focus only on genes that are present in all studies. In the current setting, this results in discarding about half of the genes, which is clearly undesirable. Alternatively, the strategy of taking the largest posterior probability from separate analysis can still be applied, which yields an AUC of 93.07% (0.12%) based on 100 runs. This is to be combined with the proposed joint analysis which results in an AUC of 98.87% (0.04%).

3.3 Condition Mismatch

Our final simulation study is designed to illustrate the effect of condition mismatch. We adopt a similar simulation set as before, with 5000 genes and four experiments. There are a total of three biological conditions but one condition is missing at each of the first three experiments. Specifically, the first experiment has three replicates under the first condition, three under the second condition, none under the third condition. The second experiment has three replicates under each of the first and third condition, but none under the second condition. The third experiment features three replicates under each of the second an third condition, and none under the first condition. The last experiment has three replicates under each of the three conditions. As we pointed out earlier, such condition mismatch is a direct consequence of different biological hypothesis of interest. In Experiment 1, our interest is in comparing the first two conditions. The goal is therefore to determine genes that are differentially expressed between these two conditions. Similarly, in Experiment 2, we want to identify genes that are differentially expressed between the first and third condition; and Experiment 3, between the second and third condition. In the last experiment, there are five possible patterns as we discussed before, all patterns except for Pattern 1 can be identified as differential expression.

Given the different hypotheses, the natural question is whether or not a joint analysis of all four experiments can be beneficial. For example, for the “investigators” of the first experiment, combining with data on the first two conditions from the last experiment might be helpful, but it is not immediately clearly whether or not it helps if we include all four experiments. To illustrate the merits of the proposed joint analysis of all experiments, we apply three different strategies here: separate analysis of the first experiment; joint analysis of the first experiment and the last experiment with data from the third condition discarded; and the proposed method of joint analysis of all four experiments with missing conditions handled as missing data as we discussed before. Table 4 summarizes the operating characteristics of all three methods averaged over 100 runs. It is clear that both joint analyses improve upon the separate analysis with the proposed method outperforms the joint analysis with only two experiments. Similar comparisons were conducted from the angles of the “investigators” of Experiments 2 and 3 and the results remain similar. Now consider the last experiment where the goal is identify differentially expressed genes among all three conditions. We compare the joint analysis that uses data from all four experiments and the individual analysis that only uses data from the last experiment. The results are also given in Table 4, which suggests that joint analysis gives superior performance. Note that existing methods for joint analysis are not designed to handle condition mismatch and are therefore not included in the comparison.

Table 4.

Performance comparison of separate analysis and joint analysis with condition mismatch. All unit are in percentages.

DE in Conditions 1 and 2 DE in Conditions 1 and 3
Sensitivity Specificity FDR Sensitivity Specificity FDR
Exp 1 69.87 (0.188) 99.62 (0.011) 2.97 (0.08) Exp 2 85.92 (0.136) 99.85 (0.006)| 0.99 (0.037)
Exp 1 & 4 86.37 (0.135) 99.69 (0.009) 2.02 (0.056) Exp 2 & 4 93.32 (0.093) 99.87 (0.005) 0.79 (0.033)
All Exp 94.74 (0.086) 99.64 (0.008) 2.12 (0.048) All Exp 96.29 (0.067) 99.83 (0.006) 0.98 (0.037)

DE in Conditions 2 and 3 DE among three conditions
Sensitivity Specificity FDR Sensitivity Specificity FDR
Exp 3 88.01 (0.106) 99.88 (0.005) 0.79 (0.03) Exp 4 64.22 (0.173) 99.31 (0.015) 4.11 (0.089)
Exp 3 & 4 94.3 (0.087) 99.89 (0.004) 0.67 (0.026)
All Exp 96.87 (0.063) 99.84 (0.006) 0.9 (0.034) All Exp 98.44 (0.04) 99.95 (0.003) 0.21 (0.014)

4. Real Examples

To further illustrate the merits of the proposed method, we now return to the prostate cancer examples discussed before. As mentioned earlier, four public microarray datasets generated independently by Dhanasekaran et al. (2001), Luo et al. (2001), Magee et al. (2001) and Welsh et al. (2001) were collected to determine genes that are differentially expressed between benign prostate and cancer tumors. As stated before, the data were generated with different platforms: Dhanasekaran et al. (2001) and Luo et al. (2001) employed spotted cDNA microarrays whereas the other two experiments utilized Affymetrix technology. All four studies include comparisons between locally advanced prostate cancer and benign prostate. Dhanasekaran et al. (2001) and Magee et al. (2001) also included a third biological condition: metastatic prostate cancer. A total of 13, 474 unique genes are present in at least one of the experiment. There is, however, a severe mismatch in the set of genes measured among the four experiments with less than 10% (1, 322) of the genes presented in all four experiments. Table 5 summarizes some basic information of the data and gives the number of genes overlapped between the four experiments.

Table 5.

Basic information of the four prostate cancer datasets. D – data from Dhanasekaran et al. (2001); L – data from Luo et al. (2001); M – data from Magee et al. (2001); and W – data from Welsh et al. (2001).

Array Type Number of Replicates Pairwise Overlap Genes
Benign Local PCA Metastatic PCA D L M W
D cDNA 14 14 20 4, 839 2, 642 1, 596 2, 126
L cDNA 9 16 0 6, 109 2, 895 3, 574
M Affy 4 8 3 5, 228 4, 963
W Affy 9 23 0 9, 071

We ran the joint analysis both with the LNN and GG model and the results are similar. Therefore, we focus here on the results from the LNN model. Similar to the simulation study conducted before, there are two primary hypotheses concerning differential expression. The goal is to identify genes that are differentially expressed between either cancer tumor and benign prostate. In other words, among the five expression patterns given in (11), we are interested in identifying genes in Patterns 2, 3, 4 and 5 as opposed to Pattern 1. Hereafter, we shall refer to genes with Pattern 1 equivalently expressed genes; and the genes with other patterns differentially expressed genes. Similar to earlier studies (see, e.g., Choi et al., 2003), a large number of genes demonstrate significant difference between prostate cancer and benign prostate. To fix idea, we focus on the top one hundred genes identified to follow Patterns 2, 3, 4 or 5 by joint analysis. All of these genes have posterior probabilities of differential expression greater than 99%. Among these genes, 31 genes are not identified by any studies; 69 are identified to be differentially expression with posterior probability at least 95% in at least one of the fours studies when analyzing the four datasets separately; 34 in at least two studies; 7 in three studies; and 0 in all four studies. For the top 100 genes found by joint analysis, the venn diagram in Figure 1 shows the availability in each of the individual analysis.

Figure 1.

Figure 1

Venn diagram of differentially expressed genes for the four prostate cancer datasets: 100 genes are selected by joint analysis. Among them, 69 were identified by separate analysis on at least one data set. This figure shows how many are identified by each one of the four data sets.

Joint analysis reveals significant genes agreed across studies more than would be expected by chance. Also, among the genes that are identified by joint analysis but appear not to be differentially expressed in individual analysis of each of the four studies is Hs.296638, a known prostate differentiation factor.

A second example we consider here is the four liver cancer datasets from Choi et al. (2003). All data were generated at two biological conditions: normal and tumor tissues. The goal is to identify genes differentially expressed in normal and tumor tissues. The datasets are of relatively proor quality when compared with the prostate datasets and have been used earlier in Choi et al. (2003) primarily to demonstrate the necessity of a joint analysis. Table 6 is some basic information of the data and gives the number of genes overlapped between the four experiments.

Table 6.

Basic information of the four liver cancer datasets.

Number of Replicates Pairwise Overlap Genes
Normal Tumor D L M W
D1 16 16 10314 10289 10194 9921
D2 23 23 10311 10202 9906
D3 29 5 10216 9815
D4 12 9 9931

Similarly, the analysis is based on the LNN model. In joint analysis, top 18 genes have posterior probability of differential equation greater than 90%. We evaluate the information of these genes and exclude genes of unknown functions. Then we get 10 genes. In particular, 9 out of 10 are identified by only one study and 1 is not by any studies. Table 7 shows the information of these genes.

Table 7.

In liver cancer, list of 10 genes identified as differential expressed in joint analysis but identified by at most one study. The first gene is failed to be selected by all studies.

Tissue name
21.2.D.1 TATA box binding protein(TBP), mRNA
15.2.F.4 AL564975 cDNA
15.2.H.9 IL3-CT0219-271099-022-C02 cDNA
16.1.C.4 KIAA0107 gene product(KIAA0107), mRNA
19.4.D.12 KIAA0304 gene product(KIAA0304), mRNA
2.2.D.2 thioredoxin reductase 1(TXNRD1), mRNA
20.2.A.9 triosephosphate isomerase 1(TPI1), mRNA
22.3.A.4 ribosomal protein L13a(RPL13A), mRNA
23.3.A.4 CD24 antigen (small cell lung carcinoma cluster 4 antigen) (CD24), mRNA
7.1.E.7 hepatocyte growth factor regulated tyrosine kinase substrate (HGS), mRNA

5. Conclusions

With the explosion of popularity of microarray experiments, it becomes a necessity to develop statistical methods that can effectively integrate data from multiple studies. Joint analysis of multiple experiments can alleviate the low sample size and high variability problem that is often faced in individual studies. In this paper, we propose a model-based joint analysis of gene expression data from multiple studies to determine differentially expressed genes between multiple biological conditions. The proposed method shares information both among genes within one study and across studies without data transformation. The method is flexible to handle various practical complications such as missing data and condition mismatch. Simulation studies and real data examples show that the accuracy of statistical inferences can be drastically improved when using the proposed approach to combine multiple studies. It was demonstrated that combing data from multiple sources leads to increased sensitivity and specificity. Even incorporating those seemingly less optimal experiments could prove beneficial.

The aim of the proposed approach is to extract differential expression, to a certain degree, agreed upon by multiple studies. In addition to these identified genes, others that are not consensus choices, sometimes may also be of interest. Discordance across studies is often observed in practice (see, e.g., Parmigiani et al., 2004). It may be attributed to faulty probe annotation (see, e.g., Dai et al., 2005) or hidden phenotypical heterogeneity or subtypes. Understanding such causes of discordance can also be of great importance.

Acknowledgments

The authors wish to thank the editor, the associate editor and two reviewers for their constructive comments that help greatly improve the presentation of the paper. This research was supported in part by NSF grants DMS-0706724 and DMS-0846234, NIH grant R01GM076274-01 and a grant from Georgia Cancer Coalition.

References

  1. Allison DB, Cui X, Page GP, Sabripour M. Microarray data analysis: from disarray to consolidation & consensus. Nature Reviews Genetics. 2006;7:55–65. doi: 10.1038/nrg1749. [DOI] [PubMed] [Google Scholar]
  2. Breiman L. Random forests. Machine Learning. 2001;45(1):5–32. [Google Scholar]
  3. Choi H, Shen R, Chinnaiyan A, Ghosh D. A latent variable approach for meta-analysis of gene expression data from multiple microarray experiments. BMC Bioinformatics. 2007;8:364–383. doi: 10.1186/1471-2105-8-364. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Choi JK, Yu U, Kim S, Yoo OJ. Combining multiple microarray studies and modeling interstudy variation. Bioinformatics. 2003;19:i84–i90. doi: 10.1093/bioinformatics/btg1010. [DOI] [PubMed] [Google Scholar]
  5. Choi H, Ghosh D. A comparison of meta-analysis methods for gene expression data. In: Biswas A, Datta S, Fine JP, Segal MR, editors. Statistical Advances in the Biomedical Sciences. New York: Wiley; 2008. pp. 200–215. [Google Scholar]
  6. Dai M, Wang P, Boyd A, Kostov G, Athey B, Jones E, Bunney W, Myers R, Speed T, Akil H, Watson S, Meng F. Evolving gene/transcript definitions signicantly alter the interpretation of GeneChip data. Nucleic Acids Research. 2005;33(20):e175. doi: 10.1093/nar/gni179. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Dhanasekaran SM, Barrette TR, Ghosh D, Shah R, Varambally S, Kurachi K, Pientas KJ, Rubin MA, Chinnaiyan AM. Delineation of prognostic biomarkers in prostate cancer. Nature. 2001;412:822–826. doi: 10.1038/35090585. [DOI] [PubMed] [Google Scholar]
  8. Do KA, Müller P, Vannucci M. Bayesian Inference for Gene Expression and Proteomics. Cambridge: Cambridge University Press; 2006. [Google Scholar]
  9. Efron B, Tibshirani R, Storey JD, Tusher V. Empirical Bayes analysis of a microarray experiment. Journal of the American Statistical Association. 2001;96(456):1151–1160. [Google Scholar]
  10. Garrett-Mayer E, Parmigiani G, Zhong X, Cope L, Gabrielson E. Cross study validation and combined analysis of gene expression microarray data. Biostatistics. 2007;9(2):333–354. doi: 10.1093/biostatistics/kxm033. [DOI] [PubMed] [Google Scholar]
  11. Ghosh D, Barette TR, Rhodes D, Chinnaiyan AM. Statistical issues and methods for meta-analysis of microarray data: a case study in prostate cancer. Functional Integrative Genomics. 2003;3:180–188. doi: 10.1007/s10142-003-0087-5. [DOI] [PubMed] [Google Scholar]
  12. Hong F, Breitling R. A comparison of meta-analysis methods fro detecting differentially expressed genes in microarray experiments. Bioinformatics. 2008;24(3):374–382. doi: 10.1093/bioinformatics/btm620. [DOI] [PubMed] [Google Scholar]
  13. Jiang H, Deng Y, Chen HS, Tao L, Sha Q, Chen J, Tsai CJ, Zhang S. Joint analysis of two microarray gene expression data sets to select lung adenocarinoma marker genes. BMC Bioinformatics. 2004;5:81. doi: 10.1186/1471-2105-5-81. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Kendziorski CM, Newton MA, Lan H, Gould MN. On parametric empirical Bayes methods for comparing multiple groups using replicated gene expression profiles. Statistics in Medicine. 2003;22:3899–3914. doi: 10.1002/sim.1548. [DOI] [PubMed] [Google Scholar]
  15. Kuo WP, Jenssen TK, Butte AJ, Ohno-Machado L, Kohane IS. Analysis of matched mRNA measurements from two different microarray technologies. Bioinformatics. 2002;18:405–412. doi: 10.1093/bioinformatics/18.3.405. [DOI] [PubMed] [Google Scholar]
  16. Lee MLT, Kuo FC, Whitmore GA, Sklar J. Importance of replication in microarray gene expression studies: statistical methods and evidence from repetitive cDNA hybridizations. Proceedings of the National Academy of Sciences of the United States of America. 2000;97:9834–9839. doi: 10.1073/pnas.97.18.9834. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Luo J, Duggan DJ, Chen Y, Sauvageot J, Ewing CM, Bittner ML, Trent JM, Isaacs WB. Human prostate cancer and benign prostatic hyperplasia: molecular dissection by gene expression profiling. Cancer Research. 2001;61:4683–4688. [PubMed] [Google Scholar]
  18. Magee JA, Araki T, Patil S, Ehrig T, True L, Humphrey PA, Catalona WJ, Wat-son MA, Milbrandt J. Expression profiling reveals hepsin overexpression in prostate cancer. Cancer Research. 2001;61:5692–5696. [PubMed] [Google Scholar]
  19. Mah N, Thelin A, Lu T, Nikolaus S, Kühbacher T, Gurbuz Y, Eickhoff H, Klöppel G, Lehrach H, Mellgård B, Costello CM, Schreiber S. A comparison of oligonucleotide and cDNA-based microarray systems. Physiological Genomics. 2004;16:361–370. doi: 10.1152/physiolgenomics.00080.2003. [DOI] [PubMed] [Google Scholar]
  20. Mukherjee S, Tamayo P, Rogers S, Rifkin R, Engle A, Campbell C, Golub TR, Mesirov JP. Estimating dataset size requirements for classifying DNA microarray data. Journal of Computational Biology. 2003;10(2):119–142. doi: 10.1089/106652703321825928. [DOI] [PubMed] [Google Scholar]
  21. Newton MA, Kendziorski CM, Richmond CS, Blattner RR, Tsui KW. On differential variability of expression ratios: improving statistical inference about gene expression changes from microarray data. Journal of Computational Biology. 2001;8(1):37–52. doi: 10.1089/106652701300099074. [DOI] [PubMed] [Google Scholar]
  22. Parmigiani G, Garrett-Mayer E, Anbazhagan R, Gabrielson E. A cross-study comparison of gene expression studies for the molecular classification of lung cancer. Clinical Cancer Research. 2004;10:2922–2927. doi: 10.1158/1078-0432.ccr-03-0490. [DOI] [PubMed] [Google Scholar]
  23. Parmigiani G, Garett-Mayer E, Irizarry R, Zeger S. The Analysis of Gene Expression Data: Methods and Software. New York: Springer; 2003. [Google Scholar]
  24. Parmigiani G, Garret-Mayer E, Anbazhagan R, Gabriel E. A statistical framework for expression-based molecular classification in cancer. Journal of the Royal Statistical Society Series B. 2002;64:717–736. [Google Scholar]
  25. Pyne S, Futcher B, Skiena S. Meta-analysis based on control of false discovery rate: combining yeast ChIP-chip datasets. Bioinformatics. 2006;22(20):2516–2522. doi: 10.1093/bioinformatics/btl439. [DOI] [PubMed] [Google Scholar]
  26. Rhodes DR, Barrette TR, Rubin MA, Ghosh D, Chinnaiyan AM. Meta-analysis of microarrays: interstudy validation of gene expression profiles reveals pathway dysregulation in prostate cancer. Cancer Research. 2002;62:4427–4433. [PubMed] [Google Scholar]
  27. Rhodes DR, Yu J, Shanker K, Deshpande N, Varambally R, Ghosh D, Barrette TR, Pandey A, Chinnaiyan AM. A large-scale meta-analysis of cancer microarray data identifies common transcriptional profiles of neoplastic transformation and progression. Proceedings of the National Academy of Sciences of the United States of America. 2004;101(25):9309–9314. doi: 10.1073/pnas.0401994101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Scharpf R, Tjelmeland H, Parmigiani G, Nobel A. A Bayesian model for cross-study differential gene expression (with discussions) Journal of the American Statistical Association. 2009;104:1295–1323. doi: 10.1198/jasa.2009.ap07611. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Shabalin AA, Tjelmeland H, Fan C, Perou CM, Nobel AB. Merging two gene-expression studies via cross-platform normalizaiton. Bioinformatics. 2008;24(9):1154–1160. doi: 10.1093/bioinformatics/btn083. [DOI] [PubMed] [Google Scholar]
  30. Shen R, Ghosh D, Chinnaiyan AM. Prognostic meta-signature of breast cancer developed by two-stage mixture modeling of microarray data. BMC Genomics. 2004;5:94. doi: 10.1186/1471-2164-5-94. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. van’t Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AA, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH. Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002;415:530–536. doi: 10.1038/415530a. [DOI] [PubMed] [Google Scholar]
  32. Wang Y, Klijn J, Zhang Y, Sieuwerts A, Look M, Yang F, Talantov D, Timmermans M, Gelder Meijer-van M, Yu J. Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet. 2005;365:671–679. doi: 10.1016/S0140-6736(05)17947-1. [DOI] [PubMed] [Google Scholar]
  33. Warnat P, Eils R, Brors B. Cross-platform analysis of cancer microarray data improves gene expression based classificationof phenotypes. BMC Bioinformatics. 2005;6:265. doi: 10.1186/1471-2105-6-265. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Welsh JB, Sapinoso LM, Su AI, Kern SG, Wang-Rodriguez J, Moskaluk CA, Frierson HF, Jr, Hampton GM. Analysis of gene expression identifies candidate markers and pharmacological targets in prostate cancer. Cancer Research. 2001;61:5974–5978. [PubMed] [Google Scholar]

RESOURCES