Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2018 Aug 1.
Published in final edited form as: J R Stat Soc Ser C Appl Stat. 2016 Dec 16;66(4):847–867. doi: 10.1111/rssc.12199

Biomarker detection and categorization in ribonucleic acid sequencing meta-analysis using Bayesian hierarchical models

Tianzhou Ma 1, Faming Liang 2, George Tseng 3
PMCID: PMC5543999  NIHMSID: NIHMS826248  PMID: 28785119

Abstract

Meta-analysis combining multiple transcriptomic studies increases statistical power and accuracy in detecting differentially expressed genes. As the next-generation sequencing experiments become mature and affordable, increasing number of RNA-seq datasets are available in the public domain. The count-data based technology provides better experimental accuracy, reproducibility and ability to detect low-expressed genes. A naive approach to combine multiple RNA-seq studies is to apply differential analysis tools such as edgeR and DESeq to each study and then combine the summary statistics of p-values or effect sizes by conventional meta-analysis methods. Such a two-stage approach loses statistical power, especially for genes with short length or low expression abundance. In this paper, we propose a full Bayesian hierarchical model (namely, BayesMetaSeq) for RNA-seq meta-analysis by modelling count data, integrating information across genes and across studies, and modelling potentially heterogeneous differential signals across studies via latent variables. A Dirichlet process mixture (DPM) prior is further applied on the latent variables to provide categorization of detected biomarkers according to their differential expression patterns across studies, facilitating improved interpretation and biological hypothesis generation. Simulations and a real application on multi-brain-region HIV-1 transgenic rats demonstrate improved sensitivity, accuracy and biological findings of the proposed method.

Keywords: Bayesian hierarchical model, differential expression (DE), meta-analysis, model-based clustering, RNA sequencing (RNA-seq)

1. Introduction

By using the next-generation sequencing technology to quantify transcriptome, RNA-seq has rapidly become a standard experimental technique in measuring RNA expression levels (Mortazavi et al., 2008; Wang et al., 2009). For RNA-seq, the abundance of transcript in each RNA sample is measured by counting the number of randomly sequenced fragments aligned to each gene. Compared to the popular microarray technology, RNA-seq has the advantage of detecting novel transcripts and quantifying a larger dynamic range of expression levels. It has been shown that RNA-seq performs better than microarray at detecting weakly expressed genes if sequencing is deep enough (Wang et al., 2014). However, new statistical challenges emerge in the differential expression analysis of RNA-seq data. First, the sequencing data are discrete counts rather than continuous intensities, so a count model is more appropriate if parametric approach is used. Secondly, since long transcripts usually have more mapped reads compared to short transcripts and the detection power of differential expression increases as the number of reads increases, short transcripts are always at a statistical disadvantage relative to long transcripts in the same dataset. Analysis of RNA-seq data needs to address such a read count bias considering the fact that many important disease markers are of short length or low expression (Oshlack et al., 2009).

Many methods have been developed to identify differentially expressed genes between two or more conditions for RNA-seq count data. Two most popular tools edgeR and DESeq assume a negative binomial model that takes over-dispersion into account and either likelihood ratio test or exact test is used to test for differential expression (Robinson et al., 2010; Anders and Huber, 2010). Other methods such as baySeq or EBSeq applied empirical Bayes approaches to detect patterns of differential expression (Hardcastle and Kelly, 2010; Leng et al., 2013). Recently, more methods have been developed using Bayesian hierarchical model and have used either approximation methods or Markov chain Monte Carlo (MCMC) sampling schemes to estimate the parameters (Van De Wiel et al., 2012; Chung et al., 2013). No single method has been shown to outperform the other methods under all circumstances in recent comparative studies (Rapaport et al., 2013; Soneson and Delorenzi, 2013). Bayesian approaches are advantageous in handling complex models and adopting more flexible modelling of effect size and variance, and thus may increase DE detection power for lowly expressed genes (Chung et al., 2013). However, all Bayesian hierarchical models are limited to single transcriptomic study so far.

Meta-analysis in genomic research is a set of statistical tools for combining multiple “-omics” studies of a related hypothesis and can potentially increase the detection power of individual studies (Tseng et al., 2012). With the increasing availability of mRNA expression data sets, many transcriptomic meta-analysis methods for microarray data have been developed in the past decade. These methods mainly fall into three categories. The first and the most popular one is a two-stage method, where a single summary statistics is first computed for each study and then meta-analysis methods are used to combine the summary statistics. These methods include combining p-values (Fisher, 1925; Stouffer et al., 1949; Li et al., 2011), combining effect sizes (Choi et al., 2003) or combining rank statistics (Hong et al., 2006). The second category of methods merges the raw data from all microarray studies and normalize simultaneously (a.k.a. mega-analysis), then standard single-study analysis can be applied (Lee et al., 2008; Sims et al., 2008). These approaches have, however, been less favored in the literature since they do not guarantee to remove cross-study discrepancy and may fail to retain study-specific biomarkers. Instead of using two-stage approaches (i.e. DE analysis in single study + meta-analyze summary statistics in the first category, and normalization and combined DE analysis in the second category), the third category integrates differential expression information from all studies using a unified and joint stochastic model (Conlon et al., 2006; Scharpf et al., 2009). Since they are joint hierarchical models by nature, the more flexible Bayesian methods are usually applied. These approaches have the potential to offer additional efficiency over the two-stage methods and, at the same time, retain the study-specific features. This motivates us to develop a Bayesian hierarchical model for RNA-seq meta-analysis.

In the literature, almost no meta-analysis methods have been developed for RNA-seq so far. Two existing R packages claimed for RNA-seq meta-analysis – metaRNASeq (Rau et al., 2014) and metaSeq (Tsuyuzaki and Nikaido, 2013) – essentially applied naive two-stage methods by using DESeq or NOISeq methods in single study and combining p-values by Fisher's or Stouffer's method. The two-stage approach leads to loss of statistical power especially when the observed counts in a given gene are small. In this paper, we propose a Bayesian hierarchical model, BayesMetaSeq, under a unified meta-analytic framework, to jointly analyze RNA-seq data from multiple studies. Bayesian hierarchical model allows sharing of information across studies and genes to increases DE detection power for genes with low read counts. In addition, a Dirichlet process mixture (DPM) prior is imposed on the DE latent variables to model the homogeneous and heterogeneous differential signals across studies. Model-based clustering embedded in the full Bayesian model provides categorization of detected biomarkers according to their differential expression patterns across studies. The result facilitates better biological interpretation and hypothesis generation.

Ramasamy et al. (2008) presented seven key issues when conducting microarray meta-analysis, including identifying and extracting experimental data, preprocessing and annotating each dataset, matching genes across studies, statistical methods for meta-analysis, and final presentation and interpretation. When combining RNA-seq studies for meta-analysis, most preliminary steps and data preparation issues will similarly apply. Identification and decision to include adequate transcriptomic studies into meta-analysis greatly impacts accuracy and reproducibility of biomarker detection (Kang et al., 2012). Many useful RNA-seq preprocessing tools such as fastQC, tophat and bedtools are instrumental for alignment and preparing expression counts for downstream analysis. Genes are matched across studies using standard gene symbols or isoforms through a common reference genome (e.g. hg18 or hg19) (Oshlack et al., 2010). In the remaining of this paper, we assume that data collection and preprocessing have been carefully done and we only focus on downstream meta-analytic modeling and interpretation.

The paper is organized as follows. Section 2 describes the Bayesian hierarchical model and an MCMC algorithm for simulating posterior distributions of parameters. Section 3 explains how we perform differential expression analysis and cluster analysis based on Bayesian inference with multiple comparison addressed from a Bayesian perspective. In Section 4 and 5, we apply BayesMetaSeq to both simulation and a multi-brain-region RNA-seq dataset from HIV transgenic rat. Final conclusion and discussion are provided in Section 6.

2. Bayesian Hierarchical Model

2.1 Notation and Assumptions

In this paper, we denote by ygik the observed count for gene g and sample i in study k, Tik=g=1Gygik the library size (i.e. the total number of reads) for sample i in study k and Xik ∈ {0, 1} the phenotypic condition of sample i in study k. The observed data are:

D={(ygik,Tik,Xik):g=1,,G;i=1,,Nk;k=1,,K},

where G is the total number of genes, Nk is the sample size of study k and K is the number of studies in the meta-analysis. The latent variable of interest δgk ∈ {0, 1} is the study-specific indicator of differential expression for gene g in study k, meaning gene g is differentially expressed in study k if δgk = 1 and non-differentially expressed if δgk = 0.

Here we assume the raw RNA-seq count values follow a negative binomial distribution under each condition. We also assume that genes are matched across studies. Although the model could be readily extended to analyze multiple studies with similar but not completely overlapped gene sets. In the next three subsections, we will introduce the generative model within each study (Section 2.2), describe information integration of effect sizes across studies (Section 2.3) and model clusters of genes with different DE patterns across studies (Section 2.4). Figure 1 provides a graphical representation of the full Bayesian hierarchical model. Parameters within the rectangle form the main model and parameters outside the rectangle are hyperparameters. The gray shaded parameters δgk (latent variable of DE indicator) and λg (DE effect size) are the parameters of interest in the model. The dashed rectangle refers to a Dirichlet process mixture (DPM) model for DE gene categorization that will be described in Section 2.4.

Figure 1. A graphical representation of the Bayesian hierarchical model.

Figure 1

2.2 Generative model within each study

Below, we describe the generative model for observed data within each study. We assume the counts ygik, conditioning on hyperparameters, are independent and follow a negative binomial distribution. Denote by μgik = E(ygik) the mean expression level and ϕgk the gene-specific dispersion parameter, we have:

ygik~NB(μgik,ϕgk). (1)

We then fit a log-linear regression model for the mean μgik, where αgk denotes the baseline expression relative to the library size and βgk denotes the effect size (i.e. the log fold change of expression between the two conditions):

log(μgik)=log(Tik)+αgk+βgkXik. (2)

Note that we set βgk to depend on both g and k, allowing the existence of between study heterogeneity for the same gene. If we re-parametrize the negative binomial model in (1) in terms of proportion p(ϕμ1+ϕμ) and dispersion ϕ, and let Ψ=logit(p)=log(ϕμ1+ϕμ11+ϕμ)=log(ϕμ), we can re-write equation (2) as:

Ψgik=log(Tik)+αgk+βgkXik+log(ϕgk). (3)

The above equation is useful when we later use Gibbs sampling to update the parameters αgk and βgk. Taking equation (1) and (2) together form our basic GLM model as follows:

ygik|αgk,βgk,ϕgk~NB(log(Tik)+αgk+βgkXik,ϕgk). (4)

2.3 Information integration of effect size across studies among DE genes

Next, we select appropriate prior distributions for the model parameters in equation (4) to allow information integration across studies. We first define the following vectors:

αg=(αg1,,αgK)T,βg=(βg1,,βgK)T,log(ϕg)=(log(ϕg1),,log(ϕgK))T,

which represent the baseline, effect size and dispersion vectors for gene g respectively. The three vectors are assumed to be a priori independent of each other. In addition, we define the vector for the differential expression indicators of gene g: δ⃗g = (δg1, …, δgk)T. We assume each of the vectors α⃗g, logϕ⃗g follows a multivariate Gaussian distribution:

αg~NK(ηg,Λ),logϕg~NK(mg,), (5)

where ηg and mg are the gene-specific grand means for α⃗g and logϕ⃗g, respectively. The covariance matrices Λ and Π are shared by all genes to be described below. For β⃗g, we assume a multivariate Gaussian prior, with different means for DE and Non-DE genes:

βg~NK(λgδg,), (6)

where λg is the gene-specific grand mean for DE genes (i.e. δgk ≠ 0 for some k). For Non-DE genes (δ⃗g = 0), the grand mean is 0. We also allow a different covariance matrix of β⃗g for DE and Non-DE genes, i.e. Σ = Σ1 for DE genes and Σ = Σ0 for Non-DE genes.

Adopting the separation strategy on modelling covariance matrices by Barnard et al. (2000), we propose independent prior distributions on the diagonal variance components and the off-diagonal correlation matrix for all the four covariance matrices mentioned above. Let [ρ(1)kk]1K, [ρ(0)kk]1K, [rkk]1K and [tkk]1K denote the correlation matrices corresponding to the covariance matrices Σ1, Σ0, Λ and Π respectively, and let [σ(1),k2]1K, [σ(0),k2]1K, [τk2]1K, [ξk2]1K denote the corresponding diagonal matrices with the variance terms on the diagonal. It is widely known that:

1=([σ(1),k2]1K)1/2[ρ(1)kk]1K([σ(1),k2]1K)1/2,0=([σ(0),k2]1K)1/2[ρ(0)kk]1K([σ(0),k2]1K)1/2,Λ=([τk2]1K)1/2[rkk]1K([τk2]1K)1/2,Π=([ξk2]1K)1/2[tkk]1K([ξk2]1K)1/2.

For each variance component, we propose a Jeffrey's prior, that is to say:

σ(1),k21σ(1),k2,σ(0),k21σ(0),k2,τk21τk2,ξk21ξk2.

For the correlation matrices, we propose an inverse-Wishart prior distribution with identity matrix as its scale matrix and υ = K + 1 degrees of freedom, which is equivalent to putting a uniform prior on each element of the correlation matrices marginally (Gelman et al., 2014; Scharpf et al., 2009; Barnard et al., 2000), more specifically we have:

[ρ(1)kk]1K,[ρ(0)kk]1K,[rkk]1K,[tkk]1K~W1(I,υ).

For gene-specific grand means λg, ηg and mg, we assume that they follow normal priors, e.g. λg~N(μλ,σλ2), ηg~N(μη,ση2), mg~N(μm,σm2) with mean μλ = 0, μη = 0, μm = 0, and variance σλ2=102, ση2=102, σm2=102. We performed sensitivity analysis on the hyperparameter values, since the variance σλ2, ση2 and σm2 are fairly large, the results show little change when the means μλ, μη and μm change (see Appendix for the result of a sensitivity analysis on hyperparameter μη).

In addition to the informative parameters listed above, we introduce one supporting parameter ωgik into the model to help obtain closed-form posterior distribution for βgk and αgk by exploiting conditional conjugacy (Polson et al., 2013; Zhou et al., 2012). The prior for ωgik is specified as:

ωgik~PG(ygik+ϕgk1,0),

where PG refers to the Polya-Gamma distribution, details about this distribution and how the supporting parameter facilitates conditional conjugacy are provided in the Appendix. The closed-form posterior distribution for βgk and αgk by conditional conjugacy speeds up MCMC simulation.

2.4 Model-based clustering to categorize DE genes

We next utilize the differential expression indicators δgk to cluster the DE genes and model the homogeneous and heterogeneous differential signals across studies. Since clustering based on the binary latent variable is unstable and does not take effect size into consideration, we first transform the binary vector into a standard normal vector and use Dirichlet process Gaussian mixture model to cluster the DE genes, following Medvedovic et al. (2004). Suppose P(δgk = 1) = πgk is the prior probability that a gene g is DE in study k, the effect size is used to turn πgk into a signed probability measure πgk±=πgk×sign(βgk) where sign(.) is the sign function. We further rescale πgk=(πgk±+1)/2, so the score falls in the range [0, 1]. Lastly, we transform πgk to a Z-score zgk=Φ1(πgk) where Φ is the standard normal cumulative distribution function. Following Ferguson (1983) and Neal (2000), we construct a Dirichlet process mixture (DPM) framework to cluster the DE genes:

zg|cg,θ~F(θcg),P(cg=c)=pc,θc~G0,p~Dirichlet(a/C,,a/C). (7)

where z⃗g = (zg1, …, zgK)T and cg indicates the “latent cluster” for gene g, F(.) is a mixture of K-dimensional multivariate Gaussian distributions with mean θ⃗c and covariance matrix being identity matrix. C is the number of clusters, which is stochastic and allowed to go to infinity under DPM. G0 is the base distribution, in this case, G0 = NK(0⃗, I) and p⃗ = (p1, …, pC) is the mixing proportions for the clusters. a/C is the concentration parameter. In our model, we specify a = C so the marginal prior distribution of each mixing proportion pc would be Unif(0, 1) under the constraint c=1Cpc=1.

The above descriptions fully define the hierarchical Bayesian model proposed. The observed data are the raw counts, the library size and the phenotypic indicator {ygik, Tik, Xik}, the parameters we need to update through sampling include δgk, βgk, αgk, ϕgk, λg, ηg, mg, σk2, τk2, ξk2, ρkk, rkk, tkk, ωgik, cg and C. The hyperparameters we prespecify include υ = K + 1, μλ = 0, μη = 0, μm = 0, σλ2=102, ση2=102, σm2=102 and Cinit = 10.

2.5 Simulating posterior distribution via MCMC

We use the Metropolis-Hastings (MH) algorithm (Metropolis et al., 1953; Hastings, 1970) as well as the Gibbs sampling algorithm (Geman and Geman, 1984) to infer the posterior distribution of the parameters. Depending on the form of the distribution, 5 types of mechanisms are proposed to update the 16 groups of parameters.

  1. The full conditional for αgk and βgk are bivariate normal with known ω⃗gk. The full conditional for ω⃗gk is Polya-Gamma distribution with known αgk and βgk (Polson et al., 2013; Zhou et al., 2012). We use Gibbs sampling to update them sequentially for each gene g in study k.

  2. The full conditional for λg, ηg and mg are multivariate Gaussian distribution for each gene g. The full conditional for each element in [σk2]1K, [τk2]1K and [ξk2]1K is an inverse-gamma distribution. The full conditional for [ρkk]1K, [rkk]1K and [tkk]1K are inverse Wishart distributions. For all the above with closed form conditional distributions, we use Gibbs sampling to update them.

  3. For ϕ⃗g, we propose a MH algorithm to update it for each gene g. In particular, we sample a new value of ϕ⃗g from a multivariate log-normal jump distribution with mean equal to the old value and covariance matrix equal to Π. The acceptance ratio r is defined as the ratio of two posterior density functions, and the new value is accepted with probability min[1, r].

  4. For the pair (βgk, δgk), since the support for βgk depends on δgk, we use a reversible jump MCMC to jointly update them (Green, 1995; Lewin et al., 2007). First, a potential new value of δgk is proposed by inverting the current value, i.e. δ̃gk = 1 − δgk and a new update β̃gk is then sampled from the associated full condition given δ̃gk. We define the ratio of the two joint posterior distributions as r and jointly accept the new proposed values (β̃gk, δ̃gk) with probability min[1, r].

  5. Since our DPM model is in a conjugate context, to update the cluster assignment cg, we follow Algorithm 3 in Neal (2000) to draw a new value from cg|cg, z⃗g for g = 1, …, G at each iteration, where cg is the cluster assignment of all genes other than g. The number of clusters C is updated at each iteration based on cg.

The detailed updating functions and algorithms for each group of parameters are described in the Appendix. For both simulation and real data, we ran 10,000 MCMC iterations. The selected traceplots (see Appendix) from Simulation I below showed that all parameters reached convergence after relatively small number of iterations (roughly 3,000). In light of this, the first 3,000 iterations were dropped as burn-in period in all later analysis. The remaining 7,000 of 10,000 iterations are used for inference.

3. Bayesian Inference and Clustering

3.1 Bayesian inference and control of false discovery rate

In the Bayesian literature, Newton et al. (2004) proposed a direct approach to control FDR and defined a Bayesian false discovery rate as:

BFDR(t)=g=1GPg(H0|D)dg(t)g=1Gdg(t),

where Pg(H0|D) is the posterior probability of gene g being non-DE (H0) given data (D) and dg(t) = I{Pg(H0|D) < t}. The tuning parameter t can be tuned to control the BFDR at a certain α level. Throughout this paper, the Bayesian false discovery rate BFDR will be used to address the multiplicity issue for the Bayesian method so that it is comparable to the FDR control from the two-stage methods.

For fair comparison with the Fisher's method in meta-analysis, we adopt a union-intersection (UIT) hypothesis (a.k.a. conjunction hypothesis) setting following Li et al. (2011): H0 : ∩{βk = 0} vs Ha : ∪{βk ≠ 0}, i.e. reject the null when the gene is differentially expressed in at least one study, where βk is the effect size of study k, 1 ≤ kK. Correspondingly, we define a null set Ω0={βg:k=1KI(βgk0)=0} and the respective DE set Ω1={βg:k=1KI(βgk0)>0}. To control BFDR at the gene level, we introduce a Bayesian equivalent q-value. From the Bayesian posterior, we can calculate the probability of each gene falling in the null space: P^g(H0|D)=P^(βgΩ0|D)=t=1TI{δg(t)=0}T, where T is the total number of MCMC samples and 0⃗ is a K-dimensional zero vector. We then define the Bayesian q-value of gene g as qg=mintP^g(H0|D)BFDR(t). This qg will be treated similarly as q-value in the Frequentist approach by Fisher's method. Aside from detection of a DE gene list from meta-analysis, the posterior mean of δgk, E(δgk|D), can be used to infer differential expression for gene g in study k.

3.2 Summarization of clustering posterior to categorize DE genes

Addressing the differential expression in multiple studies is more difficult than that in a single study because the gene may be concordantly or discordantly (up-regulated in some studies but not in the others) differentially expressed. The proposed Bayesian method is based on effect size, thus it would favor DE genes concordant across studies. Following Section 2.4, we use the posterior estimate of πgk as an indicator of cross-study differential expression pattern to cluster the DE genes. To stabilize the estimation, we estimated πgk by non-overlapping windows of every 20 MCMC simulations, i.e. π^gk(b)=t(b)=120δgkt(b)/20, for the bth simulation and then transformed into gk as in Section 2.4. After each chain of 20 simulations, the cluster assignment cg is updated from the DPM model. At the end of all chains, to summarize the posterior estimates of cg, we follow from Medvedovic et al. (2004) and Rasmussen et al. (2009) and calculate the co-occurrence probability pg,h for any two genes g and h as the number of times the two genes are assigned to the same cluster divided by the total number of assignments. Then we use 1 − pg,h as a dissimilarity measure to further cluster the genes using consensus clustering (Monti et al., 2003). Consensus clustering is a stable clustering method by summarizing hierarchical clustering results with Ward linkage in repeated subsampling. The default consensus clustering method does not allow scattered genes (i.e. genes not belonging to any cluster) but one can apply other methods such as tight clustering for that purpose (Tseng and Wong, 2005). As a result, genes with similar differential expression patterns over the chains are grouped together, while those with very different cross-study differential expression patterns will be separated.

3.3 Methods for comparison

Since other existing Bayesian methods in RNA-seq DE analysis such as “baySeq” and “EBSeq” are developed for a single study (Hardcastle and Kelly, 2010; Leng et al., 2013), they cannot be immediately extensible to meta-analysis framework and compare to our method. Thus, we will compare our method to selected two-stage approaches as frequently adopted in the literature so far. For the first stage of single study differential expression analysis of RNA-seq, we will compare two most popular tools edgeR and DESeq (Robinson et al., 2010; Anders and Huber, 2010). For meta-analysis, since no other methods have been proposed specifically for RNA-seq, Fisher's method will be applied to combine edgeR or DESeq p-values from multiple RNA-seq studies (Fisher, 1925). The meta-analysed p-values are then adjusted for multiple comparison by Benjamini-Hochberg procedure (Benjamini and Hochberg, 1995). In this paper, we will compare BayesMetaSeq with the two-stage edgeR/Fisher and DESeq/Fisher approaches.

4. Simulation

We performed three types of simulation to compare BayesMetaSeq, edgeR/Fisher and DESeq/Fisher. Details are described below.

  1. Simulating homogeneous study effects to assess power and accuracy

    In the first part of simulation, we assessed the performance of BayesMetaSeq for genes with low, medium and high read counts when the effects were homogeneous across all studies. We simulated expression counts of G = 1000 genes for K = 2, 5 studies, Nk = 10 (5 cases and 5 controls), 1 ≤ kK. Library sizes for all samples were sampled from 0.3 to 0.5 millions so the average counts range roughly from 300 to 500. Baseline expressions were either high (αgk ∼ Unif{−5.5, −4.5}; mean counts ∼ 1500-4500), medium (αgk ∼ Unif{−8.5, −6.5}; mean counts ∼ 80-600) or low means (αgk ∼ Unif{−11, −9}; mean counts ∼ 5-50). The log-scaled dispersion were generated accordingly (log(ϕgk) ∼ Unif{−3.5, −2.5} for high mean counts, log(ϕgk) ∼ Unif{−2.5, −1.5}) for medium mean counts and log(ϕgk) ∼ Unif{−1.5, −0.5} for low mean counts), assuming genes with larger means had smaller dispersion (Anders et al., 2013). We let the first 20% genes (N=200) be differentially expressed in all studies, among them, 1/2 was generated from high means and the other 1/2 from the low means. The rest of genes (N=800) were non-differentially expressed, 1/4 of them were generated from high means, 1/4 from the medium means, and the other 1/2 from the low means. For differentially expressed genes, the effect size βgk was drawn from Unif{0.8, 2} or Unif{−2, −0.8} (positive or negative log fold change, respectively). For non-differentially genes, βgk was drawn from N(0, 0.52). We repeated the above parameter sampling for all the K studies. Under the same homogeneous scenario, we also repeated the above simulations for weaker DE signals (Simulation IB), i.e. log-scaled effect size βgk was drawn from Unif{0.7, 1.5} or Unif{−1.5, −0.7} for DE genes and from N(0, 0.72) for Non-DE genes.

  2. Simulating heterogeneous study effects to assess power and accuracy

    In the second part of simulation, we assessed the performance of BayesMetaSeq when the effects were heterogeneous in different studies. We simulated expression counts of G = 1000 genes for K = 2, 5 studies with Nk = 10, 1 ≤ kK. Library size, baseline expression and the corresponding log-scaled dispersion were generated in the same way as in Simulation I. We assumed the first 30% of genes (N=300) be differentially expressed. For K = 2, 2/3 of those genes are DE only in the first study or the second study, and 1/3 are common DE; for K = 5, 1/3 of those genes are DE only in one study, 1/3 are DE only in two studies, and 1/3 are DE in more than two studies. Similar to the previous simulation, 1/2 of the DE genes were from high means and 1/2 from the low means. The other 70% of genes (N=700) were non-differentially expressed, 1/4 of them were generated from high means, 1/4 from the medium means, and 1/2 from the low means. For differentially expressed genes, the effect size βgk was drawn from Unif{1, 2.5} or Unif{−2.5, −1}, however, no discordance was allowed. For non-differentially genes, βgk was drawn from Unif{−0.3, 0.3}.

  3. Simulating cross-study differential patterns to evaluate DE gene clustering

    In the fourth part of simulation, we assessed the clustering performance of BayesMetaSeq when the DE genes were generated from varying cross-study differential patterns. We simulated expression counts of G = 1000 genes for K = 3 with Nk = 10, 1 ≤ kK. Library size was generated in the same way as in Simulation I. The baseline expression αgk was drawn from Unif{−8.5, −6.5} (mean counts ∼ 80-600) and the dispersion parameter ϕgk was drawn from Unif{−2.5, −1.5}. We assumed the first 30% of genes (N=300) were differentially expressed in at least 2 studies. Among them, 1/6 were up-regulated in all studies (“+++”), 1/6 were down-regulated in all studies (“- - -”), the other 2/3 were either up-regulated or down-regulated in two studies but non-DE in the third study (e.g. 50 genes with the pattern “++0”, 50 genes with the pattern “- - 0”, 50 genes with the pattern “+0+”, 50 genes with the pattern “- 0 -”). For differentially expressed genes, the effect size βgk was drawn from N(2, 0.52) or N(−2, 0.52) (up-regulated or down-regulated, respectively). For non-differentially expressed genes, βgk was 0.

For comparison with the other methods (edgeR/Fisher and DESeq/Fisher), we assessed both power and accuracy by plotting the number of true positives against the top number of declared DE genes, as well as the ROC curves respectively for each method.

Simulation I, II

The posterior means and standard errors of selected parameters were summarized and compared to their true values from Simulation IA as shown in the Appendix. The result demonstrated validity of BayesMetaSeq. In Simulation IA of homogeneous study effects, we found that BayesMetaSeq was more powerful and accurate than edgeR/Fisher and DESeq/Fisher methods in low mean counts while performed almost equally well in high means counts (for simplicity, we combined both high mean and medium mean in this group), as shown in Figure 2(A). Comparing to the other two methods, only BayesMetaSeq had AUC above 0.9 in low mean region with both high sensitivity and specificity. As the number of study K increased, we saw more noticeable advantage of Bayesian method over the other methods in detecting DE genes with low means. Since the signals for high means were very strong, the three approaches performed almost perfectly even when K = 2. For Simulation IB with weaker signals, the results were similar to Simulation IA and the difference was more noticeable between BayesMetaSeq and the other two methods in low mean region, while for high mean region, the performance for all three methods were alike (Figure 2(B)).

Figure 2.

Figure 2

ROC Curve (left) and Power (right) comparison of BayesMetaSeq vs. edgeR/Fisher vs. DESeq/Fisher. (A) Simulation IA; (B) Simulation IB; (C) Simulation II. The solid line is for BayesMetaSeq, the dashed line is for edgeR and the dotted line is for DESeq. The AUC values are attached to each ROC plot. For the power comparison, X axis refers to the top number of DE genes declared by each method, and Y axis refers to the number of true positives.

Similarly, in Simulation II with heterogeneous study effects, though the overall signals became weaker, we found that BayesMetaSeq still performed better than the edgeR/Fisher and DESeq/Fisher methods in terms of both power and accuracy for low mean counts genes, while their performances were similar in high mean region, as shown in Figure 2(C). One thing to notice here is that, even though the Bayesian method increased the power of detecting true DE signals in low mean regions, the detection power for low means genes was still relatively weaker than high mean genes under the same scenario, due to the inherent read count bias.

Simulation III

In Simulation III, we found that BayesMetaSeq clearly identified the six clusters of DE genes with pre-specified cross-study differential patterns (Figure 3 Left). Each of the six clusters corresponded to one particular cross-study differential pattern as reflected in the heatmap of signed E(δgk|D) (Figure 3 Right), for example, cluster 1 included genes up-regulated in all studies and cluster 3 included genes down-regulated in all studies.

Figure 3.

Figure 3

Simulation III. Left: Correlation heatmap of DE genes based on the co-occurrence probability pg,h with consensus clustering. Right: The heatmap of signed posterior mean of the DE latent indicator (i.e. E(δgk|D) ×sign(βgk)) in the six clusters.

5. Real Data Analysis

We applied BayesMetaSeq to a multi-brain-region HIV-1 transgenic rat experiment (GSE47474) comparing the normal F344 strain and the HIV strain (Li et al., 2013). Samples from three brain tissues (hippocampus (HIP), striatum (STR), prefrontal cortex (PFC)) were sequenced and we regarded those as 3 studies to adopt our meta-analysis framework. There were 12 samples from each brain region in each strain (N1 = N2 = N3 = 24, K = 3). The experiment was designed to determine expression differences in brain regions of F344 and HIV-1 transgenic rats, in order to identify the mechanisms involved in HIV-1 neuropathology and develop efficient therapy for neuropsyhchiatric disorders associated with HIV-1 infection (Li et al., 2013). Following the guidance in edgeR (Robinson et al., 2010), we first filtered out genes with mean counts smaller than 1 in any study. After filtering, 10,280 genes remained for analysis. We applied BayesMetaSeq as well as edgeR/Fisher and DESeq/Fisher to the data. After we obtained the DE genes from each approach, we performed pathway enrichment analysis using Fisher's exact test based on the Gene Ontology (GO) database to annotate the identified genes (Khatri et al., 2012). In addition, we also analyzed the DE genes categories from BayesMetaSeq using Ingenuity Pathway Analysis (IPA) for more biological insight. IPA is a commercial curated database that contains rich functional annotation, gene-gene interaction and regulatory information (IPA®, QIAGEN Redwood City, www.qiagen.com/ingenuity).

5.1 Differential expression analysis

Controlling FDR at 0.1, edgeR/Fisher detected 51 DE genes and DESeq/Fisher 46 DE genes respectively, while BayesMetaSeq detected 245 DE genes (Table 1). A Venn Diagram showing the number of overlapping genes indicated good agreement among the three methods (see Appendix). As shown in Figure 4(A), the DE genes detected by BayesMetaSeq have wider detection range, especially for genes with smaller read counts, smaller RPKM (reads per kilobase per million) or shorter transcript length (Mortazavi et al., 2008). Table 2 lists three DE genes detected only by BayesMetaSeq but not by the other two methods. They typically have rare counts (table and boxplots of normalized counts shown in the Appendix) due to short length of the transcripts (e.g. Mir212, Mir384) and/or small RPKM (e.g. Alb). microRNA-212 has been reported in previous studies to promote interleukin-17-producing T-helper cell differentiation (Nakahama et al., 2013). miRNA-384 has been found to regulate both amyloid precursor protein (APP) and β-site APP cleaving enzyme, which play an important role in the pathogenesis of Alzheimer's disease (Liu et al., 2014). Gene Alb encodes for albumin which is a primary carrier protein for steroids, fatty acids and steroid hormones in the blood, and has been used as markers of HIV disease progression in the highly active antiretroviral therapy (Shah et al., 2007).

Table 1. Comparison of 3 approaches in real rat data.

Method FDR at 0.05 FDR at 0.1
BayesMetaSeq 169 245
edgeR+Fisher 36 51
DESeq+Fisher 37 46

Figure 4.

Figure 4

(A) Boxplot of average normalized counts, log(RPKM) and transcript lengths for the declared DE genes by each method. From left to right: BayesMetaSeq, edgeR/Fisher, DESeq/Fisher. (B) Manhattan plot of GO pathways enriched by the top 200 DE genes from each method. X axis refers to the GO pathways sorted by GO IDs, Y axis refers to the -logl0(p-values) from the Fisher's exact test, the highlighted points are the GO pathways with FDR < 0.05.

Table 2.

Three example genes that show better detection power of BayesMetaSeq to detect low expressed or short length genes.

Gene Study edgeR DESeq BayesMetaSeq Ave. normalized counts Ave. RPKM Transcript length(bp)

p-value Fisher's q-value p-value Fisher's q-value Posterior means Bayesian q-value HIV strain Normal strain
Mir212 HIP 0.02 0.21 0.05 0.51 0.89 2e-3 2.08 3.92 20.27 23
STR 0.02 0.03 0.99 2.20 4.93 21.31
PFC 0.07 0.09 0.83 2.99 5.08 22.56

Mir384 HIP 0.06 0.39 0.10 0.86 0.88 8e-3 1.59 2.84 15.53 20
STR 0.65 0.77 0.33 2.54 2.12 14.50
PFC 0.004 0.01 0.98 2.02 4.59 19.28

Alb HIP 0.002 0.10 0.06 0.88 0.95 6e-3 8.58 2.93 1.31 2676
STR 0.61 0.60 0.41 9.17 7.65 1.67
PFC 0.006 0.03 0.99 20.09 10.90 2.89

5.2 Pathway enrichment analysis on detected DE genes

Detecting more DE genes does not necessarily indicate a better performance of our method. Since the underlying truth is not known in real data, we performed a pathway enrichment analysis on identified DE genes by each method. For fair comparison, we used the top 200 genes from each of the three methods and regarded them as DE genes in the pathway analysis. We tested on three pathway databases in MSigDB (http://software.broadinstitute.org/gsea/msigdb): GO, KEGG and Reactome, and only GO reported significant (q-value<0.05) pathways for all three methods. Controlling FDR at 0.05 by Benjamini-Hochberg correction, we found 50 GO pathways enriched with the DE genes from the BayesMetaSeq, while only 20 and 22 GO pathways were enriched for edgeR and DESeq, respectively. A cluster of enriched pathways was identified on the left of the Manhattan plot for BayesMetaSeq (circled), implying the enrichment in a major functional domain (Figure 4(B); pathways sorted by GO IDs). These pathways were mainly related to cell killing, leukocyte mediated cytotoxicity and T-cell mediated cytotoxicity (GO:0001906, GO:0001909, GO:0001910, GO:0001912, GO:0001913, GO:0001914, GO:0001916) and were enriched with BayesMetaSeq only (Table 3; p-values obtained from Fisher's exact test). The enrichment in these GO pathways might reflect changes in adaptive immune response against the HIV.

Table 3.

Selected GO pathways enriched only with BayesMetaSeq from Figure 4(B). For fair comparison, the top 200 genes from each approach were regarded as DE genes and used for pathway analysis.

GO ID GO Term BayesMetaSeq p-value (logOR) edgeR p-value (logOR) DESeq p-value (logOR)
GO:0001906 cell killing 2.2e-4 (1.87) 0.033 (1.25) 0.12 (0.95)
GO:0001909 leukocyte mediated cytotoxicity 1e-3 (1.77) 0.105 (1.01) 0.102 (1.03)
GO:0001910 regulation of leukocyte mediated cytotoxicity 2.3e-4 (2.07) 0.056 (1.30) 0.055 (1.31)
GO:0001912 positive regulation of leukocyte mediated cytotoxicity 1.3e-4 (2.18) 0.04 (1.40) 0.043 (1.42)
GO:0001913 T cell mediated cytotoxicity 9.9e-5 (2.25) 0.039 (1.46) 0.038 (1.47)
GO:0001914 regulation of T cell mediated cytotoxicity 5e-5 (2.39) 0.029 (1.59) 0.028 (1.60)
GO:0001916 positive regulation of T cell mediated cytotoxicity 3.5E-5 (2.46) 0.024 (1.66) 0.24 (1.67)

5.3 Categorization of DE genes by study heterogeneity

We calculated the co-occurrence probability pg,h and used 1 – pg,h as a dissimilarity measure to cluster the DE genes of BayesMetaSeq. As shown in Figure 5(A), we identified seven major clusters from the 245 DE genes. Each of the seven clusters corresponded to one particular cross-study differential patterns based on the signed E(δgk|D) (Figure 5(B)). For example, genes in Cluster 1 were up-regulated in all three studies and genes in Cluster 5 were down-regulated only in STR, but not in HIP and PFC. Moreover, when we analyzed each cluster of genes separately through IPA pathway enrichment analysis, we noticed that each cluster of genes represented different functional domains that were changed in the HIV strain as compared to the normal strain in different brain regions. For example, Cluster 1 which included genes up-regulated in all three brain regions was mainly involved in antimicrobial response, while Cluster 5 which included genes down-regulated in STR region only was mainly related to nervous system development (Figure 5(C)). Cluster 7 was not shown here since it included very few DE genes and only one enriched pathway was identified. Detailed list of significant pathways in each cluster with corresponding p-values and log odds ratios can be found in the Appendix. In our analysis, we detected more region-specific DE markers (Cluster 2-7) than common DE markers (Cluster 1) which was consistent with the results reported from the original paper of this data (Li et al., 2013).

Figure 5.

Figure 5

(A) Correlation heatmap of 245 Bayesian DE genes based on the co-occurrence probability Pg,h with consensus clustering. (B) The heatmap of signed posterior mean of DE latent indicator (i.e.E(δgk|D) ×sign(βgk)) in the five major clusters. (C) A collection of overlapping IPA pathways enriched with each cluster of genes (deeper color refers to more significant pathways).

6. Discussion and Conclusion

In this paper, we proposed a Bayesian hierarchical model called BayesMetaSeq to conduct meta-analysis of RNA-seq data and biomarker categorization by study heterogeneity. Based on a negative binomial framework, the model assumed study-specific differential expression pattern for each gene and allowed the shrinkage of multiple parameters. MCMC algorithm was applied to update the posterior distribution of model parameters and the multiplicity issue was addressed by global FDR from a Bayesian perspective. A Dirichlet process mixture (DPM) model embedded in the Bayesian framework automatically clustered the detected biomarkers based on cross-study differential patterns. Both the simulations and real rat data analysis showed that the Bayesian unified model was more powerful than the two-stage methods (e.g. edgeR/Fisher, DESeq/Fisher), especially in lowly expressed genes without the loss of power in highly expressed genes, and the false discovery rate was well controlled. The differentially expressed genes identified by BayesMetaSeq between HIV strains and normal strains in the real data were further validated by pathway analysis and many DE genes were enriched in pathways related to immune response. Clustering analysis of the DE genes showed that genes with unique cross-study differential patterns were involved in specific functional domains such as antimicrobial response, inflammatory response and so on.

Bayesian models have long been used in differential analysis of genomic studies such as microbar-ray, RNA-seq and methylation (Hardcastle and Kelly, 2010; Leng et al., 2013; Van De Wiel et al., 2012; Chung et al., 2013; Park et al., 2014). Compared to other approaches, Bayesian methods can handle more complex generative mechanisms and allows the sharing of information across studies and across genes, both of which are essential for meta-analysis. Our unified Bayesian meta-analysis model increases the detection power for genes with low counts by accumulating small counts from multiple studies and encourages the sharing of information across different studies, which is not seen in the two-stage meta-analysis methods. In addition, the flexible and adaptable modelling of variance across samples in our approach also contributes to the improvement of detection power (Chung et al., 2013). Similar advantage of unified model over two-stage method has been seen in categorical analysis literature where joint modelling of count data to combine multiple sparse contingency tables was shown to be more powerful than traditional two-stage methods (Warn et al., 2002; Bradburn et al., 2007).

The current model relies on a fixed effects model, which assumes that differences of effect sizes are from sampling error alone. It can be readily extended to a random effects scenario, where each effect size is assumed to be drawn from a study-specific distribution (Choi et al., 2003). Model checking can be performed to determine whether the fixed effect model or the random effect model is more adequate for a given dataset. Recent statistical research on RNA-seq proposed zero-inflated negative binomial model as an alternative to the regular negative binomial model and found that it fits better to real data since excessive zeros have always been observed in the NGS data (Van De Wiel et al., 2012). Our model can be easily extended to a zero-inflation framework, and its performance and computing feasibility for applications can be assessed through simulation or real data analysis. In our current approach, only binary outcome is considered. The framework is applicable for continuous outcome or multi-class outcome, where dummy variable regression approach can be applied. Moreover, potential confounding covariates such as age, gender and other individual attributes can be included in the model.

Our real data application presents an example using the same RNA-seq platform across studies. In practice, it is possible that studies from different RNA-seq platforms are included and thus introduce significant bias. For example, the Sequencing Quality Control (SEQC) consortium performed extensive comparison on three RNA-seq platforms (Illumina HiSeq, Life Technology SOLiD and Roche 454) and determined pros and cons of different platforms (Consortium et al., 2014; Xu et al., 2013). As of 07/30/2016, more than 95% of data in GEO used Illumina sequencing systems. As a result, unless different experimental protocols (e.g. mRNA preparation kits) are used in different studies, the platform bias in RNA-seq meta-analysis is not as severe as in microarray. We, however, acknowledge that platform bias may exist or may become more serious if new competing sequencing platforms become popular in the future. Practitioners should apply batch effect diagnostic or removal tools (Leek, 2014; Liu and Markatou, 2016), or extend with random effects in our model to account for cross-platform bias.

Currently, the Bayesian hierarchical model allows study-specific DE status, but favors concordant differential expression across studies. In some applications, discordant DE genes (e.g. a biomarker is up-regulated in one brain region but down-regulated in another brain region) may be expected and another hierarchical layer will be needed to accommodate. Another limitation of our method is the relatively high computational cost. To speed up the computation, we randomly partition the whole dataset into independent gene chunks and apply explicit parallelism using “snowfall” package in R, while merging intermediate outputs for cluster analysis with all genes. It takes about 1 hour for 10,000 MCMC iterations and 10,280 genes with K=3 using 128 computing threads (8 CPUs each with Sixteen-core AMD 2.3GHz and 128GB RAM) in R code. Since the reduction of computing time is almost linear when more computing threads are used, we expect further computing time reduction when powerful computing clusters are used. Optimization of code in C++ and applying further parallel computing such as Consensus Monte Carlo Algorithm and Asynchronous Distributed Gibbs Sampling (Scott et al., 2013; Terenin et al., 2015) should further reduce computing time for general applications in the future. An R package, BayesMetaSeq, is publicly available to perform the analysis (http://tsenglab.biostat.pitt.edu/software.htm).

Supplementary Material

Supp Fig S1

Figure S1: Traceplots of selected parameters from Simulation IA.

Supp Fig S2

Figure S2: Venn Diagram of number of overlapping DE genes (FDR < 0.1) among the three methods applied in real data.

Supp Fig S3

Figure S3: Distribution of normalized counts for the three genes shown in table 3 (left: HIV strain; right: Normal strain). The values above the boxplots correspond to the respective p-values or posterior means from edgeR/DESeq/BayesMetaSeq, with stars indicating the significance (e.g. p-value ≤0.1 or E(δgk|D) ≥ 0.8).

Table S1: Comparison of posterior mean of the parameters estimated by BayesMetaSeq with their true values from Simulation IA, K=2

Table S2: Sensitivity analysis on hyperparameter μη

Table S3; Normalized counts (rounded) for the three genes shown in table 3.

Table S4: List of significant IPA pathways (p-value < 0.05) from Cluster 1-4 in Figure 5.

Table S5: List of significant IPA pathways (p-value < 0.05) from Cluster 5-7 in Figure 5.

Acknowledgments

Research reported in this publication was supported by NCI of the National Institutes of Health under award number R01CA190766 to T.M. and G.C.T.

Appendix

Parameter estimation by Gibbs Sampling and the Metropolis-Hastings algorithm

In this section, we described the detailed updating conditional distributions or algorithms if there were no closed form conditional distributions for some parameters. The full conditional posterior is as follows:

P(|Ygik,Tik,Xik)P(Ygik|αgk,βgk,ϕgk)×f(αgk|ηg,τk2,r)f(βgk|λg,δgk,σk2,ρ)f(ϕgk|mg,ξk2,t)f(ηg|N(μη,ση2))(1/τk2)f(r|InvWishart(I,K+1))f(λg|N(μλ,σλ2))(1/σk2)f(ρ|InvWishart(I,K+1))f(mg|N(μm,σm2))(1/ξk2)f(t|InvWishart(I,K+1))f(δgk|πgk)f(πgk|θ,cg)f(θ|G0)f(cg|p)f(p|a,C). (A.1)

To update each parameter, we simply integrate out the rest from the above.

Step 1

Gibbs sampling is used to update αgk, βgk. The two sets of parameters would be updated for each gene in each study, for simplicity, I will drop the suffix g and k here. The posterior distributions of these two parameters have closed form conditioning on the supporting parameter ω from the Polya-Gamma (PG) distribution. Following ?, ωPG(b, c) is an infinite convolution of gamma distributions defined as:

ω=D12π2k=1gkk1/22+c2/(4π2).

where each gkGamma(b, 1) is an independent gamma random variable with b > 0, c ∈ ℛ, and =D denotes equality in distribution.

The PG distribution has two important properties. Firstly, if ωPG(b, 0), then by Laplace transform, we would have E{exp(ωt)}=coshb(t/2), where cosh(x)=ex+ex2. Let ωPG(y + ϕ−1, 0), the negative binomial likelihood in terms of proportion p and dispersion ϕ can thus be expressed as:

L(p,ϕ)py(1p)ϕ1=[exp(ψ)]y[1+exp(ψ)]y+ϕ1=2(y+ϕ1)exp((yϕ1)ψ2)coshy+ϕ1(ψ2)exp((yϕ1)ψ2)Eω{exp(ωψ22)}. (A.2)

In other words, conditioning on ω, the above will end up with some negative quadratic form of ψ (see equation (3) in the main text) within the exponential. Thus, the normal prior on ψ would be a conjugate prior conditioning on ω. Let's go back to equation (3) in the main text, assume B = (α, β)T and Zi = (1, Xi)T, then conditioning on known ωi's, we know the likelihood of B is equal to:

L(B)i=1Nexp{ωi2(ZiTB(yiϕ12ωilogTi))2}. (A.3)

Let Ω = diag(ω1, …, ωn) and ui=yiϕ12ωilogTi, U = (u1, …, un)T, and we have the prior BN(c, C), where c = (η, λδ)T, C = diag(τ2, σ2), so the conditional posterior we used to update B would be:

(B|)~N(m,V),whereV=(ZΩZT+C1)1,m=V(ZΩU+C1c). (A.4)

Another important property of PG distribution is that any PG(b, c) random variable ω has the following pdf (where the expectation in the denominator is taken w.r.t. PG(b, 0)):

p(ω|b,c)=exp(c22ω)p(ω|b,0)Eω{exp(c22ω)}

In other words, the posterior distribution of ωPG(b, 0) given c still belongs to the PG class (in our case, b = y + ϕ−1, c = ψ):

P(ω|ψ)exp(ωψ22)PG(y+ϕ1,0)PG(y+ϕ1,ψ). (A.5)

We can update each ωi based on the above distribution using Gibbs sampling.

Step 2

For ϕg, since no closed form posterior distribution is available, we used Metropolis Hasting (MH) algorithm to update ϕg for all studies together. For each g, we proposed a new vector log(ϕ⃗new) = (log(ϕ1), …, log(ϕK))T from some jump distribution NK(log(ϕ⃗old), Π). The proposal is accepted with probability min(1, r), where r is the acceptance ratio:

r=NK(logϕgnew;mg,Π)k=1Ki=1I(k)NB(ygik;logTik+αgk+βgkXik,ϕgknew)NK(logϕgold;mg,Π)k=1Ki=1I(k)NB(ygik;logTik+αgk+βgkXik,ϕgkold). (A.6)

If the proposal is accepted, we replace the old log(ϕ⃗) with the new one, otherwise, we keep the current value of log(ϕ⃗).

Step 3

We used Gibbs sampling to update λg, ηg, mg based on their full conditional Gaussian distributions as follows:

(λg|)~NK(λμ,λ),whereλ=(diag(1/(σλ2))+K1)1,λμ=λ(diag(1/(σλ2))μλ+K1βg)(ηg|)~NK(ημ,μ),whereη=(diag(1/(ση2))+KΛ1)1,ημ=η(diag(1/(ση2))μη+KΛ1αg)(mg|)~NK(mμ,m),m=(diag(1/(σm2))+KΠ1)1,mμ=m(diag(1/(σm2))μm+KΠ1logϕg) (A.7)

To update λg, we will only use those βgk for which δgk = 1, if δ⃗g = 0⃗, we would redraw from its prior N(μλ,σλ2). Since we only need one value for each of the above parameters in every iteration, we took the average of each result.

Step 4

The full conditional for [σ(1),k2]1K, [σ(0),k2]1K, [τk2]1K and [ξk2]1K have closed forms and are updated using Gibbs sampling for each k:

σ(1),k2~InvGamma(g=1Gδgk2,12g=1Gδgk(βgkλg)2)σ(0),k2~InvGamma(g=1G(1δgk)2,12g=1G(1δgk)(βgk2))τk2~InvGamma(G2,12g=1G(αgkηg)2)ξk2~InvGamma(G2,12g=1G(logϕgkmg)2) (A.8)

Step 5

The full conditional for [ρ(1)kk]1K, [ρ(0)kk]1K, [rkk]1K, [tkk]1K have closed forms and are updated using Gibbs sampling:

Forδg0,[ρ(1)kk]1K~InvWishart(ψ=I+k=1K(β¯kλ¯)(β¯kλ¯)T,υ=2K+1)Forδg=0,[ρ(0)kk]1K~InvWishart(ψ=I+k=1K(β¯k)(β¯k)T,υ=2K+1)[rkk]1K~InvWishart(ψ=I+k=1K(α¯kη¯)(α¯kη¯)T,υ=2K+1)[tkk]1K~InvWishart(ψ=I+k=1K(logϕ¯km¯)(logϕ¯km¯)T,υ=2K+1) (A.9)

where the average is taken over all genes for βk, λ, αk, η, logϕk and m. After drawing a new covariance matrix from the above posterior, the actual correlation matrix can be obtained by integrating out the variance components.

Step 6

Since the support for βgk depends on the choice of δgk, we use a reversible jump MCMC algorithm to update (δgk, βgk) together for each g and k. Specifically, a new value δgknew=1δgkold is proposed, and we then generate βgknew from the posterior in Step 1 based on δgknew. The proposal is accepted with probability min(1, r), where r is the acceptance ratio:

r=N(βgknew;δgknewλg,σ2)i=1I(k)NB(ygik;logTik+αgk+βgknewXik,ϕgk)N(βgkold;δgkoldλg,σ2)i=1I(k)NB(ygik;logTik+αgk+βgkoldXik,ϕgk) (A.10)

We accept or reject the proposed values jointly from the above.

Step 7

Lastly, upon obtaining the updates of δgk, we can estimate πgk for every 20 chains, and we transform it into zgk through the steps described in Section 2.4. Based on the vector z⃗g, we can update the cluster assignment cg for each gene by Gibbs sampling using the following conditional probabilities:

Ifc=chfor somehg:P(cg=c|cg,zg)=bncG1+aF(zg,θc)dHg,c(θc)P(cgchfor allhg|cg,zg)=baG1+aF(zg,θ)dG0(θ)

where Hg,c is the posterior distribution of θc based on the prior G0 and all observations z⃗h for which hg and ch = c, nc is the cluster size of cluster c, b is the normalizing constant to make the probability sum to 1. More specifically, F(zg,θc)dHg,c(θc)=f(zg;NK(μK=ncnc+1zh,=diag(nc+2nc+1,K)), ∫ F(zg, θ)dG0(θ) = f(z⃗g; NK(μK = 0K, Σ = diag(2, K)).

Contributor Information

Tianzhou Ma, Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA 15261.

Faming Liang, Department of Biostatistics, University of Florida, Gainesville, FL 32611.

George Tseng, Department of Biostatistics (primary appointment), Department of Human Genetics, Department of Computational Biology, University of Pittsburgh, Pittsburgh, PA 15261.

References

  1. Anders S, Huber W. Differential expression analysis for sequence count data. Genome biol. 2010;11(10):R106. doi: 10.1186/gb-2010-11-10-r106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Anders S, McCarthy DJ, Chen Y, Okoniewski M, Smyth GK, Huber W, Robinson MD. Count-based differential expression analysis of rna sequencing data using r and bioconductor. Nature protocols. 2013;8(9):1765–1786. doi: 10.1038/nprot.2013.099. [DOI] [PubMed] [Google Scholar]
  3. Barnard J, McCulloch R, Meng XL. Modeling covariance matrices in terms of standard deviations and correlations, with application to shrinkage. Statistica Sinica. 2000;10(4):1281–1312. [Google Scholar]
  4. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B (Methodological) 1995:289–300. [Google Scholar]
  5. Bradburn MJ, Deeks JJ, Berlin JA, Russell Localio A. Much ado about nothing: a comparison of the performance of meta-analytical methods with rare events. Statistics in medicine. 2007;26(1):53–77. doi: 10.1002/sim.2528. [DOI] [PubMed] [Google Scholar]
  6. Choi JK, Yu U, Kim S, Yoo OJ. Combining multiple microarray studies and modeling interstudy variation. Bioinformatics. 2003;19(suppl 1):i84–i90. doi: 10.1093/bioinformatics/btg1010. [DOI] [PubMed] [Google Scholar]
  7. Chung LM, Ferguson JP, Zheng W, Qian F, Bruno V, Montgomery RR, Zhao H. Differential expression analysis for paired rna-seq data. BMC bioinformatics. 2013;14(1):110. doi: 10.1186/1471-2105-14-110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Conlon EM, Song JJ, Liu JS. Bayesian models for pooling microarray studies with multiple sources of replications. BMC bioinformatics. 2006;7(1):247. doi: 10.1186/1471-2105-7-247. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Consortium, S-I et al. A comprehensive assessment of rna-seq accuracy, reproducibility and information content by the sequencing quality control consortium. Nature biotechnology. 2014;32(9):903–914. doi: 10.1038/nbt.2957. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Ferguson TS. Bayesian density estimation by mixtures of normal distributions. Recent advances in statistics. 1983;24(1983):287–302. [Google Scholar]
  11. Fisher RA. Statistical methods for research workers. Genesis Publishing Pvt Ltd; 1925. [Google Scholar]
  12. Gelman A, Carlin JB, Stern HS, Rubin DB. Bayesian data analysis. Vol. 2. Taylor & Francis; 2014. [Google Scholar]
  13. Geman S, Geman D. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. Pattern Analysis and Machine Intelligence, IEEE Transactions on. 1984;(6):721–741. doi: 10.1109/tpami.1984.4767596. [DOI] [PubMed] [Google Scholar]
  14. Green PJ. Reversible jump markov chain monte carlo computation and bayesian model determination. Biometrika. 1995;82(4):711–732. [Google Scholar]
  15. Hardcastle TJ, Kelly KA. bayseq: empirical bayesian methods for identifying differential expression in sequence count data. BMC bioinformatics. 2010;11(1):422. doi: 10.1186/1471-2105-11-422. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Hastings WK. Monte carlo sampling methods using markov chains and their applications. Biometrika. 1970;57(1):97–109. [Google Scholar]
  17. Hong F, Breitling R, McEntee CW, Wittner BS, Nemhauser JL, Chory J. Rankprod: a bioconductor package for detecting differentially expressed genes in meta-analysis. Bioinformatics. 2006;22(22):2825–2827. doi: 10.1093/bioinformatics/btl476. [DOI] [PubMed] [Google Scholar]
  18. Kang DD, Sibille E, Kaminski N, Tseng GC. Metaqc: objective quality control and inclusion/exclusion criteria for genomic meta-analysis. Nucleic acids research. 2012;40(2):e15–e15. doi: 10.1093/nar/gkr1071. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Khatri P, Sirota M, Butte AJ. Ten years of pathway analysis: current approaches and outstanding challenges. PLoS Comput Biol. 2012;8(2):e1002375. doi: 10.1371/journal.pcbi.1002375. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Lee Y, Scheck AC, Cloughesy TF, Lai A, Dong J, Farooqi HK, Liau LM, Horvath S, Mischel PS, Nelson SF. Gene expression analysis of glioblastomas identifies the major molecular basis for the prognostic benefit of younger age. BMC medical genomics. 2008;1(1):52. doi: 10.1186/1755-8794-1-52. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Leek JT. svaseq: removing batch effects and other unwanted noise from sequencing data. Nucleic acids research. 2014:gku864. doi: 10.1093/nar/gku864. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Leng N, Dawson JA, Thomson JA, Ruotti V, Rissman AI, Smits BM, Haag JD, Gould MN, Stewart RM, Kendziorski C. Ebseq: an empirical bayes hierarchical model for inference in rna-seq experiments. Bioinformatics. 2013;29(8):1035–1043. doi: 10.1093/bioinformatics/btt087. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Lewin A, Bochkina N, Richardson S. Fully bayesian mixture model for differential gene expression: simulations and model checks. Statistical applications in genetics and molecular biology. 2007;6(1) doi: 10.2202/1544-6115.1314. [DOI] [PubMed] [Google Scholar]
  24. Li J, Tseng GC, et al. An adaptively weighted statistic for detecting differential gene expression when combining multiple transcriptomic studies. The Annals of Applied Statistics. 2011;5(2A):994–1019. [Google Scholar]
  25. Li MD, Cao J, Wang S, Wang J, Sarkar S, Vigorito M, Ma JZ, Chang SL. Transcriptome sequencing of gene expression in the brain of the hiv-1 transgenic rat. PloS one. 2013;8(3):e59582. doi: 10.1371/journal.pone.0059582. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Liu CG, Wang JL, Li L, Wang PC. Microrna-384 regulates both amyloid precursor protein and β-secretase expression and is a potential biomarker for alzheimer's disease. International journal of molecular medicine. 2014;34(1):160–166. doi: 10.3892/ijmm.2014.1780. [DOI] [PubMed] [Google Scholar]
  27. Liu Q, Markatou M. Evaluation of methods in removing batch effects on rna-seq data. Infectious Diseases and Translational Medicine. 2016;2(1):3–9. [Google Scholar]
  28. Medvedovic M, Yeung KY, Bumgarner RE. Bayesian mixture model based clustering of replicated microarray data. Bioinformatics. 2004;20(8):1222–1232. doi: 10.1093/bioinformatics/bth068. [DOI] [PubMed] [Google Scholar]
  29. Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E. Equation of state calculations by fast computing machines. The journal of chemical physics. 1953;21(6):1087–1092. [Google Scholar]
  30. Monti S, Tamayo P, Mesirov J, Golub T. Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Machine learning. 2003;52(1-2):91–118. [Google Scholar]
  31. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by rna-seq. Nature methods. 2008;5(7):621–628. doi: 10.1038/nmeth.1226. [DOI] [PubMed] [Google Scholar]
  32. Nakahama T, Hanieh H, Nguyen NT, Chinen I, Ripley B, Millrine D, Lee S, Nyati KK, Dubey PK, Chowdhury K, et al. Aryl hydrocarbon receptor-mediated induction of the microrna-132/212 cluster promotes interleukin-17–producing t-helper cell differentiation. Proceedings of the National Academy of Sciences. 2013;110(29):11964–11969. doi: 10.1073/pnas.1311087110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Neal RM. Markov chain sampling methods for dirichlet process mixture models. Journal of computational and graphical statistics. 2000;9(2):249–265. [Google Scholar]
  34. Newton MA, Noueiry A, Sarkar D, Ahlquist P. Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics. 2004;5(2):155–176. doi: 10.1093/biostatistics/5.2.155. [DOI] [PubMed] [Google Scholar]
  35. Oshlack A, Robinson MD, Young MD. From rna-seq reads to differential expression results. Genome biology. 2010;11(12):1. doi: 10.1186/gb-2010-11-12-220. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Oshlack A, Wakefield MJ, et al. Transcript length bias in rna-seq data confounds systems biology. Biol Direct. 2009;4(1):14. doi: 10.1186/1745-6150-4-14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Park Y, Figueroa ME, Rozek LS, Sartor MA. Methylsig: a whole genome dna methylation analysis pipeline. Bioinformatics. 2014:btu339. doi: 10.1093/bioinformatics/btu339. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Polson NG, Scott JG, Windle J. Bayesian inference for logistic models using pólya–gamma latent variables. Journal of the American Statistical Association. 2013;108(504):1339–1349. [Google Scholar]
  39. Ramasamy A, Mondry A, Holmes CC, Altman DG. Key issues in conducting a meta-analysis of gene expression microarray datasets. PLoS Med. 2008;5(9):e184. doi: 10.1371/journal.pmed.0050184. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Rapaport F, Khanin R, Liang Y, Pirun M, Krek A, Zumbo P, Mason CE, Socci ND, Betel D. Comprehensive evaluation of differential gene expression analysis methods for rna-seq data. Genome Biol. 2013;14(9):R95. doi: 10.1186/gb-2013-14-9-r95. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Rasmussen CE, De la Cruz BJ, Ghahramani Z, Wild DL. Modeling and visualizing uncertainty in gene expression clusters using dirichlet process mixtures. Computational Biology and Bioinformatics, IEEE/ACM Transactions on. 2009;6(4):615–628. doi: 10.1109/TCBB.2007.70269. [DOI] [PubMed] [Google Scholar]
  42. Rau A, Marot G, Jaffrézic F. Differential meta-analysis of rna-seq data from multiple studies. BMC bioinformatics. 2014;15(1):91. doi: 10.1186/1471-2105-15-91. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Robinson MD, McCarthy DJ, Smyth GK. edger: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26(1):139–140. doi: 10.1093/bioinformatics/btp616. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Scharpf RB, Tjelmeland H, Parmigiani G, Nobel AB. A bayesian model for cross-study differential gene expression. Journal of the American Statistical Association. 2009;104(488) doi: 10.1198/jasa.2009.ap07611. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Scott SL, Blocker AW, Bonassi FV, Chipman H, George E, McCulloch R. Bayes and big data: The consensus monte carlo algorithm. EFaBBayes 250 conference. 2013;16 [Google Scholar]
  46. Shah S, Smith C, Lampe F, Youle M, Johnson M, Phillips A, Sabin C. Haemoglobin and albumin as markers of hiv disease progression in the highly active antiretrovial therapy era: relationships with gender*. HIV medicine. 2007;8(1):38–45. doi: 10.1111/j.1468-1293.2007.00434.x. [DOI] [PubMed] [Google Scholar]
  47. Sims AH, Smethurst GJ, Hey Y, Okoniewski MJ, Pepper SD, Howell A, Miller CJ, Clarke RB. The removal of multiplicative, systematic bias allows integration of breast cancer gene expression datasets–improving meta-analysis and prediction of prognosis. BMC medical genomics. 2008;1(1):42. doi: 10.1186/1755-8794-1-42. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Soneson C, Delorenzi M. A comparison of methods for differential expression analysis of rna-seq data. BMC bioinformatics. 2013;14(1):91. doi: 10.1186/1471-2105-14-91. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Stouffer SA, Suchman EA, DeVinney LC, Star SA, Williams RM., Jr The american soldier: adjustment during army life(studies in social psychology in world war ii, vol 1) 1949 [Google Scholar]
  50. Terenin A, Simpson D, Draper D. Asynchronous distributed gibbs sampling. arXiv preprint arXiv:1509.08999 2015 [Google Scholar]
  51. Tseng GC, Ghosh D, Feingold E. Comprehensive literature review and statistical considerations for microarray meta-analysis. Nucleic acids research. 2012:gkr1265. doi: 10.1093/nar/gkr1265. [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Tseng GC, Wong WH. Tight clustering: A resampling-based approach for identifying stable and tight patterns in data. Biometrics. 2005;61(1):10–16. doi: 10.1111/j.0006-341X.2005.031032.x. [DOI] [PubMed] [Google Scholar]
  53. Tsuyuzaki K, Nikaido I. metaseq: Meta-analysis of rna-seq count data 2013 [Google Scholar]
  54. Van De Wiel MA, Leday GG, Pardo L, Rue H, Van Der Vaart AW, Van Wieringen WN. Bayesian analysis of rna sequencing data by estimating multiple shrinkage priors. Biostatistics. 2012:kxs031. doi: 10.1093/biostatistics/kxs031. [DOI] [PubMed] [Google Scholar]
  55. Wang C, Gong B, Bushel PR, Thierry-Mieg J, Thierry-Mieg D, Xu J, Fang H, Hong H, Shen J, Su Z, et al. A comprehensive study design reveals treatment-and transcript abundance–dependent concordance between rna-seq and microarray data. Nature biotechnology. 2014;32(9):926. doi: 10.1038/nbt.3001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  56. Wang Z, Gerstein M, Snyder M. Rna-seq: a revolutionary tool for transcriptomics. Nature Reviews Genetics. 2009;10(1):57–63. doi: 10.1038/nrg2484. [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Warn D, Thompson S, Spiegelhalter D. Bayesian random effects meta-analysis of trials with binary outcomes: methods for the absolute risk difference and relative risk scales. Statistics in medicine. 2002;21(11):1601–1623. doi: 10.1002/sim.1189. [DOI] [PubMed] [Google Scholar]
  58. Xu J, Su Z, Hong H, Thierry-Mieg J, Thierry-Mieg D, Kreil DP, Mason CE, Tong W, Shi L. Cross-platform ultradeep transcriptomic profiling of human reference rna samples by rna-seq. Scientific data. 2013;1:140020–140020. doi: 10.1038/sdata.2014.20. [DOI] [PMC free article] [PubMed] [Google Scholar]
  59. Zhou M, Li L, Dunson D, Carin L. Lognormal and gamma mixed negative binomial regression. Machine learning: proceedings of the International Conference International Conference on Machine Learning. 2012;2012:1343. NIH Public Access. [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp Fig S1

Figure S1: Traceplots of selected parameters from Simulation IA.

Supp Fig S2

Figure S2: Venn Diagram of number of overlapping DE genes (FDR < 0.1) among the three methods applied in real data.

Supp Fig S3

Figure S3: Distribution of normalized counts for the three genes shown in table 3 (left: HIV strain; right: Normal strain). The values above the boxplots correspond to the respective p-values or posterior means from edgeR/DESeq/BayesMetaSeq, with stars indicating the significance (e.g. p-value ≤0.1 or E(δgk|D) ≥ 0.8).

Table S1: Comparison of posterior mean of the parameters estimated by BayesMetaSeq with their true values from Simulation IA, K=2

Table S2: Sensitivity analysis on hyperparameter μη

Table S3; Normalized counts (rounded) for the three genes shown in table 3.

Table S4: List of significant IPA pathways (p-value < 0.05) from Cluster 1-4 in Figure 5.

Table S5: List of significant IPA pathways (p-value < 0.05) from Cluster 5-7 in Figure 5.

RESOURCES