Bayesian Hierarchical Modeling and Selection of Differentially Expressed Genes for the EST Data

Fang Yu; Ming-Hui Chen; Lynn Kuo; Peng Huang; Wanling Yang

doi:10.1111/j.1541-0420.2010.01447.x

. Author manuscript; available in PMC: 2014 Sep 23.

Published in final edited form as: Biometrics. 2011 Mar;67(1):142–150. doi: 10.1111/j.1541-0420.2010.01447.x

Bayesian Hierarchical Modeling and Selection of Differentially Expressed Genes for the EST Data

Fang Yu ^†,^*, Ming-Hui Chen ^◇, Lynn Kuo ^◇, Peng Huang ^‡, Wanling Yang ^§

PMCID: PMC4171397 NIHMSID: NIHMS203209 PMID: 20560937

Summary

Expressed sequence tag (EST) sequencing is a one-pass sequencing reading of cloned cDNAs derived from a certain tissue. The frequency of unique tags among different unbiased cDNA libraries is used to infer the relative expression level of each tag. In this paper, we propose a hierarchical multinomial model with a nonlinear Dirichlet prior for the EST data with multiple libraries and multiple types of tissues. A novel hierarchical prior is developed and the properties of the proposed prior are examined. An efficient Markov chain Monte Carlo algorithm is developed for carrying out the posterior computation. We also propose a new selection criterion for detecting which genes are differentially expressed between two tissue types. Our new method with the new gene selection criterion is demonstrated via several simulations to have low false negative and false positive rates. A real EST data set is used to motivate and illustrate the proposed method.

Keywords: Dirichlet distribution, Gene expression, Mixture distributions, Multinomial distribution, Shrinkage estimators

1. Introduction

Differential gene expression can be investigated via several methods, including EST sequence sampling from cDNA libraries, serial analysis of gene expression (SAGE), and high density DNA microarrays. The EST experiment produces expressed sequence tags (ESTs) (200–300bp) to detect expressed genes. First, mRNA (sequences from expressed genes) was extracted from a given tissue, and reverse transcribed into cDNA. These cDNAs were subsequently subcloned into plasmid vectors and amplified in E. coli. The pool of cDNA clones is often called a cDNA library. Each E. coli colony represents one original mRNA sequence and each expressed gene could have more than one but different numbers of representative clones. Different numbers of clones for each library were randomly selected and sequenced, and their sequences (EST) were automatically submitted to EST database, which provide rapid identification of the reference sequences for all human genes. A major public EST database (available at NCBI, National Center for Biotechnology Institution) is dbEST, a subsidiary of GenBank. Therefore, by identifying and counting the frequency of appearance of particular gene sequences or EST, it is possible to gain an estimate for relative abundance of each mRNA transcripts. Note that abundant clones are likely to be represented by many ESTs and rare clones are likely to be represented by only a few ESTs. Unlike other gene expression techniques, only the tissue specific transcript frequencies are observed in EST and SAGE. In this paper, we focus on detecting differentially expressed (DE) genes between normal and cancer tissues using the EST data. The statistical methods can also be applied to the SAGE data.

There is a considerable amount of literature on analyzing EST or SAGE data from different libraries. Schmitt et al. (1999) and Romualdi et al. (2001) fit the observed frequency counts with a collapsed 2 × 2 table and used a Fisher exact test or a χ² test to identify DE genes. These approaches are not efficient as a substantial amount of information is lost after collapsing the data. Stekel et al. (2000) employed a likelihood ratio test based on a Poisson model to detect DE genes. Audic and Claverie (1997) and Claverie (1999) evaluated a conditional probability of a frequency count of a gene in one condition given that of the other condition for an equally expressed gene using a Poisson model. They claimed a gene to be DE if its conditional probability is small. To account for the possibility of overdispersion in the tag counts, Baggerly et al. (2004) and Lu et al. (2005) considered an overdispersed logistic regression model and an overdispersed loglinear model, respectively, for the binomial or Poisson sampling assumption for identifying differentially expressed genes using SAGE data. Although EST data provide us rich genetic information, it is difficult to utilize them since only a small proportion of unique tags (genes) are observed under each library. For example, in our study, we consider 9 libraries (6 normal and 3 tumor tissues) sampled from the lymphoreticular or lymph node and observe a total of 8,190 unique genes among these libraries. To obtain a better understanding of the frequency counts in a library, we group the genes if they have the same count and count its size. Then we plot the size (log scale) of each group versus the groups (ordered in count). Figure 1 displays this for a library of size of 1908 tags summed over all groups. We see that the distribution of the group size is highly right-skewed. The largest observed count in this library is 33, while 7,500 (or exp(8.923)) unique genes (> 91%) have zero counts. All other libraries, no matter which tissue type they come from, share a similar pattern. This skewness in the distribution for gene expression counts has also been noted by others including Kuznetsov (2001) and Morris et al. (2003, 2006).

To estimate the expression level of genes which are highly correlated with the frequency count in the EST data, a simple approach is to use the maximum likelihood estimator (MLE) of the true frequency count for each gene. The resulting estimate is the ratio of the observed count on one gene over the corresponding library size. Due to a relatively small library size, a failure to sample certain genes might occur. Hence a “missing” gene does not mean the gene is not expressed at all, and MLE will underestimate them. Also the EST data may accidently capture too many copies of some genes which are actually of a low expression level. In this case, the MLE tends to overestimate them. To circumvent these issues of underestimation and overestimation, the following two methods have been proposed recently by Morris et al. (2003). The first method is to fit a multinomial Dirichlet model to the data. Let n denote the library size and G the number of unique tags. The model assumes a G − 1 variate symmetric Dirichlet prior distribution with the same parameter θ for the probabilities in the multinomial model. Then it gives a shrinkage estimator with a weight of $\frac{n}{n + G θ}$ given to the MLE. This approach is simple and easy to implement. However, when the library size n is small relatively to the number of unique genes, which is common in EST data, this estimator works poorly for genes with large counts because too much weights are reassigned from abundant genes (with relatively large observed counts) to scarce genes (with relatively small counts). The second method is to fit a non-linear mixture Dirichlet model (NLMD). It assumes that genes are classified into two classes: abundant and scarce according to their relative frequencies. Two different forms of shrinkage estimators are then obtained for these two types; they not only protect the abundant genes from being shrunk too much, but also steal weights from scarce genes to compensate for the missing genes. Morris et al. (2003) called it “Reverse Robin Hood” and applied it to only one-library of a SAGE data.

We extend their method to EST or SAGE data with multiple libraries and multiple tissue types in Sections 2 and 3. We first modify the hyperparameter in their NLMD prior for a single library to be size dependent as it is expected that the value of the total mass increases when the size of the abundant set increases. We then build a hierarchical structure on the hyperparameters of the NLMD to yield more robust posterior estimates. Specifically, at the library level, an NLMD is assumed for each library and the parameters from the different NLMDs share common priors across different libraries so that the information can be borrowed among different libraries within the same tissue type. At the tissue level, an extra layer of priors are assumed to synthesize the information across different tissue types.

In Section 4, an efficient Markov chain Monte Carlo (MCMC) sampling algorithm is developed for carrying out the high-dimensional posterior computation. It also includes several novel computational strategies for reducing computing times and avoiding potential over-floating problems in MCMC sampling. In Section 5, a new gene selection criterion is proposed for detecting DE genes between two different types of tissues. The proposed criterion, easy to implement, has a nice statistical interpretation. Several simulations have been conducted in Section 6. They show that the proposed method has smaller false negative rates than several existing methods while its false positive rates are comparable to the others. In Section 7, a real EST data set is analyzed to illustrate the proposed methodology. A brief discussion is provided in Section 8.

2. Models for EST Data with Multiple Libraries Under Multiple Tissue Types

We apply the NLMD model given by Morris et al. (2003) to each library and use similar notations and parameterizations to theirs, except we add subscripts for libraries and tissues to the abundance probability of each tag in order to explain the hierarchical structure for sharing information across libraries within each tissue type and then between tissues at a higher level.

Assume each unique tag represents a unique gene. Let G be the number of unique tags (genes) observed in the data. Also let L_t be the total number of libraries of tissue type t for t = 1, …, T. Then, the observed data can be written as D = {X_tli, i = 1, …, G, l = 1, …, L_t, t = 1, …, T}, where X_tli is the number of occurrences of a unique EST tag i in library l from type t, with its size $n_{tl} = \sum_{i = 1}^{G} X_{tli}$ . Further, let X_tl = (X_tl1, …, X_tlG)′ denote a column vector of frequency for each gene from the library l of type t. Suppose each library size is fixed, then as a consequence of the Poisson model for the frequency count, we can assume that X_tl follows a multinomial distribution with parameters n_tl and p_tl = (p_tl1, …, p_tlG)′, where p_tli represents the relative abundance of the i^th unique tag in the l^th library of type t, and $\sum_{i = 1}^{G} p_{tli} = 1$ , for ∀l and ∀t. Then the likelihood function is given by $L (p_{11}, \dots, p_{T L_{T}} ∣ D) = \prod_{t = 1}^{T} \prod_{l = 1}^{L_{t}} (_{X_{tl 1}, \overset{n_{tl}}{\dots}, X_{tlG}}) p_{{tl}_{1}}^{X_{tl 1}} \dots p_{{tl}_{G}}^{X_{tl G}}$ . For each library, we use the same NLMD prior as in Morris et al. (2003) for the p_tl, except modifying their hyperparameters for the total mass of abundant genes to be size dependent.

To build a hierarchical structure, we assume that the different libraries in the same tissue type share the same indicator functions on whether a unique tag is abundant or scarce. That is, let λ_ti be a latent indicator for the i^th tag such that λ_ti = 1 if the i^th unique tag of type t is abundant and λ_ti = 0 otherwise. Write λ_t = (λ_t1, …, λ_tG)′. Then, under the same tissue type t, all unique tags are split into two subsets: an “abundant set” A_t = {i : λ_ti = 1} and a “scarce set” S_t = {i : λ_ti = 0}. In addition, let $k_{A_{t}} = \sum_{i = 1}^{G} λ_{ti}$ denote the number of abundant unique tags, and $k_{S_{t}} = \sum_{i = 1}^{G} (1 - λ_{ti})$ the number of scarce unique tags. We further assume that λ_ti's are independent over i and each of them follows a Bernoulli distribution with probability ϕ_t, which is the probability of a tag being abundant under tissue type t. Given the value of λ_t, we define $p_{tl}^{*} = \sum_{i \in A_{t}} p_{tli}$ , which is the total mass of the abundant set in the l^th library of type t. If all unique tags are abundant, then we have $p_{tl}^{*} = \sum_{i = 1}^{G} p_{tli} = 1$ and there are no scarce genes, i.e., S_t = ∅. Similarly, we have $p_{tl}^{*} = 0$ when no unique tags are abundant. In this case, the abundant set A_t = ∅. Mathematically, we have $p_{tl}^{*} = 1$ if $\sum_{i = 1}^{G} λ_{ti} = G$ , $p_{tl}^{*} = \sum_{i \in A_{t}} p_{tli}$ if $0 < \sum_{i = 1}^{G} λ_{ti} < G$ , and $p_{tl}^{*} = 0$ if $\sum_{i = 1}^{G} λ_{ti} = 0$ . Then we define a column vector q_tl = (q_tl1, …, q_tlG)′ for library l of type t, where q_tli represents the relevant re-scaled conditional probability of observing X_tli for a unique tag i under library l of type t. Depending on whether the unique tag i belongs to an abundant set or a scarce set, q_tl can be partitioned into $q_{Atl} = {q_{Atli} = p_{tli} ∕ p_{tl}^{*} : i \in A_{t}}$ or $q_{Stl} = {q_{Stli} = p_{tli} ∕ (1 - p_{tl}^{*}) : i \in S_{t}}$ . This implies that when a unique tag i belongs to the abundant set, $p_{tli} = q_{Atli} p_{tl}^{*}$ , and when a unique tag i belongs to the scarce set, $p_{tli} = q_{Stli} (1 - p_{tl}^{*})$ . Note that when all tags are abundant, q_Stli's are not defined, and when all unique tags are scarce, q_Atli's are not defined. Based on the relationships between p_tli, q_Atli, and q_Stli, we have $p_{tli} = {(q_{Atli} p_{tl}^{*})}^{λ_{ti}} {q_{Stli} (1 - p_{tl}^{*})}^{(1 - λ_{ti})}$ . Let $p_{t}^{*} = {(p_{t 1}^{*}, \dots, p_{tLt}^{*})}^{'}$ , $p^{*} = {({(p_{1}^{*})}^{'}, \dots, {(p_{T}^{*})}^{'})}^{'}$ , $q_{At} = {(q_{Atl}^{'}, \dots, q_{AtLt}^{'})}^{'}$ , $q_{St} = {(q_{Stl}^{'}, \dots, q_{StLt}^{'})}^{'}$ , $q_{A} = {(q_{A 1}^{'}, \dots, q_{AT}^{'})}^{'}$ , $q_{S} = {(q_{S 1}^{'}, \dots, q_{ST}^{'})}^{'}$ , and $λ = {(λ_{1}^{'}, \dots, λ_{T}^{'})}^{'}$ . Given λ and the observed data D, the conditional likelihood can be rewritten as $L (p^{*}, q_{A}, q_{S} ∣ λ, D) = \prod_{t = 1}^{T} \prod_{l = 1}^{L_{t}} (X_{t l 1}, \overset{n_{t l}}{\dots}, X_{t l G}) \prod_{i = 1}^{G} {(q_{Atli} p_{t l}^{*})}^{λ_{t i} X_{tli}} {q_{Stli} (1 - p_{t l}^{*})}^{(1 - λ_{t i}) X_{tli}}$ . If the population is composed of very abundant species, very rare species, and species in between, then we can generalize the above model to a mixture of three Dirichlet distributions, each conditioning on the total mass of a gene in each category.

3. Prior Elicitation

We assume that q_Atl and q_Stl and $p_{t l}^{*}$ are conditionally independent given λ_t. We note that this assumption holds if a Dirichlet prior is assumed for p_tl. For q_Atl and q_Stl, we specify the symmetric Dirichlet distributions with $q_{Atl} ~ D_{k_{A_{t}}} (θ_{Atl})$ and $q_{Stl} ~ D_{k_{S_{t}}} (θ_{Stl})$ , where $D_{k_{A_{t}}} (θ_{Atl}) (D_{k_{S_{t}}} (θ_{Stl}))$ denotes a symmetric k_{A_t} − 1 (k_{S_t} − 1) variate Dirichlet distribution with all parameters equal to θ_Atl (θ_Stl). Based on the definition of $p_{t l}^{*}$ , we observe that (i) $p_{t l}^{*}$ is highly dependent on the value of λ_t; (ii) the value of $p_{t l}^{*}$ increases when the size of the abundance set k_{A_t} increases; and (iii) when k_{A_t} = G, $p_{t l}^{*} = 1$ , and when k_{A_t} = 0, $p_{t l}^{*} = 0$ in probability 1. To account for the dependence of $p_{t l}^{*}$ on λ_t, we specify the conditional distribution for $p_{t l}^{*}$ as a mixture distribution composed of a beta distribution depending on k_{A_t} and two degenerate distributions at 0 (for all genes to be scarce) and 1 (for all genes to be abundant), where the beta distribution is given by $π (p_{t l}^{*} ∣ λ_{t}, α_{t l}) = \frac{Γ (G_{α_{t l}})}{Γ (k_{A_{t}} α_{t l}) Γ (k_{S_{t}} α_{t l})} {(p_{t l}^{*})}^{k_{A_{t}} α_{t l} - 1} {(1 - p_{t l}^{*})}^{k_{S_{t}} α_{t l} - 1}$ for $0 < \sum_{i = 1}^{G} λ_{t i} < G$ . The hyperparameter α_tl controls the degree of concentration of $p_{t l}^{*}$ around k_At/G. The larger it is, the more concentrated the $p_{t l}^{*}$ is about its mean value (the proportion of abundant genes of type t). This size dependent prior takes advantages of information provided by λ_t as it is not only more natural in defining the total mass of abundant genes, but also reduces the two hyperparameters (mean and variance) specification as in Morris et al. (2003) into only one parameter (the variance). This will streamline our hierarchical construction and interpretation for borrowing information across libraries.

Priors at the Library Level

For hyperparameters α_tl, θ_Atl and θ_Stl, we assume α_tl is independent of θ_Atl and θ_Stl apriori. We specify $α_{t l} \overset{iid}{~} E (α_{t})$ for l = 1, …, L_t, where $E (α_{t})$ denotes an exponential distribution with mean 1/α_t. As the frequencies of the scarce genes have smaller variances than those of the abundant genes, we impose the constraint θ_Atl < θ_Stl to ensure identifiability. To ease the computational burden, we consider an de-constrained transformation: $θ_{Stl}^{*} = θ_{Stl} - θ_{Atl}$ . Then, we assume that θ_Atl and $θ_{Stl}^{*}$ are independent and take $θ_{Atl} \underset{~}{iid} E (θ_{A t})$ and $θ_{Stl}^{*} \underset{~}{iid} E (θ_{S t}^{*})$ . Note that the prior mean of α_tl is $E (α_{t l} ∣ α_{t}) = \frac{1}{α_{t}}$ , which measures the average value of α_tl over the L_t libraries of tissue type t. Similarly, 1/θ_At and $1 ∕ θ_{A t} + 1 ∕ θ_{S t}^{*}$ measure the average values of θ_Atl and θ_Stl over the L_t libraries of tissue type t.

Priors at the Tissue Type Level

To borrow information across different types of tissues, we take $α_{t} \overset{iid}{~} E (α)$ , $θ_{A t} \overset{iid}{~} E (θ_{A})$ and $θ_{S t}^{*} \overset{iid}{~} E (θ_{S}^{*})$ for t = 1, …, T, where α_t, θ_At, and $θ_{S t}^{*}$ are assumed to be independent. We further specify independent gamma priors for α, θ_A, and $θ_{S}^{*}$ as α ~ Gamma(a_α, b_α), θ_A ~ Gamma(a_A, b_A) and $θ_{S}^{*} ~ Gamma (a_{S}, b_{S})$ , where the shape and scale parameters a_α, b_α, a_A, b_A, a_S, and b_S are pre-specified. In Sections 6 and 7, we use a_α = a_A = a_S = 1, and b_α = b_A = b_S = 0.001 (Section 6) or 0.1 (Section 7). Finally, we specify independent beta priors for ϕ_t (the probability of abundant genes for tissue type t) given by $ϕ_{t} \overset{iid}{~} beta {\frac{1 - C_{0} - ϕ}{C_{0}} ϕ, \frac{1 - C_{0} - ϕ}{C_{0}} (1 - ϕ)}$ , where 0 < C₀ < 1 is a pre-specified hyperparameter. We further specify a uniform prior, ϕ ~ U (0, ϕ₀), for ϕ, where 0 < ϕ₀ < 1 – C₀ is pre-specified. It is easy to show that the prior for ϕ_t has the mean and variance E(ϕ_t|ϕ) = ϕ and Var(ϕ_t|ϕ) = C₀ϕ. Thus, a large value of C₀ reflects a vague prior belief in the prior mean ϕ and a small value yields a strong prior belief in ϕ. In Sections 6 and 7, we consider (ϕ₀, C₀) = (0.3, 0.5), (0.3, 0.1), and (0.6, 0.3) for carrying out sensitivity analysis of the posterior estimates of ϕ_t's with respect to ϕ₀ and C₀ in our simulation studies and real data analysis. Morris et al. (2003) considered only one SAGE experiment with only one tissue type. As in their case there is only a single ϕ, ϕ is not estimable. Thus, they set ϕ to be fixed. As mentioned in Morris et al. (2003), their analysis was very sensitive to the choice of ϕ. On the contrary, as empirically shown in Sections 6 and 7, our prior specification for ϕ_t's and ϕ results in more robust posterior estimates of ϕ_t's due to the availability of multiple tissue types in our EST data. In order to promote a better understanding of the hierarchical model, a graphical display of the model structure and its prior specifications for T = 2 is constructed in Figure 2.

Graphical Display of the Hierarchical Structure. The structure on the left (right) is for libraries in tissue type 1 (2). Multiple libraries are stacked up with only l^th library in each tissue type plotted. The information sharing across libraries is done by second rows top and bottom, and information sharing across tissues is done by first row top and bottom.

4. Posterior Distribution and Computational Development

We develop an efficient MCMC sampling algorithm to circumvent the problem of no analytical solution for the posterior distribution due to the complex hierarchical model. First, we integrate out parameters α, θ_A and $θ_{S}^{*}$ , and sample from several conditional posterior distributions in a hierarchical fashion. We then use collapsed Gibbs sampler as in Liu (1994) to sample λ, α⁽¹⁾, $θ_{A}^{(1)}$ , and $θ_{S}^{* (1)}$ after collapsing out p*, q_A and q_S, where α⁽¹⁾ = (α_tl, l = 1, …, L_t, t = 1, …, T)′, $θ_{A}^{(1)} = {(θ_{Atl}, l = 1, \dots, L_{t}, t = 1, \dots, T)}^{'}$ , and $θ_{S}^{* (1)} = {(θ_{Stl}^{*}, l = 1 \dots, L_{t}, t = 1, \dots, T)}^{'}$ . We note that this is the key step of the MCMC sampling algorithm as q_A and q_S are extremely high-dimensional. Moreover, instead of sampling α_tl, we sample its log value using a localized Metropolis algorithm with a normal distribution proposal. To sample ϕ_t and ϕ, we use the collapsed Gibbs sampler again by first sampling ϕ given λ and D (data) and then sampling the vector of ϕ_t given (ϕ, λ, D).

It is worthy of mentioning there are many gamma functions involved in the conditional posterior distributions. Computing these gamma functions is time-consuming and may encounter a potential over-floating problem. However, all gamma functions involved in the conditional posterior distributions can be paired in a form of fraction with integer difference in the arguments. Combining this manipulation with Stirling's formula, we can efficiently sample those variates from their posterior distributions and avoid over-floating problems. In our MCMC algorithm, the generation of λ shares the same feature as in Morris et al. (2003). Otherwise, it is quite different from their algorithm, due to the hierarchical structure developed in this paper. The details for the derivation and computing of the posterior distribution are given in Web Appendix A.

5. Gene Selection Criteria

One of our major objectives is to detect DE genes between two different tissue types. The probability p_tli represents the relative abundance of the gene i under library l of tissue type t. To assess whether a gene is DE between two different tissue types, we construct a weighted average of the p_tli to obtain the type-specific gene level summary measure as $p_{wti} = \sum_{l = 1}^{L_{t}} w_{t l} p_{tli}$ , where $\sum_{l = 1}^{L_{t}} w_{t l} = 1$ for the tissue type t. Following Baggerly et al. (2003), we choose the weight w_tl ∞ n_tl. We declare the gene i to be DE if the difference between the weighted averages p_wti and p_wt′i for types t and t′ is large enough with respect to its standard deviation. Specifically, we compute the posterior probability $γ_{i} = P_{r} {\frac{∣ p_{wti} - p_{w t^{'} i} ∣}{\sqrt{Var (p_{wti} - p_{w t^{'} i} ∣ D)}} \geq 2 ∣ D}$ , where Var(p_wti − p_wt′i|D) is the posterior variance of p_wti - p_wt′i. Then we declare that a gene i is DE if γ_i ≥ γ₀, where γ₀ is a predetermined cut-off value (for example 0.5, 0.6, or 0.7).

This method called 2-criterion is an extension of the 1-criterion algorithm proposed in Ibrahim et al. (2002), where a gene is declared to be DE when its posterior probability of at least one standardized unit of changes in the weighted averages is larger than γ₀. Compared to the 1-criterion, our new criterion is better calibrated and has a better interpretation. Assuming T = 2, and p_w1i − p_w2i is normally distributed, after some algebra, we can show that the 2-criterion ensures the posterior probability, p_0i = max{Pr(p_w1i − p_w2i > 0 |D), Pr(p_w1i − p_w2i < 0 |D)}, to be at least Φ[2 − Φ⁻¹{1 + Φ(−2) − γ₀}] asymptotically for any γ₀ > 0, where Φ denotes the standard normal cumulative distribution function. Furthermore, we can also show that asymptotically the 2-criterion ensures that p_0i > 97.7% when γ₀ = 0.5 and p_0i > 99.4% when γ₀ = 0.7. Let $σ = \sqrt{Var (p_{w 1 i} - p_{w 2 i} ∣ D)}$ , which is the posterior standard deviation of p_w1i − p_w2i. Let us also observe that when γ₀ = 0.5, the “center” of the posterior distribution of p_w1i − p_w2i is approximately 2.0σ away from 0; when γ₀ = 0.7, the “center” of the posterior distribution of p_w1i − p_w2i is approximately 2.5σ away from 0. These results imply that the 2-criterion ensures that a gene is declared to be DE if the difference between p_w1i and p_w2i is larger than approximately two standard deviations when γ₀ = 0.5 and two and half standard deviations when γ₀ = 0.7. Finally, we note that fewer DE genes are selected if a larger value of γ₀ is chosen.

6. Simulation Studies

We conducted two simulation studies each with 100 data sets simulated from the NLMD model and a real NCBI dataset, respectively. We evaluated the performance of the proposed method (called two-class) in comparison to three other methods: 2 × 2 table with a χ² test (Romualdi et al., 2001), a multinomial Dirichlet (called one-class) model and the multinomial NLMD (called two-class (non-hier)) model (Morris et al., 2003, 2006). Under the χ² test, a 2 × 2 cross classification table is constructed for each gene where the rows correspond to gene i and “all other genes” and the columns correspond to two different tissue types. A gene is declared to be DE if the resulting p-value for the homogeneous test is less than a pre-specified cut-off value, say 0.01 or 0.05. Under the one-class model that does not account for scarce and abundant genes, we fit the expression relative probabilities p_tl = {p_tli, i = 1, …, G} with a symmetric Dirichlet hierarchical prior distribution $D_{G - 1} (θ_{t l})$ for each t and l; $θ_{t l} ∣ θ_{t} \underset{~}{iid} E (θ_{t})$ , ∀l for each fixed t; and $θ_{t} \underset{~}{iid} E (θ)$ , ∀t. Under the two-class (non-hier) model, we modified the prior for the total mass of abundant genes $p_{t l}^{*}$ to be size dependent as defined in Section 3. As in Morris et al. (2003), we prespecify the values of the hyperparameters {ϕ_t, α_tl, θ_Atl, θ_Stl}. Specifically, we set ϕ_t = 0.2 in Simulation I and 0.05 in Simulation II, α_tl = 0.001, θ_Atl = 0.5, and θ_Stl = 1.0, ∀ t and l. Note that the two-class (non-hier) model fits the same NLMD model for each library independently without any hierarchical structures. For the two-class method, we considered (0.3, 0.5), (0.3, 0.1) and (0.6, 0.3) for (ϕ₀, C₀). All one-class, two-class (non-hier) and two-class model use the 2-criterion based on 5000 MCMC samples with γ₀ = 0.5, 0.6, 0.7 to detect DE genes.

To evaluate the performance of these four methods, we use four error rates: false negative (FNR), false positive (FPR), false discovery rate (FDR), and false non-discovery rate (FNDR). FNR is the proportion of DE genes failed to be detected as DE. The FPR is the proportion of EE genes wrongly declared to be DE. The FDR is the realized rate of false detections in the detected genes and FNDR is the realized rate of false non-detections in the non-detected genes. A method is considered to be better than the others, if all error rates based on it are smaller than those of the others. In addition to these error rates, we also report the number of genes claimed to be DE (CDE), the number of genes correctly claimed as DE (CCDE), the number of genes correctly claimed as EE (CCEE) averaged over 100 data sets with their standard deviation (in parentheses in Table 1).

Table 1.

Method comparison based on two simulation studies (G =5000)

Simulation	Method	CutOff	CDE	CCDE	CCEE	FNR	FPR	FDR	FNDR
I	χ ²	0.05	313.2(8.0)	199.9(0.3)	4686.7(7.9)	0.001	0.024	0.361	0.000
(#DE=200)		0.01	224.8(5.0)	199.8(0.4)	4775.0(5.0)	0.001	0.005	0.111	0.005
	one-class	0.5	271.2(6.4)	199.9(0.3)	4728.7(6.3)	0.001	0.015	0.263	0.000
		0.6	243.8(6.2)	199.9(0.3)	4756.1(6.0)	0.001	0.009	0.180	0.000
		0.7	225.0(5.2)	199.8(0.4)	4774.8(5.3)	0.001	0.005	0.112	0.000
	two-class	0.5	265.9(8.5)	200.0(0.0)	4734.1(8.6)	0.000	0.014	0.247	0.000
	(non-hier)	0.6	239.6(6.1)	200.0(0.1)	4760.4(6.2)	0.000	0.008	0.165	0.000
		0.7	221.6(4.8)	200.0(0.2)	4778.4(4.8)	0.000	0.005	0.097	0.000
	two-class^*	0.5	201.8(1.6)	200.0(0.0)	4798.2(1.4)	0.000	0.000	0.009	0.000
	(0.3, 0.5)	0.6	200.5(0.7)	200.0(0.0)	4799.5(0.0)	0.000	0.000	0.003	0.000
		0.7	200.3(0.5)	200.0 (0.0)	4799.7(0.0)	0.000	0.000	0.002	0.000
	two-class	0.5	201.7(1.6)	200.0(0.0)	4798.3(1.4)	0.000	0.000	0.008	0.000
	(0.3, 0.1)	0.6	200.5(0.7)	200.0(0.0)	4799.5(0.0)	0.000	0.000	0.003	0.000
		0.7	200.3(0.5)	200.0(0.0)	4799.7(0.0)	0.000	0.000	0.002	0.000
	two-class	0.5	201.4(1.3)	200.0(0.1)	4798.6(1.2)	0.000	0.000	0.007	0.000
	(0.6, 0.3)	0.6	200.3(0.6)	200.0(0.1)	4799.7(0.6)	0.000	0.000	0.002	0.000
		0.7	200.1(0.3)	200.0(0.1)	4799.9(0.3)	0.000	0.000	0.001	0.000

II	χ ²	0.05	732.8(20.7)	362.4(8.8)	4129.6(17.6)	0.275	0.082	0.505	0.032
(#DE=500)		0.01	336.6(10.4)	277.4(8.6)	4440.8(7.1)	0.445	0.013	0.176	0.048
	one-class	0.5	364.2(11.5)	324.4(9.1)	4460.2(7.0)	0.351	0.009	0.109	0.038
		0.6	303.7(9.6)	287.5(8.6)	4483.8(4.0)	0.425	0.004	0.053	0.045
		0.7	253.7(8.8)	249.4(8.5)	4495.7(2.1)	0.501	0.001	0.017	0.053
	two-class	0.5	267.1(8.6)	264.8(8.5)	4497.7(1.6)	0.470	0.000	0.009	0.050
	(non-hier)	0.6	246.6(8.5)	245.1(8.3)	4498.5(1.2)	0.510	0.000	0.006	0.054
		0.7	227.6(8.1)	226.6(8.0)	4499.0(1.0)	0.547	0.000	0.004	0.057
	two-class	0.5	353.3(10.3)	341.2(8.9)	4487.9(3.7)	0.318	0.003	0.034	0.034
	(0.3, 0.5)	0.6	337.6(9.0)	328.3(8.2)	4490.7(3.2)	0.343	0.002	0.028	0.037
		0.7	323.0(8.4)	315.9(7.9)	4492.9(2.9)	0.368	0.002	0.022	0.039
	two-class	0.5	353.6(10.3)	341.4(9.0)	4487.8(3.7)	0.317	0.003	0.034	0.034
	(0.3, 0.1)	0.6	337.8(8.8)	328.4(8.0)	4490.6(3.2)	0.343	0.002	0.028	0.037
		0.7	322.8(8.4)	315.7(7.9)	4492.9(3.0)	0.369	0.002	0.022	0.039
	two-class	0.5	353.4(10.1)	341.3(9.0)	4487.9(3.2)	0.317	0.003	0.034	0.034
	(0.6, 0.3)	0.6	337.7(8.8)	328.3(8.1)	4490.7(2.8)	0.343	0.002	0.028	0.037
		0.7	322.9(8.4)	315.8(8.0)	4492.9(2.6)	0.368	0.002	0.022	0.039

Open in a new tab

Note: Under the method column, “two-class (non-hier)” denotes the two-class method without hierarchical structure with ϕ₁ = ϕ₂ = 0.2 under stimulation study I and ϕ₁ = ϕ₂ = 0.05 under stimulation study II.

The two numeric values inside the parentheses under “two-class” correspond to the values of ϕ₀ and C₀ respectively.

6.1 Simulation Study I

Data were generated for 6 libraries with equal library size for each tissue type for each data set. The observed counts under each library were generated from a multinomial distribution $M (n_{t l}, (p_{t l 1}, \dots, p_{tlG}))$ with G = n_tl = 5000. The parameters in the multinomial distribution (p_tl1, …, p_tlG) were generated as follows. We first set A₁ = {1, …, 1000}, S₁ = {1001, …, 5000}, A₂ = {1, …, 900, 1001, …, 1100}, S₂ = {901, …, 1000, 1101, …, 5000}, and $p_{t l}^{*} = 0.9$ . Then, we generated parameters (q_tli, i ∈ A_t) ~ symmetric Dirichlet $D (11)$ and $(q_{tli}, i \in S_{t}) ~ D (15)$ under library l and tissue type t. Finally, we set $p_{tli} = q_{tli} p_{t l}^{*}$ if i ∈ A_t and $p_{tli} = q_{tli} (1 - p_{t l}^{*})$ if i ∈ S_t. This simulation setting implies that genes 901 to 1100 are DE while all other genes are EE. Specifically, the first (second) half of the 200 DE genes are abundant (scarce) under type 1 and scarce (abundant) under type 2.

The results under Simulation I are summarized in Table 1. From the top panel of Table 1, we see that the two-class method clearly outperforms the χ² test, the one-class method and the two-class (non-hier) method based on FNR, FPR, FDR and FNDR. The one-class method and the χ² test are comparable in terms of magnitudes of error rates. The error rates from one-class are bounded between that obtained from the χ² test with cutoff values of 0.05 and 0.01. The number of DE genes selected by the two-class (non-hier) method tends to fall between the one-class and the two-class methods. Compared to the one-class method, the two-class (non-hier) method performed better by correctly selecting more genes, hence providing smaller error rates for all four. Compared to the two-class method, the two-class (non-hier) method does not perform well for its high value of FDR. The top panel of Table 1 shows that the two-class method is least sensitive to the different choices of cut-off value γ₀. Looking at the column of CDE from χ² down to two-class models, the number of DE genes selected by each method with different cutoff values averaged over 100 simulated data sets decreases also with smaller variances as well. Looking at the error rates, it also demonstrates that the proposed gene selection 2-criterion works well as it leads to satisfactorily low FNR and FPR for all the latter three methods.

We further compared the results obtained from the same methods with different library sizes. The results with G = 5000 and n_tl = 2500 are shown in Table 4 of Web Appendix B. As the library size increases, we have observed all four error rates (FNR, FPR, FDR and FNDR) of the two-class method decreases. For other methods, improvements were only seen on some of the error rates. In particular, the FPR and FDR obtained from the χ² test and the one-class method increase when the library size increases; while only the FDR increases for the two-class (non-hier) method.

6.2 Simulation Study II

In this study we simulated more realistic data. We started with a real EST data set with 2 brain tumor tissue libraries (type 1) and 2 normal tissue libraries (type 2). We first obtained the MLEs of the relative probabilities ${\hat{p}}_{t g}$ for all genes at the tissue type t. Let ${\hat{p}}_{g} = ({\hat{p}}_{1 g} + {\hat{p}}_{2 g}) ∕ 2$ . Then, the top G = 5000 genes with the largest ${\hat{p}}_{g}$ were used in the study, given genes with small frequency counts for both types are not interesting in detecting DE genes. Among them, 500 genes with largest differences in the MLEs from the two tissue types were set to be DE. Other genes are EE. We applied a logit transformation to the original relative frequency count (probability) for each gene in a library, and computed their means and variances for each gene over the libraries. If a genes is EE, we simulated normal variates for each library with the same mean (averaged over two types) and pooled variance for each gene. If a gene is DE, we simulated normal variates separately for each library to match individual means and variances for each type. Then we applied an inverse logit transformation to take the normal variates back to the original scale. A re-scaling within each library was applied to ensure that their sum is one. The counts for each library were generated independently from a multinomial distribution with library size n_tl = 5000 and these relative probabilities for a total of 9 libraries (6 were assigned to type 1 and 3 to type 2), to mimic the real data set. One hundred datasets were simulated.

The results under Simulation II are also summarized in Table 1. From the bottom half of Table 1, we see that the two-class method performs better than the one-class method with smaller or comparable error rates. The χ² test with cutoff for p-value set to be 0.05 provides the lowest FNR of 0.275 among all methods, but with a huge FDR of 0.505. The χ² test with a different cutoff value of 0.01 provides a much smaller FDR of 0.176, which is still much higher than the other methods. The results also show that the χ² test method with the cutoff value of 0.01 perform worse than the two-class method and the one-class method for γ₀ = 0.5 or 0.6 for its high values for all four error rates. The two-class (non-hier) method provides small error rates for FPR and FDR, but its FPR is much larger with a range of 0.470 to 0.547 compared to those from the two-class method (0.318 to 0.368), the one-class method (0.351 to 0.501), and the χ² test (0.275 to 0.445). In addition, the two-class (non-hier) method has the largest FNDR's among all methods. Overall, the two-class method performs better than all other three methods as it successfully identifies a big proportion of DE genes yet with a good control of FDR to be lower than 0.05.

From the two-class model, we have obtained the posterior means for ϕ₁ and ϕ₂ range from 4.99% to 5.71%, and 5.13% to 6.08%, respectively. Both of them are slightly above the prespecified value of 5% for ϕ_t under the two-class (non-hier) method. We also checked the posterior mean for θ_Atl and θ_Stl for all libraries under both tissue types. Under the first (second) tissue type, the posterior means for the θ_Atl range from 0.985 to 1.685 (0.62 to 0.85) and for the θ_Stl range from 3.43 to 10.99 (3.85 to 20.88). The differences between the true value of the hyperparameters and the setting specified in the two-class (non-hier) method might explain the poor performance of the two-class (non-hier) method. Similar to the results of Simulation I, we do not observe any significant changes among the results under the three choices of (θ₀, C₀).

7. Real Data Analysis

An in-house Perl program was used to evaluate 291 most reliable libraries from the NCBI unigene database (http://www.ncbi.nlm.nih.gov/unigene and http://www.ncbi.nlm.gov/dbEST/). The program outputs EST sequences, their corresponding gene name, UniGene cluster ID, the cDNA libraries they were derived from, as well as the size of the cDNA libraries and their numbers of detected sequences in each library. The samples were extracted from a variety of tissues including lung, muscle, bone, and liver. We focus only on the samples from lymphoreticular or lymph node tissue. We are primarily interested in identifying the genes (unique tags) that are DE between the normal tissue and tumor cell line. There were in total 18,324 unique tags recorded under 6 cDNA libraries of the normal tissue and 3 cDNA libraries of the tumor cell line type. After removing those genes with zero counts under both tissue types, there were 8190 unique tags with at least one non-zero count under one or more cDNA libraries, which forms the subset of the data used in our analysis.

The same four methods were used to analyze the real data with settings similar to the simulation study II. The results are summarized in Table 2. The χ² test with a cut-off value of 0.05 selects 2192 DE genes, the largest number of genes being selected among the four methods. The other three methods in order select 1322, 1013, and 1059 DE genes with γ₀ = 0.5. The last column in Table 2 reports the proportion of genes that are also selected by the χ² test (with cutoff value of 0.05) relative to the DE genes selected by the listed methods. They are all essentially above 90%, which is not surprising given the large number of genes selected by the χ² test. The two-class methods have slightly larger overlap with the χ² test than the two-class (non-hier) method. The results among the two-class methods with the three choices of (ϕ₀, C₀) are similar, which further confirms that the proposed method is not too sensitive to the choice of (ϕ₀, C₀). To evaluate the performance of the proposed method, we applied a permutation test (modified from Storey and Tibshirani, 2003) proposed by Jiao and Zhang (2008) with 1000 permutations to estimate the false discovery rate. We obtained an estimate of 0.10–0.12 for the false discovery rate with γ₀ chosen between 0.5 to 0.7 in the 2-criterion algorithm.

Table 2.

Real data analysis based on samples from lymphoreticular or lymph node tissues (G = 8190).

Method (Cut-off)	CDE	DE shared by χ² (0.05)	% DE shared by χ² (0.05)
χ² (0.05)	2192	2192	100.0
(0.01)	1073	1073	100.0
one-class (0.5)	1322	1213	91.8
(0.6)	1056	1035	98.0
(0.7)	828	825	99.6
two-class (0.5)	1013	909	89.7
(non-hier) (0.6)	931	854	91.7
(0.7)	837	786	93.9
two-class^* (0.5)	1059	996	94.1
(0.3, 0.5) (0.6)	897	859	95.8
(0.7)	772	751	97.3
two-class (0.5)	1059	994	93.9
(0.3, 0.1) (0.6)	900	861	95.7
(0.7)	773	753	97.4
two-class (0.5)	1059	997	94.2
(0.6, 0.3) (0.6)	895	858	95.9
(0.7)	778	757	97.3

Open in a new tab

Note: Under the method column, “two-class (non-hier)” denotes the two-class method without hierarchical structure with setting of ϕ₁ = ϕ₂ = 0.05.

The two numeric values within the parentheses under “two-class” correspond to the values of ϕ₀ and C₀, respectively.

We further examined the sensitivity of the posterior estimates to the choice of (ϕ₀, C₀). Specifically, we compared the posterior means, the posterior standard deviations (SDs), and the 95% HPD intervals of all parameters under the different choices of (ϕ₀, C₀). The results with (ϕ₀, C₀) = (0.3, 0.5) and (0.3, 0.1) are summarized in Table 3. We see that these two sets of the posterior estimates for all parameters except for ϕ are very similar. Note that the decision on selecting DE genes is primarily based upon the parameters at the library and tissue levels including ϕ_t, α_tl, θ_Atl and θ_Stl for all t and l. The results with (ϕ₀, C₀) = (0.6, 0.3) were not shown in Table 3. However, the posterior estimates are similar to those with the other two choices of (ϕ₀, C₀). For example, the posterior means, SDs, and 95% HPD intervals under (ϕ₀, C₀) = (0.6, 0.3) are 0.1987, 0.0126, and (0.1745, 0.2238) for ϕ₁; and 0.1387, 0.0008, and (0.1235, 0.1535) for ϕ₂; Therefore, the proposed method is fairly robust to the choice of (ϕ₀, C₀) in terms of posterior estimates.

Table 3.

Posterior estimates of the parameters from real data analysis.

C₀ = 0.5

C₀ = 0.1

Parameter

Mean

95% HPD Interval

Mean

95% HPD Interval

0.1432

0.0585

(0.0536, 0.2630)

0.2305

0.0548

(0.1223, 0.2999)

ϕ ₁

0.1991

0.0116

(0.1782, 0.2235)

0.1977

0.0115

(0.1748, 0.2199)

ϕ ₂

0.1387

0.0078

(0.1236, 0.1537)

0.1386

0.0078

(0.1248, 0.1549)

α ₁

0.0003

0.0002

(0.0000, 0.0006)

0.0003

0.0002

(0.0000, 0.0006)

θ _A1

0.2826

0.1485

(0.1055, 0.5689)

0.2829

0.1390

(0.0948, 0.5614)

θ_{S 1}^{*}

0.1158

0.0732

(0.0260, 0.2541)

0.1162

0.0722

(0.0276, 0.2564)

α ₂

0.0003

(0.0000, 0.0008)

0.0003

(0.0000, 0.0007)

θ _A2

0.4961

0.3580

(0.1028, 1.1493)

0.5049

0.3793

(0.1158, 1.1894)

θ_{S 2}^{*}

0.5760

0.5002

(0.1350, 1.3476)

0.5587

0.4401

(0.1226, 1.2384)

α ₁₁

0.0003

0.0002

(0.0000, 0.0007)

0.0003

(0.0000, 0.0008)

θ _A11

0.0963

0.0083

(0.0804, 0.1129)

0.0958

0.0085

(0.0801, 0.1133)

θ_{S 11}^{*}

0.1445

0.0200

(0.1073, 0.1849)

0.1446

0.0208

(0.1037, 0.1849)

α ₁₂

0.0003

0.0002

(0.0000, 0.0007)

0.0003

0.0002

(0.0000, 0.0007)

θ _A12

0.0497

0.0047

(0.0412, 0.0594)

0.0497

0.0049

(0.0410, 0.0599)

θ_{S 12}^{*}

0.1081

0.0128

(0.0842, 0.1342)

0.1072

0.0131

(0.0802, 0.1316)

α ₁₃

0.0003

0.0002

(0.0000, 0.0008)

0.0003

0.0002

(0.0000, 0.0007)

θ _A13

0.1636

0.0175

(0.1328, 0.1997)

0.1645

0.0182

(0.1295, 0.1995)

θ_{S 13}^{*}

0.0123

0.0124

(0.0005, 0.0364)

0.0115

(0.0002, 0.0341)

α ₁₄

0.0002

(0.0000, 0.0006)

0.0002

(0.0000, 0.0006)

θ _A14

0.4373

0.0257

(0.3870, 0.4873)

0.4386

0.0246

(0.3924, 0.4899)

θ_{S 14}^{*}

0.0603

0.0366

(0.0019, 0.1243)

0.0589

0.0340

(0.0018, 0.1218)

α ₁₅

0.0001

(0.0000, 0.0002)

0.0001

(0.0000, 0.0002)

θ _A15

0.3180

0.0343

(0.2496, 0.3814)

0.3210

0.0325

(0.2595, 0.3859)

θ_{S 15}^{*}

0.1027

0.1492

(0.0017, 0.3583)

0.1028

0.1403

(0.0013, 0.3512)

α ₁₆

0.0001

(0.0000, 0.0003)

0.0001

(0.0000, 0.0004)

θ _A16

0.1559

0.0217

(0.1140, 0.1983)

0.1554

0.0215

(0.1123, 0.1945)

θ_{S 16}^{*}

0.0072

0.0097

(0.0003, 0.0243)

0.0075

0.0095

(0.0001, 0.0233)

α ₂₁

0.0002

0.0001

(0.0000, 0.0004)

0.0002

0.0001

(0.0000, 0.0004)

θ _A21

0.2559

0.0119

(0.2335, 0.2797)

0.2562

0.0121

(0.2335, 0.2798)

θ_{S 21}^{*}

0.0014

(0.0001, 0.0041)

0.0014

0.0013

(0.0000, 0.0042)

α ₂₂

0.0002

(0.0000, 0.0005)

0.0002

(0.0000, 0.0006)

θ _A22

0.4683

0.0260

(0.4201, 0.5205)

0.4686

0.0259

(0.4181, 0.5215)

θ_{S 22}^{*}

1.3839

0.1537

(1.0777, 1.6656)

1.3822

0.1519

(1.0934, 1.6765)

α ₂₃

0.0002

0.0001

(0.0000, 0.0004)

0.0002

0.0001

(0.0000, 0.0004)

θ _A23

0.3005

0.0126

(0.2759, 0.3257)

0.3012

0.0131

(0.2767, 0.3265)

θ_{S 23}^{*}

0.0017

(0.0001, 0.0053)

0.0018

(0.0000, 0.0055)

Open in a new tab

Finally, we note that forty-nine DE genes were selected by the two-class method with C₀ = 0.5 but not by the χ² test or the one-class methods. Sixteen of them were also not selected by the two-class (non-hier) method. Among them, many are important and of great biological interest. For example, cyclin-dependent kinase 4 (CDK4, Hs.95577) was found to be up-regulated in tumor cell lines but not in normal lymph node. CDK4 is a catalytic subunit of the protein kinase complex and plays an important role in cell cycle G1 phase progression. Mutations in this gene as well as in its related proteins were found to be associated with tumorigenesis of a variety of cancers (Molenaar et al., 2008). Another gene found to be up-regulated is heat shock 90kDa protein 1, alpha (HSPCA, Hs.525600), which was shown to be overexpressed in poor-prognosis acute myeloid leukemia cells and plays a role in cell survival and resistance to chemotherapy (Flandrin et al., 2008). Other genes found to be up-regulated code for proteins are involved in pre-mRNA splicing, protein translation, and cellular metabolism. Among the downregulated genes, there are several known tumor suppressors: the forkhead box O transcription factor (FOXO1, Hs.370666), which functions as a tumor suppressor by regulating expression of genes involved in apoptosis, cell cycle arrest and oxidative detoxification (Liu et al., 2008); BRCA1 (Hs. 194143), which is known to suppress tumor growth; and runt-related transcription factor 3 (RUNX3, Hs.170019), which functions as a tumor suppressor and is frequently deleted or transcriptionally silenced in cancer cells (Kim et al., 2008).

8. Discussion

In this paper, we have developed a hierarchical multinomial NLMD model for robust estimation of the EST data by borrowing information from multiple libraries and multiple tissue types. Due to the availability of multiple libraries and multiple tissue types, the hierarchical modeling allows the parameters at the library level and the tissue level (the second and third levels in the hierarchy) to be estimable and identifiable, which yields the observed advantages as seen in the simulation studies in Section 6. In the proposed model, the hierarchical structure is built on the parameters of the NLMD distributions across different libraries. A natural alternative is to build a hierarchical model directly on the overall distributions of the p_tli's by assuming $p_{t l} ~ D_{G} {\frac{1 - ξ_{t}}{ξ_{t}} p_{t 1}, \dots, \frac{1 - ξ_{t}}{ξ_{t}} p_{t, G - 1}, \frac{1 - ξ_{t}}{ξ_{t}} (1 - \sum_{i = 1}^{G - 1} p_{t i})}$ and an NLMD prior for (p_t1, …, p_tG)′, where p_ti is the tissue and gene specific probability and ξ_t > 0. Under this formulation, the prior mean E(p_tli|p_ti) = p_ti, ξ_t controls the degree of concentration of p_tli around p_ti, and posterior inference can be carried out directly on the p_ti rather than the average of the p_tli's across l. However, this alternative approach greatly increases the computational difficulty and prevents carrying out posterior sampling. In addition, since p_ti equals the mean of the p_tli's, a simple average of p_tli's across l may serve as a reasonable approximation of p_ti and, hence, posterior inference on the average of the p_tli's across l for detecting DE genes may be similar to the one based on the p_ti. Finally, to detect DE genes between two tissue types, we have proposed a new gene selection criterion, called the 2-criterion. This gene selection criterion is easy to implement and also has a nice statistical interpretation. Although the current version of the gene selection algorithm is written for comparing two types of tissues, it can be extended to three or more tissue types. The further theoretical properties of the 2-criterion and its extension are currently under investigation.

Supplementary Material

Supp Materials

NIHMS203209-supplement-Supp_Materials.pdf^{(137.9KB, pdf)}

Acknowledgements

The authors wish to thank the editor, the associate editor, and the two referees for their helpful comments and suggestions, which have led to a considerable improvement of this article. Dr. Chen's research was partially supported by NIH grants #GM 70335 and #CA 74015 and Dr. Kuo's research was partially supported by NIH grant #GM 5764-01.

Footnotes

9. Supplementary Materials The Web Appendix referenced in Section 4 is available under the Paper Information link at the Biometrics website http://www.biometrics.tibs.org.

References

Adams MD, Dubnick M, Kerlavage AR, Moreno R, Kelley JM, Utterback TR, Nagle JW, Fields C, Venter JC. Sequence identification of 2,375 human brain genes. Nature. 1992;355:632–634. doi: 10.1038/355632a0. [DOI] [PubMed] [Google Scholar]
Audic S, Claverie JM. The significance of digital gene expression profiles. Genome Research. 1997;7:986–995. doi: 10.1101/gr.7.10.986. [DOI] [PubMed] [Google Scholar]
Baggerly KA, Deng L, Morris JS, Aldaz CM. Differential expression in SAGE: accounting for normal between-library variation. Bioinformatics. 2003;19(12):1477–1483. doi: 10.1093/bioinformatics/btg173. [DOI] [PubMed] [Google Scholar]
Baggerly KA, Deng L, Morris JS, Aldaz CM. Overdispersed logistic regression for SAGE: Modelling multiple groups and covariates. BMC Bioinformatics. 2004;5:144–159. doi: 10.1186/1471-2105-5-144. [DOI] [PMC free article] [PubMed] [Google Scholar]
Claverie JM. Computational methods for the identification of differential and coordinated gene expression. Human Molecular Genetics. 1999;8(10):1821–1832. doi: 10.1093/hmg/8.10.1821. [DOI] [PubMed] [Google Scholar]
Flandrin P, Guyotat D, Duval A, Cornillon J, Tavernier E, Nadal N, Campos L. Significance of heat-shock protein (HSP) 90 expression in acute myeloid leukemia cells. Cell Stress Chaperones. 2008;13(3):357–364. doi: 10.1007/s12192-008-0035-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ibrahim JG, Chen M-H, Gray RJ. Bayesian models for gene expression with DNA microarray data. Journal of the American Statistical Association. 2002;97:88–99. [Google Scholar]
Jiao S, Zhang S. On correcting the overestimation of the permutation-based false discovery rate estimator. Bioinformatics. 2008;24(15):1655–1661. doi: 10.1093/bioinformatics/btn310. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kim EJ, Kim YJ, Jeong P, Ha YS, Bae SC, Kim WJ. Methylation of the RUNX3 promoter as a potential prognostic marker for bladder tumor. Journal of Urology. 2008;180(3):1141–1145. doi: 10.1016/j.juro.2008.05.002. [DOI] [PubMed] [Google Scholar]
Kuznetsov VA. Distribution associated with stochastic processes of gene expression in a single eukaryotic cell. EURASIP Journal on Applied Signal Processing. 2001;4:285–296. [Google Scholar]
Liu JS. The collapsed Gibbs sampler in Bayesian computations with applications to a gene regulation problem. Journal of the American Statistical Association. 1994;89:958–966. [Google Scholar]
Liu P, Kao TP, Huang H. CDK1 promotes cell proliferation and survival via phosphorylation and inhibition of FOXO1 transcription factor. Oncogene. 2008;27(34):4733–4744. doi: 10.1038/onc.2008.104. [DOI] [PubMed] [Google Scholar]
Lu J, Tomfohr JK, Kepler TB. Identifying differential expression in multiple SAGE libraries: an overdispersed log-linear model approach. BMC Bioinformatics. 2005;6:165–178. doi: 10.1186/1471-2105-6-165. [DOI] [PMC free article] [PubMed] [Google Scholar]
Molenaar JJ, Ebus ME, Koster J, van Sluis P, van Noesel CJ, Versteeg R, Caron HN. Cyclin D1 and CDK4 activity contribute to the undifferentiated phenotype in neuroblastoma. Cancer Research. 2008;68(8):2599–2609. doi: 10.1158/0008-5472.CAN-07-5032. [DOI] [PubMed] [Google Scholar]
Morris JS, Baggerly KA, Coombes KR. Bayesian shrinkage estimation of the relative abundance of mRNA transcripts using SAGE. Biometrics. 2003;59:476–486. doi: 10.1111/1541-0420.00057. [DOI] [PubMed] [Google Scholar]
Morris JS, Baggerly KA, Coombes KR. Shrinkage estimation for SAGE data using a mixture Dirichlet prior. In: Do KA, Müller P, Vannucci M, editors. Bayesian Inference for Gene Expression and Proteomics. Cambridge University Press; New York: 2006. pp. 254–268. [Google Scholar]
Romualdi C, Bortoluzzi S, Danieli GA. Detecting differentially expressed genes in multiple tag sampling experiments: comparative evaluation of statistical tests. Human Molecular Genetics. 2001;10(19):2133–2141. doi: 10.1093/hmg/10.19.2133. [DOI] [PubMed] [Google Scholar]
Schmitt AO, Specht T, Beckmann G, Dahl E, Pilarsky CP, Hinzmann B, Rosenthal A. Exhaustive mining of EST libraries for genes differentially expressed in normal and tumor tissues. Nucleic Acids Research. 1999;27(21):4251–4260. doi: 10.1093/nar/27.21.4251. [DOI] [PMC free article] [PubMed] [Google Scholar]
Stekel DJ, Git Y, Falciani F. The comparison of gene expression from multiple cDNA libraries. Genome Research. 2000;10:2055–2061. doi: 10.1101/gr.gr-1325rr. [DOI] [PMC free article] [PubMed] [Google Scholar]
Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc. Natl Acad. Sci. USA. 2003;100:9440–9445. doi: 10.1073/pnas.1530509100. [DOI] [PMC free article] [PubMed] [Google Scholar]
Velculescu VE, Zhang L, Vogelstein B, Kinzler KW. Serial analysis of gene expression. Science. 1995;270:484–487. doi: 10.1126/science.270.5235.484. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp Materials

NIHMS203209-supplement-Supp_Materials.pdf^{(137.9KB, pdf)}

[R1] Adams MD, Dubnick M, Kerlavage AR, Moreno R, Kelley JM, Utterback TR, Nagle JW, Fields C, Venter JC. Sequence identification of 2,375 human brain genes. Nature. 1992;355:632–634. doi: 10.1038/355632a0. [DOI] [PubMed] [Google Scholar]

[R2] Audic S, Claverie JM. The significance of digital gene expression profiles. Genome Research. 1997;7:986–995. doi: 10.1101/gr.7.10.986. [DOI] [PubMed] [Google Scholar]

[R3] Baggerly KA, Deng L, Morris JS, Aldaz CM. Differential expression in SAGE: accounting for normal between-library variation. Bioinformatics. 2003;19(12):1477–1483. doi: 10.1093/bioinformatics/btg173. [DOI] [PubMed] [Google Scholar]

[R4] Baggerly KA, Deng L, Morris JS, Aldaz CM. Overdispersed logistic regression for SAGE: Modelling multiple groups and covariates. BMC Bioinformatics. 2004;5:144–159. doi: 10.1186/1471-2105-5-144. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Claverie JM. Computational methods for the identification of differential and coordinated gene expression. Human Molecular Genetics. 1999;8(10):1821–1832. doi: 10.1093/hmg/8.10.1821. [DOI] [PubMed] [Google Scholar]

[R6] Flandrin P, Guyotat D, Duval A, Cornillon J, Tavernier E, Nadal N, Campos L. Significance of heat-shock protein (HSP) 90 expression in acute myeloid leukemia cells. Cell Stress Chaperones. 2008;13(3):357–364. doi: 10.1007/s12192-008-0035-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Ibrahim JG, Chen M-H, Gray RJ. Bayesian models for gene expression with DNA microarray data. Journal of the American Statistical Association. 2002;97:88–99. [Google Scholar]

[R8] Jiao S, Zhang S. On correcting the overestimation of the permutation-based false discovery rate estimator. Bioinformatics. 2008;24(15):1655–1661. doi: 10.1093/bioinformatics/btn310. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Kim EJ, Kim YJ, Jeong P, Ha YS, Bae SC, Kim WJ. Methylation of the RUNX3 promoter as a potential prognostic marker for bladder tumor. Journal of Urology. 2008;180(3):1141–1145. doi: 10.1016/j.juro.2008.05.002. [DOI] [PubMed] [Google Scholar]

[R10] Kuznetsov VA. Distribution associated with stochastic processes of gene expression in a single eukaryotic cell. EURASIP Journal on Applied Signal Processing. 2001;4:285–296. [Google Scholar]

[R11] Liu JS. The collapsed Gibbs sampler in Bayesian computations with applications to a gene regulation problem. Journal of the American Statistical Association. 1994;89:958–966. [Google Scholar]

[R12] Liu P, Kao TP, Huang H. CDK1 promotes cell proliferation and survival via phosphorylation and inhibition of FOXO1 transcription factor. Oncogene. 2008;27(34):4733–4744. doi: 10.1038/onc.2008.104. [DOI] [PubMed] [Google Scholar]

[R13] Lu J, Tomfohr JK, Kepler TB. Identifying differential expression in multiple SAGE libraries: an overdispersed log-linear model approach. BMC Bioinformatics. 2005;6:165–178. doi: 10.1186/1471-2105-6-165. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] Molenaar JJ, Ebus ME, Koster J, van Sluis P, van Noesel CJ, Versteeg R, Caron HN. Cyclin D1 and CDK4 activity contribute to the undifferentiated phenotype in neuroblastoma. Cancer Research. 2008;68(8):2599–2609. doi: 10.1158/0008-5472.CAN-07-5032. [DOI] [PubMed] [Google Scholar]

[R15] Morris JS, Baggerly KA, Coombes KR. Bayesian shrinkage estimation of the relative abundance of mRNA transcripts using SAGE. Biometrics. 2003;59:476–486. doi: 10.1111/1541-0420.00057. [DOI] [PubMed] [Google Scholar]

[R16] Morris JS, Baggerly KA, Coombes KR. Shrinkage estimation for SAGE data using a mixture Dirichlet prior. In: Do KA, Müller P, Vannucci M, editors. Bayesian Inference for Gene Expression and Proteomics. Cambridge University Press; New York: 2006. pp. 254–268. [Google Scholar]

[R17] Romualdi C, Bortoluzzi S, Danieli GA. Detecting differentially expressed genes in multiple tag sampling experiments: comparative evaluation of statistical tests. Human Molecular Genetics. 2001;10(19):2133–2141. doi: 10.1093/hmg/10.19.2133. [DOI] [PubMed] [Google Scholar]

[R18] Schmitt AO, Specht T, Beckmann G, Dahl E, Pilarsky CP, Hinzmann B, Rosenthal A. Exhaustive mining of EST libraries for genes differentially expressed in normal and tumor tissues. Nucleic Acids Research. 1999;27(21):4251–4260. doi: 10.1093/nar/27.21.4251. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Stekel DJ, Git Y, Falciani F. The comparison of gene expression from multiple cDNA libraries. Genome Research. 2000;10:2055–2061. doi: 10.1101/gr.gr-1325rr. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc. Natl Acad. Sci. USA. 2003;100:9440–9445. doi: 10.1073/pnas.1530509100. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] Velculescu VE, Zhang L, Vogelstein B, Kinzler KW. Serial analysis of gene expression. Science. 1995;270:484–487. doi: 10.1126/science.270.5235.484. [DOI] [PubMed] [Google Scholar]

PERMALINK

Bayesian Hierarchical Modeling and Selection of Differentially Expressed Genes for the EST Data

Fang Yu

Ming-Hui Chen

Lynn Kuo

Peng Huang

Wanling Yang

Summary

1. Introduction

Figure 1.

2. Models for EST Data with Multiple Libraries Under Multiple Tissue Types

3. Prior Elicitation

Priors at the Library Level

Priors at the Tissue Type Level

Figure 2.

4. Posterior Distribution and Computational Development

5. Gene Selection Criteria

6. Simulation Studies

Table 1.

6.1 Simulation Study I

6.2 Simulation Study II

7. Real Data Analysis

Table 2.

Table 3.

8. Discussion

Supplementary Material

Acknowledgements

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Bayesian Hierarchical Modeling and Selection of Differentially Expressed Genes for the EST Data

Fang Yu

Ming-Hui Chen

Lynn Kuo

Peng Huang

Wanling Yang

Summary

1. Introduction

Figure 1.

2. Models for EST Data with Multiple Libraries Under Multiple Tissue Types

3. Prior Elicitation

Priors at the Library Level

Priors at the Tissue Type Level

Figure 2.

4. Posterior Distribution and Computational Development

5. Gene Selection Criteria

6. Simulation Studies

Table 1.

6.1 Simulation Study I

6.2 Simulation Study II

7. Real Data Analysis

Table 2.

Table 3.

8. Discussion

Supplementary Material

Acknowledgements

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases