A Semiparametric Bayesian Approach for Estimating the Gene Expression Distribution

Fei Zou; Hanwen Huang; Joseph G Ibrahim

doi:10.1080/10543400903572746

. Author manuscript; available in PMC: 2011 Mar 1.

Published in final edited form as: J Biopharm Stat. 2010 Mar;20(2):267–280. doi: 10.1080/10543400903572746

A Semiparametric Bayesian Approach for Estimating the Gene Expression Distribution

Fei Zou ¹, Hanwen Huang ¹, Joseph G Ibrahim ¹

PMCID: PMC3045782 NIHMSID: NIHMS274546 PMID: 20309758

Abstract

Gene expression microarrays are powerful tools for global comparison and estimation of gene expression. Many microarray studies have demonstrated biologically plausible results with only a few arrays, leading to a mis-perception that a handful of hybridized arrays can always find something meaningful. From a statistical point of view, it is important to prospectively estimate required sample sizes prior to undertaking a microarray experiment. However, all sample size calculations need to directly or indirectly estimate the unknown distribution of the effect sizes of gene expression intensities. A parametric mixture model has been developed for relating the sample size directly to the false discovery rate (FDR), the most popular multiple comparison control criteria. In this paper,we extend the parametric mixture model and propose a robust semiparametric Dirichlet process mixture model, where the parametric distribution of gene expressions is no longer specified. This analysis is performed in a Bayesian inference framework using Markov chain Monte Carlo. The usefulness of the method is illustrated by simulations and a real murine lung study.

1 Introduction

Gene expression microarrays are powerful tools for global comparison and estimation of gene expression. One of main applications of microarray studies is to detect genes that are differentially expressed under two (or more) conditions. Early approaches had used simple thresholds for ratios of expression estimates under the two conditions (Chen et al 1997), while ordinary t-tests and Wilcoxon tests (Dudoit et al 2002 and Troyanskaya et al 2002) have later been used for more rigorous statistical analyses. More sophisticated ideas are further employed in regularized t-tests, for example by adding a constant to the variance estimate in the t denominator (Tusher et al 2001 and Efron et al 2001).

Many microarray studies have demonstrated biologically plausible results with only a few arrays (Yoon et al 2002), leading to a mis-perception that a handful of hybridized arrays can always find something meaningful. It is plausible that in some cases surprisingly few arrays may be sufficient, and the current dominance of microarray research in cancer has produced an optimistic view of microarray studies that may be unwarranted in other settings, such as examining expression differences among closely related rodent strains. From a statistical point of view, it is important to prospectively estimate required sample sizes prior to undertaking an experiment. Using ANOVA analysis of gene expression, Black and Doerge (2002) propose a parametric approach assuming lognormal or gamma distributions for gene expression intensities. Lee and Whitmore (2002) also describe sample size calculations for the ANOVA model, under several different experimental designs. Pan et al (2002) discuss sample size calculations by applying a nonparametric statistical method, the normal mixture model approach. All these methods relate power to sample size while controlling for the Type I error. Hu et al (2005) relate the sample size directly to the false discovery rate (FDR), as this criterion is often used for controlling for multiple comparisons.

All sample size calculations need to directly or indirectly estimate the unknown distribution of the effect sizes of the gene expression intensities. In the statistical literature, there are many proposed methods for estimating a distribution parametrically or non-parametrically. Hu et al (2005) employ three-component parametric mixture models to estimate the effect size distribution. Alternatively, the non-parametric methods of Robbins (1955), Laird (1978; 1981), Lindsay (1983), Wu et al (2006), Ruppert et al (2007) and Guan et al (2008) can be used for estimating the effect size distribution. However, these frequentist nonparametric methods assume a (known) fixed number of mixture components. Pan et al (2002) fit a series of normal mixture models with a varying number of mixture components and then use AIC or BIC to pick up the number of mixture components corresponding to the first local minimum of AIC or BIC. In the Bayesian literature, the mixture of Dirichlet process (MDP) model (Ferguson 1973; Blackwell and MacQueen 1973; Berry and Christensen 1979; Escobar and West 1995; Liu 1996; MacEachem 1994; MacEachem and Müller 1998; Kleinman and Ibrahim 1998; Neal 2000; Rasmussen 2000 and Ishwaran and James 2002) have been widely used in nonparametric Bayesian models. The Dirichlet process (DP) provides a nonparametric prior specifications for the parameters of a mixture model and allows the number of mixture components to grow as the size of the data grows. While classical parametric modeling techniques assume that a fixed number of components, the DP approach makes no such assumptions. Tang et al (2007) apply Dirichlet mixtures of beta densities to the distribution of p-values in estimating positive FDR.

This paper extends the three-component parametric mixture models in Hu et al (2005) to the DP mixture model in estimating the effect size distribution of the gene expressions. Our method is described in the next section with details of the Markov chain Monte Carlo (MCMC) steps for fitting the model. Simulation studies and the analysis of real datasets are presented in Section 3. We conclude the paper with some remarks and discussion.

2 Methods

2.1 Notation and assumptions

Here we largely follow the notation in Hu et al (2005). Let x_1j,i(j = 0, ⋯ ,n₁) and x_2j,i(j = 1, ⋯ ,n₂) denote the expression levels for a single gene i (i = 1, ⋯ ,m) under two conditions. For simplicity, we assume n₁ = n₂ = n. For a single gene, we assume that the gene expression estimates are normally distributed within each condition. Let μ_1,i and μ_2,i denote the true mean expressions for gene i under conditions 1 and 2, and $σ_{1, i}^{2}$ and $σ_{2, i}^{2}$ denote the corresponding variances. For testing purposes, the hypotheses are H₀ : μ_1,i = μ_2,i versus H₁ : μ_1,i ≠ μ_2,i, and we use the statistic $t_{i} = \frac{\sqrt{n} ({\overset{‒}{x}}_{1, i} - {\overset{‒}{x}}_{2, i})}{\sqrt{s_{1, i}^{2} + s_{2, i}^{2}}}$ , where ${\overset{‒}{x}}_{1, i} = \sum_{j = 1}^{n} \frac{x_{1 j, i}}{n}, {\overset{‒}{x}}_{2, i} = \sum_{j = 1}^{n} \frac{x_{2 j, i}}{n} s_{1, i}^{2} = \sum_{j = 1}^{n} \frac{{(x_{1 j, i} - {\overset{‒}{x}}_{1, i})}^{2}}{(n - 1)}$ and $s_{2, i}^{2} = \sum_{j = 1}^{n} \frac{{(x_{2 j, i} - {\overset{‒}{x}}_{2, i})}^{2}}{(n - 1)}$ . Then t_i follows the student t-distribution 2n – 2 degrees of freedom and non-centrality parameter $\sqrt{n} δ_{i}$ . The effect size

δ_{i} = \frac{μ_{1, i} - μ_{2, i}}{\sqrt{σ_{1, i}^{2} + σ_{2, i}^{2}}}

(1)

governs the power to detect departures from H₀, and has a biological interpretation as a scaled expression difference. We can now restate the hypotheses as H₀ : δ_i = 0 versus H₁ : δ_i ≠ 0.

With thousands of genes assayed in a typical microarray experiment, we can treat the δ_is as realizations from a common CDF, F , represented by a random variable Δ. The distributions of the t_is and derived p-values depend entirely on F. Because the sign of δ_i has biological importance, F is typically decomposed as the following three component mixture:

F (δ) ~ π_{0} F_{0} (δ) + π_{1} F_{1} (δ) + π_{2} F_{2} (δ) .

(2)

Here π₀, π₁ and π₂ are the probabilities that Δ is 0, positive or negative respectively, and π₀ + π₁ + π₂ = 1. F₀(δ) = I₀(δ) is the cumulative distribution function (CDF) of the random variable who has a point mass at 0, whereas F₁ and F₂ are the conditional CDFs for a positive and negative Δ. Equation (2) encompasses a large number of practical situations, where F₁ and F₂ can be discrete or continuous.

Given δ, the conditional distribution of the p-value of a gene is

P (P < p_{0} ∣ δ) = 1 - G (t_{p_{0} ∕ 2}; \sqrt{n} δ) + G (- t_{p_{0} ∕ 2}; \sqrt{n} δ),

(3)

where $G (t; \sqrt{n} δ)$ is the CDF of the non-central student t-distribution with 2n – 2 degrees of freedom and noncentrality parameter $\sqrt{n} δ$ and t_α is the 1 – α quantile of the central t distribution with 2n – 2 degrees of freedom. The marginal distribution of P can be written as

\begin{matrix} P (P < p_{0}) = & \int_{δ} P (P < p_{0} ∣ δ) d F (δ) \\ = & π_{0} p_{0} + π_{1} \int_{δ ? 0} P (P < p_{0} ∣ δ) d F_{1} (δ) + π_{2} \int_{δ < 0} P (P < p_{0} ∣ δ) d F_{2} (δ) . \end{matrix}

(4)

Thus, the relationship between the FDR and p₀ is

F D R (p_{0}) = \frac{P (P < p_{0}, δ = 0)}{P (p < p_{0})} = \frac{π_{0} p_{0}}{P (p < p_{0})},

(5)

which depends on n through (3). The relationship (5) depends on the underlying distribution of δ. For known π₀, π₁, π₂, F₁ and F₂, through (5), we can estimate the required sample size n in order to achieve a desired FDR value for a given p-value threshold p₀. Therefore, the sample size calculation reduces to the estimation of π₀, π₁, π₂, F₁ and F₂. For such purposes, Hu et al (2005) propose a parametric method by considering three situations: (1) F₁ (and F₂) has a point mass at (positive) a₁ (and (negative) a₂); (2) the exponential mixture model where F₁ and F₂ follow positive and negative exponential distributions, and (3) the normal mixture model where F₁ and F₂ are two normal densities truncated at 0. However, for real datasets, the above parametric assumptions are likely to be violated. Below, we relax the parametric assumptions and estimate F nonparametrically via the MDP model.

2.2 Mixture Dirichlet Process Model

In this section, we propose the MDP model for δ, derive the full conditional distributions, and demonstrate how to implement the Gibbs sampling. Parametric Bayesian modeling typically assumes that δ follows a prior parametric distribution with known hyperparameters Ψ₀. That is,

\begin{matrix} t_{i} ∣ δ_{i} & ~ t (δ_{i}, d), \\ δ_{i} ∣ Ψ_{0} & ~ h (δ_{i} ∣ Ψ_{0}), \end{matrix}

(6)

where t(δ, d) is the non-central student t-distribution with d degrees of freedom and non-centrality parameter δ and h(δ|Ψ₀) is a known distribution. The MDP model removes the distributional assumption on the δ_is and assumes that they follow an unspecified distribution G, which itself follows a Dirichlet process prior:

\begin{matrix} t_{i} ∣ δ_{i} ~ & t (δ_{i}, d), \\ δ_{i} ~ G, \\ G ∣ M ~ D P & (M \cdot G_{0}) . \end{matrix}

(7)

The parameters of the Dirichlet process, DP(M · G₀) are G₀ and a positive scalar M. In practice, the base measure G₀ is set to a parametric distribution that (hopefully) approximates the true nonparametric shape of G. The scalar M reflects our prior belief about how similar the nonparametric distribution G is to the base measure G₀. As M → ∞, G → G₀, the MDP model leads to the fully parametric case and the base measure is the prior distribution for δ.

Note that the t-distribution can be obtained as a scale mixture of normal distributions as follows:

\begin{matrix} t_{i} ∣ δ_{i}, τ_{i} & ~ N (δ_{i}, τ_{i}^{- 1}) \\ τ_{i} & ~ G a m m a (\frac{d}{2}, \frac{d}{2}), \end{matrix}

(8)

where N(μ, σ²) denotes the normal distribution with mean μ and variance σ² and Gamma(a, b) denotes the Gamma distribution with shape parameter a and scale parameter b. The marginal distribution of t_i given δ_i is therefore t(δ_i, d). Thus the MDP model (7) can be re-expressed as the following,

\begin{matrix} t_{i} ∣ δ_{i}, τ_{i} & ~ N (δ_{i}, τ_{i}^{- 1}), \\ τ_{i} & ~ G a m m a (\frac{d}{2}, \frac{d}{2}), \\ δ_{i} & ~ G, \\ G & ~ D P (M \cdot G_{0}) . \end{matrix}

(9)

This model, as will be shown below, allows direct Gibbs sampling with closed-form full conditional distributions.

For the base measure G₀, we set it to a three-component normal mixture

G_{0} (δ) ~ p_{0} I_{0} (δ) + p_{1} N + (μ_{01}, τ_{01}^{- 1}) + p_{2} N - (μ_{02}, τ_{02}^{- 1}),

(10)

where the indicator function I_x(y) = 1 (or 0) if y = x (or y ≠ x); N₊, N₋ are positively and negatively truncated normal distributions, respectively, and p₀, p₁, p₂, μ₀₁, μ₀₂, τ₀₁, τ₀₂ are unspecified hyperparameters with hyper priors

(p_{0}, p_{1}, p_{2}) ~ D i r i c h l e t ({\overset{‒}{p}}_{0}, {\overset{‒}{p}}_{1}, {\overset{‒}{p}}_{2}),

(11)

μ_{01} ~ N ({\overset{‒}{μ}}_{01}, {\overset{‒}{τ}}_{01}^{- 1}),

(12)

μ_{02} ~ N ({\overset{‒}{μ}}_{02}, {\overset{‒}{τ}}_{02}^{- 1}),

(13)

τ_{01} ~ G a m m a (\frac{{\overset{‒}{α}}_{10}}{2}, \frac{{\overset{‒}{β}}_{10}}{2}),

(14)

τ_{02} ~ G a m m a (\frac{{\overset{‒}{α}}_{20}}{2}, \frac{{\overset{‒}{β}}_{20}}{2}),

(15)

respectively. Here Dirichlet( $D i r i c h l e t ({\overset{‒}{p}}_{0}, {\overset{‒}{p}}_{1}, {\overset{‒}{p}}_{2})$ ) denotes the Dirichlet distribution of order 3 with parameters ${\overset{‒}{p}}_{0}, {\overset{‒}{p}}_{1}, {\overset{‒}{p}}_{2} ? 0$ . Last, we set the prior of M to Gamma(M_a, M_b).

The MDP model (9) allows us to get the full conditional for each of the unknown parameters via Markov chain Monte Carlo, which we describe in detail below. Let θ be the vector of all unknown parameters and let θ_−v be the remaining vector of θ after removing element v from θ, and t = (t₁, ⋯ ,t_m)’.

Step 0: Assign starting values to τ_i, δ_i, i = 1, ⋯ ,m, and P = (p₀, p₁, p₂), μ₀₁, μ₀₂, τ₀₁, τ₀₂.

Step 1: Smple τ_i, i = 1, ⋯ ,m according to the following conditional distribution:

τ_{i} ∣ θ_{- τ_{i}}, t ~ G a m m a (\frac{d + 1}{2}, \frac{d + {(t_{i} - δ_{i})}^{2}}{2}) .

(16)

Step 2: Sample δ_i, i = 0, ⋯ ,m. As derived in Escobar (1994), the full conditional of δ_i is proportional to

\sum_{k \neq i} q_{i k} I_{δ_{k}} (δ_{i}) + M q_{i 0} G_{0} (δ_{i}) N (t_{i} ∣ δ_{i}, τ_{i}),

(17)

where δ_−i = (δ₁, ⋯ ,δ_i−1, δ_i+1, ⋯ ,δ_m), and $N (\cdot ∣ δ_{i}, τ_{i}^{- 1})$ is the density function of $N (δ_{i}, τ_{i}^{- 1})$ at t_i. Let $q_{i k} = N (t_{i} ∣ δ_{k}, τ_{i}^{- 1}), k = 1, \dots, i - 1, i + 1, \dots, m$ , and $q_{i 0} = \int N (t_{i} ∣ δ, τ_{i}) G_{0} (δ) d δ$ , which equals

\begin{matrix} q_{i 0} = & \int N (t_{i} ∣ δ, τ_{i}^{- 1}) G_{0} (δ) d δ \\ = & p_{0} N (t_{i} ∣ 0, τ_{i}^{- 1}) + p_{1} \int_{0}^{\infty} N_{+} (t_{i} ∣ δ, τ_{i}^{- 1}) N (δ ∣ μ_{01}, τ_{01}^{- 1}) d δ + p_{2} \int_{- \infty}^{0} N_{-} (t_{i} ∣ δ, τ_{i}^{- 1}) N (δ ∣ μ_{02}, τ_{02}^{- 1}) d δ . \end{matrix}

(18)

The q_ik’s and Mq_i₀ need to be further normalized as below to find the selection probabilities. Let C⁻¹ = ∑_k≠i q_ik+Mq_i0. With probability Cq_ik, δ_i is drawn from I_{δ_k}(δ_i) and with probability CMq_i0, δ_i is drawn from the distribution

a I_{0} (δ_{i}) + b N_{+} (δ_{i} ∣ \frac{τ_{i} t_{i} + τ_{01} μ_{01}}{τ_{i} + τ_{01}}, \frac{1}{τ_{i} + τ_{01}}) + c N_{-} (δ_{i} ∣ \frac{τ_{i} t_{i} + τ_{02} μ_{02}}{τ_{i} + τ_{02}}, \frac{1}{τ_{i} + τ_{02}}),

where

\begin{matrix} a = & s p_{0}, \\ b = & s p_{1} {(\frac{τ_{01}}{τ_{i} + τ_{01}})}^{\frac{1}{2}} \frac{N (- μ_{1}^{*} ∣ 0, 1)}{N (- τ_{01}^{\frac{1}{2}} μ_{01} ∣ 0, 1)} exp [- \frac{1}{2} (τ_{01} μ_{01}^{2} - μ_{1}^{* 2})], \end{matrix}

(19)

c = s p_{2} {(\frac{τ_{02}}{τ_{i} + τ_{02}})}^{\frac{1}{2}} \frac{N (- 1 μ_{2}^{*} ∣ 0, 1)}{N (- τ_{02}^{\frac{1}{2}} μ_{02} ∣ 0, 1)} exp [- \frac{1}{2} (τ_{02} μ_{02}^{2} - μ_{2}^{* 2})],

(20)

with s being the normalizing constant such that a + b + c = 1, and

μ_{1}^{*} = \frac{τ_{i} t_{i} + τ_{01} μ_{01}}{{(τ_{i} + τ_{01})}^{\frac{1}{2}}}, μ_{2}^{*} = \frac{τ_{i} t_{i} + τ_{02} μ_{02}}{{(τ_{i} + τ_{02})}^{\frac{1}{2}}} .

(21)

We therefore sample $δ_{i}^{(j + 1)}, i = 0, \dots, m$ according to (17). After sampling the δ_is for all subjects, we group them into clusters such that subjects within each cluster have the same δ value. Let L(0 < L ≤ m) be the number of unique δ values. Denote these unique values as z_l, l = 1, ⋯ ,L. Note that knowing δ_i is equivalent to knowing the z_l’s and the cluster that subject i belongs to.

One additional step in the sampling is recommended by Bush and MacEachern (1996) in order to improve the convergence for the Gibbs sampler. To speed the mixing over the entire parameter space, they suggest sampling the z_ls according to the conditional density of z_l after assigning subjects to their clusters:

p (z_{l} ∣ θ, t) \infty \prod_{i \in cluster l} p (t_{i} ∣ z_{l}, τ_{i}) G_{0} (z_{l})

which implies that

p (z_{1} ∣ θ, t) = \tilde{a} I_{0} (z_{1}) + \tilde{b} N_{+} (z_{l} ∣ {\tilde{μ}}_{1}^{*}, \frac{1}{\sum_{i \in l} τ_{i} + τ_{01}}) + \tilde{c} N_{-} (z_{l} ∣ {\tilde{μ}}_{2}^{*}, \frac{1}{\sum_{i \in l} τ_{i} + τ_{02}}),

(22)

where

\begin{matrix} \tilde{a} = & \tilde{s} p_{0}, \\ \tilde{b} = & \tilde{s} p_{1} {(\frac{τ_{01}}{\sum_{i \in l} τ_{i} + τ_{01}})}^{\frac{1}{2}} \frac{N (- {\tilde{μ}}_{1}^{*} ∣ 0, 1)}{N (- τ_{01}^{\frac{1}{2}} μ_{01} ∣ 0, 1)} exp [- \frac{1}{2} (τ_{01} μ_{01}^{2} - {\tilde{μ}}_{1}^{* 2})], \end{matrix}

(23)

\tilde{c} = \tilde{s} p_{2} {(\frac{τ_{02}}{\sum_{i \in l} τ_{i} + τ_{02}})}^{\frac{1}{2}} \frac{N (- {\tilde{μ}}_{2}^{*} ∣ 0, 1)}{N (- τ_{02}^{\frac{1}{2}} μ_{02} ∣ 0, 1)} exp [- \frac{1}{2} (τ_{02} μ_{02}^{2} - {\tilde{μ}}_{2}^{* 2})],

(24)

with $\tilde{s}$ being the normalizing constant to ensure that $\tilde{a} + \tilde{b} + \tilde{c} = 1$ , and

{\tilde{μ}}_{1}^{*} = \frac{\sum_{i \in l} τ_{i} t_{i} + τ_{01} μ_{01}}{{(\sum_{i \in l} τ_{i} + τ_{01})}^{\frac{1}{2}}}, {\tilde{μ}}_{2}^{*} = \frac{\sum_{i \in l} τ_{i} t_{i} + τ_{02} μ_{02}}{{(\sum_{i \in l} τ_{i} + τ_{02})}^{\frac{1}{2}}} .

(25)

Step 3: Sample (p₀, p₁, p₂): The full conditional of p = (p₀, p₁, p₂) is

[(p_{0}, p_{1}, p_{2}) ∣ θ_{- (p_{0}, p_{1}, p_{2})}, t] ~ D i r i c h l e t ({\tilde{p}}_{0} + {\tilde{n}}_{0}, {\tilde{p}}_{1} + {\tilde{n}}_{1}, {\tilde{p}}_{2} + {\tilde{n}}_{2}),

(26)

where ${\overset{‒}{n}}_{0}, {\overset{‒}{n}}_{1}, {\overset{‒}{n}}_{2}$ are the number of z_ls belonging to the clusters whose $δ_{i}^{'}$ are equal to, greater than, or less than 0 respectively, that is ${\overset{‒}{n}}_{0} + {\overset{‒}{n}}_{1} + {\overset{‒}{n}}_{2} = L$ .

Step 4: Sample the hyperparameters μ₀₁, μ₀₂, τ₀₁, τ₀₂ from their full conditionals:

μ_{01} ∣ θ_{- μ_{01}}, t ~ N (\frac{τ_{01} \sum_{l = 1}^{L} I (δ_{l} ? 0) δ_{l} + {\tilde{τ}}_{01} {\tilde{μ}}_{01}}{{\tilde{n}}_{1} τ_{01} + {\tilde{τ}}_{01}}, \frac{1}{{\tilde{n}}_{1} τ_{01} + {\tilde{τ}}_{01}}),

(27)

μ_{02} ∣ θ_{- μ_{02}}, t ~ N (\frac{τ_{02} \sum_{l = 1}^{L} I (δ_{l} < 0) δ_{l} + {\tilde{τ}}_{02} {\tilde{μ}}_{02}}{{\tilde{n}}_{2} τ_{02} + {\tilde{τ}}_{02}}, \frac{1}{{\tilde{n}}_{2} τ_{02} + {\tilde{τ}}_{02}}),

(28)

τ_{01} ∣ θ_{- τ_{01}}, t ~ G a m m a (\frac{{\tilde{n}}_{1} + {\tilde{α}}_{10}}{2}, \frac{\sum_{l = 1}^{L} I (δ_{l} ? 0) {(δ_{l} - μ_{01})}^{2} + {\tilde{β}}_{10}}{2}),

(29)

τ_{02} ∣ θ_{- τ_{02}}, t ~ G a m m a (\frac{{\tilde{n}}_{2} + {\tilde{α}}_{20}}{2}, \frac{\sum_{l = 1}^{L} I (δ_{l} < 0) {(δ_{l} - μ_{02})}^{2} + {\tilde{β}}_{20}}{2}),

(30)

respectively.

Step 5. Sample M in two steps: we first sample a latent variable η from

η ~ B e t a (M + 1, N),

(31)

then we sample M from

M ∣ η ~ π_{η} Γ (M_{a} + L, M_{b} - l o g (η)) + (1 - π_{η}) Γ (M_{a} + L - 1, M_{b} - l o g (η)),

(32)

where

\frac{π_{η}}{1 - π_{η}} = \frac{M_{a} + L - 1}{N (M_{b} - l o g (η))} .

(33)

Steps 1-5 complete one cycle of the Gibbs sampler. From the Gibbs samples, one can easily obtain Bayesian estimates of the parameters of primary interest, such as the FDR shown in Equation (5).

3 Numerical Studies

In this section, we test the performance of our method using both simulated and real datasets. Similar to Hu et al (2005), we first simulated two sets of data from mixture exponential and mixture normal distributions for F . The range of π₀ is from 0.3 to 0.9, which covers real biologic scenarios. In the exponential setup, we chose a₁ = a₂ = 1/λ = 0.5, 1, 2.5, 5, which represent an F with comparatively short, medium and heavy tails, respectively. In the normal setup, we chose μ₁ = μ₂ = 0, and σ₁ = σ₂ = 0.5, 1, 2.5, 5 which again represent an F with comparatively short, medium and heavy tails. We set the hyperparameters ${\overset{‒}{p}}_{0} = {\overset{‒}{p}}_{1} = {\overset{‒}{p}}_{2} = 1$ , ${\overset{‒}{μ}}_{01} = {\overset{‒}{μ}}_{02} = 0, {\overset{‒}{τ}}_{01} = {\overset{‒}{τ}}_{02} = 1$ , ${\overset{‒}{α}}_{10} = {\overset{‒}{α}}_{20} = {\overset{‒}{β}}_{10} = {\overset{‒}{β}}_{20} = 1$ , and M_a = M_b = 1.

The length of the chain consisted of 23000 Gibbs samples. The first 3000 samples (the burn-in) were discarded. The remaining samples were saved for every 20 samples to reduce serial correlations. The final posterior samples contained 1000 samples and were used for the subsequent Bayesian analysis. Table 1 shows the posterior mean and the 90% credible interval for one of the primary parameters, π₀. The upper part is for the exponential mixture settings and the lower part is for the normal mixture settings. Figure 1 shows one example of the posterior distribution for π₀ = 0.7. For all the π₀ values simulated, the farther the two non-zero components of F are from zero, the better our method performs. This is intuitive since for those distributions with two non-zero components of F close to 0, the resulting distributions of the test statistics t_is are centered around 0 and thus it becomes harder to distinguish the two non-zero components from 0. This statement can be further illustrated by Figure 2, in which the Q-Q plots of the estimated F distributions against the simulated ones are provided. Comparing the exponential mixture results (upper panel) to the normal mixture ones (lower panel), we find that our approach fits the exponential mixture model better. This is consistent with the above conclusion since the two non-zero exponential distributions have longer tails than its two counterpart normal distributions when a = σ.

Table 1.

π₀ and their estimates obtained from the simulated data

	Exponential Mixture
π ₀	a=5	a=2.5	a=1	a=0.5
0.3	0.31(0.1,0.43)	0.26(0.002,0.48)	0.49(0.06,0.72)	0.42(0,0.88)
0.5	0.52(0.29,0.64)	0.55(0.25,0.72)	0.49(0,0.81)	0.60(0.06,0.92)
0.7	0.69(0.54,0.75)	0.68(0.46,0.78)	0.74(0.11,0.87)	0.75(0.05,0.92)
0.9	0.92(0.82,0.96)	0.91(0.74,0.97)	0.89(0.12,0.99)	0.97(0.93,1)

	Normal Mixture
π ₀	σ = 5	σ = 2.5	σ = 1	σ = 0.5
0.3	0.28(0.001,0.45)	0.43(0.004,0.61)	0.50(0,0.81)	0.86(0.18,1)
0.5	0.42(0.02,0.57)	0.45(0.009,0.67)	0.32(0,0.72)	0.92(0.46,1)
0.7	0.66(0.38,0.75)	0.68(0.45,0.80)	0.64(0.01,0.87)	0.59(0,0.94)
0.9	0.89(0.66,0.97)	0.90(0.66,0.98)	0.97(0.81,1)	0.97(0.88,1)

Open in a new tab

The fifth and ninety-fifth percentiles of the posterior distribution are given in parentheses.

Histograms for the posterior distributions of π₀ for the four simulation studies for the mixture exponential model (upper) and mixture normal distribution (lower). π₀ = 0.7.

Q-Q plots for the underlying F distribution in a simulation study (n=50, m=10000). The lower panel is for the normal mixture model, the upper panel is for the exponential mixture model, and π₀ = 0.7.

We also reconstruct the distributions of the test statistics based on the estimated underlying distributions of δ and compare them with the true distributions of the test statistics (Figure 3). It is clear that the reconstructed distributions of the test statistics match their corresponding true distributions well, regardless of how heavy the tail for δ is. Further, our method always performs better than the parametric estimations in Hu et al (2005) unless the parametric model assumptions hold exactly.

Q-Q plots for the observed t-statistics in a simulation study (n=50, m=10000). The lower panel is for the normal mixture model, the upper panel is for the exponential mixture model, and π₀ = 0.7.

Next, we demonstrate the effectiveness of our method on a real dataste. we considre a murine lung microarray dataset submitted to Gene Expression Omnibus (E. Hoffman, Children’s National Medical Center, GDS251). The study used the Affymetrix U74Av2 array (12488 probesets) to compare samples from the two mouse strains: C57BL6/J (n=12) and Balb/c (n=12) in their sensitivity with pulmonary fibrosis. This data has been analyzed by Hu et al (2005) and we compare our results to theirs.

The 95% incredible intervals of the π_is are (0.27, 0.47), (0.33, 0.55) and (0.09, 0.29) for π₀, π₁ and π₂, respectively. These results are somewhat different from the estimations in Hu et al (2005). The difference between π₁ and π₂ indicates that the distribution of δ is not symmetric in the positive and the negative components. The fitted distribution for δ is shown in the right panel of Figure 4. The Q-Q plot of 12488 observed t-statistics versus the estimated ones is given in the left panel of Figure 4. Compared to the Q-Q plot (Figure 3) in Hu et al (2005), our nonparametric model clearly provides a better fit.

Real murine lung study: the left panel is the Q-Q plot of the observed t-statistics vs the estimated ones; the right panel displays the histogram of the posterior distribution of F .

4 Discussion

We have described a Bayesian semiparametric approach to estimate the underlying gene expression distribution F . Our method involves specifying a nonparametric prior for the distribution of the scaled gene expression difference via a mixture of Dirichlet processes. The resulting model is estimated via the Gibbs sampler. Compared to the parametric mixture model, our nonparametric Bayesian method is more flexible and robust in modeling the gene expression distribution. Once the distribution of the effect size is estimated, we can easily apply this method to the FDR-based sample size calculations in Hu et al (2005). The biases in estimating F with short tails is not unique to our method. All parametric and non-parametric methods share similar (identification) problems. For such situations, increasing the sample size may be necessary in improving estimation accuracy. In this paper, only a two-sample design is considered. One future extension is to extend the method to other experimental designs. More general multi-sample designs or general linear regression modeling require a more careful consideration regarding δ, but the essence of the approach remains the same.

Acknowledgments

This research is supported in part by NIH grant R01 GM074175.

References

[1].Berry DA, Christensen R. Empirical bayes estimation of a Binomial parameter via mixtures of Dirichlet processes. The Annals of Statistics. 1979;7:558–568. [Google Scholar]
[2].Black MA, Doerge RW. Calculation of the minimum number of replicate spots required for detection of significant gene expression fold change in microarray experiments. Bioinformatics. 2002;18:1609–1616. doi: 10.1093/bioinformatics/18.12.1609. [DOI] [PubMed] [Google Scholar]
[3].Blackwell D, MacQueen JB. Ferguson distributions via Polya urn schemes. Annals of Statistics. 1973;1:353–355. [Google Scholar]
[4].Bush CA, MacEachern SN. A semi-parametric Bayesian model for randomized block designs. Biometrika. 1996;83:275–285. [Google Scholar]
[5].Chen Y, Dougherty ER, Bittner ML. Ratio-based decisions and the quantitative analysis of cDNA microarray images. J. Biomed. Optics. 1997;2:364–367. doi: 10.1117/12.281504. [DOI] [PubMed] [Google Scholar]
[6].Dudoit S, Yang YH, Callow MJ, Speed TP. Statistical methods for identifying differentially expressed genes in replicated cdna microarray experiments. Statistica Sinica. 2002;12:111–139. [Google Scholar]
[7].Efron B, Tibshirani R, Storey JD, Tusher V. Empirical bayes analysis of a microarray experiment. Journal of the American Statistical Association. 2001;96:1151–1160. [Google Scholar]
[8].Escobar MD. Estimating normal means with a Dirichlet process prior. Journal of American Statistical Association. 1994;89:268–277. [Google Scholar]
[9].Escobar MD, West M. Bayesian density estimation and inference using mixtures. Journal of American Statistical Association. 1995;90:577–588. [Google Scholar]
[10].Ferguson TS. A Bayesian analysis of some nonparametric problems. The Annals of Statistics. 1973;1:209–230. [Google Scholar]
[11].Guan Z, Wu B, Zhao H. Application of Bernstein polynomial in the estimation of false-discovery-rate. Statistica Sinica. 2008;18:905–923. [Google Scholar]
[12].Hu J, Wright FA, Zou F. Practical FDR-based sample size calculations in microarray experiments. Bioinformatics. 2005;21:3264–3272. doi: 10.1093/bioinformatics/bti519. [DOI] [PubMed] [Google Scholar]
[13].Ishwaran H, James LF. Approximate Dirichlet process computing in finite normal mixtures: smoothing and prior information. Journal of Computational and Graphical Statistics. 2002;11:508–532. [Google Scholar]
[14].Kleinman KP, Ibrahim JG. A Semi-parametric bayesian approach to the random effects model. Biometrics. 1998;54:921–938. [PubMed] [Google Scholar]
[15].Lee MT, Whitmore GA. Power and sample size for dna microarray studies. Statistics in Medicine. 2002;21:3543–3570. doi: 10.1002/sim.1335. [DOI] [PubMed] [Google Scholar]
[16].Laird NM. Nonparametric maximum likelihood estimation of a mixing distribution. J. Amer. Statist. Assoc. 1978;73:805–811. [Google Scholar]
[17].Laird NM. Empirical Bayes estimates using nonparametric maximum likelihood estimate for the prior. J. Statist. Comput. Simulation. 1981;15:211–220. [Google Scholar]
[18].Lindsay BG. The geometry of mixture likelihoods: a general theory. Ann Stat. 1983;11:86–94. [Google Scholar]
[19].Liu JS. Nonparametric hierarchical Bayes via sequential imputations. The Annals of Statistics. 1996;24:911–930. [Google Scholar]
[20].MacEachern SN. Estimating normal means with conjugate style Dirichlet process prior. Communication in Statistics Simulation and Computation. 1994;23:727–741. [Google Scholar]
[21].MacEachern SN, Müller P. Estimating mixture of Dirichlet process models. Journal of Computational and Graphical Statistics. 1998;7:223–238. [Google Scholar]
[22].Neal RM. Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics. 2000;9:249–265. [Google Scholar]
[23].Pan W, Lin J, Le C. How many replicates of arrays are required to detect gene expression changes in microarray experiments? A mixture model approach. Genome Biolology. 2002;3 doi: 10.1186/gb-2002-3-5-research0022. research0022.1-0022.10. [DOI] [PMC free article] [PubMed] [Google Scholar]
[24].Rasmussen CE. The Infinite Gaussian Mixture Model. In: Solla SA, Leen TK, Muller K-R, editors. Adv. Neur. Inf. Proc. Sys. Vol. 12. MIT Press; 2000. pp. 554–560. [Google Scholar]
[25].Robbins H. An Empirical Bayes approach to atatistics; Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics; 1955.pp. 157–163. [Google Scholar]
[26].Ruppert D, Nettleton D, Hwang J. T. Gene. Exploring the information in P-values for the analysis and planning of multiple-test experiments. Biometrics. 2007;63:483–495. doi: 10.1111/j.1541-0420.2006.00704.x. [DOI] [PubMed] [Google Scholar]
[27].Tang Y, Ghosal S, Roy A. Nonparametric Bayesian estimation of positive false discovery rates. Biometrics. 2007;63:1126–1134. doi: 10.1111/j.1541-0420.2007.00819.x. [DOI] [PubMed] [Google Scholar]
[28].Troyanskaya OG, Garber ME, Brown P, Botstein D, Altman RB. Nonparametric methods for identifying differentially expressed genes in microarray data. Bioinformatics. 2002;18:1454–1461. doi: 10.1093/bioinformatics/18.11.1454. [DOI] [PubMed] [Google Scholar]
[29].Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proceedings of the National Academy of Sciences USA. 2001;98:5116–5121. doi: 10.1073/pnas.091062498. [DOI] [PMC free article] [PubMed] [Google Scholar]
[30].Wu B, Guan Z, Zhao H. Parametric and nonparametric FDR estimation revisited. Biometrics. 2006;62:735–744. doi: 10.1111/j.1541-0420.2006.00531.x. [DOI] [PubMed] [Google Scholar]
[31].Yoon H, Liyanarachchi S, Wright FA, Davuluri R, Lockman JC, de la Chapelle A, Pellegata NS. Gene expression profiling of isogenic cells with different tp53 gene dosage reveals numerous genes that are affected by tp53 dosage and identifies cspg2 as a direct target of p53. Proceedings of the National Academy of Sciences USA. 2002;99:15632–15637. doi: 10.1073/pnas.242597299. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] [1].Berry DA, Christensen R. Empirical bayes estimation of a Binomial parameter via mixtures of Dirichlet processes. The Annals of Statistics. 1979;7:558–568. [Google Scholar]

[R2] [2].Black MA, Doerge RW. Calculation of the minimum number of replicate spots required for detection of significant gene expression fold change in microarray experiments. Bioinformatics. 2002;18:1609–1616. doi: 10.1093/bioinformatics/18.12.1609. [DOI] [PubMed] [Google Scholar]

[R3] [3].Blackwell D, MacQueen JB. Ferguson distributions via Polya urn schemes. Annals of Statistics. 1973;1:353–355. [Google Scholar]

[R4] [4].Bush CA, MacEachern SN. A semi-parametric Bayesian model for randomized block designs. Biometrika. 1996;83:275–285. [Google Scholar]

[R5] [5].Chen Y, Dougherty ER, Bittner ML. Ratio-based decisions and the quantitative analysis of cDNA microarray images. J. Biomed. Optics. 1997;2:364–367. doi: 10.1117/12.281504. [DOI] [PubMed] [Google Scholar]

[R6] [6].Dudoit S, Yang YH, Callow MJ, Speed TP. Statistical methods for identifying differentially expressed genes in replicated cdna microarray experiments. Statistica Sinica. 2002;12:111–139. [Google Scholar]

[R7] [7].Efron B, Tibshirani R, Storey JD, Tusher V. Empirical bayes analysis of a microarray experiment. Journal of the American Statistical Association. 2001;96:1151–1160. [Google Scholar]

[R8] [8].Escobar MD. Estimating normal means with a Dirichlet process prior. Journal of American Statistical Association. 1994;89:268–277. [Google Scholar]

[R9] [9].Escobar MD, West M. Bayesian density estimation and inference using mixtures. Journal of American Statistical Association. 1995;90:577–588. [Google Scholar]

[R10] [10].Ferguson TS. A Bayesian analysis of some nonparametric problems. The Annals of Statistics. 1973;1:209–230. [Google Scholar]

[R11] [11].Guan Z, Wu B, Zhao H. Application of Bernstein polynomial in the estimation of false-discovery-rate. Statistica Sinica. 2008;18:905–923. [Google Scholar]

[R12] [12].Hu J, Wright FA, Zou F. Practical FDR-based sample size calculations in microarray experiments. Bioinformatics. 2005;21:3264–3272. doi: 10.1093/bioinformatics/bti519. [DOI] [PubMed] [Google Scholar]

[R13] [13].Ishwaran H, James LF. Approximate Dirichlet process computing in finite normal mixtures: smoothing and prior information. Journal of Computational and Graphical Statistics. 2002;11:508–532. [Google Scholar]

[R14] [14].Kleinman KP, Ibrahim JG. A Semi-parametric bayesian approach to the random effects model. Biometrics. 1998;54:921–938. [PubMed] [Google Scholar]

[R15] [15].Lee MT, Whitmore GA. Power and sample size for dna microarray studies. Statistics in Medicine. 2002;21:3543–3570. doi: 10.1002/sim.1335. [DOI] [PubMed] [Google Scholar]

[R16] [16].Laird NM. Nonparametric maximum likelihood estimation of a mixing distribution. J. Amer. Statist. Assoc. 1978;73:805–811. [Google Scholar]

[R17] [17].Laird NM. Empirical Bayes estimates using nonparametric maximum likelihood estimate for the prior. J. Statist. Comput. Simulation. 1981;15:211–220. [Google Scholar]

[R18] [18].Lindsay BG. The geometry of mixture likelihoods: a general theory. Ann Stat. 1983;11:86–94. [Google Scholar]

[R19] [19].Liu JS. Nonparametric hierarchical Bayes via sequential imputations. The Annals of Statistics. 1996;24:911–930. [Google Scholar]

[R20] [20].MacEachern SN. Estimating normal means with conjugate style Dirichlet process prior. Communication in Statistics Simulation and Computation. 1994;23:727–741. [Google Scholar]

[R21] [21].MacEachern SN, Müller P. Estimating mixture of Dirichlet process models. Journal of Computational and Graphical Statistics. 1998;7:223–238. [Google Scholar]

[R22] [22].Neal RM. Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics. 2000;9:249–265. [Google Scholar]

[R23] [23].Pan W, Lin J, Le C. How many replicates of arrays are required to detect gene expression changes in microarray experiments? A mixture model approach. Genome Biolology. 2002;3 doi: 10.1186/gb-2002-3-5-research0022. research0022.1-0022.10. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] [24].Rasmussen CE. The Infinite Gaussian Mixture Model. In: Solla SA, Leen TK, Muller K-R, editors. Adv. Neur. Inf. Proc. Sys. Vol. 12. MIT Press; 2000. pp. 554–560. [Google Scholar]

[R25] [25].Robbins H. An Empirical Bayes approach to atatistics; Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics; 1955.pp. 157–163. [Google Scholar]

[R26] [26].Ruppert D, Nettleton D, Hwang J. T. Gene. Exploring the information in P-values for the analysis and planning of multiple-test experiments. Biometrics. 2007;63:483–495. doi: 10.1111/j.1541-0420.2006.00704.x. [DOI] [PubMed] [Google Scholar]

[R27] [27].Tang Y, Ghosal S, Roy A. Nonparametric Bayesian estimation of positive false discovery rates. Biometrics. 2007;63:1126–1134. doi: 10.1111/j.1541-0420.2007.00819.x. [DOI] [PubMed] [Google Scholar]

[R28] [28].Troyanskaya OG, Garber ME, Brown P, Botstein D, Altman RB. Nonparametric methods for identifying differentially expressed genes in microarray data. Bioinformatics. 2002;18:1454–1461. doi: 10.1093/bioinformatics/18.11.1454. [DOI] [PubMed] [Google Scholar]

[R29] [29].Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proceedings of the National Academy of Sciences USA. 2001;98:5116–5121. doi: 10.1073/pnas.091062498. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] [30].Wu B, Guan Z, Zhao H. Parametric and nonparametric FDR estimation revisited. Biometrics. 2006;62:735–744. doi: 10.1111/j.1541-0420.2006.00531.x. [DOI] [PubMed] [Google Scholar]

[R31] [31].Yoon H, Liyanarachchi S, Wright FA, Davuluri R, Lockman JC, de la Chapelle A, Pellegata NS. Gene expression profiling of isogenic cells with different tp53 gene dosage reveals numerous genes that are affected by tp53 dosage and identifies cspg2 as a direct target of p53. Proceedings of the National Academy of Sciences USA. 2002;99:15632–15637. doi: 10.1073/pnas.242597299. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

A Semiparametric Bayesian Approach for Estimating the Gene Expression Distribution

Fei Zou

Hanwen Huang

Joseph G Ibrahim

Abstract

1 Introduction