Cho and Lee (2004) proposed a Bayesian hierarchical error model (HEM) to account for heterogeneous error variability in oligo-nucleotide microarray experiments. They estimated the parameters of their model using Markov Chain Monte Carlo (MCMC) and proposed an F-like summary statistic to identify differentially expressed genes under multiple conditions. Their HEM is one of the emerging Bayesian hierarchical modeling tools that have been developed for the analysis of multiple-level data structures and variation in microarray gene expression data (Broet et al., 2002; Tadesse and Ibrahim, 2004; Cho and Lee, 2004). In this letter, we first discuss the significance of the HEM developed by Cho and Lee. Then, we re-derive the fully conditional distributions for gene and conditional effects, since we think that these two fully conditional distributions were not presented properly in their paper. Finally, we expand the HEM to deal with biological or/and experimental correlations in gene expression data. A FORTRAN 90 program was developed to implement our extended method and it is available from the authors upon request.
Significance of HEM
Many aspects of the HEM presented in the article of Cho and Lee (2004) are significant. The following are the two aspects that we believe are important in the analysis of microarray gene expression data. First, error variability of gene expression data is decomposed into two components, biological and experimental. The former reflects variation owing to biological sample heterogeneity that affects gene expression, while the latter largely comes from variation in sample preparation and sample analysis. Estimating these two sources of errors separately will not only improve estimation of model parameters but also facilitate a better understanding of patterns of over all gene expression.
Second, and more importantly, the use of a Bayesian HEM is an effective way to represent complex relationships that have many parameters, because it allows the parameters to be modeled separately at several levels and provides a link (S) between the data (D) and the unobserved parameters (θ) of interest. In multilevel modeling, the top layer is the data, p(D|S, θ), which is modeled by an appropriate likelihood function. The data are assumed to be generated by some process that depends on certain unknown parameters. Modeling the link or process is the middle layer, where the art of statistical modeling takes place. On the bottom layer are the prior distributions that represent a priori beliefs about these parameters. Following the Bayesian inference, unknown parameters are inferred from the joint distribution of the parameters given the data
| (1) |
Thus, Bayesian hierarchical modeling based on multivariate normal distributions effectively handles complex relationships that are governed by many parameters, and provides a tool for evaluating correlated gene expression data. For example, the model of Cho and Lee (2004) can be extended to allow for biological or/and experimental correlations. These will be discussed in more detail in the third section of this letter. Statistically, the estimation of these variables separately is biased when they are correlated (Johnston, 1972; Gianola and Sorensen, 2004). From a practical standpoint, ignoring correlations in gene expression data would result in high variability of statistical estimators and cause the reproducibility to deteriorate (Qiu et al., 2005).
Fully conditional distributions of gene (gi) and condition (cj) effects
We claim that Cho and Lee (2004) made a few mistakes when they derived the fully conditional distributions for the gene (gi) and condition (cj) effects. Let y be a vector containing all the data (yijkls), x a vector containing all the link variables (xijks) that correspond to expression intensities free from experimental errors, g a vector containing all the gene effects (gis), c a vector containing all the condition effects (cjs), r a vector containing all the interaction effects (rijs) of genes and conditions, and Δ a variance–covariance matrix with diagonal elements or variance ( ) and zero off-diagonal elements or covariance. Following their model specification and prior distributions, the posterior of unknown parameters, given the data (y), is
| (2) |
where θL = {μ, g, c, r, Δ} is a collection of unknown parameters determining the link x, with μ, mij, nijk, and being defined the same as μ, mi,j, ni,j,k, and respectively, in the paper of Cho and Lee (2004). To ensure that our results are comparable with those of Cho and Lee (2004), we made the same assumptions regarding the prior distribution. That is, we also applied a flat prior to the grand mean μ such that p(μ) cancels out in (2) as a constant. [See Cho and Lee (2004) for a complete discussion of the implications of using flat priors.] To derive the fully conditional distribution for an unknown parameter, we simply pick from (2) the component or components that are related to the parameter of interest. First, consider the grand mean. Since
the fully conditional distribution of the overall mean, given the data and all other parameters, is
| (3) |
where . In (3) and throughout this letter, we use ‘|·’ to mean ‘conditional on the data and all unknown parameters except the one currently under consideration’. For the effect that corresponds to the i-th gene, we pick components in (2) that are related to gi, which leads to
Thus, the fully conditional distribution of gi given the data and all other parameters is
| (4) |
where . Similarly, since
the fully conditional distribution of cj, the effect that corresponds to the j-th condition, is
| (5) |
where . Comparing (3), (4) and (5) with the first three equations in Appendix 1 of the paper of Cho and Lee (2004), we found that they applied a common quantity for the three fully conditional distributions and this introduced systematic biases to the posterior estimates of model parameters. In our simulation study, we found this flaw to be dramatically downward-biased posterior estimations of both condition effects and gene effects, and it consequently affected posterior estimation of interaction effects and biological variances. Figure 1 illustrates how it affected posterior sampling of a randomly chosen gene effect toward zero. In contrast, using the corrected fully conditional distribution as shown in (4), we were able to obtain a posterior estimate for this gene effect that was close to its true value (i.e. 0.7923). This situation was representative of all gene and condition effects. However, we observed no significant influence of this flaw on the posterior estimation of μ and . In other words, the posterior estimates of μ and were very close to their true values using either the algorithm of Cho and Lee (2004) or the corrected one that we propose. This is not very surprising if one considers the restrictions Σigi =0 and Σjcj =0 in our data simulation. When these restrictions hold, bias in the posterior estimation of gi and cj did not significantly affect the posterior sampling of μ and .
Fig. 1.
Comparison of Markov chain sampling of a gene effect using (a) previous algorithm by Cho and Lee (2004) and (b) corrected algorithm by Wu, Forney and Joyce. The simulated gene effect was 0.7923. The y-axis represents posterior estimates of the gene effect whereas the x-axis represents saved (i.e. every one-tenth) Markov chain sampling steps.
In their paper, the conditional distribution xijk |· was derived with replicates, as shown below:
| (6) |
We think it was improper for them to reduce (6) to
when there is no experimental replication. By our reasoning, no replication is equivalent to putting nijk = 1, not nijk = 0 as stated in their paper, and , implying that
Modeling biological/experimental correlations
Cho and Lee (2004) assumed independence for both biological and experimental errors, but these assumptions may be relaxed in the HEM. In this section, we expand their HEM to model biological or/and experimental correlation in gene expression data. First, consider correlations among experimental replicates, which may arise when experimental replicates are exposed to some common environmental factors yet collected at different times or locations. In these cases, the gene expression data may show temporal or spatial variation. In bacterial biofilms, for example, microorganisms undergo significant phenotypic changes during the transition from planktonic to biofilm forms, which reflects altered underlying gene expression and regulation as they adapt to form highly organized and structured sessile communities (Schembri et al., 2003; Watnick and Kolter, 2000) with enhanced resistance to antimicrobial treatments and host immune responses. Furthermore, the development of a biofilm may be tuned differently as a response to a nutrient gradient from the bulk media to the inside of the biofilm. Thus, gene expression at different locations in a biofilm may show spatial variation similar to geographic data in such a way that ‘everything is related to everything else, but near things are more related than distant things’ (Tobler, 1970). Biofilms are the dominant mode of growth for bacteria in the environment, and are of great concern in medical, environmental and industry settings. Therefore, considering the spatial correlation in a model would help better understanding the formation of biofims as well as their altered biological activities.
Consider a balanced experimental design with observations on i = 1, …, G genes under j = 1, …, C conditions with k = 1, …, m biological samples each measured in l = 1, 2, …, n chips. Suppose that the n chips represent n experimental replicates (e.g. layers of a biofilm). Let Rn be a n × n spatial variance–covariance matrix such that the residuals e ~ N(0, Im* ⊗ Rn), where Im* is an m*× m* identity matrix (m* = G × C × m) and ⊗ indicates the Kronecker product operation. Thus, in the posterior formulation as shown in (2), is then replaced with p(y | x, Rn) and the latter is given as
If we assume that the prior distribution of Rn to be an inverted Wishart distribution with scale matrix Ve with a degree of freedom parameter κe [denoted as Wishart−1 (κe, Ve)], it can be shown that posterior samples of Rn are sampled from the following:
| (7) |
where Se is a n × n matrix with the (l, l) diagonal element being Σi Σj Σk(yijkl − xijk)2 and the (l, l′) off-diagonal element being Σi Σj Σk(yijkl − xijk)(yijkl′ − xijk). We assumed inverted Wishart distributions instead of inverted multivariate gamma distributions as the prior distributions for the variance–covariance matrix for the sake of convenience in programming. Next, each element in x is sampled from the following fully conditional distribution
| (8) |
where μ* = μ + gi + cj + rij, γll′ is the (l, l′) element of matrix , tr represents the trace matrix operation. Under the assumption of common experimental error variance (i.e. ) with zero covariance (i.e. γll′ = 0), (8) is reduced to (6).
To model correlated biological errors, we similarly define a m × m covariance matrix Δm such that bijk ~ N(0, Ic* ⊗ Δm where Ic* is an C* × C* identity matrix (C* = G × C). Biological covariance matters in analyzing gene expression data when samples are collected from a single experimental system. Then, p(x | μ, g, c, rΔ) in (2) is replaced with the following
Again, we assume the prior distribution for Δm to be an inverted Wishart distribution with scale matrix Vb and degree of freedom parameter κb, leading to the following fully conditional distribution of Δm.
| (9) |
where Sb is a m × m matrix with the diagonal element ΣiΣj(xijk − μ − gi − cj − rij)2 and the off-diagonal element ΣiΣj(xijk − μ − gi − cj − rij)(xijk′ − μ − gi − cj − rij). The fully conditional distributions for elements in g, c and r are derived as below:
| (10) |
where δkk′ is the (k,k′) element in the inverse matrix and
| (11) |
where
| (12) |
where and .
| (13) |
where and . Under the assumption of a non-zero biological covariance, (10), (11) and (12) are reduced to (3), (4) and (5), respectively, and (13) reduced to the fully conditional distribution of rij by Cho and Lee (2004). When both biological and experimental errors are correlated, (8) becomes
| (14) |
To show how our expanded model works, we simulated the expression data of 1000 genes, which were measured using 5 biological samples with 3 experimental replicates of each and 10 biological samples each with 6 experimental replicates of each. Briefly, the grand mean was given as a constant. Gene effects were generated from , condition effects from , and interaction effects from , where and . Biological and experimental errors were randomly generated either from b ~ N(0‚IC* ⊗ Δm) and , or from and b ~ N(0‚Im* ⊗ Rn), where n* = G×C×m×n, . We ran 10 independent Markov chains, each consisting of 100 000 updates for each dataset and 1000 burn-in updates. However, to reduce correlation between successive updates, we used only 1 out of every 10 posterior samples generated for a total of 10 000 saved updates per dataset. Model performance was evaluated based on decomposition of mean squared errors (MSE) of each parameter (say gi) such that
| (15) |
where
represented sampling variation and
represented squared bias, and was an estimate of gi at saved iteration t=1, 2, …, T.
Our extended model facilitates estimating patterns of biological or/and experimental (co)variance or correlations (Fig. 2) whereas such information could not be inferred using the previous model (Cho and Lee, 2004). As mentioned before, biological or experimental correlations could be of considerable interest and necessary to help us better understand the nature of correlated gene expression. For example, suppose the pattern of experimental (co)variance in Figure 2b represents the environmental correlations in a biofilm. We might conclude that nutrient differences result in higher correlations between the inner and bottom layers than between the top and the other two layers.
Fig. 2.
3D patterns of estimated (co)variance components among (a) five biological sample and (b) three experimental replicates.
Further, we investigated how different model specifications about biological/experimental covariance (i.e. zero versus non-zero) would affect the posterior estimates of parameters of interest. Ignoring experimental covariance in the model did not lead to significant differences in parameter estimation (data not shown). However biological covariance components are important for estimating gene and interaction effects. If the biological covariance components are ignored then gene effects and interaction effects will be estimated with large biases. However, grand mean, condition effects and the common residual variance appear to be more robust, and gave similar results under either model specification about biological (co)variance. As shown in Table 1, considering non-zero biological (co)variance in the model considerably decreased estimation biases of gene and interaction effects. Squared biases, and consequently MSE, of these parameters were smaller in the analysis considering biological covariance components than by ignoring them. Given the variation that occurs during the collection of gene expression data, posterior estimation of gene and interaction effects was better, in terms of MSE, sampling variation and squared biases, when more samples were used (Table 1). This result would challenge the present practice where a very limited number (say three) of samples are used in most gene expression DNA array experiments.
Table 1.
Mean squared error (MSE), sampling variance (SV) and squared bias (SB) in posterior estimation of model effects under the assumption of zero [COV(b) = 0] versus non-zero [COV(b) ≠ 0] biological covariancea,b
| COV(b) = 0 | COV(b) ≠ 0 | |||||
|---|---|---|---|---|---|---|
| MSE | SV | SB | MSE | SV | SB | |
| B05R03 | ||||||
| μ | 0.0005 | 0.0004 | 0.0001 | 0.0004 | 0.0004 | <0.0001 |
| gi | 0.4976 | 0.2471 | 0.2505 | 0.3750 | 0.2318 | 0.1432 |
| cj | 0.0024 | 0.0010 | 0.0014 | 0.0016 | 0.0009 | 0.0007 |
| rij | 0.8770 | 0.4321 | 0.4449 | 0.6667 | 0.3983 | 0.2684 |
| V(b) | — | — | — | 0.1455 | 0.0196 | 0.1259 |
| 0.0001 | <0.0001 | 0.0001 | 0.0001 | <0.0001 | 0.0001 | |
| B10R06 | ||||||
| μ | 0.0005 | 0.0004 | 0.0001 | 0.0004 | 0.0004 | <0.0001 |
| gi | 0.4438 | 0.2276 | 0.2162 | 0.2971 | 0.1909 | 0.1062 |
| cj | 0.0012 | 0.0008 | 0.0004 | 0.0012 | 0.0008 | 0.0004 |
| rij | 0.6446 | 0.3500 | 0.2946 | 0.4214 | 0.2381 | 0.1833 |
| V(b) | — | — | — | 0.0924 | 0.0058 | 0.0866 |
| <0.0001 | <0.0001 | <0.0001 | <0.0001 | <0.0001 | <0.0001 | |
μ = grand mean; gi = effect of gene i; cj = effect of condition j; rij = interaction between gene i and condition j; V(b)= variance–covariance matrix of biological errors; = variance of residual errors.
B05R03 = five biological samples each with three experimental replicates; B10R06 = 10 biological samples each with 6 experimental replicates.
Finally, we show how model specifications about biological covariance would affect the estimation of gene expression difference, (μ + gi + cj + rij) − (μ + gi + cj′ + rij′) = (cj−cj′) + (rij − rij′) between various conditions (Fig. 3), which is of importance in identifying differential gene expression. Clearly, the estimated gene expression differences were in better agreement with true differences when considering non-zero biological covariance as compared to ignoring them in the model. In the former case, the R-square value for the linear regression of the estimated and true expression differences was higher and the regression slope is closer to 1.0. The low slope in Figure 3b suggests an obvious systematic bias, which resulted from ignoring non-zero biological covariance. On average, the higher the biological correlations were, the greater the difference. In 10 independent simulations starting with different datasets, we found that 6.5–38.7% of the top 100 differentially expressed genes could vary when one assumes zero or non-biological covariance, respectively.
Fig. 3.
Regression of estimated to true gene expression difference obtained by assuming (a) zero versus (b) non-zero biological covariance components.
In conclusion, we extended the model of Cho and Lee (2004) to allow the analysis of gene expression data with biological/environmental correlations, which not only facilitate estimating biological/expression correlation patterns, but it also improved estimation of model parameters and consequently identifying differentially expressed genes in DNA array experiments. A FORTRAN 90 program was developed to implement our extended method and it is freely downloadable at http://www.ibest.uidaho.edu/tools/banal_array/
References
- Broet P, et al. Bayesian hierarchical model for identifying changes in gene expression from microarray experiments. J Comput Biol. 2002;9:671–683. doi: 10.1089/106652702760277381. [DOI] [PubMed] [Google Scholar]
- Cho H, Lee JK. Bayesian hierarchical error model for analysis of gene expression data. Bioinformatics. 2004;20:2016–2025. doi: 10.1093/bioinformatics/bth192. [DOI] [PubMed] [Google Scholar]
- Gianola D, Sorensen D. Quantitative genetic models for describing simultaneous and recursive relationships between phenotypes. Genetics. 2004;167:1407–1424. doi: 10.1534/genetics.103.025734. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Johnston JM. Econometric Methods. McGraw-Hill; New York: 1972. [Google Scholar]
- Oleksiak MF, et al. Variation in gene expression within and among natural populations. Nat Genet. 2002;32:261–266. doi: 10.1038/ng983. [DOI] [PubMed] [Google Scholar]
- Qiu X, et al. The effects of normalization on the correlation structure of microarray data. BMC Bioinformatics. 2005;6:120. doi: 10.1186/1471-2105-6-120. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tadesse MG, Ibrahim JG. A Bayesian hierarchical model for the analysis of Affymetrix arrays. Ann N Y Acad Sci. 2004;1020:41–48. doi: 10.1196/annals.1310.006. [DOI] [PubMed] [Google Scholar]
- Tobler W. Review of J. Forrester, ‘Urban Dynamics’. Geogr Anal. 1970;2:198–199. [Google Scholar]
- Schembri MA, et al. Global gene expression in Escherichia coli biofilms. Mol Microbiol. 2003;48:253–267. doi: 10.1046/j.1365-2958.2003.03432.x. [DOI] [PubMed] [Google Scholar]
- Watnick P, Kolter R. Biofilm: city of microbes. J Bacteriol. 2000;182:2675–2579. doi: 10.1128/jb.182.10.2675-2679.2000. [DOI] [PMC free article] [PubMed] [Google Scholar]



