Skip to main content
Biostatistics (Oxford, England) logoLink to Biostatistics (Oxford, England)
. 2015 May 11;16(4):641–654. doi: 10.1093/biostatistics/kxv016

Compound hierarchical correlated beta mixture with an application to cluster mouse transcription factor DNA binding data

Hongying Dai 1,*, Richard Charnigo 2
PMCID: PMC4701176  PMID: 25964663

Abstract

Modeling correlation structures is a challenge in bioinformatics, especially when dealing with high throughput genomic data. A compound hierarchical correlated beta mixture (CBM) with an exchangeable correlation structure is proposed to cluster genetic vectors into mixture components. The correlation coefficient, Inline graphic, is homogenous within a mixture component and heterogeneous between mixture components. A random CBM with Inline graphic brings more flexibility in explaining correlation variations among genetic variables. Expectation–Maximization (EM) algorithm and Stochastic Expectation–Maximization (SEM) algorithm are used to estimate parameters of CBM. The number of mixture components can be determined using model selection criteria such as AIC, BIC and ICL-BIC. Extensive simulation studies were conducted to compare EM, SEM and model selection criteria. Simulation results suggest that CBM outperforms the traditional beta mixture model with lower estimation bias and higher classification accuracy. The proposed method is applied to cluster transcription factor–DNA binding probability in mouse genome data generated by Lahdesmaki and others (2008, Probabilistic inference of transcription factor binding from multiple data sources. PLoS One, 3, e1820). The results reveal distinct clusters of transcription factors when binding to promoter regions of genes in JAK–STAT, MAPK and other two pathways.

Keywords: Cluster, Compound hierarchical correlated betamixture, Exchangeable correlation structure, EM and SEMalgorithm

1. Introduction

For many genetic variables ranging between 0 and 1, beta distributions with two shape parameters can characterize complex distributions of genetic data while mixture models can accommodate the heterogeneity among genetic data by detecting underlying group structures and assigning different beta distributions to mixture components. Ji and others (2005) laid the groundwork of beta mixture model (BMM) Inline graphic Beta(Inline graphic) in bioinformatics and suggested that BMM can effectively detect underlying group structures. Studies show that BMM outperforms Gaussian Mixture Models when data are asymmetric or have complex distribution shapes (Bouguila and others, 2006; Ma and Leijon, 2011). Laurila and others (2011) developed BMM for methylation microarray data and showed that BMM substantially reduces the dimensionality of data and that BMM can be applied for sample classification and to detect changes in methylation status between different samples and tissues. BMM has also been extended to single nucleotide polymorphism (SNP) analysis (Fu and others, 2011), cluster analysis (Dai and others, 2009), quantile normalization to correct probe design bias (Teschendorff and others, 2013), pattern recognition and image processing (Bouguila and others, 2006; Ma and Leijon, 2011).

However, the existing BMM methods are based on the independence assumption. It is well known that genetic data could be highly correlated. Thus, ignoring the correlation structure in BMM may lead to bias in statistical inference and inflation of the Type I error rate. A compound hierarchical correlated beta mixture (CBM) is herein proposed to cluster a variety of genetic data. CBM utilize a latent class variable Inline graphic to model heterogeneity, construct an exchangeable correlation structure for genetic data by introducing a latent variable Inline graphic to characterize underlying mechanisms that trigger correlations among genetic data, then use the compound probability theory to derive a closed-form marginal distribution of genetic data.

CBM allows the random variable Inline graphic to be multivariate with Inline graphic as the length of Inline graphic. The traditional BMM Inline graphic becomes a special univariate case of CBM Inline graphic. The fundamental distinction between CBM and BMM is that CBM can cluster genetic vectors by treating Inline graphic as a clustering unit. For instance, if the Inline graphic variable is measured at 3 time points, Inline graphic, then CBM can construct a vector Inline graphic with an exchangeable correlation structure and keep Inline graphic together in clustering. But the traditional BMM may cluster Inline graphic to distinct mixture components.

CBM allows the correlation coefficient Inline graphic homogeneous within a mixture component and heterogeneous between mixture components. The correlation coefficients Inline graphic can also be treated as a prior with a distribution Inline graphic, which brings more flexibility in explaining correlation variations among genetic variables. The exchangeable correlation structure is suitable to model a wide range of genetic data. For instance, (1) genes within some pathways might be highly correlated but genes from different pathways might be independent. (2) Variants within a haplotype might be highly correlated but variants from different haplotypes might be independent. (3) Variants from close genome positions might be highly correlated but variants from distant genome positions might be independent.

2. Method

2.1. Compound hierarchical CBM model

CBM is a hierarchical model with formulae (2.1)–(2.3). Let genetic vectors be Inline graphic and Inline graphic is the length of the Inline graphic vector. Introduce an identically and independently distributed (iid) multinomial latent variable Inline graphic with

2.1. (2.1)

The latent variable Inline graphic classifies Inline graphic to the Inline graphic mixture component with probability Inline graphic.

Let Inline graphic be a beta binomial random variable such as

2.1. (2.2)

with parameters Inline graphic, Inline graphic, Inline graphic. The latent variable Inline graphic models underlying mechanisms that trigger correlations among Inline graphic.

Conditioning on Inline graphic and Inline graphic, let

2.1. (2.3)

Since the beta family is conjugate to the binomial family (Minhajuddin and other, 2004), compounding Inline graphic on Inline graphic will yield a correlated beta distribution for Inline graphic. The parameter Inline graphic in (2.3) will be integrated out in the compound probability distribution of Inline graphic. Thus, the conditional distribution Inline graphic leads to Inline graphic. See Proposition 1.

Since Inline graphic share the same Inline graphic in (2.3), Inline graphic is correlated among Inline graphic. See Proposition 2.

Theorem 1 —

The marginal distribution of Inline graphic follows a Inline graphic-component CBM with the probability density function

graphic file with name M60.gif (2.4)

Proposition 1 —

Marginally, Inline graphic follows a beta mixture distribution:

graphic file with name M62.gif

Proposition 2 —

Inline graphic and Inline graphic within Inline graphic from the Inline graphic mixture component are correlated with the correlation coefficient

graphic file with name M67.gif

Proposition 3 —

The traditional BMM Inline graphic is a special case of CBM when Inline graphic for Inline graphic. Inline graphic is independent from Inline graphic for any Inline graphic.

Proofs can be found in Supplementary Materials available at Biostatistics online.

In summary, CBM is a hierarchical model that compounds the distribution of Inline graphic over 2 latent variables Inline graphic and Inline graphic. The latent variable Inline graphic determines the number of mixture components while the latent variable Inline graphic determines correlations within Inline graphic. Since Inline graphic and Inline graphic are unobservable, the marginal distribution of Inline graphic is fitted to data to cluster genetic vectors Inline graphic into different mixture components.

2.2. Random exchangeable correlation structure

The exchangeable correlation structure for Inline graphic is Inline graphic, where Inline graphic is an identity matrix, Inline graphic is a vector of 1, Inline graphic be Kronecker product and Inline graphic is transpose. It might be too restrictive to fix the correlation coefficient Inline graphic for the Inline graphic mixture component as genetic data often have complex correlation structures. To address this issue, one can treat Inline graphic as a random variable and insert Inline graphic as a hyperparameter to the hierarchical model.

The correlation of Inline graphic is triggered by Inline graphic, thus Inline graphic is determined by three parameters Inline graphic from the beta binomial distribution of Inline graphic. Since Inline graphic is a latent variable, the parameters Inline graphic have been determined by the marginal distribution of Inline graphic fitted to data. This leads to a linkage between Inline graphic and Inline graphic, Inline graphic.

In the hierarchical model, introduce a probability function Inline graphic with Inline graphic as the parameter(s) for the distribution of Inline graphic. Then the value of Inline graphic will determine the value of parameter Inline graphic in the distribution of Inline graphic. Below is the framework to construct a random CBM.

  • Let Inline graphic be iid with Inline graphic for Inline graphic and Inline graphic.

  • For the Inline graphic mixture component, let Inline graphic

  • Let Inline graphic where Inline graphic is a function to round Inline graphic into an integer.

  • Let Inline graphic.

  • Let Inline graphic for Inline graphic.

Theorem 2 (Random CBM) —

The above framework will generate a CBM with a random exchangeable correlation structure, where Inline graphic and Inline graphic for Inline graphic. The marginal probability distribution of Inline graphic is

graphic file with name M127.gif (2.5)

where Inline graphic stands for the probability function of a multivariate correlated beta distribution with random correlation coefficients. See Supplementary Materials for proof (available at Biostatistics online). Propositions (1)–(3) also hold for the random CBM.

2.3. Expectation–Maximization algorithm and Stochastic Expectation–Maximization algorithm for parameter estimation

Let Inline graphic, where Inline graphic is an indicator function, to measure whether Inline graphic belongs to the Inline graphic mixture component. Since the latent variable Inline graphic is unobservable, the Expectation–Maximization (EM) algorithm is implemented to estimate parameters of the CBM. The EM algorithm relies on the complete data log-likelihood function and alternates between expectation (E) steps and maximization (M) steps. In the E-step, the latent variable Inline graphic is replaced by the expectation of the posterior distribution of Inline graphic given data and the current estimate for the parameters. In the M-step, parameter estimates are updated using maximizers for the expected log-likelihood found on the E-step.

Let Inline graphic and Inline graphic stand for unknown parameters, where Inline graphic in CBM (2.4) and Inline graphic in random CBM (2.5) for Inline graphic. Write a log-likelihood function of Inline graphic based on the joint distribution of Inline graphic and Inline graphic as

2.3. (2.6)
  • Step 1: Starting with iteration Inline graphic, generate the initial value of Inline graphic. Set the initial mixing weight as Inline graphic.

  • Step 2: E-step. Compute the expectation of Inline graphic using the posterior distribution of Inline graphic
    graphic file with name M150.gif
    Plug Inline graphic to (2.6) so the expected log-likelihood function becomes
    graphic file with name M152.gif (2.7)
  • Step 3: M-step. Find Inline graphic and Inline graphic that maximizes Equation (4) and update Inline graphic. The mixing weight Inline graphic. Without a closed-form, Inline graphic can be determined using a numerical optimization function “optim” in R, which is freely available at http://www.r-project.org/.

  • Step 4: Repeat the M-step and E-step until the change in maximum of the log-likelihood (2.7) is within a tolerance value (say 0.1).

The EM algorithm only ensures convergence to a local maximum of the log-likelihood, therefore it is critical to select appropriate initial values for Inline graphic. In Step 1, multiple sets of initial values of Inline graphic are randomly generated. Or one can perform a grid search in the parameter space. Then repeat Steps 1–4 and select the global maximizer of the log-likelihood.

To further prevent estimates from staying near saddle points of the likelihood function, the Stochastic EM (SEM) (Celeux and Diebolt, 1985) is considered. SEM incorporates a stochastic (S-) step between E-step and M-step. This step can be viewed as a Bayesian extension of the EM algorithm. This allows SEM to stochastically explore the likelihood surface by partially avoiding convergence to the closest mode.

Keep all steps in EM. After Step 2, SEM adds.

Step 2.2: S-step: Randomly generate Inline graphic from the multinomial distribution with probability Inline graphic. In (2.7), replace Inline graphic by Inline graphic.

As an ergodic Markov chain, the sequence of SEM estimates converges in distribution to a unique stationary probability as the number of iterations increases.

2.4. Number of mixture components

The number of mixture components Inline graphic can be determined using model selection criteria, AIC, BIC and ICL-BIC (Ji and others, 2005). These criteria are based on log-likelihood functions with a penalty to the number of unknown parameters in a model. For CBM (2.4), Inline graphic. For the random CBM (2.5), Inline graphic, where Inline graphic is the number of unknown parameters in Inline graphic. Let Inline graphic be the maximum of the log likelihood function (2.6). Define

  • Akaike information criterion: Inline graphic

  • Bayesian information criterion: Inline graphic

  • Integrated classification likelihood—BIC:

    Inline graphic

ICL-BIC is BIC plus an estimated entropy of the fuzzy classification matrix Inline graphic. Mixture models with different number of mixture components are fitted using the EM algorithm and SEM algorithm as described in Section 2.3, then the model with the smallest AIC, BIC or ICL-BIC is chosen.

3. Simulation

Under the regularity conditions, the MLEs, Inline graphic and Inline graphic, of the CBM are consistent estimators of the true parameter, Inline graphic and Inline graphic, i.e. Inline graphic and Inline graphicin probability. Five CBM models are presented for illustration of estimation consistency:

  • Model 1: Inline graphic

  • Model 2: Inline graphic

  • Model 3: Inline graphic

  • Model 4: Inline graphic

  • Model 5: Inline graphic

Models 1 and 2 are two-component mixture models with 500 genetic vectors and 5 variants in each vector. Model 3 is a three-component mixture model with 100 genetic vectors and 50 variants in each vector. Each genetic vector is considered as a multivariate random variable with a correlation coefficient Inline graphic from the Inline graphic mixture component. In Model 4, variables within both mixture components are independent (Inline graphic. In Model 5, variables within the first mixture component are independent and variables within the other two mixture components are correlated (Inline graphic.

The distribution of variables from Model 3 is depicted in Figure 1. Figure 1(a) shows a scatter plot between Inline graphic (labeled as “Genetic variant 1”) and Inline graphic (labeled as “Genetic variant 2”) for Inline graphic. These 2 variables follow a three-component mixture distribution with Inline graphic. The Q-Q plot in Figure 1(b) is very close to the diagonal line, indicating that variants Inline graphic and Inline graphic have the same distribution. Figure 1(c) is a 2D histogram for variants Inline graphic and Inline graphic. Figure 1(d) shows a three-component beta mixture for the variant Inline graphic. Figures 1(a)–(d) jointly confirm that genetic variants Inline graphic have the same marginal beta mixture distribution, which confirms Proposition 1. In Figure 1(d), both CBM and BMM are fitted to the histogram of the variant Inline graphic. The BMM fits a close density function to the histogram, but this method does not consider correlation among variants thus Inline graphic was set in BMM. Remarkably, CBM not only provides a close fit of density function to the histogram, but also accurately estimate correlation within 3 mixture components with Inline graphic, which are correct to 2 decimal digits (Table 1).

Fig. 1.

Fig. 1.

Distributions of variables in the simulated CBM Model 3 (Inline graphic). (a) Correlation plot. (b) Q–Q plot. (c) 2D histogram. (d) Histogram.

Table 2.

Comparison of CBM and BMM using EM and SEM algorithms

Model 1
Model 2
Model 3
Model 4
Model 5
CBM BMM CBM BMM CBM BMM CBM BMM CBM BMM
EM algorithm
MSE(Inline graphic 0.002 0.006 0.003 0.006 0.0003 0.002 0.01 0.01 0.02 0.02
MSE(Inline graphic 0.45 2.54 0.64 4.08 0.02 2.02 0.04 0.15 0.48 0.35
MSE(Inline graphic 0.18 0.49 0.66 0.78
MSE(Inline graphic 0.17 0.34 0.11 0.63 0.10 1.90 0.50 1.17 0.47 2.13
MSE(Inline graphic 0.03 0.06 0.02 0.14 0.02 1.46 0.07 0.06 0.17 0.33
MSE(Inline graphic 0.003 0.06 0.12 0.16
MSE(Inline graphic 0.004 0.0009 0.001 0.002 0.001
MSE(Inline graphic 0.003 0.003 <0.0001 0.002 0.003
MSE(Inline graphic 0.001 0.005
MSE(Inline graphic 0.004 0.004 <0.0001 0.0007 0.001 0.002 0.002 0.003 0.01 0.03
MSE(Inline graphic 0.004 0.004 <0.0001 0.0007 0.005 0.003 0.002 0.003 0.01 0.02
MSE(Inline graphic 0.002 0.003 0.001 0.002
AICInline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic
BICInline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic
ICL-BICInline graphic Inline graphic 1627 Inline graphic Inline graphic Inline graphic 5201 Inline graphic 194 Inline graphic 387
AccuracyInline graphic 0.87 0.83 0.99 0.97 0.96 0.78 0.99 0.84 0.79 0.65
SEM algorithm
MSE(Inline graphic 0.01 0.08 0.01 0.06 0.01 0.03 0.002 0.012 0.16 0.17
MSE(Inline graphic 4.30 4.02 2.04 1.20 1.30 1.59 0.069 0.177 0.24 0.18
MSE(Inline graphic 3.32 2.78 0.06 0.16
MSE(Inline graphic 0.20 2.99 0.82 3.61 1.31 3.63 0.18 0.04 0.16 0.17
MSE(Inline graphic 0.25 0.59 0.14 0.21 1.66 4.85 0.06 0.10 0.24 0.18
MSE(Inline graphic 0.12 0.39 0.06 0.16
MSE(Inline graphic 0.002 0.0006 0.0008 0.001 0
MSE(Inline graphic 0.001 0.0014 0.002 0 0
MSE(Inline graphic 0.01 0.1
MSE(Inline graphic 0.002 0.02 0.0002 0.001 0.005 0.002 0.0004 0.0007 0.01 0.02
MSE(Inline graphic 0.002 0.02 0.0002 0.001 0.02 0.008 0.0004 0.0007 0.01 0.02
MSE(Inline graphic 0.007 0.003 0.003 0.002
AICInline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic
BICInline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic
ICL-BICInline graphic Inline graphic 1130 Inline graphic Inline graphic Inline graphic 3353 Inline graphic 170 Inline graphic 367
AccuracyInline graphic 0.88 0.82 0.99 0.98 0.88 0.77 0.99 0.85 0.76 0.67

Inline graphicLower MSE indicates less estimation bias.

Inline graphicLower AIC, BIC and ICL-BIC indicates a better model fit.

Inline graphicHigher accuracy indicates improved classification accuracy for mixture components.

Table 1.

Exchangeable correlation architecture modeled by CBM. The correlation coefficient from the jInline graphic mixture component can be either fixed or a random variable with probability Inline graphic

graphic file with name kxv01603.jpg

Generate Inline graphic data sets for each model. Then, the mean square error (MSE) for each parameter is Inline graphic. Here, Inline graphic refers to the MLE of each parameter estimated through the EM algorithm and SEM algorithm. A smaller MSE indicates a smaller bias. As shown in Table 2, CBM provides very accurate estimates of correlation coefficients Inline graphic with MSE Inline graphic 0.01 for all 5 simulated models. When compared with BMM, CBM yields lower MSE for mixing weights and mixture component parameters. For instance, Inline graphic for CBM is lower than Inline graphic for BMM in Model 3 using the SEM algorithm.

Models 1–3 are fully correlated with varying correlation among mixture components. Model 5 is partially correlated and Model 4 is fully independent. If one disregards the correlation structure and fit BMM to these models, then this will lead to higher MSE in parameter estimation, poor model fit and lower prediction accuracy (see Table 2). AIC, BIC and ICL-BIC are used to compare goodness-of-fit for models. For all 5 models, CBM yields lower AIC, BIC and ICL-BIC values than BMM. For instance, in Model 2 using the SEM algorithm, ICL-BIC Inline graphic Inline graphic4126 for CBM and ICL-BIC Inline graphic Inline graphic1764 for BMM.

Mixture models provide a principled probabilistic framework for prediction and classification. The classification accuracy in CBM is higher than BMM in 5 models. Using the SEM algorithm, CBM increases the classification accuracy by 0.06 in Model 1, 0.11 in Model 3, 0.14 in Model 4 and 0.09 in Model 5. In Model 2, two mixture components are well separated. Therefore, both CBM and BMM have high classification accuracy.

The performance of EM algorithm and the SEM algorithm was compared among 5 models. The SEM algorithm can prevent estimates from staying near saddle points of the likelihood function, but it also generates more variation in estimates and it takes more iterations for the sequencing to converge. In Table 2, the MSE of Inline graphic in Models 1 and 2 and MSE of Inline graphic in Model 3 are Inline graphic1 under the SEM algorithm. These Inline graphic are the mixture parameters with smaller mixing weights (Inline graphic0.3, 0.15, and 0.2, respectively), which indicates that the SEM algorithm may increase the estimation variance for mixture parameters with smaller mixing weights. When the SEM algorithm adds a stochastic procedure to the EM algorithm to prevent estimates from sticking around a small neighborhood of a fixed value, it may increase the variation in parameter estimation, and this especially effects mixture parameters with smaller mixing weights. We suggest using the SEM algorithm to prevent estimates from staying near saddle points. Investigators may consider using the EM algorithm for data with a small size and small mixing weights.

Three model selection criteria, AIC, BIC and ICL-BIC, were compared in simulation. AIC does not take the sample size into account, thus it tends to select more complicated models. ICL-BIC tends to select a parsimonious model. The simulation results show that ICL-BIC and BIC are appropriate tools for model selection.

4. Case study: genome-wide analysis of binding probability

We apply CBM to cluster transcription factors based on their binding probabilities in mouse genome data. Transcription factors are DNA-binding proteins that regulate gene expression levels by binding to promoter regions proximal to gene transcription start sites or to more distal enhancer regions that regulate expression through long-range interactions (Rye and others, 2011). Transcription factor binding varies between cell types (Levine and Tjian, 2003), (Zhang and others, 2006). Lahdesmaki and others (2008) constructed a test set of annotated binding sites in mouse promoters from existing databases, including TRANSFAC, ORegAnno, the human genome browser at UCSC and ABS (Wingender and others, 2000; Kent and others, 2002; Blanco and others, 2006; Montgomery and others, 2006). A likelihood-based binding prediction method was applied to the 2K base pair upstream promoter regions of all 20 397 mouse genes, where the genomic locations of the promoters were based on RefSeq gene annotations. Evolutionary conservation was used as an additional data sources and binding specificities for 266 transcription factors were taken from TRANSFAC Professional version 10.3. The full binding probability results for all mouse transcription factor–gene pairs (a 20,397 by 266 table) are available at http://xerad.systemsbiology.net/ProbTF/. These genome-wide results indicate that it is rare to observe high binding probability between transcriptional regulators and their target promoters. See Supplementary Materials for details (available at Biostatistics online).

Let Inline graphic be the binding probability of the Inline graphic transcription factor to the promoter region in the Inline graphic gene. To make clustering more biologically meaningful, we select genetic pathways extracted from the IntPath pathway database (Zhou and others, 2012). The online toolkit of the IntPath program provides a total of 550 pathways for the “M.musculus” species. Let Inline graphic be the vector of binding probability for the Inline graphic transcription factor to Inline graphic genes in a pathway. The number of genes, Inline graphic, varies by pathway. The goal is to apply CBM to cluster transcription factors (Inline graphic based on their patterns of binding probabilities to all genes in a pathway. The CBM clustering is performed for each pathway, therefore the clustering results are different for different pathways. BMM can cluster the binding probablities between each TF and each gene, but BMM cannot cluster the binding probablities between each TF and a pathway.

For each genetic pathway, fit CBM with varying numbers of mixture components using the EM algorithm. The optimal number of mixture components was determined by BIC. Then apply the posterior probability to cluster transcription factors into mixture components. To make clustering more biologically meaningful, we select 2 transcription factors pathways: JAK–STAT signaling pathway and MAPK cascade. The JAK–STAT pathway is a signaling cascade whose evolutionarily conserved roles include cell proliferation and hematopoiesis. The Mitogen-Activated Protein Kinases (MAPKs) pathway belongs to a large family of serine/threonine protein kinases that are conserved in organisms as diverse as yeast and humans. We also select 2 pathways which are not transcription factors pathways. The term information of these 4 pathways are listed in Table S1 in Supplementary Materials available at Biostatistics online.

The clustering of transcription factors is based on their binding probabilities to a pathway. For the JAK–STAT pathway and the histidine metabolism pathway, CBM identified 3 clusters of transcription factors. For the MAPK pathway and superpathway of serine and glycine biosynthesis I, CBM identified 2 clusters of transcription factors. The estimates of parameters within each mixture component are listed in Table 3.

Table 3.

Parameter estimates of CBM fitted to transcription factor binding probability data

Cluster 1 Cluster 2 Cluster 3
JAK–STAT signaling Inline graphic 0.38 0.07 0.55
 pathway Inline graphic 0.70 1.27 0.65
Inline graphic 3.34 3.26 0.41
Inline graphic 2 3 2
Inline graphic 0.33 0.40 0.65
# transcription factorInline graphic 99 20 147
MAPK cascade Inline graphic 0.04 0.96
Inline graphic 0.17 0.44
Inline graphic 3.61 0.36
Inline graphic 9 1
Inline graphic 0.70 0.56
# transcription factor 11 255
Superpathway Inline graphic 0.42 0.58
 of serine Inline graphic 0.73 0.40
 and glycine Inline graphic 3.24 0.60
 biosynthesis I Inline graphic 1 0
Inline graphic 0.20 0
# transcription factor 126 140
Histidine metabolism Inline graphic 0.69 0.25 0.06
Inline graphic 0.58 0.30 0.34
Inline graphic 2.33 1.85 0.57
Inline graphic 0 6 0
Inline graphic 0 0.74 0
# transcription factor 186 63 17

Inline graphicNumber of transcription factors classified to mixture components by CBM.

CBM reveals very different patterns among clusters (see density curves in Figures 2). Among clusters we identified, one cluster often contained a small number of transcription factors and had a distinctive binding pattern when compared with a large number of transcription factors in other clusters.

Fig. 2.

Fig. 2.

Histogram of transcription factor–DNA binding probability and fitted probability density distribution. (a) JAK–STAT. (b) MAPK. (c) Serine and glycine. (d) Histidine metabolism.

The marginal distribution of binding probabilities among 266 transcription factors is displayed in Figure 2. The fitted probability density function for all transcription factors (the solid curve) suggests that there are more binding probabilities near 0 and only a small amount close to 1. This confirms sparse connectivity between transcriptional regulators and their target promoters. The fitted density for transcription factors within each cluster is depicted by a dashed line. These density curves indicate that transcriptions factors from different clusters have distinct binding patterns.

Since researchers are particularly interested in high binding probabilities, the patterns among these clusters, especially the probability density at binding probability near 1, yield interesting findings regarding transcriptional regulation. For instance, in superpathway of serine and glycine biosynthesis I, transcription factors in Cluster 2 have higher binding probability than Cluster 1 as the tail part of probability density function of Cluster 2 is much higher than Cluster 1. For the histidine metabolism pathway, Cluster 2 has a binding peak near 0.3, indicating that this cluster of transcription factors has more moderate binding probabilities when compared with other clusters. For the JAK–STAT pathway, Cluster 3 exhibits a nearly U-shaped distribution of binding probabilities with 2 peaks, 1 peak near 0 and 1 peak near 1, indicating this cluster has more high binding probability (near 1) than other clusters.

5. Discussion

A hierarchical beta mixture with an exchangeable correlation structure is proposed to cluster multivariate genetic data. The CBM is important for the following reasons: (1) Assuming iid genetic data may be unreasonable in real-world applications, yet much of the existing methodology does so. The present methodology allows both for correlations in genetic data and heterogeneity in such correlations, even if there is no heterogeneity in location. (2) With most existing methodology, there is no guarantee that related observations will be assigned to the same mixture component, even when it would not make sense for them to be assigned to different mixture components. The present methodology ensures that related observations are assigned to the same mixture component. Moreover, the quality of such assignments can be improved and informed by the heterogeneity in correlation.

CBM has a closed-form probability density function because of the compound probability theory. The correlation coefficient Inline graphic is homogenous within a mixture component and heterogeneous between mixture components. A random CBM with Inline graphic brings more flexibility in explaining correlation variations among genetic variables.

The results of simulation and case study show that CBM yields more accurate prediction and classification for group structures. For instance, in Model 3 with the EM algorithm, CBM has 96% classification accuracy while BMM only has 78% classification accuracy. Correlated variants from a gene can be grouped together by genetic vectors, which allow the unsupervised clustering focusing on genetic vectors. This will prevent assigning variants from a genetic vector to different mixture components.

The R code for CBM is freely available at http://d.web.umkc.edu/daih/.

Supplementary material

Supplementary Material is available at http://biostatistics.oxfordjournals.org.

Funding

The work of H.D. was supported in part by R01(DK100779, PI: S Patton) awarded by the National Institute of Diabetes and Digestive and Kidney Diseases.

Supplementary Material

Supplementary Data

Acknowledgement

We gratefully acknowledge the helpful comments from three referees and the Associate Editor. We thank the HPC system engineer, Shane Corder, for his bioinformatics and computing support in this project. Conflict of Interest: None declared.

References

  1. Blanco E., Farre D., Alba M. M., Messeguer X., Guigo R. (2006). ABS: a database of Annotated regulatory Binding Sites from orthologous promoters. Nucleic Acids Research, 34, D63–D67. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Bouguila N., Ziou D., Monga E. (2006). Practical Bayesian estimation of a finite beta mixture through gibbs sampling and its applications. Statistics and Computing, 16, 215–225. [Google Scholar]
  3. Celeux G., Diebolt J. (1985). The SEM algorithm: a probabilistic teacher algorithm derived from the EM algorithm for the mixture problem. Computational Statistics Quarterly, 2, 73–82. [Google Scholar]
  4. Dai X., Erkkila T., Yli-Harja O., Lahdesmaki H. (2009). A joint finite mixture model for clustering genes from independent Gaussian and beta distributed data. BMC Bioinformatics, 10, 165. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Fu R., Dey D. K., Holsinger K. E. (2011). A Beta-mixture model for assessing genetic population structure. Biometrics, 67, 1073–1082. [DOI] [PubMed] [Google Scholar]
  6. Ji Y., Wu C., Liu P., Wang J., Coombes K. R. (2005). Applications of beta-mixture models in bioinformatics. Bioinformatics, 21, 2118–2122. [DOI] [PubMed] [Google Scholar]
  7. Kent W. J., Sugnet C. W., Furey T. S., Roskin K. M., Pringle T. H., Zahler A. M., Haussler D. (2002). The human genome browser at UCSC. Genome Research, 12, 996–1006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Lahdesmaki H., Rust A. G., Shmulevich I. (2008). Probabilistic inference of transcription factor binding from multiple data sources. PLoS One, 3, e1820. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Laurila K., Oster B., Andersen C. L., Lamy P., Orntoft T., Yli-Harja O., Wiuf C. (2011). A beta-mixture model for dimensionality reduction, sample classification and analysis. BMC Bioinformatics, 12, 215. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Levine M., Tjian R. (2003). Transcription regulation and animal diversity. Nature, 424, 147–151. [DOI] [PubMed] [Google Scholar]
  11. Ma Z., Leijon A. (2011). Bayesian estimation of beta mixture models with variational inference. IEEE Trans Pattern Anal Mach Intell, 33, 2160–2173. [DOI] [PubMed] [Google Scholar]
  12. Minhajuddin A., Harris I., Schucany W. (2004). Simulating multivariate distributions with specific correlations. Journal of Statistical Computation and Simulation, 74, 599–607. [Google Scholar]
  13. Montgomery S. B., Griffith O. L., Sleumer M. C., Bergman C. M., Bilenky M., Pleasance E. D., Prychyna Y., Zhang X., Jones S. J. (2006). ORegAnno: an open access database and curation system for literature-derived promoters, transcription factor binding sites and regulatory variation. Bioinformatics, 22, 637–640. [DOI] [PubMed] [Google Scholar]
  14. Rye M., Saetrom P., Handstad T., Drablos F. (2011). Clustered ChIP-Seq-defined transcription factor binding sites and histone modifications map distinct classes of regulatory elements. BMC Biology, 9, 80. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Teschendorff A. E., Marabita F., Lechner M., Bartlett T., Tegner J., Gomez-Cabrero D., Beck S. (2013). A beta-mixture quantile normalization method for correcting probe design bias in Illumina Infinium 450 k DNA methylation data. Bioinformatics, 29, 189–196. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Wingender E., Chen X., Hehl R., Karas H., Liebich I., Matys V., Meinhardt T., Pruss M., Reuter I., Schacherer F. (2000). TRANSFAC: an integrated system for gene expression regulation. Nucleic Acids Research, 28, 316–319. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Zhang C., Xuan Z., Otto S., Hover J. R., McCorkle S. R., Mandel G., Zhang M. Q. (2006). A clustering property of highly-degenerate transcription factor binding sites in the mammalian genome. Nucleic Acids Research, 34, 2238–2246. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Zhou H., Jin J., Zhang H., Yi B., Wozniak M., Wong L. (2012). IntPath—an integrated pathway gene relationship database for model organisms and important pathogens. BMC Systems Biology, 6(Suppl 2), S2. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

Articles from Biostatistics (Oxford, England) are provided here courtesy of Oxford University Press

RESOURCES