Abstract
Modeling correlation structures is a challenge in bioinformatics, especially when dealing with high throughput genomic data. A compound hierarchical correlated beta mixture (CBM) with an exchangeable correlation structure is proposed to cluster genetic vectors into mixture components. The correlation coefficient,
, is homogenous within a mixture component and heterogeneous between mixture components. A random CBM with
brings more flexibility in explaining correlation variations among genetic variables. Expectation–Maximization (EM) algorithm and Stochastic Expectation–Maximization (SEM) algorithm are used to estimate parameters of CBM. The number of mixture components can be determined using model selection criteria such as AIC, BIC and ICL-BIC. Extensive simulation studies were conducted to compare EM, SEM and model selection criteria. Simulation results suggest that CBM outperforms the traditional beta mixture model with lower estimation bias and higher classification accuracy. The proposed method is applied to cluster transcription factor–DNA binding probability in mouse genome data generated by Lahdesmaki and others (2008, Probabilistic inference of transcription factor binding from multiple data sources. PLoS One, 3, e1820). The results reveal distinct clusters of transcription factors when binding to promoter regions of genes in JAK–STAT, MAPK and other two pathways.
Keywords: Cluster, Compound hierarchical correlated betamixture, Exchangeable correlation structure, EM and SEMalgorithm
1. Introduction
For many genetic variables ranging between 0 and 1, beta distributions with two shape parameters can characterize complex distributions of genetic data while mixture models can accommodate the heterogeneity among genetic data by detecting underlying group structures and assigning different beta distributions to mixture components. Ji and others (2005) laid the groundwork of beta mixture model (BMM)
Beta(
) in bioinformatics and suggested that BMM can effectively detect underlying group structures. Studies show that BMM outperforms Gaussian Mixture Models when data are asymmetric or have complex distribution shapes (Bouguila and others, 2006; Ma and Leijon, 2011). Laurila and others (2011) developed BMM for methylation microarray data and showed that BMM substantially reduces the dimensionality of data and that BMM can be applied for sample classification and to detect changes in methylation status between different samples and tissues. BMM has also been extended to single nucleotide polymorphism (SNP) analysis (Fu and others, 2011), cluster analysis (Dai and others, 2009), quantile normalization to correct probe design bias (Teschendorff and others, 2013), pattern recognition and image processing (Bouguila and others, 2006; Ma and Leijon, 2011).
However, the existing BMM methods are based on the independence assumption. It is well known that genetic data could be highly correlated. Thus, ignoring the correlation structure in BMM may lead to bias in statistical inference and inflation of the Type I error rate. A compound hierarchical correlated beta mixture (CBM) is herein proposed to cluster a variety of genetic data. CBM utilize a latent class variable
to model heterogeneity, construct an exchangeable correlation structure for genetic data by introducing a latent variable
to characterize underlying mechanisms that trigger correlations among genetic data, then use the compound probability theory to derive a closed-form marginal distribution of genetic data.
CBM allows the random variable
to be multivariate with
as the length of
. The traditional BMM
becomes a special univariate case of CBM
. The fundamental distinction between CBM and BMM is that CBM can cluster genetic vectors by treating
as a clustering unit. For instance, if the
variable is measured at 3 time points,
, then CBM can construct a vector
with an exchangeable correlation structure and keep
together in clustering. But the traditional BMM may cluster
to distinct mixture components.
CBM allows the correlation coefficient
homogeneous within a mixture component and heterogeneous between mixture components. The correlation coefficients
can also be treated as a prior with a distribution
, which brings more flexibility in explaining correlation variations among genetic variables. The exchangeable correlation structure is suitable to model a wide range of genetic data. For instance, (1) genes within some pathways might be highly correlated but genes from different pathways might be independent. (2) Variants within a haplotype might be highly correlated but variants from different haplotypes might be independent. (3) Variants from close genome positions might be highly correlated but variants from distant genome positions might be independent.
2. Method
2.1. Compound hierarchical CBM model
CBM is a hierarchical model with formulae (2.1)–(2.3). Let genetic vectors be
and
is the length of the
vector. Introduce an identically and independently distributed (iid) multinomial latent variable
with
![]() |
(2.1) |
The latent variable
classifies
to the
mixture component with probability
.
Let
be a beta binomial random variable such as
![]() |
(2.2) |
with parameters
,
,
. The latent variable
models underlying mechanisms that trigger correlations among
.
Conditioning on
and
, let
![]() |
(2.3) |
Since the beta family is conjugate to the binomial family (Minhajuddin and other, 2004), compounding
on
will yield a correlated beta distribution for
. The parameter
in (2.3) will be integrated out in the compound probability distribution of
. Thus, the conditional distribution
leads to
. See Proposition 1.
Since
share the same
in (2.3),
is correlated among
. See Proposition 2.
Theorem 1 —
The marginal distribution of
follows a
-component CBM with the probability density function
(2.4)
Proposition 1 —
Marginally,
follows a beta mixture distribution:
Proposition 2 —
and
within
from the
mixture component are correlated with the correlation coefficient
Proposition 3 —
The traditional BMM
is a special case of CBM when
for
.
is independent from
for any
.
Proofs can be found in Supplementary Materials available at Biostatistics online.
In summary, CBM is a hierarchical model that compounds the distribution of
over 2 latent variables
and
. The latent variable
determines the number of mixture components while the latent variable
determines correlations within
. Since
and
are unobservable, the marginal distribution of
is fitted to data to cluster genetic vectors
into different mixture components.
2.2. Random exchangeable correlation structure
The exchangeable correlation structure for
is
, where
is an identity matrix,
is a vector of 1,
be Kronecker product and
is transpose. It might be too restrictive to fix the correlation coefficient
for the
mixture component as genetic data often have complex correlation structures. To address this issue, one can treat
as a random variable and insert
as a hyperparameter to the hierarchical model.
The correlation of
is triggered by
, thus
is determined by three parameters
from the beta binomial distribution of
. Since
is a latent variable, the parameters
have been determined by the marginal distribution of
fitted to data. This leads to a linkage between
and
,
.
In the hierarchical model, introduce a probability function
with
as the parameter(s) for the distribution of
. Then the value of
will determine the value of parameter
in the distribution of
. Below is the framework to construct a random CBM.
Let
be iid with
for
and
.For the
mixture component, let 
Let
where
is a function to round
into an integer.Let
.Let
for
.
Theorem 2 (Random CBM) —
The above framework will generate a CBM with a random exchangeable correlation structure, where
and
for
. The marginal probability distribution of
is
(2.5) where
stands for the probability function of a multivariate correlated beta distribution with random correlation coefficients. See Supplementary Materials for proof (available at Biostatistics online). Propositions (1)–(3) also hold for the random CBM.
2.3. Expectation–Maximization algorithm and Stochastic Expectation–Maximization algorithm for parameter estimation
Let
, where
is an indicator function, to measure whether
belongs to the
mixture component. Since the latent variable
is unobservable, the Expectation–Maximization (EM) algorithm is implemented to estimate parameters of the CBM. The EM algorithm relies on the complete data log-likelihood function and alternates between expectation (E) steps and maximization (M) steps. In the E-step, the latent variable
is replaced by the expectation of the posterior distribution of
given data and the current estimate for the parameters. In the M-step, parameter estimates are updated using maximizers for the expected log-likelihood found on the E-step.
Let
and
stand for unknown parameters, where
in CBM (2.4) and
in random CBM (2.5) for
. Write a log-likelihood function of
based on the joint distribution of
and
as
![]() |
(2.6) |
Step 1: Starting with iteration
, generate the initial value of
. Set the initial mixing weight as
.- Step 2: E-step. Compute the expectation of
using the posterior distribution of
Plug
to (2.6) so the expected log-likelihood function becomes

(2.7) Step 3: M-step. Find
and
that maximizes Equation (4) and update
. The mixing weight
. Without a closed-form,
can be determined using a numerical optimization function “optim” in R, which is freely available at http://www.r-project.org/.Step 4: Repeat the M-step and E-step until the change in maximum of the log-likelihood (2.7) is within a tolerance value (say 0.1).
The EM algorithm only ensures convergence to a local maximum of the log-likelihood, therefore it is critical to select appropriate initial values for
. In Step 1, multiple sets of initial values of
are randomly generated. Or one can perform a grid search in the parameter space. Then repeat Steps 1–4 and select the global maximizer of the log-likelihood.
To further prevent estimates from staying near saddle points of the likelihood function, the Stochastic EM (SEM) (Celeux and Diebolt, 1985) is considered. SEM incorporates a stochastic (S-) step between E-step and M-step. This step can be viewed as a Bayesian extension of the EM algorithm. This allows SEM to stochastically explore the likelihood surface by partially avoiding convergence to the closest mode.
Keep all steps in EM. After Step 2, SEM adds.
Step 2.2: S-step: Randomly generate
from the multinomial distribution with probability
. In (2.7), replace
by
.
As an ergodic Markov chain, the sequence of SEM estimates converges in distribution to a unique stationary probability as the number of iterations increases.
2.4. Number of mixture components
The number of mixture components
can be determined using model selection criteria, AIC, BIC and ICL-BIC (Ji and others, 2005). These criteria are based on log-likelihood functions with a penalty to the number of unknown parameters in a model. For CBM (2.4),
. For the random CBM (2.5),
, where
is the number of unknown parameters in
. Let
be the maximum of the log likelihood function (2.6). Define
Akaike information criterion:

Bayesian information criterion:

-
Integrated classification likelihood—BIC:

ICL-BIC is BIC plus an estimated entropy of the fuzzy classification matrix
. Mixture models with different number of mixture components are fitted using the EM algorithm and SEM algorithm as described in Section 2.3, then the model with the smallest AIC, BIC or ICL-BIC is chosen.
3. Simulation
Under the regularity conditions, the MLEs,
and
, of the CBM are consistent estimators of the true parameter,
and
, i.e.
and
in probability. Five CBM models are presented for illustration of estimation consistency:
Model 1:

Model 2:

Model 3:

Model 4:

Model 5:

Models 1 and 2 are two-component mixture models with 500 genetic vectors and 5 variants in each vector. Model 3 is a three-component mixture model with 100 genetic vectors and 50 variants in each vector. Each genetic vector is considered as a multivariate random variable with a correlation coefficient
from the
mixture component. In Model 4, variables within both mixture components are independent (
. In Model 5, variables within the first mixture component are independent and variables within the other two mixture components are correlated (
.
The distribution of variables from Model 3 is depicted in Figure 1. Figure 1(a) shows a scatter plot between
(labeled as “Genetic variant 1”) and
(labeled as “Genetic variant 2”) for
. These 2 variables follow a three-component mixture distribution with
. The Q-Q plot in Figure 1(b) is very close to the diagonal line, indicating that variants
and
have the same distribution. Figure 1(c) is a 2D histogram for variants
and
. Figure 1(d) shows a three-component beta mixture for the variant
. Figures 1(a)–(d) jointly confirm that genetic variants
have the same marginal beta mixture distribution, which confirms Proposition 1. In Figure 1(d), both CBM and BMM are fitted to the histogram of the variant
. The BMM fits a close density function to the histogram, but this method does not consider correlation among variants thus
was set in BMM. Remarkably, CBM not only provides a close fit of density function to the histogram, but also accurately estimate correlation within 3 mixture components with
, which are correct to 2 decimal digits (Table 1).
Fig. 1.
Distributions of variables in the simulated CBM Model 3 (
). (a) Correlation plot. (b) Q–Q plot. (c) 2D histogram. (d) Histogram.
Table 2.
Comparison of CBM and BMM using EM and SEM algorithms
| Model 1 |
Model 2 |
Model 3 |
Model 4 |
Model 5 |
||||||
|---|---|---|---|---|---|---|---|---|---|---|
| CBM | BMM | CBM | BMM | CBM | BMM | CBM | BMM | CBM | BMM | |
| EM algorithm | ||||||||||
MSE(
|
0.002 | 0.006 | 0.003 | 0.006 | 0.0003 | 0.002 | 0.01 | 0.01 | 0.02 | 0.02 |
MSE(
|
0.45 | 2.54 | 0.64 | 4.08 | 0.02 | 2.02 | 0.04 | 0.15 | 0.48 | 0.35 |
MSE(
|
0.18 | 0.49 | 0.66 | 0.78 | ||||||
MSE(
|
0.17 | 0.34 | 0.11 | 0.63 | 0.10 | 1.90 | 0.50 | 1.17 | 0.47 | 2.13 |
MSE(
|
0.03 | 0.06 | 0.02 | 0.14 | 0.02 | 1.46 | 0.07 | 0.06 | 0.17 | 0.33 |
MSE(
|
0.003 | 0.06 | 0.12 | 0.16 | ||||||
MSE( † |
0.004 | 0.0009 | 0.001 | 0.002 | 0.001 | |||||
MSE(
|
0.003 | 0.003 | <0.0001 | 0.002 | 0.003 | |||||
MSE(
|
0.001 | 0.005 | ||||||||
MSE(
|
0.004 | 0.004 | <0.0001 | 0.0007 | 0.001 | 0.002 | 0.002 | 0.003 | 0.01 | 0.03 |
MSE(
|
0.004 | 0.004 | <0.0001 | 0.0007 | 0.005 | 0.003 | 0.002 | 0.003 | 0.01 | 0.02 |
MSE(
|
0.002 | 0.003 | 0.001 | 0.002 | ||||||
AIC
|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
BIC
|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
ICL-BIC
|
![]() |
1627 | ![]() |
![]() |
![]() |
5201 | ![]() |
194 | ![]() |
387 |
Accuracy
|
0.87 | 0.83 | 0.99 | 0.97 | 0.96 | 0.78 | 0.99 | 0.84 | 0.79 | 0.65 |
| SEM algorithm | ||||||||||
MSE(
|
0.01 | 0.08 | 0.01 | 0.06 | 0.01 | 0.03 | 0.002 | 0.012 | 0.16 | 0.17 |
MSE(
|
4.30 | 4.02 | 2.04 | 1.20 | 1.30 | 1.59 | 0.069 | 0.177 | 0.24 | 0.18 |
MSE(
|
3.32 | 2.78 | 0.06 | 0.16 | ||||||
MSE(
|
0.20 | 2.99 | 0.82 | 3.61 | 1.31 | 3.63 | 0.18 | 0.04 | 0.16 | 0.17 |
MSE(
|
0.25 | 0.59 | 0.14 | 0.21 | 1.66 | 4.85 | 0.06 | 0.10 | 0.24 | 0.18 |
MSE(
|
0.12 | 0.39 | 0.06 | 0.16 | ||||||
MSE(
|
0.002 | 0.0006 | 0.0008 | 0.001 | 0 | |||||
MSE(
|
0.001 | 0.0014 | 0.002 | 0 | 0 | |||||
MSE(
|
0.01 | 0.1 | ||||||||
MSE(
|
0.002 | 0.02 | 0.0002 | 0.001 | 0.005 | 0.002 | 0.0004 | 0.0007 | 0.01 | 0.02 |
MSE(
|
0.002 | 0.02 | 0.0002 | 0.001 | 0.02 | 0.008 | 0.0004 | 0.0007 | 0.01 | 0.02 |
MSE(
|
0.007 | 0.003 | 0.003 | 0.002 | ||||||
AIC
|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
BIC
|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
ICL-BIC
|
![]() |
1130 | ![]() |
![]() |
![]() |
3353 | ![]() |
170 | ![]() |
367 |
Accuracy
|
0.88 | 0.82 | 0.99 | 0.98 | 0.88 | 0.77 | 0.99 | 0.85 | 0.76 | 0.67 |
Lower MSE indicates less estimation bias.
Lower AIC, BIC and ICL-BIC indicates a better model fit.
Higher accuracy indicates improved classification accuracy for mixture components.
Table 1.
Exchangeable correlation architecture modeled by CBM. The correlation coefficient from the j
mixture component can be either fixed or a random variable with probability 
![]() |
Generate
data sets for each model. Then, the mean square error (MSE) for each parameter is
. Here,
refers to the MLE of each parameter estimated through the EM algorithm and SEM algorithm. A smaller MSE indicates a smaller bias. As shown in Table 2, CBM provides very accurate estimates of correlation coefficients
with MSE
0.01 for all 5 simulated models. When compared with BMM, CBM yields lower MSE for mixing weights and mixture component parameters. For instance,
for CBM is lower than
for BMM in Model 3 using the SEM algorithm.
Models 1–3 are fully correlated with varying correlation among mixture components. Model 5 is partially correlated and Model 4 is fully independent. If one disregards the correlation structure and fit BMM to these models, then this will lead to higher MSE in parameter estimation, poor model fit and lower prediction accuracy (see Table 2). AIC, BIC and ICL-BIC are used to compare goodness-of-fit for models. For all 5 models, CBM yields lower AIC, BIC and ICL-BIC values than BMM. For instance, in Model 2 using the SEM algorithm, ICL-BIC
4126 for CBM and ICL-BIC
1764 for BMM.
Mixture models provide a principled probabilistic framework for prediction and classification. The classification accuracy in CBM is higher than BMM in 5 models. Using the SEM algorithm, CBM increases the classification accuracy by 0.06 in Model 1, 0.11 in Model 3, 0.14 in Model 4 and 0.09 in Model 5. In Model 2, two mixture components are well separated. Therefore, both CBM and BMM have high classification accuracy.
The performance of EM algorithm and the SEM algorithm was compared among 5 models. The SEM algorithm can prevent estimates from staying near saddle points of the likelihood function, but it also generates more variation in estimates and it takes more iterations for the sequencing to converge. In Table 2, the MSE of
in Models 1 and 2 and MSE of
in Model 3 are
1 under the SEM algorithm. These
are the mixture parameters with smaller mixing weights (
0.3, 0.15, and 0.2, respectively), which indicates that the SEM algorithm may increase the estimation variance for mixture parameters with smaller mixing weights. When the SEM algorithm adds a stochastic procedure to the EM algorithm to prevent estimates from sticking around a small neighborhood of a fixed value, it may increase the variation in parameter estimation, and this especially effects mixture parameters with smaller mixing weights. We suggest using the SEM algorithm to prevent estimates from staying near saddle points. Investigators may consider using the EM algorithm for data with a small size and small mixing weights.
Three model selection criteria, AIC, BIC and ICL-BIC, were compared in simulation. AIC does not take the sample size into account, thus it tends to select more complicated models. ICL-BIC tends to select a parsimonious model. The simulation results show that ICL-BIC and BIC are appropriate tools for model selection.
4. Case study: genome-wide analysis of binding probability
We apply CBM to cluster transcription factors based on their binding probabilities in mouse genome data. Transcription factors are DNA-binding proteins that regulate gene expression levels by binding to promoter regions proximal to gene transcription start sites or to more distal enhancer regions that regulate expression through long-range interactions (Rye and others, 2011). Transcription factor binding varies between cell types (Levine and Tjian, 2003), (Zhang and others, 2006). Lahdesmaki and others (2008) constructed a test set of annotated binding sites in mouse promoters from existing databases, including TRANSFAC, ORegAnno, the human genome browser at UCSC and ABS (Wingender and others, 2000; Kent and others, 2002; Blanco and others, 2006; Montgomery and others, 2006). A likelihood-based binding prediction method was applied to the 2K base pair upstream promoter regions of all 20 397 mouse genes, where the genomic locations of the promoters were based on RefSeq gene annotations. Evolutionary conservation was used as an additional data sources and binding specificities for 266 transcription factors were taken from TRANSFAC Professional version 10.3. The full binding probability results for all mouse transcription factor–gene pairs (a 20,397 by 266 table) are available at http://xerad.systemsbiology.net/ProbTF/. These genome-wide results indicate that it is rare to observe high binding probability between transcriptional regulators and their target promoters. See Supplementary Materials for details (available at Biostatistics online).
Let
be the binding probability of the
transcription factor to the promoter region in the
gene. To make clustering more biologically meaningful, we select genetic pathways extracted from the IntPath pathway database (Zhou and others, 2012). The online toolkit of the IntPath program provides a total of 550 pathways for the “M.musculus” species. Let
be the vector of binding probability for the
transcription factor to
genes in a pathway. The number of genes,
, varies by pathway. The goal is to apply CBM to cluster transcription factors (
based on their patterns of binding probabilities to all genes in a pathway. The CBM clustering is performed for each pathway, therefore the clustering results are different for different pathways. BMM can cluster the binding probablities between each TF and each gene, but BMM cannot cluster the binding probablities between each TF and a pathway.
For each genetic pathway, fit CBM with varying numbers of mixture components using the EM algorithm. The optimal number of mixture components was determined by BIC. Then apply the posterior probability to cluster transcription factors into mixture components. To make clustering more biologically meaningful, we select 2 transcription factors pathways: JAK–STAT signaling pathway and MAPK cascade. The JAK–STAT pathway is a signaling cascade whose evolutionarily conserved roles include cell proliferation and hematopoiesis. The Mitogen-Activated Protein Kinases (MAPKs) pathway belongs to a large family of serine/threonine protein kinases that are conserved in organisms as diverse as yeast and humans. We also select 2 pathways which are not transcription factors pathways. The term information of these 4 pathways are listed in Table S1 in Supplementary Materials available at Biostatistics online.
The clustering of transcription factors is based on their binding probabilities to a pathway. For the JAK–STAT pathway and the histidine metabolism pathway, CBM identified 3 clusters of transcription factors. For the MAPK pathway and superpathway of serine and glycine biosynthesis I, CBM identified 2 clusters of transcription factors. The estimates of parameters within each mixture component are listed in Table 3.
Table 3.
Parameter estimates of CBM fitted to transcription factor binding probability data
| Cluster 1 | Cluster 2 | Cluster 3 | ||
|---|---|---|---|---|
| JAK–STAT signaling | ![]() |
0.38 | 0.07 | 0.55 |
| pathway | ![]() |
0.70 | 1.27 | 0.65 |
![]() |
3.34 | 3.26 | 0.41 | |
![]() |
2 | 3 | 2 | |
![]() |
0.33 | 0.40 | 0.65 | |
# transcription factor
|
99 | 20 | 147 | |
| MAPK cascade | ![]() |
0.04 | 0.96 | |
![]() |
0.17 | 0.44 | ||
![]() |
3.61 | 0.36 | ||
![]() |
9 | 1 | ||
![]() |
0.70 | 0.56 | ||
| # transcription factor | 11 | 255 | ||
| Superpathway | ![]() |
0.42 | 0.58 | |
| of serine | ![]() |
0.73 | 0.40 | |
| and glycine | ![]() |
3.24 | 0.60 | |
| biosynthesis I | ![]() |
1 | 0 | |
![]() |
0.20 | 0 | ||
| # transcription factor | 126 | 140 | ||
| Histidine metabolism | ![]() |
0.69 | 0.25 | 0.06 |
![]() |
0.58 | 0.30 | 0.34 | |
![]() |
2.33 | 1.85 | 0.57 | |
![]() |
0 | 6 | 0 | |
![]() |
0 | 0.74 | 0 | |
| # transcription factor | 186 | 63 | 17 |
Number of transcription factors classified to mixture components by CBM.
CBM reveals very different patterns among clusters (see density curves in Figures 2). Among clusters we identified, one cluster often contained a small number of transcription factors and had a distinctive binding pattern when compared with a large number of transcription factors in other clusters.
Fig. 2.
Histogram of transcription factor–DNA binding probability and fitted probability density distribution. (a) JAK–STAT. (b) MAPK. (c) Serine and glycine. (d) Histidine metabolism.
The marginal distribution of binding probabilities among 266 transcription factors is displayed in Figure 2. The fitted probability density function for all transcription factors (the solid curve) suggests that there are more binding probabilities near 0 and only a small amount close to 1. This confirms sparse connectivity between transcriptional regulators and their target promoters. The fitted density for transcription factors within each cluster is depicted by a dashed line. These density curves indicate that transcriptions factors from different clusters have distinct binding patterns.
Since researchers are particularly interested in high binding probabilities, the patterns among these clusters, especially the probability density at binding probability near 1, yield interesting findings regarding transcriptional regulation. For instance, in superpathway of serine and glycine biosynthesis I, transcription factors in Cluster 2 have higher binding probability than Cluster 1 as the tail part of probability density function of Cluster 2 is much higher than Cluster 1. For the histidine metabolism pathway, Cluster 2 has a binding peak near 0.3, indicating that this cluster of transcription factors has more moderate binding probabilities when compared with other clusters. For the JAK–STAT pathway, Cluster 3 exhibits a nearly U-shaped distribution of binding probabilities with 2 peaks, 1 peak near 0 and 1 peak near 1, indicating this cluster has more high binding probability (near 1) than other clusters.
5. Discussion
A hierarchical beta mixture with an exchangeable correlation structure is proposed to cluster multivariate genetic data. The CBM is important for the following reasons: (1) Assuming iid genetic data may be unreasonable in real-world applications, yet much of the existing methodology does so. The present methodology allows both for correlations in genetic data and heterogeneity in such correlations, even if there is no heterogeneity in location. (2) With most existing methodology, there is no guarantee that related observations will be assigned to the same mixture component, even when it would not make sense for them to be assigned to different mixture components. The present methodology ensures that related observations are assigned to the same mixture component. Moreover, the quality of such assignments can be improved and informed by the heterogeneity in correlation.
CBM has a closed-form probability density function because of the compound probability theory. The correlation coefficient
is homogenous within a mixture component and heterogeneous between mixture components. A random CBM with
brings more flexibility in explaining correlation variations among genetic variables.
The results of simulation and case study show that CBM yields more accurate prediction and classification for group structures. For instance, in Model 3 with the EM algorithm, CBM has 96% classification accuracy while BMM only has 78% classification accuracy. Correlated variants from a gene can be grouped together by genetic vectors, which allow the unsupervised clustering focusing on genetic vectors. This will prevent assigning variants from a genetic vector to different mixture components.
The R code for CBM is freely available at http://d.web.umkc.edu/daih/.
Supplementary material
Supplementary Material is available at http://biostatistics.oxfordjournals.org.
Funding
The work of H.D. was supported in part by R01(DK100779, PI: S Patton) awarded by the National Institute of Diabetes and Digestive and Kidney Diseases.
Supplementary Material
Acknowledgement
We gratefully acknowledge the helpful comments from three referees and the Associate Editor. We thank the HPC system engineer, Shane Corder, for his bioinformatics and computing support in this project. Conflict of Interest: None declared.
References
- Blanco E., Farre D., Alba M. M., Messeguer X., Guigo R. (2006). ABS: a database of Annotated regulatory Binding Sites from orthologous promoters. Nucleic Acids Research, 34, D63–D67. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bouguila N., Ziou D., Monga E. (2006). Practical Bayesian estimation of a finite beta mixture through gibbs sampling and its applications. Statistics and Computing, 16, 215–225. [Google Scholar]
- Celeux G., Diebolt J. (1985). The SEM algorithm: a probabilistic teacher algorithm derived from the EM algorithm for the mixture problem. Computational Statistics Quarterly, 2, 73–82. [Google Scholar]
- Dai X., Erkkila T., Yli-Harja O., Lahdesmaki H. (2009). A joint finite mixture model for clustering genes from independent Gaussian and beta distributed data. BMC Bioinformatics, 10, 165. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fu R., Dey D. K., Holsinger K. E. (2011). A Beta-mixture model for assessing genetic population structure. Biometrics, 67, 1073–1082. [DOI] [PubMed] [Google Scholar]
- Ji Y., Wu C., Liu P., Wang J., Coombes K. R. (2005). Applications of beta-mixture models in bioinformatics. Bioinformatics, 21, 2118–2122. [DOI] [PubMed] [Google Scholar]
- Kent W. J., Sugnet C. W., Furey T. S., Roskin K. M., Pringle T. H., Zahler A. M., Haussler D. (2002). The human genome browser at UCSC. Genome Research, 12, 996–1006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lahdesmaki H., Rust A. G., Shmulevich I. (2008). Probabilistic inference of transcription factor binding from multiple data sources. PLoS One, 3, e1820. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Laurila K., Oster B., Andersen C. L., Lamy P., Orntoft T., Yli-Harja O., Wiuf C. (2011). A beta-mixture model for dimensionality reduction, sample classification and analysis. BMC Bioinformatics, 12, 215. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Levine M., Tjian R. (2003). Transcription regulation and animal diversity. Nature, 424, 147–151. [DOI] [PubMed] [Google Scholar]
- Ma Z., Leijon A. (2011). Bayesian estimation of beta mixture models with variational inference. IEEE Trans Pattern Anal Mach Intell, 33, 2160–2173. [DOI] [PubMed] [Google Scholar]
- Minhajuddin A., Harris I., Schucany W. (2004). Simulating multivariate distributions with specific correlations. Journal of Statistical Computation and Simulation, 74, 599–607. [Google Scholar]
- Montgomery S. B., Griffith O. L., Sleumer M. C., Bergman C. M., Bilenky M., Pleasance E. D., Prychyna Y., Zhang X., Jones S. J. (2006). ORegAnno: an open access database and curation system for literature-derived promoters, transcription factor binding sites and regulatory variation. Bioinformatics, 22, 637–640. [DOI] [PubMed] [Google Scholar]
- Rye M., Saetrom P., Handstad T., Drablos F. (2011). Clustered ChIP-Seq-defined transcription factor binding sites and histone modifications map distinct classes of regulatory elements. BMC Biology, 9, 80. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Teschendorff A. E., Marabita F., Lechner M., Bartlett T., Tegner J., Gomez-Cabrero D., Beck S. (2013). A beta-mixture quantile normalization method for correcting probe design bias in Illumina Infinium 450 k DNA methylation data. Bioinformatics, 29, 189–196. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wingender E., Chen X., Hehl R., Karas H., Liebich I., Matys V., Meinhardt T., Pruss M., Reuter I., Schacherer F. (2000). TRANSFAC: an integrated system for gene expression regulation. Nucleic Acids Research, 28, 316–319. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang C., Xuan Z., Otto S., Hover J. R., McCorkle S. R., Mandel G., Zhang M. Q. (2006). A clustering property of highly-degenerate transcription factor binding sites in the mammalian genome. Nucleic Acids Research, 34, 2238–2246. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhou H., Jin J., Zhang H., Yi B., Wozniak M., Wong L. (2012). IntPath—an integrated pathway gene relationship database for model organisms and important pathogens. BMC Systems Biology, 6(Suppl 2), S2. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.






































































































































