Compound hierarchical correlated beta mixture with an application to cluster mouse transcription factor DNA binding data

Hongying Dai; Richard Charnigo

doi:10.1093/biostatistics/kxv016

. 2015 May 11;16(4):641–654. doi: 10.1093/biostatistics/kxv016

Compound hierarchical correlated beta mixture with an application to cluster mouse transcription factor DNA binding data

Hongying Dai ^1,^*, Richard Charnigo ²

PMCID: PMC4701176 PMID: 25964663

Abstract

Modeling correlation structures is a challenge in bioinformatics, especially when dealing with high throughput genomic data. A compound hierarchical correlated beta mixture (CBM) with an exchangeable correlation structure is proposed to cluster genetic vectors into mixture components. The correlation coefficient, Inline graphic , is homogenous within a mixture component and heterogeneous between mixture components. A random CBM with brings more flexibility in explaining correlation variations among genetic variables. Expectation–Maximization (EM) algorithm and Stochastic Expectation–Maximization (SEM) algorithm are used to estimate parameters of CBM. The number of mixture components can be determined using model selection criteria such as AIC, BIC and ICL-BIC. Extensive simulation studies were conducted to compare EM, SEM and model selection criteria. Simulation results suggest that CBM outperforms the traditional beta mixture model with lower estimation bias and higher classification accuracy. The proposed method is applied to cluster transcription factor–DNA binding probability in mouse genome data generated by Lahdesmaki and others (2008, Probabilistic inference of transcription factor binding from multiple data sources. PLoS One, 3, e1820). The results reveal distinct clusters of transcription factors when binding to promoter regions of genes in JAK–STAT, MAPK and other two pathways.

Keywords: Cluster, Compound hierarchical correlated betamixture, Exchangeable correlation structure, EM and SEMalgorithm

1. Introduction

For many genetic variables ranging between 0 and 1, beta distributions with two shape parameters can characterize complex distributions of genetic data while mixture models can accommodate the heterogeneity among genetic data by detecting underlying group structures and assigning different beta distributions to mixture components. Ji and others (2005) laid the groundwork of beta mixture model (BMM) Inline graphic Beta() in bioinformatics and suggested that BMM can effectively detect underlying group structures. Studies show that BMM outperforms Gaussian Mixture Models when data are asymmetric or have complex distribution shapes (Bouguila and others, 2006; Ma and Leijon, 2011). Laurila and others (2011) developed BMM for methylation microarray data and showed that BMM substantially reduces the dimensionality of data and that BMM can be applied for sample classification and to detect changes in methylation status between different samples and tissues. BMM has also been extended to single nucleotide polymorphism (SNP) analysis (Fu and others, 2011), cluster analysis (Dai and others, 2009), quantile normalization to correct probe design bias (Teschendorff and others, 2013), pattern recognition and image processing (Bouguila and others, 2006; Ma and Leijon, 2011).

However, the existing BMM methods are based on the independence assumption. It is well known that genetic data could be highly correlated. Thus, ignoring the correlation structure in BMM may lead to bias in statistical inference and inflation of the Type I error rate. A compound hierarchical correlated beta mixture (CBM) is herein proposed to cluster a variety of genetic data. CBM utilize a latent class variable Inline graphic to model heterogeneity, construct an exchangeable correlation structure for genetic data by introducing a latent variable to characterize underlying mechanisms that trigger correlations among genetic data, then use the compound probability theory to derive a closed-form marginal distribution of genetic data.

CBM allows the random variable Inline graphic to be multivariate with as the length of . The traditional BMM becomes a special univariate case of CBM . The fundamental distinction between CBM and BMM is that CBM can cluster genetic vectors by treating as a clustering unit. For instance, if the variable is measured at 3 time points, Inline graphic , then CBM can construct a vector with an exchangeable correlation structure and keep together in clustering. But the traditional BMM may cluster to distinct mixture components.

CBM allows the correlation coefficient Inline graphic homogeneous within a mixture component and heterogeneous between mixture components. The correlation coefficients can also be treated as a prior with a distribution , which brings more flexibility in explaining correlation variations among genetic variables. The exchangeable correlation structure is suitable to model a wide range of genetic data. For instance, (1) genes within some pathways might be highly correlated but genes from different pathways might be independent. (2) Variants within a haplotype might be highly correlated but variants from different haplotypes might be independent. (3) Variants from close genome positions might be highly correlated but variants from distant genome positions might be independent.

2. Method

2.1. Compound hierarchical CBM model

CBM is a hierarchical model with formulae (2.1)–(2.3). Let genetic vectors be Inline graphic and is the length of the vector. Introduce an identically and independently distributed (iid) multinomial latent variable with

(2.1)

The latent variable Inline graphic classifies to the mixture component with probability .

Let Inline graphic be a beta binomial random variable such as

(2.2)

with parameters Inline graphic , , . The latent variable models underlying mechanisms that trigger correlations among .

Conditioning on Inline graphic and , let

(2.3)

Since the beta family is conjugate to the binomial family (Minhajuddin and other, 2004), compounding Inline graphic on will yield a correlated beta distribution for . The parameter in (2.3) will be integrated out in the compound probability distribution of . Thus, the conditional distribution leads to . See Proposition 1.

Since Inline graphic share the same in (2.3), is correlated among . See Proposition 2.

Theorem 1 —

The marginal distribution of follows a -component CBM with the probability density function

(2.4)

Proposition 1 —

Marginally, follows a beta mixture distribution:

Proposition 2 —

and within from the mixture component are correlated with the correlation coefficient

Proposition 3 —

The traditional BMM is a special case of CBM when for . is independent from for any .

Proofs can be found in Supplementary Materials available at Biostatistics online.

In summary, CBM is a hierarchical model that compounds the distribution of Inline graphic over 2 latent variables and . The latent variable determines the number of mixture components while the latent variable determines correlations within . Since and are unobservable, the marginal distribution of is fitted to data to cluster genetic vectors into different mixture components.

2.2. Random exchangeable correlation structure

The exchangeable correlation structure for Inline graphic is , where is an identity matrix, is a vector of 1, be Kronecker product and is transpose. It might be too restrictive to fix the correlation coefficient for the mixture component as genetic data often have complex correlation structures. To address this issue, one can treat as a random variable and insert Inline graphic as a hyperparameter to the hierarchical model.

The correlation of Inline graphic is triggered by , thus is determined by three parameters from the beta binomial distribution of . Since is a latent variable, the parameters have been determined by the marginal distribution of fitted to data. This leads to a linkage between and , .

In the hierarchical model, introduce a probability function Inline graphic with as the parameter(s) for the distribution of . Then the value of will determine the value of parameter in the distribution of . Below is the framework to construct a random CBM.

Let be iid with for and .
For the mixture component, let
Let where is a function to round into an integer.
Let .
Let for .

Theorem 2 (Random CBM) —

The above framework will generate a CBM with a random exchangeable correlation structure, where and for . The marginal probability distribution of is

(2.5)

where stands for the probability function of a multivariate correlated beta distribution with random correlation coefficients. See Supplementary Materials for proof (available at Biostatistics online). Propositions (1)–(3) also hold for the random CBM.

2.3. Expectation–Maximization algorithm and Stochastic Expectation–Maximization algorithm for parameter estimation

Let Inline graphic , where is an indicator function, to measure whether belongs to the mixture component. Since the latent variable is unobservable, the Expectation–Maximization (EM) algorithm is implemented to estimate parameters of the CBM. The EM algorithm relies on the complete data log-likelihood function and alternates between expectation (E) steps and maximization (M) steps. In the E-step, the latent variable Inline graphic is replaced by the expectation of the posterior distribution of given data and the current estimate for the parameters. In the M-step, parameter estimates are updated using maximizers for the expected log-likelihood found on the E-step.

Let Inline graphic and stand for unknown parameters, where in CBM (2.4) and in random CBM (2.5) for . Write a log-likelihood function of based on the joint distribution of and as

(2.6)

Step 1: Starting with iteration , generate the initial value of . Set the initial mixing weight as .
Step 2: E-step. Compute the expectation of using the posterior distribution of
Plug to (2.6) so the expected log-likelihood function becomes
(2.7)
Step 3: M-step. Find and that maximizes Equation (4) and update . The mixing weight . Without a closed-form, can be determined using a numerical optimization function “optim” in R, which is freely available at http://www.r-project.org/.
Step 4: Repeat the M-step and E-step until the change in maximum of the log-likelihood (2.7) is within a tolerance value (say 0.1).

The EM algorithm only ensures convergence to a local maximum of the log-likelihood, therefore it is critical to select appropriate initial values for Inline graphic . In Step 1, multiple sets of initial values of are randomly generated. Or one can perform a grid search in the parameter space. Then repeat Steps 1–4 and select the global maximizer of the log-likelihood.

To further prevent estimates from staying near saddle points of the likelihood function, the Stochastic EM (SEM) (Celeux and Diebolt, 1985) is considered. SEM incorporates a stochastic (S-) step between E-step and M-step. This step can be viewed as a Bayesian extension of the EM algorithm. This allows SEM to stochastically explore the likelihood surface by partially avoiding convergence to the closest mode.

Keep all steps in EM. After Step 2, SEM adds.

Step 2.2: S-step: Randomly generate Inline graphic from the multinomial distribution with probability . In (2.7), replace by .

As an ergodic Markov chain, the sequence of SEM estimates converges in distribution to a unique stationary probability as the number of iterations increases.

2.4. Number of mixture components

The number of mixture components Inline graphic can be determined using model selection criteria, AIC, BIC and ICL-BIC (Ji and others, 2005). These criteria are based on log-likelihood functions with a penalty to the number of unknown parameters in a model. For CBM (2.4), . For the random CBM (2.5), , where is the number of unknown parameters in Inline graphic . Let be the maximum of the log likelihood function (2.6). Define

Akaike information criterion:
Bayesian information criterion:
Integrated classification likelihood—BIC:

ICL-BIC is BIC plus an estimated entropy of the fuzzy classification matrix Inline graphic . Mixture models with different number of mixture components are fitted using the EM algorithm and SEM algorithm as described in Section 2.3, then the model with the smallest AIC, BIC or ICL-BIC is chosen.

3. Simulation

Under the regularity conditions, the MLEs, Inline graphic and , of the CBM are consistent estimators of the true parameter, and , i.e. and in probability. Five CBM models are presented for illustration of estimation consistency:

Model 1:
Model 2:
Model 3:
Model 4:
Model 5:

Models 1 and 2 are two-component mixture models with 500 genetic vectors and 5 variants in each vector. Model 3 is a three-component mixture model with 100 genetic vectors and 50 variants in each vector. Each genetic vector is considered as a multivariate random variable with a correlation coefficient Inline graphic from the mixture component. In Model 4, variables within both mixture components are independent (. In Model 5, variables within the first mixture component are independent and variables within the other two mixture components are correlated (.

The distribution of variables from Model 3 is depicted in Figure 1. Figure 1(a) shows a scatter plot between Inline graphic (labeled as “Genetic variant 1”) and (labeled as “Genetic variant 2”) for . These 2 variables follow a three-component mixture distribution with . The Q-Q plot in Figure 1(b) is very close to the diagonal line, indicating that variants and have the same distribution. Figure 1(c) is a 2D histogram for variants Inline graphic and . Figure 1(d) shows a three-component beta mixture for the variant . Figures 1(a)–(d) jointly confirm that genetic variants have the same marginal beta mixture distribution, which confirms Proposition 1. In Figure 1(d), both CBM and BMM are fitted to the histogram of the variant Inline graphic . The BMM fits a close density function to the histogram, but this method does not consider correlation among variants thus was set in BMM. Remarkably, CBM not only provides a close fit of density function to the histogram, but also accurately estimate correlation within 3 mixture components with Inline graphic , which are correct to 2 decimal digits (Table 1).

Fig. 1. — Distributions of variables in the simulated CBM Model 3 (). (a) Correlation plot. (b) Q–Q plot. (c) 2D histogram. (d) Histogram.

Table 2.

Comparison of CBM and BMM using EM and SEM algorithms

	Model 1		Model 2		Model 3		Model 4		Model 5
	CBM	BMM	CBM	BMM	CBM	BMM	CBM	BMM	CBM	BMM
EM algorithm
MSE(	0.002	0.006	0.003	0.006	0.0003	0.002	0.01	0.01	0.02	0.02
MSE(	0.45	2.54	0.64	4.08	0.02	2.02	0.04	0.15	0.48	0.35
MSE(					0.18	0.49			0.66	0.78
MSE(	0.17	0.34	0.11	0.63	0.10	1.90	0.50	1.17	0.47	2.13
MSE(	0.03	0.06	0.02	0.14	0.02	1.46	0.07	0.06	0.17	0.33
MSE(					0.003	0.06			0.12	0.16
MSE(†	0.004		0.0009		0.001		0.002		0.001
MSE(	0.003		0.003		<0.0001		0.002		0.003
MSE(					0.001				0.005
MSE(	0.004	0.004	<0.0001	0.0007	0.001	0.002	0.002	0.003	0.01	0.03
MSE(	0.004	0.004	<0.0001	0.0007	0.005	0.003	0.002	0.003	0.01	0.02
MSE(					0.002	0.003			0.001	0.002
AIC
BIC
ICL-BIC		1627				5201		194		387
Accuracy	0.87	0.83	0.99	0.97	0.96	0.78	0.99	0.84	0.79	0.65
SEM algorithm
MSE(	0.01	0.08	0.01	0.06	0.01	0.03	0.002	0.012	0.16	0.17
MSE(	4.30	4.02	2.04	1.20	1.30	1.59	0.069	0.177	0.24	0.18
MSE(					3.32	2.78			0.06	0.16
MSE(	0.20	2.99	0.82	3.61	1.31	3.63	0.18	0.04	0.16	0.17
MSE(	0.25	0.59	0.14	0.21	1.66	4.85	0.06	0.10	0.24	0.18
MSE(					0.12	0.39			0.06	0.16
MSE(	0.002		0.0006		0.0008		0.001		0
MSE(	0.001		0.0014		0.002		0		0
MSE(					0.01				0.1
MSE(	0.002	0.02	0.0002	0.001	0.005	0.002	0.0004	0.0007	0.01	0.02
MSE(	0.002	0.02	0.0002	0.001	0.02	0.008	0.0004	0.0007	0.01	0.02
MSE(					0.007	0.003			0.003	0.002
AIC
BIC
ICL-BIC		1130				3353		170		367
Accuracy	0.88	0.82	0.99	0.98	0.88	0.77	0.99	0.85	0.76	0.67

Open in a new tab

Inline graphic Lower MSE indicates less estimation bias.

Inline graphic Lower AIC, BIC and ICL-BIC indicates a better model fit.

Inline graphic Higher accuracy indicates improved classification accuracy for mixture components.

Table 1.

Exchangeable correlation architecture modeled by CBM. The correlation coefficient from the j Inline graphic mixture component can be either fixed or a random variable with probability

Open in a new tab

Generate Inline graphic data sets for each model. Then, the mean square error (MSE) for each parameter is . Here, refers to the MLE of each parameter estimated through the EM algorithm and SEM algorithm. A smaller MSE indicates a smaller bias. As shown in Table 2, CBM provides very accurate estimates of correlation coefficients Inline graphic with MSE 0.01 for all 5 simulated models. When compared with BMM, CBM yields lower MSE for mixing weights and mixture component parameters. For instance, for CBM is lower than for BMM in Model 3 using the SEM algorithm.

Models 1–3 are fully correlated with varying correlation among mixture components. Model 5 is partially correlated and Model 4 is fully independent. If one disregards the correlation structure and fit BMM to these models, then this will lead to higher MSE in parameter estimation, poor model fit and lower prediction accuracy (see Table 2). AIC, BIC and ICL-BIC are used to compare goodness-of-fit for models. For all 5 models, CBM yields lower AIC, BIC and ICL-BIC values than BMM. For instance, in Model 2 using the SEM algorithm, ICL-BIC Inline graphic 4126 for CBM and ICL-BIC 1764 for BMM.

Mixture models provide a principled probabilistic framework for prediction and classification. The classification accuracy in CBM is higher than BMM in 5 models. Using the SEM algorithm, CBM increases the classification accuracy by 0.06 in Model 1, 0.11 in Model 3, 0.14 in Model 4 and 0.09 in Model 5. In Model 2, two mixture components are well separated. Therefore, both CBM and BMM have high classification accuracy.

The performance of EM algorithm and the SEM algorithm was compared among 5 models. The SEM algorithm can prevent estimates from staying near saddle points of the likelihood function, but it also generates more variation in estimates and it takes more iterations for the sequencing to converge. In Table 2, the MSE of Inline graphic in Models 1 and 2 and MSE of in Model 3 are 1 under the SEM algorithm. These are the mixture parameters with smaller mixing weights (0.3, 0.15, and 0.2, respectively), which indicates that the SEM algorithm may increase the estimation variance for mixture parameters with smaller mixing weights. When the SEM algorithm adds a stochastic procedure to the EM algorithm to prevent estimates from sticking around a small neighborhood of a fixed value, it may increase the variation in parameter estimation, and this especially effects mixture parameters with smaller mixing weights. We suggest using the SEM algorithm to prevent estimates from staying near saddle points. Investigators may consider using the EM algorithm for data with a small size and small mixing weights.

Three model selection criteria, AIC, BIC and ICL-BIC, were compared in simulation. AIC does not take the sample size into account, thus it tends to select more complicated models. ICL-BIC tends to select a parsimonious model. The simulation results show that ICL-BIC and BIC are appropriate tools for model selection.

4. Case study: genome-wide analysis of binding probability

We apply CBM to cluster transcription factors based on their binding probabilities in mouse genome data. Transcription factors are DNA-binding proteins that regulate gene expression levels by binding to promoter regions proximal to gene transcription start sites or to more distal enhancer regions that regulate expression through long-range interactions (Rye and others, 2011). Transcription factor binding varies between cell types (Levine and Tjian, 2003), (Zhang and others, 2006). Lahdesmaki and others (2008) constructed a test set of annotated binding sites in mouse promoters from existing databases, including TRANSFAC, ORegAnno, the human genome browser at UCSC and ABS (Wingender and others, 2000; Kent and others, 2002; Blanco and others, 2006; Montgomery and others, 2006). A likelihood-based binding prediction method was applied to the 2K base pair upstream promoter regions of all 20 397 mouse genes, where the genomic locations of the promoters were based on RefSeq gene annotations. Evolutionary conservation was used as an additional data sources and binding specificities for 266 transcription factors were taken from TRANSFAC Professional version 10.3. The full binding probability results for all mouse transcription factor–gene pairs (a 20,397 by 266 table) are available at http://xerad.systemsbiology.net/ProbTF/. These genome-wide results indicate that it is rare to observe high binding probability between transcriptional regulators and their target promoters. See Supplementary Materials for details (available at Biostatistics online).

Let Inline graphic be the binding probability of the transcription factor to the promoter region in the gene. To make clustering more biologically meaningful, we select genetic pathways extracted from the IntPath pathway database (Zhou and others, 2012). The online toolkit of the IntPath program provides a total of 550 pathways for the “M.musculus” species. Let Inline graphic be the vector of binding probability for the transcription factor to genes in a pathway. The number of genes, , varies by pathway. The goal is to apply CBM to cluster transcription factors ( based on their patterns of binding probabilities to all genes in a pathway. The CBM clustering is performed for each pathway, therefore the clustering results are different for different pathways. BMM can cluster the binding probablities between each TF and each gene, but BMM cannot cluster the binding probablities between each TF and a pathway.

For each genetic pathway, fit CBM with varying numbers of mixture components using the EM algorithm. The optimal number of mixture components was determined by BIC. Then apply the posterior probability to cluster transcription factors into mixture components. To make clustering more biologically meaningful, we select 2 transcription factors pathways: JAK–STAT signaling pathway and MAPK cascade. The JAK–STAT pathway is a signaling cascade whose evolutionarily conserved roles include cell proliferation and hematopoiesis. The Mitogen-Activated Protein Kinases (MAPKs) pathway belongs to a large family of serine/threonine protein kinases that are conserved in organisms as diverse as yeast and humans. We also select 2 pathways which are not transcription factors pathways. The term information of these 4 pathways are listed in Table S1 in Supplementary Materials available at Biostatistics online.

The clustering of transcription factors is based on their binding probabilities to a pathway. For the JAK–STAT pathway and the histidine metabolism pathway, CBM identified 3 clusters of transcription factors. For the MAPK pathway and superpathway of serine and glycine biosynthesis I, CBM identified 2 clusters of transcription factors. The estimates of parameters within each mixture component are listed in Table 3.

Table 3.

Parameter estimates of CBM fitted to transcription factor binding probability data

		Cluster 1	Cluster 2	Cluster 3
JAK–STAT signaling		0.38	0.07	0.55
pathway		0.70	1.27	0.65
		3.34	3.26	0.41
		2	3	2
		0.33	0.40	0.65
	# transcription factor	99	20	147
MAPK cascade		0.04	0.96
		0.17	0.44
		3.61	0.36
		9	1
		0.70	0.56
	# transcription factor	11	255
Superpathway		0.42	0.58
of serine		0.73	0.40
and glycine		3.24	0.60
biosynthesis I		1	0
		0.20	0
	# transcription factor	126	140
Histidine metabolism		0.69	0.25	0.06
		0.58	0.30	0.34
		2.33	1.85	0.57
		0	6	0
		0	0.74	0
	# transcription factor	186	63	17

Open in a new tab

Inline graphic Number of transcription factors classified to mixture components by CBM.

CBM reveals very different patterns among clusters (see density curves in Figures 2). Among clusters we identified, one cluster often contained a small number of transcription factors and had a distinctive binding pattern when compared with a large number of transcription factors in other clusters.

Fig. 2. — Histogram of transcription factor–DNA binding probability and fitted probability density distribution. (a) JAK–STAT. (b) MAPK. (c) Serine and glycine. (d) Histidine metabolism.

The marginal distribution of binding probabilities among 266 transcription factors is displayed in Figure 2. The fitted probability density function for all transcription factors (the solid curve) suggests that there are more binding probabilities near 0 and only a small amount close to 1. This confirms sparse connectivity between transcriptional regulators and their target promoters. The fitted density for transcription factors within each cluster is depicted by a dashed line. These density curves indicate that transcriptions factors from different clusters have distinct binding patterns.

Since researchers are particularly interested in high binding probabilities, the patterns among these clusters, especially the probability density at binding probability near 1, yield interesting findings regarding transcriptional regulation. For instance, in superpathway of serine and glycine biosynthesis I, transcription factors in Cluster 2 have higher binding probability than Cluster 1 as the tail part of probability density function of Cluster 2 is much higher than Cluster 1. For the histidine metabolism pathway, Cluster 2 has a binding peak near 0.3, indicating that this cluster of transcription factors has more moderate binding probabilities when compared with other clusters. For the JAK–STAT pathway, Cluster 3 exhibits a nearly U-shaped distribution of binding probabilities with 2 peaks, 1 peak near 0 and 1 peak near 1, indicating this cluster has more high binding probability (near 1) than other clusters.

5. Discussion

A hierarchical beta mixture with an exchangeable correlation structure is proposed to cluster multivariate genetic data. The CBM is important for the following reasons: (1) Assuming iid genetic data may be unreasonable in real-world applications, yet much of the existing methodology does so. The present methodology allows both for correlations in genetic data and heterogeneity in such correlations, even if there is no heterogeneity in location. (2) With most existing methodology, there is no guarantee that related observations will be assigned to the same mixture component, even when it would not make sense for them to be assigned to different mixture components. The present methodology ensures that related observations are assigned to the same mixture component. Moreover, the quality of such assignments can be improved and informed by the heterogeneity in correlation.

CBM has a closed-form probability density function because of the compound probability theory. The correlation coefficient Inline graphic is homogenous within a mixture component and heterogeneous between mixture components. A random CBM with brings more flexibility in explaining correlation variations among genetic variables.

The results of simulation and case study show that CBM yields more accurate prediction and classification for group structures. For instance, in Model 3 with the EM algorithm, CBM has 96% classification accuracy while BMM only has 78% classification accuracy. Correlated variants from a gene can be grouped together by genetic vectors, which allow the unsupervised clustering focusing on genetic vectors. This will prevent assigning variants from a genetic vector to different mixture components.

The R code for CBM is freely available at http://d.web.umkc.edu/daih/.

Supplementary material

Supplementary Material is available at http://biostatistics.oxfordjournals.org.

Funding

The work of H.D. was supported in part by R01(DK100779, PI: S Patton) awarded by the National Institute of Diabetes and Digestive and Kidney Diseases.

Supplementary Material

Supplementary Data

supp_16_4_641__index.html^{(877B, html)}

Acknowledgement

We gratefully acknowledge the helpful comments from three referees and the Associate Editor. We thank the HPC system engineer, Shane Corder, for his bioinformatics and computing support in this project. Conflict of Interest: None declared.

References

Blanco E., Farre D., Alba M. M., Messeguer X., Guigo R. (2006). ABS: a database of Annotated regulatory Binding Sites from orthologous promoters. Nucleic Acids Research, 34, D63–D67. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bouguila N., Ziou D., Monga E. (2006). Practical Bayesian estimation of a finite beta mixture through gibbs sampling and its applications. Statistics and Computing, 16, 215–225. [Google Scholar]
Celeux G., Diebolt J. (1985). The SEM algorithm: a probabilistic teacher algorithm derived from the EM algorithm for the mixture problem. Computational Statistics Quarterly, 2, 73–82. [Google Scholar]
Dai X., Erkkila T., Yli-Harja O., Lahdesmaki H. (2009). A joint finite mixture model for clustering genes from independent Gaussian and beta distributed data. BMC Bioinformatics, 10, 165. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fu R., Dey D. K., Holsinger K. E. (2011). A Beta-mixture model for assessing genetic population structure. Biometrics, 67, 1073–1082. [DOI] [PubMed] [Google Scholar]
Ji Y., Wu C., Liu P., Wang J., Coombes K. R. (2005). Applications of beta-mixture models in bioinformatics. Bioinformatics, 21, 2118–2122. [DOI] [PubMed] [Google Scholar]
Kent W. J., Sugnet C. W., Furey T. S., Roskin K. M., Pringle T. H., Zahler A. M., Haussler D. (2002). The human genome browser at UCSC. Genome Research, 12, 996–1006. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lahdesmaki H., Rust A. G., Shmulevich I. (2008). Probabilistic inference of transcription factor binding from multiple data sources. PLoS One, 3, e1820. [DOI] [PMC free article] [PubMed] [Google Scholar]
Laurila K., Oster B., Andersen C. L., Lamy P., Orntoft T., Yli-Harja O., Wiuf C. (2011). A beta-mixture model for dimensionality reduction, sample classification and analysis. BMC Bioinformatics, 12, 215. [DOI] [PMC free article] [PubMed] [Google Scholar]
Levine M., Tjian R. (2003). Transcription regulation and animal diversity. Nature, 424, 147–151. [DOI] [PubMed] [Google Scholar]
Ma Z., Leijon A. (2011). Bayesian estimation of beta mixture models with variational inference. IEEE Trans Pattern Anal Mach Intell, 33, 2160–2173. [DOI] [PubMed] [Google Scholar]
Minhajuddin A., Harris I., Schucany W. (2004). Simulating multivariate distributions with specific correlations. Journal of Statistical Computation and Simulation, 74, 599–607. [Google Scholar]
Montgomery S. B., Griffith O. L., Sleumer M. C., Bergman C. M., Bilenky M., Pleasance E. D., Prychyna Y., Zhang X., Jones S. J. (2006). ORegAnno: an open access database and curation system for literature-derived promoters, transcription factor binding sites and regulatory variation. Bioinformatics, 22, 637–640. [DOI] [PubMed] [Google Scholar]
Rye M., Saetrom P., Handstad T., Drablos F. (2011). Clustered ChIP-Seq-defined transcription factor binding sites and histone modifications map distinct classes of regulatory elements. BMC Biology, 9, 80. [DOI] [PMC free article] [PubMed] [Google Scholar]
Teschendorff A. E., Marabita F., Lechner M., Bartlett T., Tegner J., Gomez-Cabrero D., Beck S. (2013). A beta-mixture quantile normalization method for correcting probe design bias in Illumina Infinium 450 k DNA methylation data. Bioinformatics, 29, 189–196. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wingender E., Chen X., Hehl R., Karas H., Liebich I., Matys V., Meinhardt T., Pruss M., Reuter I., Schacherer F. (2000). TRANSFAC: an integrated system for gene expression regulation. Nucleic Acids Research, 28, 316–319. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang C., Xuan Z., Otto S., Hover J. R., McCorkle S. R., Mandel G., Zhang M. Q. (2006). A clustering property of highly-degenerate transcription factor binding sites in the mammalian genome. Nucleic Acids Research, 34, 2238–2246. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhou H., Jin J., Zhang H., Yi B., Wozniak M., Wong L. (2012). IntPath—an integrated pathway gene relationship database for model organisms and important pathogens. BMC Systems Biology, 6(Suppl 2), S2. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

supp_16_4_641__index.html^{(877B, html)}

supp_kxv016_kxv016supp.doc^{(192.5KB, doc)}

[C1] Blanco E., Farre D., Alba M. M., Messeguer X., Guigo R. (2006). ABS: a database of Annotated regulatory Binding Sites from orthologous promoters. Nucleic Acids Research, 34, D63–D67. [DOI] [PMC free article] [PubMed] [Google Scholar]

[C2] Bouguila N., Ziou D., Monga E. (2006). Practical Bayesian estimation of a finite beta mixture through gibbs sampling and its applications. Statistics and Computing, 16, 215–225. [Google Scholar]

[C3] Celeux G., Diebolt J. (1985). The SEM algorithm: a probabilistic teacher algorithm derived from the EM algorithm for the mixture problem. Computational Statistics Quarterly, 2, 73–82. [Google Scholar]

[C4] Dai X., Erkkila T., Yli-Harja O., Lahdesmaki H. (2009). A joint finite mixture model for clustering genes from independent Gaussian and beta distributed data. BMC Bioinformatics, 10, 165. [DOI] [PMC free article] [PubMed] [Google Scholar]

[C5] Fu R., Dey D. K., Holsinger K. E. (2011). A Beta-mixture model for assessing genetic population structure. Biometrics, 67, 1073–1082. [DOI] [PubMed] [Google Scholar]

[C6] Ji Y., Wu C., Liu P., Wang J., Coombes K. R. (2005). Applications of beta-mixture models in bioinformatics. Bioinformatics, 21, 2118–2122. [DOI] [PubMed] [Google Scholar]

[C7] Kent W. J., Sugnet C. W., Furey T. S., Roskin K. M., Pringle T. H., Zahler A. M., Haussler D. (2002). The human genome browser at UCSC. Genome Research, 12, 996–1006. [DOI] [PMC free article] [PubMed] [Google Scholar]

[C8] Lahdesmaki H., Rust A. G., Shmulevich I. (2008). Probabilistic inference of transcription factor binding from multiple data sources. PLoS One, 3, e1820. [DOI] [PMC free article] [PubMed] [Google Scholar]

[C9] Laurila K., Oster B., Andersen C. L., Lamy P., Orntoft T., Yli-Harja O., Wiuf C. (2011). A beta-mixture model for dimensionality reduction, sample classification and analysis. BMC Bioinformatics, 12, 215. [DOI] [PMC free article] [PubMed] [Google Scholar]

[C10] Levine M., Tjian R. (2003). Transcription regulation and animal diversity. Nature, 424, 147–151. [DOI] [PubMed] [Google Scholar]

[C11] Ma Z., Leijon A. (2011). Bayesian estimation of beta mixture models with variational inference. IEEE Trans Pattern Anal Mach Intell, 33, 2160–2173. [DOI] [PubMed] [Google Scholar]

[C12] Minhajuddin A., Harris I., Schucany W. (2004). Simulating multivariate distributions with specific correlations. Journal of Statistical Computation and Simulation, 74, 599–607. [Google Scholar]

[C13] Montgomery S. B., Griffith O. L., Sleumer M. C., Bergman C. M., Bilenky M., Pleasance E. D., Prychyna Y., Zhang X., Jones S. J. (2006). ORegAnno: an open access database and curation system for literature-derived promoters, transcription factor binding sites and regulatory variation. Bioinformatics, 22, 637–640. [DOI] [PubMed] [Google Scholar]

[C14] Rye M., Saetrom P., Handstad T., Drablos F. (2011). Clustered ChIP-Seq-defined transcription factor binding sites and histone modifications map distinct classes of regulatory elements. BMC Biology, 9, 80. [DOI] [PMC free article] [PubMed] [Google Scholar]

[C15] Teschendorff A. E., Marabita F., Lechner M., Bartlett T., Tegner J., Gomez-Cabrero D., Beck S. (2013). A beta-mixture quantile normalization method for correcting probe design bias in Illumina Infinium 450 k DNA methylation data. Bioinformatics, 29, 189–196. [DOI] [PMC free article] [PubMed] [Google Scholar]

[C16] Wingender E., Chen X., Hehl R., Karas H., Liebich I., Matys V., Meinhardt T., Pruss M., Reuter I., Schacherer F. (2000). TRANSFAC: an integrated system for gene expression regulation. Nucleic Acids Research, 28, 316–319. [DOI] [PMC free article] [PubMed] [Google Scholar]

[C17] Zhang C., Xuan Z., Otto S., Hover J. R., McCorkle S. R., Mandel G., Zhang M. Q. (2006). A clustering property of highly-degenerate transcription factor binding sites in the mammalian genome. Nucleic Acids Research, 34, 2238–2246. [DOI] [PMC free article] [PubMed] [Google Scholar]

[C18] Zhou H., Jin J., Zhang H., Yi B., Wozniak M., Wong L. (2012). IntPath—an integrated pathway gene relationship database for model organisms and important pathogens. BMC Systems Biology, 6(Suppl 2), S2. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Compound hierarchical correlated beta mixture with an application to cluster mouse transcription factor DNA binding data

Hongying Dai

Richard Charnigo

Abstract

1. Introduction

2. Method

2.1. Compound hierarchical CBM model

Theorem 1 —

Proposition 1 —

Proposition 2 —

Proposition 3 —

2.2. Random exchangeable correlation structure

Theorem 2 (Random CBM) —

2.3. Expectation–Maximization algorithm and Stochastic Expectation–Maximization algorithm for parameter estimation

2.4. Number of mixture components

3. Simulation

Fig. 1.

Table 2.

Table 1.

4. Case study: genome-wide analysis of binding probability

Table 3.

Fig. 2.

5. Discussion

Supplementary material

Funding

Supplementary Material

Acknowledgement

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Compound hierarchical correlated beta mixture with an application to cluster mouse transcription factor DNA binding data

Hongying Dai

Richard Charnigo

Abstract

1. Introduction

2. Method

2.1. Compound hierarchical CBM model

Theorem 1 —

Proposition 1 —

Proposition 2 —

Proposition 3 —

2.2. Random exchangeable correlation structure

Theorem 2 (Random CBM) —

2.3. Expectation–Maximization algorithm and Stochastic Expectation–Maximization algorithm for parameter estimation

2.4. Number of mixture components

3. Simulation

Fig. 1.

Table 2.

Table 1.

4. Case study: genome-wide analysis of binding probability

Table 3.

Fig. 2.

5. Discussion

Supplementary material

Funding

Supplementary Material

Acknowledgement

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases