Abstract
In recent years, a large number of genomic and epigenomic studies have been focusing on the integrative analysis of multiple experimental datasets measured over a large number of observational units. The objectives of such studies include not only inferring a hidden state of activity for each unit over individual experiments, but also detecting highly associated clusters of units based on their inferred states. Although there are a number of methods tailored for specific datasets, there is currently no state-of-the-art modeling framework for this general class of problems. In this paper, we develop the MBASIC (Matrix Based Analysis for State-space Inference and Clustering) framework. MBASIC consists of two parts: state-space mapping and state-space clustering. In state-space mapping, it maps observations onto a finite state-space, representing the activation states of units across conditions. In state-space clustering, MBASIC incorporates a finite mixture model to cluster the units based on their inferred state-space profiles across all conditions. Both the state-space mapping and clustering can be simultaneously estimated through an Expectation-Maximization algorithm. MBASIC flexibly adapts to a large number of parametric distributions for the observed data, as well as the heterogeneity in replicate experiments. It allows for imposing structural assumptions on each cluster, and enables model selection using information criterion. In our data-driven simulation studies, MBASIC showed significant accuracy in recovering both the underlying state-space variables and clustering structures. We applied MBASIC to two genome research problems using large numbers of datasets from the ENCODE project. The first application grouped genes based on transcription factor occupancy profiles of their promoter regions in two different cell types. The second application focused on identifying groups of loci that are similar to a GATA2 binding site that is functional at its endogenous locus by utilizing transcription factor occupancy data and illustrated applicability of MBASIC in a wide variety of problems. In both studies, MBASIC showed higher levels of raw data fidelity than analyzing these data with a two-step approach using ENCODE results on transcription factor occupancy data.
1 Introduction
The flow of genetic information through DNA transcription and RNA translation is a highly regulated process. The underlying mechanisms of regulation by both genomic and epigenomic factors are central targets in large numbers of genomic and epigenomic studies. This paper is motivated by a number of such studies that aim to elucidate genomic regulatory mechanisms across multiple biological conditions. Experiments that investigate such processes produce a plethora of data types. For example, relevant to DNA transcription is transcription factor occupancy data that quantify which regions of DNA are occupied by DNA binding proteins that can enhance or reduce gene expression. Histone modification data map covalent post-translational modifications to histone proteins, core proteins that make up the nucleosome structure of DNA. Such modifications impact DNA transcription by altering chromatin structure.
Computational and statistical analysis of these data often involve identifying genomic loci that show significant signal, i.e., enrichment, compared to background noise in the experimental measurements, with the operating principle that multiple loci might exhibit similar signal profile over different biological conditions.
Improvements in the next-generation sequencing technology further accelerated rapid generation of these types of data. In return, the vast availability of such data has revolutionized the scope of genome regulation studies. Previous analyses had been restricted to detecting regions of genome that were associated with one particular factor in one single organism. Many recent studies focus on detecting more complex functional patterns that integrate data from multiple organisms under multiple conditions. Namely, the associations between DNA elements and how they change across biological and/or experimental conditions have been the focus of many integrative modeling approaches. Examples of these studies include:
Differential binding analysis among multiple ChIP-seq data. One of the key mechanisms of gene expression regulation is through differential activities of transcription factors and epigenetic modifications. Currently, chromatin immunoprecipitation followed by high throughput sequencing (ChIP-seq) is the state-of-the-art method for genome-wide profiling of protein-DNA interactions. Such two key interactions are DNA occupancy by transcription factors and histone modifications. Most transcription factors, i.e., DNA binding proteins, recognize DNA in a sequence specific manner and promote or represses gene expression. Similarly, histone modifications can induce diverse biological consequences such as transcriptional activation/deactivation. The study of gene regulation often involves comparing transcription factor occupancy and histone modifications across multiple biological conditions. Such conditions can be different treatment levels, time points of measurements, or different dosage levels (Liang and Keleş, 2012; Anders and Huber, 2010; Ji et al., 2013; Wei et al., 2015).
Transcription factor regulatory network analysis. The combinatorial nature of transcription factor regulation underlies the large diversity observed in eukaryotic gene control. This largely motivates construction of regulatory networks that model gene expression as a combinatorial function of regulatory interactions between DNA and different transcription factors. The large-scale data from the ENCODE project (The ENCODE Project Consortium, 2012) now enable joint analyses of over one hundred human transcription factors across multiple cell types. Such analyses are posed to reveal a great amount of information about co-association patterns between different TFs, hierarchical network organizations, and systems-level integration of complex cellular signals (Neph1 et al., 2012; Gerstein et al., 2012; Cheng et al., 2011; Zeng et al., 2013). While the large number of TFs makes it computationally formidable to exhaust all possible combinatorial associations for such analyses, it is important to detect the most significant combinatorial patterns that preserve global regulatory dynamics.
Comparative functional genomic studies across different species. Functional genomics analysis compares gene expressions or TF occupancy profiles between multiple species. The main task is to identify divergent and conserved functional modules that are central to evolutionary relationships (e.g., Kunarso et al. (2010), Schmidt et al. (2010)). Existing methods, that build on hidden Markov models (Roy et al., 2013) or biclustering (Waltman et al., 2010), implicitly assume that the functional modules should at least have similar signal profiles (i.e., expression, occupancy) among some subsets of the species under consideration. For these analyses, it is also important to identify functional modules that are fully divergent across species. These regions play an equally important role in understanding connectivity among species over the evolutionary history.
Although the types of data for these different studies vary, the underlying statistical principles are largely shared. Therefore, we propose a unified framework for the analysis of such data by formalizing the shared aspects. We formulate the underlying statistical problem as follows. Suppose a dataset {Yik} is collected over a set of observational units (e.g., loci in genomic experiments) i = 1, 2, · · ·, I under conditions k = 1, 2, · · ·, K. Inferring the association patterns within a single experiment involves mapping the corresponding set of observations {Yik : i = 1, 2, · · ·, I} to a finite discrete state-space, 𝒮 = {1, 2, · · ·, S}. This space contains different levels of association (e.g., enrichment/non-enrichment indicating the status of occupancy in ChIP-seq experiments, expressed/not expressed in RNA-seq gene expression experiments). This falls under the classical finite-mixture modeling framework, where a latent state variable θik ∈ 𝒮 is inferred for each observational unit Yik. A higher level of modeling on the matrix Θ = (θik)1≤i≤I,1≤k≤K is required for integrating the association patterns under different conditions. We call this matrix the state-space matrix since it describes the latent states of individual observations.
We propose the following framework to model the state-space matrix Θ. We assume that rows of Θ can be partitioned into J + 1 subsets: {1, · · ·, I} = 𝒞0 ∪ 𝒞1 ∪ ·· · ∪ 𝒞J. Rows of Θ within partition 𝒞j, j ≥ 1, are generated by the same distribution parametrized by wj· = (wjk)1≤k≤K:
while the rows of 𝒞0, which denotes the group of ”singleton” units, i.e., units that do not cluster in any of the J groups, are generated by row specific distributions. The goal of this model is thus to estimate a partitioning that best characterizes the row associations of state-space matrix Θ.
We refer to the proposed framework as Matrix Based Analysis for State-space Inference and Clustering (MBASIC). MBASIC is related to classical factor analysis which considers the problem of projecting one dimension (either row or column) of large noisy matrices into low-dimensional spaces. MBASIC has two distinguished features compared to the existing literature in these areas. First, MBASIC deals with matrices with discrete entries, while most existing methods are designed for matrices on continuous scales. Second, MBASIC estimates the low-dimensional projection by grouping the rows of the original matrix in contrast to the Principle Component Analysis (PCA) approaches which form linear combinations of the rows (e.g., Ji et al. (2013), Lee et al. (2010)). This is motivated by the following arguments:
In MBASIC, each factor estimate wj· characterizes the commonality of a group of rows and is easily interpretable in practice. Such interpretability can further be enhanced by imposing structural restrictions on the wj· vector for practical purposes. Examples of such constraints are described in Section 3.3;
PCA for high dimensional matrices are often accompanied by regularization techniques, which are computationally prohibitive for many epigenetic datasets. In contrast, clustering the matrix rows can be implemented very efficiently and in a straightforward manner.
The hierarchical structure of MBASIC is similar to two other recently proposed statistical models: iASeq (Wei et al., 2012) and Cormotif (Wei et al., 2015). Both these models incorporate a state-space clustering structure similar to MBASIC. MBASIC extends these models in several critically essential directions. First, MBASIC is developed for general purposes and can be easily implemented for a wide range of parametric distributions, while Cormotif and iASeq operate with specific distributions targeting the problems of differential expression and allele-specific binding. Second, neither of these models include a group of singletons with idiosyncratic state-space profiles. When we are agnostic about the “true” clustering structure in applications, separating the singletons can reduce their influence on the estimation of clustering parameters. Third, both iASeq and Cormotif separate estimation for the distributional parameters from the clustering structure, while MBASIC jointly fits all model parameters. A limiting assumption of MBASIC compared to these models is that MBASIC does not allow the distributional parameters within the same state to be heterogeneous. However, a pre-processing step that accounts for the the heterogeneity can overcome such a limitation. We evaluate and discuss all of these features with extensive simulation studies in this paper.
This paper is organized as follows. We start with a formal description of MBASIC in Section 2, followed by model estimation and selection methods in Section 3. We also investigate general features of MBASIC compared to iASeq and Cormotif with extensive simulations in this section. Section 4 presents results from several real data examples. Mathematical details of the algorithm are included in Appendix A.1.
2 The Hierarchical Mixture Model Framework
Consider a dataset with observations from I different observational units under K different conditions. For each condition k ∈ {1, 2, · · ·, K}, there are nk replicate experiments, indexed by l = 1, 2, · · ·, nk. We use Yikl to denote the observation for the l-th replicate of unit i under condition k. For each condition k at unit i, there exists a hidden state variable θik ∈ 𝒮 = {1, 2, · · ·, S}. The MBASIC model consists of the following components:
- State-space Mapping:
(1) -
State-space Clustering: θik’s are independently sampled from 𝒮 with the sampling probability:
(2) In (1), μkls and σkls are the parameters related to the mean and dispersion for the s-th state for replicate l under condition k, and γikls is the covariate encoding known information for unit i. In (2), pis, ζ, πj, and wjks are additional non-negative parameters subject to restrictions:We further discuss these parameters in Section 2.2.
2.1 State-space Mapping
Equation 1 partitions observational units i = 1, · · ·, I into S subsets according to their hidden states. Within the same replicate, data from the same hidden state follow the same distribution fs(·|μkls, σkls, γikls). MBASIC assumes that the hidden states θik’s are independent of the replicate index l, which means all replicates under the same condition have the same set of hidden states. However, distributional parameters for a given state can be different among replicates. Such a setting allows for the flexibility of modeling the heterogeneity in replicate experiments.
The density function f can be from an arbitrary parametric distribution. We consider three fundamental families of distributions commonly used for genomic data analysis:
- Log-normal Distribution. LN(μklsγikls, σkls) with a density function:
(3) - Negative Binomial Distribution. NB(μklsγikls, σkls) with a density function:
(4) - Binomial Distribution. Binom(γikls, μkls) with a density function:
(5)
In these three examples, γikls represents the known heterogeneity across loci whereas μkls and σkls are unknown parameters. For example, when using Eqn. (3) or (4) in a ChIP-seq analysis with S = 2 states, we can estimate γikl1 using data from the control samples so that the ChIP sample read counts in the background state scale with the control sample data (e.g., as in Zuo and Keleş (2013)), and assume γikl2 = 1 for the enriched states. Eqn. (5) can be used to analyze allele-specific binding data, where γikls is the total read counts from both paternal and maternal alleles and is constant across s. Application with the binomial distribution also requires that , ∀k, l, is strictly increasing in s for model identification.
The MBASIC can be easily extended to other classes of parametric distributions and estimation for these distributions follows the same Expectation-Maximization skeleton. While Section 3 relies on these three distributions to describe the model and the estimation algorithms, the second real data example in Section 4 utilizes a more complex parametrization, which demonstrates the wide applicability of the MBASIC framework. Furthermore, we consider the following degenerate distribution:
| (6) |
where I(.) denotes the indicator function. This degenerate form corresponds to the situation where the states, θik’s, are directly observed rather than inferred from Yikl’s. We utilize this parametrization for comparing MBASIC to alternative two-step analysis approaches in Section 3.5. Parameter estimation for this case follows a slightly modified procedure from the non-degenerate cases, which is described in Section 3.
2.2 State-space Clustering
Equation (2) models the distribution of θik as a mixture of multiple distributions. To illustrate this model we introduce additional variables. The goal is to identify J clusters from the set of observation units 1 ≤ i ≤ I. Let bi = I(unit i does not belong to any cluster) and zij = I(unit i belongs to cluster j). The bi variables entertain the possibility that some observations are ”singletons”, i.e., they do not cluster with any other observational units. With these additional variables, the distribution in Equation (2) can be hierarchically decomposed as follows:
;
;
Conditional on bi and zij, θik’s are independent samples from 𝒮, with sampling probabilities P(θik = s|bi = 1) = pis, P(θik = s|bi = 0, zij = 1) = wjks.
In this set up, although the singleton state-space probabilities pis are assumed to be constant across conditions, i.e., P(θik = s) = pis, ∀k, this assumption is mildly restrictive since it accommodates (P(θik = 1, · · ·, P(θik = S)) to follow an arbitrary prior distribution (e.g., (P(θik = 1, · · ·, P(θik = S)) ~ Dirichlet(α, · · ·, α), ∀k) as long as it leads to the same marginal distribution for θik for all k.
It is worth noting that this hierarchical structure essentially seeks a low-rank representation for the matrix Θ = (θik)1≤i≤I,1≤k≤K. To illustrate this, we introduce additional matrices Θs = (I(θik = s))1≤i≤I,1≤k≤K, Ws = (wjks)1≤j≤J,1≤k≤K, Z = (zij)1≤i≤I,1≤j≤J and vectors ps = (pis)1≤i≤I, B = (bi)1≤i≤I. Then, the conditional expectation of Θs is:
| (7) |
where “∘” denotes the Hadamard product. We note that E(Θs|Z, B) is a matrix of rank J +1, which is usually much smaller than the dimension of the matrix Θs. Similar models for low-rank representation of discrete matrices were considered in Lee et al. (2010), and turned out to be challenging both theoretically and computationally. The row-clustering structure for the matrices E(Θs|Z, B) in MBASIC is more restrictive than the general low-rank structure. Such additional restrictions not only reduce the difficulty in parameter estimation but also enable the flexibility in many useful ways. For example, while Lee et al. (2010) can only estimate one matrix at a time and thus is only applicable when S = 2, MBASIC can be applied to arbitrary values of S.
3 Model Estimation and Selection
3.1 Likelihood Functions
In the MBASIC model, the likelihood function for both the observed random variables Yikl’s and the unobserved θik’s, zij ’s, bi’s, i.e., full data likelihood, is given by:
| (8) |
For non-degenerate distributions, we can show that the marginal likelihood is:
| (9) |
Equation (9) is easily interpretable. Conditional on bi and zij, the joint distribution for each Yikl, 1 ≤ l ≤ nk is a mixture of S components, where the weight on the s-th component is either pis (when bi = 1) or wjks (when bi = 0 and zij = 1). This yields the expressions in the square brackets. Integrating out bi and zij, the joint distribution for Yikl of fixed i is a mixture of J + 1 components, with probability ζ of being a singleton and probability (1 − ζ)πj of belonging to cluster j.
For the degenerate case, by substituting (6) into (9), it can be shown that the marginal likelihood is:
| (10) |
3.2 An Expectation and Maximization (E-M) Algorithm
The hierarchical structure of MBASIC naturally fits in the Expectation-Maximization algorithm Dempster et al. (1977), which maximizes the marginal likelihood (equations (9) or (10)) by iteratively maximizing the complete data log-likelihood function. We let ϕ to denote a vector including all unknown parameters μ, σ, π, p, ζ, w, and ϕ̂(t) to denote the parameter estimates at the t-th iteration. The complete data log-likelihood function is:
| (11) |
The E-M algorithm for MBASIC is outlined by Algorithm 1. E-step updates are listed in equations (12)–(15) and their derivations are provided in Appendix A.
Algorithm 1.
Expectation-Maximization (EM)
| for t = 1, 2, · · · until convergence do |
| Expectation Step: Compute the conditional expectations E[I(θik = s)|ϕ̂(t−1)], E[bi|ϕ̂ (t−1)], |
| E[I(θik = s)bi|ϕ̂(t−1)], E[zij(1 − bi)|ϕ̂(t−1)], E[I(θik = s)zij(1 − bi)|ϕ̂(t−1)]; |
| Maximization Step: Update estimates for parameters μkls, σkls, ζ, πj, wjks, pis as maximizers for (11). |
| end for |
| (12) |
| (13) |
| (14) |
| (15) |
where . Given these results from the E-step, updates of ζ, πj, wjks, pis in the M-step are straight forward as in equations (16), (17), (18), and (19).
| (16) |
| (17) |
| (18) |
| (19) |
Updates for μkls and σkls have to be treated according to the specific distributions. For the log-normal distributions (3), we have:
| (20) |
| (21) |
For the binomial distributions (5), we have:
| (22) |
Closed form maximizers of μ and σ do not exist for the negative binomial distribution (4). We adopt the method of moment estimates as in Kuan et al. (2011); Zuo and Keleş (2013), where the updated values and are the solutions of the following equations:
For the degenerate distributions as in (6), θik’s are directly observed. Therefore, the E-M algorithm for this case requires slight modifications: we skip the estimation for E[I(θik = s)|ϕ̂ (t−1)] in the E-step and for μ, σ in the M-step.
3.3 Estimating Structured Clusters
In integrative functional genomics studies, the set of experimental conditions usually consists of interactions of multiple experimental factors; hence, it is often important to identify clusters, states of which are homogeneous across the levels of one or more factors. For example, in a typical transcription factor network analysis, experimental conditions include the combination of different cell types and TFs. It is often desirable to separate loci groups whose states are homogeneous within each cell type from those with cell type specific states for the purpose of cell type comparison. Depending on the cell types involved, such comparison can yield insights on cell development, pathology and/or cell-specific functions. We refer to clusters with homogeneous states within each cell type as TF-homogeneous. Another example is encountered in comparative functional genomics studies across different species, where experimental conditions range across both species and TFs. Clusters of loci, states of which are homogeneous across species conditional on each TF, constitute conserved functional modules. The TF-homogeneous clusters in this context represent the marginal effect of the species factor, and play a central role in understanding the evolutionary relationships.
To estimate a cluster with homogeneity for a particular experimental factor, MBASIC allows structural constraints on its state-space parameters. Recall that the parameters of cluster j are represented by wj.s = (wj1s, wj2s, · · ·, wjKs). Marginalizing the effect of this factor, the K experimental conditions can be partitioned into M sets, {1, 2, · · ·, K} = T1 ∪ T2 ∪ ·· · ∪ TM, where conditions within each set differ only in the levels of this factor. The parameters of this cluster satisfy the following constraints:
| (23) |
A pictorial depiction with six experimental conditions due to full interaction between 2 cell types and 3 TFs is depicted in Figure 1. Estimating structured clustering models follows the previous E-M algorithm with a modification in Equation (19). A constrained maximizer for wjks subject to constraint (23) is computed as:
Figure 1.
A graphical description for a parametrization with structural constraints. Interactions of 2 cell types and 3 TFs result in six experimental conditions. Parameters with homogeneous values are shaded by the same color.
MBASIC requires that such structural constraints must be specified a priori and remain fixed during model fitting. MBASIC incorporates a model selection procedure to compare models with different hypothesized structural constraints and numbers of clusters. We next describe the details of this model selection procedure.
3.4 Model Selection
The MBASIC framework so far assumes that the total number of clusters J is known a priori. In practice, models with varying values of J need to be fitted independently and compared with each other according to some information criterion to determine the best value of J. Since the E-M algorithm aims to maximize the data likelihood function, AIC and BIC criteria can be utilized with MBASIC. The degrees of freedom for a model with J clusters is , where F1 = 2 for distributions (3) and (4), F1 = 1 for (5), and F2 is the total number of free variables among wjks’s. If there are no structured clusters, we have F2 = JK(S − 1).
When there is no prior information available, both the total number of clusters and the number of clusters following structural constraints have to be determined. This results in a prohibitively large number of candidate models, and computing the information criterion for each of them is not practical. In such cases we incorporate the following two-phase strategy to limit the number of candidate models:
Evaluate models with varying total number of clusters without any structural constraints. Select Jopt according to the minimal AIC or BIC value.
Evaluate models with the fixed number of Jopt clusters while varying the number of clusters following each structural constraint. Select the number of clusters following each structural constraint based on the minimal AIC or BIC value.
We acknowledge that the above two-step strategy is only a practical compromise to restrict the space of candidate models and does not guarantee finding the best model that globally minimizes the information criterion. However, we have conducted extensive simulation studies which illustrated that the proposed two-phase strategy performs well in a wide variety of settings.
3.5 Simulation Studies
We conducted 6 model-based simulation studies to investigate the performance of MBASIC in various settings as summarized in Table 1. Each simulation study has multiple settings that vary the distributional assumptions, size of the state-space (S), proportion of singletons (ζ), number of units (I), number of clusters (J), and number of conditions (K). We provide the details of these simulation studies in Appendix B and highlight the overall conclusions in this section.
Table 1.
Design of the simulation studies. S: size of the state-space; ζ: Proportion of singletons; I: number of units; J: number of clusters; K: number of experimental conditions.
| Study | Distribution | S | ζ | I | (J,K) | Model Selection |
|---|---|---|---|---|---|---|
| 1 | LN, NB, Bin | 2, 3, 4 | 0, 0.1, 0.4 | 4000 | (20, 30) | No |
| 2 | LN, NB, Bin | 2 | 0.1, 0.4 | 4000 | (20, 30) | Yes |
| 3 | iASeq | 3 | 0, 0.1, 0.4 | 4000 | (10, 20), (20, 30) | Yes |
| 4 | Cormotif | 2 | 0 | 10,000 | (4, 4), (5, 8), (5, 10) | Yes |
| 5 | Cormotif | 2 | 0, 0.1, 0.4 | 4000 | (10, 20) | Yes |
| 6 | LN | 2 | 0, 0.33 | 4120, 4600, 6120 | (8, 30) | Yes |
Data in Simulation Studies 1–2 were simulated according to MBASIC’s distributional assumptions. In Simulation Study 1, we emphasized two most important features of MBASIC: the joint estimation procedure of all model parameters and the inclusion of a singleton cluster. We derived six alternative algorithms (Table B.1) to benchmark MBASIC’s performance in various settings. Three of the algorithms (SE-HC, SE-MC, PE-MC) use two-stage procedures for model estimation, decoupling either the estimation of the state-space variables or the distributional parameters from the mixture modeling of clustering analysis. The other three algorithms are created as variations on these by excluding the singleton feature (SE-MC0, PE-MC0, MBASIC0). These benchmark algorithms are in spirit analogous to procedures in many applied genomic data analyses where the association between observational units are estimated separately from the estimation of individual data set specific parameters (e.g., Gerstein et al. (2012), Wei et al. (2012), Wei et al. (2015)).
Figures B.2–B.4 summarize the performance comparisons in Simulation Study 1. We observed that MBASIC’s joint estimation feature improved the inference for both the clustering structure and the individual states. In the presence of many singletons, the inclusion of their idiosyncratic state-space profiles was essential for robust clustering. In Simulation Study 2, we evaluated the effect of using BIC to select the number of clusters as well as the structural constraints within each cluster. Tables B.2 and B.3 indicate that MBASIC was always able to select models with similar structures with the simulated truth.
In Simulation Studies 3 to 5, we simulated data according to the models proposed by iASeq (Wei et al., 2012) and Cormotif (Wei et al., 2015). These models allow heterogeneous distributional parameters within the same state, a potential advantage over MBASIC in specific data analysis such as differential expression or allele-specific binding. Comparison to these two models is intended to enable investigation of whether MBASIC is robust against such within-state heterogeneity. In Simulation Study 3, we showed that MBASIC with the binomial distribution could directly handle data generated under the iASeq framework and achieve competitive performance (Figure B.5). In Simulation Study 4, we inherited the simulation settings from Wei et al. (2015), where distributions from different states were weakly separable, but the individual states were completely deterministic from the clustering. We explored more dynamic settings in Simulation Study 5, where we had easier separation between different states, but randomness among the states within the same cluster. We showed that a pre-processing step homogenizing the within-state units followed by MBASIC leads to comparable performance to Cormotif in Simulation Study 4 (Figure B.6), and much better performance in Simulation Study 5 (Figure B.7).
Wei et al. (2015) discusses an interesting point that when the clustering model does not accommodate singletons, small clusters tend to be merged together to form spurious clusters, estimated state-space patterns of which are the averages among several true clusters. In order to investigate whether such a phenomena exists for MBASIC, we conducted Simulation Study 6, where we simulated data with two large clusters and six small clusters, and compared the performance of MBASIC and MBASIC0 to highlight the effect of including a singleton cluster. We found that compared to MBASIC0, MBASIC was significantly less aggressive in merging small clusters. Overall, it captured large clusters and allocated the small cluster units as singletons (Figures B.10 and B.11, Tables B.6, B.7 and B.8). This study highlighted the utility of a singleton cluster as a potential remedy for merging of small clusters.
Combining results from all of our simulation studies, we conclude that MBASIC is a powerful model for both state-space estimation and clustering structure recovery. Its adaptability to singletons, effectiveness in model selection, and robustness against within-state heterogeneity strongly support its applicability for real data sets.
4 Applications of MBASIC to Genome Research Problems
4.1 Transcription Factor Enrichment Network
Regulation of gene expression relies heavily on the context-specific combinatorial activities of TFs. Gene clustering analysis based on TF occupancy data, i.e., ChIP-seq, aims to identify combinatorial patterns of TF occupancy and group genes based on such patterns. The ENCODE consortium (The ENCODE Project Consortium, 2012) has generated TF ChIP-seq datasets for over 100 TFs across multiple cell types, and has motivated several integrative studies for learning regulation patterns (Gerstein et al., 2012; Wang et al., 2012). In this study, we applied MBASIC to the analysis of such data. Specifically, we focused on the TF enrichment patterns at the promoter regions, i.e., −5000 bps and +1000 bps the transcription start site, of the 10290 genes that had significant expression, as measured by RNA-seq, in either the Gm12878 or the K562 cells. The input data to MBASIC were the mapped numbers of reads at these promoter regions from the uniformly processed ChIP-seq data by Gerstein et al. (2012). We chose the cell types Gm12878 and K562 because they had the largest numbers of TF ChIP-seq experiments. The final dataset utilized included ChIP-seq data for I = 10290 observational units over 30 TFs corresponding to K = 60 experimental conditions (cell type × TF) with a total of 166 replicate experiments.
We fitted MBASIC with S = 2 states and used log normal distributions as in Equation (3). s = 1 corresponded to the unenriched state, and we let γikl1 = log(1+xik), where xik is the count from the matching control experiment at unit i. s = 2 corresponded to the enrichment state, and we let γikl2 = 1 for all loci.
We followed the two-phase procedure using BIC from Section 3.4 to select both the number of clusters and the structure of each cluster. In Phase 1, we selected the number of clusters as 24. In Phase 2, we considered two types of structural constraints for each cluster, referred to by TF-homogeneity and cell type-homogeneity and defined as wjk1s = wjk2s if k1 and k2 corresponded to the same TF or cell type. We found that imposing cell type-homogeneity to any cluster would cause that cluster to be degenerate (i.e., no unit was assigned to that cluster). Therefore, we chose the final model among those with TF-homogeneity structures. The BIC and log likelihood values for different models fitted in both phases are shown in Figure 2. The final model had 24 unconstrained clusters, consisting of 1 − ζ = 89.8% of the 10290 loci. The ranges of the estimated distribution parameters among replicates within the same cell type-TF combination is shown in Figure C.12. We notice that these parameters can be substantially different among replicated experiments. This provides further support for our replicate specific parametrization.
Figure 2.
(a) BIC and (b) log-likelihood values for models with different structures. All the clusters are unstructured in the Phase 1 models and the x-axis denotes the total number of clusters. The total number of clusters is 24 for Phase 2 models and the x-axis denotes the number of unconstrained clusters. The remaining clusters have TF-homogeneity.
To compare the normalized data and the predicted enrichment probability for each cluster, we computed the normalized signals 1 and compared them to the estimated cluster parameters. Figure 3 depicts such normalized signals from five randomly selected loci within each predicted cluster (Figure 3(a)), as well as the predicted enrichment probabilities at the corresponding condition and cluster (wjk2’s) (Figure 3(b)). We observe that the estimated enrichment probabilities at the cluster level capture the commonality among loci within each cluster. In addition, each loci cluster exhibits distinct combinatorial patterns of activity across all cell type-TF combination. The cell type-TF combination enriched within each cluster is listed in Table C.9.
Figure 3.
(a) Normalized data for each cell-TF combination at five sub-sampled loci within each cluster. (b) Estimated enrichment probability at each cell-TF combination for each cluster.
Our clustering results are consistent with the existing literature on the TF enrichment networks. For example, cooperating TFs tend to be enriched at the same loci. This pattern can be observed in Figure 3(b) between Bcl3 and Bclaf1. Pol2 and Pol24h8 represent Pol2 experiments with different antibodies. As expected, we observe enrichment at the same loci for these two different version of Pol2 experiments. Moreover, pairs of TFs that have similar binding motifs have similar enrichment probabilities over the clusters. For example, Wang et al. (2012) discovered the UA1 motif as common to both Chd2 and Ets1 and the USF motif for Max, Usf1, and Usf2. Interactions between Taf1 and Tbp have also been studied by Anandapadamanaban et al. (2013). Similar enrichment probabilities of these TFs across clusters can be observed in Figure 3(b). In addition to these observations that are consistent with the literature, our results illustrate how the genome-wide TF association patterns can be attributed to specific clusters. We explored the loci clusters with distinct patterns between cell types (e.g., Pol2 in Cluster 12, Figure 4), TFs from the same families (e.g., Bcl3 v.s. Bclaf1 in Cluster 3, Figure C.13), and TFs with similar genome-wide enrichment (e.g., Max v.s. Usf1 in Cluster 2, Figure C.14) using raw data. We further evaluated each cluster of genes for their KEGG pathway enrichment (Subramanian et al., 2005), and identified 8 KEGG pathways that are significantly enriched in individual clusters (Table 2). Three of our clusters (Clusters 7, 9, and 19) have more than half of their genes in one single pathway. Since KEGG pathways curate the known knowledge of molecular interaction systems, these clusters may be driven by unknown biological processes that warrant further investigation.
Figure 4.
(a, b) Plots of the transformed Pol2 ChIP sample read counts against the transformed control sample read counts for all units in (a) Gm12878 and (b) K562 cells. Data from unenriched units are expected to reside around the 45 degree dashed line.
Table 2.
Significantly enriched KEGG pathways across the 24 clusters.
| KEGG.name | # Genes Overlapped | Z Score | Cluster | Cluster Size |
|---|---|---|---|---|
| Protein processing in endoplasmic reticulum | 156 | 5.652 | 7 | 391 |
| Fatty acid elongation in mitochondria | 7 | 7.518 | 8 | 133 |
| B cell receptor signaling pathway | 74 | 6.016 | 9 | 146 |
| Lysine biosynthesis | 3 | 6.53 | 9 | 146 |
| D-Glutamine and D-glutamate metabolism | 3 | 5.548 | 12 | 184 |
| Vitamin B6 metabolism | 4 | 5.28 | 14 | 156 |
| Non-homologous end-joining | 12 | 7.539 | 17 | 213 |
| Lysosome | 116 | 5.402 | 19 | 187 |
MBASIC infers the clustering structure based on its own estimates of the state-space profiles. The ENCODE consortium provides the estimated enrichment regions (i.e., peaks) for each experiment in this study. Then, a natural question is whether MBASIC reveals more information compared to clustering of genes based on ENCODE-estimated binary enrichment profiles of TFs. To address this, we created a binary vector for each gene by overlapping its promoter with the ENCODE peaks. Then, we applied the state-of-the-art MClust model (Fraley and Raftery, 2002) to cluster the 10290 promoter regions based on these peak profiles. MClust selected 90 clusters based on BIC. Figure C.15 displays cluster-level estimated enrichment probabilities of TFs across the conditions considered. Compared to Figure 3, we can see that many of the MClust clusters have very similar enrichment profiles. For example, Clusters 51, 7, 8, 32, 54 contained almost no enrichment for any TFs, but are classified as distinct clusters. The association between units across these clusters are thus non-trivial to interpret. In addition, we found that for some conditions, the enrichment states predicted by MBASIC are quite different than those from the ENCODE peak profiles (e.g., Figure 5). This is because the ENCODE peaks are identified by whole genome-wide analysis and may not reflect the differences between the ChIP and control samples at the local promoter regions. MBASIC attains larger raw data fidelity by directly modeling the counts at each unit rather than inheriting results from existing analyses.
Figure 5.
(a, b) Transformed ChIP versus control sample read counts from a Gm12878-Ctcf dataset. Enrichment states are annotated by (a) ENCODE peak profiles and (b) MBASIC estimation. In MBASIC, an observational unit is estimated to be enriched if its enrichment probability satisfies P(θik = 2|Y ) > 0.5.
4.2 Genome-wide Identification of +9.5-like Composite Elements
Johnson et al. (2012) and Gao et al. (2013) described the requirement of the intronic +9.5 site, an Ebox-GATA composite element located at chr6: 88143884–88157023 in the mouse genome (genome version mm9), to establish the hematopoietic stem/progenitor cell (HSC) compartment in the fetal liver and for hematopoietic stem cell genesis in the aorta-gonad-mesonephros (AGM), respectively. Furthermore, Johnson et al. (2012) and Hsu et al. (2013) showed that heterozygous +9.5 mutations cause a human immunodeficiency associated with myelodysplastic syndrome (MDS) and acute myeloid leukemia (AML). Because the +9.5 site is the only known cis-element deletion of which depletes fetal liver HSCs and is lethal at E13–14 of embryogenesis, identifying additional loci that have similar functionality is extremely important for establishing mechanisms that enable GATA factor-bound regions with nonredundant activity and have the potential to reveal novel targets for therapeutic modulation of hematopoiesis. In this application, we identified 4803 genomic regions with the Ebox-GATA motif (CATCTG-N[7–9]-AGATAA where N[7–9] denotes a variable size spacer of 7 and 9 nucleotides) in the human genome (genome version hg19). We considered a 150 bps window anchored at each of the 4803 composite elements as the observational unit. To analyze the TF occupancy activities at these units and identify a group of composite elements with occupancy profiles similar to that of the +9.5 composite element, we downloaded all ChIP-seq data for the Huvec and K562 cells from Gerstein et al. (2012). In total, the data set contained 224 replicates spanning K = 84 experimental conditions and 77 TFs.
We used negative binomial distributions with S = 2 states, where s = 1 denoted the unenriched (unoccupied) state, in the MBASIC framework. We chose γikl1 = 1+xik, where xik is the count for unit i from the matching control experiment for condition k, to incorporate data from the accompanying control experiments of the ChIP samples. For s = 2, we utilized the following mixture distribution to account for the heavy tails observed in the raw data:
Here, the constant 3 represents the minimum count threshold for enrichment estimation. The use of mixture distributions to capture heavy tailed count data was previously considered by Zuo and Keles¸ (2013). We note that an alternative approach to capture heavy tailed counts would be to fit a model using S = 3 states, with s = 2, 3 representing two distinct enrichment components. Such an approach would differ from the proposed approach in a subtle yet important way. In this alternative approach, allocation of each unit to different enrichment components would affect the clustering estimation, while in our approach, clustering is only determined by the enrichment status of the individual unit regardless of which enrichment component it follows. The E-M algorithm for this setting requires a slight modification as discussed in Section A.2.
Following the two phase model selection procedure using BIC, we selected the model with 3 clusters, 2 of which were cell type-homogeneous. The ranges of the estimated distribution parameters among replicates within the same condition are displayed in Figures C.16 and C.17. The three clusters (denoted by C1, C2, and C3) included 332, 837, 157 composite elements, respectively, and the remaining 3477 composite elements were identified as singletons. A heatmap for the enrichment probability of each unit under each cell type-TF combination across the three clusters is shown in Figure 6. The +9.5 element is a member of cluster C3 which consists of a total of 157 +9.5-like composite elements. A detailed genomic annotation of these elements are provided in Table C.10. Notably, 46% of the C3 elements reside in intronic regions and 42% of these are within first intron. Only 15% of the cluster are located up to 10Kb upstream of transcription start sites.
Figure 6.
Posterior enrichment probability (i.e,. P(θik = 2|Y )) for all units in the three clusters. The right most column of the C3 cluster corresponds to the +9.5 element.
A detailed analysis of Figure 6 reveals that cluster C3 is driven by several transcription factors with known associations to GATA2. First, we note that a large fraction of the C3 loci are bound by BRG1. The chromatin remodeler BRG1 is involved in GATA1-mediated chromatin looping (Kim et al., 2009a,b) and co-localizes with GATA1 at some chromatin sites (Hu et al., 2011). BRG1 has broad functions in many cell types; however, conditional knockouts of BRG1 reveal its importance in specific cell and tissue contexts (Holley et al., 2014). Another factor that clearly stands out as having a GATA2-like profile in cluster C3 is ETS1. Our prior work identified the propensity of occupied GATA motifs to reside near Ets motifs (Linneman et al., 2011) and Doré et al. (2012) has highlighted GATA2-ETS co-localization.
We next performed an alternative naive analysis by utilizing the list of peaks provided by the ENCODE project. As in the case of the Transcription Factor Enrichment Network example of Section 4.1, these peaks, provided by the ENCODE consortium, were identified by analyzing each dataset individually with ENCODE’s uniform ChIP-seq processing pipeline. Figure C.18 displays the ENCODE peak profiles for our cell type-TF conditions. For each of the 4803 composite elements, we constructed a peak profile, which is a binary vector indicating whether the element overlaps with the ENCODE peaks for each cell type-TF combination. We then computed the peak profile based similarity between the +9.5 site and each the of the composite elements using the R function dist.binary with the “Jaccard index” option. For comparison, we computed pseudo-binary similarities between each element and the +9.5 site using the MBASIC estimated enrichment probabilities across all conditions2. We then ranked the composite elements based on both ENCODE and MBASIC estimated similarities. Figure 7 provides a comparison of the two lists as a function of top ranking composite elements. Overall, we observe that the rankings based on MBASIC estimation are consistent with the rankings based on the ENCODE peak profiles.
Figure 7.

Proportion of overlap between the top ranked +9.5-like composite elements identified by MBASIC and ENCODE peak profiles. The overlap proportion is calculated by considering the same number of top ranked units (x-axis) in both the ENCODE-based and MBASIC-based similarities to the +9.5 site. The dashed lines mark that 78.3% of the C3 units are ranked in the top 157 based on the ENCODE peak profiles.
Although the rankings of the composite elements with respect to their +9.5 similarity using both the ENCODE peak profiles and MBASIC estimation were quite similar, the two approaches resulted in different enrichment estimation at the individual TF-cell combination level. Figure 8(a) compares the estimated cluster-level enrichment probabilities of each cell type-TF combination for cluster C3 against their average ENCODE peak profiles and highlights the difference between the two procedures. To further investigate these differences, we plotted the raw data for individual replicates and compared the composite elements that were estimated to be enriched by the two methods. An example using data from K562-Chd2 is displayed in Figures 8(b) and (c). Although many elements have significantly higher counts in the ChIP sample compared to the control sample, they are not identified as occupied by Chd2 in K562 according to ENCODE peak annotation.
Figure 8.
(a) Top half: Enrichment probabilities for the C3 units across all experimental conditions estimated by MBASIC. Bottom half: Proportion of C3 units that are overlapped by the ENCODE peaks for each condition. (b, c) ChIP sample read counts against normalized control sample read counts for one replicate of K562-Chd2 dataset. Enrichment status are annotated by (a) the ENCODE peak profiles and (c) MBASIC prediction.
Another example using a replicate from K562-Yy1 is shown in Figure C.19, where several elements with zero ChIP count are overlapped by ENCODE peaks. These results indicate that MBASIC provides a grouping of the Ebox-GATA composite elements that is more consistent with the raw data compared to grouping based on ENCODE peak annotation.
5 Conclusions and Discussion
Clustering analysis based on an underlying state-space is a common problem for many genomic and epigenomic studies where multiple data sets over many observational units are integrated. In this paper, we developed a unified statistical framework, called MBASIC, for addressing these class of problems. MBASIC simultaneously projects the observations onto a hidden state-space and infers clustered units in this space. The hierarchical structure of MBASIC enables the information of the state-space clusters to be fed back into the projection of the raw data, thus reinforces the accuracy of predicting the state-space states of individual units. The MBASIC framework offers flexibility in a number of aspects of experimental design, such as different numbers of replicates under individual experimental conditions and missing values. Additionally, it is applicable to many parametric distributions. Our computational studies highlighted good operating characteristics of MBASIC and the two genomic applications illustrated how large numbers of ChIP-seq datasets can be integrated for addressing specific problems. In both of the applications, MBASIC algorithm converged within 20 minutes for a fixed model on a 64 bit machine with Intel Xeon 3.0GHz processor and 64GB of RAM. For model selection, we utilized R package snow to implement the 2-phase procedure with parallel fitting of different candidate models using a 8-core 64 bit, 64GB RAM machine with 8 Intel Xeon 3.0GHz processors. These runs were completed under 2 hours. The computational efficiency of our model depends on the simple, closed-form updates in our E-M algorithm. Such a mathematical form is due, at least in part, to our modeling assumption that the rows of our state-space matrix is clustered. We have argued that this assumption, as compared to the PCA-type model structures, offers easier interpretation and is well suited for many genomic applications. MBASIC is available as R package mbasic at https://github.com/chandlerzuo/mbasic.
A Details of the Expectation-Maximization (EM) Algorithms
A.1 Derivation for the E-step
We derive the expressions for the E-step updates of our algorithm in Eqns. (12), (13), (14), (15) as well as the marginal likelihood in Eqns. (9) and (10).
In what follows, we let θiks denote 1{θik = s}. The joint density of (z, b, θ, Y ) is given by:
| (24) |
where . The following elementary equality is used repeatedly throughout the rest of the derivations in this section.
The joint density of (z, b, Y ) can be calculated from Eqn. (24):
| (25) |
Since
Eqn. (25) can be rewritten as:
| (26) |
The joint distribution of (b, Y ) can be calculated from Eqn. (26):
| (27) |
We note that
Then, Eqn. (27) can be rewritten as:
| (28) |
We can calculate the marginal density of Y, given in Eqn. (9), from Eqn. (28) as:
| (29) |
Eqn. (10) can be obtained similarly. Moreover, we can rewrite (25) as
| (30) |
by using
Thus, the density of (z, Y ) can be calculated as:
| (31) |
Using Eqns. (28) and (29), we obtain Eqn. (12) as
| (32) |
Similarly, using Eqns. (31) and (29), we have
| (33) |
Using Eqns. (26) and (27), we have
| (34) |
Eqns. (34) and (32) together results in Eqn. (13). Using Eqns. (24) and (25), we have
| (35) |
Therefore, we obtain Eqn. (14) by using Eqns. (27), (34), and (35):
| (36) |
Finally, we obtain Eqn. (15) using Eqns. (35) and (27):
| (37) |
A.2 EM Algorithm with Mixture Data Distributions
An important extension of the MBASIC model is to allow multiple mixture components within each state. For example, our model in Section 4.2 models the data from state s = 2 as a mixture of two negative binomial distributions following the well motivated model of Kuan et al. (2011):
where the constant 3 denotes the minimum number of reads required to be in state θ = 2. In this section, we describe the general algorithm for such extensions. We assume that data from state s has a distribution of ms components:
This can be written in a hierarchical form, using νiklsr as the hidden variable indicating the mixture component within the state:
| (38) |
Here, we allow the distribution parameters μ and σ as well as the prior information derived γ to depend on the component. Let fiklsr = fsr(yikl|μklsr, σklsr, γiklsr). The joint density for this model is:
| (39) |
Let , then the joint density for z, b, θ, Y can be expressed exactly the same as Eqn. (24). Therefore, the M-step updates for W, P, ζ and π are not changed, with the related E-step quantities computed as Eqns. (12), (13), (14), (15). We only need to modify the algorithm to estimate variables that depend on the component index r: μ, σ, and v.
The related quantities that need to be computed are E[νiklsr|Y ] and E[θiksνiklsr|Y ]. By Eqn. (39), we have
Therefore, we have
| (40) |
where and can be computed by Eqns. (36) and (37). As a result,
| (41) |
Given Eqns. (40) and (41), the M-step update for vklsr is:
The M-step updates for μklsr, σklsr can be derived using Eqn. (40). For the negative binomial distribution, as in Section 4.2, we have
B Simulation Studies
This section presents six broad simulation studies to evaluate the performance of MBASIC. Each simulation study had multiple settings as outlined in Table 1 of the main article. We introduced the following four families of distributions in our main article:
- Log-normal Distribution. LN(μklsγikls, σkls) with a density function:
(42) - Negative Binomial Distribution. NB(μklsγikls, σkls) with a density function:
(43) - Binomial Distribution. Binom(γikls, μkls) with a density function:
(44) - Degenerate Distribution:
(45)
B.1 Simulation Study 1
The first simulation study investigated the performance of MBASIC when the true value of J was known and there were no structured clusters. We set the number of observational units as I = 4000 and the number of clusters as J = 20. The number of conditions was set to K = 30, and within each condition the numbers of replicates varied as nk = 1, 2, 3, each with probability 0.3, 0.5, and 0.2. The size of the hidden state space was varied at three levels: S = 2, 3, 4. We simulated data under three distributional families: log-normal (LN) (3), negative binomial (NB) (4), and binomial (Bin) (5). We also varied the proportion of singleton units ζ at 0, 0.1 and 0.4. For simplicity, we set γikls = 1 in all distributions.
B.1.1 Parameter Settings
Parameters wjks’s and pis’s generate the hidden state variables θik’s. We set them as follows. For different values of k, j, and i, the vectors wjk. = (wjks : 1 ≤ s ≤ S) and pi. = (pis : 1 ≤ s ≤ S) were simulated independently, each following an S-dimensional Dirichlet distribution Dir(α, · · ·, α). We chose a uniform concentration parameter of α = 0.2 for all dimensions to ensure that for each vector wjk. or pi., the probability mass tended to concentrate on one component. This controlled the conditional variance of (θik|bi, zij ). An increased value of α would increase the conditional variance of θik, thus make it more difficult to recover wjks’s and pis’s.
The settings for parameters μkls’s and σkls’s were important. These parameters connected hidden states θik’s to the observed values Yikl’s. In general, recovering hidden states from the observed data is more difficult if: (1) differences of the mean values μkls’s between the states are small; (2) variances of the distributions within each state are large. To control these two aspects at reasonable levels, we set these parameters as follows:
For log-normal distributions (3), we set ξs = 2+log(4s – 3), simulated μkls ~ N(ξs, 0.052), and set σkls = 0.5;
For negative binomial distributions (4), we set ξs = 8s–6, simulated μkls ~ N(ξs, 0.52), and set σkl1 = 2.82, σkls = 5 for s = 2, 3, 4;
For binomial distributions (5), we simulated μkls ~ Beta(3s, 3(S+1–s)), and γikl1 = γikl2 = · · · = γiklS ~ Pois(10).
Figure B.1 displays the histograms of Y1,i,l, 1 ≤ i ≤ I from one of the simulated data sets for all the three distributions with S = 4 components. For comparison, we also present the histogram of an actual data set from the analysis in Section 4 of our main article. We observe that the mixture distribution of our simulated data with log-normal or negative-binomial distributions closely follow the real data.
B.1.2 Alternative Approaches for Benchmarking MBASIC
The MBASIC algorithm is summarized as Algorithm 2:
To the best of our knowledge, there are currently no existing methods suited for the general setup of MBASIC. There are, however, algorithms tailored for analyzing specific data types with hierarchical state-space models similar to MBASIC. These algorithms largely fall into two categories. In the first category, estimation for the state-space variables are separated from state-space clustering. Some examples include Cheng et al. (2011), Gerstein et al. (2012), Ji et al. (2013), Neph1 et al. (2012). In the second category, distributional parameters for each experimental replicate are estimated first. These parameters are then fixed, and the state-space variables and the clustering structure are estimated jointly conditional on these estimates. Examples of this approach include Wei et al. (2012), Zeng et al. (2013), and Wei et al. (2015).
Figure B.1.
Histograms for a) a real data set from a K562 Pol2 replicate in Section 4.1 of our main article; and simulated data from one condition based on one simulation for the b) Log-Normal, c) Negative Binomial and d) Binomial distribution with S=4 states.
Algorithm 2.
MBASIC
| for t = 1, 2, · · · until convergence do |
| Expectation-Step: Compute the conditional expectations E[I(θik = s)|ϕ̂(t−1)], E[bi|ϕ̂ (t−1)], E[I(θik = s)bi|ϕ̂ (t−1)], E[zij(1 − bi)|ϕ̂ (t−1)], E[I(θik = s)zij(1 − bi)|ϕ̂ (t−1)]; Maximization-Step: Update estimates for parameters μkls, σkls, ζ, πj, wjks, pis. |
| end for |
Table B.1. Simulation Study 1.
A summary of the benchmark algorithms that are compared to MBASIC. Neither SE-MC nor PE-MC perform joint estimation of the model parameters. SE-* algorithms estimate the data-specific model parameters and state-space as a first step and then cluster the state variables. PE-* algorithms estimate data-specific model parameters and fixes these in joint estimation of the state-space and clustering.
| Algorithm | Is State-space Estimation Joint with Clustering? | Is Parameter* Estimation Joint with Clustering? | Clustering model | Include singletons |
|---|---|---|---|---|
| MBASIC | Joint | Joint | Mixture model | Yes |
| SE-HC | Separate | Separate | Hierarchical clustering | No |
| SE-MC | Separate | Separate | Mixture model | Yes |
| PE-MC | Joint | Separate | Mixture model | Yes |
| MBASIC0 | Joint | Joint | Mixture model | No |
| SE-MC0 | Separate | Separate | Mixture model | No |
| PE-MC0 | Joint | Separate | Mixture model | No |
Denotes distributional parameters for each experimental replicate.
To compare the general implementation of MBASIC as in Algorithm 2 with these existing model fitting ideas, we designed six benchmark algorithms. Table B.1 provides a summary of these algorithms. Two of these algorithms, SE-HC (State-space Estimation followed by Hierarchical Clustering) and SE-MC (State-space Estimation followed by Mixture model Clustering), treat the state-space mapping step and the state-space clustering separately. The third algorithm, PE-MC (Parameter Estimation followed by Mixture model Clustering), separates experiment-specific distributional parameter estimation from the joint estimation of other parameters.
For all the three algorithms, in the first step, observations from each experimental condition {Yikl : 1 ≤ i ≤ I, 1 ≤ l ≤ nk} are fitted according to the following model:
| (46) |
The standard E-M algorithm can be used for the first step and results in estimates of qks, μkls, σkls as well as the posterior estimates for the state space P(θik = s|Y). In the second step, SE-MC and SE-HC cluster the observational units based on the estimated P(θik = s|Y) from the first step. SE-HC (Algorithm 3) uses hierarchical clustering, while SE-MC (Algorithm 4) uses MBASIC with degenerate distributions (6) for clustering. The second step of PE-MC (Algorithm 5) is similar to Algorithm 2, except that parameters μkls, σkls’s are not updated.
In addition to joint fitting of all model parameters, another important feature of MBASIC is its inclusion of the singleton cluster 𝒞0. To the best of our knowledge, this feature is not included in similar models such as Wei et al. (2012) and Wei et al. (2015). We conjecture that in practice, when some units can not be grouped together with other units due to their distinct state-space profiles, including this singleton cluster can enhance model estimation. To test this conjecture, we developed a version of each of the SE-MC, PE-MC, and MBASIC algorithms that ignore the singleton cluster, i.e., forces each unit into a cluster. This is achieved simply by initializing ζ = 0 in the Algorithms 2, 4, and 5. We refer to these algorithms by SE-MC0, PE-MC0, and MBASIC0.
Algorithm 3.
State-space estimation followed by hierarchical clustering (SE-HC)
| Step 1: |
| for 1 ≤ k ≤ K do |
| Apply the standard E-M algorithm on data {Yikl : 1 ≤ i ≤ I, 1 ≤ l ≤ nk} to estimate posterior probabilities P(θik = s|Y ). |
| end for |
| Step 2: |
| Let θ̃i = {P(θik = s|Y)}1≤k≤K,1≤s≤S. Cluster vectors θ̃i into J clusters using hierarchical clustering algorithm with the Euclidean distance. Estimate wjks as the means within each cluster. |
Algorithm 4.
State-space estimation followed by mixture model clustering (SE-MC)
| Step 1: |
| for 1 ≤ k ≤ K do |
| Apply the standard E-M algorithm on data {Yikl : 1 ≤ i ≤ I, 1 ≤ l ≤ nk} to estimate posterior probabilities P(θik = s|Y). |
| end for |
| Step 2: |
| Denote for each 1 ≤ k ≤ K, 1 ≤ i ≤ I). Apply Algorithm 2 with and fs = I(y = s) to obtain estimates for wjks, pis, ζ, and πj. |
Algorithm 5.
Parameter estimation followed by mixture model clustering (PE-MC)
| Step 1: |
| for 1 ≤ k ≤ K do |
| Apply the standard E-M algorithm on data {Yikl : 1 ≤ i ≤ I, 1 ≤ l ≤ nk} to estimate μkls, σkls for each experiment. |
| end for |
| Step 2: |
| Apply Algorithm 2 without updating μkls, σkls in the Maximization step. |
B.1.3 Results
We utilized several criteria to compare the performance of MBASIC to the benchmark algorithms in Table B.1. To estimate how well the state space was characterized for each cluster, we computed the mean-squared error for W (MSE-W) as . We also evaluated how well each method recovered the true state variables θik’s. This was reflected by the state prediction error (SPE) as the mean squared error between the simulated states θik’s and their posterior probabilities: . Finally, to compare the estimated clustering with the simulated true clustering, we computed the Adjusted Rand Index (ARI) (Rand, 1971). ARI is a measure for the similarity between two different clusterings of the data. Its value ranges between −1 and 1, with 1 indicating perfect match between the two clusterings.
Figure B.2. Simulation Study 1, log-normal distribution.
Boxplots for ARI, MSE-W, and SPE across 10 simulated datasets. The number of states is varied at 2, 3, and 4, and the proportion of singletons at 0, 0.1, 0.4. Table B.1 summarizes the methods compared.
Figure B.3. Simulation Study 1, negative binomial distribution.
Boxplots for ARI, MSE-W, and SPE across 10 simulated datasets. The number of states is varied at 2, 3, and 4, and the proportion of singletons at 0, 0.1, 0.4. Table B.1 summarizes the methods compared.
Figure B.4. Simulation study 1, binomial distribution.
Boxplots for ARI, MSE-W, and SPE across 10 simulated datasets. The number of states is varied at 2, 3, and 4, and the proportion of singletons at 0, 0.1, 0.4. Table B.1 summarizes the methods compared.
ARI requires the true clusters denoted by 𝒞j, 0 ≤ j ≤ J and their estimates denoted by 𝒞̂j, 0 ≤ j ≤ Ĵ. In our simulations, they were computed as:
where 𝒞0 denoted the set of singleton units. 𝒞̂j was computed from the posterior distributions as 𝒞̂j = {1 ≤ i ≤ I : j = argmax0 ≤j≤J P(i ∈ 𝒞j |Y)}, where P(i ∈ 𝒞0|Y) = E(bi|Y), and P(i ∈ 𝒞j|Y) = E[(1 − bi)zij|Y].
The simulation results under various settings are summarized by the boxplots for each criterion in Figures B.2, B.3, and B.4. Across all different simulation settings, the performance of MBASIC was consistently among the best in all of the ARI, MSE-W, and SPE metrics. This shows that MBASIC could not only recover the clustering structure, but also achieve high accuracy in estimating individual states. SE-HC, SE-MC and SE-MC0 performed the worst in both detecting the clustering structure and estimating the individual states. This suggests that separating state-space inference from joint model fitting can significantly deteriorate model estimation. Different from the SE-* methods, performances of PE-MC and PE-MC0 were much closer to MBASIC. For the negative binomial and binomial distributions (Figures B.3 and B.4), PE-MC achieved similar ARI levels to MBASIC and slightly larger SPE values. These observations show that by jointly estimating the clusters and the states, data under different conditions could borrow information from each other and thus substantially improve the state-space estimation. Overall, these observations are consistent with the results in Wei et al. (2015) and Wei et al. (2012).
The simulation results also highlight the effect of modeling the singleton cluster in various settings. Comparing the performances of MBASIC with MBASIC0 and PE-MC with PE-MC0, we see that modeling the singleton cluster does not have a significant effect when the proportion of singletons is low, i.e., ζ = 0 or 0.1; however, the improvement is highly significant when ζ = 0.4. When ζ = 0.4, including singletons significantly improved the performance with respect to ARI, but did not have an obvious effect on SPE. This has several implications in practice. First, the fact that MBASIC does not under-perform any other methods when ζ = 0 or 0.1 indicates that increasing the model complexity by introducing singletons does not lead to unrobust inference. Because we are always agnostic on the existence of singletons for any real data, keeping them in our model would guard against their adverse influence in inferring the clustering structure. Second, although incorporating the singleton cluster does not improve estimating individual states, some epigenetic studies focus primarily on the association structure between units, as our example in 4.2. For such studies, the gain in estimating the clustering structure by including the singletons is essential. We note that in the comparison of SE-MC0 with SE-MC for the negative binomial and the binomial distributions (Figures B.3 and B.4), modeling the singletons does not necessarily improve estimation for separate model fitting even when the proportion of singletons is high, e.g., ζ = 0.4. This might suggest that the state-space estimation step is introducing additional noise to the clustering step, which in turn makes it less favorable to infer a complicated clustering structure with singletons.
Table B.2. Simulation study 2, Scenario 1, unstructured clusters.
Simulation results for model selection without structural constraints. For each criterion, the mean is computed over 10 simulated data sets, with the standard deviation shown in the parentheses.
| Dist. | ζ | J | ARI | MSE-W | SPE |
|---|---|---|---|---|---|
| Bin | 0.1 | 20.8 ( 2.098 ) | 0.94 ( 0.036 ) | 0.096 ( 0.018 ) | 0.159 ( 0.014 ) |
| Bin | 0.4 | 20.9 ( 1.101 ) | 0.914 ( 0.035 ) | 0.122 ( 0.034 ) | 0.204 ( 0.012 ) |
| LN | 0.1 | 20.7 ( 0.823 ) | 0.989 ( 0.005 ) | 0.044 ( 0.03 ) | 0.086 ( 0.006 ) |
| LN | 0.4 | 21.3 ( 1.337 ) | 0.972 ( 0.007 ) | 0.095 ( 0.027 ) | 0.107 ( 0.008 ) |
| NB | 0.1 | 21.6 ( 0.843 ) | 0.947 ( 0.021 ) | 0.089 ( 0.028 ) | 0.154 ( 0.007 ) |
| NB | 0.4 | 20.6 ( 2.271 ) | 0.902 ( 0.026 ) | 0.112 ( 0.048 ) | 0.189 ( 0.007 ) |
Table B.3. Simulation study 2, Scenario 2, structured clusters.
Simulation results for model selection with structural constraints. For each criterion, the mean is computed over 10 simulated data sets, with the standard deviation shown in the parentheses.
| Dist. | ζ | J1 | J | ARI | MSE-W | SPE |
|---|---|---|---|---|---|---|
| Bin | 0.1 | 10.3 ( 1.16 ) | 20.7 ( 1.494 ) | 0.934 ( 0.022 ) | 0.084 ( 0.035 ) | 0.162 ( 0.02 ) |
| Bin | 0.4 | 10.3 ( 1.636 ) | 21 ( 2.625 ) | 0.897 ( 0.048 ) | 0.125 ( 0.03 ) | 0.196 ( 0.031 ) |
| LN | 0.1 | 10.4 ( 0.516 ) | 20.6 ( 0.516 ) | 0.984 ( 0.015 ) | 0.044 ( 0.032 ) | 0.086 ( 0.006 ) |
| LN | 0.4 | 11.2 ( 1.619 ) | 22.5 ( 1.509 ) | 0.968 ( 0.01 ) | 0.108 ( 0.037 ) | 0.106 ( 0.006 ) |
| NB | 0.1 | 10.9 ( 1.197 ) | 21 ( 1.054 ) | 0.955 ( 0.019 ) | 0.064 ( 0.035 ) | 0.155 ( 0.008 ) |
| NB | 0.4 | 11.2 ( 1.814 ) | 22.2 ( 1.398 ) | 0.926 ( 0.014 ) | 0.108 ( 0.031 ) | 0.184 ( 0.013 ) |
B.2 Simulation Study 2: Model Selection
This second set of simulations aimed to evaluate the use of BIC to select the number of clusters as well as the structural constraints for each cluster. We simulated data sets under two scenarios. For the first scenario, each data set had J = 20 clusters with K = 30 experimental conditions, and none of the clusters had structural constraints. For the second scenario, each data set had J = 20 clusters over K = 30 conditions, but J1 = 10 of the clusters were structurally constrained as follows:
We refer the two scenarios as the unstructured scenario and the structured scenario, respectively. We considered log-normal distributions (3), negative binomial distributions (4) and binomial distributions (5) for both cases. We also varied the proportion of singleton units ζ at 0.1 and 0.4. The number of states was fixed at S = 2. The remaining parameters were simulated following the same mechanism as in Section B.1.1.
For each simulated data set, we fitted a number of candidate models. For the unstructured scenario, we varied the number of clusters J from 10 to 30. For the structured scenario, we followed the two-phase procedure described in Section 3.4 of our main article. The best model was selected by the minimum BIC value. To assess the performances of these selected models, we computed the ARI, MSE-W and SPE metrics as described in Section B.13.
The simulation results are summarized in Tables B.2 and B.3. Under each set of parameters, we computed the mean and the standard deviation for each of the criterion as well as the selected value of J and J1 under 10 simulated data sets. These tables show that the selected values for J and J1 were very close to the true values. Moreover, MBASIC performed uniformly well with respect to ARI, MSE-W, and SPE under different settings. These results indicate that even if MBASIC may not identify the “true” structure that drives the actual data, the identified structures can still properly represent the state-space associations between units.
B.3 Simulation Studies 3-5: Comparison with iASeq and CorMotif
In this section, we compare MBASIC with two recently proposed models for integrative analysis of specific types of genomic data: CorMotif (Wei et al., 2015) and iASeq (Wei et al., 2012). Both models have the similar state-space clustering structure as MBASIC. The main difference from MBASIC is that they each incorporate more complicated distributional assumptions targeting specific genomic data types. The CorMotif model specifically addresses integrative differential expression analysis with nk1 case condition replicates and nk0 control condition replicates for each experimental condition k. It inherits the LIMMA (Smyth, 2004) framework for differential analysis of gene-expression data and assumes mixture of Gaussian distributions with S = 2 states: s = 1 for the equally expressed state, and s = 2 for the differentially expressed state. Specifically, the CorMotif model has the following state-space mapping structure:
where Xikl’s are the observed data from control experiments, and Yikl are the observed data from the case experiments. nk and are hyper parameters specific to each experiment to account for potential heterogeneity among units within the same state, and uk reflects the strength of differential expression. CorMotif assumes almost the same state-space clustering structure as MBASIC except that it does not include singletons. The iASeq model, targeting at allele-specific binding problems has the following state-space mapping structure:
where the αk, βk are experiment-specific parameters, and γikl is the observed total number of reads between two alleles. The state-space mapping structure for iASeq is almost the same as MBASIC, except that it assumes no singletons, and that one cluster is dedicated to equal binding/occupancy between the alleles (i.e., w1k1 = 1, ∀1 ≤ k ≤ K).
There are two key differences between CorMotif/iASeq and MBASIC. First, both CorMotif and iASeq address the heterogeneity among the units within the same state, and they introduce additional hyper parameters to model the heterogeneous parameters associated with the distribution of individual units. Compared to MBASIC, where we assume the distributions within the same state are homogeneous, such heterogeneous distributional assumptions are much more realistic. Second, CorMotif and iASeq implement two-stage estimation procedures similar to PE-MC0, which separate parameter estimation from state-space clustering. Wei et al. (2015) pointed out that once we have the heterogeneous distributional parameters within each state, joint model fitting for all parameters would require running a Markov Chain Monte-Carlo algorithm rather than the simple E-M algorithm we have developed for MBASIC. Therefore, the computational cost ensued might render its applicability for large real data sets.
In comparison of MBASIC to CorMotif and iASeq, we simulated data according to each of the assumed distributions of CorMotif/iASeq, but fitted MBASIC models using simplified distributions. For data simulated from the iASeq model, we used MBASIC to fit binomial distributions with S = 3 states (5). For data simulated from the CorMotif model, we first generated two versions of t-statistics as follows. For each unit and experiment, denote and vk = 1/nk1 + 1/nk0. We computed the naive t-statistic Tik as:
| (47) |
We also computed the limma t-statistic T̃ik by first fitting the data for each condition using LIMMA (Smyth, 2004) to estimate nk and , then computed:
| (48) |
For each set of Tik’s and T̃ik’s, we fitted the MBASIC model with S = 2 components of scaled-t distributions:
| (49) |
Here, μks is the scaling parameter, and σks is the degrees of freedom. Because we pooled the replicate level data to generate these t-statistics, the parameters μ and σ no longer depended on l. We refer to the method using T̃ik as MBASIC-limma, and using Tik as MBASIC-t. Because there is no closed form maximum likelihood solution for t-distributions, we use the moment method to estimate μks’s and σks’s in the M-step similar to the case of negative binomial distributions.
In Simulation Study 3, we simulated data following the iASeq model. We set αk = βk = 2, and simulated state-space variables the same as in Section B.1 with I = 4000. We set J = 10, 20 and ζ = 0, 0.1, 0.4. Simulation Studies 4-5 compare MBASIC with CorMotif. In Simulation Study 4, we simulated data in four settings corresponding to Simulations 1-4 of Wei et al. (2015) respectively. In these settings, we had nk = 4, uk = 4, . Table B.4 summarizes the settings for the number of clusters, experiment conditions, and units for the state-space variables. We refer our readers to Wei et al. (2015) for more details of the state-space design. We note that Wei et al. (2015) simulations did not include singletons (i.e., ζ = 0) and furthermore, their settings assumed wjks ∈ {0, 1}. This means that the state-space variables are completely determined by the clustering structure. In Simulation Study 5, we set nk and the same as in Simulation Study 4, but varied uk as uk = 8 for easier distinction between different states. However, we simulated wjks following S-dimensional Dirichlet distributions as in Simulation Study 1 to introduce noises in generating state-space variables. In addition, we simulated data with smaller number of units (I = 4000), but more clusters (J = 10, 20), and varied the proportion of singletons ζ = 0, 0.1, 0.4. The other details of generating state-space variables were the same as in Section B.1.
Table B.4.
Summary for the designs of the simulation settings in Simulation Study 4, originally designed by Wei et al. (2015).
| Simulation Setting | I | J | K |
|---|---|---|---|
| 1 | 10,000 | 4 | 4 |
| 2 | 10,000 | 4 | 4 |
| 3 | 10,000 | 5 | 8 |
| 4 | 10,000 | 5 | 20 |
Table B.5. Simulation Studies 3-5.
A summary of the simulation designs, the fitting algorithms compared, and the figure numbers for the results.
| Study | J | ζ | True model | Fitting algorithms | Related figures |
|---|---|---|---|---|---|
| 3 | 10, 20 | 0, 0.1, 0.4 | iASeq | MBASIC, iASeq | Figure B.5 |
| 4 | 4, 5 | 0 | CorMotif | MBASIC-limma, MBASIC-t, CorMotif | Figures B.6, B.8 |
| 5 | 10, 20 | 0, 0.1, 0.4 | CorMotif | MBASIC-limma, MBASIC-t, CorMotif | Figure B.7 |
Table B.5 further summarizes the components of the Simulation Studies. For each set of parameters, we simulated 10 data sets. We computed ARI, MSE-W, and SPE based on both the model with the number of clusters selected by BIC, and the oracle model where the number of clusters is set to its true value. The comparison between MBASIC and iASeq is shown in Figure B.5. For all the different settings, MBASIC achieved better clustering performance, with higher ARI values. However, iASeq performed better in SPE and MSE-W. When ζ = 0, iASeq performed overall better than MBASIC, with similar ARI values as MBASIC but much lower SPE. However, as ζ increased, iASeq’s ARI value became significantly smaller than MBASIC, while its SPE value became closer to MBASIC’s. In such cases, the benefits of modeling singletons seem to outweigh the loss of using simplified distributional assumptions.
The comparison between MBASIC and CorMotif is summarized in Figures B.6 and B.7. In Simulation Study 4 (Figure B.6), because CorMotif models did not allow singletons, we also excluded the singleton cluster in fitting MBASIC models. MBASIC-limma performed the best except in the first setting, where CorMotif achieved the best SPE. Figure B.8 depicts the average true positive rate in detecting states with θik = 2 among the 1000 top ranking units for each of the four settings. In all but Setting 1, MBASIC-limma performed equally well as CorMotif. We note that Setting 1 has the fewest clusters J = 4 and the fewest experimental conditions K = 4, while the other settings have more complicated state-space structures. Performance of MBASIC-t was the worst in all the four settings. This suggests that neglecting the heterogeneity in these cases can significantly increase estimation error. Although MBASIC model alone does not address the heterogeneity issue, fitting MBASIC models after a data pre-processing step that incorporates the heterogeneity structure, such as computing T̃ik in MBASIC-limma, can significantly improve model inference. In Simulation Study 5 where we had stronger signals in separating distribution components but noisy state-space clusters, CorMotif resulted in the largest SPE values in all settings (Figure B.7). Although its performance in ARI was comparable with MBASIC-limma when ζ ≤ = 0.1, it deteriorated with increasing proportion of singletons, i.e., ζ = 0.4. Simulation Studies 4 and 5 collectively suggest that MBASIC’s performance is competitive with CorMotif in settings where we have less noise in clustering structure, small numbers of clusters, and some level of singletons despite the fact that the distributional assumptions of MBASIC might be mis-specified. This indicates that for real data sets where we are agnostic about the true data generating structure, MBASIC might be a more general and robust approach.
Figure B.5. Simulation Study 3, comparison between MBASIC and iASeq.
We varied the number of clusters at 10, 20 and the proportion of singletons at 0, 0.1 and 0.4. Results are summarized over 10 simulations under each setting.
Figure B.6. Simulation Study 4, comparison between MBASIC and CorMotif.
We simulated data under four settings as in Table B.4. Results are summarized over 10 simulations under each setting.
Figure B.7. Simulation Study 5, comparison between MBASIC and CorMotif.
We varied the number of clusters at 10, 20 and the proportion of singletons at 0, 0.1 and 0.4. Results are summarized over 10 simulations under each setting.
Figure B.8. Simulation Study 4, comparison between MBASIC and CorMotif.
The average true positive rate for the 10 simulations among the 1000 highest ranking unit-experiment pairs for each of the four simulation settings in Table B.4. For each simulation, the “true positive” set consists of (i, k)’s with θik = 2, and the ranking is based on the posterior probability P (θik = 2|Y ).
Figure B.9. Simulation Study 6.
Two settings of the true cluster patterns, represented by the matrix .
B.4 Simulation Study 6: Weak Clusters
Wei et al. (2015) pointed out that state-space clustering without accommodating singletons can lead to merging of small clusters with large clusters with distinct state-space profiles. Such a phenomenon may alter the interpretation of the W matrix, because each column may represent the average state-space pattern of several small clusters that lack the data support. It is therefore important to investigate whether such a phenomenon still exists for MBASIC where we include a singleton cluster.
We conducted six simulations in Simulation Study 6 to investigate this issue. We simulated data according to the log-normal distribution with S = 2 states, J = 8 clusters, and K = 30 conditions. The number of replicates within each condition, as well as the distribution parameters within each state is the same as in Section B.1.1. We set the sizes of the first two clusters as 2000, and varied the size of each of the other six clusters, that is nsmall, as 20 or 100. To vary the level of state-space similarity between the two big clusters and the small clusters, we had two settings for the state-space pattern as shown in Figure B.9. For Simulations 1–3, the conditions in which the small clusters have state s = 2 are distinct from the two big clusters, while for Simulations 4–6, the patterns between the small and large clusters are more similar. To control these cluster patterns, we set wjks ∈{0.1, 0.9}. Finally, we included nsingleton = 0 or 2000 singletons in each simulated data set. The states for the singleton units were generated the same as in Section B.1.1.
In each simulation, we fitted the data using MBASIC and MBASIC0 with BIC to select the number of clusters. MBASIC0 differs from MBASIC only by the exclusion of the singleton feature. Therefore, comparing these two methods allow us to assess how fitting a singleton cluster may affect the small cluster merging problem. Tables B.6 and B.7 compare the confusion matrices between the fitted and the true clusters. We also display the state-space patterns of the fitted models in Figures B.10 and B.11. In Simulations 1 and 4, with nsmall = 20 units in each of the small clusters, MBASIC classified the units of small clusters as singletons, while MBASIC0 merged them to form a spurious cluster. The state-space pattern estimated by MBASIC represented the two real big clusters. When the data included singletons, as in Simulations 2 and 5, MBASIC0 formed more spurious clusters, while MBASIC continued to allocate the small clusters as singletons. When we had nsmall = 100 units in each of the small clusters, both methods identified these small cluster patterns in Simulation 6 (Figure B.11), but formed spurious clusters in Simulation 3 (Figure B.10). We compare the resulting ARI, MSE-W, and SPE between MBASIC and MBASIC0 in Table B.8. Performances of these two methods are close when we have no singletons, but differentiate otherwise. Based on these simulations, we conclude that fitting a singleton cluster can substantially avoid merging weak clusters. The state-space patterns estimated by the W matrix are more likely to reflect true underlying clusters rather than the average of several small clusters. We acknowledge that how well modeling the singletons can avoid merging weak clusters requires further investigation in more dynamic settings as we vary the similarity among clusters, the difference among the states, as well as other variables that influence cluster structures such as J, K, S. We leave such potential investigations as future research.
Figure B.10. Simulation Study 6, Simulations 1–3.
Estimated cluster patterns by (a, c, e) MBASIC and (b, d, f) MBASIC0. The true clustering pattern is shown in Figure B.9(a).
Figure B.11. Simulation Study 6, Simulations 4–6.
Estimated cluster patterns by (a, c, e) MBASIC and (b, d, f) MBASIC0. The true clustering pattern is shown in Figure B.9(b).
Table B.6. Simulation Study 6, Simulations 1–3.
Confusion matrix between the true clusters and the estimated clusters. The true cluster pattern is shown in Figure B.9(a).
| Simulation 1, nsmall = 20, nsingleton = 0 | |||||||
|---|---|---|---|---|---|---|---|
| MBASIC | MBASIC0 | ||||||
| True | 0 | 1 | 2 | True | 1 | 2 | 3 |
|
|
|
||||||
| 1 | 23 | 2 | 1975 | 1 | 1998 | 2 | 0 |
| 2 | 30 | 1967 | 3 | 2 | 4 | 1995 | 1 |
| 3 | 20 | 0 | 0 | 3 | 0 | 1 | 19 |
| 4 | 20 | 0 | 0 | 4 | 0 | 1 | 19 |
| 5 | 20 | 0 | 0 | 5 | 0 | 0 | 20 |
| 6 | 20 | 0 | 0 | 6 | 0 | 1 | 19 |
| 7 | 20 | 0 | 0 | 7 | 0 | 0 | 20 |
| 8 | 20 | 0 | 0 | 8 | 0 | 1 | 19 |
| Simulation 2, nsmall = 20, nsingleton = 2000 | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| MBASIC | MBASIC0 | ||||||||||
| True | 0 | 1 | 2 | 3 | True | 1 | 2 | 3 | 4 | 5 | 6 |
|
|
|
||||||||||
| 0 | 1957 | 19 | 17 | 7 | 0 | 67 | 372 | 620 | 59 | 427 | 455 |
| 1 | 180 | 1818 | 2 | 0 | 1 | 1957 | 25 | 0 | 4 | 0 | 14 |
| 2 | 153 | 3 | 1844 | 0 | 2 | 8 | 23 | 3 | 1950 | 0 | 16 |
| 3 | 20 | 0 | 0 | 0 | 3 | 0 | 0 | 1 | 0 | 0 | 19 |
| 4 | 20 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 20 |
| 5 | 20 | 0 | 0 | 0 | 5 | 0 | 0 | 1 | 0 | 0 | 19 |
| 6 | 20 | 0 | 0 | 0 | 6 | 0 | 0 | 1 | 0 | 0 | 19 |
| 7 | 20 | 0 | 0 | 0 | 7 | 0 | 0 | 1 | 0 | 0 | 19 |
| 8 | 20 | 0 | 0 | 0 | 8 | 0 | 0 | 1 | 0 | 0 | 19 |
| Simulation 3, nsmall = 100, nsingleton = 0 | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| MBASIC | MBASIC0 | ||||||||||
| True | 0 | 1 | 2 | 3 | 4 | 5 | True | 1 | 2 | 3 | 4 |
|
|
|
||||||||||
| 1 | 21 | 1974 | 0 | 0 | 5 | 0 | 1 | 1993 | 0 | 7 | 0 |
| 2 | 12 | 7 | 0 | 1 | 1980 | 0 | 2 | 7 | 1 | 1991 | 1 |
| 3 | 1 | 1 | 9 | 0 | 0 | 89 | 3 | 1 | 0 | 0 | 99 |
| 4 | 5 | 0 | 89 | 0 | 0 | 6 | 4 | 0 | 2 | 0 | 98 |
| 5 | 6 | 0 | 2 | 92 | 0 | 0 | 5 | 0 | 99 | 0 | 1 |
| 6 | 1 | 0 | 0 | 13 | 0 | 86 | 6 | 0 | 6 | 0 | 94 |
| 7 | 0 | 0 | 96 | 4 | 0 | 0 | 7 | 0 | 100 | 0 | 0 |
| 8 | 0 | 0 | 0 | 97 | 0 | 3 | 8 | 0 | 99 | 0 | 1 |
Table B.7. Simulation Study 6, Simulations 4–6.
Confusion matrix between the true clusters and the estimated clusters. The true cluster pattern is shown in Figure B.9(b).
| Simulation 4, nsmall = 20, nsingleton = 0 | |||||||
|---|---|---|---|---|---|---|---|
|
|
|
||||||
| MBASIC | MBASIC0 | ||||||
| True | 0 | 1 | 2 | True | 1 | 2 | 3 |
|
|
|
||||||
| 1 | 5 | 1995 | 0 | 1 | 1999 | 1 | 0 |
| 2 | 1 | 0 | 1999 | 2 | 0 | 0 | 2000 |
| 3 | 1 | 19 | 0 | 3 | 17 | 3 | 0 |
| 4 | 19 | 1 | 0 | 4 | 0 | 20 | 0 |
| 5 | 1 | 0 | 19 | 5 | 0 | 5 | 15 |
| 6 | 20 | 0 | 0 | 6 | 0 | 20 | 0 |
| 7 | 18 | 1 | 1 | 7 | 0 | 19 | 1 |
| 8 | 20 | 0 | 0 | 8 | 1 | 19 | 0 |
|
|
|
||||||
| Simulation 5, nsmall = 20, nsingleton = 2000 | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|||||||||
| MBASIC | MBASIC0 | |||||||||
| True | 0 | 1 | 2 | True | 1 | 2 | 3 | 4 | 5 | 6 |
|
|
|
|||||||||
| 0 | 1986 | 7 | 7 | 0 | 16 | 393 | 457 | 561 | 560 | 13 |
| 1 | 31 | 1969 | 0 | 1 | 1990 | 0 | 0 | 5 | 5 | 0 |
| 2 | 45 | 0 | 1955 | 2 | 0 | 0 | 0 | 5 | 12 | 1983 |
| 3 | 11 | 9 | 0 | 3 | 12 | 0 | 0 | 1 | 7 | 0 |
| 4 | 20 | 0 | 0 | 4 | 0 | 0 | 0 | 1 | 19 | 0 |
| 5 | 8 | 0 | 12 | 5 | 0 | 0 | 0 | 0 | 6 | 14 |
| 6 | 20 | 0 | 0 | 6 | 0 | 0 | 0 | 0 | 20 | 0 |
| 7 | 20 | 0 | 0 | 7 | 0 | 0 | 0 | 0 | 20 | 0 |
| 8 | 20 | 0 | 0 | 8 | 0 | 0 | 0 | 1 | 19 | 0 |
|
|
|
|||||||||
| Simulation 6, nsmall = 100, nsingleton = 0 | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|||||||||||||||
| MBASIC | MBASIC0 | |||||||||||||||
| True | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | True | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
|
|
|
|||||||||||||||
| 1 | 3 | 1983 | 14 | 0 | 0 | 0 | 0 | 0 | 1 | 1976 | 24 | 0 | 0 | 0 | 0 | 0 |
| 2 | 1 | 0 | 0 | 0 | 0 | 1999 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 2000 | 0 | 0 |
| 3 | 0 | 8 | 92 | 0 | 0 | 0 | 0 | 0 | 3 | 7 | 93 | 0 | 0 | 0 | 0 | 0 |
| 4 | 1 | 0 | 0 | 1 | 98 | 0 | 0 | 0 | 4 | 0 | 0 | 99 | 0 | 0 | 1 | 0 |
| 5 | 1 | 0 | 0 | 0 | 0 | 99 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 99 | 0 | 1 |
| 6 | 1 | 0 | 0 | 0 | 0 | 0 | 99 | 0 | 6 | 0 | 0 | 1 | 0 | 0 | 0 | 99 |
| 7 | 1 | 0 | 1 | 96 | 0 | 1 | 0 | 1 | 7 | 0 | 2 | 1 | 1 | 1 | 95 | 0 |
| 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 100 | 8 | 0 | 0 | 0 | 100 | 0 | 0 | 0 |
|
|
|
|||||||||||||||
Table B.8. Simulation Study 6.
ARI, MSE-W, and SPE in all simulations.
| Simulation | nsmall | nsingleton | MBASIC | MBASIC0 | ||||
|---|---|---|---|---|---|---|---|---|
| ARI | MSE-W | SPE | ARI | MSE-W | SPE | |||
| 1 | 20 | 0 | 0.967 | 0.433 | 0.185 | 0.991 | 0.29 | 0.189 |
| 2 | 20 | 2000 | 0.79 | 0.442 | 0.192 | 0.727 | 0.34 | 0.193 |
| 3 | 100 | 0 | 0.969 | 0.203 | 0.187 | 0.975 | 0.233 | 0.193 |
| 4 | 20 | 0 | 0.977 | 0.384 | 0.169 | 0.983 | 0.260 | 0.167 |
| 5 | 20 | 2000 | 0.926 | 0.384 | 0.185 | 0.773 | 0.333 | 0.184 |
| 6 | 100 | 0 | 0.949 | 0.087 | 0.172 | 0.947 | 0.087 | 0.171 |
C Additional Tables and Figures
Figure C.12.
The range of the estimated parameters μkls and σkls among the different replicates under the same experimental condition for the transcription factor enrichment network data in Section 4.1.
Figure C.13.
Plots of the transformed ChIP sample read counts against the transformed control sample read counts for all units in the Gm12878 cell for (a) Bcl3 and (b) Bclaf1. Data from unenriched units are expected to locate around the 45 degree dashed line.
Figure C.14.
Plots of the transformed ChIP sample read counts against the transformed control sample read counts for all units in both Gm12878 and K562 cells for (a) Max and (b) Usf1. Data from unenriched units are expected to locate around the 45 degree dashed line.
Figure C.15.
Estimated enrichment probability for each of the 90 clusters identified by MClust.
Figure C.16.
The range of the estimated parameters μkls among the different replicates under the same experimental condition for the +9.5 composite element data in Section 4.2.
Figure C.17.
The range of the estimated parameters σkls among the different replicates under the same experimental condition for the +9.5 composite element data in Section 4.2. For a subset of the replicates, the data for an individual state can be under-dispersed, resulting in a negative value for the estimated size parameter in the negative binomial distribution. In that case, we set σkls = 100. Although our model do not fully capture the under-dispersion patterns, the large variations in σkls for a subset of the experimental conditions suggest that assuming replicate-specific distributions is quite necessary.
Figure C.18.
Enrichment states provided by the ENCODE peak profiles. Seven empty rows corresponding to the TFs that lack ENCODE peak profiles. Note that only a small percentage of the composite elements which harbor the canonical GATA binding site are identified as enriched for GATA family transcription factors. This suggests that the ENCODE peak profiles can be conservative.
Figure C.19.
(a, b) ChIP sample read counts against control sample read counts for one replicate with K562-Yy1. Enrichment status are annotated by (a) the ENCODE peak profiles and (b) MBASIC prediction.
Table C.9.
Enriched cell type-TF combination for each cluster in the TF enrichment network analysis of Section 4.1 of the main text. TFs with estimated enrichment probability > 95% are listed for each cluster.
| Cluster | # of Loci | Common TF | Gm12878 Specific | K562 Specific |
|---|---|---|---|---|
| 1 | 34 | Bcl3, Max, Sp1, Taf1, Zbtb33 | Atf3, Ets1, Jund, Nrf1, Pol24h8, Sin3ak20, Tr4 | Egr1, Pu1, Rad21, Usf2 |
| 2 | 317 | Bcl3, Chd2, Max, Pol24h8, Sp1, Taf1 | Atf3, Ets1, Nrf1, Sin3ak20, Usf2 | Bclaf1, Egr1, Pu1, Smc3, Tbp, Zbtb33 |
| 3 | 490 | Ets1, Pol2, Pol24h8, Sin3ak20, Taf1, Tbp, Usf1 | Atf3, Chd2, Nrf1, Usf2 | Bcl3, Bclaf1, Max, Sp1 |
| 4 | 555 | Bcl3, Bclaf1, Chd2, Pol2, Pol24h8, Sp1, Taf1, Tbp | Atf3, Ets1, Nrf1, Sin3ak20, Six5, Smc3 | Pu1 |
| 5 | 428 | Bcl3, Chd2, Ets1, Pol24h8, Sp1, Taf1 | Atf3, Ctcf, Nrf1, Rad21, Sin3ak20 | Bclaf1, Pol2, Smc3, Tbp |
| 6 | 729 | Bcl3, Chd2, Ets1, Pol2, Pol24h8, Sin3ak20, Smc3, Sp1, Taf1 | Atf3, Nrf1 | Bclaf1, Egr1, Tbp |
| 7 | 391 | Bcl3, Bclaf1, Chd2, Egr1, Ets1, Smc3, Sp1 | ||
| 8 | 133 | Ctcf | Atf3, Chd2, Rad21 | Smc3, Srf |
| 9 | 146 | Ctcf | Bclaf1, Pol24h8, Smc3, Tbp | |
| 10 | 469 | Pol2, Pol24h8, Taf1 | Smc3 | Bclaf1, Ets1, Sin3ak20, Tbp |
| 11 | 440 | Gabp, Pol24h8, Taf1 | Ets1, Smc3 | Pol2 |
| 12 | 184 | Bcl3, Chd2, Ets1, Nrf1, Pol24h8, Taf1 | ||
| 13 | 277 | Chd2, Pol2, Pol24h8, Sp1, Taf1, Tbp | Smc3 | Cfos |
| 14 | 156 | Chd2, Pol2, Pol24h8, Tbp, Zbtb33 | Ets1, Smc3, Taf1 | |
| 15 | 412 | Pol2, Pol24h8 | ||
| 16 | 327 | Pol24h8 | ||
| 17 | 213 | Usf1 | Atf3, Usf2 | Max |
| 18 | 241 | Six5 | Ets1, Smc3 | |
| 19 | 187 | Chd2, Sp1 | Cfos | |
| 20 | 222 | Ets1 | ||
| 21 | 385 | Pol24h8 | ||
| 22 | 343 | Ctcf | Smc3 | |
| 23 | 449 | Nrf1, Smc3 | ||
| 24 | 1674 |
Table C.10.
Annotations for +9.5 Element-like loci in 5p1 (2Kb upstream of transcription start site (TSS)), 5p2 (2Kb to 10Kb upstream of TSS) and intronic regions.
| Ref ID | Gene | Chr | Strand | Gene Start | Gene End | Region | Distance | Peak Start | Peak End | +9.5 Similarity |
|---|---|---|---|---|---|---|---|---|---|---|
| NM 001145662 | GATA2 | chr3 | − | 128198264 | 128206764 | intron | 4601 | 128202079 | 128202248 | 0.964 |
| NM 001145661 | GATA2 | chr3 | − | 128198264 | 128207373 | intron | 5210 | 128202079 | 128202248 | 0.964 |
| NM 032638 | GATA2 | chr3 | − | 128198264 | 128212030 | intron | 9867 | 128202079 | 128202248 | 0.964 |
| NM 005225 | E2F1 | chr20 | − | 32263292 | 32274210 | 5p2 | −3886 | 32278012 | 32278182 | 0.774 |
| NM 001166 | BIRC2 | chr11 | + | 102217965 | 102249394 | 5p2 | −5004 | 102212877 | 102213046 | 0.753 |
| NM 203343 | EPB41 | chr1 | + | 29213602 | 29446558 | intron | 39968 | 29253487 | 29253655 | 0.74 |
| NM 203342 | EPB41 | chr1 | + | 29213602 | 29446558 | intron | 39968 | 29253487 | 29253655 | 0.74 |
| NM 001166007 | EPB41 | chr1 | + | 29213602 | 29446558 | intron | 39968 | 29253487 | 29253655 | 0.74 |
| NM 004437 | EPB41 | chr1 | + | 29213602 | 29446558 | intron | 39968 | 29253487 | 29253655 | 0.74 |
| NM 001166005 | EPB41 | chr1 | + | 29213602 | 29446558 | intron | 39968 | 29253487 | 29253655 | 0.74 |
| NM 001166006 | EPB41 | chr1 | + | 29241087 | 29391731 | intron | 12483 | 29253487 | 29253655 | 0.74 |
| NM 173485 | TSHZ2 | chr20 | + | 51588876 | 52103965 | intron | 203292 | 51792084 | 51792253 | 0.735 |
| NM 007077 | AP4S1 | chr14 | + | 31494682 | 31555007 | intron | 13234 | 31507832 | 31508001 | 0.733 |
| NM 001128126 | AP4S1 | chr14 | + | 31494682 | 31562634 | intron | 13234 | 31507832 | 31508001 | 0.733 |
| NM 001430 | EPAS1 | chr2 | + | 46524540 | 46613842 | intron | 42757 | 46567214 | 46567382 | 0.728 |
| NM 018119 | POLR3E | chr16 | + | 22308740 | 22345341 | intron | 833 | 22309489 | 22309658 | 0.719 |
| NM 181442 | ADNP | chr20 | − | 49506882 | 49547527 | intron | 27423 | 49520020 | 49520189 | 0.718 |
| NM 015339 | ADNP | chr20 | − | 49506882 | 49547527 | intron | 27423 | 49520020 | 49520189 | 0.718 |
| NM 020359 | PLSCR2 | chr3 | − | 146151081 | 146213722 | 5p1 | −921 | 146214559 | 146214728 | 0.718 |
| NM 006257 | PRKCQ | chr10 | − | 6469104 | 6622238 | intron | 106706 | 6515449 | 6515617 | 0.713 |
| NM 018309 | TBC1D23 | chr3 | + | 99979685 | 100044078 | intron | 28727 | 100008329 | 100008497 | 0.711 |
| NM 020382 | SETD8 | chr12 | + | 123868703 | 123893898 | intron | 4170 | 123872789 | 123872958 | 0.71 |
| NM 015385 | SORBS1 | chr10 | − | 97071530 | 97321171 | intron | 29191 | 97291896 | 97292065 | 0.709 |
| NM 024991 | SORBS1 | chr10 | − | 97071530 | 97321171 | intron | 29191 | 97291896 | 97292065 | 0.709 |
| NM 000440 | PDE6A | chr5 | − | 149237519 | 149324356 | intron | 4584 | 149319688 | 149319857 | 0.699 |
| NM 012091 | ADAT1 | chr16 | − | 75632997 | 75657154 | intron | 2033 | 75655038 | 75655206 | 0.693 |
| NM 005033 | EXOSC9 | chr4 | + | 122722471 | 122738175 | 5p2 | −6644 | 122715743 | 122715912 | 0.691 |
| NM 001034194 | EXOSC9 | chr4 | + | 122722471 | 122738175 | 5p2 | −6644 | 122715743 | 122715912 | 0.691 |
| NM 004099 | STOM | chr9 | − | 124101353 | 124132545 | intron | 388 | 124132073 | 124132243 | 0.689 |
| NM 198194 | STOM | chr9 | − | 124101356 | 124132545 | intron | 388 | 124132073 | 124132243 | 0.689 |
| NM 014395 | DAPP1 | chr4 | + | 100737980 | 100791344 | intron | 25687 | 100763583 | 100763752 | 0.682 |
| NM 181078 | IL21R | chr16 | + | 27413722 | 27462115 | intron | 28911 | 27442549 | 27442718 | 0.681 |
| NM 181079 | IL21R | chr16 | + | 27414422 | 27462115 | intron | 28211 | 27442549 | 27442718 | 0.681 |
| NM 021798 | IL21R | chr16 | + | 27438578 | 27462115 | intron | 4055 | 27442549 | 27442718 | 0.681 |
| NM 002492 | NDUFB5 | chr3 | + | 179322574 | 179342287 | intron | 10827 | 179333318 | 179333486 | 0.68 |
| NM 021831 | AGBL5 | chr2 | + | 27274490 | 27293489 | intron | 11452 | 27285858 | 27286027 | 0.678 |
| NM 020132 | AGPAT3 | chr21 | + | 45285115 | 45407474 | 5p2 | −4281 | 45280751 | 45280919 | 0.677 |
| NM 007356 | LAMB4 | chr7 | − | 107663995 | 107770801 | intron | 50003 | 107720714 | 107720883 | 0.677 |
| NM 001010985 | MYBPHL | chr1 | − | 109834986 | 109849663 | 5p1 | −613 | 109850192 | 109850361 | 0.67 |
| NM 006253 | PRKAB1 | chr12 | + | 120105760 | 120119428 | 5p2 | −2184 | 120103492 | 120103661 | 0.668 |
| NM 015226 | CLEC16A | chr16 | + | 11038344 | 11276044 | intron | 27935 | 11066196 | 11066364 | 0.667 |
| NM 020448 | NIPAL3 | chr1 | + | 24742244 | 24799472 | intron | 22892 | 24765053 | 24765221 | 0.663 |
| NM 015560 | OPA1 | chr3 | + | 193310932 | 193415599 | intron | 67680 | 193378528 | 193378697 | 0.659 |
| NM 130832 | OPA1 | chr3 | + | 193310932 | 193415599 | intron | 67680 | 193378528 | 193378697 | 0.659 |
| NM 130831 | OPA1 | chr3 | + | 193310932 | 193415599 | intron | 67680 | 193378528 | 193378697 | 0.659 |
| NM 130834 | OPA1 | chr3 | + | 193310932 | 193415599 | intron | 67680 | 193378528 | 193378697 | 0.659 |
| NM 130837 | OPA1 | chr3 | + | 193310932 | 193415599 | intron | 67680 | 193378528 | 193378697 | 0.659 |
| NM 130836 | OPA1 | chr3 | + | 193310932 | 193415599 | intron | 67680 | 193378528 | 193378697 | 0.659 |
| NM 130835 | OPA1 | chr3 | + | 193310932 | 193415599 | intron | 67680 | 193378528 | 193378697 | 0.659 |
| NM 130833 | OPA1 | chr3 | + | 193310932 | 193415599 | intron | 67680 | 193378528 | 193378697 | 0.659 |
| NM 001004342 | TRIM67 | chr1 | + | 231298673 | 231357314 | 5p2 | −2591 | 231295998 | 231296167 | 0.659 |
| NM 020201 | NT5M | chr17 | + | 17206679 | 17250975 | intron | 1325 | 17207920 | 17208089 | 0.658 |
| NM 173054 | RELN | chr7 | − | 103112232 | 103629963 | intron | 331913 | 103297966 | 103298135 | 0.657 |
| NM 005045 | RELN | chr7 | − | 103112232 | 103629963 | intron | 331913 | 103297966 | 103298135 | 0.657 |
| NM 014206 | C11orf10 | chr11 | − | 61556602 | 61560085 | 5p2 | −8290 | 61568291 | 61568461 | 0.657 |
| NR 030342 | MIR611 | chr11 | − | 61559967 | 61560033 | 5p2 | −8342 | 61568291 | 61568461 | 0.657 |
| NM 173685 | NSMCE2 | chr8 | + | 126104082 | 126379367 | intron | 242235 | 126346233 | 126346402 | 0.655 |
| NM 001127511 | APC | chr5 | + | 112043217 | 112181935 | 5p2 | −3243 | 112039890 | 112040060 | 0.653 |
| NM 021926 | ALX4 | chr11 | − | 44282277 | 44331716 | intron | 39937 | 44291695 | 44291864 | 0.652 |
| NM 016213 | TRIP4 | chr15 | + | 64680019 | 64747500 | intron | 42584 | 64722519 | 64722688 | 0.649 |
| NM 007217 | PDCD10 | chr3 | − | 167401696 | 167452594 | intron | 10656 | 167441855 | 167442023 | 0.649 |
| NM 145860 | PDCD10 | chr3 | − | 167401696 | 167452630 | intron | 10692 | 167441855 | 167442023 | 0.649 |
| NM 145859 | PDCD10 | chr3 | − | 167401696 | 167452651 | intron | 10713 | 167441855 | 167442023 | 0.649 |
| NM 203318 | MYO18A | chr17 | − | 27400527 | 27507407 | intron | 17382 | 27489941 | 27490111 | 0.645 |
| NM 078471 | MYO18A | chr17 | − | 27400527 | 27507407 | intron | 17382 | 27489941 | 27490111 | 0.645 |
| NM 001626 | AKT2 | chr19 | − | 40736224 | 40791265 | intron | 12426 | 40778755 | 40778924 | 0.643 |
| NM 004767 | GPR37L1 | chr1 | + | 202092028 | 202098633 | 5p2 | −4769 | 202087175 | 202087344 | 0.642 |
| NM 015531 | C2CD3 | chr11 | − | 73745479 | 73882064 | intron | 84354 | 73797626 | 73797796 | 0.637 |
| NM 002738 | PRKCB | chr16 | + | 23847299 | 24231930 | intron | 166844 | 24014060 | 24014228 | 0.634 |
| NM 212535 | PRKCB | chr16 | + | 23847299 | 24231930 | intron | 166844 | 24014060 | 24014228 | 0.634 |
| NM 004571 | PKNOX1 | chr21 | + | 44394642 | 44453688 | intron | 15907 | 44410465 | 44410634 | 0.632 |
| NR 026749 | SKINTL | chr1 | − | 48567386 | 48648100 | intron | 4923 | 48643093 | 48643262 | 0.629 |
| NM 005560 | LAMA5 | chr20 | − | 60884122 | 60942368 | intron | 9836 | 60932449 | 60932617 | 0.626 |
| NM 001080826 | SGK223 | chr8 | − | 8175258 | 8239257 | intron | 9672 | 8229501 | 8229670 | 0.622 |
| NM 130465 | TSPAN17 | chr5 | + | 176074387 | 176086058 | intron | 1462 | 176075765 | 176075934 | 0.621 |
| NM 012171 | TSPAN17 | chr5 | + | 176074387 | 176086058 | intron | 1462 | 176075765 | 176075934 | 0.621 |
| NM 001006616 | TSPAN17 | chr5 | + | 176074387 | 176086058 | intron | 1462 | 176075765 | 176075934 | 0.621 |
| NM 013326 | C18orf8 | chr18 | + | 21083461 | 21111742 | 5p2 | −4363 | 21079015 | 21079183 | 0.616 |
| NM 138371 | FAM113B | chr12 | + | 47610051 | 47630441 | 5p2 | −8921 | 47601047 | 47601215 | 0.616 |
| NM 182498 | ZNF428 | chr19 | − | 44111376 | 44124014 | intron | 3381 | 44120549 | 44120718 | 0.615 |
| NM 025179 | PLXNA2 | chr1 | − | 208195589 | 208417665 | intron | 207170 | 208210411 | 208210580 | 0.614 |
| NM 020133 | AGPAT4 | chr6 | − | 161551056 | 161695107 | intron | 12103 | 161682920 | 161683089 | 0.611 |
| NM 013427 | ARHGAP6 | chrX | − | 11155662 | 11683821 | intron | 252809 | 11430929 | 11431097 | 0.609 |
| NM 006125 | ARHGAP6 | chrX | − | 11161516 | 11683821 | intron | 252809 | 11430929 | 11431097 | 0.609 |
| NM 001669 | ARSD | chrX | − | 2822011 | 2847392 | intron | 6868 | 2840441 | 2840609 | 0.607 |
| NM 009589 | ARSD | chrX | − | 2831654 | 2847392 | intron | 6868 | 2840441 | 2840609 | 0.607 |
| NM 032359 | C3orf26 | chr3 | + | 99536677 | 99897476 | intron | 243457 | 99780050 | 99780220 | 0.605 |
| NM 182909 | FILIP1L | chr3 | − | 99551988 | 99833349 | intron | 53215 | 99780050 | 99780220 | 0.605 |
| NM 001042459 | FILIP1L | chr3 | − | 99566772 | 99833349 | intron | 53215 | 99780050 | 99780220 | 0.605 |
| NM 194298 | SLC16A9 | chr10 | − | 61410521 | 61469649 | 5p2 | −3726 | 61473291 | 61473460 | 0.603 |
| NM 007356 | LAMB4 | chr7 | − | 107663995 | 107770801 | intron | 39404 | 107731314 | 107731482 | 0.594 |
| NM 203456 | PPIE | chr1 | + | 40204529 | 40229585 | intron | 18825 | 40223270 | 40223439 | 0.594 |
| NM 152726 | EFHA1 | chr13 | − | 22066839 | 22178307 | intron | 81061 | 22097162 | 22097331 | 0.592 |
| NM 001025107 | ADAR | chr1 | − | 154554535 | 154600437 | 5p1 | −1420 | 154601773 | 154601942 | 0.592 |
| NM 001130966 | TBXAS1 | chr7 | + | 139478046 | 139720123 | intron | 75241 | 139553203 | 139553372 | 0.591 |
| NM 001166254 | TBXAS1 | chr7 | + | 139478046 | 139720123 | intron | 75241 | 139553203 | 139553372 | 0.591 |
| NM 030984 | TBXAS1 | chr7 | + | 139528951 | 139720123 | intron | 24336 | 139553203 | 139553372 | 0.591 |
| NM 001166253 | TBXAS1 | chr7 | + | 139528951 | 139720123 | intron | 24336 | 139553203 | 139553372 | 0.591 |
| NM 001061 | TBXAS1 | chr7 | + | 139528951 | 139720123 | intron | 24336 | 139553203 | 139553372 | 0.591 |
| NR 029394 | TBXAS1 | chr7 | + | 139528951 | 139720123 | intron | 24336 | 139553203 | 139553372 | 0.591 |
| NM 173542 | PLBD2 | chr12 | + | 113796370 | 113827458 | intron | 20911 | 113817197 | 113817366 | 0.591 |
| NM 001159727 | PLBD2 | chr12 | + | 113796370 | 113827458 | intron | 20911 | 113817197 | 113817366 | 0.591 |
| NM 138356 | SHF | chr15 | − | 45459413 | 45493373 | intron | 31722 | 45461567 | 45461736 | 0.589 |
| NM 021908 | ST7 | chr7 | + | 116593380 | 116863955 | intron | 92727 | 116686023 | 116686192 | 0.588 |
| NM 018412 | ST7 | chr7 | + | 116593380 | 116870073 | intron | 92727 | 116686023 | 116686192 | 0.588 |
| NM 017681 | NUP62CL | chrX | − | 106366657 | 106449670 | intron | 53243 | 106396343 | 106396512 | 0.587 |
| NM 020845 | PITPNM2 | chr12 | − | 123468026 | 123594975 | intron | 73786 | 123521105 | 123521274 | 0.587 |
| NM 001135054 | SIGIRR | chr11 | − | 405715 | 414999 | 5p2 | −7383 | 422299 | 422467 | 0.586 |
| NM 021805 | SIGIRR | chr11 | − | 405715 | 417397 | 5p2 | −4985 | 422299 | 422467 | 0.586 |
| NM 001135053 | SIGIRR | chr11 | − | 405715 | 417397 | 5p2 | −4985 | 422299 | 422467 | 0.586 |
| NM 001012302 | ANO9 | chr11 | − | 417929 | 442011 | intron | 19629 | 422299 | 422467 | 0.586 |
| NM 001098816 | ODZ4 | chr11 | − | 78364328 | 79151695 | intron | 773881 | 78377730 | 78377899 | 0.582 |
| NM 178865 | SERINC2 | chr1 | + | 31885962 | 31907524 | 5p2 | −2738 | 31883140 | 31883309 | 0.581 |
| NM 004481 | GALNT2 | chr1 | + | 230202955 | 230417875 | 5p1 | −672 | 230202200 | 230202368 | 0.579 |
| NM 032427 | MAML2 | chr11 | − | 95711439 | 96076344 | intron | 19672 | 96056588 | 96056758 | 0.577 |
| NM 021961 | TEAD1 | chr11 | + | 12695968 | 12966298 | intron | 202724 | 12898608 | 12898778 | 0.577 |
| NM 016436 | PHF20 | chr20 | + | 34359922 | 34538288 | intron | 130764 | 34490602 | 34490771 | 0.573 |
| NM 003128 | SPTBN1 | chr2 | + | 54683453 | 54898582 | intron | 117920 | 54801290 | 54801458 | 0.573 |
| NM 178313 | SPTBN1 | chr2 | + | 54785530 | 54889444 | intron | 15843 | 54801290 | 54801458 | 0.573 |
| NM 001037165 | FOXK1 | chr7 | + | 4721929 | 4811074 | intron | 30769 | 4752614 | 4752783 | 0.573 |
| NM 005802 | TOPORS | chr9 | − | 32540542 | 32552601 | 5p1 | −1681 | 32554199 | 32554367 | 0.573 |
| NM 182739 | NDUFB6 | chr9 | − | 32553522 | 32573182 | intron | 18900 | 32554199 | 32554367 | 0.573 |
| NM 002493 | NDUFB6 | chr9 | − | 32553522 | 32573182 | intron | 18900 | 32554199 | 32554367 | 0.573 |
| NM 004466 | GPC5 | chr13 | + | 92050934 | 93519485 | intron | 7890 | 92058740 | 92058909 | 0.569 |
| NM 001145169 | GPR113 | chr2 | − | 26531040 | 26541970 | intron | 2083 | 26539803 | 26539972 | 0.563 |
| NM 153835 | GPR113 | chr2 | − | 26531040 | 26569685 | intron | 29798 | 26539803 | 26539972 | 0.563 |
| NM 001145168 | GPR113 | chr2 | − | 26532812 | 26541917 | intron | 2030 | 26539803 | 26539972 | 0.563 |
| NM 000593 | TAP1 | chr6 | − | 32812986 | 32821748 | 5p2 | −3584 | 32825248 | 32825417 | 0.549 |
| NM 002800 | PSMB9 | chr6 | + | 32821937 | 32827626 | intron | 3395 | 32825248 | 32825417 | 0.549 |
| NM 148954 | PSMB9 | chr6 | + | 32821937 | 32827626 | intron | 3395 | 32825248 | 32825417 | 0.549 |
| NM 033104 | STON2 | chr14 | − | 81736910 | 81864927 | intron | 94218 | 81770625 | 81770794 | 0.546 |
| NM 001001894 | TTC3 | chr21 | + | 38445570 | 38575406 | intron | 12717 | 38458203 | 38458372 | 0.544 |
| NM 003316 | TTC3 | chr21 | + | 38455246 | 38575406 | intron | 3041 | 38458203 | 38458372 | 0.544 |
| NM 000147 | FUCA1 | chr1 | − | 24171573 | 24194859 | 5p2 | −4297 | 24199072 | 24199241 | 0.544 |
| NM 016063 | HDDC2 | chr6 | − | 125596495 | 125623282 | 5p1 | −844 | 125624043 | 125624211 | 0.544 |
| NM 000404 | GLB1 | chr3 | − | 33038099 | 33138694 | intron | 89213 | 33049397 | 33049566 | 0.541 |
| NM 001135602 | GLB1 | chr3 | − | 33038099 | 33138694 | intron | 89213 | 33049397 | 33049566 | 0.541 |
| NM 001079811 | GLB1 | chr3 | − | 33038099 | 33138314 | intron | 88833 | 33049397 | 33049566 | 0.541 |
| NM 017803 | DUS2L | chr16 | + | 68057203 | 68113183 | intron | 17562 | 68074682 | 68074850 | 0.54 |
| NM 001101417 | ISPD | chr7 | − | 16127151 | 16460947 | intron | 160970 | 16299893 | 16300063 | 0.532 |
| NM 001101426 | ISPD | chr7 | − | 16127151 | 16460947 | intron | 160970 | 16299893 | 16300063 | 0.532 |
| NM 002736 | PRKAR2B | chr7 | + | 106685177 | 106802255 | intron | 31362 | 106716455 | 106716624 | 0.532 |
| NR 024448 | LOC91316 | chr22 | − | 23980676 | 24059610 | intron | 32642 | 24026884 | 24027054 | 0.529 |
| NM 153615 | RGL4 | chr22 | + | 24033047 | 24041358 | 5p2 | −6079 | 24026884 | 24027054 | 0.529 |
| NM 033631 | LUZP1 | chr1 | − | 23410515 | 23495351 | intron | 53595 | 23441673 | 23441841 | 0.526 |
| NM 001142546 | LUZP1 | chr1 | − | 23410515 | 23495351 | intron | 53595 | 23441673 | 23441841 | 0.526 |
| NM 001134492 | HS2ST1 | chr1 | + | 87380334 | 87564124 | intron | 77077 | 87457327 | 87457496 | 0.52 |
| NM 012262 | HS2ST1 | chr1 | + | 87380334 | 87575680 | intron | 77077 | 87457327 | 87457496 | 0.52 |
| NM 001085481 | MAP1LC3B2 | chr12 | + | 116997185 | 117014425 | intron | 2387 | 116999488 | 116999657 | 0.499 |
| NM 079834 | SCAMP4 | chr19 | + | 1905372 | 1926011 | intron | 2224 | 1907513 | 1907681 | 0.487 |
| NM 138422 | ADAT3 | chr19 | + | 1905416 | 1913443 | intron | 2180 | 1907513 | 1907681 | 0.487 |
| NM 032932 | RAB11FIP4 | chr17 | + | 29718641 | 29865232 | intron | 33750 | 29752307 | 29752476 | 0.48 |
| NM 033129 | SCRT2 | chr20 | − | 642240 | 656823 | 5p2 | −8983 | 665723 | 665891 | 0.47 |
| NM 012079 | DGAT1 | chr8 | − | 145538246 | 145550567 | intron | 1560 | 145548924 | 145549092 | 0.443 |
Footnotes
This work was supported by National Institutes of Health Grants (HG006716 and HG007019) to S.K.
The normalized signal for unit i and condition k is:
The pseudo-binary similarity between two units i1 and i2 is calculated as .
When the actual J and its estimate Ĵ are different, MSE-W is redefined as:
where
References
- Anandapadamanaban M, Andresen C, Helander S, Ohyama Y, Siponen MI, Lundstrm P, Kokubo T, Ikura M, Moche M, Sunnerhagen M. High-resolution structure of TBP with TAF1 reveals anchoring patterns in transcriptional regulation - Nat Struct Mol Biol. 2013;20:1008–1014. doi: 10.1038/nsmb.2611. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biology. 2010;11:R106. doi: 10.1186/gb-2010-11-10-r106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cheng C, Yan K-K, Hwang W, Qian J, Bhardwaj N, Rozowsky J, Lu ZJ, Niu W, Alves P, Kato M, Snyder M, Gerstein M. Construction and analysis of an integrated regulatory network derived from high-throughput sequencing data. PLoS Computational Biology. 2011:7. doi: 10.1371/journal.pcbi.1002190. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological) 1977;39(1):1–38. [Google Scholar]
- Doré LC, Chlon TM, Brown CD, White KP, Crispino JD. Chromatin occupancy analysis reveals genome-wide GATA factor switching during hematopoiesis. Blood. 2012;119(16):3724–3733. doi: 10.1182/blood-2011-09-380634. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fraley C, Raftery AE. Model-based clustering, discriminant analysis and density estimation. Journal of the American Statistical Association. 2002;97:611–631. [Google Scholar]
- Gao X, Johnson KD, Chang YI, Boyer ME, Dewey CN, Zhang J, Bresnick EH. Gata2 cis-element is required for hematopoietic stem cell generation in the mammalian embryo. Journal of Experimental Medicine. 2013;210(13):2833–42. doi: 10.1084/jem.20130733. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gerstein MB, Kundaje A, Hariharan M, Landt SG, Yan KK, Cheng C, Mu XJ, Khurana E, Rozowsky J, Alexander R, Min R, Alves P, Abyzov A, Addleman N, Bhardwaj N, Boyle AP, Cayting P, Charos A, Chen DZ, Cheng Y, Clarke D, Eastman C, Euskirchen G, Frietze S, Fu Y, Gertz J, Grubert F, Harmanci A, Jain P, Kasowski M, Lacroute P, Leng J, Lian J, Monahan H, O/’Geen H, Ouyang Z, Partridge EC, Patacsil D, Pauli F, Raha D, Ramirez L, Reddy TE, Reed B, Shi M, Slifer T, Wang J, Wu L, Yang X, Yip KY, Zilberman-Schapira G, Batzoglou S, Sidow A, Farnham PJ, Myers RM, Weissman SM, Snyder M. Architecture of the human regulatory network derived from ENCODE data. Nature. 2012;489:91–100. doi: 10.1038/nature11245. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Holley DW, Groh BS, Wozniak G, Donohoe DR, Sun W, Godfrey V, Bultman SJ. The BRG1 Chromatin Remodeler Regulates Widespread Changes in Gene Expression and Cell Proliferation During B Cell Activation. Journal of Cellular Physiology. 2014;229(1):44–52. doi: 10.1002/jcp.24414. [DOI] [PubMed] [Google Scholar]
- Hsu AP, Johnson KD, Falcone EL, Sanalkumar R, Sanchez L, Hickstein DD, Cuellar-Rodriguez J, Lemieux JE, Zerbe CS, Bresnick EH, Holland SM. GATA2 haploinsufficiency caused by mutations in a conserved intronic element leads to MonoMAC syndrome. Blood. 2013;121(19):3830–3837. doi: 10.1182/blood-2012-08-452763. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hu G, Schones DE, Cui K, Ybarra R, Northrup D, Tang Q, Gattinoni L, Restifo NP, Huang S, Zhao K. Regulation of nucleosome landscape and transcription factor targeting at tissue-specific enhancers by BRG1. Genome Research. 2011;21(10):1650–1658. doi: 10.1101/gr.121145.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ji H, Li X, Wang Q, Ning Y. Differential principle component analysis of ChIP-seq. PNAS. 2013;110:6789–6794. doi: 10.1073/pnas.1204398110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Johnson KD, Hsu A, RMJ, Boyer ME, Keleş S, Zhang J, Lee Y, Holland SM, Bresnick EH. Cis-element mutation in a GATA-2-dependent immunodeficiency syndrome governs hematopoiesis and vascular integrity. Journal of Clinical Investigation. 2012;10(122):36923704. doi: 10.1172/JCI61623. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kim SI, Bresnick EH, Bultman SJ. BRG1 directly regulates nucleosome structure and chromatin looping of the a globin locus to activate transcription. Nucleic Acids Research. 2009a;37(18):6019–6027. doi: 10.1093/nar/gkp677. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kim SI, Bultman SJ, Kiefer CM, Dean A, Bresnick EH. BRG1 requirement for long-range interaction of a locus control region with a downstream promoter. Proceedings of the National Academy of Sciences. 2009b;106(7):2259–2264. doi: 10.1073/pnas.0806420106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kuan PF, Chung D, Pan G, Thomson JA, Stewart R, Keleş S. A statistical framework for the analysis of ChIP-seq data. Journal of the American Statistical Association. 2011;106(495):891–903. doi: 10.1198/jasa.2011.ap09706. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kunarso G, Chia N, Jeyakani J, Hwang C, Lu X, Chan Y, Ng H, Bourque G. Transposable elements have rewired the core regulatory network of human embryonic stem cells. Nature Genetics. 2010;42:631–634. doi: 10.1038/ng.600. [DOI] [PubMed] [Google Scholar]
- Lee S, Huang J, Hu J. Sparse logistic principal components analysis for binary data. The Annals of Applied Statistics. 2010;4:1579–1601. doi: 10.1214/10-AOAS327SUPP. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liang K, Keleş S. Detecting differential binding of transcription factors with ChIP-seq. Bioinformatics. 2012;28:121–122. doi: 10.1093/bioinformatics/btr605. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Linneman AK, O’Geen H, Keleş S, Farnham PJ, Bresnick EH. Genetic framework for GATA factor function in vascular biology. Proceedings of the National Academy of Sciences. 2011;108(33):13641–13646. doi: 10.1073/pnas.1108440108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Neph1 S, Stergachis AB, Reynolds A, Sandstrom R, Borenstein E, Stamatoyannopoulos JA. Circuitry and dynamics of human transcription factor regulatory networks. Cell. 2012;150:12741286. doi: 10.1016/j.cell.2012.04.040. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rand W. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association. 1971;66(336):846–850. [Google Scholar]
- Roy S, Wapinski I, Pfiffner J, French C, Socha A, Konieczka J, Habib N, Kellis M, Thompson D, Regev A. Arboretum: reconstruction and analysis of the evolutionary history of condition-specific transcriptional modules. Genome Research. 2013;23:1039–1050. doi: 10.1101/gr.146233.112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schmidt D, Wilson M, Ballester B, Schwalie P, Brown G, Marshall A, Kutter C, Watt S, Martinez-Jimenez C, Mackay S, Talianidis I, Flicek P, Odom D. Five-vertebrate ChIP-seq reveals the evolutionary dynamics of transcription factor binding. Science. 2010;328:1036–1040. doi: 10.1126/science.1186176. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Smyth GK. Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology. 2004:3. doi: 10.2202/1544-6115.1027. [DOI] [PubMed] [Google Scholar]
- Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences of the United States of America. 2005;102(43):15545–15550. doi: 10.1073/pnas.0506580102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- The ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489:57–74. doi: 10.1038/nature11247. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Waltman P, Kacmarczyk T, Bate A, Kearns D, Reiss D, Eichenberger P, Bonneau R. Multi-species integrative biclustering. Genome Biology. 2010;11:R96. doi: 10.1186/gb-2010-11-9-r96. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang J, Zhuang J, Iyer S, Lin X, Whitfield TW, Greven MC, Pierce BG, Dong X, Kundaje A, Cheng Y, Rando OJ, Birney E, Myers RM, Noble WS, Snyder M, Weng Z. Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors. Genome Research. 2012;22:1798–1812. doi: 10.1101/gr.139105.112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wei Y, Li X, fei Wang Q, Ji H. iaseq: integrative analysis of allele-specificity of protein-dna interactions in multiple chip-seq datasets. BMC Genomics. 2012:13. doi: 10.1186/1471-2164-13-681. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wei Y, Tenzen T, Ji H. Joint analysis of differential gene expression in multiple studies using correlation motifs. Biostatistics. 2015;16:31–46. doi: 10.1093/biostatistics/kxu038. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zeng X, Sanalkumar R, Bresnick EH, Li H, Chang Q, Keleş S. jMOSAiCS: joint analysis of multiple ChIP-seq datasets. Genome Biology. 2013:14. doi: 10.1186/gb-2013-14-4-r38. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zuo C, Keleş S. A statistical framework for power calculations in ChIP-seq experiments. Bioinformatics. 2013 doi: 10.1093/bioinformatics/btt200. [DOI] [PMC free article] [PubMed] [Google Scholar]


























