A Hierarchical Framework for State-Space Matrix Inference and Clustering

Chandler Zuo; Kailei Chen; Kyle J Hewitt; Emery H Bresnick; Sündüz Keleş

doi:10.1214/16-AOAS938

. Author manuscript; available in PMC: 2018 Jun 15.

Published in final edited form as: Ann Appl Stat. 2016 Sep 28;10(3):1348–1372. doi: 10.1214/16-AOAS938

A Hierarchical Framework for State-Space Matrix Inference and Clustering^*

Chandler Zuo ^1,^2,^†, Kailei Chen ^1,², Kyle J Hewitt ³, Emery H Bresnick ³, Sündüz Keleş ^1,^2,^‡

PMCID: PMC6003413 NIHMSID: NIHMS967801 PMID: 29910842

Abstract

In recent years, a large number of genomic and epigenomic studies have been focusing on the integrative analysis of multiple experimental datasets measured over a large number of observational units. The objectives of such studies include not only inferring a hidden state of activity for each unit over individual experiments, but also detecting highly associated clusters of units based on their inferred states. Although there are a number of methods tailored for specific datasets, there is currently no state-of-the-art modeling framework for this general class of problems. In this paper, we develop the MBASIC (Matrix Based Analysis for State-space Inference and Clustering) framework. MBASIC consists of two parts: state-space mapping and state-space clustering. In state-space mapping, it maps observations onto a finite state-space, representing the activation states of units across conditions. In state-space clustering, MBASIC incorporates a finite mixture model to cluster the units based on their inferred state-space profiles across all conditions. Both the state-space mapping and clustering can be simultaneously estimated through an Expectation-Maximization algorithm. MBASIC flexibly adapts to a large number of parametric distributions for the observed data, as well as the heterogeneity in replicate experiments. It allows for imposing structural assumptions on each cluster, and enables model selection using information criterion. In our data-driven simulation studies, MBASIC showed significant accuracy in recovering both the underlying state-space variables and clustering structures. We applied MBASIC to two genome research problems using large numbers of datasets from the ENCODE project. The first application grouped genes based on transcription factor occupancy profiles of their promoter regions in two different cell types. The second application focused on identifying groups of loci that are similar to a GATA2 binding site that is functional at its endogenous locus by utilizing transcription factor occupancy data and illustrated applicability of MBASIC in a wide variety of problems. In both studies, MBASIC showed higher levels of raw data fidelity than analyzing these data with a two-step approach using ENCODE results on transcription factor occupancy data.

1 Introduction

The flow of genetic information through DNA transcription and RNA translation is a highly regulated process. The underlying mechanisms of regulation by both genomic and epigenomic factors are central targets in large numbers of genomic and epigenomic studies. This paper is motivated by a number of such studies that aim to elucidate genomic regulatory mechanisms across multiple biological conditions. Experiments that investigate such processes produce a plethora of data types. For example, relevant to DNA transcription is transcription factor occupancy data that quantify which regions of DNA are occupied by DNA binding proteins that can enhance or reduce gene expression. Histone modification data map covalent post-translational modifications to histone proteins, core proteins that make up the nucleosome structure of DNA. Such modifications impact DNA transcription by altering chromatin structure.

Computational and statistical analysis of these data often involve identifying genomic loci that show significant signal, i.e., enrichment, compared to background noise in the experimental measurements, with the operating principle that multiple loci might exhibit similar signal profile over different biological conditions.

Improvements in the next-generation sequencing technology further accelerated rapid generation of these types of data. In return, the vast availability of such data has revolutionized the scope of genome regulation studies. Previous analyses had been restricted to detecting regions of genome that were associated with one particular factor in one single organism. Many recent studies focus on detecting more complex functional patterns that integrate data from multiple organisms under multiple conditions. Namely, the associations between DNA elements and how they change across biological and/or experimental conditions have been the focus of many integrative modeling approaches. Examples of these studies include:

Differential binding analysis among multiple ChIP-seq data. One of the key mechanisms of gene expression regulation is through differential activities of transcription factors and epigenetic modifications. Currently, chromatin immunoprecipitation followed by high throughput sequencing (ChIP-seq) is the state-of-the-art method for genome-wide profiling of protein-DNA interactions. Such two key interactions are DNA occupancy by transcription factors and histone modifications. Most transcription factors, i.e., DNA binding proteins, recognize DNA in a sequence specific manner and promote or represses gene expression. Similarly, histone modifications can induce diverse biological consequences such as transcriptional activation/deactivation. The study of gene regulation often involves comparing transcription factor occupancy and histone modifications across multiple biological conditions. Such conditions can be different treatment levels, time points of measurements, or different dosage levels (Liang and Keleş, 2012; Anders and Huber, 2010; Ji et al., 2013; Wei et al., 2015).
Transcription factor regulatory network analysis. The combinatorial nature of transcription factor regulation underlies the large diversity observed in eukaryotic gene control. This largely motivates construction of regulatory networks that model gene expression as a combinatorial function of regulatory interactions between DNA and different transcription factors. The large-scale data from the ENCODE project (The ENCODE Project Consortium, 2012) now enable joint analyses of over one hundred human transcription factors across multiple cell types. Such analyses are posed to reveal a great amount of information about co-association patterns between different TFs, hierarchical network organizations, and systems-level integration of complex cellular signals (Neph1 et al., 2012; Gerstein et al., 2012; Cheng et al., 2011; Zeng et al., 2013). While the large number of TFs makes it computationally formidable to exhaust all possible combinatorial associations for such analyses, it is important to detect the most significant combinatorial patterns that preserve global regulatory dynamics.
Comparative functional genomic studies across different species. Functional genomics analysis compares gene expressions or TF occupancy profiles between multiple species. The main task is to identify divergent and conserved functional modules that are central to evolutionary relationships (e.g., Kunarso et al. (2010), Schmidt et al. (2010)). Existing methods, that build on hidden Markov models (Roy et al., 2013) or biclustering (Waltman et al., 2010), implicitly assume that the functional modules should at least have similar signal profiles (i.e., expression, occupancy) among some subsets of the species under consideration. For these analyses, it is also important to identify functional modules that are fully divergent across species. These regions play an equally important role in understanding connectivity among species over the evolutionary history.

Although the types of data for these different studies vary, the underlying statistical principles are largely shared. Therefore, we propose a unified framework for the analysis of such data by formalizing the shared aspects. We formulate the underlying statistical problem as follows. Suppose a dataset {Y_ik} is collected over a set of observational units (e.g., loci in genomic experiments) i = 1, 2, · · ·, I under conditions k = 1, 2, · · ·, K. Inferring the association patterns within a single experiment involves mapping the corresponding set of observations {Y_ik : i = 1, 2, · · ·, I} to a finite discrete state-space, 𝒮 = {1, 2, · · ·, S}. This space contains different levels of association (e.g., enrichment/non-enrichment indicating the status of occupancy in ChIP-seq experiments, expressed/not expressed in RNA-seq gene expression experiments). This falls under the classical finite-mixture modeling framework, where a latent state variable θ_ik ∈ 𝒮 is inferred for each observational unit Y_ik. A higher level of modeling on the matrix Θ = (θ_ik)_1≤_i_≤_I,_1≤_k_≤_K is required for integrating the association patterns under different conditions. We call this matrix the state-space matrix since it describes the latent states of individual observations.

We propose the following framework to model the state-space matrix Θ. We assume that rows of Θ can be partitioned into J + 1 subsets: {1, · · ·, I} = 𝒞₀ ∪ 𝒞₁ ∪ ·· · ∪ 𝒞_J. Rows of Θ within partition 𝒞_j, j ≥ 1, are generated by the same distribution parametrized by w_j_· = (w_jk)_1≤_k_≤_K:

θ_{i k} ~ g (\cdot ∣ w_{j k}), i \in C_{j},

while the rows of 𝒞₀, which denotes the group of ”singleton” units, i.e., units that do not cluster in any of the J groups, are generated by row specific distributions. The goal of this model is thus to estimate a partitioning that best characterizes the row associations of state-space matrix Θ.

We refer to the proposed framework as Matrix Based Analysis for State-space Inference and Clustering (MBASIC). MBASIC is related to classical factor analysis which considers the problem of projecting one dimension (either row or column) of large noisy matrices into low-dimensional spaces. MBASIC has two distinguished features compared to the existing literature in these areas. First, MBASIC deals with matrices with discrete entries, while most existing methods are designed for matrices on continuous scales. Second, MBASIC estimates the low-dimensional projection by grouping the rows of the original matrix in contrast to the Principle Component Analysis (PCA) approaches which form linear combinations of the rows (e.g., Ji et al. (2013), Lee et al. (2010)). This is motivated by the following arguments:

In MBASIC, each factor estimate w_j_· characterizes the commonality of a group of rows and is easily interpretable in practice. Such interpretability can further be enhanced by imposing structural restrictions on the w_j_· vector for practical purposes. Examples of such constraints are described in Section 3.3;
PCA for high dimensional matrices are often accompanied by regularization techniques, which are computationally prohibitive for many epigenetic datasets. In contrast, clustering the matrix rows can be implemented very efficiently and in a straightforward manner.

The hierarchical structure of MBASIC is similar to two other recently proposed statistical models: iASeq (Wei et al., 2012) and Cormotif (Wei et al., 2015). Both these models incorporate a state-space clustering structure similar to MBASIC. MBASIC extends these models in several critically essential directions. First, MBASIC is developed for general purposes and can be easily implemented for a wide range of parametric distributions, while Cormotif and iASeq operate with specific distributions targeting the problems of differential expression and allele-specific binding. Second, neither of these models include a group of singletons with idiosyncratic state-space profiles. When we are agnostic about the “true” clustering structure in applications, separating the singletons can reduce their influence on the estimation of clustering parameters. Third, both iASeq and Cormotif separate estimation for the distributional parameters from the clustering structure, while MBASIC jointly fits all model parameters. A limiting assumption of MBASIC compared to these models is that MBASIC does not allow the distributional parameters within the same state to be heterogeneous. However, a pre-processing step that accounts for the the heterogeneity can overcome such a limitation. We evaluate and discuss all of these features with extensive simulation studies in this paper.

This paper is organized as follows. We start with a formal description of MBASIC in Section 2, followed by model estimation and selection methods in Section 3. We also investigate general features of MBASIC compared to iASeq and Cormotif with extensive simulations in this section. Section 4 presents results from several real data examples. Mathematical details of the algorithm are included in Appendix A.1.

2 The Hierarchical Mixture Model Framework

Consider a dataset with observations from I different observational units under K different conditions. For each condition k ∈ {1, 2, · · ·, K}, there are n_k replicate experiments, indexed by l = 1, 2, · · ·, n_k. We use Y_ikl to denote the observation for the l-th replicate of unit i under condition k. For each condition k at unit i, there exists a hidden state variable θ_ik ∈ 𝒮 = {1, 2, · · ·, S}. The MBASIC model consists of the following components:

State-space Mapping:
$Y_{ikl} ∣ θ_{i k} = s \overset{ind .}{\sim} f_{s} (\cdot ∣ μ_{kls}, σ_{kls}, γ_{ikls}) .$ (1)
State-space Clustering: θ_ik’s are independently sampled from 𝒮 with the sampling probability:
$P (θ_{i k} = s) = ζ p_{i s} + (1 - ζ) \sum_{j = 1}^{J} π_{j} w_{jks} .$ (2)

In (1), μ_kls and σ_kls are the parameters related to the mean and dispersion for the s-th state for replicate l under condition k, and γ_ikls is the covariate encoding known information for unit i. In (2), p_is, ζ, π_j, and w_jks are additional non-negative parameters subject to restrictions:
$0 \leq ζ \leq 1; \sum_{j = 1}^{J} π_{j} = 1; \sum_{s = 1}^{S} w_{jks} = 1, \forall j, k; \sum_{s = 1}^{S} p_{i s} = 1, \forall i .$

We further discuss these parameters in Section 2.2.

2.1 State-space Mapping

Equation 1 partitions observational units i = 1, · · ·, I into S subsets according to their hidden states. Within the same replicate, data from the same hidden state follow the same distribution f_s(·|μ_kls, σ_kls, γ_ikls). MBASIC assumes that the hidden states θ_ik’s are independent of the replicate index l, which means all replicates under the same condition have the same set of hidden states. However, distributional parameters for a given state can be different among replicates. Such a setting allows for the flexibility of modeling the heterogeneity in replicate experiments.

The density function f can be from an arbitrary parametric distribution. We consider three fundamental families of distributions commonly used for genomic data analysis:

Log-normal Distribution. LN(μ_klsγ_ikls, σ_kls) with a density function:
$f_{s} (y ∣ μ_{kls}, σ_{kls}, γ_{ikls}) = \frac{1}{\sqrt{2 π} σ_{kls}} exp {- \frac{{(log (y + 1) - μ_{kls} γ_{ikls})}^{2}}{2 σ_{kls}^{2}}} .$ (3)
Negative Binomial Distribution. NB(μ_klsγ_ikls, σ_kls) with a density function:
$f_{s} (y ∣ μ_{kls}, σ_{kls}, γ_{ikls}) = \frac{Γ (y + σ_{kls})}{Γ (σ_{kls}) Γ (y)} \frac{{(μ_{kls} γ_{ikls})}^{y} σ_{kls}^{σ_{kls}}}{{(μ_{kls} γ_{ikls} + σ_{kls})}^{y + σ_{kls}}} .$ (4)
Binomial Distribution. Binom(γ_ikls, μ_kls) with a density function:
$f_{s} (y ∣ μ_{kls}, γ_{ikls}) = (\begin{matrix} γ_{ikls} \\ y \end{matrix}) μ_{kls}^{y} {(1 - μ_{kls})}^{γ_{ikls} - y} .$ (5)

In these three examples, γ_ikls represents the known heterogeneity across loci whereas μ_kls and σ_kls are unknown parameters. For example, when using Eqn. (3) or (4) in a ChIP-seq analysis with S = 2 states, we can estimate γ_ikl₁ using data from the control samples so that the ChIP sample read counts in the background state scale with the control sample data (e.g., as in Zuo and Keleş (2013)), and assume γ_ikl₂ = 1 for the enriched states. Eqn. (5) can be used to analyze allele-specific binding data, where γ_ikls is the total read counts from both paternal and maternal alleles and is constant across s. Application with the binomial distribution also requires that $μ_{kls} \sum_{i = 1}^{I} γ_{ikls}$ , ∀k, l, is strictly increasing in s for model identification.

The MBASIC can be easily extended to other classes of parametric distributions and estimation for these distributions follows the same Expectation-Maximization skeleton. While Section 3 relies on these three distributions to describe the model and the estimation algorithms, the second real data example in Section 4 utilizes a more complex parametrization, which demonstrates the wide applicability of the MBASIC framework. Furthermore, we consider the following degenerate distribution:

f_{s} (y ∣ μ_{kls}, σ_{kls}, γ_{ikls}) = I (y = s),

(6)

where I(.) denotes the indicator function. This degenerate form corresponds to the situation where the states, θ_ik’s, are directly observed rather than inferred from Y_ikl’s. We utilize this parametrization for comparing MBASIC to alternative two-step analysis approaches in Section 3.5. Parameter estimation for this case follows a slightly modified procedure from the non-degenerate cases, which is described in Section 3.

2.2 State-space Clustering

Equation (2) models the distribution of θ_ik as a mixture of multiple distributions. To illustrate this model we introduce additional variables. The goal is to identify J clusters from the set of observation units 1 ≤ i ≤ I. Let b_i = I(unit i does not belong to any cluster) and z_ij = I(unit i belongs to cluster j). The b_i variables entertain the possibility that some observations are ”singletons”, i.e., they do not cluster with any other observational units. With these additional variables, the distribution in Equation (2) can be hierarchically decomposed as follows:

$b_{i} \overset{i . i . d .}{\sim} Bernoulli (ζ)$ ;
$(z_{i 1}, z_{i 2}, \dots, z_{i, J}) \overset{i . i . d .}{\sim} MultiNom (1, (π_{1}, π_{2}, \dots, π_{J}))$ ;
Conditional on b_i and z_ij, θ_ik’s are independent samples from 𝒮, with sampling probabilities P(θ_ik = s|b_i = 1) = p_is, P(θ_ik = s|b_i = 0, z_ij = 1) = w_jks.

In this set up, although the singleton state-space probabilities p_is are assumed to be constant across conditions, i.e., P(θ_ik = s) = p_is, ∀k, this assumption is mildly restrictive since it accommodates (P(θ_ik = 1, · · ·, P(θ_ik = S)) to follow an arbitrary prior distribution (e.g., (P(θ_ik = 1, · · ·, P(θ_ik = S)) ~ Dirichlet(α, · · ·, α), ∀k) as long as it leads to the same marginal distribution for θ_ik for all k.

It is worth noting that this hierarchical structure essentially seeks a low-rank representation for the matrix Θ = (θ_ik)_1≤_i_≤_I,_1≤_k_≤_K. To illustrate this, we introduce additional matrices Θ_s = (I(θ_ik = s))_1≤_i_≤_I,_1≤_k_≤_K, W_s = (w_jks)_1≤_j_≤_J,_1≤_k_≤_K, Z = (z_ij)_1≤_i_≤_I,_1≤_j_≤_J and vectors p_s = (p_is)_1≤_i_≤_I, B = (b_i)_1≤_i_≤_I. Then, the conditional expectation of Θ_s is:

E (Θ_{s} ∣ Z, B) = (Z W_{s}) \circ ((1 - B) 1_{K}^{T}) + (p_{s} \circ B) 1_{K}^{T},

(7)

where “∘” denotes the Hadamard product. We note that E(Θ_s|Z, B) is a matrix of rank J +1, which is usually much smaller than the dimension of the matrix Θ_s. Similar models for low-rank representation of discrete matrices were considered in Lee et al. (2010), and turned out to be challenging both theoretically and computationally. The row-clustering structure for the matrices E(Θ_s|Z, B) in MBASIC is more restrictive than the general low-rank structure. Such additional restrictions not only reduce the difficulty in parameter estimation but also enable the flexibility in many useful ways. For example, while Lee et al. (2010) can only estimate one matrix at a time and thus is only applicable when S = 2, MBASIC can be applied to arbitrary values of S.

3 Model Estimation and Selection

3.1 Likelihood Functions

In the MBASIC model, the likelihood function for both the observed random variables Y_ikl’s and the unobserved θ_ik’s, z_ij ’s, b_i’s, i.e., full data likelihood, is given by:

l (μ, σ, π, p, ζ, w; y, θ, z, b) = \prod_{i = 1}^{I} ζ^{b_{i}} {(1 - ζ)}^{1 - b_{i}} \cdot \prod_{i = 1}^{I} \prod_{k = 1}^{K} \prod_{s = 1}^{S} p_{i s}^{I (θ_{i k} = s) b_{i}} \cdot \prod_{i = 1}^{I} \prod_{j = 1}^{J} π_{j}^{z_{i j}} \cdot \prod_{i = 1}^{I} \prod_{k = 1}^{K} \prod_{s = 1}^{S} {[\prod_{l = 1}^{n_{k}} f_{s} (y_{ikl} ∣ μ_{kls}, σ_{kls}, γ_{ikls})]}^{I (θ_{i k} = s)} \cdot \prod_{i = 1}^{I} \prod_{j = 1}^{J} \prod_{k = 1}^{K} \prod_{s = 1}^{S} w_{jks}^{I (θ_{i k} = s) (1 - b_{i}) z_{i j}} .

(8)

For non-degenerate distributions, we can show that the marginal likelihood is:

l (μ, σ, π, p, ζ, w; y) = \prod_{i = 1}^{I} {ζ \prod_{k = 1}^{K} [\sum_{s = 1}^{S} p_{i s} \prod_{l = 1}^{n_{k}} f_{s} (y_{ikl} ∣ μ_{kls}, σ_{kls}, γ_{ikls})] + (1 - ζ) \sum_{j = 1}^{J} π_{j} \prod_{k = 1}^{K} [\sum_{s = 1}^{S} w_{jks} \prod_{l = 1}^{n_{k}} f_{s} (y_{ikl} ∣ μ_{kls}, σ_{kls}, γ_{ikls})]} .

(9)

Equation (9) is easily interpretable. Conditional on b_i and z_ij, the joint distribution for each Y_ikl, 1 ≤ l ≤ n_k is a mixture of S components, where the weight on the s-th component is either p_is (when b_i = 1) or w_jks (when b_i = 0 and z_ij = 1). This yields the expressions in the square brackets. Integrating out b_i and z_ij, the joint distribution for Y_ikl of fixed i is a mixture of J + 1 components, with probability ζ of being a singleton and probability (1 − ζ)π_j of belonging to cluster j.

For the degenerate case, by substituting (6) into (9), it can be shown that the marginal likelihood is:

l (μ, σ, π, p, ζ, w; θ) = \prod_{i = 1}^{I} {ζ \prod_{k = 1}^{K} \prod_{s = 1}^{S} p_{i s}^{I (θ_{i k} = s)} + (1 - ζ) \prod_{j = 1}^{J} π_{j} \prod_{k = 1}^{K} \prod_{s = 1}^{S} w_{jks}^{I (θ_{i k} = s)}} .

(10)

3.2 An Expectation and Maximization (E-M) Algorithm

The hierarchical structure of MBASIC naturally fits in the Expectation-Maximization algorithm Dempster et al. (1977), which maximizes the marginal likelihood (equations (9) or (10)) by iteratively maximizing the complete data log-likelihood function. We let ϕ to denote a vector including all unknown parameters μ, σ, π, p, ζ, w, and ϕ̂⁽^t⁾ to denote the parameter estimates at the t-th iteration. The complete data log-likelihood function is:

Q (ϕ ∣ {\hat{ϕ}}^{(t - 1)} = \sum_{i = 1}^{I} \sum_{k = 1}^{K} \sum_{s = 1}^{S} [\sum_{l = 1}^{n_{k}} log f_{s} (y_{ikl} ∣ μ_{kls}, σ_{kls}, γ_{ikls})] E [I (θ_{i k} = s) ∣ {\hat{ϕ}}^{(t - 1)}] + \sum_{i = 1}^{I} \sum_{k = 1}^{K} \sum_{s = 1}^{S} log p_{i s} E [I (θ_{i k} = s) b_{i} ∣ {\hat{θ}}^{(t - 1)}] + \sum_{i = 1}^{I} \sum_{j = 1}^{J} log π_{j} E [z_{i j} (1 - b_{i}) ∣ {\hat{ϕ}}^{(t - 1)}] + \sum_{i = 1}^{I} {log ζ E [b_{i} ∣ {\hat{ϕ}}^{(t - 1)}] + log (1 - ζ) (1 - E [b_{i} ∣ {\hat{ϕ}}^{(t - 1)}])} + \sum_{i = 1}^{I} \sum_{k = 1}^{K} \sum_{j = 1}^{J} \sum_{s = 1}^{S} E [I (θ_{i k} = s) z_{i j}) (1 - b_{i}) ∣ {\hat{ϕ}}^{(t - 1)}] log w_{jks} .

(11)

The E-M algorithm for MBASIC is outlined by Algorithm 1. E-step updates are listed in equations (12)–(15) and their derivations are provided in Appendix A.

Algorithm 1.

Expectation-Maximization (EM)

for t = 1, 2, · · · until convergence do

Expectation Step: Compute the conditional expectations E[I(θ_ik = s)|ϕ̂⁽^t⁻¹⁾], E[b_i|ϕ̂ ⁽^t⁻¹⁾],

E[I(θ_ik = s)b_i|ϕ̂⁽^t⁻¹⁾], E[z_ij(1 − b_i)|ϕ̂⁽^t⁻¹⁾], E[I(θ_ik = s)z_ij(1 − b_i)|ϕ̂⁽^t⁻¹⁾];

Maximization Step: Update estimates for parameters μ_kls, σ_kls, ζ, π_j, w_jks, p_is as maximizers for (11).

end for

Open in a new tab

E (b_{i} ∣ {\hat{ϕ}}^{(t - 1)}) = \frac{{\hat{ζ}}^{(t - 1)} \prod_{k = 1}^{K} (\sum_{s = 1}^{S} {\hat{f}}_{iks}^{(t - 1)} {\hat{p}}_{i s}^{(t - 1)})}{(1 - {\hat{ζ}}^{(t - 1)}) \sum_{j = 1}^{J} {\hat{π}}_{j}^{(t - 1)} \prod_{k = 1}^{K} (\sum_{s = 1}^{S} {\hat{f}}_{iks}^{(t - 1)} {\hat{w}}_{jks}^{(t - 1)}) + {\hat{ζ}}^{(t - 1)} \prod_{k = 1}^{K} (\sum_{s = 1}^{S} {\hat{f}}_{iks}^{(t - 1)} {\hat{p}}_{i s}^{(t - 1)})},

(12)

E (z_{i j} (1 - b_{i}) ∣ {\hat{ϕ}}^{(t - 1)}) = \frac{{\hat{π}}_{j}^{(t - 1)} \prod_{k = 1}^{K} (\sum_{s = 1}^{S} {\hat{f}}_{iks}^{(t - 1)} {\hat{w}}_{jks}^{(t - 1)})}{\sum_{j = 1}^{J} {\hat{π}}_{j}^{(t - 1)} \prod_{k = 1}^{K} (\sum_{s = 1}^{S} {\hat{f}}_{iks}^{(t - 1)} {\hat{w}}_{jks}^{(t - 1)})} [1 - E (b_{i} ∣ {\hat{ϕ}}^{(t - 1)})],

(13)

E (I (θ_{i k} = s) z_{i j} (1 - b_{i}) ∣ {\hat{ϕ}}^{(t - 1)}) = [1 - E (b_{i} ∣ {\hat{ϕ}}^{(t - 1)})] \frac{{\hat{π}}_{j}^{(t - 1)} \prod_{k = 1}^{K} (\sum_{s = 1}^{S} {\hat{f}}_{iks}^{(t - 1)} {\hat{w}}_{jks}^{(t - 1)})}{\sum_{j = 1}^{J} {\hat{π}}_{j}^{(t - 1)} \prod_{k = 1}^{K} (\sum_{s = 1}^{S} {\hat{f}}_{iks}^{(t - 1)} {\hat{w}}_{jks}^{(t - 1)})} \cdot \frac{{\hat{f}}_{iks}^{(t - 1)} {\hat{w}}_{jks}^{(t - 1)}}{\sum_{s = 1}^{S} {\hat{f}}_{iks}^{(t - 1)} {\hat{w}}_{j k_{s}}^{(t - 1)}},

(14)

E (I (θ_{i k} = s) b_{i} ∣ {\hat{ϕ}}^{(t - 1)}) = E (b_{i} ∣ {\hat{ϕ}}^{(t - 1)}) \cdot \frac{{\hat{f}}_{iks}^{(t - 1)} {\hat{p}}_{i s}^{(t - 1)}}{\sum_{s = 1}^{S} {\hat{f}}_{iks}^{(t - 1)} {\hat{p}}_{i s}^{(t - 1)}},

(15)

where ${\hat{f}}_{iks}^{(t - 1)} = \prod_{l = 1}^{n_{k}} f (y_{ikl} ∣ {\hat{μ}}_{kls}^{(t - 1)}, {\hat{σ}}_{kls}^{(t - 1)}, γ_{ikls})$ . Given these results from the E-step, updates of ζ, π_j, w_jks, p_is in the M-step are straight forward as in equations (16), (17), (18), and (19).

{\hat{ζ}}^{(t)} = \frac{\sum_{i = 1}^{I} E [b_{i} ∣ {\hat{ϕ}}^{(t - 1)}]}{I},

(16)

{\hat{π}}_{j}^{(t)} = \frac{\sum_{i = 1}^{I} E [z_{i j} (1 - b_{i}) ∣ {\hat{ϕ}}^{(t - 1)}]}{\sum_{i = 1}^{I} \sum_{j = 1}^{J} E [z_{i j} (1 - b_{i}) ∣ {\hat{ϕ}}^{(t - 1)}]},

(17)

{\hat{p}}_{i s}^{(t)} = \frac{\sum_{k = 1}^{K} E [I (θ_{i k} = s) b_{i} ∣ {\hat{ϕ}}^{(t - 1)}]}{\sum_{s = 1}^{S} \sum_{k = 1}^{K} E [I (θ_{i k} = s) b_{i} ∣ {\hat{ϕ}}^{(t - 1)}]},

(18)

{\hat{w}}_{jks}^{(t)} = \frac{\sum_{i = 1}^{I} E [I (θ_{i k} = s) z_{i j} (1 - b_{i}) ∣ {\hat{ϕ}}^{(t - 1)}]}{\sum_{s = 1}^{S} \sum_{i = 1}^{I} E [I (θ_{i k} = s) z_{i j} (1 - b_{i}) ∣ {\hat{ϕ}}^{(t - 1)}]} .

(19)

Updates for μ_kls and σ_kls have to be treated according to the specific distributions. For the log-normal distributions (3), we have:

{\hat{μ}}_{kls}^{(t)} = \frac{\sum_{i = 1}^{I} log (y_{ikl} + 1) P [θ_{i k} = s ∣ {\hat{ϕ}}^{(t - 1)}]}{\sum_{i = 1}^{I} γ_{ikls} P [θ_{i k} = s ∣ {\hat{ϕ}}^{(t - 1)}]},

(20)

{\hat{σ}}_{kls}^{(t) 2} = \frac{\sum_{i = 1}^{I} P [θ_{i k} = s ∣ {\hat{ϕ}}^{(t - 1)}] {[log (y_{ikl} + 1) - {\hat{μ}}_{kls}^{(t)} γ_{ikls}]}^{2}}{\sum_{i = 1}^{I} P [θ_{i k} = s ∣ {\hat{ϕ}}^{(t - 1)}]} .

(21)

For the binomial distributions (5), we have:

{\hat{μ}}_{kls}^{(t)} = \frac{\sum_{i = 1}^{I} y_{ikl} P [θ_{i k} = s ∣ {\hat{ϕ}}^{(t - 1)}]}{\sum_{i = 1}^{I} γ_{ikls} P [θ_{i k} = s ∣ {\hat{ϕ}}^{(t - 1)}]} .

(22)

Closed form maximizers of μ and σ do not exist for the negative binomial distribution (4). We adopt the method of moment estimates as in Kuan et al. (2011); Zuo and Keleş (2013), where the updated values ${\hat{μ}}_{kls}^{(t)}$ and ${\hat{σ}}_{kls}^{(t)}$ are the solutions of the following equations:

{\hat{μ}}_{kls}^{(t)} \sum_{i = 1}^{I} γ_{ikls} P [θ_{i k} = s ∣ {\hat{ϕ}}^{(t - 1)}] = \sum_{i = 1}^{I} y_{ikl} P [θ_{i k} = s ∣ {\hat{ϕ}}^{(t - 1)}], \sum_{i = 1}^{I} [{\hat{μ}}_{kls}^{(t) 2} γ_{ikls}^{2} (1 + \frac{1}{{\hat{σ}}_{kls}^{(t)}}) + {\hat{μ}}_{kls}^{(t)} γ_{ikls}] P [θ_{i k} = s ∣ {\hat{ϕ}}^{(t - 1)}] = \sum_{i = 1}^{I} y_{ikl}^{2} P [θ_{i k} = s ∣ {\hat{ϕ}}^{(t - 1)}] .

For the degenerate distributions as in (6), θ_ik’s are directly observed. Therefore, the E-M algorithm for this case requires slight modifications: we skip the estimation for E[I(θ_ik = s)|ϕ̂ ⁽^t⁻¹⁾] in the E-step and for μ, σ in the M-step.

3.3 Estimating Structured Clusters

In integrative functional genomics studies, the set of experimental conditions usually consists of interactions of multiple experimental factors; hence, it is often important to identify clusters, states of which are homogeneous across the levels of one or more factors. For example, in a typical transcription factor network analysis, experimental conditions include the combination of different cell types and TFs. It is often desirable to separate loci groups whose states are homogeneous within each cell type from those with cell type specific states for the purpose of cell type comparison. Depending on the cell types involved, such comparison can yield insights on cell development, pathology and/or cell-specific functions. We refer to clusters with homogeneous states within each cell type as TF-homogeneous. Another example is encountered in comparative functional genomics studies across different species, where experimental conditions range across both species and TFs. Clusters of loci, states of which are homogeneous across species conditional on each TF, constitute conserved functional modules. The TF-homogeneous clusters in this context represent the marginal effect of the species factor, and play a central role in understanding the evolutionary relationships.

To estimate a cluster with homogeneity for a particular experimental factor, MBASIC allows structural constraints on its state-space parameters. Recall that the parameters of cluster j are represented by w_j.s = (w_j₁_s, w_j₂_s, · · ·, w_jKs). Marginalizing the effect of this factor, the K experimental conditions can be partitioned into M sets, {1, 2, · · ·, K} = T₁ ∪ T₂ ∪ ·· · ∪ T_M, where conditions within each set differ only in the levels of this factor. The parameters of this cluster satisfy the following constraints:

w_{j k_{1} s} = w_{j k_{2} s}, if \exists m s . t . k_{1}, k_{2} \in T_{m} .

(23)

A pictorial depiction with six experimental conditions due to full interaction between 2 cell types and 3 TFs is depicted in Figure 1. Estimating structured clustering models follows the previous E-M algorithm with a modification in Equation (19). A constrained maximizer for w_jks subject to constraint (23) is computed as:

{\hat{w}}_{jks}^{(t)} = \frac{\sum_{k^{'} : k^{'} \in T_{m}} \sum_{i = 1}^{I} E [I (θ_{i k^{'}} = s) z_{i j} (1 - b_{i}) ∣ {\hat{ϕ}}^{(t - 1)}]}{# {T_{m}} \sum_{s = 1}^{S} \sum_{i = 1}^{I} E [I (θ_{i k} = s) z_{i j} (1 - b_{i}) ∣ {\hat{ϕ}}^{(t - 1)}]}, k \in T_{m} .

A graphical description for a parametrization with structural constraints. Interactions of 2 cell types and 3 TFs result in six experimental conditions. Parameters with homogeneous values are shaded by the same color.

MBASIC requires that such structural constraints must be specified a priori and remain fixed during model fitting. MBASIC incorporates a model selection procedure to compare models with different hypothesized structural constraints and numbers of clusters. We next describe the details of this model selection procedure.

3.4 Model Selection

The MBASIC framework so far assumes that the total number of clusters J is known a priori. In practice, models with varying values of J need to be fitted independently and compared with each other according to some information criterion to determine the best value of J. Since the E-M algorithm aims to maximize the data likelihood function, AIC and BIC criteria can be utilized with MBASIC. The degrees of freedom for a model with J clusters is $d f = F_{1} S \sum_{k = 1}^{K} n_{k} + (S - 1) I + J + F_{2}$ , where F₁ = 2 for distributions (3) and (4), F₁ = 1 for (5), and F₂ is the total number of free variables among w_jks’s. If there are no structured clusters, we have F₂ = JK(S − 1).

When there is no prior information available, both the total number of clusters and the number of clusters following structural constraints have to be determined. This results in a prohibitively large number of candidate models, and computing the information criterion for each of them is not practical. In such cases we incorporate the following two-phase strategy to limit the number of candidate models:

Evaluate models with varying total number of clusters without any structural constraints. Select J_opt according to the minimal AIC or BIC value.
Evaluate models with the fixed number of J_opt clusters while varying the number of clusters following each structural constraint. Select the number of clusters following each structural constraint based on the minimal AIC or BIC value.

We acknowledge that the above two-step strategy is only a practical compromise to restrict the space of candidate models and does not guarantee finding the best model that globally minimizes the information criterion. However, we have conducted extensive simulation studies which illustrated that the proposed two-phase strategy performs well in a wide variety of settings.

3.5 Simulation Studies

We conducted 6 model-based simulation studies to investigate the performance of MBASIC in various settings as summarized in Table 1. Each simulation study has multiple settings that vary the distributional assumptions, size of the state-space (S), proportion of singletons (ζ), number of units (I), number of clusters (J), and number of conditions (K). We provide the details of these simulation studies in Appendix B and highlight the overall conclusions in this section.

Table 1.

Design of the simulation studies. S: size of the state-space; ζ: Proportion of singletons; I: number of units; J: number of clusters; K: number of experimental conditions.

Study	Distribution	S	ζ	I	(J,K)	Model Selection
1	LN, NB, Bin	2, 3, 4	0, 0.1, 0.4	4000	(20, 30)	No
2	LN, NB, Bin	2	0.1, 0.4	4000	(20, 30)	Yes
3	iASeq	3	0, 0.1, 0.4	4000	(10, 20), (20, 30)	Yes
4	Cormotif	2	0	10,000	(4, 4), (5, 8), (5, 10)	Yes
5	Cormotif	2	0, 0.1, 0.4	4000	(10, 20)	Yes
6	LN	2	0, 0.33	4120, 4600, 6120	(8, 30)	Yes

Open in a new tab

Data in Simulation Studies 1–2 were simulated according to MBASIC’s distributional assumptions. In Simulation Study 1, we emphasized two most important features of MBASIC: the joint estimation procedure of all model parameters and the inclusion of a singleton cluster. We derived six alternative algorithms (Table B.1) to benchmark MBASIC’s performance in various settings. Three of the algorithms (SE-HC, SE-MC, PE-MC) use two-stage procedures for model estimation, decoupling either the estimation of the state-space variables or the distributional parameters from the mixture modeling of clustering analysis. The other three algorithms are created as variations on these by excluding the singleton feature (SE-MC0, PE-MC0, MBASIC0). These benchmark algorithms are in spirit analogous to procedures in many applied genomic data analyses where the association between observational units are estimated separately from the estimation of individual data set specific parameters (e.g., Gerstein et al. (2012), Wei et al. (2012), Wei et al. (2015)).

Figures B.2–B.4 summarize the performance comparisons in Simulation Study 1. We observed that MBASIC’s joint estimation feature improved the inference for both the clustering structure and the individual states. In the presence of many singletons, the inclusion of their idiosyncratic state-space profiles was essential for robust clustering. In Simulation Study 2, we evaluated the effect of using BIC to select the number of clusters as well as the structural constraints within each cluster. Tables B.2 and B.3 indicate that MBASIC was always able to select models with similar structures with the simulated truth.

In Simulation Studies 3 to 5, we simulated data according to the models proposed by iASeq (Wei et al., 2012) and Cormotif (Wei et al., 2015). These models allow heterogeneous distributional parameters within the same state, a potential advantage over MBASIC in specific data analysis such as differential expression or allele-specific binding. Comparison to these two models is intended to enable investigation of whether MBASIC is robust against such within-state heterogeneity. In Simulation Study 3, we showed that MBASIC with the binomial distribution could directly handle data generated under the iASeq framework and achieve competitive performance (Figure B.5). In Simulation Study 4, we inherited the simulation settings from Wei et al. (2015), where distributions from different states were weakly separable, but the individual states were completely deterministic from the clustering. We explored more dynamic settings in Simulation Study 5, where we had easier separation between different states, but randomness among the states within the same cluster. We showed that a pre-processing step homogenizing the within-state units followed by MBASIC leads to comparable performance to Cormotif in Simulation Study 4 (Figure B.6), and much better performance in Simulation Study 5 (Figure B.7).

Wei et al. (2015) discusses an interesting point that when the clustering model does not accommodate singletons, small clusters tend to be merged together to form spurious clusters, estimated state-space patterns of which are the averages among several true clusters. In order to investigate whether such a phenomena exists for MBASIC, we conducted Simulation Study 6, where we simulated data with two large clusters and six small clusters, and compared the performance of MBASIC and MBASIC0 to highlight the effect of including a singleton cluster. We found that compared to MBASIC0, MBASIC was significantly less aggressive in merging small clusters. Overall, it captured large clusters and allocated the small cluster units as singletons (Figures B.10 and B.11, Tables B.6, B.7 and B.8). This study highlighted the utility of a singleton cluster as a potential remedy for merging of small clusters.

Combining results from all of our simulation studies, we conclude that MBASIC is a powerful model for both state-space estimation and clustering structure recovery. Its adaptability to singletons, effectiveness in model selection, and robustness against within-state heterogeneity strongly support its applicability for real data sets.

4 Applications of MBASIC to Genome Research Problems

4.1 Transcription Factor Enrichment Network

Regulation of gene expression relies heavily on the context-specific combinatorial activities of TFs. Gene clustering analysis based on TF occupancy data, i.e., ChIP-seq, aims to identify combinatorial patterns of TF occupancy and group genes based on such patterns. The ENCODE consortium (The ENCODE Project Consortium, 2012) has generated TF ChIP-seq datasets for over 100 TFs across multiple cell types, and has motivated several integrative studies for learning regulation patterns (Gerstein et al., 2012; Wang et al., 2012). In this study, we applied MBASIC to the analysis of such data. Specifically, we focused on the TF enrichment patterns at the promoter regions, i.e., −5000 bps and +1000 bps the transcription start site, of the 10290 genes that had significant expression, as measured by RNA-seq, in either the Gm12878 or the K562 cells. The input data to MBASIC were the mapped numbers of reads at these promoter regions from the uniformly processed ChIP-seq data by Gerstein et al. (2012). We chose the cell types Gm12878 and K562 because they had the largest numbers of TF ChIP-seq experiments. The final dataset utilized included ChIP-seq data for I = 10290 observational units over 30 TFs corresponding to K = 60 experimental conditions (cell type × TF) with a total of 166 replicate experiments.

We fitted MBASIC with S = 2 states and used log normal distributions as in Equation (3). s = 1 corresponded to the unenriched state, and we let γ_ikl₁ = log(1+x_ik), where x_ik is the count from the matching control experiment at unit i. s = 2 corresponded to the enrichment state, and we let γ_ikl₂ = 1 for all loci.

We followed the two-phase procedure using BIC from Section 3.4 to select both the number of clusters and the structure of each cluster. In Phase 1, we selected the number of clusters as 24. In Phase 2, we considered two types of structural constraints for each cluster, referred to by TF-homogeneity and cell type-homogeneity and defined as w_jk_₁_s = w_jk_₂_s if k₁ and k₂ corresponded to the same TF or cell type. We found that imposing cell type-homogeneity to any cluster would cause that cluster to be degenerate (i.e., no unit was assigned to that cluster). Therefore, we chose the final model among those with TF-homogeneity structures. The BIC and log likelihood values for different models fitted in both phases are shown in Figure 2. The final model had 24 unconstrained clusters, consisting of 1 − ζ = 89.8% of the 10290 loci. The ranges of the estimated distribution parameters among replicates within the same cell type-TF combination is shown in Figure C.12. We notice that these parameters can be substantially different among replicated experiments. This provides further support for our replicate specific parametrization.

(a) BIC and (b) log-likelihood values for models with different structures. All the clusters are unstructured in the Phase 1 models and the x-axis denotes the total number of clusters. The total number of clusters is 24 for Phase 2 models and the x-axis denotes the number of unconstrained clusters. The remaining clusters have TF-homogeneity.

To compare the normalized data and the predicted enrichment probability for each cluster, we computed the normalized signals ¹ and compared them to the estimated cluster parameters. Figure 3 depicts such normalized signals from five randomly selected loci within each predicted cluster (Figure 3(a)), as well as the predicted enrichment probabilities at the corresponding condition and cluster (w_jk₂’s) (Figure 3(b)). We observe that the estimated enrichment probabilities at the cluster level capture the commonality among loci within each cluster. In addition, each loci cluster exhibits distinct combinatorial patterns of activity across all cell type-TF combination. The cell type-TF combination enriched within each cluster is listed in Table C.9.

(a) Normalized data for each cell-TF combination at five sub-sampled loci within each cluster. (b) Estimated enrichment probability at each cell-TF combination for each cluster.

Our clustering results are consistent with the existing literature on the TF enrichment networks. For example, cooperating TFs tend to be enriched at the same loci. This pattern can be observed in Figure 3(b) between Bcl3 and Bclaf1. Pol2 and Pol24h8 represent Pol2 experiments with different antibodies. As expected, we observe enrichment at the same loci for these two different version of Pol2 experiments. Moreover, pairs of TFs that have similar binding motifs have similar enrichment probabilities over the clusters. For example, Wang et al. (2012) discovered the UA1 motif as common to both Chd2 and Ets1 and the USF motif for Max, Usf1, and Usf2. Interactions between Taf1 and Tbp have also been studied by Anandapadamanaban et al. (2013). Similar enrichment probabilities of these TFs across clusters can be observed in Figure 3(b). In addition to these observations that are consistent with the literature, our results illustrate how the genome-wide TF association patterns can be attributed to specific clusters. We explored the loci clusters with distinct patterns between cell types (e.g., Pol2 in Cluster 12, Figure 4), TFs from the same families (e.g., Bcl3 v.s. Bclaf1 in Cluster 3, Figure C.13), and TFs with similar genome-wide enrichment (e.g., Max v.s. Usf1 in Cluster 2, Figure C.14) using raw data. We further evaluated each cluster of genes for their KEGG pathway enrichment (Subramanian et al., 2005), and identified 8 KEGG pathways that are significantly enriched in individual clusters (Table 2). Three of our clusters (Clusters 7, 9, and 19) have more than half of their genes in one single pathway. Since KEGG pathways curate the known knowledge of molecular interaction systems, these clusters may be driven by unknown biological processes that warrant further investigation.

(a, b) Plots of the transformed Pol2 ChIP sample read counts against the transformed control sample read counts for all units in (a) Gm12878 and (b) K562 cells. Data from unenriched units are expected to reside around the 45 degree dashed line.

Table 2.

Significantly enriched KEGG pathways across the 24 clusters.

KEGG.name	# Genes Overlapped	Z Score	Cluster	Cluster Size
Protein processing in endoplasmic reticulum	156	5.652	7	391
Fatty acid elongation in mitochondria	7	7.518	8	133
B cell receptor signaling pathway	74	6.016	9	146
Lysine biosynthesis	3	6.53	9	146
D-Glutamine and D-glutamate metabolism	3	5.548	12	184
Vitamin B6 metabolism	4	5.28	14	156
Non-homologous end-joining	12	7.539	17	213
Lysosome	116	5.402	19	187

Open in a new tab

MBASIC infers the clustering structure based on its own estimates of the state-space profiles. The ENCODE consortium provides the estimated enrichment regions (i.e., peaks) for each experiment in this study. Then, a natural question is whether MBASIC reveals more information compared to clustering of genes based on ENCODE-estimated binary enrichment profiles of TFs. To address this, we created a binary vector for each gene by overlapping its promoter with the ENCODE peaks. Then, we applied the state-of-the-art MClust model (Fraley and Raftery, 2002) to cluster the 10290 promoter regions based on these peak profiles. MClust selected 90 clusters based on BIC. Figure C.15 displays cluster-level estimated enrichment probabilities of TFs across the conditions considered. Compared to Figure 3, we can see that many of the MClust clusters have very similar enrichment profiles. For example, Clusters 51, 7, 8, 32, 54 contained almost no enrichment for any TFs, but are classified as distinct clusters. The association between units across these clusters are thus non-trivial to interpret. In addition, we found that for some conditions, the enrichment states predicted by MBASIC are quite different than those from the ENCODE peak profiles (e.g., Figure 5). This is because the ENCODE peaks are identified by whole genome-wide analysis and may not reflect the differences between the ChIP and control samples at the local promoter regions. MBASIC attains larger raw data fidelity by directly modeling the counts at each unit rather than inheriting results from existing analyses.

(a, b) Transformed ChIP versus control sample read counts from a Gm12878-Ctcf dataset. Enrichment states are annotated by (a) ENCODE peak profiles and (b) MBASIC estimation. In MBASIC, an observational unit is estimated to be enriched if its enrichment probability satisfies P(*θ_ik* = 2|Y ) > 0.5.

4.2 Genome-wide Identification of +9.5-like Composite Elements

Johnson et al. (2012) and Gao et al. (2013) described the requirement of the intronic +9.5 site, an Ebox-GATA composite element located at chr6: 88143884–88157023 in the mouse genome (genome version mm9), to establish the hematopoietic stem/progenitor cell (HSC) compartment in the fetal liver and for hematopoietic stem cell genesis in the aorta-gonad-mesonephros (AGM), respectively. Furthermore, Johnson et al. (2012) and Hsu et al. (2013) showed that heterozygous +9.5 mutations cause a human immunodeficiency associated with myelodysplastic syndrome (MDS) and acute myeloid leukemia (AML). Because the +9.5 site is the only known cis-element deletion of which depletes fetal liver HSCs and is lethal at E13–14 of embryogenesis, identifying additional loci that have similar functionality is extremely important for establishing mechanisms that enable GATA factor-bound regions with nonredundant activity and have the potential to reveal novel targets for therapeutic modulation of hematopoiesis. In this application, we identified 4803 genomic regions with the Ebox-GATA motif (CATCTG-N[7–9]-AGATAA where N[7–9] denotes a variable size spacer of 7 and 9 nucleotides) in the human genome (genome version hg19). We considered a 150 bps window anchored at each of the 4803 composite elements as the observational unit. To analyze the TF occupancy activities at these units and identify a group of composite elements with occupancy profiles similar to that of the +9.5 composite element, we downloaded all ChIP-seq data for the Huvec and K562 cells from Gerstein et al. (2012). In total, the data set contained 224 replicates spanning K = 84 experimental conditions and 77 TFs.

We used negative binomial distributions with S = 2 states, where s = 1 denoted the unenriched (unoccupied) state, in the MBASIC framework. We chose γ_ikl₁ = 1+x_ik, where x_ik is the count for unit i from the matching control experiment for condition k, to incorporate data from the accompanying control experiments of the ChIP samples. For s = 2, we utilized the following mixture distribution to account for the heavy tails observed in the raw data:

Y_{ikl} - 3 ∣ θ_{i k} = 2 \overset{ind .}{\sim} ν_{ikl} N B (μ_{k l 2}, σ_{k l 2}) + (1 - ν_{ikl}) N B (μ_{k l 3}, σ_{k l 3}), ν_{i k} \overset{i . i . d .}{\sim} Bernoulli (v_{k l}) .

Here, the constant 3 represents the minimum count threshold for enrichment estimation. The use of mixture distributions to capture heavy tailed count data was previously considered by Zuo and Keles¸ (2013). We note that an alternative approach to capture heavy tailed counts would be to fit a model using S = 3 states, with s = 2, 3 representing two distinct enrichment components. Such an approach would differ from the proposed approach in a subtle yet important way. In this alternative approach, allocation of each unit to different enrichment components would affect the clustering estimation, while in our approach, clustering is only determined by the enrichment status of the individual unit regardless of which enrichment component it follows. The E-M algorithm for this setting requires a slight modification as discussed in Section A.2.

Following the two phase model selection procedure using BIC, we selected the model with 3 clusters, 2 of which were cell type-homogeneous. The ranges of the estimated distribution parameters among replicates within the same condition are displayed in Figures C.16 and C.17. The three clusters (denoted by C1, C2, and C3) included 332, 837, 157 composite elements, respectively, and the remaining 3477 composite elements were identified as singletons. A heatmap for the enrichment probability of each unit under each cell type-TF combination across the three clusters is shown in Figure 6. The +9.5 element is a member of cluster C3 which consists of a total of 157 +9.5-like composite elements. A detailed genomic annotation of these elements are provided in Table C.10. Notably, 46% of the C3 elements reside in intronic regions and 42% of these are within first intron. Only 15% of the cluster are located up to 10Kb upstream of transcription start sites.

Posterior enrichment probability (i.e,. P(*θ_ik* = 2|Y )) for all units in the three clusters. The right most column of the C3 cluster corresponds to the +9.5 element.

A detailed analysis of Figure 6 reveals that cluster C3 is driven by several transcription factors with known associations to GATA2. First, we note that a large fraction of the C3 loci are bound by BRG1. The chromatin remodeler BRG1 is involved in GATA1-mediated chromatin looping (Kim et al., 2009a,b) and co-localizes with GATA1 at some chromatin sites (Hu et al., 2011). BRG1 has broad functions in many cell types; however, conditional knockouts of BRG1 reveal its importance in specific cell and tissue contexts (Holley et al., 2014). Another factor that clearly stands out as having a GATA2-like profile in cluster C3 is ETS1. Our prior work identified the propensity of occupied GATA motifs to reside near Ets motifs (Linneman et al., 2011) and Doré et al. (2012) has highlighted GATA2-ETS co-localization.

We next performed an alternative naive analysis by utilizing the list of peaks provided by the ENCODE project. As in the case of the Transcription Factor Enrichment Network example of Section 4.1, these peaks, provided by the ENCODE consortium, were identified by analyzing each dataset individually with ENCODE’s uniform ChIP-seq processing pipeline. Figure C.18 displays the ENCODE peak profiles for our cell type-TF conditions. For each of the 4803 composite elements, we constructed a peak profile, which is a binary vector indicating whether the element overlaps with the ENCODE peaks for each cell type-TF combination. We then computed the peak profile based similarity between the +9.5 site and each the of the composite elements using the R function dist.binary with the “Jaccard index” option. For comparison, we computed pseudo-binary similarities between each element and the +9.5 site using the MBASIC estimated enrichment probabilities across all conditions². We then ranked the composite elements based on both ENCODE and MBASIC estimated similarities. Figure 7 provides a comparison of the two lists as a function of top ranking composite elements. Overall, we observe that the rankings based on MBASIC estimation are consistent with the rankings based on the ENCODE peak profiles.

Proportion of overlap between the top ranked +9.5-like composite elements identified by MBASIC and ENCODE peak profiles. The overlap proportion is calculated by considering the same number of top ranked units (x-axis) in both the ENCODE-based and MBASIC-based similarities to the +9.5 site. The dashed lines mark that 78.3% of the C3 units are ranked in the top 157 based on the ENCODE peak profiles.

Although the rankings of the composite elements with respect to their +9.5 similarity using both the ENCODE peak profiles and MBASIC estimation were quite similar, the two approaches resulted in different enrichment estimation at the individual TF-cell combination level. Figure 8(a) compares the estimated cluster-level enrichment probabilities of each cell type-TF combination for cluster C3 against their average ENCODE peak profiles and highlights the difference between the two procedures. To further investigate these differences, we plotted the raw data for individual replicates and compared the composite elements that were estimated to be enriched by the two methods. An example using data from K562-Chd2 is displayed in Figures 8(b) and (c). Although many elements have significantly higher counts in the ChIP sample compared to the control sample, they are not identified as occupied by Chd2 in K562 according to ENCODE peak annotation.

(a) Top half: Enrichment probabilities for the C3 units across all experimental conditions estimated by MBASIC. Bottom half: Proportion of C3 units that are overlapped by the ENCODE peaks for each condition. (b, c) ChIP sample read counts against normalized control sample read counts for one replicate of K562-Chd2 dataset. Enrichment status are annotated by (a) the ENCODE peak profiles and (c) MBASIC prediction.

Another example using a replicate from K562-Yy1 is shown in Figure C.19, where several elements with zero ChIP count are overlapped by ENCODE peaks. These results indicate that MBASIC provides a grouping of the Ebox-GATA composite elements that is more consistent with the raw data compared to grouping based on ENCODE peak annotation.

5 Conclusions and Discussion

Clustering analysis based on an underlying state-space is a common problem for many genomic and epigenomic studies where multiple data sets over many observational units are integrated. In this paper, we developed a unified statistical framework, called MBASIC, for addressing these class of problems. MBASIC simultaneously projects the observations onto a hidden state-space and infers clustered units in this space. The hierarchical structure of MBASIC enables the information of the state-space clusters to be fed back into the projection of the raw data, thus reinforces the accuracy of predicting the state-space states of individual units. The MBASIC framework offers flexibility in a number of aspects of experimental design, such as different numbers of replicates under individual experimental conditions and missing values. Additionally, it is applicable to many parametric distributions. Our computational studies highlighted good operating characteristics of MBASIC and the two genomic applications illustrated how large numbers of ChIP-seq datasets can be integrated for addressing specific problems. In both of the applications, MBASIC algorithm converged within 20 minutes for a fixed model on a 64 bit machine with Intel Xeon 3.0GHz processor and 64GB of RAM. For model selection, we utilized R package snow to implement the 2-phase procedure with parallel fitting of different candidate models using a 8-core 64 bit, 64GB RAM machine with 8 Intel Xeon 3.0GHz processors. These runs were completed under 2 hours. The computational efficiency of our model depends on the simple, closed-form updates in our E-M algorithm. Such a mathematical form is due, at least in part, to our modeling assumption that the rows of our state-space matrix is clustered. We have argued that this assumption, as compared to the PCA-type model structures, offers easier interpretation and is well suited for many genomic applications. MBASIC is available as R package mbasic at https://github.com/chandlerzuo/mbasic.

A Details of the Expectation-Maximization (EM) Algorithms

A.1 Derivation for the E-step

We derive the expressions for the E-step updates of our algorithm in Eqns. (12), (13), (14), (15) as well as the marginal likelihood in Eqns. (9) and (10).

In what follows, we let θ_iks denote 1{θ_ik = s}. The joint density of (z, b, θ, Y ) is given by:

f (z, b, θ, y) = \prod_{i = 1}^{I} ζ^{b_{i}} {(1 - ζ)}^{1 - b_{i}} \cdot \prod_{i = 1}^{I} \prod_{j = 1}^{J} π_{j}^{z_{i j}} \cdot \prod_{i = 1}^{I} \prod_{k = 1}^{K} \prod_{s = 1}^{S} {[f_{iks} {(\prod_{j = 1}^{j} w_{jks}^{z_{i j}})}^{1 - b_{i}} p_{i s}^{b_{i}}]}^{θ_{iks}},

(24)

where $f_{iks} = \prod_{l = 1}^{n_{k}} f (y_{ikl} ∣ μ_{kls}, σ_{kls}, γ_{ikls})$ . The following elementary equality is used repeatedly throughout the rest of the derivations in this section.

\sum_{\sum_{j} a_{i j} = 1, a_{i j} \in {0, 1}} \prod_{i} \prod_{j} b_{i j}^{a_{i j}} = \prod_{i} (\sum_{j} b_{i j}) .

The joint density of (z, b, Y ) can be calculated from Eqn. (24):

f (z, b, y) = \sum_{\sum_{s} θ_{iks} = 1} f (z, b, θ, y) = \prod_{i = 1}^{I} ζ^{b_{i}} {(1 - ζ)}^{1 - b_{i}} \cdot \prod_{i = 1}^{I} \prod_{j = 1}^{J} π_{j}^{z_{i j}} \cdot \sum_{\sum_{s} θ_{iks} = 1} \prod_{i = 1}^{I} \prod_{k = 1}^{K} \prod_{s = 1}^{S} {[f_{iks} {(\prod_{j = 1}^{J} w_{jks}^{z_{i j}})}^{1 - b_{i}} p_{i s}^{b_{i}}]}^{θ_{iks}} = \prod_{i = 1}^{I} ζ^{b_{i}} {(1 - ζ)}^{1 - b_{i}} \cdot \prod_{i = 1}^{I} \prod_{j = 1}^{J} π_{j}^{z_{i j}} \cdot \prod_{i = 1}^{I} \prod_{k = 1}^{K} [\sum_{s = 1}^{S} f_{iks} {(\prod_{j = 1}^{J} w_{jks}^{z_{i j}})}^{1 - b_{i}} p_{i s}^{b_{i}}] .

(25)

Since

\sum_{s = 1}^{S} f_{iks} {(\prod_{j = 1}^{J} w_{jks}^{z_{i j}})}^{1 - b_{i}} p_{i s}^{b_{i}} = \prod_{j = 1}^{J} {[\sum_{s = 1}^{S} f_{iks} w_{jks}^{1 - b_{i}} p_{i s}^{b_{i}}]}^{z_{i j}},

Eqn. (25) can be rewritten as:

f (z, b, y) = \prod_{i = 1}^{I} ζ^{b_{i}} {(1 - ζ)}^{1 - b_{i}} \cdot \prod_{i = 1}^{I} \prod_{j = 1}^{J} {[π_{j} \prod_{k = 1}^{K} (\sum_{s = 1}^{S} f_{iks} w_{jks}^{1 - b_{i}} p_{i s}^{b_{i}})]}^{z_{i j}} .

(26)

The joint distribution of (b, Y ) can be calculated from Eqn. (26):

f (b, y) = \sum_{\sum_{j} z_{i j} = 1} f (z, b, y) = \prod_{i = 1}^{I} ζ^{b_{i}} {(1 - ζ)}^{1 - b_{i}} \cdot \sum_{\sum_{j} z_{i j} = 1} \prod_{i = 1}^{I} \prod_{j = 1}^{J} {[π_{j} \prod_{k = 1}^{K} (\sum_{s = 1}^{S} f_{iks} w_{jks}^{1 - b_{i}} p_{i s}^{b_{i}})]}^{z_{i j}} = \prod_{i = 1}^{I} ζ^{b_{i}} {(1 - ζ)}^{1 - b_{i}} \cdot \prod_{i = 1}^{I} [\sum_{j = 1}^{J} π_{j} \prod_{k = 1}^{K} (\sum_{s = 1}^{S} f_{iks} w_{jks}^{1 - b_{i}} p_{i s}^{b_{i}})] .

(27)

We note that

\sum_{j = 1}^{J} π_{j} \prod_{k = 1}^{K} (\sum_{s = 1}^{S} f_{iks} w_{jks}^{1 - b_{i}} p_{i s}^{b_{i}}) = {[\sum_{j = 1}^{J} π_{j} \prod_{k = 1}^{K} (\sum_{s = 1}^{S} f_{iks} w_{jks})]}^{1 - b_{i}} {[\prod_{k = 1}^{K} (\sum_{s = 1}^{S} f_{iks} p_{i s})]}^{b_{i}} .

Then, Eqn. (27) can be rewritten as:

f (b, y) = \prod_{i = 1}^{I} {[(1 - ζ) \sum_{j = 1}^{J} π_{j} \prod_{k = 1}^{K} (\sum_{s = 1}^{S} f_{iks} w_{jks})]}^{1 - b_{i}} {[ζ \prod_{k = 1}^{K} (\sum_{s = 1}^{S} f_{iks} p_{i s})]}^{b_{i}} .

(28)

We can calculate the marginal density of Y, given in Eqn. (9), from Eqn. (28) as:

f (y) = \sum_{b_{i} \in {0, 1}} f (b, y) = \prod_{i = 1}^{I} [(1 - ζ) \sum_{j = 1}^{J} π_{j} \prod_{k = 1}^{K} (\sum_{s = 1}^{S} f_{iks} w_{jks}) + ζ \prod_{k = 1}^{K} (\sum_{s = 1}^{S} f_{iks} p_{i s})] .

(29)

Eqn. (10) can be obtained similarly. Moreover, we can rewrite (25) as

f (z, b, y) = \prod_{i = 1}^{I} {[ζ \prod_{k = 1}^{K} \sum_{s = 1}^{S} f_{iks} p_{i s}]}^{b_{i}} {[(1 - ζ) \prod_{j = 1}^{J} \prod_{k = 1}^{K} {(\sum_{s = 1}^{S} f_{iks} w_{jks})}^{z_{i j}}]}^{1 - b_{i}} \cdot \prod_{i = 1}^{I} \prod_{j = 1}^{J} π_{j}^{z_{i j}}

(30)

by using

\sum_{s = 1}^{S} f_{iks} {(\prod_{j = 1}^{J} w_{jks}^{z_{i j}})}^{1 - b_{i}} p_{i s}^{b_{i}} = {[\prod_{j = 1}^{J} {(\sum_{s = 1}^{S} f_{iks} w_{jks})}^{z_{i j}}]}^{1 - b_{i}} {[\sum_{s = 1}^{S} f_{iks} p_{i s}]}^{b_{i}} .

Thus, the density of (z, Y ) can be calculated as:

f (z, y) = \sum_{b_{i} \in {0, 1}} \prod_{i = 1}^{I} {[ζ \prod_{k = 1}^{K} (\sum_{s = 1}^{S} f_{iks} p_{i s})]}^{b_{i}} {[(1 - ζ) \prod_{j = 1}^{J} \prod_{k = 1}^{K} {(\sum_{s = 1}^{S} f_{iks} w_{jks})}^{z_{i j}}]}^{1 - b_{i}} \cdot \prod_{i = 1}^{I} \prod_{j = 1}^{J} π_{j}^{z_{i j}} = \prod_{i = 1}^{I} [ζ \prod_{k = 1}^{K} (\sum_{s = 1}^{S} f_{iks} p_{i s}) + (1 - ζ) \prod_{j = 1}^{J} \prod_{k = 1}^{K} {(\sum_{s = 1}^{S} f_{iks} w_{jks})}^{z_{i j}}] \cdot \prod_{i = 1}^{I} \prod_{j = 1}^{J} π_{j}^{z_{i j}} = \prod_{i = 1}^{I} \prod_{j = 1}^{J} {[π_{j} ζ \prod_{k = 1}^{K} (\sum_{s = 1}^{S} f_{iks} p_{i s}) + π_{j} (1 - ζ) \prod_{k = 1}^{K} (\sum_{s = 1}^{S} f_{iks} w_{jks})]}^{z_{i j}} .

(31)

Using Eqns. (28) and (29), we obtain Eqn. (12) as

E [b_{i} ∣ Y] = \frac{ζ \prod_{k = 1}^{K} (\sum_{s = 1}^{S} f_{iks} p_{i s})}{(1 - ζ) \sum_{j = 1}^{J} π_{j} \prod_{k = 1}^{K} (\sum_{s = 1}^{S} f_{iks} w_{jks}) + ζ \prod_{k = 1}^{K} (\sum_{s = 1}^{S} f_{iks} p_{i s})} .

(32)

Similarly, using Eqns. (31) and (29), we have

E [z_{i j} ∣ Y] = \frac{π_{j} ζ \prod_{k = 1}^{K} (\sum_{s = 1}^{S} f_{iks} p_{i s}) + (1 - ζ) π_{j} \prod_{k = 1}^{K} (\sum_{s = 1}^{S} f_{iks} w_{jks})}{ζ \prod_{k = 1}^{K} \sum_{s = 1}^{S} f_{iks} p_{i s} + (1 - ζ) \sum_{j = 1}^{J} π_{j} \prod_{k = 1}^{K} (\sum_{s = 1}^{S} f_{iks} w_{jks})} .

(33)

Using Eqns. (26) and (27), we have

E [z_{i j} ∣ b, Y] = \frac{π_{j} \prod_{k = 1}^{K} (\sum_{s = 1}^{S} f_{iks} w_{jks}^{1 - b_{i}} p_{i s}^{b_{i}})}{\sum_{j = 1}^{J} π_{j} \prod_{k = 1}^{K} (\sum_{s = 1}^{S} f_{iks} w_{jks}^{1 - b_{i}} p_{i s}^{b_{i}})} .

(34)

Eqns. (34) and (32) together results in Eqn. (13). Using Eqns. (24) and (25), we have

E [θ_{iks} ∣ z, b, Y] = \frac{f_{iks} {(\prod_{j = 1}^{J} w_{jks}^{z_{i j}})}^{1 - b_{i}} p_{i s}^{b_{i}}}{\sum_{s = 1}^{S} f_{iks} {(\prod_{j = 1}^{J} w_{jks}^{z_{i j}})}^{1 - b_{i}} p_{i s}^{b_{i}}} .

(35)

Therefore, we obtain Eqn. (14) by using Eqns. (27), (34), and (35):

E [θ_{iks} z_{i j} (1 - b_{i}) ∣ Y] = [1 - E (b_{i} ∣ Y)] E (z_{i j} ∣ b_{i} = 0, Y) E (θ_{iks} ∣ b_{i} = 0, z_{i j} = 1, Y) = \frac{(1 - ζ) \sum_{j = 1}^{J} π_{j} \prod_{k = 1}^{K} (\sum_{s = 1}^{S} f_{iks} w_{jks})}{(1 - ζ) \sum_{j = 1}^{J} π_{j} \prod_{k = 1}^{K} (\sum_{s = 1}^{S} f_{iks} w_{jks}) + ζ \prod_{k = 1}^{K} (\sum_{s = 1}^{S} f_{iks} p_{i s})} \cdot \frac{π_{j} \prod_{k = 1}^{K} (\sum_{s = 1}^{S} f_{iks} w_{jks})}{\sum_{j = 1}^{J} π_{j} \prod_{k = 1}^{K} (\sum_{s = 1}^{S} f_{iks} w_{jks})} \cdot \frac{f_{iks} w_{jks}}{\sum_{s = 1}^{S} f_{iks} w_{jks}} .

(36)

Finally, we obtain Eqn. (15) using Eqns. (35) and (27):

E (θ_{iks} b_{i} ∣ Y) = E (b_{i} ∣ Y) E (θ_{iks} ∣ b_{i} = 1, Y) = \frac{ζ \prod_{k = 1}^{K} (\sum_{s = 1}^{S} f_{iks} p_{i s})}{(1 - ζ) \sum_{j = 1}^{J} π_{j} \prod_{k = 1}^{K} (\sum_{s = 1}^{S} f_{iks} w_{jks}) + ζ \prod_{k = 1}^{K} (\sum_{s = 1}^{S} f_{iks} p_{i s})} \cdot \frac{f_{iks} p_{i s}}{\sum_{s = 1}^{S} f_{iks} p_{i s}} .

(37)

A.2 EM Algorithm with Mixture Data Distributions

An important extension of the MBASIC model is to allow multiple mixture components within each state. For example, our model in Section 4.2 models the data from state s = 2 as a mixture of two negative binomial distributions following the well motivated model of Kuan et al. (2011):

Y_{ikl} - 3 ∣ θ_{i k} = 2 ~ ν_{ikl} N B (μ_{k l 2}, σ_{k l 2}) + (1 - ν_{ikl}) N B (μ_{k l 3}, σ_{k l 3}), ν_{ikl} ~ Bernoulli (v_{k l}),

where the constant 3 denotes the minimum number of reads required to be in state θ = 2. In this section, we describe the general algorithm for such extensions. We assume that data from state s has a distribution of m_s components:

Y_{ikl} ∣ θ_{iks} = 1 ~ \sum_{r = 1}^{m_{s}} v_{klsr} f_{s r} (\cdot ∣ μ_{klsr}, σ_{klsr}, γ_{iklsr}) .

This can be written in a hierarchical form, using ν_iklsr as the hidden variable indicating the mixture component within the state:

{(ν_{iklsr})}_{1 \leq r \leq m_{s}} ~ Multinom (1, {(v_{klsr})}_{1 \leq r \leq m_{s}}), Y_{ikl} ∣ θ_{iks} = 1, ν_{iklsr} = 1 ~ f_{s r} (\cdot ∣ μ_{klsr}, σ_{klsr}, γ_{iklsr}) .

(38)

Here, we allow the distribution parameters μ and σ as well as the prior information derived γ to depend on the component. Let f_iklsr = f_sr(y_ikl|μ_klsr, σ_klsr, γ_iklsr). The joint density for this model is:

f (z, b, θ, ν, y) = \prod_{i = 1}^{I} ζ^{b_{i}} {(1 - ζ)}^{1 - b_{i}} \cdot \prod_{i = 1}^{I} \prod_{j = 1}^{J} π_{j}^{z_{i j}} \cdot \prod_{i = 1}^{I} \prod_{k = 1}^{K} \prod_{l = 1}^{n_{k}} \prod_{s = 1}^{S} \prod_{r = 1}^{m_{s}} v_{iklsr}^{ν_{klsr}} \cdot \prod_{i = 1}^{I} \prod_{k = 1}^{K} \prod_{s = 1}^{S} {[(\prod_{l = 1}^{n_{k}} \prod_{r = 1}^{m_{s}} f_{iklsr}^{ν_{iklsr}}) {(\prod_{j = 1}^{J} w_{jks}^{z_{i j}})}^{1 - b_{i}} p_{i s}^{b_{i}}]}^{θ_{iks}} .

(39)

Let $f_{iks} = \prod_{l = 1}^{n_{k}} (\sum_{r = 1}^{m_{s}} v_{klsr} f_{iklsr})$ , then the joint density for z, b, θ, Y can be expressed exactly the same as Eqn. (24). Therefore, the M-step updates for W, P, ζ and π are not changed, with the related E-step quantities computed as Eqns. (12), (13), (14), (15). We only need to modify the algorithm to estimate variables that depend on the component index r: μ, σ, and v.

The related quantities that need to be computed are E[ν_iklsr|Y ] and E[θ_iksν_iklsr|Y ]. By Eqn. (39), we have

P (ν_{iklsr} = 1 ∣ θ_{iks} = 0) = v_{klsr}, P (ν_{iklsr} = 1 ∣ θ_{iks} = 1) = \frac{v_{klsr} f_{iklsr}}{\sum_{r = 1}^{m_{s}} v_{klsr} f_{iklsr}} .

Therefore, we have

E [θ_{iks} ν_{iklsr} ∣ Y] = E [θ_{iks} ∣ Y] \frac{v_{klsr} f_{iklsr}}{\sum_{r = 1}^{m_{s}} v_{klsr} f_{iklsr}},

(40)

where $E [θ_{iks} ∣ Y] = \sum_{j = 1}^{J} E [θ_{iks} (1 - b_{i}) z_{i j} ∣ Y] + E [θ_{iks} b_{i} ∣ Y]$ and can be computed by Eqns. (36) and (37). As a result,

E [ν_{iklsr} ∣ Y] = (1 - E [θ_{iks} ∣ Y]) v_{klsr} + E [θ_{iks} ν_{iklsr} ∣ Y] .

(41)

Given Eqns. (40) and (41), the M-step update for v_klsr is:

v_{klsr}^{(t)} = \frac{\sum_{i = 1}^{I} E [ν_{iklsr} ∣ {\hat{ϕ}}^{(t - 1)}]}{I} .

The M-step updates for μ_klsr, σ_klsr can be derived using Eqn. (40). For the negative binomial distribution, as in Section 4.2, we have

{\hat{μ}}_{klsr}^{(t)} \sum_{i = 1}^{I} γ_{ikls} E [θ_{iks} v_{iklsr} ∣ {\hat{ϕ}}^{(t - 1)}] = \sum_{i = 1}^{I} E [θ_{iks} ν_{iklsr} ∣ {\hat{ϕ}}^{(t - 1)}] (Y_{ikl} - 3), \sum_{i = 1}^{I} E [θ_{iks} v_{iklsr} ∣ {\hat{ϕ}}^{(t - 1)}] [{\hat{μ}}_{klsr}^{(t) 2} γ_{ikls}^{2} (1 + \frac{1}{{\hat{σ}}_{klsr}^{(t)}}) + {\hat{μ}}_{klsr}^{(t)} γ_{ikls}] = \sum_{i = 1}^{I} E [θ_{iks} v_{iklsr} ∣ {\hat{ϕ}}^{(t - 1)}] {(Y_{ikl} - 3)}^{2} .

B Simulation Studies

This section presents six broad simulation studies to evaluate the performance of MBASIC. Each simulation study had multiple settings as outlined in Table 1 of the main article. We introduced the following four families of distributions in our main article:

Log-normal Distribution. LN(μ_klsγ_ikls, σ_kls) with a density function:
$f_{s} (y ∣ μ_{kls}, σ_{kls}, γ_{ikls}) = \frac{1}{\sqrt{2 π} σ_{kls}} exp {- \frac{{(log (y + 1) - μ_{kls} γ_{ikls})}^{2}}{2 σ_{kls}^{2}}} .$ (42)
Negative Binomial Distribution. NB(μ_klsγ_ikls, σ_kls) with a density function:
$f_{s} (y ∣ μ_{kls}, σ_{kls}, γ_{ikls}) = \frac{Γ (y + σ_{kls})}{Γ (σ_{kls}) Γ (y + 1)} \frac{{(μ_{kls} γ_{ikls})}^{y} σ_{kls}^{σ_{kls}}}{{(μ_{kls} γ_{ikls} + σ_{kls})}^{y + σ_{kls}}} .$ (43)
Binomial Distribution. Binom(γ_ikls, μ_kls) with a density function:
$f_{s} (y ∣ μ_{kls}, γ_{ikls}) = (\begin{matrix} γ_{ikls} \\ y \end{matrix}) μ_{kls}^{y} {(1 - μ_{kls})}^{γ_{ikls} - y} .$ (44)
Degenerate Distribution:
$f_{s} (y ∣ μ_{kls}, σ_{kls}, γ_{ikls}) = I (y = s) .$ (45)

B.1 Simulation Study 1

The first simulation study investigated the performance of MBASIC when the true value of J was known and there were no structured clusters. We set the number of observational units as I = 4000 and the number of clusters as J = 20. The number of conditions was set to K = 30, and within each condition the numbers of replicates varied as n_k = 1, 2, 3, each with probability 0.3, 0.5, and 0.2. The size of the hidden state space was varied at three levels: S = 2, 3, 4. We simulated data under three distributional families: log-normal (LN) (3), negative binomial (NB) (4), and binomial (Bin) (5). We also varied the proportion of singleton units ζ at 0, 0.1 and 0.4. For simplicity, we set γ_ikls = 1 in all distributions.

B.1.1 Parameter Settings

Parameters w_jks’s and p_is’s generate the hidden state variables θ_ik’s. We set them as follows. For different values of k, j, and i, the vectors w_jk. = (w_jks : 1 ≤ s ≤ S) and p_i. = (p_is : 1 ≤ s ≤ S) were simulated independently, each following an S-dimensional Dirichlet distribution Dir(α, · · ·, α). We chose a uniform concentration parameter of α = 0.2 for all dimensions to ensure that for each vector w_jk. or p_i., the probability mass tended to concentrate on one component. This controlled the conditional variance of (θ_ik|b_i, z_ij ). An increased value of α would increase the conditional variance of θ_ik, thus make it more difficult to recover w_jks’s and p_is’s.

The settings for parameters μ_kls’s and σ_kls’s were important. These parameters connected hidden states θ_ik’s to the observed values Y_ikl’s. In general, recovering hidden states from the observed data is more difficult if: (1) differences of the mean values μ_kls’s between the states are small; (2) variances of the distributions within each state are large. To control these two aspects at reasonable levels, we set these parameters as follows:

For log-normal distributions (3), we set ξ_s = 2+log(4s – 3), simulated μ_kls ~ N(ξ_s, 0.05²), and set σ_kls = 0.5;
For negative binomial distributions (4), we set ξ_s = 8s–6, simulated μ_kls ~ N(ξ_s, 0.5²), and set σ_kl₁ = 2.82, σ_kls = 5 for s = 2, 3, 4;
For binomial distributions (5), we simulated μ_kls ~ Beta(3s, 3(S+1–s)), and γ_ikl₁ = γ_ikl₂ = · · · = γ_iklS ~ Pois(10).

Figure B.1 displays the histograms of Y₁_,i,l, 1 ≤ i ≤ I from one of the simulated data sets for all the three distributions with S = 4 components. For comparison, we also present the histogram of an actual data set from the analysis in Section 4 of our main article. We observe that the mixture distribution of our simulated data with log-normal or negative-binomial distributions closely follow the real data.

B.1.2 Alternative Approaches for Benchmarking MBASIC

The MBASIC algorithm is summarized as Algorithm 2:

To the best of our knowledge, there are currently no existing methods suited for the general setup of MBASIC. There are, however, algorithms tailored for analyzing specific data types with hierarchical state-space models similar to MBASIC. These algorithms largely fall into two categories. In the first category, estimation for the state-space variables are separated from state-space clustering. Some examples include Cheng et al. (2011), Gerstein et al. (2012), Ji et al. (2013), Neph1 et al. (2012). In the second category, distributional parameters for each experimental replicate are estimated first. These parameters are then fixed, and the state-space variables and the clustering structure are estimated jointly conditional on these estimates. Examples of this approach include Wei et al. (2012), Zeng et al. (2013), and Wei et al. (2015).

Figure B.1 — Histograms for a) a real data set from a K562 Pol2 replicate in Section 4.1 of our main article; and simulated data from one condition based on one simulation for the b) Log-Normal, c) Negative Binomial and d) Binomial distribution with S=4 states.

Algorithm 2.

MBASIC

for t = 1, 2, · · · until convergence do

Expectation-Step: Compute the conditional expectations E[I(θ_ik = s)|ϕ̂⁽^t⁻¹⁾], E[b_i|ϕ̂ ⁽^t⁻¹⁾], E[I(θ_ik = s)b_i|ϕ̂ ⁽^t⁻¹⁾], E[z_ij(1 − b_i)|ϕ̂ ⁽^t⁻¹⁾], E[I(θ_ik = s)z_ij(1 − b_i)|ϕ̂ ⁽^t⁻¹⁾]; Maximization-Step: Update estimates for parameters μ_kls, σ_kls, ζ, π_j, w_jks, p_is.

end for

Open in a new tab

Table B.1. Simulation Study 1.

A summary of the benchmark algorithms that are compared to MBASIC. Neither SE-MC nor PE-MC perform joint estimation of the model parameters. SE-^* algorithms estimate the data-specific model parameters and state-space as a first step and then cluster the state variables. PE-^* algorithms estimate data-specific model parameters and fixes these in joint estimation of the state-space and clustering.

Algorithm	Is State-space Estimation Joint with Clustering?	Is Parameter^* Estimation Joint with Clustering?	Clustering model	Include singletons
MBASIC	Joint	Joint	Mixture model	Yes
SE-HC	Separate	Separate	Hierarchical clustering	No
SE-MC	Separate	Separate	Mixture model	Yes
PE-MC	Joint	Separate	Mixture model	Yes
MBASIC0	Joint	Joint	Mixture model	No
SE-MC0	Separate	Separate	Mixture model	No
PE-MC0	Joint	Separate	Mixture model	No

Open in a new tab

Denotes distributional parameters for each experimental replicate.

To compare the general implementation of MBASIC as in Algorithm 2 with these existing model fitting ideas, we designed six benchmark algorithms. Table B.1 provides a summary of these algorithms. Two of these algorithms, SE-HC (State-space Estimation followed by Hierarchical Clustering) and SE-MC (State-space Estimation followed by Mixture model Clustering), treat the state-space mapping step and the state-space clustering separately. The third algorithm, PE-MC (Parameter Estimation followed by Mixture model Clustering), separates experiment-specific distributional parameter estimation from the joint estimation of other parameters.

For all the three algorithms, in the first step, observations from each experimental condition {Y_ikl : 1 ≤ i ≤ I, 1 ≤ l ≤ n_k} are fitted according to the following model:

(Y_{ikl} ∣ θ_{i k} = s) ~ f_{s} (\cdot ∣ μ_{kls}, σ_{kls}, γ_{ikls}), P (θ_{i k} = s) = q_{k s} .

(46)

The standard E-M algorithm can be used for the first step and results in estimates of q_ks, μ_kls, σ_kls as well as the posterior estimates for the state space P(θ_ik = s|Y). In the second step, SE-MC and SE-HC cluster the observational units based on the estimated P(θ_ik = s|Y) from the first step. SE-HC (Algorithm 3) uses hierarchical clustering, while SE-MC (Algorithm 4) uses MBASIC with degenerate distributions (6) for clustering. The second step of PE-MC (Algorithm 5) is similar to Algorithm 2, except that parameters μ_kls, σ_kls’s are not updated.

In addition to joint fitting of all model parameters, another important feature of MBASIC is its inclusion of the singleton cluster 𝒞₀. To the best of our knowledge, this feature is not included in similar models such as Wei et al. (2012) and Wei et al. (2015). We conjecture that in practice, when some units can not be grouped together with other units due to their distinct state-space profiles, including this singleton cluster can enhance model estimation. To test this conjecture, we developed a version of each of the SE-MC, PE-MC, and MBASIC algorithms that ignore the singleton cluster, i.e., forces each unit into a cluster. This is achieved simply by initializing ζ = 0 in the Algorithms 2, 4, and 5. We refer to these algorithms by SE-MC0, PE-MC0, and MBASIC0.

Algorithm 3.

State-space estimation followed by hierarchical clustering (SE-HC)

Step 1:

for 1 ≤ k ≤ K do

Apply the standard E-M algorithm on data {Y_ikl : 1 ≤ i ≤ I, 1 ≤ l ≤ n_k} to estimate posterior probabilities P(θ_ik = s|Y ).

end for

Step 2:

Let θ̃_i = {P(θ_ik = s|Y)}_1≤_k_≤_K_,1≤_s_≤_S. Cluster vectors θ̃_i into J clusters using hierarchical clustering algorithm with the Euclidean distance. Estimate w_jks as the means within each cluster.

Open in a new tab

Algorithm 4.

State-space estimation followed by mixture model clustering (SE-MC)

Step 1:

for 1 ≤ k ≤ K do

Apply the standard E-M algorithm on data {Y_ikl : 1 ≤ i ≤ I, 1 ≤ l ≤ n_k} to estimate posterior probabilities P(θ_ik = s|Y).

end for

Step 2:

Denote

θ_{i k}^{*} = {arg}_{s} max P (θ_{i k} = s ∣ Y)

for each 1 ≤ k ≤ K, 1 ≤ i ≤ I). Apply Algorithm 2 with

θ_{i k} \leftarrow θ_{i k}^{*}

and f_s = I(y = s) to obtain estimates for w_jks, p_is, ζ, and π_j.

Open in a new tab

Algorithm 5.

Parameter estimation followed by mixture model clustering (PE-MC)

Step 1:

for 1 ≤ k ≤ K do

Apply the standard E-M algorithm on data {Y_ikl : 1 ≤ i ≤ I, 1 ≤ l ≤ n_k} to estimate μ_kls, σ_kls for each experiment.

end for

Step 2:

Apply Algorithm 2 without updating μ_kls, σ_kls in the Maximization step.

Open in a new tab

B.1.3 Results

We utilized several criteria to compare the performance of MBASIC to the benchmark algorithms in Table B.1. To estimate how well the state space was characterized for each cluster, we computed the mean-squared error for W (MSE-W) as $MSE - W = \sqrt{\sum_{j, k, s} {({\hat{w}}_{jks} - w_{jks})}^{2} / (JKS)}$ . We also evaluated how well each method recovered the true state variables θ_ik’s. This was reflected by the state prediction error (SPE) as the mean squared error between the simulated states θ_ik’s and their posterior probabilities: $SPE = \sqrt{\sum_{i, k, s} {[1 {θ_{i k} = s} - P (θ_{i k} = s ∣ Y)]}^{2} / (IKS)}$ . Finally, to compare the estimated clustering with the simulated true clustering, we computed the Adjusted Rand Index (ARI) (Rand, 1971). ARI is a measure for the similarity between two different clusterings of the data. Its value ranges between −1 and 1, with 1 indicating perfect match between the two clusterings.

Figure B.2 — Boxplots for ARI, MSE-W, and SPE across 10 simulated datasets. The number of states is varied at 2, 3, and 4, and the proportion of singletons at 0, 0.1, 0.4. Table B.1 summarizes the methods compared.

Figure B.3 — Boxplots for ARI, MSE-W, and SPE across 10 simulated datasets. The number of states is varied at 2, 3, and 4, and the proportion of singletons at 0, 0.1, 0.4. Table B.1 summarizes the methods compared.

Figure B.4 — Boxplots for ARI, MSE-W, and SPE across 10 simulated datasets. The number of states is varied at 2, 3, and 4, and the proportion of singletons at 0, 0.1, 0.4. Table B.1 summarizes the methods compared.

ARI requires the true clusters denoted by 𝒞_j, 0 ≤ j ≤ J and their estimates denoted by 𝒞̂_j, 0 ≤ j ≤ Ĵ. In our simulations, they were computed as:

C_{0} = {1 \leq i \leq I : b_{i} = 1}; C_{j} = {1 \leq i \leq I : b_{i} = 0, z_{i j} = 1}, j \leq 1,

where 𝒞₀ denoted the set of singleton units. 𝒞̂_j was computed from the posterior distributions as 𝒞̂_j = {1 ≤ i ≤ I : j = argmax_{0 ≤}_j_≤_J P(i ∈ 𝒞_j |Y)}, where P(i ∈ 𝒞₀|Y) = E(b_i|Y), and P(i ∈ 𝒞_j|Y) = E[(1 − b_i)z_ij|Y].

The simulation results under various settings are summarized by the boxplots for each criterion in Figures B.2, B.3, and B.4. Across all different simulation settings, the performance of MBASIC was consistently among the best in all of the ARI, MSE-W, and SPE metrics. This shows that MBASIC could not only recover the clustering structure, but also achieve high accuracy in estimating individual states. SE-HC, SE-MC and SE-MC0 performed the worst in both detecting the clustering structure and estimating the individual states. This suggests that separating state-space inference from joint model fitting can significantly deteriorate model estimation. Different from the SE-* methods, performances of PE-MC and PE-MC0 were much closer to MBASIC. For the negative binomial and binomial distributions (Figures B.3 and B.4), PE-MC achieved similar ARI levels to MBASIC and slightly larger SPE values. These observations show that by jointly estimating the clusters and the states, data under different conditions could borrow information from each other and thus substantially improve the state-space estimation. Overall, these observations are consistent with the results in Wei et al. (2015) and Wei et al. (2012).

The simulation results also highlight the effect of modeling the singleton cluster in various settings. Comparing the performances of MBASIC with MBASIC0 and PE-MC with PE-MC0, we see that modeling the singleton cluster does not have a significant effect when the proportion of singletons is low, i.e., ζ = 0 or 0.1; however, the improvement is highly significant when ζ = 0.4. When ζ = 0.4, including singletons significantly improved the performance with respect to ARI, but did not have an obvious effect on SPE. This has several implications in practice. First, the fact that MBASIC does not under-perform any other methods when ζ = 0 or 0.1 indicates that increasing the model complexity by introducing singletons does not lead to unrobust inference. Because we are always agnostic on the existence of singletons for any real data, keeping them in our model would guard against their adverse influence in inferring the clustering structure. Second, although incorporating the singleton cluster does not improve estimating individual states, some epigenetic studies focus primarily on the association structure between units, as our example in 4.2. For such studies, the gain in estimating the clustering structure by including the singletons is essential. We note that in the comparison of SE-MC0 with SE-MC for the negative binomial and the binomial distributions (Figures B.3 and B.4), modeling the singletons does not necessarily improve estimation for separate model fitting even when the proportion of singletons is high, e.g., ζ = 0.4. This might suggest that the state-space estimation step is introducing additional noise to the clustering step, which in turn makes it less favorable to infer a complicated clustering structure with singletons.

Table B.2. Simulation study 2, Scenario 1, unstructured clusters.

Simulation results for model selection without structural constraints. For each criterion, the mean is computed over 10 simulated data sets, with the standard deviation shown in the parentheses.

Dist.	ζ	J	ARI	MSE-W	SPE
Bin	0.1	20.8 ( 2.098 )	0.94 ( 0.036 )	0.096 ( 0.018 )	0.159 ( 0.014 )
Bin	0.4	20.9 ( 1.101 )	0.914 ( 0.035 )	0.122 ( 0.034 )	0.204 ( 0.012 )
LN	0.1	20.7 ( 0.823 )	0.989 ( 0.005 )	0.044 ( 0.03 )	0.086 ( 0.006 )
LN	0.4	21.3 ( 1.337 )	0.972 ( 0.007 )	0.095 ( 0.027 )	0.107 ( 0.008 )
NB	0.1	21.6 ( 0.843 )	0.947 ( 0.021 )	0.089 ( 0.028 )	0.154 ( 0.007 )
NB	0.4	20.6 ( 2.271 )	0.902 ( 0.026 )	0.112 ( 0.048 )	0.189 ( 0.007 )

Open in a new tab

Table B.3. Simulation study 2, Scenario 2, structured clusters.

Simulation results for model selection with structural constraints. For each criterion, the mean is computed over 10 simulated data sets, with the standard deviation shown in the parentheses.

Dist.	ζ	J₁	J	ARI	MSE-W	SPE
Bin	0.1	10.3 ( 1.16 )	20.7 ( 1.494 )	0.934 ( 0.022 )	0.084 ( 0.035 )	0.162 ( 0.02 )
Bin	0.4	10.3 ( 1.636 )	21 ( 2.625 )	0.897 ( 0.048 )	0.125 ( 0.03 )	0.196 ( 0.031 )
LN	0.1	10.4 ( 0.516 )	20.6 ( 0.516 )	0.984 ( 0.015 )	0.044 ( 0.032 )	0.086 ( 0.006 )
LN	0.4	11.2 ( 1.619 )	22.5 ( 1.509 )	0.968 ( 0.01 )	0.108 ( 0.037 )	0.106 ( 0.006 )
NB	0.1	10.9 ( 1.197 )	21 ( 1.054 )	0.955 ( 0.019 )	0.064 ( 0.035 )	0.155 ( 0.008 )
NB	0.4	11.2 ( 1.814 )	22.2 ( 1.398 )	0.926 ( 0.014 )	0.108 ( 0.031 )	0.184 ( 0.013 )

Open in a new tab

B.2 Simulation Study 2: Model Selection

This second set of simulations aimed to evaluate the use of BIC to select the number of clusters as well as the structural constraints for each cluster. We simulated data sets under two scenarios. For the first scenario, each data set had J = 20 clusters with K = 30 experimental conditions, and none of the clusters had structural constraints. For the second scenario, each data set had J = 20 clusters over K = 30 conditions, but J₁ = 10 of the clusters were structurally constrained as follows:

w_{j, k, s} = w_{j, k + K / 2, s}, \forall j, 1 \leq j \leq J / 2; \forall k, 1 \leq k \leq K / 2.

We refer the two scenarios as the unstructured scenario and the structured scenario, respectively. We considered log-normal distributions (3), negative binomial distributions (4) and binomial distributions (5) for both cases. We also varied the proportion of singleton units ζ at 0.1 and 0.4. The number of states was fixed at S = 2. The remaining parameters were simulated following the same mechanism as in Section B.1.1.

For each simulated data set, we fitted a number of candidate models. For the unstructured scenario, we varied the number of clusters J from 10 to 30. For the structured scenario, we followed the two-phase procedure described in Section 3.4 of our main article. The best model was selected by the minimum BIC value. To assess the performances of these selected models, we computed the ARI, MSE-W and SPE metrics as described in Section B.1³.

The simulation results are summarized in Tables B.2 and B.3. Under each set of parameters, we computed the mean and the standard deviation for each of the criterion as well as the selected value of J and J₁ under 10 simulated data sets. These tables show that the selected values for J and J₁ were very close to the true values. Moreover, MBASIC performed uniformly well with respect to ARI, MSE-W, and SPE under different settings. These results indicate that even if MBASIC may not identify the “true” structure that drives the actual data, the identified structures can still properly represent the state-space associations between units.

B.3 Simulation Studies 3-5: Comparison with iASeq and CorMotif

In this section, we compare MBASIC with two recently proposed models for integrative analysis of specific types of genomic data: CorMotif (Wei et al., 2015) and iASeq (Wei et al., 2012). Both models have the similar state-space clustering structure as MBASIC. The main difference from MBASIC is that they each incorporate more complicated distributional assumptions targeting specific genomic data types. The CorMotif model specifically addresses integrative differential expression analysis with n_k₁ case condition replicates and n_k₀ control condition replicates for each experimental condition k. It inherits the LIMMA (Smyth, 2004) framework for differential analysis of gene-expression data and assumes mixture of Gaussian distributions with S = 2 states: s = 1 for the equally expressed state, and s = 2 for the differentially expressed state. Specifically, the CorMotif model has the following state-space mapping structure:

\frac{n_{k} s_{k}^{2}}{σ_{i k}^{2}} ~ χ_{n_{k}}^{2}, μ_{i k} ∣ σ_{i k}^{2} ~ N (0, u_{k} σ_{i k}^{2}), (Y_{ikl} ∣ θ_{i k} = 1) ~ N (μ_{i k 0}, σ_{i k}^{2}), l = 1, 2, \dots, n_{k 1}, (Y_{ikl} ∣ θ_{i k} = 2) ~ N (μ_{i k 0} + μ_{i k}, σ_{i k}^{2}), l = 1, 2, \dots, n_{k 1}, X_{ikl} ~ N (μ_{i k 0}, σ_{i k}^{2}), l = 1, 2, \dots, n_{k 0},

where X_ikl’s are the observed data from control experiments, and Y_ikl are the observed data from the case experiments. n_k and $s_{k}^{2}$ are hyper parameters specific to each experiment to account for potential heterogeneity among units within the same state, and u_k reflects the strength of differential expression. CorMotif assumes almost the same state-space clustering structure as MBASIC except that it does not include singletons. The iASeq model, targeting at allele-specific binding problems has the following state-space mapping structure:

Y_{ikl} ~ Binom (γ_{ikl}, p_{i k}), p_{i k} ∣ θ_{i k} = 2 ~ Beta (α_{k}, β_{k}), p_{i k} ∣ θ_{i k} = 1 ~ Unif (0, \frac{α_{k}}{α_{k} + β_{k}}), p_{i k} ∣ θ_{i k} = 3 ~ Unif (\frac{α_{k}}{α_{k} + β_{k}}, 1),

where the α_k, β_k are experiment-specific parameters, and γ_ikl is the observed total number of reads between two alleles. The state-space mapping structure for iASeq is almost the same as MBASIC, except that it assumes no singletons, and that one cluster is dedicated to equal binding/occupancy between the alleles (i.e., w₁_k₁ = 1, ∀1 ≤ k ≤ K).

There are two key differences between CorMotif/iASeq and MBASIC. First, both CorMotif and iASeq address the heterogeneity among the units within the same state, and they introduce additional hyper parameters to model the heterogeneous parameters associated with the distribution of individual units. Compared to MBASIC, where we assume the distributions within the same state are homogeneous, such heterogeneous distributional assumptions are much more realistic. Second, CorMotif and iASeq implement two-stage estimation procedures similar to PE-MC0, which separate parameter estimation from state-space clustering. Wei et al. (2015) pointed out that once we have the heterogeneous distributional parameters within each state, joint model fitting for all parameters would require running a Markov Chain Monte-Carlo algorithm rather than the simple E-M algorithm we have developed for MBASIC. Therefore, the computational cost ensued might render its applicability for large real data sets.

In comparison of MBASIC to CorMotif and iASeq, we simulated data according to each of the assumed distributions of CorMotif/iASeq, but fitted MBASIC models using simplified distributions. For data simulated from the iASeq model, we used MBASIC to fit binomial distributions with S = 3 states (5). For data simulated from the CorMotif model, we first generated two versions of t-statistics as follows. For each unit and experiment, denote ${\bar{Y}}_{i k} = \sum_{l = 1}^{n_{k 1}} Y_{ikl} / n_{k 1}, {\bar{X}}_{i k} = \sum_{l = 1}^{n_{k 0}} X_{ikl} / n_{k 0}, {\tilde{s}}_{i k}^{2} = [\sum_{l = 1}^{n_{k 1}} {(Y_{ikl} - {\bar{Y}}_{i k})}^{2} + \sum_{l = 1}^{n_{k 0}} {(X_{ikl} - {\bar{X}}_{i k})}^{2}] / (n_{k 1} + n_{k 0} - 2)$ and v_k = 1/n_k₁ + 1/n_k₀. We computed the naive t-statistic T_ik as:

T_{i k} = \frac{{\bar{Y}}_{i k} - {\bar{X}}_{i k}}{\sqrt{v_{k}} {\tilde{s}}_{k}} .

(47)

We also computed the limma t-statistic T̃_ik by first fitting the data for each condition using LIMMA (Smyth, 2004) to estimate n_k and $s_{k}^{2}$ , then computed:

{\tilde{T}}_{i k} = \frac{\sqrt{n_{k} + n_{k 1} + n_{k 0} - 2} ({\bar{Y}}_{i k} - {\bar{X}}_{i k})}{\sqrt{v_{k} [(n_{k 1} + n_{k 0} - 2) {\tilde{s}}_{k}^{2} + n_{k} s_{k}^{2}]}} .

(48)

For each set of T_ik’s and T̃_ik’s, we fitted the MBASIC model with S = 2 components of scaled-t distributions:

T / μ_{k s} ∣ θ_{iks} = 1 ~ t_{σ_{k s}}, s = 1, 2.

(49)

Here, μ_ks is the scaling parameter, and σ_ks is the degrees of freedom. Because we pooled the replicate level data to generate these t-statistics, the parameters μ and σ no longer depended on l. We refer to the method using T̃_ik as MBASIC-limma, and using T_ik as MBASIC-t. Because there is no closed form maximum likelihood solution for t-distributions, we use the moment method to estimate μ_ks’s and σ_ks’s in the M-step similar to the case of negative binomial distributions.

In Simulation Study 3, we simulated data following the iASeq model. We set α_k = β_k = 2, and simulated state-space variables the same as in Section B.1 with I = 4000. We set J = 10, 20 and ζ = 0, 0.1, 0.4. Simulation Studies 4-5 compare MBASIC with CorMotif. In Simulation Study 4, we simulated data in four settings corresponding to Simulations 1-4 of Wei et al. (2015) respectively. In these settings, we had n_k = 4, u_k = 4, $s_{k}^{2} = 0.02$ . Table B.4 summarizes the settings for the number of clusters, experiment conditions, and units for the state-space variables. We refer our readers to Wei et al. (2015) for more details of the state-space design. We note that Wei et al. (2015) simulations did not include singletons (i.e., ζ = 0) and furthermore, their settings assumed w_jks ∈ {0, 1}. This means that the state-space variables are completely determined by the clustering structure. In Simulation Study 5, we set n_k and $s_{k}^{2}$ the same as in Simulation Study 4, but varied u_k as u_k = 8 for easier distinction between different states. However, we simulated w_jks following S-dimensional Dirichlet distributions as in Simulation Study 1 to introduce noises in generating state-space variables. In addition, we simulated data with smaller number of units (I = 4000), but more clusters (J = 10, 20), and varied the proportion of singletons ζ = 0, 0.1, 0.4. The other details of generating state-space variables were the same as in Section B.1.

Table B.4.

Summary for the designs of the simulation settings in Simulation Study 4, originally designed by Wei et al. (2015).

Simulation Setting	I	J	K
1	10,000	4	4
2	10,000	4	4
3	10,000	5	8
4	10,000	5	20

Open in a new tab

Table B.5. Simulation Studies 3-5.

A summary of the simulation designs, the fitting algorithms compared, and the figure numbers for the results.

Study	J	ζ	True model	Fitting algorithms	Related figures
3	10, 20	0, 0.1, 0.4	iASeq	MBASIC, iASeq	Figure B.5
4	4, 5	0	CorMotif	MBASIC-limma, MBASIC-t, CorMotif	Figures B.6, B.8
5	10, 20	0, 0.1, 0.4	CorMotif	MBASIC-limma, MBASIC-t, CorMotif	Figure B.7

Open in a new tab

Table B.5 further summarizes the components of the Simulation Studies. For each set of parameters, we simulated 10 data sets. We computed ARI, MSE-W, and SPE based on both the model with the number of clusters selected by BIC, and the oracle model where the number of clusters is set to its true value. The comparison between MBASIC and iASeq is shown in Figure B.5. For all the different settings, MBASIC achieved better clustering performance, with higher ARI values. However, iASeq performed better in SPE and MSE-W. When ζ = 0, iASeq performed overall better than MBASIC, with similar ARI values as MBASIC but much lower SPE. However, as ζ increased, iASeq’s ARI value became significantly smaller than MBASIC, while its SPE value became closer to MBASIC’s. In such cases, the benefits of modeling singletons seem to outweigh the loss of using simplified distributional assumptions.

The comparison between MBASIC and CorMotif is summarized in Figures B.6 and B.7. In Simulation Study 4 (Figure B.6), because CorMotif models did not allow singletons, we also excluded the singleton cluster in fitting MBASIC models. MBASIC-limma performed the best except in the first setting, where CorMotif achieved the best SPE. Figure B.8 depicts the average true positive rate in detecting states with θ_ik = 2 among the 1000 top ranking units for each of the four settings. In all but Setting 1, MBASIC-limma performed equally well as CorMotif. We note that Setting 1 has the fewest clusters J = 4 and the fewest experimental conditions K = 4, while the other settings have more complicated state-space structures. Performance of MBASIC-t was the worst in all the four settings. This suggests that neglecting the heterogeneity in these cases can significantly increase estimation error. Although MBASIC model alone does not address the heterogeneity issue, fitting MBASIC models after a data pre-processing step that incorporates the heterogeneity structure, such as computing T̃_ik in MBASIC-limma, can significantly improve model inference. In Simulation Study 5 where we had stronger signals in separating distribution components but noisy state-space clusters, CorMotif resulted in the largest SPE values in all settings (Figure B.7). Although its performance in ARI was comparable with MBASIC-limma when ζ ≤ = 0.1, it deteriorated with increasing proportion of singletons, i.e., ζ = 0.4. Simulation Studies 4 and 5 collectively suggest that MBASIC’s performance is competitive with CorMotif in settings where we have less noise in clustering structure, small numbers of clusters, and some level of singletons despite the fact that the distributional assumptions of MBASIC might be mis-specified. This indicates that for real data sets where we are agnostic about the true data generating structure, MBASIC might be a more general and robust approach.

Figure B.5 — We varied the number of clusters at 10, 20 and the proportion of singletons at 0, 0.1 and 0.4. Results are summarized over 10 simulations under each setting.

Figure B.7 — We varied the number of clusters at 10, 20 and the proportion of singletons at 0, 0.1 and 0.4. Results are summarized over 10 simulations under each setting.

Figure B.8 — The average true positive rate for the 10 simulations among the 1000 highest ranking unit-experiment pairs for each of the four simulation settings in Table B.4. For each simulation, the “true positive” set consists of (i, k)’s with *θ_ik* = 2, and the ranking is based on the posterior probability P (*θ_ik* = 2|Y ).

Figure B.9 — Two settings of the true cluster patterns, represented by the matrix ${(w_{j k 2})}_{1 \leq j \leq 8, 1 \leq k \leq 30}^{T}$ .

B.4 Simulation Study 6: Weak Clusters

Wei et al. (2015) pointed out that state-space clustering without accommodating singletons can lead to merging of small clusters with large clusters with distinct state-space profiles. Such a phenomenon may alter the interpretation of the W matrix, because each column may represent the average state-space pattern of several small clusters that lack the data support. It is therefore important to investigate whether such a phenomenon still exists for MBASIC where we include a singleton cluster.

We conducted six simulations in Simulation Study 6 to investigate this issue. We simulated data according to the log-normal distribution with S = 2 states, J = 8 clusters, and K = 30 conditions. The number of replicates within each condition, as well as the distribution parameters within each state is the same as in Section B.1.1. We set the sizes of the first two clusters as 2000, and varied the size of each of the other six clusters, that is n_small, as 20 or 100. To vary the level of state-space similarity between the two big clusters and the small clusters, we had two settings for the state-space pattern as shown in Figure B.9. For Simulations 1–3, the conditions in which the small clusters have state s = 2 are distinct from the two big clusters, while for Simulations 4–6, the patterns between the small and large clusters are more similar. To control these cluster patterns, we set w_jks ∈{0.1, 0.9}. Finally, we included n_singleton = 0 or 2000 singletons in each simulated data set. The states for the singleton units were generated the same as in Section B.1.1.

In each simulation, we fitted the data using MBASIC and MBASIC0 with BIC to select the number of clusters. MBASIC0 differs from MBASIC only by the exclusion of the singleton feature. Therefore, comparing these two methods allow us to assess how fitting a singleton cluster may affect the small cluster merging problem. Tables B.6 and B.7 compare the confusion matrices between the fitted and the true clusters. We also display the state-space patterns of the fitted models in Figures B.10 and B.11. In Simulations 1 and 4, with n_small = 20 units in each of the small clusters, MBASIC classified the units of small clusters as singletons, while MBASIC0 merged them to form a spurious cluster. The state-space pattern estimated by MBASIC represented the two real big clusters. When the data included singletons, as in Simulations 2 and 5, MBASIC0 formed more spurious clusters, while MBASIC continued to allocate the small clusters as singletons. When we had n_small = 100 units in each of the small clusters, both methods identified these small cluster patterns in Simulation 6 (Figure B.11), but formed spurious clusters in Simulation 3 (Figure B.10). We compare the resulting ARI, MSE-W, and SPE between MBASIC and MBASIC0 in Table B.8. Performances of these two methods are close when we have no singletons, but differentiate otherwise. Based on these simulations, we conclude that fitting a singleton cluster can substantially avoid merging weak clusters. The state-space patterns estimated by the W matrix are more likely to reflect true underlying clusters rather than the average of several small clusters. We acknowledge that how well modeling the singletons can avoid merging weak clusters requires further investigation in more dynamic settings as we vary the similarity among clusters, the difference among the states, as well as other variables that influence cluster structures such as J, K, S. We leave such potential investigations as future research.

Figure B.10 — Estimated cluster patterns by (a, c, e) MBASIC and (b, d, f) MBASIC0. The true clustering pattern is shown in Figure B.9(a).

Figure B.11 — Estimated cluster patterns by (a, c, e) MBASIC and (b, d, f) MBASIC0. The true clustering pattern is shown in Figure B.9(b).

Table B.6. Simulation Study 6, Simulations 1–3.

Confusion matrix between the true clusters and the estimated clusters. The true cluster pattern is shown in Figure B.9(a).

Simulation 1, n_small = 20, n_singleton = 0
	MBASIC				MBASIC0
True	0	1	2	True	1	2	3

1	23	2	1975	1	1998	2	0
2	30	1967	3	2	4	1995	1
3	20	0	0	3	0	1	19
4	20	0	0	4	0	1	19
5	20	0	0	5	0	0	20
6	20	0	0	6	0	1	19
7	20	0	0	7	0	0	20
8	20	0	0	8	0	1	19

Simulation 2, n_small = 20, n_singleton = 2000
	MBASIC					MBASIC0
True	0	1	2	3	True	1	2	3	4	5	6

0	1957	19	17	7	0	67	372	620	59	427	455
1	180	1818	2	0	1	1957	25	0	4	0	14
2	153	3	1844	0	2	8	23	3	1950	0	16
3	20	0	0	0	3	0	0	1	0	0	19
4	20	0	0	0	4	0	0	0	0	0	20
5	20	0	0	0	5	0	0	1	0	0	19
6	20	0	0	0	6	0	0	1	0	0	19
7	20	0	0	0	7	0	0	1	0	0	19
8	20	0	0	0	8	0	0	1	0	0	19

Simulation 3, n_small = 100, n_singleton = 0
	MBASIC							MBASIC0
True	0	1	2	3	4	5	True	1	2	3	4

1	21	1974	0	0	5	0	1	1993	0	7	0
2	12	7	0	1	1980	0	2	7	1	1991	1
3	1	1	9	0	0	89	3	1	0	0	99
4	5	0	89	0	0	6	4	0	2	0	98
5	6	0	2	92	0	0	5	0	99	0	1
6	1	0	0	13	0	86	6	0	6	0	94
7	0	0	96	4	0	0	7	0	100	0	0
8	0	0	0	97	0	3	8	0	99	0	1

Open in a new tab

Table B.7. Simulation Study 6, Simulations 4–6.

Confusion matrix between the true clusters and the estimated clusters. The true cluster pattern is shown in Figure B.9(b).

Simulation 4, n_small = 20, n_singleton = 0

	MBASIC				MBASIC0
True	0	1	2	True	1	2	3

1	5	1995	0	1	1999	1	0
2	1	0	1999	2	0	0	2000
3	1	19	0	3	17	3	0
4	19	1	0	4	0	20	0
5	1	0	19	5	0	5	15
6	20	0	0	6	0	20	0
7	18	1	1	7	0	19	1
8	20	0	0	8	1	19	0

Simulation 5, n_small = 20, n_singleton = 2000

	MBASIC				MBASIC0
True	0	1	2	True	1	2	3	4	5	6

0	1986	7	7	0	16	393	457	561	560	13
1	31	1969	0	1	1990	0	0	5	5	0
2	45	0	1955	2	0	0	0	5	12	1983
3	11	9	0	3	12	0	0	1	7	0
4	20	0	0	4	0	0	0	1	19	0
5	8	0	12	5	0	0	0	0	6	14
6	20	0	0	6	0	0	0	0	20	0
7	20	0	0	7	0	0	0	0	20	0
8	20	0	0	8	0	0	0	1	19	0

Simulation 6, n_small = 100, n_singleton = 0

	MBASIC									MBASIC0
True	0	1	2	3	4	5	6	7	True	1	2	3	4	5	6	7

1	3	1983	14	0	0	0	0	0	1	1976	24	0	0	0	0	0
2	1	0	0	0	0	1999	0	0	2	0	0	0	0	2000	0	0
3	0	8	92	0	0	0	0	0	3	7	93	0	0	0	0	0
4	1	0	0	1	98	0	0	0	4	0	0	99	0	0	1	0
5	1	0	0	0	0	99	0	0	5	0	0	0	0	99	0	1
6	1	0	0	0	0	0	99	0	6	0	0	1	0	0	0	99
7	1	0	1	96	0	1	0	1	7	0	2	1	1	1	95	0
8	0	0	0	0	0	0	0	100	8	0	0	0	100	0	0	0

Open in a new tab

Table B.8. Simulation Study 6.

ARI, MSE-W, and SPE in all simulations.

Simulation	n_small	n_singleton	MBASIC			MBASIC0
Simulation	n_small	n_singleton	ARI	MSE-W	SPE	ARI	MSE-W	SPE
1	20	0	0.967	0.433	0.185	0.991	0.29	0.189
2	20	2000	0.79	0.442	0.192	0.727	0.34	0.193
3	100	0	0.969	0.203	0.187	0.975	0.233	0.193
4	20	0	0.977	0.384	0.169	0.983	0.260	0.167
5	20	2000	0.926	0.384	0.185	0.773	0.333	0.184
6	100	0	0.949	0.087	0.172	0.947	0.087	0.171

Open in a new tab

C Additional Tables and Figures

Figure C.12 — The range of the estimated parameters *μ_kls* and *σ_kls* among the different replicates under the same experimental condition for the transcription factor enrichment network data in Section 4.1.

Figure C.13 — Plots of the transformed ChIP sample read counts against the transformed control sample read counts for all units in the Gm12878 cell for (a) Bcl3 and (b) Bclaf1. Data from unenriched units are expected to locate around the 45 degree dashed line.

Figure C.14 — Plots of the transformed ChIP sample read counts against the transformed control sample read counts for all units in both Gm12878 and K562 cells for (a) Max and (b) Usf1. Data from unenriched units are expected to locate around the 45 degree dashed line.

Figure C.15 — Estimated enrichment probability for each of the 90 clusters identified by MClust.

Figure C.16 — The range of the estimated parameters *μ_kls* among the different replicates under the same experimental condition for the +9.5 composite element data in Section 4.2.

Figure C.17 — The range of the estimated parameters *σ_kls* among the different replicates under the same experimental condition for the +9.5 composite element data in Section 4.2. For a subset of the replicates, the data for an individual state can be under-dispersed, resulting in a negative value for the estimated size parameter in the negative binomial distribution. In that case, we set *σ_kls* = 100. Although our model do not fully capture the under-dispersion patterns, the large variations in *σ_kls* for a subset of the experimental conditions suggest that assuming replicate-specific distributions is quite necessary.

Figure C.18 — Enrichment states provided by the ENCODE peak profiles. Seven empty rows corresponding to the TFs that lack ENCODE peak profiles. Note that only a small percentage of the composite elements which harbor the canonical GATA binding site are identified as enriched for GATA family transcription factors. This suggests that the ENCODE peak profiles can be conservative.

Figure C.19 — (a, b) ChIP sample read counts against control sample read counts for one replicate with K562-Yy1. Enrichment status are annotated by (a) the ENCODE peak profiles and (b) MBASIC prediction.

Table C.9.

Enriched cell type-TF combination for each cluster in the TF enrichment network analysis of Section 4.1 of the main text. TFs with estimated enrichment probability > 95% are listed for each cluster.

Cluster	# of Loci	Common TF	Gm12878 Specific	K562 Specific
1	34	Bcl3, Max, Sp1, Taf1, Zbtb33	Atf3, Ets1, Jund, Nrf1, Pol24h8, Sin3ak20, Tr4	Egr1, Pu1, Rad21, Usf2
2	317	Bcl3, Chd2, Max, Pol24h8, Sp1, Taf1	Atf3, Ets1, Nrf1, Sin3ak20, Usf2	Bclaf1, Egr1, Pu1, Smc3, Tbp, Zbtb33
3	490	Ets1, Pol2, Pol24h8, Sin3ak20, Taf1, Tbp, Usf1	Atf3, Chd2, Nrf1, Usf2	Bcl3, Bclaf1, Max, Sp1
4	555	Bcl3, Bclaf1, Chd2, Pol2, Pol24h8, Sp1, Taf1, Tbp	Atf3, Ets1, Nrf1, Sin3ak20, Six5, Smc3	Pu1
5	428	Bcl3, Chd2, Ets1, Pol24h8, Sp1, Taf1	Atf3, Ctcf, Nrf1, Rad21, Sin3ak20	Bclaf1, Pol2, Smc3, Tbp
6	729	Bcl3, Chd2, Ets1, Pol2, Pol24h8, Sin3ak20, Smc3, Sp1, Taf1	Atf3, Nrf1	Bclaf1, Egr1, Tbp
7	391			Bcl3, Bclaf1, Chd2, Egr1, Ets1, Smc3, Sp1
8	133	Ctcf	Atf3, Chd2, Rad21	Smc3, Srf
9	146	Ctcf		Bclaf1, Pol24h8, Smc3, Tbp
10	469	Pol2, Pol24h8, Taf1	Smc3	Bclaf1, Ets1, Sin3ak20, Tbp
11	440	Gabp, Pol24h8, Taf1	Ets1, Smc3	Pol2
12	184		Bcl3, Chd2, Ets1, Nrf1, Pol24h8, Taf1
13	277	Chd2, Pol2, Pol24h8, Sp1, Taf1, Tbp	Smc3	Cfos
14	156	Chd2, Pol2, Pol24h8, Tbp, Zbtb33	Ets1, Smc3, Taf1
15	412	Pol2, Pol24h8
16	327		Pol24h8
17	213	Usf1	Atf3, Usf2	Max
18	241	Six5	Ets1, Smc3
19	187	Chd2, Sp1		Cfos
20	222			Ets1
21	385			Pol24h8
22	343	Ctcf		Smc3
23	449		Nrf1, Smc3
24	1674

Open in a new tab

Table C.10.

Annotations for +9.5 Element-like loci in 5p1 (2Kb upstream of transcription start site (TSS)), 5p2 (2Kb to 10Kb upstream of TSS) and intronic regions.

Ref ID	Gene	Chr	Strand	Gene Start	Gene End	Region	Distance	Peak Start	Peak End	+9.5 Similarity
NM 001145662	GATA2	chr3	−	128198264	128206764	intron	4601	128202079	128202248	0.964
NM 001145661	GATA2	chr3	−	128198264	128207373	intron	5210	128202079	128202248	0.964
NM 032638	GATA2	chr3	−	128198264	128212030	intron	9867	128202079	128202248	0.964
NM 005225	E2F1	chr20	−	32263292	32274210	5p2	−3886	32278012	32278182	0.774
NM 001166	BIRC2	chr11	+	102217965	102249394	5p2	−5004	102212877	102213046	0.753
NM 203343	EPB41	chr1	+	29213602	29446558	intron	39968	29253487	29253655	0.74
NM 203342	EPB41	chr1	+	29213602	29446558	intron	39968	29253487	29253655	0.74
NM 001166007	EPB41	chr1	+	29213602	29446558	intron	39968	29253487	29253655	0.74
NM 004437	EPB41	chr1	+	29213602	29446558	intron	39968	29253487	29253655	0.74
NM 001166005	EPB41	chr1	+	29213602	29446558	intron	39968	29253487	29253655	0.74
NM 001166006	EPB41	chr1	+	29241087	29391731	intron	12483	29253487	29253655	0.74
NM 173485	TSHZ2	chr20	+	51588876	52103965	intron	203292	51792084	51792253	0.735
NM 007077	AP4S1	chr14	+	31494682	31555007	intron	13234	31507832	31508001	0.733
NM 001128126	AP4S1	chr14	+	31494682	31562634	intron	13234	31507832	31508001	0.733
NM 001430	EPAS1	chr2	+	46524540	46613842	intron	42757	46567214	46567382	0.728
NM 018119	POLR3E	chr16	+	22308740	22345341	intron	833	22309489	22309658	0.719
NM 181442	ADNP	chr20	−	49506882	49547527	intron	27423	49520020	49520189	0.718
NM 015339	ADNP	chr20	−	49506882	49547527	intron	27423	49520020	49520189	0.718
NM 020359	PLSCR2	chr3	−	146151081	146213722	5p1	−921	146214559	146214728	0.718
NM 006257	PRKCQ	chr10	−	6469104	6622238	intron	106706	6515449	6515617	0.713
NM 018309	TBC1D23	chr3	+	99979685	100044078	intron	28727	100008329	100008497	0.711
NM 020382	SETD8	chr12	+	123868703	123893898	intron	4170	123872789	123872958	0.71
NM 015385	SORBS1	chr10	−	97071530	97321171	intron	29191	97291896	97292065	0.709
NM 024991	SORBS1	chr10	−	97071530	97321171	intron	29191	97291896	97292065	0.709
NM 000440	PDE6A	chr5	−	149237519	149324356	intron	4584	149319688	149319857	0.699
NM 012091	ADAT1	chr16	−	75632997	75657154	intron	2033	75655038	75655206	0.693
NM 005033	EXOSC9	chr4	+	122722471	122738175	5p2	−6644	122715743	122715912	0.691
NM 001034194	EXOSC9	chr4	+	122722471	122738175	5p2	−6644	122715743	122715912	0.691
NM 004099	STOM	chr9	−	124101353	124132545	intron	388	124132073	124132243	0.689
NM 198194	STOM	chr9	−	124101356	124132545	intron	388	124132073	124132243	0.689
NM 014395	DAPP1	chr4	+	100737980	100791344	intron	25687	100763583	100763752	0.682
NM 181078	IL21R	chr16	+	27413722	27462115	intron	28911	27442549	27442718	0.681
NM 181079	IL21R	chr16	+	27414422	27462115	intron	28211	27442549	27442718	0.681
NM 021798	IL21R	chr16	+	27438578	27462115	intron	4055	27442549	27442718	0.681
NM 002492	NDUFB5	chr3	+	179322574	179342287	intron	10827	179333318	179333486	0.68
NM 021831	AGBL5	chr2	+	27274490	27293489	intron	11452	27285858	27286027	0.678
NM 020132	AGPAT3	chr21	+	45285115	45407474	5p2	−4281	45280751	45280919	0.677
NM 007356	LAMB4	chr7	−	107663995	107770801	intron	50003	107720714	107720883	0.677
NM 001010985	MYBPHL	chr1	−	109834986	109849663	5p1	−613	109850192	109850361	0.67
NM 006253	PRKAB1	chr12	+	120105760	120119428	5p2	−2184	120103492	120103661	0.668
NM 015226	CLEC16A	chr16	+	11038344	11276044	intron	27935	11066196	11066364	0.667
NM 020448	NIPAL3	chr1	+	24742244	24799472	intron	22892	24765053	24765221	0.663
NM 015560	OPA1	chr3	+	193310932	193415599	intron	67680	193378528	193378697	0.659
NM 130832	OPA1	chr3	+	193310932	193415599	intron	67680	193378528	193378697	0.659
NM 130831	OPA1	chr3	+	193310932	193415599	intron	67680	193378528	193378697	0.659
NM 130834	OPA1	chr3	+	193310932	193415599	intron	67680	193378528	193378697	0.659
NM 130837	OPA1	chr3	+	193310932	193415599	intron	67680	193378528	193378697	0.659
NM 130836	OPA1	chr3	+	193310932	193415599	intron	67680	193378528	193378697	0.659
NM 130835	OPA1	chr3	+	193310932	193415599	intron	67680	193378528	193378697	0.659
NM 130833	OPA1	chr3	+	193310932	193415599	intron	67680	193378528	193378697	0.659
NM 001004342	TRIM67	chr1	+	231298673	231357314	5p2	−2591	231295998	231296167	0.659
NM 020201	NT5M	chr17	+	17206679	17250975	intron	1325	17207920	17208089	0.658
NM 173054	RELN	chr7	−	103112232	103629963	intron	331913	103297966	103298135	0.657
NM 005045	RELN	chr7	−	103112232	103629963	intron	331913	103297966	103298135	0.657
NM 014206	C11orf10	chr11	−	61556602	61560085	5p2	−8290	61568291	61568461	0.657
NR 030342	MIR611	chr11	−	61559967	61560033	5p2	−8342	61568291	61568461	0.657
NM 173685	NSMCE2	chr8	+	126104082	126379367	intron	242235	126346233	126346402	0.655
NM 001127511	APC	chr5	+	112043217	112181935	5p2	−3243	112039890	112040060	0.653
NM 021926	ALX4	chr11	−	44282277	44331716	intron	39937	44291695	44291864	0.652
NM 016213	TRIP4	chr15	+	64680019	64747500	intron	42584	64722519	64722688	0.649
NM 007217	PDCD10	chr3	−	167401696	167452594	intron	10656	167441855	167442023	0.649
NM 145860	PDCD10	chr3	−	167401696	167452630	intron	10692	167441855	167442023	0.649
NM 145859	PDCD10	chr3	−	167401696	167452651	intron	10713	167441855	167442023	0.649
NM 203318	MYO18A	chr17	−	27400527	27507407	intron	17382	27489941	27490111	0.645
NM 078471	MYO18A	chr17	−	27400527	27507407	intron	17382	27489941	27490111	0.645
NM 001626	AKT2	chr19	−	40736224	40791265	intron	12426	40778755	40778924	0.643
NM 004767	GPR37L1	chr1	+	202092028	202098633	5p2	−4769	202087175	202087344	0.642
NM 015531	C2CD3	chr11	−	73745479	73882064	intron	84354	73797626	73797796	0.637
NM 002738	PRKCB	chr16	+	23847299	24231930	intron	166844	24014060	24014228	0.634
NM 212535	PRKCB	chr16	+	23847299	24231930	intron	166844	24014060	24014228	0.634
NM 004571	PKNOX1	chr21	+	44394642	44453688	intron	15907	44410465	44410634	0.632
NR 026749	SKINTL	chr1	−	48567386	48648100	intron	4923	48643093	48643262	0.629
NM 005560	LAMA5	chr20	−	60884122	60942368	intron	9836	60932449	60932617	0.626
NM 001080826	SGK223	chr8	−	8175258	8239257	intron	9672	8229501	8229670	0.622
NM 130465	TSPAN17	chr5	+	176074387	176086058	intron	1462	176075765	176075934	0.621
NM 012171	TSPAN17	chr5	+	176074387	176086058	intron	1462	176075765	176075934	0.621
NM 001006616	TSPAN17	chr5	+	176074387	176086058	intron	1462	176075765	176075934	0.621
NM 013326	C18orf8	chr18	+	21083461	21111742	5p2	−4363	21079015	21079183	0.616
NM 138371	FAM113B	chr12	+	47610051	47630441	5p2	−8921	47601047	47601215	0.616
NM 182498	ZNF428	chr19	−	44111376	44124014	intron	3381	44120549	44120718	0.615
NM 025179	PLXNA2	chr1	−	208195589	208417665	intron	207170	208210411	208210580	0.614
NM 020133	AGPAT4	chr6	−	161551056	161695107	intron	12103	161682920	161683089	0.611
NM 013427	ARHGAP6	chrX	−	11155662	11683821	intron	252809	11430929	11431097	0.609
NM 006125	ARHGAP6	chrX	−	11161516	11683821	intron	252809	11430929	11431097	0.609
NM 001669	ARSD	chrX	−	2822011	2847392	intron	6868	2840441	2840609	0.607
NM 009589	ARSD	chrX	−	2831654	2847392	intron	6868	2840441	2840609	0.607
NM 032359	C3orf26	chr3	+	99536677	99897476	intron	243457	99780050	99780220	0.605
NM 182909	FILIP1L	chr3	−	99551988	99833349	intron	53215	99780050	99780220	0.605
NM 001042459	FILIP1L	chr3	−	99566772	99833349	intron	53215	99780050	99780220	0.605
NM 194298	SLC16A9	chr10	−	61410521	61469649	5p2	−3726	61473291	61473460	0.603
NM 007356	LAMB4	chr7	−	107663995	107770801	intron	39404	107731314	107731482	0.594
NM 203456	PPIE	chr1	+	40204529	40229585	intron	18825	40223270	40223439	0.594
NM 152726	EFHA1	chr13	−	22066839	22178307	intron	81061	22097162	22097331	0.592
NM 001025107	ADAR	chr1	−	154554535	154600437	5p1	−1420	154601773	154601942	0.592
NM 001130966	TBXAS1	chr7	+	139478046	139720123	intron	75241	139553203	139553372	0.591
NM 001166254	TBXAS1	chr7	+	139478046	139720123	intron	75241	139553203	139553372	0.591
NM 030984	TBXAS1	chr7	+	139528951	139720123	intron	24336	139553203	139553372	0.591
NM 001166253	TBXAS1	chr7	+	139528951	139720123	intron	24336	139553203	139553372	0.591
NM 001061	TBXAS1	chr7	+	139528951	139720123	intron	24336	139553203	139553372	0.591
NR 029394	TBXAS1	chr7	+	139528951	139720123	intron	24336	139553203	139553372	0.591
NM 173542	PLBD2	chr12	+	113796370	113827458	intron	20911	113817197	113817366	0.591
NM 001159727	PLBD2	chr12	+	113796370	113827458	intron	20911	113817197	113817366	0.591
NM 138356	SHF	chr15	−	45459413	45493373	intron	31722	45461567	45461736	0.589
NM 021908	ST7	chr7	+	116593380	116863955	intron	92727	116686023	116686192	0.588
NM 018412	ST7	chr7	+	116593380	116870073	intron	92727	116686023	116686192	0.588
NM 017681	NUP62CL	chrX	−	106366657	106449670	intron	53243	106396343	106396512	0.587
NM 020845	PITPNM2	chr12	−	123468026	123594975	intron	73786	123521105	123521274	0.587
NM 001135054	SIGIRR	chr11	−	405715	414999	5p2	−7383	422299	422467	0.586
NM 021805	SIGIRR	chr11	−	405715	417397	5p2	−4985	422299	422467	0.586
NM 001135053	SIGIRR	chr11	−	405715	417397	5p2	−4985	422299	422467	0.586
NM 001012302	ANO9	chr11	−	417929	442011	intron	19629	422299	422467	0.586
NM 001098816	ODZ4	chr11	−	78364328	79151695	intron	773881	78377730	78377899	0.582
NM 178865	SERINC2	chr1	+	31885962	31907524	5p2	−2738	31883140	31883309	0.581
NM 004481	GALNT2	chr1	+	230202955	230417875	5p1	−672	230202200	230202368	0.579
NM 032427	MAML2	chr11	−	95711439	96076344	intron	19672	96056588	96056758	0.577
NM 021961	TEAD1	chr11	+	12695968	12966298	intron	202724	12898608	12898778	0.577
NM 016436	PHF20	chr20	+	34359922	34538288	intron	130764	34490602	34490771	0.573
NM 003128	SPTBN1	chr2	+	54683453	54898582	intron	117920	54801290	54801458	0.573
NM 178313	SPTBN1	chr2	+	54785530	54889444	intron	15843	54801290	54801458	0.573
NM 001037165	FOXK1	chr7	+	4721929	4811074	intron	30769	4752614	4752783	0.573
NM 005802	TOPORS	chr9	−	32540542	32552601	5p1	−1681	32554199	32554367	0.573
NM 182739	NDUFB6	chr9	−	32553522	32573182	intron	18900	32554199	32554367	0.573
NM 002493	NDUFB6	chr9	−	32553522	32573182	intron	18900	32554199	32554367	0.573
NM 004466	GPC5	chr13	+	92050934	93519485	intron	7890	92058740	92058909	0.569
NM 001145169	GPR113	chr2	−	26531040	26541970	intron	2083	26539803	26539972	0.563
NM 153835	GPR113	chr2	−	26531040	26569685	intron	29798	26539803	26539972	0.563
NM 001145168	GPR113	chr2	−	26532812	26541917	intron	2030	26539803	26539972	0.563
NM 000593	TAP1	chr6	−	32812986	32821748	5p2	−3584	32825248	32825417	0.549
NM 002800	PSMB9	chr6	+	32821937	32827626	intron	3395	32825248	32825417	0.549
NM 148954	PSMB9	chr6	+	32821937	32827626	intron	3395	32825248	32825417	0.549
NM 033104	STON2	chr14	−	81736910	81864927	intron	94218	81770625	81770794	0.546
NM 001001894	TTC3	chr21	+	38445570	38575406	intron	12717	38458203	38458372	0.544
NM 003316	TTC3	chr21	+	38455246	38575406	intron	3041	38458203	38458372	0.544
NM 000147	FUCA1	chr1	−	24171573	24194859	5p2	−4297	24199072	24199241	0.544
NM 016063	HDDC2	chr6	−	125596495	125623282	5p1	−844	125624043	125624211	0.544
NM 000404	GLB1	chr3	−	33038099	33138694	intron	89213	33049397	33049566	0.541
NM 001135602	GLB1	chr3	−	33038099	33138694	intron	89213	33049397	33049566	0.541
NM 001079811	GLB1	chr3	−	33038099	33138314	intron	88833	33049397	33049566	0.541
NM 017803	DUS2L	chr16	+	68057203	68113183	intron	17562	68074682	68074850	0.54
NM 001101417	ISPD	chr7	−	16127151	16460947	intron	160970	16299893	16300063	0.532
NM 001101426	ISPD	chr7	−	16127151	16460947	intron	160970	16299893	16300063	0.532
NM 002736	PRKAR2B	chr7	+	106685177	106802255	intron	31362	106716455	106716624	0.532
NR 024448	LOC91316	chr22	−	23980676	24059610	intron	32642	24026884	24027054	0.529
NM 153615	RGL4	chr22	+	24033047	24041358	5p2	−6079	24026884	24027054	0.529
NM 033631	LUZP1	chr1	−	23410515	23495351	intron	53595	23441673	23441841	0.526
NM 001142546	LUZP1	chr1	−	23410515	23495351	intron	53595	23441673	23441841	0.526
NM 001134492	HS2ST1	chr1	+	87380334	87564124	intron	77077	87457327	87457496	0.52
NM 012262	HS2ST1	chr1	+	87380334	87575680	intron	77077	87457327	87457496	0.52
NM 001085481	MAP1LC3B2	chr12	+	116997185	117014425	intron	2387	116999488	116999657	0.499
NM 079834	SCAMP4	chr19	+	1905372	1926011	intron	2224	1907513	1907681	0.487
NM 138422	ADAT3	chr19	+	1905416	1913443	intron	2180	1907513	1907681	0.487
NM 032932	RAB11FIP4	chr17	+	29718641	29865232	intron	33750	29752307	29752476	0.48
NM 033129	SCRT2	chr20	−	642240	656823	5p2	−8983	665723	665891	0.47
NM 012079	DGAT1	chr8	−	145538246	145550567	intron	1560	145548924	145549092	0.443

Open in a new tab

Footnotes

This work was supported by National Institutes of Health Grants (HG006716 and HG007019) to S.K.

The normalized signal for unit i and condition k is:

{\tilde{θ}}_{i k} = \frac{\prod_{l = 1}^{n_{k}} f_{s} (y_{ikl} ∣ {\hat{μ}}_{k l 2}, {\hat{σ}}_{k l 2}, γ_{ikl 1})}{\prod_{l = 1}^{n_{k}} f_{s} (y_{ikl} ∣ {\hat{μ}}_{k l 2}, {\hat{σ}}_{k l 2}) + \prod_{l = 1}^{n_{k}} f_{s} (y_{ikl} ∣ {\hat{μ}}_{k l 1}, {\hat{σ}}_{k l 1}, γ_{ikl 2})}

The pseudo-binary similarity between two units i₁ and i₂ is calculated as $s (i_{1}, i_{2}) = \frac{\sum_{k} P {θ_{i_{1} k} = 1 ∣ Y} P {θ_{i_{2} k} = 1 ∣ Y}}{\sum_{k} P {θ_{i_{1} k} = 1 ∣ Y} + P {θ_{i_{2} k} = 1 ∣ Y} - P {θ_{i_{1} k} = 1 ∣ Y} P {θ_{i_{2} k} = 1 ∣ Y}}$ .

When the actual J and its estimate Ĵ are different, MSE-W is redefined as:

MSE - W = {[\frac{\sum_{1 \leq j \leq J, k, s} {(w_{jks} - {\hat{w}}_{c_{1} (j) k s})}^{2} + \sum_{1 \leq j \leq \hat{J}, k, s} {({\hat{w}}_{jks} - w_{c_{2} (j) k s})}^{2}}{K S (J + \hat{J})}]}^{\frac{1}{2}},

where

c_{1} (j) = {arg}_{j^{'} \leq \hat{J}} min \sum_{k, s} {(w_{jks} - {\hat{w}}_{j k^{'} s})}^{2}, c_{2} (j) = {arg}_{j^{'} \leq J} min \sum_{k, s} {({\hat{w}}_{jks} - w_{j k^{'} s})}^{2} .

References

Anandapadamanaban M, Andresen C, Helander S, Ohyama Y, Siponen MI, Lundstrm P, Kokubo T, Ikura M, Moche M, Sunnerhagen M. High-resolution structure of TBP with TAF1 reveals anchoring patterns in transcriptional regulation - Nat Struct Mol Biol. 2013;20:1008–1014. doi: 10.1038/nsmb.2611. [DOI] [PMC free article] [PubMed] [Google Scholar]
Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biology. 2010;11:R106. doi: 10.1186/gb-2010-11-10-r106. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cheng C, Yan K-K, Hwang W, Qian J, Bhardwaj N, Rozowsky J, Lu ZJ, Niu W, Alves P, Kato M, Snyder M, Gerstein M. Construction and analysis of an integrated regulatory network derived from high-throughput sequencing data. PLoS Computational Biology. 2011:7. doi: 10.1371/journal.pcbi.1002190. [DOI] [PMC free article] [PubMed] [Google Scholar]
Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological) 1977;39(1):1–38. [Google Scholar]
Doré LC, Chlon TM, Brown CD, White KP, Crispino JD. Chromatin occupancy analysis reveals genome-wide GATA factor switching during hematopoiesis. Blood. 2012;119(16):3724–3733. doi: 10.1182/blood-2011-09-380634. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fraley C, Raftery AE. Model-based clustering, discriminant analysis and density estimation. Journal of the American Statistical Association. 2002;97:611–631. [Google Scholar]
Gao X, Johnson KD, Chang YI, Boyer ME, Dewey CN, Zhang J, Bresnick EH. Gata2 cis-element is required for hematopoietic stem cell generation in the mammalian embryo. Journal of Experimental Medicine. 2013;210(13):2833–42. doi: 10.1084/jem.20130733. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gerstein MB, Kundaje A, Hariharan M, Landt SG, Yan KK, Cheng C, Mu XJ, Khurana E, Rozowsky J, Alexander R, Min R, Alves P, Abyzov A, Addleman N, Bhardwaj N, Boyle AP, Cayting P, Charos A, Chen DZ, Cheng Y, Clarke D, Eastman C, Euskirchen G, Frietze S, Fu Y, Gertz J, Grubert F, Harmanci A, Jain P, Kasowski M, Lacroute P, Leng J, Lian J, Monahan H, O/’Geen H, Ouyang Z, Partridge EC, Patacsil D, Pauli F, Raha D, Ramirez L, Reddy TE, Reed B, Shi M, Slifer T, Wang J, Wu L, Yang X, Yip KY, Zilberman-Schapira G, Batzoglou S, Sidow A, Farnham PJ, Myers RM, Weissman SM, Snyder M. Architecture of the human regulatory network derived from ENCODE data. Nature. 2012;489:91–100. doi: 10.1038/nature11245. [DOI] [PMC free article] [PubMed] [Google Scholar]
Holley DW, Groh BS, Wozniak G, Donohoe DR, Sun W, Godfrey V, Bultman SJ. The BRG1 Chromatin Remodeler Regulates Widespread Changes in Gene Expression and Cell Proliferation During B Cell Activation. Journal of Cellular Physiology. 2014;229(1):44–52. doi: 10.1002/jcp.24414. [DOI] [PubMed] [Google Scholar]
Hsu AP, Johnson KD, Falcone EL, Sanalkumar R, Sanchez L, Hickstein DD, Cuellar-Rodriguez J, Lemieux JE, Zerbe CS, Bresnick EH, Holland SM. GATA2 haploinsufficiency caused by mutations in a conserved intronic element leads to MonoMAC syndrome. Blood. 2013;121(19):3830–3837. doi: 10.1182/blood-2012-08-452763. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hu G, Schones DE, Cui K, Ybarra R, Northrup D, Tang Q, Gattinoni L, Restifo NP, Huang S, Zhao K. Regulation of nucleosome landscape and transcription factor targeting at tissue-specific enhancers by BRG1. Genome Research. 2011;21(10):1650–1658. doi: 10.1101/gr.121145.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ji H, Li X, Wang Q, Ning Y. Differential principle component analysis of ChIP-seq. PNAS. 2013;110:6789–6794. doi: 10.1073/pnas.1204398110. [DOI] [PMC free article] [PubMed] [Google Scholar]
Johnson KD, Hsu A, RMJ, Boyer ME, Keleş S, Zhang J, Lee Y, Holland SM, Bresnick EH. Cis-element mutation in a GATA-2-dependent immunodeficiency syndrome governs hematopoiesis and vascular integrity. Journal of Clinical Investigation. 2012;10(122):36923704. doi: 10.1172/JCI61623. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kim SI, Bresnick EH, Bultman SJ. BRG1 directly regulates nucleosome structure and chromatin looping of the a globin locus to activate transcription. Nucleic Acids Research. 2009a;37(18):6019–6027. doi: 10.1093/nar/gkp677. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kim SI, Bultman SJ, Kiefer CM, Dean A, Bresnick EH. BRG1 requirement for long-range interaction of a locus control region with a downstream promoter. Proceedings of the National Academy of Sciences. 2009b;106(7):2259–2264. doi: 10.1073/pnas.0806420106. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kuan PF, Chung D, Pan G, Thomson JA, Stewart R, Keleş S. A statistical framework for the analysis of ChIP-seq data. Journal of the American Statistical Association. 2011;106(495):891–903. doi: 10.1198/jasa.2011.ap09706. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kunarso G, Chia N, Jeyakani J, Hwang C, Lu X, Chan Y, Ng H, Bourque G. Transposable elements have rewired the core regulatory network of human embryonic stem cells. Nature Genetics. 2010;42:631–634. doi: 10.1038/ng.600. [DOI] [PubMed] [Google Scholar]
Lee S, Huang J, Hu J. Sparse logistic principal components analysis for binary data. The Annals of Applied Statistics. 2010;4:1579–1601. doi: 10.1214/10-AOAS327SUPP. [DOI] [PMC free article] [PubMed] [Google Scholar]
Liang K, Keleş S. Detecting differential binding of transcription factors with ChIP-seq. Bioinformatics. 2012;28:121–122. doi: 10.1093/bioinformatics/btr605. [DOI] [PMC free article] [PubMed] [Google Scholar]
Linneman AK, O’Geen H, Keleş S, Farnham PJ, Bresnick EH. Genetic framework for GATA factor function in vascular biology. Proceedings of the National Academy of Sciences. 2011;108(33):13641–13646. doi: 10.1073/pnas.1108440108. [DOI] [PMC free article] [PubMed] [Google Scholar]
Neph1 S, Stergachis AB, Reynolds A, Sandstrom R, Borenstein E, Stamatoyannopoulos JA. Circuitry and dynamics of human transcription factor regulatory networks. Cell. 2012;150:12741286. doi: 10.1016/j.cell.2012.04.040. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rand W. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association. 1971;66(336):846–850. [Google Scholar]
Roy S, Wapinski I, Pfiffner J, French C, Socha A, Konieczka J, Habib N, Kellis M, Thompson D, Regev A. Arboretum: reconstruction and analysis of the evolutionary history of condition-specific transcriptional modules. Genome Research. 2013;23:1039–1050. doi: 10.1101/gr.146233.112. [DOI] [PMC free article] [PubMed] [Google Scholar]
Schmidt D, Wilson M, Ballester B, Schwalie P, Brown G, Marshall A, Kutter C, Watt S, Martinez-Jimenez C, Mackay S, Talianidis I, Flicek P, Odom D. Five-vertebrate ChIP-seq reveals the evolutionary dynamics of transcription factor binding. Science. 2010;328:1036–1040. doi: 10.1126/science.1186176. [DOI] [PMC free article] [PubMed] [Google Scholar]
Smyth GK. Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology. 2004:3. doi: 10.2202/1544-6115.1027. [DOI] [PubMed] [Google Scholar]
Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences of the United States of America. 2005;102(43):15545–15550. doi: 10.1073/pnas.0506580102. [DOI] [PMC free article] [PubMed] [Google Scholar]
The ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489:57–74. doi: 10.1038/nature11247. [DOI] [PMC free article] [PubMed] [Google Scholar]
Waltman P, Kacmarczyk T, Bate A, Kearns D, Reiss D, Eichenberger P, Bonneau R. Multi-species integrative biclustering. Genome Biology. 2010;11:R96. doi: 10.1186/gb-2010-11-9-r96. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang J, Zhuang J, Iyer S, Lin X, Whitfield TW, Greven MC, Pierce BG, Dong X, Kundaje A, Cheng Y, Rando OJ, Birney E, Myers RM, Noble WS, Snyder M, Weng Z. Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors. Genome Research. 2012;22:1798–1812. doi: 10.1101/gr.139105.112. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wei Y, Li X, fei Wang Q, Ji H. iaseq: integrative analysis of allele-specificity of protein-dna interactions in multiple chip-seq datasets. BMC Genomics. 2012:13. doi: 10.1186/1471-2164-13-681. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wei Y, Tenzen T, Ji H. Joint analysis of differential gene expression in multiple studies using correlation motifs. Biostatistics. 2015;16:31–46. doi: 10.1093/biostatistics/kxu038. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zeng X, Sanalkumar R, Bresnick EH, Li H, Chang Q, Keleş S. jMOSAiCS: joint analysis of multiple ChIP-seq datasets. Genome Biology. 2013:14. doi: 10.1186/gb-2013-14-4-r38. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zuo C, Keleş S. A statistical framework for power calculations in ChIP-seq experiments. Bioinformatics. 2013 doi: 10.1093/bioinformatics/btt200. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] Anandapadamanaban M, Andresen C, Helander S, Ohyama Y, Siponen MI, Lundstrm P, Kokubo T, Ikura M, Moche M, Sunnerhagen M. High-resolution structure of TBP with TAF1 reveals anchoring patterns in transcriptional regulation - Nat Struct Mol Biol. 2013;20:1008–1014. doi: 10.1038/nsmb.2611. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biology. 2010;11:R106. doi: 10.1186/gb-2010-11-10-r106. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] Cheng C, Yan K-K, Hwang W, Qian J, Bhardwaj N, Rozowsky J, Lu ZJ, Niu W, Alves P, Kato M, Snyder M, Gerstein M. Construction and analysis of an integrated regulatory network derived from high-throughput sequencing data. PLoS Computational Biology. 2011:7. doi: 10.1371/journal.pcbi.1002190. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological) 1977;39(1):1–38. [Google Scholar]

[R5] Doré LC, Chlon TM, Brown CD, White KP, Crispino JD. Chromatin occupancy analysis reveals genome-wide GATA factor switching during hematopoiesis. Blood. 2012;119(16):3724–3733. doi: 10.1182/blood-2011-09-380634. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] Fraley C, Raftery AE. Model-based clustering, discriminant analysis and density estimation. Journal of the American Statistical Association. 2002;97:611–631. [Google Scholar]

[R7] Gao X, Johnson KD, Chang YI, Boyer ME, Dewey CN, Zhang J, Bresnick EH. Gata2 cis-element is required for hematopoietic stem cell generation in the mammalian embryo. Journal of Experimental Medicine. 2013;210(13):2833–42. doi: 10.1084/jem.20130733. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] Gerstein MB, Kundaje A, Hariharan M, Landt SG, Yan KK, Cheng C, Mu XJ, Khurana E, Rozowsky J, Alexander R, Min R, Alves P, Abyzov A, Addleman N, Bhardwaj N, Boyle AP, Cayting P, Charos A, Chen DZ, Cheng Y, Clarke D, Eastman C, Euskirchen G, Frietze S, Fu Y, Gertz J, Grubert F, Harmanci A, Jain P, Kasowski M, Lacroute P, Leng J, Lian J, Monahan H, O/’Geen H, Ouyang Z, Partridge EC, Patacsil D, Pauli F, Raha D, Ramirez L, Reddy TE, Reed B, Shi M, Slifer T, Wang J, Wu L, Yang X, Yip KY, Zilberman-Schapira G, Batzoglou S, Sidow A, Farnham PJ, Myers RM, Weissman SM, Snyder M. Architecture of the human regulatory network derived from ENCODE data. Nature. 2012;489:91–100. doi: 10.1038/nature11245. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Holley DW, Groh BS, Wozniak G, Donohoe DR, Sun W, Godfrey V, Bultman SJ. The BRG1 Chromatin Remodeler Regulates Widespread Changes in Gene Expression and Cell Proliferation During B Cell Activation. Journal of Cellular Physiology. 2014;229(1):44–52. doi: 10.1002/jcp.24414. [DOI] [PubMed] [Google Scholar]

[R10] Hsu AP, Johnson KD, Falcone EL, Sanalkumar R, Sanchez L, Hickstein DD, Cuellar-Rodriguez J, Lemieux JE, Zerbe CS, Bresnick EH, Holland SM. GATA2 haploinsufficiency caused by mutations in a conserved intronic element leads to MonoMAC syndrome. Blood. 2013;121(19):3830–3837. doi: 10.1182/blood-2012-08-452763. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] Hu G, Schones DE, Cui K, Ybarra R, Northrup D, Tang Q, Gattinoni L, Restifo NP, Huang S, Zhao K. Regulation of nucleosome landscape and transcription factor targeting at tissue-specific enhancers by BRG1. Genome Research. 2011;21(10):1650–1658. doi: 10.1101/gr.121145.111. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] Ji H, Li X, Wang Q, Ning Y. Differential principle component analysis of ChIP-seq. PNAS. 2013;110:6789–6794. doi: 10.1073/pnas.1204398110. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Johnson KD, Hsu A, RMJ, Boyer ME, Keleş S, Zhang J, Lee Y, Holland SM, Bresnick EH. Cis-element mutation in a GATA-2-dependent immunodeficiency syndrome governs hematopoiesis and vascular integrity. Journal of Clinical Investigation. 2012;10(122):36923704. doi: 10.1172/JCI61623. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] Kim SI, Bresnick EH, Bultman SJ. BRG1 directly regulates nucleosome structure and chromatin looping of the a globin locus to activate transcription. Nucleic Acids Research. 2009a;37(18):6019–6027. doi: 10.1093/nar/gkp677. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] Kim SI, Bultman SJ, Kiefer CM, Dean A, Bresnick EH. BRG1 requirement for long-range interaction of a locus control region with a downstream promoter. Proceedings of the National Academy of Sciences. 2009b;106(7):2259–2264. doi: 10.1073/pnas.0806420106. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] Kuan PF, Chung D, Pan G, Thomson JA, Stewart R, Keleş S. A statistical framework for the analysis of ChIP-seq data. Journal of the American Statistical Association. 2011;106(495):891–903. doi: 10.1198/jasa.2011.ap09706. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] Kunarso G, Chia N, Jeyakani J, Hwang C, Lu X, Chan Y, Ng H, Bourque G. Transposable elements have rewired the core regulatory network of human embryonic stem cells. Nature Genetics. 2010;42:631–634. doi: 10.1038/ng.600. [DOI] [PubMed] [Google Scholar]

[R18] Lee S, Huang J, Hu J. Sparse logistic principal components analysis for binary data. The Annals of Applied Statistics. 2010;4:1579–1601. doi: 10.1214/10-AOAS327SUPP. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Liang K, Keleş S. Detecting differential binding of transcription factors with ChIP-seq. Bioinformatics. 2012;28:121–122. doi: 10.1093/bioinformatics/btr605. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Linneman AK, O’Geen H, Keleş S, Farnham PJ, Bresnick EH. Genetic framework for GATA factor function in vascular biology. Proceedings of the National Academy of Sciences. 2011;108(33):13641–13646. doi: 10.1073/pnas.1108440108. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] Neph1 S, Stergachis AB, Reynolds A, Sandstrom R, Borenstein E, Stamatoyannopoulos JA. Circuitry and dynamics of human transcription factor regulatory networks. Cell. 2012;150:12741286. doi: 10.1016/j.cell.2012.04.040. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] Rand W. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association. 1971;66(336):846–850. [Google Scholar]

[R23] Roy S, Wapinski I, Pfiffner J, French C, Socha A, Konieczka J, Habib N, Kellis M, Thompson D, Regev A. Arboretum: reconstruction and analysis of the evolutionary history of condition-specific transcriptional modules. Genome Research. 2013;23:1039–1050. doi: 10.1101/gr.146233.112. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Schmidt D, Wilson M, Ballester B, Schwalie P, Brown G, Marshall A, Kutter C, Watt S, Martinez-Jimenez C, Mackay S, Talianidis I, Flicek P, Odom D. Five-vertebrate ChIP-seq reveals the evolutionary dynamics of transcription factor binding. Science. 2010;328:1036–1040. doi: 10.1126/science.1186176. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] Smyth GK. Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology. 2004:3. doi: 10.2202/1544-6115.1027. [DOI] [PubMed] [Google Scholar]

[R26] Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences of the United States of America. 2005;102(43):15545–15550. doi: 10.1073/pnas.0506580102. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] The ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489:57–74. doi: 10.1038/nature11247. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] Waltman P, Kacmarczyk T, Bate A, Kearns D, Reiss D, Eichenberger P, Bonneau R. Multi-species integrative biclustering. Genome Biology. 2010;11:R96. doi: 10.1186/gb-2010-11-9-r96. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] Wang J, Zhuang J, Iyer S, Lin X, Whitfield TW, Greven MC, Pierce BG, Dong X, Kundaje A, Cheng Y, Rando OJ, Birney E, Myers RM, Noble WS, Snyder M, Weng Z. Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors. Genome Research. 2012;22:1798–1812. doi: 10.1101/gr.139105.112. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] Wei Y, Li X, fei Wang Q, Ji H. iaseq: integrative analysis of allele-specificity of protein-dna interactions in multiple chip-seq datasets. BMC Genomics. 2012:13. doi: 10.1186/1471-2164-13-681. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] Wei Y, Tenzen T, Ji H. Joint analysis of differential gene expression in multiple studies using correlation motifs. Biostatistics. 2015;16:31–46. doi: 10.1093/biostatistics/kxu038. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] Zeng X, Sanalkumar R, Bresnick EH, Li H, Chang Q, Keleş S. jMOSAiCS: joint analysis of multiple ChIP-seq datasets. Genome Biology. 2013:14. doi: 10.1186/gb-2013-14-4-r38. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] Zuo C, Keleş S. A statistical framework for power calculations in ChIP-seq experiments. Bioinformatics. 2013 doi: 10.1093/bioinformatics/btt200. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

A Hierarchical Framework for State-Space Matrix Inference and Clustering*

Chandler Zuo

Kailei Chen

Kyle J Hewitt

Emery H Bresnick

Sündüz Keleş

Abstract

1 Introduction

2 The Hierarchical Mixture Model Framework

2.1 State-space Mapping

2.2 State-space Clustering

3 Model Estimation and Selection

3.1 Likelihood Functions

3.2 An Expectation and Maximization (E-M) Algorithm

Algorithm 1.

3.3 Estimating Structured Clusters

Figure 1.

3.4 Model Selection

3.5 Simulation Studies

Table 1.

4 Applications of MBASIC to Genome Research Problems

4.1 Transcription Factor Enrichment Network

Figure 2.

Figure 3.

Figure 4.

Table 2.

Figure 5.

4.2 Genome-wide Identification of +9.5-like Composite Elements

Figure 6.

Figure 7.

Figure 8.

5 Conclusions and Discussion

A Details of the Expectation-Maximization (EM) Algorithms

A.1 Derivation for the E-step

A.2 EM Algorithm with Mixture Data Distributions

B Simulation Studies

B.1 Simulation Study 1

B.1.1 Parameter Settings

B.1.2 Alternative Approaches for Benchmarking MBASIC

Figure B.1.

Algorithm 2.

Table B.1. Simulation Study 1.

Algorithm 3.

Algorithm 4.

Algorithm 5.

B.1.3 Results

Figure B.2. Simulation Study 1, log-normal distribution.

Figure B.3. Simulation Study 1, negative binomial distribution.

Figure B.4. Simulation study 1, binomial distribution.

Table B.2. Simulation study 2, Scenario 1, unstructured clusters.

Table B.3. Simulation study 2, Scenario 2, structured clusters.

B.2 Simulation Study 2: Model Selection

B.3 Simulation Studies 3-5: Comparison with iASeq and CorMotif

Table B.4.

Table B.5. Simulation Studies 3-5.

Figure B.5. Simulation Study 3, comparison between MBASIC and iASeq.

Figure B.6. Simulation Study 4, comparison between MBASIC and CorMotif.

Figure B.7. Simulation Study 5, comparison between MBASIC and CorMotif.

Figure B.8. Simulation Study 4, comparison between MBASIC and CorMotif.

Figure B.9. Simulation Study 6.

B.4 Simulation Study 6: Weak Clusters

Figure B.10. Simulation Study 6, Simulations 1–3.

Figure B.11. Simulation Study 6, Simulations 4–6.

Table B.6. Simulation Study 6, Simulations 1–3.

Table B.7. Simulation Study 6, Simulations 4–6.

Table B.8. Simulation Study 6.

C Additional Tables and Figures

Figure C.12.

Figure C.13.

Figure C.14.

Figure C.15.

Figure C.16.

Figure C.17.

Figure C.18.

Figure C.19.

Table C.9.

Table C.10.

Footnotes

References

A Hierarchical Framework for State-Space Matrix Inference and Clustering^*