Abstract
Histone modification is a vital epigenetic mechanism for transcriptional control in eukaryotes. High-throughput techniques have enabled whole-genome analysis of histone modifications in recent years. However, most studies assume one combination of histone modification invariantly translates to one transcriptional output regardless of local chromatin environment. In this study we hypothesize that, the genome is organized into local domains that manifest similar enrichment pattern of histone modification, which leads to orchestrated regulation of expression of genes with relevant biological functions. We propose a multivariate Bayesian Change Point (BCP) model to segment the Drosophila melanogaster genome into consecutive blocks on the basis of combinatorial patterns of histone marks. By modeling the sparse distribution of histone marks with a zero-inflated Gaussian mixture, our partitions capture local BLOCKs that manifest relatively homogeneous enrichment pattern of histone marks. We further characterized BLOCKs by their transcription levels, distribution of genes, degree of co-regulation and GO enrichment. Our results demonstrate that these BLOCKs, although inferred merely from histone modifications, reveal strong relevance with physical domains, which suggests their important roles in chromatin organization and coordinated gene regulation.
Keywords: Bayesian change point model, Histone modification, chromosomal domain
1. Introduction
Epigenetics refers to the study of heritable changes affecting gene expression and other phenotypes that occur without a change in DNA sequence. Epigenetic mechanisms, including chromatin remodeling, histone modification, DNA methylation and binding of non-histone proteins, provide a fundamental level of transcriptional control. Extensive studies on histone modifications have led to the “histone code” hypothesis that histone modifications do not occur in isolation but rather in a combinatorial manner to provide “ON” or “OFF” signature for transcriptional events (Allis, 2007).
Genome-wide studies using high-throughput technologies such as chromatin immunoprecipitation (ChIP) followed by microarray analysis (ChIP on chip) or deep sequencing (ChIP-seq) have begun to decipher the “histone code” at the genome-wide scale. Currently, a common approach to assess chromatin states using these data is a multivariate Hidden Markov Model (HMM) introduced by Ernst and Kellis (2010), which has been used in several modENCODE and ENCODE project publications (modENCODE Consortium 2010, Kharchenko et al. 2011, Riddle et al. 2011, Eaton et al. 2011). This model associates each 200bp genomic window with a particular state, generating a chromatin-centric annotation. However, a pre-defined number of states needs to be specified in HMMs and it is difficult to justify and interpret a particular choice. Different studies trying to balance resolution and interpretability based on different criteria often led to different numbers of states, both between different organisms (Ernst and Kellis 2010, modENCODE Consortium 2010) and within the same organism (Filion et al. 2010, modENCODE Consortium 2010). Moreover, HMM summarizes chromatin information by a vector of “emission” probabilities associated with each chromatin state and a vector of “transition” probabilities with which different chromatin states occur in spatial relationship of each other (Ernst and Kellis 2010). These settings assume the homogeneity of hidden states and their transitions across the genome. However, since histone modifications are outcomes of interplay with local environment, the assumption of spatial homogeneity may not hold at the genome level.
To address the limitations in the HMM-based approaches, we propose an alternative approach to examining combinatorial histone marks at coarse scales. We hypothesize that the genome is organized into local blocks that display regionalized histone signatures. Those blocks may have important roles in orchestrated regulation of expression of genes with relevant biological functions. We note that our approach does not require a pre-defined number of possible states and it identifies local patterns without the assumption on spatial homogeneity.
To computationally infer these blocks, we propose a multivariate Bayesian Change Point (BCP) model which is capable of incorporating both local and global information. The BCP model was first proposed by Barry and Hartigan (1992, 1993) to describe a process where the observations can be considered to arise from a series of contiguous blocks, with distributional parameters different across blocks. One of the inferential goals is to identify the change points separating contiguous blocks. By “assuming probability of any partition is proportional to a product of prior cohesions, one for each block in the partition, and that given the blocks the parameters in different blocks have independent prior distributions” (Barry and Hartigan 1992, 1993), a fully Bayesian approach can be adopted to detect change points from a sequence of observations. Barry and Hartigan (1992) considered in detail the case where the observations X1, …, Xn are independent and normally distributed given the sequence of parameters μl with Xi ∼ N(μl, σ2) where the observations from the same block l have the same μl. This method has been used by Erdman and Emerson (2008) to segment microarray data. However, this model cannot be directly applied to infer histone modification blocks because observed modification data do not follow normal distributions. This is due to the fact that histone modifications are usually observed at a small proportion of the genome locations with signal at the rest of the genome being (or near) zero (Figure S1). To accommodate these unique features, here we present a multivariate BCP model through the introduction of a zero-inflated Gaussian mixture distribution, to partition the genome into blocks where each block is relatively homogeneous with respect to histone marks.
1.1. Outline of the Paper
We organized the paper as following. In Section 2, we present the methodological details of the BCP model with a mixture prior and an MCMC algorithm to infer the posterior probability. Section 3 presents results from simulation studies. In Section 4, we describe a change point analysis of the D. melanogaster genome with multiple histone marks using S2 cell data from the modENCODE project. The identified chromosomal blocks are called as BLOCKs in the rest of this article. Then we present two sets of exploratory analysis, Section 4.2 on BLOCKs’ relationship with physical domains and Section 4.3 on the functional relevance of BLOCKs. In Section 4.4, we compare our results with HMM. We conclude the paper with a summary and discussion in Section 5.
1.2. Notations
We denote the density function of N(μ, σ2) by ϕ(|μ, σ), and denote the density function of Beta (a, b) by ψ(·|a, b). The Dirac function δ indicates the point mass at 0. For a set S, #S is the cardinality of S. For a random variable X, {X = 1} is the indicator function taking value 1 if X = 1 and taking value 0 if X ≠ 1. The indicator function {X = 0} is defined in the same way. The set {i + 1, i + 2, …, j} with integers i < j is denoted by (i : j). The function f(·|·) is a generic notation for conditional density when the distribution is clear in the context.
2. Method
2.1. A BCP model for block identification
The observation we have is an M × n data matrix , where each Xm for m = 1, …, M is a modification mark with length n. We first describe the likelihood of each Xm and then combine them together. For notational simplicity, we suppress the subscript and write X instead of Xm.
Let X = (X1, …, Xn) be a vector with length n. Create another vector Z = (Z1, …, Zn) to indicate whether each Xh is zero or not. That is, Zh = 0 if Xh = 0, and Zh = 1 if Xh ≠ 0. Note Z is fully determined by X.
For the index set {1, …, n}, let ρ be a partition of this set. That is ρ = {S1, …, SN}, with and for all l1 ≠ l2. The number N represents the number of blocks of {1, …, n}. For the change-point problem, each Sl is a contiguous subset of {1, …, n}. That is, Sl = (i : j) = {i + 1, …, j} for some i < j.
2.1.1. Likelihood
Given the partition ρ = {S1, …, SN}, Xk follows a mixture distribution Xk ∼ (1 − λl)N(μl, σ2) + λlδ, for k ∈ Sl and each l = 1, …, N. The parameter μl is block-specific, while σ is shared among different blocks. The parameter λl describes how likely Xk is zero, which varies across different blocks. Thus, given (ρ, μ1, …, μN, λ1, …, λN, σ), the likelihood of (X, Z) can be fully specified. That is,
(2.1) |
where for each l with Sl = {i + 1, …, j},
(2.2) |
(2.3) |
where and .
2.1.2. Prior
We proceed to specify the prior distribution on the parameters (ρ, μ1, …, μN, λ1, …, λN, sigma).
(2.4) |
(2.5) |
(2.6) |
The prior (2.4) on the partition ρ is called product partition model, which was originally described in Barry and Hartigan (1993). The quantity c(Sl) is called cohesion. In this paper, c(Sl) is defined to be c(i:j) = (1 − pj−i−1 p when j < n and c(i:j) = (1 − pj−i−1 p when j = n, where 0 ≤ p ≤ 1 and Sl {i + 1, …, j} as mentioned before. This specification implies that the sequence of change points forms a discrete renewal process with inter-arrival times identically geometrically distributed. The geometric distribution has memoryless property. For histone mark data, it means we assume the possibility of a genomic position (bin) as a boundary for BLOCK is roughly constant. To note, the cohesion prior is a true nonparametric prior on all possible 2n−1 partitions for n data points, thus the number of blocks does not need to be specified and can be inferred from the data. The priors (2.5) and (2.6) are conjugate priors with respect to the likelihood. The prior on the variance σ2 will be jointly specified with the hyper-parameters.
To pursue a fully Bayesian approach, we put priors on the hyper-parameters (p, μ0, σ0) in (2.4) and (2.5). Define . We jointly specify the priors on the hyper-parameters together with the prior on σ2.
(2.7) |
(2.8) |
(2.9) |
(2.10) |
The priors (2.7), (2.9) and (2.10) are uniform priors. They reflect our ignorance of knowledge. The prior (2.8) can be viewed as a uniform distribution on the logarithmic scale. Notice (2.7) and (2.8) are improper priors. This will not cause problem in view of our sampling procedure described later.
2.1.3. Posterior
Our goal here is to find the posterior distribution of the partition, which is . According to Bayes formula,
(2.11) |
Since the denominator of (2.11) is complicated, we need to use MCMC to sample from the posterior by
(2.12) |
The conditional density f(X, Z|ρ) is by integrating out the likelihood function (2.1) using the prior of (μ1, …, μN, λ1, …, λN, σ) specified in (2.5), (2.6), (2.7), (2.8) and (2.9). The prior f(ρ) is by integrating out f(ρ|p) specified in (2.4) with respect to (2.10). We first find f(ρ).
(2.13) |
Then, we continue to find f(X, Z|ρ). We first integrate out (μ1, …, μN, λ1, …, λN) in (2.1) using (2.5) and (2.6). Remember ψ(λl, b) is the density of Beta(a, b).Using (2.3) as the representation of (2.1), we have
(2.14) |
where
(2.15) |
Next, we integrate out (μ0, w, σ) in (2.14) using priors (2.7), (2.8) and (2.9).
(2.16) |
(2.17) |
(2.18) |
To model multiple histone marks, X1, …, XM are independent vectors given the same block structure ρ. As has been calculated in (2.18), for each m,
(2.19) |
where am, bm, Wm, Bm Tm and Am are values for the m-th sequence as a, b, W, B, T and A defined above. Zm are indicators determined by Xm and Zk,m is the k-th element in Zm. Combining (2.13) and (2.19), we have
(2.20) |
Although an exact implementation of this model is tractable, the calculations are O(n3). It is prohibitive to evaluate the posterior probability when n is large. We have implemented an MCMC approximation that greatly facilitates the estimation.
2.2. MCMC algorithm for BCP model inference
Following Barry and Hartigan (1993), for a partition ρ induced by U = (U1, …, Un), where Ui = 1 indicates a change point at position i + 1, the odds ratio for the conditional probability of a change point at the position i + 1 is:
where , , and are the within and between block sums of squares obtained for the m-th sequence when Ui = 0 and Ui = 1 respectively, and is the values of (2.15) obtained for the m-th sequence when Ui = 0 and Ui = 1 respectively. The result is a direct consequence of (2.20).
We then approximate these integrals by incomplete beta function as:
We initialize Ui to 0 for all i < n, with Un = 1. Then we update Ui by passes through data. 500 passes were used in block identification.
3. Simulation studies
First we used simulated data to study the performance of the proposed method. The simulation assumed that there were 10 blocks and six histone modification marks were observed at each one of the 2000 locations in the genome. The lengths of the 10 blocks were ranging from 10 to 1500 (In simulation 1 shown in Figure 1, the lengths are 152, 10, 102, 416, 27, 799, 217, 22, 206 and 49). We use X(i:j),m to denote the observed signal within a block from (i + 1)-th to j-th location for the m-th mark. We assumed that each component of the X(i:j),m followed a mixture distribution of 0.2 ∗ N(μ(i:j),m, 1) + 0.8 ∗ δ where μ(i:j),m was a random draw from U(−2, 2). These settings are based on the empirical observation that for a specific histone mark, on average ∼20% of the genome display binding peaks with the intensities ranging from −2 to 2 for the normalized data. To apply our method, we need to specify the values of the hyper-parameters p, w, am and bm. In the simulation, we investigated the sensitivity of the results to the specifications of these parameter values by considering a range of values, with p = (0.1, 0.2, 0.3, 0.4), w = (0.1, 0.2, 0.3, 0.4), and (am, bm) = {(1, 1), (2, 2), (0.5, 0.5)}. As a result, we considered a total of 48 specifications for (p0, w0, am, bm). We simulated 20 data sets. For each simulated data set, we ran 48 MCMC chains with each chain using one of the 48 different hyperparameters described above. Change points were inferred to be those locations in the genome that had a posterior probability larger than 0.8 (The results were similar under different cutoff values).
We then checked the precision and recall rates based on the true and inferred change points from the simulated data. The precision rate is defined as TP/(TP +FP), and the recall rate is TP/(TP +FN), where TP is the number of true positives (predicted block boundaries that are true), FP is the number of false positives (predicted boundaries that are not true), and FN is the number of false negatives (undiscovered true block boundaries). In our assessment, if the inferred change point was 3 units or less from one of the true change points, this inference was considered a true positive. As shown in Figure 1B, the overall posteriors are insensitive to the specified values of the hyperparameters p0, am, bm. Simulation studies also show that the proposed method is capable of identifying large blocks expanded over 1000 positions as well as small blocks of size around 10 (Figure 1). Moreover, the ability of identifying zero-inflated blocks is significantly boosted by the introduction of the mixture prior (Figure 1).
4. Application to modENCODE epigenome Data
All data used in this analysis were generated by the modENCODE project (Table 1). Specifically, we used pre-processed enrichment score of 18 histone marks in S2 cells from study “Genomic Distributions of Histone Modifications”; the S2 cell transcriptome data came from study “Paired End RNA-Seq of Drosophila Cell Lines”; the transcriptome data for 9 different developmental stages were drawn from study “Developmental Stage Timecourse Transcriptional Profiling with RNA-Seq”. To identify and characterize blocks from histone marks, we divided the Drosophila melanogaster genome into 1000-bp bins and calculated the enrichment level for each bin by averaging the log2 intensity values of each mark. The transcription level (in S2 cell and different development stages) was calculated by averaging read counts from replicates.
Table 1.
modENCODE Experiment | Method | Cell Line or Tissue Type | Sample |
---|---|---|---|
Genomic Distributions of Histone Modifications | ChIP-chip/ChIP-seq | S2-DRSC, ML-DmBG3-c2 | H3K18ac, H3K23ac, H3K27Ac, H3K27Me3, H3K36me1, H3K36me3, H3K4Me3, H3K4me1, H3K4me2, H3K79Me2, H3K79Me1, H3K9ac, H3K9me2, H3K9me3, H4AcTetra, H4K16ac, H4K5ac, H4K8ac |
Transcriptional profiling of Drosophila cell lines | RNA-seq | S2-DRSC | |
Developmental Stage Timecourse Transcriptional Profiling | RNA-seq | Embryo 10–12h, White Pre-pupae 24h, Larvae L1, Adult Female Eclosure 1d |
4.1. Identification of chromatin blocks based on histone modifications
We applied the proposed method to 18 histone methylation and acetylation marks in S2 cells. Change points with posterior probability greater than 0.75 were defined as block boundaries. Because chromosome X is distinguished by the high level enrichment of H4K16ac and H3K36me3 from other chromosomes (Kharchenko et al. 2011), we applied our model to autosomes only.
A total of 994 blocks were inferred from chromosomes 2L, 2R, 3L and 3R, with 90% of the blocks ranging in size from 21kb to 247kb, with a median of 70kb (called as BLOCKs, Table S1). We observed that BLOCKs captured the combinatorial pattern of histone modifications and reflected local transcriptional activities. We use chr2L:4142–5520kb as an example to illustrate this (Figure 2). For simplicity, we only show the enrichment levels of several chromatin signatures including transcription activation marks H3K4m3 and H3K9ac, and transcription repression marks H3K9me3 and H3K27me3 (see Figure 4 for an example of all marks). PolII enrichment and RNA-seq counts at log10 scale are shown as a reference of transcriptional activity. Compared with “chromatin states” annotation for non-overlapping 200bp windows in the genome (Kharchenko et al. 2011) (Figure 2C), BLOCKs depict the genome as local domains at a larger scale. We divided BLOCKs into five quantiles based on their sizes: ≤ 5%, 6% ∼ 35%, 36% ∼ 65%, 66% ∼ 95%, ≥ 96% and looked into the transcription activity distributions for each group (Figure 3E). Transcription activities do not show a systematic bias as a function of block size.
4.2. BLOCK boundaries are potentially physical domain boundaries
A recent published high-resolution chromosomal contact map on Drosophila embryonic nuclei (Sexton et al. 2012) shows that the entire genome is linearly partitioned into well-demarcated physical domains. We therefore studied the link between physical domains and BLOCKs inferred from histone marks. A total of 994 physical domains were identified from Drosophila embryonic nuclei (Sexton et al. 2012) chromosome 2L, 2R, 3L and 3R with the sizes ranged from 10kb to 823kb and a median of 60kb. We observed strong association between physical domains and BLOCK boundaries. For example, 36% of BLOCK boundaries are within 10kb of physical domain boundaries whereas this proportion never exceeds 26% in 1000 randomized block partitions and 56% of BLOCK boundaries are within 20kb of physical domain boundaries whereas this proportion never exceeds 42% in 1000 randomized block partitions.
In Sexton et al. (2012), the authors characterized physical domains into four epigenetic classes based on the enrichment of epigenetic marks. Out of the four classes, transcriptional “Active” domains are associated with H3K4me3, H3K36me3, and hyperacetylation, “PcG” domains are associated with the mark H3K27me3, “HP1/Centromere” class is associated with HP1 and “Null” domains are not enriched for any available marks. We explored whether BLOCKs can be aligned to the classification in Sexton et al. (2012). We assigned the four classes to BLOCKs based on enrichment of H3K4me3, H3K27me3 and HP1a. Specifically, BLOCKs with average intensities of HP1a greater than 1 and coverage greater than 10% are classified as “HP1/Centromere” domains, BLOCKs with average intensities of H3K27me3 greater than 0.5 and coverage greater than 25% are classified as “PcG” domains, BLOCKs with average intensities of H3K4me3 greater than 1 and coverage greater than 25% are classified as “Active” domains and all the remaining ones are characterized as “Null” domains. Figure 4 shows the alignment between BLOCKs and physical domains with epigenetic classes. In 93835 genomic bins annotated by both BLOCKs and chromHMM, 62987 have the same assignment, leading to a jaccard index of 0.5. The high concordance between BLOCKs and physical domains suggests that BLOCKs bridge the link between epigenetic domains with topological domains. The difference may be introduced by techniques, data quality and cell types used in these two studies.
Another indirect evidence for BLOCKs as physical domains is the consistency with replication timing. Replication timing refers to the order in which segments of DNA along the length of a chromosome are duplicated. Since the packaging of DNA with proteins into chromatin takes place immediately after the DNA is duplicated, replication timing reflects the order of assembly of chromatin. Recent studies suggest that late-replicating regions generically define not only a repressed but also a physically segregated nuclear compartment. Thus replication timing is a manifestation of spatial organization of the chromosome. To investigate the association of BLOCKs with replication timing, we compared BLOCKs with the meta peaks of replication origins (10kb to 285kb) from cell lines BG3, Kc and S2 analyzed by mod-ENCODE project. We observed that 69% of meta peaks are within 20kb of BLOCK boundaries. This statistic agrees with physical domains well since we observed that 60% of meta peaks within 20kb of physical boundaries characterized in Sexton et al. (2012).
4.3. Functional relevance of BLOCKS
To investigate whether BLOCKs represent domains of functional importance, we performed three different analyses. First, we checked whether genes within each BLOCK tended to be co-regulated using transcriptome in L1 larvae and 10–12h embryo measured by RNA-seq. A total of 11376 FlyBase genes were used in our analysis. When a gene had multiple isoforms, the longest one was used. We defined the following rules to describe the status change of each gene between L1 larvae stage (and 10–12h embryo) and in S2 cell: genes whose expression increased by more than 2 fold but were not below 10 were categorized as “up-regulated”; those with fold change less than 0.5 but the expression levels were not below 10 as “down-regulated”; and others as “no-change”. To examine whether each BLOCK is enriched for genes with specific status, we used the proportion of blocks that the dominant status accounted for at least 50% of the genes within a block as a test statistic. We observed the percentage of BLOCKs where the dominant status accounted for more than 50% of the genes was 71.8% and 67.6% for L1 larvae and 10–12h embryo, respectively, with 55.4% of the BLOCKs overlapped between the two comparisons. These observed statistics reach statistical significance when testing against randomly permutated blocks. For physical domains in Sexton et al. (2012), we observed 68% and 65.8% with dominant co-regulation for L1 larvae and 10–12h embryo, respectively.
Second, we asked whether genes within each BLOCK tended to have similar biological functions. We tested for the enrichment of Gene Ontology (GO) categories within each BLOCK using hypergeometric test with Bonferroni correction. 51.2% (412 out of 805 BLOCKs with more than 2 genes) were enriched for at least one GO category using a 0.05 cutoff and 1172 GO categories in total are enriched (Table S2). The observed numbers of GO enriched BLOCKs and enriched GO categories were both significantly higher than those from permutated blocks. We further asked which biological processes or functions involve genes that are significantly linearly juxtaposed. We found 86.4% (108/125) of chromatin assembly or disassembly genes (GO:0006333) for Drosophila were juxtaposed within a BLOCK located on chr2L: 21344–21579kb, with a striking p-value of 3.3 × 10−235. Genes in chitin-based cuticle development (GO:0040003), body morphogenesis(GO:0010171), proteinaceous extracellular matrix (GO:0005578) were found significantly clustered with over 70% genes in one BLOCK share the same function.
Third, we reasoned if BLOCKs reflected coordinated regulation of genes with relevant biological functions, we would expect that BLOCKs enriched in developmentally specific GO categories would have large variation across different developmental stages, while BLOCKs enriched in “house-keeping” GO categories would display limited fluctuations. We ranked the BLOCKs based on their standard deviation of transcription level across 9 different developmental stages (Table S3 and S4). BLOCKs with the top 20% largest deviations and 20% smallest deviations were checked for their GO enrichment respectively, and then were listed in Tables S2 and S3 by their order of statistical significance. Notably, in BLOCKs displaying most striking changes across different developmental stages, we found GO categories associated with developmental-specific biological processes or functions, such as heart development, structural constituent of chitin-based cuticle, positive regulation of muscle organ development, and midgut development, among others. Moreover, metabolism-related functions, such as serine-type endopeptidase activity, peptidyl-dipeptidase activity etc, display turnover across developmental transcriptomes and are among the top of our list. GO categories associated with “house-keeping” functions, like transferase activity, aminoacylase activity, chromatin assembly, insulin receptor binding showed limited fluctuations through development. This result provides further evidence on the role of BLOCKs in coordinated regulation.
4.4. Comparison with ChromHMM
In this subsection, we compare the results from our method with those from a popular HMM based method, ChromHMM. We applied ChromHMM to the same dataset (18 histone modification, 1kb bins, S2 cell). The data were binarized to fit ChromHMM’s requirement of input. More specifically, all intervals with intensities greater than 0 are set to 1 and remaining are set to 0. To obtain blocks at coarse levels, we explored ChromHMM models by varying the pre-specified number of hidden states (from 3 to 18). We observed that a smaller number of hidden states tended to produce blocks with larger sizes. Here we report ChromHMM models with the number of hidden states from 3 to 5. The ChromHMM model with 3 hidden states generates 12517 segments, the model with 4 hidden states generates 9157 segments, and the model with 5 hidden states generates 12444 segments. For each ChromHMM model, the sizes of segments range from 2kb (5% quantile) to 26kb (95% quantile) and a median of 5kb. The distributions of sizes of segments from ChromHMM models and BLOCKs are visualized in Figure 5. Therefore, our model has advantages over the HMM models in characterization of histone modification patterns at coarse levels.
4.5. How robust is the result?
The BCP model used in this paper assumes that different histone marks are independent. However, some histone marks, such as H3K4me3 and H3K4me2, are highly correlated with each other. Moreover, it is known that there exists redundancy and exclusivity between the active and repressive marks. To further explore how the input histone marks will affect the result, we performed the change point analysis with the input of 4 marks, 7 marks and 10 marks, respectively. The marks for each model were selected based on their correlation across the entire genome. As shown in Figure 6A, there are mainly 7 groups of marks based on their correlation patterns: the first group consists of H3K9me2 and H3K9me3; the second group is featured by H3K36me3 and H3K79me1; the third group consists of H4K5ac, H3K18ac, H4K8ac, H3K27ac, H4Ac, H3K36me1 and H3K4me1; the fourth group is featured by H3K79me2, H3K9ac, H3K4me3 and H3K4me2; where as three separate groups are formed by H4K16ac, H3K23ac and H3K27me3, respectively. For the 7 marks model, we selected one mark from each of the 7 groups with the input marks as H3K18ac, H3K23ac, H3K27Me3, H3K36me3, H3K4Me3, H3K9me2, and H4K16ac. For the 10 marks model, we further introduced H4, H3K79Me2, and H3K9ac into the 7 marks model. For the 4 marks model, we excluded H3K18ac, H3K36me3, and H4K16ac from the 7 marks model. The 4 marks, 7 marks and 10 marks models identified 698, 868 and 927 blocks, respectively. We observed high consistency between these results and reported BLOCKs obtained with 18 marks, for example, 84% of boundaries from the 10 marks model are within 20kb of BLOCK boundaries and 84% of boundaries from the 7 marks model are within 20kb of BLOCK boundaries (see Figure 6B for other comparisons).
To investigate how the posterior probability cutoff would affect the characterization of BLOCKs, we varied the threshold and checked the distribution of the sizes. The results were rather stable under different cut-off values. When the cut-off value was set as low as 0.25, only 2 new boundaries were added, leading to a total of 996 blocks.
5. Discussion
5.1. Methodological comparisons
Our BCP model was developed with a different purpose compared to existing methods for analyzing combinatorial pattern of histone marks. For example, ChromaSig (Hon, Ren and Wang 2008) was designed to uncover potential regulatory elements through searching for genome-wide frequently occurring chromatin signatures. Spatial clustering (Jaschek and Tanay 2009) identified novel patterns of local co-occurrence among histone modifications by imposing a spatial K-clustering solution on HMM. Segway (Hoffman et al. 2012) based on Dynamic Bayesian Networks, achieved a breakthrough in precision and resolution in finding known elements and handling of missing data compared to HMM-based approaches. The most recent method of this kind, ChAT (Wang, Lunyak and Jordan 2012), extends the capabilities of chromatin signatures characterization through an inherent statistical criterion for classification. All these methods tried to detect chromatin signatures associated with a variety of small functional elements. To the best of our knowledge, our model is the first effort to examine histone marks at coarse scales although no explicit constraint has been put on block size. By separately modeling zero and nonzero signals, our model is able to capture the local enrichment patterns of different sizes implicitly, superior than the existing ad hoc merging strategy (Wang, Lunyak and Jordan 2012).
BCP differs substantially from several previously described studies to subdivide the genome at “domain-level”. de Wit et al. (2008) reported a study to identify nested chromatin domain structure through a statistical test of each chromatin component. Their chromatin domains are specific for each component or factor whereas our approach captures domain with combinatorial pattern of multiple factors. Thurman et al. (2007) used a simple two-state HMM to segment the ENCODE regions into active and repressed domains based on multiple tracks of functional genomic data, including activating and repressive histone modifications, RNA output, and DNA replication timing. By using wavelet smoothing, their method focuses on a single scale at a time (Lian et al. 2008). In contrast, our analysis focuses on histone modifications only and simultaneously captures enrichment patterns over different scales. BCP is most similar to a four-state CPM model proposed to characterize chromatin accessibility based on tiled microarray DNaseI sensitivity data only (Lian et al. 2008). Both methods formulate the segmentation of genome into a change point detection problem. However, these two methods differ in several respects. First, CPM is still a hidden-state model with transition probabilities imposed on segments other than equal-sized bins in HMM, whereas BCP is hidden-state free with emphasis on local patterns. Second, four-state CPM model was developed to interpret a single track DNaseI array data while our method was an examination based on multivariate histone modification data. Third, CPM models the DNaseI signal as a continuous mixture of Gaussian at each state, whereas we models histone binding signal with a zero-inflated Gaussian mixture due to spatial sparsity of binding events.
5.2. Summary and future directions
In this paper, we have developed a novel multivariate BCP model to partition genome into contiguous blocks based on histone modifications. It could be extended to analyze chip-sequencing data or applied to other studies with partitioning zero-inflated multiple observation tracks as a task. Our model presents a new approach to examining combinatorial histone marks. Not only histone marks are signatures for functional elements (Kharchenko et al. 2011, Ernst and Kellis 2010), our results from the D. melanogaster S2 cell genome suggest that they are also roadmaps for chromatin organization at coarse scales.
It is worthwhile to further investigate whether BLOCKs and topological domains are substantively different, or if BLOCKs merely re-describes topological domains based on histone marks. Besides the difference introduced by techniques, data quality and cell types, we believe other two possible reasons for imperfect alignment between BLOCKs and physical domains are: 1) the partition is not saturated based on the current profile of histone modifications; 2) the equal weight assigned to different histone modifications in the partition limit the identification of finer domains (a drawback of all current approaches).
It has become increasingly clear that functionally related genes are often located next to one another in the linear genome (Sproul, Gilbert and Bickmore 2005), resembling DNA operon in bacteria (Chen et al. 2012, Keene 2007). This proximity is essential for coordinated gene regulation. Genome-wide expression analysis have identified many clusters of co-expressed genes during Drosophila development (Lee and Sonnhammer 2003, Yi, Sze and Thon 2007), such as the hox gene clusters (Duboule 2007). One mechanism for this coordinated regulation is that these genes are organized into a chromatin domain that acts as a regulatory unit by the epigenetic mechanism (Kosak and Groudine 2004, Sproul, Gilbert and Bickmore 2005). Several such chromatin domains have already been characterized (Kosak and Groudine 2004, Tolhuis et al. 2006, Pickersgill et al. 2006, Orlando and Paro 1993). In this study, we illustrated the widespread existence of these chromatin domains as BLOCKs that were identified by histone marks.
Last but not least, although we have shown that a substantial portion of BLOCKs can potentially act as regulatory units, this is likely still an underestimate. Firstly, our BLOCKs were identified based on combinatorial patterns of 18 histone marks from the S2 epigenome. We do not know in totality how many histone marks are sufficient to saturate the segmentation. It is likely that more markers, including potentially undiscovered ones will be needed to get a complete view of epigenetic landscape. Over 100 histone marks have been discovered yet with a lot of exclusivity and correlation. Future studies addressing relationships among histone marks will give us more insight into this open question. It is also important to develop block identification methods that can accommodate the dependency structure among marks. Secondly, when evaluating expression of genes within an individual BLOCK, we used developmental transcriptome from Drosophila tissues other than S2 cells, which only presented a weighted average of varying BLOCKs across different cell types within each developmental stage. In reality, each type of cells is likely to have its distinct pattern of BLOCKs. Thirdly, plasticity in chromosomal modifications has been shown in several reports (Riddle et al. 2011, Eaton et al. 2011, modENCODE Consortium 2010). Thus we would expect BLOCKs are dynamic structures and the percentage of BLOCKs with tendency of co-regulation might be even higher if taking into account this plasticity. This conjecture could be tested when more histone marks data across development stages are available. Fourthly, with incomplete and inaccurate knowledge on gene functions in GO database (as well as others) (Khatri, Sirota and Butte 2012), likely many BLOCKs with functional relevance may not stand out just because supporting information does not exist yet. Finally, coordinated regulation is a complex process accomplished by miRNA, transcript factors and other regulatory elements with feedback effect on chromatin organization. Further analysis on binding sites of regulatory elements and their interplay with genes within BLOCKs will shed more lights on understanding the underlying mechanism.
Supplementary Material
Acknowledgments
We thank the reviewers for their constructive comments and Chao Gao for discussion.
Footnotes
Contributor Information
Mengjie Chen, Email: mengjie@email.unc.edu, Department of Biostatistics and Genetics, University of North Carolina, Chapel Hill, NC 27599.
Haifan Lin, Email: haifan.lin@yale.edu, Yale Stem Cell Center, Yale School of Medicine, New Haven, CT 06520.
Hongyu Zhao, Email: hongyu.zhao@yale.edu, Department of Biostatistics, Yale School of Public Health, New Haven, CT 06520.
References
- Allis D. Epigenetics. CSHL Press; 2007. [Google Scholar]
- Barry D, Hartigan JA. Product Partition Models for Change Point Problems. The Annals of Statistics. 1992;20:260–279. [Google Scholar]
- Barry D, Hartigan JA. A Bayesian Analysis for Change Point Problems. Journal of the American Statistical Association. 1993;88:309–319. [Google Scholar]
- Chen D, Zheng W, Lin A, Uyhazi K, Zhao H, Lin H. Pumilio 1 Suppresses Multiple Activators of p53 to Safeguard Spermatogenesis. Current Biology. 2012;22:420–425. doi: 10.1016/j.cub.2012.01.039. [DOI] [PMC free article] [PubMed] [Google Scholar]
- de Wit E, Braunschweig U, Greil F, Bussemaker HJ, van Steensel B. Global Chromatin Domain Organization of the Drosophila Genome. PLoS Genetics. 2008;4:e1000045. doi: 10.1371/journal.pgen.1000045. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Duboule D. The rise and fall of Hox gene clusters. Development. 2007;134:2549–2560. doi: 10.1242/dev.001065. [DOI] [PubMed] [Google Scholar]
- Eaton ML, Prinz JA, MacAlpine HK, Tretyakov G, Kharchenko PV, et al. Chromatin signatures of the Drosophila replication program. Genome Res. 2011;21:164–174. doi: 10.1101/gr.116038.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Erdman C, Emerson JW. A fast Bayesian change point analysis for the segmentation of microarray data. Bioinformatics. 2008;24:2143–2148. doi: 10.1093/bioinformatics/btn404. [DOI] [PubMed] [Google Scholar]
- Ernst J, Kellis M. Discovery and characterization of chromatin states for systematic annotation of the human genome. Nature Biotechnology. 2010;28:817–826. doi: 10.1038/nbt.1662. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Filion GJ, van Bemmel JG, Braunschweig U, Talhout W, Kind J, et al. Systematic protein location mapping reveals five principal chromatin types in Drosophila cells. Cell. 2010;143:212–224. doi: 10.1016/j.cell.2010.09.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hoffman MM, Buske OJ, Wang J, Weng Z, Bilmes JA, et al. Unsupervised pattern discovery in human chromatin structure through genomic segmentation. Nature Methods. 2012;9:473–476. doi: 10.1038/nmeth.1937. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hon G, Ren B, Wang W. ChromaSig: A Probabilistic Approach to Finding Common Chromatin Signatures in the Human Genome. PLoS Computational Biology. 2008;4(10):e1000201. doi: 10.1371/journal.pcbi.1000201. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jaschek R, Tanay A. Spatial Clustering of Multivariate Genomic and Epigenomic Information. Research in Computational Molecular Biology. 2009;5541:170–183. [Google Scholar]
- Keene JD. RNA regulons: coordination of post-transcriptional events. Nature Reviews Genetics. 2007;8:533–543. doi: 10.1038/nrg2111. [DOI] [PubMed] [Google Scholar]
- Kharchenko PV, Alekseyenko AA, Schwartz YB, Minoda A, Riddle NC, et al. Comprehensive analysis of the chromatin landscape in Drosophila melanogaster. Nature. 2011;471:480–485. doi: 10.1038/nature09725. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Khatri P, Sirota M, Butte AJ. Ten years of pathway analysis: current approaches and outstanding challenges. PLoS computational biology. 2012;8:e1002375. doi: 10.1371/journal.pcbi.1002375. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kosak ST, Groudine M. Gene order and dynamic domains. Science. 2004;306:644–647. doi: 10.1126/science.1103864. [DOI] [PubMed] [Google Scholar]
- Lee JM, Sonnhammer ELL. Genomic gene clustering analysis of pathways in eukaryotes. Genome Res. 2003;13:875–882. doi: 10.1101/gr.737703. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lian H, Thompson WA, Thurman R, Stamatoyannopoulos JA, Noble WS, et al. Automated mapping of large-scale chromatin structure in ENCODE. Bioinformatics. 2008;24(17):1911–1916. doi: 10.1093/bioinformatics/btn335. [DOI] [PMC free article] [PubMed] [Google Scholar]
- modENCODE Consortium, T. Identification of functional elements and regulatory circuits by Drosophila modENCODE. Science. 2010;330:1787. doi: 10.1126/science.1198374. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Orlando V, Paro R. Mapping Polycomb-repressed domains in the bithorax complex using in vivo formaldehyde cross-linked chromatin. Cell. 1993;75:1187–1198. doi: 10.1016/0092-8674(93)90328-n. [DOI] [PubMed] [Google Scholar]
- Pickersgill H, Kalverda B, de Wit E, Talhout W, Fornerod M, van Steensel B. Characterization of the Drosophila melanogaster genome at the nuclear lamina. Nature genetics. 2006;38:1005–1014. doi: 10.1038/ng1852. [DOI] [PubMed] [Google Scholar]
- Riddle NC, Minoda A, Kharchenko PV, Alekseyenko AA, Schwartz YB, et al. Plasticity in patterns of histone modifications and chromosomal proteins in Drosophila heterochromatin. Genome Res. 2011;21:147–163. doi: 10.1101/gr.110098.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sexton T, Yaffe E, Kenigsberg E, Bantignies F, Leblanc B, Hoichman M, et al. Three-Dimensional Folding and Functional Organization Principles of the Drosophila Genome. Cell. 2012;148:1–15. doi: 10.1016/j.cell.2012.01.010. [DOI] [PubMed] [Google Scholar]
- Sproul D, Gilbert N, Bickmore WA. The role of chromatin structure in regulating the expression of clustered genes. Nat Rev Genet. 2005;6:775–781. doi: 10.1038/nrg1688. [DOI] [PubMed] [Google Scholar]
- Thurman RE, Day N, Noble WS, Stamatoyannopoulos JA. Identification of higher-order functional domains in the human ENCODE regions. Genome Research. 2007;17:917–927. doi: 10.1101/gr.6081407. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tolhuis B, Muijrers I, de Wit E, Teunissen H, Talhout W, van Steensel B, van Lohuizen M. Genome-wide profiling of PRC1 and PRC2 Polycomb chromatin binding in Drosophila melanogaster. Nat Genet. 2006;38:694–699. doi: 10.1038/ng1792. [DOI] [PubMed] [Google Scholar]
- Wang J, Lunyak VV, Jordan IK. Chromatin signature discovery via histone modification profile alignments. Nucleic Acids Research. 2012 doi: 10.1093/nar/gks848. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yi G, Sze SH, Thon MR. Identifying clusters of functionally related genes in genomes. Bioinformatics. 2007;23:1053–1060. doi: 10.1093/bioinformatics/btl673. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.