CHANGE POINT ANALYSIS OF HISTONE MODIFICATIONS REVEALS EPIGENETIC BLOCKS LINKING TO PHYSICAL DOMAINS

Mengjie Chen; Haifan Lin; Hongyu Zhao

doi:10.1214/16-AOAS905

. Author manuscript; available in PMC: 2017 Mar 1.

Published in final edited form as: Ann Appl Stat. 2016 Mar 25;10(1):506–526. doi: 10.1214/16-AOAS905

CHANGE POINT ANALYSIS OF HISTONE MODIFICATIONS REVEALS EPIGENETIC BLOCKS LINKING TO PHYSICAL DOMAINS

Mengjie Chen ^1,^†, Haifan Lin ^2,^‡, Hongyu Zhao ^3,^§,^*

PMCID: PMC4876974 NIHMSID: NIHMS755035 PMID: 27231496

Abstract

Histone modification is a vital epigenetic mechanism for transcriptional control in eukaryotes. High-throughput techniques have enabled whole-genome analysis of histone modifications in recent years. However, most studies assume one combination of histone modification invariantly translates to one transcriptional output regardless of local chromatin environment. In this study we hypothesize that, the genome is organized into local domains that manifest similar enrichment pattern of histone modification, which leads to orchestrated regulation of expression of genes with relevant biological functions. We propose a multivariate Bayesian Change Point (BCP) model to segment the Drosophila melanogaster genome into consecutive blocks on the basis of combinatorial patterns of histone marks. By modeling the sparse distribution of histone marks with a zero-inflated Gaussian mixture, our partitions capture local BLOCKs that manifest relatively homogeneous enrichment pattern of histone marks. We further characterized BLOCKs by their transcription levels, distribution of genes, degree of co-regulation and GO enrichment. Our results demonstrate that these BLOCKs, although inferred merely from histone modifications, reveal strong relevance with physical domains, which suggests their important roles in chromatin organization and coordinated gene regulation.

Keywords: Bayesian change point model, Histone modification, chromosomal domain

1. Introduction

Epigenetics refers to the study of heritable changes affecting gene expression and other phenotypes that occur without a change in DNA sequence. Epigenetic mechanisms, including chromatin remodeling, histone modification, DNA methylation and binding of non-histone proteins, provide a fundamental level of transcriptional control. Extensive studies on histone modifications have led to the “histone code” hypothesis that histone modifications do not occur in isolation but rather in a combinatorial manner to provide “ON” or “OFF” signature for transcriptional events (Allis, 2007).

Genome-wide studies using high-throughput technologies such as chromatin immunoprecipitation (ChIP) followed by microarray analysis (ChIP on chip) or deep sequencing (ChIP-seq) have begun to decipher the “histone code” at the genome-wide scale. Currently, a common approach to assess chromatin states using these data is a multivariate Hidden Markov Model (HMM) introduced by Ernst and Kellis (2010), which has been used in several modENCODE and ENCODE project publications (modENCODE Consortium 2010, Kharchenko et al. 2011, Riddle et al. 2011, Eaton et al. 2011). This model associates each 200bp genomic window with a particular state, generating a chromatin-centric annotation. However, a pre-defined number of states needs to be specified in HMMs and it is difficult to justify and interpret a particular choice. Different studies trying to balance resolution and interpretability based on different criteria often led to different numbers of states, both between different organisms (Ernst and Kellis 2010, modENCODE Consortium 2010) and within the same organism (Filion et al. 2010, modENCODE Consortium 2010). Moreover, HMM summarizes chromatin information by a vector of “emission” probabilities associated with each chromatin state and a vector of “transition” probabilities with which different chromatin states occur in spatial relationship of each other (Ernst and Kellis 2010). These settings assume the homogeneity of hidden states and their transitions across the genome. However, since histone modifications are outcomes of interplay with local environment, the assumption of spatial homogeneity may not hold at the genome level.

To address the limitations in the HMM-based approaches, we propose an alternative approach to examining combinatorial histone marks at coarse scales. We hypothesize that the genome is organized into local blocks that display regionalized histone signatures. Those blocks may have important roles in orchestrated regulation of expression of genes with relevant biological functions. We note that our approach does not require a pre-defined number of possible states and it identifies local patterns without the assumption on spatial homogeneity.

To computationally infer these blocks, we propose a multivariate Bayesian Change Point (BCP) model which is capable of incorporating both local and global information. The BCP model was first proposed by Barry and Hartigan (1992, 1993) to describe a process where the observations can be considered to arise from a series of contiguous blocks, with distributional parameters different across blocks. One of the inferential goals is to identify the change points separating contiguous blocks. By “assuming probability of any partition is proportional to a product of prior cohesions, one for each block in the partition, and that given the blocks the parameters in different blocks have independent prior distributions” (Barry and Hartigan 1992, 1993), a fully Bayesian approach can be adopted to detect change points from a sequence of observations. Barry and Hartigan (1992) considered in detail the case where the observations X₁, …, X_n are independent and normally distributed given the sequence of parameters μ_l with X_i ∼ N(μ_l, σ²) where the observations from the same block l have the same μ_l. This method has been used by Erdman and Emerson (2008) to segment microarray data. However, this model cannot be directly applied to infer histone modification blocks because observed modification data do not follow normal distributions. This is due to the fact that histone modifications are usually observed at a small proportion of the genome locations with signal at the rest of the genome being (or near) zero (Figure S1). To accommodate these unique features, here we present a multivariate BCP model through the introduction of a zero-inflated Gaussian mixture distribution, to partition the genome into blocks where each block is relatively homogeneous with respect to histone marks.

1.1. Outline of the Paper

We organized the paper as following. In Section 2, we present the methodological details of the BCP model with a mixture prior and an MCMC algorithm to infer the posterior probability. Section 3 presents results from simulation studies. In Section 4, we describe a change point analysis of the D. melanogaster genome with multiple histone marks using S2 cell data from the modENCODE project. The identified chromosomal blocks are called as BLOCKs in the rest of this article. Then we present two sets of exploratory analysis, Section 4.2 on BLOCKs’ relationship with physical domains and Section 4.3 on the functional relevance of BLOCKs. In Section 4.4, we compare our results with HMM. We conclude the paper with a summary and discussion in Section 5.

1.2. Notations

We denote the density function of N(μ, σ²) by ϕ(|μ, σ), and denote the density function of Beta (a, b) by ψ(·|a, b). The Dirac function δ indicates the point mass at 0. For a set S, #S is the cardinality of S. For a random variable X, {X = 1} is the indicator function taking value 1 if X = 1 and taking value 0 if X ≠ 1. The indicator function {X = 0} is defined in the same way. The set {i + 1, i + 2, …, j} with integers i < j is denoted by (i : j). The function f(·|·) is a generic notation for conditional density when the distribution is clear in the context.

2. Method

2.1. A BCP model for block identification

The observation we have is an M × n data matrix $X = {(X_{1}, \dots, X_{M})}^{T}$ , where each X_m for m = 1, …, M is a modification mark with length n. We first describe the likelihood of each X_m and then combine them together. For notational simplicity, we suppress the subscript and write X instead of X_m.

Let X = (X₁, …, X_n) be a vector with length n. Create another vector Z = (Z₁, …, Z_n) to indicate whether each X_h is zero or not. That is, Z_h = 0 if X_h = 0, and Z_h = 1 if X_h ≠ 0. Note Z is fully determined by X.

For the index set {1, …, n}, let ρ be a partition of this set. That is ρ = {S₁, …, S_N}, with ${1, \dots, n} = \cup_{l = 1}^{N} S_{l}$ and $S_{l_{1}} \cap S_{l_{2}} = \emptyset$ for all l₁ ≠ l₂. The number N represents the number of blocks of {1, …, n}. For the change-point problem, each S_l is a contiguous subset of {1, …, n}. That is, S_l = (i : j) = {i + 1, …, j} for some i < j.

2.1.1. Likelihood

Given the partition ρ = {S₁, …, S_N}, X_k follows a mixture distribution X_k ∼ (1 − λ_l)N(μ_l, σ²) + λ_lδ, for k ∈ S_l and each l = 1, …, N. The parameter μ_l is block-specific, while σ is shared among different blocks. The parameter λ_l describes how likely X_k is zero, which varies across different blocks. Thus, given (ρ, μ₁, …, μ_N, λ₁, …, λ_N, σ), the likelihood of (X, Z) can be fully specified. That is,

L (X, Z | ρ, μ_{1}, \dots, μ_{N}, λ_{1}, \dots, λ_{N}, σ) = \prod_{l = 1}^{N} f (X_{S_{l}}, Z_{S_{l}} | μ_{l}, λ_{l}, σ),

(2.1)

where for each l with S_l = {i + 1, …, j},

f (X_{S_{l}}, Z_{S_{l}} | μ_{l}, λ_{l}, σ)

(2.2)

= {(1 - λ_{l})}^{# {k \in S_{l} : Z_{k} = 1}} λ_{l}^{# {k \in S_{l} : Z_{k} = 0}} \prod_{{k \in S_{l} : Z_{k} = 1}} ϕ (X_{k} | μ_{l}, σ),

(2.3)

where $X_{S_{l}} = (X_{i + 1}, \dots, X_{j})$ and $Z_{S_{l}} = (Z_{i + 1}, \dots, Z_{j})$ .

2.1.2. Prior

We proceed to specify the prior distribution on the parameters (ρ, μ₁, …, μ_N, λ₁, …, λ_N, sigma).

ρ ~ \prod_{l = 1}^{N} c (S_{l}),

(2.4)

\begin{array}{l} μ_{l} ~ N (μ_{0}, σ_{0}^{2} d_{l}^{- 1}) for each l with S_{l} = {i + 1, \dots, j}, \\ and d_{l} = # {k \in S_{l} : Z_{k} = 1}, \end{array}

(2.5)

λ_{l} ~ Beta (a, b) .

(2.6)

The prior (2.4) on the partition ρ is called product partition model, which was originally described in Barry and Hartigan (1993). The quantity c(S_l) is called cohesion. In this paper, c(S_l) is defined to be c₍_i:j₎ = (1 − p^j−i⁻¹ p when j < n and c₍_i:j₎ = (1 − p^j−i⁻¹ p when j = n, where 0 ≤ p ≤ 1 and S_l {i + 1, …, j} as mentioned before. This specification implies that the sequence of change points forms a discrete renewal process with inter-arrival times identically geometrically distributed. The geometric distribution has memoryless property. For histone mark data, it means we assume the possibility of a genomic position (bin) as a boundary for BLOCK is roughly constant. To note, the cohesion prior is a true nonparametric prior on all possible 2ⁿ⁻¹ partitions for n data points, thus the number of blocks does not need to be specified and can be inferred from the data. The priors (2.5) and (2.6) are conjugate priors with respect to the likelihood. The prior on the variance σ² will be jointly specified with the hyper-parameters.

To pursue a fully Bayesian approach, we put priors on the hyper-parameters (p, μ₀, σ₀) in (2.4) and (2.5). Define $w = \frac{σ^{2}}{σ^{2} + σ_{0}^{2}}$ . We jointly specify the priors on the hyper-parameters together with the prior on σ².

μ_{0} ~ 1, - \infty < μ_{0} < \infty

(2.7)

σ^{2} ~ \frac{1}{σ^{2}}, 0 \leq σ^{2} < \infty,

(2.8)

w ~ \frac{1}{w_{0}}, 0 \leq w \leq w_{0},

(2.9)

p ~ \frac{1}{p_{0}}, 0 \leq p \leq p_{0} .

(2.10)

The priors (2.7), (2.9) and (2.10) are uniform priors. They reflect our ignorance of knowledge. The prior (2.8) can be viewed as a uniform distribution on the logarithmic scale. Notice (2.7) and (2.8) are improper priors. This will not cause problem in view of our sampling procedure described later.

2.1.3. Posterior

Our goal here is to find the posterior distribution of the partition, which is $f (ρ | X, ℤ)$ . According to Bayes formula,

f (ρ | X, ℤ) = \frac{\prod_{m = 1}^{M} f (X_{m}, Z_{m} | ρ) f (ρ)}{\int \prod_{m = 1}^{M} f (X_{m}, Z_{m} | ρ) f (ρ) d ρ} .

(2.11)

Since the denominator of (2.11) is complicated, we need to use MCMC to sample from the posterior by

f (ρ | X, ℤ) \propto \prod_{m = 1}^{M} f (X_{m}, Z_{m} | ρ) f (ρ) .

(2.12)

The conditional density f(X, Z|ρ) is by integrating out the likelihood function (2.1) using the prior of (μ₁, …, μ_N, λ₁, …, λ_N, σ) specified in (2.5), (2.6), (2.7), (2.8) and (2.9). The prior f(ρ) is by integrating out f(ρ|p) specified in (2.4) with respect to (2.10). We first find f(ρ).

\begin{array}{l} f (ρ) = \frac{1}{p_{0}} \int_{0}^{p_{0}} f (ρ | p) d p \propto \frac{1}{p_{0}} \int_{0}^{p_{0}} (\prod_{l = 1}^{N} c (S_{l})) d p \\ = \frac{1}{p_{0}} \int_{0}^{p_{0}} (\prod_{S_{l} = {i + 1, \dots, j}} c_{i j}) d p \\ = \frac{1}{p_{0}} \int_{0}^{p_{0}} p^{N - 1} {(1 - p)}^{n - N} d p . \end{array}

(2.13)

Then, we continue to find f(X, Z|ρ). We first integrate out (μ₁, …, μ_N, λ₁, …, λ_N) in (2.1) using (2.5) and (2.6). Remember ψ(λ_l, b) is the density of Beta(a, b).Using (2.3) as the representation of (2.1), we have

\begin{array}{l} f (X, Z | ρ, μ_{0}, w, σ) \\ = \prod_{l = 1}^{N} \int \prod_{{k \in S_{l} : Z_{k} = 1}} ϕ (X_{k} | μ_{l}, σ) ϕ (μ_{l} | μ_{0}, σ_{0} d_{l}^{- 1 / 2}) d μ_{l} \\ \times \prod_{l = 1}^{N} {\int_{0}^{1} (1 - λ_{l})}^{# {k \in S_{l} : Z_{k} = 1}} λ_{l}^{# {k \in S_{l} : Z_{k} = 0}} ψ (λ_{l} | a, b) d λ_{l} \\ \propto A \times {(2 π σ^{2})}^{\frac{- T}{2}} w^{\frac{N}{2}} \exp (- \frac{1}{2 σ^{2}} (W + w B + w T {(μ_{0} - {\bar{X}}_{T})}^{2})), \end{array}

(2.14)

where

\begin{array}{l} T = \sum_{k = 1}^{n} {Z_{k} = 1} \\ {\bar{X}}_{T} = T^{- 1} \sum_{k = 1}^{n} X_{k} \\ {\bar{X}}_{(i : j), Z_{k}} = \frac{1}{# {Z_{k} = 1, i < k \leq j}} \sum_{{k : Z_{k} = 1, i < k \leq j}} X_{k} \\ W = \sum_{{(i : j) = S_{l} \in ρ}} \sum_{{k : Z_{k} = 1, i < k \leq j}} {(X_{k} - {\bar{X}}_{(i : j), Z_{k}})}^{2} \\ B = \sum_{{(i : j) = S_{l} \in ρ}} # {Z_{k} = 1 : i < k \leq j} {({\bar{X}}_{(i : j), Z_{k}} - {\bar{X}}_{T})}^{2} \\ A = \prod_{{(i : j) = S_{l} \in ρ}} \frac{Γ (a + # {Z_{k} = 0 : i < k \leq j}) Γ (b + # {Z_{k} = 1 : i < k \leq j})}{Γ (a + b + j - i)} . \end{array}

(2.15)

Next, we integrate out (μ₀, w, σ) in (2.14) using priors (2.7), (2.8) and (2.9).

f (X, Z | ρ)

(2.16)

\propto \frac{1}{w_{0}} \int_{0}^{w_{0}} \int σ^{- 2} \int f (X, Z | ρ, μ_{0}, w, σ) d μ_{0} d (σ^{2}) d w

(2.17)

\propto A \int_{0}^{w_{0}} \frac{w^{\frac{N - 1}{2}}}{{[W + w B]}^{\frac{T - 1}{2}}} d w .

(2.18)

To model multiple histone marks, X₁, …, X_M are independent vectors given the same block structure ρ. As has been calculated in (2.18), for each m,

f (X_{m}, Z_{m} | ρ) \propto A_{m} \int_{0}^{w_{0}} \frac{w^{\frac{N - 1}{2}}}{{[W_{m} + w B_{m}]}^{\frac{T_{m} - 1}{2}}} d w,

(2.19)

where a_m, b_m, W_m, B_m T_m and A_m are values for the m-th sequence as a, b, W, B, T and A defined above. Z_m are indicators determined by X_m and Z_k,m is the k-th element in Z_m. Combining (2.13) and (2.19), we have

\begin{array}{l} f (ρ | X, ℤ) \propto \frac{1}{p_{0}} \int_{0}^{p_{0}} p^{N - 1} {(1 - p)}^{n - N} d p \\ \times \prod_{m = 1}^{M} A_{m} \times \prod_{m = 1}^{M} \int_{0}^{w_{0}} \frac{w^{\frac{N - 1}{2}}}{{[W_{m} + w B_{m}]}^{\frac{T_{m} - 1}{2}}} d w \end{array}

(2.20)

Although an exact implementation of this model is tractable, the calculations are O(n³). It is prohibitive to evaluate the posterior probability when n is large. We have implemented an MCMC approximation that greatly facilitates the estimation.

2.2. MCMC algorithm for BCP model inference

Following Barry and Hartigan (1993), for a partition ρ induced by U = (U₁, …, U_n), where U_i = 1 indicates a change point at position i + 1, the odds ratio for the conditional probability of a change point at the position i + 1 is:

\begin{array}{l} \frac{P (U_{i} = 1 | X, ℤ, U_{j}, j \neq i)}{P (U_{i} = 0 | X, ℤ, U_{j}, j \neq i)} \\ = \frac{\int_{0}^{p_{0}} p^{N} {(1 - p)}^{n - N - 1} d p \times \prod_{m = 1}^{M} A_{m}^{1} \int_{0}^{w_{0}} \frac{w^{\frac{N}{2}}}{{[W_{m}^{1} + w B_{m}^{1}]}^{\frac{T_{m} - 1}{2}}} d w}{\int_{0}^{p_{0}} p^{N - 1} {(1 - p)}^{n - N} d p \times \prod_{m = 1}^{M} A_{m}^{0} \int_{0}^{w_{0}} \frac{w^{\frac{N - 1}{2}}}{{[W_{m}^{0} + w B_{m}^{0}]}^{\frac{T_{m} - 1}{2}}} d w} \end{array}

where $W_{m}^{0}$ , $B_{m}^{0}$ , $W_{m}^{1}$ and $B_{m}^{1}$ are the within and between block sums of squares obtained for the m-th sequence when U_i = 0 and U_i = 1 respectively, $A_{m}^{0}$ and $A_{m}^{1}$ is the values of (2.15) obtained for the m-th sequence when U_i = 0 and U_i = 1 respectively. The result is a direct consequence of (2.20).

We then approximate these integrals by incomplete beta function as:

\begin{array}{l} \frac{P (U_{i} = 1 | X, ℤ, U_{j}, j \neq i)}{P (U_{i} = 0 | X, ℤ, U_{j}, j \neq i)} \\ = \prod_{m = 1}^{M} ({(\frac{W_{m}^{1}}{B_{m}^{1}})}^{\frac{1}{2}} {(\frac{W_{m}^{0}}{B_{m}^{1}})}^{\frac{T_{m} - N - 2}{2}} {(\frac{B_{m}^{0}}{B_{m}^{1}})}^{\frac{N + 1}{2}}) \\ \times \frac{\prod_{m = 1}^{M} \int_{0}^{\frac{B_{m}^{1} w_{0} / W_{m}^{1}}{1 + B_{m}^{1} w_{0} / W_{m}^{1}}} x^{(N + 2) / 2} {(1 - x)}^{(T_{m} - N - 3) / 2} d x}{\prod_{m = 1}^{M} \int_{0}^{\frac{B_{m}^{0} w_{0} / W_{m}^{0}}{1 + B_{m}^{0} w_{0} / W_{m}^{0}}} x^{(N + 1) / 2} {(1 - x)}^{(T_{m} - N - 2) / 2} d x} \\ \times \frac{\int_{0}^{p_{0}} p^{N} {(1 - p)}^{n - N - 1} d p \times \prod_{m = 1}^{M} A_{m}^{1}}{\int_{0}^{p_{0}} p^{N - 1} {(1 - p)}^{n - N} d p \times \prod_{m = 1}^{M} A_{m}^{0}} . \end{array}

We initialize U_i to 0 for all i < n, with U_n = 1. Then we update U_i by passes through data. 500 passes were used in block identification.

3. Simulation studies

First we used simulated data to study the performance of the proposed method. The simulation assumed that there were 10 blocks and six histone modification marks were observed at each one of the 2000 locations in the genome. The lengths of the 10 blocks were ranging from 10 to 1500 (In simulation 1 shown in Figure 1, the lengths are 152, 10, 102, 416, 27, 799, 217, 22, 206 and 49). We use X₍_i:j₎_,m to denote the observed signal within a block from (i + 1)-th to j-th location for the m-th mark. We assumed that each component of the X₍_i:j₎_,m followed a mixture distribution of 0.2 ∗ N(μ₍_i:j₎_,m, 1) + 0.8 ∗ δ where μ₍_i:j₎_,m was a random draw from U(−2, 2). These settings are based on the empirical observation that for a specific histone mark, on average ∼20% of the genome display binding peaks with the intensities ranging from −2 to 2 for the normalized data. To apply our method, we need to specify the values of the hyper-parameters p, w, a_m and b_m. In the simulation, we investigated the sensitivity of the results to the specifications of these parameter values by considering a range of values, with p = (0.1, 0.2, 0.3, 0.4), w = (0.1, 0.2, 0.3, 0.4), and (a_m, b_m) = {(1, 1), (2, 2), (0.5, 0.5)}. As a result, we considered a total of 48 specifications for (p₀, w₀, a_m, b_m). We simulated 20 data sets. For each simulated data set, we ran 48 MCMC chains with each chain using one of the 48 different hyperparameters described above. Change points were inferred to be those locations in the genome that had a posterior probability larger than 0.8 (The results were similar under different cutoff values).

Fig 1 — Simulation results. A. One example of simulated datasets with posterior probabilities inferred from proposed BCP model with p₀ = 0.1, w₀ = 0.1, a_m = b_m = 0.5 and from original BCP model using function bcp() in R package bcp. B. Jitter plot for precision and recall rates of BCP model with 48 different sets of hyper-parameters on 20 simulated datasets.

We then checked the precision and recall rates based on the true and inferred change points from the simulated data. The precision rate is defined as TP/(TP +FP), and the recall rate is TP/(TP +FN), where TP is the number of true positives (predicted block boundaries that are true), FP is the number of false positives (predicted boundaries that are not true), and FN is the number of false negatives (undiscovered true block boundaries). In our assessment, if the inferred change point was 3 units or less from one of the true change points, this inference was considered a true positive. As shown in Figure 1B, the overall posteriors are insensitive to the specified values of the hyperparameters p₀, a_m, b_m. Simulation studies also show that the proposed method is capable of identifying large blocks expanded over 1000 positions as well as small blocks of size around 10 (Figure 1). Moreover, the ability of identifying zero-inflated blocks is significantly boosted by the introduction of the mixture prior (Figure 1).

4. Application to modENCODE epigenome Data

All data used in this analysis were generated by the modENCODE project (Table 1). Specifically, we used pre-processed enrichment score of 18 histone marks in S2 cells from study “Genomic Distributions of Histone Modifications”; the S2 cell transcriptome data came from study “Paired End RNA-Seq of Drosophila Cell Lines”; the transcriptome data for 9 different developmental stages were drawn from study “Developmental Stage Timecourse Transcriptional Profiling with RNA-Seq”. To identify and characterize blocks from histone marks, we divided the Drosophila melanogaster genome into 1000-bp bins and calculated the enrichment level for each bin by averaging the log2 intensity values of each mark. The transcription level (in S2 cell and different development stages) was calculated by averaging read counts from replicates.

Table 1.

Overview of modENCODE data that were used in this study

modENCODE Experiment	Method	Cell Line or Tissue Type	Sample
Genomic Distributions of Histone Modifications	ChIP-chip/ChIP-seq	S2-DRSC, ML-DmBG3-c2	H3K18ac, H3K23ac, H3K27Ac, H3K27Me3, H3K36me1, H3K36me3, H3K4Me3, H3K4me1, H3K4me2, H3K79Me2, H3K79Me1, H3K9ac, H3K9me2, H3K9me3, H4AcTetra, H4K16ac, H4K5ac, H4K8ac
Transcriptional profiling of Drosophila cell lines	RNA-seq	S2-DRSC
Developmental Stage Timecourse Transcriptional Profiling	RNA-seq	Embryo 10–12h, White Pre-pupae 24h, Larvae L1, Adult Female Eclosure 1d

Open in a new tab

4.1. Identification of chromatin blocks based on histone modifications

We applied the proposed method to 18 histone methylation and acetylation marks in S2 cells. Change points with posterior probability greater than 0.75 were defined as block boundaries. Because chromosome X is distinguished by the high level enrichment of H4K16ac and H3K36me3 from other chromosomes (Kharchenko et al. 2011), we applied our model to autosomes only.

A total of 994 blocks were inferred from chromosomes 2L, 2R, 3L and 3R, with 90% of the blocks ranging in size from 21kb to 247kb, with a median of 70kb (called as BLOCKs, Table S1). We observed that BLOCKs captured the combinatorial pattern of histone modifications and reflected local transcriptional activities. We use chr2L:4142–5520kb as an example to illustrate this (Figure 2). For simplicity, we only show the enrichment levels of several chromatin signatures including transcription activation marks H3K4m3 and H3K9ac, and transcription repression marks H3K9me3 and H3K27me3 (see Figure 4 for an example of all marks). PolII enrichment and RNA-seq counts at log10 scale are shown as a reference of transcriptional activity. Compared with “chromatin states” annotation for non-overlapping 200bp windows in the genome (Kharchenko et al. 2011) (Figure 2C), BLOCKs depict the genome as local domains at a larger scale. We divided BLOCKs into five quantiles based on their sizes: ≤ 5%, 6% ∼ 35%, 36% ∼ 65%, 66% ∼ 95%, ≥ 96% and looked into the transcription activity distributions for each group (Figure 3E). Transcription activities do not show a systematic bias as a function of block size.

Fig 2 — BLOCKs inferred from multiple histone marks in Drosophila melanogaster S2 cell. A. An overview of the BLOCKs in S2 cells with average transcriptional levels shown in gradient. B. An example of BLOCK characterization at a specific locus on chromosome 2L. BLOCK boundaries are shown as solid black lines. The enrichment levels of several chromatin signatures are shown at 1kb resolution, including transcription activation marks H3K4m3 and H3K9ac, transcription repression marks H3K9me3 and H3K27Me3. PolII and RNA-seq counts at log10 scale are shown as a reference of the transcription activity. C. “Chromatin states” annotation from Kharchenko et al. (2011).

Fig 4 — A. The alignment of BLOCKs (S2 cells) with physical domains in Sexton et al. (2012) (embryonic nuclei cells). B. A comparison of ChromHMM, BLOCKs and physical domains at a locus on chromosome 2L (8Mb–12Mb). BLOCK boundaries are shown as vertical gray lines.

Fig 3 — BLOCKs characterization. A. A locus on chromosome 2R with BLOCKs display diverse sizes, gene density and transcription activity (corresponded “chromatin states” annotation from Kharchenko et al. (2011) shown on the top). B. Transcription activity vs. gene density with block size shown in gradient. C. Boxplot for gene density on five block size quantiles. D. Boxplot for transcription activity on five block size quantiles.

4.2. BLOCK boundaries are potentially physical domain boundaries

A recent published high-resolution chromosomal contact map on Drosophila embryonic nuclei (Sexton et al. 2012) shows that the entire genome is linearly partitioned into well-demarcated physical domains. We therefore studied the link between physical domains and BLOCKs inferred from histone marks. A total of 994 physical domains were identified from Drosophila embryonic nuclei (Sexton et al. 2012) chromosome 2L, 2R, 3L and 3R with the sizes ranged from 10kb to 823kb and a median of 60kb. We observed strong association between physical domains and BLOCK boundaries. For example, 36% of BLOCK boundaries are within 10kb of physical domain boundaries whereas this proportion never exceeds 26% in 1000 randomized block partitions and 56% of BLOCK boundaries are within 20kb of physical domain boundaries whereas this proportion never exceeds 42% in 1000 randomized block partitions.

In Sexton et al. (2012), the authors characterized physical domains into four epigenetic classes based on the enrichment of epigenetic marks. Out of the four classes, transcriptional “Active” domains are associated with H3K4me3, H3K36me3, and hyperacetylation, “PcG” domains are associated with the mark H3K27me3, “HP1/Centromere” class is associated with HP1 and “Null” domains are not enriched for any available marks. We explored whether BLOCKs can be aligned to the classification in Sexton et al. (2012). We assigned the four classes to BLOCKs based on enrichment of H3K4me3, H3K27me3 and HP1a. Specifically, BLOCKs with average intensities of HP1a greater than 1 and coverage greater than 10% are classified as “HP1/Centromere” domains, BLOCKs with average intensities of H3K27me3 greater than 0.5 and coverage greater than 25% are classified as “PcG” domains, BLOCKs with average intensities of H3K4me3 greater than 1 and coverage greater than 25% are classified as “Active” domains and all the remaining ones are characterized as “Null” domains. Figure 4 shows the alignment between BLOCKs and physical domains with epigenetic classes. In 93835 genomic bins annotated by both BLOCKs and chromHMM, 62987 have the same assignment, leading to a jaccard index of 0.5. The high concordance between BLOCKs and physical domains suggests that BLOCKs bridge the link between epigenetic domains with topological domains. The difference may be introduced by techniques, data quality and cell types used in these two studies.

Another indirect evidence for BLOCKs as physical domains is the consistency with replication timing. Replication timing refers to the order in which segments of DNA along the length of a chromosome are duplicated. Since the packaging of DNA with proteins into chromatin takes place immediately after the DNA is duplicated, replication timing reflects the order of assembly of chromatin. Recent studies suggest that late-replicating regions generically define not only a repressed but also a physically segregated nuclear compartment. Thus replication timing is a manifestation of spatial organization of the chromosome. To investigate the association of BLOCKs with replication timing, we compared BLOCKs with the meta peaks of replication origins (10kb to 285kb) from cell lines BG3, Kc and S2 analyzed by mod-ENCODE project. We observed that 69% of meta peaks are within 20kb of BLOCK boundaries. This statistic agrees with physical domains well since we observed that 60% of meta peaks within 20kb of physical boundaries characterized in Sexton et al. (2012).

4.3. Functional relevance of BLOCKS

To investigate whether BLOCKs represent domains of functional importance, we performed three different analyses. First, we checked whether genes within each BLOCK tended to be co-regulated using transcriptome in L1 larvae and 10–12h embryo measured by RNA-seq. A total of 11376 FlyBase genes were used in our analysis. When a gene had multiple isoforms, the longest one was used. We defined the following rules to describe the status change of each gene between L1 larvae stage (and 10–12h embryo) and in S2 cell: genes whose expression increased by more than 2 fold but were not below 10 were categorized as “up-regulated”; those with fold change less than 0.5 but the expression levels were not below 10 as “down-regulated”; and others as “no-change”. To examine whether each BLOCK is enriched for genes with specific status, we used the proportion of blocks that the dominant status accounted for at least 50% of the genes within a block as a test statistic. We observed the percentage of BLOCKs where the dominant status accounted for more than 50% of the genes was 71.8% and 67.6% for L1 larvae and 10–12h embryo, respectively, with 55.4% of the BLOCKs overlapped between the two comparisons. These observed statistics reach statistical significance when testing against randomly permutated blocks. For physical domains in Sexton et al. (2012), we observed 68% and 65.8% with dominant co-regulation for L1 larvae and 10–12h embryo, respectively.

Second, we asked whether genes within each BLOCK tended to have similar biological functions. We tested for the enrichment of Gene Ontology (GO) categories within each BLOCK using hypergeometric test with Bonferroni correction. 51.2% (412 out of 805 BLOCKs with more than 2 genes) were enriched for at least one GO category using a 0.05 cutoff and 1172 GO categories in total are enriched (Table S2). The observed numbers of GO enriched BLOCKs and enriched GO categories were both significantly higher than those from permutated blocks. We further asked which biological processes or functions involve genes that are significantly linearly juxtaposed. We found 86.4% (108/125) of chromatin assembly or disassembly genes (GO:0006333) for Drosophila were juxtaposed within a BLOCK located on chr2L: 21344–21579kb, with a striking p-value of 3.3 × 10⁻²³⁵. Genes in chitin-based cuticle development (GO:0040003), body morphogenesis(GO:0010171), proteinaceous extracellular matrix (GO:0005578) were found significantly clustered with over 70% genes in one BLOCK share the same function.

Third, we reasoned if BLOCKs reflected coordinated regulation of genes with relevant biological functions, we would expect that BLOCKs enriched in developmentally specific GO categories would have large variation across different developmental stages, while BLOCKs enriched in “house-keeping” GO categories would display limited fluctuations. We ranked the BLOCKs based on their standard deviation of transcription level across 9 different developmental stages (Table S3 and S4). BLOCKs with the top 20% largest deviations and 20% smallest deviations were checked for their GO enrichment respectively, and then were listed in Tables S2 and S3 by their order of statistical significance. Notably, in BLOCKs displaying most striking changes across different developmental stages, we found GO categories associated with developmental-specific biological processes or functions, such as heart development, structural constituent of chitin-based cuticle, positive regulation of muscle organ development, and midgut development, among others. Moreover, metabolism-related functions, such as serine-type endopeptidase activity, peptidyl-dipeptidase activity etc, display turnover across developmental transcriptomes and are among the top of our list. GO categories associated with “house-keeping” functions, like transferase activity, aminoacylase activity, chromatin assembly, insulin receptor binding showed limited fluctuations through development. This result provides further evidence on the role of BLOCKs in coordinated regulation.

4.4. Comparison with ChromHMM

In this subsection, we compare the results from our method with those from a popular HMM based method, ChromHMM. We applied ChromHMM to the same dataset (18 histone modification, 1kb bins, S2 cell). The data were binarized to fit ChromHMM’s requirement of input. More specifically, all intervals with intensities greater than 0 are set to 1 and remaining are set to 0. To obtain blocks at coarse levels, we explored ChromHMM models by varying the pre-specified number of hidden states (from 3 to 18). We observed that a smaller number of hidden states tended to produce blocks with larger sizes. Here we report ChromHMM models with the number of hidden states from 3 to 5. The ChromHMM model with 3 hidden states generates 12517 segments, the model with 4 hidden states generates 9157 segments, and the model with 5 hidden states generates 12444 segments. For each ChromHMM model, the sizes of segments range from 2kb (5% quantile) to 26kb (95% quantile) and a median of 5kb. The distributions of sizes of segments from ChromHMM models and BLOCKs are visualized in Figure 5. Therefore, our model has advantages over the HMM models in characterization of histone modification patterns at coarse levels.

Fig 5 — Boxplot for the sizes of segments identified using different methods: physical domains in embryonic nuclei identified using High-C data (Sexton et al. 2012), ChromHMM with 5, 4 and 3 hidden states and BLOCKs with posterior probability greater than 0.75 and 0.25.

4.5. How robust is the result?

The BCP model used in this paper assumes that different histone marks are independent. However, some histone marks, such as H3K4me3 and H3K4me2, are highly correlated with each other. Moreover, it is known that there exists redundancy and exclusivity between the active and repressive marks. To further explore how the input histone marks will affect the result, we performed the change point analysis with the input of 4 marks, 7 marks and 10 marks, respectively. The marks for each model were selected based on their correlation across the entire genome. As shown in Figure 6A, there are mainly 7 groups of marks based on their correlation patterns: the first group consists of H3K9me2 and H3K9me3; the second group is featured by H3K36me3 and H3K79me1; the third group consists of H4K5ac, H3K18ac, H4K8ac, H3K27ac, H4Ac, H3K36me1 and H3K4me1; the fourth group is featured by H3K79me2, H3K9ac, H3K4me3 and H3K4me2; where as three separate groups are formed by H4K16ac, H3K23ac and H3K27me3, respectively. For the 7 marks model, we selected one mark from each of the 7 groups with the input marks as H3K18ac, H3K23ac, H3K27Me3, H3K36me3, H3K4Me3, H3K9me2, and H4K16ac. For the 10 marks model, we further introduced H4, H3K79Me2, and H3K9ac into the 7 marks model. For the 4 marks model, we excluded H3K18ac, H3K36me3, and H4K16ac from the 7 marks model. The 4 marks, 7 marks and 10 marks models identified 698, 868 and 927 blocks, respectively. We observed high consistency between these results and reported BLOCKs obtained with 18 marks, for example, 84% of boundaries from the 10 marks model are within 20kb of BLOCK boundaries and 84% of boundaries from the 7 marks model are within 20kb of BLOCK boundaries (see Figure 6B for other comparisons).

Fig 6 — A. Genome-wide correlation plot for 18 histone marks in S2 cells. The marks are ordered based on the result of hierarchical clustering. B. Comparison of models with different input histone marks. Each of the off-diagonal element is the percentage of boundaries (within 20kb) shared by any pair of the models. The diagonal element is the number of boundaries shared with physical boundaries in Sexton et al. (2012) (short as PD) / the number of segments detected for each model.

To investigate how the posterior probability cutoff would affect the characterization of BLOCKs, we varied the threshold and checked the distribution of the sizes. The results were rather stable under different cut-off values. When the cut-off value was set as low as 0.25, only 2 new boundaries were added, leading to a total of 996 blocks.

5. Discussion

5.1. Methodological comparisons

Our BCP model was developed with a different purpose compared to existing methods for analyzing combinatorial pattern of histone marks. For example, ChromaSig (Hon, Ren and Wang 2008) was designed to uncover potential regulatory elements through searching for genome-wide frequently occurring chromatin signatures. Spatial clustering (Jaschek and Tanay 2009) identified novel patterns of local co-occurrence among histone modifications by imposing a spatial K-clustering solution on HMM. Segway (Hoffman et al. 2012) based on Dynamic Bayesian Networks, achieved a breakthrough in precision and resolution in finding known elements and handling of missing data compared to HMM-based approaches. The most recent method of this kind, ChAT (Wang, Lunyak and Jordan 2012), extends the capabilities of chromatin signatures characterization through an inherent statistical criterion for classification. All these methods tried to detect chromatin signatures associated with a variety of small functional elements. To the best of our knowledge, our model is the first effort to examine histone marks at coarse scales although no explicit constraint has been put on block size. By separately modeling zero and nonzero signals, our model is able to capture the local enrichment patterns of different sizes implicitly, superior than the existing ad hoc merging strategy (Wang, Lunyak and Jordan 2012).

BCP differs substantially from several previously described studies to subdivide the genome at “domain-level”. de Wit et al. (2008) reported a study to identify nested chromatin domain structure through a statistical test of each chromatin component. Their chromatin domains are specific for each component or factor whereas our approach captures domain with combinatorial pattern of multiple factors. Thurman et al. (2007) used a simple two-state HMM to segment the ENCODE regions into active and repressed domains based on multiple tracks of functional genomic data, including activating and repressive histone modifications, RNA output, and DNA replication timing. By using wavelet smoothing, their method focuses on a single scale at a time (Lian et al. 2008). In contrast, our analysis focuses on histone modifications only and simultaneously captures enrichment patterns over different scales. BCP is most similar to a four-state CPM model proposed to characterize chromatin accessibility based on tiled microarray DNaseI sensitivity data only (Lian et al. 2008). Both methods formulate the segmentation of genome into a change point detection problem. However, these two methods differ in several respects. First, CPM is still a hidden-state model with transition probabilities imposed on segments other than equal-sized bins in HMM, whereas BCP is hidden-state free with emphasis on local patterns. Second, four-state CPM model was developed to interpret a single track DNaseI array data while our method was an examination based on multivariate histone modification data. Third, CPM models the DNaseI signal as a continuous mixture of Gaussian at each state, whereas we models histone binding signal with a zero-inflated Gaussian mixture due to spatial sparsity of binding events.

5.2. Summary and future directions

In this paper, we have developed a novel multivariate BCP model to partition genome into contiguous blocks based on histone modifications. It could be extended to analyze chip-sequencing data or applied to other studies with partitioning zero-inflated multiple observation tracks as a task. Our model presents a new approach to examining combinatorial histone marks. Not only histone marks are signatures for functional elements (Kharchenko et al. 2011, Ernst and Kellis 2010), our results from the D. melanogaster S2 cell genome suggest that they are also roadmaps for chromatin organization at coarse scales.

It is worthwhile to further investigate whether BLOCKs and topological domains are substantively different, or if BLOCKs merely re-describes topological domains based on histone marks. Besides the difference introduced by techniques, data quality and cell types, we believe other two possible reasons for imperfect alignment between BLOCKs and physical domains are: 1) the partition is not saturated based on the current profile of histone modifications; 2) the equal weight assigned to different histone modifications in the partition limit the identification of finer domains (a drawback of all current approaches).

It has become increasingly clear that functionally related genes are often located next to one another in the linear genome (Sproul, Gilbert and Bickmore 2005), resembling DNA operon in bacteria (Chen et al. 2012, Keene 2007). This proximity is essential for coordinated gene regulation. Genome-wide expression analysis have identified many clusters of co-expressed genes during Drosophila development (Lee and Sonnhammer 2003, Yi, Sze and Thon 2007), such as the hox gene clusters (Duboule 2007). One mechanism for this coordinated regulation is that these genes are organized into a chromatin domain that acts as a regulatory unit by the epigenetic mechanism (Kosak and Groudine 2004, Sproul, Gilbert and Bickmore 2005). Several such chromatin domains have already been characterized (Kosak and Groudine 2004, Tolhuis et al. 2006, Pickersgill et al. 2006, Orlando and Paro 1993). In this study, we illustrated the widespread existence of these chromatin domains as BLOCKs that were identified by histone marks.

Last but not least, although we have shown that a substantial portion of BLOCKs can potentially act as regulatory units, this is likely still an underestimate. Firstly, our BLOCKs were identified based on combinatorial patterns of 18 histone marks from the S2 epigenome. We do not know in totality how many histone marks are sufficient to saturate the segmentation. It is likely that more markers, including potentially undiscovered ones will be needed to get a complete view of epigenetic landscape. Over 100 histone marks have been discovered yet with a lot of exclusivity and correlation. Future studies addressing relationships among histone marks will give us more insight into this open question. It is also important to develop block identification methods that can accommodate the dependency structure among marks. Secondly, when evaluating expression of genes within an individual BLOCK, we used developmental transcriptome from Drosophila tissues other than S2 cells, which only presented a weighted average of varying BLOCKs across different cell types within each developmental stage. In reality, each type of cells is likely to have its distinct pattern of BLOCKs. Thirdly, plasticity in chromosomal modifications has been shown in several reports (Riddle et al. 2011, Eaton et al. 2011, modENCODE Consortium 2010). Thus we would expect BLOCKs are dynamic structures and the percentage of BLOCKs with tendency of co-regulation might be even higher if taking into account this plasticity. This conjecture could be tested when more histone marks data across development stages are available. Fourthly, with incomplete and inaccurate knowledge on gene functions in GO database (as well as others) (Khatri, Sirota and Butte 2012), likely many BLOCKs with functional relevance may not stand out just because supporting information does not exist yet. Finally, coordinated regulation is a complex process accomplished by miRNA, transcript factors and other regulatory elements with feedback effect on chromatin organization. Further analysis on binding sites of regulatory elements and their interplay with genes within BLOCKs will shed more lights on understanding the underlying mechanism.

Supplementary Material

Supp 1

Figure S1. Number of enriched regions of 46 histone marks and non-histone chromosomal proteins from modENCODE project.

Table S1. BLOCKs identified by BCP in S2 cells using posterior probability cutoff 0.75.

Table S2. Gene lists in GO enriched BLOCKs in S2 cell.

Table S3. BLOCKs with the top 20% largest deviations in the transcription across 9 different developmental stages.

Table S4. BLOCKs with the top 20% smallest deviations in the transcription across 9 different developmental stages.

NIHMS755035-supplement-Supp_1.csv^{(44.8KB, csv)}

Supp 2

NIHMS755035-supplement-Supp_2.csv^{(115.1KB, csv)}

Supp 3

NIHMS755035-supplement-Supp_3.csv^{(36KB, csv)}

Supp 4

NIHMS755035-supplement-Supp_4.pdf^{(450.9KB, pdf)}

Supp 5

NIHMS755035-supplement-Supp_5.csv^{(34.3KB, csv)}

Acknowledgments

We thank the reviewers for their constructive comments and Chao Gao for discussion.

Footnotes

SUPPLEMENTARY MATERIAL

Supplementary Figures and Tables

().

Contributor Information

Mengjie Chen, Email: mengjie@email.unc.edu, Department of Biostatistics and Genetics, University of North Carolina, Chapel Hill, NC 27599.

Haifan Lin, Email: haifan.lin@yale.edu, Yale Stem Cell Center, Yale School of Medicine, New Haven, CT 06520.

Hongyu Zhao, Email: hongyu.zhao@yale.edu, Department of Biostatistics, Yale School of Public Health, New Haven, CT 06520.

References

Allis D. Epigenetics. CSHL Press; 2007. [Google Scholar]
Barry D, Hartigan JA. Product Partition Models for Change Point Problems. The Annals of Statistics. 1992;20:260–279. [Google Scholar]
Barry D, Hartigan JA. A Bayesian Analysis for Change Point Problems. Journal of the American Statistical Association. 1993;88:309–319. [Google Scholar]
Chen D, Zheng W, Lin A, Uyhazi K, Zhao H, Lin H. Pumilio 1 Suppresses Multiple Activators of p53 to Safeguard Spermatogenesis. Current Biology. 2012;22:420–425. doi: 10.1016/j.cub.2012.01.039. [DOI] [PMC free article] [PubMed] [Google Scholar]
de Wit E, Braunschweig U, Greil F, Bussemaker HJ, van Steensel B. Global Chromatin Domain Organization of the Drosophila Genome. PLoS Genetics. 2008;4:e1000045. doi: 10.1371/journal.pgen.1000045. [DOI] [PMC free article] [PubMed] [Google Scholar]
Duboule D. The rise and fall of Hox gene clusters. Development. 2007;134:2549–2560. doi: 10.1242/dev.001065. [DOI] [PubMed] [Google Scholar]
Eaton ML, Prinz JA, MacAlpine HK, Tretyakov G, Kharchenko PV, et al. Chromatin signatures of the Drosophila replication program. Genome Res. 2011;21:164–174. doi: 10.1101/gr.116038.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
Erdman C, Emerson JW. A fast Bayesian change point analysis for the segmentation of microarray data. Bioinformatics. 2008;24:2143–2148. doi: 10.1093/bioinformatics/btn404. [DOI] [PubMed] [Google Scholar]
Ernst J, Kellis M. Discovery and characterization of chromatin states for systematic annotation of the human genome. Nature Biotechnology. 2010;28:817–826. doi: 10.1038/nbt.1662. [DOI] [PMC free article] [PubMed] [Google Scholar]
Filion GJ, van Bemmel JG, Braunschweig U, Talhout W, Kind J, et al. Systematic protein location mapping reveals five principal chromatin types in Drosophila cells. Cell. 2010;143:212–224. doi: 10.1016/j.cell.2010.09.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hoffman MM, Buske OJ, Wang J, Weng Z, Bilmes JA, et al. Unsupervised pattern discovery in human chromatin structure through genomic segmentation. Nature Methods. 2012;9:473–476. doi: 10.1038/nmeth.1937. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hon G, Ren B, Wang W. ChromaSig: A Probabilistic Approach to Finding Common Chromatin Signatures in the Human Genome. PLoS Computational Biology. 2008;4(10):e1000201. doi: 10.1371/journal.pcbi.1000201. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jaschek R, Tanay A. Spatial Clustering of Multivariate Genomic and Epigenomic Information. Research in Computational Molecular Biology. 2009;5541:170–183. [Google Scholar]
Keene JD. RNA regulons: coordination of post-transcriptional events. Nature Reviews Genetics. 2007;8:533–543. doi: 10.1038/nrg2111. [DOI] [PubMed] [Google Scholar]
Kharchenko PV, Alekseyenko AA, Schwartz YB, Minoda A, Riddle NC, et al. Comprehensive analysis of the chromatin landscape in Drosophila melanogaster. Nature. 2011;471:480–485. doi: 10.1038/nature09725. [DOI] [PMC free article] [PubMed] [Google Scholar]
Khatri P, Sirota M, Butte AJ. Ten years of pathway analysis: current approaches and outstanding challenges. PLoS computational biology. 2012;8:e1002375. doi: 10.1371/journal.pcbi.1002375. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kosak ST, Groudine M. Gene order and dynamic domains. Science. 2004;306:644–647. doi: 10.1126/science.1103864. [DOI] [PubMed] [Google Scholar]
Lee JM, Sonnhammer ELL. Genomic gene clustering analysis of pathways in eukaryotes. Genome Res. 2003;13:875–882. doi: 10.1101/gr.737703. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lian H, Thompson WA, Thurman R, Stamatoyannopoulos JA, Noble WS, et al. Automated mapping of large-scale chromatin structure in ENCODE. Bioinformatics. 2008;24(17):1911–1916. doi: 10.1093/bioinformatics/btn335. [DOI] [PMC free article] [PubMed] [Google Scholar]
modENCODE Consortium, T. Identification of functional elements and regulatory circuits by Drosophila modENCODE. Science. 2010;330:1787. doi: 10.1126/science.1198374. [DOI] [PMC free article] [PubMed] [Google Scholar]
Orlando V, Paro R. Mapping Polycomb-repressed domains in the bithorax complex using in vivo formaldehyde cross-linked chromatin. Cell. 1993;75:1187–1198. doi: 10.1016/0092-8674(93)90328-n. [DOI] [PubMed] [Google Scholar]
Pickersgill H, Kalverda B, de Wit E, Talhout W, Fornerod M, van Steensel B. Characterization of the Drosophila melanogaster genome at the nuclear lamina. Nature genetics. 2006;38:1005–1014. doi: 10.1038/ng1852. [DOI] [PubMed] [Google Scholar]
Riddle NC, Minoda A, Kharchenko PV, Alekseyenko AA, Schwartz YB, et al. Plasticity in patterns of histone modifications and chromosomal proteins in Drosophila heterochromatin. Genome Res. 2011;21:147–163. doi: 10.1101/gr.110098.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sexton T, Yaffe E, Kenigsberg E, Bantignies F, Leblanc B, Hoichman M, et al. Three-Dimensional Folding and Functional Organization Principles of the Drosophila Genome. Cell. 2012;148:1–15. doi: 10.1016/j.cell.2012.01.010. [DOI] [PubMed] [Google Scholar]
Sproul D, Gilbert N, Bickmore WA. The role of chromatin structure in regulating the expression of clustered genes. Nat Rev Genet. 2005;6:775–781. doi: 10.1038/nrg1688. [DOI] [PubMed] [Google Scholar]
Thurman RE, Day N, Noble WS, Stamatoyannopoulos JA. Identification of higher-order functional domains in the human ENCODE regions. Genome Research. 2007;17:917–927. doi: 10.1101/gr.6081407. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tolhuis B, Muijrers I, de Wit E, Teunissen H, Talhout W, van Steensel B, van Lohuizen M. Genome-wide profiling of PRC1 and PRC2 Polycomb chromatin binding in Drosophila melanogaster. Nat Genet. 2006;38:694–699. doi: 10.1038/ng1792. [DOI] [PubMed] [Google Scholar]
Wang J, Lunyak VV, Jordan IK. Chromatin signature discovery via histone modification profile alignments. Nucleic Acids Research. 2012 doi: 10.1093/nar/gks848. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yi G, Sze SH, Thon MR. Identifying clusters of functionally related genes in genomes. Bioinformatics. 2007;23:1053–1060. doi: 10.1093/bioinformatics/btl673. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp 1

Figure S1. Number of enriched regions of 46 histone marks and non-histone chromosomal proteins from modENCODE project.

Table S1. BLOCKs identified by BCP in S2 cells using posterior probability cutoff 0.75.

Table S2. Gene lists in GO enriched BLOCKs in S2 cell.

Table S3. BLOCKs with the top 20% largest deviations in the transcription across 9 different developmental stages.

Table S4. BLOCKs with the top 20% smallest deviations in the transcription across 9 different developmental stages.

NIHMS755035-supplement-Supp_1.csv^{(44.8KB, csv)}

Supp 2

NIHMS755035-supplement-Supp_2.csv^{(115.1KB, csv)}

Supp 3

NIHMS755035-supplement-Supp_3.csv^{(36KB, csv)}

Supp 4

NIHMS755035-supplement-Supp_4.pdf^{(450.9KB, pdf)}

Supp 5

NIHMS755035-supplement-Supp_5.csv^{(34.3KB, csv)}

[R1] Allis D. Epigenetics. CSHL Press; 2007. [Google Scholar]

[R2] Barry D, Hartigan JA. Product Partition Models for Change Point Problems. The Annals of Statistics. 1992;20:260–279. [Google Scholar]

[R3] Barry D, Hartigan JA. A Bayesian Analysis for Change Point Problems. Journal of the American Statistical Association. 1993;88:309–319. [Google Scholar]

[R4] Chen D, Zheng W, Lin A, Uyhazi K, Zhao H, Lin H. Pumilio 1 Suppresses Multiple Activators of p53 to Safeguard Spermatogenesis. Current Biology. 2012;22:420–425. doi: 10.1016/j.cub.2012.01.039. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] de Wit E, Braunschweig U, Greil F, Bussemaker HJ, van Steensel B. Global Chromatin Domain Organization of the Drosophila Genome. PLoS Genetics. 2008;4:e1000045. doi: 10.1371/journal.pgen.1000045. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] Duboule D. The rise and fall of Hox gene clusters. Development. 2007;134:2549–2560. doi: 10.1242/dev.001065. [DOI] [PubMed] [Google Scholar]

[R7] Eaton ML, Prinz JA, MacAlpine HK, Tretyakov G, Kharchenko PV, et al. Chromatin signatures of the Drosophila replication program. Genome Res. 2011;21:164–174. doi: 10.1101/gr.116038.110. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] Erdman C, Emerson JW. A fast Bayesian change point analysis for the segmentation of microarray data. Bioinformatics. 2008;24:2143–2148. doi: 10.1093/bioinformatics/btn404. [DOI] [PubMed] [Google Scholar]

[R9] Ernst J, Kellis M. Discovery and characterization of chromatin states for systematic annotation of the human genome. Nature Biotechnology. 2010;28:817–826. doi: 10.1038/nbt.1662. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] Filion GJ, van Bemmel JG, Braunschweig U, Talhout W, Kind J, et al. Systematic protein location mapping reveals five principal chromatin types in Drosophila cells. Cell. 2010;143:212–224. doi: 10.1016/j.cell.2010.09.009. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] Hoffman MM, Buske OJ, Wang J, Weng Z, Bilmes JA, et al. Unsupervised pattern discovery in human chromatin structure through genomic segmentation. Nature Methods. 2012;9:473–476. doi: 10.1038/nmeth.1937. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] Hon G, Ren B, Wang W. ChromaSig: A Probabilistic Approach to Finding Common Chromatin Signatures in the Human Genome. PLoS Computational Biology. 2008;4(10):e1000201. doi: 10.1371/journal.pcbi.1000201. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Jaschek R, Tanay A. Spatial Clustering of Multivariate Genomic and Epigenomic Information. Research in Computational Molecular Biology. 2009;5541:170–183. [Google Scholar]

[R14] Keene JD. RNA regulons: coordination of post-transcriptional events. Nature Reviews Genetics. 2007;8:533–543. doi: 10.1038/nrg2111. [DOI] [PubMed] [Google Scholar]

[R15] Kharchenko PV, Alekseyenko AA, Schwartz YB, Minoda A, Riddle NC, et al. Comprehensive analysis of the chromatin landscape in Drosophila melanogaster. Nature. 2011;471:480–485. doi: 10.1038/nature09725. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] Khatri P, Sirota M, Butte AJ. Ten years of pathway analysis: current approaches and outstanding challenges. PLoS computational biology. 2012;8:e1002375. doi: 10.1371/journal.pcbi.1002375. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] Kosak ST, Groudine M. Gene order and dynamic domains. Science. 2004;306:644–647. doi: 10.1126/science.1103864. [DOI] [PubMed] [Google Scholar]

[R18] Lee JM, Sonnhammer ELL. Genomic gene clustering analysis of pathways in eukaryotes. Genome Res. 2003;13:875–882. doi: 10.1101/gr.737703. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Lian H, Thompson WA, Thurman R, Stamatoyannopoulos JA, Noble WS, et al. Automated mapping of large-scale chromatin structure in ENCODE. Bioinformatics. 2008;24(17):1911–1916. doi: 10.1093/bioinformatics/btn335. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] modENCODE Consortium, T. Identification of functional elements and regulatory circuits by Drosophila modENCODE. Science. 2010;330:1787. doi: 10.1126/science.1198374. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] Orlando V, Paro R. Mapping Polycomb-repressed domains in the bithorax complex using in vivo formaldehyde cross-linked chromatin. Cell. 1993;75:1187–1198. doi: 10.1016/0092-8674(93)90328-n. [DOI] [PubMed] [Google Scholar]

[R22] Pickersgill H, Kalverda B, de Wit E, Talhout W, Fornerod M, van Steensel B. Characterization of the Drosophila melanogaster genome at the nuclear lamina. Nature genetics. 2006;38:1005–1014. doi: 10.1038/ng1852. [DOI] [PubMed] [Google Scholar]

[R23] Riddle NC, Minoda A, Kharchenko PV, Alekseyenko AA, Schwartz YB, et al. Plasticity in patterns of histone modifications and chromosomal proteins in Drosophila heterochromatin. Genome Res. 2011;21:147–163. doi: 10.1101/gr.110098.110. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Sexton T, Yaffe E, Kenigsberg E, Bantignies F, Leblanc B, Hoichman M, et al. Three-Dimensional Folding and Functional Organization Principles of the Drosophila Genome. Cell. 2012;148:1–15. doi: 10.1016/j.cell.2012.01.010. [DOI] [PubMed] [Google Scholar]

[R25] Sproul D, Gilbert N, Bickmore WA. The role of chromatin structure in regulating the expression of clustered genes. Nat Rev Genet. 2005;6:775–781. doi: 10.1038/nrg1688. [DOI] [PubMed] [Google Scholar]

[R26] Thurman RE, Day N, Noble WS, Stamatoyannopoulos JA. Identification of higher-order functional domains in the human ENCODE regions. Genome Research. 2007;17:917–927. doi: 10.1101/gr.6081407. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] Tolhuis B, Muijrers I, de Wit E, Teunissen H, Talhout W, van Steensel B, van Lohuizen M. Genome-wide profiling of PRC1 and PRC2 Polycomb chromatin binding in Drosophila melanogaster. Nat Genet. 2006;38:694–699. doi: 10.1038/ng1792. [DOI] [PubMed] [Google Scholar]

[R28] Wang J, Lunyak VV, Jordan IK. Chromatin signature discovery via histone modification profile alignments. Nucleic Acids Research. 2012 doi: 10.1093/nar/gks848. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] Yi G, Sze SH, Thon MR. Identifying clusters of functionally related genes in genomes. Bioinformatics. 2007;23:1053–1060. doi: 10.1093/bioinformatics/btl673. [DOI] [PubMed] [Google Scholar]

PERMALINK

CHANGE POINT ANALYSIS OF HISTONE MODIFICATIONS REVEALS EPIGENETIC BLOCKS LINKING TO PHYSICAL DOMAINS

Mengjie Chen

Haifan Lin

Hongyu Zhao

Abstract

1. Introduction

1.1. Outline of the Paper

1.2. Notations

2. Method

2.1. A BCP model for block identification

2.1.1. Likelihood

2.1.2. Prior

2.1.3. Posterior

2.2. MCMC algorithm for BCP model inference

3. Simulation studies

Fig 1.

4. Application to modENCODE epigenome Data

Table 1.

4.1. Identification of chromatin blocks based on histone modifications

Fig 2.

Fig 4.

Fig 3.

4.2. BLOCK boundaries are potentially physical domain boundaries

4.3. Functional relevance of BLOCKS

4.4. Comparison with ChromHMM

Fig 5.

4.5. How robust is the result?

Fig 6.

5. Discussion

5.1. Methodological comparisons

5.2. Summary and future directions

Supplementary Material

Acknowledgments

Footnotes

Contributor Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases