SUMMARY:
Chromatin immunoprecipitation followed by next generation sequencing (ChIP-seq) is a technique to detect genomic regions containing protein-DNA interaction, such as transcription factor binding sites or regions containing histone modifications. One goal of the analysis of ChIP-seq experiments is to identify genomic loci enriched for sequencing reads pertaining to DNA bound to the factor of interest. The accurate identification of such regions aids in the understanding of epigenomic marks and gene regulatory mechanisms. Given the reduction of massively parallel sequencing costs, methods to detect consensus regions of enrichment across multiple samples are of interest. Here, we present a statistical model to detect broad consensus regions of enrichment from ChIP-seq technical or biological replicates through a class of Zero-Inflated Mixed Effects Hidden Markov Models. We show that the proposed model outperforms existing methods for consensus peak calling in common epigenomic marks by accounting for the excess zeros and sample-specific biases. We apply our method to data from the Encyclopedia of DNA Elements (ENCODE) and Roadmap Epigenomics projects and also from an extensive simulation study.
Keywords: ChIP-Seq, Hidden Markov Model, Mixed Model
1. Introduction
Chromatin immunoprecipitation followed by next generation sequencing (ChIP-seq) is a technique to detect genome-wide regions of protein-DNA interaction, such as transcription factor (TF) binding sites or regions containing histone modifications (Robertson et al., 2007). These interactions may regulate gene expression and influence biological processes (Jones et al., 2016). ChIP-seq experiments have been used to understand epigenomic mechanisms in which TFs and histone modifications play an important role (Barski et al., 2007). Such mechanisms are hypothesized to explain heterogeneity at both the molecular level (gene expression, gene silencing, etc.) and on an individual level (cancer incidence, cardiovascular disease, etc.). In cancer research, histone modifications have been shown to play an important role in carcinogenesis, progression, and tumor suppression (Huang et al., 2017).
ChIP-seq experiments begin with cross-linking DNA and proteins on chromatin structures followed by sonication-induced fragmentation. DNA fragments bound to the protein of interest are then isolated by chromatin immunoprecipitation. The associated fragments are sequenced via massively parallel sequencing to generate short sequencing reads pertaining to the original fragments. These sequences are then mapped back to a reference genome through sequence alignment to determine their likely genomic locations of origin. Genomic coordinates containing a high density of mapped reads, referred to as enriched regions, are then identified through statistical analysis. These coordinates indicate likely locations where the protein of interest was bound to the DNA. Here, we refer to all other genomic positions pertaining to non-enriched regions as background. Methods available for the detection of enrichment regions calculate the distribution of the read counts (signal) in ChIP-seq experiments by first tiling the genome with non-overlapping windows (Rashid et al., 2011) or sliding windows (Zhang et al., 2008), and then computing the number of reads mapped into each window.
The detection of enrichment regions (peaks) in ChIP-seq experiments is challenging for several reasons. Namely, the diversity of enrichment profiles, the presence of serial correlation in the data, sample-specific characteristics such as the signal-to-noise ratio, and an excessive number of zeros in the distribution of read counts. Hence, peak callers need to be tailored accordingly to capture the specific signal of the protein of interest. Although single sample ChIP-seq peak callers have been successfully presented in the literature (Zhang et al., 2008; Xu et al., 2010; Kuan et al., 2011; Song and Smith, 2011; Xing et al., 2012; Rashid et al., 2014), multi-sample peak callers are of growing interest given the reduction of sequencing costs. Leveraging additional data into a joint framework leads to a significant improvement when detecting consensus peaks across samples (Yang et al., 2014). However, current approaches that integrate multiple ChIP-seq replicates (Ibrahim et al., 2014; Cuscò and Filion, 2016) show poor spatial resolution of peak calls when analyzing diffuse data due to the low signal profile of broad histone modifications. Under these scenarios, we observed that broad regions of enrichment are fragmented into narrow and discontiguous peak calls by the aforementioned methods (see Sections 2 and 6).
To tackle these challenges, we present a Zero-Inflated Mixed effects Hidden Markov Model (ZIMHMM) to analyze data and detect broad peaks in consensus across multiple ChIP-seq technical or biological replicates. The ZIMHMM accounts for the excess of zeros as well as sample-specific sequencing depth and ChIP-control relationship via random effects. Using data from H3K27me3 and H3K36me3 ChIP-seq experiments on human cells from the ENCODE Consortium and the Roadmap Epigenomics Project (Dunham et al. 2012; Bernstein et al. 2010; see Web Appendix A for details), we compared the performance of the ZIMHMM to the current multi-sample peak callers JAMM and Zerone, as well as the single-sample methods BCP, CCAT, MACS2, MOSAiCS, and RSEG. Based on real data analyses, we show that the ZIMHMM outperforms the existing approaches for detection of broad consensus regions of enrichment from multiple ChIP-seq experiments (see Section 6).
2. Background
Histones are proteins found in eukaryotic cells and comprise structural units called nucleosomes which aid in the DNA packaging. When these proteins are enzimatically modified by either methylation, ADP-ribosylation, phosphorylation, glycosylation, or acetylation, their electric charge and shape are affected along with the structural and functional properties of the chromatin. Consequently, these modifications directly affect transcription, DNA repair, replication and recombination (Bannister and Kouzarides, 2011). From all the variant forms of histones, the trimethylation of histone H3 at lysines 36 and 27 (H3K36me3 and H3K27me3) are of particular interest due to their association to actively transcribed genes and gene repression, respectively (Liu et al., 2016). In cancer research, for instance, the epigenomic mark H3K27me3 has been shown to play an important role in prostate carcinogenesis and progression while H3K36me3-deficient cancer cells are acutely sensitive to gene WEE1 inhibition and can be selectively killed by dNTP starvation (Pfister et al., 2015).
ChIP-seq experiments usually differ with respect to the number of mappable reads, referred to as the sequencing depth. Jung et al. (2014) suggested a practical lower bound of 40–50 million reads for most of the marks from human cells in order to ensure robust conclusions from results derived from peak-calling algorithms. In general, publicly available ChIP-seq data do not meet this suggested minimum number of reads and show high variation regarding their sequencing depths. We observed that this variation mediates the effect of the input control on the distribution of ChIP signal across different experiments and regions of the genome (see Section 6 and Figure A.1 in Web Appendix A). While the input controls might well explain the technical variation in ChIP read counts on enrichment regions from highly sequenced experiments, their effect is not pronounced in under-sequenced data.
When analyzing diffuse or under-sequenced data, we observed that current multi-sample peak callers fail to call sufficiently broad regions of enrichment. In general, such methods call narrow and discontiguous peaks that do not correspond to the entire range of protein-DNA binding site. Under the pooling type of analysis, methods tended to call the union of individual peaks as the ChIP-seq signals were combined by merging reads from multiple samples (Ibrahim et al., 2014). In addition, these data are characterized by a low read density profile and an excess of zeros that one would not expect to observe if modeling the signal with either a Poisson or Negative Binomial (NB) distribution. In this scenario, we find that the Zero-Inflated Negative Binomial (ZINB) model appears to accurately capture the excess of zeros present in background regions of the genome (Figure 1).
Figure 1.
Low and broad signal profile of histone modifications H3K36me3 (top panels) and H3K27me3 (bottom panels). On the left, pooled read counts of technical replicates of diffuse histone marks ChIP-seq on human Huvec and Nhek cells, respectively, and peaks called by some of the current methods. On the right, bar plots of the observed count distribution of ENCODE background regions on these cells and expected proportions under the Poisson, NB, and ZINB models. This figure appears in color in the electronic version of this article.
Before the introduction of methods focused on the integration of multiple ChIP-seq experiments, consensus regions of enrichment were detected by using ad hoc rules to combine peaks called independently from different samples (Valouev et al., 2008). Alternatively, aligned reads from all experiments available could be pooled and analyzed by single-sample procedures (Young et al., 2011). However, we observed that peak calls from pooled samples usually correspond to the union of individual enrichment regions (see Figure D.1 of the Web Appendix). In recent years, a few methods have focused on the joint analysis of ChIP-seq data to call consensus peaks. JAMM integrates multiple technical replicates and fits a multivariate Gaussian mixture model to cluster genomic windows and call regions of consensus. Zerone fits a three-state HMM with Zero-Inflated Negative Multinomial emission distributions to identify regions of enrichment. As shown in Figure 1, these methods do not perform well when capturing broad regions of enrichment in consensus across multiple samples.
3. Methods
Here, we first introduce an immediate extension of the single-sample HMM proposed by Rashid et al. (2014) in Section 3.1. Such an extension is aimed to call consensus regions of enrichment from multiple ChIP-seq experiments by fitting a two-state fixed effects multivariate Zero-Inflated HMM. In Section 3.2, we present the ZIMHMM, a mixed effects version of the extended model motivated by the work from Altman (2007). Both models capture the excess of background zeros from diffuse data and, in addition, the ZIMHMM accounts for sample-specific differences via random effects.
3.1. Multi-sample Zero-Inflated HMM
From here onwards, all models will be presented under a two-state HMM with ZINB and NB emission distributions associated with the background and enrichment states, respectively. For genomic window j of experiment i, j = 1, …,M and i = 1, …,N, let Yij and Xij denote the random variables pertaining to the ChIP and log-transformed input control read counts, respectively. Here, yij and xij denote the the observed values of Yij and Xij, respectively. For multiple experiments sharing the same input control, we have Xij = Xi′j for all i ≠ i′. We assume a single latent discrete time stationary Markov chain , with state-to-state transition probabilities γ= (γ11, γ12, γ21, γ22)′ and initial probabilities π = (π1, π2)′. Conditionally upon Zj, we observe the vectors of independent counts Y.j = (Y1j, …,YNj)′ and X.j = (X1j,.…,XNj)′, for all windows j = 1, …,M and across all N replicated experiments.
Let denote the vector of state-specific parameters, f1 and f2 denote the emission distributions corresponding to the hidden states, and xij denote the predictor of , the state-specific mean read count of Yij. Under this set-up, the observed data follow a multi-sample HMM whose likelihood function is
| (1) |
where the emission distributions f1 and f2 are defined as
| (2) |
Here, I(·) is an indicator function, Ψ = (π′, γ′, ψ′)′, and . In addition, NB indicates the NB probability mass function with mean and dispersion such that , , zj ∈ {1, 2} and log(θij/1 − θij) = λ1 + λ2xij. For ChIP-seq experiments with a single input control, we allow the probabilities θij, i = 1, …,N, to differ across replicates by including the log-transformed total number of ChIP read counts as an offset in the model. This is particularly important as replicates with different amount of mapped reads are likely to have different distributions of observed zeros in the background regions. We describe the EM algorithm to obtain parameter estimates from (1) in Section 4.
3.2. Multi-sample Zero-Inflated Mixed Effects HMM
Here we present the ZIMHMM, an immediate mixed effects extension of the model presented in Section 3.1 and a special case of the model proposed by Altman (2007), as it assumes a single sequence of hidden states common to all experiments to ensure the detection of consensus peaks. Let the latent random vector B = (B1, …,BN)′ be an N-dimensional vector of sample-specific scalar random effects to be included in the linear model. We will assume that , where I denotes an N × N identity matrix. For better computational stability and efficiency, we will make use of the change of variables B to random effects U following the ideas presented by Bates et al. (2014). Define the linear transformation from a N-dimensional spherical random vector, U, to B as . We will assume that, conditional on the random effects U, and the Markov chain Z, the observed data follow a HMM, and observations from different experiments are independent. In addition, conditionally upon the unobserved realization ui of Ui, we model Yij according to ZINB and NB emission distributions associated with background and enriched states, respectively.
Let rij denote the design variable associated with the random effects indicating whether the model has either sample-specific random intercept (rij = 1) or random slope (rij = xij). In addition, let λ= (λ1, λ2)′, β = (β11, β12, β21, β22)′, ϕ = (ϕ1, ϕ2), and Ψ = (π′, γ′, λ′, β′, ϕ′, σ)′ denote the vectors of all model parameters. The likelihood function of the ZIMHMM is
| (3) |
where , , and , for zj ∈ {1, 2}. Here, f1 and f2 are defined as in (2) with , zj ∈ {1, 2}. In ChIP-seq peak calling, a model with random intercepts would account for differences in the sequencing depth of replicates by modeling sample-specific random shifts in the mean model of read counts. Conversely, a random slope model would be particularly interesting when modelling experiments with input controls having differential relationships with the distribution of read counts. Different datasets might exhibit different ChIP-control relationships due to differences in immunoprecipitation (IP) efficiency across experiments (Chen et al., 2015; Lun and Smyth, 2015). While efficient IP shows strong peaks in read coverage at binding sites and a mild control effect (in adjusting for technical variability in enrichment regions), inefficient IP will result in weaker peaks and a larger control effect in enrichment regions, as it is harder to separate technical variability from the true signal in such cases.
Under this model setup, the inclusion of random effects has a critical impact on the marginal covariance structure of read counts. Specifically, it is possible to show that Cov(Yij, Yij′) → κ > 0, as |j − j′| → ∞ (Altman 2007; see Section B2 of the Web Appendix for technical derivations). For the fixed effects model presented in Section 3.1, however, such a long-range positive dependence decays to zero. We propose an EM algorithm to estimate the model parameters from the likelihood function (3), which is presented in Section 4.
4. Estimation
Besides the unknown parameters Ψ = (π′, γ′, λ′, β′, ϕ′, σ)′, the likelihood (3) contains two unobserved quantities: the M-dimensional vector of the state path , and the N-dimensional vector of sample-specific random effects . In the sth step of the EM algorithm, the Q function of the complete data log-likelihood can be written as
We make use of the Laplace’s approximation to maximize the Q function with respect to Ψ. Following the notation presented in Altman (2007), the Q function can be rewritten as (see Section B3 of the Web Appendix for technical derivations)
| (4) |
where , , is a 2 × 2 matrix with elements for all l and k in {1, 2}, and is a 2-dimensional vector of ones. The integral (4) is approximated by its integrand evaluated at such that . Here, denotes the Jacobian of the function g evaluated at . Note that neither g nor its Hessian matrix depends on Ψ.
In the E-step, we compute via numerical optimization of g using the BOBYQA algorithm (Powell, 2009). The posterior probabilities from (4) can be calculated by a standard Forward-Backward algorithm (Rashid et al., 2014). In the M-step, the Q function is maximized with respect to the unknown parameters Ψ. It is possible to show (see Section B3 of the Web Appendix) that one can approximate the Q function as
| (5) |
In this setting, one can obtain closed forms for the estimates of the initial and transition probabilities. We perform conditional maximizations to compute estimates of , and σ using the BFGS algorithm (Fletcher, 2013). The EM algorithm iterates until the maximum absolute relative change in the parameter estimates three iterations apart is less than 10−3 for three consecutive iterations. For better efficiency, we use a rejection-controlled EM (RCEM; Ma et al. 2006) with threshold 0.05 and a weighted maximization approach on aggregated data. The final set of posterior probabilities can be used to determine the hidden path of the states Z and segment the genome into either enriched or background windows. By denoting the probability of window j belonging to background, one could classify window j to be enriched if pj ⩽ α, where α is chosen such that the total false discovery rate (FDR) is (Efron et al., 2001). Alternatively, the Viterbi algorithm (Viterbi, 1967) can be used to determine the most likely sequence of background and enrichment windows without the need of a subjective choice of an FDR threshold. Finally, regions of enrichment are created by merging adjacent windows either meeting a cutoff α or belonging to the same Viterbi’s predicted state.
5. Simulation Study
In this study, we evaluated the performance of the ZIMHMM under a set of different scenarios where experimental replicates differed with respect to sequencing depth and ChIP-input control relationship. We compared the ZIMHMM to its fixed-effects version (ZIHMM) and to a multi-sample HMM that does not account for zero-inflation (HMM). For each scenario, we simulated a hundred ChIP-seq multi-sample data under random intercept and random slope models mimicking the main characteristics of H3K27me3 ChIP-seq data. First, we generated a sequence of hidden states with length M = 25, 000 from a first-order Markov chain with two states and transition probabilities γ11 = γ22 = 0.95 to ensure broad background and enrichment regions. Secondly, for a given path of states, a set of N input control read counts was independently simulated following a NB distribution with parameters (μ, ϕ)′ = (9, 2.5)′. Thirdly, N sequences of ChIP read counts with length M was simulated as a function of the log-transformed input control counts following a mixture of random effects ZINB and NB distributions. Here, we simulate data under scenarios with N = {2, 3, 6, 9} ChIP-seq replicates and explored scenarios with low, medium, and high levels of heterogeneity across the N simulated ChIP replicates. These levels of heterogeneity are represented by different values of the variance component σ2 for both the random intercept and random slope models (see Figure C.1 of the Web Appendix).
5.1. Simulation Results
Table 1 shows the true values, the sample median, 25th, and 75th percentiles of the parameter estimates from simulated data relative to scenarios with medium level of heterogeneity and random intercept model. The median values of the estimates associated with the parameters from the mean model appeared to be symmetric and centered at the true values, suggesting that the proposed Laplace approximation works relatively well even for a small number of replicates. The estimates of the variance component were close to the true values in all simulated scenarios. We present the median observed true and false positive rates (TPR and FPR, respectively) based on the sequence of predicted states by the Viterbi algorithm. Regardless of the number of replicates, the ZIMHMM performed well when predicting the path of hidden states. We observed that its classification performance improved in scenarios with higher number of replicates, as expected. This is particularly important as a common practice in the analysis of multiple ChIP-seq data is to call peaks utilizing two replicates only. In the analyzed scenario, integrating data from additional replicates improved the detection of enrichment regions in consensus.
Table 1.
Median (first, and third quantiles) of parameter estimates under random intercept models (medium heterogeneity).
| Parameter | True value | Two rep. | Three rep. | Six rep. | Nine rep. |
|---|---|---|---|---|---|
| β11 | 1.50 | 1.51 (1.07, 2.05) | 1.36 (0.93, 1.78) | 1.61 (1.35, 1.88) | 1.55 (1.46, 1.73) |
| β12 | 0.75 | 0.75 (0.74, 0.75) | 0.75 (0.75, 0.76) | 0.75 (0.75, 0.75) | 0.75 (0.75, 0.75) |
| β21 | 2.50 | 2.51 (2.11, 2.99) | 2.42 (1.96, 2.81) | 2.63 (2.39, 2.91) | 2.55 (2.46, 2.73) |
| β22 | 0.50 | 0.50 (0.50, 0.50) | 0.50 (0.50, 0.51) | 0.50 (0.50, 0.50) | 0.50 (0.50, 0.50) |
| ϕ1 | 5.00 | 4.96 (4.79, 5.04) | 4.57 (4.05, 4.87) | 4.14 (3.61, 4.52) | 4.00 (3.65, 4.28) |
| ϕ2 | 2.50 | 2.48 (2.44, 2.50) | 2.39 (2.22, 2.46) | 2.28 (2.12, 2.38) | 2.23 (2.13, 2.31) |
| λ1 | −0.75 | −0.75 (−0.79, −0.71) | −0.75 (−0.78, −0.72) | −0.75 (−0.77, −0.73) | −0.76 (−0.78, −0.74) |
| λ2 | −0.60 | −0.60 (−0.62, −0.58) | −0.59 (−0.61, −0.58) | −0.59 (−0.61, −0.58) | −0.60 (−0.60, −0.59) |
| σ2 | 0.20 | 0.17 (0.00, 1.24) | 0.21 (0.06, 1.14) | 0.26 (0.11, 0.44) | 0.20 (0.12, 0.32) |
| TPR | 0.94 (0.94, 0.95) | 0.96 (0.95, 0.96) | 0.98 (0.98, 0.98) | 0.99 (0.99, 0.99) | |
| FPR | 0.07 (0.07, 0.07) | 0.05 (0.04, 0.05) | 0.02 (0.02, 0.02) | 0.01 (0.01, 0.01) |
The simulation results indicated that estimates associated with dispersion parameters (ϕ1, ϕ)′ were biased even for scenarios with a high number of ChIP replicates. An extensive statistical literature makes reference to biased estimates of the dispersion parameter in the NB regression model and proposes possible corrections to it (Robinson and Smyth, 2007). Here, given the good classification performance of the ZIMHMM regarding the TPR and FPR across all different simulated scenarios, we did not explore alternative solutions to the estimation of the dispersion parameter as this investigation would be beyond the scope of this work. Nonetheless, we believe that such a correction would lead to better precision for the parameter estimates. Thresholding posterior probabilities with different FDR levels allowed us to compare the performance of the ZIMHMM with the ZIHMM and HMM. The ZIMHMM had a better classification performance than the misspecified models ZIHMM and HMM in all the scenarios (see Figure C.3 of the Web Appendix). However, we observed a higher relative performance of the ZIMHMM over the ZIHMM and HMM when a low number of replicates was analyzed. In the context of heterogeneous replicates, these results suggest that accounting for sample-specific biases boosts the detection of consensus regions of enrichment and its improvement is particularly significant when only a few replicates are available.
6. Data Applications
We applied the ZIMHMM with sample-specific random intercepts to detect consensus regions of enrichment from multiple ChIP-seq experiments of H3K27me3 and H3K36me3 marks from the ENCODE Consortium and the Roadmap Epigenomics Project. Data were analyzed in two different scenarios. In Section 6.1, we report results from the analysis of technical replicates from H3K36me3 and H3K27me3 experiments of Huvec and Nhek cell lines, respectively. In this standard scenario of multi-sample ChIP-seq peak calling, technical replicates are expected to show low spatial heterogeneity regarding the signal profile across the genome. In Section 6.2, we present results of the analysis of H3K36me3 and H3K27me3 experiments from white blood cell lines CD4 memory, CD4 naïve, CD8 naïve, and CD34 primary cells. In this scenario, white blood cell lines are assumed to be similar but show a certain level of heterogeneity regarding the signal profile and genomic locations of protein-DNA binding sites.
We sought to benchmark the genome-wide performance of the ZIMHMM to the multi-sample peak callers JAMM and Zerone, as well as single-sample peak callers under the pooling approach BCP-P, CCAT-P, MACS2-P, MOSAiCS-P, and RSEG-P. We compared methods regarding peak accuracy, broadness, coverage of the observed read density from both analyzed marks, coverage of active and inactive genomic regions, and running time. To assess the benefits of the random effects approach, results from the fixed effects model ZIHMM presented in Section 3.1 are shown. Read counts were computed using non-overlapping windows of 500bp in both scenarios. For the ZIMHMM and the ZIHMM, enrichment regions were defined by merging neighboring predicted enriched windows using the Viterbi algorithm. A discussion about the choice of the window size and a comparison between the Viterbi algorithm and the FDR thresholding approach is presented in Section 6.1.
6.1. Analysis of ChIP-seq Data From Technical Replicates
For benchmarking purposes, we created a set of measures and associations that were first introduced by Xing et al. (2012) (Table 2). First, we calculated the median size of called peaks (in kbp) by each method. For both analyzed marks, we observed that the ZIMHMM called substantially broader regions of enrichment than the multi-sample peak callers JAMM and Zerone, but narrower than the regions of the single-sample peak callers BCP-P and RSEG-P. Next, we defined the read coverage as the proportion of reads from the analyzed mark mapped on called peaks out of the total number of mapped reads. Read counts were previously normalized by the median log-ratios of each sample over the geometric mean (after adding 1 pseudo count to avoid undefined ratios in windows with zero counts). Results showed that the ZIMHMM covered most of the mapped reads while still maintaining a low size of peak calls. While RSEG-P had a reasonable coverage of counts, its called peaks were often excessively large and did not capture minor changes in the signal profile (Figure 2). This is a known characteristic of the pooling type of analysis of single-sample peak callers and the results highlight the improved accuracy of the peaks called by the multi-sample peak caller the ZIMHMM. Here, the ZIHMM was fitted using the total sum of read counts as an o set to attempt the correction of differences in sequencing depth across replicates. However, the inclusion of replicate-specific random effects led to a better coverage of read counts across the genome.
Table 2.
Genome-wide peak calls and common associations for ChIP-seq data of H3K36me3 and H3K27me3 marks from three technical replicates of Huvec and Nhek cells, respectively. The running time of each method is shown in hours.
| Coverage | |||||||
|---|---|---|---|---|---|---|---|
| Mark | Method | Peaks | Median Size | Reads | Active Regions | Inactive Regions | Time |
| H3K36me3 | BCP-P | 6852 | 29.298 | 0.400 | 0.497 | 0.027 | 1.618 |
| CCAT-P | 94181 | 1.026 | 0.345 | 0.345 | 0.015 | 17.642 | |
| JAMM | 66751 | 0.300 | 0.123 | 0.105 | 0.007 | 5.376 | |
| MOSAiCS-P | 3626 | 17.704 | 0.184 | 0.178 | 0.004 | 0.512 | |
| MACS2-P | 53950 | 1.616 | 0.353 | 0.356 | 0.018 | 0.132 | |
| RSEG-P | 8259 | 33.204 | 0.470 | 0.623 | 0.043 | 1.659 | |
| Zerone | 16913 | 7.322 | 0.336 | 0.346 | 0.016 | 0.024 | |
| ZIHMM | 14867 | 18.064 | 0.508 | 0.682 | 0.049 | 0.324 | |
| ZIMHMM | 12574 | 22.948 | 0.517 | 0.709 | 0.055 | 6.336 | |
| H3K27me3 | BCP-P | 6618 | 16.114 | 0.412 | 0.032 | 0.147 | 1.335 |
| CCAT-P | 193893 | 0.978 | 0.504 | 0.034 | 0.165 | 30.758 | |
| JAMM | 109855 | 0.303 | 0.253 | 0.012 | 0.058 | 6.925 | |
| MOSAiCS-P | 4726 | 4.090 | 0.159 | 0.004 | 0.024 | 1.829 | |
| MACS2-P | 89258 | 1.147 | 0.394 | 0.019 | 0.100 | 0.148 | |
| RSEG-P | 12801 | 20.997 | 0.564 | 0.047 | 0.246 | 0.981 | |
| Zerone | 34397 | 1.465 | 0.240 | 0.008 | 0.040 | 0.027 | |
| ZIHMM | 51276 | 5.859 | 0.622 | 0.053 | 0.262 | 0.642 | |
| ZIMHMM | 54994 | 5.845 | 0.634 | 0.056 | 0.272 | 12.307 | |
Figure 2.
Pooled read counts of three technical replicates of histone modifications H3K36me3 (A) and H3K27me3 (B) on human cells Huvec and Nhek, respectively. At the top, called peaks from benchmarked methods. At the bottom, posterior probabilities of enrichment from ZIMHMM, which calls broad peaks in consensus that better associate with the read counts profile from the analyzed diffuse marks. This figure appears in color in the electronic version of this article.
To asses whether the high sensitivity of the ZIMHMM was indeed due to an improved segmentation, we computed empirical TPRs and FPRs based on the coverage of actively transcribed genes and reads of the reverse mark (see Figure 3). Histones H3K36me3 and H3K27me3 are known to be associated with gene transcription and repression, respectively. For the former (latter), enrichment regions are usually deposited on genes with high (low) expression and are nearly mutually exclusive, although the activity of H3K27me3 can also be seen in genomic regions without any gene bodies (Xing et al. 2012; Figure 1). Using RNA-seq data, we determined cell line-specific actively transcribed genes and computed the coverage of active and inactive regions (see Section D1 of the Web Appendix for details). Here, we define an inactive region to be any genomic region not overlapping an actively transcribed gene, which includes intergenic regions and inactive genes. We observed that the ZIMHMM had the highest coverage among all methods. Its called peaks for H3K36me3 (H3K27me3) covered most of the active (inactive) locations, respectively, while still maintaining low false positives. Both multi-sample peak callers JAMM and Zerone performed poorly under this scenario regarding these metrics (Figure 2). Here, single-sample peak callers had mixed performances and called peaks that were either excessively large and expanded multiple actively transcribed gene bodies (BCP-P and RSEG-P) or overly segmented (CCAT-P, MACS2-P, and MOSAiCS-P).
Figure 3.
Genome-wide performance of ZIMHMM and other peak callers. We analyzed diffuse histone modifications H3K36me3 and H3K27me3 under scenarios of technical replicates and multiple cell lines. ZIMHMM showed superior performance in most of the scenarios, better associating with gene expression and read counts than other methods. Overall, peaks called by ZIMHMM showed a reasonably low number of false positives, here characterized by the coverage of inactive (active) regions by H3K36me3 (H3K27me3) peaks and coverage of reads from the other mark. This figure appears in color in the electronic version of this article.
We observed that peak callers varied substantially regarding their computational time. MACS2-P, Zerone, and ZIHMM were among the fastest methods under comparison taking no longer than an hour to analyze the entire human genome. Conversely, CCAT-P, JAMM, and ZIMHMM were the peak callers that took longer to complete the analysis. The approximate running time of the ZIMHMM was six and twelve hours to analyze three replicates of H3K36me3 and H3K27me3, respectively. Conversely, CCAT-P had an approximate running time of 18 and 28 hours for these marks, respectively. It is worth noting that single-sample peak callers such as BCP-P, MOSAiCS-P, RSEG-P, and MACS2 are in general faster than multi-sample peak callers simply by the fact that technical replicates are pooled together and analyzed as a single experiment. We believe that the performance of the ZIMHMM can be further improved and will be left as a project in a future implementation of the model.
The performance of peak callers under different choices of window sizes was investigated. Results were consistent across windows of 250bp, 500bp, 750bp, and 1000bp, although peaks from ZIMHMM became larger for wider window sizes (see Tables D.1–D.3 in Web Appendix D). In Ibrahim et al. (2014), the authors propose the use of a cost function to select the window size. Here, we choose to report results based on the window size of 500bp calculated as a function of the average fragment length, an approach also used by MACS2. Moreover, we compared peaks called by the ZIMHMM via both the Viterbi algorithm and FDR thresholding. The Viterbi peaks were similar regarding the metrics used in this paper to peaks based on a FDR cutoff of 0.05. An increasing (decreasing) trend in sensitivity (specificity) across the different thresholds was observed (see Tables D.4–D.7 in Web Appendix D). We compared the performance of the ZIMHMM under the whole-genome analysis presented in this paper with peaks called chromosome-wise. We observed a better sensitivity/specificity of the whole-genome analysis over the chromosome-wise analysis for small chromosomes (see Figures D.3 and D.4 in Web Appendix D). A possible explanation for the increase in performance is that small chromosomes may have less data to better resolve peak regions. In addition, chromosomes with less gene activity are likely to have fewer enrichment regions for certain marks. The whole-genome analysis could be a workaround for a potential convergence issues in a chromosome where most of the reads are coming from background.
6.2. Analysis of ChIP-seq Data From Multiple Cell Lines
We analyzed data from CD4 memory, CD4 naive, CD8 naive, and CD34 mobilized primary cell lines from the Roadmap Project. We expected these cell lines to be heterogeneous regarding the enrichment profile of read counts and, therefore, served as a basis for a sensitivity analysis for the benchmarked consensus peak callers. The measures presented in Section 6.1 were also used in this scenario. Using RNA-seq data, genes were considered to be actively transcribed in consensus across cell lines if they were simultaneously active in all white blood cells (see Section D1 of the Web Appendix). Results are presented in Table 3.
Table 3.
Genome-wide peak calls and common associations for ChIP-seq data of H3K36me3 and H3K27me3 marks from CD4 memory primary, CD4 naive primary, CD8 naive primary, and CD34 mobilized primary cell lines. The running time of each method is shown in hours.
| Coverage | |||||||
|---|---|---|---|---|---|---|---|
| Mark | Method | Peaks | Median Size | Reads | Active Regions | Inactive Regions | Time |
| H3K36me3 | BCP-P | 8572 | 26.368 | 0.331 | 0.595 | 0.030 | 3.150 |
| CCAT-P | 38735 | 1.075 | 0.131 | 0.150 | 0.003 | 11.247 | |
| JAMM | 72470 | 0.571 | 0.219 | 0.317 | 0.012 | 5.955 | |
| MOSAiCS-P | 15941 | 12.573 | 0.334 | 0.579 | 0.029 | 1.586 | |
| MACS2-P | 64331 | 1.478 | 0.310 | 0.489 | 0.025 | 1.036 | |
| RSEG-P | 6936 | 27.833 | 0.280 | 0.478 | 0.021 | 4.033 | |
| Zerone | 28913 | 2.930 | 0.210 | 0.289 | 0.009 | 0.024 | |
| ZIHMM | 31852 | 5.370 | 0.345 | 0.578 | 0.032 | 0.588 | |
| ZIMHMM | 29747 | 5.371 | 0.328 | 0.538 | 0.028 | 20.183 | |
| H3K27me3 | BCP-P | 6872 | 12.208 | 0.191 | 0.015 | 0.120 | 2.379 |
| CCAT-P | 8725 | 1.026 | 0.028 | 0.001 | 0.007 | 3.012 | |
| JAMM | 118528 | 0.295 | 0.106 | 0.009 | 0.054 | 8.123 | |
| MOSAiCS-P | 16630 | 8.508 | 0.190 | 0.015 | 0.107 | 1.503 | |
| MACS2-P | 91632 | 0.792 | 0.158 | 0.012 | 0.077 | 1.113 | |
| RSEG-P | 855 | 13.673 | 0.029 | 0.004 | 0.014 | 10.578 | |
| Zerone | 29304 | 2.929 | 0.118 | 0.010 | 0.061 | 0.038 | |
| ZIHMM | 58117 | 8.301 | 0.520 | 0.071 | 0.394 | 0.947 | |
| ZIMHMM | 51655 | 9.277 | 0.543 | 0.076 | 0.424 | 12.559 | |
In this analyzed scenario, we observed that peak callers performed similarly for the H3K36me3 mark regarding the coverage of read counts, although BCP-P and MOSAiCS-P had a slightly higher coverage of actively transcribed gene bodies than the ZIMHMM. However, regions called by these two methods were consistently larger than actual gene bodies and did not show a reasonable spatial resolution when detecting minor changes in enrichment profile across cells (see Figures D.1 and D.2 of the Web Appendix). As noted, these are known characteristics of the pooling type of analysis from single-sample peak callers. For H3K27me3, we observed a significant improvement of the ZIMHMM over current multi- and single-sample peak callers regarding the coverage of read counts and gene bodies. Specifically, benchmarked methods covered no more than 20% of the mapped reads and had a low genome-wide coverage of inactive regions. In this scenario, accounting for cell line-specific shifts in the signal profile of read counts significantly improved the detection of enrichment regions in consensus across cells. Here, the ZIMHMM was more time consuming than other approaches, specially single-sample peak callers that call peaks with pooled
6.3. Association of H3K36me3, H3K27me3, and Gene Expression
We further compared peak callers regarding the genome-wise association of peaks with gene expression data as well as the coverage of the reads from the opposite mark. Called peaks were sorted with respect to the number of mapped reads and the coverage of active and inactive regions by the top- and bottom-most peaks were calculated, respectively. Peaks were also sorted regarding their read counts and the coverage of H3K27me3 and H3K36me3 reads mapped onto the top- and bottom-most peaks, respectively, was calculated. These quantities provide measures of association between the two analyzed marks and their role on gene activation and suppression. In all the scenarios, read counts were previously normalized by the median log-ratios as in Section 6.1. Results are presented in Figure 3.
Overall, top peaks called by the ZIMHMM had a superior performance than all other methods in most of the scenarios. The proposed model covered more of actively transcribed gene bodies and read counts for H3K36me3 and H3K27me3, respectively. We observed that the performance of all methods but CCAT-P, JAMM, and Zerone was homogeneous when calling H3K36me3 peaks from white blood cells. Both multi-sample peak callers performed poorly for the two analyzed diffuse marks in all the scenarios.
7. Discussion
Here, we presented the ZIMHMM, a statistical model tailored to call broad peaks in consensus across multiple ChIP-seq technical or biological replicates. The ZIMHMM models the excess of zeros of broad and diffuse marks and accounts for sample differences via random effects.
The ZIMHMM should be applied in multiple biological or technical ChIP-seq replicates with broad regions of signal, such as those pertaining to epigenomic marks. Methods focused on peak calling from multiple samples are of growing interest given the reduction of sequencing costs and higher data availability. Prior work from multi-sample peak callers has shown the benefits of data integration in ChIP-seq data analysis. However, there is no consensus in the literature on how to integrate results from multiple replicates and current approaches perform poorly in finding epigenomic marks with broad peaks. In this paper, we analyzed H3K36me3 and H3K27me3, marks that are associated with gene activation and gene suppression, respectively. For the former mark, in particular, enrichment regions detected by the ZIMHMM better associated with activated gene bodies than any other benchmarked peak caller. These results could trigger, for instance, new insights to investigators interested in detecting cell-specific activated genes, for instance. The ZIMHMM is comparable to most of the current peak callers in terms of computing time and has been implemented into an R package that is available for download (see Section A4 of the Web Appendix for details).
Supplementary Material
Acknowledgements
The authors wish to thank the editor, associate editor and two referees for helpful comments and suggestions, which have led to an improvement of this article. This research was partially supported by NIH grants GM70335, P01CA142538, P30CA016086, P50CA058223 and by the Brazilian funding agency CAPES (13195/13–1).
Footnotes
Supporting Information
Web Appendices A–D, referenced in Sections 1–6, and a link to the implemented software are available with this article at the Biometrics website on Wiley Online Library.
References
- Altman RM (2007). Mixed hidden Markov models: an extension of the hidden Markov model to the longitudinal data setting. Journal of the American Statistical Association 102, 201–210. [Google Scholar]
- Bannister AJ and Kouzarides T (2011). Regulation of chromatin by histone modifications. Cell research 21, 381. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Barski A, Cuddapah S, Cui K, Roh T-Y, Schones DE, Wang Z, Wei G, Chepelev I, and Zhao K (2007). High-resolution profiling of histone methylations in the human genome. Cell 129, 823–837. [DOI] [PubMed] [Google Scholar]
- Bates D, Mächler M, Bolker B, and Walker S (2014). Fitting linear mixed-effects models using lme4. arXiv preprint arXiv:1406.5823. [Google Scholar]
- Bernstein BE, Stamatoyannopoulos JA, Costello JF, Ren B, Milosavljevic A, Meissner A, Kellis M, Marra MA, Beaudet AL, Ecker JR, et al. (2010). The nih roadmap epigenomics mapping consortium. Nature biotechnology 28, 1045. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen L, Wang C, Qin ZS, and Wu H (2015). A novel statistical method for quantitative comparison of multiple ChIP-seq datasets. Bioinformatics 31, 1889–1896. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cuscò P and Filion GJ (2016). Zerone: a ChIP-seq discretizer for multiple replicates with built-in quality control. Bioinformatics 32, 2896–2902. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dunham I, Kundaje A, Aldred S, Collins P, Davis C, Doyle F, Epstein C, Frietze S, Harrow J, et al. (2012). An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Efron B, Tibshirani R, Storey JD, and Tusher V (2001). Empirical Bayes analysis of a microarray experiment. Journal of the American statistical association 96, 1151–1160. [Google Scholar]
- Fletcher R (2013). Practical methods of optimization. John Wiley & Sons. [Google Scholar]
- Huang T, Lin C, Zhong LL, Zhao L, Zhang G, Lu A, Wu J, and Bian Z (2017). Targeting histone methylation for colorectal cancer. Therapeutic advances in gastroenterology 10, 114–131. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ibrahim MM, Lacadie SA, and Ohler U (2014). JAMM: a peak finder for joint analysis of NGS replicates. Bioinformatics 31, 48–55. [DOI] [PubMed] [Google Scholar]
- Jones PA, Issa J-PJ, and Baylin S (2016). Targeting the cancer epigenome for therapy. Nature reviews Genetics 17, 630. [DOI] [PubMed] [Google Scholar]
- Jung YL, Luquette LJ, Ho JW, Ferrari F, Tolstorukov M, Minoda A, Issner R, Epstein CB, Karpen GH, Kuroda MI, et al. (2014). Impact of sequencing depth in ChIP-seq experiments. Nucleic acids research 42, e74–e74. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kuan PF, Chung D, Pan G, Thomson JA, Stewart R, and Keleş S (2011). A statistical framework for the analysis of chip-seq data. Journal of the American Statistical Association 106, 891–903. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu X, Wang C, Liu W, Li J, Li C, Kou X, Chen J, Zhao Y, Gao H, Wang H, et al. (2016). Distinct features of H3K4me3 and H3K27me3 chromatin domains in pre-implantation embryos. Nature 537, 558. [DOI] [PubMed] [Google Scholar]
- Lun AT and Smyth GK (2015). csaw: a bioconductor package for differential binding analysis of chip-seq data using sliding windows. Nucleic acids research 44, e45–e45. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ma P, Castillo-Davis CI, Zhong W, and Liu JS (2006). A data-driven clustering method for time course gene expression data. Nucleic acids research 34, 1261–1269. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pfister SX, Markkanen E, Jiang Y, Sarkar S, Woodcock M, Orlando G, Mavrommati I, Pai C-C, Zalmas L-P, Drobnitzky N, et al. (2015). Inhibiting WEE1 selectively kills histone H3K36me3-deficient cancers by dNTP starvation. Cancer cell 28, 557–568. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Powell MJ (2009). The bobyqa algorithm for bound constrained optimization without derivatives. Cambridge NA Report NA2009/06, University of Cambridge, Cambridge pages 26–46. [Google Scholar]
- Rashid NU, Giresi PG, Ibrahim JG, Sun W, and Lieb JD (2011). ZINBA integrates local covariates with DNA-seq data to identify broad and narrow regions of enrichment, even within amplified genomic regions. Genome biology 12, R67. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rashid NU, Sun W, and Ibrahim JG (2014). Some statistical strategies for DAE-seq data analysis: variable selection and modeling dependencies among observations. Journal of the American Statistical Association 109, 78–94. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Robertson G, Hirst M, Bainbridge M, Bilenky M, Zhao Y, Zeng T, Euskirchen G, Bernier B, Varhol R, Delaney A, et al. (2007). Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nature methods 4, 651. [DOI] [PubMed] [Google Scholar]
- Robinson MD and Smyth GK (2007). Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics 9, 321–332. [DOI] [PubMed] [Google Scholar]
- Song Q and Smith AD (2011). Identifying dispersed epigenomic domains from chip-seq data. Bioinformatics 27, 870–871. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Valouev A, Johnson DS, Sundquist A, Medina C, Anton E, Batzoglou S, Myers RM, and Sidow A (2008). Genome-wide analysis of transcription factor binding sites based on chip-seq data. Nature methods 5, 829. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Viterbi A (1967). Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE transactions on Information Theory 13, 260–269. [Google Scholar]
- Xing H, Mo Y, Liao W, and Zhang MQ (2012). Genome-wide localization of protein-DNA binding and histone modification by a Bayesian change-point method with ChIP-seq data. PLoS computational biology 8, e1002613. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xu H, Handoko L, Wei X, Ye C, Sheng J, Wei C-L, Lin F, and Sung W-K (2010). A signal–noise model for significance analysis of chip-seq with negative control. Bioinformatics 26, 1199–1204. [DOI] [PubMed] [Google Scholar]
- Yang Y, Fear J, Hu J, Haecker I, Zhou L, Renne R, Bloom D, and McIntyre LM (2014). Leveraging biological replicates to improve analysis in ChIP-seq experiments. Computational and structural biotechnology journal 9, e201401002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Young MD, Willson TA, Wakefield MJ, Trounson E, Hilton DJ, Blewitt ME, Oshlack A, and Majewski IJ (2011). Chip-seq analysis reveals distinct h3k27me3 profiles that correlate with transcriptional activity. Nucleic acids research 39, 7415–7427. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, Nusbaum C, Myers RM, Brown M, Li W, et al. (2008). Model-based analysis of ChIP-Seq (MACS). Genome biology 9, R137. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.



