Abstract
Background
Next-generation sequencing (NGS) is widely used for genome-wide identification and quantification of DNA elements involved in the regulation of gene transcription. Studies that generate multiple high-throughput NGS datasets require data integration methods for two general tasks: 1) generation of genome-wide data tracks representing an aggregate of multiple replicates of the same experiment; and 2) combination of tracks from different experimental types that provide complementary information regarding the location of genomic features such as enhancers.
Results
NGS-Integrator is a Java-based command line application, facilitating efficient integration of multiple genome-wide NGS datasets. NGS-Integrator first transforms all input data tracks using the complement of the minimum Bayes’ factor so that all values are expressed in the range [0,1] representing the probability of a true signal given the background noise. Then, NGS-Integrator calculates the joint probability for every genomic position to create an integrated track. We provide examples using real NGS data generated in our laboratory and from the mouse ENCODE database.
Conclusions
Our results show that NGS-Integrator is both time- and memory-efficient. Our examples show that NGS-Integrator can integrate information to facilitate downstream analyses that identify functional regulatory domains along the genome.
Supplementary Information
The online version contains supplementary material available at 10.1186/s12864-020-07220-7.
Keywords: Efficient data integration, Genome-wide NGS, NGS data analysis, Minimum Bayes factor
Background
Genome-wide next-generation sequencing (NGS) can provide information about binding of proteins to DNA, chromatin modifications, and chromatin accessibility. These factors are under intense study because they determine what genes are expressed in a given cell type and how gene expression is regulated in different physiological or pathophysiological states. In general, as with any experimental data, genome-wide NGS data has background noise, random and systematic errors that can obscure the features being investigated. Therefore, multiple replicates are needed to maximize confidence and consensus in feature identification. It is essential to replicate multiple NGS experiments for confidence in the produced data. However, approaches to utilization of multiple data tracks for feature identification has been limited largely to visualization of superimposed [1] or stacked data tracks. Peak-calling tools [2–5] are typically applied to individual replicates and not the aggregate data from multiple tracks.
In this paper, we propose an efficient method for integrating multiple tracks for data averaging of multiple replicates. Our proposed method, called NGS-Integrator, transforms genome-wide NGS data for identification of genomic regulatory elements and chromatin modifications (e.g. ChIP-Seq, ATAC-Seq, and Bisulfite-Seq) to a probability vector using the complement of the minimum Bayes’ factor for each track, followed by calculation of joint probabilities for the tracks as a function of position along the genome. The integrated tracks can then be used as input to peak calling programs to identify specific features with greater specificity.
Methods
NGS-Integrator combines NGS data tracks from experiments that identify genomic regulatory elements and chromatin modifications (e.g. ChIP-Seq, ATAC-Seq, and Bisulfite-Seq), but not RNA-Seq data. A genome-wide NGS track is a vector R = [ri, i] where ri represents a measured value at nucleotide base position i along the length of DNA representing the genome. Usually, but not always, ri represents a discrete value, e.g. the number of sequencing reads that overlap the position. In general, values of ri can range from 0 to infinity. Conceptionally, the value ri gives information about the likelihood of a given feature being present, e.g. binding of a particular transcription factor (TF) from ChIP-Seq data, when considered in the context of the intrinsic background noise, ni, in the signal. To transform the data in terms of probability of a feature, we can use the complement of the minimum Bayes factor,
where zi equals ri/ni. With two data tracks k = 1,2, a probability vector that combines information from both can be obtained by calculating the joint probability at each i (Pi) as
Implementation
Algorithm
Computationally, NGS-Integrator consists of two elements (Fig. 1a). The Calculator element calculates the complement of the minimum Bayes factor (cMBF) at each position i, estimating ni as the median of the ri values across a window of i values straddling the position at which pi is being calculated. This calculation assumes that the features being detected are relatively sparsely represented along the genome, allowing the median to be representative of the true background noise level. The Integrator element calculates the joint probability between the two NGS tracks Pi based on the respective pi vectors. When integrating multiple tracks, sequential dual-input calculations are performed, based on the commutativity of the joint probability operation. Also, it is obvious that the computational time complexity of the algorithms is O(kw) where k is the number of nucleotides in the genome and w is the size of the window. The memory-complexity of the algorithm is a linear function of k.
Input and output
For calculation of genome-wide minimum Bayes’ factors from the read depth, NGS-Integrator requires a bed file containing coverage (number of the sequence reads at each nucleotide) generated from standard tools, such as samtools (depth) and bedtools (coverage, genomecov). The input file should contain 4 fields, chromosome number, start region, end region and coverage value. The output of NGS-Integrator as a BED file contains cMBF values at each nucleotide position across the whole genome. The output BED file can be easily modified to bedGraph or BigWig file format to be visualized on genome browsers. The output of NGS-Integrator could also be used to identify overlapping regions across multiple genome-wide NGS data types (Fig. 1a).
Datasets for integration analyses using NGS-Integrator
Three replicates of ChIP-Seq for the transcription factor ELF1 (unpublished) using an ELF1 antibody (sc-631, Santa Cruz Biotechnology) done in mouse kidney cortical collecting duct cell mpkCCD cells as described previously [6] were used to generate input BED files with genome-wide coverage (10 bp-scale bin). Coverage across the whole genome was estimated by bedtools (coverage) from BAM files. To compare with our ELF1 ChIP-Seq data, public data for ELF1 ChIP-Seq data in erythroleukemia cells (ENCSR033OW) with fold change of signal over control track and its optimal irreproducible discover rate threshold peaks was downloaded from the mouse ENCODE database (https://www.encodeproject.org/).
Four replicates of ATAC-Seq data (GSE108786) were generated in mpkCCD cells as described previously [6]. Datasets of ChIP-Seq data for histone H3K27Ac (GSE95009) and RNA Polymerase II (GSE79584) were used from previous published data [7, 8]. Genome-wide coverage of every 10 bp-scale bin as an input bed file of each replicate was generated by bedtools (coverage) from BAM files.
Multiple ChIP-Seq datasets (GSE29218) published as part of the mouse ENCODE project were used to examine integration of multiple types of ChIP-Seq data. The ChIP-Seq datasets for P300 (GSM723018), RNA Polymerase II (GSM723019) and H3K27Ac (GSM851278) were generated in Bruce4 mouse embryonic stem cells (mESCs) by Dr. Bing Ren’s laboratory [9]. Two replicates from each ChIP-Seq dataset were used for integration using NGS-Integrator.
Results
Example 1:Integration of multiple-replicate transcription factor ChIP-Seq data
To provide an example of an application of NGS-Integrator for genome-wide NGS data for multiple replicates of the same experiment, we processed previously unpublished datasets obtained from antibody-based chromatin immunoprecipitation followed by NGS (ChIP-Seq) using an antibody to the transcription factor ELF1 (Fig. 1b). Based on technical limitations, especially in the antibody-based immunoprecipitation step, ChIP-Seq data for TF binding site identification often includes significant background noise that obscures specific binding sites. In this example, we used three replicates of ELF1 ChIP-Seq data generated in mouse kidney cortical collecting duct (mpkCCD) cells [6, 10]. Figure 1 shows a genomic region on the UCSC genome browser that includes the Eef2 gene. The replicates are shown along with peaks identified with MACS2 [11] for each. Only one peak is common to all three replicates. Integration of three replicate datasets using NGS-Integrator is shown as the red track. When this track was used as input to MACS2, additional peaks were called, many of which contain typical ELF1 binding motifs (ANCCGGAAGT) identified with Homer 4.9 (blue lines, Fig. 1). For comparison, we showed ELF1 ChIP-Seq data reported in ENCODE (ENCSR033OWC) for murine erythroleukemia (MEL) cells along with called peaks (IDR peaks) indicated by green bars. ELF1 binding from the present study is consistent with ELF1 binding in the ENCODE data even though the cell types are different. We provide an additional example of the use of NGS-Integrator to integrate ChIP-Seq data in Supplementary Fig. 1, mapping genomic binding sites of the TF SNAIL1 in human LS174T colorectal cancer cells [12].
Example 2:Integration of multiple replicates of ATAC-Seq data
ATAC-Seq (Assay for Transposase-Accessible Chromatin using Sequencing) is a recently developed method to map open chromatin regions across the genome [13] and is able to identify regulatory DNA elements where transcriptional regulators bind. We previously applied ATAC-Seq in mpkCCD cells to identify active regulatory DNA elements near the Aqp2 gene, which codes for the water channel protein aquaporin-2 [6]. Here, four replicates of ATAC-Seq data from mpkCCD cells were integrated (Fig. 2a) as another example of homogeneous data integration. Data integration using NGS-Integrator allows the data to be summarized in a single track with improved overall signal-to-noise discrimination (red track, Fig. 2a).
Example 3:Combining data from RNA polymerase II ChIP-Seq, histone H3K27Ac ChIP-Seq, and ATAC-Seq to predict locations of enhancers
As a third example, heterogeneous data integration was done using three types of data expected to overlap at active enhancer sites, viz. ATAC-Seq, ChIP-Seq for histone H3 acetylation at Lysine 27 (H3K27Ac) and ChIP-Seq for RNA polymerase II (RNA Pol II), all from mouse mpkCCD cells (Fig. 2b). Although these three features are known to overlap at enhancers [14], none alone are adequate to precisely identify the boundaries of active enhancers. Typically, sequences within identified enhancers are analyzed to identify putative TF binding sites. By sharpening enhancer boundaries, NGS-Integrator can narrow the list of putative TFs that need to be studied further to understand the regulation of a particular gene.
Example 4:Combining data from P300, RNA polymerase II, histone H3K27Ac ChIP-Seq datasets to predict locations of active regulatory elements
To test integration of a large number of datasets, NGS-Integrator was applied to ChIP-Seq datasets published by Dr. Bing Ren’s laboratory. Total 6 ChIP-Seq datasets for the histone acetyltransferase P300, RNA Pol II and Histone H3K27Ac (two replicates per target protein) generated from mouse embryonic stem cell line, Bruce4, were used to calculate cMBFs from each dataset and combine all 6 datasets into one track with NGS-Integrator (Fig. 3), revealing active regions surrounding Eef2 including one at the transcriptional start site.
Performance characteristics
Figure 4 shows execution time and memory usage for a range of input file sizes for the Calculator module (Fig. 4a) and the Integrator module of NGS-Integrator (Fig. 4b). To vary input file sizes, multiple scale bins (1000 bp, 500 bp, 100 bp, 50 bp, 10 bp and 5 bp) were generated in four different mouse chromosomes (1, 10, 17, and 19) through the coverage calculation using bedtools prior to application of NGS-Integrator. As shown in Fig. 4a, the execution time increases with a linear-trend with increasing size of the input file. The memory consumption for given set of files was also noted to be consistent with increasing size of the data. Figure 4b shows that decreasing the file size for integrating output files from the Calculation module corresponds to a sharp decrease in the execution time, as expected. Complexity analysis of the algorithms suggests that it can operate in linear time and space with respect to the length of the genome, consistent with the results shown here.
Conclusions
In this paper we have proposed, designed and implemented easy-to-use software called NGS-Integrator which allows integration of multiple genome-wide NGS data tracks. Our software is implemented in Java and allows integration of data obtained either from multiple datasets of the same technique or data sets derived from different, but complementary, techniques. As shown in the examples, NGS-Integrator can integrate information to facilitate downstream analyses that identify functional domains along the genome. In the Calculator module, the NGS data are transformed to range [0,1] through use of the complement of the minimum Bayes’ factor. Fundamentally, any data transformation that yields values in the range [0,1] and has a positive monotonic relation to the input data would work. The main reason for the choice of the minimum Bayes’ factor is the direct use of the background noise for the calculation, which is easily estimated for sparse data as described. We expect that NGS-Integrator will be very useful for a wide variety of experimental studies where multiple replicates of genome-wide data need to be integrated into a single track.
Availability and requirements
Project name: NGS-Integrator.
Project home page: https://hpcwebapps.cit.nih.gov/ESBL/NGS-Integrator/.
Archived version: GitHub (https://github.com/ESBL/NGS-Integrator).
Operating system(s): Platform independent.
Programming language: Java.
Other requirements: Java 1.7 or higher.
License: Open source.
Supplementary Information
Acknowledgements
We carried out next-generation sequencing in the National Heart, Lung, and Blood Institute DNA Sequencing Core Facility (Dr. Yuesheng Li, Director).
Abbreviations
- NGS
Next-generation sequencing
- TF
Transcription factor
- cMBF
Complement of the minimum Bayes factor
- ChIP-seq
Chromatin immunoprecipitation followed by DNA sequencing
- ATAC-seq
Assay for Transposase-Accessible Chromatin using sequencing
- H3K27Ac
Histone H3 acetylation at lysine 27
- RNA Pol II
RNA polymerase II
Authors’ contributions
BW, LC, HJJ and MAK designed the algorithm. BW and HJJ wrote the code and HJJ carried out NGS experiments. BW, HJJ and FS carried out performance testing. BW, LC, HJJ, FS and MAK wrote the manuscript and all authors revised and approved the final manuscript.
Funding
The work was supported by the Division of Intramural Research, National Heart, Lung, and Blood Institute projects ZIA-HL001285 (to MAK) and ZIA-HL006129 (to MAK). BW was supported by the NHLBI Summer Internship Program (June–August 2018). Fahad Saeed was supported by the NIGMS of the National Institutes of Health (NIH) under award number: R01GM134384, and by the National Science Foundations (NSF) under the award number: NSF CAREER OAC-1925960.
Availability of data and materials
The datasets analyzed during the current study are available in the GEO: ATAC-Seq data (GSE108786), ChIP-Seq data for histone H3K27Ac (GSE95009), and RNA Polymerase II (GSE79584). The datasets of ELF1 ChIP-seq analyzed during the current study are available from the corresponding author on request.
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Footnotes
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Bronte Wen and Hyun Jun Jung contributed equally to this work.
References
- 1.An J, Lai J, Wood DL, Sajjanhar A, Wang C, Tevz G, et al. RNASeqBrowser: a genome browser for simultaneous visualization of raw strand specific RNAseq reads and UCSC genome browser custom tracks. BMC Genomics. 2015;16:145. doi: 10.1186/s12864-015-1346-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Feng J, Liu T, Qin B, Zhang Y, Liu XS. Identifying ChIP-seq enrichment using MACS. Nat Protoc. 2012;7(9):1728–1740. doi: 10.1038/nprot.2012.101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Rozowsky J, Euskirchen G, Auerbach RK, Zhang ZD, Gibson T, Bjornson R, et al. PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls. Nat Biotechnol. 2009;27(1):66–75. doi: 10.1038/nbt.1518. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Xu J, Zhang Y. A generalized linear model for peak calling in ChIP-Seq data. J Comput Biol. 2012;19(6):826–838. doi: 10.1089/cmb.2012.0023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Ji H, Jiang H, Ma W, Johnson DS, Myers RM, Wong WH. An integrated software system for analyzing ChIP-chip and ChIP-seq data. Nat Biotechnol. 2008;26(11):1293–1300. doi: 10.1038/nbt.1505. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Jung HJ, Raghuram V, Lee JW, Knepper MA. Genome-wide mapping of DNA accessibility and binding sites for CREB and C/EBPbeta in vasopressin-sensitive collecting duct cells. J Am Soc Nephrol. 2018;29(5):1490–1500. doi: 10.1681/ASN.2017050545. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Sandoval PC, Claxton JS, Lee JW, Saeed F, Hoffert JD, Knepper MA. Systems-level analysis reveals selective regulation of Aqp2 gene expression by vasopressin. Sci Rep. 2016;6:34863. doi: 10.1038/srep34863. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Isobe K, Jung HJ, Yang CR, Claxton J, Sandoval P, Burg MB, et al. Systems-level identification of PKA-dependent signaling in epithelial cells. Proc Natl Acad Sci U S A. 2017;114(42):E8875–E8884. doi: 10.1073/pnas.1709123114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Shen Y, Yue F, McCleary DF, Ye Z, Edsall L, Kuan S, et al. A map of the cis-regulatory sequences in the mouse genome. Nature. 2012;488(7409):116–120. doi: 10.1038/nature11243. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Yu MJ, Miller RL, Uawithya P, Rinschen MM, Khositseth S, Braucht DW, et al. Systems-level analysis of cell-specific AQP2 gene expression in renal collecting duct. Proc Natl Acad Sci U S A. 2009;106(7):2441–2446. doi: 10.1073/pnas.0813002106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, et al. Model-based analysis of ChIP-Seq (MACS) Genome Biol. 2008;9(9):R137. doi: 10.1186/gb-2008-9-9-r137. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Beyes S, Andrieux G, Schrempp M, Aicher D, Wenzel J, Anton-Garcia P, et al. Genome-wide mapping of DNA-binding sites identifies stemness-related genes as directly repressed targets of SNAIL1 in colorectal cancer cells. Oncogene. 2019;38(40):6647–6661. doi: 10.1038/s41388-019-0905-4. [DOI] [PubMed] [Google Scholar]
- 13.Buenrostro JD, Giresi PG, Zaba LC, Chang HY, Greenleaf WJ. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nat Methods. 2013;10(12):1213–1218. doi: 10.1038/nmeth.2688. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Spicuglia S, Vanhille L. Chromatin signatures of active enhancers. Nucleus. 2012;3(2):126–131. doi: 10.4161/nucl.19232. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The datasets analyzed during the current study are available in the GEO: ATAC-Seq data (GSE108786), ChIP-Seq data for histone H3K27Ac (GSE95009), and RNA Polymerase II (GSE79584). The datasets of ELF1 ChIP-seq analyzed during the current study are available from the corresponding author on request.