Abstract
Summary
Computational evaluation of variability across DNA or RNA sequencing datasets is a crucial step in genomic science, as it allows both to evaluate reproducibility of biological or technical replicates, and to compare different datasets to identify their potential correlations. Here we present fCCAC, an application of functional canonical correlation analysis to assess covariance of nucleic acid sequencing datasets such as chromatin immunoprecipitation followed by deep sequencing (ChIP-seq). We show how this method differs from other measures of correlation, and exemplify how it can reveal shared covariance between histone modifications and DNA binding proteins, such as the relationship between the H3K4me3 chromatin mark and its epigenetic writers and readers.
Availability and Implementation
An R/Bioconductor package is available at http://bioconductor.org/packages/fCCAC/.
Supplementary information
Supplementary data are available at Bioinformatics online.
1 Introduction
Computational assessment of reproducibility across nucleic acid sequencing data is a pivotal component in genomic studies. Moreover, the ever-growing list of available datasets demands robust methods to quickly mine such resources to identify novel potential functional correlations between various genetic and epigenetic regulations. Chromatin immunoprecipitation followed by sequencing, or ChIP-seq, is a widely used method to profile histone modifications (HMs) and transcription factor (TF) binding at genome-wide scale. For each dataset, a set of peaks (regions of statistically significant read counts when compared against an IgG or input DNA controls) can be obtained (Bailey et al., 2013). Reproducibility can then be evaluated by genome-wide Pearson correlation analysis, and peaks in replicates can be compared using Irreproducible Discovery Rate (IDR) analysis and/or overlap analysis (Bailey et al., 2013; Li et al., 2011). However, IDR was designed to find a set of reproducible peaks among different replicates of the same type, but cannot be used to compare distinct HMs or TFs datasets. Overlap analysis suffers as well from inherent statistical problems (Bardet et al., 2011). The author has previously developed a methodology that, by using functional principal component analysis, revealed novel correlations between histone modifications that do not colocalize (Madrigal and Krajewski, 2015). Here, we present fCCAC, a functional canonical correlation analysis approach to allow the assesment of: (i) reproducibility of biological or technical replicates analyzing their shared covariance in higher order components; (ii) the associations between different datasets. We propose a new statistic to summarize canonical correlations that can be used instead of genome-wide (or peak based) Pearson correlation coefficient, with the advantage of using the profile of the genomic regions to study their covariance at higher orders. We assume that technical and biological replicates will share most of the variability, as will do so bona-fide interactions between different co-factors. Overall, fCCAC greatly facilitates the assessment of covariance in genomic applications.
2 Implementation
Functional data analysis is a raising field of statistics that allows moving from discrete measurements to functional approximations using an expansion in basis functions (Ramsay and Silverman, 2005). As in Madrigal and Krajewski (2015), we have used cubic splines to approximate data, which we read from genomic coverages in bigWig format. For N genomic regions (provided in BED format) we have two sets of curves, (xi, yi), . The curves are then centered, and principal modes of variation and between xi and yi in terms of probe weight functions ξ and η can be estimated (Supplementary Material). The N pairs of probe scores represent shared variability if they correlate strongly with one another. Then, squared canonical correlations , can be calculated as in Ramsay and Silverman (2005) by constraining successive canonical probe values to be orthogonal. Values of close to 1.0 imply high covariation between the two samples (Supplementary Information). For K squared canonical correlations, we can compute a weighted squared correlation as , where the weights are the kth harmonic number, and decrease with the order of the canonical component. Then, we can report SK as a fraction over the maximum , where F represents an overall measure of shared covariation. The user interacts with the main function fccac (examples can be found in the Supplementary Information and in the vignette of the package in Bioconductor).
3 Results
To exemplify the methodology we explored the correlation between the nucleosomal HM H3K4me3 and several TFs and chromatin epigenetic remodelers. For this, we focused on human embryonic stem cells (hESCs). We took advantage of recently published H3K4me3 ChIP-seq data (Bertero et al., 2015), which was performed in biological triplicate from the H9 hESC line. First, we defined an aggregated list of peaks at H3K4me3 as our reference set to study replicate reproducibility (23 422 peaks). The results showed high shared covariation (F > 95%) for the H3K4me3 ChIP-seq triplicates, as expected (analogous analysis for H3K27me3 confirmed the irreproducibility of one of the replicates; Supplementary Material). Then, we analyzed the relationships between H3K4me3 deposition and other genomic datasets for DNA binding proteins. For this, we included ChIP-seq data for DPY30 (Bertero et al., 2015), since this protein is part of the enzymatic complex responsible for the deposition of the H3K4me3 mark, as well 58 other DNA binding proteins included in the ENCODE dataset for the H1 hESC line (97 datasets) (ENCODE Project Consortium et al., 2012). We found high canonical correlations between H3K4me3 and DPY30 (Fig. 1A), as expected (Bertero et al., 2015). Only PHF8 (F = 54.2%) and KDM4A (JMJD2C) showed higher F value than DPY30 (F = 37.2%; Fig. 1B), in agreement with their known ability to bind to H3K4me3 (Feng et al., 2010; Pedersen et al., 2014). When we monitored all possible combinations of interactions in H3K4me3 regions, TFs BRCA1 and CHD2 showed F = 92% in H3K4me3, in agreement with motif analyses suggesting that they might form part of the same complex (Kheradpour and Kellis, 2014). Finally, we compared F to Pearson product-moment correlation coefficient. Both measures were similar between replicates of same HM or TF, but substantially differed otherwise (Supplementary Information).
4 Conclusion
fCCAC represents a more sophisticated approach that complements Pearson correlation of genomic coverage. This method can be used (i) to evaluate reproducibility, and flag datasets showing low canonical correlations; (ii) or to investigate covariation between genetic and epigenetic regulations, in order to infer their potential functional correlations. Overall, this method will facilitate the development of new hypothesis regarding how TFs, chromatin remodelling enzymes, histone marks, RNA binding proteins, and epitranscriptome changes can cooperatively dictate the specification of cell function and identity.
Supplementary Material
Acknowledgements
The author would like to thank Alessandro Bertero for useful comments on an early draft.
Funding
This work was supported by the ERC starting grant Relieve-IMDs and core support grant from the Wellcome Trust and MRC to the Wellcome Trust – Medical Research Council Cambridge Stem Cell Institute.
Conflict of Interest: none declared.
References
- Bailey T. et al. (2013) Practical guidelines for the comprehensive analysis of ChIP-seq data. PLoS Comput. Biol., 9, e1003326.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bardet A.F. et al. (2011) A computational pipeline for comparative ChIP-seq analyses. Nat. Protoc., 7, 45–61. [DOI] [PubMed] [Google Scholar]
- Bertero A. et al. (2015) Activin/nodal signaling and NANOG orchestrate human embryonic stem cell fate decisions by controlling the H3K4me3 chromatin mark. Genes. Dev., 29, 702–717. [DOI] [PMC free article] [PubMed] [Google Scholar]
- ENCODE Project Consortium et al. (2012) An integrated encyclopedia of DNA elements in the human genome. Nature, 489, 57–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Feng W. et al. (2010) PHF8 activates transcription of rRNA genes through H3K4me3 binding and H3K9me1/2 demethylation. Nat. Struct. Mol. Biol., 17, 445–450. [DOI] [PubMed] [Google Scholar]
- Kheradpour P., Kellis M. (2014) Systematic discovery and characterization of regulatory motifs in ENCODE TF binding experiments. Nucleic Acids Res., 42, 2976–2987. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li Q. et al. (2011) Measuring reproducibility of high-throughput experiments. Ann. Appl. Stat., 5, 1752–1779. [Google Scholar]
- Madrigal P., Krajewski P. (2015) Uncovering correlated variability in epigenomic datasets using the Karhunen-Loeve transform. BioData Min., 8, 20.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pedersen M.T. et al. (2014) The demethylase JMJD2C localizes to H3K4me3-positive transcription start sites and is dispensable for embryonic development. Mol. Cell Biol., 34, 1031–1045. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ramsay J.O., Silverman B.W. (2005) Functional Data Analysis. Springer-Verlag, New York. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.