Summary
Genome-wide chromatin conformation capture technologies such as Hi-C are commonly employed to study chromatin spatial organization. In particular, to identify statistically significant long-range chromatin interactions from Hi-C data, most existing methods such as Fit-Hi-C/FitHiC2 and HiCCUPS assume that all chromatin interactions are statistically independent. Such an independence assumption is reasonable at low resolution (e.g., 40 kb bin) but is invalid at high resolution (e.g., 5 or 10 kb bins) because spatial dependency of neighboring chromatin interactions is non-negligible at high resolution. Our previous hidden Markov random field-based methods accommodate spatial dependency but are computationally intensive. It is urgent to develop approaches that can model spatial dependence in a computationally efficient and scalable manner. Here, we develop HiC-ACT, an aggregated Cauchy test (ACT)-based approach, to improve the detection of chromatin interactions by post-processing results from methods assuming independence. To benchmark the performance of HiC-ACT, we re-analyzed deeply sequenced Hi-C data from a human lymphoblastoid cell line, GM12878, and mouse embryonic stem cells (mESCs). Our results demonstrate advantages of HiC-ACT in improving sensitivity with controlled type I error. By leveraging information from neighboring chromatin interactions, HiC-ACT enhances the power to detect interactions with lower signal-to-noise ratio and similar (if not stronger) epigenetic signatures that suggest regulatory roles. We further demonstrate that HiC-ACT peaks show higher overlap with known enhancers than Fit-Hi-C/FitHiC2 peaks in both GM12878 and mESCs. HiC-ACT, effectively a summary statistics-based approach, is computationally efficient (∼6 min and ∼2 GB memory to process 25,000 pairwise interactions).
Keywords: chromatin interactions, Hi-C, HiC-ACT, aggregated Cauchy test, summary statistics-based approach
Introduction
Chromatin spatial organization plays a critical role in genome functions such as transcription regulation and DNA replication.1, 2, 3 Studies have shown that millions of putative cis-regulatory elements, such as enhancers, exist within the genome; many of these elements are far away in one-dimensional (1D) genomic distance from their target genes (e.g., up to 1 Mb away).1,2,4, 5, 6 Because of the abundance of enhancers and their long-range regulation roles, systematic mapping of enhancer-promoter interactions is challenging.1
Genome-wide chromosome conformation capture techniques such as Hi-C7 have been widely used to study three-dimensional (3D) organization of chromatin. Hi-C data can be summarized into a contact matrix of all possible pairwise interactions between ligated fragments genome wide. As comprehensive chromatin interaction maps become increasingly prominent as a result of increases in sequencing capacity and decreases in cost, there is an urgent need to develop tools to analyze and interpret this type of data.8 Such methods to detect statistically significant long-range chromatin interactions (also referred to as “peak callers”) seek to determine whether the observed contact frequency is significantly higher than expected from chromatin random collision.
Fit-Hi-C9 is a popular method to evaluate pairs of chromatin loci independently, and it assigns each pair a statistical confidence (p value). Fit-Hi-C corrects distance dependence and potential systematic biases in Hi-C datasets by fitting non-parametric splines to model the background chromatin contact frequency.9, 10, 11 Recently, a re-implementation, FitHiC2,10 was released. Along with the addition of new computational modules, FitHiC2 can be applied to the highest-resolution Hi-C datasets currently available.10 However, in high-resolution data (e.g., 5 or 10 kb bin resolution), neighboring chromatin interactions are unlikely to be independent, as assumed in FitHiC2. When this independence assumption is violated, the p values corresponding to chromatin interactions are inaccurate.
We have previously demonstrated that spatial dependency is non-negligible when analyzing Hi-C data at high resolution. Accordingly, we developed hidden Markov random field (HMRF)-based methods, HMRF-Bayes12 and FastHiC,13 to accommodate spatial dependency for improved statistical properties. However, compared with FitHiC2, which analyzes each pair of chromatin loci separately, our HMRF-based framework is more computationally intensive.
HiCCUPS14 (Hi-C computational unbiased peak search) is another commonly adopted method for identifying significant chromatin interactions. Unlike Fit-Hi-C/FitHiC2 and our HMRF-based methods, which use a global background model,9,10,12 HiCCUPS uses a local background model where each chromatin loci pair has a unique model influenced by information from local neighborhoods.14 This model defines peaks on the basis of whether the loci pair interacts significantly more frequently than loci pairs in its neighborhood. Therefore, HiCCUPS effectively detects summits of chromatin interactions rather than peaks identified in Fit-Hi-C/FitHiC2 and our HMRF-based methods. A most recently published method, MUSTACHE, similarly relies on a local background model and detects summits by using a scale-space modeling framework enlightened by methods in computer vision.15 The summit-detection strategy is valuable in distinguishing the most frequently interacting pairs from its neighborhood but limits its ability to identify many functionally important interactions linking cis-regulatory elements such as promoters and enhancers.10,14
In this paper, we develop HiC-ACT, a method for post-processing peak calling results from methods that do not consider spatial dependency. HiC-ACT’s post-processing via an aggregated Cauchy test approach accounts for possible correlation between adjacent loci pairs from high resolution Hi-C data. HiC-ACT, a summary statistics-based approach, is flexible in application, only requiring the input of bin identifiers and corresponding raw p values generated from established 3D peak callers rather than raw Hi-C data. HiC-ACT also allows users to specify a smoothing parameter based on the data resolution. Moreover, HiC-ACT does not require any information about the underlying correlation structure in the data while being able to account for the inherent correlation between bin (loci) pairs.
The implementation of p value smoothing in HiC-ACT improves identification of significant chromatin interactions and recovers information lost in sparse data. Since HiC-ACT borrows information from neighboring loci pairs, it calls peaks rather than summits. Thus, we chose to compare HiC-ACT to FitHiC2. Both simulation studies and real data analysis demonstrate that HiC-ACT outperforms FitHiC2 in increasing recall with comparable precision.
In the remainder of this article, we specify the HiC-ACT model and provide details regarding the workflow. Next, we show real data-based simulation results based on Hi-C data from the human lymphoblastoid cell line GM12878 at various sequencing depths. Then, we perform real data analysis using Hi-C datasets from GM12878 and mouse embryonic stem cells (mESCs). Finally, we conclude with some discussions.
Material and methods
Aggregated Cauchy combination test
HiC-ACT is based on the aggregated Cauchy combination test16 to combine a set of p values, . We use a linear combination of transformed p values with non-negative weights:
(Equation1) |
where is the individual p value, is the non-negative weight such that , and k is the total number of p values to be combined. When only one p value is considered , it is straightforward to show that follows a Cauchy distribution (location parameter, scale parameter ) under the null hypothesis that is uniformly distributed between 0 and 1.16 Liu and Xie showed that this combination of p values, , follows a standard Cauchy distribution under the null hypothesis.16 Assume that the p values are calculated from Z scores and let , where is a test statistic corresponding to . The null hypothesis can then be written as .
Thus, can be expressed as . We can then rewrite (Equation 1) as follows:
(Equation2) |
If the s are perfectly dependent (i.e., all the s are equal or linear functions of one another) or perfectly independent, it can be shown that the sum of multiple independent Cauchy random variables also follows a Cauchy distribution. Furthermore, it has been shown that this holds even when the s are correlated.16,17
Liu et al. further showed that under arbitrary dependency structures has approximately a Cauchy tail.16,17 They also demonstrated that when the s are correlated, it has very limited effect on the tail of the distribution. Consequently, we can transform the test statistic back to a p value by using Cauchy(0,1)
(Equation3) |
Because of the heavy tail of the Cauchy distribution, T is insensitive to the correlation of the p values, especially at the tail of the distribution, lending to accurate approximations for small p values.16,17 This desirable property of this approximation (Equation 3) with small p values is of particular interest in Hi-C data analysis. Liu et al. also argued that if the individual p values are conservative, will be conservative as well and the type I error is controlled.17
HiC-ACT test statistic
Using the framework above, we specify the HiC-ACT test statistic as follows. Let represent the p value for chromatin interaction between bin and bin from a specific Hi-C peak calling method. Consider the null hypothesis that the contact frequency between bin pair is due to random chromatin collision. Define the HiC-ACT test statistic as
(Equation4) |
Here, is the local smoothing bandwidth. We followed the strategy adopted by the HiCRep18 method to determine the size of the smoothing window based on data resolution (Table S1). We take to be the Gaussian kernel weight function, defined as
(Equation5) |
The criterion in Equation 4 that determines which bin pairs are included in the chromatin interaction neighborhood is derived from the equation of a diamond and ensures that the p values of all bin pairs within a specified distance from the bin pair of interest are combined. Note that the p value for the bin pair itself contributes to the statistic and, thus, the smoothed p value.
On the basis of the theory established, approximately follows a standard Cauchy distribution under the null hypothesis. Therefore, the p value for can be approximated by
(Equation6) |
We can interpret as the local neighborhood smoothed p value. Intuitively, for a biologically meaningful chromatin interaction, all bin pairs in its neighborhood are more likely to have significant p values. Thus, the combined p value tends to be more significant and is driven by small p values in its neighborhood.
In an application to rare variant association analysis, Liu et al. demonstrated that the aggregated Cauchy test is powerful under sparse alternatives.17 Our application to Hi-C data is also subject to sparse alternatives because there are relatively few interactions due to chromatin looping compared to the vast number of random events of chromatin collision. As shown by Liu and Xie, the Cauchy combination test handles arbitrary dependency structures without knowledge of the correlation values.16 Through this property, HiC-ACT specifically accounts for the inherent correlation across neighboring pairs while maintaining the benefit of not needing to specify the correlations.
Workflow
To implement HiC-ACT, we first obtain results from a standard peak caller not considering spatial dependency. HiC-ACT only requires bin pair identifiers and the corresponding p values. Next, we set based on the data resolution (see Table S1 for suggestions). Then, we identify a set of bin pairs of interest, e.g., by selecting if is less than a specified threshold. We recommend that this threshold depends on the total number of mapped reads in the data (Table S2). For each pair, HiC-ACT determines all possible pairs that meet criterion in Equation 4, calculates the weights, and then computes and its corresponding p value for each pair in the set of interest by using Equation 6.
In Figure 1, we present a motivating example for HiC-ACT by using 10 kb GM12878 Hi-C data acquired from the Rao et al. study consisting of ∼4.9 billion pairwise contacts.14 Each colored pixel on the heatmap represents the strength of the FitHiC2-identified interaction (p value), represented on the −log10 scale. This specific chromatin interaction (i.e., bin pair) is centered at 50,625,000 bp and 50,975,000 bp on chromosome 22 (marked by a blue “x”) and has one end overlapping with a super-enhancer reported by the Roadmap Epigenomics Consortium19 and the other end overlapping with the transcription start site (TSS)20 of the highly expressed TRABD gene (FPKM = 17).21 However, when the data is down-sampled to ∼1 billion raw reads (a more realistic sequencing depth), this interaction is not classified as a significant peak by FitHiC2 (p = 8.62e−4) (see Table S2 for details on how peaks were determined) (Figure 1A). When HiC-ACT is applied to these FitHiC2 results, the resulting p value is highly significant (p = 2.73e−19) as expected given the biological evidence. Figure 1B displays the corresponding heatmap for FitHiC2 interactions/p values called on the full GM12878 data (∼4.9 billion reads). The FitHiC2 p value here for the specified interaction is 3.50e−11. Comparing Figure 1A to Figure 1B, we notice that information is lost in data with shallower sequencing depth. HiC-ACT is able to recover some information lost in Hi-C data with shallower sequencing depths by leveraging information from neighboring loci.
We also note that there are other bin pairs in this illustrated neighborhood with significant interactions. As mentioned previously, HiC-ACT and FitHiC2 call peaks rather than summits. Calling peaks ensures a higher coverage of capturing functional chromatin interactions, as opposed to calling summits, which can be driven by a combination of stochasticity and proximity to bona fide interactions. Although the highlighted bin pair in Figure 1 does not have the strongest signal, it completely overlaps the enhancer region, as opposed to, for example, the bin pair directly below that only partially overlaps with the enhancer region.
Results
Real data-based simulations
We first used real data-based simulations to assess the performance of HiC-ACT. The simulations were based on the 10 kb GM12878 Hi-C data consisting of ∼4.9 billion pairwise contacts.14 FitHiC2 results generated from this high-depth data were treated as the truth. Approximately 1.57 million significant chromatin interactions were identified on the basis of the criterion that the observed contact count > 15, the expected contact count > 5, the ratio of observed to expected > 1.5, and the p value < 1.0e−12. The p value threshold was informed by a recent study of 10 kb bin resolution deeply sequenced Hi-C data from human brain cortex, where high-confidence regulatory chromatin interactions were determined with p value < 2.31e−11.5
To simulate more realistic sequencing depths and reflect the sequencing depths of most studies, we down-sampled the GM12878 Hi-C data to 10%–40% of the original depth corresponding to ∼0.5–2.0 billion raw reads. We performed down-sampling by generating multinomially distributed random number vectors. The parameters for the multinomial distribution were specified with the down-sampling ratio (i.e., 10%–40%) and the contact counts for bin pairs in the full data. For each of these down-sampled data, we ran FitHiC2 then applied HiC-ACT. Following HiCRep,18 we chose the smoothing bandwidth (h) to be 20 because we analyzed the data at 10 kb resolution.
Significant pairwise interactions were defined via sequencing depth-specific threshold of minimum observed contact count, minimum expected contact count, global significant p value threshold, and for HiC-ACT, initial p value filtering. In each case, a minimum ratio between observed count and expected count of 1.5 was required to determine a significant interaction. Table S2 provides recommendations for defining significant interactions (i.e., peaks) on the basis of this simulation via sequencing depth-specific initial p value filtering.
Assuming that deeply sequenced data is more reliable than data with shallower sequencing depth, we use the FitHiC2 peak calls (i.e., defined significant interactions) from the full GM12878 data as the working truth in our simulations. Accordingly, we counted the number of interactions correctly classified as significant or insignificant by HiC-ACT and FitHiC2 in each down-sampled data. HiC-ACT correctly identified 75%–641% more significant interactions than FitHiC2 and achieves comparable precision (Table 1 and Table S3).
Table 1.
Sequencing depth (billions) |
Sensitivity/recall |
Precision |
F1 score |
|||
---|---|---|---|---|---|---|
HiC-ACT | FitHiC2 | HiC-ACT | FitHiC2 | HiC-ACT | FitHiC2 | |
0.5 | 0.44 | 0.06 | 0.92 | 1.00 | 0.59 | 0.11 |
1.0 | 0.57 | 0.16 | 0.93 | 1.00 | 0.71 | 0.28 |
1.5 | 0.65 | 0.28 | 0.93 | 1.00 | 0.77 | 0.44 |
2.0 | 0.70 | 0.40 | 0.93 | 1.00 | 0.80 | 0.57 |
Sensitivity, precision, and corresponding F1 score (harmonic mean of precision and recall) of calling true peaks at various GM12878 10 kb sequencing depths (in approximate billions of raw reads) is reported. Peaks are defined with the guidelines in Table S2, and peaks called by FitHiC2 in the full GM12878 data (∼4.9 billion raw reads) are treated as working truth.
Although HiC-ACT tends to be driven by the most significant interactions in a neighborhood (those pairs with extremely small p values), it maintains large (i.e., non-significant) p values for truly insignificant interactions. To demonstrate this, we calculated the sensitivity/recall and precision for correct identification of significant interactions for each method. Sensitivity, also known as the true positive rate, is the proportion of true peaks identified out of the total number of true peaks. Precision, also known as the positive predictive value, is the proportion of true peaks identified out of the number of interactions called as peaks. We also report the F1 score, defined as the harmonic mean of precision and recall, where a value of 1 indicates perfect precision and recall. Table 1 displays a summary of these results at various sequencing depths (in billions of raw reads). HiC-ACT considerably improves sensitivity with affordable loss of precision, as demonstrated by greater F1 scores, in all sequencing depths, although the largest improvements are seen when sequencing depth is low. We note that the pattern of these results holds when the global significance threshold is adjusted (Table S3).
In the Hi-C peak calling problem, the number of true positives (significant interactions/peaks) and true negatives (insignificant interactions/background noise) is highly unbalanced. Because of the large proportion of true negatives, comparing sensitivity versus specificity is not ideal. Precision versus recall is a more appropriate performance metric in this scenario.22 Accordingly, specificity is omitted from Table 1 since the values for both methods are nearly 1 because of the large number of insignificant interactions. Specificity, along with peak classification counts, can be found in Table S3.
We can further examine the relationship between true positives (i.e., correctly identifying significant interactions/calling true peaks) and false positives (i.e., incorrectly identifying interactions as significant/calling false peaks) through precision-recall curves (PRCs). Ideally, we desire a method that has both high precision (few false positives) and high recall (few false negatives), which is represented by a PRC located in the top right region of the plot. Figure 2 shows the PRCs for calling true peaks (as defined in Table S2) in the GM12878 10 kb data when the data is down-sampled to different depths. Each panel displays the PRC for peaks called via FitHiC2 as well as HiC-ACT. The shapes indicate where a specific p value threshold for defining FitHiC2/HiC-ACT peaks lies on the curve. For example, with ∼0.5 billion raw reads, FitHiC2 (gray curve) achieves a recall of approximately 0.06 and precision near 1 when the significance threshold p is between 1.0e−14 and 1.0e−10. However, HiC-ACT with initial p value filtering of 1.0e−3 (blue curve) is able to significantly improve peak classification, achieving recall of approximately 0.36 with negligible loss in precision (0.97) (Figure 2A).
The pink dashed curves correspond to HiC-ACT applied with our suggested filter p′ (values of p′ can be found in Table S2). As detailed in Table S2, we suggest using a more stringent initial p value filter for data with high sequencing depth and using a more lenient initial p value filter for data with shallow sequencing depth. As the sequencing depth increases, the choice of initial p value filter has less effect on the precision and recall of HiC-ACT (Figure 2D). In general, HiC-ACT peak calling can be made more conservative (or liberal) by choosing a more stringent (or lenient) initial p value filtering threshold. In other words, HiC-ACT allows us to improve precision at the cost of recall (e.g., detection of true peaks) by selecting a smaller initial p value filter. We obtained similar results when the global significance threshold is adjusted (Figures S1 and S2 and Table S3).
Lastly, the PRCs also suggest that type I errors are largely maintained in that the curves (particularly the parts where the significant thresholds were selected) are rather flat, reflecting no big drop in precision. Given the much larger number of non-peaks compared to peaks, a small increase in type I error could lead to a rather drastic increase in the denominator for precision calculation, which would incur a big drop in precision. Therefore, it is reassuring to observe that the HiC-ACT PRCs remain largely flat.
HiC-ACT also shows improved power to detect significant interactions with low normalized contact frequency. Specifically, we compared the observed to expected contact count ratios between methods for their most significant interactions (ranked p values). Figure 3 shows the distribution of the ratios of the most significant true peaks (significant interactions called in the full data) called by each method in the ∼1 billion raw read data. The median ratio for HiC-ACT is ∼3.3 (0.5 on the log10 scale) across all top n peaks, whereas the median ratio for FitHiC2 decreases from 6 to 4.5 (0.8 to 0.7 on the log10 scale) as the number of top peaks increases. The observed to expected contact count ratios of HiC-ACT are significantly lower than those of FitHiC2 (Wilcoxon test p value < 2.2e−16) in each case. We reached similar conclusions at other sequencing depths (0.5–2.0 billion raw reads, data not shown).
HiC-ACT identifies biologically relevant interactions
GM12878 Roadmap Epigenomics Consortium enhancers
Using the same GM12878 Hi-C data at 10 kb bin resolution, we compared the peaks called by HiC-ACT and FitHiC2 to typical enhancers (TEs) and super-enhancers (SEs) reported from the Roadmap Epigenomics Consortium.19 There are 10,335 enhancers in total, 252 of which are SEs. First, we identified which peaks have one end overlapping with an enhancer and the other end overlapping with the TSS20 of an expressed gene21 (FPKM > 1), and defined such peaks as overlapping an enhancer-promoter (E-P) interaction.
At each sequencing depth, we counted the total number of super-enhancer-promoter (SE-P) interactions (Figure 4A) and typical enhancer-promoter (TE-P) interactions (Figure 4B) identified by each method. HiC-ACT interactions overlap more with SE-P and TE-P interactions compared to FitHiC2 interactions. We also counted the total number of unique SEs (Figure 4C) and unique TEs (Figure 4D) identified by each method. HiC-ACT is able to capture 90%–95% of the SEs and 63%–81% of TEs, compared to 74%–94% and 32%–72%, respectively, captured by FitHiC2. HiC-ACT appears to be less sensitive to sequencing depth than FitHiC2 and shows more significant improvements over FitHiC2 at shallower sequencing depths. Figures 4C and 4D displays the total counts as well as the odds ratios and corresponding p values for the number of enhancers identified (out of 252 total SEs and 10,335 total TEs).
Next, we examined the total number of interactions overlapping an E-P interaction within a specified number of top peaks (ranked p values). At all sequencing depths (∼0.5–2.0 billion raw reads), we observed improved performance of HiC-ACT over FitHiC2 for SE-P interactions and comparable performance between HiC-ACT and FitHiC2 for TE-P interactions (Figure S3). Moreover, the most significant interactions identified by each method are different (Figures 4E–4G). Figure 4E illustrates the number of HiC-ACT-specific, FitHiC2-specific, and shared peaks that overlap an SE-P interaction at various top peaks. For example, out of the top 100,000 peaks called by HiC-ACT and FitHiC2, 1,219 and 1,064 peaks overlap with SE-P interactions, respectively. Among them, 688 peaks are HiC-ACT specific, 513 peaks are FitHiC2 specific, and 552 peaks are shared by two methods (Figure 4F). A similar example for the top 200,000 peaks called by each method is displayed in Figure 4G.
mESC ChIP-seq/ATAC-seq peaks
We applied HiC-ACT (h = 20) to FitHiC2 results from Hi-C data from mESCs at 10 kb bin resolution.23 Because this data is deeply sequenced (∼7 billion reads), we chose a HiC-ACT initial p value filter of 1e−6. This choice was informed by the PRCs in Figure 4D. Significant interactions were defined with the same thresholds as the GM12878 data (observed contact count > 15, expected contact count > 5, the ratio of observed to expected contact counts > 1.5, and global p value < 1.0e−12). By these criteria, HiC-ACT identifies ∼1.8 million significant interactions and FitHiC2 identifies ∼1 million significant interactions.
We compared these peak calls to mESC ChIP-seq (H3K4me3, H3K4me1, H3K27ac, and CTCF) peaks24, 25, 26 and ATAC-seq peaks.26 We defined an overlap as a HiC-ACT/FitHiC2-called peak with either 10 kb bin overlapping a ChIP-seq/ATAC-seq peak. Further, we defined a 10 kb bin as an enhancer bin or a promoter bin if it overlaps with a H3K27ac ChIP-seq peak or H3K4me3 ChIP-seq peak and TSS20 of an expressed gene27 (FPKM > 1), respectively. We defined a HiC-ACT- or FitHiC2-identified peak as an E-P interaction if one anchor bin is an enhancer bin and the other anchor bin is a promoter bin. We similarly defined enhancer-enhancer (E-E) and promoter-promoter (P-P) interactions.
Because HiC-ACT identifies more significant interactions than FitHiC2, we examined the same number of top most significant interactions (ranked by p values) called by each method for a fair comparison. The most significant HiC-ACT-specific interactions show higher overlap with E-P, E-E, and P-P interactions than the same number of most significant FitHiC2-specific interactions (Figure 5A). The odds of the most significant HiC-ACT peaks showing overlap with E-P, E-E, or P-P interactions is significantly higher (odds ratio estimate ≈ 1.5, p value < 2.2e−16) than the odds of the most significant FitHiC2 peaks (Table S4). We observed similar results when only considering the 1D overlaps in H3K27ac, H3K4me1, and H3K4me3 ChIP-seq peaks and a comparable performance between HiC-ACT and FitHiC2 in ATAC-seq peaks and CTCF ChIP-seq peaks (Figure 5B). Table S4 lists the number of overlaps displayed by Figure 5 at various numbers of top peaks.
mESC FANTOM5 and dbSUPER enhancers
Next, we compared the mESC HiC-ACT and FitHiC2 calls at 10 kb resolution to mESC enhancers cataloged in the FANTOM5, database28,29 and from the dbSUPER database.30 FANTOM5 includes 43,662 enhancers and dbSUPER includes 229 SEs. For each set of enhancers, we counted how many interactions called by HiC-ACT and FitHiC2 overlap with an E-P interaction (one end overlapping with an enhancer and the other end overlapping with the TSS20 of an expressed gene27). The most significant HiC-ACT peaks have approximately 1.4–2 times the odds of overlapping an E-P interaction than the same number of most significant FitHiC2 peaks. Figure 6A displays the number of peaks overlapping E-P interactions for each enhancer database and method, as well as the corresponding odds ratio estimates and p values.
We next examined the total number of unique E-P interactions identified by each method for the enhancers in the dbSUPER and FANTOM5 databases (Figure 6B). HiC-ACT identifies 198 more dbSUPER SE-P interactions and 29,632 more FANTOM5 EP interactions than FitHiC2. Further, one enhancer may interact with multiple promoters, so we also report the total number of unique enhancers among the identified E-P interactions. Interestingly, all FitHiC2-identified dbSUPER enhancers and all but 14 FitHiC2-identified FANTOM5 enhancers are also identified by HiC-ACT.
Discussion
Hi-C has been widely adopted to study chromatin spatial organization with several peak callers proposed and commonly used to analyze and interpret this data. Here, we present HiC-ACT, a method to improve the detection of chromatin interactions by post-processing 3D peak calling results from methods relying on the assumption that pairs of chromatin interactions are statistically independent. HiC-ACT leverages the power of an aggregated Cauchy test to specifically account for the correlation without requiring any information about its structure. We demonstrated that HiC-ACT can improve sensitivity while maintaining comparable precision. We also provide guidelines regarding decision rules to maintain a desired type I error.
As expected, we observed most pronounced improvement over FitHiC2 when sequencing depth is less than 1 billion reads, which is the typical depth for the vast majority of Hi-C datasets generated to date. As shown through our analyses of the GM12878 data, the performance of FitHiC2 decreases as sequencing depth decreases; therefore, there is relatively more room for improvement for Hi-C data with lower sequencing depth. Further, Hi-C data with shallower sequencing depths are more likely than Hi-C data with higher sequencing depths to have lower signal-to-noise ratios for some significant interactions, and by borrowing information from neighboring interactions, HiC-ACT is able to more powerfully identify these interactions than FitHiC2. Even with increasing sequencing depth anticipated in some future Hi-C studies, we consider HiC-ACT useful because it will allow more powerful 3D peak calling at finer resolution (e.g., 5 kb or even 1 kb resolution, particularly when cut with the appropriate restriction enzymes such as the 4-base pair cutter MboI or DpnII).
It is unsurprising that the most significant interactions called by each method are different. Intuitively, all bin pairs in the neighborhood of a biologically relevant interaction are more likely to be significant than randomly colliding bin pairs. However, in Hi-C data with shallower sequencing depths, the signal strengths for all bin pairs in the neighborhood may not be adequately reflected in the unsmoothed p values. We have demonstrated that HiC-ACT has higher power to detect peaks with lower signal-to-noise ratio than FitHiC2 (Figure 3). Accordingly, for highly significant interactions, HiC-ACT is more likely than FitHiC2 to call peaks in its neighborhood as well, lending to the differences observed in the top peak calls of each method.
We note that although FastHiC accounts for spatial dependency, it is not intended to be used as a chromosome-wide peak caller in high resolution Hi-C data, such as FitHiC2. We find that FastHiC underperforms both HiC-ACT and FitHiC2 in this scenario (Figure S4). Although HiC-ACT can theoretically be applied to HiCCUPS results, we consider such application inappropriate because of the intrinsic nature of HiCCUPS to call summits in peak regions. HiCCUPS contrasts each chromatin loci pair with its local neighborhood; however, our goal is to call peaks by borrowing information from the neighborhood.
HiC-ACT is computationally efficient and scalable. HiC-ACT can process 25,000 pairwise interactions in ∼6 min with ∼2 GB memory and 0.5 million pairwise interactions in ∼2 h with ∼30 GB memory by using a 2.50 and 3.40 GHz Intel processor, respectively. Note that chromosome 1 has ∼90,000 and ∼168,000 pairwise interactions, at 10 kb resolution, passing the suggested initial p value filter in the ∼0.5 and ∼1 billion raw reads GM12878 Hi-C data, respectively.
Future work may involve fine tuning the smoothing parameter, particularly for 1 kb bin resolution Hi-C data, and investigating different weight functions.
By identifying statistically significant long-range interactions with enhanced statistical power and improved computationally efficiency, HiC-ACT can improve our knowledge regarding regions with regulatory potential and aid to establish links between cis-regulatory regions and their target genes. We anticipate HiC-ACT will become a convenient tool for many researchers.
Data and code availability
This paper did not generate any datasets.
Declaration of interests
The authors declare no competing interests.
Acknowledgments
This material is based upon work supported by the National Science Foundation Graduate Research Fellowship Program under grant DGE-1650116. T.M.L. and Y.L. are partially funded by R01 HL129132 (awarded to Y.L.). Y.L. is additionally supported by R01 GM105785 and P50 HD103573. A.A. and M.H. are partially funded by NIH grants U54DK107977 and UM1HG011585 (awarded to M.H.).
Published: February 4, 2021
Footnotes
Supplemental information can be found online at https://doi.org/10.1016/j.ajhg.2021.01.009.
Contributor Information
Yuchen Yang, Email: yyuchen@email.unc.edu.
Yun Li, Email: yunli@med.unc.edu.
Web resources
ENCODE, https://www.encodeproject.org/
FANTOM5, https://fantom.gsc.riken.jp/5/
GEO, https://www.ncbi.nlm.nih.gov/geo/
HiC-ACT, https://github.com/tmlagler/hicACT
HiC-ACT, https://yunliweb.its.unc.edu/hicACT/
Supplemental information
References
- 1.Li Y., Hu M., Shen Y. Gene regulation in the 3D genome. Hum. Mol. Genet. 2018;27(R2):R228–R233. doi: 10.1093/hmg/ddy164. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Yu M., Ren B. The Three-Dimensional Organization of Mammalian Genomes. Annu. Rev. Cell Dev. Biol. 2017;33:265–289. doi: 10.1146/annurev-cellbio-100616-060531. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Schoenfelder S., Fraser P. Long-range enhancer-promoter contacts in gene expression control. Nat. Rev. Genet. 2019;20:437–455. doi: 10.1038/s41576-019-0128-0. [DOI] [PubMed] [Google Scholar]
- 4.Fulco C.P., Munschauer M., Anyoha R., Munson G., Grossman S.R., Perez E.M., Kane M., Cleary B., Lander E.S., Engreitz J.M. Systematic mapping of functional enhancer-promoter connections with CRISPR interference. Science. 2016;354:769–773. doi: 10.1126/science.aag2445. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Giusti-Rodriguez P., Lu L., Yang Y., Crowley C.A., Liu X., Juric I., Martin J.S., Abnousi A., Allred S.C., Ancalade N. Using three-dimensional regulatory chromatin interactions from adult and fetal cortex to interpret genetic results for psychiatric disorders and cognitive traits. bioRxiv. 2018 doi: 10.1101/406330. [DOI] [Google Scholar]
- 6.Martin J.S., Xu Z., Reiner A.P., Mohlke K.L., Sullivan P., Ren B., Hu M., Li Y. HUGIn: Hi-C Unifying Genomic Interrogator. Bioinformatics. 2017;33:3793–3795. doi: 10.1093/bioinformatics/btx359. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Lieberman-Aiden E., van Berkum N.L., Williams L., Imakaev M., Ragoczy T., Telling A., Amit I., Lajoie B.R., Sabo P.J., Dorschner M.O. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science. 2009;326:289–293. doi: 10.1126/science.1181369. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Dekker J., Marti-Renom M.A., Mirny L.A. Exploring the three-dimensional organization of genomes: interpreting chromatin interaction data. Nat. Rev. Genet. 2013;14:390–403. doi: 10.1038/nrg3454. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Ay F., Bailey T.L., Noble W.S. Statistical confidence estimation for Hi-C data reveals regulatory chromatin contacts. Genome Res. 2014;24:999–1011. doi: 10.1101/gr.160374.113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Kaul A., Bhattacharyya S., Ay F. Identifying statistically significant chromatin contacts from Hi-C data with FitHiC2. Nat. Protoc. 2020;15:991–1012. doi: 10.1038/s41596-019-0273-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Schmitt A.D., Hu M., Ren B. Genome-wide mapping and analysis of chromosome architecture. Nat. Rev. Mol. Cell Biol. 2016;17:743–755. doi: 10.1038/nrm.2016.104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Xu Z., Zhang G., Jin F., Chen M., Furey T.S., Sullivan P.F., Qin Z., Hu M., Li Y. A hidden Markov random field-based Bayesian method for the detection of long-range chromosomal interactions in Hi-C data. Bioinformatics. 2016;32:650–656. doi: 10.1093/bioinformatics/btv650. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Xu Z., Zhang G., Wu C., Li Y., Hu M. FastHiC: a fast and accurate algorithm to detect long-range chromosomal interactions from Hi-C data. Bioinformatics. 2016;32:2692–2695. doi: 10.1093/bioinformatics/btw240. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Rao S.S.P., Huntley M.H., Durand N.C., Stamenova E.K., Bochkov I.D., Robinson J.T., Sanborn A.L., Machol I., Omer A.D., Lander E.S., Aiden E.L. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell. 2014;159:1665–1680. doi: 10.1016/j.cell.2014.11.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Roayaei Ardakany A., Gezer H.T., Lonardi S., Ay F. Mustache: multi-scale detection of chromatin loops from Hi-C and Micro-C maps using scale-space representation. Genome Biol. 2020;21:256. doi: 10.1186/s13059-020-02167-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Liu Y., Xie J. Cauchy combination test: a powerful test with analyticp -value calculation under arbitrary dependency structures. J. Am. Stat. Assoc. 2019;115:393–402. doi: 10.1080/01621459.2018.1554485. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Liu Y., Chen S., Li Z., Morrison A.C., Boerwinkle E., Lin X. ACAT: A Fast and Powerful p Value Combination Method for Rare-Variant Analysis in Sequencing Studies. Am. J. Hum. Genet. 2019;104:410–421. doi: 10.1016/j.ajhg.2019.01.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Yang T., Zhang F., Yardımcı G.G., Song F., Hardison R.C., Noble W.S., Yue F., Li Q. HiCRep: assessing the reproducibility of Hi-C data using a stratum-adjusted correlation coefficient. Genome Res. 2017;27:1939–1949. doi: 10.1101/gr.220640.117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Kundaje A., Meuleman W., Ernst J., Bilenky M., Yen A., Heravi-Moussavi A., Kheradpour P., Zhang Z., Wang J., Ziller M.J., Roadmap Epigenomics Consortium Integrative analysis of 111 reference human epigenomes. Nature. 2015;518:317–330. doi: 10.1038/nature14248. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Frankish A., Diekhans M., Ferreira A.-M., Johnson R., Jungreis I., Loveland J., Mudge J.M., Sisu C., Wright J., Armstrong J. GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res. 2019;47(D1):D766–D773. doi: 10.1093/nar/gky955. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Schmitt A.D., Hu M., Jung I., Xu Z., Qiu Y., Tan C.L., Li Y., Lin S., Lin Y., Barr C.L., Ren B. A compendium of chromatin contact maps reveals spatially active regions in the human genome. Cell Rep. 2016;17:2042–2059. doi: 10.1016/j.celrep.2016.10.061. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Mladenić D., Grobelnik M. Feature Selection for Unbalanced Class Distribution and Naive Bayes. In: Bratko I., Dzeroski S., editors. ICML '99: Proceedings of the Sixteenth International Conference on Machine Learning. 1999. pp. 258–267. [Google Scholar]
- 23.Bonev B., Mendelson Cohen N., Szabo Q., Fritsch L., Papadopoulos G.L., Lubling Y., Xu X., Lv X., Hugnot J.-P., Tanay A., Cavalli G. Multiscale 3D Genome Rewiring during Mouse Neural Development. Cell. 2017;171:557–572.e24. doi: 10.1016/j.cell.2017.09.043. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Davis C.A., Hitz B.C., Sloan C.A., Chan E.T., Davidson J.M., Gabdank I., Hilton J.A., Jain K., Baymuradov U.K., Narayanan A.K. The Encyclopedia of DNA elements (ENCODE): data portal update. Nucleic Acids Res. 2018;46(D1):D794–D801. doi: 10.1093/nar/gkx1081. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.ENCODE Project Consortium An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489:57–74. doi: 10.1038/nature11247. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Juric I., Yu M., Abnousi A., Raviram R., Fang R., Zhao Y., Zhang Y., Qiu Y., Yang Y., Li Y. MAPS: Model-based analysis of long-range chromatin interactions from PLAC-seq and HiChIP experiments. PLoS Comput. Biol. 2019;15:e1006982. doi: 10.1371/journal.pcbi.1006982. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Li Y., Rivera C.M., Ishii H., Jin F., Selvaraj S., Lee A.Y., Dixon J.R., Ren B. CRISPR reveals a distal super-enhancer required for Sox2 expression in mouse embryonic stem cells. PLoS ONE. 2014;9:e114485. doi: 10.1371/journal.pone.0114485. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Forrest A.R., Kawaji H., Rehli M., Baillie J.K., de Hoon M.J., Haberle V., Lassmann T., Kulakovskiy I.V., Lizio M., Itoh M., FANTOM Consortium and the RIKEN PMI and CLST (DGT) A promoter-level mammalian expression atlas. Nature. 2014;507:462–470. doi: 10.1038/nature13182. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Andersson R., Gebhard C., Miguel-Escalada I., Hoof I., Bornholdt J., Boyd M., Chen Y., Zhao X., Schmidl C., Suzuki T. An atlas of active enhancers across human cell types and tissues. Nature. 2014;507:455–461. doi: 10.1038/nature12787. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Hnisz D., Abraham B.J., Lee T.I., Lau A., Saint-André V., Sigova A.A., Hoke H.A., Young R.A. Super-enhancers in the control of cell identity and disease. Cell. 2013;155:934–947. doi: 10.1016/j.cell.2013.09.053. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
This paper did not generate any datasets.