Skip to main content
Nucleic Acids Research logoLink to Nucleic Acids Research
. 2021 Dec 20;50(6):e35. doi: 10.1093/nar/gkab1235

HiCRes: a computational method to estimate and predict the genomic resolution of Hi-C libraries

Claire Marchal 1,2,, Nivedita Singh 3, Ximena Corso-Díaz 4, Anand Swaroop 5,
PMCID: PMC8990515  PMID: 34928367

Abstract

Three-dimensional (3D) conformation of the chromatin is crucial to stringently regulate gene expression patterns and DNA replication in a cell-type specific manner. Hi-C is a key technique for measuring 3D chromatin interactions genome wide. Estimating and predicting the resolution of a library is an essential step in any Hi-C experimental design. Here, we present the mathematical concepts to estimate the resolution of a dataset and predict whether deeper sequencing would enhance the resolution. We have developed HiCRes, a docker pipeline, by applying these concepts to several Hi-C libraries.

INTRODUCTION

Within mammalian nuclei, chromatin is compacted following a well-defined three-dimensional (3D) organization. Chromosomes remain separated into distinct territories that can be labeled and observed by microscopy (1,2). Within each chromosome, the chromatin can be organized into megabase-size domains, called topologically associated domains (TADs) (3,4). At the kilobase level, two genomic loci can join together to form chromatin loops (4–6). This organization is dynamic and changes during distinct stages of a cell's life including cell cycle (7–9), differentiation (5,10,11) and senescence (12,13). 3D chromatin organization is associated with gene expression regulation (14–19) and DNA replication timing (7,20–24), but the relationship between these features is still poorly known. Chromosome Conformation Capture technologies, such as Hi-C, have permitted access to 3D chromatin interactions genome-wide (25–27) and are among the most common techniques used to explore the relationship between the 3D genome and chromatin associated processes.

Hi-C libraries are generated by in-nuclei enzymatic digestion of cross-linked chromatin. Digested chromatin is then ligated producing chimeric fragments of neighbor chromatin loci which are purified and sequenced pairwise. Interactions between loci, separated by a restriction digestion site, are kept for further analysis (25). Many laboratories are implementing Hi-C to examine 3D chromatin interactions to get a better insight into a biological process. ‘How deep do I need to sequence my Hi-C library?’ is one of the first questions when deciding to perform Hi-C experiments. The answer is far from being trivial and depends on the chromatin structures to be observed, such as compartments, TADs or loops, as well as the quality of the Hi-C library (4). Compartments are very robust across sequencing depths and can be called on small Hi-C datasets (20,28). On the contrary, loops calling requires high resolution Hi-C obtained by deep sequencing of good quality libraries (6,28). High sequencing depth represents a big expense; thus, the first step is usually to sequence the Hi-C library at low depth (e.g. 100M read pairs), which allows one to evaluate its quality and assess the usefulness of deeper sequencing that is needed for a higher resolution. Accurately predicting the future resolution of a Hi-C library at different sequencing depths would allow a user to then choose how deep the library need to be sequenced to obtain a given resolution. While sequencing depth is the main determinant of the resolution, it is important to note that the resolution of Hi-C data is limited by the restriction enzyme used in the assay (27,28). For example, HindIII restriction enzyme produces an average fragment length of 4 kb on the human genome, and thus the best possible resolution will be around 4 kb (26). For assays using MboI or DpnII, one can achieve a resolution of around 500 bp (26).

A useful definition of the resolution reached by a sequenced Hi-C library was set up by Rao et al. (6). This definition sets the resolution of a Hi-C experiment as the minimum size window which, when used to calculate the genome coverage, leads to 80% of the windows covered by at least 1000 reads (6,27). Mathematically, the resolution is the window size for which 20th percentile of the reads per window equals 1000. This definition is based on the global distribution of the coverage and allows an estimation of the range of interactions that can be observed in a given library, providing an excellent standard for comparison among multiple datasets.

Nevertheless, the relationship between the maximum resolution of a Hi-C experiment and the size of its library is non-linear (27). This relationship depends on: (i) the complexity of the library, which can be predicted using published tools such as preseq (29), (ii) the percentage of uniquely mapped valid read pairs, directly proportional to the number of de-duplicated reads and (iii) the distribution of uniquely mapped valid read pairs, which can be estimated and predicted by the model presented in this study. We included all these steps within a single pipeline and developed a docker image, called HiCRes. We present and validate here this tool to assess and predict the resolution a Hi-C library can reach at different sequencing depths, following the definition of resolution by Rao et al. (6). This is the first method to estimate the resolution that can be reached by a given Hi-C library at different sequencing depths.

MATERIALS AND METHODS

Subsampling datasets

Datasets are downloaded from SRA and fastq files are extracted using SRAtoolkit (ncbi.github.io/sra-tools/). These files are converted in text files with one complete read pair (sequence and quality) per line. Random lines are then selected using awk bash function and its internal function rand. Seeds for the random extraction are set up as the script running date and time. This method allows the fast extraction of a chosen proportion of reads, while not using the computer RAM. Datasets are subsampled to approximatively 100M read pairs to generate predictions, or 10M, 25M, 50M and 75M read pairs to test prediction using smaller datasets. GM12878 (HIC003) and mESC datasets have also been subsampled to 200M, 300M, 400M and 500M read pairs.

Mapping and filtering

Subsampled libraries are mapped and filtered using bowtie2 (34) and HiCUP (35), on hg38 (human samples), mm10 (mouse sample), TAIR10 (Arabidopsis thaliana), dm3 (Drosophila melanogaster) or ce10 (Caenorhabditis elegans), using genomes digested in silico by MboI, HindIII or the Arima kit enzymes (Supplementary Figure S1). Proportions of reads pairing, mapping and filtering are calculated with HiCUP. These proportions are considered constant and independent of the library sequencing depth. Similarly, the proportion of cisversustrans-interaction is considered independent of the library sequencing depth (data not shown).

Measuring the observed resolution interval

The observed resolution is calculated following Rao and colleagues’ definition (6). The final HiCUP output for each subsample is processed through bedtools (36) to calculate the read coverage per window using several window sizes ranging from 100 bp to 100 kb (0.1, 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5, 7.5, 10, 12.5, 15, 17.5, 20, 22.5, 25, 30, 40, 50 and 100 kb). Then the 20th percentile of the coverage is calculated using R. With these analyses, the 20th percentile of the coverage is measured for each window size. An interval containing the Hi-C resolution of each subsample is inferred from these values: the minimum of this interval is the larger window size for which the 20th percentile of the coverage is bellow 1000 reads, while the maximum of this interval is the smallest window size for which the 20th percentile of coverage is higher than 1000 reads.

To ensure that this observed resolution relates to Hi-C features, contact maps have been plotted using R and visually inspected at several CTCF loops using multiple subsamples of the GM12878 dataset (Supplementary Figure S2A). In addition, the stratified correlation coefficient (SCC) has been computed using several window sizes using subsamples of the GM12878 dataset (Supplementary Figure S2B). A variation of 0.004 of the SCC was previously shown to be within the confidence interval for subsamples of a similar size dataset (37), indicating that the subsamples were achieving the same reproducibility as the original and thereby reaching the same resolution. Here, the SCC between the original dataset and each subsample is computed using several window sizes (5, 10, 25, 50 and 100 kb) with multiple subsample sizes (10, 25, 50, 75, 100, 200, 300 and 400M read pairs). The window sizes including the observed resolution interval for each given dataset correspond to SCC from either side of the 0.996 threshold, demonstrating that the observed resolution interval captures accurately the resolution reached by each subsample.

Estimating and predicting the library complexity

Library complexity is estimated from a 100M read pairs subsample of the library, which will be the minimum size required for the library provided by the user. The library complexity is estimated on the raw mapped read pairs (see mapping and filtering). When both ends of two pairs are mapped on the same position, they are considered as duplicate. Preseq tool is used to predict the library yield at higher sequencing depths from the duplicate distribution (29). Preseq uses an empirical Bayes method on a sequencing dataset to estimate the number of unique reads that can be obtained from a library at different sequencing depth. To evaluate the accuracy of preseq on these data, the full mESC (SRX2636668) library and several subsamples are also analyzed. For each subsample, the non-de-duplicated mapped reads and the de-duplicated mapped reads are counted and compared to the data predicted by preseq based on the 100M read pairs subsample. The near-perfect overlap between preseq predictions and the observed duplicate rates proves the accuracy of preseq on these data (Supplementary Figure S3).

Extracting and plotting the resolution versus the sequencing depth

The unique read pairs count obtained at several sequencing depths and the associated confidence intervals predicted with preseq are combined with HiCUP statistics to estimate the number of valid read pairs, and valid read pairs in cis or cis-far (intra-chromosomal interaction longer than 10 kb) for various sequencing depths. These values and their confidence intervals are then used to calculates the predicted resolution based on Equation (4).

Hi-C on mouse rods

Mouse rods were purified from the Nrlp-EGFP C57BL/6J strain as described (38). All procedures were approved by the Animal Care and Use Committee (NEI-ASP#650). Rods were fixed with 1% formaldehyde for 15 min followed by 5 min incubation with Glycine (125 mM) before cell sorting. Two million purified rods were used for Hi-C, per instructions from the Arima kit (Arima, # A510008). Libraries were sequenced using Illumina HiSeq 2500. The sequenced libraries have been analyzed using HiCUP (Supplementary Figure S4A), contact maps have been visualized using Juicebox (39) (Supplementary Figure S4B). The data have been compared to subsamples of 120M interactions of the mESC Hi-C and of a publicly-available mouse rods Hi-C (SRX6657342 (30)) by calculating the stratified correlation coefficient (SCC) on each autosome using HiCRep (37) (Supplementary Figure S4C).

RESULTS

Modeling the Hi-C fragments distribution on the genome

Elucidation of relationship between the resolution and the quantity of Hi-C interactions means understanding the relationship between the read coverage distribution, the window size used to calculate the coverage, and the number of read pairs. To model the relationship between these parameters, we used a large high resolution Hi-C library from human cells GM12878 (6) and mESC (10). We first explored the association of the read coverage distribution to the window size and the number of read pairs. To measure the distribution of read coverage, we used the 20th percentile of the read coverage per window, calculated at variable window sizes and sample sizes. We observed that for uniquely mapped valid read pairs on the genome, the cube root of the 20th percentile of read coverage varies perfectly linearly with the cube root of the window's size used to assess the coverage (Figure 1A). Similarly, it varies linearly with the cube root of the number of valid reads (Figure 1B). These two linear relationships have also been observed using mouse embryonic stem cells Hi-C (mESC, Supplementary Figure S5A and B). Thus, the cube root of the 20th percentile of coverage can be considered as a linear function of the cube root of the window size for a given number of reads, as well as a linear function of the cube root of the number of valid reads for a given window size. This means that it can be written as a function p(x,y) where x is the cube root of the window size and y the cube root of the number of valid read pairs. Since we observed that, for a given y value, the cube root of the 20th percentile is varying linearly with x, the relationship between the 20th percentile and x can be written as Equation (1).

Figure 1.

Figure 1.

Hi-C resolution model fits the observed resolution. (A, B) Variation of 20th percentile of the coverage with the window size (A) and the subsample size (B) used to calculate the coverage. For each data subsample (A) and each window size (B), this variation between the cube root values is linear. (C) 3D plot showing the prediction of the 20th percentile of window coverage by our model versus the window size and the number of valid read pairs. The surface represents the function predicting the 20th percentile for any window size and valid read number, while the dots are the observed 20th percentile for each window size/valid read pairs. The color scale represents the 20th percentile. (D) Predicted resolution versus the number of valid read pairs. Predictions are computed using a 100M sequenced read pairs subsample. Observed resolutions of several subsamples are plotted as an interval containing the observed resolution (red segments).

Equation (1): 20th percentile coverage (p3) relation to the window size (x3) for a given number of valid read pairs (y3):

graphic file with name M0001.gif (1)

Similarly, for a given x value, the cube root of the 20th percentile is varying linearly with y, which can be written as Equation (2).

Equation (2): 20th percentile coverage (p3) relation to the valid read pairs (y3) for a window size (x3):

graphic file with name M0001a.gif (2)

From a mathematical point of view, these two functions are the partial derivatives of a third function, p(x,y), describing the cube roots of the 20th percentile of coverage versus the window size and the number of valid read pairs. Here, we show that this last function can be written as Equation (3), where x is the cube root of window size, y the cube root of the number of valid read pairs and a, b, c and d are some constants to be determined for each library.

Equation (3): 20th percentile coverage (p3) relation to the window size (x3) and the valid read pairs (y3):

graphic file with name M0002.gif (3)

The coefficients a, b, c and d of Equation (3) are specific to each sequenced Hi-C library. To determine the value of these four coefficients, four (x,y) pairs from a given dataset need to be used: (x1,y1), (x1,y2), (x2,y1) and (x2,y2). For each of these datapoints, the 20th percentile of the coverage per windows (i.e. p(x,y)3) is measured. HiCRes uses two sizes of subsamples of the final valid interactions to compute y1 and y2, and two windows sizes (20 and 50 kb) to compute x1 and x2.

Under the hypothesis that the 20th percentile of the coverage only depends on the read number and the window size, Equation (3) should be sufficient to predict the 20th percentile of the read of coverage given any read number and window size pairs. To confirm this hypothesis, we manually assessed the 20th percentile of the coverage for several read number / window size pairs in the GM12878 Hi-C library and calculated the a, b, c and d coefficients of specific to this library using a 100M sequenced read pairs subsample. The function described by Equation (3) perfectly overlaps the observed cube root of 20th percentile coverages from different subsamples and window sizes used for the coverage assessment of a high resolution Hi-C from GM12878 cells (Figure 1C). Similarly, we manually assessed the 20th percentile of the coverage for several read number / window size pairs in the mESC library and calculated the coefficients of Equation (3) specific to this library using a 100M sequenced read pairs subsample. In this mESC library too, the function described in Equation (3) perfectly overlaps the observed cube root of the 20th percentiles of coverage per window (Supplementary Figure S5C). Together, this shows that Equation (3) accurately describes the relationship between the 20th percentile of the coverage, the read number and the window size. From this equation, the resolution r, i.e. the window size r for which the 20th percentile is equal to 1000 reads, or p(r1/3,y1/3) = 10001/3, with y the number of valid read pairs, can be written as Equation (4).

Equation (4): Final model of the resolution (r) relation to the valid read pairs (y):

graphic file with name M0003.gif (4)

Validation of the model using published datasets

To validate our model, we used several subsamples of high-resolution Hi-C datasets that were publicly available (see methods) (6,10). We predicted the resolution versus the number of valid read pairs using a 100M sequenced read pairs subsample of the GM12878 and mESC datasets. For each dataset, we assessed the number of valid read pairs and measured the interval including the observed resolution (see Materials and Method). For each dataset, the predicted resolution is within the interval comprising the observed resolution (Figure 1D and Supplementary Figure S5D). We reproduced this result using two others public Hi-C datasets, in HMEC and NHEK cell lines (Supplementary Figure S5E and F). For all these datasets tested, our predictions overlapped perfectly with the observed intervals, thereby validating our model of Hi-C resolution prediction from the number of valid read pairs.

The initial measure of the resolution by Rao and colleagues was performed using all valid interactions. Nevertheless, most informative Hi-C interactions are cis-far interaction, i.e. intra-chromosomal interactions longer than 10 kb. We thus assessed the robustness of our model when considering only cis or only cis-far interactions. Using the same 100M sequenced read subsamples to predict the resolution of our test datasets, we predicted the resolution of the full datasets including cis only or cis-far only interactions. For each dataset, we measured the interval including the resolution of the full dataset using cis only or cis-far only interactions. For each dataset tested (GM12878, mESC, HMEC and NHEK), the prediction of the resolution perfectly overlapped the interval comprising the observed resolution (Supplementary Figure S6).

Implementation of the pipeline HiCRes

Our model successfully links the number of valid Hi-C interactions to the resolution. Nevertheless, to predict the sequencing depth required for a library to reach a given resolution, the number of valid interactions needs to be linked to the sequencing depth. To do so, we developed HiCRes, a user-friendly pipeline associating our model to some published tools for measuring any Hi-C resolution from raw or analyzed data (Figure 2A). HiCRes is able to predict the resolution versus the sequencing depth of any Hi-C library of 100M read pairs or more. For this purpose, our pipeline measures the library complexity and predicts future yields using the preseq algorithm (29), which we confirmed to be accurate on Hi-C libraries (Supplementary Figure S3). After estimating the future yield of the library, the percentage of uniquely mapped valid read pairs is evaluated using bowtie2 (34) and HiCUP (35). Next, the constants a, b, c and d of Equation (4) are calculated using our model. The tool also confirms the linear relationships between the cube root 20th percentile of the coverage per window and the cube roots of the sample size or the window size. If one of this relationship is not linear (i.e. the Pearson correlation between three data points is less then 0.98), the tool does not output any prediction but an error message (Figure 2A). Finally, the predicted resolution can be calculated for different sequencing depths. For inter-operability, our tool is available as a docker image and can be run on any system where either docker or singularity is installed.

Figure 2.

Figure 2.

HiCRes accurately predicts the resolution in relation to the sequencing depth. (A) Overview of the HiCRes pipeline. It combines library complexity prediction, estimation of the percentage of valid read pairs, and prediction of the Hi-C resolution. This pipeline predicts the resolution of a given Hi-C library at different sequencing levels. (B) Predicted resolution versus the sequencing depth in GM12878 used for the model. Predictions are calculated using a 100M sequenced read pairs subsample. The corresponding observed resolutions of the total library are plotted as an interval containing the observed resolution (red segment). (C–E) Same as in (B), using independent datasets: mESC (C), HMEK (D) and NHEK (E).

To validate the accuracy of the pipeline to predict the resolution that can be reached by Hi-C libraries from raw sequenced reads, we tested HiCRes pipeline on public Hi-C datasets. We subsampled the GM12878, mESC, HMEC and NHEK Hi-C datasets to 100M sequenced read pairs. We then used HiCRes to predict the resolution each subsampled dataset will reach for various sequencing depths. To test the accuracy of our predictions, we measured the resolution interval, which is an interval comprising the observed resolution for the full public dataset. For each dataset tested, the prediction of the resolution corresponded with the observed resolution interval (Figure 2BE). These analyses confirm the accuracy of HiCRes to predict Hi-C library resolutions at different sequencing depths based on 100M sequenced read pairs.

To assess the robustness of HiCRes on small input datasets, we challenged our model on low sequencing depth input using the GM12878 dataset. We subsampled the dataset into small subsamples of 10M, 25M, 50M and 75M sequenced read pairs and compared the predictions of the resolution using these libraries to the prediction done using the 100M sequencing read pairs subsample. Wherever using all, cis only or cis-far only interactions to estimate the resolution the library can reach at deeper sequencing levels, the prediction curves using data from 25M, 50M or 75M sequenced read pairs are almost indistinguishable from the prediction curve using 100M sequenced read pairs, while the prediction curves using 10M sequenced read pairs is slightly apart (Figure 3). Thus, our model can be used to predict the resolution a Hi-C library can reach at different sequencing depth starting from libraries as small as 25M sequencing read pairs.

Figure 3.

Figure 3.

HiCRes predictions are robust to low depth datasets. (A–C) Predicted resolution versus the number of valid read pairs using all (A), cis only (B) or cis-far only interactions. Predictions are computed using several subsample sizes (10M, 25M, 50M, 75M or 100M sequenced read pairs).

Validation of HiCRes pipeline on diverse Hi-C conditions

HiCRes has been developed using Hi-C data from MboI digested chromatin in human and mouse cells. The use of different restriction enzymes leads to different fragments sizes (Supplementary Figure S1) and could influence the accuracy of our model. Additionally, the genome of the species used to perform the Hi-C experiment could have an impact on the read distribution, such as a strong heterogeneity of the read coverage distribution when using different window sizes to compute the coverage. In this case, the accuracy of our model could be strongly challenged. Thus, to test whether our model and this pipeline can be extended to other species and Hi-C conditions, we used several public and lab produced datasets (Table 1). Overall, we tested HiCRes on different combination of restriction enzymes (MboI, HindIII, and the Arima kit), species (Human, Mouse, Arabidopsis Thaliana, Caenorhabditis elegans and Drosophila melanogaster) and samples (cultured cells, tissues). We subsampled each dataset to 100M sequenced read pairs and used HiCRes pipeline to estimate the resolution at various sequencing depths as described above. We then compared the predictions to the observed resolution interval of the full dataset. For all these various samples, the predictions perfectly corroborated the observed resolution intervals using all interactions (Figure 4 and Supplementary Figure S7). When using cis only or cis-far only interactions to compute the resolution, most of the predictions perfectly overlapped the observed resolution interval of the full dataset, apart from prediction in Mouse retinas and D. melanogaster for which the predictions were slightly diverging from the observed value when using cis only or cis-far only interactions (Supplementary Figure S8).

Table 1.

Hi-C Datasets used in this study

Sample Specie Restriction enzyme Size (read pairs) Ref.* SRA number Reference
GM12878 Human MboI 486 848 168 HIC003 SRR1658572 Rao et al. (6)
Retina Mouse MboI 1 433 302 476 HiC_Retina_Adult-Rep1 SRR9906313 Norrie et al. (30)
HMEC Human MboI 456 577 382 HIC058 SRR1658680 Rao et al. (6)
NHEK Human MboI 536 747 653 HIC067 SRR1658691 Rao et al. (6)
GM12878 Human HindIII 1 195 923 990 HIC035 SRX764970 Rao et al. (6)
Rods Mouse Arima 194 604 167 - GSE152491 This study
mESC Mouse DpnII 2 812 503 253 HiC_ES_3 SRX2636668 Bonev et al. (10)
A. thaliana A. thaliana HindIII 169 121 538 HiC Col SRR1197490 Grob et al. (31)
D. melanogaster D. melanogaster DpnII 367 216 747 Asynchronous Hi-C rep1 SRX2997988 Wang et al. (32)
C. elegans (Crane et al.) C. elegans DpnII 266 503 508 Replicate 1 SRR1665087 Crane et al. (33)
SRR1665088
SRR1665087
C. elegans (Rowley et al.) C. elegans DpnII 168 530 362 HiC_hermaphrodites_rep1 SRX6055741 Rowley et al. (4)
Rods Mouse MboI Subsampled to 120 000 000 HiC_Retina_NRLGFPpositive SRX6657342 Norrie et al. (30)
NHEK Human HindIII 536 747 653 HIC068 SRR1658692 Rao et al. (6)
GM12878 Human MboI 1 195 923 990 HIC004 SRR1658573 Rao et al. (6)
mESC Mouse HindIII 465 473 330 Replicate 1 SRX116341 Dixon et al. (3)

*Reference in the study from which the dataset comes from.

Figure 4.

Figure 4.

HiCRes is accurate across species. (A–D) Predicted resolution versus the sequencing depth in datasets from various species/Hi-C protocols: Mouse rods using Arima kit (A), A. thaliana aerial tissue of seedlings using HindIII digestion (B), D. melanogaster embryonic cells using DpnII (C) and C. elegans using DpnII (D). Predictions are calculated using a 100M read pairs subsample. Observed resolutions of the total library are plotted as an interval containing the observed resolution (red segment).

Using all, cis- or cis-far-interactions to calculate the resolution

Most tools employed to call 3D chromatin structures use contact maps generated on each chromosome (15,40). Thus, only the interactions occurring within the same chromosome, i.e. the cis-interactions, are usually informative for calling compartments, TADs or loops. Moreover, short-range (i.e. within 10 kb) interactions are usually enriched in self-ligation interactions, and thus are not informative of specific chromatin structures. Accordingly, we added the prediction of the resolution using cis-interactions only and cis-far (separated by more than 10 kb) only interactions to our pipeline output. Because cis- and cis-far interactions represent a sub-fraction of all interactions, a lower resolution is expected when using these interactions only to estimate the resolution, compared to all interactions. For each library tested, our predictions are in accordance with this (Figure 5, Supplementary Figure S9). As it would be intuitively expected, we observe a stronger difference between the predictions using cis- (respectively cis-far) versus all interactions in libraries with a lower proportion of cis-interactions (respectively cis-far), compared to libraries with a higher proportion (Figure 5, Supplementary Figure S9). Thus, a low percentage of cis-interactions as well as a low percentage of cis-far interactions will directly affect the final resolution of a Hi-C dataset.

Figure 5.

Figure 5.

Cis- and cis-far interactions proportions impact the resolution of a Hi-C dataset. (A–D) Predicted resolution versus the sequencing depth using all, cis only or cis-far only interactions in samples with various proportion of cis and cis-far interactions: GM12878 (A), Mouse retina (B), GM12878 using HindII digestion (C) and C. elegans (D). For all the samples, the proportions of cis and cis-far interactions are reported on the plot.

DISCUSSION

Here, we present HiCRes, a tool to estimate and predict the resolution a given Hi-C library will reach when sequenced deeper. We demonstrate that HiCRes accurately predicts the resolution of Hi-C libraries obtained from distinct human and mouse cell types generated using different restriction digestion enzymes. HiCRes is available as a docker image, making it possible to perform different steps of the pipeline using one simple command line. Our observations have shown that HiCRes accurately predicts the resolution any Hi-C library can reach at different sequencing depths. The datasets tested have proven that HiCRes was accurate to predict resolutions up to 2 kb for single libraries. Thus, we recommend HiCRes to be used for prediction up to 2kb.

Two conditions need to be satisfied to apply our model; these are the linear relationships between the cube root of the 20th percentile of the read coverage with the cube root of the window size used to calculate the coverage and between the cube root of the 20th percentile read coverage, and the cube root of the number of valid interactions. Our pipeline tests whether these two conditions are met and will not produce any estimation or prediction if these conditions are not satisfied. In that scenario, the resolution can be manually measured as described in the method section and by Rao et al. (6), but no prediction can be calculated.

HiCRes uses sequenced reads as input to produce a prediction of the resolution versus the sequencing depth or already processed Hi-C data to realize a prediction of the resolution versus the number of valid interactions. Using 40 CPUs, HiCRes predicts the resolution of a 200M read pair dataset (fastq files) in approximatively 5h (Table 2, ‘Starting from fastq’). Alternatively, processed data (bam files) can be used as an input for HiCRes. In this case, using 40 CPUs, HiCRes will take approximatively 30 minutes to produce the predictions (Table 2, ‘Starting from bam’). Nevertheless, when starting with already analyzed data, the predictions will be done only in relation to the number of valid interactions, not to the sequenced read number. Thus, we recommend running HiCRes on raw sequenced reads to predict the resolution that libraries can reach at deeper sequencing levels. To simply estimate the resolution of a given library (with no need for prediction at different sequencing depths), we recommend running HiCRes directly on processed data.

Table 2.

Benchmarking for Hi-C datasets

Starting from fastq Starting from bam
Dataset Size (read pairs) Species Enzyme Time Resolution Time Resolution
SRR1658692 274M Human HindIII 4h53m 26883 bp 36m 26896 bp
SRR1658573* 161M Human MboI 4h48m 24744 bp 25m 24746 bp
SRR443883 SRR443884 SRR443885 465M Mouse HindIII 4h21m 22346 bp 29m 22356 bp
SRR9906313 270M Mouse MboI 6h32 17764 bp 39m 17739 bp
This study 195M Mouse Arima 5h 15661 bp 29m 15685 bp

*In these datasets, prediction for the library yields (duplicates) failed. In this case, HiCRes gives a warning and generates the predictions on the unique sequenced read pairs.

An important parameter to consider when estimating the resolution of any Hi-C experiment is the interaction type included to compute the resolution. The original definition of the resolution by Rao and colleagues included all valid interactions in the valid read count (6). But most of the information about chromatin structures such as TADs or loops is embedded within the long-range (more than 10kb) intra-chromosomal interactions, or cis-far interactions. Depending on the quality of the Hi-C experiment, the contamination by trans-interactions or by short-range interactions can influence the resolution estimation as we have shown here (Figure 5). For comparing datasets together or with published data using the definition of the resolution set up by Rao and colleagues, using all interactions makes sense, especially if associated to other quality control comparison, such as the proportion of trans and cis short-range interactions. Nevertheless, including or not trans or cis short-range interactions to compute the resolution can have a significant impact on the final estimate of the resolution. Because of this, we recommend using the resolution computed using cis-far interactions to have a good overview of the size range of the 3D structures that can be observed in a given Hi-C dataset. Moreover, given that Hi-C contact maps are generated usually by chromosome with cis-interactions only and are used as input for many tools to perform further analysis (compartments, TADs or loop calling) (15,40), using the resolution computed using all interactions would not be meaningful.

The resolution calculated by this approach is a powerful way to estimate the size limit of the chromatin structures to be observed. This prediction can be used to compare different datasets and will help on deciding the sequencing depth needed for a given library. Nevertheless, this number does not directly reflect the quality of the Hi-C experiment and other quality indicators should be used in complement, such as the proportion of valid interactions, the cis- versus trans-interactions ratio, the distance-dependent decay of interaction frequency (28) or the reproducibility among replicates (41). Moreover, the resolution is an average value for the whole genome while the local resolution can be impacted by the read mappability, the presence of restriction digestion sites and the accessibility of such sites to the restriction digestion enzymes. Thus, the measured resolution does not replace the statistical analysis for assessing the local significance of any observed contacts enrichment in a Hi-C experiment (27).

DATA AVAILABILITY

HiCRes pipeline is available as a docker image on hub.docker.com/r/marchalc/hicres. All the scripts used to produce the figures in this study are available on GitHub, as well as the benchmarking for HiCRes docker (github.com/ClaireMarchal/HiCRes). The dockerfile used to generate the image is also available on GitHub.

DATA ACCESSIBILITY

The Hi-C dataset on mouse rods generated in this study is available on the GEO database (www.ncbi.nlm.nih.gov/geo/) under the accession number GSE152491.

Supplementary Material

gkab1235_Supplemental_File

ACKNOWLEDGEMENTS

The authors are grateful to Frederic Mentink-Vigier from the National High Magnetic Field Laboratory (FSU, FL) for providing insights that helped this study. We also thank Linn Gieser for assistance with next generation sequencing and Zachary Batz for comments and assistance with next generation sequencing. This work utilized the computational resources of the NIH HPC Biowulf cluster (http://hpc.nih.gov).

Contributor Information

Claire Marchal, Neurobiology, Neurodegeneration & Repair Laboratory, National Eye Institute, National Institutes of Health, MSC0610, 6 Center Drive, Bethesda, MD 20892, USA; In silichrom Ltd, First Floor, Angel Court, 81 St Clements St, Oxford OX4 1AW, UK.

Nivedita Singh, Neurobiology, Neurodegeneration & Repair Laboratory, National Eye Institute, National Institutes of Health, MSC0610, 6 Center Drive, Bethesda, MD 20892, USA.

Ximena Corso-Díaz, Neurobiology, Neurodegeneration & Repair Laboratory, National Eye Institute, National Institutes of Health, MSC0610, 6 Center Drive, Bethesda, MD 20892, USA.

Anand Swaroop, Neurobiology, Neurodegeneration & Repair Laboratory, National Eye Institute, National Institutes of Health, MSC0610, 6 Center Drive, Bethesda, MD 20892, USA.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

Intrmural Research Program of the National Eye Institute [ZIAEY000450, ZIAEY000546]. Funding for open access charge: same grants.

Conflict of interest statement. None declared.

REFERENCES

  • 1. Cremer T., Kurz A., Zirbel R., Dietzel S., Rinke B., Schrock E., Speicher M.R., Mathieu U., Jauch A., Emmerich P.et al.. Role of chromosome territories in the functional compartmentalization of the cell nucleus. Cold Spring Harb. Symp. Quant. Biol. 1993; 58:777–792. [DOI] [PubMed] [Google Scholar]
  • 2. Maya-Mendoza A., Jackson D.A.. Labeling DNA replication foci to visualize chromosome territories in vivo. Curr. Protoc. Cell Biol. 2017; 75:22.21.1–22.21.16. [DOI] [PubMed] [Google Scholar]
  • 3. Dixon J.R., Selvaraj S., Yue F., Kim A., Li Y., Shen Y., Hu M., Liu J.S., Ren B.. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature. 2012; 485:376–380. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Rowley M.J., Corces V.G.. Organizational principles of 3D genome architecture. Nat. Rev. Genet. 2018; 19:789–800. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Lu L., Liu X., Huang W.K., Giusti-Rodriguez P., Cui J., Zhang S., Xu W., Wen Z., Ma S., Rosen J.D.et al.. Robust Hi-C maps of enhancer-promoter interactions reveal the function of non-coding genome in neural development and diseases. Mol. Cell. 2020; 79:521–534. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Rao S.S., Huntley M.H., Durand N.C., Stamenova E.K., Bochkov I.D., Robinson J.T., Sanborn A.L., Machol I., Omer A.D., Lander E.S.et al.. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell. 2014; 159:1665–1680. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Dileep V., Ay F., Sima J., Vera D.L., Noble W.S., Gilbert D.M.. Topologically associating domains and their long-range contacts are established during early G1 coincident with the establishment of the replication-timing program. Genome Res. 2015; 25:1104–1113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Nagano T., Lubling Y., Varnai C., Dudley C., Leung W., Baran Y., Mendelson Cohen N., Wingett S., Fraser P., Tanay A.. Cell-cycle dynamics of chromosomal organization at single-cell resolution. Nature. 2017; 547:61–67. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Naumova N., Imakaev M., Fudenberg G., Zhan Y., Lajoie B.R., Mirny L.A., Dekker J.. Organization of the mitotic chromosome. Science. 2013; 342:948–953. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Bonev B., Mendelson Cohen N., Szabo Q., Fritsch L., Papadopoulos G.L., Lubling Y., Xu X., Lv X., Hugnot J.P., Tanay A.et al.. Multiscale 3D genome rewiring during mouse neural development. Cell. 2017; 171:557–572. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Dixon J.R., Jung I., Selvaraj S., Shen Y., Antosiewicz-Bourget J.E., Lee A.Y., Ye Z., Kim A., Rajagopal N., Xie W.et al.. Chromatin architecture reorganization during stem cell differentiation. Nature. 2015; 518:331–336. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Iwasaki O., Tanizawa H., Kim K.D., Kossenkov A., Nacarelli T., Tashiro S., Majumdar S., Showe L.C., Zhang R., Noma K.I.. Involvement of condensin in cellular senescence through gene regulation and compartmental reorganization. Nat. Commun. 2019; 10:5688. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Sati S., Bonev B., Szabo Q., Jost D., Bensadoun P., Serra F., Loubiere V., Papadopoulos G.L., Rivera-Mulia J.C., Fritsch L.et al.. 4D genome rewiring during oncogene-induced and replicative senescence. Mol. Cell. 2020; 78:522–538. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Cremer T., Cremer C.. Chromosome territories, nuclear architecture and gene regulation in mammalian cells. Nat. Rev. Genet. 2001; 2:292–301. [DOI] [PubMed] [Google Scholar]
  • 15. Heinz S., Texari L., Hayes M.G.B., Urbanowski M., Chang M.W., Givarkes N., Rialdi A., White K.M., Albrecht R.A., Pache L.et al.. Transcription elongation can affect genome 3D structure. Cell. 2018; 174:1522–1536. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Hug C.B., Grimaldi A.G., Kruse K., Vaquerizas J.M.. Chromatin architecture emerges during zygotic genome activation independent of transcription. Cell. 2017; 169:216–228. [DOI] [PubMed] [Google Scholar]
  • 17. Stadhouders R., Vidal E., Serra F., Di Stefano B., Le Dily F., Quilez J., Gomez A., Collombet S., Berenguer C., Cuartero Y.et al.. Transcription factors orchestrate dynamic interplay between genome topology and gene regulation during cell reprogramming. Nat. Genet. 2018; 50:238–249. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Schoenfelder S., Fraser P.. Long-range enhancer-promoter contacts in gene expression control. Nat. Rev. Genet. 2019; 20:437–455. [DOI] [PubMed] [Google Scholar]
  • 19. Gorkin D.U., Leung D., Ren B.. The 3D genome in transcriptional regulation and pluripotency. Cell Stem Cell. 2014; 14:762–775. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Dileep V., Wilson K.A., Marchal C., Lyu X., Zhao P.A., Li B., Poulet A., Bartlett D.A., Rivera-Mulia J.C., Qin Z.S.et al.. Rapid irreversible transcriptional reprogramming in human stem cells accompanied by discordance between replication timing and chromatin compartment. Stem Cell Rep. 2019; 13:193–206. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Marchal C., Sima J., Gilbert D.M.. Control of DNA replication timing in the 3D genome. Nat. Rev. Mol. Cell Biol. 2019; 20:721–737. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Moindrot B., Audit B., Klous P., Baker A., Thermes C., de Laat W., Bouvet P., Mongelard F., Arneodo A.. 3D chromatin conformation correlates with replication timing and is conserved in resting cells. Nucleic Acids Res. 2012; 40:9470–9481. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Pope B.D., Ryba T., Dileep V., Yue F., Wu W., Denas O., Vera D.L., Wang Y., Hansen R.S., Canfield T.K.et al.. Topologically associating domains are stable units of replication-timing regulation. Nature. 2014; 515:402–405. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Sima J., Chakraborty A., Dileep V., Michalski M., Klein K.N., Holcomb N.P., Turner J.L., Paulsen M.T., Rivera-Mulia J.C., Trevilla-Garcia C.et al.. Identifying cis elements for spatiotemporal control of mammalian DNA replication. Cell. 2019; 176:816–830. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Lieberman-Aiden E., van Berkum N.L., Williams L., Imakaev M., Ragoczy T., Telling A., Amit I., Lajoie B.R., Sabo P.J., Dorschner M.O.et al.. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science. 2009; 326:289–293. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Belaghzal H., Dekker J., Gibcus J.H.. Hi-C 2.0: an optimized Hi-C procedure for high-resolution genome-wide mapping of chromosome conformation. Methods. 2017; 123:56–65. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Schmitt A.D., Hu M., Ren B.. Genome-wide mapping and analysis of chromosome architecture. Nat. Rev. Mol. Cell Biol. 2016; 17:743–755. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Lajoie B.R., Dekker J., Kaplan N.. The Hitchhiker's guide to Hi-C analysis: practical guidelines. Methods. 2015; 72:65–75. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Daley T., Smith A.D.. Predicting the molecular complexity of sequencing libraries. Nat. Methods. 2013; 10:325–327. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Norrie J.L., Lupo M.S., Xu B., Al Diri I., Valentine M., Putnam D., Griffiths L., Zhang J., Johnson D., Easton J.et al.. Nucleome dynamics during retinal development. Neuron. 2019; 104:512–528. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Grob S., Schmid M.W., Grossniklaus U.. Hi-C analysis in Arabidopsis identifies the KNOT, a structure with similarities to the flamenco locus of Drosophila. Mol. Cell. 2014; 55:678–693. [DOI] [PubMed] [Google Scholar]
  • 32. Wang Q., Sun Q., Czajkowsky D.M., Shao Z.. Sub-kb Hi-C in D. melanogaster reveals conserved characteristics of TADs between insect and mammalian cells. Nat. Commun. 2018; 9:188. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Crane E., Bian Q., McCord R.P., Lajoie B.R., Wheeler B.S., Ralston E.J., Uzawa S., Dekker J., Meyer B.J.. Condensin-driven remodelling of X chromosome topology during dosage compensation. Nature. 2015; 523:240–244. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Langmead B., Salzberg S.L.. Fast gapped-read alignment with Bowtie 2. Nat. Methods. 2012; 9:357–359. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Wingett S., Ewels P., Furlan-Magaril M., Nagano T., Schoenfelder S., Fraser P., Andrews S.. HiCUP: pipeline for mapping and processing Hi-C data. F1000Res. 2015; 4:1310. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Quinlan A.R. BEDTools: the Swiss-Army tool for genome feature analysis. Curr. Protoc. Bioinformatics. 2014; 47:11.12.1–11.12.1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Yang T., Zhang F., Yardimci G.G., Song F., Hardison R.C., Noble W.S., Yue F., Li Q.. HiCRep: assessing the reproducibility of Hi-C data using a stratum-adjusted correlation coefficient. Genome Res. 2017; 27:1939–1949. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Akimoto M., Cheng H., Zhu D., Brzezinski J.A., Khanna R., Filippova E., Oh E.C., Jing Y., Linares J.L., Brooks M.et al.. Targeting of GFP to newborn rods by Nrl promoter and temporal expression profiling of flow-sorted photoreceptors. Proc. Natl. Acad. Sci. U.S.A. 2006; 103:3890–3895. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Robinson J.T., Turner D., Durand N.C., Thorvaldsdottir H., Mesirov J.P., Aiden E.L.. Juicebox.js provides a cloud-based visualization system for Hi-C data. Cell Syst. 2018; 6:256–258. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Durand N.C., Shamim M.S., Machol I., Rao S.S., Huntley M.H., Lander E.S., Aiden E.L.. Juicer provides a one-click system for analyzing loop-resolution Hi-C experiments. Cell Syst. 2016; 3:95–98. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Yardimci G.G., Ozadam H., Sauria M.E.G., Ursu O., Yan K.K., Yang T., Chakraborty A., Kaul A., Lajoie B.R., Song F.et al.. Measuring the reproducibility and quality of Hi-C data. Genome Biol. 2019; 20:57. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

gkab1235_Supplemental_File

Data Availability Statement

HiCRes pipeline is available as a docker image on hub.docker.com/r/marchalc/hicres. All the scripts used to produce the figures in this study are available on GitHub, as well as the benchmarking for HiCRes docker (github.com/ClaireMarchal/HiCRes). The dockerfile used to generate the image is also available on GitHub.

The Hi-C dataset on mouse rods generated in this study is available on the GEO database (www.ncbi.nlm.nih.gov/geo/) under the accession number GSE152491.


Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

RESOURCES