Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2018 Mar 12.
Published in final edited form as: Cell Syst. 2016 Jul;3(1):95–98. doi: 10.1016/j.cels.2016.07.002

Juicer provides a one-click system for analyzing loop-resolution Hi-C experiments

Neva C Durand 1,2,3,4,*, Muhammad S Shamim 1,2,3,*, Ido Machol 1,2,3, Suhas S P Rao 1,2,3,5, Miriam H Huntley 1,2,3,6, Eric S Lander 4,7,8, Erez Lieberman Aiden 1,2,3,4,9
PMCID: PMC5846465  NIHMSID: NIHMS804301  PMID: 27467249

Abstract

Hi-C experiments explore the three-dimensional structure of the genome, generating terabases of data to create high resolution contact maps. Here, we introduce Juicer, an open-source tool for analyzing terabase-scale Hi-C datasets. Juicer allows users without a computational background to transform raw sequence data into normalized contact maps with one click. Juicer produces a hic file containing compressed contact matrices at many resolutions, facilitating visualization and analysis at multiple scales. Structural features, such as loops and domains, are automatically annotated. Juicer is available as open source software at http://aidenlab.org/juicer/

Graphical abstract

graphic file with name nihms804301u1.jpg

Main Text

Hi-C experiments probe the three-dimensional structure of DNA and chromatin by ligating and sequencing DNA loci that are spatially proximate to one another (Lieberman-Aiden and Van Berkum et al., 2009; Rao and Huntley et al., 2014). The resulting maps reflect patterns of physical contact between loci, making it possible to deduce how loci are organized in 3D.

Efforts to improve the resolution of 3D maps have caused the amount of DNA sequence produced from Hi-C experiments to skyrocket. Our original maps, derived from 30 million reads and 16 Gb of DNA sequence, described the genome at 1 megabase resolution (Lieberman-Aiden and Van Berkum et al., 2009). In contrast, we recently generated 6.5 billion reads and 1.6 Tb of DNA sequence in order to create a single 3D map of the genome at kilobase resolution (Rao and Huntley et al., 2014).

Although pipelines for Hi-C data analysis exist (Lieberman-Aiden and Van Berkum et al., 2009; Schmid et al., 2015; Servant et al., 2015; Suria et al., 2015), these packages are not designed to process datasets at the terabase scale or to annotate the structural features that these maps reflect. Moreover, when designing tools that require high-performance computation, ensuring reliability and ease-of-use across software platforms and hardware instances becomes a crucial desideratum. Ensuring such compatibility can be a considerable engineering challenge.

Here, we introduce Juicer, an easy-to-use, fully-automated pipeline for the processing and annotation of data from Hi-C and other contact mapping experiments. Juicer is closely based on the algorithms that we recently developed in order to analyze and annotate our terabase-scale Hi-C experiments (Rao and Huntley et al., 2014). In order to meet the engineering challenge of handling such massive datasets, Juicer supports the use of parallelization and hardware acceleration whenever possible, including CPU clusters, general-purpose graphics processing units (GP-GPUs), and field-programmable gate arrays (FPGAs). Juicer is also compatible with a variety of cloud and cluster architectures.

Juicer comprises three tools, which are designed to be run one-after-another.

First, Juicer transforms raw sequence data into a list of Hi-C contacts (pairs of genomic positions that were adjacent to each other in three-dimensional space during the experiment). To accomplish this, read pairs are aligned to the genome; both duplicates and near-duplicates are removed, and read pairs that align to three or more locations are set aside. When appropriate hardware is available, this procedure can be accelerated, either by parallelizing across multiple CPUs or by using an FPGA (see Table 1).

Table 1.

Using Juicer to process 1.5 billion paired-end Hi-C reads on different cluster systems. “RAM (Gb)” (resp., “VM(Gb)”) are the maximum RAM (resp., virtual memory”) used for each task. Loop annotation was not performed on the Broad cluster, which does not offer GPUs. See Table S1.

System Amazon Web Services
g2.8xlarge
Broad
Univa Grid Engine
Rice PowerOmics Rice PowerOmics + FPGA

CPU Intel Xeon E5-2670 @2.60GHz Intel Xeon X5650 @2.66GHz IBM POWER8E@2.061GHz revision: 2.1 IBM POWER8E@2.061GHz revision: 2.1
Cores/node 4×8 cores 4×6 cores 2×24 cores 2×24 cores
RAM 60GB 32GB 256GB 256GB
Cluster OS OpenLava 2.2 (LSF Compatible) UGE 8.3.0 Slurm 14.11.8 Slurm 14.11.8
GPU NVIDIA Quadro K5000 None NVIDIA Tesla K80 NVIDIA Tesla K80
FPGA None None None Edico Genome DRAGEN Bio-IT Platform
Max Parallel Cores 32 1200 1536 1536

Core Hours (hr:min) RAM (GB) VM (GB) Core Hours (hr:min) RAM (GB) VM (GB) Core Hours (hr:min) RAM (GB) VM (GB) Core Hours (hr:min) RAM (GB) VM (GB)

Align 8744:49 12.3 13.5 11614:07 10.8 11.9 4221:29 13.1 14.0 1:29 0 0
Merge Sort 35:36 9.9 10.1 117:03 8.7 198.1 452:13 14.0 120.0 426:30 30.0 120.0
Duplicate Removal 12:21 0.5 0.5 17:04 0.4 0.5 3:12 0.4 0.0 1:28 0.4 0.0
.hic Creation 112:43 21.8 34.9 209:43 13.4 19.5 139:17 19.3 8 177:04 19.3 8
Feature Annotation 2:07 10.5 139.3 1:04 6.4 19.5 3:25 4.2 9.1 4:28 77.1 9.1

Total 8906:11 11959:01 4819:36 608:59

Next, the catalog of contacts is used to create contact matrices. To do so, the linear genome is partitioned into loci of a fixed size, or “resolution,” (e.g., 1Mb or 1Kb). These loci correspond to the rows and columns of a contact matrix; each entry in the matrix reflects the number of contacts observed between the corresponding pair of loci during a Hi-C experiment. Due to factors such as chromatin accessibility, certain loci are observed more frequently in Hi-C experiments. Juicer can adjust for these biases in multiple ways. The options include our original normalization scheme (Lieberman-Aiden and Van Berkum et al., 2009), as well as a matrix balancing scheme that ensures that each row and column of the contact matrix sums to the same value (Knight and Ruiz, 2012). A wide array of quality statistics are also calculated, making it possible to assess the success and reliability of a given experiment before the costly deep-sequencing step.

The contact matrices generated in this way are stored efficiently in a compressed format, which is designed to facilitate all subsequent computations. For instance, 1 terabyte of raw sequencing data is represented as an 80 gigabyte hic file containing normalized and non-normalized contact matrices at 18 different resolutions, from 2.5Mb resolution to single restriction fragment resolution for a 4-cutter restriction enzyme (~400bp). Contact matrices in the hic format can also be visualized using Juicebox, which is described in the accompanying paper.

Finally, Juicer contains a suite of algorithms that are designed to annotate contact matrices and thus identify features of genome folding. These features include loops, loop anchor motifs, and contact domains.

Loops are identified using the HiCCUPS algorithm (Rao and Huntley et al., 2014), which searches for clusters of contact matrix entries in which the frequency of contact is enriched relative to the local background. Since there are trillions of pixels in a kilobase-resolution Hi-C map, HiCCUPS is implemented using GP-GPUs. Given CTCF and/or cohesin ChIP-Seq tracks for the same cell type, HiCCUPS can frequently use FIMO (Grant et al., 2011) to identify the CTCF motif that serves as the anchor for each loop. We recently performed CRISPR experiments disrupting seven different CTCF motifs, each of which was identified by HiCCUPS as the anchor of one or more loops. In each case, disruption of the motif led to disruption of the corresponding loop, thus confirming the accuracy of HiCCUPS loop anchor annotations (Sanborn and Rao et al., 2015).

Contact domains are identified using a dynamic programming algorithm that relies on applying the Arrowhead transformation [Ai,i+d = (M* i,i-d − M* i,i+d)/(M* i,i-d + M* i,i+d)] to a normalized contact matrix M* (Rao and Huntley et al., 2014). Many of these domains are associated with loops, and can be disrupted by manipulating the corresponding loop anchors (Sanborn and Rao et al., 2015).

It is frequently useful to examine the cumulative signal from a large number of putative features at once, including both loops and domains. To this end, Juicer includes an implementation of Aggregate Peak Analysis (Rao and Huntley et al., 2014).

Juicer is an open-source project. It is available at github.com/theaidenlab/juicer as a series of packages designed for a variety of hardware configurations: either a single machine, or clusters that run LSF, Univa Grid Engine, or SLURM. In addition, Juicer is available on the cloud at Amazon Web Services. Table 1 displays different performance metrics on each cluster system; the details of each setup are in the supplemental text. Once installed, Juicer can be executed using a single command, by users without informatics experience.

Experimental Methods

All algorithms and data are drawn from Rao and Huntley et al., 2014, except as described in the supplement.

Supplementary Material

1
2
3

Figure 1. Juicer analyzes terabases of Hi-C data with one click.

Figure 1

(A) Sequenced read pairs (horizontal bars) are aligned to the genome in parallel. Color indicates genomic position. Read pairs aligning to more than two positions are excluded. Those remaining are sorted by position and merged into a single list, at which point duplicate reads are removed. The .hic file stores contact matrices at many resolutions, which can be loaded into Juicebox for visualization. See Table S2. (B) Contact domains (yellow) are annotated using the Arrowhead algorithm. (C) Loops (cyan) are annotated using HiCCUPS.

Acknowledgments

Supported by NIH New Innovator Award 1DP2OD008540, NIH 4D Nucleome Grant U01HL130010, NSF Physics Frontier Center PHY-1427654, NHGRI HG006193, Welch Foundation Q-1866, Cancer Prevention Research Institute of Texas Scholar Award R1304, an NVIDIA Research Center Award, an IBM University Challenge Award, a Google Research Award, a McNair Medical Institute Scholar Award, and the President’s Early Career Award in Science and Engineering to E.L.A.; an NHGRI grant (HG003067) to E.S.L.; and a PD Soros Fellowship to S.S.P.R. The Rice PowerOmics cluster was a gift from IBM.

Footnotes

Author Contributions: E.L.A. conceived of this project; N.C.D. created the pipeline; S.S.P.R. created HiCCUPS; M.H.H. created APA; M.H.H. and N.C.D. created Arrowhead; M.S.S. re-implemented all feature annotation algorithms in Java as fully-automated, end-to-end tools; I.M. ported the pipeline to SLURM and AWS; N.C.D., M.S.S., I.M., and E.S.L. contributed to tool development; N.C.D. and E.L.A. prepared the manuscript.

The software and test data sets used to review this manuscript are available at http://dx.doi.org/10.17632/c6bg4cbggn.1

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

  1. Grant CE, Bailey TL, Noble WS. FIMO: Scanning for occurrences of a given motif. Bioinformatics. 2011;27(7):1017–1018. doi: 10.1093/bioinformatics/btr064. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Knight PA, Ruiz D. A fast algorithm for matrix balancing. IMA J Numer Anal. 2012;33:1029–1047. [Google Scholar]
  3. Lieberman-Aiden E, van Berkum N, Williams L, Imakaev M, Ragoczy T, Telling A, Amit I, Lajoie B, Sabo P, Dorschner M, et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science. 2009;326:289–293. doi: 10.1126/science.1181369. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Rao SSP, Huntley MH, Durand NC, Stamenova EK, Bochkov ID, Robinson JT, Sanborn AL, Machol I, Omer AD, Lander ES, Aiden EL. A Three-dimensional Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping. Cell. 2014;159:1665–1680. doi: 10.1016/j.cell.2014.11.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Sanborn AL, Rao SSP, Huang S, Durand NC, Huntley MH, Jewett AI, Bochkov ID, Chinnappan D, Cutkosky A, Geeting KP, Gnirke A, Melnikov A, McKenna D, Stamenova EK, Lander ES, Aiden EL. Chromatin extrusion explains key features of loop and domain formation in wild-type and engineered genomes. Proceedings of the National Academy of Sciences. 2015;112(47):E6456–E6465. doi: 10.1073/pnas.1518552112. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Servant N, Varoquaux N, Lajoie BR, Viara E, Chen CJ, Vert JP, Heard E, Dekker J, Barillot E. HiC-Pro: an optimized and flexible pipeline for Hi-C data processing. Genome Biology. 2015;16:259. doi: 10.1186/s13059-015-0831-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Schmid MW, Grob S, Stefan Grob, Grossniklaus U. HiCdat: a fast and easy-to-use Hi-C data analysis tool. BMC Bioinformatics. 2015;16(1):277. doi: 10.1186/s12859-015-0678-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Suria MEG, Phillips-Cremins JE, Corces VG, Taylor J. HiFive: a tool suite for easy and efficient HiC and 5C data analysis. Genome Biology. 2015;16:237. doi: 10.1186/s13059-015-0806-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1
2
3

RESOURCES