hictk: blazing fast toolkit to work with .hic and .cool files

Roberto Rossini; Jonas Paulsen

doi:10.1093/bioinformatics/btae408

. 2024 Jun 24;40(7):btae408. doi: 10.1093/bioinformatics/btae408

hictk: blazing fast toolkit to work with .hic and .cool files

Roberto Rossini ¹, Jonas Paulsen ^2,^3,^✉

Editor: Alfonso Valencia

PMCID: PMC11216752 PMID: 38913844

Abstract

Motivation

Hi-C is gaining prominence as a method for mapping genome organization. With declining sequencing costs and a growing demand for higher-resolution data, efficient tools for processing Hi-C datasets at different resolutions are crucial. Over the past decade, the .hic and Cooler file formats have become the de-facto standard to store interaction matrices produced by Hi-C experiments in binary format. Interoperability issues make it unnecessarily difficult to convert between the two formats and to develop applications that can process each format natively.

Results

We developed hictk, a toolkit that can transparently operate on .hic and .cool files with excellent performance. The toolkit is written in C++ and consists of a C++ library with Python and R bindings as well as CLI tools to perform common operations directly from the shell, including converting between .hic and .mcool formats. We benchmark the performance of hictk and compare it with other popular tools and libraries. We conclude that hictk significantly outperforms existing tools while providing the flexibility of natively working with both file formats without code duplication.

Availability and implementation

The hictk library, Python bindings and CLI tools are released under the MIT license as a multi-platform application available at github.com/paulsengroup/hictk. Pre-built binaries for Linux and macOS are available on bioconda. Python bindings for hictk are available on GitHub at github.com/paulsengroup/hictkpy, while R bindings are available on GitHub at github.com/paulsengroup/hictkR.

1 Introduction

Hi-C and related tools involve probing and quantifying pairwise proximities throughout entire genomes' 3D structure, and are invaluable for studying genome- and chromatin organization. Hi-C data production has seen a dramatic rise recently, both for advancing genome assembly (Burton et al. 2013, Kaplan and Dekker 2013, Selvaraj et al. 2013) and for precise mapping of the genome's 3D architecture (Lieberman-Aiden et al. 2009), putting to test the performance of available computational tools to process and analyze Hi-C data.

Sequencing data produced by Hi-C experiments are usually processed with specialized pipelines, such as those provided by ENCODE, 4DNucleome, nf-core, and Open2C. The main file output by these pipelines stores a matrix-representation of all pairwise Hi-C interactions within and between the reference chromosomes.

These interaction matrices are typically very large (9.5 trillion values at 1 kb resolution in the human genome) and thus cannot be represented as dense matrices in plain-text format.

Two popular file formats are currently used to efficiently store this kind of data: Cooler (Abdennur and Mirny 2019) and .hic (Durand et al. 2016a). Both formats represent interaction matrices as sparse, compressed and chunked matrices, yet in substantially different ways, requiring specialized libraries for reading or writing each format. As a consequence, application developers need to write two versions of the same code to read data from files: one for .hic files and another for Cooler files. Yet, responsibility of format conversion is usually placed on the users of the tools. Notably, conversion between the two formats is not straightforward or computationally efficient, requiring an intermediate text representation for reliable conversion. This process is inefficient and error prone, as users are tasked with verifying the accuracy of the text format and ensuring proper sorting.

Files stored in the .hic format undergo processing through several tools: JuiceboxGUI (Durand et al. 2016a) offers an interactive interface where users can visualize .hic matrices at different resolutions, with various normalizations and annotation tracks. JuicerTools and HiCTools (Durand et al. 2016b) facilitate the generation of .hic files, and straw (Durand et al. 2016a) functions as a versatile .hic file reader. Notably, straw is accessible as a library for multiple programming languages such as C++, Matlab, R, and Python. In contrast, JuiceboxGUI and HiCTools are Java-based applications designed to efficiently handle .hic files.

Cooler files on the other hand, are processed with the homonymous suite of tools and can be visualized using HiGlass (Kerpedjiev et al. 2018, Abdennur and Mirny 2019). The Cooler file format is based on HDF5 (www.hdfgroup.org/HDF5), with a CLI and library written in Python. Usage of HDF5 represents a difference in philosophy from the .hic format, allowing flexible and relatively user-friendly runtime-based introspection of data. However, users willing to open cooler files using programming languages other than Python still need to create dedicated code to e.g. satisfy queries by genomic coordinates. The Cooler file format specification defines three formats: single-resolution (.cool), multi-resolution (.mcool), and single-cell (.scool) Cooler files. Single-resolution Cooler files store interactions using a compressed sparse row (CSR) scheme; multi-resolution Coolers are collections of .cool files at different resolutions; single-cell Cooler files are meant to store single-cell Hi-C data. Contrary to files in Cooler format, files in .hic format can be visualized using the UCSC genome browser.

Finally, a tool named hic2cool is available to easily convert from .hic to .mcool. hic2cool is written in Python and borrows code from straw and cooler. However, hic2cool is relatively slow (see benchmarks) and is not capable of reading the latest version of .hic files (github.com/aidenlab/hic-format).

We have developed the hictk toolkit with the aim of streamlining reading, writing and converting .hic and Cooler files within the same library, and doing so with excellent performance. hictk consists of a library and CLI components. The library allows reading and writing both .hic and Cooler files. The CLI makes it easy to perform common operations directly from the shell, including converting files between Cooler and .hic formats (see Fig. 1 and Supplementary Fig. S1).

Figure 1. — Overview of hictk functionality. (A) List of CLI subcommands provided by hictk. (B) Snipped of Python code showing how to use hictkpy to fetch genome-wide interactions from a .cool file. (C) Snippet of R code showing how to use hictkR to fetch genome-wide interactions from a .cool file. (D) Example usage of hictk dump. The example shows how to use the dump subcommand to extract chr1 interactions at 1000 bp resolution from a .hic or .cool file and print them to stdout as a list of bedGraph2 records. (E) Example usage of hictk load. The example shows how to ingest interactions in 4DN pairs format to generate a single-resolution .hic or .cool file. (F) Example usage of hictk convert showing how to convert a file in .hic format to .mcool and vice-versa. (G) Example usage of hictk merge showing how to merge multiple .hic or .cool files, producing a single file in the same format as the input files.

2 Materials and methods

2.1 Overview of the hictk library

hictk is implemented in C++ and was designed with computational- and memory efficiency and composability in mind. To achieve this, hictk heavily relies on iterators to lazily traverse collections of pixels. From the user perspective, there is no difference between reading interactions from a .hic, .cool, or .mcool file. To read interactions for a region of interest, users first create a file object by providing the path to a file and the resolution in bp to be used. The file object implements a fetch method that takes as input several optional parameters, including a query range (e.g. chr1:0–10 000 000). The fetch method returns a PixelSelector object, providing begin() and end() methods allowing pixel traversal for the queried range (see Supplementary Table S1 found in Supplementary Text S1). PixelSelector objects also support returning interaction frequencies directly as sparse or dense matrices. Iterators are implemented to strike a good balance between memory usage and performance, requiring few KBs of memory when fetching small regions on coarse matrices up to hundreds of MBs when fetching large queries from high resolution matrices while traversing millions of pixels per second (see benchmarks for more details).

The library supports reading files with integer and floating-point interaction frequencies, with and without matrix balancing using algorithms such as iterative correction and eigenvector decomposition (ICE) or Knight-Ruiz (KR) (Imakaev et al. 2012, Knight and Ruiz 2012). For .hic files, expected and observed/expected interaction frequencies are also supported. This level of flexibility is achieved through runtime polymorphism enabled by the visitor pattern (Palsberg and Jay 1998). Beside reading interactions, the library provides a set of operations that can be used as building blocks for more complex applications, such as functions to create .cool, .mcool, and .scool files, merge multiple files, coarsening interactions from .hic and .cool files, and more.

Unrecoverable errors are propagated through exceptions, letting the caller decide whether an application using hictk should abort in case of errors.

The API of hictkpy, hictk’s Python bindings, was heavily inspired by cooler’s Python API.

Beside performance, hictk’s C++ implementation allows writing bindings for popular scripting languages such as Python and R through e.g. nanobind github.com/wjakob/nanobind and Rcpp (Eddelbuettel 2013). We currently provide bindings for the Python and R languages through the hictkpy and hictkR packages, respectively. In Section S9 in the Supplementary Text S1 we show how to update a Python application to use hictkpy instead of cooler to read contact matrices, making the application capable of reading .hic files with few changes.

2.2 Testing

hictk is thoroughly tested with a continuous integration pipeline (CI) which automatically executes hictk unit tests (80%+ coverage) on Linux, macOS and Windows using different versions of GCC, Clang, Apple-Clang and MSVC.

CLI tools are further tested with 20 integration tests, covering the most common use-cases of each subcommand.

As an example, hictk convert is tested by two different integration tests covering conversion from .mcool to .hic and vice-versa. After performing the conversion, the tests dump interactions for the original and converted files to two different text files. The text files are then compared and an error is raised in case differences are found.

Finally, fetch functionality is automatically tested every time a pull request is merged, or weekly, with thousands of random queries to ensure that hictk and cooler return identical results for the same query.

2.3 Documentation

Documentation for hictk’s C++ API and CLI interface is available on ReadTheDocs at the following address: hictk.readthedocs.io. The documentation also contains instructions on how to install or build hictk.

Documentation for hictkpy and hictkR are hosted separately at the following addresses: hictkpy.readthedocs.io, paulsengroup.github.io/hictkR.

3 Results

3.1 Availability and implementation

hictk is a header-only C++17 library accompanied by command-line tools to transparently operate on .hic and .cool files. hictk source code is available on GitHub at github.com/paulsengroup/hictk under the MIT license. Pre-built binaries for hictk can be obtained from bioconda (Linux and macOS). Docker images are available on GHCR.io and DockerHub. hictk can be built from source on Linux, macOS and Windows using a compiler toolchain supporting C++17, CMake v3.25+ and optionally Conan v2+. The C++ API can be used by other projects by including hictk source code using CMake or by installing hictk with Conan. hictkpy can be obtained through PyPI and bioconda or installed from source, while hictkR can be installed from source.

The CLI interface of hictk is built around facilities provided by the library component and is based on the very expressive CLI interface of cooler.

The interface consists of a single executable, named hictk with several subcommands (see Supplementary Table S2 from Supplementary Text S1).

Most hictk subcommands accept both Cooler and .hic files as input, and can output .hic, .cool, and .mcool files as needed. This is thanks to the hictk C++ library, which can natively read and write both Cooler and .hic files.

3.2 Benchmarks

We characterize hictk performance through a number of comparative benchmarks with .hic v8, v9, and .mcool files with close to 7 billion pairwise interactions (see Supplementary Figs S2–S8; Supplementary Text S1 for more details). In summary, hictk is a near-drop-in, high-performance alternative to existing tools that can perform a variety of common operations on .hic and Cooler files. A thorough overview of the benchmark results, including benchmark graphs, is available in Supplementary Text S1 (Supplementary Sections S4–S11).

3.3 Current limitations

hictk supports the majority of operations enabled by cooler, straw and hic2cool, and features of the .hic and Cooler format specification. However, some aspects are not yet fully supported, including native operation on remote files which requires implementing a Virtual File Layer driver in the case of Cooler, and low level remote file operations in case of .hic files. Reading remote files can however already be achieved by mounting and accessing these on the local filesystem using e.g. FUSE. Support for reading or writing asymmetric cooler files is also lacking, albeit their usage is infrequent. Finally, hictk does not support reading or writing FRAG interactions from .hic files.

Supplementary Material

btae408_Supplementary_Data

btae408_supplementary_data.zip^{(3.4MB, zip)}

Acknowledgements

We thank the ENCODE Consortium and the lab of Erez Lieberman Aiden for contributing with ENCODE data used as part of this study. Unpublished genome assemblies and sequencing data for Schistocerca americana, Schistocerca cancellata, and Schistocerca serialis cubense are used with permission from the DNA Zoo Consortium (www.dnazoo.org). We thank the DNA Zoo and the Behavioral Plasticity Research Institute (BPRI) for providing the genome assemblies for Schistocerca americana, Schistocerca cancellata, and Schistocerca serialis cubense.

Contributor Information

Roberto Rossini, Department of Biosciences, University of Oslo, Oslo 0316, Norway.

Jonas Paulsen, Department of Biosciences, University of Oslo, Oslo 0316, Norway; Centre for Bioinformatics, Department of Informatics, University of Oslo, Oslo 0316, Norway.

Supplementary data

Supplementary data are available at Bioinformatics online.

Conflict of interest

None declared.

Funding

This work was supported by the Norwegian Research Council [grant number 324137 and 343102].

Data availability

All data used for the analyses in this article are publicly available on ENCODE and the UCSC FTP server (accession IDs are listed in Supplementary File S1). The code used for benchmarking is hosted on GitHub at github.com/paulsengroup/2023-hictk-paper and is archived on Zenodo at doi.org/10.5281/zenodo.10868277. Input files used in this study are available at the Gene Expression Omnibus under identifier GSE242815 (identifiers.org/geo:GSE242815). Benchmark results and matrices used for visualization purposes throughout the paper are deposited on Zenodo at doi.org/10.5281/zenodo.10868775.

References

Abdennur N, Mirny LA. Cooler: scalable storage for Hi-C data and other genomically labeled arrays. Bioinformatics 2019;36:311–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
Burton JN, Adey A, Patwardhan RP et al. Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Nat Biotechnol 2013;31:1119–25. [DOI] [PMC free article] [PubMed] [Google Scholar]
Durand NC, Robinson JT, Shamim MS et al. Juicebox provides a visualization system for Hi-C contact maps with unlimited zoom. Cell Syst 2016a;3:99–101. [DOI] [PMC free article] [PubMed] [Google Scholar]
Durand NC, Shamim MS, Machol I et al. Juicer provides a one-click system for analyzing loop-resolution Hi-C experiments. CELS 2016b;3:95–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
Eddelbuettel D. Seamless R and C++ Integration with RCPP. New York: Springer. 2013. [Google Scholar]
Imakaev M, Fudenberg G, McCord RP et al. Iterative correction of Hi-C data reveals hallmarks of chromosome organization. Nat Methods 2012;9:999–1003. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kaplan N, Dekker J. High-throughput genome scaffolding from in vivo DNA interaction frequency. Nat Biotechnol 2013;31:1143–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kerpedjiev P, Abdennur N, Lekschas F et al. HiGlass: web-based visual exploration and analysis of genome interaction maps. Genome Biol 2018;19:1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
Knight PA, Ruiz D. A fast algorithm for matrix balancing. IMA J Numer Anal 2012;33:1029–47. [Google Scholar]
Lieberman-Aiden E, Van Berkum NL, Williams L et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 2009;326:289–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
Palsberg J, Jay CB. The essence of the Visitor pattern. In: Proceedings of the Twenty-Second Annual International Computer Software and Applications Conference (Compsac ’98) (Cat. No.98CB 36241), IEEE. 1998, 9–15.
Selvaraj S, R Dixon J, Bansal V et al. Whole-genome haplotype reconstruction using proximity-ligation and shotgun sequencing. Nat. Biotechnol 2013;31:1111–8. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

btae408_Supplementary_Data

btae408_supplementary_data.zip^{(3.4MB, zip)}

Data Availability Statement

[btae408-B1] Abdennur N, Mirny LA. Cooler: scalable storage for Hi-C data and other genomically labeled arrays. Bioinformatics 2019;36:311–6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae408-B2] Burton JN, Adey A, Patwardhan RP et al. Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Nat Biotechnol 2013;31:1119–25. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae408-B3] Durand NC, Robinson JT, Shamim MS et al. Juicebox provides a visualization system for Hi-C contact maps with unlimited zoom. Cell Syst 2016a;3:99–101. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae408-B4] Durand NC, Shamim MS, Machol I et al. Juicer provides a one-click system for analyzing loop-resolution Hi-C experiments. CELS 2016b;3:95–8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae408-B5] Eddelbuettel D. Seamless R and C++ Integration with RCPP. New York: Springer. 2013. [Google Scholar]

[btae408-B6] Imakaev M, Fudenberg G, McCord RP et al. Iterative correction of Hi-C data reveals hallmarks of chromosome organization. Nat Methods 2012;9:999–1003. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae408-B7] Kaplan N, Dekker J. High-throughput genome scaffolding from in vivo DNA interaction frequency. Nat Biotechnol 2013;31:1143–7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae408-B8] Kerpedjiev P, Abdennur N, Lekschas F et al. HiGlass: web-based visual exploration and analysis of genome interaction maps. Genome Biol 2018;19:1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae408-B9] Knight PA, Ruiz D. A fast algorithm for matrix balancing. IMA J Numer Anal 2012;33:1029–47. [Google Scholar]

[btae408-B10] Lieberman-Aiden E, Van Berkum NL, Williams L et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 2009;326:289–93. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae408-B11] Palsberg J, Jay CB. The essence of the Visitor pattern. In: Proceedings of the Twenty-Second Annual International Computer Software and Applications Conference (Compsac ’98) (Cat. No.98CB 36241), IEEE. 1998, 9–15.

[btae408-B12] Selvaraj S, R Dixon J, Bansal V et al. Whole-genome haplotype reconstruction using proximity-ligation and shotgun sequencing. Nat. Biotechnol 2013;31:1111–8. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

hictk: blazing fast toolkit to work with .hic and .cool files

Roberto Rossini

Jonas Paulsen

Roles

Abstract

Motivation

Results

Availability and implementation

1 Introduction

Figure 1.

2 Materials and methods

2.1 Overview of the hictk library

2.2 Testing

2.3 Documentation

3 Results

3.1 Availability and implementation

3.2 Benchmarks

3.3 Current limitations

Supplementary Material

Acknowledgements

Contributor Information

Supplementary data

Conflict of interest

Funding

Data availability

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

hictk: blazing fast toolkit to work with .hic and .cool files

Roberto Rossini

Jonas Paulsen

Roles

Abstract

Motivation

Results

Availability and implementation

1 Introduction

Figure 1.

2 Materials and methods

2.1 Overview of the hictk library

2.2 Testing

2.3 Documentation

3 Results

3.1 Availability and implementation

3.2 Benchmarks

3.3 Current limitations

Supplementary Material

Acknowledgements

Contributor Information

Supplementary data

Conflict of interest

Funding

Data availability

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases