Abstract
Summary
Minimizer digestion is an increasingly common component of bioinformatics tools, including tools for de Bruijn graph assembly and sequence classification. We describe a new open source tool and library to facilitate efficient digestion of genomic sequences. It can produce digests based on the related ideas of minimizers, modimizers or syncmers. Digest uses efficient data structures, scales well to many threads, and produces digests with expected spacings between digested elements.
Availability and implementation
Digest is implemented in C++17 with a Python API, and is available open-source at https://github.com/VeryAmazed/digest. The python library is available on Bioconda. Rust bindings are available as a public crate at https://crates.io/crates/digest-rs.
1 Introduction
Digestion is the process of transforming a biological sequence into a shorter sequence that is still a useful reference for read alignment, sequence classification (Ahmed et al. 2023), or de novo assembly (Ekim et al. 2021). Digestion works by selecting certain substrings to be kept according to strategies like minimizers (Schleimer et al. 2003, Roberts et al. 2004) or syncmers (Edgar 2021). The selected substrings are concatenated to form the digested sequence, which is often much shorter than the original. For example, SPUMONI2 uses minimizer digestion to reduce the reference sequence prior to indexing, reducing overall index sizes by a factor of 2.
Digestion can be supplemented by “alphabet promotion,” where the alphabet is shifted from a 4-letter DNA alphabet to a larger alphabet where each distinct minimizer is a symbol, to further shorten the sequence and speed up matching algorithms, which has been previously explored (Ekim et al. 2021, Ahmed et al. 2023). Digestion is often called a “sketch,” however, we propose the term “digest” to differentiate methods that shrink a sequence into a smaller (but still linear in size) representation from those that build a sublinear-size data structure.
We present a new C++ software library called digest that performs digestion with (i) improved efficiency compared to previously described data-structures, (ii) efficient scaling to many threads, (iii) three different substring-selection strategies: minimizers, syncmers, or modimizers, and (iv) an API allowing for various downstream uses, including Python bindings. In its command-line tool form, it can efficiently convert FASTA genomic sequences into digested sequences also in FASTA format.
We show that a combination of different data structures allow digest to work efficiently across a range of sequence lengths and window sizes. We show that naïve approaches, as well as approaches based on the segment tree, are superior to approaches proposed in the past for common ranges of window sizes. Finally, we describe the API exposed by the digest tool and library, how it operates in parallel in its multithreaded mode.
2 Materials and methods
Digest is a C++ software library that exposes an Application Programming Interface (API) for DNA sequence digestion. The following subsections detail its key data structures, how it was optimized, its interface, and how it maps to typical use cases.
Digest builds on the ntHash library (Mohamadi et al. 2016, Kazemi et al. 2022) for efficient hashing of DNA sequences. Besides the features described in the following subsections, we describe additional implementation details in Note 1, available as supplementary data at Bioinformatics online.
2.1 Digestion schemes
Digest supports three strategies. The first uses “modimizers.” In this scheme, a length-k substring is included in the digest if and only if its hash value is equivalent to 0 mod n, where k and n are parameters. The second is based on “minimizers.” Here, a length-k substring is included in the digest if and only if its hash value is minimal in any of the length-() substrings containing the k-mer. The third uses “syncmers” (specifically closed-syncmers). In this scheme, a length-k substring is included in the digest if and only if the leftmost or rightmost t-mer (where ) of the substring has the minimal hash value among all the length-t substrings of the .
In the event of multiple hash values that are both equal and minimal, we choose the rightmost by default.
2.2 Data structures
Modimizers are easy to compute. But the other two schemes require the help of a data structure to track hash values and compute minima.
We implemented and benchmarked various structures supporting an insert query given an index and hash, as well as a min query which returns the index with the minimum hash in the current window. An index cannot be assumed to increase by one in each insert, as skips are possible due to unknown or ambiguous bases. In the following algorithm bounds, n refers to the length of the input and w refers to the size of the window of k-mers.
The Naïve method uses a deque (double-ended queue), stored as a circular array in memory. insert simply adds an element at the head of the queue, simultaneously evicting an element from the tail. min simply performs a linear scan of the queue to find the minimum element. The worst-case time is .
The Naïve-memo method additionally memoizes (stores) the index of the minimum hash from the previous iteration. On an insert query, the new hash, if smaller than the stored minimum, replaces the memoized variable. If the stored minimum leaves the window, a linear scan is used to search for the new minimum. The min query retrieves the memoized variable via a constant-time lookup (Algorithm 2, available as supplementary data at Bioinformatics online). Its worst-case time is , but its amortized cost is constant time, as we argue in Note 2, available as supplementary data at Bioinformatics online.
The Set method uses an ordered set, typically implemented as a red-black tree. We supplement the map with a deque of pointers to elements in the set. Inserting, removing, and finding the minimum element all take time that is logarithmic in the size of the window, with overall worst-case time therefore being .
The monotone method uses a monotonic queue, which has been recommended for finding minimizers due to its linear runtime (Zheng et al. 2023). It does indeed run in . The queue holds the invariant that it is kept in increasing order. Old hashes are popped from the front as they leave the window. On an insertion, to maintain the invariant, all hashes that are greater than the inserted hash are deleted from the back. The hashes deleted from the back are of no use, since a smaller hash has entered the window. The minimum can be found by querying the front of the queue in , while insertions can take time but amortizes to .
The segment tree method uses a binary tree and maintains the invariant that a node holds the minimum of its two children (Bentley 1977). All leaves are situated at the same level, and the number of leaves is a power of two. If the specified window size does not give a power-of-two number of leaves, new leaves annotated with the maximum possible value are added as padding. The tree is represented implicitly in an array. The minimum can be queried in time by simply querying the root of the tree. Updates can traverse up the entire tree, taking time (Algorithm 3, available as supplementary data at Bioinformatics online).
2.3 Multithreaded operation
We parallelize the digestion process by breaking the input sequence into partitions. We must allow for some overlap between the partitions for schemes that consider both large and small windows. A further complication is that the amount of overlap interacts with the treatment of non-ACGT characters, since the presence of ambiguous characters can effectively cause the large window to grow, so as to include the target number of non-ambiguous k-mers. This is discussed further in Note 3, available as supplementary data at Bioinformatics online.
2.4 Application programming interface
The digest software supports two APIs, one for C++17 and one for Python. The C++17 API is designed to be easy to use, with no dependencies on other libraries besides ntHash. Input sequence data is represented with an STL string, and results are appended to an STL vector. Users first instantiate the proper digester object from a hierarchy that includes an abstract parent class called Digester, a templated class called WindowMin for the windowed schemes, and a concrete class called ModMin implementing modimizers. For ease of use, we also provide a non-templated, concrete class called Adaptive that will attempt to select the optimal concrete class (and, therefore, data structure) for a given window-size scenario. Adaptive64 is implemented alike with support for 64 bit hashes.
Digest also has a Python API that applies the desired digestion scheme to a Python string, returning the result in a Python list. Note that this API wraps the Adaptive class, and so it allows the C++ library to choose the appropriate data structures according to the window-size scenario.
Example code using both APIs is included in the Note 4, available as supplementary data at Bioinformatics online.
3 Results
3.1 Data structures
The computational bottleneck to computing minimizers is the data structure used to facilitate both (i) the selection of the minimal value in a window (min), and (ii) the updating of the window (insert). To understand the relative merits of the data structures, we implemented and conducted a benchmarking study wherein we applied all the data structures to an array of 10 million uniformly distributed hash values, executing the insert operation for every hash value and min operation for every window. As seen in Fig. 1A, the best performing algorithms are naïve, segment tree, and naïve-memo. The set method’s runtime, which grows logarithmically with window size, exceeded the bounds of the chart and was omitted. Monotone, although a theoretically linear runtime data structure, also performed poorly, hampered by many conditional checks executed during the insert loop. Adaptive traces the shape of the best algorithms by determining the optimal data structure based on the desired window size. For small window sizes (6–14), naïve outperforms all others due to a quick loop that is unrolled by the compiler and optimized with conditional move assembly instructions. For larger window sizes (16), naïve-memo performs best by avoiding unnecessary window scanning while maintaining the simple loop structure of naïve. In the final implementation of digest, we omitted the monotone and set methods, and suggest adaptive as the default back-end for general use cases.
Figure 1.
(A) Comparison of min query speed for different data structures as a function of window size. In this benchmark, each data-structure performs 10 million queries on an array of uniformly distributed 32-bit hash values. (B) Shows the throughput of the different digestion schemes in Digest (using a segment tree data-structure) when computing the digest of a 62M human chromosome Y sequence consisting of only A/C/G/T characters. Benchmarking for both (A) and (B) were performed on a 48-core 3 GHz Intel Xeon Gold Cascade Lake 6248R CPU with 192 GB RAM.
3.2 Thread scaling
To test the multi-threading scalability, we benchmarked the digest library when digesting the human chromosome Y sequence from T2T-HG002 assembly (Nurk et al. 2022) with an increasing number of threads.
Figure 1B shows the scalability of each digestion strategy. All three strategies scale linearly with an increasing number of threads until we approach the number of cores on our machine, which was 48. Around 48 threads, we observed a decline in throughput for modimizers, and a less pronounced slowdown for syncmers and minimizers. The throughput observed as we continue to increase the number of threads () can be attributed to better load balancing for this machine.
As expected, the modimizer scheme has the highest throughput given its simpler strategy that do not require an auxilary for range-minimum queries. Overall, the digest library shows consistent throughput improvements for each digestion scheme as we add more and more threads reaching speeds ranging from 1.7 Gbps to 2.8 Gbps.
4 Discussion
We present a new, efficient C++/Python software library called digest implementing modimizers, minimizers and syncmers. We benchmarked the tool comprehensively for multi-thread scalability and across different back-end implementations.
We identified various avenues for future development. digest’s thread scaling capability is currently limited when dealing with non-ACGT characters, and additional policies for handling a general input alphabet is warranted for broader applicability beyond biological sequence data.
Additionally, our multi-threading strategy divides the input into a number of equal-sized, overlapping partitions, where the number of partitions equals the number of simultaneous threads. In the future, we plan to implement a strategy that uses a work queue so that the number of threads can be specified independently of the size of each individual partition, to allow for better load balance at a smaller numbers of threads.
Since one of the major uses of minimizer digestion is in settings where the digested sequence should undergo “alphabet promotion,” a goal for future versions of the digest library will be to support this as an immediate output of the digestion process. For instance, users could digest a long biological sequence into a shorter “promoted” sequence of, say, 8-bit minimizer symbols, in a single API call.
Lastly, the current field of minimizer research is constantly evolving with a great interest in deriving schemes with lower and lower expected densities (Kille et al. 2025, Koerkamp and Pibiri 2024, Groot Koerkamp et al. 2025). The closer the density is to the theoretical lower bound, the smaller the “digest” will be which typically translates into both reduced running time and memory costs downstream. It will be important in the future for new schemes to be added into the digest library in order to provide users with different options since different schemes can perform optimally in different use-cases.
Supplementary Material
Acknowledgements
We thank Ragnar Groot Koerkamp for his comments on the manuscript.
Contributor Information
Alan Zheng, Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, United States.
Ishmeal Lee, Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, United States.
Vikram S Shivakumar, Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, United States.
Omar Y Ahmed, Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, United States.
Ben Langmead, Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, United States.
Author contributions
Alan Zheng (Software [equal], Validation [equal]), Ishmeal Lee (Software [equal], Validation [equal]), Vikram S. Shivakumar (Software [equal], Supervision [equal], Writing—original draft [equal], Writing—review & editing [equal]), Omar Y. Ahmed (Software [equal], Supervision [equal], Validation [equal], Writing—original draft [equal], Writing—review & editing [equal]), and Ben Langmead (Conceptualization [equal], Funding acquisition [equal], Project administration [equal], Software [equal], Supervision [equal])
Supplementary data
Supplementary data are available at Bioinformatics online.
Conflict of interest: None declared.
Funding
V.S.S., O.Y.A., and B.L. were supported by the NIH [R35GM139602 to B.L.].
Data availability
Digest is available open source at https://github.com/VeryAmazed/digest. The benchmarking was performed on version 0.3.0 (Zheng et al. 2025). Rust bindings are available at https://crates.io/crates/digest-rs and python bindings are available on bioconda.
References
- Ahmed OY, Rossi M, Gagie T et al. SPUMONI 2: improved classification using a pangenome index of minimizer digests. Genome Biol 2023;24:122. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bentley JL. Solutions to Klee’s rectangle problems. Unpublished manuscript. 1977:282–300.
- Edgar R. Syncmers are more sensitive than minimizers for selecting conserved kmers in biological sequences. PeerJ 2021;9:e10805. Feb. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ekim B, Berger B, Chikhi R. Minimizer-space de bruijn graphs: whole-genome assembly of long reads in minutes on a personal computer. Cell Syst 2021;12:958–68.e6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Groot Koerkamp R, Liu D, Pibiri GE. The open-closed mod-minimizer algorithm. Algorithms Mol Biol 2025;20:4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kazemi P, Wong J, Nikolić V et al. nthash2: recursive spaced seed hashing for nucleotide sequences. Bioinformatics 2022;38:4812–3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kille B, Groot Koerkamp R, McAdams D et al. A near-tight lower bound on the density of forward sampling schemes. Bioinformatics 2025;41. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Koerkamp RG, Pibiri GE. The mod-minimizer: a simple and efficient sampling algorithm for long k-mers. bioRxiv, 2024, preprint: not peer reviewed. 10.1101/2024.05.25.595898 [DOI]
- Mohamadi H, Chu J, Vandervalk BP et al. nthash: recursive nucleotide hashing. Bioinformatics 2016;32:3492–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nurk S, Koren S, Rhie A et al. The complete sequence of a human genome. Science 2022;376:44–53. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Roberts M, Hayes W, Hunt BR et al. Reducing storage requirements for biological sequence comparison. Bioinformatics 2004;20:3363–9. Dec. [DOI] [PubMed] [Google Scholar]
- Schleimer S, Wilkerson DS, Aiken A. Winnowing: local algorithms for document fingerprinting. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data. 2003, 76–85. New York, NY, USA: ACM.
- Zheng A, Lee I, Shivakumar V et al. digest v0.3.0, 2025. 10.5281/zenodo.15538544 [DOI]
- Zheng H, Marçais G, Kingsford C. Creating and using minimizer sketches in computational genomics. J Comput Biol 2023;30:1251–76. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Citations
- Zheng A, Lee I, Shivakumar V et al. digest v0.3.0, 2025. 10.5281/zenodo.15538544 [DOI]
Supplementary Materials
Data Availability Statement
Digest is available open source at https://github.com/VeryAmazed/digest. The benchmarking was performed on version 0.3.0 (Zheng et al. 2025). Rust bindings are available at https://crates.io/crates/digest-rs and python bindings are available on bioconda.

