Abstract
The analysis of biological sequencing data has been one of the biggest applications of string algorithms. The approaches used in many such applications are based on the analysis of -mers, which are short fixed-length strings present in a dataset. While these approaches are rather diverse, storing and querying a -mer set has emerged as a shared underlying component. A set of -mers has unique features and applications that, over the past 10 years, have resulted in many specialized approaches for its representation. In this survey, we give a unified presentation and comparison of the data structures that have been proposed to store and query a -mer set. We hope this survey will serve as a resource for researchers in the field as well as make the area more accessible to researchers outside the field.
Additional Key Words and Phrases: k-mer sets, de Bruijn graphs, navigational data structures, Bloom filters, unitgs, FM-index, k-mers, biological sequencing data, data structures
1. INTRODUCTION
String algorithms have found some of their biggest applications in modern analysis of sequencing data. Sequencing is a type of technology that takes a biological sample of DNA or RNA and extracts many reads from it. Each read is a short substring (e.g., anywhere between 50 characters and several thousands of characters, or more) of the original sample, subject to errors. Analysis of sequencing data relies on string matching with these reads, and many popular methods are based on first identifying short, fixed-length substrings of the reads. These are called -mers, where refers to the length of the substring; equivalently, some papers use the term -gram instead of -mer. Such -mer-based methods have become more popular in the past 10 years due to their inherent scalability and simplicity. They have been applied across a wide spectrum of biological domains, e.g., genome and transcriptome assembly, transcript expression quantification, metagenomic classification, structural variation detection, and genotyping. While the algorithms working with -mers are rather diverse, storing and querying a set of -mers has emerged as a shared underlying component. Because of the massive size of these sets, minimizing their storage requirements and query times is becoming its own area of research.
In this survey, we describe published data structures for indexing a set of -mers such that set membership can be checked either directly or by attempting to extend elements already in the set (called navigational queries, to be defined in Section 2). We evaluate the data structures based on their theoretical time for membership and navigational queries, space and time for construction, and time for insertion or deletion. We also describe known lower bounds on the space usage of such data structures and various extensions that go beyond membership and navigational queries. We do not describe the various applications of -mer sets to biological problems, i.e., strategies for constructing the -mer set from the biological data (e.g., sampling, error detection) or algorithms that use -mer set data structures to solve some problem (e.g., assembly, genotyping). An example of an application we do not specifically discuss is the use of a -mer set as an index, e.g., when a -mer is used to retrieve a position in a reference genome.
Since these data structures are often developed in an applied context and published outside the theoretical computer science community, they do not consistently contain thorough mathematical analysis or even problem statements. There is the additional problem of inconsistent definitions and terminology. In this survey, we attempt to unify them under a common set of query operations, categorize them, and draw connections between them. We present a combination of: (1) non-specialized data structures (i.e., hash tables) that have been applied to -mer sets as is, (2) non-specialized data structures that have been adapted for their use on -mer sets, and (3) data structures that have been developed specifically for -mer sets. We give a high-level overview of all categories, but we give a more detailed description for the third category. The survey can be read with only an undergraduate-level understanding of computer science, though knowledge of the FM-index would lead to a deeper understanding in some places.
Let denote a set of -mers. Representing a set is a well-studied problem in computer science. However, the fact that the set consists of strings, and that the strings are fixed-length, lends structure that can be exploited for efficiency. There are other factors as well. First, in most applications, the alphabet has constant size, denoted by . Second, most applications revolve around sets where ; in this survey, we refer to these as sparse sets. Third, is typically much larger than , e.g., is usually between 20 and 200, while can be in the billions.
Another unique aspect of a -mer set is what we call the spectrum-like-property. has the spectrum-like-property if there exists a collection of long strings that “generates” . By “generates,” we mean that contains a significant portion of the -mers of , and, conversely, many of the -mers of are either exact or “noisy” substrings of is usually unknown. For example, sequencing a metagenome sample ( would be the set of abundant genomes in this case) generates a set of reads, which cover most of the abundant genomes in the sample. A computational tool would then chop the reads up into their constituent -mers (e.g., ) and store these in the set . Some other examples of are a single genome (e.g., whole genome sequencing), a collection of transcripts (RNA-seq or Iso-Seq), or enriched genomic regions (e.g., ChIP-seq). We introduce this property to informally capture an important aspect of in many applications that arise from sequencing. Our definition is necessarily imprecise to capture the huge diversity in how sequencing technologies are applied and how sequencing data is used. However, as we will show, this property is exploited by methods for representing a -mer set and also drives the types of queries that are performed on them.
2. OPERATIONS
In this section, we describe a common set of operations that unifies many of the data structures for representing a set of -mers. First, let us assume that the size of the alphabet is constant, all logs are base 2, strings are 1-indexed, and is sparse. The most basic operations that a data structure representing supports are its construction and checking whether a -mer is in (memb, which returns a Boolean value). If the data structure is dynamic, it also supports inserting a -mer into () or deleting a -mer from (). A data structure where insertion and deletion is either not possible or would require as much time as re-construction is called static.
Recall that in the context of the spectrum-like-property, there is an underlying set of strings that is generating the -mers of . This implies that many -mers in will have dovetail overlaps with each other (i.e., the suffix of one -mer equals to the prefix of another), often by characters. Algorithms that use to reconstruct often work by starting from a -mer and extending it one character at a time to obtain the strings of . This motivates having efficient support for operations that check if an extension of a -mer exists in . A forward extension of is any -mer such that , and a backward extension is any -mer such that (we use the notation to refer to the substring of starting from the ith character up to and including the jth character). Formally, given and a character , the operation returns true if is in (we use - to signify string concatenation). Similarly, the operation checks whether is in . We refer to and operations as navigation operations.
We assume that a data structure maintains some kind of internal state corresponding to the last queried -mer i.e., a query would leave the data structure in a state corresponding to , a query would leave the state corresponding to , and so on. For example, for a hash table, the internal state after a query would correspond to the hash value of and to the memory location of ’s slot; in the case of an FM-index or a similar data structure, the internal state corresponds to an interval representing .
We also assume that prior to a call to or , the data structure is in a state corresponding to . In this way, and are different from and , respectively. For example, it would be invalid to execute after executing , because the memb operation would leave the data structure in a state corresponding to and executing requires it to be in a state corresponding to . For data structures that do not support or explicitly or do not maintain an internal state, there is always the default implementation using the corresponding membership query.
In the following, we will first summarize some basic data structures for the above problem (Section 3). In Section 4, we will make the connection to de Bruijn graphs and present data structures that aim for fast and queries. In Section 5, we present special type of data structures where queries are very expensive or impossible, but navigational queries are cheap. We summarize the query, construction, and modification time and space complexities of the key data structures in Tables 1 and 2. In the Appendix, we show how these complexities are derived for the cases when it is not explicit in the original papers. We then continue to other aspects. In Section 6, we describe the known space lower bounds for storing a set of -mers. Finally, in Section 7, we describe various variations on and extensions of the data structures presented in Sections 3–5.
Table 1.
Query Complexities
| data structure | |||
|---|---|---|---|
| sorted list | a | a | |
| hash table adj. list | b | b | |
| Conway and Bromage | a | a | |
| Bloom filter1 | b | b | |
| Bloom filter trie | a | a | |
| BOSS (static) | 1 | 1 | |
| BOSS (dynamic) | |||
| unitig-based2 | c | c | |
| Belazzougui et al3 | a | a |
Big O notation is implied for all the complexities, but the symbol is omitted from the table for clarity.
There is no specialized navigational query, so the time is the same as for .
occurs if a rolling hash function is used, otherwise there is no specialized navigational query.
For DBGFM and deGSM, holds if the extension lies on the same unitig; for BLight, it holds if the extension lies on the same super--mer; for pufferfish, it holds if a rolling MPHF is used.
The Bloom filter is non-exact and may return false positives.
This includes DBGFM [Chikhi et al. 2014], deGSM [Guo et al. 2019], pufferfish [Almodaresi et al. 2018], and BLight [Marchet et al. 2019b].
This includes both the static and dynamic version presented in Belazzougui et al. [2016a]. But, the dynamic version may, with low probability, give incorrect query answers.
Table 2.
Construction and Modification Time and Space Complexities
| data structure | construction | modification | ||
|---|---|---|---|---|
| time | space | |||
| sorted list | - | - | ||
| hash table adj. list | ||||
| Conway and Bromage | - | - | ||
| Bloom filter | - | |||
| Bloom filter trie | - | |||
| BOSS (static) | a | - | - | |
| BOSS (dynamic) | a | - | ||
| unitig-based | - | - | ||
| Belazzougui et al (static) | b | - | - | |
| Belazzougui et al (dynamic) | b | |||
Construction space refers to the size of the constructed data structure, rather than to the memory used by the construction algorithm.
This assumes that either the number of sources and sinks is negligible [Boucher et al. 2015; Bowe et al. 2012], or the membership queries are not always exact [Li et al. 2016]; otherwise, in the worst case, the space needed is .
is the number of connected components in the underlying undirected dBG.
We note that the definition of as a set implies that there is no count information associated with a -mer in . However, some of the data structures we will present also support maintaining count information with each -mer. Rather than present how this is done together with each data structure that supports it, we have a separate section (Section 7.4) dedicated to how the presented data structures can be adapted to store count information.
3. BASIC APPROACHES
Perhaps the most basic static representation that is used in practice is a lexicographically sorted list of -mers. The construction time is using any linear time string sort algorithm and the space needed to store the list is . A membership query is executed as a binary search in time . This representation is both space- and time-inefficient, as it is dominated by other approaches we will discuss (e.g., unitig-based approaches or BOSS). But it can be used by someone with very limited computer science background, making it still relevant.
Sorted lists can be partitioned to speed up queries. In this approach, taken by Wood and Salzberg [2014], the -mers are partitioned according to a minimizer function. For a given , an -minimizer of a -mer is the smallest (according to some given permutation function) -mer substring of [Roberts et al. 2004; Schleimer et al. 2003]. A minimizer function is a function that maps a -mer to its minimizer, or, equivalently, to an integer in . In the partitioned sorted list approach, the -mers within each partition are stored in a separate sorted list, and a separate direct-access table maps each partition to the location of the stored list. For this table to fit into memory, should be small (e.g., for ). This approach can work well to speed up queries when there are not many -mers in each partition. However, the space used is still , which is inefficient compared to more recent methods we will present.
Two traditional types of data structures to represent sets of elements are binary search trees and hash tables. A binary search tree and its variants require time for membership queries and are in most aspects worse than a string trie [Mäkinen et al. 2015]. To the best of our knowledge, binary search trees have not been used for directly indexing -mers. In a hash table, the amortized time for a membership query, insertion, deletion, and is equivalent to the time for hashing a -mer [Cormen et al. 2009]. Hashing a -mer generally requires time, but one can also use rolling hash functions. In a rolling hash function [Lemire and Kaser 2010], if we know the hash value for a -mer , we can compute the hash value of any forward or backward extension of in time. Using a rolling hash function can therefore improve the query time to . These fast query and modification times and the availability of efficient and easy-to-use hash table libraries in most popular programming languages make hash tables popular in some applications. However, a hash table requires space, which is prohibitive for large applications due to the factor.
Conway and Bromage [2011] were of the first to consider more compact representations of a -mer set. can be thought of as a binary bitvector of length , where each -mer corresponds to a position in the bitvector and the value of the bit reflects whether the -mer is present in . Since is sparse, storing the bitvector wastes a lot of space. The field of compact data-structures [Navarro 2016] concerns exactly with how to store such bitvectors space-efficiently. In this case, a sparse bitmap representation [Okanohara and Sadakane 2007] based on Elias-Fano coding [Elias 1974] can be used to store the bitvector; then, the operation becomes a pair of rank operations (i.e., finding the number ones in a prefix of a bitvector) on the compressed bitvector. However, if is exponentially sparse (i.e., such that ), then the space needed is .
3.1. Approximate Membership Query Data Structures
An approximate membership query data structure is a type of probabilistic data structure that represents a set in a space-efficient manner in exchange for allowing membership queries to occasionally return false positives (no false negatives are allowed, though). A false positive occurs when but returns true. These data structures are applicable whenever space savings outweigh the drawback of allowing some false positives or when the effect of false positives can be mitigated using other methods. Note that approximate membership queries are not related to the type of queries that ask whether contains a -mer with some bounded number of mismatches (e.g., one substitution) to the query -mer.
Bloom filters [Bloom 1970] (abbreviated BF) are a classical example of an approximate membership data structure that has found widespread use in representing a -mer set (see Broder and Mitzenmacher [2003] for a definition and analysis of Bloom filters). Some of the earliest applications were by Shi et al. [2010] and Stranneheim et al. [2010]. BFs applied to -mers support , , , and operations in the time it takes to hash a -mer (usually , except for rolling hash functions) and take space. A BF does not support ), though there are variants of BFs that make tradeoffs to support it in time (e.g., counting BFs [Fan et al. 2000] and spectral BFs [Cohen and Matias 2003]). Further time-space tradeoffs can be achieved by compressing a BF using RRR [Raman et al. 2007] encoding [Mitzenmacher 2002]. See Tarkoma et al. [2011] for a survey of BF variations and the tradeoffs they offer.
Pellow et al. [2017] developed several modifications of a Bloom filter, specifically for a -mer set. They take advantage of the spectrum-like property to either reduce the false positive rate or decrease the space usage. The general idea is that when has the spectrum-like property, most of its -mers will have some backward and forward extension present in . The (hopefully small amount of) -mers for which this is not true are maintained in a separate hash table. For the rest, to determine whether a -mer is in , they make sure the BF contains not only but also at least one forward and one backward extension. Using similar ideas, they give other versions of a BF for when is the spectrum of a read set or of one string. In another paper, Chu et al. [2018] developed what they called a multi-index BF, which similarly takes advantage of the spectrum-like property (details omitted).
Bloom filters are popular, because they reduce the space usage to while maintaining membership query time. BFs and their variants are also valuable for their simplicity and flexibility. However, operations on Bloom filters generally require access to distant parts of the data structure and therefore do not scale well when they do not fit into RAM. Here, we highlight some more advanced approximate membership data structures that offer better performance and have been applied to -mers sets. There is the quotient filter [Bender et al. 2012] and the counting quotient filter [Pandey et al. 2017a], which have been applied to storing a -mer set in Pandey et al. [2017b] and Pandey et al. [2018]. There is also the quasi-dictionary [Marchet et al. 2018] and -Othello [Liu et al. 2018], both generally applicable to any set of elements but applied to a -mer set by the authors. Cuckoo filters [Fan et al. 2014] are another approximate membership data structure that has been applied to -mers [Zentgraf et al. 2020].
3.2. String-based Indices
There is a rich literature of string-based indices [Mäkinen et al. 2015], some of which can be modified to store and query a -mer set. One of the most popular string-based indices to be applied to bioinformatics is the FM-index1 [Ferragina and Manzini 2000]. It can be defined and constructed for a set of strings, using the Extended Burrows-Wheeler Transform [Mantaci et al. 2005]. A scalable version has been implemented in the BEETL software [Bauer et al. 2013]. This can in principle be applied to (by treating every -mer in as a separate string), though we are unaware of such an application in practice. In theory, it results in construction time and query time [Bauer et al. 2013]. A naive implementation of and operations in this setting would require a new memb query; however, we hypothesize that a more sophisticated approach, using bidirectional indices, may improve the runtime (this, however, does not appear in the literature and is not proven). However, the FM-index is not usually directly used for storing -mers; rather, it is either used in combination with other strategies (e.g., DBGFM and deGSM, which we will describe in Section 4.2) or in a form specifically adapted to -mer queries (i.e., the BOSS structure, which we will describe in Section 4.1).
Another popular string-based index is the trie data structure and its variations. A trie is a tree-based index known for its fast query time, with strings labeling nodes and/or edges (see Mäkinen et al. [2015] for details). Tries have been adopted to the -mer set setting in a data structure called the Bloom filter trie [Holley et al. 2016]. It combines the elements of Bloom filters and burst tries [Heinz et al. 2002]. Conceptually, a small parameter is chosen and all the -mers are split into equal-length parts. The ith part is then stored within a node at the ith level of the trie. Bloom filters are used within nodes to quickly filter out true negatives when querying the membership of a -mer part. The Bloom filter trie offers fast time but requires space.
4. DE BRUIJN GRAPHS
A de Bruijn graph provides a useful way to think about a -mer set that has the spectrum-like-property and for which and operations should be supported more efficiently than membership operations. A de Bruijn graph (dBG) is directed graph built from a set of -mers . In the node-centric dBG, the node set is given by and there is an edge from to iff the last characters of are equal to the first characters of . In a edge-centric dBG, the node set is given by the set of -mers present in , and, for every , there is an edge from to . In other words, the -mers of are nodes in the node-centric dBG and edges in the edge-centric dBG. Figure 1 shows an example. The graphs represent equivalent information. Technically, the node-centric dBG of is a line graph [Bang-Jensen and Gutin 2009] of the edge-centric dBG of , and without loss of generality, we mostly focus our discussion on node-centric dBGs.
Fig. 1.

An example of the node-centric de Bruijn graph (left) and the edge-centric one (right). Both graphs are built for . There are three maximal unitigs in the node-centric graph, highlighted in the figure with orange rectangles. The spellings of the three maximal unitigs are , , and .
The concept of a de Bruijn graph in bioinformatics is originally borrowed from combinatorics, where it is used to denote the node-centric dBG (in the sense we define here) of the full -mer set, i.e., a set of all -mers. It found its initial application in bioinformatics in genome assembly algorithms [Simpson and Pop 2015]. We do not discuss this application here, but rather, we discuss its relationship to the representation of a -mer set.
The dBG is a mathematical object constructed from that explicitly captures the overlaps between the -mers of . Since this information is already implicitly present in , the dBG contains the same underlying information as . However, the graph formalism gives us a way to apply graph-theoretic concepts, such as walks or connected components, to a -mer set. In theory, all these concepts could be stated in terms of directly without the use of the dBG. For example a simple path in the node-centric dBG could be defined as an ordered subset of such that every consecutive pair of -mers and obey . However, using the formalism of de Bruijn graphs makes the use of graph-theoretic concepts simpler and more immediate.
Just like is a mathematical object that can be represented by various data structures, so is the dBG. In this sense, the term dBG can have a fuzzy meaning when it is used to refer to not just the mathematical object but to the data structure representing it. Generally, though, when a data structure is said to represent the dBG (as opposed to ), it is meant that edge queries can be answered efficiently. When projected onto the operations we consider in this article, in- and out-edge queries are equivalent to and queries, respectively. In particular, a query to check if has an outgoing edge to is equivalent to the operation, while is equivalent to checking if has an outgoing edge to .
4.1. Node- or Edge-based Representations
The simplest data structures that represent graphs are the incidence matrix and the adjacency list. The incidence matrix representation requires space and is rarely used for dBGs (the inefficiency can also be explained by the fact the incidence matrix is not intended for sparse graphs, but the dBG is sparse, because its nodes have constant in- and out-degrees of at most ). A hash table adjacency list representation is possible using a hash table that stores, for each node, bits to signify which incident edges exist in the graph. Concretely, each node potentially has outgoing edges, corresponding to the possible forward extensions and backward extensions. Thus, we can use one bit for each of the possible edges to indicate their presence/absence. The navigational operations still require the time needed to hash a -mer, because the hash value for the extension needs to be calculated to change the “internal state” of the hash table to the extension. However, checking which extensions exist can be done in constant time. While this representation requires space, its ease of implementation makes it a popular choice for smaller or .
The special structure of dBGs (relative no arbitrary graphs) has been exploited to create a more space-efficient representation called BOSS (the name comes from the initials of the inventors [Bowe et al. 2012]). BOSS represents the edge-centric dBG as a list of the edges’ extension characters (i.e., for each edge , the character ), sorted by the concatenation of the reverse of the source node label and the extension character (i.e., ). The details of the query algorithm are too involved to present here, and we refer the reader to either the original paper or to Mäkinen et al. [2015]. BOSS builds upon the XBW-transform [Ferragina et al. 2009] representation of trees, which itself is an extension of the FM-index [Ferragina and Manzini 2000] for strings. BOSS further modified the XBW-transform to work for dBGs. Historically, BOSS was initially introduced such that it was computed on a single string as input [Bowe et al. 2012]; then an efficient implementation used -mer-counted input (COSMO, Boucher et al. [2015]); finally some modifications have been made to the original structure for usage in a real genome assembler [Li et al. 2016].
BOSS occupies bits of space and allows operation in time, which works like the search operation in an FM-index [Ferragina and Manzini 2000]. This assumes that there is only one source and one sink in the dBG. If there are more sources and sinks in the dBG but their number is negligible, the space becomes (this is due to a distinct separator character being needed, as described in Bowe et al. [2012]). Otherwise, in the worst case, the space needed becomes [Boucher et al. 2015; Bowe et al. 2012]. In the version given by Li et al. [2016], the space is always , but then membership queries sometimes give incorrect answers. BOSS achieves a runtime for the operation, while still runs in time. The query time can further be reduced to using the method of Belazzougui et al. [2016b], at the cost of extra space. This representation is static, but a dynamic one is also possible by sacrificing some query time [Belazzougui et al. 2016b; Bowe et al. 2012]. Like approximate membership data structures, BOSS achieves space and memb query time. The main difference is that approximate data structures have false positives while BOSS only achieves the space when the number of sources/sinks is small.
4.2. Unitig-based Representations
A unitig in a node-centric dBG is a path over the nodes , with such that either (1) , or (2) for all , the out- and in-degree of is 1 and the in-degree of is 1 and the out-degree of is 1. A unitig is maximal if the underlying path cannot be extended by a node while maintaining the property of being a unitig. The set of maximal unitigs in a graph is unique and forms a node decomposition of the graph (Lemma 2 in Chikhi et al. [2016]). See Figure 1 for an example of maximal unitigs. In the literature, maximal unitigs are sometimes referred to as unipaths or as simply unitigs. Computing the maximal unitigs can also be viewed as a task of compacting together their constituent nodes in the graph; hence, this is sometimes referred to as graph compaction.
A maximal unitig spells a string with the property that a -mer is a substring of iff . Thus, the list of maximal unitigs is an alternate representation of the -mers in in the sense that if and only if is a suc . This representation reduces the amount of space, since a maximal unitig represents a set of -mers using characters, while the raw set of -mers uses characters. The number of characters taken by the list is , where is the number of maximal unitigs. In many bioinformatic applications, is much smaller than , and this representation can greatly reduce the space. However, since one can always construct a set with , this representation does not yield an improvement when using worst-case analysis.
Given these space savings, one can pre-compute the maximal unitigs of as an initial, lossless, compression step. This is itself a task that builds upon other -mer set representations. However, there are fast and low-memory stand-alone tools for compaction such as BCALM [Chikhi et al. 2016] or others [Guo et al. 2019; Pan et al. 2020]; more generally, algorithms for compaction are often presented as part of genome assembly algorithms, which are too numerous to cite here.
To support efficient , and queries, the maximal unitigs must be appropriately indexed. The DBGFM data structure [Chikhi et al. 2014] builds an FM-index of the maximal unitigs to allow queries. In deGSM [Guo et al. 2019], the authors similarly build a BWT (which is the major component of an FM-index) of the maximal unitigs; but, they demonstrate how this can be done more efficiently by not explicitly constructing the strings of maximal unitigs (details omitted). These representations allow for queries. For a -mer that is not the first or last -mer of a maximal unitig, there is exactly one and extension, and it is determined by the next character in the unitig. For such -mers, these operations can be done in very small constant time without the need to use the FM-index. In the case that a -mer lies at the end of its maximal unitig, it may have multiple extensions, and they would be at an extremity of another maximal unitig. In this case a new query is required, though more sophisticated techniques may be possible to reduce the query times. It should be noted that these approaches, as implemented, are static; however, it may be possible to modify them to allow for insertion and deletion.
Another approach to index unitigs is taken by Bifrost [Holley and Melsted 2019], using minimizers. Bifrost builds a hash table where the keys are all the distinct minimizers of and the values of the locations of those minimizers in the maximal unitigs. The membership of a -mer is then checked by first computing its minimizer and then checking all the minimizer occurrences in the unitigs for a full match. The index is dynamic, i.e., it intelligently recomputes the unitigs and the minimizer index based on a -mer insertion or deletion.
Before presenting other unitig-based indices, we make an aside to introduce minimal perfect hash functions. Given a static set of size , a hash function is perfect if its image by has cardinality , i.e., there are no collisions. Furthermore, the hash function is minimal if the image consists of integers smaller or equal to . Minimal perfect hash functions (MPHF) can in theory be efficiently constructed and evaluated; we omit the details and refer the reader to Belazzougui et al. [2009] for an example. When applied to a -mer set , one can construct an MPHF in time and store it in bits of space where is a small constant (around 3) [Belazzougui et al. 2009; Limasset et al. 2017]; calculating the hash value of a -mer is done in time. There exists an efficient implementation of MPHF for a -mer set, BBHash [Limasset et al. 2017], designed to handle sets of billions of -mers. The advantage of an MPHF is that one can use it to associate information with each -mer in ; this is done by creating an array of size and using the MPHF value of a -mer as its index into the array. Unlike a hash table, this requires instead of space. The disadvantage of an MPHF is that if it is given a -mer , then it will still return a location associated with some arbitrary . Thus, it cannot be used to test for membership without further additions. Furthermore, support for insertions and deletions would require a dynamic perfect hashing scheme, yet to the best of our knowledge the only efficient implementation for large key sets [Limasset et al. 2017] is static. This limitation is inherited by the MPHF-based schemes we will describe in this article.
The pufferfish index [Almodaresi et al. 2018] uses an MPHF as an alternate to the FM-index when indexing the maximal unitigs. The MPHF along with additional information enables mapping each -mer to its location in the maximal unitigs. To check for membership, a -mer is first mapped to its location; then, if and only if the -mer at the location is equal to The pufferfish index is static, because of its reliance on the MPHF. A similar approach is the BLight index [Marchet et al. 2019b]. It also uses an MPHF to map -mers to locations in unitigs, though it does it in a somewhat different way (we omit the details here).
Břinda [2016], Břinda et al. [2020], and Rahman and Medvedev [2020] recently extended the idea of unitig-based representations to spectrum-preserving string set representations (alternatively, these are referred to as simplitigs). They observed that what makes unitigs useful as a representation is that they contain the exact same -mers as , without any duplicates. They defined a spectrum-preserving string set representation as any set of strings that has this property and gave a greedy algorithm to construct one. The resulting simplitigs had a substantially lower number of characters than unitigs in practice. To support queries, simplitigs were combined with an FM-index [Rahman and Medvedev 2020] in the same manner that unitigs were combined with an FM-index to obtain DBGFM.
5. NAVIGATIONAL DATA STRUCTURES
Many genome assembly algorithms start from a -mer in the and proceed to navigate the graph by following the out- and in-neighbor edges. Membership queries are only needed to seed the start of a navigation with a -mer. Afterwards, only and queries are performed. In this way, we can continue navigating to all the -mers reachable from the seed. A data structure to represent can take advantage of this access pattern to reduce its space usage, as we will see in this section. Formally, a navigational data structure is one where queries are either very expensive or impossible, but and queries are cheap (e.g., . Navigational data structures were first used by Chikhi and Rizk [2012] and later formalized in Chikhi et al. [2014].
An MPHF in combination with a hash table adjacency list representation of a dBG forms a natural basis for a navigational data structure, as follows. This scheme was first described in the literature by Belazzougui et al. [2016a] but was previously implemented in the SPAdes assembler [Bankevich et al. 2012]. An MPHF is first built on and then used to index a direct access table (i.e., an array). Each entry is composed of bits indicating which incident edges exist. For , we can answer and queries using the table. Given ’s hash value, it takes only time to find out if an extension exists, but the queries take time, because a hash value has to be computed to actually navigate to the extension. If a rolling MPHF is used, this can also take time.
The list of maximal unitigs also forms a natural basis for a navigational data structure without the need of constructing any additional index to support queries. As previously described, when maximal unitigs are stored, the and queries are trivial for most -mers. The exceptions occur when is executed on the last -mer in a maximal unitig or when is executed on the first -mer in a maximal unitig. These extensions must be stored in a structure separate from the maximal unitigs; for example, the hash table adjacency list indexed by an MPHF can be used as described above. This approach of indexing the extensions was taken by Limasset et al. [2016]. When the number of maximal unitigs is significantly smaller than , the cost of this additional structure is negligible.
Another approach to constructing a navigational data structure builds on the Bloom filter (BF). A BF is first built to store the -mers of , but a hash table is also used to store the -mers that are false positives in the BF and are extensions of elements of [Chikhi and Rizk 2012]. This allows to avoid false positives for queries by double-checking the hash table. More memory efficient approaches use a cascading Bloom filter [Jackman et al. 2017; Salikhov et al. 2013], which is a sequence of increasingly smaller Bloom filters, where is an initial Bloom filter that stores , and stores the -mers that are false positives of . BF-based navigational data structures support exact queries in time (or with a rolling hash); as a bonus, they can also support approximate queries (they do not support insert operations). In this sense, they can be viewed as a compromise between navigational and normal data structures that trades exact membership of non-extension -mers for better space-efficiency. Alternatively, they can be viewed as an augmentation of the simple Bloom filter representation to guarantee that at least the navigational queries are exact.
Belazzougui et al. [2016a] proposed a mechanism to transform their navigational data structure (described earlier in this section) into a membership data structure. They give both a static and dynamic version; we present the static one here. They first find a forest of node-disjoint rooted trees that is a node-covering subgraph of the dBG. Each tree has bounded height (between and , or less in case of a small connected component). They build an MPHF of and use it to store the adjacency list of the dBG, as described above. They also use it to record, for each -mer, whether it is a root in the forest and in case it is not, a number between 0 and to represent which navigational query will lead to its parent. A dictionary is used to store the node sequences of -mers associated with each root. Apart from these, no other node sequence is stored. The tree structure requires an additional bits to store, where is implementation-dependent, and supports membership queries in time. It is assumed that the space to store the root -mers is a lower-order term of the whole structure, which is the case except when the graph consists of many small connected components.
To check for membership of a -mer , we start with the node , which MPHF identifies as corresponding with . We use the stored navigation instructions to follow up to its root (using at most queries). If a tree root cannot be reached after steps, or if any of the navigational instructions violate the information in the MPHF adjacency list, then we can conclude that and hence . If a tree root is reached within steps, then if and only if the sequence of the root (computed dynamically from traveling up the tree) is equal to the stored -mer associated with the root.
6. SPACE LOWER BOUNDS
How many bits are necessary to store , in the worst case, so membership queries can be answered (without mistakes)? Conway and Bromage [2011] provided an information theoretic answer, based on the fact that to store elements from a universe of size requires bits. In our case, we denote this lower bound by and, using standard inequality bounds, we have:
This asymptotically matches the space of Conway and Bromage’s data structure (Table 2). The quantity reflects the density of the set, and we have that . If is exponentially sparse, then .
Chikhi et al. [2014] explored lower bounds for navigational data structures. Here, how many bits are necessary to store , in the worst case, so navigational queries can be answered (without mistakes)? They showed that bits are required to represent a navigational data structure (for ). Note that this beats the above lower bound for membership data structures, because a navigational data structure cannot answer arbitrary queries.
The above are traditional worst-case lower bounds, meaning that, for any representation that uses less than (respectively, bits for all possible sets with elements of -mers, there will exist at least one input where the representation will produce a false answer to a membership (respectively, navigational) query. However, this is of limited interest in the bioinformatics setting, where the -mers in come from an underlying biological source. For example, the family of graphs used to prove the bound would never occur in bioinformatics practice. As a result, the value that worst-case lower bounds bring to practical representation of a -mer set is limited. In fact, the static BOSS and the static Belazzougui data structures are able to beat this lower bound in practice by taking advantage of a de Bruijn graph that is typically highly connected.
The difficulty of finding an alternative to worst-case lower bounds is the difficulty of modeling the input distribution. Chikhi et al. [2014] considered the opposite end of the spectrum. They call S linear if the node-centric de Bruijn graph of is a single unitig. They showed that the number of bits needed to represent that is linear is . A linear -mer set is in some sense the best case that can occur in practice. However, a linear -mer set is much easier to represent than the sets arising in practice, hence is too conservative of a lower bound.
An intermediate model was also considered by Chikhi et al. [2014], where is parametrized by the number of maximal unitigs in the de Bruijn graph. They used this parameter to describe how much space their representation takes, however, they did not pursue the interesting question of a lower bound parametrized by the number of maximal unitigs.
An alternative to traditional worst-case lower bounds or modeling the input distribution is to derive more instance-specific lower bounds. Typically, a lower bound is derived as a function of the input size, but a more instance-specific lower bound might be a function of the degree distribution of the de Bruijn graph or something even more specific to the graph structure. These types of lower bounds are extremely satisfying when they can be used to show an algorithm is instance-optimal, i.e., it matches the lower bound on every instance. Rahman and Medvedev [2020] derive such a lower bound for the number of characters in a spectrum-preserving string set representation. Their lower bound did not match the performance of their greedy algorithm in the worst case, but it came very close (within a factor of 2%) on the evaluated input.
7. VARIATIONS AND EXTENSIONS
There are natural variations and extensions of data structures for storing a -mer set, which we describe in this section. These are not included in Tables 1 and 2, because they do not neatly fit into the framework of those tables.
7.1. Membership of
A useful operation may be to check if contains a given string of length . In some data structures, like the Bloom filter trie, it is easy to find if a -mer begins with , but there is no specialized way to check if appears as a non-prefix in . One way to check for ’s membership is to enumerate all the -mers in and then perform an exact string-matching algorithm in time (e.g., Knuth-Morris-Pratt, described in the textbook of Cormen et al. [2009]). Another way is to attempt all possible ways to complete a -mer from . Both these ways are prohibitively inefficient for most applications. However, both the static BOSS and the FM-index on top of unitigs [Chikhi et al. 2014; Guo et al. 2019] data structures support checking u’s membership in time; dynamic BOSS also supports this, in time . We omit the details of these implementation here.
7.2. Variable-order de Bruijn Graphs
The and operations require an overlap of characters to navigate . However, if such an overlap does not exist, then in some applications it makes sense to look for a shorter overlap. The variable-order BOSS was introduced to allow this [Boucher et al. 2015]. For a given , it simultaneously represents all the dBGs for , as follows: At any given time, the variable-order BOSS maintains an intermediate state, which is a value and a range of nodes (denoted as ) that share the same suffix of length , representing a single node in the dBG for . It supports new operations shorter() and longer() for changing the value of (by one), running in and time, respectively. The operation runs in the same asymptotic time as BOSS, but fwd runs in time. A bidirectional variable order BOSS improved that operation from to [Belazzougui et al. 2016b]. The memb times are unaffected compared to BOSS. The space complexity is bits, adding an extra bits to the space of BOSS.
7.3. Double Strandedness
The reverse complement of a string is the string reversed and every nucleotide (i.e., character) replaced by its Watson-Crick complement. In many applications, it is often useful to treat a -mer and its reverse complement as being identical. There are two general ways in which data structures for storing a -mer set can be adapted to achieve this.
The first way is to make all -mers canonical. A -mer is canonical if it is lexicographically no larger than its reverse complement. To make a -mer canonical, one replaces it by its reverse complement if is not canonical. The elements of are made canonical prior to construction of the data structure, and queries always make the -mer canonical first. This approach works well in data structures that are hash-based (e.g., sorted list, hash table adjacency list, Conway and Bromage, Bloom filter) or the Bloom filter trie. The space of these data structures does not increase, but the query times increase by the operations that may be needed to make a -mer canonical.
For a data structure such as BOSS, using canonical -mers is incompatible with the specialized and operations. For such cases, there is a second way to handle reverse complements. Concretely, we can compute the reverse complement closure of , as follows: We first modify by checking, for every , if the reverse complement of is in , and, if not, adding this reverse complement to . This increases the size of the data structure by up to a factor of two, but maintains the same time for and operations.
In case of unitig-based representations, the unitigs themselves can be constructed on what is called a bidirected de Bruijn graph [Medvedev et al. 2019, 2007]. A bidirected graph naturally captures the notion of double-stranded -mer extensions in a graph-theoretic framework. The unitigs can then be indexed using their canonical form. We omit the details here.
7.4. Maintaining -mer Counts
In many contexts it is natural to store a positive integer count associated with each -mer in . Alternatively, this may be viewed as storing a multi-set instead of a set. In the same way that a set of -mers can be thought of as a de Bruijn graph, a multi-set of -mers can be also thought of as a weighted de Bruijn graph.
Many of the data structures discussed naturally support maintaining counts, including operations to increment or decrement a count. Any of the data structures that associate some memory location with each -mer in can be augmented to store counts, e.g., a hash table adjacency list representation, a BOSS, or a representation based on unitigs or on a spectrum-preserving string set. More generally, if a data structure provides a method to obtain the rank of a -mer within (e.g., Conway and Bromage), that rank can be used as an index into an integer vector containing the counts. For Bloom filters, there also exist variants that allocate a fixed number of bits per -mer to store the approximate counts (the counting Bloom filter, [Fan et al. 2000]).
The downside of such representations, however, is that they are space-inefficient when the distribution of count values is skewed. For example, in one typical situation, most -mers will have a count of ≤ 10, but there will be a few with a count in the thousands. Since these representations use a fixed number of bits to represent a count, they will waste a lot of bits for low count -mers to support just a few -mers with a large count. To alleviate this, variable-length counters can be used. Conway and Bromage [2011] proposed a tiered approach, storing higher-order bits only as needed. More recently, the counting quotient filter [Pandey et al. 2017a] was designed with variable-length counters in mind; it was applied to store a -mer multi-set by the Squeakr [Pandey et al. 2017b] and deBGR [Pandey et al. 2017c] algorithms.
Mäkinen et al. [2015, Section 9.7.2] also present a count-aware alternative to BOSS, also based on the BWT and following Välimäki and Rivals [2013]. In this representation, a BWT is constructed without removing duplicate -mers, and the count of a -mer can then be inferred by the number of entries in the BWT corresponding to . This approach avoids storing an explicit count vector, however, it requires space to represent each extra copy of a -mer. This tradeoff can be beneficial when the count values are skewed and most -mers have low counts.
7.5. Sets of -mer Sets
A natural extension of a -mer set is a set of -mer sets, i.e., , where each is a -mer set. Sets of -mer sets have received significant recent interest, as they are used to index large collections of sequencing datasets or genomes from a population. An equivalent way to think about this is a set of -mers where each -mer is associated with a set of genomes (often called colors) . A set of colors is referred to as a color class. If the underlying set of -mers is intended to support navigational queries, then a representation of is referred to as a colored de Bruijn graph [Iqbal et al. 2012]. This is an extension of viewing a -mer set as a de Bruijn graph to the case of multiple sets.
The literature has focused on two types of queries. The first is the basic -mer color query: Given a -mer , is , and, if yes, what is ? The second is a color-matching query: Given a set of query -mers and a threshold , identify all colors that contain at least a fraction of the -mers in .
Proposed representations have generally fallen into two categories. The first explicitly stores each -mer’s color class in a way that can be indexed by the -mer. For example, Holley et al. [2016] proposed storing the color class of a -mer at its corresponding leaf in a Bloom filter trie, while Pandey et al. [2018] stored the color class in the -mer’s slot of a counting quotient filter. Alternatively, a BOSS can be used to store the -mers and the colors can be stored in an auxiliary binary color matrix [Almodaresi et al. 2017; Muggli et al. 2017]. Here, if the ith -mer in the BOSS ordering has a color . Instead of using a BOSS, -mers in the color matrix can also be indexed using a minimal perfect hash function [Yu et al. 2018] or a unitig-based representation [Holley and Melsted 2019].
A column of the color matrix can be viewed as binary vector specifying the -mer membership of . A variation of this then replaces each column using a Bloom filter representation of [Bingmann et al. 2019; Bradley et al. 2019; Mustafa et al. 2019]. Thus, each row of the color matrix becomes a position in the Bloom filter, instead of a -mer. This results in space savings, but representation of the color class is no longer guaranteed to be correct.
The color matrix is sometimes compressed using a standard compression technique such as RRR [Raman et al. 2007] or Elias-Fano encoding [Muggli et al. 2017]. Further compression can be achieved based on the idea that, in some applications, many -mers share the same color class. For example, Holley et al. [2016], Almodaresi et al. [2017], and Pandey et al. [2018] assign an integer code to each color class in increasing order of the number of -mers that belong to it. Thus, frequently occurring color classes are represented using less bits. Yu et al. [2018] proposed an adaptive approach to encoding color classes. Based on how many colors a color class contains, the class is stored as either a list of the colors, a delta-list encoding of the colors, or as a bitvector of length . Almodaresi et al. [2019] take advantage of the fact that adjacent -mers in the de Bruijn graph are likely to have similar color classes; they then store many of the color classes not as an explicit encoding but as a difference vector to a similar color class. Finally, an alternative way to encode the color matrix based on wavelet trees is given by Mustafa et al. [2019].
The second category of representations are based on the Bloofi [Crainiceanu and Lemire 2015] data structure, which is designed to exploit the fact that many are similar and, more generally, many color classes have similar -mer compositions. Here, each is stored in a Bloom filter and a tree is constructed with each as a leaf. Each internal node represents the union of the -mers of its descendants, also represented as a Bloom filter. The Bloofi datastructure was adapted to the -mer setting by Solomon and Kingsford [2016], who called it the Sequence Bloom Tree. The color matching query can be answered by traversing the tree top-down and pruning the search at any node where less than -mers match. Further improvements were made to reduce its size and query times [Harris and Medvedev 2020; Solomon and Kingsford 2018; Sun et al. 2018]. For example, -mers that appear in all the nodes of a subtree can be marked as such to allow more pruning during queries, and the information about such -mers can be stored at the root, thereby saving space [Solomon and Kingsford 2018; Sun et al. 2018]. Using a hierarchical clustering to improve the topology of the tree also yields space savings and better query times [Sun et al. 2018]. A better organization of the bitvectors was shown to reduce saturation and improve performance [Harris and Medvedev 2020]
The first category of representations are designed with the basic -mer color query in mind, though they can be adopted to answer the color matching query as well. The second category of methods, however, are specifically designed to answer the color matching query. They can be viewed as aggregating -mer information at the color level, while the first category can be viewed as aggregating color information at the -mer level. For a more thorough survey of this topic, please see Marchet et al. [2019a].
8. CONCLUSION
In this article, we have surveyed data structures for storing a DNA -mer set in a way that can efficiently support membership and/or navigational queries. This problem falls into the more general category of indexing a set of elements, which has been widely studied in computer science. The aspects of a DNA -mer set that make it unique are that the elements are fixed length strings over a constant-sized alphabet, the set is sparse, and is much less than . A DNA -mer set tends to also have what we have termed the spectrum-like-property. This property is hard to capture with mathematical precision, but it has been a major driver behind the design of specialized data structures. Another way that a DNA -mer set is different from a general set is that queries are sometimes more constrained than arbitrary membership queries. In particular, navigational queries start from a -mer that is known to be in the set and ask which of its extensions are also present.
We now give a summary of the major developments in this field. Some methods for storing a set proved to be useful right out-of-the-box, with the major examples being hash tables, Bloom filters, and sparse bitvectors. These methods are generic, in the sense that there is nothing specific to -mer sets about them. Hash tables and Bloom filters, especially, gained widespread use because of their broad software availability and conceptual simplicity, respectively. These two offered a tradeoff between query accuracy and space; concretely, Bloom filters require only space but have false positives, while hash tables have no false positives but require space. They both offered fast query times of for membership and for navigational queries (assuming rolling hash functions are used). Beyond these, other generic methods found applicability in -mer sets, especially approximate membership query data structures. These offer both practical and theoretical improvements; however, describing these requires a more fine-grained analysis than we are able to provide here.
Generic data structures were also modified to take advantage of properties inherent to a DNA -mer set, either simply that the strings are of fixed length or, more strongly, have the spectrum-like-property. The most notable examples of this were the works by Pellow et al. [2017] to modify Bloom filters, by Holley et al. [2016] to modify string tries (i.e., the Bloom filter trie data structure), and by Bowe et al. [2012] to modify the FM-index (i.e., BOSS data structure). Pellow et al. [2017] improved the space usage of Bloom filters, though the theoretical analysis is beyond the scope of this survey. The improvements of Holley et al. [2016] to a string trie were more practical and difficult to theoretically analyze. Bowe et al. [2012] were able to simultaneously achieve the space usage of Bloom filters and the perfect accuracy of a hash table without affecting the query times. This, however, does not hold in the worst case, because it assumes that the number of sources and sinks in the de Bruijn graph is negligible. Later papers showed how to modify BOSS to achieve different tradeoffs [Belazzougui et al. 2016b; Li et al. 2016].
There were also two novel types of data structures developed specifically for the -mer setting. The first was unitig-based representations, proposed by Chikhi et al. [2014] and later extended to spectrum-preserving string set representations by Břinda [2016], Rahman and Medvedev [2020], and Břinda et al. [2020]. These representations work by first constructing the unitigs and then building an index on top of them. The type of index varies: The FM-index is used by Chikhi et al. [2014] and Guo et al. [2019], while a minimum perfect hash function is used by Almodaresi et al. [2018], Marchet et al. [2019b], and Holley and Melsted [2019]. Unitig-based representations were specifically designed to exploit the spectrum-like-property to save space, resulting in space ( is the number of maximal unitigs in the input). Membership and navigation remain efficient and , respectively), except that for -mers at the boundaries of unitigs, navigation takes . The idea is that in practice, the spectrum-like-property implies that is much smaller than , resulting in low space and making boundary -mers rare in practice. A direct comparison between unitig-based representations and other representations (e.g., BOSS) to determine the regimes in which one outperforms the other has not, to the best of our knowledge, been attempted; this includes either a theoretical or a comprehensive empirical analysis.
The second type of data structure developed specifically for a -mer set is a navigational data structure, which exploits the way that a DNA -mer set is often queried. These data structures retain navigational queries but sacrifice the efficiency and/or feasibility of membership queries to achieve space. Chikhi and Rizk [2012] were the first to use such a data structure, and Chikhi et al. [2014] later formalized the idea; other navigational data structures were later developed by Bankevich et al. [2012], Salikhov et al. [2013], Jackman et al. [2017], Belazzougui et al. [2016a], and Limasset et al. [2016].
Reading through the literature in this field, one often encounters papers on the representation of de Bruijn graphs as opposed to representation of a -mer set. The distinction between the two is unclear to us, as a de Bruijn graph and a -mer set represent equivalent information (i.e., there is a bijection between the universe of -mer sets and the universe of de Bruijn graphs). One distinction may be that the term “de Bruijn graph” implies that edge queries (which in the node-centric version correspond to navigational queries, in our terminology) are efficient, while the term “ -mer set” does not connote anything about navigation. However, “de Bruijn graph” obfuscates the fact that there are no degrees of freedom in defining the edge set: Once the node labels (i.e., -mers) are determined, so are the edges. This is in the node-centric setting, but in the edge-centric setting, it is the nodes that are determined once the edge labels (i.e., -mers) are fixed.
Beyond the data structures, we also discussed what is known about space lower bounds. Unfortunately, there have been only limited results. Besides the basic information-theoretic lower bound by Conway and Bromage [2011], nothing is known for membership data structures. For navigational data structures, Chikhi et al. [2014] provided some lower bounds; however, these are of limited practical use, because they only consider worst-case lower bounds, which are easily beat on real data. Within the confines of spectrum-preserving string set representations, instance-specific lower bounds were successfully applied empirically to demonstrate the near-optimality of the greedy representation on real data.
In this survey, we did not discuss in any detail how DNA -mer sets are used in practice; we assume that there is some algorithm that takes a set of reads and extracts a -mer set from them in a way that is useful to downstream algorithms. However, bringing such algorithms into some kind of unified framework would be a fascinating topic for another survey.
We hope that this area receives more systematic attention in the future, as -mer set representations underly many bioinformatics tools. This might include expanding the set of operations beyond what we have described here to better capture the way a DNA -mer set is used. Another promising avenue of research is to better and more explicitly model the distribution of -mer sets that arise in sequencing data; such models can then uncover more efficient representations as well as provide useful lower bounds. Progress in the field can also come through the creation of benchmarking datasets and through impartial competitive assessment of existing tools (e.g., as in Bradnam et al. [2013]; Sczyrba et al. [2017]). The ultimate goal, though, remains practical: to come up with data structures that improve space and query time of existing ones.
CCS Concepts:
• Applied computing → Computational biology; • Theory of computation → Pattern matching;
Acknowledgments
This research has started during J. Holub’s research stay at the Pennsylvania State University supported by the Fulbright Visiting Scholar Program and it was finished with the support of the OP VVV MEYS funded project CZ.02.1.01/0.0/0.0/16_019/0000765 “Research Center for Informatics.” This work was partially supported by NSF awards DBI-1356529, CCF-551439057, IIS-1453527, and IIS-1421908 to PM. Research reported in this publication was supported by the National Institute Of General Medical Sciences of the National Institutes of Health under Award Number R01GM130691, and the INCEPTION project (PIA/ANR-16-CONV-0005). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
APPENDIX
A. DERIVATIONS OF COMPLEXITIES
A.1. Conway and Bromage
Conway and Bromage [2011] present separate structures for dense and sparse sets; in our case, the sparse bitmap representation (called sarray in Conway and Bromage [2011]) is relevant. The space taken by sarray is given in Table 1 of Conway and Bromage [2011] as . In our case, and . Membership is implemented as a constant number of rank operations, which are supported in sarray in time (Table 1 in Conway and Bromage [2011]). In terms of construction time, we did not find an analysis in either Conway and Bromage [2011] or Okanohara and Sadakane [2007]. We show the construction time as , since it is at least necessary to hash each -mer.
A.2. Bloom Filter Tries
The Bloom filter trie complexities depend on several internal parameters (e.g., in the article). For our analysis, we have treated these as constants, and, in particular, we have set , as it minimizes the complexity of operations. Yet, this is an extreme case that has not been explicitly considered in the original article, and Holley et al. [2016] suggested optimizations for performing faster navigational queries that are not reflected by our analysis here. A more fine-grained analysis than we have done here is likely possible, in terms of these internal parameters.
A.3. BOSS
In Bowe et al. [2012], the time complexity of query (called is , where is (rank& select [Raman et al. 2007]) for the static case and (a balanced binary search tree) for the dynamic case, and is the maximum of complexities of functions rank, select, and access on strings, which is for the static implementation [Ferragina et al. 2007] and for the dynamic implementation [Navarro and Sadakane 2014]. Considering that the alphabet size is constant in our case, the static implementation makes query time complexity equal to and the dynamic complexity makes it .
The time complexity of query (called ) is , which is for the static case and for the dynamic case. The time complexity of query (called ) is , which is for the static case and for the dynamic case. Both static [Ferragina et al. 2007] and dynamic [Navarro and Sadakane 2014] rank & select implementations have the same asymptotic space complexity; therefore, both the static and dynamic BOSS have the same asymptotic space complexity.
A.4. Variable-order BOSS
In the case of a constant alphabet, the variable-order BOSS [Boucher et al. 2015] representation uses the data structures of original BOSS and a new array requiring space [Boucher et al. 2015, Theorem 1]. The query is used in the same way as in BOSS. Operations and for -mers are also used in the same way as in BOSS. For -mers with the operations are implemented in a different (slower) way: Note, in the variable-order BOSS returns a list of nodes with an edge to . In Boucher et al. [2015] (Section 5), variable-order BOSS time complexity is . Operation runs in time (i.e., time for , runs in time. Operation runs in time , and operation runs in time , where is a range of nodes sharing the same suffix of length .
Footnotes
We note that the FM-index and its variants are also sometimes referred to as a BWT-indices, since they are based on the Burrows-Wheeler Transform (BWT).
Contributor Information
RAYAN CHIKHI, Center of Bioinformatics and Biostatistics and Integrative Biology.
JAN HOLUB, Department of Theoretical Computer Science, Czech Technical University in Prague.
PAUL MEDVEDEV, Center for Computational Biology and Bioinformatics.
REFERENCES
- Almodaresi Fatemeh, Pandey Prashant, Ferdman Michael, Johnson Rob, and Patro Rob. 2019. An efficient, scalable and exact representation of high-dimensional color information enabled via de Bruijn graph search. In Proceedings of the International Conference on Research in Computational Molecular Biology (Lecture Notes in Computer Science), Vol. 11467. Springer, 1–18. DOI: 10.1007/978-3-030-17083-7_1 [DOI] [Google Scholar]
- Almodaresi Fatemeh, Pandey Prashant, and Patro Rob. 2017. Rainbowfish: A succinct colored de Bruijn graph representation. In WABI 2017: Algorithms in Bioinformatics (LIPIcs-Leibniz International Proceedings in Informatics), Schwartz Russell and Reinert Knut (Eds.), Vol. 88. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 18:1–18:15. DOI: 10.4230/LIPIcs.WABI.2017.18 [DOI] [Google Scholar]
- Almodaresi Fatemeh, Sarkar Hirak, Srivastava Avi, and Patro Rob. 2018. A space and time-efficient index for the compacted colored de Bruijn graph. Bioinformatics 34, 13 (2018), i169–i177. DOI: 10.1093/bioinformatics/bty292 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bang-Jensen Jørgen and Gutin Gregory Z.. 2009. Digraphs: Theory, Algorithms and Applications. Springer Science & Business Media. DOI: 10.1007/978-1-84800-998-1 [DOI] [Google Scholar]
- Bankevich Anton, Nurk Sergey, Antipov Dmitry, Gurevich Alexey A., Dvorkin Mikhail, Kulikov Alexander S., Lesin Valery M., Nikolenko Sergey I., Pham Son, Prjibelski Andrey D. et al. 2012. SPAdes: A new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol 19, 5 (2012), 455–477. DOI: 10.1089/cmb.2012.0021 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bauer Markus J., Cox Anthony J., and Rosone Giovanna. 2013. Lightweight algorithms for constructing and inverting the BWT of string collections. Theor. Comput. Sci 483 (2013), 134–148. DOI: 10.1016/j.tcs.2012.02.002 [DOI] [Google Scholar]
- Belazzougui Djamal, Botelho Fabiano C., and Dietzfelbinger Martin. 2009. Hash, displace, and compress. In ESA 2009: European Symposium on Algorithms (Lecture Notes in Computer Science), Vol. 5757. Springer, 682–693. DOI: 10.1007/978-3-642-04128-0_61 [DOI] [Google Scholar]
- Belazzougui Djamal, Gagie Travis, Mäkinen Veli, and Previtali Marco. 2016a. Fully dynamic de Bruijn graphs. In SPIRE 2016: String Processing and Information Retrieval (Lecture Notes in Computer Science), Inenaga Shunsuke, Sadakane Kunihiko, and Sakai Tetsuya (Eds.), Vol. 9954. Springer, 145–152. DOI: 10.1007/978-3-319-46049-9_14 [DOI] [Google Scholar]
- Belazzougui Djamal, Gagie Travis, Mäkinen Veli, Previtali Marco, and Puglisi Simon J.. 2016b. Bidirectional variable-order de Bruijn graphs. In LATIN 2016: Theoretical Informatics (Lecture Notes in Computer Science), Kranakis Evangelos, Navarro Gonzalo, and Chávez Edgar (Eds.), Vol. 9644. Springer, 164–178. DOI: 10.1007/978-3-662-49529-2_13 [DOI] [Google Scholar]
- Bender Michael A., Farach-Colton Martin, Johnson Rob, Kraner Russell, Kuszmaul Bradley C., Medjedovic Dzejla, Montes Pablo, Shetty Pradeep, Spillane Richard P., and Zadok Erez. 2012. Don’t thrash: How to cache your hash on flash. Proc. VLDB Endow 5, 11 (2012), 1627–1637. DOI: 10.14778/2350229.2350275 [DOI] [Google Scholar]
- Bingmann Timo, Bradley Phelim, Gauger Florian, and Iqbal Zamin. 2019. COBS: A compact bit-sliced signature index. In SPIRE 2019: String Processing and Information Retrieval (Lecture Notes in Computer Science), Vol. 11811. Springer, 285–303. DOI: 10.1007/978-3-030-32686-9_21 [DOI] [Google Scholar]
- Bloom Burton H.. 1970. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7 (1970), 422–426. DOI: 10.1145/362686.362692 [DOI] [Google Scholar]
- Boucher Christina, Bowe Alex, Gagie Travis, Puglisi Simon J., and Sadakane Kunihiko. 2015. Variable-order de Bruijn graphs. In Proceedings of the Data Compression Conference, Bilgin A, Marcellin MW, Serra-Sagristà J, and Storer JA (Eds.). IEEE Computer Society Press, 383–392. DOI: 10.1109/DCC.2015.70 [DOI] [Google Scholar]
- Bowe Alexander, Onodera Taku, Sadakane Kunihiko, and Shibuya Tetsuo. 2012. Succinct de Bruijn graphs. In WABI 2012: Algorithms in Bioinformatics (Lecture Notes in Computer Science), Raphael Ben and Tang Jijun (Eds.), Vol. 7534. Springer-Verlag, 225–235. DOI: 10.1007/978-3-642-33122-0_18 [DOI] [Google Scholar]
- Bradley Phelim, den Bakker Henk, Rocha Eduardo, McVean Gil, and Iqbal Zamin. 2019. Ultrafast search of all deposited bacterial and viral genomic data. Nat. Biotechnol 37 (2019), 152–159. DOI: 10.1038/s41587-018-0010-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bradnam Keith R., Fass Joseph N., Alexandrov Anton, Baranay Paul, Bechner Michael, Birol Inanç, Boisvert Sébastien, Chapman Jarrod A., Chapuis Guillaume, Chikhi Rayan et al. 2013. Assemblathon 2: Evaluating de novo methods of genome assembly in three vertebrate species. GigaScience 2, 1 (2013). DOI: 10.1186/2047-217X-2-10 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Břinda Karel. 2016. Novel Computational Techniques for Mapping and Classifying Next-generation Sequencing Data Ph.D. Dissertation. University of Paris-Est Marne-la-Vallée. DOI: 10.5281/zenodo.1045317 [DOI] [Google Scholar]
- Břinda Karel, Baym Michael, and Kucherov Gregory. 2020. Simplitigs as an efficient and scalable representation of de Bruijn graphs. bioRxiv 903443 (2020). DOI: 10.1101/2020.01.12.903443 [DOI] [Google Scholar]
- Broder Andrei and Mitzenmacher Michael. 2003. Network applications of Bloom filters: A survey. Internet Math 1, 4 (2003), 485–509. DOI: 10.1080/15427951.2004.10129096 [DOI] [Google Scholar]
- Chikhi Rayan, Limasset Antoine, Jackman Shaun, Simpson Jared T., and Medvedev Paul. 2014. On the representation of de Bruijn graphs. In RECOMB 2014: Research in Computational Molecular Biology (Lecture Notes in Computer Science), Sharan Roded (Ed.), Vol. 8394. Springer, 35–55. DOI: 10.1007/978-3-319-05269-4_4 [DOI] [Google Scholar]
- Chikhi Rayan, Limasset Antoine, and Medvedev Paul. 2016. Compacting de Bruijn graphs from sequencing data quickly and in low memory. Bioinformatics 32, 12 (2016), i201–i208. DOI: 10.1093/bioinformatics/btw279 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chikhi Rayan and Rizk Guillaume. 2012. Space-efficient and exact de Bruijn graph representation based on a Bloom filter. In WABI 2012: Algorithms in Bioinformatics (Lecture Notes in Computer Science), Raphael Ben and Tang Jijun (Eds.), Vol. 7534. Springer, 236–248. DOI: 10.1007/978-3-642-33122-0_19 [DOI] [Google Scholar]
- Chu Justin, Mohamadi Hamid, Erhan Emre, Tse Jeffery, Chiu Readman, Yeo Sarah, and Birol Inanç. 2018. Improving on hash-based probabilistic sequence classification using multiple spaced seeds and multi-index Bloom filters. bioRxiv (2018), 434795. DOI: 10.1101/434795 [DOI] [Google Scholar]
- Cohen Saar and Matias Yossi. 2003. Spectral Bloom filters. In Proceedings of the ACM SIGMOD International Conference on Management of Data Association for Computing Machinery, 241–252. DOI: 10.1145/872757.872787 [DOI] [Google Scholar]
- Conway Thomas C. and Bromage Andrew J.. 2011. Succinct data structures for assembling large genomes. Bioinformatics 27,4 (2011), 479–486. DOI: 10.1093/bioinformatics/btq697 [DOI] [PubMed] [Google Scholar]
- Cormen Thomas H., Leiserson Charles E., Rivest Ronald L., and Stein Clifford. 2009. Introduction to Algorithms The MIT Press. [Google Scholar]
- Crainiceanu Adina and Lemire Daniel. 2015. Bloofi: Multidimensional Bloom filters. Inf. Syst 54 (2015), 311–324. DOI: 10.1016/j.is.2015.01.002 [DOI] [Google Scholar]
- Elias Peter. 1974. Efficient storage and retrieval by content and address of static files. J ACM 21,2 (1974), 246–260. DOI: 10.1145/321812.321820 [DOI] [Google Scholar]
- Fan Bin, Andersen Dave G., Kaminsky Michael, and Mitzenmacher Michael D.. 2014. Cuckoo filter: Practically better than bloom. In Proceedings of the 10th ACM International on Conference on Emerging Networking Experiments and Technologies. Association for Computing Machinery, 75–88. DOI: 10.1145/2674005.2674994 [DOI] [Google Scholar]
- Fan Li, Cao Pei, Almeida Jussara, and Broder Andrei Z.. 2000. Summary cache: A scalable wide-area web cache sharing protocol. IEEE/ACM Trans. Netw 8, 3 (2000), 281–293. DOI: 10.1109/90.851975 [DOI] [Google Scholar]
- Ferragina Paolo, Luccio Fabrizio, Manzini Giovanni, and Muthukrishnan Shan. 2009. Compressing and indexing labeled trees, with applications. J. ACM 57, 1 (2009), 4:1–4:33. DOI: 10.1145/1613676.1613680 [DOI] [Google Scholar]
- Ferragina Paolo and Manzini Giovanni. 2000. Opportunistic data structures with applications. In Proceedings of the 41st Symposium on Foundations of Computer Science (FOCS’00) IEEE Computer Society, 390–398. DOI: 10.1109/SFCS.2000.892127 [DOI] [Google Scholar]
- Ferragina Paolo, Manzini Giovanni, Mäkinen Veli, and Navarro Gonzalo. 2007. Compressed representations of sequences and full-text indexes. ACM Trans. Algor 3, 2 (2007). DOI: 10.1145/1240233.1240243 [DOI] [Google Scholar]
- Guo Hongzhe, Fu Yilei, Gao Yan, Li Junyi, Wang Yadong, and Liu Bo. 2019. deGSM: Memory scalable construction of large scale de Bruijn Graph. IEEE/ACM Trans. Comput. Biol. Bioinf (2019), Early access. DOI: 10.1109/TCBB.2019.2913932 [DOI] [Google Scholar]
- Harris Robert S. and Medvedev Paul. 2020. Improved representation of sequence Bloom trees. Bioinformatics 36, 3 (2020), 721–727. DOI: 10.1093/bioinformatics/btz662 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Heinz Steffen, Zobel Justin, and Williams Hugh E.. 2002. Burst tries: A fast, efficient data structure for string keys. ACM Trans. Inf. Syst 20, 2 (2002), 192–223. DOI: 10.1145/506309.506312 [DOI] [Google Scholar]
- Holley Guillaume and Melsted Páll. 2019. Bifrost-Highly parallel construction and indexing of colored and compacted de Bruijn graphs. bioRxiv (2019), 695338. DOI: 10.1101/695338 [DOI] [Google Scholar]
- Holley Guillaume, Wittler Roland, and Stoye Jens. 2016. Bloom filter trie: An alignment-free and reference-free data structure for pan-genome storage. Algor. Molec. Biol 11, 1 (2016), 3. DOI: 10.1186/s13015-016-0066-8 [DOI] [Google Scholar]
- Iqbal Zamin, Caccamo Mario, Turner Isaac, Flicek Paul, and McVean Gil. 2012. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat. Genetics 44 (2012), 226–232. DOI: 10.1038/ng.1028 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jackman Shaun D., Vandervalk Benjamin P., Mohamadi Hamid, Chu Justin, Yeo Sarah, Hammond S. Austin, Jahesh Golnaz, Khan Hamza, Coombe Lauren, Warren Rene L. et al. 2017. ABySS 2.0: Resource-efficient assembly of large genomes using a Bloom filter. Genome Res 27 (2017), 768–777. DOI: 10.1101/gr.214346.116 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lemire Daniel and Kaser Owen. 2010. Recursive n-gram hashing is pairwise independent, at best. Comput. Speech Lang 24, 4 (2010), 698–710. DOI: 10.1016/j.csl.2009.12.001 [DOI] [Google Scholar]
- Li Dinghua, Luo Ruibang, Liu Chi-Man, Leung Chi-Ming, Ting Hing-Fung, Sadakane Kunihiko, Yamashita Hiroshi, and Lam Tak-Wah. 2016. MEGAHIT v1. 0: A fast and scalable metagenome assembler driven by advanced methodologies and community practices. Methods 102 (2016), 3–11. DOI: 10.1016/j.ymeth.2016.02.020 [DOI] [PubMed] [Google Scholar]
- Limasset Antoine, Cazaux Bastien, Rivals Eric, and Peterlongo Pierre. 2016. Read mapping on de Bruijn graphs. BMC Bioinf 17 (2016). DOI: 10.1186/s12859-016-1103-9 [DOI] [Google Scholar]
- Limasset Antoine, Rizk Guillaume, Chikhi Rayan, and Peterlongo Pierre. 2017. Fast and scalable minimal perfect hashing for massive key sets. In Proceedings of the 16th International Symposium on Experimental Algorithms (SEA’17) (Leibniz International Proceedings in Informatics (LIPIcs)), Iliopoulos Costas S., Pissis Solon P., Puglisi Simon J., and Raman Rajeev (Eds.), Vol. 75. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 25:1–25:16. DOI: 10.4230/LIPIcs.SEA.2017.25 [DOI] [Google Scholar]
- Liu Xinan, Yu Ye, Liu Jinpeng, Elliott Corrine F., Qian Chen, and Liu Jinze. 2018. A novel data structure to support ultrafast taxonomic classification of metagenomic sequences with k-mer signatures. Bioinformatics 34, 1 (2018), 171–178. DOI: 10.1093/bioinformatics/btx432 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mäkinen Veli, Belazzougui Djamal, Cunial Fabio, and Tomescu Alexandru I.. 2015. Genome-scale Algorithm Design Cambridge University Press. DOI: 10.1017/CBO9781139940023 [DOI] [Google Scholar]
- Mantaci Sabrina, Restivo Antonio, and Sciortino Marinella. 2005. An extension of the Burrows Wheeler transform to k words. In Proceedings of the Data Compression Conference, Storer James A. and Cohn Martin (Eds.). IEEE Computer Society Press, 469. DOI: 10.1109/DCC.2005.13 [DOI] [Google Scholar]
- Marchet Camille, Boucher Christina, Puglisi Simon J., Medvedev Paul, Salson Mikaël, and Chikhi Rayan. 2019a. Data structures based on k-mers for querying large collections of sequencing datasets. bioRxiv 866756 (2019). DOI: 10.1101/866756 [DOI] [Google Scholar]
- Marchet Camille, Kerbiriou Maël, and Limasset Antoine. 2019b. Indexing de Bruijn graphs with minimizers. bioRxiv (2019), 546309. DOI: 10.1101/546309 [DOI] [Google Scholar]
- Marchet Camille, Lecompte Lolita, Limasset Antoine, Bittner Lucie, and Peterlongo Pierre. 2018. A resource-frugal probabilistic dictionary and applications in bioinformatics. Disc. Appl. Math 274 (2018), 92–102. DOI: 10.1016/j.dam.2018.03.035 [DOI] [Google Scholar]
- Medvedev Paul, Chikhi Rayan, and Limasset Antoine. 2019. Bi-directed graphs in BCALM 2 Retrieved from https://github.com/GATB/bcalm/blob/master/bidirected-graphs-in-bcalm2/bidirected-graphs-in-bcalm2.md.
- Medvedev Paul, Georgiou Konstantinos, Myers Gene, and Brudno Michael. 2007. Computability of models for sequence assembly. In WABI 2007: Algorithms in Bioinformatics (Lecture Notes in Computer Science), Giancarlo Raffaele and Hannenhalli Sridhar (Eds.), Vol. 4645. Springer, 289–301. DOI: 10.1007/978-3-540-74126-8_27 [DOI] [Google Scholar]
- Mitzenmacher Michael. 2002. Compressed Bloom filters. IEEE/ACM Trans. Netw 10, 5 (2002), 604–612. DOI: 10.1109/TNET.2002.803864 [DOI] [Google Scholar]
- Muggli Martin D., Bowe Alexander, Noyes Noelle R., Morley Paul S., Belk Keith E., Raymond Robert, Gagie Travis, Puglisi Simon J., and Boucher Christina. 2017. Succinct colored de Bruijn graphs. Bioinformatics 33, 20 (2017), 3181–3187. DOI: 10.1093/bioinformatics/btx067 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mustafa Harun, Schilken Ingo, Karasikov Mikhail, Eickhoff Carsten, Rätsch Gunnar, and Kahles André. 2019. Dynamic compression schemes for graph coloring. Bioinformatics 35, 3 (2019), 407–414. DOI: 10.1093/bioinformatics/bty632 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Navarro Gonzalo. 2016. Compact Data Structures: A Practical Approach Cambridge University Press. DOI: 10.1017/CBO9781316588284 [DOI] [Google Scholar]
- Navarro Gonzalo and Sadakane Kunihiko. 2014. Fully functional static and dynamic succinct trees. ACM Trans. Algor 10, 3 (2014), 16:1–16:39. DOI: 10.1145/2601073 [DOI] [Google Scholar]
- Okanohara Daisuke and Sadakane Kunihiko. 2007. Practical entropy-compressed rank/select dictionary. In Proceedings of the 9th Workshop on Algorithm Engineering and Experiments (ALENEX’07) Society for Industrial and Applied Mathematics, 60–70. DOI: 10.5555/2791188.2791194 [DOI] [Google Scholar]
- Pan Tony, Nihalani Rahul, and Aluru Srinivas. 2020. Fast de Bruijn graph compaction in distributed memory environments. IEEE/ACM Trans. Comput. Biol. Bioinf 17, 1 (2020), 136–148. DOI: 10.1109/TCBB.2018.2858797 [DOI] [Google Scholar]
- Pandey Prashant, Almodaresi Fatemeh, Bender Michael A., Ferdman Michael, Johnson Rob, and Patro Rob. 2018. Mantis: A fast, small, and exact large-scale sequence search index. Cell Syst (2018), 201–207. DOI: 10.1016/j.cels.2018.05.021 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pandey Prashant, Bender Michael A., Johnson Rob, and Patro Rob. 2017c. deBGR: An efficient and near-exact representation of the weighted de Bruijn graph. Bioinformatics 33, 14 (2017), i133–i141. DOI: 10.1093/bioinformatics/btx261 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pandey Prashant, Bender Michael A., Johnson Rob, and Patro Rob. 2017a. A general-purpose counting filter: Making every bit count. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’17) Association for Computing Machinery, 775–787. DOI: 10.1145/3035918.3035963 [DOI] [Google Scholar]
- Pandey Prashant, Bender Michael A., Johnson Rob, and Patro Rob. 2017b. Squeakr: An exact and approximate k-mer counting system. Bioinformatics 34, 4 (2017), 568–575. DOI: 10.1093/bioinformatics/btx636 [DOI] [Google Scholar]
- Pellow David, Filippova Darya, and Kingsford Carl. 2017. Improving Bloom filter performance on sequence data using k-mer Bloom filters. J. Comput. Biol 24, 6 (2017), 547–557. DOI: 10.1089/cmb.2016.0155 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rahman Amatur and Medvedev Paul. 2020. Representation of k-mer sets using spectrum-preserving string sets. bioRxiv 896928 (2020). DOI: 10.1101/2020.01.07.896928 [DOI] [Google Scholar]
- Raman Rajeev, Raman Venkatesh, and Satti Srinivasa Rao. 2007. Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Trans. Algor 3, 4 (2007), 43. DOI: 10.1145/1290672.1290680 [DOI] [Google Scholar]
- Roberts Michael, Hayes Wayne, Hunt Brian R., Mount Stephen M., and Yorke James A.. 2004. Reducing storage requirements for biological sequence comparison. Bioinformatics 20, 18 (2004), 3363–3369. DOI: 10.1093/bioinformatics/bth408 [DOI] [PubMed] [Google Scholar]
- Salikhov Kamil, Sacomoto Gustavo, and Kucherov Gregory. 2013. Using cascading Bloom filters to improve the memory usage for de Brujin graphs. In WABI 2013: Algorithms in Bioinformatics (Lecture Notes in Computer Science), Darling Aaron and Stoye Jens (Eds.), Vol. 8126. Springer, 364–376. DOI: 10.1007/978-3-642-40453-5_28 [DOI] [Google Scholar]
- Schleimer Saul, Wilkerson Daniel S., and Aiken Alex. 2003. Winnowing: Local algorithms for document fingerprinting. In Proceedings of the ACM SIGMOD International Conference on Management of Data Association for Computing Machinery, 76–85. DOI: 10.1145/872757.872770 [DOI] [Google Scholar]
- Sczyrba Alexander, Hofmann Peter, Belmann Peter, Koslicki David, Janssen Stefan, Johannes Dröge Ivan Gregor, Majda Stephan, Fiedler Jessika, Dahms Eik et al. 2017. Critical assessment of metagenome interpretation—A benchmark of metagenomics software. Nat. Methods 14, 11 (2017), 1063–1071. DOI: 10.1038/nmeth.4458 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shi Haixiang, Schmidt Bertil, Liu Weiguo, and Muller-Wittig Wolfgang. 2010. A parallel algorithm for error correction in high-throughput short-read data on CUDA-enabled graphics hardware. J. Comput. Biol 17,4 (2010), 603–615. DOI: 10.1089/cmb.2009.0062 [DOI] [PubMed] [Google Scholar]
- Simpson Jared T. and Pop Mihai. 2015. The theory and practice of genome sequence assembly. Ann.l Rev. Genom. Hum. Genet 16 (2015), 153–172. DOI: 10.1146/annurev-genom-090314-050032 [DOI] [Google Scholar]
- Solomon Brad and Kingsford Carl. 2016. Fast search of thousands of short-read sequencing experiments. Nat. Biotechnol 34,3 (2016), 300–302. DOI: 10.1038/nbt.3442 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Solomon Brad and Kingsford Carl. 2018. Improved search of large transcriptomic sequencing databases using split sequence Bloom trees. J. Comput. Biol 25,7 (2018), 755–765. DOI: 10.1089/cmb.2017.0265 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stranneheim Henrik, Käller Max, Allander Tobias, Andersson Björn, Arvestad Lars, and Lundeberg Joakim. 2010. Classification of DNA sequences using Bloom filters. Bioinformatics 26, 13 (2010), 1595–1600. DOI: 10.1093/bioinformatics/btq230 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sun Chen, Harris Robert S., Chikhi Rayan, and Medvedev Paul. 2018. AllSome Sequence Bloom Trees. J. Comput. Biol 25, 5 (2018), 467–479. DOI: 10.1089/cmb.2017.0258 [DOI] [PubMed] [Google Scholar]
- Tarkoma Sasu, Rothenberg Christian Esteve, and Lagerspetz Eemil. 2011. Theory and practice of Bloom filters for distributed systems. IEEE Commun. Surv. Tutor 14, 1 (2011), 131–155. DOI: 10.1109/SURV.2011.031611.00024 [DOI] [Google Scholar]
- Välimäki Niko and Rivals Eric. 2013. Scalable and versatile k-mer indexing for high-throughput sequencing data. In Proceedings of the 9th International Symposium on Bioinformatics Research and Applications (ISBRA’13) (Lecture Notes in Computer Science), Vol. 7875. Springer, 237–248. DOI: 10.1007/978-3-642-38036-5_24 [DOI] [Google Scholar]
- Wood Derrick E. and Salzberg Steven L.. 2014. Kraken: Ultrafast metagenomic sequence classification using exact alignments. Genome Biol 15 (2014), R46. DOI: [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yu Ye, Liu Jinpeng, Liu Xinan, Zhang Yi, Magner Eamonn, Qian Chen, and Liu Jinze. 2018. SeqOthello: Query over RNA-seq experiments at scale. Genome Biol 19 (2018). DOI: 10.1186/s13059-018-1535-9 [DOI] [Google Scholar]
- Zentgraf Jens, Timm Henning, and Rahmann Sven. 2020. Cost-optimal assignment of elements in genome-scale multi-way bucketed Cuckoo hash tables. In 2020 Proceedings of the Twenty-Second Workshop on Algorithm Engineering and Experiments (ALENEX), Blelloch Guy and Finocchi Irene (Eds.). SIAM, 186–198. DOI: 10.1137/1.9781611976007.15 [DOI] [Google Scholar]
