Abstract
Given the popularity and elegance of k-mer-based tools, finding a space-efficient way to represent a set of k-mers is important for improving the scalability of bioinformatics analyses. One popular approach is to convert the set of k-mers into the more compact set of unitigs. We generalize this approach and formulate it as the problem of finding a smallest spectrum-preserving string set (SPSS) representation. We show that this problem is equivalent to finding a smallest path cover in a compacted de Bruijn graph. Using this reduction, we prove a lower bound on the size of the optimal SPSS and propose a greedy method called UST (Unitig-STitch) that results in a smaller representation than unitigs and is nearly optimal with respect to our lower bound. We demonstrate the usefulness of the SPSS formulation with two applications of UST. The first one is a compression algorithm, UST-Compress, which, we show, can store a set of k-mers by using an order-of-magnitude less disk space than other lossless compression tools. The second one is an exact static k-mer membership index, UST-FM, which, we show, improves index size by 10%–44% compared with other state-of-the-art low-memory indices.
Keywords: bidirected graph, k-mer compression, k-mer index, k-mer set, path cover, unitigs
1. Introduction
Algorithms based on k-mers are now among the top performing tools for many bioinformatics analyses. Instead of working directly with reads or alignments, these tools work with the set of k-mer substrings present in the data, often relying on specialized data structures for representing sets of k-mers (for a survey, see Chikhi et al., 2019). Since modern sequencing datasets are huge, the space used by such data structures is a bottleneck when attempting to scale up to large databases. For example, as part of our group's work on building indices for RNA-seq data, we are storing gzipped k-mer set files from about 2500 experiments (Harris and Medvedev, 2018). Though this is only a fraction of experiments in the SRA, it already consumes 6 TB of space. For these and other applications, the development of space-efficient representations of k-mer sets can improve scalability and enable novel biological discoveries.
Conway and Bromage (2011) showed that at least bits are needed to losslessly store a set of n k-mers, in the worst case. However, a set of k-mers generated from a sequencing experiment typically exhibits the spectrum-like property (Chikhi et al., 2019) and contains a lot of redundant information. Therefore, in practice, most data structures can substantially improve on that bound (Chikhi et al., 2014).
A common way to reduce the redundancy in a k-mer set K is to convert it into a set of maximal unitigs. A unitig is a non-branching path in the de Bruijn graph, a graph whose nodes are the k-mers of K and edges are the overlaps between k-mers. A unitig u can be written as a string of length , such that the k-mers of u are exactly the k-mer substrings of . For example, the unitig is spelled as AACGT. This gives a way to represent |u| k-mers using characters, instead of characters used by a naive approach. When unitigs are long, as they are in real data, the space savings are significant. The idea can be extended to store the whole set K, because the set of maximal unitigs U forms a decomposition of K, and, therefore, has the nice property that iff x is a substring of , for some .
The maximal unitigs U can be computed efficiently (Chikhi et al., 2016; Pan et al., 2018; Guo et al., 2019) and combined with an auxiliary index to obtain a membership data structure (i.e., one that can efficiently determine whether a k-mer belongs to K or not). In particular, Unitigs-FM (Chikhi et al., 2014) and deGSM (Guo et al., 2019) use the FM-index as the auxiliary index; Pufferfish (Almodaresi et al., 2018) and BLight (Marchet et al., 2019) use a minimum perfect hash function; and Bifrost (Holley and Melsted, 2019) uses a minimizer hash table. Alternatively, U can be compressed to obtain a compressed disk representation of K, although without efficient support for membership queries before decompression.
Although unitigs conveniently fit the needs of those applications, we observe in this article that they are not necessarily the best that can be done. Concretely, we claim that what makes U useful in these scenarios is that they are a type of spectrum-preserving string set (SPSS) representation of K, which we define to be a set of strings X such that a k-mer is in K if it is a substring of a string in X. [This is in contrast to the way unitigs are used in assembly, where it is crucial that they are not chimeric (Medvedev, 2018)]. The weight of X is the number of characters it contains. In this article, we explore the idea of low-weight representations and their applicability. In particular, are there representations with a smaller weight than U that can be efficiently computed? What is the lowest weight that is achievable by a representation? Can such representations seamlessly replace unitig representations in downstream applications, and can they improve space performance?
In this article, we show that the problem of finding a minimum weight SPSS representation is equivalent to finding the smallest path cover in a compacted de Bruijn graph (Section 3). We use the reduction to give a lower bound on the weight, which could be achieved by any SPSS representation (Section 4), and we give an efficient greedy algorithm UST (Unitig-STitch) to find a representation that improves on U (Section 5) and is empirically near-optimal. We demonstrate the usefulness of our representation by using two applications (Section 6). One, we combine it with an FM-index into a membership data structure called UST-FM, and, two, we combine it with a general compression algorithm to give a compression algorithm called UST-Compress. Both applications result in a substantial space decrease over state of the art (Section 7), demonstrating the usefulness of SPSS representations. Our software is freely available at https://github.com/medvedevgroup/UST/.
1.1. Related work
The idea of using a SPSS for a membership index was previously independently described in a PhD thesis (Břinda, 2016), and questions similar to the ones in our article are simultaneously and independently studied in Břinda et al. (2020). The idea of greedily gluing unitigs (as UST does) has previously appeared in read compression (Jones et al., 2012), where contigs greedily constructed from the reads and the reads were stored as alignments to these contigs. The idea also appeared in the context of sequence assembly, where a greedy traversal of an assembly graph was used as an intermediate step during assembly (Haas et al., 2013; Kolmogorov et al., 2019).
The compression of k-mer sets has not been extensively studied, except in the context of how k-mer counters store their output (Marçais and Kingsford, 2011; Rizk et al., 2013; Kokot et al., 2017; Pandey et al., 2017c). DSK (Risk et al., 2013) uses an HDF5-based encoding, KMC3 (Kokot et al., 2017) combines a dense storage of prefixes with a sparse storage of suffixes, and Squeakr (Pandey et al., 2017c) uses a counting quotient filter (Pandey et al., 2017a). The compression of read data, on the other hand, stored in either unaligned or aligned formats, has received a lot of attention (Hosseini et al., 2016; Numanagić et al., 2016; Hernaez et al., 2019). In the scenario where the k-mer set to be compressed was originally generated from FASTA files by a k-mer counter, an alternate to k-mer compression is to compress the original FASTA file and use a k-mer counter as part of the decompression to extract the k-mers on the fly. This approach is unsatisfactory because (1) as shown in this article, it takes substantially more space than direct k-mer compression, (2) k-mer counting on the fly adds significant time and memory to the decompression process, and (3) there are applications where the k-mer set cannot be reproduced by simply counting k-mers in a FASTA file, for example, when it is a product of a multi-sample error correction algorithm (Yang et al., 2012).
Further, there are applications where the k-mer set is not related to sequence read data at all, for example, a universal hitting set (Orenstein et al., 2017), a chromosome-specific reference dictionary (Rangavittal et al., 2019), or a winnowed min-hash sketch [e.g., as in Sahlin and Medvedev (2019), or see Marçais et al. (2019) and Rowe (2019) for a survey].
Membership data structures for k-mer sets were surveyed in a recent paper (Chikhi et al., 2019). In addition to the unitig-based approaches already mentioned, other exact representations include succinct de Bruijn graphs (referred to as BOSS; Bowe et al., 2012) and their variations (Boucher et al., 2015; Blazzougui et al., 2016a), dynamic de Bruijn graphs (Belazzougui et al., 2016b; Crawford et al., 2018), and Bloom filter tries (Holley et al., 2016). Some data structures are non-static, that is, they provide the ability to insert and/or delete k-mers. However, such operations are not needed in many read-only applications, where the cost of supporting them can be avoided. Membership data structures can be extended to associate additional information with each k-mer, for instance an abundance count (e.g., deBGR; Pandey et al., 2017b) or a color class (for a short overview, see Chikhi et al., 2019).
2. Definitions
2.1. Strings
In this article, we assume all strings are over the alphabet . The length of string x is denoted by . A string of length k is called a k-mer. For a set of strings S, denotes the total count of characters. We write x(i..j) to denote the substring of x from the ith to the jth character, inclusive. We define (respectively, ) to be the last (respectively, first) k characters of x. For x and y with , we define gluing x and y as . For , we define to be x if and to be the reverse complement of x if . A string x is canonical if x is the lexicographically smaller of x and its reverse complement. To canonize x is to replace it by its canonical version (i.e., ). We say that x0 and x1 have a (s0, s1)-oriented-overlap if . Intuitively, such an overlap exists between two strings if we can orient them in such a way that they are glueable. We define the k-spectrum as the multiset of all canonized k-mer substrings of x. The k-spectrum for a set of strings S is defined as .
2.2. Bidirected graphs
A bidirected graph G is a pair where the set V are called vertices and E is a set of edges. An edge e is a 4-tuple , where and , for . Intuitively, every vertex has two sides, and an edge connects to a side of a vertex. Note that there can be multiple edges between two vertices, but only one edge once the sides are fixed. An edge is a loop if . Given a non-loop edge e that is incident to a vertex u, we denote as the side of u to which it is incident. We say that a vertex u is isolated if it has no edge incident to it and is a dead-end if it has exactly one side to which no edges are incident. We define and as the number of dead-end and isolated vertices, respectively. A sequence is a walk if for all , ei is incident to and to ui, and for all , . Vertices are called internal, and u0 and un are called endpoints. A walk can also be a single vertex, in which case it is considered to have no internal vertex and one endpoint. A path cover W of G is a set of walks such that every vertex is in exactly one walk in W and no walk visits a vertex more than once.
2.3. Bidirected DNA graphs
A bidirected DNA graph is a bidirected graph G where every vertex u has a string label , and for every edge , there is a -oriented-overlap between and . G is said to be overlap-closed if there is an edge for every such overlap. Let be a walk. We define and, for , . The spelling of a walk is defined as . (The fact that the xi's are glueable in this way can be derived from definitions.) If W is a set of walks, then we define .
2.4. de Bruijn graphs
Let K be a set of canonical k-mers. The node-centric bidirected de Bruijn graph, denoted by , is the overlap-closed bidirected DNA graph where the vertices and their labels correspond to K. Figure 1A shows an example. In this article, we will assume that is not just a single cycle; such a case is easy to handle in practice but is a space-consuming corner case in all the analyses. A walk in is a unitig if all its vertices have in- and out-degrees of 1, except that the first vertex can have any in-degree and the last vertex can have any out-degree. A single vertex is also a unitig. A unitig is maximal if it is not a sub-walk of another unitig. It was shown in Chikhi et al. (2016) that if is not a cycle, then a unitig cannot visit a vertex more than once, and the set of maximal unitigs forms a unique decomposition of the vertices in into non-overlapping walks. The bidirected compacted de Bruijn graph of K, denoted by , is the overlap-closed bidirected DNA graph where the vertices are the maximal unitigs of , and the labels of the vertices are the spellings of the unitigs. Figure 1B shows an example.
3. Equivalence of SPSS Representations and Path Covers
For this section, we fix K to be a canonical set of k-mers. A set of strings X is said to be a SPSS representation of K if their k-spectrums are equal and each string in X is of length . For brevity, we say X represents K. Note that because in our definitions K is a set (i.e., no duplicates) and the k-spectrum is a multi-set, this effectively restricts X to not contain duplicate k-mers (see Figure 1B, C, e.g.). In this article, we consider the problem of finding a minimum weight SPSS representation of K. In this section, we will show that it is equivalent to the problem of finding the smallest path cover of , in the following sense:
Theorem 1. Let be a minimum weight SPSS representation of K. Let be the smallest path cover on . Then, .
First, we show that the weight of an SPSS representation is a linear increasing function of its size (i.e., the number of strings it contains) and, hence, finding an SPSS representation of minimum weight is equivalent to finding one of minimum size.
Lemma 1. Let X be an SPSS representing K. Then, .
Proof. Every string x of length contains k-mers. X has k-mers, since X and K have the same k-spectrum. Combining these, .
The intuition behind Theorem 1 is that there is a natural size-preserving bijection between path covers of and SPSS representations of K. Since it is more efficient to work with compacted de Bruijn graphs, we would like this to hold for as well. However, the path covers with an endpoint at an internal vertex of a unitig in do not project onto . Nevertheless, this is not an issue because such path covers are necessarily non-optimal.
Lemma 2. Let W be a path cover of . Then, represents K.
Proof. By construction, all strings in are at least k-long, so we only need to show that the spectrum of is K. Let W0 be the path cover with every vertex as its own walk. We can view W as being constructed from W0 by repeatedly taking a pair of walks that share endpoints and joining them together. We prove the Lemma by induction.
For the base case, are the unitigs of , which, by definition, have the same spectrum as K (Chikhi et al., 2016). Now let Wi be the path cover after i walk-joins. Then, is the result of joining some two walks w and into . Observe that joining walks preserves the k-spectrum of their spellings, that is, . Combining with the inductive hypothesis for Wi, .
Lemma 3. Let X be the smallest SPSS representation of K. Then, there exists a path cover W of with .
Proof. Let . Every string xi is spelled by a walk in , visiting the sequence of its canonized constituent k-mers. Since X is spectrum preserving with respect to K, it contains every k-mer in K exactly once; therefore, is a path cover of .
Since X has the smallest number of strings, the endpoints of cannot be on internal vertices of unitigs, otherwise there would exist another string xj that could be glued with xi to form a smaller SPSS representing K. Therefore, there exists a corresponding walk wi in such that . Hence, the set of walks is a path cover of .
Now we can prove Theorem 1.
Proof. By Lemma 1, has minimum size and, hence, by Lemma 3, there exists a path cover W with . By the optimality of , . Next, by Lemma 2, represents K and, by definition, . Since has minimum size, . This proves . Lemma 1 then implies the Theorem.
4. Lower Bound on the Weight of a SPSS Representation
In this section, we will prove a lower bound on the size of a path cover of a bidirected graph, which, by Theorem 1, gives a lower bound on the weight of any SPSS representation. Finding the minimum size of a path cover in general directed graphs is NP-hard, since a directed graph has a Hamiltonian path if and only if it has a path cover of size 1. However, we do not know the complexity of the problem when restricted to compacted de Bruijn graphs of k-mer sets. The minimum size of a path cover is known to be bounded from above by the maximum size of an independent set (at least for directed graphs; Diestel, 2005); however, finding a maximum independent set is itself NP-hard. We, therefore, take a different approach.
For this section, let be a bidirected graph without loops and let W be a path cover. A vertex-side is a pair , where and . For a non-isolated vertex u, we say is a dead-side if there are no edges incident to . Note that the number of dead-sides is, by definition, the number of dead-end vertices. Consider a walk with . Denote its endpoint-sides as and . If a walk contains just one vertex , then denote its endpoint-sides as and .
We observe that every walk in a path cover must have two unique endpoint-sides. Our strategy is to give a lower bound on the number of endpoint-sides, thereby giving a lower bound on the size of a path cover. We know, for instance, that dead-sides must be endpoint-sides and we also know that the sides of an isolated vertex must be endpoint-sides. For other cases, we cannot predict exactly the endpoint-sides, but we can create disjoint sets of vertex-sides (which we call special neighborhoods) such that, for each set, we can guarantee that all but one of its vertex-sides are endpoint-sides. Formally, for a vertex-side , its special neighborhood is the set of vertex-sides such that there exists an edge between and and it is the only edge incident on . A vertex-side that belongs to a special neighborhood is called a special-side. Figure 2 shows an example. Our key lemma is that all but one member of a special neighborhood must be an endpoint-side:
Lemma 4. For a vertex-side , there must be at least |Bu, su| − 1 endpoint-sides of in .
Proof. Assume without loss of generality that |Bu,su| > 1, since the lemma is otherwise vacuous. Let be a vertex-side that is not an endpoint-side in W, and let wv be the walk containing v. Since, in particular, is not an endpoint-side of wv, then wv must contain an edge incident to . By definition of special neighborhood, the only such edge is incident to . By definition of a path cover, there can only be one walk in W that contains an edge incident to and it can contain only one such edge. Hence, there can only be one that is not an endpoint-side.
Next, we show that the special neighborhoods are disjoint, and we can therefore define as a lower bound on the number of special-sides that are endpoint-sides:
Lemma 5. There are at least special-sides that are endpoint-sides of W.
Proof. We claim that for all . Let . By definition of Bu,su, the only edge touching is incident to . Similarly, by definition of Bv, the only edge touching is incident to . Hence, . The Lemma follows by applying Lemma 4 to each and summing the result.
Finally, we are ready to prove our lower bound on the size of path cover.
Theorem 2. .
Proof. Define a walk as isolated if it has only one vertex and that vertex is isolated. There are exactly isolated walks in W. Next, dead-sides are trivially endpoint-sides of a non-isolated walk in W by Lemma 5, and so are at least of the special-sides. Since the set of dead-sides and the set of special-sides are, by their definition, disjoint, the number of distinct endpoint-sides of non-isolated walks is at least . Since every walk in a path cover must have exactly two distinct endpoint-sides, there must be at least non-isolated walks.
By applying Theorem 1 to Theorem 2 and observing that loops do not affect path covers, we get a lower bound on the minimum weight of any SPSS representation:
Corollary 1. Let K be a set of canonical k-mers and let be its minimum weight SPSS representation. Then, , where , , and are defined with respect to the graph obtained by removing loops from .
We note that the lower bound is not tight, as in the example of Figure 2B; it can likely be improved by accounting for higher-order relationships in G. However, the empirical gap between our lower bound and algorithm is so small (Section 7) that we did not pursue this direction.
5. The UST Algorithm for Computing a SPSS Representation
In this section, we describe our algorithm called UST for computing an SPSS representation of a set of k-mers K. We first use the Bcalm2 tool (Chikhi et al., 2016) to construct , then find a path cover W of , and finally output , which by Lemma 2 is an SPSS representation of K.
The UST constructs a path cover W by greedily exploring the vertices, with each vertex explored exactly once. We maintain the invariant that W is a path cover over all the vertices explored up to that point, and that the currently explored vertex is an endpoint of a walk in W. To start, we pick an arbitrary vertex u, add a walk consisting of only u to W, and start an exploration from u.
An exploration from u works as follows. First, we mark u as explored. Let wu be the walk in W that contains u as an endpoint, and let su be the endpoint-side of u in wu. We then search for an edge , for some v and sv. If we find such an edge and v has not been explored, then we extend wu with e and start a new exploration from v. If v has been explored and is an endpoint vertex of a walk wv in W, then we merge wu and wv together if the orientations allow (i.e., if is the side at which wv is incident to v) and start a new exploration from an arbitrary unexplored vertex. In all other cases (i.e., if e is not found, if the orientations do not allow merging wv with wu, or if v in internal vertex in wv), we start a new exploration from an arbitrary unexplored vertex. The algorithm is terminated once all the vertices have been explored. It follows directly via the loop invariant that the algorithm finds a path cover, though we omit an explicit proof.
In our implementation, we do not store the walks W explicitly but rather just store a walk ID at every vertex along with some associated information. This makes the algorithm run-time and memory linear in the number of vertices and the number of edges, except for the possibility of needing to merge walks (i.e., merging of wu and wv). But we implement these operations by using a union-find data structure, making the total time near-linear.
We note that the UST's path cover depends on the arbitrary choices of which vertex to explore. Figure 1C gives an example of where this leads to suboptimal results. However, our results indicate that UST cannot be significantly improved in practice, at least for the datasets we consider (Section 7).
6. Applications
We apply the UST to solve two problems. First, we use it to construct a compression algorithm UST-Compress. UST-Compress supports only compression and decompression and not membership and is intended to reduce disk space. We take K as input [in the binary output format of either DSK (Risk et al., 2013) or Jellyfish (Marçais and Kingsford, 2011)], run UST on K, and finally compress the resulting SPSS by using a generic nucleotide compressor MFC (Pinho and Prats, 2013). UST-Compress can also be run in a mode that takes as input a count associated with each k-mer. In this mode, it outputs a list of counts in the order of their respective k-mers in the output SPSS representation (this is a trivial modification to UST). This list is then compressed by using the generic LZMA compression algorithm. Note that we use MFC and LZMA due to their superior compression ratios, but other compressors could be substituted. To decompress, we simply run the MFC or LZMA decompressing algorithm.
Second, we use UST to construct an exact static membership data structure UST-FM. Given K, we first run UST on K, and then construct an FM-index (Ferragina and Manzine, 2000) (as implemented in https://github.com/jts/dbgfm) on top of the resulting SPSS representation. The FM-index then supports membership queries. In comparison to hash-based approaches, the FM-index does not support insertion or deletion; on the other hand, it allows membership queries of strings shorter than k.
7. Empirical Results
We use different types of publicly available sequencing data, because each type may result in a de Bruijn graph with different properties and may inherently be more or less compressible. Our datasets include human, bacterial, and fish samples; they also include genomic, metagenomic, and RNA-seq data (Table 1). Each dataset was k-mer counted by using DSK (Risk et al., 2013), using with singleton k-mers removed. Although these are not the optimal values for each of the respective applications, it allows us to have a uniform comparison across datasets. In addition, we k-mer count one of the datasets with , removing singletons, to study the effect of k-mer size. All our experiments were run on a server with an Intel® Xeon® CPU E5-2683 v4 @ 2.10 GHz with 64 cores and 512 GB of memory. All tested algorithms were verified for correctness in all datasets. Table 2 shows the version numbers of all tools tested, and further reproducibility details are available at https://github.com/medvedevgroup/UST/tree/master/experiments.
Table 1.
Dataset | Source | No. of reads | Read length (bp) | No. of distinct k-mers |
---|---|---|---|---|
Zebrafish RNA-seq | SRX3022435 | 59,741,039 | 101 | 124,740,993 |
Human RNA-seq | SRR957915 | 49,459,840 | 101 | 101,017,526 |
Human chromosome 14 | GAGE (Salzberg et al., 2012) | 36,504,800 | 101 | 99,941,572 |
Whole human genome | SRR034939 | 36,201,642 | 100 | 391,766,120 |
Human gut metagenome | SRR341725 | 25,479,128 | 90 | 103,814,001 |
Human RNA-seq () | SRR957915 | 49,459,840 | 101 | 75,013,109 |
Singletons are not included in the k-mer count. Unless otherwise stated, .
Table 2.
Tool | URL | Git commit hash/version | Non-default option |
---|---|---|---|
Bcalm2 | https://github.com/gatb/bcalm | f4e0012e8056c56a04c7b00a927c260d5dbd2636 | -kmer-size 31 -abundance-min 2 -all-abundance-counts |
Cosmo/VARI | https://github.com/cosmo-team/cosmo/tree/VARI | d35bc3dd2d6ba7861232c49274dc6c63320cedc1 | -d |
Dbgfm | https://github.com/jts/dbgfm | ef82d38af2c402beab9ef9f12a72e7dcaeff210c | |
KMC | https://github.com/refresh-bio/KMC | 85ad76956d890aa24fc8525eee5653078ed86ace | -fa -k31 -ci2 -sm -m2 |
Squeakr | https://github.com/splatlab/squeakr | aa30936a40ac07b556d48b867ccadcebc5525021 | -e -k 31 -c 2 -s 2000 -t 1 |
McCortex | https://github.com/mcveanlab/mccortex/ | d3901d900cacff376e1201e86223adf1cc56784a | |
MFC | http://bioinformatics.ua.pt/software/mfcompress/ | Version 1.01 |
Default options were used except as noted in the last column. We show the options for . Reproducibility details are available at https://github.com/medvedevgroup/UST/tree/master/experiments.
7.1. Evaluation of the UST representation
We compare our UST representation against the unitig representation as well as against the SPSS lower bound of Corollary 1 (Table 3, with a deeper breakdown in Table 4). The UST reduces the number of nucleotides (i.e., weight) compared to the unitigs by 10%–32%, depending on the dataset. The number of nucleotides obtained is always within 3% of the SPSS lower bound; in fact, when considering the gap between the unitig representation and the lower bound, UST closes 92%–99% of that gap. These results indicate that our greedy algorithm is a nearly optimal SPSS representation, on these datasets. They also indicate that the lower bound of Corollary 1, though not theoretically tight, is nearly tight on the type of real data captured by our experiments.
Table 3.
Dataset | No. of distinct k-mers | SPSS lower bound |
UST |
unitigs |
|||
---|---|---|---|---|---|---|---|
No. of strings | nt/k-mer | No. of strings | nt/k-mer | No. of strings | nt/k-mer | ||
Zebrafish RNA-seq | 124,740,993 | 3,979,856 | 1.96 | 4,174,867 | 2.00 | 7,775,719 | 2.87 |
Human RNA-seq | 101,017,526 | 3,924,803 | 2.17 | 4,132,115 | 2.23 | 7,665,682 | 3.28 |
Human chromosome 14 | 99,941,572 | 2,235,267 | 1.67 | 2,386,324 | 1.72 | 4,871,245 | 2.46 |
Whole human genome | 391,766,120 | 13,964,825 | 2.07 | 14,423,449 | 2.10 | 19,581,835 | 2.50 |
Human gut metagenome | 103,814,001 | 1,517,107 | 1.34 | 1,522,139 | 1.34 | 2,187,669 | 1.49 |
Human RNA-seq (k = 61) | 75,013,109 | 2,651,729 | 3.12 | 2,713,825 | 3.17 | 4,371,173 | 4.50 |
The second column shows . For a representation X, the number of strings is and the number of nucleotides per distinct k-mer is . Unitigs were computed by using BCALM2.
SPSS, spectrum-preserving string set; UST, Unitig-STitch.
Table 4.
Dataset | |||
---|---|---|---|
Zebrafish RNA-seq | 13 | 18 | 21 |
Human RNA-seq | 13 | 20 | 18 |
Human chromosome 14 | 10 | 14 | 21 |
Whole human genome | 54 | 8 | 9 |
Human gut metagenome | 44 | 10 | 15 |
Human RNA-seq () | 24 | 22 | 15 |
7.2. Evaluation of UST-compress
We measure the compressed space usage (Table 5), compression time and memory (Table 6), and decompression time and memory. We compare against the following lossless compression strategies: (1) the binary output of the k-mer counters DSK (Risk et al., 2013), KMC (Kokot et al., 2017), and Squeakr-exact (Pandey et al., 2017c); (2) the original FASTA sequences, with headers removed; (3) the maximal unitigs; and (4) the BOSS representation (Bowe et al., 2012) (as implemented in COSMO [https://github.com/cosmo-team/cosmo/tree/VARI]). In all cases, the stored data are additionally compressed by using MFC (for nucleotide sequences, i.e., 2 and 3) or LZMA (for binary data, i.e., 1 and 4). The second strategy (which we already discussed in Section 1.1) is not a k-mer compression strategy per say, but it is how many users store their data in practice. The fourth strategy uses BOSS, the empirically most space efficient exact membership data structure according to a recent comparison (Crawford et al., 2018). We include this comparison to measure the advantage that can be gained by not needing to support membership queries. Note that strategies 1 and 2 retain count information, unlike strategies 3 and 4. Squeakr-exact also has an option to store only the k-mers, without counts.
Table 5.
Dataset | With counts |
Without counts |
|||||||
---|---|---|---|---|---|---|---|---|---|
Squeakr | KMC | DSK | FASTA | UST-compress | Squeakr | BOSS | Unitigs | UST-compress | |
Zebrafish RNA-seq | 91 | 41 | 47 | 33 | 5.4 | 45 | 5.9 | 5.0 | 3.6 |
Human RNA-seq | 94 | 41 | 48 | 41 | 6.3 | 41 | 6.9 | 5.8 | 4.1 |
Human chromosome 14 | 98 | 43 | 48 | 49 | 5.8 | 41 | 5.5 | 4.3 | 3.1 |
Whole human genome | 85 | 41 | 43 | 17 | 4.7 | 40 | 7.0 | 4.7 | 4.1 |
Human gut metagenome | 90 | 46 | 51 | 23 | 4.2 | 44 | 5.3 | 3.0 | 2.7 |
Human RNA-seq (k = 61) | — | 82 | 77 | 41 | 6.4 | — | 9.0 | 5.5 | 4.3 |
We show the average number of bits per distinct k-mer in the dataset. All files are compressed with MFC or LZMA, in addition to the tool shown in the column name. Squeakr-exact's implementation is limited to (Pandey et al., 2017c) and so it could not be run for .
Table 6.
Dataset | Time (minutes) |
Peak memory (GB) |
||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
BOSS |
Unitigs |
UST-compress |
BOSS | Unitigs | UST-compress | |||||||
Cosmo | LZMA | Total | bcalm2 | MFC | Total | UST | MFC | Total | ||||
Zebrafish RNA-seq | 6.3 | 0.7 | 7.0 | 3.0 | 1.5 | 4.4 | 1.5 | 0.9 | 5.3 | 4.0 | 3.1 | 3.1 |
Human RNA-seq | 4.0 | 0.8 | 4.8 | 4.7 | 1.3 | 5.9 | 1.6 | 0.8 | 7.1 | 3.6 | 3.4 | 3.4 |
Human chromosome 14 | 4.9 | 0.5 | 5.4 | 2.1 | 1.0 | 3.1 | 1.1 | 0.7 | 3.9 | 4.2 | 3.4 | 3.4 |
Whole human genome | 17.3 | 3.0 | 20.3 | 10.4 | 2.2 | 12.5 | 4.1 | 1.9 | 16.3 | 4.0 | 4.3 | 4.3 |
Human gut metagenome | 6.6 | 0.7 | 7.3 | 3.2 | 0.9 | 4.0 | 0.5 | 0.8 | 4.5 | 3.3 | 3.9 | 3.9 |
Human RNA-seq (k = 61) | 4.4 | 0.6 | 5.0 | 3.6 | 3.9 | 7.5 | 1.1 | 2.4 | 7.1 | 4.3 | 2.3 | 2.3 |
For BOSS and unitigs, the times are separated according to the two steps of compression: running the core algorithm (Cosmo and bcalm2) followed by the generic compressor (respectively, LZMA and MFC). For UST-Compress, the first step is exactly the same as for unitigs (Bcalm2), so the column is not repeated.
First, we observe that compared with the compressed native output of k-mer counters, UST-Compress reduces the space by roughly an order of magnitude; this, however, comes at an expense of compression time. When the value of k is increased, this improvement becomes even higher; as k nearly doubles, the UST-Compress output size remains the same; however, the compressed binary files output by k-mer counters approximately double in size. Our results indicate that when disk space is a more limited resource than compute time, SPSS-based compression can be very beneficial.
Second, we observe a 4–8 × space improvement compared with just compressing the reads FASTA file. In this case, however, the extra time needed for UST compression is balanced by the extra time needed to recount the k-mers from the FASTA file. Therefore, if all that is used downstream are the k-mers and possibly their counts, then SPSS-based compression is again very beneficial. Third, UST-Compress uses between 39% and 48% less space than BOSS, with comparable construction time and memory. Fourth, compared with the other SPSS-based compression (based on maximal unitigs), UST-Compress uses 10% to 29% less space, but it has 10% to 24% slower compression times (with the exception of the dataset, where it compresses 6% faster). The ratio of space savings after compression closely parallels the ratio of the weights of the two SPSS representations (Table 3). Fifth, we note that the best compression ratios achieved are significantly better than the worst case Conway Bromage lower bound of bits per k-mer for the datasets and 95 bits per k-mer for the dataset. Finally, we note that the differences in the peak construction memory, and the total decompression run time and memory ( minutes and GB for UST-Compress, respectively, table not shown) were negligible.
We also compressed a subset of samples from a de-noised index of 450,000 microbial DNA data used recently in large-scale indexing projects of BIGSI (Bradley et al., 2019) and COBS (Bingmann et al., 2019). Each sample consists of error-corrected 31-mers (without abundance information) from a corresponding sequencing experiment, natively stored as bzipped McCortex binary file [see Bingmann et al. (2019) and Bradley et al. (2019) for details]. We downloaded 19,000 of these files from http://ftp.ebi.ac.uk/pub/software/bigsi/nat_biotech_2018/ctx/. We ran UST-Compress, which reduced the disk space from 507 to 14.7 GB, a 35 × reduction. The compression took a total of 82 hours and a peak memory of 3 GB (using one core).
7.3. Evaluation of UST-FM
We measure the memory taken by the data structure (Table 8), the query times (Table 9), and the time and memory taken during construction (Table 7). We compare UST-FM against two other space-efficient exact static membership data structures for k-mer sets. The first builds the FM index on top of the maximal unitigs (we refer to this as unitig-FM, but it is referred to originally as dbgfm in Chikhi et al., 2014). The second is BOSS, which, as previously mentioned, was shown (Crawford et al., 2018) to have superior space usage. We did not compare against the Bloom filter trie (Holley et al., 2016), which is fast but uses an order of magnitude more memory than BOSS (Crawford et al., 2018). Other data structures, such as Pufferfish (Almodaresi et al., 2018), blight (Marchet et al., 2019), and Bifrost (Holley and Melsted, 2019), implement more sophisticated operations and hence use significantly more memory than BOSS. Moreover, these make use of a unitig SPSS representation and hence could potentially themselves incorporate the UST approach.
Table 8.
Dataset | BOSS | Unitigs-FM | UST-FM |
---|---|---|---|
Zebrafish RNA-seq | 7.5 | 7.9 | 5.5 |
Human RNA-seq | 9.0 | 9.2 | 6.3 |
Human chromosome 14 | 8.7 | 6.9 | 4.8 |
Whole human genome | 7.7 | 6.8 | 5.7 |
Human gut metagenome | 8.8 | 5.4 | 4.9 |
Human RNA-seq (k = 61) | 13.4 | 13.6 | 10.0 |
This was measured by taking the peak memory usage during membership queries.
Table 9.
BOSS | Unitigs-FM | UST-FM | |
---|---|---|---|
3.80 | 0.51 | 0.49 | |
1.48 | 0.38 | 0.37 | |
15.25 | 1.61 | 1.58 | |
5.10 | 0.35 | 0.37 |
The first set contains k-mers drawn from the dataset, so that UST-FM returns a hit. The second set takes randomly generated k-mers that were verified to not be present in the dataset. We measured the query times (per k-mer) after the index was already loaded into memory.
Table 7.
Dataset | Time (minutes) |
Memory (GB) |
||||
---|---|---|---|---|---|---|
BOSS | Unitigs-FM | UST-FM | BOSS | Unitigs-FM | UST-FM | |
Zebrafish RNA-seq | 6.3 | 24 | 17 | 4.0 | 3.1 | 3.1 |
Human RNA-seq | 4.0 | 21 | 15 | 3.6 | 3.4 | 3.4 |
Human chromosome 14 | 4.9 | 15 | 11 | 4.2 | 3.4 | 3.4 |
Whole human genome | 17.3 | 111 | 92 | 4.0 | 4.3 | 4.3 |
Human gut metagenome | 6.6 | 13 | 12 | 3.3 | 3.9 | 3.9 |
Human RNA-seq (k = 61) | 4.2 | 16 | 9 | 4.3 | 2.3 | 2.3 |
First, the UST-FM index is 25%–44% smaller and the queries are 4 to 11 times faster compared with BOSS; however, it takes 2 to 5 times longer to build. This time is dominated by FM-index construction, rather than by UST. Second, the UST-FM index is 10%–32% smaller than the unitigs-FM index, with a negligibly faster query time. Finally, the memory use during construction was similar for all approaches.
8. Conclusion
In this article, we define the notion of an SPSS representation of a set of k-mers, give a lower bound on what could be achieved by such a representation, and give an algorithm to compute a representation that comes close to the lower bound. We demonstrate the applicability of the SPSS definition by using our algorithm to substantially improve space efficiency of the state of the art in two applications.
A natural question is why we limit ourselves to SPSS representations. One can imagine alternative strategies, such as allowing a k-mer to appear more than once in the string set, or allowing other types of characters. In fact, for any concrete application, one might argue that an SPSS representation is too restrictive and can be improved. However, we chose to focus on SPSS representations because they are the common denominator in the applications of unitig-based representations we have observed (Chikhi et al., 2014; Almodaresi et al., 2018; Holley and Melsted, 2019; Marchet et al., 2019). In this way, they retain broad applicability, as opposed to more specialized representations.
One limitation of the UST is the time and memory needed to run Bcalm2 as a first step. Bcalm2 works by repeatedly gluing k-mers into longer strings, taking care to never glue across a unitig boundary. However, this care is wasted in our case, since the UST then greedily glues across unitig boundaries anyway. Therefore, a potentially significant speedup and memory reduction of UST would be to implement it as a modification of Bcalm2, as opposed to running on top of it. This can keep the high-level algorithm the same but change the implementation to work directly on the k-mer set by incorporating algorithmic aspects of Bcalm2.
Acknowledgment
The authors are grateful to Rayan Chikhi for feedback and help with modifying Bcalm2.
Author Disclosure Statement
The authors declare they have no competing financial interests.
Funding Information
P.M. and A.R. were supported by NSF awards 1453527 and 1439057. A.R. is supported by NIH Computation, Bioinformatics, and Statistics training program.
References
- Almodaresi, F., Sarkar, H., Srivastava, A., et al. . 2018. A space and time-efficient index for the compacted colored de Bruijn graph. Bioinformatics 34, i169–i177 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Belazzougui, D., Gagie, T., Mäkinen, V., et al. . 2016. a. Bidirectional variable-order de Bruijn graphs. In LATIN 2016: Theoretical Informatics. Springer [Google Scholar]
- Belazzougui, D., Gagie, T., Veli, M., et al. . 2016. b. Fully dynamic de Bruijn graphs. Presented at the International Symposium on String Processing and Information Retrieval. Springer [Google Scholar]
- Bingmann, T., Bradley, P., Gauger, F., et al. . 2019. COBS: A compact bit-sliced signature index. arXiv arXiv:1905.09624 [Google Scholar]
- Boucher, C., Bowe, A., Gagie, T., et al. . 2015. Variable-order de Bruijn graphs. Presented at the 2015 Data Compression Conference. IEEE [Google Scholar]
- Bowe, A., Onodera, T., Sadakane, K., et al. . 2012. Succinct de Bruijn graphs. In Algorithms in Bioinformatics. Springer, Berlin, Heidelberg. https://doi:org/10.1007/978-3-642-33122-0_18
- Bradley, P., den Bakker, H.C., Rocha, E.P., et al. . 2019. Ultrafast search of all deposited bacterial and viral genomic data. Nat Biotechnol 37, 152. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Břinda, K. 2016. Novel computational techniques for mapping and classifying next-generation sequencing data [doctoral dissertation]. Université Paris-Est. 10.5281/zenodo.1045317 [DOI]
- Břinda, K., Baym, M., and Kucherov, G.. 2020. Simplitigs as an efficient and scalable representation of de Bruijn graphs. bioRxiv. 10.1101/2020.01.12.903443 [DOI] [PMC free article] [PubMed]
- Chikhi, R., Holub, J., and Medvedev, P.. 2019. Data structures to represent sets of k-long DNA sequences. arXiv 1903.12312 [cs, q-bio] [Google Scholar]
- Chikhi, R., Limasset, A., Jackman, S., et al. . 2014. On the representation of de Bruijn graphs. Presented at the International Conference on Research in Computational Molecular Biology. Springer; [DOI] [PubMed] [Google Scholar]
- Chikhi, R., Limasset, A., and Medvedev, P.. 2016. Compacting de Bruijn graphs from sequencing data quickly and in low memory. Bioinformatics 32, i201–i208 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Conway, T.C., and Bromage, A.J.. 2011. Succinct data structures for assembling large genomes. Bioinformatics 27, 479–486 [DOI] [PubMed] [Google Scholar]
- Crawford, V.G., Kuhnle, A., Boucher, C., et al. . 2018. Practical dynamic de Bruijn graphs. Bioinformatics 34, 4189–4195 [DOI] [PubMed] [Google Scholar]
- Diestel, R. 2005. Graph theory 101
- Ferragina, P., and Manzini, G.. 2000. Opportunistic data structures with applications. Presented at the Proceedings 41st Annual Symposium on Foundations of Computer Science, IEEE, Redondo Beach, CA, USA [Google Scholar]
- Guo, H., Fu, Y., Gao, Y., et al. . 2019. deGSM: Memory scalable construction of large scale de Bruijn Graph. IEEE/ACM Trans Comput Biol Bioinform [DOI] [PubMed] [Google Scholar]
- Haas, B.J., Papanicolaou, A., Yassour, M., et al. . 2013. De novo transcript sequence reconstruction from RNA-sequsing the Trinity platform for reference generation and analysis. Nat. Protoc. 8, 1494. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Harris, R.S., and Medvedev, P.. 2018. Improved representation of sequence bloom trees. bioRxiv. 10.1101/501452 [DOI] [PMC free article] [PubMed]
- Hernaez, M., Pavlichin, D., Weissman, T., et al. . 2019. Genomic data compression. Annu. Rev. Biomed. Data Sci. 2, 19–37 [Google Scholar]
- Holley, G., and Melsted, P.. 2019. Bifrost-Highly parallel construction and indexing of colored and compacted de Bruijn graphs. bioRxiv 695338 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Holley, G., Wittler, R., and Stoye, J.. 2016. Bloom Filter Trie: An alignment-free and reference-free data structure for pan-genome storage. Algorithms Mol. Biol. 11, 3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hosseini, M., Pratas, D., and Pinho, A.. 2016. A survey on data compression methods for biological sequences. Information 7, 56 [Google Scholar]
- Jones, D.C., Ruzzo, W.L., Peng, X., et al. . 2012. Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucleic Acids Res. 40, e171–e171 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kokot, M., Długosz, M., and Deorowicz, S.. 2017. KMC 3: Counting and manipulating k-mer statistics. Bioinformatics 33, 2759–2761 [DOI] [PubMed] [Google Scholar]
- Kolmogorov, M., Yuan, J., and Lin, Y.. 2019. Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol. 37, 540. [DOI] [PubMed] [Google Scholar]
- Marçais, G., and Kingsford, C.. 2011. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27, 764–770 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Marçais, G., Solomon, B., Patro, R., et al. . 2019. Sketching and sublinear data structures in genomics. Annu. Rev. Biomed. Data Sci. 2, 93–118 [Google Scholar]
- Marchet, C., Kerbiriou, M., and Limasset, A.. 2019. Indexing de Bruijn graphs with minimizers. bioRxiv. 10.1101/546309 [DOI]
- Medvedev, P. 2018. Modeling biological problems in computer science: A case study in genome assembly. Brief Bioinform. 20, 1376–1383 [DOI] [PubMed] [Google Scholar]
- Numanagić, I., Bonfield, J.K., Hach, F., et al. . 2016. Comparison of high-throughput sequencing data compression tools. Nat. Methods 13, 1005. [DOI] [PubMed] [Google Scholar]
- Orenstein, Y., Pellow, D., Marçais, G., et al. . 2017. Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing. PLoS Comput. Biol. 13, e1005777. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pan, T., Nihalani, R., and Aluru, S.. 2018. Fast de Bruijn graph compaction in distributed memory environments. IEEE/ACM Trans. Comput. Biol. Bioinform. 17:136–148 [DOI] [PubMed] [Google Scholar]
- Pandey, P., Bender, M.A., Johnson, R., et al. . 2017. a. A general-purpose counting filter: Making every bit count. Presented at the Proceedings of the 2017 ACM International Conference on Management of Data. ACM [Google Scholar]
- Pandey, P., Bender, M.A., Johnson, R., et al. . 2017. b. deBGR: An efficient and near-exact representation of the weighted de Bruijn graph. Bioinformatics 33, i133–i141 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pandey, P., Bender, M.A., Johnson, R., et al. . 2017. c. Squeakr: An exact and approximate k-mer counting system. Bioinformatics 34, 568–575 [DOI] [PubMed] [Google Scholar]
- Pinho, A.J., and Pratas, D.. 2013. MFCompress: A compression tool for FASTA and multi-FASTA data. Bioinformatics 30, 117–118 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rangavittal, S., Stopa, N., Tomaszkiewicz, M., et al. . 2019. DiscoverY: A classifier for identifying Y chromosome sequences in male assemblies. BMC Genomics 20, 641. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rizk, G., Lavenier, D., and Chikhi, R.. 2013. DSK: K-mer counting with very low memory usage. Bioinformatics 29, 652–653 [DOI] [PubMed] [Google Scholar]
- Rowe, W.P. 2019. When the levee breaks: A practical guide to sketching algorithms for processing the flood of genomic data. Genome Biol. 20, 199. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sahlin, K., and Medvedev, P.. 2019. De novo clustering of long-read transcriptome data using a greedy, quality-value based algorithm. Presented at the International Conference on Research in Computational Molecular Biology. Springer; [DOI] [PMC free article] [PubMed] [Google Scholar]
- Salzberg, S.L., Phillippy, A.M., Zimin, A., et al. . 2012. GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Res. 22, 557–567 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang, X., Chockalingam, S.P., and Aluru, S.. 2012. A survey of error-correction methods for next-generation sequencing. Brief Bioinform. 14, 56–66 [DOI] [PubMed] [Google Scholar]