Skip to main content
Algorithms for Molecular Biology : AMB logoLink to Algorithms for Molecular Biology : AMB
. 2021 Jun 21;16:10. doi: 10.1186/s13015-021-00192-7

Disk compression of k-mer sets

Amatur Rahman 1,, Rayan Chikhi 2, Paul Medvedev 1
PMCID: PMC8218509  PMID: 34154632

Abstract

K-mer based methods have become prevalent in many areas of bioinformatics. In applications such as database search, they often work with large multi-terabyte-sized datasets. Storing such large datasets is a detriment to tool developers, tool users, and reproducibility efforts. General purpose compressors like gzip, or those designed for read data, are sub-optimal because they do not take into account the specific redundancy pattern in k-mer sets. In our earlier work (Rahman and Medvedev, RECOMB 2020), we presented an algorithm UST-Compress that uses a spectrum-preserving string set representation to compress a set of k-mers to disk. In this paper, we present two improved methods for disk compression of k-mer sets, called ESS-Compress and ESS-Tip-Compress. They use a more relaxed notion of string set representation to further remove redundancy from the representation of UST-Compress. We explore their behavior both theoretically and on real data. We show that they improve the compression sizes achieved by UST-Compress by up to 27 percent, across a breadth of datasets. We also derive lower bounds on how well this type of compression strategy can hope to do.

Keywords: De Bruijn graphs, Compression, k-mer sets, Spectrum-preserving string sets

Introduction

Many of today’s bioinformatics analyses are powered by tools that are k-mer based. These tools first reduce the input sequence data, which may be of various lengths and type, to a set of short fixed length strings called k-mers. K-mer based methods are used in a broad range of applications, including genome assembly [1], metagenomics [2], genotyping [3, 4], variant calling [5], and phylogenomics [6]. They have also become the basis of a recent wave of database search tools [715], surveyed in [16]. K-mer based methods are not new, but only recently they have started to be applied to terabyte-sized datasets. For example, the dataset used to test the BIGSI database search index, which is composed of 31-mers from 450,000 microbial genomes [11], takes about 12 TB to store in compressed form.

Storing such large datasets is a detriment to tool developers, tool users, and reproducibility efforts. For tool developers, development time is significantly increased when having to manage such large files. Due to the iterative nature of the development process, these files do not typically just sit in one place, but instead get created/moved/recreated many times. For tool users, the time it takes for the tools to write these files to disk and load them into memory is non-negligible. In addition, as we scale to even larger datasets, storage costs start to play a larger factor. Finally, for reproducibility efforts, storing and moving terabytes of data across networks can be detrimental.

To minimize these negative effects, disk compression of k-mer sets is a natural solution. By disk compression, we refer to a compressed representation that, while supporting decompression, does not support any other querying of the compressed data. Compressed representations that allow for membership queries [17] are important in their own right, but are sub-optimal when only storage is required. Most k-mer sets are currently stored on disk in one of two ways. In the situation where the set of k-mers comes from k-mer counting reads, one can simply compress the reads themselves using one of many read compression tools [1820]. This approach requires the substantial overhead of running a k-mer counter as part of decompression, but it is often used in the absence of better options. The second approach is to gzip/bzip the output of the k-mer counter [2125]. As we showed in [26], both of these approaches are space-inefficient by at least an order-of-magnitude. This is not surprising, as neither of these approaches was designed specifically for disk compression of k-mer sets.

Disk compression tailor-made for k-mer sets was first considered in our earlier work [26]. The idea was based on the concept of spectrum-preserving string sets (SPSS), introduced in [2628]. In [28], the concept of SPSS is introduced under the name simplitigs. A set of strings S is said to be a SPSS representation of a set of k-mers K iff 1) the set of k-mers contained in S is exactly K, 2) S does not contain duplicate k-mers, and 3) each string in S is of length k. The weight of an SPSS is the number of characters it contains. For example, if K={ACG,CGT,CGA}, then {ACGT,CGA} would be an SPSS of weight 7; also K itself would be an SPSS of K of weight 9. On the other hand, {CGACGT} is not an SPSS, because it contains GACK. Intuitively, a low weight SPSS can be constructed by gluing together k-mers in K, with each glue operation reducing the weight by k-1. In [26], we proposed the following simple compression strategy, called UST-Compress. We find a low-weight SPSS S, using a greedy algorithm called UST, and compress S to disk using a generic nucleotide compression algorithm (e.g. MFC [29]). UST-Compress achieved significantly better compression sizes than the two approaches mentioned above.

UST-Compress was not designed to be the best possible disk compression algorithm but only to demonstrate one of the possible applications of the SPSS concept. When the goal is specifically disk compression, we are no longer bound to store a set of strings with exactly the same k-mers as K, as long as a decompression algorithm can correctly recover K. The main idea of this paper is to replace the SPSS with a more relaxed string set representation, over the alphabet {A,C,G,T,[,],+,-}. Our approach is loosely inspired by the notion of elastic-degenerate strings [30]. It attempts to remove even more duplicate (k-1)-mers from the representation than SPSS does, using the extra alphabet characters as placeholders for nearby repetitive (k-1)-mers. For the above example, our representation would be ACG[+A]T, where the ``+ is interpreted as a placeholder for the k-1 characters before the open bracket (i.e. CG). After replacing the ``+, we get ACG[CGA]T and we split the string by cleaving out the substring within brackets; i.e., we get ACGT and CGA.

Based on this idea, we present two algorithms for the disk compression of k-mer sets, ESS-Compress and ESS-Tip-Compress. We explore the behavior of these algorithms both theoretically and on real data. We give a lower bound on how well this type of algorithm can compress. We show that they improve the compression sizes achieved by UST-Compress by up to 27% across a breadth of datasets. The two algorithms present a trade-off between time/memory and compression size, which we explore in our results. The two algorithms are freely available open source tools on http://github.com/medvedevgroup/ESSCompress.

Preliminaries

Basic definitions

Strings

The length of string x is denoted by |x|. A string of length k is called a k-mer. We assume k-mers are over the DNA alphabet. A string over the alphabet {A,C,G,T,[,],+,-} is said to be enriched. We use · as the string concatenation operator. For a set of strings S, weight(S)=xS|x| denotes the total count of characters. We define sufk(x) (respectively, prek(x)) to be the last (respectively, first) k characters of x. We define cutPrek(x)=suf|x|-k(x) as x with the prefix removed. When the subscript is omitted from pre, suf, and cutPre, we assume it is k-1. A string x is canonical if it is the lexicographically smaller of x and its reverse complement.

For x and y with suf(x)=pre(y), we define gluing x and y as xy=x·cutPre(y). For s{0,1}, we define orient(xs) to be x if s=0 and to be the reverse complement of x if s=1. We say that x0 and x1 have a (s0,s1)-oriented-overlap if suf(orient(x0,1-s0))=pre(orient(x1,s1)). Intuitively, such an overlap exists between two strings if we can orient them in such a way that they are glueable. For example, AAC and TTG have a (0, 0)-oriented overlap.

Bidirected de Bruijn graphs

A bidirected graph G is a pair (VE) where the set V are called vertices and E is a set of edges. An edge e is a set of two pairs, {(u0,s0),(u1,s1)}, where uiV and si{0,1}, for i{0,1}. Note that this differs from the notion of an edge in an undirected graph, where EV×V. Intuitively, every vertex has two sides, and an edge connects to a side of a vertex (see Fig. 1 for examples). An edge is a loop if u0=u1. Given a non-loop edge e that is incident to a vertex u, we denote side(ue) as the side of u to which it is incident. We say that a vertex u is a dead-end if it has exactly one side to which no edges are incident. A bidirected DNA graph is a bidirected graph G where every vertex u has a string label lab(u), and for every edge e={(u0,s0),(u1,s1)}, there is a (s0,s1)-oriented-overlap between lab(u0) and lab(u1) (see Fig. 1 for examples). G is said to be overlap-closed if there is an edge for every such overlap. Let K be a set of canonical k-mers. The node-centric bidirected de Bruijn graph, denoted by dBG(K), is the overlap-closed bidirected DNA graph where the vertices and their labels correspond to K. In this paper, we will assume that dBG(K) is not just a single cycle; such a case is easy to handle in practice but is a space-consuming corner-case in all the analyses.

Fig. 1.

Fig. 1

Examples of the four types of absorption. Each panel shows the edges along two paths: ψp (red vertices inside a shaded rectangle) and ψc (blue vertices inside a shaded rectangle) and an absorption edge e={(up,sp),(uc,sc)} (dashed line) between the parent unitig up and the child unitig uc. The graph being shown in each panel is cdBG(K), but only the absorption edge and the edges of ψp and ψc are shown. In this simple example, the unitigs of dBG(K) are just paths made of single vertices, and hence the vertices of cdBG(K) have labels of length k=3. Each vertex is shown as a pointed rectangle with its label inside; we use the convention that the “zero” side of a vertex is the flat side on the left, and the “one” side is the pointy side on the right. At the bottom left of each panel, we show the spectrum-preserving string set (SPSS) spell({ψp,ψc}). At the bottom right, we show the enriched representation generated by our algorithm. Depending on the value of sp and sc, four different cases can arise. When sp=1,sc=0 (shown in A), pre(lab(uc)) is replaced with marker “+”, as it is same as suf(lab(up)). When sp=1,sc=1 (shown in B), pre(lab(uc)) is replaced by “−”, as it is same as the reverse complement of suf(lab(up)). When sp=0,sc=0 (shown in C), pre(lab(uc)) is replaced with “−”, as it is the same as the reverse complement of pre(lab(up)). When sp=0,sc=1 (shown in D), suf(lab(uc)) is replaced with “+”, as it is the same as pre(lab(up))

Paths and spellings

A sequence p=(u0,e1,u1,,en,un) is a path iff (1) for all 1in, ei is incident to ui-1 and to ui, (2) for all 1in-1, side(ui,ei)=1-side(ui,ei+1), and (3) all the uis are different. A path can also be any single vertex. Vertices u1,,un-1 are called internal and u0 and un are called endpoints. We call u0 to be the initiator vertex of p. We say that p is normalized if for every ei, side(ui-1,ei)=1 and side(ui,ei)=0; intuitively, the path uses edges like in a directed graph. The spelling of a normalized path p is defined as spell(p)=lab(u0)lab(un). If P is a set of normalized paths, then spell(P)=pPspell(p).

Unitigs and the compacted de Bruijn graph

A path in dbG(K) is a unitig if all its vertices have in- and out-degrees of 1, except that the first vertex can have any in-degree and the last vertex can have any out-degree. A single vertex is also a unitig. A unitig is maximal if it is not a sub-path of another unitig. It was shown in [31] that if dBG(K) is not a cycle, then the set of maximal unitigs forms a unique decomposition of the vertices in dBG(K) into vertex-disjoint paths. The bidirected compacted de Bruijn graph of K, denoted by cdBG(K), is the overlap-closed bidirected DNA graph where the vertices are the maximal unitigs of dBG(K), and the labels of the vertices are the spellings of the unitigs. In practice, this graph can be efficiently constructed from K using the BCALM2 tool [31, 32].

Spanning out-forest

Given a directed graph D, an out-tree is a subgraph in which every vertex except one, called the root, has in-degree one, and, when the directions on the edges are ignored, is a tree. An out-forest is a collection of vertex-disjoint out-trees. An out-forest is spanning if it covers all the vertices of D.

Path covers and UST-compress

A vertex-disjoint normalized path cover Ψ of cdBG(K) is a set of normalized paths such that every vertex is in exactly one path and no path visits a vertex more than once; we will sometimes use the shorter term path cover to mean the same thing. There is a close relationship between SPSS representations of K and path covers, shown in [26]. In particular, a path cover Ψ induces the SPSS spell(Ψ). An example of a path cover is one where every vertex of cdBG(K) is in its own path, and the corresponding SPSS is the set of all maximal unitig sequences. Figures 1 and 2 show examples of path covers. The number of paths in Ψ (denoted as |Ψ|) and the weight of the induced SPSS is closely related:

weight(spell(Ψ))=|K|+|Ψ|(k-1) 2.1

This relationship also translates to the number of edges in Ψ; by its definition, the number of edges in Ψ is simply the number of vertices in cdBG(K) minus |Ψ|.

Fig. 2.

Fig. 2

More complex absorption examples. In A, one path absorbs multiple paths. In B, one unitig up absorbs multiple paths. In C, one path (ψ1) absorbs another (ψ2) which itself absorbs another (ψ3). This is a recursive absorption, showing how a path can be both a child and a parent

The idea of our previous algorithm UST-Compress [26] is to find a path cover ΨUST with as many edges as possible. Having more edges reduces the number of paths, which in turn reduces the weight of the corresponding SPSS and the size of the final compressed output. We can understand this intuitively as follows. Edges in cdBG(K) connect unitigs whose endpoints have the same (k-1)-mer (after accounting for reverse complements). For every edge we add to our path cover, we glue these two unitigs and remove one duplicate instance of the (k-1)-mer from the corresponding SPSS. Note however that ΨUST does not remove all duplicate (k-1)-mers from the SPSS, because Ψ can only have two edges incident on a vertex, one from each side, and hence a unitig can only be glued at most twice. If a unitig has edges to more than two other unitigs, then some of the adjacent unitigs would include the duplicate (k-1)-mer in the SPSS. The idea of our paper is to exploit the redundancy due to those remaining edges an thus further reduce the size of the representation.

ESS-compress

Main algorithm

Our starting point is a set of canonical k-mers K, the graph cdBG(K), and a vertex-disjoint normalized path cover Ψ of cdBG(K) returned by UST.1 To develop the intuition for our algorithm, we first consider a simple example (Fig. 1A). In this example, we see a vertex-disjoint path cover Ψ composed of two paths, ψp and ψc. There is an edge between an internal vertex (=unitig2) up of ψp and the initiator vertex uc of ψc. Such an edge is an example of an absorption edge. ESS-Compress constructs an enriched string representation of K, as shown in the figure. The basic idea is that up and uc share a common (k-1)-mer (i.e. GT). We can cut out this common portion from the string representing uc and replace it with a special marker character “+”. We can then include uc inside of the representation of up by surrounding uc with brackets. The marker character ``+ is a placeholder for the k-1 nucleotides right before the opening bracket. To decompress the enriched string, we first replace the marker to get TCGT[GTAA]T and then cleave out the bracketed string to get {TCGTT,GTAA}. This exactly recovers the SPSS representation of ψp and ψc.

Formally, we say that an edge in cdBG(K) is an absorption edge iff (1) it connects two unitigs up and uc, on two distinct paths ψp and ψc, respectively, (2) up is an internal vertex, and (3) uc is an initiator vertex. We refer to up and ψp as parents and uc and ψc as children; we also say that ψp and up absorb ψc and uc.3

Figure 1B–D shows the other cases, corresponding to the possible orientation of the absorption edge. The logic is the same, but we need to introduce a second marker character ``- that is a placeholder for the reverse complement of the last k-1 characters right before the opening bracket. In each of these cases, we add 3 extra characters (two brackets and one marker) and remove k-1 nucleotide characters. Note that, expanding the alphabet has its inherent cost, but even after taking that into account, we get lower number of characters than SPSS representation when k>4.

Next, observe that a single parent path can absorb multiple children paths, as illustrated in Fig. 2A. Also, observe that a single parent unitig can absorb more than one child path, as shown in Fig. 2B. As in the previous example, we save k-1-3=k-4 characters for every absorbed edge.

These absorptions can be recursively combined, as shown in Fig. 2C. Because we require a parent unitig to be an internal vertex and a child unitig to be an initiator vertex, the same unitig cannot be both parent and child. Therefore, ESS-Compress can construct a representation recursively, without any conflicts. The recursion tree is reflected in the nesting structure of the brackets in the enriched string.

However, we must be careful to avoid cycles in the recursion. We define the absorption digraph DA as the directed graph whose vertex set is the set of paths Ψ and an edge (ψpψc) if ψp absorbs ψc. For every edge in DA, we also associate the corresponding bidirected edge between up and uc in cdBG(K). We would like to select a subset of edges F along which to perform absorptions, so as to avoid cycles in DA and to make sure a path cannot be absorbed by more than one other path. We would also try to choose as many edges as possible, since each absorption saves k-4 characters. To achieve these goals, ESS-Compress defines F as a spanning out-forest in DA with the maximum number of edges. We postpone the algorithm to find F to Sect. "Algorithm to choose absorption edges".

The high-level pseudo-code of ESS-Compress is shown in Algorithm 1 and illustrated in Fig. 3. The recursive algorithm to create the enriched representation using F as a guide is shown in Algorithm 2. It follows the intuition we just developed. It starts from the paths that will not be absorbed (i.e. the roots in F). For a path ψp, it first computes the enriched representations of all the child paths (Lines 3–9). It then integrates them into the appropriate locations of spell(ψp) (Lines 10–14). It then uses a marker to replace the redundant sequence in the spelling of ψp, with respect to ψp’s own parent (Lines 17–31). To decide which marker to use, it receives as a parameter the absorption edge eD that was used to absorb ψp.

Fig. 3.

Fig. 3

Visual overview of the steps in Algorithm 1

Decompression is done by a recursive algorithm DEC that takes as input an enriched string x and a (k-1)-mer called markerReplacement. Initially, DEC is called independently on every enriched string xESS-Compress(K), with markerReplacement=null. We call the characters of x which are not enclosed within brackets outer. The brackets themselves are not considered outer characters. DEC first replaces any occurrence of an outer ``+ (respectively, ``-) with markerReplacement (respectively, the reverse complement of markerReplacement). It then outputs all the outer characters as a single string. Then, for every top-level open/close bracket pair in x, it calls DEC recursively on the sequence in between the brackets, and passes as markerReplacement the rightmost k-1 outer characters to the left of the open bracket.

Algorithm to choose absorption edges

Let D be any directed graph and consider the problem of finding a spanning out-forest with the maximum number of edges. We call this the problem of finding an edge-maximizing spanning out-forest. This problem is a specific instance of the maximum weight out-forest problem [33], which allows for weights to be placed on the edges. As we show in this section, there is an optimal algorithm for our problem that is simpler than the algorithm for arbitrary weights described in [33].graphic file with name 13015_2021_192_Figa_HTML.jpg

Our algorithm first decomposes D into strongly connected components, and builds SC(D), the strongly connected component digraph of D. In SC(D), the vertices are the strongly connected components of D, and there is an edge from component c1 to c2 if there is an edge in D from some vertex in c1 to some vertex in c2. For every component that is a source in SC(D), we pick an arbitrary vertex from it (in D) and put it into a “starter” set. Then, we perform a depth-first search (DFS) traversal of D, but whenever we start a new tree, we initiate it with a vertex from the starter set, if one is available. We remove the vertex from the starter set once it is used to initiate a tree. We then output the DFS forest F.

We will prove that F is a spanning out-forest of D with the maximum number of edges.

Lemma 3.1

(Correctness of edge-maximizing spanning out-forest algorithm) Let D be a directed graph, let F be the spanning out-forest returned by our algorithm run on D, and let nsc be the number of source components in SC(D). Then, the number of out-trees in F is nsc and this is the smallest possible for any spanning out-forest. Also, the number of edges in F is the maximum possible for any spanning out-forest.

Proof

Consider any spanning out-forest of D. If it has less than nsc out-trees, then by the pigeonhole principle, there are two source components c1 and c2 with vertices v1 and v2, respectively, belonging to the same out-tree. This is a contradiction, since c1 and c2 are source components and hence there cannot be a path between them. Hence, any spanning out-forest must have at least nsc out-trees. Now, consider F. Every vertex in D is reachable from one of the vertices in the starter set, by its construction. There are nsc starter vertices, so F will have at most nsc out-trees. Since any spanning out-forest must have at least nsc out-trees, F will have nsc out-trees and it will be the minimum achievable. Also, in any spanning out-forest, the number of edges is the number of vertices minus the number of out-trees; hence F will have the the maximum number of edges of any spanning out-forest.

The weight of the ESS-compress representation

In this section, we derive a formula for the weight of the ESS-Compress representation and explore the potential benefits of some variations of ESS-Compress.graphic file with name 13015_2021_192_Figb_HTML.jpg

Theorem 3.2

Let K be a set of canonical k-mers, and let Ψ be a vertex-disjoint normalized path cover of cdBG(K) that is used by ESS-Compress(K). Let nsc be the number of sources in the strongly connected component graph of the absorption graph DA. Let X be the solution returned by ESS-Compress(K). Then

weight(X)=|K|+3|Ψ|+nsc(k-4)

Proof

If we unroll the recursion of ESS-Compress, then there are exactly |Ψ| runs of Spell-Path-Enrich, one for each ψΨ. For each call, we let nψ be the number of characters in the returned string that are added non-recursively (i.e. everything except ins0 and ins1). Considering the structure of the recursion and accounting for characters in this way, we have that weight(X)=ψΨnψ.

Prior to marker replacement (Line 17, the non-recursive part of x is spell(ψ)). When ψ is a root in the absorption forest F, then the marker absorption stage is not executed and so nψ=|spell(ψ)|. Otherwise, the marker absorption phase (Lines 17 to 31) removes k-1 characters, adds 1 new marker character, and adds two new bracket characters. Hence, nψ=|spell(ψ)|-(k-1)+3=|spell(ψ)|-(k-4). By Lemma 3.1, F contains nsc roots. Hence,

weight(X)=ψΨnψ=ψis a root|spell(ψ)|+ψis not a root|spell(ψ)|-(k-4)=ψΨ|spell(ψ)|-(k-4)(|Ψ|-nsc)=|K|+3|Ψ|-nsc(k-4)

The last equality follows by applying Eq. (2.1) from Sect. "Preliminaries".

We can use Theorem 3.2 to better understand ESS-Compress. The weight depends on the choice of Ψ. The Ψ returned by UST has, empirically, almost the minimum |Ψ| possible [26]. This (almost) minimizes the 3|Ψ| term in Theorem 3.2. However, this may not necessarily lead to the lowest total weight, because there is an interplay between Ψ and nsc, as follows. Let Ψ be a vertex-disjoint normalized path cover with |Ψ|>|Ψ|. Its paths are shorter, on average, than Ψ’s. There may now be edges of cdBG(K) that become absorption edges, that were not with Ψ. For example, an edge between two unitigs which are internal in Ψ is not, by our definition, an absorption edge. With the shorter paths in Ψ, one of these unitigs may become an initiator vertex, making the edge absorbing. This may in turn improve connectivity in DA and decrease nsc, counterbalancing the increase in |Ψ|. Nevertheless, ESS-Compress does not consider alternative path covers and always uses the one returned by UST.

Another aspect of ESS-Compress that could be changed is the definition of absorption edge. We restrict absorption edges to be between an initiator unitig and an internal unitig; however, one could in principle also define ways to absorb between an endpoint unitig and an internal unitig, or between two internal unitigs. This could potentially decrease nsc by increasing the number of absorption edges, though it would likely need more complicated and space-consuming encoding schemes.

How much could be gained by modifying the path cover and the absorption rules that ESS-Compress uses? We can answer this by observing that nsc cannot be less than C, the number of connected components of the undirected graph underlying cdBG(K). At the same time, in [26] we gave an algorithm to compute an instance-specific lower bound β on the number of paths in any vertex-disjoint path cover. Putting this together, we conclude that regardless of which path cover is used and which subset of cdBG(K) edges are allowed to be absorbing, the weight of a ESS-Compress representation cannot be lower than:

|K|+3β+C(k-4) 3.1

As we will see in the results, the weight of ESS-Compress is never more than 2% higher than this lower bound, which is why we did not pursue these other possible optimizations to ESS-Compress. We note, however, that the above is not a general lower bound and does not rule out the possibility of lower-weight string set representations that beat ESS-Compress.

ESS-Tip-compress: a simpler alternative

ESS-Compress is designed to achieve a low compression size but can require a large memory stack due to its recursive structure. The memory during compression and decompression is proportional to the depth of this stack, which is the depth of the out-forest F. If F were to be more shallow, then the memory would be reduced. In this section, we describe ESS-Tip-Compress, a simpler, faster, and lower-memory technique that can be used when compression speed/memory are prioritized. It is centered on dead-end vertices in the compacted graph, which usually correspond to tips in the uncompacted dBG and are typically due to sequencing errors, endpoints of transcripts, or coverage gaps. ESS-Tip-Compress is based on the observation that a large chunk of the graph is dead-end vertices (at least for sequencing data), and limiting absorption to only them can yield much of the benefits of a more sophisticated algorithm.

First, we find a vertex-disjoint normalized path cover Ψ that is forced to have each dead-end vertex in its own dedicated path (i.e. its path only contains the vertex itself). This can be done easily by running UST on the graph obtained from cdBG(K) by removing all dead-end vertices. Next, we select the absorption forest F as follows. For each dead-end vertex v, we identify a non-dead-end vertex u which is connected to v via an edge e. In the rare case that such a u does not exist, we skip v. Otherwise, we add (uv) to F. We can assume without loss of generality that side(u,e)=1-side(v,e) because if that is not the case, than we can replace lab(v) by its reverse complement and thereby change the side to which e is incident. For any paths that remain uncovered by F, we add them as roots of their own tree. Finally, we run a slightly modified version of Spell-Path-Enrich, using this Ψ and this F.

We modify Spell-Path-Enrich as follows. First, observe that F has max depth of 2 vertices. Hence, the parenthesis generated by Spell-Path-Enrich are never nested. Second, observe that the marker value is always ``+, because side(u,e)=1-side(v,e) for all absorption edges in F. These observations allow us to reduce the number of extra characters we need for each absorption down to 2, instead of 3 (we omit the implementation details).

Empirical results

We evaluated our methods on one small bacterial dataset, two metagenomic datasets from NIH human microbiome project, reads from whole human genome, and RNA-seq reads from both human and plant (Table 1). To obtain the set of k-mers K from these datasets, we ran the DSK k-mer-counter [22] with k=31 and filtered out low-frequency k-mers (<5 for whole human genome and <2 for the other datasets). We then constructed cdBG(K) using BCALM2. The last three columns in Table 1 show the properties of the graph: number of vertices, number of dead-end vertices and total percentage of isolated vertices. We ran all our experiments single-threaded on a server with an Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10 GHz processor with 64 cores and 512 GB of memory. We used /usr/bin/time to measure time and memory. Detailed steps to reproduce our experiments are available at https://github.com/medvedevgroup/ESSCompress/tree/master/experiments.

Table 1.

Dataset characteristics

Dataset Source Read length (bp) # Reads # Distinct 31-mers # unitigs % Dead-end unitigs (%) % Isolated unitigs (%)
R. sphaeroides GAGE [37] 101 2,050,868 5,908,467 442,681 47 8
Human RNA-seq SRR957915 101 49,459,840 101,017,526 7,665,682 4 13
Gingiva metagenome SRS014473 101 55,419,548 101,872,420 5,678,516 36 15
Soybean RNA-seq SRR11458718 125 83,594,116 111,206,789 3,659,969 28 12
Tongue metagenome SRS011086 101 81,664,789 165,159,726 11,358,233 37 11
Whole human ERR174310 101 207,579,467 2,319,022,432 51,094,913 14 18

The output of our tools was compressed with MFC. Note that MFC is not optimized for non-nucleotide characters, but such characters are rare in our string sets (<0.1 bits per k-mer). We compared our tools against four other approaches. The first is UST-Compress, which we showed in our previous work to outperform other disk compressors [26]. The second is to strip the read FASTA files of all non-sequence information and compress them using MFC. The third is to simply write one distinct k-mer per line to a file and compress it using MFC. The fourth is the BOSS method, as implemented in [34]. BOSS is a succinct implementation of a de Bruijn graph [35]. Though it is designed to answer membership queries, it also achieved the closest compression size to UST-Compress in our previous study [26]. As in [26], we compressed BOSS’s binary output using LZMA. We confirmed the correctness of all evaluated tools, including our own, on the datasets.

We did not explore the possibility of replacing UST in our pipeline with ProphAsm [36]. ProphAsm is an alternative algorithm to compute an SPSS called simplitigs, but we showed in [26] that the UST SPSS representation is nearly optimal, with only 2–3% difference to the lower bound of weight. Since ProphAsm computes the same kind of representation, it is impossible for it to improve result beyond 2–3%. We also did not compare against other k-mer membership data structures because in our previous paper [26], we showed that UST-Compress and BOSS achieve a better compression ratio on the tested datasets.

String set properties

We first measure the weights and sizes of our ESS-Compress and ESS-Tip-Compress, shown in Table 2. ESS-Compress uses 13–42% less characters than UST. ESS-Tip-Compress was worse than ESS-Compress (6–13% larger), but still better than UST-Compress (3–38% smaller). The lower bound computed by Eq. (3.1) is very close to the weight of ESS-Compress (within 1.7%, Table 2), indicating that the alternate strategies explored in Sect. "The weight of the ESS-Compress representation" would not be useful on these datasets.

Table 2.

The weights and sizes of various string set representations

Dataset UST ESS-Tip-Compress ESS-Compress Eq. (3.1) lower bound
# strings #char/ k-mer # strings #char/ k-mer # strings #char/ k-mer #char/ k-mer
R. sphaeroides 240,562 2.22 61,909 1.38 36,456 1.29 1.28
Human RNA-seq 4,098,389 2.22 1,834,945 1.60 1,098,938 1.42 1.39
Gingiva metagenome 3,095,476 1.91 1,499,270 1.48 917,388 1.33 1.32
Soybean RNA-seq 1,806,078 1.49 1,137,350 1.32 515,244 1.17 1.17
Tongue metagenome 6,030,814 2.10 2,664,422 1.53 1,327,701 1.33 1.32
Whole human 22,072,219 1.32 21,320,263 1.28 10,321,275 1.15 1.14

The rightmost column shows the lower bound computed by Eq. (3.1) in Sect. "The weight of the ESS-Compress representation". The weight of ESS-Compress was verified to be the same as predicted by Theorem 3.2

Compression size

Table 3 shows the final compression sizes, after the string sets are compressed with MFC. ESS-Compress outperforms the second best tool (which is usually UST-Compress) by 4–27%. It outperforms the naive strategies (i.e. read FASTA or one k-mer per line) by an order-of-magnitude. Interestingly, it outperforms ESS-Tip-Compress by only 1–8%; this can be attributed to the large number of dead-end vertices (Table 1).

Table 3.

The compression sizes, as measured in bits per k-mer in the compressed output

Dataset Read FASTA One k-mer per line BOSS UST-Compress ESS-Tip-Compress ESS-Compress
R. sphaeroides 45.4 28.4 6.55 3.93 2.90 2.87
Human RNA-seq 45.8 31.7 6.89 4.14 3.43 3.33
Gingiva metagenome 48.0 32.4 10.64 3.76 3.22 3.05
Soybean RNA-seq 43.0 33.1 5.97 2.83 2.66 2.55
Tongue metagenome 48.1 33.3 3.59 4.07 3.32 3.07
Whole human 31.9 48.2 4.65 2.49 2.46 2.40

All string representations (i.e. not BOSS) are compressed using MFC in the final step. Since BOSS is a binary representation, we use LZMA for the final compression step

We observe that our improvement in weight (Table 2) does not directly translate to improvement after compression with MFC (Table 3). For ESS-Compress, the average improvement in weight over UST is 30% but the improvement in bits is 17%. We attribute this to the fact that MFC works by exploiting redundant regions, based on their context. Thus, the redundant sequence that ESS-Compress removes is likely the sequence that was more compressible by MFC and hence MFC looses some of its effectiveness.

We also verified that ESS-Compress can successfully compress datasets of varying k-mer sizes (between 21 and 71) and low-frequency thresholds (2, 3, and 4). Figure 4 shows compressed sizes of human RNA-seq data in bits/k-mer as well as their weights compared to the lower bounds. The weight of ESS-Compress closely matches the lower bound across all parameters (<2.4% gap), but the weight and compression size increase for larger k and lower thresholds.

Fig. 4.

Fig. 4

Compression performance of ESS-Compress when varying k and the low-frequency filter threshold, on Human RNA-seq dataset. In the left panel, solid lines represent the weight of the ESS-Compress representation, compared against the lower bound, represented by the dashed lines. In the right panel, compressed sizes are shown in bits/k-mer

Decompression and compression time and memory

The cost of decompression is important since it is incurred every time the dataset is used for analysis. For both ESS-Compress and ESS-Tip-Compress, the decompression memory is <1 GB (Table 5) the time is <10 min for the large whole human dataset and <1.5 minutes for the other datasets (Table 4). Both of these are dominated by the MFC portion.

Table 5.

Peak memory usage for compression and decompression

Dataset Compression (GB) Decompression (MB)
BOSS UST-Compress ESS-Tip-Compress ESS-Compress UST-Compress ESS-Tip-Compress ESS-Compress
MFC-D MFC-D Core MFC-D Core
R. sphaeroides 2 3 3 3 509 513 3 513 4
Human RNA-seq 4 3 3 6 515 515 3 515 38
Gingiva metagenome 4 2 2 5 515 515 3 515 4
Soybean 4 2 2 3 515 515 3 515 12
Tongue metagenome 4 2 2 9 515 515 3 515 6
Whole human 5 12 11 42 515 515 3 515 735

Decompression takes far less memory than compression, so compression memory is shown in GB and decompression memory in MB. Decompression memory is split in the same manner as the running time in Table 4

Table 4.

Decompression time in seconds

Dataset UST-Compress ESS-Tip-Compress ESS-Compress
MFC-D MFC-D Core Total MFC-D Core Total
R. sphaeroides 3 2 1 4 2 1 3
Human RNA-seq 40 41 19 60 34 17 51
Gingiva metagenome 37 38 16 54 30 15 45
Soybean 31 33 13 46 29 13 42
Tongue metagenome 62 61 28 89 49 25 74
Whole human 302 337 259 596 303 250 553

The time is broken down into the portion taken by MFC to decompress the binary file into an enriched string set and the portion taken by our core algorithm to decompress the enriched string set into an SPSS. Note that BOSS does not implement decompression (because it is a membership data structure) so it is not included

Compression is typically done only once, but the time and memory use can still be important in some applications. Tables 5 and 6 show the compression time and memory usage. For UST-Compress, the time is dominated by the cdBG construction step (i.e. BCALM2). For ESS-Compress, the time and memory are significantly increased beyond that. Here, the advantage of ESS-Tip-Compress stands out. Its run time is nearly the same as UST-Compress, and its memory, while higher than UST-Compress, is significantly lower than ESS-Compress.

Table 6.

Compression time, measured in minutes

Dataset BOSS BCALM UST-Compress ESS-Tip-Compress ESS-Compress
UST MFC Total Core MFC Total Core MFC Total
R. sphaeroides 0.2 0.4 0.1 0.1 1 0.1 0.0 1 0.2 0.0 1
Human RNA-seq 4.0 6.6 1.6 0.8 9 1.3 0.7 9 5.0 0.6 12
Gingiva metagenome 4.3 5.5 1.2 0.7 7 1.0 0.7 7 3.4 0.6 10
Soybean 5.7 9.6 0.8 0.6 11 0.7 0.7 11 2.4 0.5 13
Tongue metagenome 7.4 8.7 1.6 0.8 11 1.9 1.1 12 7.6 0.9 17
Whole human 95 106 11 7 124 10 6 122 40 7 152

The column for BOSS includes the time for k-mer counting the reads using KMC [21], the time to run BOSS construction, and the time to run LZMA. The total time in UST-Compress, ESS-Tip-Compress and ESS-Compress include the time to compute cdBG from the reads using BCALM. The time to compute cdBG is same for all three. The columns labelled core refer to Algorithm 1. ESS-Tip-Compress core uses the specific instance of Algorithm 1 defined in Sect. "ESS-Tip-Compress: a simpler alternative"

Note that MFC is one of many DNA sequence compressors that can be used with our algorithms. MFC is known to achieve superior compression ratios but is slower for compression/decompression than other competitors [38]. We recommend using MFC since it was not the time or memory bottleneck during compression, in our datasets.

Discussion

In this paper, we presented a disk compression algorithm for k-mer sets called ESS-Compress. ESS-Compress is based on the strategy of representing a set of k-mers as a set of longer strings with as few total characters as possible. Once this string set is constructed, it is compressed using a generic nucleotide compressor such as MFC. On real data, ESS-Compress uses up to 42% less characters than the previous best algorithm UST-Compress. After MFC compression, ESS-Compress uses up to 27% less bits than UST-Compress.

We also presented a second algorithm ESS-Tip-Compress. It is simpler than ESS-Compress and does not achieve as good of compression sizes. However, the difference is less than 8% on our data, and it has the advantage of being about twice as fast and using significantly less memory during compression. For many users, this may be a desirable trade-off.

Our algorithms can also be used to compress information associated with the k-mers in K, such as their counts. Every k-mer in K corresponds to a unique location in the enriched string set. The counts can then be ordered sequentially, in the same order as the k-mers appear in the string set, and stored in a separate file. This file can then be compressed/decompressed separately using a generic compressor. After decompression of the enriched string set, the order of k-mers in the output SPSS will be the same as in the counts file.

We discussed several potential improvements to ESS-Compress, such as allowing more edges in the compacted de Bruijn graph to be absorbing or exploring the space of all path covers. We also gave a lower bound to what such improvements could achieve and showed they cannot gain more than 2% in space on our datasets. This makes these improvement of little interest, unless we encounter datasets where the gap is much larger.

ESS-Compress works by removing redundant (k-1)-mers from the string set, but a more general strategy could be to somehow remove -mer duplicates, for all mink-1. Such a strategy would require novel algorithms but would still be unable to reduce the characters per k-mer below one. On our datasets, this amounts to at most a 30% improvement in characters, which would be further reduced after MFC compression. It is also not clear if a 30% improvement in characters is even possible, since this kind of strategy would require a more sophisticated encoding scheme with more overhead.

Another direction to achieve lower compression sizes is to look beyond string set approaches. We observe, for example, that the large improvement of ESS-Compress compared to UST-Compress, measured in the weight of the string set, was significantly reduced when measured in bits after MFC compression. This indicates that some of the work done by ESS-Compress duplicates the work done by MFC on UST, which is itself designed to remove redundancy in the input. Thus, generic compressors such as MFC could potentially be modified to work directly on k-mer sets.

We believe that the biggest opportunity for improving the two algorithms of this paper are the compression time and memory. The time is dominated by the initial step of running BCALM2 to find unitigs. It may be possible to avoid this step by running UST directly on the non-compacted graph. Such an approach was taken in [28], and it would be interesting to see if it ends up improving on the memory and run-time of BCALM2. The memory usage, on the other hand, can likely be optimized with better software engineering. The current implementation of Algorithm 2 is done in a memoized bottom-up manner. Instead, a top down iterative implementation can reduce memory usage by directly writing to disk as soon as a vertex is processed. A “max-depth” option in Algorithm 2 could also be used to limit the depth of the recursion, thereby controlling memory at the cost of the compression ratio.

Another practical extension of ESS-compress is to allow the compression of associated information. Each k-mer, for example, could have an abundance. count associated with it. ESS-Compress representation defines an ordering on the k-mers. This ordering can be tracked through the algorithm and can then be used to sort the input associated data in the same order. Then, the associated data can be further compressed using LZMA (or any suitable compressor) and distributed with the ESS-Compress representation. The decompression algorithm would then similarly track the reordering of k-mers and apply the same permutation to the associated data.

Acknowledgements

PM and AR were supported by NSF awards 1453527 and 1439057. AR is supported by NIH Computation, Bioinformatics, and Statistics training program. RC is supported by INCEPTION project (PIA/ANR-16-CONV-0005).

Declarations

Competing interests

The authors declare that they have no competing interests.

Footnotes

1

Though we did not explain it in [26], UST always returns normalized paths. It flips any vertex that is in the wrong orientation on its path, by reverse complementing its label, without affecting anything else.

2

Note that the vertices of this graph (i.e. cdBG(K)) correspond to maximal unitigs in the non-compacted graph (i.e. dBG(K)). We will generally use “vertex” and “unitig” interchangeably, to refer to a vertex in cdBG(K). We never use “unitig” to refer to a type of path in cdBG(K).

3

In our code, we actually allow a slightly broader definition of absorption. In particular, we also allow an edge to be absorbing if up is an initiator and sp=1, or if up is an initiator and |lab(up)|2k-2. For the sake of simplicity, we do not consider this edge case in the paper.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Amatur Rahman, Email: aur1111@psu.edu.

Rayan Chikhi, Email: rayan.chikhi@pasteur.fr.

Paul Medvedev, Email: pzm11@psu.edu.

References

  • 1.Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, Lesin VM, Nikolenko SI, Pham S, Prjibelski AD, et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol. 2012;19(5):455–477. doi: 10.1089/cmb.2012.0021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Wood DE, Salzberg SL. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 2014;15(3):46. doi: 10.1186/gb-2014-15-3-r46. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Sun C, Medvedev P. Toward fast and accurate snp genotyping from whole genome sequencing data for bedside diagnostics. Bioinformatics. 2018;35(3):415–420. doi: 10.1093/bioinformatics/bty641. [DOI] [PubMed] [Google Scholar]
  • 4.Denti L., Previtali M., Bernardini G., Schönhuth A., Bonizzoni P. MALVA: genotyping by Mapping-free ALlele detection of known VAriants. iScience. 2019;18:20–27. doi: 10.1016/j.isci.2019.07.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Standage D.S., Brown C.T., Hormozdiari F. Kevlar: a mapping-free framework for accurate discovery of de novo variants. iScience. 2019;18:28–36. doi: 10.1016/j.isci.2019.07.032. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, Phillippy AM. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 2016;17(1):132. doi: 10.1186/s13059-016-0997-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Solomon B, Kingsford C. Fast search of thousands of short-read sequencing experiments. Nat Biotechnol. 2016;34(3):300–302. doi: 10.1038/nbt.3442. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Solomon B, Kingsford C. Improved search of large transcriptomic sequencing databases using split sequence bloom trees. J Comput Biol. 2018;25(7):755–765. doi: 10.1089/cmb.2017.0265. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Sun C, Harris RS, Chikhi R, Medvedev P. AllSome Sequence Bloom Trees. In: 21st Annual International Conference. Research in Computational Molecular Biology. RECOMB 2017, Hong Kong, China, May 3–7, 2017, Proceedings. Lecture Notes in Computer Science, vol. 10229, 2017;pp. 272–286.
  • 10.Harris RS, Medvedev P. Improved representation of sequence bloom trees. Bioinformatics. 2020;36(3):721–727. doi: 10.1093/bioinformatics/btz662. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Bradley P, den Bakker HC, Rocha EP, McVean G, Iqbal Z. Ultrafast search of all deposited bacterial and viral genomic data. Nat Biotechnol. 2019;37(2):152. doi: 10.1038/s41587-018-0010-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Bingmann T, Bradley P, Gauger F, Iqbal Z. COBS: a compact bit-sliced signature index. arXiv preprint arXiv:1905.09624 2019.
  • 13.Pandey P, Almodaresi F, Bender MA, Ferdman M, Johnson R, Patro R. Mantis: a fast, small, and exact large-scale sequence-search index. Cell Syst. 2018;7(2):201–207. doi: 10.1016/j.cels.2018.05.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Dadi TH, Siragusa E, Piro VC, Andrusch A, Seiler E, Renard BY, Reinert K. DREAM-Yara: an exact read mapper for very large databases with short update time. Bioinformatics. 2018;34(17):766–772. doi: 10.1093/bioinformatics/bty567. [DOI] [PubMed] [Google Scholar]
  • 15.Marchet C, Iqbal Z, Gautheret D, Salson M, Chikhi R. Reindeer: efficient indexing of k-mer presence and abundance in sequencing datasets. bioRxiv 2020. [DOI] [PMC free article] [PubMed]
  • 16.Marchet C, Boucher C, Puglisi SJ, Medvedev P, Salson M, Chikhi R. Data structures based on k-mers for querying large collections of sequencing datasets. bioRxiv, 866756 2019. [DOI] [PMC free article] [PubMed]
  • 17.Chikhi R, Holub J, Medvedev P. Data structures to represent sets of k-long DNA sequences. arXiv:1903.12312 [cs, q-bio] 2019.
  • 18.Hosseini M, Pratas D, Pinho A. A survey on data compression methods for biological sequences. Information. 2016;7(4):56. doi: 10.3390/info7040056. [DOI] [Google Scholar]
  • 19.Hernaez M, Pavlichin D, Weissman T, Ochoa I. Genomic data compression. Ann Rev Biomed Data Sci. 2019;2.
  • 20.Numanagić I, Bonfield JK, Hach F, Voges J, Ostermann J, Alberti C, Mattavelli M, Sahinalp SC. Comparison of high-throughput sequencing data compression tools. Nat Methods. 2016;13(12):1005. doi: 10.1038/nmeth.4037. [DOI] [PubMed] [Google Scholar]
  • 21.Kokot M, Długosz M, Deorowicz S. KMC 3: counting and manipulating k-mer statistics. Bioinformatics. 2017;33(17):2759–2761. doi: 10.1093/bioinformatics/btx304. [DOI] [PubMed] [Google Scholar]
  • 22.Rizk G, Lavenier D, Chikhi R. DSK: k-mer counting with very low memory usage. Bioinformatics. 2013;29(5):652–653. doi: 10.1093/bioinformatics/btt020. [DOI] [PubMed] [Google Scholar]
  • 23.Marçais G, Kingsford C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics. 2011;27(6):764–770. doi: 10.1093/bioinformatics/btr011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Pandey P, Bender MA, Johnson R, Patro R. Squeakr: an exact and approximate k-mer counting system. Bioinformatics. 2017;34(4):568–575. doi: 10.1093/bioinformatics/btx636. [DOI] [PubMed] [Google Scholar]
  • 25.Turner I, Garimella KV, Iqbal Z, McVean G. Integrating long-range connectivity information into de bruijn graphs. Bioinformatics. 2018;34(15):2556–2565. doi: 10.1093/bioinformatics/bty157. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Rahman A, Medvedev P. Representation of k-mer sets using spectrum-preserving string sets. In: 24th Annual International Conference. Research in Computational Molecular Biology. RECOMB 2020, Padua, Italy, May 10-13, 2020, Proceedings. Lecture Notes in Computer Science, vol. 12074, pp. 152–168. Springer, 2020.
  • 27.Břinda K. Novel computational techniques for mapping and classifying Next-Generation Sequencing data. PhD thesis, Université Paris-Est (November 2016). 10.5281/zenodo.1045317.
  • 28.Břinda K, Baym M, Kucherov G. Simplitigs as an efficient and scalable representation of de Bruijn graphs. bioRxiv 2020. 10.1101/2020.01.12.903443. [DOI] [PMC free article] [PubMed]
  • 29.Pinho AJ, Pratas D. MFCompress: a compression tool for FASTA and multi-FASTA data. Bioinformatics. 2013;30(1):117–118. doi: 10.1093/bioinformatics/btt594. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Iliopoulos CS, Kundu R, Pissis SP. Efficient pattern matching in elastic-degenerate texts. In: International Conference on Language and Automata Theory and Applications, 2017;pp. 131–142. Springer.
  • 31.Chikhi R, Limasset A, Medvedev P. Compacting de Bruijn graphs from sequencing data quickly and in low memory. Bioinformatics. 2016;32(12):201–208. doi: 10.1093/bioinformatics/btw279. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Chikhi R, Limasset A, Jackman S, Simpson JT, Medvedev P. On the representation of de Bruijn graphs. In: Research in Computational Molecular Biology, RECOMB 2014. Lecture Notes in Computer Science, 2014; vol. 8394: pp. 35–55. Springer.
  • 33.Bang-Jensen J, Gutin GZ. Digraphs: theory. Algorithms and applications. Berlin: Springer; 2008. [Google Scholar]
  • 34.https://github.com/cosmo-team/cosmo/tree/VARI.
  • 35.Bowe A, Onodera T, Sadakane K, Shibuya T. Succinct de bruijn graphs. In: Proceedings of the 12th International Conference on Algorithms in Bioinformatics. LNCS, 2012; vol. 7534: pp. 225–235. Springer.
  • 36.https://github.com/prophyle/prophasm.
  • 37.Salzberg SL, Phillippy AM, Zimin A, Puiu D, Magoc T, Koren S, Treangen TJ, Schatz MC, Delcher AL, Roberts M, et al. GAGE: a critical evaluation of genome assemblies and assembly algorithms. Genome Res. 2012;22(3):557–567. doi: 10.1101/gr.131383.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Kryukov K, Ueda MT, Nakagawa S, Imanishi T. Nucleotide Archival Format (NAF) enables efficient lossless reference-free compression of DNA sequences. bioRxiv, 501130 2018. [DOI] [PMC free article] [PubMed]

Articles from Algorithms for Molecular Biology : AMB are provided here courtesy of BMC

RESOURCES