Abstract
The widespread adoption of k-mers in bioinformatics has led to efficient methods utilizing genomic sequences in a variety of biological tasks. However, understanding the influence of k-mer sizes within these methods remains a persistent challenge, as the outputs of complex bioinformatics pipelines obscure this influence with various noisy factors. The choice of k-mer size is often arbitrary, with justification frequently omitted in the literature and method tutorials. Furthermore, recent methods employing multiple k-mer sizes encounter significant computational challenges.
Nevertheless, most methods are built on well-defined objects related to k-mers, such as de Bruijn graphs, Jaccard similarity, Bray-Curtis dissimilarity, and k-mer spectra. The role of k-mer sizes within these objects is more intuitive and can be described by numerous quantities and metrics. Therefore, exploring these objects across k-mer sizes opens opportunities for robust analyses and new applications. However, the evolution of k-mer objects with respect to k-mer sizes is surprisingly elusive.
We introduce a novel substring index, the Prokrustean graph, that elucidates the transformation of k-mer sets across k-mer sizes. Our framework built upon this index rapidly computes k-mer-based quantities for all k-mer sizes, with computational complexity independent of the size range and dependent only on maximal repeats. For example, counting maximal simple paths in de Bruijn graphs for is achieved in seconds using our index on a gigabase-scale dataset. We present a variety of such experiments relevant to pangenomics and metagenomics.
The Prokrustean graph is space-efficiently constructed from the Burrows-Wheeler Transform. Through this construction, it becomes evident that other modern substring indices inherently face difficulties in exploring k-mer objects across sizes, which motivated our data structure.
Our implementation is available at: https://github.com/KoslickiLab/prokrustean.
1. Introduction
As the volume and number of sequencing reads and reference genomes grow every year, k-mer-based methods continue to gain popularity in computational biology. This simple approach—cutting sequences into substrings of a fixed length—offers significant benefits across various disciplines. Biologists consider k-mers as intuitive markers representing biologically significant patterns, bioinformaticians easily formulate novel methods by regarding k-mers as fundamental units that reflect their original sequences, and engineers leverage their fixed-length nature to optimize computational processes at a very low level. However, understanding the most crucial parameter, k, remains elusive, thereby leaving the accuracy of experimental results obscure to practitioners.
Two central challenges emerge in k-mer-based methods. First, the selection of the k-mer size is often arbitrary, despite its well-recognized influence on outcomes. This issue, though widely acknowledged, remains insufficiently addressed in the literature, with little formal guidance on determine an optimal k size for applications. The reasoning behind these choices is frequently unclear, typically confined to specific, unpublished experimental analyses (e.g., “we found that k=31 was appropriate...”). Second, methods attempting to utilize multiple k-mer sizes encounter significant computational burdens. The accuracy benefits of incorporating multiple k-mer sizes are overshadowed by the escalating computational costs with each additional k-mer size. Consequently, researchers often resort to “folklore” k values based on prior empirical results.
The influence of k-mer sizes within each method is a complex function reflecting bioinformatics pipelines that process k-mers. As biological adjustments and engineering strategies complicate the pipelines, their outputs obscure the impact of k-mer sizes with noisy factors, as simplified in the following abstraction:
pipeline (k-mer-based objects(sequences, k), biological adjustments, engineering.
Despite the complexity of the pipelines, there always exist mathematically well-defined k-mer-based objects that form the foundation of method formulation. These objects abstract the utilization of k-mers and depend solely on sequences and k-mer sizes. Thus, they offer potential for generalizability and quantification of the influence of k-mers in methods. Indeed, intuitive quantities have been derived from k-mer-based objects; however, there are challenges in actually computing them, as detailed in several examples we now present.
In genome analysis, the number of distinct k-mers is often used to reflect the complexity of genomes. Although the number varies by k-mer sizes, large data sizes restrict experiments to a few k values [9, 15, 39]. A recent study suggests that the number of distinct k-mers across all k-mer sizes provides additional insights into pangenome complexity [10]. Furthermore, the frequencies of k-mers provide richer information, as discussed in [1, 46], but their computation becomes more complicated and sometimes impractical, even with a fixed k-mer size [7].
In comparative analyses, Jaccard Similarity is utilized for genome indexing and searching by being approximated through hashing techniques [26, 34]. Experiments attempting to assess the influence of k-mer sizes on Jaccard Similarity must undergo tedious iterations through various k-mer sizes [8]. Additionally, Bray-Curtis dissimilarity is a k-mer frequency-based measure frequently used to compare metagenomic samples, where experimentalists face resource limitations due to the growing size of sequencing data, even with a fixed k-mer size. Yet, the demand for analyzing multiple k-mer sizes continues to increase [23, 35, 36].
Genome assemblers utilizing k-mer-based de Bruijn graphs are probably the most sensitive to k-mer sizes, and those employing multi-k approaches face significant computational challenges. The choice of k-mer sizes is particularly crucial in de novo assemblers, yet they predominantly rely on heuristic methods [19, 27, 38]. The topological features of de Bruijn graphs are succinctly summarized by the compacting process, where vertices represent simple paths called unitigs, but exploring these features across varying k-mer sizes is computationally intensive. Furthermore, the development of multi-k de Bruijn graphs continues to be a challenging and largely theoretical endeavor [22, 40, 43], with no substantial advancements following the heuristic selection of multiple k-mer sizes implemented by metaSPAdes [33].
These examples motivated the pressing need to generalize the exploration of k-mer-based objects across varying k-mer sizes. We introduce a computational framework for rapidly computing quantities derived from k-mer-based objects across a range of k sizes. The objective is clarified in Section 2, with experimental results in Section 3. The underlying substring index and its usage are briefly presented in Section 4 and Section 5.
Due to the length of proofs and algorithms, construction algorithms are left in Supplementary materials. We emphasize that distinctions between the new substring index and other modern substring indexes are only briefly covered in Section 5, while the construction algorithms provide essential insights. The new index can be built from the Burrows-Wheeler Transform (BWT), but the BWT alone is unlikely to achieve the same efficiency as our framework. It appears that modern substring indexes based on longest common prefixes, such as the BWT, suffix tree, FM-index, and r-index, share similar limitations.
2. Setup and main theorem
Let be our alphabet and let represent a finite-length string, and represent a set of strings. Consider the following preliminary definitions and main result detailing the Prokrustean graph utility:
Definition 1.
-
A region in the string is a triple such that .
** See the example after definition 2 which justifies this non-standard region notation .
The string of a region is the corresponding substring: str .
The size or length of a region is the length of its string: .
- An extension of a region is a larger region including :
- The occurrences of in are those regions in strings of corresponding to :
is a maximal repeat in if it occurs at least twice and more than any of its extensions: , and for all a superstring of , . I.e., a maximal repeat is one that occurs more than once in (i.e. is a repeat), and any extension of which has a lower frequency in than .
Let represent the set of all maximal repeats in .
Theorem 1. (Main Result)
Given a set of sequences , there exists a substring representation of size . can be used to compute k-mer quantities of for all within time and space, where is the length of the longest sequence in . The k-mer quantities are defined as follows:
(Counts) The number of distinct k-mers, as used in genome cardinality DandD [10].
(Frequencies) k-mer Bray-Curtis dissimilarities, as used in comparing metagenomic samples [36].
(Extensions) The number of maximal unitigs in k-mer de Bruijn graphs, as used in analyses of reference genomes and genome assembly [29, 42].
(Occurrences) The number of edges with annotated length of k in an overlap graph. [45].
Furthermore, can be constructed in time and space utilizing bwt, a BWT representation of .
Proof. Section 4 defines and derives its size . Section 5 introduces the computation of k-mer quantities, and the construction algorithm is introduced in the Supplementary Section S3. □
is termed the Prokrustean graph of . The main contribution is that the time complexity for computing k-mer quantities for all is independent of the range, specifically . Notably, is sublinear relative to , where is the cumulative sequence length of . In contrast, any direct computation of k-mer quantities from for a fixed k-mer size demands, at best, time. Extending this computation naively to increases the complexity to to cover all possible substrings.
3. Results
The construction and four applications of the Prokrustean graph were conducted with various datasets that were randomly selected to represent a broad range of sizes, sequencing technologies, and biological origins. Diverse sequencing datasets were collected, including two metagenome short-read datasets, one metagenome long-read dataset, and two human transcriptome short-read datasets. The full list of datasets is provided in section S1. All experiments were performed on a Mac Studio equipped with an Apple M1 Max of 10 CPU cores and 64 GB memory.
3.1. Prokrustean graph construction with
The construction algorithm is introduced in the Supplementary Section S3. The Prokrustean graph grows with the number of maximal repeats, i.e., . Also, the size of can be controlled by dropping maximal repeats of length below . Then requires space where is the set of maximal repeats of length at least . Then, subsequent sections cover the results of four applications that compute k-mer quantities for . We assume is always less than , even if the largest possible value is chosen, which is typical for sequencing data. Figure 1 shows how the graph size of short sequencing reads decreases as increases.
Fig.1:
Size decay of a Prokrustean graph as a function of . The sizes of Prokrustean graphs, i.e., number of vertices and edges, versus values for a short read dataset ERR3450203 are depicted. The size of the graph rapidly drops around , meaning the maximal repeats and locally-maximal repeat regions are dense when is under 11. The number of edges falls again around because the graph eventually becomes fully disconnected.
Table 1 shows that the size of the Prokrustean graphs is sublinear to that of their inputs. Also, the space occupancy is close to the inputs (read files) with practical values. We used that is common in genome assemblers and comparative microbiome analyses.
Table 1:
Prokrustean graph construction introduced in Section S3. Symbols M, B, mb, and gb denote millions, billions, megabytes, and gigabytes, respectively. represents the cumulative sequence length of . is the vertex set and is the edge set of . Since , the time complexity of subsequent applications is sublinear to the input, even if all k-mer sizes are considered.
dataset | reads | N | time | memory | output size | |||
---|---|---|---|---|---|---|---|---|
SRR20044276 (metagenomic) | 0.94M | 72M | 1 | 20s | 200mb | 40mb | 8M | 21M |
SRR20044276 (metagenomic) | 0.94M | 72M | 20 | 13s | 180mb | 35mb | 2.7M | 5.4M |
ERR3450203 (metagenomic) | 55M | 4.5B | 20 | 29min | 12.7gb | 3.1gb | 254M | 477M |
ERR3450203 (metagenomic) | 55M | 4.5B | 30 | 28min | 11gb | 2.7gb | 221M | 407M |
SRR18495451 (metagenomic/long) | 0.83M | 11.3B | 30 | 90min | 43gb | 14gb | 890M | 1.7B |
SRR7130905 (human/rna-seq) | 45M | 6.8B | 30 | 39min | 18gb | 4gb | 280M | 580M |
SRR21862404 (human/rna-seq) | 336M | 22.3B | 30 | 99min | 37gb | 5.9gb | 330M | 800M |
Regarding subsequent results, note that, to our knowledge, no existing computational techniques perform the exact same tasks across a range of k-mer sizes, making performance comparisons inherently “unfair.” Readers are encouraged to focus on the overall time scale to gauge the efficiency of the Prokrustean graph. Additionally, discrepancies in output comparisons may arise due to practice-specific configurations in computational tools, such as the use of only ACGT (i.e., no “N”) and canonical k-mers in KMC. Consequently, we tested correctness on GitHub1 using our brute-force implementations.
3.2. Counting distinct k-mers for
The number of distinct k-mers is commonly used to assess the complexity and structure of pangenomes and metagenomes [13, 44]. A recent study by [10] spans multiple k contexts, suggesting that the number of k-mers across all k-mer sizes should be used to estimate genome sizes. The Supplementary Algorithm S1 counts distinct k-mers for all k-mer sizes in time . We employed KMC [28], an optimized k-mer counting library designed for a fixed k size, for comparison, and used it iteratively for all k values.
3.3. Counting simple paths in de Bruijn graphs for
A maximal unitig of a de Bruijn graph represents a simple path that cannot be extended further. Maximal unitigs are vertices of a compacted de Bruijn graph [20], reflecting a topological characteristic of the graph. More maximal unitigs generally indicate more complex graph structures, which significantly impact genome assembly performance [3, 14]. Therefore, their number across multiple k-mer sizes reflects the influence of k-mer sizes on the complexity of assembly. Algorithm S3 counts maximal unitigs for all k-mer sizes in time .
We utilized GGCAT [20] for comparisons, a mature library constructing compacted de Bruijn graphs of a fixed k-mer size. We assumed their construction is equivalent to counting maximal unitigs. Again, computing maximal unitigs with the Prokrustean graph becomes more efficient as additional k-mer sizes are considered. Table 3 displays the comparison, and Figure 2 illustrates one of the outputs.
Table 3:
Counting maximal unitigs with GGCAT and Algorithm S3 using the Prokrustean Graph. GGCAT was executed iteratively for each k size, with each iteration taking approximately 1–5 minutes. The efficiency gap becomes clearer when there are many possible k-mer sizes, as in long reads. Both methods used 8 threads.
Dataset | k | GGCAT | Prokrustean Graph | ||
---|---|---|---|---|---|
time | memory | time | memory | ||
SRR20044276 | 20..150 | 13m | 310mb | 0.36s | 70mb |
ERR3450203 | 20..150 | 98m | 1.1gb | 21s | 6.1gb |
SRR18495451 | 30..50000 | days | - | 127s | 26gb |
SRR7130905 | 30...150 | 126m | 1.4gb | 46s | 8.5gb |
SRR21862404 | 30...150 | 168m | 0.8gb | 76s | 13.8gb |
Fig.2:
The number of maximal unitigs in de Bruijn graphs of order k = 20,...,150 of metagenomic short reads (ERR3450203). A complex scenario is revealed: The decrease of the numbers fluctuates in k = 30,...,60, and the peak around k = 90 corresponds to the sudden decrease in maximal repeats. Lastly, further disconnections increase contigs and eventually make the graph completely disconnected.
3.4. Computing Bray-Curtis dissimilarities for
The Bray-Curtis dissimilarity is a frequency-based metric that measures the dissimilarity between biological samples. The Bray-Curtis dissimilarity for k-mer sets is defined as follow:
This quantity is particularly popular in metagenomics analysis [23, 36]. In each citation we found, the dissimilarity scores are presented for a single k-mer size. However, the values are sensitive to the choice of k-mer sizes, so they are often calculated with multiple k-mer sizes for analyses [23, 36, 47], yet most state-of-the-art tools utilize a fixed k-mer size [7]. Algorithm S2 computes Bray-Curtis dissimilarities for all k-mer sizes in , where contains sequencing reads for samples, and Pairs is the sample pairs compared.
Figure 3 shows that dissimilarities between four example metagenome samples are not consistent in that at some k-mer sizes, one observed pattern of dissimilarity is completely flipped at another k-mer size. I.e. no k size is “correct”; instead, new insights are obtained from multiple k sizes. This task took about 10 minutes with the Prokrustean graph of four samples of 12 gigabase pairs in total, and the graph construction took around 1 hour. Computing the same quantity takes around 5 to 8 minutes for a fixed k-mer size with Simka [7], and no library computes it across k-mer sizes.
Fig.3:
Bray-Curtis dissimilarities between four metagenomic samples computed with Algorithm S2. We used the sequencing data of four samples derived from infant fecal microbiota, which were studied in [36]. Samples include two from 12-month-old human subjects and two from 3-week-olds. The relative order between values shift as k changes, e.g. compared to all other dissimilarities, the dissimilarity between sample 2 and 4 is highest at k = 8 but lowest at k = 21. Note that diverse k sizes (10, 12, 21, 31) are actively utilized in practice.
3.5. Counting vertex degrees of overlap graph of threshold
Overlap graphs are extensively utilized in genome and metagenome assembly, alongside de Bruijn graphs. Although their definition appears unrelated to k-mers, we can view them as representing common suffix-prefix k-mers of sequences in . Overlap information is particularly crucial in read classification [2, 16, 31, 41] and contig binning [32, 45] in metagenomics. These applications interpret vertex degrees as (abundant-weighted) read coverage in samples, which exhibit high variances due to species diversity and abundance variability.
A common computational challenge is the quadratic growth of overlap graphs relative to the number of reads: . In contrast, the Prokrustean graph, which encompasses the overlap graph as a sort of hierarchical form, requires space to access the structure of overlaps, enabling the efficient computation of vertex degrees.
Figure 4 is the result of computing the vertex degrees on a metagenomic short read dataset, as computed by Algorithm S4. With the Prokrustean graph of 27 million short reads (4.5 gigabase pairs in total; ERR3450203), the computation used 8.5 gigabytes memory and 1 minute to count all vertex degrees.
Fig.4:
Vertex degrees of the overlap graph of metagenomic short reads (ERR3450203) computed with Algorithm S4. The number of vertices per each degree is scaled as log log y. There is a clear peak around 102 at both incoming and outgoing degrees, which may imply dense existence of abundant species. The intermittent peaks after 102 might be related to repeating regions within and across species.
4. Prokrustean graph: A hierarchy of maximal repeats
The key idea of the Prokrustean graph is to recursively capture maximal repeats by their relative frequencies of occurrences in . Consider Figure 5 depicting repeats in three sequences ACCCT, GACCC, and TCCCG.
Fig.5:
Recursive repeats. At each sequence on the top level, locally-maximal repeat regions (indicated with blue underlining) are used to denote substrings that are more frequent than its superstring. The arrows then points to a node of the substring, and the process can repeat hierarchically.
Definition 2.
is a locally-maximal repeat region in if:
(1) |
and for each extension of ,
(2) |
**Note that we often omit “in ” when the context is clear.
A locally-maximal repeat region captures a substring that appears more frequently than the “parent” string, and any extension of the region captures a substring that appears as frequently as the parent string. Consider the previous example . The region (ACCCT,1,4) is a locally-maximal repeat region capturing the substring ACCC in ACCCT. However, (ACCCT,2,4) is not a locally-maximal repeat region because the region capturing CCC in ACCCT can be extended to the left to capture ACCC which occurs more frequently than ACCCT.
Note, an alternative definition using substring notations, such as locally-maximal repeats, instead of regions, is not robust, hence the requirement for us to use region notation. Consider and . Observe that the region () capturing CCC is a locally-maximal repeat region because CCC occurs more frequently than in , but its immediate left and right extensions capture ACCC and CCCT that occur the same number of times as . In constrast, a subregion () capturing CC is not a locally-maximal repeat region because an extension () still captures a string (CCC) which occurs 2 times in which is more than . However, (), which also captures CC, is a locally-maximal repeat region because and . Consequently, defining a locally-maximal repeat as S6..7=CC becomes ambiguous when compared with S2..3=CC, whereas an explicit expression of a region () more accurately reflects the desired property of hierarchy of occurrences.
The Prokrustean graph of is simply a graph representation of the recursive structure. Its vertexes represent the substrings and edges represent locally-maximal repeat regions. The theorem below says that all maximal repeats in are captured along the recursive description.
Proposition 1 (Complete).
Construct a string set as follows:
for each locally-maximal repeat region where , add to and
for each locally-maximal repeat region where , add to .
Then it follows that holds.
Proof. Any string in is a maximal repeat of , so the claim is satisfied if the process captures every maximal repeat of . Assume a maximal repeat is not in . Consider any occurrence of in some . Extending the maximal repeat within makes the number of its occurrences drop, but since cannot occur as a locally-maximal repeat region by assumption, it must be included in some locally-maximal repeat region that captures a maximal repeat , hence , and is a proper substring of . Now consider any occurrence of within . This argument continues recursively, capturing maximal repeats , but cannot extend indefinitely as is always shorter than for every . Eventually, must be occurring as a locally-maximal repeat region of some , which is a contradiction. □
Therefore, we use maximal repeats as vertices of the Prokrustean graph, along with sequences, so .
Definition 3.
The Prokrustean graph of is a directed multigraph :
Vertex set if and only if .
Each vertex is annotated with the string size, size .
Edge set if and only if is a locally-maximal repeat region.
Each edge directs from to and is annotated with the interval , representing the locally-maximal repeat region.
Figure 6 visualizes how locally-maximal repeat regions are encoded in a Prokrustean graph. Next, the cardinality analysis of this representation reveals promising bounds.
Fig.6:
The Prokrustean graph of . The left graph illustrates the recursion of locally-maximal repeat regions, depicted as blue regions, following the same rule described in Figure 5. The Prokrustean graph on the right represents the same structure that white vertices correspond to sequences and blue vertices represent maximal repeats found through the recursion process shown in the left graph. The strings are not stored, so integer labels store the regions and the size of the substrings.
Theorem 2 (Compact).
.
Proof. Every vertex in has at most one incoming edge per letter extension on the right (or similarly on the left), so has at most incoming edges. Assume the contrary: a maximal repeat appears as two different locally-maximal repeat regions and within , , and can extend by the same letter on the right, i.e., and capture the same string. Extending these regions further together will eventually diverge their strings, because either or the original two regions are differently located even if . Consequently, at least one of them— without loss of generality—results in a decreased occurrence count of its string as it extends. Hence, . This leads to contradiction because a locally-maximal repeat region should satisfy . □
Assuing that , as is overwhelmingly the case in practice—meaning there are significantly more maximal repeats than sequences—we derive that . Hence, by the theorem. Given that is typically constant in genomic sequences (eg. ), the graph’s size depends on the number of maximal repeats. Letting be the accumulated sequence length of , it is a well-known fact that [25], so the graph grows sublinear to the input size. Furthermore, Section 3 restricted to maximal repeats of length at least . Setting a practical threshold allows for significantly more efficient and configurable space usage.
4.1. Prokrustean graph construction
We leave the construction algorithm in Supplementary Section S3. However, we emphasize that the core insight of our study comes from the relationship between the Prokrustean graph and suffix trees. It turns out that a straightforward transformation can be defined with affix trees—the union of the suffix tree of and the suffix tree of its reversed strings. A subset of edges of the affix tree of corresponds to the reversed edge set of the Prokrustean graph of . But affix trees are resource-intensive, so our final algorithm utilizes the BWT of , adding more intermediate steps to compensate for the non-bidirectional substring representation.
5. Framework: Computing quantities for all sizes
We first describe the type of information required to efficiently compute k-mer quantities (defined in Theorem 1) to elucidate the uniqueness of the Prokrustean graph. Specifically, why can’t other modern substring indexes be used for the computations in previous results? The answer also implies why developing efficient multi-k k-mer-based methods has been challenging.
5.1. A proxy problem: computing substring co-occurrence.
There exists a theoretical void in identifying when a k-mer and a k′-mer serve similar roles within their respective substring sets. Although close k-mer sizes generally yield comparable outputs, dissecting this phenomenon at the level of local substring scopes is unexpectedly challenging. For instance, de Bruijn graphs constructed from sequencing reads with k = 30 and k′ = 31 display distinct yet highly similar topologies [40]. However, anyone formally articulating the topological similarity would find it quite elusive, as vertex mapping and other techniques establishing correspondences between the two graphs fail to consistently explain it.
A common underlying difficulty is that the entire k-mer set undergoes complete transformations as the k-mer size changes. Since the role of a k-mer is assigned within its k-mer set, the role of a k′-mer within k′-mers cannot be “locally” derived. To address this issue, we propose substring co-occurrence as a generalized framework for consistently grouping k-mers of similar roles across varying k-mer sizes.
Definition 4.
A string co-occurs within a string in if:
Modern substring indexes built on longest common prefixes of suffix arrays are adept at capturing co-occurrence while extending a substring. In suffix trees, a substring that extends from to along an edge—without passing through a node—indicates preservation of co-occurrence. Similarly, the BWT and its variants represent both and with the same set of suffix array intervals to imply co-occurrence. However, there has been no recognized necessity for capturing co-occurrence for substrings rather than superstrings:
Definition 5.
is the set of substrings that co-occur within .
Modern substring indexes would fail to “smoothly” move over the order unless is efficiently accessible. For instance, the variable-order de Bruijn graph and its variants extend a compact de Bruijn graph representation built on the Burrows-Wheeler Transform [12]. To address the large substring space, these solutions require non-trivial operation times for both forward moves and order changes [5, 11].
We argue the usefulness of with a simple yet foundational property.
Theorem 3 (Principle).
For any substring such that ,
Proof. Assume the theorem does not hold, i.e., co-occurs within either no string or multiple strings in . Consider the former case: co-occurs within no string in , meaning occurs more than any of its superstrings in . Given that occurs in some string (since , must be occurring more frequently than by assumption. Then, there exists a maximal repeat such that is a substring of and is a substring of , because extending within decreases its occurrence at some point. Again, whenever is a substring of a maximal repeat , must be occurring more frequently than by assumption. Following the similar argument above, there exists another maximal repeat such that is a substring of and is a proper substring of . This recursive argument will eventually terminate as is strictly shorter than . Thus, co-occurs within the last maximal repeat, which contradicts the assumption.
Now consider the latter case where co-occurs within at least two strings in . If is extended within these two strings, either to the right or left by one step at each time, the extensions eventually diverge into two different strings since the two superstrings differ, thereby reducing the number of occurrences. Hence, must be occurring more frequently than at least one of those two superstrings, leading to a contradiction. □
This theorem provides two insights. First, the sequences and maximal repeats comprehensively and disjointly cover the entire substring, i.e., every substring in is in . Second, by computing for every , we can categorize k-mers of similar roles across various lengths. Therefore, we aim to design a succinct representation of that can be accessed by the Prokrustean graph.
5.2. An algorithmic idea: co-occurrence stacks.
We apply the Prokrustean graph to compute k-mer quantities for all k sizes within the time and space. Previously, a toy example in Figure 5 depicted blue regions that cover locally-maximal repeat regions. Then, given a k-mer size, a complementary region is a maximal region that covers k-mers that are not included in a blue region. See the red regions underlining in Figure 7. An opportunistic property is that k-mers in red regions co-occur within their parent strings. The following notation captures substrings in red regions.
Fig.7:
Complementary regions. Blue underlining depicts locally maximal regions, and red underlining depicts regions that complement k-mers not covered in blue regions, given k values of 3 and 4. Every k-mer appears exactly once within a red region. For example, in the left image, the 3-mer ACC is included in a red region within ACCCC and does not appear again in any other red region. Note that red regions can include multiple co-occurring k-mers. So, in the right image, both 4-mers TCCC and CCCG co-occur within TCCCG.
Definition 6.
A string k-co-occurs within a string in if:
Furthermore, maximally k-co-occurs within in if no superstring of k-co-occurs within in . The red regions in Figure 7 identify maximal k-co-occurring substrings of , so the Prokrustean graph can be used to efficiently compute co-occurring k-mers of a fixed k-mer size. Now, we extend red regions for all possible k values, as shown in Figure 8. Observe that stack-like structures are formed around locally-maximal repeat regions of the string. The shape motivated the following definition:
Fig.8:
Three types of maximal co-occurrence stacks. Maximal co-occurrence stacks of (sets of red rectangles) are introduced at each row for each type, given two locally-maximal repeat regions in (blue rectangles). The Rate column describes the change in the count of expressed k-mers as the k-mer size increases by 1. Hence, the rate is . Also, and control the vertical shapes of stacks. For example, the Type 0 stack forms like a rectangle with , but the Type 1 stack forms like stairs on its left with .
Definition 7.
Consider two strings and where is a substring of , along with two numbers: a depth and a k-mer size , each at least 1. A co-occurrence stack of in is defined as a 4-tuple:
if for each , for , and for change rates is NOT a prefix of and is NOT a suffix of , the following holds:
A co-occurrence stack expresses a k-mer if the k-mer is a substring of some k-co-occurring substring identified by the stack. A co-occurrence stack is maximal if its expressed k-mers are not a subset of those expressed by any other co-occurrence stack.
Let the set of all maximal co-occurrence stacks of in as . It is straightforward to check that no two stacks in express the same k-mer together. So, we can nicely cover with . We extend the scheme to the substring space of . Define .
Theorem 4 (Framework).
is a complete substring representation of such that every substring (any k-mer of any k value) in is expressed by exactly one stack in is bounded by , and can be enumerated in time given .
Proof. Supplementary Section S2 derives this result.
5.3. Algorithms
is used in the four algorithms computing the results in Section 3. For each stack in , constant-time operations are defined, and since is , the time complexity of the algorithms is independent of the k-mer size interval .
For example, consider using the stack in the first row of Figure 8 for counting distinct k-mers. The co-occurrence stack compactly represents the information of co-occurring substrings. From k = 6 to k = 9, the counts of expressed k-mers of consistently decrease by 1, starting from 4. Therefore, for a count vector defined for the interval , an algorithm can implement the contribution of the stack by “adding 4 to the count vector at k = 6” and “recording a change rate decrease of 1 at k = 6 and restoring it at k = 9.” Repeating this process for all stacks in counts all k-mers in and takes time. Refer to Supplementary Algorithm S1, Algorithm S2, Algorithm S3, and Algorithm S4 for the whole contexts.
6. Conclusion
We have introduced the problem of computing co-occurring substrings as a proxy to analyze the influence of k-mer sizes in k-mer-based methods. The Prokrustean graph facilitates access to by iterating over , which is used to implement algorithms that compute k-mer-based quantities across a range of k sizes. Once the Prokrustean graph is constructed using the BWT in a space-efficient manner, the computation of k-mer-based quantities for all k-mer sizes becomes roughly equivalent to scanning the graph. Furthermore, we can control the performance by choosing , the smallest maximal repeat size, so that the graph size is adjusted based on the needs of analyzed methods.
The Prokrustean graph and the affix tree have intuitive relationships, as shown in Supplementary Figure S10. The contrast lies in accessing the co-occurrence structure. Suffix trees represent co-occurrence by extending substrings, functioning as a “bottom-up” approach. The Prokrustean graph, however, follows a “top-down” approach, as co-occurring substrings (not superstrings) are identified in constant time. The issue with the bottom-up approach is that not all co-occurrence stacks are easily identified, even if bidirectional substring indexes are employed. For example, the Type-2 stack in Figure 8 requires information about two locally-maximal repeat regions within close proximity in a string, which is difficult to identify with substring extending operations. We therefore conjecture that exploring the space of k-mers with modern LCP-based substring indices presents an intrinsic challenge.
The results in Section 3.3 on counting maximal unitigs imply that a Prokrustean graph can smoothly explore de Bruijn graphs of all orders. Thus, the Prokrustean graph can advance the so-called variable-order scheme [11] by representing the union of de Bruijn graphs of all orders. A straightforward approach is to use the maximal co-occurrence stacks as a vertex set. We have confirmed that an edge set of size can represent all extensions between the stacks. Since the Prokrustean graph does not grow faster than the number of maximal repeats, this new representation can serve as a compact multi-k assembly graph, more precisely identifying sequencing read errors, SNPs in population genomes, and genome contigs.
Thus, the open problem 5 in [40], which calls for a practical representation of variable-order de Bruijn graphs that generates assembly comparable to that of overlap graphs, can be answered with the Prokrustean graph. Recall that Section 3.5 analyzed the overlap graph of . The two popular objects used in genome assembly can be accessed together through the Prokrustean graph.
There is abundant literature analyzing the influence of k-mer sizes in various bioinformatics tasks, but little effort has been made to derive rigorous formulations or quantities to explain these phenomena. We expect our framework to contribute as an initial step toward understanding the influence, at least at the base level of k-mer-based objects of sequencing data or reference genomes.
Supplementary Material
Table 2:
Counting distinct k-mers with Prokrustean graphs with Algorithm S1. For comparisons, KMC had to be executed iteratively causing the computational time to increase steadily as the range of k increases. The running time of KMC took about 1–3 minutes for each k-mer size.
Dataset | k | KMC | Prokrustean Graph | ||||
---|---|---|---|---|---|---|---|
time | memory | threads | time (build) | memory | threads | ||
SRR20044276 | 20...150 | 4.5m | 840mb | 8 | 0.36s (13s) | 80mb | 8 |
ERR3450203 | 20...150 | 50m | 12gb | 8 | 30s (29min) | 6gb | 8 |
SRR18495451 | 30...50000 | days | - | 8 | 88s (90min) | 24gb | 8 |
SRR7130905 | 30...150 | 66m | 12gb | 8 | 46s (39min) | 8.5gb | 8 |
SRR21862404 | 30...150 | 112m | 13gb | 8 | 74s (99min) | 13.8gb | 8 |
Footnotes
Bibliography
- [1].AlEisa H. N., Hamad S., and Elhadad A.. K-mer spectrum-based error correction algorithm for next-generation sequencing data. Computational Intelligence and Neuroscience, 2022, 2022. [DOI] [PMC free article] [PubMed]
- [2].Balvert M., Luo X., Hauptfeld E., Schönhuth A., and Dutilh B. E.. Ogre: overlap graph-based metagenomic read clustering. Bioinformatics, 37(7):905–912, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].Bankevich A., Bzikadze A. V., Kolmogorov M., Antipov D., and Pevzner P. A.. Multiplex de bruijn graphs enable genome assembly from long, high-fidelity reads. Nature biotechnology, 40(7):1075–1081, 2022. [DOI] [PubMed] [Google Scholar]
- [4].Belazzougui D.. Linear time construction of compressed text indices in compact space. In Proceedings of the forty-sixth Annual ACM Symposium on Theory of Computing, pages 148–193, 2014. [Google Scholar]
- [5].Belazzougui D. and Cunial F.. Fully-functional bidirectional burrows-wheeler indexes and infinite-order de bruijn graphs. In 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2019. [Google Scholar]
- [6].Beller T., Gog S., Ohlebusch E., and Schnattinger T.. Computing the longest common prefix array based on the burrows–wheeler transform. Journal of Discrete Algorithms, 18:22–31, 2013. [Google Scholar]
- [7].Benoit G.. Simka: fast kmer-based method for estimating the similarity between numerous metagenomic datasets. In RCAM, 2015. [Google Scholar]
- [8].Besta M., Kanakagiri R., Mustafa H., Karasikov M., Rätsch G., Hoefler T., and Solomonik E.. Communication-efficient jac-card similarity for high-performance distributed genome comparisons. In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 1122–1132. IEEE, 2020. [Google Scholar]
- [9].Bonnici V. and Manca V.. Informational laws of genome structures. Scientific reports, 6(1):28840, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [10].Bonnie J. K., Ahmed O., and Langmead B.. Dandd: efficient measurement of sequence growth and similarity. bioRxiv, pages 2023–02, 2023. [DOI] [PMC free article] [PubMed]
- [11].Boucher C., Bowe A., Gagie T., Puglisi S. J., and Sadakane K.. Variable-order de bruijn graphs. In 2015 data compression conference, pages 383–392. IEEE, 2015. [Google Scholar]
- [12].Bowe A., Onodera T., Sadakane K., and Shibuya T.. Succinct de bruijn graphs. In International workshop on algorithms in bioinformatics, pages 225–235. Springer, 2012. [Google Scholar]
- [13].Breitwieser F. P., Baker D. N., and Salzberg S. L.. Krakenuniq: confident and fast metagenomics classification using unique kmer counts. Genome biology, 19:1–10, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [14].Břinda K., Baym M., and Kucherov G.. Simplitigs as an efficient and scalable representation of de bruijn graphs. Genome biology, 22:1–24, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [15].Bussi Y., Kapon R., and Reich Z.. Large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy. PloS one, 16(10):e0258693, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [16].Cavattoni M. and Comin M.. Classgraph: improving metagenomic read classification with overlap graphs. Journal of Computational Biology, 30(6):633–647, 2023. [DOI] [PubMed] [Google Scholar]
- [17].Cenzato D., Guerrini V., Lipták Z., and Rosone G.. Computing the optimal BWT of very large string collections. In In Proc. of the 33rd Data Compression Conference, DCC 2023, 2023, pages 71–80, 2023. [Google Scholar]
- [18].Cenzato D. and Lipták Z.. A survey of bwt variants for string collections. Bioinformatics, page btae333, 2024. [DOI] [PMC free article] [PubMed]
- [19].Chikhi R. and Medvedev P.. Informed and automated k-mer size selection for genome assembly. Bioinformatics, 30(1):31–37, 2014. [DOI] [PubMed] [Google Scholar]
- [20].Cracco A. and Tomescu A. I.. Extremely fast construction and querying of compacted and colored de bruijn graphs with ggcat. Genome Research, pages gr–277615, 2023. [DOI] [PMC free article] [PubMed]
- [21].Díaz-Domínguez D. and Navarro G.. Efficient construction of the bwt for repetitive text using string compression. Information and Computation, 294:105088, 2023. [Google Scholar]
- [22].D’ıaz-Dom’ınguez D., Onodera T., Puglisi S. J., and Salmela L.. Genome assembly with variable order de bruijn graphs. bioRxiv, pages 2022–09, 2022.
- [23].Dubinkina V. B., Ischenko D. S., Ulyantsev V. I., Tyakht A. V., and Alexeev D. G.. Assessment of k-mer spectrum applicability for metagenomic dissimilarity analysis. BMC bioinformatics, 17:1–11, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [24].Gog S., Beller T., Moffat A., and Petri M.. From theory to practice: Plug and play with succinct data structures. In 13th International Symposium on Experimental Algorithms, (SEA 2014), pages 326–337, 2014. [Google Scholar]
- [25].Gusfield D.. Algorithms on stings, trees, and sequences: Computer science and computational biology. Acm Sigact News, 28(4):41–60, 1997. [Google Scholar]
- [26].Irber L., Brooks P. T., Reiter T., Pierce-Ward N. T., Hera M. R., Koslicki D., and Brown C. T.. Lightweight compositional analysis of metagenomes with fracminhash and minimum metagenome covers. bioRxiv, pages 2022–01, 2022.
- [27].Islam R., Raju R. S., Tasnim N., Shihab I. H., Bhuiyan M. A., Araf Y., and Islam T.. Choice of assemblers has a critical impact on de novo assembly of sars-cov-2 genome and characterizing variants. Briefings in bioinformatics, 22(5):bbab102, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [28].Kokot M., Długosz M., and Deorowicz S.. Kmc 3: counting and manipulating k-mer statistics. Bioinformatics, 33(17):2759–2761, 2017. [DOI] [PubMed] [Google Scholar]
- [29].Krannich T., White W. T. J., Niehus S., Holley G., Halldórsson B. V., and Kehr B.. Population-scale detection of nonreference sequence variants using colored de bruijn graphs. Bioinformatics, 38(3):604–611, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [30].Li H.. Fast construction of fm-index for long sequence reads. Bioinformatics, 30(22):3274–3275, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [31].Liao X., Li M., Luo J., Zou Y., Wu F.-X., Pan Y., Luo F., and Wang J.. Improving de novo assembly based on read classification. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 17(1):177–188, 2018. [DOI] [PubMed] [Google Scholar]
- [32].Mallawaarachchi V.. Metagenomics Binning Using Assembly Graphs. PhD thesis, The Australian National University; (Australia: ), 2022. [Google Scholar]
- [33].Nurk S., Meleshko D., Korobeynikov A., and Pevzner P. A.. metaspades: a new versatile metagenomic assembler. Genome research, 27(5):824–834, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [34].Ondov B. D., Treangen T. J., Melsted P., Mallonee A. B., Bergman N. H., Koren S., and Phillippy A. M.. Mash: fast genome and metagenome distance estimation using minhash. Genome biology, 17:1–14, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [35].Pérez-Cobas A. E., Gomez-Valero L., and Buchrieser C.. Metagenomic approaches in microbial ecology: an update on whole-genome and marker gene sequencing analyses. Microbial genomics, 6(8):e000409, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [36].Ponsero A. J., Miller M., and Hurwitz B. L.. Comparison of k-mer-based de novo comparative metagenomic tools and approaches. Microbiome Research Reports, 2(4), 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [37].Prezza N. and Rosone G.. Space-efficient computation of the lcp array from the burrows-wheeler transform. arXiv preprint arXiv:1901.05226, 2019.
- [38].Prjibelski A., Antipov D., Meleshko D., Lapidus A., and Korobeynikov A.. Using spades de novo assembler. Current protocols in bioinformatics, 70(1):e102, 2020. [DOI] [PubMed] [Google Scholar]
- [39].Ranallo-Benavidez T. R., Jaron K. S., and Schatz M. C.. Genomescope 2.0 and smudgeplot for reference-free profiling of polyploid genomes. Nat. comm., 11(1):1432, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [40].Rizzi R., Beretta S., Patterson M., Pirola Y., Previtali M., Della Vedova G., and Bonizzoni P.. Overlap graphs and de bruijn graphs: data structures for de novo genome assembly in the big data era. Quantitative Biology, 7:278–292, 2019. [Google Scholar]
- [41].Rodriguez-r L. M. and Konstantinidis K. T.. Estimating coverage in metagenomic data sets and why it matters. The ISME journal, 8(11):2349–2351, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [42].Schmidt S., Khan S., Alanko J. N., Pibiri G. E., and Tomescu A. I.. Matchtigs: minimum plain text representation of k-mer sets. Genome Biology, 24(1):136, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [43].Shariat B., Movahedi N. S., Chitsaz H., and Boucher C.. Hydavista: towards optimal guided selection of k-mer size for sequence assembly. BMC genomics, 15(10):1–8, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [44].Tang D., Li Y., Tan D., Fu J., Tang Y., Lin J., Zhao R., Du H., and Zhao Z.. Kcoss: an ultra-fast k-mer counter for assembled genome analysis. Bioinformatics, 38(4):933–940, 2022. [DOI] [PubMed] [Google Scholar]
- [45].Wickramarachchi A. and Lin Y.. Metagenomics binning of long reads using read-overlap graphs. In RECOMB International Workshop on Comparative Genomics, pages 260–278. Springer, 2022. [Google Scholar]
- [46].Yang Z., Li H., Jia Y., Zheng Y., Meng H., Bao T., Li X., and Luo L.. Intrinsic laws of k-mer spectra of genome sequences and evolution mechanism of genomes. BMC Evolutionary Biology, 20:1–15, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [47].Zhai H. and Fukuyama J.. A convenient correspondence between k-mer-based metagenomic distances and phylogenetically-informed -diversity measures. PLOS Computational Biology, 19(1):e1010821, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.