Abstract
Despite the widespread adoption of k-mer-based methods in bioinformatics, understanding the influence of k-mer sizes remains a persistent challenge. Selecting an optimal k-mer size or employing multiple k-mer sizes is often arbitrary, application-specific, and fraught with computational complexities. Typically, the influence of k-mer size is obscured by the outputs of complex bioinformatics tasks, such as genome analysis, comparison, assembly, alignment, and error correction. However, it is frequently overlooked that every method is built above a well-defined k-mer-based object like Jaccard Similarity, de Bruijn graphs, k-mer spectra, and Bray-Curtis Dissimilarity. Despite these objects offering a clearer perspective on the role of k-mer sizes, the dynamics of k-mer-based objects with respect to k-mer sizes remain surprisingly elusive.
This paper introduces a computational framework that generalizes the transition of k-mer-based objects across k-mer sizes, utilizing a novel substring index, the Prokrustean graph. The primary contribution of this framework is to compute quantities associated with k-mer-based objects for all k-mer sizes, where the computational complexity depends solely on the number of maximal repeats and is independent of the range of k-mer sizes. For example, counting vertices of compacted de Bruijn graphs for can be accomplished in mere seconds with our substring index constructed on a gigabase-sized read set.
Additionally, we derive a space-efficient algorithm to extract the Prokrustean graph from the Burrows-Wheeler Transform. It becomes evident that modern substring indices, mostly based on longest common prefixes of suffix arrays, inherently face difficulties at exploring varying k-mer sizes due to their limitations at grouping co-occurring substrings.
We have implemented four applications that utilize quantities critical in modern pangenomics and metagenomics. The code for these applications and the construction algorithm is available at https://github.com/KoslickiLab/prokrustean.
Keywords: k-mer, k-mer spectra, FM-index, BWT, genome assembly, pangenomics, metagenomics, 03B70, 05C85, 92–08
1. Introduction
As the volume and number of sequencing reads and reference genomes grow every year, k-mer-based methods continue to gain popularity in computational biology. This simple approach—cutting sequences into substrings of a fixed length—offers significant benefits across various disciplines. Biologists consider k-mers as intuitive markers representing biologically significant patterns, bioinformaticians easily formulate novel methods by regarding k-mers as fundamental units that reflect their original sequences, and engineers leverage their fixed-length nature to optimize computational processes at a very low level. However, the understanding of the most crucial parameter, , remains surprisingly elusive, thereby impeding further methodological advancements.
Two central challenges emerge in the application of k-mer-based methods. First, the selection of the k-mer size is often arbitrary, despite its well-recognized influence on outcomes. This issue, though widely acknowledged, remains insufficiently addressed in the literature, with little formal guidance on how to determine an optimal size for different applications. The reasoning behind these choices is frequently obscured, typically confined to specific, unpublished experimental analyses (e.g. “we found that was appropriate…”). Second, methods attempting to utilize multiple k-mer sizes encounter significant computational burdens. While incorporating multiple k-mer sizes can naturally enhance the accuracy of k-mer-based methods, the computational costs escalate with each additional k-mer size used. Consequently, researchers often resort to “folklore” value(s) based on prior empirical results.
The influence of k-mer sizes within each method is a complex function reflecting bioinformatics pipelines that process k-mers. As biological adjustments and engineering strategies complicate the pipelines, their outputs obscure the impact of k-mer sizes with noisy factors, as simplified in the following abstraction:
Despite the complexity of the pipelines, there always exist mathematically well-defined k-mer-based objects that form the foundation of method formulation. These objects abstract the utilization of k-mers and depend solely on sequences and k-mer sizes. Thus, they offer potential for generalizability and quantification of the influence of k-mers in methods. Indeed, intuitive quantities have been derived from k-mer-based objects; however, there are challenges in actually computing them, as detailed in several examples we now present.
In genome analysis, the number of distinct k-mers is often used to reflect the complexity of genomes. Although the number varies by k-mer sizes, large data sizes restrict experiments to a few values [9, 39, 15]. A recent study suggests that the number of distinct k-mers across all k-mer sizes provides additional insights into pangenome complexity [10]. Furthermore, the frequencies of k-mers provide richer information, as discussed in [1, 46], but their computation becomes more complicated and sometimes impractical, even with a fixed k-mer size [7].
In comparative analyses, Jaccard Similarity is utilized for genome indexing and searching by being approximated through hashing techniques [34, 26]. Experiments attempting to assess the influence of k-mer sizes on Jaccard Similarity must undergo tedious iterations through various k-mer sizes [8]. Additionally, Bray-Curtis dissimilarity is a k-mer frequency-based metric frequently used in comparing metagenomic samples, where experiments face resource limitations due to the growing size of sequencing data, even with a fixed k-mer size. Yet, the demand for analyzing multiple k-mer sizes continues to increase [23, 36, 35].
Genome assemblers utilizing k-mer-based de Bruijn graphs are probably the most sensitive to k-mer sizes, and those employing multi-k approaches face significant computational challenges. The choice of k-mer sizes is particularly crucial in de novo assemblers, yet it predominantly relies on heuristic methods [27, 19, 38]. The topological features of de Bruijn graphs are succinctly summarized by the compacting process, where vertices represent simple paths called unitigs, but exploring these features across varying k-mer sizes is computationally intensive. Furthermore, the development of multi-k de Bruijn graphs continues to be a challenging and largely theoretical endeavor [40, 43, 22], with no substantial advancements following the heuristic selection of multiple k-mer sizes implemented by metaSPAdes [33].
These examples motivate the pressing need to generalize the exploration of k-mer-based objects across k-mer sizes. Our manuscript introduces a framework for rapidly computing quantities derived from k-mer-based objects, thereby addressing the prevalent challenge in bioinformatics. The organization of the manuscript is as follows:
Section 2 defines the main objective of computing k-mer-based quantities and introduces the proxy problem of computing substring co-occurrence.
Section 3 defines , the Prokrustean graph, a novel substring representation that provides straightforward access to substring co-occurrence.
Section 4 outlines a framework built upon the Prokrustean graph, accompanied by algorithms that treat various k-mer-based objects and compute related quantities across all possible k-mer sizes in time.
Section 5 presents the experimental results of computing k-mer-based quantities.
Section 6 derives the construction algorithm of , and discusses the limitations of other modern substring indices in computing substring co-occurrence.
2. Problem Formulation
2.1. Basic Notations
Let be our alphabet and let represent a finite-length string, and for a set of strings. Because of the biological sequence motivation for this work, the elements of are called sequences. Consider the following preliminary definitions:
Definition 1.
A region in the string is a triple () such that .
The string of a region is the corresponding substring: .
The size or length of a region is the length of its string: .
- An extension of a region is a larger region including :
- The occurrences of in are those regions in strings of corresponding to :
is a maximal repeat in if it occurs at least twice and more than any of its extensions: , and for all a superstring of , . I.e., given a fixed set , a maximal repeat is one that occurs more than once in (i.e. is a repeat), and any extension of which has a lower frequency in than the original string .
is the set of all maximal repeats in .
2.2. The proxy problem: how to compute substring co-occurrence?
There exists a theoretical void in identifying when a k-mer and a -mer serve similar roles within their respective substring sets. Although close k-mer sizes generally yield comparable outputs, dissecting this phenomenon at the level of local substring scopes is unexpectedly challenging. For instance, de Bruijn graphs constructed from sequencing reads with and display distinct yet highly similar topologies [40]. However, anyone formally articulating the topological similarity would find it quite elusive, as vertex mapping and other techniques establishing correspondences between the two graphs fail to consistently explain it.
A common underlying difficulty is that the entire k-mer set undergoes complete transformations as the k-mer size changes. Since the role of a k-mer is assigned within its k-mer set, the role of a -mer within -mers cannot be “locally” derived. To address this issue, we propose substring co-occurrence as a generalized framework for consistently grouping k-mers of similar roles across varying k-mer sizes.
Definition 2.
A string co-occurs within a string in if:
Modern substring indexes built on longest common prefixes (LCP) of their (implicit) suffix array are adept at capturing co-occurrence in this “extending direction.” In suffix trees, a substring that extends from to along an edge—without passing through a node—indicates preservation of co-occurrence. Similarly, in the Burrows-Wheeler Transform (BWT) and its variants such as the FM-index and r-index, representing both and with the same set of LCPs corresponding to some suffix array interval implies co-occurrence. However, there has been no recognized necessity for addressing the following “substring direction”:
Definition 3.
is the set of substrings that co-occur within .
Modern substring indexes are not efficient at computing , which results in their failure to “smoothly” explore k-mer-based objects across varying k-mer sizes. Specifically, shrinking a substring found in always requires some link (or pointer) to navigate to the corresponding LCP in the (implicit) suffix array. Given that the number of substrings scales quadratically with the cumulative size of the input sequences, this linkage system becomes impractical. For instance, the variable-order de Bruijn graph and its variants address the space issue by employing a compact de Bruijn graph representation built above the Burrow-Wheeler transform [12]; however, this solution requires nontrivial operation times for both “forward” moves and changes in order () [11, 5]. So, these data structures do not allow extensive exploration of the substring space, and hence their multi-k approaches do not scale well.
Our work demonstrates that can be computed efficiently, and then the functionality can be used to rapidly compute k-mer quantities. We introduce a simple yet foundational property as groundwork.
Theorem 1 (Principle).
For any substring such that ,
Proof. Assume the theorem does not hold, i.e., co-occurs within either no string or multiple strings in . Consider the former case: co-occurs within no string in , meaning occurs more than any of its superstrings in . Given that occurs in some string , must be occurring more frequently than by assumption. Then, there exists a maximal repeat such that is a substring of and is a substring of , because extending within decreases its occurrence at some point. Again, whenever is a substring of a maximal repeat , must be occurring more frequently than by assumption. Following the similar argument above, there exists another maximal repeat such that is a substring of and is a proper substring of . This recursive argument will eventually terminate as is strictly shorter than . Thus, co-occurs within the last maximal repeat, which contradicts the assumption.
Now consider the latter case where co-occurs within at least two strings in . If is extended within these two strings, either to the right or left by one step at each time, the extensions eventually diverge into two different strings since the two superstrings differ, thereby reducing the number of occurrences. Hence, must be occurring more frequently than at least one of those two superstrings, leading to a contradiction. □
This theorem provides two key insights. First, every substring found in exclusively co-occurs as a substring of either a sequence or a maximal repeat in , so the collective sets of from every comprehensively and disjointly cover the entire substring space. Second, by computing for every , we can categorize k-mers of similar roles across various lengths, devising various algorithmic ideas on . The following main theorem outlines the components discussed in the remainder of the manuscript.
Theorem 2. (Main Result)
Given a set of sequences , there exists a substring representation of size . can be used to compute k-mer quantities of for all within time and space, where is the longest sequence length in . The k-mer quantities are given as follows:
(Counts) The number of distinct k-mers, as used in genome cardinality DandD [10].
(Frequencies) k-mer Bray-Curtis dissimilarities, as used in comparing metagenomic samples [36].
(Extensions) The number of unitigs in k-mer de Bruijn graphs, as used in analyses of reference genomes and genome assembly [29, 42].
(Occurrences) The number of edges with annotated length of in the overlap graph. [45].
Furthermore, the graph can be constructed in time and space, where is the size of the Burrows-Wheeler transform representation of .
Proof. Section 3 analyzes the size of , Section 4 introduces the computation of k-mer quantities, and Section 6 derives the construction algorithm. □
The main contribution is that the time complexity for computing k-mer quantities for all using is independent of the range of k-mer sizes, which is . Specifically, letting be the cumulative sequence length of , is sublinear relative to . In contrast, the direct computation of k-mer quantities from the sequence set for any fixed k-mer size demands, at best, time. Extending this computation naively to cover all would exponentially increase the computational demand to . This significant improvement in complexity leverages a proper representation of repeats in , as will be introduced in the next section.
3. Prokrustean graph: A hierarchy of maximal repeats
The key idea of the Prokrustean graph is to recursively capture maximal repeats by their relative frequencies of occurrences in . Consider the following two descriptions of repeats in three sequences: ACCCT, GACCC, and TCCCG.
ACCC is in 2 sequences, and CCC is in all 3 sequences.
ACCC is in 2 sequences, and CCC is in 1 sequence TCCCG and 1 substring ACCC.
Each first and second case uses 5 and 4 units of occurrences, respectively, yet preserves the same meaning. As depicted in Figure 1, the rule of the compact second case is to recursively capture regions of more frequent substrings.
Figure 1:
Recursive repeats. At each sequence on the top level, maximally long subregions (indicated with blue underlining) are used to denote substrings that are more frequent than its superstring. The arrows then points to a node of the substring, and the process repeats hierarchically.
Definition 4.
is a locally-maximal repeat region in if:
(1) |
and for each extension of ,
(2) |
**Note that we often omit “in ” when the context is clear.
A locally-maximal repeat region captures a substring that appears more frequently than the “parent” string, and any extension of the region captures a substring that appears as frequently as the parent string. Consider the previous example . (ACCCT,1,4) is a locally-maximal repeat region capturing the substring ACCC in ACCCT. However, (ACCCT,2,4) is not a locally-maximal repeat region because the region capturing CCC in ACCCT can be extended to the left to capture ACCC which occurs more frequently than ACCCT.
Note, an alternative definition using substring notations, such as locally-maximal repeats, instead of regions, is not robust. Consider and . Observe that the region capturing CCC is a locally-maximal repeat region because CCC occurs more frequently than in , but its immediate left and right extensions capture ACCC and CCCT that occur the same number of times as . In constrast, a subregion capturing CC is not a locally-maximal repeat region because an extension still captures a string (CCC) which occurs 2 times in which is more than . However, , which also captures CC, is a locally-maximal repeat region because and . Consequently, defining a locally-maximal repeat as becomes ambiguous with , whereas an explicit expression of a region more accurately reflects the desired property of hierarchy of occurrences.
3.1. Prokrustean Graph
The Prokrustean graph of is simply a graph representation of recursive locally-maximal repeat regions on , i.e. vertexes represent substrings and edges represent locally-maximal repeat regions. The theorem below says that all maximal repeats in are caught along the recursive description.
Theorem 3 (Complete).
Construct a string set as follows:
for each locally-maximal repeat region where , add to and
- for each locally-maximal repeat region where , add to .
Proof. Any string in is a maximal repeat of , so the claim is satisfied if the process captures every maximal repeat of . Assume a maximal repeat is not in . Consider any occurrence of in some . Extending the maximal repeat within makes the number of its occurrences drop, but since cannot occur as a locally-maximal repeat region by assumption, it must be included in some locally-maximal repeat region that captures a maximal repeat , hence , and is a proper substring of . Now consider any occurrence of within . This argument continues recursively, capturing maximal repeats , but cannot extend indefinitely as is always shorter than for every . Eventually, must be occurring as a locally-maximal repeat region of some , which is a contradiction. □
Therefore, we use maximal repeats as vertices of the Prokrustean graph, along with sequences, so .
Definition 5.
The Prokrustean graph of is a directed multigraph :
if and only if .
Each vertex is annotated with the string size, .
if and only if is a locally-maximal repeat region.
Each edge directs from and is annotated with interval , to represent the locally-maximal repeat region.
Figure 2 visualizes how locally-maximal repeat regions are encoded in a Prokrustean graph. Next, the cardinality analysis of this representation reveals promising bounds.
Figure 2:
The Prokrustean graph of . The left graph describes the recursion of locally-maximal repeat regions, and the Prokrustean graph on the right represents the same structure with integer labels storing the regions and the size of the substrings.
Theorem 4 (Compact).
Proof. Every vertex in has at most one incoming edge per letter extension on the right (or similarly on the left), so has at most incoming edges. Assume the contrary: a maximal repeat appears as two different locally-maximal repeat regions and within , and can extend by the same letter on the right, i.e., and capture the same string. Extending these regions further together will eventually diverge their strings, because either or the original two regions are differently located even if . Consequently, at least one of them— without loss of generality—results in a decreased occurrence count of its string as it extends. Hence, . This leads to contradiction because a locally-maximal repeat region should satisfy . □
Assuming trivially that , meaning there are significantly more maximal repeats than the number of sequences, we derive that , hence by the theorem. Given that is typically constant in genomic sequences (eg. ), the graph’s size depends on the number of maximal repeats. Letting be the accumulated sequence length of , it is a well-known fact that [25], so the graph grows sublinear to the input size. Furthermore, in the implementations described in Section 5, can be restricted to maximal repeats of length at least . Although the size of is comparable to that of the suffix tree of , setting allows for significantly more efficient and configurable space usage.
4. Framework: Computing k-mer quantities for all sizes
This section introduces the computation of k-mer quantities for all sizes, preserving the time and space.
4.1. Accessing co-occurring k-mers for a single size
We briefly cover how k-mers of a fixed k-mer size are accessed through the Prokrustean graph, before generalizing the computation to all k-mer sizes in the next section. Previously, a toy example in Figure 1 depicted blue regions that cover locally-maximal repeat regions. Then, given a k-mer size, a complementary region is a maximal region that covers k-mers that are not included in a blue region. See the red regions underlining in Figure 3.
Figure 3:
Accessing k-mers by complementing locally-maximal repeat regions. Blue underlining depicts locally-maximal regions, and red underlining depicts regions that complemented k-mers not covered in blue regions, given sizes of and 4. Every k-mer appears exactly once within a red region. For example, at the figure of on left, the 3-mer ACC is included in a red region in ACCCC and does not appear again in any other red region.
An opportunistic property is that k-mers in red regions co-occur within their parent strings. Following notation captures substrings in red regions:
Defin ition 6.
A string k-co-occurs within a string in if:
We mean the same by being a k-co-occuring substring of . Also, maximally k-co-occurs within in if no superstring of k-co-occurs within in . Maximal k-co-occurring substrings of are easily computed as complementary regions in as introduced in Figure 3. So, they are efficiently identified with the Prokrustean graph, as implied by the proposition below.
Proposition 1.
A string maximally k-co-occurs within a string in if and only if its region in intersect locally-maximal repeat regions by under , and on both sides, either extends to an end of or intersects a locally-maximal repeat region by exactly .
Proof. The forward direction is straightforward: if k-co-occurs within , its region cannot intersect any locally-maximal repeat region by at any case, and if the intersection is under on its edge, by extending and making the intersection , another superstring of k-co-occurs within .
For the reverse direction, assume a region satisfies the conditions. A k-mer appearing in the region cannot occur more than , as it would mean the region intersects a locally-maximal repeat region by at least , so k-co-occurs within . is also maximal because extending its region increases the size of the intersection to more than . □
Proposition 1 implies that computing maximal k-co-occurring substrings of a string takes time linear to the number of locally-maximal repeat regions in , given the regions pre-ordered by positions: Enumerate locally-maximal repeat regions of length at least and check its left and right whether a complementary region of length at least can be defined. Lastly, maximally capture complementary regions on both sides so that they either intersect locally-maximal repeat regions by or meet an end of . See Figure 4 for a detailed example.
Figure 4:
Computing k-co-occurring substrings (red) of of varying sizes. The rule is to cover every k-mer that is not covered by a locally-maximal repeat region (blue). Note that a k-co-occurring substring can appear on the intersection of two locally-maximal repeat regions too. For example, the region () capturing CAGC in the second row () overlaps two blue regions by 3.
Therefore, the idea for computing complementary regions can be applied to count distinct k-mers for a fixed k-mer size. The toy algorithm below takes time and space.
Note that line 2 is the computation derived from Proposition 1, as depicted in Figure 4. The locally-maximal repeat regions of are represented by the outgoing edges of . The nested loop on lines 1 and 2 thus takes , i.e., time. For correctness, Theorem 1 states that any k-mer co-occurs within exactly one string in . extends to a unique maximal k-co-occurring substring of ; otherwise, multi-occurrence in is implied. Lastly, Proposition 1 guarantees that every maximal k-co-occurring substring of can be identified by scanning the outgoing edges of .
4.2. Accessing co-occurring k-mers for a range of
The computation can be extended to a range of k-mer sizes while preserving the complexity. Recall that the problem is to efficiently express . It turns out that can be expressed by a finite number of stack-like substring structures.
Definition 7.
Consider two strings and where is a substring of , along with two numbers: a depth and a k-mer size , each at least 1. A co-occurrence stack of in is defined as a 4-tuple:
where for , with change rates and .
A co-occurrence stack expresses a k-mer if it is a substring of some k-co-occurring substring identified by the stack. A co-occurrence stack is maximal if its expressed k-mers are not a subset of those expressed by any other co-occurrence stack. It is clear that any k-mer expressed in a co-occurrence stack of co-occurs within . So, maximal co-occurrence stacks of collectively express .
Definition 8.
is the set of all maximal co-occurrence stacks of in .
A maximal co-occurrence stack of is type-0 if the substring is itself, type-1 if is a proper prefix or suffix of , and type-2 otherwise. The numeral in each type’s name indicates the number of locally-maximal repeat regions intersecting the regions of substrings identified by the co-occurrence stack. Refer to Figure 5 for intuition on the type names.
Figure 5:
Three types of maximal co-occurrence stacks of (sets of red rectangles), given two locally-maximal repeat regions in (blue rectangles). The Rate column describes the change in the count of expressed k-mers as the k-mer size increases by 1. Hence, the rate is , and it controls the vertical shapes of stacks.
Theorem 5.
completely and disjointly cover co-occurring substrings of , i.e.,
Also,
Proof. Appendix B.1 covers the proof. □
Lastly, this scheme is extended to the entire substring space. Also, all maximal co-occurrence stacks in can be enumerated within the computation time .
Proposition 2.
The number of all maximal co-occurrence stacks in is . Precisely,
Proof. This is implied by the inequality (the number of locally-maximal repeat regions in ). The term accounts for the constant contribution of 1 from each , and corresponds to the right term because represents the total set of locally-maximal repeat regions in . □
Proposition 3.
Given , enumerating for some takes time.
Proof. This is derived from the proof of the number of locally-maximal repeat regions in in Appendix B.1. □
Since k-mers expressed by each stack of inherit characteristics of , such as frequency and extensions within , algorithms can leverage this information to compute k-mer quantities, by considering k-mers in the same stack as having similar roles in the k-mer-based objects. We further introduce four applications motivated by biological problems that benefit from this methodology.
4.3. Application: Counting distinct k-mers of all sizes
The number of distinct k-mers is commonly used to assess the complexity and structure of pangenomes and metagenomes [44, 13]. A recent study by [10] spans multiple contexts, suggesting that the number of k-mers across all k-mer sizes should be used to estimate genome sizes. The following algorithm counts distinct k-mers for all k-mer sizes in time . We assume the maximum possible k-mer size, , is always less than , which is typical for most sequencing data and genome references.
The core idea leverages the change rate of k-mers within each co-occurrence stack, thereby eliminating the need to consider k-mer counts individually. The number of k-mers expressed by each type-0, 1, or 2 co-occurrence stack changes by −1, 0, or +1 as increases by 1, respectively, as detailed in the Rate column of Figure 5. This property is utilized in lines 4–7, and hence the loop in line 2 runs in time.
4.4. Application: Computing Bray-Curtis dissimilarities of all sizes
The Bray-Curtis dissimilarity, defined with k-mers, is a frequency-based metric that measures the dissimilarity between biological samples. The values are sensitive to the choice of k-mer sizes, so they are often calculated with multiple k-mer sizes to check the influence [47] [23] [36], yet most state-of-the-art tools utilize a fixed k-mer size [7].
We computed the Bray-Curtis dissimilarity for all k-mer sizes of samples A and B using the Prokrustean graph constructed from the union of their sequence sets, and . This graph tracks the origins of sequences, computing the following quantities:
Recall that the value assignments in line 5 and 6 utilized co-occurrence stacks for counting k-mers in Algorithm 2. The values and are assigned instead of a constant 1, because they grows linear to the number of co-occurring k-mers weighted by the frequency-based values. Consequently, vectors and contains values of the numerator and denominator part of the Bray-Curtis dissimilarities for all k-mer sizes.
Typically, biological experiments compute these values for pairs of multiple samples. The proposed algorithm can be extended to consider multiple samples, requiring time.
4.5. Application: Counting maximal unitigs of all sizes
A maximal unitig of a de Bruijn graph, or a vertex of a compacted de Bruijn graph [20], represents a simple path that cannot be extended further, reflecting a topological characteristic of the graph. More maximal unitigs generally indicate more complex graph structures, which significantly impact genome assembly performance [14, 3]. Their number across multiple k-mer sizes reflects the influence of k-mer sizes on the complexity of assembly.
The idea is that maximal unitigs are implied by tips, convergences, and divergences, which are indicated by type-0 or 1 co-occurrence stacks. For instance, consider a string in with multiple right extensions, such as C and T. If a suffix region in of length is an locally-maximal repeat region, it suggests that a k-co-occurring suffix of where corresponds to a vertex in the respective de Bruijn graph of order with right extensions C and T. Therefore, identifying the locally-maximal repeat regions on suffixes and prefixes of strings in is sufficient.
4.6. Application: Computing vertex degrees of overlap graph
Overlap graphs are extensively utilized in genome and metagenome assembly, alongside de Bruijn graphs. Although their definition appears unrelated to k-mers, we can view them as representing suffix-prefix k-mers of sequences in . Overlap information is particularly crucial for metagenomics in read classification [16, 41, 31, 2] and contig binning [45, 32]. Vertex degrees are often interpreted as (abundant-weighted) read coverage in metagenomic samples, which exhibit high variances due to species diversity and abundance variability.
A common computational challenge is the quadratic growth of overlap graphs relative to the number of reads . A significant drawback of overlap graphs is their quadratic growth in the number of reads, . In contrast, the Prokrustean graph, which encompasses the overlap graph as a hierarchical subgraph, requires only space. This structure enables the efficient computation of vertex degrees in the overlap graph.
The overlap graph of has the vertex set , and an edge () if and only if a suffix of matches a prefix of . Define as the number of occurring as a prefix in strings in , which is easily computed through the recursive structure.
Starting from , explore a suffix path, meaning recursively choose edges that are locally-maximal repeat regions on suffices. Then, sum up for all vertices encountered along the path. Since represents the occurrence of as a prefix, the summed quantity equals the outgoing degree of vertex in the overlap graph of . Symmetrically, incoming degrees can be computed by defining . The recursion can be strategically organized so that vertices are visited only once when computing all overlap degrees of , thereby reducing the overall complexity to .
5. Experiments and Results
Here, we implement and test against various data sets the construction and four applications of the Prokrustean graph. Datasets were randomly selected to represent a broad range of sizes, sequencing technologies, and biological origins. Both reference genomes and sequencing data were used to demonstrate the scalability of the Prokrustean graph. 16 human pangenome and 3,500 metagenome references were used, and diverse sequencing datasets were collected, including two metagenome short-read datasets, one metagenome long-read dataset, and two human transcriptome short-read datasets. The full list of datasets is provided in appendix A.
5.1. Prokrustean graph construction with
The construction algorithm is introduced later (Section 6). Recall that the Prokrustean graph grows with the number of maximal repeats, i.e., . The size of can be controlled by dropping maximal repeats of length below . This threshold limits the length of locally-maximal repeat regions to or above, capturing a subgraph of the Prokrustean graph, and hence achieves space while computing -mer quantities for . Figure 6 shows how the graph size of short sequencing reads of length 151 decreases as increases.
Figure 6:
Sizes of Prokrustean graphs (number of vertices and edges) versus values for a short read dataset ERR3450203. The size of the graph drops around , meaning the maximal repeats and locally-maximal repeat regions are dense when is under 11. The number of edges falls again around because the graph eventually becomes fully disconnected.
Additionally, for 3500 E. coli genomes with , we observed that and . Figure 7 shows how references of pangenome and metagenome grow. Again, the growth of the graph follows the growth of .
Figure 7:
Sizes of Prokrustean graphs as a function of sequences added. Left: Prokrustean graphs constructed using an accumulation of 16 human chromosome 1 references. Right: same but generated using 3500 E. coli references exibiting an increased growth rate reflecting E. coli species diversity.
Table 1 shows that with some practical values applied, the size of the Prokrustean graphs stays around the size of their inputs. It is observed that , with and with short reads.
Table 1:
Performances of Prokrustean graph construction from BWTs. Symbols M, B, mb, and gb mean millions, billions, megabytes, and gigabytes. N denotes the cumulative sequence length of and we can see the output is comparable to .
dataset | reads | N | time | mem | output | |
---|---|---|---|---|---|---|
SRR20044276(metagenomic) | 0.94M | 72M | 20 | 13s | 200mb | 40mb |
ERR3450203(metagenomic) | 55M | 4.5B | 30 | 28min | 11gb | 2.7gb |
SRR18495451(metagenomic/long) | 0.83M | 11.3B | 30 | 90min | 43gb | 14gb |
SRR7130905 (human/rna-seq) | 45M | 6.8B | 30 | 39min | 18gb | 4gb |
SRR21862404 (human/rna-seq) | 336M | 22.3B | 30 | 99min | 37gb | 5.9gb |
Subsequent sections cover the results of four applications. Note that, to our knowledge, there are no existing computational techniques that perform the exact same tasks for a range of k-mer sizes, making performance comparisons inherently “unfair.” Readers are encouraged to focus on the overall time scale to gauge the efficiency of the Prokrustean graph. Additionally, discrepancies in the outputs of the comparisons may arise due to practice-specific configurations in computational tools, such as the use of only ACGT (i.e. no “N”) and canonical k-mers in KMC. Consequently, we tested correctness on GitHub using our brute-force implementations.
5.2. Result: Counting distinct k-mers for
We employed KMC [28], an optimized k-mer counting library designed for a fixed size , and used it iteratively for all values. This “unfair” comparison emphasizes the limitations of current practices in counting k-mers across various sizes.
5.3. Result: Computing Bray-Curtis dissimilarities for
Bray-Curtis dissimilarity is a popular metric used in metagenomics analysis [23, 36]. In each reference we found, dissimilarity scores were presented for a single k-mer size. Figure 8 shows that dissimilarities between four example metagenome samples are not consistent in that at some k-mer sizes, one observed pattern of dissimilarity is completely flipped at another k-mer size. I.e. no size is “correct”; instead, new insights are obtained from multiple sizes. This task took about 10 minutes with the Prokrustean graph of four samples of 12 gigabase pairs in total, and the graph construction took around 1 hour. Computing the same quantity takes around 5 to 8 minutes for a fixed k-mer size with Simka [7], and no library computes it across k-mer sizes.
Figure 8:
Bray-Curtis dissimilarities between four samples derived from infant fecal microbiota, which were studied in [36]. Samples include two from 12-month-old human subjects and two from 3-week-olds. The relative order between values shift as changes, e.g. the dissimilarity between sample 2 and 4 is highest at but lowest at . Note that diverse sizes (10, 12, 21, 31) are actively utilized in practice.
5.4. Result: Counting maximal unitigs for
We utilized GGCAT [20], a mature library optimized for constructing compacted de Bruijn graphs at a fixed k-mer size. Since maximal unitigs are vertices of a compacted de Bruijn graph, we measured their de Bruijn graph construction time. Again, the Prokrustean graph becomes more efficient as additional k-mer sizes are included in the computation. The following table displays the comparison, and Figure 9 illustrates some of the outputs.
Figure 9:
The number of maximal unitigs in de Bruijn graphs of order of metagenomic short reads (ERR3450203). A complex scenario is revealed: The decrease of the numbers fluctuates in , and the peak around corresponds to the sudden decrease in maximal repeats. Lastly, further disconnections increase contigs and eventually make the graph completely disconnected.
5.5. Result: Counting vertex degrees of overlap graph of threshold
Here, we describe computing the vertex degrees of an overlap graph using a threshold of on a metagenomic short read dataset. Limiting the task to vertex degree counting, time and space are required (Section 4.6). With the Prokrustean graph of 27 million short reads (4.5 gigabase pairs in total), the computation generating Figure 10 used 8.5 gigabytes memory and 1 minute to count all vertex degrees.
Figure 10:
Vertex degrees of the overlap graph of metagenomic short reads (ERR3450203). The number of vertices per each degree is scaled as log log . There is a clear peak around 102 at both incoming and outgoing degrees, which may imply dense existence of abundant species. The intermittent peaks after 102 might be related to repeating regions within and across species.
6. Prokrustean graph construction
Recall that Section 2.2 addressed the limitations of LCP-based substring indexes in computing , which motivated the adoption of the Prokrustean graph. LCP-based representations inherently provide a “bottom-up” approach and are at best able to access incoming edges of in the Prokrustean graph. In contrast, the Prokrustean graph supports a “top-down” approach by accessing through co-occurrence stacks whose number depends on the outgoing edges of .
We start with a more straightforward, but less efficient approach: Section 6.1 employs affix trees—the union of the suffix tree of and the suffix tree of its reversed strings—to extract the Prokrustean graph. Although affix trees allow straightforward operations supporting bidirectional extensions above suffix trees, they are resource-intensive in practice. Section 6.2 refines the approach using the BWT of , incorporating an intermediate model to compensate the bidirectional context lost in moving away from affix trees. Section 6.3 then briefly introduce the implementation, covering recent advancements in BWT-based computations supporting it.
6.1. Extracting Prokrustean graph from two suffix trees
For ease of explanation, we use two suffix trees to represent the affix tree. Let denote the generalized suffix tree constructed from . The reversed string of a string is denoted by , and the reversed string set is . A node in a tree is represented as if the path from the root to the node spells out the string , with the considered tree being always clear from the context. We assume affix links are constructed, which map between in and in if both nodes exist in their respective trees.
Additionally, a string preceded/followed by a letter represents an extension of the string with that letter, as in or . Similarly, concatenation of two strings and can be written as . An asterisk on a substring denotes any number (including zero) of letters extending the substring, as in .
6.1.1. Vertex set
Collecting maximal repeats of requires information about and for each , for any considered substring . Most substring indexes built above LCPs provide access to occurrences through the LCP intervals, and the number of occurrences is represented by the length of the corresponding LCP intervals. The reverse representation is not required yet, implying that the same information is accessible via single-directional indexes like the BWT in the next section.
6.1.2. Edge set
A more nuanced approach is required to extract the edge set of a Prokrustean graph. Since an edge in a suffix tree basically implies co-occurrence, it sometimes directly indicates a locally-maximal repeat region, providing a subset of substrings that co-occur within their extensions. That is, an edge from to in implies that may represent a locally-maximal repeat region in capturing a prefix of . However, different scenarios may arise—sometimes the reverse direction should be considered, or both directions, or neither. These variations depend on the combinations of letter extensions around the substring. We present a theorem that delineates these cases, visually explained in Figure 11 for intuition.
Figure 11:
Deriving locally-maximal repeat regions from edges of suffix trees. From a node of a maximal repeat or , the extension is identified by adding strings on edge annotations. Then, the three scenarios depict how each edge(s) locates each locally-maximal repeat region that captures within .
We define terms distinguishing occurrence patterns of extensions. A string is left-maximal if for every , and left-non-maximal with if , and right-maximality and right-non-maximality are symmetrically defined.
Theorem 6.
Consider a maximal repeat and letters . The following three cases define a bijective correspondence between selected edges in the suffix trees ( and ) and the edges of the Prokrustean graph of .
Case 6.1 (Prefix Condition) The following three statements are equivalent.
is left-maximal.
There exists a string of form in such that an edge from node to node is in .
There exists an edge in satisfying and .
Case 6.2 (Suffix Condition) The following three statements are equivalent.
is right-maximal.
There exists a string of form in such that an edge from to is in .
There exists an edge satisfying and .
Case 6.3 (Interior Condition) The following three statements are equivalent.
is left-non-maximal with and is right-non-maximal with .
There exist strings and , and of form in such that an edge from to is in and an edge from to is in .
There exists an edge in satisfying .
Proof. Refer to Appendix C.1 for the proof, which is straightforward but tedious. □
The computation of extracting the edge set of the Prokrustean graph is realized by collecting incoming edges for each vertex: Explore to identify each maximal repeat , and verify the maximality conditions in items labeled (a) in the theorem. Upon satisfying (a) of any case, use the corresponding (b) to identify a superstring from the suffix trees and . The corresponding (c) then confirms that the region of within is indeed a locally-maximal repeat region in . The precise region is inferred from the decomposition of outlined in (b). For instance, if is as in Case 9.2, then is the suffix region , so an edge from to is labeled .
This approach covers every edge in the Prokrustean graph. Observe that items labeled (c) distinguish locally-maximal repeat regions by the letters on immediate left or right. Each region is unique with respect to the letter extension, as previously established in the proof of Theorem 4. Hence, for each maximal repeat , the conditions met in (a) have bijectively mapped incoming edges to , thus the entire edge set is identified through Theorem 9.
We have leveraged a bidirectional substring index, but if only one direction is supported, i.e., is unavailable, finding as outlined in (b) becomes challenging. This limitation with single-directional substring indexes motivates the development of an enhanced algorithm.
6.2. Extracting Prokrustean graph from Burrows-Wheeler transform
This section addresses two issues of the previous computation when implemented in practice. Firstly, suffix tree representations generally consume substantial memory, which hinders scalability when handling large genomic datasets such as pangenomes. Secondly, bidirectional indexes are more space-consuming, expensive to build, and less commonly implemented than single-directional indexes. Instead, BWTs enable traversal of the same suffix tree structure using sublinear space relative to the input sequences, and are supported by a range of actively improved construction algorithms.
Since the bidirectional extensions in items of (b) in Theorem 9 are not directly applicable to the BWT or other single-directional substring indexes, we utilize a subset of occurrences of maximal repeats as an intermediate representation of locally-maximal repeat regions. This approach necessitates a consistent way of utilizing suffix orders within the generalized suffix arrays of .
Before describing the construction of the Prokrustean from a BWT, we need the following definitions.
Definition 9.
Let the rank of a region be the rank of the suffix of starting at in the generalized suffix array of , which is implied by a substring index being used. Then, the first occurrence of a string , is the lowest ranked occurrence among .
Note that suffix ordering is consistent only within a sequence but varies across sequences depending on implementations. For example, among BWTs that employ different strategies, some consider the global lexicographical order, yielding for , while others impose a strict order on input sequences, such as , resulting in . This variability is extensively discussed in [18]. In any scenario, the identical Prokrustean graph will be generated as long as the first occurrences are consistently considered.
The following intermediate model achieves the goal by collecting specific occurrences of maximal repeats utilizing first occurrences. These occurrences are designed to form one-to-one correspondences with the edges of the Prokrustean graph.
Definition 10.
is a subset of occurrences of strings in , where elements are specified as follows:
For every sequence , the entire region of the sequence, i.e., .
-
For every maximal repeat and letters ,
If is left-maximal, then given .
If is right-maximal, then given .
If is left-non-maximal with and is right-non-maximal with , then given .
Refer to Figure 12 for intuition that is indirectly deriving the locally-maximal repeat regions in . Although the same maximality conditions are used in both Theorem 9 and the definition of , the occurrences in are organized in a nested manner, allowing locally-maximal repeat regions to be inferred through their inclusion relationships. So, is a projected image of the Prokrustean graph of . The decoding rule for , necessary for reconstructing the Prokrustean graph, is articulated in the following proposition and theorem.
Figure 12:
Construction of the Prokrustean graph via projected occurrences. Gray vertices, depicted as either rectangles or circles, represent sequences in , while blue vertices denote maximal repeats of . The suffix tree is shown flipped to better illustrate the correspondence, providing intuition for the “bottom-up” approach. Dotted rectangles in each diagram refer to a maximal repeat . First occurrences of substrings like and are identified through the construction of . Then, the nested structure of projected occurrences derive incoming edges of in the Prokrustean graph.
The following theorem elaborates the decoding rule. Let the relative occurrence of within be defined only if .
Theorem 7.
, i.e., is a locally-maximal repeat region in some if and only if there exist satisfying:
, and
the relative occurrence of within is , and
no region satisfies .
Proof. Refer to Appendix C.2 for the proof, which is straightforward but tedious. □
Therefore, two occurrences in their “closest” containment relationship in imply their strings form a locally-maximal repeat region. Since every occurrence in originates from either a maximal repeat or a sequence, each represents a vertex in the Prokrustean graph. Therefore, collecting these relative occurrences constructs the edge set of the Prokrustean graph. Refer to Appendix C for the algorithm.
6.3. Implementation
We briefly introduce the techniques implemented in our code base: https://github.com/KoslickiLab/prokrustean.
Firstly, the BWT is constructed from a set of sequences using any modern algorithm supporting multiple sequences [30, 17, 21]. The resulting BWT is then converted into a succinct string representation, such as a wavelet tree, to facilitate access to the “nodes” of the implied suffix tree. For this purpose, we used the implementation from the SDSL project [24]. The traversal is built on foundational works by Belazzougui et al. [4] and Beller et al. [6], as detailed by Nicola Prezza et al. [37]. Their node representation, based on LCP intervals, enables constant-time access to and for each substring and letter . This capability is crucial for verifying the maximality conditions outlined in Definition 10. Also, the first occurrence of so that the first occurrences and are identified as the start of the LCP interval of and , respectively, so can be built.
is collected by exploring the nodes of the suffix tree implicitly supported by the BWT of . The exploration requires time in total, because succinct string operations take time. Identifying a maximal repeat takes time and then collecting its occurrences in takes , which is obvious because each combination of occurrence, e.g., is accessible in constant time. Therefore, constructing takes time and space where is the size of the succinct string representing the BWT of .
Lastly, reconstructing the Prokrustean graph using Theorem 10 takes time, which is introduced in Appendix C. Note that the computation need not consider every combination of occurrences in ; it leverages the property that an occurrence can imply locally-maximal repeat regions with up to two extensions in found in . Therefore, is grouped by each sequence , and the edge set can be built from each group. Hence, the total space usage is , where . The space usage is primarily output-dependent that around in practice, and the most time-consuming part is the construction of via traversing the implicit suffix tree.
7. Discussion
We have introduced the problem of computing co-occurring substrings (Section 2) and derived the Prokrustean graph (Section 3). The graph facilitates computing to efficiently express (Theorem 8), which is used to implement algorithms computing k-mer-based quantities (Section 4.3, Section 4.4, Section 4.5, Section 4.6). Lastly, the graph was constructed from the BWT (Section 6).
It is worthwhile to note that the construction described in Section 6.1 highlights the limitations of LCP-based substring indices by comparing the differences between the Prokrustean graph and suffix trees. Suffix trees cannot access the co-occurrence structure in constant time without the “top-down” representation supported by the Prokrustean graph. Therefore, we conjecture that building multi-k methods with modern LCP-based substring indices is challenging regardless of the underlying approach.
The application Section 4.5 indicates that a Prokrustean graph somehow access de Bruijn graphs of all orders. That is, a Prokrustean graph has potential for advancing the so-called variable-order scheme. Indeed, a Prokrustean of can be converted into the union of de Bruijn graphs of all orders. Since the Prokrustean graph does not grow faster than maximal repeats, the new representation can help formulating multi-k methods to identify errors in sequencing reads, SNPs in population genomes, and assembly genomes.
Thus, the open problem 5 in [40] can be answered by the Prokrustean graph, which calls for a practical representation of variable-order de Bruijn graphs that generates assembly comparable to that of overlap graphs. Recall that Section 4.6 analyzed the overlap graph of from the Prokrustean graph of . So, the two popular objects used in genome assembly are elegantly accessed through the Prokrustean graph.
There is abundant literature analyzing the influence of k-mer sizes in many bioinformatics tasks. However, little effort has been made to derive rigorous formulations or quantities to explain these phenomena. It is a significant disadvantage to leave the effects of k-mer sizes shrouded in mystery, especially as many other biological assumptions are made in these tasks and complex pipelines. Understanding the usage of k-mers at least at the level of substring representations of the sequencing data or references is an essential initial step. Our framework is expected to contribute to forming this prerequisite so that further improvement involving biological knowledge is desired afterwards.
Table 2:
Counting distinct k-mers with Profcrustean graphs (Algorithm 2) compared with KMC. KMC had to be executed iteratively causing the computational time to increase steadily as the range of increases. The running time of KMC for each took about 1–3 minutes.
Dataset | k | KMC | Prokrustean Graph | ||||
---|---|---|---|---|---|---|---|
time | memory | threads | time | memory | threads | ||
SRR20044276 | 1...150 | 5m | 837mb | 22 | 0.36s | 80mb | 8 |
ERR3450203 | 1...150 | 70m | 12gb | 22 | 106s | 11gb | 8 |
SRR18495451 | 30...50000 | days | - | 22 | 88s | 24gb | 8 |
SRR7130905 | 30...150 | 66m | 12gb | 22 | 46s | 8.5gb | 8 |
SRR21862404 | 30...150 | 112m | 13gb | 22 | 74s | 13.8gb | 8 |
Table 3:
Counting maximal unitigs with GGCAT and Prokrustean Graph(Algorithm 4). The efficiency shown in the compute column is emphasized, where only a few seconds are required for large datasets. The GGCAT was executed for sizes iteratively. The running time for each took about 1–5 minutes for GGCAT. Both methods used 8 threads.
Dataset | k | GGCAT | Prokrustean Graph | ||
---|---|---|---|---|---|
time | memory | time | memory | ||
SRR20044276 | 20..150 | 13m | 310mb | 0.36s | 70mb |
ERR3450203 | 30..150 | 98m | 1.1gb | 31s | 5.8gb |
SRR18495451 | 30..50000 | days | - | 127s | 26gb |
SRR7130905 | 30...150 | 126m | 1.4gb | 46s | 8.5gb |
SRR21862404 | 30...150 | 168m | 0.8gb | 76s | 13.8gb |
A. Datasets
-
Datasets mainly used in applications.
SRR20044276, ERR3450203, SRR18495451, SRR21862404, SRR7130905
-
Human pangenome references (Chromosome 1).
AP023461.1, CH003448.1, CH003496.1, CM000462.1, CM001609.2, CM003683.2, CM009447.1, CM009872.1, CM010808.1, CM021568.2, CM034951.1, CM035659.1, CM039011.1, CM045155.1, CM073952.1, CM074009.1, CP139523.1, NC_000001.11, NC_060925.1
-
Metagenome e.coli references.
3682 E. coli assemblies in NCBI circa 2020. (https://zenodo.org/records/6577997)
-
Metagenome sample for Bray-Curtis, referring to [36].
ERS11976829, ERS11976830, ERS11976565, ERS11976566
B. Maximal co-occurrence stacks
B.1. The proof of completely covering and its bound.
Proposition 4.
A maximal k-co-occurring substring and a maximal -co-occurring substring of are identified by the same maximal co-occurrence stack if and only if they share identical boundary conditions: on both sides, they either extend to the end of or intersect the same locally-maximal repeat region by and , respectively.
Proof. Consider maximal k-co-occurring substring and -co-occurring substring of where . The forward direction is straightforward; a maximal co-occurrence stack , which identifies and , identifies as a -co-occurring substring of by definition, and either extends to an end of or intersects an locally-maximal repeat region by on both sides according to Proposition 1. Then, as and are identified as the -th and -th elements by the stack, respectively, the rates and derived from the stack ensure and sharing the equivalent side conditions with as described in Proposition 1.
For the reverse direction, consider three types of co-occurrence stacks: 1. , i.e., they extend to both ends of . Then a type-0 stack identifies them. Whenever -co-occurs within , the same holds for and above until . Therefore, with some maximal identifies and .
2. and are proper prefixes of . Then a type-1 stack identifies them. The condition says there is a locally-maximal repeat region () intersecting the regions of and in by and , respectively. Without loss of generality, let the intersection be on the left of . Then, a -co-occurring prefix of with intersects by ; if not, there must be a -mer occurring more than starting between positions 1 and . But -mers of start in the positions too, so does not -co-occur within , which leads to contradiction. Thus, is a type-1 stack that identifies both and .
3. and are neither prefixes nor suffixes. Then a type-2 stack identifies them. There are two locally-maximal repeat regions and that intersect the regions of and by and , respectively. Assuming is smaller, a similar argument as in the previous paragraph shows that a -co-occurring substring intersects both and by given , so identifies and . □
This proposition implies that either zero, one, or two locally-maximal repeat regions in characterize a co-occurrence stack by identifying substrings of identical side conditions. This corresponds to the classifications—type-0, type-1, and type-2. The proposition also implies that the number of maximal co-occurrence stacks of does not grow faster than the number of locally-maximal repeat regions in .
Theorem 8.
completely and disjointly cover co-occurring substrings of , i.e.,
Also,
Proof. completely and disjointly covers : any k-mer occurring in co-occurs within an exactly one string in by Theorem 1. The k-mer extends to a unique superstring that maximally k-co-occurs within because appearing in two distinct maximally k-co-occurring substrings imply multi-occurrence of the k-mer in . The superstring is identified by a unique maximal co-occurrence stack by Proposition 4. So, the k-mer is expressed by exactly one maximal co-occurrence stack.
To demonstrate the inequality, first, consider a type-0 stack on a string , where all identified substrings are itself by definition, i.e., extend to ends of . A single type-0 stack identifies them all by Proposition 4. Hence, just one type-0 stack identifies them all by Proposition 4.
Next, for a type-1 or type-2 maximal co-occurrence stack (), we show its longest identified substring always intersects some locally-maximal repeat region of length by exactly . Since a proper type-1 stack identifies every k-co-occurring prefix (or suffix) of whose region intersects the same locally-maximal repeat region by , by Proposition 4, the largest order works too. Then, should be the length of the locally-maximal repeat region; otherwise, we can find a ()-co-occurring prefix (or suffix) of , denoted , intersecting the locally-maximal repeat region by , so () is a proper co-occurrence stack, resulting in the original stack non-maximal. Similarly, a type-2 stack identifies substrings that intersect two locally-maximal repeat regions on both sides, so checking the length of the smaller region be permits a similar deduction.
Note that each left and right side of a locally-maximal repeat region of length intersect the region of at most one maximally -co-occurring substring by exactly . Since such maximal intersection happens with every type-1 or type-2 maximal co-occurrence stack at least once, the number of such stacks is bounded by twice the number of locally-maximal repeat regions, thus is 2(the number of outgoing edges of in ) in total.
Lastly, enumerating is efficient. A type-0 stack is uniquely defined for a string where the depth is one more than the length of the largest locally-maximal repeat region. Each side of a locally-maximal repeat region maximally intersects the largest substring identified in at most one stack, either type-1 or type-2. Without loss of generality, a type-1 stack is defined on the left side of a locally-maximal repeat region if it is the largest one among those found on its left, while a type-2 stack pairs with the nearest locally-maximal repeat region of at least the same size on its left. These properties can be verified by a single scan of the locally-maximal repeat regions of a string. □
C. Prokrustean graph construction
C.1. The proof of suffix trees to Prokrustean graph
Theorem 9.
Consider a maximal repeat and letters . The following three cases define a bijective correspondence between selected edges in the suffix trees ( and ) and the edges of the Prokrustean graph of .
Case 9.1 (Prefix Condition)
The following three statements are equivalent.
is left-maximal.
There exists a string of form in such that an edge from to is in .
There exists an edge in satisfying and .
Case 9.2 (Suffix Condition)
The following three statements are equivalent.
is right-maximal.
There exists a string of form in such that an edge from to is in .
There exists an edge satisfying and .
Case 9.3 (Interior Condition)
The following three statements are equivalent.
is left-non-maximal with and is right-non-maximal with .
There exist strings and , and of form in such that an edge from to is in and an edge from to ) is in .
There exists an edge in satisfying .
Proof. Case 9.1 : We proceed arguing in the contrapositive. Assume which is non-maximal on either left or right. Observe that the definition of the construction of the suffix tree implies and being right-maximal. Since is right-maximal, must be left-non-maximal with some , so . Since meaning is left-non-maximal with some .
Case 9.1 : Since holds and , showing is a locally-maximal repeat region is enough to show satisfies the statement. Let . First, the region’s string occurs more than because is a maximal repeat. Next, as is a prefix region, every extension includes the smallest extension , so the string of any of its extensions occurs at least , but implies , so the extensions co-occur within . Hence, is a locally-maximal repeat region and .
Case 9.1 : Since is a locally-maximal repeat region, its extension captures , so holds. Also, implies is left- and right-maximal, so holds for any . Therefore, for any , hence is left-maximal.
**Case 9.2 is symmetrically argued in line with Case 9.1.
Case 9.3 : The non-maximalities in (a) imply . Also, by co-occurrences implied by two suffix trees, and hold. Hence, is satisfied, meaning is left- and right-maximal that .
Case 9.3 : The maximal repeat occurs more than and the string of any extension occurs at least or times, while is implied by the two suffix trees. Hence, is a locally-maximal repeat region such that .
Case 9.3 : This statement is trivial because the locally-maximal repeat region implies , hence holds, which implies non-maximalities on both sides of . □
C.2. The proof of projected occurrences to Prokrustean graph
Below proposition is useful to prove the main theorem.
Proposition 5.
For every maximal repeat is in .
Proof. If a maximal repeat is a whole sequence in , it is trivially in . So, assume every occurrence of in is extendable by some letters.
Let , and define and/or . One of or might not be defined, so consider without loss of generality.
If is right-maximal, then trivially belongs to by 2-(2) of Definition 10 Otherwise, is right-non-maximal with some , so must hold.
Consequently, whether is left-maximal or left-non-maximal with holds because conditions 2-(1) or 2-(3) of Definition 10 apply, respectively. The symmetric argument assuming ( confirms the same results. □
Theorem 10.
, i.e., is a locally-maximal repeat region in some if and only if there exist satisfying:
, and
the relative occurrence of within is , and
no region Uatisfies .
Proof. The forward direction is composed of three parts. We show there exists some corresponding to that is specified by Theorem 9, and then show there exists some such that the relative occurrence of in the region is . Lastly, the third statement is shown.
First, there exists an occurrence that captures the same string and shares some left or right extension with . must be satisfying exactly one of three items labeled (c) in Case 9.1, Case 9.2, and Case 9.3 conditioned by extensions of (). Then the corresponding item of (a) describes a maximality condition around that is equally described in 2-(1), 2-(2), and 2-(3) in Definition 10 to derive elements in . So, there is always some () in chosen for () by the agreement in letter extensions, i.e., either or or both holds.
Next, there exists some such that the relative occurrence of within is . The construction of showed either or holds, so assume without loss of generality. Since is a locally-maximal repeat region, co-occurs within and hence co-occurs within . Therefore, can be extended to some so that is captured and is at reflecting the position of within . Similarly, holds, and hence is recovered by .
We need to finish showing the second statement. If is in , then trivially holds and is applied by item 1. in Definition 10. Otherwise, is in , then is because the first occurrence of some subtring co-occurring within was utilized for identifying . Proposition 5 confirms that .
Lastly, assume some () in satisfies . Then a maximal repeat exists such that is a superstring of and is a superstring of . Then could be captured by extending and occurs more than , so is not a locally-maximal repeat region, leading to a contradiction.
For the backward direction, assume is not a locally-maximal repeat region for contradiction. Then some locally-maximal repeat region must be found by extending because occurs more than . Then by following the the same steps in the first paragraph utilizing Theorem 9 and Definition 10, is used to identify an occurrence in that shares some letter extension. So, letting , a substring of form either or is used to find a first occurrence and the substring co-occurs within , so () appears within so that its relative occurrence within is . Therefore, . Also, because is an extension of that is an extension of too. Lastly, indeed holds because of . That is, if is not but some other occurrence of in , then the corresponding occurrence of in must be found in instead of . Thus, the presence of leads to contradiction. □
C.3. The algorithm — projected occurrences to Prokrustean graph
The algorithm below reduces the task of reconstructing the Prokrustean graph from to a problem designed for a specific string . For the subset of regions , determine their inclusion relationships to identify the closest related pairs. This is easily parallelized as it processes the intervals for each separately. And the running time is linear to the size of the subset of regions per each , hence || in total.
An important assumption in the algorithm is that for any region , within the subset , there are at most two regions such that and no exists where . It is straightforward to check the property.
References
- [1].AlEisa Hussah N, Hamad Safwat, and Elhadad Ahmed. K-mer spectrum-based error correction algorithm for next-generation sequencing data. Computational Intelligence and Neuroscience, 2022, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [2].Balvert Marleen, Luo Xiao, Hauptfeld Ernestina, Schönhuth Alexander, and Dutilh Bas E. Ogre: overlap graph-based metagenomic read clustering. Bioinformatics, 37(7):905–912, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].Bankevich Anton, Bzikadze Andrey V, Kolmogorov Mikhail, Antipov Dmitry, and Pevzner Pavel A. Multiplex de bruijn graphs enable genome assembly from long, high-fidelity reads. Nature biotechnology, 40(7):1075–1081, 2022. [DOI] [PubMed] [Google Scholar]
- [4].Belazzougui Djamal. Linear time construction of compressed text indices in compact space. In Proceedings of the forty-sixth Annual ACM Symposium on Theory of Computing, pages 148–193, 2014. [Google Scholar]
- [5].Belazzougui Djamal and Cunial Fabio. Fully-functional bidirectional burrows-wheeler indexes and infinite-order de bruijn graphs. In 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2019. [Google Scholar]
- [6].Beller Timo, Gog Simon, Ohlebusch Enno, and Schnattinger Thomas. Computing the longest common prefix array based on the burrows–wheeler transform. Journal of Discrete Algorithms, 18:22–31, 2013. [Google Scholar]
- [7].Benoit Gaëtan. Simka: fast kmer-based method for estimating the similarity between numerous metagenomic datasets. In RCAM, 2015. [Google Scholar]
- [8].Besta Maciej, Kanakagiri Raghavendra, Mustafa Harun, Karasikov Mikhail, Rätsch Gunnar, Hoefler Torsten, and Solomonik Edgar. Communication-efficient jaccard similarity for high-performance distributed genome comparisons. In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 1122–1132. IEEE, 2020. [Google Scholar]
- [9].Bonnici Vincenzo and Manca Vincenzo. Informational laws of genome structures. Scientific reports, 6(1):28840, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [10].Bonnie Jessica K, Ahmed Omar, and Langmead Ben. Dandd: efficient measurement of sequence growth and similarity. bioRxiv, pages 2023–02, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [11].Boucher Christina, Bowe Alex, Gagie Travis, Puglisi Simon J, and Sadakane Kunihiko. Variable-order de bruijn graphs. In 2015 data compression conference, pages 383–392. IEEE, 2015. [Google Scholar]
- [12].Bowe Alexander, Onodera Taku, Sadakane Kunihiko, and Shibuya Tetsuo. Succinct de bruijn graphs. In International workshop on algorithms in bioinformatics, pages 225–235. Springer, 2012. [Google Scholar]
- [13].Breitwieser Florian P, Baker Daniel N, and Salzberg Steven L. Krakenuniq: confident and fast metagenomics classification using unique k-mer counts. Genome biology, 19:1–10, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [14].Břinda Karel, Baym Michael, and Kucherov Gregory. Simplitigs as an efficient and scalable representation of de bruijn graphs. Genome biology, 22:1–24, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [15].Bussi Yuval, Kapon Ruti, and Reich Ziv. Large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy. PloS one, 16(10):e0258693, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [16].Cavattoni Margherita and Comin Matteo. Classgraph: improving metagenomic read classification with overlap graphs. Journal of Computational Biology, 30(6):633–647, 2023. [DOI] [PubMed] [Google Scholar]
- [17].Cenzato Davide, Guerrini Veronica, Lipták Zsuzsanna, and Rosone Giovanna. Computing the optimal BWT of very large string collections. In In Proc. of the 33rd Data Compression Conference, DCC 2023, 2023, pages 71–80, 2023. doi: 10.1109/DCC55655.2023.00015. [DOI] [Google Scholar]
- [18].Cenzato Davide and Lipták Zsuzsanna. A survey of bwt variants for string collections. Bioinformatics, page btae333, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [19].Chikhi Rayan and Medvedev Paul. Informed and automated k-mer size selection for genome assembly. Bioinformatics, 30(1):31–37, 2014. [DOI] [PubMed] [Google Scholar]
- [20].Cracco Andrea and Tomescu Alexandru I. Extremely fast construction and querying of compacted and colored de bruijn graphs with ggcat. Genome Research, pages gr–277615, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [21].Díaz-Domínguez Diego and Navarro Gonzalo. Efficient construction of the bwt for repetitive text using string compression. Information and Computation, 294:105088, 2023. [Google Scholar]
- [22].D’ıaz-Dom’ınguez Diego, Onodera Taku, Puglisi Simon J, and Salmela Leena. Genome assembly with variable order de bruijn graphs. bioRxiv, pages 2022–09, 2022. [Google Scholar]
- [23].Dubinkina Veronika B, Ischenko Dmitry S, Ulyantsev Vladimir I, Tyakht Alexander V, and Alexeev Dmitry G. Assessment of k-mer spectrum applicability for metagenomic dissimilarity analysis. BMC bioinformatics, 17:1–11, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [24].Gog Simon, Beller Timo, Moffat Alistair, and Petri Matthias. From theory to practice: Plug and play with succinct data structures. In 13th International Symposium on Experimental Algorithms, (SEA 2014), pages 326–337, 2014. [Google Scholar]
- [25].Gusfield Dan. Algorithms on stings, trees, and sequences: Computer science and computational biology. Acm Sigact News, 28(4):41–60, 1997. [Google Scholar]
- [26].Irber Luiz, Brooks Phillip T, Reiter Taylor, Tessa Pierce-Ward N, Hera Mahmudur Rahman, Koslicki David, and Titus Brown C. Lightweight compositional analysis of metagenomes with fracminhash and minimum metagenome covers. bioRxiv, pages 2022–01, 2022. [Google Scholar]
- [27].Islam Rashedul, Raju Rajan Saha, Tasnim Nazia, Shihab Istiak Hossain, Bhuiyan Maruf Ahmed, Araf Yusha, and Islam Tofazzal. Choice of assemblers has a critical impact on de novo assembly of sars-cov-2 genome and characterizing variants. Briefings in bioinformatics, 22(5):bbab102, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [28].Kokot Marek, Długosz Maciej, and Deorowicz Sebastian. Kmc 3: counting and manipulating k-mer statistics. Bioinformatics, 33(17):2759–2761, 2017. [DOI] [PubMed] [Google Scholar]
- [29].Krannich Thomas, Timothy J White W, Niehus Sebastian, Holley Guillaume, Halldórsson Bjarni V, and Kehr Birte. Population-scale detection of non-reference sequence variants using colored de bruijn graphs. Bioinformatics, 38(3):604–611, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [30].Li Heng. Fast construction of fm-index for long sequence reads. Bioinformatics, 30(22):3274–3275, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [31].Liao Xingyu, Li Min, Luo Junwei, Zou You, Wu Fang-Xiang, Pan Yi, Luo Feng, and Wang Jianxin. Improving de novo assembly based on read classification. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 17(1):177–188, 2018. [DOI] [PubMed] [Google Scholar]
- [32].Mallawaarachchi Vijini. Metagenomics Binning Using Assembly Graphs. PhD thesis, The Australian National University (Australia), 2022. [Google Scholar]
- [33].Nurk Sergey, Meleshko Dmitry, Korobeynikov Anton, and Pevzner Pavel A. metaspades: a new versatile metagenomic assembler. Genome research, 27(5):824–834, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [34].Ondov Brian D, Treangen Todd J, Melsted Páll, Mallonee Adam B, Bergman Nicholas H, Koren Sergey, and Phillippy Adam M. Mash: fast genome and metagenome distance estimation using minhash. Genome biology, 17:1–14, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [35].Pérez-Cobas Ana Elena, Gomez-Valero Laura, and Buchrieser Carmen. Metagenomic approaches in microbial ecology: an update on whole-genome and marker gene sequencing analyses. Microbial genomics, 6(8):e000409, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [36].Ponsero Alise Jany, Miller Matthew, and Hurwitz Bonnie Louise. Comparison of k-mer-based de novo comparative metagenomic tools and approaches. Microbiome Research Reports, 2(4), 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [37].Prezza Nicola and Rosone Giovanna. Space-efficient computation of the lcp array from the burrows-wheeler transform. arXiv preprint arXiv:1901.05226, 2019. [Google Scholar]
- [38].Prjibelski Andrey, Antipov Dmitry, Meleshko Dmitry, Lapidus Alla, and Korobeynikov Anton. Using spades de novo assembler. Current protocols in bioinformatics, 70(1):e102, 2020. [DOI] [PubMed] [Google Scholar]
- [39].Rhyker Ranallo-Benavidez T, Jaron Kamil S, and Schatz Michael C. Genomescope 2.0 and smudgeplot for reference-free profiling of polyploid genomes. Nat. comm., 11(1):1432, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [40].Rizzi Raffaella, Beretta Stefano, Patterson Murray, Pirola Yuri, Previtali Marco, Vedova Gianluca Della, and Bonizzoni Paola. Overlap graphs and de bruijn graphs: data structures for de novo genome assembly in the big data era. Quantitative Biology, 7:278–292, 2019. [Google Scholar]
- [41].Rodriguez-r Luis M and Konstantinidis Konstantinos T. Estimating coverage in metagenomic data sets and why it matters. The ISME journal, 8(11):2349–2351, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [42].Schmidt Sebastian, Khan Shahbaz, Alanko Jarno N, Pibiri Giulio E, and Tomescu Alexandru I. Matchtigs: minimum plain text representation of k-mer sets. Genome Biology, 24(1):136, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [43].Shariat Basir, Movahedi Narjes Sadat, Chitsaz Hamidreza, and Boucher Christina. Hyda-vista: towards optimal guided selection of k-mer size for sequence assembly. BMC genomics, 15(10):1–8, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [44].Tang Deyou, Li Yucheng, Tan Daqiang, Fu Juan, Tang Yelei, Lin Jiabin, Zhao Rong, Du Hongli, and Zhao Zhongming. Kcoss: an ultra-fast k-mer counter for assembled genome analysis. Bioinformatics, 38(4):933–940, 2022. [DOI] [PubMed] [Google Scholar]
- [45].Wickramarachchi Anuradha and Lin Yu. Metagenomics binning of long reads using read-overlap graphs. In RECOMB International Workshop on Comparative Genomics, pages 260–278. Springer, 2022. [Google Scholar]
- [46].Yang Zhenhua, Li Hong, Jia Yun, Zheng Yan, Meng Hu, Bao Tonglaga, Li Xiaolong, and Luo Liaofu. Intrinsic laws of k-mer spectra of genome sequences and evolution mechanism of genomes. BMC Evolutionary Biology, 20:1–15, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [47].Zhai Hongxuan and Fukuyama Julia. A convenient correspondence between k-mer-based metagenomic distances and phylogenetically-informed β-diversity measures. PLOS Computational Biology, 19(1):e1010821, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]