Skip to main content

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

bioRxiv logoLink to bioRxiv
[Preprint]. 2024 Dec 20:2023.11.21.568151. [Version 7] doi: 10.1101/2023.11.21.568151

Prokrustean Graph: A substring index for rapid k-mer size analysis

Adam Park a, David Koslicki a,b,c
PMCID: PMC11160577  PMID: 38853857

Abstract

Despite the widespread adoption of k-mer-based methods in bioinformatics, understanding the influence of k-mer sizes remains a persistent challenge. Selecting an optimal k-mer size or employing multiple k-mer sizes is often arbitrary, application-specific, and fraught with computational complexities. Typically, the influence of k-mer size is obscured by the outputs of complex bioinformatics tasks, such as genome analysis, comparison, assembly, alignment, and error correction. However, it is frequently overlooked that every method is built above a well-defined k-mer-based object like Jaccard Similarity, de Bruijn graphs, k-mer spectra, and Bray-Curtis Dissimilarity. Despite these objects offering a clearer perspective on the role of k-mer sizes, the dynamics of k-mer-based objects with respect to k-mer sizes remain surprisingly elusive.

This paper introduces a computational framework that generalizes the transition of k-mer-based objects across k-mer sizes, utilizing a novel substring index, the Prokrustean graph. The primary contribution of this framework is to compute quantities associated with k-mer-based objects for all k-mer sizes, where the computational complexity depends solely on the number of maximal repeats and is independent of the range of k-mer sizes. For example, counting vertices of compacted de Bruijn graphs for k=1,,100 can be accomplished in mere seconds with our substring index constructed on a gigabase-sized read set.

Additionally, we derive a space-efficient algorithm to extract the Prokrustean graph from the Burrows-Wheeler Transform. It becomes evident that modern substring indices, mostly based on longest common prefixes of suffix arrays, inherently face difficulties at exploring varying k-mer sizes due to their limitations at grouping co-occurring substrings.

We have implemented four applications that utilize quantities critical in modern pangenomics and metagenomics. The code for these applications and the construction algorithm is available at https://github.com/KoslickiLab/prokrustean.

Keywords: k-mer, k-mer spectra, FM-index, BWT, genome assembly, pangenomics, metagenomics, 03B70, 05C85, 92–08

1. Introduction

As the volume and number of sequencing reads and reference genomes grow every year, k-mer-based methods continue to gain popularity in computational biology. This simple approach—cutting sequences into substrings of a fixed length—offers significant benefits across various disciplines. Biologists consider k-mers as intuitive markers representing biologically significant patterns, bioinformaticians easily formulate novel methods by regarding k-mers as fundamental units that reflect their original sequences, and engineers leverage their fixed-length nature to optimize computational processes at a very low level. However, the understanding of the most crucial parameter, k, remains surprisingly elusive, thereby impeding further methodological advancements.

Two central challenges emerge in the application of k-mer-based methods. First, the selection of the k-mer size is often arbitrary, despite its well-recognized influence on outcomes. This issue, though widely acknowledged, remains insufficiently addressed in the literature, with little formal guidance on how to determine an optimal k size for different applications. The reasoning behind these choices is frequently obscured, typically confined to specific, unpublished experimental analyses (e.g. “we found that k=31 was appropriate…”). Second, methods attempting to utilize multiple k-mer sizes encounter significant computational burdens. While incorporating multiple k-mer sizes can naturally enhance the accuracy of k-mer-based methods, the computational costs escalate with each additional k-mer size used. Consequently, researchers often resort to “folklore” k value(s) based on prior empirical results.

The influence of k-mer sizes within each method is a complex function reflecting bioinformatics pipelines that process k-mers. As biological adjustments and engineering strategies complicate the pipelines, their outputs obscure the impact of k-mer sizes with noisy factors, as simplified in the following abstraction:

pipeline(k-mer-basedobjects(sequences,k),biologicaladjustments,engineering).

Despite the complexity of the pipelines, there always exist mathematically well-defined k-mer-based objects that form the foundation of method formulation. These objects abstract the utilization of k-mers and depend solely on sequences and k-mer sizes. Thus, they offer potential for generalizability and quantification of the influence of k-mers in methods. Indeed, intuitive quantities have been derived from k-mer-based objects; however, there are challenges in actually computing them, as detailed in several examples we now present.

In genome analysis, the number of distinct k-mers is often used to reflect the complexity of genomes. Although the number varies by k-mer sizes, large data sizes restrict experiments to a few k values [9, 39, 15]. A recent study suggests that the number of distinct k-mers across all k-mer sizes provides additional insights into pangenome complexity [10]. Furthermore, the frequencies of k-mers provide richer information, as discussed in [1, 46], but their computation becomes more complicated and sometimes impractical, even with a fixed k-mer size [7].

In comparative analyses, Jaccard Similarity is utilized for genome indexing and searching by being approximated through hashing techniques [34, 26]. Experiments attempting to assess the influence of k-mer sizes on Jaccard Similarity must undergo tedious iterations through various k-mer sizes [8]. Additionally, Bray-Curtis dissimilarity is a k-mer frequency-based metric frequently used in comparing metagenomic samples, where experiments face resource limitations due to the growing size of sequencing data, even with a fixed k-mer size. Yet, the demand for analyzing multiple k-mer sizes continues to increase [23, 36, 35].

Genome assemblers utilizing k-mer-based de Bruijn graphs are probably the most sensitive to k-mer sizes, and those employing multi-k approaches face significant computational challenges. The choice of k-mer sizes is particularly crucial in de novo assemblers, yet it predominantly relies on heuristic methods [27, 19, 38]. The topological features of de Bruijn graphs are succinctly summarized by the compacting process, where vertices represent simple paths called unitigs, but exploring these features across varying k-mer sizes is computationally intensive. Furthermore, the development of multi-k de Bruijn graphs continues to be a challenging and largely theoretical endeavor [40, 43, 22], with no substantial advancements following the heuristic selection of multiple k-mer sizes implemented by metaSPAdes [33].

These examples motivate the pressing need to generalize the exploration of k-mer-based objects across k-mer sizes. Our manuscript introduces a framework for rapidly computing quantities derived from k-mer-based objects, thereby addressing the prevalent challenge in bioinformatics. The organization of the manuscript is as follows:

  • Section 2 defines the main objective of computing k-mer-based quantities and introduces the proxy problem of computing substring co-occurrence.

  • Section 3 defines 𝒢𝒰, the Prokrustean graph, a novel substring representation that provides straightforward access to substring co-occurrence.

  • Section 4 outlines a framework built upon the Prokrustean graph, accompanied by algorithms that treat various k-mer-based objects and compute related quantities across all possible k-mer sizes in O(|𝒢𝒰|) time.

  • Section 5 presents the experimental results of computing k-mer-based quantities.

  • Section 6 derives the construction algorithm of 𝒢𝒰, and discusses the limitations of other modern substring indices in computing substring co-occurrence.

2. Problem Formulation

2.1. Basic Notations

Let Σ be our alphabet and let SΣ* represent a finite-length string, and 𝒰Σ* for a set of strings. Because of the biological sequence motivation for this work, the elements of 𝒰 are called sequences. Consider the following preliminary definitions:

Definition 1.

  • A region in the string S is a triple (S,i,j) such that 1ij|S|.

  • The string of a region is the corresponding substring: strS,i,j:=SiSi+1Sj.

  • The size or length of a region is the length of its string: S,i,j:=ji+1.

  • An extension of a region (S,i,j) is a larger region S,i,j including (S,i,j):
    (S,i,j)S,i,j:=iijjand|(S,i,j)|<S,i,j.
  • The occurrences of S in 𝒰 are those regions in strings of 𝒰 corresponding to S:
    occ𝒰S:=S,i,jS𝒰&strS,i,j=S.
  • S is a maximal repeat in 𝒰 if it occurs at least twice and more than any of its extensions: occ𝒰(S)>1, and for all S a superstring of S, |occ𝒰(S)|>occ𝒰S. I.e., given a fixed set 𝒰, a maximal repeat S is one that occurs more than once in 𝒰 (i.e. is a repeat), and any extension of which has a lower frequency in 𝒰 than the original string S.

  • 𝒰 is the set of all maximal repeats in 𝒰.

2.2. The proxy problem: how to compute substring co-occurrence?

There exists a theoretical void in identifying when a k-mer and a k-mer serve similar roles within their respective substring sets. Although close k-mer sizes generally yield comparable outputs, dissecting this phenomenon at the level of local substring scopes is unexpectedly challenging. For instance, de Bruijn graphs constructed from sequencing reads with k=30 and k=31 display distinct yet highly similar topologies [40]. However, anyone formally articulating the topological similarity would find it quite elusive, as vertex mapping and other techniques establishing correspondences between the two graphs fail to consistently explain it.

A common underlying difficulty is that the entire k-mer set undergoes complete transformations as the k-mer size changes. Since the role of a k-mer is assigned within its k-mer set, the role of a k-mer within k-mers cannot be “locally” derived. To address this issue, we propose substring co-occurrence as a generalized framework for consistently grouping k-mers of similar roles across varying k-mer sizes.

Definition 2.

A string S co-occurs within a string S in 𝒰 if:

SisasubstringofSandocc𝒰(S)=occ𝒰S.

Modern substring indexes built on longest common prefixes (LCP) of their (implicit) suffix array are adept at capturing co-occurrence in this “extending direction.” In suffix trees, a substring that extends from S to S along an edge—without passing through a node—indicates preservation of co-occurrence. Similarly, in the Burrows-Wheeler Transform (BWT) and its variants such as the FM-index and r-index, representing both S and S with the same set of LCPs corresponding to some suffix array interval implies co-occurrence. However, there has been no recognized necessity for addressing the following “substring direction”:

Definition 3.

cosubstr𝒰(S) is the set of substrings that co-occur within S.

Modern substring indexes are not efficient at computing cosubstr𝒰(S), which results in their failure to “smoothly” explore k-mer-based objects across varying k-mer sizes. Specifically, shrinking a substring found in 𝒰 always requires some link (or pointer) to navigate to the corresponding LCP in the (implicit) suffix array. Given that the number of substrings scales quadratically with the cumulative size of the input sequences, this linkage system becomes impractical. For instance, the variable-order de Bruijn graph and its variants address the space issue by employing a compact de Bruijn graph representation built above the Burrow-Wheeler transform [12]; however, this solution requires nontrivial operation times for both “forward” moves and changes in order (k) [11, 5]. So, these data structures do not allow extensive exploration of the substring space, and hence their multi-k approaches do not scale well.

Our work demonstrates that cosubstr𝒰(S) can be computed efficiently, and then the functionality can be used to rapidly compute k-mer quantities. We introduce a simple yet foundational property as groundwork.

Theorem 1 (Principle).

For any substring S such that occ𝒰(S)>0,

Scooccurswithinexactlyonesequenceormaximalrepeatin𝒰𝒰

Proof. Assume the theorem does not hold, i.e., S co-occurs within either no string or multiple strings in 𝒰𝒰. Consider the former case: S co-occurs within no string in 𝒰𝒰, meaning S occurs more than any of its superstrings in 𝒰𝒰. Given that S occurs in some string S𝒰 (i.e.occ𝒰(S)>0), S must be occurring more frequently than S by assumption. Then, there exists a maximal repeat R such that S is a substring of R and R is a substring of S, because extending S within S decreases its occurrence at some point. Again, whenever S is a substring of a maximal repeat R, S must be occurring more frequently than R by assumption. Following the similar argument above, there exists another maximal repeat R such that S is a substring of R and R is a proper substring of R. This recursive argument will eventually terminate as R is strictly shorter than R. Thus, S co-occurs within the last maximal repeat, which contradicts the assumption.

Now consider the latter case where S co-occurs within at least two strings in 𝒰𝒰. If S is extended within these two strings, either to the right or left by one step at each time, the extensions eventually diverge into two different strings since the two superstrings differ, thereby reducing the number of occurrences. Hence, S must be occurring more frequently than at least one of those two superstrings, leading to a contradiction. □

This theorem provides two key insights. First, every substring found in 𝒰 exclusively co-occurs as a substring of either a sequence or a maximal repeat in 𝒰, so the collective sets of cosubstr𝒰(S) from every S𝒰𝒰 comprehensively and disjointly cover the entire substring space. Second, by computing cosubstr𝒰(S) for every S𝒰𝒰, we can categorize k-mers of similar roles across various lengths, devising various algorithmic ideas on cosubstr𝒰(S). The following main theorem outlines the components discussed in the remainder of the manuscript.

Theorem 2. (Main Result)

Given a set of sequences 𝒰, there exists a substring representation 𝒢𝒰 of size O|Σ|𝒰. 𝒢𝒰 can be used to compute k-mer quantities of 𝒰 for all k=1,,kmax within O(|𝒢𝒰|) time and space, where kmax is the longest sequence length in 𝒰. The k-mer quantities are given as follows:

  • (Counts) The number of distinct k-mers, as used in genome cardinality DandD [10].

  • (Frequencies) k-mer Bray-Curtis dissimilarities, as used in comparing metagenomic samples [36].

  • (Extensions) The number of unitigs in k-mer de Bruijn graphs, as used in analyses of reference genomes and genome assembly [29, 42].

  • (Occurrences) The number of edges with annotated length of k in the overlap graph. [45].

Furthermore, the graph 𝒢𝒰 can be constructed in O(|BWT|+|𝒢𝒰|) time and space, where |BWT| is the size of the Burrows-Wheeler transform representation of 𝒰.

Proof. Section 3 analyzes the size of 𝒢𝒰, Section 4 introduces the computation of k-mer quantities, and Section 6 derives the construction algorithm. □

The main contribution is that the time complexity for computing k-mer quantities for all k1,kmax using 𝒢𝒰 is independent of the range of k-mer sizes, which is O𝒢𝒰. Specifically, letting N be the cumulative sequence length of 𝒰, |𝒢𝒰| is sublinear relative to N. In contrast, the direct computation of k-mer quantities from the sequence set 𝒰 for any fixed k-mer size demands, at best, O(N) time. Extending this computation naively to cover all k=1,,kmax would exponentially increase the computational demand to O(Nkmax). This significant improvement in complexity leverages a proper representation of repeats in 𝒰, as will be introduced in the next section.

3. Prokrustean graph: A hierarchy of maximal repeats

The key idea of the Prokrustean graph is to recursively capture maximal repeats by their relative frequencies of occurrences in 𝒰. Consider the following two descriptions of repeats in three sequences: ACCCT, GACCC, and TCCCG.

  1. ACCC is in 2 sequences, and CCC is in all 3 sequences.

  2. ACCC is in 2 sequences, and CCC is in 1 sequence TCCCG and 1 substring ACCC.

Each first and second case uses 5 and 4 units of occurrences, respectively, yet preserves the same meaning. As depicted in Figure 1, the rule of the compact second case is to recursively capture regions of more frequent substrings.

Figure 1:

Figure 1:

Recursive repeats. At each sequence on the top level, maximally long subregions (indicated with blue underlining) are used to denote substrings that are more frequent than its superstring. The arrows then points to a node of the substring, and the process repeats hierarchically.

Definition 4.

(S,i,j) is a locally-maximal repeat region in 𝒰 if:

occ𝒰(str(S,i,j))>occ𝒰(S) (1)

and for each extension S,i,j of (S,i,j),

occ𝒰strS,i,j=occ𝒰(S). (2)

**Note that we often omit “in 𝒰” when the context is clear.

A locally-maximal repeat region captures a substring that appears more frequently than the “parent” string, and any extension of the region captures a substring that appears as frequently as the parent string. Consider the previous example 𝒰={ACCCT,GACCC,TCCCG}. (ACCCT,1,4) is a locally-maximal repeat region capturing the substring ACCC in ACCCT. However, (ACCCT,2,4) is not a locally-maximal repeat region because the region capturing CCC in ACCCT can be extended to the left to capture ACCC which occurs more frequently than ACCCT.

Note, an alternative definition using substring notations, such as locally-maximal repeats, instead of regions, is not robust. Consider 𝒰={ACCCTCCG,GCCC} and S:=ACCCTCCG. Observe that the region (S,2,4) capturing CCC is a locally-maximal repeat region because CCC occurs more frequently than S in 𝒰, but its immediate left and right extensions capture ACCC and CCCT that occur the same number of times as S. In constrast, a subregion (S,2,3) capturing CC is not a locally-maximal repeat region because an extension (S,2,4) still captures a string (CCC) which occurs 2 times in 𝒰 which is more than S. However, (S,6,7), which also captures CC, is a locally-maximal repeat region because occ𝒰(CC)=5>occ𝒰(S)=1 and occ𝒰(S)=occ𝒰(TCC)=occ𝒰(CCG). Consequently, defining a locally-maximal repeat as S6..7=CC becomes ambiguous with S2..3=CC, whereas an explicit expression of a region (S,6,7) more accurately reflects the desired property of hierarchy of occurrences.

3.1. Prokrustean Graph

The Prokrustean graph of 𝒰 is simply a graph representation of recursive locally-maximal repeat regions on 𝒰, i.e. vertexes represent substrings and edges represent locally-maximal repeat regions. The theorem below says that all maximal repeats in 𝒰 are caught along the recursive description.

Theorem 3 (Complete).

Construct a string set R(𝒰) as follows:

  1. for each locally-maximal repeat region (S,i,j) where S𝒰, add str(S,i,j) to R(𝒰) and

  2. for each locally-maximal repeat region (S,i,j) where SR(𝒰), add str(S,i,j) to R(𝒰).
    ItfollowsthatR𝒰=𝒰holds.

Proof. Any string in R(𝒰) is a maximal repeat of 𝒰, so the claim is satisfied if the process captures every maximal repeat of 𝒰. Assume a maximal repeat R is not in R(𝒰). Consider any occurrence of R in some S𝒰. Extending the maximal repeat R within S makes the number of its occurrences drop, but since R cannot occur as a locally-maximal repeat region by assumption, it must be included in some locally-maximal repeat region that captures a maximal repeat R1, hence R1R(𝒰), and R is a proper substring of R1. Now consider any occurrence of R within R1. This argument continues recursively, capturing maximal repeats R1,R2,, but cannot extend indefinitely as Ri+1 is always shorter than Ri for every i1. Eventually, R must be occurring as a locally-maximal repeat region of some Rn, which is a contradiction. □

Therefore, we use maximal repeats as vertices of the Prokrustean graph, along with sequences, so 𝒰𝒰.

Definition 5.

The Prokrustean graph of 𝒰 is a directed multigraph 𝒢𝒰:=𝒱𝒰,𝒰:

  • 𝒱𝒰:vS𝒱𝒰 if and only if S𝒰𝒰.

  • Each vertex vS𝒱𝒰 is annotated with the string size, sizevS=|S|.

  • 𝒰:eS,i,j𝒰 if and only if (S,i,j) is a locally-maximal repeat region.

  • Each edge eS,i,j𝒰 directs from vStovstr(S,i,j) and is annotated with interval eS,i,j:=(i,j), to represent the locally-maximal repeat region.

Figure 2 visualizes how locally-maximal repeat regions are encoded in a Prokrustean graph. Next, the cardinality analysis of this representation reveals promising bounds.

Figure 2:

Figure 2:

The Prokrustean graph of 𝒰={ACCCT,GACCC,TCCCG,GGCG}. The left graph describes the recursion of locally-maximal repeat regions, and the Prokrustean graph on the right represents the same structure with integer labels storing the regions and the size of the substrings.

Theorem 4 (Compact). 𝒰2|Σ|𝒱𝒰

Proof. Every vertex in 𝒱𝒰 has at most one incoming edge per letter extension on the right (or similarly on the left), so has at most 2|Σ| incoming edges. Assume the contrary: a maximal repeat appears as two different locally-maximal repeat regions (S,i,j) and S,i,j within S,S𝒰𝒰, and can extend by the same letter on the right, i.e., (S,i,j+1) and S,i,j+1 capture the same string. Extending these regions further together will eventually diverge their strings, because either SS or the original two regions are differently located even if S=S. Consequently, at least one of them—(S,i,j+1) without loss of generality—results in a decreased occurrence count of its string as it extends. Hence, occ𝒰(str(S,i,j+1))>occ𝒰(S). This leads to contradiction because a locally-maximal repeat region (S,i,j) should satisfy occ𝒰(S,i,j+1)=occ𝒰(S,1,|S|). □

Assuming trivially that |𝒰|𝒰, meaning there are significantly more maximal repeats than the number of sequences, we derive that O𝒱𝒰=O𝒰, hence O𝒢𝒰:=O𝒰=O|Σ|𝒰 by the theorem. Given that |Σ| is typically constant in genomic sequences (eg. Σ={A,C,T,G}), the graph’s size depends on the number of maximal repeats. Letting N be the accumulated sequence length of 𝒰, it is a well-known fact that 𝒰<N [25], so the graph grows sublinear to the input size. Furthermore, in the implementations described in Section 5, 𝒰 can be restricted to maximal repeats of length at least kmin. Although the size of 𝒢𝒰 is comparable to that of the suffix tree of 𝒰, setting kmin allows for significantly more efficient and configurable space usage.

4. Framework: Computing k-mer quantities for all k sizes

This section introduces the computation of k-mer quantities for all k sizes, preserving the O(|𝒢𝒰|) time and space.

4.1. Accessing co-occurring k-mers for a single k size

We briefly cover how k-mers of a fixed k-mer size are accessed through the Prokrustean graph, before generalizing the computation to all k-mer sizes in the next section. Previously, a toy example in Figure 1 depicted blue regions that cover locally-maximal repeat regions. Then, given a k-mer size, a complementary region is a maximal region that covers k-mers that are not included in a blue region. See the red regions underlining in Figure 3.

Figure 3:

Figure 3:

Accessing k-mers by complementing locally-maximal repeat regions. Blue underlining depicts locally-maximal regions, and red underlining depicts regions that complemented k-mers not covered in blue regions, given sizes of k=3 and 4. Every k-mer appears exactly once within a red region. For example, at the figure of k=3 on left, the 3-mer ACC is included in a red region in ACCCC and does not appear again in any other red region.

An opportunistic property is that k-mers in red regions co-occur within their parent strings. Following notation captures substrings in red regions:

Defin ition 6.

A string S k-co-occurs within a string S in 𝒰 if:

everykmerinScooccurswithinSin𝒰.

We mean the same by S being a k-co-occuring substring of S. Also, S maximally k-co-occurs within S in 𝒰 if no superstring of S k-co-occurs within S in 𝒰. Maximal k-co-occurring substrings of S are easily computed as complementary regions in S as introduced in Figure 3. So, they are efficiently identified with the Prokrustean graph, as implied by the proposition below.

Proposition 1.

A string S maximally k-co-occurs within a string S in U if and only if its region in S intersect locally-maximal repeat regions by under k1, and on both sides, either extends to an end of S or intersects a locally-maximal repeat region by exactly k1.

Proof. The forward direction is straightforward: if S k-co-occurs within S, its region cannot intersect any locally-maximal repeat region by k at any case, and if the intersection is under k1 on its edge, by extending S and making the intersection k1, another superstring of S k-co-occurs within S.

For the reverse direction, assume a region satisfies the conditions. A k-mer appearing in the region cannot occur more than S, as it would mean the region intersects a locally-maximal repeat region by at least k, so S k-co-occurs within S. S is also maximal because extending its region increases the size of the intersection to more than k1. □

Proposition 1 implies that computing maximal k-co-occurring substrings of a string S takes time linear to the number of locally-maximal repeat regions in S, given the regions pre-ordered by positions: Enumerate locally-maximal repeat regions of length at least k and check its left and right whether a complementary region of length at least k can be defined. Lastly, maximally capture complementary regions on both sides so that they either intersect locally-maximal repeat regions by k1 or meet an end of S. See Figure 4 for a detailed example.

Figure 4:

Figure 4:

Computing k-co-occurring substrings (red) of S of varying k sizes. The rule is to cover every k-mer that is not covered by a locally-maximal repeat region (blue). Note that a k-co-occurring substring can appear on the intersection of two locally-maximal repeat regions too. For example, the region (S,3,6) capturing CAGC in the second row (k=4) overlaps two blue regions by 3.

Therefore, the idea for computing complementary regions can be applied to count distinct k-mers for a fixed k-mer size. The toy algorithm below takes O(|𝒢𝒰|) time and space.

4.1.

Note that line 2 is the computation derived from Proposition 1, as depicted in Figure 4. The locally-maximal repeat regions of S are represented by the outgoing edges of vS. The nested loop on lines 1 and 2 thus takes O𝒰, i.e., O|Σ|𝒰 time. For correctness, Theorem 1 states that any k-mer K co-occurs within exactly one string S in 𝒰𝒰. K extends to a unique maximal k-co-occurring substring of S; otherwise, multi-occurrence in S is implied. Lastly, Proposition 1 guarantees that every maximal k-co-occurring substring of S can be identified by scanning the outgoing edges of vS.

4.2. Accessing co-occurring k-mers for a range of k

The computation can be extended to a range of k-mer sizes while preserving the complexity. Recall that the problem is to efficiently express cosubstr𝒰(S). It turns out that cosubstr𝒰(S) can be expressed by a finite number of stack-like substring structures.

Definition 7.

Consider two strings S and T where T is a substring of S, along with two numbers: a depth d* and a k-mer size k*, each at least 1. A co-occurrence stack of S in U is defined as a 4-tuple:

(S,T,k*,d*)
whenstrT,1+δl(i1),|T|δr(i1)maximallyki-co-occurswithinS

where ki=k*i1 for i=1,2,,d*, with change rates δl:=1TisNOTaprefixofS and δr:=1TisNOTasuffixofS .

A co-occurrence stack expresses a k-mer if it is a substring of some k-co-occurring substring identified by the stack. A co-occurrence stack is maximal if its expressed k-mers are not a subset of those expressed by any other co-occurrence stack. It is clear that any k-mer expressed in a co-occurrence stack of S co-occurs within S. So, maximal co-occurrence stacks of S collectively express cosubstr𝒰(S).

Definition 8.

costack𝒰(S) is the set of all maximal co-occurrence stacks of S in 𝒰.

A maximal co-occurrence stack of S is type-0 if the substring T is S itself, type-1 if T is a proper prefix or suffix of S, and type-2 otherwise. The numeral in each type’s name indicates the number of locally-maximal repeat regions intersecting the regions of substrings identified by the co-occurrence stack. Refer to Figure 5 for intuition on the type names.

Figure 5:

Figure 5:

Three types of maximal co-occurrence stacks of S (sets of red rectangles), given two locally-maximal repeat regions in S (blue rectangles). The Rate column describes the change in the count of expressed k-mers as the k-mer size increases by 1. Hence, the rate is 1δlδr, and it controls the vertical shapes of stacks.

Theorem 5.

costack𝒰(S) completely and disjointly cover co-occurring substrings of S, i.e.,

cosubstr𝒰(S)=(S,T,k*,d*)costack𝒰(S){substringsexpressedby(S,T,k*,d*)}.

Also,

co-stack𝒰(S)1+2(thenumberoflocally-maximalrepeatregionsinS)

Proof. Appendix B.1 covers the proof. □

Lastly, this scheme is extended to the entire substring space. Also, all maximal co-occurrence stacks in 𝒰 can be enumerated within the computation time O𝒰.

Proposition 2.

The number of all maximal co-occurrence stacks in 𝒰 is O|𝒢𝒰|. Precisely,

S𝒰𝒰|co-stack𝒰(S)|𝒱𝒰+2𝒰.

Proof. This is implied by the inequality |costack𝒰(S)|1+2(the number of locally-maximal repeat regions in S). The term 𝒱𝒰 accounts for the constant contribution of 1 from each S𝒰𝒰, and 2𝒰 corresponds to the right term because 𝒰 represents the total set of locally-maximal repeat regions in 𝒰𝒰. □

Proposition 3.

Given 𝒢𝒰, enumerating costack𝒰(S) for some vS𝒱𝒰 takes O(|outgoingedgesofvS time.

Proof. This is derived from the proof of costack𝒰(S)1+2 the number of locally-maximal repeat regions in S in Appendix B.1. □

Since k-mers expressed by each stack of S inherit characteristics of S, such as frequency and extensions within 𝒰, algorithms can leverage this information to compute k-mer quantities, by considering k-mers in the same stack as having similar roles in the k-mer-based objects. We further introduce four applications motivated by biological problems that benefit from this methodology.

4.3. Application: Counting distinct k-mers of all k sizes

The number of distinct k-mers is commonly used to assess the complexity and structure of pangenomes and metagenomes [44, 13]. A recent study by [10] spans multiple k contexts, suggesting that the number of k-mers across all k-mer sizes should be used to estimate genome sizes. The following algorithm counts distinct k-mers for all k-mer sizes in time O𝒢𝒰. We assume the maximum possible k-mer size, kmax, is always less than 𝒰, which is typical for most sequencing data and genome references.

The core idea leverages the change rate of k-mers within each co-occurrence stack, thereby eliminating the need to consider k-mer counts individually. The number of k-mers expressed by each type-0, 1, or 2 co-occurrence stack changes by −1, 0, or +1 as k increases by 1, respectively, as detailed in the Rate column of Figure 5. This property is utilized in lines 4–7, and hence the loop in line 2 runs in O𝒢𝒰 time.

4.

4.4. Application: Computing Bray-Curtis dissimilarities of all k sizes

The Bray-Curtis dissimilarity, defined with k-mers, is a frequency-based metric that measures the dissimilarity between biological samples. The values are sensitive to the choice of k-mer sizes, so they are often calculated with multiple k-mer sizes to check the influence [47] [23] [36], yet most state-of-the-art tools utilize a fixed k-mer size [7].

We computed the Bray-Curtis dissimilarity for all k-mer sizes of samples A and B using the Prokrustean graph constructed from the union of their sequence sets, 𝒰A and 𝒰B. This graph tracks the origins of sequences, computing the following quantities:

BC𝒰A,𝒰B,k=1k-merin𝒰A𝒰B2minfreqA(k-mer),freqB(k-mer)k-merin𝒰A𝒰BfreqA(k-mer)+freqB(k-mer)

4.

Recall that the value assignments in line 5 and 6 utilized co-occurrence stacks for counting k-mers in Algorithm 2. The values minfreqAvS,freqBvS and freqAvS+freqBvS are assigned instead of a constant 1, because they grows linear to the number of co-occurring k-mers weighted by the frequency-based values. Consequently, vectors N and D contains values of the numerator and denominator part of the Bray-Curtis dissimilarities for all k-mer sizes.

Typically, biological experiments compute these values for pairs of multiple samples. The proposed algorithm can be extended to consider multiple samples, requiring O(|samples|2·|𝒢𝒰|) time.

4.5. Application: Counting maximal unitigs of all k sizes

A maximal unitig of a de Bruijn graph, or a vertex of a compacted de Bruijn graph [20], represents a simple path that cannot be extended further, reflecting a topological characteristic of the graph. More maximal unitigs generally indicate more complex graph structures, which significantly impact genome assembly performance [14, 3]. Their number across multiple k-mer sizes reflects the influence of k-mer sizes on the complexity of assembly.

The idea is that maximal unitigs are implied by tips, convergences, and divergences, which are indicated by type-0 or 1 co-occurrence stacks. For instance, consider a string S in 𝒰𝒰 with multiple right extensions, such as C and T. If a suffix region in S of length l is an locally-maximal repeat region, it suggests that a k-co-occurring suffix of S where k>l corresponds to a vertex in the respective de Bruijn graph of order k with right extensions C and T. Therefore, identifying the locally-maximal repeat regions on suffixes and prefixes of strings in 𝒰𝒰 is sufficient.

4.

4.6. Application: Computing vertex degrees of overlap graph

Overlap graphs are extensively utilized in genome and metagenome assembly, alongside de Bruijn graphs. Although their definition appears unrelated to k-mers, we can view them as representing suffix-prefix k-mers of sequences in 𝒰. Overlap information is particularly crucial for metagenomics in read classification [16, 41, 31, 2] and contig binning [45, 32]. Vertex degrees are often interpreted as (abundant-weighted) read coverage in metagenomic samples, which exhibit high variances due to species diversity and abundance variability.

A common computational challenge is the quadratic growth of overlap graphs relative to the number of reads O(|𝒰|2). A significant drawback of overlap graphs is their quadratic growth in the number of reads, O(|𝒰|2). In contrast, the Prokrustean graph, which encompasses the overlap graph as a hierarchical subgraph, requires only O(|𝒢𝒰|) space. This structure enables the efficient computation of vertex degrees in the overlap graph.

The overlap graph of 𝒰 has the vertex set 𝒰, and an edge (vS,vS) if and only if a suffix of S matches a prefix of S. Define prefvR𝒱𝒰 as the number of R occurring as a prefix in strings in 𝒰, which is easily computed through the recursive structure.

prefvR𝒱𝒰:=e=vR,v𝒰,region(e)=(1,*)prefvR

Starting from vS𝒱𝒰, explore a suffix path, meaning recursively choose edges that are locally-maximal repeat regions on suffices. Then, sum up prefv* for all vertices encountered along the path. Since prefvR represents the occurrence of R as a prefix, the summed quantity equals the outgoing degree of vertex S in the overlap graph of 𝒰. Symmetrically, incoming degrees can be computed by defining suffvR. The recursion can be strategically organized so that vertices are visited only once when computing all overlap degrees of 𝒰, thereby reducing the overall complexity to O(|𝒢𝒰|).

5. Experiments and Results

Here, we implement and test against various data sets the construction and four applications of the Prokrustean graph. Datasets were randomly selected to represent a broad range of sizes, sequencing technologies, and biological origins. Both reference genomes and sequencing data were used to demonstrate the scalability of the Prokrustean graph. 16 human pangenome and 3,500 metagenome references were used, and diverse sequencing datasets were collected, including two metagenome short-read datasets, one metagenome long-read dataset, and two human transcriptome short-read datasets. The full list of datasets is provided in appendix A.

5.1. Prokrustean graph construction with kmin

The construction algorithm is introduced later (Section 6). Recall that the Prokrustean graph grows with the number of maximal repeats, i.e., O𝒢𝒰=O|Σ|𝒰. The size of 𝒢𝒰 can be controlled by dropping maximal repeats of length below kmin. This threshold limits the length of locally-maximal repeat regions to kmin or above, capturing a subgraph of the Prokrustean graph, and hence achieves O𝒢𝒰=O|Σ|𝒰kmin space while computing k-mer quantities for k=kmin,,kmax. Figure 6 shows how the graph size of short sequencing reads of length 151 decreases as kmin increases.

Figure 6:

Figure 6:

Sizes of Prokrustean graphs (number of vertices and edges) versus kmin values for a short read dataset ERR3450203. The size of the graph drops around kmin=[11,,19], meaning the maximal repeats and locally-maximal repeat regions are dense when kmin is under 11. The number of edges falls again around kmin=100 because the graph eventually becomes fully disconnected.

Additionally, for 3500 E. coli genomes with kmin=10, we observed that 𝒱𝒰<0.01N and 𝒰<0.03N. Figure 7 shows how references of pangenome and metagenome grow. Again, the growth of the graph follows the growth of 𝒰.

Figure 7:

Figure 7:

Sizes of Prokrustean graphs as a function of sequences added. Left: Prokrustean graphs constructed using an accumulation of 16 human chromosome 1 references. Right: same but generated using 3500 E. coli references exibiting an increased growth rate reflecting E. coli species diversity.

Table 1 shows that with some practical kmin values applied, the size of the Prokrustean graphs stays around the size of their inputs. It is observed that 𝒢𝒰N, with 𝒱𝒰<0.2N and 𝒰<0.5N with short reads.

Table 1:

Performances of Prokrustean graph construction from BWTs. Symbols M, B, mb, and gb mean millions, billions, megabytes, and gigabytes. N denotes the cumulative sequence length of 𝒰 and we can see the output is comparable to N.

dataset reads N kmin time mem output
SRR20044276(metagenomic) 0.94M 72M 20 13s 200mb 40mb
ERR3450203(metagenomic) 55M 4.5B 30 28min 11gb 2.7gb
SRR18495451(metagenomic/long) 0.83M 11.3B 30 90min 43gb 14gb
SRR7130905 (human/rna-seq) 45M 6.8B 30 39min 18gb 4gb
SRR21862404 (human/rna-seq) 336M 22.3B 30 99min 37gb 5.9gb

Subsequent sections cover the results of four applications. Note that, to our knowledge, there are no existing computational techniques that perform the exact same tasks for a range of k-mer sizes, making performance comparisons inherently “unfair.” Readers are encouraged to focus on the overall time scale to gauge the efficiency of the Prokrustean graph. Additionally, discrepancies in the outputs of the comparisons may arise due to practice-specific configurations in computational tools, such as the use of only ACGT (i.e. no “N”) and canonical k-mers in KMC. Consequently, we tested correctness on GitHub using our brute-force implementations.

5.2. Result: Counting distinct k-mers for k=kmin,,kmax

We employed KMC [28], an optimized k-mer counting library designed for a fixed k size , and used it iteratively for all k values. This “unfair” comparison emphasizes the limitations of current practices in counting k-mers across various k sizes.

5.3. Result: Computing Bray-Curtis dissimilarities for kmin,,kmax

Bray-Curtis dissimilarity is a popular metric used in metagenomics analysis [23, 36]. In each reference we found, dissimilarity scores were presented for a single k-mer size. Figure 8 shows that dissimilarities between four example metagenome samples are not consistent in that at some k-mer sizes, one observed pattern of dissimilarity is completely flipped at another k-mer size. I.e. no k size is “correct”; instead, new insights are obtained from multiple k sizes. This task took about 10 minutes with the Prokrustean graph of four samples of 12 gigabase pairs in total, and the graph construction took around 1 hour. Computing the same quantity takes around 5 to 8 minutes for a fixed k-mer size with Simka [7], and no library computes it across k-mer sizes.

Figure 8:

Figure 8:

Bray-Curtis dissimilarities between four samples derived from infant fecal microbiota, which were studied in [36]. Samples include two from 12-month-old human subjects and two from 3-week-olds. The relative order between values shift as k changes, e.g. the dissimilarity between sample 2 and 4 is highest at k=8 but lowest at k=21. Note that diverse k sizes (10, 12, 21, 31) are actively utilized in practice.

5.4. Result: Counting maximal unitigs for kmin,,kmax

We utilized GGCAT [20], a mature library optimized for constructing compacted de Bruijn graphs at a fixed k-mer size. Since maximal unitigs are vertices of a compacted de Bruijn graph, we measured their de Bruijn graph construction time. Again, the Prokrustean graph becomes more efficient as additional k-mer sizes are included in the computation. The following table displays the comparison, and Figure 9 illustrates some of the outputs.

Figure 9:

Figure 9:

The number of maximal unitigs in de Bruijn graphs of order k=20,,150 of metagenomic short reads (ERR3450203). A complex scenario is revealed: The decrease of the numbers fluctuates in k=30,,60, and the peak around k=90 corresponds to the sudden decrease in maximal repeats. Lastly, further disconnections increase contigs and eventually make the graph completely disconnected.

5.5. Result: Counting vertex degrees of overlap graph of threshold kmin

Here, we describe computing the vertex degrees of an overlap graph using a threshold of kmin on a metagenomic short read dataset. Limiting the task to vertex degree counting, O|Σ|𝒰 time and space are required (Section 4.6). With the Prokrustean graph of 27 million short reads (4.5 gigabase pairs in total), the computation generating Figure 10 used 8.5 gigabytes memory and 1 minute to count all vertex degrees.

Figure 10:

Figure 10:

Vertex degrees of the overlap graph of metagenomic short reads (ERR3450203). The number of vertices per each degree is scaled as log log y. There is a clear peak around 102 at both incoming and outgoing degrees, which may imply dense existence of abundant species. The intermittent peaks after 102 might be related to repeating regions within and across species.

6. Prokrustean graph construction

Recall that Section 2.2 addressed the limitations of LCP-based substring indexes in computing cosubstr𝒰(S), which motivated the adoption of the Prokrustean graph. LCP-based representations inherently provide a “bottom-up” approach and are at best able to access incoming edges of vS in the Prokrustean graph. In contrast, the Prokrustean graph supports a “top-down” approach by accessing cosubstr𝒰(S) through co-occurrence stacks whose number depends on the outgoing edges of vS.

We start with a more straightforward, but less efficient approach: Section 6.1 employs affix trees—the union of the suffix tree of 𝒰 and the suffix tree of its reversed strings—to extract the Prokrustean graph. Although affix trees allow straightforward operations supporting bidirectional extensions above suffix trees, they are resource-intensive in practice. Section 6.2 refines the approach using the BWT of 𝒰, incorporating an intermediate model to compensate the bidirectional context lost in moving away from affix trees. Section 6.3 then briefly introduce the implementation, covering recent advancements in BWT-based computations supporting it.

6.1. Extracting Prokrustean graph from two suffix trees

For ease of explanation, we use two suffix trees to represent the affix tree. Let 𝒯𝒰 denote the generalized suffix tree constructed from 𝒰. The reversed string of a string S is denoted by rev(S), and the reversed string set rev(𝒰) is {rev(S)S𝒰}. A node in a tree is represented as node(S) if the path from the root to the node spells out the string S, with the considered tree being always clear from the context. We assume affix links are constructed, which map between node(S) in 𝒯𝒰 and node(rev(S)) in 𝒯rev(𝒰) if both nodes exist in their respective trees.

Additionally, a string preceded/followed by a letter represents an extension of the string with that letter, as in σS or Sσ. Similarly, concatenation of two strings S and S can be written as SS. An asterisk (*) on a substring denotes any number (including zero) of letters extending the substring, as in S*.

6.1.1. Vertex set

Collecting maximal repeats of 𝒰 requires information about occ𝒰(σR) and |occ𝒰(Rσ)| for each σΣ, for any considered substring R. Most substring indexes built above LCPs provide access to occurrences through the LCP intervals, and the number of occurrences is represented by the length of the corresponding LCP intervals. The reverse representation 𝒯rev(𝒰) is not required yet, implying that the same information is accessible via single-directional indexes like the BWT in the next section.

6.1.2. Edge set

A more nuanced approach is required to extract the edge set of a Prokrustean graph. Since an edge in a suffix tree basically implies co-occurrence, it sometimes directly indicates a locally-maximal repeat region, providing a subset of substrings that co-occur within their extensions. That is, an edge from node(S) to nodeS in 𝒯𝒰 implies that S may represent a locally-maximal repeat region in S capturing a prefix S of S. However, different scenarios may arise—sometimes the reverse direction should be considered, or both directions, or neither. These variations depend on the combinations of letter extensions around the substring. We present a theorem that delineates these cases, visually explained in Figure 11 for intuition.

Figure 11:

Figure 11:

Deriving locally-maximal repeat regions from edges of suffix trees. From a node of a maximal repeat R or rev(R), the extension S is identified by adding strings on edge annotations. Then, the three scenarios depict how each edge(s) locates each locally-maximal repeat region that captures R within S.

We define terms distinguishing occurrence patterns of extensions. A string R is left-maximal if |occ𝒰(σR)|<occ𝒰(R) for every σΣ, and left-non-maximal with σ if occ𝒰(σR)=occ𝒰(R), and right-maximality and right-non-maximality are symmetrically defined.

Theorem 6.

Consider a maximal repeat R𝒰 and letters σl,σrΣ. The following three cases define a bijective correspondence between selected edges in the suffix trees (𝒯𝒰 and 𝒯rev(𝒰)) and the edges 𝒰 of the Prokrustean graph of 𝒰.

Case 6.1 (Prefix Condition) The following three statements are equivalent.

  1. Rσr is left-maximal.

  2. There exists a string T of form Rσr* in 𝒰𝒰 such that an edge from node (R) to node (T) is in 𝒯𝒰.

  3. There exists an edge eT,i,j in 𝒰 satisfying str(T,i,j+1)=Rσr and i=1.

Case 6.2 (Suffix Condition) The following three statements are equivalent.

  1. σlR is right-maximal.

  2. There exists a string T of form *σlR in 𝒰𝒰 such that an edge from node(rev(R)) to node(rev(T)) is in 𝒯rev(𝒰).

  3. There exists an edge eT,i,j𝒰 satisfying str(T,i1,j)=σlR and j=|T|.

Case 6.3 (Interior Condition) The following three statements are equivalent.

  1. Rσr is left-non-maximal with σl and σrR is right-non-maximal with σr.

  2. There exist strings Sl and Sr, and T:=SlRSr of form *σlRσr* in 𝒰𝒰 such that an edge from node(R) to nodeRSr is in 𝒯𝒰 and an edge from node(rev(R)) to noderevSlR is in 𝒯rev(𝒰).

  3. There exists an edge eT,i,j in 𝒰 satisfying str(T,i1,j+1)=σlRσr.

Proof. Refer to Appendix C.1 for the proof, which is straightforward but tedious. □

The computation of extracting the edge set of the Prokrustean graph is realized by collecting incoming edges for each vertex: Explore 𝒯𝒰 to identify each maximal repeat R𝒰, and verify the maximality conditions in items labeled (a) in the theorem. Upon satisfying (a) of any case, use the corresponding (b) to identify a superstring T from the suffix trees 𝒯𝒰 and 𝒯rev(𝒰). The corresponding (c) then confirms that the region of R within T is indeed a locally-maximal repeat region in 𝒰. The precise region (T,i,j) is inferred from the decomposition of T outlined in (b). For instance, if T is SlR as in Case 9.2, then (T,i,j) is the suffix region T,Sl+1,|T|, so an edge from vT to vR is labeled Sl+1,|T|.

This approach covers every edge in the Prokrustean graph. Observe that items labeled (c) distinguish locally-maximal repeat regions by the letters on immediate left or right. Each region is unique with respect to the letter extension, as previously established in the proof of Theorem 4. Hence, for each maximal repeat R, the conditions met in (a) have bijectively mapped incoming edges to vR, thus the entire edge set 𝒰 is identified through Theorem 9.

We have leveraged a bidirectional substring index, but if only one direction is supported, i.e., 𝒯rev(𝒰) is unavailable, finding T as outlined in (b) becomes challenging. This limitation with single-directional substring indexes motivates the development of an enhanced algorithm.

6.2. Extracting Prokrustean graph from Burrows-Wheeler transform

This section addresses two issues of the previous computation when implemented in practice. Firstly, suffix tree representations generally consume substantial memory, which hinders scalability when handling large genomic datasets such as pangenomes. Secondly, bidirectional indexes are more space-consuming, expensive to build, and less commonly implemented than single-directional indexes. Instead, BWTs enable traversal of the same suffix tree structure using sublinear space relative to the input sequences, and are supported by a range of actively improved construction algorithms.

Since the bidirectional extensions in items of (b) in Theorem 9 are not directly applicable to the BWT or other single-directional substring indexes, we utilize a subset of occurrences of maximal repeats as an intermediate representation of locally-maximal repeat regions. This approach necessitates a consistent way of utilizing suffix orders within the generalized suffix arrays of 𝒰.

Before describing the construction of the Prokrustean from a BWT, we need the following definitions.

Definition 9.

Let the rank of a region (S,i,j) be the rank of the suffix of S starting at i in the generalized suffix array of 𝒰, which is implied by a substring index being used. Then, the first occurrence of a string R,occ𝒰(R)1, is the lowest ranked occurrence among occ𝒰(R).

Note that suffix ordering is consistent only within a sequence but varies across sequences depending on implementations. For example, among BWTs that employ different strategies, some consider the global lexicographical order, yielding occ𝒰(AA)1=(CCAA,3,4) for 𝒰={GGAA,CCAA}, while others impose a strict order on input sequences, such as 𝒰=GGAA#1,CCAA#2 , resulting in occ𝒰(AA)1=(GGAA,3,4). This variability is extensively discussed in [18]. In any scenario, the identical Prokrustean graph will be generated as long as the first occurrences are consistently considered.

The following intermediate model achieves the goal by collecting specific occurrences of maximal repeats utilizing first occurrences. These occurrences are designed to form one-to-one correspondences with the edges of the Prokrustean graph.

Definition 10.

ProjOcc𝒰 is a subset of occurrences of strings in 𝒰𝒰, where elements are specified as follows:

  1. For every sequence S𝒰, the entire region of the sequence, i.e., (S,1,|S|)ProjOcc𝒰.

  2. For every maximal repeat R𝒰 and letters σl,σrΣ,

    1. If Rσr is left-maximal, then (S,i,j1)ProjOcc𝒰 given S,i,j:=occ𝒰Rσr1.

    2. If σlR is right-maximal, then (S,i+1,j)ProjOcc𝒰 given S,i,j:=occ𝒰σlR1.

    3. If Rσr is left-non-maximal with σl and σlR is right-non-maximal with σr, then (S,i+1,j1)ProjOcc𝒰 given S,i,j:=occ𝒰σlRσr1.

Refer to Figure 12 for intuition that ProjOcc𝒰 is indirectly deriving the locally-maximal repeat regions in 𝒰𝒰. Although the same maximality conditions are used in both Theorem 9 and the definition of ProjOcc𝒰, the occurrences in ProjOcc𝒰 are organized in a nested manner, allowing locally-maximal repeat regions to be inferred through their inclusion relationships. So, ProjOcc𝒰 is a projected image of the Prokrustean graph of 𝒰. The decoding rule for ProjOcc𝒰, necessary for reconstructing the Prokrustean graph, is articulated in the following proposition and theorem.

Figure 12:

Figure 12:

Construction of the Prokrustean graph via projected occurrences. Gray vertices, depicted as either rectangles or circles, represent sequences in 𝒰, while blue vertices denote maximal repeats of 𝒰. The suffix tree is shown flipped to better illustrate the correspondence, providing intuition for the “bottom-up” approach. Dotted rectangles in each diagram refer to a maximal repeat R. First occurrences of substrings like σlR and Rσr are identified through the construction of ProjOcc𝒰. Then, the nested structure of projected occurrences derive incoming edges of vR in the Prokrustean graph.

The following theorem elaborates the decoding rule. Let the relative occurrence of (S,i,j) within S,i,j be strS,i,j,ii+1,ji+1 defined only if (S,i,j)S,i,j.

Theorem 7.

eT,p,q𝒰, i.e., (T,p,q) is a locally-maximal repeat region in some T𝒰𝒰 if and only if there exist (S,i,j),(S,i,j)ProjOcc𝒰 satisfying:

  • (S,i,j)S,i,j, and

  • the relative occurrence of (S,i,j) within S,i,j is (T,p,q), and

  • no region (S,x,y)ProjOcc𝒰 satisfies (S,i,j)(S,x,y)S,i,j.

Proof. Refer to Appendix C.2 for the proof, which is straightforward but tedious. □

Therefore, two occurrences in their “closest” containment relationship in ProjOcc𝒰 imply their strings form a locally-maximal repeat region. Since every occurrence in ProjOcc𝒰 originates from either a maximal repeat or a sequence, each represents a vertex in the Prokrustean graph. Therefore, collecting these relative occurrences constructs the edge set of the Prokrustean graph. Refer to Appendix C for the algorithm.

6.3. Implementation

We briefly introduce the techniques implemented in our code base: https://github.com/KoslickiLab/prokrustean.

Firstly, the BWT is constructed from a set of sequences using any modern algorithm supporting multiple sequences [30, 17, 21]. The resulting BWT is then converted into a succinct string representation, such as a wavelet tree, to facilitate access to the “nodes” of the implied suffix tree. For this purpose, we used the implementation from the SDSL project [24]. The traversal is built on foundational works by Belazzougui et al. [4] and Beller et al. [6], as detailed by Nicola Prezza et al. [37]. Their node representation, based on LCP intervals, enables constant-time access to occ𝒰(Rσ) and occ𝒰(σR) for each substring R and letter σΣ. This capability is crucial for verifying the maximality conditions outlined in Definition 10. Also, the first occurrence of so that the first occurrences occ𝒰(Rσ)1 and occc𝒰(σR)1 are identified as the start of the LCP interval of Rσ and σR, respectively, so ProjOcc𝒰 can be built.

ProjOcc𝒰 is collected by exploring the nodes of the suffix tree implicitly supported by the BWT of 𝒰. The exploration requires O(log|Σ||𝒯𝒰|) time in total, because succinct string operations take O(log|Σ|) time. Identifying a maximal repeat takes O(|Σ|) time and then collecting its occurrences in ProjOcc𝒰 takes O(|Σ|2), which is obvious because each combination of occurrence, e.g., |occ𝒰σlRσr| is accessible in constant time. Therefore, constructing ProjOcc𝒰 takes O(log|Σ||Σ|2|𝒯𝒰|) time and O|BWT|+ProjOcc𝒰 space where |BWT| is the size of the succinct string representing the BWT of 𝒰.

Lastly, reconstructing the Prokrustean graph using Theorem 10 takes O(|ProjOcc𝒰|) time, which is introduced in Appendix C. Note that the computation need not consider every combination of occurrences in ProjOcc𝒰; it leverages the property that an occurrence (S,i,j)ProjOcc𝒰 can imply locally-maximal repeat regions with up to two extensions in S found in ProjOcc𝒰. Therefore, ProjOcc𝒰 is grouped by each sequence S𝒰, and the edge set 𝒰 can be built from each group. Hence, the total space usage is O|BWT|+ProjOcc𝒰+𝒢𝒰, where O(|ProjOcc𝒰|)=O(|𝒢𝒰|). The space usage is primarily output-dependent that around 2𝒢𝒰 in practice, and the most time-consuming part is the construction of ProjOcc𝒰 via traversing the implicit suffix tree.

7. Discussion

We have introduced the problem of computing co-occurring substrings cosubstr𝒰(S) (Section 2) and derived the Prokrustean graph (Section 3). The graph facilitates computing costack𝒰(S) to efficiently express cosubstr𝒰(S) (Theorem 8), which is used to implement algorithms computing k-mer-based quantities (Section 4.3, Section 4.4, Section 4.5, Section 4.6). Lastly, the graph was constructed from the BWT (Section 6).

It is worthwhile to note that the construction described in Section 6.1 highlights the limitations of LCP-based substring indices by comparing the differences between the Prokrustean graph and suffix trees. Suffix trees cannot access the co-occurrence structure in constant time without the “top-down” representation supported by the Prokrustean graph. Therefore, we conjecture that building multi-k methods with modern LCP-based substring indices is challenging regardless of the underlying approach.

The application Section 4.5 indicates that a Prokrustean graph somehow access de Bruijn graphs of all orders. That is, a Prokrustean graph has potential for advancing the so-called variable-order scheme. Indeed, a Prokrustean of 𝒰 can be converted into the union of de Bruijn graphs of all orders. Since the Prokrustean graph does not grow faster than maximal repeats, the new representation can help formulating multi-k methods to identify errors in sequencing reads, SNPs in population genomes, and assembly genomes.

Thus, the open problem 5 in [40] can be answered by the Prokrustean graph, which calls for a practical representation of variable-order de Bruijn graphs that generates assembly comparable to that of overlap graphs. Recall that Section 4.6 analyzed the overlap graph of 𝒰 from the Prokrustean graph of 𝒰. So, the two popular objects used in genome assembly are elegantly accessed through the Prokrustean graph.

There is abundant literature analyzing the influence of k-mer sizes in many bioinformatics tasks. However, little effort has been made to derive rigorous formulations or quantities to explain these phenomena. It is a significant disadvantage to leave the effects of k-mer sizes shrouded in mystery, especially as many other biological assumptions are made in these tasks and complex pipelines. Understanding the usage of k-mers at least at the level of substring representations of the sequencing data or references is an essential initial step. Our framework is expected to contribute to forming this prerequisite so that further improvement involving biological knowledge is desired afterwards.

Table 2:

Counting distinct k-mers with Profcrustean graphs (Algorithm 2) compared with KMC. KMC had to be executed iteratively causing the computational time to increase steadily as the range of k increases. The running time of KMC for each k took about 1–3 minutes.

Dataset k KMC Prokrustean Graph
time memory threads time memory threads
SRR20044276 1...150 5m 837mb 22 0.36s 80mb 8
ERR3450203 1...150 70m 12gb 22 106s 11gb 8
SRR18495451 30...50000 days - 22 88s 24gb 8
SRR7130905 30...150 66m 12gb 22 46s 8.5gb 8
SRR21862404 30...150 112m 13gb 22 74s 13.8gb 8

Table 3:

Counting maximal unitigs with GGCAT and Prokrustean Graph(Algorithm 4). The efficiency shown in the compute column is emphasized, where only a few seconds are required for large datasets. The GGCAT was executed for k sizes iteratively. The running time for each k took about 1–5 minutes for GGCAT. Both methods used 8 threads.

Dataset k GGCAT Prokrustean Graph
time memory time memory
SRR20044276 20..150 13m 310mb 0.36s 70mb
ERR3450203 30..150 98m 1.1gb 31s 5.8gb
SRR18495451 30..50000 days - 127s 26gb
SRR7130905 30...150 126m 1.4gb 46s 8.5gb
SRR21862404 30...150 168m 0.8gb 76s 13.8gb

A. Datasets

B. Maximal co-occurrence stacks

B.1. The proof of costack𝒰(S) completely covering cosubstr𝒰(S) and its bound.

Proposition 4.

A maximal k-co-occurring substring T and a maximal k-co-occurring substring T of S are identified by the same maximal co-occurrence stack if and only if they share identical boundary conditions: on both sides, they either extend to the end of S or intersect the same locally-maximal repeat region by k1 and k1, respectively.

Proof. Consider maximal k-co-occurring substring T and k-co-occurring substring T of S where k<k. The forward direction is straightforward; a maximal co-occurrence stack (S,S,k*,d*), which identifies T and T, identifies S as a k*-co-occurring substring of S by definition, and S either extends to an end of S or intersects an locally-maximal repeat region by k*1 on both sides according to Proposition 1. Then, as T and T are identified as the k*k+1-th and k*k+1-th elements by the stack, respectively, the rates δl and δr derived from the stack ensure T and T sharing the equivalent side conditions with S as described in Proposition 1.

For the reverse direction, consider three types of co-occurrence stacks: 1. T=T=S, i.e., they extend to both ends of S. Then a type-0 stack identifies them. Whenever S k-co-occurs within S, the same holds for k+1 and above until |S|. Therefore, (S,T:=S,k*:=|S|,d*) with some maximal d* identifies T and T.

2. T and T are proper prefixes of S. Then a type-1 stack identifies them. The condition says there is a locally-maximal repeat region (S,i,j) intersecting the regions of T and T in S by k1 and k1, respectively. Without loss of generality, let the intersection be on the left of (S,i,j). Then, a k*-co-occurring prefix of S with k*:=|(S,i,j)| intersects (S,i,j) by k*1; if not, there must be a k*-mer occurring more than |occ𝒰(S)| starting between positions 1 and i1. But k-mers of T start in the positions too, so T does not k-co-occur within S, which leads to contradiction. Thus, S,T:=str(S,1,j1),k*:=|(S,i,j)|,d* is a type-1 stack that identifies both T and T.

3. T and T are neither prefixes nor suffixes. Then a type-2 stack identifies them. There are two locally-maximal repeat regions (S,i,j) and S,i,j that intersect the regions of T and T by k1 and k1, respectively. Assuming (S,i,j) is smaller, a similar argument as in the previous paragraph shows that a k*-co-occurring substring intersects both (S,i,j) and S,i,j by k*1 given k*:=|(S,i,j)|, so (S,T:=str(S,jk*+2,i+k*2),k*:=|(S,i,j)|,d*) identifies T and T. □

This proposition implies that either zero, one, or two locally-maximal repeat regions in S characterize a co-occurrence stack by identifying substrings of identical side conditions. This corresponds to the classifications—type-0, type-1, and type-2. The proposition also implies that the number of maximal co-occurrence stacks of S does not grow faster than the number of locally-maximal repeat regions in S.

Theorem 8.

costack𝒰(S) completely and disjointly cover co-occurring substrings of S, i.e.,

cosubstr𝒰(S)=(S,T,k*,d*)costack𝒰(S){substringsexpressedby(S,T,k*,d*)}.

Also,

|costack𝒰(S)|1+2(thenumberoflocally-maximalrepeatregionsinS)

Proof. costack𝒰(S) completely and disjointly covers cosubstar𝒰(S): any k-mer occurring in 𝒰 co-occurs within an exactly one string S in 𝒰𝒰 by Theorem 1. The k-mer extends to a unique superstring that maximally k-co-occurs within S because appearing in two distinct maximally k-co-occurring substrings imply multi-occurrence of the k-mer in S. The superstring is identified by a unique maximal co-occurrence stack by Proposition 4. So, the k-mer is expressed by exactly one maximal co-occurrence stack.

To demonstrate the inequality, first, consider a type-0 stack on a string S, where all identified substrings are S itself by definition, i.e., extend to ends of S. A single type-0 stack identifies them all by Proposition 4. Hence, just one type-0 stack identifies them all by Proposition 4.

Next, for a type-1 or type-2 maximal co-occurrence stack (S,T,k*,d*), we show its longest identified substring always intersects some locally-maximal repeat region of length k* by exactly k*1. Since a proper type-1 stack identifies every k-co-occurring prefix (or suffix) of S whose region intersects the same locally-maximal repeat region by k1, by Proposition 4, the largest order k* works too. Then, k* should be the length of the locally-maximal repeat region; otherwise, we can find a (k*+1)-co-occurring prefix (or suffix) of S, denoted T, intersecting the locally-maximal repeat region by k*, so (S,T,k*+1,d*+1) is a proper co-occurrence stack, resulting in the original stack non-maximal. Similarly, a type-2 stack identifies substrings that intersect two locally-maximal repeat regions on both sides, so checking the length of the smaller region be k* permits a similar deduction.

Note that each left and right side of a locally-maximal repeat region of length l intersect the region of at most one maximally l-co-occurring substring by exactly l1. Since such maximal intersection happens with every type-1 or type-2 maximal co-occurrence stack at least once, the number of such stacks is bounded by twice the number of locally-maximal repeat regions, thus is 2(the number of outgoing edges of vS in 𝒰) in total.

Lastly, enumerating costack𝒰(S) is efficient. A type-0 stack is uniquely defined for a string S where the depth d* is one more than the length of the largest locally-maximal repeat region. Each side of a locally-maximal repeat region maximally intersects the largest substring identified in at most one stack, either type-1 or type-2. Without loss of generality, a type-1 stack is defined on the left side of a locally-maximal repeat region if it is the largest one among those found on its left, while a type-2 stack pairs with the nearest locally-maximal repeat region of at least the same size on its left. These properties can be verified by a single scan of the locally-maximal repeat regions of a string. □

C. Prokrustean graph construction

C.1. The proof of suffix trees to Prokrustean graph

Theorem 9.

Consider a maximal repeat R𝒰 and letters σl,σrΣ. The following three cases define a bijective correspondence between selected edges in the suffix trees (𝒯𝒰 and 𝒯rev(𝒰)) and the edges 𝒰 of the Prokrustean graph of 𝒰.

Case 9.1 (Prefix Condition)

The following three statements are equivalent.

  1. Rσr is left-maximal.

  2. There exists a string T of form Rσr* in 𝒰𝒰 such that an edge from node(R) to node(T) is in 𝒯𝒰.

  3. There exists an edge eT,i,j in 𝒰 satisfying str(T,i,j+1)=Rσr and i=1.

Case 9.2 (Suffix Condition)

The following three statements are equivalent.

  1. σlR is right-maximal.

  2. There exists a string T of form *σlR in 𝒰𝒰 such that an edge from node(rev(R)) to node(rev(T)) is in 𝒯rev(𝒰)..

  3. There exists an edge eT,i,j𝒰 satisfying str(T,i1,j)=σlR and j=|T|.

Case 9.3 (Interior Condition)

The following three statements are equivalent.

  1. Rσr is left-non-maximal with σl and σrR is right-non-maximal with σr.

  2. There exist strings Sl and Sr, and T:=SlRSr of form *σlRσr* in 𝒰𝒰 such that an edge from node(R) to nodeRSr is in 𝒯𝒰 and an edge from node(rev(R)) to noderevSlR) is in 𝒯rev(𝒰).

  3. There exists an edge eT,i,j in 𝒰 satisfying str(T,i1,j+1)=σlRσr.

Proof. Case 9.1 (a)(b): We proceed arguing in the contrapositive. Assume T:=RSr𝒰𝒰 which is non-maximal on either left or right. Observe that the definition of the construction of the suffix tree 𝒯𝒰 implies occ𝒰Rσr=occ𝒰RSr and RSr being right-maximal. Since RSr is right-maximal, RSr must be left-non-maximal with some σΣ, so occ𝒰σRSr=occ𝒰RSr. Since occ𝒰Rσr=occ𝒰RSr,occ𝒰σRSr=occ𝒰RSr=occ𝒰Rσr meaning Rσr is left-non-maximal with some σ.

Case 9.1 (b)(c): Since R,T𝒰𝒰 holds and T[|R|+1]=σr, showing (T,1,|R|) is a locally-maximal repeat region is enough to show eT,1,|R| satisfies the statement. Let T:=RSr. First, the region’s string R occurs more than RSr because R is a maximal repeat. Next, as RSr,1,|R| is a prefix region, every extension includes the smallest extension RSr,1,|R|+1, so the string of any of its extensions occurs at least occ𝒰Rσr, but 𝒯𝒰 implies occ𝒰Rσr=occ𝒰RSr, so the extensions co-occur within RSr. Hence, (T,1,|R|) is a locally-maximal repeat region and str(T,1,|R|+1)=Rσr.

Case 9.1 (c)(a): Since T,1,|R| is a locally-maximal repeat region, its extension captures Rσr, so occ𝒰Rσr=|occ𝒰(T)| holds. Also, T𝒰𝒰 implies T is left- and right-maximal, so |occ𝒰(σT)<occ𝒰(T) holds for any σΣ. Therefore, occ𝒰σRσr<occ𝒰Rσr for any σΣ, hence Rσr is left-maximal.

**Case 9.2 is symmetrically argued in line with Case 9.1.

Case 9.3 (a)(b): The non-maximalities in (a) imply occ𝒰σlR=occ𝒰Rσr=occ𝒰σlRσr. Also, by co-occurrences implied by two suffix trees, occ𝒰SlR=occ𝒰σlR and occ𝒰RSr=occ𝒰Rσr hold. Hence, occ𝒰SlR=occ𝒰SlRSR=occ𝒰RSR is satisfied, meaning SlRSR is left- and right-maximal that T:=SlRSR𝒰𝒰.

Case 9.3 (b)(c): The maximal repeat R occurs more than SlRSr and the string of any extension occurs at least occ𝒰σlR or occ𝒰Rσr times, while occ𝒰σlS=occ𝒰SlSSR=occ𝒰SSR is implied by the two suffix trees. Hence, T,i,j:=SlRSr,Sl+1,Sl+|R| is a locally-maximal repeat region such that str(T,i1,j+1)=σlRσr.

Case 9.3 (c)(a): This statement is trivial because the locally-maximal repeat region implies occ𝒰σlR=occ𝒰SlRSR=occ𝒰Rσr, hence occ𝒰σlR=occ𝒰Rσr=occ𝒰σlRσr holds, which implies non-maximalities on both sides of R. □

C.2. The proof of projected occurrences to Prokrustean graph

Below proposition is useful to prove the main theorem.

Proposition 5.

For every maximal repeat R𝒰,occ𝒰(R)1 is in ProjOcc𝒰.

Proof. If a maximal repeat is a whole sequence in 𝒰, it is trivially in ProjOcc𝒰. So, assume every occurrence of R in 𝒰 is extendable by some letters.

Let S,i,j:=occ𝒰(R)1, and define σl:=S[i1] and/or σr:=S[j+1]. One of σl or σr might not be defined, so consider (S,i1,j)=oc𝒸𝒰σlR1 without loss of generality.

If σlR is right-maximal, then occ𝒰(R)1 trivially belongs to ProjOcc𝒰 by 2-(2) of Definition 10 Otherwise, σlR is right-non-maximal with some σr, so (S,i,j+1)=occ𝒰Rσr1 must hold.

Consequently, whether Rσr is left-maximal or left-non-maximal with σl,occ𝒰(R)1ProjOcc𝒰 holds because conditions 2-(1) or 2-(3) of Definition 10 apply, respectively. The symmetric argument assuming (S,i,j+1)=occ𝒰Rσr1 confirms the same results. □

Theorem 10.

eT,p,q𝒰, i.e., (T,p,q) is a locally-maximal repeat region in some T𝒰𝒰 if and only if there exist (S,i,j),S,i,jProjOcc𝒰 satisfying:

  • (S,i,j)S,i,j, and

  • the relative occurrence of (S,i,j) within S,i,j is (T,p,q), and

  • no region (S,x,y)ProjOcc𝒰 Uatisfies (S,i,j)(S,x,y)S,i,j.

Proof. The forward direction is composed of three parts. We show there exists some (S,i,j)ProjOcc𝒰 corresponding to (T,p,q) that is specified by Theorem 9, and then show there exists some S,i,jProjOccU such that the relative occurrence of (S,i,j) in the region is (T,p,q). Lastly, the third statement is shown.

First, there exists an occurrence (S,i,j)ProjOcc𝒰 that captures the same string and shares some left or right extension with (T,p,q). (T,p,q) must be satisfying exactly one of three items labeled (c) in Case 9.1, Case 9.2, and Case 9.3 conditioned by extensions of (T,p,q). Then the corresponding item of (a) describes a maximality condition around R that is equally described in 2-(1), 2-(2), and 2-(3) in Definition 10 to derive elements in ProjOcc𝒰. So, there is always some (S,i,j) in ProjOcc𝒰 chosen for (T,p,q) by the agreement in letter extensions, i.e., either S[i1]=T[p1] or S[j+1]=T[q+1] or both holds.

Next, there exists some S,i,j such that the relative occurrence of (S,i,j) within S,i,j is (T,p,q). The construction of (S,i,j) showed either S[i1]=T[p1] or S[j+1]=T[q+1] holds, so assume S[i1]=T[p1] without loss of generality. Since (T,p,q) is a locally-maximal repeat region, str(T,p1,q) co-occurs within T and hence str(S,i1,j) co-occurs within T. Therefore, (S,i,j) can be extended to some S,i,j so that T is captured and i is at i+p1 reflecting the position of (T,p,q) within T. Similarly, j=j+q1 holds, and hence (T,p,q) is recovered by strS,i,j,ii+1,jj+1.

We need S,i,jProjOcc𝒰 to finish showing the second statement. If T is in 𝒰, then trivially S,i,j=(S,1,|S|) holds and (S,1,|S|)ProjOcc𝒰 is applied by item 1. in Definition 10. Otherwise, T is in 𝒰, then S,i,j is occ𝒰(T)1 because the first occurrence of some subtring co-occurring within T was utilized for identifying (S,i,j). Proposition 5 confirms that occ𝒰(T)1ProjOcc𝒰.

Lastly, assume some (S,x,y) in ProjOcc𝒰 satisfies (S,i,j)(S,x,y)S,i,j. Then a maximal repeat R(:=str(S,x,y)) exists such that R is a superstring of R(:=str(T,p,q)) and T is a superstring of R. Then R could be captured by extending (T,p,q) and R occurs more than T, so (T,p,q) is not a locally-maximal repeat region, leading to a contradiction.

For the backward direction, assume (T,p,q) is not a locally-maximal repeat region for contradiction. Then some locally-maximal repeat region T,p,q must be found by extending (T,p,q) because R:=str(T,p,q) occurs more than T. Then by following the the same steps in the first paragraph utilizing Theorem 9 and Definition 10, T,p,q is used to identify an occurrence S,i,j in ProjOcc𝒰 that shares some letter extension. So, letting R:=T,p,q, a substring of form either σlR or Rσr is used to find a first occurrence and the substring co-occurs within T, so (S,x,y) appears within occ𝒰(T)1 so that its relative occurrence within occ𝒰(T)1 is T,p,q. Therefore, S,i,jocc𝒰(T)1. Also, (S,i,j)S,i,j because T,p,q is an extension of (T,p,q) that S,i,j is an extension of (S,i,j) too. Lastly, occ𝒰(T)1=S,i,j indeed holds because of (S,i,j)S,i,j. That is, if occ𝒰(T)1 is not S,i,j but some other occurrence of T in S, then the corresponding occurrence of R in ProjOcc𝒰 must be found in S instead of (S,i,j). Thus, the presence of S,i,j leads to contradiction. □

C.3. The algorithm — projected occurrences to Prokrustean graph

The algorithm below reduces the task of reconstructing the Prokrustean graph from ProjOcc𝒰 to a problem designed for a specific string S𝒰. For the subset of regions S,i,jProjOcc𝒰S=S, determine their inclusion relationships to identify the closest related pairs. This is easily parallelized as it processes the intervals for each S separately. And the running time is linear to the size of the subset of regions per each S, hence |ProjOcc𝒰| in total.

An important assumption in the algorithm is that for any region (S,i,j)ProjOcc𝒰, within the subset {S,i,jProjOcc𝒰S=S}, there are at most two regions S,i,j such that (S,i,j)S,i,j and no (S,x,y) exists where (S,i,j)(S,x,y)S,i,j. It is straightforward to check the property.

C.

References

  • [1].AlEisa Hussah N, Hamad Safwat, and Elhadad Ahmed. K-mer spectrum-based error correction algorithm for next-generation sequencing data. Computational Intelligence and Neuroscience, 2022, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [2].Balvert Marleen, Luo Xiao, Hauptfeld Ernestina, Schönhuth Alexander, and Dutilh Bas E. Ogre: overlap graph-based metagenomic read clustering. Bioinformatics, 37(7):905–912, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [3].Bankevich Anton, Bzikadze Andrey V, Kolmogorov Mikhail, Antipov Dmitry, and Pevzner Pavel A. Multiplex de bruijn graphs enable genome assembly from long, high-fidelity reads. Nature biotechnology, 40(7):1075–1081, 2022. [DOI] [PubMed] [Google Scholar]
  • [4].Belazzougui Djamal. Linear time construction of compressed text indices in compact space. In Proceedings of the forty-sixth Annual ACM Symposium on Theory of Computing, pages 148–193, 2014. [Google Scholar]
  • [5].Belazzougui Djamal and Cunial Fabio. Fully-functional bidirectional burrows-wheeler indexes and infinite-order de bruijn graphs. In 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2019. [Google Scholar]
  • [6].Beller Timo, Gog Simon, Ohlebusch Enno, and Schnattinger Thomas. Computing the longest common prefix array based on the burrows–wheeler transform. Journal of Discrete Algorithms, 18:22–31, 2013. [Google Scholar]
  • [7].Benoit Gaëtan. Simka: fast kmer-based method for estimating the similarity between numerous metagenomic datasets. In RCAM, 2015. [Google Scholar]
  • [8].Besta Maciej, Kanakagiri Raghavendra, Mustafa Harun, Karasikov Mikhail, Rätsch Gunnar, Hoefler Torsten, and Solomonik Edgar. Communication-efficient jaccard similarity for high-performance distributed genome comparisons. In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 1122–1132. IEEE, 2020. [Google Scholar]
  • [9].Bonnici Vincenzo and Manca Vincenzo. Informational laws of genome structures. Scientific reports, 6(1):28840, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [10].Bonnie Jessica K, Ahmed Omar, and Langmead Ben. Dandd: efficient measurement of sequence growth and similarity. bioRxiv, pages 2023–02, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [11].Boucher Christina, Bowe Alex, Gagie Travis, Puglisi Simon J, and Sadakane Kunihiko. Variable-order de bruijn graphs. In 2015 data compression conference, pages 383–392. IEEE, 2015. [Google Scholar]
  • [12].Bowe Alexander, Onodera Taku, Sadakane Kunihiko, and Shibuya Tetsuo. Succinct de bruijn graphs. In International workshop on algorithms in bioinformatics, pages 225–235. Springer, 2012. [Google Scholar]
  • [13].Breitwieser Florian P, Baker Daniel N, and Salzberg Steven L. Krakenuniq: confident and fast metagenomics classification using unique k-mer counts. Genome biology, 19:1–10, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [14].Břinda Karel, Baym Michael, and Kucherov Gregory. Simplitigs as an efficient and scalable representation of de bruijn graphs. Genome biology, 22:1–24, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [15].Bussi Yuval, Kapon Ruti, and Reich Ziv. Large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy. PloS one, 16(10):e0258693, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [16].Cavattoni Margherita and Comin Matteo. Classgraph: improving metagenomic read classification with overlap graphs. Journal of Computational Biology, 30(6):633–647, 2023. [DOI] [PubMed] [Google Scholar]
  • [17].Cenzato Davide, Guerrini Veronica, Lipták Zsuzsanna, and Rosone Giovanna. Computing the optimal BWT of very large string collections. In In Proc. of the 33rd Data Compression Conference, DCC 2023, 2023, pages 71–80, 2023. doi: 10.1109/DCC55655.2023.00015. [DOI] [Google Scholar]
  • [18].Cenzato Davide and Lipták Zsuzsanna. A survey of bwt variants for string collections. Bioinformatics, page btae333, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [19].Chikhi Rayan and Medvedev Paul. Informed and automated k-mer size selection for genome assembly. Bioinformatics, 30(1):31–37, 2014. [DOI] [PubMed] [Google Scholar]
  • [20].Cracco Andrea and Tomescu Alexandru I. Extremely fast construction and querying of compacted and colored de bruijn graphs with ggcat. Genome Research, pages gr–277615, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [21].Díaz-Domínguez Diego and Navarro Gonzalo. Efficient construction of the bwt for repetitive text using string compression. Information and Computation, 294:105088, 2023. [Google Scholar]
  • [22].D’ıaz-Dom’ınguez Diego, Onodera Taku, Puglisi Simon J, and Salmela Leena. Genome assembly with variable order de bruijn graphs. bioRxiv, pages 2022–09, 2022. [Google Scholar]
  • [23].Dubinkina Veronika B, Ischenko Dmitry S, Ulyantsev Vladimir I, Tyakht Alexander V, and Alexeev Dmitry G. Assessment of k-mer spectrum applicability for metagenomic dissimilarity analysis. BMC bioinformatics, 17:1–11, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [24].Gog Simon, Beller Timo, Moffat Alistair, and Petri Matthias. From theory to practice: Plug and play with succinct data structures. In 13th International Symposium on Experimental Algorithms, (SEA 2014), pages 326–337, 2014. [Google Scholar]
  • [25].Gusfield Dan. Algorithms on stings, trees, and sequences: Computer science and computational biology. Acm Sigact News, 28(4):41–60, 1997. [Google Scholar]
  • [26].Irber Luiz, Brooks Phillip T, Reiter Taylor, Tessa Pierce-Ward N, Hera Mahmudur Rahman, Koslicki David, and Titus Brown C. Lightweight compositional analysis of metagenomes with fracminhash and minimum metagenome covers. bioRxiv, pages 2022–01, 2022. [Google Scholar]
  • [27].Islam Rashedul, Raju Rajan Saha, Tasnim Nazia, Shihab Istiak Hossain, Bhuiyan Maruf Ahmed, Araf Yusha, and Islam Tofazzal. Choice of assemblers has a critical impact on de novo assembly of sars-cov-2 genome and characterizing variants. Briefings in bioinformatics, 22(5):bbab102, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [28].Kokot Marek, Długosz Maciej, and Deorowicz Sebastian. Kmc 3: counting and manipulating k-mer statistics. Bioinformatics, 33(17):2759–2761, 2017. [DOI] [PubMed] [Google Scholar]
  • [29].Krannich Thomas, Timothy J White W, Niehus Sebastian, Holley Guillaume, Halldórsson Bjarni V, and Kehr Birte. Population-scale detection of non-reference sequence variants using colored de bruijn graphs. Bioinformatics, 38(3):604–611, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [30].Li Heng. Fast construction of fm-index for long sequence reads. Bioinformatics, 30(22):3274–3275, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [31].Liao Xingyu, Li Min, Luo Junwei, Zou You, Wu Fang-Xiang, Pan Yi, Luo Feng, and Wang Jianxin. Improving de novo assembly based on read classification. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 17(1):177–188, 2018. [DOI] [PubMed] [Google Scholar]
  • [32].Mallawaarachchi Vijini. Metagenomics Binning Using Assembly Graphs. PhD thesis, The Australian National University (Australia), 2022. [Google Scholar]
  • [33].Nurk Sergey, Meleshko Dmitry, Korobeynikov Anton, and Pevzner Pavel A. metaspades: a new versatile metagenomic assembler. Genome research, 27(5):824–834, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [34].Ondov Brian D, Treangen Todd J, Melsted Páll, Mallonee Adam B, Bergman Nicholas H, Koren Sergey, and Phillippy Adam M. Mash: fast genome and metagenome distance estimation using minhash. Genome biology, 17:1–14, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [35].Pérez-Cobas Ana Elena, Gomez-Valero Laura, and Buchrieser Carmen. Metagenomic approaches in microbial ecology: an update on whole-genome and marker gene sequencing analyses. Microbial genomics, 6(8):e000409, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [36].Ponsero Alise Jany, Miller Matthew, and Hurwitz Bonnie Louise. Comparison of k-mer-based de novo comparative metagenomic tools and approaches. Microbiome Research Reports, 2(4), 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [37].Prezza Nicola and Rosone Giovanna. Space-efficient computation of the lcp array from the burrows-wheeler transform. arXiv preprint arXiv:1901.05226, 2019. [Google Scholar]
  • [38].Prjibelski Andrey, Antipov Dmitry, Meleshko Dmitry, Lapidus Alla, and Korobeynikov Anton. Using spades de novo assembler. Current protocols in bioinformatics, 70(1):e102, 2020. [DOI] [PubMed] [Google Scholar]
  • [39].Rhyker Ranallo-Benavidez T, Jaron Kamil S, and Schatz Michael C. Genomescope 2.0 and smudgeplot for reference-free profiling of polyploid genomes. Nat. comm., 11(1):1432, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [40].Rizzi Raffaella, Beretta Stefano, Patterson Murray, Pirola Yuri, Previtali Marco, Vedova Gianluca Della, and Bonizzoni Paola. Overlap graphs and de bruijn graphs: data structures for de novo genome assembly in the big data era. Quantitative Biology, 7:278–292, 2019. [Google Scholar]
  • [41].Rodriguez-r Luis M and Konstantinidis Konstantinos T. Estimating coverage in metagenomic data sets and why it matters. The ISME journal, 8(11):2349–2351, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [42].Schmidt Sebastian, Khan Shahbaz, Alanko Jarno N, Pibiri Giulio E, and Tomescu Alexandru I. Matchtigs: minimum plain text representation of k-mer sets. Genome Biology, 24(1):136, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [43].Shariat Basir, Movahedi Narjes Sadat, Chitsaz Hamidreza, and Boucher Christina. Hyda-vista: towards optimal guided selection of k-mer size for sequence assembly. BMC genomics, 15(10):1–8, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [44].Tang Deyou, Li Yucheng, Tan Daqiang, Fu Juan, Tang Yelei, Lin Jiabin, Zhao Rong, Du Hongli, and Zhao Zhongming. Kcoss: an ultra-fast k-mer counter for assembled genome analysis. Bioinformatics, 38(4):933–940, 2022. [DOI] [PubMed] [Google Scholar]
  • [45].Wickramarachchi Anuradha and Lin Yu. Metagenomics binning of long reads using read-overlap graphs. In RECOMB International Workshop on Comparative Genomics, pages 260–278. Springer, 2022. [Google Scholar]
  • [46].Yang Zhenhua, Li Hong, Jia Yun, Zheng Yan, Meng Hu, Bao Tonglaga, Li Xiaolong, and Luo Liaofu. Intrinsic laws of k-mer spectra of genome sequences and evolution mechanism of genomes. BMC Evolutionary Biology, 20:1–15, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [47].Zhai Hongxuan and Fukuyama Julia. A convenient correspondence between k-mer-based metagenomic distances and phylogenetically-informed β-diversity measures. PLOS Computational Biology, 19(1):e1010821, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from bioRxiv are provided here courtesy of Cold Spring Harbor Laboratory Preprints

RESOURCES