Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Dec 17.
Published in final edited form as: Int Symp String Process Inf Retr. 2025 Sep 22;16073:10–17. doi: 10.1007/978-3-032-05228-5_2

KeBaB: k-mer based breaking for finding long MEMs

Nathaniel K Brown 1, Lore Depuydt 2, Mohsen Zakeri 1, Anas Alhadi 3, Nour Allam 3, Dove Begleiter 3, Nithin Bharathi Kabilan Karpagavalli 3, Suchith Sridhar Khajjayam 3, Hamza Wahed 3, Travis Gagie 3, Ben Langmead 1
PMCID: PMC12700359  NIHMSID: NIHMS2112843  PMID: 41393220

Abstract

Long maximal exact matches (MEMs) are used in many genomics applications such as read classification and sequence alignment. Li’s ropebwt3 finds long MEMs quickly because it can often ignore much of its input, skipping matching steps which are redundant to the final output. In this paper we propose KeBaB, a fast and space efficient k-mer filtration step using a Bloom filter. This approach speeds up MEM-finders such as ropebwt3 even further by letting them ignore even more, breaking the input into substrings called “pseudo-MEMs” which are guaranteed to contain all long MEMs. We also show experimentally that KeBaB can accelerate metagenomic classification without significantly reducing accuracy, either by finding all long MEMs or by leveraging the filter to find only the long MEMs present in the t longest pseudo-MEMs.

Keywords: Maximal exact matches, k-mer filtration, Pseudo-MEMs

1. Introduction

A challenge for today’s string-matching algorithms is to compute exact matches with respect to an index over a large, repetitive text. This is a pressing problem in computational genomics, where databases of reference genomes and pangenomes are growing very rapidly. One highly practical full-text indexing method for pangenomes is ropebwt3 [10], which indexes using a run-length compressed form of the Burrows-Wheeler Transform of the text. Its strategy for querying the index involves skipping along the query in the style of Boyer-Moore pattern matching [3], an idea that was first connected to BWT queries by Gagie [8].

In this paper we propose a fast k-mer filtration strategy using a Bloom filter that allows for more skipping and speeds up MEM-finders such as ropebwt3 substantially. We call our strategy KeBaB for “k-mer based breaking”. In Section 2 we briefly review MEM-finding. In Section 3, we describe how to break a pattern into substrings we call pseudo-MEMs that are guaranteed to contain all sufficiently long MEMs of the pattern with respect to an indexed text. If we are interested only in the t longest MEMs, then we can search in the pseudo-MEMs in non-increasing order by length and stop when we have found t MEMs at least as long as the next pseudo-MEM. This should require modifying existing MEM-finders such as ropebwt3, but our experiments in Section 4 indicate that simply searching in the t longest pseudo-MEMs and discarding the rest does not significantly affect downstream results in metagenomic classification — even compared to using all the long MEMs. Figure 1 shows an example of how to use KeBaB to find pseudo-MEMs.

Fig. 1.

Fig. 1.

An example of how to use KeBaB to find pseudo-MEMs.

2. MEMs, forward-backward and BML

A maximal exact match (MEM) — also called super-maximal exact match (SMEM) — of a pattern P[0..m–1] with respect to a text T[0..n–1] is a substring P[i..j] such that

  • P[i..j] occurs in T,

  • i = 0 or P[i – 1..j] does not occur in T,

  • j = m – 1 or P[i..j + 1] does not occur in T.

Finding MEMs is an important step in many bioinformatics pipelines, such as aligning long and error-prone DNA reads to large pangenomic references [19].

For Li’s [9] popular forward-backward MEM-finding algorithm, we keep FM-indexes [6] of T and its reverse Trev (see [12, 14] for an introduction to FM-indexes). Assuming all the characters in P occur in T, the leftmost MEM starts at P[0]. We can therefore find the leftmost MEM P[0..e1] through forward extension by searching for Prev in the index for Trev. If e1 < m–1 then the second MEM P[s2..e2] from the left in P includes P[e1 + 1]. By definition, no MEM includes P[s2 −1..e1 +1], so we can find s2 through backward extension by searching for P[0..e1+1] in the index for T. Conceptually, we can then recurse on P[s2..m–1] and find e2 and the remaining MEMs. The number of backward steps this takes in the indexes is proportional to the total length of the MEMs.

For many applications we are interested only in MEMs which are determined to be sufficiently long; these are biologically significant since they are unlikely to be the result of noise. Unfortunately, the total length of the MEMs is often dominated by many short MEMs, which we would like to ignore. Suppose we are interested only in MEMs of length at least L. Gagie [8] recently observed that any such MEM starting in P[0..L – 1] includes P[L – 1], so if we search for P[0..L – 1] in the index for T and find that P[s..L – 1] occurs in T but P[s – 1..L – 1] does not, for some s > 1, then we can ignore P[0..s – 1] and recurse on P[s..m–1]. If we find that all of P[0..L–1] occurs in T then we can still use the first few steps of forward-backward to find the leftmost MEM and the starting position of the second MEM from the left in P, and then recurse. Since this approach is reminiscent of Boyer-Moore pattern matching, we call it Boyer-Moore-Li (BML). Li [10] incorporated BML into ropebwt3 and found it significantly accelerates MEM-finding. Depuydt et al.’s [4] b-move also supports accelerated MEM-finding by applying BML to the bidirectional r-index.

3. k-mer based breaking into pseudo-MEMs

Another technique for speeding up pattern matching is k-mer filtration. In contrast to BML, this requires scanning the whole input and deciding which parts can be ignored because they cannot contain significant-length matches. If the alphabet’s size is polylogarithmic in n and BML uses a sublinear number of backward steps, then in the word-RAM model filtration is asymptotically slower; however, the filtration scan is sequential, incurring few cache misses and allowing it to be fast in practice compared to FM-index queries, which tend to incur many cache misses.

Suppose we are given k when we index T and we build a Bloom filter [2] for the distinct k-mers in T. Bloom filters can give false positive results but not false negative ones, so if the filter answers “no” for a k-mer P[i..i + k – 1] then no MEM of length at least k includes that k-mer. It follows that when we are given P and L > k, we can break P up into maximal substrings — which can overlap by k – 2 characters but cannot nest — containing only k-mers for which the filter answers “yes”, that contain all the MEMs of length at least L. We call these substrings pseudo-MEMs because they are our best guesses at the MEMs of length at least L based on the information we can glean from the filter. Further, any filter which cannot give false negatives can be used.

Definition 1. Let a filter be a function f : Σk → {0, 1} with respect to a given k, alphabet Σ, and text T, such that if k-mer x occurs in T, then f(x) = 1. Thus, a k-mer x appears in f if and only if f(x) = 1 (the filter answers “yes”).

Definition 2. A pseudo-MEM of a pattern P[0..m – 1] with respect to a text T[0..n – 1], an integer k ≥ 1, a filter f for the distinct k-mers in T, and an integer L > k, is any maximal non-empty substring P[i..j] of P of length at least L such that all the k-mers in P[i..j] appear in the filter f.

Proposition 1. All the MEMs of P with respect to T of length at least L > k are contained in the pseudo-MEMs of P with respect to T and any filter for the distinct k-mers in T.

Our experiments in Section 4 show that computing the pseudo-MEMs and searching in them is in practice already faster than searching in all of P. Further, they show that if T is highly repetitive then the Bloom filter tends to be smaller than the respective indexes for the MEM-finders we tested.

If we seek only the top-t longest MEMs of length at least L, however, then we can leverage the filter step to accelerate the MEM-finding step even further through early stopping.

Proposition 2. If we are finding the top-t longest MEMs of length at least L and we are searching the pseudo-MEMs in non-increasing order by length, we can stop when we have already found t MEMs longer than the next pseudo-MEM.

We can compute and sort the pseudo-MEMs independently of the MEM-finding algorithm in use. While modifying a MEM-finder to track the t longest MEMs found so far and stop when the next pseudo-MEM becomes shorter would enable early stopping optimizations, KeBaB fits seamlessly into any MEM-finding pipeline without requiring such changes and preserves exact MEM output. As a result, we have not yet modified any MEM-finder, even though doing so could provide further speedups.

Without modifying a MEM-finder, we can estimate how long it would take to find the top-t long MEMs by finding them ourselves ahead of time and giving it only the pseudo-MEMs it would search in before stopping. Our experiments in Section 4 show that for reasonable values of t, this should be much faster than running ropebwt3 on all the pseudo-MEMs; moreover, at least for the metagenomic classifier using b-move we tested, it does not significantly hurt the accuracy. In fact, we found that using the long MEMs we found in only the top-t pseudo-MEMs — which are not guaranteed to be the top-t long MEMs but which we can find without modifying existing MEM-finders — is even faster and still results in nearly identical classification accuracy.

4. Experiments

Our kebab implementation in C++ is available at github.com/drnatebrown/kebab. It streams over k-mers using a rolling nucleotide hash defined by ntHash supporting both the forward and reverse complement DNA strands [13]. We use HyperLogLog [7] to estimate the cardinality of a text collection to initialize the Bloom filter size, which is optimized with respect to the number of filter hashes used. We then add canonical k-mers (the smaller of each k-mer and its reverse complement by hash value) to the filter. Given a pattern, we query its canonical k-mers and extract the pseudo-MEMs. Further optimizations –— such as Fibonacci hashing, parallelization, and latency hiding — are explained in the appendix alongside technical details.

4.1. MEM-finding

We tested the speed of MEM-finding on a mock community dataset of 7 microbial species (5867 genomes, ~27 GB) used in Ahmed et al.’s [1] SPUMONI 2 study. Patterns consist of long ONT null reads (10245 yeast reads with average length 19693) and positive reads (581802 microbial reads with average length 25378). Constructing ropebwt3 took 162.88 minutes with an 0.7988 GB index. Building kebab with k = 20 and one hash function took 4.02 minutes with an 0.2684 GB filter (about a third of the size of ropebwt3’s index).

We compared the time to find MEMs with ropebwt3 alone with default settings, to the time to first generate pseudo-MEMs with kebab and then search them with ropebwt3. We also simulated early stopping to find the 10 longest MEMs as explained in Section 3. Figure 2 shows the total times for different choices of L, and Figure 3 shows times for null and positive reads with L = 40. For L ≥ 30, the running-time of only the kebab step on the reads was at most about 3 times more than the time to copy them to another file, which is a rough lower bound on file I/O for a filtering step. For L = 40, the pseudo-MEMs for positive reads had an average length of 70.97, while the long MEMs themselves had an average length of 70.46.

Fig. 2.

Fig. 2.

Total runtime in seconds for MEM-finding methods, searching in a microbial pangenome with different minimum MEM-length values L.

Fig. 3.

Fig. 3.

For L = 40, time per base in the original input to find all long MEMs or only the 10 longest MEMs.

4.2. Metagenomic Classification

To see how using a few long MEMs affects downstream applications, we replicated the metagenomic classification experiment in Depuydt et al.’s [5] tagger study, using the same dataset of 8 microbial species (8165 genomes, ~37GB) and 50000 simulated long ONT reads with average length 5236. By default, tagger uses b-move with BML to find long MEMs together with sample species containing occurrences of them, then classifies the reads based on the sample species containing each read’s long MEMs. We note that b-move is usually larger but faster than ropebwt3, so speedups with kebab are not as dramatic.

We computed tagger’s accuracies (Figure 4, on the left) — that is, its percentages of true-positive classifications — and the average number of steps b-move takes (Figure 4, on the right) when finding and classifying based on

  • all the reads’ MEMs of length at least L (“default”),

  • only the longest t MEMs from each read (“top-t MEMs”),

  • only the MEMs of length at least L in the longest t pseudo-MEMs from each read (“top-t pseudo-MEMs”),

for L = 25 and various values of t. We ran tagger with default settings and L = 25 because Depuydt et al. found it gave good results. Clearly, for t greater than about 10, using only the t longest MEMs in each read or the MEMs of length at least L = 25 in the t longest pseudo-MEMs, does not noticeably hurt tagger’s accuracy but significantly reduces the number of steps b-move takes for MEM-finding.

Fig. 4.

Fig. 4.

tagger’s accuracy (left) and the average number of steps b-move takes (right) for MEM-finding.

We also computed the total times, shown in Figure 5, to classify the reads with tagger after first

  • finding all the MEMs of length at least L with b-move (“tagger”),

  • finding all the pseudo-MEMs with kebab and then finding all the MEMs of length least L in them with b-move (“kebab + tagger”),

  • finding all the pseudo-MEMs with kebab and then finding the MEMs of length at least L in the t longest pseudo-MEMs from each read with b-move (“kebab + tagger, t = 30, 20, 10”).

We ran tagger with default settings and L = 25, and kebab with k = 20 and one hash function. The index for b-move took 7.869 GB and the filter for kebab took an additional 0.2684 GB. Clearly, kebab can also speed up tagger’s pipeline.

Fig. 5.

Fig. 5.

The time to classify using the MEMs of length at least L = 25 using only tagger, or kebab followed by tagger, or all the MEMs of length at least L = 25 in the t longest pseudo-MEMs in each read using kebab followed by tagger.

5. Conclusion

KeBaB substantially speeds up MEM-finding, and can further speed up analyses like metagenomic classification by considering only the longest MEMs or long MEMs present in the longest pseudo-MEMs. Parameter tuning and further optimizations (blocked Bloom filters [16], SIMD acceleration [11], and sub-k-mer queries [15, 17]) could make kebab even more efficient. In fact, adopting the approach of sub-k-mer methods such as Robidou and Peterlongo’s findere may reduce Bloom filter queries by skipping certain k-mers. Future work includes implementing early stopping in ropebwt3, exploring other early-stopping approaches, and applying kebab to scenarios that can particularly benefit from filtration such as sequences with hypervariable regions.

Supplementary Material

Supplementary info

Acknowledgments.

Many thanks to Finlay Maguire for pointing out the similarities between BML and Boyer-Moore, and to the reviewers for insightful comments. NKB, MZ, and BL were funded by NIH grants R01HG011392 and R56HG013865 to BL. NKB was also funded by a JHU CS PhD Fellowship and an NSERC PGS-D. LD was funded by a PhD Fellowship FR (1117322N), Research Foundation – Flanders (FWO). TG was funded by NSERC grant RGPIN-07185-2020.

Footnotes

Disclosure of Interests. The authors declare no competing interests.

References

  • 1.Ahmed OY, Rossi M, Gagie T, Boucher C, Langmead B: Spumoni 2: improved classification using a pangenome index of minimizer digests. Genome Biology 24(1), 122 (2023) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Bloom BH: Space/time trade-offs in hash coding with allowable errors. Communications of the ACM 13(7), 422–426 (1970) [Google Scholar]
  • 3.Boyer RS, Moore JS: A fast string searching algorithm. Communications of the ACM 20(10), 762–772 (1977) [Google Scholar]
  • 4.Depuydt L, Renders L, Van de Vyver S, Veys L, Gagie T, Fostier J: b-move: Faster lossless approximate pattern matching in a run-length compressed index. Algorithms for Molecular Biology (accepted) [Google Scholar]
  • 5.Depuydt L, Ahmed OY, Fostier J, Langmead B, Gagie T: Run-length compressed metagenomic read classification with smem-finding and tagging. bioRxiv pp. 2025–02 (2025) [Google Scholar]
  • 6.Ferragina P, Manzini G: Indexing compressed text. Journal of the ACM 52(4), 552–581 (2005) [Google Scholar]
  • 7.Flajolet P, Fusy É, Gandouet O, Meunier F: Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. Discrete Mathematics & Theoretical Computer Science (2007) [Google Scholar]
  • 8.Gagie T: How to find long maximal exact matches and ignore short ones. In: Proc. 28th Conference on Developments in Language Theory (DLT). pp. 131–140 (2024) [Google Scholar]
  • 9.Li H: Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly. Bioinformatics 28(14), 1838–1844 (2012) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Li H: BWT construction and search at the terabase scale. Bioinformatics 40(12), btae717 (2024) [Google Scholar]
  • 11.Lu J, Wan Y, Li Y, Zhang C, Dai H, Wang Y, Zhang G, Liu B: Ultra-fast bloom filters using simd techniques. IEEE Transactions on Parallel and Distributed Systems 30(4), 953–964 (2018) [Google Scholar]
  • 12.Mäkinen V, Belazzougui D, Cunial F, Tomescu AI: Genome-scale algorithm design: bioinformatics in the era of high-throughput sequencing. Cambridge University Press; (2023) [Google Scholar]
  • 13.Mohamadi H, Chu J, Vandervalk BP, Birol I: nthash: recursive nucleotide hashing. Bioinformatics 32(22), 3492–3494 (2016) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Navarro G: Compact data structures: A practical approach. Cambridge University Press; (2016) [Google Scholar]
  • 15.Pellow D, Filippova D, Kingsford C: Improving bloom filter performance on sequence data using k-mer bloom filters. Journal of Computational Biology 24(6), 547–557 (2017) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Putze F, Sanders P, Singler J: Cache-, hash-, and space-efficient bloom filters. Journal of Experimental Algorithmics (JEA) 14, 4–4 (2010) [Google Scholar]
  • 17.Robidou L, Peterlongo P: findere: fast and precise approximate membership query. In: International Symposium on String Processing and Information Retrieval. pp. 151–163. Springer; (2021) [Google Scholar]
  • 18.Thorup M: High speed hashing for integers and strings. arXiv preprint arXiv:1504.06804 (2015) [Google Scholar]
  • 19.Varki R, Rossi M, Ferro E, Oliva M, Garrison E, Langmead B, Boucher C: Accurate short-read alignment through r-index-based pangenome indexing. Genome Research 35(7), 1609–1620 (2025) [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary info

RESOURCES