Skip to main content
Genome Biology logoLink to Genome Biology
. 2025 Jun 17;26:169. doi: 10.1186/s13059-025-03644-0

Mumemto: efficient maximal matching across pangenomes

Vikram S Shivakumar 1,, Ben Langmead 1,
PMCID: PMC12172372  PMID: 40528225

Abstract

Aligning genomes into common coordinates is central to pangenome construction, though computationally expensive. Multi-sequence maximal unique matches (multi-MUMs) help to frame and solve the multiple alignment problem. We introduce Mumemto, a tool that computes multi-MUMs and other match types across large pangenomes. Mumemto allows for visualization of synteny, reveals aberrant assemblies and scaffolds, and highlights pangenome conservation and structural variation. Mumemto computes multi-MUMs across 320 human assemblies (960GB) in 25.7 h with 800 GB of memory and hundreds of fungal assemblies in minutes. Mumemto is implemented in C++ and Python and available open-source at https://github.com/vikshiv/mumemto (v1.1.1 at doi.org/10.5281/zenodo.15053447).

Supplementary information

The online version contains supplementary material available at 10.1186/s13059-025-03644-0.

Background

Recent pangenomes can span hundreds of genomes. The Human Pangenome Reference Consortium, for example, has released hundreds of high-quality human genome assemblies [1]. Large pangenomes can shed light on sequence conservation and large-scale variants, and they provide reference panels for read alignment, genotype imputation, and sequence classification. As a result, there is a growing need for algorithms to align large pangenomes and reveal their underlying coordinate systems. This has spurred a range of new alignment and classification tools, as well as a new set of compressed-space algorithms for efficient construction of pangenome indexes.

Maximal exact matches (MEMs) and maximal unique matches (MUMs) are used in whole genome alignment [2, 3] and multiple sequence alignment [48] as syntenic anchors between more variable sequences. Collinear multi-MUMs (MUMs across multiple sequences) can span large sections of conserved sequence in an MSA. However, existing methods for computing multi-MUMs do not scale well beyond relatively small collections of bacterial genomes. While tools like MUMmer4 [2] compute pairwise MUMs, the problem of aligning pangenomes is inherently multi-way, and computing them by way of pairwise alignments can require a quadratic number of sequence comparisons, plus a substantial merging step.

We introduce Mumemto, a tool to compute maximal exact or unique matches across many sequences. Mumemto uses prefix-free parsing (PFP) [9], a compressed-space method for computing the enhanced suffix array in sublinear space for pangenome sequence collections. For instance, Mumemto can compute multi-MUMs across 89 human genomes in under 4 h and across hundreds of fungal genomes in 2 min. We show that multi-MUMs can form a rudimentary MSA, reveal genomic conservation and structure, and identify aberrant assembly artifacts and potential pangenome issues. We propose Mumemto as a first-step pangenome diagnostic and visualization tool that scales efficiently for future large pangenome collections.

Results

Overview

Mumemto computes multi-MUMs and MEMs in a streaming algorithm over the suffix array (SA), Burrows-Wheeler Transform (BWT), and longest common prefix (LCP) arrays (Algorithm 1). Prefix-free parsing [9] computes these arrays efficiently for large, repetitive sequence collections like pangenomes. PFP computes these arrays sequentially, allowing Mumemto to consume them in a streaming fashion and avoiding the need for multiple passes or writing the arrays to disk (however, they can be optionally stored on disk to recompute various match types in subsequent passes without recomputation). In short, Mumemto finds all relevant matches by performing some modest additional computation (see Additional file 1: Fig. S1) on top of the indexing process already used to produce compressed indexes like the r-index [10, 11] and move structure [12, 13].

Mumemto computes a variety of match types. Multi-MUMs are maximal exact matches that occur exactly once in all genomes. The other match types relax these constraints in some way; e.g., a partial multi-MUM may occur only some sequences (Fig. 1).

Fig. 1.

Fig. 1

Exact match types that Mumemto can compute. Two flags to control how many sequences a match appears in (-k) and how many times a match may appear in any given sequence (-f)

Matches can be combined to define longer blocks of synteny, which may straddle important variants such as structural variants (SVs), e.g., inversion polymorphisms and rearrangements. The output of Mumemto can be used to visualize synteny and SVs by way of multi-MUMs. It can also characterize highly repetitive regions using multi-MEMs. As described below, Mumemto can also use partial-MUMs to identify potential assembly errors and other large-scale aberrations.

Mumemto differs from existing MUM-finding methods such as MUMmer4 [2] in that it computes matches shared across many sequences rather than just pairs of sequences. Though MUMs across N sequences could be computed by invoking a pairwise matching algorithm O(N2) times and merging the resulting pairwise MUMs, the quadratic time requirement is impractical for large pangenomes. For example, running MUMmer4 on all pairs of chr19 haplotypes from HPRC takes >30 h (30× slower than Mumemto), not counting the time to merge pairwise MUMs into multi-MUMs. As a result, we consider multi-MUM finding to be a distinct problem and omit pairwise methods from further comparisons.

Efficient core genome alignment and pangenome construction

Mumemto is the fastest multi-MUM finder

To evaluate its multi-MUM finding algorithm, we compared Mumemto to the widely used multi-MUM-based multiple sequence aligners, Parsnp2 [5] and ProgressiveMauve [14]. Both tools find multi-MUMs as an intermediate step prior to collinear blocking and detailed alignment. Mauve uses a hash table of short seed matches and extends unique seeds present in all sequences. Parsnp builds a compressed suffix graph over a reference sequence, then computes and merges multi-MUMs in each sequence across the collection. We found that both tools tend to miss a few multi-MUMs and falsely report a small number of non-unique matches; however, these differences were negligible for the purposes of method evaluation.

We computed multi-MUMs across 89 haplotypes of each autosomal chromosome from the Human Pangenome Reference Consortium [1]. We measured the time and memory usage for only the MUM-finding step of each tool in Fig. 2A and B. Mumemto was 7–11× faster than Mauve and 3–15× faster than Parsnp, while using 24%–44% less memory than Mauve and 39%–52% less than Parsnp. (Note that Mumemto was run on a single thread, while Mauve and Parsnp were run on 48 threads.) One exception was memory usage for chromosome 9, where Mumemto’s prefix-free parse included (by chance) long, unique centromeric substrings which are difficult to compress, yielding a slightly higher memory footprint compared to Parsnp and Mauve for that chromosome.

Fig. 2.

Fig. 2

AB Comparison of runtime and peak memory usage (measured as maximum resident set size) between multi-MUM finders. Only the initial multi-MUM computation was considered for ProgressiveMauve and Parsnp2. Note: Parsnp2 took >48 h for chromosome 1 and 2, so these are omitted. Parsnp2 and ProgressiveMauve were run with 48 threads, while Mumemto was run single-threaded. CD Time and memory scaling comparison for increasing sequence collection sizes of chr19. EF Comparison of time and memory for a Mumemto-seeded Parsnp2 alignment pipeline compared to the original Parsnp2 pipeline, and G a comparison of the alignments from each pipeline. H Regions excluded from Minigraph-Cactus (MC) while aligning chr19 assemblies, compared to regions excluded by a Mumemto-seeded MC pipeline (overlaid on a MUM synteny plot in gray). IJ Syntenic view of MUMs vs tube map [15] view of the equivalent graph

When computing multi-MUMs over increasingly large collections of chromosome 19 haplotypes, Mumemto’s speed and memory scaled better than that of Mauve or Parsnp (Fig. 2C and D), owing to the PFP algorithm’s ability to scale with the amount of non-redundant sequence. Further, we computed multi-MUMs across 320 human genome assemblies from HPRC (available at [16]) using 8 threads, which completed in 25.7 h while using 800 GB of memory. If run serially, Mumemto would compute these multi-MUMs in under a week within 139 GB of memory.

Mumemto accelerates core genome alignment

Parsnp [17] uses multi-MUMs as initial guideposts to build a “core genome alignment,” i.e., a multiple alignment involving the conserved portions of the genomes. We modified the Parsnp pipeline to use Mumemto-computed multi-MUMs. We compared the original pipeline with the Mumemto-accelerated pipeline (Fig. 2E–G). Though the peak memory footprint is dominated by the downstream portions of the Parsnp pipeline, the total runtime is up to 12× faster using Mumemto, while the overall alignment coverage is nearly identical to that of the original pipeline (Fig. 2G). Minor differences in alignment between the two pipelines are likely due to Parsnp omitting a few true multi-MUMs and including some additional smaller, locally unique matches.

Multi-MUMs form preliminary graphs

Multi-MUMs can also inform the construction of a preliminary pangenome graph. Collinear multi-MUMs represent conserved stretches of columns in the underlying MSA, and so are prime candidates for being collapsed into pangenome graph nodes. Gaps between collinear MUMs due to genomic variation are often short (<5 bp) and common across haplotypes. For example, we found that among haplotypes of chr19, 78% of gaps between collinear MUMs were single nucleotide variants. As a result, for intraspecific pangenomes (such as HPRC) with high genome similarity, Mumemto can simplify and accelerate graph construction.

As a proof of concept, we compared various graph building strategies over HPRC haplotypes of chromosome 19. We built graphs using Mumemto and its reported multi-MUMs and compared these to a graph built entirely with Minigraph-Cactus [18] (Table 1). The Mumemto-full strategy first computes multi-MUMs, then “collapses” gaps between collinear and adjacent multi-MUMs if the gap sequence is identical between any haplotypes. This strategy only includes small (100 kb) gaps, assuming that larger gaps are similar to the “brnn” regions (e.g., centromeres), which are also excluded from the HPRC pangenome. The regions excluded in Minigraph-Cactus and Mumemto-full are further compared in Fig. 2H. Mumemto does not perform base-level alignment in order to resolve variable-length gaps, but still achieves a compression ratio of 13.7×, meaning the 89-haplotype pangenome collapses to about 6.5 haplotypes worth of sequence needed to label the pangenome graph. As seen in Table 1, the Mumemto-full graph is larger than the others in terms of total sequence labels, its number of nodes and edges, and its memory footprint.

Table 1.

Comparison of different graph construction methods. Mumemto-full refers to a purely MUM-based graph where identical gaps between collinear MUMs are merged when possible. Mumemto-collapsed refers to a further compressed version of Mumemto-full, where large gaps between collinear MUMs are merged using Minigraph-cactus. Mumemto + MC refers to a Minigraph-cactus pipeline where the initial SV-only graph construction step is replaced with a simplified version of Mumemto-full

Minigraph-cactus (MC) Mumemto-full Mumemto-collapsed Mumemto + MC
Total sequence (bp) 76,934,678 400,616,341 243,003,864 75,263,941
# nodes 2,607,631 1,223,511 1,429,051 4,160,567
# edges 3,579,661 1,815,102 2,097,814 5,965,807
# haplotype walks 1442 2022 1471 1137
Avg coverage 92.568% 90.333% 90.333% 92.623%
Time to construct 18:44:58 1:04:13 3:49:33 13:35:19
Memory (GB) 69.5 16.2 16.2 113.9
Threads 48 1 1 (Mumemto) 48 (MC) 1 (Mumemto) 48 (MC)

The Mumemto-full graph is larger because no attempt is made to collapse the interstitial sequence in large (>10 kb) gaps between collinear MUMs. To address this, we identified the 50 largest gaps and aligned the inter-MUM sequence with Minigraph-Cactus. This graph (Mumemto-collapsed) is just under half the size of Mumemto-full, while only requiring an additional 2.75 h to compute.

We also considered how Mumemto could accelerate the Minigraph-Cactus pipeline. We replaced the initial SV graph construction step of Minigraph-Cactus with a simplified version of the Mumemto-full graph that included only short gaps (50 bp–100 kbp) that were partially shared (<45 unique gap sequences). We seeded the Minigraph-Cactus pipeline with this MUM-based SV graph, resulting in the Mumemto + MC graph. This strategy was faster than running Minigraph-Cactus and provided a graph with a comparable coverage and compression ratio to Minigraph-Cactus (Table 1).

Finally, we compared the computational efficiency and accuracy of short read alignment to each of these graphs using giraffe [19] and Illumina reads from the HG002 individual from the Google Brain dataset [20]. We found comparable alignment quality and speed (Additional file 1: Table S1). However, we noted that the Mumemto-seeded MC graph was slower for alignment, likely due to its higher complexity, e.g., larger number of nodes and edges (Table 1). However, this could be improved by further fine-tuning construction of the initial SV-only graph from multi-MUMs. Nonetheless, the fast construction time and comparable sequence compression represents the potential for a Mumemto-accelerated approach for constructing pangenome graph indexes.

Mumemto reveals aberrations in pangenome assemblies

We found that examining the collinear MUMs reported by Mumemto revealed and helped to visualize aberrant features of pangenome assemblies. Large, private insertions and deletions in a single sequence manifested as a characteristic pattern of short, spurious MUMs spread across the genome (Fig. 3A and B). These spurious MUMs are collinear in all but one sequence, helping to pinpoint the affected region (Fig. 3D and E). If a large number of collinear MUM pairs are separated in a specific sequence, Mumemto can identify the problem region as either an insertion (Fig. 3A) or deletion (Fig. 3B) depending on the location of the spurious MUMs in other sequences.

Fig. 3.

Fig. 3

AB MUM synteny visualization of collinear MUM blocks (red) and MUMs that break collinearity in a single sequence (gray (+)/green (−) based on orientation). C Large (>4 Mbp) syntenic region lost in HG02080.1, but recovered by partial MUMs (in gray). Evidence from non-collinear MUMs (DE) and missing sequence present in partial MUMs (F) points a potential aberrant assembly artifact in the HG02080 paternal haplotype. GH Genome-wide multi-MUMs reveal an interchromosomal join (confirmed to be a misassembly by HPRC [1]) in the aberrant regions

We also found that Mumemto-reported partial MUMs (MUMs present in a subset of sequences) provide additional evidence. For instance, partial MUMs present in all but one sequence reveal large, private deletions (Fig. 3C), which can indicate an assembly error or rare large-scale variant. Mumemto can identify these regions and quantify the sequence “missing” from each assembly using partial MUMs (Fig. 3F).

As a case study, we used Mumemto to identify aberrant regions across HPRC haplotypes scaffolded using RagTag [21]. Figure 3 highlights a large insertion in the paternal haplotype of HG02080 chr19, identified by a spike in broken collinear MUM pairs. Additionally, we found a large deletion in chr17 of the same haplotype using partial MUMs. Computing multi-MUMs across full genome assemblies revealed a large interchromosomal join between chr17 and chr19 in the HG02080.1 assembly. This potential translocation was confirmed to be a misassembly by the HPRC team [1]. Identifying this problem was straightforward both quantitatively and visually using Mumemto multi-MUMs and partial multi-MUMs; it did not require finding pairwise alignments, building a graph, or computing a multiple alignment.

Scaffolding errors

We scaffolded the assembly contigs provided by HPRC using RagTag, with default parameters and the T2T-CHM13 assembly as the reference [22]. Homology-based scaffolding is commonly used when there is no separate line of evidence such as Hi-C reads. However, scaffolding with respect to a single linear reference—even a high-quality reference—can bias contig placement and orientation [23, 24].

Mumemto can highlight potential scaffolding errors given the contig breakpoints in an assembly by identifying inversions at contig boundaries. We examined two instances of this on human chromosome 8 (Fig. 4A). Both are located in one of the largest inversion polymorphisms in the human genome [25]. Based on RagTag scaffolding, the contigs covering the inversion in the paternal haplotypes of HG03098 and HG02148 are both oriented in the same direction as the CHM13 reference. RagTag assigns both contigs an orientation confidence score of 1.0, the highest possible confidence. However, it is clear from the MUM synteny visualization that both contigs should be reversed to preserve the flanking region orientation. This results in an inversion polymorphism that single reference-guided scaffolding would avoid. However, the overall pangenome synteny reveals that this polymorphism is common in the population. Mumemto synteny visualization and multi-MUM information can be used to correct the reference-guided scaffolding errors in each assembly using pangenome context.

Fig. 4.

Fig. 4

A HPRC chr8 assemblies visualized with multi-MUM synteny. Regions of high multi-MEM density shown in red. (Zoom panels) Two examples of incorrectly oriented contigs during scaffolding, with contig breakpoints represented by diamond markers. Inversions shown in green. B Assemblies of chr3 across the potato family, shown with multi-MUM synteny and MEM density colored in red. (top) Density of gene and LTR retrotransposon annotations for potato accession A6-26 (shown in the top row of syntenic view)

Mumemto highlights pangenome-scale biology

Pangenomes across tree of life

We ran Mumemto on five recently released pangenome collections with genome lengths ranging from 13 Mbp (yeast [26]) to 3 Gbp (human [1]). For each, we computed all-pairs k-mer-based Jaccard similarities using Dashing2 [27]. As expected, the two interspecific datasets—potato [28] and maize [29]—had the lowest inter-genome similarity, as captured by the inter-quartile range of Jaccard similarities (Table 2). The human pangenome had the highest inter-genome similarity, as well as the highest MUM coverage. We report the time and memory footprint for computing multi-MUMs across each pangenome dataset in Table 2. We also measured the wall-clock time required to compute MUMs over each set of chromosome assemblies. Since we used parallel threads for this, our memory measurement is the total memory footprint across all parallel threads.

Table 2.

Multi-MUM and partial multi-MUM coverage, computation time, memory, and statistics for five pangenomes of varying sizes. n/r refers to the ratio of the total size of the pangenome (n) to the number of runs in the Burrow-Wheeler Transform over the sequence collection (r). The Jaccard index range is presented as an interquartile range (IQR), representing the 25th–75th percentile. Both Jaccard distance and n/r are included as approximate measures of intra-pangenome genomic divergence. Partial MUMs (pMUMs) are defined for the purposes of this table as matches that occur in at least half the dataset

Dataset N # chr Genome size (Gbp) n/r Jaccard index (IQR) Median MUM length (bp) MUM coverage Median pMUM length (bp) pMUM coverage Time (hh:mm:ss) Memory (GB) Ref
Human 89 22 2.86 135.16 0.92–0.93 110 83.76% 445 91.70% 4:14:11 707.5 [16]
Maize 27 10 2.13 42.33 0.44–0.47 41 13.48% 139 30.36% 2:29:49 661.6 [30]
Potato 60 12 0.752 30.20 0.32–0.45 29 10.31% 51 62.54% 1:38:50 545.5 [31]
Arabidopsis 69 5 0.135 42.03 0.56–0.63 37 43.06% 91 73.88% 32:53 92.6 [32]
Yeast 127 16 0.013 65.30 0.62–0.72 29 13.19% 195 38.78% 1:52 13.3 [33]

Table 2 also catalogs the strict (i.e., appearing in all sequences) MUM coverage and partial (i.e., appearing in a majority of sequences) MUM coverage. MUM coverage refers to the fraction of bases in each genome covered by at least one multi-MUM or partial MUM, averaged across all assemblies in the pangenome. Two datasets in particular, potato and Arabidopsis, displayed a large increase in coverage when including partial multi-MUMs. This trend generally indicates a small subset of sequences which form a distinct subgroup due to large genomic variation or incomplete assemblies within the group.

Mumemto can compute an outlier score when computing partial multi-MUMs. For a given sequence, this value is the aggregate of the lengths of partial MUMs in which the sequence is excluded, i.e., a high-scoring sequence tends not to share MUMs that are present in all other sequences. Figure 5 shows the outlier score for assemblies in the Arabidopsis (A) and potato (B) pangenomes. For potato, assemblies of S. candolleanum, a progenitor of cultivated potatoes which is considered a distinct clade within the Petota section of Solanum [28, 34] tend to score higher, i.e., tend to be excluded from MUMs shared by others. Similarly for Arabidopsis, accessions from the African continent and from the geographically isolated Madeira islands have higher scores, along with two Asian accessions from Japan (Fig. 5B).

Fig. 5.

Fig. 5

Aggregate length of partial MUMs that are not present in each genome assembly. A A. thaliana accessions are grouped by geographical region, and B potato (Solanum section Petota) are grouped by species

MUMs reveal genomic organization

Mumemto-computed multi-MUMs for potato chromosome 3 revealed a immediately noticeably pattern (Fig. 4B). By visualizing the collinear blocks of multi-MUMs, we observed a denser arrangement of syntenic blocks conserved across the pangenome in the flanks of the chromosome. This correlates highly with gene density (Additional file 1: Fig. S2), as gene-rich regions are more evolutionarily conserved [35]. Regions with low MUM density also tend to be repetitive regions with more structural and less functional characteristics. The density of multi-MEMs (shown as a heatmap in red in Fig. 4B) also recapitulates this trend, where spikes in multi-MEMs correspond to an increased density of LTR retrotransposons, the most abundant type of transposable element (TE) in the potato genome [35]. Both multi-MEMs and LTR TE density tends to be highest in the periocentromeric and centromeric regions, which has been previously observed in plants [35]. We note that this trend was clear despite the relatively low level of MUM coverage overall (Table 2).

We also observed large-scale structural variations in chromosome 3. The largest of these was the 5.8 Mbp inversion polymorphism at a 40–50 Mbp offset in each assembly. Mumemto can further use the orientation of collinear multi-MUM blocks to identify large inversions and can report approximate inversion boundaries. This inversion has been linked with the Y locus that controls tuber flesh color in potatoes and has been observed to cause suppressed recombination [34].

Discussion

Mumemto is an efficient tool for finding maximal exact matches, including multi-MUMs and related match types like partial MUMs and MEMs, across large collections of sequences. By computing these matches, Mumemto can rapidly define a pangenome coordinate system, aid in visualizing pangenome conservation and major structural variants, and reveal potential assembly issues and outliers. Mumemto also serves as an efficient multi-MUM-finding engine that can accelerate existing tools for core genome alignment and pangenome graph construction.

Mumemto can find partial multi-MUMs appearing in any subset of sequences as efficiently as it finds “strict” multi-MUMs. This enables a new view on pangenomes that reveals shared sequence within subgroups of a collection. Our findings for the Arabidopsis pangenome showed that partial MUMs can help to discern distinct subgroups of sequences and, ultimately, could quantify inter-sequence distances in a pangenome. Though this idea has been previously proposed [14], Mumemto’s algorithms allow for partial MUM computation at a scale of hundreds to thousands of genomes, potentially improving genomic distance and evolutionary inference for large sequence collections.

Mumemto computes multi-MUMs at the same time as PFP is computing the SA, LCP array, and the BWT. These arrays are exactly the key components of compressed full-text pangenome indexes like the r-index and move structure. These indexes can then be used to compute matching statistics [36] and other measures that quantify MEM-level similarity between query reads and pangenomes. In other work, we showed that considering multi-MUMs in the r-index improves read classification [37]. In this way, Mumemto can be integrated into a full-text indexing method to improve downstream alignment by providing a unifying coordinate system for full-text index-based alignment.

Further, Mumemto can be expanded into a generalized framework for suffix tree traversal in compressed space. Due to the modularity of the implementation, Mumemto could provide a convenient method for mapping arbitrary functions—e.g., a function to identify tandem repeats—over the internal nodes of an underlying suffix tree [3], leveraging the scalability of PFP.

A potential drawback of Mumemto is its memory usage. Though PFP operates in compressed space, its memory footprint still reaches hundreds of gigabytes for large pangenomes. Currently, peak memory use could be reduced by splitting pangenomes by chromosome and computing only intra-chromosomal matches. Advances in compressed-space BWT construction [38] and improvements to prefix-free parsing [39] could help to decrease memory requirements without sacrificing inter-chromosomal resolution. We also plan to implement minimizer digestion to both decrease the input size as well as overcome minor genomic variation that may truncate syntenic blocks. Lastly, we plan to explore methods for merging multi-MUMs, such as the method in Parsnp2 [5] or index data structures like in RopeBWT3 [38], enabling a more incremental and parallelizable approach.

Due to how multi-MUMs are defined, they tend to cover less of the pangenome as the pangenome grows to include more individuals. While we find that multi-MUMs retain much of their utility even when they cover a lower percentage of the pangenome, we expect that a looser definition of MUM (e.g., a partial MUM present in a majority of sequences) becomes more appropriate as the pangenome grows. However, it will also be important to investigate other ways to loosen these requirements as the pangenome grows. Parsnp2, for example, uses a recursive process whereby it computes finer-grained multi-MUMs with respect to the spaces between previously identified coarser-grained multi-MUMs. We will explore a similar strategy in Mumemto, which could require multiple scans over the enhanced suffix array.

Visualization of pangenome synteny is currently limited to pairwise comparisons of genomes in a defined order. This order is arbitrary and thus multiple-sequence-based synteny is crucial to reveal the true pangenome coordinate system. Various methods exist for pairwise syntenic visualization [40, 41]. Although Mumemto implements a new visualization module intended for multi-MUMs, existing visualization methods could be used. However, this would require formatting the multi-MUM output as a set of pairwise comparisons for input, which would be computationally inefficient.

Conclusions

We showed that Mumemto can accelerate existing pipelines for pangenome alignment and construction. We also discussed the ability of Mumemto to reveal potential pangenome aberrations and misassemblies, improving newly assembled sequence collections and visualizing pangenomic variation structure. These use cases highlighted Mumemto’s potential as a core method for pangenomics, making it ideal as an initial tool in future pangenomic pipelines.

Methods

Algorithm 1.

Algorithm 1

Find multi-MEMs/MUMs

Preliminaries

Given a text T of length n, T[i..n] is defined as the ith suffix. The suffix array (SA) over T is defined as the offsets of suffixes in T, ordered by lexicographic rank, such that SA[i]SA[i-1]. The Burrows-Wheeler Transform BWTT is a permutation of T defined such that BWTT[i]=T[SA[i]-1], i.e., the BWT contains the characters preceding each suffix of T in lexicographic order. The Longest Common Prefix (LCP) array holds the lengths of the longest common prefixes between lexicographically successive suffixes. Formally, LCPT[i]=LCP(T[SA[i]..n],T[SA[i-1]..n]), where LCP[0]=0. Here, we will consider that the SA, LCP array, and BWT are being constructed over a collection of sequences T=T1#T2#TN, where # represents a unique sequence delimiter. We define the document array DA[i] as an array that identifies the sequence of origin (T1, T2, etc.) for the suffix at offset i in the SA.

Multi-sequence maximal unique matches (multi-MUMs, referred to as simply MUMs when unambigious) are maximal exact matches that appear in all sequences (T1, T2, , TN) and in each sequence exactly once. A maximal exact match is an exact match that cannot be extended further to the left or right. Multi-sequence maximal exact matches (multi-MEMs) relax the uniqueness constraint. That is, a multi-MEM appears in every sequence at least once, rather than exactly once. Partial multi-MUMs (pMUMs) are exact matches that appear exactly once in a subset of the sequences, and do not occur in the remaining sequences.

Computing multi-MUMs and MEMs

Abouelhoda et al. [42] showed how to use the SA and LCP array to simulate bottom-up traversal of a suffix tree and find supermaximal repeats between two sequences. Deogun et al. [6] extended this to compute multi-MUMs by identifying LCP-intervals of size N that contain a suffix from each sequence and are not preceded by the same character. Note that the character that precedes the ith suffix is BWTT[i].

Mumemto’s core match-finding algorithm is adapted from bottom-up traversal ([3]; algorithm 4.1) and uses the multi-MUM properties defined by Deogun et al. [6]. Its inputs are the SA, LCP array, BWT, and DA. We present this in Algorithm 1. By varying which properties we check in the IsValidInterval function, we can compute any form of multi-MUM or MEM using three main parameters: (1) number of sequences a match appears in (-k), (2) maximum occurrences within any given sequence (-f), (3) maximum total occurrences within the collection (-F) (see Fig. 1, Additional file 1: Fig. S3). Algorithm 1 can be computed using just a buffer of values from each array, corresponding to the oldest interval in the stack, allowing for streaming computation. Extending this algorithm to compute matches on either strand of a nucleotide sequence, we include the forward and reverse complement of each sequence in the original text. A caveat is that the LCP-interval of a palindromic multi-MUM would appear as an interval of length 2N, violating the multi-MUM properties. Currently, Mumemto does not report palindromic multi-MUMs.

Prefix-free parsing

Typical suffix array construction algorithms (e.g., gSACA [43, 44]) scale linearly with input size and incur a memory footprint of many bytes per character of input sequence. Prefix-free parsing was introduced by Boucher et al. [9] as a method for computing a suffix array, BWT, and LCP array in compressed space, allowing it to scale to pangenomes, which tend to be highly repetitive. The key idea was to parse the input sequence into a dictionary D containing a set of phrases, and a parse P, holding the order in which the phrases must be concatenated to obtain the original text. Boucher et al. show that the BWT, LCP array, and SA can be computed in space proportional to O(|D|+|P|). For repetitive inputs, phrases will tend to be long (reducing |P|) and appear many times (reducing |D|). PFP computes the necessary arrays for Mumemto’s exact match computation. Importantly, it computes these values in order, allowing Mumemto to operate in a streaming fashion, and avoiding the need to store the SA and LCP fully in memory or on disk.

Multi-MUM collinear blocks

We define an ordering of multi-MUMs in each sequence such that multi-MUM C[ij] appears in sequence i with rank j. We define collinear multi-MUMs to be any multi-MUM that appears in a pair that is consecutive in every sequence, allowing for reversed pairs in case of a negative strand multi-MUM. A set of consecutive, collinear MUMs is defined as a collinear block. We refer to the region between collinear multi-MUMs as a collinear MUM gap. Collinear blocks can be optionally split into two when collinear MUM gaps are sufficiently large.

Pangenome graph construction

Collinear multi-MUM blocks can be connected to form preliminary pangenome graphs. For each collinear block, the gaps between collinear multi-MUMs can be collapsed into the set of unique inter-MUM sequences. These represent a snarl [45] flanked by the multi-MUM sequence. Similarly, gaps between collinear blocks are collapsed when possible, and the necessary edges are included between MUM and inter-MUM nodes to provide haplotype walks that represent the input sequences. The nodes, edges, and walks are written to GFA format v1.1 for use with any downstream graph method.

Supplementary information

13059_2025_3644_MOESM1_ESM.pdf (261.6KB, pdf)

Additional file 1. Contains Fig S1, S2, S3 and Table S1. PDF. Fig S1 contains a breakdown of the PFP pipeline runtime, Fig S2 contains correlations between coverage of MUM/MEMs and genes, and Fig S3 contains an example of the MUM finding procedure. Table S1 contains a breakdown of read alignment against different graphs.

Acknowledgements

We thank Todd Treangen and Bryce Kille for the valuable discussions on multi-MUMs. We also thank Katie Jenike for pointing us to useful pangenome datasets.

Peer review information

Tim Sands was the primary editor of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team. The peer-review history is available in the online version of this article.

Authors' contributions

V.S.S. and B.L. conceived the project. V.S.S. wrote the software and conducted the experiments. V.S.S. and B.L. wrote the manuscript. All authors revised and approved the final manuscript.

Funding

This work was carried out at the Advanced Research Computing at Hopkins (ARCH) core facility, supported by the National Science Foundation (OAC 1920103). V.S. was supported by the National Science Foundation (DGE2139757).

B.L. and V.S.S were supported by National Human Genome Research Institute (R01HG011392 to B.L.) and National Science Foundation (IIBR 2029552).

Data availability

The Mumemto software is available open source at https://github.com/vikshiv/mumemto and released under a GPL-3.0 license. Code for reproducing the results and figures is available at https://github.com/vikshiv/mumemto-reproducibility-scripts [46]. Mumemto v1.1.1 [47] was used for all experiments in this manuscript, and can be found at 10.5281/zenodo.15053448.

Declarations

Ethical approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Vikram S. Shivakumar, Email: vshivak1@jhu.edu

Ben Langmead, Email: langmea@cs.jhu.edu.

References

  • 1.Liao WW, Asri M, Ebler J, Doerr D, Haukness M, Hickey G, et al. A draft human pangenome reference. Nature. 2023;617(7960):312–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Marçais G, Delcher AL, Phillippy AM, Coston R, Salzberg SL, Zimin A. MUMmer4: a fast and versatile genome alignment system. PLoS Comput Biol. 2018;14(1):e1005944. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Abouelhoda MI, Kurtz S, Ohlebusch E. Replacing suffix trees with enhanced suffix arrays. J Discret Algoritm. 2004;2(1):53–86. [Google Scholar]
  • 4.Darling AC, Mau B, Blattner FR, Perna NT. Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Res. 2004;14(7):1394–403. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Kille B, Nute MG, Huang V, Kim E, Phillippy AM, Treangen TJ. Parsnp 2.0: scalable core-genome alignment for massive microbial datasets. Bioinformatics. 2024;40(5):btae311. [DOI] [PMC free article] [PubMed]
  • 6.Deogun JS, Yang J, Ma F. Emagen: an efficient approach to multiple whole genome alignment. In: Proceedings of the second conference on Asia-Pacific bioinformatics. Darlinghurst: Australian Computer Society, Inc.; 2004;29:113–122.
  • 7.Höhl M, Kurtz S, Ohlebusch E. Efficient multiple genome alignment. Bioinformatics. 2002;18(suppl_1):S312–S320. [DOI] [PubMed]
  • 8.Treangen TJ, Messeguer X. M-GCAT: interactively and efficiently constructing large-scale multiple genome comparison frameworks in closely related species. BMC Bioinformatics. 2006;7:1–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Boucher C, Gagie T, Kuhnle A, Langmead B, Manzini G, Mun T. Prefix-free parsing for building big BWTs. Algoritm Mol Biol. 2019;14:1–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Gagie T, Navarro G, Prezza N. Optimal-time text indexing in BWT-runs bounded space. In: Proceedings of the twenty-ninth annual ACM-SIAM symposium on discrete algorithms. SIAM; 2018. pp. 1459–1477.
  • 11.Kuhnle A, Mun T, Boucher C, Gagie T, Langmead B, Manzini G. Efficient construction of a complete index for pan-genomics read alignment. J Comput Biol. 2020;27:500–513. [DOI] [PMC free article] [PubMed]
  • 12.Nishimoto T, Tabei Y. Optimal-time queries on BWT-runs compressed indexes. In: 48th international colloquium on automata, languages, and programming (ICALP 2021). Schloss-Dagstuhl-Leibniz Zentrum für Informatik; 2021.
  • 13.Zakeri M, Brown NK, Ahmed OY, Gagie T, Langmead B. Movi: a fast and cache-efficient full-text pangenome index. iScience. 2024;27(12):111464. [DOI] [PMC free article] [PubMed]
  • 14.Darling AE, Mau B, Perna NT. progressiveMauve: multiple genome alignment with gene gain, loss and rearrangement. PLoS ONE. 2010;5(6):e11147. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Beyer W, Novak AM, Hickey G, Chan J, Tan V, Paten B, et al. Sequence tube maps: making graph genomes intuitive to commuters. Bioinformatics. 2019;35(24):5318–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Li H. A collection of high-quality human assemblies. Zenodo; 2024. 10.5281/zenodo.13955431.
  • 17.Treangen TJ, Ondov BD, Koren S, Phillippy AM. The Harvest suite for rapid core-genome alignment and visualization of thousands of intraspecific microbial genomes. Genome Biol. 2014;15:1–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Hickey G, Monlong J, Ebler J, et al. Pangenome graph construction from genome alignments with Minigraph-Cactus. Nat Biotechnol. 2024;42:663–673. [DOI] [PMC free article] [PubMed]
  • 19.Sirén J, Monlong J, Chang X, Novak AM, Eizenga JM, Markello C, et al. Pangenomics enables genotyping of known structural variants in 5202 diverse genomes. Science. 2021;374(6574):abg8871. [DOI] [PMC free article] [PubMed]
  • 20.Baid G, Nattestad M, Kolesnikov A, Goel S, Yang H, Chang PC, et al. An extensive sequence dataset of gold-standard samples for benchmarking and development. bioRxiv [Preprint]. 2020.12.11.422022.
  • 21.Alonge M, Lebeigle L, Kirsche M, Jenike K, Ou S, Aganezov S, et al. Automated assembly scaffolding using RagTag elevates a new tomato system for high-throughput genome editing. Genome Biol. 2022;23(1):258. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Nurk S, Koren S, Rhie A, Rautiainen M, Bzikadze AV, Mikheenko A, et al. The complete sequence of a human genome. Science. 2022;376(6588):44–53. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Bosi E, Donati B, Galardini M, Brunetti S, Sagot MF, Lió P, et al. MeDuSa: a multi-draft based scaffolder. Bioinformatics. 2015;31(15):2443–51. [DOI] [PubMed] [Google Scholar]
  • 24.Chen KT, Chen CJ, Shen HT, Liu CL, Huang SH, Lu CL. Multi-CAR: a tool of contig scaffolding using multiple references. BMC Bioinformatics. 2016;17:185–92. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Logsdon GA, Vollger MR, Hsieh P, Mao Y, Liskovykh MA, Koren S, et al. The structure, function and evolution of a complete human chromosome 8. Nature. 2021;593(7857):101–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.O’donnell S, Yue JX, Saada OA, Agier N, Caradec C, Cokelaer T, et al. Telomere-to-telomere assemblies of 142 strains characterize the genome structural landscape in Saccharomyces cerevisiae. Nat Genet. 2023;55(8):1390–9. [DOI] [PMC free article] [PubMed]
  • 27.Baker DN, Langmead B. Genomic sketching with multiplicities and locality-sensitive hashing using Dashing 2. Genome Res. 2023;33(7):1218–27. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Cheng L, Wang N, Bao Z, et al. Leveraging a phased pangenome for haplotype design of hybrid potato. Nature. 2025;640:408–417. 10.1038/s41586-024-08476-9. [DOI] [PMC free article] [PubMed]
  • 29.Hufford MB, Seetharam AS, Woodhouse MR, Chougule KM, Ou S, Liu J, et al. De novo assembly, annotation, and comparative analysis of 26 diverse maize genomes. Science. 2021;373(6555):655–62. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Maize GDB, 2021. https://maizegdb.org/NAM_project. Accessed 17 April 2024.
  • 31.Potato Database, 2024. http://solomics.agis.org.cn/potato/. Accessed 17 April 2024.
  • 32.Arabidopsis thaliana (PRJNA1033522), 2024. NCBI BioProject Accession PRJNA1033522. https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1033522/. Accessed 11 April 2024.
  • 33.Saccharomyces cerevisiae reference assembly panel (ScRAP), 2023. ENA Project Accession PRJEB59869. https://www.ebi.ac.uk/ena/browser/view/PRJEB59869. Accessed 17 April 2024.
  • 34.Tang D, Jia Y, Zhang J, Li H, Cheng L, Wang P, et al. Genome evolution and diversity of wild and cultivated potatoes. Nature. 2022;606(7914):535–41. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Zavallo D, Crescente JM, Gantuz M, Leone M, Vanzetti LS, Masuelli RW, et al. Genomic re-assessment of the transposable element landscape of the potato genome. Plant Cell Rep. 2020;39:1161–74. [DOI] [PubMed] [Google Scholar]
  • 36.Ahmed OY, Rossi M, Gagie T, Boucher C, Langmead B. SPUMONI 2: improved classification using a pangenome index of minimizer digests. Genome Biol. 2023;24(1):122. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Brown NK, Shivakumar VS, Langmead B. Improved pangenomic classification accuracy with chain statistics. In International Conference on Research in Computational Molecular Biology. Cham: Springer; 2025;15647:190–208. 
  • 38.Li H. BWT construction and search at the terabase scale. Bioinformatics. 2024;40(12):btae717. [DOI] [PMC free article] [PubMed]
  • 39.Ferro E, Oliva M, Gagie T, Boucher C. Building a pangenome alignment index via recursive prefix-free parsing. iScience. 2024;27(10):110933. [DOI] [PMC free article] [PubMed]
  • 40.Porubsky D, Guitart X, Yoo D, Dishuck PC, Harvey WT, Eichler EE. SVbyEye: a visual tool to characterize structural variation among whole-genome assemblies. bioRxiv [Preprint]. 2024 Sep 17:2024.09.11.612418. 10.1101/2024.09.11.612418. [DOI] [PubMed]
  • 41.Goel M, Schneeberger K. Plotsr: visualizing structural similarities and rearrangements between multiple genomes. Bioinformatics. 2022;38(10):2922–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Abouelhoda MI, Kurtz S, Ohlebusch E. The enhanced suffix array and its applications to genome analysis. In: Algorithms in bioinformatics: second international workshop, WABI 2002 Rome, Italy, September 17–21, 2002 Proceedings 2. Springer; 2002. pp. 449–463.
  • 43.Nong G. Practical linear-time O (1)-workspace suffix sorting for constant alphabets. ACM Trans Inf Syst (TOIS). 2013;31(3):1–15. [Google Scholar]
  • 44.Louza FA, Gog S, Telles GP. Inducing enhanced suffix arrays for string collections. Theor Comput Sci. 2017;678:22–39. [Google Scholar]
  • 45.Garrison E, Sirén J, Novak AM, Hickey G, Eizenga JM, Dawson ET, et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat Biotechnol. 2018;36(9):875–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Shivakumar V. Reproducibility scripts for Mumemto. Zenodo; 2025. 10.5281/zenodo.15538966.
  • 47.Shivakumar V. vikshiv/mumemto: v1.1.1. Zenodo; 2025. 10.5281/zenodo.15053448.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

13059_2025_3644_MOESM1_ESM.pdf (261.6KB, pdf)

Additional file 1. Contains Fig S1, S2, S3 and Table S1. PDF. Fig S1 contains a breakdown of the PFP pipeline runtime, Fig S2 contains correlations between coverage of MUM/MEMs and genes, and Fig S3 contains an example of the MUM finding procedure. Table S1 contains a breakdown of read alignment against different graphs.

Data Availability Statement

The Mumemto software is available open source at https://github.com/vikshiv/mumemto and released under a GPL-3.0 license. Code for reproducing the results and figures is available at https://github.com/vikshiv/mumemto-reproducibility-scripts [46]. Mumemto v1.1.1 [47] was used for all experiments in this manuscript, and can be found at 10.5281/zenodo.15053448.


Articles from Genome Biology are provided here courtesy of BMC

RESOURCES