Abstract
Motivation:
The Jaccard similarity on -mer sets has shown to be a convenient proxy for sequence identity. By avoiding expensive base-level alignments and comparing reduced sequence representations, tools such as MashMap can scale to massive numbers of pairwise comparisons while still providing useful similarity estimates. However, due to their reliance on minimizer winnowing, previous versions of MashMap were shown to be biased and inconsistent estimators of Jaccard similarity. This directly impacts downstream tools that rely on the accuracy of these estimates.
Results:
To address this, we propose the minmer winnowing scheme, which generalizes the minimizer scheme by use of a rolling minhash with multiple sampled -mers per window. We show both theoretically and empirically that minmers yield an unbiased estimator of local Jaccard similarity, and we implement this scheme in an updated version of MashMap. The minmer-based implementation is over 10 times faster than the minimizer-based version under the default ANI threshold, making it well-suited for large-scale comparative genomics applications.
2. Introduction
The recent deluge of genomic data accelerated by population-scale long-read sequencing efforts has driven an urgent need for scalable long-read mapping and comparative genomics algorithms. The completion of the first Telomere-to-Telemore (T2T) human genome Nurk et al. (2022) and the launch of the Human Pangenome Project Wang et al. (2022a) have paved the way to mapping genomic diversity at unprecedented scale and resolution. A key goal when comparing a newly sequenced human genome to a reference genome or pangenome is to accurately identify homologous sequences, that is, DNA sequences that share a common evolutionary source.
Algorithms for pairwise sequence alignment, which aim to accurately identify homologous regions between two sequences, have continued to advance in recent years Marco-Sola et al. (2021). While a powerful and ubiquitous computational tool in computational biology, exact alignment algorithms are typically reserved for situations where the boundaries of homology are known a priori, due to their quadratic runtime costs and in-ability to model nonlinear sequence relationships such as inversions, translocations, and copy number variants. Because of this, long-read mapping or whole-genome alignment methods must first identify homologous regions across billions of nucleotides, after which the exact methods can be deployed to compute a base-level “gapped” read alignment for each region. To efficiently identify candidate mappings, the prevailing strategy is to first sample -mers and then identify consecutive k-mers that appear in the same order for both sequences: known as “seeding” and “chaining”, respectively.
For many use cases, an exact gapped alignment is not needed and only an estimate of sequence identity is required. As a result, methods have been developed which can predict sequence identity without the cost of computing a gapped alignment. Jaccard similarity, a metric used for comparing the similarity of two sets, has found widespread use for this task, especially when combined with locality sensitive hashing of -mer sets Ondov et al. (2016); Brown and Irber (2016); Ondov et al. (2019); Jain et al. (2017, 2018a); Baker and Langmead (2019); Shaw and Yu(2023). By comparing only -mers, the Jaccard can be used to estimate the average nucleotide identity (ANI) of two sequences without the need for an exact alignment Ondov et al. (2016, 2019); Blanca et al. 2022).
To accelerate mapping and alignment, -mers from the input sequences are often down-sampled using a “winnowing scheme” in a way that reduces the input size while still enabling meaningful comparisons. For example, both MashMap Jain et al. (2017, 2018a) and Minimap Li (2018) use a minimizer scheme Roberts et al. (2004), which selects only the smallest -mer from all -length substrings of the genome. Of relevance to this study, MashMap2 then uses these minimizers to approximate the Jaccard similarity between the mapped sequences, and these estimates have been successfully used by downstream methods such as FastANI Jain et al. (2018b) and MetaMaps Dilthey et al. (2019).
However, a recent investigation noted limitations of the “winnowed minhash” scheme introduced by MashMap Belbasi et al. (2022). Although the original MashMap paper notes a small, but negligible bias in its estimates Jain et al. (2017), Belbasi et al. proved that no matter the length of the sequences, the bias of the minimizer-based winnowed minhash estimator is never zero Belbasi et al. (2022).
To address this limitation, we propose a novel winnowing scheme, the “minmer” scheme, which is a generalization of minimizers that allows for the selection of multiple -mers per window. We define this scheme, characterize its properties, and provide an implementation in MashMap3. Importantly, we show that minmers, unlike minimizers, enable an unbiased prediction of the local Jaccard similarity.
3. Preliminaries
Let be an alphabet and be a function which maps a sequence to the set of all -mers in . Similarly, given a sequence , we define as the sequence of -mers in starting at the th -mer. When and are clear from context, we use . We use the terms sequence and string interchangeably.
3.1. Jaccard similarity and the minhash approximation
Given two sets and , their Jaccard similarity is defined as . The Jaccard similarity between two sequences and can be computed as for some -mer size .
However, computing the exact Jaccard for and is not an efficient method for determining similarity for long reads and whole genomes. Instead, the minhash algorithm provides an estimator for the Jaccard similarity while only needing to compare a fraction of the two sets. Assuming is the universe of all possible elements and is a function which imposes a randomized total order on the universe of elements, we have that
This equivalency, proven by Broder (1997), is key to the minhash algorithm and yields an unbiased and consistent Jaccard estimator with the help of a sketching function . Let return the lowest items from the input set according to the random total order . Then we define the minhash as
Importantly, this Jaccard estimator has an expected error that scales with and is therefore independent of the size of the original input sets. While there are a number of variants of minhash which provide the same guarantee Cohen (2016), we will be using the “bottom-s sketch” (as opposed to the -mins and -partition sketch) since it ensures a consistent sketch size regardless of the parameters and requires only a single hash computation per element of . Additionally, the simplicity of the bottom-s sketch leads to a streamlined application of the sliding window model, which we describe next.
3.2. Winnowing
While sequences can be reduced into their corresponding sketch via the method described above, this is a global sketch and it is difficult to determine where two sequences share similarity. In order to perform local mapping, Schleimer et al. (2003) and Roberts et al. (2004) independently introduced the concept of winnowing and minimizers. In short, given some total ordering on the -mers, a window of length is slid over the sequence and the element with the lowest rank in each window (the minimizer) is selected, using the left-most position to break ties Roberts et al. (2004). By definition, winnowing ensures that at least one element is sampled per window and therefore there is never a gap of more than elements between sampled positions. Here, we extend the winnowing concept to allow the selection of more than one element per window (the minmers), and we refer to the set of all minmers and/or their positions as the winnowed sequence.
3.2.1. Winnowing scheme characteristics
Definition 3.1.
A winnowing scheme has a -window guarantee if for every window of -mers, there are at least -mers sampled from the window, where is the number of distinct -mers in the window.
This definition is more general than the commonly used -window guarantee, which is equivalent to the -window guarantee. While not all winnowing schemes must have such a guarantee, this ensures that no area of the sequence is under-sampled. Shaw and Yu(2022) recently provided an analytical framework for winnowing schemes and showed that mapping sensitivity is related to the distribution of distances (or spread) between sampled positions, and precision is related to the proportion of unique values relative to the total number of sampled positions. As the overarching goal of winnowing is to reduce the size of the input while preserving as much information as possible, winnowing schemes typically aim to optimize the precision/sensitivity metrics given a particular density.
Definition 3.2.
The density of a winnowing scheme is defined as the expected frequency of sampled positions from a long random string, and the density factor is defined as the expected number of sampled positions in window of -mers.
There has been significant work on improving the performance of minimizers by identifying orderings that reduce the density factor Marçais et al. (2017). Minimizer schemes which use a uniformly random ordering have a density factor of and recent schemes like Miniception Zheng et al. (2020) and PASHA Ekim et al. (2020) are able to obtain density factors as low as 1.7 for certain values of and .
For the remainder of this work, we will assume that , i.e. the windows are not so large that we expect duplicate -mers in a random string. This ensures that each -mer in a window has probability of being in the sketch for that window.
3.2.2. Winnowing scheme hierarchies
Recent winnowing methods have focused on schemes that select at most a single position per window, which simplifies analyses but restricts the universe of possible schemes. Minimizers belong to the class of forward winnowing schemes, where the sequence of positions sampled from adjacent sliding windows is non-decreasing Marçais et al. (2018). More general is the concept of a -local scheme Shaw and Yu (2022), defined on windows of consecutive -mers but without the forward requirement. Non-forward schemes are more powerful and are not limited by the same density factor bounds as forward schemes. While the need of non-forward schemes to “jump back” in order to obtain lower sampling densities is acknowledged by Marçais et al. (2018), there are currently no well-studied, non-forward, -local schemes.
3.3. MashMap
MashMap is a minimizer-based tool for long-read and whole-genome sequence homology mapping that is designed to identify all pairwise regions above some sequence similarity cutoff Jain et al. (2017, 2018a). Specifically, for a reference sequence and a query sequence comprised of -mers, MashMap aims to find all positions in the reference such that , where and , and is the sequence similarity cutoff. For ease of notation, we will use to refer to the sequence of -mers from the reference sequence . Importantly, MashMap only requires users to specify a minimum segment length and minimum sequence identity threshold, and the algorithm will automatically determine the parameters needed to return all mappings that meet this criteria with parameterized confidence under a binomial mutation model.
Here we replace the minimizer-based approach of prior versions of MashMap with minmers. While the problem formulation remains the same, our method for computing the reference index and filtering candidate mappings is novel. We will first introduce the concept of minmers, which enable winnowing the input sequences while still maintaining the -mers necessary to compute an unbiased Jaccard estimation between any two windows of length at least . We will then discuss the construction of the reference index and show how query sequences can be efficiently mapped to the reference such that their expected ANI is above the desired threshold.
4. The minmer winnowing scheme
Minmers are a generalization of minimizers that allow for the selection of more than one minimum value per window. The relationship between minmers and minimizers was noted by Berlin et al. (2015) but as a global sketch and without the use of a sliding window. Here we formalize a definition of the minmer winnowing scheme.
Definition 4.1.
Given a tuple , where and are integers and is an ordering on the set of all -mers, a -mer in a sequence is a minmer if it is one of the smallest -mers in any of the subsuming windows of -mers.
Similar to other -local winnowing schemes, ties between -mers are broken by giving priority to the leftmost -mer. From the definition, it follows that by letting we obtain the definition of the minimizer scheme. Compared to minimizers with the same value, minmers guarantee that at least -mers will be sampled from each window. However, as a non-forward scheme, a minmer may be one of the smallest -mers in two non-adjacent windows, yet not one of the smallest -mers in an intervening window (Figure 1. To account for this and simplify development of this scheme, we define a minmer interval to be the interval for which the -mer at position is a minmer for all windows starting within that interval. Thus, a single -mer may have multiple minmer intervals starting at different positions.
Definition 4.2.
A tuple is a minmer interval for a sequence if the -mer at position is a minmer for all windows where , but not or .
Any window may contain more than minmers, and so to naively compute the Jaccard between a query and would require identification of the smallest -mers in . Minmer intervals are convenient because for any window start position , the smallest -mers in are simply the ones whose minmer intervals contain . Thus, indexing with minmer intervals enables the efficient retrieval of the smallest -mers for any window without additional sorting or comparisons.
Another benefit of minmer intervals is that the smallest -mers for any window of length are guaranteed to be a subset of the combined -minmers contained in that window. This subset can be easily computed with minmer intervals, since the set of -minmer intervals that overlap with the range are also guaranteed to include the smallest -mers of the larger window, and the overlapping minmer intervals can be inspected to quickly identify them.
4.1. Constructing the rolling minhash index
In this section, we will describe our rolling bottom- sketch algorithm for collecting minmers and their corresponding minmer intervals. Popic and Batzoglou (2017) proposed a related rolling minhash method for short-read mapping, but using an -mins scheme without minmer intervals. For the remainder of the section, we will assume no duplicate -mers in a window and an ideal uniform hash function which maps to . Duplicate -mers are handled in practice by keeping a counter of the number of active positions for a particular -mer, similar to the original MashMap implementation Jain et al. (2017). Minmer intervals longer than the window length sometimes arise due to duplicate -mers and are split into adjacent windows of length at most . This bound on the minmer interval length is necessary for the mapping step.
For ease of notation, we now consider as a sequence of -mer hash values where each and refer to these elements as hashes and -mers interchangeably. We use a min-heap and a sorted map , both ordered on the hash values, to keep track of the rolling minhash index. As the window slides across , will contain the minmer intervals for the lowest hashes in the window and will contain the remaining hashes in the window. We denote the minmer inter val of a hash in by and . In practice, may contain “expired” -mers which are no longer part of the current window, however by storing the -mer position as well, we can immediately discard such -mers whenever they appear at the top of the heap. To prevent expired -mers from accumulating all expired -mers from the heap are pruned whenever the heap size exceeds . After initialization of and with the first -mers of , we begin sliding the window for each consecutive position and collect the minmer intervals in an index . For each window , there will be a single “exiting” -mer and a single “entering” -mer , each of which may or may not belong to the lowest -mers. Therefore, we have four possibilities, examples of which can be seen in Figure 1
-
and
Neither the exiting nor entering -mer is in the sketch. Insert into .
-
and
The exiting -mer was not in the sketch, but the entering -mer will be. Since the incoming -mer enters the sketch, the largest element in the sketch must be removed. Therefore, is set to and the the minmer interval is appended to the index is then removed from and the new -mer is inserted to , marking .
-
and
The exiting -mer was in the sketch, but the entering -mer will not be. Since the exiting -mer was a member of the sketch, set , remove from and append it to , and insert into . At this point, , as we removed an element from the sketch but did not replace it. To fill the empty sketch position, -mers are popped from until a -mer which has not expired is obtained. This -mer is added to , setting .
-
and
Both the exiting and entering -mers are in the sketch. As before, set and remove from and append it to . The entering -mer belongs in the sketch, so set .
Our implementation of uses a balanced binary tree and is pruned in time at most every -mers and therefore the amortized time complexity of each sliding window update is . In order to efficiently use the index for mapping, we sort based on the start positions of the minmers. In addition to , we compute a reverse lookup table which maps hash values to ordered lists of start and end points of minmer intervals for that hash value. Overall, the indexing time requires , where is estimated to be , as shown in section 5.1.2
4.2. Querying the rolling minhash index
MashMap computes mappings in a two-stage process. In the first stage, all regions within the reference that may contain a mapping satisfying the desired ANI constraints are obtained. In the second stage, the minhash algorithm is used to estimate the Jaccard for each candidate mapping position produced by the first stage. As the second stage is the most computationally intensive step, we introduce both a new candidate region filter and a more efficient minhash computation to improve overall runtime. We assume here that query sequences are -mers long. In practice, sequences longer than are split into windows of -mers, mapped independently, and then chained and filtered as described in Jain et al. (2018a).
4.2.1. Stage 1: Candidate region filter
First, the query sequence is winnowed using a min-heap to obtain the lowest hash values. All minmer intervals in the reference with matching hashes are obtained from and a sorted list is created in time, where consists of all minmer start and end positions. In this way, we can iterate through the list and keep a running count of the overlapping minmer intervals by incrementing the count for each start-point and decrementing the count for each endpoint.
Unlike the previous versions of MashMap that look for all mappings above a certain ANI threshold, MashMap3 provides the option to instead filter out all mappings which are not likely to be within of the best predicted mapping ANI. This significantly reduces the number and size of the candidate regions passed on to the more expensive second stage.
Let be a random variable representing the numerator of the minhash formula for and . Given , we observe that is distributed hypergeometrically, where we have success states in population of states (proof in Supplementary Materials). Let be a position with the maximum intersection size over all , i.e. the position in that overlaps with the most selected minmer intervals. We can now find a minimum intersection size such that for any ,
where is the difference in the Jaccard that corresponds to an ANI value less than the ANI value predicted by and is a desired confidence level. To calculate this probability, we can use the following summation
For each intersection size, we can identify a cutoff in time. As a preprocessing step, we compute cutoffs for each of the possible intersection sizes at the indexing stage. Candidate regions that are unlikely to have an ANI within of the best predicted ANI are then pruned. The default and confidence parameters of MashMap3 are 0 and 0.999, respectively, as in many cases the lower scoring mappings for a segment are filtered out by the plane-sweep filtering method of MashMap described in Jain et al. (2018a).
We compute two passes over the interval endpoints in . In the first pass of stage 1, the maximum intersection size is obtained. In the second pass, candidate mappings whose intersection is above the cutoff derived from are obtained. Consecutive candidate mappings are grouped into candidate regions and passed to stage 2.
4.2.2. Stage 2: Efficiently computing the rolling minhash
Given a candidate region , the goal of stage 2 is to calculate the minhash for all , pairs for . In order to track the minhash of and for each , MashMap2 previously used a sorted map to track all active seeds in each window. We improve upon this by observing that the minhash can be efficiently tracked using only , , and the number of minmers from in-between each consecutive pair of minmers from . To do so, MashMap3 uses an array where each represents one of the minmer hash values from in increasing order and for each , the values and are
if else 0
We can imagine as a set of buckets labeled by the corresponding hash values of and sorted in increasing order. At each position , each bucket holds and all reference minmers in , which are between and . A bucket is marked “good” if . It remains to find the largest integer such that the number of minmers in the first buckets is at most . Given , the numerator of the minhash formula, , is the number of “good” buckets in the first buckets.
For a candidate region , we initialize by inserting all of the minmers from the reference index whose intervals overlap with and set
It follows that
In order to keep track of intervals which overlap with the current position, we use a min-heap sorted on interval endpoints. We then continue to iterate through minmer intervals from the reference in order based on their start points, stopping once the intervals no longer overlap with . For each minmer interval starting at , we pop intervals from that end at or before . For each interval popped from , we update in time through a binary search, decrementing the corresponding and setting if the interval represents a shared minmer. The new interval is added in a similar manner and the necessary and values are updated. After is updated, is updated from by incrementing or decrementing until it is the maximal value such that . By keeping track of and the sums and , the new and corresponding sums are updated in constant time per window.
While the MashMap3 implementation of the second filtering stage still requires time to update the minhash for each sliding window within the candidate region, it is significantly more efficient than MashMap2’s ordered map in practice due to being a static data structure in contiguous memory, only requiring updates to counters.
4.2.3. Early termination of stage 2
Instead of computing the stage 2 step for each candidate region obtained in the first stage, we aim to terminate the second stage once we have confidently identified all mappings whose predicted ANI is within of the best predicted ANI. We do this by sorting the candidate regions in decreasing order of their maximum interval overlap size obtained in stage 1. The stage 2 minhash calculation is then performed on each candidate region in order, keeping track of the best predicted ANI value seen. Let be numerator of the minhash that corresponds to an ANI value less than the best predicted ANI value seen so far. Then, given a candidate region with a maximum overlap size of , we know that and therefore no more candidate regions can contain mappings whose predicted ANI is within of the predicted ANI of the best mapping.
5. Results
5.1. Characteristics of the minmer scheme
Here we provide formulas for the density of minmers and minmer intervals and an approximation for the distance between adjacent minmers. Proofs of the formulas are presented in the Supplementary Materials. We then compare these formulas to results on both simulated and empirical sequences. For the simulated dataset, we generated a sequence of 1 million uniform random hash values. For the empirical dataset, we used MurmurHash to hash the sequence of -mers in the recently-completed human Y-chromosome Rhie et al. (2022) with .
5.1.1. Minmer density
To obtain the formula for the minmer density, we consider how the rank of a random -mer changes with each consecutive window that contains it. As a result, we have a distribution of the rank of a random -mer throughout consecutive sliding windows. This distribution enables us to not only obtain the density (Figure 2), but also determine other characteristics such as the likelihood of being a minmer given some initial rank or given a hash value .
Theorem 5.1.
Let be the expected density of -minmers in a random sequence. Then, where and where and .
5.1.2. Minmer interval density
Theorem 5.2.
Let be the density of -minmer intervals in a random sequence. Then,
As expected, letting yields the same density as minimizers, , and a similar formula appears when determining the probability of observing consecutive unsampled -mers under the the minimizer scheme Spouge (2022). As the number of minmers is a strict lower bound on the number of minmer intervals, this result also gives an upper bound on the density of -minmers.
5.1.3. Minmer window guarantee
As the main difference between minimizers and minmers is the window guarantee, it is important to observe the difference in the density of the minmer scheme compared to a minimizer scheme which also satisfies the -window guarantee. In Figure 2, we consider the case where we have a -minmer scheme and a minimizer scheme, where is set to obtain the same window guarantee of the minmer scheme by letting . We observe that for sketch sizes other than 1 and 1000, for which the density of the schemes are equal, the density of the minmer scheme is strictly less than the density of the corresponding minimizer scheme. For some values of , the density of the -minimizer scheme is over larger than the -minmer scheme.
5.1.4. Minmer spread
Let be the distance between the selected minmer and the selected minmer. For a -minmer scheme with a density factor , we have that
To see how well this approximation holds, we plot the results on both empirical and simulated data in Supplemental Figure 2.
5.2. ANI prediction ideal sequences
We replicated the experiments for Table 1 of Belbasi et al. (2022) using the minmer-based MashMap3 (commit 0b47608), with the exception that we report the mean predicted sequence divergence as opposed to the median. For each divergence rate , 100 random windows of 10,000 base pairs were selected from the Escherichia coli genome and 10,000r positions were selected at random and mutated, ensuring that no duplicate -mers were generated. The reads were mapped back to the reference E. coli genome and the predicted divergence was compared to the ground truth (Figure 3).
The parameters of the minmer-based MashMap3 were set to obtain a similar numbers of sampled -mers as the minimizer-based MashMap2 under MashMap2’s default settings, resulting in a density of 0.009 for both tools. As expected, the results show that the ANI values predicted by the minmer scheme are significantly closer to the ground truth than those predicted by the minimizer scheme. Notably, in the case where the true divergence was 1%, the relative error is reduced from 29% to 2% (Figure 3).
5.3. ANI prediction on simulated reads
In addition to the ANI prediction measurements from Belbasi et al. (2022), we also simulated reads from the human T2T-CHM13 reference genome Nurk et al. (2022) at varying error rates to determine the accuracy of the ANI predictions. We compared the minmer-based MashMap3 against the minimizer-based MashMap2 with similar densities for each run as well as against Minimap2 Li (2018). Minimap2 was run in its default mode with -x map-ont set which, like MashMap, computes approximate mappings and estimates the alignment identity. MashMap2 was modified to use the binomial model for estimating the ANI from the Jaccard estimator which has been shown to be more accurate Belbasi et al. (2022).
We used Pbsim Ono et al. (2013) to simulate three datasets: “ONT-95”, “ONT-98”, and “ONT-99”, where the number following the dash represents the average ANI across reads. The standard deviation of the error rates was set to 0, and the ratio of matches, insertions, and deletions was set to 20:40:40, respectively, to ensure that mapped regions would, on average, be the same length as the reads. For each dataset, 5,000bp reads were generated with the CLR profile at a depth of 2, resulting in 1.25 million reads for each dataset. The mappings output by the different methods were parsed and the predicted ANI was compared to the gap-compressed ANI of the ground-truth mapping. The results of the simulations can be seen in Table 1
Table 1: Metrics for simulated Nanopore read mapping to the human genome.
Minimap2 | MashMap2 | MashMap3 | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Dataset | CPU time (m) | Memory (Gb) | ME | MAE | CPU time (m) | Memory (Gb) | ME | MAE | CPU time (m) | Memory (Gb) | ME | MAE |
ONT-99 | 154.20 | 9.89 | −0.25 | 0.34 | 80.27 | 9.92 | −0.27 | 0.29 | 33.64 | 13.07 | 0.03 | 0.17 |
ONT-98 | 147.29 | 9.89 | −0.36 | 0.52 | 82.46 | 9.92 | −0.33 | 0.39 | 35.13 | 13.09 | 0.06 | 0.29 |
ONT-95 | 96.35 | 9.89 | −0.46 | 0.81 | 106.81 | 9.92 | −0.25 | 0.59 | 42.81 | 13.10 | 0.21 | 0.62 |
For MashMap2 and MashMap3, we used a -mer size of 19 and set the MashMap2 minimizer to 89 and minmer to to obtain a density of 0.0222 for both tools. The ANI cutoff was set to 94%, 93%, and 90% for the ONT-99, ONT-98, and ONT-95 datasets, respectively. The indexing times for Minimap2, MashMap2, and MashMap3 were 1.7, 2.8, and 9.8 minutes, respectively.
5.4. ANI prediction on mammalian genome alignments
To test the performance of MashMap3 at the genome-mapping scale, we computed mappings between the T2T human reference genome and reference genomes for chimpanzee Kronenberg et al. (2018) and macaque Warren et al. (2020). In absence of ground truth ANI values, we used wfmash Guarracino et al. (2021) to compute the gap-compressed ANI of the segment mappings output by MashMap and report the results of the mappings with ≥ 80% complexity in Table 2 For a small proportion of segment mappings output by MashMap2 and MashMap3, wfmash did not produce an alignment. When the ANI threshold is 85%, these cases accounted for 0.07% of chimpanzee mappings and 0.3% macaque mappings. When the ANI threshold was 90% or 95%, less than 0.01% of mappings were not aligned with wfmash for both chimpanzee and macaque. We consider these mappings as false positives. For the ANI thresholds of 95%, 90%, and 85%, the winnowing scheme densities were set to 0.043, 0.053, and 0.064, respectively.
Table 2: Comparison of MashMap2 and MashMap3 for identifying mappings between pairs of mammalian genomes.
MashMap2 | MashMap3 | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Query Species | ANI Threshold | Basepairs mapped (Gbp) | CPU time (m) | Memory (Gb) | ME | MAE | Basepairs mapped (Gbp) | CPU time (m) | Memory (Gb) | ME | MAE |
Chimpanzee | 95% | 2.80 | 39.76 | 19.95 | −0.25 | 0.29 | 2.81 | 32.76 | 27.07 | 0.01 | 0.22 |
Chimpanzee | 90% | 2.82 | 118.31 | 24.55 | −0.22 | 0.29 | 2.82 | 51.12 | 36.20 | 0.01 | 0.25 |
Chimpanzee | 85% | 2.83 | 787.44 | 44.96 | −0.18 | 0.27 | 2.83 | 64.48 | 39.47 | 0.02 | 0.25 |
Macaque | 95% | 0.38 | 30.0 | 20.83 | 0.29 * | 0.46 | 1.08 | 28.67 | 28.97 | 0.57* | 0.66 |
Macaque | 90% | 2.54 | 40.49 | 23.04 | −0.30 | 0.69 | 2.56 | 34.87 | 35.91 | 0.01 | 0.74 |
Macaque | 85% | 2.60 | 446.71 | 38.13 | −0.24 | 0.74 | 2.61 | 43.74 | 39.49 | 0.05 | 0.87 |
Sampling bias leads to ANI over-estimation. See discussion for details.
To isolate the effect of the new seeding method, we turned chaining off for both tools. As the Jaccard estimator is known to perform poorly in the presence of many degenerate -mers, results for query regions above and below 80% complexity are reported separately, where complexity is defined as the ratio of observed distinct -mers in a region to . Low-complexity mappings make up for at most 1% and 3% of the mappings for chimpanzee and macaque genomes, respectively. We show the table of the metrics for the low-complexity mappings in Supplementary Table 1
6. Discussion
Minmers are a novel “non-forward” winnowing scheme with a -window guarantee. Similar to what has been done for other proposed schemes, we have derived formulas (approximate and exact) that describe the scheme’s characteristics. We have replaced minimizers with minmers in MashMap3 and demonstrated that minmers eliminate Jaccard estimator bias and enable new methods to reduce mapping runtime compared to MashMap2. In addition, we show that minmers require substantially less density than minimizers when a -window guarantee is required.
The minmer scheme enables sparser sketches
The minimizer winnowing scheme has long been the dominant method for winnowing due to its -window guarantee, simplicity, and performance. Other 1-local methods such as strobemers Sahlin (2021) and syncmers Edgar (2021) remove the window guarantee and rely on a random sequence assumption to provide probabilistic bounds on the expected distance between sampled -mers.
Minmers represent a novel class of winnowing schemes that extend the window guarantee of minimizers. Unlike strobemers, syncmers, and other 1-local methods. the minmer scheme guarantees the desired number of -mers will be sampled from every window, so long as it contains at least distinct -mers. This is particularly desirable for accurate Jaccard estimation and the winnowing of low-complexity sequence where the density of sampled -mers from 1-local schemes can vary significantly.
Minmers yield an unbiased estimator at lower computational costs
Indexing minmers rather than minimizers removes the Jaccard estimator bias present in earlier versions of MashMap. For any window, the set of sampled -mers is guaranteed to be a superset of the bottom-s sketch of that window. Therefore, running the minhash algorithm on minmers yields the same estimator as running the minhash algorithm on the full set of -mers.
In addition to the experiments from Belbasi et al. (2022), which focus on “ideal” sequences with no repetitive -mers, we also measured the performance of the ANI prediction for different levels of divergence on the human genome across mappings of simulated reads and a sample of mammalian genomes. Our results showed that MashMap3 with minmers not only produced unbiased and more accurate predictions of the ANI than Minimap2 and MashMap2, but it did so in a fraction of the time.
We replicated the behavior of minimizers to underpredict ANI as seen in Belbasi et al. (2022) across all experiments. At the same time, in both the simulated reads and empirical genome alignment results, we see that MashMap3 slightly over-predicts the ANI at larger divergences. Further inspection reveals that this is due to indels in the alignment, which are not modeled by the binomial mutation model used to convert the Jaccard to ANI (Supplementary Table 2).
The optimizations to the second stage of mapping combined with the minmer interval indexing leads to significantly better mapping speeds in MashMap3. Relative to Minimap2 and MashMap2, MashMap3 spends a significant amount of time indexing the genome. This, however, serves as an investment for the mapping phase which is significantly faster than MashMap2, particularly at lower ANI thresholds. As an additional feature, MashMap3 provides the option to save the reference index so that users can leverage the increased mappings speeds for previously indexed genomes.
Similar to MashMap2, MashMap3 by default uses the plane-sweep post-processing algorithm described in Jain et al. (2018a) to filter out redundant segment mappings. We show that by using the probabilistic filtering method described in Section 4.2.1, we can discard many of these mappings at the beginning of the process as opposed to the end, yielding significant runtime improvements.
MashMap3 is significantly more efficient at lower ANI thresholds, which is helpful for detecting more distant homologies. For example, in our human-chimpanzee mapping, we recovered an additional 50Mbp of mapped sequence by reducing the ANI threshold from 95% to 85% while also completing over 10x quicker than MashMap2. It is also worth noting that the default ANI of MashMap2 and MashMap3 is 85%, and often the ANI of homologies between genomes is not known a priori.
Further motivating the improved efficiency of low ANI thresholds is the fact that thresholds above the true ANI can lead to recovering mappings which over-predict the ANI while discarding those which accurately or underpredict the ANI. This sampling bias leads to an increase in the ANI estimation bias. We see this behavior in the human-macaque alignment with a threshold 95% ANI (Table 2). At lower ANI thresholds, we observe that the majority of mappings are in the 90%–95% ANI range.
Limitations and future directions
MashMap’s Jaccard-based similarity method tends to overestimate ANI in low-complexity sequences. For downstream alignment applications, the resulting false-positive mappings can be pruned using a chaining or exact alignment algorithm to validate the mappings. Unreliable ANI estimates could also be flagged by using the bottom-s sketch to determine the complexity of a segment as described in Cohen and Kaplan (2007), but a sketching method and distance metric that better approximates ANI across all sequence and mutational contexts would be desirable.
An important characteristic of MashMap is the relatively few parameter settings necessary to tune across different use cases. Building on this, we aim to develop a methodology that can find maximal homologies without a pre-determined segment size, similar to the approach of Wang et al. (2022b).
7. Conclusion
In this work, we proposed and studied the characteristics of the minmer scheme and showed that they belong to the unexplored class of non-forward local schemes, which have the potential to achieve lower densities under the same locality constraints as forward schemes Marçais et al. (2018). We derived formulas for the density and approximate spread of minmers, enabling them to be objectively compared to other winnowing schemes.
By construction, minmers, unlike minimizers, enable an unbiased estimation of the Jaccard. We replaced the minimizer winnowing scheme in MashMap2 with minmers and showed that minmers significantly reduce the bias in both simulated and empirical datasets.
Through leveraging the properties of the minmers, we implemented a number of algorithmic improvements in MashMap3. In our experiments, these improvements yielded significantly lower runtimes, particularly in the case when the ANI threshold of MashMap is set to the default of 85%. With the improvements in MashMap3, it is no longer necessary to estimate the ANI of homologies a priori to avoid significantly longer runtimes, making it an ideal candidate for a broad range of comparative genomics applications.
Supplementary Material
Acknowledgements
We would like to thank Chirag Jain for helpful discussions and his implementation of the original MashMap software, as well Andrea Guarracino for improvements and discussions. We would also like to thank Nicolae Sapoval and Fritz Sedlazeck for their feedback on the proofs and primate alignments, respectively.
Funding
B.K. was supported by the NLM Training Program in Biomedical Informatics and Data Science (Grant: T15LM007093). A.M.P. was supported by the Intramural Research Program of the National Human Genome Research Institute, National Institutes of Health. B.K. and T.T. were supported in part by the National Institute of Allergy and Infectious Diseases (Grant# P01-AI152999). T.T. was supported in part by National Science Foundation grant EF-2126387. E.G. was supported by National Institutes of Health/NIDA U01DA047638, National Institutes of Health/NIGMS R01GM123489, NSF PPoSS Award #2118709, and the Tennessee Governor’s Chairs program.
Availability:
MashMap3 is available at https://github.com/marbl/MashMap
References
- Baker D. N. and Langmead B. (2019). Dashing: fast and accurate genomic distances with hyperloglog. Genome biology, 20(1), 1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Belbasi M. et al. (2022). The minimizer jaccard estimator is biased and inconsistent. Bioinformatics. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Berlin K. et al. (2015). Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nature biotechnology, 33(6), 623–630. [DOI] [PubMed] [Google Scholar]
- Blanca A. et al. (2022). The statistics of k-mers from a sequence undergoing a simple mutation process without spurious matches. Journal of Computational Biology, 29(2), 155–168. [DOI] [PubMed] [Google Scholar]
- Broder A. Z. (1997). On the resemblance and containment of documents. In Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 971B100171), pages 21–29. IEEE. [Google Scholar]
- Brown C. T. and Irber L. (2016). sourmash: a library for minhash sketching of dna. Journal of Open Source Software, 1(5), 27. [Google Scholar]
- Cohen E. (2016). Min-hash sketches.
- Cohen E. and Kaplan H. (2007). Summarizing data using bottom-k sketches. In Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing, pages 225–234. [Google Scholar]
- Dilthey A. T. et al. (2019). Strain-level metagenomic assignment and compositional estimation for long reads with metamaps. Nature communications, 10(1), 1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Edgar R. (2021). Syncmers are more sensitive than minimizers for selecting conserved k-mers in biological sequences. PeerJ, 9, e10805. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ekim B. et al. (2020). A randomized parallel algorithm for efficiently finding near-optimal universal hitting sets. In International Conference on Research in Computational Molecular Biology, pages 37–53. Springer. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Guarracino A. et al. (2021). wfmash: a pangenome-scale aligner.
- Jain C. et al. (2017). A fast approximate algorithm for mapping long reads to large reference databases. In International Conference on Research in Computational Molecular Biology, pages 66–81. Springer. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jain C. et al. (2018a). A fast adaptive algorithm for computing whole-genome homology maps. Bioinformatics, 34(17), i748–i756. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jain C. et al. (2018b). High throughput ani analysis of 90k prokaryotic genomes reveals clear species boundaries. Nature communications, 9(1), 1–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kronenberg Z. N. et al. (2018). High-resolution comparative analysis of great ape genomes. Science, 360(6393), eaar6343. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li H. (2018). Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34(18), 3094–3100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Marçais G. et al. (2017). Improving the performance of minimizers and winnowing schemes. Bioinformatics, 33(14), i110–i117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Marçais G. et al. (2018). Asymptotically optimal minimizers schemes. Bioinformatics, 34(13), i13–i22. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Marco-Sola S. et al. (2021). Fast gap-affine pairwise alignment using the wavefront algorithm. Bioinformatics, 37(4), 456–463. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nurk S. et al. (2022). The complete sequence of a human genome. Science, 376(6588), 44–53. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ondov B. D. et al. (2016). Mash: fast genome and metagenome distance estimation using minhash. Genome biology, 17(1), 1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ondov B. D. et al. (2019). Mash screen: high-throughput sequence containment estimation for genome discovery. Genome biology, 20(1), 1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ono Y. et al. (2013). Pbsim: Pacbio reads simulator-toward accurate genome assembly. Bioinformatics, 29(1), 119–121. [DOI] [PubMed] [Google Scholar]
- Popic V. and Batzoglou S. (2017). A hybrid cloud read aligner based on minhash and k-mer voting that preserves privacy. Nature communications, 8(1), 1–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rhie A. et al. (2022). The complete sequence of a human y chromosome. bioRxiv. [Google Scholar]
- Roberts M. et al. (2004). Reducing storage requirements for biological sequence comparison. Bioinformatics, 20(18),3363–3369. [DOI] [PubMed] [Google Scholar]
- Sahlin K. (2021). Effective sequence similarity detection with strobemers. Genome research, 31(11), 2080–2094. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schleimer S. et al. (2003). Winnowing: local algorithms for document fingerprinting. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pages 76–85. [Google Scholar]
- Shaw J. and Yu Y. W. (2022). Theory of local k-mer selection with applications to long-read alignment. Bioinformatics, 38(20), 4659–4669. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shaw J. and Yu Y. W. (2023). Fast and robust metagenomic sequence comparison through sparse chaining with skani. bioRxiv, pages 2023–01. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Spouge J. L. (2022). A closed formula relevant to ‘theory of local k-mer selection with applications to long-read alignment’by jim shaw and yun william yu. Bioinformatics, 38(20), 4848–4849. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang T. et al. (2022a). The human pangenome project: a global resource to map genomic diversity. Nature, 604(7906), 437–446. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang Z. et al. (2022b). Txtalign: Efficient near-duplicate text alignment search via bottom-k sketches for plagiarism detection. In Proceedings of the 2022 International Conference on Management of Data, pages 1146–1159. [Google Scholar]
- Warren W. C. et al. (2020). Sequence diversity analyses of an improved rhesus macaque genome enhance its biomedical utility. Science, 370(6523), eabc6617. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zheng H. et al. (2020). Improved design and analysis of practical minimizers. Bioinformatics, 36(Supplement_1), i119–i127. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
MashMap3 is available at https://github.com/marbl/MashMap