Skip to main content
Briefings in Bioinformatics logoLink to Briefings in Bioinformatics
. 2025 Jun 12;26(3):bbaf267. doi: 10.1093/bib/bbaf267

EvANI benchmarking workflow for evolutionary distance estimation

Sina Majidian 1,, Stephen Hwang 2, Mohsen Zakeri 3, Ben Langmead 4,
PMCID: PMC12159288  PMID: 40501070

Abstract

Advances in long-read sequencing technology have led to a rapid increase in high-quality genome assemblies. These make it possible to compare genome sequences across the Tree of Life, deepening our understanding of evolutionary relationships. Average nucleotide identity (ANI) is a metric for estimating the genetic similarity between two genomes, usually calculated as the mean identity of their shared genomic regions. These regions are typically found with genome aligners like Basic Local Alignment Search Tool BLAST or MUMmer. ANI has been applied to species delineation, building guide trees, and searching large sequence databases. Since computing ANI via genome alignment is computationally expensive, the field has increasingly turned to sketch-based approaches that use assumptions and heuristics to speed this up. We propose a suite of simulated and real benchmark datasets, together with a rank-correlation-based metric, to study how these assumptions and heuristics impact distance estimates. We call this evaluation framework EvANI. With EvANI, we show that ANIb is the ANI estimation algorithm that best captures tree distance, though it is also the least efficient. We show that k-mer-based approaches are extremely efficient and have consistently strong accuracy. We also show that some clades have inter-sequence distances that are best computed using multiple values of Inline graphic, e.g. Inline graphic and Inline graphic for Chlamydiales. Finally, we highlight that approaches based on maximal exact matches may represent an advantageous compromise, achieving an intermediate level of computational efficiency while avoiding over-reliance on a single fixed k-mer length.

Keywords: genome, average nucleotide identity, evolution, BLAST, k-mer, sketching

Introduction

The availability of assembled genomes of eukaryotes and prokaryotes has surged [1, 2], owing to advances in long-read sequencing technology [3] and assembly algorithms [4]. This new abundance of high-quality assemblies creates an ideal setting to reexamine the definitions and algorithms we use to study sequence similarity and evolution.

Robust metrics and models are needed to accurately quantify genome comparisons for different evolutionary rates across genomic regions and species. A basic model of DNA sequence evolution is Jukes-Cantor, which assumes equal frequency of each base (or amino acid) and equiprobable substitutions [5]. More flexible models have also been suggested (e.g. K80, F81, general time-reversible (GTR), point accepted mutation (PAM), blocks substitution matrix (BLOSUM)) and applied [6]. These more complex models use scoring functions, assigning higher scores to substitutions that occur more frequently. In this way, alignments are driven toward identifying homologous segments descended from a common ancestor [7].

Average nucleotide identity (ANI) is a widely used measure for genomic similarity [8–11]. ANI was originally proposed as a measure for delineating species and strains, specifically as an alternative to the labor-intensive DNA–DNA hybridization (DDH) technique [12, 13]. ANI has proven useful in other applications including building phylogeny [14] and guide trees [13, 15], improving the NCBI (National Center of Biotechnology Information) taxonomy [16], studying microbial presence in a metagenomic sample [17], analyzing genomic repeats [18] and searching large databases [19].

Computing ANI is neither conceptually nor computationally trivial, despite the development of several ANI tools. What ANI truly represents is difficult to define; its definition has evolved over the years and holds different meanings for different subfields. Further, how ANI is computed in a given software tool is contingent on its scoring functions and heuristics.

In this review, we explore the state of the art in estimating sequence similarity and evolutionary distance between genome sequences. We propose an accuracy measure based on rank correlation with tree distance. Using this measure, we evaluate various ways that researchers have defined ANI. We then compare and contrast current approaches, highlighting methods on either end of the efficiency-versus-accuracy spectrum. Finally, we suggest how future methods could achieve even more favorable combinations of speed and accuracy. Our EvANI evaluation framework is available as an open-source software tool at https://github.com/sinamajidian/EvANI.

Definition of ANI

The earliest uses of the term ANI date to the 1990s, when RNA polymerase gene sequences of Cyanobacteria [20, 21] were compared using the CLUSTAL multiple sequence aligner [22]. In 2005, a formal definition was introduced, with the aim of capturing similarity as measured using the molecular technique of DDH [12]. In that work, predicted protein-coding sequences (CDSs) from one genome (query) were searched against the genomic sequence of another genome (reference), and ANI was calculated based on conserved genes identified by BLAST (Basic Local Alignment Search Tool). The conserved genes were defined as BLAST matches with more than 60% overall sequence identity over an alignable region covering at least 70% of their nucleotide length in the reference.

Later, the meaning of ANI was widened to cover comparisons of whole-genome sequences [23]. To achieve this, the sequence of the query genome was divided into consecutive 1020-base fragments. The choice of 1020 bases was due to the size of the DNA fragments used in DDH experiments. In silico, these 1020-base fragments were compared to the entire genomic sequence of the other genome (reference) in the pair using BLAST. The ANI between the query and the reference was calculated as the average sequence identity of all BLAST matches with greater than 30% sequence identity over alignable regions covering more than 70% of their length. They showed that a 95% ANI is equivalent to a DDH value of 70%, the threshold used for species delineation. In their study, they also performed a reverse search in which the reference genome is used as the query to provide reciprocal values. Their results show a small difference (less than 0.1%) between the two reciprocal ANI values for each pair.

To improve computational efficiency, the JSpecies tool [24] used NUCmer (NUCleotide MUMmer) [25, 26] as a replacement for BLAST in order to compute the whole genome alignment (Fig. 1A). In Section 2.2, we will describe these in more detail. Recent methods define ANI simply based on a whole-genome alignment. Once a whole-genome alignment has been performed, ANI is computed as the fraction of aligned positions that are matches.

Figure 1.

Alt text: A figure presenting an overview of the ANI definition and the EvANI benchmarking pipeline.

(A) ANI quantifies the similarity between two genomes. ANI can be defined as the number of aligned positions where the two aligned bases are identical, divided by the total number of aligned bases. Historically, ANI was calculated using a single gene family for multiple sequence alignment. Another approach finds orthologous genes between two genomes and reports the average similarity between their CDSs. This method was later extended to whole-genome alignment by identifying local alignments and excluding supplementary alignments with lower similarity. (B) Different ANI tools employ various approaches in calculating ANI values. ANIm, OrthoANI, and FastANI use aligners to identify homologous regions, whereas Mash uses k-mer hashing to estimate similarities. Only alignments with higher similarity represented by green arrows are included in ANI calculations, while red arrows, corresponding to paralogs, are excluded. (C) The proposed benchmarking method evaluates the performance of different tools using both real and simulated data. It assumes that more distantly related species on the phylogenetic tree should have lower ANI similarities. This is measured by calculating the statistics of Spearman rank correlation. We expect a negative correlation between ANI and the tree distance (scatter plot on the right).

While each of these definitions of ANI had practical motivations and interpretations, they did not yield the same definition. One definition is more concerned with how genes co-occur, while the other is more concerned with similarity of substrings that may or may not overlap genes. We also highlight some additional ambiguities in the definition and goal of ANI. First, some methods for computing ANI have the goal of computing average base-level identity over “alignable” regions only [23, 24, 27]. In these cases, the portions of one genome that fail to align onto a counterpart in the other genome are totally excluded from the computation; i.e. they are excluded from both the numerator and the denominator of the fraction computed. Whether this is a reasonable strategy depends on the scenario, as we will detail later. If the genomes being compared are distant, this requirement can result in an ANI estimate of zero or a value i.e. very close to zero. For instance, consider two scenarios: (i) two genomes with 90% ANI similarity over a 70% aligned region, and (ii) two other genomes with 85% ANI similarity over a 100% aligned region. It is not clear which scenario represents the smaller evolutionary distance, even though the aligned region exceeds the threshold (70%). Furthermore, the threshold is often chosen arbitrarily, which adds to the uncertainty.

One challenge in the definition of ANI lies in using a notion of conserved regions. The concept attempts to infer conservation through pairwise comparisons. However, the concept of conservation needs to be evaluated across several genomes in a pangenomics context. In other words, when identifying conserved regions—commonly referred to as core genomes—a pairwise approach is insufficient. Instead, comparisons across multiple genomes might be necessary, typically achieved through multiple genome alignment [28].

The primary approach intended in the ANI calculation is to capture regions of common ancestry using orthology [27]. When a distance metric aims to reflect a species tree, only orthologous regions should be taken into account. The rationale for including only orthologous regions is to avoid confounding effects from other evolutionary events, such as duplications. Consequently, duplicated regions should be excluded from the distance calculation to better conform to distances on the species tree. However, in practice, this goal is often not fully achieved, as identifying orthologous regions is challenging. Several methods for orthology inference have been proposed [29, 30] but they focus only on CDS regions. ANI tools often use reciprocal best hits as a proxy for finding orthologous regions. However, due to varying evolutionary rates across different genomic regions, the reciprocal best hits serve as a poor approximation [31]. It is also crucial to distinguish duplications that happened before or after speciation events, as this distinction impacts the orthology assignment.

Another limitation arises from using non-overlapping 1 kb segments as the unit of comparison in the proposed definition. This approach interferes with finding the true borders of orthologs (or homologs). The proposed ANI definitions rely on alignment tools such as BLAST or MUMmer. Thus, calculation is at the mercy of heuristics used in the alignment process. BLAST, e.g. employs a dynamic programming algorithm (i.e. local or global alignment) while MUMmer uses its own set of heuristics.

One important challenge is the prevalence of lateral gene transfer (LGT), a.k.a., horizontal gene transfer (HGT). This refers to the movement of genetic material between organisms that are not direct (vertical) descendants of each other. It could be argued that HGT has limited impact if the analysis is performed genome-wide, where ANI is calculated by averaging across all regions of similarity.

In conclusion, while the ANI concept is practically useful, it lacks a single, universally used definition. ANI’s reliance on alignable regions and pairwise comparisons introduces inaccuracies, particularly for distant genomes, as it excludes unaligned genomic portions and struggles with identifying conserved orthologous regions. Additionally, challenges arise from the use of fixed-width (e.g. 1 kb) segments, from differing alignment heuristics, and from the impact of phenomena that have a more dramatic impact on both sequence similarity and tree shape, such as LGT.

Evaluation of ANI tools

We divide ANI estimation tools into two categories: alignment-based and k-mer-based approaches. Alignment-based tools include OrthoANI, digital DDH, JSpecies, ANIb, and ANIm, with PyANI serving as a Python wrapper for the last two. K-mer-based approaches include those that use the Jaccard coefficient, such as Mash and Dashing (Table 1).

Table 1.

A summary of the two broad categories of ANI tools including k-mer-based and alignment-based is provided along with their underlying software, links, and references (see Fig. 1 for a visualization of how tools operate)

Tools Unit (default) Alignment Reciprocal Underlying software Software link Ref
Mash k-mer (Inline graphic) No No MinHash github.com/marbl/Mash [32]
Dashing k-mer (Inline graphic) No No . github.com/dnbaker/dashing [33]
ANIb segment (1020 bp) S2G No BLAST N/A [23]
ANIm genome G2G No MUMmer N/A [24]
PyANI * * * BLAST/MUMmer github.com/widdowquinn/pyani [34]
JSpecies genome G2G No BLAST/MUMmer jspecies.ribohost.com [24, 35]
Digital DDH genome G2G . BLAST ggdc.dsmz.de/ggdc.php [36]
OrthoANI segment (1020 bp) S2S Yes BLAST help.ezbiocloud.net [27]
FastANI segment (3000 bp) S2G Yes MashMap github.com/ParBLiSS/FastANI [10]

Note that MUMmer refers specifically to its alignment subprogram, NUCmer. G2G: genome to genome, S2G: segment to genome, S2S: segment to segment. *:PyANI is a wrapper to ANIb/ANIm. Python wrappers for FastANI and OrthoANI also exist [37].

The alignment approach

Early methods for ANI estimation focused on aligning genomes to each other. First, alignable portions of the two genomes were found with BLAST as a tool for homology discovery. This was achieved by extracting consecutive blocks of 1020 bases of one genome and then aligning them to the other [23]. This was the approach taken for the method called “ANIb” [23, 24], which is implemented in the PyANI package [34].

A similar approach, called OrthoANI [27], divides both genomes into segments of 1020 bases. These segments are then aligned to each other to identify reciprocal best hits. ANI is calculated considering only the segments for which a reciprocal best hit was found. There are two notable differences between OrthoANI and ANIb (Fig 1B). First, only one genome is divided into windows in the ANIb method, whereas both genomes are divided into windows in the OrthoANI method. Secondly, in the ANIb analysis, the order of query and reference was changed to identify the conserved regions. However, a reciprocal best hit calculation is implemented in OrthoANI to find orthologous regions. The challenge with this approach lies in the arbitrary definition of homology boundaries (with blocks of 1020 bases), which may fail to reflect the true evolutionary history. The BLAST options used include -dust=no, -xdrop_gap=150, -penalty=21, -reward=1 and -evalue= Inline graphic. However, the rationale behind their selection of arguments of cost function or thresholds remains unclear.

Digital DDH is a web service that computes various distances based on local alignments (a.k.a., high-scoring segment pairs) identified with the BLAST aligner [36] (see Table 1). One distance measure is defined as 1 - a ratio, where the numerator of the ratio is the total length of all local alignments, including both alignments of reference versus query and query versus reference, and the denominator is the sum of both genomes’ lengths, a.k.a. alignment fraction (AF). Another distance measure is similar but modifies the numerator using the sum of identical base pairs. Besides these formulas, sampling schemes based on bootstrapping and jackknifing are used to calculate confidence intervals. Notably, the outputs of these two distance formulas can vary significantly. These distance values are converted to DDH estimates using generalized linear models from empirical reference datasets. For a test case of a simulated genome pair, the DDH estimates were reported as 100%, 71.3%, and 98.7%, corresponding to alignment length divided by total length (AF), identities divided by alignment length, and identities divided by total length, respectively. The authors recommend using the middle metric, reporting the lower similarity in this case. They argue that since the other two formulas use genome length in their denominator, they can be inaccurate when the assemblies being compared are incomplete.

The ANIm method is based on the MUMmer whole-genome aligner [24, 26]. Some studies use ANIm as the “gold standard” against which methods should be compared, e.g. the HyperGen [38] study. ANIm has a lower computational cost compared to BLAST-based methods by using MUMmer’s fast maximal unique match (MUM) and maximal exact match (MEM) finding algorithms. MUMmer uses a 32-bit suffix tree data structure (up to version 3 [26]) or a 48-bit suffix array (in version 4 [39]) for fast match finding, and has flexible options for parallel processing. ANIm uses NUCmer, a wrapper program that invokes MUMmer and then uses local alignment to extend and combine matches found by MUMmer. NUCmer uses MUMs as anchors in the MUM mode. MUMs are matches that are unique in both the reference and query. Alternatively, NUCmer’s maxmatch mode uses all anchor matches regardless of their uniqueness (i.e. MEMs). ANIm is available through the JSpecies online platform [35] and is implemented in the PyANI package [34].

FastANI [10] is generally faster than both ANIm and ANIb. It works by aligning segments of 3 kb extracted from one genome to another using the MashMap aligner [40]. The process begins with indexing the genome, and subsequently finding all alignments using a winnowed-MinHash estimator to measure similarity between the 3 kb segment and regions of the other genome. This approach operates under the assumption that k-mers follow a Poisson distribution, which might not be the case in practice. Since identifying matching k-mers is a step within the MashMap aligner, FastANI could be considered either alignment-based or k-mer-based.

The k-mer approach

An alternative approach for calculating the distance between two genomes involves decomposing the genomes into their constituent k-mers and summarizing these in a “sketch” data structure. A sketch functions as an approximate version of a set data structure. Sketches can be queried to estimate set cardinalities, as well as the similarity between two sets as measured by the Jaccard index. The Jaccard index is defined as the ratio of the number of distinct shared k-mers between the two genomes divided by the total number of distinct k-mers in either genome:

graphic file with name DmEquation1.gif (1)

where Inline graphic is the set of all k-mers in the genome Inline graphic, ignoring their multiplicity (i.e. each distinct k-mer counts once).

In their seminal study, Ondov et. al. described the Mash method, which uses the MinHash sketch to estimate pairwise distance. Instead of computing the Jaccard index overall for all distinct k-mers, Mash computes it over a sketch, which is a much smaller, approximate representation of the set. The MinHash sketch is constructed by hashing each k-mer from the input sequence and selecting the Inline graphic smallest hash values. This reduced sketch enables fast, approximate set operations, such as intersection, facilitating efficient distance calculations:

graphic file with name DmEquation2.gif (2)

where Inline graphic’s is the set of hash values in the sketch built over genome Inline graphic’s k-mers. The Mash index (similarity) is calculated as below

graphic file with name DmEquation3.gif (3)

This calculation is based on the expectation that the number of mutations in a k-mer is Inline graphic, where Inline graphic represents the mutation probability. Under a Poisson model, assuming unique k-mers and random independent mutations, the probability of no mutation occurring in a k-mer is given by Inline graphic (for details, see page 10 of [32] and page 14 of [41]).

K-mer approaches like Mash are much faster than alignment-based approaches but typically achieve lower accuracy, as we discuss in Results. However, Mash is a widely used and successful tool, as are similar tools such as Sourmash [42] and Dashing [33], highlighting the utility of methods that sacrifice accuracy for computational efficiency.

A major goal of this review is to investigate reasons why k-mer-based methods lose accuracy compared to alignment-based methods. To this end, we will catalog their key assumptions and heuristics. Specifically, Mash employs a fixed k-mer length and applies a sketching step with a predetermined sketch size, limiting its focus to a sampled subset of k-mers. This k-mer-based strategy overlooks the collinearity of genomic segments and does not perform alignment at the base level, relying solely on exact k-mer matches. Once the statistics of shared k-mers are found, Mash calculates distance based on the assumption that k-mers are unique and independent of each other [41], ignoring the fact that two k-mers with a long shared suffix/prefix are more likely to co-occur than two k-mers with no shared suffix/prefix.

Other k-mer-based tools include Bindash [43, 44] (which uses b-bit one-permutation rolling MinHash), HyperGen [38] (which incorporates hyperdimensional modeling of Dothash [45]), Gsearch [46], Skmer [47] (designed for raw reads), vclust [48] (designed for viral data [49]), and skani [50] (for metagenomics which uses FracMinhash [51] and k-mer chaining).

Results

We developed a workflow to evaluate the performance of ANI tools under different scenarios and considering both distant and closely related species. We simulated genome evolution using the artificial life framework (ALF) simulator [52], varying amounts of duplication, LGT, and the branch length [52]. This simulation framework generates genomic sequences and a phylogenetic tree that includes branch lengths, providing true distance values between genome pairs (Fig. 1C). Additionally, we incorporated real genomes from NCBI RefSeq and phylogenetic data from both the NCBI taxonomy and genome taxonomy database (GTDB) [13, 53]. We worked from the premise that if an estimate of ANI is a good distance measure, it should accurately reflect the distances inherent in the species tree. In other words, a higher ANI value between two genomes should correspond to a smaller genetic distance on the tree, as indicated by a smaller sum of branch lengths between them.

Sketching of k-mers retains most of the essential information for distance calculation but there is no one optimal k-mer length across clades

We first evaluated how well the k-mer-based tools such as Mash and Dashing can capture evolutionary distances. Tree distances represent the evolutionary divergence between genomes, providing a reference for assessing the accuracy of the calculated distance measures. Thus, we investigated the correlation between the tree distance and the ANI estimates in different scenarios of genome evolution with varying degrees of genetic divergence.

Mash has two main parameters: k-mer length and sketch size. A larger k-mer length generally provides greater specificity, while smaller k-mers may offer more sensitivity [54]. However, large genomes might share small k-mers by chance, rather than due to sharing homologous regions. The sketch size refers to the number of k-mers that are sampled and used in the analysis. A smaller sketch set enables fast comparison of genomes. The default parameters in Mash are Inline graphic and Inline graphic.

We simulated the evolution of 15 genomes with varying amount of divergence by increasing the root-to-leaf branch length from 5 to 100, resulting in average leaf-to-leaf ANI values from 97.5% to 71.5%, respectively (Supplementary Figs S14 and S15). The simulation outputs include the genomic content of species and the phylogenetic tree, which shows how samples are related to each other. We generated one species tree topology, which was used in all experiments. The simulation experiment was repeated five times.

In our experiments with simulated data, we found that the default sketch size of 1000 was too small, leading to a Spearman correlation P-value of Inline graphic for datasets with more distantly related species (branch length of 100), whereas a P-value of Inline graphic could be achieved using Mash without sketching (i.e. Jaccard). Increasing the sketch size from 1000 to 500 000 improved the Spearman P-value from Inline graphic to Inline graphic for a branch length of 100. Varying the k-mer length also affected distance estimation, with lengths of 15 or 17 yielding the best correlation (Fig. 2).

Figure 2.

Alt text: A line chart illustrating the impact of sketch size and k-mer length on ANI.

The Inline graphicP-value of Spearman’s rank correlation test between ANI (calculated using Jaccard or Mash distance) with the tree distance (based on the true phylogenetic tree) for simulated evolution of 15 genomes. (left) Mash with k-mer length of Inline graphic, (right) Mash with sketch size of Inline graphic. The branch length parameter refers to the total length of root-to-leaf branches in the simulated tree used for genome evolution simulations [52]. Overall, increasing the sketch size improves the rank correlation, decreasing the P-value and the optimal k-value that minimizes the P-value is different for different scenarios.

We also analyzed real genome assemblies, focusing on eight clades: c__Caldisericia (with Inline graphic genomes), o__Bacillales_A (Inline graphic), p__Aquificota (Inline graphic), c__Dethiobacteria (Inline graphic), f__Neisseriaceae (Inline graphic), o__Cyanobacteriales (Inline graphic), o__Chlamydiales (Inline graphic), and o__Anaerolineales (Inline graphic). These were selected across all bacterial clades in order to cover different taxonomic ranks including phylum, class, order, and family. Two of the clades – c__Dethiobacteria and o__Anaerolineales – were selected at random to assess the representativeness of the selection (see Methods section). One of these clades, Cyanobacteria, is already understood to be highly diverse with respect to GC content and genome size.

To calculate Spearman’s rank correlation, we considered the tree distance from the GTDB phylogenetic tree. The results are depicted in Fig. 3. Notably, for Cyanobacteriales and Chlamydiales, statistics across different k-values revealed distinct local minima at Inline graphic and Inline graphic. The values of k that led to globally the best Spearman rank statistics were in the range of 19 to 23 for these eight clades. No single value of k consistently optimized the rank correlation across different clades, highlighting a limitation of approaches that estimate evolutionary distance using fixed-length k-mers.

Figure 3.

Alt text: A figure showing the impact of k-mer length on ANI across eight clades.

The statistics of the Spearman rank correlation test comparing Jaccard index (calculated by Dashing) and tree distances for eight clades including c__Caldisericia, o__Bacillales_A, p__Aquificota, c__Dethiobacteria, f__Neisseriaceae, o__Cyanobacteriales, and o__Chlamydiales, o__Anaerolineales. Note that the two clades of c__Dethiobacteria and o__Anaerolineales were chosen randomly to assess the representativeness of the selection (see Methods section). Red arrows show a local minimum or a notable change in the statistics. Although a value of k around 19 to 23 optimizes the correlation, there is no single k value for the k-mer length that optimizes the statistics for estimating evolutionary distances across all clades, and some clades exhibit multiple local optimals, highlighting the fundamental limitation of k-mer-based approaches.

Different clades exhibited this phenomenon to varying degrees. In some clades, two distinct local minima can be observed, while in others (e.g. c__Caldisericia and o__Bacillales_A), a milder change in curvature is observable, but without creating a local minimum. We hypothesized that different k-mer lengths have different strengths when estimating tree distance between a genome pair. For very short k-mers (roughly 9-mers or shorter), the absence of specific k-mers in one genome or the other may be particularly informative, since the base expectation is that very short k-mers are likely to be present in both genomes. For longer k-mers (e.g. 21-mers), the co-occurrence of a k-mer in both genomes represents an informative “coincidence.” For intermediate values of k, e.g. roughly 9–11-mers, about half the k-mers might be present in the union of two k-mer sets. In this case, observing or failing to observe a k-mer in the intersection of the two sets is a high-entropy observation relative to short values of k (where nearly all k-mers are present in the intersection) or high values of k (where a tiny fraction of k-mers are present in the intersection).

To explore this further, we analyzed the fraction of distinct k-mers present across different k values for genome pairs [55]. Supplementary Fig. S16 shows this fraction, calculated as the number of distinct k-mers divided by the total number of possible k-mers (Inline graphic). Notably, for k smaller than 9, almost all possible k-mers are present in all genomes and thus shared across them, limiting their ability to distinguish between genomes and to measure ANI (e.g. using the Jaccard index as implemented in tools like Mash or Dashing). Specifically, we observed that around Inline graphic to Inline graphic, the fraction of distinct k-mers is about 0.5, i.e. maximizing the entropy, which is close to the smaller local minimum of the Spearman statistics. For larger k values, the presence of k-mers becomes more informative, since they will rarely occur by coincidence.

Related to this, we hypothesized that the presence of multiple local minima indicates that different k-mer lengths can contribute complementary information about evolutionary distance. If true, combining the information from both k-mer lengths should yield improved rank correlation. For the order Chlamydiales, the minima were at Inline graphic and Inline graphic. We computed a merged ranking by averaging the ranks of each genome pair calculated using 10-mers and 19-mers, then re-ranking according to these averages. The distances derived from the merged ranks improved Spearman correlation (Fig. 4), supporting our hypothesis. An analysis that used GC content specifically in place of the 10-mers did not improve rank correlation in the same way (see Supplementary Figs S1 and S2), indicating that integrating 10-mers does not simply add information about GC content.

Figure 4.

Alt text: A figure showcasing the benefit of combining information from two k-mer sizes.

The Spearman rank correlation test between the distance on the GTDB tree and the Jaccard index calculated using Dashing in full-hash mode. For the orders of Chlamydiales and Cyanobacteriales, two distinct k-values performed well (green). Using both sets of 10-mers and 19-mers (for Chlamydiales) to find distance ranks improved the statistics (blue/orange), which demonstrates that small and large k-mers can capture complementary information, since using both resulted in a better estimation.

Mash is robust to duplication and LGT while ANIm and FastANI are not as robust, but often perform better

We also conducted simulation experiments to study the impact of duplication and LGT on distance estimation. We used the ALF simulator [52] to generate genomes under duplication rates of 0.05%, 0.1%, and 0.2% or LGT rates of 0.01%, 0.05%, 0.1%, and 0.2% all using the same tree topology. We simulated each scenario five times and reported the median.

We ran Mash, FastANI and ANIm with different parameters. For Mash, we varied k-mer lengths and sketch sizes. For ANIm, we varied the minimum length requirement for the MUMs. For FastANI, we varied the minimum fraction of genome shared (Fig. 5) and the fragment length (Supplementary Fig. S4). We observed that Mash was robust to increasing duplication rate but showed moderate sensitivity to higher LGT rates, with the impact becoming more pronounced for larger sketch sizes (see Fig. 5). For example, for Inline graphic and Inline graphic, the averaged P-value for data without LGT was Inline graphic, but with an LGT rate of Inline graphic, the average P-value rose to Inline graphic. This matched our expectation that a higher rate of LGT adversely affects distance estimates, since more genes can move among species in a manner that contradicts the species tree. The results also showed that when the sketch is large, the LGT has a more pronounced impact on Mash as mentioned earlier. On the other hand, for a sketch size of 1000, this value changed from Inline graphic to Inline graphic, also indicating a reduction in accuracy (see Fig. 5D).

Figure 5.

Alt text: Graphs showing the performance of ANI tools, with subfigures labeled A to H.

The ALF simulator was used to generate related genomes under two series of evolutionary scenarios. One series simulated duplication rates of 0.05%, 0.1%, and 0.2% (left column). Another series simulated LGT rates of 0.01%, 0.05%, 0.1%, and 0.2% (right column). (A–D) We ran Mash with different k-mer length and sketch size parameters. The alignment-based tools for estimating ANI includes FastANI (E and F) and ANIm (G and H). See Supplementary Fig. S4 for the impact of fragment length on FastANI.

In contrast, alignment-based ANI estimation tools such as FastANI and ANIm produced markedly worse estimates as LGT and duplication rates increased. For these tools, the fragment length in FastANI or the minimum length of MUMs in ANIm did not significantly affect the Spearman correlation (Fig. 5E-F).

So far, we studied the impact of duplication and LGT on the performance of each tool individually. Here, we want to draw a comparison among tools. Note that most tools have accuracies that fall into overlapping ranges, except Mash which has appreciably worse correlation than the rest (Fig. 6(right)). By increasing the amount of LGT, the log Spearman P-value for all tools increased, i.e. their estimate exhibited worse correlations. The Jaccard index (found using Dashing in its --use-full-khash-sets mode) had accuracy similar to that of alignment-based tools like ANIb and ANIm for low LGT, which is surprising. Jaccard’s P-value changes drastically from Inline graphic to Inline graphic as the LGT rate increases to Inline graphic. Mash is less sensitive to LGT, achieving moderate P-value, changing from Inline graphic to Inline graphic. This difference in behavior between Mash and Dashing (full-hash) may be attributed to the Mash’s sketching. Given that the number of exchanged elements due to LGT is limited, most of these elements are likely excluded from the sketched set of k-mers, resulting in minimal impact on the distance calculation by Mash. In contrast, Dashing (full-hash) incorporates k-mers from the entire genome, including these exchanged elements, into the distance calculation, even though it should not. Since these LGT genomic elements do not adhere to the vertical evolutionary history represented by the species tree, they should not be included in the distance calculation. Alignment-based approaches showed a decline in performance as the LGT rate increased, likely for the same reason.

Figure 6.

Alt text: A figure showing the impact of duplications and LGT on ANI calculation.

Comparing alignment-based tools (FastANI, ANIm, ANIb) and k-mer-based tools (Dashing, Mash) on datasets with different LGT and duplication rates (summarizing the Fig. 5). The alignment approaches performed better than k-mer-based Mash in Spearman correlation test. fastANI-l1k-frac0.1: FastANI with a fragment length of 1000 and minimum fraction shared genome of 0.1, Mash-k14-s10k: Mash with k-mer length of 14 and sketch size of 10,000. ANIm-l11:ANIm with a minimum MUM length of 11. DashingFull-k14: The Jaccard index was calculated with the Dashing tool which was run in the mode --use-full-khash-sets.

We also noted that in scenarios with high LGT or duplication, the ANIm and FastANI methods showed worse correlation. Specifically, ANIm’s P-value moves from Inline graphic to Inline graphic (Inline graphic) as the LGT rate (duplication rate) increases from Inline graphic to Inline graphic. Similarly, FastANI’s P-value moved from Inline graphic to Inline graphic (Inline graphic) under the same conditions. But even in these cases, ANIm and FastANI exhibited better corelation compared to Mash (Fig. 5).

To study how different tools mimic the behavior of ANIb (or ANIm), we compared each tool versus ANIb (or ANIm) in terms of the rank correlation between the pairwise distances computed by the tool versus those computed by ANIb (ANIm). The results showed that ANIb and ANIm produce similar rankings, whereas Mash is less similar to both ANIb and ANIm (Fig. 7). When there is a high rate of duplication, FastANI is more similar to the Jaccard index as computed by Dashing than it is to the distances produced by ANIb/ANIm.

Figure 7.

Alt text: A graph comparing ANI tools under evolutionary simulations with varying levels of duplications and LGT.

The Inline graphic  P-value of spearman correlation test between different ANI tools (FastANI, Mash, Dashing, ANIm) versus ANIb (left) using simulated data. We considered a range of LGT and duplication rates. (right) A similar analysis in comparison to ANIm, showing the strongest rank correlation between ANIm and ANIb.

Consideration of match uniqueness, match length, and AF boosts ANIm’s accuracy

ANIm is one of the better-performing tools in our previous experiments. It is based on the MUMmer whole-genome aligner, which offers many parameters to adjust its performance and sensitivity. For example, MUMmer can use either MUMs or MEMs to anchor the alignment, with a default match length of 20. Starting from these “seed” matches, MUMmer and NUCmer work to cluster the seeds and extend them into full alignments. The match length is adjustable: a shorter length increases sensitivity but at a greater computational cost. When run using MUMs as seeds (rather than MEMs), the uniqueness requirement reduces the available seeds to those that occur outside of repeats, which could also affect sensitivity. Here, we provide a comprehensive investigation into the impact of these parameters on ANI calculation.

We first noted that ANIm has a tendency to report an ANI of zero for genomes pairs that are moderately distant (Supplementary Fig. S13). In a real dataset, we noticed that this was due to a lack of matches exceeding NUCmer’s minimum MUM length parameter. Note that so far and for the following analysis, NUCmer was run in MUM mode. Later we discuss the impact of using NUCmer in MUM mode compared to its maxmatch (MEM-based) mode.

Decreasing ANIm’s minimum MUM length parameter from 20 to 14 caused it to report non-zero distance values in more cases. Related to this, Spearman correlation for its distance estimates also improved (Supplementary Figs S3 and S12).

To further understand the advantage of MUM-based methods, we also explored the effect of weighting ANIm results by the fraction of the genomes that aligned. This was attempted in previous studies [14, 50] and we hypothesized that the adjustment could become more important as the genomes become more distant.

In an experiment with 84 Cyanobacteraia genomes, we tested this approach. We observed that for this dataset ANIm has a slight positive correlation with tree distance (Spearman statistics Inline graphic and P-valueInline graphic, Fig. 8). This result might be unexpected, as ANIm is a similarity metric and should not correlate positively with tree distance. This could be justified by previously reported limitation of ANIm in distance calculation when species are distant [8, 50]. Interestingly, when ANIm values were adjusted by multiplying them with the alignment fraction (AF), the negative correlation was observed, with a Spearman statistic of −0.286 and the noticeably low Spearman rank P-value of Inline graphic.

Figure 8.

Alt text: A graph showing the impact of weighting ANI by AF.

Impact of weighting ANIm with AF for distance calculation in Cyanobacteraia. 84 genomes were considered for studying the correlation between ANIm (or Inline graphic) versus tree distance from NCBI and each point corresponds to a pair of genomes (see Supplementary Fig. S10 when GTDB is used).

The simulation results showed that Inline graphic performs slightly worse than ANIm for closely-related species, but outperforms ANIm for more distantly related genomes. Under conditions where the branch length is 100 (producing divergent genomes in the evolutionary scenario), weighting improves the P-value from Inline graphic to Inline graphic (minimum MUM length=11, top left, Fig. 9). When the minimum MUM length is set to 21 (default), the improvement is even more pronounced, from Inline graphic to Inline graphic. For the same branch length (=100), the P-value of ANIb is Inline graphic. Interestingly, Inline graphic’s performance surpassed ANIb’s when the branch length is 200 (Supplementary Fig. S11).

Figure 9.

Alt text: A graph comparing the impact of match uniqueness in the MUMmer aligner.

The Spearman correlation between ANIm (or ANIm*AF) and tree distance for simulated datasets with different root-to-leaf branch length, and duplication and LGT rates. The top row uses MUMs in alignment with MUMmer, followed by keeping only 1-to-1 alignments. The bottom row corresponds to using all maximal matches without any filtering. Longer branches in trees used for genome evolution resulted in more distant genomes. Different minimum MUM lengths used in ANIm did not impact the result (Supplementary Fig. S6). The branch length in PAM [52, 56] varies from 5 to 300. AF = alignment fraction.

On the other hand, our simulations also showed that the improvement achieved by using AF-weighted ANIm does not hold when there is a high rate of duplication or LGT (Fig. 9 (top row)). Notably, we observed that the AF (calculated with NUCmer in MUM mode) of genomes with duplication or LGT showed a much lower Spearman correlation with tree distance; specifically, the Spearman P-values increased from Inline graphic (when no duplication/LGT exists) to Inline graphic for a duplication rate of Inline graphic and to Inline graphic for an LGT rate of Inline graphic (Supplementary Fig. S7). Note that AF values are based on the 1-to-1 alignments that are found by filtering all NUCmer alignments. The fraction of unfiltered alignments correlates better with the distance (the last row of Supplementary Fig. S7). This likely occurs because filtering removes a substantial portion of alignments in the presence of high duplication, which reduces the AF numerator (total alignment length) without changing the denominator (genome length). This led to an AF i.e. lower than the true proportion of homologous regions.

The described analyses of ANIm were based on NUCmer’s MUM mode. In this mode, NUCmer uses MUMs as anchors which are MUMs. Thus, NUCmer does not find all the available alignments due to the uniqueness of anchors. This is observed in a decrease in fraction of alignment to around 0.90 when duplicated genes are present (Supplementary Figs S7 and S8). To address this issue, we ran NUCmer in maxmatch mode which uses all maximal matches as anchors. Interestingly, Inline graphic performed similarly to ANIm even when high duplication or LGT is present (the bottom row of Fig. 9). We hypothesized that the uniqueness of MUMs results in discarding some of the duplicated regions, creating big discontinuities and poor rank correlation. Consequently, Inline graphic does not appear to be a universally suitable measure for distance estimation. Developing an optimal weighting scheme that accounts for diverse evolutionary scenario including LGT and duplications, with their distinct alignment profile, could be a direction for future research. Overall, ANIm demonstrated superior computational efficiency compared to ANIb. Tuning match parameters (uniqueness and length) combined with AF weighing extends ANIm’s applicability to more distantly related genomes, achieving an accuracy comparable to that of ANIb.

Orthologous genes showed a stronger rank correlation with the GTDB true tree distance compared to the whole genome using k-mers

We investigated the impact of different genomic regions in distance calculation using k-mers. In an experiment on 44 genomes of o__Bacillales_A2, we infer orthologous genes with the FastOMA tool [29]. We considered four types, including the whole genome, all CDSs, 100 random genes of each genome, and 100 orthologous genes. We calculated the Jaccard similarity using each of the mentioned genomic regions separately. A typical k-mer-based distance method considers the whole genome. The results indicate that orthologous genes show a stronger correlation with species tree distances compared to whole-genome data for the clade of o__Bacillales_A2 (Fig. 10). This is because these genes are responsible for speciation events, shaping the species tree. However, it is important to note that changes in CDSs alone may not provide sufficient resolution for calculating distances between subspecies or strains within a single species. Additionally, a limitation of this approach is its reliance on accurate orthology information of genes. Of note, gene annotation itself is a Herculean task [57], impacting the orthology assignment [58], ultimately the distance estimation. This highlights the important point that methods should clearly specify their range of evolutionary distances they are designed to address. It also emphasizes the need for future tools with ability to design models that can integrate the strengths of approaches specialized in both short evolutionary distances and those tailored for longer evolutionary distances.

Figure 10.

Alt text: A graph comparing how the inclusion of different genomic regions, such as genes, impacts ANI.

The log P-value of Spearman correlation test between the Jaccard index and distance on the GTDB tree (the lower, the better). This shows the impact of using different genomic regions in distance calculation for the clade o_Bacillales_A2 when k-mers found from the whole genome, all CDSs, 100 random CDS genes, or 100 orthologous genes.

Performance evaluation

To evaluate the performance of different ANI tools, we compared them in terms of run time and memory footprint on different datasets.

The runtime of each tool varied drastically, chiefly depending on whether they are alignment-based or k-mer-based (Table 2). Since the N=15 simulated genomes have a length of around 5 Mbp each, Mash and Jaccard took only a few seconds to run, whereas ANIm and ANIb needed minutes to estimate ANIs between the full set of 105 genome pairs (=Inline graphic). This difference was more pronounced for the Cyanobacteriales clade, which compromising of Inline graphic samples and so requires distance estimates for 170 820 genome pairs. ANIm took 13 h using 48 CPU cores (556 CPU-hours) whereas FastANI needed 1 h (34 CPU-hours). FastANI’s speed is likely due to MashMap’s use of fast k-mer sketching methods. ANIb, which was run via PyANI on 48 CPU cores, did not complete this task for the Cyanobacteriales within a 24-h time limit. Mash finished the task in less than two CPU minutes, again owing to its use of the fast sketching technique. Jaccard needed around 5 CPU-hours, which made it six times faster than the fastest alignment-based tool, FastANI. While ANIm and ANIb provided more accurate distance estimates, especially for higher LGT and duplication rates in simulated data, they were considerably slower than the other methods. In summary, ANIb consistently provided the most accurate tree distance estimates but at a much higher runtime. FastANI’s speed is likely due MashMap’s use of fast k-mer sketching methods. ANIb, which was run via PyANI on 48 CPU cores, did not complete this task for the Cyanobacteriales within a 24-h time limit. Mash finished the task in less than two CPU minutes, again owing to its use of a fast sketching method. The Dashing-based computation of the full-fidelity Jaccard index needed around 5 CPU-hours, making it six times faster than the fastest alignment-based tool, FastANI.

Table 2.

The Wall-clock time of CPU time (in hours:minutes:seconds) of different ANI tools for simulated genomes (Inline graphic; 105 genome pairs), Caldisericia (Inline graphic; 1176 genome pairs), and Cyanobacteriales (Inline graphic; 170 820 genome pairs) using 48 CPU cores

Tools Simulated data Caldisericia Cyanobacteriales
Wall-clock time CPU time Wall-clock time CPU time Wall-clock time CPU time
Mash 00:00:01 00:00:02 00:00:01 00:00:48 00:00:04 00:01:37
Dashing(full-hash) 00:00:03 00:00:34 00:00:03 00:01:13 00:06:53 05:16:20
FastANI 00:00:10 00:04:14 00:00:07 00:03:37 01:02:47 33:40:50
ANIm 00:01:11 00:20:22 00:01:27 00:45:19 12:48:08 556:41:48
ANIb 00:03:02 00:41:47 00:03:57 00:57:37 N/A N/A

ANIb did not finish for Cyanobacteriales in 24 h.

We also investigated the impact of the number of genomes and genome sizes on runtime of ANI calculation. First, we considered 585 Cyanobacteriales genomes and randomly selected nested subsets of sizes 10, 50, 100, 300, and the full set. We recorded the runtime and memory footprint for ANIb, ANIm, FastANI, Dashing (full-hash), and Mash using the Linux Time command. PyANI was used to calculate ANIb and ANIm. Since the number of pairwise comparisons increases quadratically with the number of genomes, the run time of ANI tools (ANIb, ANIm, FastANI, Dashing, and Mash) also showed a similarly quadratic trend. The same pattern was observed for memory usage (Fig. 11C). Memory consumption statistics showed that FastANI can be run on a system with 16GB of memory for up to 300 Cyanobacteria genomes, but not more than that. Since FastANI and Dashing (in full-hash mode) load all genomes into memory, their memory requirements are high. One workaround is to run these tools on each genome pair separately. In contrast, ANIb, ANIm, MUMmer, and BLAST (as part of PyANI) perform alignment on a limited number of genome pairs at a time, as handled by PyANI internally. PyANI only stores the alignment output in memory, allowing its memory usage to remain below 1GB.

Figure 11.

Alt text: A graph comparing the run times of different ANI tools.

Performance evaluation of ANI tools on the Cyanobacteriales dataset. (A) Wall-clock time versus number of genomes. (B) CPU time versus number of genomes. (C) Max memory versus number of genomes. (D) Comparison of tools by genome size.

Additionally, we assessed how genome size affects the runtime of each tool. We selected 50 Cyanobacteriales genomes and visualized the runtime for each genome pair against the sum of their genome sizes (in bases). Genome sizes in this dataset range from 6.2 to 22 million bases. We can observe a quadratic increase in runtime with increasing genome size. Overall, these lines of evidence highlighted the distinct performance characteristics of different ANI tools, including their time and memory requirements.

Discussion and conclusion

A goal of many modern Bioinformatics methods is to efficiently compute distance values, e.g. Mash distance, that serve as proxies for evolutionary distance. In this study, we surveyed different definitions for ANI and examined how ANI is influenced by the heuristics implemented in different ANI calculators. To achieve this, we developed a benchmarking workflow called EvANI to evaluate the performance of these tools across a diverse range of scenarios.

The term ANI has been used broadly in the literature over the past years, often referring to related concepts with different assumptions. We argue that the initial definition of ANI emerged as a practical approach, rather than a comprehensive definition. Future studies should clearly define the assumptions underlying the use of ANI and be more explicit about the evolutionary phenomena that a study aims to measure, as these assumptions directly influence how a method calculates the ANI value. Specifically, it is important to clearly describe how unaligned and duplicated regions are treated in distance calculation.

As we discussed in this work, ANIb demonstrates impressive performance by aligning as much of the alignable sequence as possible, with unaligned regions likely reflecting a lack of homology. On the other hand, the k-mer approach takes the whole genome into consideration but its underlying assumptions about k-mer length limit its accuracy, resulting in performance below that of ANIb.

In other words, ANIb appears to be the most effective at capturing tree-based distances. However, its computational demands make it impractical for large-scale studies. Additionally, the scoring function used in BLAST (as the underlying software in ANIb) introduces assumptions that have not been thoroughly investigated. The current literature lacks an exploration of optimal scoring functions for alignment approaches in this context. Compared to k-mer-based methods, the space of possible scoring functions is much larger, indicating significant opportunities for further work. Designing scoring functions tailored to different genomic regions (e.g. genic versus intergenic regions) based on their evolutionary histories might be particularly beneficial. Revisiting this problem will be crucial for improving ANI tools.

Overall, benchmarking the ANI tools is hard since there is no mathematical definition but only a practical description for ANI. In this study, we designed an approach that enables benchmarking ANI tools using distance on the tree. We highlighted different situations where different assumptions in the ANI tools made a difference. This resulted in the recommendation that assumptions and limitations should be explicitly described and carefully considered when using these ANI tools.

The current study demonstrates a direction for future research. Exploring alternative alignment tools, such as Minimap2 [59] or LastZ [60], and evaluating their performances could provide valuable insights. In the context of k-mer-based methods, investigating other sketching techniques such as minimizers [61] can enhance the performance. It also remains an open question as to whether we could achieve higher accuracy by combining different k-mer lengths together with spaced k-mers [62, 63], or more generally matching statistics [64]. Other lines of research could be on exploring a specialized ANI definition for eukaryotes [65], considering haplotypes variations of diploid or polyploid species [66, 67], non-tree evolutionary histories [68] and recombination [69] and large language models [70]. Finally, a comprehensive benchmarking of phylogenies [71] inferred from evolutionary distances, compared to those derived from gene marker-based [72, 73] or reconciliation-based approaches [74], would be highly valuable for the field.

Materials and methods

The EvANI benchmarking method requires two inputs: the genomes and the phylogeny (Fig. 1C), which could be either simulated or real data. These together enable us to assess how well different ANI tools capture evolutionary distances. To do so, we use a Spearman rank correlation test to quantify the correlation between ANI values and phylogenetic distances. The results are reported as correlation statistics or P-values in log_10.

Real data

We utilized the phylogenetic tree from the GTDB resource [13, 53]. This phylogeny is constructed using multiple sequence alignments of 120 gene markers and thus includes branch lengths. To calculate tree distances, we employed the ETE3 package [75]. We first manually selected six different clades, including c__Caldisericia, o__Bacillales_A, p__Aquificota, f__Neisseriaceae, o__Cyanobacteriales, and o__Chlamydiales. These clades span across bacteria, considering different ranks, including phylum, class, order, and family. Then, to assess the representativeness of our analysis, we included two more clades using the following procedure. Using the GTDB tree, we found all 820 bacterial clades with sizes between 50 and 1000, and using a Python function random.randrange and seed of 100, selected two clades randomly. These correspond to c__Dethiobacteria and o__Anaerolineales with 102 and 525 genomes.

For each clade, we retrieved the taxonomic IDs from GTDB and obtained their genomes from the NCBI Assembly database using the command line esearch -db assembly -query tax_id | esummary |  xtract -pattern DocumentSummary -element FtpPath_GenBank. We also used the phylogeny from the NCBI taxonomy. However, this phylogeny does not have branch lengths. Thus, we used topological tree distances (as integer values) in our method.

Simulated data

To model genome evolution, we utilized the ALF simulator [52], which is available as a command-line tool. For each scenario, we generated 15 genomes. We repeated this five times, and averaged the results across these replicates. Prior to genome simulation, ALF generates a species tree (phylogeny) by fixing speciation events (Supplementary Fig. S9). The tree is sampled using a birth–death process, with a birth rate of 0.01 and a death rate of 0.001. The root genome consists of 100 genes, each with a length of 50 kb, resulting in genomes with an average size of 5 Mbp.

At each speciation event (an internal node in the species tree), two new species are generated, each inheriting the ancestral genome. These offspring genomes evolve independently, undergoing different mutations and accumulating distinct differences as the simulation progresses. We used the PAM substitution model at the amino-acid level. In some experiments, we adjusted the branch length parameter in the tree (MutRate in ALF [52]) without changing the tree topology to increase the amount of accumulated mutations and genome divergence.

Additionally, insertions and deletions (indels) occur independently of substitutions at a rate of 0.0001, following the ZIPF model [52].

Gene duplications occur randomly at the sequence level along with evolutionary events. In our simulation, we consider duplication rates of 0, 0.0005, 0.001, and 0.002 [52, 76]. Another event included in the simulation is LGT, which allows a genome to acquire new genes. ALF randomly selects donor and recipient genomes as well as the genes to be transferred. The transferred genes are inserted at random positions in the recipient genome. We consider the LGT rate, varying from 0 to 0.0001, 0.001, and a maximum of 0.002 [52, 76].

ANI tools

We ran FastANI [10] version v1.34 using fastANI --ql fastalist --rl fastalist -o distances.tab --threads 48 --minFraction 0.1 --fragLen 3000. The argument fasta_list is the list of input FASTAs. We considered a range of values for the minimum fraction of the genome i.e. shared (minFraction) and the length of the fragment (fragLen).

Furthermore, we use OrtoANI [27] version v1.40 with java -jar OAT_cmd.jar -blastplus_dir ncbi-blast-2.15.0+bin -num_threads 48 -fasta1 ref -fasta2 query ref_query.out. This calculates the distance between two genomes, namely, fasta1 and fasta2. We feed OrthoANI with the executable of BLAST version 2.15.0.

We used PyANI [34] version 0.2.12 for measuring runtimes of ANIb and ANIm and ran average_nucleotide_identity.py  -m ANIm -i fasta_folder -o out -v -l out.log --workers 48, where fasta_folder is the folder including FASTA files.

PyANI [34] does not allow modifying options of the underlying aligner (ANIm or ANIb) and it soon ran out of memory with ANIb. We separately executed ANIb’s and ANIm’s underlying tools, which are NUCmer and BLAST, respectively.

Subsequently, we used Python functions from PyANI to compute the distance values. This approach allowed us to modify the minimum MUM length in NUCmer when we ran it with nucmer -l 20 --mum -p ref_query.out ref.fa query.fa where -l is the parameter of minimum maximal uniq matches (MUMs). We also ran nucmer --maxmatch to examine the impact of match uniqueness on weighting ANIm with AF. The output of this step is alignment in delta format. Then, we filtered these delta files to select the best hit using delta-filter -1 ref_query.out.delta > ref_query.filtered.delta.

To compute the Jaccard index, we used the Dashing tool [33] version v1.0.2-4-g0635 for different k-mer lengths, using dashing dist --use-full-khash-sets -k 21 -p 48 -O distance.tab -- full-tsv fasta/*fa. We executed Mash [32] version 2.3 in two steps of sketching and distance calculation. We ran mash sketch -p 48 -o all -s 1000 -k 21 fasta/*fa for creating sketches and mash dist -p 48 all.msh all.msh -t > distances.tab for creating the distance matrix. We varied the sketch size and the k-mer length with -s and -k arguments, respectively.

We ran FastOMA [29] version v0.2.0 to infer orthologous genes using nextflow run FastOMA.nf --input_folder proteomes --output_folder output when the translated genes were downloaded from NCBI.

Key Points

  • We surveyed a wide range of methods for estimation of evolutionary distances.

  • We developed the EvANI benchmarking framework and datasets to evaluate the accuracy of distance estimation algorithms.

  • Bi-k-mer spectra provide better evolutionary distance estimates for Chlamydiales than a single k-mer.

  • BLAST-based ANIb effectively captures tree-based distances but is computationally demanding.

Supplementary Material

Supplementary_Figures_bbaf267

Acknowledgments

We would like to thank Vikram Shivakumar for feedback and helpful discussions. S.M and B.L. were supported by NIH grants R35GM139602 to B.L.

Contributor Information

Sina Majidian, Department of Computer Science, Johns Hopkins University, 3400 North Charles St., Baltimore, MD 21218, United States.

Stephen Hwang, XDBio Program, Johns Hopkins University, 3400 North Charles St., Baltimore, MD 21218, United States.

Mohsen Zakeri, Department of Computer Science, Johns Hopkins University, 3400 North Charles St., Baltimore, MD 21218, United States.

Ben Langmead, Department of Computer Science, Johns Hopkins University, 3400 North Charles St., Baltimore, MD 21218, United States.

Author contributions

S.M. drafted the manuscript and ran the experiments. B.L. and S.M. designed the experiments. B.L., S.H., and M.Z. contributed to the analysis and revised the manuscript. All authors read and approved the final manuscript.

Conflict of interest: No competing interest is declared.

Data availability

The benchmarking workflow is available at https://github.com/sinamajidian/EvANI and benchmarking datasets are available at https://zenodo.org/records/14579845.

References

  • 1. Lewin  HA, Robinson  GE, Kress  WJ. et al.  Earth biogenome project: sequencing life for the future of life. Proc Natl Acad Sci  2018;115:4325–33. 10.1073/pnas.1720115115 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Hunt  M, Lima  L, Shen  W. et al.  Allthebacteria-all bacterial genomes assembled, available and searchable. Preprint bioRxiv. 2024;2024–03. 10.1101/2024.03.08.584059 [DOI]
  • 3. Wenger  AM, Peluso  P, Rowell  WJ. et al.  Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat Biotechnol  2019;37:1155–62. 10.1038/s41587-019-0217-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Rautiainen  M, Nurk  S, Walenz  BP. et al.  Telomere-to-telomere assembly of diploid chromosomes with verkko. Nat Biotechnol  2023;41:1474–82. 10.1038/s41587-023-01662-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Jukes  TH, Cantor  CR (1969) Evolution of protein molecules. In: Munro HN (ed) Mammalian Protein Metabolism, 21–132. 10.1016/B978-1-4832-3211-9.50009-7 [DOI] [Google Scholar]
  • 6. Sumner  JG, Jarvis  PD, Fernández-Sánchez  J. et al.  Is the general time-reversible model bad for molecular phylogenetics?  Syst Biol  2012;61:1069–74. 10.1093/sysbio/sys042 [DOI] [PubMed] [Google Scholar]
  • 7. Altschul  SF, Gish  W, Miller  W. et al.  Basic local alignment search tool. J Mol Biol  1990;215:403–10. 10.1016/S0022-2836(05)80360-2 [DOI] [PubMed] [Google Scholar]
  • 8. Yoon  S-H, Ha  S-m, Lim  J. et al.  A large-scale evaluation of algorithms to calculate average nucleotide identity. Antonie Van Leeuwenhoek  2017;110:1281–6. 10.1007/s10482-017-0844-4 [DOI] [PubMed] [Google Scholar]
  • 9. Palmer  M, Steenkamp  ET, Blom  J. et al.  All ANIs are not created equal: Implications for prokaryotic species boundaries and integration of ANIs into polyphasic taxonomy. Int J Syst Evol Microbiol  2020;70:2937–48. 10.1099/ijsem.0.004124 [DOI] [PubMed] [Google Scholar]
  • 10. Jain  C, Rodriguez-R  LM, Phillippy  AM. et al.  High throughput ANI analysis of 90k prokaryotic genomes reveals clear species boundaries. Nat Commun  2018b;9:5114. 10.1038/s41467-018-07641-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Bussi  Y, Kapon  R, Reich  Z. Large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy. PLoS One  2021;16:e0258693. 10.1371/journal.pone.0258693 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Konstantinidis  KT, Tiedje  JM. Genomic insights that advance the species definition for prokaryotes. Proc Natl Acad Sci U S A  2005;102:2567–72. 10.1073/pnas.0409727102 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Parks  DH, Chuvochina  M, Chaumeil  PA. et al.  A complete domain-to-species taxonomy for Bacteria and Archaea. Nat Biotechnol  2020;38:1079–86. 10.1038/s41587-020-0501-8 [DOI] [PubMed] [Google Scholar]
  • 14. Gosselin  S, Fullmer  MS, Feng  Y. et al.  Improving phylogenies based on average nucleotide identity, incorporating saturation correction and nonparametric bootstrap support. Syst Biol  2022;71:396–409. 10.1093/sysbio/syab060 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Chaumeil  PA, Mussig  AJ, Hugenholtz  P. et al.  GTDB-Tk v2: Memory friendly classification with the genome taxonomy database. Bioinformatics  2022;38:5315–6. 10.1093/bioinformatics/btac672 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Ciufo  S, Kannan  S, Sharma  S. et al.  Using average nucleotide identity to improve taxonomic assignments in prokaryotic genomes at the NCBI. Int J Syst Evol Microbiol  2018;68:2386–92. 10.1099/ijsem.0.002809 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Koslicki  D, White  S, Ma  C. et al.  Yacht: an ANI-based statistical test to detect microbial presence/absence in a metagenomic sample. Bioinformatics  2024;40:btae047. 10.1093/bioinformatics/btae047 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Sweeten  A, Schatz  MC, Phillippy  AM. ModDotPlot—rapid and interactive visualization of tandem repeats. Bioinformatics.  2024;40:8. 10.1093/bioinformatics/btae493 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Sapoval  N, Liu  Y, Curry  KD. et al.  Lightweight taxonomic profiling of long-read metagenomic datasets with Lemur and Magnet. Preprint. bioRxiv. 2024;2024–06. 10.1101/2024.06.01.596961 [DOI]
  • 20. Palenik  B. Cyanobacterial community structure as seen from RNA polymerase gene sequence analysis. Appl Environ Microbiol  1994;60:3212–9. 10.1128/aem.60.9.3212-3219.1994 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Biasin  MR, Fiordalisi  G, Zanella  I. et al.  A DNA hybridization method for typing hepatitis C virus genotype 2c. J Virol Methods  1997;65:307–15. 10.1016/S0166-0934(97)02202-7 [DOI] [PubMed] [Google Scholar]
  • 22. Higgins  DG, Bleasby  AJ, Fuchs  R. CLUSTAL V: Improved software for multiple sequence alignment. Bioinformatics  1992;8:189–91. 10.1093/bioinformatics/8.2.189 [DOI] [PubMed] [Google Scholar]
  • 23. Goris  J, Konstantinidis  KT, Klappenbach  JA. et al.  DNA–DNA hybridization values and their relationship to whole-genome sequence similarities. Int J Syst Evol Microbiol  2007;57:81–91. 10.1099/ijs.0.64483-0 [DOI] [PubMed] [Google Scholar]
  • 24. Richter  M, Rosselló-Móra  R. Shifting the genomic gold standard for the prokaryotic species definition. Proc Natl Acad Sci U S A  2009;106:19126–31. 10.1073/pnas.0906412106 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Deloger  M, El Karoui  M, Petit  MA. A genomic distance based on MUM indicates discontinuity between most bacterial species and genera. J Bacteriol  2009;191:91–9. 10.1128/jb.01202-08 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Kurtz  S, Phillippy  A, Delcher  AL. et al.  Versatile and open software for comparing large genomes. Genome Biol  2004;5:R12. 10.1186/gb-2004-5-2-r12 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Lee  I, Ouk Kim  Y, Park  SC. et al.  OrthoANI: an improved algorithm and software for calculating average nucleotide identity. Int J Syst Evol Microbiol  2016;66:1100–3. 10.1099/ijsem.0.000760 [DOI] [PubMed] [Google Scholar]
  • 28. Kille  B, Balaji  A, Sedlazeck  FJ. et al.  Multiple genome alignment in the telomere-to-telomere assembly era. Genome Biol  2022;23:182. 10.1186/s13059-022-02735-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Majidian  S, Nevers  Y, Yazdizadeh Kharrazi  A. et al.  Orthology inference at scale with fastoma. Nat Methods  2024;22:269–72. 10.1038/s41592-024-02552-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Altenhoff  A, Nevers  Y, Tran  V. et al.  New developments for the quest for orthologs benchmark service. NAR genom bioinform  2024;6:lqae167. 10.1093/nargab/lqae167 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Dalquen  DA, Dessimoz  C. Bidirectional best hits miss many orthologs in duplication-rich clades such as plants and animals. Genome Biol Evol  2013;5:1800–6. 10.1093/gbe/evt132 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Ondov  BD, Treangen  TJ, Melsted  P. et al.  Mash: fast genome and metagenome distance estimation using minhash. Genome Biol  2016;17:132. 10.1186/s13059-016-0997-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Baker  DN, Langmead  B. Genomic sketching with multiplicities and locality-sensitive hashing using dashing 2. Genome Res  2023;33:1218–27. 10.1101/gr.277655.123 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Pritchard  L, Glover  RH, Humphris  S. et al.  Genomics and taxonomy in diagnostics for food security: soft-rotting enterobacterial plant pathogens. Anal Methods  2016;8:12–24. 10.1039/C5AY02550H [DOI] [Google Scholar]
  • 35. Richter  M, Rosselló-Móra  R, Oliver Glöckner  F. et al.  Jspeciesws: a web server for prokaryotic species circumscription based on pairwise genome comparison. Bioinformatics  2016;32:929–31. 10.1093/bioinformatics/btv681 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Meier-Kolthoff  JP, Auch  AF, Klenk  HP. et al.  Genome sequence-based species delimitation with confidence intervals and improved distance functions. BMC Bioinformatics  2013;14:60. 10.1186/1471-2105-14-60 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Larralde  M, Zeller  G, Carroll  LM. PyOrthoANI, PyFastANI, and Pyskani: a suite of Python libraries for computation of average nucleotide identity. Preprint bioRxiv. 2025;2025–02. 10.1101/2025.02.13.638148 [DOI]
  • 38. Weihong  X. et al.  HyperGen: compact and efficient genome sketching using hyperdimensional vectors. Bioinformatics.  2024;40:7. 10.1093/bioinformatics/btae452 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Marçais  G, Delcher  AL, Phillippy  AM. et al.  MUMmer4: a fast and versatile genome alignment system. PLoS Comput Biol  2018;14:e1005944. 10.1371/journal.pcbi.1005944 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Jain  C, Dilthey  A, Koren  S. et al.  A fast approximate algorithm for mapping long reads to large reference databases. J Comput Biol  2018a;25:766–79. 10.1089/cmb.2018.0036 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Fan  H, Ives  AR, Surget-Groba  Y. et al.  An assembly and alignment-free method of phylogeny reconstruction from next-generation sequencing data. BMC Genomics  2015;16:522. 10.1186/s12864-015-1647-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Irber  L, Pierce-Ward  NT, Abuelanin  M. et al.  Sourmash v4: a multitool to quickly search, compare, and analyze genomic and metagenomic data sets. Journal of Open Source Software  2024;9:6830. 10.21105/joss.06830 [DOI] [Google Scholar]
  • 43. Zhao  X. Bindash, software for fast genome distance estimation on a typical personal laptop. Bioinformatics  2019;35:671–3. 10.1093/bioinformatics/bty651 [DOI] [PubMed] [Google Scholar]
  • 44. Zhao  J, Zhao  X, Pierre-Both  J. et al.  Bindash 2.0: new MinHash scheme allows ultra-fast and accurate genome search and comparisons. Preprint bioRxiv. 2024;2024–03. 10.1101/2024.03.13.584875 [DOI]
  • 45. Nunes  I, Heddes  M, Vergés  P. et al.  Dothash: estimating set similarity metrics for link prediction and document deduplication. In: Singh A (ed) proceedings of the 29th ACM SIGKDD conference on knowledge discovery and data mining. New York: Association for Computing Machinery, 2023, pp 1758–69. 10.1145/3580305.359931 [DOI]
  • 46. Zhao  J, Both  JP, Rodriguez-R  LM. et al.  GSearch: Ultra-fast and scalable genome search by combining k-mer hashing with hierarchical navigable small world graphs. Nucleic Acids Res  2024;52:e74–4. 10.1093/nar/gkae609 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Sarmashghi  S, Bohmann  K, P. Gilbert  MT. et al.  Skmer: assembly-free and alignment-free sample identification using genome skims. Genome Biol  2019;20:34–20. 10.1186/s13059-019-1632-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Zielezinski  A, Gudys  A, Barylski  J. et al.  Ultrafast and accurate sequence alignment and clustering of viral genomes. Nat Methods  2025. 10.1038/s41592-025-02701-7 [DOI] [PMC free article] [PubMed]
  • 49. Ndovie  W, Havránek  J, Koszucki  J. et al.  Exploration of the genetic landscape of bacterial dsDNA viruses reveals an ANI gap amid extensive mosaicism. Preprint mSystems. 2025;10:e01661-24. 10.1128/msystems.01661-24 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50. Shaw  J, Yu  YW. Fast and robust metagenomic sequence comparison through sparse chaining with skani. Nat Methods  2023;20:1661–5. 10.1038/s41592-023-02018-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Irber  L, Brooks  PT, Reiter  T. et al.  Lightweight compositional analysis of metagenomes with fracminhash and minimum metagenome covers. Preprint bioRxiv. 2022;2022–01.
  • 52. Dalquen  DA, Anisimova  M, Gonnet  GH. et al.  ALF–a simulation framework for genome evolution. Mol Biol Evol  2012;29:1115–23. 10.1093/molbev/msr268 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Parks  DH, Chuvochina  M, Rinke  C. et al.  GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy. Nucleic Acids Res  2022;50:D785–94. 10.1093/nar/gkab776 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54. Greenberg  G, Ravi  AN, Ilan  S. LexicHash: Sequence similarity estimation via lexicographic comparison of hashes. Bioinformatics  2023;39:btad652. 10.1093/bioinformatics/btad652 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55. Kokot  M, Dlugosz  M, Deorowicz  S. KMC 3: counting and manipulating k-mer statistics. Bioinformatics  2017;33:2759–61. 10.1093/bioinformatics/btx304 [DOI] [PubMed] [Google Scholar]
  • 56. Higgs, PG, Attwood  TK. Bioinformatics and Molecular Evolution. Oxford, UK: Blackwell Publishing Ltd, 2013. 10.1002/9781118697078 [DOI] [Google Scholar]
  • 57. Salzberg  SL. Next-generation genome annotation: we still struggle to get it right. Genome Biol  2019;20:92. 10.1186/s13059-019-1715-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58. Langschied  F, Bordin  N, Cosentino  S. et al.  Quest for orthologs in the era of biodiversity genomics. Genome Biol Evol  2024;16:evae224. 10.1093/gbe/evae224 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59. Li  H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics  2018;34:3094–100. 10.1093/bioinformatics/bty191 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60. Schwartz  S, Kent  WJ, Smit  A. et al.  Human–mouse alignments with BLASTZ. Genome Res  2003;13:103–7. 10.1101/gr.809403 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61. Ndiaye  M, Prieto-Baños  S, Fitzgerald  LM. et al.  When less is more: sketching with minimizers in genomics. Genome Biol  2024;25:270. 10.1186/s13059-024-03414-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62. Morgenstern  B, Zhu  B, Horwege  S. et al.  Estimating evolutionary distances between genomic sequences from spaced-word matches. Algorithms Mol Biol  2015;10:5–12. 10.1186/s13015-015-0032-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63. Ma  B, Tromp  J, Li  M. Patternhunter: Faster and more sensitive homology search. Bioinformatics  2002;18:440–5. 10.1093/bioinformatics/18.3.440 [DOI] [PubMed] [Google Scholar]
  • 64. Hwang  S, Brown  NK, Ahmed  OY, Jenike  KM, Kovaka  S, Schatz  MC, Langmead  B. MEM-based pangenome indexing for k-mer queries. Algorithms Mol Biol 2025;20:3. 10.1186/s13015-025-00272-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65. Hart  R, Moran  NA, Ochman  H. Genomic divergence across the tree of life. Proc Natl Acad Sci  122:e2319389122. 10.1073/pnas.2319389122 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66. Yan  Z, Cao  Z, Nakhleh  L. Polyphest: sast polyploid phylogeny estimation. Bioinformatics  2024;40:ii20–8. 10.1093/bioinformatics/btae390 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67. Majidian  S, Kahaei  MH. NGS based haplotype assembly using matrix completion. PLoS One  2019;14.3:e0214455. 10.1371/journal.pone.0214455 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68. Huson  DH, David  B. Application of phylogenetic networks in evolutionary studies. Mol Biol Evol  2006;23:254–67. 10.1093/molbev/msj030 [DOI] [PubMed] [Google Scholar]
  • 69. Frolova  D, Lima  L, Roberts  LW. et al.  Applying rearrangement distances to enable plasmid epidemiology with pling. Microbial Genomics  2024;10.10:001300. 10.1099/mgen.0.001300 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70. Chen  R, Foley  G, Boden  M. Learning the language of phylogeny with MSA transformer. Preprint bioRxiv. 2024;2024–12. 10.1101/2024.12.18.629037 [DOI]
  • 71. Zielezinski  A, Girgis  HZ, Bernard  G. et al.  Benchmarking of alignment-free sequence comparison methods. Genome Biol  2019;20:144. 10.1186/s13059-019-1755-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72. Dylus  D, Altenhoff  A, Majidian  S. et al.  Inference of phylogenetic trees directly from raw sequencing reads using read2tree. Nat Biotechnol  2024;42:139–47. 10.1038/s41587-023-01753-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73. Digiacomo  A, Cloutier  A, Grayson  P. et al.  The unfinished synthesis of comparative genomics and phylogenetics: examples from flightless birds. In: Kubatko  L, Knowles  L (eds) Species Tree Inference: A Guide to Methods and Applications. New Jersey: Princeton University Press, 2023. 10.2307/j.ctv2wr4wdf.19 [DOI] [Google Scholar]
  • 74. Zhang  C, Mirarab  S. Astral-pro 2: ultrafast species tree reconstruction from multi-copy gene family trees. Bioinformatics  2022;38:4949–50. 10.1093/bioinformatics/btac620 [DOI] [PubMed] [Google Scholar]
  • 75. Huerta-Cepas  J, Serra  F, Bork  P. ETE 3: reconstruction, analysis, and visualization of phylogenomic data. Mol Biol Evol  2016;33:1635–8. 10.1093/molbev/msw046 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 76. Tria  FD, Martin  WF. Gene duplications are at least 50 times less frequent than gene transfers in prokaryotic genomes. Genome Biol Evol  2021;13:evab224. 10.1093/gbe/evab224 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary_Figures_bbaf267

Data Availability Statement

The benchmarking workflow is available at https://github.com/sinamajidian/EvANI and benchmarking datasets are available at https://zenodo.org/records/14579845.


Articles from Briefings in Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES