Abstract
An abundance of microbial genomes have been sequenced in the past two decades. For fundamental comparative genomic investigations, where the goal is to determine the major gain and loss events shaping the pangenome of a species or broader taxon, it is often unnecessary and computationally onerous to include all available genomes in studies. In addition, the over-representation of specific lineages due to sampling and sequencing bias can have undesired effects on evolutionary analyses. To assist users with genomic dereplication, we developed skDER and CiDDER (https://github.com/raufs/skDER) to select a subset of representative genomes for downstream comparative genomic investigations. skDER is a nucleotide-based genomic dereplication tool that can dereplicate thousands of microbial genomes leveraging recent advances in average nucleotide identity (ANI) inference. CiDDER dereplicates microbial genomes based on saturation assessment of distinct protein-coding genes. To support usability, auxiliary functionalities are incorporated for testing the number of representative genomes resulting from applying various clustering parameters, automated downloading of genomes belonging to a bacterial species or genus, clustering non-representative genomes to their closest representative genomes and filtering plasmids and phages prior to dereplication. From benchmarking against other ANI-based dereplication tools, skDER, when run in the default mode, was efficient and achieved comparable pangenome coverage and strictly adhered to user-defined cutoffs for both ANI and aligned fraction (AF). Further, we showcase that CiDDER is a convenient alternative to ANI-based dereplication that allows users to more directly optimize the selection of representative genomes to cover a large breadth of a taxon’s pangenome.
Keywords: comparative genomics, microbial, genome dereplication
Impact Statement
Due to the increased availability of genomes for certain microbial species, performing fundamental comparative genomic investigations has become restricted to those with access to more advanced computational infrastructure. Genomic dereplication, the process of selecting distinct representative genomes to capture the breadth of a taxonomic group, thus presents a valuable solution to overcome this embarrassment of riches. Specifically, genomic dereplication allows simplifying the scale of comparative investigations while minimizing the risk of biassing analyses to specific lineages, which might be overrepresented in genomic databases. We present here two programmes for genomic dereplication, one based on average nucleotide identity inference, skDER, and the other based on assessing the saturation of total coding genes sampled, CIDDER. These tools are implemented with a variety of auxiliary options and are designed for ease of use.
Data Summary
skDER and CiDDER are provided as open-source software implemented in Python and C++ on Github (https://github.com/raufs/skDER) with version updates tracked on Zenodo, https://zenodo.org/records/13887710 [1]. Installation of the software is supported via both Bioconda [2] and Docker. Additional code and data for analyses presented in this manuscript can be found on Zenodo at https://zenodo.org/records/13891800 [3]. Pre-computed representative genomes selected by skDER (v1.0.7) for 18 common bacterial taxonomic groups, referencing classifications from Genome Taxonomy Database release 214 [4], are also provided on Zenodo at https://zenodo.org/records/10041203 [5].
Introduction
The boundaries of bacterial species are commonly defined using average nucleotide identity (ANI)-based thresholds [6,8]. However, other types of approaches are often used for the delineation of more resolute clusters of strains or lineages within species and the selection of individual genomes to serve as representatives. Specifically, phylogenetic [9], alignment [10,11] and k-mer-based [12,13] approaches have been used as alternatives to ANI-based methods for such applications. While these alternative approaches have demonstrated scalability to handle species featuring thousands of genomes, they typically involve multi-step workflows, for instance, needing preliminary construction of core genome sequence alignments or phylogenies. In contrast, an approach based on ANI-cutoffs is more direct and simply requires genomic assemblies as input.
To reduce computational costs and resource requirements for downstream analyses, genome dereplication can be extremely useful. Methods that have used ANI to perform dereplication of an input set of genomes – to select representatives – have largely been designed for metagenomic applications. A common strategy of these methods [14,16] involves first performing preliminary clustering using faster but less precise ANI calculators [17,18] to coarsely bin related genomes together. Afterwards, a secondary, more stringent, clustering is performed using a more accurate but lower efficiency approach, such as FastANI [7,19]. More recently, a new method, skani, was developed for high-accuracy and efficient estimation of ANI [20], thus enabling a quick and direct approach for clustering or dereplicating large sets of similar (meta-)genomes. skani’s speed and accuracy stem from an internal, preliminary k-mer sketching step to avoid computationally costly ANI estimation for distantly related genome pairs and the use of k-mer chaining for finding orthologous regions between genomes [21].
For many comparative genomics studies, achieving nucleotide resolution diversity might be less important than understanding the key gene gain and loss events that have shaped the pangenome of the focal species or genus. While the metrics of ANI and percentage of shared genes for pairs of genomes are positively associated [7,22], if the researcher aims to select representative genomes that adequately sample a taxon’s pangenome space, a direct approach to maximize the number of different genes sampled by the fewest number of representative genomes might be preferred to an ANI-based approach. For such methods, recent advances in heuristics for efficient clustering of large protein datasets [23,25] can be leveraged.
Here, we describe skDER and CiDDER, which implement approaches for genomic dereplication based on assessments of ANI and protein cluster saturation, respectively. skDER features two modes, ‘greedy’, which is akin to greedy set cover clustering and strictly adheres to thresholds for dereplication specified by users, and ‘dynamic’, which approximates single-linkage clustering and aims to minimize the number of representative genomes selected. We apply the dereplication algorithms primarily on the diverse genus of Enterococcus and the mobile genetic element (MGE)-rich species of Enterococcus faecalis. We demonstrate their efficiency compared to currently available genomic dereplication tools primarily designed for metagenomic applications.
Methods
Genomic dereplication approaches
Two distinct approaches for genomic dereplication are provided within skDER. Both approaches are based on first estimating ANI between pairs of genomes using skani [20]. Specifically, the skani triangle mode is applied with user-customizable parameter options. The default options for running skani triangle are mostly retained as the default options in skDER, with the one exception being an increase to the value of the screening parameter, -s, from 80.0 to 90.0, which is sensible for improving computational efficiency for most use cases of skDER. In v1.3.0 and more recent versions, the value of this screening parameter is instead by default set to X%, where X is the requested ANI cutoff minus 10%.
For this manuscript, we used version 1.2.6 of skDER and CiDDER, version 0.2.2 of skani [20], version 3.5.1 of Pyrodigal [26] and version 4.8.1 of CD-HIT [27]. Visualization to showcase dereplication was performed using the program granet (version 1.2.7), which uses the igraph library [28] and is included within the skDER package.
By default, the ANI threshold in skDER was set to 99% but increased to 99.5% in v1.3.0 to reflect recently determined thresholds for distinguishing different sequence types [29]. Because CiDDER aims to select a set of representative genomes that adequately sample the pangenome of a taxon, the default value for the percentage of total distinct protein clusters that need to be accounted for is set to 90%.
skDER greedy ANI-based dereplication
The ‘greedy’ algorithm is designed to ensure that all non-representative genomes have ANI and AF values that meet the user-specified cutoffs relative to at least one representative genome. To this end, the ‘greedy’ algorithm for dereplication within skDER begins by initializing a set called representative_genomes. Pairwise genome-to-genome similarity information from skani [20] is next assessed, where for each genome, a list is constructed of alternate genomes deemed similar to it using user-adjustable ANI and AF cutoffs. A score indicating the value of a genome to serve as a representative is computed. This score is simply the connectivity of a genome, how many other genomes it is similar to, multiplied by the N50 for the genome. Genomes are sorted in descending order by their respective score. The first genome, with the greatest score, is automatically selected as a representative and appended to the representative_genomes set. The direct list of alternate genomes regarded as redundant to this genome is noted and kept track of in another set, nonrepresentative_genomes, disqualifying them from serving as representatives. As the full list of genomes is traversed linearly, they are appended to the set of representative_genomes if they have not previously been added to the set of nonrepresentative_genomes. Similar to the first representative genome, genomes subsequently selected as representatives also have their respective sets of genomes deemed as redundant to them appended to the nonrepresentative_genomes set.
skDER dynamic ANI-based dereplication
The ‘dynamic’ algorithm for dereplication within skDER begins by initializing an empty set, redundant_genomes. Pairwise genome-to-genome similarity information from skani [20] is assessed, and, similar to the greedy algorithm-based approach, a score is used to determine the preferred genome in the pair to use as a reference. The score for a genome is computed as the product of its assembly N50 and the number of genomes that are regarded as highly similar to it based on user-adjustable ANI and AF cutoffs. Assuming the genomes are similar to each other at the required ANI and AF cutoffs and the AF for one genome is not substantially larger than the other, the genome with the lower score, less ideal to serve as a representative, is added to the set of redundant_genomes. Further, smaller genomes largely contained within larger genomes should not serve as a representative. Thus, a third filter, the difference in the AF between a pair of genomes, which is non-symmetric, is also assessed during dereplication. If this difference exceeds a certain threshold, the ANI between the genomes meets the ANI cutoff, and the AF for at least one genome exceeds the AF cutoff, then the smaller genome, with the higher AF, will by default be added to the redundant_genomes. Once all pairs of genomes are assessed, genomes that are absent in the redundant_genomes set are reported as representatives. It is important to note that this approach approximates selecting a single representative genome for each coarse transitive cluster of related genomes. Therefore, some non-representative genomes might exhibit greater distances to the nearest representative genome than the requested thresholds if they are indirectly connected to the representative genome.
CiDDER protein cluster saturation-based dereplication
CiDDER performs genomic dereplication by applying methods for large-scale protein clustering to achieve some user-defined saturation of the total coding pangenome space across the full genome set amongst selected representatives. Coding sequences on the edge of scaffolds or contigs are requested not to be reported when running Pyrodigal by default. Starting in v1.3.0, users can also provide proteomes directly to CiDDER as input to avoid gene-calling and the taxonomic constraints of Pyrodigal-based gene calling. After gene-calling, predicted protein sequences are concatenated and clustered using CD-HIT [25]. The key parameters controlling the greedy-based protein clustering done by CD-HIT are user-adjustable, but by default, CiDDER sets configurations to those that are sensible for dereplication of genomes belonging to a single species, for instance, by collapsing proteins at an identity threshold of 95% amino acid identity. Namely, the identity threshold is set to 95.0% with a word size of 5 used, an alignment coverage threshold of the shorter sequence by the longer sequence of 90% and an alignment coverage of the longer sequence by the shorter sequence of 75%. After determining protein clusters, the number of unique protein clusters per genome is computed. The genome with the largest number of protein clusters is selected as the initial representative genome. Then, in an iterative process, the next representative genome is selected based on which one contributes the most additional number of unique proteins, which have not been seen in a previously selected representative genome. This addition of representative genomes continues until one of three criteria is met: (i) the next genome adds less than X number of distinct protein clusters (X is by default 0), (ii) over Y% of the total distinct protein clusters across all genomes are found in the so-far selected representative genomes (Y is by default 90%) or (iii) over Z% of the total distinct multi-genome protein clusters across all genomes are found in the so-far selected representative genomes (Z is by default 100%). Thus, while all three parameters for controlling representative genome selection are user-adjustable, by default, only Y is used for representative genome selection. An option to request incremental selection of additional representative genomes from the remaining pool of non-representative genomes that lack minimal overlap of protein clusters, 90% by default, to at least one established representative genome was introduced in v1.3.0.
Auxiliary utility functionalities present in skDER and/or CiDDER
For convenience, we also implemented several auxiliary functionalities within skDER and CiDDER. For instance, users are able to automatically download all genomes belonging to a single bacterial species or genus in the Genome Taxonomy Database (GTDB) [4,30]. In addition, similar to other tools for genomic dereplication, both skDER and CiDDER allow users to request a secondary clustering of non-representative genomes to representative genomes, which will also report similarity metrics for pairs of genomes in the same cluster. For CiDDER, such clustering can involve either assessing similarity based on shared protein clusters or based on ANI. For skDER, a tiered approach is taken, whereby for each non-representative genome, its best matching representative is first selected based on the largest ANI from the subset of representative genomes that meet both the ANI and AF cutoffs requested. If no representative genome is close enough to the focal non-representative genome at the requested ANI and AF cutoffs, then the representative genome with the highest ANI is selected. If there are ties, AF is used as a secondary criterion for sorting and selecting the best representative genome.
Both skDER and CiDDER also feature an option to filter genomes for predicted MGEs using either PhiSpy [31], primarily designed for phage prediction – but run more loosely by not requiring any hallmark phages – or geNomad [32]. While geNomad is robustly designed to identify both plasmids and phages, leading to more accurate removal of such large MGEs, it takes more time to run computationally. Filtering of predicted MGEs from genomes is performed using mgecut, a standalone wrapper of PhiSpy and geNomad included in the skDER suite. Importantly, because it splits scaffolds along coordinates of MGEs, mgecut is applied in skDER after computing N50 so as not to penalize genomes with many MGEs and allow them to still serve as representatives. Finally, skDER features an option to assess the number of representative genomes that would result from using a variety of ANI or AF thresholds. This can be useful if users aim to select at most some number of representative genomes and would like to learn uniform cutoffs to achieve such a selection.
Assessment of the influence of MGE filtering on ANI and AF estimates
mgecut was run using geNomad (v1.8.0; database v1.7) [32] on all 1,902 E. faecalis genomes included in GTDB release 2074. Pairwise ANI estimates were calculated for E. faecalis genomes with and without MGE filtering using skani triangle [20].
Benchmarking computational resource usage of different genome dereplication programmes
We compared the performance of skDER and CiDDER for genomic dereplication against dRep (v3.5.0) [14] and galah (v1.4.0) [16] through application to genomes classified as Enterococcus (including Enterococcus_A and Enterococcus_B) in GTDB release 207 [4]. We ran dRep with Mash (v2.3) [18] for primary clustering and FastANI (v1.34) [7] for secondary clustering because the configuration is one of the most frequently used for genomic dereplication in recent years. CheckM [33] was skipped in dRep due to excessive time requirements. More recently, galah was developed and replaced the usage of MASH for primary clustering of genomes into coarsely related genomic groups with a faster alternative, Dashing [17]. However, given the advantages of skani for ANI estimations and performing preliminary screening to avoid more computationally intensive ANI computation between distantly related genomes, recent versions of galah have switched to using skani for both the preliminary and final clustering steps. We, thus, ran galah (v1.4.0) [16] with the recent defaults using skani code in the backend. For benchmarking, all programmes were run using 30 threads where possible, and resource use was measured using the Unix time command. To systematically reassess the distance of non-representative genomes to representative genomes determined by different methods, we recomputed ANI and AF between all pairs of genomes using skani triangle [20] run in ‘-slow’ mode for increased accuracy. The closest matching representative genome per non-representative genome was selected based on a score equivalent to the product of the non-representative AF and ANI.
A subset of 1,902 genomes corresponding to the species E. faecalis, based on GTDB R207 classification, was gathered, annotated using Prokka (v1.13) [34] and processed through Panaroo (v1.5.0) [35] for orthology inference. Importantly, Prokka was requested not to report coding sequences on the edge of scaffolds or contigs to avoid overinflating the pangenome with partial CDS features due to assembly fragmentation [36]. To define the phylogenetic breadth of the lineage, single-copy core ortholog groups were identified, and muscle super5 (v5.1) [37] was used to construct protein alignments for them. The protein alignments were next converted to codon alignments using PAL2NAL (v14.1) [38], which were then trimmed for sites that featured gaps in greater than 10% of sites using trimAl. Trimmed codon alignments were concatenated together and provided as input to FastTree2 (v2.1.11) [39] to construct a phylogeny using a generalized time-reversible model. The ete3 (v3.1.3) [40] package was used to extract subtrees corresponding to the different sets of representative genomes selected by the different dereplication programmes, with the ‘preserve_branch_length’ option set to True, and to measure the summed branch length of the subtrees.
Results
Convenient tools to select representative genomes for microbial comparative genomic analyses
skDER and CiDDER are both programs designed to perform dereplication of large microbial datasets (Fig. 1, Text S1, available in the online Supplementary Material). First, skDER leverages advances in fast and accurate ANI estimations by skani [20] and provides two memory-efficient approaches for processing skani’s results to determine a set of genomes that are well suited to serve as representatives. The two approaches within skDER are referred to as the ‘greedy’ algorithm and the ‘dynamic’ algorithm (Fig. 2a, b). The ‘greedy’ algorithm guarantees that all non-representative genomes have ANI and AF values that exceed the user-adjustable specified cutoffs to at least one representative genome. It functions similarly to approaches for genomic dereplication in other popular software such as dRep and galah [14,16]. In contrast, the ‘dynamic’ algorithm approximates single-linkage clustering and more sparingly selects a single representative per coarse clustering of genomes. The ‘dynamic’ algorithm, thus, does not guarantee all non-representative genomes are within a certain ANI and AF distance from their closest representative genome.
Fig. 1. Overview of algorithms for genomic dereplication. Overviews for the (a) skDER ‘greedy’ dereplication, (b) skDER ‘dynamic’ dereplication and (c) CiDDER dereplication approaches are shown.
Fig. 2. Visualization of the dereplication of 66 Cutibacterium avidum genomes using skDER and CiDDER with granet. The programs skDER and CiDDER were run to download and dereplicate 66 genomes classified as C. avidum in GTDB release 220. Afterwards, the granet tool was used to generate network visualizations for the (a) skDER greedy, (b) skDER dynamic and (c) CiDDER runs. Nodes correspond to different genomic assemblies, and edge connections represent pairs of genomes exhibiting >99 % ANI and a maximal AF >90 %. Genomes selected as representatives are shown in red.
In CiDDER, comprehensive protein clustering using CD-HIT [25] is applied to minimize the number of representative genomes needed to maximize sampling of the coding pangenome space (Fig. 2c). This is performed via an iterative approach in which genomes that contribute the most additional protein clusters, not accounted for in previously selected representative genomes, are prioritized to serve as the next representative. By default, the algorithm terminates when 90% of the total number of distinct protein clusters in the coding pangenome space are sampled. Users can adjust this cutoff and apply other criteria for controlling the selection of additional representative genomes.
Beyond dereplication, skDER and CiDDER both offer users a variety of options for convenience and additional functionality. For instance, they feature an option to automatically download genomes from National Center for Biotechnology Information’s GenBank database belonging to a single species or genus based on GTDB classifications [4,30, 41] and perform dereplication on them. They also feature options to perform secondary clustering of non-representative to representative genomes. Clustering reports feature detailed information such as the ANI and AF of the non-representative genome to its closest representative genome or the percentage of shared protein clusters between them. In addition, skDER features an option to test combinations of different ANI and AF cutoffs to assess how many representative genomes are selected when they are applied (Fig. S1). This option might be of interest to users when they aim to select at most a certain number of genomes as representatives.
Functionalities for visualizing the relationship between genomes and assessing the effects of MGEs
The skDER package also includes two standalone scripts: granet and mgecut. The granet program can be used with results from skDER or CiDDER for creating network visualizations, which summarize the dereplication process (Fig. 2). In such networks, nodes represent genomes and edges are shown between genome pairs that are similar enough to meet ANI and AF cutoffs. Nodes are coloured red if they are representative genomes and black otherwise. The mgecut program aims to filter large MGEs, namely plasmids and viruses, from genomes prior to computing AF estimates. It is a simple wrapper program that runs and processes the results of either PhiSpy [31] or geNomad [32], two programs for MGE annotation, and then cuts out MGE coordinates in the input genome. mgecut is integrated within skDER and CiDDER and can be requested to process genomes prior to ANI estimation or protein clustering.
To assess the effect of MGEs on ANI, we processed 1,902 E. faecalis genomes through mgecut using geNomad [32]. E. faecalis was selected because it is known to be rich in MGEs, which have shaped the evolution of the species [42,43]. skani [20] was next used to infer ANI and AF estimates between pairs of unaltered genomes and separately between pairs of MGE-filtered genomes. ANI estimates between corresponding pairs of genomes were compared to reveal that MGEs tended to reduce ANI between genomes, with a median difference of −0.05 (Fig. 3a). Similarly, AF values were also generally less similar between pairs of genomes when MGEs were included, with a median difference of −3.43 (Fig. 3b). In addition, Rodriguez-R et al. recently reported that an intra-species ANI gap is consistently found and proposed a cutoff of at least 99.99% ANI to distinguish strains [29,44]. We find that this ANI gap is still present after removal of MGEs (Fig. S2).
Fig. 3. Assessment of MGE effects on ANI and AF between pairs of E. faecalis genomes. The difference between unprocessed and mgecut processed genomes for (a) ANI and (b) AF estimates is shown. The dashed line in red indicates the median value. (c) The relationship between ANI computed using unprocessed genomes and ANI computed using mgecut processed genomes where each point corresponds to a pair of genomes. The dashed blue line provides a one-to-one reference. Note, because ANI is symmetric while AF is not, there are twice as many genome pairs considered in panel (b) in comparison to panels (a) and (c).
Comparison of skDER to other ANI-based approaches for representative genome selection across Enterococcus
We compared the performance of skDER and CiDDER for representative genome selection across the diverse genus of Enterococcus [45] with two other ANI-based dereplication programs, dRep and galah, which are primarily designed for selecting reference genomes for mapping metagenomic readsets. The dynamic and greedy-based approaches implemented in skDER, as well as CiDDER, dRep and galah, were independently applied to a set of 5,291 low to high-quality Enterococcus genomes from GTDB [4], including metagenome assembled genomes (MAGs). FastANI [7] was requested for secondary clustering in dRep, whereas skani was used for ANI inference and screening in galah and skDER. For all ANI-based methods, dereplication was performed using an ANI threshold of 99% with either a coverage or AF cutoff of 25% or 90%.
One of the most common approaches taken for the task of genomic dereplication is running dRep with FastANI for secondary clustering. Importantly, this method was designed for application on MAGs. When applied to a dataset of MAGs from a common microbiome, the divide-and-conquer approach it implements to first partition MAGs into coarse bins and, afterwards, perform more precise ANI inference separately within each partition of genomes is extremely efficient. However, in our application of dRep on Enterococcus, only two partitions were identified – leading to extensive FastANI computations that took considerably longer than skDER and galah, which were run using skani for accurate ANI inference instead. Thus, much of the speed boost for galah and skDER is from the use of skani, which is substantially faster than FastANI [20] (Fig. 4a). Dereplication with dRep also required greater memory than galah, skDER and CiDDER; however, all methods used less than 18 GB of memory (Fig. 4b). The quality of representative genomic assemblies selected by the different methods, measured using the N50 statistic, appeared largely similar (Fig. S3).
Fig. 4. Comparisons of skDER and CiDDER to other dereplication approaches. Bar plots show the (a) runtime (based on using 30 threads), (b) maximum memory consumption and (c) total representative genomes selected from dereplication of 5,291 Enterococcus genomes by different methods. The time and memory needed to run CheckM2 are not accounted for in the plots for two galah dereplication investigations where CheckM2 information was incorporated. (d) The proportion of non-representative genomes that meet ANI and AF criteria in relation to a nearby representative genome. Neither indicates either ANI is <99% or AF is <25%. While dereplication methods that used skani ran it with largely default settings, pairwise ANI and AF estimates used for this assessment were obtained from running skani in ‘-slow’ mode for increased accuracy.
Surprisingly, despite being run with similar thresholds for ANI and AF, the three software selected a discordant number of representative genomes (Fig. 4c). For instance, while skDER in greedy dereplication mode selected 807 representatives at the cutoffs of 99% ANI and 90% AF, galah with information on genome quality from CheckM2 and dRep selected 630 and 1,805 representative genomes at the same cutoffs, respectively. Further investigation revealed that while galah requires all non-representative genomes to exhibit an ANI equivalent to or greater than the cutoff to a representative genome, no strict enforcement is applied for the AF cutoff (Figs 4d and S4). This explains why skDER selects more representative genomes than galah, despite similar parameters and algorithms, since, in greedy dereplication mode, skDER requires both ANI and AF of non-representative genomes to meet the specified cutoffs in relation to one or more representative genomes. The substantial increase in the selected number of representative genomes by dRep in comparison to skDER and galah when using similarly stringent cutoffs for ANI and AF might be due to underlying differences between FastANI and skani in how genomic similarity metrics are estimated.
In addition, all methods for dereplication selected at least one genome representative for each Enterococcus species in GTDB [4]. Therefore, one of the key differences between the four methods was the number of representative genomes selected for highly sequenced species in the genus, in particular E. faecalis. We further investigated the coverage of the E. faecalis pangenome by representative genomes selected using skDER, CiDDER, dRep and galah. Panaroo [35] was used to group genes into ortholog groups for all selected E. faecalis genomes. The granularity of representative genome selection was similar for each dereplication method when applied to E. faecalis, as was previously observed in their application to the entire genus (Fig. 5). Because of the stringent requirement for AF cutoffs, skDER run in greedy mode selected more representative genomes than galah under similar parameter configurations but accounted for considerably more distinct ortholog groups. While the rate of distinct ortholog group discovery per representative genome selected was 39 between galah runs using 25% AF and 90% AF cutoffs, the rate was largely retained between the galah run with a 90% AF cutoff and skDER run in greedy mode with a 90% AF cutoff at 32.5. This indicates that rate saturation had not occurred and that the additional count of genomes included as representatives by skDER substantially increased coverage of the E. faecalis pangenome. In contrast, the rate of distinct ortholog groups discovered between skDER greedy run with a 90% AF cutoff and dRep run requesting a 90% AF cutoff, which selected the highest number of representative genomes, was just 6.5. CiDDER – when run with default parameters, which requests to select representatives until 90% of the total protein clusters observed across all genomes have been sampled – performs similarly to skDER run in greedy dereplication mode with cutoffs of 99% ANI and 90% AF. Further, while CiDDER sampled a slightly greater number of ortholog groups per representative genome, the skDER run covered a wider phylogenetic breadth across representative genomes, measured as the sum of lengths for relevant branches along a core genome phylogeny. These trends are expected since ANI-based dereplication should be more sensitive to variation in the core genome used for phylogeny inference.
Fig. 5. Overview of E. faecalis representative selections by different dereplication software. Scatterplots show the (a) number of ortholog groups determined by Panaroo and (b) the phylogenetic breadth of the core genome for 1,902 E. faecalis genomes as a function of the number of representative genomes selected by different dereplication methods.
Discussion
Here, we introduced skDER, a convenient tool that leverages recent advances in fast yet accurate ANI estimation by skani to perform genomic dereplication. We also described CiDDER, an independent program that performs genomic dereplication based on the user’s desired saturation of the non-redundant coding space observed across all genomes. Both tools are primarily designed to reduce genomic datasets for downstream comparative genomic investigations.
In the last decade, ANI-based methods for genomic dereplication have primarily been used for metagenomic applications to select representative MAGs for bacterial species or strains to then align reads to and assess presence across related microbiomes [14,16]. While skDER and CiDDER are designed primarily to assist with genomic selection for comparative studies of microbial species or genera, they can also be useful for metagenomic studies. skDER’s dynamic dereplication approach might be well-suited for metagenomic applications where selection of fewer representative genomes should alleviate concerns that reads belonging to the same strain get falsely partitioned across closely related genomes. In contrast, CiDDER and the greedy dereplication mode of skDER can be used to ensure representative genomes capture the breadth of a taxon’s pangenome since they are more sensitive to distinctions in auxiliary gene content between genomes. These methods could thus be particularly useful for the selection of genomes in comparative studies aiming to account for variably conserved traits that are associated with particular lineages or habitats [46]. Ultimately, whether a more comprehensive or concise selection of representative genomes is appropriate will likely depend on the downstream metagenomic analyses being performed.
A key limitation of skDER and CiDDER relative to dRep and galah for metagenomic applications is their inability to incorporate potentially available information on genomic contamination [14,16, 33, 47]. While the extent of contamination is an important consideration when selecting representatives amongst MAGs in particular [48], a simple solution we recommend is for users to assess results from CheckM2 [47] or other software and remove highly contaminated genomes prior to genomic dereplication. Alternatively, users should consider the use of charcoal [49] or similar tools that enable filtering genomic assemblies for specific scaffolds or contigs exhibiting indications of being contaminants prior to running skDER or CiDDER instead of disqualifying entire MAGs from serving as representatives. Contamination within genomes becomes more difficult to resolve when it involves multiple strains belonging to the same species [14,47]. Ultimately, further benchmarking is needed to compare the usage of skDER and CiDDER to dRep and galah for metagenomic applications.
A recent study found that, on average, 100 bp differences between strains result in distinct ecological dynamics within microbiomes [50]. This corresponds to an ANI of 99.998% identity if we assume an average genome size of five million bp and supports proposals by Rodriguez-R et al. to use a stringent ANI cutoff of 99.99% to distinguish different strains [29]. However, some k-mer-based ANI estimation methods might exhibit difficulties in achieving such precision, especially when applied to fragmented genomes [14,20]. Furthermore, because MGEs, such as plasmids, are less stably integrated within genomes, users might prefer to filter them prior to dereplication or determining strain clusters. For one species, we observed that such MGEs usually lower the ANI between strains. This is likely because they can evolve much faster than chromosomes, especially with regard to syntenic structure [51,53], and can, in a nested fashion, contain smaller MGEs, such as transposons, themselves [54,55]. skDER and CiDDER both have options to automatically remove such MGEs, but their application can become expensive when dereplicating thousands of genomes.
In summary, skDER and CiDDER aim to serve as convenient, open-source tools for performing genomic dereplication to simplify downstream comparative genomic investigations. Low computational resource requirements to dereplicate over five thousand Enterococcus genomes highlight how they can be used to select diverse representative genomes that can then be investigated with user-friendly tools or web applications for comparative genomic investigations [35,56,62] on generally accessible infrastructure, such as laptops.
Supplementary material
Acknowledgements
We thank Dr C. Titus Brown, Dr N. Tessa Pierce-Ward and Dr Karthik Anantharaman for helpful discussions on the development of skDER and CiDDER.
Abbreviations
- AF
aligned fraction
- ANI
average nucleotide identity
- GTDB
genome taxonomy database
- MAGs
metagenome assembled genomes
Footnotes
Funding: This work was supported by grants from the National Institutes of Health awarded to L.R.K (NIAID U19AI142720 and NIGMS R35GM137828). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Contributor Information
Rauf Salamzade, Email: salamzade@wisc.edu;salamzader@gmail.com.
Aamuktha Kottapalli, Email: akottapalli@wisc.edu.
Lindsay R. Kalan, Email: kalanlr@mcmaster.ca.
References
- 1.Salamzade R. SkDER. Zenodo; 2024. [DOI] [Google Scholar]
- 2.Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, et al. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018;15:475–476. doi: 10.1038/s41592-018-0046-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Salamzade R, Kalan L. SkDER & CIDDER - manuscript data & code. Zenodo. 2024 doi: 10.5281/ZENODO.13891800. [DOI] [Google Scholar]
- 4.Parks DH, Chuvochina M, Rinke C, Mussig AJ, Chaumeil P-A, et al. GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy. Nucleic Acids Res. 2022;50:D785–D794. doi: 10.1093/nar/gkab776. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Salamzade R, Kalan L. SkDER representative genomes for select bacterial taxa. Zenodo. 2023 [Google Scholar]
- 6.Goris J, Konstantinidis KT, Klappenbach JA, Coenye T, Vandamme P, et al. DNA-DNA hybridization values and their relationship to whole-genome sequence similarities. Int J Syst Evol Microbiol. 2007;57:81–91. doi: 10.1099/ijs.0.64483-0. [DOI] [PubMed] [Google Scholar]
- 7.Jain C, Rodriguez-R LM, Phillippy AM, Konstantinidis KT, Aluru S. High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nat Commun. 2018;9:5114. doi: 10.1038/s41467-018-07641-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Olm MR, Crits-Christoph A, Diamond S, Lavy A, Matheus Carnevali PB, et al. Consistent metagenome-derived metrics verify and delineate bacterial species boundaries. mSystems . 2020;5:e00731-19. doi: 10.1128/mSystems.00731-19. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Menardo F, Loiseau C, Brites D, Coscolla M, Gygli SM, et al. Treemmer: a tool to reduce large phylogenetic datasets with minimal loss of diversity. BMC Bioinformatics. 2018;19:164. doi: 10.1186/s12859-018-2164-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Cheng L, Connor TR, Sirén J, Aanensen DM, Corander J. Hierarchical and spatially explicit clustering of DNA sequences with BAPS software. Mol Biol Evol. 2013;30:1224–1228. doi: 10.1093/molbev/mst028. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Tonkin-Hill G, Lees JA, Bentley SD, Frost SDW, Corander J. Fast hierarchical bayesian analysis of population structure. Nucleic Acids Res. 2019;47:5539–5549. doi: 10.1093/nar/gkz361. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Lees JA, Harris SR, Tonkin-Hill G, Gladstone RA, Lo SW, et al. Fast and flexible bacterial genomic epidemiology with PopPUNK. Genome Res. 2019;29:304–316. doi: 10.1101/gr.241455.118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Xu X, Yin Z, Yan L, Zhang H, Xu B, et al. RabbitTClust: enabling fast clustering analysis of millions of bacteria genomes with MinHash sketches. Genome Biol. 2023;24:121. doi: 10.1186/s13059-023-02961-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Olm MR, Brown CT, Brooks B, Banfield JF. dRep: a tool for fast and accurate genomic comparisons that enables improved genome recovery from metagenomes through de-replication. ISME J. 2017;11:2864–2868. doi: 10.1038/ismej.2017.126. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Evans JT, Denef VJ. To dereplicate or not to dereplicate? mSphere . 2020;5 doi: 10.1128/mSphere.00971-19. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Aroney C, Camargo STN, Tyson AP, Woodcroft GW, Ben J. Galah: more scalable dereplication for metagenome assembled genomes
- 17.Baker DN, Langmead B. Dashing: fast and accurate genomic distances with HyperLogLog. Genome Biol. 2019;20:265. doi: 10.1186/s13059-019-1875-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, et al. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 2016;17:132. doi: 10.1186/s13059-016-0997-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Konstantinidis KT, Tiedje JM. Genomic insights that advance the species definition for prokaryotes. Proc Natl Acad Sci USA. 2005;102:2567–2572. doi: 10.1073/pnas.0409727102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Shaw J, Yu YW. Fast and robust metagenomic sequence comparison through sparse chaining with skani. Nat Methods. 2023;20:1661–1665. doi: 10.1038/s41592-023-02018-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Abouelhoda MI, Ohlebusch E. Chaining algorithms for multiple genome comparison. J Discrete Algorithms. 2005;3:321–341. doi: 10.1016/j.jda.2004.08.011. [DOI] [Google Scholar]
- 22.Wolf YI, Makarova KS, Lobkovsky AE, Koonin EV. Two fundamentally different classes of microbial genes. Nat Microbiol. 2016;2:16208. doi: 10.1038/nmicrobiol.2016.208. [DOI] [PubMed] [Google Scholar]
- 23.Steinegger M, Söding J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol. 2017;35:1026–1028. doi: 10.1038/nbt.3988. [DOI] [PubMed] [Google Scholar]
- 24.Buchfink B, Ashkenazy H, Reuter K, Kennedy JA, Drost H-G. Sensitive clustering of protein sequences at tree-of-life scale using DIAMOND DeepClust. Genomics. doi: 10.1101/2023.01.24.525373. [DOI] [Google Scholar]
- 25.Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012;28:3150–3152. doi: 10.1093/bioinformatics/bts565. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Larralde M. Pyrodigal: python bindings and interface to Prodigal, an efficient method for gene prediction in prokaryotes. J Open Source Softw. 2022;7:4296. doi: 10.21105/joss.04296. [DOI] [Google Scholar]
- 27.Huang Y, Niu B, Gao Y, Fu L, Li W. CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics. 2010;26:680–682. doi: 10.1093/bioinformatics/btq003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Csardi G, Nepusz T. The igraph software. Complex Syst. 1695:1–9. [Google Scholar]
- 29.Rodriguez-R LM, Conrad RE, Viver T, Feistel DJ, Lindner BG, et al. An ANI gap within bacterial species that advances the definitions of intra-species units. mBio. 2024;15:e0269623. doi: 10.1128/mbio.02696-23. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Creators Blin K. Ncbi-genome-download. 10.5281/zenodo.8192486
- 31.Akhter S, Aziz RK, Edwards RA. PhiSpy: a novel algorithm for finding prophages in bacterial genomes that combines similarity- and composition-based strategies. Nucleic Acids Res. 2012;40:e126. doi: 10.1093/nar/gks406. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Camargo AP, Roux S, Schulz F, Babinski M, Xu Y, et al. Identification of mobile genetic elements with geNomad. Nat Biotechnol. 2024;42:1303–1312. doi: 10.1038/s41587-023-01953-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 2015;25:1043–1055. doi: 10.1101/gr.186072.114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics. 2014;30:2068–2069. doi: 10.1093/bioinformatics/btu153. [DOI] [PubMed] [Google Scholar]
- 35.Tonkin-Hill G, MacAlasdair N, Ruis C, Weimann A, Horesh G, et al. Producing polished prokaryotic pangenomes with the panaroo pipeline. Genome Biol. 2020;21:180. doi: 10.1186/s13059-020-02090-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Klassen JL, Currie CR. Gene fragmentation in bacterial draft genomes: extent, consequences and mitigation. BMC Genomics. 2012;13:14. doi: 10.1186/1471-2164-13-14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Edgar RC. Muscle5: high-accuracy alignment ensembles enable unbiased assessments of sequence homology and phylogeny. Nat Commun. 2022;13:6968. doi: 10.1038/s41467-022-34630-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Suyama M, Torrents D, Bork P. PAL2NAL: robust conversion of protein sequence alignments into the corresponding codon alignments. Nucleic Acids Res. 2006;34:W609–12. doi: 10.1093/nar/gkl315. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Price MN, Dehal PS, Arkin AP. FastTree 2--approximately maximum-likelihood trees for large alignments. PLoS One. 2010;5:e9490. doi: 10.1371/journal.pone.0009490. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Huerta-Cepas J, Serra F, Bork P. ETE 3: reconstruction, analysis, and visualization of phylogenomic data. Mol Biol Evol. 2016;33:1635–1638. doi: 10.1093/molbev/msw046. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Benson DA, Cavanaugh M, Clark K, Karsch-Mizrachi I, Lipman DJ, et al. GenBank. Nucleic Acids Res. 2013;41:D36–42. doi: 10.1093/nar/gks1195. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Palmer KL, Gilmore MS. Multidrug-resistant enterococci lack CRISPR-cas. mBio. 2010;1:e00227-10. doi: 10.1128/mBio.00227-10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Johnson CN, Sheriff EK, Duerkop BA, Chatterjee A. Let me upgrade you: impact of mobile genetic elements on enterococcal adaptation and evolution. J Bacteriol. 2021;203:e0017721. doi: 10.1128/JB.00177-21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Viver T, Conrad RE, Rodriguez-R LM, Ramírez AS, Venter SN, et al. Towards estimating the number of strains that make up a natural bacterial population. Nat Commun. 2024;15:544. doi: 10.1038/s41467-023-44622-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Schwartzman JA, Lebreton F, Salamzade R, Martin MJ, Schaufler K, et al. Global diversity of Enterococci and description of 18 novel species. bioRxiv . 2023:2023.05.18.540996. doi: 10.1101/2023.05.18.540996. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Franzosa EA, McIver LJ, Rahnavard G, Thompson LR, Schirmer M, et al. Species-level functional profiling of metagenomes and metatranscriptomes. Nat Methods. 2018;15:962–968. doi: 10.1038/s41592-018-0176-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Chklovski A, Parks DH, Woodcroft BJ, Tyson GW. CheckM2: a rapid, scalable and accurate tool for assessing microbial genome quality using machine learning. Nat Methods. 2023;20:1203–1212. doi: 10.1038/s41592-023-01940-w. [DOI] [PubMed] [Google Scholar]
- 48.Meziti A, Rodriguez-R LM, Hatt JK, Peña-Gonzalez A, Levy K, et al. The reliability of metagenome-assembled genomes (MAGs) in representing natural populations: insights from comparing MAGs against isolate genomes derived from the same fecal sample. Appl Environ Microbiol. 2021;87:e02593-20. doi: 10.1128/AEM.02593-20. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Charcoal: Remove Contaminated Contigs from Genomes Using k-Mers and Taxonomies. Github; [Google Scholar]
- 50.Goyal A, Bittleston LS, Leventhal GE, Lu L, Cordero OX. Interactions between strains govern the eco-evolutionary dynamics of microbial communities. elife. 2022;11:e74987. doi: 10.7554/eLife.74987. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Frolova D, Lima L, Roberts L, Bohnenkämper L, Wittler R, et al. Applying rearrangement distances to enable plasmid epidemiology with pling. Bioinformatics. 2024 doi: 10.1101/2024.06.12.598623. [DOI] [PMC free article] [PubMed]
- 52.Shan X, Yang M, Wang N, Schwarz S, Li D, et al. Plasmid fusion and recombination events that occurred during conjugation of poxtA-carrying plasmids in Enterococci. Microbiol Spectr. 2022;10:e0150521. doi: 10.1128/spectrum.01505-21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Wang X, Zhao J, Ji F, Chang H, Qin J, et al. Multiple-replicon resistance plasmids of Klebsiella Mediate extensive dissemination of antimicrobial genes. Front Microbiol. 2021;12:754931. doi: 10.3389/fmicb.2021.754931. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Sheppard AE, Stoesser N, Wilson DJ, Sebra R, Kasarskis A, et al. Nested Russian doll-like genetic mobility drives rapid dissemination of the carbapenem resistance gene blaKPC. Antimicrob Agents Chemother. 2016;60:3767–3778. doi: 10.1128/AAC.00464-16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Salamzade R, Manson AL, Walker BJ, Brennan-Krohn T, Worby CJ, et al. Inter-species geographic signatures for tracing horizontal gene transfer and long-term persistence of carbapenem resistance. Genome Med. 2022;14:37. doi: 10.1186/s13073-022-01040-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Eren AM, Kiefl E, Shaiber A, Veseli I, Miller SE, et al. Community-led, integrated, reproducible multi-omics with anvi’o. Nat Microbiol. 2020;6:3–6. doi: 10.1038/s41564-020-00834-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Page AJ, Cummins CA, Hunt M, Wong VK, Reuter S, et al. Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics. 2015;31:3691–3693. doi: 10.1093/bioinformatics/btv421. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Mungan MD, Alanjary M, Blin K, Weber T, Medema MH, et al. ARTS 2.0: feature updates and expansion of the antibiotic resistant target seeker for comparative genome mining. Nucleic Acids Res. 2020;48:W546–W552. doi: 10.1093/nar/gkaa374. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Arkin AP, Cottingham RW, Henry CS, Harris NL, Stevens RL, et al. KBase: the United States department of energy systems biology knowledgebase. Nat Biotechnol. 2018;36:566–569. doi: 10.1038/nbt.4163. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Chen I-MA, Chu K, Palaniappan K, Ratner A, Huang J, et al. The IMG/M data management and analysis system v.7: content updates and new features. Nucleic Acids Res. 2023;51:D723–D732. doi: 10.1093/nar/gkac976. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Marquis B, Pillonel T, Carrara A, Bertelli C. zDB: bacterial comparative genomics made easy. mSystems. 2024;9:e0047324. doi: 10.1128/msystems.00473-24. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Darling ACE, Mau B, Blattner FR, Perna NT. Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Res. 2004;14:1394–1403. doi: 10.1101/gr.2289704. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.





