Skip to main content
Nucleic Acids Research logoLink to Nucleic Acids Research
. 2025 Feb 5;53(3):gkaf045. doi: 10.1093/nar/gkaf045

zol and fai: large-scale targeted detection and evolutionary investigation of gene clusters

Rauf Salamzade 1,2, Patricia Q Tran 3,4, Cody Martin 5,6, Abigail L Manson 7, Michael S Gilmore 8,9,10, Ashlee M Earl 11, Karthik Anantharaman 12, Lindsay R Kalan 13,14,15,16,
PMCID: PMC11795205  PMID: 39907107

Abstract

Many universally and conditionally important genes are genomically aggregated within clusters. Here, we introduce fai and zol, which together enable large-scale comparative analysis of different types of gene clusters and mobile-genetic elements, such as biosynthetic gene clusters (BGCs) or viruses. Fundamentally, they overcome a current bottleneck to reliably perform comprehensive orthology inference at large scale across broad taxonomic contexts and thousands of genomes. First, fai allows the identification of orthologous instances of a query gene cluster of interest amongst a database of target genomes. Subsequently, zol enables reliable, context-specific inference of ortholog groups for individual protein-encoding genes across gene cluster instances. In addition, zol performs functional annotation and computes a variety of evolutionary statistics for each inferred ortholog group. Importantly, in comparison to tools for visual exploration of homologous relationships between gene clusters, zol can scale to handle thousands of gene cluster instances and produce detailed reports that are easy to digest. To showcase fai and zol, we apply them for: (i) longitudinal tracking of a virus in metagenomes, (ii) performing population genetic investigations of BGCs for a fungal species, and (iii) uncovering evolutionary trends for a virulence-associated gene cluster across thousands of genomes from a diverse bacterial genus.

Graphical Abstract

Graphical Abstract.

Graphical Abstract

Introduction

The determination of ortholog groups, clusters of proteins from multiple genomes that originate from the same ancestral speciation event and are suspected to perform similar cellular processes, forms the basis of comparative genomics [1–3]. De novo inference of ortholog groups typically involves searching for reciprocal best hits of proteins between pairs of genomes, indicative of orthology, and subsequently clustering pairs of inferred orthologs and in-paralogs across multiple genomes [4–7]. Initial methods for orthology inference were designed to be able to identify orthologs between distinct species but limited in the number of genomes they could process [4–6]. This limitation is largely due to the all-vs-all alignment of proteomes, core to most methods for de novo ortholog grouping, which is an O (n2) operation and a major computational bottleneck. Approaches to overcome this procedure include limiting proteome comparisons by using a guiding-phylogeny [8, 9], adapting alignment searching parameters and heuristics to further boost speeds [10, 11], or preliminary aggressive clustering of proteins into coarse homolog groups [12]. Recently, graph-based and iterative-clustering approaches have also allowed vast scalability to thousands of bacterial genomes, but are primarily designed for application to a single species [13–16].

Available orthology inference methods struggle to infer ortholog groups across large datasets of taxonomically diverse genomes, potentially representing thousands of species, such as a set of metagenome-assembled genomes (MAGs) related to a common microbiome. While multiple methods exist to identify instances of previously established ortholog groups within the predicted proteome of a metagenome [2, 17–19], these are unable to account for proteins not represented in their database. Recently, independent advancements in methods to collapse large protein sets based on sequence similarity have enabled rapid clustering of millions of sequences [20–22]. These approaches have even been used on massive protein datasets gathered from across multiple metagenomic datasets [23]; however, more resolute delineation of functionally analogous ortholog groups across thousands of genomes from multiple species remains difficult to perform de novo.

Of relevance, within bacterial genomes, genes are often co-located within smaller, discrete, multi-gene units, which we will broadly refer to as gene clusters. Examples of gene clusters include operons [24, 25], prophages [26], metabolic gene clusters [27], biosynthetic gene clusters (BGCs) [28–31], and pathogenicity islands [32, 33]. Although less common, eukaryotic genomes can also contain genes aggregated within discrete clusters [34–36]. Sometimes gene clusters are highly conserved, encoding for products essential to the survival of the organism [37]. In other cases, a single gene cluster can exhibit variability in gene carriage and order across different strains or species [38–40]. This is often the case for BGCs encoding specialized metabolites or virulence-associated gene clusters, where evolution of gene content and sequence divergence can influence fitness and contribute to adaptation within a changing ecosystem [41–43].

Syntenic conservation has been used to assist de novo identification of homologous instances of a gene cluster of interest in diverse target genomes [44–47]. Homologous gene cluster instances can then be comprehensively investigated to delineate homolog or ortholog groups of the proteins found across them [46, 48]. While such targeted approaches can alleviate time and computational resources by avoiding more comprehensive identification of orthologs at genome-wide scales, currently available methods are mostly designed for specific types of gene clusters, such as BGCs [44, 46, 47]. Many of the software implementing such approaches also do not provide support for uniform annotation of coding sequences in target genomes, which can decrease sensitivity for targeted detection of gene clusters. In addition, most methods do not account for gene cluster paralogy, which has been observed for BGCs in bacterial [40] and fungal genomes [35], or provide specialized capabilities for finding gene clusters across fragmented genomes or metagenomic assemblies [40]. Distinguishing paralogs from orthologs is especially critical if researchers aim to compute evolutionary statistics for genes downstream.

Following identification of homologous gene clusters in target genomes, software to understand the evolutionary relationships between gene cluster instances and infer protein ortholog groups have largely applied coarse protein clustering and aimed to provide visualization based exploration to users [46, 48–50]. Visual assessment of related gene clusters and manual refinement of ortholog groups work well at smaller scales but become impractical when dealing with hundreds to thousands of gene cluster instances. Scalability challenges are due to both computational costs needed to render visuals as well as the figures becoming convoluted and difficult to interpret. An effective solution to ease the identification of evolutionary trends amongst homologous gene clusters is to first identify ortholog groups [46] and present information pertaining to their conservation and sequence divergence within tabular reports [13, 40]. Such tabular reports scale by the number of unique ortholog groups and can be organized by their consensus order along gene cluster instances. We recently introduced construction of such reports in a software suite for exploring microdiversity amongst homologous BGCs from a single taxon [40]; however, the functionality was difficult to use outside of the suite and reliant on orthologous relationships between proteins of gene clusters being known in advance.

Here, we introduce the zol suite (https://github.com/Kalan-Lab/zol), providing functionalities for targeted gene cluster detection and subsequent inference and investigation of protein ortholog groups across homologous gene clusters. The versatility and scalability of these programs is demonstrated through application to three types of gene clusters within different genomic contexts including a virus within environmental metagenomes, fungal secondary metabolite encoding BGCs, and a conserved polysaccharide antigen locus from the diverse bacterial genus of Enterococcus.

Materials and methods

Software availability

zol is provided as an open-source software suite, developed primarily in Python3 on GitHub at: https://github.com/Kalan-Lab/zol. Docker and Bioconda [51] based installations of the suite are supported. Documentation for the programs in the zol suite is provided on the Wiki for the GitHub repo at: https://github.com/Kalan-Lab/zol/wiki. For the analyses presented in this manuscript, v1.5.5 of the zol software package was primarily used. The one exception was for the domain mode analysis of the aflatoxin BGC in Aspergillus flavus, where v1.5.7 of the suite was used instead. Immutable records of the suite for each version following v1.2 onwards are tracked on Zenodo (https://zenodo.org/records/14276364). Version information for major dependencies of the zol suite [52–63] and other software used [46, 64, 65] for analyses in this study is provided in Supplementary Table S1. Code and data for generation of figures in this manuscript are provided separately on GitHub and Zenodo (https://github.com/Kalan-Lab/Salamzade_etal_zol; https://zenodo.org/records/14278963).

Assessment of compute time, memory usage, and disk space: The UNIX time command was mostly applied to measure the runtime and memory usage of programs. Specifically, the ‘Elapsed (wall clock) time’ was regarded as the runtime and the ‘Maximum resident set size (kbytes)’ as the maximum memory usage. To determine the average runtime for the annotation of individual fungal genomes using either prepTG with miniprot or funannotate, log files were assessed instead. All analyses were computed on the same server running Ubuntu 18.04.06 LTS with AMD EPYC 7451 24-Core processors, 472 GB of 288-Pin DDR4 random-access memory, and a Samsung 970 Pro solid disk drive.

Overview of tools and algorithms

prepTG - processing and preparing target genomes for searching with fai: prepTG allows users to create a database of target genomes that can be searched for homologous instances of query gene clusters with fai. In addition to formatting and producing files for optimizing fai searches, prepTG integrates pyrodigal [60], prodigal [66], prodigal-gv [67], and miniprot [62] for gene-calling or protein-mapping in prokaryotic, viral, and eukaryotic genomes as well as metagenomes to aid consistency in fai's performance and limit bias due to potential differences in gene-calling methods. For miniprot-based protein-mapping, coding sequence predictions are required to exhibit an identity of at least 80% to the reference protein and instances of overlapping mRNA features are resolved by retaining only the highest scoring mappings.

prepTG also features options to download pre-built databases for select bacterial taxa that are commonly studied [68], such as ESKAPE pathogens, or to download all genomes belonging to any genus or species in GTDB [69] and subsequently construct a database ab initio.

fai - automated identification of homologous instances of gene clusters: fai allows for rapid, targeted detection of gene clusters in a database of genomes. It accepts a database of target genomes prepared by prepTG and query gene cluster (s). Query gene cluster (s) can be provided in one of three formats: (i) GenBank file (s) with CDS features, (ii) a coordinate along a reference genome, or (iii) a set of proteins in FASTA format. When using coordinates along a reference genome to define a gene cluster, fai reperforms gene-calling along the reference using pyrodigal [60] and extracts a GenBank file corresponding to the specified region.

zol implements HMM-based and CDS separation-based approaches for determining homologous gene cluster instances in target genomes, which can further be combined in a hybrid approach. For both approaches, homologs of proteins from query gene clusters are first searched for in predicted proteomes of target genomes using DIAMOND alignment [57]. Then, in ‘Gene-Clumper’ mode, which is the default, scaffolds with homologs of query proteins are dynamically assessed for whether homologs are within a maximum number of CDS predictions to be regarded as belonging to the same gene cluster. In ‘HMM’ mode, scaffolds of target genomes are instead scanned gene-by-gene using an HMM and neighborhoods or sets of genes are regarded as being in a state of homology to the query gene cluster if several individual genes depict homology to the proteins from the query gene cluster (s). The algorithm is similar to lsaBGC-Expansion [40], however, it is not dependent on a preliminary genome-wide orthology grouping analysis and thus features a different set of filters to still enable high-throughput automated detection of homologous gene cluster segments as a result. lsaBGC-Expansion is reliant on a preliminary orthology analysis to identify BGC-specific genes that could be used to differentiate true homologous instances of BGCs and customize weighting of HMM emission probabilities for distinct genes. It further requires the length of genes within putative homologous regions to be within a certain deviation from the median length of known gene instances. In contrast, fai has preconfigured emission probabilities which can be customized by users and has no length requirement for potential homologous instances of genes. fai further allows the ‘HMM-based’ approach to be run with the parameter for aggregating CDS predictions for the ‘Gene-Clumper’ mode, whereby, gene cluster segments detected by the HMM can be joined with other such segments if they are withing a certain number of CDS features from each other. Similar to lsaBGC-Expansion, syntenic similarity between candidate and query gene cluster segments can also be used to filter candidate segments using a gene cluster-wide correlation metric [40].

By default, fai requires filters pertaining to the number of genes from query gene clusters to be met for each homologous gene cluster candidate segment. However, in ‘draft mode’, thresholds for detection of gene clusters within target genomes are assessed in aggregate for putative gene cluster segments found near scaffold edges (< 2, 000 bp). Visual reports produced by fai showcasing the sequence similarity of target genome proteins to the query protein (s) can then be manually investigated by users to assess the validity of fragmented gene cluster instances. In addition, fai features an option to filter for paralogous, overlapping candidate segments of a gene cluster in target genomes. Briefly, paralogous gene cluster candidate segments are identified per target genome if they feature homologous hits to two or more of the distinct query proteins. Two homologs are required to disregard cases where two segments have parts of the same query protein that has simply been fragmented due to assembly issues [70]. If parology is determined, fai then assess the sum of bitscores across query proteins for each segment and classifies the segment with the highest score as the primary match to the query gene cluster. In addition, fai offers an intuitive visualization of gene cluster segments, if requested, to allow users to assess their quality, including proximity of candidate segments to scaffold edges. Together, all these options enable the large-scale identification of orthologous gene clusters across genomes which can then be leveraged by zol to perform context-specific inference of protein ortholog groups.

In addition to a directory of homologous gene clusters in GenBank format, to serve as input for zol analysis, and a small set of visual PDF files, fai generates an in-depth report on which target genomes have the query gene cluster as an XLSX spreadsheet. This spreadsheet includes information such as the average amino acid identity (AAI), syntenic similarity, and number of conserved genes for gene clusters from target genomes relative to the query gene cluster. The spreadsheet allows for easy sorting of various columns to assist identification of which target genomes feature a gene cluster to the desired degree of similarity for the user.

zol - computes a variety of evolutionary statistics and can perform gene cluster specific dereplication: The zol workflow begins by processing the input directory of gene cluster GenBank files to assess validity and perform filtering of gene clusters or individual proteins. Filtering can be performed at the gene cluster level by requesting filtering of draft-quality gene clusters, those marked as being near scaffold edges, or low-quality gene clusters, those with ≥ 10% missing base-pairs (e.g. Ns) in their sequence. Filtering of individual proteins which are near scaffold edges can also be performed if fai was used to identify the input gene cluster set, because fai marks these proteins with a special feature tag in the resulting gene cluster GenBank files.

Next, zol will perform dereplication of gene clusters, if requested by users, with skani [63] by clustering gene clusters which depict some user-defined coverage and identity thresholds using single linkage clustering or more resolved MCL-based clustering, for which the inflation parameter can be adjusted. Representative gene clusters are selected from each cluster as part of the dereplication based on maximum length and, if comparative analysis is requested, whether the representative gene cluster is part of the focal or focal-complement set of gene cluster instances specified by the user.

The input set of gene clusters or set of dereplicated representative gene clusters is then used to identify protein ortholog groups with an InParanoid-type approach [6]. Briefly, DIAMOND [57] is used to perform all vs. all pairwise alignment between proteins from the set of gene clusters after which the alignments are processed to identify reciprocal best hits (RBH) between pairs of gene clusters. In-paralogs are identified within each gene cluster based on whether two coding sequences depict more similarity to each other than one does to an RBH with a different gene cluster. Bitscores, standardized through division by reflexive bitscore values for query proteins, are used to assess homology. Specifically, the average normalized bitscore between each pair of orthologs and in-paralogs is recorded. Afterwards, bitscores between such protein pairs are further standardized through dividing them with the average values between pairs of gene clusters to aid proper clustering of proteins downstream. This is akin to the genome-wide normalization procedure recommended in OrthoMCL, owing to the realization that orthologs between distantly related species are also more likely to exhibit lower sequence similarity, which should be corrected for prior to MCL clustering [5]. This information is input into MCL with the inflation parameter set to 1.5, similar to other orthology inference methods [10, 71]. The inflation parameter and minimum identity and coverage cutoffs to consider valid pairs of in-paralogs and orthologs are adjustable by users.

Reinflation can also be requested by users to expand ortholog groups to include proteins from the full input set of gene clusters if gene cluster dereplication was requested [13]. Reinflation of ortholog groups is performed by first performing comprehensive and granular clustering of proteins from all input gene clusters using CD-HIT [55], requiring proteins to depict ≥ 98% sequence similarity and ≥ 95% bi-directional coverage to the representative sequences of clusters. Proteins in CD-HIT clusters are then mapped to ortholog groups if they co-cluster with proteins from dereplicated gene clusters which are already assigned to ortholog groups. Dereplication and reinflation are not recommended if sequence redundancy amongst the set of input gene clusters is low. Stringent cutoffs used for CD-HIT clustering during reinflation assume that dereplication was also run with stringent parameters to only collapse highly similar gene clusters. Otherwise, reinflation could miss more distant instances of ortholog groups, resulting in an underestimation of ortholog group conservation amongst gene clusters.

Next, zol will partition protein and nucleotide sequences from gene clusters according to ortholog groups, perform protein alignment using MUSCLE [61], and create codon alignments using PAL2NAL [72]. We also offer an option to use reference proteins to refine and filter sequences based on multiple sequence alignment using MUSCLE [61], which might be useful to further filter intronic sequences in eukaryotic ORFs. Codon alignments are filtered for regions with high ambiguity (≥10% gaps) using trimAL [53] which are then used downstream for calculation of evolutionary statistics and to construct approximate maximum-likelihood phylogenies using FastTree 2 [54] for each ortholog group. Consensus protein sequences for each ortholog group are finally constructed using HMMER3 [56].

Using protein consensus sequences of each ortholog group, zol is next able to linearize annotation of ortholog groups with various annotation databases including KOfam [17], the PGAP database [73], VFDB [74], CARD [75], MIBiG [76], ISfinder [77], the PaperBLAST database [78], and Pfam [79]. A custom FASTA file can also be provided by users to annotate ortholog groups. The best hit per ortholog group for each annotation database is selected by score, if annotation is HMM based [80], or bitscore, if it is DIAMOND alignment based [57], and a default E-value cutoff of 1e-5. The E-value of the alignment is provided in the zol report for each putative annotation except Pfam domains. However, for Pfam annotations, only domains meeting trusted thresholds are reported.

Next, zol will compute basic statistics per ortholog group including the consensus order, consensus directionality, whether proteins are single-copy across gene clusters, the median length of ortholog group sequences, their median GC% percentage, and GC skew values. The consensus order and directionality are performed similarly to lsaBGC-PopGene [40]. Afterwards, in the sixth step, zol will calculate evolutionary statistics for each ortholog group including Tajima's D [81], the proportion of filtered codon alignments which correspond to segregating sites, the average sequence entropy of the filtered codon alignment and the 100 upstream region, and the median and maximum Beta-RDgc. Beta-RDgc is a statistic that is derived from the Beta-RD statistic which we described in lsaBGC [40] and measures the divergence of a pair of protein sequences based on the expected divergence between the gene clusters. Values below one suggest that protein divergence is larger for the pair than expected based on other shared proteins between the two gene clusters; conversely, the opposite trend might suggest high conservation of the particular protein between the gene clusters and potentially gene-specific horizontal gene transfer. Finally, we perform site-specific selection analyses using the FUBAR [82] and GARD [83] methods offered in the HyPhy suite. While highly scalable relative to comparable methods [82], these analyses can still take considerable time and are turned off by default. Importantly, GARD recombination detection [83] and partitioning of input alignments for ortholog groups can also be used for alternate HyPhy analyses with HyPhy Vision [59], to extend beyond the site-specific selection analyses using FUBAR [82] supported directly in zol.

Prior to the generation of a final report, zol allows users to perform an optional comparative analysis between user-defined set (s) of focal and complementary or alternate gene cluster instances. In these comparative analyses, the conservation and fixation index [84] is calculated for each ortholog group.

Finally, we generate a consensus report and a spreadsheet in XLSX format where each row corresponds to an ortholog group and columns correspond to basic statistics, evolutionary statistics, and annotation information. Quantitative fields are automatically colored to make visual detection of patterns easier for users. A basic heatmap showing the presence of ortholog groups across gene clusters is also produced.

zol additionally features three optional modes that can be triggered via specific arguments. First, the ‘only-orthologs’ argument will invoke zol to only compute ortholog groups and exist after determining them. Second, the ‘select-fai-params-mode’ argument allows users to provide a handful of known instances for a gene cluster and determine appropriate thresholds for searching for additional instances of the gene cluster using fai. This mode assumes that the known instances provided are representative of the breadth of diversity expected for the gene cluster amongst the target genomes being searched. Finally, the ‘domain-mode’ argument allows inference of ortholog groups at the resolution of protein domains.

When ‘domain-mode’ is requested, CDS features from input gene clusters are first partitioned into domain and inter-domains subunits. PyHMMER hmmsearch [80] is used to annotate proteins with domains from the Pfam database [85] using trusted thresholds. Hits are ordered based on score and largely non-overlapping domain ranges are identified for each CDS feature. CDS are then split up into cCDS features corresponding to discrete domain or inter-domain coordinates that have minimal overlap and meet a user-adjustable minimum length criteria (default of 50 aa). These cCDS features are then used to infer domain resolution ortholog groups (DOGs), their consensus order and direction, and compute evolutionary statistics and perform functional annotations in a manner that is equivalent to a standard zol analysis. A secondary single-linkage clustering is also performed for full CDS features (proteins) based on DOG relationships and this information is provided as a separate column in zol's final report.

cgc and cgcg – tools for scalable visualization of zol results: To complement the tabular reports produced by zol, the suite features two downstream programs to create publication-quality figures summarizing evolutionary information from potentially thousands of gene cluster instances. cgc, collapsed gene cluster, and cgcg, collapsed gene cluster graph, both take as input a directory of results from zol analysis to produce visuals. Both tools also feature a variety of customization options. cgc creates a PDF visualization for which the basis is a gene schematic where genes correspond to ortholog groups, their order and direction to the consensus order and direction computed by zol, and their length to the median length of the ortholog groups. Additional tracks for quantitative evolutionary statistics featured in zol's report can further be added on top of the gene schematic as bar plots. While cgc features several options accessible via the command line, an R script for generating the final figure is also created. This allows users who are familiar with R to further customize figures directly as needed. In contrast, cgcg instead leverages the gravis package [86] in Python to create an interactive network visualization output as a portable HTML file. Nodes in the network correspond to distinct ortholog groups with edges indicating gene order information. Nodes can be colored based on certain evolutionary statistics computed by zol, including ortholog group conservation and Tajima's D value. Various options are provided via the command line to customize the figure, including a flag to show the major path across the gene cluster in gold. Gravis further features an interface on the right panel of the network allowing users to directly configure and adjust various options, such as the size of ortholog group labels or the strength of clustering. In addition, nodes in the network can dynamically be rearranged and cgcg embeds select information for ortholog groups from zol reports to be displayed on a bottom panel when users click on different nodes.

abon, atpoc, and apos – tools for assessing novelty and conservation of BGCs, phages, and plasmids from a single strain: The zol suite features three small wrapper programs called abon, atpoc, and apos which assess the conservation and novelty of a single genome's BGC-ome, phage-ome, and plasmid-ome, respectively, relative to a target genome database constructed by prepTG. The target genomes database could be all other genomes belonging to the focal genome's species or genus. The three programs are wrappers of fai but also offer a simple BLAST search alternative, to more thoroughly check for whether individual genes from BGCs, phages, and plasmids are present in the target genomes being searched. These tools accept results from standard software for annotation of BGCs [65, 87], phages [64, 67, 88], and plasmids [67, 89] but do not integrate them within the suite. Similar to fai and zol, they produce auto-formatted XLSX spreadsheets as primary results.

Application of fai and zol to track a virus within lake metagenomes

VIBRANT [64] was used to identify viral contigs or sub-contigs in the three total metagenomes from Tran et al. 2023 [90] sampled on the earliest date of 07/24. Afterwards, predicted circular contigs were clustered using BiG-SCAPE [46] which revealed a ∼36 kb virus was found in two of the three metagenomes.

prepTG was used to prepare all 16 total metagenomic assemblies from the Tran et al. 2023 study for comprehensive targeted searching of the virus with fai. For improved viral gene predictions, prodigal-gv [67] was requested in prepTG and similarly used to reperform gene-calling for the two query instances of the virus. fai was initially run with largely default settings, with filtering of secondary instances of the virus requested to retain only the best matching scaffold or scaffold segment resembling the queries. Because the virus is suspected to be circular, no syntenic similarity assessment was applied. To assess the performance of cblaster for preparing the target metagenomes database and subsequently searching for the virus, we provided GenBank files with CDS features produced by prepTG as input for cblaster makedb and adjusted searching parameters for cblaster search to more closely match what we used for fai. To more sensitively detect the virus using fai with the following parameter changes: (i) draft mode was specified, (ii) loosened E-value threshold for query protein homology detection from 1e-10 to 1e-5, (iii) increased maximum distance allowed between homologs for being grouped in the same candidate gene cluster segment from 5 to 10, and (iv) lowered the proportion of distinct query proteins with homologs needed from 0.5 to 0.1.

Microevolutionary investigations of leporin and aflatoxin BGCs in Aspergillus flavus

Genomic assemblies downloaded from NCBI GenBank were processed using prepTG. Of the 217 genomic assemblies downloaded, one, GCA_000006275.3, was dropped from the analysis because the original GenBank file had multiple CDS features with the same name, leading to difficulties in performing BGC prediction with antiSMASH [65], and because alternate assemblies were available for the isolate. prepTG was run on all assemblies with miniprot [62] based gene-mapping of the high-quality gene coordinate predictions available for A. flavus NRRL 3357 (GCA_009017415.1) [91] requested. Target genomes were then searched for the leporin (BGC0001445) and aflatoxin (BGC0000008) BGCs using GenBank files downloaded from MIBiGv3 [76] as queries. For leporin, AFLA_066 840, as represented in the MIBiG database, was treated as a key protein required for detection of the BGC. Similarly, for aflatoxin, PksA (AAS90022.1), as represented in the MIBiG database, was treated as a key protein required for detection of the BGC. Draft mode and filtering of paralogous segments were requested and a syntenic similarity correlation threshold of 0.6 was applied. For both analyses, ortholog groups found in fewer than 5% of gene cluster instances were disregarded.

We reidentified population B as previously delineated [39] using k-mer based ANI estimation [92] and neighbor-joining tree construction [93]. A discrete clade (n = 81) in the tree was validated to feature all isolates previously determined as part of population B [39] and thus regarded as such.

For comprehensive and de novo BGC prediction, antiSMASH was run on the 216 genomic assemblies with ‘glimmerhmm’ requested for the option ‘–genefinding-tool’. Similarly, antiSMASH was also run on full GenBank files for genomes generated by prepTG from reference proteome-mapping via miniprot and using GFFs from funannotate gene annotation (https://github.com/nextgenusfs/funannotate). Briefly, the funannoate pipeline was first used to clean, sort, and mask repeats in genomes. After, funannotate predict, which incorporates several ab initio gene prediction software such as GeneMark-ES/ET [94] and AUGUSTUS [95], was run on the processed genomes to determine CDS features. BGCs from each set of genome annotations were independently clustered using BiG-SCAPE with ‘mix’ clustering analysis and MIBiG reference BGC integration requested. The gene cluster family and clan matching the reference leporin BGC in MIBiG (BGC0001445) were regarded as the leporin BGC.

For remote cblaster [47] analysis, CAGECAT [96] was used to search NCBI’s nr database with proteins from the leporin BGC representative (BGC0001445) provided as a query. Only 13 scaffolds, belonging to 12 assemblies (including GCA_000006275.3), were identified. cblaster was also used to locally search all 216 genomes with gene calling annotations from both prepTG and funannotate.

Evolutionary investigations of the epa locus across Enterococcus

All Enterococcus genomes represented in GTDB R207 [69] (n = 5, 291) were downloaded using ncbi-genome-download (https://github.com/kblin/ncbi-genome-download). The same query for epa was used for all analyses. Specifically, coordinates extending from 2, 071, 671 to 2, 115, and 174 along the E. faecalis V583 chromosome, corresponding to genes EF2164 to EF2200, were used as a query for the epa locus in fai to identify homologous instances in target genomes [97, 98].

Comparing orthology/homology inferences between fai & zol, OrthoFinder, and eggNOG-mapper: Representative genome assemblies were selected for each of the 92 species of Enterococcus in GTDB R207 [69] based on the N50 metric. One set of species representative genomes corresponded to those with the largest N50 values and the other set was comprised of genomes with the lowest N50 values. The two sets of species representative genomes were processed and investigated identically but independently. Gene calling was first performed for genomes using prepTG with pyrodigal [60]. To generate the input for OrthoFinder [71] and eggNOG-mapper [18], proteins from prepTG’s genome-wide GenBank files were extracted in FASTA format. After, OrthoFinder and eggNOG-mapper were largely run with default settings, with the argument ‘–tax_scope bacteria’ also provided to eggNOG-mapper. Phylogenetic hierarchical orthogroups inferred by OrthoFinder were used for comparisons. Protein matches to coarse, root-level, and resolute, Enterococcaceae-specific, precomputed eggNOG ortholog groups were assessed. fai was run using largely default settings to identify homologs of the epa locus from E. faecalis V583 in representative genomes. Adjustments from default settings included lowering the minimum proportion of distinct query proteins needed for a gene cluster instance from 0.5 to 0.25 and specifying draft mode to alleviate possible cases of gene cluster fragmentation due to assembly issues. Similarly, zol was also run using mostly default settings but with the flags ‘only-orthologs’, to stop after it determined ortholog groups, and ‘allow-edge-cds’, to allow usage of CDS features marked by fai to be near scaffold edges. All three methods for orthology determination were provided 20 threads wherever possible. Comparisons between the methods were performed by assessment of shared pairs of proteins belonging to common ortholog groups. Because OrthoFinder and eggNOG-mapper were applied at the genome-wide scale, only pairs of proteins where at least one protein was found in an epa locus identified by fai were considered. Congruence between methods was also calculated when restricted to using only protein pairs where both proteins were found in epa loci identified by fai. The Jaccard index, the number of protein pairs presumed orthologs shared by both methods divided by the total number of such protein pairs identified by at least one method, was used to quantify congruence between ortholog group prediction methods.

Comprehensive and tailored usages of fai and zol for finding  epa  in  Enterococcus: Based on prior comparative analyses that had shown that gene conservation and gene order can be slightly variable between epa loci from E. faecalis and E. faecium [99, 100], we retained the default syntenic similarity threshold of candidate gene cluster matches in target genomes to the query in fai as 0.0. In addition, we relaxed the minimum percentage of query proteins needed to report a homologous instance of the epa locus to 10%. Instead, we required the presence of 50% of key epa proteins found in both E. faecalis and E. faecium, defined as epaABCDEFGHLMOPQR, for the identification of valid homologous instances of the epa locus. The E-value cutoff to determine presence for the key epa proteins was also lowered from 1e-20 to 1e-10 to be inclusive of shorter genes and allow for higher levels of sequence divergence across the Enterococcus genus. To gather auxiliary genes flanking the core epa region in target genomes, we further requested the inclusion of CDS features found within 20 kb of the boundary genes in detected instances of the epa locus within the resulting GenBank files produced by fai. A phylogenetic heatmap was constructed for the presence of the epa locus across a species tree using species representative genomes, selected based on largest assembly N50, where the values of the heatmap corresponded to the maximum percent identity of a query protein to their best match in target genomes. Because EF2173 and EF2185 are identical transposases, they were shown as one column in the heatmap. The species tree was constructed using GToTree [101] using HMMs for proteins regarded as largely single-copy core to the phylum Bacillota. The phylogenetic heatmap visual was created using iTol [102].

From inspection of fai's resulting XLSX spreadsheet, zol's parameters were adjusted to relax identity and coverage thresholds for assessing protein pairs for orthology prior to MCL clustering to 20% and 25%, respectively. Identical processing was performed for the full set of epa loci and epa loci from only species representative genomes. During the comprehensive processing of all high-quality epa loci identified, one instance was dropped during zol analysis despite meeting requirements because all CDS features in it were found near scaffold edges and, by default, such features are not used in zol to aid more accurate inference of ortholog groups and assessment of their sequence variation. A third run of zol was performed using identical settings and all the gene cluster instances but leveraging the dereplication and reinflation options to showcase how the combination of the options can reduce the runtime needed for comprehensive processing. For dereplication of gene clusters, alignment fraction was increased from the default of 95% to 99% and MCL was used for clustering to gather more resolute representative gene clusters. Major ortholog groups determined between the comprehensive and the dereplication + reinflation runs were found to be similarly conserved based on matching to known epa genes.

Phylogenetic assessment of glycosyltransferase orthology predictions : Proteins from ortholog groups determined by zol analysis of species representative genomes were extracted based on whether the ortholog group was annotated as featuring the keywords: ‘glycosyl’ and ‘transferase’ in Pfam protein domain annotations [85]. Two additional ortholog groups were included and featured the Pfam domain ‘Bacterial sugar transferase’, including epaR, which is also regarded as a glycosyltransferase [98]. The comprehensive set of glycosyltransferases were next aligned using MUSCLE with the default align mode [61]. Filtering of the alignment was next performed using trimal with options ‘-keepseqs -gt 0.9″ to filter sites composed largely of gaps and further filtered for sequences which were composed of > 10% gaps or ambiguous characters (‘X’). IQ-TREE [103] was used to construct a maximum-likelihood phylogeny with ModelFinder limited to the WAG and LG substitution models. The phylogeny was visualized using iTol [102] with classifications for ortholog groups most closely matching E. faecalis V583 epa glycosyltransferases marked on leaves. Ortholog groups were assigned to specific epa gene designations based on sequence alignment of their consensus sequences to E. faecalis V583 epa-associated proteins. Best matching ortholog groups for each E. faecalis V583 epa glycosyltransferase were identified based on E-value.

Large-scale evolutionary investigations of epa loci from E. faecalis

The full set of epa loci identified by fai in E. faecalis genomes were processed through zol requesting for retention of only complete instances that were also distant from scaffold edges. For projection of conservation, Tajima's D, and sequence entropy statistics onto genes for the epa locus in E. faecalis V583, sequence alignment was used to identify the best matching ortholog groups based on E-value. For the identical transposases, EF2173 and EF2185, data from the same ortholog group was used.

Investigation of glycosyltransferase phylogenetic diversity: A similar phylogeny of glycosyltransferases was constructed for the E. faecalis analysis as was done for the investigation of epa glycosyltransferases across species representatives of Enterococcus. Glycosyltransferase ortholog groups were identified based on Pfam domains featuring the keywords ‘glycosyl’ and ‘transferase’ or because they matched epa genes regarded as glycosyltransferases in prior studies [98]. To accommodate for the larger number of sequences: (i) only ortholog groups found in > 1% of epa loci instances were regarded, (ii) MUSCLE [61] super5 mode was used for alignment, and (iii) FastTree 2 [54] was used for approximate maximum-likelihood phylogeny construction. After trimal based filtering of sites, only sequences which featured greater than 20% gaps or ambiguous characters (‘X’) were filtered to retain epaA in the final alignment prior to phylogeny construction.

Results

fai and zol allow for the rapid inference of gene cluster orthologs across diverse genomes

The zol suite consists of three major programs: prepTG (prepare target genomes), fai (find additional instances), and zol (zoom on locus) (Fig. 1A). First, prepTG and fai can be run to process a set of target genomes and rapidly search for a query gene cluster within them, respectively. Afterwards, zol can perform reliable and efficient context-limited inference of ortholog groups across homologous gene cluster instances identified using a flexible InParanoid-type algorithm [6]. For each ortholog group, zol will further compute evolutionary statistics, such as Tajima's D [81], and functional annotations, using several, diverse databases suitable for a variety of gene clusters, including those specific to phages [104], virulence elements [74], and BGCs [76]. Ultimately, zol will summarize data in a table report where each row corresponds to a distinct ortholog group. This report is automatically color formatted and provided as an XLSX spreadsheet to allow for easy interpretation of the data, which can span thousands of gene cluster instances.

Figure 1.

Figure 1.

Overview of the zol suite. (A) A cartoon schematic of how prepTG, fai, and zol, as well as visualization tools cgc and cgcg, are integrated. Certain statistics in the zol report will not be calculated if not enough instances of an ortholog group are identified, resulting in non-available (NA) values being reported. Squiggles correspond to arbitrary text pertaining to functional annotation information, etc. (B) An overview of steps in the core programs in the suite: prepTG, (C) fai, and (D) zol algorithms and workflows. Inputs and outputs for the programs are indicated with bolder coloring.

The program prepTG aims to provide a convenient interface to transform genomic or metagenomic datasets into a format ready for targeted gene cluster detection using fai (Fig. 1B). To promote consistency in gene calling across target genomes, we have incorporated computationally light-weight dependencies for de novo gene prediction in bacterial genomes [60, 66] and protein-mapping in eukaryotic genomes [62] within prepTG. Options are also available to download pre-built databases of distinct representative genomes for 18 commonly studied bacterial taxa [68] or to build comprehensive databases for any genus or species in the latest release of the Genome Taxonomy Database (GTDB) [69].

fai features two key features which are absent in most existing methods for targeted gene cluster detection (Fig. 1C; Supplementary Table S2; Supplementary Text). First, it has an option to automatically filter secondary or paralogous instances of query gene clusters identified for each target genome to prosper appropriate evolutionary analysis of protein ortholog groups downstream when using zol. This option is off by default however to prioritize more comprehensive identification of gene clusters. Second, fai implements a mode for searching for gene clusters in draft quality genomes, MAGs, or unbinned metagenomic assemblies, where gene clusters might be fragmented across multiple scaffolds. When this mode is activated, fai relaxes requirements for reporting a gene cluster as present in a genome or metagenome if multiple homologous gene cluster regions are identified near scaffold edges in a target genome and assesses whether reporting criteria are met in unison across such instances (Supplementary Fig. S1). Similar to prepTG, fai also aims to provide convenience for users and can accept query gene clusters in different formats to ease searching for gene clusters and genomic islands cataloged in databases such as ICEberg [105], MIBiG [76], or IslandViewer [106]. Query gene clusters can be provided as a coordinate along a reference genome, in GenBank format, or as a set of proteins in FASTA format. In addition, to simplify conservation and novelty assessment of a single isolate's BGCs, phages, and plasmids relative to other genomes from the same genus or species, specialized wrapper programs of fai are also provided within the zol suite (Supplementary Fig. S2).

zol will infer ortholog groups for proteins across homologous gene clusters and then construct a tabular report with information on conservation, evolutionary trends, and annotation for each individual ortholog group (Fig. 1D). To make annotated reports generated by zol more comprehensive for different types of gene clusters, several databases have been included, such as VOGDB [107], VFDB [74], ISFinder [77], and CARD [75]. In addition, zol incorporates HyPhy [59] as a dependency and calculates various evolutionary statistics. Ultimately, beyond high-throughput inference of ortholog groups across diverse genomic datasets, the rich tabular report produced by zol provides complementary information to figures generated by comparative visualization software such as clinker [48], CORASON [46], gggenomes [108], and Easyfig [109]. Importantly, such visualization software, which display pairwise relationships between homologous gene cluster instances, are inherently limited in the scale at which they can be applied. Thus, we also provide the tools cgc and cgcg in the zol suite to transform tabular reports from zol into static or interactive visualizations summarizing evolutionary trends across potentially thousands of gene cluster instances.

Several options and distinct modes are available in zol. One key argument in zol enables ‘domain mode’, where CDS features in input gene cluster instances are partitioned into domain and inter-domain subunits and used to infer domain-resolution ortholog groups. This mode is especially helpful for researchers interested in BGCs, where key biosynthetic genes, such as polyketide synthases, can be large and feature modular domains that frequently recombine or undergo gene conversion [46, 110, 111]. In addition, zol features the ability to dereplicate gene clusters directly using skani [63], which was recently shown to be more reliable at estimating average nucleotide identity between genomes of variable contiguity relative to comparative methods. Dereplication can allow for more appropriate inference of evolutionary statistics to overcome availability or sampling biases in genomic databases [112]. It can also be used to subset distinct representative gene cluster instances to make investigation using visualization software more tractable. Another important ability of zol is a mode where users can provide a handful of known instances for a gene cluster to estimate optimal parameters to search for additional instances of the gene cluster using fai. We applied this functionality of zol on sets of homologous BGCs and phages to determine distributions for search parameters in fai which users could consult as priors (Supplementary Fig. S3; Supplementary Text).

Finally, zol allows for comparative investigations of gene clusters based on taxonomic or ecological groupings [113–115]. For instance, users can designate a subset of gene clusters as belonging to a specific population to allow zol to calculate ortholog group conservation across just the focal set of gene clusters. In addition, zol will compute the fixation index [84], FST, for each ortholog group to assess gene flow between the focal and complementary sets of gene clusters.

Longitudinal tracking of a virus within lake metagenomic assemblies

Metagenomic datasets represent a large reservoir of underexplored sequence space [116, 117]. To demonstrate the ability of the zol suite to identify and investigate gene clusters in metagenomes, we applied it to track a virus in a longitudinal metagenomic dataset profiling a lake's microbiome over space and time [90].

We first identified large (20 kb) viruses, that were also predicted to represent circular molecules, across a subset of the metagenomic assemblies corresponding to the earliest sampling date [64]. Afterwards, clustering based on the sequence and syntenic similarity of protein domains led to the identification of a ∼36 kb highly conserved virus in two of the metagenomes sampled from lower lake depths.

All 16 metagenomic assemblies, spanning five distinct sampling timepoints and four distinct sampling depths, were processed through prepTG to identify coding sequences and construct a database ready to search for gene clusters using fai. GenBank files with coding sequence annotations for metagenomic assemblies generated by prepTG, amassing 27 Gb total in size, were further provided as input for cblaster makedb, which serves a similar role to prepTG in the cblaster suite to format genomic data for downstream gene cluster searches. However, cblaster makedb does not feature the ability to perform de novo gene-calling for either genomes or metagenomes and is not designed to accommodate the size of metagenomic assemblies. During database construction, cblaster makedb required around 30 Gb of memory, while prepTG needed less than 3 Gb of memory (Supplementary Fig. S4A).

Next, fai was used to perform a rapid, targeted search for this ∼36 kb Caudovirales virus across the full set of 16 metagenomes to identify additional instances of the virus. fai completed its search of the metagenomes, featuring >20 million proteins and 10.7 million contigs, in <4 min using 20 threads, performing equivalently to cblaster, run using similar settings as fai (Supplementary Fig. S4B). Of the 16 total metagenomes, the virus was confidently found in ten metagenomes, including all nine metagenomes surveying anoxic conditions (P< 0.001; one-sided Fisher's exact test; Fig. 2A). This is concordant with inferences for the host for the virus being Rhodoferax, which are purple bacterium featuring species classified as anaerobic photoheterotrophs [90, 118, 119]. In addition, Rhodoferax classified MAGs from the metagenomic dataset were exclusively obtained from anoxic conditions [90].

Figure 2.

Figure 2.

Targeted viral detection in metagenomes using fai. (A) Total metagenomes from a single site in Lake Mendota across multiple depths and timepoints from Tran et al. (2023) were investigated using fai for the presence of a virus found in two of the three earliest microbiome samplings (red box; samples from 7/24). The presence of the virus is indicated by a virus icon. * denotes a metagenome sample where the virus was partially detected based on more sensitive searching criteria using fai. Metagenome samples are colored according to whether they corresponded to oxic, oxycline, or anoxic. The most shallow sampling depths varied for different dates and consolidated as a single row corresponding to a sampling depth of either 5 or 10 meters. (B) A depiction of the pangenome of the virus created using cgcg is shown. Nodes correspond to ortholog groups with sizes indicating the median size in bp divided by 100. Only ortholog groups found in ≥25% of virus instances are shown. Coloring, which can be configured, for this figure corresponds to conservation of ortholog groups across instances of the virus. Edges and arrows show the consensus order of ortholog groups, with border colors of nodes indicating the consensus direction of the ortholog groups. Edges which are gold coincide with the major path most commonly observed across the 10 instances of the virus. Functional annotations were manually added to the figure. (C) A zoom-in of a region in the pangenome graph showing the interactive capabilities of cgcg, implemented via the gravis library, to allow users to explore zol results in a network visual.

Because low-abundance organisms might lack proper representation in metagenomic assemblies, especially in complex microbiomes, [120, 121] we reran fai using more sensitive parameters to assess whether we missed additional instances of the virus. When run requesting draft mode and lowering thresholds needed for detection, fai was able to detect homologous segments of the virus in all additional metagenomes but took longer to run (Fig. 2A, Supplementary Table S3). Manual assessment of these instances suggested that most were likely distantly related sequences to the virus; however, one metagenomic sample, initially regarded as lacking the virus, had 94.6% of the query proteins for the virus at roughly 90.3% average amino acid identity across 114 partial segments. This suggests that the virus, or a diverged variant of it, might exist at lower abundances in this metagenomic sample and potentially other samples. Thus, users should carefully consider the completeness of metagenomic assemblies they are searching and also consider read-based alternatives for gene cluster detection [122–124].

To investigate how the gene repertoire of the virus evolved over time, we next applied zol and cgcg on the high-quality instances detected using our initial search with fai. These analyses revealed that 44 (69.8%) of the 63 total distinct ortholog groups were core to all instances of the virus across the ten metagenomes from which instances were confidently identified (Fig. 2B; Supplementary Table S4). These genes were further highly conserved in sequence over the course of 2.5 months (Supplementary Table S4). Furthermore, 16 of the 63 ortholog groups were not observed in the query viruses from the earliest sampling date, suggesting the potential acquisition or duplication of genes in the virus during the span of sampling at the lake.

Investigating population-level and species-wide evolutionary trends of BGCs in the eukaryotic species Aspergillus flavus

Low sensitivity for gene cluster detection in eukaryotic genome assemblies can arise from their incompleteness, leading to gene clusters being fragmented across multiple scaffolds [70, 125], as well as challenges in ab initio gene prediction due to alternative splicing [126, 127]. Therefore, many gene cluster detection software are either specific for bacterial genomes or require coding sequence annotations for eukaryotic genomes to be provided by the user. To overcome such challenges to user application, we integrated miniprot [62] into prepTG which allows for mapping high-quality protein annotations from a reference genome to the remainder of the genomes available for a species or genus. We showcase the ability of prepTG and fai to simplify the reliable identification of gene clusters in eukaryotic genomes by using them to find instances of two BGCs across genomes belonging to the fungal species Aspergillus flavus.

The genus of Aspergillus is a source of several natural products, including aflatoxins, a common and economically impactful contaminant of food [128]. The genus also contains species that are model organisms for studying fungal secondary metabolism [36, 129, 130]. Examination of the secondary metabolome of A. flavus has revealed that different clades or populations can exhibit variability in their metabolite production despite high conservation of core BGC genes encoding enzymes for synthesis of these metabolites [39, 131]. For instance, population B A. flavus were identified as producing a greater abundance of the insecticide leporin B relative to populations A and C [39, 132]. We showcase zol's ability to aid comparative analysis of gene clusters from different populations through application to the leporin BGC. We further show how zol can detect variation in sequence conservation for different genes from the aflatoxin BGC and be inclusive of genes present in target genome annotations but missing in the query gene cluster, allowing for comprehensive profiling of BGC auxiliary content.

Based on read alignment to a reference genome, the leporin cluster was recently identified to be a core component of the A. flavus genome [39]. However, a restricting factor in the direct prediction of gene clusters in A. flavus assemblies is the lack of gene annotations, with only 11 (5.1%) of 216 genomes for the species in NCBI’s GenBank database having coding sequence predictions (Fig. 3A). Therefore, we mapped high-quality protein predictions for a reference A. flavus genome [91] to the remainder of the 216 genomes available for the species using prepTG. Running fai in ‘draft mode’ led to the identification of the leporin BGC within 212 (98.1%) assemblies, consistent with the prior read mapping-based investigation suggesting that the BGC was core to the species [39]. Although cblaster [47] run using annotated genomes from prepTG was similarly sensitive to fai, the web-application for running cblaster, CAGECAT [96], was limited to genomes with protein coding annotations available on NCBI (Fig. 3B). We also investigated the ability of non-targeted approaches for BGC detection to identify the leporin BGC by applying antiSMASH followed by BiG-SCAPE for clustering related BGCs and matching them to characterized BGCs in the MIBiG database. When this approach was applied using GenBank files prepared by prepTG, the gene cluster clan corresponding containing the leporin BGC was found in all A. flavus genomes provided as input. However, when antiSMASH was run using de novo gene prediction in antiSMASH based on GlimmerHMM [133] with Cryptococcus gene annotation models, recovery of the leporin BGC was limited (Fig. 3B). To check prepTG’s protein-mapping approach was not causing false positive detection of the leporin BGC, we manually assessed visuals produced by fai and applied a comprehensive pipeline for fungal gene calling, funannotate, on all genomes, which led to similar conclusions that the leporin BGC was largely core to A. flavus (Fig. 3B). Furthermore, on average, funannotate took 6.1 hours and four threads to perform gene calling per sample, whereas miniprot and prepTG processing completed in just six minutes per sample using one thread.

Figure 3.

Figure 3.

Evolutionary trends of common BGCs in A. flavus. (A) The proportion of 216 A. flavus genomes from NCBI’s GenBank database with coding-sequence predictions available. (B) Comparison of the sensitivity of prepTG and fai with alternate assembly-based approaches for detecting the leporin BGC. The dashed vertical lines indicate the number of genomes with CDS features available on NCBI (n= 11; pink) and the total number of genomes assessed (n = 216; violet), respectively. Dark gray indicates instances identified by CAGECAT/cblaster or fai or as belonging to the same GCF as the reference leporin BGC from MIBiG by antiSMASH and BiG-SCAPE analysis. Lighter gray indicates the number of similar BGCs identified by BiG-SCAPE as belonging to the same clan but not to the same GCF as the reference leporin BGC. A schematic of the (C) leporin and (D) aflatoxin BGCs is shown with genes present in ≥10% of samples shown in consensus order and relative directionality. Coloring of genes in (C) corresponds to FST values and in (D) to Tajima's D values, as calculated by zol. Vertical bars in the legends, at (C) 0.92 and (D) −1.06, indicate the mean values for the statistics across genes in the BGC. *For the leporin BGC, lepB corresponds to an updated open-reading frame (ORF) prediction by Skerker et al. 2021 which was the combination of AFLA_066 860 and AFLA_066 870 ORFs in the MIBiG entry BGC0001445 used as the query for fai. For the aflatoxin BGC, ORFs which were not represented in the MIBiG entry BGC0000008 but predicted to be within the aflatoxin BGC by mapping of gene-calls from A. flavus NRRL 3357 by Skerker et al. 2021 are noted in gold text. The major allele frequency distributions are shown for (E) pksA and (F) aflX, which depict opposite trends in sequence conservation according to their respective Tajima's D calculations.

Of the 212 genomes with the leporin BGC identified by fai, 202 contained instances that were high-quality and not near scaffold edges. This set of 202 instances of the gene cluster was further investigated using zol with options to perform comparative investigation of BGC instances from A. flavus population B genomes to instances from other populations. High sequence conservation was observed for all genes in the leporin gene cluster as previously reported [39] (Supplementary Table S5). Further, alleles for genes in the BGC from population B genomes were generally more similar to each other than to alleles from outside the population, as indicated by high FST values (>0.85 for 9 of 10 genes) (Fig. 3C; Supplementary Table S5). While regulation of secondary metabolites in Aspergillus is complex [134], zol analysis showed that essential genes for leporin biosynthesis [132], in comparison to auxiliary genes, exhibit greater sequence conservation 100 bp upstream of their exonic coordinates (Supplementary Fig. S5).

fai and zol were also applied to the BGC encoding aflatoxin across A. flavus [135] (Supplementary Table S6). Similar to the leporin BGC, the aflatoxin BGC was highly prevalent in the species and found in 71.8% of genomes. However, in contrast to the leporin BGC, the aflatoxin BGC contained several genes with positive Tajima's D values, indicating greater sequence variability for these coding regions across the species (Fig. 3D). One of the genes with a positive Tajima's D value was aflX, which has been shown to influence conversion of the precursor veriscolorin A to downstream intermediates in the aflatoxin biosynthesis pathway [136] (Fig. 3F). An abundance of sites with mid-frequency alleles in the oxidoreductase encoding gene could represent granular control for the amount of aflatoxin relative to intermediates produced. The polyketide synthase gene pksA had the lowest Tajima's D value of −2.5, which suggests it is either highly conserved or under purifying selection (Fig. 3E). In addition, because the reference proteome used to infer genomic coding regions was constructed recently [91], fai and zol detected several highly conserved genes within the aflatoxin BGC that are not represented in the original reference gene cluster input for fai [76]. This includes a gene annotated as a noranthrone monooxygenase and recently characterized as contributing to aflatoxin biosynthesis [137, 138] (Fig. 3D). We further investigated the aflatoxin BGC using zol run in ‘domain mode’ and found that another one of these genes, hypE, which is also suspected to participate in aflatoxin biosynthesis [139], has a domain likely evolving under balancing selection (Supplementary Table S6).

Identification of the enterococcal polysaccharide antigen and assessment of context restricted orthology inference

To demonstrate the ability of zol and fai to reliably identify ortholog groups across multiple species and thousands of genomes, we used the tools to assess the distribution of the enterococcal polysaccharide antigen (Epa) and its individual genes across the diverse genus of Enterococcus. Because previous comparative genomic investigations have been performed between epa loci from different species [99, 100], we also showcase how such prior insight can be used to tailor parameters in fai for searching for the locus across the full genus and how results from fai can be assessed for appropriate selection of parameter values in zol.

The Epa is a signature component of the cellular envelope of multiple species within Enterococcus [99, 100, 140, 141] and has mostly been characterized in the species Enterococcus faecalis [97, 98, 140, 142, 143]. While molecular studies have provided evidence that the locus contributes to enterococcal host colonization [143], evasion of immune systems [144], and sensitivity to antibiotics [145] and phages [145, 146], it was only recently that the structure of Epa was resolved and a model for its biosynthesis and localization formally proposed [98]. A homologous instance of the epa locus was identified in the other prominent pathogenic species from the genus, Enterococcus faecium [99, 100, 147]; however, the prevalence and conservation of epa across the diverse genus of Enterococcus [148–150] remains poorly studied.

We first assessed the performance of fai and zol to identify epa loci across representative genomes for each of the 92 species of Enterococcus in GTDB R207 [69] and subsequently delineate protein ortholog groups relative to other methods. Specifically, we compared the runtime and ortholog group predictions of fai and zol to OrthoFinder [71], an established software for de novo multi-species ortholog group delineation, and eggNOG-mapper [18], which maps proteins to a pre-computed database of ortholog groups. The targeted approach taken by fai and zol was the fastest of the three methods and able to identify ortholog groups for the epa locus in less than two minutes (Fig. 4A). Ortholog pairs determined by fai and zol were generally concordant with orthology predictions by the two alternative methods (Fig. 4B). While concordance dropped between OrthoFinder and the other methods when the software were applied to lower-quality assemblies, fai and zol still exhibited high overlap for pairs of orthologs with eggNOG-mapper (Fig. 4C). We further performed evolutionary-simulation of the epa locus, allowing for sequence gains and losses, and assessed context-limited orthology inference by zol, using a variety of parameters, as well as OrthoFinder (Supplementary Fig. S6; Supplementary Text). This test showed that zol, run using certain options can, like OrthoFinder, appropriately cluster truncated or fused forms of ancestrally related genes, even when no core sequence exists across all instances. However, when run using default settings, zol was found to produce fine-grained ortholog groups to conservatively avoid false positive orthology prediction.

Figure 4.

Figure 4.

Searching for the epa locus across the diverse genus of Enterococcus. (A) Overview of the time needed to run orthology/homology inference methods on the 92 genomes with the highest N50 for each distinct Enterococcus species. OrthoFinder and eggNOG-mapper were run at the genome-wide scale, while fai, was used to first identify genomic regions corresponding to the epa locus from E. faecalis V583 and zol was subsequently applied to determine ortholog groups. The asterisk denotes that manual assessment or filtering of homologous gene clusters identified by fai is encouraged and thus additional time if often required for them. The Jaccard index between ortholog pair sets identified by fai & zol, OrthoFinder, and eggNOG-mapper are shown following their application to representative genomes from GTDB R214 with the (B) highest N50 and (C) lowest N50 for the 92 different species. The upper-right triangles show values between methods when strictly considering ortholog pairs which are possible for zol to infer from targeted detection of epa by fai. The lower-left triangles show values between methods when considering ortholog pairs with only one protein needing to be found in an epa region identified by fai – thus allowing for ortholog pairs between epa proteins and other proteins across genomes by OrthoFinder and eggNOG-mapper. (D) The distribution of the epa locus, based on criteria used for running fai, is shown across a species phylogeny for 92 genomes representative of distinct Enterococcus species in GTDB R214. The coloring of the heatmap corresponds to the percent identity of the best matching protein from each genome to the query epa proteins from E. faecalis V583. Note, the representative genome for E. faecalis (GCA_902166685.1) is not V583 and certain strain-variable genes are not found for it. (E) A schematic of the epa gene cluster from E. faecalis V583 (from EF2164 to EF2200) with glycosyltransferase encoding genes shown in color. (F) A maximum-likelihood phylogeny of zol-identified ortholog groups corresponding to glycosyltransferases in epa loci across Enterococcus. (G) Distribution of different glycosyltransferase ortholog groups across the four major clades of Enterococcus are shown. For D and F, the tree scales correspond to the number of amino acid substitutions per site along the alignments used for phylogeny construction.

Next, to properly and comprehensively assess the distribution of epa across the entire set of 5, 291 genomes in GTDB classified as one of the 92 Enterococcus species [69], we applied fai with more careful consideration of parameter values and requested more advanced features for gene cluster detection. A sensitive searching criterium was selected based on prior comparative genomics for the locus [99, 100] and its coordinates along the E. faecalis V583 genome as a reference [97, 98]. For detection of epa orthologous regions, co-location of at least seven of the 14 epa genes previously identified as conserved in both E. faecalis and E. faecium was required. The default threshold for syntenic conservation of homologous instances to the query gene cluster was disregarded to increase sensitivity for the detection of epa in enterococcal species more distantly related to E. faecalis. In addition, key proteins were specified and the length of the flanking context to include as part of the loci was expanded. Using these criteria, 5085 of the genomes assessed were found to possess an epa locus, with phylogenomic investigations further revealing that the locus is highly conserved in three of the four major clades of Enterococcus (Fig. 4D; Supplementary Table S7).

Based on fai's reports, we realized that to achieve optimal clustering for ortholog groups across the diverse set of epa loci identified, we needed to lower the default thresholds for percent identity and coverage that protein pairs needed to exhibit for being considered as orthologs (Fig. 4D; Supplementary Table S7). We ran zol on both the full set of 5052 high-quality epa loci and only loci from species representative genomes. For the comprehensive analysis, zol was able to identify 14 ortholog groups as core or near-core, found in > 90% of loci instances (Supplementary Table S8). When provided 30 threads, zol completed in 25.02 h and had a maximum memory usage of 101.6 GB. The more restricted analysis of zol to investigate epa instances from 65 species representative genomes was to allow for assessing the quality of ortholog group predictions using phylogenetics (Supplementary Table S9). After applying zol on epa from species representative genomes, orthology predictions were assessed through construction of a maximum-likelihood phylogeny of epa associated glycosyltransferases. Ortholog groups which corresponded to glycosyltransferases from E. faecalis V583 were labelled on the phylogeny and confirmed to match distinct phylogenetic clades, which suggests their appropriate delineation (Fig. 4E and F). zol further identified several epa associated glycosyltransferase ortholog groups that were absent in the E. faecalis representative genome and other representative genomes from the E. faecalis clade (Fig. 4G). These distinct glycosyltransferases might impact the final structure or decoration of Epa in other Enterococcus species.

zol identifies genetic diversity of epaX-like glycosyltransferases in E. faecalis

zol features several options related to the dereplication of input gene clusters to retain only distinct representative instances for orthology inference and other downstream analytics. Importantly, the application of these methods can substantially reduce zol's runtime and impact some of the evolutionary statistics computed (Supplementary Figs S7S9; Supplementary Text). Whether dereplication is appropriate for a particular analysis should thus be carefully considered by users depending on their research aims. In particular, dereplication can impact investigations for highly sequenced bacterial taxa, including the opportunistic pathogen E. faecalis. For such pathogens, certain lineages, such as those commonly isolated at clinics, might be overrepresented in genomic databases, and the researcher may find it beneficial for the analysis to apply dereplication.

To showcase the scalability of zol and its ability to expand knowledge for even well-studied gene clusters, we applied it to high-quality, complete epa loci from 1, 232 E. faecalis genomes without dereplication. In accordance with prior studies [98, 99], zol was able to distinguish core and strain-variable patterns. The report from zol showed that one end of the locus corresponds to genes which are highly conserved and core to E. faecalis (epaA-epaR), whereas the other end contained strain-specific genes (Fig. 5A; Supplementary Table S10). Using zol, we further found that variably conserved genes exhibit high sequence dissimilarity, as measured using both Tajima's D and average sequence entropy, in comparison to the core genes of the locus (Fig. 5B and C). These statistics were robust to the application of dereplication and thus unlikely to be heavily impacted by well-sequenced lineages (Supplementary Figs S8 and S9).

Figure 5.

Figure 5.

High sequence diversity of epaX-like glycosyltransferases amongst E. faecalis. A schematic of the epa locus from E. faecalis V583 with evolutionary statistics, (A) conservation, (B) Tajima's D and (C) sequence entropy, gathered from the best corresponding ortholog group for each protein. Ortholog groups were inferred from zol investigation of 1, 232 epa loci from the species. Genes upstream of and including epaR were recently proposed to be involved in Epa decoration by Guerardel et al. 2020. ‘//’ indicates that the ortholog group was not single-copy in the context of the gene-cluster and calculation of evolutionary statistics for these genes was avoided (gray in panels B and C). Note, the same ortholog group was regarded for EF2173 and EF2185 which correspond to an identical ISEf1 transposase. The length of proteins in the locus schematic are the median lengths of the corresponding ortholog groups. (D) The major allele frequency is depicted across the alignment for the ortholog group featuring epaX. Sites predicted to be under negative selection by FUBAR, Prob (Inline graphic) ≥ 0.9, are marked in red. E) An approximate maximum-likelihood phylogeny of glycosyltransferase ortholog groups identified by zol which were found in > 1% of epa instances. Ortholog groups identified by zol are indicated by colored circular nodes with names of epa genes from E. faecalis V583 noted where possible. The number of leaves/proteins for each clade is provided for labeled ortholog groups. The tree scale corresponds to the number of amino acid substitutions per site along the input protein alignment used for phylogeny construction.

One ortholog group, corresponding to the glycosyltransferase epaX, exhibited substantially higher sequence variation than other epa associated glycosyltransferases (Fig. 5B and D). This finding was further validated through phylogenetic analysis of glycosyltransferases from the species, which highlighted the breadth of diversity observed for the epaX ortholog group relative to other epa associated glycosyltransferases (Fig. 5E).

Discussion

Here, fai and zol are introduced to enable large-scale evolutionary investigations of gene clusters in diverse taxa. Together these tools overcome current bottlenecks in computational biology to infer orthologous sets of genes at scale across thousands of diverse genomes and large metagenomic assemblies.

The set of input gene clusters for zol does not need to be produced by fai. cblaster [47] is another tool that can identify instances of a query gene cluster within a set of target genomes and extract them in GenBank format for downstream investigations using zol. For those lacking computational resources needed for fai analysis, cblaster offers remote searching of BGCs using NCBI’s BLAST infrastructure and non-redundant databases. More recently, CAGECAT [96], a highly accessible web-application for running cblaster, was also developed and can similarly be used to identify and extract gene cluster instances from genomes represented in NCBI databases. In contrast to these tools, prepTG and fai feature algorithms and options for users interested in: (i) identification of gene clusters in metagenomes, (ii) performing standardized gene annotation across target genomes, (iii) improved sensitivity for gene cluster detection in draft-quality assemblies, and (iv) automated filtering of secondary, or paralogous, matches to query gene clusters. In addition, users can apply zol to further investigate homologous sets of gene clusters identified from IslandCompare [151], BiG-SCAPE [46], or vConTACT2 [152] analyses, which perform comprehensive clustering of predicted genomic islands, BGCs, or viruses.

The application of fai to identify gene clusters in metagenomes is demonstrated here through rapid, targeted detection of a virus across lake metagenomic assemblies. We expect that both fai and zol will gain greater relevance for metagenomic applications in the future as long-read sequencing becomes cheaper [121, 153]. Importantly, the tools can be applied directly on assemblies without the need for binning scaffolds into MAGs, avoiding complications associated with binning [154]. In addition to their application to viral tracking, fai and zol's application to metagenomes could be useful for assessing the presence of concerning transposons carrying antimicrobial resistance traits [155–157] and identifying novel auxiliary genes within known BGCs which may tailor the resulting specialized metabolites and expand chemical diversity [158, 159]. While fai features options to alleviate loss of sensitivity, detection of gene clusters from metagenomic assemblies can still be challenging, especially if gene clusters of interest are low-abundance or when searching more complex microbiomes [120, 160].

Reidentifying gene clusters in eukaryotic genomes remains difficult due to technical challenges in gene prediction owing to the presence of alternative splicing. The ability of fai and zol to perform population-level genetics on BGCs from the eukaryotic species A. flavus was demonstrated. While there are over 200 genomes of A. flavus in NCBI, only 5.1% had coding-sequence information readily available. We used prepTG with miniprot [62] to map high quality gene coordinate predictions from a representative genome in the species [91] to the remainder of genomic assemblies to enable high sensitivity targeted detection of BGCs with fai. Our analysis provides confirmation that the leporin BGC is conserved across the species [39] using an assembly-based approach.

The ability of zol to identify ortholog groups across 5, 052 gene cluster instances from 71 distinct species using limited computational resources was demonstrated through investigation of the epa locus across Enterococcus. While such large-scale investigations will be largely limited to those with access to a server, we expect datasets to often feature some degree of species level redundancy. For instance, 80.2% of the 5, 052 epa instances were from only two species, E. faecalis and E. faecium. Thus, to alleviate computational costs, we have included functions for dereplication of gene clusters and reinflation of ortholog groups in zol. Applying these features to the comprehensive set of epa loci using 30 threads, reduced runtime from 25 to 2.9 h and maximum memory usage from 101.6 to 2.3 GB (Supplementary Table S11).

We further assessed the quality of ortholog group predictions by fai and zol using phylogenetic investigations and comparisons with other software for homology inference. Specifically, we compared orthology relationships determined by fai and zol to those obtained from OrthoFinder [71] and eggNOG-mapper [18], which were used to detect ortholog groups at the genome-wide scale. OrthoFinder was chosen as a representative method for standard multi-species orthology inference because it has been shown to perform well for several criteria in prior benchmarking studies [71, 161]. Similarly, eggNOG-mapper was included as a popular representative for methods which map proteins to precomputed ortholog groups instead of inferring them de novo. Through application to diverse epa loci from multiple distinct species and evolutionary simulation of the locus from E. faecalis, we found zol produces reliable orthology predictions that are mostly in accordance with OrthoFinder and eggNOG-mapper. Of note, when applied to lower-quality assemblies, orthologs identified by fai and zol retained high overlap with resolute orthology relationships identified by eggNOG-mapper. When applied to challenging scenarios where gene evolution has involved several complex structural changes, running zol using default settings might result in missing true orthologous relationships. Thus, future versions could explore new methods to alleviate under-clustering or implement a tiered approach for orthology prediction similar to OrthoFinder [71].

Our investigation of epa loci from multiple species revealed the presence of a multitude of glycosyltransferases associated with production or decoration of the polysaccharide, including some that are absent in the representative E. faecalis genome, the species in which the polysaccharide has been most extensively characterized. Through population-genetic investigations of the locus in E. faecalis using zol, we further determined that an ortholog group containing epaX-like glycosyltransferases possessed high sequence divergence relative to other glycosyltransferases associated with the locus. In addition to influencing the ability of E. faecalis to colonize hosts [143], mutations in epaX and other genes from the ortholog group have also been shown to impact susceptibility to phage predation [162–164, 165]. Therefore, we hypothesize that extensive evolution of the epaX ortholog group is a result of contrasting selective forces, pressuring E. faecalis to retain or (re-)acquire the glycosyltransferase to gain a fitness advantage within hosts and to also lose the gene to escape phage predation.

Practically, zol presents a comprehensive analysis tool for comparative genetics of related gene clusters to facilitate identification of evolutionary patterns that might be less apparent from pairwise, visual assessment of homologous gene clusters. Fundamentally, the algorithms presented within fai and zol enable the reliable detection of orthologous gene clusters, and subsequently orthologous proteins, across multi-species datasets spanning thousands of genomes and help overcome a key barrier in scalability for comparative genomics.

Supplementary Material

gkaf045_Supplemental_Files

Acknowledgements

The authors are grateful to James Kosmopoulos, Dr. Caitlin Pepperell, Dr. Caitlin Sande, Dr. Mary Hannah Swaney, and reviewers for feedback or assistance with data acquisition as well as Dr. Devon Ryan and Dr. Robert A. Petit III for assistance with incorporation of the suite into Bioconda.

R.S.: conceptualization; investigation; data curation; formal analysis; methodology; software; visualization; writing – original draft; writing - review & editing. P.Q.T.: data curation; formal analysis; writing – review & editing. C.M.: writing – review & editing. A.L.M.: writing – review & editing. M.S.G.: conceptualization; writing – review & editing. A.M.E.: funding acquisition; writing – review & editing. K.A.: conceptualization; writing – review & editing. L.R.K.: conceptualization; investigation; funding acquisition; supervision; writing – original draft; writing – review & editing.

Contributor Information

Rauf Salamzade, Department of Medical Microbiology and Immunology, School of Medicine and Public Health, University of Wisconsin-Madison, Madison, WI, 53706, United States; Microbiology Doctoral Training Program, University of Wisconsin-Madison, Madison, WI, 53706, United States.

Patricia Q Tran, Department of Bacteriology, University of Wisconsin-Madison, Madison, WI, 53706, United States; Freshwater and Marine Science Doctoral Program, University of Wisconsin-Madison, Madison, WI, 53706, United States.

Cody Martin, Microbiology Doctoral Training Program, University of Wisconsin-Madison, Madison, WI, 53706, United States; Department of Bacteriology, University of Wisconsin-Madison, Madison, WI, 53706, United States.

Abigail L Manson, Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA, 02142, United States.

Michael S Gilmore, Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA, 02142, United States; Department of Ophthalmology, Harvard Medical School and Massachusetts Eye and Ear, Boston, MA, 02114, United States; Department of Microbiology, Harvard Medical School and Massachusetts Eye and Ear, Boston, MA, 02115, United States.

Ashlee M Earl, Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA, 02142, United States.

Karthik Anantharaman, Department of Bacteriology, University of Wisconsin-Madison, Madison, WI, 53706, United States.

Lindsay R Kalan, Department of Medical Microbiology and Immunology, School of Medicine and Public Health, University of Wisconsin-Madison, Madison, WI, 53706, United States; Department of Medicine, Division of Infectious Disease, School of Medicine and Public Health, University of Wisconsin-Madison, Madison, WI, 53705, United States; M.G. DeGroote Institute for Infectious Disease Research, David Braley Centre for Antibiotic Discovery, McMaster University, Hamilton, Ontario, L8S 4L8, Canada; Department of Biochemistry and Biomedical Sciences, McMaster University, Hamilton, Ontario, L8S 4K1, Canada.

Supplementary data

Supplementary data is available at NAR online.

Conflict of interest

None declared.

Funding

This work was supported by grants from the National Institutes of Health awarded [NIAID U19AI142720 and NIGMS R35GM137828 to L.R.K.] and the Broad Institute [U19AI110818]. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. Funding to pay the Open Access publication charges for this article was provided by NIH [U19AI142720].

Data availability

Genomes and metagenomes used to showcase the application of fai and zol are listed with GenBank accession identifiers in Supplementary Table S12. Total metagenomes and their associated information from Lake Mendota microbiome samplings were originally described in Tran et al. (2023) [90] and deposited in NCBI under BioProject PRJNA758276. Genomic assemblies available for A. flavus in NCBI’s GenBank database on 31 January 2023 were downloaded in FASTA format using ncbi-genome-download (https://github.com/kblin/ncbi-genome-download). Genomic assemblies for Enterococcus that met quality and taxonomic criteria for belonging to the genus or related genera (e.g. Enterococcus_A, Enterococcus_B, etc.) in GTDB [69] release R207 were similarly downloaded from NCBI’s GenBank database using ncbi-genome-download in FASTA format.

References

  • 1. Fitch  WM  Distinguishing homologous from analogous proteins. Syst Zool. 1970; 19:99–113. 10.2307/2412448. [DOI] [PubMed] [Google Scholar]
  • 2. Tatusov  RL, Galperin  MY, Natale  DA  et al.  The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res. 2000; 28:33–6. 10.1093/nar/28.1.33. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Huerta-Cepas  J, Szklarczyk  D, Heller  D  et al.  eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Res. 2019; 47:D309–14. 10.1093/nar/gky1085. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Enright  AJ, Kunin  V, Ouzounis  CA  Protein families and TRIBES in genome sequence space. Nucleic Acids Res. 2003; 31:4632–8. 10.1093/nar/gkg495. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Li  L, Stoeckert  CJ  Jr, Roos  DS  OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res. 2003; 13:2178–89. 10.1101/gr.1224503. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Remm  M, Storm  CE, Sonnhammer  EL  Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J Mol Biol. 2001; 314:1041–52. 10.1006/jmbi.2000.5197. [DOI] [PubMed] [Google Scholar]
  • 7. van Dongen  SM  Graph Clustering Via a Discrete Uncoupling Process. SIMAX. 2008; 30:121–41. 10.1137/040608635. [DOI] [Google Scholar]
  • 8. Schreiber  F, Sonnhammer  ELL  Hieranoid: hierarchical orthology inference. J Mol Biol. 2013; 425:2072–81. 10.1016/j.jmb.2013.02.018. [DOI] [PubMed] [Google Scholar]
  • 9. Georgescu  CH, Manson  AL, Griggs  AD  et al.  SynerClust: a highly scalable, synteny-aware orthologue clustering tool. Microb Genom. 2018; 4:e000231. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Hu  X, Friedberg  I  SwiftOrtho: a fast, memory-efficient, multiple genome orthology classifier. Gigascience. 2019; 8:giz118. 10.1093/gigascience/giz118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Cosentino  S, Iwasaki  W  SonicParanoid: fast, accurate and easy orthology inference. Bioinformatics. 2019; 35:149–51. 10.1093/bioinformatics/bty631. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Ding  W, Baumdicker  F, Neher  RA  panX: pan-genome analysis and exploration. Nucleic Acids Res. 2018; 46:e5. 10.1093/nar/gkx977. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Page  AJ, Cummins  CA, Hunt  M  et al.  Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics. 2015; 31:3691–3. 10.1093/bioinformatics/btv421. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Bayliss  SC, Thorpe  HA, Coyle  NM  et al.  PIRATE: Afast and scalable pangenomics toolbox for clustering diverged orthologues in bacteria. Gigascience. 2019; 8:giz119. 10.1093/gigascience/giz119. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Tonkin-Hill  G, MacAlasdair  N, Ruis  C  et al.  Producing polished prokaryotic pangenomes with the Panaroo pipeline. Genome Biol. 2020; 21:180. 10.1186/s13059-020-02090-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Gautreau  G, Bazin  A, Gachet  M  et al.  PPanGGOLiN: depicting microbial diversity via a partitioned pangenome graph. PLoS Comput Biol. 2020; 16:e1007732. 10.1371/journal.pcbi.1007732. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Aramaki  T, Blanc-Mathieu  R, Endo  H  et al.  KofamKOALA: KEGG Ortholog assignment based on profile HMM and adaptive score threshold. Bioinformatics. 2020; 36:2251–2. 10.1093/bioinformatics/btz859. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Cantalapiedra  CP, Hernández-Plaza  A, Letunic  I  et al.  eggNOG-mapper v2: functional annotation, orthology assignments, and domain prediction at the metagenomic scale. Mol Biol Evol. 2021; 38:5825–9. 10.1093/molbev/msab293. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Melnyk  RA, Hossain  SS, Haney  CH  Convergent gain and loss of genomic islands drive lifestyle changes in plant-associated Pseudomonas. ISME J. 2019; 13:1575–88. 10.1038/s41396-019-0372-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Steinegger  M, Söding  J  MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol. 2017; 35:1026–8. 10.1038/nbt.3988. [DOI] [PubMed] [Google Scholar]
  • 21. Buchfink  B, Ashkenazy  H, Reuter  K  et al.  Sensitive clustering of protein sequences at tree-of-life scale using DIAMOND DeepClust. bioRxiv25 Januuary 2023, preprint: not peer reviewed 10.1101/2023.01.24.525373. [DOI]
  • 22. Coelho  LP, Alves  R, Del  Río ÁR  et al.  Towards the biogeography of prokaryotic genes. Nature. 2022; 601:252–6. 10.1038/s41586-021-04233-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Steinegger  M, Söding  J  Clustering huge protein sequence sets in linear time. Nat Commun. 2018; 9:2542. 10.1038/s41467-018-04964-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Snyder  L, Henkin  TM, Peters  JE  et al.  Molecular Genetics of Bacteria. 2013; 4th EdUnited States: ASM Press. [Google Scholar]
  • 25. Price  MN, Arkin  AP, Alm  EJ  The life-cycle of operons. PLoS Genet. 2006; 2:859–73. 10.1371/journal.pgen.0020096. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Ptashne  M  A Genetic Switch: Gene Control and Phage Lambda. 1986; Palo Alto, CA: Blackwell Scientific Publications. [Google Scholar]
  • 27. Andreu  VP, Augustijn  HE, Chen  L  et al.  gutSMASH predicts specialized primary metabolic pathways from the human gut microbiota. Nat Biotechnol. 2023; 41:1416–23. 10.1038/s41587-023-01675-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Cortes  J, Haydock  SF, Roberts  GA  et al.  An unusually large multifunctional polypeptide in the erythromycin-producing polyketide synthase of saccharopolyspora erythraea. Nature. 1990; 348:176–8. 10.1038/348176a0. [DOI] [PubMed] [Google Scholar]
  • 29. Donadio  S, Staver  MJ, McAlpine  JB  et al.  Modular organization of genes required for complex polyketide biosynthesis. Science. 1991; 252:675–9. 10.1126/science.2024119. [DOI] [PubMed] [Google Scholar]
  • 30. Walsh  CT, Fischbach  MA  Natural products version 2.0: connecting genes to molecules. J Am Chem Soc. 2010; 132:2469–93. 10.1021/ja909118a. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Medema  MH, Blin  K, Cimermancic  P  et al.  antiSMASH: rapid identification, annotation and analysis of secondary metabolite biosynthesis gene clusters in bacterial and fungal genome sequences. Nucleic Acids Res. 2011; 39:W339–46. 10.1093/nar/gkr466. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Gal-Mor  O, Finlay  BB  Pathogenicity islands: a molecular toolbox for bacterial virulence. Cell Microbiol. 2006; 8:1707–19. 10.1111/j.1462-5822.2006.00794.x. [DOI] [PubMed] [Google Scholar]
  • 33. Kaper  JB, Nataro  JP, Mobley  HL  Pathogenic Escherichia coli. Nat Rev Micro. 2004; 2:123–40. 10.1038/nrmicro818. [DOI] [PubMed] [Google Scholar]
  • 34. Buchanan  BB, Gruissem  W, Jones  RL  Biochemistry and Molecular Biology of Plants. 2015; 2nd EdChichester, UK: John Wiley & Sons Ltd. [Google Scholar]
  • 35. Rokas  A, Mead  ME, Steenwyk  JL  et al.  Biosynthetic gene clusters and the evolution of fungal chemodiversity. Nat Prod Rep. 2020; 37:868–78. 10.1039/C9NP00045C. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Robey  MT, Caesar  LK, Drott  MT  et al.  An interpreted atlas of biosynthetic gene clusters from 1, 000 fungal genomes. Proc Natl Acad Sci USA. 2021; 118:e2020230118. 10.1073/pnas.2020230118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Lindahl  L, Zengel  JM  Operon-specific regulation of ribosomal protein synthesis in Escherichia coli. Proc Natl Acad Sci USA. 1979; 76:6542–46. 10.1073/pnas.76.12.6542. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Cordero  OX, Polz  MF  Explaining microbial genomic diversity in light of evolutionary ecology. Nat Rev Micro. 2014; 12:263–73. 10.1038/nrmicro3218. [DOI] [PubMed] [Google Scholar]
  • 39. Drott  MT, Rush  TA, Satterlee  TR  et al.  Microevolution in the pansecondary metabolome of Aspergillus flavus and its potential macroevolutionary implications for filamentous fungi. Proc Natl Acad Sci USA. 2021; 118:e2021683118. 10.1073/pnas.2021683118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Salamzade  R, Cheong  JZA, Sandstrom  S  et al.  Evolutionary investigations of the biosynthetic diversity in the skin microbiome using lsaBGC. Microb Genom. 2023; 9:mgen000988. 10.1099/mgen.0.000988. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Ziemert  N, Lechner  A, Wietz  M  et al.  Diversity and evolution of secondary metabolism in the marine actinomycete genus Salinispora. Proc Natl Acad Sci USA. 2014; 111:E1130–9. 10.1073/pnas.1324161111. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. van Bergeijk  DA, Terlouw  BR, Medema  MH  et al.  Ecology and genomics of actinobacteria: new concepts for natural product discovery. Nat Rev Micro. 2020; 18:546–58. 10.1038/s41579-020-0379-y. [DOI] [PubMed] [Google Scholar]
  • 43. Chevrette  MG, Gutiérrez-García  K, Selem-Mojica  N  et al.  Evolutionary dynamics of natural product biosynthesis in bacteria. Nat Prod Rep. 2020; 37:566–99. 10.1039/C9NP00048H. [DOI] [PubMed] [Google Scholar]
  • 44. Medema  MH, Takano  E, Breitling  R  Detecting sequence homology at the gene cluster level with MultiGeneBlast. Mol Biol Evol. 2013; 30:1218–23. 10.1093/molbev/mst025. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Abby  SS, Néron  B, Ménager  H  et al.  MacSyFinder: a program to mine genomes for molecular systems with an application to CRISPR-Cas systems. PLoS One. 2014; 9:e110726. 10.1371/journal.pone.0110726. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Navarro-Muñoz  JC, Selem-Mojica  N, Mullowney  MW  et al.  A computational framework to explore large-scale biosynthetic diversity. Nat Chem Biol. 2020; 16:60–68. 10.1038/s41589-019-0400-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Gilchrist  CLM, Booth  TJ, van Wersch  B  et al.  Cblaster: a remote search tool for rapid identification and visualization of homologous gene clusters. Bioinform Adv. 2021; 1:vbab016. 10.1093/bioadv/vbab016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Gilchrist  CLM, Chooi  Y-H  clinker & clustermap.Js: automatic generation of gene cluster comparison figures. Bioinformatics. 2021; 37:2473–5. [DOI] [PubMed] [Google Scholar]
  • 49. Hackl  T, Ankenbrand  M, van Adrichem  B  gggenomes: A Grammar of Graphics for Comparative Genomics. R packagehttps://github.com/thackl/gggenomes.
  • 50. pyGenomeViz: a genome visualization python package for comparative genomics. Githubhttps://github.com/moshi4/pyGenomeViz.
  • 51. Grüning  B, Dale  R, Sjödin  A  et al.  Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018; 15:475–6. 10.1038/s41592-018-0046-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52. Cock  PJA, Antao  T, Chang  JT  et al.  Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009; 25:1422–3. 10.1093/bioinformatics/btp163. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Capella-Gutiérrez  S, Silla-Martínez  JM, Gabaldón  T  trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics. 2009; 25:1972–3. 10.1093/bioinformatics/btp348. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54. Price  MN, Dehal  PS, Arkin  AP  FastTree 2–approximately maximum-likelihood trees for large alignments. PLoS One. 2010; 5:e9490. 10.1371/journal.pone.0009490. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55. Huang  Y, Niu  B, Gao  Y  et al.  CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics. 2010; 26:680–2. 10.1093/bioinformatics/btq003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56. Eddy  SR  Accelerated profile HMM searches. PLoS Comput Biol. 2011; 7:e1002195. 10.1371/journal.pcbi.1002195. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57. Buchfink  B, Xie  C, Huson  DH  Fast and sensitive protein alignment using DIAMOND. Nat Methods. 2015; 12:59–60. 10.1038/nmeth.3176. [DOI] [PubMed] [Google Scholar]
  • 58. Schreiber  J  Pomegranate: fast and flexible probabilistic modeling in python. J Mach Learn Res. 2017; 18:5992–7.https://dl.acm.org/doi/10.5555/3122009.3242021. [Google Scholar]
  • 59. Kosakovsky  Pond SL, Poon  AFY, Velazquez  R  et al.  HyPhy 2.5—A customizable platform for evolutionary hypothesis testing using phylogenies. Mol Biol Evol. 2020; 37:295–9. 10.1093/molbev/msz197. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60. Larralde  M  Pyrodigal: python bindings and interface to Prodigal, an efficient method for gene prediction in prokaryotes. JOSS. 2022; 7:4296. 10.21105/joss.04296. [DOI] [Google Scholar]
  • 61. Edgar  RC  Muscle5: high-accuracy alignment ensembles enable unbiased assessments of sequence homology and phylogeny. Nat Commun. 2022; 13:6968. 10.1038/s41467-022-34630-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62. Li  H  Protein-to-genome alignment with miniprot. Bioinformatics. 2023; 39:btad014. 10.1093/bioinformatics/btad014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63. Shaw  J, Yu  YW  Fast and robust metagenomic sequence comparison through sparse chaining with skani. Nat Methods. 2023; 20:1661–5. 10.1038/s41592-023-02018-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64. Kieft  K, Zhou  Z, Anantharaman  K  VIBRANT: automated recovery, annotation and curation of microbial viruses, and evaluation of viral community function from genomic sequences. Microbiome. 2020; 8:90. 10.1186/s40168-020-00867-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65. Blin  K, Shaw  S, Kloosterman  AM  et al.  antiSMASH 6.0: improving cluster detection and comparison capabilities. Nucleic Acids Res. 2021; 49:W29–35. 10.1093/nar/gkab335. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66. Hyatt  D, Chen  G-L, Locascio  PF  et al.  Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinf. 2010; 11:119. 10.1186/1471-2105-11-119. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67. Camargo  AP, Roux  S, Schulz  F  et al.  Identification of mobile genetic elements with geNomad. Nat Biotechnol. 2024; 42:1303–12. 10.1038/s41587-023-01953-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68. Salamzade  R, Kalan  LR  skDER: microbial genome dereplication approaches for comparative and metagenomic applications. bioRxiv22 November 2023, preprint: not peer reviewed 10.1101/2023.09.27.559801. [DOI]
  • 69. Parks  DH, Chuvochina  M, Rinke  C  et al.  GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy. Nucleic Acids Res. 2022; 50:D785–94. 10.1093/nar/gkab776. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70. Klassen  JL, Currie  CR  Gene fragmentation in bacterial draft genomes: extent, consequences and mitigation. BMC Genomics. 2012; 13:14. 10.1186/1471-2164-13-14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71. Emms  DM, Kelly  S  OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol. 2019; 20:238. 10.1186/s13059-019-1832-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72. Suyama  M, Torrents  D, Bork  P  PAL2NAL: robust conversion of protein sequence alignments into the corresponding codon alignments. Nucleic Acids Res. 2006; 34:W609–12. 10.1093/nar/gkl315. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73. Li  W, O’Neill  KR, Haft  DH  et al.  RefSeq: expanding the Prokaryotic Genome Annotation Pipeline reach with protein family model curation. Nucleic Acids Res. 2021; 49:D1020–8. 10.1093/nar/gkaa1105. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74. Liu  B, Zheng  D, Jin  Q  et al.  VFDB 2019: a comparative pathogenomic platform with an interactive web interface. Nucleic Acids Res. 2019; 47:D687–92. 10.1093/nar/gky1080. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75. Alcock  BP, Huynh  W, Chalil  R  et al.  CARD 2023: expanded curation, support for machine learning, and resistome prediction at the Comprehensive Antibiotic Resistance Database. Nucleic Acids Res. 2023; 51:D690–9. 10.1093/nar/gkac920. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 76. Terlouw  BR, Blin  K, Navarro-Muñoz  JC  et al.  MIBiG 3.0: a community-driven effort to annotate experimentally validated biosynthetic gene clusters. Nucleic Acids Res. 2023; 51:D603–10. 10.1093/nar/gkac1049. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 77. Siguier  P, Perochon  J, Lestrade  L  et al.  ISfinder: the reference centre for bacterial insertion sequences. Nucleic Acids Res. 2006; 34:D32–6. 10.1093/nar/gkj014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78. Price  MN, Arkin  AP  PaperBLAST: text mining papers for information about homologs. Msystems. 2017; 2:e00039-17. 10.1128/mSystems.00039-17. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 79. Finn  RD, Bateman  A, Clements  J  et al.  Pfam: the protein families database. Nucleic Acids Res. 2014; 42:D222–30. 10.1093/nar/gkt1223. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 80. Larralde  M, Zeller  G  PyHMMER: a Python library binding to HMMER for efficient sequence analysis. Bioinformatics. 2023; 39:btad214. 10.1093/bioinformatics/btad214. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 81. Tajima  F  Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics. 1989; 123:585–95. 10.1093/genetics/123.3.585. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 82. Murrell  B, Moola  S, Mabona  A  et al.  FUBAR: a fast, unconstrained bayesian approximation for inferring selection. Mol Biol Evol. 2013; 30:1196–205. 10.1093/molbev/mst030. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 83. Kosakovsky  Pond SL, Posada  D, Gravenor  MB  et al.  GARD: a genetic algorithm for recombination detection. Bioinformatics. 2006; 22:3096–8. 10.1093/bioinformatics/btl474. [DOI] [PubMed] [Google Scholar]
  • 84. Hudson  RR, Slatkin  M, Maddison  WP  Estimation of levels of gene flow from DNA sequence data. Genetics. 1992; 132:583–9. 10.1093/genetics/132.2.583. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 85. Mistry  J, Chuguransky  S, Williams  L  et al.  Pfam: the protein families database in 2021. Nucleic Acids Res. 2021; 49:D412–9. 10.1093/nar/gkaa913. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 86. Haas  R  gravis: interactive graph visualizations with python and HTML/CSS/JS. GitHubhttps://github.com/robert-haas/gravis.
  • 87. Carroll  LM, Larralde  M, Fleck  JS  et al.  Accurate de novo identification of biosynthetic gene clusters with GECCO. bioRxiv4 May 2021, preprint: not peerreviewed 10.1101/2021.05.03.442509. [DOI]
  • 88. Akhter  S, Aziz  RK, Edwards  RA  PhiSpy: a novel algorithm for finding prophages in bacterial genomes that combines similarity- and composition-based strategies. Nucleic Acids Res. 2012; 40:e126. 10.1093/nar/gks406. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 89. Robertson  J, Nash  JHE  MOB-suite: software tools for clustering, reconstruction and typing of plasmids from draft assemblies. Microb Genom. 2018; 4:e000206. 10.1099/mgen.0.000206. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 90. Tran  PQ, Bachand  SC, Peterson  B  et al.  Viral impacts on microbial activity and biogeochemical cycling in a seasonally anoxic freshwater lake. bioRxiv19 April 2023, preprint: not peer reviewed 10.1101/2023.04.19.537559. [DOI]
  • 91. Skerker  JM, Pianalto  KM, Mondo  SJ  et al.  Chromosome assembled and annotated genome sequence of Aspergillus flavus NRRL 3357. G3. 2021; 11:jkab213. 10.1093/g3journal/jkab213. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 92. Ondov  BD, Treangen  TJ, Melsted  P  et al.  Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 2016; 17:132. 10.1186/s13059-016-0997-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 93. Paradis  E, Claude  J, Strimmer  K  APE: analyses of Phylogenetics and Evolution in R language. Bioinformatics. 2004; 20:289–90. 10.1093/bioinformatics/btg412. [DOI] [PubMed] [Google Scholar]
  • 94. Ter-Hovhannisyan  V, Lomsadze  A, Chernoff  YO  et al.  Gene prediction in novel fungal genomes using an ab initio algorithm with unsupervised training. Genome Res. 2008; 18:1979–90. 10.1101/gr.081612.108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 95. Stanke  M, Diekhans  M, Baertsch  R  et al.  Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics. 2008; 24:637–44. 10.1093/bioinformatics/btn013. [DOI] [PubMed] [Google Scholar]
  • 96. van  den Belt M, Gilchrist  C, Booth  TJ  et al.  CAGECAT: the CompArative GEne Cluster Analysis Toolbox for rapid search and visualisation of homologous gene clusters. BMC Bioinf. 2023; 24:181. 10.1186/s12859-023-05311-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 97. Teng  F, Singh  KV, Bourgogne  A  et al.  Further characterization of the epa gene cluster and epa polysaccharides of Enterococcus faecalis. Infect Immun. 2009; 77:3759–67. 10.1128/IAI.00149-09. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 98. Guerardel  Y, Sadovskaya  I, Maes  E  et al.  Complete structure of the enterococcal polysaccharide antigen (EPA) of vancomycin-resistant Enterococcus faecalis V583 reveals that EPA decorations are teichoic acids covalently linked to a rhamnopolysaccharide backbone. mBio. 2020; 11:e00277-20. 10.1128/mBio.00277-20. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 99. Palmer  KL, Godfrey  P, Griggs  A  et al.  Comparative genomics of enterococci: variation in Enterococcus faecalis, clade structure in E. faecium, and defining characteristics of E. gallinarum and E. casseliflavus. mBio. 2012; 3:e00318-11. 10.1128/mBio.00318-11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 100. Qin  X, Galloway-Peña  JR, Sillanpaa  J  et al.  Complete genome sequence of Enterococcus faecium strain TX16 and comparative genomic analysis of Enterococcus faecium genomes. BMC Microbiol. 2012; 12:135. 10.1186/1471-2180-12-135. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 101. Lee  MD  GToTree: a user-friendly workflow for phylogenomics. Bioinformatics. 2019; 35:4162–4. 10.1093/bioinformatics/btz188. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 102. Letunic  I, Bork  P  Interactive Tree of Life (iTOL) v4: recent updates and new developments. Nucleic Acids Res. 2019; 47:W256–9. 10.1093/nar/gkz239. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 103. Minh  BQ, Schmidt  HA, Chernomor  O  et al.  IQ-TREE 2: new models and efficient methods for phylogenetic inference in the genomic era. Mol Biol Evol. 2020; 37:1530–34. 10.1093/molbev/msaa015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 104. Grazziotin  AL, Koonin  EV, Kristensen  DM  Prokaryotic Virus Orthologous Groups (pVOGs): a resource for comparative genomics and protein family annotation. Nucleic Acids Res. 2017; 45:D491–8. 10.1093/nar/gkw975. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 105. Liu  M, Li  X, Xie  Y  et al.  ICEberg 2.0: an updated database of bacterial integrative and conjugative elements. Nucleic Acids Res. 2019; 47:D660–65. 10.1093/nar/gky1123. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 106. Bertelli  C, Laird  MR, Williams  KP  et al.  IslandViewer 4: expanded prediction of genomic islands for larger-scale datasets. Nucleic Acids Res. 2017; 45:W30–5. 10.1093/nar/gkx343. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 107. Trgovec-Greif  L, Hellinger  H-J, Mainguy  J  et al.  VOGDB-database of virus orthologous groups. Viruses. 2024; 16:1191. 10.3390/v16081191. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 108. Hackl  T, Duponchel  S, Barenhoff  K  et al.  Virophages and retrotransposons colonize the genomes of a heterotrophic flagellate. eLife. 2021; 10:e72674. 10.7554/eLife.72674. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 109. Sullivan  MJ, Petty  NK, Beatson  SA  Easyfig: a genome comparison visualizer. Bioinformatics. 2011; 27:1009–10. 10.1093/bioinformatics/btr039. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 110. Helfrich  EJN, Ueoka  R, Chevrette  MG  et al.  Evolution of combinatorial diversity in trans-acyltransferase polyketide synthase assembly lines across bacteria. Nat Commun. 2021; 12:1422. 10.1038/s41467-021-21163-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 111. Nivina  A, Herrera  Paredes S, Fraser  HB  et al.  GRINS: genetic elements that recode assembly-line polyketide synthases and accelerate their diversification. Proc Natl Acad Sci USA. 2021; 118:e2100751118. 10.1073/pnas.2100751118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 112. Blackwell  G, Hunt  M, Malone  KM  et al.  Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences. PLoS Biol. 2022; 19:e3001421. 10.1099/acmi.ac2021.po0143. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 113. Lebreton  F, van Schaik  W, McGuire  AM  et al.  Emergence of epidemic multidrug-resistant Enterococcus faecium from animal and commensal strains. mBio. 2013; 4:e00534-13. 10.1128/mBio.00534-13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 114. Lieberman  TD, Flett  KB, Yelin  I  et al.  Genetic variation of a bacterial pathogen within individuals with cystic fibrosis provides a record of selective pressures. Nat Genet. 2014; 46:82–7. 10.1038/ng.2848. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 115. Crits-Christoph  A, Olm  MR, Diamond  S  et al.  Soil bacterial populations are shaped by recombination and gene-specific selection across a grassland meadow. ISME J. 2020; 14:1834–46. 10.1038/s41396-020-0655-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 116. Pavlopoulos  GA, Baltoumas  FA, Liu  S  et al.  Unraveling the functional dark matter through global metagenomics. Nature. 2023; 622:594–602. 10.1038/s41586-023-06583-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 117. Vanni  C, Schechter  MS, Acinas  SG  et al.  Unifying the known and unknown microbial coding sequence space. eLife. 2022; 11:e67667. 10.7554/eLife.67667. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 118. Willems  A  The family Comamonadaceae. The Prokaryotes. 2014; Berlin, Heidelberg: Springer Berlin Heidelberg; 777–851. [Google Scholar]
  • 119. Roux  S, Camargo  AP, Coutinho  FH  et al.  iPHoP: an integrated machine learning framework to maximize host prediction for metagenome-derived viruses of archaea and bacteria. PLoS Biol. 2023; 21:e3002083. 10.1371/journal.pbio.3002083. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 120. Tamames  J, Cobo-Simón  M, Puente-Sánchez  F  Assessing the performance of different approaches for functional and taxonomic annotation of metagenomes. Genomics. 2019; 20:960. 10.1186/s12864-019-6289-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 121. Yorki  S, Shea  T, Cuomo  CA  et al.  Comparison of long- and short-read metagenomic assembly for low-abundance species and resistance genes. Brief Bioinform. 2023; 24:bbad050. 10.1093/bib/bbad050. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 122. Pascal  Andreu V, Augustijn  HE, van  den Berg K  et al.  BiG-MAP: an automated pipeline to profile metabolic gene cluster abundance and expression in microbiomes. mSystems. 2021; 6:e0093721. 10.1128/msystems.00937-21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 123. Olm  MR, Crits-Christoph  A, Bouma-Gregson  K  et al.  inStrain profiles population microdiversity from metagenomic data and sensitively detects shared microbial strains. Nat Biotechnol. 2021; 39:727–36. 10.1038/s41587-020-00797-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 124. Gregory  AC, Gerhardt  K, Zhong  Z-P  et al.  MetaPop: a pipeline for macro- and microdiversity analyses and visualization of microbial and viral metagenome-derived populations. Microbiome. 2022; 10:49. 10.1186/s40168-022-01231-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 125. Thomma  BPHJ, Seidl  MF, Shi-Kunne  X  et al.  Mind the gap; seven reasons to close fragmented genome assemblies. Fung Genet Biol. 2016; 90:24–30. 10.1016/j.fgb.2015.08.010. [DOI] [PubMed] [Google Scholar]
  • 126. Drăgan  M-A, Moghul  I, Priyam  A  et al.  GeneValidator: identify problems with protein-coding gene predictions. Bioinformatics. 2016; 32:1559–61. 10.1093/bioinformatics/btw015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 127. Scalzitti  N, Jeannin-Girardon  A, Collet  P  et al.  A benchmark study of ab initio gene prediction methods in diverse eukaryotic organisms. BMC Genomics. 2020; 21:293. 10.1186/s12864-020-6707-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 128. Jallow  A, Xie  H, Tang  X  et al.  Worldwide aflatoxin contamination of agricultural products and foods: from occurrence to control. Comp Rev Food Sci Food Safe. 2021; 20:2332–81. 10.1111/1541-4337.12734. [DOI] [PubMed] [Google Scholar]
  • 129. Bok  JW, Hoffmeister  D, Maggio-Hall  LA  et al.  Genomic mining for Aspergillus natural products. Chem Biol. 2006; 13:31–7. 10.1016/j.chembiol.2005.10.008. [DOI] [PubMed] [Google Scholar]
  • 130. Vadlapudi  V, Borah  N, Yellusani  KR  et al.  Aspergillus Secondary Metabolite Database, a resource to understand the secondary metabolome of Aspergillus genus. Sci Rep. 2017; 7:387. 10.1038/s41598-017-07436-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 131. Hatmaker  EA, Rangel-Grimaldo  M, Raja  HA  et al.  Genomic and phenotypic trait variation of the opportunistic Human pathogen Aspergillus flavus and its close relatives. Microbiol Spectr. 2022; 10:e0306922. 10.1128/spectrum.03069-22. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 132. Cary  JW, Uka  V, Han  Z  et al.  An Aspergillus flavus secondary metabolic gene cluster containing a hybrid PKS-NRPS is necessary for synthesis of the 2-pyridones, leporins. Fung Genet Biol. 2015; 81:88–97. 10.1016/j.fgb.2015.05.010. [DOI] [PubMed] [Google Scholar]
  • 133. Majoros  WH, Pertea  M, Salzberg  SL  TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders. Bioinformatics. 2004; 20:2878–9. 10.1093/bioinformatics/bth315. [DOI] [PubMed] [Google Scholar]
  • 134. Yang  K, Tian  J, Keller  NP  Post-translational modifications drive secondary metabolite biosynthesis in Aspergillus: a review. Environ Microbiol. 2022; 24:2857–81. 10.1111/1462-2920.16034. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 135. Klich  MA  Aspergillus flavus: the major producer of aflatoxin. Mol Plant Pathol. 2007; 8:713–22. 10.1111/j.1364-3703.2007.00436.x. [DOI] [PubMed] [Google Scholar]
  • 136. Cary  JW, Ehrlich  KC, Bland  JM  et al.  The aflatoxin biosynthesis cluster gene, aflX, encodes an oxidoreductase involved in conversion of versicolorin A to demethylsterigmatocystin. Appl Environ Microb. 2006; 72:1096–101. 10.1128/AEM.72.2.1096-1101.2006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 137. Cleveland  TE, Yu  J, Fedorova  N  et al.  Potential of Aspergillus flavus genomics for applications in biotechnology. Trends Biotechnol. 2009; 27:151–7. 10.1016/j.tibtech.2008.11.008. [DOI] [PubMed] [Google Scholar]
  • 138. Ehrlich  KC, Li  P, Scharfenstein  L  et al.  HypC, the anthrone oxidase involved in aflatoxin biosynthesis. Appl Environ Microb. 2010; 76:3374–7. 10.1128/AEM.02495-09. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 139. Ehrlich  KC  Predicted roles of the uncharacterized clustered genes in aflatoxin biosynthesis. Toxins. 2009; 1:37–58. 10.3390/toxins1010037. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 140. Xu  Y, Murray  BE, Weinstock  GM  A cluster of genes involved in polysaccharide biosynthesis from Enterococcus faecalis OG1RF. Infect Immun. 1998; 66:4313–23. 10.1128/IAI.66.9.4313-4323.1998. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 141. Hancock  LE, Murray  BE, Sillanpää  J. Gilmore  MS, Clewell  DB, Ike  Y, Shankar  N  Enterococcal cell wall components and structures. Enterococci: From Commensals to Leading Causes of Drug Resistant Infection. 2014; Boston: Massachusetts Eye and Ear Infirmary; 375–408. [Google Scholar]
  • 142. Teng  F, Jacques-Palaz  KD, Weinstock  GM  et al.  Evidence that the Enterococcal polysaccharide antigen gene (epa) cluster is widespread in Enterococcus faecalis and influences resistance to phagocytic killing of E. faecalis. Infect Immun. 2002; 70:2010–15. 10.1128/IAI.70.4.2010-2015.2002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 143. Rigottier-Gois  L, Madec  C, Navickas  A  et al.  The surface rhamnopolysaccharide epa of Enterococcus faecalis is a key determinant of intestinal colonization. J Infect Dis. 2015; 211:62–71. 10.1093/infdis/jiu402. [DOI] [PubMed] [Google Scholar]
  • 144. Smith  RE, Salamaga  B, Szkuta  P  et al.  Decoration of the enterococcal polysaccharide antigen EPA is essential for virulence, cell surface charge and interaction with effectors of the innate immune system. PLoS Pathog. 2019; 15:e1007730. 10.1371/journal.ppat.1007730. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 145. Singh  KV, Murray  BE  Loss of a major enterococcal polysaccharide antigen (Epa) by Enterococcus faecalis is associated with increased resistance to Ceftriaxone and Carbapenems. Antimicrob Agents Chemother. 2019; 63:e00481-19. 10.1128/AAC.00481-19. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 146. Ho  K, Huo  W, Pas  S  et al.  Loss-of-function mutations in epaR confer resistance to ϕNPV1 infection in Enterococcus faecalis OG1RF. Antimicrob Agents Chemother. 2018; 62:e00758-18. 10.1128/AAC.00758-18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 147. Fiore  E, Van Tyne  D, Gilmore  MS  Pathogenicity of enterococci. Microbiol Spectr. 2019; 7:1–23. 10.1128/microbiolspec.GPP3-0053-2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 148. Lebreton  F, Willems  RJL, Gilmore  MS. Gilmore  MS, Clewell  DB, Ike  Y, Shankar  N  Enterococcus diversity, origins in nature, and gut colonization. Enterococci: From Commensals to Leading Causes of Drug Resistant Infection. 2014; Boston: Massachusetts Eye and Ear Infirmary; 1–45. [PubMed] [Google Scholar]
  • 149. Lebreton  F, Manson  AL, Saavedra  JT  et al.  Tracing the enterococci from paleozoic origins to the hospital. Cell. 2017; 169:849–61. 10.1016/j.cell.2017.04.027. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 150. Schwartzman  JA, Lebreton  F, Salamzade  R  et al.  Global diversity of enterococci and description of 18 novel species. Proc Natl Acad Sci USA. 2023; 121:1–12. 10.1073/pnas.231085212. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 151. Bertelli  C, Gray  KL, Woods  N  et al.  Enabling genomic island prediction and comparison in multiple genomes to investigate bacterial evolution and outbreaks. Microb Genom. 2022; 8:mgen000818. 10.1099/mgen.0.000818. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 152. Bin  Jang H, Bolduc  B, Zablocki  O  et al.  Taxonomic assignment of uncultivated prokaryotic virus genomes is enabled by gene-sharing networks. Nat Biotechnol. 2019; 37:632–9. 10.1038/s41587-019-0100-8. [DOI] [PubMed] [Google Scholar]
  • 153. Marx  V  Method of the year: long-read sequencing. Nat Methods. 2023; 20:6–11. 10.1038/s41592-022-01730-w. [DOI] [PubMed] [Google Scholar]
  • 154. Meyer  F, Fritz  A, Deng  Z-L  et al.  Critical Assessment of Metagenome interpretation: the second round of challenges. Nat Methods. 2022; 19:429–40. 10.1038/s41592-022-01431-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 155. Salamzade  R, Manson  AL, Walker  BJ  et al.  Inter-species geographic signatures for tracing horizontal gene transfer and long-term persistence of carbapenem resistance. Genome Med. 2022; 14:37. 10.1186/s13073-022-01040-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 156. Sheppard  AE, Stoesser  N, Wilson  DJ  et al.  Nested Russian doll-like genetic mobility drives rapid dissemination of the carbapenem resistance gene blaKPC. Antimicrob Agents Chemother. 2016; 60:3767–78. 10.1128/AAC.00464-16. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 157. Groussin  M, Poyet  M, Sistiaga  A  et al.  Elevated rates of horizontal gene transfer in the industrialized human microbiome. Cell. 2021; 184:2053–67. 10.1016/j.cell.2021.02.052. [DOI] [PubMed] [Google Scholar]
  • 158. Crits-Christoph  A, Diamond  S, Butterfield  CN  et al.  Novel soil bacteria possess diverse genes for secondary metabolite biosynthesis. Nature. 2018; 558:440–44. 10.1038/s41586-018-0207-y. [DOI] [PubMed] [Google Scholar]
  • 159. Bickhart  DM, Kolmogorov  M, Tseng  E  et al.  Generation of lineage-resolved complete metagenome-assembled genomes by precision phasing. Nat Biotechnol. 2022; 40:711–9. 10.1101/2021.05.04.442591. [DOI] [PubMed] [Google Scholar]
  • 160. Seshadri  R, Roux  S, Huber  KJ  et al.  Expanding the genomic encyclopedia of actinobacteria with 824 isolate reference genomes. Cell Genomics. 2022; 2:100213. 10.1016/j.xgen.2022.100213. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 161. Nevers  Y, Jones  TEM, Jyothi  D  et al.  The Quest for Orthologs orthology benchmark service in 2022. Nucleic Acids Res. 2022; 50:W623–32. 10.1093/nar/gkac330. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 162. Chatterjee  A, Johnson  CN, Luong  P  et al.  Bacteriophage resistance alters antibiotic-mediated intestinal expansion of enterococci. Infect Immun. 2019; 87:e00085-19. 10.1128/IAI.00085-19. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 163. Chatterjee  A, Willett  JLE, Nguyen  UT  et al.  Parallel Genomics Uncover Novel Enterococcal-Bacteriophage Interactions. mBio. 2020; 11:e03120-19. 10.1128/mbio.03120-19. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 164. Canfield  GS, Chatterjee  A, Espinosa  J  et al.  Lytic Bacteriophages Facilitate Antibiotic Sensitization of Enterococcus faecium. Antimicrob. Agents Chemother. 2021; 65:e00143-21. 10.1101/2020.09.22.309401. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 165. Kirsch  JM, Ely  S, Stellfox  ME  et al.  Targeted IS-element sequencing uncovers transposition dynamics during selective pressure in enterococci. PLoS Pathog. 2023; 19:W623–32. 10.1371/journal.ppat.1011424. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

gkaf045_Supplemental_Files

Data Availability Statement

Genomes and metagenomes used to showcase the application of fai and zol are listed with GenBank accession identifiers in Supplementary Table S12. Total metagenomes and their associated information from Lake Mendota microbiome samplings were originally described in Tran et al. (2023) [90] and deposited in NCBI under BioProject PRJNA758276. Genomic assemblies available for A. flavus in NCBI’s GenBank database on 31 January 2023 were downloaded in FASTA format using ncbi-genome-download (https://github.com/kblin/ncbi-genome-download). Genomic assemblies for Enterococcus that met quality and taxonomic criteria for belonging to the genus or related genera (e.g. Enterococcus_A, Enterococcus_B, etc.) in GTDB [69] release R207 were similarly downloaded from NCBI’s GenBank database using ncbi-genome-download in FASTA format.


Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

RESOURCES