A fast comparative genome browser for diverse bacteria and archaea

Morgan N Price; Adam P Arkin

doi:10.1371/journal.pone.0301871

. 2024 Apr 9;19(4):e0301871. doi: 10.1371/journal.pone.0301871

A fast comparative genome browser for diverse bacteria and archaea

Morgan N Price ^1,^*, Adam P Arkin ¹

Editor: Vasilis J Promponas²

PMCID: PMC11003636 PMID: 38593165

Abstract

Genome sequencing has revealed an incredible diversity of bacteria and archaea, but there are no fast and convenient tools for browsing across these genomes. It is cumbersome to view the prevalence of homologs for a protein of interest, or the gene neighborhoods of those homologs, across the diversity of the prokaryotes. We developed a web-based tool, fast.genomics, that uses two strategies to support fast browsing across the diversity of prokaryotes. First, the database of genomes is split up. The main database contains one representative from each of the 6,377 genera that have a high-quality genome, and additional databases for each taxonomic order contain up to 10 representatives of each species. Second, homologs of proteins of interest are identified quickly by using accelerated searches, usually in a few seconds. Once homologs are identified, fast.genomics can quickly show their prevalence across taxa, view their neighboring genes, or compare the prevalence of two different proteins. Fast.genomics is available at https://fast.genomics.lbl.gov.

Introduction

As of July 2023, the Genome Taxonomy Database lists about 85,000 species of bacteria and archaea and about 20,000 genera [1]. Although many of these taxa are known only from low-quality metagenome-assembled genomes, high-quality assemblies [2, 3] are available for over 6,000 genera and over 29,000 species. These genomes contain valuable information about the functions of genes and their ecological roles. For example, knowing which other protein families are encoded near a protein of interest, across diverse prokaryotes, can allow one to propose the protein’s function [4–6]. If homologs of two proteins co-occur, this suggests a close functional relationship [6, 7]. And the distribution of a protein family across taxa can be used to identify horizontal gene transfer events [8] or to give hints as to the protein’s ecological role [9].

All of these analyses depend on identifying homologs of the protein(s) of interest across many genomes. The standard approach is to use protein BLAST [10] or a similar tool to find homologs, but given the size of current genome databases, this typically takes over 10 minutes, which precludes interactive analysis. An alternative is to pre-compute groups of homologous proteins, using resources such as PFam [11], TIGRFam [12], or eggNOG [13]. Examples of fast tools that rely on pre-computed homology groups include GeCoViz, for identifying genes that are conserved near a query gene [14]; AnnoTree, for viewing the taxonomic distribution of protein families [15]; and PhyloCorrelate, for identifying protein families that have a similar distribution as a protein family of interest [16]. However, these tools will not give useful results, or might give misleading results, if the protein family of interest is a poor match to the pre-computed families. Tools that rely on pre-computed homology groups can also be cumbersome to maintain due to the computational cost of comparing new genomes to the existing protein families.

To support interactive browsing of protein families across the diversity of bacteria and archaea, we built a new website, fast.genomics (https://fast.genomics.lbl.gov/). The key challenge for interactive browsing is the fast identification of homologs for a protein of interest. We selected one high-quality representative genome for each genus, so the main database of fast.genomics contains proteins from “only” 6,377 genomes. Because the Genome Tree Database (GTDB) aims for a roughly consistent age for each genus, as estimated using relative evolutionary divergence [1], these representatives should evenly cover the sequenced diversity of bacteria and archaea. In case searching the main database does not find enough homologs, fast.genomics also has a database for every taxonomic order, with up to 10 representatives of each species. Searching in the order that a protein belongs to will usually yield many additional homologs.

To speed up homology searches against the main database, fast.genomics uses MMseqs2 [17], an accelerated alternative to protein BLAST, and stores the MMseqs2 index in memory. Because MMseqs2 is optimized for large-scale searching, it does not benefit from multiple CPUs when handling a single query. To get around this, fast.genomics splits the candidates from the prefilter step of MMseqs2 into 10 lists and analyzes them in parallel. Parallel MMseqs2 takes an average of just 3.3 seconds per query, and is almost as sensitive as protein BLAST. Because some of the taxonomic orders contain thousands of genomes, fast.genomics also uses an accelerated strategy, based on clustering similar sequences, to search the order-level databases.

Once homologs have been identified, fast.genomics includes tools to view their gene neighborhoods, to view their taxonomic distribution, and to compare the distribution of the homologs of two proteins (also known as phylogenetic profiling or co-occurrence analysis).

Results and discussion

The fast.genomics databases

To ensure that the presence or absence of a gene family can be determined reliably, fast.genomics includes only high-quality genomes. Specifically, we required that an assembly be at least 90% complete and have at most 5% contamination, as assessed by CheckM [3], and that the assembly not be chimeric, as assessed by GUNC [18]. For metagenome-assembled genomes or single-cell amplified genomes, we required that they meet the MIMAG guidelines for a high-quality draft, [2], which means that they contain ribosomal rRNAs and most tRNAs. Using the April 2023 release of the Genome Taxonomy database, we identified 6,377 genera and 29,413 species with high-quality genomes (Table 1). All taxonomic assignments in fast.genomics, including genus and species names, are taken from GTDB. Roughly speaking, GTDB defines species as clusters of genomes with above 95% average nucleotide identity, and defines groups at higher taxonomic levels to have consistent ages [1].

Table 1. Statistics for the main database, for the biggest order-level databases, and for all databases combined.

The number of proteins in each database is the number of distinct sequences, not the number of protein-coding genes.

Database	genomes	genera	species	proteins, millions	clusters, millions
Main	6,377	6,377	6,377	21.8	-
Pseudomonadales	4,321	163	1,807	11.7	1.9
Burkholderiales	3,772	340	1,997	12.3	2.8
Actinomycetales	3,339	245	1,891	7.7	2.2
Rhizobiales	3,271	215	1,643	12.5	2.4
Lactobacillales	3,223	98	1,071	3.7	0.9
Combined	56,186	6,377	29,413	217.3	44.3

Open in a new tab

Overview of the fast.genomics website

The main function of fast.genomics is to select a protein or gene of interest and then to find its homologs. The analyst can select a protein using locus tags or protein identifiers from fast.genomics as well as identifiers from a variety of other databases, including UniProt, the NCBI Protein database, RefSeq, or the Protein Data Bank. Or the analyst can enter a protein sequence. Fast.genomics also supports combined genus/annotation searches, such as “Escherichia thymidylate synthase”. Finally, the analyst can search for a taxon of interest and view genomes within a given genus or species. From the genome page, the analyst can find a gene or protein or interest by searching the text annotations or by using Curated BLAST to find proteins that are similar to characterized proteins that have that annotation [19].

Once the protein sequence of interest has been specified, fast.genomics provides accelerated sequence searches, against either the main database or against an order-level database, followed by several ways of visualizing the results.

Speed and sensitivity of finding homologs in the main database

To find homologs in the main database, fast.genomics uses fast parallel MMseqs2. To test the speed and sensitivity of our approach, we selected 1,000 proteins at random from the main database. We searched for homologs of these proteins in the main database using protein BLAST from NCBI’s BLAST+ package [20], fast parallel MMseqs2, or MMseqs2 with the most sensitive settings. We ran each query separately, as would occur during interactive use. BLAST+ and parallel MMseqs2 used 10 threads, and for MMseqs2, the index was loaded into memory beforehand.

We searched for up to 6,377 homologs with E ≤ 10⁻³ (6,377 is the number of genomes). These are the same settings used by the fast.genomics website. Fast parallel MMseqs2 took an average of 3.3 seconds per query, which was 7 times faster than BLAST+ (23 seconds on average) and 2.6 times faster than MMseqs2 with the same settings but no parallelism (8.4 seconds on average). Running times were correlated with the lengths of the queries, with a linear correlation of 0.57 for parallel MMseqs2 or 0.95 for BLAST.

BLAST+ and MMseqs2 use different alignment scores, so weak hits according to one scoring might not meet the E-value cutoff for the other, or they might be very far down in the sorted list of hits. To test the sensitivity of MMseqs2, we considered the top 3,188 BLAST+ hits for each query with E ≤ 10⁻⁵. (3,188 is half the number of genomes.) MMseqs2 with sensitive settings missed 1.4% or 2.2% of these hits (depending on the k-mer size), with a much lower miss rate for the top-ranked hits (Fig 1A). Fast parallel MMseqs2 missed 6.7% of hits, but most of these missed alignments were relatively weak: their median identity was 26.6%.

Fig 1 — (A) The miss rate as a function of the hit’s rank, for hits from BLAST+ with E ≤ 10⁻⁵. Each point is an average across 50 ranks and up to 1,000 queries. Besides fast parallel MMseqs2, we also show results for MMseqs2 with the most sensitive settings (using either 6-mers or 7-mers) and for COGs (as assigned using eggNOG-mapper). For COGs, we only considered the top 200 hits of each query. (B) The miss rate for potential orthologs (at least 30% identity and at least 50% coverage of the query).

When examining gene prevalence, the focus is usually on proteins that are potential functional orthologs. (Focusing on orthologs also justifies considering only the top #genomes/2 hits: since relatively few proteins have orthologs in the majority of prokaryotes, hits further down in the list are unlikely to be orthologs.) We considered a hit to be a potential functional ortholog if the alignment was at least 30% identity and covered at least 50% of the query. By this criterion, the average query had 1,847 potential orthologs (the median was 724). For potential orthologs with E ≤ 10⁻⁵ and rank ≤ #genomes/2, the miss rate of fast parallel MMseqs2 was just 1.2%. Again, the miss rate was much lower for the top-ranked hits (Fig 1B). Another threshold that is sometimes used for identifying potential orthologs is a bit score ratio of 0.3, that is, the bit score for the alignment is at least 30% of the maximum [21]. By this criterion, the average query had 893 potential orthologs (the median was 123). Among hits with E ≤ 10⁻⁵, rank ≤ #genomes/2, and score ratio ≥ 0.3, fast parallel MMseqs2 missed just 0.2% of hits.

Overall, by using parallel MMseqs2, fast.genomics can find homologs for a protein of interest across 6,377 genomes in a few seconds, with a sensitivity for potential orthologs of around 99%.

Ortholog groups miss many potential orthologs despite being too broad

As discussed above, some fast comparative genomics tools rely on pre-computed ortholog groups from eggNOG. Given a protein of interest, eggNOG can quickly assign it to an ortholog group; if all of the proteins in the database have been assigned to ortholog groups, then this assignment also yields a list of homologs. To test the sensitivity of this approach, we ran eggNOG-mapper [22] on the top 200 homologs of each of our 1,000 test proteins.

We first considered the broadest ortholog groups returned by eggNOG-mapper, which are primarily from the COG database [23]. (COG is short for clusters of orthologous groups.) Of the 1,000 queries, 93 were not assigned to ortholog groups, and could not be handled by the ortholog approach. 40 of these lacked homologs besides themselves (as identified by BLAST+ with E ≤ 10⁻⁵) but the other 53 did have at least one homolog in our top-level database, and in all of these cases, fast parallel MMseqs identified at least one homolog. Across the remaining 907 queries, the ortholog group approach missed 9.2% of the high-ranking homologs. If we restrict our attention to high-ranking potential orthologs (at least 30% identity and 50% coverage, and again, rank at most 200), ortholog groups still missed 6.0% of homologs. Some of these potential orthologs might have different domain content, which would lead them to be classified in a different COG (even if they are close homologs). But when we required the alignment to cover 90% of both the query and the subject, ortholog groups still missed 3.6% of homologs. For comparison, among these high-ranking high-coverage hits (that were assigned to ortholog groups by eggNOG-mapper), fast parallel MMseqs2 missed just 0.1% of homologs.

Another issue with pre-computed ortholog groups is that they are often too broad to be useful. For example, in the model gut bacterium Bacteroides thetaiotaomicron VPI-5482, 41% of genes are in top-level ortholog groups with 5 or more members. Lower-level ortholog groups are narrower and can avoid this problem, but they have a much higher rate of missed homologs. For example, the third-level ortholog group (usually at the phylum level) missed 26% of high-ranking potential orthologs (homologs with at least 30% identity, at least 90% coverage both ways, and rank at most 200). Overall, searching for homologs with fast parallel MMseqs2 is much more accurate than relying on ortholog groups.

Speed and sensitivity of finding homologs in a large order

Because the largest order-level databases contain over ten million proteins each (Table 1), BLAST can be too slow for interactive analysis of these databases. Fast.genomics does not use MMseqs2 for the order-level databases because of MMseqs2’s memory requirements (63 GB for the 22 million proteins in the main database). Instead, fast.genomics uses a strategy based on sequence clustering. The largest orders contain roughly 2–3 times more genomes than species, and over 10 times more genomes than genera. This suggests that these orders contain many highly-similar sequences. Indeed, when we used CD-HIT [24, 25] to cluster the proteins in each order at 70% identity and 90% coverage both ways, we reduced the number of sequences in the largest orders by 3.5- to 6.2-fold (Table 1).

Given the redundancy of the large orders, fast.genomics can find homologs quickly by first searching against the reduced database, using BLAST+, and then searching for additional homologs in each cluster, using LAST [26]. To test the speed and accuracy of these clustered searches, we selected 1,000 random proteins from the Rhizobiales, which is the order with the largest number of proteins, namely 12.5 million, which reduces to 2.4 million after clustering. We ran both clustered search and regular BLAST+ for 1,000 randomly-selected proteins from this order. All analyses used 12 threads and each query was run separately. Regular BLAST+ took an average of 11.7 seconds, while clustered BLAST took an average of 3.6 seconds (3.3 times faster).

To quantify the sensitivity of clustered BLAST, we considered the top #genomes/2 = 1,635 hits from regular BLAST+ with E < 10⁻⁵. 1.8% of these homologs were missed by clustered BLAST. Among the top 200 homologs for each query, just 0.3% were missed. Among potential orthologs (30% identity and 50% coverage of the query), 1.0% were missed. Overall, clustered BLAST typically finds homologs within the largest order in a few seconds, with a sensitivity for potential orthologs of around 99%.

Viewing gene neighborhoods

Once homologs have been identified for a protein of interest, fast.genomics can quickly show the gene neighborhoods around its top homologs (Fig 2). Hovering on each gene shows its annotation. A blue bar within each homolog shows the extent of homology. Hovering on the bar gives more information about the alignment, and clicking on it runs protein BLAST to show the pairwise alignment.

Fig 2 — The query protein is ING2E5A_RS06865.

To show if gene neighbors are conserved, the genes are color-coded by homology. Specifically, fast.genomics runs LAST [26] on all of the protein sequences of genes that are visible. Then, genes that are similar (with at least 50% coverage) are assigned the same color. This usually takes less than a second.

By default, the genus or species is shown at the right side of each gene neighborhood, and hovering on the taxon shows the taxonomic lineage. Alternatively, the full taxonomic lineage can be shown above each gene neighborhood. This makes it easier to understand which taxa contain close homologs of the query, but fewer homologs can be seen at once.

By default, the homologs are shown in descending order of bit score, but the analyst can also request that a multiple sequence alignment and a phylogenetic tree be computed. Fast.genomics uses MUSCLE [27], FastTree 2 [28], and midpoint rooting; together, these usually take a second or less. Splits in the tree are reordered to show the closest homologs at the top (Fig 2). This feature of fast.genomics was inspired by the MicrobesOnline tree browser [29], but MicrobesOnline uses pre-computed trees and was last updated in 2011.

Because the order-level databases contain up to 10 representative genomes for each species, there are often multiple very-similar homologs from different representatives of a species. By default, the gene neighborhood viewer will collapse proteins that belong to the same species if they belong to the same CD-HIT cluster (at least 70% identity). This allows more distant homologs to be shown, and also makes it easy to see if close homologs are conserved within each species.

Viewing the taxonomic distribution of a protein’s homologs

Fast.genomics can show which taxa contain homologs of a protein, at whatever taxonomic level the analyst selects. By default, all homologs are considered, but fast.genomics can also consider only potential orthologs (at least 30% identity and 50% coverage), as in Fig 3, or “good” homologs (with a bit score at least 30% of the maximum score). Alternatively, the analyst can download a table of homologs that includes GTDB’s classification of their genomes, bit scores, and e-values.

Fig 3 — Here we show the prevalence of potential orthologs of ING2E5A_RS06865 across phyla. “B” at left is short for bacteria. “Max ratio” reports the highest bit score ratio (the alignment score for a homolog divided by the alignment score of the query protein to itself) among homologs from that taxonomic group.

Comparing the presence and absence of two proteins’ homologs

Once two proteins are selected, and their homologs have been computed, fast.genomics can compare their distributions (co-occurrence analysis or phylogenetic profiling) in two ways. First, as shown in Fig 4, fast.genomics can plot the score ratio (the bit score divided by the maximum score) for the best hit in each genome. If the two proteins co-occur, then a genome with a high score ratio for one protein will tend to have a high score ratio for the other. Fast.genomics also highlights genomes that have the two proteins encoded close by (within 5 kb and on the same strand). This aspect of the presence/absence plot is conceptually similar to the information in the gene neighbor view, but the gene neighbor view does not scale to so many homologs. In this case, relatively weak homologs (score ratios below 0.2) are sometimes encoded close by. Fig 4 also shows a few high-scoring homologs that are in the same genome but not nearby.

Fig 4 — Each point is a genome that contains a homolog of protein 1 and/or protein 2. Each axis shows the score ratio (the bit score divided by the maximum) for the highest-scoring homolog in that genome. Genomes are highlighted in green if the two homologs are encoded close by. If the same protein in a genome is the highest-scoring homolog for both queries (such as a fusion protein), then the genome would be highlighted in blue. Other genomes (with the homologs not encoded nearby) are shown in black. If a genome contains a homolog for one query but not the other, then the score is shown in the gray region below zero. Hovering on a point shows the taxonomic lineage of that genome. The two query proteins are ING2E5A_RS06865 and ING2E5A_RS06860.

Fast.genomics reports how often the homologs co-occur, both for all homologs and for “good” homologs with a bit score ≥ 30% of the maximum. Fast.genomics also selects a rank threshold that minimizes the probability of the observed co-occurrence, under a simple neutral model in which each gene has a fixed probability of being present in any genome. This threshold is shown as the “optimal” line (Fig 4). This feature was inspired by partial phylogenetic profiling [30], but to maintain a quick interactive response, fast.genomics optimizes one threshold (the same rank threshold for both proteins) instead of two thresholds. Fast.genomics uses Fisher’s exact test with a Bonferonni correction for the number of possible thresholds considered.

Although the presence/absence plot summarizes information from many genomes, it does not give a quick indication of which taxa contain the genes. Fast.genomics can also show a table of the taxa that contain both genes, or of the taxa that contain the two genes in proximity.

Links to other sequence analysis tools

Although fast.genomics does not include pre-computed sequence features such as protein family membership, the protein page includes links to a variety of sequence analysis tools. These include two fast ways to place proteins into families: searching against the Conserved Domain Database [31], or finding the closest homolog in UniProt using SANSparallel [32], which then links to the InterPro results for that homolog [33]. In our experience, most of the proteins in the fast.genomics database have a homolog with 99%-100% identity in UniProt.

Limitations of fast.genomics

The main limitation of fast.genomics is probably the splitting of the database by order. This might make fast.genomics less effective for selfish genetic elements that can move between distantly related bacteria. Still, the main (genus-level) database can quickly indicate which orders frequently contain the protein of interest.

To ensure that presence/absence analyses give reliable results, fast.genomics includes only high-quality genomes; but this means that much of the diversity visible in metagenomes is missing from fast.genomics. For instance, of the 20,739 genera in GTDB, only about a third (6,377) are included in fast.genomics. Including low-quality metagenome-assembled genomes might allow analysts to identify more conserved gene neighborhoods. Still, the fast.genomics main database includes 1,418 genomes from uncultivated genera, of which 1,368 are metagenome-assembled genomes. And most prokaryotic proteins do have homologs in the main database: for instance, for random proteins from its database, fast.genomics finds at least one potential ortholog (at least 30% identity and 50% coverage) for 94% of queries.

Fast.genomics does not support iterative searches to find distant homologs (homologs that are too distant to be found using pairwise methods such as BLAST). In this case, we recommend using the HMMer web server [34] for iterated searches against reference proteomes, or using a tool such as AnnoTree [15] for predefined families.

Comparisons to other comparative genomics websites

As far as we know, the other fast websites for comparative genomics of bacteria and archaea all rely on pre-computed orthology groups, such as eggNOG. Above, we assessed the accuracy of ortholog groups for finding homologs. Here, we give an example of how tools that rely on ortholog groups can fail. Consider the TonB-dependent transporter BT2172. We previously proposed, based partly on comparative genomics evidence, that this transporter functions together with another protein, BT2173, belonging to the uncharacterized family DUF4249 [35]. Fast.genomics shows that these proteins form a conserved operon, and that this subfamily of TonB-dependent transporters is almost always encoded next to a DUF4249 protein, across dozens of genera.

When we ran GeCoViz [14] with the sequence of BT2172 and selected the top-scoring eggNOG (ENOG501R5B9, E = 3.94e-91), GeCoViz failed to identify any highly conserved gene neighbors. Specifically, no neighbors met the conservation threshold of 0.5. When we dropped the threshold to 0.1, then a variety of conserved gene neighbors were reported, but not DUF4249. This is because the top-scoring eggNOG mixes BT2172 and its close homologs together with several other larger subfamilies of TonB-dependent receptors.

We also analyzed BT2172 using the STRING database of functional associations between proteins, which includes associations based on conserved proximity or co-occurrence. These associations are pre-computed using a combination of BLAST searches and orthology groups [36, 37]. (As of October 2023, STRING 12.0 includes 2,424 core genomes and 12,535 total genomes, and all proteins from core genomes are compared to all proteins from all genomes.) Given BT2172 (BT_2172) as a query, STRING correctly identified BT2173 (BT_2173) as a conserved gene neighbor. However, STRING also reported that the co-occurrence of the two proteins across genomes was “none / insignificant.” STRING’s co-occurrence viewer suggested that BT2172 is present in diverse bacteria, while BT2173 is present only in Bacteroidetes. In contrast, fast.genomics reports that the closest homologs of the two proteins co-occur (P = 10^−23.6). Our interpretation is that the distant homologs of BT2172 from other phyla have a different function. STRING’s co-occurrence viewer does show that the homologs of BT2172 from other phyla are quite distant, but by a subtle color coding, and this is apparently not taken into account in the automated analysis. In reality, disrupting either BT2172 or BT2173 has similar consequences [35], which confirms that they function together and suggests that the distant homologs of BT2172 from genomes that lack homologs of BT2173 have a different function. But in the absence of the genetic data, the report from STRING that BT2172 has a different distribution than BT2173 would suggest that BT2172 can function on its own, which is misleading.

Conversely, BLAST-based tools should give accurate results, but they are far slower than fast.genomics. Based on a recent review [38], we identified three websites that are still maintained and that report gene neighborhoods for the most similar proteins, as identified using BLAST. These were WebFlaGs, the Enzyme Function Initiative’s Genome Neighborhood Tool (EFI-GNT retrieve neighborhood diagrams), and IMG/M (top homologs combined with gene cart neighborhoods) [39–41]. When we ran these tools using a 290 amino acid protein (BT2157) as the query, the quickest response was from webFLaGs, in 11 minutes. None of these tools show the gene neighborhoods together with information about the similarity of each homolog or the extent of homology (in case of changes to domain structure). In the IMG/M results, the first 60 homologs are over 90% identical, and all but one of these is from the genus Bacteroides. Information from such close homologs is probably not functionally informative, because gene neighborhoods beyond operons (that is, the proximity of genes that probably lack a functional relationship) are often conserved in closely-related bacteria [42]. WebFlaGs and EFI-GNT have options to search against reduced databases (such as UniRef90, which is clustered at 90% identity), which avoids this problem. Similarly, fast.genomics avoids this issue by having just one representative per genus in the main database, or for the sub-databases, by clustering together similar proteins from the same species.

As far as we know, none of these websites can use a tree derived from protein sequences to organize the results. (This capability is described in the WebFlaGs manuscript, but as of October 2023, it is not available.) By grouping together similar sequences, fast.genomics can highlight subfamilies with consistent gene neighborhoods. In contrast, GeCoViz and STRING use the species tree to organize their displays. Because of rampant horizontal gene transfer, the species tree is generally not a reasonable guide to the gene tree. Furthermore, neither GeCoViz nor STRING’s gene neighborhood viewer shows which homologs are closely related to the query.

Conclusions

Fast.genomics supports interactive browsing of protein families across the diversity of bacteria and archaea. To identify homologs for a protein of interest in a few seconds, fast.genomics splits the database of genomes into a main database, with one representative per genus, and a database for each taxonomic order. Furthermore, fast.genomics uses accelerated searches which are almost as sensitive as protein BLAST, but are several times faster. These accelerated searches are much more accurate than pre-computed ortholog groups. Once homologs have been identified, fast.genomics can rapidly show their gene neighborhoods and their taxonomic distribution.

Given ongoing improvements in long-read sequencing and metagenomics, we expect the number of genera with high-quality genomes to increase substantially over the next few years. For example, if all of the genera in GTDB had at least one representative with a high-quality genome, then fast.genomics’ main database would need to expand three-fold. In the future, we hope to maintain quick access to the expanding diversity of high-quality prokaryotic genomes by using faster CPUs and solid-state storage and by making improvements to the parallelism of MMseqs2.

Materials and methods

Data sources

Taxonomic assignments and metadata about assemblies, including CheckM scores and MIMAG quality assessments, were downloaded from the Genome Taxonomy Database (release 08-RS214). Assemblies, including gene annotations, were downloaded from RefSeq or Genbank. Although we ran GUNC on some assemblies ourselves, for the most part, we relied on previous classifications of assemblies as being non-chimeric, either from the GUNC website (https://grp-bork.embl-community.io/gunc/datasets.html) or from proGenomes3 [43].

Which genomes to include

As mentioned above, we required that each genome have CheckM completeness of at least 90% and contamination of at most 5%; we required that the genome not be classified as chimeric by GUNC; and if the genome was not from an isolate, we required that it meet the MIMAG criteria for high quality. Assemblies without protein annotations in Genbank or Refseq were not considered. Also, we excluded a few genomes that had an unreasonably high proportion of pseudogenes; specifically, we required that at least half of the genes be protein coding.

We selected one genome per genus for the main database, and up to 10 genomes per species for the order-level databases. When selecting genomes, we preferred assemblies that were in RefSeq; that were the type species of a genus (according to GTDB); that were selected by GTDB as the representative of a species; that had lower 2 * contamination—completeness (using CheckM scores); that were in proGenomes3’s list of highly important strains; or that had a longer largest scaffold.

Fast parallel MMseqs2

MMseqs2 provides an “easy-search” workflow that emulates the output of protein BLAST. This workflow includes four steps: createdb, prefilter, align, and convertalis. Most of the time is taken up by the prefilter step, which finds potential homologs with a promising ungapped alignment, and the align step, which searches around each of those ungapped alignments for the optimal gapped alignment. We wrote a perl script (mmseqsParallel.pl) which runs the same steps as easy-search, but splits the results of the prefilter into 10 parts. It then runs the align step on those 10 parts in parallel, combines the parts, keeps the top #genomes entries, and runs the final step. mmseqsParallel.pl gives identical results as MMseqs2’s easy-search except for the the truncation of the result list to #genomes entries and the ordering of hits with the same e-value and bit score.

For the typical query, with no parallelization, the prefilter and align steps each take about the same amount of time, which explains why the median time for parallel MMseqs2 is about half that for regular MMseqs2 (6.1 seconds versus 3.0 seconds). However, for longer queries that have many homologs, the align step can take much longer, so that parallel MMseqs2 gives a somewhat larger reduction in average time (2.6x instead of 2.0x).

As an alternative to running the alignment step in parallel, we also tried splitting the database into 10 pieces, and running MMseqs2 in parallel on each piece. This was not as fast as parallelizing the alignment step only because the prefilter step is I/O intensive, and splitting the database into 10 pieces would increase the number of random memory accesses against the index of k-mers by 10-fold.

Conceptually, the prefilter stage combines two separate computations: looking up k-mers to find candidate homologs (and which diagonal the homology is on), and finding the best ungapped alignment for these candidates. Looking up k-mers is I/O intensive and probably wouldn’t benefit from using multiple CPUs, but computing ungapped alignments might benefit from multi-threading. This is an opportunity to further speed up parallel MMseqs2.

Settings for MMseqs2

We chose to use a k-mer size of 6 instead of 7, as this leads to only a slight decrease in the maximum sensitivity (Fig 1) and it reduces the time in the prefilter step, which cannot be parallelized. For a k-mer size of 6 and 6,377 genomes, the MMseqs2 database is 63 GB (including the k-mer index) and takes about 5 minutes to build.

To run MMseqs2 with sensitive settings, we used the maximum value of the sensitivity parameter (7.5) and asked MMseqs2 to consider up to 100 * #genomes alignments (the max-seqs parameter). Fast parallel MMseqs2 considers up to 8 * #genomes alignments and returns a maximum of #genomes alignments. Fast parallel MMseqs2 varies the sensitivity parameter depending on the query length because in initial testing, the miss rate was much higher for shorter sequences. The sensitivity levels are: 7.5 (the maximum) for queries of up to 150 amino acids; or 7.0 if the query is 250 amino acids or less; or 6.25 if 350 amino acids or less; or 6.0 if 650 amino acids or less; or 5.7 (the default) for queries of over 650 amino acids. With these settings, the miss rate for potential orthologs is still the highest for the shortest queries (2.0% for queries of 150 amino acids or less), but these are already being run at maximum sensitivity.

Because MMseqs2 with sensitive settings returns up to 100 * #genomes alignments, while fast parallel MMseqs2 returns only #genomes alignments, the difference in their sensitivity (shown in Fig 1) is inflated, but only by a small amount. If we consider only the top #genomes hits from MMseqs2 with sensitive settings and a k-mer size of 6, then its miss rate for potential orthologs increases slightly from 0.6% to 0.7%. This difference reflects the different scoring used by BLAST+ and MMseqs2, which leads to a different ordering of the hits.

Clustered search

Given a query, clustered search runs BLAST+ against a database of cluster representatives. Then it expands the hits by adding additional members of the clusters, and it uses LAST [26] to compare the query to all of the candidate homologs. To limit the time for the second step, the number of potential homologs is limited to the number of genomes in the order-level database or 200, whichever is greater. Note that LAST and BLAST+ report different bit scores.

The clusters were identified by running CD-HIT [24, 25] separately for each order-level database. We ran CD-HIT with 5-mers, required 70% identity and 90% alignment coverage, and used 50 threads (-n 5 -c 0.7 -aS 0.9 -aL 0.9 -T 50). CD-HIT is the slowest part of building the fast.genomics order-level databases, which took a total of 44 hours.

E-values and compositional bias

All homology searches in fast.genomics use a e-value cutoff of 10^-3.For searches against the main database, e-values and alignment scores (bit scores) are from MMseqs2, which adjusts for local compositional bias [17]. The e-values from MMseqs2 should be unbiased: in a large-scale validation test that was conducted by its authors, and that included proteins with disordered and low-complexity regions, about 1 per 1,000 queries had any false positives with E < 10⁻³ [17]).

For searches against the order-level databases, the alignment scores and e-values are from LAST [26], which does not correct for compositional bias. We do not expect the use of LAST to add false positives because proteins are only considered by LAST if they are very similar to homologs that were identified using BLAST+, which does correct for compositional bias [44]. But, the authors of MMseqs2 found that e-values from BLAST+ are overly optimistic [17], so we wondered if there was a high rate of false positives in the clustered search results. For the 1,000 random query proteins from Rhizobiales, we considered the weak hits from clustered search (10⁻⁵ < E < 10⁻³). We considered only the top 100 hits for each query, because by default, the gene neighborhood view shows the top 50 species clusters, and this database has roughly twice as many genomes as species. 85% of the high-ranking weak hits were found by MMseqs2 as well (MMseqs2 e-value < 0.05) and are unlikely to be false positives. We checked five of the exceptions and all five pairs are probably true homologs: they matched the same family (hidden markov model) in InterPro [33] or they had similar predicted structures [45].

Regardless of the risk of false positives, distant homologs (under 30% identity) are likely to have different functions, and should probably be ignored unless they have conserved gene neighborhoods.

Features of the gene neighborhood view

To identify sequence similarity between the genes that are shown, fast.genomics runs LAST [26] on their protein sequences, using 12 threads and default settings. Only alignments that cover at least 50% of both sequences are considered. A greedy clustering algorithm is used to group similar genes together, and each group is given the same color. Two hatch patterns are used to expand the effective number of colors from 11 colors to 33. The query and its homologs are shown in white, which is effectively another color.

To infer a phylogenetic tree, MUSCLE 3 is run with fast settings (-maxiters 2 -maxmb 1000) and only the part of each sequence that is similar to the query is included in the alignment. Then, FastTree 2 is run with default settings, followed by rooting the tree at the midpoint (to minimize the maximum distance from the root to any sequence).

Fast.genomics can show up to 200 homologs. (Fig 1 shows only 25 homologs to save space.) But some genes have thousands of potential orthologs. To make it easier to see the range of gene neighbors for a protein family, fast.genomics can select homologs (or “good” homologs with bit score ≥ 30% of the maximum score) at random, instead of showing the top homologs.

Genes in the same operon often have a functional relationship, and genes are much more likely to be in the same operon if they are very close together [46, 47]. So it can be important to know just how close the genes are together—but this is difficult to see in a gene-level view. In fast.genomics, hovering in between two genes or near the edge of a gene shows the spacing in nucleotides.

Implementation of the web site

The web site is implemented in perl 5 using the common gateway interface (CGI). Graphics are rendered using scalable vector graphics (SVG). To avoid redoing computations such as identifying homologs for a protein sequence, color coding the proteins in the gene neighborhood view, or inferring a phylogenetic tree, the inputs are hashed using MD5, which produces a 128-bit hash, and the results are stored on disk based on that hash.

Tables of genomes, genes, proteins, and taxa are stored in a SQLite3 database. There is a separate SQLite3 database file for the main database and for each order, so that most of the code is the same regardless of whether it is operating on the main database or an order-level database.

Hardware and memory requirements

The fast.genomics server has 1 TB of memory and 64 AMD Opteron 6376 CPUs running at 1.4 GHz. The server is used for other memory-intensive computations as well, but it usually has little load.

For the top-level database, which has 21.8 million proteins and 7.1 billion amino acids, the MMseqs2 database requires 63 GB. For good performance, this needs to be kept in memory. A SQLite3 database of genes takes another 5.2 GB and this should also be kept in memory. For example, to describe the taxonomic prevalence of a protein’s homologs, fast.genomics needs to look up the gene(s) and genome identifier(s) for every homolog’s protein identifier. For clustered search within each order, the clustering information (18.5 GB) needs to be kept in memory, as looking up the additional members of thousands of clusters could otherwise take several seconds. Overall, the server needs at least 87 GB of memory. We run a script, once per hour, to read these tables or files to ensure that they are kept in memory. In the future, we hope to switch to solid state storage, and this then will be less important.

Data Availability

Fast.genomics is available at http://fast.genomics.lbl.gov. The source code and the database have also been archived at figshare (https://doi.org/10.6084/m9.figshare.24010353.v1).

Funding Statement

This material by ENIGMA- Ecosystems and Networks Integrated with Genes and Molecular Assemblies (http://enigma.lbl.gov), a Science Focus Area Program at Lawrence Berkeley National Laboratory is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Biological & Environmental Research under contract number DE-AC02-05CH11231 The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1.Parks DH, Chuvochina M, Waite DW, Rinke C, Skarshewski A, Chaumeil P-A, et al. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nat Biotechnol. 2018;36: 996–1004. doi: 10.1038/nbt.4229 [DOI] [PubMed] [Google Scholar]
2.Bowers RM, Kyrpides NC, Stepanauskas R, Harmon-Smith M, Doud D, Reddy TBK, et al. Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nat Biotechnol. 2017;35: 725–731. doi: 10.1038/nbt.3893 [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 2015;25: 1043–1055. doi: 10.1101/gr.186072.114 [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Dandekar T, Snel B, Huynen M, Bork P. Conservation of gene order: a fingerprint of proteins that physically interact. Trends Biochem Sci. 1998;23: 324–328. doi: 10.1016/s0968-0004(98)01274-2 [DOI] [PubMed] [Google Scholar]
5.Wolf YI, Rogozin IB, Kondrashov AS, Koonin EV. Genome alignment, evolution of prokaryotic genome organization, and prediction of gene function using genomic context. Genome Res. 2001;11: 356–372. doi: 10.1101/gr.gr-1619r [DOI] [PubMed] [Google Scholar]
6.Huynen M, Snel B, Lathe W, Bork P. Predicting protein function by genomic context: quantitative evaluation and qualitative inferences. Genome Res. 2000;10: 1204–1210. doi: 10.1101/gr.10.8.1204 [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO. Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci USA. 1999;96: 4285–4288. doi: 10.1073/pnas.96.8.4285 [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Zhaxybayeva O, Doolittle WF. Lateral gene transfer. Curr Biol. 2011;21: R242–6. doi: 10.1016/j.cub.2011.01.045 [DOI] [PubMed] [Google Scholar]
9.Price MN, Deutschbauer AM, Arkin AP. Four families of folate-independent methionine synthases. PLoS Genet. 2021;17: e1009342. doi: 10.1371/journal.pgen.1009342 [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25: 3389–3402. doi: 10.1093/nar/25.17.3389 [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Finn RD, Bateman A, Clements J, Coggill P, Eberhardt RY, Eddy SR, et al. Pfam: the protein families database. Nucleic Acids Res. 2014;42: D222–30. doi: 10.1093/nar/gkt1223 [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Haft DH, Selengut JD, Richter RA, Harkins D, Basu MK, Beck E. Tigrfams and genome properties in 2013. Nucleic Acids Res. 2013;41: D387–95. doi: 10.1093/nar/gks1234 [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Hernández-Plaza A, Szklarczyk D, Botas J, Cantalapiedra CP, Giner-Lamia J, Mende DR, et al. eggNOG 6.0: enabling comparative genomics across 12 535 organisms. Nucleic Acids Res. 2023;51: D389–D394. doi: 10.1093/nar/gkac1022 [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Botas J, Rodríguez Del Río Á, Giner-Lamia J, Huerta-Cepas J. GeCoViz: genomic context visualisation of prokaryotic genes from a functional and evolutionary perspective. Nucleic Acids Res. 2022;50: W352–W357. doi: 10.1093/nar/gkac367 [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Mendler K, Chen H, Parks DH, Lobb B, Hug LA, Doxey AC. AnnoTree: visualization and exploration of a functionally annotated microbial tree of life. Nucleic Acids Res. 2019;47: 4442–4448. doi: 10.1093/nar/gkz246 [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Tremblay BJ-M, Lobb B, Doxey AC. PhyloCorrelate: inferring bacterial gene-gene functional associations through large-scale phylogenetic profiling. Bioinformatics. 2021;37: 17–22. doi: 10.1093/bioinformatics/btaa1105 [DOI] [PubMed] [Google Scholar]
17.Steinegger M, Söding J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol. 2017;35: 1026–1028. doi: 10.1038/nbt.3988 [DOI] [PubMed] [Google Scholar]
18.Orakov A, Fullam A, Coelho LP, Khedkar S, Szklarczyk D, Mende DR, et al. GUNC: detection of chimerism and contamination in prokaryotic genomes. Genome Biol. 2021;22: 178. doi: 10.1186/s13059-021-02393-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Price MN, Arkin AP. Curated BLAST for genomes. mSystems. 2019;4. doi: 10.1128/mSystems.00072-19 [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Park Y, Sheetlin S, Ma N, Madden TL, Spouge JL. New finite-size correction for local alignment score distributions. BMC Res Notes. 2012;5: 286. doi: 10.1186/1756-0500-5-286 [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Lerat E, Daubin V, Moran NA. From gene trees to organismal phylogeny in prokaryotes: the case of the gamma-Proteobacteria. PLoS Biol. 2003;1: E19. doi: 10.1371/journal.pbio.0000019 [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Cantalapiedra CP, Hernández-Plaza A, Letunic I, Bork P, Huerta-Cepas J. eggNOG-mapper v2: Functional Annotation, Orthology Assignments, and Domain Prediction at the Metagenomic Scale. Mol Biol Evol. 2021;38: 5825–5829. doi: 10.1093/molbev/msab293 [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Galperin MY, Makarova KS, Wolf YI, Koonin EV. Expanded microbial genome coverage and improved protein family annotation in the COG database. Nucleic Acids Res. 2015;43: D261–9. doi: 10.1093/nar/gku1223 [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22: 1658–1659. doi: 10.1093/bioinformatics/btl158 [DOI] [PubMed] [Google Scholar]
25.Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012;28: 3150–3152. doi: 10.1093/bioinformatics/bts565 [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Kiełbasa SM, Wan R, Sato K, Horton P, Frith MC. Adaptive seeds tame genomic sequence comparison. Genome Res. 2011;21: 487–493. doi: 10.1101/gr.113985.110 [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32: 1792–1797. doi: 10.1093/nar/gkh340 [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Price MN, Dehal PS, Arkin AP. FastTree 2—approximately maximum-likelihood trees for large alignments. PLoS ONE. 2010;5: e9490. doi: 10.1371/journal.pone.0009490 [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Dehal PS, Joachimiak MP, Price MN, Bates JT, Baumohl JK, Chivian D, et al. MicrobesOnline: an integrated portal for comparative and functional genomics. Nucleic Acids Res. 2010;38: D396–400. doi: 10.1093/nar/gkp919 [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Basu MK, Selengut JD, Haft DH. ProPhylo: partial phylogenetic profiling to guide protein family construction and assignment of biological process. BMC Bioinformatics. 2011;12: 434. doi: 10.1186/1471-2105-12-434 [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Marchler-Bauer A, Derbyshire MK, Gonzales NR, Lu S, Chitsaz F, Geer LY, et al. CDD: NCBI’s conserved domain database. Nucleic Acids Res. 2015;43: D222–6. doi: 10.1093/nar/gku1221 [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Somervuo P, Holm L. SANSparallel: interactive homology search against Uniprot. Nucleic Acids Res. 2015;43: W24–9. doi: 10.1093/nar/gkv317 [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Finn RD, Attwood TK, Babbitt PC, Bateman A, Bork P, Bridge AJ, et al. InterPro in 2017-beyond protein family and domain annotations. Nucleic Acids Res. 2017;45: D190–D199. doi: 10.1093/nar/gkw1107 [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Potter SC, Luciani A, Eddy SR, Park Y, Lopez R, Finn RD. HMMER web server: 2018 update. Nucleic Acids Res. 2018;46: W200–W204. doi: 10.1093/nar/gky448 [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Price MN, Deutschbauer AN, Arkin AP. Many Families of Lids for TonB-dependent Transporters in Bacteroides. BioRxiv. 2023. doi: 10.1101/2023.03.17.533168 [DOI] [Google Scholar]
36.von Mering C, Jensen LJ, Kuhn M, Chaffron S, Doerks T, Krüger B, et al. STRING 7—recent developments in the integration and prediction of protein interactions. Nucleic Acids Res. 2007;35: D358–62. doi: 10.1093/nar/gkl825 [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Szklarczyk D, Kirsch R, Koutrouli M, Nastou K, Mehryary F, Hachilif R, et al. The STRING database in 2023: protein-protein association networks and functional enrichment analyses for any sequenced genome of interest. Nucleic Acids Res. 2023;51: D638–D646. doi: 10.1093/nar/gkac1000 [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Reed C, Denise R, Hourihan J, Babor J, Jaroch M, Martinelli M, et al. Beyond blast: enabling microbiologists to better extract literature, taxonomic distributions and gene neighborhood information for protein families. BioRxiv. 2023. doi: 10.1101/2023.05.03.539116 [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Saha CK, Sanches Pires R, Brolin H, Delannoy M, Atkinson GC. FlaGs and webFlaGs: discovering novel biology through the analysis of gene neighbourhood conservation. Bioinformatics. 2021;37: 1312–1314. doi: 10.1093/bioinformatics/btaa788 [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Oberg N, Zallot R, Gerlt JA. EFI-EST, EFI-GNT, and EFI-CGFP: Enzyme Function Initiative (EFI) Web Resource for Genomic Enzymology Tools. J Mol Biol. 2023;435: 168018. doi: 10.1016/j.jmb.2023.168018 [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Chen I-MA, Chu K, Palaniappan K, Ratner A, Huang J, Huntemann M, et al. The IMG/M data management and analysis system v.7: content updates and new features. Nucleic Acids Res. 2023;51: D723–D732. doi: 10.1093/nar/gkac976 [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Rocha EPC. Inference and analysis of the relative stability of bacterial chromosomes. Mol Biol Evol. 2006;23: 513–522. doi: 10.1093/molbev/msj052 [DOI] [PubMed] [Google Scholar]
43.Fullam A, Letunic I, Schmidt TSB, Ducarmon QR, Karcher N, Khedkar S, et al. proGenomes3: approaching one million accurately and consistently annotated high-quality prokaryotic genomes. Nucleic Acids Res. 2023;51: D760–D766. doi: 10.1093/nar/gkac1078 [DOI] [PMC free article] [PubMed] [Google Scholar]
44.Yu Y-K, Altschul SF. The construction of amino acid substitution matrices for the comparison of proteins with non-standard compositions. Bioinformatics. 2005;21: 902–911. doi: 10.1093/bioinformatics/bti070 [DOI] [PubMed] [Google Scholar]
45.Varadi M, Anyango S, Deshpande M, Nair S, Natassia C, Yordanova G, et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 2022;50: D439–D444. doi: 10.1093/nar/gkab1061 [DOI] [PMC free article] [PubMed] [Google Scholar]
46.Moreno-Hagelsieb G, Collado-Vides J. A powerful non-homology method for the prediction of operons in prokaryotes. Bioinformatics. 2002;18 Suppl 1: S329–36. doi: 10.1093/bioinformatics/18.suppl_1.s329 [DOI] [PubMed] [Google Scholar]
47.Price MN, Huang KH, Alm EJ, Arkin AP. A novel method for accurate operon predictions in all sequenced prokaryotes. Nucleic Acids Res. 2005;33: 880–892. doi: 10.1093/nar/gki232 [DOI] [PMC free article] [PubMed] [Google Scholar]

PLoS One. doi: 10.1371/journal.pone.0301871.r001

Decision Letter 0

Vasilis J Promponas

8 Feb 2024

PONE-D-23-38656A fast comparative genome browser for diverse bacteria and archaeaPLOS ONE

Dear Dr. Price,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Mar 24 2024 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Vasilis J Promponas

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Note from Emily Chenette, Editor in Chief of PLOS ONE, and Iain Hrynaszkiewicz, Director of Open Research Solutions at PLOS: Did you know that depositing data in a repository is associated with up to a 25% citation advantage (https://doi.org/10.1371/journal.pone.0230416)? If you’ve not already done so, consider depositing your raw data in a repository to ensure your work is read, appreciated and cited by the largest possible audience. You’ll also earn an Accessible Data icon on your published paper if you deposit your data in any participating repository (https://plos.org/open-science/open-data/#accessible-data).

3. Please note that PLOS ONE has specific guidelines on code sharing for submissions in which author-generated code underpins the findings in the manuscript. In these cases, all author-generated code must be made available without restrictions upon publication of the work. Please review our guidelines at https://journals.plos.org/plosone/s/materials-and-software-sharing#loc-sharing-code and ensure that your code is shared in a way that follows best practice and facilitates reproducibility and reuse.

4. Thank you for stating the following financial disclosure:

"This material by ENIGMA- Ecosystems and Networks Integrated with Genes and

Molecular Assemblies (http://enigma.lbl.gov), a Science Focus Area Program at

Lawrence Berkeley National Laboratory is based upon work supported by the

U.S. Department of Energy, Office of Science, Office of Biological &

Environmental Research under contract number DE-AC02-05CH11231"

Please state what role the funders took in the study. If the funders had no role, please state: ""The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.""

If this statement is not correct you must amend it as needed.

Please include this amended Role of Funder statement in your cover letter; we will change the online submission form on your behalf.

5. Thank you for stating the following in the Acknowledgments Section of your manuscript:

"This material by ENIGMA- Ecosystems and Networks Integrated with Genes and Molecular

Assemblies (http://enigma.lbl.gov), a Science Focus Area Program at Lawrence Berkeley

National Laboratory is based upon work supported by the U.S. Department of Energy, Office of

Science, Office of Biological & Environmental Research under contract number

DE-AC02-05CH1123."

We note that you have provided funding information that is not currently declared in your Funding Statement. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form.

Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows:

"This material by ENIGMA- Ecosystems and Networks Integrated with Genes and

Molecular Assemblies (http://enigma.lbl.gov), a Science Focus Area Program at

Lawrence Berkeley National Laboratory is based upon work supported by the

U.S. Department of Energy, Office of Science, Office of Biological &

Environmental Research under contract number DE-AC02-05CH11231"

Please include your amended statements within your cover letter; we will change the online submission form on your behalf.

6. Please update your submission to use the PLOS LaTeX template. The template and more information on our requirements for LaTeX submissions can be found at http://journals.plos.org/plosone/s/latex.

7. Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

Additional Editor Comments:

Specifically, the reviewers who assessed your manuscript and the Fast.genomics software/server (whose detailed comments can be found below) both find value in your work: Fast.genomics is a welcome addition to the computational tools available for computational comparative genomics, functions as described in the paper and the manuscript is clearly written.

Reviewer #1 poses several interesting topics for discussion, which I believe are easy to address in a revised version of this work. Of particular importance is the comment on the suitability of different e-value thresholds for establishing homology relations; it can be emphasized that by providing the Fast.genomics data and code in GitHub, interested researchers can experiment and fine tune the tools for their particular needs.

Reviewer #2 initially highlights the lack of comparison to online compute environments providing similar functionality to Fast.genomics. Even though it might be of interest to see detailed benchmarks to systems like IMG or MGnify, I believe it will be adequate to only provide high level comparisons (e.g., comparison of features available) since such systems are installed in compute infrastructures which are not easy to replicate/have access to and their performance when run over the web can vary based on the specific workload of their servers. Regarding the second comment, although this is definitely worth discussing within the manuscript, I understand that providing a detailed comparison to the work presented in the recently published paper by Pavlopoulos and colleagues is a whole project on its own. However, adding some practical tips (either in the manuscript or in the GitHub repository) on how interested readers could exploit these newly characterized protein families for enhancing Fast.genomics functionalities would be welcome.

On a personal note, it has long been reported that local compositional extremes (e.g., regions with low complexity/compositional bias/tandem repeats) in protein sequences can be a major confounding factor when searching for homologous proteins. However, by carefully inspecting the manuscript//webserver and quickly going through the GitHub repository, I could see no information on how your approach deals with such cases. I suggest that the manuscript text is amended to accurately reflect on how local compositional biases are handled and an appropriate rationale for this choice. Even though such features are not as prominent in prokaryotes, a simple UniProt query (https://www.uniprot.org/uniprotkb?query=%28taxonomy_id%3A83333%29+COMPBIAS; accessed Feb 05, 2024) yields more than 200 proteins annotated with compositional bias in the Escherichia coli (strain K12) proteome. It would be illuminating to examine how Fast.genome handles such proteins, especially with respect to identification of spurious (i.e. false positive) homologs.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The authors have developed a web-based tool, called fast.genomics, that supports fast browsing of prokaryotic genomes.

They have a main database that contains one representative from each of the 6,377 genera that have high quality genomes. They also have a second database that contains 10 representatives of each species.

Next, they identify homologs of proteins of interest. Afterwards, fast genomics shows their distribution in the various taxa, as well as their neighboring genes. Furthermore, the software can compare the distribution of homologs for two different proteins. The main idea behind this is to predict function, via phylogenetic profiling. Also, such an approach may identify operons that may be horizontally transferred in diverse taxa.

This is a very useful prokaryotic comparative genomics tool that focuses on rapid analysis and allows for quick conclusions. The web-site works fine and the underlined software that the pipeline relies on for sequence analysis and phylogeny are all standard tools well accepted by the community.

Reassuringly, the authors first apply quality filters to the genomes that they integrate.

Also, the manuscript is well written and easy to follow.

I only have some minor corrections/comments for future updates of the software.

One computational limitation is rapid identification of homologs in such vast datasets of thousands of genera and species. Thus, the authors use MMseqs2, which is a fast alternative to Blast. How does MMSeqs2 compare to DIAMOND (Buchfink et al., 2015, PMID: 25402007)? From DIAMOND’s web site “DIAMOND is a high-throughput program for aligning DNA reads or protein sequences against a protein reference database such as NR, at up to 20,000 times the speed of BLAST, with high sensitivity.”

The authors apply a homology cutoff of 1e-3. From my experience, I see better results with 1e-5. 1e-3 tends to get a lot of false positives, but this is only a comment of mine. Since different researchers have their own criteria for homology, depending on what they are after, it would be good to have a local version where the user may define their sets of genomes and their homology criteria, such as e-value and % identity with certain coverage.

Another issue is the criteria used for orthology. This is much more difficult than identifying homologs. The authors justify their own criteria for a fast search, which is the main idea behind this work. Otherwise, better and more accurate approaches would not allow for such large-scale and rapid analyses. Thus, the orthology criteria constitute a reasonable tradeoff between precision and recall.

Concerning which genomes to include: I understand that the authors use the species name as provided by genebank. However, very frequently, genomes are misannotated in terms of species names (Nikolaidis et al., 2022 and Nikolaidis et al., 2023 - PMID: 36144322 and PMID: 37266990). One approach, maybe for future updates of the web-tool is to use FASTANI (Jain et al., 2018, PMID: 30504855), to include the proper genomes to each species. However, it would be good to discuss this issue of species misannotations in Genbank (which is caused by the researchers that submit their genomes).

Reviewer #2: The authors propose a method for identifying homologs of protein(s) of interest across many genomes. Focussing on speed-up, they argue that their approach is faster than standard approaches such as protein BLAST or similar tool to find homologs, that, given the size of current genome databases, typically require over 10 minutes. They argue against alternative approaches that pre-compute groups of homologous proteins, using resources such as PFam (for examples, fast tools that rely on pre-computed homology groups such as GeCoViz, AnnoTree and PhyloCorrelate), as the authors believe they may give misleading results, especially if the protein of interest is a poor match to the pre-computed families.

The authors have built an interactive web service for browsing of protein families (fast.genomics - https://fast.genomics.lbl.gov/). Moreover, they have taken multiple shortcuts to speed up the process of obtaining comparative analysis for protein(s) of interest. These include:

1) Selection of one high-quality representative genome for each genus, so the main database of fast.genomics contains fewer genomes (6,377 genomes) to be used as a search space. They also allow for the expansion to up to 10 representatives of each species if the initial database does not provide adequate search results.

2) The use of a parallel version of MMseqs2 that allows for splitting of candidates from the prefilter step of MMseqs2 into 10 lists and analyzes them in parallel.

3) Use of CD-HIT to cluster the proteins in the largest order-level databases to speed-up

To test the speed and sensitivity of their approach, the authors selected 1,000 proteins at random from the main database and searched for homologs of these proteins in the main database using protein BLAST from NCBI’s BLAST+ package, fast parallel MMseqs2, or MMseqs2 with the most sensitive settings.

Their result showed that Fast parallel MMseqs2 took an average of 3.3 seconds per query, which was 7 times faster than BLAST+ (23 seconds on average) and 2.6 times faster than MMseqs2 with the same settings but no parallelisation (8.4 seconds on average).

The also asses the sensitivity of their parallel MMseq2 approach by considering the top 3,188 BLAST+ hits for each query with E ≤ 10-5. MMseqs2 with sensitive settings missed 1.4% or 2.2% of these hits (depending on the k-mer size), with a much lower miss rate for the top-ranked hits. Fast parallel MMseqs2 missed 6.7% of the hits.

The authors also compared to some fast comparative genomics tools that rely on pre-computed ortholog groups from eggNOG. They considered the broadest ortholog groups returned by eggNOG-mapper, for the 1,000 benchmark queries and found that 93 were not assigned to ortholog groups, and could not be handled by the ortholog approach. 40 of these lacked homologs besides themselves (as identified by BLAST+ with E ≤ 10-5) but the other 53 did have at least one homolog in their top-level database, and in all of these cases, fast parallel MMseqs identified at least one homolog. Across the remaining 907 queries, the ortholog group approach missed 9.2% of the high-ranking homologs. Among these high-ranking high-coverage hits (that were assigned to ortholog groups by eggNOG-mapper), fast parallel MMseqs2 missed just 0.1% of homologs.

The authors also report a reduction of the number of sequences in the largest orders by 3.5- to 6.2-fold using the CDHIT approach and also fast.genomics can find homologs quickly by first searching against the reduced database and at high sensitivity, again by comparing using a 1000 random protein benchmark dataset.

Fast.genomics further allows for extra functionalities including: 1) viewing gene neighborhoods, 2) viewing the taxonomic distribution of a protein’s homologs, 3) comparison of the distributions of two selected proteins, and their homologs (co-occurrence analysis or phylogenetic profiling) and 4) providing links to a variety of sequence analysis tools from the fast.genomics protein page.

The authors provide a specific example to highlight discrepancies that can arise from other fast websites for comparative genomics of bacteria and archaea that rely on pre-computed orthology groups, such as eggNOG. By considering the TonB-dependent transporter BT2172 and using GeCoViz with the sequence of BT2172 selected, the authors show that this tool failed to identify any highly conserved gene neighbours.

The authors also compare to BLAST-based tools that should give accurate results, but they are far slower than fast.genomics. The specifically compare with WebFlaGs, the Enzyme Function Initiative’s Genome Neighborhood Tool (EFI-GNT retrieve neighborhood diagrams), and IMG/M (top homologs

combined with gene cart neighborhoods). Again, they use the 290 amino acid protein (BT2157) as the query, the report the quickest response was from webFLaGs, at 11 minutes. The further showcase the functionalities of fast.geneomics by comparing against the 3 BLAST-based tools and show that none of these tools show the gene neighborhoods together with information about the similarity of each homolog or the extent of homology.

Overall, this is a nice approach that provides some interesting results with respect to speed-up of homology searching for metagenomic proteins as well as a user interface with some useful functionalities. However, I have the following major concerns:

1) I feel that the authors have not performed a complete and adequate comparison with other competing tools. Especially with some tools that provide an integrated environment for analysis, management, storage, and sharing of metagenomic projects, like IMG/M or MGnify (supported by the EBI). Although, they perform some form of assessment using a specific example, namely the BT2172 sequence, I would have liked to see a more thorough comparison with tools such as IMG/M. These tools are specifically built for running large jobs and are hosted in servers with high calibre specifications suited for fast running and high sensitivity of results.

2) Recently Pavlopoulos et al. Nature, 2023 (Unraveling the functional dark matter through global metagenomics) published an approach that examines functional diversity beyond what was currently possible through the lens of reference genomes. Their computational approach generates reference-free protein families from the sequence space in metagenomes. They analysed over 26,000 metagenomes and identified >1 billion protein sequences with no prior similarity to any sequences from >100,000 reference genomes or the Pfam database. It would be nice for the authors to make a comparison with this approach and provide at least some measure of sensitivity with the results reported by this publication. Moreover, Pavlopoulos et al. provided a tremendous amount of novel protein families which have not been considered by the authors in their approach.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2024 Apr 9;19(4):e0301871. doi: 10.1371/journal.pone.0301871.r002

Author response to Decision Letter 0

27 Feb 2024

Responses to comments from reviewer #1

Reviewer 1: "One computational limitation is rapid identification of homologs in such vast datasets of thousands of genera and species. Thus, the authors use MMseqs2, which is a fast alternative to Blast. How does MMSeqs2 compare to DIAMOND (Buchfink et al., 2015, PMID: 25402007)? From DIAMOND’s web site "DIAMOND is a high-throughput program for aligning DNA reads or protein sequences against a protein reference database such as NR, at up to 20,000 times the speed of BLAST, with high sensitivity."

DIAMOND is optimized for searches with many queries. For searches with a single query, as would occur during interactive use, our experience is that DIAMOND takes around the same time as protein BLAST.

Reviewer 1: "The authors apply a homology cutoff of 1e-3. From my experience, I see better results with 1e-5. 1e-3 tends to get a lot of false positives, but this is only a comment of mine."

The editor was also curious about this question, and the related issue of how fast.genomics handles compositional bias. To address it, we added a section to the Materials and Methods, titled "E-values and compositional bias". Briefly, we do not believe that there are many false positives in the hits with e-values near the cutoff.

We'd also like to note that for most queries, changing the e-value cutoff from 1e-3 to 1e-5 would not affect the gene neighborhood view with default settings. For the main database, the gene neighborhood view includes the top 50 hits by default. In the test set of 1,000 random prokaryotic proteins, only 8% have any hits from MMseqs2 with rank <= 50 and E > 1e-5 (and E < 1e-3). For the test set of 1,000 proteins from Rhizobiales, only 6% have hits from clustered search with rank <= 100 and E > 1e-5. (The default view for order databases shows species clusters of genes, and there's ~2x more genomes than species in this order-level database, so it is more appropriate to consider the top 2*50 = 100 homologs.)

More broadly, the focus of fast.genomics is on homologs which are likely to have the same function, namely closer homologs or homologs that have the same gene context (and are unlikely to be false positives).

Reviewer 1: "Since different researchers have their own criteria for homology, depending on what they are after, it would be good to have a local version where the user may define their sets of genomes and their homology criteria, such as e-value and % identity with certain coverage."

Fast.genomics allows the user to download a table of homologs. We revised the Results to mention that this table includes e-values. Similarly, the gene presence/absence tool allows the user to download a table of the top hit for both queries in each genome; this table now includes the e-values. In either case, the user could easily filter out weak hits from these tables, or choose a subset of genomes of interest.

If a user wishes to build a version of fast.genomics with additional genomes, the source code for fast.genomics is available, including the scripts for building the database. We revised the code availability statement to make it clear that the scripts for building the databases are included.

Reviewer 1: "Concerning which genomes to include: I understand that the authors use the species name as provided by genebank. However, very frequently, genomes are misannotated in terms of species names (Nikolaidis et al., 2022 and Nikolaidis et al., 2023 - PMID: 36144322 and PMID: 37266990). One approach, maybe for future updates of the web-tool is to use FASTANI…"

Actually, all taxonomic assignments in fast.genomics, including the species names, are taken from GTDB, which uses ANI comparisons to define species. We revised the section of the Results on "The fast.genomics databases" to explain that the species definitions in fast.genomics are from GTDB.

Responses to comments from reviewer #2

Reviewer 2: "1) I feel that the authors have not performed a complete and adequate comparison with other competing tools. Especially with some tools that provide an integrated environment for analysis, management, storage, and sharing of metagenomic projects, like IMG/M or MGnify (supported by the EBI). Although, they perform some form of assessment using a specific example, namely the BT2172 sequence, I would have liked to see a more thorough comparison with tools such as IMG/M. These tools are specifically built for running large jobs and are hosted in servers with high calibre specifications suited for fast running and high sensitivity of results."

In regards to "These tools are ... suited for fast running and high sensitivity of results", this is not our experience. For instance, the IMG/M web site states that "Real time BLAST request on average takes about 2 mins. to 15 mins. to complete." The homologs feature of IMG/M is just not designed to be fast. And, as discussed in our manuscript, we feel that the IMG web site does not organize the results as well as fast.genomics does. Similarly, our impression is that MGnify is not suitable for fast searches. As of February 13, the sequence search page says: "We recognize that our service has faced challenges in providing the latest version of the MGnify protein database, and we sincerely apologize for any inconvenience caused. The recent rapid growth of the protein database to over 3 billions has present technical challenges in scaling the search infrastructure which we are currently addressing."

We did not compare fast.genomics to MGnify because MGnify does not provide analogous functionality. In particular, as far as we know, MGnify does not provide any way to compare the gene neighborhoods of the homologous proteins or to compare the presence/absence of two proteins across taxa.

More broadly, the reviewer was concerned that we did not compare fast.genomics to tools that support metagenomics projects -- but supporting metagenomics might not be compatible with the goals of fast.genomics. In particular, fast.genomics includes only high-quality genomes (whether from isolates or assembled from metagenomes) to ensure that analyses of the presence or absence of a gene family, across genomes will give reliable results. It's not clear how to compare gene presence/absence across taxa from fragmented metagenomic assemblies. We do hope that in the future, many more high-quality MAGs will be available, and fast.genomics’ coverage of the diversity of bacteria and archaea will improve (see the Conclusions section).

In the revised manuscript, the section on "The fast.genomics databases" clarifies why we only include high-quality genomes. And a new paragraph in the "Limitations" section reports how many MAGs are included and discusses the trade-off between supporting presence/absence analyses and incorporating more of the sequenced diversity of bacteria and archaea.

Reviewer 2: "2) Recently Pavlopoulos et al. Nature, 2023 (Unraveling the functional dark matter through global metagenomics) published an approach that examines functional diversity beyond what was currently possible through the lens of reference genomes. Their computational approach generates reference-free protein families from the sequence space in metagenomes. They analysed over 26,000 metagenomes and identified >1 billion protein sequences with no prior similarity to any sequences from >100,000 reference genomes or the Pfam database. It would be nice for the authors to make a comparison with this approach and provide at least some measure of sensitivity with the results reported by this publication. Moreover, Pavlopoulos et al. provided a tremendous amount of novel protein families which have not been considered by the authors in their approach."

Fast.genomics does not use precomputed families, and we demonstrated that for the main database (of mostly isolate genomes), this often leads to better results. So we're not sure if it would make sense to incorporate the families identified by Pavlopoulos et al into fast.genomics.

Fast.genomics's main database does include 1,418 non-isolate genomes, so some of the families from Pavlopoulos et al do have homologs in fast.genomics already, and these can be found using fast.genomics's search tools. As metagenome assemblies improve, a greater proportion of the diversity of bacteria and archaea will be represented in fast.genomics. For now, fast.genomics's databases do not include low-quality MAGs because we want to support presence/absence analyses (see above).

Attachment

Submitted filename: responses.pdf

pone.0301871.s001.pdf^{(34.3KB, pdf)}

PLoS One. doi: 10.1371/journal.pone.0301871.r003

Decision Letter 1

Vasilis J Promponas

24 Mar 2024

A fast comparative genome browser for diverse bacteria and archaea

PONE-D-23-38656R1

Dear Dr. Price,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

Please, when preparing for submitting the final version of your manuscript, note the following possible typo:

- line 303: "These include fast two ways ... " should probably read "These include two fast ways ..."

An invoice will be generated when your article is formally accepted. Please note, if your institution has a publishing partnership with PLOS and your article meets the relevant criteria, all or part of your publication costs will be covered. Please make sure your user information is up-to-date by logging into Editorial Manager at Editorial Manager® and clicking the ‘Update My Information' link at the top of the page. If you have any questions relating to publication charges, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Vasilis J Promponas

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Reviewer #1: (No Response)

Reviewer #2: (No Response)

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

**********

PLoS One. doi: 10.1371/journal.pone.0301871.r004

Acceptance letter

Vasilis J Promponas

27 Mar 2024

PONE-D-23-38656R1

PLOS ONE

Dear Dr. Price,

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now being handed over to our production team.

At this stage, our production department will prepare your paper for publication. This includes ensuring the following:

* All references, tables, and figures are properly cited

* All relevant supporting information is included in the manuscript submission,

* There are no issues that prevent the paper from being properly typeset

If revisions are needed, the production department will contact you directly to resolve them. If no revisions are needed, you will receive an email when the publication date has been set. At this time, we do not offer pre-publication proofs to authors during production of the accepted work. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few weeks to review your paper and let you know the next and final steps.

Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

If we can help with anything else, please email us at customercare@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Vasilis J Promponas

Academic Editor

PLOS ONE

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Attachment

Submitted filename: responses.pdf

pone.0301871.s001.pdf^{(34.3KB, pdf)}

Data Availability Statement

Fast.genomics is available at http://fast.genomics.lbl.gov. The source code and the database have also been archived at figshare (https://doi.org/10.6084/m9.figshare.24010353.v1).

[pone.0301871.ref001] 1.Parks DH, Chuvochina M, Waite DW, Rinke C, Skarshewski A, Chaumeil P-A, et al. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nat Biotechnol. 2018;36: 996–1004. doi: 10.1038/nbt.4229 [DOI] [PubMed] [Google Scholar]

[pone.0301871.ref002] 2.Bowers RM, Kyrpides NC, Stepanauskas R, Harmon-Smith M, Doud D, Reddy TBK, et al. Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nat Biotechnol. 2017;35: 725–731. doi: 10.1038/nbt.3893 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0301871.ref003] 3.Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 2015;25: 1043–1055. doi: 10.1101/gr.186072.114 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0301871.ref004] 4.Dandekar T, Snel B, Huynen M, Bork P. Conservation of gene order: a fingerprint of proteins that physically interact. Trends Biochem Sci. 1998;23: 324–328. doi: 10.1016/s0968-0004(98)01274-2 [DOI] [PubMed] [Google Scholar]

[pone.0301871.ref005] 5.Wolf YI, Rogozin IB, Kondrashov AS, Koonin EV. Genome alignment, evolution of prokaryotic genome organization, and prediction of gene function using genomic context. Genome Res. 2001;11: 356–372. doi: 10.1101/gr.gr-1619r [DOI] [PubMed] [Google Scholar]

[pone.0301871.ref006] 6.Huynen M, Snel B, Lathe W, Bork P. Predicting protein function by genomic context: quantitative evaluation and qualitative inferences. Genome Res. 2000;10: 1204–1210. doi: 10.1101/gr.10.8.1204 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0301871.ref007] 7.Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO. Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci USA. 1999;96: 4285–4288. doi: 10.1073/pnas.96.8.4285 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0301871.ref008] 8.Zhaxybayeva O, Doolittle WF. Lateral gene transfer. Curr Biol. 2011;21: R242–6. doi: 10.1016/j.cub.2011.01.045 [DOI] [PubMed] [Google Scholar]

[pone.0301871.ref009] 9.Price MN, Deutschbauer AM, Arkin AP. Four families of folate-independent methionine synthases. PLoS Genet. 2021;17: e1009342. doi: 10.1371/journal.pgen.1009342 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0301871.ref010] 10.Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25: 3389–3402. doi: 10.1093/nar/25.17.3389 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0301871.ref011] 11.Finn RD, Bateman A, Clements J, Coggill P, Eberhardt RY, Eddy SR, et al. Pfam: the protein families database. Nucleic Acids Res. 2014;42: D222–30. doi: 10.1093/nar/gkt1223 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0301871.ref012] 12.Haft DH, Selengut JD, Richter RA, Harkins D, Basu MK, Beck E. Tigrfams and genome properties in 2013. Nucleic Acids Res. 2013;41: D387–95. doi: 10.1093/nar/gks1234 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0301871.ref013] 13.Hernández-Plaza A, Szklarczyk D, Botas J, Cantalapiedra CP, Giner-Lamia J, Mende DR, et al. eggNOG 6.0: enabling comparative genomics across 12 535 organisms. Nucleic Acids Res. 2023;51: D389–D394. doi: 10.1093/nar/gkac1022 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0301871.ref014] 14.Botas J, Rodríguez Del Río Á, Giner-Lamia J, Huerta-Cepas J. GeCoViz: genomic context visualisation of prokaryotic genes from a functional and evolutionary perspective. Nucleic Acids Res. 2022;50: W352–W357. doi: 10.1093/nar/gkac367 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0301871.ref015] 15.Mendler K, Chen H, Parks DH, Lobb B, Hug LA, Doxey AC. AnnoTree: visualization and exploration of a functionally annotated microbial tree of life. Nucleic Acids Res. 2019;47: 4442–4448. doi: 10.1093/nar/gkz246 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0301871.ref016] 16.Tremblay BJ-M, Lobb B, Doxey AC. PhyloCorrelate: inferring bacterial gene-gene functional associations through large-scale phylogenetic profiling. Bioinformatics. 2021;37: 17–22. doi: 10.1093/bioinformatics/btaa1105 [DOI] [PubMed] [Google Scholar]

[pone.0301871.ref017] 17.Steinegger M, Söding J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol. 2017;35: 1026–1028. doi: 10.1038/nbt.3988 [DOI] [PubMed] [Google Scholar]

[pone.0301871.ref018] 18.Orakov A, Fullam A, Coelho LP, Khedkar S, Szklarczyk D, Mende DR, et al. GUNC: detection of chimerism and contamination in prokaryotic genomes. Genome Biol. 2021;22: 178. doi: 10.1186/s13059-021-02393-0 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0301871.ref019] 19.Price MN, Arkin AP. Curated BLAST for genomes. mSystems. 2019;4. doi: 10.1128/mSystems.00072-19 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0301871.ref020] 20.Park Y, Sheetlin S, Ma N, Madden TL, Spouge JL. New finite-size correction for local alignment score distributions. BMC Res Notes. 2012;5: 286. doi: 10.1186/1756-0500-5-286 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0301871.ref021] 21.Lerat E, Daubin V, Moran NA. From gene trees to organismal phylogeny in prokaryotes: the case of the gamma-Proteobacteria. PLoS Biol. 2003;1: E19. doi: 10.1371/journal.pbio.0000019 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0301871.ref022] 22.Cantalapiedra CP, Hernández-Plaza A, Letunic I, Bork P, Huerta-Cepas J. eggNOG-mapper v2: Functional Annotation, Orthology Assignments, and Domain Prediction at the Metagenomic Scale. Mol Biol Evol. 2021;38: 5825–5829. doi: 10.1093/molbev/msab293 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0301871.ref023] 23.Galperin MY, Makarova KS, Wolf YI, Koonin EV. Expanded microbial genome coverage and improved protein family annotation in the COG database. Nucleic Acids Res. 2015;43: D261–9. doi: 10.1093/nar/gku1223 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0301871.ref024] 24.Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22: 1658–1659. doi: 10.1093/bioinformatics/btl158 [DOI] [PubMed] [Google Scholar]

[pone.0301871.ref025] 25.Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012;28: 3150–3152. doi: 10.1093/bioinformatics/bts565 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0301871.ref026] 26.Kiełbasa SM, Wan R, Sato K, Horton P, Frith MC. Adaptive seeds tame genomic sequence comparison. Genome Res. 2011;21: 487–493. doi: 10.1101/gr.113985.110 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0301871.ref027] 27.Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32: 1792–1797. doi: 10.1093/nar/gkh340 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0301871.ref028] 28.Price MN, Dehal PS, Arkin AP. FastTree 2—approximately maximum-likelihood trees for large alignments. PLoS ONE. 2010;5: e9490. doi: 10.1371/journal.pone.0009490 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0301871.ref029] 29.Dehal PS, Joachimiak MP, Price MN, Bates JT, Baumohl JK, Chivian D, et al. MicrobesOnline: an integrated portal for comparative and functional genomics. Nucleic Acids Res. 2010;38: D396–400. doi: 10.1093/nar/gkp919 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0301871.ref030] 30.Basu MK, Selengut JD, Haft DH. ProPhylo: partial phylogenetic profiling to guide protein family construction and assignment of biological process. BMC Bioinformatics. 2011;12: 434. doi: 10.1186/1471-2105-12-434 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0301871.ref031] 31.Marchler-Bauer A, Derbyshire MK, Gonzales NR, Lu S, Chitsaz F, Geer LY, et al. CDD: NCBI’s conserved domain database. Nucleic Acids Res. 2015;43: D222–6. doi: 10.1093/nar/gku1221 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0301871.ref032] 32.Somervuo P, Holm L. SANSparallel: interactive homology search against Uniprot. Nucleic Acids Res. 2015;43: W24–9. doi: 10.1093/nar/gkv317 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0301871.ref033] 33.Finn RD, Attwood TK, Babbitt PC, Bateman A, Bork P, Bridge AJ, et al. InterPro in 2017-beyond protein family and domain annotations. Nucleic Acids Res. 2017;45: D190–D199. doi: 10.1093/nar/gkw1107 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0301871.ref034] 34.Potter SC, Luciani A, Eddy SR, Park Y, Lopez R, Finn RD. HMMER web server: 2018 update. Nucleic Acids Res. 2018;46: W200–W204. doi: 10.1093/nar/gky448 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0301871.ref035] 35.Price MN, Deutschbauer AN, Arkin AP. Many Families of Lids for TonB-dependent Transporters in Bacteroides. BioRxiv. 2023. doi: 10.1101/2023.03.17.533168 [DOI] [Google Scholar]

[pone.0301871.ref036] 36.von Mering C, Jensen LJ, Kuhn M, Chaffron S, Doerks T, Krüger B, et al. STRING 7—recent developments in the integration and prediction of protein interactions. Nucleic Acids Res. 2007;35: D358–62. doi: 10.1093/nar/gkl825 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0301871.ref037] 37.Szklarczyk D, Kirsch R, Koutrouli M, Nastou K, Mehryary F, Hachilif R, et al. The STRING database in 2023: protein-protein association networks and functional enrichment analyses for any sequenced genome of interest. Nucleic Acids Res. 2023;51: D638–D646. doi: 10.1093/nar/gkac1000 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0301871.ref038] 38.Reed C, Denise R, Hourihan J, Babor J, Jaroch M, Martinelli M, et al. Beyond blast: enabling microbiologists to better extract literature, taxonomic distributions and gene neighborhood information for protein families. BioRxiv. 2023. doi: 10.1101/2023.05.03.539116 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0301871.ref039] 39.Saha CK, Sanches Pires R, Brolin H, Delannoy M, Atkinson GC. FlaGs and webFlaGs: discovering novel biology through the analysis of gene neighbourhood conservation. Bioinformatics. 2021;37: 1312–1314. doi: 10.1093/bioinformatics/btaa788 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0301871.ref040] 40.Oberg N, Zallot R, Gerlt JA. EFI-EST, EFI-GNT, and EFI-CGFP: Enzyme Function Initiative (EFI) Web Resource for Genomic Enzymology Tools. J Mol Biol. 2023;435: 168018. doi: 10.1016/j.jmb.2023.168018 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0301871.ref041] 41.Chen I-MA, Chu K, Palaniappan K, Ratner A, Huang J, Huntemann M, et al. The IMG/M data management and analysis system v.7: content updates and new features. Nucleic Acids Res. 2023;51: D723–D732. doi: 10.1093/nar/gkac976 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0301871.ref042] 42.Rocha EPC. Inference and analysis of the relative stability of bacterial chromosomes. Mol Biol Evol. 2006;23: 513–522. doi: 10.1093/molbev/msj052 [DOI] [PubMed] [Google Scholar]

[pone.0301871.ref043] 43.Fullam A, Letunic I, Schmidt TSB, Ducarmon QR, Karcher N, Khedkar S, et al. proGenomes3: approaching one million accurately and consistently annotated high-quality prokaryotic genomes. Nucleic Acids Res. 2023;51: D760–D766. doi: 10.1093/nar/gkac1078 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0301871.ref044] 44.Yu Y-K, Altschul SF. The construction of amino acid substitution matrices for the comparison of proteins with non-standard compositions. Bioinformatics. 2005;21: 902–911. doi: 10.1093/bioinformatics/bti070 [DOI] [PubMed] [Google Scholar]

[pone.0301871.ref045] 45.Varadi M, Anyango S, Deshpande M, Nair S, Natassia C, Yordanova G, et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 2022;50: D439–D444. doi: 10.1093/nar/gkab1061 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0301871.ref046] 46.Moreno-Hagelsieb G, Collado-Vides J. A powerful non-homology method for the prediction of operons in prokaryotes. Bioinformatics. 2002;18 Suppl 1: S329–36. doi: 10.1093/bioinformatics/18.suppl_1.s329 [DOI] [PubMed] [Google Scholar]

[pone.0301871.ref047] 47.Price MN, Huang KH, Alm EJ, Arkin AP. A novel method for accurate operon predictions in all sequenced prokaryotes. Nucleic Acids Res. 2005;33: 880–892. doi: 10.1093/nar/gki232 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

A fast comparative genome browser for diverse bacteria and archaea

Morgan N Price

Adam P Arkin

Roles

Abstract

Introduction

Results and discussion

The fast.genomics databases

Table 1. Statistics for the main database, for the biggest order-level databases, and for all databases combined.

Overview of the fast.genomics website

Speed and sensitivity of finding homologs in the main database

Fig 1. Sensitivity of fast parallel MMseqs2 when searching against 6,377 representative genomes.

Ortholog groups miss many potential orthologs despite being too broad

Speed and sensitivity of finding homologs in a large order

Viewing gene neighborhoods

Fig 2. The gene neighborhood view.

Viewing the taxonomic distribution of a protein’s homologs

Fig 3. The taxonomic distribution view.

Comparing the presence and absence of two proteins’ homologs

Fig 4. The presence/absence plot.

Links to other sequence analysis tools

Limitations of fast.genomics

Comparisons to other comparative genomics websites

Conclusions

Materials and methods

Data sources

Which genomes to include

Fast parallel MMseqs2

Settings for MMseqs2

Clustered search

E-values and compositional bias

Features of the gene neighborhood view

Implementation of the web site

Hardware and memory requirements

Data Availability

Funding Statement

References

Decision Letter 0

Vasilis J Promponas

Roles

Author response to Decision Letter 0

Decision Letter 1

Vasilis J Promponas

Roles

Acceptance letter

Vasilis J Promponas

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases