Skip to main content
mSystems logoLink to mSystems
. 2024 Jun 28;9(7):e00473-24. doi: 10.1128/msystems.00473-24

zDB: bacterial comparative genomics made easy

Bastian Marquis 1, Trestan Pillonel 1, Alessia Carrara 1, Claire Bertelli 1,
Editor: James C Stegen2
PMCID: PMC11264898  PMID: 38940522

ABSTRACT

The analysis and comparison of genomes rely on different tools for tasks such as annotation, orthology prediction, and phylogenetic inference. Most tools are specialized for a single task, and additional efforts are necessary to integrate and visualize the results. To fill this gap, we developed zDB, an application integrating a Nextflow analysis pipeline and a Python visualization platform built on the Django framework. The application is available on GitHub (https://github.com/metagenlab/zDB) and from the bioconda channel. Starting from annotated Genbank files, zDB identifies orthologs and infers a phylogeny for each orthogroup. A species phylogeny is also constructed from shared single-copy orthologs. The results can be enriched with Pfam protein domain prediction, Cluster of Orthologs Genes and Kyoto Encyclopedia of Genes and Genomes annotations, and Swissprot homologs. The web application allows searching for specific genes or annotations, running Blast queries, and comparing genomic regions and whole genomes. The metabolic capacities of organisms can be compared at either the module or pathway levels. Finally, users can run queries to examine the conservation of specific genes or annotations across a chosen subset of genomes and display the results as a list of genes, Venn diagram, or heatmaps. Those features make zDB useful for both bioinformaticians and researchers more accustomed to laboratory research.

IMPORTANCE

Genome comparison and analysis rely on many independent tools, leaving to scientists the burden to integrate and visualize their results for interpretation. To alleviate this burden, we have built zDB, a comparative genomics tool that includes both an analysis pipeline and a visualization platform. The analysis pipeline automates gene annotation, orthology prediction, and phylogenetic inference, while the visualization platform allows scientists to easily explore the results in a web browser. Among other features, the interface allows users to visually compare whole genomes and targeted regions, assess the conservation of genes or metabolic pathways, perform Blast searches, or look for specific annotations. Altogether, this tool will be useful for a broad range of applications in comparative studies between two and hundred genomes. Furthermore, it is designed to allow sharing of data sets easily at a local or international scale, thereby supporting exploratory analyses for non-bioinformaticians on the genome of their favorite organisms.

KEYWORDS: comparative genomics, microbial genomics, genome visualization

INTRODUCTION

Since the publication of the first complete genome in 1995 (1), the number of available sequences has kept on growing, with now 450,000 different species available in the Genbank database (2). As recent sequencing technologies make it possible to sequence an organism in a matter of hours at a cost affordable even for small research laboratories, this trend is unlikely to abate in the future. These technological improvements transferred the burden from sequencing to the actual analysis of the sequences. While a plethora of different tools already exist for this purpose, they are often specialized for specific tasks such as gene calling, orthology prediction, or phylogenetic inference. Moreover, these tools are often standalone programs that do not readily integrate each other’s results. As the results are often not produced in a format that easily allows their exploration, additional visualization efforts are also necessary.

The need for tools designed to aggregate results from different sources has been illustrated by the success of programs like Prokka (3), which merges the results of different annotation tools in files ready for submission and visualization in genome browsers. The idea is further extended by pipelines like Bactopia (4), TORMES (5), and ASA3P (6) that automate all steps from reads quality control to antibiotic resistance gene prediction and generate simple HTML reports, allowing the visualization of the main results. As these pipelines were developed with a focus on clinical microbiology, they are limited in terms of comparative genomics analysis. In contrast, websites dedicated to the comparative genomics of specific groups of organisms (7, 8) have been developed and implemented powerful interfaces allowing users to make custom queries and generate complex plots. However, these websites do not allow users to analyze their own sets of genomes. Some web-based comparative genomics platforms, like EDGAR (9), Pathogenwatch (10), PhyloCloud (11), KBase (12), CoGe (13), or MicroScope (14) implement similar interfaces while allowing users to upload their own data set. Some of those platforms are, however, closed-source, and as the analysis is performed on the platform’s respective servers, users are required to register and upload their data set. The ideal comparative genomics platform would, therefore, be open source, could run on any infrastructure, be as flexible and scalable as Bactopia, and, similar to MicroScope and EDGAR, offer an extensive interface to visualize the results. The developers of anvi’o (15), OpenGenomeBrowser (16), and bioBakery (17), all of which implement an analysis pipeline and a visualization platform, demonstrate the feasibility of such an approach.

Using a similar design, we developed zDB, an open-source comparative genomics analysis pipeline and visualization platform. The analysis pipeline performs functional annotations, orthology, and phylogenetic inference, while the visualization platform offers an interactive modern web-based interface to explore the results. Altogether, the ease of installing and executing the tool and the ability to easily visualize the results will benefit both bioinformaticians and researchers more accustomed to lab work.

RESULTS

The visualization platform can be started by a single command as soon as the analyses are complete. The command starts a web server that will make the results available via a web browser, either only locally or also possibly extended to the whole Internet depending on the setup. The platform can also be used to visualize archived results imported from another computer.

Visualization toolkit

The visualization platforms implement a set of plots and queries to explore the results of orthology prediction and phylogeny inference. In addition, zDB comes with several features of more general interest like the possibility to run Blast queries, to search for specific annotation or gene using a search bar and to draw interactive Circos plots or genomic regions.

A sidebar is present on all pages of the web application (Fig. 1A) and allows quick access to all available analyses. The content of the “Annotations” tab varies in function of which optional analyses were performed. Similarly, the “Metabolism” tab will only be present when the genomes were annotated with Kyoto Encyclopedia of Genes and Genomes (KEGG) orthologs. Summaries of the main characteristics of the genomes of interest, including the results from CheckM, can be visualized either as lists or directly annotated in the species phylogeny (Fig. S1) via the “Genomes” and “Phylogeny” tabs, respectively. Finally, the web interface also includes a search bar (Fig. 1A) that allows users to look for genes, gene products, bacteria, or specific annotations based on their names. The search bar accepts wildcards and logical operators to combine different search terms.

Fig 1.

Fig 1

(A) zDB side panel listing all available analyses. The search bar (red box) allows users to look for a gene or annotation of interest. (B) Blast search result. The whole data set was searched for several proteins of the type III secretion system of Salmonella Typhimurium. Blast hits are displayed as a heatmap of identities. (C) zDB can draw a comparison of the genomic region of a specified gene of interest and its orthologs (in red) in several genomes. Black arrows represent pseudogenes. Gray bars link orthologous genes, and their shades reflect amino-acid identity. (D) Visualization of protein conservation of a reference genome (here Shigella flexneri) compared to four selected genomes (here from Escherichia genus). The inner circle represents the GC content of each open reading frame (ORF) in the reference genome. The next four circles represent the absence (in blue) or presence (in red) of homologs of proteins from the reference genome in the selected genomes, with a color scale representing protein identity. The next two circles represent the localization of the ORFs on the forward and reverse strands of the reference genome. The two outer circles represent the contigs in the reference assembly and a histogram of the number of homologs to each protein of the reference genome in the selected genomes. (E) Venn diagram illustrating the distribution of the orthologous groups in four genomes. (F) Gene phylogeny. The identity column shows the identity relative to the CP0154 locus, as it was accessed through the page dedicated to this locus. The rightmost columns display the domain architecture. (G) Heatmap of gene conservation in the four genomes as in (E); pink bars represent genes present in multiple copies in a genome, and blue bars represent single copy genes. (H) Distribution of the orthologous groups in the function of the number of genomes where they occur.

The “Orthology” tab links to pages allowing to explore gene conservation in the data set. In particular, users can visualize gene conservation across a chosen set of genomes as either heatmaps (Fig. 1G), Venn diagrams (Fig. 1E), or lists. zDB can also draw the commonly used core and pan-genome plots as well as a plot of the number of orthogroups in function of the number of genomes where they occur (Fig. 1H). The latter plot allows for a quick assessment of the number of singletons, the size of the core genome, and the detection of gene groups occurring in a subset of the genomes. Finally, zDB implements an interface that allows users to search for genes conserved in a chosen set of genomes but absent in another one.

As searching for specific sequences in organisms of interest is a frequent task, the visualization platform implements an interface that allows users to run their own Blast searches on either the whole data set or on a specific genome. Several types of blast searches can be performed (blastp, blastn, and tblastn), either with a single query or with multiple queries in FASTA format. The results are displayed interactively using the BlasterJS library (18). Moreover, if the search was run on the whole data set, zDB can display the results as a heatmap of the best blast hits identity linked to the species phylogenetic tree (Fig. 1B). This allows to quickly detect patterns in the distribution of Blast hits in function of the phylogenetic distance.

Finally, zDB can draw plots to compare genomic regions sharing orthologous genes (Fig. 1C) and Circos plots to compare a set of genomes to a specified reference (Fig. 1D). The former notably allows to visually confront the orthology inference to the predicted genes order in homologous regions. The minimal setup also includes summary pages for every gene and orthogroup. The gene summary page allows easy access to nucleotide and amino-acid sequences (for protein-coding genes), displays the genomic region of the gene of interest, and provides a list of orthologous genes (Fig. 2A). The orthogroup page allows users to examine the gene phylogeny and the distribution of the orthogroup in the genomes of the data set (Fig. 1F; 2C). Both pages are enriched with the results of the optional analyses if they were performed (Fig. 2A and 1F).

Fig 2.

Fig 2

(A) Example of a gene summary page, with its genomic region, Pfam domains, and Cluster of Orthologs Genes (COG) annotation. The phylogenetic distribution and list of Swissprot homologs shown in (B and C) can be accessed from the highlighted tab. (B) The list of Swissprot homologs. (C) Part of the phylogenetic tree and protein conservation. The first column shows the number of homologs of the gene in a given genome. The second column shows the amino-acid identity between the gene of interest and its closest homolog in a given genome.

Pfam, COG, and KEGG functional analyses

The conservation of Pfam, Cluster of Orthologs Genes (COG), and KEGG annotations across genomes can be compared in a similar way to orthogroups. In particular, Venn diagrams, heatmaps, pan-, and core-genome plots can be drawn for those annotations, while an interface to search for annotations present in a set of chosen genomes but absent in another is also available. Since COG and KEGG orthologs are assigned to high-level functional categories, users can visualize the distribution of annotated genes in those categories across one or several genomes, either as bar charts (Fig. 3A and B) or as heatmaps (Fig. 3C). This allows users to quickly visualize differences of functional capabilities between organisms.

Fig 3.

Fig 3

(A and B) Distribution of genes annotated with COG and KEGG orthologs in their functional categories for four chosen genomes. (C) Details of the completeness of KEGG module 83 (fatty acid biosynthesis and elongation). Red squares in the heatmap correspond to the number of genes annotated as a given KEGG ortholog. Green squares correspond to a gene without a specific KEGG annotation but in the same orthogroup as other genes having this annotation. This may indicate a shared function, and in such cases, the corresponding KEGG ortholog is considered as present when estimating the module completeness. The last column indicates module completeness, as determined by the module definition language. (D) Proportion of the genes in genomes assigned to the different COG categories (some categories were removed for the sake of simplicity). (E) Overview of modules completeness in any given KEGG category. Green squares indicate a complete KEGG module, and orange squares indicate an incomplete module. The number of genes annotated as KEGG orthologs for a given module is indicated in each square.

To further characterize metabolic capacities, zDB implements a parser for the KEGG module definition language, which allows to assess the completeness of a metabolic module based on the KEGG orthologs present in a genome. Module completeness can be compared at the scope of a single KEGG module (Fig. 3C) or at the scope of categories or sub-categories (Fig. 3E). The results are directly linked to the species phylogeny, making it easy to notice patterns of metabolic capacities linked to specific clades.

The Swissprot homologs are listed both on the orthogroup home page and on the gene homepage (Fig. 2B).

Benchmarking

An initial benchmark evaluated the duration of computations for data sets with an increasing number of genomes (Fig. 4B) using a configuration mimicking a high-end desktop computer. Generating a database with all the optional analysis (except the RefSeq homologs search) took 1.9 h, 3.9 h, 8.6 h, 21.0 h, and 55.6 h for the data sets with 10, 20, 40, 90, and 179 genomes, respectively. The CPU (central processing unit) time spent in the optional analyses increased linearly with the number of genomes. This was, however, not the case for the core analysis. In particular, the cost of orthology prediction increased faster than the other analyses and will likely be the limiting factor for larger data sets. This is expected due to the O(n2) complexity of the all-against-all genomes comparison performed by Orthofinder (19).

Fig 4.

Fig 4

(A) zDB dataflow. The core analyses are performed for each data set, while users can choose to perform additional optional analyses. (B) Duration of the analysis split by type (core and optional) in total CPU time according to the number of genomes in a benchmarking. (C) Benchmarking of the different analysis types with the 179 genome data set. Functional analysis includes COG and KEGG orthology annotations and Pfam domain prediction. Swissprot and RefSeq represent searching for homologs in Swissprot and Refseq, respectively.

The choice of optional analyses for the 179 genome data set significantly impacts the computing time (Fig. 4C). Despite the use of Diamond instead of blastp, searching for homologs in the RefSeq database took about four times longer than the other analyses. Similarly, searching for homologs in the Swissprot database took as long as performing the KEGG, COG, and Pfam annotations together.

Finally, to confirm that zDB can process 100 genomes on a desktop computer, we ran the analysis pipeline on the 90 genomes data set on a desktop machine, which took 71 h to complete.

DISCUSSION

zDB is a comparative genomics analysis tool entirely run on the user side that includes both an analysis pipeline and a web interface to visually explore the results. It was designed to require minimal typing on the command line; only three commands from installation to visualization of the results. As shown in the benchmarks, the analysis pipeline can process data sets of 100 genomes in a matter of days on a desktop computer, making dedicated computing infrastructures unnecessary for all but the largest data sets. The possibility to easily run Blast queries, to search for specific genes, or to retrieve amino-acid and nucleotide sequences will make zDB useful for researchers more accustomed to lab work, while features such as core and pan-genome analysis and genomic regions comparisons will be useful to seasoned bioinformaticians. The ability to share the database and launch the web interface from any computer facilitates data sharing and accessibility within and across groups and institutions with minimal infrastructure. Altogether, this makes zDB a tool easy to use and install for a wide variety of applications such as genome browsing, characterization of newly sequenced genomes, or even setting up a public database for an organism of interest.

Comparison to existing tools

Some of the features and analyses implemented in zDB overlap with those offered by similar existing tools such as Anvi’o (15) and OpenGenomeBrowser (16). The pangenomics and phylogenomics workflows of Anvi’o are notably similar to the approach used in zDB. Single-copy core orthologs are first identified and used to infer a phylogenetic tree that can then be visualized in the interface. OpenGenomeBrowser can infer a species’ phylogeny based on Orthofinder predictions. The visualization of Blast search results or the ability to estimate KEGG module completeness is also available in OpenGenomeBrowser and Anvi’o, respectively. OpenGenomeBrowser also allows to search for genes or annotations that are statistically associated with user-specified phenotypic characteristics, while Anvi’o implements an enrichment analysis to detect functional differences between clusters of genes.

The focus of those tools is however different. zDB was designed for comparative genomics, and as a consequence, most visualizations are articulated around orthology predictions and the inferred phylogenetic trees, allowing the recognition of phylogenetic patterns in the distribution of genes or annotations. Anvi’o being less specialized, it lacks some of the visualizations implemented in zDB but offers a much broader set of analyses including the ability to handle meta-transcriptomics or meta-genomics data. While OpenGenomeBrowser was also designed for comparative genomics, it does not integrate annotations with the inferred phylogenetic tree but provides some other analyses such as dot-plot visualization of pairwise alignments of genomes.

Limitations and future directions

Unlike other tools such as Bactopia (4), which can either use already annotated genomes or perform de-novo assemblies from raw reads, zDB only accepts pre-annotated genomes. While this may limit researchers not used to bioinformatics, the availability of good quality, easy-to-use annotation tools (3, 20) should not represent a significant limitation to users. As zDB focuses on functional annotations and phylogenetic inference, we chose to leave the burden of gene prediction—and the responsibility to ensure the quality of the genomes included in the analysis—to the user. However, given the modularity of the pipeline, an optional annotation step could easily be included in future releases along de novo assembly and quality checks.

Large phylogenies are cumbersome to visualize as they may not entirely fit on a computer screen. To alleviate this, we plan to replace the ete3 drawing engine by a custom Javascript library to draw interactive phylogenetic trees allowing the user to collapse and expand branches. As of now, the addition or removal of genomes from an existing database is not possible and requires to repeat all the analyses on the modified data set. We, therefore, plan to implement the possibility to add or remove genomes, which will allow users to incrementally improve a database without having to repeat the analyses. Finally, we will rapidly extend the set of optional analyses with additional annotations such as the prediction of antibiotic resistance genes, virulence factors, protein transmembrane domains, or signal peptides.

MATERIALS AND METHODS

Design and implementation

zDB is composed of two parts that can be run independently (Fig. 4A): an analysis pipeline that performs all the computationally intensive steps and stores the results in a Sqlite3 database, and a visualization platform that renders the results stored in the database in a graphical interface.

The analyses are separated in a set of core analyses focused on orthology prediction and phylogeny inference and a set of independent optional analyses, with a focus on functional annotation. To simplify the installation and make the analyses reproducible and scalable, all steps are run either within docker (21) or singularity (22) containers or in conda environments, under the control of the Nextflow workflow manager (23). Nextflow allows the analyses to be easily scaled from high-performance clusters to desktop machines, while containers guarantee the reproducibility and ease of installation by packaging the tools in controlled environments. By default, the analysis pipelines run the analyses in parallel, with a modifiable maximum of four simultaneous processes. As several of the tools used by zDB can take advantage of multi-threading, parallelism can also be applied at the analysis level by editing the configuration file, although this is not enabled by default. After the completion of the analysis pipeline, zDB can export the results as a compressed archive for subsequent use. The ability to export the results was developed to facilitate sharing and to accommodate the fact that the analysis may have to be run and exported from a high-performance computing cluster, where long-term storage might not be possible due to disk space constraints.

The visualization platform was implemented as a Django website building upon the scaffold of ChlamDB (7). The Django server can either be instantiated on a desktop computer, for local access, or on an Internet-facing computer, if the website is to be made public (internally within a network or externally). The results are rendered as lists, annotated phylogenetic trees, and interactive plots. The phylogenetic trees are drawn as static images with the ete3 toolkit (24), while the interactive plots are generated by a collection of home-made scripts based on the d3.js framework and several libraries such as jvenn.js (25), Circos.js (https://github.com/nicgirault/circosJS), BlasterJS (18), and plotly (https://plotly.com). All the plots drawn by the website can be downloaded as support vector graphics (.svg) images for subsequent use. Finally, users can also retrieve the results directly from the database via a Python (26) interface, if custom analyses are to be performed.

The code is available on GitHub (https://github.com/metagenlab/zDB), and zDB can be installed as a conda package from the bioconda channel (27).

Minimal analyses: quality control, orthology inference, and core genome phylogeny

The minimal set of analyses includes quality control with CheckM v1.2.1 (28), the generation of Blast (29) databases, orthology prediction, and phylogeny inference. zDB takes GenBank files as input and has currently been tested with the output of Prokka (3), PGAP (30), and Bakta (20). As locally assembled genomes may have duplicated accessions or locus tags, zDB first checks their uniqueness and automatically generates new identifiers if necessary. Amino-acid and nucleotide sequences are then extracted from the GenBank files and used as input for subsequent analyses (4a). Annotations such as gene names and protein products are also extracted from the GenBank files. Those annotations are indeed particularly valuable when reference genomes are analyzed together with draft genomes, as the annotations of the genes from a reference genome may hint at the function of their homologs in draft genomes.

Orthology is inferred using Orthofinder v2.5.2 (19). The sequences of orthologous proteins are aligned with MAFFT v7.487 (31), and the alignments are used to infer phylogenetic trees for each orthogroup using FastTree v2.1.8 (32). In addition to gene phylogenies, zDB also generates a species tree with FastTree using the concatenated alignments of the single-copy core orthologs. As some assemblies may be incomplete, the condition that core orthologs must be present in all genomes can be relaxed to allow missing genes. zDB generates Blast databases with both amino-acids and nucleotides sequences for each individual genomes and for the whole data set. This allows users to search for sequences in a specific genome without the interference of better hits in other genomes, while still making it possible to perform global searches on the whole data set.

Optional analyses: homology search, COG, KEGG, and Pfam annotations

To complement the core analysis, zDB can perform optional analyses focused on function prediction. Optional analyses all take the proteins of the non-redundant pan-genome as input and include the assignment to the COG, mapping to the KEGG, prediction of Pfam protein domains, and search for homologs in the SwissProt database. COG annotations offer clues regarding protein functions and allow their classification in broad functional categories. The assignment to COG (33) clusters is performed by rps-blast (29) searches using the position-specific score matrices of the NCBI Conserved Domain Database (CDD) (34). KEGG annotations give insights into the metabolic capacities of the analyzed bacteria. The mapping to KEGG orthologs is performed by Kofamscan (35) using the prokaryotic profiles of the KEGG database (36). As they can offer functional insights into otherwise unannotated proteins and as domain architecture conservation may be a valuable addition to a gene phylogeny, Pfam protein annotations were also added in the optional analysis. The annotation is performed with the Pfam_scan (37) tool and the Pfam-A database. Finally, zDB can also perform a homology search with blastp (29) against the manually curated entries of the SwissProt (38) database. The reference database used by zDB to perform that analysis is listed in Table 1. Of note, the core analyses can be performed without any reference database.

TABLE 1.

Reference databases used by zDB

Database Release Sizea Search tool
Swissprot (38) Release 2021_04 86M Blastp v2.9.0 (29)
Refseq nrb (39) Release 210 34.9G Diamond v2.0.13 (40)
KEGG hmm profiles (36) Release 03/2022 1.2G Kofamscan v1.3.0 (35)
CDD (34) Release 3.19 4.0G Rpsblast v2.9.0 (29)
Pfam-A hmm profiles (37) Release 35.0 279M Pfam_scan v1.6 (37)
a

Refers to the volume of data to download.

b

As downloading the non-redundant RefSeq database is prone to failure, it is not automatically downloaded by zDB. A script to download and prepare the databases is installed with zDB but has to be run manually.

To screen for lateral gene transfers using a well-validated method (41), zDB can search the RefSeq database for homologs of proteins from the non-redundant pangenome. The search is performed by Diamond (40) to reduce the duration of the analysis. The proteins of every orthogroup and their best hits (the best four hits of every protein, by default) are then aligned with MAFFT, and the alignment is used by FastTree to infer a phylogenetic tree. As reference genomes downloaded from RefSeq may have been included in zDB input data set, the best hits from genomes already present in the database are filtered out. If the database was populated with genomes of related bacteria, observing that a protein from a distant taxa clusters more closely than the other proteins from the same orthogroup may indeed indicate a lateral gene transfer.

Benchmarking

Although the analysis pipeline could process more genomes, the visualization platform is designed for data sets ranging from tens to hundreds of genomes. Therefore, we chose a representative data set composed of the NCBI’s 179 reference genomes of the Enterobacteriaceae family to benchmark the analysis pipeline. The genomes were downloaded as Genbank files from the NCBI (Table S1). We ran a first benchmark to measure the running time of the different optional analyses on the full data set. As the search for RefSeq homologs proved to be prohibitively long (Fig. 4C), it was not included in the subsequent benchmarks. The pipeline was then run on randomly generated subsets of the 179 genomes composed of 10, 20, 40, 90, or all genomes, all with a mean genome size of 3.8 Mbp.

The performances of the pipeline were measured using Nextflow with report option. All analyses were run on an Ubuntu 18 server (112 Intel Xeon Platinum 8280 2.7 GHz CPUs, equipped with 377 GB of RAM (random-access memory)), limiting parallelization to 20 simultaneous processes (with Nextflow cpus option) and total memory usage to 32 GB (with Nextflow memory option), to mimic the computing power of a high-end desktop computer. We also tested the 90 genomes data set on a desktop computer with six cores and 16 GB of RAM memory to have a better idea of the performances of zDB on a computer with more limited resources.

ACKNOWLEDGMENTS

The authors would like to thank the other members of the laboratory, Niklaus Johner, Elena Montenegro-Borbolla, Carmen Chen, Sedreh Nassirnia, Yangji Choi, and Elindi De Coning, who tested the tool and provided feedback during the development. Bastian Marquis is funded by the Jürg-Tschopp MD-PhD scholarship.

This work was supported as a part of NCCR Microbiomes, a National Center of Competence in Research, funded by the Swiss National Science Foundation (grant number 180575).

Contributor Information

Claire Bertelli, Email: claire.bertelli@chuv.ch.

James C. Stegen, Pacific Northwest National Laboratory, Richland, Washington, USA

SUPPLEMENTAL MATERIAL

The following material is available online at https://doi.org/10.1128/msystems.00473-24.

Fig. S1. msystems.00473-24-s0001.docx.

The species tree generated by zDB, annotated with the main characteristics of the included genomes.

DOI: 10.1128/msystems.00473-24.SuF1
Table S1. msystems.00473-24-s0002.xlsx.

All the genomes used for the benchmarking of zDB.

DOI: 10.1128/msystems.00473-24.SuF2

ASM does not own the copyrights to Supplemental Material that may be linked to, or accessed through, an article. The authors have granted ASM a non-exclusive, world-wide license to publish the Supplemental Material files. Please contact the corresponding author directly for reuse.

REFERENCES

  • 1. Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, Kerlavage AR, Bult CJ, Tomb JF, Dougherty BA, Merrick JM. 1995. Whole-genome random sequencing and assembly of Haemophilus influenzae RD. Science 269:496–512. doi: 10.1126/science.7542800 [DOI] [PubMed] [Google Scholar]
  • 2. Sayers EW, Cavanaugh M, Clark K, Ostell J, Pruitt KD, Karsch-Mizrachi I. 2019. GenBank. Nucleic Acids Res. 47:D94–D99. doi: 10.1093/nar/gky989 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Seemann T. 2014. Prokka: rapid prokaryotic genome annotation. Bioinformatics 30:2068–2069. doi: 10.1093/bioinformatics/btu153 [DOI] [PubMed] [Google Scholar]
  • 4. Petit RA, Read TD. 2020. Bactopia: a flexible pipeline for complete analysis of bacterial genomes. mSystems 5:e00190-20. doi: 10.1128/mSystems.00190-20 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Quijada NM, Rodríguez-Lázaro D, Eiros JM, Hernández M. 2019. TORMES: an automated pipeline for whole bacterial genome analysis. Bioinformatics 35:4207–4212. doi: 10.1093/bioinformatics/btz220 [DOI] [PubMed] [Google Scholar]
  • 6. Schwengers O, Hoek A, Fritzenwanker M, Falgenhauer L, Hain T, Chakraborty T, Goesmann A. 2020. ASA3P: an automatic and scalable pipeline for the assembly, annotation and higher-level analysis of closely related bacterial isolates. PLoS Comput Biol 16:e1007134. doi: 10.1371/journal.pcbi.1007134 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Pillonel T, Tagini F, Bertelli C, Greub G. 2020. ChlamDB: a comparative genomics database of the phylum chlamydiae and other members of the planctomycetes-verrucomicrobiae-chlamydiae superphylum. Nucleic Acids Res. 48:D526–D534. doi: 10.1093/nar/gkz924 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Amos B, Aurrecoechea C, Barba M, Barreto A, Basenko EY, Bażant W, Belnap R, Blevins AS, Böhme U, Brestelli J, et al. 2022. VEuPathDB: the eukaryotic pathogen, vector and host bioinformatics resource center. Nucleic Acids Res 50:D898–D911. doi: 10.1093/nar/gkab929 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Dieckmann MA, Beyvers S, Nkouamedjo-Fankep RC, Hanel PHG, Jelonek L, Blom J, Goesmann A. 2021. EDGAR3. 0: comparative genomics and phylogenomics on a scalable infrastructure. Nucleic Acids Res 49:W185–W192. doi: 10.1093/nar/gkab341 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Argimón S, David S, Underwood A, Abrudan M, Wheeler NE, Kekre M, Abudahab K, Yeats CA, Goater R, Taylor B, Harste H, Muddyman D, Feil EJ, Brisse S, Holt K, Donado-Godoy P, Ravikumar KL, Okeke IN, Carlos C, Aanensen DM, NIHR Global Health Research Unit on Genomic Surveillance of Antimicrobial Resistance . 2021. Rapid genomic characterization and global surveillance of Klebsiella using pathogenwatch. Clin Infect Dis 73:S325–S335. doi: 10.1093/cid/ciab784 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Deng Z, Botas J, Cantalapiedra CP, Hernández-Plaza A, Burguet-Castell J, Huerta-Cepas J. 2022. PhyloCloud: an online platform for making sense of phylogenomic data. Nucleic Acids Res 50:W577–W582. doi: 10.1093/nar/gkac324 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Arkin AP, Cottingham RW, Henry CS, Harris NL, Stevens RL, Maslov S, Dehal P, Ware D, Perez F, Canon S, et al. 2018. KBase: the United States department of energy systems biology knowledgebase. Nat Biotechnol 36:566–569. doi: 10.1038/nbt.4163 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Grover JW, Bomhoff M, Davey S, Gregory BD, Mosher RA, Lyons E. 2017. CoGe LoadExp+: a web-based suite that integrates next-generation sequencing data analysis workflows and visualization. Plant Direct 1:2. doi: 10.1002/pld3.8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Vallenet D, Calteau A, Dubois M, Amours P, Bazin A, Beuvin M, Burlot L, Bussell X, Fouteau S, Gautreau G, Lajus A, Langlois J, Planel R, Roche D, Rollin J, Rouy Z, Sabatet V, Médigue C. 2020. MicroScope: an integrated platform for the annotation and exploration of microbial gene functions through genomic, pangenomic and metabolic comparative analysis. Nucleic Acids Res 48:D579–D589. doi: 10.1093/nar/gkz926 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Eren AM, Esen ÖC, Quince C, Vineis JH, Morrison HG, Sogin ML, Delmont TO. 2015. Anvi’o: an advanced analysis and visualization platform for ‘omics data. PeerJ 3:e1319. doi: 10.7717/peerj.1319 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Roder T, Oberhänsli S, Shani N, Bruggmann R. 2022. OpenGenomeBrowser: a versatile, dataset-independent and scalable web platform for genome data management and comparative genomics. BMC Genomics 23:855. doi: 10.1186/s12864-022-09086-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. McIver LJ, Abu-Ali G, Franzosa EA, Schwager R, Morgan XC, Waldron L, Segata N, Huttenhower C. 2018. bioBakery: a meta’omic analysis environment. Bioinformatics 34:1235–1237. doi: 10.1093/bioinformatics/btx754 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Blanco-Míguez A, Fdez-Riverola F, Sánchez B, Lourenço A. 2018. BlasterJS: a novel interactive JavaScript visualisation component for BLAST alignment results. PLoS One 13:e0205286. doi: 10.1371/journal.pone.0205286 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Emms DM, Kelly S. 2019. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol. 20:238. doi: 10.1186/s13059-019-1832-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Schwengers O, Jelonek L, Dieckmann MA, Beyvers S, Blom J, Goesmann A. 2021. Bakta: rapid and standardized annotation of bacterial genomes via alignment-free sequence identification. Microb Genom 7:11. doi: 10.1099/mgen.0.000685 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Merkel D. 2014. Docker: lightweight linux containers for consistent development and deployment. Linux journal 2014 [Google Scholar]
  • 22. Kurtzer GM, Sochat V, Bauer MW. 2017. Singularity: scientific containers for mobility of compute. PLoS One 12:e0177459. doi: 10.1371/journal.pone.0177459 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. 2017. Nextflow enables reproducible computational workflows. Nat Biotechnol 35:316–319. doi: 10.1038/nbt.3820 [DOI] [PubMed] [Google Scholar]
  • 24. Huerta-Cepas J, Serra F, Bork P. 2016. ETE 3: reconstruction, analysis, and visualization of phylogenomic data. Mol Biol Evol 33:1635–1638. doi: 10.1093/molbev/msw046 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Bardou P, Mariette J, Escudié F, Djemiel C, Klopp C. 2014. jvenn: an interactive venn diagram viewer. BMC Bioinformatics 15:293. doi: 10.1186/1471-2105-15-293 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Van Rossum G, Drake FL. 2009. Python 3 reference manual. CreateSpace, Scotts Valley, CA [Google Scholar]
  • 27. Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, Köster J, Team B. 2018. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods 15:475–476. doi: 10.1038/s41592-018-0046-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. 2015. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 25:1043–1055. doi: 10.1101/gr.186072.114 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. 1990. Basic local alignment search tool. J Mol Biol 215:403–410. doi: 10.1016/S0022-2836(05)80360-2 [DOI] [PubMed] [Google Scholar]
  • 30. Tatusova T, DiCuccio M, Badretdin A, Chetvernin V, Nawrocki EP, Zaslavsky L, Lomsadze A, Pruitt KD, Borodovsky M, Ostell J. 2016. NCBI prokaryotic genome annotation pipeline. Nucleic Acids Res 44:6614–6624. doi: 10.1093/nar/gkw569 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Katoh K, Standley DM. 2013. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol 30:772–780. doi: 10.1093/molbev/mst010 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Price MN, Dehal PS, Arkin AP. 2010. FastTree 2–approximately maximum-likelihood trees for large alignments. PLoS One 5:e9490. doi: 10.1371/journal.pone.0009490 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Tatusov RL, Koonin EV, Lipman DJ. 1997. A genomic perspective on protein families. Science 278:631–637. doi: 10.1126/science.278.5338.631 [DOI] [PubMed] [Google Scholar]
  • 34. Marchler-Bauer A, Derbyshire MK, Gonzales NR, Lu S, Chitsaz F, Geer LY, Geer RC, He J, Gwadz M, Hurwitz DI, Lanczycki CJ, Lu F, Marchler GH, Song JS, Thanki N, Wang Z, Yamashita RA, Zhang D, Zheng C, Bryant SH. 2015. CDD: NCBI’s conserved domain database. Nucleic Acids Res 43:D222–D226. doi: 10.1093/nar/gku1221 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Aramaki T, Blanc-Mathieu R, Endo H, Ohkubo K, Kanehisa M, Goto S, Ogata H. 2020. KofamKOALA: KEGG ortholog assignment based on profile HMM and adaptive score threshold. Bioinformatics 36:2251–2252. doi: 10.1093/bioinformatics/btz859 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Kanehisa M, Sato Y, Kawashima M, Furumichi M, Tanabe M. 2016. KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res 44:D457–D462. doi: 10.1093/nar/gkv1070 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Mistry J, Chuguransky S, Williams L, Qureshi M, Salazar GA, Sonnhammer ELL, Tosatto SCE, Paladin L, Raj S, Richardson LJ, Finn RD, Bateman A. 2021. Pfam: the protein families database in 2021. Nucleic Acids Res 49:D412–D419. doi: 10.1093/nar/gkaa913 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Bairoch A, Apweiler R. 2000. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res 28:45–48. doi: 10.1093/nar/28.1.45 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. O’Leary NA, Wright MW, Brister JR, Ciufo S, Haddad D, McVeigh R, Rajput B, Robbertse B, Smith-White B, Ako-Adjei D, et al. 2016. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res 44:D733–D745. doi: 10.1093/nar/gkv1189 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Buchfink B, Reuter K, Drost H-G. 2021. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat Methods 18:366–368. doi: 10.1038/s41592-021-01101-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Marcet-Houben M, Gabaldón T. 2010. Acquisition of prokaryotic genes by fungal genomes. Trends Genet 26:5–8. doi: 10.1016/j.tig.2009.11.007 [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Fig. S1. msystems.00473-24-s0001.docx.

The species tree generated by zDB, annotated with the main characteristics of the included genomes.

DOI: 10.1128/msystems.00473-24.SuF1
Table S1. msystems.00473-24-s0002.xlsx.

All the genomes used for the benchmarking of zDB.

DOI: 10.1128/msystems.00473-24.SuF2

Articles from mSystems are provided here courtesy of American Society for Microbiology (ASM)

RESOURCES