Skip to main content
Nucleic Acids Research logoLink to Nucleic Acids Research
. 2025 May 14;53(W1):W369–W375. doi: 10.1093/nar/gkaf413

M1CR0B1AL1Z3R 2.0: an enhanced web server for comparative analysis of bacterial genomes at scale

Yair Shimony 1,, Edo Dotan 2,3, Elya Wygoda 4, Naama Wagner 5, Iris Lyubman 6, Noa Ecker 7, Gianna Durante 8, Gal Mishan 9, Jeff H Chang 10, Oren Avram 11,12,13,✉,d, Tal Pupko 14,✉,d
PMCID: PMC12230721  PMID: 40366021

Abstract

Large-scale analyses of bacterial genomic datasets contribute to the comprehensive characterization of complex microbial dynamics among different strains and species. Such analyses often include open reading frame extraction, orthogroup inference, phylogeny reconstruction, and functional annotation of proteins. We have previously developed the M1CR0B1AL1Z3R web server, a “one-stop shop” for conducting comparative analyses of microbial genomes. Here, we present M1CR0B1AL1Z3R 2.0, an enhanced version that includes a new user-friendly web interface and an improved, optimized, and more versatile pipeline. The following features were added: (i) a computationally efficient inference of orthogroups, which allows the analysis of up to 2000 bacterial genomes; (ii) genome completeness analysis; (iii) lists of orphan genes per genome; (iv) genome numeric representation that allows detecting genomic rearrangement events; (v) codon bias analysis; (vi) annotation of orthogroups with KEGG Orthology numbers; and (vii) a map of pairwise average nucleotide identity values. M1CR0B1AL1Z3R 2.0 is freely available at https://microbializer.tau.ac.il/.

Graphical Abstract

Graphical Abstract.

Graphical Abstract

Introduction

Next-generation sequencing (NGS) technologies are now routinely employed to sequence large collections of bacterial samples. Samples can include diverged species and strains or different isolates from a single species. The resulting sequence reads are assembled into contigs for each sample, forming a draft representation of the respective genome. These genomic assemblies are subsequently subjected to comparative genomics analyses, which highlight differences among the analyzed genomes and elucidate their phylogenomic relationships.

Common analyses include gene prediction, orthologous and paralogous relations prediction, phylogenetic inference, and functional annotation. Executing these computational tasks necessitates the integration of multiple bioinformatics tools. This complexity generally mandates the involvement of specialized bioinformaticians to design, implement, and maintain analysis pipelines. However, these pipelines frequently impose operational constraints, including specific software dependencies, high-performance computing (HPC) infrastructure (e.g. multicore servers), and advanced technical expertise for installation and execution. These limitations motivated the development of various ready-to-run pipelines and web servers to analyze datasets of microbial genomes [1–5]. We have previously developed M1CR0B1AL1Z3R [6] (pronounced: microbializer), a web-based platform designed to streamline microbial genome analysis and enhance accessibility for the broader scientific community. M1CR0B1AL1Z3R includes various analyses that are not provided by competing tools [6]. Here, we introduce M1CR0B1AL1Z3R 2.0, an enhanced platform featuring an updated pipeline (Fig. 1) that includes new outputs and algorithmic advancements, further enhancing the analysis of user-provided bacterial datasets with up to 2000 bacterial genomes.

Figure 1.

Figure 1.

M1CR0B1AL1Z3R 2.0 pipeline workflow.

Materials and methods

In the subsequent sections, we detail the input of the pipeline, its processing steps, and the resulting output.

Input

The pipeline input is a zipped folder with multiple FASTA-formatted files, each containing the genomic sequence of a single species/strain/isolate (we support up to 2000 genomes, a six-fold increase compared to the previous version). The file can contain a fully-assembled genome, a collection of contigs, or the set of open reading frames (ORFs) of a single genome. Notably, in many metagenomic studies, the assignment of the various contigs to separate isolates is unknown, and in this case, the data should be binned prior to running M1CR0B1AL1Z3R [7]. The input files should be provided as an archived (.zip or .tar.gz) folder.

An optional initial step in the pipeline involves filtering contigs or ORFs associated with plasmids by detecting the term “plasmid” in the record header. This filtering step can be activated via a user-defined flag.

ORF extraction

We extract ORFs from each genome using Prodigal [8], as described in [6]. If a user uploads as input a dataset of DNA ORF files, this step is skipped. In both cases, the ORFs are translated into amino acid sequences, which are used in the next steps of the pipeline.

Orthogroup inference

When considering a set of species, an orthogroup (OG) is defined as the group of genes descended from a single gene in the last common ancestor of the species. Orthogroup inference is the process of inferring all orthogroups in a dataset of genomes representing different species (or strains/isolates), i.e. clustering the (translated) ORFs of the genomes into orthogroups. Over the years, numerous computational methods have been developed for this purpose, beginning with Clusters of Orthologous Groups [9–13] and followed by algorithms such as InParanoid [14–16], OMA [17, 18], EggNOG [19, 20], OrthoMCL [21], and OrthoFinder [22, 23]. While each method defines its clustering objective slightly differently, the overarching goal is to infer groups of genes with orthologous relationships. In practice, this task is challenging due to gene duplication events, which introduce complex relationship patterns, including one-to-many and many-to-many orthologous connections. Additional complications stem from gene loss and lateral gene transfer events.

In this work, we implement a variant of the OrthoMCL method to cluster ORFs into groups comprising orthologs and recent paralogs [21]. The workflow begins with identifying reciprocal best hits (RBHs) between each pair of genomes [6] and recording the bit scores between protein RBHs. Sequence similarity searches are performed using MMseqs2 [24] with user-defined similarity and coverage thresholds. In parallel, paralogous gene pairs within each genome are identified under the same thresholds, retaining those with higher similarity to each other than to any ORF in another genome—these are classified as recent paralogs. Bit scores are subsequently normalized and ORFs are clustered using the Markov Cluster (MCL) algorithm [25], following the original OrthoMCL procedure. We adopt an inflation parameter of 1.5, as recommended by OrthoFinder [22], to balance cluster granularity and cohesiveness.

For large genomic datasets, the initial step of identifying RBHs between genome pairs becomes computationally prohibitive, as the number of pairwise comparisons increases quadratically with the number of genomes. To mitigate this limitation, and inspired by [4], we implemented the following optimization strategy. For datasets comprising a large number of genomes, we divide the input into smaller batches and infer orthogroups within each batch using the full algorithm described above. The output of this stage is a table of orthogroups for each batch. Subsequently, we construct a “pseudo-genome” from each such table by selecting a representative sequence from each orthogroup. The orthogroup inference algorithm is then reapplied to the set of pseudo-genomes. This step generates a table of orthogroups, each comprising sequences from pseudo-genomes. Finally, this table is updated by replacing each sequence within each orthogroup by the original sequences that it corresponds to (see Supplementary Material S1 for a detailed description).

The main output of this step is the orthogroup table, in which each row corresponds to an orthogroup and each column contains the set of genes from a specific genome. The Inline graphic entry contains the corresponding gene names of the Inline graphic orthogroup and Inline graphic genome. The orthogroup table is sorted by the ORF coordinates of the genome in the first column, i.e. the first orthogroup is the one that contains the first ORF of the genome in the first column. Orthogroups that do not contain a representative in the first genome are sorted by the ORF coordinates of the genome in the second column, and so on.

The pipeline provides several additional outputs that relate to the orthogroups table: (i) the orthogroup structure in OrthoXML format; (ii) a FASTA file encoding the phyletic pattern of the genomes and orthogroups, following the format described in [6]. The phyletic pattern constitutes a binary presence–absence matrix, where each genome is represented as a binary vector indicating its membership across orthogroups; (iii) a visual representation of this matrix as a blue–white grid, in which blue represents presence and white represents absence. The rows correspond to genomes and the columns to orthogroups. This matrix representation is enhanced with a dendrogram of the genomes; and (iv) the binary matrix is used to generate a two-dimensional UMAP projection [26] that groups together genomes with similar phyletic patterns. Additionally, the genomes are clustered using HDBSCAN [27], and each cluster in the UMAP is assigned a different color.

Following the orthogroup inference, we detect for each genome its orphan genes, i.e. genes belonging to that genome that do not have orthologs in any other genome. We distinguish between orphan orthogroups, which are clusters of paralogs in the genome that do not have orthologs in other genomes, and orphan single genes, which are genes in the genome that have neither orthologs nor paralogs.

Alignments of orthogroups and phylogenetic tree reconstruction

For each orthogroup, all protein sequences are aligned using MAFFT with the “–auto” flag [28], and the aligned sequences are reverse-translated to generate codon-level alignments [29]. The multiple sequence alignments of all core orthogroups, i.e. orthogroups that contain an ORF from each genome, are concatenated to construct a core genome (using the DNA sequences) and a core proteome (using the amino acid sequences). In the case where there is more than one gene from a genome in a core orthogroup (e.g. due to gene duplication), we choose one randomly to be included in the concatenated core genome and proteome. To account for the absence of core orthogroups, users are given the option to adjust the threshold for the minimum percentage of genomes required for an orthogroup to be classified as core. For instance, if the threshold is set to 80%, orthogroups containing genes from at least 80% of the genomes will be considered core and included in the concatenated core genome and proteome.

A maximum-likelihood phylogenetic tree is reconstructed based on the concatenated core proteome, using IQ-Tree [30] with the WAG substitution model for protein evolution and rate heterogeneity modeled by a discrete Gamma distribution. A user can choose to compute branch support using bootstrap [31] and to root the tree by selecting one of the input genome names as an outgroup. If no outgroup is specified, the resulting Newick-formatted tree will be unrooted, whereas the graphical representation of the tree will be displayed using midpoint rooting. In cases where the concatenated core proteome contains >1000 core orthogroups, the species tree is reconstructed by randomly sampling 1000 core orthogroups, to reduce running times.

Pairwise whole genome similarity

Average nucleotide identity (ANI) values are widely used for assessing pairwise whole-genome similarities [32–34]. We compute ANI values for each pair of genomes using FastANI [35] and represent the results as a heatmap to provide a visual representation of genome similarities. We also identify for each genome its closest relative. Notably, FastANI does not return results for genome pairs with ANI values significantly below 80%, leading to missing entries in the heatmap. When all genomes in the dataset exhibit relatively high similarity, resulting in no missing values, the heatmap is enhanced with hierarchical clustering to further illustrate genome relationships.

Genome completeness

To evaluate genome completeness, we utilize the BUSCO framework [36]. This method uses a database of profile hidden Markov models (pHMMs) representing universal single-copy orthologs. A genome encoding ORFs with a significant match score against all these profiles is considered complete. Otherwise, the fraction of significant matches represents the level of completeness. We utilize the OrthoDB Bacteria dataset (version 9), which comprises 148 core-bacterial pHMMs [37]. Each input genome is queried against these profiles using hmmsearch from HMMER3 [38], and genome completeness is quantified as the fraction of profiles with at least one hit to a protein encoded in the genome, with an E-value threshold of 0.01.

Genome numeric representation

To identify genome rearrangement events, we perform an analysis termed “genome numeric representation.” This analysis begins with constructing for each genome a sorted list of its ORF identifiers. The list is sorted by the ORF coordinates in the genome. If the user uploads ORF files instead of full genomes, then we assume the order of the ORFs in each file corresponds to the genomic coordinates. Next, we replace each ORF identifier with the orthogroup number it belongs to or with the number −1 if it does not belong to any orthogroup. These numeric representations of all genomes are written to a single file, one row for each genome. This representation enables the straightforward visualization of genomic rearrangements, such as insertions, deletions, inversions, and translocations of genomic segments (Fig. 2).

Figure 2.

Figure 2.

Comparative analyses of Chlamydia species using M1CR0B1AL1Z3R 2.0. (A) Histogram of orthogroup sizes from run A. (B) Histogram of orthogroup sizes from run B, demonstrating the effect of increased sequence identity and coverage thresholds on orthogroup formation. (C) Heatmap of ANI for the genomes analyzed in run A. The color scale represents ANI values, with red indicating high similarity and green denoting more distant relationships. (D) Distribution of genome BUSCO completeness scores from run A, showing the completeness of genome assemblies based on single-copy orthologs. (E) Phylogenetic tree reconstructed from run A, with Waddlia chondrophila used as an outgroup. Bootstrap support values are shown in red. (F) Genome numeric representation highlighting gene insertion (in blue), inversion (in pink), and translocation (in red) between Chlamydia abortus strain 162STDY5437294 and Chlamydia trachomatis strain tet9. The “...” is a placeholder representing all genes between #51 and #976. (G) Phyletic pattern of the genomes analyzed in run A, alongside a hierarchical clustering of them based on their orthogroup membership.

Functional annotation of orthogroups

KEGG Orthology (KO) annotations are assigned to each orthogroup using the approach described in KofamKOALA [39]. As a preprocessing step, the KOfam database (version 2024-09-01, based on KEGG release 111.0), which contains a pHMM for each KO along with a predefined score threshold, was downloaded and filtered to contain only prokaryotic-associated pHMMs. For an input dataset, rather than using the KofamScan tool [39], which we found to be computationally inefficient, we implemented an equivalent workflow that runs hmmsearch against the protein sequences of the orthogroups, using the profile database as a query. This is followed by filtering the results to retain only KO assignments with scores exceeding their respective thresholds. To optimize runtime, instead of analyzing all sequences within each orthogroup, we use the consensus sequence as a representative proxy (see Supplementary Material S2 for details).

Codon bias analysis

Codon bias refers to the non-random usage of synonymous codons to encode a specific amino acid, a phenomenon widely observed across diverse organisms [40]. This bias is shaped by evolutionary forces, including selection for translational efficiency and accuracy, and is most pronounced in highly expressed genes (HEGs). Understanding codon bias is essential for investigating gene expression regulation and evolutionary dynamics at the genomic level. To quantify codon bias, as a preprocessing step, we generated an amino acid FASTA file comprising 40 well-characterized HEGs from Escherichia coli. The selection of these HEGs was based on data from CBDB [41], and the corresponding sequences were retrieved from the NCBI Protein Database.

For each analyzed genome, ORFs are screened for homologs of the HEGs using tblastn [42], where the genomic ORFs serve as the database, and the E. coli HEG protein sequences function as queries. Subsequently, the W vector of codon relative adaptiveness is calculated for each genome using the identified HEGs, following the method defined in [40], with Biopython [43] facilitating the computation. Additionally, genomes are clustered based on their W vectors, with principal component analysis used for visualizing the clusters.

In the next stage, the codon adaptation index (CAI), as defined in [40], is computed for all genes within each orthogroup, using the genome-specific W vector. The output includes the orthogroup table alongside the mean CAI for each orthogroup and a histogram depicting the distribution of mean CAI values across orthogroups.

Outputs

Results are provided to the user as downloadable files when the pipeline finishes running. These files are organized in the following folders: (1) a table and heatmap of ANI values between all pairs of genomes; (2a) an ORF file of each genome; (2b) ORF count and GC content of ORFs for each genome, alongside matching histograms; (2c) an amino acid file for each genome that contains its translated ORFs; (3) BUSCO value of each genome alongside a matching histogram; (4) a list of orphan genes of each genome alongside statistics and a histogram; (5a) the inferred orthogroups in a table format, an annotated orthogroups table with orthogroup sizes, KO annotations, and the average CAI of each orthogroup, and the orthogroups in an OrthoXML format; (5b) a histogram depicting the distribution of orthogroup sizes (in terms of number of genomes included in each orthogroup), a phyletic pattern of the genomes and orthogroups, a matrix visualization of the phyletic pattern, a UMAP projection of the genomes (which are represented as binary vectors of orthogroup membership), a clustering of the genomes, and a numeric representation of the genomes; (6a) a FASTA file for each orthogroup with the unaligned DNA sequences; (6b) a FASTA file for each orthogroup with the unaligned amino acid sequences; (6c) a FASTA file for each orthogroup with the aligned amino acid sequences; (6d) a FASTA file for each orthogroup with the aligned DNA sequences (codon alignment); (7a) the concatenated core proteome and a list of the core orthogroups comprising it; (7b) the concatenated core genome and the same list of core orthogroups comprising it; (8) a species phylogeny tree in .newick, .png, and .svg formats; (9) a CSV file of the W vectors (codons relative adaptiveness) of all genomes, a clustering of the W vectors, the orthogroups table with the mean CAI value of each orthogroup, and a histogram of mean CAI values.

Implementation

M1CR0B1AL1Z3R 2.0 is developed in Python 3.8, using the packages listed in https://github.com/orenavram/microbializer/blob/master/pipeline/microbializer.yaml. The web server operates on an HPC cluster hosted at Tel Aviv University, utilizing Slurm for job scheduling and resource management. To optimize runtime efficiency, each job’s computational steps are executed in parallel across multiple CPU cores and computing nodes. The web server includes a Gallery, an Overview, a Frequently Asked Questions (FAQ) section, and an output example, to assist users in maximizing the platform’s utility.

Case study

To demonstrate the utility of M1CR0B1AL1Z3R 2.0 for comparative genomics, we analyzed a Chlamydia species genomes dataset. Chlamydia are intracellular bacterial pathogens that can infect various mucosal surfaces, leading to a range of health issues, especially in the reproductive and urinary systems [44]. Two independent analyses were conducted: run A, which included 40 Chlamydia genomes along with Waddlia chondrophila as an outgroup, and run B, which excluded the outgroup to focus exclusively on Chlamydia species. The analyzed genomes were downloaded from the NCBI repository in January 2025. The two runs also differed in their input parameters: in run A, we used the default thresholds for homolog detection (40% for sequence identity and 70% for sequence coverage), whereas in run B, we used a 60% threshold for sequence identity and 80% for sequence coverage. The complete results of these analyses are available in the Gallery section of the web server (https://microbializer.tau.ac.il/gallery).

Multiple genomic features were investigated (Fig. 2), including ORF counts, orthogroup distributions, ANI, orphan gene counts, genome completeness, and phylogenetic relationships. The core genome of Chlamydia (including the outgroup) was found to comprise 384 genes conserved across all strains. The ANI analysis revealed clustering patterns consistent with genetic divergence among Chlamydia species. Orthogroup analysis in run B, which employed stricter homolog detection thresholds, resulted in an increased number of orthogroups with fewer genes per orthogroup and a higher proportion of orphan genes. Genome completeness evaluations confirmed the high-quality assembly of most genomes, while phylogenetic analysis provided detailed insights into species relationships. Furthermore, genome numeric representation identified gene insertions, inversions, and translocations, highlighting structural genome variations across the analyzed species.

Supplementary Material

gkaf413_Supplemental_File

Acknowledgements

Y.S., E.W., and N.E. were supported in part by a fellowship from the Edmond J. Safra Center for Bioinformatics at Tel Aviv University. Y.S. was also supported by the Tel Aviv University Center for AI and Data Science (TAD).

Author contributions: Yair Shimony (Conceptualization, Data curation, Formal analysis, Investigation, Project administration, Software, Visualization, Writing - original draft, Writing - review & editing), Edo Dotan (Software, Visualization), Elya Wygoda (Software, Visualization), Naama Wagner (Data Curation, Formal analysis), Iris Lyubman (Data curation, Visualization), Noa Ecker (Formal analysis), Gianna Durante (Data curation, Formal analysis, Visualization), Gal Mishan (Data curation, Formal analysis), Jeff H. Chang (Conceptualization), Oren Avram (Conceptualization, Software, Project administration, Supervision, Writing - review & editing), and Tal Pupko (Conceptualization, Project administration, Supervision, Writing - review & editing).

Contributor Information

Yair Shimony, The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel.

Edo Dotan, The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel; The Henry and Marilyn Taub Faculty of Computer Science, Technion—Israel Institute of Technology, Haifa 3200003, Israel.

Elya Wygoda, The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel.

Naama Wagner, The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel.

Iris Lyubman, The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel.

Noa Ecker, The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel.

Gianna Durante, The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel.

Gal Mishan, The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel.

Jeff H Chang, Department of Botany and Plant Pathology, Oregon State University, Corvallis, OR 97331, United States.

Oren Avram, Department of Computational Medicine, University of California Los Angeles, Los Angeles, CA 90095, United States; Department of Computer Science, University of California Los Angeles, Los Angeles, CA 90095, United States; Department of Anesthesiology and Perioperative Medicine, University of California Los Angeles, Los Angeles, CA 90095, United States.

Tal Pupko, The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel.

Supplementary data

Supplementary data are available at NAR online.

Conflict of interest

None declared.

Funding

T.P. was supported by Israel Science Foundation [2818/21]. Funding to pay the Open Access publication charges for this article was provided by Israel Science Foundation.

Data availability

M1CR0B1AL1Z3R 2.0 is free and open to all users at https://microbializer.tau.ac.il/ and there is no login requirement. The source code of the pipeline, the pHMMs from OrthoDB v9 (utilized to assess genome completeness), and the FASTA file of E. coli HEGs (used for codon bias analysis) are available at https://github.com/orenavram/microbializer and https://doi.org/10.5281/zenodo.15306283.

References

  • 1. Rodriguez-R  LM, Gunturu  S, Harvey  WT  et al.  The Microbial Genomes Atlas (MiGA) webserver: taxonomic and gene diversity analysis of archaea and bacteria at the whole genome level. Nucleic Acids Res. 2018; 46:W282–8. 10.1093/nar/gky467. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Chen  X, Zhang  Y, Zhang  Z  et al.  PGAweb: a web server for bacterial pan-genome analysis. Front Microbiol. 2018; 9:389106. 10.3389/fmicb.2018.01910. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Keegan  KP, Glass  EM, Meyer  F  MG-RAST, a metagenomics service for analysis of microbial community structure and function. Methods Mol Biol. 2016; 1399:207–33. 10.1007/978-1-4939-3369-3_13. [DOI] [PubMed] [Google Scholar]
  • 4. Ding  W, Baumdicker  F, Neher  RA  panX: pan-genome analysis and exploration. Nucleic Acids Res. 2018; 46:e5. 10.1093/nar/gkx977. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Marquis  B, Pillonel  T, Carrara  A  et al.  zDB: bacterial comparative genomics made easy. mSystems. 2024; 9:e0047324. 10.1128/msystems.00473-24. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Avram  O, Rapoport  D, Portugez  S  et al.  M1CR0B1AL1Z3R—a user-friendly web server for the analysis of large-scale microbial genomics data. Nucleic Acids Res. 2019; 47:W88–92. 10.1093/nar/gkz423. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Liu  Y, Hou  T, Kang  B  et al.  Unsupervised binning of metagenomic assembled contigs using improved fuzzy c-means method. IEEE/ACM Trans Comput Biol Bioinformatics. 2017; 14:1459–67. 10.1109/tcbb.2016.2576452. [DOI] [PubMed] [Google Scholar]
  • 8. Hyatt  D, Chen  GL, LoCascio  PF  et al.  Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 2010; 11:119. 10.1186/1471-2105-11-119. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Tatusov  RL, Koonin  EV, Lipman  DJ  A genomic perspective on protein families. Science. 1997; 278:631–7. 10.1126/science.278.5338.631. [DOI] [PubMed] [Google Scholar]
  • 10. Tatusov  RL, Fedorova  ND, Jackson  JD  et al.  The COG database: an updated version includes eukaryotes. BMC Bioinformatics. 2003; 4:41. 10.1186/1471-2105-4-41. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Tatusov  RL, Galperin  MY, Natale  DA  et al.  The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res. 2000; 28:33–6. 10.1093/nar/28.1.33. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Galperin  MY, Makarova  KS, Wolf  YI  et al.  Expanded microbial genome coverage and improved protein family annotation in the COG database. Nucleic Acids Res. 2015; 43:D261–9. 10.1093/nar/gku1223. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Galperin  MY, Kristensen  DM, Makarova  KS  et al.  Microbial genome analysis: the COG approach. Brief Bioinform. 2019; 20:1063–70. 10.1093/bib/bbx117. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Remm  M, Storm  CEV, Sonnhammer  ELL  Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J Mol Biol. 2001; 314:1041–52. 10.1006/jmbi.2000.5197. [DOI] [PubMed] [Google Scholar]
  • 15. Alexeyenko  A, Tamas  I, Liu  G  et al.  Automatic clustering of orthologs and inparalogs shared by multiple proteomes. Bioinformatics. 2006; 22:e9–15. 10.1093/bioinformatics/btl213. [DOI] [PubMed] [Google Scholar]
  • 16. Cosentino  S, Iwasaki  W  SonicParanoid: fast, accurate and easy orthology inference. Bioinformatics. 2019; 35:149–51. 10.1093/bioinformatics/bty631. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Roth  ACJ, Gonnet  GH, Dessimoz  C  Algorithm of OMA for large-scale orthology inference. BMC Bioinformatics. 2008; 9:1–10. 10.1186/1471-2105-9-518. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Majidian  S, Nevers  Y, Kharrazi  AY  et al.  Orthology inference at scale with FastOMA. Nat Methods. 2025; 22:269–72. 10.1038/s41592-024-02552-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Jensen  LJ, Julien  P, Kuhn  M, von Mering  C  et al.  eggNOG: automated construction and annotation of orthologous groups of genes. Nucleic Acids Res. 2008; 36:D250–4. 10.1093/nar/gkm796. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Hernández-Plaza  A, Szklarczyk  D, Botas  J  et al.  eggNOG 6.0: enabling comparative genomics across 12 535 organisms. Nucleic Acids Res. 2023; 51:D389–94. 10.1093/nar/gkac1022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Li  L, Stoeckert  CJ, Roos  DS  OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res. 2003; 13:2178–89. 10.1101/gr.1224503. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Emms  DM, Kelly  S  OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy. Genome Biol. 2015; 16:157. 10.1186/S13059-015-0721-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Emms  DM, Kelly  S  OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol. 2019; 20:238. 10.1186/s13059-019-1832-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Steinegger  M, Söding  J  MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol. 2017; 35:1026–8. 10.1038/nbt.3988. [DOI] [PubMed] [Google Scholar]
  • 25. Van Dongen  S  Graph clustering via a discrete uncoupling process. SIAM J Matrix Anal Appl. 2008; 30:121–41. 10.1137/040608635. [DOI] [Google Scholar]
  • 26. McInnes  L, Healy  J, Saul  N, Großberger  L  UMAP: Uniform Manifold Approximation and Projection. JOSS. 2018; 3:861. 10.21105/joss.00861. [DOI] [Google Scholar]
  • 27. McInnes  L, Healy  J  Accelerated hierarchical density based clustering. IEEE International Conference on Data Mining Workshops (ICDMW). 2017; New Orleans, LA, USA: 33–42. 10.1109/icdmw.2017.12. [DOI] [Google Scholar]
  • 28. Katoh  K, Standley  DM  MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol. 2013; 30:772–80. 10.1093/molbev/mst010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Wernersson  R, Pedersen  AG  RevTrans: multiple alignment of coding DNA from aligned amino acid sequences. Nucleic Acids Res. 2003; 31:3537–9. 10.1093/nar/gkg609. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Minh  BQ, Schmidt  HA, Chernomor  O  et al.  IQ-TREE 2: new models and efficient methods for phylogenetic inference in the genomic era. Mol Biol Evol. 2020; 37:1530–4. 10.1093/molbev/msaa015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Felsenstein  J  Confidence limits on phylogenies: an approach using the bootstrap. Evolution. 1985; 39:783–91. 10.1111/j.1558-5646.1985.tb00420.x. [DOI] [PubMed] [Google Scholar]
  • 32. Goris  J, Konstantinidis  KT, Klappenbach  JA  et al.  DNA–DNA hybridization values and their relationship to whole-genome sequence similarities. Int J Syst Evol Microbiol. 2007; 57:81–91. 10.1099/ijs.0.64483-0. [DOI] [PubMed] [Google Scholar]
  • 33. Lee  I, Kim  YO, Park  SC  et al.  OrthoANI: an improved algorithm and software for calculating average nucleotide identity. Int J Syst Evol Microbiol. 2016; 66:1100–3. 10.1099/ijsem.0.000760. [DOI] [PubMed] [Google Scholar]
  • 34. Yoon  SH, Ha  SM, Lim  J  et al.  A large-scale evaluation of algorithms to calculate average nucleotide identity. Antonie Van Leeuwenhoek. 2017; 110:1281–6. 10.1007/s10482-017-0844-4. [DOI] [PubMed] [Google Scholar]
  • 35. Jain  C, Rodriguez- R  LM, Phillippy  AM  et al.  High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nat Commun. 2018; 9:5114. 10.1038/s41467-018-07641-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Simão  FA, Waterhouse  RM, Ioannidis  P  et al.  BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics. 2015; 31:3210–2. 10.1093/bioinformatics/btv351. [DOI] [PubMed] [Google Scholar]
  • 37. Zdobnov  EM, Tegenfeldt  F, Kuznetsov  D  et al.  OrthoDB v9.1: cataloging evolutionary and functional annotations for animal, fungal, plant, archaeal, bacterial and viral orthologs. Nucleic Acids Res. 2017; 45:D744–9. 10.1093/nar/gkw1119. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Eddy  SR  Accelerated profile HMM searches. PLoS Comput Biol. 2011; 7:e1002195. 10.1371/journal.pcbi.1002195. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Aramaki  T, Blanc-Mathieu  R, Endo  H  et al.  KofamKOALA: KEGG Ortholog assignment based on profile HMM and adaptive score threshold. Bioinformatics. 2020; 36:2251–2. 10.1093/bioinformatics/btz859. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Sharp  PM, Li  WH  The Codon Adaptation Index—a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res. 1987; 15:1281–95. 10.1093/nar/15.3.1281. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Hilterbrand  A, Saelens  J, Putonti  C  CBDB: the codon bias database. BMC Bioinformatics. 2012; 13:62. 10.1186/1471-2105-13-62. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Camacho  C, Coulouris  G, Avagyan  V  et al.  BLAST+: architecture and applications. BMC Bioinformatics. 2009; 10: 10.1186/1471-2105-10-421. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Cock  PJA, Antao  T, Chang  JT  et al.  Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009; 25:1422–3. 10.1093/bioinformatics/btp163. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Elwell  C, Mirrashidi  K, Engel  J  Chlamydia cell biology and pathogenesis. Nat Rev Microbiol. 2016; 14:385–400. 10.1038/nrmicro.2016.30. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

gkaf413_Supplemental_File

Data Availability Statement

M1CR0B1AL1Z3R 2.0 is free and open to all users at https://microbializer.tau.ac.il/ and there is no login requirement. The source code of the pipeline, the pHMMs from OrthoDB v9 (utilized to assess genome completeness), and the FASTA file of E. coli HEGs (used for codon bias analysis) are available at https://github.com/orenavram/microbializer and https://doi.org/10.5281/zenodo.15306283.


Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

RESOURCES