Skip to main content
G3: Genes | Genomes | Genetics logoLink to G3: Genes | Genomes | Genetics
. 2024 Jun 5;14(8):jkae119. doi: 10.1093/g3journal/jkae119

OrthoPhyl—streamlining large-scale, orthology-based phylogenomic studies of bacteria at broad evolutionary scales

Earl A Middlebrook 1,, Robab Katani 2, Jeanne M Fair 3,b
Editor: A Whitehead
PMCID: PMC11304591  PMID: 38839049

Abstract

There are a staggering number of publicly available bacterial genome sequences (at writing, 2.0 million assemblies in NCBI's GenBank alone), and the deposition rate continues to increase. This wealth of data begs for phylogenetic analyses to place these sequences within an evolutionary context. A phylogenetic placement not only aids in taxonomic classification but informs the evolution of novel phenotypes, targets of selection, and horizontal gene transfer. Building trees from multi-gene codon alignments is a laborious task that requires bioinformatic expertise, rigorous curation of orthologs, and heavy computation. Compounding the problem is the lack of tools that can streamline these processes for building trees from large-scale genomic data. Here we present OrthoPhyl, which takes bacterial genome assemblies and reconstructs trees from whole genome codon alignments. The analysis pipeline can analyze an arbitrarily large number of input genomes (>1200 tested here) by identifying a diversity-spanning subset of assemblies and using these genomes to build gene models to infer orthologs in the full dataset. To illustrate the versatility of OrthoPhyl, we show three use cases: E. coli/Shigella, Brucella/Ochrobactrum and the order Rickettsiales. We compare trees generated with OrthoPhyl to trees generated with kSNP3 and GToTree along with published trees using alternative methods. We show that OrthoPhyl trees are consistent with other methods while incorporating more data, allowing for greater numbers of input genomes, and more flexibility of analysis.

Keywords: rickettsiales, brucella, assembly, maximum likelihood, phylogenetics


Understanding the evolutionary history of organisms is fundamental to biology. However, bacterial phylogenomic analyses require highly specialized software tools and extensive time and computational investments, so getting it right is difficult and errors are costly. These bottlenecks become even more pronounced when inferring trees with hundreds to thousands of leaves. Here, Middlebrook et al. present OrthoPhyl, which takes much of the guesswork and troubleshooting out of inferring orthologs, generating codon alignments, and finally, producing phylogenetic trees.

Introduction

Underpinning every aspect of an organism's biology is its evolutionary history. Phylogenetic methods strive to reconstruct these histories using genetic data. With the advent of “Next-” and “Third-gen” sequencing methodologies, the amount of genetic data in public databases has increased dramatically. At writing, there are currently 2,018,201 bacterial genome assemblies available through NCBI alone (ncbi.nlm.nih.gov/assembly/?term = bacteria). The wealth of available genetic data has transformed the fields of bacterial phylogenetics and taxonomy (Konstantinidis and Tiedje 2007; Varghese et al. 2015; Jain et al. 2018). Unfortunately, most whole genome phylogenetic analysis methods require specialized bioinformatic data processing and expertise (Smith 2013; Lozano-Fernandez 2022). There are few very large-scale trees to place these sequences into an evolutionary framework. The available large-scale trees are under-resolved and built by reconciling many disparate phylogenetic studies, such as NCBI taxonomy (Schoch et al. 2020) or are generated with methods unsuited for narrow and broad evolutionary distances (Hördt et al. 2020) reducing their utility at multiple evolutionary scales. This flood of genomic data necessitates an easy-to-use phylogenetic analysis pipeline to help reveal the evolutionary context for the myriad of sequenced bacteria.

Analysis options

Single and multi-locus

A classic way to generate bacterial trees is to compare single homologous loci (e.g. 16S ribosomal, recA, gyrA, rpoB, or dnaK genes). While largely replaced by other methods, single gene trees have some advantages, which include the ability to use conserved primer sites to amplify sequences, very low computational intensity, and the availability of curated databases (SILVA, RDP, NCBI GenBank, etc.). Using genes other than rDNA allows for greater resolution at certain evolutionary distances but still lacks broad resolution (i.e. Near resolution for swiftly evolving genes, far for conserved) (Yang 1998). However, these trees are prone to misleading topologies due to horizontal gene transfers, incomplete lineage sorting, paralogous gene conversions, and limited phylogenetic information (Huerta-Cepas et al. 2007). Using multi-locus alignments to build phylogenetic trees increases the total amount of phylogenetic information (more sequence used), reduces the effects of incongruent gene tree topologies due to recombination or horizontal gene transfer, and widens the breadth of evolutionary resolution (Gontcharov et al. 2004). However, in multi-locus analyses of taxa without well-vetted multi-locus methods, locus selection is non-trivial and the effort and cost of primer design and amplification optimization scales with the number of sequences used, which can rapidly become prohibitive for many labs. Additionally, strains of many taxa, like Brucella, are not differentiated well by available multi-locus methods, leading to unknown evolutionary histories (Sankarasubramanian et al. 2019).

Whole genome

Classically, whole genome alignments were required to identify Single Nucleotide Polymorphisms (SNPs) of assemblies (see Shakya et al. 2020; Treangen et al. 2014 for method details). These methods add considerable phylogenetic data for tree building and can produce different topologies than multi-locus alignments, illustrating their potential utility (Baltrus et al. 2014). A major drawback of whole genome alignments is that they tend to produce short alignments or fail outright, for more distantly related genomes (Darling et al. 2004; Angiuoli and Salzberg 2011; Chung et al. 2018; Shakya et al. 2020). Thus, phylogenetic information drops quickly as a function of evolutionary distance. The practical result is a lack of resolution at even moderate evolutionary distances. For alignments to a reference, having the “missing” data enriched for the more divergent sites will possibly lead to an underestimation of the sequence to reference evolutionary distances for more divergent sequences (Spencer et al. 2007; Shavit Grievink et al. 2013; Bertels et al. 2014). Alternatively, one can use all-vs-all whole genome alignments, however, they scale poorly and are limited in the number of input taxa that can be aligned (Angiuoli and Salzberg 2011).

K-mer overlap

An emergent method to quickly estimate evolutionary histories from genome assemblies is to use k-mer content. This involves identifying subsequences of length k within each assembly and calculating the shared k-mer content (for details see Bussi et al. 2021). The advantage of this class of methods is that it is extremely fast relative to alignment-based methods. The comparison of k-mer overlap between sequences naturally leads to pairwise distance matrices, thus neighbor-joining trees can be generated directly from the results (Saitou and Nei 1987). Recent work has integrated more complicated distance statistics with success (Tang et al. 2023). Crucially, k-mer distances become non-linearly associated with true percent identity, leading to erroneous estimated branch lengths (Jain et al. 2018).

An alternative k-mer-based approach is kSNP (Gardner and Hall 2013; Gardner et al. 2015; Hall and Nisbet 2023), which identifies SNPs by flanking conserved k-mers. This approach allows the use of tree-building methods including parsimony, maximum likelihood, and Bayesian. Like whole genome alignments and k-mer overlap methods, kSNP suffers from a rapid decline in phylogenetic information as sequences diverge. Additionally, the signal/noise ratio degrades quickly because non-homologous k-mers are increasingly common as sequences diverge (Gardner and Hall 2013).

Predicted gene alignments

To use the wealth of data generated by NGS methods and simultaneously simplify the problem of whole genome alignments, CDS or protein sequences derived from annotations or transcriptome sequencing can be compared. This method breaks the problem of identifying homologous loci into two tractable parts: gene identification, then identification of homologous sequences. This approach identifies many phylogenetically informative sites (like whole genome), but alignments are more tractable (like single loci). However, gene-based alignments are negatively affected by inferred paralogs, i.e. duplicated genes or contamination. Thus, most methods filter out gene families with paralogs. However, see ASTRAL-Pro for a method that is “paralog aware” (Zhang et al. 2020). A widely used pipeline that accomplishes this task is GToTree, which annotates assemblies, uses precomputed hidden Markov models to identify single copy orthologs, aligns predicted amino acid sequences, then generates trees with FastTree2 or IQTree2 from concatenated alignments (Lee 2019). While focused on protein alignments, GToTree can also be run in nucleotide mode to use CDS alignments for phylogenetic inference instead.

A trade-off arises while using predicted coding or protein sequences to build species trees. Coding sequences (CDS) have the most phylogenetic information due to codon degeneracy but are hard to align at high divergencies (States et al. 1991). Proteins remain alignable at high divergences but lack the information content of CDS alignments. This can be addressed by converting protein alignments to corresponding codon alignments, leveraging nucleotide phylogenetic information and protein alignment accuracy (Wernersson and Pedersen 2003; Bininda-Emonds 2005). However, at increasing evolutionary distances, even with accurate alignments, mutational saturation in nucleotide data can negatively affect phylogenetic estimates. Evolutionary model selection, avoiding compositional bias, and dense taxon sampling can largely alleviate these issues, see (Kapli et al. 2023) for details.

Benefits of an automated workflow

The skills required to generate bacterial phylogenetic trees from whole genomes are extensive. Some of the many steps include gathering and annotating assemblies, identifying and filtering orthologs, sequence alignment, trimming and concatenation, and finally, tree inference (Ashford et al. 2020; Lozano-Fernandez 2022). Each one of these steps, especially for large data sets, requires expertise in picking parameters, file management, and data format manipulation and filtering. Many of these steps also require familiarity with UNIX command line.

Beyond taxonomic studies, the phylogenetic context of organisms is becoming increasingly important for standard molecular and evolutionary studies. Even with the available data and a clear need for the phylogenetic placement of focal study species, technical barriers preclude many researchers from inferring their evolutionary histories. An easy-to-use, accurate phylogenomic pipeline will encourage wide adoption of phylogenomic methods to complement ongoing environmental, evolutionary, and clinical microbiological research, and perhaps help standardize the estimation of phylogenetic trees within and between research labs.

To meet this demand, we present OrthoPhyl, a phylogenomic pipeline that takes bacterial genomes as input, annotates them, identifies orthologs, converts protein to nucleotide alignments, and builds species trees with both concatenated alignments and gene tree to species tree reconciliation. The workflow accepts an arbitrarily large number of input genomes (tested up to 1,200 here). To accelerate analysis, a subset of samples representing the whole dataset's diversity are identified, their proteomes are used to identify orthologs and build hidden Markov models, which are then expanded to the full input dataset by iterative searches. This strategy allows the generation of trees for 689 Brucella assemblies in ∼48 hrs using 30 cpus and 58.3GB memory with no hands-on time required. This pipeline is designed to be an easy-to-install, scalable, turn-key solution for generating high-resolution bacterial trees from diverse clades.

Materials and methods

OrthoPhyl's workflow

The structure of OrthoPhyl can be broken down into four main steps: Genome assemblies are annotated (Fig. 1a), resulting proteins are assigned to orthogroups (Fig. 1b), orthogroup proteins are aligned and converted to codon alignments (Fig. 1c), and finally, concatenated codon alignments are used to infer phylogenetic trees (Fig. 1d). All processes below use default parameters unless specified otherwise.

Fig. 1.

Fig. 1.

Workflow diagram of OrthoPhyl. Grey boxes indicate processes. Programs used in each step are listed unless a custom script was used. Orange, tan, and purple boxes represent user input, intermediate files, and species tree outputs, respectively. Purple arrows show iterative approaches. The workflow is divided into four main tasks a) annotate assemblies and remove identical CDSs. If more than “N” assemblies are being analyzed, b1) identify a subset of diversity-spanning assemblies, b2) pass them through OrthoFinder to generate orthogroups, and b3) expand the OrthoFinder-identified orthogroups to the full dataset of assemblies through iterative HMM searches. c) Align full orthogroup protein sets, generate and trim matching codon alignments, then filter orthogroups by taxon representation. Finally, d) estimate species tree topologies with concatenated codon alignment supermatrices along with a gene tree to species tree consensus method.

Structural annotation

OrthoPhyl starts by generating gene calls from input assemblies (Fig. 1a) with Prodigal (Hyatt et al. 2010). Redundant CDSs (> 99.9% nucleotide identity) found in a genome are removed using bbmap's dedupe (BBMap (2022) Oct 6) with the logic they are likely the result of recent gene duplications and provide no or little phylogenetic information. This removes unnecessary paralogs from the dataset to preserve the usefulness of the containing orthogroup for downstream analysis.

Orthogroup assignment

To infer orthogroups to use for tree generation, OrthoFinder is used with default parameters except for using multiple sequence alignments based on gene tree inference (Emms and Kelly 2015; Emms and Kelly 2019). When species trees are being generated for large numbers (default: > 30) of genome assemblies, finding orthogroups directly becomes intractable due to the exponential increase in computational time of the all-vs-all homology search. To overcome this, OrthoPhyl identifies a diversity-spanning subset of assemblies to identify orthogroups. Briefly, average nucleotide identity (ANI) is estimated for all assembly pairs with FastANI (Jain et al. 2018). A custom algorithm is used to cluster assemblies based on pairwise ANI values. Briefly, through successive rounds, samples with the highest ANI are merged into clusters. The merged samples’ ANI to other samples and clusters are averaged. Rounds of merging are performed until N number (default: 30) of clusters are formed. Single representative samples for each of the N groups are chosen as input for OrthoFinder to infer a reduced set of orthogroups. FastANI fails to give an ANI percent for genome comparisons with ANI less than ∼75%. For these instances, an arbitrary value of 50% is used. The representative genome picking will be unaffected if there are less than 30 clusters at this or greater distances to all other clusters.

To expand orthogroups to the full assembly dataset, proteins from each orthogroup are realigned with Mafft (Katoh and Standley 2013), then hmmer (Eddy 2008, 2009, 2011) generates hidden Markov models and searches against all predicted proteins using default parameters. OrthoPhyl then finds the minimum HMM hit score cutoff which removes all paralogs, the remaining hits above this score (if any) are designated as the final orthogroup, ensuring all potential paralogs are removed and the greatest number of taxa are represented in the orthogroup. Each orthogroup is then realigned and HMMs are again generated and searched against all proteins to capture additional, more divergent orthologues.

Codon alignment generation

Once orthogroups are identified for the full set of assemblies, protein sequences are realigned with Mafft (Katoh and Standley 2013). These alignments are then used as the basis for codon alignment using PAL2NAL with codon table 11 (Suyama et al. 2006). Codon alignments are then trimmed with trimAL using arguments “-resoverlap .5 -seqoverlap 50 -gt .80 -cons 60 -w 3″ (Capella-Gutiérrez et al. 2009). Users can change trimming options with the “control_files.user” if required. Alignment_Assessment (Portik et al. 2016), is used to visualize the quality of codon alignments, including their phylogenetic signals, taxa per alignment, and alignments per taxa (Fig. 2). This is critical for ensuring high-quality data are used for generating trees and can aid in troubleshooting phylogenies with low branch support.

Fig. 2.

Fig. 2.

Alignment metric visualization generated by OrthoPhyl. a) Histograms of the number of taxa in each alignment is shown. Dotted line indicates the >30% taxa cutoff for alignments used to generate the “SCO relaxed” trees. b) Amount of missing data (including gaps and ambiguous bases) per alignment is shown with a histogram. This is exclusive of genomes without detected orthologs. c) The number of phylogenetically informative sites (sites with at least 2 different states in at least 2 taxa each) compared to alignment length for each orthogroup are shown. A linear regression is shown with a solid line with associated slopes. A one-to-one relationship is shown as a black dotted line.

Our tool then classifies orthogroups as relaxed or strict single-copy orthologs (SCOs). Strict SCOs are genes found in every taxon with no paralogs, while relaxed SCOs are found in a subset of assemblies still with no paralogs. The percent of assemblies with the SCO to count in this set is tunable with the default set as 30% (shown in Fig. 2a: red dashed line). See (Wiens and Morrill 2011) for analysis on the effects of missing data. These two sets of orthogroups are used in parallel moving forward to generate species trees.

Species tree estimation

Two general methods are used to infer species trees for the input assemblies: gene-tree to species-tree estimation with ASTRAL-III (Zhang et al. 2018) and maximum likelihood methods based on codon alignment supermatrices generated with catfasta2phyml (https://github.com/nylander/catfasta2phyml). First, to use ASTRAL, ML gene trees are generated with either “FastTree2 -gtr -gamma” (Price et al. 2010) or IQtree2 with default settings (Nguyen et al. 2015; Kalyaanamoorthy et al. 2017; Hoang et al. 2018) per user input. Then ASTRAL will generate species trees with the strict and relaxed SCO gene tree datasets. For the maximum likelihood species tree methods, the user can specify any combination of IQtree, FastTree2 and/or RAxML (Stamatakis 2014) to infer tree structure(s). Default parameters are used for each tree method except for RAxML and FastTree2 being set to use the GTR + gamma model and 100 and 1,000 bootstraps (with Shimodaira-Hasegawa test), respectively, and IQtree2 set to run ModelFinder to identify a best fit evolutionary model and run 1,000 ultrafast bootstraps.

Finally, the trees generated during the OrthoPhyl run are compared with generalized Robinson-Foulds metrics provided by ETEtoolkit (Huerta-Cepas et al. 2007) indicating the stability of the resultant tree structure when different inference methods and matrix completenesses are used. All default parameters used with these tools can be found in the OrthoPhyl repository at OrthoPhyl/control_file.defaults.

Datasets

As a proof of concept, OrthoPhyl was used to build trees for three bacterial clades. We analyzed well-characterized E. coli/Shigella strains. These included a list of 34 complete genomes analyzed in (Shakya et al. 2020), obtained from NCBI.

The genus Brucella was used to show OrthoPhyl's ability to resolve very closely related sequences while dealing with very long relative branch lengths in the same tree (Ochrobactrum group). Assemblies were acquired using NCBI Taxon number 234 (Schoch et al. 2020) with the utility script “gather_genomes.sh” (Supplementary Methods - Gathering and Filtering Assemblies and Supplementary Fig. 1) which is packaged with OrthoPhyl (github.com/eamiddlebrook/OrthoPhyl/utils/). This resulted in 689 NCBI Genbank or RefSeq assemblies after filtering for >98% completion, < 0.1% duplication, and < 1% contamination with CheckM (Benson et al. 2013; O’Leary et al. 2016; Parks et al. 2015). Here, duplication is the total number of loci identified as marker genes divided by the number of identified marker genes (see Supplementary methodsGathering and Filtering Assemblies” and “utils/checkm_assemblies.slurm” within the github repository). It is important to note that Ochrobactrum species were recently moved to the genus Brucella (Hördt et al. 2020). However, for clarity and due to controversy within the field, we chose to keep the Ochrobactrum labeling for these species. An assembly of Mycoplana dimorpha (GCA_003046475.1) was added to the dataset as an outgroup.

Finally, a tree for all NCBI Rickettsiales assemblies was generated to illustrate OrthoPhyl's utility in dealing with bacterial order level divergences. For Rickettsiales assemblies, we again used gather_genomes.sh, this time with NCBI taxon number 766. To illustrate a more challenging scenario, filtering of Rickettsiales assemblies was less stringent with cut-offs at > 95% completion, < 0.2% duplication, and < 1.5% contamination, again using CheckM (Parks et al. 2015) and our independent calculation of duplication (Supplementary methodsGathering and Filtering Assemblies”. This resulted in 1,201 assemblies. Seven Pelagibacter assemblies were added to this dataset to serve as an outgroup. Accessions and stats for all genome assemblies used in this manuscript are available in Supplementary Tables 2–4.

Results

Small Tree of Closely Related Assemblies: E. coli and Shigella

To illustrate Orthophyl's ability to build trees from moderate numbers of closely related samples on a local Linux computer, we ran our E. coli analysis (34 genomes) on a desktop with a 12-core Intel processor with 16 GB RAM running RHEL8 (centOS8). Because the assemblies are of high quality and the stains are closely related, OrthoPhyl was used to generate trees from only the strict SCO dataset. Run statistics are provided in Table 1. The full pipeline took 3.25 hours to build trees with FastTree2 and ASTRAL. OrthoPhyl identified 2,142 strict SCOs and alignment of these sequences constitutes 2.105 MB with 6% phylogenetically informative sites for the SCO (Table 1).

Table 1.

- OrthoPhyl output, runtime, and resource usage.

Assembly Data Set
Metric E. coli + Shigella Brucella Rickettsiales
Number of assemblies 34 689 + 1 1208 + 7
Number SCOs - strict 2217 897 42
Total Length (bp) 2,105,000 855,924 29,248
% Informative SNPs 6.1 41.4 81.5
% missing data 0.4 0.5 2.3
ANI % (avg/med/min)a 97.6/99.2/89.9 95.4/99.8/63.7 83.2/57.3/48.1
Number SCOs—relaxed - 1762 348
Total Length (bp) - 1,627,004 295,820
% Informative SNPs - 43.3 78.9
% missing data - 2.7 10.6
ANI % (avg/med/min)a - 95.0/99.8/63.2 82.6/55.7/48.0
CPU time (hr:min) 16:20 373:09 152:00
CPU efficiency 47% 24% 41%
Total Runtime (hr:min) 3:11 48:34 12:10
Max Mem Used (GB) 13.07 58.34 17.3

a ANI percentage is calculated with CEANIA (github.com/eamiddlebrook/CEANIA) on the final concatenated codon alignments excluding gapped positions. See Supplementary Methods for details.

The resulting tree built by FastTree2 from the SCO-strict dataset shows successful differentiation of the E. coli phylotypes (Fig. 3). This tree shows B1 and A as sister phylotypes, with S. sonnei, S. boydii, and S. flexneri between them. That clade is in turn next to phylotype E and S. dysenteriae, then D1 and D2. Finally, B2 is the sister to all of them. This topology is identical to the reported tree from (Shakya et al. 2020) using whole genome alignments, except for the placement of the root (E. fergusonii), where they have the root placed between D2/B2 (red arrow) and the rest while OrthoPhyl's tree indicates it should be placed between B2 and the other phylotypes.

Fig. 3.

Fig. 3.

Strict single copy ortholog based maximum likelihood phylogeny of 34 Escherichia and Shigella assemblies inferred by FastTree2 using the GTR gamma model with the default 1,000 bootstraps. The tree was constructed with 2,217 genes totaling 2.04MB of sequence. Tips are labeled with the given species and strain from NCBI's BioSample database (details in Supplementary Table 2). All bootstrap support values are 100% except for the split leading to E. coli strains APEC O1 and S88 (black arrow), which is 98.7%. Bars next to tree indicate E. coli phylotypes [Shakya 2017]. The tree was rooted at E. fergusonii then root node was removed. The blue and red arrows indicate the alternative placement of E. coli IAI1 and the root from Shakya 2017, respectively.

Genus level Species trees of 689 assemblies: Brucella/Ochrobactrum

For the 689 Brucella/Ochrobactrum assemblies analyzed (plus Mycoplana dimorpha outgroup), OrthoPhyl was run on a RedHat8 compute node with 30 CPUs and 500 GB of ram. The full analysis (generating strict and relaxed SCO trees with FastTree2 and ASTRAL) took just over 2 days (48 hrs 31 min). Total ram usage was moderate at 58GB. CPU usage efficiency was 24%, with a total CPU time of approx. 373 hours (Table 1).

OrthoPhyl identified 785 strict and 1,635 relaxed SCOs in the full dataset of assemblies. The distribution of taxon number represented in each alignment shows that most orthogroup alignments have all taxa represented, with approximately 1,400 orthogroups having 680–689 taxa (Fig. 2a). The orthogroup alignments have every little missing data (gaps or ambiguous bases) with the vast majority showing less than 10% (Fig. 2b). Figure 2c shows the number of informative sites vs each alignment's length. A slope of 0.42, indicates the alignments are far from the value that would constitute noisy, erroneous alignments.

The maximum likelihood (ML) Brucella phylogeny generated by FastTree2 based on the 1,635 relaxed SCOs (Fig. 4) shows very close agreement with trees generated by (Ashford et al. 2020) for the Ochrobactrum clades (see Supplementary Fig. 2 for high resolution tree with accessions, bootstrap values, and species labeled). For instance, it supports two major Ochrobactrum clades, Group A - O. anthropi, O. lupini, O. tritici. O. pecoris, O. oryzae, O, ciceri, O. intermedium, O. pseudointermedium, and O. daejeonense, and Group B - O. grignonenense, O. pituitosa (pituitosum), O. quorumnocens, O.rhizosphaerae, O. pseudo grignonense, O. thiophenivorans, and O. gallinifaecis (Fig. 4orange bars). Like Ashford et al., our tree supports O. endophytica as basal to the other Ochrobactrum. However, our tree shows the clade with O. haematophila, O. soli, and O. teleogrylli as sister to Group B Ochrobactrum with >90% support instead of Group A, like in Ashford et al. (Figure 4a and Supplementary Fig. 2, orange arrows). A major distinction between our tree topology and the one presented in Ashford et al. is that our tree shows the monophyletic Brucella clade splitting the Ochrobactrum clade between O. endophytica and all other Ochrobactrum with >90% support (Supplementary Fig. 2, black arrow). In contrast, the Brucellae in Ashford et al. split Ochrobactrum between Group A and B, as in (Hördt et al. 2020). This alternative placement of Brucellae is shown in Fig. 4a and Supplementary Fig. 2 with blue arrows. The difference could be a product of having different outgroups (Mycoplana ramose, M. dimorpha, and Rhizobium etli for Ashford et al. and Mycoplana dimorpha for OrthoPhyl), and our tree is subject to long-branch attraction between Brucella and the unbroken long Mycoplana branch. Additionally, our tree indicates that O. thiophenivorans, O. pituitosa, and O. rhizosphaerae samples are para/polyphyletic with >90% bootstrap support for all associated splits (Fig. 4a and Supplementary Fig. 2, astrisks), indicating they were possibly mislabeled when uploaded to NCBI.

Fig. 4.

Fig. 4.

a) Maximum likelihood tree of Brucella/Ochrobactrum assemblies generated by FastTree2 within OrthoPhyl. If provided by NCBI, species are labeled. To aid in visualization of mono/polyphyly, branches are colored by reported species. Internal branch colors are given based on majority rule of offspring. Branch lengths represent relative number of mutations per site. The tree was rooted with Mycoplana dimorpha, which was subsequently removed, and the classic Brucella clade was collapsed. b) Subtree of the collapsed classic Brucella clade. A high-resolution version of the full tree showing Species, strain, accession, and bootstrap support is available in Supplementary Fig. 2.

Ochrobactrum lupini and O. ciceri are closely associated with O. anthropi and O. intermedia (respectively). Like Ashford et al. comparing field isolates to type strains, we had many O. anthropi and O. intermedia assemblies which show that O. lupini and O. ciceri arose from within their respective clades and are likely misclassified as species, but instead possibly should be classified as O. anthropi and O. intermedia, respectively. For O. lupini and O. anthropi, this is supported by additional work (Gazolla Volpiano et al. 2019).

The clade comprising the classical Brucella agrees with published phylogenies (Fig. 4b) (Ashford et al. 2020; Suárez-Esquivel et al. 2020). Brucella vulpis at the base, B. pinnipedialis arising from within B. ceti (rendering B. ceti paraphyletic) (Orsini et al. 2022), B. canis arising from within B. suis (rendering B. suis paraphyletic), and B. melitensis and B. abortus being sister groups. Interestingly, several Brucella assemblies seem to be mislabeled (Supplementary Fig. 2, grey tip labels). Whether these placements are accurate or not remains to be determined, however, B. melitensis and B. abortus are very closely related (precluding saturation) and do not show signs of significant interspecies recombination (Vishnu et al. 2015; Suárez-Esquivel et al. 2020). The classical Brucella have very short branches at the crown (B. microti, B.neotomae, B.ovis, B. suis/canis, B. abortus/melitensis, and B. pinnipedialis/ceti), indicating very rapid divergence times for these clades, as seen elsewhere (Ashford et al. 2020; Suárez-Esquivel et al. 2020; Orsini et al. 2022). While the topology of this tree is congruent with several published trees, more analyses are required to create the most robust tree possible (see Discussion).

GToTree vs OrthoPhyl: Brucella/Ochrobactrum

OrthoPhyl compares favorably to the phylogenomics pipeline GToTree (Lee 2019). We ran GToTree v1.8.4 in nucleotide mode on the same genome dataset as OrthoPhyl allowing the use of 28 cores while searching for orthologs with the Alphaproteobacteria HMM gene set containing 117 models. All other options were left as default. The full analysis took slightly more time than OrthoPhyl, using 50 hours of real time with a total CPU time of 1057 hours. The alignment length produced was 74.9 kb, with 51 million total characters and approximately 1 million being missing/ambiguous.

The GToTree topology for deep splits is very similar to the tree generated by OrthoPhyl above. Specifically, all of the species-level splits of the Ochrobactrum group are identical between the trees (Supplementary Figs. 2 and 3). Additionally, the Brucella clade splits the main Orchrobactrum clade and O. endophytica in both trees. Notable disagreements between the trees include many strain-level differences in the Ochrobactrum species subtrees and the basal Brucella. Additionally, the classical Brucella clade shows much lower resolution in the GToTree phylogeny, with many splits supported in less than 50% of bootstrap trees (Supplementary Fig. 3, red dots). These results are perhaps expected due to their relatively recent radiations and GToTree using a highly conserved gene set from Alphaproteobacteria, likely missing Brucella/Ochrobactrum specific phylogenetic signal which could resolve these relationships.

kSNP4 vs OrthoPhyl: Brucella/Ochrobactrum

To compare the performance of OrthoPhyl with an alternative assembly-to-tree method, kSNP4 (Gardner et al. 2015; Hall and Nisbet 2023) was run with the same hardware as OrthoPhyl. See Supplementary methods for details. The resulting kSNP4 matrix for the Brucella/Ochrobactrum dataset has 7.15 million SNPs, which is considerably longer than the maximum genome length. Many of these are likely artifacts of the sequences being divergent enough to have mutations within the flanking k-mers used to identify SNPs. Thus, much of the matrix consists of missing data (∼78%) indicating k-mers not being shared between assemblies.

FastTree2 was used externally to generate a tree with the k-mer-based SNP alignment matrix from kSNP4. Like the OrthoPhyl tree, sites found in less than 30% of assemblies were removed up to the point of removing 80% of total sites. This left 1,569,787 SNPs (exactly 20% of original SNPs, with ∼13% total missing data). The trimmed SNP alignment file was then used to generate a tree with FastTree2 using the GTR + gamma model and default parameters. For a high-resolution tree with species, strain, bootstap support and assembly accessions labeled, see Supplementary Fig. 4.

For deep branches outside of the classic Brucella, this tree shows largely the same topology as the tree generated by OrthoPhyl. Some notable exceptions within the Ochrobactrum include: 1) the placement of O. gallinifaecis and one of the O. thiophenivorans being sister to O. pseudogrignonensis in the OrthoPhyl tree while it is sister to the clade of O. pseudogrignonensis, O. quorumnocens, O, pituitosa (among others) in the kSNP4 based tree (Fig. 4a, orange arrow), and 2) O. daejeonensis is placed one node more basal in the kSNP4 tree (Fig. 4a, red arrow). For the Brucella clade, there are 2 major differences. The first is the placement of the root, with kSNP4 showing it coming in-between B. ovis and the rest of Brucella (Fig. 4b, black arrow). The second is (given the OrthoPhyl root) B. neotomae being placed, along with two B. suis samples (Fig. 4b, asterisk) on the short branch leading to the B. melitensis, B. abortus, B. ovis, B. pinnipedialis, B. ceti clade, instead of sister to all Brucella internal to B. microti (Fig. 4b, purple arrow). Compared to kSNP4, OrthoPhyl had favorable runtime and resource usage requirements. Major differences include max memory usage (58 vs 204 GB), max storage (53 vs 480GB) and runtime (48 vs > 216 hours) for OrthoPhyl and kSNP4, respectively.

Order level divergence tree with 1,200 samples: Rickettsiales

The order Rickettsiales was chosen as a challenging taxon to test OrthoPhyl's ability to deal with wide phylogenetic distances and extensive genome reduction. Additionally, this set of assemblies was not filtered stringently, leaving a total of 1,208 assemblies (7 outgroup) of varying quality (see Datasets section above). As with the Brucella analysis, OrthoPhyl was run on a 30 cpu compute node with 500 GB of available ram. The program took just 12 and 10 minutes to complete using a total of 17.3 GB memory. The CPU usage efficiency was 41%, with a total CPU time of approx. 152 hours (Table 1).

Even with this challenging genomic dataset, OrthoPhyl identifies > 29,000 homologous base pairs from strict SCOs and 295,000 bp from relaxed SCOs (found in 366 or more assemblies). These homologous sequences come from 42 and 348 genes, respectively (Table 1). The relaxed SCO FastTree2 ML tree reconstructed from the 1,201 Rickettsiales assemblies recovers all valid genera as monophyletic (Fig. 5a). The sole paraphyletic genus recovered is Candidatus Jidaibacter. Other than Jidaibacter, the recovered topology of genera is consistent with previously published analyses (Salje 2021; Schön et al. 2022), with two major clades 1) Wolbachia sister to Ehrlichia and Anaplasma, with Neoricketsia basal to them and 2) Rickettsia and Orientia separated from the rest by the root. A high-resolution tree is provided in Supplementary Fig. 5, which provides species, assembly accession, bootstrap support, and number of SCOs per sample.

Fig. 5.

Fig. 5.

a) Maximum likelihood tree for Rickettsiales generated by FastTree2 within OrthoPhyl. Branch colors show reported genera for each sample according to NCBI metadata. To aid in identification, labels are also provided. Samples with no genus-level identification are labeled in grey. Internal branch colors are given based on majority rule of offspring. Note the red Wolbachia clade has been compressed vertically (1/5 ratio) thus clade widths are not proportional to number of samples across the tree. Branch lengths are relative probability of a mutation per site. The tree was rooted with a Pelagibacter outgroup that was subsequently removed. Candidate genera (Candidatus) are labeled Can. The single, genus-level polytomy (Canidatus-Jidibacter) is labeled with red arrows. b) The Rickettsia subtree is shown with monophyletic species identifiers collapsed. Following species tip labels, in parentheses is the number of collapsed tips. The two most basal branches are omitted, as they are identified only to genus. All assemblies with species-level classification show monophyly apart from R. conorii (red arrows) and R. massiliae/R. rhipicephali (orange arrow).

To illustrate OrthoPhyl's ability to resolve narrow evolutionary windows along with order level divergences in the same tree, the species-rich Rickettsia genus subtree is shown (Fig. 5b). Only three Rickettsia with species-level identification (provided by NCBI metadata) show polytomy. First, samples labeled R. conorii are placed in 3 very different clades across the tree (Fig. 5b, red arrows), indicating that they are possibly mislabeled in NCBI's genome database. Analysis in Hördt et al, using digital DNA:DNA hybridization, also indicates sequences labeled as R. conorii and related species are phylogenenetically problematic. Additionally, the placement of R. massiliae and R. rhipicephali samples render each other polyphyletic (Fig. 5b, orange arrow). Other studies mirror our results (Diop et al. 2018, 2020). The four R. conorii that are analyzed by Diop et al. are found to be monophyletic in both their papers and this one, with ours having additional R. conorii samples that appear polyphyletic. Diop et al. like us also recovered R. massiliae and R. rhipicephali as problematic.

It should be noted that, although we recover identical topologies to others, deeper splits in OrthoPhyl's tree could be affected by saturation at informative sites (Supplementary Fig. 6). Thus, a robust Rickettsiales analysis would involve also inferring trees from alignments excluding saturated genes and/or sites and potentially from corresponding amino acid matrices.

Key features

Codon alignments

Using codon alignments to infer phylogenetic trees leverages the favorable signal-to-noise ratio of protein alignments (States et al. 1991) while gaining phylogenetic signal through codon degeneracy, allowing greater amounts of information per gene (Wernersson and Pedersen 2003; Bininda-Emonds 2005; Kapli et al. 2023).

Many input assemblies

Whole genome alignment tools, such as Mugsy and Mauve do not scale well for large datasets, with them being limited to generating multiple alignments for about 80 input assemblies (Darling et al. 2004; Angiuoli and Salzberg 2011). OrthoPhyl can generate alignments for >1,000 assemblies, allowing large trees to be inferred. It does this by identifying orthogroups in a diversity-spanning subset of assemblies (see Methods), then assigning proteins to these orthogroups for the full set of predicted proteins by iterative HMM searches. This reduces the computationally demanding all-vs-all protein searching that would make these large analyses intractable for many researchers.

User friendly

OrthoPhyl manages the formatting and organization of intermediate files, which is a non-trivial task in a phylogenomic workflow such as this. As with any phylogenetic method, gene sets, alignment, and tree estimation parameters will likely need to be tuned to obtain robust results, thus we exposed these variables for altering at runtime. If OrthoPhyl fails due to input error or lack of memory recourses, it can restart at the last completed step. Being robust to restarting also allows users to tune parameters without rerunning the entire pipeline.

Discussion

OrthoPhyl successfully reconstructs phylogenetic trees for clades with broad evolutionary divergences. This is achieved with little user input; only requiring one command line input pointing the pipeline to the assembly and output directories. OrthoPhyl can generate trees for >30 bacterial assemblies on a laptop or >1,000 assemblies on a workstation or single compute node with moderate resources (30 cpus and 100GB RAM). Many useful options are provided by OrthoPhyl, including alignment trimming parameters, number of assemblies from which to identify initial SCOs, tree-building software to use, and evolutionary models for tree estimation.

With few exceptions, the trees estimated for the E. coli/Shigella, Brucella/Ochrobactrum, and Rickettsiales datasets are consistent with published topologies. This illustrates OrthoPhyl's ability to resolve broad evolutionary relationships: from highly similar genomes with ANIs of >99% for E. coli/Shigella and Brucella/Ochrobactrum to highly dissimilar assemblies with ANIs of 57% for the Rickettsiales. With OrthoPhyl being able to generate whole genome-based phylogenies for more than 1,000 assemblies, it fills a large gap in phylogenomic analysis software, providing an easy-to-use, assembly-to-tree tool that will help many researchers incorporate evolutionary analysis into their ongoing bacterial studies.

The Brucella/Ochrobactrum and Rickettsiales trees are, to our knowledge, the densest phylogenies published for these clades, let alone using whole genome methods. While more work is necessary to validate the topologies presented here, they are a starting point for assessing metadata-based species labels and expected clade monophyly. Future work will test the robustness of topologies to different evolutionary models, gene sets, tree inference methods, and sample incorporation.

Workflow considerations

There are many parameters to tune during the various steps of phylogenomic inference. Since we focus on usability there is an inevitable tradeoff with optimization, mainly through automated or hard-coded parameter choice. OrthoPhyl does not produce publication-ready trees without user inspection of analysis metrics such as alignment quality, single copy ortholog numbers, percent phylogenetically informative sites, mutational saturation, and of course branching support values. However, caution should be exercised when interpreting bootstrap support from phylogenomic methods as support values can coalesce on erroneous topologies because of mutational saturation and compositional bias and/or model misspecification. Ultimately, trees should be compared to existing data to ensure major clades are consistent and evidence for novel branching should be closely scrutinized.

Users should take different approaches to inferring trees depending on their final goals. If one wishes to infer an initial phylogenetic tree to inform experimental design or taxon sampling, running OrthoPhyl with default parameters to generate a tree with FastTree2 will likely be sufficient. This will produce a tree using the widely applicable GTR + gamma model. To infer a robust tree for publication, researchers should consider using IQtree2, which runs ModelFinder to choose the evolutionary model that best fits their data. They should also ensure even taxon sampling, with a focus on breaking up long branches. For trees aimed at revising the taxonomy of a studied group, users should consider screening out genes with high saturation and inspecting gene trees for signals of horizontal gene transfer. Additionally, building trees using multiple “best” models, tree software, and gene sets and then comparing results will add robustness to the inferred topologies. We refer readers to (Lozano-Fernandez 2022) for an in-depth discussion on the subject.

Future

Future versions of this software package will include several additional evolutionary analyses. Bayesian tree inference methods will be incorporated into the pipeline to allow more flexibility in tree estimation. Since codon alignments of single-copy orthologs are generated while running this pipeline, we also plan on integrating a software package, e.g. ETE3 (Huerta-Cepas et al. 2016), to test evolutionary models such as positive/negative selection and neutrality. A “forest of life” based horizontal gene transfer analysis will be added using gene tree comparisons. This analysis will look at k-means clustering of gene trees to build multiple consensus trees to identify if blocks of genes do not agree with a single consensus species tree hypothesis (Puigbò et al. 2019).

In the current iteration of OrthoPhyl, very ancient paralogs (originating before the most recent common ancestor of the dataset species) might be clustered into the same orthogroup, leading to the orthogroup's removal during filtering even though 2 sub-trees of the orthogroup could be consistent with the species tree. Future iterations of this software will filter the orthogroups in a deep paralog-aware manner, using a rapidly generated neighbor-joining species tree to identify and rescue such orthogroups.

Although this pipeline is geared towards prokaryotes, only two changes are needed to adapt it to eukaryotic phylogenetics: 1) allowing users to input precomputed annotations directly, avoiding the complex task of Eukaryotic annotation within the workflow itself and 2) taking into consideration paralog subtrees which are consistent with a consensus species tree. The next version of OrthoPhyl will incorporate these changes to expand the use cases to Eukaryotic phylogenetics.

OrthoPhyl identifies gene family groups for the dataset provided. While this allows the workflow to adapt to any set of genome assemblies, it requires compute resources that could be saved if predefined gene family models could be used as a starting point. Un-clustered protein sequences would then be fed into the standard OrthoPhyl ortholog inference pipeline to capture additional gene families.

Conclusions

Phylogenetic reconstruction is critical to understanding the evolution of pathogenicity, novel traits, horizontal gene transfer, and metabolic potentials of bacteria. Unfortunately, large-scale phylogenomic analyses of diverse bacterial groups require a deep understanding of bioinformatic methods and great care in dealing with the myriad intermediate files and formats used during the analysis. To our knowledge, OrthoPhyl is the only software to take >1,000 bacterial assemblies with order level divergence, generate ortholog codon alignments, then reconstruct accurate phylogenetic trees without user input or management of intermediate files being necessary. Thus, OrthoPhyl will allow many research groups, including those with modest computing resources and knowledge, to leverage the wealth of publicly available genomic data to enrich their ongoing analyses with robust phylogenomic inferences across a broad swath of bacterial diversity.

Web resources

Code used in this manuscript is freely available at https://github.com/eamiddlebrook/OrthoPhyl/ under branch OrthoPhyl_1.0 while the default branch holds the current OrthoPhyl version. Installation and execution instructions are provided in the associated github README.md files. OrthoPhyl requires a Linux OS. Third-party software versions and OrthoPhyl execution files will remain static in the OrthoPhyl_1.0 branch. For versions of software dependencies see Supplementary Table 1. To aid in usability, a Singularity container Is available at https://cloud.sylabs.io/library/earlyevol/default/orthophyl or with the command, singularity pull library://earlyevol/default/orthophyl:1.0_ms. Also, see the GitHub page for Singularity usage guide.

Supplementary Material

jkae119_Supplementary_Data

Acknowledgments

The authors would like to thank Migun Shakya, Taehyung Kwon, and John Gillece for their valuable insight and comments and the whole Disease Surveillance and Molecular Epidemiology of Brucella project team members in Tanzania and Rwanda, especially Prof. Joram Buza for comments and suggestions.

Contributor Information

Earl A Middlebrook, Genomics and Bioanalytics Group, Los Alamos National Laboratory, Mailstop M888, Los Alamos, NM 87545, USA.

Robab Katani, 401 Huck Life Sciences Building, Huck Institutes of Life Sciences, Pennsylvania State University, University Park, PA 16802, USA.

Jeanne M Fair, Genomics and Bioanalytics Group, Los Alamos National Laboratory, Mailstop M888, Los Alamos, NM 87545, USA.

Data availability

Assemblies used within this manuscript are available from https://www.ncbi.nlm.nih.gov/assembly/ with accessions in Supplementary tables 2–4.

Supplemental material available at G3 online.

Funding

This research was funded by the Defense Threat Reduction Agency through Triad National Security, LLC, operator of the Los Alamos National Laboratory under Contract No. 89233218CNA000001 with the U.S. Department of Energy.

Literature cited

  1. Angiuoli SV, Salzberg SL. 2011. Mugsy: fast multiple alignment of closely related whole genomes. Bioinformatics. 27(3):334–342. doi: 10.1093/bioinformatics/btq665. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Ashford RT, Muchowski J, Koylass M, Scholz HC, Whatmore AM. 2020. Application of whole genome sequencing and pan-family multi-locus sequence analysis to characterize relationships within the family Brucellaceae. Front Microbiol. 11:1329. doi: 10.3389/fmicb.2020.01329. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Baltrus DA, Dougherty K, Beckstrom-Sternberg SM, Beckstrom-Sternberg JS, Foster JT. 2014. Incongruence between multi-locus sequence analysis (MLSA) and whole-genome-based phylogenies: pseudomonas syringae pathovar pisi as a cautionary tale. Mol Plant Pathol. 15(5):461–465. doi: 10.1111/mpp.12103. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Benson DA, Cavanaugh M, Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW. 2013. GenBank. Nucleic Acids Res. 41:D36–42. doi: 10.1093/nar/gks1195. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Bertels F, Silander OK, Pachkov M, Rainey PB, van Nimwegen E. 2014. Automated reconstruction of whole-genome phylogenies from short-sequence reads. Mol Biol Evol. 31(5):1077–1088. doi: 10.1093/molbev/msu088. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Bininda-Emonds OR. 2005. transAlign: using amino acids to facilitate the multiple alignment of protein-coding DNA sequences. BMC Bioinformatics. 6(1):156. doi: 10.1186/1471-2105-6-156. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Bushnell B, BBMap . SourceForge. 2022. https://sourceforge.net/projects/bbmap/SourceForge. [accessed 2023 Mar 23].
  8. Bussi Y, Kapon R, Reich Z. 2021. Large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy. PLoS One. 16(10):e0258693. doi: 10.1371/journal.pone.0258693. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Capella-Gutiérrez S, Silla-Martínez JM, Gabaldón T. 2009. Trimal: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics. 25(15):1972–1973. doi: 10.1093/bioinformatics/btp348. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Chung M, Munro JB, Tettelin H, Dunning Hotopp JC. 2018. Using core genome alignments to assign bacterial Species. mSystems. 3(6):e00236–18. doi: 10.1128/mSystems.00236-18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Darling ACE, Mau B, Blattner FR, Perna NT. 2004. Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Res. 14(7):1394–1403. doi: 10.1101/gr.2289704. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Diop A, El Karkouri K, Raoult D, Fournier P-E. 2020. Genome sequence-based criteria for demarcation and definition of species in the genus Rickettsia. Int J Syst Evol Microbiol. 70(3):1738–1750. doi: 10.1099/ijsem.0.003963. [DOI] [PubMed] [Google Scholar]
  13. Diop A, Raoult D, Fournier P-E. 2018. Rickettsial genomics and the paradigm of genome reduction associated with increased virulence. Microbes Infect. 20(7–8):401–409. doi: 10.1016/j.micinf.2017.11.009. [DOI] [PubMed] [Google Scholar]
  14. Eddy SR. 2008. A probabilistic model of local sequence alignment that simplifies statistical significance estimation. PLoS Comput Biol. 4(5):e1000069. doi: 10.1371/journal.pcbi.1000069. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Eddy SR. 2009. A new generation of homology search tools based on probabilistic inference. Genome Inform Int Conf Genome Inform. 23(1):205–211. 10.1142/9781848165632_0019. [DOI] [PubMed] [Google Scholar]
  16. Eddy SR. 2011. Accelerated profile HMM searches. PLoS Comput Biol. 7(10):e1002195. doi: 10.1371/journal.pcbi.1002195. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Emms DM, Kelly S. 2015. OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy. Genome Biol. 16(1):157. doi: 10.1186/s13059-015-0721-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Emms DM, Kelly S. 2019. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol. 20(1):238. doi: 10.1186/s13059-019-1832-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Gardner SN, Hall BG. 2013. When whole-genome alignments just won’t work: kSNP v2 software for alignment-free SNP discovery and phylogenetics of hundreds of microbial genomes. PLoS One. 8(12):e81760. doi: 10.1371/journal.pone.0081760. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Gardner SN, Slezak T, Hall BG. 2015. kSNP3.0: SNP detection and phylogenetic analysis of genomes without genome alignment or reference genome. Bioinformatics. 31(17):2877–2878. doi: 10.1093/bioinformatics/btv271. [DOI] [PubMed] [Google Scholar]
  21. Gontcharov AA, Marin B, Melkonian M. 2004. Are combined analyses better than single gene phylogenies? A case study using SSU rDNA and rbcL sequence comparisons in the zygnematophyceae (Streptophyta). Mol Biol Evol. 21(3):612–624. doi: 10.1093/molbev/msh052. [DOI] [PubMed] [Google Scholar]
  22. Grievink LS, Penny D, Holland BR. 2013. Missing data and influential sites: choice of sites for phylogenetic analysis can be as important as taxon sampling and model choice. Genome Biol Evol. 5(4):681–687. doi: 10.1093/gbe/evt032. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Hall BG, Nisbet J. 2023. Building phylogenetic trees from genome sequences with kSNP4. Mol Biol Evol. 40(11):msad235. doi: 10.1093/molbev/msad235. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Hoang DT, Chernomor O, von Haeseler A, Minh BQ, Vinh LS. 2018. UFBoot2: improving the ultrafast bootstrap approximation. Mol Biol Evol. 35(2):518–522. doi: 10.1093/molbev/msx281. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Hördt A, López MG, Meier-Kolthoff JP, Schleuning M, Weinhold L-M, Tindall BJ, Gronow S, Kyrpides NC, Woyke T, Göker M. 2020. Analysis of 1,000+ type-strain genomes substantially improves taxonomic classification of Alphaproteobacteria. Front Microbiol. 11:468. doi: 10.3389/fmicb.2020.00468. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Huerta-Cepas J, Dopazo H, Dopazo J, Gabaldón T. 2007. The human phylome. Genome Biol. 8(6):R109. doi: 10.1186/gb-2007-8-6-r109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Huerta-Cepas J, Serra F, Bork P. 2016. ETE 3: reconstruction, analysis, and visualization of phylogenomic data. Mol Biol Evol. 33(6):1635–1638. doi: 10.1093/molbev/msw046. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Hyatt D, Chen G-L, LoCascio PF, Land ML, Larimer FW, Hauser LJ. 2010. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 11(1):119. doi: 10.1186/1471-2105-11-119. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Jain C, Rodriguez-R LM, Phillippy AM, Konstantinidis KT, Aluru S. 2018. High throughput ANI analysis of 90 K prokaryotic genomes reveals clear species boundaries. Nat Commun. 9(1):5114. doi: 10.1038/s41467-018-07641-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Kalyaanamoorthy S, Minh BQ, Wong TKF, von Haeseler A, Jermiin LS. 2017. ModelFinder: fast model selection for accurate phylogenetic estimates. Nat Methods. 14(6):587–589. doi: 10.1038/nmeth.4285. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Kapli P, Kotari I, Telford MJ, Goldman N, Yang Z. 2023. DNA sequences are as useful as protein sequences for inferring deep phylogenies. Syst Biol. 72(5):1119–1135. doi: 10.1093/sysbio/syad036. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Katoh K, Standley DM. 2013. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol. 30(4):772–780. doi: 10.1093/molbev/mst010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Konstantinidis KT, Tiedje JM. 2007. Prokaryotic taxonomy and phylogeny in the genomic era: advancements and challenges ahead. Curr Opin Microbiol. 10(5):504–509. doi: 10.1016/j.mib.2007.08.006. [DOI] [PubMed] [Google Scholar]
  34. Lee MD. 2019. GTotree: a user-friendly workflow for phylogenomics. Bioinformatics. 35(20):4162–4164. doi: 10.1093/bioinformatics/btz188. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Lozano-Fernandez J. 2022. A practical guide to design and assess a phylogenomic study. Genome Biol Evol. 14(9):evac129. doi: 10.1093/gbe/evac129. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Nguyen L-T, Schmidt HA, von Haeseler A, Minh BQ. 2015. IQ-TREE: a fast and effective stochastic algorithm for estimating Maximum-likelihood phylogenies. Mol Biol Evol. 32(1):268–274. doi: 10.1093/molbev/msu300. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. O’Leary NA, Wright MW, Rodney Brister J, Ciufo S, Haddad D, Ciufo Diana, McVeigh R, Rajput B, et al. 2016. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 44(D1):D733–745. doi: 10.1093/nar/gkv1189. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Orsini M, Ianni A, Zinzula L. 2022. Brucella ceti and Brucella pinnipedialis genome characterization unveils genetic features that highlight their zoonotic potential. MicrobiologyOpen. 11(5):e1329. doi: 10.1002/mbo3.1329. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. 2015. Checkm: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 25(7):1043–1055. doi: 10.1101/gr.186072.114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Portik DM, Smith LL, Bi K. 2016. An evaluation of transcriptome-based exon capture for frog phylogenomics across multiple scales of divergence (class: amphibia, order: anura). Mol Ecol Resour. 16(5):1069–1083. doi: 10.1111/1755-0998.12541. [DOI] [PubMed] [Google Scholar]
  41. Price MN, Dehal PS, Arkin AP. 2010. FastTree 2 – approximately Maximum-likelihood trees for large alignments. PLoS One. 5(3):e9490. doi: 10.1371/journal.pone.0009490. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Puigbò P, Wolf YI, Koonin EV. 2019. Genome-Wide comparative analysis of phylogenetic trees: the prokaryotic forest of life. Methods Mol Biol Clifton NJ. 1910:241–269. doi: 10.1007/978-1-4939-9074-0_8. [DOI] [PubMed] [Google Scholar]
  43. Saitou N, Nei M. 1987. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol. 4(4):406–425. doi: 10.1093/oxfordjournals.molbev.a040454. [DOI] [PubMed] [Google Scholar]
  44. Salje J. 2021. Cells within cells: rickettsiales and the obligate intracellular bacterial lifestyle. Nat Rev Microbiol. 19(6):375–390. doi: 10.1038/s41579-020-00507-2. [DOI] [PubMed] [Google Scholar]
  45. Sankarasubramanian J, Vishnu US, Gunasekaran P, Rajendhran J. 2019. Development and evaluation of a core genome multilocus sequence typing (cgMLST) scheme for Brucella spp. Infect Genet Evol. 67:38–43. doi: 10.1016/j.meegid.2018.10.021. [DOI] [PubMed] [Google Scholar]
  46. Schoch CL, Ciufo S, Domrachev M, Hotton CL, Kannan S, Khovanskaya R, Leipe D, Mcveigh R, Neill O, Robbertse K, et al. 2020. NCBI Taxonomy: a comprehensive update on curation, resources and tools. Database (Oxford). 2020:baaa062. doi: 10.1093/database/baaa062. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Schön ME, Martijn J, Vosseberg J, Köstlbacher S, Ettema TJG. 2022. The evolutionary origin of host association in the Rickettsiales. Nat Microbiol. 7(8):1189–1199. doi: 10.1038/s41564-022-01169-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Shakya M, Ahmed SA, Davenport KW, Flynn MC, Lo C-C, Chain PSG. 2020. Standardized phylogenetic and molecular evolutionary analysis applied to species across the microbial tree of life. Sci Rep. 10(1):1723. doi: 10.1038/s41598-020-58356-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Smith DR. 2013. The battle for user-friendly bioinformatics. Front Genet. 4:187. doi: 10.3389/fgene.2013.00187. [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Spencer M, Bryant D, Susko E. 2007. Conditioned genome reconstruction: how to avoid choosing the conditioning genome. Syst Biol. 56(1):25–43. doi: 10.1080/10635150601156313. [DOI] [PubMed] [Google Scholar]
  51. Stamatakis A. 2014. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinforma Oxf Engl. 30(9):1312–1313. doi: 10.1093/bioinformatics/btu033. [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. States DJ, Gish W, Altschul SF. 1991. Improved sensitivity of nucleic acid database searches using application-specific scoring matrices. Methods. 3(1):66–70. doi: 10.1016/S1046-2023(05)80165-3. [DOI] [Google Scholar]
  53. Suárez-Esquivel M, Chaves-Olarte E, Moreno E, Guzmán-Verri C. 2020. Brucella genomics: macro and micro evolution. Int J Mol Sci. 21(20):7749. doi: 10.3390/ijms21207749. [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. Suyama M, Torrents D, Bork P. 2006. PAL2NAL: robust conversion of protein sequence alignments into the corresponding codon alignments. Nucleic Acids Res. 34(Web Server):W609–W612. doi: 10.1093/nar/gkl315. [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Tang R, Yu Z, Li J. 2023. KINN: an alignment-free accurate phylogeny reconstruction method based on inner distance distributions of k-mer pairs in biological sequences. Mol Phylogenet Evol. 179:107662. doi: 10.1016/j.ympev.2022.107662. [DOI] [PubMed] [Google Scholar]
  56. Treangen TJ, Ondov BD, Koren S, Phillippy AM. 2014. The harvest suite for rapid core-genome alignment and visualization of thousands of intraspecific microbial genomes. Genome Biol. 2014;15(11):524. doi: 10.1186/s13059-014-0524-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Varghese NJ, Mukherjee S, Ivanova N, Konstantinidis KT, Mavrommatis K, Kyrpides NC, Pati A. 2015. Microbial species delineation using whole genome sequences. Nucleic Acids Res. 43(14):6761–6771. doi: 10.1093/nar/gkv657. [DOI] [PMC free article] [PubMed] [Google Scholar]
  58. Vishnu US, Sankarasubramanian J, Sridhar J, Gunasekaran P, Rajendhran J. 2015. Identification of recombination and positively selected genes in Brucella. Indian J Microbiol. 55(4):384–391. doi: 10.1007/s12088-015-0545-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  59. Volpiano CG, Sant’Anna FH, Ambrosini A, Lisboa BB, Vargas LK, Passaglia LMP. 2019. Reclassification of Ochrobactrum lupini as a later heterotypic synonym of Ochrobactrum anthropi based on whole-genome sequence analysis. Int J Syst Evol Microbiol. 69(8):2312–2314. doi: 10.1099/ijsem.0.003465. [DOI] [PubMed] [Google Scholar]
  60. Wernersson R, Pedersen AG. 2003. RevTrans: multiple alignment of coding DNA from aligned amino acid sequences. Nucleic Acids Res. 31(13):3537–3539. doi: 10.1093/nar/gkg609. [DOI] [PMC free article] [PubMed] [Google Scholar]
  61. Wiens JJ, Morrill MC. 2011. Missing data in phylogenetic analysis: reconciling results from simulations and empirical data. Syst Biol. 60(5):719–731. doi: 10.1093/sysbio/syr025. [DOI] [PubMed] [Google Scholar]
  62. Yang Z. 1998. On the best evolutionary rate for phylogenetic analysis. Syst Biol. 47(1):125–133. doi: 10.1080/106351598261067. [DOI] [PubMed] [Google Scholar]
  63. Zhang C, Rabiee M, Sayyari E, Mirarab S. 2018. ASTRAL-III: polynomial time species tree reconstruction from partially resolved gene trees. BMC Bioinformatics. 19(S6):153. doi: 10.1186/s12859-018-2129-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  64. Zhang C, Scornavacca C, Molloy EK, Mirarab S. 2020. ASTRAL-Pro: quartet-based Species-tree inference despite paralogy. Mol Biol Evol. 37(11):3292–3307. doi: 10.1093/molbev/msaa139. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

jkae119_Supplementary_Data

Data Availability Statement

Assemblies used within this manuscript are available from https://www.ncbi.nlm.nih.gov/assembly/ with accessions in Supplementary tables 2–4.

Supplemental material available at G3 online.


Articles from G3: Genes|Genomes|Genetics are provided here courtesy of Oxford University Press

RESOURCES