Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2010 Jun 7;107(25):11453–11458. doi: 10.1073/pnas.1001291107

Phylogenetic incongruence arising from fragmented speciation in enteric bacteria

Adam C Retchless 1, Jeffrey G Lawrence 1,1
PMCID: PMC2895130  PMID: 20534528

Abstract

Evolutionary relationships among species are often assumed to be fundamentally unambiguous, where genes within a genome are thought to evolve in concert and phylogenetic incongruence between individual orthologs is attributed to idiosyncrasies in their evolution. We have identified substantial incongruence between the phylogenies of orthologous genes in Escherichia, Salmonella, and Citrobacter, or E. coli, E. fergusonii, and E. albertii. The source of incongruence was inferred to be recombination, because individual genes support conflicting topology more robustly than expected from stochastic sequence homoplasies. Clustering of phylogenetically informative sites on the genome indicated that the regions of recombination extended over several kilobases. Analysis of phylogenetically distant taxa resulted in consensus among individual gene phylogenies, suggesting that recombination is not ongoing; instead, conflicting relationships among genes in descendent taxa reflect recombination among their ancestors. Incongruence could have resulted from random assortment of ancestral polymorphisms if species were instantly created from the division of a recombining population. However, the estimated branch lengths in alternative phylogenies would require ancestral populations with far more diversity than is found in extant populations. Rather, these and previous data collectively suggest that genome-wide recombination rates decreased gradually, with variation in rate among loci, leading to pluralistic relationships among their descendent taxa.

Keywords: recombination, species, Tree of Life, population structure, incomplete lineage sorting


At first glance, prokaryotes appear to have simple, well-ordered relationships resulting from asexual reproduction and divergence by mutation. However, homologous recombination between closely related strains can lead to complex, nonclonal relationships (1). Recombination has implications that are so profound that its potential within populations is often taken to be the definitive feature of species. Mayr's biological species concept (BSC) frames species in the context of reproductive barriers whereby only conspecific individuals exchange genes; individuals that fail to recombine represent different species (2). Despite its formulation for sexual eukaryotes, Dykhuizen and Green (1) proposed that the BSC could apply to bacteria; operationally, phylogenies of orthologous genes would be identical for strains of different species but demonstrably different for strains within species due to recombination. Several studies have applied these criteria to multilocus phylogenetic analysis of prokaryotes, supporting the notion that there are distinct groups of organisms experiencing recombination within each group but not between them (3).

Despite these results, the BSC may not be generally applicable to prokaryotes, even for taxa that undergo high rates of recombination. Whereas eukaryotic recombination occurs genome-wide during meiosis, recombination in prokaryotes involves only a small fragment of DNA being introduced into a cell via transformation, phage-mediated transduction, or plasmid conjugation. In both prokaryotes and eukaryotes, recombinants may be counter-selected when genomic incompatibilities reduce hybrid fitness. In eukaryotes, this inhibits gene exchange genome-wide (4), whereas in prokaryotes, recombination interference affects only that small portion of the genome that carries the incompatible DNA. Similarly, antirecombination driven by sequence divergence and the mismatch repair system causes hybrid sterility in eukaryotes such as Saccharomyces (5), whereas in prokaryotes it simply prevents the integration of the particular sequence that has diverged from the recipient (6).

As a result, barriers to recombination in bacteria could be limited to specific portions of a genome. Consider a bacterial population freely recombining at all loci. Subpopulations can develop through genetic isolation of only a few loci, driven by ecology or sequence divergence (7); such subpopulations—recombining at many loci but genetically isolated at other loci—could be numerous within a larger population. Consistent with this fragmented speciation model (8), several multilocus sequence analysis (MLSA) studies identified closely related populations that appear to be recombining at some loci but remain genetically isolated at others (9, 10). In addition, a comparison of Escherichia and Salmonella genomes revealed extensive variation in the level of sequence divergence (after correcting for position and codon selection) across regions of the chromosome, suggesting that many regions experienced homogenizing recombination as much as 70 million years after other regions had ceased recombination (11). Critically, excessively diverged regions were clustered around the loci where gene gains or losses distinguish Escherichia from Salmonella; such adaptive changes in gene inventory could have contributed to ecological differentiation within the recombining ancestral population and been the focus of selection against recombination. This suggests that interference with recombination emerged at different loci at different times, driven by adaptation. Consistent with this view, population genetic simulations examining the recombination-suppressing role of sequence divergence do not result in population splitting if given recombination parameters similar to those observed in E. coli (12, 13), but population splitting is observed when using the more restrictive parameters inferred from Salmonella (14).

If recombination barriers are imparted gradually as populations split, they may not be complete before each descendent population splits again. The stepwise acquisition of genetic isolation at different locations around the chromosome would lead to differing phylogenies of orthologous genes, resulting in the lack of clear organismal relationships. Alternatively, recombination could cease for all loci instantly when each population splits, as suggested for Yersinia pestis (15). In this instant speciation model, all recombination events occur before the acquisition of genome-wide genetic isolation. Any phylogenetic incongruence would result from the partitioning of ancestral variation among descendent lineages, which would confound our ability to discern otherwise robust organismal relationships. Here, we test these models directly.

Results

Phylogenetic Discordance in the Enterobacteriaceae Does Not Reflect Ongoing Recombination.

To identify taxa with potentially confounded relationships, we looked within the well-characterized species-rich clade of enteric bacteria. To establish a reference phylogeny, we aligned a core genome containing 1,174 orthologous open reading frames (ORFs) in each of 17 genomes (Table S1), with <15% of aligned sites having gaps in any genome. A Neighbor-Net (16) analysis of the concatenated codon alignment of these genes shows conflicting phylogenetic signals among these genomes (Fig. 1). Regions with conflicting signal may reflect the incongruent histories among genes due to recombination (17). Examining each gene independently by maximum likelihood (ML), there is near-universal support for the separation of Erwinia, Dickeya, Pectobacterium, Serratia, and Yersinia from Cronobacter and the other genomes (>99% of those alignments with a single topology in the 90% confidence limit). These taxa are used as outgroup genomes in subsequent tests.

Fig. 1.

Fig. 1.

Phylogenetic network of enteric bacteria. The Neighbor-Net (16) dendrogram was calculated by SplitsTree. The shaded region indicates the range for placement of the node separating Escherichia coli and Salmonella enterica according to relative divergence analysis (11). The Inset focuses on the divergence of Escherichia, Salmonella, and Citrobacter (arrow A).

Using this outgroup, we examined reference pairs of taxa in a quartet analysis to test the robustness of their relationship with respect to an additional taxon (Fig. 2). For high-confidence phylogenies, the additional taxon should be either a clear outgroup (supporting the reference pair as sister taxa) or a clear ingroup (rejecting the pair as sister taxa). As expected from the Neighbor-Net results, virtually all genes supported the Escherichia/Salmonella pair when it was evaluated with either Klebsiella or Cronobacter, and virtually no gene supported this pair when evaluated with E. albertii, E. fergusonii (Fig. 2C), or S. enterica arizonae (Fig. 2B). However, two regions of the Neighbor-Net phylogeny show substantial conflicting signal (Fig. 1, regions A and B). The patterns of divergence between Escherichia and Salmonella support a fragmented speciation model (11) whereby chromosomal regions became genetically isolated during an extended time frame (shaded area in Fig. 1). This range includes the nodes representing the divergence of the Citrobacter lineages, suggesting that the relationship between these three genera will be ambiguous (Fig. 1 Inset). This theme was reinforced by individual gene quartet analyses, where relationships between Escherichia and Salmonella were ambiguous—with the pair being neither widely accepted nor widely rejected—when evaluated with Citrobacter koseri (14% accepted) or C. youngae (30% accepted). A similar pattern was observed for the C. youngae/Salmonella clade (Fig. 2A).

Fig. 2.

Fig. 2.

Percentage of genes supporting clades. Each panel shows how often a reference genome was found to be the sister taxon to a second genome (defining each trendline; see legend) in a four-taxa ML phylogeny. Each phylogeny included the reference pair, a constant outgroup, and each of a series of test genomes. Results are plotted according to the distance from the reference genome to the node leading to the test genome on a neighbor-joining tree based on estimates of amino acid substitution counts (Ka). Gene counts were limited to those ORF alignments that generated substantial likelihood support for a single topology (90% confidence interval, SH test). Quartet composition is listed to the right of each chart; x represents the test genome, which is identified on the distance axis. Cko, Citrobacter koseri; Csp, C. sp. 30_2; Cyo, C. youngae; Csa, Cronobacter sakazakii; Eal, E. albertii; Eco, E. coli MG1655; Efe, E. fergusonii; Esp, Enterobacter sp. 683; Kpn, K. pneumoniae; Saz, S. enterica arizonae; Sen, S. enterica LT2; UTI, E. coli UTI89.

In contrast to the divergence of Escherichia, Salmonella, and Citrobacter, which have been separated for tens of millions of years, the three species of Escherichia are likely in the final throes of genetic isolation. MLSA results (18) are consistent with very low levels of recombination between otherwise distinct groups of Escherichia with divergence comparable to E. coli and E. fergusonii. The vast majority of genes in our analysis also supports the monophyly of E. coli K12 with E. coli UTI89 (Fig. 2C) as well as the monophyly of the Escherichia relative to other genera; however, the relationships between the three Escherichia species remain unclear. The E. coli/E. albertii clade was supported by 53% of genes and the alternative E. coli/E. fergusonii clade by 44% of genes. Thus, these taxa represent the genesis of the phylogenetic ambiguity that plagues the relationships of Escherichia, Salmonella, and Citrobacter.

These gene-based quartets provided no evidence for recent recombination between species of different genera, indicating that (for the genomes analyzed here) any substantial gene flow was limited to the time periods before the genera diversified into extant species. Yet with both radiations, the remaining phylogenetic incongruence may be interpreted several ways. The conflicting signal may simply represent noise, especially when inferring the relationships between the more distantly related Escherichia, Salmonella, and Citrobacter. Alternatively, conflicting phylogenetic signal may reflect maintenance of ancestral polymorphism, whereby incomplete lineage sorting leads to ambiguous phylogenetic relationships even when genetic isolation occurs instantly for all genes (19). Lastly, the fragmented speciation model predicts conflicting phylogenetic signal due to stepwise acquisition of barriers to recombination.

Robust Alternative Relationships Among Bacterial Genera.

One expects the phylogenetic signal to be weakest for very short branches, so one may posit that the inference of conflicting topologies described above simply reflects the lack of support for the “true” organismal phylogeny. To test this hypothesis, we determined whether the support for alternative topologies is stronger than expected. Likelihood analyses were performed on alignments of sequences from 14 genomes representing the maximal available diversity among Escherichia, Salmonella, Citrobacter, and Klebsiella, while maintaining the monophyly of each group (6, 4, 2, and 2 genomes, respectively; Table S1). The use of multiple taxa for each clade increased the signal-to-noise ratio. Although 2,028 potentially orthologous ORFs were present in each of the 14 genomes, we removed 705 unreliably aligned ORFs (>5% of their multiple sequence alignment contained gaps), 14 potentially paralogous ORFs for which syntenic neighboring ORFs could not be reliably identified, and 165 ORFs for which the monophyly of each of the four genera was not confidently supported by Bayesian analysis. For the 1,144 remaining ORFs, the three possible relationships of the four genera were evaluated by codon-based maximum likelihood, holding the relationships within each genus fixed as defined by Bayesian analysis (Dataset S1).

A substantial number of alignments supported each of the three topologies (Fig. 3 A and B) and several lines of evidence ruled out stochastic and systematic errors as the basis for these incongruent results. Among those alignments generating strong bootstrap support for a topology (Fig. 3A), where bootstrap support thresholds provide conservative estimates of accuracy (20, 21), no topology had the level of support expected if it were the true topology for all genes (dashed line). Furthermore, an excess of alignments rejected each topology with high confidence, indicating strong phylogenetic information at the gene level despite the widespread incongruence between genes (Fig. 3B; P < 0.01 for all categories where individual genes are rejected at P < = 0.25; binomial test using threshold gene P values as the expected probabilities). Unlike what would be expected for an unambiguous organismal phylogeny, large fractions of alignments reject the Escherichia/Citrobacter clade, the Escherichia/Salmonella clade, and, for high-confidence alignments (P < 0.25), the Citrobacter/Salmonella clade.

Fig. 3.

Fig. 3.

Phylogenetic discordance at all confidence levels. Measures of confidence on ML tests for individual genes do not match expectations for a genome-wide topology, either for the relationship among Citrobacter, Escherichia, and Salmonella (A and B) or E. albertii, E. coli, and E. fergusonii (C and D). Bootstrap support is expected to correspond to accuracy (A and C), and SH test P values provide expected frequencies of topology rejection (B and D). Support for (A and C) or rejection of (B and D) alternative clades is indicated by trendlines. Within-species quartets are shown in Fig. S1.

The proportions of alignments supporting each topology were robust to subsampling guided by statistical confidence and to a variety of other subsampling techniques that would purge different varieties of stochastic and systematic errors. Support for a single topology did not arise when we removed potentially mismatched orthologs, alignments with gaps or few informative sites, or any alignment generating inconsistent phylogenetic results using a codon-position model, other outgroups, or fewer taxa (Table S2).

Robust Alternative Relationships Among the Escherichia.

The same set of 1,144 alignments was tested according to a codon-position maximum-likelihood model applied to each of the three topologies involving E. albertii, E. coli, and E. fergusonii, with an outgroup comprising two genomes each of Salmonella and Klebsiella. As with the original tests (Fig. 2C), roughly equal portions of alignments supported the clustering of E. coli with either E. albertii or E. fergusonii, whereas E. albertii and E. fergusonii rarely clustered together (Fig. 3 C and D). None of the topologies had support from a sufficient number of genes to justify the hypothesis that it is the single true topology and the others are artifactual (Fig. 3C), whereas each topology was rejected more often than expected by chance (Fig. 3D). Support for a single topology did not arise when we removed potentially mismatched orthologs or alignments with gaps, few informative sites, or that did not produce identical results when Klebsiella, Citrobacter, or Salmonella were used as single outgroups (Table S3).

Interestingly, alignments supporting the E. albertii/E. fergusonii clade are rare (Fig. 3 C and D), illustrating the complexity of the isolation process. E. fergusonii and E. albertii may have arisen from small populations that rarely encountered each other, but continued to recombine with E. coli. Alternatively, ecological differentiation may be greater between E. fergusonii and E. albertii than between either of them and E. coli, suppressing recombination between them more. The former explanation is consistent with elevated substitution rates in the lineages leading to E. fergusonii and E. albertii relative to E. coli observed previously (18), but is not exclusive of the latter explanation. Also of note, whereas the E. albertii/E. coli clade is favored by those genes with strongest bootstrap support, a greater number of genes support the E. fergusonii/E. coli clade (Fig. 3 C and D). This could reflect the idiosyncratic nature of the fragmented speciation process, suggesting either that some loci in the E. fergusonii genome have not recombined with E. albertii/E. coli for a long time, or that E. albertii/E. coli continued to recombine at these loci until relatively recently.

As a final test of the robustness of the incongruence in both quartet analyses, we attempted to identify hidden likelihood support (22) for a congruent topology by concatenating those alignments supporting each of the three topologies tested, then repeating the maximum-likelihood analysis. In each case, we found unambiguous support for the topology that the genes had individually supported (no alternate topology within a 99% confidence interval). To guard against the analysis being dominated by a few genes with the strongest support for the given topology, we repeated the analysis using only those genes that had 50–60% bootstrap support for the given topology. Again, we recovered unambiguous support for each topology except the E. albertii/E. fergusonii clade, which produced 64% bootstrap support for itself, but could not reject the E. fergusonii/E. coli clade.

Clustering of Phylogenetic Signal in the Chromosome.

The above results suggest that no single phylogeny is appropriate to describe the relationship between Escherichia, Salmonella, and Citrobacter, or between the three species of Escherichia. If recombination between nascent species were responsible for the phylogenetic incongruence, then the phylogenetically informative sites should be clustered in their respective genomes according to the topology that they support. To evaluate clustering, we concatenated single-gene, high-quality alignments of genes with reliable identification of neighboring ORFs. Supporting sites in the alignment were defined as those for which one topology is more parsimonious than the other two. Parsimony criteria were applied to nucleotide, amino acid, and synonymous codon alignments, and two analyses were performed to measure the clustering of sites supporting each topology within the 1,309 ORFs.

A runs test for randomness in the order of supporting sites indicated highly significant clustering of supporting sites by topology (P < < 10−10; Table 1). That is, for both the Escherichia/Salmonella/Citrobacter and E. albertii/E. coli/E. fergusonii clades, sites supporting each of the conflicting topologies were clustered in these genomes. To test whether sites supporting each topology were clustered, we repeated the runs test by omitting each topology in turn; significant clustering was still observed (Table S4; P < 10−5). To investigate the scale of clustering as a function of distance between supporting sites, pairs of sites were binned according to distance, calculating the frequency that both sites within the binned distance range supported the same topology (Fig. 4). For all analyses, there was a clear enrichment of sites supporting the same topology over the frequency expected from genome averages. This is most noticeable for the analysis of Escherichia species, where phylogenetic signal is expected to be the strongest; clustering is apparent both within and between genes (Fig. 4 AC). Clustering is detectable in the Escherichia/Citrobacter/Salmonella analysis (Fig. 4D), although less apparent due to the accumulation of noise.

Table 1.

Occurrence of sites supporting the same topology

Topology
Runs of sites
1 2 3 Exp SD Obs Z
ESC Nuc 19,539 19,886 18,829 38,827 114 35,684 27.6
ESC Pro 2,382 2,879 2,442 5,117 41 4,600 12.5
ESC Syn 12,229 11,666 11,621 23,672 89 22,844 9.3
EEE Nuc 9,687 19,298 18,950 30,718 102 24,327 62.6
EEE Pro 471 3,074 1,361 2,558 29 1,660 31.0
EEE Syn 6,950 11,009 13,658 20,357 83 17,159 38.7

Topologies: EEE1, Eal,Efe; EEE2, Eal,Eco; EEE3, Eco,Efe; ESC1, Cit,Esc; ESC2, Cit,Sal; ESC3, Esc,Sal; Nuc, nucleotide; Pro, protein; Syn, synonymous codons. Runs test of randomness was performed on the sequence of sites supporting each topology. We observed (Obs) fewer runs than expected (Exp) given the number of sites supporting each topology, indicating that sites supporting a given topology are clustered. Z-scores [(Obs − Exp)/SD], evaluated with a one-tailed test, reject the null hypothesis of random ordering at P < < 10−10 for all tests, including both quartets analyzed and all three character types evaluated. Additional tests are shown in Table S4.

Fig. 4.

Fig. 4.

Chromosomal clustering of parsimony informative sites supporting each topology. Each site supporting a distinct topology was compared against each other site and the observation binned according to the distance between sites. Trendlines report proportions of observations where a site supporting a given topology was paired with a site supporting the same topology. Expected values are derived from the genome-wide proportion of sites supporting that topology. Data are plotted at the midpoint of the bin range. Within-species quartets are shown in Fig. S2.

Phylogenetic Incongruence Does Not Reflect Incomplete Lineage Sorting.

The clustering of informative sites is evidence that recombination produced different phylogenies for different regions of these genomes, thus eliminating the possibility of recovering an unambiguous organismal phylogeny from the sequence data. Yet this evidence is not sufficient to determine whether recombination occurred at some loci subsequent to genetic isolation at other loci (fragmented speciation), or an ancestral population split into two descendent populations, one of which split again before its ancestral polymorphisms had been resolved (incomplete lineage sorting). By the latter model, one topology would reflect the history of population splitting (the species topology), and the other two would represent the diversity of the original population (19). To evaluate this “instant-speciation” model empirically, we compared the observed diversity of the extant populations to the inferred diversity of the ancestral population that could have generated the incongruent phylogenies (Fig. S3). Putative ancestral diversity was measured as the length of the innermost branch connecting all four genera on a maximum-likelihood tree (internal branch). Extant diversity (terminal branch) was measured as half of the branch length separating genes from two E. coli strains. To maximize our estimate of extant diversity, we began with a collection of four maximally divergent E. coli genomes (13), selecting two (UMN026/UTI89) where the terminal branch length exceeds the internal branch length most often. As reported above, widespread support for the monophyly of E. coli indicates that E. albertii and E. fergusonii do not recombine freely with E. coli; therefore they were not included in measures of extant diversity.

If incomplete lineage sorting were to produce the observed phylogenetic ambiguity between Escherichia, Salmonella, and Citrobacter, then the terminal and internal branch lengths on maximum-likelihood trees with nonspecies topologies would be comparable because both would represent within-species diversity. Only trees with the “true” species topology could have accumulated extra divergence along the internal branch. However, we observed that the terminal branch was generally shorter than the internal branch. This was true for each topology, in 93.0% (346/372), 94.8% (292/308), and 95.7% (443/463) of the trees that clustered Escherichia with Salmonella, Escherichia with Citrobacter, or Citrobacter with Salmonella, respectively (Fig. S3A). When using S. enterica enterica strains, extant diversity was smaller than the ancestral variation in at least 98% of trees for each comparison (S. enterica arizonae was an outgroup in >97.5% of alignments where no alternate topology is within the 90% confidence limit).

To determine how often the internal branch would be longer than the terminal branch if genetic isolation were imposed simultaneously for all genes, we simulated the evolution of these taxa according to the best ML tree. This guide tree was modified so that the internal branch length was equal to the average terminal branch length between strains UMN026 and UTI89 (Fig. S3B). We repeated the quartet analysis using 100 simulations of the set of 1,144 genes (Fig. S3C). The internal branch on an ML tree of the simulated data was longer than the average terminal branch for 60 ± 1% of the genes (maximum value 63.9%). Comparable values were found when the genes supporting each topology were analyzed separately (Table S5; maximum value 66.8% for the 308 genes supporting the Escherichia/Citrobacter clade). Therefore, our data indicate that measured ancestral diversity for the Escherichia/Salmonella/Citrobacter split far exceeds diversity found in extant species of E. coli and S. enterica. Because the data suggest that all three topologies represent the true species topology, we reject incomplete lineage sorting as the mechanism leading to phylogenetic incongruence. Similar results were found using the internal branch of the E. albertii/E. coli/E. fergusonii split when the genes supporting each of the two dominant topologies were analyzed (Table S5); the rare gene alignments that supported an E. albertii/E. fergusonii clade produced branch lengths within the expected distribution, which is consistent with the lack of prolonged recombination between genes in these lineages, as was suggested above.

Discussion

Questioning the Tree of Life.

Significant phylogenetic incongruence was observed between bacterial taxa (Fig. 1). This incongruence could have reflected noise, recent recombination between otherwise genetically isolated populations, or the random assortment of ancestral diversity following instant acquisition of genetic isolation. Above, we provided evidence rejecting these alternatives; therefore, another model, such as the stepwise acquisition of genetic isolation (fragmented speciation), must be invoked to explain the data. In accordance with this model, previous data for E. coli and S. enterica show that genetic isolation occurred at different times for different genes, driven by adaptive change (11). The fragmented speciation model suggests that organismal phylogenies cannot be deduced from gene phylogenies because genes have different evolutionary histories. Given the vast diversity of prokaryotes (23), groups of ambiguously related taxa produced by rapid evolutionary radiations may be common.

The number and density of such problematic relationships can only increase as more microbial diversity is characterized. In the most extreme interpretation, this would invalidate the Tree of Life hypothesis, which is founded on the idea that extant taxa have unambiguous relationships (24). Phylogenetic trees of organisms serve as frameworks for interpreting evolutionary change; characteristics of ancestral taxa are inferred by coalescence and serve as platforms for interpreting changes in descendent taxa. Yet our data suggest that such ancestral taxa may not have existed, and inferences that require them, such as any utilization of parsimony, would fail. For example, if one accepts an organismal phylogeny that places Escherichia as an outgroup to the Citrobacter/Salmonella clade, then any feature in common between E. coli and Salmonella would be interpreted as a parallel gain or a loss from Citrobacter (Fig. S4). Alternatively, this feature may have been shared by the two taxa throughout the fragmented speciation process, because a distinct taxon ancestral to the Citrobacter/Salmonella clade need not exist (17). Given these complications, the Tree of Life, in demanding a strict bifurcating relationship among descendent taxa, cannot form the basis for rigorous examination of bacterial diversity.

Questioning Bacterial Species Concepts.

The patterns of incongruence that we identified suggest that populations of potentially recombinogenic bacteria are neither freely recombining nor genetically isolated at all loci as required by the BSC. The implication is that any apparently freely recombining population actually comprises many partially genetically isolated subpopulations. As a result, Mayrian species boundaries cannot be defined rigorously by gene flow because extant species will include numerous ecological protospecies that are in partial genetic isolation, leading to ambiguity in the relationships among derived taxa as shown above. Moreover, ecotypes are not a good basis for species identification, as closely related ecotypes can still experience substantial gene flow, causing the evolution of one ecotype to influence the trajectory of another as in Neisseria (9) or Campylobacter (25). Given the complexity of bacterial gene exchange, we are unlikely to identify any rules for identifying the threshold beyond which two populations are destined to follow separate paths. Historical evidence for recombination does not necessitate ongoing potential for recombination.

Thus, species concepts may not apply to bacteria (26), even if phenotypically distinct groups of related bacteria are readily identifiable. Such concepts connect patterns of phenotypic diversity in groups of organisms to the evolutionary forces acting upon those organisms’ constituent genes. Forces that lead to cohesion within sexual eukaryotic populations act upon all genes in concert; as a result, the history of such organisms is reflected in the collective history of their genes. Their sexual systems simplify species conceptualization by producing mating barriers that affect entire genomes at once (4). In contrast, evolutionary forces do not act on all bacterial genes in unison; recombination may be successful at some loci, but be counter-selected at others. The evolutionary independence of bacterial genes afforded by position-specific gene exchange generates incongruence among gene trees. Therefore, a species concept attributing the unambiguous species delineation to the action of a particular evolutionary process may be unattainable in bacteria.

One response would be for taxonomy to embrace the pluralistic nature of bacterial taxa, placing strains into more than one species (27), or abandoning species names altogether for a less hierarchical approach (28). Yet one could argue that bacterial species names carry the greatest practical impact, placing organisms into defined groups that are used for agriculture, biotechnology, epidemiology, public health, disease diagnosis, and bioterrorism. Indeed, the public policy impact of such ambiguity and fluidity in the characterization of bacteria may preclude the widespread adoption of such a classification system. Barring this approach, then, what is left is the necessary use of practical definitions in the absence of a feasible species concept. Such definitions would encompass collections of bacteria that are phenotypically similar by criteria that are subjectively important to the classifiers, leading to both narrowly defined (e.g., Bacillus anthracis) and broadly defined (e.g., E. coli) groups. The ease by which many medically relevant taxa can be classified suggests that it can be an effective approach. Although this lacks the elegance and satisfaction of groupings driven by biological processes, the absence of strong theoretical underpinning to their delineation does not detract from their utility.

Materials and Methods

Genomes.

The sequences of Citrobacter sp. 30_2, C. koseri, C. youngae, Cronobacter sakazakii, Dickeya zeae, Klebsiella pneumoniae strains 342 and MGH 78578, Enterobacter sp. 638, Erwinia tasmaniensis, Escherichia coli MG1655, UMN026, UTI89, and IAI39, E. fergusonii, E. albertii, Pectobacterium wasabiae, Salmonella enterica enterica LT2, CT18, and CVM19633, S. enterica arizonae, Serratia proteamaculans, and Yersinia enterocolitica were downloaded from the National Center for Biotechnology Information. Accession numbers appear in Table S1.

Ortholog Identification.

Annotated ORFs were translated and used as BLASTP queries to search databases composed of ORFs from each of the other genomes (e < 1) followed by semiglobal alignment. Sets of putative orthologs were assembled from those ORFs where each was a reciprocal best match with the others. Analyses of 17 enteric genomes (Figs. 1 and 2) used alignments with >65% similarity; 14 genome analyses (Figs. 3 and 4; Table 1) used alignments with >70% sequence similarity (Dataset S1). Multiple sequence alignments (MSA) were produced with ClustalW and back-translated to codon alignments. At least five syntenic genes must have been identified to establish orthology.

Quartet Analyses.

For each analysis, we evaluated the relationships among four groups of genomes for each ortholog. Alignments for each gene were subject to maximum-likelihood analysis for each of the three possible topologies, whereas the relationships within each group were specified if the group comprised more than two genomes. The root in Fig. 2 was specified according to the dominant topology in the Neighbor-Net tree (Fig. 1), and the relationships among Salmonella and Escherichia strains in Fig. 3 A and B were specified using MrBayes (29).

Maximum Likelihood on ORFs.

The topologies were evaluated by the PAML package (30), using each MSA in turn, generating resampling estimated log-likelihood (RELL) bootstrap support and Shimodaira–Hasegawa (SH) test P values (31). Our simulations (below) support bootstrap thresholds as a conservative estimate of accuracy. Codon-based ML used a single Ω parameter and the Miyata geometric amino acid substitution probabilities. Codon-position ML constructed an HKY85 nucleotide-substitution model for each codon position. This model is computationally efficient relative to the codon model, and has been shown to have similar accuracy for both closely and distantly related sequences (32).

Simulation of Instant Speciation.

Simulated sequences were generated by the Evolver program in PAML. Test trees were generated by a codon-position nucleotide model. Using actual sequences, this produced results comparable to the full codon model, but was much less computationally intensive. All parameters were based upon the actual MSA being tested. The input tree was identical to the ML tree generated from the actual MSA, except that the innermost branch was set to be the same length as other branches in the tree that were being compared with the innermost branch. Sequence length was set to be identical to the number of sites aligned across all sequences of the MSA. Codon proportions were identical to the frequencies observed across all sequences in the MSA, and the κ and Ω parameters were set according to the parameters estimated by the YN00 program, with pairwise values averaged together by first averaging all pairs of genomes between any two groups within the quartets, and then averaging the six pairwise values for the four groups within the quartet analysis.

Supplementary Material

Supporting Information

Acknowledgments

This work was supported by National Institutes of Health Grant GM078092.

Footnotes

The authors declare no conflict of interest.

*This Direct Submission article had a prearranged editor.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1001291107/-/DCSupplemental.

References

  • 1.Dykhuizen DE, Green L. Recombination in Escherichia coli and the definition of biological species. J Bacteriol. 1991;173:7257–7268. doi: 10.1128/jb.173.22.7257-7268.1991. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Mayr E. Systematics and the Origin of Species. New York: Columbia Univ Press; 1942. [Google Scholar]
  • 3.Wertz JE, Goldstone C, Gordon DM, Riley MA. A molecular phylogeny of enteric bacteria and implications for a bacterial species concept. J Evol Biol. 2003;16:1236–1248. doi: 10.1046/j.1420-9101.2003.00612.x. [DOI] [PubMed] [Google Scholar]
  • 4.Rieseberg LH, Wood TE, Baack EJ. The nature of plant species. Nature. 2006;440:524–527. doi: 10.1038/nature04402. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Greig D. Reproductive isolation in Saccharomyces. Heredity. 2009;102:39–44. doi: 10.1038/hdy.2008.73. [DOI] [PubMed] [Google Scholar]
  • 6.Shen P, Huang HV. Homologous recombination in Escherichia coli: Dependence on substrate length and homology. Genetics. 1986;112:441–457. doi: 10.1093/genetics/112.3.441. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Ray JL, et al. Sexual isolation in Acinetobacter baylyi is locus-specific and varies 10,000-fold over the genome. Genetics. 2009;182:1165–1181. doi: 10.1534/genetics.109.103127. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Lawrence JG. Gene transfer in bacteria: Speciation without species? Theor Popul Biol. 2002;61:449–460. doi: 10.1006/tpbi.2002.1587. [DOI] [PubMed] [Google Scholar]
  • 9.Spratt BG, Bowler LD, Zhang QY, Zhou J, Smith JM. Role of interspecies transfer of chromosomal genes in the evolution of penicillin resistance in pathogenic and commensal Neisseria species. J Mol Evol. 1992;34:115–125. doi: 10.1007/BF00182388. [DOI] [PubMed] [Google Scholar]
  • 10.Hanage WP, Fraser C, Spratt BG. Fuzzy species among recombinogenic bacteria. BMC Biol. 2005;3:6. doi: 10.1186/1741-7007-3-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Retchless AC, Lawrence JG. Temporal fragmentation of speciation in bacteria. Science. 2007;317:1093–1096. doi: 10.1126/science.1144876. [DOI] [PubMed] [Google Scholar]
  • 12.Fraser C, Hanage WP, Spratt BG. Recombination and the nature of bacterial speciation. Science. 2007;315:476–480. doi: 10.1126/science.1127573. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Touchon M, et al. Organised genome dynamics in the Escherichia coli species results in highly diverse adaptive paths. PLoS Genet. 2009;5:e1000344. doi: 10.1371/journal.pgen.1000344. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Falush D, et al. Mismatch induced speciation in Salmonella: Model and data. Philos Trans R Soc Lond B Biol Sci. 2006;361:2045–2053. doi: 10.1098/rstb.2006.1925. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Dykhuizen DE. Yersinia pestis: An instant species? Trends Microbiol. 2000;8:296–298. doi: 10.1016/s0966-842x(00)01783-2. [DOI] [PubMed] [Google Scholar]
  • 16.Bryant D, Moulton V. Neighbor-Net: An agglomerative method for the construction of phylogenetic networks. Mol Biol Evol. 2004;21:255–265. doi: 10.1093/molbev/msh018. [DOI] [PubMed] [Google Scholar]
  • 17.Lawrence JG, Retchless AC. The myth of bacterial species and speciation. Biol Philos. 2010 in press. [Google Scholar]
  • 18.Walk ST, et al. Cryptic lineages of the genus Escherichia. Appl Environ Microbiol. 2009;75:6534–6544. doi: 10.1128/AEM.01262-09. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Pamilo P, Nei M. Relationships between gene trees and species trees. Mol Biol Evol. 1988;5:568–583. doi: 10.1093/oxfordjournals.molbev.a040517. [DOI] [PubMed] [Google Scholar]
  • 20.Taylor DJ, Piel WH. An assessment of accuracy, error, and conflict with support values from genome-scale phylogenetic data. Mol Biol Evol. 2004;21:1534–1537. doi: 10.1093/molbev/msh156. [DOI] [PubMed] [Google Scholar]
  • 21.Hillis DM, Bull JJ. An empirical test of bootstrapping as a method for assessing confidence in phylogenetic analysis. Syst Biol. 1993;42:182–192. [Google Scholar]
  • 22.Gatesy J, Baker RH. Hidden likelihood support in genomic data: Can forty-five wrongs make a right? Syst Biol. 2005;54:483–492. doi: 10.1080/10635150590945368. [DOI] [PubMed] [Google Scholar]
  • 23.Dykhuizen DE. Santa Rosalia revisited: Why are there so many species of bacteria? Antonie Van Leeuwenhoek. 1998;73:25–33. doi: 10.1023/a:1000665216662. [DOI] [PubMed] [Google Scholar]
  • 24.Doolittle WF, Bapteste E. Pattern pluralism and the Tree of Life hypothesis. Proc Natl Acad Sci USA. 2007;104:2043–2049. doi: 10.1073/pnas.0610699104. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Sheppard SK, McCarthy ND, Falush D, Maiden MC. Convergence of Campylobacter species: Implications for bacterial evolution. Science. 2008;320:237–239. doi: 10.1126/science.1155532. [DOI] [PubMed] [Google Scholar]
  • 26.Doolittle WF, Zhaxybayeva O. On the origin of prokaryotic species. Genome Res. 2009;19:744–756. doi: 10.1101/gr.086645.108. [DOI] [PubMed] [Google Scholar]
  • 27.Bapteste E, Boucher Y. Epistemological impacts of horizontal gene transfer on classification in microbiology. Methods Mol Biol. 2009;532:55–72. doi: 10.1007/978-1-60327-853-9_4. [DOI] [PubMed] [Google Scholar]
  • 28.Lawrence JG, Hatfull GF, Hendrix RW. Imbroglios of viral taxonomy: Genetic exchange and failings of phenetic approaches. J Bacteriol. 2002;184:4891–4905. doi: 10.1128/JB.184.17.4891-4905.2002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Ronquist F, Huelsenbeck JP. MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics. 2003;19:1572–1574. doi: 10.1093/bioinformatics/btg180. [DOI] [PubMed] [Google Scholar]
  • 30.Yang Z. PAML 4: Phylogenetic analysis by maximum likelihood. Mol Biol Evol. 2007;24:1586–1591. doi: 10.1093/molbev/msm088. [DOI] [PubMed] [Google Scholar]
  • 31.Shimodaira H, Hasegawa M. Multiple comparisons of log-likelihoods with applications to phylogenetic inference. Mol Biol Evol. 1999;16:1114–1116. [Google Scholar]
  • 32.Ren F, Tanaka H, Yang Z. An empirical examination of the utility of codon-substitution models in phylogeny reconstruction. Syst Biol. 2005;54:808–818. doi: 10.1080/10635150500354688. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting Information
1001291107_sd01.xlsx (2MB, xlsx)

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES