Abstract
In 2009, five monophyletic Escherichia clades were described and referred to as “cryptic” based on the inability to distinguish them from representative E. coli isolates using diagnostic biochemical reactions. Since this original publication, a number of studies have explored the genomic, transcriptomic, and phenotypic diversity of cryptic clade isolates to better understand their phylogenetic, physiological, and ecological distinctiveness with respect to previously named Escherichia species. This chapter reviews the original discovery of the cryptic clades, discusses available evidence that some are environmentally adapted, and evaluates current support for taxonomic designations of these microorganisms. The importance of these clades to clinical research, epidemiology, population genetics, and microbial speciation is also discussed.
INTRODUCTION
Many non-coli species of the genus Escherichia have been proposed, but only two, E. albertii and E. fergusonii, are supported by substantial phylogenetic inference. Some historically named species include E. adecarboxylata (1) (1962), E. blattae (2) (1973), E. hermanii (3), and E. vulneris (4) (both in 1982). E. adecarboxylata and E. blattae have since been reclassified as Leclercia adecarboxylata (5) and Shimwellia blattae (6), respectively. E. hermanii and E. vulneris were originally given their generic status due to metabolic similarity to E. coli and not necessarily due to DNA-DNA similarity based on hybridization (3, 4). They remain nominally in the genus in contrast to some phylogenetic evidence that Salmonella spp. share a more recent common ancestor with the type species of the genus, E. coli, than they do (7). Thus, by 2005, only E. coli, E. fergusonii (8) (1985), and E. albertii (9) (2003) were considered true monophyletic species of the Escherichia genus.
It was in this context that E. coli population genetic work in the laboratory of Thomas S. Whittam, then at Michigan State University, was being conducted. This article reviews the original discovery of five previously unnamed Escherichia clades and why they were originally called “cryptic.” At least 65 studies have referenced these clades, and a detailed review of this literature is presented with respect to clinical and epidemiologic research, population genetics, and microbial speciation. Finally, a taxonomic evaluation of the clades is presented, as their ambiguity has quickly faded and so the hypothesis that they are indeed novel species is warranted.
DISCOVERY OF “CRYPTIC” CLADES
Nearly all of what is known about Escherichia comes from characterization of E. coli, and nearly all of what is known about non-coli Escherichia comes from evaluation and characterization of clinical isolates from human and animal disease. It has been appreciated for some time that E. coli, at least, can be found in the gastrointestinal tract (GIT) of warm-blooded animals as well as water, sediment, and soil (10). Michael Savageau referred to the GIT as the “primary” habitat and water, sediment, and soil as the “secondary” habitat, pointing out that both were complex and not well understood. Unfortunately, it seems that the use of “primary” may have been misinterpreted as “the most important,” when the actual use of the word was likely just a placeholder to differentiate it from its other habitat. The reality is that E. coli and other members of the Escherichia genus are quite ubiquitous in the environment, and a better understanding of their ecology and evolution in ex vivo habitats is needed to evaluate their biodiversity and their biology in general.
The “cryptic” clades were discovered while conducting a population genetic analysis of E. coli from freshwater beaches (11). Multilocus sequence typing (MLST) was being used to define and understand the ecological relationship between certain genotypes and the wave-washed zone of public beaches in Michigan, some of which are frequently used as recreation areas. The isolates were originally recovered from selective agar plates (mTEC) according to U.S. Environmental Protection Agency methods for enumerating fecal indicator bacteria from recreational water (12, 13). A total of 205 isolates were selected as presumptive E. coli, and subsequent biochemical testing (API 20E test strips; bioMérieux) revealed that 196 were E. coli according to the sensu stricto definition, while 10 were unequivocally other Enterobacteriaceae species. MLST was initiated on the group of 196, and it became clear with preliminary sequence data that one additional isolate was a non-Escherichia bacterium. MLST was completed on the remaining 195 isolates, which uniformly clustered together with 5 exceptions. These 5 isolates fell outside of typical E. coli diversity on long branches, somewhere between E. coli and Salmonella enterica subsp. enterica.
An initial thought was that these divergent isolates were likely E. albertii because E. fergusonii would have been identified by the API testing. Luckily, E. albertii MLST sequences were available for comparison as part of another project in the lab (14) and E. fergusonii sequences were available online. Comparison to these sequences revealed that the five divergent isolates were neither E. albertii nor E. fergusonii, but appeared to be as divergent (i.e., long branch lengths) as these two species. During completion of the beach sand project, an E. coli MLST study was published by the Achtman lab in which two divergent isolates belonging to the same sequence type (ST) were identified (15). This ST was clearly divergent from typical E. coli and did not cluster with E. albertii (E. fergusonii was not included in the analysis). The authors concluded that this divergent ST represented some ancestral diversity of the E. coli species and that contemporary E. coli diversity was the result of drastic population contractions and bottlenecks. It was clear at this point that additional isolates were needed to more completely understand the biology and taxonomic placement of these isolates.
Diverse and “wild” E. coli isolates were obtained from collaborators in Michigan (Jeffrey L. Ram, Wayne State University), Puerto Rico (Gary A. Toranzos, University of Puerto Rico), and Australia (David M. Gordon, Australian National University). Extended MLST using internal fragments of 22 housekeeping loci (11,161 bp) was conducted, including MLST loci used in both the Whittam and Achtman labs so that a direct comparison could be made between these databases (16). A dendrogram based on these data revealed not one, but five novel clades of Escherichia, each on long branches with high bootstrap support (Fig. 1). The five novel clades differed from the other named Escherichia species by hundreds of parsimoniously informative nucleotide sites (Table 1). Given the numbers of E. coli isolates previously characterized by multilocus enzyme electrophoresis (17) and by MLST in labs around the world for more than 3 decades, we thought it remarkable that these clades remained ambiguous, or undiscovered, for so long.
Table 1.
E. coli | Clade I | Clade II | Clade III | Clade IV | Clade V | E. fergusonii | |
---|---|---|---|---|---|---|---|
Clade I | 925 (566) | ||||||
Clade II | 978 (367) | 896 (165) | |||||
Clade III | 1,105 (806) | 934 (737) | 770 (190) | ||||
Clade IV | 940 (715) | 891 (641) | 770 (106) | 517 (452) | |||
Clade V | 1,199 (990) | 1,036 (900) | 934 (250) | 932 (803) | 923 (803) | ||
E. fergusonii | 1,148 (900) | 957 (694) | 946 (118) | 1,013 (931) | 1,014 (869) | 1,204 (1,706) | |
E. albertii | 1,295 (1,053) | 1,148 (988) | 1,072 (272) | 1,105 (989) | 1,082 (948) | 1,192 (1,071) | 1,267 (1,123) |
The numbers of parsimoniously informative sites are given in parentheses. The total number of nucleotide sites in the MLST dataset was 11,161.
Part of the ambiguity can be explained by the lack of characterized isolates from ex vivo (i.e., outside the host) habitats. For example, the breadth of E. coli genetic diversity was initially described for a standard reference collection (E. coli reference [ECOR] collection) of isolates “from a variety of hosts and geographic locations” (18). Isolates in the ECOR collection were selected from thousands of archived isolates based on previously published work and to represent human and animal hosts in different locations around the world. Isolates were also selected to represent allelic diversity at 11 enzyme-coding loci based on multilocus enzyme electrophoresis. The resulting 72 ECOR isolates represented and, in large part, still encompass nearly all of the known genetic diversity of typical E. coli even in the nucleotide sequencing and comparative genomics era (19). The ECOR collection was included in the analysis of the Achtman MLST collection mentioned above (15), along with ∼390 additional isolates from humans and 41 species of domesticated, captive, and wild mammals, birds, and reptiles. So, a total of zero (n = 0) ex vivo isolates were included in the derivation of the ECOR collection or in the published Achtman MLST analysis (isolates representing the divergent ST identified by the Achtman group were from a dog and a parrot). Therefore, the frequency at which these divergent clades occur in hosts appears to be very low (<1 in ∼10,000).
Another explanation for why these divergent Escherichia isolates were only recently discovered is that they are difficult to distinguish from E. coli using standard biochemical reactions. Along with extended MLST analysis, we generated utilization data for 31 biochemical substrates and compared the biochemical profiles of the divergent clades to those of 692 typical E. coli isolates using nonmetric multidimensional scaling (16). Many of these isolates were indistinguishable from E. coli, which led to the term “cryptic.” The same was not true for E. fergusonii and E. albertii isolates that had statistically different biochemical utilization profiles. Why these cryptic bacteria utilize a similar repertoire of substrates compared with E. coli has not been resolved.
CLADE I PATHOGENS
Three isolates from cryptic clade I that were included in the original publication (16) (Fig. 2) were isolated from children with diarrhea in Guinea-Bissau, and all tested positive for the classic enterotoxigenic E. coli (ETEC) virulence factor, heat-labile toxin (LT) (20). These isolates belong to the ETEC clonal group 12, as described by Steinsland et al. (20), and a total of 34 clonal group 12 isolates have been reported. Clonal group 12 isolates were also recovered from sick children in India, and most (27 of 34; 79%) carried both LT and the porcine heat-stable toxin (STp). These data suggest that clade I ETECs are not necessarily rare in sick children and support the identification of clade I isolates as potential human pathogens. However, LT and STp are plasmid encoded, and so their presence is likely to be variable among isolates. Indeed, PCR screening of an expanded collection of clade I isolates showed that most (68%) did not carry LT (21). More data are needed to determine the frequency of clade I isolates among human ETEC cases, but given the repeated observation of clonal group 12 in the above ETEC study, it is possible that at least a subset of clade I lineages emerged recently in humans because of their ability to cause disease.
In addition to being an enterotoxigenic form of Escherichia, it is possible that clade I isolates can cause human gastroenteritis as a result of Shiga toxin production. In 2003, a presumed E. coli isolate, 7V, was found to carry a Shiga toxin 2 allelic variant, named stx2g (22). The strain was originally isolated from a healthy cow in China and was recently sequenced as part of a study examining atypical Shiga toxin-producing E. coli (STEC) (23). Analysis of nucleotide sequences at MLST loci in this genome revealed that this isolate is highly similar (only two nonsynonymous polymorphisms across 3,714 amino acids) to other Escherichia clade I members (Fig. 2). In their genomic comparison (23), Steyert et al. found that the 7V genome contained more unique sequences than any other STEC in the analysis and that these unique sequences contained putative virulence factors and hypothetical proteins. Isolate 7V did not carry the locus of enterocyte effacement (LEE) or the gene encoding the attachment molecule intimin (eae), which are two hallmark virulence factors of typical enterohemorrhagic E. coli. However, other non-LEE, non-intimin STECs are currently of clinical importance to humans and may cause hemolytic-uremic syndrome (24). Escherichia-like bacteria that carry the stx2g Shiga toxin appear to be rare in cattle (0.7% in one study [22] and 3% in another [25]), but other than the 7V isolate, no information is available as to whether stx2g-carrying isolates belong to clade I or E. coli sensu stricto. More data are needed from non-LEE, non-intimin STEC to better characterize the genomic determinants of clinical disease beyond Shiga toxin.
Clade I isolates were also recovered from patients with bacteremia in France (26). It is not entirely clear what clinical diagnosis was associated with all patients, but the isolates were collected from “extraintestinal” sites of symptomatic patients, which would qualify them as extraintestinal pathogenic E. coli. In addition to clade I isolates, two clade V and one clade III isolates were observed among a total of 1,081 isolates from a large, multicenter study of septicemic patients in France (26, 27). These were the first and only reports of non-clade I isolates from cases of human disease.
LOW VIRULENCE POTENTIAL IN CLADES II TO V
For cryptic clades (II to V), at least four lines of evidence support the hypothesis that these bacteria are rarely opportunistic or specialized human pathogens. First, clades III and V were only associated with human disease in a single study (see above) and clades II and IV have never been reported from cases of human and/or animal infection. Second, PCR screening for genes encoding virulence factors involved in human intestinal and extraintestinal E. coli infection revealed that most of these factors were absent in representative isolates (21, 28). A few notable exceptions include genes for the outer membrane protease, ompT; an invasion enhancement gene, ibeA; the capsule polysaccharide export inner-membrane protein, kpsE; the aerotaxis receptor, aer; and arginine N-succinyltransferase, astA (21, 28). In general, clade I and V isolates were found to carry more of these genes compared with the other clades. Third, representative clade isolates did not cause mortality in a murine sepsis model (21). None of the mice (n = 10 mice per isolate) challenged with 109 CFU/ml given subcutaneously in the abdomen developed disease out to 7 days postinjection, indicating that these isolates were avirulent in this context. Fourth and finally, resistance to 7 antibiotics (nalidixic acid, chloramphenicol, kanamycin, streptomycin, tetracycline, amoxicillin, and sulfamethoxazole) revealed a higher level of resistance in clade I isolates (at least one isolate resistant to 5 of 7 drugs) compared with clade III (1 of 7 drugs) and clade V (2 of 7 drugs), and no resistance was observed among clade IV isolates (21). These data indicate that antibiotic selection pressure is low in clades II to V, presumably because they do not commonly circulate in human populations, or at least those that access modern health care.
It should be noted, however, that under most circumstances, cryptic clade isolates may be mistaken for E. coli during routine clinical testing. As a result, there are few data with which to assess the abundance and distribution of cryptic Escherichia among cases of human or animal disease. When E. coli is suspected, detailed isolate characterizations are necessary to quantify the true clinical impact of the cryptic clades.
THE ENVIRONMENTAL HYPOTHESIS
Archived isolate collections are useful to help narrow the possible habitats in which bacterial genotypes circulate. However, “samples of convenience” as opposed to consecutively drawn samples may or may not provide accurate estimates of genotypic frequency. Also, some sort of phylogenetic characterization (MLST [15, 16] or allele-specific PCR [26]) is necessary to identify cryptic clade isolates. As a result, there are few sampling studies of the appropriate design or the appropriate discriminatory power to resolve accurate genotypic frequencies for cryptic Escherichia clades; thus, the habitat(s) in which they achieve maximum genotypic frequencies remains somewhat debatable. Based on the relatively high frequency of cryptic clade isolates in the original freshwater beach study (5 of 196; 2.5%) (11, 16) and their near absence in MLST datasets from host-associated isolates, we hypothesized that the cryptic clades are more abundant in and better adapted to habitats outside the host compared with inside the host. This “environmental” hypothesis is clearly not appropriate for clade I, as these isolates were detected in host-associated databases upon further review (16) and have been associated with a variety of human diseases (see above).
Results from other studies support an overabundance of cryptic clade isolates in water, soil, and/or aquatic sediment samples. For example, cryptic clade isolates comprised 8.1% (8 of 99), 8.5% (14 of 164), and 14.5% (20 of 138) of isolates from this type of environmental sample in Australia (26), France (29), and Italy (28), respectively. There is also evidence that cryptic clades circulate at a high frequency in birds and nonhuman mammals: 28.2% (11 of 39) and 7.8% (6 of 77) of bird isolates in France and Australia, respectively, as well as 8.2% (23 of 279) and 3.2% (4 of 124) of nonhuman mammal isolates in France and Australia, respectively (26). These data are compelling, but it is unclear whether the frequency of cryptic clade isolates in the environment can explain their frequency in wild animals, or vice versa. To help add clarity to these possibilities, my lab has been assembling a collection of Escherichia isolates from ducks and a pond that is visited daily (study in progress). More than 500 isolates from the pond water (n = 14) and duck fecal samples (n = 11) have been characterized using the phylogrouping (30) and cryptic clade (26) PCRs of Clermont et al. At least 10 isolates were characterized from each pond water sample, and at least 33 isolates were characterized from each duck fecal sample. A total of 5 cryptic clade isolates (4 clade III and 1 clade V) have been identified so far, and all were isolated from pond water, despite the more-intensive sampling of duck feces. These preliminary results from consecutive sampling suggest, at least in the case of this duck pond, that water is a source of the cryptic clades and ducks are a spillover host. Ultimately, more data are needed to differentiate between source populations of cryptic clade isolates and their spillover hosts.
Comparative Genomics Support for the Environmental Hypothesis
Comparative genomics of representative strains (31) supported the original hypothesis that cryptic clades are “environmentally adapted.” Genes that were commonly shared among representative genomes of clades III, IV, and V (i.e., environmental), but not E. coli or clade I (i.e., enteric), appeared to provide functions that enhance survival in the environment, such as diol utilization and lysozyme production (31). Conversely, genes that were enriched among enteric genomes encoded functions thought to be important in the mammalian GIT, such as the transport and use of N-acetylglucosamine, gluconate, and 5-C and 6-C sugars (e.g., fucose). The environmental genomes also carried 25% more genes of unknown or uncharacterized function compared with the enteric genomes (39 versus 29), even though nearly twice as many enteric genomes were included in the analysis (13 enteric versus 7 environmental). These data support the genomic and perhaps functional differentiation of clades III, IV, and V from E. coli and clade I (no clade II genome has been sequenced or compared to date). One may ask whether these results could be due to differences in the annotation of orthologous loci. For example, diol (propanediol) utilization may not be unique to the cryptic Escherichia clades, as orthologs from E. coli, E. fergusonii, and S. enterica are members of the same clusters of orthologous groups (COG) pathway (COG4869) in the evolutionary genealogy of genes (eggNOG version 3.0 nonsupervised orthologous groups database) (32). Similarly, genes involved in lysozyme production (COG3757) are found in cryptic Escherichia clade, E. coli, and E. fergusonii genomes, according to the same database (32). It appears that functional differences between cryptic Escherichia and other Escherichia species based on comparative genomics may require additional investigation using in vitro assays.
The comparative genomics study above was expanded using DNA microarray analysis of a larger collection (n = 27) of strains (33). This study identified 98 genes that were differentially shared between environmental (GIII) versus enteric (GII) groups, as defined by genome content. Microarray analysis identified the potential loss of adherence factors in environmental strains that are thought to be important for bacterial adhesion to mucosal tissue in the GIT. Other notable functions enriched in GII but not GIII were oxidative stress defense, efflux pumps, adhesins, and maltose catabolism. More data are needed to determine whether these functions can be used to more accurately define host adaptation among Escherichia strains.
Comparative Transcriptomics Support for the Environmental Hypothesis
The environmental hypothesis was again supported by transcriptomic comparisons between two E. coli and two cryptic clade strains (one from clade IV and one from clade V) (34). Global gene expression profiling was conducted under batch culture, constant growth (chemostat), and starvation conditions. When each culture condition was considered separately, the environmental strains clustered together and the enterics clustered together, regardless of whether all genes, only genes shared among the four representative strains, or the Escherichia core genes as defined by the comparative genomics study above (31) were considered. Ecologically distinct patterns of gene expression were noted under starvation conditions for stress response (more genes at higher levels in environmentals) and carbon substrate catabolism/central metabolism (maintenance of expression by enterics). These results are in agreement with the comparative genomics study and provide additional evidence that cryptic Escherichia strains are better adapted to life outside the host.
Microbial Physiology Support for the Environmental Hypothesis
Some physiologic traits also support the environmental hypothesis. Multiple representative isolates of Escherichia (E. coli, E. fergusonii, E. albertii, and clades III to V) were used to quantify differences in maximum growth rate, optimal growth temperature, low-temperature limit to replication, survival in the absence of nutrients, and biofilm formation (21). No differences were observed in maximum growth rate or optimal growth temperature for any of the Escherichia isolates, suggesting that these traits are not under strong diversifying selection. In contrast, all of the cryptic clade isolates were able to replicate at 2°C, whereas none of the E. coli, E. fergusonii, or E. albertii isolates replicated at this temperature. Survival in the absence of nutrients was also significantly different among the Escherichia isolates, but the cryptic clades ranked second behind E. coli in mean survival. Finally, biofilm formation was measured under four different conditions comprising two temperatures (37 and 24°C) and two types of media (minimal glucose media and nutrient broth), and cryptic clade isolates formed more substantial biofilms overall compared with the other Escherichia isolates (significantly more substantial biofilms compared with E. coli and E. albertii under all four conditions and compared with E. fergusonii under three of the four conditions). The ability to replicate at lower temperatures and form more substantial biofilms likely enhances cryptic clade survival in environments outside the host.
Interestingly, survival of cryptic clades in the absence of nutrients was not superior to that of E. coli in the above study. This may seem at odds with the environmental hypothesis, but the transcriptomic results discussed above suggest that cryptic clades and E. coli have evolved different strategies for responding to starvation conditions. For example, E. coli maintained expression of nutrient-scavenging genes, which likely led to longer survival under starvation conditions. Conversely, the cryptic clades shut off these genes more rapidly during the transition to nutrient-limited environments and upregulated stress response genes. These results predict that the cryptic clades will have marginal survival in the face of any one stress, but will be more likely to survive multiple stressors simultaneously compared with E. coli. It seems logical to predict that environmental conditions have more simultaneous threats to bacterial survival compared with the GIT of warm-blooded hosts, but more data are needed to test the influence of additive (individual versus multiple) stressors on Escherichia survival.
ARE THE CRYPTIC CLADES ECOLOGICALLY DISTINCT?
The cryptic clades were named as such because they could not be differentiated from E. coli by standard biochemical testing. Upon characterization of additional isolates, Clermont and colleagues found that most E. coli isolates utilized both lysine and ornithine, but clades III and IV tended to be deficient in lysine utilization and clades II and V tended to be deficient in ornithine utilization (26). More data are needed from additional isolates to verify these biochemical results (i.e., relatively few isolates have been characterized for these phenotypes to date), but at least some biochemical differentiation may exist between these lineages.
The combination of phylogenetic (16), genomic (31), and transcriptomic (34) results discussed above provides strong support that at least some cryptic clades are ecologically distinct from other named Escherichia species. Two ecological partitioning tools based on (i) patterns of sequence diversity (“ecotype simulation” [35]) and (ii) differences in habitat of isolation (AdaptML [36]) supported an ecological distinction between clades III to V and other Escherichia isolates (37). It is equally clear that cryptic clade I remains biochemically (26), phylogenetically (16, 26), and ecologically (37) similar to E. coli sensu stricto, and that these two lineages recombine at a higher rate compared with recombination between E. coli and the other clades (31). Given also their ability to cause human infection (20) and carry hallmark E. coli virulence factors (21, 26), it seems somewhat reasonable to consider clade I as a divergent E. coli.
SUBSPECIES DESIGNATION
An important consideration before designating a taxonomic status for the cryptic clades is the placement of Enterobacteriaceae subspecies. Subspecies designations have been used to subdivide S. enterica (38, 39), Cronobacter dublinensis (40), Klebsiella pneumoniae (41), and others. It is currently difficult to understand how Escherichia lineages compare to these designations because there is little consensus in the field as to the appropriate definition of a subspecies and few relevant nucleotide sequence datasets are available for comparison between genera and species in which this designation has been made. For example, the Salmonella genus currently contains two species, S. enterica and S. bongori (39), and only a single multilocus (n = 4 housekeeping loci) nucleotide sequence-based phylogeny is available for all lineages (42). This dataset supports a previous recommendation (43) of seven S. enterica subspecies but remains one of the only available datasets for comparing Salmonella at the nucleotide level. (This author was unable to find representative genome sequences for S. enterica subsp. indica.)
Genetic distances can be used to estimate the evolutionary divergence between lineages in the above Salmonella MLST dataset to understand how divergent Salmonella subspecies are compared to full-fledged species. A variety of distance estimators can be used to conduct such an analysis based on different assumptions about evolutionary change in bacterial populations. A popular distance, called the Kimura-2 parameter, assumes that the frequencies of all four nucleotides at each site are the same and that the rates of substitution do not vary among these sites (44). This estimation also takes into account differences in the likelihood of polymorphisms (transitions and transversions) (44). The genetic distance (Kimura-2 parameter) separating S. enterica subspecies using the 4 MLST loci dataset above ranges between 0.016 (between subspecies I and II) and 0.049 (subspecies IIIa and VII). Using the same distance (Kimura-2 parameter) based on the extended (22 loci) MLST dataset for Escherichia (16), the only distances that fall within this range are 0.033 (between clade III and clade IV) and 0.034 (between E. coli and clade I). Thus, using Salmonella as a model for subspeciation, the available data suggest that clades III and IV are subspecies belonging to the same novel species and that clade I is a subspecies of E. coli.
Unfortunately, two of the genes used to construct the Salmonella phylogeny above, phoP and gapA, are not found in genome sequence data of cryptic clade representatives (31) using the BLASTn algorithm, leaving only data for mdh and recA for a head-to-head comparison of all Escherichia and Salmonella members. A dendrogram based on these two loci for representative Salmonella and Escherichia lineages with K. pneumoniae subsp. pneumoniae as the outgroup is shown in Fig. 3. The phylogenetic relationships in this dendrogram, especially the deep nodes, should be considered with caution because only two loci were used and so the amount of shared genetic diversity among these bacteria has been undersampled. Indeed, at least some of the phylogenetic relationships are skewed, as E. fergusonii appears to be the most divergent Escherichia lineage using data at just these two loci. However, enough phylogenetic signal is contained in these two loci to differentiate all the known Salmonella and Escherichia lineages, with the single exception of E. coli and clade I (i.e., they are monophyletic at these two loci), and so this tree can be used to preliminarily evaluate subspecies designation.
The nodes and branch lengths in Fig. 3 represent the combined (concatenated) histories of mdh/recA loci in Salmonella and Escherichia. How similar this history is to other shared loci remains to be determined (hopefully when sequencing efforts on the S. enterica subspecies are completed). Figure 3 is called a “time tree,” as generated using the MEGA6 program (45), which means that relative estimates of time have been given for each bifurcating node. The time estimates are in complete agreement with the distance-based estimates (Kimura-2 parameter) of MLST data above. In other words, S. enterica subspecies began radiating between 0.009 and 0.022 and the only two Escherichia nodes within these relative time points correspond to the split between clade I and E. coli (0.011; node not shown) and clades III and IV (0.013). Therefore, both distance estimates of MLST data (n = 4 loci for Salmonella and n = 22 for Escherichia) and head-to-head comparison of relative divergence times based on two shared loci (mdh and recA) support the hypothesis that clade I is an E. coli subspecies and clades III and IV are subspecies of a novel Escherichia species.
GENOME-WIDE TAXONOMIC DESIGNATION
The gold standard for designating bacterial species is DNA-DNA hybridization (DDH), in which the percentage of DNA complementarity between isolates is quantified at increasing temperatures. A general consensus has been reached that bacterial species tend to have ≥70% DNA-DNA hybridization. Few laboratories routinely conduct DDH assays, which limits its utility as a standard. Recently, a digital DNA-DNA hybridization technique (dDDH) was developed that estimates the percentage of DDH from partial and completed genome sequences (46). Meier-Kolthoff and colleagues recently extended dDDH to predict a cutoff of 79 to 80% for subspecies in E. coli (47). Unlike the previous DDH species designations, however, the more recent subspecies designations were based solely on in silico analyses, and no significant phenotypic or ecological differences were noted between the predicted E. coli subspecies. Whether these in silico groups represent physiologically or ecologically distinct bacteria remains to be shown, but whether this is an important distinction for subspecies status remains equally open to debate. It should be noted that the 79 to 80% dDDH-based subspecies definition does not seem appropriate for the already named S. enterica subspecies, as all but one of the dDDH values between four representative genomes (arizonae, salamae, enterica, and hountenae) are well below this cutoff (Table 2).
Table 2.
arizonae | salamae | enterica | |
---|---|---|---|
salamae | 51.5 | ||
enterica | 51.3 | 88.7 | |
houtenae | 51.9 | 60.3 | 59.9 |
Genome identification numbers can be found in Supplemental Table 1.
Traditional DDH correlates well with the average nucleotide identity (ANI) between bacterial genomes, and the DDH species cutoff of 70% corresponds well with 95% ANI (48). A comparison of dDDH and ANI among available cryptic clade genomes (clade I, n = 2; clade III, n = 2; clade IV, n = 3; and clade V, n = 2) and the already named Escherichia species (E. coli, n = 4; E. fergusonii, n = 2; and E. albertii, n = 2) supports the overall correlation between ANI and dDDH (Fig. 4). Both metrics support a species-level designation for clade V and for the combination of clades III and IV as a species. However, ANI and dDDH support different taxonomic designations for clade I: ANI supports clade I as a divergent E. coli, whereas dDDH supports clade I as a new species (Fig. 4). ANI is therefore in complete agreement with the distance-based MLST and time tree analyses above, in that clade I is a subspecies of E. coli and clades III and IV are subspecies of a novel species. More data are needed to understand whether clade I violates the actual 70% DDH cutoff as predicted by dDDH.
In summary, the cryptic Escherichia clades appear to belong to at least two novel species, of which clades III and IV are subspecies of the same novel species and clade V is a second novel species. Clade I currently appears to be a novel subspecies of E. coli. More (genome) data are needed from representative clade II isolates before a taxonomic designation can be made. A manuscript is being prepared for consideration of these results as well as proposed names for the cryptic clades (S. T. Walk, unpublished data).
CONCLUSION
The discovery of phylogenetically distinct Escherichia lineages from undersampled habitats outside the gut of warm-blooded animals, including humans, highlights a current knowledge gap in the natural history of “enteric” bacteria in general and Escherichia species specifically. A disproportionately large amount of data has been generated from host-associated isolates and samples, which likely slowed the discovery of these cryptic lineages. More epidemiologic research on non-coli Escherichia is needed. The work reviewed and discussed here suggests that a concerted effort be made to define and understand Escherichia biodiversity as it transmits between hosts or establishes autochthonous populations in the environment. These efforts should include genome-based phylogenetics as well as biochemical and transcriptomic characterizations as at least some of the Escherichia diversity is partitioned by differences in gene expression and not necessarily gene content. Finally, clinical research should recognize the possibility that new E. coli subspecies, like clade I, represent an understudied source of genetic material that may include novel virulence factors and antibiotic resistance elements.
SUPPLEMENTAL MATERIAL
REFERENCES
- 1.Leclerc H. 1962. Biochemical study of pigmented Enterobacteriaceae. Ann Inst Pasteur (Paris) 102:726–741. (In French.) [PubMed] [Google Scholar]
- 2.Burgess NR, McDermott SN, Whiting J. 1973. Aerobic bacteria occurring in the hind-gut of the cockroach, Blatta orientalis. J Hyg (Lond) 71:1–7. [PubMed] 10.1017/S0022172400046155 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Brenner DJ, Davis BR, Steigerwalt AG, Riddle CF, McWhorter AC, Allen SD, Farmer JJ, III, Saitoh Y, Fanning GR. 1982. A typical biogroups of Escherichia coli found in clinical specimens and description of Escherichia hermannii sp. nov. J Clin Microbiol 15:703–713. [PubMed] [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Brenner DJ, McWhorter AC, Knutson JK, Steigerwalt AG. 1982. Escherichia vulneris: a new species of Enterobacteriaceae associated with human wounds. J Clin Microbiol 15:1133–1140. [PubMed] [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Tamura K, Sakazaki R, Kosako Y, Yoshizaki E. 1986. Leclercia adecarboxylata gen. nov., comb. nov., formerly known as Escherichia adecarboxylata. Curr Microbiol 13:179–184. 10.1007/BF01568943 [DOI] [Google Scholar]
- 6.Priest FG, Barker M. 2010. Gram-negative bacteria associated with brewery yeasts: reclassification of Obesumbacterium proteus biogroup 2 as Shimwellia pseudoproteus gen. nov., sp. nov., and transfer of Escherichia blattae to Shimwellia blattae comb. nov. Int J Syst Evol Microbiol 60:828–833. [PubMed] 10.1099/ijs.0.013458-0 [DOI] [PubMed] [Google Scholar]
- 7.Lawrence JG, Ochman H, Hartl DL. 1991. Molecular and evolutionary relationships among enteric bacteria. J Gen Microbiol 137:1911–1921. [PubMed] 10.1099/00221287-137-8-1911 [DOI] [PubMed] [Google Scholar]
- 8.Farmer JJ, III, Fanning GR, Davis BR, O’Hara CM, Riddle C, Hickman-Brenner FW, Asbury MA, Lowery VA, III, Brenner DJ. 1985. Escherichia fergusonii and Enterobacter taylorae, two new species of Enterobacteriaceae isolated from clinical specimens. J Clin Microbiol 21:77–81. [PubMed] [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Huys G, Cnockaert M, Janda JM, Swings J. 2003. Escherichia albertii sp. nov., a diarrhoeagenic species isolated from stool specimens of Bangladeshi children. Int J Syst Evol Microbiol 53:807–810. [PubMed] 10.1099/ijs.0.02475-0 [DOI] [PubMed] [Google Scholar]
- 10.Savageau MA. 1983. Escherichia coli habitats, cell types, and molecular mechanisms of gene control. Am Nat 122:732–744. 10.1086/284168 [DOI] [Google Scholar]
- 11.Walk ST, Alm EW, Calhoun LM, Mladonicky JM, Whittam TS. 2007. Genetic diversity and population structure of Escherichia coli isolated from freshwater beaches. Environ Microbiol 9:2274–2288. [PubMed] 10.1111/j.1462-2920.2007.01341.x [DOI] [PubMed] [Google Scholar]
- 12.U.S. Environmental Protection Agency. 2000. Improved Enumeration Methods for the Recreational Water Quality Indicators: Enterococci and Escherichia coli. U.S. Environmental Protection Agency, Washington, DC. [Google Scholar]
- 13.Wheeler Alm E, Burke J, Spain A. 2003. Fecal indicator bacteria are abundant in wet sand at freshwater beaches. Water Res 37:3978–3982. [PubMed] 10.1016/S0043-1354(03)00301-4 [DOI] [PubMed] [Google Scholar]
- 14.Hyma KE, Lacher DW, Nelson AM, Bumbaugh AC, Janda JM, Strockbine NA, Young VB, Whittam TS. 2005. Evolutionary genetics of a new pathogenic Escherichia species: Escherichia albertii and related Shigella boydii strains. J Bacteriol 187:619–628. [PubMed] 10.1128/JB.187.2.619-628.2005 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Wirth T, Falush D, Lan R, Colles F, Mensa P, Wieler LH, Karch H, Reeves PR, Maiden MC, Ochman H, Achtman M. 2006. Sex and virulence in Escherichia coli: an evolutionary perspective. Mol Microbiol 60:1136–1151. [PubMed] 10.1111/j.1365-2958.2006.05172.x [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Walk ST, Alm EW, Gordon DM, Ram JL, Toranzos GA, Tiedje JM, Whittam TS. 2009. Cryptic lineages of the genus Escherichia. Appl Environ Microbiol 75:6534–6544. [PubMed] 10.1128/AEM.01262-09 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Selander RK, Caugant DA, Ochman H, Musser JM, Gilmour MN, Whittam TS. 1986. Methods of multilocus enzyme electrophoresis for bacterial population genetics and systematics. Appl Environ Microbiol 51:873–884. [PubMed] [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Ochman H, Selander RK. 1984. Standard reference strains of Escherichia coli from natural populations. J Bacteriol 157:690–693. [PubMed] [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Sahl JW, Matalka MN, Rasko DA. 2012. Phylomark, a tool to identify conserved phylogenetic markers from whole-genome alignments. Appl Environ Microbiol 78:4884–4892. [PubMed] 10.1128/AEM.00929-12 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Steinsland H, Lacher DW, Sommerfelt H, Whittam TS. 2010. Ancestral lineages of human enterotoxigenic Escherichia coli. J Clin Microbiol 48:2916–2924. [PubMed] 10.1128/JCM.02432-09 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Ingle DJ, Clermont O, Skurnik D, Denamur E, Walk ST, Gordon DM. 2011. Biofilm formation by and thermal niche and virulence characteristics of Escherichia spp. Appl Environ Microbiol 77:2695–2700. [PubMed] 10.1128/AEM.02401-10 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Leung PH, Peiris JS, Ng WW, Robins-Browne RM, Bettelheim KA, Yam WC. 2003. A newly discovered verotoxin variant, VT2g, produced by bovine verocytotoxigenic Escherichia coli. Appl Environ Microbiol 69:7549–7553. [PubMed] 10.1128/AEM.69.12.7549-7553.2003 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Steyert SR, Sahl JW, Fraser CM, Teel LD, Scheutz F, Rasko DA. 2012. Comparative genomics and stx phage characterization of LEE-negative Shiga toxin-producing Escherichia coli. Front Cell Infect Microbiol 2:133. 10.3389/fcimb.2012.00133. [PubMed] 10.3389/fcimb.2012.00133 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Newton HJ, Sloan J, Bulach DM, Seemann T, Allison CC, Tauschek M, Robins-Browne RM, Paton JC, Whittam TS, Paton AW, Hartland EL. 2009. Shiga toxin-producing Escherichia coli strains negative for locus of enterocyte effacement. Emerg Infect Dis 15:372–380. [PubMed] 10.3201/eid1503.080631 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Krüger A, Lucchesi PM, Parma AE. 2007. Evaluation of vt2-subtyping methods for identifying vt2g in verotoxigenic Escherichia coli. J Med Microbiol 56:1474–1478. [PubMed] 10.1099/jmm.0.47307-0 [DOI] [PubMed] [Google Scholar]
- 26.Clermont O, Gordon DM, Brisse S, Walk ST, Denamur E. 2011. Characterization of the cryptic Escherichia lineages: rapid identification and prevalence. Environ Microbiol 13:2468–2477. [PubMed] 10.1111/j.1462-2920.2011.02519.x [DOI] [PubMed] [Google Scholar]
- 27.Lefort A, Panhard X, Clermont O, Woerther PL, Branger C, Mentré F, Fantin B, Wolff M, Denamur E, Colibafi Group. 2011. Host factors and portal of entry outweigh bacterial determinants to predict the severity of Escherichia coli bacteremia. J Clin Microbiol 49:777–783. [PubMed] 10.1128/JCM.01902-10 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Vignaroli C, Di Sante L, Magi G, Luna GM, Di Cesare A, Pasquaroli S, Facinelli B, Biavasco F. 2014. Adhesion of marine cryptic Escherichia isolates to human intestinal epithelial cells. ISME J 9:508–515. [PubMed] 10.1038/ismej.2014.164 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Berthe T, Ratajczak M, Clermont O, Denamur E, Petit F. 2013. Evidence for coexistence of distinct Escherichia coli populations in various aquatic environments and their survival in estuary water. Appl Environ Microbiol 79:4684–4693. [PubMed] 10.1128/AEM.00698-13 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Clermont O, Christenson JK, Denamur E, Gordon DM. 2013. The Clermont Escherichia coli phylo-typing method revisited: improvement of specificity and detection of new phylo-groups. Environ Microbiol Rep 5:58–65. [PubMed] 10.1111/1758-2229.12019 [DOI] [PubMed] [Google Scholar]
- 31.Luo C, Walk ST, Gordon DM, Feldgarden M, Tiedje JM, Konstantinidis KT. 2011. Genome sequencing of environmental Escherichia coli expands understanding of the ecology and speciation of the model bacterial species. Proc Natl Acad Sci U S A 108:7200–7205. [PubMed] 10.1073/pnas.1015622108 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Powell S, Szklarczyk D, Trachana K, Roth A, Kuhn M, Muller J, Arnold R, Rattei T, Letunic I, Doerks T, Jensen LJ, von Mering C, Bork P. 2012. eggNOG v3.0: orthologous groups covering 1133 organisms at 41 different taxonomic ranges. Nucleic Acids Res 40:D284–D289. [PubMed] 10.1093/nar/gkr1060 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Oh S, Buddenborg S, Yoder-Himes DR, Tiedje JM, Konstantinidis KT. 2012. Genomic diversity of Escherichia isolates from diverse habitats. PLoS One 7:e47005. 10.1371/journal.pone.0047005. [PubMed] 10.1371/journal.pone.0047005 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Vital M, Chai B, Østman B, Cole J, Konstantinidis KT, Tiedje JM. 2014. Gene expression analysis of E. coli strains provides insights into the role of gene regulation in diversification. ISME J 9:1130–1140. [PubMed] 10.1038/ismej.2014.204 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Koeppel A, Perry EB, Sikorski J, Krizanc D, Warner A, Ward DM, Rooney AP, Brambilla E, Connor N, Ratcliff RM, Nevo E, Cohan FM. 2008. Identifying the fundamental units of bacterial diversity: a paradigm shift to incorporate ecology into bacterial systematics. Proc Natl Acad Sci U S A 105:2504–2509. [PubMed] 10.1073/pnas.0712205105 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Hunt DE, David LA, Gevers D, Preheim SP, Alm EJ, Polz MF. 2008. Resource partitioning and sympatric differentiation among closely related bacterioplankton. Science 320:1081–1085. [PubMed] 10.1126/science.1157890 [DOI] [PubMed] [Google Scholar]
- 37.Cohan FM, Kopac SM. 2011. Microbial genomics: E. coli relatives out of doors and out of body. Curr Biol 21:R587–R589. [PubMed] 10.1016/j.cub.2011.06.011 [DOI] [PubMed] [Google Scholar]
- 38.Le Minor L, Popoff MY. 1987. Designation of Salmonella enterica sp. nov., nom. rev., as the type and only species of the genus Salmonella: request for an opinion. Int J Syst Bacteriol 37:465–468. 10.1099/00207713-37-4-465 [DOI] [Google Scholar]
- 39.Tindall BJ, Grimont PA, Garrity GM, Euzéby JP. 2005. Nomenclature and taxonomy of the genus Salmonella. Int J Syst Evol Microbiol 55:521–524. [PubMed] 10.1099/ijs.0.63580-0 [DOI] [PubMed] [Google Scholar]
- 40.Iversen C, Mullane N, McCardell B, Tall BD, Lehner A, Fanning S, Stephan R, Joosten H. 2008. Cronobacter gen. nov., a new genus to accommodate the biogroups of Enterobacter sakazakii, and proposal of Cronobacter sakazakii gen. nov., comb. nov., Cronobacter malonaticus sp. nov., Cronobacter turicensis sp. nov., Cronobacter muytjensii sp. nov., Cronobacter dublinensis sp. nov., Cronobacter genomospecies 1, and of three subspecies, Cronobacter dublinensis subsp. dublinensis subsp. nov., Cronobacter dublinensis subsp. lausannensis subsp. nov. and Cronobacter dublinensis subsp. lactaridi subsp. nov. Int J Syst Evol Microbiol 58:1442–1447. [PubMed] 10.1099/ijs.0.65577-0 [DOI] [PubMed] [Google Scholar]
- 41.Orskov I. 1984. Genus V. Klebsiella Trevisan 1885, vol 1. Williams & Wilkins, Baltimore, MD. [Google Scholar]
- 42.McQuiston JR, Herrera-Leon S, Wertheim BC, Doyle J, Fields PI, Tauxe RV, Logsdon JM, Jr. 2008. Molecular phylogeny of the salmonellae: relationships among Salmonella species and subspecies determined from four housekeeping genes and evidence of lateral gene transfer events. J Bacteriol 190:7060–7067. [PubMed] 10.1128/JB.01552-07 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Boyd EF, Wang FS, Whittam TS, Selander RK. 1996. Molecular genetic relationships of the salmonellae. Appl Environ Microbiol 62:804–808. [PubMed] [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Kimura M. 1980. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J Mol Evol 16:111–120. [PubMed] 10.1007/BF01731581 [DOI] [PubMed] [Google Scholar]
- 45.Tamura K, Stecher G, Peterson D, Filipski A, Kumar S. 2013. MEGA6: Molecular Evolutionary Genetics Analysis version 6.0. Mol Biol Evol 30:2725–2729. [PubMed] 10.1093/molbev/mst197 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Meier-Kolthoff JP, Auch AF, Klenk HP, Göker M. 2013. Genome sequence-based species delimitation with confidence intervals and improved distance functions. BMC Bioinformatics 14:60. 10.1186/1471-2105-14-60. [PubMed] 10.1186/1471-2105-14-60 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Meier-Kolthoff JP, Hahnke RL, Petersen J, Scheuner C, Michael V, Fiebig A, Rohde C, Rohde M, Fartmann B, Goodwin LA, Chertkov O, Reddy T, Pati A, Ivanova NN, Markowitz V, Kyrpides NC, Woyke T, Göker M, Klenk HP. 2014. Complete genome sequence of DSM 30083(T), the type strain (U5/41(T)) of Escherichia coli, and a proposal for delineating subspecies in microbial taxonomy. Stand Genomic Sci 9:2. 10.1186/1944-3277-9-2. 10.1186/1944-3277-9-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Goris J, Konstantinidis KT, Klappenbach JA, Coenye T, Vandamme P, Tiedje JM. 2007. DNA-DNA hybridization values and their relationship to whole-genome sequence similarities. Int J Syst Evol Microbiol 57:81–91. [PubMed] 10.1099/ijs.0.64483-0 [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.