Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2009 May 13;106(21):8713–8718. doi: 10.1073/pnas.0812949106

A precise reconstruction of the emergence and constrained radiations of Escherichia coli O157 portrayed by backbone concatenomic analysis

Shana R Leopold a,b,1, Vincent Magrini c,1, Nicholas J Holt a, Nurmohammad Shaikh a, Elaine R Mardis c, Joseph Cagno d, Yoshitoshi Ogura e,f, Atsushi Iguchi e, Tetsuya Hayashi e,f, Alexander Mellmann g,h, Helge Karch g,h, Thomas E Besser i, Stanley A Sawyer j, Thomas S Whittam k,2, Phillip I Tarr a,3
PMCID: PMC2689004  PMID: 19439656

Abstract

Single nucleotide polymorphisms (SNPs) in stable genome regions provide durable measurements of species evolution. We systematically identified each SNP in concatenations of all backbone ORFs in 7 newly or previously sequenced evolutionarily instructive pathogenic Escherichia coli O157:H7, O157:H, and O55:H7. The 1,113 synonymous SNPs demonstrate emergence of the largest cluster of this pathogen only in the last millennium. Unexpectedly, shared SNPs within circumscribed clusters of organisms suggest severely restricted survival and limited effective population sizes of pathogenic O157:H7, tenuous survival of these organisms in nature, source-sink evolutionary dynamics, or, possibly, a limited number of mutations that confer selective advantage. A single large segment spanning the rfb-gnd gene cluster is the only backbone region convincingly acquired by recombination as O157 emerged from O55. This concatenomic analysis also supports using SNPs to differentiate closely related pathogens for infection control and forensic purposes. However, constrained radiations raise the possibility of making false associations between isolates.

Keywords: E. coli, evolution, SNPs


Shiga toxin (Stx)-producing sorbitol-nonfermenting Escherichia coli O157:H7, sorbitol-fermenting (SF) E. coli O157:H, and nontoxigenic E. coli O55:H7 (1) are the principal subgroups of the enterohemorrhagic E. coli (EHEC) 1 Clade. Globally distributed E. coli O157:H7 are harmless when in bovine hosts, but among the E. coli genospecies, O157:H7 is singularly virulent for humans. SF O157:H is a nonmotile human pathogen confined largely to Germany. In the current stepwise scenario based on multilocus enzyme electrophoresis (2), O55:H7 belong to an ancestral subgroup (A) from which considerably more pathogenic subgroups B (SF O157:H) and C (O157:H7) emerged (Fig. 1 and Fig. S1). A probably extinct SF O157:H7 (gray sphere, Fig. 1 and Fig. S1) has an stx2 bacteriophage and is a hypothetical node between O157:H7 and SF O157:H (3). Unlike subgroup A or B strains, subgroup C O157:H7 fail to ferment sorbitol, have fimA promoter (4) and uidA (5) mutations, and possess the TAI genomic island (6). Human subgroup C O157:H7 belong to 3 sequentially emerged clusters defined by the presence of a truncated bacteriophage in yehV and an intact wrbA gene (cluster 1), stx2bacteriophage disruption of wrbA (cluster 2), and acquisition of stx1 by the truncated bacteriophage occupying yehV (cluster 3) (7).

Fig. 1.

Fig. 1.

Evolutionary scenario for EHEC 1 pathogens. E. coli O55:H7 belongs to the most ancestral subgroup of the EHEC 1 clade (subgroup A). The gray sphere depicts a probably extinct intermediate between O55:H7 and E. coli expressing the O157 lipopolysaccharide, with subgroup B consisting of the SF O157:H and subgroup C consisting of E. coli O157:H7. Critical intraclade events are noted. Intracluster ovals represent genomically sequenced strains (dark-pink), strains used for SNP consensus sampling (green), inferred founders (pale-pink), and postulated organisms that are immediate progenitors to the next cluster or subgroup (white). Screened strains were assigned to the main branch if they had each of the 3 signature SNPs among the 111 shared SNPs (38/48), and the minor branch if they lacked this SNPs (10/48). More extensive evolutionary detail is provided in Fig. S1. Distances not drawn to scale.

The 5.4 million base pair (Mb) chromosomes of O157:H7 cluster 3 strains EDL933 and O157 Sakai (the only fully sequenced EHEC at the start of our study) are, as in many pathogens, mosaics. Each chromosome has E. coli K-12 orthologous segments (the 4.1 Mb “backbone”) (8, 9) separated by hundreds of O-islands. O-islands arise from bacteriophage transduction and smaller insertions and deletions (indels) (10), often encode virulence traits, and are unstably integrated (11).

EHEC 1 strains that belong to ancestral clusters and subgroups continue to infect humans. As such, they present exceptional opportunities to illuminate the emergence of a pathogen. We hypothesized that by finding all single nucleotide polymorphisms (SNPs) in concatenated backbone open reading frames of evolutionarily informative EHEC 1 strains, we could precisely time and characterize the emergence and radiation of these pathogens. This “concatenomic” analysis portrays highly nonrandom radiations from group founders, limited recombination within backbones, and quite recent pathogen emergence. The data also offer statistical guidance for sequence-based differentiation of closely related pathogens.

Results

Correction of Sanger-Generated SNPs in the Database and Validation of Newly Pyrosequenced Genomes.

SNP validations were performed by PCR-amplifying across sites of interest, Sanger sequencing the resulting amplicons, and correcting the database, if necessary.

At the outset, we focused on the 124 SNPs in the backbones of the previously sequenced cluster 3 strains O157 Sakai and EDL933, and the 11 single nucleotide differences between these strains in a subset of comparable O-island segments. We verified only 65 (52%) and 6 (55%) of these SNPs, respectively. We next turned our attention to the pyrosequenced strains (Table S1). After aligning them to the corrected backbone ORFs in the 2 cluster 3 strains, we found 235 single nucleotide indels (SNIs) in or near homopolymer runs (≥4 nucleotides). Three of these 235 SNIs were based on 100% concordant bidirectional reads and verified by Sanger sequencing, and eliminated from further analysis. However, because pyrosequencing can generate SNIs in error (12), we studied 35 random additional SNIs based on nonunanimous reads, and validated only one. We removed that ORF from further analysis, but because the preponderance of these 35 SNIs was incorrect, we retained for analysis all ORFs with nonunanimous SNIs (Table S2).

Next, we validated the SNPs in 3 pyrosequenced subgroup C O157:H7s (strains 87–14, TW14359, and 86–24). These 3 isolates had 186 SNPs that were called on the basis of 100% concordance in all reads, and which were not found in any other strains. We verified all 10 synonymous and all 10 nonsynonymous strain-specific SNPs randomly selected from each of these strains (SNPs were termed nonsynonymous if they changed the amino acid designated by the codon in which they occur without elongating or truncating the protein, and synonymous if they did not alter the amino acid). These 3 strains also had 19 backbone SNPs based on 51–99% concordant reads, and we verified each. However, none of the 12 high confidence SNPs based on ≤50% concordant reads was correct. Finally, we BLAST-compared 100 randomly selected ORFs (containing 491,598 nucleotides) in each of the 5 pyrosequenced strains to the O157 Sakai sequence, and found no SNPs other than those detected by Lasergene. Thus, bacterial genome pyrosequencing produces considerably fewer false positive SNPs than does Sanger sequencing, and few, if any, false negative SNPs. (Table S2 and Table S3)

Categorization of SNPs as Radial, Linear, or Provisionally Linear.

SNPs were termed radial if they differed from their cluster founder backbone concatenome sequence and were found in no descendant group of organisms, indicating that they occurred as their branch “radiated” away from the founder but remained within a cluster. The founder of cluster 1 is considered to be the first sorbitol nonfermenting E. coli O157:H7 that contained a truncated bacteriophage in yehV. The founder of cluster 2 is considered to be the first E. coli O157:H7 in which wrbA is occupied by an stx2 bacteriophage. The founder of cluster 3 is considered to be the first E. coli O157:H7 in which the bacteriophage occupying yehV acquires stx1. SNPs were termed linear if they represent a polymorphic site that agrees with the consensus within the cluster in which it first appeared, and was carried forward in each descendant cluster, and provisionally linear if we were unable to infer the true founder backbone concatenome sequence for a terminal group of organisms, but can establish by consensus sampling (defined below) their probable linear status (Table S4).

SNP Categorizations for Groups That Cannot Be Accurately Rooted.

For terminal subgroup C, cluster 3, we do not know whether a SNP shared by O157 Sakai and EDL933 is retained in descendant sets of organisms. Therefore, we termed the 11 such SNPs that appear in these 2 strains as being “provisionally linear” (example 15, Table S4) after we confirmed that each of these mutations was found in each of 6 additional cluster 3 strains (Table S1).

We next directed attention to SNPs found only in strain TB182A, the single sequenced member of ancestral subgroup A, where we have no near-ancestral groups for reference. We randomly selected 25 synonymous and 25 nonsynonymous TB182A strain specific SNPs and studied these sites in each of 3 additional subgroup A Diarrheagenic E. coli (Dec) (“consensus sampling”). Dec 5b shares 23 of 50 sample SNPs unique to TB182A, Dec 5a shares these 23 and 10 additional sample SNPs, and Dec 5d shares 49 of the 50 sample SNPs. These mutation “tiers” infer at least 3 subgroup A segmentations (Fig. 1 and Fig. S1). If Dec 5b belongs to an ancestral group with subgroup A (scenario represented by the leftward of the bidirectional arrows in Fig. 1 and Fig. S1), the linear SNPs in strain TB182A might also be provisional.

We also consensus sampled 25 synonymous and 25 nonsynonymous random sample SNPs in the sole sequenced subgroup B isolate, where descendant groups have also not been identified. Three SF O157:H strains share 48 of 50 sample SNPs unique to 493/89. We therefore classified 96% of the SNPs in this subgroup as being provisionally linear (example 14, Table S4). One of the 2 remaining sample SNPs was found in strains 493/89 and GeSB-18; the other was found only in strain 493/89. These last 2 SNPs (and by inference 4% of the 325 subgroup B-specific SNPs) were classified as radial.

Precise Portrayal of Emergence of the EHEC 1 Clade, and Evidence of Constrained Radiation.

Within subgroup C, exactly 60 synonymous and 57 nonsynonymous linear SNPs occurred as cluster 2 emerged from cluster 1. Six synonymous and 5 nonsynonymous provisionally linear SNPs accrued as cluster 3 emerged from cluster 2 (distances are founder to founder). From consensus sampling, we estimate that 396 synonymous and 368 nonsynonymous SNPs differentiate strain Dec 5b, the O55:H7 strain in this study most closely related to O157, from the subgroup B founder. We also estimate that 545 synonymous and 530 nonsynonymous SNPs separate Dec 5b from the cluster 1 founder (Fig. S1 and Table S2).

Strains 87–14 and TW14359 have 175 and 217 radial SNPs, respectively, of which 111 are shared (between the red symbols in Fig. 1 and Fig. S1). We sequenced in 48 additional cluster 1 O157:H7 strains 3 sites that correspond to SNPs randomly selected from these 111 shared SNPs. We did this to test the hypothesis that this cluster radiated in a highly nonrandom pattern. Thirty-eight (79%) of these North American and European isolates contain each of these 3 “signature” SNPs. The remaining 10 isolates contain none of these 3 SNPs but do have an N135K FimH mutation characteristic of cluster 2 and 3 O157:H7 (Fig. S1). These 10 strains, therefore, most likely lie on or have diverged from, the branch denoted by green symbols in Fig. 1 and Fig. S1.

O157 Sakai has 35 and EDL933 has 30 different radial SNPs. We then asked whether any of the 31 synonymous radial SNPs in these strains are found in any of 6 other cluster 3 O157:H7s. Of these 6 additional strains, 2 bovine isolates have 3 of O157 Sakai's 12 radial synonymous SNPs, 2 human isolates share 2 of the 19 radial synonymous SNPs found in EDL933, and 1 bovine and 1 human isolate have none of these 31 SNPs.

There was no relation between the dN/dS ratio and dS in any appropriate pairwise comparison among the 7 sequenced strains (Fig. S2).

Backbone Segments Have Experienced Minimal Recombination.

GENECONV identified only 1 segment in strain TB182A containing more SNPs than expected compared with all other backbones in this study (P < 0.0001, Fig. 2 and Fig. S3). This 161-kb region includes gnd, which “hitchhikes” with the adjacent rfb cluster, which encodes the O-lipopolysaccharide side chain antigen (3, 13). After removing the rfb-gnd segment, we found that among the 850 total informative sites in the backbone that were identified by the homoplasy test of Maynard Smith and Smith (14), the 3 homoplastic sites were the only 3 sites in the entire dataset that we already knew contained >2 nucleotides (Table S3) (P = 0.9717 with 10,000 simulations). Thus, the EHEC 1 clade backbone experienced no significant gene conversion.

Fig. 2.

Fig. 2.

SNP clusterings, E. coli O55:H7 strain TB182A versus founder of E. coli O157:H7 cluster 1. Percentage of single nucleotide differences in each backbone ORF in the subgroup C the backbone concatenome, comparing ORFs in strain TB182A and the founder of subgroup C, cluster 1. The y axis portrays percentage of ORF nucleotides that differ from counterpart in descendant E. coli O157:H7 in the most closely related cluster. The x axis portrays concatenated ORFs. The 160,818 nucleotide region between hisG and yehL has 39,295 O-island and 121,523 backbone nucleotides. The backbone part of this region in strain TB182A has 1,337 SNPs that are not found in the founder of E. coli O157:H7 cluster 1 or in strain 493/89, whereas the remaining 3,190,617 EHEC 1 the backbone concatenome nucleotides have only 1,764 SNPs. This 161-kb region includes the O55 gnd gene and the O55 rfb cluster, which encodes the lipopolysaccharide side chain (Fig. S3). (Inset) Noted are several ORFs relevant to this region, including cross-over sites designated by GENECONV (amn and yehL) and by Wang, et al. (amn-hisG and galF) (3). The backbone region encoding ECs2830–2847, which spans gnd, rfb genes, and galF, was not included in this pairwise comparison because of <95% similarity (gnd) or unequal lengths (galF) between orthologues, or absence of genes from K-12 (rfb cluster). However, we have included their relative position between included backbone ORFs in this region.

Modeling Interstrain Differentiation Using SNPs.

We found ≥1 differentiating SNP in 100 randomly selected ORFs in 99% of 100,000 Monte Carlo simulations between the 2 completely sequenced cluster 1 strains (170 SNPs, 87–14 vs. TW14359). For the 2 cluster 3 strains, only 83% of simulations produced ≥1 differentiating SNP in 100 ORFs (65 SNPs, EDL933 vs. O157 Sakai) (Fig. 3).

Fig. 3.

Fig. 3.

Percentages of 100,000 simulated sequences with SNPs, as functions of ORFs sampled, and of clusters of origin of strains. A total of 50, 100, 300, and 500 the backbone concatenome ORFs were randomly selected in 100,000 independent simulations. We compared the 2 genomically sequenced clinical isolates in cluster 1 (A) and the 2 genomically sequenced clinical isolates strains in cluster 3 (B) to model how many ORFs one would need to sequence to find differentiating SNPs. The distribution of ORFs that contained SNPs in pairwise interpathogen comparisons are portrayed as curves, with each curve representing a different number of ORFs sampled. x axis depicts numbers of SNPs identified. The y axis depicts percentage of simulation.

Discussion

We were surprised to find O157:H7s collected on 3 continents over 3 decades from humans, cattle, and food share so many intracluster SNPs. Cluster 1 is exceptionally informative. Cluster 1's founder backbone concatenome sequence (unlike the founder backbone concatenome sequence of cluster 3) is deduced with near certainty because we know ancestral and descendant sequence. Unlike cluster 2, cluster 1 is common among contemporaneous pathogens (15). Because cluster 1 O157:H7s have been in existence for several millennia, ample time has presumably elapsed for its members to have radiated on multiple branches from their founder. However, despite this opportunity for random radiation, no node occurs for the 2 completely sequenced strains until their ancestor accrued 111 SNPs, and the validation sample of 48 Old and New World strains demonstrates only a major and a minor branch (between the red and green symbols, respectively, in Fig. 1 and Fig. S1). This pattern vividly portrays human pathogenic O157:H7 as belonging to a very small subset of potential founder offspring, and attests to the limited effective population sizes of the clusters. It also confirms the hypothesis that evolution of this pathogen from the founder within the well-circumscribed and well-established cluster 1 is far from random.

Why do cluster 1 O157:H7s radiate on only 2 branches from their founder? A bottleneck might restrict diversity, but a single population reducing event is very unlikely to have constrained the cluster 1 population, however, by the time 111 backbone SNPs occur, the mutating bacteria would be geographically dispersed and less susceptible to extermination. Constrained radiation could be explained by “black-hole” source-sink evolution dynamics, as proposed for uropathogenic E. coli (16) and other pathogens (17). According to this theory, mutations in bacteria that reside mainly in reservoirs (the source, in this case cattle) in which they are not pathogens confer on a small subset a phenotype that results in injury to an incidental host (the sink, in this case humans) (17). A third possibility is that a strong selective advantage conferred by a mutation immediately before the node in Fig. 1 and Fig. S1 enabled cluster 1 O157:H7s to flourish in cattle, making them more available for spillover into humans (scenarios portrayed in Fig. 4). In any case, these data prompt scrutiny of the shared SNPs before and after that node. We cannot test the significance of the observed shared SNPs in cluster 3 because of the paucity of accrued mutations, so we cannot discard the possibility that these mutations are consistent with a neutral coalescence model.

Fig. 4.

Fig. 4.

Possible scenarios underlying constrained radiation observed in cluster 1. (A) Many different radiations of similarly virulent O157:H7s extend from the cluster 1 founder, until a bottleneck (selective sweep) eliminated all descendants of the founder except for one that had sustained the 111 shared SNPs between the red symbols in Fig. 1 and Fig. S1. The descendants of the survivor of this event flourished, and randomly spillover to human populations today. (B) Heterogeneous cluster 1 O157:H7s reside in their natural reservoir (different colored bacilli), but only a small minority (red bacilli) are capable of injuring an incidentally infected human (source sink dynamics). (C) Many different radiations of similarly virulent O157:H7s extend from the cluster 1 founder. A mutation that conferred a selective advantage occurred in an O157:H7 at or immediately before the node in Fig. 1 and Fig. S1. The descendants of this advantaged mutant flourished and randomly spillover to human populations today.

Additional data from this concatenomic analysis suggest that survival of pathogenic EHEC 1 organisms is exceptional. If Dec 5b emerged from an O55:H7 progenitor that is also ancestral to Dec 5a, 5d and TB182A (leftward arrows, Fig. 1 and Fig. S1), then a single radiation from a founder of subgroup A spawned extant O55:H7. Also, the common O157 ancestor to subgroups B and C has never been found, and is probably extinct. Long-lived O157:H7 clusters arise rarely, despite the ubiquity and flux of prophages that could conceivably spawn new groups (7, 11). Cluster 2 where strains are recovered much less frequently from humans than are isolates from clusters 1 or 3 (15), might also be heading toward extinction. Cluster 3 emerged too recently to predict its durability, but shared SNPs in some of its members portray nonrandom radiations from the founder, as in cluster 1.

SNPs might be useful to type O157:H7 (18) and other pathogens, including Bacillus anthracis (19). Pulsed field gel electrophoresis, the standard technique (20), is hindered by genomic alterations (11, 21), and theoretic and technical challenges (22). Known SNPs can differentiate pathogens, but confidence that 2 isolates of the same SNP type are from the same source is directly proportional to the amount of identical de novo–generated sequence between them. For example, SNP typing, using some of the 111 shared SNPs in cluster 1, might lead to the incorrect conclusion that the 2006 nationwide O157:H7 spinach outbreak was caused by strain 87–14, isolated in the 1980s in Seattle.

SNPs are rarer than gene gain or loss in this clade (23), but calibrate evolution more accurately than do other kinds of mutations. By dividing the panchromosomal dS by synonymous site substitutions per annum (24, 25), and multiplying by 1.44 × 10−10 synonymous site changes per generation (26) and 200 annual generations (27), we estimate that the cluster 2 founder emerged ≈2,300 years after cluster 1 was founded, and the cluster 3 founder emerged only 2 centuries later. The synonymous radial SNPs in O157 Sakai and EDL933 suggest a mere 4 to 6 century interval between the founding of cluster 3 to the present. Interestingly, these short intervals coincide with massive trans-Atlantic cattle migration, because European herds populated and then proliferated in the New World (28, 29). Our analysis positions the last common ancestor (LCA) to O157:H7 and SF O157:H as existing only ≈7,000 years ago, which is only approximately a millennium before the first evidence of cattle domestication in the eastern Mediterranean (30) and Africa. Zhang, et al., estimated that this LCA existed much earlier (51,000 to 58,000 years ago) (31), but used widely separated strains 493/89 and O157 Sakai and only 10% of their backbone ORFs to calculate the time since LCA. However, associations between cattle domestication and migration and EHEC 1 evolutionary events are highly circumstantial and speculative.

Our agnostic approach, which does not rely only on known SNPs, extends the work of Manning, et al., who parsed a large O157:H7 strain set into 9 clades and 39 SNP types (32). However, their study, confined to 96 known SNPs, did not generate a maximally definitive phylogeny. A Neighbor-Joining phylogeny (Fig. S4) generated with SNPs in our database independent of any other information also assigns a population structure for the clade identical to Fig. 1. We contend that the EHEC 1 phylogeny postulated 2 decades ago (2) is now proven.

Holt et al. (33), studying backbones of Salmonella Typhi, identified cluster H58, with 6 strains producing 6 spokes in a centrifugal burst, which contrasts with the 2 radiating branches produced by the 48 cluster 1 O157:H7s. Holt et al. (33) concluded that accumulating and abundant gene inactivations infer a small effective pathogen population size, and attributed this to Typhi's host range, which is restricted to humans. Our SNP census of O157:H7 more definitively converges on the same conclusion: This pathogenic E. coli also has a small effective population size, its broader host range than Typhi notwithstanding.

The dN/dS ratios of the SNPs in this set are <1.0 (expected ratio if synonymous and nonsynonymous SNPs accrue without bias), but are higher than would be expected if a large majority of such mutations are deleterious, suggesting EHEC 1 strains better relatively tolerate such mutations. Although we observe no diminution of the dN/dS ratio as dS increases (34), it is possible that the time intervals examined (dS) are still too short to permit deleterious mutations to cause elimination from this population.

The rfb-gnd region, the only part of the backbone acquired by recombination, is much larger than the rfb-gnd horizontally acquired segment proposed by Wang, et al. (3). The transferring elements and complexity (single or multistep) of the recombination of such a large segment are unknown. However, as in Typhi, recombination seems to have played little role in the evolution of backbone architecture since the LCA in this clade (33).

Finally, massively parallel sequencing is much more accurate than Sanger sequencing for identifying SNPs; pyrosequencing produces negligible false positive or negatives, using “majority rule” base calling policies among backbone ORFs. Such accuracy is critical when studying phylogenetic differences between closely related organisms.

In summary, E. coli O157:H7 emerged much more recently than was estimated using more distantly related EHEC 1 strains. O55 and O157 antigen conversion cotransferred much more backbone DNA than previously proposed, but the backbone provides no other evidence of recombination. Pyrosequencing accuracy and economics increase the appeal of seeking backbone SNPs to type pathogens for disease control purposes. However, finding a few common SNPs is necessary but not sufficient to state that 2 isolates had the same progenitor or source; the greater the number of nucleotides confirmed to be identical, the lower the likelihood of making a false association. Massively parallel sequencing of whole genomes of phylogenetically instructive bacteria and backbone SNPs provide extraordinary clarity in illuminating evolutionary history. Finally, and most intriguingly, extant E. coli O157:H7 represent remarkably few descendants of cluster founders. Taken together, our data portray E. coli O157:H7's existence as fortuitous, and quite possibly ephemeral.

Methods

ORF Qualifications and the “Backbone Concatenome.”

We first sought qualifying backbone ORFs (Table S5) in 2 published (8, 9) O157:H7 sequences. ORF qualifications are provided in Table S6. We then sequenced the genomes of 5 additional strains (Table S1) using 454GS20 or 454FLX pyrosequencers, and identified their backbone ORFs. Then, we concatenated all qualifying ORFs in their correct orientation and order into 3 different “Backbone Concatenomes” (defined in Table S5) to use as denominator populations for intra- and intergroup comparisons. Each qualifying ORF that belonged to at least one backbone concatenome in any pyrosequenced strains was deposited in GenBank with its own accession number (Table S7).

O-Islands Posed Particular Challenges.

Many O-island orthologues are nonsyntenic, and O-island polymorphisms might arise from gene conversions (35) or bacteriophage transduction, and not stochastically occurring SNPs. However, to partly estimate O-island SNP rates, we compared a subset of 184 qualifying O-island segments in the 2 cluster 3 O157:H7 strains (Table S5 and Table S6). We also compared the EHEC 1 O55:H7 rfb region (36) to the O157:H7 rfb region, because this is a clearly horizontally transferred set of backbone and O-island genes.

Initial SNP Identification.

SNPs were identified in the cluster 3 backbone concatenome using a program written by T.S.W., and ORFs were identified using ORF BLAST analysis. For pyrosequenced strains, we assembled contigs, using 454 software, and identified all SNPs generated with ≥ 3 high quality, nonduplicate reads (>Q20 in 15 bases around variation sites, where there are multiple forward and reverse strand reads available for scrutiny), using 454 1.1.03 software. The reads also use sequences that align 5 bases flanking each site in question. Variations with frequencies >10% are considered high confidence (“high confidence reads”). We then compared these SNPs to verified O157 Sakai sequence (Lasergene v7.2, DNASTAR).

Deduction of Founder Backbone Concatenome for Flanked Clusters and Characterization of SNPs in Nonflanked Subgroups and Clusters.

We deduced the backbone concatenome of the founders of clusters 1 and 2 by forming consensuses, using nucleotides at corresponding sites in flanking subgroups and/or clusters. We had too few data to deduce the backbone concatenome for the founders of subgroup A or B, or subgroup C, cluster 3. For subgroup A, we approximated radial:linear proportions among 50 SNP consensus samples, using additional subgroup A strains [Dec 5a, 5b, and 5d (all O55:H7)] (Table S2; SNPs and primers in Table S3). These 50 sites corresponded to SNPs identified by pyrosequencing in strain TB182A. We used these sites in other strains to build a consensus to determine whether these SNPs resembled examples 1 or 13, Table S4. For subgroup B, which has no descendant groups of organisms, we similarly identified 50 random SNPs in strain 493/89 and sequenced these sites in subgroup B strains 3072/96, GeSB-13, and GeSB-18 (Table S2 and Table S3) to determine whether the consensus resembled examples 2 or 14, Table S4. We also sequenced each of the 11 SNPs found in both EDL933 and O157 Sakai in 3 additional cluster 3 strains to determine whether the consensus resembled examples 9 or 15 in Table S4 (Table S2 and Table S3). These consensus samplings enabled us to estimate the proportion of monomorphic SNPs in the subgroup A and B strains and the 2 subgroup C cluster 3 strains that are radial (examples 2 and 9, Table S4) or provisionally linear (examples 14 and 15, Table S4). The “provisional” modifier recognizes that apparently linear SNPs represent a branch radiating from a founder, because we cannot infer the true founder the backbone concatenome sequence in terminal subgroups/clusters.

Validations to Assess False Calls.

We amplified and then Sanger sequenced DNA spanning all 124 candidate backbone SNPs and the 11 candidate O-island SNPs in the cluster 3 strains O157 Sakai and EDL933. We also validated in the subgroup C pyrosequenced strains (O157:H7 strains 87–14, TW14359, and 86–24) 10 random synonymous and 10 random nonsynonymous radial SNPs based on 100% concordance in all pyrosequencing reads (60 sites total), all 19 backbone SNPs based on 51–99% concordant reads, and the 12 high confidence SNPs called with ≤50% concordance. We limited confirmations to singleton radial SNPs; we did not confirm shared radial or linear SNPs because confirmation is built into their definition (i.e., they must be present in ≥2 organisms in a phylogenetically logical context) (Table S2 and Table S3). We manual BLAST compared 100 random ORFs from each of the 5 pyrosequenced strains (Table S2) to the O157 Sakai reference sequence to determine whether there were any SNPs present other than those detected by Lasergene.

Radial SNP Conservation Within Clusters.

We sequenced sites corresponding to the 31 confirmed synonymous radial backbone SNPs in strains EDL933 and O157 Sakai in 6 additional cluster 3 O157:H7s to determine whether organisms of diverse origin in this cluster share any of these mutations. We also sequenced nucleotides corresponding to positions 337933, 1460599, and 2370797 (randomly chosen synapomorphic SNPs on the main radiating branch of cluster 1) from 48 clinical isolates from Washington, Oregon, Idaho, Montana, Missouri, Illinois, and Germany (cluster 1 main branch “signature” SNPs). For the 2 United States cluster 1 isolates without any of these signature SNPs, and for all 18 German cluster 1 isolates, we determined whether there was an N (cluster 1) or a K (cluster 2) at FimH position 135 (Table S1, Tables S2, and Table S3).

Phylogenetic Analysis.

A phylogenetic tree based on the neighbor-joining method was constructed using programs written by S. Sawyer, using published methodology (37).

Identification of Recombination Events.

After aligning all backbone ORFs of sequenced genomes, we used GENECONV v1.81 (38), with command-line parameter gscale = 3, to identify regions acquired by recombination. The Homoplasy Test was performed using the “simple method” with Se = 0.6 × S and 10,000 simulations (14).

Supplementary Material

Supporting Information

Acknowledgments.

We thank Beth Wolf for manuscript preparation assistance; Nicole Perna, Guy Plunkett, Steve Moseley, David Hunstad, James Johnson, and Vincent Young for helpful comments; and William E. Bennett Jr. for artistic assistance. We communicated sequencing errors to investigators at the University of Wisconsin Enteropathogen Resource Integration Center Bioinformatics Resource Center, and we thank Guy Plunkett for assuming responsibility for correcting the errors in EDL933 in the National Center for Biotechnology Information database and annotating the confirmed polymorphisms in the ERIC-BRC database and the participants of the 2006 Next Generation Sequencing Course, who contributed pyrosequencing reads. This work was supported by National Institutes of Health Grant AI47499 and R56AI063282; United States Department of Agriculture Grants 2002–35212–12355 (to P.I.T.); National Institutes of Health Grant 5P30 DK052574 (to Washington University Digestive Diseases Research Core Center); National Institutes of Health Grant 5T32AI007172 (to S.R.L.); National Institutes of Health Contracts NO1-A1–30055 (to T.E.B.) and N01-AI-30058 (to T.S.W.) (strains are deposited at the Michigan State University STEC Center, supported by this contract); Ministry of Education, Culture, Sports, Science and Technology and Ministry of Health, Labor and Welfare of Japan grants (to T.H.); Interdisziplinäres Zentrum für Klinische Forschung Grant Me2/023/08 and Federal Ministry of Education and Research Grant 01KI0801 (to A.M. and H.K.); and the Melvin E. Carnahan Professorship of Pediatrics (to P.I.T.).

Footnotes

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

Data Deposition: The sequence reported in this paper has been deposited in the GenBank database (accession nos. EU889374889556, EU889560889952, EU889956891224, EU891230EU892004, EU892010EU896054, EU896060EU901119, EU901125EU906854, FJ197142197143, and FJ667493FJ667497).

This article contains supporting information online at www.pnas.org/cgi/content/full/0812949106/DCSupplemental.

References

  • 1.Feng P, Lampel KA, Karch H, Whittam TS. Genotypic and phenotypic changes in the emergence of Escherichia coli O157:H7. J Infect Dis. 1998;177:1750–1753. doi: 10.1086/517438. [DOI] [PubMed] [Google Scholar]
  • 2.Whittam TS, Wilson RA. Genetic relationships among pathogenic Escherichia coli of serogroup O157. Infect Immun. 1988;56:2467–2473. doi: 10.1128/iai.56.9.2467-2473.1988. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Wang L, Huskic S, Cisterne A, Rothemund D, Reeves PR. The O-antigen gene cluster of Escherichia coli O55:H7 and identification of a new UDP-GlcNAc C4 epimerase gene. J Bacteriol. 2002;184:2620–2625. doi: 10.1128/JB.184.10.2620-2625.2002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Li B, Koch WH, Cebula TA. Detection and characterization of the fimA gene of Escherichia coli O157:H7. Mol Cell Probes. 1997;11:397–406. doi: 10.1006/mcpr.1997.0132. [DOI] [PubMed] [Google Scholar]
  • 5.Monday SR, Whittam TS, Feng PC. Genetic and evolutionary analysis of mutations in the gusA gene that cause the absence of beta-glucuronidase activity in Escherichia coli O157:H7. J Infect Dis. 2001;184:918–921. doi: 10.1086/323154. [DOI] [PubMed] [Google Scholar]
  • 6.Friedrich AW, et al. Distribution of the urease gene cluster among and urease activities of enterohemorrhagic Escherichia coli O157 isolates from humans. J Clin Microbiol. 2005;43:546–550. doi: 10.1128/JCM.43.2.546-550.2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Shaikh N, Holt NJ, Johnson JR, Tarr PI. Fim operon variation in the emergence of Enterohemorrhagic Escherichia coli: An evolutionary and functional analysis. FEMS Microbiol Lett. 2007;273:58–63. doi: 10.1111/j.1574-6968.2007.00781.x. [DOI] [PubMed] [Google Scholar]
  • 8.Perna NT, et al. Genome sequence of enterohaemorrhagic Escherichia coli O157:H7. Nature. 2001;409:529–533. doi: 10.1038/35054089. [DOI] [PubMed] [Google Scholar]
  • 9.Hayashi T, et al. Complete genome sequence of enterohemorrhagic Escherichia coli O157:H7 and genomic comparison with a laboratory strain K-12. DNA Res. 2001;8:11–22. doi: 10.1093/dnares/8.1.11. [DOI] [PubMed] [Google Scholar]
  • 10.Britten RJ, Rowen L, Williams J, Cameron RA. Majority of divergence between closely related DNA samples is due to indels. Proc Natl Acad Sci USA. 2003;100:4661–4665. doi: 10.1073/pnas.0330964100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Mellmann A, et al. Recycling of Shiga toxin 2 genes in sorbitol-fermenting enterohemorrhagic Escherichia coli O157:NM. Appl Environ Microbiol. 2008;74:67–72. doi: 10.1128/AEM.01906-07. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Huse SM, Huber JA, Morrison HG, Sogin ML, Welch DM. Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biol. 2007;8:R143. doi: 10.1186/gb-2007-8-7-r143. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Nelson K, Selander RK. Intergeneric transfer and recombination of the 6-phosphogluconate dehydrogenase gene (gnd) in enteric bacteria. Proc Natl Acad Sci USA. 1994;91:10227–10231. doi: 10.1073/pnas.91.21.10227. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Maynard Smith J, Smith NH. Detecting recombination from gene trees. Mol Biol Evol. 1998;15:590–599. doi: 10.1093/oxfordjournals.molbev.a025960. [DOI] [PubMed] [Google Scholar]
  • 15.Besser TE, et al. Greater diversity of Shiga toxin-encoding bacteriophage insertion sites among Escherichia coli O157:H7 isolates from cattle than in those from humans. Appl Environ Microbiol. 2007;73:671–679. doi: 10.1128/AEM.01035-06. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Chattopadhyay S, et al. Haplotype diversity in “source-sink” dynamics of Escherichia coli urovirulence. J Mol Evol. 2007;64:204–214. doi: 10.1007/s00239-006-0063-5. [DOI] [PubMed] [Google Scholar]
  • 17.Sokurenko EV, Gomulkiewicz R, Dykhuizen DE. Source-sink dynamics of virulence evolution. Nat Rev Microbiol. 2006;4:548–555. doi: 10.1038/nrmicro1446. [DOI] [PubMed] [Google Scholar]
  • 18.Cebula TA, Jackson SA, Brown EW, Goswami B, LeClerc JE. Chips and SNPs, bugs and thugs: A molecular sleuthing perspective. J Food Protect. 2005;68:1271–1284. doi: 10.4315/0362-028x-68.6.1271. [DOI] [PubMed] [Google Scholar]
  • 19.Van Ert MN, et al. Global genetic population structure of Bacillus anthracis. PLoS ONE. 2007;2:e461. doi: 10.1371/journal.pone.0000461. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Karama M, Gyles CL. Methods for genotyping verotoxin-producing Escherichia coli. Zoonoses Public Health. 2009 doi: 10.1111/j.1863-2378.2009.01259.x. [DOI] [PubMed] [Google Scholar]
  • 21.Iguchi A, et al. Effects of repeated subculturing and prolonged storage at room temperature of enterohemorrhagic Escherichia coli O157:H7 on pulsed-field gel electrophoresis profiles. J Clin Microbiol. 2002;40:3079–3081. doi: 10.1128/JCM.40.8.3079-3081.2002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Davis MA, Hancock DD, Besser TE, Call DR. Evaluation of pulsed-field gel electrophoresis as a tool for determining the degree of genetic relatedness between strains of Escherichia coli O157:H7. J Clin Microbiol. 2003;41:1843–1849. doi: 10.1128/JCM.41.5.1843-1849.2003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Wick LM, Qi W, Lacher DW, Whittam TS. Evolution of genomic content in the stepwise emergence of Escherichia coli O157:H7. J Bacteriol. 2005;187:1783–1791. doi: 10.1128/JB.187.5.1783-1791.2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Guttman DS, Dykhuizen DE. Clonal divergence in Escherichia coli as a result of recombination, not mutation. Science. 1994;266:1380–1383. doi: 10.1126/science.7973728. [DOI] [PubMed] [Google Scholar]
  • 25.Sharp PM. Determinants of DNA sequence divergence between Escherichia coli and Salmonella typhimurium: Codon usage, map position, and concerted evolution. J Mol Evol. 1991;33:23–33. doi: 10.1007/BF02100192. [DOI] [PubMed] [Google Scholar]
  • 26.Lenski RE, Winkworth CL, Riley MA. Rates of DNA sequence evolution in experimental populations of Escherichia coli during 20,000 generations. J Mol Evol. 2003;56:498–508. doi: 10.1007/s00239-002-2423-0. [DOI] [PubMed] [Google Scholar]
  • 27.Ochman H, Elwyn S, Moran NA. Calibrating bacterial evolution. Proc Natl Acad Sci USA. 1999;96:12638–12643. doi: 10.1073/pnas.96.22.12638. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Cossette E, Horard-Herbin M. A contribution to the morphometrical study of cattle in colonial North America. J Archaeol Sci. 2003;30:263–274. [Google Scholar]
  • 29.Primo AT. El ganado bovino ibérico en las Américas: 500 años después. Arch Zootec. 1992;41:421–432. [Google Scholar]
  • 30.Toussaint-Samat M. A History of Food. Malden, MA: Wiley-Blackwell; 2008. pp. 87–88. [Google Scholar]
  • 31.Zhang W, et al. Probing genomic diversity and evolution of Escherichia coli O157 by single nucleotide polymorphisms. Genome Res. 2006;16:757–767. doi: 10.1101/gr.4759706. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Manning SD, et al. Variation in virulence among clades of Escherichia coli O157:H7 associated with disease outbreaks. Proc Natl Acad Sci USA. 2008;105:4868–4873. doi: 10.1073/pnas.0710834105. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Holt KE, et al. High-throughput sequencing provides insights into genome variation and evolution in Salmonella Typhi. Nat Genet. 2008;40:987–993. doi: 10.1038/ng.195. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Rocha EP, et al. Comparisons of dN/dS are time dependent for closely related bacterial genomes. J Theor Biol. 2006;239:226–235. doi: 10.1016/j.jtbi.2005.08.037. [DOI] [PubMed] [Google Scholar]
  • 35.Morris RT, Drouin G. Ectopic gene conversions in four Escherichia coli genomes: Increased recombination in pathogenic strains. J Mol Evol. 2004;58:596–605. doi: 10.1007/s00239-004-2581-3. [DOI] [PubMed] [Google Scholar]
  • 36.Iguchi A, et al. Genomic comparison of the O-antigen biosynthesis gene clusters of Escherichia coli O55 strains belonging to three distinct lineages. Microbiology. 2008;154(Pt 2):559–570. doi: 10.1099/mic.0.2007/013334-0. [DOI] [PubMed] [Google Scholar]
  • 37.Naidu RA, Sawyer S, Deom CM. Molecular diversity of RNA-2 genome segments in pecluviruses causing peanut clump disease in West Africa and India. Arch Virol. 2003;148:83–98. doi: 10.1007/s00705-002-0900-9. [DOI] [PubMed] [Google Scholar]
  • 38.Sawyer SA. GENECONV: A computer package for the statistical detection of gene conversion. 1999 Available at www.math.wustl.edu/~sawyer.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting Information
0812949106_ST1_pdf.pdf (890.2KB, pdf)
0812949106_ST2_pdf.pdf (485KB, pdf)
0812949106_ST4_PDF.pdf (20.3KB, pdf)
0812949106_ST5_PDF.pdf (11.3KB, pdf)
0812949106_ST6_PDF.pdf (11.9KB, pdf)
0812949106_ST3.xls (1.3MB, xls)
0812949106_ST7.xls (1.9MB, xls)

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES