Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2005 Jan 24;102(5):1542–1547. doi: 10.1073/pnas.0408633102

A genomic population genetics analysis of the pathogenic enterocyte effacement island in Escherichia coli: The search for the unit of selection

Amanda Castillo 1, Luis E Eguiarte 1, Valeria Souza 1,*
PMCID: PMC547851  PMID: 15668384

Abstract

Comparative genomic analysis is a powerful tool for understanding the history and organization of complete genomes. The mathematical tools of population genetics combined with genomic analysis provide a powerful approach to dissect heterogeneities in genome evolution. This study presents a hierarchical analysis of the enterocyte and effacement island (35 kb), which is found in the enteropathogenic and enterohemorrhagic strains in Escherichia coli and in Citrobacter rodentium. The locus of enterocyte and effacement in E. coli is considered to be a clonal unit inside a clonal organism and is expected to evolve as a single unit. This analysis examines the clonal assumption by determining genetic diversity, GC content, and the substitution rates at the different functional levels of (i) the complete pathogenic island, (ii) the five operons in which the island is organized, and (iii) for each of the individual 41 genes that comprise the locus. We find that there is a conserved region that is composed of genes that belong to the type III secretion system and that may be products of horizontal transfer. A more diverse region is composed of genes for secreted proteins and genes that we infer to be original components of the E. coli genome. This genetic mosaic seems to be differentially affected by selection and mutation. Our results suggest that recombination and selection may be breaking this structure so that different elements are, at best, weakly coupled in their evolution. These observations suggest that the units of selection are not the complete island, but rather, much smaller units that comprise the island.

Keywords: natural selection, pathogenicity island, positive Darwinian selection, mutation, GC content


Escherichia coli is a diverse bacterial species living in multiple habitats, including the intestine of mammals and other vertebrates as a free organism, commensal organism, or pathogenic organism (1). The comparative analysis of complete genome sequences of four E. coli serotypes, the enterohemorrhagic E. coli. (EHEC) pathogens EHEC O157:H7 strain EDL933 (2) and O157 Sakai (3), the uropathogenic strain CFT073 (4), and the nonpathogenic laboratory K-12 MG1655 strain (5), reveals that this bacteria exhibits substantial genome diversity, where only 39.2% of proteins are shared between the four strains (4). This result strongly suggests that the genome is a mosaic that includes a conserved backbone considered to be the core E. coli genome, together with genomic islands comprised of groups of genes interleaved throughout the genome (2). Numerous studies of population genetics of human-related E. coli have strongly suggested that this enteric bacteria is a clonal organism (6, 7), where periodic selection is the cohesive evolutionary force. Yet, when viewed at the whole-genome level, E. coli is a mosaic characterized by different units (islands, operons, and genes) with different evolutionary histories.

Genomic islands and operons are considered units where groups of genes are transcribed together and whose products contribute to a specific function (8). Typical examples of genomic islands are the pathogenic islands (PAIs) present in pathogenic bacteria that form the principal molecular component responsible for the development of a specific disease (911). Evidence supports the idea that PAIs are considered genetic units horizontally transferred through bacterial species during evolution (11). PAIs have common features including a preference for insertion at tRNA sites and atypical GC content (9, 10). On the other hand, it has been suggested that operons are also mobile elements originated by horizontal transfer events (8). These genetic units (PAIs and operons), where genes are acting in concert, are expected to evolve at homogeneous rates owing to their mutual interdependence in producing a phenotype (8). Thus, evolutionary parameters of these genetic units such as GC content, genetic diversity, codon usage, and substitution rates are expected to be homogeneous. However, if selection is sufficiently weak and if the magnitude of recombination is sufficiently large, the PAIs may become decoupled in their evolution (8). A second important feature of genomic islands is their hierarchical organization because islands are themselves composed of groups of operons, which are in turn composed of groups of genes that are interdependent in their regulation. The goal of this study is to ask whether an important PAI evolves in a homogeneous fashion, and if not, whether patterns of evolution are homogeneous within the operons and/or within the genes that comprise the PAI.

A PAI of interest is the locus of enterocyte and effacement (LEE) that is likely to encode almost all of the genes necessary to produce an intestinal attaching/effacing (A/E) lesion (12); the acquisition of this PAI probably transforms nonpathogenic E. coli strains into pathogenic strains (13). The average size of LEE is ≈35 kb with a GC content of 38% (14, 15), which is very different from the housekeeping genes of E. coli (GC content = 50%) (4). The locus comprises 41 genes that include a type III secretion system (TTSS) (16), an adhesin-denominated intimin (eae) (17), its receptor (tir) (18), several secreted proteins (espA, espD, espB, and espF), and their chaperones (19). The TTSS apparatus directs the transfer of specific proteins across the bacterial envelope, where the secreted proteins function to transfer effector proteins into host cells (16). The adhesin receptor tir is transferred into host cells, where it is modified by host kinases, and becomes inserted into the plasma membrane to orchestrate cytoskeletal rearrangements; this activity depends on its interaction with the adhesin (eae) and tyrosine phosphorylation (18). The secreted proteins are required for the translocation of other proteins into the host cell; in the specific case of espA, this protein forms a filamentous conduit along which secreted proteins travel before they arrive at the translocation pore in the plasma membrane of the host cell, comprised of espB and espD (19, 20). Many secreted proteins before secretion are maintained in the bacterial cytoplasm by association with a specific chaperone (1921). The 41 genes are organized in five polycystronic operons known as LEE1, LEE2, LEE3, TIR, and LEE4, all of them positively regulated by ler (12), which is localized at LEE1. Additionally, recent evidence suggests that there are two more regulators inside LEE: orf10 seem to encode a negative regulator and orf11 seem to encode a positive regulator (22). However, nearly half of LEE genes (with the exception of the TTSS) seem to have no homologs in other bacteria and have no identified function. LEE islands can be found in diverse range of A/E pathogens with different host specificity and evolutionary history. These pathogens include E. coli strains that are natural pathogens of animals, such as rabbits, pigs, cats, dogs, and Citrobacter rodentium, a mouse pathogen.

We present a comparative bioinformatic analysis of the LEE island from six epidemic pathogenic strains of E. coli that includes descriptions of the genetic structures of the islands and a determination of the role of selection and mutation on the present structure of LEE.

Methods

Sequences. For this study, we used six complete sequences for LEE that are available at GenBank; they correspond to the A/E pathogens described in Table 1, which is published as supporting information on the PNAS web site. We carried out three different scales of analysis: (i) we used the complete LEE as a unit, (ii) we dissected LEE into its five operons and analyzed them separately, and (iii)we studied each of the 41 genes that comprise the island individually. A statistical analysis of ANOVA and a Tukey's test were also performed to test for significant differences of GC content and substitution rates between operons and genes.

Alignment. clustalw (23) was used to produce a multiple alignment of the six LEE islands. The corresponding delimitation of the coding and noncoding regions was performed by using bioedit (24) (the complete alignment is available on request). Both LEE islands of EHEC strains are reported in the reverse orientation compared with the other strains, so we hand-corrected them to be in the same direction. Shiga toxigenic E. coli corresponds to LEE locus II, which contains LEE as generally described, and a region of 23,586 bp that carries additional elements of pathogenesis characteristic of this strain, but that we discarded for this study. A special case is orf3 that is not considered to be present in EPEC strain RDEC-1 and Shiga toxigenic E. coli strains because a mutation of T to C at the first codon interrupts the initial methionine; however, because the rest of the sequence remains unchanged and homologous to the other islands, we included it for the study as additional informative sites. C. rodentium has an inversion of the two first genes of the island (rorf1 and espG), and they are localized at the end of LEE. For the purpose of this study, we analyzed them as if they were in the same order as the rest of the strains. For the espF gene, we included only the partial sequence (606 bp) because the last base pairs of the gene are hypervariable (data not shown).

Genetic Diversity. The genetic diversity of LEE was assessed by the estimationofpi(π), which was calculated from the total number of sites and determined with dnasp (25).

GC Content. The GC content determination was carried out by using mega 2.1 (28). We presented the average GC content for the complete LEE, for each of the five operons, and for every gene. We also included the distribution of GC content at first, second, and third position of LEE genes.

Codon Adaptation Index. The codon adaptation index for the 41 genes of LEE was calculated by using codonw, which was written by J. Peden (Institut Pasteur, Paris), and can be accessed at www.bioweb.pasteur.fr/seqanal/interfaces/codonw.html, based on the work of Ikemura (27).

Phylogenetic Analyses. We constructed the genealogy of the six islands by using the alignment described above that corresponds to the complete core consensus sequence homologous between them, including coding and noncoding sites by using mega 2.1 (26). The genetic distances were generated under neighbor joining (28) with Tamura–Nei distance (31) (data not shown). The same method was used for the genealogy construction of every individual LEE gene. Character support for the genealogies was assessed by 5,000 bootstrap resamplings of the data (30), in the case of LEE genes, and 10,000 iterations for the complete six islands. In addition to the genealogy construction, congruence between genes was assessed by using the incongruence length difference (ILD) test (31), which is available in paup v.4.03b (32). The ILD was performed for combined data matrices of some selected genes, and the test was performed on all matrices with 1,000 data partitions by using branch and bound searches (33). One additional method was used to support the ILD results. Split-decomposition analysis detects conflicting phylogenetic signals by allowing the genealogies to be expressed in a network rather than a tree-like representation of relatedness. This analysis was performed by using splitstree v.2.4 (34) with Hamming distances and using informative sites only.

Statistical Test for Adaptive Selection. The determination of the nucleotide substitution changes and the nonsynonymous (dN) and synonymous (dS) substitutions per site ratio is frequently used as a test for positive Darwinian selection (PDS). When this ratio is equal to 1, the gene is under the neutral model of evolution, if this ratio is >1, PDS is inferred, whereas if this ratio is <1, purifying selection is inferred (35, 36). To assess the role of selection among LEE, we used two methods. First, we performed a general estimation of the dS and dN, respectively, for the whole-sequence sample; in this case, we used the analysis developed by Nei–Gojobori (37) implemented by using the dnasp package (25). Second, we used the integrative approach for detecting selection at specific amino acid sites included at hyphy 0.901b, which can be accessed at www.hyphy.org (38) from the site www.datamonkey.org developed by Kosakovski-Pond and Frost (39). The integrative analysis for detecting selection includes three different analyses called single likelihood-derived ancestor counting, approximate likelihood ratio at a site, and full likelihood (see refs. 3941 for more detailed information on the methodology). All of the analyses start with a given estimate of the genealogy (described above), and, fitting a codon substitution model, the number of changes occurred along each genealogy are estimated with different methodology. We present the average dN/dS ratio and the sites for each LEE gene that are inferred to be under PDS and/or purifying selection.

Results

General Analysis of LEE Islands. A total of 32,148 bp comprises the core sequence shared by the six LEE islands, including coding and noncoding sites. From the general alignment, it is evident that the core region of LEE is conserved among A/E pathogens in sequence and structure. The variable parts include the flanking regions, the insertion sites, some intergenic regions of genes such as the intimin (eae), rorf1, and espG, and insertions between the operons TIR and LEE4, which is especially evident in C. rodentium (data not shown). From the core sequence of the six islands, 78% (25, 067) corresponds to conserved sites, and 22% (7,051) are polymorphic sites. Although there is high degree of conservation in structure, the genetic diversity is relatively high (π = 0.10). The GC content is low at 38.6% (± 0.46 SE), which is congruent with previous reports (38%) (14, 15). The complete genealogy for the six islands was in agreement with the pathotypes relations described (42) (Table 1 and Fig. 1), where the EHEC-1 and enteropathogenic E. coli (EPEC)-1 groups are more divergent and more related between each other, and the EHEC-2 group is more conserved and less divergent (42).

Fig. 1.

Fig. 1.

Genealogy of the six LEE islands used in the study. The genealogy was constructed by using the 32,148 bp that comprise the core consensus of the six islands (including coding and noncoding sites), under the neighbor-joining method with Tamura–Nei distance, and 10,000 bootstrap resamplings. STEC, Shiga toxigenic E. coli.

LEE Operons. The number of genes varies among LEE operons. From Table 2, which is published as supporting information on the PNAS web site, we observe that the first three operons (LEE1, LEE2, and LEE3) have similar number of genes, genetic diversity, and GC content. These three operons comprise almost the entire TTSS of the LEE locus. On the other hand, the last two operons (TIR and LEE4) exhibit more variation in gene number and sequence size. LEE1 comprises nine genes (from ler to escU) with an average genetic diversity of π = 0.060. This operon has the lowest GC content (33.1%) of the group, which is significantly lower (P < 0.0001) than the average of the complete LEE locus (38.6%). This operon contains the main regulator recognized for LEE (ler). LEE2 includes six genes (from cesD to sepZ) with a genetic diversity of π = 0.090 and a GC content of 37.8% close to the average of the complete island. This operon includes the gene that contains the highest genetic polymorphism (sepZ π = 0.24) and has not been functionally studied yet. LEE3 includes six genes (from orf12 to espH) with an average genetic variation of π = 0.073 that, with the exception of espH (π = 0.18), is conserved among the genes that integrate the operon. The average GC content is 39.2% (including espH). TIR comprises only three genes (Tir, cesT, and eae) that have the highest genetic variation observed for the LEE locus (π = 0.133). The average GC content (43.3%) is also significantly higher (P < 0.0001) than the average of the LEE locus. This operon includes the adhesin denominated intimin (eae) and its receptor (tir) that together are fundamental for the development of the A/E lesion, giving the close attachment to the host membrane that characterize the A/E pathogens. LEE4 includes eight genes (from sepL to espF) with a genetic diversity similar to TIR (π = 0.129) and a GC content of 41.9%. This operon contains several secreted proteins as espA, espD, espB, and espF that are responsible for the signal transduction system and for the melting of the microtubules of the host cell. The ANOVA and Tukey's tests show that GC content between the five operons is significantly different (P < 0.0001).

Genetic Diversity of LEE. The 41 genes of LEE have an average genetic diversity distribution that ranges from π = 0.03 (orf11 SE ± 0.008) to 0.24 (sepZ SE ± 0.027) (Table 3, which is published as supporting information on the PNAS web site). This distribution characterizes the broad range of genetic variation contained within the island. A comparison of the diversity of LEE genes with genes that are part of the conserved backbone of E. coli like mdh (π = 0.01; n = 46), putp (π = 0.02; n = 12), fimA (π = 0.06; n = 7), and trpA (π = 0.03; n = 25) (43) reveals the high genetic diversity characteristic of the pathogenic genes. This finding is especially evident if we consider the sample size of the present study (n = 6), and that all of the strains used belong to epidemic clones. It will be of future interest to explore this variation in nonepidemic strains from a broad range of E. coli natural hosts (A.C., unpublished work).

GC Content of LEE Genes and Codon Adaptation Index. The average GC content of LEE genes also exhibits a broad distribution from 28.3% SE ± 0.33 (rorf3) to 53.1% SE ± 0.22 for espF (Fig. 2). To describe in more detail how this GC content is distributed among the different coding positions of the genes, we divided it in first, second, and third positions. The GC content is significantly different (P < 0.0001) at the first position (45.09%), as compared with the second (34.4%) and third (33.5%) positions that are similar to each other. It has been proposed that sequences introduced by horizontal transfer, as suspected for the complete LEE locus, will be affected by the same mutational processes as the recipient genome and eventually, will converge to the base composition and codon usage of the resident genome (44). This process should occur most rapidly at sites with little or no functional constraint, particularly the third position, where most changes are synonymous. Accordingly, we expect the third position to have a higher GC content. However, in this case, the first position is the most similar to the resident genome in GC content. Perhaps more interesting is the observation that the third position has an even a lower average GC content than the second position. If mutational processes and selection affect all genes homogeneously, we might expect GC content to have increased toward the E. coli average (55.4% for the third position, 40.7% for the second position, and 58.8% for the first position) (45), instead, we observe a very low average GC content (33.5%). These results establish a high codon bias for LEE genes. This result could be interpreted as a signal of conservation from its original source, even when the island has been horizontally transferred in multiple occasions for a considerable time (11, 13), or it could be a signal of regulation, because LEE is only expressed on the logarithmic growth phase during infection to the host (17). We calculated the codon adaptation index (0.200 ± 0.0016) for the genes of the LEE locus (excluding C. rodentium) and found that they were considerably lower than for the average present in E. coli genes (0.485 ± 0.051) (4). This index indicates the use of rare codons when it has low values. In this case, is showing that the genes of LEE are biased and differ from the reminder of the E. coli genome.

Fig. 2.

Fig. 2.

GC content, genetic diversity, and dN/dS ratio distribution for the 41 genes of LEE.

Genealogy of LEE Genes. The phylogenetic study of LEE island genes highlighted differences in the evolutionary histories of some of its members. Almost all of the 41 genes give the same branch order between each strain in congruence with the described for the genealogy based on the pooled LEE genes sequence data. Interesting exceptions are orf3, cesD, rorf6, rorf8, sepZ, orf16, espH, cesF, map, tir, eae, espA, and espB that give different branch orders and lengths (the genealogy of each gene is available on request). In these genes, the EPEC strain E2348/69 is very divergent, not closely related to the EHEC-1 group as described before, but is grouped with C. rodentium. We selected a sample of these genes (Table 4, which is published as supporting information on the PNAS web site) and constructed matrices in a pairwise manner by using some of the genes that presented a different genealogy with a sample of those that have the same consensus LEE genealogy for the ILD test. The ILD test statistically confirmed the topologic incongruence that supports the idea that some genes have a different phylogenetic history than the rest of LEE. Significant differences (P = 0.001) were observed when we compare some genes of the TTSS (escJ, escR, escS, escU, escV, escN, escD, sepZ, and escF) with espH, map, tir, eae, espA, and espB (Fig. 3). We also use these genes to test for the split-decomposition analysis, and the incongruence, together with possible signals of recombination. We observed recombination for espH, map, tir, eae, espA, and espB (Fig. 4). This result suggests that recombination may be breaking the linkage between these genes and allowing them to diverge.

Fig. 3.

Fig. 3.

Genealogy of nine LEE genes used in the study. The genealogy of each gene was constructed under the neighbor-joining method with Tamura–Nei distance and 5,000 bootstrap resamplings. The genes escJ, escV, escN, and escF belong to the TTSS. The genes espH, map, and espA are secreted proteins, and tir is the receptor for the adhesin eae.

Fig. 4.

Fig. 4.

Split-decomposition analysis of eight LEE genes.

Nucleotide Substitutions of LEE Genes. We determined the dS and dN for the 41 genes of the LEE locus. The dS is assumed to be primarily related to mutational processes because they do not alter the amino acid composition. The dS estimates range from 0.10 (SE ± 0.02) for orf11 to 1.12 (SE ± 0.20) for cesF (Fig. 2). From this analysis, we can observe that dS estimates are not uniform among the members of LEE.

The dN describes the substitution rates at sites with amino acid changes, so they are an index of both selective and neutral events. The dN estimates for the LEE genes are more restricted in distribution than the dS. This result is especially true in the case of the TTSS, where we find values as low as dN = 0.011 SE ± 0.001(escS), indicating that most mutations at these positions are eliminated. These genes have homologs in Salmonella and Yersinia (46) and are clearly a product of horizontal gene transfer. In contrast, dN estimates for some genes are high (Table 3), suggesting that some LEE members may show a signal of PDS (sepZ, dN = 0.221, SE ± 0.024; tir, dN = 0.202, SE ± 0.024; espH, dN = 0.176, SE ± 0.019; espB,dN = 0.165, SE ± 0.020; espF,dN = 0.121, SE ± 0.016; map,dN = 0.108, SE ± 0.013; cesF,dN = 0.107, SE ± 0.011; espD, dN = 0.107, SE ± 0.015; eae, dN = 0.096, SE ± 0.011; escF, dN = 0.086, SE ± 0.032; and espA,dN = 0.082, SE ± 0.010). Some of these genes have direct interaction with the host.

dN/dS ratios were averaged over all of the sites in the gene sequence and they are given for each site of the gene sequence. Thus, we obtained the average dN/dS and also an indication of sites that may be under purifying and/or PDS. For LEE, although there is high polymorphism, most members appear to be under purifying selection (dN/dS <1), whereas some genes are close to neutrality (dN/dS = 1), but none of the genes appear to be under PDS (dN/dS >1) (Fig. 2). The gene that has the highest dN/dS ratio is espG (0.54), a secretion protein, this gene in C. rodentium is localized at the end extreme of the island, possibly as a product of a rearrangement. Other genes with a high dN/dS ratio are espF, espH, and sepZ with 0.43, 0.41, and 0.39, respectively. The lowest dN/dS ratio is present at some members of the TTSS with values as low as 0.05 for escS or 0.07 for escT and escC.

The result from the site-by-site analysis shows that few genes have sites under PDS. These genes are espG (one site), map (one site), tir (one site), eae (three sites), espD (one site), and espF (one site) (Table 3); none of them are part of the TTSS. These are interesting sites for the study of directed mutagenesis and gene therapy because they are genes that have an important role on the virulence of A/E pathogens, especially the intimin (eae) and its receptor (tir). Two of the regulators of LEE seem to be neutral (orf10 and orf11), with no sites under adaptive or purifying selection. Most of the genes with the highest number of sites under purifying selection belong to the TTSS (orf4, orf5, escR, escC, escJ, escV, and escN) (Table 5, which is published as supporting information on the PNAS web site). TTSS are involved in the development of a complex structure that crosses the membrane so maybe any change at the sites is purged to preserve the structure. The receptor of the intimin (tir) is a special case, that, along with escV (TTSS) and escC (TTSS), present the highest number of sites under purifying selection with 102, 110, and 115 sites, respectively.

Discussion

The pathogenic LEE island has been the focus of numerous epidemiological and molecular studies because it represents an excellent model for the evolution of pathogenesis. The acquisition of this PAI is thought to confer pathogenic characteristics upon a normal commensal E. coli strain (13). From the evolutionary point of view, LEE has been considered a genetic unit that has been horizontally transferred through A/E pathogens evolution (13). However, from the analysis of wild hosts of E. coli, it has been suggested that this island could be more dynamic, involving an ongoing process of construction and disruption (47). This PAI confers different fitness to some pathogenic strains of E. coli as in the case of the serotype O157:H7, a successful epidemic clone responsible of important epidemic outbreaks that have caused the death of adults and children (15).

From the present study, we observe an increase of genetic diversity and GC content along the LEE island as we analyze it from its 3′ end through the 5′ end (Fig. 2). The TTSS, largely represented in the first three operons, is the group of genes with lower levels of genetic diversity and GC content. On the other hand, genes such as map, tir, eae, espA, espD, espB, and espF present the highest levels of genetic diversity and GC content closer to the average of E. coli genes. From the GC content, substitution rates, and phylogenetic analyses, we infer that the TTSS travel together as a cluster of genes linked by function. The nonsynonymous substitution rate for these genes is low and purifying selection is eliminating diversity. This group of genes that have the lowest GC contents, genetic diversities, and conserved nucleotide substitution rates may preserve the phylogenetic signal of the early formation of the island. These results, together with the finding of other regulators of the secretion system localized outside the LEE island (22), strongly support the idea that the TTSS may also be participating in other processes not necessarily related to virulence. The parallel acquisition of some other virulence factors as an invasin (eae) by nonpathogenic E. coli strains that already have a TTSS may increase its virulence, completing the molecular scenario for the appearance of a new pathogen. We suggest that the TTSS could be a good candidate for the calculation of the time of its integration to E. coli genome and compare it with the adhesin (eae) or some secreted proteins as espB. This result may reveal the time when the LEE island was originally assembled. Otherwise, because the TTSS is involved in pathogenesis, this finding may provide targets for future therapies.

The fact that genes like the adhesin and its receptor are more divergent, less conserved, and with different genealogies than the rest of the island, supports the hypothesis that virulence is a recent derived state that may be a result of the parallel acquisition of virulence factors. The presence of genes like the adhesin, which is a mosaic product of recombination (48, 49), suggests that the origin and evolution of the LEE island is a complex process.

From the observed polymorphism present at some LEE genes and the nucleotide substitutions results, we expected that some genes will have a dN/dS ratio >1, a clear sign of the participation of PDS. However, none of the loci seem consistent with this assumption, and again this ratio is highly variable across the genes of the island. The second part of the analysis highlighted the specific sites under PDS or purifying selection. Interestingly, it seems that PDS had, at best, a minor role in shaping the history of the LEE island, but this does not mean that was not an important role. A few sites of LEE genes are under PDS in genes that are fundamental for the correct development of the A/E lesion, such as map or the adhesin and its receptor. These sites are very important for directed mutagenesis analysis and antibiotic treatment because these genes are key virulence factors in A/E lesion. On the other side, the participation of purifying selection is evident and important for several genes of LEE, like some members of the TTSS, where any change is purged to preserve the protein structure and function. Thus, purifying selection seems to be delimitating important regions for the protein structure.

There are two different types of genealogy for LEE. The TTSS has a different genealogy than other genes like map, tir, or eae, indicating that at least two different transfer events originated the island or that recombination is weakening the assemblage. If recombination is common, we might expect to reconstruct many different genealogies within the island. Curiously, we only have two types of relatedness inside LEE, the general consensus LEE genealogy and the alternative genealogy where the EPEC strain is more related to C. rodentium. The ILD test shows that there is congruence between the genes that belong to the TTSS, so it is clear that this group of genes have a shared ancestry and have been traveling together since the origin of LEE. On the other hand, internal recombination is shaping the genealogies of genes such as map, tir, eae, espA, and espB, and may be promoting the development of new pathotypes. Thus, the ancient formation of the LEE island may be a product of two different transfer events, the first was the acquisition of the TTSS, and the second the acquisition of genes as the adhesin. However, although they are not part of the conserved backbone of genes in E. coli, some unique genes in LEE (like tir) may have been generated de novo. This result is possible, especially if we considered pathogenesis as a derived state in E. coli.

There is a high degree of heterogeneity present among LEE island genes in genetic diversity, GC content, and nucleotide substitution rates. We dissect these heterogeneities beginning with the complete LEE island level on through its constituent operons and genes. A group of genes linked by function and coregulation might be expected to experience common mutational and selective processes. Consequently, if the island is a genetic unit evolving in concert, this fact will be reflected in the conservation of a signature present in GC content, genetic diversity, and nucleotide substitution rates, especially if they share common ancestry and are regulated in synchrony. In this study, we find that diversity, GC content, and nucleotide substitution rates are variable, suggesting that mutation and selection are acting with different intensity within the PAI, generating a mosaic. The results derived from the present study suggest that the assumption that LEE was assembled and has been evolving as a unit is not supported by the data. There probably have been recombination events and different selection pressures for different parts of LEE generating a genetic mosaic, which will create different coalescent times for different regions of the PAI.

The present study included only pathogenic strains from epidemic clones, where selection is believed to be sufficiently high to maintain LEE as a unit during horizontal transfer. However, even in this group of strains, the phylogenetic signal points to evidence for different transfer events in the origin of the LEE island. The results of this study suggest that the origin and evolution of LEE is a more complex process than previously thought. The unit of selection in this specific case, is not the whole island, but smaller modules inside the island and within its constituent genes. This study may also delimitate the minimal LEE unit needed for A/Eto became pathogens at the first instance. Future work will need to explore both atypical strains of EPEC and EHEC, and also wild strains of E. coli not related to humans, to better understand the evolution of pathogenesis.

Supplementary Material

Supporting Tables
pnas_102_5_1542__.html (7.3KB, html)

Acknowledgments

We thank L. Martínez-Castilla for technical assistance, C. Silva and L. Forney for carefully reviewing the manuscript, and two anonymous reviewers and Dr. Mike Clegg, who made interesting suggestions that considerably improved this study. This work was supported by Consejo Nacional de Ciencia y Tecnología Genomic Project 0028 and Dirección General de Asuntos del Personal Académico Grant IN-208601.

Abbreviations: EPEC, enteropathogenic Escherichia coli; EHEC, enterohemorrhagic E. coli; TTSS, type III secretion system; PAI, pathogenicity island; LEE, locus of enterocyte effacement; A/E, attaching/effacing; dS, synonymous substitutions per synonymous site; dN, nonsynonymous substitutions per nonsynonymous site; PDS, positive Darwinian selection; ILD, incongruence length difference.

References

  • 1.Souza, V., Rocha, M., Valera, A. & Eguiarte, L. E. (1999) Appl. Environ. Microbiol. 65, 3373-3385. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Perna, N. T., Plunkett, G., Burland, V., Mau, B., Glasner, J. D., Rose, D. J., Mayhew, G. F., Evans, P. S., Gregor, J., Kirkpatrick, H. A., et al. (2001) Nature 409, 529-533. [DOI] [PubMed] [Google Scholar]
  • 3.Hayashi, T., Makino, K., Onishi, M., Kurokawa, K., Ishii, K., Yokoyama, K., Han, C. G., Ohtsubo, E., Nakayama, K., Murata, T., et al. (2001) DNA Res. 8, 11-22. [DOI] [PubMed] [Google Scholar]
  • 4.Welch, R. A., Burland, V., Plunkett, G., III, Redford, P., Roesch, P., Rasko, D., Buckles, E. L., Liou, S. R., Boutin, A., Hackett, J., et al. (2002) Proc. Natl. Acad. Sci. USA 99, 17020-17024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Blattner, F. R., Plunkett, I. G., Bloch, C. A., Perna, N. T., Burland, V., Riley, M., Collado-Vides, J., Glasner, J. D., Rode, C. K., Mayhew, G. F., et al. (1997) Science 277, 1453-1462. [DOI] [PubMed] [Google Scholar]
  • 6.Selander, R. K. & Levin, B.R. (1980) Science 210, 545-547. [DOI] [PubMed] [Google Scholar]
  • 7.Selander R. K., Caugant D. A. & Whittam, T. S. (1987) in Escherichia coli and Salmonella typhimurium: Genetic Structure and Variation in Natural Populations of Escherichia coli., eds. Neidhardt, F. C., Ingraham, J. L., Low, K. B., Schaeter, M. M. & Umbager, H. E. (Am. Soc. Microbiol., Washington, DC), Vol. 2.
  • 8.Lawrence, J. G. & Roth, J. R. (1996) Genetics 143, 1843-1860. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Blum, G., Ott, M., Lischewski, A., Ritter, A., Imrich, H., Tschape, H. & Hacker, J. (1994) Infect. Immun. 62, 606-614. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Hacker, J., Bender, L., Ott, M., Wingender, J., Lund, B., Marre, R. & Goebel, W. (1990) Microb. Pathog. 8, 213-225. [DOI] [PubMed] [Google Scholar]
  • 11.Groisman, E. A. & Ochman, H. (1996) Cell 87, 791-794. [DOI] [PubMed] [Google Scholar]
  • 12.Elliott, S. J., Sperandio, V., Girón, J. A., Shin, S., Mellies, J. L., Wainwright, L., Hutcheson, S. W., McDaniel, T. K. & Kaper, J. B. (2000) Infect. Immun. 68, 6115-6126. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Reid, S. D., Herbelin, C. J., Bumbaugh, A. C., Selander, R. K. & Whittam, T. S. (2000) Nature 406, 64-67. [DOI] [PubMed] [Google Scholar]
  • 14.Elliot, S. J., Wainwright, L. A., McDaniel, T. K., Jarvis, K. G., Deng, Y. K., Lai, L. C., MacNamara, B. P., Donnenberg, M. S. & Kaper, J.B. (1998) Mol. Microbiol. 28, 1-4. [DOI] [PubMed] [Google Scholar]
  • 15.Perna, N. T., Mayhew, G. F., Posfai, G., Elliot, S., Donnenberg, M. S., Kaper, J.B. & Blattner, F. R. (1998) Infect. Immun. 66, 3810-3817. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Jarvis, K. G., Girón, J. A., Jerse, A. E., McDaniel, T. K., Donnenberg, M. S. & Kaper, J. B. (1995) Proc. Natl. Acad. Sci. USA 92, 7996-8000. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Jerse, A. E, Yu, J., Tall, B. D. & Kaper, J. B. (1990) Proc. Natl. Acad. Sci. USA 87, 7839-7843. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Kenny, B., DeVinney, R., Stein, M., Reinscheid, D. J., Frey, E. A. & Finlay, B. B. (1997) Cell 91, 511-520. [DOI] [PubMed] [Google Scholar]
  • 19.Creasey, E. A., Friedberg, D., Shaw, R. K., Umanski, T., Knutton, S., Rosenshine, I. & Frankel, G. (2003) Microbiology 149, 3639-3647. [DOI] [PubMed] [Google Scholar]
  • 20.Clarke, S. C., Haigh, R. D., Freestone, P. P. E. & Williams, P. H. (2003) Clin. Microbiol. Rev. 16, 365-378. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Franke, G., Phillips, A. D., Rosenshine, I., Dougon, G., Kaper, J. B. & Knutton, S. (1998) Mol. Microbiol. 30, 911-921. [DOI] [PubMed] [Google Scholar]
  • 22.Deng, W., Puente, J. L., Gruenheid, S., Li, Y., Vallance, B. A., Vázquez, A., Barba, J., Ibarra, J. A., O'Donnell, P., Metalnikov, P., et al. (2004) Proc. Natl. Acad. Sci. USA 101, 3597-3602. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Thompso, J. D., Higgins, D. G. & Gibson, T. J. (1994) Nucleic Acids Res. 22, 4673-4680. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Hall, T. A. (1999) Nucleic Acids Symp. Ser. 41, 95-98. [Google Scholar]
  • 25.Rozas, J. & Rozas, R. (1999) Bioinformatics 15, 174-175. [DOI] [PubMed] [Google Scholar]
  • 26.Kuma, S., Tamura, K., Jakobsen, I. B. & Nei, M. (2001) mega2: Molecular Evolutionary Genetics Analysis software (Arizona State Univ., Tempe). [DOI] [PubMed]
  • 27.Ikemur, T. (1985) Mol. Biol. Evol. 2, 13-34. [DOI] [PubMed] [Google Scholar]
  • 28.Saitou, N. & Nei, M. (1987) Mol. Biol. Evol. 4, 406-425. [DOI] [PubMed] [Google Scholar]
  • 29.Tamura, K. & Nei, M. (1993) Mol. Biol. Evol. 10, 512-526. [DOI] [PubMed] [Google Scholar]
  • 30.Felsenstein, J. (1985) Evolution (Lawrence, Kans.) 39, 783-791. [DOI] [PubMed] [Google Scholar]
  • 31.Farris, J. S., Kallersjo, M., Kluge, A. G. & Bult, C. (1995) Cladistics 10, 315-319. [Google Scholar]
  • 32.Swofford, D. L. (1999) paup*: Phylogenetic Analysis Using Parsimony (*and Other Methods) (Sinauer, Sunderland, MA), Version 403b.
  • 33.Allard, M. W., Farris, J. S. & Carpenter, J. (1999) Cladistics 15, 75-84. [Google Scholar]
  • 34.Huson, D. H. (1998) Bioinformatics 14, 68-73. [DOI] [PubMed] [Google Scholar]
  • 35.Goldman, N. & Yang, Z. (1994) Mol. Biol. Evol. 11, 725-736. [DOI] [PubMed] [Google Scholar]
  • 36.Yang, Z., Nielsen, R., Goldman, N. & Krabbe-Pedersen, A. M. (2000) Genetics 155, 431-449. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Nei, M. & Gojobori, T. (1986) Mol. Biol. Evol. 3, 418-426. [DOI] [PubMed] [Google Scholar]
  • 38.Kosakovski-Pond, S. L. & Muse, S. V. (2004) Mol. Biol. Evol., in press.
  • 39.Kosakovski-Pond, S. L. & Frost, S. D. W. (2004) Mol. Biol. Evol., in press.
  • 40.Suzuki, Y. & Gojobori, T. (1999) Mol. Biol. Evol. 16, 1315-1328. [DOI] [PubMed] [Google Scholar]
  • 41.Felsenstein, J. (1981) J. Mol. Evol. 17, 368-376. [DOI] [PubMed] [Google Scholar]
  • 42.Donnenberg, M. S. & Whittam, T. S. (2001) J. Clin. Invest. 107, 539-547. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Boyd, E. F. & Hartl, D. L. (1998) J. Mol. Evol. 47, 258-267. [DOI] [PubMed] [Google Scholar]
  • 44.Lawrence, J. G. & Ochman, H. (1997) J. Mol. Evol. 44, 383-397. [DOI] [PubMed] [Google Scholar]
  • 45.Nakamura, Y., Gojobori, T. & Ikemura, T. (1998) Nucleic Acids Res. 26, 334. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Cornelius, G. L. (2000) Philos. Trans. R. Soc. London B 355, 681-693. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Sandner, L., Eguiarte, L. E., Navarro, A., Cravioto, A. & Souza, V. (2001) Microbiology 147, 3149-3158. [DOI] [PubMed] [Google Scholar]
  • 48.McGraw, E. A., Li, J., Selander, R. K. & Whittam, T. S. (1999) Mol. Biol. Evol. 16, 12-22. [DOI] [PubMed] [Google Scholar]
  • 49.Tarr, C. L. & Whittam, T. S. (2002) J. Bacteriol. 184, 479-487. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting Tables
pnas_102_5_1542__.html (7.3KB, html)
pnas_102_5_1542__1.html (6.3KB, html)
pnas_102_5_1542__2.pdf (212.1KB, pdf)
pnas_102_5_1542__3.html (254.9KB, html)

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES