Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2015 Jul 7;112(29):9070–9075. doi: 10.1073/pnas.1510839112

Recombinant transfer in the basic genome of Escherichia coli

Purushottam D Dixit 1,1,2, Tin Yau Pang 1,1,3, F William Studier 1,4, Sergei Maslov 1,4,5
PMCID: PMC4517234  PMID: 26153419

Significance

A significant fraction of the length of Escherichia coli genomes comprises mobile elements integrated at various sites in a ∼4-Mbp basic genome shared by the species. We find that the entire basic genome is continually exchanged by homologous recombination with genome fragments acquired from other genomes in the population. Evolutionary groups appear to exchange DNA preferentially within the same group but also with other groups to different extents. Entering DNA is often fragmented by restriction systems of the recipient cell, with surviving pieces replacing homologous parts of the recipient chromosome. Coevolving populations of phages that package genome fragments and deliver them to cells that have appropriate receptors are likely mediators of most DNA transfers, distributing variability throughout the species.

Keywords: E. coli evolution, basic genome, core genome, recombinant transfer, generalized transduction

Abstract

An approximation to the ∼4-Mbp basic genome shared by 32 strains of Escherichia coli representing six evolutionary groups has been derived and analyzed computationally. A multiple alignment of the 32 complete genome sequences was filtered to remove mobile elements and identify the most reliable ∼90% of the aligned length of each of the resulting 496 basic-genome pairs. Patterns of single base-pair mutations (SNPs) in aligned pairs distinguish clonally inherited regions from regions where either genome has acquired DNA fragments from diverged genomes by homologous recombination since their last common ancestor. Such recombinant transfer is pervasive across the basic genome, mostly between genomes in the same evolutionary group, and generates many unique mosaic patterns. The six least-diverged genome pairs have one or two recombinant transfers of length ∼40–115 kbp (and few if any other transfers), each containing one or more gene clusters known to confer strong selective advantage in some environments. Moderately diverged genome pairs (0.4–1% SNPs) show mosaic patterns of interspersed clonal and recombinant regions of varying lengths throughout the basic genome, whereas more highly diverged pairs within an evolutionary group or pairs between evolutionary groups having >1.3% SNPs have few clonal matches longer than a few kilobase pairs. Many recombinant transfers appear to incorporate fragments of the entering DNA produced by restriction systems of the recipient cell. A simple computational model can closely fit the data. Most recombinant transfers seem likely to be due to generalized transduction by coevolving populations of phages, which could efficiently distribute variability throughout bacterial genomes.


The increasing availability of complete genome sequences of many different bacterial and archaeal species, as well as metagenomic sequencing of mixed populations from natural environments, has stimulated theoretical and computational approaches to understand mechanisms of speciation and how prokaryotic species should be defined (18). Much genome analysis and comparison has been at the level of gene content, identifying core genomes (the set of genes found in most or all genomes in a group) and the continually expanding pan-genome. Population genomics of Escherichia coli has been particularly well studied because of its long history in laboratory research and because many pathogenic strains have been isolated and completely sequenced (914). Proposed models of how related groups or species form and evolve include isolation by ecological niche (79, 11, 15), decreased homologous recombination as divergence between isolated populations increases (24, 8, 14, 16), and coevolving phage and bacterial populations (6).

E. coli genomes are highly variable, containing an array of phage-related mobile elements integrated at many different sites (17), random insertions of multiple transposable elements (18), and idiosyncratic genome rearrangements that include inversions, translocations, duplications, and deletions. Although E. coli grows by binary cell division, genetic exchange by homologous recombination has come to be recognized as a significant factor in adaptation and genome evolution (9, 10, 19). Of particular interest has been the relative contribution to genome variability of random mutations (single base-pair differences referred to as SNPs) and replacement of genome regions by homologous recombination with fragments imported from other genomes (here referred to as recombinant transfers or transferred regions). Estimates of the rate, extent, and average lengths of recombinant transfers in the core genome vary widely, as do methods for detecting transferred regions and assessing their impact on phylogenetic relationships (1214, 20, 21).

In a previous comparison of complete genome sequences of the K-12 reference strain MG1655 and the reconstructed genome of the B strain of Delbrück and Luria referred to here as B-DL, we observed that SNPs are not randomly distributed among 3,620 perfectly matched pairs of coding sequences but rather have two distinct regimes: sharply decreasing numbers of genes having 0, 1, 2, or 3 SNPs, and an abrupt transition to a much broader exponential distribution in which decreasing numbers of genes contain increasing numbers of SNPs from 4 to 102 SNPs per gene (22). Genes in the two regimes of the distribution are interspersed in clusters of variable lengths throughout what we referred to as the basic genome, namely, the ∼4 Mbp shared by the two genomes after eliminating mobile elements. We speculated that genes having 0 to 3 SNPs may primarily have been inherited clonally from the last common ancestor, whereas genes comprising the exponential tail may primarily have been acquired by horizontal transfer from diverged members of the population.

The current study was undertaken to extend these observations to a diverse set of 32 completely sequenced E. coli genomes and to analyze how SNP distributions in the basic genome change as a function of evolutionary divergence between the 496 pairs of strains in this set. We have taken a simpler approach than those of Touchon et al. (13), Didelot et al. (14), and McNally et al. (21), who previously analyzed multiple alignments of complete genomes of E. coli strains. The appreciably larger basic genome derived here is not restricted to protein-coding sequences and retains positional information.

Results and Discussion

Deriving Basic Genomes.

We selected for analysis the completely sequenced chromosomal genomes of 32 independently isolated E. coli strains from six previously defined evolutionary groups: A, B1, B2, D1, D2, and E (Fig. 1). Whole-genome multiple alignment of all 32 genomes was produced by the Mauve program (23). Computational filters and procedures described in Materials and Methods generated a 3,955,192-bp alignment that has eliminated essentially all mobile elements and approximates the basic genomes of these 32 strains. The organization of 21 of these 32 basic genomes is that of the comprehensively annotated K-12 laboratory strain MG1655. The remaining 11 genomes contained idiosyncratic inversions and translocations, most of which were reconfigured manually to align with the consensus organization in the multiple alignment (SI Materials and Methods).

Fig. 1.

Fig. 1.

Phylogenetic tree derived from filtered genome-wide average SNP densities (Δ) between 496 pairs of 32 basic genomes. Previously recognized phylogenetic groups: E (light blue), A (green), B1 (blue), D2 (yellow), D1 (brown), and B2 (red). The tree was calculated using UPGMA algorithm. Dots in the lines connecting pairs of evolutionary groups are placed approximately at average SNP densities between them. The groups are ordered to fit the relative divergences among them summarized in Table S1. GenBank accession numbers are given in Table S3.

Reliable Genomewide SNP Densities.

The filtered multiple alignment contains 105 ordered alignment blocks that were arbitrarily divided into tandem strings of 1-kbp segments, starting at the left end of each block. This process generated 3,903 segments of 1 kbp in 88 strings from 1 to 370 segments long, separated by 105 shorter right-end segments covering ∼1.3% of the multiple alignment. The segments are numbered 1–4,008 from left to right. Segment ends are indexed to both the basic- and complete-genome sequences of each strain to allow easy comparisons and link to annotations. SNPs between each of the 496 pairs of basic genomes and cumulatively across all 32 strains are extracted directly from the filtered multiple alignment.

SNPs are accurately identified in regions where all 32 basic genomes are unambiguously aligned, but highly variable regions, particularly where alignment lengths differ, can be problematic. To minimize erroneous SNPs due to such multiple-alignment difficulties, we focused most of our analyses on a set of 3,769 segments of 1 kbp that have an approximately normal distribution over cumulative SNP densities of 0.3–18.0% (averaging 7.5%) and cover 95.3% of the filtered 32-genome multiple alignment (Fig. S1). The 134 most-diverged 1-kbp segments in the scattered tail of the distribution extend to 60.2% cumulative SNP density and are primarily in known regions of high variability and subject to known selective pressures, including genes for making O-antigens, lipopolysaccharides, flagella, fimbrial-adhesins, DNA modification and restriction enzymes, and surface receptors for phages and colicins. The most variable parts of such exchangeable regions did not pass the computational filters and are not present in the aligned basic genomes analyzed here.

Fig. S1.

Fig. S1.

Distribution of the 3,903 basic-genome segments of 1 kbp as a function of cumulative SNP density in the filtered 32-strain basic-genome alignment (δcum is the percentage of aligned base pair positions having a SNP in any strain). An approximately normal distribution of 3,769 segments with δcum = 0.3–18.0%, average of 7.5%, is followed by a scattered tail of 134 segments with δcum = 18.1–60.2%, average of 29.1%.

Even in regions where cumulative SNP densities are in the normal range, alignment problems involving group-specific or individual deletions, misplaced remnants of insertion sequence (IS) elements, variable numbers of repeats, or other idiosyncrasies can generate false SNPs in some genome pairs. To minimize such problems, most of our analyses were limited to perfectly aligned 1-kbp segments having no indels. Perfectly aligned segments in the set of 3,769 usually cover ∼90% of an aligned basic-genome pair. This additional filter reduces average SNP density by 5–15% (but as much as 30% between closely related genomes with few total SNPs).

SNP Distributions in Genome Pairs.

Our measure of evolutionary divergence is SNP density, the average percentage of SNPs between perfectly aligned 1-kbp segments in the set of 3,769, referred to here as Δ for an entire basic-genome pair and δ for individual segments. Distributions of SNP densities among individual segments are shown in Fig. 2A for five basic-genome pairs over the range of Δ = 0.38–2.54%. The alignments are between the K-12 reference genome MG1655 and two other group A genomes, and between MG1655 and one genome each from groups B1, E, and B2. The two group A pairs show a sharply decreasing number of segments with increasing numbers of SNPs from 0 to 3 per 1-kbp segment (0–0.3% SNP density), which we refer to as the clonal peak on the assumption that most of the segments in that peak are likely to have been clonally inherited from a common ancestor. Consistent with our previous observations using matched protein-coding sequences between MG1655 and B-DL (22), the more highly diverged segments are distributed in a roughly exponential tail extending from the clonal peak.

Fig. 2.

Fig. 2.

Distributions of SNP densities between basic genomes. (A) Distribution of perfectly aligned 1-kbp segments from the set of 3,769 as a function of average SNP density δ for five basic-genome pairs: group A strain MG1655 aligned with ETEC (A–A, Δ = 0.38%, black triangles); B–DL (A–A, Δ = 0.72%, brown squares); SE11 (A–B1, Δ = 1.16%, red triangles); O157 (A–E, Δ = 1.56%, blue circles); and IHE (A–B2, Δ = 2.54%, green diamonds). (B) Distributions of SNP density as a function of δ as predicted by the computational model given in Supporting Information for pairs of genomes having the same average SNP densities Δ as the five genome pairs in A.

A clonal peak is apparent only when both of the paired genomes are in the same evolutionary group, but not all genome pairs within a group show a clonal peak: the 28 pairs between the 8 genomes of group A, the 28 pairs between the 8 genomes of group B1, and the single pairs in D2 and E all show at least a small peak; however, only 4 of the 45 pairs between the 10 genomes of B2 show a clonal peak, and neither the single pair in D1 nor any of the 392 pairs between genomes from different groups show a pronounced clonal peak (Dataset S1). As the clonal peak decreases, the increasing number of segments in the exponential tail maintain approximately the same slope, and the most highly diverged genome pairs have a broad maximum around δ = 1–2% (Fig. 2A). Our computational model of genome divergence, summarized in a later section and detailed in Supporting Information, fits the observed distributions quite well over a broad range of Δ (Fig. 2B).

Recombinant Transfers and Mosaic Genomes.

For simplicity, we refer to all perfectly aligned 1-kbp segments having 0 to 3 SNPs as clonal, because distinguishing whether such segments were inherited vertically from a common ancestor or represent incidental matches to mosaic regions acquired by recombinant transfer is not always unambiguous. We also refer to segments having more than three SNPs as transferred, meaning that the SNPs were acquired at least in part by recombinant transfer from a diverged genome, even though we recognize that some isolated segments containing four to six SNPs are likely to be clonal. Using these designations, we calculated four quantities for each of the 496 basic-genome pairs: Δ, the average SNP density for all perfectly aligned 1-kbp segments from the set of 3,769; fc, the fraction of these segments containing 0 to 3 SNPs, referred to as the clonal fraction; Δc, the average SNP density in the clonal fraction; and Δt, the average SNP density in the remaining, putatively transferred segments (Dataset S1). The values of fc, Δc, and Δt are plotted as a function of Δ for all 496 genome pairs in Fig. 3 A–C and summarized for all combinations between evolutionary groups in Table S1.

Fig. 3.

Fig. 3.

Values of clonal fraction, Δc, and Δt as a function of overall divergence Δ in 496 basic-genome pairs. (A) Clonal fraction fc; (B) average SNP density in the clonal fraction Δc; (C) average SNP density in the transferred fraction Δt. Data points for genome pairs within evolutionary groups A are given by solid green circles; B1 by blue squares; B2 by red diamonds; between A and B1 by violet asterisks; and other pairs by a black X. Dashed red lines with error bars are values predicted by the computational model given in Supporting Information. Error bars correspond to SD of 100 runs of the model.

Table S1.

Summary of divergences between 496 basic-genome pairs within and between evolutionary groups, limited to perfectly aligned 1-kbp segments from the set of 3,769

Δ Δ Clonal fraction Δt*
Genome pairs No. Min, % Max, % Max Min Avg, % SD
A A 28 0.179 0.876 0.909 0.405 1.393 0.031
B1 B1 28 0.111 0.678 0.956 0.468 1.175 0.016
B2 B2 45 0.034 1.221 0.987 0.147 1.279 0.071
D1 D1 1 1.296 0.200 1.583
D2 D2 1 1.217 0.278 1.658
E E 1 0.088 0.979
A B1 64 1.034 1.189 0.271 0.184 1.413 0.016
A E 16 1.542 1.593 0.071 0.060 1.676 0.013
B1 E 16 1.602 1.629 0.064 0.055 1.706 0.006
D1 D2 4 1.879 2.008 0.049 0.031 2.018 0.044
A D2 16 2.066 2.119 0.055 0.027 2.168 0.024
B1 D2 16 2.072 2.130 0.031 0.021 2.151 0.022
D2 E 4 2.126 2.162 0.026 0.019 2.189 0.019
B2 D1 20 2.111 2.195 0.040 0.024 2.228 0.017
A D1 16 2.226 2.265 0.041 0.020 2.317 0.009
B1 D1 16 2.251 2.295 0.028 0.018 2.323 0.010
D1 E 4 2.299 2.314 0.022 0.018 2.350 0.003
B2 D2 20 2.392 2.512 0.017 0.012 2.499 0.030
A B2 80 2.496 2.566 0.028 0.011 2.578 0.012
B1 B2 80 2.516 2.575 0.016 0.011 2.581 0.012
B2 E 20 2.533 2.578 0.015 0.010 2.594 0.011
*

Excluding genome pairs with clonal fraction >0.9 in A, B1, B2, and E.

These data show that recombinant transfers are pervasive throughout the basic genome. The 104 genome pairs in which both genomes are from any one of the six evolutionary groups have clonal fractions that decrease approximately linearly from >0.90 to ∼0.15 as Δ increases from <0.1% to ∼1.3% (Fig. 3A). Over the same range, Δc increases from <0.02% to ∼0.2% (Fig. 3B). Clearly, most of the divergence between basic-genome pairs within an evolutionary group is due to accumulating recombinant transfers from diverged genomes since their last common ancestor. Assuming random recombinant transfers, a recombining population will generate over time a steady-state population of mosaic genomes with many uniquely different patterns of interspersed clonal and recombinant regions. Examples of mosaic patterns between genome pairs of different divergence can be visualized in Dataset S2 by scrolling through the spreadsheet, which has color coding and information at the top and bottom to help locate specific features.

Clonal Fraction (fc) and Average SNP Density in Transferred Segments (Δt) Within and Between Groups.

As clonal fraction decreases with accumulating recombinant transfers, Δt remains relatively constant, averaging 1.39%, 1.18%, and 1.28% in groups A, B1, and B2 in relatively narrow distributions with slight overlaps (except for considerable scatter in the least-diverged pairs, which have few recombinant transfers) (Fig. 3C, Table S1, and Dataset S1). The average Δt within a group should reflect the average divergence in the recombining population, and the relatively narrow distributions and slight overlaps suggest that each group may be exchanging genome fragments by recombinant transfers primarily within its own recombining population, but occasionally with genomes in other recombining groups. The D1, D2, and E groups are represented by only one genome pair each. The D1 pair (fc = 0.20) and D2 pair (fc = 0.28) each have more than 2,000 recombinant 1-kbp segments, and their Δt of 1.58% and 1.66% may be approximately representative of the average divergences in their recombining populations. However, the group E pair (fc = 0.98) has fewer than 100 recombinant segments and its Δt may be far from representative.

The only intergroup genome pairs with significant clonal fractions are the 64 pairings between group A and B1 genomes (fc = 0.27–0.18 and Δ = 1.03–1.19%), similar to the 17 most diverged genome pairs within the B2 group (those involving SE15 or O127:H6 have fc = 0.22–0.15 and Δ = 1.03–1.22%). The average of Δt over the 64 A–B1 genome pairs is 1.41% ± 0.02 (SD) and that of the 17 most diverged B2–B2 pairs is 1.33% ± 0.06. The appreciable clonal fractions in A–B1 genome pairs suggest that the two groups diverged relatively recently.

The 328 intergroup genome pairs in the other 14 of the 15 possible pairings between the six evolutionary groups assort into sets of 4–80 genome pairs per combination (Table S1 and Dataset S1). The average values of Δt of all genome pairs in any single set range from 1.7% to 2.6% and each combination has a very narrow distribution (SD usually less than 0.02). These narrow distributions support the interpretation that the different mosaic genomes in each recombining population are well equilibrated to the average diversity in the population throughout their lengths. The clonal fraction is negligible in all 14 of these sets of intergroup genome pairs, decreasing from 0.07 to 0.01 as Δ increases from ∼1.6% to 2.6%. As a consequence, Δt is approximately the same as Δ throughout this range (Fig. 3C). The few 1-kbp segments that appear clonal in the most diverged intergroup genome pairs are usually highly conserved across the 32 genomes (Dataset S2, set 4) and thus would appear to be clonal whether or not they had been transferred. The longest of these highly conserved regions contains the cluster of 26 ribosomal protein genes rplQ to rpsJ in 13 tandem 1-kbp segments.

Six Slightly Diverged Genome Pairs Reveal an Important Mode of Recombinant Transfer.

The six least-diverged pairs of basic genomes, all of which have a clonal fraction >0.95, have a striking pattern of recombinant transfers. Instead of mosaic transferred regions of various lengths dispersed throughout the basic genome in moderately diverged genome pairs, each genome pair has only one or two long transferred regions, each region extending across 42–107 kbp of basic-genome sequence and together containing the vast majority of SNPs attributable to recombinant transfer (Table S2). End points of these transferred regions are even farther apart in the complete-genome sequences, ∼85–240 kbp, due mostly to mobile elements, which may either have been in the transferred fragment or inserted after acquisition.

Table S2.

Long transfers in the least-diverged basic-genome pairs

Genomes and group Clonal gaps Length, kbp
Δ, % Δc, % fc % SNPs No. %Length Region ba cg
IHE APEC B2 0.034 0.015 0.99 2.23 2 7.1 O-antigen 58 102
IHE S88 B2 0.095 0.013 0.97 2.86 2 7.1 O-antigen 58 104
4.05 0 Rest/fimbri 107 240
APEC S88 B2 0.097 0.011 0.96 3.34 1 10.8 O-antigen 42 85
4.05 0 Rest/fimbri 107 231
ABU CFT B2 0.084 0.015 0.96 3.01 2 3.1 O-antigen 65 86
1.37 15 23.0 Capsule 104 223
O111 O26 B1 0.111 0.046 0.96 5.27 1 1.0 O-antigen 102 202
1.67 11 11.3 Rest/fimbri 105 162
O55 O157 E 0.088 0.057 0.98 3.57 2 2.1 O-antigen 94 122

%SNPs is SNP density in long transferred regions in the total set of 4,008 segments, including clonal gaps. ba, basic genome; cg, complete genome. Rest/fimbri refers to DNA restriction and nearby fimbrial gene clusters.

These long transferred regions replace the mosaic pattern of the recipient genome region with the mosaic pattern of the homologous region of the donor genome. The number of incidental clonal matches between them and the percentage of each transferred region occupied by clonal matches should decrease with increasing average SNP density in the transferred region. Indeed, the least diverged of the 10 transferred regions in Table S2 (1.37% SNP density) has 15 clonal matches of one to four segments covering 23% of the transferred length; the next (1.67% SNP density) has 11 clonal matches constituting 11% of the transferred length; the six transferred regions in the range of 2.23–3.57% SNP density have only one or two clonal matches constituting 2–7% of the transferred length; the transferred region in S88 with 4.05% SNP density has no clonal matches; and the transferred region with Δ = 5.27% SNP density has a single clonal segment constituting 1% of its length. The transferred region in the B2 strain S88 came from a group A genome, as evidenced by tandem strings of clonal segments in this region when the S88 genome is paired with different group A genomes (Dataset S2, set 4).

Initially, it appeared that short recombinant transfers were present elsewhere in the six genome pairs containing long transfers. However, examination of the candidate SNP clusters provided other explanations for almost all of them. The most interesting proved to be the result of shuffling of variants in ribosomal RNA operons by internal recombination (gene conversion) among the seven operons characteristic of E. coli (Dataset S2). These internally generated SNPs in rRNA operons can be a significant fraction of putative recombinant SNPs in the six least-diverged genome pairs. Most of the other high-density SNP clusters can be attributed to multiple-alignment difficulties. Internally generated and erroneous SNP clusters become a negligible fraction of transferred SNPs in even moderately diverged genome pairs, and corrections for them were not made in the figures and SI datasets. Three isolated clusters of 28–38% SNPs in 88–197 bp in APEC but not in any of the other 31 genomes are the most likely candidates in the six least-diverged genome pairs to have resulted from short recombinant transfers (Dataset S2, set 2).

Each of the long transferred regions contains gene clusters known to have exchangeable variants that provide a strong selective advantage in some situations. These variable gene clusters can nonetheless be exchanged by homologous recombination because the sequences flanking them are much less variable. The O-antigen and DNA restriction genes were previously identified as having many variants that are frequently exchanged and efficiently retained in E. coli populations because they can confer significant selective advantage (24). An O-antigen gene cluster was transferred into at least one of the genomes in all six pairs, the DNA restriction cluster into two genomes, and a gene cluster for capsule formation in one (Table S2). It seems likely that each of these long recombinant transfers was the initial step in divergence of a new lineage in a population under stress, and that the selective advantage it conferred fixed not only the advantageous gene cluster but also the unique mosaic pattern of the recipient genome as the ancestral sequence of the new lineage.

The last of these selective transfers in each of the nine different genomes in the least-diverged genome pairs was apparently recent enough that no subsequent relatively neutral recombinant transfers have become fixed in the population since the last common ancestor (with the possible exception of the three short candidates in APEC). Rough estimates of the number of generations since the mosaic patterns in the >95% of basic-genome length in these six genome pairs were fixed can be made by dividing their corrected numbers of SNPs per base pair by twice the estimated mutation rate of 8.9 × 10−11 mutations per base pair per generation (25) (arbitrarily assigning one-half of the SNPs to each genome in a pair). This calculation gives estimates of ∼8 × 105 generations for the five B2 genomes and ∼3 × 106 for the two B1 and two E genomes.

DNA Restriction Has a Prominent Role in Recombinant Transfer.

Many types and specificities of DNA restriction and modification are widely distributed in E. coli and about one-half of completely sequenced E. coli genomes contain one or more of the many variants of hsd genes, which specify type I restriction/modification systems, and often other restriction enzymes as well (26, 27). Milkman and colleagues (28, 29) used restriction fragment length polymorphism to show that genome fragments introduced into E. coli by transduction or conjugation are fragmented and reduced in length by restriction systems, a process they postulated is responsible for generating the mosaic structure of E. coli genomes. More recently, we analyzed three completely sequenced genomes to deduce the patterns of recombinant transfer by P1 transduction of genomic DNA from the K-12 strain W3110 across a type I restriction barrier into two different B genomes (22). P1 is a generalized transducing phage that delivers genome fragments as long as ∼115 kbp without accompanying phage DNA (30). One of these recombinant transfers delivered a single fragment of 6.0–10.6 kbp, but the second delivered six DNA fragments totaling 44.9–55.1 kbp across 71.3–77.0 kbp of genome. The average length of the six transferred W3110 fragments is ∼8.3 kbp and that of the five clonal B intervals is ∼4.8 kbp, but the lengths of individual fragments have very wide possible ranges, 0.3–26.5 kbp for transferred fragments and 0.1–13.9 kbp for clonal intervals. We examined nucleotide sequences of all 32 genomes in the primary locus of genes specifying type I and other types of restrictions enzymes, referred to as the immigration control region (31). Eighteen of the 32 genomes appear to have the intact hsd genes needed for type I restriction, with several different types represented (details in Supporting Information).

Lengths of Interspersed Clonal and Recombinant Regions.

As random recombinant transfers accumulate in genomes that have a recent common ancestor, average lengths of clonal regions decrease rapidly with the corresponding increases in Δ (Fig. 4A and Dataset S3). The seven least-diverged genome pairs (adding the group A pair MG–F18) all have clonal fractions >0.90, uninterrupted strings of clonal segments hundreds of kilobase pairs long, and one, two, or possibly five recombinant transfer events consistent with entering DNA longer than 40 kbp (Table S2 and Datasets S2 and S3). The next levels of divergence are seen in a set of six group A genome pairs having clonal fractions between 0.76 and 0.62 (Δ = 0.38–0.59%), five of which have 20–50% of their aligned lengths in uninterrupted strings of clonal segments longer than 50 kbp (Dataset S3). Including all 4,008 segments in the analyses (and allowing clonal regions to extend through the high-SNP segments in rRNA operons, segments containing obviously erroneous SNPs, and rare isolated segments containing four to six SNPs per kilobase pair), these moderately diverged genome pairs have as many as seven clonal regions longer than 100 kbp, the longest covering 967 kbp of aligned length (Dataset S2, set 3).

Fig. 4.

Fig. 4.

Average lengths of clonal and transferred regions, and numbers of each as a function of overall divergence Δ in 496 basic-genome pairs. (A) Average lengths of clonal regions (in kilobase pairs); (B) average lengths of transferred regions (in kilobase pairs); (C) number of transferred regions (equal numbers of interspersed clonal and transferred regions). Data points, dashed red lines, and error bars are as in Fig. 3.

These long clonal regions are distributed across the paired genomes at intervals consistent with clonal inheritance from a recent common ancestor interrupted by random recombinant transfers of indeterminate numbers and lengths from genomes with different mosaic patterns. However, the P12b genome uniquely has long clonal matches with two different genomes: five clonal regions of 184–715 kbp are evident when P12b is aligned with MG, and a single 756-kbp clonal region when aligned with Crooks (in a region that has a divergent mosaic pattern relative to MG; Dataset S2, set 3). The long clonal region between P12b and Crooks must almost certainly be due to an uninterrupted recombinant exchange of ∼20% of the basic genome of P12b with DNA delivered by conjugation from a close relative of Crooks.

At higher divergence, average lengths of clonal matches between genomes within an evolutionary group decrease to 1.5–3 kbp, and the longest clonal regions in moderately to highly diverged pairs within a group are typically less than 1% of the aligned basic-genome length (Fig. 4A and Dataset S3). The lengths and distributions of clonal and transferred regions reflect the uniquely different mosaic patterns generated by random recombinant transfers in the two lineages, and incidental clonal matches in transferred regions should decrease with increasing distance from the last common ancestor of donor and recipient. Fragmentation of the entering DNA in a significant fraction of transfers can also reduce average lengths of both clonal and transferred regions. Average lengths of transferred regions are ∼2.6–4.6 kbp in the range of Δ = 0.38–1.0% before increasing at higher divergence as newly acquired recombinant transfers overlap those acquired previously (Fig. 4B). The average number of interspersed regions reaches a maximum around 500 in group A genome pairs and around 600 in B1 and B2 genome pairs before decreasing somewhat in the B2 pairs having Δ > 1.0% due to overlaps (Fig. 4C).

Computational Model of Divergence.

We developed a simple computational model of divergence in a steady-state population of genomes assumed to comprise 4,000 segments of 1 kbp and to be accumulating random point mutations and acquiring random tandem segments of variable lengths by recombinant transfer from other genomes in the population at fixed rates. Analytical derivation of the probabilities of possible outcomes, combined with Markov chain simulations, generated probable SNP distributions across 100 pairs of genomes evolving to any given Δ. The model (details in Supporting Information) fits the data quite well (Figs. 2C, 3, and 4) using a mutation rate µ = 8.9 × 10−11 SNPs per base pair per generation (25), an average length of recombinant transfer of 3 kbp (Fig. 4B and Dataset S3), and a ratio of rate of accumulation of SNPs by recombinant transfer relative to random mutations, ρ/µ, of 0.31, slightly higher than the 0.14–0.21 calculated in Dataset S1 for group A, B1, and B2 genome pairs. The value of r/m, a measure of the ratio of the number of mutations in transferred regions relative to random mutations throughout the genome, is 11.2 when calculated from the parameters used in the model, slightly higher than the 8.9 for group A and 5.4 for groups B1 and B2 in Dataset S1 (overall average = 6.5; SD, 0.2). These values for r/m may be compared with the original estimate of 50 by Guttman and Dykhuizen (20), from limited data, and 0.34 by McNally et al. (21), 1.5 by Touchon et al. (13), and 7 by Didelot et al. (14) from complete-genome sequences.

Coevolving Transducing Phages as Primary Vectors of Transfer.

How are the continuing, pervasive, and primarily group-specific transfers of genome fragments of 40–115 kbp (or even larger) accomplished routinely in recombining populations of E. coli? Of the three well-studied mechanisms of DNA transfer (conjugation, transformation, and transduction), generalized transduction seems to us likely to be the primary mode of routine transfer. Generalized transducing phages such as coliphage P1 (30) are widely distributed and can have packaging capacity as high as 300 kbp of DNA (3234). Coevolving populations of phages and bacteria (6) would maintain specificity for delivery of genome fragments primarily to members of the recombining population of bacteria, but host-range variability could provide occasional delivery of genome fragments to cells of other populations. Transducing phage particles potentially deliver random, well-protected genome fragments throughout a coevolving population much more widely and efficiently than conjugation or conjugative plasmids, which require specialized transfer mechanisms and cell-to-cell contact. That is not to argue that conjugation or transfer of conjugative plasmids does not occur or is not important, and we did detect one obvious example of conjugation in the P12b genome. However, phages are ideally suited to mediate the pervasive and continuing recombinant transfer documented by our analyses, and coevolution of phage and bacterial populations provides a simple explanation for a large body of previous work on genome variability, recombinant transfer, and speciation.

Basic Genome: A Platform for Annotation.

Further development of the basic-genome platform and computational methodology developed here could simplify and facilitate classification and annotation of the current and anticipated flood of complete, draft, and metagenome sequences of E. coli. Consensus basic-genome sequences of all 32 strains and of the group A, B1, and B2 strains are given as FASTA files in Datasets S4–S7. Alignment to these consensus sequences can help to classify newly sequenced E. coli genomes, identify orthologs, and distinguish types and locations of mobile elements. Standardized basic-genome annotations and catalogs of group-specific features, exchangeable gene clusters, and other features could accelerate and improve uniformity and reliability of annotation. The methodology should be applicable to any bacterial or archaeal species or evolutionary group.

Materials and Methods

Extracting Basic Genomes.

We used the Mauve program with default parameters (23) to produce a multiple alignment of the complete genome sequences of 32 independently isolated strains of E. coli. Their names, GenBank accession numbers, complete genome lengths, and derived basic-genome lengths are given in Table S3, and an annotated complete genome sequence of B–DL is given as Dataset S8. Sequences of 11 genomes were reconfigured to simplify the multiple alignment (SI Materials and Methods).

Table S3.

E. coli strains in 32-genome Mauve multiple alignment

Group Strain GenBank Complete genome Basic genome ba/cg,% ba/32,% Reconfigured for alignment
A MG1655 U00096.2 4,639,675 3,917,107 84.4 99.0
A B-DL Dataset S8 4,620,778 3,892,824 84.2 98.4
A HS CP000802.1 4,643,538 3,884,968 83.7 98.2
A P12b CP002291.1 4,935,294 3,901,785 79.1 98.6 1.2-Mbp invert btw IS3s
A Crooks (ATCC 8739) CP000946.1 4,746,218 3,914,315 82.5 99.0 Invert and rotate
G UMNF18 AGTD01000001.1 5,239,207 3,920,484 74.8 99.1
A ETEC H10407 FN649414.1 5,153,435 3,912,687 75.9 98.9
A UMNK88 CP002729.1 5,186,416 3,927,265 75.7 99.3
B1 W CP002185.1 4,900,968 3,930,428 80.2 99.4
B1 55989 CU928145.2 5,154,862 3,914,831 75.9 99.0 Rotate
B1 O26:H11 str. 11368 NC_013361.1 5,697,240 3,912,181 68.7 98.9 471- and 5.3-kbp inverts
B1 O111:H- str. 11128 AP010960.1 5,371,077 3,910,010 72.8 98.9 92-kbp invert btw IS129s
B1 O103:H2 str. 12009 NC_013353.1 5,449,314 3,924,647 72.0 99.2 456-kbp invrt w 5 transloc
B1 SE11 AP009240.1 4,887,515 3,932,908 80.5 99.4
B1 E24377A CP000800.1 4,979,619 3,929,451 78.9 99.3
B1 IAI1 CU928160.2 4,700,560 3,927,748 83.6 99.3
E O55:H7 str. CB9615 CP001846.1 5,386,352 3,892,416 72.3 98.4
E O157:H7 str. Sakai NC_002695.1 5,498,450 3,849,037 70.0 97.3
B2 SE15 AP009378.1 4,717,338 3,783,265 80.2 95.7 Invert between asn tRNAs
B2 IHE3034 CP001969.1 5,108,383 3,799,695 74.4 96.1 Invert between asn tRNAs
B2 S88 CU928161.2 5,032,268 3,808,318 75.7 96.3
B2 APEC O1 CP000468.1 5,082,025 3,800,601 74.8 96.1 Rotate
B2 ED1A NC_011745.1 5,209,548 3,739,178 71.8 94.5 Rotate, 8.8-kbp invert
B2 536 CP000247.1 4,938,920 3,797,593 76.9 96.0 Invert between asn tRNAs
B2 LF82 CU651637.1 4,773,108 3,796,695 79.5 96.0 Rotate
B2 ABU 83972 CP001671.1 5,131,397 3,794,741 74.0 95.9 Invert between asn tRNAs
B2 CFT073 AE014075.1 5,231,428 3,784,222 72.3 95.7 Invert between asn tRNAs
B2 O127:H6 E2348/69 FM180568.1 4,965,553 3,779,984 76.1 95.6
D O7:K1 str. CE10 CP003034.1 5,313,531 3,903,557 73.5 98.7
D SMS-3-5 NC_010498.1 5,068,389 3,911,226 77.2 98.9 1.4-Mbp invrt btw IS110s
D 042 FN554766.1 5,241,977 3,879,605 74.0 98.1
D UMN026 NC_011751.1 5,202,090 3,888,373 74.7 98.3 Rotate

Length of filtered 32-genome Mauve multiple alignment: 3,955,192.

A key step in deriving basic genomes from the multiple alignment was to apply a simple computational filter designed to eliminate mobile elements, idiosyncratic insertions or duplications, and highly diverged regions that do not align well. The filter applied in the present analysis removed every base pair position in which fewer than 22 of the 32 genomes have an aligned base pair, which retained idiosyncratic and group-specific deletions. This filter reduced the initial 3,044 Mauve alignment blocks to 105 ordered blocks and the total aligned sequence within these blocks from 20.5 Mbp to the 3,955,192-bp approximation to the basic genome analyzed here. We determined that all mobile elements (prophages, rhs elements, IS elements) annotated in the extensively analyzed complete-genome sequences of MG1655 and B–DL (22) were removed by the filter, but at least a few other genomes retained remnants due to difficulties in multiple alignment at sites where mobile elements are integrated. Our procedures appear to have captured a reasonable approximation to the basic genome shared by these 32 strains and provide a platform for further analysis and refinement.

Consensus Sequences.

Majority rule at each position where at least 22 genomes are represented generated a consensus basic-genome sequence for all 32 genomes. At least six genomes were required at each position in the group A and B1 consensus sequences, and at least seven genomes in B2.

SI Results and Discussion

DNA Restriction.

We previously analyzed three completely sequenced genomes to deduce the patterns of recombinant transfer by P1 transduction of genomic DNA from the K-12 strain W3110 across a type I restriction barrier into ancestors of two different B genomes, BL21(DE3) and REL606 (22). Transductants had been selected for the repair of a 6-kbp deletion that had apparently been caused in B strains by an IS1 element that remained at the site of the deletion. The only recombinant transfer detected in the BL21(DE3) lineage was a fragment of maximum length 10.6 kbp that replaced the IS1 element and restored the deleted region. However, recombinant transfers in theREL606 lineage replaced six DNA fragments totaling 44.9–55.1 kbp across 71.3–77.0 kbp of genome. Minimum and maximum lengths were determined by the ends of uninterrupted successions of single base-pair differences that define which DNA is represented (positions of all single base-pair differences among strains are given in table S1 of ref. 22). The observed patterns of recombinant transfers are consistent with the positions of known recognition sites for the B restriction endonuclease in W3110 DNA and the mechanism of action of the B restriction endonuclease (22). The average length of the six transferred W3110 fragments is ∼8.3 kbp and that of the five clonal B intervals is ∼4.8 kbp, but the lengths of individual fragments have very wide possible ranges, 0.3–26.5 kbp for transferred fragments and 0.1–13.9 kbp for clonal intervals.

We examined nucleotide sequences of all 32 genomes in the primary locus of genes specifying type I and other types of restrictions enzymes, referred to as the immigration control region (ICR) (31). Fourteen of these genomes lack all or part of the hsd genes needed for type I restriction, some by exchange of the entire ICR with the unrelated pac gene (which specifies a penicillin acylase), some by deletions associated with IS elements, and some by simple deletions. Most of the genomes are highly variable relative to each other in at least the specificity gene hsdS, if not all three hsd genes. However, five pairs of genomes have essentially or completely identical sequences through the entire hsd region (and across most or all of the entire ICR), including three of the six least-diverged genome pairs (IHE–APEC, ABU–CFT, and O55–O157). In addition, O111 is one of the genomes that has exchanged the entire ICR with the pac gene, and the long transferred region containing the ICR in the O111–O26 genome pair could either have delivered active hsd genes to O26 or replaced them with the pac gene in O111.

Computational Model of Divergence.

In our numerical simulations, we follow the divergent evolutionary path of two bacterial strains within a population of coevolving strains with a constant effective population size Ne. Computational limitations do not allow us to explicitly model several billions of other strains in the population. Instead, we use the mathematical expressions derived below to describe the diversity within the population. We subsequently simulate a Markov chain model following the time course of sequence divergence in a pair of strains. In agreement with the empirical data on E. coli’s basic genome, each of two strains in our simulation has a circular genome comprising NG = 4,000 gene-sized segments consisting of LG=1,000 base pairs each. To speed up our simulations, the finest resolution of genomes in our model is at the level of a segment rather than at the level of individual base pairs.

Our simulations aim at modeling the neutral evolution of the basic genome by random point mutations and homologous recombinations. Thus, they ignore evolutionary processes rearranging gene order or changing the size of the genome such as inversions, short indels, or large-scale additions and deletions of genomic fragments. In this simplified picture, in each generation one of two things may happen: (i) a randomly selected segment acquires a point mutation a rate μ per base per generation or (ii) a randomly selected segment on the genome is chosen as the starting point of a horizontal transfer event at a rate ρ per base pair per generation. Horizontal transfer comprises transfer of a genomic region of length L segments from a randomly selected donor strain from the population of coevolving strains. Similar to the model used in ref. 3, we assume that segments are indivisible in horizontal transfer. The length of the transferred region is chosen from an exponential distribution with mean Lt=L.

We next describe how we chose the donor strain of the horizontally transferred segment. It is known that the probability of successful transport followed by homologous recombination of a donor fragment into the recipient genome, which we refer to as transfer efficiency, decreases with the increasing sequence divergence between the donor and recipient genome regions. See for example, refs. 16 and 3537. Factors likely to affect transfer efficiency include the following: (i) ecological or geographical niche separation of donor and recipient strains; (ii) likelihood that generalized transducing phages from the donor population are capable of infecting recipient strains; (iii) defense mechanisms such as DNA restriction modification systems; (iv) efficiency of integration into the recipient genome by homologous recombination; and (v) probability that an integrated fragment becomes fixed in the population. In our simulations, we assume that transfer efficiency can be approximated by an exponentially decreasing probability of successful recombinant transfer, peδ/δTE, where δ is the local sequence divergence between the transferred fragment (comprising on average Lt segments) in donor and recipient genomes. δTE is the parameter that dictates the severity of the penalty. For example, a higher δTE implies lowered restriction to transfer and vice versa. Previous modeling efforts have also used an exponential biasing function to favor horizontal transfer among closely related strains (3, 38).

Theoretical Framework.

In our computational analysis, we are interested in understanding how a pair of strains, referred to as strains X and Y from here onward, acquires SNPs as the two genomes diverge from their common ancestor as members of the recombining population of Ne strains. We have developed a semianalytical theoretical model to estimate the divergence among different segments of strains X and Y by taking into account transfers from and within the rest of the population.

Let us denote by δ¯(t)={δ1(t),δ2(t),,δ4,000(t)} the vector representing the local sequence divergence in all 4,000 segments as a function of time t of divergence from the last common ancestor of X and Y. Local sequence divergence in a given segment is simply the number of SNPs in that segment divided by 1,000, the length of each segment. In what follows, we analytically derive the transition probability matrix of a Markov process to propagate δ¯(t) probabilistically in time.

In every generation, the vector of divergence δ¯(t) is updated if (i) there is a point mutation in one of the segments in one of the two genomes (happens at a rate μ per segment per generation) or (ii) there is a horizontal transfer event starting at one of the segments in one of the genomes (happens at ρ per segment per generation) (diagrammed below). Updating the SNPs vector after a point mutation is straightforward: a mutation in kth segment will increase the kth component of δ¯(t), δk(t), by 1/1,000. On the other hand, horizontal transfer of a genomic fragment comprising L segments in one of the two compared strains will most likely acquire foreign genomic fragments from a randomly chosen strain (referred to as strain D, for donor, from here onward) from the population of coevolving strains. This horizontal transfer will update the divergence vector at L consecutive segments at the same time. Without loss of generality, if strain X receives a horizontal transfer from a donor strain D, the local divergence between X and Y in the horizontally transferred region after the transfer will be the local divergence in the same region between the donor D and Y.

graphic file with name pnas.1510839112sfx01.jpg

We model horizontal transfer as a two-step process. First, a donor genomic fragment from strain D in the coevolving population is chosen at random, and second, the integration of this fragment into the genome of strain X is attempted with the probability of success determined by transfer efficiency: peδ/δTE. Given that horizontally transferred fragments have variable lengths, overlap between different horizontal transfer events is possible. In this instance, the phylogenetic lineage of a given fragment cannot be uniquely established because different segments comprising the fragment may potentially have come from different strains. Nevertheless, as a first approximation, justified a posteriori by the agreement of the model with empirical observations, we assume that the coalescent framework (see below), which strictly speaking is applicable only to indivisible fragments, is still a valid approximation in tracing the lineage of transferred DNA. Consequently, we derive the conditional probability, p(δa|δb) for a typical fragment that is a candidate for horizontal transfer. Here, δb (b for before) denotes the local divergence between X and Y before the horizontal transfer, and δa (a for after) denotes the local divergence after integration of the transferred region in X.

A genome fragment integrated in X by recombinant transfer from D will have three possible phylogenetic relationships with the equivalent fragment in the diverging lineage Y, relative to their most recent common ancestor (MRCA): (i) D is more closely related to X than to Y; (ii) D is more closely related to Y than to X; or (iii) X and Y are more closely related to each other than either is to D.

graphic file with name pnas.1510839112sfx02.jpg

Case 1.

Here, we want to find the probability p(δa=δb|δb) when the common ancestor XD of X and D is younger than XYD, the common ancestor of X, Y, and D. Note that, in this case, δXD<δXY=δb and δa=δb.

Assume that the number of generations separating X and Y is sXY. Let D be a randomly chosen strain. The probability that D and X have their MRCA sXD generations ago can be found as follows (39, 40). We start with D and choose an ancestor from the previous generation of strains randomly and keep going back in time until we coalesce with the ancestral lineage of X. At every generation, X has a unique ancestor (among Ne strains in the population at that time). At any generation, the probability that we will not randomly pick an ancestor for D that is also in the ancestral lineage of X is 11/Ne. Consequently, the probability that D will coalesce with the ancestral lineage of X in exactly sXD generations is =(1/Ne)(1(1/Ne))sXD1(1/Ne)esXD/Ne.

Similarly, the probability that D avoids the ancestral lineage of Y before coalescing with XD is equal to =(1(1/Ne))sXD1esXD/Ne. Combining the two, the probability that X and D are separated by sXD generations and XYD is older than XD is given by the product of the two, p=(1/Ne)e2sXD/Ne.

We can convert the above probability into divergences by noting δ=2μs and by defining θ=2μNe as the average local divergence in the population. We have p=(1/θ)e2δXD/θ. Finally, integrating over all possible δXD<δXY=δb by applying the transfer efficiency criterion,

p(δa=δb|δb)=1Ω0δXY=δbeδXDδTE×1θe2δXDθdδXD. [S1]

Here, Ω is an overall normalization constant.

Case 2.

Here, we want to find out the probability p(δa,δa<δb|δb) when YD, MRCA of Y and D, is younger than XYD. Note in this case that δXD=δb and δDY=δa<δb.

Similar to case 1, the probability that Y and D had their MRCA exactly sDY generations ago is given by p=(1/Ne)eSDY/Ne. The probability that D does not coalesce with the ancestral lineage of X before coalescing with that of Y is given by p=e(sDY/Ne) (see case 1). The product of the two gives the probability, after including transfer efficiency criterion (converting to divergences):

p(δa,δa<δb|δb)=1Ω1θe2δaθ×eδbδTE. [S2]

Case 3.

Here, we want to find out the probability p(δa,δa>δb|δb) when the MRCA of X and D and the MRCA of Y and D is the same (XYD) and is older than the MRCA of X and Y, XY. Note in this case that δDY=δXD=δa>δb=δXY.

In this case, we first require that D does not coalesce with the ancestral lineage of X and of Y in sXY generations. This probability is given by p=e(2sXY/Ne) (see cases 1 and 2). After avoiding the ancestral lineages for the first sXY generations, we now require D to coalesce with the ancestral lineages of both X and Y in exactly sDYsXY generations. This probability is given by p=(1/Ne)e((sDYsXY)/Ne) (see cases 1 and 2). Finally, the product of the two gives the probability that D coalesces with Y in sDY generations and sDY>sXY Combining the two, we get p=(1/N)e((sDY+sXY)/Ne). Converting to divergences, applying the transfer efficiency criterion, we have the following:

p(δa,δa>δb|δb)=1Ω1θeδa+δbθ×eδaδTE. [S3]

The normalization constant Ω=Ω(δb) depends on the original local divergence δb between X and Y and is simply defined as the normalization constant that ensures that probabilities in Eqs. S1, S2, and S3 add up to one.

Combining Eqs. S1, S2, and S3, we can finally write down that the overall conditional probability p(δa|δb) is given by the following:

p(δa|δb)=1Ω(Di(δaδb)×p(δa=δb|δb)+Θ(δbδa)×p(δa,δa<δb|δb)+Θ(δaδb)×p(δa,δa>δb|δb)). [S4]

Here, Di(x) is the Dirac Delta function and Θ(x) is the Heaviside function. Using the conditional probability in Eq. S4, we predict the distribution of the local divergence δa between X and Y after horizontal transfer from δb, the corresponding local divergence before the horizontal transfer.

Thus, we have analytically derived the probability to update a local fragment of the SNP vector δ¯(t). Combined with the relative rates of mutation and recombination, we can construct the Markovian transition probability for δ¯(t)δ¯(t+1). Using this Markovian transition probability, we numerically propagate the stochastic trajectory of δ¯(t) as detailed below.

Simulations.

Using Eq. S4, we are equipped to perform a Markov chain simulation to propagate δ¯(t). We proceed as follows:

  • 1)

    Initialize the SNPs vector as a vector of zeros at time t=0. This implies that we start with two strains with identical genomes at time t=0.

  • 2)

    At every generation, do the following:

  • a) Because we know that μ1 is very low, it is very unlikely that a single mutation event brings about two or more SNPs. Thus, first draw G = 4,000 random variables zk uniformly from [0, 1]. If zk<μ, we add 1/1,000 to δk(t).

  • b) Draw 4,000 random numbers r uniformly from [0, 1] and compare them with ρ. Identify segment(s) with r<ρ. These segments are candidates for horizontal transfer. Given that the rate of horizontal transfer is very likely to be low, in any time step it is very unlikely that more than one segment would be selected for horizontal transfer.

  • i)

    If segment k is selected for horizontal transfer, draw an integer L from an exponential distribution with the mean Lt. Genes from k to k + L − 1 are potential candidates for horizontal replacement of genomic fragments. Note that the genome is circular, i.e., segments 1 and 4,000 are neighbors of each other, and that we allow segment transfers to start and end only at segment boundaries.

  • ii)

    Find the local divergence δb in the selected genomic fragment over the L segments before the horizontal transfer and, using the conditional probability p(δa|δb) defined in Eq. S5, sample δa.

  • iii)

    Draw L Poisson random variables with parameter δa and distribute them randomly on the L segments and update the SNPs vector for these L segments using the newly sampled Poisson variables.

Step 2.b.iii assumes that the newly acquired genomic fragment has no spatial correlations of its SNPs, an assumption which is reasonable for the current analysis but may need to be relaxed if one is interested in understanding SNP positional correlations along the genome.

Each Markov chain starts at t = 0 with strains X and Y with identical genomes, i.e., δ¯(t)=0¯. As time increases, the global genomic divergence Δ increases like a random walk with a drift. To generate SNPs vectors δ¯(t) at any given Δ, we stop the Markov chain when Δ reaches within 0.01% of the prescribed divergence. For each Δ, we collect 100 such Markov chains. It is a straightforward calculation to generate the figures in the main text from the SNPs vectors.

Parameters.

Our theoretical and numerical model depends on four dimensionless parameters:

  • i)

    θ=2μNe, Watterson’s measure of effective population diversity;

  • ii)

    ρ/μ, the ratio of the horizontal transfer rate to the mutation rate;

  • iii)

    Lt/(LG×NG), the fraction of the genome length replaced in a single transfer event;

  • iv)

    δTE, the transfer efficiency in units of pairwise sequence divergence.

A ratio of ρ/μ=0.31 best fits the profile of average clonal and recombined SNP densities shown in Fig. 3 B and C. Based on a mutation rate μ= 8.9 × 10−11 per base pair per generation (25), that ratio corresponds to a recombination rate of ρ= 2.8 × 10−11 per base pair per generation.

The fraction of genome length replaced in a single transfer event dictates the maximum number of uninterrupted strings of transferred segments (Fig. 4C). Trial and error showed that Lt/(LG×NG)=0.075% gives a good fit to the data, corresponding to an average transferred length of Lt=3 kbp.

The exponential tail in the SNP distributions generated in our numerical simulations (Fig. 2C) is influenced by the combined parameter 1/θ+1/δTE (Eq. S3). For genomes with a large number of genes, Watterson’s θ=2μNe is roughly equal to the maximum of the overall divergence Δ between any two genomes in the population. The maximum divergence of our 32 genomes is 2.6% so θ must at least be greater than 2.6%, which imposes a lower bound on the effective population size Ne=θ/2μ> 1.5 × 108. If transfers were perfectly efficient (δTE1), the slope of the exponential tail would be governed entirely by θ. For θ=2.6%, SNP distributions predicted by our model have exponential tails that are about twice wider than those observed in the empirical data (Fig. 2B). Thus, we assumed that the slope of the exponential distribution in the tails of P(δi) is governed primarily by δTE. To simplify calculations in our model, we set θδTE . In this limit, the value δTE=0.8% provides the best fit (Fig. 2C) to the exponential slopes observed in SNP histograms of genome pairs within group A and between group A and B1 (Fig. 2B).

Comparing our results with previous work, our estimated average transfer length of 3 kbp is considerably longer than the ∼50 bp estimated by Touchon et al. (13) or the 542 bp estimated by Didelot et al. (14). The ratio ρ/μ=0.31 in our model is smaller than the estimate of 0.0128/0.01251 by Didelot et al. (14) (albeit with a much shorter length of recombined regions) and much smaller than the 2.47 calculated by Touchon et al. (13). A composite parameter r/m=(ρ/μ)LTδTE, an estimate of the relative contributions of recombinant transfers and random mutations to overall divergence, is 11.2 in our model, much lower than the early estimate of 50 by Guttman and Dykhuizen (20), considerably higher than 0.34 by McNally et al. (21) and 1.5 by Touchon et al. (13), and somewhat higher than 7, estimated by Didelot et al. (14).

SI Materials and Methods

The results given in the figures and most tables are limited to perfectly aligned 1-kbp segments in the set of 3,769. Extending the analyses to the entire set of 4,008 basic-genome segments or increasing the clonal cutoff from three to four or five SNPs per 1-kbp segment produced similar results with relatively minor differences or trends.

Strain names of the 32 genomes used in this work, GenBank accession numbers, the lengths of both complete genomes and the basic genomes derived from them, and summaries of reconfigurations of the genome sequences made to simplify the multiple alignment are given in Table S3. No significant inversions or translocations are seen among 21 of the 32 genomes, including the reference genome MG1655. The first base pair in the numbering of six genome sequences was located at a different position relative to gene content from that of MG1655, and those sequence files were rotated, or inverted and rotated, and renumbered so that their first base pair corresponds to that of MG1655.

Eleven genomes have significant inversions, translocations, or both relative to the consensus organization. To reduce the complexity of the multiple alignment, the following inversions and translocations were corrected manually to the same orientation and/or position as in MG1655 using Clone Manager (Scientific and Educational Software): 4.2- to 5.3-kbp inversions between comparable pairs of oppositely oriented asparagine tRNA sequences in the five B2 strains IHE3034, ABU 83792, CFT073, 536, and SE15; a 1.2-Mbp inversion between oppositely oriented IS3 insertions in P12b; a 92-kbp inversion between oppositely oriented IS629 insertions in O111:H−; a 1.4-Mbp inversion between oppositely oriented IS110 insertions in SMS-3-5; an 8.8-kbp inversion of uncertain origin in ED1A; a 456-kbp inversion with five translocations at a single IS621 insertion in O103:H2; and an inversion of 471 kbp between oppositely oriented IS621 insertions plus a 5.3-kbp inversion at a single IS621 insertion in O26:H11. The O26:H11 genome in the multiple alignment retained two substantial translocations relative to MG1655.

In analyzing scattered SNP clusters in the six least-diverged genome pairs, we found that the sequence files of CFT073 and APEC O1 have clusters of sequence-ambiguity codes that were being scored as SNPs. Such SNPs were eliminated from all genomes in our analyses simply by requiring that each SNP contain only A, G, C, or T.

Basic genomes of individual strains are 94.5–99.4% of the total length of the filtered multiple alignment (due to deletions) and are 68.7–84.4% of their corresponding complete-genome lengths (Table S3), reflecting different percentages occupied by mobile elements and exchangeable operons or functional modules present in fewer than 22 of these genomes. Some exchangeable operons or gene clusters that should be considered part of the basic genome did not have enough representation in this set of genomes to have passed the filter.

Supplementary Material

Supplementary File
Supplementary File
Supplementary File
pnas.1510839112.sd03.xls (254.5KB, xls)
Supplementary File
Supplementary File
Supplementary File
Supplementary File
Supplementary File

Acknowledgments

This work was supported by Grants PM-031 and ELS165 from the Office of Biological and Environmental Research of the US Department of Energy and internal research funding from Brookhaven National Laboratory.

Footnotes

The authors declare no conflict of interest.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1510839112/-/DCSupplemental.

References

  • 1.Ochman H, Lawrence JG, Groisman EA. Lateral gene transfer and the nature of bacterial innovation. Nature. 2000;405(6784):299–304. doi: 10.1038/35012500. [DOI] [PubMed] [Google Scholar]
  • 2.Gogarten JP, Townsend JP. Horizontal gene transfer, genome innovation and evolution. Nat Rev Microbiol. 2005;3(9):679–687. doi: 10.1038/nrmicro1204. [DOI] [PubMed] [Google Scholar]
  • 3.Fraser C, Hanage WP, Spratt BG. Recombination and the nature of bacterial speciation. Science. 2007;315(5811):476–480. doi: 10.1126/science.1127573. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Fraser C, Alm EJ, Polz MF, Spratt BG, Hanage WP. The bacterial species challenge: Making sense of genetic and ecological diversity. Science. 2009;323(5915):741–746. doi: 10.1126/science.1159388. [DOI] [PubMed] [Google Scholar]
  • 5.Lapierre P, Gogarten JP. Estimating the size of the bacterial pan-genome. Trends Genet. 2009;25(3):107–110. doi: 10.1016/j.tig.2008.12.004. [DOI] [PubMed] [Google Scholar]
  • 6.Rodriguez-Valera F, et al. Explaining microbial population genomics through phage predation. Nat Rev Microbiol. 2009;7(11):828–836. doi: 10.1038/nrmicro2235. [DOI] [PubMed] [Google Scholar]
  • 7.Wiedenbeck J, Cohan FM. Origins of bacterial diversity through horizontal genetic transfer and adaptation to new ecological niches. FEMS Microbiol Rev. 2011;35(5):957–976. doi: 10.1111/j.1574-6976.2011.00292.x. [DOI] [PubMed] [Google Scholar]
  • 8.Polz MF, Alm EJ, Hanage WP. Horizontal gene transfer and the evolution of bacterial and archaeal population structure. Trends Genet. 2013;29(3):170–175. doi: 10.1016/j.tig.2012.12.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Tenaillon O, Skurnik D, Picard B, Denamur E. The population genetics of commensal Escherichia coli. Nat Rev Microbiol. 2010;8(3):207–217. doi: 10.1038/nrmicro2298. [DOI] [PubMed] [Google Scholar]
  • 10.Milkman R. Recombination and population structure in Escherichia coli. Genetics. 1997;146(3):745–750. doi: 10.1093/genetics/146.3.745. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Gogarten JP, Doolittle WF, Lawrence JG. Prokaryotic evolution in light of gene transfer. Mol Biol Evol. 2002;19(12):2226–2238. doi: 10.1093/oxfordjournals.molbev.a004046. [DOI] [PubMed] [Google Scholar]
  • 12.Mau B, Glasner JD, Darling AE, Perna NT. Genome-wide detection and analysis of homologous recombination among sequenced strains of Escherichia coli. Genome Biol. 2006;7(5):R44. doi: 10.1186/gb-2006-7-5-r44. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Touchon M, et al. Organised genome dynamics in the Escherichia coli species results in highly diverse adaptive paths. PLoS Genet. 2009;5(1):e1000344. doi: 10.1371/journal.pgen.1000344. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Didelot X, Méric G, Falush D, Darling AE. Impact of homologous and non-homologous recombination in the genomic evolution of Escherichia coli. BMC Genomics. 2012;13:256. doi: 10.1186/1471-2164-13-256. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Blount ZD, Borland CZ, Lenski RE. Historical contingency and the evolution of a key innovation in an experimental population of Escherichia coli. Proc Natl Acad Sci USA. 2008;105(23):7899–7906. doi: 10.1073/pnas.0803151105. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Vulić M, Lenski RE, Radman M. Mutation, recombination, and incipient speciation of bacteria in the laboratory. Proc Natl Acad Sci USA. 1999;96(13):7348–7351. doi: 10.1073/pnas.96.13.7348. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Bobay LM, Touchon M, Rocha EP. Pervasive domestication of defective prophages by bacteria. Proc Natl Acad Sci USA. 2014;111(33):12127–12132. doi: 10.1073/pnas.1405336111. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Mahillon J, Chandler M. Insertion sequences. Microbiol Mol Biol Rev. 1998;62(3):725–774. doi: 10.1128/mmbr.62.3.725-774.1998. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Dykhuizen DE, Green L. Recombination in Escherichia coli and the definition of biological species. J Bacteriol. 1991;173(22):7257–7268. doi: 10.1128/jb.173.22.7257-7268.1991. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Guttman DS, Dykhuizen DE. Clonal divergence in Escherichia coli as a result of recombination, not mutation. Science. 1994;266(5189):1380–1383. doi: 10.1126/science.7973728. [DOI] [PubMed] [Google Scholar]
  • 21.McNally A, Cheng L, Harris SR, Corander J. The evolutionary path to extraintestinal pathogenic, drug-resistant Escherichia coli is marked by drastic reduction in detectable recombination within the core genome. Genome Biol Evol. 2013;5(4):699–710. doi: 10.1093/gbe/evt038. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Studier FW, Daegelen P, Lenski RE, Maslov S, Kim JF. Understanding the differences between genome sequences of Escherichia coli B strains REL606 and BL21(DE3) and comparison of the E. coli B and K-12 genomes. J Mol Biol. 2009;394(4):653–680. doi: 10.1016/j.jmb.2009.09.021. [DOI] [PubMed] [Google Scholar]
  • 23.Darling AE, Mau B, Perna NT. progressiveMauve: Multiple genome alignment with gene gain, loss and rearrangement. PLoS One. 2010;5(6):e11147. doi: 10.1371/journal.pone.0011147. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Milkman R, Jaeger E, McBride RD. Molecular evolution of the Escherichia coli chromosome. VI. Two regions of high effective recombination. Genetics. 2003;163(2):475–483. doi: 10.1093/genetics/163.2.475. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Wielgoss S, et al. 2011. Mutation rate inferred from synonymous substitutions in a long-term evolution experiment with Escherichia coli. G3 (Bethesda) 1(3):183–186.
  • 26.Loenen WAM, Dryden DTF, Raleigh EA, Wilson GG. Type I restriction enzymes and their relatives. Nucleic Acids Res. 2014;42(1):20–44. doi: 10.1093/nar/gkt847. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Loenen WAM, Dryden DTF, Raleigh EA, Wilson GG, Murray NE. Highlights of the DNA cutters: A short history of the restriction enzymes. Nucleic Acids Res. 2014;42(1):3–19. doi: 10.1093/nar/gkt990. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.McKane M, Milkman R. Transduction, restriction and recombination patterns in Escherichia coli. Genetics. 1995;139(1):35–43. doi: 10.1093/genetics/139.1.35. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Milkman R, et al. Molecular evolution of the Escherichia coli chromosome. V. Recombination patterns among strains of diverse origin. Genetics. 1999;153(2):539–554. doi: 10.1093/genetics/153.2.539. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Sternberg NL, Maurer R. 1991. Bacteriophage-mediated generalized transduction in Escherichia coli and Salmonella typhimurium. Methods Enzymol 204:18–43.
  • 31.Sibley MH, Raleigh EA. Cassette-like variation of restriction enzyme genes in Escherichia coli C and relatives. Nucleic Acids Res. 2004;32(2):522–534. doi: 10.1093/nar/gkh194. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Muniesa M, Imamovic L, Jofre J. Bacteriophages and genetic mobilization in sewage and faecally polluted environments. Microb Biotechnol. 2011;4(6):725–734. doi: 10.1111/j.1751-7915.2011.00264.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Battaglioli EJ, et al. Isolation of generalized transducing bacteriophages for uropathogenic strains of Escherichia coli. Appl Environ Microbiol. 2011;77(18):6630–6635. doi: 10.1128/AEM.05307-11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Petty NK, et al. A generalized transducing phage for the murine pathogen Citrobacter rodentium. Microbiology. 2007;153(Pt 9):2984–2988. doi: 10.1099/mic.0.2007/008888-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Vulić M, Dionisio F, Taddei F, Radman M. Molecular keys to speciation: DNA polymorphism and the control of genetic exchange in enterobacteria. Proc Natl Acad Sci USA. 1997;94(18):9763–9767. doi: 10.1073/pnas.94.18.9763. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Majewski J, Zawadzki P, Pickerill P, Cohan FM, Dowson CG. Barriers to genetic exchange between bacterial species: Streptococcus pneumoniae transformation. J Bacteriol. 2000;182(4):1016–1023. doi: 10.1128/jb.182.4.1016-1023.2000. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Zawadzki P, Roberts MS, Cohan FM. The log-linear relationship between sexual isolation and sequence divergence in Bacillus transformation is robust. Genetics. 1995;140(3):917–932. doi: 10.1093/genetics/140.3.917. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Ansari MA, Didelot X. Inference of the properties of the recombination process from whole bacterial genomes. Genetics. 2014;196(1):253–265. doi: 10.1534/genetics.113.157172. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Kingman JF. Origins of the coalescent. 1974–1982. Genetics. 2000;156(4):1461–1463. doi: 10.1093/genetics/156.4.1461. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Tavaré S, Balding DJ, Griffiths RC, Donnelly P. Inferring coalescence times from DNA sequence data. Genetics. 1997;145(2):505–518. doi: 10.1093/genetics/145.2.505. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary File
Supplementary File
Supplementary File
pnas.1510839112.sd03.xls (254.5KB, xls)
Supplementary File
Supplementary File
Supplementary File
Supplementary File
Supplementary File

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES