Skip to main content
Genome Biology and Evolution logoLink to Genome Biology and Evolution
. 2023 May 22;15(6):evad087. doi: 10.1093/gbe/evad087

T Residues Preceded by Runs of G Are Hotspots of T→G Mutation in Bacteria

Joshua L Cherry 1,2,
Editor: Laura Katz
PMCID: PMC10243904  PMID: 37216188

Abstract

The rate of mutation varies among positions in a genome. Local sequence context can affect the rate and has different effects on different types of mutation. Here, I report an effect of local context that operates to some extent in all bacteria examined: the rate of T→G mutation is greatly increased by preceding runs of three or more G residues. The strength of the effect increases with the length of the run. In Salmonella, in which the effect is strongest, a G run of length three 3 increases the rate by a factor of ∼26, a run of length 4 increases it by almost a factor of 100, and runs of length 5 or more increase it by a factor of more than 400 on average. The effect is much stronger when the T is on the leading rather than the lagging strand of DNA replication. Several observations eliminate the possibility that this effect is an artifact of sequencing error.

Keywords: mutation rate, mutational spectrum, homopolymer run, transversions


Significance.

The rate and spectrum of mutation vary among nucleotide positions in a genome, in part due to effects of local sequence context. This work reports an effect of this type that is widespread in bacteria: the rate at which T mutates to G is unusually high at positions preceded by three or more G residues, and longer G runs are associated with higher mutation rates. This phenomenon is responsible for a large fraction of T→G mutations in some bacteria and accounts for some previous observations.

Introduction

The rate of mutation varies among nucleotide sites within a genome (Hodgkinson and Eyre-Walker 2011). In addition to moderate variation among the bulk of sites, some sites, often referred to as hotspots, exhibit very high mutation rates. Often, only mutation to a particular nucleotide is accelerated at a hotspot.

Some hotspots of mutation are caused by DNA modification. Methylation of cytosine at the C5 position increases the rate of C→T mutation in bacteria (Coulondre et al. 1978; Cherry 2018) and eukaryotes (Bird 1980; Cooper and Youssoufian 1988). N6-methylated adenines in Dam sites also exhibit an elevated mutation rate (Lee et al. 2012). Certain N4-methylated cytosines are hotspots of mutation, though the causal role of this methylation is uncertain (Cherry 2021).

Repeats of short sequence motifs, including homopolymer runs, are subject to high rates of insertion and deletion of one or more repeat units. The mechanism is thought to involve slipped-strand mispairing, in which the strands of the repeat region pair out of register near the end of the strand being synthesized (Levinson and Gutman 1987).

Although natural selection may generally favor lower mutation rates, it may favor hotspots at certain locations, particularly if reversal is facile or the mutation rate is high only in somatic cells. Bacterial “contingency loci” in which insertion and deletion occur frequently in homopolymer runs (Moxon et al. 2006) are examples. Reversibility need not be an issue if the benefits of hypermutation are due mainly to kin selection, that is, they mostly accrue not to the mutant and its descendants, but to nonmutant close relatives (Cherry 2020).

Mutation rate and its determinants have been studied using natural variation and laboratory experiments. Laboratory experiments include mutation accumulation experiments, in which the effects of selection are minimized; laboratory evolution experiments, in which the main goal is adaptation; and selection experiments, in which only mutants of a particular type survive. Experiments with mutator genotypes provide clues to mechanisms of mutation and heterogeneity of its rate. Natural variation, even within a species, is subject to confounding effects of selection if long evolutionary distances are involved, but it reflects mutational processes as they occur in nature. Analysis of natural mutations of very recent origin can greatly reduce the effects of selection while yielding much larger numbers of mutations than laboratory experiments.

Studies of various types have revealed effects of nearest-neighbor nucleotides (the bases immediately preceding and following the mutated position) on mutation rate (Lee et al. 2012; Sung et al. 2015; Goncearenco et al. 2017). The identities of flanking bases have different effects on different types of mutations, for example, C→T versus C→A and C→G. Effects of nonadjacent nearby nucleotides have also been reported (Aggarwala and Voight 2016; Ling et al. 2020).

Factors other than local sequence context have been found to correlate with mutation rate. These include orientation with respect to the direction of replication (leading/lagging strand asymmetry) (Lujan et al. 2014; Sung et al. 2015), the rate and direction of transcription (Hudson et al. 2003; Lujan et al. 2014), distance from the origin of replication (Mira and Ochman 2002; Lujan et al. 2014), and nucleosome positioning in eukaryotes (Lujan et al. 2014; Li and Luscombe 2020).

Here, I describe a mutational phenomenon that occurs in bacteria: the rate of T→G mutation is greatly increased if the T is immediately preceded by three or more G residues. This effect was detected in all bacteria analyzed. It was inferred mainly from sequence comparisons between closely related natural isolates, but is also apparent in results of mutation accumulation experiments. In Salmonella, in which the effect is strongest, this phenomenon accounts for about one third of all T→G mutations.

Results

Elevated T→G Mutation Rate at GGGT Sites in Salmonella

Examination of candidate hotspots of mutation in Salmonella revealed that many exhibit mainly T→G changes at GGGT. This observation motivated a formal analysis of the rate of changes of this type.

Analysis of Salmonella data from the NCBI Pathogens project (PDG000000002.2479) yielded 1,494,207 polarizable single-nucleotide changes meeting criteria for strength of support. Of the 102,153 T→G changes, 36,130, or ∼35%, were at positions with at least 3 immediately preceding G residues (G≥3 sites), though these constitute only 1.1% of the T positions.

Table 1 presents relative T→G rates for positions preceded by G tracts of different lengths. The rate is increased substantially by G tracts of length three and even more by longer tracts. Because T→G transversions constitute only ∼21% of point mutations at T positions not preceded by G, the factor by which the total mutation rate is increased is about 5-fold lower, but nevertheless quite large.

Table 1.

Effects of Preceding G Tracts on T→G Mutation Rate in Salmonella

Length of Preceding G Tract
1 2 3 4 5+
Mutations 6,953 3,439 16,527 9,920 9,683
Sitesa 367,103 120,477 20,557 3,288 733
Relative rate 0.62 0.93 26.3 98.6 431
95% CI 0.60–0.63 0.90–0.97 25.8–26.7 96.5–101 422–441
a

Weighted mean number of sites per genome.

Evidence against Systematic Sequencing Error

Systematic sequencing error might produce the appearance of high mutation rates at certain sequence motifs. Several types of analysis were performed to test for this possibility.

A mutation may occur along an internal branch of the phylogenetic tree, usually affecting multiple isolates that constitute a clade. In contrast, a sequencing error at a particular position is unlikely to affect multiple closely related isolates. The fraction of mutations of a particular type inferred to have occurred along internal branches by the most parsimonious reconstruction (see Materials and Methods) is therefore an indication of whether they are genuine. This fraction is 32.7% for mutations other than those of interest. For mutations of interest, this fraction is very similar: 32.7%, 33.3%, and 32.1% for T→G mutations at sites preceded by G runs of length three, four, and five or more, respectively. This result suggests that the mutations of interest are mostly genuine.

Although motifs including runs of G are a source of sequence-specific errors for some Illumina platforms (Stoler and Nekrutenko 2021), only reads in one direction are affected. Because the analysis in table 1 considered only mutations with strong bidirectional support (at least 20 aligned reads, at least 90% of which supported the called base and at least 25% of them supporting reads in each direction), such errors should mostly be eliminated. Furthermore, removing all requirements for read support does not decrease the apparent strength of the effect, instead reducing it slightly (results not shown). Making the requirements more stringent (at least 50 reads, 100% supporting, and at least 40% in each direction) also does not change the strength of the effect greatly.

Additional evidence that the phenomenon is not an artifact of systematic sequencing error is presented in supplementary text S1, Supplementary Material online.

Nonsynonymous versus Synonymous Sites

Comparison of rates of nonsynonymous and synonymous change can provide information about the effects of selection on the apparent spectrum of mutation. The ratio of all nonsynonymous to synonymous changes in the Salmonella data is ∼80% of the neutral expectation. Supplementary table S3, Supplementary Material online compares nonsynonymous and synonymous T→G rates at positions with preceding G tracts of different lengths. The nonsynonymous rate is approximately equal to the synonymous rate for G0 sites and slightly smaller for G1 and G2 sites. Among G≥3 sites, the rate is higher at nonsynonymous sites, especially among G4 sites. As a consequence, the rate increase associated with G runs is somewhat smaller for synonymous sites than for all sites, but is nonetheless quite large: a factor of 22.5, 49.5, and 320 for G3, G4, and G≥5 positions, respectively. Supplementary table S3, Supplementary Material online also shows that G3 sites are the category most likely to be synonymous, but G4 and G≥5 are the least likely.

Effect of the Downstream Nucleotide

Figure 1 compares T→G rates at G0–G≥5 sites with different nucleotides immediately following the T. Strong effects of the 3′ nucleotide are evident for G3, G4, and G≥5 positions. The order of apparent mutation rates is C > G > T > A. The rate is 7.4–8.9 times higher with a 3′ C than with a 3′ A.

Fig. 1.


Fig. 1

Effect of the 3′ base on the T→G mutation rate.

Leading/Lagging Strand Asymmetry

Replication of a circular bacterial chromosome generally proceeds in both directions from an origin of replication, terminating at approximately the opposite point of the circle. The replication fork is asymmetric, and the mutational process differs between leading and lagging strand synthesis (Fijalkowska et al. 1998). I therefore assessed the T→G mutation rate in Salmonella as a function of orientation with respect to the direction of replication.

Many of the genome assemblies, including the reference assemblies for some clusters, are highly fragmented. Along with the fact that many genome rearrangements have occurred within Salmonella, this makes it impossible to know the orientation of single-nucleotide polymorphism (SNP) positions in some clusters. I therefore made use of the Salmonella cluster with the largest number of inferred mutations, PDS000089910.229, which contains 10,062 isolates and yields 66,534 chromosomal mutations for analysis. The use of a single cluster also makes it easier to know that different changes have occurred at the same position in the genome.

Figure 2, top panel, shows GC skew for the sequence of the reference chromosome for the cluster (CP093400.1), along with the approximate locations of the origin and terminus of replication determined from its minimum and maximum. The location of the origin is approximately consistent with the location of the dnaA gene at the beginning of the sequence and its forward-strand orientation, as the origin is generally not far upstream of this gene in Salmonella.

Fig. 2.


Fig. 2

Orientation dependence of GGGT→GGGG mutation in Salmonella. The upper panel shows GC skew, which reveals the approximate locations of the origin and terminus of replication. The lower panel shows the cumulative number of mutations of interest for the forward and reverse orientations.

The lower panel of figure 2 shows the cumulative distribution of the genome position of T→G changes at positions preceded by three or more G residues for “forward” and “reverse” orientations. Also shown are the approximate positions of the origin and terminus of replication. It is evident that there are many more changes at sites where GGGT rather than its complement is on the leading strand. Table 2 shows that the rate per site after G runs of length three or more is much higher when the T is on the leading strand.

Table 2.

Orientation Dependence of T→G Mutation Rate in Salmonella

Length of Preceding G Tract
0 1 2 3 4 5+
Leading strand Mutations 1,340 184 134 431 296 287
Sites 898,982 190,697 64,534 10,878 1,927 485
Relative rate 1.08 0.7 1.5 28.7 111 429
95% CI 1.01–1.15 0.60–0.81 1.25–1.79 25.9–31.8 98.3–126 378–485
Lagging strand Mutations 1,151 154 62 102 30 21
Sites 905,907 175,673 55,659 9,802 1,375 248
Relative rate 0.92 0.64 0.81 7.54 15.8 61.4
95% CI 0.86–0.99 0.54–0.75 0.62–1.04 6.12–9.19 10.6–22.6 37.9–94.0
Rate ratioa 1.17 1.1 1.86 3.81 7.04 6.99
95% CI 1.08–1.27 0.88–1.37 1.37–2.56 3.06–4.77 4.83–10.6 4.49–11.5
a

Ratio of rate per site on leading and lagging strand.

Effects of G Runs on Other Types of Mutation

Figure 3 shows relative rates of some other types of mutation at positions adjacent to G tracts of different lengths in Salmonella. Values are relative to the rate with a G tract of length zero, that is, for positions preceded or followed by a base other than G.

Fig. 3.


Fig. 3

Relative rates of various types of mutation at positions adjacent to G tracts of different lengths in Salmonella. Rates are relative to positions not adjacent to G. Error bars indicate 95% CIs. “5+” denotes five or more G residues.

Mutation of T to C or A is not strongly affected by preceding runs of G, that is, their effect is specific for mutation to G. Although the rate of T→C mutation is about 2-fold higher when the preceding base is a G, it does not increase substantially with additional preceding G residues.

An effect of mutation of C or A to G is evident, particularly for longer runs, but it is weak compared with the effect on mutation of T. Although the effect for runs of length four or more appears to be stronger for C than for A, this is attributable to the larger denominator for A, which reflects the generally higher rate of transitions compared with transversions. In terms of absolute numbers of additional mutations per site, the effect on A is larger than that on C, but still much smaller than the effect on T. The apparent strength for runs of at least five may also be affected by differences between preceding run length distributions among nucleotides.

G runs that follow rather than precede a T have only a small effect on T→G mutation rate. There is no substantial effect of following G runs on A→G mutation. Equivalently, preceding runs of C do not substantially affect the rate of T→C mutation.

Supplementary figures S2 and S3, Supplementary Material online show the same analysis for the remaining categories of mutation. In most cases, mutation rates vary by less than a factor of three with the number of preceding or following G residues. The exceptions are T→A and A→T changes at positions followed by G runs, for which rates vary by up to a factor of about eight.

Contribution to Nearest Neighbor Effects

Figure 4 shows the effects of nearest neighbors (the immediately preceding and following nucleotides) on rates of different types of mutation in Salmonella. All mutations are oriented so that the ancestral base is a pyrimidine. The contribution of GGGT→GGGG mutations is shown in red. Also shown are the contributions of C→T mutation at Dcm methylation sites and those of mutation at Dam methylation sites (represented as mutation of the paired T).

Fig. 4.


Fig. 4

Context dependence of mutation rates in Salmonella. For each strand-symmetrized category of point mutation, the number of mutations per site is shown for all 16 combinations of preceding and following base. The contributions of GGGT→GGGG mutations, transitions at Dcm-methylated sites, and mutations opposite Dam-methylated adenines are indicated.

Other Bacteria

Other bacteria were assessed for an effect of preceding G runs on the rate of T→G mutation using the NCBI Pathogens data. The choice of taxa for analysis was guided by the quantities of available data, the desire for phylogenetic diversity, and the desire to analyze close relatives of Salmonella.

Figure 5 presents results for several bacteria. The phenomenon operates to some extent in all of these, though its strength varies considerably among them (note that the vertical axis is logarithmic). The effect is strongest in Salmonella, followed by its close relative Escherichia coli.

Fig. 5.


Fig. 5

Relative T→G mutation rate with different numbers of preceding G residues in various bacteria. Values are relative to the rate with no preceding G. Error bars indicate 95% CIs. “5+” denotes five or more G residues. Note the logarithmic vertical scale.

Confirmation by Mutation Accumulation Results

Mutation accumulation experiments employ repeated population bottlenecks to reduce the effects of selection. Although they yield relatively small numbers of mutations and reflect laboratory rather than natural growth conditions, they present an opportunity for confirmation that the phenomenon is not an artifact of effects of selection.

Table 3 compares T→G mutation rates at G≥3 positions with those at G<3 positions in MA experiments with nonmutator strains of five species of bacteria (two pooled experiments each for E. coli and Salmonella enterica). In every case, the rate at G≥3 positions is higher by an estimated factor of ∼20 or more, and the lower bound on the 95% confidence interval (CI) is well above 1, confirming the mutagenic effect of G runs. The CIs for E. coli and Vibrio cholerae include the corresponding estimates from natural isolates. The rate enhancement in the Salmonella MA experiments is somewhat smaller than that inferred from natural isolates: the upper bound of the CI for the former is 69% of the estimate for the latter. This is not primarily due to a paucity of GGGT→GGGG mutations, which constitute 2.1% of the total, close to the 2.42% in natural isolates and statistically indistinguishable from it. It instead reflects an ∼2-fold higher fraction of G<3 T→G mutations in the MA experiments. As shown in supplementary figure S4, Supplementary Material online, this is the only substantial difference between the fractions of single-base mutation types in the two datasets.

Table 3.

Effects of Preceding G Runs on T→G Mutation Rates in Mutation Accumulation Experiments

Rate Enhancement
Mutation Counts Fraction of Total Fraction of T→G MA (95% CI) Natural Isolates
Species Total T→G GGGT→GGGG
S. enterica 660 81 14 2.1% 17.3% 19.6 (10.2–35.2) 51.3
E. coli 449 51 9 2.0% 17.6% 19.9 (8.53–41.5) 40.5
V. cholerae 138 17 5 3.6% 29.4% 41.3 (11.4–126) 20.2
V. fischeri 219 26 3 1.4% 11.5% 22.3 (4.29–73.9)
M. smegmatis 856 111 48 5.6% 43.2% 27.4 (18.4–40.5)

Kucukyildirim et al. (2016) noted an overrepresentation of A→C mutations at GACC and CACC sites in the Mycolicibacterium smegmatis MA experiment. They attributed this to conjectured methylation of these adenines. However, 88% of these mutations are at positions followed by three or more C residues, that is, they are GGGT→GGGG mutations. The elevated rate at these sites apparently reflects the phenomenon reported here rather than indicating DNA methylation.

No Strong Effect in Yeast

An analysis of data from a yeast mutation accumulation experiment (Sharp et al. 2018) is shown in supplementary table S4, Supplementary Material online. No GGGT→GGGG mutations were observed. This result is consistent with no enhancement of transversion rate by preceding runs of G. Nonetheless, a sizable effect of four or more G residues cannot be ruled out due to the low absolute numbers of other types of mutations. For just three, however, the 95% CI limits the size of any enhancement to about a factor of three. If all sites with G runs of length three or more are combined, a tighter upper bound of 2.65 is obtained. Assuming that additional G residues do not decrease the mutation rate, this bound can be applied to runs of length three. The implication is that any effect of three preceding G residues in yeast is weaker than that in any of the bacteria analyzed, with the possible exception Campylobacter. Use of single-sided CIs, arguably justified for counts of zero, would reduce all of the upper bounds by ∼20%, strengthening this conclusion.

Effect of MutT Deficiency

The MutT protein catalyzes the hydrolysis of 8-oxo-dGTP, reducing the rate of T→G mutation that results from misincorporation of 8-oxo-G across from A in the template (Setoyama et al. 2011). This pathway is responsible for most mutations in strains deficient in mutT. Couce et al. (2017) compared the mutational spectra of E. coli nonmutator and mutator strains, including mutT mutants.

As demonstrated above, the effect of G runs on T→G mutation is apparent during laboratory growth of nonmutator E. coli. Supplementary table S5, Supplementary Material online shows analogous results for mutT-deficient strains in mutation accumulation and long-term evolution experiments. Mutation rates are largely unaffected by preceding runs of G in mutT mutants, strongly suggesting that G runs do not exert their effect through incorporation of 8-oxo-G.

Discussion

The rate of T→G mutation in bacteria is especially high at positions immediately preceded by at least three G residues and increases with the number of these. This phenomenon was apparent in all of the bacteria analyzed, which are phylogenetically diverse, but its strength varies considerably among them. It is strongest in Salmonella and nearly as strong in the closely related E. coli. In Salmonella, the T→G rate at such sites is 4-fold to 7-fold higher when the T, rather than the paired A, is on the leading strand of DNA replication.

Several types of evidence indicate that the phenomenon is not an artifact of systematic sequencing error. These include the high frequencies with which mutations of the type of interest are detected in two or more appropriately related isolates, increases in their numbers with time and with sequence divergence, and full support from sequencing technologies expected to produce fewer errors of the relevant type. In addition, the strong observed leading/lagging strand bias suggests a genuine biological phenomenon, as does the absence of a strong effect in yeast. Furthermore, a strong hotspot of T→G mutation that has been characterized in the laboratory (Horton et al. 2021) appears to be an example of the phenomenon, as the position is preceded by a run of four G nucleotides.

Mutation Rate versus Selection

Selection is expected to affect the estimates of mutation rates from natural variation. However, its effect is expected to be weak for the NCBI Pathogens data because the mutations analyzed are of recent origin. The ratio of nonsynonymous to synonymous changes in the Salmonella data is only ∼20% lower than expected under selective neutrality, indicating that the effects of purifying selection are modest. Furthermore, only systematic differences in selection between categories of sites will affect the estimates of their relative mutation rates.

The apparent nonsynonymous T→G rate is indeed only slightly lower than, or approximately equal to, the synonymous rate at G<3 positions (supplementary table S3, Supplementary Material online). At G≥3 positions, the apparent nonsynonymous rate exceeds the synonymous rate, the opposite of what would be expected from stronger selection against nonsynonymous changes. This could be due to positive selection for certain nonsynonymous changes, though this would have to overcome the effects purifying selection. It might instead indicate a genuinely higher average mutation rate at the nonsynonymous sites due to association of nonsynonymy with features that affect mutation rate, such as local sequence context and leading/lagging strand orientation. In any case, a very strong effect is apparent from synonymous mutations alone, which are expected to be little influenced by selection.

A strong effect of preceding G runs is also apparent in MA experiments, which are designed to minimize the effects of selection. In Salmonella MA experiments, the effect appears to be somewhat smaller than that estimated from natural isolates. This might be due to effects of selection in nature. However, the difference is due primarily to a much higher overall fraction of T→G mutations at G<3 sites in the MA experiments rather than rarity of such mutations at G≥3 sites. A selective explanation is therefore difficult to reconcile with the comparison of synonymous and nonsynonymous rates. The discrepancy may instead reflect a genuine difference in mutational spectra between natural and laboratory growth conditions. Growth conditions can have large effects on the spectrum of mutation (Maharjan and Ferenci 2017, 2018; Shewaramani et al. 2017; Ferenci 2019).

Mechanism

One mechanism of T→G mutation involves incorporation of 8-oxo-G opposite an A in the template. In E. coli with inactivated mutT, this pathway is the main cause of mutation. In experiments involving such strains, T→G mutation was no more common at sites preceded by G·C runs. This observation is strong evidence that this pathway is not the predominant mechanism of GGGT→GGGG mutation in a wild-type background because it does not disproportionately affect such sites.

Homopolymer runs, particularly of G·C, are prone to expansion and contraction in bacteria. Insertions and deletions in these runs are thought to occur mainly through slipped-strand mispairing (Levinson and Gutman 1987). This process might also be involved in the high rate of T→G transversion after a G·C tract. After slipped-strand mispairing and addition of an extra G at the end of the run, a return to in-register pairing would result in a G:A mismatch and potentially a T→G mutation. Analogous mechanisms would also explain the small effects of preceding G runs on A and C mutation to G. It is not obvious, however, why the effect would be strongest for mutation of T or why preceding runs of C would not increase the rate of T→C mutation.

The leading/lagging strand asymmetry of the mutation rate suggests that a replication-associated event is involved in the phenomenon. It does not, however, distinguish between an event affecting GGGT synthesis that is more common during leading strand synthesis and an event affecting ACCC synthesis that is more common during lagging strand synthesis. Furthermore, orientation might affect one step in the pathway to mutation, whereas homopolymer runs affect another.

If G runs do exert their effect during DNA synthesis, they might do so during either replication or repair. Even if slipped-strand mispairing plays no role, it seems more likely that they affect synthesis of the T than that of the complementary A. In the structures of both DNA polymerases 1 (PDB 1QTM (Li et al. 1999)) and 3 (PDB 3F2B (Evans et al. 2008)) complexed with primer/template and dNTP, the downstream base(s) of the template strand is/are distant (more than 9 Å) from the template base, incoming dNTP, and polymerase catalytic site. In contrast, the immediately upstream bases of both strands come within 3.5 Å of the nascent base pair. The upstream duplex approximates an ordinary double helix, with interactions that could communicate effects of further upstream bases to the region of the nascent base pair. Thus, the G run seems more likely to affect synthesis when it is upstream of the nascent base pair, that is, during T-strand synthesis. An effect on T synthesis is also concordant with the fact that the mutation rate is higher when GGGT is on the leading strand in conjunction with the suggestion that leading strand synthesis is more error-prone (Fijalkowska et al. 1998).

Consequences

The phenomenon accounts for some of the strongest effects of neighboring bases on mutation rate in Salmonella (fig. 4). The mutagenic effects of Dcm and Dam methylation explain some, though not all, of the remaining strong effects.

The contribution of the phenomenon to mutation varies among bacteria due to variation in the degree to which mutation rate is elevated, differences in the frequencies of the motifs affected by it, and differences in the base rate of T→G mutation relative to rates of other types of single-base mutation. In Salmonella, the phenomenon accounts for 2.42% of mutations, more than half the fraction due to transitions at Dcm-methylated positions. It accounts for more than one-third of T→G mutations and about 10% of mutations from T or A to C or G. The resulting mutations increase the length of the G run by at least one nucleotide, sometimes producing a motif even more prone to mutation and further extension of the run.

Because a highly mutable motif will tend to be short-lived in the absence of selection against the high-frequency mutation, these motifs may be found disproportionately at positions where such selection operates. This is particularly true of sites with more than three preceding G residues, which have particularly high mutation rates. Purifying selection may therefore eliminate most changes at such sites in the long run, though they are apparent in the short-range comparisons considered here.

Methods

NCBI Pathogens Data

Most analyses were based on data from the NCBI Pathogens database (https://www.ncbi.nlm.nih.gov/pathogens/). This contains information derived from whole-genome sequencing of large numbers of bacterial isolates for many taxonomic groups (species or genera; Shigella is combined with E. coli, of which it is part phylogenetically). It provides SNP calls for clusters of very closely related isolates, along with phylogenetic trees based on these SNPs.

The build runs (versions of taxon-specific datasets) used in the analysis were PDG000000002.2479 (Salmonella), PDG000000004.3383 (E. coli), PDG000000001.2875 (Listeria), PDG000000003.1716 (Campylobacter), PDG000000032.298 (Neisseria), PDG000000036.681 (Pseudomonas aeruginosa), PDG000000055.317 (V. cholerae), PDG000000026.169 (Legionella pneumophila), and PDG000000073.309 (Staphylococcus aureus).

The sequencing technology used for each assembly was determined from the “Sequencing technology” line of the associated assembly_stats.txt file. For mutations that were inferred to have occurred on an internal branch of the tree, it was checked that the SNP data for the supporting genome(s) sequenced by the technology of interest contained the derived G, rather than an ambiguity that had been resolved by the ancestral state reconstruction or a different nucleotide due to an ostensible second change at the position. The few cases not meeting this condition were not included in the analysis.

Reconstruction of Sequence Evolution

For each cluster containing at least five isolates, sequence changes within the cluster were reconstructed for all reported SNP positions that passed the filters for sequence quality and SNP density. Tree distances between isolates in the same cluster were at most a few hundred SNPs in genomes of several million base pairs. Reconstruction was done by maximum parsimony with a “soft” treatment of multifurcations (Maddison 1989). The nature of each change (the ancestral and derived nucleotide) and its sequence context were recorded, along with its location in the reference genome for the cluster. The details of this procedure have been described elsewhere (Cherry 2018; Cherry 2020).

Analysis of Apparent Mutation Rates

Sequence changes that were mapped to branches descending directly from the root of a tree were excluded from the analysis because of the impossibility of determining their direction (e.g., C→T vs. T→C). Some were also excluded on the basis of information about sequence reads that aligned to the SNP position. For cases in which more than one isolate was affected, a single representative was chosen at random so that all mutations could be subjected to identical criteria. Only cases in which there were at least 20 aligned reads, at least 90% of them supported the called base and at least 25% were supporting reads in each direction, were included in the analysis.

Numbers of T→G changes at positions with various numbers of preceding G residues were calculated, as were analogous counts for other types of mutation. For analyses involving multiple NCBI Pathogens clusters, frequencies of sites of each type were computed as averages among the reference genomes for each cluster, weighted by the number of usable changes in the cluster. The weighted mean site frequencies were multiplied by the weighted genome size. This multiplication does not affect relative rates at different types of sites, because it affects all sites in the same way, but it allows interpretation of the products as approximate numbers of sites per genome. Simply using the site counts in a single genome yielded similar results. Numbers proportional to rates per site were calculated as the ratio of relevant changes to the (weighted mean) number of occurrences of the site category. Rates relative to positions adjacent to zero-length tracts were calculated by simple division. The relative rate for zero-length tracts is identically one, and hence not presented, except in the analysis of leading/lagging strand asymmetry, for which the denominator was calculated using both strands.

Origin and Terminus of Replication

The approximate locations of the origin and terminus of replication for CP093400.1 (PDT001274653.1), the reference sequence for Salmonella cluster PDS000089910.229, were determined from GC skew. This was calculated for each position in the chromosome as the count of G nucleotides minus the count of C nucleotides in the sequence up to that position. The positions with minimum and maximum GC skew were taken as the origin and terminus, respectively.

Analysis of Laboratory Evolution and Mutation Accumulation Data

Mutation data for the E. coli long-term evolution and mutT mutation accumulation experiments reported in Couce et al. (2017) were downloaded from https://datadryad.org/stash/dataset/doi:10.5061/dryad.sq67g. It was analyzed in conjunction with the applicable chromosome sequences of E. coli REL606 (NC_012967.1) and MG1655 (NC_000913.2). Sequence differences appearing in multiple clones from the same experimental replicate were counted only once for the purpose of estimating relative mutation rates. For assessing the relationship between changes of interest and time, every difference from the REL606 sequence in every genome from a time point was counted.

Results of mutation accumulation experiments with nonmutator bacteria were analyzed from S. enterica (Lind and Andersson 2008; Pan et al. 2022), E. coli (Lee et al. 2012; Tenaillon et al. 2016), V. cholerae and V. fischeri (Dillon et al. 2017), and M. smegmatis (Kucukyildirim et al. 2016). The two S. enterica experiments were pooled, as were the two E. coli experiments. Yeast data were from Sharp et al. (2018). Mutations from haploid and diploid yeast were combined, and mutations in the mitochondrial genome were not included.

In most cases, the sequence context of the mutations was determined from the genome sequence(s) on which their positions were reported. For S. enterica mutations from Pan et al. (2022), in which mutations in diverse strains were all referenced to the LT2 genome, the sequence reads for the relevant strains were searched with 101 bp fragments of the LT2 sequence centered on each mutation position to determine the actual context. Target site frequencies were determined from the reference genomes, except for S. enterica, for which the frequencies in table 1 were used. For E. coli, the frequencies in the ancestral sequences for the two experiments were averaged.

Analysis of P. aeruginosa Serial Isolate Data

The data provided by Marvig et al. (2015) were used for the analysis. Isolates bearing nonsynonymous changes or insertions or deletions in mutator genes were considered to have a mutator phenotype. Point mutations present only in such isolates were excluded from the analysis.

The isolates are divided into 36 clone types. Because a single mutation event may affect more than one isolate from a clone type, counts from isolates within a clone type are not independent. Therefore, just one nonmutator isolate was chosen randomly from each clone type for the purpose of calculating the coefficient of correlation between the number of mutations of interest and the number of point mutations of other types. Correlation coefficients were calculated for 100,000 independent random choices of isolates.

Randomized Counts for Correlation Coefficients

Correlation coefficients between the numbers of mutations of interest and of other types of mutation were calculated for Salmonella and P. aeruginosa. Because the actual numbers of mutations vary stochastically around their expected values, the correlations between the counts will usually be imperfect even if their expected values are exactly proportional. For comparison, the correlation coefficient expected under such proportionality was estimated using each randomly chosen set of isolates or isolate pairs. For each set, counts were drawn from a multivariate hypergeometric distribution such that the number of mutations for each isolate (pair) and the total number of mutations of interest were maintained. This procedure is equivalent to randomly designating mutations affecting the chosen isolates as mutations of interest, with the total number of mutations of interest equal to the actual total and each mutation equally likely to be so designated.

Supplementary Material

evad087_Supplementary_Data

Acknowledgment

This work was supported by the intramural research program of the National Library of Medicine, National Institutes of Health. The opinions expressed in this article are those of the author and do not reflect the view of the National Institutes of Health, the Department of Health and Human Services, or the US government.

Supplementary Material

Supplementary data are available at Genome Biology and Evolution online (http://www.gbe.oxfordjournals.org/).

Data Availability

All data are available at https://ftp.ncbi.nlm.nih.gov/pub/jcherry/gggt/.

Literature Cited

  1. Aggarwala V, Voight BF. 2016. An expanded sequence context model broadly explains variability in polymorphism levels across the human genome. Nat Genet. 48:349–355. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Bird AP. 1980. DNA methylation and the frequency of CpG in animal DNA. Nucleic Acids Res. 8:1499–1504. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Cherry JL. 2018. Methylation-induced hypermutation in natural populations of bacteria. J Bacteriol. 200:e00371–18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Cherry JL. 2020. Selection-driven gene inactivation in Salmonella. Genome Biol Evol. 12:18–34. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Cherry JL. 2021. Extreme C-to-A hypermutation at a site of cytosine-N4 methylation. mBio 12:e00172–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Cooper DN, Youssoufian H. 1988. The CpG dinucleotide and human genetic disease. Hum Genet. 78:151–155. [DOI] [PubMed] [Google Scholar]
  7. Couce A, et al. 2017. Mutator genomes decay, despite sustained fitness gains, in a long-term experiment with bacteria. Proc Natl Acad Sci U S A. 114:E9026–E9035. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Coulondre C, Miller JH, Farabaugh PH, Gilbert W. 1978. Molecular basis of base substitution hotspots in Escherichia coli. Nature. 274(5673):775–80. [DOI] [PubMed] [Google Scholar]
  9. Dillon MM, Sung W, Sebra R, Lynch M, Cooper VS. 2017. Genome-wide biases in the rate and molecular spectrum of spontaneous mutations in Vibrio cholerae and Vibrio fischeri. Mol Biol Evol. 34:93–109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Evans RJ, et al. 2008. Structure of PolC reveals unique DNA binding and fidelity determinants. Proc Natl Acad Sci U S A. 105:20695–20700. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Ferenci T. 2019. Irregularities in genetic variation and mutation rates with environmental stresses. Environ Microbiol. 21:3979–3988. [DOI] [PubMed] [Google Scholar]
  12. Fijalkowska IJ, Jonczyk P, Tkaczyk MM, Bialoskorska M, Schaaper RM. 1998. Unequal fidelity of leading strand and lagging strand DNA replication on the Escherichia coli chromosome. Proc Natl Acad Sci U S A. 95:10020–10025. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Goncearenco A, et al. 2017. Exploring background mutational processes to decipher cancer genetic heterogeneity. Nucleic Acids Res. 45:W514–W522. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Hodgkinson A, Eyre-Walker A. 2011. Variation in the mutation rate across mammalian genomes. Nat Rev Genet. 12:756–766. [DOI] [PubMed] [Google Scholar]
  15. Horton JS, Flanagan LM, Jackson RW, Priest NK, Taylor TB. 2021. A mutational hotspot that determines highly repeatable evolution can be built and broken by silent genetic changes. Nat Commun. 12:6092. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Hudson RE, Bergthorsson U, Ochman H. 2003. Transcription increases multiple spontaneous point mutations in Salmonella enterica. Nucleic Acids Res. 31:4517–4522. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Kucukyildirim S, et al. 2016. The rate and spectrum of spontaneous mutations in Mycobacterium smegmatis, a bacterium naturally devoid of the postreplicative mismatch repair pathway. G3 (Bethesda) 6:2157–2163. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Lee H, Popodi E, Tang H, Foster PL. 2012. Rate and molecular spectrum of spontaneous mutations in the bacterium Escherichia coli as determined by whole-genome sequencing. Proc Natl Acad Sci U S A. 109:E2774–E2783. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Levinson G, Gutman GA. 1987. Slipped-strand mispairing: a major mechanism for DNA sequence evolution. Mol Biol Evol. 4:203–221. [DOI] [PubMed] [Google Scholar]
  20. Li C, Luscombe NM. 2020. Nucleosome positioning stability is a modulator of germline mutation rate variation across the human genome. Nat Commun. 11:1363. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Li Y, Mitaxov V, Waksman G. 1999. Structure-based design of Taq DNA polymerases with improved properties of dideoxynucleotide incorporation. Proc Natl Acad Sci U S A. 96:9491–9496. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Lind PA, Andersson DI. 2008. Whole-genome mutational biases in bacteria. Proc Natl Acad Sci U S A. 105:17878–17883. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Ling G, Miller D, Nielsen R, Stern A. 2020. A Bayesian framework for inferring the influence of sequence context on point mutations. Mol Biol Evol. 37:893–903. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Lujan SA, et al. 2014. Heterogeneous polymerase fidelity and mismatch repair bias genome variation and composition. Genome Res. 24:1751–1764. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Maddison W. 1989. Reconstructing character evolution on polytomous cladograms. Cladistics 5:365–377. [DOI] [PubMed] [Google Scholar]
  26. Maharjan RP, Ferenci T. 2017. A shifting mutational landscape in 6 nutritional states: stress-induced mutagenesis as a series of distinct stress input-mutation output relationships. PLoS Biol. 15:e2001477. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Maharjan RP, Ferenci T. 2018. The impact of growth rate and environmental factors on mutation rates and spectra in Escherichia coli. Environ Microbiol Rep. 10:626–633. [DOI] [PubMed] [Google Scholar]
  28. Marvig RL, Sommer LM, Molin S, Johansen HK. 2015. Convergent evolution and adaptation of Pseudomonas aeruginosa within patients with cystic fibrosis. Nat Genet. 47:57–64. [DOI] [PubMed] [Google Scholar]
  29. Mira A, Ochman H. 2002. Gene location and bacterial sequence divergence. Mol Biol Evol. 19:1350–1358. [DOI] [PubMed] [Google Scholar]
  30. Moxon R, Bayliss C, Hood D. 2006. Bacterial contingency loci: the role of simple sequence DNA repeats in bacterial adaptation. Annu Rev Genet. 40:307–333. [DOI] [PubMed] [Google Scholar]
  31. Pan J, et al. 2022. Rates of mutations and transcript errors in the foodborne pathogen Salmonella enterica subsp. enterica. Mol Biol Evol. 39:msac081. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Setoyama D, Ito R, Takagi Y, Sekiguchi M. 2011. Molecular actions of Escherichia coli MutT for control of spontaneous mutagenesis. Mutat Res. 707:9–14. [DOI] [PubMed] [Google Scholar]
  33. Sharp NP, Sandell L, James CG, Otto SP. 2018. The genome-wide rate and spectrum of spontaneous mutations differ between haploid and diploid yeast. Proc Natl Acad Sci U S A. 115:E5046–E5055. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Shewaramani S, et al. 2017. Anaerobically grown Escherichia coli has an enhanced mutation rate and distinct mutational spectra. PLoS Genet. 13:e1006570. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Stoler N, Nekrutenko A. 2021. Sequencing error profiles of Illumina sequencing instruments. NAR Genom Bioinform. 3:lqab019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Sung W, et al. 2015. Asymmetric context-dependent mutation patterns revealed through mutation-accumulation experiments. Mol Biol Evol. 32:1672–1683. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Tenaillon O, et al. 2016. Tempo and mode of genome evolution in a 50,000-generation experiment. Nature 536:165–170. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

evad087_Supplementary_Data

Data Availability Statement

All data are available at https://ftp.ncbi.nlm.nih.gov/pub/jcherry/gggt/.


Articles from Genome Biology and Evolution are provided here courtesy of Oxford University Press

RESOURCES