Skip to main content
PLOS Genetics logoLink to PLOS Genetics
. 2019 Nov 8;15(11):e1008493. doi: 10.1371/journal.pgen.1008493

Linking high GC content to the repair of double strand breaks in prokaryotic genomes

JL Weissman 1, William F Fagan 1, Philip L F Johnson 1,*
Editor: Xavier Didelot2
PMCID: PMC6867656  PMID: 31703064

Abstract

Genomic GC content varies widely among microbes for reasons unknown. While mutation bias partially explains this variation, prokaryotes near-universally have a higher GC content than predicted solely by this bias. Debate surrounds the relative importance of the remaining explanations of selection versus biased gene conversion favoring GC alleles. Some environments (e.g. soils) are associated with a high genomic GC content of their inhabitants, which implies that either high GC content is a selective adaptation to particular habitats, or that certain habitats favor increased rates of gene conversion. Here, we report a novel association between the presence of the non-homologous end joining DNA double-strand break repair pathway and GC content; this observation suggests that DNA damage may be a fundamental driver of GC content, leading in part to the many environmental patterns observed to-date. We discuss potential mechanisms accounting for the observed association, and provide preliminary evidence that sites experiencing higher rates of double-strand breaks are under selection for increased GC content relative to the genomic background.

Author summary

The overall nucleotide composition of an organism’s genome varies greatly between species. Previous work has identified certain environmental factors (e.g., oxygen availability) associated with the relative number of GC bases as opposed to AT bases in the genomes of species. Many of these environments that are associated with high GC content are also associated with relatively high rates of DNA damage. We show that organisms possessing the non-homologous end-joining DNA repair pathway, which is one mechanism to repair DNA double-strand breaks, have an elevated GC content relative to expectation. We also show that certain sites on the genome that are particularly susceptible to double strand breaks have an elevated GC content. This leads us to suggest that an important underlying driver of variability in nucleotide composition across environments is the rate of DNA damage (specifically double-strand breaks) to which an organism living in each environment is exposed.

Introduction

Prokaryotic genomes vary widely in their GC content, from the small genomes of endosymbionts with low GC content (as low as 16% [1]) to the larger genomes of soil dwelling microbes with high GC content (> 60% [2, 3]). This bias in content might naturally be assumed to arise from biases in mutation rates, but a puzzle arose when observational studies surprisingly revealed a GC→AT mutational bias (which implies an expected equilibrium GC content < 50%) in genomes with actual GC content > 50% [4, 5]; more recently, controlled mutation accumulation experiments showed that even genomes with < 50% actual GC content still have greater GC content than expected from mutation rates [6]. This discrepancy between mutation rates and GC content implies that GC alleles fix at a higher rate than AT alleles. Two mechanisms could lead to biased fixation: selection directly on GC content [4, 5] or biased gene conversion (BGC), wherein homologous recombination favors GC alleles when resolving heteroduplex DNA mismatches [7]. Much debate has resulted over the relative contribution of these two mechanisms to the observed genomic GC content in prokaryotes [4, 5, 710].

This debate between proponents of the selection and BGC hypotheses continues, with many studies focusing on patterns of genetic diversity that by themselves cannot easily differentiate between these two hypotheses because recombination will also locally increase the efficiency of selection [9, 11]; however, the addition of phenotypic information provides the tantalizing clue that GC content correlates with shared environmental factors [2, 3, 12, 13] independent of phylogenetic similarity [3]. Thus, these environmental factors must either lead to an unknown selective advantage for high/low GC content [11] or lead to elevated rates of BGC through an as-yet unknown mechanism.

We noticed that many environments containing high GC content microbes, such as soils and aerobic environments [3, 12], induce relatively high rates of DNA damage in the form of double-strand breaks (DSB) that necessitate repair [14, 15]. For instance, in aerobes, this damage typically results from reactive oxygen species produced during metabolism [14] that can lead to DSBs by producing collapsed replication forks [16], as well as via a number of other mechanisms (often in conjunction with other stressors; [1724]). In soil-dwelling microbes, DSBs are associated with desiccation and spore formation [2528]. Even going back nearly 50 years, it was suggested that the rate of exposure to UV radiation, which can lead to DSBs [29], might be driving observed variation in genomic GC content among microbes [30].

To repair DSBs, microbes may use one of two pathways: homologous recombination (HR), or non-homologous end joining (NHEJ) [15]. HR machinery is ubiquitous across microbes [31], although it requires multiple genome copies to function. To-date, much work on GC content has focused on associating rates of HR locally along a genome (inferred using polymorphism data) with local GC content, which would be taken as evidence for the action of BGC [7]. We might also expect that organisms experiencing many DSBs would have an high overall recombination rate in order to repair these breaks. However, average rates of recombination in different genomes do not seem to be correlated with genomic GC content [5], despite the systematic environmental variation in GC content discussed above. It is possible that analyses that correlate global recombination rates and GC content looking across many genomes are too coarse-grained to reveal subtle differences between microbes leading to larger divergence in GC content over evolutionary time. Additionally, it is difficult to get accurate estimates of recombination rates from population-level polymorphism data, and it is unclear how strongly these rates would correlate with DSB formation specifically. Thus, an alternative, complementary indicator of high rates of DSB formation would be useful.

In contrast to HR, the NHEJ repair pathway is rarer and generally found in organisms experiencing DSBs with only a single copy of the genome present in the cell (e.g. during an extended stationary phase; [26]). Notably, we expect NHEJ repair to be favored only when HR is not an option, as NHEJ is generally considered a highly error prone pathway [15, 32]. NHEJ repair requires the presence of the highly conserved Ku protein [33, 34], which makes Ku presence/absence a useful indicator of genomes more/less likely to be subjected to high rates of DSBs during especially vulnerable periods (i.e., one genome copy present). We leverage Ku as an indicator of the rate of DSB formation and examine how the incidence of the NHEJ pathway co-varies with genomic GC content. We find a strong association between Ku presence and elevated GC content, and go on to discuss several mechanisms that could explain this pattern under a selection or a BGC paradigm.

Results and discussion

NHEJ and high GC content found in similar environments

A number of ecological factors have been associated with GC content in previous works, including aerobicity [12], nitrogen fixation [35], exposure to UV radiation [30], and growth temperature (although this last association has been disputed; [25, 3638]). Notably, many of these associations are weak, and in general there is no known universal driver or mechanism that explains the high genomic GC content seen across many environments. We noted that many of the environmental factors correlated with GC content that have been identified in previous analyses are also associated with high rates of DNA damage, specifically DSBs. Perhaps, then, the unifying driver of GC content is the rate of DSB formation, and the environmental trends observed to-date can be attributed to this underlying driver.

While NHEJ presence is an imperfect indicator of DSB incidence in general, we expect this pathway to be especially common among organisms experiencing many DSBs during periods of slow or no growth [26]. Using a large-scale microbial trait database [39] paired with genomes from RefSeq [40], we identified known ecological correlates of NHEJ incidence as well as genomic GC content (some of which were redundant, S1 Fig). A principal component analysis of these traits revealed similar patterns of NHEJ incidence and high genomic GC content in trait-space (Fig 1), consistent with the idea that DNA damage is associated with genomic GC content. In fact, the pairwise correlation between an ecological trait and genomic GC content tracks almost perfectly with correlation between each trait and Ku incidence (S2 Fig).

Fig 1. Ku and high GC content share a particular region of trait space.

Fig 1

PCA of microbial trait data for select traits with species colored based on either their mean genomic GC content or whether they have known members that encode the Ku protein. Trait loadings signified by arrows. Note the clear separation of Ku and no-Ku organisms in trait space.

Nevertheless, the inclusion of Ku along with ecological traits in a linear model to explain genomic GC content resulted in most other environmental traits still being statistically significant (S1 Table), indicating that either there is some aspect of the environment affecting GC content that is not attributable to DSBs or that NHEJ is an imperfect indicator of the rate of DSB formation (or both). In fact this is trivially true, as Ku presence is a discrete, binary variable whereas the rate of DSB formation is continuous. Despite the fact that Ku is not be the sole predictor of genomic GC content, the shared region of trait-space between NHEJ-capable organisms and high GC content organisms is quite striking (Fig 1).

Organisms with NHEJ machinery have high GC content

Next, we looked directly at the Ku versus GC content relationship. Using a large set of genomes from RefSeq we found that genomes with Ku have a dramatically shifted GC content relative to genomes without Ku (Fig 2A, S3 Fig; Pearson correlation between GC content and Ku across genomes, r = 0.54, p < 2.2 × 10−16), even though Ku presence/absence is sprinkled throughout the prokaryotic phylogeny (Fig 2B). Indeed, this association remains highly significant even after formally correcting for phylogeny using phylogenetic regression with a Brownian motion (BM) model of trait evolution (Table 1). Our analysis is robust to the choice of evolutionary model, as repetition with an Ornstein-Uhlenbeck (OU) model of trait evolution yielded similar results (Table 1). Similarly, restriction of our analysis to a particular phylum (Actinobacteria, Firmicutes, and Proteobacteria, respectively; each with >1000 genomes on our tree) shows that this effect is not attributable to a single branch of the prokaryotic tree but is quite general (Table 1). Finally, to control for the possibility that Ku gain/loss via horizontal transfer is frequent and potentially confounding, we also restricted our analysis to a subset of the data where Ku presence/absence did not vary within each genera (discarding variable genera) and found qualitatively the same result (Table 1). In sum, the presence of NHEJ on a genome is positively associated with the GC content of that genome.

Fig 2. The relationship between genomic GC content and the NHEJ pathway in prokaryotes.

Fig 2

(a) Microbes that code for the Ku protein tend to have much higher genomic GC content than those that do not (all RefSeq assemblies shown, 21389 out of 104297 genomes encode Ku). (b) Ku incidence mapped onto the SILVA Living Tree [64]. While Ku incidence is not randomly distributed across the prokaryotic tree, neither is it isolated to a particular clade. Organisms coding for Ku shown in blue, those not coding for Ku shown in red.

Table 1. Coefficients and p-values for all phylogenetic regressions performed.

All tests significant after Benjamini-Hochberg correction (α = 0.05). “Uniform Ku” refers to the dataset restricted to where Ku is always present/absent within each genus (see Methods).

Data Model AIC β Ku p Ku β GenomeLength p GenomeLength β Interaction p Interaction
All BM -10754 1.29 1.465 × 10−13 0.251 2.2 × 10−16 -0.191 2.522 × 10−13
All OU -10760 1.28 1.938 × 10−13 0.248 2.2 × 10−16 -0.190 3.291 × 10−13
All (GC4) BM 6719 4.53 3.763 × 10−12 0.812 2.2 × 10−16 -0.666 8.247 × 10−12
All (GC4) OU 6676 5.58 2.342 × 10−12 0.827 2.2 × 10−16 -0.675 5.160 × 10−12
Actinobacteria BM -1997.131 3.26 2.2 × 10−16 0.610 2.2 × 10−16 -0.483 2.2 × 10−16
Actinobacteria OU -2011 3.41 2.2 × 10−16 0.644 2.2 × 10−16 -0.505 2.2 × 10−16
Firmicutes BM -1541 1.63 0.004541 0.104 0.028670 -0.246 0.004358
Firmicutes OU -1536 1.64 0.004242 0.110 0.022632 -0.248 0.004088
Proteobacteria BM -3181 0.719 0.01704 0.244 1.054 × 10−11 -0.105 0.01969
Proteobacteria OU -3180 0.716 0.01744 0.238 3.377 × 10−11 -0.104 0.02011
Uniform Ku BM -5167 2.22 6.517 × 10−12 0.290 2.2 × 10−16 -0.325 2.032 × 10−11
Uniform Ku OU -5166 2.22 8.146 × 10−12 0.287 2.2 × 10−16 -0.324 2.499 × 10−11

Importantly, we control for genome length in all our phylogenetic models, which potentially co-varies with Ku incidence and is known to be associated with genomic GC content in prokaryotes (Table 1 and S4 Fig). Interestingly, Ku presence and genome length have a significant negative interaction in their effect on GC content (Table 1).

Clearly organisms that encode Ku have a higher genomic GC content than organisms that do not, but can Ku help explain why organisms have a higher GC content than expected? Prokaryotes typically have higher GC content than predicted from their mutational biases, which are nearly always skewed towards AT alleles [46]. Does the observed association between NHEJ and genomic GC content contribute to this deviation? In other words, are GC alleles more likely to fix than the neutral expectation in Ku-encoding genomes, and is this deviance from neutrality larger than in genomes that do not encode Ku? Alternatively, it is possible that the error-prone NHEJ machinery simply leads to an excess of GC mutations. In this case, NHEJ incidence would help explain differences in mutational biases between microbes, but would not help explain the mystery of higher than expected genomic GC content among microbes.

Examining mutation accumulation experiments in detail, Ku shows no effect on GC↔AT mutational biases (S5 Fig; data from [6]). Since mutation accumulation data are limited, we also used the GC bias of inferred polymorphisms as a proxy for mutation. Similar to previous studies of prokaryotic GC content [4], and using the same intuition to that of the McDonald-Kreitman test for selection [41], we assume that polymorphisms within a population have experienced minimal selection and are therefore representative of the mutational biases of a given organism. Thus the GC content of polymorphisms within a population gives an estimate of the GC↔AT mutational bias pre-selection, whereas the genomic GC content of an organism results from a combination of mutational biases and mechanisms that alter the probability of fixation of an allele, such as selection and BGC. We obtained multiple alignments of all orthologous genes for organisms in the ATGC database [42] belonging to clusters that contained at least three genomes (to identify and polarize polymorphisms) and used this dataset to compare the background GC content to the GC content of polymorphisms. We saw greater evidence for BGC/selection in genomes with Ku than without Ku, with the presence of Ku leading to higher observed GC content regardless of the expected GC content (Fig 3, S7 Fig). In order to minimize the effects of selection on our estimate of expected GC content, we repeated this analysis only using polymorphisms at fourfold degenerate sites and found qualitatively similar results, despite having only about a third as many informative polymorphisms (S6 and S7 Figs).

Fig 3. Genomes with Ku appear to fix GC alleles at a greater rate than expected (either due to BGC or selection).

Fig 3

(a,b) Genomes with Ku have, on average, even greater elevation of GC content over expectation than genomes without Ku. Expected GC content was estimated from polymorphism data. This signal is conservative due to observed polymorphisms experiencing some effects of BGC/selection (see Methods for discussion).

Thus, the association between Ku and genomic GC content is not due to differences in mutational bias. This implies that DSBs are either leading to selection for high GC content or influencing the rate and/or biases of homologous recombination to increase the overall action of BGC. We emphasize that this effectively rules out the possibility that biases during NHEJ repair are causing the observed patterns. NHEJ repair may be error-prone, but if those errors (i.e., mutations) were driving genome-wide GC-bias it would affect the GC-bias of polymorphisms as well as fixed alleles in the test described above.

Finally, we note that there is a small subset of genomes in Fig 2 that both encode Ku and have a low GC content (< 40%). Of these, 80% belong to the family Baccilaceae. This family has uniformly low GC content (> 99% of genomes have GC content < 50%, and 76% have GC content < 40%), and an ancestral state reconstruction suggests that its most recent common ancestor encoded Ku (S8 Fig and see Methods), though Ku has been lost multiple times across the group. We do not know why the Baccilaceae violate the pattern seen across the rest of the dataset; it may be an accident of evolutionary history or some particular aspect of this group’s ecology and/or physiology.

No apparent relationship between rates of homologous recombination and NHEJ

The above analyses suggest that GC alleles fix with a higher probability in organisms experiencing an elevated rate of DSB formation. If BGC were the primary driver of GC content evolution in prokaryotes, would we expect an association between damage and GC content as we see here? We can think of at least one plausible scenario. The formation of DSBs should stimulate recombination for repair, and assuming that recombination is biased we might expect rates of BGC to increase as the rate of DSB formation increases. We saw no positive association between Ku incidence and inferred rates of homologous recombination looking between genomes, as would be predicted by this hypothesis (S9 Fig with data from [43, 44], and S10 Fig with data from the ATGC database [42]). In fact the relationship appeared to be negative regardless of method to measure recombination rate (though not significant). That being said, the effects of BGC are typically only apparent locally, comparing GC content and recombination rates along a genome rather than between genomes [7]. The clearest evidence for BGC leading to high GC content in prokaryotes comes from Lassalle et al [7], who compared the GC content of genes that recombined frequently or rarely within genomes. We checked if any of the 21 taxa they studied typically encode Ku based on the genomes we downloaded from RefSeq above. Four taxa carry Ku at any appreciable frequency (>1%): Mycobacterium tuberculosis (100%), Burkholderia pseudomallei (100%), Burkholderia cenocepacia (87%), and Bacillus anthracis (68%). These organisms did not show a consistent association between recombination rate and GC content, unlike most other taxa in the study. M. tuberculosis and B. pseudomallei are highly clonal and did not present enough diversity for a complete analysis by Lassalle et al [7], though B. pseudomallei had a negative association between recombination rate and GC content at the third codon position (positive for GC content overall, in both cases not significant). B. cenocepacia presented enough data for analysis, but showed no relationship between recombination rate and GC content (again with a negative but non-significant effect at the third codon position). Finally, B. anthracis had inconsistent effects, with a significant negative association between recombination rate and GC content overall but a significant positive interaction when restricting to GC content at the third codon position. What to make of this? Among the very limited number of species that encode Ku in Lassalle et al.’s dataset, the evidence for BGC is not strong.

Given the small number of organisms in Lassalle et al.’s dataset that had Ku, we endeavoured to repeat this analysis using a larger set of organisms. Using the ATGC database (as we did with our analysis of polymorphism earlier), we obtained multiple alignments of all orthologous genes for each cluster of organisms [42]. We then classified genes as recombining or non-recombining using the PHI statistic [45]. Similar to Lassale et al. [7], we found that recombining genes had higher GC content than non-recombining genes, though this difference was small (paired t-test, df = 154, p = 1.503 × 10−11; Fig 4). Interestingly, while a link between recombination and GC content was apparent, it seemed to explain none of the difference between Ku-encoding and Ku-lacking organisms (Fig 4a). In fact the difference in GC content between recombining and non-recombining genes was actually smaller for Ku-encoding organisms than Ku-lacking ones, the opposite of what we would expect if recombination were driving the link between Ku and GC content (t-test, df = 83.698, p = 0.0308; Fig 4b).

Fig 4. Recombination contributes to GC content locally but cannot explain the relationship between GC content and Ku incidence.

Fig 4

(a) The mean GC content of genes with evidence for recombination (PHI statistic, [45], see Methods) plotted against the mean GC content of genes without evidence for recombination in a given closely related cluster of organisms (ATGC database [42]). Recombining genes have slightly higher GC content than non-recombining genes (points mostly lie above the dashed x = y line). (b) The difference in GC content for recombining and non-recombining genes within a cluster is smaller for Ku-encoding than Ku-lacking clusters. Clusters classified as Ku-lacking if no members encoding Ku (n = 114) and Ku-encoding if at least one member has Ku (n = 41). Clusters excluded if no evidence for recombination was found for any of their genes.

Taking a step back, there are two primary reasons to disfavor BGC as the hypothetical mechanism underlying the observed positive association between NHEJ and genomic GC content. First, the extent to which BGC is a driver of genomic GC content in prokaryotes has been questioned. Studies using alternative methods to quantify recombination across the genome to those used presently by us and previously by Lassalle et al. [7] have found either no relationship between recombination rate and local GC content or inconsistent patterns between species (with some species even showing a negative relationship; [46, 47]). More recently, looking specifically at polymorphisms arising via recombination across many prokaryote species, Bobay and Ochman [10] reject the BGC hypothesis outright. They note that some positive relationship between recombination and GC content is apparent at a coarser scale (as seen here), but attribute this pattern to the increased efficiency of selection in regions of high recombination. Even in some microbial eukaryotes, where the evidence of BGC is thought to be strong, tetrad analysis has been unable to reveal any evidence of BGC leading to elevated GC content, including in genomes where recombination rate and GC content are locally correlated [48]. This suggests that correlative methods may be insufficient to conclusively demonstrate BGC, and it has been suggested that high GC content may actually increase the rate of recombination locally (effectively reversing the logic behind the evidence for BGC; [49]).

Second, while organisms encoding Ku are likely to be experiencing DSBs, they are unlikely to experience high rates of recombination. NHEJ is thought of as an alternative pathway to HR, specifically used when HR cannot proceed because the genome is only present as a single copy [15, 26]. Thus we expect NHEJ to be favored specifically in situations where BGC is unlikely. While it is possible that high rates of damage could still favor both NHEJ and HR, albeit at different points in an organism’s life cycle, the extremely strong and specific association between GC content and Ku suggests that this relationship may be particular to the specific conditions selecting for Ku (especially considering the absence of an association between recombination and GC content when looking between genomes [5]; S9 and S10 Figs). Nevertheless, we have insufficient information to completely rule out BGC as a mechanism at this time.

High GC content near regions with frequent breaks

Given the inability of BGC to explain the association between NHEJ and high GC content (Fig 4), perhaps selection can provide an alternative hypothesis. Could it be that organisms with NHEJ machinery are under stronger selection for high GC content than those without? This leads to us to a puzzle: what fitness advantage might be conferred by GC content? In fact, high GC content may promote DNA repair, both by facilitating canonical NHEJ (i.e., Ku-dependent [5052]) and alternative NHEJ (i.e., Ku-independent [32, 53]) pathways.

During DSB repair, the NHEJ machinery in prokaryotes takes advantage of homology in any short overhanging regions or nearby microhomology in order to help align the two broken DNA ends [5052]. Any factor that stabilizes this interaction (e.g., high GC content via an increased number of hydrogen bonds) may have the potential to increase the efficiency of NHEJ repair [5052]. It has also been shown that prokaryotes can employ alternative high-fidelity end-joining pathways that are independent of the NHEJ machinery [32, 53], and that these pathways are primarily dependent on short (2-5bp [32]) nearby microhomology (DNA ends are typically degraded to reveal internal homologies [32, 53, 54]) to tether the DNA ends together. It stands to reason that high GC content in these regions of microhomology might help stabilize the end-pairings and improve the efficiency of repair [32, 53]. In fact, in eukaryotes, high GC overhangs or microhomologies specifically promote the use of a similar NHEJ-independent, high-accuracy end-joining repair pathway [55, 56]. In these systems high GC content is thought to help tether the DNA together and thus perform a similar role to Ku [55, 56], though this has not yet been confirmed in prokaryotes. Nevertheless, this mechanism suggests that high GC content could help to ameliorate the negative effects of DSBs, especially in environments with high rates of DSBs but only a single genome copy. That being said, high genomic GC content alone cannot protect an entire genome, since most genomes will have at least some AT-rich regions. Thus a combined Ku and high GC content strategy would potentially be favorable for DSB-vulnerable microbes and could explain the strong positive relationship we observed between selection for high genomic GC content and the presence of Ku (Fig 2).

In addition, our hypothesis makes a novel testable prediction: regions of the genome that are especially prone to DSBs should be under selection to have higher GC content. Restriction modification (RM) systems provide an ideal test case as damage due to self-targeting is a known phenomenon (e.g. [57]), and we know the potential locations of self-targeting if we know the restriction enzyme recognition sequence. We hypothesized that restriction enzymes on a genome would be selected to target sites higher in GC content than expected from the genomic background to help mitigate the effects of autoimmunity (otherwise they should match host background GC content because phage typically track their host nucleotide composition, often as a byproduct of optimizing codon usage bias for transcription in their host, e.g. [58]). We further predicted that, for restriction enzymes with low GC content recognition sequences, the bases flanking restriction sites on the genome would have elevated GC content. Both of these predictions were borne out. We analyzed the complete set of genomes and their listed restriction enzymes in the REBASE database [59] and found that restriction enzymes indeed tend to have higher GC content recognition sequences than their genomic background, and that the bases immediately flanking AT-rich recognition sites have elevated GC content (Fig 5). In principle, evidence of high GC content near breaks could also be taken as support for BGC (despite other evidence to the contrary [10, 48]) since the rate of HR repair should increase locally, meaning that ultimately experimental approaches will be needed to tease apart these hypotheses.

Fig 5. Restriction sites are associated with elevated GC content.

Fig 5

(a) Restriction enzymes tend to target sequences with GC content higher than the genomic average. (b) Bases immediately flanking AT-rich restriction sites (≥ 75% AT, n = 214 genomes) have an elevated mean GC content. This signature mostly decays within 50bp of the recognition site. This pattern is particularly striking when looking at the first flanking base. Error bars represent bootstrapped 95% confidence intervals of the mean.

We emphasize that the idea that DSB formation selects for high GC content, while consistent with our data, is at this point largely speculative. By outlining this scenario we hope to enable experimentalists to design specific, mechanistic studies. Much of the debate over genomic GC content (including our present contribution), and especially its ecological role, has relied on population-scale correlative studies and has largely avoided mechanisms. Our “GC-tethering” hypothesis has the advantage of being amenable to laboratory-based investigations.

Finally, we caution that one is unlikely to see genome-wide differences in GC content when comparing across organisms with different numbers of restriction enzymes, since restriction sites comprise a very limited subset of loci along the genome (and self targeting should be somewhat restrained via methylation of the host chromosome). Presumably if self targeting was frequent enough to select for elevated GC content at a genome-wide scale, the corresponding cost of encoding these enzymes would be prohibitively high.

Conclusions

We found a strong positive association between the presence of the NHEJ pathway on a genome and genomic GC content across prokaryotes. This association holds controlling for phylogeny and genome length and cannot be explained by mutational biases. The NHEJ repair pathway is broadly but sparsely distributed across the prokaryotic tree (Fig 2), showing up in only about a quarter of genomes [33], and is expected to be favored among organisms experiencing high rates of DSB formation during periods of no or slow growth (i.e., only a single genome copy present so that HR is impossible; [15, 26]). This suggests that high GC content may be an adaptation to deal with DSBs when HR is not feasible, especially when rates of damage are high. In fact, we find that in regions of the genome where DSBs are likely to occur, GC content is locally elevated (Fig 5). Alternatively, the presence of NHEJ may be an indicator of high rates of DSBs in general, so that these organisms are also experiencing higher rates of HR for repair, and subsequently increased BGC. We discussed the relative merits of these two hypothetical mechanisms linking DSBs to genomic GC content, though at this point it is not possible to state conclusively which mechanism is the primary driver of the pattern we observe. It is also possible that some combination of BGC and selection is acting to increase genomic GC content in organisms experiencing DNA damage.

Regardless of the underlying mechanism, high risk of DSB formation is a common factor in many of the habitats that high GC content microbes have been shown to inhabit. While the presence of NHEJ cannot single-handedly explain high GC content in all organisms (there are many organisms incapable of NHEJ that still have high GC content, Fig 2), it is possible that DSB formation can (or at least come close). For example, Deinococcus radiodurans is resilient to extremely high rates of DSB formation [60] and has high genomic GC content, but lacks Ku. It is difficult to directly assess the rate of DSB formation, but not impossible [61]. We hope to see future work that assays rates of damage in the environment. DNA damage is an important challenge for microbes to overcome, and a systematic understanding of how damage varies between environments is of broad ecological interest.

Methods

Data

We downloaded all available completely sequenced prokaryotic genomes from NCBI’s non-redundant RefSeq database FTP site on December 23, 2017 [62] and searched for the presence of the gene coding for the Ku protein, which is central to the NHEJ pathway, using hmmsearch (E-value cutoff 10−2/#Genomes, Pfam family PF02735; [63]). This identified 21389 genomes containing Ku out of a total 104297 genomes analyzed.

For phylogenetic analyses, we downloaded the SILVA Living Tree 16S rRNA tree [64]. We obtained mutational bias estimates from Long et al. (m; [6]), of which 22 could be matched with a genome in our dataset. We obtained estimates of the rate of homologous recombination from [43, 44]. For our analyses on the fate of polymorphism and for estimating recombination we downloaded alignments from the Alignable Tight Genomic Cluster (ATGC) database [42].

Trait data were obtained from the ProTraits microbial trait database (2679 species; [39]). ProTraits scores are expressed as separate confidences that a particular species does or does not have a trait based on the results of an automated text-mining algorithm, giving two scores per binary trait. We combined these scores to obtain what is essentially a probability that a microbe has a given trait using the equation in Weissman et al. [65], yielding a single score between zero and one for each trait for each species. We selected a suite of traits known to be associated with either the incidence of Ku or genomic GC content (Soil-dwelling, aerobicity, growth temperature, nitrogen fixation, spore formation) to include in our analysis. To assess trait vs. Ku relationships we sampled a single genome per species from our RefSeq dataset to determine Ku presence/absence in species with trait data available (617, 2062 without). Most species either always have or always lack Ku (S11 Fig), meaning that sampling should give a reliable estimate of whether we can expect a species to typically have Ku.

We downloaded the complete list of genomes from the REBASE restriction enzyme database [59], which includes all RM enzymes found on a given genome, along with their target sequence if known. Using the listed accession numbers, we then downloaded each corresponding genome from RefSeq in order to assess GC content near restriction sites on the genome (potential sites where DSBs would occur).

Please visit https://github.com/jlw-ecoevo/gcku/ for code and intermediate datasets.

Phylogenetic linear models

Using the 6648 organisms on the SILVA tree for which we had a genome and could assess Ku presence/absence (2051 with, 4597 without) we built a series of phylogenetically corrected linear models of genomic GC content using the phylolm package in R [66]. First we logit transformed our GC content (%GC) values

GCl=log(%GC1%GC) (1)

so that values were in the range (−∞, ∞). We then used Ku incidence as a binary predictor along with log10 genome length as a continuous predictor to predict GCl using phylogenetic regression:

y=Xβ+ε (2)

where

εN(0,σp2V+σe2I) (3)

so that V is the phylogenetic covariance matrix and σe2 is the variance of the measurement error [66]. Brownian motion (BM) models of trait evolution are inappropriate when trait values are bounded. While GC content is theoretically bounded at zero and one, there are no species that approach these bounds, and our logit transforming of the GC content values should ameliorate this issue. Sometimes Ornstein-Uhlenbeck (OU) models of trait evolution are also used in these cases. We found that using an OU model had no qualitative effect on our result (Table 1), though this model had a lower AIC than the Brownian motion model (-10760 versus -10754).

We also applied the above analyses independently to the three best-represented phyla in the dataset, each with >1000 genomes: Actinobacteria (614 with Ku, 446 without), Firmicutes (320 with Ku, 699 without), and Proteobacteria (539 with Ku, 1384 without).

Finally, for our “Uniform Ku” models we excluded all genera from our dataset that had fewer than two genomes with which to assess Ku incidence, and then excluded any genera for which Ku incidence was not uniform (all genomes had Ku or all genomes lacked Ku). We then repeated our above analysis (779 taxa with Ku, 2365 without).

Ancestral state reconstruction

We performed an ancestral state reconstruction of the presence/absence of Ku in the Baccilaceae (S8 Fig). We used the R package corHMM to reconstruct the evolutionary history of this trait on the subtree of the SILVA phylogeny describing the Baccilaceae [67]. We allowed for up to two rate classes (for trait evolution) across the tree when building our evolutionary model (rate.cat parameter in function corHMM, otherwise default parameters), but found that a model with a single rate class had a lower AICc (257.3119 vs. 263.8347). Thus we only retained a model using a single rate class across the tree.

Fate of polymorphism

The ATGC database groups closely-related genomes into “clusters” and provides alignments of their core genes [42]. We downloaded multiple alignments corresponding to clusters in the ATGC database that had at least three genomes. In this way we could, at a minimum, identify polymorphisms between two genomes while using a third, more distantly related genome to polarize these polymorphisms. We restricted our analysis to orthologous genes (COGs) that were present in all members of a cluster. For each genome in a cluster we obtained a set of polymorphisms for that genome by comparing to the most similar genome in that cluster, using the most diverged genome in the cluster to polarize these polymorphisms (assuming that the diverged genome represented the ancestral state, and ignoring cases where neither of the other two genomes matched this “ancestral” allele). Similarity was calculated as the percent identity over the entire aligned core genome provided by ATGC. In order to ensure that polymorphisms were recent and had not yet undergone selection, we discarded genomes that were not within 1% pairwise divergence of any other genome in their respective cluster (calculated over the set of core genes provided in the ATGC alignments). We also discarded pairs of genomes that had fewer than 5 informative sites (either GC→AT or AT→GC) in order to avoid extreme expected GC content estimates. Thus we obtained a set of 1868739 polarized polymorphisms for 1643 pairs of genomes for which GC content could be assessed and compared to the background genomic GC content (Fig 3). Expected GC content was calculated as in Long et al [6].

To obtain expected GC content at fourfold degenerate sites we repeated the above analysis only looking at polymorphisms at fourfold degenerate sites (S6 Fig). There are about a third as many polymorphisms in this dataset (574944) but a similar number of genome pairs are retained (1351).

In order to estimate mutational biases, we assume that recent polymorphisms will not have had a chance to undergo selection (or BGC). This is similar to the intuition underlying the McDonald-Kreitman test for selection [41], and similar analyses have been performed in past work on GC content [4, 5]. Therefore we can obtain an estimate of the expected GC content based on mutational bias, and infer that selection (or BGC) is acting if the realized genomic GC content differs from the expectation. In practice, because we are looking at alignable coding sequence, selection is likely to be strong, and may bias our estimates (S12 Fig). This is further compounded by the fact that genomes in a cluster can still be quite diverged, although we control for this by restricting to genomes within 1% sequence divergence from each other. In any case, the direction of bias will be towards the equilibrium GC content, as estimated via the genomic background. Thus, this test for selection suffers somewhat in that it has an increased probability of false-negatives, but this bias should not cause a false signal of selection to occur.

While our estimates do not perfectly align with those found in mutation accumulation experiments (S12 Fig), we note that even within a genus there can be extremely high heterogeneous values for GC↔AT mutation bias. For example, Long et al. [6] estimate the ratio of the rate of GC→AT mutation to the rate of AT→GC mutation to be 4.5 in Vibrio fischeri but 2.3 in Vibrio cholerae. Similarly, they find values of 6.6 in Staphylococcus epidermidis but 4.6 in Staphylococcus aureus, despite having very similar values for genomic GC content (0.33 vs. 0.32) and GC content at fourfold degenerate sites (0.20 vs 0.19). This implies that closely related organisms may have very different mutational biases, making comparisons between datasets challenging.

Measuring recombination

We obtained all available alignments of shared genes within each cluster of organisms in the ATGC database ([42]). We then ran the program PhiPack [45] using 10000 permutations to generate p-values for the occurrence of recombination in each cluster-gene pair. To correct for multiple testing we used a Benjamini-Hochberg correction with a false-discovery rate of 5%. Altogether this yielded 52117 genes with significant evidence of recombination out of 438580 cluster-gene pairs with sufficient information to run PhiPack. To obtain GC content for each cluster-gene pair we took the mean GC content across sequences in the relevant alignment. To obtain cluster-wide estimates of GC content and Ku incidence we took the mean across genomes associated with organisms in that cluster (each cluster member in ATGC is associated with a RefSeq genome).

Restriction sites

We identified all genomes encoding restriction enzymes with known restriction sequences in our dataset using the REBASE database [59]. We then restricted our analyses to genomes encoding enzymes that had low-GC content restriction sequences (AT-rich restriction sequences defined as those with ≥ 75% AT, n = 214, no genomes had multiple enzymes with AT-rich targets). For each remaining genome we mapped the corresponding restriction sequence to the genome itself to find all potential sites of self-targeting. We then calculated the mean GC content of the sites directly flanking these self-targets across the genome, obtaining a value for average GC content for each distance (1-200bp) from the target for each genome.

In order to generate an adequate null for comparison, for each genome-restriction sequence pair in our dataset we generated a novel restriction sequence. To do this, we took each restriction recognition sequence and randomly permuted it to obtain a new sequence with identical base composition (if the permuted sequence was identical to the original, we continued drawing until a different sequence was obtained). We then repeated the above flank-analysis with this set of “fake” restriction recognition sequences (a single, large simulated dataset was generated with 15923 genome-enzyme pairs).

Finally, for each flank-distance (1-200bp) in each genome we calculated the difference in mean GC content of the bases flanking true restriction sites from bases flanking the null sites. We bootstrapped the mean of this distribution for each flank distance across genomes to obtain 95% confidence intervals (Fig 5).

Supporting information

S1 Table. Output of linear model relating GC content to environmental variables.

The formal model was GC = β0 + βKuKu + ∑i βitraiti + ϵ, where GC is genomic GC content and Ku is a binary variable representing the presence/absence of Ku.

(PDF)

S1 Fig. The pairwise correlation between traits among species in the trait dataset.

Note that some traits are highly correlated.

(PDF)

S2 Fig. The correlation of trait values for microbial species with their average genomic GC content is similar to the correlation of trait values with the presence/absence of Ku.

Note that each point is an individual trait, as shown in Fig 1. The dashed diagonal line indicates the x = y line. For a direct analysis of the relationship between GC content and Ku incidence among organisms see Fig 2 and Table 1.

(PDF)

S3 Fig. GC content at fourfold degenerate sites follows a similar pattern to that of genomic GC content overall (Fig 2).

The effect of Ku is significant even taking phylogeny into account using an identical approach to overall genomic GC content (Table 1).

(PDF)

S4 Fig. While there is a positive GC content versus genome length trend, genomes with Ku have elevated GC independent of this relationship.

(a) Regression and contour lines were created using default ggplot settings (b) The positive GC versus Ku relationship hold across taxa, independently of any relationship with genome length. Regressions of GC versus log genome length for Ku and non-Ku genomes shown.

(PDF)

S5 Fig. Mutational bias does not appear to be associated with the NHEJ pathway.

Organisms with the Ku protein did not differ significantly in their GC↔AT mutational biases from those without the Ku protein (t-test, p > 0.34). Estimates of mutational bias were obtained from Long et al. [6].

(PDF)

S6 Fig. Genomes with Ku appear to fix GC alleles at a greater rate than expected (either due to BGC or selection).

(a,b) Genomes with Ku have, on average, even greater elevation of GC over expectation than genomes without Ku. Expected GC estimated from polymorphism data; in contrast to main text Fig 3, here we only use polymorphisms at fourfold degenerate sites. This signal is conservative due to observed polymorphisms experiencing some effects of BGC/selection (see Methods for discussion).

(PDF)

S7 Fig. Genomes with Ku appear to fix GC alleles at a greater rate than expected (either due to BGC to selection).

Genomes with Ku have, on average, even greater elevation of GC over expectation than genomes without Ku. This figure is identical to panels from Fig 3 and S6 Fig except that we draw loess smoothing lines using default ggplot settings instead of linear model fits.

(PDF)

S8 Fig. Phylogeny of the Baccilaceae (subtree of the SILVA tree).

(a) Ku presence/absence plotted on the tips of the tree as in Fig 2 (blue with, red without Ku). (b) Ancestral state reconstruction of Ku (one rate class). Each internal node is represented by a pie chart describing the probability that that organism either had (black) or did not have (white) Ku. Notice that the root and most nodes near the root are likely to have had Ku.

(PDF)

S9 Fig. Frequency of Ku presence does not appear to be positively associated with rates of homologous recombination for a species.

(a) Estimated rate of recombination relative to mutation rate from Vos and Didelot [43]. (b,c) Estimated number of recombination events per gene family for species estimated with two methods by Rendueles et al. [44]. In general all of these methods give highly correlated results [44].

(PDF)

S10 Fig. No relationship between genome-wide recombination frequency and (a) Ku incidence or (b) GC content in the ATGC database.

We used the PHI statistic (see Methods) to determine if genes within each ATGC cluster of genomes had evidence for recombination. The percent of genes with evidence for recombination (out of all genes with sufficient data to test) showed no relationship to either Ku or GC content (averaged across genomes in a particular ATGC cluster).

(PDF)

S11 Fig. Most species in RefSeq tend to always encode or always lack Ku on their genomes.

Shown is the proportion of genomes within a species that have Ku (all RefSeq assemblies) plotted against the total number of assemblies in RefSeq for that species.

(PDF)

S12 Fig. We evaluate the use of polymorphisms as a proxy for mutation by comparing estimates for the few species present in both the polymorphism and mutation accumulation data.

(a) Estimates based on all polymorphisms. (b) Estimates based on polymorphisms at fourfold degenerate sites. Here we see selection/BGC appears to bias the polymorphism estimates when mutation is extremely biased towards AT.

(PDF)

Data Availability

All data used came from public repositories. Completely sequenced prokaryotic genomes were from NCBI’s non-redundant RefSeq database (ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/). Relationships between prokaryotes were from the SILVA Living Tree (https://www.arb-silva.de/projects/living-tree/). Clusters of related genomes were from the Alignable Tight Genomic Cluster (ATGC) database (http://dmk-brain.ecn.uiowa.edu/ATGC/). Prokaryotic trait data were from the ProTraits database (http://protraits.irb.hr/). Linkages between genomes and restriction enzymes were from the REBASE database (http://rebase.neb.com/rebase/rebase.html). Intermediate data files and code may be found at: https://github.com/jlw-ecoevo/gcku.

Funding Statement

JLW was supported by a GAANN Fellowship from the U.S. Department of Education and the University of Maryland as well as a COMBINE Fellowship from the University of Maryland and funded by NSF DGE-1632976. WFF was partially supported the U.S. Army Research Laboratory and the U.S. Army Research Office under Grant W911NF-14-1-0490. PLFJ was supported in part by NIH R00 GM104158. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. Nakabachi A, Yamashita A, Toh H, Ishikawa H, Dunbar HE, Moran NA, et al. The 160-kilobase genome of the bacterial endosymbiont Carsonella. Science. 2006;314(5797):267–267. doi: 10.1126/science.1134196 [DOI] [PubMed] [Google Scholar]
  • 2. Foerstner KU, von Mering C, Hooper SD, Bork P. Environments shape the nucleotide composition of genomes. EMBO reports. 2005. Dec;6(12):1208–1213. doi: 10.1038/sj.embor.7400538 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Reichenberger ER, Rosen G, Hershberg U, Hershberg R. Prokaryotic nucleotide composition is shaped by both phylogeny and the environment. Genome Biology and Evolution. 2015. Apr;7(5):1380–1389. doi: 10.1093/gbe/evv063 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Hershberg R, Petrov DA. Evidence That Mutation Is Universally Biased towards AT in Bacteria. PLOS Genetics. 2010. Sep;6(9):e1001115. Available from: http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1001115 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Hildebrand F, Meyer A, Eyre-Walker A. Evidence of Selection upon Genomic GC-Content in Bacteria. PLOS Genetics. 2010. Sep;6(9):e1001107. Available from: http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1001107 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Long H, Sung W, Kucukyildirim S, Williams E, Miller SF, Guo W, et al. Evolutionary determinants of genome-wide nucleotide composition. Nature Ecology & Evolution. 2018. Feb;2(2):237–240. Available from: https://www.nature.com/articles/s41559-017-0425-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Lassalle F, Périan S, Bataillon T, Nesme X, Duret L, Daubin V. GC-Content Evolution in Bacterial Genomes: The Biased Gene Conversion Hypothesis Expands. PLOS Genetics. 2015. Feb;11(2):e1004941. Available from: http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1004941 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Raghavan R, Kelkar YD, Ochman H. A selective force favoring increased G+ C content in bacterial genes. Proceedings of the National Academy of Sciences. 2012;109(36):14504–14507. doi: 10.1073/pnas.1205683109 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Rocha EPC. Neutral Theory, Microbial Practice: Challenges in Bacterial Population Genetics. Molecular Biology and Evolution. 2018. Jun;35(6):1338–1347. Available from: https://academic.oup.com/mbe/article/35/6/1338/4976545 [DOI] [PubMed] [Google Scholar]
  • 10. Bobay LM, Ochman H. Impact of recombination on the base composition of bacteria and archaea. Molecular biology and evolution. 2017;34(10):2627–2636. doi: 10.1093/molbev/msx189 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Rocha EPC, Feil EJ. Mutational Patterns Cannot Explain Genome Composition: Are There Any Neutral Sites in the Genomes of Bacteria? PLOS Genetics. 2010. Sep;6(9):e1001104. Available from: http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1001104 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Naya H, Romero H, Zavala A, Alvarez B, Musto H. Aerobiosis increases the genomic guanine plus cytosine content (GC%) in prokaryotes. Journal of Molecular Evolution. 2002. Sep;55(3):260–264. doi: 10.1007/s00239-002-2323-3 [DOI] [PubMed] [Google Scholar]
  • 13. Romero H, Pereira E, Naya H, Musto H. Oxygen and Guanine—Cytosine Profiles in Marine Environments. Journal of Molecular Evolution. 2009. Aug;69(2):203–206. Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2722718/ [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Karanjawala ZE, Murphy N, Hinton DR, Hsieh CL, Lieber MR. Oxygen Metabolism Causes Chromosome Breaks and Is Associated with the Neuronal Apoptosis Observed in DNA Double-Strand Break Repair Mutants. Current Biology. 2002. Mar;12(5):397–402. Available from: http://www.sciencedirect.com/science/article/pii/S096098220200684X [DOI] [PubMed] [Google Scholar]
  • 15. Pitcher RS, Brissett NC, Doherty AJ. Nonhomologous end-joining in bacteria: a microbial perspective. Annual Review of Microbiology. 2007;61:259–282. doi: 10.1146/annurev.micro.61.080706.093354 [DOI] [PubMed] [Google Scholar]
  • 16. Charbon G, Bjørn L, Mendoza-Chamizo B, Frimodt-Møller J, Løbner-Olesen A. Oxidative DNA damage is instrumental in hyperreplication stress-induced inviability of Escherichia coli. Nucleic acids research. 2014;42(21):13228–13241. doi: 10.1093/nar/gku1149 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Dianov GL, Timehenko TV, Sinitsina OI, Kuzminov AV, Medvedev OA, Salganik RI. Repair of uracil residues closely spaced on the opposite strands of plasmid DNA results in double-strand break and deletion formation. Molecular and General Genetics MGG. 1991;225(3):448–452. doi: 10.1007/bf00261686 [DOI] [PubMed] [Google Scholar]
  • 18. Kozmin SG, Sedletska Y, Reynaud-Angelin A, Gasparutto D, Sage E. The formation of double-strand breaks at multiply damaged sites is driven by the kinetics of excision/incision at base damage in eukaryotic cells. Nucleic acids research. 2009;37(6):1767–1777. doi: 10.1093/nar/gkp010 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Hong Y, Li L, Luan G, Drlica K, Zhao X. Contribution of reactive oxygen species to thymineless death in Escherichia coli. Nature microbiology. 2017;2(12):1667. doi: 10.1038/s41564-017-0037-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Henrikus SS, Henry C, McDonald JP, Hellmich Y, Wood EA, Woodgate R, et al. DNA double-strand breaks induced by reactive oxygen species promote DNA polymerase IV activity in Escherichia coli. bioRxiv. 2019; p. 533422. [Google Scholar]
  • 21. Bonura T, Town CD, Smith KC, Kaplan HS. The influence of oxygen on the yield of DNA double-strand breaks in X-irradiated Escherichia coli K-12. Radiation research. 1975;63(3):567–577. doi: 10.2307/3574108 [DOI] [PubMed] [Google Scholar]
  • 22. Tilby MJ, Loverock PS. Measurements of DNA double-strand break yields in E. coli after rapid irradiation and cell inactivation: the effects of inactivation technique and anoxic radiosensitizers. Radiation research. 1983;96(2):309–321. doi: 10.2307/3576214 [DOI] [PubMed] [Google Scholar]
  • 23. Van der Schans G, Blok J. The influence of oxygen and sulphhydryl compounds on the production of breaks in bacteriophage DNA by gamma-rays. International Journal of Radiation Biology and Related Studies in Physics, Chemistry and Medicine. 1970;17(1):25–38. doi: 10.1080/09553007014550041 [DOI] [PubMed] [Google Scholar]
  • 24. Mahaseth T, Kuzminov A. Prompt repair of hydrogen peroxide-induced DNA lesions prevents catastrophic chromosomal fragmentation. DNA repair. 2016;41:42–53. doi: 10.1016/j.dnarep.2016.03.012 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Wang HC, Susko E, Roger AJ. On the correlation between genomic G+C content and optimal growth temperature in prokaryotes: Data quality and confounding factors. Biochemical and Biophysical Research Communications. 2006. Apr;342(3):681–684. Available from: http://www.sciencedirect.com/science/article/pii/S0006291X06003214 [DOI] [PubMed] [Google Scholar]
  • 26. Pitcher RS, Green AJ, Brzostek A, Korycka-Machala M, Dziadek J, Doherty AJ. NHEJ protects mycobacteria in stationary phase against the harmful effects of desiccation. DNA repair. 2007;6(9):1271–1276. doi: 10.1016/j.dnarep.2007.02.009 [DOI] [PubMed] [Google Scholar]
  • 27. Vriezen JA, De Bruijn FJ, Nüsslein K. Responses of rhizobia to desiccation in relation to osmotic stress, oxygen, and temperature. Applied and Environmental Microbiology. 2007;73(11):3451–3459. doi: 10.1128/AEM.02991-06 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Dupuy P, Gourion B, Sauviac L, Bruand C. DNA double-strand break repair is involved in desiccation resistance of Sinorhizobium meliloti, but is not essential for its symbiotic interaction with Medicago truncatula. Microbiology. 2017;163(3):333–342. doi: 10.1099/mic.0.000400 [DOI] [PubMed] [Google Scholar]
  • 29. Slieman TA, Nicholson WL. Artificial and solar UV radiation induces strand breaks and cyclobutane pyrimidine dimers in Bacillus subtilis spore DNA. Appl Environ Microbiol. 2000;66(1):199–205. doi: 10.1128/aem.66.1.199-205.2000 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Singer CE, Ames BN. Sunlight ultraviolet and bacterial DNA base ratios. Science. 1970;170(3960):822–826. doi: 10.1126/science.170.3960.822 [DOI] [PubMed] [Google Scholar]
  • 31. Rocha EP, Cornet E, Michel B. Comparative and evolutionary analysis of the bacterial homologous recombination systems. PLoS genetics. 2005;1(2):e15. doi: 10.1371/journal.pgen.0010015 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Gong C, Bongiorno P, Martins A, Stephanou NC, Zhu H, Shuman S, et al. Mechanism of nonhomologous end-joining in mycobacteria: a low-fidelity repair system driven by Ku, ligase D and ligase C. Nature structural & molecular biology. 2005;12(4):304. doi: 10.1038/nsmb915 [DOI] [PubMed] [Google Scholar]
  • 33. Aravind L, Koonin EV. Prokaryotic homologs of the eukaryotic DNA-end-binding protein Ku, novel domains in the Ku protein and prediction of a prokaryotic double-strand break repair system. Genome Research. 2001. Aug;11(8):1365–1374. doi: 10.1101/gr.181001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Doherty Aidan J, Jackson Stephen P, Weller Geoffrey R. Identification of bacterial homologues of the Ku DNA repair proteins. FEBS Letters. 2001. Jul;500(3):186–188. Available from: https://febs.onlinelibrary.wiley.com/doi/full/10.1016/S0014-5793%2801%2902589-3. [DOI] [PubMed] [Google Scholar]
  • 35. Mcewan CE, Gatherer D, Mcewan NR. Nitrogen-fixing aerobic bacteria have higher genomic GC content than non-fixing species within the same genus. Hereditas. 1998;128(2):173–178. doi: 10.1111/j.1601-5223.1998.00173.x [DOI] [PubMed] [Google Scholar]
  • 36. Musto H, Naya H, Zavala A, Romero H, Alvarez-Valín F, Bernardi G. Genomic GC level, optimal growth temperature, and genome size in prokaryotes. Biochemical and biophysical research communications. 2006;347(1):1–3. doi: 10.1016/j.bbrc.2006.06.054 [DOI] [PubMed] [Google Scholar]
  • 37. Galtier N, Lobry J. Relationships between genomic G+ C content, RNA secondary structures, and optimal growth temperature in prokaryotes. Journal of molecular evolution. 1997;44(6):632–636. doi: 10.1007/pl00006186 [DOI] [PubMed] [Google Scholar]
  • 38. Hurst LD, Merchant AR. High guanine—cytosine content is not an adaptation to high temperature: a comparative analysis amongst prokaryotes. Proceedings of the Royal Society of London Series B: Biological Sciences. 2001;268(1466):493–497. doi: 10.1098/rspb.2000.1397 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Brbić M, Piškorec M, Vidulin V, Kriško A, Šmuc T, Supek F. The landscape of microbial phenotypic traits and associated genes. Nucleic Acids Research. 2016. Dec;44(21):10074–10090. Available from: https://academic.oup.com/nar/article/44/21/10074/2290929 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Tatusova T, Ciufo S, Fedorov B, O’Neill K, Tolstoy I. RefSeq microbial genomes database: new representation and annotation strategy. Nucleic acids research. 2013;42(D1):D553–D559. doi: 10.1093/nar/gkt1274 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. McDonald JH, Kreitman M. Adaptive protein evolution at the Adh locus in Drosophila. Nature. 1991;351(6328):652. doi: 10.1038/351652a0 [DOI] [PubMed] [Google Scholar]
  • 42. Kristensen DM, Wolf YI, Koonin EV. ATGC database and ATGC-COGs: an updated resource for micro-and macro-evolutionary studies of prokaryotic genomes and protein family annotation. Nucleic acids research. 2016; p. gkw934. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Vos M, Didelot X. A comparison of homologous recombination rates in bacteria and archaea. The ISME journal. 2009;3(2):199. doi: 10.1038/ismej.2008.93 [DOI] [PubMed] [Google Scholar]
  • 44. Rendueles O, de~Sousa JAM, Bernheim A, Touchon M, Rocha EP. Genetic exchanges are more frequent in bacteria encoding capsules. PLoS genetics. 2018;14(12):e1007862. doi: 10.1371/journal.pgen.1007862 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Bruen TC, Philippe H, Bryant D. A simple and robust statistical test for detecting the presence of recombination. Genetics. 2006;172(4):2665–2681. doi: 10.1534/genetics.105.048975 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Yahara K, Didelot X, Jolley KA, Kobayashi I, Maiden MC, Sheppard SK, et al. The landscape of realized homologous recombination in pathogenic bacteria. Molecular biology and evolution. 2015;33(2):456–471. doi: 10.1093/molbev/msv237 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. González-Torres P, Rodríguez-Mateos F, Antón J, Gabaldón T. Impact of homologous recombination on the evolution of prokaryotic core genomes. mBio. 2019;10(1):e02494–18. doi: 10.1128/mBio.02494-18 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Liu H, Huang J, Sun X, Li J, Hu Y, Yu L, et al. Tetrad analysis in plants and fungi finds large differences in gene conversion rates but no GC bias. Nature ecology & evolution. 2018;2(1):164. doi: 10.1038/s41559-017-0372-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Marsolier-Kergoat MC, Yeramian E. GC content and recombination: reassessing the causal effects for the Saccharomyces cerevisiae genome. Genetics. 2009;183(1):31–38. doi: 10.1534/genetics.109.105049 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50. Brissett NC, Pitcher RS, Juarez R, Picher AJ, Green AJ, Dafforn TR, et al. Structure of a NHEJ polymerase-mediated DNA synaptic complex. Science. 2007;318(5849):456–459. doi: 10.1126/science.1145112 [DOI] [PubMed] [Google Scholar]
  • 51. Brissett NC, Doherty AJ. Repairing DNA double-strand breaks by the prokaryotic non-homologous end-joining pathway. Biochemical Society transactions. 2009. Jun;37:539–545. doi: 10.1042/BST0370539 [DOI] [PubMed] [Google Scholar]
  • 52. Della M, Palmbos PL, Tseng HM, Tonkin LM, Daley JM, Topper LM, et al. Mycobacterial Ku and ligase proteins constitute a two-component NHEJ repair machine. Science. 2004;306(5696):683–685. doi: 10.1126/science.1099824 [DOI] [PubMed] [Google Scholar]
  • 53. Aniukwu J, Glickman MS, Shuman S. The pathways and outcomes of mycobacterial NHEJ depend on the structure of the broken DNA ends. Genes & development. 2008;22(4):512–527. doi: 10.1101/gad.1631908 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54. Sfeir A, Symington LS. Microhomology-mediated end joining: a back-up survival mechanism or dedicated pathway? Trends in biochemical sciences. 2015;40(11):701–714. doi: 10.1016/j.tibs.2015.08.006 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55. Sandoval A, Labhart P. High G/C content of cohesive overhangs renders DNA end joining Ku-independent. DNA repair. 2004;3(1):13–21. doi: 10.1016/j.dnarep.2003.08.014 [DOI] [PubMed] [Google Scholar]
  • 56. Daley JM, Wilson TE. Rejoining of DNA double-strand breaks as a function of overhang length. Molecular and cellular biology. 2005;25(3):896–906. doi: 10.1128/MCB.25.3.896-906.2005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57. Pleška M, Qian L, Okura R, Bergmiller T, Wakamoto Y, Kussell E, et al. Bacterial autoimmunity due to a restriction-modification system. Current Biology. 2016;26(3):404–409. doi: 10.1016/j.cub.2015.12.041 [DOI] [PubMed] [Google Scholar]
  • 58. Limor-Waisberg K, Carmi A, Scherz A, Pilpel Y, Furman I. Specialization versus adaptation: two strategies employed by cyanophages to enhance their translation efficiencies. Nucleic acids research. 2011;39(14):6016–6028. doi: 10.1093/nar/gkr169 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59. Roberts RJ, Vincze T, Posfai J, Macelis D. REBASE—a database for DNA restriction and modification: enzymes, genes and genomes. Nucleic Acids Research. 2010. Jan;38(suppl_1):D234–D236. Available from: https://academic.oup.com/nar/article/38/suppl_1/D234/3112229 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60. Cox MM, Battista JR. Deinococcus radiodurans-the consummate survivor. Nature Reviews Microbiology. 2005;3(11):882. doi: 10.1038/nrmicro1264 [DOI] [PubMed] [Google Scholar]
  • 61. Rohwer F, Azam F. Detection of DNA damage in prokaryotes by terminal deoxyribonucleotide transferase-mediated dUTP nick end labeling. Appl Environ Microbiol. 2000;66(3):1001–1006. doi: 10.1128/aem.66.3.1001-1006.2000 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62. O’Leary NA, Wright MW, Brister JR, Ciufo S, Haddad D, McVeigh R, et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Research. 2016. Jan;44(Database issue):D733–D745. Available from: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4702849/ [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63. Eddy SR. Profile hidden Markov models. Bioinformatics (Oxford, England). 1998;14(9):755–763. doi: 10.1093/bioinformatics/14.9.755 [DOI] [PubMed] [Google Scholar]
  • 64. Yarza P, Richter M, Peplies J, Euzeby J, Amann R, Schleifer KH, et al. The All-Species Living Tree project: A 16S rRNA-based phylogenetic tree of all sequenced type strains. Systematic and Applied Microbiology. 2008. Sep;31(4):241–250. Available from: http://www.sciencedirect.com/science/article/pii/S072320200800060X [DOI] [PubMed] [Google Scholar]
  • 65. Weissman JL, Laljani RM, Fagan WF, Johnson PL. Visualization and prediction of CRISPR incidence in microbial trait-space to identify drivers of antiviral immune strategy. The ISME journal. 2019. doi: 10.1038/s41396-019-0411-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66. Ho LsT, Ané C. A linear-time algorithm for Gaussian and non-Gaussian trait evolution models. Systematic Biology. 2014. May;63(3):397–408. doi: 10.1093/sysbio/syu005 [DOI] [PubMed] [Google Scholar]
  • 67. Beaulieu JM, O’Meara BC, Donoghue MJ. Identifying hidden rate changes in the evolution of a binary morphological character: the evolution of plant habit in campanulid angiosperms. Systematic biology. 2013;62(5):725–737. doi: 10.1093/sysbio/syt034 [DOI] [PubMed] [Google Scholar]

Decision Letter 0

Lotte Søgaard-Andersen, Xavier Didelot

1 Oct 2019

* Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. *

Dear Dr Johnson,

Thank you very much for submitting your Research Article entitled 'Linking high GC content to the repair of double strand breaks in prokaryotic genomes' to PLOS Genetics. Your manuscript was fully evaluated at the editorial level and by independent peer reviewers. The reviewers appreciated the attention to an important topic but identified some aspects of the manuscript that should be improved. In particular, reviewer 3 makes suggestions for inclusion of additional data on homologous recombination, and use of the PHI test as in previous comparable studies, and we would like to encourage the authors to follow these suggestions.

We therefore ask you to modify the manuscript according to the review recommendations before we can consider your manuscript for acceptance. Your revisions should address the specific points made by each reviewer.

In addition we ask that you:

1) Provide a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript.

2) Upload a Striking Image with a corresponding caption to accompany your manuscript if one is available (either a new image or an existing one from within your manuscript). If this image is judged to be suitable, it may be featured on our website. Images should ideally be high resolution, eye-catching, single panel square images. For examples, please browse our archive. If your image is from someone other than yourself, please ensure that the artist has read and agreed to the terms and conditions of the Creative Commons Attribution License. Note: we cannot publish copyrighted images.

We hope to receive your revised manuscript within the next 30 days. If you anticipate any delay in its return, we would ask you to let us know the expected resubmission date by email to plosgenetics@plos.org.

If present, accompanying reviewer attachments should be included with this email; please notify the journal office if any appear to be missing. They will also be available for download from the link below. You can use this link to log into the system when you are ready to submit a revised version, having first consulted our Submission Checklist.

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Please be aware that our data availability policy requires that all numerical data underlying graphs or summary statistics are included with the submission, and you will need to provide this upon resubmission if not already present. In addition, we do not permit the inclusion of phrases such as "data not shown" or "unpublished results" in manuscripts. All points should be backed up by data provided with the submission.

PLOS has incorporated Similarity Check, powered by iThenticate, into its journal-wide submission system in order to screen submitted content for originality before publication. Each PLOS journal undertakes screening on a proportion of submitted articles. You will be contacted if needed following the screening process.

To resubmit, you will need to go to the link below and 'Revise Submission' in the 'Submissions Needing Revision' folder.

[LINK]

Please let us know if you have any questions while making these revisions.

Yours sincerely,

Xavier Didelot

Associate Editor

PLOS Genetics

Lotte Søgaard-Andersen

Section Editor: Prokaryotic Genetics

PLOS Genetics

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: In this study, Weissman and colleagues explore a new mechanism potentially explaining the evolution of GC-content in bacteria. Many works have been published on this question, but yet, no single explanation has imposed itself. The authors hypothesize that DNA breaks and repair could be the underlying cause of GC-content evolution across bacteria. Although they cannot provide direct evidence supporting their hypothesis, the results show, at the very least, a clear link between NHEJ and GC-content. The methods used by the author are generally sound and only have a few criticisms. The manuscript is well written and very pleasant to read. I think this is an interesting hypothesis and a good study. As mentioned by the authors, future experimental works could potentially test this hypothesis.

The methods appear adequate as far as I can judge, and I don’t have any major concerns. However, I am not sure to understand how the authors calculated the incidence of Ku across the data set. For what I understood, it represents the probability that Ku is really present in a genome. Figure S1 is particularly intriguing, but that I am not sure what to get from it. It seems to represent a correlation of correlation coefficients, which might be a very indirect way to show a correlation. It would be more straightforward to represent the correlation between Ku incidence and GC-content directly.

The main issue with the results of the manuscript is that the authors are using the presence of Ku as evidence for more frequent DSBs. As stated by the authors, the NHEJ pathway is either present or absent but DSBs can occur at different rates. Figure 2A is rather convincing but it would be interesting to indicate the size of each sample. Also, it is not clear to me whether the data were computed on the entire dataset of genomes or if only one genome were selected for each species. Species that don’t encode Ku appear to present a wider range of GC-content while species that encode Ku appear much more biased toward high GC-content. It would be informative to explore and discuss the rare species that encode Ku but present a relatively low GC-content. These cases might be insightful, but maybe the authors did not find anything worth reporting in the manuscript.

The authors argue the restriction systems might elevate the frequency of DSBs and use the presence of RM systems as an indirect indicator for elevated DSBs. If we follow their logic, we might expect species encoding larger numbers of RM systems to present higher GC-content. I find this argument not very convincing considering that Helicobacter pylori encodes an exceptionally high number of RM systems (Oliveira, Touchon and Rocha, NAR 2014) but present a relatively low GC-content (~39%).

Finally, I think it is interesting that GC-content and genome length correlate. This observation is not new, but it supports the hypothesis of the authors. I believe it can be safely assumed that, overall, bacteria with larger genomes endure more frequent DSBs. Under the authors’ assumption, the higher GC-content of larger genomes could be explained by the need to repair more frequent DNA breaks. I think this was not explicitly formulated in the manuscript and could be emphasized.

Reviewer #2: I have reviewed this article before for another journal. In this new version, most of my initial critics have been addressed and I find the article quite good. The relationships between genomic GC-content and the presence of the NHEJ pathway is an interesting point to bring to the debate on the evolution of GC-content in genomes. However, I still have difficulties with the hypothesis of the authors that there is selection for high GC-content to favor double strand breaks repair. It is not clear to me why the hypothesis that NHEJ is itself a repair mechanism that is biased toward GC (just like BGC in HR) is not considered and discussed. The argument that GC-rich regions are better repaired seems relatively weak to me.

Reviewer #3: This manuscript presents new observations and a new hypothesis to explain the long-time puzzle of prokaryotic GC content heterogeneity, and the discrepancy between observed GC contents and their – almost universally lower – expected value based on mutational patterns. They report that the non-homologous end-joining (NHEJ) protein Ku is strikingly associated with high genome GC content, and also with the departures from mutational equilibrium, in a stronger way than any previously considered trait (notably those associated to lifestyle). The authors interpret this Ku-GC association as a signature of GC elevation being a response to frequent exposure to double-stranded DNA (dsDNA) break (DSBs). This is considered under several hypotheses, including that GC elevation and Ku occurrence may both be correlated responses the high incidence of DSBs, via separate mechanisms. Alternatively, they investigate a hypothesis where Ku is causally linked to GC elevation, via selective process promoting the elevation of GC content in the genome and in particular in regions susceptible to regular DSBs such as self-target sites for restriction enzymes to improve the efficiency of Ku repair function. They conclude that Ku (or the NHEJ pathway) is unlikely to account on its own for the whole higher-than-expected-GC phenomenon, but may at least be the functional mechanism of a selective process that accounts for part of this phenomenon.

The manuscript is very well written and documented, and presents relevant analyses to test the new hypothesis. The authors also attempt to link these new results to observations made previously regarding other hypotheses of mechanism for above-mutational-equilibrium genome GC contents, namely selection for higher %GC per se, and biased gene conversion (BGC).

The evidence presented in support of the Ku-GC association is sufficient and convincing, and its interpretation is cautiously discussed to consider known evidence, and to take into account potential interactions or confounding signatures with other mechanisms.

However, it would be desirable that the authors bring their study a step further, and bring a bit more material to help the reader (and future investigations) to resolve this puzzle. Namely, in order to test the relevance of the BGC hypothesis in the light of the facts presented in this study, they confront Ku occurrence and GC content data are to homologous recombination (HR) data, which are only recovered from other studies. This brings the concern that these data are not properly matched with the sty’s own datasets: summary statistic from studies using different genome sets (and potentially different set of sequences within genomes) on the basis of the sole species name is unlikely to reflect the exact properties of the genome datasets investigated here. Considering the scale of the present investigation (the whole prokaryotic tree of life), it is crucial that each data point be accurately representing the properties of the considered organism, and hence that all measurements be made on the same dataset. As explained below, applying the HR test/quantification procedures described in the cited literature to this dataset would be a feasible undertaking, and would add much value to the paper.

Finally, I notice that the intermediary data is not made available. This includes tables describing the sets of genomes used, the occurrence of Ku in these genomes, the list of restriction enzyme found in them and their corresponding target sequences, the genome tree presented Fig 2 in machine-readable format, the estimates of GC at the mutational equilibrium, etc. the scripts used to generate such data, as well as those used to test their association, should be provided as well. I think that publication in scientific journal, and especially in the open-access pioneer PLoS journals, should always be backed by full access to data and proceedings of the analyses so they can be replicated. Please attach them as a supplement, or provide a link to an external data/code repository (my recommendation).

I let the editor appreciate the relevance of the request for additional data on HR. Provided that the few minor comments below are addressed and that intermediary data are provided, I think the manuscript would be otherwise generally fit for publication in PLoS Genetics. I thus recommend the paper for minor revision.

Detailed comments

L70-82: this paragraph belongs to the introduction, with which it is slightly redundant.

L87-89: this correlation of the Ku and GC, as revealed by correlation of each with ‘third-party’ trait, is striking. However, it would be nice to have a more straightforward estimate and visualisation of their association. Could the authors provide a correlation r^2 and p-value for GC ~ Ku occurrence? In complement of the PCA in fig. 1, could they also plot the result of a linear discriminant analysis (LDA) maximizing the separation of the samples based on their Ku +/- state, and plotting the %GC over it (as well as showing the explained variance of such a projection)?

Actually, something like a heatmap of a correlation matrix of all these traits would be helpful (in supplement) for the reader to see how the traits are associated with each other.

L90-92 / S1 Table: I think that the table legend should spell out how the model was formulated (like give the R code or a more formal string like ‘y ~ trait1 + trait2’). Once that is clarified, it would be interesting to present results of a general linear model where the prioritization of would have been different: with Ku as first explanatory variable, would the other traits have any variance left to explain?

L95-96: “In fact this is trivially true, as Ku presence is a discrete, binary variable whereas the rate of DSB formation is continuous.”

This is a relevant point, and should be considered further. In fact, the presence/absence of the Ku protein (used as a proxy of a functional NHEJ pathway) is a trait that can vary among strains of a clade or species, as stated by the authors L176-178. Transitions between the Ku +/- states might have happened recently in certain strain lineages, and at potentially high frequency over time. On the contrary, %GC increase is expected to be a long process, given that the effect size of either selection for higher %GC or BGC phenomena are likely small, that they act against the mutational bias, and that selection for other traits may interfere with this background amelioration process. This is to be opposed to phenotypic traits (usually considered for correlation under BM or OU models) that result from the expression of the genotype of an individual organism, i.e. in sync with its current genotype.

It follows that the association between a potentially recently acquired trait (Ku presence) and the result of a long-standing process (%GC increase away from the mutational equilibrium) could possibly be coincidental. The authors should try and repeat their analyses by restricting them to genomes in clades where the Ku +/- state is conserved, and where we can expect that it has been present/absent for long enough so that the base substitution process is in its steady state. The situation that “Ku presence/absence is sprinkled throughout the prokaryotic phylogeny”, and described in Figure 2B, where it seems that many clades have a homogeneous pattern of Ku occurrence, should allow them to run such restricted analyses with enough statistical power (while still using the phylogenetically-aware regression models to avoid over-counting the replicated data points within such homogeneous clades).

This is an important point, as most studies trying to confirm/invalidate the hypothesis of BGC have tried to correlate the %GC with the recombination rate inferred from recent polymorphism data, which again reflect a recent property of the population, but might not reflect the long-term average recombination rate that the lineage has experienced – a major issue that prevented most past analyses to settle the debate on the existence or not of BGC in Prokaryotes. Ku occurrence is a simple binary trait and its past distribution is more easily estimated than the past recombination rate, which estimation from polymorphism data is inherently biased towards recent times due to saturation of homoplasy signals; by studying this simpler trait, the authors here have an opportunity to bring stronger evidence on that subject than any other previous study.

Section “No Apparent Relationship Between Rate of Homologous Recombination and NHEJ”:

I agree with the general conclusions of the authors for this section, that is the impossibility to conclude given the data, but I think they could try and provide further evidence to fuel the debate. In particular, they only rely on data from previous study to quantify the effect of homologous recombination (HR) on species they investigated in their own dataset. The third-party data they report is likely to be inadequate to answer the question asked, for several reasons.

The quantification of HR rates (r/m) by Vos and Didelot (ref [44]) is made using ClonalFrame, a method that is able to grasp the long-term average HR rate (see comment above), which is a good thing, but was based on multi-locus data and on quite a variable set of strains depending on the species, thus unlikely to reflect findings from sets of whole-genomes of calibrated diversity (from the ATGC database) used in the present study.

The data from Ruendules et al. [45] are also unlikely to have used the same set of genomes, and use simple linkage disequilibrium-based metrics which have been designed to perform test of occurrence of HR, not to quantify it, and which application at the whole-genome scale is unlikely to grasp any nuance in such signal.

The fairer comparison is with the data from Lassalle et al. (ref [7]), but again the genome datasets are unlikely to be matched. Published genome data expand rapidly and, as a consequence, prokaryotic species definitions are being regularly revised; the genomes available for what was considered to be B. anthracis by Lassalle et al. in 2015 is thus unlikely what is available today in ATGC database under this same name. I believe this drastically limits the scope of what the authors are able to say about HR in the framework of this study.

I would suggest that the authors replicate the procedure used by Lassalle et al., that is running the PHI test on the core gene alignments of their species datasets (or at least a representative subset), as provided by the ATGC database. The PHI test is very fast and can easily be ran in parallel on a large collection of gene alignments. This is not essential to the core argument of the paper, but would help going further on the matter.

L216-220: “the extremely strong and specific association between GC and Ku suggests that this relationship may be particular to the specific conditions selecting for Ku (especially considering the absence of an association between HR and GC when looking between genomes [5]; S7 Fig)”

As discussed above, these datasets are very unlikely to be matched with the authors’, and rejecting the association of elevated %GC (or Ku occurrence) with HR rates on this basis is possibly flawed. Again, I would suggest the authors run their own recombination tests/quantifications on their own datasets so they can draw robust conclusions.

L217 “association between GC and Ku”; L219 “association between HR and GC”; L223 “association 223 between NHEJ and high GC content” and more:

The authors need to use a consistent term to refer to the A/T vs. G/C base composition of genomes; the early sections of the manuscript use the acronym ‘%GC’, but later just name it ‘GC’, or ‘GC content’. One term should be chosen and used throughout the manuscript

L223: “Given our lack of enthusiasm for BGC as a mechanism”

I appreciate the author’s willingness to disclose any subjective bias they may have towards one or another scientific hypothesis, but I don’t think it is appropriate to use it to justify what they investigate. Please rephrase into something like “Given the lack of evidence in support of the BGC hypothesis as reported above, we chose to investigate an alternative hypothesis.”

Importantly, the authors should make clear that they are not opposing hypotheses, i.e. rejecting BGC because of support for the selection hypothesis, or vice versa. In principle, both hypotheses could be true, and so could be a third (or more) alternative that was not yet proposed in the literature.

L228-229: “high GC content may promote DNA repair, both by facilitating canonical NHEJ 228 (i.e., Ku-dependent) and alternative NHEJ (i.e., Ku-independent) pathways.”

Please cite relevant literature supporting these claims. If they are supported by the references [50, 51, 52] cited in the following paragraph, please connect these text sections (e.g. by not ending the sentence L229 and connecting it to the next with a colon) so to make it clear.

L232-234 “Any factor that stabilizes this interaction (e.g., high GC via an increased number of hydrogen bonds) may thus increase the efficiency of NHEJ repair.”

L238-239: “It stands to reason that high GC content in these regions of microhomology might help stabilize the end-pairings and improve the efficiency of repair.”

Again, please cite the relevant literature (redundancy of citation with the previous sentence is not an issue in my opinion) so to clarify whether this is a (reasonable) speculation of mechanism by the authors or something that is backed by experimental evidence.

L235-237: “alternative high-fidelity end-joining pathways that are independent of the NHEJ machinery, and that these pathways are primarily dependent on nearby microhomology to tether the DNA ends together”

Please clarify which bits of sequence are required to present microhomology for the NHEJ or NHEJ-independent end-joining pathways to function. If it is the immediate sequence on both free ends of the broken dsDNA, this means that sequences with short repeats would be more likely to be repaired by these pathways. This would come as a confounding factor for the prediction of effect of %GC in this system (for instance, because short repeats are enriched in mobile elements like phages, transposons or integrons, which are themselves generally AT-rich…); the authors should mention these potential pitfalls as they develop this hypothesis.

L262-264: “We further predicted that, for restriction enzymes with low GC recognition sequences, the bases flanking restriction sites on the genome would have elevated GC”

The test presented afterwards could also support selection-free hypotheses where the converse rationale would stand, i.e. that increased repair at those DSB-prone sites would induce higher %GC; typically, it would be in line with the BGC hypothesis as HR-associated pathways are also taking part in the repair or restriction enzyme-induced breaks.

L259: “to help ameliorate the effects of autoimmunity”

‘mitigate’ instead of ‘ameliorate’

L316:” This identified 21389 genomes containing Ku out of a 316 total 104297 genomes analysed”

Please provide a list of the genomes, and of which were deemed positive for the Ku protein-coding gene.

L322: “we downloaded alignments from the Alignable Tight Genomic Cluster (ATGC) database [43]”

Please provide the list of genomes assigned to cluster, the number of gene alignments and clarify how many were dropped/retained when filter were applied.

L324: “Trait data were obtained from the ProTraits microbial trait database (2679 species; [39])”

Please provide the table of how genomes from RefSeq were matched with entries of ProTraits (or if sharing identifiers, a list of genomes covered by both databases).

L382: “The rationale behind this test”

No test has been described at this point; I assume the authors refer to the comparison of the expected %GC (based on the mutational pattern estimated from phased polymorphism data) to the realized genomic %GC, which they describe right after; please rephrase.

L413: “no genomes with multiple AT-rich enzymes”

Please clarify how you define AT-rich enzymes (if based on the composition of the target sequence, what threshold of %GC?).

L424: “We then repeated the above”

Please specify how many draws of thee permutations were conducted.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Genetics data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: No: A link to a table including all the features of each genome (GC, genome size, ecological information, taxonomy, gene presence, etc...) should be provided to allow reproducibility of the results.

Reviewer #3: No: the intermediary data is not made available. This includes tables describing the sets of genomes used, the occurrence of Ku in these genomes, the list of restriction enzyme found in them and their corresponding target sequences, the genome tree presented Fig 2 in machine-readable format, the estimates of GC at the mutational equilibrium, etc. the scripts used to generate such data, as well as those used to test their association, should be provided as well

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: Yes: Florent Lassalle

Decision Letter 1

Lotte Søgaard-Andersen, Xavier Didelot

25 Oct 2019

Dear Dr Johnson,

We are pleased to inform you that your manuscript entitled "Linking high GC content to the repair of double strand breaks in prokaryotic genomes" has been editorially accepted for publication in PLOS Genetics. Congratulations!

Before your submission can be formally accepted and sent to production you will need to complete our formatting changes, which you will receive in a follow up email. Please be aware that it may take several days for you to receive this email; during this time no action is required by you. Please note: the accept date on your published article will reflect the date of this provisional accept, but your manuscript will not be scheduled for publication until the required changes have been made.

Once your paper is formally accepted, an uncorrected proof of your manuscript will be published online ahead of the final version, unless you’ve already opted out via the online submission form. If, for any reason, you do not want an earlier version of your manuscript published online or are unsure if you have already indicated as such, please let the journal staff know immediately at plosgenetics@plos.org.

In the meantime, please log into Editorial Manager at https://www.editorialmanager.com/pgenetics/, click the "Update My Information" link at the top of the page, and update your user information to ensure an efficient production and billing process. Note that PLOS requires an ORCID iD for all corresponding authors. Therefore, please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field.  This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager.

If you have a press-related query, or would like to know about one way to make your underlying data available (as you will be aware, this is required for publication), please see the end of this email. If your institution or institutions have a press office, please notify them about your upcoming article at this point, to enable them to help maximise its impact. Inform journal staff as soon as possible if you are preparing a press release for your article and need a publication date.

Thank you again for supporting open-access publishing; we are looking forward to publishing your work in PLOS Genetics!

Yours sincerely,

Xavier Didelot

Associate Editor

PLOS Genetics

Lotte Søgaard-Andersen

Section Editor: Prokaryotic Genetics

PLOS Genetics

www.plosgenetics.org

Twitter: @PLOSGenetics

----------------------------------------------------

Comments from the reviewers (if applicable):

----------------------------------------------------

Data Deposition

If you have submitted a Research Article or Front Matter that has associated data that are not suitable for deposition in a subject-specific public repository (such as GenBank or ArrayExpress), one way to make that data available is to deposit it in the Dryad Digital Repository. As you may recall, we ask all authors to agree to make data available; this is one way to achieve that. A full list of recommended repositories can be found on our website.

The following link will take you to the Dryad record for your article, so you won't have to re‐enter its bibliographic information, and can upload your files directly: 

http://datadryad.org/submit?journalID=pgenetics&manu=PGENETICS-D-19-01378R1

More information about depositing data in Dryad is available at http://www.datadryad.org/depositing. If you experience any difficulties in submitting your data, please contact help@datadryad.org for support.

Additionally, please be aware that our data availability policy requires that all numerical data underlying display items are included with the submission, and you will need to provide this before we can formally accept your manuscript, if not already present.

----------------------------------------------------

Press Queries

If you or your institution will be preparing press materials for this manuscript, or if you need to know your paper's publication date for media purposes, please inform the journal staff as soon as possible so that your submission can be scheduled accordingly. Your manuscript will remain under a strict press embargo until the publication date and time. This means an early version of your manuscript will not be published ahead of your final version. PLOS Genetics may also choose to issue a press release for your article. If there's anything the journal should know or you'd like more information, please get in touch via plosgenetics@plos.org.

Acceptance letter

Lotte Søgaard-Andersen, Xavier Didelot

1 Nov 2019

PGENETICS-D-19-01378R1

Linking high GC content to the repair of double strand breaks in prokaryotic genomes

Dear Dr Johnson,

We are pleased to inform you that your manuscript entitled "Linking high GC content to the repair of double strand breaks in prokaryotic genomes" has been formally accepted for publication in PLOS Genetics! Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out or your manuscript is a front-matter piece, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Genetics and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Matt Lyles

PLOS Genetics

On behalf of:

The PLOS Genetics Team

Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom

plosgenetics@plos.org | +44 (0) 1223-442823

plosgenetics.org | Twitter: @PLOSGenetics

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Table. Output of linear model relating GC content to environmental variables.

    The formal model was GC = β0 + βKuKu + ∑i βitraiti + ϵ, where GC is genomic GC content and Ku is a binary variable representing the presence/absence of Ku.

    (PDF)

    S1 Fig. The pairwise correlation between traits among species in the trait dataset.

    Note that some traits are highly correlated.

    (PDF)

    S2 Fig. The correlation of trait values for microbial species with their average genomic GC content is similar to the correlation of trait values with the presence/absence of Ku.

    Note that each point is an individual trait, as shown in Fig 1. The dashed diagonal line indicates the x = y line. For a direct analysis of the relationship between GC content and Ku incidence among organisms see Fig 2 and Table 1.

    (PDF)

    S3 Fig. GC content at fourfold degenerate sites follows a similar pattern to that of genomic GC content overall (Fig 2).

    The effect of Ku is significant even taking phylogeny into account using an identical approach to overall genomic GC content (Table 1).

    (PDF)

    S4 Fig. While there is a positive GC content versus genome length trend, genomes with Ku have elevated GC independent of this relationship.

    (a) Regression and contour lines were created using default ggplot settings (b) The positive GC versus Ku relationship hold across taxa, independently of any relationship with genome length. Regressions of GC versus log genome length for Ku and non-Ku genomes shown.

    (PDF)

    S5 Fig. Mutational bias does not appear to be associated with the NHEJ pathway.

    Organisms with the Ku protein did not differ significantly in their GC↔AT mutational biases from those without the Ku protein (t-test, p > 0.34). Estimates of mutational bias were obtained from Long et al. [6].

    (PDF)

    S6 Fig. Genomes with Ku appear to fix GC alleles at a greater rate than expected (either due to BGC or selection).

    (a,b) Genomes with Ku have, on average, even greater elevation of GC over expectation than genomes without Ku. Expected GC estimated from polymorphism data; in contrast to main text Fig 3, here we only use polymorphisms at fourfold degenerate sites. This signal is conservative due to observed polymorphisms experiencing some effects of BGC/selection (see Methods for discussion).

    (PDF)

    S7 Fig. Genomes with Ku appear to fix GC alleles at a greater rate than expected (either due to BGC to selection).

    Genomes with Ku have, on average, even greater elevation of GC over expectation than genomes without Ku. This figure is identical to panels from Fig 3 and S6 Fig except that we draw loess smoothing lines using default ggplot settings instead of linear model fits.

    (PDF)

    S8 Fig. Phylogeny of the Baccilaceae (subtree of the SILVA tree).

    (a) Ku presence/absence plotted on the tips of the tree as in Fig 2 (blue with, red without Ku). (b) Ancestral state reconstruction of Ku (one rate class). Each internal node is represented by a pie chart describing the probability that that organism either had (black) or did not have (white) Ku. Notice that the root and most nodes near the root are likely to have had Ku.

    (PDF)

    S9 Fig. Frequency of Ku presence does not appear to be positively associated with rates of homologous recombination for a species.

    (a) Estimated rate of recombination relative to mutation rate from Vos and Didelot [43]. (b,c) Estimated number of recombination events per gene family for species estimated with two methods by Rendueles et al. [44]. In general all of these methods give highly correlated results [44].

    (PDF)

    S10 Fig. No relationship between genome-wide recombination frequency and (a) Ku incidence or (b) GC content in the ATGC database.

    We used the PHI statistic (see Methods) to determine if genes within each ATGC cluster of genomes had evidence for recombination. The percent of genes with evidence for recombination (out of all genes with sufficient data to test) showed no relationship to either Ku or GC content (averaged across genomes in a particular ATGC cluster).

    (PDF)

    S11 Fig. Most species in RefSeq tend to always encode or always lack Ku on their genomes.

    Shown is the proportion of genomes within a species that have Ku (all RefSeq assemblies) plotted against the total number of assemblies in RefSeq for that species.

    (PDF)

    S12 Fig. We evaluate the use of polymorphisms as a proxy for mutation by comparing estimates for the few species present in both the polymorphism and mutation accumulation data.

    (a) Estimates based on all polymorphisms. (b) Estimates based on polymorphisms at fourfold degenerate sites. Here we see selection/BGC appears to bias the polymorphism estimates when mutation is extremely biased towards AT.

    (PDF)

    Attachment

    Submitted filename: ResponseToReviewers_PlosGen.pdf

    Data Availability Statement

    All data used came from public repositories. Completely sequenced prokaryotic genomes were from NCBI’s non-redundant RefSeq database (ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/). Relationships between prokaryotes were from the SILVA Living Tree (https://www.arb-silva.de/projects/living-tree/). Clusters of related genomes were from the Alignable Tight Genomic Cluster (ATGC) database (http://dmk-brain.ecn.uiowa.edu/ATGC/). Prokaryotic trait data were from the ProTraits database (http://protraits.irb.hr/). Linkages between genomes and restriction enzymes were from the REBASE database (http://rebase.neb.com/rebase/rebase.html). Intermediate data files and code may be found at: https://github.com/jlw-ecoevo/gcku.


    Articles from PLoS Genetics are provided here courtesy of PLOS

    RESOURCES