ABSTRACT
Gene transfer agents (GTAs) are virus-like elements that are encoded by some bacterial and archaeal genomes. The production of GTAs can be induced by carbon depletion and results in host lysis and the release of virus-like particles that contain mostly random fragments of the host DNA. The remaining members of a GTA-producing population act as GTA recipients by producing proteins needed for GTA-mediated DNA acquisition. Here, we detected a codon usage bias toward codons with more readily available tRNAs in the RcGTA-like GTA genes of alphaproteobacterial genomes. Such bias likely improves the translational efficacy during GTA gene expression. While the strength of codon usage bias fluctuates substantially among individual GTA genes and across taxonomic groups, it is especially pronounced in Sphingomonadales, whose members are known to inhabit nutrient-depleted environments. By screening genomes for gene families with trends in codon usage biases similar to those in GTA genes, we found a gene that likely encodes head completion protein in some GTAs where it appeared missing, and 13 genes previously not implicated in the GTA life cycle. The latter genes are involved in various molecular processes, including the homologous recombination and transport of scarce organic matter. Our findings provide insights into the role of selection for translational efficiency in the evolution of GTA genes and outline genes that are potentially involved in the previously hypothesized integration of GTA-delivered DNA into the host genome.
IMPORTANCE Horizontal gene transfer (HGT) is a fundamental process that drives evolution of microorganisms. HGT can result in a rapid dissemination of beneficial genes within and among microbial communities and can be achieved via multiple mechanisms. One peculiar HGT mechanism involves viruses “domesticated” by some bacteria and archaea (their hosts). These so-called gene transfer agents (GTAs) are encoded in hosts’ genomes, produced under starvation conditions, and cannot propagate themselves as viruses. We show that GTA genes are under selection to improve the efficiency of their translation when the host activates GTA production. The selection is especially pronounced in bacteria that occupy nutrient-depleted environments. Intriguingly, several genes involved in incorporation of DNA into a genome are under similar selection pressure, suggesting that they may facilitate the integration of GTA-delivered DNA into the host genome. Our findings underscore the potential importance of GTAs as a mechanism of HGT under nutrient-limited conditions, which are widespread in microbial habitats.
KEYWORDS: GTA, Sphingomonadales, codon usage bias, nutrient depletion, homologous recombination, tonB, addAB, head completion protein
INTRODUCTION
Gene transfer agents are phage-like particles produced by multiple groups of bacteria and archaea (1). Unlike viruses, GTA particles tend to package random pieces of the host cell DNA instead of genes that encode GTAs themselves (2, 3). Released GTA particles can deliver the packaged genetic material to other cells (4), impacting the exchange of genetic material in prokaryotic populations (5–7). The benefits of GTA production and GTA-mediated DNA acquisition are not well understood. It has been hypothesized that GTAs may facilitate DNA repair (8), enable the population-level exchange of traits needed under the conditions of nutritional stress via horizontal gene transfer (HGT) (5), or decrease population density during the carbon starvation periods (9).
To date, at least three independently exapted GTAs are functionally characterized (10). The most studied GTA system (RcGTA) belongs to the alphaproteobacterium Rhodobacter capsulatus (2). RcGTA is encoded by at least 24 genes that are distributed across 5 distinct genomic loci (11, 12). Seventeen of the 24 genes are situated in one locus, which is dubbed the ‘head-tail’ cluster because it encodes most of the structural proteins of the RcGTA particles (1). RcGTA-like ‘head-tail’ clusters are present in many alphaproteobacterial genomes. They evolve slowly and are inferred to be inherited mostly vertically from a common ancestor of an alphaproteobacterial clade that spans multiple taxonomic orders (12–14). Additionally, multiple cellular genes regulate RcGTA production, release, and reception (11, 15). Other, yet undiscovered, genes in R. capsulatus are likely involved in the GTA life cycle.
Expression of RcGTA is known to be triggered by nutrient depletion (16), under which a small fraction of the R. capsulatus population becomes dedicated to GTA production (17, 18). As a result, RcGTA-producing cells likely express GTA genes at high levels. By extension, RcGTA-like GTA genes in other alphaproteobacteria (here referred to as “GTA genes” for brevity) are also likely to be highly expressed in GTA-producing cells of alphaproteobacterial populations.
Highly expressed genes that are involved in core biological processes, such as translational machinery, are known to exhibit a strong codon usage bias (19). For example, codon usage in ribosomal proteins, which are highly expressed in almost all organisms, deviates most dramatically from the distribution of codons expected under their equal usage corrected for organismal GC content (20). Such bias is primarily due to selection to match the pool of most abundant tRNA molecules in order to have the most efficient translation for proteins needed in a high number of copies (21–23). As a result, highly expressed genes tend to have codons that correspond to the most abundant tRNA molecules in the cell. This type of selection is known as the “selection for translational efficiency” and is ubiquitous among bacteria (24).
Besides constitutively highly expressed genes, selection for translational efficiency also acts on genes that are highly expressed under specific environmental conditions that microorganisms experience (19, 24, 25). For instance, genes that utilize galactose have higher codon usage biases in budding yeasts that live in dairy-associated habitats than in yeasts that occupy alcohol-associated habitats (25). Additionally, genes that encode interacting proteins and genes involved in the same pathway often exhibit similar codon usage biases (25, 26).
In earlier work, we discovered that alphaproteobacterial GTA genes have a striking bias toward GC-rich codons in comparison to the rest of the genome (9, 12). However, this bias is different from the codon usage bias: The skewed composition of the encoded proteins toward containing energetically less expensive amino acids caused the first two positions of the codons to be enriched in Gs and Cs, due to the structure of the genetic code (9). In this study, we examined GTA genes in 208 alphaproteobacterial genomes and assessed if there was additional codon usage bias due to the genes being under selection for translation efficiency. For this purpose, we used two well-established metrics for the assessment of codon usage bias and its match to the tRNA abundance: the effective number of codons (ENC) (20) and tRNA adaptation index (tAI) (27). ENC quantifies how equally synonymous codons are used in a gene and varies from 20 (when only one codon is used per each amino acid; strong bias) to 61 (when all codons are used equally; no bias). The tAI measures how optimally the codon usage of each gene fits the available tRNA pool by correlating the frequency of each codon in the gene with the abundance of its cognate tRNA. The degree of adaptation of a gene is gauged by comparing its tAI value to the tAI values of all other genes in a genome. We also searched for genes whose involvement in GTA production and regulation is currently unsuspected by screening GTA-encoding genomes for genes with codon usage trends similar to those of GTA genes.
RESULTS
Codon usage bias of GTA genes and its match to available tRNAs varies across GTA genes and GTA-containing genomes.
To assess the presence of codon usage bias in GTA genes across alphaproteobacteria, we have calculated ENC values for each “reference GTA gene” (see Materials and Methods for the definition) and compared them against the expected ENC values for all genes in a genome under the null model of no codon usage bias, corrected for the genomic GC content (28). Indeed, we found that 1,543 out of 2,308 (66.8%) reference GTA genes detected across 208 GTA head-tail clusters deviate from the genome-specific null expectations by more than 10% (Fig. S1). However, there is a substantial variation in this deviation for different GTA genes (Fig. S2) and only in genes g5 and g8 is the deviation significantly higher than the genomic average (Kruskal-Wallis rank sum test, P < 2.2e−16; Dunn’s test, P < 0.05, Benjamini-Hochberg correction).
Distribution of deviations of the effective number of codon (ENC) values from the expected ENC values under the null model of no codon bias. The distribution contains deviations for 2,308 reference GTA genes found in the 208 genomes. Numbers on the plot designate the number of reference GTA genes in an interval delineated by dashed lines. Download FIG S1, PDF file, 0.10 MB (99.3KB, pdf) .
Copyright © 2022 Kogay and Zhaxybayeva.
This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
Deviation of the effective number of codon (ENC) values for individual reference GTA genes in comparison to the genomic average. The deviation of the ENC from the expectation under the null model for each GTA gene was normalized by the average ENC deviation of its genome. A line within a box displays the median normalized ENC value for a GTA gene across all genomes. The boxes are bounded by first and third quartiles. Whiskers represent ptAI values within the 1.5× interquartile range. Dots outside whiskers are outliers. Download FIG S2, PDF file, 0.09 MB (89.1KB, pdf) .
Copyright © 2022 Kogay and Zhaxybayeva.
This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
To assess the match of the observed codon usage bias to the available tRNA pool, we calculated tAI values of the reference GTA genes across 208 genomes and converted them to percentile tAI values (ptAI; see Materials and Methods for the definition) to allow for the intergenomic comparisons. Similar to the ENC values, the ptAI values also vary substantially across the genes and genomes (Fig. 1), suggesting that the strength of selection for translational efficiency should be examined in individual GTA genes and in specific taxonomic groups, which we report in the next two sections.
FIG 1.
Distribution of ptAI values among reference GTA genes from GTA head-tail clusters in 208 alphaproteobacterial genomes. The line within a box displays the median ptAI value for a GTA gene across all genomes, in which the gene was detected. The boxes are bounded by first and third quartiles. Whiskers represent ptAI values within a 1.5× interquartile range. Dots outside whiskers are outliers.
Selection for translational efficiency is uneven among GTA genes.
The differences in ptAI values among the reference GTA genes are statistically significant (Kruskal-Wallis rank sum test, P < 2.2e−16) (Fig. 1). Particularly notable is a significant decline in ptAI values of the region encoding genes g12 through g15 (Dunn’s test, P < 0.05, Benjamini-Hochberg correction). These genes are located at the 3′ ends of the head-tail cluster and encode the tail components of a GTA’s particle. In contrast, ptAI values of the genes g5 (encoding major capsid protein) and g11 (encoding tail tape measure protein) are significantly higher than ptAI values of other GTA genes (Dunn’s test, P < 0.05, Benjamini-Hochberg correction). Notably, protein g5 is detected in the largest number of copies (145) per RcGTA particle than any other protein (29), while proteins g12 to g15 are present in a small number of copies (between 1 and 6) per RcGTA particle (29). Given that genes encoding proteins needed in a larger number of copies have a higher degree of adaptation to the tRNA pool (30), we hypothesize that the observed variation in ptAI values of GTA genes reflected the different numbers of GTA proteins in a GTA particle. Protein g11, however, is found in only 3 copies per RcGTA particle (29), and, therefore, a demand for a larger copy number cannot explain its high ptAI values.
Variation of ptAI values could also be due to the physical location of the genes in the GTA head-tail cluster. Similar to the operons (31), genes in the RcGTA head-tail cluster are co-transcribed from a single promoter upstream of the cluster (15, 32). Because genes at the 3′ end of operons tend to have lower expression levels (33), the low ptAI values of GTA genes g12-g15 may be due to their distant location from the promoter.
Selection for translational efficiency is the strongest in Sphingomonadales’ genomes.
In addition to variability in ptAI values across different GTA genes, ptAI values of individual GTA genes vary substantially across the 208 genomes (Fig. 1). To evaluate if these differences represent variation in selection pressure in distinct taxonomic groups, we initially examined the ptAI values of the g5 gene that were grouped by alphaproteobacterial order. The g5 gene was chosen due to its high abundance of the encoded protein in RcGTA particles (more copies than all other structural proteins combined) and for being the only gene with the highest detected deviations from the average genomic values for both ENC and ptAI. We found that the ptAI values of the g5 gene vary significantly among members of the four alphaproteobacterial orders (Kruskal-Wallis rank sum test, P < 0.05) (Fig. 2A). In particular, g5 genes from the Sphingomonadales’ genomes have significantly higher ptAI values than those from genomes of bacteria from other three orders (Mann-Whitney U test, P < 0.05, Benjamini-Hochberg correction). Twelve of the 14 g5 genes with the highest overall ptAI values (>90) (Fig. 2A) also belong to the Sphingomonadales genomes. Beyond the g5 gene, all reference GTA genes, as a group, have higher ptAI values in Sphingomonadales than in members of the three other alphaproteobacterial orders (Mann-Whitney U test, P < 0.05, Benjamini-Hochberg correction) (Fig. S3). These observations suggest that, in Sphingomonadales, there is a strong selection for efficient production of GTA particles. Because Sphingomonadales are known to live in nutrient-depleted environments (34), we suggest that GTA production is especially beneficial in those habitats to exert strong selection for translational efficiency.
FIG 2.
Distributions of (A) ptAI values in the major capsid protein-encoding gene (g5) and (B) carbon content of amino acids in the g5 protein across four orders of the class Alphaproteobacteria. (A and B) A line within a box displays the median ptAI value for g5 representatives within an order. The boxes are bounded by first and third quartiles. Whiskers represent ptAI values within the 1.5× interquartile range. Dots outside whiskers are outliers. The phylogenetic tree on the y-axis is the reference phylogenomic tree (see Materials and Methods for details), in which branches are collapsed at the taxonomic order level. The dashed line in (A) marks a ptAI value of 90.
Distributions of ptAI values in all reference GTA genes across four orders of the class Alphaproteobacteria. A line within a box displays the median ptAI value for a GTA gene across all genomes. The boxes are bounded by first and third quartiles. Whiskers represent ptAI values within the 1.5× interquartile range. Dots outside whiskers are outliers. Download FIG S3, PDF file, 0.10 MB (99.5KB, pdf) .
Copyright © 2022 Kogay and Zhaxybayeva.
This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
Increased translational efficiency of GTA genes is associated with a reduced energetic cost to produce the encoded proteins.
Among the GTA proteins in four alphaproteobacterial orders, Sphingomonadales’ GTA proteins also have the strongest skew in amino acid composition toward energetically less expensive amino acids (Fig. 2B). To evaluate if selection for energy efficiency is linked to selection for translational efficiency, we examined the relationship between the ptAI values of GTA genes and the number of carbons in amino acid chains encoded by the Sphingomonadales GTA genes. We found that there is a significant negative correlation between them (Pearson R = −0.19, N = 636, P < 0.05). We propose that, in Sphingomonadales, benefits associated with the production of GTA particles in nutrient-limited conditions led not only to the selection for translational efficiency but also to the selection for use of energetically less expensive amino acids in the GTA genes.
Fourteen gene families have translational efficiency trends similar to those of GTA genes.
While the strength of selection for translational efficiency acting on GTA genes varies across genes and genomes, we found that the combinations of ptAI values across all reference GTA genes in a genome have similar trends across the 208 genomes. The similarity is significant in all pairwise reference GTA gene comparisons (Fig. S4), as determined using the phylogenetic generalized least-squares (PGLS) method. We conjecture that ptAI values of genes in the other loci of a GTA “genome,” as well as the host genes involved in the GTA life cycle, would exhibit similar trends to the ptAI values of reference GTA genes, allowing for the discovery of yet unsuspected genes involved in GTA life cycle. To identify such unknown genes that may be coexpressed with GTA genes, we examined correlations of ptAI values between reference GTA genes and 3,477 other gene families present in 208 alphaproteobacterial genomes. The PGLS analysis revealed 14 gene families, whose ptAI values correlate significantly with the ptAI values of the reference GTA genes (Table 1).
TABLE 1.
Functional annotations of 14 gene families, whose ptAI values have a significantly similar trend to ptAI values of the reference GTA genes
Gene | GenBank accession number of a representative protein | Functional annotation | COG category | COG functional category description |
---|---|---|---|---|
gafA | AYM61721.1 | DUF6456 domain-containing protein | K | Transcription |
addA | AYM63601.1 | Double-strand break repair helicase AddA | L | Replication, recombination, and repair |
addB | AYM63602.1 | Double-strand break repair protein AddB | L | Replication, recombination, and repair |
xseA | AYM63827.1 | Exodeoxyribonuclease VII large subunit | L | Replication, recombination, and repair |
dinG | ANY21229.1 | ATP-dependent DNA helicase | KL | Transcription; Replication, recombination, and repair |
hrpB | AYM63629.1 | ATP-dependent helicase HrpB | L | Replication, recombination, and repair |
priA | AYM63372.1 | Primosomal protein N′ | L | Replication, recombination, and repair |
glnE | AYM61788.1 | Bifunctional [glutamine synthetase] adenylyltransferase/[glutamine synthetase]-adenylyl-l-tyrosine phosphorylase | OT | Molecular chaperones and related functions; Signal transduction mechanism |
ccmE | AYM61781.1 | Cytochrome c maturation protein CcmE | O | Molecular chaperones and related functions |
ATP12 | AYM62266.1 | ATP12 family chaperone protein | O | Molecular chaperones and related functions |
tonB | APG61850.1 | Energy transducer TonB | M | Cell wall/membrane/envelope biogenesis |
TPR | AYM61439.1 | Tetratricopeptide repeat protein | M | Cell wall/membrane/envelope biogenesis |
smrA | AYM63588.1 | Smr/MutS family protein | S | Function unknown |
crtB | AYM62351.1 | Phytoene/Squalene synthase family protein | I | Lipid transport and metabolism |
PGLS model fit among ptAI values of the reference GTA gene pairs. Each pairwise comparison is represented by a rectangle that is color-coded according to the P values from the PGLS analysis of the reference GTA gene pairs. The numerical P values are listed within each rectangle. Download FIG S4, PDF file, 0.1 MB (145.6KB, pdf) .
Copyright © 2022 Kogay and Zhaxybayeva.
This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
One of 14 identified gene families is a homolog of gafA, which encode a crucial transcription activator of GTA particle production in Rhodobacter capsulatus (11, 15). This gene is located outside the RcGTA’s head-tail cluster and, therefore, was not included in the set of reference GTA genes. However, its discovery demonstrate the suitability of our approach to identifying genes linked to the GTA life cycle. Interestingly, gafA homologs were previously described only in the genomes of Rhodobacterales and some Rhizobiales (11, 12, 15). However, with different criteria in the OrthoFinder-based similarity searches, we were able to identify this regulator in 196 of the 208 genomes (94.2%), spanning all GTA-containing alphaproteobacterial orders. The evolutionary history of the gafA homologs is largely congruent with the reference phylogenomic tree (normalized quartet score of 0.87) and even more so with the phylogeny of the concatenated GTA reference genes (normalized quartet score of 0.92) (trees are available at https://doi.org/10.6084/m9.figshare.20082749), suggesting that the gafA gene had co-evolved with the GTA ‘head-tail’ cluster since the time of the last common ancestor of RcGTA-like GTAs.
The remaining 13 gene families belong to several functional categories of the clusters of orthologous groups (COG) (Table 1). While proteins encoded by some of these genes could be postulated to be involved in the GTA life cycle (exemplified below by the addAB, xseA, and tonB genes), the similarity of codon usage biases between other genes and reference GTA genes can be explained by their expression at similar environmental conditions (exemplified by three genes from ‘molecular chaperones and related functions’ COG category).
Protein products of the addA and addB genes form the heterodimeric helicase-nuclease complex that repairs double-stranded DNA breaks by homologous recombination and is functionally equivalent to the RecBCD complex (35). The knockout of the AddAB complex is associated with a deficiency in RecA-dependent homologous recombination (36). We hypothesize that the addAB pathway is involved in the recombination of GTAs’ genetic material with the host’s genome.
The main function of exodeoxyribonuclease VII large subunits (xseA), which is encoded by the xseA gene, is to form a complex with xseB and degrade single-stranded DNA to oligonucleotides. However, expression of the xseA gene without the xseB gene leads to cell death (37). Because we did not find any correlation of codon usage bias between the xseB gene and GTA genes, we speculate that the xseA gene product facilitates the lysis of GTA-producing cells and the release of GTA particles.
The tonB gene encodes the tonB energy transducer. TonB-dependent transporters are involved in the transport of diverse compounds, including carbohydrates, amino acids, lipids, vitamins, and iron (38–40). Similar to the quorum-sensing regulated expression of the gene encoding GTA receptor in the non-GTA-producing cells of a Rhodobacter capsulatus population (4), the tonB gene could also be regulated to be expressed in the non-GTA-producing cells to aid the uptake of the nutrients released from the lysed cells via TonB-dependent transporters. The tonB gene was detected only in members of the Sphingomonadales order, suggesting that such nutrient uptake is most relevant in nutrient-limited environments.
Three genes from the ‘molecular chaperones and related functions’ COG category are less likely to be directly involved in the GTA life cycle because GTAs already encode their chaperones that assist GTA protein folding (29). However, it is well known that chaperones tend to be highly expressed in bacteria at times of stress and facilitate the survival of cells in rapidly changing environmental conditions (41). Because chaperones are essential in responding to starvation-induced cellular stresses (42), we conjecture that the observed similarity in ptAI values of the reference GTA genes is due to their expression being triggered by similar environmental conditions.
To evaluate if the detected gene families interacted with each other and with GTA genes, we constructed the protein-protein interaction network of the 14 gene families, 12 GTA reference genes, and 50 additional interactor proteins from the STRING database (Fig. 3). Thirteen of the 14 families and all 12 reference GTA genes belong to two protein-protein interaction subnetworks (Fig. 3), one of which contains all GTA reference genes, while the other is involved in a wide range of functions (Table 1). By carrying out the KEGG enrichment analysis, we found a significant overrepresentation of four molecular pathways in the second protein-protein interaction network (Table S1). Consistent with the 6 of the 13 gene families being assigned to the “replication, recombination, and repair” COG category, two of the KEGG pathways are ‘homologous recombination’ and ‘mismatch repair’, further corroborating the involvement of identified genes in the integration of the genetic material delivered by GTAs into recipients’ genomes. Two additional pathways, ‘carotenoid biosynthesis’ and ‘terpenoid backbone biosynthesis’, are less likely to be directly involved in the life cycle of GTAs. Production of secondary metabolites is known to be protective against stress factors (43, 44), and carbon starvation leads to the upregulation of the carotenoid biosynthesis pathway (45, 46). Similar to the above-described genes encoding chaperones, we hypothesize that the expression of ‘carotenoid biosynthesis’ and ‘terpenoid backbone biosynthesis’ genes is not related to the GTA life cycle but is initiated by conditions that also activates the production of GTAs.
FIG 3.
The protein-protein interactions among 12 GTA reference proteins and 14 proteins putatively coexpressed with GTAs. Nodes represent individual proteins. Blue-colored nodes correspond to GTA reference proteins and red-colored nodes correspond to 14 putatively coexpressed proteins. Gray-colored nodes represent additional proteins found through the STRING functional enrichment analysis. The thickness of the edges is proportional to the STRING’s confidence score of protein interactions (varying between 0.4 [thin line] and 1.0 [thick line]).
Four molecular pathways were significantly overrepresented in the protein-protein interaction network shown in Fig. 3. The pathway information was obtained from the KEGG database. Download Table S1, PDF file, 0.07 MB (76.2KB, pdf) .
Copyright © 2022 Kogay and Zhaxybayeva.
This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
Replacement of the head completion protein in Sphingomonadales’ GTAs.
Gene content of GTA head-tail clusters varies across alphaproteobacteria (12). While some clusters do not contain homologs of all RcGTA genes, others include additional genes that are conserved across multiple clusters but have no known function (12, 13). To predict whether any of these additional genes play a role in GTA production, we compared the ptAI values of genes found in at least 10 genomes to the ptAI values of the reference GTA genes. One gene family, which is found only within GTA head-tail clusters of 11 genomes in one subclade of Sphingomonadales (GenBank accessions are available at https://doi.org/10.6084/m9.figshare.20082749), has a significant positive correlation with 5 out of the 12 GTA reference genes (Table S2). Interestingly, within Sphingomonadales GTA head-tail clusters this gene is located where the g7 gene, which encodes a head completion protein, is found in the RcGTA head-tail cluster (Fig. S5). Only seven of the 55 Sphingomonadales genomes in our data set had detectable homologs of the g7 gene. Among the remaining 48 genomes, 22 contain a gene encoding a protein of unknown function in the “gene g7 locus,” while 26 genomes do not have any gene in that locus.
Significance and slope of the fit of the phylogenetic generalized least squares (PGLS) models between the reference GTA genes and putative head-completion protein in Sphingomonadales. Statistically significant associations (P < 0.05) are highlighted in orange. Download Table S2, PDF file, 0.09 MB (94.4KB, pdf) .
Copyright © 2022 Kogay and Zhaxybayeva.
This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
Gene neighborhood of the RcGTA gene g7 and the putative g7 replacement gene in 11 Sphingomonadales spp. The only region corresponding to the RcGTA ‘head-tail’ cluster is depicted, with each gene represented by an arrow scaled relative to its length within each cluster. The RcGTA genes (g1 to g15) are color-coded and their homologs in Sphingomonadales are shown in the same color. Putative g7 replacements in Sphingomonadales are shown in magenta and marked with an arrow. Pseudogenes are colored in black, while genes without an established relationship to GTA production are shown in gray. A phylogenetic tree is a subtree extracted from the reference phylogeny. The scale bar corresponds to the number of substitutions per site. Download FIG S5, PDF file, 0.1 MB (136.6KB, pdf) .
Copyright © 2022 Kogay and Zhaxybayeva.
This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
The members of the identified gene family are substantially shorter than the RcGTA gene g7 and have a different secondary structure (Fig. 4), precluding the possibility that the identified protein is simply too divergent for a detectable amino acid similarity. However, we found viral head completion proteins that have similar protein length and similar secondary structures to both GTA head completion protein and the identified gene family (Fig. 4). We conjecture that the gene encoding the head completion protein was replaced in some Sphingomonadales by a gene encoding an analogous viral protein.
FIG 4.
Secondary structures of head completion proteins from phages and GTAs. The Enterobacteria phage lambda gpW and Bacillus phage SPP1 (highlighted in red) are two representatives of viral head completion proteins with major differences in lengths and secondary structures. The secondary structures of R. capsulatus (PDB ID 6TUI_8), Enterobacteria phage lambda gpW (1HYW), and Bacillus phage SPP1 (2KCA) proteins were retrieved from the PDB database. The secondary structures of the putative head completion proteins from Sphingomonadales were predicted computationally. The secondary structures are scaled with respect to the protein lengths, which are listed in parentheses next to the taxonomic names.
DISCUSSION
Our analyses of codon usage biases suggest that alphaproteobacterial GTA systems are under selection for an optimal translation of GTA proteins from GTA genes achieved by using codons with more readily available tRNAs. The strength of such selection for translational efficiency is the most pronounced (and therefore most easily detectable) in the major capsid protein gene, which needs to be expressed to produce thousands of copies per GTA-producing bacterium. Additionally, the strength of the selection for translational efficiency varies across taxonomic groups but is particularly prominent in the Sphingomonadales order, whose members typically inhabit nutrient-limited conditions. We hypothesize that the observed variation in the selection strength depends on the severity and duration of the nutrient scarcity experienced by a population capable of producing GTAs. On the one hand, long-term exposure to nutrient-depleted conditions would trigger a more efficient and/or more frequent production of GTA particles, which would lead to greater survival of communities with better translational efficiency of GTA systems and, thus, a higher codon usage bias in the GTA genes. On the other hand, if GTAs are needed only on rare occasions due to the stable and abundant nutrient supplies, the selection for translational efficiency would be weak and would result in a lower codon usage bias. Combined with an observation that the production of GTAs is triggered by nutritional stress (16), our findings that the selection is the strongest in alphaproteobacteria that inhabit nutrient-limited environments further underscore the earlier hypothesized importance of GTA systems in situations of nutrient scarcity (9).
Additionally, the stronger selection for translation efficiency in GTA genes is associated with a larger decline in the carbon content of the encoded proteins. These findings suggest that the benefits associated with GTA production are substantial enough to drive selection for both translational efficiency and low energetic costs of the translated proteins. We speculate that these modifications of GTA proteins allow the bacterial population under adverse conditions to increase both the speed of GTA particle production and the number of released GTA particles.
We hypothesize that genes that are located outside the GTA head-tail cluster but are involved in the GTA life cycle, including processing and integration of the GTA-delivered DNA, would have signatures of selection for translational efficiency similar to those of GTA genes. Gratifyingly, our genome-wide screen for such patterns detected the direct GTA activator gene, gafA (15). We also identified multiple genes not yet implicated in the GTA life cycle. Several of these genes are involved in recombination and mismatch repair, providing bioinformatic evidence for the hypothesis that GTAs facilitate HGT by distributing genetic fragments that become incorporated into recipients’ genomes via homologous recombination (4). The involvement of other genes with similar selection pressures in the GTA life cycle is speculative and needs to be investigated experimentally. But the putative coexpression of xseA and tonB genes with GTA genes raises an intriguing possibility that, in addition to HGT, GTA production may provide an extra benefit in a nutrient-depleted environment: scavenging of scarce organic matter from GTA-producing cells. The lysis of the GTA-producing cells could be mediated by XseA and the cellular debris could be imported as nutrients by the surviving cells via TonB-dependent transporters.
Alphaproteobacterial GTAs likely originated millions of years ago from a lysogenic phage, and they were mostly vertically inherited by many alphaproteobacterial lineages (12, 14). However, similar to the HGT influence on many other regions of a typical bacterial genome (47), it is very likely that over time GTAs experienced gene replacements via HGT (12). Instances of HGT between GTAs and phages have been already documented (11, 48). By examining the patterns of selection for translational efficiency, we identified another case of gene exchange with viruses that resulted in the replacement of the gene encoding head completion protein in some GTAs of Sphingomonadales. Curiously, the gene currently has no significant primary sequence similarity to any gene in GenBank. Many other unannotated ORFs in alphaproteobacterial head-tail clusters outside Rhodobacterales (12) may also have functional roles in their respective GTA regions. Notably, when alphaproteobacterial RcGTA-like genomic regions appear incomplete due to the lack of homologs to genes required for GTA production in R. capsulatus, it could be due to our inability to recognize some genes due to their replacements with analogous genes. Because such incomplete RcGTA-like clusters are abundant in alphaproteobacteria (12), GTAs could be morphologically diverse and even more widespread across alphaproteobacteria than we currently estimate (13).
MATERIALS AND METHODS
Data set of representative alphaproteobacterial genomes with GTA head-tail clusters.
As an initial data set, we selected 212 representative alphaproteobacterial genomes previously predicted to contain GTAs (9). The gene annotations of the genomes were downloaded from the RefSeq database (49) in October 2020. GTA head-tail clusters (1) were predicted using the GTA-Hunter program (13). Because GTA-Hunter identified only 11 out of the 17 genes in the RcGTA’s head-tail cluster and requires genes to align with their RcGTA homologs by at least 60% of their length, some GTA genes were likely missed by GTA-Hunter. To look for these potential false negatives, additional BLASTP (50) searches with the E value cutoff of 0.1 were performed using 17 RcGTA head-tail cluster genes as queries and protein-coding genes in 212 genomes as a database. Only matches located within the genomic regions designated as GTA gene clusters by GTA-Hunter were kept. In four genomes, calculations of genes’ adaptation to the tRNA pool (see “Evaluation of the adaptiveness of protein-coding genes to the tRNA pool” below for details) did not converge. As a result, only 208 genomes were retained in the reported analyses (GenBank accessions are available at https://doi.org/10.6084/m9.figshare.20082749).
Identification of gene families in 208 alphaproteobacterial genomes.
Within each genome, protein-coding genes of less than 300 nucleotides in length were excluded to reduce the stochasticity of codon usage bias values due to the insufficient number of codons. The remaining protein-coding genes were clustered into gene families using OrthoFinder v2.4 (51) with default parameters and DIAMOND (52) for the amino acid sequence similarity search. Only gene families detected in at least 40 genomes were retained to ensure statistical power.
Some alphaproteobacterial GTA head-tail cluster regions contain protein-coding ORFs that do not have significant similarity to the RcGTA homologs of the genes shown to be required for GTA production in RcGTA. Gene families of these ORFs were retrieved from the collection of gene families predicted for all protein-coding genes (regardless of their length) using OrthoFinder v2.4 (51) with default parameters and DIAMOND (52) for the amino acid sequence similarity search. Only gene families that are both located within the genomic region encoding GTA head-tail cluster and found in at least 10 genomes were retained.
Reference set of GTA genes.
Although RcGTA head-tail cluster contains 17 genes, genes g3.5 and g10.1 are less than 300 nucleotides in length, and genes g1 and g7 are not detected widely across analyzed genomes. Additionally, codon usage patterns of gene g9 were found to be different from other GTA genes (see “Examination of similarity in adaptation to the tRNA pool among GTA genes” for details). Therefore, in our inferences about selection, we considered only 12 of the 17 GTA genes (Table S3), which we designate throughout the manuscript as “GTA reference genes.”
Decisions behind choosing the RcGTA 'head-tail' cluster homologs for the reference GTA gene set. The GTA genes selected for the reference set are highlighted in orange. Functional annotations are based on the descriptions in the RefSeq database records unless noted otherwise. Download Table S3, PDF file, 0.1 MB (141.8KB, pdf) .
Copyright © 2022 Kogay and Zhaxybayeva.
This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
Amino acid sequences of GTA reference genes were aligned individually using MAFFT-linsi v7.455 (53) and then concatenated into a single alignment. Each gene was treated as a separate partition in the alignment and the best substitution model for each gene was determined by ModelFinder (54). The maximum likelihood tree was reconstructed using IQ-TREE v1.6.7 (55) and the support values were calculated via 1,000 ultrafast bootstrap replicates (56).
Reconstruction of the reference phylogenomic tree.
Twenty-nine marker proteins that are present in a single copy in more than 95% of the 208 retained genomes were retrieved using AMPHORA2 (57). Amino acid sequences within each of the 29 marker families were aligned using MAFFT-linsi v7.455 (53). The best substitution matrix for each family was determined by ProteinModelSelection.pl script downloaded from https://github.com/stamatak/standard-RAxML/tree/master/usefulScripts in October 2020. Individual alignments of the marker families were concatenated, but each alignment was treated as a separate partition with its own best substitution model in the subsequent phylogenetic reconstruction. The maximum likelihood tree was reconstructed using IQ-TREE v1.6.7 (55), and the support values were calculated using 1,000 ultrafast bootstrap replicates (56).
Evaluation of codon usage bias in protein-coding genes using an “effective number of codons” metric.
For the retained genes in each genome, an effective number of codons (ENC) (20) and G+C content variation at the 3rd codon position in the synonymous sites (GC3s) were calculated using CodonW (http://codonw.sourceforge.net). The null model of no codon usage bias was calculated as described in dos Reis et al. (28) using an in-house script (available in the FigShare repository at https://doi.org/10.6084/m9.figshare.20082749). For every gene, the deviation of its ENC from the null model was calculated using the in-house script. Genes that have observed ENC higher than expected were excluded from analyses.
Evaluation of the adaptiveness of protein-coding genes to the tRNA pool.
The tRNA genes in each genome were predicted using tRNAscan-SE v 2.06, using a model trained on bacterial genomes (58, 59) and the Infernal mode without HMM filter to improve the sensitivity of the search (60). tRNA gene copy number was used as the proxy for tRNA abundance, following the previously reported observation that the two correlate strongly (28, 61). The adaptiveness of each codon (ωi) to the tRNA pool was calculated using the stAIcalc program with the maximum hill climbing stringency (62). The tRNA adaptation index (tAI) of each retained gene was calculated as the geometric mean of its ωi values (27). Because the distribution of tAI values varies among genomes (63) (Fig. S6), tAI values were converted to their relative percentile tRNA adaptation index (ptAI) within a genome. The ptAI value of a gene ranges between 0 and 100 and represents the percentage of analyzed genes in a genome that have a smaller tAI than tAI of that gene.
Distribution of tAI values in protein-coding genes of the analyzed genomes. Only genes at least 300 nucleotides in length were included. (A) Distribution of tAI values of genes in three representative alphaproteobacterial genomes selected to have the lowest, the median, and the highest mean tAI value among 208 genomes. (B) Distribution of the average genomic tAI values across 208 genomes. Download FIG S6, PDF file, 0.1 MB (139.4KB, pdf) .
Copyright © 2022 Kogay and Zhaxybayeva.
This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
Examination of similarity in adaptation to the tRNA pool among GTA genes.
The ptAI values were retrieved for a subset of 13 GTA genes that are at least 300 nucleotides in length and are widely detected across all taxonomic groups. The linear regression analysis of ptAI values between all GTA gene pairs was conducted using the phylogenetic generalized least-squares method (PGLS) (64). The reference phylogenomic tree was used to correct the shared evolutionary history. The analysis was done using the ‘caper’ package (65) and λ, δ and κ parameters were estimated using the maximum likelihood function. Because the ptAI values of the g9 gene were not significantly correlated with the ptAI values of 8 out of the 12 other examined GTA genes at a P value cutoff of 0.001 (Table S4), the g9 gene was not included in the reference set of GTA genes.
Significance and slope of the fit of the phylogenetic generalized least squares (PGLS) models between the reference GTA genes and the g9 gene. Download Table S4, PDF file, 0.09 MB (93.2KB, pdf) .
Copyright © 2022 Kogay and Zhaxybayeva.
This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
Identification of genes with ptAI values similar to that of the GTA genes.
For each gene family, the “within-genome” ptAI values were retrieved. For gene families with at least two paralogs, the ptAI values for all paralogs from a particular genome were replaced with their median ptAI value.
To identify gene families that exhibit tRNA pool adaptation patterns similar to those of GTA genes, a linear regression model of ptAI values between these gene families and reference GTA genes was fit using the PGLS (64). The PGLS analysis was carried out using the ‘caper’ package (65) and λ, δ, and κ parameters were estimated using the maximum likelihood function. The reference phylogenomic tree was used to correct the shared phylogenetic history. For gene families found in at least 40 genomes, a gene family was designated to be associated with a GTA if the obtained fit of the model was statistically significant across all reference GTA genes. Because gene families found in less than 40 genomes were kept only if the genes are located within the genomic regions encoding GTA head-tail clusters (see the “Identification of gene families in 208 alphaproteobacterial genomes”), a more relaxed criterion was adopted for such gene families. A gene family was designated to be associated with a GTA if the fit of the model was statistically significant across at least 40% of reference GTA genes. If a significantly associated gene family contained paralogs, the PGLS analysis was repeated by using individual ptAI values across all possible combinations of paralogs (if the total number of combinations was <1,000) or across random 1,000 combinations of paralogs (if the total number of combinations was >1,000). This was carried out to ensure that the detected signal was not due to sampling associated with selecting the median ptAI value.
Genes with significant similarity in their ptAI trends were annotated via eggNOG-mapper v2.1 (66).
Protein-protein interaction of GTA genes and gene families with similar ptAI values.
To identify protein-protein interaction networks, reference GTA genes and genes from families with similar tRNA pool adaptation patterns were retrieved from the Sphingomonas sp. MM1 genome, chosen because it is the only genome that contains all genes from the GTA reference gene set and all 14 gene families listed in Table 1. The locus tags of the retrieved Sphingomonas sp. MM1 genes were used as queries against STRING database v 11.0b (last accessed July 2021) (67) with the medium confidence score cutoff and all active interaction sources. The retrieved protein-protein interaction network was visualized in STRING using the queries and up to 50 additional interactor proteins and displaying edges based on the STRING confidence scores. The KEGG pathways (68) enrichment analysis was conducted via hypergeometric testing on the whole retrieved network, as implemented in STRING.
Analysis of other protein-coding genes situated within GTA head-tail clusters.
For gene families within GTA head-tail clusters, ptAI values were retrieved and compared to ptAI values of the reference GTA gene set using PGLS analysis as described above. For the only gene family with a significant association with GTA genes, the secondary structure of its proteins was predicted using Porter v5.0 (69). To retrieve available viral head completion proteins, the phrase ‘head-completion protein’ was used as a query against the UniProt database (accessed in August 2021) (70). Among the 24 manually annotated (“reviewed”) matches from the Swiss-Prot subdatabase of the UniProt database, only 2 viral matches (accession numbers P68656 and P68660) had lengths similar to the genes in the above-described gene family. Both proteins belong to the λ phage gpW family and for Escherichia phage λ protein 3D structure is available in PDB (71). The secondary structure of RcGTA’s g7 protein, a structural viral homolog of RcGTA’s g7 from Bacillus phage SPP1 (gp16) (29) and head completion protein of phage λ were retrieved from the PDB database (71) in August 2021.
In 48 Sphingomonadales genomes without a detectable homolog of RcGTA gene g7, the genomic space either between the homologs of the RcGTA genes g6 and g8, or, in genomes without g6 homolog, between homologs of the RcGTA genes g5 and g8, was searched for the presence of open reading frames.
Refinement of the tonB gene family using a phylogenetic tree.
To identify orthologs within the large tonB gene family, the evolutionary history of the tonB gene family was reconstructed and evaluated. To do so, amino acid sequences of the tonB gene family were aligned using MAFFT-linsi v7.455 (53). The phylogeny was reconstructed in IQ-TREE v1.6.7 (55) using the best substitution model (LG+F+R6) detected by ModelFinder (54). The tree was visualized using the iTOL v6 (72). The phylogeny was used to subdivide the family into two families, whereas the five genes on very long branches served as an outgroup (tree is available at https://doi.org/10.6084/m9.figshare.20082749).
Calculation of energetic cost associated with the production of the encoded proteins.
To quantify the energetic cost of proteins, the carbon content of their amino acids was used as a proxy and was calculated by counting the number of carbons in the amino acid side chains, as described in Kogay et al. (9). The total number of carbons in each protein was normalized by the protein length.
Retrieval and phylogenetic analyses of gafA homologs.
Amino acid sequences of gafA homologs in 196 alphaproteobacterial genomes were detected via OrthoFinder (gene family OG0001218). Only homologs found in a single copy in a genome (194 in total) were retained for phylogenetic analysis. These homologs were aligned using MAFFT-linsi v7.455 (53). The best substitution model (LG+F+R6) was determined by ModelFinder (54) and the maximum-likelihood tree was reconstructed by IQ-TREE v1.6.7 (55) with the number of iterations to stop set to 500. The support values were calculated using 1,000 ultrafast bootstrap replicates (56). Both the GTA reference tree and reference phylogenomic tree were pruned to match the taxa in gafA phylogeny. The normalized quartet scores were calculated using ASTRAL v5.7.8 (73).
Data availability.
The following data are available in the FigShare repository at https://doi.org/10.6084/m9.figshare.20082749: accession numbers of 208 analyzed alphaproteobacterial genomes; accession numbers of the GTA regions identified in the analyzed genomes; accession numbers of genes in gene families across analyzed genomes; raw data related to tAI and ENC calculations; an in-house script for ENC calculations; slopes and P values of associations detected in PGLS analyses; accession numbers of the putative g7 proteins in Sphingomonadales genomes; multiple sequence alignments and phylogenetic trees of tonB and gafA gene families, concatenated phylogenomic markers, and concatenated GTA reference genes.
ACKNOWLEDGMENTS
This work was supported in part by the Simons Foundation Investigators in the Mathematical Modeling of Living Systems program (award number 327936 to O.Z.).
We declare no competing interests.
Contributor Information
Olga Zhaxybayeva, Email: olga.zhaxybayeva@dartmouth.edu.
Rachel Poretsky, University of Illinois at Chicago.
REFERENCES
- 1.Lang AS, Westbye AB, Beatty JT. 2017. The distribution, evolution, and roles of gene transfer agents in prokaryotic genetic exchange. Annu Rev Virol 4:87–104. doi: 10.1146/annurev-virology-101416-041624. [DOI] [PubMed] [Google Scholar]
- 2.Marrs B. 1974. Genetic recombination in Rhodopseudomonas capsulata. Proc Natl Acad Sci USA 71:971–973. doi: 10.1073/pnas.71.3.971. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Lang AS, Zhaxybayeva O, Beatty JT. 2012. Gene transfer agents: phage-like elements of genetic exchange. Nat Rev Microbiol 10:472–482. doi: 10.1038/nrmicro2802. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Brimacombe CA, Ding H, Beatty JT. 2014. Rhodobacter capsulatus DprA is essential for RecA-mediated gene transfer agent (RcGTA) recipient capability regulated by quorum-sensing and the CtrA response regulator. Mol Microbiol 92:1260–1278. doi: 10.1111/mmi.12628. [DOI] [PubMed] [Google Scholar]
- 5.McDaniel LD, Young E, Delaney J, Ruhnau F, Ritchie KB, Paul JH. 2010. High frequency of horizontal gene transfer in the oceans. Science 330:50. doi: 10.1126/science.1192243. [DOI] [PubMed] [Google Scholar]
- 6.Brimacombe CA, Ding H, Johnson JA, Beatty JT. 2015. Homologues of genetic transformation DNA import genes are required for Rhodobacter capsulatus gene transfer agent recipient capability regulated by the response regulator CtrA. J Bacteriol 197:2653–2663. doi: 10.1128/JB.00332-15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Quebatte M, Christen M, Harms A, Korner J, Christen B, Dehio C. 2017. Gene transfer agent promotes evolvability within the fittest subpopulation of a bacterial pathogen. Cell Syst 4:611–621.e6. doi: 10.1016/j.cels.2017.05.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Marrs B, Wall JD, Gest H. 1977. Emergence of the biochemical genetics and molecular biology of photosynthetic bacteria. Trends Biochem Sci 2:105–108. doi: 10.1016/0968-0004(77)90173-6. [DOI] [Google Scholar]
- 9.Kogay R, Wolf YI, Koonin EV, Zhaxybayeva O. 2020. Selection for reducing energy cost of protein production drives the GC content and amino acid composition bias in gene transfer agents. mBio 11:e01206-20. doi: 10.1128/mBio.03139-20. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Kogay R, Koppenhofer S, Beatty JT, Lang AS, Kuhn JH, Zhaxybayeva O. 2022. Formal recognition and classification of gene transfer agents as viriforms. Virus Evolution. doi: 10.1093/ve/veac100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Hynes AP, Shakya M, Mercer RG, Grull MP, Bown L, Davidson F, Steffen E, Matchem H, Peach ME, Berger T, Grebe K, Zhaxybayeva O, Lang AS. 2016. Functional and evolutionary characterization of a gene transfer agent's multilocus genome. Mol Biol Evol 33:2530–2543. doi: 10.1093/molbev/msw125. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Shakya M, Soucy SM, Zhaxybayeva O. 2017. Insights into origin and evolution of alpha-proteobacterial gene transfer agents. Virus Evol 3:vex036. doi: 10.1093/ve/vex036. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Kogay R, Neely TB, Birnbaum DP, Hankel CR, Shakya M, Zhaxybayeva O. 2019. Machine-learning classification suggests that many alphaproteobacterial prophages may instead be gene transfer agents. Genome Biol Evol 11:2941–2953. doi: 10.1093/gbe/evz206. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Lang AS, Beatty JT. 2007. Importance of widespread gene transfer agent genes in alpha-proteobacteria. Trends Microbiol 15:54–62. doi: 10.1016/j.tim.2006.12.001. [DOI] [PubMed] [Google Scholar]
- 15.Fogg PCM. 2019. Identification and characterization of a direct activator of a gene transfer agent. Nat Commun 10:595. doi: 10.1038/s41467-019-08526-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Westbye AB, O'Neill Z, Schellenberg-Beaver T, Beatty JT. 2017. The Rhodobacter capsulatus gene transfer agent is induced by nutrient depletion and the RNAP omega subunit. Microbiology (Reading) 163:1355–1363. doi: 10.1099/mic.0.000519. [DOI] [PubMed] [Google Scholar]
- 17.Fogg PC, Westbye AB, Beatty JT. 2012. One for all or all for one: heterogeneous expression and host cell lysis are key to gene transfer agent activity in Rhodobacter capsulatus. PLoS One 7:e43772. doi: 10.1371/journal.pone.0043772. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Hynes AP, Mercer RG, Watton DE, Buckley CB, Lang AS. 2012. DNA packaging bias and differential expression of gene transfer agent genes within a population during production and release of the Rhodobacter capsulatus gene transfer agent, RcGTA. Mol Microbiol 85:314–325. doi: 10.1111/j.1365-2958.2012.08113.x. [DOI] [PubMed] [Google Scholar]
- 19.Roller M, Lucic V, Nagy I, Perica T, Vlahovicek K. 2013. Environmental shaping of codon usage and functional adaptation across microbial communities. Nucleic Acids Res 41:8842–8852. doi: 10.1093/nar/gkt673. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Wright F. 1990. The 'effective number of codons' used in a gene. Gene 87:23–29. doi: 10.1016/0378-1119(90)90491-9. [DOI] [PubMed] [Google Scholar]
- 21.Rocha EP. 2004. Codon usage bias from tRNA's point of view: redundancy, specialization, and efficient decoding for translation optimization. Genome Res 14:2279–2286. doi: 10.1101/gr.2896904. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Quax TE, Claassens NJ, Soll D, van der Oost J. 2015. Codon bias as a means to fine-tune gene expression. Mol Cell 59:149–161. doi: 10.1016/j.molcel.2015.05.035. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Zhou Z, Dang Y, Zhou M, Li L, Yu CH, Fu J, Chen S, Liu Y. 2016. Codon usage is an important determinant of gene expression levels largely through its effects on transcription. Proc Natl Acad Sci USA 113:E6117–E6125. doi: 10.1073/pnas.1606724113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Supek F, Skunca N, Repar J, Vlahovicek K, Smuc T. 2010. Translational selection is ubiquitous in prokaryotes. PLoS Genet 6:e1001004. doi: 10.1371/journal.pgen.1001004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.LaBella AL, Opulente DA, Steenwyk JL, Hittinger CT, Rokas A. 2021. Signatures of optimal codon usage in metabolic genes inform budding yeast ecology. PLoS Biol 19:e3001185. doi: 10.1371/journal.pbio.3001185. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Fraser HB, Hirsh AE, Wall DP, Eisen MB. 2004. Coevolution of gene expression among interacting proteins. Proc Natl Acad Sci USA 101:9033–9038. doi: 10.1073/pnas.0402591101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.dos Reis M, Wernisch L, Savva R. 2003. Unexpected correlations between gene expression and codon usage bias from microarray data for the whole Escherichia coli K-12 genome. Nucleic Acids Res 31:6976–6985. doi: 10.1093/nar/gkg897. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.dos Reis M, Savva R, Wernisch L. 2004. Solving the riddle of codon usage preferences: a test for translational selection. Nucleic Acids Res 32:5036–5044. doi: 10.1093/nar/gkh834. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Bárdy P, Füzik T, Hrebík D, Pantůček R, Thomas Beatty J, Plevka P. 2020. Structure and mechanism of DNA delivery of a gene transfer agent. Nat Commun 11:3034. doi: 10.1038/s41467-020-16669-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Plotkin JB, Kudla G. 2011. Synonymous but not the same: the causes and consequences of codon bias. Nat Rev Genet 12:32–42. doi: 10.1038/nrg2899. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Lim HN, Lee Y, Hussein R. 2011. Fundamental relationship between operon organization and gene expression. Proc Natl Acad Sci USA 108:10626–10631. doi: 10.1073/pnas.1105692108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Lang AS, Beatty JT. 2000. Genetic analysis of a bacterial genetic exchange element: the gene transfer agent of Rhodobacter capsulatus. Proc Natl Acad Sci USA 97:859–864. doi: 10.1073/pnas.97.2.859. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Nishizaki T, Tsuge K, Itaya M, Doi N, Yanagawa H. 2007. Metabolic engineering of carotenoid biosynthesis in Escherichia coli by ordered gene assembly in Bacillus subtilis. Appl Environ Microbiol 73:1355–1361. doi: 10.1128/AEM.02268-06. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Balkwill DL, Fredrickson JK, Romine MF. 2006. Sphingomonas and related genera, p 605–629. In Dworkin M, Falkow S, Rosenberg E, Schleifer K-H, Stackebrandt E (ed), The Prokaryotes: Volume 7: Proteobacteria: Delta, Epsilon Subclass. Springer New York, New York, NY. [Google Scholar]
- 35.Kooistra J, Haijema BJ, Venema G. 1993. The Bacillus subtilis addAB genes are fully functional in Escherichia coli. Mol Microbiol 7:915–923. doi: 10.1111/j.1365-2958.1993.tb01182.x. [DOI] [PubMed] [Google Scholar]
- 36.Marsin S, Lopes A, Mathieu A, Dizet E, Orillard E, Guerois R, Radicella JP. 2010. Genetic dissection of Helicobacter pylori AddAB role in homologous recombination. FEMS Microbiol Lett 311:44–50. doi: 10.1111/j.1574-6968.2010.02077.x. [DOI] [PubMed] [Google Scholar]
- 37.Jung H, Liang J, Jung Y, Lim D. 2015. Characterization of cell death in Escherichia coli mediated by XseA, a large subunit of exonuclease VII. J Microbiol 53:820–828. doi: 10.1007/s12275-015-5304-0. [DOI] [PubMed] [Google Scholar]
- 38.Tang K, Jiao N, Liu K, Zhang Y, Li S. 2012. Distribution and functions of TonB-dependent transporters in marine bacteria and environments: implications for dissolved organic matter utilization. PLoS One 7:e41204. doi: 10.1371/journal.pone.0041204. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Eisenbeis S, Lohmiller S, Valdebenito M, Leicht S, Braun V. 2008. NagA-dependent uptake of N-acetyl-glucosamine and N-acetyl-chitin oligosaccharides across the outer membrane of Caulobacter crescentus. J Bacteriol 190:5230–5238. doi: 10.1128/JB.00194-08. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Blanvillain S, Meyer D, Boulanger A, Lautier M, Guynet C, Denance N, Vasse J, Lauber E, Arlat M. 2007. Plant carbohydrate scavenging through tonB-dependent receptors: a feature shared by phytopathogenic and aquatic bacteria. PLoS One 2:e224. doi: 10.1371/journal.pone.0000224. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Genest O, Wickner S, Doyle SM. 2019. Hsp90 and Hsp70 chaperones: collaborators in protein remodeling. J Biol Chem 294:2109–2120. doi: 10.1074/jbc.REV118.002806. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Rockabrand D, Livers K, Austin T, Kaiser R, Jensen D, Burgess R, Blum P. 1998. Roles of DnaK and RpoS in starvation-induced thermotolerance of Escherichia coli. J Bacteriol 180:846–854. doi: 10.1128/JB.180.4.846-854.1998. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Gershenzon J, Dudareva N. 2007. The function of terpene natural products in the natural world. Nat Chem Biol 3:408–414. doi: 10.1038/nchembio.2007.5. [DOI] [PubMed] [Google Scholar]
- 44.Tyc O, Song C, Dickschat JS, Vos M, Garbeva P. 2017. The ecological role of volatile and soluble secondary metabolites produced by soil bacteria. Trends Microbiol 25:280–292. doi: 10.1016/j.tim.2016.12.002. [DOI] [PubMed] [Google Scholar]
- 45.Yang Y, Liu B, Du X, Li P, Liang B, Cheng X, Du L, Huang D, Wang L, Wang S. 2015. Complete genome sequence and transcriptomics analyses reveal pigment biosynthesis and regulatory mechanisms in an industrial strain, Monascus purpureus YY-1. Sci Rep 5:8331. doi: 10.1038/srep08331. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Ram S, Mitra M, Shah F, Tirkey SR, Mishra S. 2020. Bacteria as an alternate biofactory for carotenoid production: a review of its applications, opportunities and challenges. J Functional Foods 67:103867. doi: 10.1016/j.jff.2020.103867. [DOI] [Google Scholar]
- 47.Soucy SM, Huang J, Gogarten JP. 2015. Horizontal gene transfer: building the web of life. Nat Rev Genet 16:472–482. doi: 10.1038/nrg3962. [DOI] [PubMed] [Google Scholar]
- 48.Zhan Y, Huang S, Voget S, Simon M, Chen F. 2016. A novel roseobacter phage possesses features of podoviruses, siphoviruses, prophages and gene transfer agents. Sci Rep 6:30372. doi: 10.1038/srep30372. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.O'Leary NA, Wright MW, Brister JR, Ciufo S, Haddad D, McVeigh R, Rajput B, Robbertse B, Smith-White B, Ako-Adjei D, Astashyn A, Badretdin A, Bao Y, Blinkova O, Brover V, Chetvernin V, Choi J, Cox E, Ermolaeva O, Farrell CM, Goldfarb T, Gupta T, Haft D, Hatcher E, Hlavina W, Joardar VS, Kodali VK, Li W, Maglott D, Masterson P, McGarvey KM, Murphy MR, O'Neill K, Pujar S, Rangwala SH, Rausch D, Riddick LD, Schoch C, Shkeda A, Storz SS, Sun H, Thibaud-Nissen F, Tolstoy I, Tully RE, Vatsan AR, Wallin C, Webb D, Wu W, Landrum MJ, Kimchi A, et al. 2016. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res 44:D733–D745. doi: 10.1093/nar/gkv1189. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Emms DM, Kelly S. 2019. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol 20:238. doi: 10.1186/s13059-019-1832-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Buchfink B, Xie C, Huson DH. 2015. Fast and sensitive protein alignment using DIAMOND. Nat Methods 12:59–60. doi: 10.1038/nmeth.3176. [DOI] [PubMed] [Google Scholar]
- 53.Katoh K, Standley DM. 2013. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol 30:772–780. doi: 10.1093/molbev/mst010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Kalyaanamoorthy S, Minh BQ, Wong TKF, von Haeseler A, Jermiin LS. 2017. ModelFinder: fast model selection for accurate phylogenetic estimates. Nat Methods 14:587–589. doi: 10.1038/nmeth.4285. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Nguyen LT, Schmidt HA, von Haeseler A, Minh BQ. 2015. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol Biol Evol 32:268–274. doi: 10.1093/molbev/msu300. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Hoang DT, Chernomor O, von Haeseler A, Minh BQ, Vinh LS. 2018. UFBoot2: improving the ultrafast bootstrap approximation. Mol Biol Evol 35:518–522. doi: 10.1093/molbev/msx281. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Wu M, Scott AJ. 2012. Phylogenomic analysis of bacterial and archaeal sequences with AMPHORA2. Bioinformatics 28:1033–1034. doi: 10.1093/bioinformatics/bts079. [DOI] [PubMed] [Google Scholar]
- 58.Lowe TM, Eddy SR. 1997. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res 25:955–964. doi: 10.1093/nar/25.5.955. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Chan PP, Lin BY, Mak AJ, Lowe TM. 2021. tRNAscan-SE 2.0: improved detection and functional classification of transfer RNA genes. Nucleic Acids Res 49:9077–9096. doi: 10.1093/nar/gkab688. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Nawrocki EP, Eddy SR. 2013. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics 29:2933–2935. doi: 10.1093/bioinformatics/btt509. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Duret L. 2000. tRNA gene number and codon usage in the C. elegans genome are co-adapted for optimal translation of highly expressed genes. Trends Genet 16:287–289. doi: 10.1016/S0168-9525(00)02041-2. [DOI] [PubMed] [Google Scholar]
- 62.Sabi R, Volvovitch Daniel R, Tuller T. 2017. stAIcalc: tRNA adaptation index calculator based on species-specific weights. Bioinformatics 33:589–591. doi: 10.1093/bioinformatics/btw647. [DOI] [PubMed] [Google Scholar]
- 63.LaBella AL, Opulente DA, Steenwyk JL, Hittinger CT, Rokas A. 2019. Variation and selection on codon usage bias across an entire subphylum. PLoS Genet 15:e1008304. doi: 10.1371/journal.pgen.1008304. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Martins EP, Hansen TF. 1997. Phylogenies and the comparative method: a general approach to incorporating phylogenetic information into the analysis of interspecific data. The Am Nat 149:646–667. doi: 10.1086/286013. [DOI] [Google Scholar]
- 65.Orme D. 2018. The caper package: comparative analysis of phylogenetics and evolution in R.
- 66.Cantalapiedra CP, Hernandez-Plaza A, Letunic I, Bork P, Huerta-Cepas J. 2021. eggNOG-mapper v2: functional annotation, orthology assignments, and domain prediction at the metagenomic scale. Mol Biol Evol 38:5825–5829. doi: 10.1093/molbev/msab293. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Szklarczyk D, Gable AL, Nastou KC, Lyon D, Kirsch R, Pyysalo S, Doncheva NT, Legeay M, Fang T, Bork P, Jensen LJ, von Mering C. 2021. The STRING database in 2021: customizable protein-protein networks, and functional characterization of user-uploaded gene/measurement sets. Nucleic Acids Res 49:D605–D612. doi: 10.1093/nar/gkaa1074. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Kanehisa M, Furumichi M, Sato Y, Ishiguro-Watanabe M, Tanabe M. 2021. KEGG: integrating viruses and cellular organisms. Nucleic Acids Res 49:D545–D551. doi: 10.1093/nar/gkaa970. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Torrisi M, Kaleel M, Pollastri G. 2019. Deeper profiles and cascaded recurrent and convolutional neural networks for state-of-the-art protein secondary structure prediction. Sci Rep 9:12374. doi: 10.1038/s41598-019-48786-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.UniProt Consortium. 2021. UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res 49:D480–D489. doi: 10.1093/nar/gkaa1100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. 2000. The protein data bank. Nucleic Acids Res 28:235–242. doi: 10.1093/nar/28.1.235. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Letunic I, Bork P. 2021. Interactive Tree Of Life (iTOL) v5: an online tool for phylogenetic tree display and annotation. Nucleic Acids Res 49:W293–W296. doi: 10.1093/nar/gkab301. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Zhang C, Rabiee M, Sayyari E, Mirarab S. 2018. ASTRAL-III: polynomial time species tree reconstruction from partially resolved gene trees. BMC Bioinformatics 19:153. doi: 10.1186/s12859-018-2129-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Distribution of deviations of the effective number of codon (ENC) values from the expected ENC values under the null model of no codon bias. The distribution contains deviations for 2,308 reference GTA genes found in the 208 genomes. Numbers on the plot designate the number of reference GTA genes in an interval delineated by dashed lines. Download FIG S1, PDF file, 0.10 MB (99.3KB, pdf) .
Copyright © 2022 Kogay and Zhaxybayeva.
This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
Deviation of the effective number of codon (ENC) values for individual reference GTA genes in comparison to the genomic average. The deviation of the ENC from the expectation under the null model for each GTA gene was normalized by the average ENC deviation of its genome. A line within a box displays the median normalized ENC value for a GTA gene across all genomes. The boxes are bounded by first and third quartiles. Whiskers represent ptAI values within the 1.5× interquartile range. Dots outside whiskers are outliers. Download FIG S2, PDF file, 0.09 MB (89.1KB, pdf) .
Copyright © 2022 Kogay and Zhaxybayeva.
This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
Distributions of ptAI values in all reference GTA genes across four orders of the class Alphaproteobacteria. A line within a box displays the median ptAI value for a GTA gene across all genomes. The boxes are bounded by first and third quartiles. Whiskers represent ptAI values within the 1.5× interquartile range. Dots outside whiskers are outliers. Download FIG S3, PDF file, 0.10 MB (99.5KB, pdf) .
Copyright © 2022 Kogay and Zhaxybayeva.
This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
PGLS model fit among ptAI values of the reference GTA gene pairs. Each pairwise comparison is represented by a rectangle that is color-coded according to the P values from the PGLS analysis of the reference GTA gene pairs. The numerical P values are listed within each rectangle. Download FIG S4, PDF file, 0.1 MB (145.6KB, pdf) .
Copyright © 2022 Kogay and Zhaxybayeva.
This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
Four molecular pathways were significantly overrepresented in the protein-protein interaction network shown in Fig. 3. The pathway information was obtained from the KEGG database. Download Table S1, PDF file, 0.07 MB (76.2KB, pdf) .
Copyright © 2022 Kogay and Zhaxybayeva.
This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
Significance and slope of the fit of the phylogenetic generalized least squares (PGLS) models between the reference GTA genes and putative head-completion protein in Sphingomonadales. Statistically significant associations (P < 0.05) are highlighted in orange. Download Table S2, PDF file, 0.09 MB (94.4KB, pdf) .
Copyright © 2022 Kogay and Zhaxybayeva.
This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
Gene neighborhood of the RcGTA gene g7 and the putative g7 replacement gene in 11 Sphingomonadales spp. The only region corresponding to the RcGTA ‘head-tail’ cluster is depicted, with each gene represented by an arrow scaled relative to its length within each cluster. The RcGTA genes (g1 to g15) are color-coded and their homologs in Sphingomonadales are shown in the same color. Putative g7 replacements in Sphingomonadales are shown in magenta and marked with an arrow. Pseudogenes are colored in black, while genes without an established relationship to GTA production are shown in gray. A phylogenetic tree is a subtree extracted from the reference phylogeny. The scale bar corresponds to the number of substitutions per site. Download FIG S5, PDF file, 0.1 MB (136.6KB, pdf) .
Copyright © 2022 Kogay and Zhaxybayeva.
This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
Decisions behind choosing the RcGTA 'head-tail' cluster homologs for the reference GTA gene set. The GTA genes selected for the reference set are highlighted in orange. Functional annotations are based on the descriptions in the RefSeq database records unless noted otherwise. Download Table S3, PDF file, 0.1 MB (141.8KB, pdf) .
Copyright © 2022 Kogay and Zhaxybayeva.
This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
Distribution of tAI values in protein-coding genes of the analyzed genomes. Only genes at least 300 nucleotides in length were included. (A) Distribution of tAI values of genes in three representative alphaproteobacterial genomes selected to have the lowest, the median, and the highest mean tAI value among 208 genomes. (B) Distribution of the average genomic tAI values across 208 genomes. Download FIG S6, PDF file, 0.1 MB (139.4KB, pdf) .
Copyright © 2022 Kogay and Zhaxybayeva.
This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
Significance and slope of the fit of the phylogenetic generalized least squares (PGLS) models between the reference GTA genes and the g9 gene. Download Table S4, PDF file, 0.09 MB (93.2KB, pdf) .
Copyright © 2022 Kogay and Zhaxybayeva.
This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
Data Availability Statement
The following data are available in the FigShare repository at https://doi.org/10.6084/m9.figshare.20082749: accession numbers of 208 analyzed alphaproteobacterial genomes; accession numbers of the GTA regions identified in the analyzed genomes; accession numbers of genes in gene families across analyzed genomes; raw data related to tAI and ENC calculations; an in-house script for ENC calculations; slopes and P values of associations detected in PGLS analyses; accession numbers of the putative g7 proteins in Sphingomonadales genomes; multiple sequence alignments and phylogenetic trees of tonB and gafA gene families, concatenated phylogenomic markers, and concatenated GTA reference genes.