Selection shapes synonymous stop codon use in mammals

Cathal Seoighe; Stephen J Kiniry; Andrew Peters; Pavel V Baranov; Haixuan Yang

doi:10.1007/s00239-020-09957-x

. Author manuscript; available in PMC: 2025 Oct 21.

Published in final edited form as: J Mol Evol. 2020 Jul 2;88(7):549–561. doi: 10.1007/s00239-020-09957-x

Selection shapes synonymous stop codon use in mammals

Cathal Seoighe ^1,^✉, Stephen J Kiniry ², Andrew Peters ³, Pavel V Baranov ⁴, Haixuan Yang ⁵

PMCID: PMC7618042 EMSID: EMS208187 PMID: 32617614

Abstract

Phylogenetic models of the evolution of protein-coding sequences can provide insights into the selection pressures that have shaped them. In the application of these models synonymous nucleotide substitutions, which do not alter the encoded amino acid, are often assumed to have limited functional consequences and used as a proxy for the neutral rate of evolution. The ratio of nonsynonymous to synonymous substitution rates is then used to categorize the selective regime that applies to the protein (e.g. purifying selection, neutral evolution, diversifying selection). Here, we extend the Muse & Gaut model of codon evolution to explore the extent of purifying selection acting on substitutions between synonymous stop codons. Using a large collection of coding sequence alignments we estimate that a high proportion (approximately 57%) of mammalian genes are affected by selection acting on stop codon preference. This proportion varies substantially by codon, with UGA stop codons far more likely to be conserved. Genes with evidence of selection acting on synonymous stop codons have distinctive characteristics, compared to unconserved genes with the same stop codon, including longer 3′ untranslated regions (UTRs) and shorter mRNA half-life. The coding regions of these genes are also much more likely to be under strong purifying selection pressure. Our results suggest that the preference for UGA stop codons found in many multicellular eukaryotes is selective rather than mutational in origin.

Keywords: Stop codon, Evolutionary model, Selection

Background

The standard genetic code includes three stop codons, UAG, UAA and UGA, that signal the end of translation. In eukaryotes termination of translation involves two proteins, eRF1 and eRF3, termed release factors. When a stop codon is within the A site of the ribosome it is recognized by eRF1. The nascent polypeptide is then released from the ribosome in a process mediated by a ternary complex of eRF1, eRF3 and guanosine triphosphate (GTP) [24]. Although all three stop codons can be recognized by eRF1, the efficiency of translation termination differs significantly between them, ranging from UAA (highest fidelity of translation termination) to UGA (lowest), with UAG being intermediate [12]. The altered conformation of the ribosome accommodates the nucleotide immediately downstream of the stop codon (known as the +4 nucleotide) within the A site [7] and this nucleotide can also have a substantial impact on the efficiency of translation termination [11, 12]. Nucleotides further downstream of the stop codon and nucleotides upstream of the stop codon can also affect translation termination [11]. Failure to terminate translation at the stop codon is known as stop codon readthrough and some human genes have been found to have stop codon readthrough rates as high as 17% [37].

Consistent with the higher efficiency of translation termination associated with purines at the +4 position, G is over-represented at this position in mammalian genes [38], particularly in highly expressed genes [51]. By contrast, despite having the lowest efficiency of translation termination among the three, UGA is the most common stop codon in many multicellular eukaryotes [49]. The relative frequencies of the three stop codons are dependent on multiple factors and strongly associated with variation in GC content of the coding sequence [53]. The frequency of UAA is strongly negatively correlated with GC content, while the use of UAG and particularly of UGA increases with increasing GC content. This suggests a mutational origin for the variation in stop codon use, as mammalian GC content may be largely determined by mutational effects, reflecting variation in deoxynucleoside triphosphate (dNTP) abundance across S-phase that favours the incorporation of A and T nucleotides in late-replicating DNA [30]. However, the relationship with GC content is less strong for stop codon use than for sense codons and this has been interpreted as evidence that selection also contributes to the choice of stop codon [53]. This is further supported by the associations that have been reported between protein function and stop codon use [53].

Although a preference for efficient translational termination could explain variation in stop codon preference between different gene classes, it does not explain why UGA, the least efficient stop codon, remains the most frequent stop codon in many multicellular eukaryotes, including human, in which it occurs in close to 50% of protein-coding genes. This suggests either that the difference in fitness resulting from differences in the efficiency of translation termination are small or, alternatively, that stop codon readthrough may have useful functional consequences. Programmed readthrough, which refers to readthrough that contributes to biological function, was discovered in viruses and enables their typically compact genomes to increase their coding capacity (reviewed in [17]). Only a limited number of cases of functional readthrough have been confirmed in higher eukaryotes; however, the application of genome-scale techniques has recently suggested over a hundred human genes as candidates for functional readthrough [48].

Stop codon readthrough may also have an important regulatory role, potentially involving mechanisms that degrade proteins resulting from readthrough translation [3]. Yordanova et al. [55] recently proposed an intriguing model whereby low-level readthrough of a stop codon may play a role in gene regulation and mRNA quality control by placing a constraint on the total translational capacity of an mRNA. Under this model, ribosomes that continue past the stop codon form a queue, backing up from a downstream ribosome stalling site. The rate of stop codon readthrough together with the length of the interval between the stop codon and the stall site control the number of times the mRNA can be translated before the ribosome queue backs up as far as the stop codon, inhibiting translation. Currently it is not known how widespread this mechanism is; however, if it is common it may have an impact on stop codon use, as the different readthrough efficiencies of the different stop codons would have implications for the number of times the mRNA is translated. Termination of translation is a slower process than the addition of an amino acid to the nascent peptide; hence, stop codons themselves are often used as ribosome stalling sites. Such stalling may affect mRNA stability via NoGo decay [15] (i.e. the decay of mRNA associated with stalled ribosomes), but may have other important functions as in the above example or in the example of the XBP1 gene where it is required for its unusual cytoplasmic mRNA splicing [54]. Yet another regulatory event that involves stop codons is programmed ribosomal frameshifting [4] that often takes place at stop codons, e.g. +1 frameshifting in all three human antizyme genes (OAZ1, OAZ2 and OAZ3) takes place at stop codons and their identity is highly conserved. These, or as yet undiscovered functional implications of stop codon choice, may provide a selective explanation for the markedly high abundance of UGA stop codons in multicellular eukaryotes.

Here we set out to assess the extent of purifying selection affecting stop codon evolution in mammals. We extended models of codon sequence evolution that have previously been used to assess selection acting on coding sequences [2] to encompass the stop codon and fitted mixture models to estimate both the strength of purifying selection and the proportion of genes affected by purifying selection acting on the stop codon. Our model enabled us to assess the statistical evidence for selection for individual genes. Genes with conserved stop codons showed striking characteristics, including longer 3′ UTRs and shorter mRNA half-life compared to other genes. Stop codon conservation was more strongly associated with the selective constraints acting on the coding sequence than with the GC content of the gene, suggesting a selective, rather than a mutational origin for the variation in stop codon use with GC-content.

Results

Model

Standard codon models are characterised by a 61 × 61 transition rate matrix, Q, that gives the instantaneous rate of transition between each pair of sense codons [39, 20, 13, 2]. Typically, these models assume that codons evolve through single nucleotide substitutions, according to a continuous-time Markov process, so that the instantaneous rate of transition between codons differing at more than one position is zero. Here we extend models of codon evolution by proposing a full 64 × 64 rate matrix that includes the three stop codons. As implemented here, we set the rate of substitutions between sense and stop codons to zero. Although the point at which translation terminates may shift (resulting from mutations between sense and stop codons), we consider only aligned sequences with the stop codons positionally homologous to the end of the alignment and assume correctness of the sequence alignment. The model can be modified easily to allow mutations between stop and sense codons by the addition of parameters corresponding to the rates of these substitutions. Note that the stop codon UAA is accessible by a single base mutation from both of the other stop codons (UAG and UGA); however, the instantaneous rate of transition between UAG and UGA is zero, as it requires two single nucleotide substitutions. We note also that, unlike standard codon models, the transition probability matrix obtained by exponentiating our rate matrix is not irreducible and does not, therefore, have a unique stationary distribution (a chain starting with a sense/stop codon will remain a sense/stop codon). Several forms have been proposed for the generator matrix of codon substitution models, differing mainly in how differences in codon usage are modeled. The model of Muse and Gaut [39] sets the rate of substitution from codon i to codon j (which differ at a single nucleotide position, k) to be proportional to $π_{j_{k}}$ , the equilibrium frequency of the nucleotide at position k of codon j. We follow this approach for all of the results presented in the manuscript, as it has been found to be less prone to detecting spurious context-dependent effects than the version of Goldman and Yang [20], which sets the substitution rate to be proportional to the equilibrium frequency of codon j [35]. The entries, q_ij, of the rate matrix of our model are therefore:

q_{i j} = {\begin{matrix} π_{j_{k}} synonymous transversion (i, j \in S) \\ κ π_{j_{k}} synonymous transition (i, j \in S) \\ ω π_{j_{k}} nonsynonymous transversion (i, j \in S) \\ ω κ π_{j_{k}} nonsynonymous transition (i, j \in S) \\ 0 if i and j differ at > 1 position \\ ϕ κ π_{j_{k}} synonymous transition (i, j \in N) \\ 0 i \in N \oplus j \in N \end{matrix}

where S and N are the sets of sense and stop (or nonsense) codons, respectively (note that all mutations between stop codons are transitions). The last two conditions represent the extension of the model to accommodate stop codons. A parameter, φ, models the substitution rate between stop codons, relative to the rate of synonymous substitutions between sense codons. The last condition assigns a zero rate for substitutions between sense and stop codons (the exclusive OR means that a zero rate applies if exactly one of i and j is a stop codon). Although a parameter can easily be added to allow mutations between sense and stop codons (resulting in a shift in the stop codon position) as presented above the model consists of two subsets of intercommunicating states (corresponding to the sense and stop codons). This model could be fitted as a combination of a standard codon model and a 3x3 rate matrix for the stop codons; however, the model parameters would need to be estimated jointly and therefore we consider that this is best represented as a single reducible Markov process.

We implemented this model in R [45], optimizing parameters using the Nelder-Mead [40] method. We simulated data under the model and found that we could recover simulated values of φ, the parameter of interest, with no evident bias (Fig. S1a; slope = 1.008, estimated with robust regression, using an M estimator). We also simulated data under the Goldman and Yang [20] model, using empirical triplet frequencies which we obtained from intron sequences, to investigate if we could recover φ, when the simulated data were generated with a different version of the codon model to the one used for inference. We found that ϕ could still be recovered from the simulated data with little bias (Fig. S1b; robust linear model slope = 1.04; see Methods for details of the simulations).

The model we present above essentially leverages the rate of synonymous substitution in the coding region to infer the effects of selection on substitutions between synonymous stop codons. However, it is possible that the parameters of the model are influenced by selection acting on synonymous mutations within the coding region. This could result from selective codon use (reflecting differences in the abundance of different tRNA species) or selection against mutations that alter mRNA splicing, which can reduce the rate of synonymous substitution. Although the latter effect is likely to make our method conservative (see Discussion), it is possible that the properties of the coding sequences distort our parameter estimates resulting in bias in our inference of selection affecting mutations between synonymous stop codons. Therefore, we developed an alternative approach to inferring selection acting on substitutions between synonymous stop codons based on alignments of intron sequences from the same gene (see Methods for details). This model makes use of appropriately scaled branch lengths and model parameters (e.g. the transition to transversion rate ratio) estimated from the intron to model stop codon evolution. In this case ϕ is the rate of synonymous substitutions between stop codons relative to the rate of transitions in the intron (accounting in both cases for unequal nucleotide frequencies). Although this method does not make any use of the coding sequence alignment (other than the stop codon) estimates of ϕ obtained in this way were close to and well correlated with estimates obtained using the coding sequence alignment (Fig. S2). Because all mutations between synonymous stop codons are A ↔ G, it is possible that differences in rates between different synonymous substitution types could bias our results; however, we found that results obtained from the intron sequences using the GTR model [52] were very similar to results obtained using the HKY model [23], upon which the MG model and, by extension, our model are based (Fig. S3).

Proportion of stop codons under purifying selection

To estimate the proportion of stop codons evolving under the influence of purifying selection, we fitted our stop-extended codon model to the codon-aware alignments of mammalian orthologues, obtained from the OrthoMaM database [16], using a mixture distribution for the ϕ parameter. The mixture distribution consisted of two point masses, one with ϕ a free parameter (constrained to be < 1), corresponding to alignments with a stop codon evolving under purifying selection and another with ϕ fixed at 1, corresponding to neutral evolution – i.e. substitutions between stop codons occurring at a rate consistent with the rate of synonymous substitutions in the coding region. We then used maximum likelihood to estimate the two free parameters of this mixture model (the ϕ parameter for the constrained stop codons and the mixture weight parameter). Note that we are using the synonymous substitution rate to approximate the rate of selectively neutral substitution (see Discussion for caveats).

We estimated that 57% (Fig. 1) of mammalian stop codons are evolving under relatively strong (point mass for ϕ at 0.24) purifying selection. Using simulation we found that this estimate is not strongly dependent on modeling assumptions (Fig. S4). To investigate differences in selection pressure between genes with each of the three stop codons, we separated the genes into three groups, depending on which stop codon was found in the human gene. When we analysed these groups separately, we found that a substantially higher proportion of UGA stop codons are under selective constraint, compared to UAG and UAA stop codons. Bootstrapping over orthologue families suggests that the higher frequency of conservation in UGA stop codons is highly robust to sampling error (Fig. 1).

Identification of genes with conserved stop codon use

We also fitted our extended codon model to each orthologue family independently, estimating a separate value of ϕ for each gene. The rate of evolution of stop codons varied widely (Fig. S5). For 3,642 orthologue families (29.5% of those included in the analysis) the stop codon was completely conserved across the mammalian phylogeny, resulting in maximum likelihood estimates of zero for ϕ, while for some other genes the point estimate of ϕ was greater than one. An example of a gene in each category is provided in figure S6. Of the 3,642 completely conserved stop codons, 2,585, 667 and 390 were UGA, UAG and UAA stop codons, respectively.

To assess the evidence for selection acting on stop codon use at the level of individual genes we used a likelihood ratio test, comparing the likelihood of the Null model with ϕ = 1 to the maximum likelihood of the alternative model with ϕ as a free parameter, bounded above by 1. For data simulated under the Null model the likelihood ratio test (LRT) statistic (twice the difference in the log likelihood between the null and alternative models) matched the expected χ² distribution, with one degree of freedom (Fig. S7a). The fit with the χ² distribution was less good when we simulated data using a model based on triplet nucleotide frequencies from introns (Fig. S7b; see Methods for details); however, even in this case the proportion of simulations for which the LRT statistic exceeded the critical value and for which ϕ was less than one was 0.058, approximately equal to the significance level (α = 0.05) applied. All genes for which the LRT statistic exceeded the critical value and for which the maximum likelihood estimate, $\hat{ϕ}$ , was less than one were considered as putatively under selective constraint. Using these criteria 27% of genes showed evidence of purifying selection acting on stop codon preference. We caution that some of these may be false positives, given the 5% significance threshold applied. The lower proportion (27%) of genes with conserved stop codons detected using the likelihood ratio test applied to individual genes, compared to the estimate from the mixture model, likely reflects limits of the power to reject the null hypothesis for individual alignments, as the null hypothesis was not rejected for some of the alignments with completely conserved stop codons. For 2,881 of the 3,642 genes with completely conserved stop codons, we reject the null hypothesis of neutral evolution (the null hypothesis was also rejected in favour of purifying selection for a further 447 genes with some stop codon substitutions). Failure to reject the null hypothesis even when the stop codon was completely conserved across the phylogeny was more likely to occur for UGA (neutral model rejected for 78% of completely conserved stop codons) or UAG (73%) stop codons than for UAA (95%) stop codons. The higher power for UAA results from the fact that UAA may mutate to UGA or UAG in a single point mutation, reducing the likelihood of passive conservation, relative to UGA and UAG, which can only mutate via single point mutations to UAA. Differences in power between stop codons means that the frequency of purifying selection cannot be compared between the stop codons on the basis of individual tests; however, this does not undermine the comparison based on the mixture model as described above, as the latter can accumulate evidence of frequent purifying selection across genes.

Properties of genes with conserved stop codons

We compared the following properties between genes with conserved and non-conserved stop codons: 3′ UTR length,5′ UTR length, downstream open-reading frame length, coding sequence length, mRNA half-life, dN/dS, GC content, gene expression breadth, gene expression level. These properties were compared between all genes and between groups of genes, defined by the stop codon found in humans. Genes with putatively conserved stop codons had several striking features. Notably, they had longer 3′ UTRs than other genes (median of 1,183bp compared to a median of 877bp for the remainder of the genes in the dataset; p = 1 × 10⁻³⁹). However, the lengths of 3′ and 5′ UTRs are strongly negatively correlated with GC content [42]. When we fitted a linear model to 3′ UTR length as a function of stop codon conservation and mRNA GC content we found the effect of conservation remained positive and significant (effect size estimate = 300bp; p = 7 × 10⁻¹⁴). Interestingly, the mRNA half-life of genes with conserved stop codons was shorter than that of other genes (Fig. 2). A strong effect was observed only for UGA stop codons with weak effects in opposite directions observed for UAG and UAA stop codons. For genes with a UGA stop codon (in human) half-life remained significantly associated with stop codon conservation when we fitted a logistic regression model with GC content, 3′ UTR length, coding sequence (CDS) length and the ratio of nonsynonymous to synonymous substitution rates, ω, as predictor variables (p = 0.0004). Both the expression level and breadth of the stop-conserved genes were significantly higher than that of other genes (p= 1 × 10⁻⁸ and 2 × 10⁻¹⁰, respectively, using a Wilcoxon rank sum test). When these effects were investigated separately in genes with different stop codons (in human) they remained strongly statistically significant in genes with UGA and UAA stop codons but not in genes with UAG stop codons.

Fig. 2 — Estimated probability of stop codon conservation (and 95% confidence interval) as a function of (a) mRNA half-life and (b) ω for each stop codon type. Conservation is based on model comparison in (a) and on complete sequence conservation across the alignment in (b). Estimates are from a logistic regression model, which included the number of taxa for which the stop codon was positionally homologous with the end of the alignment as a covariate.

Model-free analysis supports a major role for selection in shaping stop codon use

As an alternative to the model-based approach to defining conserved stop codons we investigated the characteristics of genes for which the stop codon was completely conserved across the entire alignment, regardless of whether there was sufficient evidence from the model to reject neutrality. We found that ω, the ratio of nonsynonymous to synonymous substitution rates, was strongly associated with stop codon conservation, with genes with low ω values (consistent with strong purifying selection acting on the CDS) having a much higher probability of having a stop codon that was completely conserved across the alignment (Fig. 2; this effect was also evident for the model based conserved set). We fitted a logistic regression model, treating complete conservation of the stop codon across the alignment as the outcome and with GC-content, stop codon (in human) and ω as predictors and including the interaction between stop codon and ω (i.e. allowing for different ω effects for different stop codons). There was a striking effect of ω on the probability that a stop codon was conserved (Fig. 2). This suggests that conservation of the stop codon is influenced by selection, as genes with low values of ω are under strong selective constraint. Moreover, when we fitted an equivalent model but with GC-content and stop codon (in human) and included interaction terms, none of the interactions between stop codon and GC-content were significant, revealing no difference in the relationship with GC-content between the stop codons. Given the large variation in mutational patterns as a function of GC-content [30] this suggests that variation in mutation patterns is not the cause of differences in stop codon conservation. Further evidence that stop codon use is influenced by selection, rather than mutational effects comes from analysis of the frequency of each stop-associated triplet in the 3′ UTR. The frequency of UAG, UGA and UAA in the 3′ UTR in human (across all three forward frames) was 23%, 38% and 38%, respectively, suggesting that the excess of UGA over UAA stop codons is not mutational in origin, although the low abundance (22% in human) of UAG stop codons may be a mutational effect.

Characteristics of the set of genes with stop codons conserved across the mammalian alignment (the sequence-based set) were similar to those of the genes identified using the model (the model-based set) and, indeed, the majority (70%) of the genes that occurred in either group occurred in both. The sequence-based conserved gene set also had significantly longer 3′ UTRs. The mRNA half-life effect was more striking. In a logistic regression model with membership of this set as the outcome variable and including GC content, 3′ UTR length, CDS length, mRNA half-life and dN/dS ratio as predictors, the mRNA half-life coefficient was significantly different from zero for the complete set of genes (p = 1 × 10⁻⁸) as well as for the genes with UGA or UAG stop codons in human (p = 2 × 10⁻⁶ and 0.01, respectively), but not for genes with UAA stop codons in human (p = 0.68).

Nonsynonymous but not synonymous divergence strongly predicts conservation of the stop codon

To investigate further whether conservation of the stop codon results from purifying selection or from a lower mutation rate or random chance we assessed the relationship between the probability of stop codon conservation and synonymous/nonsynonymous divergence. We obtained the number of non-synonymous substitutions per nonsynonymous site (dN) and the number of synonymous substitutions per synonymous site (dS) for human-mouse orthologues from Ensembl [1]. Although, we already report above that the dN/dS ratio (i.e. ω) is predictive of stop codon conservation, these data allowed us to investigate the relationship with dN and dS separately, using values calculated independently of the alignments analyzed here. We found that dN was highly predictive of complete stop codon conservation but dS was only weakly predictive (Fig. 3). On average synonymous substitutions are under much weaker selection than nonsynonymous substitutions and dS is therefore much more reflective of the underlying mutation rate than is dN. Our observation that dS is a much weaker predictor of stop codon conservation than dN suggests that a lower mutation rate is not sufficient to explain conservation of the stop codon across mammals. It is possible that weak relationship between dS (human-mouse) and stop codon conservation is due to saturation of synonymous substitutions between human and mouse, given the relatively distant divergence of these species. Therefore, we also fitted the same model to complete stop codon conservation as a function of divergence values with macaque, which diverged from humans much more recently. Again dN was strongly associated with stop codon conservation (at least for UGA and UAA stop codons), while dS was not (Fig. 3).

Fig. 3 — Estimated probability of complete sequence conservation (and 95% confidence interval) as a function of (a) dN between human and mouse (n = 12,266), (b) dS between human and mouse (n = 12,266), (c) dN between human and macaque (n = 12,083) and (d) dS between human and macaque (n = 12,083). In all cases the x-axis is truncated to 1, as most of the divergence values are lower than this. The number of taxa for which the stop codon was positionally homologous with the end of the alignment was included as a covariate in the model.

Discussion

We set out to determine the extent to which alternative stop codon use is affected by purifying selection in mammalian genes. By extending models that were developed to understand the selection pressures acting on amino acid encoding sequences, we estimated that the stop codon is subject to purifying selection in a large proportion (approximately 57%) of mammalian genes. The proportion under selection varies between the stop codons and is highest for genes with UGA stop codons (fig. 1). UGA is by far the most common stop codon in human and many other multicellular eukaryotes (close to 50% of human protein-coding genes have a UGA stop codon). Given the high rate of purifying selection affecting UGA stop codons we propose that the predominance of UGA codons is a result of selection rather than mutation. Stop codon use is strongly associated with GC-content [53] and large-scale variation in GC content across genomes has a mutational origin [30]. However, if the high abundance of UGA codons was mutational in origin we would expect that UGA codons in regions with high GC content would be much more likely to be conserved than UGA codons in low GC regions, given the strong relationship between GC content and stop codon use. This was not the case, as we found only a weak relationship between the probability of complete sequence conservation and GC content for all three stop codons.

Stop codon conservation may result from gene regulatory mechanisms that have a preference for the use of a specific stop codon. These mechanisms may include the control of translation capacity of mRNA molecules [55] and translational pausing, which has previously been associated with localization of the translating ribosome [54]. Strong enrichment of UGA among conserved stop codons suggests that UGA may frequently be the preferred codon in such cases (71% of completely conserved stop codons were UGA, compared to 50% UGA among all human protein-coding genes). In principle, it is possible that some cases of purifying selection acting on stop codons result from missed examples of UGA codons that encode the amino acid selenocysteine, rather than signaling the end of translation. However, given the small number of genes encoding selenoproteins [32] and the large number of conserved UGA stop codons, this is very unlikely to explain a substantial proportion of the conservation we report.

The use of the synonymous substitution as a proxy for the rate of neutral evolution has been criticised, as it is known that synonymous substitutions may have functional consequences [8, 10, 9, 41, 47]. Codon models that include a variable synonymous substitution rate have been developed [43, 46]. By not incorporating variability in the synonymous substitution rate in the coding region we are effectively comparing the rate of synonymous stop codon substitutions to their expected rate, given the mean rate of synonymous substitution in the coding region. Given the extent of purifying selection acting on synonymous sites, the mean synonymous substitution rate is likely to be an underestimate of the neutral rate of evolution. However, because our objective here is to study purifying selection affecting synonymous stop codons underestimation of the neutral rate would make our method conservative (we would miss some genes under purifying selection, but the underestimate of the neutral rate should not cause false positives). We also observed many genes for which the maximum likelihood estimate of ϕ was greater than one (including PARP1, shown in fig. S6). It is possible that some of the genes with stop codons evolving at a rate greater than expected, given the synonymous rate, result from purifying selection acting on mutations between synonymous sense codons, but it is also possible that there is diversifying selection acting on stop codon use in some genes. However, the number of genes with φ significantly greater than one, was not more than we expected by chance (260 from a total of 12,336 genes at a significance threshold of α = 0.05). Using a very different method to ours, Belinky et al. [6] recently carried out an analysis of stop codon substitutions in a wide range of taxa. Although the majority of the taxa studied were microbial, they included three mammalian species. The authors reported an excess of substitutions to UGA stop codons, which they attributed to positive selection. However, consistent with our findings, they also report widespread purifying selection acting on synonymous stop codon mutations in primates, particularly for UGA [6].

Our method to infer purifying selection between synonymous stop codons using the synonymous substitution rate in the coding region does not consider the impact of selection acting on codon use. Differences in tRNA abundance can result in striking difference in the frequencies of synonymous codons, particularly for highly expressed genes in organisms with large effective population sizes [21][18]. However, although translational selection on codon use has been reported in mammals [14], it tends to be weak due to relatively small effective population sizes [44][18]. The use of the rate of synonymous substitutions in the coding region to approximate the rate of neutral evolution in order to detect the effect of purifying selection acting on substitutions between synonymous stop codons is supported by the fact that we obtained consistent results when we instead used the rate of substitution in introns (Fig. S2). Although the use of intron sequences circumvents concerns that the rate of synonymous substitutions in the coding region may not be close to the rate of neutral substitution, there are significant advantages to the use of coding sequences. In particular, not all genes have introns, coding regions can be aligned much more accurately and the large-scale datasets of aligned coding regions are more readily available. We caution, however, that at least in its current form our model should not be applied to infer purifying selection on synonymous stop codon substitutions in organisms in which codon use is likely to be shaped by translational selection.

Among the most striking properties of genes with conserved stop codon use was the decreased mRNA half-life for conserved genes with UGA stop codons (fig. 2). The length of the 3′ UTRs is known to be inversely correlated with mRNA half-life and the conserved genes had longer 3′ UTRs; however, the reduced half-life in the UGA genes remained significant, even when we fitted a logistic regression model incorporating 3′ UTR length and GC-content as covariates. Although there may be many possible explanations for the lower half-life of these genes we note that the model proposed by Yordanova et al. [55] consisting of a mechanism to limit the number of times an mRNA molecule is translated may result in lowered half-life of the mRNAs affected because stalled ribosomes trigger mRNA decay [15]. Arribere et al. found evidence of instability of proteins resulting from readthrough in Caenorhabditis elegans and human cells and noted that this has been reported to result in mRNA instability in the HBA2 gene [3, 34]. Given the apparent ubiquity of protein instability resulting from readthrough, the higher readthrough rate for UGA codons and the shorter mRNA half-life of genes with conserved UGA stop codons, destabilization of mRNA, resulting from readthrough may be a common mechanism of controlling protein abundance. However, selection acting against synonymous mutations between stop codons may have reasons other than readthrough. In this regard we note a recent analysis of stop codon readthrough in Saccharomyces cerevisiae and Drosophila melanogaster ([33]) that suggests that many stop codon readthrough events may be molecular errors rather than having a specific function.

Previous studies have investigated stop codon sequence conservation, for example as one of the strands of evidence of functional translational readthrough [29, 36, 28]. Our study is distinct in that we do not set out to investigate a specific function of conserved stop codons but, instead, to estimate the frequency and strength of selection acting on synonymous stop codon use and to investigate the properties of the associated genes, in general. In principle, inference of stop codon conservation using our extended codon model is preferable to inference based on sequence conservation alone, as the latter does not take into account differences in GC-content and mutation rates between genomic regions. Our method explicitly models variation in codon usage, through incorporation of parameters (estimated empirically from the entire CDS alignment) for codon equilibrium frequencies. However, the total tree length and number of taxa in the mammalian orthologue alignments from OrthoMaM [16] was only just sufficient in many cases and in some other cases insufficient to reject the neutral model, even in the presence of complete conservation of the stop codon across all taxa. As the size of the families of reliably aligned coding sequences increases, the power to estimate accurately the strength of purifying selection acting on synonymous stop codons will increase. This will allow much more accurate identification of the affected genes and represents an example of the value of continued efforts to sequence the genomes of an ever increasing range of organisms and of the curation and alignment of orthologue families.

Conclusions

Our manuscript describes a method to assess the selection pressure acting on mutations between stop codons using the observed rate of synonymous substitution in the coding region or the rate of comparable substitutions in intron sequences. Using mixture models to combine information across alignments allowed us to infer that a large proportion of stop codons are under purifying selection pressure that reduces the rate of substitutions between synonymous stop codons. We conclude that selection plays an important and largely over-looked role in determining stop codon use in mammalian protein-coding genes.

Methods

Model optimization and data

We downloaded 14,526 coding sequence alignments for mammalian orthologue families and the corresponding inferred phylogenetic trees from the OrthoMaM (version 8) database [16]. These included sequences from 43 completely sequenced mammalian genomes, spanning from platypus to placental mammals. We restricted to the 12,336 families with at least 20 taxa for which the stop codon was included in the sequence alignment and positionally homologous with the end of the alignment. For each sequence alignment, we first re-estimated branch lengths of the phylogenetic trees using a the Muse and Gaut model [39] with the F1X4 model of codon equilibrium frequencies (MGF1X4), implemented in codonPhyml [19]. Treating the tree topology and relative branch lengths estimated by codonPhyml as fixed, we then optimized the model in equation 1 over the parameters κ, ω, φ and a scaling factor, s, by which we multiplied all branch lengths of the tree (in practice the scaling factor was typically close to 1 – interquartile range: 0.96 - 0.98). The model was implemented in R [45], and optimized using the optim function with the Nelder-Mead [40] method. Likelihood ratio tests were used to identify genes with evidence of conserved stop codons, with twice ΔL (the difference in the log likelihood of a model with ϕ fixed at 1 and a model with ϕ set to its maximum likelihood value) compared to a χ² distribution with one degree of freedom. Code and data to reproduce our results are available from https://github.com/cseoighe/StopEvol.

Inference of selection from intron sequences

Sequence alignments in Multiple Alignment Format (MAF) for 37 mammals, inferred using the Ensembl Compara [25] EPO pipeline, were retrieved by FTP from Ensembl Compara in October 2019. Using gene structure information obtained from BioMart [31] and custom Perl scripts we retrieved the genomic coordinates (in the human genome) of the 3′ most intron of all human protein-coding genes (taking the longest protein-coding transcript for each gene). Based on these coordinates we retrieved alignments corresponding to the last introns of human genes from the MAF files. In each case we excluded the first and last 100bps of the intron (to reduce the representation of sequences involved in mRNA splicing). Genes for which the last intron was shorter than 1200bp were not considered. We then selected only the taxa that were also represented in the OrthoMaM (version 8) alignment corresponding to the same human gene. We further removed all positions from the alignment at which at least 10% of the sequences had a gap at that position. We retained only genes for which there remained at least 1000bp of aligned sequence and trimmed to the central 1000bp of each remaining intron alignment, so that the size of the remaining intron alignments was the same for each gene. We selected an arbitrary subset of 500 of these alignments, each including at least 20 taxa. For each alignment we pruned the phylogenetic tree obtained from OrthoMaM to remove species not present in the intron alignment. We then used PhyML (version 3.3) [22], with the HKY and GTR models (for figures S2 and S3, respectively) to reestimate branch lengths and model parameters. We converted the resulting branch lengths (in units of substitutions per site) to codon branch lengths (with units of substitutions per codon) by multiplying by a factor of three. We then applied these branch lengths and rate parameters (kappa or the A ↔ G rate for the HKY and GTR models, respectively) to model the data at the stop codon. In this case ϕ models the rate of substitution between synonymous stop codons relative to the rate of the corresponding mutation type in the intron (transitions in the case of HKY and A ↔ G transitions in the case of GTR).

Simulations

We first produced simulated data under the model in equation 1 and optimized the parameters of the same model. Coding-sequence alignments for 1000 genes (and the corresponding phylogenetic trees) were sampled at random from the OrthoMaM database. We re-estimated the branch-lengths of the tree using a codon model (MGF1X4) implemented in codonPhyml [19]. Codons at the root of the tree were sampled from their equilibrium frequency under the F1X4 model (the F1X4 model sets codon frequencies to the product of the frequencies of their constituent nucleotides). Coding sequences were evolved from the root node over the phylogeny using code written in R. Model parameters were estimated from the simulated data as described above. We also simulated data under a Goldman and Yang [20] model (GY). The GY model differs from the MG model in that it uses triplet frequencies in place of the equilibrium frequency of the target nucleotide at the mutated codon site. We used empirical target triplet frequencies, derived from intron sequences of the same gene. Intron sequences from human protein-coding genes were derived from Ensembl Genes 94 [1]. We downloaded cDNA sequences and exon coordinates using Biomart and subtracted the exonic regions from the cDNA sequences. The first 100bp and last 100bp of each intron were discarded to reduce the impact of splicing signals on triplet frequencies. Codons were sampled from the intronic triplet frequencies for the root node of each tree and coding sequences were again simulated over the phylogeny.

Mixture model, bootstrapping and simulation

We used a mixture model to estimate the proportion, p, of alignments for which the stop codon was under purifying selection pressure. Conditional on belonging to this set of alignments the value of ϕ was treated as a free parameter, while ϕ was equal to 1, otherwise. For tractability, we set ω, kappa and the tree scaling parameter to their maximum likelihood values. We performed a bootstrap over alignments to assess uncertainty in the estimates of p and ϕ. To test the accuracy with which the proportion of stop codons under purifying selection could be recovered we performed additional simulations. We simulated 1,000 genes with neutrally-evolving stop codons (ϕ = 1) and a further 1,000 genes with ϕ a normal random variable with mean 0.3 and standard deviation 0.1. Both sets of genes were simulated under a GY model with empirical triplet frequencies derived from intron sequences, as described above. We then performed 100 simulations. For each simulation we sampled 1000 genes from the two sets above, a random proportion (uniformly sampled from 0.1 to 0.8) of which were from the purifying selection set. We then applied our method (using the MGF1X4 model) to estimate the proportion of genes under purifying selection. Despite the substantial differences between simulation and the model the results are highly correlated (R² = 0.996) and show only a very slight downward bias in the estimates (fig. S4). We also applied the mixture model using the GY model with the F3X4 model of codon frequencies, but found that this yielded less accurate results, despite being closer to the model under which the simulation was performed (fig. S8).

Gene properties and enrichment analysis

Sequences of 3′ untranslated regions (UTRs) for human and mouse were downloaded from Ensembl Genes 91 [1]. Lengths of UTRs and coding regions were compared between genes that showed evidence of stop codon conservation (φ < 1 and p < 0.05) and the remaining genes using Wilcoxon rank sum tests. We performed tests of enrichment using DAVID (version 6.8) [27, 26].

Expression level, expression breadth and mRNA half-life

Expression level and breadth were calculated using median values, per tissue, of gene transcripts per million (TPM) data from GTEx (version 7; [5]), downloaded on 8 February, 2018. We used the median of the per tissue median TPM values as a measure of expression level and the number of tissues in which the median TPM was greater than 10 as a measure of expression breadth. mRNA half-life data is from [50].

Supplementary Material

Supporting information

EMS208187-supplement-Supporting_information.pdf^{(1.6MB, pdf)}

Acknowledgements

We are grateful to Estienne Swart and Gary Loughran for comments on the manuscript.

Funding

C.S. is supported by Science Foundation Ireland, award number 16/IA/4612.

P.V.B. is supported by SFI-HRB-Wellcome Trust Biomedical Research Partnership [210692/Z/18/Z]. S.J.K. wishes to acknowledge personal support from the Irish Research Council.

Footnotes

Competing interests

The authors declare that they have no competing interests.

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Author’s contributions

CS initiated the project, developed the model, wrote the code, performed analysis and drafted the manuscript. PVB, HY and SK suggested and performed further analyses. AP developed and maintains the software repository.

Contributor Information

Cathal Seoighe, School of Mathematics, Statistics and Applied Mathematics, National University of Ireland Galway, Galway, Ireland.

Stephen J. Kiniry, School of Biochemistry and Cell Biology, University College Cork, Cork, Ireland

Andrew Peters, School of Mathematics, Statistics and Applied Mathematics, National University of Ireland Galway, Galway, Ireland.

Pavel V. Baranov, School of Biochemistry and Cell Biology, University College Cork, Cork, Ireland

Haixuan Yang, School of Mathematics, Statistics and Applied Mathematics, National University of Ireland Galway, Galway, Ireland.

Availability of data and materials

Code and data to reproduce our results are available from https://github.com/cseoighe/StopEvol.

References

1.Aken BL, Achuthan P, Akanni W, Amode MR, Bernsdorff F, Bhai J, Billis K, Carvalho-Silva D, Cummins C, Clapham P, Gil L, et al. Ensembl 2017. Nucleic Acids Res. 2017;45(D1):D635–D642. doi: 10.1093/nar/gkw1104. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Anisimova M, Kosiol C. Investigating protein-coding sequence evolution with probabilistic codon substitution models. Mol Biol Evol. 2009;26(2):255–271. doi: 10.1093/molbev/msn232. [DOI] [PubMed] [Google Scholar]
3.Arribere JA, Cenik ES, Jain N, Hess GT, Lee CH, Bassik MC, Fire AZ. Translation readthrough mitigation. Nature. 2016;534(7609):719–723. doi: 10.1038/nature18308. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Atkins JF, Loughran G, Bhatt PR, Firth AE, Baranov PV. Ribosomal frameshifting and transcriptional slippage: From genetic steganography and cryptography to adventitious use. Nucleic Acids Res. 2016;44(15):7007–7078. doi: 10.1093/nar/gkw530. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Battle A, Brown CD, Engelhardt BE, Montgomery SB, Aguet F, Ardlie KG, Cummings BB, Gelfand ET, Getz G, Hadley K, Handsaker RE, et al. Genetic effects on gene expression across human tissues. Nature. 2017;550(7675):204–213. doi: 10.1038/nature24277. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Belinky F, Babenko VN, Rogozin IB, Koonin EV. Purifying and positive selection in the evolution of stop codons. Sci Rep. 2018;8(1):9260. doi: 10.1038/s41598-018-27570-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Brown A, Shao S, Murray J, Hegde RS, Ramakrishnan V. Structural basis for stop codon recognition in eukaryotes. Nature. 2015;524(7566):493–496. doi: 10.1038/nature14896. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Caceres EF, Hurst LD. The evolution, impact and properties of exonic splice enhancers. Genome Biol. 2013;14(12):R143. doi: 10.1186/gb-2013-14-12-r143. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Carlini DB, Genut JE. Synonymous SNPs provide evidence for selective constraint on human exonic splicing enhancers. J Mol Evol. 2006;62(1):89–98. doi: 10.1007/s00239-005-0055-x. [DOI] [PubMed] [Google Scholar]
10.Chamary JV, Hurst LD. Evidence for selection on synonymous mutations affecting stability of mRNA secondary structure in mammals. Genome Biol. 2005;6(9):R75. doi: 10.1186/gb-2005-6-9-r75. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Cridge AG, Crowe-McAuliffe C, Mathew SF, Tate WP. Eukaryotic translational termination efficiency is influenced by the 3’ nucleotides within the ribosomal mRNA channel. Nucleic Acids Res. 2018;46(4):1927–1944. doi: 10.1093/nar/gkx1315. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Dabrowski M, Bukowy-Bieryllo Z, Zietkiewicz E. Translational readthrough potential of natural termination codons in eucaryotes–The impact of RNA sequence. RNA Biol. 2015;12(9):950–958. doi: 10.1080/15476286.2015.1068497. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Delport W, Scheffler K, Seoighe C. Models of coding sequence evolution. Brief Bioinformatics. 2009;10(1):97–109. doi: 10.1093/bib/bbn049. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Doherty A, McInerney JO. Translational selection frequently overcomes genetic drift in shaping synonymous codon usage patterns in vertebrates. Mol Biol Evol. 2013;30(10):2263–2267. doi: 10.1093/molbev/mst128. [DOI] [PubMed] [Google Scholar]
15.Doma MK, Parker R. Endonucleolytic cleavage of eukaryotic mRNAs with stalls in translation elongation. Nature. 2006;440(7083):561–564. doi: 10.1038/nature04530. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Douzery EJ, Scornavacca C, Romiguier J, Belkhir K, Galtier N, Delsuc F, Ranwez V. OrthoMaM v8: a database of orthologous exons and coding sequences for comparative genomics in mammals. Mol Biol Evol. 2014;31(7):1923–1928. doi: 10.1093/molbev/msu132. [DOI] [PubMed] [Google Scholar]
17.Firth AE, Brierley I. Non-canonical translation in RNA viruses. J Gen Virol. 2012;93(Pt 7):1385–1409. doi: 10.1099/vir.0.042499-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Galtier N, Roux C, Rousselle M, Romiguier J, Figuet E, Glemin S, Bierne N, Duret L. Codon Usage Bias in Animals: Disentangling the Effects of Natural Selection, Effective Population Size, and GC-Biased Gene Conversion. Mol Biol Evol. 2018;35(5):1092–1103. doi: 10.1093/molbev/msy015. [DOI] [PubMed] [Google Scholar]
19.Gil M, Zanetti MS, Zoller S, Anisimova M. CodonPhyML: fast maximum likelihood phylogeny estimation under codon substitution models. Mol Biol Evol. 2013;30(6):1270–1280. doi: 10.1093/molbev/mst034. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Goldman N, Yang Z. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol Biol Evol. 1994;11(5):725–736. doi: 10.1093/oxfordjournals.molbev.a040153. [DOI] [PubMed] [Google Scholar]
21.Gouy M, Gautier C. Codon usage in bacteria: correlation with gene expressivity. Nucleic Acids Res. 1982;10(22):7055–7074. doi: 10.1093/nar/10.22.7055. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Guindon S, Dufayard JF, Lefort V, Anisimova M, Hordijk W, Gascuel O. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst Biol. 2010;59(3):307–321. doi: 10.1093/sysbio/syq010. [DOI] [PubMed] [Google Scholar]
23.Hasegawa M, Kishino H, Yano T. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J Mol Evol. 1985;22(2):160–174. doi: 10.1007/BF02101694. [DOI] [PubMed] [Google Scholar]
24.Hellen CUT. Translation Termination and Ribosome Recycling in Eukaryotes. Cold Spring Harb Perspect Biol. 2018;10(10) doi: 10.1101/cshperspect.a032656. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Herrero J, Muffato M, Beal K, Fitzgerald S, Gordon L, Pignatelli M, Vilella AJ, Searle SM, Amode R, Brent S, Spooner W, et al. Ensembl comparative genomics resources. Database (Oxford) 2016;2016 doi: 10.1093/database/bav096. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Huang DW, Sherman BT, Lempicki RA. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 2009;37(1):1–13. doi: 10.1093/nar/gkn923. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Huang DW, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc. 2009;4(1):44–57. doi: 10.1038/nprot.2008.211. [DOI] [PubMed] [Google Scholar]
28.Jungreis I, Chan CS, Waterhouse RM, Fields G, Lin MF, Kellis M. Evolutionary Dynamics of Abundant Stop Codon Readthrough. Mol Biol Evol. 2016;33(12):3108–3132. doi: 10.1093/molbev/msw189. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Jungreis I, Lin MF, Spokony R, Chan CS, Negre N, Victorsen A, White KP, Kellis M. Evidence of abundant stop codon readthrough in Drosophila and other metazoa. Genome Res. 2011;21(12):2096–2113. doi: 10.1101/gr.119974.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Kenigsberg E, Yehuda Y, Marjavaara L, Keszthelyi A, Chabes A, Tanay A, Simon I. The mutation spectrum in genomic late replication domains shapes mammalian GC content. Nucleic Acids Res. 2016;44(9):4222–4232. doi: 10.1093/nar/gkw268. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Kinsella RJ, Kahari A, Haider S, Zamora J, Proctor G, Spudich G, Almeida-King J, Staines D, Derwent P, Kerhornou A, Kersey P, et al. Ensembl BioMarts: a hub for data retrieval across taxonomic space. Database (Oxford) 2011;2011:bar030. doi: 10.1093/database/bar030. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Kryukov GV, Castellano S, Novoselov SV, Lobanov AV, Zehtab O, Guigo R, Gladyshev VN. Characterization of mammalian selenoproteomes. Science. 2003;300(5624):1439–1443. doi: 10.1126/science.1083516. [DOI] [PubMed] [Google Scholar]
33.Li C, Zhang J. Stop-codon read-through arises largely from molecular errors and is generally nonadaptive. PLoS Genet. 2019;15(5):e1008141. doi: 10.1371/journal.pgen.1008141. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Liebhaber SA, Kan YW. Differentiation of the mRNA transcripts originating from the alpha 1- and alpha 2-globin loci in normals and alpha-thalassemics. J Clin Invest. 1981;68(2):439–446. doi: 10.1172/JCI110273. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Lindsay H, Yap VB, Ying H, Huttley GA. Pitfalls of the most commonly used models of context dependent substitution. Biol Direct. 2008;3:52. doi: 10.1186/1745-6150-3-52. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Loughran G, Chou MY, Ivanov IP, Jungreis I, Kellis M, Kiran AM, Baranov PV, Atkins JF. Evidence of efficient stop codon readthrough in four mammalian genes. Nucleic Acids Res. 2014;42(14):8928–8938. doi: 10.1093/nar/gku608. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Loughran G, Howard MT, Firth AE, Atkins JF. Avoidance of reporter assay distortions from fused dual reporters. RNA. 2017;23(8):1285–1289. doi: 10.1261/rna.061051.117. [DOI] [PMC free article] [PubMed] [Google Scholar]
38.McCaughan KK, Brown CM, Dalphin ME, Berry MJ, Tate WP. Translational termination efficiency in mammals is influenced by the base following the stop codon. Proc Natl Acad Sci USA. 1995;92(12):5431–5435. doi: 10.1073/pnas.92.12.5431. [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Muse SV, Gaut BS. A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. Mol Biol Evol. 1994;11(5):715–724. doi: 10.1093/oxfordjournals.molbev.a040152. [DOI] [PubMed] [Google Scholar]
40.Nelder JA, Mead R. A simplex method for function minimization. Computer Journal. 1965;7:308–313. [Google Scholar]
41.Ngandu NK, Scheffler K, Moore P, Woodman Z, Martin D, Seoighe C. Extensive purifying selection acting on synonymous sites in HIV-1 Group M sequences. Virol J. 2008;5:160. doi: 10.1186/1743-422X-5-160. [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Pesole G, Mignone F, Gissi C, Grillo G, Licciulli F, Liuni S. Structural and functional features of eukaryotic mRNA untranslated regions. Gene. 2001;276(1-2):73–81. doi: 10.1016/s0378-1119(01)00674-6. [DOI] [PubMed] [Google Scholar]
43.Pond SK, Muse SV. Site-to-site variation of synonymous substitution rates. Mol Biol Evol. 2005;22(12):2375–2385. doi: 10.1093/molbev/msi232. [DOI] [PubMed] [Google Scholar]
44.Pouyet F, Mouchiroud D, Duret L, Semon M. Recombination, meiotic expression and human codon usage. Elife. 2017;6 doi: 10.7554/eLife.27344. [DOI] [PMC free article] [PubMed] [Google Scholar]
45.R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing; Vienna, Austria: 2017. URL https://www.R-project.org/ [Google Scholar]
46.Rubinstein ND, Doron-Faigenboim A, Mayrose I, Pupko T. Evolutionary models accounting for layers of selection in protein-coding genes and their impact on the inference of positive selection. Mol Biol Evol. 2011;28(12):3297–3308. doi: 10.1093/molbev/msr162. [DOI] [PubMed] [Google Scholar]
47.Sauna ZE, Kimchi-Sarfaty C. Understanding the contribution of synonymous mutations to human disease. Nat Rev Genet. 2011;12(10):683–691. doi: 10.1038/nrg3051. [DOI] [PubMed] [Google Scholar]
48.Schueren F, Thoms S. Functional Translational Readthrough: A Systems Biology Perspective. PLoS Genet. 2016;12(8):e1006196. doi: 10.1371/journal.pgen.1006196. [DOI] [PMC free article] [PubMed] [Google Scholar]
49.Sun J, Chen M, Xu J, Luo J. Relationships among stop codon usage bias, its context, isochores, and gene expression level in various eukaryotes. J Mol Evol. 2005;61(4):437–444. doi: 10.1007/s00239-004-0277-3. [DOI] [PubMed] [Google Scholar]
50.Tani H, Mizutani R, Salam KA, Tano K, Ijiri K, Wakamatsu A, Isogai T, Suzuki Y, Akimitsu N. Genome-wide determination of RNA stability reveals hundreds of short-lived noncoding transcripts in mammals. Genome Res. 2012;22(5):947–956. doi: 10.1101/gr.130559.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
51.Tate WP, Poole ES, Horsfield JA, Mannering SA, Brown CM, Moffat JG, Dalphin ME, McCaughan KK, Major LL, Wilson DN. Translational termination efficiency in both bacteria and mammals is regulated by the base following the stop codon. Biochem Cell Biol. 1995;73(11-12):1095–1103. doi: 10.1139/o95-118. [DOI] [PubMed] [Google Scholar]
52.Tavaré S. Some probabilistic and statistical problems in the analysis of dna sequences. Lectures on mathematics in the life sciences. 1986;17(2):57–86. [Google Scholar]
53.Trotta E. Selective forces and mutational biases drive stop codon usage in the human genome: a comparison with sense codon usage. BMC Genomics. 2016;17:366. doi: 10.1186/s12864-016-2692-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
54.Yanagitani K, Kimata Y, Kadokura H, Kohno K. Translational pausing ensures membrane targeting and cytoplasmic splicing of XBP1u mRNA. Science. 2011;331(6017):586–589. doi: 10.1126/science.1197142. [DOI] [PubMed] [Google Scholar]
55.Yordanova MM, Loughran G, Zhdanov AV, Mariotti M, Kiniry SJ, O’Connor PBF, Andreev DE, Tzani I, Saffert P, Michel AM, Gladyshev VN, et al. AMD1 mRNA employs ribosome stalling as a mechanism for molecular memory formation. Nature. 2018;553(7688):356–360. doi: 10.1038/nature25174. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting information

EMS208187-supplement-Supporting_information.pdf^{(1.6MB, pdf)}

Data Availability Statement

Code and data to reproduce our results are available from https://github.com/cseoighe/StopEvol.

[R1] 1.Aken BL, Achuthan P, Akanni W, Amode MR, Bernsdorff F, Bhai J, Billis K, Carvalho-Silva D, Cummins C, Clapham P, Gil L, et al. Ensembl 2017. Nucleic Acids Res. 2017;45(D1):D635–D642. doi: 10.1093/nar/gkw1104. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] 2.Anisimova M, Kosiol C. Investigating protein-coding sequence evolution with probabilistic codon substitution models. Mol Biol Evol. 2009;26(2):255–271. doi: 10.1093/molbev/msn232. [DOI] [PubMed] [Google Scholar]

[R3] 3.Arribere JA, Cenik ES, Jain N, Hess GT, Lee CH, Bassik MC, Fire AZ. Translation readthrough mitigation. Nature. 2016;534(7609):719–723. doi: 10.1038/nature18308. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Atkins JF, Loughran G, Bhatt PR, Firth AE, Baranov PV. Ribosomal frameshifting and transcriptional slippage: From genetic steganography and cryptography to adventitious use. Nucleic Acids Res. 2016;44(15):7007–7078. doi: 10.1093/nar/gkw530. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] 5.Battle A, Brown CD, Engelhardt BE, Montgomery SB, Aguet F, Ardlie KG, Cummings BB, Gelfand ET, Getz G, Hadley K, Handsaker RE, et al. Genetic effects on gene expression across human tissues. Nature. 2017;550(7675):204–213. doi: 10.1038/nature24277. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] 6.Belinky F, Babenko VN, Rogozin IB, Koonin EV. Purifying and positive selection in the evolution of stop codons. Sci Rep. 2018;8(1):9260. doi: 10.1038/s41598-018-27570-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] 7.Brown A, Shao S, Murray J, Hegde RS, Ramakrishnan V. Structural basis for stop codon recognition in eukaryotes. Nature. 2015;524(7566):493–496. doi: 10.1038/nature14896. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Caceres EF, Hurst LD. The evolution, impact and properties of exonic splice enhancers. Genome Biol. 2013;14(12):R143. doi: 10.1186/gb-2013-14-12-r143. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Carlini DB, Genut JE. Synonymous SNPs provide evidence for selective constraint on human exonic splicing enhancers. J Mol Evol. 2006;62(1):89–98. doi: 10.1007/s00239-005-0055-x. [DOI] [PubMed] [Google Scholar]

[R10] 10.Chamary JV, Hurst LD. Evidence for selection on synonymous mutations affecting stability of mRNA secondary structure in mammals. Genome Biol. 2005;6(9):R75. doi: 10.1186/gb-2005-6-9-r75. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Cridge AG, Crowe-McAuliffe C, Mathew SF, Tate WP. Eukaryotic translational termination efficiency is influenced by the 3’ nucleotides within the ribosomal mRNA channel. Nucleic Acids Res. 2018;46(4):1927–1944. doi: 10.1093/nar/gkx1315. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] 12.Dabrowski M, Bukowy-Bieryllo Z, Zietkiewicz E. Translational readthrough potential of natural termination codons in eucaryotes–The impact of RNA sequence. RNA Biol. 2015;12(9):950–958. doi: 10.1080/15476286.2015.1068497. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] 13.Delport W, Scheffler K, Seoighe C. Models of coding sequence evolution. Brief Bioinformatics. 2009;10(1):97–109. doi: 10.1093/bib/bbn049. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Doherty A, McInerney JO. Translational selection frequently overcomes genetic drift in shaping synonymous codon usage patterns in vertebrates. Mol Biol Evol. 2013;30(10):2263–2267. doi: 10.1093/molbev/mst128. [DOI] [PubMed] [Google Scholar]

[R15] 15.Doma MK, Parker R. Endonucleolytic cleavage of eukaryotic mRNAs with stalls in translation elongation. Nature. 2006;440(7083):561–564. doi: 10.1038/nature04530. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Douzery EJ, Scornavacca C, Romiguier J, Belkhir K, Galtier N, Delsuc F, Ranwez V. OrthoMaM v8: a database of orthologous exons and coding sequences for comparative genomics in mammals. Mol Biol Evol. 2014;31(7):1923–1928. doi: 10.1093/molbev/msu132. [DOI] [PubMed] [Google Scholar]

[R17] 17.Firth AE, Brierley I. Non-canonical translation in RNA viruses. J Gen Virol. 2012;93(Pt 7):1385–1409. doi: 10.1099/vir.0.042499-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] 18.Galtier N, Roux C, Rousselle M, Romiguier J, Figuet E, Glemin S, Bierne N, Duret L. Codon Usage Bias in Animals: Disentangling the Effects of Natural Selection, Effective Population Size, and GC-Biased Gene Conversion. Mol Biol Evol. 2018;35(5):1092–1103. doi: 10.1093/molbev/msy015. [DOI] [PubMed] [Google Scholar]

[R19] 19.Gil M, Zanetti MS, Zoller S, Anisimova M. CodonPhyML: fast maximum likelihood phylogeny estimation under codon substitution models. Mol Biol Evol. 2013;30(6):1270–1280. doi: 10.1093/molbev/mst034. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] 20.Goldman N, Yang Z. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol Biol Evol. 1994;11(5):725–736. doi: 10.1093/oxfordjournals.molbev.a040153. [DOI] [PubMed] [Google Scholar]

[R21] 21.Gouy M, Gautier C. Codon usage in bacteria: correlation with gene expressivity. Nucleic Acids Res. 1982;10(22):7055–7074. doi: 10.1093/nar/10.22.7055. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] 22.Guindon S, Dufayard JF, Lefort V, Anisimova M, Hordijk W, Gascuel O. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst Biol. 2010;59(3):307–321. doi: 10.1093/sysbio/syq010. [DOI] [PubMed] [Google Scholar]

[R23] 23.Hasegawa M, Kishino H, Yano T. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J Mol Evol. 1985;22(2):160–174. doi: 10.1007/BF02101694. [DOI] [PubMed] [Google Scholar]

[R24] 24.Hellen CUT. Translation Termination and Ribosome Recycling in Eukaryotes. Cold Spring Harb Perspect Biol. 2018;10(10) doi: 10.1101/cshperspect.a032656. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] 25.Herrero J, Muffato M, Beal K, Fitzgerald S, Gordon L, Pignatelli M, Vilella AJ, Searle SM, Amode R, Brent S, Spooner W, et al. Ensembl comparative genomics resources. Database (Oxford) 2016;2016 doi: 10.1093/database/bav096. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] 26.Huang DW, Sherman BT, Lempicki RA. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 2009;37(1):1–13. doi: 10.1093/nar/gkn923. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] 27.Huang DW, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc. 2009;4(1):44–57. doi: 10.1038/nprot.2008.211. [DOI] [PubMed] [Google Scholar]

[R28] 28.Jungreis I, Chan CS, Waterhouse RM, Fields G, Lin MF, Kellis M. Evolutionary Dynamics of Abundant Stop Codon Readthrough. Mol Biol Evol. 2016;33(12):3108–3132. doi: 10.1093/molbev/msw189. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] 29.Jungreis I, Lin MF, Spokony R, Chan CS, Negre N, Victorsen A, White KP, Kellis M. Evidence of abundant stop codon readthrough in Drosophila and other metazoa. Genome Res. 2011;21(12):2096–2113. doi: 10.1101/gr.119974.110. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] 30.Kenigsberg E, Yehuda Y, Marjavaara L, Keszthelyi A, Chabes A, Tanay A, Simon I. The mutation spectrum in genomic late replication domains shapes mammalian GC content. Nucleic Acids Res. 2016;44(9):4222–4232. doi: 10.1093/nar/gkw268. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] 31.Kinsella RJ, Kahari A, Haider S, Zamora J, Proctor G, Spudich G, Almeida-King J, Staines D, Derwent P, Kerhornou A, Kersey P, et al. Ensembl BioMarts: a hub for data retrieval across taxonomic space. Database (Oxford) 2011;2011:bar030. doi: 10.1093/database/bar030. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] 32.Kryukov GV, Castellano S, Novoselov SV, Lobanov AV, Zehtab O, Guigo R, Gladyshev VN. Characterization of mammalian selenoproteomes. Science. 2003;300(5624):1439–1443. doi: 10.1126/science.1083516. [DOI] [PubMed] [Google Scholar]

[R33] 33.Li C, Zhang J. Stop-codon read-through arises largely from molecular errors and is generally nonadaptive. PLoS Genet. 2019;15(5):e1008141. doi: 10.1371/journal.pgen.1008141. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] 34.Liebhaber SA, Kan YW. Differentiation of the mRNA transcripts originating from the alpha 1- and alpha 2-globin loci in normals and alpha-thalassemics. J Clin Invest. 1981;68(2):439–446. doi: 10.1172/JCI110273. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] 35.Lindsay H, Yap VB, Ying H, Huttley GA. Pitfalls of the most commonly used models of context dependent substitution. Biol Direct. 2008;3:52. doi: 10.1186/1745-6150-3-52. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] 36.Loughran G, Chou MY, Ivanov IP, Jungreis I, Kellis M, Kiran AM, Baranov PV, Atkins JF. Evidence of efficient stop codon readthrough in four mammalian genes. Nucleic Acids Res. 2014;42(14):8928–8938. doi: 10.1093/nar/gku608. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] 37.Loughran G, Howard MT, Firth AE, Atkins JF. Avoidance of reporter assay distortions from fused dual reporters. RNA. 2017;23(8):1285–1289. doi: 10.1261/rna.061051.117. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] 38.McCaughan KK, Brown CM, Dalphin ME, Berry MJ, Tate WP. Translational termination efficiency in mammals is influenced by the base following the stop codon. Proc Natl Acad Sci USA. 1995;92(12):5431–5435. doi: 10.1073/pnas.92.12.5431. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R39] 39.Muse SV, Gaut BS. A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. Mol Biol Evol. 1994;11(5):715–724. doi: 10.1093/oxfordjournals.molbev.a040152. [DOI] [PubMed] [Google Scholar]

[R40] 40.Nelder JA, Mead R. A simplex method for function minimization. Computer Journal. 1965;7:308–313. [Google Scholar]

[R41] 41.Ngandu NK, Scheffler K, Moore P, Woodman Z, Martin D, Seoighe C. Extensive purifying selection acting on synonymous sites in HIV-1 Group M sequences. Virol J. 2008;5:160. doi: 10.1186/1743-422X-5-160. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R42] 42.Pesole G, Mignone F, Gissi C, Grillo G, Licciulli F, Liuni S. Structural and functional features of eukaryotic mRNA untranslated regions. Gene. 2001;276(1-2):73–81. doi: 10.1016/s0378-1119(01)00674-6. [DOI] [PubMed] [Google Scholar]

[R43] 43.Pond SK, Muse SV. Site-to-site variation of synonymous substitution rates. Mol Biol Evol. 2005;22(12):2375–2385. doi: 10.1093/molbev/msi232. [DOI] [PubMed] [Google Scholar]

[R44] 44.Pouyet F, Mouchiroud D, Duret L, Semon M. Recombination, meiotic expression and human codon usage. Elife. 2017;6 doi: 10.7554/eLife.27344. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R45] 45.R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing; Vienna, Austria: 2017. URL https://www.R-project.org/ [Google Scholar]

[R46] 46.Rubinstein ND, Doron-Faigenboim A, Mayrose I, Pupko T. Evolutionary models accounting for layers of selection in protein-coding genes and their impact on the inference of positive selection. Mol Biol Evol. 2011;28(12):3297–3308. doi: 10.1093/molbev/msr162. [DOI] [PubMed] [Google Scholar]

[R47] 47.Sauna ZE, Kimchi-Sarfaty C. Understanding the contribution of synonymous mutations to human disease. Nat Rev Genet. 2011;12(10):683–691. doi: 10.1038/nrg3051. [DOI] [PubMed] [Google Scholar]

[R48] 48.Schueren F, Thoms S. Functional Translational Readthrough: A Systems Biology Perspective. PLoS Genet. 2016;12(8):e1006196. doi: 10.1371/journal.pgen.1006196. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R49] 49.Sun J, Chen M, Xu J, Luo J. Relationships among stop codon usage bias, its context, isochores, and gene expression level in various eukaryotes. J Mol Evol. 2005;61(4):437–444. doi: 10.1007/s00239-004-0277-3. [DOI] [PubMed] [Google Scholar]

[R50] 50.Tani H, Mizutani R, Salam KA, Tano K, Ijiri K, Wakamatsu A, Isogai T, Suzuki Y, Akimitsu N. Genome-wide determination of RNA stability reveals hundreds of short-lived noncoding transcripts in mammals. Genome Res. 2012;22(5):947–956. doi: 10.1101/gr.130559.111. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R51] 51.Tate WP, Poole ES, Horsfield JA, Mannering SA, Brown CM, Moffat JG, Dalphin ME, McCaughan KK, Major LL, Wilson DN. Translational termination efficiency in both bacteria and mammals is regulated by the base following the stop codon. Biochem Cell Biol. 1995;73(11-12):1095–1103. doi: 10.1139/o95-118. [DOI] [PubMed] [Google Scholar]

[R52] 52.Tavaré S. Some probabilistic and statistical problems in the analysis of dna sequences. Lectures on mathematics in the life sciences. 1986;17(2):57–86. [Google Scholar]

[R53] 53.Trotta E. Selective forces and mutational biases drive stop codon usage in the human genome: a comparison with sense codon usage. BMC Genomics. 2016;17:366. doi: 10.1186/s12864-016-2692-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R54] 54.Yanagitani K, Kimata Y, Kadokura H, Kohno K. Translational pausing ensures membrane targeting and cytoplasmic splicing of XBP1u mRNA. Science. 2011;331(6017):586–589. doi: 10.1126/science.1197142. [DOI] [PubMed] [Google Scholar]

[R55] 55.Yordanova MM, Loughran G, Zhdanov AV, Mariotti M, Kiniry SJ, O’Connor PBF, Andreev DE, Tzani I, Saffert P, Michel AM, Gladyshev VN, et al. AMD1 mRNA employs ribosome stalling as a mechanism for molecular memory formation. Nature. 2018;553(7688):356–360. doi: 10.1038/nature25174. [DOI] [PubMed] [Google Scholar]

PERMALINK

Selection shapes synonymous stop codon use in mammals

Cathal Seoighe

Stephen J Kiniry

Andrew Peters

Pavel V Baranov

Haixuan Yang

Abstract

Background

Results

Model

Proportion of stop codons under purifying selection

Fig. 1. Mixture model results.

Identification of genes with conserved stop codon use

Properties of genes with conserved stop codons

Fig. 2. Relationship between stop codon conservation and mRNA half-life and coding sequence conservation.

Model-free analysis supports a major role for selection in shaping stop codon use

Nonsynonymous but not synonymous divergence strongly predicts conservation of the stop codon

Fig. 3. Stop codon conservation and nonsynonymous and synonymous distance.

Discussion

Conclusions

Methods

Model optimization and data

Inference of selection from intron sequences

Simulations

Mixture model, bootstrapping and simulation

Gene properties and enrichment analysis

Expression level, expression breadth and mRNA half-life

Supplementary Material

Acknowledgements

Funding

Footnotes

Contributor Information

Availability of data and materials

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases