Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2005 Sep 21;102(40):14338–14343. doi: 10.1073/pnas.0504070102

Why highly expressed proteins evolve slowly

D Allan Drummond *,, Jesse D Bloom , Christoph Adami *,§, Claus O Wilke *,§, Frances H Arnold
PMCID: PMC1242296  PMID: 16176987

Abstract

Much recent work has explored molecular and population-genetic constraints on the rate of protein sequence evolution. The best predictor of evolutionary rate is expression level, for reasons that have remained unexplained. Here, we hypothesize that selection to reduce the burden of protein misfolding will favor protein sequences with increased robustness to translational missense errors. Pressure for translational robustness increases with expression level and constrains sequence evolution. Using several sequenced yeast genomes, global expression and protein abundance data, and sets of paralogs traceable to an ancient whole-genome duplication in yeast, we rule out several confounding effects and show that expression level explains roughly half the variation in Saccharomyces cerevisiae protein evolutionary rates. We examine causes for expression's dominant role and find that genome-wide tests favor the translational robustness explanation over existing hypotheses that invoke constraints on function or translational efficiency. Our results suggest that proteins evolve at rates largely unrelated to their functions and can explain why highly expressed proteins evolve slowly across the tree of life.

Keywords: evolutionary rate, protein misfolding, yeast, translation errors, gene duplication


A central problem in molecular evolution is why proteins evolve at different rates. Protein evolutionary rates, quantified by the number of nonsynonymous nucleotide changes per site (dN) in the encoding genes, are routinely used to build phylogenetic trees, detect selection, find orthologous proteins among related species (1), and evaluate the functional importance of genes (2), yet we possess only hints of the biophysical cause of rate differences. Thirty years ago, Zuckerkandl (3) proposed that a protein's sequence will evolve at a rate primarily determined by the proportion of its sites involved in specific functions (or “functional density”). Although this proposal has gained wide acceptance (2), measurement of functional density remains problematic because residues may contribute to protein function in unpredictable ways, and arduous sequence-wide saturation mutagenesis and mutant characterization studies are required to ascertain these effects.

Instead, many recent studies have focused on other, more readily obtained, measures that may approximate functional density. For example, protein-protein interactions presumably constrain interfacial residues, and some reports indicate that highly interactive proteins evolve slowly (4). The intuition that a protein's overall functional importance should amplify the fitness costs of mutations at sites that make subtle functional contributions has been captured in analyses of how a gene's functional category (5, 6), its essentiality for organism survival (6-8), or the fitness effect of its deletion (or “dispensability”) (9, 10) correlate with evolutionary rate. In all cases, the effects under consideration explain only a small fraction (≈5% or less) of the observed variation in evolutionary rate as quantified by their squared correlation coefficients, r2.

Surprisingly, from bacteria to mammals, the best indicator of a protein's relative evolutionary rate is the expression level of the encoding gene, measured in mRNA transcripts per cell (5, 6, 11-14). Highly expressed proteins evolve slowly, accounting for as much as 34% of rate variation in yeast (5). Moreover, after expression level is controlled for, the remaining influence of protein-protein interactions and dispensability decreases or, in some datasets, vanishes completely (15-17). Expression level's disproportionate influence remains unexplained (5, 6, 16-20).

Significant questions have persisted about whether expression level truly determines evolutionary rate, because highly expressed proteins may possess unique structural or functional features that constrain their sequences. Paralogous gene pairs resulting from a whole-genome duplication (WGD) event, such as in the lineage of Saccharomyces cerevisiae (21), minimize such differences: homology ensures a similar structure, and the majority of yeast paralogs show little, if any, difference in function (22). Analyses of evolutionary rates among paralogs have, to date, confirmed only a small independent role for expression level. Among a set of 185 yeast paralog pairs, evolutionary rate and expression level in mRNA molecules per cell correlated (r2 = 0.341), but the correlation of rate and expression differences between members of a paralogous pair was much smaller (r2 = 0.046), and no significant tendency for the higher-expressed paralog to evolve slower was found (5). A recent study that proved the WGD in yeast (21) analyzed patterns of paralog evolutionary rates and concluded that they supported a widely cited model of evolution by gene duplication (23) in which one duplicate gene retains the ancestral function and evolves slowly, whereas the other duplicate gene evolves rapidly and acquires a new function. Such behavior would obscure the influence of other variables such as expression level on paralog evolutionary rates.

Recently, several resources have become available that allow a more thorough analysis of these issues: a set of 900 S. cerevisiae paralogs derived from gene synteny and traceable to the WGD event (21), a global measurement of yeast protein abundances (24), and several additional yeast genome sequences (21, 25). Here, using this new information, we examine the strength, independence, and physical basis of expression-based constraints on protein sequence evolution. We carry out a systematic analysis designed to answer several questions. How strongly does expression constrain yeast protein evolution after controlling for structure and function? What role does functional differentiation play, compared with gene expression, in predicting the relative evolutionary rates of duplicate genes? And, what do these correlations reveal about underlying causes of evolutionary rate differences? We introduce a previously unexplored hypothesis to explain why highly expressed proteins evolve slowly and test this explanation against other causal hypotheses by using genome-wide data. Finally, we explore whether the selective pressure that we propose increases functional density and examine the biological costs underlying it.

Materials and Methods

Gene Sequences. Genome sequences for S. cerevisiae, Saccharomyces kudriavzevii, Saccharomyces paradoxus, Saccharomyces mikatae, and Saccharomyces bayanus were obtained from the Saccharomyces Genome Database (ftp://genome-ftp.stanford.edu). The genome sequence of Kluyveromyces waltii was obtained from ref. 21.

Identification of Orthologs and Paralogs. A total of 900 paralogous S. cerevisiae genes identified by synteny (21) were downloaded. Of these pairs, 290 (580 genes) were nonribosomal proteins with a measured expression level (26) and an ortholog in S. bayanus and were used in our analysis. We excluded ribosomal proteins from all analyses because they tend to be highly expressed and slow-evolving and could skew our results.

Orthologs for S. cerevisiae genes in members of the Saccharomyces genus were found by the reciprocal shortest distance (RSD) algorithm (1) with a protein-protein blast software (27) E-value cutoff of 10-20, 80% minimum alignable residues, and distances computed as dN by using paml (see below). RSD yielded 4,255 nonribosomal S. cerevisiae genes with S. bayanus orthologs and a measured expression level; 2,790 genes with S. mikatae orthologs; 4,407 genes with S. paradoxus orthologs; and 2,984 genes with S. kudriavzevii orthologs. The S. paradoxus ortholog set was expanded to include S. cerevisiae matches reported by Kellis et al. (ref. 25). Our data sets are available upon request.

Expression-Level Data. We used gene expression data measured in mRNA molecules per cell by Holstege et al. (26). To estimate variability in expression-level data, we used normalized fluorescence data collected by using the same commercial oligonucleotide array as used by Cho et al. (28), with mean expression levels computed as described in ref. 29. Because laboratory growth media and temperatures may not reflect evolutionarily relevant environmental conditions, potentially distorting expression profiles, we repeated all analyses by using each gene's codon adaptation index (CAI) (30) as an expression-level proxy (10) (see Supporting Text and Fig. 4, which are published as supporting information on the PNAS web site). We assume that species closely related to S. cerevisiae have similar expression profiles.

Measurement of Evolutionary Rates. Orthologous gene alignments were constructed from protein sequences aligned by using clustalw (31). dN and the number of synonymous substitutions per site (dS) were estimated by maximum likelihood using the paml software (32) program codeml operating on codons.

Statistical Analysis. We used r (33) for statistical analysis and plotting. To compute correlations on log-transformed dN data, we applied the transformation f(k) = log(k + 0.001) as described in ref. 10 to avoid excluding zeros.

Results

Expression Level and Evolutionary Rate. Using genome-wide measurements of expression level (mRNA molecules per cell) and evolutionary rate (dN) in S. cerevisiae, we confirm that expression level strongly predicts protein evolutionary rate. Fig. 1a shows that expression level alone explains between one-quarter and one-third of the uncorrected variance in dN for 4,255 S. cerevisiae proteins with S. bayanus orthologs and measured expression levels (Pearson's correlation; Inline graphic, P ≪ 10-9) and for the 580 paralogs (290 pairs) (Inline graphic, P≪10-9). We find that the strongest simple relationship linking dN and expression is a power law (linear on a log-log scale) and that evolutionary rates span three orders of magnitude. Expression level affects evolutionary rates of duplicated and nonduplicated genes similarly.

Fig. 1.

Fig. 1.

Expression level governs gene and paralog evolutionary rates in S. cerevisiae. (a) Highly expressed proteins evolve more slowly, and paralogs mirror the genome-wide pattern. Evolutionary rates measured relative to S. bayanus for 4,255 S. cerevisiae genes (gray squares) and 580 paralogous genes (black squares) correlate with expression levels. Lines show best log-log linear fit. For all genes (dashed line), r2 = 0.28, P ≪ 10-9; for paralogs (solid line), r2 = 0.31, P ≪ 10-9. (b) Within a paralog pair, the ratio of expression levels correlates with the ratio of evolutionary rates (r2 = 0.29, P ≪ 10-9), as predicted from the log-log linear relationship in a. Each pair generates two ratio points, making the plot symmetrical. (c) Relative expression level determines relative evolutionary rate. The percentage of pairs in which the higher-expressed paralog evolves slower are shown as a function of minimum paralog pair expression ratio (black squares). Point areas are proportional to the number of included pairs.

Structural or functional differences between proteins with differing expression levels may systematically bias the dN-expression relationship. If the power-law relationship observed across paralogs holds between paralogs in a pair, the ratio of paralog expression levels should correlate linearly with the ratio of evolutionary rates on a log-log scale. Fig. 1b confirms this prediction (Inline graphic, P ≪ 10-9) and demonstrates that a more limited previous analysis (5) underestimated this relationship's strength by more than 6-fold.

Measurement noise attenuates correlations, possibly obscuring the strength of the relationships that we have examined. For example, yeast gene expression levels measured by different groups correlate with coefficients of only 0.39-0.68 (29). We, therefore, first examined the dependence of relative interparalog evolutionary rate on the degree of expression-level disparity and found a dramatic association (Fig. 1c). For all 290 pairs, in 192 cases, the higher-expressed protein evolved slower (P < 10-7, binomial test). Among the 19 pairs for which expression differs by at least 18-fold, all of the higher-expressed paralogs have evolved slower and Inline graphic. The dN-expression correlation also can be corrected for attenuation, allowing us to determine how much of the explainable variation in dN, variation not due to measurement noise, can be attributed to expression level. Spearman's correction for attenuation in a squared correlation coefficient is Inline graphic. We found that the correlation between two independent measurements of yeast gene expression by using the same commercial oligonucleotide array was rexpr = 0.72 (Pearson's correlation; 5,555 genes), and the correlation between dNs that we measured by using orthologs in S. bayanus to those measured by using S. paradoxus orthologs was rdN = 0.92 (4,208 genes), yielding an overall Inline graphic for the 580 paralogs and Inline graphic for all 4,255 genes.

These analyses lead us to conclude that expression level accounts for roughly half of the explainable variation in yeast protein evolutionary rates, even when considering only proteins with similar structures and functions.

Functional Divergence of Gene Duplicates and Evolutionary Rate. Are the disparate evolutionary rates in paralogous proteins a result of acquisition of new function (“neofunctionalization”) in one paralog (21, 23), or do they simply reflect expression differences? Both explanations predict asymmetric paralog evolutionary rates measured against a preduplication relative. However, only the expression level explanation predicts that asymmetric rates will continue indefinitely, which can be measured by using a postduplication relative in which the genomic upheavals after WGD (massive gene loss, genome rearrangements, neofunctionalization) have long since quieted.

For S. cerevisiae, the preduplication relative K. waltii, which diverged >100 million years ago, allows evaluation of evolutionary rates relative to a single gene descended directly from the ancestral duplicated gene (21) (Fig. 2). S. paradoxus, at present the closest relative of S. cerevisiae with a sequenced genome, with a divergence time of ≈5 million years ago (25), provides a suitable postduplication relative (Fig. 2).

Fig. 2.

Fig. 2.

Phylogenetic relationships between analyzed yeast species. Relationships follow ref. 49, branch lengths indicate nucleotide substitution distances from ref. 50, and the indicated time of the WGD follows ref. 21.

We found unique S. paradoxus orthologs and measured expression levels for both paralogs in 73 of the 115 paralog pairs claimed to strongly support Ohno's functional divergence model (21) (as above, we excluded ribosomal proteins). In 64 of 73 cases (88%), the faster-evolving paralog relative to K. waltii also has evolved faster relative to S. paradoxus, even though ≈100 million years have elapsed since the duplication event. [When CAI was used as a proxy for expression level, 74 of 84 pairs (88%) showed the same pattern.] In 48 of 52 pairs (92%) in which expression differs at least 2-fold, the higher-expressed paralog evolves slower. Finally, as Fig. 1 shows, duplicated genes obey the same evolutionary-rate-expression relationship as the rest of the genome, and relative expression between paralogs predicts their relative evolutionary rates.

In sum, we find little evidence that functional differentiation causes disparate evolutionary rates among duplicate genes and plentiful evidence for the influence of expression level. A categorical consideration of neofunctionalization models is beyond our scope; we simply note that relative expression level cannot be ignored in evolutionary analyses of gene duplicates.

Causal Hypotheses. Having established the strong and apparently independent correlation of expression level with evolutionary rate, we now turn to our central question: Why do highly expressed proteins evolve slowly? We will first attend to hypotheses offering a unified mechanistic explanation for most or all of expression level's effect, and only then address the possibility that expression level merely aggregates many independent effects to create the illusion of a single cause. In considering unified explanations, we begin by eliminating all of the effects considered in the Introduction: previous analyses have already established that essentiality, dispensability, recombination rate, functional category, amino acid biosynthetic cost, and number or type of protein-protein interactions explain roughly 0-5% of evolutionary rate variation, whereas expression level accounts for >30%.

As Table 1 shows, the nonparametric correlation between expression and dN is twice as strong as that between expression and dS. Nucleotide-level pressures, such as transcription-associated mutation or DNA repair or selection on mRNA structure or stability, cannot be the primary explanation for why highly expressed proteins evolve slowly, because they predict equal expression-linked constraints on dS and dN.

Table 1. Evolutionary rate vs. expression correlations (Kendall's τ) relative to four yeast species for S. cerevisiae genes, including and excluding preferred codons.

τ
All codons
Codons with relative adaptedness <0.5
Ortholog (no. of genes) dN-expr dS-expr dN-expr dS-expr
S. bayanus (2,614) −0.300*** −0.181*** −0.273*** −0.010
S. mikatae (2,102) −0.335*** −0.163*** −0.302*** −0.009
S. paradoxus (4,383) −0.340*** −0.153*** −0.303*** +0.046**
S. kudriavzevii (2,193) −0.340*** −0.162*** −0.314*** −0.004

*, P < 10−2; **, P < 10−4; ***, P < 10−6.

We now consider three hypotheses for why highly expressed proteins evolve slowly. The first, most concisely phrased by Rocha and Danchin (6), posits that each protein molecule contributes a small amount to organism fitness by performing its function, so mutations that reduce two proteins' functional output (e.g., catalytic rate) equally will have fitness effects weighted by the number of molecules of each protein in the cell, or their abundances, causing the more abundant protein to evolve slower. We call this the “functional loss” hypothesis. Note that a highly expressed protein (whose encoding gene is transcribed at high levels) can have a low abundance (if the mRNA is translated infrequently or the protein is rapidly turned over) and vice versa. The second hypothesis, due to Akashi (18, 19), holds that because increased expression level leads to selection for synonymous codons that are translated faster or more accurately, nonsynonymous mutations to translationally less efficient codons may be evolutionarily disfavored, slowing the rate of amino acid sequence change. We call this the “translational efficiency” hypothesis.

We advance a third hypothesis based on a simple observation: to reduce the number of proteins that misfold due to translation errors, selection can act both on the nucleotide sequence, to increase translational accuracy by optimizing codon usage (34), and on the amino acid sequence, to increase the number of proteins that fold properly despite mistranslation (Fig. 3). We call this increased tolerance for translational missense errors “translational robustness.” At the canonical ribosomal error rate of 5 errors per 10,000 codons translated (35), ≈19% of average-length yeast proteins (415 aa) contain a missense error, and these errors may cause misfolding (36). Proteins vary in their tolerance for amino acid substitutions (37), providing the necessary raw material for evolution, whereas misfolded-protein aggregation and toxicity (36, 38) and production of nonfunctional protein (39) impose burdens on most cellular metabolisms, providing selective pressure. So long as translationally robust sequences are comparatively rare, intensified selection pressure resulting from increased expression level will slow the rate of amino acid substitution in higher-expressed proteins.

Fig. 3.

Fig. 3.

Translational selection against the cost of misfolded proteins can act at two distinct points. mRNA (left) may be translated without errors to produce a folded protein (top); if an error is made, the resulting protein may still fold properly, or may misfold and undergo degradation (right). Selection can act at A to increase the proportion of error-free proteins through codon preference (translational accuracy), and also at B to increase the proportion of proteins that fold despite errors (translational robustness). We neglect misfolding of error-free proteins (see text).

These three hypotheses differ in important ways. The functional loss hypothesis points to loss of protein function as the key cost constraining evolution. The translational efficiency hypothesis states that the protein sequence is constrained as a side effect of selection on the mRNA sequence. And the translational robustness hypothesis instead implicates the direct costs of misfolded proteins, independent of function. These hypotheses make testable and opposing predictions, which we now consider.

Functional Loss vs. Translational Robustness. Given two proteins with differing abundances A > a, measured in protein molecules per cell, but oppositely differing expression levels x < X, measured in mRNA molecules per cell, the functional loss hypothesis predicts dNAx < dNaX: the more abundant protein will evolve slower. By contrast, the translational robustness hypothesis states that fitness costs are dominated by translation error-induced misfolding, leading to the opposite prediction (dNAx > dNaX), because despite Ax's higher abundance, aX's higher expression level suggests more frequent translation and turnover (40).

We tested these competing predictions by using a recent global analysis of protein abundance in yeast (24). Ten thousand unique pairs of yeast proteins for which one member had a higher expression level and a lower abundance than the other were assembled at random. In 5,579 of 10,000 pairs, the more abundant but lower-expressed protein evolved faster (dNAx > dNaX, P ≪ 10-9, binomial test) consistent with translational robustness but contradicting the functional loss hypothesis. When we sampled pairs with at least a 2-fold difference in each measure, limiting the influence of measurement noise, 5,430 of 10,000 pairs showed the same pattern (P ≪ 10-9). Among synteny-derived paralog pairs, 25 of 48 showed the same pattern (not significant), as did 7 of 8 pairs with 2-fold differences (P < 0.05). When CAI was used as an expression proxy (see Materials and Methods), 6,262 of 10,000 pairs (P ≪ 10-9) and 17 of 20 paralog pairs (P < 0.002) also showed the same pattern. These results suggest that the number of translation events, a correlate of expression level and CAI, is a better predictor of relative protein evolutionary rates than the number of functional protein molecules.

The functional loss hypothesis rests on the supposition that protein molecules contribute roughly the same amount to organism fitness through their biological function, so that less-abundant proteins are less important to organism fitness. We find this assumption difficult to accept on biochemical grounds. Protein abundance seems to depend mainly on substrate or target availability, which has no obvious relationship to fitness contribution. For example, most gene regulatory proteins and DNA polymerases have only a few hundred targets and correspondingly low cellular abundances yet play crucial cellular roles. Although cells seem unlikely to invest in synthesis of high-abundance proteins without a comparably high return, the inference that low-expression proteins generate low fitness returns does not follow. Accordingly, under the functional loss hypothesis, we should expect low-expression proteins to span the range of evolutionary rates while high-expression proteins evolve under a more uniformly tight constraint. Instead, in yeast, the slowest-evolving low-expression proteins evolve an order of magnitude more rapidly than do their highly expressed counterparts (Fig. 1a). This pattern again supports translational robustness, which supposes that, whereas folded proteins may confer widely varying fitness benefits, misfolded polypeptides impose similar costs.

Translational Efficiency vs. Translational Robustness. Pressure to retain translationally efficient preferred codons will constrain synonymous evolution (dS) and, as a consequence, protein evolution (dN). Pressure for translationally efficient amino acids (19) would bias amino acid preferences at aligned positions in high- and low-expression paralogs. By contrast, translational robustness predicts that the dS and dN constraints reflect two independent points of selection (Fig. 3) and that no consistent translational preference for either codons or amino acids is required to explain the dN trend.

To assess the protein-level constraint attributable to selection for preferred codons, which is strongest at functionally important and conserved sites (36), we computed evolutionary rates by using the portions of genes consisting only of unpreferred codons. Because those sites most constrained by codon preference are removed in these reduced genes, the translational efficiency hypothesis predicts that the correlation of expression level with dS and dN should vanish. Translational robustness hypothesizes a direct constraint on the amino acid sequence, so the dN-expression correlation should remain strong while the dS-expression correlation vanishes, essentially an impossibility if synonymous-site selection for translational efficiency governs protein evolution. Using sets of aligned S. cerevisiae-ortholog genes (see Materials and Methods), we discarded all aligned codons except those where the “relative adaptedness” (30) of the S. cerevisiae codon was <0.5. We then recomputed dN, dS, and their expression correlations by using these reduced genes, discarding genes with <30 codons or dS values of ≥3.0.

Table 1 shows that, after removal of preferred codons, the reduced genes showed only slightly reduced dN-expression correlations, whereas the dS-expression correlations all became insignificant or, in the case of S. paradoxus, reversed direction. We found similar results by using CAI as an expression proxy (see Table 2, which is published as supporting information on the PNAS web site). These results demonstrate that expression-linked synonymous selection is concentrated at sites bearing preferred codons and that sites showing no such selection still show strong protein-level constraint, consistent with selection for translational robustness.

Translational efficiency selection on amino acids predicts asymmetric substitution of one amino acid for another in highly expressed proteins. If two amino acids x and y have efficiencies x < y, then at aligned positions in paralogs where both x and y occur, y should disproportionately appear in the higher-expressed paralog. We tabulated these pair-wise frequencies in the 580 paralogs analyzed in Fig. 1 and assessed statistical significance by using a binomial test with the false-discovery-rate correction for multiple tests (41). All residue pairs appeared in our dataset, but no pairs showed asymmetries at the 1% or 5% levels.

As a control, we performed the same test by using synonymous codons and found that 21 codon pairs showed significant asymmetries at the 1% level, invariably favoring the codon with higher relative adaptedness in the higher-expressed paralog (Table 3, which is published as supporting information on the PNAS web site). Of the 21 favored codons, 17 were unique and encoded 13 of the 18 amino acids with synonymous codons.

Our results offer no support for translational efficiency selection on amino acids but confirm such selection on synonymous codons, though with little consequence for dN. Although translational efficiency selection may constrain amino acid sequences to some degree, it cannot explain why highly expressed yeast proteins evolve slowly.

Expression Level Is a Master Causal Variable. We now consider the possibility that many variables (e.g., dispensability, number of protein-protein interactions, amino acid biosynthetic cost, codon preference, recombination rate) independently exert small but cumulatively severe constraining effects on protein sequence evolution, and expression level's influence derives from its relationships to each of these variables. Although such a possibility cannot be ruled out, several observations make it unlikely.

First, expression level is a major determinant of most of the candidate variables: high expression causes decreased dispensability (42), causes more experimentally detected interactions (15), increases pressure for cheaper proteins and higher translational efficiency (18), and, through increased transcription, causes exposed chromatin structures that are hotspots for recombination. No reverse mechanisms have been proposed by which these variables cause genes to become highly expressed.

Second, as we have noted earlier, the degree to which these variables appear to influence evolutionary rate becomes small or even disappears after controlling for expression level. This trend holds for protein-protein interactions (4, 15), recombination rates (43), and amino acid cost in bacteria (6), as well as essentiality, dispensability, network centrality, and gene length (44).

Discussion

We have provided evidence that expression level is the dominant determinant of evolutionary rate in S. cerevisiae genes. Our results show that (i) expression level explains roughly half the variation in gene evolutionary rates; (ii) expression level affects evolutionary rates of duplicated and singleton genes similarly; (iii) once variability in expression level is accounted for, the higher-expressed member of a paralog pair is disproportionately likely to evolve slower; (iv) asymmetric evolutionary rates in duplicated genes persist over tens of millions of years, consistent with expression-level differences but not neofunctionalization; and (v) expression level appears to influence evolutionary rate through the number of translation events rather than cellular protein abundance, constraining the protein sequence directly rather than through translational efficiency selection.

We have introduced a general hypothesis to explain why highly expressed proteins evolve slowly: selection against the expression-level-dependent cost of misfolded proteins favors rare protein sequences that fold properly despite translation errors (Fig. 3). Tests comparing the opposing predictions of this translational robustness hypothesis to two previously advanced alternative hypotheses show that genome-wide yeast data support the predictions of translational robustness and contradict the alternatives. Our hypothesis contradicts the intuitive notion that highly expressed proteins evolve slowly because they are more functionally important, perhaps explaining why more direct measures of functional importance, such as essentiality and dispensability, explain far less variation in evolutionary rates. The hypothesis also provides an explanation for the widely observed correlation between dN and dS (20): Fig. 3 indicates how one cost (misfolding) can be counteracted in two ways (translational accuracy, slowing dS, and translational robustness, slowing dN).

Would more translationally robust proteins have a higher functional density (3)? Consider URA5 and URA10 (orotate phosphoribosyltransferases 1 and 2), paralogs with similar functions that differ >60-fold in expression and 6-fold in evolutionary rate. Do we expect URA5 to have a larger proportion of its residues involved in specific functions? The translational robustness hypothesis suggests not. Instead, functionally unconstrained residues may be more carefully selected to preserve the protein's native structure after missense substitutions in URA5 than in URA10. These residues would contribute to fitness not by aiding in URA5's function, but by preventing the burdensome misfolding of mistranslated polypeptides. Thus, the fitness density of a protein, the proportion of residues under meaningful natural selection, can be larger than the functional density, and directly determines the rate of sequence evolution.

Functional constraints slow evolution at certain sites; our results suggest that these constraints operate on a sequence-wide background rate determined largely by expression. Expression patterns as well as levels may impose additional constraints if highly expressed proteins have unique cellular localization or cell-cycle expression profiles.

How large are the costs underlying translational robustness? We can make a crude general estimate. As mentioned above, ≈19% of average-length yeast proteins will contain a missense error at typical ribosomal error rates. For diverse proteins, 20-65% of amino acid substitutions lead to inactivation (37, 45), generally due to misfolding (37). Consequently, 4-12% of a typical protein species would be expected to misfold because of missense errors. Because yeast protein abundances span five orders of magnitude (24), the fitness impact of error-induced misfolding could range widely. If we assume a 5% misfolding rate, the number of misfolded protein molecules ranges from negligible, as for the ≈3 misfolded molecules to generate the measured cellular complement of 64 molecules of DSE4 (endo-1,3-β-glucanase), to potentially devastating, as for the ≈63,000 misfolded molecules required to generate 1.26 million molecules per cell of the H+-transporting P-type ATPase PMA1 (24). The latter misfolded species would be more abundant than 97% of yeast proteins (24). We have neglected protein turnover, a further cost multiplier. We also have neglected the misfolding of error-free proteins; a likely biophysical mechanism for increasing translational robustness will also mitigate stochastic misfolding (see below). Protein misfolding generates highly toxic species capable of killing cells in a concentration-dependent manner (46), so increased translational robustness in highly expressed proteins may reflect pressure for survival as well as efficiency.

Can selection for accuracy through codon preference eliminate (or make negligible) such error-induced misfolding costs? Although codon preference cannot counter mistranslation due to misacylation of tRNAs and transcription errors, both of which occur at frequencies approaching those of missense errors (35), experimental measurements of a 4- to 9-fold reduction in missense errors from preferred codons have been reported (47). Assuming all preferred codons are translated 10-fold more accurately than nonpreferred codons, how much accuracy improvement can we expect? Randomly selecting codons produces genes containing ≈35% preferred codons, whereas the most highly expressed genes have >80% preferred codons (only 9 of the 4,255 yeast genes that we analyzed contain >90% preferred codons). Even if translational error-rate measurements reflect the worst case of codon-randomized genes, the maximum accuracy gain in the most optimized genes is roughly 5-fold. In the case of PMA1 (86% preferred codons), such a reduction would still leave thousands of misfolded proteins from this single gene to burden the cell. Although that level of misfolding may represent the “cost of doing business” for the cell, such an argument assumes that mutant versions of PMA1 carried by evolutionary competitors tolerate equivalent numbers of translation errors and generate similar costs. Because a protein's tolerance to substitutions can in some cases be significantly altered with a single mutation (37), we suspect that this assumption is rarely justified. Given variability in misfolding, natural selection will then favor those mutants whose costs undercut their competitors'.

A counterintuitive prediction of the translational robustness hypothesis is that selection for proteins that are more tolerant to amino acid change yields underlying genes that appear less tolerant to nucleotide change (because they evolve slowly). How is this result possible? Consider a hypothetical allele of PMA1 for which only 0.1% (≈1,000 molecules) of translated proteins misfold because of errors. A nonsynonymous genetic mutation yielding a functionally equivalent mutant protein that misfolds 5% of the time, producing ≈50,000 potentially toxic proteins, would be evolutionarily disfavored relative to the wild type due to increased misfolding costs without showing any functional difference. Thus, the wild type, despite encoding a highly robust protein that retains function after most mutations, will appear mutationally fragile over evolutionary time. A striking example of this robust-molecule/fragile-gene behavior may be found in ribulose-1,5-bisphosphate carboxylase/oxygenase (Rubisco), perhaps the most abundant protein on Earth and a rigidly conserved, generally essential enzyme for which genetic studies have nonetheless been hampered by the difficulty of finding inactivating missense mutations (48).

How might translational robustness manifest itself biophysically? We can offer only a speculation. Because most substitutions destabilize the native structure of a protein, modest increases in thermodynamic stability broaden the spectrum of substitutions a protein can tolerate before misfolding (37), increasing fitness as long as function is not compromised. Pressure for increased stability in highly expressed proteins would restrict the set of evolutionarily viable sequences and slow sequence evolution as a consequence.

Supplementary Material

Supporting Information

Acknowledgments

This work was supported by National Institutes of Health National Research Service Award 5 T32 MH19138 (to D.A.D.) and a Howard Hughes Medical Institute Predoctoral Fellowship (to J.D.B.).

Author contributions: D.A.D. designed and performed research; D.A.D., J.D.B., C.A., C.O.W., and F.H.A. contributed new reagents/analytic tools; D.A.D. analyzed data; and D.A.D., J.D.B., C.A., C.O.W., and F.H.A. wrote the paper.

This paper was submitted directly (Track II) to the PNAS office.

Abbreviations: dN, number of nonsynonymous substitutions per site; dS, number of synonymous substitutions per site; CAI, codon adaptation index; WGD, whole-genome duplication.

References

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting Information
pnas_0504070102_1.pdf (310.5KB, pdf)

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES