Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2008 Dec 11.
Published in final edited form as: Structure. 2007 Nov;15(11):1442–1451. doi: 10.1016/j.str.2007.09.010

Genome-wide structural mapping of protein interactions reveals differences in abundance-and expression-dependent evolutionary pressures

Matt Eames 1,2, Tanja Kortemme 1,2,3
PMCID: PMC2600897  NIHMSID: NIHMS38697  PMID: 17997970

Summary

Genome-wide studies in Saccharomyces cerevisiae concluded that the dominant determinant of protein evolutionary rates is expression level, where highly-expressed proteins evolve most slowly. To determine how this constraint affects the evolution of protein interactions, we directly measure evolutionary rates of protein interface, surface and core residues by structurally mapping domain interactions to yeast genomes. We find that mRNA level and protein abundance, though correlated, report on pressures affecting regions of proteins differently. Pressures proportional to mRNA level slow evolutionary rates of all structural regions and reduce the variability in rate differences between interfaces and other surfaces. In contrast, the evolutionary rate variation within a domain is less dependent on protein abundance. Distinct pressures may be associated primarily with the cost (mRNA level) and functional benefit (protein abundance) of protein production. Interfaces of proteins with low mRNA levels may have higher evolutionary flexibility, and could constitute the raw material for new functions.

Introduction

What are the evolutionary pressures acting on proteins building up biological protein-protein interaction networks? An intuitive expectation that has been tested by many studies (Caffrey et al., 2004; Hu et al., 2000; Mintseris and Weng, 2005; Noorenand Thornton, 2003; Teichmann, 2002) is that amino acids in protein interaction interfaces should be more conserved than protein surfaces not involved in interactions. An underlying assumption here is that interfaces, which encode functionally important signaling, regulatory and structural information, are likely to contain a lower fraction of residues capable of accepting nearly-neutral mutations (Kimura, 1968; King and Jukes, 1969; Ohta, 1973) relative to the remainder of the protein surface. Hence, protein interfaces would have a higher “functional density” and evolve at slower evolutionary rates (defined as the number of nonsynonymous substitutions per site).

Several recent studies have analyzed the degree and mechanisms by which protein-protein interactions constrain the rate of protein sequence evolution (Bloom and Adami, 2003, 2004; Fraser et al., 2002; Fraser et al., 2003; Kim et al., 2006; Mintseris and Weng, 2005). An attractive hypothesis is that proteins with many interaction partners are more likely to have a high functional density of important regions, and should hence be more conserved. While the suggested inverse dependency between a protein’s evolutionary rate and the number of its interaction partners has indeed been observed, (Fraser et al., 2002; Pal et al., 2001) this effect may be complicated to separate from the strong correlation of gene expression (measured in mRNA transcripts per cell) and evolutionary rates: both evolutionary rates and number of interaction partners are related to mRNA expression level, where highly expressed proteins evolve most slowly and also have a larger number of interaction partners as detected in high-throughput studies (Bloom and Adami, 2003; Drummond et al., 2006). In addition, the constraints acting on protein interfaces may be difficult to detect by measuring the evolutionary rate of whole proteins because of the relatively small number of residues that constitute interface regions. To circumvent this problem, information on the three-dimensional structures of proteins and protein complexes is needed to separately analyze conservation in protein interfaces and the remainder of the protein (Kim et al., 2006; Mintseris and Weng, 2005). Two recent studies used structural data to support the functional density hypothesis by showing that proteins with larger number of distinct binding interfaces (Kim et al., 2006) and smaller proportion of solvent-exposed residues (Lin et al., 2007) are indeed more conserved.

Although structural/functional constraints thus clearly slow the rate of evolution of proteins and protein interfaces (Choi et al., 2006; Kim et al., 2006; Lin et al., 2007), it has not been quantified how protein interfaces are affected by known determinants of protein evolutionary rates such as mRNA level and protein abundance. A recent key study proposed that gene expression level-dependent evolutionary pressures are predominantly due to constraints imposed by robustness against translational errors (translational robustness): Highly expressed proteins would be substantially optimized to still fold despite transcriptional errors to avoid potential toxic accumulation of misfolded species (Drummond et al., 2005). Accordingly, one could expect expression-level dependent pressures to be strongest for amino acids located in protein cores, where mutations should have the most dramatic effect on stability. Following this argument, it is unclear whether the same dominant pressure would persist for amino acid positions in protein interfaces that are surface-exposed in the unbound form of the protein. Due to the relatively smaller contribution of interface residues to protein stability, protein interface residues may be freed from the evolutionary constraints acting on highly expressed proteins, and show evolutionary rates to be largely determined by functional pressures. Alternatively, mRNA expression level could exert a global pressure on all residues in a protein. If such a global effect primarily reflects the frequency of translational events but not the actual amount of protein present, differences between pressures proportional to mRNA and protein levels may be observable.

To address these questions, here we aim to characterize the combined influence of structural, functional, mRNA expression- and protein-level dependent constraints on protein interface evolution. We derive structural information from all characterized protein complexes in the Protein Data Bank (PDB) (Berman et al., 2000), allowing us to classify individual residues as members of the protein core, the surface, or a protein-protein interface. Protein-protein interface residues are further subdivided by structural features proposed to be linked to interaction affinity and specificity, such as burial in interface cores, classification as “anchor” residues that become highly buried upon complex formation (Rajamani et al., 2004), and polar character. Using domain definitions derived from the Pfam database of protein families (Finn et al., 2006) we map these structural characteristics to residues of all matching domains in a single genome, that of Saccharomyces cerevisiae. The availability of high-throughput information specific to S. cerevisiae and sequence information for related fungal genomes enables us to quantify mRNA expression-level and protein abundance-dependent evolutionary constraints on protein interfaces using a nonsynonymous substitution rate metric.

We find, first, that pressures proportional to mRNA expression level appear to slow the evolutionary rates of residues in all structural regions, including interface and other surface regions. Moreover, this global pressure significantly reduces the variability in conservation between protein cores, interfaces and surfaces in proteins with high mRNA levels. Second, while previous work has pointed to the common importance of abundance and expression level (both have been identified as the dominant determinants of evolutionary rates), we detect a distinction between pressures correlated with mRNA-expression and those correlated with protein-abundance levels: The evolutionary rate variation between structural regions within a domain is much less dependent on protein abundance, in contrast to the observed decrease in rate variation at high mRNA level. These findings support a model where each parameter reports on distinct selective pressures that may be associated primarily with the cost (mRNA expression) and functional benefit (protein abundance) of protein production. Interfaces of proteins with low mRNA expression level (lower cost) have higher evolutionary flexibility, and may constitute the raw material for evolving new functions.

Results

We aimed to characterize selective pressures imposed upon protein-protein interfaces. We set out to directly measure evolutionary rates (quantified by the number of non-synonymous mutations per site) of residues at protein interfaces and investigated, first, structural characteristics, second, pressures proportional to mRNA expression level, third, the role of protein abundance, and fourth, functional constraints.

An outline of the computational strategy is given in Figure 1A. To define and characterize protein interfaces, we first required a dataset of known protein complex structures. By mapping these structures to a single, well-studied organism, S. cerevisiae, we aimed to integrate structural information with other high-throughput data such as mRNA expression and protein abundance. To increase structural coverage (the part of the S. cerevisiae genome we could assign to structurally characterized protein-protein interfaces), we used Pfam (Finn et al., 2006) domains instead of whole proteins as the unit of structural mapping, to take advantage of the fact that a single modular domain could have occurrences in dozens of S. cerevisiae proteins. To identify amino acid residues present in protein-protein interfaces (Figure 1B), we compiled a dataset of domain-domain residue-specific contacts from the database of 3D Interacting Domains (3did) (Stein et al., 2005).

Figure 1.

Figure 1

Illustration of the computational strategy. (A) Flow cart outlining the steps for genome-wide structural mapping of S. cerevisiae domains and determination of evolutionary rates for structural subsets. (B) Cartoon depicting the 4 structural subsets used in this analysis.

Evolutionary Rates of Domain Structural Subsets

To assess the role of structural characteristics in interface conservation, we divided domains into four structural subsets: surface residues, core residues, surface interface residues (interface residues partly exposed in the complex), and buried interface residues (Figure 1B and Methods). Buried interface residues were defined as residues in the interface that became buried upon complex formation but were not part of the protein core in the monomeric form of the domain. For all structural subsets, we computed evolutionary rates as the number of non-synonymous substitutions per site (dN, see Methods). Rates in this paper were calculated by aligning S. cerevisiae proteins to their orthologs in either Candida albicans, Candida glabrata, Saccharomyces bayanus, or Saccharomyces mikatae. Unless otherwise noted, rate data presented were taken from alignments with C. albicans, but similar results were obtained using comparisons to the other fungal genomes (see Table S1 in Supplementary Materials).

We first confirmed that residues classified as part of the core by our mapping procedure have a slower median non-synonymous substitution rate than surface residues (0.223 vs 0.361 (P = 9.5e-15, Wilcoxon-signed rank test), Table 1 and Figure 2), in agreement with data from earlier studies (Bloom et al., 2006; Mintseris and Weng, 2005). We also found buried interface and surface interface residues to be significantly more conserved than surface residues outside of the interface (0.231 vs 0.361 (P = 2.6e-11), 0.273 vs 0.361 (P = 0.003), respectively). These results indicate that our structural mapping procedure is able to recapture expected conservation properties. It should however be noted that the surface residue dataset is likely to contain a mixture of both surface and interface residues, given the high probability that many interfaces have not yet been structurally characterized (Mintseris and Weng, 2005), see Methods Section in Supplementary Materials. Another source of error in our assignment of structural subsets is that not all structurally characterized domain interface residues compiled from multiple organisms may actually be involved in interface formation in fungal species (see Discussion).

Table 1.

The median non-synonymous substitution rates (dN) for the 4 structural subsets. “High exp” and “low exp” constitute the 17.5% highest and lowest expressed genes found in our dataset. “High ab” and “low ab” constitute the 17.5% most and least abundant proteins in our dataset.

Structural subset total high exp. low exp. high ab. low ab.
dN dN dN dN dN
Core 0.22 0.12 0.38 0.18 0.26
Buried interface 0.23 0.02 0.36 0.11 0.24
Surface 0.36 0.19 0.53 0.28 0.42
Surface interface 0.27 0.17 0.47 0.21 0.36

Figure 2.

Figure 2

Distributions of non-synonymous substitution rates, dN, for residue subsets with different structural characteristics. Boxes enclose the first and third quartile of the distribution and display a notch at the median; whiskers extend outward to the most extreme data point no more than three times the interquartile range from the box. Data points outside this range are drawn individually. As defined by a Wilcoxon-signed rank test, the dN distributions for all structural characteristics are significantly different from each other (P < 0.005) with exception of “buried residues” and “buried interface residues” (P = 0.18, not significant). 4 outliers fall outside the upper boundary.

Expression-Dependent Pressures

It has been well documented that gene expression level measured by mRNA molecules per protein per cell is the single most important determinant of evolutionary rates in proteins, at least in unicellular organisms such as S. cerevisiae (Drummond et al., 2005; Drummond et al., 2006; Pal et al., 2001). Three principle arguments have been set forth to explain the negative correlation between divergence and expression level, and the “translational robustness” hypothesis was shown to agree best with the data (Drummond et al., 2005). This model postulates that given the expression level-dependent cost of misfolding proteins, highly expressed proteins will be under selective pressure to favor amino acid sequences that fold properly despite translational errors. On account of the importance of the protein core for folding, we reasoned that gene expression level would evolutionarily constrain core residues to greater degree than surface residues. To detect this effect, we measured the difference in nonsynonymous substitution rate between core residues and surface residues for each protein domain as a function of mRNA expression level (Figure 3A).

Figure 3.

Figure 3

The evolutionary rate differences between structural regions within the same domain is reduced at high mRNA expression levels. Shown (A) is the difference in dN between the core residues and surface residues subsets for each domain as a function of mRNA expression level or (B) the difference between the buried interface residues and surface residues sets. One outlier falls outside the limits in (B).

Moreover, we wanted to address the question of whether a similar selective pressure would also affect protein residues buried only upon interface formation. We imagined two different scenarios: Since protein interfaces (of transient interactions) are exposed to solvent in the uncomplexed form, interface mutations are less likely than core mutations to substantially destabilize the monomer. It follows that interface mutations may not significantly alter a protein’s misfolding probability after translational errors. In this case and under the translational robustness hypothesis, we would expect the evolutionary rates of protein interface residues to be largely uncorrelated to expression level. Alternatively, mRNA expression level could exert a strong global pressure on all protein residues such that the evolutionary rates of surface, core, and buried interface residues are similarly dependent on mRNA expression level. To test for these possibilities, we computed the difference in evolutionary rates of buried interface residues and surface residues for each domain considered as a function of mRNA expression level (Figure 3B). We used interface residues buried upon complex formation, as opposed to all interface residues; we expected selective pressure for this set to be most easily detectable (see Figure 3B) as we reasoned that buried interface portions are likely to be less susceptible to alignment errors made in our structural mapping procedure.

We found that the magnitude of evolutionary rate differences for core versus surface ΔdN(core-surface) or buried interface versus surface ΔdN(interface-surface) residues was highly dependent on mRNA expression level (Figure 3). For proteins with medium and low expression levels, protein cores have a notable tendency to be more conserved than the surfaces of their corresponding domains, as expected. This is not generally the case for buried interface residues, which have the freedom to evolve faster or slower than surfaces in proteins with low mRNA levels (Figure 3B). The significance of these fast evolving interfaces is not clear, and they could be artifacts resulting from assumptions made in our structural mapping procedure. Examination of proteins with a higher relative evolutionary rate of interface to surface residues did not reveal any biases in protein function or number of interaction partners (data not shown). Nevertheless, we observe that the difference in evolutionary rates between surface and core residues, and surface and buried interface residues becomes progressively small with increasing mRNA level.

Next we asked whether the observed reduction in evolutionary rate differences between structural regions at high mRNA expression levels was primarily due to a progressively higher pressure on core residues or whether all structural regions were affected to some extent by a pressure proportional to mRNA level. Figure 4A indeed shows a global reduction in evolutionary rate of all structural regions with increasing mRNA level, which is surprising given the expected stronger pressure on core residues in accord with the translational robustness hypothesis.

Figure 4.

Figure 4

Pressures proportional to a protein’s mRNA expression level affect the degree of conservation in all structural subsets, whereas the effects of protein abundance are substantially weaker. Shown are the median dN values for protein sets binned by increasing mRNA expression level (A) or protein abundance (B). As defined by a Wilcoxon-signed rank test, the dN distributions for all expression bins are significantly different from each other within each structural subset class (P ≤ 0.004, with the exception of 26–50% vs 51–75% for all four subsets). The dN distributions for all abundance bins are not significantly different within each structural subset class (P > 0.01, with the exception of 1–25% vs 76–100% for all four subsets. For a list of P values, see Table S3.

Because relatively few proteins shown in Figure 3A have very high mRNA expression levels, we investigated whether the observed dependencies were the result of a small sampling set for higher expression regimes. We assessed the statistical significance of evolutionary rate distributions in both the upper and lower expression extremes (17.5 percentiles). We found that the reduction in the variability of ΔdN(interface-surface) values in highly expressed proteins relative to a lower expression set of equal size was significant (P = 0.0004, Figure 5A). This trend is independent of how the percentiles of the upper and lower expression extremes are chosen, and can even be observed by dividing the expression distribution into two halves (see Table S2 in Supplementary Materials). This constraint is also observed to restrict the variability in ΔdN(core-surface) values in proteins with high mRNA levels (Figure S1A in Supplementary Materials). The observed narrowing of evolutionary rate variability within a domain is not merely a consequence of an overall lower rate of proteins with high mRNA levels, as normalizing ΔdN(interface-surface) by dN of the domain also showed a significant difference between proteins with high and low mRNA levels (P = 0.001).

Figure 5.

Figure 5

mRNA expression level reduces the evolutionary rate differences between buried interface and surface residues of the same domain, while protein abundance does not. The distribution of differences in dN between the buried interface residues and surface residues sets as grouped by expression extremes (A) or abundance extremes (B). The extremes reflect the 17.5% highest and lowest expressed genes (A) or abundant proteins (B). Using the Kolmogorov-Smirnov test, we found a significant difference between the two expression distributions, but not between the two abundance distributions (P = 0.0004 for (A), P = 0.835 for (B)). One outlier falls outside the limits in (B).

Ribosomal proteins are known to be both highly conserved and highly expressed. To test for bias caused by these special characteristics of ribosomal proteins, we repeated the analysis excluding ribosomal domains (Figure S2 in Supplementary Materials). The difference in the distributions of evolutionary rate differences between buried interface and surface residues for highly and lowly expressed proteins was still significant (Kolmogorov-Smirnov test, P = 0.0023). We obtained similar statistical significance values when computing evolutionary rates relative to other fungal genomes (See Table S1 in Supplementary Materials).

Abundance-Dependent Pressures

mRNA expression level and protein abundance are correlated, but to varying extents (Greenbaum et al., 2003). We hence asked whether the observed dependence of the ΔdN(interface-surface) distributions on mRNA expression level is also observed when using protein abundance data. We found that protein abundance does not restrict ΔdN(interface-surface) in the same fashion that mRNA expression level does (Figure 5B, see Figure S1B in Supplementary Materials for ΔdN(core-surface)). In contrast to what we observed with mRNA expression level, the distributions in ΔdN(interface-surface) values are not distinguishable between proteins with high and low protein abundance levels. We observe a similar differential effect of abundance and expression level when computing substitution rates of S. cerevisiae relative to other sequenced fungal genomes, such as C. glabrata, S. bayanus and S. mikatae (see Table S1 in Supplementary Materials), and when normalizing ΔdN(interface-surface) by dN(domain), which did not result in a significant difference between proteins with high and low abundance (P = 0.17), but did between proteins with high and low mRNA expression (P = 0.001).

To further confirm the difference between mRNA-level and protein abundance dependent pressures, we tested how pressures proportional to protein abundance affect evolutionary rates of core, buried interface and surface regions separately, as above. As expected, evolutionary rates of all structural regions are less affected by abundance than by mRNA level (Figure 4B). Residues become only slightly more conserved with increasing protein abundance level. Only the highest and lowest protein abundance bins have significantly different evolutionary rates for each of the 4 structural subsets (see Table S3 in Supplementary Materials).

If pressures proportional to mRNA levels in fact limit the evolutionary rates of all protein residues similarly regardless of the residue location on the surface or at an interface (i.e. independent of a functional constraint imposed by protein interactions), a comparison of randomly picked residue patches of similar size as the interface should show the same distribution as a function of mRNA level as depicted in Figure 5A. Figure S3A in the Supplementary Materials confirms this suggestion. Analogously, repeating the residue patch comparison as a function of abundance (Figure S3B in Supplementary Materials) should not reveal a dependence on protein levels, and yields similar results as depicted in Figure 5B (Figure S3B in Supplementary Materials).

Combining the results of Figure 4 and Figure 5, we conclude that first, the evolutionary rates of all structural regions are affected proportional to mRNA level, but to significantly lesser extent by protein levels; the effect proportional to protein levels could result largely from the underlying dependence of mRNA and protein levels. Second, the variability in evolutionary rates between different structural regions within a given domain should be correlated with mRNA level but be much less affected by protein abundance. Figure 6 shows that this is indeed the case, where the mean differences of evolutionary rates between surfaces and cores and their standard deviations decrease with mRNA level but stay mainly constant as a function of protein abundance.

Figure 6.

Figure 6

The mean difference in evolutionary rates of core and surface residues within the same domain decreases with mRNA level, but not with protein abundance. The mean ΔdN(core-surface) for proteins is binned by increasing mRNA expression level (A) or protein abundance (B). The whiskers represent standard deviations within each bin.

Conservation of Polar Residues

We next sought to resolve functional pressures that constrain the evolution of protein interfaces. It has been suggested that polar residues are highly conserved in interfaces because they constitute energetic “hot spots” or contribute to interaction specificity (Hu et al., 2000). To confirm this finding in our analysis, we subclassified our structural subsets into polar residues (Asp, Glu, Lys, Arg, Ser, Thr, Asn, and Gln; definitions taken from Hu et al) and remainder (all residues not classified as polar, including nonpolar residues and aromatics). We found that, with the exception of the buried interface residue set, polar residues in all structural subsets evolved faster than remainder residues, with a high degree of statistical significance (Table 2). However, buried interface residues were unique in their equivalent conservation of polar and remainder residues (0.370 and 0.361, respectively). These results suggest the presence of pressures favoring the conservation of polar residues in protein interfaces.

Table 2.

Polar buried residues in interfaces are as conserved as all other residues. Polar residues include Asp, Glu, Lys, Arg, Ser, Thr, Asn, and Gln. All other residues are classified as the remainder. The median dN for each dataset is shown.

Structural subset Surface Core Surface interface Buried interface
dN polars 0.434 0.305 0.322 0.219
dN remainder 0.311 0.209 0.260 0.184
P (Wilcoxan) 2.4e-7 1.9e-6 0.002 0.730

Relationship Between Degree of Burial and Evolutionary Rates

Previous research has highlighted the importance of “anchor residues” in protein-protein interactions. Anchor residues were identified as side chains that pack into structurally constrained grooves of their binding partners (Rajamani et al., 2004). Such residues are exposed in the monomeric form of the protein and bury most of their surface area only upon complex formation. To assess the conservation of these anchor residues in our set of domain-domain interactions, we measured the relationship between evolutionary rate and degree of burial upon interface formation (Figure 7). Degree of burial was defined as the difference in number of neighboring Cβ atoms within a 10Å sphere of the residue of interest between the monomeric and complexed forms of the domain. Interface residues were binned according to their degree of burial, and the median evolutionary rate of each bin was computed. We found a correlation between increase in burial upon complex formation (additional neighbors) and sequence conservation (1–6 additional neighbors, median dN = 0.302; 7–12 additional neighbors, median dN = 0.295; 13–18 additional neighbors, median dN = 0.195; 19–24 additional neighbors, median dN = 0.031. Similar results were obtained when using solvent-accessible surface area (SASA) to define degree of burial (Figure S4 in Supplementary Materials). As defined by a Wilcoxon-signed rank test, the dN distributions for all interface residue bins are significantly different from each other (P < 3e-7) with the exception of the first two bins 1–6, 7–12 (P = 0.58) and the last two 13–18, 19–24 (P = 0.06). Thus we observe a significant conservation signal not only for residues buried in interfaces, which has been shown to more generally lead to proteins with larger interface area to be more conserved (Kim et al., 2006), but also specifically for representative of anchor residues that show an increase of 13–24 neighbors upon complex formation (Figure 7). This result was not altered when using different bin widths (data not shown).

Figure 7.

Figure 7

The evolutionary rates of interface residues depend on the degree of burial upon complex formation. As defined by a Wilcoxon-signed rank test, the dN distributions for all degree of burial bins are significantly different from each other (P < 3e-7) with the exception of the first two bins 1–6, 7–12 (P = 0.58) and the last two 13–18, 19–24 (P = 0.06). Bin 1–6 represents 22,210 residues; bin 7–12, 14,479 residues; bin 13–18, 3,229 residues; bin 19–24, 418 residues. 9 outliers fall outside the upper boundary.

To detect other possible biases in the binned datasets, which could arise from unusual structural/functional characteristics of domains containing residues with high degrees of burial (such as domains largely buried within complexes) or from expression level biases, we also measured the evolutionary rate of surface residues of the domains in each bin. The median values for surface evolutionary rates ranged from 0.35 to 0.37. Differences between the bins were found to be statistically insignificant (P > 0.2), suggesting that the observed conservation signal for anchor residues is not substantially biased by expression-level dependent correlations.

Discussion

We set out to combine genome-wide structural mapping with systems-level data on mRNA expression and protein abundance to characterize pressures on the evolution of proteins and protein-protein interfaces in yeast genomes. While expression level has been established as the dominant determinant of protein evolutionary rates, we aimed to dissect the influence of functional local evolutionary pressures acting on residues in protein interfaces proportional to mRNA expression-level and protein abundance. Our analysis reveals two main new results: First, we show that mRNA expression level has a profound influence on the evolutionary rates of protein interfaces relative to protein surfaces: for proteins with high mRNA levels, the distribution of ΔdN(interface-surface) values is substantially narrowed (Figure 5A). This suggests that interface residues possess substantially more evolutionary flexibility in proteins with low mRNA expression level, allowing such interfaces to evolve at different rates compared to residues on other surfaces of the same domains. Similar results are observed for the ΔdN(core-surface) distributions. Second and in contrast, this restriction in the variation of evolutionary rates between different regions in the same protein domain is to a much lesser extent proportional to protein abundance (Figure 5B). This indicates that despite some correlation between abundance and expression level, their differential effects reflect separate selective pressures.

Prior work has suggested that the selective pressure to evolve sequences robust enough to fold properly despite translational errors is the primary constraint on highly expressed proteins (Drummond et al., 2005). Following this “translational robustness” hypothesis, we initially reasoned that protein cores (but not necessarily protein interfaces exposed in the uncomplexed form) may be affected most dramatically by protein expression level, as non-synonymous core mutations would be expected to be most destabilizing. However, the evolutionary rates of all structural regions (core, surface and interface residues) appear to be slowed by pressures proportional to mRNA expression level (Figure 4A), including translational robustness. Though expression level is likely to reflect multiple selective pressures, it is apparent that one such pressure affects all residues, irrespective of their structural location.

Proteins with low mRNA expression levels are to a lesser extent affected by these global pressures, and hence variability in selective pressures targeting structural subsets becomes detectable (Figure 5A). In addition, our data suggest that in proteins with low mRNA levels, core residues and residues buried in the interface may not be under the same pressures. Core residues rarely evolve faster than the surface, which contrasts with the apparent freedom of residues buried in interfaces to evolve at rates both faster and slower than the surface (Figure 3). It is difficult to gauge the significance of such quickly evolving interfaces. They may be a result of recently evolved functionality, or merely a byproduct of caveats intrinsic to our structural mapping procedure (see below).

As with mRNA expression levels, protein abundance is also known to strongly correlate with evolutionary rate (Drummond et al., 2006). This finding has been attributed to a correlation between expression level and protein abundance. Hence, the observed differences in the evolution of protein interfaces as a function of mRNA expression level and protein abundance (compare Figures 5A and 5B) are counterintuitive. However, our finding supports evolutionary mechanisms in which restraints are linked to the frequency of translation events, which have a stronger correlation with expression level than with protein abundance (Drummond et al., 2005; Greenbaum et al., 2003).

Two broad classes of selective pressures may be at play: pressures associated with production, which could be thought of as operating globally across the nucleotide sequence, and pressures associated with benefit, which may selectively target local protein structural and functional characteristics. mRNA expression level and protein abundance, which are somewhat correlated, will reflect selective pressures within both classes. However, the disparate evolutionary trends observed when contrasting expression level and protein abundance suggest that the two parameters show differing degrees of correlation within each class.

mRNA expression level may in large part reflect cost-dependent pressures. In particular, mRNA expression has been related to both codon efficiency and translational robustness measuring potential costs due to protein misfolding. The fact that evolutionary rates of all structural regions are reduced proportionally to mRNA level (Figure 4A) implies a correlation with pressures operating globally upon the sequence. In contrast, protein abundance may more directly report on benefit-dependent pressures related to protein structure and function. Structural/functional constraints can act locally; indeed localized constraints, including those observed in protein interfaces, are still distinguishable by substantial variability of evolutionary rates of different structural regions in highly abundant proteins, despite pressures that may affect whole sequences uniformly. This conclusion is further supported by comparing evolutionary rate differences of buried interface and surface residues between pairs of proteins in which one member has a higher abundance and a lower mRNA expression level than its partner: For 10,000 randomly generated pairs meeting this definition, 5,814 of those with higher abundance had a greater evolutionary rate difference between buried interface and surface residues (P < 2.2e-16). A more detailed dissection of expression and abundancedependent evolutionary forces will require quantitative comparisons of the cost and benefit of protein production in mutant populations.

The evolutionary trends we have uncovered are strong enough to persist through the approximations we have made, including completeness of interface identification from known complexes in the PDB (see Results & Methods), functionality of each interface inferred from the PDB in S. cerevisiae, and accuracy of alignment between the PDB and S. cerevisiae sequences. While the preservation of interface functionality between organisms and domain family members used in this analysis is difficult to estimate, this is undoubtedly a large source of error. We can, however, measure the evolutionary rates of structural subsets in S. cerevisiae proteins that have been crystallized, where the data allow correct structural classification. This test, though the small number of S. cerevisiae structures limits the statistical significance, reveals that functional interfaces exhibit similar conservation as observed in Figure 2 for inferred interfaces. Other possible explanations for the disparity in selective pressures observed for highly and lowly expressed proteins could stem from biases in the sequence-structure alignments. When we investigated this possibility, we indeed found that highly expressed proteins had a statistically higher sequence identity in alignments made with PDB domains. However further analysis separating highly and lowly expressed proteins into bins of comparable alignment identity showed that similar results as summarized in Figure 3 and Figure 5 are obtained regardless of the level of sequence identity (see Methods).

From the evidence presented in this paper, we speculate that proteins with low mRNA levels have sufficiently relaxed evolutionary constraints to serve as the raw genetic material for new genes. If this is indeed the case, it would provide insight into how proteins evolve, from the mutational trajectory of proteins following gene duplication events to modified approaches for directed evolution experimentation. Modulation of not only gene expression patterns but also gene expression levels has been suggested to contribute substantially to evolutionary changes (Pal et al., 2006). Lowering mRNA expression but not necessarily protein abundance levels may increase the mutational flexibility in naturally occurring systems and directed evolution experiments.

Experimental Procedures

Determination of Interface Residues

Residues were identified as participating in domain-domain interfaces as designated through either side-chain-side-chain or side-chain-main-chain contacts. Contacts were identified using the criteria established in the 3did (Stein et al., 2005), with interacting residues defined as having one or more of hydrogen bonds (N-O distances ≤ 3.5Å), salt bridges (N-O distances ≤ 5.5Å), or van der Waals interactions (C-C distances ≤ 5Å). Using the domain definitions and domain boundaries listed in the 3did, protein domain sequences and their corresponding interacting residues were compiled as the “PDB domain dataset” (7,471 sequences, 580 different domains; Figure 1A).

Mapping Residues to Saccharomyces cerevisiae

Domain sequences, using Pfam domain definitions, of all S. cerevisiae proteins containing domains observed in the PDB domain dataset were collected as the S. cerevisiae domain dataset (1,570 domains from 950 proteins). Alignments between sequences in the PDB domain dataset and the S. cerevisiae domain dataset were made using CLUSTALW (Thompson et al., 1994). S. cerevisiae residues that aligned with PDB interface residues were identified as contributing to the interface, unless the position was the site of an insertion/deletion. For every S. cerevisiae domain, the interface was defined as the sum of all mapped interface residues from every domain-domain alignment.

Previous research addressing the relationship between sequence and interaction divergence has defined the 20–30% identity regime as the “twilight zone” at which structural similarities between sequences begin to break down, but has found a reduced identity cutoff for reliable interaction modeling using Pfam domain classification employed here (Aloy et al., 2003). To test the effects of alignment error on our results, we recomputed our data using a minimum 20% identity in PDB-S. cerevisiae alignments for mapping. The difference in evolutionary rates of interfaces of highly and lowly expressed proteins was retained in the recomputed results, as expected given that the vast majority of our domains align above this threshold. It has also been put forth that an 80% sequence identity should be used to guarantee true interactions (Yu et al., 2004); our results support the conservation of interface residues, but do not address questions of interaction specificity that is most likely dependent on higher-resolution features of protein interfaces.

Determination of Buried and Surface Residues

Surface and buried residues were identified in PDB structures by tallying the number of Cβ atoms of neighboring residues within a 10Å sphere around the Cβ atom of the residue of interest (≤16 Cβ neighbors for surface residues, ≥25 Cβ neighbors for buried residues). Buried residues were determined in both monomeric form (analyzing neighbors on the same chain, “core residues”) and complexed form (analyzing neighbors in the entire structure, “residues buried in complex”). Surface residues were only determined in the monomeric form. “Buried interface residues” were identified as the intersection of “interface residues” and “residues buried in complex”, but removing “core residues”; “surface interface residues” were defined as the intersection of “interface residues” and “surface residues”; the “surface residues” set excluded “buried interface residues”. (For illustration of the structural sets, see Figure 1B).

mRNA Expression Levels and Protein Abundance

Gene expression data for S. cerevisiae measured in mRNA molecules per cell were taken from (Holstege et al., 1998). Protein abundance data were taken from (Ghaemmaghami et al., 2003). Results using protein abundance data reported in the text excluded proteins defined as having “no detected expression” in (Ghaemmaghami et al., 2003); however, inclusion of these proteins did not significantly affect the results (data not shown).

Evaluation of Evolutionary Rates

Evolutionary rates were obtained by comparing the S. cerevisiae domain-containing proteins sequences to their C. albicans, C. glabrata, S. bayanus, and S. mikaae orthologs. Putatively orthologous sequences were identified by means of reciprocal best hits, with the minimum length of the alignable region >80% of the longer protein, and a BlastP (Altschul et al., 1990) cutoff of E >10−10. From this set, protein sequence alignments were made with CLUSTALW and then mapped to their respective nucleotide sequences. Interface-containing domains, as defined in the S. cerevisiae domain set, were then used to compute non-synonymous substitution rates (dN) for this alignment subset using the PAML software program CODEML (Yang, 1997). 396 orthologous (cerevisiae-albicans) pairs were used in this analysis. The average domain contained 9.1 “buried interface residues”, 28.6 “surface interface residues”, 67.1 “surface residues”, and 72.6 “core residues”. Analyses performed in this paper were replicated using curated orthologs from the Candida Genome Database (CGD) (http://www.candidagenome.org/), producing similar results (data not shown).

Sequences

Protein and nucleotide sequences were obtained from the Saccharomyces Genome Database (SGD, http://genome-www.stanford.edu/Saccharomyces/) for S. cerevisiae, from the CGD (http://www.candidagenome.org/) for C. albicans, downloaded from ftp://genome-ftp.stanford/pub/yeast/sequence/fungal_genomes/S_bayanus/ for S. bayanus, ftp://genome-ftp.stanford/pub/yeast/sequence/fungal_genomes/S_mikatae/ for S. mikatae and http://cbi.labri.u-bordeaux.fr/Genolevures/ for C. glabrata.

Statistical Analysis

The R package was used for statistical analysis (Ihaka and Gentleman, 1996).

Supplementary Material

01

Acknowledgements

We would like to thank David Agard, Andrej Sali, Patsy Babbitt, David Baker and members of the Kortemme laboratory for insightful comments and stimulating discussion. This work was supported by a grant from the Sandler Progam in Basic Sciences. M.E. was partially supported from a training grant from the NIH (GM08284), T.K. is an Alfred P. Sloan Fellow in Molecular Biology.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

  1. Aloy P, Ceulemans H, Stark A, Russell RB. The relationship between sequence and interaction divergence in proteins. Journal of molecular biology. 2003;332:989–998. doi: 10.1016/j.jmb.2003.07.006. [DOI] [PubMed] [Google Scholar]
  2. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. Journal of molecular biology. 1990;215:403–410. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]
  3. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. The Protein Data Bank. Nucleic Acids Res. 2000;28:235–242. doi: 10.1093/nar/28.1.235. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Bloom JD, Adami C. Apparent dependence of protein evolutionary rate on number of interactions is linked to biases in protein-protein interactions data sets. BMC Evol Biol. 2003;3:21. doi: 10.1186/1471-2148-3-21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Bloom JD, Adami C. Evolutionary rate depends on number of protein-protein interactions independently of gene expression level: response. BMC Evol Biol. 2004;4:14. doi: 10.1186/1471-2148-4-14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Bloom JD, Drummond DA, Arnold FH, Wilke CO. Structural determinants of the rate of protein evolution in yeast. Molecular biology and evolution. 2006;23:1751–1761. doi: 10.1093/molbev/msl040. [DOI] [PubMed] [Google Scholar]
  7. Caffrey DR, Somaroo S, Hughes JD, Mintseris J, Huang ES. Are protein-protein interfaces more conserved in sequence than the rest of the protein surface? Protein Sci. 2004;13:190–202. doi: 10.1110/ps.03323604. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Choi SS, Vallender EJ, Lahn BT. Systematically assessing the influence of 3-dimensional structural context on the molecular evolution of mammalian proteomes. Molecular biology and evolution. 2006;23:2131–2133. doi: 10.1093/molbev/msl086. [DOI] [PubMed] [Google Scholar]
  9. Drummond DA, Bloom JD, Adami C, Wilke CO, Arnold FH. Why highly expressed proteins evolve slowly. Proc Natl Acad Sci U S A. 2005;102:14338–14343. doi: 10.1073/pnas.0504070102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Drummond DA, Raval A, Wilke CO. A single determinant dominates the rate of yeast protein evolution. Molecular biology and evolution. 2006;23:327–337. doi: 10.1093/molbev/msj038. [DOI] [PubMed] [Google Scholar]
  11. Finn RD, Mistry J, Schuster-Bockler B, Griffiths-Jones S, Hollich V, Lassmann T, Moxon S, Marshall M, Khanna A, Durbin R, et al. Pfam: clans, web tools and services. Nucleic Acids Res. 2006;34:D247–D251. doi: 10.1093/nar/gkj149. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Fraser HB, Hirsh AE, Steinmetz LM, Scharfe C, Feldman MW. Evolutionary rate in the protein interaction network. Science. 2002;296:750–752. doi: 10.1126/science.1068696. [DOI] [PubMed] [Google Scholar]
  13. Fraser HB, Wall DP, Hirsh AE. A simple dependence between protein evolution rate and the number of protein-protein interactions. BMC Evol Biol. 2003;3:11. doi: 10.1186/1471-2148-3-11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Ghaemmaghami S, Huh WK, Bower K, Howson RW, Belle A, Dephoure N, O'Shea EK, Weissman JS. Global analysis of protein expression in yeast. Nature. 2003;425:737–741. doi: 10.1038/nature02046. [DOI] [PubMed] [Google Scholar]
  15. Greenbaum D, Colangelo C, Williams K, Gerstein M. Comparing protein abundance and mRNA expression levels on a genomic scale. Genome Biol. 2003;4:117. doi: 10.1186/gb-2003-4-9-117. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Holstege FC, Jennings EG, Wyrick JJ, Lee TI, Hengartner CJ, Green MR, Golub TR, Lander ES, Young RA. Dissecting the regulatory circuitry of a eukaryotic genome. Cell. 1998;95:717–728. doi: 10.1016/s0092-8674(00)81641-4. [DOI] [PubMed] [Google Scholar]
  17. Hu Z, Ma B, Wolfson H, Nussinov R. Conservation of polar residues as hot spots at protein interfaces. Proteins. 2000;39:331–342. [PubMed] [Google Scholar]
  18. Ihaka R, Gentleman R. R: A Language for Data Analysis and Graphics. Journal of Computational and Graphical Statistics. 1996;5:299–314. [Google Scholar]
  19. Kim PM, Lu LJ, Xia Y, Gerstein MB. Relating three-dimensional structures to protein networks provides evolutionary insights. Science. 2006;314:1938–1941. doi: 10.1126/science.1136174. [DOI] [PubMed] [Google Scholar]
  20. Kimura M. Evolutionary rate at the molecular level. Nature. 1968;217:624–626. doi: 10.1038/217624a0. [DOI] [PubMed] [Google Scholar]
  21. King JL, Jukes TH. Non-Darwinian evolution. Science. 1969;164:788–798. doi: 10.1126/science.164.3881.788. [DOI] [PubMed] [Google Scholar]
  22. Lin YS, Hsu WL, Hwang JK, Li WH. Proportion of solvent-exposed amino acids in a protein and rate of protein evolution. Molecular biology and evolution. 2007;24:1005–1011. doi: 10.1093/molbev/msm019. [DOI] [PubMed] [Google Scholar]
  23. Mintseris J, Weng Z. Structure, function, and evolution of transient and obligate protein-protein interactions. Proc Natl Acad Sci U S A. 2005;102:10930–10935. doi: 10.1073/pnas.0502667102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Nooren IM, Thornton JM. Structural characterisation and functional significance of transient protein-protein interactions. Journal of molecular biology. 2003;325:991–1018. doi: 10.1016/s0022-2836(02)01281-0. [DOI] [PubMed] [Google Scholar]
  25. Ohta T. Slightly deleterious mutant substitutions in evolution. Nature. 1973;246:96–98. doi: 10.1038/246096a0. [DOI] [PubMed] [Google Scholar]
  26. Pal C, Papp B, Hurst LD. Highly expressed genes in yeast evolve slowly. Genetics. 2001;158:927–931. doi: 10.1093/genetics/158.2.927. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Pal C, Papp B, Lercher MJ. An integrated view of protein evolution. Nat Rev Genet. 2006;7:337–348. doi: 10.1038/nrg1838. [DOI] [PubMed] [Google Scholar]
  28. Rajamani D, Thiel S, Vajda S, Camacho CJ. Anchor residues in protein-protein interactions. Proc Natl Acad Sci U S A. 2004;101:11287–11292. doi: 10.1073/pnas.0401942101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Stein A, Russell RB, Aloy P. 3did: interacting protein domains of known three-dimensional structure. Nucleic Acids Res. 2005;33:D413–D417. doi: 10.1093/nar/gki037. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Teichmann SA. The constraints protein-protein interactions place on sequence divergence. Journal of molecular biology. 2002;324:399–407. doi: 10.1016/s0022-2836(02)01144-0. [DOI] [PubMed] [Google Scholar]
  31. Thompson JD, Higgins DG, Gibson TJ. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994;22:4673–4680. doi: 10.1093/nar/22.22.4673. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Yang Z. PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci. 1997;13:555–556. doi: 10.1093/bioinformatics/13.5.555. [DOI] [PubMed] [Google Scholar]
  33. Yu H, Luscombe NM, Lu HX, Zhu X, Xia Y, Han JD, Bertin N, Chung S, Vidal M, Gerstein M. Annotation transfer between genomes: proteinprotein interologs and protein-DNA regulogs. Genome Res. 2004;14:1107–1118. doi: 10.1101/gr.1774904. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

01

RESOURCES