Quantifying evolutionary importance of protein sites: A Tale of two measures

Avital Sharir-Ivry; Yu Xia

doi:10.1371/journal.pgen.1009476

. 2021 Apr 7;17(4):e1009476. doi: 10.1371/journal.pgen.1009476

Quantifying evolutionary importance of protein sites: A Tale of two measures

Avital Sharir-Ivry ¹, Yu Xia ^1,^*

Editor: Jianzhi Zhang²

PMCID: PMC8026052 PMID: 33826605

Abstract

A key challenge in evolutionary biology is the accurate quantification of selective pressure on proteins and other biological macromolecules at single-site resolution. The evolutionary importance of a protein site under purifying selection is typically measured by the degree of conservation of the protein site itself. A possible alternative measure is the strength of the site-induced conservation gradient in the rest of the protein structure. However, the quantitative relationship between these two measures remains unknown. Here, we show that despite major differences, there is a strong linear relationship between the two measures such that more conserved protein sites also induce stronger conservation gradient in the rest of the protein. This linear relationship is universal as it holds for different types of proteins and functional sites in proteins. Our results show that the strong selective pressure acting on the functional site in general percolates through the rest of the protein via residue-residue contacts. Surprisingly however, catalytic sites in enzymes are the principal exception to this rule. Catalytic sites induce significantly stronger conservation gradients in the rest of the protein than expected from the degree of conservation of the site alone. The unique requirement for the active site to selectively stabilize the transition state of the catalyzed chemical reaction imposes additional selective constraints on the rest of the enzyme.

Author summary

Sites within proteins which are important for stability or function are under stronger selective pressure and evolve more slowly than other sites. Catalytic sites in enzymes are such highly conserved sites with relatively low evolutionary rates. Recently, catalytic sites were shown to induce a strong gradient of conservation such that the closer a residue is to the catalytic site, the more conserved it is. Here we show that there is a universal linear relationship between the degree of evolutionary conservation of a protein site and the conservation gradient it induces in the protein tertiary structure, applicable to all types of sites. Our findings suggest that selective pressure acting on a protein site generally percolates through the rest of the protein via residue-residue contacts. Remarkably however, catalytic sites induce significantly stronger conservation gradients than expected from their degree of conservation alone. Our results indicate that the strong conservation gradient induced by catalytic sites is driven by the unique function of enzyme catalysis, which requires the participation of many residues beyond the few key catalytic residues. Our results provide insights into evolutionary conservation patterns of and surrounding proteins functional sites, with implications for functional site prediction and protein design.

Introduction

The evolutionary importance of protein sites under purifying selection can be quantified in two very different ways. The classical, “intrinsic” measure for the evolutionary importance of a protein site is the degree of conservation or evolutionary rate of the protein site itself. Protein residues experience different degrees of selective pressure as a result of the different roles they play in protein stability and function[1,2]. For example, residues in a protein core are generally under stronger selective pressure than surface residues due to their importance in stabilizing the protein. Indeed, structural determinants such as solvent exposure[3–7] and degree of packing [8–10] were shown to explain a large portion of the variability in the observed site-specific evolutionary rates. In addition, residues in functional sites such as catalytic sites[11] and ligand-binding sites are also under stronger selective pressure than non-functional residues.

An alternative, “extrinsic” measure for the evolutionary importance of a protein site is the conservation gradient the site exerts on the rest of the protein. Rather than quantifying evolutionary conservation of the protein site itself, this measure captures how the evolutionary conservation of other residues surrounding the site gradually decreases with distance from the site in the tertiary structure. Several studies have indicated the possibility for selective pressure to propagate from the functional site to the rest of the protein via physical interactions between neighboring residues in the three-dimensional structure [12–15]. While it is clear that the two measures of the evolutionary importance of protein sites are substantially different, it remains unknown how the two measures relate to each other.

In this paper, we addressed the fundamental question whether there is a direct relationship between the intrinsic and extrinsic measures of the evolutionary importance of protein sites. We have based our study on a dataset of homology-based structural models of the yeast proteome [5,7]. Despite their major differences, we show here that there is a strong linear relationship between the degree of conservation of a protein site and the conservation gradient induced from it. In other words, more conserved protein sites tend to induce stronger conservation gradient in the rest of the protein, as selective pressure acting on the protein site percolates via residue-residue interactions. This linear evolutionary conservation-percolation relationship is universal in that it holds for different types of proteins as well as for different types of functional sites in proteins. Remarkably however, catalytic sites in enzymes are the exception to this universal rule, as catalytic sites induce significantly stronger conservation gradient than other types of functional sites with similar degrees of conservation. We conclude that for many different types of functional sites, site-induced conservation gradient can be explained by the percolation of site-specific selective pressure through the rest of the protein via residue-residue contacts. However, catalytic sites in enzymes induce significantly stronger conservation gradient in the rest of the protein than expected from the percolation theory. This is likely due to the unique requirement for the enzyme active site to selectively bind to and stabilize the transition state of the catalyzed chemical reaction [16].

Overall, we show that a more complete understanding of the selective pressure on protein sites can be achieved by integrating the intrinsic measure of site-specific evolutionary conservation with the extrinsic measure of site-induced conservation gradient, with potential implications in protein design, functional site prediction and the study of disease mutations.

Results

Evolutionary conservation gradient induced from a protein residue in the proteins structure is linearly correlated with the conservation of the residue

We based our study on a dataset of homology-based structural models of the S. cerevisiae proteome. This basic dataset contains structural templates from the Protein Data Bank (PDB)[17] mapped to ORFs of S. cerevisiae via sequence alignment (see Methods). We had 1274 yeast ORFs with structural models from the PDB for which residue conservation scores are available in ConSurf-DB [18,19]. Each residue in the dataset was ranked according to its relative conservation within the protein (the residue’s rank of conservation divided by the total number of residues). This normalized conservation rank of a residue ranges from 0 to 1, with higher rank corresponding to higher conservation within the protein. We have also calculated for each residue the Pearson correlation between the conservation scores of all other residues in the protein and their distance from that reference residue. This Pearson correlation between conservation and distance describes the degree of percolation of the evolutionary conservation from a residue throughout the protein tertiary structure, i.e., the ‘conservation gradient’.

A clear negative linear trend is observed between the conservation rank of a residue and the strength of the evolutionary conservation gradient induced from it (Fig 1A). The more a residue is conserved within the protein (higher conservation rank), the stronger the evolutionary conservation gradient it induces. Similar result is shown when conservation gradients are calculated using Spearman correlation rather than Pearson correlation (S1A Fig). Conservation gradients calculated up to 30Å away from the residue, are overall lower compared to those over the entire domain, however they exhibit higher correlation to residue conservation (S2A and S3A Figs). The distribution of overall per-protein Pearson correlations between conservation ranks and conservation gradients (per-protein conservation-percolation trend) is mainly between -0.2 and -0.7, and the average correlation in the dataset is -0.5 (Fig 1B), indicating that the linear relationship between site-specific conservation and site-induced conservation gradient is high for all types of proteins. Similarly, with conservation gradients calculated with Spearman correlation, the average per-protein correlation between conservation gradient and conservation rank is -0.5 as well (S1B Fig). When conservation gradients are calculated up to 30Å away, the average per-protein conservation-percolation correlation is higher, -0.6 (S2B Fig).

Fig 1 — (A) Violin plots and respective average of conservation gradient from a residue as a function of conservation rank for all residues in the dataset binned into 20 equally spaced bins of conservation rank along with the linear fit calculated over all residues. (B) Distribution of per-protein Pearson correlation between residues’ conservation ranks and conservation gradients.

We have also examined this relationship between site-specific conservation and site-induced conservation gradient specifically for functional sites. We identified different functional sites in the dataset: catalytic sites, non-catalytic ligand-binding sites (on enzymes and on nonenzymatic proteins), protein-protein interaction sites, and allosteric sites. We identified catalytic sites using the Mechanism and Catalytic Site Atlas (M-CSA)[20], ligand-binding sites using BioLip [21], allosteric sites using the Allosteric Database (ASD) [22–24] and protein-protein interaction sites using our previous protocol of identifying protein-protein interfaces in the yeast proteome[5]. Relative solvent accessibility (RSA) of residues was calculated and residues were classified as buried if RSA = 0.0, exposed if RSA>0.8 and middle for 0.0<RSA≤0.8. The linear relationship between site-specific conservation rank and site-induced conservation gradient holds regardless of the residue’s location within the protein or its functional role (Fig 2 and S1 Table). Overall, our results reveal the existence of a conservation–percolation relationship in which higher residue conservation leads to stronger percolation of selective pressure to adjacent sites in the tertiary structure. Furthermore, this relationship holds for residues in different types of functional sites.

Fig 2 — Violin plots of conservation gradient of residues as a function of their conservation rank within the protein along with the linear fit calculated over all residues in different types of structural and functional sites.

Conservation gradient induced from a protein site is linearly correlated with the evolutionary rate of the site

The linear trend between conservation and percolation shown above is based on conservation ranks of residues, which are relative within a protein. In these calculations, we have lumped together residues from different proteins having similar conservation rank, however these residues, being from different proteins, can be under different selective pressure. To address this caveat, we examined whether the conservation-percolation linear trend still holds when conservation is measured in terms of absolute evolutionary rate (dN/dS). The evolutionary rate is calculated for the yeast proteins in S. cerevisiae compared with its orthologs in four closely related yeast species (see Methods). Conservation score annotations are transferred from structural homologs onto yeast proteins using sequence alignment. Then, we binned all residues into 100 equally spaced bins of conservation rank and calculated their average evolutionary rate (dN/dS). The average evolutionary rate increases for residues with increasing conservation rank (decreasing conservation) (S4 Fig), showing high correlation between the evolutionary rate of yeast protein sites and the conservation scores of their structural homologs. The conservation-percolation linear trend is shown to still hold here, where conservation is measured in terms of absolute evolutionary rate (Fig 3). The lower the average evolutionary rate of a protein site is, the stronger percolation of evolutionary conservation is induced from it.

Fig 3 — (A) Average conservation gradient (calculated as the average Pearson correlation between conservation of residues and their distance from a site) as a function of average evolutionary rate (dN/dS) for all yeast protein residues binned according to their annotated conservation rank into 100 equally spaced bins as well as the average conservation gradients of different types of functional sites. (B) Average conservation gradient without the relative contribution of SC-WCN as a function of average evolutionary rate (dN/dS) for all yeast protein residues binned according to their annotated conservation rank into 100 equally spaced bins as well as the average conservation gradients without the relative contribution of SC-WCN of different types of functional sites.

Conservation gradients calculated as Spearman correlations exhibit a similar linear trend (S5 Fig) as well as conservation gradients calculated up to 30Å away from the reference residue (S6 and S7 Figs).

Catalytic sites induce stronger conservation gradients than predicted by the conservation-percolation trend

The annotation of functional and structural site residues was transferred to yeast proteins from their structural models using sequence alignment. The average conservation gradient induced from them is shown in Fig 3A. As expected, functional residues evolve more slowly (low dN/dS) than other residues. The average conservation gradients induced from ligand binding sites in enzymes, allosteric sites and protein-protein interaction sites can all be predicted reasonably well from the linear conservation-percolation trend. Remarkably, catalytic sites have the most significant deviation from the linear conservation-percolation trend. The average conservation gradient induced from catalytic site residues is significantly stronger than expected by the linear conservation-percolation trend. This can also be seen from the significantly stronger conservation gradients from catalytic sites compared with non-functional sites with similar high conservation rank (Fig 2). These results suggest that the strong conservation gradient induced from catalytic residues cannot be solely attributed to the percolation of the strong selective pressure on them. We repeated the analysis with the x-axis changed to the conservation rank (S8 Fig) showing agreement with Fig 3 supporting the conclusion that catalytic sites induce stronger conservation gradients on average than other sites with similar conservation. Ligand binding sites in nonenzymes in our dataset also induce somewhat stronger conservation gradients than expected (although not to the same extent as catalytic sites). This could be caused due to undiscovered catalytic sites and further work is required to test this hypothesis.

Conservation gradients calculated as Spearman correlations exhibit similar trends to those observed with conservation gradients expressed as Pearson correlations (S5 Fig). When conservation gradients measured up to 30Å (S6 and S7 Figs), the difference between those induced from catalytic sites and those induced from other sites with similar evolutionary rate is even higher compared with this difference when conservation gradients are calculated over the entire protein domain. Notably, conservation gradients up to 30Å from binding sites in nonenzymes are significantly lower compared with those from catalytic sites.

Comparing the conservation gradients from different types of functional sites, a possible caveat is that the residues composing them have a different variety of conservation ranks. While catalytic sites are smaller and contain mainly highly conserved residues, protein-protein interactions sites are larger and can include residues that have a small contribution to the function and are not highly conserved. We therefore repeated our analysis taking into account only the three most conserved residues from each functional site. These highly conserved residues have lower evolutionary rates than the entire functional site and induce higher conservation gradients (S9 Fig compared with Fig 3). The trend of the results is maintained showing that the most conserved residues within catalytic sites induce significantly stronger conservation gradients compared with the most conserved residues of other functional sites. Average conservation gradient from the most conserved residues of binding sites in nonenzymes is shown to be almost identical to that from catalytic sites.

The observed differences between conservation gradients induced from different functional sites could be dictated by structural determinants, such that tightly packed functional sites or those that are in a groove and hence close to the protein core, exhibit stronger conservation gradients. We have previously shown that any structural determinant of the protein backbone is unlikely to be a major determinant of the strong conservation gradients from enzymes[25]. We wanted to further test this hypothesis and control for site packing in our dataset. First, we have calculated the SC-WCN (side-chain weighted contact number)[9,10] for each residue in each protein in our dataset. SC-WCN is a measure of residue packing and centrality. While the average SC-WCN for catalytic sites is higher than for other functional sites, the average SC-WCN for catalytic sites is not significantly different from ligand binding sites in enzymes (Table 1). In addition, conservation gradients from non-functional buried sites with high SC-WCN induce significantly weaker conservation gradients than from other exposed functional sites such as protein-protein interaction sites. These results imply that packing does not dictate the difference in conservation gradients between these sites. Moreover, while the overall correlation coefficient between conservation gradients and conservation ranks over all the residues in our dataset is 0.43, the overall correlation coefficient between conservation gradients and SC-WCN values is significantly weaker (0.24).

Table 1. Average side-chain weighted contact number (SC-WCN) and distance to the protein center as well as average conservation gradient of the different types of functional sites.

Functional sites	Mean SC-WCN	Mean distance to protein center	Mean conservation gradient
Catalytic sites	2.62±0.03	7.1±0.4	0.41±0.01
Non-catalytic ligand binding sites in enzymes	2.65±0.03	9.3±0.4	0.30±0.01
Ligand binding sites in nonenzymes	1.92±0.02	7.4±0.3	0.37±0.01
Protein-protein interaction sites	1.70±0.01	14.8±0.1	0.21±0.00
Allosteric sites	2.03±0.01	10.8±0.2	0.24±0.00
Buried non-functional sites^#	3.32±0.00^a	2.5±0.00^b	0.24±0.00^a 0.20±0.00^b

Open in a new tab

# chosen as either

^a sites for which SC-WCN>3.0

^b sites for which distance to the protein center is <5.0

We then constructed a linear regression model for conservation gradients of residues as a function of both their conservation rank and SC-WCN value. We subtracted the contribution of SC-WCN from the conservation gradient of every residue and plotted the new conservation gradients which are independent of SC-WCN (Fig 3B). The overall trends and differences in conservation gradients between different types of functional sites are maintained and are not strongly affected by controlling for the contribution of burial/packing. Similar results were obtained when the structural measure used was the proximity to the center of the protein (calculated as the distance from the residue with highest SC-WCN) (Table 1 and S10 Fig). Therefore, structural determinants of burial/packing or proximity of the functional site to the protein center are not the main cause of the significantly stronger conservation gradients from catalytic sites compared with non-catalytic sites.

Catalytic site residues induce stronger conservation gradients than non-catalytic functional site residues with similar evolutionary rates

We have shown that on average, catalytic site residues induce stronger conservation gradients than expected by their average evolutionary rate. Next, we further reinforce this result by comparing the conservation gradient induced from subsets of catalytic and non-catalytic site residues with similar evolutionary rates. We have sampled 1,000 random subsets of residues from each type of functional site. For each such subset (represented by a circle in Fig 4A), we calculated the average evolutionary rate and average conservation gradient. As expected, catalytic site residue subsets span the lowest dN/dS values (x-axis), followed by non-catalytic ligand binding sites and allosteric sites, protein-protein interaction sites, ligand binding sites in nonenzymes and finally buried residues. The linear trend between evolutionary rate and conservation gradient holds for each of these functional site types (Fig 4A). This result indicates the robustness of the conservation-percolation linear trend regardless of the functional or structural role of the residues. Interestingly, residue subsets from different functional sites with similar evolutionary rates induce conservation gradients with different magnitudes. In particular, catalytic sites induce significantly stronger conservation gradients than all other non-catalytic sites with similar evolutionary rates, including non-catalytic ligand binding sites (Fig 4A). Notably, highly conserved buried nonfunctional residues that have similar evolutionary rates as those of catalytic sites, are shown here to induce significantly weaker conservation gradients (see also S11 Fig). This further shows that burial/packing of the functional site is not the main cause of the significantly stronger conservation gradients from catalytic sites compared with non-catalytic sites. Finally, protein-protein interaction site residues induce lower conservation gradients than most other functional site residues with similar evolutionary rates, possibly due to the tendency for protein-protein interactions to rewire during evolution.

We have repeated the analysis in Fig 4A with the x-axis changed to the conservation rank (S12 Fig). Results show broad agreement with Fig 4, supporting our main conclusion that catalytic sites induce stronger conservation gradient on average than other functional and non-functional sites, even after controlling for site-specific conservation level. In addition to catalytic sites, other ligand binding sites also exhibit somewhat higher conservation gradient than allosteric sites and protein-protein interaction sites, likely due to hidden, unannotated catalytic sites in our dataset of ligand binding sites.

When conservation gradients from subsets of residues are calculated with Spearman correlations, similar trends to those with Pearson correlations are obtained (S13 Fig). Moreover, conservation gradients up to 30Å exhibit similar trends to those calculated over the entire protein domain (S14 and S15 Figs). Even though conservation gradients up to 30Å are generally smaller in magnitude, the large difference between those induced from catalytic sites and those from non-catalytic functional sites with similar evolutionary rate is even more pronounced than conservation gradients computed over the protein domain.

When considering only the three most conserved residues from each functional site (Fig 4B), subsets of residues exhibit lower evolutionary rates and higher conservation gradients compared with subsets from all functional sites residues (Fig 4A). Interestingly, even subsets of non-catalytic sites residues with same or lower evolutionary rates than catalytic sites residues induce significantly lower conservation gradients. These results further emphasize the unique behavior of catalytic sites that induce significantly stronger conservation gradients than other sites with similar evolutionary conservation and cannot be completely explained by their low evolutionary rates. Conservation gradients from ligand binding sites in nonenzymes are mostly lower than those from catalytic sites although some induce similar conservation gradients which might be caused due to possible ‘hidden catalytic sites’.

Catalytic site residues often induce stronger conservation gradients than more conserved non-catalytic functional site residues within the same protein

We have shown that conservation gradients induced by functional site residues in general correlate linearly with the evolutionary rates of these functional site residues. We have also shown that catalytic site residues are special in that they induce the strongest conservation gradients among subsets of different functional site residues with similar average evolutionary rates. However, our analyses have so far been carried out by grouping together functional site residues from different proteins. In this section, we perform a stringent, per-protein analysis by focusing on multi-functional proteins with at least two distinct functional sites and comparing the evolutionary properties of different functional site residues within the same protein. Some proteins in our dataset contain both catalytic sites as well as non-catalytic functional sites, whereas other proteins in our dataset contain two distinct non-catalytic functional sites. We found that for the majority of the functional site residue pairs within the same multi-functional protein, the more conserved functional site residue indeed induces stronger conservation gradient (80% if the more conserved residue is catalytic, Fig 5A; 67% if both residues are non-catalytic, Fig 5B). These results agree with the hypothesis that induced conservation gradients are largely driven by the percolation of selective pressure acting on functional sites.

Fig 5 — Within the same protein, (A) more conserved catalytic site residues tend to induce stronger conservation gradient than less conserved non-catalytic site residues (binomial test, P<<0.001); (B) more conserved non-catalytic site residues tend to induce stronger conservation gradient than less conserved non-catalytic site residues (binomial test, P <<0.001); (C) less conserved catalytic site residues often induce stronger conservation gradient than more conserved non-catalytic site residues (binomial test, P<<0.001). Functional site residue pairs for which the ordering of residue conservation agrees with the ordering of induced conservation gradient (concordance) are marked in blue. Functional site residue pairs for which the ordering of residue conservation disagrees with the ordering of induced conservation gradient (discordance) are marked in orange.

Remarkably, in cases where the catalytic site residue is less conserved than the non-catalytic functional site residue within the same multi-functional protein, the catalytic site residue still induces stronger conservation gradient than the non-catalytic functional site for most of these cases (discordance of 80%, Fig 5C). This large discordance shown in cases where the lower conservation residue is a catalytic site residue, is significantly higher compared to the cases where the lower conservation residue is a non-catalytic site residue (binomial test P<<0.01). Similar trends are obtained with conservation gradients calculated as Spearman correlations (S16 Fig) as well as when the analysis is focused only on the three most conserved residues from each functional site (S17 Fig) and when conservation gradients are computed up to 30Å (S18 Fig).

These results clearly show that within the same protein, less conserved catalytic site residues often induce stronger conservation gradient than the more conserved non-catalytic site. Therefore, the strong conservation gradients from catalytic sites cannot be entirely explained by the percolation of the strong selective pressure acting on the catalytic sites.

Discussion

In this paper we have shown a linear relationship between two measures of evolutionary importance of protein sites under purifying selection. These are the degree of evolutionary conservation of the site itself, as well as the percolation of evolutionary conservation induced from the site via neighboring residues in the protein tertiary structure. Despite major differences between these two measures, we have shown that the linear relationship between the two measures is universal as it holds for different types of proteins as well as for different types of functional sites in proteins. However, catalytic sites in enzymes are the principal exception to this rule. We have shown here that catalytic sites in enzymes induce significantly stronger conservation gradients in the rest of the protein than expected from the degree of conservation of the site alone. Catalytic sites have a unique and complex functionality as they both bind a substrate as well as reduce the free energy barrier required for a chemical reaction to occur. These catalytic sites were shown to be under stronger selective pressure compared with other functional sites such as protein-protein interaction sites and ligand binding sites[13,15]. It was also shown that they induce a significantly stronger evolutionary rate gradient than other functional sites. One hypothesis regarding the origin of the strong conservation gradient from catalytic sites is that it is simply due to the strong selective pressure acting on these sites percolating through the rest of the protein via residue-residue contacts. However, we have shown here that the strong selective pressure acting on catalytic sites cannot entirely explain the strong conservation gradients induced from catalytic sites.

The main determinant of the conservation gradient from catalytic sites is still not completely understood. Local structural constraints (such as residue burial and packing, WCN) are usually potential contributors as they are known to generally have a significant effect on residue evolutionary rate[1,2,5,10]. However, it was shown that generally, local structural constraints are not the main determinants of conservation gradients in enzymes[13,25,26]. Moreover, the fact that non-catalytic ligand binding sites and allosteric binding sites induce significantly weaker evolutionary rate gradients implies that the ligand binding and allosteric function are not the main determinants of the conservation gradients in enzymes either[15]. We are therefore left with the hypothesis that the uniquely strong conservation gradient in enzymes is imposed by the special requirement for catalytic sites to differentially bind the transition state of a chemical reaction rather than the reactants or products with very similar properties [16], a function which is unique to catalytic sites compared with other functional binding sites. A recent physical model of residue evolutionary rates in enzymes introduced an activation term in addition to a stability term and showed improved predictive ability[27]. The model attributed the activity to the free energy required to transform a distorted catalytic site upon mutation to its native conformation. The improved ability of the model supports the hypothesis that the main determinant of the observed conservation gradient in enzymes is a functional rather than structural constraint. Overall, our results suggest that the stringent requirement for the catalytic site to differentially bind to and stabilize the transition state of the catalyzed chemical reaction imposes extensive evolutionary constraints on a large portion of the enzyme beyond just the catalytic site, all of which play key roles in maintaining the catalytic function.

Accurate quantification of selective pressure on proteins at single-site resolution is an important task in evolutionary biology [2]. We have shown here that there are two different methods to quantify the evolutionary importance of a protein residue. The classical, “intrinsic” measure of conservation and the “extrinsic” measure which is conservation gradient the site exerts on the rest of the protein. The combination of these two measures provides a complete, quantitative picture of evolutionary conservation patterns within proteins induced by functional sites. The linear relationship between the degree of conservation of a protein functional site and the induced conservation gradient in the rest of the protein suggests that the strong selective pressure acting on the functional site percolates through the rest of the protein via residue-residue contacts. Our results also clearly show the unique evolutionary behavior of enzymes in which the catalytic site induces significantly stronger evolutionary constraints on their surroundings than can be explained by the percolation theory alone. Moreover, our results emphasize that catalysis requires the participation of a much larger set of residues than just the few key catalytic residues.

The current study is empirical, using available data on annotated functional sites and their conservation gradient patterns. In future work it will be interesting to use simulation lattice models[14] or biophysical models[27] to examine the effect of different factors on conservation gradient patterns and to unify the empirical and theoretical studies.

Methods

Protein dataset collection and functional site annotations

The current dataset is based on a dataset of structural homologs of yeast proteins[15]. The dataset was created first by using gapped BLAST[28] searches between protein subunit sequences with solved structure from the Protein Data Bank[17] and 5,861 translated open reading frames (ORFs) of the yeast Saccharomyces cerevisiae[29]. The ORF–subunit pairs were chosen such that both the subunit sequence and the ORF sequence had coverage of ≥50% in the alignment and E-value <10⁻⁵ and could be paired with their orthologs in four other closely-related yeast species S. paradoxus, S. mikatae, S. bayanus and S. pombe. This way, 1,555 yeast ORFs were mapped to homologs in the PDB. The procedure included the following steps:

First, if one of the homology-based structural models of a yeast ORF had an annotated allosteric site, this model was chosen. For 171 yeast proteins the structural model was identified with a known allosteric site as well as pre-calculated conservation scores in ConSurf-DB[18]. For all other yeast ORFs, if they had structural models with known ligand binding sites that do not overlap with catalytic sites, the model with the lowest E-value out of them was chosen. Overall, 39 nonenzymes with 42 ligand binding sites and 20 enzymes with 25 non-catalytic ligand-binding sites were part of the dataset. For all other yeast ORFs, the structural model with the lowest E-value was chosen. In this manner, 976 more ORFs for which the best structural model had pre-calculated conservation scores in ConSurf-DB were added to the dataset. Overall, 1,206 ORF-subunit pairs were included in the study. Out of them, 147 yeast proteins were identified with 282 protein-protein interactions sites and 107 proteins were identified with catalytic sites. Full list of yeast proteins, their structural models along with identified functional sites can be found in S2 Table in the Supporting Material.

Functional site annotations

Allosteric sites within the structural subunits were found and their residues annotated using the Allosteric Database (ASD)[22–24]. Biologically-relevant ligand-binding sites were found using Binding MOAD (Mother of all Databases)[30,31], a database of biologically significant protein-ligand binding in the PDB. Using MOAD, sites with bound crystallographic additives, buffers, salts, metals and sites with covalently linked ligands are excluded. Ligand-binding residues were identified using the BioLip[21] database. Catalytic sites within the protein structural subunits were found using the M-CSA [20], taking into account also all protein chains in the PDB which are more than 95% identical to protein chains found in M-CSA[32]. In order to find proteins that participate in protein-protein interactions in our dataset, we identified structural subunits where each subunit is both in physical contact with another subunit and the corresponding modelled ORFs are reported as interacting by at least one physical experiment in the BioGRID [33,34]. Our dataset contains 147 proteins with 282 protein-protein interaction interfaces. Interfacial residues were identified as residues with different solvent accessibility values when in complex compared to when the interacting partner is manually deleted from the tertiary structure. Distances between residues were calculated as distances between their respective Cα atoms. All functional site annotations of the chosen structural subunits were transferred to the yeast ORF sequence according to the sequence alignment.

Evolutionary conservation and rate calculations

In this study we calculated both evolutionary conservation scores for the residues of the structural subunits as well as average absolute evolutionary rates (dN/dS) for the yeast ORFs. Evolutionary conservation scores were taken from ConSurf-DB[18], which is a database of pre-calculated conservation scores of residues in proteins with known structures in the Protein Data Bank (PDB). ConSurf-DB conservation scores are based on collected sequence homologs of the PDB structure and using the Rate4Site algorithm[35]. S1 Text in the Supporting Material provides all conservation scores obtained from ConSurf-DB for all the proteins used in this study. Calculated conservation gradients for each residue in every protein in the dataset can all be found in Supporting S2 Text. S2 Text also lists the conservation gradients calculated using Spearman correlation, calculated up to 30Å away from the reference residue and calculated with the relative contribution of SC-WCN eliminated.

To calculate the average evolutionary rates (dN/dS) for residues of S. cerevisiae, we first used the orthology assignment of the protein-coding genes of S. cerevisiae with four other closely-related yeast species (S. paradoxus, S. mikatae, S. bayanus, and S. pombe), according to the Fungal Orthogroup Repository[36]. We then aligned the ORFs using MAFFT[37]. Then, evolutionary rates were calculated using the program codeml within the PAML software package[38]. The tree was specified as ((((S. cerevisiae, S. paradoxus), S. mikatae), S. bayanus), S. pombe). Codon frequencies were assumed equal (CodonFreq = 0) and other parameters in codeml were left to their default values. The codon alignments can be found in S3 Text in the Supporting Material.

Statistical analysis

1000 random sample of 250 residues were collected from each type of functional site residues for Fig 4.

Estimated standard errors in our measurements of conservation gradients (Pearson correlations) and of dN/dS values were done using 50 rounds of bootstrap resampling.

Supporting information

S1 Fig. Conservation gradient induced from a protein residue is linearly correlated with its conservation within the protein.

(A) Violin plots and respective average of conservation gradient (calculated as a Spearman correlation between a residue conservation and its distance from a site) as a function of conservation rank for all residues in the dataset binned into 20 equally spaced bins of conservation rank along with the linear fit calculated over all residues. (B) Distribution of per-protein Pearson correlation between residues’ conservation ranks and conservation gradients (conservation gradients calculated as Spearman correlations).

(TIF)

Click here for additional data file.^{(300.6KB, tif)}

S2 Fig. Conservation gradient induced from a protein residue is linearly correlated with its conservation within the protein.

(A) Violin plots and respective average of conservation gradient (calculated as a Pearson correlation between a residue conservation and its distance from a site up to 30Å away) as a function of conservation rank for all residues in the dataset binned into 20 equally spaced bins of conservation rank along with the linear fit calculated over all residues. (B) Distribution of per-protein Pearson correlation between residues’ conservation ranks and conservation gradients (conservation gradients calculated as Pearson correlations up to 30Å away).

(TIF)

Click here for additional data file.^{(492.3KB, tif)}

S3 Fig. Conservation gradient induced from a protein residue is linearly correlated with its conservation within the protein.

(A) Violin plots and respective average of conservation gradient (calculated as a Pearson correlation between a residue conservation and its distance from a site between 6 Å and 30Å away) as a function of conservation rank for all residues in the dataset binned into 20 equally spaced bins of conservation rank along with the linear fit calculated over all residues. (B) Distribution of per-protein Pearson correlation between residues’ conservation ranks and conservation gradients (conservation gradients calculated as Pearson correlations for between 6 Å and 30Å away).

(TIF)

Click here for additional data file.^{(490.7KB, tif)}

S4 Fig. Evolutionary rate is correlated with conservation rank.

Evolutionary rate (dN/dS) as a function of conservation rank for all residues in the dataset grouped according to their conservation rank and binned into 100 equally spaced bins of conservation rank.

(TIF)

Click here for additional data file.^{(211KB, tif)}

S5 Fig. Conservation gradient induced from residues is linearly (negatively) correlated with their evolutionary rate (dN/dS), with catalytic site residues inducing stronger conservation gradient than expected by the linear trend.

Average conservation gradient (calculated as the average Spearman correlation between conservation of residues and their distance from a site) as a function of the average evolutionary rate (dN/dS) for all yeast protein residues binned according to their annotated conservation rank into 100 equally spaced bins (black) as well as the average conservation gradients of different types of functional sites.

(TIF)

Click here for additional data file.^{(505.3KB, tif)}

S6 Fig. Conservation gradient induced from residues is linearly (negatively) correlated with their evolutionary rate (dN/dS), with catalytic site residues inducing stronger conservation gradient than expected by the linear trend.

Average conservation gradient (calculated as the average Pearson correlation between conservation of residues and their distance from a site up to 30Å away) as a function of the average evolutionary rate (dN/dS) for all yeast protein residues binned according to their annotated conservation rank into 100 equally spaced bins (black) as well as the average conservation gradients of different types of functional sites.

(TIF)

Click here for additional data file.^{(535.8KB, tif)}

S7 Fig. Conservation gradient induced from residues is linearly (negatively) correlated with their evolutionary rate (dN/dS), with catalytic site residues inducing stronger conservation gradient than expected by the linear trend.

Average conservation gradient (calculated as the average Pearson correlation between conservation of residues and their distance from a site between 6Å and 30Å away) as a function of the average evolutionary rate (dN/dS) for all yeast protein residues binned according to their annotated conservation rank into 100 equally spaced bins (black) as well as the average conservation gradients of different types of functional sites.

(TIF)

Click here for additional data file.^{(439.2KB, tif)}

S8 Fig. Conservation gradient induced from residues is linearly (negatively) correlated with their evolutionary conservation, with catalytic site residues inducing stronger conservation gradient than expected by the linear trend.

Average conservation gradient as a function of average conservation rank for all yeast protein residues binned into 100 equally spaced bins as well as the average conservation gradients of different types of functional sites.

(TIF)

Click here for additional data file.^{(530.2KB, tif)}

S9 Fig. Conservation gradient induced from residues is linearly (negatively) correlated with their evolutionary rate (dN/dS), with catalytic site residues inducing stronger conservation gradient than expected by the linear trend.

Average conservation gradients (calculated as the average Pearson correlation between conservation of residues and their distance from a site) as a function of the average evolutionary rate (dN/dS) for all yeast protein residues binned according to their annotated conservation rank into 100 equally spaced bins (black) as well as the average conservation gradients of the three most-conserved residues within each functional site.

(TIF)

Click here for additional data file.^{(486.5KB, tif)}

S10 Fig. Average conservation gradient after reduction of the relative contribution of proximity to the protein center, as a function of average evolutionary rate (dN/dS) for all yeast protein residues binned according to their annotated conservation rank into 100 equally spaced bins as well as the average conservation gradients without the relative contribution of proximity to the protein center of different types of functional sites.

(TIF)

Click here for additional data file.^{(561.9KB, tif)}

S11 Fig. Average conservation gradients of subsets of buried and exposed, non-functional site residues.

Each circle represents a subset of residues.

(TIF)

Click here for additional data file.^{(872.2KB, tif)}

S12 Fig. Average conservation gradients of functional sites binned according to conservation rank up to 0.65.

(TIF)

Click here for additional data file.^{(1.4MB, tif)}

S13 Fig. Catalytic site residues induce stronger conservation gradients than non-catalytic functional site residues with similar evolutionary rates.

Average conservation gradients (calculated as the average Spearman correlation between conservation of residues and their distance from a site). Each circle represents a subset of residues, coloured by the different types of functional sites.

(TIF)

Click here for additional data file.^{(935KB, tif)}

S14 Fig. Catalytic site residues induce stronger conservation gradients than non-catalytic functional site residues with similar evolutionary rates.

Average conservation gradients (calculated as the average Pearson correlation between conservation of residues and their distance from a site up to 30Å away). Each circle represents a subset of residues, coloured by the different types of functional sites.

(TIF)

Click here for additional data file.^{(983KB, tif)}

S15 Fig. Catalytic site residues induce stronger conservation gradients than non-catalytic functional site residues with similar evolutionary rates.

Average conservation gradients (calculated as the average Pearson correlation between conservation of residues and their distance from a site between 6Å and 30Å away). Each circle represents a subset of residues, coloured by the different types of functional sites.

(TIF)

Click here for additional data file.^{(871.7KB, tif)}

S16 Fig. Catalytic site residues often induce stronger conservation gradients than more conserved non-catalytic functional site residues within the same protein.

Within the same protein, when conservation gradients are calculated as Spearman correlation between conservation of residues and their distance from a site (A) more conserved catalytic site residues tend to induce stronger conservation gradient than less conserved non-catalytic site residues (binomial test, P<<0.001); (B) more conserved non-catalytic site residues tend to induce stronger conservation gradient than less conserved non-catalytic site residues (binomial test, P <<0.001); (C) less conserved catalytic site residues often induce stronger conservation gradient than more conserved non-catalytic site residues (binomial test, P<<0.001). Functional site residue pairs for which the ordering of residue conservation agrees with the ordering of induced conservation gradient (concordance) are marked in blue. Functional site residue pairs for which the ordering of residue conservation disagrees with the ordering of induced conservation gradient (discordance) are marked in orange.

(TIF)

Click here for additional data file.^{(619.8KB, tif)}

S17 Fig. Catalytic site residues often induce stronger conservation gradients than more conserved non-catalytic functional site residues within the same protein.

Within the same protein, considering only conservation gradients from the three most-conserved residues within each functional site (A) more conserved catalytic site residues tend to induce stronger conservation gradient than less conserved non-catalytic site residues (binomial test, P<<0.001); (B) more conserved non-catalytic site residues tend to induce stronger conservation gradient than less conserved non-catalytic site residues (binomial test, P <<0.001); (C) less conserved catalytic site residues often induce stronger conservation gradient than more conserved non-catalytic site residues (binomial test, P<0.001). Functional site residue pairs for which the ordering of residue conservation agrees with the ordering of induced conservation gradient (concordance) are marked in blue. Functional site residue pairs for which the ordering of residue conservation disagrees with the ordering of induced conservation gradient (discordance) are marked in orange.

(TIF)

Click here for additional data file.^{(617.1KB, tif)}

S18 Fig. Catalytic site residues often induce stronger conservation gradients than more conserved non-catalytic functional site residues within the same protein.

Within the same protein, when conservation gradients are calculated as Pearson correlation between conservation of residues and their distance from a site up to 30Å away (A) more conserved catalytic site residues tend to induce stronger conservation gradient than less conserved non-catalytic site residues (binomial test, P<<0.001); (B) more conserved non-catalytic site residues tend to induce stronger conservation gradient than less conserved non-catalytic site residues (binomial test, P <<0.001); (C) less conserved catalytic site residues often induce stronger conservation gradient than more conserved non-catalytic site residues (binomial test, P<<0.001). Functional site residue pairs for which the ordering of residue conservation agrees with the ordering of induced conservation gradient (concordance) are marked in blue. Functional site residue pairs for which the ordering of residue conservation disagrees with the ordering of induced conservation gradient (discordance) are marked in orange.

(TIF)

Click here for additional data file.^{(617.5KB, tif)}

S1 Table. Linear regression of conservation gradients as a function of conservation rank over all residues in different types of functional and structural categories.

(XLSX)

Click here for additional data file.^{(11.3KB, xlsx)}

S2 Table. List of all yeast proteins, their structural models and identified functional sites.

(XLSX)

Click here for additional data file.^{(342.8KB, xlsx)}

S1 Text. Conservation scores downloaded from ConSurf-DB for all the proteins participating in this study.

(RAR)

Click here for additional data file.^{(7.6MB, rar)}

S2 Text. Calculated conservation gradients for each residue in every protein in the dataset.

The file also lists the conservation gradients calculated using Spearman correlation, calculated up to 30Å away from the reference residue and calculated with the relative contribution of SC-WCN eliminated.

(RAR)

Click here for additional data file.^{(28.6MB, rar)}

S3 Text. Codon alignments of protein coding genes of S. cerevisiae with its four closely-related yeast species (S. paradoxus, S. mikatae, S. bayanus, and S. pombe).

(RAR)

Click here for additional data file.^{(986KB, rar)}

Data Availability

Analysis scripts are available from https://github.com/AvitalSharirIvry/Quantifying-Evolutionary-Importance-of-Protein-Sites-A-Tale-of-Two-Measures.git.

Funding Statement

This work was supported by Natural Sciences and Engineering Research Council of Canada grant numbers RGPIN-2019-05952, RGPAS-2019-00012 (Y.X.) (https://www.nserc-crsng.gc.ca/index_eng.asp), Canada Foundation for Innovation grant number JELF-33732 (Y. X.) (https://www.innovation.ca/), and Canada Research Chairs program (Y. X.) (https://www.chairs-chaires.gc.ca/home-accueil-eng.aspx). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1.Worth CL, Gong S, Blundell TL. Structural and functional constraints in the evolution of protein families. Nat Rev Mol Cell Biol. 2009;10: 709–720. 10.1038/nrm2762 [DOI] [PubMed] [Google Scholar]
2.Echave J, Spielman SJ, Wilke CO. Causes of evolutionary rate variation among protein sites. Nat Rev Genet. 2016;17: 109–121. 10.1038/nrg.2015.18 [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Overington J, Donnelly D, Johnson MS, Sali A, Blundell TL. Environment-specific amino acid substitution tables: tertiary templates and prediction of protein folds. Protein Sci. 1992;1: 216–226. 10.1002/pro.5560010203 [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Conant GC, Stadler PF. Solvent exposure imparts similar selective pressures across a range of yeast proteins. Mol Biol Evol. 2009;26: 1155–1161. 10.1093/molbev/msp031 [DOI] [PubMed] [Google Scholar]
5.Franzosa EA, Xia Y. Structural determinants of protein evolution are context-sensitive at the residue level. Mol Biol Evol. 2009;26: 2387–2395. 10.1093/molbev/msp146 [DOI] [PubMed] [Google Scholar]
6.Franzosa EA, Xue R, Xia Y. Quantitative residue-level structure-evolution relationships in the yeast membrane proteome. Genome Biol Evol. 2013;5: 734–744. 10.1093/gbe/evt039 [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Sharir-Ivry A, Xia Y. The impact of native state switching on protein sequence evolution. Mol Biol Evol. 2017;34: 1378–1390. 10.1093/molbev/msx071 [DOI] [PubMed] [Google Scholar]
8.Hamelryck T. An amino acid has two sides: A new 2D measure provides a different view of solvent exposure. Proteins. 2005;59: 38–48. 10.1002/prot.20379 [DOI] [PubMed] [Google Scholar]
9.Yeh SW, Huang TT, Liu JW, Yu SH, Shih CH, Hwang JK, et al. Local packing density is the main structural determinant of the rate of protein sequence evolution at site level. Biomed Res Int. 2014; 572409. 10.1155/2014/572409 [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Marcos ML, Echave J. Too packed to change: side-chain packing and site-specific substitution rates in protein evolution. PeerJ. 2015;3: e911. 10.7717/peerj.911 [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Bartlett GJ, Porter CT, Borkakoti N, Thornton JM. Analysis of catalytic residues in enzyme active sites. J Mol Biol. 2002;324: 105–121. 10.1016/s0022-2836(02)01036-7 [DOI] [PubMed] [Google Scholar]
12.Tóth-Petróczy A, Tawfik DS. Slow protein evolutionary rates are dictated by surface-core association. Proc Natl Acad Sci U S A. 2011;108: 11151–6. 10.1073/pnas.1015994108 [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Jack BR, Meyer AG, Echave J, Wilke CO. Functional sites induce long-range evolutionary constraints in enzymes. PLOS Biol. 2016;14: e1002452. 10.1371/journal.pbio.1002452 [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Nelson ED, Grishin N V. Evolution of off-lattice model proteins under ligand binding constraints. Phys Rev E. 2016;94: 022410. 10.1103/PhysRevE.94.022410 [DOI] [PubMed] [Google Scholar]
15.Sharir-Ivry A, Xia Y. Non-catalytic Binding Sites Induce Weaker Long-Range Evolutionary Rate Gradients than Catalytic Sites in Enzymes. J Mol Biol. 2019;431: 3860–3870. 10.1016/j.jmb.2019.07.019 [DOI] [PubMed] [Google Scholar]
16.Warshel A, Sharma PK, Kato M, Xiang Y, Liu H, Olsson MHM. Electrostatic basis for enzyme catalysis. Chem Rev. 2006;106: 3210–3235. 10.1021/cr0503106 [DOI] [PubMed] [Google Scholar]
17.Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, et al. The Protein Data Bank. Nucleic Acids Res. 2000;28: 235–42. 10.1093/nar/28.1.235 [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Goldenberg O, Erez E, Nimrod G, Ben-Tal N. The ConSurf-DB: pre-calculated evolutionary conservation profiles of protein structures. Nucleic Acids Res. 2009;37: D323–7. 10.1093/nar/gkn822 [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Celniker G, Nimrod G, Ashkenazy H, Glaser F, Martz E, Mayrose I, et al. ConSurf: Using Evolutionary Data to Raise Testable Hypotheses about Protein Function. Isr J Chem. 2013;53: 199–206. 10.1002/ijch.201200096 [DOI] [Google Scholar]
20.Ribeiro AJM, Holliday GL, Furnham N, Tyzack JD, Ferris K, Thornton JM. Mechanism and Catalytic Site Atlas (M-CSA): A database of enzyme reaction mechanisms and active sites. Nucleic Acids Res. 2018;46: D618–D623. 10.1093/nar/gkx1012 [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Yang J, Roy A, Zhang Y. BioLiP: A semi-manually curated database for biologically relevant ligand-protein interactions. Nucleic Acids Res. 2013;41: D1096–1103. 10.1093/nar/gks966 [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Liu X, Lu S, Song K, Shen Q, Ni D, Li Q, et al. Unraveling allosteric landscapes of allosterome with ASD. Nucleic Acids Res. 2020;48: D394–D401. 10.1093/nar/gkz958 [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Huang Z, Zhu L, Cao Y, Wu G, Liu X, Chen Y, et al. ASD: a comprehensive database of allosteric proteins and modulators. Nucleic Acids Res. 2011;39: D663–D669. 10.1093/nar/gkq1022 [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Shen Q, Wang G, Li S, Liu X, Lu S, Chen Z, et al. ASD v3.0: unraveling allosteric regulation with structural mechanisms and biological networks. Nucleic Acids Res. 2016;44: D527–D535. 10.1093/nar/gkv902 [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Sharir-Ivry A, Xia Y. Nature of Long-Range Evolutionary Constraint in Enzymes: Insights from Comparison to Pseudoenzymes with Similar Structures. Mol Biol Evol. 2018;35: 2597–2606. 10.1093/molbev/msy177 [DOI] [PubMed] [Google Scholar]
26.Sharir-Ivry A, Xia Y. Using Pseudoenzymes to Probe Evolutionary Design Principles of Enzymes. Evol Bioinforma. 2019;15: 117693431985593. 10.1177/1176934319855937 [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Echave J. Beyond Stability Constraints: A Biophysical Model of Enzyme Evolution with Selection on Stability and Activity. Mol Biol Evol. 2019;36: 613–620. 10.1093/molbev/msy244 [DOI] [PubMed] [Google Scholar]
28.Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25: 3389–3402. 10.1093/nar/25.17.3389 [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Cherry JM, Adler C, Ball C, Chervitz SA, Dwight SS, Hester ET, et al. SGD: Saccharomyces Genome Database. Nucleic Acids Res. 1998;26: 73–79. 10.1093/nar/26.1.73 [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Hu L, Benson ML, Smith RD, Lerner MG, Carlson HA. Binding MOAD (Mother Of All Databases). Proteins Struct Funct Bioinforma. 2005;60: 333–340. 10.1002/prot.20512 [DOI] [PubMed] [Google Scholar]
31.Ahmed A, Smith RD, Clark JJ, Dunbar JB, Carlson HA. Recent improvements to Binding MOAD: a resource for protein–ligand binding affinities and structures. Nucleic Acids Res. 2015;43: D465–D469. 10.1093/nar/gku1088 [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Steinegger M, Söding J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol. 2017;35: 1026–1028. 10.1038/nbt.3988 [DOI] [PubMed] [Google Scholar]
33.Stark C, Breitkreutz B-J, Reguly T, Boucher L, Breitkreutz A, Tyers M. BioGRID: a general repository for interaction datasets. Nucleic Acids Res. 2006;34: D535–9. 10.1093/nar/gkj109 [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Chatr-Aryamontri A, Oughtred R, Boucher L, Rust J, Chang C, Kolas NK, et al. The BioGRID interaction database: 2017 update. Nucleic Acids Res. 2017;45: D369–D379. 10.1093/nar/gkw1102 [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Pupko T, Bell RE, Mayrose I, Glaser F, Ben-Tal N. Rate4Site: an algorithmic tool for the identification of functional regions in proteins by surface mapping of evolutionary determinants within their homologues. Bioinformatics. 2002;18: S71–S77. 10.1093/bioinformatics/18.suppl_1.s71 [DOI] [PubMed] [Google Scholar]
36.Wapinski I, Pfeffer A, Friedman N, Regev A. Natural history and evolutionary principles of gene duplication in fungi. Nature. 2007;449: 54–61. 10.1038/nature06107 [DOI] [PubMed] [Google Scholar]
37.Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: Improvements in performance and usability. Mol Biol Evol. 2013;30: 772–780. 10.1093/molbev/mst010 [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Yang Z. PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol. 2007;24: 1586–1591. 10.1093/molbev/msm088 [DOI] [PubMed] [Google Scholar]

PLoS Genet. doi: 10.1371/journal.pgen.1009476.r001

Decision Letter 0

Jianzhi Zhang, Kirsten Bomblies

7 Sep 2020

Dear Dr Sharir-Ivry,

Thank you very much for submitting your Research Article entitled 'Quantifying Evolutionary Importance of Protein Sites: A Tale of Two Measures' to PLOS Genetics. Your manuscript was fully evaluated at the editorial level and by two independent peer reviewers. The reviewers appreciated the attention to an important problem, but raised some substantial concerns about the current manuscript. Based on the reviews, we will not be able to accept this version of the manuscript, but we would be willing to review again a much-revised version. We cannot, of course, promise publication at that time.

Should you decide to revise the manuscript for further consideration here, your revisions should address the specific points made by each reviewer. We will also require a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript.

If you decide to revise the manuscript for further consideration at PLOS Genetics, please aim to resubmit within the next 60 days, unless it will take extra time to address the concerns of the reviewers, in which case we would appreciate an expected resubmission date by email to plosgenetics@plos.org.

If present, accompanying reviewer attachments are included with this email; please notify the journal office if any appear to be missing. They will also be available for download from the link below. You can use this link to log into the system when you are ready to submit a revised version, having first consulted our Submission Checklist.

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see our guidelines.

Please be aware that our data availability policy requires that all numerical data underlying graphs or summary statistics are included with the submission, and you will need to provide this upon resubmission if not already present. In addition, we do not permit the inclusion of phrases such as "data not shown" or "unpublished results" in manuscripts. All points should be backed up by data provided with the submission.

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

PLOS has incorporated Similarity Check, powered by iThenticate, into its journal-wide submission system in order to screen submitted content for originality before publication. Each PLOS journal undertakes screening on a proportion of submitted articles. You will be contacted if needed following the screening process.

To resubmit, use the link below and 'Revise Submission' in the 'Submissions Needing Revision' folder.

[LINK]

We are sorry that we cannot be more positive about your manuscript at this stage. Please do not hesitate to contact us if you have any concerns or questions.

Yours sincerely,

Jianzhi Zhang

Associate Editor

PLOS Genetics

Kirsten Bomblies

Section Editor: Evolution

PLOS Genetics

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: In the manuscript “Quantifying Evolutionary Importance of Protein Sites: A Tale of Two Measures”, the authors reported the relationship between the evolutionary constraint of a site in protein and its conservation gradient. For a site, its gradient is the pearson correlation between the constraints on the other sites and their distances to the focal site. A stronger gradient of a site (larger correlation) indicates that its neighboring sites tend to have stronger constraints than the distal sites. The authors observed a linear relationship, suggesting that the sites with strong constraints also had strong conservation gradients. The authors concluded that such a relationship was likely due to residue-residue contacts among the sites etc. Particularly, the authors found that catalytic sites in enzymes had much stronger conservation gradients than the other sites, and these strong gradients could not be explained by the particularly strong constraints on the focal sites. Therefore, the authors concluded that the observation was likely caused by the catalytic function. Although the results and conclusions are interesting, I have several major concerns.

Major points

One main result of the manuscript is that the sites with strong constraints tend to have strong conservation gradients. This linear relationship is convincing and expected. It is known that neighboring residues are expected to be involved in the same biological function e.g. catalysis, binding, and protein stability etc., and thus tend to have similar constraints. For example, in the reference 13, the residues in close proximity to the strongly constrained residues also have strong constraints, whereas the distal residues are less constrained. Therefore, the strongly constrained residues are expected to have relatively larger gradients than the residues with weak constraints, resulting in the observed relationship. However, different from what the author claimed, I think the linear relationship is likely quite weak, given the large variance of each bin. And it worth reporting more details on the regression, e.g. R-square and whether only the median/mean of each bin was used for the regression, which artificially reduces the data variation.

Another major result is that the catalytic residues have particularly strong conservation gradients. The authors concluded that the high conservation gradients are probably because the unique requirement for the active site to selectively stabilize the transition state of the catalyzed chemical reaction imposes additional selective constraints on the rest of the enzyme. However, the large gradients of catalytic sites may be because all the identified catalytic sites in a catalytic region have strong constraints, whereas taking the PPI sites as an example, it is known that the PPI interface is relatively large but only a few key PPI residues have strong constraints, and the neighboring sites have weak constraints. This renders even the key residues having low conservation gradients. In addition, catalytic sites tend to be buried rather than exposed on protein surfaces as other binding sites. Located at the core region of a protein further increases conservation gradients due to the paths from core (high constraints) to surface (very low constraints). In sum, without controlling for these factors, the high gradients of the catalytic sites may not be due to the intrinsic catalytic properties.

It may be interesting to quantify the influences of all these factors using a simple lattice model. The functional sites, core sites and surface sites have their constraints sampled from respective constraint distributions to calculate gradients. The factors may include the size of the functional region in the protein which has core and surface regions, the average and variance of site constraints in the functional region, the location of the region (surface or core/grove) etc.

Overall, the gradients are moderate or small, calculated using Pearson correlation. Spearman correlation robust to outliers may be necessary to confirm the discovery.

Minor points

The authors mentioned very briefly that their discovery is important to phylogenetic inference, accurate quantification of selective pressure at single-site resolution etc. Please discuss a bit more the details in the discussion.

The authors used “long-range” conservation in the manuscript. It would be useful to define the long range e.g. up to 30A. However, there is a possibility that many of the general conservation gradients observed by the authors are mainly due to “short” range residues.

In this manuscript, many sites from different proteins were pooled together to estimate an average dn/ds for these sites. Many of those sites may have quite different dn/ds. PAML may be used to test whether the sites in a protein have different evolutionary rates, and then estimate the rates respectively for the sites. The multiple groups of sites with different rates may be informative for the analyses.

Reviewer #2: Overall, this is a nice contribution. However, I have one major concern: Most proteins have a natural conservation gradient from the outside to the inside. So any study trying to identify some alternative cause for a conservation gradient must very carefully control for this strong confounder. I don't think the present study does so. I would argue the present study doesn't even properly discuss this issue.

To me, the key question is to what extent sites create a conservation gradient given where they are in the protein structure. The authors look at buried and exposed sites, but that's a very crude classification. A site can be buried but relatively close to the surface or right in the center of the protein, and these two sites will experience both different selection pressures and different conservation gradients.

A good measure to assess how close a site is to the center of the protein core is the weighted contact number (WCN), using an inverse square distance weighting. In fact, WCN is literally a measure of centrality, rather than a measure of number of contacts. (As an aside, many authors in the field mis-understand this issue.) If the authors correlate WCN with conservation gradient, they should find a fairly strong correlation. Then, the authors can build a regression model that regresses the conservation gradient against both WCN and conservation rank. The degree to which conservation rank contributes to such a model is a measure of the intrinsic conservation rank a site generates, independent of where in the structure it is located. It may well be that if the authors perform this analysis, catalytic sites stand out even more.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Genetics data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: No: I think the authors should provide their raw data and analysis scripts. As is, the study is not reproducible.

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

PLoS Genet. 2021 Apr 7;17(4):e1009476. doi: 10.1371/journal.pgen.1009476.r002

Author response to Decision Letter 0

7 Dec 2020

Attachment

Submitted filename: Reply to reviewers plos genetics.pdf

Click here for additional data file.^{(208.6KB, pdf)}

PLoS Genet. doi: 10.1371/journal.pgen.1009476.r003

Decision Letter 1

Jianzhi Zhang, Kirsten Bomblies

6 Jan 2021

* Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. *

Dear Dr Xia,

Thank you very much for submitting your Research Article entitled 'Quantifying Evolutionary Importance of Protein Sites: A Tale of Two Measures' to PLOS Genetics.

The manuscript was fully evaluated at the editorial level and by the two original peer reviewers. While one reviewer is fully satisfied, the other has a few comments that need to be addressed.

We therefore ask you to modify the manuscript according to the review recommendations. Your revisions should address the specific points made by each reviewer.

In addition we ask that you:

1) Provide a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript.

2) Upload a Striking Image with a corresponding caption to accompany your manuscript if one is available (either a new image or an existing one from within your manuscript). If this image is judged to be suitable, it may be featured on our website. Images should ideally be high resolution, eye-catching, single panel square images. For examples, please browse our archive. If your image is from someone other than yourself, please ensure that the artist has read and agreed to the terms and conditions of the Creative Commons Attribution License. Note: we cannot publish copyrighted images.

We hope to receive your revised manuscript within the next 30 days. If you anticipate any delay in its return, we would ask you to let us know the expected resubmission date by email to plosgenetics@plos.org.

If present, accompanying reviewer attachments should be included with this email; please notify the journal office if any appear to be missing. They will also be available for download from the link below. You can use this link to log into the system when you are ready to submit a revised version, having first consulted our Submission Checklist.

To resubmit, you will need to go to the link below and 'Revise Submission' in the 'Submissions Needing Revision' folder.

[LINK]

Please let us know if you have any questions while making these revisions.

Yours sincerely,

Jianzhi Zhang

Associate Editor

PLOS Genetics

Kirsten Bomblies

Section Editor: Evolution

PLOS Genetics

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: In this revision, the authors added more results and analyses, and the conclusions are more solid. However, there are still several concerns.

In the manuscript, the authors emphasize “long-range” gradients. I commented on this before suggesting the authors to define long vs short. In this revision, the authors analyzed the residues within 30A. I feel that to support “long-range”, the author should have analyzed the residues beyond a certain cutoff (within a shell) for catalytic residues and other residues.

I suggested that the strong correlation between constraint and conservation gradient may be due to the location of catalytic residues, i.e. close to the core of proteins. (The core-surface and catalytic function together lead to the high gradient). To address this comment, the authors used the numbers of contact residues to indicate the packing. However, catalytic residues (other ligand binding residues) may be in a groove (thus close to core) but have no contact residues. Number of contact residues may not be a direct measure for this purpose.

Fig2 shows the relationship between the normalized conservation ranks of residues and their conservation gradients, for different types of residues, such as ligand binding sites, catalytic sites, allosteric sites etc. The conclusion is that the relationship for catalytic sites is quite unique. However, from the fig, it seems that this is likely because the catalytic sites have x range 0.95 to 0.65. For some other residue types, the relationships in this range seem similar to that of catalytic sites. The authors may need to add regression lines using only that x range. The r-square (cor squared) is quite small, indicating the fitting is at most moderate.

For fig4, are the black dots “all residues (w/o functional sites)”? Their results are missing in panel B. The result of such residues can tell how much the constraints on residues alone influence the gradients.

Conservation gradients depend on the relative residue constraints within each protein. The normalization used by fig1&2 is more reasonable than comparing dn/ds from different proteins. It seems that the normalized conservations can be used for those key analyses in fig3&4 with x changed accordingly.

About the conclusion in DIscussion, I think the measures in this manuscript probably can not be informative for “de-novo functional site prediction and protein design”, because many functional sites, except catalytic sites, are similar to non-functional sites in terms of the measures.

The following sentences may contain typos.

When considering only the three most conserved residues from each functional site (Fig 4B), subset of residues exhibits lower evolutionary rates and higher conservation gradients compared with subsets from all functional sites residues (Fig 4A).

Beyond the classical, “intrinsic” measure of conservation and the “extrinsic” measure which is conservation gradient the site exerts on the rest of the protein.

Reviewer #2: Thank you for your careful revisions. I have no further comments.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Reviewer #1: Yes

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

PLoS Genet. 2021 Apr 7;17(4):e1009476. doi: 10.1371/journal.pgen.1009476.r004

Author response to Decision Letter 1

25 Feb 2021

Attachment

Submitted filename: Response to reviewer PGenetics-D-20-01275R1.pdf

Click here for additional data file.^{(151.2KB, pdf)}

PLoS Genet. doi: 10.1371/journal.pgen.1009476.r005

Decision Letter 2

Jianzhi Zhang, Kirsten Bomblies

9 Mar 2021

Dear Dr Xia,

We are pleased to inform you that your manuscript entitled "Quantifying Evolutionary Importance of Protein Sites: A Tale of Two Measures" has been editorially accepted for publication in PLOS Genetics. Congratulations!

Before your submission can be formally accepted and sent to production you will need to complete our formatting changes, which you will receive in a follow up email. Please be aware that it may take several days for you to receive this email; during this time no action is required by you. Please note: the accept date on your published article will reflect the date of this provisional acceptance, but your manuscript will not be scheduled for publication until the required changes have been made.

Once your paper is formally accepted, an uncorrected proof of your manuscript will be published online ahead of the final version, unless you’ve already opted out via the online submission form. If, for any reason, you do not want an earlier version of your manuscript published online or are unsure if you have already indicated as such, please let the journal staff know immediately at plosgenetics@plos.org.

In the meantime, please log into Editorial Manager at https://www.editorialmanager.com/pgenetics/, click the "Update My Information" link at the top of the page, and update your user information to ensure an efficient production and billing process. Note that PLOS requires an ORCID iD for all corresponding authors. Therefore, please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field. This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager.

If you have a press-related query, or would like to know about making your underlying data available (as you will be aware, this is required for publication), please see the end of this email. If your institution or institutions have a press office, please notify them about your upcoming article at this point, to enable them to help maximise its impact. Inform journal staff as soon as possible if you are preparing a press release for your article and need a publication date.

Thank you again for supporting open-access publishing; we are looking forward to publishing your work in PLOS Genetics!

Yours sincerely,

Jianzhi Zhang

Associate Editor

PLOS Genetics

Kirsten Bomblies

Section Editor: Evolution

PLOS Genetics

www.plosgenetics.org

Twitter: @PLOSGenetics

----------------------------------------------------

Comments from the reviewers (if applicable):

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: The authors addressed my comments

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Reviewer #1: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

----------------------------------------------------

Data Deposition

If you have submitted a Research Article or Front Matter that has associated data that are not suitable for deposition in a subject-specific public repository (such as GenBank or ArrayExpress), one way to make that data available is to deposit it in the Dryad Digital Repository. As you may recall, we ask all authors to agree to make data available; this is one way to achieve that. A full list of recommended repositories can be found on our website.

The following link will take you to the Dryad record for your article, so you won't have to re‐enter its bibliographic information, and can upload your files directly:

http://datadryad.org/submit?journalID=pgenetics&manu=PGENETICS-D-20-01275R2

More information about depositing data in Dryad is available at http://www.datadryad.org/depositing. If you experience any difficulties in submitting your data, please contact help@datadryad.org for support.

Additionally, please be aware that our data availability policy requires that all numerical data underlying display items are included with the submission, and you will need to provide this before we can formally accept your manuscript, if not already present.

----------------------------------------------------

Press Queries

If you or your institution will be preparing press materials for this manuscript, or if you need to know your paper's publication date for media purposes, please inform the journal staff as soon as possible so that your submission can be scheduled accordingly. Your manuscript will remain under a strict press embargo until the publication date and time. This means an early version of your manuscript will not be published ahead of your final version. PLOS Genetics may also choose to issue a press release for your article. If there's anything the journal should know or you'd like more information, please get in touch via plosgenetics@plos.org.

PLoS Genet. doi: 10.1371/journal.pgen.1009476.r006

Acceptance letter

Jianzhi Zhang, Kirsten Bomblies

23 Mar 2021

PGENETICS-D-20-01275R2

Quantifying Evolutionary Importance of Protein Sites: A Tale of Two Measures

Dear Dr Xia,

We are pleased to inform you that your manuscript entitled "Quantifying Evolutionary Importance of Protein Sites: A Tale of Two Measures" has been formally accepted for publication in PLOS Genetics! Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out or your manuscript is a front-matter piece, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Genetics and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Alice Ellingham

PLOS Genetics

On behalf of:

The PLOS Genetics Team

Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom

plosgenetics@plos.org | +44 (0) 1223-442823

plosgenetics.org | Twitter: @PLOSGenetics

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 Fig. Conservation gradient induced from a protein residue is linearly correlated with its conservation within the protein.

(TIF)

Click here for additional data file.^{(300.6KB, tif)}

S2 Fig. Conservation gradient induced from a protein residue is linearly correlated with its conservation within the protein.

(TIF)

Click here for additional data file.^{(492.3KB, tif)}

S3 Fig. Conservation gradient induced from a protein residue is linearly correlated with its conservation within the protein.

(TIF)

Click here for additional data file.^{(490.7KB, tif)}

S4 Fig. Evolutionary rate is correlated with conservation rank.

Evolutionary rate (dN/dS) as a function of conservation rank for all residues in the dataset grouped according to their conservation rank and binned into 100 equally spaced bins of conservation rank.

(TIF)

Click here for additional data file.^{(211KB, tif)}

(TIF)

Click here for additional data file.^{(505.3KB, tif)}

(TIF)

Click here for additional data file.^{(535.8KB, tif)}

(TIF)

Click here for additional data file.^{(439.2KB, tif)}

(TIF)

Click here for additional data file.^{(530.2KB, tif)}

(TIF)

Click here for additional data file.^{(486.5KB, tif)}

(TIF)

Click here for additional data file.^{(561.9KB, tif)}

S11 Fig. Average conservation gradients of subsets of buried and exposed, non-functional site residues.

Each circle represents a subset of residues.

(TIF)

Click here for additional data file.^{(872.2KB, tif)}

S12 Fig. Average conservation gradients of functional sites binned according to conservation rank up to 0.65.

(TIF)

Click here for additional data file.^{(1.4MB, tif)}

S13 Fig. Catalytic site residues induce stronger conservation gradients than non-catalytic functional site residues with similar evolutionary rates.

(TIF)

Click here for additional data file.^{(935KB, tif)}

S14 Fig. Catalytic site residues induce stronger conservation gradients than non-catalytic functional site residues with similar evolutionary rates.

(TIF)

Click here for additional data file.^{(983KB, tif)}

S15 Fig. Catalytic site residues induce stronger conservation gradients than non-catalytic functional site residues with similar evolutionary rates.

(TIF)

Click here for additional data file.^{(871.7KB, tif)}

S16 Fig. Catalytic site residues often induce stronger conservation gradients than more conserved non-catalytic functional site residues within the same protein.

(TIF)

Click here for additional data file.^{(619.8KB, tif)}

S17 Fig. Catalytic site residues often induce stronger conservation gradients than more conserved non-catalytic functional site residues within the same protein.

(TIF)

Click here for additional data file.^{(617.1KB, tif)}

S18 Fig. Catalytic site residues often induce stronger conservation gradients than more conserved non-catalytic functional site residues within the same protein.

(TIF)

Click here for additional data file.^{(617.5KB, tif)}

S1 Table. Linear regression of conservation gradients as a function of conservation rank over all residues in different types of functional and structural categories.

(XLSX)

Click here for additional data file.^{(11.3KB, xlsx)}

S2 Table. List of all yeast proteins, their structural models and identified functional sites.

(XLSX)

Click here for additional data file.^{(342.8KB, xlsx)}

S1 Text. Conservation scores downloaded from ConSurf-DB for all the proteins participating in this study.

(RAR)

Click here for additional data file.^{(7.6MB, rar)}

S2 Text. Calculated conservation gradients for each residue in every protein in the dataset.

(RAR)

Click here for additional data file.^{(28.6MB, rar)}

S3 Text. Codon alignments of protein coding genes of S. cerevisiae with its four closely-related yeast species (S. paradoxus, S. mikatae, S. bayanus, and S. pombe).

(RAR)

Click here for additional data file.^{(986KB, rar)}

Attachment

Submitted filename: Reply to reviewers plos genetics.pdf

Click here for additional data file.^{(208.6KB, pdf)}

Attachment

Submitted filename: Response to reviewer PGenetics-D-20-01275R1.pdf

Click here for additional data file.^{(151.2KB, pdf)}

Data Availability Statement

Analysis scripts are available from https://github.com/AvitalSharirIvry/Quantifying-Evolutionary-Importance-of-Protein-Sites-A-Tale-of-Two-Measures.git.

[pgen.1009476.ref001] 1.Worth CL, Gong S, Blundell TL. Structural and functional constraints in the evolution of protein families. Nat Rev Mol Cell Biol. 2009;10: 709–720. 10.1038/nrm2762 [DOI] [PubMed] [Google Scholar]

[pgen.1009476.ref002] 2.Echave J, Spielman SJ, Wilke CO. Causes of evolutionary rate variation among protein sites. Nat Rev Genet. 2016;17: 109–121. 10.1038/nrg.2015.18 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009476.ref003] 3.Overington J, Donnelly D, Johnson MS, Sali A, Blundell TL. Environment-specific amino acid substitution tables: tertiary templates and prediction of protein folds. Protein Sci. 1992;1: 216–226. 10.1002/pro.5560010203 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009476.ref004] 4.Conant GC, Stadler PF. Solvent exposure imparts similar selective pressures across a range of yeast proteins. Mol Biol Evol. 2009;26: 1155–1161. 10.1093/molbev/msp031 [DOI] [PubMed] [Google Scholar]

[pgen.1009476.ref005] 5.Franzosa EA, Xia Y. Structural determinants of protein evolution are context-sensitive at the residue level. Mol Biol Evol. 2009;26: 2387–2395. 10.1093/molbev/msp146 [DOI] [PubMed] [Google Scholar]

[pgen.1009476.ref006] 6.Franzosa EA, Xue R, Xia Y. Quantitative residue-level structure-evolution relationships in the yeast membrane proteome. Genome Biol Evol. 2013;5: 734–744. 10.1093/gbe/evt039 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009476.ref007] 7.Sharir-Ivry A, Xia Y. The impact of native state switching on protein sequence evolution. Mol Biol Evol. 2017;34: 1378–1390. 10.1093/molbev/msx071 [DOI] [PubMed] [Google Scholar]

[pgen.1009476.ref008] 8.Hamelryck T. An amino acid has two sides: A new 2D measure provides a different view of solvent exposure. Proteins. 2005;59: 38–48. 10.1002/prot.20379 [DOI] [PubMed] [Google Scholar]

[pgen.1009476.ref009] 9.Yeh SW, Huang TT, Liu JW, Yu SH, Shih CH, Hwang JK, et al. Local packing density is the main structural determinant of the rate of protein sequence evolution at site level. Biomed Res Int. 2014; 572409. 10.1155/2014/572409 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009476.ref010] 10.Marcos ML, Echave J. Too packed to change: side-chain packing and site-specific substitution rates in protein evolution. PeerJ. 2015;3: e911. 10.7717/peerj.911 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009476.ref011] 11.Bartlett GJ, Porter CT, Borkakoti N, Thornton JM. Analysis of catalytic residues in enzyme active sites. J Mol Biol. 2002;324: 105–121. 10.1016/s0022-2836(02)01036-7 [DOI] [PubMed] [Google Scholar]

[pgen.1009476.ref012] 12.Tóth-Petróczy A, Tawfik DS. Slow protein evolutionary rates are dictated by surface-core association. Proc Natl Acad Sci U S A. 2011;108: 11151–6. 10.1073/pnas.1015994108 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009476.ref013] 13.Jack BR, Meyer AG, Echave J, Wilke CO. Functional sites induce long-range evolutionary constraints in enzymes. PLOS Biol. 2016;14: e1002452. 10.1371/journal.pbio.1002452 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009476.ref014] 14.Nelson ED, Grishin N V. Evolution of off-lattice model proteins under ligand binding constraints. Phys Rev E. 2016;94: 022410. 10.1103/PhysRevE.94.022410 [DOI] [PubMed] [Google Scholar]

[pgen.1009476.ref015] 15.Sharir-Ivry A, Xia Y. Non-catalytic Binding Sites Induce Weaker Long-Range Evolutionary Rate Gradients than Catalytic Sites in Enzymes. J Mol Biol. 2019;431: 3860–3870. 10.1016/j.jmb.2019.07.019 [DOI] [PubMed] [Google Scholar]

[pgen.1009476.ref016] 16.Warshel A, Sharma PK, Kato M, Xiang Y, Liu H, Olsson MHM. Electrostatic basis for enzyme catalysis. Chem Rev. 2006;106: 3210–3235. 10.1021/cr0503106 [DOI] [PubMed] [Google Scholar]

[pgen.1009476.ref017] 17.Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, et al. The Protein Data Bank. Nucleic Acids Res. 2000;28: 235–42. 10.1093/nar/28.1.235 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009476.ref018] 18.Goldenberg O, Erez E, Nimrod G, Ben-Tal N. The ConSurf-DB: pre-calculated evolutionary conservation profiles of protein structures. Nucleic Acids Res. 2009;37: D323–7. 10.1093/nar/gkn822 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009476.ref019] 19.Celniker G, Nimrod G, Ashkenazy H, Glaser F, Martz E, Mayrose I, et al. ConSurf: Using Evolutionary Data to Raise Testable Hypotheses about Protein Function. Isr J Chem. 2013;53: 199–206. 10.1002/ijch.201200096 [DOI] [Google Scholar]

[pgen.1009476.ref020] 20.Ribeiro AJM, Holliday GL, Furnham N, Tyzack JD, Ferris K, Thornton JM. Mechanism and Catalytic Site Atlas (M-CSA): A database of enzyme reaction mechanisms and active sites. Nucleic Acids Res. 2018;46: D618–D623. 10.1093/nar/gkx1012 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009476.ref021] 21.Yang J, Roy A, Zhang Y. BioLiP: A semi-manually curated database for biologically relevant ligand-protein interactions. Nucleic Acids Res. 2013;41: D1096–1103. 10.1093/nar/gks966 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009476.ref022] 22.Liu X, Lu S, Song K, Shen Q, Ni D, Li Q, et al. Unraveling allosteric landscapes of allosterome with ASD. Nucleic Acids Res. 2020;48: D394–D401. 10.1093/nar/gkz958 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009476.ref023] 23.Huang Z, Zhu L, Cao Y, Wu G, Liu X, Chen Y, et al. ASD: a comprehensive database of allosteric proteins and modulators. Nucleic Acids Res. 2011;39: D663–D669. 10.1093/nar/gkq1022 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009476.ref024] 24.Shen Q, Wang G, Li S, Liu X, Lu S, Chen Z, et al. ASD v3.0: unraveling allosteric regulation with structural mechanisms and biological networks. Nucleic Acids Res. 2016;44: D527–D535. 10.1093/nar/gkv902 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009476.ref025] 25.Sharir-Ivry A, Xia Y. Nature of Long-Range Evolutionary Constraint in Enzymes: Insights from Comparison to Pseudoenzymes with Similar Structures. Mol Biol Evol. 2018;35: 2597–2606. 10.1093/molbev/msy177 [DOI] [PubMed] [Google Scholar]

[pgen.1009476.ref026] 26.Sharir-Ivry A, Xia Y. Using Pseudoenzymes to Probe Evolutionary Design Principles of Enzymes. Evol Bioinforma. 2019;15: 117693431985593. 10.1177/1176934319855937 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009476.ref027] 27.Echave J. Beyond Stability Constraints: A Biophysical Model of Enzyme Evolution with Selection on Stability and Activity. Mol Biol Evol. 2019;36: 613–620. 10.1093/molbev/msy244 [DOI] [PubMed] [Google Scholar]

[pgen.1009476.ref028] 28.Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25: 3389–3402. 10.1093/nar/25.17.3389 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009476.ref029] 29.Cherry JM, Adler C, Ball C, Chervitz SA, Dwight SS, Hester ET, et al. SGD: Saccharomyces Genome Database. Nucleic Acids Res. 1998;26: 73–79. 10.1093/nar/26.1.73 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009476.ref030] 30.Hu L, Benson ML, Smith RD, Lerner MG, Carlson HA. Binding MOAD (Mother Of All Databases). Proteins Struct Funct Bioinforma. 2005;60: 333–340. 10.1002/prot.20512 [DOI] [PubMed] [Google Scholar]

[pgen.1009476.ref031] 31.Ahmed A, Smith RD, Clark JJ, Dunbar JB, Carlson HA. Recent improvements to Binding MOAD: a resource for protein–ligand binding affinities and structures. Nucleic Acids Res. 2015;43: D465–D469. 10.1093/nar/gku1088 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009476.ref032] 32.Steinegger M, Söding J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol. 2017;35: 1026–1028. 10.1038/nbt.3988 [DOI] [PubMed] [Google Scholar]

[pgen.1009476.ref033] 33.Stark C, Breitkreutz B-J, Reguly T, Boucher L, Breitkreutz A, Tyers M. BioGRID: a general repository for interaction datasets. Nucleic Acids Res. 2006;34: D535–9. 10.1093/nar/gkj109 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009476.ref034] 34.Chatr-Aryamontri A, Oughtred R, Boucher L, Rust J, Chang C, Kolas NK, et al. The BioGRID interaction database: 2017 update. Nucleic Acids Res. 2017;45: D369–D379. 10.1093/nar/gkw1102 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009476.ref035] 35.Pupko T, Bell RE, Mayrose I, Glaser F, Ben-Tal N. Rate4Site: an algorithmic tool for the identification of functional regions in proteins by surface mapping of evolutionary determinants within their homologues. Bioinformatics. 2002;18: S71–S77. 10.1093/bioinformatics/18.suppl_1.s71 [DOI] [PubMed] [Google Scholar]

[pgen.1009476.ref036] 36.Wapinski I, Pfeffer A, Friedman N, Regev A. Natural history and evolutionary principles of gene duplication in fungi. Nature. 2007;449: 54–61. 10.1038/nature06107 [DOI] [PubMed] [Google Scholar]

[pgen.1009476.ref037] 37.Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: Improvements in performance and usability. Mol Biol Evol. 2013;30: 772–780. 10.1093/molbev/mst010 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009476.ref038] 38.Yang Z. PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol. 2007;24: 1586–1591. 10.1093/molbev/msm088 [DOI] [PubMed] [Google Scholar]

PERMALINK

Quantifying evolutionary importance of protein sites: A Tale of two measures

Avital Sharir-Ivry

Yu Xia

Roles

Abstract

Author summary

Introduction

Results

Evolutionary conservation gradient induced from a protein residue in the proteins structure is linearly correlated with the conservation of the residue

Fig 1. Conservation gradient induced from a protein residue is linearly correlated with its conservation within the protein.

Fig 2. Residues in functional and structural sites exhibit linear correlation between conservation and conservation gradient.

Conservation gradient induced from a protein site is linearly correlated with the evolutionary rate of the site

Fig 3. Conservation gradient induced from residues is linearly (negatively) correlated with their evolutionary rate (dN/dS), with catalytic site residues inducing stronger conservation gradient than expected by the linear trend.

Catalytic sites induce stronger conservation gradients than predicted by the conservation-percolation trend

Table 1. Average side-chain weighted contact number (SC-WCN) and distance to the protein center as well as average conservation gradient of the different types of functional sites.

Catalytic site residues induce stronger conservation gradients than non-catalytic functional site residues with similar evolutionary rates

Fig 4. Catalytic site residues induce stronger conservation gradients than non-catalytic functional site residues with similar evolutionary rates.

Catalytic site residues often induce stronger conservation gradients than more conserved non-catalytic functional site residues within the same protein

Fig 5. Catalytic site residues often induce stronger conservation gradients than more conserved non-catalytic functional site residues within the same protein.

Discussion

Methods

Protein dataset collection and functional site annotations

Functional site annotations

Evolutionary conservation and rate calculations

Statistical analysis

Supporting information

Data Availability

Funding Statement

References

Decision Letter 0

Jianzhi Zhang

Kirsten Bomblies

Roles

Author response to Decision Letter 0

Decision Letter 1

Jianzhi Zhang

Kirsten Bomblies

Roles

Author response to Decision Letter 1

Decision Letter 2

Jianzhi Zhang

Kirsten Bomblies

Roles

Acceptance letter

Jianzhi Zhang

Kirsten Bomblies

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases