Abstract
Functional residues in proteins tend to be highly conserved over evolutionary time. However, to what extent functional sites impose evolutionary constraints on nearby or even more distant residues is not known. Here, we report pervasive conservation gradients toward catalytic residues in a dataset of 524 distinct enzymes: evolutionary conservation decreases approximately linearly with increasing distance to the nearest catalytic residue in the protein structure. This trend encompasses, on average, 80% of the residues in any enzyme, and it is independent of known structural constraints on protein evolution such as residue packing or solvent accessibility. Further, the trend exists in both monomeric and multimeric enzymes and irrespective of enzyme size and/or location of the active site in the enzyme structure. By contrast, sites in protein–protein interfaces, unlike catalytic residues, are only weakly conserved and induce only minor rate gradients. In aggregate, these observations show that functional sites, and in particular catalytic residues, induce long-range evolutionary constraints in enzymes.
Catalytic sites in enzymes are highly conserved, but do they affect the evolutionary conservation of neighboring sites? This study shows that not just nearby neighbors but also second, third, fourth, and even fifth neighbors of a catalytic residue experience evolutionary constraint compared to a random site.
Author Summary
The basic biochemical functions of life are carried out by large molecules called enzymes. Enzymes consist of long chains of amino acids folded into a three-dimensional structure. Within that structure, a specific cluster of amino acids, known as the active site, performs the biochemical function. Substituting one amino acid for another in the active site typically results in a defective, non-functional enzyme, and therefore mutations at or near enzyme active sites are often lethal. Moreover, even mutations far from the active site have been found to disrupt function. Nonetheless, as organisms evolve, enzymes accumulate random mutations. Where in enzymes’ structures do these mutations accumulate without causing harm? Here, we observe evidence for extensive interactions between active sites and distant regions of the enzyme structure, in a comprehensive set of over 500 enzymes. We show that active sites tightly control the substitutions that an enzyme can tolerate. This control extends far beyond regions of the enzyme immediately adjacent to the active site, covering over 80% of a typical enzyme structure. Our findings have broad implications for molecular evolution, for enzyme engineering, and for the computational prediction of active-site locations in novel enzymes.
Introduction
Enzymes facilitate the chemical reactions necessary for life. To function properly, enzymes must reconcile two competing demands: they must fold stably into the correct three-dimensional conformation, and they must display the correct catalytic residues in their active sites. As enzymes evolve, mutations that are functionally beneficial are often deleterious for stability, and vice versa [1–3]. Thus, the patterns of evolutionary divergence observed in enzyme evolution are shaped by the interplay of these two potentially conflicting constraints.
Mutations affecting fold stability can occur anywhere in the protein structure, though in general stability effects tend to be more pronounced in the interior, more densely packed regions of a structure than on the protein surface [4,5]. By contrast, where mutations affect function in a protein structure is less clear. Site-directed mutagenesis experiments demonstrate that mutations at catalytic residues, unsurprisingly, disable enzyme function [6,7]. Accordingly, residues directly involved in protein function tend to be more conserved over evolutionary time than other residues [8–10]. Less intuitively, however, mutations 20 Å or more from a catalytic residue can attenuate catalytic activity in enzymes such as glycosidase [11], TEM-lactamase [12], or copper nitrate reductase [13]. Similarly, a study of a small set of α/β-barrel enzymes has found that evolutionary conservation decays continuously with the distance to the nearest catalytic residue [14]. These results suggest that residues far from an active site may be functionally important, but that this importance may decline with distance in physical, three-dimensional space.
Here, we analyze a dataset of 524 distinct enzyme structures spanning the six major functional classes of enzymes. We systematically assess how site-specific evolutionary variation in these enzymes relates to the geometric location of residues relative to the nearest catalytic residue. We find that, across all six major classes of enzymes, the constraining effects of catalytic residues extend to most of an enzyme’s structure, irrespective of protein size. These effects exist regardless of whether an active site is located on the surface or in the core of a protein, and they remain even when controlling for other structural features predicting evolutionary variation. Finally, we find that we can use site-specific conservation gradients to accurately recover active sites in more than 50% of enzymes. In summary, these findings demonstrate that functional sites induce long-range evolutionary constraints in enzyme structures.
Results
Functional Sites Induce Gradients of Conservation
To systematically explore the relationship between site-specific evolutionary rates and distance to the nearest catalytic residue, we have analyzed 524 diverse enzyme structures. We have chosen these structures as a subset of enzymes analyzed previously for their relationship between protein structure and evolutionary variation [15]. The structures represent all six major classes of enzymes, and no two structures in the dataset share more than 25% of their respective amino-acid sequences. The dataset includes both single subunit proteins (monomers) and multi-subunit proteins (multimers), and annotations describing the biological assembly and the location of the catalytic residues are available for each structure (see Methods for details). For each enzyme, we have constructed alignments of up to 300 homologous sequences, selected from the UniRef90 database [16,17]. We estimate evolutionary variation at each site in each alignment by calculating a site-specific relative evolutionary rate, using the software Rate4Site [18]. The relative rates are normalized such that a value of one corresponds to the average rate in a given protein, and larger or smaller values represent proportionally larger or smaller rates. For brevity, we will also refer to the relative rates simply as “rates.” In mathematical expressions, rates will be denoted by the letter K.
We first ask whether there is an overall trend toward increased evolutionary conservation near active sites. To address this question, we pool all sites from all structures into one combined dataset and then calculate the mean evolutionary rate as a function of Euclidean distance to the nearest catalytic residue in the respective structure. As expected, we find that evolutionary rates are, on average, the lowest at or directly near catalytically active sites. Moreover, we find that rates increase approximately linearly with increasing distance to the nearest catalytic residue, up to a distance of approximately 27.5 Å (Fig 1A). Beyond this distance, rates level off. Importantly, 80% of all residues in our dataset fall within a distance of 27.5 Å to the nearest catalytic residue (Fig 1B). Thus, the vast majority of all residues in each protein appear to experience some amount of purifying selection mediated by catalytic residues.
We can think of sites in a protein as organized into shells according to their distance to the closest catalytic residue. Each shell is 5 Å in width, the approximate minimum distance between two amino acid side-chains. The boundaries between these discrete shells are indicated in Fig 1B with dashed lines, and we can see clear dips in the distribution at 2.5 Å and 7.5 Å, the boundaries between the 0th and 1st and the 1st and 2nd shells (the boundaries between shells become less precise for higher shell numbers). We can subdivide the sites of our dataset into these discrete shells and then plot the rate distribution within each shell (Fig 1C). We find that the mean rate for each shell increases up to shell 6 (32.5 Å) and then stabilizes. Similarly, the width of the distribution also increases up to shell 6. Thus, all shells include some proportion of conserved sites, but increasingly distant shells include an increasing fraction of moderately or highly variable sites.
Conservation Gradients Are Distinct from Known Structural Constraints on Protein Evolution
The broad rate distributions that we observe within individual shells, in particular within shells distant from catalytic residues, highlight that there are other factors besides distance that also influence the extent and type of selection acting on individual sites. In fact, one important evolutionary constraint is the requirement for proteins to fold stably into their active conformation [19]. This constraint causes sites in the interior of the protein, shielded from the solvent and involved in many inter-residue contacts, to be more evolutionarily conserved than sites on the protein surface (for a recent review, see [30]) [4,15,20–29].
Two structural measures are commonly used to quantify this structural constraint: relative solvent accessibility (RSA) [31] and weighted contact number (WCN) [32]. RSA measures the exposure of a given residue to a hypothetical small solvent molecule, typically water. RSA is useful for determining if a residue is on the surface or the interior of a protein structure. WCN measures the local packing density of a given residue. WCN is high in the core of the protein, where residues are tightly packed. We have calculated both WCN and RSA for each site in each protein in our dataset. We have based this calculation on the published biological assembly of each protein, so that intra-chain contacts are properly accounted for in the case of enzymes that natively function in a multimeric state. As has been reported previously, on average WCN displays higher correlations with site-specific rate than RSA does, in particular when WCN is calculated with respect to the side-chain coordinates of each residue (see also S1 Fig) [33]. However, in our dataset, correlations of rate with WCN are only moderately higher than correlations with RSA, and there are proteins for which RSA outperforms WCN (S1 Fig). Therefore, throughout this work, we consider both WCN and RSA as measures of structural constraints acting on site-specific protein evolution. Importantly, neither WCN nor RSA make any assumptions about catalytic residues in proteins. Both quantities are purely geometric measures of protein structure. Conversely, the distance d to the closest catalytic residue does not explicitly contain information about packing density or solvent accessibility. Yet, in our dataset, the three quantities WCN, RSA, and d are all correlated with each other (S2 Fig). Therefore, we next ask to what extent the distance d captures an evolutionary constraint that is distinct from the constraints captured by WCN and RSA.
To address this question, we regress site-specific evolutionary rates K against WCN, RSA, and d, in all possible combinations, and separately for each enzyme in our dataset. We then record the R2 for each model and each enzyme (Fig 2A). We find that the best purely structural model, using both WCN and RSA as predictor variables, explains on average 39% of the variation in rate (Fig 2A). Adding distance as a third predictor to this model increases the average R2 to 44%. Thus, distance explains on average at least 5% of rate variation that cannot be attributed to purely structural factors, and possibly more than that; by itself, distance explains on average 25% of the variation in rate. Some of that variation may be accidentally captured by WCN or RSA, because active sites are frequently located closer to the interior than to the surface of the protein structure.
To further assess the independent contribution of distance to the pattern of site-specific rates, we compare model predictions and empirical rates as functions of distance to the active site. We compare rates predicted by the linear models K ∼ WCN + RSA and K ∼ WCN + RSA + d, which are fit for each protein individually. For visualization only, we average within shells, as explained above. We find that a linear model containing only WCN and RSA tends to overestimate site-specific evolutionary rates within the first three to four shells (green line in Fig 2B and 2C). Adding distance to this model removes nearly all of the overestimation (orange line in Fig 2B and 2C). These findings demonstrate that structural metrics alone are unable to accurately predict conservation patterns near active sites.
Functional and Structural Constraints Align for Active Sites in the Interior But Not for Those on the Surface
Enzymes often sequester substrates into a buried catalytic core. This sequestration allows them to facilitate chemistry that would otherwise be impossible in the broader cellular environment. For this reason, many enzymes tend to have active sites in the protein interior, where local packing density is high and solvent accessibility is low [19]. For those enzymes, we expect the distance metric d to correlate with WCN and/or RSA. By contrast, if the active site is located on the protein surface, then distance should correlate very little or not at all with either WCN or RSA.
To further disentangle active-site effects from WCN and RSA, we can identify individual structures from our dataset in which distance is sufficiently uncorrelated (defined as r < 0.25) from both WCN and RSA. Among these structures, we find four for which distance correlates strongly with evolutionary rate (defined as r ≥ 0.55) (see Fig 3). They correspond to the enzymes dihydrofolate reductase (DHFR, protein databank identifier [PDB ID]: 1DHF) [34], superoxide reductase (SOR, PDB ID: 1DO6) [35], anti-sigma factor SpollAB (PDB ID: 1L0O) [36], and the Serratia endonuclease (PDB ID: 1SMN) [37]. All of these enzymes perform different biological functions, and they are active in multimeric conformations. In these proteins, rate correlates more strongly with distance than it does with WCN or RSA (Fig 3A), and the mean rate increases linearly with distance throughout the entire structure (Fig 3B). In all four cases shown, the active sites are located near the protein surface (mean active-site RSA ranges from 0.19 to 0.25) and away from the protein center (Fig 3C).
To analyze the effect of active-site location on rate variation more systematically, we next subdivide our entire dataset into three categories based on active site location, measured by the mean RSA of all catalytic residues in the structure. We define these categories as active site in the protein interior (mean catalytic-residue RSA < 0.05), active site with intermediate solvent exposure (mean catalytic-residue RSA between 0.05 and 0.25), and active site on the protein surface (mean catalytic-residue RSA ≥ 0.25). Our dataset contains 98, 367, and 59 proteins in these three categories, respectively.
As before, we find that the purely structural metrics RSA and WCN tend to overestimate site-specific evolutionary rates near the active site in all three groups (Fig 4). Moreover, the structure-based models perform worse as the active site moves from the core of the enzyme to the surface. In all cases, incorporating distance into the model corrects most rate overestimation near the active site. Interestingly, all models perform better for active sites in the core than for active sites on the surface (Fig 4). We interpret this observation as follows: When the active site is located in the core of an enzyme, functional and structural constraints are aligned. The sites most conserved due to function are also the sites most conserved due to structure, and this overall trend is captured well in the linear models. By contrast, when the active site is located on the surface, functional and structural constraints are at odds with each other. The sites most conserved due to function are now the sites least conserved due to structure, and vice versa. In this case, since there are now two opposing trends within one structure, it is more difficult for any linear model to accurately capture rate variation throughout the structure.
Distance Effect Extends Further in Larger Proteins
We have previously seen that approximately 80% of all residues in our dataset fall within the 27.5 Å cutoff, inside of which evolutionary variation is reduced in proportion to distance to the nearest active site (Fig 1B). However, the 80% figure may be somewhat misleading, because in that analysis we have pooled all residues from all proteins. Our dataset comprises proteins of very different sizes, from 95 to 1,287 amino acids long, and for small proteins every residue falls within 27.5 Å of an active site, while for large proteins only one-half to two-thirds of the residues lie within the 27.5 Å distance cutoff.
To ascertain whether the relationship between functional sites and evolutionary rates depends on enzyme size, we can re-analyze our data by protein size. We define three evenly sized groups: small proteins (95–268 sites), medium-size proteins (270–385 sites), and large proteins (386–1,287 sites). Each group contains 175, 175, and 174 structures, respectively. We observe that as enzyme size increases, the rate–distance slope decreases (Fig 5A). Distance effects are weaker in larger proteins but also extend further out. The effect remains visible when we analyze the distance–rate relationship for individual proteins and in the context of WCN and RSA (Figs 5B and S5): purely structural models, which use only WCN and RSA to predict rate, overestimate rate up to shell 3 in small proteins, up to shell 4 in medium-sized proteins, and up to shell 5 in larger proteins.
In summary, we see that the leveling off of rate at shell 5, around 27.5 Å, in the pooled dataset, does not represent a universal cutoff but rather an average obtained from combining many different structures into one analysis. For any individual protein, there will generally be a distance effect, but it may extend only to shell 3 or 4 in small proteins while extending to shell 6 (and possibly beyond) for very large proteins.
Selection Gradients Recover Catalytic Residues
As we have seen from the preceding analyses, active sites in enzymes impose a selection gradient that can be detected throughout the majority of the protein structure. This observation leads us to ask whether we can use this gradient to identify active sites when their location is not known. To answer this question, we blindly search for distance–rate gradients in our dataset. We systematically use one residue at a time as a reference point in the structure and fit a linear model of rate versus the distance to that reference point. We record the resulting R2 for each model, and we consider the reference point with the highest R2 as the putative active site in the structure.
We find that in 18% of the structures in our dataset, the putative active site coincides with a known catalytic residue (Fig 6). In an additional 37% of structures, the putative active site falls within 7.5 Å of a catalytic residue but is not a catalytic residue itself. A distance of 7.5 Å corresponds to one shell, i.e., it captures residues in direct contact with a catalytic residue. Note that the gap visible between 0 and 2.5 Å in Fig 6 corresponds to the closest distance that two side chains can physically contact each other. A putative active site either is a catalytic residue, in which case it has a distance of 0 Å to the nearest catalytic residue (i.e., itself), or alternatively it has to be at least a distance of 2.5 Å away from the catalytic residue. In summary, for more than half (55%) of the 524 enzymes in our dataset, we can use the existing selection gradient to identify either a catalytic residue or an immediate neighbor.
As a control, we have also considered a model that places the active site at the core of the protein, at the residue with the overall highest WCN (since, as stated above, the active site is located in the protein interior for many enzymes). We find that this control approach recovers catalytic residues or their immediate neighbors in 31% of enzymes (S6 Fig). Thus, while the control approach can recover active sites in a substantial fraction of enzymes, the selection-gradient-based method performs significantly better (odds ratio = 2.8, p < 1.7 x 10−15, Fisher’s Exact Test, S7 Fig, S1 Table).
Catalytic Residues Evolve More Slowly Than Protein–Protein Interface Residues
Many enzymes function as components of multimeric protein complexes. In fact, more than half of the enzymes in our dataset contain multiple subunits in their biological assemblies. The arrangement of and interaction between these subunits could substantially modify how protein structure and protein function shape protein evolution, especially if the active site occurs at the interface of two subunits. In our dataset, we find that residues in protein–protein interfaces are, on average, only slightly more conserved than any other residues, whereas catalytic residues are much more conserved (Fig 7A): residues in interfaces evolve, on average, at a rate of 0.91 relative to the average residue, while catalytic residues evolve, on average, at a relative rate of 0.10. To verify that the little conservation we see in interface sites is not an artifact of our enzyme dataset, we have also analyzed rates in a set of 17 non-enzymatic protein–protein complexes, consisting of 30 individual proteins total (see Methods). Again, we find that residues in protein–protein interfaces show only moderate conservation relative to all other residues (relative rate of 0.82, Fig 7A). Moreover, consistent with their weak conservation, protein–protein interfaces induce only very minor gradients of conservation, if any, in both enzyme and non-enzyme proteins (compare Fig 7B–7D with Fig 1). Thus, protein–protein interactions impose much weaker evolutionary constraints than catalytic sites.
The structural metrics RSA and WCN are also sensitive to subunit arrangement, and subunit arrangement can be incorrectly annotated in the biological assembly. To assess whether subunit arrangement and/or annotation errors affect the distance–rate relationship, we re-analyze our enzyme data using three additional variations of analysis choices: (i) RSA and WCN are calculated using the biological assembly, and any residues at the interface between subunits are excluded; (ii) RSA and WCN are calculated using a single subunit, and no residues are excluded; and (iii) RSA and WCN are calculated using a single subunit, and all interface residues are excluded (S8–S34 Figs). In all three cases, our results remain qualitatively unchanged from our prior results: rate increases with increasing distance to a catalytic residue, up to about 27.5 Å; distance has an effect on rate variation that is independent from the purely structural metrics WCN and RSA. Thus, in summary, there is a positive distance–rate relationship that is independent of WCN or RSA, and it exists regardless of how we treat multi-subunit enzymes and interface residues.
Discussion
We have shown that many enzymes exhibit a clear, nearly linear relationship between site-specific evolutionary rates and distance to the nearest catalytic residue. We have found this trend consistently throughout a large dataset of 524 diverse enzymes, and we have found that the relationship extends to most of the residues in any given enzyme structure. Using combined linear models containing RSA, WCN, and distance, we have found that distance explains at least 5% of the variance in rate after controlling for WCN and RSA, and potentially up to approximately 36% (Fig 3) in proteins in which the active site is located near the protein surface. Moreover, models containing only the structural predictors WCN and RSA consistently overestimate evolutionary variation near active sites, through shell 5 (27.5 Å) in large proteins. Finally, we have shown that in over half of the enzymes in our dataset, we can recover catalytic residues or their immediate neighbors from the evolutionary gradients they imprint throughout the protein structure.
For some enzymes in our dataset, we have found little evidence for functional or structural constraints on site-specific evolutionary rates. There are some proteins for which less than 10% of the variation in evolutionary rate can be accounted for with distance to a catalytic residue, RSA, or WCN. These low correlations suggest that either the rates themselves are uninformative, or that the available PDB structures are not reflective of protein structure in vivo. In the first case, the sequence alignments used to determine evolutionary rate could contain a mix of proteins with very different arrangements in vivo. We have no way of determining the biological assembly of every sequence in the alignment, so differences in corresponding subunit arrangement could bias the site-specific evolutionary rates. Additionally, the RCSB protein database may have conflicting biological assemblies. For example, the biological assembly for human dihydrofolate reductase (PDB ID: 1DHF) is classified as a homodimer, while the biological assembly for a separate structure (PDB ID: 1DRF) of the same protein is classified as a monomer. We have attempted to control for possible structural variability in the sequence alignments and biological assemblies by re-analyzing all structures as monomers and/or removing residues at the interface of subunits before computing correlations, and the overall trends observed remain the same (S8–S34 Figs). Regardless, all of the factors mentioned here could result in rates that correlate poorly with any structural predictors.
The field of molecular evolution has long sought to understand the relationship between protein structure, function, and sequence evolution [30]. Here, we have assessed this relationship by comparing distance to the nearest catalytic residue with site-specific evolutionary rates. Past work has employed covariation analyses to reveal clusters of co-evolving residues in protein structures, deemed “protein sectors” [38]. In specific cases, such as in serine proteases, these protein sectors also correspond to different functional biochemical regions of the structure. A recent reanalysis of the seminal protein sector work demonstrates that, in proteins with just one sector, sequence conservation recovers clusters of functional residues just as well as covariation analyses [39]. Our work demonstrates that not only are clusters of functional residues highly conserved, but that such residues induce gradients of conservation within a structure. This finding of long-range interactions between residues is consistent with the sector model of large regions of co-evolving residues.
We have found that active sites are among the most highly conserved sites in proteins, whereas residues involved in protein–protein interactions are only weakly conserved relative to the average site in a protein. Moreover, the gradients of conservation induced by protein–protein interfaces are much less marked than those induced by catalytic sites. This finding is consistent with prior work on protein–protein interactions. While several prior works have found increased conservation in interface regions [40–42], effect sizes have generally been found to be small. For example, [42] found that the reduction in evolutionary rate in a protein–protein interface was mostly (though not entirely) explained by the reduction in solvent accessibility induced by complex formation. Also, complementation assays and computer simulations suggest that protein–protein interfaces can experience extensive divergence without loss of function [43], again supporting the notion that protein–protein interfaces are frequently not under strong purifying selection. For these reasons, we believe that the rate gradients we have found toward active sites are not strongly confounded by protein–protein binding interfaces.
Long-range interactions between residues in a protein have historically been studied in the context of allostery. Initially proposed in 1961, allostery describes the process by which a small molecule (ligand) binds to one area of an enzyme (allosteric site) and induces a conformational change at a distant active site [44]. Studies of allosteric interactions shed light on two key aspects of our findings. First, biophysical models have been developed that explain long-range interactions. The Monod-Wyman-Changuex model, the most widely studied of allostery models, proposes that ligand binding stabilizes a biologically active or inactive quarternary structure [44]. Recent studies, however, demonstrate that some monomeric proteins also contain allosteric sites [44]. In G-protein coupled receptors, for example, a simplified model of conserved, physically connected amino-acid residues explains the long-range interactions between allosteric sites and active sites [45]. Our dataset contains a mix of monomeric and multimeric proteins, and we observe long-range interactions in both types of proteins. Thus, our findings suggest that allosteric-like couplings between active sites and distant residues may be more common than previously thought. The physical distance between allosteric ligand-binding sites and active sites ranges from 20 Å in hemoglobin to 60 Å in glycogen phosphorylase [46]. Therefore, the selection gradients we have observed here extend to distances well within the range of experimentally observed allosteric interactions. Second, while the observed selection gradients have allowed us to recover residues in close proximity to the active site in a little over half of the proteins in our dataset, in many proteins (45%) the selection gradient points toward a residue >7.5 Å from the active site. It is possible that these non-catalytic residues, which are highly predictive of the overall patterns of evolutionary rates in the structure, may be allosteric sites. Allosteric sites tend to be highly conserved, although typically not as conserved as active sites [47]. In summary, studies of allostery provide biophysical explanations for long-range interactions between residues and may explain why we failed to recover catalytic residues from selection gradients in some proteins in our dataset.
That selection gradients can recover active sites has potentially broad applications, even beyond enzymatic proteins. For example, some of us have previously used optimized distance to identify important functional sites in influenza A hemagglutinin (HA) [48]. HA, a viral surface protein, interacts directly with sialic acid found on the surface of human cells. Viral infection requires binding of HA to sialic acid, and antibodies bind near the sialic-acid binding region to inhibit viral infection. Residues in that region are thus under strong positive selection for immune escape, and consequently the selection gradient in HA revealed a rapidly evolving functional site. This finding suggests that selection gradients could effectively recover diverse types of functional sites, not only those that are well conserved. More broadly, evolutionary history is a useful predictor of active sites [8,9] and binding partners [10]. Assuming that a given structure has been crystalized, the rate gradients we have found here could improve computational predictions of active sites and binding sites.
Methods
Datasets and Site-Specific Evolutionary Rates
We selected 524 of 554 previously characterized enzymes [15] to conduct our analysis. We removed 30 structures because they contained chains with no available catalytic residue information, or because the UniRef90 database did not contain enough homologous sequences to construct a diverse alignment. These enzymes consist of 204 monomers and 320 multimers, and no two enzymes in the dataset have more than 25% sequence identity. For each enzyme, we obtained catalytic residue information from the Catalytic Site Atlas [49].
We acquired PDB structures of the biological assemblies for these proteins from the RCSB protein database [50]. A biological assembly represents the functional form of a given enzyme in vivo based on the best experimental data available. When available, we used biological assemblies that are author-provided or both author-provided and software-supported (labeled “A” and “A+S,” respectively, in the RCSB protein database). If author-provided biological assemblies were not available, we used biological assemblies predicted by PISA (protein interfaces, surfaces, and assemblies, http://www.ebi.ac.uk/pdbe/pisa/) (labeled “S”). PISA biological assemblies are entirely predicted by software. In cases in which there were multiple author-provided biological assemblies, we chose the first of those assemblies listed in the RCSB protein database.
In addition to the enzyme dataset, we also compiled a non-enzyme dataset as a control. We selected 17 of 179 protein–protein complexes from the Protein–Protein Interaction Affinity Database 2.0 [51]. We selected only non-enzymatic proteins based on interaction classification, absence of enzyme comission (EC) number, and UniProt annotations. We also excluded complexes containing antibodies, since antibodies evolve on a different time-scale and by different mechanisms than other cellular proteins. We acquired structures of the protein–protein complexes from the RCSB protein database.
To calculate site-specific evolutionary rates, we first extracted the amino-acid sequences from the PDB structures. Using PSI-BLAST [52], we then queried the UniRef90 database [16,17] to retrieve homologous sequences for each enzyme. Among these homologous sequences for each enzyme, we removed sequences with less than 10% pairwise divergence to any other sequence, to reduce phylogenetic bias. Next, we randomly downsampled the homologous sequences to a maximum of 300 sequences per enzyme. Then, we performed a multiple sequence alignment (MSA) of the sequences with MAFFT 7.215 (Multiple Alignment using Fast Fourier Transform) [53] and generated phylogenetic trees with RAxML 7.2.8 (Randomized Axelerated Maximum Likelihood) [54] using the LG substitution matrix (named after Le and Gacuel) [55] and the PROTCAT model of rate heterogeneity [56]. We calculated site-specific evolutionary rates with the program Rate4Site 2.01 [18], using the MSAs and phylogenetic trees from the previous step as input. We used the empirical Bayes approach for rate estimation and the JTT (Jones, Taylor, and Thorton) model of amino acid replacement [57]. Lastly, we normalized the rates such that the rates for each protein have a mean of 1. Because these rates are measured relative to the average divergence rate of the entire protein, they are dimensionless. Throughout this work, we refer to these site-specific relative rates as K, or simply “rates.”
Predictor Variables
For each protein structure, we calculated several predictor variables at each site. First, we calculated the weighted contact number WCN for each residue i as follows:
(1) |
Here, rij is the distance between the geometric center of the side-chain atoms in residue i and the geometric center of side-chain atoms in residue j. To calculate these distances for residue pairs involving glycine, which has no side-chain, we used the location of the Cα in those residues instead. Unless noted otherwise, WCN was calculated using the complete biological assembly of the protein.
Next, we calculated the relative solvent accessibility (RSA) at each site. To this end, we first calculated the accessible surface area (ASA) using the software mkdssp [58,59]. We then normalized ASAs by the maximum solvent accessibility for each residue in a Gly-X-Gly tripeptide [31]. Peptide linkages across chains, typically disulfide bridges, were assigned an RSA of zero. Unless noted otherwise, RSA was calculated using the complete biological assembly of the protein.
Finally, we calculated the distance d to the nearest catalytic residue for each residue in each structure. Most enzymes have multiple catalytic residues, so we define d as distance to the nearest catalytic residue. As was the case for WCN, distances were measured from the geometric center of the side-chain of one residue to the the geometric center of the side-chain of another residue. And in the case of glycines, the position of Cα was again used in place of the side-chain geometric center. Any residue with d = 0 is therefore a catalytic residue, and conversely, all catalytic residues lie at d = 0.
We defined interface residues as residues for which RSA differed by a minimum of 10% when calculated for the full biological assembly or for a single chain. All interface residues were included in the analyses presented in the main body of the text, but we excluded interface residues in the analyses presented in S17–S34 Figs.
Linear Models
For each enzyme in the dataset, we fit the following linear models (represented in standard R notation): K ~ d, K ~ RSA, K ~ WCN, K ~ RSA + d, K ~ WCN + d, and K ~ RSA + WCN + d, where K is site-specific evolutionary rate, RSA is relative solvent accessibility, WCN is weighted contact number, and d is distance to the nearest catalytic residue.
Statistical Analyses, Plots, and Data Availability
All statistical analyses were carried out using the R software package [60]. Linear models are fit to each enzyme individually. After fitting the models, data are then binned for visualization purposes. Plots are generated with ggplot2 [61]. All code and data necessary to reproduce our analyses are available in a Github repository at: https://github.com/benjaminjack/enzyme_distance. Processed enzyme data are also provided as S1 Data. Processed data from the non-enzyme dataset are available as S2 Data. Parameter estimates for each linear model fitted to each enzyme are available in S3 Data.
Supporting Information
Acknowledgments
J. Echave is Principal Investigator of CONICET. The Texas Advanced Computing Center (TACC) provided high-performance computing resources.
Abbreviations
- ASA
accessible surface area
- EC
enzyme comission
- MSA
multiple sequence alignment
- PDB ID
protein databank identifier
- PISA
protein interfaces, surfaces and assemblies
- RSA
relative solvent accessibility
- WCN
weighted contact number
Data Availability
All processed data are provided as Supporting Information. All computer code and raw data files are available at: https://github.com/benjaminjack/enzyme_distance
Funding Statement
This work was supported in part by NIH grant R01 GM088344, DTRA grant HDTRA1-12-C-0007, NSF Cooperative agreement DBI-0939454 (BEACON Center), and ARO grant W911NF-12-1-0390 to COW. J. Echave is Principal Investigator of CONICET. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
References
- 1.Bloom JD, Labthavikul ST, Otey CR, Arnold FH. Protein stability promotes evolvability. Proc Natl Acad Sci USA. 2006;103:5869–5874. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Bloom JD, Gong LI, Baltimore D. Permissive secondary mutations enable the evolution of influenza oseltamivir resistance. Science. 2010;328:1272–1275. 10.1126/science.1187816 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Gong LI, Suchard MA, Bloom JD. Stability-mediated epistasis constrains the evolution of an influenza protein. eLife. 2013;2:e00631 10.7554/eLife.00631 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Shakhnovich E, Abkevich V, Ptitsyn O. Conserved residues and the mechanism of protein folding. Nature. 1996;379:96–98. [DOI] [PubMed] [Google Scholar]
- 5.Tokuriki N, Tawfik DS. Stability effects of mutations and protein evolvability. Curr Opin Struct Biol. 2009;19:596–604. 10.1016/j.sbi.2009.08.003 [DOI] [PubMed] [Google Scholar]
- 6.Gerlt JA. Relationships between enzymatic catalysis and active site structure revealed by applications of site-directed mutagenesis. Chem Rev. 1987;87:1079–1105. [Google Scholar]
- 7.Kanaya S, Kohara A, Miura Y, Sekiguchi A, Iwai S, Inoue H, et al. Identification of the amino acid residues involved in an active site of Escherichia coli ribonuclease H by site-directed mutagenesis. J Biol Chem. 1990;265:4615–4621. [PubMed] [Google Scholar]
- 8.Lichtarge O, Bourne HR, Cohen FE. An evolutionary trace method defines binding surfaces common to protein families. J Mol Biol. 1996;257:342–358. [DOI] [PubMed] [Google Scholar]
- 9.Mihalek I, Reš I, Lichtarge O. A family of evolution-entropy hybrid methods for ranking protein residues by importance. J Mol Biol. 2004;336:1265–1282. [DOI] [PubMed] [Google Scholar]
- 10.Huang YW, Chang CM, Lee CW, Hwang JK. The conservation profile of a protein bears the imprint of the molecule that is evolutionarily coupled to the protein. Proteins. 2015;83:1407–1413. 10.1002/prot.24809 [DOI] [PubMed] [Google Scholar]
- 11.Romero PA, Tran TM, Abate AR. Dissecting enzyme function with microfluidic-based deep mutational scanning. Proc Natl Acad Sci USA. 2015;112:7159–7164. 10.1073/pnas.1422285112 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Abriata LA, Palzkill T, Dal Peraro M. How structural and physicochemical determinants shape sequence constraints in a functional enzyme. PLoS ONE. 2015;10:e0118684 10.1371/journal.pone.0118684 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Leferink NGH, Antonyuk SV, Houwman JA, Scrutton NS, Eady RR, Hasnain SS. Impact of residues remote from the catalytic centre on enzyme catalysis of copper nitrite reductase. Nat Commun. 2014;5:4395 10.1038/ncomms5395 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Dean AM, Neuhauser C, Grenier E, Golding GB. The pattern of amino acid replacements in alpha/beta-barrels. Mol Biol Evol. 2002;19:1846–1864. [DOI] [PubMed] [Google Scholar]
- 15.Shih CH, Chang CM, Lin YS, Lo WC, Hwang JK. Evolutionary information hidden in a single protein structure. Proteins. 2012;80:1647–1657. 10.1002/prot.24058 [DOI] [PubMed] [Google Scholar]
- 16.Suzek BE, Huang H, McGarvey P, Mazumder R, Wu CH. UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics. 2007;23:1282–1288. [DOI] [PubMed] [Google Scholar]
- 17.Suzek BE, Wang Y, Huang H, McGarvey PB, Wu CH, UniProt Consortium. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics. 2015;31:926–932. 10.1093/bioinformatics/btu739 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Mayrose I, Graur D, Ben-Tal N, Pupko T. Comparison of site-specific rate-inference methods for protein sequences: empirical Bayesian methods are superior. Mol Biol Evol. 2004;21:1781–1791. [DOI] [PubMed] [Google Scholar]
- 19.Sikosek T, Chan HS. Biophysics of protein evolution and evolutionary protein biophysics. J R Soc Interface. 2014;11:20140419 10.1098/rsif.2014.0419 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Mirny LA, Shakhnovich EI. Universally conserved positions in protein folds: reading evolutionary signals about stability, folding kinetics and function. J Mol Biol. 1999;291:177–196. [DOI] [PubMed] [Google Scholar]
- 21.Echave J, Jackson EL, Wilke CO. Relationship between protein thermodynamic constraints and variation of evolutionary rates among sites. Phys Biol. 2015;12:025002 10.1088/1478-3975/12/2/025002 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Grahnen JA, Nandakumar P, Kubelka J, Liberles DA. Biophysical and structural considerations for protein sequence evolution. BMC Evol Biol. 2011;11:361 10.1186/1471-2148-11-361 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Wilke CO. Bringing Molecules Back into Molecular Evolution. PLoS Comput Biol. 2012;8:e1002572 10.1371/journal.pcbi.1002572 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Scherrer MP, Meyer AG, Wilke CO. Modeling coding-sequence evolution within the context of residue solvent accessibility. BMC Evol Bio. 2012;12:179 10.1186/1471-2148-12-179 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Bustamante CD, Townsend JP, Hartl DL. Solvent Accessibility and Purifying Selection Within Proteins of Escherichia coli and Salmonella enterica. Mol Biol Evol. 2000;17:301–308. [DOI] [PubMed] [Google Scholar]
- 26.Shahmoradi A, Sydykova DK, Spielman SJ, Jackson EL, Dawson ET, Meyer AG, et al. Predicting evolutionary site variability from structure in viral proteins: buriedness, packing, flexibility, and design. J Mol Evol. 2014;79:130–142. 10.1007/s00239-014-9644-x [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Huang TT, Marcos ML, Hwang JK, Echave J. A mechanistic stress model of protein evolution accounts for site-specific evolutionary rates and their relationship with packing density and flexibility. BMC Evol Biol. 2014;14:78 10.1186/1471-2148-14-78 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Yeh SW, Liu JW, Yu SH, Shih CH, Hwang JK, Echave J. Site-specific structural constraints on protein sequence evolutionary divergence: local packing density versus solvent exposure. Mol Biol Evol. 2014;31:135–139. 10.1093/molbev/mst178 [DOI] [PubMed] [Google Scholar]
- 29.Ramsey DC, Scherrer MP, Zhou T, Wilke CO. The relationship between relative solvent accessibility and evolutionary rate in protein evolution. Genetics. 2011;188:479–488. 10.1534/genetics.111.128025 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Echave J, Spielman SJ, Wilke CO. Causes of evolutionary rate variation among protein sites. Nature Rev Genet. 2016;17:109–121. 10.1038/nrg.2015.18 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Tien MZ, Meyer AG, Sydykova DK, Spielman SJ, Wilke CO. Maximum Allowed Solvent Accessibilites of Residues in Proteins. PLoS ONE. 2013;8:e80635 10.1371/journal.pone.0080635 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Lin CP, Huang SW, Lai YL, Yen SC, Shih CH, Lu CH, et al. Deriving protein dynamical properties from weighted protein contact number. Proteins. 2008;72:929–935. 10.1002/prot.21983 [DOI] [PubMed] [Google Scholar]
- 33.Marcos ML, Echave J. Too packed to change: side-chain packing and site-specific substitution rates in protein evolution. PeerJ. 2015;3:e911 10.7717/peerj.911 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Davies JF, Delcamp TJ, Prendergast NJ. Crystal structures of recombinant human dihydrofolate reductase complexed with folate and 5-deazafolate. Biochemistry. 1990;29:9467–9479. [DOI] [PubMed] [Google Scholar]
- 35.Yeh AP, Hu Y, Jenney FE, Adams MWW, Rees DC. Structures of the Superoxide Reductase from Pyrococcus furiosus in the Oxidized and Reduced States. Biochemistry. 2000;39:2499–2508. [DOI] [PubMed] [Google Scholar]
- 36.Campbell EA, Masuda S, Sun JL, Muzzin O, Olson CA, Wang S, et al. Crystal structure of the Bacillus stearothermophilus anti-sigma factor SpoIIAB with the sporulation sigma factor sigmaF. Cell. 2002;108:795–807. [DOI] [PubMed] [Google Scholar]
- 37.Miller MD, Krause KL. Identification of the Serratia endonuclease dimer: structural basis and implications for catalysis. Protein Sci. 1996;5:24–33. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Halabi N, Rivoire O, Leibler S, Ranganathan R. Protein sectors: evolutionary units of three-dimensional structure. Cell. 2009;138:774–786. 10.1016/j.cell.2009.07.038 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Teşileanu T, Colwell LJ, Leibler S. Protein Sectors: Statistical Coupling Analysis versus Conservation. PLoS Comput Biol. 2015;11:e1004091 10.1371/journal.pcbi.1004091 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Mintseris J, Weng Z. Structure, function, and evolution of transient and obligate protein–protein interactions. Proc Natl Acad Sci USA. 2005;102:10930–10935. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Kim PM, Lu LJ, Xia Y, Gerstein MB. Relating three-dimensional structures to protein networks provides evolutionary insights. Science. 2006;314:1938–1941. [DOI] [PubMed] [Google Scholar]
- 42.Franzosa EA, Xia Y. Structural determinants of protein evolution are context-sensitive at the residue level. Mol Biol Evol. 2009;26:2387–2395. 10.1093/molbev/msp146 [DOI] [PubMed] [Google Scholar]
- 43.Kachroo AH, Laurent JM, Yellman CM, Meyer AG, Wilke CO, Marcotte EM. Systematic humanization of yeast genes reveals conserved functions and genetic modularity. Science. 2015;348:921–925. 10.1126/science.aaa0769 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Changeux JP. 50 years of allosteric interactions: the twists and turns of the models. Nat Rev Mol Cell Biol. 2013;14:819–829. 10.1038/nrm3695 [DOI] [PubMed] [Google Scholar]
- 45.Süel GM, Lockless SW, Wall MA, Ranganathan R. Evolutionarily conserved networks of residues mediate allosteric communication in proteins. Nat Struct Biol. 2003;10:59–69. [DOI] [PubMed] [Google Scholar]
- 46.Changeux JP. Allostery and the Monod-Wyman-Changeux model after 50 years. Annu Rev Biophys. 2012;41:103–133. 10.1146/annurev-biophys-050511-102222 [DOI] [PubMed] [Google Scholar]
- 47.Lu S, Huang W, Wang Q, Shen Q, Li S, Nussinov R, et al. The structural basis of ATP as an allosteric modulator. PLoS Comput Biol. 2014;10:e1003831 10.1371/journal.pcbi.1003831 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Meyer AG, Wilke CO. Geometric Constraints Dominate the Antigenic Evolution of Influenza H3N2 Hemagglutinin. PLoS Pathog. 2015;11:e1004940 10.1371/journal.ppat.1004940 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Furnham N, Holliday GL, de Beer TAP, Jacobsen JOB, Pearson WR, Thornton JM. The Catalytic Site Atlas 2.0: cataloging catalytic sites and residues identified in enzymes. Nucleic Acids Res. 2014;42:D485–D489. 10.1093/nar/gkt1243 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, et al. The Protein Data Bank. Nucleic Acids Res. 2000;28:235–242. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Vreven T, Moal IH, Vangone A, Pierce BG, Kastritis PL, Torchala M, et al. Updates to the Integrated Protein–Protein Interaction Benchmarks: Docking Benchmark Version 5 and Affinity Benchmark Version 2. J Mol Biol. 2015. September;427(19):3031–3041. 10.1016/j.jmb.2015.07.016 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, et al. BLAST+: architecture and applications. BMC Bioinformatics. 2009;10:421–9. 10.1186/1471-2105-10-421 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Katoh K, Standley DM. MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability. Mol Biol Evol. 2013;30:772–780. 10.1093/molbev/mst010 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Stamatakis A. RAxML Version 8: A tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics. 2014;30:1312–1313. 10.1093/bioinformatics/btu033 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Le SQ, Gascuel O. An improved general amino acid replacement matrix. Mol Biol Evol. 2008;25:1307–1320. 10.1093/molbev/msn067 [DOI] [PubMed] [Google Scholar]
- 56.Stamatakis A. Phylogenetic models of rate heterogeneity: a high performance computing perspective IEEE; 2006. 10.1109/IPDPS.2006.1639535 [DOI] [Google Scholar]
- 57.Jones DT, Taylor WR, Thornton JM. The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci. 1992;8:275–282. [DOI] [PubMed] [Google Scholar]
- 58.Kabsch W, Sander C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 1983;22:2577–2637. [DOI] [PubMed] [Google Scholar]
- 59.Joosten RP, te Beek TAH, Krieger E, Hekkelman ML, Hooft RWW, Schneider R, et al. A series of PDB related databases for everyday needs. Nucleic Acids Res. 2011;39:D411–D419. 10.1093/nar/gkq1105 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria; 2014. Available from: http://www.R-project.org/.
- 61.Wickham H. ggplot2: elegant graphics for data analysis Springer; New York; 2009. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All processed data are provided as Supporting Information. All computer code and raw data files are available at: https://github.com/benjaminjack/enzyme_distance