Abstract
Proteins play important roles in biology, biotechnology and pharmacology, and missense variants are a common cause of disease. Discovering functionally important sites in proteins is a central but difficult problem because of the lack of large, systematic data sets. Sequence conservation can highlight residues that are functionally important but is often convoluted with a signal for preserving structural stability. We here present a machine learning method to predict functional sites by combining statistical models for protein sequences with biophysical models of stability. We train the model using multiplexed experimental data on variant effects and validate it broadly. We show how the model can be used to discover active sites, as well as regulatory and binding sites. We illustrate the utility of the model by prospective prediction and subsequent experimental validation on the functional consequences of missense variants in HPRT1 which may cause Lesch-Nyhan syndrome, and pinpoint the molecular mechanisms by which they cause disease.
Subject terms: Protein function predictions, Machine learning, Mutation, Computational biophysics
An important step in understanding and using proteins is to identify the residues that are important for function. The authors present a machine-learning based method to predict functional sites that leverages and combines the information available in protein sequences and structures.
Introduction
Proteins carry out most functions in a cell, generally through interactions with other molecules. These molecular interactions often involve specific sites or regions whose identification plays a fundamental role in understanding biology and disease. Progress has been made in the development of methods to identify some types of functional sites1–5 and efforts have been made to understand the relationship between sequence variability, protein function, stability and the onset of diseases6–10. With the entry of accurate and large-scale protein structure prediction, structure-based methods for understanding biology are becoming even more important.
Analyses of large-scale mutagenesis studies have been used to probe the role of individual residues in the stability, abundance and function of a protein11–13. Since most proteins need to be folded to function, it may, however, be difficult to deconvolute the effects of amino acid substitutions on intrinsic function from their effects on stability and cellular abundance14. We note here that we use the term ‘function’ and ‘functional sites’ in a relatively general sense since our goal is to examine protein function broadly. In a few, favourable cases, multiplexed assays of variant effects (MAVEs) have been used to probe the consequence of almost all individual substitutions using both a functional readout and a readout that probes cellular abundance. Analyses of such data have been used to shed light on the molecular mechanisms underlying perturbed function, and more specifically to pinpoint which functional properties an amino acid residue contributes to15–18. For example, variants that lose function together with loss of abundance are likely to be caused by perturbations to the overall protein fold and stability, whereas variants that lose function while retaining wild-type-like abundance in the cell are likely to be caused by perturbing sites that directly play a role in function15,16,18. Alternative to such multi-readout experiments, global analyses of large datasets of multi-mutant variants can be used to deconvolute effects on for example stability and binding19.
In a recent analysis of abundance and activity assays16 we showed that approximately half of the single-point variants that show loss-of-function do so together with loss of protein abundance. This result suggests that if one uses a functional readout to detect residues that are directly involved in function, then about half of the variants detected as important are so simply because they cause lowered abundance. This in turn makes it difficult to separate residues that are directly involved in function (for example in catalysis, binding and signalling) from those positions that are conserved mostly due to structural constraints20.
The other half of loss-of-function substitutions, instead, mostly affects the protein via for example perturbing interactions with substrates or binding partners rather than stability16,21. We call this class of substitutions ‘stable but inactive’ (SBI) variants emphasising their tight and direct involvement in protein function. Pinpointing and predicting which variants are SBI is important to understand how amino acid substitutions might affect protein functions, why they possibly cause disease, and to ultimately aid the development of personalised therapeutic treatments. A very practical utility of the detection of SBI variants is that they enable the identification of amino acid residues that play a direct role in function. Indeed, much of our biochemical understanding of how proteins function relies of protein engineering studies where the effects of amino acid substitutions on various biochemical readouts are probed. Thus, positions where most substitutions affect function, but not structural stability, are often found in functional sites16–18.
As an alternative to experimental measurements, computational predictions can offer a faster, cheaper and more scalable approach to pinpoint positions that are important for function. Computational prediction of SBI variants, however, is not trivial. Many existing computational protocols are based on evaluating sequence conservation to find positions and variants that cause loss of function5,22–26. This strategy alone, however, may not be sufficient to identify residues that are conserved due to direct functional roles because sequence evolution is subject to both functional and structural constraints, and thus the different signals cannot be easily disentangled. One possible strategy to overcome this problem is to combine evolutionary data with analyses that report directly on the effects of substitutions on protein stability16,27–29.
Here, we aim to create a robust and easy-to-use prediction method to identify functional residues in proteins and provide insights into the biochemical role of these residues in the target protein. To avoid biases from annotations of functional sites, we use data generated by MAVEs that report on the effects of a wide range of substitutions on both function and abundance to train a supervised machine learning model. As input to the model we use a combination of sequence conservation measures, free energy changes accompanying substitutions, and physicochemical properties. We begin by showing how our model can capture a wide range of functional sites that include those in active sites, but also distributed throughout the protein structure including potential allosteric and regulatory sites. We then show how we achieve good accuracy in pinpointing functional amino acids in different validation scenarios. Across several proteins we find that roughly one in ten of the positions are functionally relevant and conserved for reasons different than structural stability. Having validated the model, we use it to provide examples of the kinds of insight that it may provide including identifying catalytic sites, regions that interact with substrates, and interfaces in complexes. Finally, we performed prospective predictions of the mechanisms of disease variants in HPRT1 (encoding the protein Hypoxanthine-Guanine Phosphoribosyltransferase 1, HPRT1) and validate these by measuring variant effects on function and abundance. The code for our model is freely available and we also provide access to it via a web implementation.
Results
Previously we have used evolutionary analyses combined with stability calculations to predict variant effects on protein function and stability and showed how these measures correlate with changes in cellular abundance or function13,16. Here, we build on these ideas to construct a model to identify functional sites in proteins via the identification of SBI variants (Fig. 1). Before detailing the model, we first provide an intuitive description of the basic idea. Our goal is to identify positions in proteins that play some role in protein function and regulation; these may for example include active sites in enzymes, but also ‘second-shell’ residues around the active site, protein-protein interfaces, allosteric and regulatory sites, or residues involved in recognizing ligands and substrates. Experimentally, these residues may be identified since amino acid substitutions at these sites might cause loss of ‘function’ (in some readout). Many variants, however, cause loss of function via loss of stability and cellular abundance, and we remove such indirect effects by requiring that the variants have close to wild-type-like abundance (i.e., the variants belong to the SBI class). Computationally, we can identify these variants by calculations of effects on protein stability and conservation, thus finding positions that are conserved during evolution, but not due to a role in protein stability. We note two effects of these choices. First, the broad definition means that we can assign a functional role to a relatively large number of residues e.g., well beyond active sites in enzymes. Second, some residues will have a direct role in function, but also be important for protein stability; our analysis will miss those residues, but as shown below, our results suggest that most functional sites do not fall in this class.
Based on the ideas above, we collected the results from two complementary types of MAVEs that respectively probe cellular-abundance and functional effects for three different proteins: NUDT1530, PTEN31,32, and CYP2C933 for a total of 9945 variants at 923 positions. Based on the experimental abundance and function scores, we assigned each variant to one of the following four classes16: WT-like (high abundance and high activity), total loss (low abundance and low activity), SBI (high abundance and low activity), or low abundance and high activity. We also selected input features for each variant. These were calculated from the three-dimensional structure of a protein and a multiple sequence alignment. More precisely, we included the following features: (i) the predicted change in thermodynamic protein stability (ΔΔG) calculated using Rosetta34, (ii) the evolutionary sequence information scores (which we term ΔΔE, by analogy with ΔΔG) using GEMME26, (iii) the hydrophobicity of the amino acid35, and (iv) the weighted contact number36,37 (Fig. 1A).
We first examined whether one of these features on its own would be sufficient, but did not find any that individually separates SBI variants from the remainder (Supplementary Fig. 1). Thus, we used the experimental labels and the input features to train a gradient boosting classifier (Fig. 1A and Supplementary Fig. 2) that predicts abundance (high/low) and activity (high/low) for each variant (see Methods for further details on the feature choices).
We determined model hyperparameters using a stratified cross-validation procedure, where each validation set contained 20% of the data. The resulting model has an average validation accuracy of 58% and a Matthews correlation coefficient of 0.57. On the entire training data the model classified 9540/9945 variants (95%) correctly, of which 1638/1819 (90%) are SBI variants. Having assigned effects of the individual variants, we then used this data to pinpoint the functional residues in the target proteins. To this end, we assigned a residue to a class if at least half of the substitutions at that position belonged to that class. In particular, we focused our attention on amino acids classified as ‘functional residues’, where 50% or more of their variants are stable but inactive (Fig. 1B). We compared the functional positions identified from the MAVEs with the predictions from the model (Fig. 2). For the three proteins included in the training we correctly identify 116 out of 127 residues. Accuracy and true positive rates are similar for the three proteins (Fig. 2).
After feature selection and training of the final model, we tested its performance on an independent dataset (GRB2 SH3 domain,18) using as baseline a model using only cutoff values for ΔΔE and ΔΔG (Supplementary Fig. 3A) . We find that our model substantially outperforms the baseline model, especially in labelling functional residues. We also validated the performance of the model on a reduced variant training set (Supplementary Table 1), and by training on two of the proteins and evaluating on the third (Supplementary Table 2); in both cases we found reduced performance compared to the full model, but no evidence of substantial overfitting. We also examined other choices of features including both subsets of those used in our final model and a model with a wider set of features, and evaluated both by cross-validation and the independent dataset (Supplementary Fig. 3B). The results show that the chosen feature set outperforms the other sets, including the model with a larger number of features.
Validation and applications
Our model predicts sites and residues that play a broad range of functional roles. This, however, makes it more difficult to validate the model as most experiments only probe the role of a small number of sites. We therefore tested the method on diverse tasks and data.
Validation using multiplexed and high-throughput data
First, we tested our model against a set of data that is relatively similar to the training data. Specifically, we applied it to two proteins (an SH3 domain from GRB2; Uniprot ID P62993 and the PDZ3 domain from PSD95; Uniprot ID P78352) that each have been assayed using two MAVEs that probe abundance and binding18. A joint analysis of this data in turn enabled determining ΔΔG for both folding and binding. We then used our model to predict variants and residues that are important for folding and function (here binding), and compared the results to the experiments (Supplementary Fig. 4). For the SH3 domain, we find a good agreement between the predictions and experiments. Specifically, we find that variants and residues that are predicted to cause loss of function due to loss of abundance are generally found to have a large ΔΔG for folding in the experiments (Supplementary Fig. 4A, B). Similarly, the variants and residues that we predict to cause loss of function without loss of abundance, generally have a large ΔΔG for binding, but are experimentally found not to affect folding. Note that our model did not use information about binding interactions, but simply discovers residues important for peptide binding by the fact that they are conserved, but not for stability reasons. A similar analysis of the PDZ domain showed a more complex picture, where we find a correlation (rather than separation) between the ΔΔG for folding and binding, and that the predictions appear to divide the variants and residues into three categories with progressively greater ΔΔG values (Supplementary Figure 4C,D). We note that for PDZ3 we found a limited number of homologues in Uniref30 for the PDZ3 domain, which might make the ΔΔE values more uncertain.
We then analysed data from MAVEs on five different proteins to validate the performance of the variant classification step (Table 1). In contrast to the data that were used to train the model, we here only had data for a single MAVE for each protein. Four of these experiments are functional assays and the last a cellular-abundance assay. Because these experiments only probe one dimension of our two-dimensional ‘function-and-abundance’ landscape, we simplified the output labels from the model from four to two classes. For the four MAVEs that probe function we combined the model’s output into either functional or inactive (independent of the predicted effect on abundance), and for the abundance-based MAVE we combine the predicted labels into high/low abundance (independent of the predicted effect on function). We compared these predictions to the data generated by the MAVEs which we also reduced to binary labels. On average we find an accuracy of 72% over all the variants and 68% when focusing on the variants labelled as SBI by our model (Table 1).
Table 1.
We also compared our predictions with the results of a high-throughput experiment on alkaline phosphatase (PafA; Uniprot ID Q9KJX538). In these experiments, the catalytic efficiency (kcat/KM) was measured for the wild-type protein as well as variants where each residue was changed into either glycine or valine. We first analysed how well our model was able to distinguish between variants with different levels of activity, and found a lower median activity (kcat/KM) for variants classified as total-loss and SBI compared to WT-like variants (Supplementary Fig. 5A, B). The experiments also revealed different effects of valine and glycine variants, and we analysed this observation in light of our predictions. For buried positions, we find that that variants at total-loss positions generally gave rise to lowered activity, in particular when substituting to glycine (Supplementary Fig. 5C). At exposed positions, we generally predicted variants to have a smaller effect (Supplementary Fig. 5D). While the experiments revealed that most variants did not cause global unfolding, a number of variants appeared to affect the in vitro translation that was used to produce the proteins38. In particular, some variants resulted in less active protein when expressed at 37 °C compared to 23 °C. Using a ten-fold change as a cutoff38, we find most (70%) variants that show this level of difference belong to the total-loss class (Supplementary Fig. 5E, F), in line with the fact that these are predicted to cause folding defects.
Finally, we compared predictions from our model with results from two previously published models for detecting functional residues27,39. We selected 10 enzymes that had previously been studied and for which the percentage of true positives from the previous work was available. Using our model, we find 77 out of the 109 known functional sites (true positive rate of 0.71) (Supplementary Fig. 6), which can be compared to values of 0.40 and 0.60 from the earlier models27,39.
Predicting functional sites in enzymes and protein interactions
As an example of the kinds of insights one might gain on specific proteins, we applied the model to the Anti-sigma F factor (Uniprot: O32727, Fig. 3A), for which previous studies have shown the importance of a number of sites40,41. Our model predicted 509 SBI variants (20% of the total number of variants) and we used these to label 21 positions as predicted functional sites. When comparing these to previous biochemical studies, we found reported functional roles for 17 of the 21 predicted positions40,42. Of these 17 amino acids, nine are located in the proximity to the active site (Fig. 3B). His54, Gly55, Thr99, Gly107, Gly109 and Thr130 interact and stabilise the ADP/ATP in the active site, Asn50 is involved in the chelation of a Mg2+ ion, Glu46 acts as the catalytic base in the phosphorylation reaction and Arg105 stabilises the transition state in the phosphorylation reaction. In addition to these active site residues, we found another cluster of functional residues (Glu104, Thr49, Glu16, Ser45, Glu39 and Arg20) in the proximity to the binding site for the Anti-sigma F factor antagonist (Uniprot O32723, Fig. 3B), while predicted functional positions Arg20 and Lys41 have been described as mediators of the interaction of the enzyme with the sigma factor. The roles of the remaining four positions that our model highlighted (Asn3, Asn15, Gly129, and Pro95) have, to our knowledge, not been analysed; they could either be false positives or residues with functional roles that have not yet been characterized. We observed that the functional sites in the Anti-sigma F factor could be grouped in two structurally compact clusters (Supplementary Fig. 7). The first cluster includes all of the residues that have a role in the catalytic process, while the second cluster contains positions in the interaction network with Anti-sigma F factor antagonist and the sigma-factor.
Having shown the model performance using the Anti-sigma F factor as an example, we extended this analysis of functional positions to a larger data set, consisting of 20 enzymes in the Mechanism and Catalytic Site Atlas database (M-CSA43), as well as five non-enzymes from the Protein–Protein Interaction Affinity Database 2.044. We applied our model to these 25 proteins and identified 16167 SBI variants (15.2% of the total) and assigned 588 residues to be important for function (12.7% of the total positions). We found a small difference in the fraction of predicted functional sites between the enzymes (13.9% of the total positions) and the other five proteins (11.6%). For the 20 proteins in M-CSA, we collected a curated list of 87 residues known to be involved in catalysis43. Given the well-defined functional roles we would expect that most substitutions at these sites would affect the enzymatic function. Our model assigned 62 of these 87 positions (71%) as functional positions (Fig. 4A). In 9 out of 20 enzymes the entire set of catalytic residues were classified as functional residues, while in the rest of the dataset the fraction of matching sites ranged between 33% and 80%.
We examined in more detail the 25 of the 87 positions that our model did not assign as SBI. Most of these (17/25 residues; 68%) have a median ΔΔG greater 2 kcal/mol, and our model labelled them as ‘total-loss’ positions. This finding highlights one of the limitations of our approach. While we can assign conserved residues that do not affect stability to have a likely functional role beyond structural stability, the reverse is not true. In these particular cases, ca. 20% (17/85) of the amino acids in the active sites appeared to be important both for function and structural stability. As an example we show the results for an Endo-1,4-beta-xylanase (Supplementary Fig. 8) where three of the five catalytic positions were assigned as functional residues by our model, and the remaining two have a median ΔΔG > 2 kcal/mol.
A strength of our model is that it identifies residues with likely functional roles beyond those directly involved in for example catalysis. We examined the positions that our model predicted to be functional sites in the 20 enzymes from M-CSA, and found that a substantial fraction of these are localised in the vicinity of the catalytic site (Fig. 4B); on average 48% of the predicted functional residues are located less than 10 Å from the closest catalytic site residue. We expect that many of these residues are important for the catalytic process, as we show for glutamate racemase (Uniprot: P56868, Fig. 4C), where 19 out of 26 of the predicted functional residues (including M-CSA catalytic positions) are less than 10 Å from the D-glutamine in the catalytic site45.
In some cases we also found that the predicted functional sites were located both close to as well as further away (20 Å or more) from the catalytic site, as illustrated by orotate phosphoribosyltransferase (OPRTase, Uniprot: P0A7E3, Supplementary Fig. 9), where predicted functional residues are located in several distinct regions. Most proteins in the M-CSA set have clusters of predicted functional sites further away from catalytic sites. We found that these positions, for example, may be involved in interactions with substrates, but not directly involved in the catalytic activity. As an example we show a thymine-DNA glycosylase (Uniprot Q13569, Fig. 4D), where we identified a subset of functional residues far (14.5 Å) from the enzymatic active site. These include Arg275 and the residues in loop 274–27741,46. These amino acids are not directly involved in catalysis, but their role is to push the target DNA base into the catalytic pocket, which consists of two residues: Arg140 and His15146,47. Our model identified the former as a functional position, while the latter was predicted to be a total-loss position.
In addition to finding positions that are important for catalysis and binding of substrates, our model can also help identify positions that play a role in forming protein-protein interactions as we show for thymidylate synthase (P00469, Fig. 4E), a homo-dimeric enzyme with a catalytic pocket in both of the subunits. Of the 41 functional residues identified by our model, 20 are located close to the protein-protein interface. We note here that, unless otherwise noted, all the stability calculations are performed on monomeric structures, and so we identify these residues as being conserved, but not due to the structural stability of the individual subunits. Looking at the predicted functional sites at the interface, we find that some are involved in forming the active site which includes residues from both subunits (Arg178, Arg179, Cys198,48). We, however, also found a number of other residues, which we suggest are conserved as they stabilize the (obligatory) dimer structure.
To examine whether we could extract more information on interaction interfaces we also calculated ΔΔG values using the structure of the dimer of thymidylate synthase, introducing each amino acid substitution in both chains. Together with contact numbers calculated using the dimer structure we used these new scores as input to our model. We compared the resulting classification with the results obtained when using using the monomer structure as input to our model (Fig. 5). The analysis shows that nine residues at the interface are classified as total-loss when the calculations are based on the dimer structure (compared to just one when the monomer structure is used). Of these nine total-loss positions, three were labelled as functional residues in calculations based on the monomer structures. Residues at the interface that are known to be important for the catalytic activity (Arg178, Arg179,48) remained labelled as functional sites. We performed a similar set of calculations comparing the classification based on structures of the monomer or dimer structure for OPRTase (Supplementary Fig. 10A,B). As for thymidylate synthase, we found differences including an increased number of total-loss positions at the interface in the classification made using the dimer; also in this case the model correctly identified the catalytic residues located at the protein-protein interface.
The results described above show that comparing the predictions based on structures of monomers and oligomers can help disentangle functional residues that are key for protein–protein interactions from those with other roles. Specifically, when the structure of the oligomer is known and used as input, our model focuses on residues that are important for functions other than protein-protein interactions. To illustrate this point further, we compared the classification for (monomeric) myoglobin with that for the α2β2 tetramer haemoglobin. For haemoglobin, we made predictions for both the α and β subunits, and using both the monomer and tetramer structures as input. We first compared the classification results for all three proteins when calculations are based on the monomer structure (Supplementary Fig. 10C). For all three proteins we found a comparable classification of the residues surrounding the heme group, with 7–9 residues classified as functional. We, however, found differences between myoglobin and the two haemoglobin chains at the residues that form interfaces in the haemoglobin tetramer (Supplementary Fig. 10C). For these residues, the model assigned the WT-like label for most of the positions in myoglobin, while several interface residues in haemoglobin are classified as functional. This difference likely arises because the calculated ΔΔG values at the interface residues are small (because we used the monomer structures as input), but the residues in haemoglobin are more conserved than those in myoglobin. In line with this hypothesis, many of the interface residues in haemoglobin are classified as total-loss when we use the tetramer structure as input to our model (Supplementary Fig. 10C). In this case, substitutions at the interface are more destabilizing, and so the model assigns these residues to the class that corresponds to destabilization of the functional protein (the tetramer). Together, the results on thymidylate synthase (Fig. 5), OPRTase (Supplementary Fig. 10A, B), and myoglobin/haemoglobin (Supplementary Fig. 10C) show that residues at key protein-protein interaction sites behave differently depending on whether one considers destabilization of the monomer or oligomer form. In cases where the oligomer structure is known, this difference can help to distinguish residues that play functional roles due to protein-protein interactions from those that are for example directly involved in catalysis. Similarly, when using the monomer structure as input, the model can help shed light on key interaction interfaces, which appear as surface exposed patches of conserved residues.
Predicting and understanding disease variants in HPRT1
Having validated and exemplified our model using previously published data, we then used our model in a prospective study to predict the impact and mechanism of human missense variants. We and others have previously shown that many, but not all, disease-causing missense variants cause loss of function by loss of abundance. The ability to assign a functional status to so-called variants of uncertain significance is one of the major outstanding challenges in clinical genetics. In order to understand the molecular origin of disease and develop potential treatments it is, however, also important to be able to predict why variants cause loss of function.
We therefore selected hypoxanthine-guanine phosphoribosyltransferase-1 (HPRT1, Uniprot: P00492), an enzyme involved in the onset of Lesch-Nyhan disease and its attenuated variants49,50, for a prospective study of the accuracy and utility of our model. We estimated structural and sequence features for 190 of 210 residues in HPRT1; of the 3610 total possible single amino acid variants, our model predicts 471 to be SBI and 1046 to be total-loss variants.
We then selected 17 variants for experimental characterization. These variants were selected either from gnomAD51, ClinVar52 or variants that we selected to test our model more broadly. Five of the variants were predicted to have wild-type-like activity, six variants were predicted to be total-loss (i.e., loss of activity and loss of abundance), and six variants were predicted to be SBI (i.e., loss of activity without loss of abundance) (Fig. 6).
To test our predictions, we established a yeast-based growth assay for HPRT1 function. HPRT1 catalyses the formation of inosine and guanosine monophosphate (IMP and GMP) from hypoxanthine and guanine, respectively. Previous studies have shown that mycophenolic acid (MPA) acts as an inhibitor of IMP dehydrogenase, which is responsible for the conversion of IMP to xanthosine monophosphate (XMP,53,54). Thus, in the presence of MPA, yeast cells can only generate GMP through the salvage pathway, and the yeast HPRT1 orthologue, Hpt1, therefore becomes essential (Supplementary Figure 11). We introduced each of the 17 variants as well as a wild-type and vector controls into a yeast strain lacking Hpt1 to test their effects on function and abundance. We found that 11/17 variants grew worse that the wild-type control in the presence of MPA, showing that they cause loss of function (Fig. 6B). We also measured the abundance of the wild-type and 17 variants using western blots and found that 3/17 variants had substantially reduced levels (Fig. 6C).
We used the experimental data to classify the variants into wild-type-like, SBI and total-loss and compared the results to the computational predictions (Fig. 6D). Overall we find that 13/17 (76%) of the variants are predicted correctly including all (6/6) of the SBI residues. This result shows that our model can predict variant effects relatively well, and in particular can help separate loss-of-function variants that lose function due to intrinsic function from those due to structural stability. We note that some of these variants have also been characterized biochemically49, with overall good agreement between our predictions, the yeast assays and the biochemical experiments.
Making the model more easily accessible
Evaluating ΔΔG values with Rosetta is relatively slow, making the widespread application of our model less straightforward. We have, however, recently developed a method, called RaSP, for rapid stability predictions55, which uses a deep-learning representation to approximate Rosetta ΔΔG values orders of magnitude faster than Rosetta. We first tested the results when using ΔΔG values from RaSP as input to the functional model described above (which was trained with ΔΔG values generated by Rosetta). Despite the relatively high correlation between the Rosetta and RaSP ΔΔG values (average Spearman correlation coefficient of 0.78), we found that the performance of the model was lower during the cross-validation when we used ΔΔG generated by RaSP (Supplementary Fig. 12A). This result suggests that differences in the ΔΔG values generated by RaSP and Rosetta combined with the threshold-based structure of gradient boosting machines can shift the prediction for variants with feature values close to the threshold values used inside the model. We therefore retrained our model using instead ΔΔG values generated by RaSP. We found that the RaSP-based model performs as well as the model trained with Rosetta on our validation sets (Supplementary Fig. 12B) and therefore decided to test the performance for many of the tasks we used to test our Rosetta-based model on. We found that the RaSP-trained model identifies the same number of functional sites as the Rosetta-trained model (62/87) when we examine the enzymes in the M-CSA set, with 57/62 being the same (Supplementary Fig. 12C). We also found that the RaSP-based model predicts the effects of the HPRT1 variants as well as the Rosetta-trained model (Supplementary Fig. 12D).
We make our model available via a notebook that can be run using Google Colaboratory (available via https://github.com/KULL-Centre/_2022_functional-sites-cagiada). The notebook guides the user with a step-by-step procedure to generate input data, using for example the GEMME and RaSP web implementations, and to generate the predictions of functional sites.
Discussion
There is a long history of studying protein function through analyses of sequence conservation, and conserved residues are generally important for the protein. Because most proteins need to be folded to function it is, however, difficult to disentangle the role of individual amino acids on the stability of the overall fold from more direct roles of individual amino acids for protein function. Clearly, it may not always be possible to separate the two, and we have shown for example that substitutions at some catalytic residues may result in loss of stability. In many cases, however, substitutions at functionally important sites either have a small negative effect on stability, or may even give rise to increased stability56,57. Building on these ideas we have here presented a method that finds functionally important sites as those that are conserved, but not immediately due to a role in protein stability.
To construct and train our model we leverage (i) large-scale experimental assays reporting on protein activity and abundance and (ii) computational methods to estimate variant effects on protein stability as well as general functional effects using conservation in multiple sequence alignments. We have validated our model using both detailed biochemical experiments on individual proteins (via M-CSA), as well as larger scale data generated by high-throughput experiments. We also validated the method through prospective predictions of the mechanism of disease variants in HPRT1. Our model is freely available. For users that want to avoid slower, more costly Rosetta calculations, we have also trained a model that uses our deep-learning-based method for stability predictions55. Since both Rosetta58 and RaSP55 give relatively accurate results using structures predicted using AlphaFold, our model can also be applied to predicted protein structures. Very recently59, it has been shown that one can also use large-scale measurements of stability changes as input to an approach inspired by our work.
Our model for functionally important sites, in some sense, lies between two extremes that have previously been used to analyse proteins. At one end of the scale, activity-based MAVEs and sequence conservation analyses provide a global view of residues that play functional roles, but often do not give direct mechanistic insights. At the other end of the scale, detailed biochemical, biophysical and structural experiments are often needed to pinpoint how individual amino acids affect function, but are (with exceptions such as work by38) difficult to scale to a large number of variants and proteins. To test our model we have thus both compared it to data generated by MAVEs, reducing to a one-dimensional view when only a single assay is available, as well as results from detailed assessments of the role of individual amino acids in enzymes and protein-protein interactions.
Technological and methodological developments have led to rapid increases in the number of known protein sequences, as well as our ability to assign overall biological functions to many of these. The ability to identify functionally important sites in proteins is important for our understanding of how proteins work, our ability to engineer enzymes and to understand the mechanisms that underlie diseases. Our model may, for example, be used to select residues for protein engineering experiments, or for generating focused libraries in enzyme optimization. As additional data is generated by MAVEs reporting on both function and abundance, the model can be improved further. As we have illustrated, the model can help pinpoint the molecular mechanisms of disease variants, which may both guide further experiments and be used as starting point for developing mechanism-based therapies such as finding variant that may or may not be rescued by pharmacological chaperones. Recent work shows how it is possible to annotate enzyme function at larger scale60 and our model can help pinpoint functionally important sites in newly discovered enzymes. We thus anticipate that the method that we have described here will help researchers make progress in these and many other areas.
Methods
Preprocessing of the training data
The training data include paired MAVE data for three different proteins: NUDT15, PTEN, and CYP2C9. For each protein, we selected thresholds for the scores generated by each MAVE as outlined previously16. Briefly, we fit the variant score distributions to three Gaussians and use the intersection of the first and last Gaussian as the cutoff. We use such binary classification for each of the two MAVEs to classify the variants into four categories16.
Sequence features
For each protein in the training (Supplementary Table 3) and validation (Supplementary Tables 4,5 and 6) data we performed a statistical analysis of multiple sequence alignments (MSA). We extracted the sequence of the first isoform from Uniprot and used HHBlits61 to build an MSA, on which we applied additional filters before evaluating the substitution effects and conservation scores. The first filter keeps only the positions (columns) that are present in the target sequence, and the second filter removes the sequences (rows) where the number of total gaps exceeds 50% of the total number of positions. We used GEMME26 to calculate an evolutionary conservation score. GEMME explicitly models the evolutionary history of the protein, returning a score (ΔΔE) which estimates an ‘evolutionary distance’ of a variant from the query wild-type sequence. The ΔΔE scores for the variants range between 0 (conservative substitution) and − 7 (substitutions that appear incompatible based on the MSA). In addition to the effect of the individual variants, we also calculated a ‘neighbour score’ as the average of the ΔΔE values for the previous and proceeding residue. In addition to the ΔΔE scores, we used the hydrophobicity35 of the target amino acid as a feature in our model.
Structural features
We predicted changes in thermodynamic stability using Rosetta (GitHub SHA1 99d33ec59ce9fcecc5e4f3800c778a54afdf8504). We used the Cartesian ddG protocol34 on crystal structures corresponding to the Uniprot sequence listed in Supplementary Tables 3,4 and 5. The ΔΔG values obtained from Rosetta were divided by 2.9 to bring them from Rosetta energy units onto a scale corresponding to kcal/mol34. Stability changes using RaSP were performed as described55. We used these ΔΔG values as a feature for the classifier, setting all values below/above a range of 0 – 5 kcal/mol, to 0 or 5 kcal/mol, respectively. We also calculated a neighbour score, averaging over the two nearest neighbours, as for the ΔΔE values.
Finally, we used MDTraj v.1.9.362 to calculate the weighted contact number (WCN) for each residue:
1 |
where ri,j is the Cα distance between residue i and j, r0 is a switching parameter (set to 7.0 Å).
We determined solvent exposed and buried regions using GetArea63 using 20% as a threshold to divide the data in to two classes.
Training the classifier
We began by including up to eleven features as input to our classifier. After testing on the three proteins (NUDT15,PTEN and CYP2C9), we proceeded with only eight of them. The features used for each variant were: (1, 2) variant ΔΔG and ΔΔE, (3–6) residue average and neighbour average ΔΔG and ΔΔE scores, (7) the hydrophobicity of the target amino acid, and (8) the WCN. The three features discarded during model building were: wild-type amino acid type, binary exposure classes based on solvent exposure and hydrophobicity of the wild-type amino acid.
We used a gradient descent machine to classify the variants, as implemented in Catboost64 v.0.26.1, using a Multinomial/Multiclass Cross Entropy Loss and L2 regularization. We first set the number of gradient descent iterations (Supplementary Figure 2A) on the ‘vanilla’ model and then used a grid scan to find optimal value for the remaining hyperparameters. We measured precision and recall using five-fold cross validation on the training data to optimise the hyperparameters (Supplementary Figure 2B). We compared the model performance with a null model from the sklearn python package (sklearn.dummy.DummyClassifier) using the ‘prior’ strategy, which returns the most frequent class label in the observed argument passed to it. We also used a random forest classifier model (sklearn.ensemble.RandomForestClassifier) in the comparison, optimizing the hyperparameters using a grid scan protocol and k-fold cross validation.
Cloning
Full-length human HPRT1 was expressed in yeast from the pYES2 vector (Invitrogen). An N-terminal RGS6His-tag was inserted upstream of an SRS linker peptide, before HPRT Met1. Point mutations were generated by GenScript.
Yeast strains and techniques
The hpt1Δ (MatA, his3Δ1, leu2Δ0, met15Δ0, ura3Δ0, hpt1::KanMX) Saccharomyces cerevisiae yeast strain was from Invitrogen. The cells were cultured in synthetic complete (SC) medium (2% glucose, 6.7 g/L yeast nitrogen base without amino acids and with ammonium sulfate (Sigma)) and supplemented for selection with 1.92 g/L uracil drop-out mix (US Biological). For expression, the glucose was replaced with 2% galactose.
Yeast transformations were performed with lithium acetate65. For growth assays, the cells were cultured at 30 °C to exponential phase and diluted to an OD(600nm) of 0.40. From this, dilution series (5-fold) were prepared and 5 μL of each dilution was applied as droplets on agar plates. Colonies formed after 2–3 days of incubation at 30 °C. The agar contained 50 μg/mL guanine (Sigma) and 0.1 mg/mL MPA (Sigma).
Whole cell lysates for western blotting were prepared from exponential phase cultures using glass beads and trichloroacetic acid (TCA) as described before66. Briefly, 1.2 × 108 cells in exponential phase were harvested and washed in water by centrifugation (3000 g, 5 min). The cells were resuspended in 1 mL of 20% TCA and centrifuged (3000 g, 5 min). The supernatant was discarded and the pellet was resuspended in 200 μL 20% TCA and transferred to 2 mL screwcap tubes containing 0.5 mL 400–600 micron glass beads (Sigma). The tubes were then applied to a Mini-BeadBeater machine (BioSpec Products Inc.) set at 3 cycles of 10 s. With a needle, a hole was made in the bottom of the tube and the tube was placed inside a 15 mL tube. Then, 400 μL 5% TCA was added and the material was eluted by centrifugation (1000 g, 5 min) into the 15 mL tube. The eluted material was centrifuged (10000 g, 5 min) at 4 °C. The pellet washed twice with ice-cold 80% acetone. Finally, the pellet was resuspended in 100 μL sample buffer for SDS-PAGE (62.5 mM Tris/HCl pH 6.8, 2% SDS, 25% glycerol, 0.01% bromphenol blue, 5% β-mercaptoethanol) and incubated for 5 min at 100 °C.
SDS-PAGE and western blotting
SDS-PAGE was performed using 12.5% acrylamide gels. After electrophoresis, the proteins were transferred to 0.2 μm nitrocellulose membranes (Advantec) by electro-blotting. After transfer, the blots were stained with Ponceau S (0.1% Ponceau S in 5% acetic acid) and blocked in PBS (10 mM Na2HPO4, 1.8 mM KH2PO4, 137 mM NaCl, 3 mM KCl, pH 7.4) with 5% skimmed milk powder. The antibodies were mouse IgG1 anti-RGSHis (Qiagen, Cat. No. 34650) diluted 1:2000 and peroxidase-conjugated polyclonal rabbit anti-mouse antibody (Dako, Cat. No. P0260) diluted 1:5000.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Supplementary information
Acknowledgements
The research was supported by the PRISM (Protein Interactions and Stability in Medicine and Genomics) centre funded by the Novo Nordisk Foundation (NNF18OC0033950, to A.S., R.H.P., and K.L.-L.). We acknowledge access to computing resources from the Biocomputing Core Facility at the Department of Biology, University of Copenhagen.
Source data
Author contributions
M.C., S.B., and K.L.-L. conceived the overall study, building on preliminary work by A.S., K.L.-L., and R.H.P. M.C. and S.B. wrote the first draft of the manuscript with input from K.L.-L. M.C. performed the computational analyses and experiments with input from S.B. and K.L.-L. R.H.P, K.L.-L., and A.S. conceived the experiments on HPRT, which were performed by S.L. and S.M.S. under the supervision of R.H.P. All authors provided input to the study, and edited and provided input to the manuscript.
Peer review
Peer review information
Nature Communications thanks Gabriel Rocklin, Guillaume Lamoureux and the other anonymous reviewers for their contribution to the peer review of this work. A peer review file is available.
Data availability
The input data for the model and the predictions used in this study are available at https://github.com/KULL-Centre/_2022_functional-sites-cagiada which also contains the final trained model. The datasets are also available on Zenodo67, and can be accessed via 10.5281/zenodo.8046585. Uniprot and PDB ID codes for the proteins studied are available in Supplementary Tables 4, 5 and 6. Source data (full images) for Fig. 6 are provided as a Source Data File in zip format. Source data are provided with this paper.
Code availability
The code used to generate the main text figures, to recreate the predictions, and to generate new predictions is available at https://github.com/KULL-Centre/_2022_functional-sites-cagiada. An implementation of the model is also available via Google Colaboratory via the same link. The code is also available on Zenodo67, and can be accessed via 10.5281/zenodo.8046585.
Competing interests
K.L.-L. holds stock options in and is a consultant for Peptone Ltd. All other authors declare no competing interests.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Rasmus Hartmann-Petersen, Email: rhpetersen@bio.ku.dk.
Kresten Lindorff-Larsen, Email: lindorff@bio.ku.dk.
Supplementary information
The online version contains supplementary material available at 10.1038/s41467-023-39909-0.
References
- 1.del Sol Mesa A, Pazos F, Valencia A. Automatic methods for predicting functionally important residues. J. Mol. Biol. 2003;326:1289–1302. doi: 10.1016/S0022-2836(02)01451-1. [DOI] [PubMed] [Google Scholar]
- 2.Lee D, Redfern O, Orengo C. Predicting protein function from sequence and structure. Nat. Rev. Mol. Cell Biol. 2007;8:995–1005. doi: 10.1038/nrm2281. [DOI] [PubMed] [Google Scholar]
- 3.Radivojac P, et al. A large-scale evaluation of computational protein function prediction. Nat. Methods. 2013;10:221–227. doi: 10.1038/nmeth.2340. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Kulmanov M, Khan MA, Hoehndorf R. Deepgo: predicting protein functions from sequence and interactions using a deep ontology-aware classifier. Bioinformatics. 2018;34:660–668. doi: 10.1093/bioinformatics/btx624. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Torng W, Altman RB. High precision protein functional site detection using 3d convolutional neural networks. Bioinformatics. 2019;35:1503–1512. doi: 10.1093/bioinformatics/bty813. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Yue P, Li Z, Moult J. Loss of protein structure stability as a major causative factor in monogenic disease. J. Mol. Biol. 2005;353:459–473. doi: 10.1016/j.jmb.2005.08.020. [DOI] [PubMed] [Google Scholar]
- 7.Adzhubei IA, et al. A method and server for predicting damaging missense mutations. Nat. methods. 2010;7:248–249. doi: 10.1038/nmeth0410-248. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Kircher M, et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 2014;46:310–315. doi: 10.1038/ng.2892. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Wagih O, et al. A resource of variant effect predictions of single nucleotide variants in model organisms. Mol. Syst. Biol. 2018;14:e8430. doi: 10.15252/msb.20188430. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Livesey BJ, Marsh JA. Using deep mutational scanning to benchmark variant effect predictors and identify disease mutations. Mol. Syst. Biol. 2020;16:e9380. doi: 10.15252/msb.20199380. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Gray VE, Hause RJ, Fowler DM. Analysis of large-scale mutagenesis data to assess the impact of single amino acid substitutions. Genetics. 2017;207:53–61. doi: 10.1534/genetics.117.300064. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Dunham AS, Beltrao P. Exploring amino acid functions in a deep mutational landscape. Mol. Syst. Biol. 2021;17:e10305. doi: 10.15252/msb.202110305. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Høie MH, Cagiada M, Frederiksen AHB, Stein A, Lindorff-Larsen K. Predicting and interpreting large-scale mutagenesis data using analyses of protein stability and conservation. Cell Rep. 2022;38:110207. doi: 10.1016/j.celrep.2021.110207. [DOI] [PubMed] [Google Scholar]
- 14.Li X, Lehner B. Biophysical ambiguities prevent accurate genetic prediction. Nat. Commun. 2020;11:1–11. doi: 10.1038/s41467-020-18694-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Jepsen, M. M., Fowler, D. M., Hartmann-Petersen, R., Stein, A. & Lindorff-Larsen, K. in Chapter 5 - classifying disease-associated variants using measures of protein activity and stability (ed. Pey, A. L.) Protein Homeostasis Diseases 91–107 (Academic Press, 2020).
- 16.Cagiada M, et al. Understanding the origins of loss of protein function by analyzing the effects of thousands of variants on activity and abundance. Mol. Biol. Evolution. 2021;38:3235–3246. doi: 10.1093/molbev/msab095. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Chiasson MA, et al. Multiplexed measurement of variant abundance and activity reveals vkor topology, active site and human variant impact. elife. 2020;9:e58026. doi: 10.7554/eLife.58026. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Faure AJ, et al. Mapping the energetic and allosteric landscapes of protein binding domains. Nature. 2022;604:175–183. doi: 10.1038/s41586-022-04586-4. [DOI] [PubMed] [Google Scholar]
- 19.Otwinowski J. Biophysical inference of epistasis and the effects of mutations on protein stability and function. Mol. Biol. evolution. 2018;35:2345–2354. doi: 10.1093/molbev/msy141. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Echave J, Wilke CO. Biophysical models of protein evolution: understanding the patterns of evolutionary sequence divergence. Annu. Rev. Biophys. 2017;46:85–103. doi: 10.1146/annurev-biophys-070816-033819. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Nielsen SV, Hartmann-Petersen R, Stein A, Lindorff-Larsen K. Multiplexed assays reveal effects of missense variants in msh2 and cancer predisposition. PLoS Genet. 2021;17:e1009496. doi: 10.1371/journal.pgen.1009496. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Lichtarge O, Bourne HR, Cohen FE. An evolutionary trace method defines binding surfaces common to protein families. J. Mol. Biol. 1996;257:342–358. doi: 10.1006/jmbi.1996.0167. [DOI] [PubMed] [Google Scholar]
- 23.Kumar P, Henikoff S, Ng PC. Predicting the effects of coding non-synonymous variants on protein function using the sift algorithm. Nat. Protoc. 2009;4:1073–1081. doi: 10.1038/nprot.2009.86. [DOI] [PubMed] [Google Scholar]
- 24.Choi Y, Chan AP. Provean web server: a tool to predict the functional effect of amino acid substitutions and indels. Bioinformatics. 2015;31:2745–2747. doi: 10.1093/bioinformatics/btv195. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Riesselman AJ, Ingraham JB, Marks DS. Deep generative models of genetic variation capture the effects of mutations. Nat. Methods. 2018;15:816–822. doi: 10.1038/s41592-018-0138-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Laine E, Karami Y, Carbone A. Gemme: a simple and fast global epistatic model predicting mutational effects. Mol. Biol. Evol. 2019;36:2604–2619. doi: 10.1093/molbev/msz179. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Cheng G, Qian B, Samudrala R, Baker D. Improvement in protein functional site prediction by distinguishing structural and functional constraints on protein family evolution using computational design. Nucleic Acids Res. 2005;33:5861–5867. doi: 10.1093/nar/gki894. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Wang K, Horst JA, Cheng G, Nickle DC, Samudrala R. Protein meta-functional signatures from combining sequence, structure, evolution, and amino acid property information. PLoS Comput. Biol. 2008;4:e1000181. doi: 10.1371/journal.pcbi.1000181. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Capra JA, Laskowski RA, Thornton JM, Singh M, Funkhouser TA. Predicting protein ligand binding sites by combining evolutionary sequence conservation and 3d structure. PLoS Comput. Biol. 2009;5:e1000585. doi: 10.1371/journal.pcbi.1000585. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Suiter CC, et al. Massively parallel variant characterization identifies NUDT15 alleles associated with thiopurine toxicity. Proc. Natl Acad. Sci. USA. 2020;117:5394–5401. doi: 10.1073/pnas.1915680117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Matreyek KA, et al. Multiplex assessment of protein variant abundance by massively parallel sequencing. Nat. Genet. 2018;50:874–882. doi: 10.1038/s41588-018-0122-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Mighell TL, Evans-Dutson S, O’Roak BJ. A saturation mutagenesis approach to understanding PTEN lipid phosphatase activity and genotype-phenotype relationships. Am. J. Hum. Genet. 2018;102:943–955. doi: 10.1016/j.ajhg.2018.03.018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Amorosi CJ, et al. Massively parallel characterization of cyp2c9 variant enzyme activity and abundance. Am. J. Hum. Genet. 2021;108:1735–1751. doi: 10.1016/j.ajhg.2021.07.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Park H, et al. Simultaneous optimization of biomolecular energy functions on features from small molecules and macromolecules. J. Chem. theory Comput. 2016;12:6201–6212. doi: 10.1021/acs.jctc.6b00819. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Monera OD, Sereda TJ, Zhou NE, Kay CM, Hodges RS. Relationship of sidechain hydrophobicity and α-helical propensity on the stability of the single-stranded amphipathic α-helix. J. Pept. Sci. 1995;1:319–329. doi: 10.1002/psc.310010507. [DOI] [PubMed] [Google Scholar]
- 36.Shih C-H, Chang C-M, Lin Y-S, Lo W-C, Hwang J-K. Evolutionary information hidden in a single protein structure. Proteins: Struct. Funct. Bioinf. 2012;80:1647–1657. doi: 10.1002/prot.24058. [DOI] [PubMed] [Google Scholar]
- 37.Jack BR, Meyer AG, Echave J, Wilke CO. Functional sites induce long-range evolutionary constraints in enzymes. PLoS Biol. 2016;14:e1002452. doi: 10.1371/journal.pbio.1002452. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Markin C, et al. Revealing enzyme functional architecture via high-throughput microfluidic enzyme kinetics. Science. 2021;373:eabf8761. doi: 10.1126/science.abf8761. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Chelliah V, Chen L, Blundell TL, Lovell SC. Distinguishing structural and functional restraints in evolution in order to identify interaction sites. J. Mol. Biol. 2004;342:1487–1504. doi: 10.1016/j.jmb.2004.08.022. [DOI] [PubMed] [Google Scholar]
- 40.Campbell EA, et al. Crystal structure of the bacillus stearothermophilus anti-σ factor spoiiab with the sporulation σ factor σf. Cell. 2002;108:795–807. doi: 10.1016/S0092-8674(02)00662-1. [DOI] [PubMed] [Google Scholar]
- 41.Fu T, et al. Thymine dna glycosylase recognizes the geometry alteration of minor grooves induced by 5-formylcytosine and 5-carboxylcytosine. Chem. Sci. 2019;10:7407–7417. doi: 10.1039/C9SC02807B. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Masuda S, et al. Crystal structures of the adp and atp bound forms of the bacillus anti-σ factor spoiiab in complex with the anti-anti-σ spoiiaa. J. Mol. Biol. 2004;340:941–956. doi: 10.1016/j.jmb.2004.05.040. [DOI] [PubMed] [Google Scholar]
- 43.Ribeiro AJM, et al. Mechanism and catalytic site atlas (m-csa): a database of enzyme reaction mechanisms and active sites. Nucleic Acids Res. 2018;46:D618–D623. doi: 10.1093/nar/gkx1012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Vreven T, et al. Updates to the integrated protein–protein interaction benchmarks: docking benchmark version 5 and affinity benchmark version 2. J. Mol. Biol. 2015;427:3031–3041. doi: 10.1016/j.jmb.2015.07.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Hwang KY, et al. Structure and mechanism of glutamate racemase from aquifex pyrophilus. Nat. Struct. Biol. 1999;6:422–426. doi: 10.1038/8223. [DOI] [PubMed] [Google Scholar]
- 46.Maiti A, Morgan MT, Drohat AC. Role of two strictly conserved residues in nucleotide flipping and n-glycosylic bond cleavage by human thymine dna glycosylase. J. Biol. Chem. 2009;284:36680–36688. doi: 10.1074/jbc.M109.062356. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Kanaan N, Crehuet R, Imhof P. Mechanism of the glycosidic bond cleavage of mismatched thymine in human thymine dna glycosylase revealed by classical molecular dynamics and quantum mechanical/molecular mechanical calculations. J. Phys. Chem. B. 2015;119:12365–12380. doi: 10.1021/acs.jpcb.5b05496. [DOI] [PubMed] [Google Scholar]
- 48.Pookanjanatavip M, Yuthavong Y, Greene PJ, Santi DV. Subunit complementation of thymidylate synthase. Biochemistry. 1992;31:10303–10309. doi: 10.1021/bi00157a018. [DOI] [PubMed] [Google Scholar]
- 49.Fu R, Jinnah HA. Genotype-phenotype correlations in lesch-nyhan disease. J. Biol. Chem. 2012;287:2997–3008. doi: 10.1074/jbc.M111.317701. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Fu R, et al. Genotype–phenotype correlations in neurogenetics: Lesch-nyhan disease as a model disorder. Brain. 2014;137:1282–1303. doi: 10.1093/brain/awt202. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Karczewski KJ, et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature. 2020;581:434–443. doi: 10.1038/s41586-020-2308-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Landrum MJ, et al. Clinvar: improving access to variant interpretations and supporting evidence. Nucleic acids Res. 2018;46:D1062–D1067. doi: 10.1093/nar/gkx1153. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Woods RA, Roberts DG, Friedman T, Jolly D, Filpula D. Hypoxanthine: guanine phosphoribosyltransferase mutants in saccharomyces cerevisiae. Mol. Gen. Genet. MGG. 1983;191:407–412. doi: 10.1007/BF00425755. [DOI] [PubMed] [Google Scholar]
- 54.Escobar-Henriques M, Daignan-Fornier B. Transcriptional regulation of the yeast gmp synthesis pathway by its end products. J. Biol. Chem. 2001;276:1523–1530. doi: 10.1074/jbc.M007926200. [DOI] [PubMed] [Google Scholar]
- 55.Blaabjerg LM, et al. Rapid protein stability prediction using deep learning representations. Elife. 2023;12:e82593. doi: 10.7554/eLife.82593. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Shoichet BK, Baase WA, Kuroki R, Matthews BW. A relationship between protein stability and protein function. Proc. Natl Acad. Sci. USA. 1995;92:452–456. doi: 10.1073/pnas.92.2.452. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Bloom JD, Labthavikul ST, Otey CR, Arnold FH. Protein stability promotes evolvability. Proc. Natl Acad. Sci. USA. 2006;103:5869–5874. doi: 10.1073/pnas.0510098103. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Akdel M, et al. A structural biology community assessment of alphafold2 applications. Nat. Struct. Mol. Biol. 2022;29:1056–1067. doi: 10.1038/s41594-022-00849-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Tsuboyama, K. et al. Mega-scale experimental analysis of protein folding stability in biology and protein design. bioRxiv 2022–12 (2022). [DOI] [PMC free article] [PubMed]
- 60.Yu T, et al. Enzyme function prediction using contrastive learning. Science. 2023;379:1358–1363. doi: 10.1126/science.adf2465. [DOI] [PubMed] [Google Scholar]
- 61.Remmert M, Biegert A, Hauser A, Söding J. Hhblits: lightning-fast iterative protein sequence searching by hmm-hmm alignment. Nat. Methods. 2012;9:173. doi: 10.1038/nmeth.1818. [DOI] [PubMed] [Google Scholar]
- 62.McGibbon RT, et al. Mdtraj: a modern open library for the analysis of molecular dynamics trajectories. Biophys. J. 2015;109:1528 – 1532. doi: 10.1016/j.bpj.2015.08.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Fraczkiewicz R, Braun W. Exact and efficient analytical calculation of the accessible surface areas and their gradients for macromolecules. J. Comput. Chem. 1998;19:319–333. doi: 10.1002/(SICI)1096-987X(199802)19:3<319::AID-JCC6>3.0.CO;2-W. [DOI] [Google Scholar]
- 64.Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V. & Gulin, A. Catboost: unbiased boosting with categorical features. arXiv preprint arXiv:1706.09516 (2017).
- 65.Gietz RD, Schiestl RH. High-efficiency yeast transformation using the liac/ss carrier dna/peg method. Nat. Protoc. 2007;2:31–34. doi: 10.1038/nprot.2007.13. [DOI] [PubMed] [Google Scholar]
- 66.Kampmeyer, C. et al. Disease-linked mutations cause exposure of a protein quality control degron. Structure (2022). [DOI] [PubMed]
- 67.Cagiada, M. et al. _2022_functional-sites-cagiada: v.1.0-publication Zendo10.5281/zenodo.8046585 (2023).
- 68.Sun S, et al. A proactive genotype-to-patient-phenotype map for cystathionine beta-synthase. Genome Med. 2020;12:1–18. doi: 10.1186/s13073-020-0711-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Stiffler MA, Hekstra DR, Ranganathan R. Evolvability as a function of purifying selection in tem-1 β-lactamase. Cell. 2015;160:882–892. doi: 10.1016/j.cell.2015.01.035. [DOI] [PubMed] [Google Scholar]
- 70.Kitzman JO, Starita LM, Lo RS, Fields S, Shendure J. Massively parallel single-amino-acid mutagenesis. Nat. Methods. 2015;12:203–206. doi: 10.1038/nmeth.3223. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Weile J, et al. A framework for exhaustively mapping functional missense variants. Mol. Syst. Biol. 2017;13:957. doi: 10.15252/msb.20177908. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The input data for the model and the predictions used in this study are available at https://github.com/KULL-Centre/_2022_functional-sites-cagiada which also contains the final trained model. The datasets are also available on Zenodo67, and can be accessed via 10.5281/zenodo.8046585. Uniprot and PDB ID codes for the proteins studied are available in Supplementary Tables 4, 5 and 6. Source data (full images) for Fig. 6 are provided as a Source Data File in zip format. Source data are provided with this paper.
The code used to generate the main text figures, to recreate the predictions, and to generate new predictions is available at https://github.com/KULL-Centre/_2022_functional-sites-cagiada. An implementation of the model is also available via Google Colaboratory via the same link. The code is also available on Zenodo67, and can be accessed via 10.5281/zenodo.8046585.