Skip to main content
PLOS One logoLink to PLOS One
. 2022 Jan 21;17(1):e0261829. doi: 10.1371/journal.pone.0261829

Linking protein structural and functional change to mutation using amino acid networks

Cristina Sotomayor-Vivas 1, Enrique Hernández-Lemus 1,2, Rodrigo Dorantes-Gilardi 3,*
Editor: Sriparna Saha4
PMCID: PMC8782487  PMID: 35061689

Abstract

The function of a protein is strongly dependent on its structure. During evolution, proteins acquire new functions through mutations in the amino-acid sequence. Given the advance in deep mutational scanning, recent findings have found functional change to be position dependent, notwithstanding the chemical properties of mutant and mutated amino acids. This could indicate that structural properties of a given position are potentially responsible for the functional relevance of a mutation. Here, we looked at the relation between structure and function of positions using five proteins with experimental data of functional change available. In order to measure structural change, we modeled mutated proteins via amino-acid networks and quantified the perturbation of each mutation. We found that structural change is position dependent, and strongly related to functional change. Strong changes in protein structure correlate with functional loss, and positions with functional gain due to mutations tend to be structurally robust. Finally, we constructed a computational method to predict functionally sensitive positions to mutations using structural change that performs well on all five proteins with a mean precision of 74.7% and recall of 69.3% of all functional positions.

Introduction

Proteins are complex biomolecules that have been subject to mutational dynamics for billions of years and whose tasks are essential for the maintenance, development, and survival of well-functioning cells. They start from a sequence of amino acids that folds into a three-dimensional (3D) structure that determines their function [1]. Understanding the underlying relations between sequence, structure, and function of a protein has been an active research topic in molecular biology for decades [2, 3].

Structure and function prediction from the amino acid sequence has been an open problem even prior to Anfinsen’s discovery of the thermodynamic hypothesis, which states that, under normal conditions, the protein sequence is responsible for the native configuration of a protein [1, 4]. In the last couple of decades, widely available datasets of protein 3D structures like the Protein Data Bank [5], machine learning methods such as deep learning [6], as well as high-throughput methods to quantify functional scores at massive scales [79], have brought us closer to understanding the interconnections between protein sequence, structure, and function.

In particular, with the advent of the big data paradigm there has been a renewed interest in the laws yielding structure and function from the one dimensional amino acid sequence [10]. Machine learning methods developed to predict residue-residue contacts in the 3D structure have recently shown a relation between residue proximity and coevolution measured by the covariance of positions in homologous protein sequences [1014]. Coevolving positions have also been shown to be functionally sensitive to mutations using deep mutational scanning data [15, 16], reinforcing their prime role in protein structure and function.

The replacement of an amino acid in the sequence—a mutation—can have structural consequences on the resulting protein and thus has a potential effect on its function. In general, mutations occur naturally and have no effect on the protein function: this is called protein robustness [17, 18]. Protein adaptation or evolvability also requires that some mutations can change the protein’s function [19, 20], indeed, a mutation can make the protein obtain a different function [21, 22]. Finally, a small set of mutations can leave the protein without the original function [23], either because of loss or adaptation, yielding protein fragility.

Experimental evidence on the interrelation between function, structure, and mutation has been shown before. For instance, via the analysis of missense mutations of the tumor suppressor p53, where mutations at the DNA-binding structural domain were found to produce functional loss more often [24]. Computational studies of the effects of in silico mutations in protein structure have shown that most positions are structurally robust independently from the chemical properties of the mutant residue [25], and sensitivity depends on their structural neighborhood [26]. Functionally-wise, experimental research has shown that functional change (fragility or adaptation) is, in general, exclusively dependent on the sequence position mutated and not on the amino acid or its mutants [16, 27].

In the case of a mutation, the fact that sequence positions seem to contain the necessary information for structure and protein fitness, raises the question of the relation between functionally and structurally sensitive positions. Although deep mutational scanning has brought results in many areas of molecular biology [8], the availability of its data is not yet ubiquitous and it also has been often created for the analysis of epistatic effects and thus not including single mutations. This brings the additional question of whether a relation between structure and function can be observed by alternative, cheaper methods. Specifically, given a protein, can we obtain the functional relevance of its sequence positions by looking at its 3D structure?

Network science has been successfully used in biology to model a variety of systems including co-expression networks [2831], metabolic pathways [19, 32], protein-protein interactions [3336], detection of protein function [37], and protein structure [3840]. Amino acid networks, where amino acids are represented by nodes that are connected if they are within a distance threshold, have been used to model protein structure [41, 42] and study the effects of mutation on structural fitness [25, 26]. A great advantage of computing structural change under this framework is the availability of more than 144,000 structural protein models based on their 3D atomic coordinates in the Protein Data Bank [43].

Here, we propose to use this methodology to study the relation between change in protein structure and function by considering five proteins for which deep mutational scanning data is available [16, 4447]. For these proteins, the functional change resulting from a mutation has been quantified for all amino acid substitutions, in most sequence positions. We obtained corresponding structural change data in silico using the perturbation network of a mutation obtained by comparing the 3D structure of the original protein and that of its mutation.

We found that structurally sensitive positions (SSPs) are not only position dependent but are also strongly correlated to functionally sensitive positions (FSPs) in all 5 proteins. Moreover, prediction of FSPs using SSPs yields a mean precision of 74.7% and recall of 69.3% across all five proteins. Moreover, the area under the receiver operating characteristic (ROC) curve, a quantity often used to assess the quality of the prediction, has a mean value of 0.83 ± 0.04, showing a clear relevance of positions’ structure in functional fragility due to mutations.

To measure structural change, we considered three different topological measures of the perturbation network, namely its size (in nodes), its number of edges, and its weighted sum. In practice, the size of the perturbation network represents the number of amino acids affected by the mutation; its edges, in turn, represent the structural contacts between amino acids changed, and its sum of weights is the number of atomic pairs that either moved closer or further apart of a chosen distance threshold. We show that mean structural change of sequence positions accounted by each measure is correlated to experimentally-obtained functional change. However, aggregating the perturbation measures increases the correlation between functional and structural disruption. This relation was found for amino acid networks defined by 71 different atomic distance thresholds in the range of 3–10 Ångstroms (Å).

Comparing the scores obtained for predictions using a distance threshold of 9 Å with the scores obtained from all other thresholds in the 4–10 Å range, we observed that predictions using a 9 Å threshold achieve similar or better scores than all other thresholds. This is true across the five proteins studied and using all perturbation measures. We suggest that 9 Å is indeed a good choice of threshold for obtaining accurate predictions of FSPs independently from protein size.

Finally, the complement of the SSPs, the set of structurally robust positions (SRPs), correlates well with top 40% of positions with weaker functional loss (or with a gain in function). Within those positions many have a functional change close to zero, suggesting a relation between structural and functional robustness.

Results and discussion

The relationship between structural and functional change studied here is based on the comparison between the perturbation network of mutations and their corresponding experimentally obtained functional change in five proteins. We combined three network-based measures representing structural change to ultimately be able to predict positions sensitive to mutations. Below is a summary of the results found:

  • Structural sensitivity (or robustness) to mutations is position dependent.

  • Significant correlations show that there is a relationship between protein structural and functional change due to mutations.

  • Predictions for functionally sensitive positions based on individual network measures—nodes, edges or weight—achieve considerable scores. Aggregating multiple network measures to obtain predictions improves the precision.

  • Stronger structural perturbation is related to stronger functional change.

  • The use of network parameters allows us to design predictions maximizing different values, be precision, recall, or both simultaneously.

  • A relationship between robust positions to mutations and those that have small functional change can also be observed.

Distance thresholds

Correlation between structure and function

Weighted amino acid networks as we have constructed here are usually defined for distance thresholds between 5 Å and 8 Å, depending on the intended chemical interactions to capture [38]. Threshold distances for atom-atom interactions usually vary between 4.5 Å [48] and 5 Å [41, 49]. In general, the edges of amino acid networks are supposed to be at least loosely based on the underlying chemical interactions of the protein. Here, we took a different approach: we did not aim to model the biological interactions between amino acids, but their structural neighborhoods, spanning much larger distances than those chemically feasible [38].

Given a mutation, the perturbation network resulting from the comparison of a mutated structure to the original three-dimensional (3D) conformation quantifies the structural change of the mutation. Four parameters of the perturbation network were considered as perturbation measures, namely, its number of nodes, its number of edges and their weight sum, and its diameter (Methods).

To identify the best distance threshold to use, we first calculated Spearman correlation values between functional change of sequence positions and their perturbation-network parameters. For each protein and each parameter, we compared the mean functional value and the mean perturbation measure score per sequence position. Higher perturbation scores resulted from 3D structures farther away from the original, hence possibly more likely to have a disrupted function. This would be reflected by stronger negative correlations, relating higher perturbation scores with lower functional scores. For simplicity, we set all correlations to absolute values.

We found consistent results between the five proteins studied when comparing each perturbation measure to functional change (Fig 1). Mean and standard deviation Spearman correlation (ρ) for measure nodes were −0.56 ± 0.12, for edges −0.53 ± 0.1, for weight −0.51 ± 0.1, and for diameter −0.3 ± 0.11. For most measures we found statistically significant correlations between structural and functional change. For measures nodes, edges, and weight the correlations were significant (mean p-value = 3.6 × 10−4 ± 6.2 × 10−3), however that was not the case for the diameter of the perturbation network (mean p-value = 1.6 × 10−2 ± 5.3 × 10−2).

Fig 1. Spearman correlation between positions’ mean structural and functional scores by protein, perturbation measure, and distance threshold.

Fig 1

For measures nodes, edges, and weight, we found that correlations increased steadily for thresholds between 3 and 4 Å, showing a slight peak around 3.8 Å, and then stabilized around 4 Å for correlations between 0.3 and 0.65. In the case of the measure ‘diameter’, correlations peaked between 3.5–3.8 Å for all five proteins and then decreased for higher distance thresholds.

Relations between structure and function shown here suggest that protein structure can be studied with much higher distance cutoffs (∼9 Å). By arbitrarily ignoring chemical-based interactions we were able to better account for the structural change around a mutated position, suggesting that studies exclusively looking at the protein structure may benefit from including higher distance thresholds.

Prediction of functionally sensitive positions

We also analyzed predictions considering exclusively individual measures, that is, given a measure, we set a perturbation cutoff and selected all positions that had at least one mutation above the cutoff (S1 Fig). We considered cutoff 1.5, representing 1.5 standard deviations above the mean, and looked at both precision and recall (Methods).

To compare the different perturbation measures, we took the predictions obtained from single measures and averaged scores over all thresholds and proteins for each measure. We found that the number of nodes had the highest mean precision (72.66%), weight had the highest mean recall (71.76%), and diameter had the lowest score in both cases (52.58% and 49.02%, for precision and recall, respectively).

With these correlations and predictions based on single measures, we saw that in most cases, we got more information from higher thresholds, reflected by higher correlations and precision scores. Since the diameter of perturbation networks had less predictable behavior compared to the other three measures, lower correlation scores, and lower scores when predicting based just on this measure, we will not include it when making predictions of functional positions. We believe that this measure is too sensitive as adding or removing a single edge could significantly change the maximal smallest path without significantly changing the network itself; its sensitivity to small threshold changes is can be seen in Fig 1. For nodes, edges, and weight, we considered the average scores between the 5 proteins and 3 measures, and found that precision is maximized at 9.3 Å, while recall is maximized at 8.4 Å, suggesting that an optimal threshold can be found in that range. Hereafter, we considered 9 Å as the representative threshold.

Perturbation cutoffs and minimum counts

Aggregating perturbation measures

When selecting perturbation cutoffs and minimum counts—the cutoffs defining structurally unstable positions and the number of altered measures required for instability, respectively (Methods)—we started from the idea that stricter predictions, those arising from higher cutoffs and counts, reflected higher structural changes. In other words, we assumed that the more the structure of the protein was modified, the more likely it was that the function was disrupted. Hence, we expected stricter predictions to result in higher precision.

Testing different cutoffs and minimum counts confirmed this hypothesis, as well as the fact that more lenient predictions were more likely to have a higher recall, while sacrificing precision (S2 Fig, Fig 2, Table 1). In Table 1, we can see that the mean precision increased as the number of perturbation measures considered (minimum count) increased, while recall decreased. When we considered only one perturbation measure, we got a mean precision and recall of 65.96% and 82.58%, respectively. Inversely, when considering all three measures, we obtained a mean precision of 78.6% and recall of 51.47%. This shows that aggregated scores predict better than single scores when the aim is to obtain higher precision, suggesting that the three perturbation measures are relevant to account for structural change.

Fig 2. Top, comparing precision and recall scores with functional percentage, leaving parameters fixed at (1,1,1) and minimum count of 2, varying functional percentage from 30 to 70%; bottom, comparing precision and recall scores with prediction percentage, leaving functional percentage fixed at 40%, minimum count fixed at 2, and varying the functional cutoffs from 1 to 2 to obtain different prediction percentages (percentages were rounded and missing values filled in through linear interpolation).

Fig 2

The line represents the mean over the proteins, while the shaded area represents 95% confidence interval.

Table 1. For all three minimum counts, we evaluated predictions with 51 different perturbation cutoffs ranging from 1 to 2 (same cutoff for all three measures), and calculated the mean over all cutoffs and proteins, obtaining a mean score for precision and recall for each minimum count.
Minimum count Mean Precision (%) Mean Recall (%)
1 65.96 82.58
2 72.78 67.42
3 78.6 51.47

Percentages of FSPs and SSPs

Precision and recall scores are also closely related with the percentage of functionally sensitive positions (FSPs), and structurally sensitive positions (SSPs, predicted positions), respectively. In Fig 2, we compared how changes in these percentages were reflected in the precision and recall scores. To obtain changes in the prediction percentage, we varied the cutoffs from 1 to 2, in intervals of 0.02, for a total of 51 cutoffs. As cutoffs increased, structurally sensitive positions decreased, which was reflected in higher precision and lower recall, providing further evidence on the relationship between stricter measures and higher precision scores.

In the range between 18–30% of SSPs, we obtained at least 75% of precision. Larger percentages of SSP decreased precision in all proteins. Positions captured with stricter cutoffs and minimum counts had mutations with larger perturbation networks relative to mutations at other sequence positions. This may be due to some particularity in their 3D structural neighborhood, whose interconnections are sensitive to most mutations. In this sense, the more unique the 3D neighborhood of the position, the greater the mean structural change is to be expected. An example may be the active sites in enzymes, which usually take a different substructures from the rest of the protein, whether it be a pocket, a cleft, an oligomeric interface, or another 3D shape [50]. Indeed, mutations happening at or close to active sites tend to affect the protein activity either by enhancing it [51, 52], losing it or adapting it [53].

Similarly, positions within an allosteric path which conveys signals from the active site to a distant position are found to be co-evolving within protein families, which in turn tend to be functionally sensitive to mutations [54]. These positions could be subject to structural particularities in their close neighborhoods, relative to other positions, and thus having greater structural changes. A more thorough analysis of the structural perturbation and its relation to the neighborhoods of biologically relevant positions is needed in this regard.

In all of our predictions we considered perturbation cutoffs and minimum counts such that the percentage of structurally sensitive positions returned was informative, ranging from around 25% when maximizing precision to around 50% when maximizing recall. As a basis, we compared all of our predictions with the 40% of positions with lowest functional values, and we henceforth refer to them simply as functionally sensitive positions, or FSPs. Leaving all other values fixed (perturbation cutoffs, minimum count, and distance thresholds), increasing this percentage led to a better precision and lower recall, while decreasing this percentage had the opposite effect (Fig 2). We focused on 40% as a balance between obtaining more precise predictions and selecting positions with significant disruption in their function.

With distance cutoffs and functional percentage fixed, we focused on combinations of perturbation cutoffs and minimum counts to make different predictions. For all of them, we considered the precision, recall and improvement scores (Methods), the latter representing the ratio between obtained scores and expected scores from random predictions.

Since the correlations between the mean structural and functional scores by protein showed similar scores among the thresholds 4–10 Å for nodes, edges and weight scores (Fig 1), we calculated the precision, recall and improvement scores for each of the possible thresholds for the three measures, to compare them to the predictions we obtained using the threshold 9 Å for those measures and evaluate the choice of parameters (Fig 3).

Fig 3. Comparing the precision and recall obtained from varying the threshold for nodes, edges and weight scores from 4 Å to 10 Å, with threshold 9 Å, representing our predictions for FSPs, highlighted in red.

Fig 3

Different minimum count and cutoff vectors were used to A) Maximize precision, B) Maximize recall, and C) Maximize precision and recall.

Protein structure-function relation

Distance threshold of 9 Å

Having established that stricter measures result in predictions with higher precision but lower recall and vice versa, we considered three sets of parameters to give predictions focusing on high precision, high recall, and an equilibrium between both. We refer to these predictions as maximizing precision, maximizing recall, and maximizing both.

In Fig 2, we varied cutoffs between 1 and 2, and looking at the prediction percentages, we obtained ranges of 38.7% to 63.2% for minimum count 1, 26.6% to 50% for minimum count 2, and 15.1% to 39.1% for minimum count 3 (with lower values for cutoff 2, higher for 1). Based on this, and the known behavior of the parameters, we chose the minimum counts and cutoffs for predictions depending on the value to maximize and to keep informative prediction percentages: lower to maximize precision and higher to maximize recall.

First, to maximize precision, we selected stricter measures, considering a minimum count of three, as it had the highest mean precision, and a perturbation cutoff vector (1.5, 1.5, 1.5). This resulted in a mean precision of 80.5%, a mean recall of 53.1%, and a mean prediction percentage of 26.3% (Fig 3A), as well as an improvement by a factor of 2. In other words, using the three perturbation measures to account for structural change, we got a set of functionally sensitive positions with high precision.

In other studies, coevolving positions in protein families obtained using statistical coupling analysis, usually called protein sectors, have been found to form physically connected subnetworks of amino acids [15, 55]. Most of these positions, around 20% of all sequence positions, have been found to be sensitive to mutations. In particular the protein sector of the PSDpdz3 protein, one of the proteins studied here, has been related to functional loss from single mutations [16]. Predicted positions maximizing precision, showing a similar percentage of the amino acid sequence, may also be related to protein sectors in other proteins but further researcher is needed in this direction.

Next, to maximize recall, we focused on more lenient measures. We considered the perturbation cutoff vector consisting of all ones, and a minimum count of 2. This minimum count showed more balanced results between precision and recall, while the perturbation cutoff vector (1, 1, 1) helped maintain a high recall. Comparing those predictions with the functionally sensitive positions, we found a mean recall of 82.2% over the 5 proteins studied, with a mean precision of 65.7%, while the mean percentage of predicted positions was 50% (Fig 3B). This resulted in an improvement of random predictions by a factor of 1.64.

A more general prediction, aiming to maximize precision and recall simultaneously, was achieved by once again using the perturbation cutoff vector (1.5, 1.5, 1.5), but with a minimum count of 2. This prediction resulted in a mean precision of 74.7%, a mean recall of 69.2%, and a mean prediction percentage of 37% (Fig 3C), as well as an improvement by a factor of 1.87. By predicting roughly the same number of positions as the number of FSPs, we believe this combination of parameters is a good general prediction if there are no functional values to compare to.

Functional prediction from SSPs

As we can see in Fig 4B for VIM-2 protein, in S3S7 Figs for the other four proteins, and in Table 2, structural change due to mutation, similar to functional change, is position-dependent and independent from chemical properties of either the mutant amino acid or the one being replaced, supporting similar results from previous work [26]. This position dependence is also found in terms of functional change from mutations (Fig 4A, Table 2), suggesting that both structure and function relevance is determined by the position and not by the amino acid occupying it. Notably, structural measures with a higher position independence also showed higher correlations with functional change. This further supports the importance of the structural neighborhoods of positions disregarding chemical bonds to study protein structure.

Fig 4.

Fig 4

A) Experimentally obtained functional data from deep mutational scan of VIM-2 protein, with darker values representing higher functional disruption, specifically blue is loss of function while red represents gain of function [47]. B) Standardized data of the number of nodes perturbed by each mutation where each entry is the number of standard deviations from the mean of the distribution. The perturbation network was constructed using a threshold of 9 Å; blue represents highest structural perturbation, and red represents lowest. C) Predictions maximizing precision. X-axis has the sequence positions, Y-axis has the experimentally obtained mean functional value. Blue dots are SSPs—our predictions for FSPs—while shaded blue area contains the 40% of sequence positions with lowest functional scores representing strongest functional loss. Top row shows the functional values experimentally obtained for VIM-2 protein, bottom row the other four proteins studied. D) Predictions maximizing recall. E) Predictions maximizing both measures.

Table 2. Considering structural and functional data, we looked at perturbation values per position, and considered the percentage of positive scores and negative scores, keeping the maximum of the two.

This presents a measure of consensus between the changes at each position, the higher values represent that most mutations result in the same effect (whether positive or negative values), independent of the mutant amino acid or the amino acid being replaced. We present averages and standard deviations over positions and proteins for the nodes, edges, and weight measures (structural data) and for the functional data.

Measure Positions sharing same sign (%)
Nodes 87.9 ± 14.4
Edges 83.9 ± 15.8
Weight 77.7 ± 15.3
Functional data 81.4 ± 15

To further evaluate our model, we obtained the receiver operating characteristic curve, plotting the True Positive Rate against the False Positive Rate. We fixed the threshold at 9 Å and varied the perturbation cutoff from 0.03, to 3. As we have seen, changing this value varies the recall, or True Positive Rate, with lower values corresponding to a higher recall. For a minimum count of two, evaluating predictions for sensitive positions resulted in a mean area under the curve of 0.83 ± 0.04 (Fig 5).

Fig 5.

Fig 5

ROC curves for predictions of A) sensitive positions and B) robust positions are shown. ROC curves were obtained from varying cutoff-vector elements from 0.03 to 3, and fixing a minimum count of 2.

High precision to predict functional positions reinforces evidence about the relation between function and structure. Interestingly, on average more than 80% of highly SSPs (top 26%) tended to also be FSPs for all five proteins studied, yielding a precision that to our knowledge is not yet met in other non-experimental scenarios. Our framework could have applications in fields where a high precision in determining non yet known functional positions could be of significance, e.g. to inhibit the function of a target protein related to disease by a single mutation. Moreover, precision rates of around 70% were similar using perturbation networks and coevolving positions [16]. Given a protein, the advantage of our method is the lack of need of its protein family to predict its functionally sensitive positions, thus enlarging the scope of proteins to be used.

Structural and functional robustness

Our predictions so far have focused on identifying positions likely to be functionally sensitive (FSPs). Thus, by turning to the positions left out of a certain prediction, we could identify those more likely to be functionally robust (FRPs) to mutation. We compared these new predictions, obtained from the complement of different predictions for unstable positions, with the 40% of positions with highest mean functional scores (Fig 6). This percentile of positions represents those with a gain of function, or those with small functional changes resulting in scores similar to the WT amino acid at that position.

Fig 6.

Fig 6

A) Functional data from deep mutational scan of PTEN protein, with lighter values representing smaller functional disruption, specifically blue is loss of function while white/red represents functional robustness to mutation [44]. B) Standardized data of the number of nodes perturbed by each mutation where each entry is the number of standard deviations from the mean of the distribution. The perturbation network was constructed using a threshold of 9 Å; blue represents highest structural perturbation, and red represents lowest. C) Predictions maximizing precision. X-axis has sequence positions, Y-axis has mean functional value. Blue dots are SRPs—our predictions for FRPs—while shaded red area contains the 40% of sequence positions with higher functional robustness. Top row shows PTEN protein, bottom row sows the other four proteins studied.

In this case, we considered the cutoff vector consisting of only ones to define the structurally sensitive positions. By considering the complement of these positions as structurally robust, we kept positions where all mutations had scores less than one standard deviation above the mean, those closer to the wild type. Since we considered the complement of sensitive positions, smaller minimum counts lead to stricter predictions for robust positions and vice versa.

We considered one prediction, with a minimum count of one, to showcase the relationship between structural and functional robustness. The minimum count of one guarantees that all mutations for predicted positions are less than one standard deviation above the mean for all three measures. This prediction had a mean precision of 70.4%, mean recall of 65% and a mean prediction percentage of 36.8%, which resulted in an improvement of 1.44. Once again we calculated the receiver operating characteristic curve, with threshold 9 Å. Predictions for robust positions with a minimum count of one resulted in a mean area under the curve of 0.80 ± 0.04 (Fig 5).

We also tested predictions for robustness analogously to those for sensitivity, selecting positions with at least one value below a threshold for a specific number of measures. This resulted in predictions with lower precision, between 50 and 60%, compared to 70.4% as presented above. This shows that, while a single ‘bad’ mutation can be telling of a sensitive position with high precision, the same cannot be said for ‘good’ mutations and robust positions. Instead, we find good predictions for robust positions when all mutations have scores closer to the mean, suggesting higher constraints for stability are required from the structural neighborhoods.

Protein evolvability depends on the ability of a protein to obtain a new function from a set of mutations (protein innovability), as well as in protein robustness (ability to withstand mutations) [17, 56]. Specifically, robustness is the ability of the protein to maintain both structure and function in the case of mutations. The fact that over all five proteins, on average 70% of the top 30% structurally resilient positions were also within the most functionally robust may be a consequence of this property.

Concluding remarks

We set out to explore the relationship between change in protein structure and function through the use of protein three-dimensional coordinates, in silico mutagenesis, and published deep mutational scanning datasets. We developed a method to predict functionally sensitive positions using structural data, and found a mean precision of 74.7% and a mean recall of 69.2% when comparing the predictions to functionally sensitive positions. By considering the complement of a set of predictions as structurally stable positions, we found a mean precision of 70.4% and a mean recall of 65% when comparing to the 40% of positions with highest functional values. Predicting randomly would lead to precision values close to the 40% of positions deemed functionally sensitive (or stable), and these predictions improve random predictions by factors of 1.87 and 1.44, respectively.

By changing the prediction parameters, we were able to obtain predictions with higher precision or recall, and we found a relationship between stricter parameters for structural sensitivity, requiring a bigger effect in the perturbation network, and more precise predictions. When predicting stable positions, more lenient parameters for sensitivity translate to stricter requirements for stability, and the same effect on precision was obtained. This supported a close relationship between structural and functional change in a protein. On the other hand, more lenient predictions for structural sensitivity lead to a greater recall, which relates to the greater percentage of positions included in the predictions. Our predictions maximizing precision improve that value by a factor of 2 for sensitive positions, compared to random predictions.

The method described can be used to predict sensitive positions in a protein without resorting to experimental methods, and it can be used as a standalone or in combination with other variant effect predictors [57], with the advantage that only the three-dimensional coordinate file is required as well as its in silico mutations. By knowing how the choice of parameters relates to the precision and recall in the proteins studied, we can estimate the probability of certain positions being functionally sensitive, and combine predictions to obtain positions most likely to be functionally sensitive, and predictions likely to encompass most functionally sensitive positions.

The predictions we considered and their respective scores show that it is harder to predict which positions are likely to show functional values above zero, showing gain of function, or close to zero, showing little or no functional change. However, we were able to observe a clear relationship between structural and functional robustness by looking at their correlation and their mutual position dependence.

The present approach may result particularly relevant in the design of protein structures via directed mutagenesis methods [58, 59]. Protein structure-based drug design [60, 61] either with pharmaceutical and biotechnological applications or even in terms of disease modeling, relevant in the context of, for instance, the COVID-19 pandemic [62, 63]. Also of contemporary relevance are the potential applications of our approach in the context of protein structural and functional prediction of CRISPR-Cas9 modifications [6468]. The common scenario is that CRISPR-Cas9 genome editing allow us to determine gene sequences via highly specific modifications. Less clear are, however, the potential impact that such gene specific changes may bring to protein structure and function. In view of the plethora of applications of CRISPR-Cas9 genome editing in health, agriculture and biotechnology, it will become useful to have tools to predict, although still approximately, such effects.

A potential continuation of the work presented here is the use of machine learning models for classification prediction of functional positions trained on structural data from perturbation networks. In the same direction, if the structural data could be obtained exclusively from the 3D atomic coordinates, using network parameters local to the position, these models would not need further mutagenesis software. The need for this software represents the major weakness of our approach, as it require the availability and know-how of a third party software, a trade-off for not requiring additional data beyond the atomic coordinates of the protein. However, the computational time required was less or similar than other state-of-the-art methods showing good agreement with our predictions, such as DynaMut [73] (see Methods). The position dependence of structural change should incite further research on the identification of atypical neighborhoods in the structural vicinity of a position, and their relation with functional sensitivity to mutations.

Methods

Code availability

All the code used in this work is publicly-available at https://github.com/CrisSotomayor/perturbation-networks.

Protein selection

We selected five proteins with published deep mutational scanning data and corresponding three-dimensional coordinates available in the Protein Data Bank [5], focusing on enzymes with substrate binding assays. The proteins selected were PSD95pdz3 (PDB: 1BE9) [16], phosphatase and tensin homolog (PTEN) (PDB: 1D5R) [44], APH(3’)II (PDB: 1ND4) [45], Src kinase catalytic domain (Src CD, PDB: 3DQW) [46], and VIM-2 metallo-β-lactamase (PDB: 4BZ3) [47].

Functional change

We used the deep mutational scanning data to obtain functional scores for individual mutations. We considered the mean functional change at each position: the average score for all mutations at a particular position for which scores are available. Using these values and the percentage of positions we want to consider, we define functionally sensitive positions (FSPs) by sorting positions and selecting said percentage of positions with the greatest loss of function: the lowest mean functional change. Similarly, functionally robust positions (FRPs) are defined by selecting a percentage of positions with the weakest loss of function: the highest mean functional values. These values translate to positions with positive mean values or values close to zero. Throughout this paper we will consider 40% of positions for both FSPs and FRPs.

Amino acid networks

Given the three-dimensional atomic coordinates of a protein and a distance threshold t, an amino acid network G(t) is a network where nodes correspond to sequence positions and an edge between two nodes exists if there is a pair of atoms, one in each amino acid, at distance less than t. Moreover, each edge in the network has a weight corresponding to the number of atomic pairs at distance less than t between the two nodes.

The construction of the networks was done in the Python programming language and implemented in a library called Biographs [69] based on the popular libraries NetworkX [70] and Biopython [71].

Perturbation networks

Corresponding structural change data is obtained by first producing the same mutations in silico for each protein. Then, the resulting 3D structure of each mutation is modeled with an amino acid network and compared to the network of the wild-type 3D structure. The structural change of the mutation is represented by the topological difference of the two networks and called the perturbation network of the mutation, which accounts for the structural change of the protein. In this model, each topological measure of the perturbation network quantifies an effect of the mutation on a different structural property of the protein, and can be used to identify structurally sensitive positions to mutations. The full details of how we constructed the perturbation networks are described below.

For each protein, we performed in silico mutagenesis using the algorithm FoldX 5.0 [72]. We mutated every position with corresponding functional data to the other nineteen amino acids. For each mutation, FoldX yields a three-dimensional structure that was used to construct the mutation’s amino acid network. For each protein, its corresponding wild type amino acid network is obtained from the original structural file (PDB file). Using multithreading with 30 cores, computing the structures for all the point mutations took less than 24 hours per protein.

The perturbation network of a specific mutation is obtained by comparing wild type and mutation networks [42]. Given a distance threshold t, let A and B denote the adjacency matrices for the wild type and mutation networks G(t) and M(t), respectively. Let matrix C denote the absolute difference of the matrices A and B:

C=|A-B|.

After removing all the rows and columns containing only zeros from matrix C, we obtain the adjacency matrix that defines the perturbation network P(t) (Fig 7). We consider four topological attributes of P(t), namely its size (referred here as ‘nodes’), number of edges (‘edges’), total sum of weights (‘weight’), and its diameter (maximal smallest path, called here ‘diameter’). However, the diameter was ultimately not considered for making predictions.

Fig 7. Example of an amino-acid network G with three nodes and three edges.

Fig 7

The network M represents a mutation in node b, resulting in nodes a and b losing three pairs of atoms, and nodes c and b losing one edge. The network P is the perturbation network of the mutation. In this example, P has 3 nodes, 2 edges, weight 4, and diameter 2.

Structural change

For proteins with multiple identical chains, we obtained the perturbation networks for all mutations of all chains, and then calculated the average of each mutation over the different chains to obtain a single score per position. In order to capture a broad range of atomic distances, we constructed all networks using 71 different thresholds between 3 Å and 10 Å, where consecutive thresholds are spaced by a 0.1 Å step.

The four perturbation measures have different scales, and vary in magnitude according to the distance threshold, so we standardized the data to make comparisons between the different measures. We considered four data arrays per protein and threshold, one for each measure, containing the corresponding scores of every possible mutation (4 measures × 5 proteins × 71 thresholds = 1420 data arrays). We removed the null scores resulting from mutation to the same wild type amino acid to preserve the range of values obtained from non-synonymous mutations, and then standardized each array. A visual comparison of the standardized perturbation network data and the functional data is shown in Fig 4.

Given the standardized data, every mutation in every protein has four scores called nodes, edges, weight, and diameter, respectively. We refer to these four scores as the perturbation scores.

Data standardization

In order to make all structural measures comparable across all mutations, the value of the perturbation of each mutation is measured in standard deviations from the mean, that is, the mean of the perturbation of all mutations was subtracted to each value and divided by the standard deviation. The perturbation value of each of the structural measures considered (nodes, edges, and weight) is given by the absolute difference between the network obtained by the mutation and the original (Wild-type) network, thus negative values are closer to the original network’s values while positive values show a stronger structural perturbation.

Functional data was not standardized. In order to define the functionally sensitive positions we used the bottom 40% of the positions in terms of mean functional change (smaller values). This was done as we found the data to have varying distributions and few positions above the standard deviation cutoffs due probably to independent experimentation and methodology of the measuring of the functional change of mutations.

Defining structural sensitivity

Since each measure represents different changes in the perturbation network, and therefore on the protein structure, each provides a different way to identify structurally sensitive positions (SSPs). First, we say that a mutation is sensitive for a certain measure if its corresponding perturbation score is above a particular perturbation cutoff. Given the four perturbation scores, we consider a perturbation cutoff vector containing four values including the specific cutoffs for sensitivity in terms of nodes, edges, weight and diameter measures, respectively. Thus, modifying the values in this vector yields different structurally sensitive mutations. Each cutoff corresponds to the number of standard deviations from the mean taken in each distribution, e.g. a cutoff of 1.5 corresponds to the value obtained by adding 1.5 standard deviations to the mean of the distribution.

For each of the four perturbation measures, we identify sensitive positions if at least one mutation at that position has a score above the corresponding cutoff. In other words, sensitive positions for a certain measure are all the positions in the protein with one or more sensitive mutations. Since a position can be sensitive for each of the four measures, we define the minimum count as the number of measures for which a position needs to be sensitive in order to be considered a structurally sensitive position (SSP). The positions defined as structurally sensitive will serve as our predictions for functionally sensitive positions (FSPs) (Fig 7).

Given a distance threshold and a protein, predictions are thus made based on a cutoff vector and a minimum count. For example, considering the perturbation cutoff vector (1,1,1,1), and minimum count of 2, predictions include all positions in the protein that have at least one mutation with perturbation measure score one standard deviation above average, for at least two of the four measures.

We will also consider structurally robust positions (SRPs), as the complement of SSPs for certain parameters. That is, all positions not defined as structurally sensitive will be considered structurally robust.

Assessing accuracy of predictions

In order to test the predictions obtained from the perturbation network data, we identified functionally sensitive positions from the deep mutational scanning data. To have a single functional value to define FSPs, we first attempted to standardize the data and look at positions with values above a certain cutoff. However, given the different data distributions of the proteins, this yielded vastly different percentages of FSPs. Two proteins had no positions with mean values one standard deviation above average, and when considering 0.5 standard deviations, percentages ranged from 17% to 38%. This made predictions hard to evaluate and highly dependent on each individual protein and its functional-change distribution.

Instead, we evaluated positions with lowest mean functional-change value using the 40-percentile, and compared these with predictions made from different cutoff vectors and minimum counts. That is, we consider FSPs to be the top 40% of positions with a stronger functional loss. Rounding down from the number of positions times 0.4, we obtain a mean functional percentage of 39.8%.

Given a perturbation cutoff vector and a minimum count, we get a set of predictions, positions likely to be functionally sensitive based on their perturbation networks, and compare them to 40% of positions with lowest mean functional values. We considered two measures to score these predictions: the recall, i.e. what percentage of FSPs we were able to predict, and the precision, i.e. what percentage of our predictions were functionally sensitive.

Given a set of predictions, or SSPs, and a set of FSPs, the intersection of them represents the true positives. The precision and recall scores can then be expressed as:

Precision=TruePositivesSSPs=TruePositivesTruePositives+FalsePositives
Recall=TruePositivesFSPs=TruePositivesTruePositives+FalseNegatives

The prediction percentage represents the ratio between SSPs and the total number of positions in the protein, while the functional percentage represents the ratio between FSPs and total positions:

PredictionPercentage=SSPsTotalPositions=TruePositives+FalsePositivesTotalPositions
FunctionalPercentage=FSPsTotalPositions=TruePositives+FalseNegativesTotalPositions

We can also think about the precision and recall scores in terms of conditional probabilities:

Precision=P[xFSPsxSSPs]=P[xFSPsxSSPs]P[xSSPs]
Recall=P[xSSPsxFSPs]=P[xFSPsxSSPs]P[xFSPs]

Assuming independence between a position being functionally sensitive (x ∈ FSPs) and a position being structurally sensitive (x ∈ SSPs)—as would be the case if predictions were done randomly—we obtain that:

NullPrecision=P[xFSPsxSSPs]P[xSSPs]=P[xFSPs]·P[xSSPs]P[xSSPs]=P[xFSPs]=FSPsTotalPositions=FunctionalPercentage
NullRecall=P[xSSPsxFSPs]P[xFSPs]=P[xFSPs]·P[xSSPs]P[xFSPs]=P[xSSPs]=SSPsTotalPositions=PredictionPercentage

Given these null scores, which would result from random predictions, we can obtain a single improvement score by dividing the corresponding real and null values:

ImprovementScore=RealRecallNullRecall=RealPrecisionNullPrecision=TruePositives×TotalPositionsFSPs×SSPs

Comparison with state of the art methods

We compared our predictions with those obtained from DynaMut [73], which generates a prediction of the impact of a mutation on protein stability. Due to time constraints, and as this software is implemented on a web server, we performed an alanine scan instead of obtaining all point mutations, with results taking around 3 and a half weeks. We considered the obtained value for ΔΔG from DynaMut for each position, and compared the mean value for our predictions maximizing accuracy, maximizing recall, and the positions not predicted to be functionally sensitive in either case. We found a general agreement with the results, as shown in S8 Fig, with both sets of predicted functionally sensitive positions consistently obtaining a lower ΔΔG than positions not predicted, indicating a mutation that makes the protein less stable. Considering that ΔΔG was calculated for a single mutation per position, compared to our thorough mutational scan, we believe that the results show a good agreement between the two approaches.

Supporting information

S1 Fig. Predictions based on individual measures, considering cutoff 1.5, comparing precision and recall scores obtained from varying the threshold for measures nodes, edges, weight, and diameter, from 3 Å to 10 Å.

(TIF)

S2 Fig. Precision and recall across 51 different perturbation cutoffs, ranging from 1 to 2 in intervals of 0.02.

Each row and column represents a different minimum count and protein, respectively.

(TIF)

S3 Fig. Matrix of normalized structural change across all mutations for protein PSD95pdz3.

Red and blue colors represent structural loss and robustness, respectively.

(TIF)

S4 Fig. Matrix of normalized structural change across all mutations for protein PTEN.

Red and blue colors represent structural loss and robustness, respectively.

(TIF)

S5 Fig. Matrix of normalized structural change across all mutations for protein APH(3’)II.

Red and blue colors represent structural loss and robustness, respectively.

(TIF)

S6 Fig. Matrix of normalized structural change across all mutations for protein SRC CD.

Red and blue colors represent structural loss and robustness, respectively.

(TIF)

S7 Fig. Matrix of normalized structural change across all mutations for protein VIM-2.

Red and blue colors represent structural loss and robustness, respectively.

(TIF)

S8 Fig. Point plot displaying the mean ΔΔG obtained from DynaMut [73] for three sets of positions for each of the five proteins studied, those included in the maximum precision prediction, those included in the maximum recall prediction, and those not included in either.

(TIF)

Acknowledgments

EHL is a recipient of the 2016 Marcos Moshinsky Fellowship in the Physical Sciences. CSV is an undergraduate student from the Program in Genomic Sciences, Universidad Nacional Autónoma de México (UNAM).

Data Availability

Data and code to create the figures can be found at https://github.com/CrisSotomayor/perturbation-networks.

Funding Statement

This work was supported by CONACYT (grant no. 285544/2016 Ciencia Básica, and grant no. 2115 Fronteras de la Ciencia), as well as by federal funding from the National Institute of Genomic Medicine (Mexico). Additional support has been granted by the National Laboratory of Complexity Sciences (grant no. 232647/2014 CONACYT). EHL acknowledges additional support from the 2016 Marcos Moshinsky Fellowship in the Physical Sciences. The funders have no role in the design or development of this project.

References

  • 1. Dill KA, MacCallum JL. The protein-folding problem, 50 years on. science. 2012;338(6110):1042–1046. doi: 10.1126/science.1219021 [DOI] [PubMed] [Google Scholar]
  • 2. Sadowski M, Jones D. The sequence–structure relationship and protein function prediction. Current opinion in structural biology. 2009;19(3):357–362. doi: 10.1016/j.sbi.2009.03.008 [DOI] [PubMed] [Google Scholar]
  • 3. Dill KA, Ozkan SB, Shell MS, Weikl TR. The protein folding problem. Annu Rev Biophys. 2008;37:289–316. doi: 10.1146/annurev.biophys.37.092707.153558 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Anfinsen CB. Principles that govern the folding of protein chains. Science. 1973;181(4096):223–230. doi: 10.1126/science.181.4096.223 [DOI] [PubMed] [Google Scholar]
  • 5. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, et al. The protein data bank. Nucleic acids research. 2000;28(1):235–242. doi: 10.1093/nar/28.1.235 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. LeCun Y, Bengio Y, Hinton G. Deep learning. nature. 2015;521(7553):436–444. doi: 10.1038/nature14539 [DOI] [PubMed] [Google Scholar]
  • 7. Araya CL, Fowler DM. Deep mutational scanning: assessing protein function on a massive scale. Trends in biotechnology. 2011;29(9):435–442. doi: 10.1016/j.tibtech.2011.04.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Fowler DM, Fields S. Deep mutational scanning: a new style of protein science. Nature methods. 2014;11(8):801–807. doi: 10.1038/nmeth.3027 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Fowler DM, Stephany JJ, Fields S. Measuring the activity of protein variants on a large scale using deep mutational scanning. Nature protocols. 2014;9(9):2267–2284. doi: 10.1038/nprot.2014.153 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Senior AW, Evans R, Jumper J, Kirkpatrick J, Sifre L, Green T, et al. Improved protein structure prediction using potentials from deep learning. Nature. 2020;577(7792):706–710. doi: 10.1038/s41586-019-1923-7 [DOI] [PubMed] [Google Scholar]
  • 11. Kamisetty H, Ovchinnikov S, Baker D. Assessing the utility of coevolution-based residue–residue contact predictions in a sequence-and structure-rich era. Proceedings of the National Academy of Sciences. 2013;110(39):15674–15679. doi: 10.1073/pnas.1314045110 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Kuhlman B, Bradley P. Advances in protein structure prediction and design. Nature Reviews Molecular Cell Biology. 2019;20(11):681–697. doi: 10.1038/s41580-019-0163-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Morcos F, Pagnani A, Lunt B, Bertolino A, Marks DS, Sander C, et al. Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proceedings of the National Academy of Sciences. 2011;108(49):E1293–E1301. doi: 10.1073/pnas.1111471108 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Schmiedel JM, Lehner B. Determining protein structures using deep mutagenesis. Nature genetics. 2019;51(7):1177. doi: 10.1038/s41588-019-0431-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Halabi N, Rivoire O, Leibler S, Ranganathan R. Protein sectors: evolutionary units of three-dimensional structure. Cell. 2009;138(4):774–786. doi: 10.1016/j.cell.2009.07.038 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. McLaughlin RN, Poelwijk FJ, Raman A, Gosal WS, Ranganathan R. The spatial architecture of protein function and adaptation. Nature. 2012;. doi: 10.1038/nature11500 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Tóth-Petróczy Á, Tawfik DS. The robustness and innovability of protein folds. Current opinion in structural biology. 2014;26:131–138. doi: 10.1016/j.sbi.2014.06.007 [DOI] [PubMed] [Google Scholar]
  • 18. Wagner A. Robustness and evolvability in living systems. vol. 24. Princeton university press; 2013. [Google Scholar]
  • 19. Wagner A, Fell DA. The small world inside large metabolic networks. Proceedings of the Royal Society of London Series B: Biological Sciences. 2001;268(1478):1803–1810. doi: 10.1098/rspb.2001.1711 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Tokuriki N, Tawfik DS. Stability effects of mutations and protein evolvability. Current opinion in structural biology. 2009;19(5):596–604. doi: 10.1016/j.sbi.2009.08.003 [DOI] [PubMed] [Google Scholar]
  • 21. Bloom JD, Labthavikul ST, Otey CR, Arnold FH. Protein stability promotes evolvability. Proceedings of the National Academy of Sciences. 2006;103(15):5869–5874. doi: 10.1073/pnas.0510098103 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Romero PA, Arnold FH. Exploring protein fitness landscapes by directed evolution. Nature reviews Molecular cell biology. 2009;10(12):866–876. doi: 10.1038/nrm2805 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. MacArthur DG, Balasubramanian S, Frankish A, Huang N, Morris J, Walter K, et al. A systematic survey of loss-of-function variants in human protein-coding genes. Science. 2012;335(6070):823–828. doi: 10.1126/science.1215040 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Kato S, Han SY, Liu W, Otsuka K, Shibata H, Kanamaru R, et al. Understanding the function–structure and function–mutation relationships of p53 tumor suppressor protein by high-resolution missense mutation analysis. Proceedings of the National Academy of Sciences. 2003;100(14):8424–8429. doi: 10.1073/pnas.1431692100 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Achoch M, Dorantes-Gilardi R, Wymant C, Feverati G, Salamatian K, Vuillon L, et al. Protein structural robustness to mutations: an in silico investigation. Physical Chemistry Chemical Physics. 2016;18(20):13770–13780. doi: 10.1039/C5CP06091E [DOI] [PubMed] [Google Scholar]
  • 26. Dorantes-Gilardi R, Bourgeat L, Pacini L, Vuillon L, Lesieur C. In proteins, the structural responses of a position to mutation rely on the Goldilocks principle: not too many links, not too few. Physical Chemistry Chemical Physics. 2018;20(39):25399–25410. doi: 10.1039/C8CP04530E [DOI] [PubMed] [Google Scholar]
  • 27. Starr TN, Greaney AJ, Hilton SK, Ellis D, Crawford KH, Dingens AS, et al. Deep mutational scanning of SARS-CoV-2 receptor binding domain reveals constraints on folding and ACE2 binding. Cell. 2020;182(5):1295–1310. doi: 10.1016/j.cell.2020.08.012 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Dorantes-Gilardi R, García-Cortés D, Hernández-Lemus E, Espinal-Enríquez J. Multilayer approach reveals organizational principles disrupted in breast cancer co-expression networks. Applied Network Science. 2020;5(1):1–23. doi: 10.1007/s41109-020-00291-1 [DOI] [Google Scholar]
  • 29. Dorantes-Gilardi Rodrigo and García-Cortés Diana and Hernández-Lemus Enrique and Espinal-Enríquez Jesús. k-core genes underpin structural features of breast cancer. Scientific Reports. 2021;11(1):1–17. doi: 10.1038/s41598-021-95313-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Lopes-Ramos CM, Chen CY, Kuijjer ML, Paulson JN, Sonawane AR, Fagny M, et al. Sex differences in gene expression and regulatory networks across 29 human tissues. Cell reports. 2020;31(12):107795. doi: 10.1016/j.celrep.2020.107795 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Kuijjer ML, Tung MG, Yuan G, Quackenbush J, Glass K. Estimating sample-specific regulatory networks. Iscience. 2019;14:226–240. doi: 10.1016/j.isci.2019.03.021 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Jeong H, Tombor B, Albert R, Oltvai ZN, Barabási AL. The large-scale organization of metabolic networks. Nature. 2000;407(6804):651–654. doi: 10.1038/35036627 [DOI] [PubMed] [Google Scholar]
  • 33. Vazquez A, Flammini A, Maritan A, Vespignani A. Global protein function prediction from protein-protein interaction networks. Nature biotechnology. 2003;21(6):697–700. doi: 10.1038/nbt825 [DOI] [PubMed] [Google Scholar]
  • 34. Rual JF, Venkatesan K, Hao T, Hirozane-Kishikawa T, Dricot A, Li N, et al. Towards a proteome-scale map of the human protein–protein interaction network. Nature. 2005;437(7062):1173–1178. doi: 10.1038/nature04209 [DOI] [PubMed] [Google Scholar]
  • 35. Ali W, Rito T, Reinert G, Sun F, Deane CM. Alignment-free protein interaction network comparison. Bioinformatics. 2014;30(17):i430–i437. doi: 10.1093/bioinformatics/btu447 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Menche J, Sharma A, Kitsak M, Ghiassian SD, Vidal M, Loscalzo J, et al. Uncovering disease-disease relationships through the incomplete interactome. Science. 2015;347 (6224). doi: 10.1126/science.1257601 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Sharan R, Ulitsky I, Shamir R. Network-based prediction of protein function. Molecular systems biology. 2007;3(1):88. doi: 10.1038/msb4100129 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Di Paola L, De Ruvo M, Paci P, Santoni D, Giuliani A. Protein contact networks: an emerging paradigm in chemistry. Chemical reviews. 2013;113(3):1598–1613. doi: 10.1021/cr3002356 [DOI] [PubMed] [Google Scholar]
  • 39. Bagler G, Sinha S. Assortative mixing in Protein Contact Networks and protein folding kinetics. Bioinformatics. 2007;23(14):1760–1767. doi: 10.1093/bioinformatics/btm257 [DOI] [PubMed] [Google Scholar]
  • 40.Di Paola L, Giuliani A. Mapping active allosteric loci SARS-CoV spike proteins by means of protein contact networks. arXiv preprint arXiv:200305200. 2020;.
  • 41. Chakrabarty B, Parekh N. NAPS: Network analysis of protein structures. Nucleic acids research. 2016;44(W1):W375–W382. doi: 10.1093/nar/gkw383 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Pacini L, Vuillon L, Lesieur C. Induced Perturbation Network and tiling for modeling the L55P Transthyretin amyloid fiber. Procedia Computer Science. 2020;178:8–17. doi: 10.1016/j.procs.2020.11.002 [DOI] [Google Scholar]
  • 43. Protein Data Bank: the single global archive for 3D macromolecular structure data. Nucleic acids research. 2019;47(D1):D520–D528. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Mighell TL, Evans-Dutson S, O’Roak BJ. A Saturation Mutagenesis Approach to Understanding PTEN Lipid Phosphatase Activity and Genotype-Phenotype Relationships. American Journal of Human Genetics. 2018;. doi: 10.1016/j.ajhg.2018.03.018 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Melnikov A, Rogov P, Wang L, Gnirke A, Mikkelsen TS. Comprehensive mutational scanning of a kinase in vivo reveals substrate-dependent fitness landscapes. Nucleic Acids Research. 2014;. doi: 10.1093/nar/gku511 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Ahler E, Register AC, Chakraborty S, Fang L, Dieter EM, Sitko KA, et al. A Combined Approach Reveals a Regulatory Mechanism Coupling Src’s Kinase Activity, Localization, and Phosphotransferase-Independent Functions. Molecular Cell. 2019;. doi: 10.1016/j.molcel.2019.02.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Chen JZ, Fowler DM, Tokuriki N. Comprehensive exploration of the translocation, stability and substrate recognition requirements in vim-2 lactamase. eLife. 2020;. doi: 10.7554/eLife.56707 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Yao XQ, Momin M, Hamelberg D. Establishing a Framework of Using Residue–Residue Interactions in Protein Difference Network Analysis. Journal of chemical information and modeling. 2019;59(7):3222–3228. doi: 10.1021/acs.jcim.9b00320 [DOI] [PubMed] [Google Scholar]
  • 49. Yan W, Zhou J, Sun M, Chen J, Hu G, Shen B. The construction of an amino acid network for understanding protein structure and function. Amino acids. 2014;46(6):1419–1439. doi: 10.1007/s00726-014-1710-6 [DOI] [PubMed] [Google Scholar]
  • 50. Jacobs SA, Harp JM, Devarakonda S, Kim Y, Rastinejad F, Khorasanizadeh S. The active site of the SET domain is constructed on a knot. Nature structural biology. 2002;9(11):833–838. [DOI] [PubMed] [Google Scholar]
  • 51. Morley KL, Kazlauskas RJ. Improving enzyme properties: when are closer mutations better? Trends in biotechnology. 2005;23(5):231–237. [DOI] [PubMed] [Google Scholar]
  • 52. Koch AA, Hansen DA, Shende VV, Furan LR, Houk K, Jiménez-Osés G, et al. A single active site mutation in the pikromycin thioesterase generates a more effective macrocyclization catalyst. Journal of the American Chemical Society. 2017;139(38):13456–13465. doi: 10.1021/jacs.7b06436 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Kaiser E, Lawrence D. Chemical mutation of enzyme active sites. Science. 1984;226(4674):505–511. doi: 10.1126/science.6238407 [DOI] [PubMed] [Google Scholar]
  • 54. Süel GM, Lockless SW, Wall MA, Ranganathan R. Evolutionarily conserved networks of residues mediate allosteric communication in proteins. Nature structural biology. 2003;10(1):59–69. doi: 10.1038/nsb881 [DOI] [PubMed] [Google Scholar]
  • 55. Lockless SW, Ranganathan R. Evolutionarily conserved pathways of energetic connectivity in protein families. Science. 1999;286(5438):295–299. doi: 10.1126/science.286.5438.295 [DOI] [PubMed] [Google Scholar]
  • 56. Rorick MM, Wagner GP. Protein structural modularity and robustness are associated with evolvability. Genome biology and evolution. 2011;3:456–475. doi: 10.1093/gbe/evr046 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57. Livesey BJ, Marsh JA. Using deep mutational scanning to benchmark variant effect predictors and identify disease mutations. Molecular Systems Biology. 2020;16(7). doi: 10.15252/msb.20199380 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58. Wang YQ, Cao C, Ying YL, Li S, Wang MB, Huang J, et al. Rationally designed sensing selectivity and sensitivity of an aerolysin nanopore via site-directed mutagenesis. ACS sensors. 2018;3(4):779–783. doi: 10.1021/acssensors.8b00021 [DOI] [PubMed] [Google Scholar]
  • 59. Xia Y, Li K, Li J, Wang T, Gu L, Xun L. T5 exonuclease-dependent assembly offers a low-cost method for efficient cloning and site-directed mutagenesis. Nucleic acids research. 2019;47(3):e15–e15. doi: 10.1093/nar/gky1169 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60. Śledź P, Caflisch A. Protein structure-based drug design: from docking to molecular dynamics. Current opinion in structural biology. 2018;48:93–102. doi: 10.1016/j.sbi.2017.10.010 [DOI] [PubMed] [Google Scholar]
  • 61. Oliveira ER, Mohana-Borges R, de Alencastro RB, Horta BA. The flavivirus capsid protein: Structure, function and perspectives towards drug design. Virus research. 2017;227:115–123. doi: 10.1016/j.virusres.2016.10.005 [DOI] [PubMed] [Google Scholar]
  • 62. Liu Xianggen and Luo Yunan and Li Pengyong and Song Sen and Peng Jian. Deep geometric representations for modeling effects of mutations on protein-protein binding affinity PLoS computational biology. 2021;17(8):e1009284. doi: 10.1371/journal.pcbi.1009284 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63. Dong N, Yang X, Ye L, Chen K, Chan EWC, Yang M, et al. Genomic and protein structure modelling analysis depicts the origin and infectivity of 2019-nCoV, a new coronavirus which caused a pneumonia outbreak in Wuhan, China. BioRxiv. 2020;. [Google Scholar]
  • 64. Gandhi S, Piacentino ML, Vieceli FM, Bronner ME. Optimization of CRISPR/Cas9 genome editing for loss-of-function in the early chick embryo. Developmental biology. 2017;432(1):86–97. doi: 10.1016/j.ydbio.2017.08.036 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65. McCullough KT, Boye SL, Fajardo D, Calabro K, Peterson JJ, Strang CE, et al. Somatic gene editing of GUCY2D by AAV-CRISPR/Cas9 alters retinal structure and function in mouse and macaque. Human gene therapy. 2019;30(5):571–589. doi: 10.1089/hum.2018.193 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66. Dong D, Guo M, Wang S, Zhu Y, Wang S, Xiong Z, et al. Structural basis of CRISPR–SpyCas9 inhibition by an anti-CRISPR protein. Nature. 2017;546(7658):436–439. doi: 10.1038/nature22377 [DOI] [PubMed] [Google Scholar]
  • 67. Wang R, da Rocha Tavano EC, Lammers M, Martinelli AP, Angenent GC, de Maagd RA. Re-evaluation of transcription factor function in tomato fruit development and ripening with CRISPR/Cas9-mutagenesis. Scientific reports. 2019;9(1):1–10. doi: 10.1038/s41598-018-38170-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68. Lin CS, Hsu CT, Yang LH, Lee LY, Fu JY, Cheng QW, et al. Application of protoplast technology to CRISPR/Cas9 mutagenesis: from single-cell mutation detection to mutant plant regeneration. Plant biotechnology journal. 2018;16(7):1295–1310. doi: 10.1111/pbi.12870 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69. Dorantes-Gilardi R. Biographs: Amino acid networks in python; 2020. [Google Scholar]
  • 70. Hagberg A, Swart P, Chult D S. Exploring network structure, dynamics, and function using NetworkX. Los Alamos National Lab.(LANL), Los Alamos, NM (United States); 2008. [Google Scholar]
  • 71. Cock P, Antao T, Chang J, Chapman B, Cox C, Dalke A, et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25: 1422–1423; 2009. doi: 10.1093/bioinformatics/btp163 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72. Schymkowitz J, Borg J, Stricher F, Nys R, Rousseau F, Serrano L. The FoldX web server: An online force field. Nucleic Acids Research. 2005;. doi: 10.1093/nar/gki387 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73. Rodrigues CHM, Pires DEV, Ascher DB. DynaMut: predicting the impact of mutations on protein conformation, flexibility and stability. Nucleic Acids Research. 2018;. doi: 10.1093/nar/gky300 [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

Sriparna Saha

Transfer Alert

This paper was transferred from another journal. As a result, its full editorial history (including decision letters, peer reviews and author responses) may not be present.

5 Oct 2021

PONE-D-21-15101Linking protein structural and functional change to mutation using amino acid networksPLOS ONE

Dear Dr. Dorantes-Gilardi,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

==============================

ACADEMIC EDITOR: Authors are requested to update the manuscript as per the suggestions by both the reviewers.

==============================

Please submit your revised manuscript by Nov 19 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Sriparna Saha, PhD

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf.

2. Thank you for stating the following in the Competing Interests section:

“EHL is an Academic Editor at PLoS ONE.”

Please confirm that this does not alter your adherence to all PLOS ONE policies on sharing data and materials, by including the following statement: "This does not alter our adherence to  PLOS ONE policies on sharing data and materials.” (as detailed online in our guide for authors http://journals.plos.org/plosone/s/competing-interests).  If there are restrictions on sharing of data and/or materials, please state these. Please note that we cannot proceed with consideration of your article until this information has been declared.

Please include your updated Competing Interests statement in your cover letter; we will change the online submission form on your behalf.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Partly

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: No

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The article is very interesting & informative. However I have few minor comments/queries/suggestions for betterment-

1. Line no. 77-89: Most of this part will go to methods section

2. Line no. 90-118- Most of this part has been discussed in the section of "results & discussion" so here it becomes redundant.

3. Line no. 92--- Which 5 protein & why these ? mention here itself.

4. Line no. 98. Write full form of ROC.

5. Line no. 101-102-- I think you have considered four different parameter (but here you have mentioned three)

6. Line no. 125-126: (In general ….. of the protein) The statement is not clear to me. If possible put a reference here (to support the statement)

7. I would recommend to summarize the major findings (with few bullets) at the beginning of the result & discussion section

8. Line no. 396-400- You may cite a more recent study here (Liu X, Luo Y, Li P, Song S, Peng J (2021) Deep geometric representations for modeling effects of mutations on protein-protein binding affinity. PLoS Comput Biol 17(8): e1009284. https://doi.org/10.1371/journal.pcbi.1009284)

9. You should write a small paragraph mentioning the strengths (e.g. inclusive design, statically methods etc.) & weakness (only considered perturbation data not deletion data) of your model/study

10. Lastly title & abbreviations may be incorporated in the figure ( in the picture itself) so that the figures can stand alone.

Reviewer #2: The article titled "Linking protein structural and functional change to mutation using amino acid networks" tries to establish a relation between structural and functional changes in terms of mutational dynamics of protein sequences. Authors use network modelling to study the relation between protein structure and functional variations considering 5 different proteins validating from deep mutational scanning databases. The idea of finding the structurally sensitive positions by observing its behaviour in the perturbation network seems interesting. Authors also claim that there is a strong correlation between the structurally sensitive positions (SSP) and functionally sensitive positions(FSP). They predicted FSPs from SSPs with a mean precision of 74.7% and recall of 69.3% for all the 5 proteins studied.

Though the article demonstrates various empirical results with statistical validations, the results are mainly some numerical values, in most of the cases, with no/inadequate explanation.

Some majors issues citing this observation are as follows:

1. Line 458, "We consider four topological attributes of P(t), namely its size (referred here as `nodes'), number of edges (`edges'), total sum of weights (`weight'), and its diameter (maximal smallest path, called here `diameter')."

It was never explained intuitively why these measures have been chosen. There could have been several other properties of a graph to choose from. During experimentation, a correlation has been established with these measures using a statistical count. Whereas, in a perturbation network, the no. of nodes indicates the number of positions affected or perturbed due to the mutation considered. The no. of edges indicate the connection between these nodes/positions affected. Similarly, the summation may capture the score of the total impact caused by this mutation. And, finally, the diameter encompasses the reach or spread of the perturbation in the network.

This kind of intuitive explanation could be helpful to the readers.

2. Again in line 154, "In the case of the measure 'diameter', correlations peaked between 3.5{3.8  A for all five proteins and then decreased for higher distance thresholds.", some measures have been reported with no proper explanation. Any intuition why the behaviour of diameter is so different from the others?

The authors have decided not to include diameter in the prediction model. However, diameter could have been an important measure to indicate the spread of the perturbation, which has been lost from sight because of some values. If it really is not that important for the prediction, then it should be justified with proper explanation.

3. Line 169, "We found that the number of nodes had the highest mean precision (72.66%), weight had the highest mean recall (71.76%), and diameter had the lowest score in both cases (52.58% and 49.02%, for precision and recall, respectively."

Again, there is an observation with inadequate explanation. Why is the result suggesting that this measure is crucial over others? No explanation!!

4. How are the standardized data arrays obtained in Figure 4? Why are the figures 4(A) and 4(B) looking different from C, D, and E. Are they representing the same color code? No explanation of the figures.

5. There should be a detailed stepwise explanation of Figure 7. First of all, is it "distance" or diameter in figure 7A? How are the positive and negative fraction values appearing in the tables of Figure 7B? If it is the output of standardization, then it should be properly explained. Maybe, the standardization was inappropriate which results in a differential behaviour of diameter in comparison to others.

Moreover, Standard Deviation encompasses the deviation (lower/higher) in the distribution. Does it mean that the smaller changes in the perturbation measures i.e., those which are less than the mean values should also be filtered out? Is the cutoff based on standard deviation meaningful here?

6. In line 493, authors say that "For each of the four perturbation measures, we identify sensitive positions if at least one mutation at that position has a score above the corresponding cutoff. In other words, sensitive positions for a certain measure are all the positions in the protein with one or more sensitive mutations."

Is it significant to call a position sensitive, if only 1 out of 10 mutations is showing a structural change in the perturbation network. Because 9 other mutations of the same position are demonstrating no structural change. Are all measures equally affected by a mutation at a particular position? If not, this needs to be studied methodically. Because, the authors already claimed that whether a mutation is effective in causing any functional change is dependent on the position of the mutation not on mutation itself.

This should be clarified with proper justification.

7. Any justification on deciding the cut-off vector to be [1.5,1.5,1.5,1.5]? In line 514, authors mentioned that "Two proteins had no positions with mean values one standard deviation above average, and when considering 0.5 standard deviations, percentages ranged from 17% to 38%." Then why the value 1.5?

8. The results presented in this article seem incomplete without the comparisons with some state of the art methods ( as mentioned in [1] and [2]) which predicted protein structure changes as result of mutational variation in the sequence.

[1] Carlos HM Rodrigues, Douglas EV Pires, David B Ascher, DynaMut: predicting the impact of mutations on protein conformation, flexibility and stability, Nucleic Acids Research, Volume 46, Issue W1, 2 July 2018, Pages W350–W355, https://doi.org/10.1093/nar/gky300

[2]Lijun Quan, Qiang Lv, Yang Zhang, STRUM: structure-based prediction of protein stability changes upon single-point mutation, Bioinformatics, Volume 32, Issue 19, 1 October 2016, Pages 2936–2946, https://doi.org/10.1093/bioinformatics/btw361

A demonstration of mutations and structural changes that are predicted by the proposed article, are also concurred by the methods like DynaMut and STRUM would be an interesting experimentation. If these two methods result in different findings than the proposed one, then this should be explained and validated as well.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Abhijit Dey

Reviewer #2: Yes: Angana Chakraborty, PhD

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2022 Jan 21;17(1):e0261829. doi: 10.1371/journal.pone.0261829.r002

Author response to Decision Letter 0


23 Nov 2021

Dear Dr. Sriparna Saha, Academic Editor:

We would like to take the opportunity to thank you as well as the reviewers for your valuable feedback. We are sure our manuscript has considerably been improved as a consequence.

Please find below the response to each point raised by the reviewers. When specific lines in the manuscript are mentioned we refer to the revised manuscript with track changes.

Reviewer #1:

The article is very interesting & informative. However I have few minor comments/queries/suggestions for betterment-

The authors would like to thank Reviewer 1 for their professional critique and assessment of our work. In what follows we will present a point-by-point response to their comments and suggestions.

1. Line no. 77-89: Most of this part will go to methods section

We have revised our manuscript accordingly with your suggestions in lines 81--90 of the revised manuscript with track changes. Most of the paragraph was taken to the methods of the manuscript.

2. Line no. 90-118- Most of this part has been discussed in the section of "results & discussion" so here it becomes redundant.

This section has been re-written to eliminate redundancies.

3. Line no. 92--- Which 5 protein & why these ? mention here itself.

We selected these proteins based on the completeness of the point mutations evaluated, as well as a focus on experimental assays evaluating enzyme binding. We also considered that the size of the proteins was not too large, as all point mutations were evaluated. The manuscript has been updated to include that at this point in the introduction and methodology.

4. Line no. 98. Write full form of ROC.

The manuscript has been updated with its full form.

5. Line no. 101-102-- I think you have considered four different parameter (but here you have mentioned three)

As stated in the methods section, the parameter for diameter was ultimately not considered for predictions, and because of that it is not mentioned here. We have updated lines 182--185 of the manuscript with track changes to include an explanation of its exclusion.

6. Line no. 125-126: (In general ….. of the protein) The statement is not clear to me. If possible put a reference here (to support the statement)

We have rephrased this section and included some supporting references in lines 132--134.

7. I would recommend to summarize the major findings (with few bullets) at the beginning of the result & discussion section

A summary of the major findings was included at the beginning of the section Results and discussion lines 135--140 of the revised manuscript with tracked changes.

8. Line no. 396-400- You may cite a more recent study here (Liu X, Luo Y, Li P, Song S, Peng J (2021) Deep geometric representations for modeling effects of mutations on protein-protein binding affinity. PLoS Comput Biol 17(8): e1009284. https://doi.org/10.1371/journal.pcbi.1009284)

We have added this reference to the concluding remarks of the manuscript in line 406.

9. You should write a small paragraph mentioning the strengths (e.g. inclusive design, statically methods etc.) & weakness (only considered perturbation data not deletion data) of your model/study

A brief subsection on the scope and limitations of our work has been included in the revised version of our manuscript in lines 434--440.

10. Lastly title & abbreviations may be incorporated in the figure ( in the picture itself) so that the figures can stand alone.

Fig 4, 6, and 7 were updated to include additional information including titles and abbreviations.

Reviewer #2:

The article titled "Linking protein structural and functional change to mutation using amino acid networks" tries to establish a relation between structural and functional changes in terms of mutational dynamics of protein sequences. Authors use network modelling to study the relation between protein structure and functional variations considering 5 different proteins validating from deep mutational scanning databases. The idea of finding the structurally sensitive positions by observing its behaviour in the perturbation network seems interesting. Authors also claim that there is a strong correlation between the structurally sensitive positions (SSP) and functionally sensitive positions(FSP). They predicted FSPs from SSPs with a mean precision of 74.7% and recall of 69.3% for all the 5 proteins studied.

Though the article demonstrates various empirical results with statistical validations, the results are mainly some numerical values, in most of the cases, with no/inadequate explanation.

The authors are thankful to Reviewer 2 for their professional academic review of our manuscript. Below we will present a point-by-point response to your comments and concerns.

Some majors issues citing this observation are as follows:

1. Line 458, "We consider four topological attributes of P(t), namely its size (referred here as `nodes'), number of edges (`edges'), total sum of weights (`weight'), and its diameter (maximal smallest path, called here `diameter')."

It was never explained intuitively why these measures have been chosen. There could have been several other properties of a graph to choose from. During experimentation, a correlation has been established with these measures using a statistical count. Whereas, in a perturbation network, the no. of nodes indicates the number of positions affected or perturbed due to the mutation considered. The no. of edges indicate the connection between these nodes/positions affected. Similarly, the summation may capture the score of the total impact caused by this mutation. And, finally, the diameter encompasses the reach or spread of the perturbation in the network.

This kind of intuitive explanation could be helpful to the readers.

The manuscript has been revised to present a clearer image of the reasons to use such topological attributes, as well as the rationale behind the statistical analysis and the scoring procedures. This can be seen in the (new) lines 104 to 107 in the revised manuscript (see the manuscript with tracked changes for easy reading).

2. Again in line 154, "In the case of the measure 'diameter', correlations peaked between 3.5{3.8  A for all five proteins and then decreased for higher distance thresholds.", some measures have been reported with no proper explanation. Any intuition why the behaviour of diameter is so different from the others?

The authors have decided not to include diameter in the prediction model. However, diameter could have been an important measure to indicate the spread of the perturbation, which has been lost from sight because of some values. If it really is not that important for the prediction, then it should be justified with proper explanation.

Reviewer 2 is right in that our former presentation was not clear enough on these issues, for this reason we have modified the manuscript to make this more evident (lines 181-184). Diameter, in particular, was not included because we believe that it was too sensitive to small changes in thresholds, as can be seen in Figure 1, in addition to the low recall and precision scores obtained from predictions based on diameter alone. We believe that this measure is too sensitive as adding or removing a single edge could significantly change the maximal smallest path without significantly changing the network itself.

3. Line 169, "We found that the number of nodes had the highest mean precision (72.66%), weight had the highest mean recall (71.76%), and diameter had the lowest score in both cases (52.58% and 49.02%, for precision and recall, respectively."

Again, there is an observation with inadequate explanation. Why is the result suggesting that this measure is crucial over others? No explanation!!

We believe that the difference between the diameter and the other measures lies on its sensitivity to mutations: Given that the perturbation network is typically small, a loss of an edge could separate the network such that the position being mutation remains in an even smaller connected component, giving a very small diameter. This is reflected in its poor performance when taken individually to predict functional perturbation (a difference of 20% with precision and recall of other measures). However, the inclusion of the number of edges can serve as a proxy to the extension of the network. An explicit mention of the exclusion of the diameter was added in the manuscript.

4. How are the standardized data arrays obtained in Figure 4? Why are the figures 4(A) and 4(B) looking different from C, D, and E. Are they representing the same color code? No explanation of the figures.

We have updated the manuscript to clarify the standardization in a subsection of the methodology, which was done by subtracting the mean and dividing by the standard deviation. Figures 4A and 4B show experimental and computational data for each point mutation, respectively, while Figures C, D and E show which positions were predicted in each case, and the mean functional value from the experimental data. The manuscript and figure caption has been updated to clarify this.

The explanation of the color code for each plot was included as requested by the reviewer.

5. There should be a detailed stepwise explanation of Figure 7. First of all, is it "distance" or diameter in figure 7A? How are the positive and negative fraction values appearing in the tables of Figure 7B? If it is the output of standardization, then it should be properly explained. Maybe, the standardization was inappropriate which results in a differential behaviour of diameter in comparison to others.

Figure 7 has been updated to include the correct term (diameter) and the origin of the values.

Moreover, Standard Deviation encompasses the deviation (lower/higher) in the distribution. Does it mean that the smaller changes in the perturbation measures i.e., those which are less than the mean values should also be filtered out? Is the cutoff based on standard deviation meaningful here?

As we are considering the perturbation network as the absolute value of the difference between the two networks (mutation network and wild type network), for all measures, a value of zero would be expected if the networks were identical, and no negative values are possible. Therefore, values below the mean are closer to the original network, and we only filter values above the mean. The manuscript has been updated to better reflect this in the methods section.

6. In line 493, authors say that "For each of the four perturbation measures, we identify sensitive positions if at least one mutation at that position has a score above the corresponding cutoff. In other words, sensitive positions for a certain measure are all the positions in the protein with one or more sensitive mutations."

Is it significant to call a position sensitive, if only 1 out of 10 mutations is showing a structural change in the perturbation network? Because 9 other mutations of the same position are demonstrating no structural change. Are all measures equally affected by a mutation at a particular position? If not, this needs to be studied methodically. Because, the authors already claimed that whether a mutation is effective in causing any functional change is dependent on the position of the mutation not on mutation itself.

This should be clarified with proper justification.

This is a valid point raised by the reviewer. As it is true that structural perturbation is based on the position being mutated instead of the particular mutation, standardization of the data is made considering the full set of mutations and their effect on the structure of the original protein. In general, the presence of a mutation that alters the structure in such a way suggests that the other mutations also alter the structure although probably at weaker levels. This is shown in the manuscript in Table 2 by taking the percentage of mutations with the same positive or negative symbol for each position.

We are working with computationally obtained approximations of what the structure might look like if the protein was mutated. Based on this idea, we believe that in vivo mutations could distort the structure more than what is shown in in silico mutations, and therefore considered that a single “bad” mutation was a good indication of the possibility of a sensitive neighborhood in the protein. In other words, we believe that it is more likely that mutations that appear to not alter the structure have a bigger effect in vivo than viceversa. We have updated the manuscript to better reflect this reasoning.

7. Any justification on deciding the cut-off vector to be [1.5,1.5,1.5,1.5]? In line 514, authors mentioned that "Two proteins had no positions with mean values one standard deviation above average, and when considering 0.5 standard deviations, percentages ranged from 17% to 38%." Then why the value 1.5?

The quote refers to the experimental data we used, which we found to have varying distributions and few positions above the standard deviation cutoffs. This led us to filter experimental data by quantiles, settling on 40%. On the other hand, the computational data obtained from in silico mutations had much more even distributions, so using standard deviation cutoffs was possible. The value 1.5 was selected based on how many positions passed the cutoff. The manuscript has been updated to better explain the distinction between the filtering of the data in the “data standardization” subsection of the methodology.

8. The results presented in this article seem incomplete without the comparisons with some state of the art methods ( as mentioned in [1] and [2]) which predicted protein structure changes as result of mutational variation in the sequence.

[1] Carlos HM Rodrigues, Douglas EV Pires, David B Ascher, DynaMut: predicting the impact of mutations on protein conformation, flexibility and stability, Nucleic Acids Research, Volume 46, Issue W1, 2 July 2018, Pages W350–W355, https://doi.org/10.1093/nar/gky300

[2]Lijun Quan, Qiang Lv, Yang Zhang, STRUM: structure-based prediction of protein stability changes upon single-point mutation, Bioinformatics, Volume 32, Issue 19, 1 October 2016, Pages 2936–2946, https://doi.org/10.1093/bioinformatics/btw361

A demonstration of mutations and structural changes that are predicted by the proposed article, are also concurred by the methods like DynaMut and STRUM would be an interesting experimentation. If these two methods result in different findings than the proposed one, then this should be explained and validated as well.

We submitted the proteins studied to both the DynaMut and STRUM web servers by October 19th, however, we only received the results from DynaMut. We found general agreement with our results which we discuss in a new methodology section of the manuscript. We especially appreciate this comment as we believe it brought an important addition to our manuscript.

Attachment

Submitted filename: Response to Reviewers.pdf

Decision Letter 1

Sriparna Saha

13 Dec 2021

Linking protein structural and functional change to mutation using amino acid networks

PONE-D-21-15101R1

Dear Dr. Dorantes-Gilardi,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Sriparna Saha, PhD

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The authors have addressed the quires adequately. Now the manuscripts is looking much better. The editor may accept it for Publication.

Reviewer #2: (No Response)

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Abhijit Dey

Reviewer #2: Yes: Dr. Angana Chakraborty

Acceptance letter

Sriparna Saha

17 Dec 2021

PONE-D-21-15101R1

Linking protein structural and functional change to mutation using amino acid networks

Dear Dr. Dorantes-Gilardi:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Sriparna Saha

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig. Predictions based on individual measures, considering cutoff 1.5, comparing precision and recall scores obtained from varying the threshold for measures nodes, edges, weight, and diameter, from 3 Å to 10 Å.

    (TIF)

    S2 Fig. Precision and recall across 51 different perturbation cutoffs, ranging from 1 to 2 in intervals of 0.02.

    Each row and column represents a different minimum count and protein, respectively.

    (TIF)

    S3 Fig. Matrix of normalized structural change across all mutations for protein PSD95pdz3.

    Red and blue colors represent structural loss and robustness, respectively.

    (TIF)

    S4 Fig. Matrix of normalized structural change across all mutations for protein PTEN.

    Red and blue colors represent structural loss and robustness, respectively.

    (TIF)

    S5 Fig. Matrix of normalized structural change across all mutations for protein APH(3’)II.

    Red and blue colors represent structural loss and robustness, respectively.

    (TIF)

    S6 Fig. Matrix of normalized structural change across all mutations for protein SRC CD.

    Red and blue colors represent structural loss and robustness, respectively.

    (TIF)

    S7 Fig. Matrix of normalized structural change across all mutations for protein VIM-2.

    Red and blue colors represent structural loss and robustness, respectively.

    (TIF)

    S8 Fig. Point plot displaying the mean ΔΔG obtained from DynaMut [73] for three sets of positions for each of the five proteins studied, those included in the maximum precision prediction, those included in the maximum recall prediction, and those not included in either.

    (TIF)

    Attachment

    Submitted filename: Response to Reviewers.pdf

    Data Availability Statement

    Data and code to create the figures can be found at https://github.com/CrisSotomayor/perturbation-networks.


    Articles from PLoS ONE are provided here courtesy of PLOS

    RESOURCES