Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2020 Apr 30;16(4):e1007722. doi: 10.1371/journal.pcbi.1007722

Insight into the protein solubility driving forces with neural attention

Daniele Raimondi 1, Gabriele Orlando 2, Piero Fariselli 3, Yves Moreau 1,*
Editor: Emil Alexov4
PMCID: PMC7217484  PMID: 32352965

Abstract

Protein solubility is a key aspect for many biotechnological, biomedical and industrial processes, such as the production of active proteins and antibodies. In addition, understanding the molecular determinants of the solubility of proteins may be crucial to shed light on the molecular mechanisms of diseases caused by aggregation processes such as amyloidosis. Here we present SKADE, a novel Neural Network protein solubility predictor and we show how it can provide novel insight into the protein solubility mechanisms, thanks to its neural attention architecture. First, we show that SKADE positively compares with state of the art tools while using just the protein sequence as input. Then, thanks to the neural attention mechanism, we use SKADE to investigate the patterns learned during training and we analyse its decision process. We use this peculiarity to show that, while the attention profiles do not correlate with obvious sequence aspects such as biophysical properties of the aminoacids, they suggest that N- and C-termini are the most relevant regions for solubility prediction and are predictive for complex emergent properties such as aggregation-prone regions involved in beta-amyloidosis and contact density. Moreover, SKADE is able to identify mutations that increase or decrease the overall solubility of the protein, allowing it to be used to perform large scale in-silico mutagenesis of proteins in order to maximize their solubility.

Author summary

The solubility of proteins is a crucial biophysical aspect when it comes to understanding many human diseases and to improve the industrial processes for protein production. Due to its relevance, computational methods have been devised in order to study and possibly optimize the solubility of proteins. In this work we apply a deep-learning technique, called neural attention to predict protein solubility while “opening” the model itself to interpretability, even though Machine Learning models are usually considered black boxes. Thank to the attention mechanism, we show that i) our model implicitly learns complex patterns related to emergent, protein folding-related, aspects such as to recognize β-amyloidosis regions and that ii) the N-and C-termini are the regions with the highes signal fro solubility prediction. When it comes to enhancing the solubility of proteins, we, for the first time, propose to investigate the synergistic effects of tandem mutations instead of “single” mutations, suggesting that this could minimize the number of required proposed mutations.

Introduction

Proteins have evolved within the different cellular environments to improve or preserve their functions, maintaining at the same time a degree of hydrophobicity necessary to proper fold, and enough solubility to prevent protein precipitation. Solubility is then a fundamental ingredient that must be properly balanced to maintain the protein functions and not aggregating [1]. Protein solubility is also an essential aspect in diagnostic and therapeutic applications [2, 3], as well as being a critical requirement for protein homeostasis [1, 4, 5]. Solubility deficit can hamper protein-based drug development, generating insoluble protein precipitates that can be toxic and may elicit an immune response in patients [6, 7]. Protein aggregation is considered an hallmark of more than forty human diseases, that span from neurodegenerative illness, to cancer and to metabolic disorders such as diabetes [8].

Since the soluble expression of proteins is crucial for protein production [9] for both pharmacological and research goals [10], in-silico bioinformatics models have been developed to predict i) the solubility of proteins [1013] or ii) the solubility change upon mutation [9, 12, 14, 15], in order to, respectively, i) select the most likely soluble proteins [12] and ii) run in-silico mutagenesis to increase the solubility of existent proteins [9].

The methods developed so far in both categories use various Machine Learning (ML) approaches, such as Neural Networks (NN) [10], Gradient Boosting Machines [11], Support Vector Machines (SVM) [13], Logistic Regression [12] or simpler statistical methods [9, 14]. Most of these approaches compute predictions starting from just the proteins sequence, with the exception of CamSol, which uses PDB structures [14]. Among the sequence-based features used, the most common are: i) biophysical propensity scales values (e.g. hydrophobicity, charge), ii) various forms of k-mers frequencies (such as mono, bi- or tri-peptides occurrences), iii) predictions from other methods (e.g. disorder, Secondary Structure, Relative Solvent Accessibility or aggregation predictors) and iv) global features such as sequence length and the fraction of residues exposed to the solvent.

A common issue that the methods predicting the solubility of proteins had to face is the fact that the input protein sequences may have very different lengths, and indeed building ML models able to work with protein sequences is a common task in structural bioinformatics. From the ML standpoint, this task is not trivial because the variable length of proteins poses some issues to conventional ML methods, such SVM or Random Forests. This problem is usually addressed by using sliding window techniques to predict each residue independently [16, 17], but different solutions are needed when a single prediction must be associated to an entire protein sequence [13, 14, 18], since the information content of an entire sequence needs to be shrunk into a single predictive scalar value.

Neural Networks (NN) are flexible models that can elegantly address this issue. The classical approaches consist in building a pyramid-like architecture [10] that takes the protein sequence as input and reduces it to a fixed size through subsequent abstraction layers, ending with a feed-forward sub-network that yields the final scalar prediction.

Here we propose a novel solution to this issue, which has been inspired by the neural attention mechanisms developed for Natural Language Processing and machine translation [19, 20]. Our model is called SKADE and uses a neural attention-like architecture to elegantly process the information contained in protein sequences towards the prediction of their solubility. By comparing it with state of the art methods we show that it has competitive performances while requiring as inputs just the protein sequence.

Additionally, the use of neural attention allows our model to be interpreted, showing that the learned patterns correlate with complex sequence aspects such as the presence of aggregating regions or the protein contact density, while it does not correlate with more trivial aspects such as biophysical propensity scales or solvent accessibility.

We also show that, even if it has not been specifically trained for the task, SKADE can distinguish between mutations that increase or decrease the proteins’ solubility. This, coupled with the fact that it can generate hundreds of predictions per second, makes it ideal to perform in-silico optimization of protein solubility. To show its potential in this sense, we performed a complete in-silico mutagenesis of the UPF0235 protein MTH_637, computing both the effect of all the possible single mutations and all possible pairs of tandem mutations (> 2 × 106 pairs). This allowed us to investigate the possible effects of synergistic interactions between mutations, indicating that, in certain regions of the proteins, the implementation of pairs of mutations could have a larger effect that the sum of the effects of independent mutations. Finally, we show that the predicted synergistic effects have a significant correlation with the average contact distances between residues, extracted from the protein PDB structure, suggesting that SKADE is able to catch a glimpse of complex emergent properties such as the contact density.

Materials and methods

Datasets

To train and test our model, we used the protein solubility datasets adopted in [10, 11]. Using the same training/testing data and procedure allowed us to compare the performances of SKADE with the most recently published methods. The training set contains 28972 soluble and 40448 insoluble proteins that have been annotated with the pepcDB [21] “soluble” (or subsequent stages) annotations in [12]. The test dataset contains 1000 soluble and 1001 insoluble proteins, and has been compiled by [22].

To validate the ability of SKADE to distinguish between variants increasing or decreasing the overall solubility of the protein, we adopted the dataset used in CamSol [14], which contains 142 variants known to increase or decrease the solubility of 42 proteins.

Neural attention for protein sequences

SKADE is a NN model which consists in two sub-networks: the predictor network P and the attention network A.

The NN takes as input just a the target proteins sequences, without using evolutionary information or other kinds of annotations. As a first step, each residue in the input sequence is translated into a trainable 20-dimensional embedding encoding the 20 possible amino acids. The embedded sequences are then sorted and padded, in order to maximize the efficiency of the NN computations on GPUs.

Both A and P are constituted by 2 layers of bidirectional Gated Recurrent Unit (GRU) [23] networks with 20 hidden dimensions. The GRU network runs on the input sequences, outputting 20 dimensional vector for each residue. Since the GRU is bidirectional, for each protein it provides two 20 × L tensors as output, where L is the length of the sequence. The two tensors obtained from the forward and backward pass of the bidirectional GRU are concatenated into a 40 × L and further processed by a linear layer that produces a single scalar value for each residue, obtaining a 1 × L tensor for each protein.

At this point the A and P network differ: the last layer of the predictor P has a LeakyReLU activation while the attention network A has a SoftMax activation that is globally applied over the entire 1 × L tensor T, ensuring that i=0LTi=1.

To obtain the final prediction, for each protein the scalar product aT p between the 1 × L tensors a and p, respectively obtained from the P and A sub-networks is computed, allowing the SoftMax-ed attention vector a to select the position on the vector p that should be given highest relevance for the final prediction. The resulting value is passed through a Sigmoid activation, constraining the final output of SKADE withing the 0-1 range.

Conceptually, the predictor network P should assign different solubility-related values to each residue in the protein, considering also the local sequence context. In parallel, the attention network A should learn on which protein regions its attention should be focused.

The final model has 25462 trainable parameters, uses a batch size of 1001 with an initial learning rate of 0.01. We used the Adam optimizer with a L2 regularization equal to 10−6 and it is trained for 50 epochs. The model has been implemented in Pytorch version 1.0.1 [24]. The code is freely available at https://bitbucket.org/eddiewrc/skade/src.

Validation procedure and performance evaluation

In order to compare our results with [10, 11], we reproduced their validation procedure. We trained our model on the 69420 proteins in the trainset and we tested it on the 2001 proteins in the test set from [22]. We used the widely adopted Sensitivity (Sen), Specificity (Spe), Precision (Pre), Accuracy (Acc), Matthews Correlation Coefficient (MCC), Area Under the ROC curve AUC and Area Under the Precision Recall Curve (AUPRC) metrices to evaluate the predictions. The Balanced Accuracy (BAC) is computed as the arithmetic mean of Sen and Spe.

To assess the ability of SKADE to distinguish between variants increasing or decreasing the solubility of proteins, we followed the procedure used in SODA [9].

Results

SKADE predicts protein solubility from just the protein amino acid sequence

The most recently developed tools for the protein solubility prediction [10, 11] use various kinds of features. These can be divided into sequence related features (e.g. k-mers and their frequencies of occurrence, biophysical propensity scales sequence descriptions), global features (e.g. sequence length, molecular weight, fraction of residues with certain biophysical properties) and structural features, such as secondary structure assignments (SS) and Relative Solvent Accessibility (RSA) [10, 11].

During the development of SKADE we chose to use only the protein sequences as inputs because the addition of any other kind of features would pose certain restrictions to our model. For example, in [25] it has been shown that, although extremely valuable, the use of Multiple Sequence Alignments (MSAs) as features could bias predictors towards more studied proteins (e.g. proteins for which many homologous sequences are known). Another recent study we conducted [26] showed that the use of biophysical propensity scales coupled with sophisticate ML method could be just a discrete and intrinsically limited way to find an optimized embedding description of the amino acids in the proteins, and using entirely random scales could give extremely similar results [26]. Finally, our rationale to avoid the use of global features is that, since the solubility of a protein is a crucial step for its expression and large scale manufacture, we envision SKADE as a tool for the in-silico optimization of protein solubility, and thus we deemed global features such as the protein length, the molecular weight and the absolute charge as not suitable for our framework, since they are not characteristics on which we want to act directly while perform the optimization.

S6 Fig shows a PCA of the learned embeddings, and S2 Text contains their actual 20-dimensional values.

SKADE positively compares with state of the art predictors

We trained and tested SKADE following the same procedure adopted in [10, 11], which are two of the most recently developed solubility predictors. This allowed us to compare SKADE with the most relevant state of the art methods, extending the benchmark initially proposed in [22]. This procedure involved a training set containing 69420 proteins and a test set of 2001 proteins. The results obtained by SKADE and the comparison with other methods is shown in Table 1.

Table 1. Table showing the comparison between SKADE and the state of the art tools benchmarked in [22].

Methods Sen Spe Pre Acc MCC AUC AUPRC
SKADE 66 81 78 73 47 82 82
PaRSnIP 76 72 70 74 48 82 80
DeepSol S1 75 71 69 73 46 81 81
PROSO II 67 68 69 64 34 74 71
CCSOL 54 54 51 54 8 - -
SOLPRO 62 58 51 60 20 - -
PROSO 58 57 54 58 16 - -
RPSP 52 51 44 52 3 - -
SCM 65 57 42 60 21 - -

From DeepSol [10] we reported the performance of their S1 model, which is the DeepSol version that uses only protein sequence information. PaRSnIP and all the other method, on the other hand, use any kind of features, some of which include information related to Secondary Structures, Relative Solvent Accessibility and other biophysical aspects. Notwithstanding this limitation, Table 1 shows that SKADE has the best AUPRC and the best AUC, on par with PaRSnIP.

The role of attention

As shown in Fig 1, SKADE’s NN is divided into two sub-networks. The sub-network A is used to compute the per-residues SoftMax-ed attention values a while the sub-network P is used to compute the per-residue predictions p. The final solubility prediction for each protein is obtained as σ(aT p), where σ is a Sigmoid activation. This value is thus a linear combination between p and a, and the A networks is responsible to assign the coefficients that are used to weight the per-residues predictions proposed by P. This is analogous to a bias-less Logistic Regression in which the weights are dynamically predicted for each sample instead of being learned once for all, and it allows SKADE to work with input sequences with arbitrary length, making it suitable to bioinformatics applications.

Fig 1. Figure showing the structure of the NN.

Fig 1

The protein sequence is translated into a 20 dimensional embedding and then passed to the predictor network P (left) and the attention network A (right). Each of these subnetworks contain 2 layers of bi-directional GRU network, followed by a feed-forward. The predictor ends with a LeakyReLU activation, while the attention network has a SoftMax activation. Finally, the 1 × L outputs of the two networks are reduced to a predictive, probability-like, scalar value by a dot product operation, followed by a SIgmoid activation.

Moreover, since σ(aTp)=σ(iLaipi), with σ monotonically increasing, residues i with a positive value of ai × pi steer the prediction towards the class 1 (soluble), while residues with negative values are actually “voting” for the class 0. This means that these per-residue values can be considered as the solubility profile of the protein (see S3S5 Figs for the attention and prediction profiles of the proteins Q8TC59, Q9HBE1 and P25984).

In order to analyse the behavior of SKADE and to investigate the learned patterns of attention, we tried to correlate the per-residue attention and solubility profiles to known biophysical properties, such as hydrophobicity, polarity, charge, volume and the propensities of the residues being in α-helical or β-sheet conformation, but no correlation with these values has been detected (see Table A1 in S1 Text), indicating that the A and P networks behavior cannot be directly associated with trivial sequence properties. We also tested the correlation of the profiles against in-house Relative Solvent Accessibility (RSA) predictions, obtaining a Pearson correlation of r = 0.0237.

SKADE solubility profiles detect aggregation patches

We then tested the predictive power of SKADE’s solubility profiles on the AMYL dataset [27], which contains 34 amyloidogenic proteins whose aggregating regions have been used in [28] to benchmark the performances of per-residue in-silico aggregation predictors. The AMYL dataset contains annotations indicating the involvement of each residue in β-amyloidosis aggregation, for a total of 7732 residues, 2599 of which are responsible for aggregation.

SKADE’s solubility profiles show an interesting signal towards the prediction of aggregating regions in these proteins. The profiles have an AUC of 65 even though SKADE has not been trained on the proteins in the AMYL dataset nor on the β-amyloidosis aggregation task at any point. To better contextualize the magnitude of this signal, in Table 2 we compared the performance of SKADE’s solubility profiles with state of the art tools that have been specifically developed to predict aggregation. We can see that even though the solubility profiles of SKADE are a byproduct of the neural attention model and have been obtained in a completely unsupervised way with respect to the task at hand, SKADE provides quite competitive performance when it comes to identifying aggregating regions in proteins.

Table 2. Comparison of the performance of SKADE with state of the art aggregation predictors on the amyl33 dataset [27].

Results are reported from [28].

Method Sen Spe BAC MCC
AgMata 43 84 66 25
PASTA 2 (85 specificity) 41 85 63 24
PASTA 2 (90 specificity) 30 90 60 22
AMYLPRED2 39 84 62 22
SKADE 55 71 63 20
MetAmyl 52 71 62 17
Tango 14 96 55 14
Aggrescan 35 79 57 13
FishAmyloid 14 94 54 10
FoldAmyloid 21 87 54 8

We can thus conclude that, although the neural attention profiles do not show correlation with trivial biophysical aspects of the protein sequence, the ai × pi per-residue solubility profiles that are responsible for the final prediction show an interesting correlation with a complex and still only partially understood behavior such as β-amyloidosis aggregation of proteins.

The N- and C-termini are the most relevant regions for the protein solubility prediction

Further analysing the data extracted from the neural attention used in SKADE, in Fig 2, we show the mean and median values of attention a (red) and prediction p (blue) in function of the relative position of the residues in the sequence. S1 Fig shows the distributions of the ai × pi per-residues solubility profile value, where i is the residue position in each sequence. From both figures it clearly appears that SKADE focuses its attention to the N- and C-termini of the proteins, indicating that they contain signal for the prediction of the protein solubility. To investigate whether this behavior is an artifact of the recurrent modules in our NN, we tested SKADE on the 2001 sequences in the test set, but first removing the initial and final 20% of residues from each of them. The obtained AUC is very close to random (0.55), indicating that the beginning and the end of the sequences might indeed carry an important signal for the prediction of the protein solubility. On the contrary, when we test SKADE on the 2001 sequences in the test set after removing the central residues (located in the relative sequence positions between the 20th and 80th percentiles), the performances are very similar to normal (0.81 of AUC).

Fig 2. Plot showing the mean (solid red) and median (light red) attention values, compared with the mean (solid blue) and median (light blue) prediction values, in function of the position in the sequence.

Fig 2

On average, SKADE assigns higher values to positions close to the N- and C-termini.

To further ensure that this behavior is due to the signal carried by the N- and C-termini regions and not an artifact of SKADE’s architecture, we repeated the same experiments by using the DeepSol S1 webserver to predict the 2001 proteins in the test set after removing i) the initial and final 20% of residues from each sequence, and ii) the central portion of each protein (from the 20th to the 80th percentile). Similarly to the results obtained with SKADE, DeepSolS1 produces almost random predictions when the N- and C-termini are removed from the input sequences (AUC = 0.53), and exactly normal predictions when the central part of the protein is removed (AUC = 0.81). This shows that also DeepSol S1, which is based on a completely different architecture from SKADE, implicitly exploits the information contained in the extremities of the sequences to perform its prediction.

We also searched in literature whether the relevance of N- and C-termini was already established, and although we did not find exhaustive studies analysing this behavior, a number of papers investigated specific cases. For example, it has been shown that mutating the N-terminus of the mitochondrial aminoacyl-tRNA synthetases [29], sperm whale myoglobin [30] and hen egg-white lysozyme [31] enhances their expression and solubility.

Unsupervised prediction of solubility change upon mutation

One of the possible applications of protein solubility predictors is the in-silico optimization of protein sequences to enhance their solubility, for example by selecting the smallest possible set of variants able to increase the solubility of the sequence while minimizing the structural and functional alterations to the original protein.

To do so, it is necessary that the prediction methods are able to distinguish between variants that increase or decrease the overall solubility of the sequence. Unfortunately, very little experimental data is available in this sense, and in particular, to the best of our knowledge, no data concerning the effect of multiple mutations or insertions/deletions on the solubility of proteins is available.

To evaluate the ability of SKADE to identify mutations increasing or decreasing protein solubility, we used the CamSol [14] dataset, which contains 56 experimentally validated mutations. S7 Fig shows the distribution of the variants in CamSol on the corresponding protein sequences.

Since SKADE is designed to take the whole target protein sequence as input, to evaluate the solubility change upon mutation we predicted both the wildtype and the mutated target sequence, computing the difference in predicted solubility between the wildtype (WTs) and the mutant (MUTs) as ΔS = MUTsWTs.

In Table 3 we used the CamSol dataset to compare the performance of SKADE with methods designed for the task of predicting the effect of mutations on the protein solubility. To do so we reproduced the benchmark performed in SODA [9], where 4 existing predictors have been tested. SKADE correctly identifies 69% of the mutations increasing solubility and 100% of the mutations decreasing it, with an AUC of 82 and an AUPRC of 99%.

Table 3. Table showing the comparison of the unsupervised predictions of SKADE on the CamSol dataset of mutations.

Results have been preported form [9].

Methods Accuracy Total
SODA 100.0 56/56
CamSol 96 54/56
SKADE 71 40/56
SolPro 71 40/56
PROSO II 57 32/56

An important consideration is that while methods such as SODA (see Table 3) have been specifically designed and trained to discriminate mutations increasing or decreasing the solubility, SKADE faces this task in a completely unsupervised way, since it has been trained to predict the solubility of entire protein sequences.

In-silico mutational screening and analysis of the synergistic effects of mutations

SKADE is a NN and thus it can be easily run in parallel on GPUs, computing predictions for hundreds or thousands of sequences per second. This can be used to compute in-silico mutational screening of proteins, as shown in Fig 3 for UPF0235 protein MTH_637 (Uniprot ID: O26734), which is present in the test dataset [22]. The protein is annotated as soluble in the dataset, and SKADE predicts it correctly (score of 0.733). We implemented each possible mutation in O26734 and we computed the the change in solubility ΔS = MUTsWTs, meaning that positive ΔS (shown in red) indicate that the mutation increases the solubility of the protein, while negative ΔS values (in blue) decrease it. Fig 3 shows that, since the predicted solubility of O26734 is already very high, most of the mutations have the effect of decreasing the predicted solubility. In particular, there are few regions with very high effect, such as the residues from 1 to 25, from 49 to 51 and close to the C-terminal. Among the possible mutations, it appears that most of the mutations of a wildtype residue into a Met are predicted to heavily decrease the solubility. We listed the mutations with the larges ΔS in S1 Text.

Fig 3. In-silico mutational screening of UPF0235 protein MTH_637 (O26734), showing the effect on the solubility of the protein of every possible mutation.

Fig 3

The change in solubility is computed as ΔS = MUTsWTs, meaning that positive ΔS (shown in red) indicate that the mutation increases the solubility of the protein, while negative ΔS values (in blue) decrease it.

An aspect that it is not usually considered when performing classical in-silico mutational screenings is the difference between the effects of two mutations mi and mj when they are considered independently or in combination (i.e. both implemented at the same time), thus investigating the possible synergistic effects of mutations. This screening of pairs of mutations is usually not doable due to the extremely high number of possibilities that needs to be tested, but SKADE is fast enough to allow also the exhaustive analysis of the effects combinations of variants. As an example, we ran this experiment on the protein O26734, which is 103 residues long, and we tested the solubility changes ΔSi,j due to all the possible pairs of variants mi and mj, for a total of (103 × 20 × 102 × 20)/2 = 2101200 mutations. We then computed the change in solubility due to two independent mutations mi and mj from Fig 3 as ΔSsingle = ΔSi + ΔSj, and the change in solubility obtained when both mi and mj are implemented in the sequence ΔSpair = ΔSi,j = (MUTi,jWT). We then computed the synergistic versus individual effects as ΔE = ΔSsingle − ΔSpair for each pair of variants mi and mj.

In Fig 4 we averaged the ΔE for each position of O26734, indicating on which pairs of sequence positions the synergistic effects have a stronger effect than two independent variants, averaged over all the possible mutations that could be implemented in those positions. Fig 4 shows that SKADE’s model predicts that most of the positions in O26734 are more likely affected by synergistic effects of mutations in distant parts of the sequence, with respect to two single mutations acting independently. This might indicate that the most effective way to optimize proteins’ solubility could involve the study of the synergistic effects of mutations, instead of implementing variants whose effect on solubility has been assessed in isolation.

Fig 4. Heatmap showing the average single versus synergistic effects ΔE = ΔSsingle − ΔSpair on each position of the O26734 protein.

Fig 4

Negative (blue) values indicate that synergistic effects are stronger, while positive (red) values indicate that the effect of independent mutations is higher.

In S2 Fig we show the mean predicted synergistic effects, averaged over the possible pairs of amino acids mutations. Residues such as A, N, P, W, Y and I are on average less prone to show synergistic effects, while C, D, Q, M and V are, on average, involved in stronger synergistic effects. Interestingly, among these residues, Cs appears to experience little influence from mutations of Qs, and Hs are generaly not influenced by mutations of Ps and Rs.

In Fig 5 we analysed the mean synergistic effects on the protein O26734 in function of the sequence separation |ij| between pairs of mutated residues at positions i, j. We see that residues which are very close (1 ≤ |ij| ≤ 10) or very distant (90 ≤ |ij| ≤ 100) tend to experience the highest synergistic effects (blue lines). In order to find an explanation to this behavior, we compared the magnitude of the synergistic effects with the distribution of the actual 3D distances between residues extracted from the 1JRM pdb structure. As shown in Fig 5, we noticed that a certain correlation exists between the magnitude of the synergistic effects and the mean contact distance between residues (red lines). The Pearson correlation between the mean predicted synergy and mean Angstrom distance between the residues’ C-β atoms is r = 0.29 (p-value = 0.003) and the Spearman’s correlation is r = 0.36 (p-value = 0.0002). This shows that the attention-based architecture on which SKADE is built is indeed able to catch a glimpse of more complex structural aspects of proteins, such as the distribution of contacts. S8 Fig shows a scatter plot version of the same data.

Fig 5. Plot showing the correlation between the average spatial distance between residues at a certain sequence separation |ij| (red) with the magnitude of the synergistic effects between tandem mutations at the same positions i, j (blue).

Fig 5

Solid colors indicate the mean, while light colors indicate the median values.

Discussion

In this study we propose a novel Neural Network (NN) architecture for the prediction of the solubility of protein sequences, and we exploit its attention-based interpretability to investigate the molecular forces driving protein solubility. This model, called SKADE, is based on a neural attention mechanism inspired from machine translation tasks and takes as input only the target protein sequence. SKADE’s performance positively compares with state of the art solubility predictors, and the neural attention architecture offers some opportunities for the interpretation of the predictions, in a first step towards opening the Machine Learning (ML) black-box.

In this study, we indeed analyzed the attention profiles learned by the model during training to investigate whether they showed a significant correlation with biophysical properties of proteins that may relate to the solubility of the chain. Interestingly, we did not find any correlation with trivial biophysical characteristics of amino acids, such as described in biophysical propensity scales, but we showed that the solubility profiles extracted from the model can be used as an unsupervised predictor for aggregation-prone regions in proteins. This suggests that the attention-like mechanism in SKADE is indeed learning non-trivial biophysical emergent characteristics of the protein sequences and that uses them as building blocks to compute the final solubility prediction.

From the analysis of the attention profiles it also appears that the portions of the protein that carry the strongest signal when it comes to predict the protein solubility are the ones closer to the N- and C-termini, more specifically the first and last 20% of residues of the sequences.

Protein solubiliity predictors are generally used to determine which minimal set of mutations could increase the overall solubility of the protein, thus facilitating its expression and production. In our analysis we also show that, although SKADE has not been trained for the task, the model can distinguish between mutations that increase or decrease the protein solubility, and that our model can thus be used to perform in-silico mutational screening with the goal of optimizing the solubility of proteins by selecting an optimal set of mutations.

When solubility predictors are used to screen mutations to increase protein solubility, only single mutations are usually analysed, because the number of possible pairs (or triplets, quadruplets) of mutations grows exponentially. One of the advantages of SKADE is that its NN implementation can be heavily parallelized on GPU, and thus an unprecedented amount of predictions can be computed in very short time. SKADE can thus compute the in-silico mutational screening of pairs or triplets of mutations in minutes, thus including the possible synergistic effects of multiple mutations in this analysis. To show an example of this, we predicted all the possible pairs of mutations on the fairly short protein O26734, and we compared the solubility changes due to couples of independent mutations with respect to pairs of tandem mutations. From this analysis it appears that many regions of O26734 are predicted to be affected by synergistic interactions between mutations, and we thus hypothesize that modelling the synergistic effects of mutations may provide an optimal way towards the optimization of proteins with respect to specific biophysical desiderata.

Finally, while analysing the distribution of the magnitude of the synergistic effects with respect to the sequence separation, we noticed that the synergies predicted by SKADE from the O26734 protein sequence have a significant correlation with the average contact distance between residues, extracted from the corresponding PDB sequence. SKADE is thus able to catch a glimpse of a complex emergent aspect of protein sequences, such as their contact density, from the sequence alone.

Supporting information

S1 Text. Supplementary information.

PDF file containing additional analyses.

(PDF)

S2 Text. Supplementary information.

CSV file containing the aminoacid embedding values.

(CSV)

S1 Fig. Plot showing the distributions of the a i × p i values over the protein sequence.

We grouped these values by using their relative position in their respective sequence. Although the distributions are quite similar, the variance is generally higher at the be- ginning and at the end of the sequences, indicating that the NN might find stronger signal in these regions.

(EPS)

S2 Fig. Plot showing the average synergistic effect of pairs of mutations averaged over the type of mutated aminoacid.

(EPS)

S3 Fig. Plot showing the attention and prediction profiles of protein Q8TC59.

(EPS)

S4 Fig. Plot showing the attention and prediction profiles of protein Q9HBE1.

(EPS)

S5 Fig. Plot showing the attention and prediction profiles of protein P25984.

(EPS)

S6 Fig. Plot showing the 2 principal components of a PCA computed over the 20 dimensional embeddings learned by SKADE.

(EPS)

S7 Fig. Plot distributions of the mutations on the sequences in the CAMSOL dataset.

(EPS)

S8 Fig. Plot showing the correlation between the mean spatial distance (in Angstroms) and the average synergistic effects of pairs of residues at the same sequence separation in the O26734 protein.

(EPS)

Acknowledgments

DR is grateful to A. L. Mascagni for the constructive discussion and to A. Reynolds for the inspiration.

Data Availability

All relevant data are publicly available from the respective papers. The code described in this study is available at: https://bitbucket.org/eddiewrc/skade/src.

Funding Statement

DR is founded by a FWO post-doctoral fellowship. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. Ciryam P, Tartaglia GG, Morimoto RI, Dobson CM, Vendruscolo M. Widespread aggregation and neurodegenerative diseases are associated with supersaturated proteins. Cell reports. 2013;5(3):781–790. 10.1016/j.celrep.2013.09.043 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Lee CC, Perchiacca JM, Tessier PM. Toward aggregation-resistant antibodies by design. Trends in biotechnology. 2013;31(11):612–620. 10.1016/j.tibtech.2013.07.002 [DOI] [PubMed] [Google Scholar]
  • 3. Perchiacca JM, Tessier PM. Engineering aggregation-resistant antibodies. Annual review of chemical and biomolecular engineering. 2012;3:263–286. 10.1146/annurev-chembioeng-062011-081052 [DOI] [PubMed] [Google Scholar]
  • 4. Balch WE, Morimoto RI, Dillin A, Kelly JW. Adapting proteostasis for disease intervention. science. 2008;319(5865):916–919. 10.1126/science.1141448 [DOI] [PubMed] [Google Scholar]
  • 5. Kundra R, Ciryam P, Morimoto RI, Dobson CM, Vendruscolo M. Protein homeostasis of a metastable subproteome associated with Alzheimer’s disease. Proceedings of the National Academy of Sciences. 2017;114(28):E5703–E5711. 10.1073/pnas.1618417114 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Manning MC, Chou DK, Murphy BM, Payne RW, Katayama DS. Stability of protein pharmaceuticals: an update. Pharmaceutical research. 2010;27(4):544–575. 10.1007/s11095-009-0045-6 [DOI] [PubMed] [Google Scholar]
  • 7. Bye JW, Platts L, Falconer RJ. Biopharmaceutical liquid formulation: a review of the science of protein stability and solubility in aqueous environments. Biotechnology letters. 2014;36(5):869–875. 10.1007/s10529-013-1445-6 [DOI] [PubMed] [Google Scholar]
  • 8. Chiti F, Dobson CM. Protein misfolding, amyloid formation, and human disease: a summary of progress over the last decade. Annual review of biochemistry. 2017;86:27–68. 10.1146/annurev-biochem-061516-045115 [DOI] [PubMed] [Google Scholar]
  • 9. Paladin L, Piovesan D, Tosatto SC. SODA: prediction of protein solubility from disorder and aggregation propensity. Nucleic acids research. 2017;45(W1):W236–W240. 10.1093/nar/gkx412 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Khurana S, Rawi R, Kunji K, Chuang GY, Bensmail H, Mall R. DeepSol: a deep learning framework for sequence-based protein solubility prediction. Bioinformatics. 2018;34(15):2605–2613. 10.1093/bioinformatics/bty166 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Rawi R, Mall R, Kunji K, Shen CH, Kwong PD, Chuang GY. PaRSnIP: sequence-based protein solubility prediction using gradient boosting machine. Bioinformatics. 2017;34(7):1092–1098. 10.1093/bioinformatics/btx662 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Smialowski P, Doose G, Torkler P, Kaufmann S, Frishman D. PROSO II–a new method for protein solubility prediction. The FEBS journal. 2012;279(12):2192–2200. 10.1111/j.1742-4658.2012.08603.x [DOI] [PubMed] [Google Scholar]
  • 13. Agostini F, Cirillo D, Livi CM, Delli Ponti R, Tartaglia GG. cc SOL omics: A webserver for solubility prediction of endogenous and heterologous expression in Escherichia coli. Bioinformatics. 2014;30(20):2975–2977. 10.1093/bioinformatics/btu420 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Sormanni P, Aprile FA, Vendruscolo M. The CamSol method of rational design of protein mutants with enhanced solubility. Journal of molecular biology. 2015;427(2):478–490. 10.1016/j.jmb.2014.09.026 [DOI] [PubMed] [Google Scholar]
  • 15. Magnan CN, Randall A, Baldi P. SOLpro: accurate sequence-based prediction of protein solubility. Bioinformatics. 2009;25(17):2200–2207. 10.1093/bioinformatics/btp386 [DOI] [PubMed] [Google Scholar]
  • 16. Buchan DW, Minneci F, Nugent TC, Bryson K, Jones DT. Scalable web services for the PSIPRED Protein Analysis Workbench. Nucleic acids research. 2013;41(W1):W349–W357. 10.1093/nar/gkt381 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Raimondi D, Orlando G, Pancsa R, Khan T, Vranken WF. Exploring the sequence-based prediction of folding initiation sites in proteins. Scientific reports. 2017;7(1):8826 10.1038/s41598-017-08366-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Raimondi D, Orlando G, Moreau Y, Vranken WF. Ultra-fast global homology detection with Discrete Cosine Transform and Dynamic Time Warping. Bioinformatics. 2018;1:8. [DOI] [PubMed] [Google Scholar]
  • 19.Rush AM, Chopra S, Weston J. A neural attention model for abstractive sentence summarization. arXiv preprint arXiv:150900685. 2015;.
  • 20. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. In: Advances in neural information processing systems; 2017. p. 5998–6008. [Google Scholar]
  • 21. Berman HM, Westbrook JD, Gabanyi MJ, Tao W, Shah R, Kouranov A, et al. The protein structure initiative structural genomics knowledgebase. Nucleic acids research. 2008;37(suppl_1):D365–D368. 10.1093/nar/gkn790 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Chang CCH, Song J, Tey BT, Ramanan RN. Bioinformatics approaches for improved recombinant protein production in Escherichia coli: protein solubility prediction. Briefings in bioinformatics. 2013;15(6):953–962. 10.1093/bib/bbt057 [DOI] [PubMed] [Google Scholar]
  • 23.Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:14061078. 2014;.
  • 24. Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, et al. Automatic differentiation in PyTorch. 2017;. [Google Scholar]
  • 25. Orlando G, Raimondi D, Vranken W. Observation selection bias in contact prediction and its implications for structural bioinformatics. Scientific Reports. 2016;6 10.1038/srep36679 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Raimondi D, Orlando G, Vranken WF, Moreau Y. Exploring the limitations of biophysical propensity scales coupled with machine learning for protein sequence analysis. Scientific Reports. 2019;9(1):1–11. 10.1038/s41598-019-53324-w [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Tsolis AC, Papandreou NC, Iconomidou VA, Hamodrakas SJ. A consensus method for the prediction of ‘aggregation-prone’peptides in globular proteins. PLoS One. 2013;8(1):e54175 10.1371/journal.pone.0054175 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Walsh I, Seno F, Tosatto SC, Trovato A. PASTA 2.0: an improved server for protein aggregation prediction. Nucleic acids research. 2014;42(W1):W301–W307. 10.1093/nar/gku399 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Gaudry A, Lorber B, Neuenfeldt A, Sauter C, Florentz C, Sissler M. Re-designed N-terminus enhances expression, solubility and crystallizability of mitochondrial protein. Protein Engineering, Design & Selection. 2012;25(9):473–481. 10.1093/protein/gzs046 [DOI] [PubMed] [Google Scholar]
  • 30. Ribeiro EA, Ramos CH. Circular permutation and deletion studies of myoglobin indicate that the correct position of its N-terminus is required for native stability and solubility but not for native-like heme binding and folding. Biochemistry. 2005;44(12):4699–4709. 10.1021/bi047908c [DOI] [PubMed] [Google Scholar]
  • 31. Mine S, Ueda T, Hashimoto Y, Imoto T. Improvement of the refolding yield and solubility of hen egg-white lysozyme by altering the Met residue attached to its N-terminus to Ser. Protein engineering. 1997;10(11):1333–1338. 10.1093/protein/10.11.1333 [DOI] [PubMed] [Google Scholar]
PLoS Comput Biol. doi: 10.1371/journal.pcbi.1007722.r001

Decision Letter 0

Ruth Nussinov, Emil Alexov

7 Jan 2020

Dear Dr Raimondi,

Thank you very much for submitting your manuscript 'Insight into the protein solubility driving forces with neural attention' for review by PLOS Computational Biology. Your manuscript has been fully evaluated by the PLOS Computational Biology editorial team and in this case also by independent peer reviewers. The reviewers appreciated the attention to an important problem, but raised some substantial concerns about the manuscript as it currently stands. While your manuscript cannot be accepted in its present form, we are willing to consider a revised version in which the issues raised by the reviewers have been adequately addressed. We cannot, of course, promise publication at that time.

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

Your revisions should address the specific points made by each reviewer. Please return the revised version within the next 60 days. If you anticipate any delay in its return, we ask that you let us know the expected resubmission date by email at ploscompbiol@plos.org. Revised manuscripts received beyond 60 days may require evaluation and peer review similar to that applied to newly submitted manuscripts.

In addition, when you are ready to resubmit, please be prepared to provide the following:

(1) A detailed list of your responses to the review comments and the changes you have made in the manuscript. We require a file of this nature before your manuscript is passed back to the editors.

(2) A copy of your manuscript with the changes highlighted (encouraged). We encourage authors, if possible to show clearly where changes have been made to their manuscript e.g. by highlighting text.

(3) A striking still image to accompany your article (optional). If the image is judged to be suitable by the editors, it may be featured on our website and might be chosen as the issue image for that month. These square, high-quality images should be accompanied by a short caption. Please note as well that there should be no copyright restrictions on the use of the image, so that it can be published under the Open-Access license and be subject only to appropriate attribution.

Before you resubmit your manuscript, please consult our Submission Checklist to ensure your manuscript is formatted correctly for PLOS Computational Biology: http://www.ploscompbiol.org/static/checklist.action. Some key points to remember are:

- Figures uploaded separately as TIFF or EPS files (if you wish, your figures may remain in your main manuscript file in addition).

- Supporting Information uploaded as separate files, titled Dataset, Figure, Table, Text, Protocol, Audio, or Video.

- Funding information in the 'Financial Disclosure' box in the online system.

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see here

We are sorry that we cannot be more positive about your manuscript at this stage, but if you have any concerns or questions, please do not hesitate to contact us.

Sincerely,

Emil Alexov

Guest Editor

PLOS Computational Biology

Ruth Nussinov

Editor-in-Chief

PLOS Computational Biology

A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately:

[LINK]

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: This is an interesting manuscript in the burgeoning field of protein solubility prediction. It is based on machine learning models, and produces results in line with other available methods. The authors describe their use of ML methods as novel in the field and this may be the case, I am afraid that I am not expert in ML methods. What I regard as novel are a couple of observations made in the manuscript, and that is where I would like to see more work carried out to clarify the results.

First, a statement is made to the effect that the predictions / model are not based on the commonly used biophysical properties (of amino acids). But, I do not see how this can be asserted if amino acids themselves are central to the prediction scheme - as I see it, sequence is the input to the training algorithm.

Second, and of the most interest to me, is the observation that N-t and C-t seem to possess (Fig. 2) more or less all the predictive power. I would not be surprised to find a tendency towards greater predictive power, but a flat line (Fig. 2) over the central 70% of the sequence I find amazing. Is it possible that the method is learning sequences (i.e. homology) and that this is best expressed at NT and CT when sequences of variable lengths are mapped to percentiles?

My own speculation is probably wrong, but I feel that the authors need more work here. e.g. What happens if the standard bioinformatics methods for reducing to non-redundant training set is used?

In the data reported for SKADE analysis of the CAMSOL mutation set, are all the mutations that affect solubility in the NT and CT regions - as might be expected given the results of Fig. 2?

Similarly, the synergistic mutations in later figures, are they also in NT and CT regions?

Reviewer #2: The paper presents a novel computational approach to predicting protein and peptide solubility from amino acid sequence alone SKADE. The model behind the method is a Recurrent Neural Network with soft self-attention and dot-product score function, built using Gated Recurrent Units. It is implemented in Python using Pytorch framework and supplies a pre-trained network for immediate applications.

SKADE is a great example of a neural network design without explicit feature engineering. The resulting model is comparable in performance or outperforming (depending on metrics) state of the art methods such as PaRSnIP and DeepSol that rely on knowledge-based and engineered features. The distributions of weights and scores along protein sequences suggest that N- and C-termini are most important determinants of protein and peptide solubility.

In addition, SKADE is able to identify mutations that increase or decrease the overall solubility of the protein, making it a useful and fast method for scanning mutagenesis, not limited to single mutations. Testing the effects of protein truncation on solubility is also possible with SKADE.

Major:

1. Although the model provides per-residue attention values and scores, interpretability of the methods remains limited. The only conclusion is that N- and C- termini are more important, but that is about it. Besides, these are the output layers of the network, what is inside is still a “black box” in terms of interpretability, I suggest deemphasizing this point in abstract and conclusions. The lack of correlations with biophysical properties contradicts this conclusion.

2. It is quite surprising to see that there is no correlation of attention and solubility profiles to any physico-chemical and/or structural properties of amino acids. Particularly, because these features work quite well in other solubility prediction methods. I think Table 1 in supporting information files does not comprehensively cover the biophysical features, for instance aggregation propensity, SASA. Besides, it is not clear how the correlations were calculated. Could the lack of correlation be related to scalability of attention/profile with sequence length? I.e. values of a short protein could be incomparable to values of a large protein.

3. According to neural network architecture it seems that the Embedding for converting 20 amino acid categories into a 20-dimensional tensor acts simply as a One Hot encoder. If there is any actual embedding that is learnt during training could you visualize the resulting amino acid similarities? Otherwise I suggest renaming embedding to encoding.

4. Python 2.7 is not acceptable by any means. This Python branch reached end of life on January 1 2020, is no longer supported. Moreover, is currently being removed from many computer systems for security concerns because it is unsupported. See https://devguide.python.org/#status-of-python-branches . It looks like there should not be any reasons not to update the code to Python 3.

5. Please format code according to PEP8 https://www.python.org/dev/peps/pep-0008/ Automatic conversion with autopep8 --aggressive could be used, for instance. Note that there is a lot of imported but unused modules, variables, and other issues that a lint program could fix, please also run pylint to get rid of the software issues.

Minor:

1. There is an imbalance between soluble and insoluble classes, since you have not used weighted sampling. To what extent do you think this imbalance affected the results? I suspect this could be the reason behind performance gap with respect to positive/negative class prediction separating SKADE and PaRSnIP.

2. The details of model convergence should be provided in supporting information. I could see that tensorboard was used in the program, so it should be possible to include those plots.

3. Line 98, why i<0 ?

4. Line 298: According to uniprot the protein should be 104, not 103 residues long https://www.uniprot.org/uniprot/O26734

5. The equation in line 300 is not clear, were the pairs of mutations generated including two mutations on the same position? Shouldn’t it be something like 104x19x19x103/2?

6. Current Pytorch v 1.3.1, version was 1.0.1 used in the paper. Model needs to be recalculated

7. Please submit the model to kipoi.org repository, that would make it much easier to run it

8. Peptide N- and C-termini are effectively charged residues in terms of their physico-chemical characteristics affecting solubility, i.e. Ala is hydrophobic, but will have a charge on the N-termini. However, it is not treated/encoded in any way different than any other amino acid. Was the reasoning behind it to an unbiased representation in the model or something else?

9. What is BAC in table 2?

10. Line 65, correlation is reported as AUC, which is not convenient

11. Line 319: “my” > “by”

12. In my opinion it is better to show dependence of synergetic effect on spatial distance directly as a scatter plot to emphasize on the correlation, particularly in proximity

13. In Figure 3 according to heatmap the first WT residue in O26734 should be V (the only white shaded residue), however this is not the case. What gives?

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Alexander Goncearenco

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1007722.r003

Decision Letter 1

Ruth Nussinov, Emil Alexov

10 Feb 2020

Dear Dr. Raimondi,

We are pleased to inform you that your manuscript 'Insight into the protein solubility driving forces with neural attention' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch within two working days with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

Emil Alexov

Guest Editor

PLOS Computational Biology

Ruth Nussinov

Editor-in-Chief

PLOS Computational Biology

***********************************************************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #2: The authors responded to all my questions.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #2: Yes: Alexander Goncearenco

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1007722.r004

Acceptance letter

Ruth Nussinov, Emil Alexov

6 Apr 2020

PCOMPBIOL-D-19-02103R1

Insight into the protein solubility driving forces with neural attention

Dear Dr Raimondi,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Laura Mallard

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Text. Supplementary information.

    PDF file containing additional analyses.

    (PDF)

    S2 Text. Supplementary information.

    CSV file containing the aminoacid embedding values.

    (CSV)

    S1 Fig. Plot showing the distributions of the a i × p i values over the protein sequence.

    We grouped these values by using their relative position in their respective sequence. Although the distributions are quite similar, the variance is generally higher at the be- ginning and at the end of the sequences, indicating that the NN might find stronger signal in these regions.

    (EPS)

    S2 Fig. Plot showing the average synergistic effect of pairs of mutations averaged over the type of mutated aminoacid.

    (EPS)

    S3 Fig. Plot showing the attention and prediction profiles of protein Q8TC59.

    (EPS)

    S4 Fig. Plot showing the attention and prediction profiles of protein Q9HBE1.

    (EPS)

    S5 Fig. Plot showing the attention and prediction profiles of protein P25984.

    (EPS)

    S6 Fig. Plot showing the 2 principal components of a PCA computed over the 20 dimensional embeddings learned by SKADE.

    (EPS)

    S7 Fig. Plot distributions of the mutations on the sequences in the CAMSOL dataset.

    (EPS)

    S8 Fig. Plot showing the correlation between the mean spatial distance (in Angstroms) and the average synergistic effects of pairs of residues at the same sequence separation in the O26734 protein.

    (EPS)

    Attachment

    Submitted filename: answersReviewers.pdf

    Data Availability Statement

    All relevant data are publicly available from the respective papers. The code described in this study is available at: https://bitbucket.org/eddiewrc/skade/src.


    Articles from PLoS Computational Biology are provided here courtesy of PLOS

    RESOURCES