Machine learning integration for predicting the effect of single amino acid substitutions on protein stability

Ayşegül Özen; Mehmet Gönen; Ethem Alpaydın; Türkan Haliloğlu

doi:10.1186/1472-6807-9-66

. 2009 Oct 19;9:66. doi: 10.1186/1472-6807-9-66

Machine learning integration for predicting the effect of single amino acid substitutions on protein stability

Ayşegül Özen ^1,^#, Mehmet Gönen ^2,^#, Ethem Alpaydın ^2,^✉, Türkan Haliloğlu ^1,^✉

PMCID: PMC2777163 PMID: 19840377

Abstract

Background

Computational prediction of protein stability change due to single-site amino acid substitutions is of interest in protein design and analysis. We consider the following four ways to improve the performance of the currently available predictors: (1) We include additional sequence- and structure-based features, namely, the amino acid substitution likelihoods, the equilibrium fluctuations of the alpha- and beta-carbon atoms, and the packing density. (2) By implementing different machine learning integration approaches, we combine information from different features or representations. (3) We compare classification vs. regression methods to predict the sign vs. the output of stability change. (4) We allow a reject option for doubtful cases where the risk of misclassification is high.

Results

We investigate three different approaches: early, intermediate and late integration, which respectively combine features, kernels over feature subsets, and decisions. We perform simulations on two data sets: (1) S1615 is used in previous studies, (2) S2783 is the updated version (as of July 2, 2009) extracted also from ProTherm. For S1615 data set, our highest accuracy using both sequence and structure information is 0.842 on cross-validation and 0.904 on testing using early integration. Newly added features, namely, local compositional packing and the mobility extent of the mutated residues, improve accuracy significantly with intermediate integration. For S2783 data set, we also train regression methods to estimate not only the sign but also the amount of stability change and apply risk-based classification to reject when the learner has low confidence and the loss of misclassification is high. The highest accuracy is 0.835 on cross-validation and 0.832 on testing using only sequence information. The percentage of false positives can be decreased to less than 0.005 by rejecting 10 per cent using late integration.

Conclusion

We find that in both early and late integration, combining inputs or decisions is useful in increasing accuracy. Intermediate integration allows assessing the contributions of individual features by looking at the assigned weights. Overall accuracy of regression is not better than that of classification but it has less false positives, especially when combined with the reject option. The server for stability prediction for three integration approaches and the data sets are available at http://www.prc.boun.edu.tr/appserv/prc/mlsta.

Background

In protein design and analysis, understanding the stability in sequence, structure, and function paradigms is of importance [1] and hence there is a need for predicting the protein stability change due to mutation. Single amino acid mutations can significantly change the stability of a protein structure [2]. To acquire a set of experimental annotations for every possible random mutation is combinatorial and requires significant resources and time. Thus, accurate computational prediction would be of use for suggesting the destructive mutations as well as the most favorable and stable novel protein sequences. To this end, the prediction of protein stability change due to amino acid substitutions remains a challenging task in the field of molecular biology.

Recent approaches fall into two major types: energy-based methods and machine learning approaches. Energy-based methods using physical, statistical, or empirical forcefields perform a direct computation of the magnitude of the relative change in the free energy [3-8]. Average assignment method [7] and different machine learning algorithms, such as support vector machines [2], neural networks [9], and decision trees [7] are trained on a data set to predict protein stability change. There are also hybrid approaches that combine energy-based and machine learning methods [10-12]; they basically generate the input features fed into machine learning algorithms using energy-based models.

One can predict the direction towards which the mutation shifts the stability of the protein (namely the sign of ΔΔG). It could be positive or negative, corresponding to an increase or decrease in stability, respectively. From a machine learning perspective, this is a binary classification task, where given x, information about the single-site amino acid substitution, the aim is to decide whether this is a positive or negative example, depending on whether the mutation is favorable or not. A third class of "doubt" can be defined for small changes that may be considered insignificant, and in such a case, one can train a three-class classifier [13] or a two-class classifier with the reject option.

Given a sample of n independent and identically distributed training instances, (x₁, y₁),(x₂, y₂), ...,(x_n, y_n), where x_iis the d-dimensional input vector coding the relevant information and y_i∈ {-1, +1} is its class label, i = 1, ..., n, a classifier estimates P(+|x) and assigns the test instance to the positive class if P(+|x) > 0.5, and to the negative class otherwise. There can be different representations in coding x. Deciding on the best data representation used is as important as selecting the classification algorithm.

Another possibility in solving this using machine learning is to define it as a regression problem with ΔΔG directly as the numeric output. One can then decide based on whether the prediction is positive or negative, and again predictions that are close to zero can be rejected if the risk of misclassification is high. No single machine learning algorithm nor representation, in classification or regression, induces always the most accurate learner in any domain. The usual approach is to try many and choose the one that performs the best on a separate validation set unused during training. Recently, it has been shown that accuracy may be improved by combining multiple learners [14,15]. There are three possible methods for combining multiple learners: early, late, and intermediate integration [16].

In early integration, inputs are concatenated as one large vector and a single learner (classifier or regressor) is used. In late integration, multiple classifiers/regressors are trained over different inputs and their decisions are combined by a trained learner. These two approaches can be applied with any classification/regression algorithm.

Late integration has been extensively used in bioinformatics. Weighted voting was used in classifier combination for protein fold recognition [17]. Majority voting was used for prediction of the drug resistance of HIV protease mutants [18], secondary structure prediction [19], detecting rare events in human genomic DNA [20] and identification of new tumor classes using gene expression profiles [21]. A trained combiner was used for secondary structure prediction [22,23]. A mixture of localized experts was used for gene identification [24]. Cascading, which is a multi-stage sequential combination method, was used for secondary structure prediction [25].

Support vector machines allow combination in a third way, using multiple kernels; this is also called intermediate integration [16]. Kernel functions basically measure similarity between data instances and a single learner can combine separate kernels for different data sources, instead of combining data before training a single learner (as in early integration) or combining decisions from multiple learners (as in late integration).

Intermediate integration was used for protein location prediction and protein function prediction tasks, respectively, by combining kernels applied to different representations such as protein sequences, hydropathy profile, protein interactions, and gene expressions [26,27]. This method is also used in glycan classification by combining different tree kernels [28].

Our work has four aspects: (1) Introduction of new protein residue features: The temperature factors of the backbone and side-chain carbon atoms (B-factor) that reflect the thermal mobility/flexibility of the mutated residue; the local packing information in a higher resolution than that has previously been incorporated by considering the side-chain atoms as well; amino acid substitution likelihoods from PAM250 matrix. (2) Implementation of three different machine learning approaches (early, late, and intermediate integration), two of which, namely late and intermediate, have not been used before in the computational prediction of protein stability change. (3) Comparison of classification and regression methods. (4) The use of a reject option in both classification and regression to check for cases where the learner has low confidence.

Data

Data Sets

The first data set (S1615) was compiled from the data available online [29], originally extracted [9] from the ProTherm database [30]. This data set has been used previously and provides a basis for comparison [2,9,31]. The set originally contains 1615 single-site mutation data from 42 different proteins. Each instance has the following features: PDB code of the protein, mutated position and mutation itself, solvent accessibility, pH value, temperature (T), and the change in the free energy, ΔΔG, due to a mutation in a single position. As there are instances for the same mutation and position where ΔΔG differs with T and pH values, T and pH are kept as features in our data set. A subset (388 instances) of the training set (1615 instances) was previously used as a test set for comparison between different predictors [2]. Though some studies include the test set also in the training set, we remove it from the training set to have disjoint training and test sets, as done in [2].

We also extract an up-to-date version (as of July 2, 2009) (S2783) that contains 2783 single-site mutations with known PDB code of the protein and ΔΔG values also from the ProTherm database. On this larger data set, we implement and compare both classification and regression integration methods and also their versions with the reject option.

Added Features

The substitution frequency of an amino acid for another is considered here as an additional feature with the Point Accepted Mutation (PAM) matrix [32]. PAM250 is chosen for the score of each amino acid substitution and is based on the frequency of that substitution in closely related proteins that have experienced a certain amount of evolutionary divergence.

Another feature considered is the mobility/flexibility of the amino acid position in a given structure. The B-factors reported in the PDB file is a good and quick indicator of this feature. Neighbors of the mutated residue in both amino acid sequence and 3D structure are the two other features that have been used recently [2,9]. A window size of seven in the sequence [2] and a cutoff distance of 9Å in space was previously used to find the neighbors of the mutated position as the optimum sequence length and distance, respectively [9]. In our implementation, in addition to alpha-carbon atoms (C_α), beta-carbon (C_β) atoms are also considered to reflect the packing at a relatively higher resolution.

A mutation in a position of a protein sequence will change the number of side-chain atoms of the residue in that position. This may trigger a conformational change or local readjustments that may result also in a change in the atomic packing around that residue and the fluctuations of the surrounding residues and the mutated residue itself. Nevertheless, as in other studies [2,9,31], we neglect this effect.

Removing the instances with non-available features and the redundant instances from S1615 leaves us with training and test sets of 1122 and 383 instances with total of 31 and 14 proteins. Stabilizing mutations are 32.35 per cent and 11.49 per cent, respectively. After removing the instances with non-available features, S2783 reduces to 2471 instances from 68 different proteins and 755 of them (30.55 per cent) are stabilizing mutations. Both data sets are available online.

Table 1 gives a list of the representations, original features, and the new features that we introduce. The information coming only from the sequence (SO), and the topology of the protein structure (TO), and both (ST) are encoded in the same way as defined in previous studies [2]. An added asterisk, for example, (SO*), denotes the representation with newly added features. Neighbors of the mutated position in the sequence, mutation, T, and pH are encoded in SO/SO*. Sequence information is not used in TO/TO*; instead, spatial neighbors and the solvent accessibility of the mutated position are encoded. In ST/ST*, all information are combined. The substitution likelihood of an amino acid is added to the existing data as a new feature in all three representations. Crystallographic B-factors of the C_αand C_βatoms are used in TO* and ST*. For discrete features like amino acid identities, 1-of-n encoding is used, that is, if the variable can take one of n different values, one is set to 1 and all others to 0.

Table 1.

Representations, original features, and the new features.

Repr.	Original Feat.	New Repr.	New Feat.
SO	± 3 neighbors (± 3 NE) Mutation (MUT) T/pH	SO*	PAM250 (PAM)

TO	Mutation (MUT) C_αcontacts (CA) SA/T/pH	TO*	PAM250 (PAM) C_αB-factor (BFA) C_αB-factor (BFB) C_αand C_βcontacts (CB)

ST	± 3 neighbors (± 3 NE) Mutation (MUT) C_αcontacts (CA) SA/T/pH	ST*	PAM250 (PAM) C_αB-factor (BFA) C_αB-factor (BFB) C_αand C_βcontacts (CB)

#	Representation	PAM	CB	BFA	BFB
1	SO	-	-	-	-
2	SO	+	-	-	-

3	TO	-	-	-	-
4	TO	+	-	-	-
5	TO	-	+	-	-
6	TO	-	-	+	-
7	TO	-	-	-	+
8	TO	+	+	-	-
9	TO	+	-	+	-
10	TO	+	-	-	+
11	TO	-	+	+	-
12	TO	-	+	-	+
13	TO	-	-	+	+
14	TO	+	+	+	-
15	TO	+	+	-	+
16	TO	+	-	+	+
17	TO	-	+	+	+
18	TO	+	+	+	+

19	ST	-	-	-	-
20	ST	+	-	-	-
21	ST	-	+	-	-
22	ST	-	-	+	-
23	ST	-	-	-	+
24	ST	+	+	-	-
25	ST	+	-	+	-
26	ST	+	-	-	+
27	ST	-	+	+	-
28	ST	-	+	-	+
29	ST	-	-	+	+
30	ST	+	+	+	-
31	ST	+	+	-	+
32	ST	+	-	+	+
33	ST	-	+	+	+
34	ST	+	+	+	+

1:	Initialize the subset Z as empty set
2:	Initialize the subset R as all possible 102 (R.D.B) groups
3:	Remove the most accurate (R.D.B) from R and add to Z
4:	Perform McNemar's test for all pairs between Z and R
5:	Decrease the degree of confidence, α, for McNemar's test
6:	if There is at least one diverse (R.D.B) in R then
7:	Select the most accurate and most diverse (R.D.B) from R and add it to Z
8:	Go to Step 4
9:	else
10:	Use the (R.D.B) triplets in Z as the current base-learners to be combined
11:	end if

	k-NN		DT		SVM
	cv	test	cv	test	cv	test
SO	0.814	0.778	0.752	0.703	0.838	0.904
SO*	0.812	0.781	0.766	0.702	0.839	0.904

TO	0.812	0.819	0.770	0.739	0.822	0.905
TO*	0.817	0.844	0.788	0.756	0.825	0.909

ST	0.814	0.777	0.771	0.734	0.838	0.904
ST*	0.817	0.775	0.800	0.729	0.842	0.904

	SVM
	SO	TO*	ST*
Precision	0.711	0.800	0.702
Recall	0.284	0.282	0.284
FP rate	0.015	0.009	0.016

	cv	test
Accuracy	0.847	0.903
Precision	0.819	0.694
Recall	0.677	0.284
FP rate	0.071	0.017

	(2)	(3)	(4)
(1) ST.PAMCB.SVM	11.72	66.61	154.10
(2) TO.BFA.SVM		42.12	135.64
(3) ST.CBBFB.DT			41.32
(4) TO.CBBFABFB.k-NN

SO	(0.19)1NE + (0.15)2NE + (0.31)3NE + (0.30)MUT + (0.03)T + (0.03)pH
SO*	(0.19)1NE + (0.15)2NE + (0.31)3NE + (0.30)MUT + (0.03)T + (0.03)pH + (0.00)PAM

TO	(0.36)MUT + (0.40)CA + (0.15)SA + (0.04)T + (0.04)pH

TO*	(0.21)MUT + (0.21)CA + (0.10)SA + (0.01)T + (0.01)pH + (0.00)PAM + (0.30)CB + (0.09)BFA + (0.07)BFB

ST	(0.04)1NE + (0.04)2NE + (0.09)3NE + (0.38)MUT + (0.25)CA + (0.11)SA + (0.04)T + (0.04)pH

ST*	(0.03)1NE + (0.02)2NE + (0.06)3NE + (0.20)MUT + (0.18)CA + (0.09)SA + (0.01)T + (0.01)pH + (0.00)PAM + (0.26)CB + (0.08)BFA + (0.06)BFB

	early	late	intermediate
	(ST.PAMCB.SVM)	(ST.PAMCB.SVM) + (TO.BFA.SVM) + (ST.CBBFB.DT) + (TO.CBBFABFB.k-NN)	(TO.PAMCBBFABFB.SVM)
cv	0.842 ± 0.047	0.847 ± 0.046	0.826 ± 0.044
test	0.904 ± 0.004	0.903 ± 0.005	0.879 ± 0.006

	k-NN		DT		SVM		SVR
	cv	test	cv	test	cv	test	cv	test
SO	0.795	0.794	0.748	0.762	0.829	0.832	0.825	0.828
SO*	0.793	0.794	0.751	0.756	0.829	0.829	0.824	0.827

TO	0.804	0.803	0.762	0.769	0.821	0.824	0.813	0.818
TO*	0.806	0.799	0.770	0.780	0.826	0.829	0.818	0.824

ST	0.797	0.797	0.758	0.766	0.829	0.831	0.825	0.828
ST*	0.798	0.797	0.766	0.782	0.829	0.830	0.825	0.828

SO	(0.19)1NE + (0.20)2NE + (0.23)3NE + (0.27)MUT + (0.09)T + (0.03)pH
SO*	(0.19)1NE + (0.20)2NE + (0.22)3NE + (0.27)MUT + (0.09)T + (0.03)pH + (0.00)PAM

TO	(0.19)MUT + (0.56)CA + (0.17)SA + (0.05)T + (0.02)pH

TO*	(0.21)MUT + (0.23)CA + (0.12)SA + (0.06)T + (0.02)pH + (0.00)PAM + (0.23)CB + (0.07)BFA + (0.06)BFB

ST	(0.04)1NE + (0.03)2NE + (0.04)3NE + (0.21)MUT + (0.45)CA + (0.15)SA + (0.06)T + (0.02)pH

ST*	(0.02)1NE + (0.02)2NE + (0.03)3NE + (0.21)MUT + (0.21)CA + (0.11)SA + (0.06)T + (0.02)pH + (0.00)PAM + (0.19)CB + (0.06)BFA + (0.06)BFB

Accuracy
Error Rate

Precision

Recall

FP Rate

		Decision
		+	-	r

Truth	+	0	λ	1

	-	αλ	0	1

	Accuracy	Precision	Recall	FP Rate
SO	0.872	0.381	0.176	0.038
SO*	0.872	0.381	0.176	0.038

TO	0.833	0.343	0.485	0.122
TO*	0.879	0.459	0.258	0.040

ST	0.818	0.311	0.470	0.137
ST*	0.878	0.448	0.252	0.041

	SVM			SVR
	SO	TO*	ST*	SO	TO*	ST*
Precision	0.790	0.807	0.784	0.854	0.868	0.855
Recall	0.612	0.579	0.614	0.527	0.501	0.529
FP rate	0.072	0.061	0.075	0.040	0.034	0.040

SO	(0.15)1NE+ (0.25)2NE+ (0.22)3NE+ (0.31)MUT + (0.04)T + (0.02)pH
SO*	(0.16)1NE+ (0.26)2NE+ (0.22)3NE+ (0.29)MUT + (0.05)T + (0.01)pH + (0.01)PAM

TO	(0.25)MUT + (0.72)CA + (0.02)SA + (0.01)T + (0.00)pH

TO*	(0.28)MUT + (0.10)CA + (0.05)SA + (0.08)T + (0.03)pH + (0.01)PAM + (0.43)CB + (0.01)BFA + (0.01)BFB

ST	(0.02)1NE+ (0.02)2NE+ (0.02)3NE+ (0.26)MUT + (0.57)CA + (0.04)SA + (0.07)T + (0.03)pH

ST*	(0.01)1NE+ (0.01)2NE+ (0.01)3NE+ (0.30)MUT + (0.10)CA + (0.06)SA + (0.07)T + (0.01)pH + (0.00)PAM + (0.43)CB + (0.01)BFA + (0.01)BFB

		cv					test
λ	α	Acc.	Prec.	Recall	FP Rate	Reject	Acc.	Prec.	Recall	FP Rate	Reject
2	1	0.829	0.793	0.602	0.071	0.000	0.831	0.788	0.615	0.073	0.000
2	2	0.834	0.813	0.582	0.059	0.024	0.839	0.816	0.596	0.058	0.025
2	5	0.840	0.839	0.544	0.043	0.059	0.845	0.844	0.560	0.042	0.060
5	1	0.842	0.815	0.599	0.058	0.064	0.847	0.821	0.615	0.057	0.066
5	2	0.848	0.839	0.569	0.044	0.092	0.852	0.844	0.587	0.044	0.094
5	5	0.854	0.871	0.531	0.029	0.122	0.857	0.874	0.545	0.030	0.127
10	1	0.884	0.839	0.735	0.058	0.298	0.884	0.844	0.743	0.058	0.303
10	2	0.891	0.863	0.712	0.043	0.322	0.892	0.870	0.717	0.042	0.329
10	5	0.897	0.863	0.621	0.028	0.364	0.894	0.885	0.620	0.031	0.371

Ref.	Method	Data Set Size	Accuracy	Information
[41]	SVM	2048	0.77 (20-fold cv)	Seq

[42]	SVM	1383 *	0.73 (20-fold cv)	Seq

[9]	NN NN+FOLDX	1615	0.79 (20-fold cv) 0.87 (test set†) 0.93 (test set†)	Seq+Str

[2]	SVM	1496‡	SO: 0.84, TO: 0.85, ST: 0.85 (20-fold cv) SO: 0.86, TO: 0.86, ST: 0.86 (test set)	Seq+Str

[31]	iPTREE	1615	0.87 (10-fold cv)	Seq+Str

Ours	Early Late Intermediate	1122 (training) 383 (test)	0.842 (20-fold cv), 0.904 (test set) 0.847 (20-fold cv), 0.903 (test set) 0.826 (20-fold cv), 0.879 (test set)	Seq+Str

PERMALINK

Machine learning integration for predicting the effect of single amino acid substitutions on protein stability

Ayşegül Özen

Mehmet Gönen

Ethem Alpaydın

Türkan Haliloğlu

Abstract

Background

Results

Conclusion

Background

Data

Data Sets

Added Features

Table 1.

Methods

The Effect of Adding New Features to the Original Data Sets

Table 2.

Performance Assessment

Figure 1.

Table 3.

Table 4.

Early Integration

k-Nearest Neighbor (k-NN) Classifier

Decision Tree (DT)

Support Vector Machine (SVM)

Support Vector Regression (SVR)

Late Integration

Table 5.

Intermediate Integration

Results

S1615 Data Set

Early Integration

Figure 2.

Table 6.

Table 7.

Late Integration

Table 8.

Table 9.

Intermediate Integration

Table 10.

Table 11.

Overall Comparison of Integration Methods

Table 12.

S2783 Data Set

Early Integration

Table 13.

Table 14.

Late Integration

Table 15.

Intermediate Integration

Table 16.

Table 17.

Table 18.

Classification with Reject Option

Table 19.

Table 20.

Table 21.

Table 22.

Table 23.

Table 24.

Figure 3.

Figure 4.

Discussion

Sufficiency of the Data Sets

Figure 5.

Integration Approaches

Prediction Using Only the Amino Acid Sequence

Classification with Reject Option

Comparison with Other Studies

Table 25.

Conclusion

Authors' contributions

Acknowledgments

Acknowledgements

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES