Abstract
Identification of protein binding sites is critical for studying the function of the proteins. In this paper, we proposed a method for protein binding site prediction, which combined the order profile propensities and hidden Markov support vector machine (HM-SVM). This method employed the sequential labeling technique to the field of protein binding site prediction. The input features of HM-SVM include the profile-based propensities, the Position-Specific Score Matrix (PSSM), and Accessible Surface Area (ASA). When tested on different data sets, the proposed method showed promising results, and outperformed some closely relative methods by more than 10% in terms of AUC.
1. Introduction
Prediction of protein binding sites provides valuable information for studying the function of proteins. The most efficient approaches are the computational methods. By using these approaches, the functionally important amino acid residues can be identified [1].
These computational methods used different features extracted from protein sequences, PSSM, or structure information. Hydrophobic and polar residues tend to occur in protein binding regions [2, 3]. The conservation scores of amino acid are often used as features, because the protein binding sites are more conserved than other surface residues [4]. Some kinds of conservation scores were proposed; a comprehensive evaluation of these scores was reported in [5]. One of the most widely used features is the Accessible Surface Area (ASA) [4], because the binding sites show higher ASA values than those of the other surface residues [6].
Some machine learning methods treated protein binding site prediction as a binary classification task, and some well-known machine learning techniques have been applied to this field, such as support vector machine [7, 8], neural network [1], Bayesian network [9], and hidden Markov model [10]. A comparison of these methods has been performed by Zhou and Qin [11].
In our previous study [12], we introduced a novel profile-level propensity for protein binding site prediction. Experimental results showed that this propensity can significantly improve the performance of the SVM based methods. Recently, we applied hidden Markov support vector machine (HM-SVM) to this field [13], which takes protein binding site prediction as a sequence-labeling task. The advantage of this method is that it is able to incorporate the sequence-order effects into the predictor. However, this method only uses two basic features (PSSM and ASA features) as input for protein binding site prediction. Therefore, it is interesting to explore whether the order profile propensity can improve the performance of HM-SVM based method or not. In this study, we proposed a computational method for protein binding site prediction by combining the hidden Markov support vector machine and the order profile propensity. When tested on six different data sets, the HM-SVM predictor using order profile propensity as an extra feature consistently outperformed the predictor only using two basic features (PSSM and ASA features); in particular, in terms of AUC, the performance is improved by more than 10 percent, indicating that combining the order profile propensity and the HM-SVM is a suitable approach to improve the accuracy of protein binding site prediction.
2. Methods
2.1. Dataset Description
The datasets used in this study have been described in the study [13]. 1124 protein chains were selected from the Protein Data Bank (PDB) [14]. The chains were divided into six types of datasets according to homology of interacting chains and the definition of the interface. The information of the six datasets is shown in Table 1, and the process of dataset preparation is shown in the left part of Figure 1. The six datasets can be downloaded from http://bioinformatics.hitsz.edu.cn/HMSVM-OP.
Table 1.
Data set | Chains | Res. | Surface res. | Interface res. |
---|---|---|---|---|
Heterocomplex Ia | 504 | 109829 | 92797 | 26085 |
Homocomplex I | 620 | 172917 | 141295 | 38170 |
Mixb I | 1124 | 282746 | 234092 | 64255 |
Heterocomplex IIc | 504 | 109829 | 92797 | 32386 |
Homocomplex II | 620 | 172917 | 141295 | 45633 |
Mix II | 1124 | 282746 | 234092 | 78019 |
aType I data set with minor interface as negative samples.
bThe mixed data set of heterocomplexes and homocomplexes.
cType II data set with minor interface as positive samples.
2.2. Feature Description
2.2.1. Order Profile Propensity
The detailed information of how to calculate the order profile propensity was introduced in study [12]. Here we only briefly introduce this process. The order profile propensities were profile-based features, which extracted the evolutionary information from frequency profiles. The frequency profiles were calculated from the multiple sequence alignments outputted by running the PSI-BLAST software [13] searching against the nrdb90 database from EBI [15] with parameters of j = 10 and e = 0.001. The frequency profiles were converted into order profiles by combining the amino acids whose frequencies were higher than a given threadhold optimized on the benchmark dataset. Order profile can be viewed as a profile-based building block of proteins, which has been used for many tasks in the field of bioinformatics [12, 16].
The order profile propensity was based on the order profile occurrence differences between protein binding regions and other surface regions. The equations of how to calculate this feature were given by [12, Equations (3)–(5)].
2.2.2. Position-Specific Score Matrix (PSSM)
PSSM was another profile-based feature, which was generated by using PSI-BLAST [13] with the parameters j and e set as 10 and 0.001, respectively.
2.2.3. Accessible Surface Area (ASA)
We employed the DSSP program [17] to calculate the Accessible Surface Area (ASA) features, which were scaled by the nominal maximum area of each residue.
2.3. Hidden Markov Support Vector Machine
Hidden Markov support vector machine proposed by Altun et al. [15] was a sequential labelling model. In our previous study [13], it showed that when using the two basic features (PSSM and ASA features), the HM-SVM based method outperformed other machine learning methods, such as SVM, CRF, and ANN. In this study, we explored new features to improve the performance of HM-SVM based methods. For more information of HM-SVM, please refer to this paper [13].
The flowchart of the proposed computational method for protein binding site prediction was shown in Figure 1, in which the left part shows the process of dataset construction, and the right part shows the prediction process of the model based on HM-SVM.
In this paper, SVMhmm toolkit (V3.10) was employed as the software of HM-SVM model with parameters c and e set as 0.1 and 1, respectively. This parameter combination was optimized with the training data. The input features of HM-SVM include order profile propensity, ASA, and PSSM. These features were extracted from the target residues and its 6 neighbouring residues in each direction.
2.4. Evaluation Methodology
The sensitivity (Sn), specificity (Sp), overall accuracy (Acc), F1 measure (F1), Matthews correlation coefficient MCC, and AUC can be, respectively, expressed as [18–22]
(1) |
where TP represents the true positive, TN represents the true negative, FN represents the false negative, and FP represents the false positive.
3. Results
In order to validate whether the order profile propensities can improve the performance of the HM-SVM based methods or not, two HM-SVM predictors with different features were constructed. The first HM-SVM employed the PSSMs and ASA as input features. This predictor was treated as a baseline predictor. For the second HM-SVM predictor, order profile propensity is added as an extra feature to evaluate whether this feature can improve the performance or not. The performance of the two HM-SVM predictors was evaluated by fivefold cross-validation.
The results of the two HM-SVM predictors on the six datasets are shown in Table 2. It can be seen that the first HM-SVM predictor using the two basic features achieved the lowest performance. The second HM-SVM predictor using the order profile propensity as an extra feature achieved the best performance on all the six data sets, especially its AUC score being about 10% higher than that of the first HM-SVM predictor, indicating that order profile propensity can significantly improve the performance of the HM-SVM based methods. In our previous study [13], we showed that the first HM-SVM predictor outperformed some state-of-the-art methods, such as ANN, CRF, and SVM. The second HM-SVM predictor significantly outperformed the first HM-SVM predictor, indicating that the proposed computational method for protein binding site prediction is a good method in this field.
Table 2.
Dataset | Method | Sp % | Sn % | F1 % | Acc % | MCC | AUC % |
---|---|---|---|---|---|---|---|
Heterocomplex I | HM-SVM 1a | 44.9 | 56.0 | 49.8 | 68.3 | 0.274 | 69.5 |
HM-SVM 2b | 52.4 | 73.5 | 61.2 | 73.8 | 0.436 | 81.4 | |
| |||||||
Homocomplex I | HM-SVM 1 | 45.4 | 60.0 | 51.70 | 69.7 | 0.309 | 72.2 |
HM-SVM 2 | 54.5 | 74.6 | 62.9 | 76.3 | 0.474 | 83.6 | |
| |||||||
Mix I | HM-SVM 1 | 45.5 | 58.0 | 51.0 | 69.4 | 0.297 | 71.2 |
HM-SVM 2 | 53.5 | 74.0 | 62.1 | 75.0 | 0.455 | 82.5 | |
| |||||||
Heterocomplex II | HM-SVM 1 | 54.0 | 56.7 | 55.3 | 68.0 | 0.305 | 70.7 |
HM-SVM 2 | 60.8 | 71.7 | 65.8 | 74.0 | 0.454 | 81.2 | |
| |||||||
Homocomplex II | HM-SVM 1 | 53.3 | 60.1 | 56.5 | 70.1 | 0.340 | 73.4 |
HM-SVM 2 | 61.1 | 73.8 | 66.8 | 76.4 | 0.493 | 83.7 | |
| |||||||
Mix II | HM-SVM 1 | 53.6 | 58.6 | 56.0 | 69.3 | 0.326 | 72.4 |
HM-SVM 2 | 61.0 | 72.7 | 66.3 | 75.2 | 0.474 | 82.4 |
aResults of HM-SVM 1 on the six data sets are obtained from [13]. HM-SVM 1 represents the HM-SVM predictor with the basic feature set using PSSM and ASA features; bHM-SVM 2 represents the HM-SVM predictor with the feature set using PSSM, ASA, and order profile propensity features.
Šikić et al. [23] proposed a method based on random forest, which was evaluated on a heterocomplex data set, and achieved good performance (Sp = 76.45%, Sn = 38.06%, F1 = 50.82%, and Acc = 80.05%). Our method (results of heterocomplex II dataset) outperformed this method by 14.98% in terms of F1, which further confirms the better performance of our method than some state-of-the-art methods.
4. Conclusion
In this study, we proposed a computational method for protein binding site prediction, which combines the order profile propensity and hidden Markov support vector machine. This method predicts the protein binding sites with a sequential labelling approach and uses a recently proposed feature to further improve the performance: order profile propensity, which contains the evolutionary information extracted from the sequence profiles. The main contribution of this study is that we validate the fact that order profile propensity can significantly improve the performance of the HM-SVM based method. The main advantage of the proposed method is that it treats the protein sequence as a whole and is able to use the label information of neighbour residues and the evolutionary information extracted from the frequency profiles. However, the order profile propensity was generated based on the frequency profiles, which require the computational expensive multiple sequences alignment process. It is the main disadvantage of the proposed method.
As noted by Li et al. [24], choosing proper features is a challenging task, especially for sequential labelling method, such as HM-SVM and conditional random field (CRF). In their experiments, the authors found that by simply adding some features into CRF cannot improve the performance of their method. Therefore, the obvious performance improvement when using order profile propensity as an extra feature will benefit our future studies, especially for the research on applying sequential method to this field. As pointed out in a comprehensive review and carried out in a series of recent publications [25–43], finding suitable features is the key step to improve the performance.
Furthermore, since user-friendly and publicly accessible web servers represent the future direction for developing practically more useful predictors [44, 45], we shall make efforts in our future work to provide a web server for the method presented in this paper.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (nos. 61300112, 61272383), the Natural Science Foundation of Guangdong Province (no. S2012040007390), the Scientific Research Innovation Foundation in Harbin Institute of Technology (Project no. HIT.NSRIF.201310b3), the Shanghai Key Laboratory of Intelligent Information Processing, China (Grant no. IIPL-2012-002), the Scientific Research Foundation for the Returned Overseas Chinese Scholars, State Education Ministry, Strategic Emerging Industry Development Special Funds of Shenzhen (ZDSY20120613125401420 and JCYJ20120613151940045), and Key Basic Research Foundation of Shenzhen (JC201005260118A, JC201005260175A).
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
References
- 1.Porollo A, Meller J. Prediction-based fingerprints of protein-protein interactions. Proteins: Structure, Function and Genetics. 2007;66(3):630–645. doi: 10.1002/prot.21248. [DOI] [PubMed] [Google Scholar]
- 2.Glaser F, Steinberg DM, Vakser IA, Ben-Tal N. Residue frequencies and pairing preferences at protein-protein interfaces. Proteins. 2001;43(2):89–102. [PubMed] [Google Scholar]
- 3.DeLano WL. Unraveling hot spots in binding interfaces: progress and challenges. Current Opinion in Structural Biology. 2002;12(1):14–20. doi: 10.1016/s0959-440x(02)00283-x. [DOI] [PubMed] [Google Scholar]
- 4.Zhou H, Shan Y. Prediction of protein interaction sites from sequence profile and residue neighbor list. Proteins: Structure, Function and Genetics. 2001;44(3):336–343. doi: 10.1002/prot.1099. [DOI] [PubMed] [Google Scholar]
- 5.Valdar WSJ. Scoring residue conservation. Proteins: Structure, Function and Genetics. 2002;48(2):227–241. doi: 10.1002/prot.10146. [DOI] [PubMed] [Google Scholar]
- 6.Chen H, Zhou HX. Prediction of interface residues in protein-protein complexes by a consensus neural network method: test against NMR data. Proteins: Structure, Function and Genetics. 2005;61(1):21–35. doi: 10.1002/prot.20514. [DOI] [PubMed] [Google Scholar]
- 7.Koike A, Takagi T. Prediction of protein-protein interaction sites using support vector machines. Protein Engineering, Design and Selection. 2004;17(2):165–173. doi: 10.1093/protein/gzh020. [DOI] [PubMed] [Google Scholar]
- 8.Wang B, Chen P, Huang D, Li J, Lok T, Lyu MR. Predicting protein interaction sites from residue spatial sequence profile and evolution rate. FEBS Letters. 2006;580(2):380–384. doi: 10.1016/j.febslet.2005.11.081. [DOI] [PubMed] [Google Scholar]
- 9.Bradford JR, Needham CJ, Bulpitt AJ, Westhead DR. Insights into protein-protein Interfaces using a Bayesian network prediction method. Journal of Molecular Biology. 2006;362(2):365–386. doi: 10.1016/j.jmb.2006.07.028. [DOI] [PubMed] [Google Scholar]
- 10.Friedrich T, Pils B, Dandekar T, Schultz J, Müller T. Modelling interaction sites in protein domains with interaction profile hidden Markov models. Bioinformatics. 2006;22(23):2851–2857. doi: 10.1093/bioinformatics/btl486. [DOI] [PubMed] [Google Scholar]
- 11.Zhou H, Qin S. Interaction-site prediction for protein complexes: a critical assessment. Bioinformatics. 2007;23(17):2203–2209. doi: 10.1093/bioinformatics/btm323. [DOI] [PubMed] [Google Scholar]
- 12.Liu B, Wang X, Lin L, Dong Q, Wang X. Exploiting three kinds of interface propensities to identify protein binding sites. Computational Biology and Chemistry. 2009;33(4):303–311. doi: 10.1016/j.compbiolchem.2009.07.001. [DOI] [PubMed] [Google Scholar]
- 13.Liu B, Wang X, Lin L, Tang B, Dong Q. Prediction of protein binding sites in protein structures using hidden Markov support vector machine. BMC Bioinformatics. 2009;10(article 381) doi: 10.1186/1471-2105-10-381. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Kouranov A, Xie L, de la Cruz J, et al. The RCSB PDB information portal for structural genomics. Nucleic Acids Research. 2006;34:D302–D305. doi: 10.1093/nar/gkj120. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Altun Y, Tsochantaridis I, Hofmann T. Hidden markov support vector machines. Proceedings of the 20th International Conference on Machine Learning (ICML '03); August 2003; pp. 3–10. [Google Scholar]
- 16.Liu B, Lin L, Wang X. Protein remote homology detection using order profiles. Proceedings of the International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing (IJCBS '09); August 2009; Shanghai, China. pp. 255–260. [Google Scholar]
- 17.Kabsch W, Sander C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 1983;22(12):2577–2637. doi: 10.1002/bip.360221211. [DOI] [PubMed] [Google Scholar]
- 18.Liu B, Wang X, Chen Q, Dong Q, Lan X. Using amino acid physicochemical distance transformation for fast protein remote homology detection. PLoS ONE. 2012;7(9) doi: 10.1371/journal.pone.0046633.e46633 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Liu B, Wang X, Lin L, Dong Q. A discriminative method for protein remote homology detection and fold recognition combining Top-n-grams and latent semantic analysis. BMC Bioinformatics. 2008;9, article 510 doi: 10.1186/1471-2105-9-510. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Liu B, Wang X, Zou Q, Dong Q, Chen Q. Protein remote homology detection by combining Chou's pseudo amino acid composition and profile-based protein representation. Molecular Informatics. 2013;32(9-10):775–782. doi: 10.1002/minf.201300084. [DOI] [PubMed] [Google Scholar]
- 21.Liu B, Xu J, Zou Q, et al. Using distances between Top-n-gram and residue pairs for protein remote homology detection. BMC Bioinformatics. 2014;15(supplement 2):p. S3. doi: 10.1186/1471-2105-15-S2-S3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Liu B. Combining evolutionary information extracted from frequency profiles with sequence-based kernels for protein remote homology detection. Bioinformatics. 2014;30(4):472–479. doi: 10.1093/bioinformatics/btt709. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Šikić M, Tomić S, Vlahoviček K. Prediction of protein-protein interaction sites in sequences and 3D structures by random forests. PLoS Computational Biology. 2009;5(1) doi: 10.1371/journal.pcbi.1000278.e1000278 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Li MH, Lin L, Wang X, Liu T. Protein-protein interaction site prediction based on conditional random fields. Bioinformatics. 2007;23(5):597–604. doi: 10.1093/bioinformatics/btl660. [DOI] [PubMed] [Google Scholar]
- 25.Liu B. QChIPat: a quantitative method to identify distinct binding patterns for two biological ChIP-seq samples in different experimental conditions. BMC Genomics. 2013;14(supplement 8):p. S3. doi: 10.1186/1471-2164-14-S8-S3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Chen W, Feng P, Lin H. Prediction of ketoacyl synthase family using reduced amino acid alphabets. Journal of Industrial Microbiology and Biotechnology. 2012;39(4):579–584. doi: 10.1007/s10295-011-1047-z. [DOI] [PubMed] [Google Scholar]
- 27.Lin H, Chen W, Ding H. AcalPred: a sequence-based tool for discriminating between acidic and alkaline enzymes. PLoS ONE. 2013;8(10) doi: 10.1371/journal.pone.0075726.e75726 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Chen W, Lin H, Feng PM. DNA physical parameters modulate nucleosome positioning in the Saccharomyces cerevisiae genome. Current Bioinformatics. 2014;9(2):188–193. [Google Scholar]
- 29.Feng P-M, Lin H, Chen W. Identification of antioxidants from sequence information using Naïve Bayes. Computational and Mathematical Methods in Medicine. 2013;2013:5 pages. doi: 10.1155/2013/567529.567529 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Ding C, Yuan L, Guo S, Lin H, Chen W. Identification of mycobacterial membrane proteins and their types using over-represented tripeptide compositions. Journal of Proteomics. 2012;77:321–328. doi: 10.1016/j.jprot.2012.09.006. [DOI] [PubMed] [Google Scholar]
- 31.Chen W, Lin H. Identification of voltage-gated potassium channel subfamilies from sequence information using support vector machine. Computers in Biology and Medicine. 2012;42(4):504–507. doi: 10.1016/j.compbiomed.2012.01.003. [DOI] [PubMed] [Google Scholar]
- 32.Feng PM, Chen W, Lin H, Chou K. iHSP-PseRAAAC: identifying the heat shock protein families using pseudo reduced amino acid alphabet composition. Analytical Biochemistry. 2013;442(1):118–125. doi: 10.1016/j.ab.2013.05.024. [DOI] [PubMed] [Google Scholar]
- 33.Chen W, Lin H, Feng P, Ding C, Zuo Y, Chou K. iNuc-PhysChem: a sequence-based predictor for identifying nucleosomes via physicochemical properties. PLoS ONE. 2012;7(10) doi: 10.1371/journal.pone.0047843.e47843 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Guo SH, Deng EZ, Xu LQ, et al. iNuc-PseKNC: a sequence-based predictor for predicting nucleosome positioning in genomes with pseudo k-tuple nucleotide composition. Bioinformatics. 2014;30(11):1522–1529. doi: 10.1093/bioinformatics/btu083. [DOI] [PubMed] [Google Scholar]
- 35.Chen W, Feng P, Lin H, Chou K. IRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition. Nucleic Acids Research. 2013;41(6):p. e68. doi: 10.1093/nar/gks1450. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Feng P, Ding H, Chen W, Lin H. Naïve bayes classifier with feature selection to identify phage virion proteins. Computational and Mathematical Methods in Medicine. 2013;2013:6 pages. doi: 10.1155/2013/530696.530696 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Ding H, Guo S, Deng E, et al. Prediction of Golgi-resident protein types by using feature selection technique. Chemometrics and Intelligent Laboratory Systems. 2013;124:9–13. [Google Scholar]
- 38.Chen W, Lin H. Prediction of midbody, centrosome and kinetochore proteins based on gene ontology information. Biochemical and Biophysical Research Communications. 2010;401(3):382–384. doi: 10.1016/j.bbrc.2010.09.061. [DOI] [PubMed] [Google Scholar]
- 39.Lin H, Ding H, Guo F, Huang J. Prediction of subcellular location of mycobacterial protein using feature selection techniques. Molecular Diversity. 2010;14(4):667–671. doi: 10.1007/s11030-009-9205-1. [DOI] [PubMed] [Google Scholar]
- 40.Lin H, Chen W. Prediction of thermophilic proteins using feature selection technique. Journal of Microbiological Methods. 2011;84(1):67–70. doi: 10.1016/j.mimet.2010.10.013. [DOI] [PubMed] [Google Scholar]
- 41.Chen W, Lei TY, Jin DC, Lin H, Chou KC. PseKNC: a flexible web server for generating pseudo K-tuple nucleotide composition. Analytical Biochemistry. 2014;456:53–60. doi: 10.1016/j.ab.2014.04.001. [DOI] [PubMed] [Google Scholar]
- 42.Lin H, Chen W, Yuan L, Li Z, Ding H. Using over-represented tetrapeptides to predict protein submitochondria locations. Acta Biotheoretica. 2013;61(2):259–268. doi: 10.1007/s10441-013-9181-9. [DOI] [PubMed] [Google Scholar]
- 43.Zou Q, Chen W, Huang Y, Liu X, Jiang Y. Identifying multi-functional enzyme by hierarchical multi-label classifier. Journal of Computational and Theoretical Nanoscience. 2013;10(4):1038–1043. [Google Scholar]
- 44.Zou Q, Li X, Jiang Y, Zhao Y, Wang G. Binmempredict: a web server and software for predicting membrane protein types. Current Proteomics. 2013;10(1):2–9. [Google Scholar]
- 45.Lin C, Zou Y, Qin J, et al. Hierarchical classification of protein folds using a novel ensemble classifier. PLoS ONE. 2013;8(2) doi: 10.1371/journal.pone.0056499.e56499 [DOI] [PMC free article] [PubMed] [Google Scholar]