Abstract
Conotoxins are a kind of neurotoxin which can specifically interact with potassium, sodium type, and calcium channels. They have become potential drug candidates to treat diseases such as chronic pain, epilepsy, and cardiovascular diseases. Thus, correctly identifying the types of ion channel-targeted conotoxins will provide important clue to understand their function and find potential drugs. Based on this consideration, we developed a new computational method to rapidly and accurately predict the types of ion-targeted conotoxins. Three kinds of new properties of residues were proposed to use in pseudo amino acid composition to formulate conotoxins samples. The support vector machine was utilized as classifier. A feature selection technique based on F-score was used to optimize features. Jackknife cross-validated results showed that the overall accuracy of 94.6% was achieved, which is higher than other published results, demonstrating that the proposed method is superior to published methods. Hence the current method may play a complementary role to other existing methods for recognizing the types of ion-target conotoxins.
1. Introduction
The marine cone snail can secrete venom for predation and defense. A key component of venom is called conotoxin which is a kind of disulfide-rich neurotoxic peptide with 10–30 residues long. The high diversity of their sequences makes it difficult to systemically study them. It has been reported that there are over 100,000 conotoxins existing in approximately 700 species of cone snails [1]. Conotoxins can target G protein-coupled receptors (GPCRs) [2], nicotinic acetylcholine, and neurotensin receptors. Particularly, they interact with ion channels with extremely high specificity and affinity [3]. Thus, they have been regarded as important drug candidates to treat chronic pain, epilepsy, spasticity, and cardiovascular diseases [4, 5].
With more and more conotoxins being discovered, biochemical experiments-based method to investigate the function of conotoxins becomes more and more difficult because of high cost and long period of wet experiment. Using computational method to predict the function of conotoxins provides us with a convenient way to perform systemic analysis of conotoxins. In 2006, Mondal et al. combined support vector machine (SVM) with pseudo amino acid composition (PseAAC) to predict the superfamily of conotoxins [6]. Subsequently, Lin and Li developed a novel method called increment of diversity (ID) to describe dipeptide sequence and used quadratic discriminant (QD) to predict superfamily and family of conotoxins [7]. Zaki et al. used sequence alignment which was also used by Zou et al. [8] combined with amino acid composition to predict superfamily of conotoxins by use of SVM [9]. They further provide a SVM-Freescore method to improve accuracy [10]. Recently, Yin et al. developed a method called dHKNN to predict superfamily of conotoxins and achieved the overall accuracy of 90.3% by using hidden Markov model to select best features [11, 12]. Lisacek et al. used profile Hidden Markov Models (pHMMs) and position-specific scoring matrix (PSSM) to improve accuracy for conotoxin superfamily prediction [13–15].
Although the methods and results mentioned above can give some guide to study conotoxins, they did not provide more information for the prediction of conotoxins' function. A case shows that two conotoxins (delta-conotoxin-like Ac6.1 and omega-conotoxin-like Ai6.2) belong to the same superfamily; however, they can target different ion channels [16]. Thus, it is necessary to develop new bioinformatics tools to identify the function of conotoxins. In 2007, Saha and Raghava proposed a method based on SVM and PSI-BLAST to predict the function of neurotoxins [17]. Soli et al. developed a statistical-based model to predict the activity of scorpion toxins by using motifs and secondary structure information [18]. Recently, Yuan et al. developed a feature selection technique based on binomial distribution to predict the types of ion channel-targeted conotoxins by using radial basis function network [19]. Subsequently, they improved the accuracy by using SVM with optimal dipeptide composition [20]. However, the prediction accuracy can be further improved.
Thus, the present study aimed to develop a new prediction method to improve the prediction quality of conotoxins' types. We incorporated three kinds of new properties of residues into PseAAC for formulating conotoxins samples. Subsequently, we used SVM to perform classification. After feature selection, we found that the accuracy was dramatically improved in jackknife cross-validation. In the following section, we will introduce the process of model construction in detail.
2. Materials and Methods
2.1. Benchmark Dataset
The benchmark dataset extracted from the UniProt [21] was constructed by Lin's group [19, 20]. The dataset is reliable and objective because (i) the conotoxins with ambiguous annotations have been excluded, (ii) the function of all conotoxins in benchmark dataset has been experimentally confirmed, and (iii) high similar sequences (cutoff = 80%) have been pruned by using CD-HIT program. The benchmark dataset contains 112 mature conotoxins peptide sequences including 24 potassium ion channel-targeted conotoxins (K-conotoxins), 43 sodium ion channel-targeted conotoxins (Na-conotoxins), and 45 calcium ion channel-targeted conotoxins (Ca-conotoxins). All calculations and model construction in the following section are based on the data.
2.2. Feature Extraction
A key point in protein prediction is how to extract important information from peptide sequences. In the past studies, the amino acid composition has been widely used in protein prediction. To consider the correlation of residues, the dipeptide composition was used in prediction model. Chou proposed a very popular and elegant descriptor called PseAAC which describes not only the correlation of physicochemical properties of residues but also the amino acid composition [22]. Furthermore, recently some web servers or stand-alone tools have been proposed to generate different modes of PseAAC, such as PseKNC [23], PseKNC-General [24], Pse-in-One [25], repRNA [26], and repDNA [27]. The authors should introduce these tools. In this study, we proposed three kinds of new properties, that is, rigidity, flexibility, and irreplaceability. The flexibility and rigidity of residues correlate with the protein structure and function. The irreplaceability of residues can reflect the evolution of life. The values of three properties for 20 residues [28] have been listed in Table 1. In the following, we will describe how to formulate conotoxins with PseAAC [22].
Table 1.
The values of rigidity, flexibility, and irreplaceability of 20 residues.
Residues | Rigidity | Flexibility | Irreplaceability |
---|---|---|---|
G | −1.097 | −2.746 | 0.56 |
A | −1.338 | −3.102 | 0.52 |
V | −1.641 | −1.339 | 0.54 |
L | −1.741 | 0.424 | 0.58 |
I | −1.741 | 0.424 | 0.65 |
F | 2.877 | −0.466 | 0.86 |
W | 5.913 | −1.000 | 1.82 |
Y | 2.714 | −0.672 | 0.98 |
D | −0.204 | 0.424 | 0.77 |
H | 2.269 | −0.223 | 0.94 |
N | −0.204 | 0.424 | 0.79 |
E | −0.365 | 2.009 | 0.76 |
K | −1.822 | 3.950 | 0.81 |
Q | −0.365 | 2.009 | 0.86 |
M | −1.741 | 2.484 | 1.25 |
R | 1.169 | 3.06 | 0.6 |
S | −1.511 | 0.957 | 0.64 |
T | −1.641 | −1.339 | 0.56 |
C | −1.511 | 0.957 | 1.12 |
P | 1.979 | −2.404 | 0.61 |
Consider a conotoxin P = R 1 R 2 R 3 R 4 ⋯ R L, where R 1, R 2, and R L denote the 1st, 2nd, and Lth residue of the conotoxin sample P; it can be defined by a 400 + 3λ-dimensional vector as shown by
(1) |
where
(2) |
where f u is the normalized frequency of the 400 dipeptides in conotoxin P and can be defined as
(3) |
where n u denotes the number of occurrences of uth dipeptide in conotoxin P.
In (2), ω is weight factor for sequence order effect. τ j is called the j-tier sequence correlation factor computed by the following formula:
(4) |
where H i,i+λ n (n = 1,2, 3 denotes rigidity, flexibility, and irreplaceability) is called the correlation function and can be given by
(5) |
where h n(R i) is the nth kind of the physicochemical values of the amino acid R i. The values should be converted to standard type by
(6) |
where h 0 n(R i) is the original physicochemical values of the ith amino acid.
For the purpose of finding the best feature subset which can produce the maximum accuracy, we performed feature selection by using the algorithm called F-score which can be defined as
(7) |
where and are the average values of the ith feature in whole dataset and the kth dataset; x ij k is the value of the ith feature of the jth conotoxin in the kth dataset; and N k is the numbers of conotoxin in the kth dataset. We noticed that the larger the F(i) value is, the better the predictive capability the ith feature has. We used a python script fselect.py downloaded from https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/ to calculate F-score.
2.3. Support Vector Machine
SVM is a very popular machine learning method which is very suitable for small sample classification [29–31] and regressions [32, 33]. Its basic idea is to map the original samples into a high-dimensional space and search for the best hyperplane in this space which can separate different samples. In this study, the LibSVM soft package was used to implement SVM. The radial basis function (RBF) usually exhibits excellent performance in nonlinear classification [34]. Thus the RBF kernel function was used in the current work. We utilized grid search method to find out the best values of the regularization parameter C and kernel parameter γ via jackknife cross-validation. The search spaces for C and γ are [215, 2−5] and [2−5, 2−15] with steps being 2−1 and 2, respectively.
2.4. The Evaluation of Model Performance
We used jackknife cross-validation to evaluate the performance of proposed method. Three metrics, namely, sensitivity (Sn), overall accuracy (OA), and average accuracy (AA) as defined in [19, 20], were used to quantitatively estimate the accuracy of the model:
(8) |
where N k is the total number of the kth types of conotoxins and m k denotes the number of the kth types of conotoxins which was correctly recognized.
3. Results and Discussion
As we can see from (2), the results of the proposed method depend on two parameters λ and ω, where λ represents the long-range sequence order effect and ω is called weight factor which reflects the weight imposed between the local and global effects. Generally speaking, the greater λ is, the more global sequence order information it contains. However, if λ is too large, it would cause the high-dimensional disaster as mentioned above. Therefore, our searching for the optimal values of the three parameters was carried out in the following regions:
(9) |
From (9), a total of 10 × 10 = 100 individual combinations needed to be considered for finding the optimal parameter combination. This was actually a routine but tedious process to optimize the model via a 2-dimensional grid search. We used the jackknife cross-validation approach to deal with the parameter optimization. The results show that when λ = 6 and ω = 0.2, the accuracy reaches to maximum value. We noticed that the current model contains 418 features which is still so large that the high-dimensional and overfitting problems will appear.
Therefore, we must select the key features from the 418 components. These key features can produce the maximum Acc. The best feature subset will be obtained by investigating all the combinations of features. However, it is time-consuming and even beyond computational capability for most computers to examine all possible combinations. Based on this reason, we used F-score defined in (7) to perform feature selection. At first, all 418 features were ranked according to their F-scores from large to small. Secondly, the SVM was used to classify three samples and calculate the accuracy based on the feature with maximum F-score. Thirdly, a new feature subset was produced by adding the feature with the second highest F value to the former feature subset. We repeated the process until all combinations were investigated and the accuracies were calculated.
We plotted the accuracies with feature dimension in Figure 1 and noticed that the maximum accuracy is 94.6% when 180 best features were used. The detailed results were recorded in Table 1. Other published results were also listed in Table 2. We noticed that Sns of Na- and Ca-conotoxins of our method are 95.3% and 95.6%, respectively, which are higher than those of RBF network-based method [19]. The Sns of K- and Ca-conotoxins of our method are 91.7% and 95.6%, respectively, which are higher than those of iCTX-Type [20]. Thus, in summary, our proposed method is superior to other published methods.
Figure 1.
A plot to show the feature selection results. When the top 180 features were used to perform prediction, the overall success rate reached its peak of 94.6%.
Table 2.
Comparison of the current method with published methods.
4. Conclusion
In this paper, we designed a new method based on three kinds of new properties to predict three kinds of ion channel-targeted conotoxins. By using feature selection technique, prediction accuracy was dramatically improved. Comparison with published methods demonstrated the advantage of our method. The properties of residues used in this paper can also be used in other fields of protein classification. In the future, we will construct a free webserver based on the proposed method for the convenience of the vast majority of experimental scientists.
Acknowledgments
This work was supported by the Applied Basic Research Program of Sichuan Province (LZ-LY-45) and the Scientific Research Foundation of the Education Department of Sichuan Province (11ZB122).
Competing Interests
The authors declare that they have no competing financial interests.
References
- 1.Daly N. L., Craik D. J. Structural studies of conotoxins. IUBMB Life. 2009;61(2):144–150. doi: 10.1002/iub.158. [DOI] [PubMed] [Google Scholar]
- 2.Liao Z., Ju Y., Zou Q. Prediction of G protein-coupled receptors with SVM-prot features and random forest. Scientifica. 2016;2016:10. doi: 10.1155/2016/8309253.8309253 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Terlau H., Olivera B. M. Conus venoms: a rich source of novel ion channel-targeted peptides. Physiological Reviews. 2004;84(1):41–68. doi: 10.1152/physrev.00020.2003. [DOI] [PubMed] [Google Scholar]
- 4.Han T. S., Teichert R. W., Olivera B. M., Bulaj G. Conus venoms—a rich source of peptide-based therapeutics. Current Pharmaceutical Design. 2008;14(24):2462–2479. doi: 10.2174/138161208785777469. [DOI] [PubMed] [Google Scholar]
- 5.Watters M. R. Tropical marine neurotoxins: venoms to drugs. Seminars in Neurology. 2005;25(3):278–289. doi: 10.1055/s-2005-917664. [DOI] [PubMed] [Google Scholar]
- 6.Mondal S., Bhavna R., Babu R. M., Ramakumar S. Pseudo amino acid composition and multi-class support vector machines approach for conotoxin superfamily classification. Journal of Theoretical Biology. 2006;243(2):252–260. doi: 10.1016/j.jtbi.2006.06.014. [DOI] [PubMed] [Google Scholar]
- 7.Lin H., Li Q.-Z. Predicting conotoxin superfamily and family by using pseudo amino acid composition and modified Mahalanobis discriminant. Biochemical and Biophysical Research Communications. 2007;354(2):548–551. doi: 10.1016/j.bbrc.2007.01.011. [DOI] [PubMed] [Google Scholar]
- 8.Zou Q., Hu Q., Guo M., Wang G. HAlign: fast multiple similar DNA/RNA sequence alignment based on the centre star strategy. Bioinformatics. 2015;31(15):2475–2481. doi: 10.1093/bioinformatics/btv177. [DOI] [PubMed] [Google Scholar]
- 9.Zaki N., Sibai F., Campbell P. Conotoxin protein classification using pairwise comparison and amino acid composition. Proceedings of the 13th Annual Genetic and Evolutionary Computation Conference (GECCO '11); July 2011; Dublin, Ireland. ACM; pp. 323–330. [DOI] [Google Scholar]
- 10.Zaki N., Wolfsheimer S., Nuel G., Khuri S. Conotoxin protein classification using free scores of words and support vector machines. BMC Bioinformatics. 2011;12, article 217 doi: 10.1186/1471-2105-12-217. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Fan Y.-X., Song J., Kong X., Shen H.-B. PredCSf: an integrated feature-based approach for predicting conotoxin superfamily. Protein and Peptide Letters. 2011;18(3):261–267. doi: 10.2174/092986611794578341. [DOI] [PubMed] [Google Scholar]
- 12.Yin J.-B., Fan Y.-X., Shen H.-B. Conotoxin superfamily prediction using diffusion maps dimensionality reduction and subspace classifier. Current Protein and Peptide Science. 2011;12(6):580–588. doi: 10.2174/138920311796957702. [DOI] [PubMed] [Google Scholar]
- 13.Koua D., Brauer A., Laht S., et al. ConoDictor: a tool for prediction of conopeptide superfamilies. Nucleic Acids Research. 2012;40(1):W238–W241. doi: 10.1093/nar/gks337. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Koua D., Laht S., Kaplinski L., et al. Position-specific scoring matrix and hidden Markov model complement each other for the prediction of conopeptide superfamilies. Biochimica et Biophysica Acta (BBA)—Proteins and Proteomics. 2013;1834(4):717–724. doi: 10.1016/j.bbapap.2012.12.015. [DOI] [PubMed] [Google Scholar]
- 15.Laht S., Koua D., Kaplinski L., Lisacek F., Stöcklin R., Remm M. Identification and classification of conopeptides using profile Hidden Markov Models. Biochimica et Biophysica Acta (BBA)—Proteins and Proteomics. 2012;1824(3):488–492. doi: 10.1016/j.bbapap.2011.12.004. [DOI] [PubMed] [Google Scholar]
- 16.Gowd K. H., Dewan K. K., Iengar P., Krishnan K. S., Balaram P. Probing peptide libraries from Conus achatinus using mass spectrometry and cDNA sequencing: identification of δ and ω-conotoxins. Journal of Mass Spectrometry. 2008;43(6):791–805. doi: 10.1002/jms.1377. [DOI] [PubMed] [Google Scholar]
- 17.Saha S., Raghava G. P. S. Prediction of neurotoxins based on their function and source. In Silico Biology. 2007;7(4-5):369–387. [PubMed] [Google Scholar]
- 18.Soli R., Kaabi B., Barhoumi M., El-Ayeb M., Srairi-Abid N. Bioinformatic characterizations and prediction of K+ and Na+ ion channels effector toxins. BMC Pharmacology. 2009;9, article 4 doi: 10.1186/1471-2210-9-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Yuan L.-F., Ding C., Guo S.-H., Ding H., Chen W., Lin H. Prediction of the types of ion channel-targeted conotoxins based on radial basis function network. Toxicology in Vitro. 2013;27(2):852–856. doi: 10.1016/j.tiv.2012.12.024. [DOI] [PubMed] [Google Scholar]
- 20.Ding H., Deng E.-Z., Yuan L.-F., et al. ICTX-type: a sequence-based predictor for identifying the types of conotoxins in targeting ion channels. BioMed Research International. 2014;2014:10. doi: 10.1155/2014/286419.286419 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Magrane M., UniProt Consortium UniProt Knowledgebase: a hub of integrated protein data. Database. 2011;2011 doi: 10.1093/database/bar009.bar009 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Chou K.-C. Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins: Structure, Function and Genetics. 2001;43(3):246–255. doi: 10.1002/prot.1035. [DOI] [PubMed] [Google Scholar]
- 23.Chen W., Lei T.-Y., Jin D.-C., Lin H., Chou K.-C. PseKNC: a flexible web server for generating pseudo K-tuple nucleotide composition. Analytical Biochemistry. 2014;456(1):53–60. doi: 10.1016/j.ab.2014.04.001. [DOI] [PubMed] [Google Scholar]
- 24.Chen W., Zhang X., Brooker J., Lin H., Zhang L., Chou K.-C. PseKNC-General: a cross-platform package for generating various modes of pseudo nucleotide compositions. Bioinformatics. 2015;31(1):119–120. doi: 10.1093/bioinformatics/btu602. [DOI] [PubMed] [Google Scholar]
- 25.Liu B., Liu F., Wang X., Chen J., Fang L., Chou K. Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Research. 2015;43(W1):W65–W71. doi: 10.1093/nar/gkv458. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Liu B., Liu F., Fang L., Wang X., Chou K.-C. repRNA: a web server for generating various feature vectors of RNA sequences. Molecular Genetics and Genomics. 2016;291(1):473–481. doi: 10.1007/s00438-015-1078-7. [DOI] [PubMed] [Google Scholar]
- 27.Liu B., Liu F., Fang L., Wang X., Chou K.-C. RepDNA: a Python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects. Bioinformatics. 2015;31(8):1307–1309. doi: 10.1093/bioinformatics/btu820. [DOI] [PubMed] [Google Scholar]
- 28.Tang H., Chen W., Lin H. Identification of immunoglobulins using Chou's pseudo amino acid composition with feature selection technique. Molecular BioSystems. 2016;12(4):1269–1275. doi: 10.1039/c5mb00883b. [DOI] [PubMed] [Google Scholar]
- 29.Zhu P.-P., Li W.-C., Zhong Z.-J., et al. Predicting the subcellular localization of mycobacterial proteins by incorporating the optimal tripeptides into the general form of pseudo amino acid composition. Molecular BioSystems. 2015;11(2):558–563. doi: 10.1039/c4mb00645c. [DOI] [PubMed] [Google Scholar]
- 30.Wang R., Xu Y., Liu B. Recombination spot identification Based on gapped k-mers. Scientific Reports. 2016;6 doi: 10.1038/srep23934.23934 [DOI] [PMC free article] [PubMed] [Google Scholar] [Retracted]
- 31.Li D., Ju Y., Zou Q. Protein Folds Prediction with Hierarchical Structured SVM. Current Proteomics. 2016;13(2):79–85. doi: 10.2174/157016461302160514000940. [DOI] [Google Scholar]
- 32.Cao R., Wang Z., Cheng J. Designing and evaluating the MULTICOM protein local and global model quality prediction methods in the CASP10 experiment. BMC Structural Biology. 2014;14(1, article 13) doi: 10.1186/1472-6807-14-13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Cao R., Wang Z., Wang Y., Cheng J. SMOQ: a tool for predicting the absolute residue-specific quality of a single protein model with support vector machines. BMC Bioinformatics. 2014;15(1, article 120) doi: 10.1186/1471-2105-15-120. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Chen J., Wang X., Liu B. IMiRNA-SSF: improving the identification of MicroRNA precursors by combining negative sets with different distributions. Scientific Reports. 2016;6, article 19062 doi: 10.1038/srep19062. [DOI] [PMC free article] [PubMed] [Google Scholar]