Skip to main content
Computational and Mathematical Methods in Medicine logoLink to Computational and Mathematical Methods in Medicine
. 2013 May 15;2013:530696. doi: 10.1155/2013/530696

Naïve Bayes Classifier with Feature Selection to Identify Phage Virion Proteins

Peng-Mian Feng 1, Hui Ding 2, Wei Chen 3,*, Hao Lin 2,*
PMCID: PMC3671239  PMID: 23762187

Abstract

Knowledge about the protein composition of phage virions is a key step to understand the functions of phage virion proteins. However, the experimental method to identify virion proteins is time consuming and expensive. Thus, it is highly desirable to develop novel computational methods for phage virion protein identification. In this study, a Naïve Bayes based method was proposed to predict phage virion proteins using amino acid composition and dipeptide composition. In order to remove redundant information, a novel feature selection technique was employed to single out optimized features. In the jackknife test, the proposed method achieved an accuracy of 79.15% for phage virion and nonvirion proteins classification, which are superior to that of other state-of-the-art classifiers. These results indicate that the proposed method could be as an effective and promising high-throughput method in phage proteomics research.

1. Introduction

Phage is a virus that infects and replicates within bacteria. Phages are widely distributed in locations populated by bacterial hosts, such as soil or the intestines of animals. A complete infectious phage viral particle (also, namely, phage virion) consists of an inner core of nucleic acid which gives the virus infectivity and a protein coat (called a capsid) which encases the nucleic acid and provides specificity, that is, determines which organisms the virus can infect.

The nucleic acid of phage virions is either RNA or DNA. Proteins of phage virions include structural proteins and nonstructural proteins. Structural proteins commonly termed “phage virion proteins” are essential materials of the infectious viral particles, including shell proteins, envelope proteins, and virus particle enzymes. Nonstructural proteins (namely, phage nonvirion proteins) refer to that encoded by the viral genome and play important roles in biological process of viral genome replication and expression, but they do not bind to phage virions. Due to the distinct functions between phage virion proteins and phage nonvirion proteins, knowledge about the protein composition of phage virions is an essential step to further understand the functions of phage virions.

Although the use of mass spectrometry (MS) for the identification of phage virion proteins has become popular [1], it has not kept pace with the explosive growth of protein sequences generated in the postgenomic age. Hence, it is highly desired to develop automated methods for timely and reliably classifying the protein composition of phage virions.

To the best of our knowledge, there is no computational system for the classification of phage virion proteins. In the current study, we propose a Naïve Bayes based computational model for predicting phage virion proteins using amino acid compositions and dipeptide compositions. The correlation-based feature subset selection algorithm [2] was introduced to find the optimal feature set. By using the optimized features, the proposed model was evaluated in a benchmark dataset in the jackknife test. The performance demonstrates that this model could be a potentially useful tool for the annotation of the phage proteins.

According to some recent comprehensive reviews [3, 4] and demonstrated by a series of recent publications [510], to establish a really useful statistical predictor, we need to consider the following procedures: (i) construct or select a valid benchmark dataset to train and test the predictor; (ii) formulate the statistical samples with an effective mathematical expression that can truly reflect their intrinsic correlation with the target to be predicted; (iii) introduce or develop a powerful algorithm (or engine) to operate the prediction; (iv) properly perform cross-validation tests to objectively evaluate the anticipated accuracy of the predictor; (v) establish a user-friendly web server for the predictor that is accessible to the public. In the following, let us describe how to deal with these steps one by one.

2. Materials and Methods

2.1. Dataset

The raw datasets adopted in this research were extracted from the UniProt [11]. For the purpose of obtaining a reliable benchmark dataset, the following steps were considered. Firstly, only the experimentally confirmed phage virion and phage nonvirion protein sequences were included. Secondly, the sequences which are fragments of other proteins were dislodged. Thirdly, sequences containing nonstandard letters, that is, “B,” “X,” or “Z,” were excluded as their meanings are ambiguous. After following the previous strict screening procedures, we obtained 121 phage virion protein sequences and 231 phage nonvirion protein sequences.

To prepare a high quality dataset, the CD-HIT program [12] was used to prune the data. By setting the cutoff of sequence identity to 40%, 307 sequences were remained in the final benchmark dataset, including 99 phage virion protein sequences and 208 phage nonvirion protein sequences.

2.2. Feature Vector

One of the most important parts for identifying protein attributes is to generate a set of proper informative parameters to encode the protein sequences. To avoid completely losing the sequence-order information, the pseudo amino acid composition (PseAAC) was proposed [13, 14] to replace the simple amino acid composition (AAC) for representing the sample of a protein. Since the concept of PseAAC was proposed in 2001 [13], it has been widely used to study various attributes of proteins, such as identifying bacterial virulent proteins [15], predicting supersecondary structure [16], predicting protein subcellular location [1619], predicting membrane protein types [20], discriminating outer membrane proteins [21], identifying antibacterial peptides [22], identifying allergenic proteins [23], predicting metalloproteinase family [24], predicting protein structural class [25], identifying GPCRs and their types [26], identifying protein quaternary structural attributes [27], predicting protein submitochondria locations [28], identifying risk type of human papillomaviruses [29], identifying cyclin proteins [30], predicting GABA(A) receptor proteins [31], and classifying amino acids [32], among many others (see a long list of papers cited in the References section of [3]). Recently, the concept of PseAAC was further extended to represent the feature vectors of DNA and nucleotides [7, 9], as well as other biological samples (see, e.g., [33, 34]). Because it has been widely and increasingly used, recently two powerful softwares, called “PseAAC-Builder” [35] and “propy” [36], were established for generating various special Chou's pseudoamino acid compositions.

The amino acid composition and dipeptide composition are the general forms of PseAAC and the simplest parameters, which also have been widely applied in the realm of protein prediction [3740]. Hence, every protein sequence in the benchmark dataset was encoded in a discrete vector as

F=[f1,f2,,f420]T, (1)

where f i is the normalized occurrence frequencies of the 20 amino acids (i = 1, 2,…, 20) and the 400 dipeptides (i = 21, 22, …, 420) in the protein sequence, respectively. T is the transposing operator.

2.3. Feature Selection

Inclusion of redundant and noisy features in the model building process would cause poor predictive performance and increased computation. Feature selection is the process of removing irrelevant features and is extremely useful in reducing the dimensionality of the data and improving the predictive accuracy. To reduce the dimension of the feature space and improve the precision of phage virion and nonvirion protein classification, the filter method Correlation-based Feature Selection [2] combined with Best-first search strategy was used in the process of feature selection in the current work.

The process starts with an empty set of features and generates all possible single feature expansions. The subset with the highest accuracy is chosen and expanded in the same way by adding single features. If the accuracy does not maximize with the expansion of a subset, the search drops back to the next best unexpanded subset and continues from there until all features are added. The subset with the highest accuracy will be selected as the final optimized feature set [41].

2.4. Naïve Bayes

Naïve Bayes is an effective statistical classification algorithm [42] and has been successfully used in the realm of bioinformatics [4346]. The basic theory of Naïve Bayes is similar to that of Covariance Determinant (CD) [4752]. But for Naïve Bayes, it assumes the attribute variables to be independent from each other given the outcome. This assumption greatly simplifies the calculation of conditional probabilities and also overcomes the divergent problem when using the CD prediction engine to deal with those systems in which the components of constituent feature vectors are normalized.

In the Naïve Bayes framework, a classification problem can be seen as the problem of finding the outcome with maximum probability given a set of observed variables. Given a phage viral protein example, described by its feature vector F = (f 1, f 2,…, f n), we are looking for a class C that maximizes the likelihood P(F | C) = P(f 1, f 2,…, f n | C). Since the current work is intend to classify phage virion and nonvirion proteins, a binary class C ∈ {0,1} was generated, where 1 denotes that the sample was predicted as a phage virion protein and 0 denotes phage nonvirion protein. For the binary classification, the class for the protein sample could be determined by comparing two posteriors as

P(C=1F=f1,f2,,fn)P(C=0F=f1,f2,,fn)=P(C=1)i=1nPi(fiC=1)P(C=0)i=1nPi(fiC=0). (2)

Taking the logarithm of (2), we obtain

logP(C=1F=f1,f2,,fn)P(C=0F=f1,f2,,fn)=logP(C=1)P(C=0)+i=1nlogPi(fiC=1)Pi(fiC=0). (3)

Hence the sample will be predicted as 1 (phage virion protein) if

logP(C=1F=f1,f2,,fn)P(C=0F=f1,f2,,fn)θ (4)

and 0 (phage nonvirion protein) for otherwise. θ is the threshold determining the trade-off between sensitivity and specificity and can be trained on the training dataset to maximize the prediction performance.

2.5. Performance Evaluation

The performance of the proposed model was evaluated using sensitivity (Sn), specificity (Sp), and accuracy (Acc), which are expressed as

Sn=TPTP+FN,Sp=TNTN+FP,Acc=TP+TNTP+FN+TN+FP. (5)

TP, TN, FP, and FN represent the number of the correctly recognized phage virion proteins, the number of the correctly recognized phage nonvirion proteins, the number of phage nonvirion proteins recognized as phage virion proteins, and the number of phage virion proteins recognized as phage nonvirion proteins, respectively.

As the performance of the current classifier depends on the threshold θ as given in (4), the threshold independent parameter, receiver operating characteristic curve, was employed as well. Therefore, the quality of a classifier can be objectively evaluated by measuring the area under the receiver operating characteristic curve (auROC). The value of auROC score ranges from 0 to 1, with a score of 0.5 corresponding to a random guess and a score of 1.0 indicating a perfect separation.

3. Results and Discussion

Three cross-validation methods, namely, subsampling test, independent dataset test, and jackknife test, are often employed to evaluate the predictive capability of a predictor. Among the three methods, the jackknife test is deemed the most objective and rigorous one that can always yield a unique outcome as demonstrated by a penetrating analysis in a recent comprehensive review [53] and hence has been widely and increasingly adopted by investigators to examine the quality of various predictors (see, e.g., [7, 19, 21, 30, 5456]). Accordingly, the jackknife test was used to examine the performance of the model proposed in the current study. In the jackknife test, each sequence in the training dataset is in turn singled out as an independent test sample, and all the rule parameters are calculated without including the one being identified.

3.1. Prediction of Phage Virion Proteins

We trained the Naïve Bayes classifier using Waikato Environment for Knowledge Analysis (WEKA) [57] on the benchmark dataset. As shown in Table 1, an auROC score of 0.758 and an accuracy of 75.57% with an average sensitivity of 53.54% and an average specificity of 83.17% were obtained for the classification of phage virion and nonvirion proteins by using all the 420 features, that is, 20 amino acid compositions and 400 dipeptide compositions.

Table 1.

Predictive performance of Naïve Bayes based on different features.

Feature dimensions Sn (%) Sp (%) Acc (%) auROC
420 53.54 83.17 75.57 0.758
38 75.76 80.77 79.15 0.855

In order to identify prominent features that can distinguish between phage virion and nonvirion proteins, feature selection method as introduced in Section 2.3 was carried out to eliminate the redundant features using WEKA in a tenfold cross-validation approach on the benchmark dataset. We found that the proposed method achieved a maximum accuracy of 79.48% and auROC of 0.86 when the feature dimension reduced to 38 (i.e., V, T, A, H, K, E, R, S, LE, VT, VG, MK, TA, TS, AT, HI, KL, KI, KH, KN, KK, KD, KE, KW, KR, DK, EF, EL, EV, EK, EE, EW, CE, WK, RE, SG, GV, and GG). The jackknife test results of the Naïve Bayes classifier based on the 38 optimized features were listed in Table 1. As it can be seen from Table 1, the current method yielded a best auROC score of 0.855 and a predictive accuracy of 79.15% with an average sensitivity of 75.76% and an average specificity of 80.77% (Table 1). Both predictive accuracy and auROC are higher than that of the model based on the 420 features.

3.2. Comparison with Other Methods

To the best of our knowledge, there exists no theoretical method for phage virion and nonvirion protein classifications. Therefore, we cannot provide the comparison analysis with published results to confirm that the model proposed here is superior to other methods. However, the proposed Naïve Bayes classifier was compared with other state-of-the-art classifiers, that is, BayesNet, RBFnetwork, Random Forest, J48, Support Vector Machine (SVM), and LogitBoot. All the classifiers were compared on the benchmark dataset based on the optimized features (i.e., V, T, A, H, K, E, R, S, LE, VT, VG, MK, TA, TS, AT, HI, KL, KI, KH, KN, KK, KD, KE, KW, KR, DK, EF, EL, EV, EK, EE, EW, CE, WK, RE, SG, GV, and GG). Their best predictive results from jackknife test were shown in Table 2.

Table 2.

Comparison of Naïve Bayes with other methods by using optimized features.

Classifier Sn (%) Sp (%) Acc (%) auROC
BayesNet 68.69 79.81 76.22 0.799
RBFnetwork 72.73 82.21 79.15 0.839
Random Forest 55.56 84.62 75.24 0.802
LogitBoot 52.53 85.10 74.59 0.795
SVM 63.64 86.54 79.15 0.836
J48 61.62 77.88 72.64 0.671
Naïve Bayes 75.76 80.77 79.15 0.855

The predictive accuracy of Naïve Bayes is approximately 3%, 4%, 5%, and 7% higher than that of the BayesNet, Random Forest, LogitBoot, and J48 classifiers, respectively. Although the accuracies of RBFnetwork and SVM are equal to that of Naïve Bayes, their auROC scores are lower than that of Naïve Bayes. These results indicate that the proposed Naïve Bayes model can be effectively used to classify phage virion and nonvirion proteins.

4. Conclusions

In this study, the Naïve Bayes classifier with feature selection method is presented to identity phage virion proteins based on the primary sequence information. By using Correlation-based Feature Subset Selection algorithm, the feature dimensions were reduced, and 38 prominent features that could remarkably improve the predictive accuracies were obtained. However, the detailed analyses of the selected features are required to provide more information about their roles in biological activity. The accuracy for the classification of phage virion and nonvirion proteins reached 79.15% in the jackknife test, indicating that the proposed method is an effective tool for phage virion protein identification. It is expected that the presented model will provide novel insights into the research on phage proteomics. Since user-friendly and publicly accessible web servers represent the future direction for developing practically more useful predictors [58], we shall make efforts in our future work to provide a web server for the method presented in this paper.

Acknowledgments

The authors wish to express their gratitude to three anonymous reviewers whose constructive comments were very helpful in strengthening the presentation of this paper. This work was supported by the National Nature Scientific Foundation of China (nos. 61100092, 61202256), the Fundamental Research Funds for the Central Universities (ZYGX2012J113).

References

  • 1.Lavigne R, Ceyssens PJ, Robben J. Phage proteomics: applications of mass spectrometry. Methods in Molecular Biology. 2009;502:239–251. doi: 10.1007/978-1-60327-565-1_14. [DOI] [PubMed] [Google Scholar]
  • 2.Hall MA. Correlation-Based Feature Selection for Machine Learning: Data Mining, Inference and Prediction. Berlin, Germany: Springer; 2008. [Google Scholar]
  • 3.Chou KC. Some remarks on protein attribute prediction and pseudo amino acid composition. Journal of Theoretical Biology. 2011;273(1):236–247. doi: 10.1016/j.jtbi.2010.12.024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Chou KC. Some remarks on predicting multi-label attributes in molecular biosystems. Molecular Biosystems. 2013 doi: 10.1039/c3mb25555g. [DOI] [PubMed] [Google Scholar]
  • 5.Lin WZ, Fang JA, Xiao X, Chou KC. iLoc-Animal: a multi-label learning classifier for predicting subcellular localization of animal proteins. Molecular BioSystems. 2013;9(9):634–644. doi: 10.1039/c3mb25466f. [DOI] [PubMed] [Google Scholar]
  • 6.Xiao X X, Wang P, Lin WZ, Jia JH, Chou KC. iAMP-2L: a two-level multi-label classifier for identifying antimicrobial peptides and their functional types. Analytical Biochemistry. 2013;436(2):168–177. doi: 10.1016/j.ab.2013.01.019. [DOI] [PubMed] [Google Scholar]
  • 7.Chen W, Feng PM, Lin H, Chou KC. iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition. Nucleic Acids Research. 2013;41(6, article e68) doi: 10.1093/nar/gks1450. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Chou KC, Wu ZC, Xiao X. iLoc-Hum: using accumulation-label scale to predict subcellular locations of human proteins with both single and multiple sites. Molecular Biosystems. 2012;8(2):629–641. doi: 10.1039/c1mb05420a. [DOI] [PubMed] [Google Scholar]
  • 9.Chen W, Lin H, Feng PM, Ding C, Zuo Y-C, Chou K-C. iNuc-PhysChem: a sequence-based predictor for identifying nucleosomes via physicochemical properties. PLoS ONE. 2012;7(10) doi: 10.1371/journal.pone.0047843.e47843 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Xu Y, Ding J, Wu LY, Chou K-C. iSNO-PseAAC: predict cysteine S-nitrosylation sites in proteins by incorporating position specific amino acid propensity into pseudo amino acid composition. PLoS ONE. 2013;8(2) doi: 10.1371/journal.pone.0055844.e55844 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.UniProt Consortium. Reorganizing the protein space at the Universal Protein Resource (UniProt) Nucleic Acids Research. 2012;40:D71–D75. doi: 10.1093/nar/gkr981. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22(13):1658–1659. doi: 10.1093/bioinformatics/btl158. [DOI] [PubMed] [Google Scholar]
  • 13.Chou KC. Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins. 2001;43(3):246–255. doi: 10.1002/prot.1035. [DOI] [PubMed] [Google Scholar]
  • 14.Chou KC. Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics. 2005;21(1):10–19. doi: 10.1093/bioinformatics/bth466. [DOI] [PubMed] [Google Scholar]
  • 15.Nanni L, Lumini A, Gupta D, Garg A. Identifying bacterial virulent proteins by fusing a set of classifiers based on variants of Chou's pseudo amino acid composition and on evolutionary information. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2012;9(2):467–475. doi: 10.1109/TCBB.2011.117. [DOI] [PubMed] [Google Scholar]
  • 16.Zou D, He Z, He J, Xia Y. Supersecondary structure prediction using Chou's pseudo amino acid composition. Journal of Computational Chemistry. 2011;32(2):271–278. doi: 10.1002/jcc.21616. [DOI] [PubMed] [Google Scholar]
  • 17.Zhang SW, Zhang YL, Yang HF, Zhao CH, Pan Q. Using the concept of Chou's pseudo amino acid composition to predict protein subcellular localization: an approach by incorporating evolutionary information and von Neumann entropies. Amino Acids. 2008;34(4):565–572. doi: 10.1007/s00726-007-0010-9. [DOI] [PubMed] [Google Scholar]
  • 18.Kandaswamy KK, Pugalenthi G, Möller S, et al. Prediction of apoptosis protein locations with genetic algorithms and support vector machines through a new mode of pseudo amino acid composition. Protein and Peptide Letters. 2010;17(12):1473–1479. doi: 10.2174/0929866511009011473. [DOI] [PubMed] [Google Scholar]
  • 19.Mei S. Predicting plant protein subcellular multi-localization by Chou's PseAAC formulation based multi-label homolog knowledge transfer learning. Journal of Theoretical Biology. 2012;310:80–87. doi: 10.1016/j.jtbi.2012.06.028. [DOI] [PubMed] [Google Scholar]
  • 20.Chen YK, Li KB. Predicting membrane protein types by incorporating protein topology, domains, signal peptides, and physicochemical properties into the general form of Chou's pseudo amino acid composition. Journal of Theoretical Biology. 2013;318:1–12. doi: 10.1016/j.jtbi.2012.10.033. [DOI] [PubMed] [Google Scholar]
  • 21.Hayat M, Khan A. Discriminating outer membrane proteins with fuzzy K-nearest neighbor algorithms based on the general form of Chou's PseAAC. Protein and Peptide Letters. 2012;19(4):411–421. doi: 10.2174/092986612799789387. [DOI] [PubMed] [Google Scholar]
  • 22.Khosravian M, Faramarzi FK, Beigi MM, Behbahani M, Mohabatkar H. Predicting antibacterial peptides by the concept of Chou's pseudo-amino acid composition and machine learning methods. Protein and Peptide Letters. 2012;20(2):180–186. doi: 10.2174/092986613804725307. [DOI] [PubMed] [Google Scholar]
  • 23.Mohabatkar H, Beigi MM, Abdolahi K, Mohsenzadeh S. Prediction of allergenic proteins by means of the concept of Chou's pseudo amino acid composition and a machine learning approach. Medicinal Chemistry. 2013;9(1):133–137. doi: 10.2174/157340613804488341. [DOI] [PubMed] [Google Scholar]
  • 24.Beigi MM, Behjati M, Mohabatkar H. Prediction of metalloproteinase family based on the concept of Chou's pseudo amino acid composition using a machine learning approach. Journal of Structural and Functional Genomics. 2011;12(4):191–197. doi: 10.1007/s10969-011-9120-4. [DOI] [PubMed] [Google Scholar]
  • 25.Sahu SS, Panda G. A novel feature representation method based on Chou's pseudo amino acid composition for protein structural class prediction. Computational Biology and Chemistry. 2010;34(5-6):320–327. doi: 10.1016/j.compbiolchem.2010.09.002. [DOI] [PubMed] [Google Scholar]
  • 26.Zia-Ur R, Khan A. Identifying GPCRs and their types with Chou's pseudo amino acid composition: an approach from multi-scale energy representation and position specific scoring matrix. Protein and Peptide Letters. 2012;19(8):890–903. doi: 10.2174/092986612801619589. [DOI] [PubMed] [Google Scholar]
  • 27.Sun XY, Shi SP, Qiu JD, Suo SB, Huang SY, Liang RP. Identifying protein quaternary structural attributes by incorporating physicochemical properties into the general form of Chou's PseAAC via discrete wavelet transform. Molecular BioSystems. 2012;8(12):3178–3184. doi: 10.1039/c2mb25280e. [DOI] [PubMed] [Google Scholar]
  • 28.Nanni L, Lumini A. Genetic programming for creating Chou's pseudo amino acid based features for submitochondria localization. Amino Acids. 2008;34(4):653–660. doi: 10.1007/s00726-007-0018-1. [DOI] [PubMed] [Google Scholar]
  • 29.Esmaeili M, Mohabatkar H, Mohsenzadeh S. Using the concept of Chou's pseudo amino acid composition for risk type prediction of human papillomaviruses. Journal of Theoretical Biology. 2010;263(2):203–209. doi: 10.1016/j.jtbi.2009.11.016. [DOI] [PubMed] [Google Scholar]
  • 30.Mohabatkar H. Prediction of cyclin proteins using Chou's pseudo amino acid composition. Protein and Peptide Letters. 2010;17(10):1207–1214. doi: 10.2174/092986610792231564. [DOI] [PubMed] [Google Scholar]
  • 31.Mohabatkar H, Mohammad Beigi M, Esmaeili A. Prediction of GABAA receptor proteins using the concept of Chou's pseudo-amino acid composition and support vector machine. Journal of Theoretical Biology. 2011;281(1):18–23. doi: 10.1016/j.jtbi.2011.04.017. [DOI] [PubMed] [Google Scholar]
  • 32.Georgiou DN, Karakasidis TE, Nieto JJ, Torres A. Use of fuzzy clustering technique and matrices to classify amino acids and its impact to Chou's pseudo amino acid composition. Journal of Theoretical Biology. 2009;257(1):17–26. doi: 10.1016/j.jtbi.2008.11.003. [DOI] [PubMed] [Google Scholar]
  • 33.Li BQ, Huang T, Liu L, Cai YD, Chou KC. Identification of colorectal cancer related genes with mRMR and shortest path in protein-protein interaction network. PLoS ONE. 2012;7(4) doi: 10.1371/journal.pone.0033393.e33393 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Huang T, Wang J, Cai YD, Yu H, Chou KC. Hepatitis C virus network based classification of hepatocellular cirrhosis and carcinoma. PLoS ONE. 2012;7(4) doi: 10.1371/journal.pone.0034460.e34460 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Du P, Wang X, Xu C, Gao Y. PseAAC-Builder: a cross-platform stand-alone program for generating various special Chou's pseudo-amino acid compositions. Analytical Biochemistry. 2012;425(2):117–119. doi: 10.1016/j.ab.2012.03.015. [DOI] [PubMed] [Google Scholar]
  • 36.Cao DS, Xu QS, Liang YZ. Propy: a tool to generate various modes of Chou's PseAAC. Bioinformatics. 2013;29(7):960–962. doi: 10.1093/bioinformatics/btt072. [DOI] [PubMed] [Google Scholar]
  • 37.Montanucci L, Fariselli P, Martelli PL, Casadio R. Predicting protein thermostability changes from sequence upon multiple mutations. Bioinformatics. 2008;24(13):i190–i195. doi: 10.1093/bioinformatics/btn166. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Gromiha MM, Suresh MX. Discrimination of mesophilic and thermophilic proteins using machine learning algorithms. Proteins. 2008;70(4):1274–1279. doi: 10.1002/prot.21616. [DOI] [PubMed] [Google Scholar]
  • 39.Wu LC, Lee JX, Huang HD, Liu BJ, Horng JT. An expert system to predict protein thermostability using decision tree. Expert Systems with Applications. 2009;36(5):9007–9014. [Google Scholar]
  • 40.Lin H, Chen W. Prediction of thermophilic proteins using feature selection technique. Journal of Microbiological Methods. 2011;84(1):67–70. doi: 10.1016/j.mimet.2010.10.013. [DOI] [PubMed] [Google Scholar]
  • 41.Chen W, Lin H. Identification of voltage-gated potassium channel subfamilies from sequence information using support vector machine. Computers in Biology and Medicine. 2012;42(4):504–507. doi: 10.1016/j.compbiomed.2012.01.003. [DOI] [PubMed] [Google Scholar]
  • 42.Mitchell TM. Machine Learning. New York, NY, USA: McGraw-Hill; 1997. [Google Scholar]
  • 43.Cao J, Panetta R, Yue S, Steyaert A, Young-Bellido M, Ahmad S. A naive Bayes model to predict coupling between seven transmembrane domain receptors and G-proteins. Bioinformatics. 2003;19(2):234–240. doi: 10.1093/bioinformatics/19.2.234. [DOI] [PubMed] [Google Scholar]
  • 44.Yousef M, Jung S, Kossenkov AV, Showe LC, Showe MK. Naïve Bayes for microRNA target predictions—machine learning for microRNA targets. Bioinformatics. 2007;23(22):2987–2992. doi: 10.1093/bioinformatics/btm484. [DOI] [PubMed] [Google Scholar]
  • 45.Murakami Y, Mizuguchi K. Applying the Naïve Bayes classifier with kernel density estimation to the prediction of protein-protein interaction sites. Bioinformatics. 2010;26(15):1841–1848. doi: 10.1093/bioinformatics/btq302. [DOI] [PubMed] [Google Scholar]
  • 46.Sambo F, Trifoglio E, Di Camillo B, Toffolo GM, Cobelli C. Bag of Naïve Bayes: biomarker selection and classification from genome-wide SNP data. BMC Bioinformatics. 2012;13(supplement 14, article S2) doi: 10.1186/1471-2105-13-S14-S2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Chou KC. A novel approach to predicting protein structural classes in a (20-1)-D amino acid composition space. Proteins. 1995;21(4):319–344. doi: 10.1002/prot.340210406. [DOI] [PubMed] [Google Scholar]
  • 48.Zhou GP. An intriguing controversy over protein structural class prediction. Journal of Protein Chemistry. 1998;17(8):729–738. doi: 10.1023/a:1020713915365. [DOI] [PubMed] [Google Scholar]
  • 49.Chou KC, Elrod DW. Using discriminant function for prediction of subcellular location of prokaryotic proteins. Biochemical and Biophysical Research Communications. 1998;252(1):63–68. doi: 10.1006/bbrc.1998.9498. [DOI] [PubMed] [Google Scholar]
  • 50.Zhou GP, Assa-Munt N. Some insights into protein structural class prediction. Proteins. 2001;44(1):57–59. doi: 10.1002/prot.1071. [DOI] [PubMed] [Google Scholar]
  • 51.Pan YX, Zhang ZZ, Guo ZM, Feng GY, Huang ZD, He L. Application of pseudo amino acid composition for predicting protein subcellular location: stochastic signal processing approach. Journal of Protein Chemistry. 2003;22(4):395–402. doi: 10.1023/a:1025350409648. [DOI] [PubMed] [Google Scholar]
  • 52.Zhou GP, Doctor K. Subcellular location prediction of apoptosis proteins. Proteins. 2003;50(1):44–48. doi: 10.1002/prot.10251. [DOI] [PubMed] [Google Scholar]
  • 53.Chou KC, Shen HB. Cell-PLoc 2.0: an improved package of web-servers for predicting subcellular localization of proteins in various organisms. Natural Science. 2010;2(10):1090–1103. doi: 10.1038/nprot.2007.494. [DOI] [PubMed] [Google Scholar]
  • 54.Hayat M, Khan A. MemHyb: predicting membrane protein types by hybridizing SAAC and PSSM. Journal of Theoretical Biology. 2012;292:93–102. doi: 10.1016/j.jtbi.2011.09.026. [DOI] [PubMed] [Google Scholar]
  • 55.Xiao X, Wu ZC, Chou KC. iLoc-Virus: a multi-label learning classifier for identifying the subcellular localization of virus proteins with both single and multiple sites. Journal of Theoretical Biology. 2011;284(1):42–51. doi: 10.1016/j.jtbi.2011.06.005. [DOI] [PubMed] [Google Scholar]
  • 56.Chou KC, Wu ZC, Xiao X. iLoc-Euk: a multi-label classifier for predicting the subcellular localization of singleplex and multiplex eukaryotic proteins. PLoS ONE. 2011;6(3) doi: 10.1371/journal.pone.0018258.e18258 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Witten H, Frank E. Data Mining: Practical Machine Learning Tools and Techniques. San Francisco, Calif, USA: Morgan Kaufmann; 2005. [Google Scholar]
  • 58.Chou KC, Shen HB. Review: recent advances in developing web-servers for predicting protein attributes. Natural Science. 2009;2:63–92. [Google Scholar]

Articles from Computational and Mathematical Methods in Medicine are provided here courtesy of Wiley

RESOURCES