Skip to main content
Computational and Mathematical Methods in Medicine logoLink to Computational and Mathematical Methods in Medicine
. 2014 May 18;2014:715494. doi: 10.1155/2014/715494

Quad-PRE: A Hybrid Method to Predict Protein Quaternary Structure Attributes

Yajun Sheng 1, Xingye Qiu 1, Chen Zhang 1, Jun Xu 1, Yanping Zhang 1, Wei Zheng 1, Ke Chen 2,*
PMCID: PMC4052169  PMID: 24963340

Abstract

The protein quaternary structure is very important to the biological process. Predicting their attributes is an essential task in computational biology for the advancement of the proteomics. However, the existing methods did not consider sufficient properties of amino acid. To end this, we proposed a hybrid method Quad-PRE to predict protein quaternary structure attributes using the properties of amino acid, predicted secondary structure, predicted relative solvent accessibility, and position-specific scoring matrix profiles and motifs. Empirical evaluation on independent dataset shows that Quad-PRE achieved higher overall accuracy 81.7%, especially higher accuracy 92.8%, 93.3%, and 90.6% on discrimination for trimer, hexamer, and octamer, respectively. Our model also reveals that six features sets are all important to the prediction, and a hybrid method is an optimal strategy by now. The results indicate that the proposed method can classify protein quaternary structure attributes effectively.

1. Introduction

As is well known, the prediction of protein quaternary structure attributes (such as monomer, dimmer, trimer, tetramer, pentamer, hexamer, heptamer, and octamer) plays an important role in the structure bioinformatics. It can confirm how many subunits form the protein. It is the real requirement for the Anfinsen's dogma [1]. A variety of experimental techniques can determine protein quaternary structure. However, most methods are time-consuming and expensive. Moreover, the oligomers may be homooligomers or heterooligomers; the former consist of identical polypeptide chains, whereas the latter are nonidentical. Many computational methods are proposed.

As far as we know, the earliest work to study the quaternary structure type was in 2001 [2]. In this paper, Garian proposed a method named Quaternary Structure Explorer (QSE), which just judges whether or not a given protein is a homodimer. In 2003, Zhang et al. [3] first introduced support vector machine (SVM) to discriminate the differences of the primary sequences of both homodimer and nonhomodimer. Chou and Cai [4] solved the 2-state problem by using the pseudo amino acid composition. In 2006, Shi el al. [5] classified homooligomers based on amino acid composition distribution (AACD) and showed that the 2DPCA was an effective approach to decrease the high dimension of feature vector. In 2007, Carugo [6] proposed a method which is able to predict the quaternary structural type of hetero oligomeric proteins. Levy [7] proposed the PiQSi to get the annotations of about 15,000 proteins in PDB, which can be used as the benchmark dataset to test the quality of a method to predict the quaternary structure type. In 2009, Xiao and Lin introduced the grey incidence degree measure [8] to predict the protein quaternary structure attributes. The method is implemented as a web-server called Quat-2L [9], which firstly identifies the protein as homooligomer or heterooligomer and secondly justifies how many subunits. In 2012, Sun et al. utilized discrete wavelet transform [10] based on Chou's PseAAC to identify the protein quaternary structure attribute. All these methods to predict the quaternary structure attributes are based on one set of features, and mostly for 2 states.

In this paper, we proposed a new method Quad-PRE to predict protein quaternary structures attributes among 6 states only based on the primary sequences, removing both pentamer and heptamer because of insufficient data. With 10 fold cross validation, our models achieved higher overall accuracy 81.7%, especially higher accuracy 92.8%, 93.3%, and 90.6% on discrimination for trimer, hexamer, and octamer, respectively. Our method could be an effective tool to predict the protein quaternary structure attributes.

2. Materials and Methods

2.1. Benchmark Dataset

The dataset is from the quaternary structure library PiQSi (http://www.PiQSi.org/) built by Levy [7]. Our original dataset was downloaded on December 12, 2011. Firstly, we download a whole annotated list including about 15,000 protein sequences and a nonredundant set including 1755 sequences (30% sequence id.) from the library and then remove sequences which are not in the nonredundant set from the whole annotated list. In order to use a set of “good” PDB files, we use the subset of those annotated as “NOT” or “PROBABLY NOT” being errors. In addition, the number of pentamer and heptamer is too little to analyze and we also removed them. Finally, we get a protein quaternary structure dataset with primary sequence as shown in Table 1.

Table 1.

The numbers of monomer, dimmer, trimer, tetramer, hexamer, and octamer in our benchmark dataset.

Total Monomer Dimer Trimer Tetramer Hexamer Octamer
1040 366 338 53 155 67 61

2.2. Features

In this paper, we used three traditional methods and three tools (BLAST, GLAM2, and GIBBS) to select 632 features only based on unique primary sequences and denoted them as six terms: ART_1 feature, ART_2 feature, ART_3 feature, BLAST feature, GLAM2 feature, and GIBBS feature). The summary of the considered features is shown in Table 2 (See Tables S1–S3 in Supplementary Material available online at http://dx.doi.org/10.1155/2014/715494 for more detailed information).

Table 2.

Summary of the considered features, where y denotes one of the three secondary structure states and x denotes one of the 20 common AAs.

Feature sets Description
Sequence-based (79) Sequence length (1)
Composition vector (20)
The number of AAs in the sequence belonging to {R group, Electronic group, Hydrophobicity group, Exchange group} (18)
First and second order composition moment vector (40)

PSSM-based (203) From the PSSM matrix

Secondary structure (217) Based on the features utilized in the PSI-Pred method (90)
Based on the predicted secondary structure which describes collocation
of helical and strand segments (127)

Average RSA based (23) Average RSA of the residues with AA type x (20)
Average RSA of the residues with secondary structure type y (3)

Average isoelectric point (1) pI = 1/Ni=1 N pI i, the pI i values in the paper [11]

Auto-correlation functions based on FHi, EHi, and Hp indices (25) A n a = 1/(Nn)∑i=1 Nn a i a i+n, where a defines the corresponding physicochemical properties, such as two hydrophobicity indices (the Fauchere-Pliska's (FH) with n = 1,2,…, 10 and the Eisenberg's (EH) n = 1,2,…, 6), and hydropathy (HP) index with n = 1,2,…, 9.

Auto-correlation functions based on cumulative FHi index (6) A n a = ∑i=1 Nn(∑j=1 i a j) × (∑j=1 i+n a j)/(Nn), where a is the FH index with n = 1,2,…, 6.

Sum of hydrophobicities based on FHi and EHi (2) H sum a = ∑i=1 N a i, where a is the FH or the EH index.

R groups (5) RGi, where i = 1 corresponds to nonpolar aliphatic AAs (AVLIMG), i = 2 to polar uncharged AAs (SPTCNQ), i = 3 to positively charged AAs (KHR), i = 4 to negative AAs (DE), and i = 5 to aromatic AAs (FYW); the composition percentage of each group in the sequence is computed

Electronic groups (5) EGi, where i = 1corresponds to electron donor AAs (DEPA), i = 2 to weak electron donor AAs (LIV), i = 3 to electron acceptor AAs (KNR), i = 4 to weak electron acceptor AAs (FYMTQ), and i = 5 to neutral AAs (GHWS); the composition percentage of each group in the sequence is computed

Blast based (30) Refer to subsection “Features

GLAM2-based (30) Refer to subsection “Features

GIBBS-based (6) Refer to subsection “Features

Firstly, we use three traditional methods to get the three feature sets, that is, the ART_1 feature by [12], ART_2 feature by [13], and ART_3 feature by [11], respectively. The sources of data used to generate the features from the original sequence include the protein sequence, the position-specific scoring matrix (PSSM) generated by PSI-BLAST [14], the secondary structure predicted by PSI-Pred [15], the solvent accessible surface area (ASA) values predicted using Real-SPINE [16], and the relative solvent accessibility (RSA) defined as the ratio of ASA of a residue observed in its three-dimensional structure to that observed in an extended (Gly-X-Gly or Ala-X-Ala) tripeptide conformation [17].

Secondly, we generate other three features sets by BLAST, GLAM2, and GIBBS, respectively. The three methods can describe the inherent properties of sequences. Primarily, we divide equally the feature set into 10 portions randomly, making sure that every portion contains at least one element of each one of 6 states (monomer, dimmer, trimer, tetramer, hexamer, and octamer) so that we have 10 datasets

{Si||Si|=104,i=1,,10}. (1)

Every S i contains 6 subsets

Si={ppsic,c=1,2,3,4,6,8}, (2)

where each subset s ic contains sequences which has c subunits in S i. It is noted that the generated features depend on the original 10 fixed datasets.

For each sequence P = a 1 a 2a LS i, we select the most similar five sequences in each one of 6 sets {p | ps kc, ki},  c = 1,2, 3,4, 6,8 by PSI-Blastall. So we can get 30 features for each given sequence P based on the Evalue's index of the scientific notation from the results of the tool.

The sequence motifs can describe many properties of protein, such as transcription factor binding sites, splice junctions, and protein-protein interaction sites. Both GIBBS and GLAM2 are employed to find motifs from our datasets. In the same way, for each sequence PS i, we get the motifs of each one of 6 sets {p | ps kc,  ki},  c = 1,2, 3,4, 6,8 by both GLAM2 and GIBBS, denoted as follows, respectively:

McPGLAM2,c=1,2,3,4,6,8,McPGIBBS,c=1,2,3,4,6,8. (3)

In fact, there are many gaps in some motifs generated by GLAM2 so that we need to preprocess these motifs as follows.

  1. If a motif has more than five consecutive gaps, we delete those gaps and divide this motif into two new motifs.

  2. If the AAs of a motif are less than five, we delete it.

Then we get updated

McPGLAM2,c=1,2,3,4,6,8. (4)

We use the modified Smith-Waterman dynamic programming (SW-DP) algorithm to make sequence alignment between the given sequence P and each one of M c PGLAM2,  c = 1,2, 3,4, 6,8. The given sequence P acquires the five highest alignment scores from each of M c PGLAM2,  c = 1,2, 3,4, 6,8, so that we can get 30 more features for the given sequence. The specific procedure is as follows. In fact, each position of each motif generated by GLAM2 possibly has more than one AA after preprocessing. We use

MGLAM2=m1m2mn (5)

to represent a motif with n length, where m i = {b ij} and b ij may be one of 20 common AAs or a gap. For the protein sequence P = a 1 a 2a L, the penalty function is defined as

dGLAM2(mi,aj)={1ifajmi,ajgap 0ifajmi,aj=gap 1ifajmi,ajgap 13ifajmi,aj=gap. (6)

Then we use the SW-DP algorithm to compute the alignment score between P and M GLAM2.

In addition, GIBBS can find a motif like

MGIBBS=t1t2tn (7)

for each one of M c PGIBBS,  c = 1,2, 3,4, 6,8, where

ti=(pib1,pib2,,pib21)T,i=1,2,3,,n (8)

represent probabilities of 20 common AAs and gap in the position i, and

{bj,j=1,2,,21}={A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V,}. (9)

For the protein sequence P = a 1 a 2a L, the penalty function is defined as

dGIBBS(ti,aj)=piaj. (10)

We employ the SW-DP algorithm to calculate the alignment score between P and M GIBBS again, and then we gain other 6 features for the sequence P by GIBBS.

2.3. The Overall Design

Gaining a protein quaternary structure dataset, we design our method Quad-PRE from primary sequence as below.

  1. Select the features based on properties of amino acid, PSSM, the secondary structure, the solvent accessible surface area, and the physicochemical property.

  2. In addition, we divide our dataset equally into ten portions randomly, but making sure that every portion contains at least one element of each one of 6 states. And then we obtain the new features of each sequence using BLAST, GIBBS, and GLAM2, respectively.

Our scheme is a hybrid method and we give a diagram for making it easy to follow, shown in Figure 1.

Figure 1.

Figure 1

The diagram of Quad-PRE.

2.4. Classification

Support vector machine (SVM), which was shown to provide high quality predictions in classification, regression, and density estimation area, was implemented with LIBSVM [18] package. The support vector classification C-SVC is selected in this paper. There are several strategies to solve multiclass problem, such as one-versus-rest and one-versus-one. One-versus-rest strategy is used in this paper. The prediction performance was examined by n-fold cross validation, in which the training dataset is randomly divided into n subsets equally. The n − 1 subsets are used to train the model and the remaining one subset is used to evaluate the model, repeated n times. If n is the number of the samples, it was named jackknife test (or leave-one-out cross validation).

We designed a predictor with 10-fold cross validation. First of all, the input sequence is converted into the feature space, and then the corresponding features are passed to the classifier. The prediction class of the sequence that corresponds to one has the highest probability. Overall accuracy (ACC), the sensitivity or true positive rate (TPR), the false positive rate (FPR), the specificity (SPC), the precision (PPV), and Matthew's correlation coefficient (MCC) for each class are used to measure the prediction performance; they are defined as follows:

(11)ACC=(TP+TN)N,(12)TPR=TP(TP+FN),FPR=FP(FP+TN),SPC=TN(FP+TN)=1FPR,PPV=TP(TP+FP),MCC=TP×TNFP×FN(TP+FN)(TP+FP)(TN+FP)(TN+FN),

where TP is true positive number, TN is true negative, FP is false positive, FN is false negative, and N is total number of sequences. However, these metrics are not quite intuitive and easier-to-understand and we can adopt the formulation proposed recently to really understand them [1921]. We also calculate the area under the ROC curve (AUC) to evaluate the predictions. Higher values of these measures indicate better quality of predictions.

3. Results and Discussion

3.1. Results and Comparison with Garian's QSE

The choice of the penalty factor C and the kernel function type is very important since SVM is sensitive to parameterization. In this paper, we consider the radial basis function (RBF) of kernel types following the Chang and lin [22]

K(xi,xj)=exp(γ||xixj||2), (13)

where γ is the width of the RBF function. To identify the optimal C and γ, a systematic grid search was conducted for

C={0.1,0.2,0.5,1,2,5,10,20,50,100}γ={0.0025,0.005,0.01,0.02,0.04,0.08,0.16,0.32,0.64,1.28,2.56,5.12,10.24} (14)

by the 10-fold cross validation. Then we find the optimal C and γ are 0.1 and 0.01 with the average AUC value 0.704. With the best parameters, the average accuracy is 45.3% by 10-fold cross validation. The predicting matrix is as follows; the rw ij is the number of the i class predicted as the j class

RW=(28555121581851049872524913016555032841518206016722850224). (15)

The TPR, SPC, PPV, MCC, and AUC of every class are shown in Table 3 and the ROC curves are shown in Figure 2. Following from Table 3, Quad-PRE achieved higher overall ACC 81.7%, especially higher accuracy 92.8%, 93.3%, and 90.6% on discrimination for trimer, hexamer, and octamer, respectively. And overall SPC is 87.0%, especially 96.5%, 99.0%, 98.0%, and 93.8% on discrimination for trimer, tetramer, hexamer, and octamer, respectively. These results show that our hybrid method has high accuracy and specificity.

Table 3.

Predicted results with C = 0.1 and gamma = 0.01.

Monomer Dimer Trimer Tetramer Hexamer Octamer Average
ACC 63.0% 63.8% 92.8% 87.0% 93.3% 90.6% 81.7%
TPR 77.9% 30.8% 24.5% 18.1% 23.9% 39.3% 35.7%
SPC 54.9% 79.8% 96.5% 99.0% 98.0% 93.8% 87.0%
PPV 48.4% 42.3% 27.1% 75.7% 45.8% 28.2% 44.6%
MCC 0.316 0.116 0.220 0.328 0.299 0.284 0.260
AUC 0.703 0.582 0.702 0.765 0.711 0.758 0.704

Figure 2.

Figure 2

The ROC curves of six classes.

In addition, we can see that it is a little more difficult to predict dimer from Figure 2, because the AUC for predicting dimer is smaller than other oligomers. More specifically, the AUC of dimer is 0.582, while those of monomer, trimer, tetramer, hexamer, and octamer are 0.703, 0.702, 0.765, 0.711, and 0.758, respectively (see Table 2). However, when comparing with the predicted results of Garian's QSE [2] of classifying homodimer and nonhomodimer, the ACC, SPC, PPV, MCC, and AUC of Quad-PRE are all larger than QSE's, other than the TPR (see Table 4). Apparently, Quad-PRE performs better than QSE's (ROC curves of two methods are shown in Figure 3).

Table 4.

Comparison with Garian's method.

ACC TPR SPC PPV MCC AUC
Quad-PRE 63.8% 30.8% 79.8% 42.3% 0.116 0.582
QSE 46.2% 73.8% 32.6% 34.7% 0.065 0.522

Figure 3.

Figure 3

The ROC curves comparison Quad-PRE with Garian's QSE.

3.2. Discussion with Six Feature Groups

For confirming our generated new features (TOTAL) can improve the prediction of protein quaternary structure attributes, we compared the results from TOTAL features with those from each one of the six feature sets (ART_1, ART_2, ART_3, BLAST, GLAM2, and GIBBS), which are shown in Table 5. The ROC curves for predicting every attribute by six sets are shown in Figure 4, respectively.

Table 5.

Comparison with results are generated by different feature groups.

ART_1 ART_2 ART_3 BLAST GLAM2 GIBBS Total
ave-ACC 42.4% 38.5% 39.9% 34.9% 23.7% 30.6% 43.5%
ave-TPR 23.9% 21.8% 23.2% 33.0% 22.7% 15.3% 35.7%
ave-SPC 85.5% 84.7% 85.1% 85.3% 84.5% 82.6% 87.0%
ave-PPV 50.5% 27.4% 28.9% 35.2% 20.5% 10.2% 44.6%
ave-MCC 0.153 0.090 0.111 0.189 0.051 −0.024 0.260
ave-AUC 0.680 0.662 0.661 0.660 0.573 0.510 0.704

Figure 4.

Figure 4

Comparison with the ROC curves of different classes for different feature groups.

From Figure 4, we can see that the average AUC, ACC, TPR, SPC,and MCC of any of 6 features sets are all smaller than TOTAL features except the PPV. In particular, there are almost the same average SPC values for all feature sets. And the two feature sets from both GIBBS and GLAM2 all do not perform well in every metric. From Table 5 we also know that ART_1, BLAST, ART_1, ART_1, BLAST, and ART_1 play key roles in improving average ACC, TPR, SPC, PPV, MCC, and AUC of our method, respectively, because the corresponding values of them are close to those of TOTAL. These results mean each feature set contributes to the improvement of our hybrid method, especially ART_1 because the average ACC, TPR, SPC, PPV, MCC, and AUC from which are almost superior to others (see Table 5).

From the view of the average AUC, the importance of the six feature sets from high to low is ART_1, ART_2, ART_3, BLAST, GLAM2, and GIBBS (see Table 5). And the AUC values of ART_1, ART_2, and ART_3 for every protein attribute are almost larger than those of BLAST, GIBBS, and GLAM2 (see Figure 4). We think that the possible reason should be that the ART_1, ART_2, and ART_3 have much more features than BLAST, GIBBS, and GLAM2. And because similar sequences should have similar structures and functions, the features from BLAST are superior to those from both GIBBS and GLAM2 in the performance of SVM.

4. Conclusions

To predict protein quaternary structure attribute is indeed a challenging problem. This paper presents a novel approach, that is, Quad-PRE, to solve the problem. Quad-PRE starts to consider the features about motifs generated by some tools. From analysis results, we know the number of these features is too little to play important roles in improving the performance of our method, so that we will attempt to find motif features more important in the future work. In addition, Quad-PRE is a multistate method classifying monomer, trimer, tetramer, hexamer, and octamer very well, while other previous methods to predict the quaternary structure attributes are mostly for 2 states.

In fact, the hybrid method Quad-PRE is high accuracy and specificity on discrimination for trimer, tetramer, hexamer, and octamer, respectively. But we compare the Garian's QSE with our Quad-PRE using our dataset for confirming our method is effective. The results show that our hybrid method performs better than Garian's QSE in predicting the homodimmer or not from metrics ACC, SPC, PPV, MCC, and AUC. In addition, we analyze the importance of the six feature sets. The result clearly shows that each of six features sets contributes to the improvement in prediction, especially the ART_1 feature set. And three new feature sets gained by BLAST, GLAM2, and GIBBS are all effective, because these motif features describe the inherent properties of the sequence inherent and the motifs in protein sequences can help us to understand the structure and function of the molecules the sequences represent [23].

In this paper, we did not consider feature selection because we want to make full use of each feature as many as possible and analyze the importance of each one of six features sets. We believe that future improvements will be possible by designing better sequence representations rather than applying more complex classifiers.

Since user-friendly and publicly accessible web-servers [24] represent the future direction for developing practically more useful predictors, we shall make efforts in our future work to provide a web-server for the method presented in this paper.

Supplementary Material

Table S1 is the detailed definitions of the protein features considered in our method. Table S2 is the property groups used to aggregate similar amino acids, and Table S3 is the isoelectric point value, Fauchere-Pliska hydrophobicity value, Eisenberg hydrophobicity value and hydropathy value of the standard amino acids. Table S2 and Table S3 are used for generating the protein features.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (Grant no. 11201334) and Science and Technology Commission of Tianjin Municipality (Grant no. 12JCYBJC31900) to Ke Chen They gratefully acknowledge the help from Dr. Jianzhao Gao for valuable suggestions and comments.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

References

  • 1.Anfinsen CB, Haber E, Sela M, White FH., Jr. The kinetics of formation of native ribonuclease during oxidation of the reduced polypeptide chain. Proceedings of the National Academy of Sciences of the United States of America. 1961;47:1309–1314. doi: 10.1073/pnas.47.9.1309. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Garian R. Prediction of quaternary structure from primary structure. Bioinformatics. 2001;17(6):551–556. doi: 10.1093/bioinformatics/17.6.551. [DOI] [PubMed] [Google Scholar]
  • 3.Zhang S-W, Pan Q, Zhang H-C, Zhang Y-L, Wang H-Y. Classification of protein quaternary structure with support vector machine. Bioinformatics. 2003;19(18):2390–2396. doi: 10.1093/bioinformatics/btg331. [DOI] [PubMed] [Google Scholar]
  • 4.Chou K-C, Cai Y-D. Predicting protein quaternary structure by pseudo amino acid composition. Proteins: Structure, Function and Genetics. 2003;53(2):282–289. doi: 10.1002/prot.10500. [DOI] [PubMed] [Google Scholar]
  • 5.Shi JY, Pan Q, Zhang SW, et al. Classification of protein homo-oligomers using amino acid composition distribution. Acta Biophysica Sinica. 2006;22(1):49–55. [Google Scholar]
  • 6.Carugo O. A structural proteomics filter: prediction of the quaternary structural type of hetero-oligomeric proteins on the basis of their sequences. Journal of Applied Crystallography. 2007;40(6):986–989. [Google Scholar]
  • 7.Levy ED. PiQSi: protein quaternary structure investigation. Structure. 2007;15(11):1364–1367. doi: 10.1016/j.str.2007.09.019. [DOI] [PubMed] [Google Scholar]
  • 8.Xiao X, Lin W-Z. Application of protein grey incidence degree measure to predict protein quaternary structural types. Amino Acids. 2009;37(4):741–749. doi: 10.1007/s00726-008-0212-9. [DOI] [PubMed] [Google Scholar]
  • 9.Xiao X, Wang P, Chou K-C. Quat-2L: a web-server for predicting protein quaternary structural attributes. Molecular Diversity. 2011;15(1):149–155. doi: 10.1007/s11030-010-9227-8. [DOI] [PubMed] [Google Scholar]
  • 10.Sun X-Y, Shi S-P, Qiu J-D, Suo S-B, Huang S-Y, Liang R-P. Identifying protein quaternary structural attributes by incorporating physicochemical properties into the general form of Chou’s PseAAC via discrete wavelet transform. Molecular BioSystems. 2012;8:3178–3184. doi: 10.1039/c2mb25280e. [DOI] [PubMed] [Google Scholar]
  • 11.Kurgan L, Chen K. Prediction of protein structural class for the twilight zone sequences. Biochemical and Biophysical Research Communications. 2007;357(2):453–460. doi: 10.1016/j.bbrc.2007.03.164. [DOI] [PubMed] [Google Scholar]
  • 12.Mizianty MJ, Kurgan L. Modular prediction of protein structural classes from sequences of twilight-zone identity with predicting sequences. BMC Bioinformatics. 2009;10, article 414 doi: 10.1186/1471-2105-10-414. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Zhang H, Zhang T, Gao J, Ruan J, Shen S, Kurgan L. Determination of protein folding kinetic types using sequence and predicted secondary structure and solvent accessibility. Amino acids. 2012;42(1):271–283. doi: 10.1007/s00726-010-0805-y. [DOI] [PubMed] [Google Scholar]
  • 14.Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research. 1997;25(17):3389–3402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Bryson K, McGuffin LJ, Marsden RL, Ward JJ, Sodhi JS, Jones DT. Protein structure prediction servers at University College London. Nucleic Acids Research. 2005;33(2):W36–W38. doi: 10.1093/nar/gki410. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Huang J-T, Cheng J-P, Chen H. Secondary structure length as a determinant of folding rate of proteins with two- and three-state kinetics. Proteins: Structure, Function and Genetics. 2007;67(1):12–17. doi: 10.1002/prot.21282. [DOI] [PubMed] [Google Scholar]
  • 17.Ahmad S, Gromiha MM, Sarai A. Real value prediction of solvent accessibility from amino acid sequence. Proteins: Structure, Function and Genetics. 2003;50(4):629–635. doi: 10.1002/prot.10328. [DOI] [PubMed] [Google Scholar]
  • 18.Hsu C-W, Lin C-J. A comparison of methods for multiclass support vector machines. IEEE Transactions on Neural Networks. 2002;13(2):415–425. doi: 10.1109/72.991427. [DOI] [PubMed] [Google Scholar]
  • 19.Chen W, Feng PM, Lin H, Chou K-C. iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition. Nucleic Acids Research. 2013;41(6, article e68) doi: 10.1093/nar/gks1450. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Qiu WR, Xiao X, Chou KC. iRSpot-TNCPseAAC: identify recombination spots with trinucleotide composition and pseudo amino acid components. International Journal of Molecular Sciences. 2014;15(2):1746–1766. doi: 10.3390/ijms15021746. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Xu Y, Ding J, Wu LY, Chou KC. iSNO-PseAAC: predict cysteine S-nitrosylation sites in proteins by incorporating position specific amino acid propensity into pseudo amino acid composition. PLoS ONE. 2013;8(2, article e55844) doi: 10.1371/journal.pone.0055844. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Chang C-C, Lin C-J. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology. 2011;2(3, article 27) [Google Scholar]
  • 23.Bailey TL. Discovering sequence motifs. Methods in Molecular Biology. 2007;395:271–292. doi: 10.1007/978-1-59745-514-5_17. [DOI] [PubMed] [Google Scholar]
  • 24.Xiao X, Lin WZ, Chou KC. Recent advances in predicting protein classification and their applications to drug development. Current Topics in Medicinal Chemistry. 2013;13(10):1622–1635. doi: 10.2174/15680266113139990113. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Table S1 is the detailed definitions of the protein features considered in our method. Table S2 is the property groups used to aggregate similar amino acids, and Table S3 is the isoelectric point value, Fauchere-Pliska hydrophobicity value, Eisenberg hydrophobicity value and hydropathy value of the standard amino acids. Table S2 and Table S3 are used for generating the protein features.


Articles from Computational and Mathematical Methods in Medicine are provided here courtesy of Wiley

RESOURCES