Abstract
Knowledge of supersecondary structures can provide important information about its spatial structure of protein. Some approaches have been developed for the prediction of protein supersecondary structure. However, the feature used by these approaches is primarily based on amino acid sequences. In this study, a novel model is presented to predict protein supersecondary structure by use of chemical shifts (CSs) information derived from nuclear magnetic resonance (NMR) spectroscopy. Using these CSs as inputs of the method of quadratic discriminant analysis (QD), we achieve the overall prediction accuracy of 77.3%, which is competitive with the same method for predicting supersecondary structures from amino acid compositions in threefold cross-validation. Moreover, our finding suggests that the combined use of different chemical shifts will influence the accuracy of prediction.
1. Introduction
The prediction of protein structure is always one of the most important research topics in the field of bioinformatics. However, it is very difficult to predict the spatial structure directly from the protein sequence. Therefore, the prediction of supersecondary structure is an important step in the prediction of protein spatial structure. The supersecondary structural motifs are composed of a few secondary structural elements (namely, α or β) connected by loops. At present, there are four kinds of simple supersecondary structures, namely, α-loop-β, α-loop-α, β-loop-α, and β-loop-β. These motifs play an important role in protein folding and stability because a large number of motifs exist in protein spatial structure. Many researches have focused on exploring methods for protein supersecondary structure prediction [1, 2]. In 1995, Sun et al. predicted protein supersecondary structure and achieved an accuracy of between 70 and 80% by using neural networks [3]. Chou and Blinn presented a method for predicting beta turns [4–6], alpha turns [7], and all the tight turns [6]. Cruz et al. identified β-hairpin and non-β-hairpin [8]. Hu and Li identified four kinds of simple supersecondary structures in 2088 proteins and achieved an accuracy of 78~83 % [9]. Zou et al. also predicted four kinds of simple supersecondary structures from 3088 proteins by using support vector machine [10]. And the overall accuracy of 78% was achieved. The features of these studies were mainly derived from the amino acid compositions or dipeptide compositions.
Nuclear magnetic resonance (NMR) technique plays an important role in the determination of three-dimensional biological macromolecule structures. NMR chemical shifts encode subtle information about the local chemical environment of nuclear spins. For many years, there has been growing interest to access this information and utilize it for biomolecular structure determination [11, 12]. Recent progress was made by combining chemical shifts with protein structure prediction programs [13–20], showing that chemical shifts information is a power parameter for the determination of protein structure. In this paper, we utilized chemical shifts as parameters to predict four kinds of simple supersecondary structures in protein by the method of quadratic discriminant analysis. Using the benchmark dataset, we achieved the average of sensitivity of 76.3% and specificity of 74.3% and the overall prediction accuracy of 77.3% in threefold cross-validation by using six CSs (C, C α, C β, H, H α, N) as features. Moreover, we have performed the prediction by combining the different chemical shifts as features. Results showed that the redundant information has great influence on the accuracy.
2. Materials and Methods
2.1. Database
The chemical shifts of all nuclei (C, C α, C β, H, H α, N) in proteins were extracted from re-referenced protein chemical shift database (namely, RefDB [21]). The following steps were performed to construct the dataset. Firstly, only proteins with six nuclei assigned CSs were considered. Secondly, only proteins with the supersecondary structures information in ArchDB40 [22] were available. We finally utilized the PISCES program [23] to remove the highly similar sequences. After strictly following the aforementioned procedures, 114 proteins were obtained which have both CSs and supersecondary structures. Among 114 proteins, 92% (105 sequences) proteins have less than 25% sequence identity, and the sequence identity of the remains ranges from 25 to 30%. The appendix lists 114 proteins used in this study. Finally, we obtained 90 α-loop-α (HH), 89 α-loop-β (HE), 97 β-loop-α (EH), and 122 β-loop-β (EE) motifs, including the β-β link and β-β hairpin.
2.2. Feature Parameter
In the four data subsets {HH, HE, EH, EE}, we calculated the averaged CSs of six nuclei for a sequence of length l using the following formula:
| (1) |
where i = C, C α, C β, H, H α, N. Therefore, a sequence can be converted into a six-dimensional vector R : {t i}.
2.3. Prediction Algorithm
To design an efficient and accurate predicted algorithm the key step is in protein supersecondary structure prediction. The quadratic discriminant analysis [24] is a power algorithm that has been widely applied in genomic and proteomic bioinformatics. Thus, we used it here to perform prediction.
2.4. Quadratic Discriminant Analysis (QD)
For a sequence X to be classified, we calculated the averaged CSs of six nuclei using (1). So, the sequence is converted into a six-dimensional vector R : {t i}:
| (2) |
Here we integrated six-dimensional vector by using quadratic discriminant analysis function. Consider a sequence X is classified into four groups (HH, HE, EH, EE). The discriminant analysis function between group i and group j is defined by
| (3) |
According to Bayes' Theorem, we deduce
| (4) |
The result can be generalized to four groups directly and described as follows.
Set
| (5) |
where
| (6) |
where p v denotes the number of samples in group v, δ v is the square mahalanobis distance between R and μ v with respect to Σv (note: μ v and |Σv| are calculated in training set), and μ v denotes chemical shift values of six nuclei R : {t i} averaged over group v; |Σv| is the determinant of matrix Σv.
The six-dimensional vector μ v can be written as
| (7) |
where v = HH, EH, HE, EE; i = C, C α, C β, H α, H,N; Σv is the covariance matrix of 6 × 6 dimension, quantifying correlations between the chemical shifts of six nuclei:
| (8) |
where the element
| (9) |
Here v = HH, EH, HE, EE; i, j = C, C α, C β, H α, H, N.
From (4) and (5), we have concluded
| (10) |
It can be easily proved that p(ω k∣X) is the maximum of p(ω v∣X), if η k is the maximal one in η v (v = HH, EH, HE, EE). Then, we predict that X belongs to group k.
2.5. Correction in the Error Allowed Scope
A sequence X is predicted for four kinds of supersecondary structures by using (1)~(10). If η i is the maximal one in η k (k = HH, EH, HE, EE), then we predict that X belongs to group i. However, there are slight differences among η k (k = HH, EH, HE, EE). To correct predicted results, we define the coefficient of the error allowed scope as
| (11) |
where η corr denotes X belonging to itself classη, η wro denotes X being predicted another class η. For example, if X is the super-secondary structure of HH, then η corr is η HH and η wro is the maximum among η EH, η HE, η EE.
2.6. Performance Evaluation
In statistical prediction, independent dataset test, cross-validation test, and jackknife test can be used to examine a predictor for its effectiveness in practical application. Among the three test methods, the jackknife test is deemed to be the least arbitrary that can always yield a unique result for a given benchmark dataset [25] and has been widely used to examine the performance of various predictors [26–37]. However, in this study we have used the threefold cross-validation to examine the performance of our method; in order to reduce the computational time, we randomly divided the training set into three parts, two of which are for training and the rest for testing. The process is repeated three times. The following three parameters: sensitivity (SNi), specificity (SPi), and overall accuracy (Q total), are used to evaluate the predictive performance of our approach:
| (12) |
| (13) |
| (14) |
where i = HH, HE, EH, EE and TP, FN, TN, and FP denote, respectively, true positives, false positives, true negatives, and false positives. N is total number of sequences in four data subsets.
3. Results and Discussion
Under the benchmark dataset, we calculated the average chemical shift values using (1). The sequences from four data subsets are converted, respectively, into six-dimensional vectors, which are derived from chemical shift values of six nuclei; then μ is also a six-dimensional mean vector, which is calculated in each of the datasets. In the training sets, determinant and inverse matrix of covariance matrix Σv are calculated. Given a sequence of the testing sets, we may calculate η v by using (4)~(10) and compare the results. Then the class of sequence X was determined by the maximum of η v (v = HH, HE, EH, EE). Moreover, the coefficient R given in (11) is used to correct predicted results. The current study utilized R < 0.4. The results of threefold cross-validation are listed in Table 1.
Table 1.
The predicted accuracies by using six CSs as features (3-fold cross-validation).
| Class structure |
SN (%) | SP (%) | Average SN (%) |
Average SP (%) |
Q total (%) |
|---|---|---|---|---|---|
| R < 0.4 | |||||
| HH | 73.0 | 71.0 | 76.3 | 74.3 | 77.3 |
| EH | 75.8 | 78.1 | |||
| HE | 69.0 | 66.7 | |||
| EE | 87.5 | 81.4 | |||
From Table 1, we can see that the averaged sensitivity, specificity, and overall accuracy of four kinds of supersecondary structures are 76.3%, 74.3%, and 77.3%, respectively, indicating that CSs are highly informative with regard to supersecondary structures.
Generally speaking, chemical shift measurements can be incomplete for a multitude of reasons. Often, chemical shifts can only be assigned partially or are missing. To assess the impact of incomplete chemical shift assignments, we performed the prediction by using the combination of the different chemical shifts as features. The results are shown in Table 2.
Table 2.
Predicted results of different feature combinations (R < 0.4).
| Feature combinations |
HH | EH | HE | EE |
Average SN (%) |
Average SP (%) |
Q total (%) | ||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| SN (%) | SP (%) | SN (%) | SP (%) | SN (%) | SP (%) | SN (%) | SP (%) | ||||
| C, C α, C β, H, H α | 63.3 | 77.0 | 84.5 | 45.6 | 34.8 | 100 | 71.3 | 77.0 | 63.4 | 74.9 | 64.6 |
| C, C α, C β, H α, N | 90.0 | 85.3 | 66.0 | 97.0 | 85.4 | 86.4 | 93.4 | 75.5 | 83.7 | 86.1 | 84.2 |
| C, C α, C β, N | 55.6 | 87.7 | 61.9 | 80 | 44.9 | 93.0 | 95.1 | 52.5 | 64.4 | 78.3 | 66.8 |
| C α, C β, N | 90.0 | 87.1 | 94.8 | 83.6 | 79.8 | 93.4 | 91.0 | 91.7 | 88.9 | 89.0 | 89.2 |
| C, H α, N | 90.0 | 73.6 | 75.3 | 82.0 | 79.8 | 81.6 | 73.8 | 80.4 | 79.7 | 79.4 | 79.1 |
| AAC | 73.3 | 73.6 | 73.0 | 77.8 | 72.4 | 71.3 | 77.5 | 75.8 | 74.1 | 74.6 | 75.8 |
From Table 2, we found that omission of some CSs can result in radically different accuracy. Theoretically, incomplete chemical shifts provide relatively less information, so the predicted accuracy is also declined. But it actually did not in prediction. We used CSs of H, H α, C as features and achieved the highest accuracy of prediction, indicating that the results are affected by the redundant data. According to the performances, we concluded that CSs of N, C α, C β are the most informative features in the prediction of four kinds of protein supersecondary structures. In addition, the information of C, H α, N is commonly provided in protein database; we achieved the prediction accuracy of 79.1% by using CSs of C, H α, N as the only inputs.
To test the method and facilitate comparison with other features, we used amino acid compositions (AAC) as inputs of the method of quadratic discriminant analysis. The compared results are recorded in Table 2. Compared results show that the performances of CSs are superior to that of AAC for supersecondary structures prediction, except HE structure (compared with six CSs).
4. Conclusions
In this paper, we have introduced a prediction model for supersecondary structures from protein chemical shifts. Our model is both simple and easy to perform. However, owing to the limitation of both information of supersecondary structures and corresponding chemical shifts of six nuclei that should be considered, only 114 proteins have been selected in this study. Based on the benchmark dataset, we investigated the relationship between supersecondary structures and chemical shifts. We achieved the overall accuracy of 77.3% by using six CSs as features and the maximum overall accuracy of 89.2% by using the combination of CSs of N, C α, C β. Results show that chemical shift is a good parameter for the prediction of four kinds of protein supersecondary structures. In summary, the chemical shifts will become a new parameter in prediction of the protein supersecondary structures in the near future.
Acknowledgments
The author is grateful to the anonymous reviewers for their valuable suggestions and comments, which have led to the improvement of this paper. The work was supported by Inner Mongolia Agriculture University PhD Research Fund (no. BJ08-30) and Basic Science of Inner Mongolia Agriculture University Research Fund (no. JC2013004).
Appendix
See Table 3.
Table 3.
PDB 114 chains used in this work.
| 1a6g | 1a6j | 1a7g | 1ail | 1akh | 1am7 | 1avs | 1b2v |
| 1b56 | 1bdo | 1bed | 1bgf | 1bja | 1by9 | 1byf | 1c44 |
| 1cex | 1cy5 | 1dfu | 1dhn | 1dqe | 1dtl | 1dyt | 1e0c |
| 1edh | 1ejf | 1ekg | 1epf | 1ew4 | 1f2l | 1f35 | 1f3v |
| 1f80 | 1F8H | 1fdq | 1ff3 | 1fil | 1g6a | 1g6h | 1gaw |
| 1gns | 1gnu | 1go4 | 1gwy | 1gwy | 1h4a | 1h70 | 1hcb |
| 1hfc | 1hh8 | 1hrh | 1hsl | 1huu | 1i4f | 1ifo | 1iho |
| 1iko | 1iw0 | 1iwm | 1j1v | 1j54 | 1j7d | 1j97 | 1jr1 |
| 1jiw | 1jr2 | 1jl3 | 1jrl | 1jhf | 1k82 | 1l0s | 1l1d |
| 1l6x | 1lfo | 1ljp | 1lld | 1m1f | 1ml4 | 1mo1 | 1mxe |
| 1naq | 1ng2 | 1o15 | 1o5u | 1oqr | 1osp | 1php | 1ppf |
| 1pz4 | 1q4r | 1qav | 1qfj | 1qg7 | 1qog | 1qst | 1r5r |
| 1rro | 1rsy | 1scj | 1slm | 1snc | 1t15 | 1tkv | 1tn3 |
| 1tph | 1umu | 1uoh | 1uuh | 1uv0 | 1vap | 1vjh | 1ycq |
| 1ze3 | 256b |
Conflict of Interests
The author declares that there is no conflict of interests regarding the publication of this paper.
References
- 1.Blundell T, Carney D, Gardner S, et al. Knowledge-based protein modelling and design. European Journal of Biochemistry. 1988;172(3):513–520. doi: 10.1111/j.1432-1033.1988.tb13917.x. [DOI] [PubMed] [Google Scholar]
- 2.Dyson HJ, Wright PE. Peptide conformation and protein folding. Current Opinion in Structural Biology. 1993;3(1):60–65. [Google Scholar]
- 3.Sun Z, Rao X, Peng L, Xu D. Prediction of protein supersecondary structures based on the artificial neural network method. Protein Engineering. 1997;10(7):763–769. doi: 10.1093/protein/10.7.763. [DOI] [PubMed] [Google Scholar]
- 4.Chou KC. Prediction of beta-turns in proteins. Journal of Peptide Research. 1997;49:120–144. [PubMed] [Google Scholar]
- 5.Chou K-C, Blinn JR. Classification and prediction of β-turn types. Journal of Protein Chemistry . 1997;16(6):575–595. doi: 10.1023/a:1026366706677. [DOI] [PubMed] [Google Scholar]
- 6.Chou K-C. Prediction of tight turns and their types in proteins. Analytical Biochemistry. 2000;286(1):1–16. doi: 10.1006/abio.2000.4757. [DOI] [PubMed] [Google Scholar]
- 7.Chou K-C. Prediction and classification of α-turn types. Biopolymers. 1997;42(7):837–853. doi: 10.1002/(sici)1097-0282(199712)42:7<837::aid-bip9>3.0.co;2-u. [DOI] [PubMed] [Google Scholar]
- 8.de la Cruz X, Hutchinson EG, Shepherd A, Thornton JM. Toward predicting protein topology: an approach to identifying β hairpins. Proceedings of the National Academy of Sciences of the United States of America. 2002;99(17):11157–11162. doi: 10.1073/pnas.162376199. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Hu XZ, Li QZ. Prediction of the β-hairpins in proteins using support vector machine. Protein Journal. 2008;27(2):115–122. doi: 10.1007/s10930-007-9114-z. [DOI] [PubMed] [Google Scholar]
- 10.Zou DS, He ZS, He JY, Xia Y. Supersecondary structure prediction using Chou's pseudo amino acid composition. Journal of Computational Chemistry. 2011;32(2):271–278. doi: 10.1002/jcc.21616. [DOI] [PubMed] [Google Scholar]
- 11.Case DA. The use of chemical shifts and their anisotropies in biomolecular structure determination. Current Opinion in Structural Biology. 1998;8(5):624–630. doi: 10.1016/s0959-440x(98)80155-3. [DOI] [PubMed] [Google Scholar]
- 12.Wishart DS, Case DA. Use of chemical shifts in macromolecular structure determination. Methods in Enzymology. 2001;338:3–34. doi: 10.1016/s0076-6879(02)38214-4. [DOI] [PubMed] [Google Scholar]
- 13.Cavalli A, Salvatella X, Dobson CM, Vendruscolo M. Protein structure determination from NMR chemical shifts. Proceedings of the National Academy of Sciences of the United States of America. 2007;104(23):9615–9620. doi: 10.1073/pnas.0610313104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Shen Y, Lange O, Delaglio F, et al. Consistent blind protein structure generation from NMR chemical shift data. Proceedings of the National Academy of Sciences of the United States of America. 2008;105(12):4685–4690. doi: 10.1073/pnas.0800256105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Lin H, Ding C, Song Q, et al. The prediction of protein structural class using averaged chemical shifts. Journal of Biomolecular Structure & Dynamics. 2012;29(6):643–649. doi: 10.1080/07391102.2011.672628. [DOI] [PubMed] [Google Scholar]
- 16.Mechelke M, Habeck M. A probabilistic model for secondary structure prediction from protein chemical shifts. Proteins. 2013;81(6):984–993. doi: 10.1002/prot.24249. [DOI] [PubMed] [Google Scholar]
- 17.Mielke SP, Krishnan VV. Protein structural class identification directly from NMR spectra using averaged chemical shifts. Bioinformatics. 2003;19(16):2054–2064. doi: 10.1093/bioinformatics/btg280. [DOI] [PubMed] [Google Scholar]
- 18.Pastore A, Saudek V. The relationship between chemical shift and secondary structure in proteins. Journal of Magnetic Resonance. 1990;90(1):165–176. [Google Scholar]
- 19.Wang Y. Secondary structural effects on protein NMR chemical shifts. Journal of Biomolecular NMR. 2004;30(3):233–244. doi: 10.1007/s10858-004-3098-1. [DOI] [PubMed] [Google Scholar]
- 20.Mao WS, Cong PS, Wang ZH, Lu LJ, Zhu ZL, Li TH. NMRDSP: an accurate prediction of protein shape strings from NMR chemical shifts and sequence data. PLoS ONE. 2013;8(12) doi: 10.1371/journal.pone.0083532.e83532 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Zhang H, Neal S, Wishart DS. RefDB: a database of uniformly referenced protein chemical shifts. Journal of Biomolecular NMR. 2003;25(3):173–195. doi: 10.1023/a:1022836027055. [DOI] [PubMed] [Google Scholar]
- 22.Fernandez-Fuentes N, Hermoso A, Espadaler J, Querol E, Aviles FX, Oliva B. Classification of common functional loops of kinase super-families. Proteins. 2004;56(3):539–555. doi: 10.1002/prot.20136. [DOI] [PubMed] [Google Scholar]
- 23.Wang G, Dunbrack RL., Jr. PISCES: recent improvements to a PDB sequence culling server. Nucleic Acids Research. 2005;33(2):W94–W98. doi: 10.1093/nar/gki402. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Feng Y, Luo L. Use of tetrapeptide signals for protein secondary-structure prediction. Amino Acids. 2008;35(3):607–614. doi: 10.1007/s00726-008-0089-7. [DOI] [PubMed] [Google Scholar]
- 25.Chou K-C, Shen H-B. Cell-PLoc: a package of Web servers for predicting subcellular localization of proteins in various organisms. Nature Protocols. 2008;3(2):153–162. doi: 10.1038/nprot.2007.494. [DOI] [PubMed] [Google Scholar]
- 26.Chou K-C. Some remarks on protein attribute prediction and pseudo amino acid composition. Journal of Theoretical Biology. 2011;273(1):236–247. doi: 10.1016/j.jtbi.2010.12.024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Esmaeili M, Mohabatkar H, Mohsenzadeh S. Using the concept of Chou’s pseudo amino acid composition for risk type prediction of human papillomaviruses. Journal of Theoretical Biology. 2010;263(2):203–209. doi: 10.1016/j.jtbi.2009.11.016. [DOI] [PubMed] [Google Scholar]
- 28.Hayat M, Khan A. Discriminating outer membrane proteins with fuzzy K-nearest neighbor algorithms based on the general form of Chou’s PseAAC. Protein and Peptide Letters. 2012;19(4):411–421. doi: 10.2174/092986612799789387. [DOI] [PubMed] [Google Scholar]
- 29.Ding C, Yuan L-F, Guo S-H, Lin H, Chen W. Identification of mycobacterial membrane proteins and their types using over-represented tripeptide compositions. Journal of Proteomics. 2012;77:321–328. doi: 10.1016/j.jprot.2012.09.006. [DOI] [PubMed] [Google Scholar]
- 30.Chen C, Shen Z-B, Zou X-Y. Dual-layer wavelet SVM for predicting protein structural class via the general form of Chou’s pseudo amino acid composition. Protein and Peptide Letters. 2012;19(4):422–429. doi: 10.2174/092986612799789332. [DOI] [PubMed] [Google Scholar]
- 31.Chou K-C, Shen H-B. Plant-mPLoc: a top-down strategy to augment the power for predicting plant protein subcellular localization. PLoS ONE. 2010;5(6) doi: 10.1371/journal.pone.0011335.e11335 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Chen W, Feng P-M, Lin H, Chou K-C. IRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition. Nucleic Acids Research. 2013;41(6, article e68) doi: 10.1093/nar/gks1450. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Chen W, Lin H, Feng P-M, Ding C, Zuo Y-C, Chou K-C. iNuc-PhysChem: a sequence-based predictor for identifying nucleosomes via physicochemical properties. PLoS ONE. 2012;7(10) doi: 10.1371/journal.pone.0047843.e47843 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Lin H, Chen W, Yuan L-F, Li Z-Q, Ding H. Using over-represented tetrapeptides to predict protein submitochondria locations. Acta Biotheoretica. 2013;61(2):259–268. doi: 10.1007/s10441-013-9181-9. [DOI] [PubMed] [Google Scholar]
- 35.Lin H, Ding C, Yuan L-F, et al. Predicting subchloroplast locations of proteins based on the general form of Chou’s pseudo amino acid composition: approached from optimal tripeptide composition. International Journal of Biomathematics. 2013;6(2)13500034 [Google Scholar]
- 36.Lin W-Z, Fang J-A, Xiao X, Chou K-C. ILoc-Animal: a multi-label learning classifier for predicting subcellular localization of animal proteins. Molecular BioSystems. 2013;9(4):634–644. doi: 10.1039/c3mb25466f. [DOI] [PubMed] [Google Scholar]
- 37.Xiao X, Wang P, Lin W-Z, Jia J-H, Chou K-C. IAMP-2L: a two-level multi-label classifier for identifying antimicrobial peptides and their functional types. Analytical Biochemistry. 2013;436(2):168–177. doi: 10.1016/j.ab.2013.01.019. [DOI] [PubMed] [Google Scholar]
