Abstract
We demonstrate the applicability of our previously developed Bayesian probabilistic approach for predicting residue solvent accessibility to the problem of predicting secondary structure. Using only single-sequence data, this method achieves a three-state accuracy of 67% over a database of 473 non-homologous proteins. This approach is more amenable to inspection and less likely to overlearn specifics of a dataset than "black box" methods such as neural networks. It is also conceptually simpler and less computationally costly. We also introduce a novel method for representing and incorporating multiple-sequence alignment information within the prediction algorithm, achieving 72% accuracy over a dataset of 304 non-homologous proteins. This is accomplished by creating a statistical model of the evolutionarily derived correlations between patterns of amino acid substitution and local protein structure. This model consists of parameter vectors, termed "substitution schemata," which probabilistically encode the structure-based heterogeneity in the distributions of amino acid substitutions found in alignments of homologous proteins. The model is optimized for structure prediction by maximizing the mutual information between the set of schemata and the database of secondary structures. Unlike "expert heuristic" methods, this approach has been demonstrated to work well over large datasets. Unlike the opaque neural network algorithms, this approach is physicochemically intelligible. Moreover, the model optimization procedure, the formalism for predicting one-dimensional structural features and our previously developed method for tertiary structure recognition all share a common Bayesian probabilistic basis. This consistency starkly contrasts with the hybrid and ad hoc nature of methods that have dominated this field in recent years.
Full Text
The Full Text of this article is available as a PDF (2.9 MB).
Selected References
These references are in PubMed. This may not be the complete list of references from this article.
- Aronson H. E., Royer W. E., Jr, Hendrickson W. A. Quantification of tertiary structural conservation despite primary sequence drift in the globin fold. Protein Sci. 1994 Oct;3(10):1706–1711. doi: 10.1002/pro.5560031009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Asai K., Hayamizu S., Handa K. Prediction of protein secondary structure by the hidden Markov model. Comput Appl Biosci. 1993 Apr;9(2):141–146. doi: 10.1093/bioinformatics/9.2.141. [DOI] [PubMed] [Google Scholar]
- Benner S. A., Gerloff D. L. Predicting the conformation of proteins. Man versus machine. FEBS Lett. 1993 Jun 28;325(1-2):29–33. doi: 10.1016/0014-5793(93)81408-r. [DOI] [PubMed] [Google Scholar]
- Benner S. A. Patterns of divergence in homologous proteins as indicators of tertiary and quaternary structure. Adv Enzyme Regul. 1989;28:219–236. doi: 10.1016/0065-2571(89)90073-3. [DOI] [PubMed] [Google Scholar]
- Bernstein F. C., Koetzle T. F., Williams G. J., Meyer E. F., Jr, Brice M. D., Rodgers J. R., Kennard O., Shimanouchi T., Tasumi M. The Protein Data Bank: a computer-based archival file for macromolecular structures. J Mol Biol. 1977 May 25;112(3):535–542. doi: 10.1016/s0022-2836(77)80200-3. [DOI] [PubMed] [Google Scholar]
- Gibrat J. F., Garnier J., Robson B. Further developments of protein secondary structure prediction using information theory. New parameters and consideration of residue pairs. J Mol Biol. 1987 Dec 5;198(3):425–443. doi: 10.1016/0022-2836(87)90292-0. [DOI] [PubMed] [Google Scholar]
- Goldman N., Thorne J. L., Jones D. T. Using evolutionary trees in protein secondary structure prediction and other comparative sequence analyses. J Mol Biol. 1996 Oct 25;263(2):196–208. doi: 10.1006/jmbi.1996.0569. [DOI] [PubMed] [Google Scholar]
- Goldstein R. A., Luthey-Schulten Z. A., Wolynes P. G. Protein tertiary structure recognition using optimized Hamiltonians with local interactions. Proc Natl Acad Sci U S A. 1992 Oct 1;89(19):9029–9033. doi: 10.1073/pnas.89.19.9029. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Govindarajan S., Goldstein R. A. Why are some proteins structures so common? Proc Natl Acad Sci U S A. 1996 Apr 16;93(8):3341–3345. doi: 10.1073/pnas.93.8.3341. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Harvey P. H., Purvis A. Comparative methods for explaining adaptations. Nature. 1991 Jun 20;351(6328):619–624. doi: 10.1038/351619a0. [DOI] [PubMed] [Google Scholar]
- Hobohm U., Sander C. Enlarged representative set of protein structures. Protein Sci. 1994 Mar;3(3):522–524. doi: 10.1002/pro.5560030317. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kabsch W., Sander C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 1983 Dec;22(12):2577–2637. doi: 10.1002/bip.360221211. [DOI] [PubMed] [Google Scholar]
- King R. D., Sternberg M. J. Identification and application of the concepts important for accurate and reliable protein secondary structure prediction. Protein Sci. 1996 Nov;5(11):2298–2310. doi: 10.1002/pro.5560051116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Krogh A., Brown M., Mian I. S., Sjölander K., Haussler D. Hidden Markov models in computational biology. Applications to protein modeling. J Mol Biol. 1994 Feb 4;235(5):1501–1531. doi: 10.1006/jmbi.1994.1104. [DOI] [PubMed] [Google Scholar]
- Lim V. I. Algorithms for prediction of alpha-helical and beta-structural regions in globular proteins. J Mol Biol. 1974 Oct 5;88(4):873–894. doi: 10.1016/0022-2836(74)90405-7. [DOI] [PubMed] [Google Scholar]
- Mamitsuka H. Representing inter-residue dependencies in protein sequences with probabilistic networks. Comput Appl Biosci. 1995 Aug;11(4):413–422. doi: 10.1093/bioinformatics/11.4.413. [DOI] [PubMed] [Google Scholar]
- Maxfield F. R., Scheraga H. A. Improvements in the prediction of protein backbone topography by reduction of statistical errors. Biochemistry. 1979 Feb 20;18(4):697–704. doi: 10.1021/bi00571a023. [DOI] [PubMed] [Google Scholar]
- Mehta P. K., Heringa J., Argos P. A simple and fast approach to prediction of protein secondary structure from multiply aligned sequences with accuracy above 70%. Protein Sci. 1995 Dec;4(12):2517–2525. doi: 10.1002/pro.5560041208. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Riis S. K., Krogh A. Improving prediction of protein secondary structure using structured neural networks and multiple sequence alignments. J Comput Biol. 1996 Spring;3(1):163–183. doi: 10.1089/cmb.1996.3.163. [DOI] [PubMed] [Google Scholar]
- Robson B. Analysis of code relating sequences to conformation in globular prtoeins. Theory and application of expected information. Biochem J. 1974 Sep;141(3):853–867. doi: 10.1042/bj1410853. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rost B., Sander C. Prediction of protein secondary structure at better than 70% accuracy. J Mol Biol. 1993 Jul 20;232(2):584–599. doi: 10.1006/jmbi.1993.1413. [DOI] [PubMed] [Google Scholar]
- Shrake A., Rupley J. A. Environment and exposure to solvent of protein atoms. Lysozyme and insulin. J Mol Biol. 1973 Sep 15;79(2):351–371. doi: 10.1016/0022-2836(73)90011-9. [DOI] [PubMed] [Google Scholar]
- Sjölander K., Karplus K., Brown M., Hughey R., Krogh A., Mian I. S., Haussler D. Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology. Comput Appl Biosci. 1996 Aug;12(4):327–345. doi: 10.1093/bioinformatics/12.4.327. [DOI] [PubMed] [Google Scholar]
- Stolorz P., Lapedes A., Xia Y. Predicting protein secondary structure using neural net and statistical methods. J Mol Biol. 1992 May 20;225(2):363–377. doi: 10.1016/0022-2836(92)90927-c. [DOI] [PubMed] [Google Scholar]
- Stultz C. M., White J. V., Smith T. F. Structural analysis based on state-space modeling. Protein Sci. 1993 Mar;2(3):305–314. doi: 10.1002/pro.5560020302. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Thompson M. J., Goldstein R. A. Predicting solvent accessibility: higher accuracy using Bayesian statistics and optimized residue substitution classes. Proteins. 1996 May;25(1):38–47. doi: 10.1002/(SICI)1097-0134(199605)25:1<38::AID-PROT4>3.0.CO;2-G. [DOI] [PubMed] [Google Scholar]
- Wako H., Blundell T. L. Use of amino acid environment-dependent substitution tables and conformational propensities in structure prediction from aligned sequences of homologous proteins. I. Solvent accessibility classes. J Mol Biol. 1994 May 20;238(5):682–692. doi: 10.1006/jmbi.1994.1329. [DOI] [PubMed] [Google Scholar]
- Zhang X., Mesirov J. P., Waltz D. L. Hybrid system for protein secondary structure prediction. J Mol Biol. 1992 Jun 20;225(4):1049–1063. doi: 10.1016/0022-2836(92)90104-r. [DOI] [PubMed] [Google Scholar]