Abstract
We have developed a new representation for structural and functional motifs in protein sequences based on correlations between pairs of amino acids and applied it to alpha-helical and beta-sheet sequences. Existing probabilistic methods for representing and analyzing protein sequences have traditionally assumed conditional independence of evidence. In other words, amino acids are assumed to have no effect on each other. However, analyses of protein structures have repeatedly demonstrated the importance of interactions between amino acids in conferring both structure and function. Using Bayesian networks, we are able to model the relationships between amino acids at distinct positions in a protein sequence in addition to the amino acid distributions at each position. We have also developed an automated program for discovering sequence correlations using standard statistical tests and validation techniques. In this paper, we test this program on sequences from secondary structure motifs, namely alpha-helices and beta-sheets. In each case, the correlations our program discovers correspond well with known physical and chemical interactions between amino acids in structures. Furthermore, we show that, using different chemical alphabets for the amino acids, we discover structural relationships based on the same chemical principle used in constructing the alphabet. This new representation of 3-dimensional features in protein motifs, such as those arising from structural or functional constraints on the sequence, can be used to improve sequence analysis tools including pattern analysis and database search.
Full Text
The Full Text of this article is available as a PDF (1.1 MB).
Selected References
These references are in PubMed. This may not be the complete list of references from this article.
- Armstrong K. M., Baldwin R. L. Charged histidine affects alpha-helix stability at all positions in the helix by interacting with the backbone charges. Proc Natl Acad Sci U S A. 1993 Dec 1;90(23):11337–11340. doi: 10.1073/pnas.90.23.11337. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bairoch A., Boeckmann B. The SWISS-PROT protein sequence data bank. Nucleic Acids Res. 1991 Apr 25;19 (Suppl):2247–2249. doi: 10.1093/nar/19.suppl.2247. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bashford D., Chothia C., Lesk A. M. Determinants of a protein fold. Unique features of the globin amino acid sequences. J Mol Biol. 1987 Jul 5;196(1):199–216. doi: 10.1016/0022-2836(87)90521-3. [DOI] [PubMed] [Google Scholar]
- Bernstein F. C., Koetzle T. F., Williams G. J., Meyer E. F., Jr, Brice M. D., Rodgers J. R., Kennard O., Shimanouchi T., Tasumi M. The Protein Data Bank: a computer-based archival file for macromolecular structures. J Mol Biol. 1977 May 25;112(3):535–542. doi: 10.1016/s0022-2836(77)80200-3. [DOI] [PubMed] [Google Scholar]
- Burley S. K., Petsko G. A. Weakly polar interactions in proteins. Adv Protein Chem. 1988;39:125–189. doi: 10.1016/s0065-3233(08)60376-9. [DOI] [PubMed] [Google Scholar]
- Chou P. Y., Fasman G. D. Empirical predictions of protein conformation. Annu Rev Biochem. 1978;47:251–276. doi: 10.1146/annurev.bi.47.070178.001343. [DOI] [PubMed] [Google Scholar]
- Creamer T. P., Rose G. D. Side-chain entropy opposes alpha-helix formation but rationalizes experimentally determined helix-forming propensities. Proc Natl Acad Sci U S A. 1992 Jul 1;89(13):5937–5941. doi: 10.1073/pnas.89.13.5937. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Eisenberg D., Weiss R. M., Terwilliger T. C. The hydrophobic moment detects periodicity in protein hydrophobicity. Proc Natl Acad Sci U S A. 1984 Jan;81(1):140–144. doi: 10.1073/pnas.81.1.140. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Garnier J., Osguthorpe D. J., Robson B. Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins. J Mol Biol. 1978 Mar 25;120(1):97–120. doi: 10.1016/0022-2836(78)90297-8. [DOI] [PubMed] [Google Scholar]
- Gorry G. A., Barnett G. O. Experience with a model of sequential diagnosis. Comput Biomed Res. 1968 May;1(5):490–507. doi: 10.1016/0010-4809(68)90016-5. [DOI] [PubMed] [Google Scholar]
- Gribskov M., McLachlan A. D., Eisenberg D. Profile analysis: detection of distantly related proteins. Proc Natl Acad Sci U S A. 1987 Jul;84(13):4355–4358. doi: 10.1073/pnas.84.13.4355. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gutell R. R., Power A., Hertz G. Z., Putz E. J., Stormo G. D. Identifying constraints on the higher-order structure of RNA: continued development and application of comparative sequence analysis methods. Nucleic Acids Res. 1992 Nov 11;20(21):5785–5795. doi: 10.1093/nar/20.21.5785. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Henikoff S., Henikoff J. G. Automated assembly of protein blocks for database searching. Nucleic Acids Res. 1991 Dec 11;19(23):6565–6572. doi: 10.1093/nar/19.23.6565. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kabsch W., Sander C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 1983 Dec;22(12):2577–2637. doi: 10.1002/bip.360221211. [DOI] [PubMed] [Google Scholar]
- Klein P., Delisi C. Prediction of protein structural class from the amino acid sequence. Biopolymers. 1986 Sep;25(9):1659–1672. doi: 10.1002/bip.360250909. [DOI] [PubMed] [Google Scholar]
- Klein P., Kanehisa M., DeLisi C. Prediction of protein function from sequence properties. Discriminant analysis of a data base. Biochim Biophys Acta. 1984 Jun 28;787(3):221–226. doi: 10.1016/0167-4838(84)90312-1. [DOI] [PubMed] [Google Scholar]
- Korber B. T., Farber R. M., Wolpert D. H., Lapedes A. S. Covariation of mutations in the V3 loop of human immunodeficiency virus type 1 envelope protein: an information theoretic analysis. Proc Natl Acad Sci U S A. 1993 Aug 1;90(15):7176–7180. doi: 10.1073/pnas.90.15.7176. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kyte J., Doolittle R. F. A simple method for displaying the hydropathic character of a protein. J Mol Biol. 1982 May 5;157(1):105–132. doi: 10.1016/0022-2836(82)90515-0. [DOI] [PubMed] [Google Scholar]
- Levin J. M., Robson B., Garnier J. An algorithm for secondary structure determination in proteins based on sequence similarity. FEBS Lett. 1986 Sep 15;205(2):303–308. doi: 10.1016/0014-5793(86)80917-6. [DOI] [PubMed] [Google Scholar]
- Lim V. I. Algorithms for prediction of alpha-helical and beta-structural regions in globular proteins. J Mol Biol. 1974 Oct 5;88(4):873–894. doi: 10.1016/0022-2836(74)90405-7. [DOI] [PubMed] [Google Scholar]
- Marqusee S., Robbins V. H., Baldwin R. L. Unusually stable helix formation in short alanine-based peptides. Proc Natl Acad Sci U S A. 1989 Jul;86(14):5286–5290. doi: 10.1073/pnas.86.14.5286. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McGregor M. J., Islam S. A., Sternberg M. J. Analysis of the relationship between side-chain conformation and secondary structure in globular proteins. J Mol Biol. 1987 Nov 20;198(2):295–310. doi: 10.1016/0022-2836(87)90314-7. [DOI] [PubMed] [Google Scholar]
- Pickett S. D., Sternberg M. J. Empirical scale of side-chain conformational entropy in protein folding. J Mol Biol. 1993 Jun 5;231(3):825–839. doi: 10.1006/jmbi.1993.1329. [DOI] [PubMed] [Google Scholar]
- Ponder J. W., Richards F. M. Tertiary templates for proteins. Use of packing criteria in the enumeration of allowed sequences for different structural classes. J Mol Biol. 1987 Feb 20;193(4):775–791. doi: 10.1016/0022-2836(87)90358-5. [DOI] [PubMed] [Google Scholar]
- Qian N., Sejnowski T. J. Predicting the secondary structure of globular proteins using neural network models. J Mol Biol. 1988 Aug 20;202(4):865–884. doi: 10.1016/0022-2836(88)90564-5. [DOI] [PubMed] [Google Scholar]
- Shoemaker K. R., Fairman R., Schultz D. A., Robertson A. D., York E. J., Stewart J. M., Baldwin R. L. Side-chain interactions in the C-peptide helix: Phe 8 ... His 12+. Biopolymers. 1990 Jan;29(1):1–11. doi: 10.1002/bip.360290104. [DOI] [PubMed] [Google Scholar]
- Staden R. Computer methods to locate signals in nucleic acid sequences. Nucleic Acids Res. 1984 Jan 11;12(1 Pt 2):505–519. doi: 10.1093/nar/12.1part2.505. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stolorz P., Lapedes A., Xia Y. Predicting protein secondary structure using neural net and statistical methods. J Mol Biol. 1992 May 20;225(2):363–377. doi: 10.1016/0022-2836(92)90927-c. [DOI] [PubMed] [Google Scholar]
- Thornton J. M., Gardner S. P. Protein motifs and data-base searching. Trends Biochem Sci. 1989 Jul;14(7):300–304. doi: 10.1016/0968-0004(89)90069-8. [DOI] [PubMed] [Google Scholar]
- Wilbur W. J., Lipman D. J. Rapid similarity searches of nucleic acid and protein data banks. Proc Natl Acad Sci U S A. 1983 Feb;80(3):726–730. doi: 10.1073/pnas.80.3.726. [DOI] [PMC free article] [PubMed] [Google Scholar]