Abstract
A comprehensive data base is analyzed to determine the Shannon information content of a protein sequence. This information entropy is estimated by three methods: a k-tuplet analysis, a generalized Zipf analysis, and a "Chou-Fasman gambler." The k-tuplet analysis is a "letter" analysis, based on conditional sequence probabilities. The generalized Zipf analysis demonstrates the statistical linguistic qualities of protein sequences and uses the "word" frequency to determine the Shannon entropy. The Zipf analysis and k-tuplet analysis give Shannon entropies of approximately 2.5 bits/amino acid. This entropy is much smaller than the value of 4.18 bits/amino acid obtained from the nonuniform composition of amino acids in proteins. The "Chou-Fasman" gambler is an algorithm based on the Chou-Fasman rules for protein structure. It uses both sequence and secondary structure information to guess at the number of possible amino acids that could appropriately substitute into a sequence. As in the case for the English language, the gambler algorithm gives significantly lower entropies than the k-tuplet analysis. Using these entropies, the number of most probable protein sequences can be calculated. The number of most probable protein sequences is much less than the number of possible sequences but is still much larger than the number of sequences thought to have existed throughout evolution. Implications of these results for mutagenesis experiments are discussed.
Full text
PDF







Selected References
These references are in PubMed. This may not be the complete list of references from this article.
- Balafas JS, Dewey TG. Multifractal analysis of solvent accessibilities in proteins. Phys Rev E Stat Phys Plasmas Fluids Relat Interdiscip Topics. 1995 Jul;52(1):880–887. doi: 10.1103/physreve.52.880. [DOI] [PubMed] [Google Scholar]
- Chou P. Y., Fasman G. D. Empirical predictions of protein conformation. Annu Rev Biochem. 1978;47:251–276. doi: 10.1146/annurev.bi.47.070178.001343. [DOI] [PubMed] [Google Scholar]
- Dewey T. G., Strait B. J. Multifractals, encoded walks and the ergodicity of protein sequences. Pac Symp Biocomput. 1996:216–229. [PubMed] [Google Scholar]
- Hobohm U., Scharf M., Schneider R., Sander C. Selection of representative protein data sets. Protein Sci. 1992 Mar;1(3):409–417. doi: 10.1002/pro.5560010313. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kauffman S. A. Applied molecular evolution. J Theor Biol. 1992 Jul 7;157(1):1–7. doi: 10.1016/s0022-5193(05)80753-2. [DOI] [PubMed] [Google Scholar]
- Mantegna R. N., Buldyrev S. V., Goldberger A. L., Havlin S., Peng C. K., Simons M., Stanley H. E. Systematic analysis of coding and noncoding DNA sequences using methods of statistical linguistics. Phys Rev E Stat Phys Plasmas Fluids Relat Interdiscip Topics. 1995 Sep;52(3):2939–2950. doi: 10.1103/physreve.52.2939. [DOI] [PubMed] [Google Scholar]
- Pande V. S., Grosberg A. Y., Tanaka T. Nonrandomness in protein sequences: evidence for a physically driven stage of evolution? Proc Natl Acad Sci U S A. 1994 Dec 20;91(26):12972–12975. doi: 10.1073/pnas.91.26.12972. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Reidhaar-Olson J. F., Sauer R. T. Combinatorial cassette mutagenesis as a probe of the informational content of protein sequences. Science. 1988 Jul 1;241(4861):53–57. doi: 10.1126/science.3388019. [DOI] [PubMed] [Google Scholar]
- Shakhnovich E. I., Gutin A. M. Engineering of stable and fast-folding sequences of model proteins. Proc Natl Acad Sci U S A. 1993 Aug 1;90(15):7195–7199. doi: 10.1073/pnas.90.15.7195. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Srinivasan R., Rose G. D. LINUS: a hierarchic procedure to predict the fold of a protein. Proteins. 1995 Jun;22(2):81–99. doi: 10.1002/prot.340220202. [DOI] [PubMed] [Google Scholar]
- Strait BJ, Dewey TG. Multifractals and decoded walks: Applications to protein sequence correlations. Phys Rev E Stat Phys Plasmas Fluids Relat Interdiscip Topics. 1995 Dec;52(6):6588–6592. doi: 10.1103/physreve.52.6588. [DOI] [PubMed] [Google Scholar]
- Yockey H. P. On the information content of cytochrome c. J Theor Biol. 1977 Aug 7;67(3):345–376. doi: 10.1016/0022-5193(77)90043-1. [DOI] [PubMed] [Google Scholar]