Abstract
We present a method based on hierarchical self-organizing maps (SOMs) for recognizing patterns in protein sequences. The method is fully automatic, does not require prealigned sequences, is insensitive to redundancy in the training set, and works surprisingly well even with small learning sets. Because it uses unsupervised neural networks, it is able to extract patterns that are not present in all of the unaligned sequences of the learning set. The identification of these patterns in sequence databases is sensitive and efficient. The procedure comprises three main training stages. In the first stage, one SOM is trained to extract common features from the set of unaligned learning sequences. A feature is a number of ungapped sequence segments (usually 4-16 residues long) that are similar to segments in most of the sequences of the learning set according to an initial similarity matrix. In the second training stage, the recognition of each individual feature is refined by selecting an optimal weighting matrix out of a variety of existing amino acid similarity matrices. In a third stage of the SOM procedure, the position of the features in the individual sequences is learned. This allows for variants with feature repeats and feature shuffling. The procedure has been successfully applied to a number of notoriously difficult cases with distinct recognition problems: helix-turn-helix motifs in DNA-binding proteins, the CUB domain of developmentally regulated proteins, and the superfamily of ribokinases. A comparison with the established database search procedure PROFILE (and with several others) led to the conclusion that the new automatic method performs satisfactorily.
Full Text
The Full Text of this article is available as a PDF (2.2 MB).
Selected References
These references are in PubMed. This may not be the complete list of references from this article.
- Altschul S. F., Gish W., Miller W., Myers E. W., Lipman D. J. Basic local alignment search tool. J Mol Biol. 1990 Oct 5;215(3):403–410. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]
- Arrigo P., Giuliano F., Scalia F., Rapallo A., Damiani G. Identification of a new motif on nucleic acid sequence data using Kohonen's self-organizing map. Comput Appl Biosci. 1991 Jul;7(3):353–357. doi: 10.1093/bioinformatics/7.3.353. [DOI] [PubMed] [Google Scholar]
- Bacon D. J., Anderson W. F. Multiple sequence alignment. J Mol Biol. 1986 Sep 20;191(2):153–161. doi: 10.1016/0022-2836(86)90252-4. [DOI] [PubMed] [Google Scholar]
- Bairoch A., Boeckmann B. The SWISS-PROT protein sequence data bank, recent developments. Nucleic Acids Res. 1993 Jul 1;21(13):3093–3096. doi: 10.1093/nar/21.13.3093. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bairoch A. The PROSITE dictionary of sites and patterns in proteins, its current status. Nucleic Acids Res. 1993 Jul 1;21(13):3097–3103. doi: 10.1093/nar/21.13.3097. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bengio Y., Pouliot Y. Efficient recognition of immunoglobulin domains from amino acid sequences using a neural network. Comput Appl Biosci. 1990 Oct;6(4):319–324. doi: 10.1093/bioinformatics/6.4.319. [DOI] [PubMed] [Google Scholar]
- Bohr H., Bohr J., Brunak S., Cotterill R. M., Lautrup B., Nørskov L., Olsen O. H., Petersen S. B. Protein secondary structure and homology by neural networks. The alpha-helices in rhodopsin. FEBS Lett. 1988 Dec 5;241(1-2):223–228. doi: 10.1016/0014-5793(88)81066-4. [DOI] [PubMed] [Google Scholar]
- Bork P., Beckmann G. The CUB domain. A widespread module in developmentally regulated proteins. J Mol Biol. 1993 May 20;231(2):539–545. doi: 10.1006/jmbi.1993.1305. [DOI] [PubMed] [Google Scholar]
- Bork P., Sander C., Valencia A. Convergent evolution of similar enzymatic function on different protein folds: the hexokinase, ribokinase, and galactokinase families of sugar kinases. Protein Sci. 1993 Jan;2(1):31–40. doi: 10.1002/pro.5560020104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Brennan R. G., Matthews B. W. The helix-turn-helix DNA binding motif. J Biol Chem. 1989 Feb 5;264(4):1903–1906. [PubMed] [Google Scholar]
- Dodd I. B., Egan J. B. Improved detection of helix-turn-helix DNA-binding motifs in protein sequences. Nucleic Acids Res. 1990 Sep 11;18(17):5019–5026. doi: 10.1093/nar/18.17.5019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dodd I. B., Egan J. B. The prediction of helix-turn-helix DNA-binding regions in proteins. A reply to Yudkin. Protein Eng. 1988 Sep;2(3):174–176. doi: 10.1093/protein/2.3.174. [DOI] [PubMed] [Google Scholar]
- Ferrán E. A., Ferrara P. Topological maps of protein sequences. Biol Cybern. 1991;65(6):451–458. doi: 10.1007/BF00204658. [DOI] [PubMed] [Google Scholar]
- Frishman D., Argos P. Recognition of distantly related protein sequences using conserved motifs and neural networks. J Mol Biol. 1992 Dec 5;228(3):951–962. doi: 10.1016/0022-2836(92)90877-m. [DOI] [PubMed] [Google Scholar]
- Gribskov M., Lüthy R., Eisenberg D. Profile analysis. Methods Enzymol. 1990;183:146–159. doi: 10.1016/0076-6879(90)83011-w. [DOI] [PubMed] [Google Scholar]
- Henikoff S., Henikoff J. G. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A. 1992 Nov 15;89(22):10915–10919. doi: 10.1073/pnas.89.22.10915. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lawrence C. E., Altschul S. F., Boguski M. S., Liu J. S., Neuwald A. F., Wootton J. C. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science. 1993 Oct 8;262(5131):208–214. doi: 10.1126/science.8211139. [DOI] [PubMed] [Google Scholar]
- Lüthy R., Xenarios I., Bucher P. Improving the sensitivity of the sequence profile method. Protein Sci. 1994 Jan;3(1):139–146. doi: 10.1002/pro.5560030118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McGregor M. J., Flores T. P., Sternberg M. J. Prediction of beta-turns in proteins using neural networks. Protein Eng. 1989 May;2(7):521–526. doi: 10.1093/protein/2.7.521. [DOI] [PubMed] [Google Scholar]
- Neuwald A. F., Green P. Detecting patterns in protein sequences. J Mol Biol. 1994 Jun 24;239(5):698–712. doi: 10.1006/jmbi.1994.1407. [DOI] [PubMed] [Google Scholar]
- Niefind K., Schomburg D. Amino acid similarity coefficients for protein modeling and sequence alignment derived from main-chain folding angles. J Mol Biol. 1991 Jun 5;219(3):481–497. doi: 10.1016/0022-2836(91)90188-c. [DOI] [PubMed] [Google Scholar]
- Pabo C. O., Sauer R. T. Transcription factors: structural families and principles of DNA recognition. Annu Rev Biochem. 1992;61:1053–1095. doi: 10.1146/annurev.bi.61.070192.005201. [DOI] [PubMed] [Google Scholar]
- Pearson W. R., Lipman D. J. Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A. 1988 Apr;85(8):2444–2448. doi: 10.1073/pnas.85.8.2444. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Qian N., Sejnowski T. J. Predicting the secondary structure of globular proteins using neural network models. J Mol Biol. 1988 Aug 20;202(4):865–884. doi: 10.1016/0022-2836(88)90564-5. [DOI] [PubMed] [Google Scholar]
- Rost B., Sander C. Combining evolutionary information and neural networks to predict protein secondary structure. Proteins. 1994 May;19(1):55–72. doi: 10.1002/prot.340190108. [DOI] [PubMed] [Google Scholar]
- Smith H. O., Annau T. M., Chandrasegaran S. Finding sequence motifs in groups of functionally related proteins. Proc Natl Acad Sci U S A. 1990 Jan;87(2):826–830. doi: 10.1073/pnas.87.2.826. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tatusov R. L., Altschul S. F., Koonin E. V. Detection of conserved segments in proteins: iterative scanning of sequence databases with alignment blocks. Proc Natl Acad Sci U S A. 1994 Dec 6;91(25):12091–12095. doi: 10.1073/pnas.91.25.12091. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Taylor W. R. Identification of protein sequence homology by consensus template alignment. J Mol Biol. 1986 Mar 20;188(2):233–258. doi: 10.1016/0022-2836(86)90308-6. [DOI] [PubMed] [Google Scholar]
- Treisman J., Harris E., Wilson D., Desplan C. The homeodomain: a new face for the helix-turn-helix? Bioessays. 1992 Mar;14(3):145–150. doi: 10.1002/bies.950140302. [DOI] [PubMed] [Google Scholar]
- Wallace J. C., Henikoff S. PATMAT: a searching and extraction program for sequence, pattern and block queries and databases. Comput Appl Biosci. 1992 Jun;8(3):249–254. doi: 10.1093/bioinformatics/8.3.249. [DOI] [PubMed] [Google Scholar]
- Yudkin M. D. The prediction of helix-turn-helix DNA-binding regions in proteins. Protein Eng. 1987 Oct-Nov;1(5):371–372. doi: 10.1093/protein/1.5.371. [DOI] [PubMed] [Google Scholar]
