Abstract
We have recently described a method based on artificial neural networks to cluster protein sequences into families. The network was trained with Kohonen's unsupervised learning algorithm using, as inputs, the matrix patterns derived from the dipeptide composition of the proteins. We present here a large-scale application of that method to classify the 1,758 human protein sequences stored in the SwissProt database (release 19.0), whose lengths are greater than 50 amino acids. In the final 2-dimensional topologically ordered map of 15 x 15 neurons, proteins belonging to known families were associated with the same neuron or with neighboring ones. Also, as an attempt to reduce the time-consuming learning procedure, we compared 2 learning protocols: one of 500 epochs (100 SUN CPU-hours [CPU-h]), and another one of 30 epochs (6.7 CPU-h). A further reduction of learning-computing time, by a factor of about 3.3, with similar protein clustering results, was achieved using a matrix of 11 x 11 components to represent the sequences. Although network training is time consuming, the classification of a new protein in the final ordered map is very fast (14.6 CPU-seconds). We also show a comparison between the artificial neural network approach and conventional methods of biosequence analysis.
Full Text
The Full Text of this article is available as a PDF (1.4 MB).
Selected References
These references are in PubMed. This may not be the complete list of references from this article.
- Altschul S. F., Gish W., Miller W., Myers E. W., Lipman D. J. Basic local alignment search tool. J Mol Biol. 1990 Oct 5;215(3):403–410. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]
- Altschul S. F., Lipman D. J. Protein database searches for multiple alignments. Proc Natl Acad Sci U S A. 1990 Jul;87(14):5509–5513. doi: 10.1073/pnas.87.14.5509. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Andreassen H., Bohr H., Bohr J., Brunak S., Bugge T., Cotterill R. M., Jacobsen C., Kusk P., Lautrup B., Petersen S. B. Analysis of the secondary structure of the human immunodeficiency virus (HIV) proteins p17, gp120, and gp41 by computer modeling based on neural network methods. J Acquir Immune Defic Syndr. 1990;3(6):615–622. [PubMed] [Google Scholar]
- Bengio Y., Pouliot Y. Efficient recognition of immunoglobulin domains from amino acid sequences using a neural network. Comput Appl Biosci. 1990 Oct;6(4):319–324. doi: 10.1093/bioinformatics/6.4.319. [DOI] [PubMed] [Google Scholar]
- Bohr H., Bohr J., Brunak S., Cotterill R. M., Lautrup B., Nørskov L., Olsen O. H., Petersen S. B. Protein secondary structure and homology by neural networks. The alpha-helices in rhodopsin. FEBS Lett. 1988 Dec 5;241(1-2):223–228. doi: 10.1016/0014-5793(88)81066-4. [DOI] [PubMed] [Google Scholar]
- Brunak S., Engelbrecht J., Knudsen S. Neural network detects errors in the assignment of mRNA splice sites. Nucleic Acids Res. 1990 Aug 25;18(16):4797–4801. doi: 10.1093/nar/18.16.4797. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Brunak S., Engelbrecht J., Knudsen S. Prediction of human mRNA donor and acceptor sites from the DNA sequence. J Mol Biol. 1991 Jul 5;220(1):49–65. doi: 10.1016/0022-2836(91)90380-o. [DOI] [PubMed] [Google Scholar]
- Corpet F. Multiple sequence alignment with hierarchical clustering. Nucleic Acids Res. 1988 Nov 25;16(22):10881–10890. doi: 10.1093/nar/16.22.10881. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Demeler B., Zhou G. W. Neural network optimization for E. coli promoter prediction. Nucleic Acids Res. 1991 Apr 11;19(7):1593–1599. doi: 10.1093/nar/19.7.1593. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Devereux J., Haeberli P., Smithies O. A comprehensive set of sequence analysis programs for the VAX. Nucleic Acids Res. 1984 Jan 11;12(1 Pt 1):387–395. doi: 10.1093/nar/12.1part1.387. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Engelbrecht J., Knudsen S., Brunak S. G+C-rich tract in 5' end of human introns. J Mol Biol. 1992 Sep 5;227(1):108–113. doi: 10.1016/0022-2836(92)90685-d. [DOI] [PubMed] [Google Scholar]
- Farber R., Lapedes A., Sirotkin K. Determination of eukaryotic protein coding regions using neural networks and information theory. J Mol Biol. 1992 Jul 20;226(2):471–479. doi: 10.1016/0022-2836(92)90961-i. [DOI] [PubMed] [Google Scholar]
- Feng D. F., Doolittle R. F. Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J Mol Evol. 1987;25(4):351–360. doi: 10.1007/BF02603120. [DOI] [PubMed] [Google Scholar]
- Ferrán E. A., Ferrara P. Clustering proteins into families using artificial neural networks. Comput Appl Biosci. 1992 Feb;8(1):39–44. doi: 10.1093/bioinformatics/8.1.39. [DOI] [PubMed] [Google Scholar]
- Ferrán E. A., Ferrara P. Topological maps of protein sequences. Biol Cybern. 1991;65(6):451–458. doi: 10.1007/BF00204658. [DOI] [PubMed] [Google Scholar]
- Ferrán E. A., Pflugfelder B. A hybrid method to cluster protein sequences based on statistics and artificial neural networks. Comput Appl Biosci. 1993 Dec;9(6):671–680. doi: 10.1093/bioinformatics/9.6.671. [DOI] [PubMed] [Google Scholar]
- Gonnet G. H., Cohen M. A., Benner S. A. Exhaustive matching of the entire protein sequence database. Science. 1992 Jun 5;256(5062):1443–1445. doi: 10.1126/science.1604319. [DOI] [PubMed] [Google Scholar]
- Gribskov M., McLachlan A. D., Eisenberg D. Profile analysis: detection of distantly related proteins. Proc Natl Acad Sci U S A. 1987 Jul;84(13):4355–4358. doi: 10.1073/pnas.84.13.4355. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hirst J. D., Sternberg M. J. Prediction of structural and functional features of protein and nucleic acid sequences by artificial neural networks. Biochemistry. 1992 Aug 18;31(32):7211–7218. doi: 10.1021/bi00147a001. [DOI] [PubMed] [Google Scholar]
- Horton P. B., Kanehisa M. An assessment of neural network and statistical approaches for prediction of E. coli promoter sites. Nucleic Acids Res. 1992 Aug 25;20(16):4331–4338. doi: 10.1093/nar/20.16.4331. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kneller D. G., Cohen F. E., Langridge R. Improvements in protein secondary structure prediction by an enhanced neural network. J Mol Biol. 1990 Jul 5;214(1):171–182. doi: 10.1016/0022-2836(90)90154-E. [DOI] [PubMed] [Google Scholar]
- Lipman D. J., Pearson W. R. Rapid and sensitive protein similarity searches. Science. 1985 Mar 22;227(4693):1435–1441. doi: 10.1126/science.2983426. [DOI] [PubMed] [Google Scholar]
- Lukashin A. V., Anshelevich V. V., Amirikyan B. R., Gragerov A. I., Frank-Kamenetskii M. D. Neural network models for promoter recognition. J Biomol Struct Dyn. 1989 Jun;6(6):1123–1133. doi: 10.1080/07391102.1989.10506540. [DOI] [PubMed] [Google Scholar]
- McGregor M. J., Flores T. P., Sternberg M. J. Prediction of beta-turns in proteins using neural networks. Protein Eng. 1989 May;2(7):521–526. doi: 10.1093/protein/2.7.521. [DOI] [PubMed] [Google Scholar]
- Muskal S. M., Holbrook S. R., Kim S. H. Prediction of the disulfide-bonding state of cysteine in proteins. Protein Eng. 1990 Aug;3(8):667–672. doi: 10.1093/protein/3.8.667. [DOI] [PubMed] [Google Scholar]
- Nakayama S., Shigezumi S., Yoshida M. Method for clustering proteins by use of all possible pairs of amino acids as structural descriptors. J Chem Inf Comput Sci. 1988 May;28(2):72–78. doi: 10.1021/ci00058a006. [DOI] [PubMed] [Google Scholar]
- Needleman S. B., Wunsch C. D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970 Mar;48(3):443–453. doi: 10.1016/0022-2836(70)90057-4. [DOI] [PubMed] [Google Scholar]
- Oppenheim J. J., Zachariae C. O., Mukaida N., Matsushima K. Properties of the novel proinflammatory supergene "intercrine" cytokine family. Annu Rev Immunol. 1991;9:617–648. doi: 10.1146/annurev.iy.09.040191.003153. [DOI] [PubMed] [Google Scholar]
- Pearson W. R., Lipman D. J. Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A. 1988 Apr;85(8):2444–2448. doi: 10.1073/pnas.85.8.2444. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Petersen S. B., Bohr H., Bohr J., Brunak S., Cotterill R. M., Fredholm H., Lautrup B. Training neural networks to analyse biological sequences. Trends Biotechnol. 1990 Nov;8(11):304–308. doi: 10.1016/0167-7799(90)90206-d. [DOI] [PubMed] [Google Scholar]
- Qian N., Sejnowski T. J. Predicting the secondary structure of globular proteins using neural network models. J Mol Biol. 1988 Aug 20;202(4):865–884. doi: 10.1016/0022-2836(88)90564-5. [DOI] [PubMed] [Google Scholar]
- Rost B., Sander C. Improved prediction of protein secondary structure by use of sequence profiles and neural networks. Proc Natl Acad Sci U S A. 1993 Aug 15;90(16):7558–7562. doi: 10.1073/pnas.90.16.7558. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Snyder E. E., Stormo G. D. Identification of coding regions in genomic DNA sequences: an application of dynamic programming and neural networks. Nucleic Acids Res. 1993 Feb 11;21(3):607–613. doi: 10.1093/nar/21.3.607. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stolorz P., Lapedes A., Xia Y. Predicting protein secondary structure using neural net and statistical methods. J Mol Biol. 1992 May 20;225(2):363–377. doi: 10.1016/0022-2836(92)90927-c. [DOI] [PubMed] [Google Scholar]
- Stormo G. D., Schneider T. D., Gold L., Ehrenfeucht A. Use of the 'Perceptron' algorithm to distinguish translational initiation sites in E. coli. Nucleic Acids Res. 1982 May 11;10(9):2997–3011. doi: 10.1093/nar/10.9.2997. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sulston J., Du Z., Thomas K., Wilson R., Hillier L., Staden R., Halloran N., Green P., Thierry-Mieg J., Qiu L. The C. elegans genome sequencing project: a beginning. Nature. 1992 Mar 5;356(6364):37–41. doi: 10.1038/356037a0. [DOI] [PubMed] [Google Scholar]
- Uberbacher E. C., Mural R. J. Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. Proc Natl Acad Sci U S A. 1991 Dec 15;88(24):11261–11265. doi: 10.1073/pnas.88.24.11261. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vieth M., Koliński A. Prediction of protein secondary structure by an enhanced neural network. Acta Biochim Pol. 1991;38(3):335–351. [PubMed] [Google Scholar]
- Watson J. D. The human genome project: past, present, and future. Science. 1990 Apr 6;248(4951):44–49. doi: 10.1126/science.2181665. [DOI] [PubMed] [Google Scholar]
- Wu C., Whitson G., McLarty J., Ermongkonchai A., Chang T. C. Protein classification artificial neural system. Protein Sci. 1992 May;1(5):667–677. doi: 10.1002/pro.5560010512. [DOI] [PMC free article] [PubMed] [Google Scholar]
- van Heel M. A new family of powerful multivariate statistical sequence analysis techniques. J Mol Biol. 1991 Aug 20;220(4):877–887. doi: 10.1016/0022-2836(91)90360-i. [DOI] [PubMed] [Google Scholar]
- von Heijne G. Computer analysis of DNA and protein sequences. Eur J Biochem. 1991 Jul 15;199(2):253–256. doi: 10.1111/j.1432-1033.1991.tb16117.x. [DOI] [PubMed] [Google Scholar]