Abstract
A method for assigning functions to unknown sequences based on finding correlations between short signals and functional annotations in a protein database is presented. This approach is based on keyword (KW) and feature (FT) information stored in the SWISS-PROT database. The former refers to particular protein characteristics and the latter locates these characteristics at a specific sequence position. In this way, a certain keyword is only assigned to a sequence if sequence similarity is found in the position described by the FT field. Exhaustive tests performed over sequences with homologues (cluster set) and without homologues (singleton set) in the database show that assigning functions is much ’cleaner’ when information about domains (FT field) is used, than when only the keywords are used.
Full Text
The Full Text of this article is available as a PDF (825.1 KB).
Selected References
These references are in PubMed. This may not be the complete list of references from this article.
- Agarwal P., States D. J. Comparative accuracy of methods for protein sequence similarity search. Bioinformatics. 1998;14(1):40–47. doi: 10.1093/bioinformatics/14.1.40. [DOI] [PubMed] [Google Scholar]
- Altschul S. F., Gish W., Miller W., Myers E. W., Lipman D. J. Basic local alignment search tool. J Mol Biol. 1990 Oct 5;215(3):403–410. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]
- Altschul S. F., Madden T. L., Schäffer A. A., Zhang J., Zhang Z., Miller W., Lipman D. J. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997 Sep 1;25(17):3389–3402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Andrade M. A., Brown N. P., Leroy C., Hoersch S., de Daruvar A., Reich C., Franchini A., Tamames J., Valencia A., Ouzounis C. Automated genome sequence analysis and annotation. Bioinformatics. 1999 May;15(5):391–412. doi: 10.1093/bioinformatics/15.5.391. [DOI] [PubMed] [Google Scholar]
- Andrade M. A., Ouzounis C., Sander C., Tamames J., Valencia A. Functional classes in the three domains of life. J Mol Evol. 1999 Nov;49(5):551–557. doi: 10.1007/pl00006576. [DOI] [PubMed] [Google Scholar]
- Andrade M. A. Position-specific annotation of protein function based on multiple homologs. Proc Int Conf Intell Syst Mol Biol. 1999:28–33. [PubMed] [Google Scholar]
- Apweiler R. Functional information in SWISS-PROT: the basis for large-scale characterisation of protein sequences. Brief Bioinform. 2001 Mar;2(1):9–18. doi: 10.1093/bib/2.1.9. [DOI] [PubMed] [Google Scholar]
- Attwood T. K., Croning M. D., Flower D. R., Lewis A. P., Mabey J. E., Scordis P., Selley J. N., Wright W. PRINTS-S: the database formerly known as PRINTS. Nucleic Acids Res. 2000 Jan 1;28(1):225–227. doi: 10.1093/nar/28.1.225. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bachinsky A. G., Frolov A. S., Naumochkin A. N., Nizolenko L. P., Yarigin A. A. PROF_ PAT 1.3: updated database of patterns used to detect local similarities. Bioinformatics. 2000 Apr;16(4):358–366. doi: 10.1093/bioinformatics/16.4.358. [DOI] [PubMed] [Google Scholar]
- Bairoch A., Apweiler R. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 2000 Jan 1;28(1):45–48. doi: 10.1093/nar/28.1.45. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bateman A., Birney E., Durbin R., Eddy S. R., Howe K. L., Sonnhammer E. L. The Pfam protein families database. Nucleic Acids Res. 2000 Jan 1;28(1):263–266. doi: 10.1093/nar/28.1.263. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bhat T. N., Bourne P., Feng Z., Gilliland G., Jain S., Ravichandran V., Schneider B., Schneider K., Thanki N., Weissig H. The PDB data uniformity project. Nucleic Acids Res. 2001 Jan 1;29(1):214–218. doi: 10.1093/nar/29.1.214. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bork P., Koonin E. V. Predicting functions from protein sequences--where are the bottlenecks? Nat Genet. 1998 Apr;18(4):313–318. doi: 10.1038/ng0498-313. [DOI] [PubMed] [Google Scholar]
- Brenner S. E., Chothia C., Hubbard T. J. Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. Proc Natl Acad Sci U S A. 1998 May 26;95(11):6073–6078. doi: 10.1073/pnas.95.11.6073. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Brenner S. E. Errors in genome annotation. Trends Genet. 1999 Apr;15(4):132–133. doi: 10.1016/s0168-9525(99)01706-0. [DOI] [PubMed] [Google Scholar]
- Burset M., Guigó R. Evaluation of gene structure prediction programs. Genomics. 1996 Jun 15;34(3):353–367. doi: 10.1006/geno.1996.0298. [DOI] [PubMed] [Google Scholar]
- Corpet F., Servant F., Gouzy J., Kahn D. ProDom and ProDom-CG: tools for protein domain analysis and whole genome comparisons. Nucleic Acids Res. 2000 Jan 1;28(1):267–269. doi: 10.1093/nar/28.1.267. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Devos D., Valencia A. Practical limits of function prediction. Proteins. 2000 Oct 1;41(1):98–107. [PubMed] [Google Scholar]
- Eisen J. A. Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. Genome Res. 1998 Mar;8(3):163–167. doi: 10.1101/gr.8.3.163. [DOI] [PubMed] [Google Scholar]
- Fleischmann W., Möller S., Gateau A., Apweiler R. A novel method for automatic functional annotation of proteins. Bioinformatics. 1999 Mar;15(3):228–233. doi: 10.1093/bioinformatics/15.3.228. [DOI] [PubMed] [Google Scholar]
- Gellissen G., Hollenberg C. P. Application of yeasts in gene expression studies: a comparison of Saccharomyces cerevisiae, Hansenula polymorpha and Kluyveromyces lactis -- a review. Gene. 1997 Apr 29;190(1):87–97. doi: 10.1016/s0378-1119(97)00020-6. [DOI] [PubMed] [Google Scholar]
- Gracy J., Argos P. Automated protein sequence database classification. I. Integration of compositional similarity search, local similarity search, and multiple sequence alignment. Bioinformatics. 1998;14(2):164–173. doi: 10.1093/bioinformatics/14.2.164. [DOI] [PubMed] [Google Scholar]
- Hashimoto H., Sakakibara A., Yamasaki M., Yoda K. Saccharomyces cerevisiae VIG9 encodes GDP-mannose pyrophosphorylase, which is essential for protein glycosylation. J Biol Chem. 1997 Jun 27;272(26):16308–16314. doi: 10.1074/jbc.272.26.16308. [DOI] [PubMed] [Google Scholar]
- Hegyi H., Gerstein M. The relationship between protein structure and function: a comprehensive survey with application to the yeast genome. J Mol Biol. 1999 Apr 23;288(1):147–164. doi: 10.1006/jmbi.1999.2661. [DOI] [PubMed] [Google Scholar]
- Henikoff J. G., Greene E. A., Pietrokovski S., Henikoff S. Increased coverage of protein families with the blocks database servers. Nucleic Acids Res. 2000 Jan 1;28(1):228–230. doi: 10.1093/nar/28.1.228. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hofmann K., Bucher P., Falquet L., Bairoch A. The PROSITE database, its status in 1999. Nucleic Acids Res. 1999 Jan 1;27(1):215–219. doi: 10.1093/nar/27.1.215. [DOI] [PMC free article] [PubMed] [Google Scholar]
- John S. B. Ophthalmic Notes: (1.) Double Glaucoma Fulminans. (2.) Extraction of a Piece of Iron from the Lens by Means of the Permanent Magnet. Trans Am Ophthalmol Soc. 1882;3:421–423. [PMC free article] [PubMed] [Google Scholar]
- John S. B. Ophthalmic Notes: (1.) Double Glaucoma Fulminans. (2.) Extraction of a Piece of Iron from the Lens by Means of the Permanent Magnet. Trans Am Ophthalmol Soc. 1882;3:421–423. [PMC free article] [PubMed] [Google Scholar]
- John S. B. Ophthalmic Notes: (1.) Double Glaucoma Fulminans. (2.) Extraction of a Piece of Iron from the Lens by Means of the Permanent Magnet. Trans Am Ophthalmol Soc. 1882;3:421–423. [PMC free article] [PubMed] [Google Scholar]
- Karp P. D. What we do not know about sequence analysis and sequence databases. Bioinformatics. 1998;14(9):753–754. doi: 10.1093/bioinformatics/14.9.753. [DOI] [PubMed] [Google Scholar]
- Kretschmann E., Fleischmann W., Apweiler R. Automatic rule generation for protein annotation with the C4.5 data mining algorithm applied on SWISS-PROT. Bioinformatics. 2001 Oct;17(10):920–926. doi: 10.1093/bioinformatics/17.10.920. [DOI] [PubMed] [Google Scholar]
- Lander E. S., Linton L. M., Birren B., Nusbaum C., Zody M. C., Baldwin J., Devon K., Dewar K., Doyle M., FitzHugh W. Initial sequencing and analysis of the human genome. Nature. 2001 Feb 15;409(6822):860–921. doi: 10.1038/35057062. [DOI] [PubMed] [Google Scholar]
- Marcotte E. M., Pellegrini M., Thompson M. J., Yeates T. O., Eisenberg D. A combined algorithm for genome-wide prediction of protein function. Nature. 1999 Nov 4;402(6757):83–86. doi: 10.1038/47048. [DOI] [PubMed] [Google Scholar]
- Murzin A. G., Brenner S. E., Hubbard T., Chothia C. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol. 1995 Apr 7;247(4):536–540. doi: 10.1006/jmbi.1995.0159. [DOI] [PubMed] [Google Scholar]
- Needleman S. B., Wunsch C. D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970 Mar;48(3):443–453. doi: 10.1016/0022-2836(70)90057-4. [DOI] [PubMed] [Google Scholar]
- Nevill-Manning C. G., Wu T. D., Brutlag D. L. Highly specific protein sequence motifs for genome analysis. Proc Natl Acad Sci U S A. 1998 May 26;95(11):5865–5871. doi: 10.1073/pnas.95.11.5865. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pearson W. R. Effective protein sequence comparison. Methods Enzymol. 1996;266:227–258. doi: 10.1016/s0076-6879(96)66017-0. [DOI] [PubMed] [Google Scholar]
- Pearson W. R., Lipman D. J. Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A. 1988 Apr;85(8):2444–2448. doi: 10.1073/pnas.85.8.2444. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pellegrini M., Marcotte E. M., Thompson M. J., Eisenberg D., Yeates T. O. Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci U S A. 1999 Apr 13;96(8):4285–4288. doi: 10.1073/pnas.96.8.4285. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ponting C. P., Schultz J., Milpetz F., Bork P. SMART: identification and annotation of domains from signalling and extracellular protein sequences. Nucleic Acids Res. 1999 Jan 1;27(1):229–232. doi: 10.1093/nar/27.1.229. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rigoutsos I., Floratos A. Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm. Bioinformatics. 1998;14(1):55–67. doi: 10.1093/bioinformatics/14.1.55. [DOI] [PubMed] [Google Scholar]
- Rigoutsos I., Floratos A., Parida L., Gao Y., Platt D. The emergence of pattern discovery techniques in computational biology. Metab Eng. 2000 Jul;2(3):159–177. doi: 10.1006/mben.2000.0151. [DOI] [PubMed] [Google Scholar]
- Sander C., Schneider R. Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins. 1991;9(1):56–68. doi: 10.1002/prot.340090107. [DOI] [PubMed] [Google Scholar]
- Smith T. F., Waterman M. S. Identification of common molecular subsequences. J Mol Biol. 1981 Mar 25;147(1):195–197. doi: 10.1016/0022-2836(81)90087-5. [DOI] [PubMed] [Google Scholar]
- Tamames J., Tramontano A. DANTE: a workbench for sequence analysis. Trends Biochem Sci. 2000 Aug;25(8):402–403. doi: 10.1016/s0968-0004(00)01616-9. [DOI] [PubMed] [Google Scholar]
- Thode G., García-Ranea J. A., Jimenez J. Search for ancient patterns in protein sequences. J Mol Evol. 1996 Feb;42(2):224–233. doi: 10.1007/BF02198848. [DOI] [PubMed] [Google Scholar]
- Thornton J. M., Orengo C. A., Todd A. E., Pearl F. M. Protein folds, functions and evolution. J Mol Biol. 1999 Oct 22;293(2):333–342. doi: 10.1006/jmbi.1999.3054. [DOI] [PubMed] [Google Scholar]
- Venter J. C., Adams M. D., Myers E. W., Li P. W., Mural R. J., Sutton G. G., Smith H. O., Yandell M., Evans C. A., Holt R. A. The sequence of the human genome. Science. 2001 Feb 16;291(5507):1304–1351. doi: 10.1126/science.1058040. [DOI] [PubMed] [Google Scholar]
- Vuorio R., Härkönen T., Tolvanen M., Vaara M. The novel hexapeptide motif found in the acyltransferases LpxA and LpxD of lipid A biosynthesis is conserved in various bacteria. FEBS Lett. 1994 Jan 17;337(3):289–292. doi: 10.1016/0014-5793(94)80211-4. [DOI] [PubMed] [Google Scholar]
- Wilson C. A., Kreychman J., Gerstein M. Assessing annotation transfer for genomics: quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores. J Mol Biol. 2000 Mar 17;297(1):233–249. doi: 10.1006/jmbi.2000.3550. [DOI] [PubMed] [Google Scholar]
- Yona G., Linial N., Linial M. ProtoMap: automatic classification of protein sequences and hierarchy of protein families. Nucleic Acids Res. 2000 Jan 1;28(1):49–55. doi: 10.1093/nar/28.1.49. [DOI] [PMC free article] [PubMed] [Google Scholar]
- des Jardins M., Karp P. D., Krummenacker M., Lee T. J., Ouzounis C. A. Prediction of enzyme classification from protein sequence without the use of sequence similarity. Proc Int Conf Intell Syst Mol Biol. 1997;5:92–99. [PubMed] [Google Scholar]