Abstract
The analysis of genomics data needs to become as automated as its generation. Here we present a novel data-mining approach to predicting protein functional class from sequence. This method is based on a combination of inductive logic programming clustering and rule learning. We demonstrate the effectiveness of this approach on the M. tuberculosis and E. coli genomes, and identify biologically interpretable rules which predict protein functional class from information only available from the sequence. These rules predict 65% of the ORFs with no assigned function in M. tuberculosis and 24% of those in E. coli, with an estimated accuracy of 60–80% (depending on the level of functional assignment). The rules are founded on a combination of detection of remote homology, convergent evolution and horizontal gene transfer. We identify rules that predict protein functional class even in the absence of detectable sequence or structural homology. These rules give insight into the evolutionary history of M. tuberculosis and E. coli.
Full Text
The Full Text of this article is available as a PDF (195.0 KB).
Selected References
These references are in PubMed. This may not be the complete list of references from this article.
- Adams M. D., Celniker S. E., Holt R. A., Evans C. A., Gocayne J. D., Amanatides P. G., Scherer S. E., Li P. W., Hoskins R. A., Galle R. F. The genome sequence of Drosophila melanogaster. Science. 2000 Mar 24;287(5461):2185–2195. doi: 10.1126/science.287.5461.2185. [DOI] [PubMed] [Google Scholar]
- Alizadeh A. A., Eisen M. B., Davis R. E., Ma C., Lossos I. S., Rosenwald A., Boldrick J. C., Sabet H., Tran T., Yu X. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature. 2000 Feb 3;403(6769):503–511. doi: 10.1038/35000501. [DOI] [PubMed] [Google Scholar]
- Altschul S. F., Madden T. L., Schäffer A. A., Zhang J., Zhang Z., Miller W., Lipman D. J. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997 Sep 1;25(17):3389–3402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Blattner F. R., Plunkett G., 3rd, Bloch C. A., Perna N. T., Burland V., Riley M., Collado-Vides J., Glasner J. D., Rode C. K., Mayhew G. F. The complete genome sequence of Escherichia coli K-12. Science. 1997 Sep 5;277(5331):1453–1462. doi: 10.1126/science.277.5331.1453. [DOI] [PubMed] [Google Scholar]
- Bork P., Dandekar T., Diaz-Lazcoz Y., Eisenhaber F., Huynen M., Yuan Y. Predicting function: from genes to genomes and back. J Mol Biol. 1998 Nov 6;283(4):707–725. doi: 10.1006/jmbi.1998.2144. [DOI] [PubMed] [Google Scholar]
- Brenner S. E. Errors in genome annotation. Trends Genet. 1999 Apr;15(4):132–133. doi: 10.1016/s0168-9525(99)01706-0. [DOI] [PubMed] [Google Scholar]
- Bussey H. 1997 ushers in an era of yeast functional genomics. Yeast. 1997 Dec;13(16):1501–1503. doi: 10.1002/(SICI)1097-0061(199712)13:16<1501::AID-YEA259>3.0.CO;2-R. [DOI] [PubMed] [Google Scholar]
- Cole S. T., Brosch R., Parkhill J., Garnier T., Churcher C., Harris D., Gordon S. V., Eiglmeier K., Gas S., Barry C. E., 3rd Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence. Nature. 1998 Jun 11;393(6685):537–544. doi: 10.1038/31159. [DOI] [PubMed] [Google Scholar]
- DeRisi J. L., Iyer V. R., Brown P. O. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science. 1997 Oct 24;278(5338):680–686. doi: 10.1126/science.278.5338.680. [DOI] [PubMed] [Google Scholar]
- Dyer MR, Cohen D, Herrling PL. Functional genomics: from genes to new therapies. Drug Discov Today. 1999 Mar;4(3):109–114. doi: 10.1016/s1359-6446(99)01310-0. [DOI] [PubMed] [Google Scholar]
- Goffeau A., Barrell B. G., Bussey H., Davis R. W., Dujon B., Feldmann H., Galibert F., Hoheisel J. D., Jacq C., Johnston M. Life with 6000 genes. Science. 1996 Oct 25;274(5287):546, 563-7. doi: 10.1126/science.274.5287.546. [DOI] [PubMed] [Google Scholar]
- Henikoff S., Greene E. A., Pietrokovski S., Bork P., Attwood T. K., Hood L. Gene families: the taxonomy of protein paralogs and chimeras. Science. 1997 Oct 24;278(5338):609–614. doi: 10.1126/science.278.5338.609. [DOI] [PubMed] [Google Scholar]
- Hieter P., Boguski M. Functional genomics: it's all how you read it. Science. 1997 Oct 24;278(5338):601–602. doi: 10.1126/science.278.5338.601. [DOI] [PubMed] [Google Scholar]
- King R. D., Muggleton S. H., Srinivasan A., Sternberg M. J. Structure-activity relationships derived by machine learning: the use of atoms and their bond connectivities to predict mutagenicity by inductive logic programming. Proc Natl Acad Sci U S A. 1996 Jan 9;93(1):438–442. doi: 10.1073/pnas.93.1.438. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lockhart D. J., Dong H., Byrne M. C., Follettie M. T., Gallo M. V., Chee M. S., Mittmann M., Wang C., Kobayashi M., Horton H. Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat Biotechnol. 1996 Dec;14(13):1675–1680. doi: 10.1038/nbt1296-1675. [DOI] [PubMed] [Google Scholar]
- Murzin A. G., Brenner S. E., Hubbard T., Chothia C. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol. 1995 Apr 7;247(4):536–540. doi: 10.1006/jmbi.1995.0159. [DOI] [PubMed] [Google Scholar]
- Ouali M., King R. D. Cascaded multiple classifiers for secondary structure prediction. Protein Sci. 2000 Jun;9(6):1162–1176. doi: 10.1110/ps.9.6.1162. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Park J., Teichmann S. A., Hubbard T., Chothia C. Intermediate sequences increase the detection of homology between sequences. J Mol Biol. 1997 Oct 17;273(1):349–354. doi: 10.1006/jmbi.1997.1288. [DOI] [PubMed] [Google Scholar]
- Pearson W. R., Lipman D. J. Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A. 1988 Apr;85(8):2444–2448. doi: 10.1073/pnas.85.8.2444. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tatusov R. L., Koonin E. V., Lipman D. J. A genomic perspective on protein families. Science. 1997 Oct 24;278(5338):631–637. doi: 10.1126/science.278.5338.631. [DOI] [PubMed] [Google Scholar]
- Taylor W. R. Dynamic sequence databank searching with templates and multiple alignment. J Mol Biol. 1998 Jul 17;280(3):375–406. doi: 10.1006/jmbi.1998.1853. [DOI] [PubMed] [Google Scholar]