Abstract
The Dictionary of Interacting Proteins (DIP) (Xenarios et al., 2000) is a large repository of protein interactions: its March 2000 release included 2379 protein pairs whose interactions have been detected by experimental methods. Even if many of these correspond to poorly characterized proteins, the result of massive yeast two-hybrid screenings, as many as 851 correspond to interactions detected using direct biochemical methods.
We used information retrieval technology to search automatically for sentences in Medline abstracts that support these 851 DIP interactions. Surprisingly, we found correspondence between DIP protein pairs and Medline sentences describing their interactions in only 30% of the cases. This low coverage has interesting consequences regarding the quality of annotations (references) introduced in the database and the limitations of the application of information extraction (IE) technology to Molecular Biology. It is clear that the limitation of analyzing abstracts rather than full papers and the lack of standard protein names are difficulties of considerably more importance than the limitations of the IE methodology employed. A positive finding is the capacity of the IE system to identify new relations between proteins, even in a set of proteins previously characterized by human experts. These identifications are made with a considerable degree of precision.
This is, to our knowledge, the first large scale assessment of IE capacity to detect previously known interactions: we thus propose the use of the DIP data set as a biological reference to benchmark IE systems.
Full Text
The Full Text of this article is available as a PDF (217.6 KB).
Selected References
These references are in PubMed. This may not be the complete list of references from this article.
- Andrade M. A., Valencia A. Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families. Bioinformatics. 1998;14(7):600–607. doi: 10.1093/bioinformatics/14.7.600. [DOI] [PubMed] [Google Scholar]
- Bader G. D., Donaldson I., Wolting C., Ouellette B. F., Pawson T., Hogue C. W. BIND--The Biomolecular Interaction Network Database. Nucleic Acids Res. 2001 Jan 1;29(1):242–245. doi: 10.1093/nar/29.1.242. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bairoch A., Apweiler R. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 2000 Jan 1;28(1):45–48. doi: 10.1093/nar/28.1.45. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Benson D. A., Boguski M. S., Lipman D. J., Ostell J., Ouellette B. F. GenBank. Nucleic Acids Res. 1998 Jan 1;26(1):1–7. doi: 10.1093/nar/26.1.1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Blaschke C., Andrade M. A., Ouzounis C., Valencia A. Automatic extraction of biological information from scientific text: protein-protein interactions. Proc Int Conf Intell Syst Mol Biol. 1999:60–67. [PubMed] [Google Scholar]
- Blaschke C., Oliveros J. C., Valencia A. Mining functional information associated with expression arrays. Funct Integr Genomics. 2001 Mar;1(4):256–268. doi: 10.1007/s101420000036. [DOI] [PubMed] [Google Scholar]
- Chien C. T., Bartel P. L., Sternglanz R., Fields S. The two-hybrid system: a method to identify and clone genes for proteins that interact with a protein of interest. Proc Natl Acad Sci U S A. 1991 Nov 1;88(21):9578–9582. doi: 10.1073/pnas.88.21.9578. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Eilbeck K., Brass A., Paton N., Hodgman C. INTERACT: an object oriented protein-protein interaction database. Proc Int Conf Intell Syst Mol Biol. 1999:87–94. [PubMed] [Google Scholar]
- Eisenberg D., Marcotte E. M., Xenarios I., Yeates T. O. Protein function in the post-genomic era. Nature. 2000 Jun 15;405(6788):823–826. doi: 10.1038/35015694. [DOI] [PubMed] [Google Scholar]
- Enright A. J., Iliopoulos I., Kyrpides N. C., Ouzounis C. A. Protein interaction maps for complete genomes based on gene fusion events. Nature. 1999 Nov 4;402(6757):86–90. doi: 10.1038/47056. [DOI] [PubMed] [Google Scholar]
- Fromont-Racine M., Rain J. C., Legrain P. Toward a functional analysis of the yeast genome through exhaustive two-hybrid screens. Nat Genet. 1997 Jul;16(3):277–282. doi: 10.1038/ng0797-277. [DOI] [PubMed] [Google Scholar]
- Fukuda K., Tamura A., Tsunoda T., Takagi T. Toward information extraction: identifying protein names from biological papers. Pac Symp Biocomput. 1998:707–718. [PubMed] [Google Scholar]
- Hishiki T, Collier N, Nobata C, Okazaki-Ohta T, Ogata N, Sekimizu T, Steiner R, Park HS, Tsujii J. Developing NLP Tools for Genome Informatics: An Information Extraction Perspective. Genome Inform Ser Workshop Genome Inform. 1998;9:81–90. [PubMed] [Google Scholar]
- Ito T., Tashiro K., Muta S., Ozawa R., Chiba T., Nishizawa M., Yamamoto K., Kuhara S., Sakaki Y. Toward a protein-protein interaction map of the budding yeast: A comprehensive system to examine two-hybrid interactions in all possible combinations between the yeast proteins. Proc Natl Acad Sci U S A. 2000 Feb 1;97(3):1143–1147. doi: 10.1073/pnas.97.3.1143. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jenssen T. K., Laegreid A., Komorowski J., Hovig E. A literature network of human genes for high-throughput analysis of gene expression. Nat Genet. 2001 May;28(1):21–28. doi: 10.1038/ng0501-21. [DOI] [PubMed] [Google Scholar]
- Ohta Y., Yamamoto Y., Okazaki T., Uchiyama I., Takagi T. Automatic construction of knowledge base from biological papers. Proc Int Conf Intell Syst Mol Biol. 1997;5:218–225. [PubMed] [Google Scholar]
- Proux D., Rechenmann F., Julliard L. A pragmatic information extraction strategy for gathering data on genetic interactions. Proc Int Conf Intell Syst Mol Biol. 2000;8:279–285. [PubMed] [Google Scholar]
- Proux D, Rechenmann F, Julliard L, Pillet V, V, Jacq B. Detecting Gene Symbols and Names in Biological Texts: A First Step toward Pertinent Information Extraction. Genome Inform Ser Workshop Genome Inform. 1998;9:72–80. [PubMed] [Google Scholar]
- Rain J. C., Selig L., De Reuse H., Battaglia V., Reverdy C., Simon S., Lenzen G., Petel F., Wojcik J., Schächter V. The protein-protein interaction map of Helicobacter pylori. Nature. 2001 Jan 11;409(6817):211–215. doi: 10.1038/35051615. [DOI] [PubMed] [Google Scholar]
- Rindflesch T. C., Hunter L., Aronson A. R. Mining molecular binding terminology from biomedical text. Proc AMIA Symp. 1999:127–131. [PMC free article] [PubMed] [Google Scholar]
- Schwikowski B., Uetz P., Fields S. A network of protein-protein interactions in yeast. Nat Biotechnol. 2000 Dec;18(12):1257–1261. doi: 10.1038/82360. [DOI] [PubMed] [Google Scholar]
- Stapley B. J., Benoit G. Biobibliometrics: information retrieval and visualization from co-occurrences of gene names in Medline abstracts. Pac Symp Biocomput. 2000:529–540. doi: 10.1142/9789814447331_0050. [DOI] [PubMed] [Google Scholar]
- Sussman J. L., Lin D., Jiang J., Manning N. O., Prilusky J., Ritter O., Abola E. E. Protein Data Bank (PDB): database of three-dimensional structural information of biological macromolecules. Acta Crystallogr D Biol Crystallogr. 1998 Nov 1;54(Pt 6 1):1078–1084. doi: 10.1107/s0907444998009378. [DOI] [PubMed] [Google Scholar]
- Tanabe L., Scherf U., Smith L. H., Lee J. K., Hunter L., Weinstein J. N. MedMiner: an Internet text-mining tool for biomedical information, with application to gene expression profiling. Biotechniques. 1999 Dec;27(6):1210-4, 1216-7. doi: 10.2144/99276bc03. [DOI] [PubMed] [Google Scholar]
- Thomas J., Milward D., Ouzounis C., Pulman S., Carroll M. Automatic extraction of protein interactions from scientific abstracts. Pac Symp Biocomput. 2000:541–552. doi: 10.1142/9789814447331_0051. [DOI] [PubMed] [Google Scholar]
- Uetz P., Giot L., Cagney G., Mansfield T. A., Judson R. S., Knight J. R., Lockshon D., Narayan V., Srinivasan M., Pochart P. A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature. 2000 Feb 10;403(6770):623–627. doi: 10.1038/35001009. [DOI] [PubMed] [Google Scholar]
- Xenarios I., Rice D. W., Salwinski L., Baron M. K., Marcotte E. M., Eisenberg D. DIP: the database of interacting proteins. Nucleic Acids Res. 2000 Jan 1;28(1):289–291. doi: 10.1093/nar/28.1.289. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yakushiji A., Tateisi Y., Miyao Y., Tsujii J. Event extraction from biomedical papers using a full parser. Pac Symp Biocomput. 2001:408–419. doi: 10.1142/9789814447362_0040. [DOI] [PubMed] [Google Scholar]
