Skip to main content
EURASIP Journal on Bioinformatics and Systems Biology logoLink to EURASIP Journal on Bioinformatics and Systems Biology
. 2007 Dec 9;2007(1):53096. doi: 10.1155/2007/53096

Extraction of Protein Interaction Data: A Comparative Analysis of Methods in Use

Hena Jose 1, Thangavel Vadivukarasi 1, Jyothi Devakumar 1,
PMCID: PMC3171344  PMID: 18274648

Abstract

Several natural language processing tools, both commercial and freely available, are used to extract protein interactions from publications. Methods used by these tools include pattern matching to dynamic programming with individual recall and precision rates. A methodical survey of these tools, keeping in mind the minimum interaction information a researcher would need, in comparison to manual analysis has not been carried out. We compared data generated using some of the selected NLP tools with manually curated protein interaction data (PathArt and IMaps) to comparatively determine the recall and precision rate. The rates were found to be lower than the published scores when a normalized definition for interaction is considered. Each data point captured wrongly or not picked up by the tool was analyzed. Our evaluation brings forth critical failures of NLP tools and provides pointers for the development of an ideal NLP tool.

[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18]

Contributor Information

Hena Jose, Email: hena_jose@jubilantbiosys.com.

Thangavel Vadivukarasi, Email: vadivukarasi.t@jubilantbiosys.com.

Jyothi Devakumar, Email: dr_devakumar@jubilantbiosys.com.

References

  1. Hunter L, Cohen KB. Biomedical language processing: what's beyond PubMed? Molecular Cell. 2006;21(5):589–594. doi: 10.1016/j.molcel.2006.02.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Fukuda K, Tamura A, Tsunoda T, Takagi T. Toward information extraction: identifying protein names from biological papers. Pacific Symposium on Biocomputing. 1998. pp. 707–718. [PubMed]
  3. Stephens M, Palakal M, Mukhopadhyay S, Raje R, Mostafa J. Detecting gene relations from Medline abstracts. Pacific Symposium on Biocomputing. 2001. pp. 483–495. [DOI] [PubMed]
  4. Sekimizu T, Park HS, Tsujii J. Identifying the interaction between genes and gene products based on frequently seen verbs in medline abstracts. Genome informatics. 1998;9:62–71. [PubMed] [Google Scholar]
  5. Novichkova S, Egorov S, Daraselia N. MedScan, a natural language processing engine for Medline abstracts. Bioinformatics. 2003;19(13):1699–1706. doi: 10.1093/bioinformatics/btg207. [DOI] [PubMed] [Google Scholar]
  6. Yakushiji A, Tateisi Y, Miyao Y, Tsujii J. Event extraction from biomedical papers using a full parser. Pacific Symposium on Biocomputing. 2001. pp. 408–419. [DOI] [PubMed]
  7. Thomas J, Milward D, Ouzounis C, Pulman S, Carroll M. Automatic extraction of protein interactions from scientific abstracts. Pacific Symposium on Biocomputing. 2000. pp. 541–552. [DOI] [PubMed]
  8. Huang M, Zhu X, Hao Y, Payan DG, Qu K, Li M. Discovering patterns to extract protein-protein interactions from full texts. Bioinformatics. 2004;20(18):3604–3612. doi: 10.1093/bioinformatics/bth451. [DOI] [PubMed] [Google Scholar]
  9. Hu ZZ, Narayanaswamy M, Ravikumar KE, Vijay-Shanker K, Wu CH. Literature mining and database annotation of protein phosphorylation using a rule-based system. Bioinformatics. 2005;21(11):2759–2765. doi: 10.1093/bioinformatics/bti390. [DOI] [PubMed] [Google Scholar]
  10. Jenssen T-K, Lægreid A, Komorowski J, Hovig E. A literature network of human genes for high-throughput analysis of gene expression. Nature Genetics. 2001;28(1):21–28. doi: 10.1038/ng0501-21. [DOI] [PubMed] [Google Scholar]
  11. Friedman C, Kra P, Yu H, Krauthammer M, Rzhetsky A. GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics. 2001;17(1):S74–S82. doi: 10.1093/bioinformatics/17.suppl_1.S74. [DOI] [PubMed] [Google Scholar]
  12. Corney DPA, Buxton BF, Langdon WB, Jones DT. BioRAT: extracting biological information from full-length papers. Bioinformatics. 2004;20(17):3206–3213. doi: 10.1093/bioinformatics/bth386. [DOI] [PubMed] [Google Scholar]
  13. Ahmed ST, Chidambaram D, Davulcu H, Baral C. IntEx: a syntactic role driven protein-protein interaction extractor for bio-medical text. Association for Computational Linguistics. 2005. pp. 54–61.
  14. Eom J Zhang B PubMiner: machine learning-based text mining for biomedical information analysis Genomics & Informatics 20042299–106.21761310 [Google Scholar]
  15. Donaldson I, Martin J, de Bruijn B. et al. PreBIND and Textomy—mining the biomedical literature for protein-protein interactions using a support vector machine. BMC Bioinformatics. 2003;4(1):11–23. doi: 10.1186/1471-2105-4-11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Daraselia N, Yuryev A, Egorov S, Novichkova S, Nikitin A, Mazo I. Extracting human protein interactions from Medline using a full-sentence parser. Bioinformatics. 2004;20(5):604–611. doi: 10.1093/bioinformatics/btg452. [DOI] [PubMed] [Google Scholar]
  17. Jang H, Lim J, Lim J-H, Park S-J, Lee K-C, Park S-H. Finding the evidence for protein-protein interactions from PubMed abstracts. Bioinformatics. 2006;22(14):e220–e226. doi: 10.1093/bioinformatics/btl203. [DOI] [PubMed] [Google Scholar]
  18. Corney DPA, Buxton BF, Langdon WB, Jones DT. BioRAT: extracting biological information from full-length papers. Bioinformatics. 2004;20(17):3206–3213. doi: 10.1093/bioinformatics/bth386. [DOI] [PubMed] [Google Scholar]

Articles from EURASIP Journal on Bioinformatics and Systems Biology are provided here courtesy of Springer

RESOURCES