Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2012 Nov 19.
Published in final edited form as: Pac Symp Biocomput. 2010:485–487. doi: 10.1142/9789814295291_0051

EXTRACTION OF GENOTYPE-PHENOTYPE-DRUG RELATIONSHIPS FROM TEXT: FROM ENTITY RECOGNITION TO BIOINFORMATICS APPLICATION

Adrien Coulet 1,2, Nigam Shah 2, Lawrence Hunter 4, Chitta Barral 5, Russ B Altman 1,3
PMCID: PMC3501138  NIHMSID: NIHMS418474  PMID: 19904832

Abstract

Advances in concept recognition and natural language parsing have led to the development of various tools that enable the identification of biomedical entities and relationships between them in text. The aim of the Genotype-Phenotype-Drug Relationship Extraction from Text workshop (or GPD-Rx workshop) is to examine the current state of art and discuss the next steps for making the extraction of relationships between biomedical entities integral to the curation and knowledge management workflow in Pharmacogenomics. The workshop will focus particularly on the extraction of Genotype-Phenotype, Genotype-Drug, and Phenotype-Drug relationships that are of interest to Pharmacogenomics. Extracting and structuring such text-mined relationships is a key to support the evaluation and the validation of multiple hypotheses that emerge from high throughput translational studies spanning multiple measurement modalities. In order to advance this agenda, it is essential that existing relationship extraction methods be compared to one another and that a community wide benchmark corpus emerges; against which future methods can be compared. The workshop aims to bring together researchers working on the automatic or semi-automatic extraction of relationships between biomedical entities from research literature in order to identify the key groups interested in creating such a benchmark.

Keywords: NLP, Pharmacogenomics, Entity Recognition, Event Extraction, Genotype-Phenotype-Drug Relationships

1. Introduction

Research in the BioNLP community, such as BioCreative II1 and the BioNLP Shared Task’09,2 have led to the development of efficient BioNLP methods for entity recognition and event extraction. The aim of the GPD-Rx workshop is to discuss how results of previous shared tasks can be adapted and improved in order to efficiently provide a detailed representation of complex pharmacogenomic processes described in the literature. The extraction of a structured and fine-grained representation is a key to evaluate and validate hypothesis that emerge from translational studies. The objective of the GPD-Rx workshop is thus to advance in this direction by identifying key groups and propose corpus, standard vocabularies, knowledge representation language and evaluation methods that would enable the comparison and the interoperability of future results.

2. Entity Recognition

Entity recognition or named entity recognition is the task of identifying, in free text, words that mention a known entity. Most of the efforts aimed at extracting relationships between entities start with this fundamental task in order to identify entities to be related.

Entity recognition has been extensively studied in the biomedical domain with varying results. Some of the proposed methods are generic and can identify any kind of entity that is part of a dictionary provided as a reference to the system.3,4 Other methods are specialized in the recognition of specific kinds of entities such as genes/proteins,5 genomic variations,6,7 diseases,8 or drugs.9 Machine learning approaches are commonly integrated with entity recognition methods to improve their results.10

The first goal of the GPD-Rx workshop is to discuss issues in the recognition of entities relevant to pharmacogenomics.

3. Extraction of Relationships between Entities

The second goal of the workshop is to discuss the application, in pharmacogenomics, of methods that extract relationships between relevant entities (e.g. genomic variation, phenotype, drug).

One simple approach is based on the hypothesis that two entities which are frequently mentioned together are associated. Entity recognition methods have been applied to search for the co-occurrences of entities with the goal of discovering associated ones.11 This approach has been applied for the construction of gene networks12 or the guidance of biomedical curation.13 In such co-occurance driven approaches, associations have a higher chance to be true when the co-occurrence of entities is observed in a small amount of text (e.g. a sentence), and a lower chance to be true when observed in larger amounts (e.g. a full section).

The development of natural language parsers have led to a second approach that enables, by providing the grammatical structure of sentences, the extraction of relationships (or events) mentioned in the text. The importance of learning protein-protein interactions in biology has motivated many researchers to use parsers to extract such relations with a high accuracy. The work of Fundel et al.,14 of Rebholz-Schuhmann et al.,15 of Hunter et al.,16 and of Miyao et al.17 illustrate the latest research in extracting biomedical relationships from text.

Similar approaches have already been developed for the extraction of Genotype-Phenotype-Drug relationships. 1821 The GPD-Rx workshop aims at identifying issues specific to this task and to using the output of such efforts. For example, the comparison of extracted relationships, to determine agreement or to point out a contradiction, is a key to make extracted relationships actionable.

4. Standards

We belive that BioNLP groups focused on relationship extraction tasks would have a mutual interest in using shared standards to facilitate the comparison and the interoperability of their results. The main ones are:

  • the use of unique identifier for entities involved in relationships,

  • the use of a common knowledge representation language for the description of relationships,

  • evaluation methods for the extraction of relationships,

  • shared text corpora and vocabularies of entity names and vocabularies of relationship type,

  • set of gold standard relationships.

The workshop aims to stimulate discussion for identifying, sharing and wide-spread use of such standards when applying text-mining in the realm of pharmacogenomics.

Acknowledgments

We would like to thank the PSB 2010 organizers and particularly Tiffany Murray for helping us in the organization of the GPD-Rx workshop.

References

  • 1.Hirschman Lynette, Krallinger Martin, Wilbur John, Valencia Alfonso., editors. The BioCreative II - Critical Assessment for Information Extraction in Biology Challenge. Genome Biology. 2008;9(S2) [Google Scholar]
  • 2.Tsujii Jun’ichi. Proceedings of the BioNLP 2009 Workshop Companion Volume for Shared Task. 2009
  • 3.Aronson Alan R. Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program; Proceedings of the AMIA Symposium; 2001. pp. 17–21. [PMC free article] [PubMed] [Google Scholar]
  • 4.Dai Manhong, Shah Nigam H, Xuan Wei, Musen Mark A, Watson Stanley J, Athey Brian D, Meng Fan. An Efficient Solution for Mapping Free Text to Ontology Terms. Proceedings of the AMIA Summit on Translational Bioinformatics. 2008 [Google Scholar]
  • 5.Smith Larry, Tanabe Lorraine K, Ando Rie Johnson nee, Kuo Cheng-Ju, Chung I-Fang, Hsu Chun-Nan, Lin Yu-Shi, et al. Overview of BioCreative II gene mention recognition. Genome Biology. 2008;9(S2) doi: 10.1186/gb-2008-9-s2-s2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Caporaso J Gregory, Baumgartner William A, Jr, Randolph David A, Cohen K Bretonnel, Hunter Lawrence. MutationFinder: a high-performance system for extracting point mutation mentions from text. Bioinformatics. 2007;23(14):1862–1865. doi: 10.1093/bioinformatics/btm235. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Baker Christopher JO, Rebholz-Schuhmann Dietrich., editors. Proceedings of the European Conference on Computational Biology (ECCB) 2008 Workshop: Annotations, interpretation and management of mutations (AIMM) Bioinformatics. 2009;10(S8) [Google Scholar]
  • 8.Xu Rong, Supekar Kaustubh, Morgan Alex, Das Amar, Garber Alan M. Unsupervised Method for Automatic Construction of a Disease Dictionary from a Large Free Text Collection; Proceedings of the AMIA Symposium; 2008. [PMC free article] [PubMed] [Google Scholar]
  • 9.Segura-Bedmar Isabel, Martnez Paloma, Segura-Bedmar Mara. Drug name recognition and classification in biomedical texts: A case study outlining approaches underpinning automated systems. Drug Discovery Today. 2008;13(17–18):816–823. doi: 10.1016/j.drudis.2008.06.001. [DOI] [PubMed] [Google Scholar]
  • 10.Leaman Robert, Gonzalez Graciela. Banner: An executable survey of advances in biomedical named entity recognition; Pacific Symposium on Biocomputing; 2008. pp. 652–663. [PubMed] [Google Scholar]
  • 11.Gonzalez Graciela, Uribe Juan C, Tari Luis, Brophy Colleen, Baral Chitta. Mining gene-disease relationships from biomedical literature: Weighting proteinprotein interactions and connectivity; Pacific Symposium on Biocomputing; 2007. pp. 28–39. [PubMed] [Google Scholar]
  • 12.Jenssen Tor-Kristian, Laegreid Astrid, Komorowski Jan, Hovig Eivind. A literature network of human genes for high-throughput analysis of gene expression. Nature Genetics. 2001;28(1):21–28. doi: 10.1038/ng0501-21. [DOI] [PubMed] [Google Scholar]
  • 13.Garten Yael, Altman Russ B. Pharmspresso: a text mining tool for extraction of pharmacogenomic concepts and relationships from full text. BMC Bioinformatics. 2009;10(Suppl 2):S6. doi: 10.1186/1471-2105-10-S2-S6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Fundel Katrin, Küffner Robert, Zimmer Ralf. Relex - relation extraction using dependency parse trees. Bioinformatics. 2007;23(3):365–371. doi: 10.1093/bioinformatics/btl616. [DOI] [PubMed] [Google Scholar]
  • 15.Rebholz-Schuhmann Dietrich, Kirsch Harald, Arregui Miguel, Gaudan Sylvain, Riethoven Mark, Stoehr Peter. Ebimed - text crunching to gather facts for proteins from medline. Bioinformatics. 2007;23(2):237–244. doi: 10.1093/bioinformatics/btl302. [DOI] [PubMed] [Google Scholar]
  • 16.Hunter Lawrence, Lu Zhiyong, Firby James, Baumgartner William A, Jr, Johnson Helen L, Ogren Philip V, Cohen K Bretonnel. Opendmap: An open-source, ontology-driven concept analysis engine, with applications to capturing knowledge regarding protein transport, protein interactions and cell-type-specific gene expression. BMC Bioinformatics. 2009;9(78) doi: 10.1186/1471-2105-9-78. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Miyao Yusuke, Sagae Kenji, Saetre Rune, Matsuzaki Takuya, Tsujii Jun’ichi. Evaluating contributions of natural language parsers to proteinprotein interaction extraction. Bioinformatics. 2009;25(3):394–400. doi: 10.1093/bioinformatics/btn631. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Rindflesch ThomasC, Tanabe Lorraine, Weinstein JohnN, Hunter Lawrence. EDGAR: extraction of drugs, genes and relations from the biomedical literature; Pacific Symposium on Biocomputing; 2000. pp. 571–528. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Ahlers Caroline B, Fiszman Marcelo, Demner-Fushman Dina, Lang François-Michel, Rindflesch Thomas C. Extracting semantic predications from medline citations for pharmacogenomics; Pacific Symposium on Biocomputing; 2007. pp. 209–220. [PubMed] [Google Scholar]
  • 20.Tari Luis, Hakenberg Jörg, Gonzalez Graciela, Baral Chitta. Querying parse tree database of medline text to synthesize user-specific biomolecular networks; Pacific Symposium on Biocomputing; 2009. pp. 87–98. [PubMed] [Google Scholar]
  • 21.Tari Luis, Anwar Saadat, Liang Shanshan, Hakenberg Jörg, Baral Chitta. Synthesis of Pharmacokinetic Pathways Through Knowledge Acquisition and Automated Reasoning; Pacific Symposium on Biocomputing; 2010. [DOI] [PubMed] [Google Scholar]

RESOURCES