Abstract
Summary: Effective access to the vast biomedical knowledge present in the scientific literature is challenging. Semantic relations are increasingly used in knowledge management applications supporting biomedical research to help address this challenge. We describe SemMedDB, a repository of semantic predications (subject–predicate–object triples) extracted from the entire set of PubMed citations. We propose the repository as a knowledge resource that can assist in hypothesis generation and literature-based discovery in biomedicine as well as in clinical decision-making support.
Availability and implementation: The SemMedDB repository is available as a MySQL database for non-commercial use at http://skr3.nlm.nih.gov/SemMedDB. An UMLS Metathesaurus license is required.
Contact: kilicogluh@mail.nih.gov
1 INTRODUCTION
Scientific discoveries depend on synthesis of knowledge from the literature and generation of novel hypotheses for testing. In biomedicine, the overwhelming size of literature makes it difficult for researchers to identify promising new avenues, which may involve insights from seemingly unrelated subfields of the domain. Large-scale information extraction in the form of semantic relations is increasingly proposed for advanced knowledge management and discovery systems (Björne et al., 2010, 2012; Cohen et al., 2010; Hristovski et al., 2006).
In this note, we describe SemMedDB, a repository of semantic predications extracted from the titles and abstracts of all PubMed citations by SemRep (Rindflesch and Fiszman, 2003), a rule-based semantic interpreter. Elements of semantic predications are drawn from the Unified Medical Language System (UMLS) knowledge sources (Bodenreider, 2004); the subject and object pair corresponds to Metathesaurus concepts, and the predicate to a relation type in an extended version of the semantic network. For example, SemRep extracts the predication Infection-CAUSES-Guillain-Barre Syndrome from the sentence Infections can trigger GBS. By normalizing the free text describing a relation to UMLS domain knowledge, SemRep provides the ability to combine and link knowledge from various sources as well as aggregate knowledge at PubMed scale. SemRep extracts 30 predicate types, largely relating to clinical medicine (e.g. TREATS, DIAGNOSES, ADMINISTERED_TO, PROCESS_OF), substance interactions (e.g. INTERACTS_WITH, INHIBITS, STIMULATES), genetic etiology of disease (e.g. ASSOCIATED_WITH, CAUSES, PREDISPOSES) and pharmacogenomics (e.g. AFFECTS, AUGMENTS, DISRUPTS). For a full list and descriptions of these predicate types, we refer the reader to Kilicoglu et al. (2011). Several evaluations have focused on different domains and linguistic structures. Results are summarized in Table 1. Numbers in the first column indicate the number of predications evaluated.
Table 1.
Evaluation type | Reference | Precision (%) | Recall (%) |
---|---|---|---|
Gene-disease relations (1124) | Rindflesch et al., 2003 | 76 | — |
Pharmacogenomics (623) | Ahlers et al., 2007 | 73 | 55 |
Hypernymic relations (830) | Rindflesch and Fiszman, 2003 | 83 | — |
Comparative structures (300) | Fiszman et al., 2007 | 96 | 70 |
Nominalizations (300) | Kilicoglu et al., 2010 | 75 | 57 |
The SemMedDB repository consists of information regarding semantic predications extracted from PubMed citations by preprocessing and stored for efficient access. The repository underpins Semantic MEDLINE (http://skr3.nlm.nih.gov/SemMed), a Web application that incorporates PubMed-based information retrieval with semantic predications, automatic summarization and visualization (Kilicoglu et al., 2008; Rindflesch et al., 2011). In this note, we propose the repository as a stand-alone, large-scale knowledge resource that can be exploited independently of Semantic MEDLINE. With the ability to access the full extent of the repository, the users can circumvent the ‘search–summarize–visualize’ model of Semantic MEDLINE and apply advanced data mining algorithms directly to support biomedical research as well as clinical practice, for hypothesis generation and literature-based discovery.
2 OVERVIEW OF SemMedDB
The SemMedDB repository is implemented primarily as a MySQL relational database that consists of tables holding information regarding PubMed citations, relevant UMLS knowledge and the semantic predications. A brief description of the content of each table and the approximate number of records in it are provided in Table 2. The UMLS-related tables (CONCEPT*) contain information from a modified version of the UMLS 2006AA release, adapted for SemRep. The descriptions of data fields in each table are explained in detail in the online documentation. The entity-relationship diagram of the database is also provided.
Table 2.
Name | Number of records | Content |
---|---|---|
CITATION | 21 M | Metadata relevant for each PubMed citation |
SENTENCE | 119.1 M | Sentences from each PubMed citation |
CONCEPT | 1.3 M | Relevant information about UMLS Metathesaurus concepts |
CONCEPT_SEMTYPE | 1.5 M | One-to-many relationships between concepts and their semantic types from UMLS semantic network |
PREDICATION | 12.9 M | Unique predications |
PREDICATION_ARGUMENT | 27.5 M | Links between each predication and its subject and object contained in CONCEPT table |
SENTENCE_PREDICATION | 57.6 M | Links between a sentence and a predication extracted from it |
PREDICATION_AGGREGATE | 57.6 M | Convenience table that aggregates information from all of the tables above for more efficient access |
The preprocessing of PubMed for SemMedDB takes about 2 months. Citations are first retrieved from PubMed using the NCBI E-utilities API and then processed by SemRep in a distributed computing environment. The latest version of the repository has ∼57.6 M predications extracted from 21 M citations (dated June 30, 2012 or earlier). We process newly added citations at regular intervals and update the repository with the extracted predications.
3 USE CASES
SemMedDB is being used to support a range of biomedical applications, especially for literature-based discovery and hypothesis generation. Miller et al. (2012) proposed a mechanistic link between cortisol, testosterone and age-related sleep-quality decline, while Wilkowski et al. (2011) applied graph-theoretical notions to elucidate the relation between sleep and depression. More recently, Goodwin et al. (2012) replicated these results by exploiting information foraging theory. Cohen et al. (2012) used analogical reasoning in a high-dimensional vector space to suggest drug therapies. Similarly, Hristovski et al. (2012) proposed predication space for drug target discovery and drug repurposing, whereas Hristovski et al. (2010) combined predications with DNA microarray data to generate novel hypotheses on Parkinson disease. We are currently combining predications with microarray data to automatically generate gene regulatory networks.
SemMedDB has formed the basis of several clinical applications. Jonnalagadda et al. (2012) generate therapy-oriented summaries to support clinical decision making, while Liu et al. (2012) elucidate the association between medical concepts co-occurring in clinical reports. The repository also underpinned a recent study to identify interactions between drugs mentioned in clinical reports.
4 CONCLUSION
We described the SemMedDB repository, which makes the semantic content of all PubMed citations as extracted by SemRep available to the research community. The utility of the repository for hypothesis generation, literature-based discovery and clinical decision making has so far been demonstrated from within the Semantic MEDLINE paradigm and independently. Given its scale and size, this repository can further serve as the basis of advanced data-mining techniques and assist in uncovering novel relationships in biomedicine.
Funding: This research was supported in part by the Intramural Research Program of the National Institutes of Health, National Library of Medicine.
Conflict of Interest: none declared.
REFERENCES
- Ahlers CB, et al. Pacific Symposium on Biocomputing. Maui, HI, USA: World Scientific; 2007. Extracting semantic predications from Medline citations for pharmacogenomics; pp. 209–220. [PubMed] [Google Scholar]
- Björne J, et al. Proceedings of the Workshop on Biomedical Natural Language Processing (BioNLP’10) Uppsala, Sweden: Association of Computational Linguistics; 2010. Scaling up biomedical event extraction to the entire PubMed; pp. 28–36. [Google Scholar]
- Björne J, et al. Proceedings of the Workshop on Biomedical Natural Language Processing (BioNLP’12) Montreal, Canada: Association of Computational Linguistics; 2012. PubMed-scale event extraction for post-translational modifications, epigenetics and protein structural relations; pp. 82–90. [Google Scholar]
- Bodenreider O. The unified medical language system (UMLS): integrating biomedical terminology. Nucleic Acids Res. 2004;32:D267–D270. doi: 10.1093/nar/gkh061. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cohen T, et al. EpiphaNet: an interactive tool to support biomedical discoveries. J. Biomed. Discov. Collab. 2010;5:21–49. [PMC free article] [PubMed] [Google Scholar]
- Cohen T, et al. Proceedings of the Sixth International Conference on Quantum Interaction (QI’12) 2012. Many paths lead to discovery: analogical retrieval of cancer therapies. Springer, Paris, France (in press) [Google Scholar]
- Fiszman M, et al. Proceedings of the Workshop on Biomedical Natural Language Processing (BioNLP’07) Prague, Czech Republic: Association of Computational Linguistics; 2007. Interpreting comparative constructions in biomedical text; pp. 137–144. [Google Scholar]
- Goodwin JC, et al. IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW) PA, USA: IEEE, Philadelphia; 2012. Discovery by scent: closed literature-based discovery system based on the information foraging theory; pp. 232–239. [Google Scholar]
- Hristovski D, et al. AMIA Annual Symposium Proceedings. Washington, DC, USA: American Medical Informatics Association; 2006. Exploiting semantic relations for literature-based discovery; pp. 349–353. [PMC free article] [PubMed] [Google Scholar]
- Hristovski D, et al. Combining semantic relations and DNA microarray data for novel hypothesis generation. In: Blaschke C, Shatkay H, editors. ISMB/ECCB2009, Lecture Notes in Bioinformatics. Heidelberg: Springer; 2010. pp. 53–61. [Google Scholar]
- Hristovski D, et al. Using literature-based discovery to identify novel therapeutic approaches. Cardiovasc. Hematol. Agents. Med. Chem. 2012 doi: 10.2174/1871525711311010005. (in press) [DOI] [PubMed] [Google Scholar]
- Jonnalagadda S, et al. Automatically extracting sentences from Medline citations to support clinicians’ information needs. J. Am. Med. Inform. Assn. 2012 doi: 10.1136/amiajnl-2012-001347. (in press) [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kilicoglu H, et al. Proceedings of the Third International Symposium on Semantic Mining in Biomedicine (SMBM 2008) Turku, Finland: Turku Centre for Computer Science (TUCS); 2008. Semantic MEDLINE: a web application to manage the results of PubMed searches; pp. 69–76. [Google Scholar]
- Kilicoglu H, et al. Proceedings of the Workshop on Biomedical Natural Language Processing (BioNLP’10) Uppsala, Sweden: Association of Computational Linguistics; 2010. Arguments of nominals in semantic interpretation of biomedical text; pp. 46–54. [Google Scholar]
- Kilicoglu H, et al. BMC Bioinformatics. Vol. 12. Chicago, IL, USA: American Medical Informatics Association; 2011. Constructing a semantic predication gold standard from the biomedical literature; p. 486. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu Y, et al. AMIA Annual Symposium Proceedings. 2012. Using SemRep to label semantic relations extracted from clinical text. American Medical Informatics Association, Chicago, IL, USA. [PMC free article] [PubMed] [Google Scholar]
- Miller CM, et al. A closed literature-based discovery technique finds a mechanistic link between hypogonadism and diminished sleep quality in aging men. Sleep. 2012;35:279–285. doi: 10.5665/sleep.1640. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rindflesch TC, Fiszman M. The interaction of domain knowledge and linguistic structure in natural language processing: interpreting hypernymic propositions in biomedical text. J. Biomed. Inform. 2003;36:462–477. doi: 10.1016/j.jbi.2003.11.003. [DOI] [PubMed] [Google Scholar]
- Rindflesch TC, et al. AMIA Annual Symposium Proceedings. Washington, DC, USA: American Medical Informatics Association; 2003. Semantic relations asserting the etiology of genetic diseases; pp. 554–558. [PMC free article] [PubMed] [Google Scholar]
- Rindflesch TC, et al. Semantic MEDLINE: an advanced information management application for biomedicine. Inform. Services Use. 2011;31:15–21. [Google Scholar]
- Wilkowski B, et al. AMIA Annual Symposium Proceedings. Washington, DC, USA: American Medical Informatics Association; 2011. Graph-based methods for discovery browsing with semantic predications; pp. 1514–1523. [PMC free article] [PubMed] [Google Scholar]