A methodology for extending domain coverage in SemRep

Graciela Rosemblat; Dongwook Shin; Halil Kilicoglu; Charles Sneiderman; Thomas C Rindflesch

doi:10.1016/j.jbi.2013.08.005

. Author manuscript; available in PMC: 2014 Dec 1.

Published in final edited form as: J Biomed Inform. 2013 Aug 21;46(6):10.1016/j.jbi.2013.08.005. doi: 10.1016/j.jbi.2013.08.005

A methodology for extending domain coverage in SemRep

Graciela Rosemblat ¹, Dongwook Shin ¹, Halil Kilicoglu ¹, Charles Sneiderman ¹, Thomas C Rindflesch ¹

PMCID: PMC3844019 NIHMSID: NIHMS524910 PMID: 23973273

Abstract

We describe a domain-independent methodology to extend SemRep coverage beyond the biomedical domain. SemRep, a natural language processing application originally designed for biomedical texts, uses the knowledge sources provided by the Unified Medical Language System (UMLS©). Ontological and terminological extensions to the system are needed in order to support other areas of knowledge. We extended SemRep's application by developing a semantic representation of a previously unsupported domain. This was achieved by adapting well-known ontology engineering phases and integrating them with the UMLS knowledge sources on which SemRep crucially depends. While the process to extend SemRep coverage has been successfully applied in earlier projects, this paper presents in detail the stepwise approach we followed and the mechanisms implemented. A case study in the field of medical informatics illustrates how the ontology engineering phases have been adapted for optimal integration with the UMLS. We provide qualitative and quantitative results, which indicate the validity and usefulness of our methodology.

Keywords: Natural Language Processing Application, Domain-Independent Ontology Development Methodology, Semantic Predications, UMLS Knowledge Sources

1. Introduction

We present an approach to extend SemRep [1,2], a natural language processing (NLP) application, to domains not previously supported. Relying on the Unified Medical language System (UMLS©) [3] for domain knowledge, SemRep identifies and extracts semantic relations (called semantic predications) in biomedical text. Semantic predications represent assertions in text as triples that take UMLS Metathesaurus concepts as subject and object arguments and a Semantic Network relation to connect the arguments. When applied to a set of citations, they provide a semantic representation of their content which can be summarized and visualized as a network of concepts and relations, as in Semantic MEDLINE application [4,5]. Semantic MEDLINE has been exploited for hypothesis generation [6], inference of discovery patterns from known therapeutic relationships [7], and to support clinicians’ information needs [8], among others.

SemRep's dependence on the UMLS entails that it can only support predications in the biomedical domain. One way to extend SemRep coverage to other areas is by developing domain ontologies to adequately capture domain knowledge. Incorporating the new knowledge adhering to UMLS formalization parameters allows extending Semantic MEDLINE to other domains, as well.

Ontologies often underpin a semantic representation of a domain and provide a structural framework for NLP applications. Aside from the philosophical discipline of formal ontologies and properties and the various formal theories in ontology development, a domain ontology has been defined as ‘a hierarchically structured set of concepts describing a specific domain of knowledge’ [9]. A top level structure for an ontology may not always be necessary to meet the expected goals [9] if it is intended to be used by an automatic system. The ontology need not represent an exhaustive categorization of a domain. The scope can be limited by defining only those elements and relationships needed to meet pre-set goals.

Domain ontologies represent a given conceptualization of a domain, described through a set of entities or concepts and their interactions [10]. However, the consensus on the importance of building ontologies in developing knowledge-based systems does not extend to the content, format, or structure of the models, even when created for similar purposes or to represent the same domain [11]. For McCray [12], “all conceptualizations are biased, both because they depend on the purposes for which they have been created, and because they are closely tied to the world view of their designers.” Similarly in [13]: “ontology design is a creative process and no two ontologies designed by different people would be the same.” Hence, conceptualizations may not be universal but rather task-based. For others, domain ontologies can be too limiting, focusing on “abstract descriptions of knowledge organization” [14]. But incorporating context can render models less abstract, with better representation of reality [15]. Thus, conceptualization and contextualization can be complementary rather than mutually exclusive.

We adopt the latter view that when constructing domain models it is important to incorporate context as a way to ground the ontology in actual data and provide a targeted representation of the domain. Our approach aims to develop such a representation, articulating core aspects of the ontological space of the domain, so the data can inform the model. In the resulting representation model, the key actors and concepts in the data are represented, along with the semantic relationships into which they enter. An understanding of this network of concepts and relations is needed to carve out the semantic space of the domain and outline the scope of the application. Structured knowledge can be incorporated by mapping the concepts and relationships to a higher-level terminology or ontology, such as the UMLS.

By adapting well-known generic ontology engineering phases [16] for integration with the UMLS, we leverage existing UMLS knowledge while extending coverage within a newly defined semantic space. SemRep domain extensions have so far been deployed in pharmacogenomics [17] (now part of general SemRep), public health [18], and disaster information management [19]. The last two applications are available to the research community upon request. In this paper, we illustrate our approach with a new case study in medical informatics.

2. Related Work

One characteristic of ontologies is the possibility of using them for different purposes and in different applications. Thus, they can be classified by their intended use [20,21]. In biomedicine, for example, ontologies have been used in applications about disease surveillance systems [22] (as mentioned in [23]), outbreak alerts [23], and in terminology management, data exchange, knowledge reuse, and decision support [24]. By structuring the knowledge and providing the necessary conceptualizations and relations, ontologies facilitate knowledge sharing beyond a list of domain terms, paving the way for data reuse [24]. This allows capturing regularities in the use of knowledge for different goals [25], laying the groundwork for future semantic integration [26]. A more general top-level ontology can be extended to apply to a focused domain, and combined into other application ontologies later if needed [20]. This is a cost-effective way to render ontologies useful for purposes or domains other than those for which they were originally designed [26]. In other biomedicine examples, an expansion of terms about muscle biology was later incorporated into the Gene Ontology [27]. Chen et al. [28] enhanced the UMLS Semantic Network to allow differentiation between chemical complex types and conjugates. The latter supports the views that domain ontologies “need not contain all possible information about the domain” [26], since they can be extended or revised if more granularity or further specialization is needed later [28].

To develop a domain ontology to extend SemRep's coverage, we drew on Kuziemsky and Lau's 4-stage generic ontology-engineering approach [16] for health information system design: Specification and conceptualization define the ontology purpose and scope and provide the vocabulary, relationships, and concepts for ontology design. Formalization draws ontological hierarchies and relationships (PART_OF/IS_A) to develop a domain ontology/sub-ontologies for use in Implementation. Lastly, evaluation and maintenance focus on user evaluation of different components and technical/formative evaluation during ontology development. Reliance on user-involved participatory design and on data-based Grounded Theory [29,30] set this approach apart from earlier works with which it has significant correspondences [31,32]. Grounded Theory, an inductive social science methodology related to content analysis, derives concepts and categories from the data. It has been used in biomedicine to guide annotation of clinical conditions in emergency department reports [33], and more recently, to create a document-based, application-specific semantic representation for automatic information extraction from dental records [13].

While the approaches in [13] and [16] bear methodological parallels with our own, those projects built new ontologies without implementing the semantics afforded by mapping the new knowledge to more formal existing terminologies. In contrast, knowledge linking to the UMLS is paramount in our NLP application SemRep. SemRep's applicability in other domains requires ontological and terminological extensions for domain coverage, such that it recognizes domain-specific classes of concepts that may not play a role in biomedicine, while relying on the structured knowledge in the UMLS. For UMLS compatibility, the approach in [16] required adaptations, as described in Methods.

3. Background

3.1. SemRep

SemRep is a semantic interpreter [1,2] that uses underspecified syntactic analysis and the UMLS knowledge sources to provide partial semantic interpretation of the biomedical research literature. The output consists of text-derived assertions expressed as subject-relation-object triples called semantic predications, in which the relation is a UMLS Semantic Network relation. The subject and object arguments are drawn from the UMLS Metathesaurus. For example, the semantic predication in (2) is derived from text in (1):

The UMLS Metathesaurus comprises over 100 controlled vocabularies, such as MeSH and SNOMED-CT. All UMLS Metathesaurus concepts, which contain synonyms, are assigned a semantic type according to their semantic properties, such as Pharmacologic Substance and Disease or Syndrome in (2) for Fish Oils and Coronary heart disease, respectively. Concepts with similar semantic properties are assigned the same semantic type. For example, Diabetes and Malaria are both assigned semantic type Disease or Syndrome. SemRep generates a range of semantic relations, such as TREATS, DIAGNOSES, CAUSES, among others.

Underspecified syntactic analysis relies on the UMLS SPECIALIST Lexicon [34], the MedPost part-of-speech tagger [35] and a parser that identifies simple noun phrases. MetaMap [36] then maps these noun phrases to concepts in the UMLS Metathesaurus.

Several mechanisms are involved in interpreting (2) as a semantic predication of (1). First, manually crafted “indicator rules” map syntactic elements in text--nominalizations, prepositions, verbs--to relations in the Semantic Network, such as TREATS and PREVENTS. The indicator rule needed for (1) is (3):

While the UMLS concept mapping process is done through MetaMap, there is no information in the UMLS about mapping a text expression (an “indicator”) such as protect against in (1) to a Semantic Network relation such as PREVENTS in (2). Crafting indicator rules is a manual process, based on semantic interpretation. Prevention, prevent, and immunization can also mean PREVENTS when used with arguments bearing the semantic types allowed for PREVENTS in the Semantic Network.

Following the application of indicator rules, argument identification rules establish syntactic relations between indicators and the heads of simple noun phrases serving as arguments. Such rules are stated in general terms. For example, rules for verbs specify that subjects occur to the left, objects to the right. To be interpreted as a semantic predication, the semantic types of the UMLS Metathesaurus concepts that the syntactic arguments are mapped to must match the semantic types allowed in the Semantic Network for an ontological predication, which denotes an allowed relationship between two semantic types. For example, the ontological predication in (4) indicates that any concept with semantic type Pharmacologic Substance is allowed as the subject of a PREVENTS relation, with any concept assigned Disease or Syndrome as object. SemRep uses 8550 ontological predications, extracted and refined from UMLS Semantic Network and expanded with ontological predications pertinent to molecular biology.

The semantic types Pharmacologic Substance[phsu] and Disease or Syndrome [dsyn] in (4) match those for the syntactic subject fish oils and object coronary heart disease in the PREVENTS relation. The semantic predication (5) substitutes the UMLS Metathesaurus concepts for the relevant textual arguments whose semantic types match those in the Semantic Network relation in (4).

At the ontological level, semantic types interact with other semantic types. In the semantic predications, concepts categorized by the semantic types interact with each other to represent textual assertions.

3.2. SemRep Generalities

There is distinction in SemRep between those systematic aspects that apply in general, such as argument identification, and those needed for a focused domain. We refer to the former as ‘general procedural knowledge.’ Only domain knowledge required for text interpretation need be developed to extend SemRep's coverage to an unsupported domain. The domain knowledge that needs to be developed includes domain concepts and relations, which are integrated as UMLS extensions, and the “indicator rules,” which indicate the way the relations are expressed in domain texts. SemRep has 800 indicator rules.

4. Methodology

We tailored the approach in [16] to SemRep's implementation design and the existing semantic representation for the biomedical domain. Our modifications allow mapping to the UMLS, maximizing use of its resources—such as UMLS synonymy information—and fostering concept reuse and integration. A detailed description of our adaptation in each stage follows.

4.1. Specification and Conceptualization

Data collection and analysis characterize this stage, supported by Grounded Theory and Participatory Design. Through Participatory Design, domain experts guide the data collection process to assure a representative corpus on the targeted representation and scope. Grounded Theory provides the theoretical foundation for the initial data analysis of the corpus collected. Grounded Theory has 3 coding cycles [16,29]: Open Coding conceptualizes and categorizes the data into representative sentences, identified by domain experts through relevant domain concepts. Potential main themes in the corpus are highlighted. In Axial and Selective Coding, the core domain themes, the concepts and their binding relations are defined, aided by sentence sorting by topic similarity. These represent the ontology scope and define the semantic space of the domain, paving the way for sentence-derived SemRep predications.

To facilitate conceptualization, we processed the collected texts with SemRep's sentence splitter, parser and MetaMap. Next, a program we developed that operates on this output separated the noun phrases in the text identified by the parser into two categories: those that were mapped to Metathesaurus concepts and those that were not. These two categories of noun phrases were stored in separate files, which allowed a preliminary assessment of the extent of domain coverage in the UMLS Metathesaurus. For example, the program reported that the head of the noun phrase automated tools mapped to the UMLS Metathesaurus concept “Tool, device (physical object)” (semantic type Manufactured Object), while the heads of the noun phrases semantic predication and expert did not map to any UMLS Metathesaurus concepts. A third file contains all verbs identified by the program in the sentences processed, to be used in the analysis of the relationships expressed by these verbs and the discovery of indicators.

4.2. Formalization

In this stage, the domain ontology is developed as a formal model of concepts and relationships. Also developed are problem-solving approaches to domain- or application-specific issues. For example, UMLS concepts’ semantic types may not always be appropriate to capture their semantics and use within the context of a new domain. Through semantic and contextual analysis, we redefine concepts by reassigning their semantic type so they can be identified by SemRep as arguments in semantic predications. For example, UMLS Metathesaurus concept User is assigned Idea or Concept in the UMLS Metathesaurus, a non-specific semantic type that fails to define this concept. Population Group would be a more appropriate semantic type, as it would capture the fact that a user is typically a person. By changing the semantic type this way, SemRep will correctly identify User as an argument in predicates that take Population Group as either subject or object. Semantic and linguistic analysis, refinements, and contextualization form the groundwork of the semantic representation of the domain.

SemRep uses UMLS hierarchical information to extract ISA predications, such as Clozapine ISA Antipsychotic Agents. While hierarchical relations are also developed in the Formalization stage [16], extending in SemRep such hierarchical information to new domains is beyond the scope of the current work. Thus, a domain ontology supporting SemRep's extension is considered an application ontology [37,38], used for data structuring and management, and document retrieval, rather than for capturing the taxonomic knowledge of the new domain.

4.3. Implementation

Implementation of new domain knowledge focuses on compatibility with SemRep and the UMLS framework (i.e., formatting) as well as knowledge integration with UMLS concepts and relations. This entails encoding the new domain concepts, redefining some existing UMLS concepts for the domain, encoding the semantic types and ontological predications allowed in the domain as well as the indicator rules that map textual expressions to relevant domain-specific relations.

4.4. Evaluation and Maintenance

Ontology development is an iterative process. It involves updating, extending, and correcting the implemented ontology based on iterative evaluation results along the way. These bring to light the need for corrections and lead to refinements to the ontology representation. We use the Semantic MEDLINE application [4,5], which visualizes semantic predications as a network of concepts and relations and links them to the specific text sentence(s) from which they derive, to assist in checking the validity of the domain-specific predications extracted by our approach.

5. Case study - Medical Informatics

5.1. Specification and Conceptualization

5.1.1 Data collection – Participatory Design

For this project we did not engage outside experts to guide data collection but relied instead on two team members (co-authors DS and TR) who are experts in the field of medical informatics.

To collect a representative corpus, DS conducted an initial PubMed search (Dec 2011) using the search query (“AMIA Annu Symp Proc”[Journal] OR “Proc AMIA Symp”[Journal] OR “Proc AMIA Annu Fall Symp”[Journal]) AND (“information retrieval”[Title/Abstract] OR “text retrieval”[Title/Abstract] OR “data mining”[Title/Abstract] OR “intelligent search”[Title/Abstract] OR “natural language”[Title/Abstract] OR “information extraction”[Title/Abstract] OR “decision making”[Title/Abstract] OR “lexical analysis”[Title/Abstract] OR “sentence parsing”[Title/Abstract] OR “language processing”[Title/Abstract] OR “question answering”[Title/Abstract]). The query retrieved 488 MEDLINE citations. Of those, a random subset of 300 citations containing 1,977 sentences was set aside for training and in-depth data analysis.

5.1.2. Data Analysis – Grounded Theory and Participatory Design

Data analysis was carried out with domain expert TR in open discussion sessions (Participatory Design). Grounded Theory's open coding focused on the semantic interpretation of the sentences and the relevant, most informative concepts in them. Our look-up program aided in the identification of domain concepts not present in the UMLS Metathesaurus. Axial coding established the interactions between concepts. In selective coding we identified the main themes in the corpus, bringing together the concepts, their semantic types, and their binding relations. This represented the ontological space of the domain. By partitioning the ontological space into its core themes, we were able to conceptualize how these themes and their interactions are expressed in text. Seven main themes were identified: Topic of Importance, Provenance and Location, Methods and Systems, Support and Development, Knowledge Linking and Representation, Data Processing and Manipulation, and Analysis and Evaluation. Sorting sentences by topic similarity aided concept and relation identification within each theme.

5.2. Formalization

SemRep's processing of the training set provided the sentential context for concept interpretation, to assess whether new relations were needed, whether Metathesaurus-mapping concepts needed semantic type reassignment (5.2.1.2), or MetaMap mappings needed filtering out (5.2.1.3). Non-mapping noun phrases were similarly analyzed, to determine their most appropriate semantic type. They were also analyzed for potential synonymy to Metathesaurus-mapping and non-mapping noun phrases, within the context of their binding relations and the sentential interpretation (5.2.1.4).

5.2.1. Domain Concepts

All domain-relevant Metathesaurus-mapping and non-mapping concepts were analyzed for their actual or potential semantic type, respectively, with reassignment or assignment as per the criteria below.

5.2.1.1. Non-Metathesaurus-Mapping Noun Phrases: Semantic Type Assignment

For non-Metathesaurus-mapping domain-relevant concepts for which no semantically suitable UMLS semantic type was available, eight new semantic types were defined (Table I) and manually assigned through linguistic / semantic analysis and group discussions.

Table I.

Newly created domain-specific semantic types and example concepts within each.

New domain-specific semantic types	Abbreviation	Examples of new concepts
Attribute	[attr]	Qualities, Attribute, Interoperability
Community Characteristic	[comc]	Health, Incidence
Information Construct	[infc]	Information, Paper, Data, Document
Linguistic Artifact	[lart]	Sentences, Synonym, Term
Linguistic Phenomenon	[lphn]	Anaphora, Ambiguity, Word Senses
Program Type	[prty]	Initiative, Project, Program
Text Characteristics	[txch]	Format, Readability, Coverage
Unit of Measure	[umes]	Specificity, Performance, Recall

Open in a new tab

Other times, a semantically suitable UMLS semantic type was available for these new concepts, such as Professional or Occupational Group for Annotator, or Machine Activity for Semantic Processing. Our domain files contain 452 new concepts with a UMLS semantic type and 287 with a new one from Table I.

5.2.1.2. Metathesaurus-Mapping Noun Phrases: Semantic Type Re-Assignment (contextualization)

Domain-relevant Metathesaurus-mapping noun phrases in the corpus were analyzed to assess whether their semantic types, best suited for the biomedical domain, were semantically applicable in medical informatics. If they were not, they were reassigned more semantically appropriate ones, as shown in Table III. 470 UMLS Metathesaurus concepts were re-assigned new semantic types in this way.

Table III.

Examples of semantic type contextualization of existing UMLS Metathesaurus concepts

UMLS concept	UMLS semantic type	Changed to
Manual Extraction	Therapeutic or Preventive Procedure [topp]	Activity [acty]
Performance	Individual Behavior [inbe]	Unit of Measure [umes]
Discharge summary	Intellectual Product [inpr]	Information Construct [infc]
Computer Applications	Functional Concept [ftcn]	Manufactured Object [mnob]
Term	Time Concept [tmco]	Linguistic Artifact [lart]

Open in a new tab

5.2.1.3. Filtering out Domain-Inappropriate MetaMap Mappings

MetaMap mappings appropriate for the biomedical domain may trigger false positive or false negative predications in medical informatics texts. For example, adjective little mapped to UMLS Metathesaurus Little's Disease [Disease or Syndrome], triggering a false positive predication; the mapping of corpus (set of documents) to Body of uterus [Body Part, Organ, or Organ Component] would result in a missed predication that called for an argument with semantic type Information Construct. In a supplemental file, we blocked 62 such mappings, for MetaMap to ignore.

5.2.1.4. Synonymy Determination

Semantic analysis revealed new concepts that were semantically similar to UMLS Metathesaurus concepts and were added as their synonyms in our domain files, inheriting their semantic types. For example, medical attention was identified as a synonym (variant) of UMLS Metathesaurus medical care [Therapeutic or Preventive Procedure]. Non-Metathesaurus-mapping concepts that showed strong semantic similarity among themselves were grouped together. Within those lexical groups and based on group discussions, the best candidate for a new concept was selected as the preferred term, the rest as its synonyms. We identified 139 new synonyms for UMLS Metathesaurus concepts, 235 for non-Metathesaurus-mapping ones:

5.2.2. Domain Relations

To capture the nature of some of the relationships connecting new or redefined concepts, new predicates not in the UMLS Semantic Network were also needed. Twenty four such semantic predicates were defined: ANALYZES, CATEGORIZES, COLLECTS, INTERPRETS, COORDINATES_WITH, DEVELOPS, DISPLAYS, ENHANCES, EVALUATES, EXTRACTS, FACILITATES, FOCUSES_ON, IDENTIFIES, LINKS_TO, PERFORMS, PROCESSES, PROVIDES, REPRESENTS, SEARCHES, SOURCE_OF, SUPPORTS, VALIDATES. Example predications of new relations are shown in Table VI.

Table VI.

Examples of new ontological predications (b examples) and the semantic predications they trigger (c) from corpus sentences (a). Semantic type abbreviations in Tables I-III.

PMID	Sentence in abstract
21347110	(7a) After ... filtering, the ...augmented UMLS Metathesaurus contains* 518,835 terms*
	(7b) Ontological Predication:	(7c) Semantic Predication:
	[infc] CONTAINS [lart]	UMLS Metathesaurus CONTAINS Term
21347116	(8a) Improving Search for Evidence-based Practice using* Information Extraction*
	(8b) [acty] USES [mcha]	(8c) Search USES Information Extraction
10999004	(9a) Slightly more than half of the symptom terms...do not match* the UMLS*
	(9b) [lart] LINKS_TO [mnob]	(9c) Medical term NEG_LINKS_TO UMLS
21347067	(10a)...provide the foundation for tools that enable epidemiological research exploration...
	(10b)[mnob] FACILITATES [resa]	(10c) Tool FACILITATES Epidemiologic Studies

Open in a new tab

5.2.3. Indicator Rules

The links between text expressions and the new semantic predicates they indicate were determined by careful manual analysis through semantic interpretation. For example, text indicators for the newly added Medical Informatics ENHANCES predicate were: boost (verb), boost (noun), enhance, enhancement, enrich, enrichment, augment, augmentation, improve, improvement, increase (verb), increase (noun), optimize, optimization. A total of 360 indicator rules were added specifically for Medical Informatics, averaging roughly 15 per each of the new 24 semantic predicates.

Iterative evaluation results uncovered four SemRep indicator rules (Table V) that triggered unwanted results in medical informatics. For example, indicator of triggers valid LOCATION_OF results with anatomy arguments in biomedical texts, but not in this corpus (6). Preposition for as a USES indicator conflicted with its use in FOCUSES_ON, which was prioritized. These rules were disabled for the enhanced version.

Table V.

SemRep indicator rules disabled for the enhanced medical informatics version.

Predicate	Indicator	Indicators’ Part-of-Speech
TREATS	for	Preposition
TREATS	in	Preposition
USES	for	Preposition
LOCATION_OF	of	Preposition

Open in a new tab

FP: Radiology report [Information Construct] LOCATION_OF repository [Manufactured Object]

5.3. Implementation

To implement the ontology and obtain the desired semantic predications in medical informatics, ontological predications were added as needed, to allow their use with domain-relevant semantic types as subject-object arguments. Conforming to the Semantic Network format--a predicate that takes semantic types as arguments--, we crafted 520 ontological predications based on the relevant concepts and predicates identified in the corpus. Table VI shows sentences (7a-10a), the ontological predications (7b-10b) that apply, and the semantic predications (7c-10c) they generate. Ontological predications included those with Semantic Network predicates and newly created semantic types (7b), those with existing ones in combinations not originally specified (8b), and those with new domain predicates with new (9b) and existing semantic types (10b). UMLS CONTAINS and LOCATION_OF predicates were redefined by specifying as their arguments semantic types not stipulated for these predicates in the Semantic Network, as in (7b) in Table VI. Thus, contextualization extends to domain ontology concepts and predicates due to the way different SemRep components work together. The verbs identified by our look-up program in the (a) sentences below (contains, using, match, enable) became indicators for predicates CONTAINS, USES, LINKS_TO, and FACILITATES respectively, in Table VI.

5.4. Evaluation and Maintenance

We corrected and enhanced the implemented ontology through analysis of iterative training set results generated by extended SemRep. This analysis task was facilitated by Semantic MEDLINE [4,5] visualization of the information extracted.

5.4.1. Test Set

To evaluate results, a more encompassing PubMed search was conducted (Feb 2012) using the query (“information retrieval”[Title/Abstract] OR “text retrieval”[Title/Abstract] OR “data mining”[Title/Abstract] OR “intelligent search”[Title/Abstract] OR “natural language” [Title/Abstract] OR “information extraction”[Title/Abstract] OR “lexical analysis” [Title/Abstract] OR “sentence parsing”[Title/Abstract] OR “language processing”[Title/Abstract] OR “question answering” [Title/Abstract]). We retrieved 5,205 MEDLINE citations. To allow predications to capture the action or activity performed by the authors, in a pre-processing step we substituted the concept author for the commonly occurring we, as SemRep does not handle pronouns or pronominal anaphora. Then, to obtain a prototype representation, we processed the 5,205 citations through the enhanced SemRep version. This run generated 23,860 predications. A subset of 500 randomly selected citations (evaluation set) containing 3,767 sentences and 2,092 predications was set aside and processed through generic SemRep without any domain knowledge, for a baseline evaluation against the results obtained with the enhanced version of SemRep.

5.4.2 Evaluations

Two types of evaluation were conducted. The first evaluation focused on comparison of predications and predicate frequencies in both SemRep versions (Tables VII, VIII) on 500 citations. Examples of new predications are given in Table IX. We also compared the frequency of occurrence and the performance of two predicates, TREATS and ADMINISTERED_TO, in the two SemRep runs (Table X).

Table VII.

Generic SemRep vs. SemRep enhanced for medical informatics.

SemRep runs	Citations (N=)	Sentences	Predications
Generic	500	3,767	495
Medical informatics	500	3,767	2,065

Open in a new tab

Table VIII.

Frequency count distributions by predicate types.

Generic SemRep		Enhanced for medical informatics
Predicate	Frequency count	Predicate	Frequency count
PROCESS_OF	77 (19)	FOCUSES_ON	505 (0)
LOCATION_OF	63 (280)	USES	293 (26)
COEXISTS_WITH	31 (23)	LOCATION_OF	280 (63)
AFFECTS	31 (33)	PROVIDES	128 (0)
USES	26 (293)	DEVELOPS	76 (0)
TREATS	26 (9)	EXTRACTS	64 (0)
METHOD_OF	19 (17)	IDENTIFIES	62 (0)
ADMINISTERED_TO	17 (1)	PROCESSES	48 (0)
ASSOCIATED_WITH	13 (13)	CONTAINS	47 (0)
INTERACTS_WITH	12 (12)	ANALYZES	36 (0)

Open in a new tab

Table IX.

Examples of new USES and LOCATION_OF ontological predications and results of their application.

PMID	Sentence in abstract
9609492	(11a) These include more rapid knowledge acquisition using* data mining.*
	Ontological Predication:	Semantic Predication:
	(11b) [mcha] USES [mcha]	(11c) Knowledge acquisition USES data mining
9571082	(12a) When this method was applied* to a test database, an improvement ...*
	(12b) [mnob] USES [infc]	(12c) Techniques USES Databases
5921470	(13a) This paper supplements information given in earlier papers on the ...
	(13b) [infc] LOCATION_OF [infc]	(13c) Paper LOCATION_OF Information
4566036	(14a) ... two computerized information retrieval systems at the University of Georgia.
	(14b)[orgt] LOCATION_OF [mnob]	(14c) Universities LOCATION_OF Information Retrieval Systems

Open in a new tab

Table X.

Evaluation judgments for TREATS and ADMINISTERED_TO in the two SemRep versions.

Relation	Generic	SemRep	Subtotal	Enhanced	SemRep	Subtotal
	TP	FP		TP	FP
TREATS	12	14	26	9	0	9
ADMINISTERED_TO	2	15	17	1	0	1
TOTAL PREDICATIONS	14	29	43	10	0	10

Open in a new tab

The second type of evaluation focused on performance of the enhanced SemRep and was separately conducted on a random 135-citation subset of the 500-citation evaluation set. One of the co-authors (CS), a physician and a biomedical informatics researcher familiar with both SemRep and the UMLS, manually annotated this subset (659 sentences, 308 predications). CS did not participate in any stage of the development of SemRep's coverage extension. Also annotated were 42 false negative predications that the system failed to generate. Altogether, these added up to 350 judgments.

5.4.2.1. Comparison of Generic and Enhanced SemRep

The counts for each run (Table VII) show that SemRep's enhanced version retrieved about 4 times more predications than generic SemRep. This is in line with the domain knowledge implemented, tailored to medical informatics texts.

The 10 topmost predicates in both version runs in Table VIII are ranked by raw number of predications by predicate in each version. Parentheses indicate the frequency of each predicate in the other version.

New medical informatics predicates FOCUSES_ON, PROVIDES, DEVELOPS, EXTRACTS, IDENTIFIES, PROCESSES, and ANALYZES account for their zero counts in generic SemRep (Table VIII, parentheses, right hand column). This also applies to CONTAINS due to the new ontological predications for this predicate. Assertions about systems, methods, tools used are common in medical informatics, which explains the large number of predications involving USES in enhanced SemRep. Table IX shows examples of representative predications obtained by redefining predicates to allow novel co-occurring combinations with new and existing semantic types.

Table VIII also indicates a significant disparity in the number of TREATS and ADMINISTERED_TO relations between the two runs. We analyzed these relations to assess the effect of enhancing SemRep for a particular domain on extracting relations covered by generic SemRep. In view of the evaluation results for these two relations (Table X), it is noteworthy that while the number of predications extracted by enhanced SemRep is significantly lower, they are significantly more accurate; in fact, in the case of TREATS and ADMINISTERED_TO, they are all correct. This suggests that extending the coverage of SemRep to a novel domain via conceptualization and contextualization also has the added benefit of increasing the accuracy of core biomedical relations in that domain. For example, consider the sentence in (15). Generic SemRep generates the erroneous predication shown below, as UMLS Metathesaurus extraction has the semantic type Therapeutic or Preventive Procedure rather than the more appropriate Machine Activity. This causes a false positive error. By adding the new domain concept Information Extraction with the semantic type Machine Activity, the generation of this type of predications is prevented.

FP: Extraction [Therapeutic or Preventive Procedure] ADMINISTERED_TO Author [Human]

5.4.2.2. Enhanced SemRep: Performance Evaluation

From the 659 sentences manually annotated with 350 semantic predications, enhanced SemRep generated 308 predications. We calculated precision, recall, and F-score for these results, shown in Table XI.

Table XI.

Evaluation of enhanced SemRep.

Evaluator Judgment	True positive	False Positive	False Negative
*Judgments Issued*	212	96	42

Measure	Precision	Recall	F-score
*Result*	0.69	0.88	0.76

Open in a new tab

The 308 predications generated by enhanced SemRep contained 337 UMLS concepts with their original semantic type, 169 UMLS concepts with a contextualized semantic type (i.e., changed from the original UMLS-assigned), and 110 entirely new concepts. We analyzed the 96 predications erroneously generated by enhanced SemRep. They fall into the categories shown in Table XII. The error analysis indicates that the majority of errors (53%) are caused by general SemRep processing issues, which are independent of domain extensions. Such errors include tagger/parser problems, failure to recognize negation, handle appositives and impersonal passive constructions. Addressing these problems in generic SemRep will translate into improved results in the extensions to other domains. Despite mechanisms applied in the enhanced SemRep version, issues in conceptualization and contextualization, including ambiguity (part-of-speech, lexical, conceptual) and missing concepts, account for 31% of the errors. There are also errors due to missing or ill-defined ontological predications or indicator rules (16%). We discuss these errors in more detail below and provide examples.

Table XII.

Categories of false positive errors.

E.g.#	Error type	Error subtype - Number of predications for each		Percent
	*Metathesaurus-related*			31%
(16)		Inappropriate semantic type/concept mapping	(2)
(17)		Missing concepts	(7)
(18,19)		Part-of-speech, lexical, or other ambiguity	(19)
(20)		Multiword expressions not recognized as such	(3)
	*Rule-related*			16%
(21)		Missing ontological predications	(2)
(22)		Indicator rules (misapplying, ill-defined, etc.)	(13)
	*General SemRep processing*			53%
		Negation	(1)
(23,24)		Other (parsing errors, appositives,...)	(49)
TOTAL FP predications:			(96)	100%

Open in a new tab

5.4.2.2.1. Metathesaurus-related Errors

Semantic types

Not all infelicitous mappings were eliminated by semantic type contextualization. For example, “clinical condition” in (16) was missed as the object argument because of its UMLS-assigned Qualitative Concept semantic type, rather than the more appropriate Disease or Syndrome:

FP: Natural Language processor [Manufactured Object] IDENTIFIES Report [Information Construct]

Missing arguments

Care is not a UMLS Metathesaurus concept and it did not appear in the training set. As a result, it had not been added to our supplemental files as a synonym to Health Care [Health Care Activity]. Thus, information was taken as the object in (17):

FP: Dental hygienist [Professional or Occupational Group] PROVIDES Information [Information Construct]

Ambiguity

Although SemRep uses a method for ambiguity resolution [39], unwanted results are still triggered. For example, attendant in (18) is taken as a noun, not as an adjective (part-of-speech ambiguity); stress in (19) is a synonym of emphasis, not a medical condition (lexical ambiguity):

FP: Attendants [Professional or Occupational Group] DEVELOPS Computers [Manufactured Object]

FP: Fact [Information Construct] LOCATION_OF Stress [Finding]

Multiword expressions

Multiword units not recognized as such may result in individual terms used as arguments in predication. The unit for example triggered 3 FP, with example used as an argument (20):

FP: Databases [Information Construct] FOCUSES_ON Example [Information Construct]

5.4.2.2.2. Rule-related

Missing ontological predications

Failure to specify the co-occurrence of allowable semantic types as arguments for a given predicate may result in the true argument being ignored and another argument with a semantic type allowed by an existing rule used instead. In (21), a missing LOCATION_OF ontological predication with [Manufactured Object] as both subject and object resulted in the use of Information Construct article—not Manufactured Object category—as the object:

FP: Framework [Manufactured Object] LOCATION_OF article [Information Construct]

Indicator rules

Rules may fit some contexts but trigger wrong results in others. Indicator from for PROVIDES generated many TP predications in this domain, but it triggered a FP in (22):

FP: Dictionary [Information Construct] PROVIDES Analyzer [Manufactured Object]

5.4.2.2.3. General SemRep Processing

Some FP predications were triggered by the inability of SemRep to handle certain constructions, such as higher clause negation, or appositive numbers. For example, due to appositive number ‘2’ in (23), SemRep failed to generate the semantic predication ‘Experiment ANALYZES relationship’:

FP: Relationships [Manufactured Object] USES Approach [Manufactured Object]

FN: Experiment [Research Activity] ANALYZES Relationship [Manufactured Object]

Parsing and tagger issues accounted for most errors in this category, as in (24), where the passive construction involving the predicate index was not recognized due to a parser error.

FP: Author [Professional or Occupational Group] PROCESSES Databases [Information Construct]

System processing issues, such as argument identification problems and missing co-occurrence rules, often prevent predications from being generated even though the corresponding rules may exist: the fact that a rule exists is not a guarantee that it will apply. Focusing on the process followed, the case study presented illustrates with concrete examples how the different elements that resulted from the process work together.

7. Conclusions and Future Work

We extended SemRep coverage to medical informatics domain by applying and adapting the 4-step ontology engineering process proposed by Kuziemsky and Lau [16]. Our results provide further proof for the usefulness and extensibility of their approach, which they had applied to only one specific domain. Furthermore, our results demonstrate the suitability of UMLS knowledge source model (in particular, UMLS Metathesaurus and Semantic Network) as a general model to capture domain knowledge. They also validate SemRep as a general, knowledge-based natural language processing system that can be extended to new domains in a straightforward, methodical way. This also consolidates SemRep as a valuable tool for ontology-based NLP applications, such as those relating to knowledge management. For example, SemRep is currently being explored as a valuable addition to the suite of tools used in grant or research portfolio analysis of the type carried out at the National Cancer Institute at the National Institutes of Health [40].

While we use automated tools to suggest terms that are potentially useful in new domains, much of the work in assessing their appropriateness, whether and how they are covered in UMLS remains a manual process. These tasks can be time-consuming and error-prone. Thus, future work in our research includes automating this process further and providing better validation. For example, synonymy discovery techniques, based on co-occurrence patterns [41] or dictionary definitions [42] could be exploited to automatically conflate terms into conceptual groups. Similarly, hypernymy relations (IS-A) can be discovered to some extent by exploiting syntactic patterns like ‘A, such as B’ [43] or ‘C tool’. However, while these techniques may lessen the burden on the domain experts, manual intervention will still be required as the techniques will provide less than perfect results. Based on our experience with the ontology engineering process and the errors encountered, we have also started implementing tools to automatically check the validity of encoded domain knowledge with respect to UMLS and SemRep, including checks for unique identifiers for concepts, formatting of rules, and so on.

Highlights (number of characters including spaces in parenthesis at the end).

Integrating new domain knowledge into the UMLS extends SemRep core coverage. (78)
An existing ontology-building method is adapted to accommodate the new knowledge. (83)
Text analysis uncovers non-UMLS domain knowledge for a medical informatics case study. (86)
New concepts, semantic types, and relations are defined for the new domain. (77)
The new knowledge and semantic representation allow SemRep to apply to new domains. (83)

Table II.

Examples of non-Metathesaurus-mapping domain concepts and semantic type assignment.

Domain concepts	Semantic type assigned	Semantic type source
Coder	Professional or Occupational Group [prog]	UMLS
Token	Manufactured Object [mnob]	UMLS
Computarization	Machine Activity [mcha]	UMLS
Natural Language	Language [lang]	UMLS
Header	Information Construct [infc]	New, domain-specific
Punctuation sign	Linguistic Artifact [lart]	New, domain-specific

Open in a new tab

Table IV.

Examples of new synonyms identified.

Metathesaurus concepts	Synonyms identified
Question answering	question-answering; qa; q-a
Date of Birth	birthdate
MEDLINE	Medline collection; Medline bibliographic database, Medline database

New domain concepts	Synonyms identified
Training Set	training data set; training dataset; training data, training corpus
Web Browser	Internet browser; browser
De-identification task	deidentification task; de-identification; deidentification

Open in a new tab

Acknowledgments

This study was supported by the Intramural Research Program of the National Institutes of Health, National Library of Medicine.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Conflicts of Interest: None

... fish oils can protect against coronary heart disease...

Fish Oils [Pharmacologic Substance] PREVENTS Coronary heart disease [Disease or Syndrome]

protect against (verb) → PREVENTS

⁴

Pharmacologic Substance [phsu] PREVENTS Disease or Syndrome [dsyn]

⁵

Fish Oils PREVENTS Coronary heart disease

⁶

...mining large repositories of radiology reports...could enable epidemiologic studies... [PMID 5152570]

¹⁵

...the author shows that it is possible to perform accurate information extraction. [PMID 10786287]

¹⁶

MedLEE, a general-purpose natural language processor ...was compared to physicians’ ability to detect seven clinical conditions in... radiograph reports. [PMID 9550840]

¹⁷

...Dental hygienists need to access ... information sources to provide quality care. [PMID 9745646]

¹⁸

... the third generation of computers with attendant preparation; .... [PMID 5921470]

¹⁹

Stress is laid on the fact that information outside the biomedical area... [PMID 9789091]

²⁰

... search two or more databases-for example, in EMBASE and TOXLINE ... [PMID 9849544]

²¹

... 21 articles met the inclusion criteria for ... the categories in the framework. [PMID 9794316]

²²

It describes the design of an analyzer that can profit from a dictionary. [PMID 9779890]

²³

Experiment 2 examined the relationship ... using a spatial occlusion approach. [PMID 9635326]

²⁴

All the records were classified by journal and author's name and were verified for each record whether or not it was indexed in each database. [PMID 9849544]

References

1.Rindflesch TC, Fiszman M, Libbus B. Semantic interpretation for the biomedical research literature. In: Chen, Fuller, Hersh, Friedman, editors. Medical informatics: Knowledge management and data mining in biomedicine. Springer; 2005. pp. 399–422. [Google Scholar]
2.Rindflesch TC, Fiszman M. The interaction of domain knowledge and linguistic structure in natural language processing: interpreting hypernymic propositions in biomedical text. J Biomed Inform. 2003 Dec;36(6):462–77. doi: 10.1016/j.jbi.2003.11.003. [DOI] [PubMed] [Google Scholar]
3.Humphreys BL, Lindberg DA, Schoolman HM, Barnett GO. The Unified Medical language System: An informatics research collaboration. J Am Med Inform Assoc. 1998 Jan-Feb;5(1):1–11. doi: 10.1136/jamia.1998.0050001. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Rindflesch TC, Fiszman M, Rosemblat G, Shin D. Semantic MEDLINE: An advanced information management application for biomedicine. Information Systems and Use. 2011;31(1,2):15–21. [Google Scholar]
5.Kilicoglu H, Fiszman M, Rodriguez A, Shin D, Ripple AM, Rindflesch TC. Semantic MEDLINE: A Web application to manage the results of PubMed searches. Proc. Third International Symposium for Semantic Mining in Biomedicine. 2008:69–76. [Google Scholar]
6.Miller CM, Rindflesch TC, Fiszman M, Hristovski D, Shin D, Rosemblat G, Zhang H, Strohl KP. A closed literature-based discovery technique finds a mechanistic link between hypogonadism and diminished sleep quality in aging men. Sleep. 2012;35(2):279–285. doi: 10.5665/sleep.1640. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Cohen T, Widdows D, Schvaneveldt RW, Davies P, Rindflesch TC. Discovering discovery patterns with predication-based Semantic Indexing. J Biomedical Inform. 2012 Dec;45(6):1049–65. doi: 10.1016/j.jbi.2012.07.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Jonnalagadda S, Del Fiol G, Medlin R, Weir C, Fiszman M, Mostafa J, Liu H. Automatically extracting sentences from Medline citations to support clinicians’ information needs. Journal of the American Society for Information Science and Technology (JASIST) 2012;0:1–6. doi: 10.1136/amiajnl-2012-001347. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Blomqvist E, Öhgren A, Sandkuhl K. Ontology construction in an enterprise context: comparing and evaluating two approaches. Proc. ICEIS. 2006. pp. 86–93.
10.Van Heijst G, Schreiber AT, Wielinga BJ. Using explicit ontologies for KBS development. Intl Journal of Human-Computer Studies. 1977;46(2-3) [Google Scholar]
11.Visser PRS, Bench-Acpon TJM. A comparison of four ontologies for the design of legal knowledge systems. Artificial Intelligence and Law. 1998;6:27–57. 1998. [Google Scholar]
12.McCray AT. Conceptualizing the world: lessons from history. J Biom Inform. 2006 Jun;39(3):267–73. doi: 10.1016/j.jbi.2005.08.007. [DOI] [PubMed] [Google Scholar]
13.Noy N, McGuinness L. Ontology Development 101: A Guide to Creating Your First Ontology. Stanford Knowledge Systems Laboratory Technical Report KSL-01-05 and Stanford Medical Informatics Technical Report SMI-2001-0880. 2001.
14.Irwin Y, Harkema H, Christensen L, Schleyer T, Haug P, Chapman W. Methodology to Develop and Evaluate a Semantic Representation for NLP. Proc. AMIA Annual Symp Proc. 2009:271–275. [PMC free article] [PubMed] [Google Scholar]
15.Analyti A, Theodorakis M, Spyratos N, Constantopoulos P. Contextualization as an independent abstraction mechanism for conceptual modeling. Inform Syst. 2007;32:24–60. [Google Scholar]
16.Kuziemsky CE, Lau F. A four stage approach for ontology-based health information system design. Artificial Intelligence in Medicine. 2010;50:133–48. doi: 10.1016/j.artmed.2010.04.012. [DOI] [PubMed] [Google Scholar]
17.Ahlers CB, Fiszman M, Demner-Fushman D, Lang FM, Rindflesch TC. Extracting semantic predications from Medline citations for pharmacogenomics. Pac Symp Biocomput. 2007:209–20. [PubMed] [Google Scholar]
18.Rosemblat G, Resnick MP, Auston I, Shin D, Sneiderman C, Fiszman M, Rindflesch TC. Extending SemRep to the public health domain. J. Am. Soc. Inf. Sci. 2013 doi: 10.1002/asi.22899. doi: 10.1002/asi.22899. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Keselman A, Rosemblat G, Kilicoglu H, Fiszman M, Jin H, Shin D, Rindflesch TC. Adapting semantic natural language processing technology to address information overload in influenza epidemic management. Journal of the American Society for Information Science and Technology (JASIST) 2010;61(12):2531–2543. doi: 10.1002/asi.21414. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Chandrasekaran B, Josephson JR, Benjamins VR. Proc. Workshop on Knowledge Acquisition. Modeling and Management (KAW'98); Banff, Canada: 1998. Ontology of Tasks and Methods. [Google Scholar]
21.Van Heijst G, Schreiber AT, Wielinga BJ. Using explicit ontologies for KBS development. Intl Journal of Human-Computer Studies. 1977;46(2-3) [Google Scholar]
22.Public Health Agency of Canada Global Public Health Intelligence Network (GPHIN) 2004.
23.Collier N, Kawazoe A, Jin L, Shigematsu M, Dien D, et al. The BioCaster Ontology: A multilingual ontology for infectious disease outbreak surveillance: Rationale, design and challenges. J Lang Resources Eval. 2007;40:405–413. doi: 10.1007/s10579-007-9019-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Bodenreider O. Biomedical Ontologies in Action: Role in Knowledge Management, Data Integration and Decision Support. Yearb. Med. Inform. 2008;47:67–79. [PMC free article] [PubMed] [Google Scholar]
25.Brewster C, Alani H, Dasmahapatra S, Wilks Y. Proc. of Intl Conf. on Language Resources and Evaluation. Portugal: 2004. Data Driven Ontology Evaluation. pp. 24–30. [Google Scholar]
26.Noy NF. Staab S, Studer R, editors. Ontology Mapping. Handbook on Ontologies, International Handbooks on Information Systems. 2009;(Part 4):573–590. [Google Scholar]
27.Feltrin E, Campanaro S, Diehl AD, Ehler E, Faulkner G, Fordham J, Gardin C, Harris M, Hill D, Knoell R, Laveder P, Mittempergher L, Nori A, Reggiani C, Sorrentino V, Volpe P, Zara I, Valle G, Deegan née Clark J. Muscle Research and Gene Ontology: New standards for improved data integration. BMC Med Genomics. 2009;2:6. doi: 10.1186/1755-8794-2-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Chen L, Morrey CP, Gu H, Halper M, Perl Y. Modeling multi-typed structurally viewed chemicals with the UMLS Refined Semantic Network. J Am Med Inform Assoc. 2009 Jan-Feb;16(1):116–31. doi: 10.1197/jamia.M2604. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Strauss A, Corbin J. Grounded theory methodology: an overview. In: Denzin NK, Lincoln YS, editors. Handbook of qualitative research. Sage; Thousand Oaks: 1994. pp. 273–85. [Google Scholar]
30.Glaser B, Strauss A. The Discovery of Grounded Theory: strategies for qualitative research. Chicago. Aldine: 1967. [Google Scholar]
31.Fernández M, Gómez-Pérez A, Juristo N. Proc. AAAI Spring Symp Series. AAAI Press; Menlo Park, Calif.: 1997. METHONTOLOGY: From Ontological Art Towards Ontological Engineering. pp. 33–40. [Google Scholar]
32.Pinto SF, Martins JP. Ontologies: how can they be built? Knowledge Inform Syst. 2004;6:441–64. [Google Scholar]
33.Chapman W, Dowling JN. Inductive creation of an annotation schema for manually indexing clinical conditions from emergency department reports. J. Biomed Inform. 2006;39(2):196–208. doi: 10.1016/j.jbi.2005.06.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.McCray AT, Srinivasan S, Browne AC. Lexical methods for managing variation in biomedical terminologies. Proc Annu Symp Comput Appl Med Care. 1994:235–9. [PMC free article] [PubMed] [Google Scholar]
35.Smith L, Rindflesch T, Wilbur WJ. MedPost: a part-of-speech tagger for biomedical text. Bioinformatics. 2004;20(14):2320–1. doi: 10.1093/bioinformatics/bth227. doi: 10.1093/bioinformatics/bth227. [DOI] [PubMed] [Google Scholar]
36.Aronson AR. Effective mapping of biomedical text to the UMLS Metathesaurus: The MetaMap program. Proc AMIA Annual Symp. 2001:17–21. [PMC free article] [PubMed] [Google Scholar]
37.Studer R, Benjamins R, Fensel D. Knowledge Engineering: Principles and Methods. Data Knowl. Eng. 1998;25(1-2):161–197. [Sowa, 2000] Guided Tour of Ontology. Sowa JF. Accessed at http://www.jfsowa.com/ontology/guided.htm. [Google Scholar]
38.Uschold M, Grüninger M. Ontologies: principles, methods, and applications. Knowledge Engineering Review. 1996;1:96–137. [Google Scholar]
39.Humphrey SM, Rogers WJ, Kilicoglu H, Demner-Fushman D, Rindflesch TC. Word sense disambiguation by selecting the best semantic type based on Journal Descriptor Indexing. Journal of the American Society for Information Science and Technology (JASIST) 2006;57(1):96–113. doi: 10.1002/asi.20257. [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Krueger L, Vos M, Bukowski M, Johnson C, Lau W, Collie K. NCI and CIT: Creating a Knowledge Management Toolkit. 2012 Feb; Accessed at http://dcb.cit.nih.gov/PortfolioParnership.pdf.
41.Mohammad S, Hirst G. Proc. Conference on Empirical Methods in Natural Language Processing. Sydney, Australia: 2006. Distributional measures of concept-distance: a task-oriented evaluation. pp. 35–43. [Google Scholar]
42.Wang T, Hirst G. Exploring patterns in dictionary definitions for synonym extraction. Journal of Natural Language Engineering. 2012;18(3):313–342. [Google Scholar]
43.Hearst M. Proc., 14th International Conference on Computational Linguistics. Vol. 2. Nantes, France: 1992. Automatic acquisition of hyponyms from large text corpora. pp. 539–545. [Google Scholar]

[R1] 1.Rindflesch TC, Fiszman M, Libbus B. Semantic interpretation for the biomedical research literature. In: Chen, Fuller, Hersh, Friedman, editors. Medical informatics: Knowledge management and data mining in biomedicine. Springer; 2005. pp. 399–422. [Google Scholar]

[R2] 2.Rindflesch TC, Fiszman M. The interaction of domain knowledge and linguistic structure in natural language processing: interpreting hypernymic propositions in biomedical text. J Biomed Inform. 2003 Dec;36(6):462–77. doi: 10.1016/j.jbi.2003.11.003. [DOI] [PubMed] [Google Scholar]

[R3] 3.Humphreys BL, Lindberg DA, Schoolman HM, Barnett GO. The Unified Medical language System: An informatics research collaboration. J Am Med Inform Assoc. 1998 Jan-Feb;5(1):1–11. doi: 10.1136/jamia.1998.0050001. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Rindflesch TC, Fiszman M, Rosemblat G, Shin D. Semantic MEDLINE: An advanced information management application for biomedicine. Information Systems and Use. 2011;31(1,2):15–21. [Google Scholar]

[R5] 5.Kilicoglu H, Fiszman M, Rodriguez A, Shin D, Ripple AM, Rindflesch TC. Semantic MEDLINE: A Web application to manage the results of PubMed searches. Proc. Third International Symposium for Semantic Mining in Biomedicine. 2008:69–76. [Google Scholar]

[R6] 6.Miller CM, Rindflesch TC, Fiszman M, Hristovski D, Shin D, Rosemblat G, Zhang H, Strohl KP. A closed literature-based discovery technique finds a mechanistic link between hypogonadism and diminished sleep quality in aging men. Sleep. 2012;35(2):279–285. doi: 10.5665/sleep.1640. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] 7.Cohen T, Widdows D, Schvaneveldt RW, Davies P, Rindflesch TC. Discovering discovery patterns with predication-based Semantic Indexing. J Biomedical Inform. 2012 Dec;45(6):1049–65. doi: 10.1016/j.jbi.2012.07.003. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Jonnalagadda S, Del Fiol G, Medlin R, Weir C, Fiszman M, Mostafa J, Liu H. Automatically extracting sentences from Medline citations to support clinicians’ information needs. Journal of the American Society for Information Science and Technology (JASIST) 2012;0:1–6. doi: 10.1136/amiajnl-2012-001347. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Blomqvist E, Öhgren A, Sandkuhl K. Ontology construction in an enterprise context: comparing and evaluating two approaches. Proc. ICEIS. 2006. pp. 86–93.

[R10] 10.Van Heijst G, Schreiber AT, Wielinga BJ. Using explicit ontologies for KBS development. Intl Journal of Human-Computer Studies. 1977;46(2-3) [Google Scholar]

[R11] 11.Visser PRS, Bench-Acpon TJM. A comparison of four ontologies for the design of legal knowledge systems. Artificial Intelligence and Law. 1998;6:27–57. 1998. [Google Scholar]

[R12] 12.McCray AT. Conceptualizing the world: lessons from history. J Biom Inform. 2006 Jun;39(3):267–73. doi: 10.1016/j.jbi.2005.08.007. [DOI] [PubMed] [Google Scholar]

[R13] 13.Noy N, McGuinness L. Ontology Development 101: A Guide to Creating Your First Ontology. Stanford Knowledge Systems Laboratory Technical Report KSL-01-05 and Stanford Medical Informatics Technical Report SMI-2001-0880. 2001.

[R14] 14.Irwin Y, Harkema H, Christensen L, Schleyer T, Haug P, Chapman W. Methodology to Develop and Evaluate a Semantic Representation for NLP. Proc. AMIA Annual Symp Proc. 2009:271–275. [PMC free article] [PubMed] [Google Scholar]

[R15] 15.Analyti A, Theodorakis M, Spyratos N, Constantopoulos P. Contextualization as an independent abstraction mechanism for conceptual modeling. Inform Syst. 2007;32:24–60. [Google Scholar]

[R16] 16.Kuziemsky CE, Lau F. A four stage approach for ontology-based health information system design. Artificial Intelligence in Medicine. 2010;50:133–48. doi: 10.1016/j.artmed.2010.04.012. [DOI] [PubMed] [Google Scholar]

[R17] 17.Ahlers CB, Fiszman M, Demner-Fushman D, Lang FM, Rindflesch TC. Extracting semantic predications from Medline citations for pharmacogenomics. Pac Symp Biocomput. 2007:209–20. [PubMed] [Google Scholar]

[R18] 18.Rosemblat G, Resnick MP, Auston I, Shin D, Sneiderman C, Fiszman M, Rindflesch TC. Extending SemRep to the public health domain. J. Am. Soc. Inf. Sci. 2013 doi: 10.1002/asi.22899. doi: 10.1002/asi.22899. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] 19.Keselman A, Rosemblat G, Kilicoglu H, Fiszman M, Jin H, Shin D, Rindflesch TC. Adapting semantic natural language processing technology to address information overload in influenza epidemic management. Journal of the American Society for Information Science and Technology (JASIST) 2010;61(12):2531–2543. doi: 10.1002/asi.21414. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] 20.Chandrasekaran B, Josephson JR, Benjamins VR. Proc. Workshop on Knowledge Acquisition. Modeling and Management (KAW'98); Banff, Canada: 1998. Ontology of Tasks and Methods. [Google Scholar]

[R21] 21.Van Heijst G, Schreiber AT, Wielinga BJ. Using explicit ontologies for KBS development. Intl Journal of Human-Computer Studies. 1977;46(2-3) [Google Scholar]

[R22] 22.Public Health Agency of Canada Global Public Health Intelligence Network (GPHIN) 2004.

[R23] 23.Collier N, Kawazoe A, Jin L, Shigematsu M, Dien D, et al. The BioCaster Ontology: A multilingual ontology for infectious disease outbreak surveillance: Rationale, design and challenges. J Lang Resources Eval. 2007;40:405–413. doi: 10.1007/s10579-007-9019-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] 24.Bodenreider O. Biomedical Ontologies in Action: Role in Knowledge Management, Data Integration and Decision Support. Yearb. Med. Inform. 2008;47:67–79. [PMC free article] [PubMed] [Google Scholar]

[R25] 25.Brewster C, Alani H, Dasmahapatra S, Wilks Y. Proc. of Intl Conf. on Language Resources and Evaluation. Portugal: 2004. Data Driven Ontology Evaluation. pp. 24–30. [Google Scholar]

[R26] 26.Noy NF. Staab S, Studer R, editors. Ontology Mapping. Handbook on Ontologies, International Handbooks on Information Systems. 2009;(Part 4):573–590. [Google Scholar]

[R27] 27.Feltrin E, Campanaro S, Diehl AD, Ehler E, Faulkner G, Fordham J, Gardin C, Harris M, Hill D, Knoell R, Laveder P, Mittempergher L, Nori A, Reggiani C, Sorrentino V, Volpe P, Zara I, Valle G, Deegan née Clark J. Muscle Research and Gene Ontology: New standards for improved data integration. BMC Med Genomics. 2009;2:6. doi: 10.1186/1755-8794-2-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] 28.Chen L, Morrey CP, Gu H, Halper M, Perl Y. Modeling multi-typed structurally viewed chemicals with the UMLS Refined Semantic Network. J Am Med Inform Assoc. 2009 Jan-Feb;16(1):116–31. doi: 10.1197/jamia.M2604. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] 29.Strauss A, Corbin J. Grounded theory methodology: an overview. In: Denzin NK, Lincoln YS, editors. Handbook of qualitative research. Sage; Thousand Oaks: 1994. pp. 273–85. [Google Scholar]

[R30] 30.Glaser B, Strauss A. The Discovery of Grounded Theory: strategies for qualitative research. Chicago. Aldine: 1967. [Google Scholar]

[R31] 31.Fernández M, Gómez-Pérez A, Juristo N. Proc. AAAI Spring Symp Series. AAAI Press; Menlo Park, Calif.: 1997. METHONTOLOGY: From Ontological Art Towards Ontological Engineering. pp. 33–40. [Google Scholar]

[R32] 32.Pinto SF, Martins JP. Ontologies: how can they be built? Knowledge Inform Syst. 2004;6:441–64. [Google Scholar]

[R33] 33.Chapman W, Dowling JN. Inductive creation of an annotation schema for manually indexing clinical conditions from emergency department reports. J. Biomed Inform. 2006;39(2):196–208. doi: 10.1016/j.jbi.2005.06.004. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] 34.McCray AT, Srinivasan S, Browne AC. Lexical methods for managing variation in biomedical terminologies. Proc Annu Symp Comput Appl Med Care. 1994:235–9. [PMC free article] [PubMed] [Google Scholar]

[R35] 35.Smith L, Rindflesch T, Wilbur WJ. MedPost: a part-of-speech tagger for biomedical text. Bioinformatics. 2004;20(14):2320–1. doi: 10.1093/bioinformatics/bth227. doi: 10.1093/bioinformatics/bth227. [DOI] [PubMed] [Google Scholar]

[R36] 36.Aronson AR. Effective mapping of biomedical text to the UMLS Metathesaurus: The MetaMap program. Proc AMIA Annual Symp. 2001:17–21. [PMC free article] [PubMed] [Google Scholar]

[R37] 37.Studer R, Benjamins R, Fensel D. Knowledge Engineering: Principles and Methods. Data Knowl. Eng. 1998;25(1-2):161–197. [Sowa, 2000] Guided Tour of Ontology. Sowa JF. Accessed at http://www.jfsowa.com/ontology/guided.htm. [Google Scholar]

[R38] 38.Uschold M, Grüninger M. Ontologies: principles, methods, and applications. Knowledge Engineering Review. 1996;1:96–137. [Google Scholar]

[R39] 39.Humphrey SM, Rogers WJ, Kilicoglu H, Demner-Fushman D, Rindflesch TC. Word sense disambiguation by selecting the best semantic type based on Journal Descriptor Indexing. Journal of the American Society for Information Science and Technology (JASIST) 2006;57(1):96–113. doi: 10.1002/asi.20257. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R40] 40.Krueger L, Vos M, Bukowski M, Johnson C, Lau W, Collie K. NCI and CIT: Creating a Knowledge Management Toolkit. 2012 Feb; Accessed at http://dcb.cit.nih.gov/PortfolioParnership.pdf.

[R41] 41.Mohammad S, Hirst G. Proc. Conference on Empirical Methods in Natural Language Processing. Sydney, Australia: 2006. Distributional measures of concept-distance: a task-oriented evaluation. pp. 35–43. [Google Scholar]

[R42] 42.Wang T, Hirst G. Exploring patterns in dictionary definitions for synonym extraction. Journal of Natural Language Engineering. 2012;18(3):313–342. [Google Scholar]

[R43] 43.Hearst M. Proc., 14th International Conference on Computational Linguistics. Vol. 2. Nantes, France: 1992. Automatic acquisition of hyponyms from large text corpora. pp. 539–545. [Google Scholar]

PERMALINK

A methodology for extending domain coverage in SemRep

Graciela Rosemblat, Ph.D.

Dongwook Shin, Ph.D.

Halil Kilicoglu, Ph.D.

Charles Sneiderman, M.D., Ph.D.

Thomas C Rindflesch, Ph.D.

Abstract

1. Introduction

2. Related Work

3. Background

3.1. SemRep

3.2. SemRep Generalities

4. Methodology

4.1. Specification and Conceptualization

4.2. Formalization

4.3. Implementation

4.4. Evaluation and Maintenance

5. Case study - Medical Informatics

5.1. Specification and Conceptualization

5.1.1 Data collection – Participatory Design

5.1.2. Data Analysis – Grounded Theory and Participatory Design

5.2. Formalization

5.2.1. Domain Concepts

5.2.1.1. Non-Metathesaurus-Mapping Noun Phrases: Semantic Type Assignment

Table I.

5.2.1.2. Metathesaurus-Mapping Noun Phrases: Semantic Type Re-Assignment (contextualization)

Table III.

5.2.1.3. Filtering out Domain-Inappropriate MetaMap Mappings

5.2.1.4. Synonymy Determination

5.2.2. Domain Relations

Table VI.

5.2.3. Indicator Rules

Table V.

5.3. Implementation

5.4. Evaluation and Maintenance

5.4.1. Test Set

5.4.2 Evaluations

Table VII.

Table VIII.

Table IX.

Table X.

5.4.2.1. Comparison of Generic and Enhanced SemRep

5.4.2.2. Enhanced SemRep: Performance Evaluation

Table XI.

Table XII.

5.4.2.2.1. Metathesaurus-related Errors

Semantic types

Missing arguments

Ambiguity

Multiword expressions

5.4.2.2.2. Rule-related

Missing ontological predications

Indicator rules

5.4.2.2.3. General SemRep Processing

7. Conclusions and Future Work

Highlights (number of characters including spaces in parenthesis at the end).

Table II.

Table IV.

Acknowledgments

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases