Abstract
Objective: The Enriched Semantic Network (ESN) was introduced as an extension of the Unified Medical Language System (UMLS) Semantic Network (SN). Its multiple subsumption configuration and concomitant multiple inheritance make the ESN's relationship structures and semantic type assignments different from those of the SN. A technique for deriving the relationship structures of the ESN's semantic types and an automated technique for deriving the ESN's semantic type assignments from those of the SN are presented.
Design: The technique to derive the ESN's relationship structures finds all newly inherited relationships in the ESN. All such relationships are audited for semantic validity, and the blocking mechanism is used to block invalid relationships. The mapping technique to derive the ESN's semantic type assignments uses current SN semantic type assignments and preserves nonredundant categorizations, while preventing new redundant categorizations.
Results: Among the 426 newly inherited relationships, 326 are deemed valid. Seven blockings are applied to avoid inheritance of the 100 invalid relationships. Sixteen semantic types have different relationship structures in the ESN as compared to those in the SN. The mapping of semantic type assignments from the SN to the ESN avoids the generation of 26,950 redundant categorizations. The resulting ESN contains 138 semantic types, 149 IS-A links, 7,303 relationships, and 1,013,876 semantic type assignments.
Conclusion: The ESN's multiple inheritance provides more complete relationship structures than in the SN. The ESN's semantic type assignments avoid the existing redundant categorizations appearing in the SN and prevent new ones that might arise due to multiple parents. Compared to the SN, the ESN provides a more accurate unifying semantic abstraction of the UMLS Metathesaurus.
The Unified Medical Language System (UMLS) Metathesaurus (META) is a concept repository containing over one million biomedical concepts from many source terminologies.1,2,3 The Semantic Network (SN), a high-level unifying semantic structure for the META,4 consists of 135 semantic types arranged in a hierarchy of two trees based on the IS-A (subsumption) relationship.5 At present, each semantic type is restricted to having exactly one parent.6 The roots, Event (semantic types are in bold throughout the text, except in tables and figures) and Entity, do not have any parent. This arrangement is restrictive in modeling medical knowledge because it does not allow a given semantic type to be a specialization of more than one other semantic type. A study7 was conducted to evaluate how well the UMLS could support clinical information systems at Columbia-Presbyterian Medical Center as compared to the local Medical Entities Dictionary (MED).8,9 A recommendation resulting from this study was that multiple parents be permitted in the SN. In previous work,10 we addressed this issue by extending the SN (of the 2002AB release) into the Enriched Semantic Network (ESN), whose key characteristic is an IS-A hierarchy permitting multiple parents for a single semantic type.
The IS-A links of the SN are the means by which semantic relationships are inherited by more specific semantic types from more general semantic types. As the ESN has additional IS-A links, new inheritance paths were created. We concentrated on the configuration of the ESN's IS-A hierarchy and did not investigate these inheritance issues.10 They form a central part of this paper. Specifically, a technique to derive the complete set of semantic relationships (what we call relationship structure) of each of the ESN's semantic types is presented. This includes the semantic relationships introduced at each semantic type and the semantic relationships inherited by each.
In the SN, two mechanisms exist for interrupting the inheritance of semantic relationships. While inheritance is a reasoning mechanism that has universal validity in the sense of logic, the NLM has introduced these mechanisms into the UMLS for the convenience of the knowledge engineers editing it. We are operating completely within the paradigm of the NLM and therefore are also using these two mechanisms. In our approach, a human auditor needs to verify whether an inherited semantic relationship is valid at a semantic type, and if it is not, it needs to be manually blocked. We note that our research shows that the need for these mechanisms rarely arises. All newly inherited semantic relationships were audited for semantic validity, and those deemed invalid were excluded from the ESN. We report on the numbers of inherited relationships as well as cases of blocked inheritance.
Every concept in the META has been assigned one or more semantic types of the SN. To complete the picture of the ESN, we must carry out such semantic type assignments with respect to each concept. We present a mapping (and accompanying algorithm) through which the semantic type assignments of the ESN are derived automatically from those of the SN. The mapping is complicated by the ESN's new IS-A links and the need to comply with the principle that each concept be assigned the lowest (most specialized) possible semantic type in the IS-A hierarchy.11 An assignment that violates this principle (i.e., an assignment of an ancestor semantic type to a concept when there already exists an assignment of a descendant semantic type to that concept) is referred to as a “redundant categorization.”12 Our algorithm guarantees that no redundant categorizations currently existing in the SN are transferred to the ESN and no new redundant categorizations are introduced due to the ESN's expanded IS-A configuration. We report on the number of potential redundant categorizations avoided by our algorithm.
Finally, one of the most popular uses of the UMLS involves free-text searches in concepts that map to terms in a particular terminology. The SN can facilitate such searches via filtering based on semantic types. However, filtering term look-ups with respect to the SN can be incomplete due to the single parent restriction. We discuss the improvement afforded by the complete ESN to such filtering.
Background
Our ESN,10 based on the original SN, allows a given semantic type to have more than one parent. Thus, the ESN exhibits a directed acyclic graph (DAG) hierarchy, in contrast to the SN's two-tree–structured hierarchy. Note that a tree is a specialized form of a DAG, in which each semantic type is restricted to having at most one parent. The ESN also contains some additional semantic types included to support the new multiple subsumption framework. Overall, the ESN contains 138 semantic types and 149 IS-A links. (The original SN from which the ESN was derived contained 134 semantic types and 132 IS-As.) ▶ shows part of the Entity hierarchy of the ESN. To emphasize the changes from the original SN, we use dashed arrows to denote new IS-A links and dashed rectangles to denote new semantic types. Thin dashed rectangles denote semantic types that originally were in the SN's Event tree.
Figure 1.
Part of the Entity component of the Enriched Semantic Network.
As in the SN, semantic types of the ESN are also connected by semantic (non-IS-A) relationships of 53 different kinds. Such relationships can be directly introduced at a semantic type or inherited. When a relationship is defined at a semantic type but not at its parent, we call the semantic type an introduction point of the relationship. All the descendants of an introduction point inherit this introduced relationship, unless the inheritance is explicitly blocked. There are two mechanisms for blocking inheritance in the SN. The first mechanism, called “blocking,” nullifies the definition of an inherited relationship. The second mechanism allows a newly introduced relationship to be designated as “defined but not inherited” (DNI). This means that the relationship is not inherited by any of the children (and thus descendants) of the semantic type that is introducing it.
We call the entire set of semantic relationships exhibited by a semantic type, including those inherited and those introduced, its “relationship structure.” The relationship structures of the semantic types have played a major role in the analysis of a partition of the SN.13 The relationship structure of a given semantic type in the ESN may in general differ from that of the same type in the SN. This is a result of the fact that in the ESN, a semantic type can have more than one parent and inherit relationships independently from each, a situation referred to as “multiple inheritance.” The ESN was designed so that all semantic types should at least preserve their original relationship structures from the SN. That is, the relationship structure of a semantic type in the ESN can either be equal to or be a superset of the relationship structure of the same semantic type in the SN. It is noted, however, that introduction points for relationships may have changed.
The SN is designed to serve as a high-level unifying semantic structure for the underlying META,4 with each concept being assigned one or more semantic types. As noted, any assignment should be the lowest possible semantic type in the IS-A hierarchy.11 Assignments of higher semantic types to a concept can be inferred via the IS-A links. In previous work,12,14 we have found many situations in the UMLS where a concept was assigned both a descendant semantic type and its ancestor type simultaneously. Such a situation, which we refer to as a “redundant categorization,” must be avoided in the ESN's concept configuration. Our mapping ensures that the ESN is free of any redundant categorizations.
Issues of single inheritance versus multiple inheritance have been studied in different subfields of computer science, such as object-oriented programming languages and knowledge representation. For example, C++15 supports full-fledged multiple inheritance, while Java16 allows only one parent for each class. Knowledge representation systems,17 if they support a taxonomy at all, typically prefer the expressive power achieved by multiple inheritance and are willing to pay the price of much more complicated implementations.
Schulze-Kremer et al.18 defined a set of five desiderata for the SN, and it was shown that the failure of the current UMLS in certain reasoning situations is due to the fact that these desiderata are ignored. However, fulfilling these five desiderata would require a serious reorganization of the current SN.
Burgun and Bodenreider19 analyzed the relationship between the SN and two other widely known terminologies, the Upper CYC Ontology (UCO) and WordNet. As did Schulze-Kremer et al.,18 the authors show that structural inaccuracies may and will lead to reasoning mistakes. For example, WordNet misclassifies Fever as a Psychological Feature. In our research, we have added IS-A relationships and semantic types to the SN, resulting in the ESN. While our approach was structurally motivated, Yu et al.20 present a proposed extension to the SN that is topically motivated. There, the authors add six new semantic types and 16 new semantic relations to the SN, with the purpose of better capturing the genomic environment.
Methods
Derivation of Semantic Types' Relationship Structures
In the original SN, there are 6,977 semantic relationship occurrences of the 53 different kinds. Hence, the average number of occurrences per semantic type is about 50 since there are 135 semantic types. For example, there is an affects relationship from Anatomical Abnormality to Alga and also an affects relationship from Amino Acid, Peptide or Protein to Biologic Function. Each of them is an occurrence of affects, with different source and target semantic types. Furthermore, a semantic type may be the source of several occurrences of the same kind of relationship, with different targets. For brevity, we use “occurrence” and “relationship” interchangeably whenever there is no possibility of confusion.
Relationships fit into two categories: introduced relationship and inherited relationship. We use the notation r(X, Y) to denote an occurrence of the relationship of kind r from source semantic type X to target semantic type Y. Let X and Y be two semantic types, and let Px be the parent of X. A relationship r(X, Y) is an introduced relationship if there does not exist a relationship r(Px, Y) in the SN; otherwise, it is an inherited relationship unless r(Px, Y) is a DNI relationship at Px or a blocked relationship at X.
There are 422 introduced relationships in the SN and in total 6,977 − 422 = 6,555 inherited relationships. There are only 27 DNI relationships (about 6% of the introduced relationships) and ten “blocking” relationships.
The relationships appearing in the ESN are derived from those of the SN according to the following two rules and review step.
Rule 3.1.1: A relationship r(X, Y) in the SN implies a relationship r(X, Y) in the ESN. If r(X, Y) is an inherited relationship in the SN, then it is also an inherited relationship in the ESN.
Rule 3.1.2: If a semantic type T has multiple parents (or ancestors) in the ESN, then initially T inherits all the relationships of its new parents (or ancestors) except for those that have been explicitly blocked or are DNI relationships.
Review Step: A domain expert (for example, an MD or PhD in an appropriate area with experience in medical terminologies) manually checks the semantic validity of all newly inherited relationships in the ESN. Only relationships that are deemed semantically valid are retained; otherwise, blocking or DNI is used to avoid inheritance of an invalid relationship.
All existing relationships in the SN are preserved in the ESN according to Rule 3.1.1. In particular, if r(X, Y) is an inherited relationship in the SN, then r(X, Y) will exist in the ESN and also be an inherited relationship. On the other hand, if r(X, Y) is an introduced relationship in the SN, then r(X, Y) in the ESN can either be an introduced relationship or an inherited relationship due to the changing inheritance pattern brought on by multiple parents. We note, though, that most introduced relationships in the SN remain introduced relationships in the ESN.
For each semantic type having multiple parents, Rule 3.1.2 will find all newly inherited relationships from the new parent(s). For example, Gene or Genome has a new parent Molecular Sequence in the ESN. According to Rule 3.1.2, it will inherit all nonblocked and non-DNI relationships from the new parent. There is a relationship result_of (Molecular Sequence, Mental Process) that is not defined at either Gene or Genome or its unique parent Fully Formed Anatomical Structure in the SN. Therefore, according to Rule 3.1.2, Gene or Genome will initially inherit the result_of relationship in the ESN. That means there is a relationship result_of (Gene or Genome, Mental Process) in the ESN waiting to be reviewed by our domain expert in the Review Step. For this relationship, it is deemed valid according to the expert's review. Therefore, the ESN will truly have a relationship result_of (Gene or Genome, Mental Process).
Rule 3.1.2 implies that a semantic type with multiple parents might have more relationships in the ESN than in the SN because it could inherit new relationships from its new parents. The same is true for its descendants.
Deriving MRSTYE from MRSTY for the ESN
To complete the ESN, all the META concepts must be assigned one or more of the ESN's semantic types. The UMLS set of such assignments is in the distribution file MRSTY. Similarly, we will generate the file MRSTYE, holding all semantic type assignments for the ESN.
The simplest way to generate the ESN's semantic type assignments is to use those of the SN. That is, if a concept C in the META was assigned a set of semantic types {T1, T2,…, Tm} in the SN, then in the ESN, the concept C will also be assigned these same types. This mapping will possibly yield two kinds of redundant categorizations in the ESN. In the first case, an already existing redundant categorization is copied over to the ESN. In the second case, a new redundant categorization arises as a result of a semantic type having more parents than it did before. Our mapping deals with these situations in order to prevent introducing any redundant categorizations in the ESN. For the latter case, we must check each pair of semantic types having a new IS-A path between them in the ESN. If such a new IS-A path has the potential for introducing new redundant categorizations, then this must be accounted for in the mapping.
For example, besides the current parent Conceptual Entity, Organism Attribute has a new parent Physiologic Function in the ESN (▶). Among the 2,381 concepts assigned Organism Attribute, 14 concepts are also assigned simultaneously Physiologic Function (▶). Since Physiologic Function is now a parent of Organism Attribute in the ESN, the assignments of Physiologic Function to the 14 concepts would be redundant categorizations if they were carried over to the ESN. Therefore, our mapping will eliminate these 14 assignments.
Table 1.
Concepts Assigned Organism Attribute and Physiologic Function in the Semantic Network
Concept ID | Concept Name |
---|---|
C0489560 | Intrachamber Diastolic |
C0489561 | Intrachamber Mean |
C0489562 | Intrachamber Systolic |
C0489564 | Intravascular End Diastolic |
C0489565 | Intravascular Mean |
C0489566 | Intravascular Systolic |
C0489568 | Intravascular Systolic.Inspiration - Expiration |
C0489703 | Forearm Blood Pressure Systolic |
C0489708 | Hepatic Capillary Wedge Pressure |
C0489726 | Left Upper Arm Blood Pressure Mean |
C0489728 | Maximum Systolic Blood Pressure |
C0489731 | Mean Systolic Blood Pressure |
C0489733 | Minimum Systolic Blood Pressure |
C0489763 | Right Thigh Blood Pressure Systolic |
We now define the mapping as follows. If a concept C was assigned the semantic types T1, T2,…, Tm in the SN, the assignments for concept C follow these three rules.
Rule 3.2.1: If in the ESN no pair of types (Ti, Tj) (1 ≤ i ≠ j ≤ m) has an IS-A path between them, then C is assigned in the ESN each of the types T1, T2,…, Tm.
Rule 3.2.2: If among T1, T2,…, Tm there exists a pair (Ti, Tj) (1 ≤ i ≠ j ≤ m) such that Ti is an ancestor of Tj in the SN, then exclude the assignment of Ti to C from the ESN.
Rule 3.2.3: If among T1, T2,…, Tm there exists a pair (Ti, Tj) (1 ≤ i ≠ j ≤ m) such that Ti is a new ancestor of Tj in the ESN, then the assignment of Ti to C is excluded from the ESN.
Rule 3.2.1 is used to preserve all nonredundant categorizations in the SN. Rule 3.2.2 excludes all redundant categorizations currently existing in the SN from the ESN. Rule 3.2.3 averts the introduction of new redundant categorizations arising from multiple-parent cases in the ESN. The application of the three rules yields a complete set of semantic type assignments for the ESN that preserves the SN's assignments while purging existing redundant categorizations and avoiding new ones.
We note that the four new semantic types of the ESN should each be assigned at least the corresponding concept of the same name.21,22 (These new concepts are now included in the UMLS.) Furthermore, a domain expert should review the concepts assigned the parents and children of each of the four new semantic types to check whether any of the concepts should instead be assigned one of these four semantic types.
The application of the three mapping rules involves the use of algorithms that detect all existing or potential redundant categorizations. For Rule 3.2.2, the algorithm of Peng et al.12 (here referred to as “DetectRedundantCatgs”) is used to scan through all the SN's semantic type assignments (as supplied in MRSTY) and mark those it determines to be redundant categorizations. Subsequently, these marked assignments are not introduced with respect to the ESN.
For Rule 3.2.3, the following DetectNewRedundantCatgs algorithm is applied to detect and mark all potentially new redundant categorizations arising from new IS-A paths in the ESN. This algorithm functions similarly to DetectRedundantCatgs. However, this version is more efficient in that it iterates over a much narrower set of semantic types. In the algorithm, ET denotes the set of concepts assigned a semantic type T in the SN. NewAncestors(T) is the set of all new ancestor semantic types (including the new parent[s]) of T in the ESN. is the intersection of the concept sets of T1 and T2. Following the UMLS convention, we use (C|T) to denote the assignment of the semantic type T to the concept C.
DetectNewRedundantCatgs algorithm: mark potentially new redundant categorizations.
for (each semantic type T with a new parent(s) or a new ancestor(s))
{
for (each semantic type Y ∈ NewAncestors(T))
{
if (
in the SN)
//potential redundant categorization
for (each concept
)
Mark the assignment (C|Y) as potentially redundant
}
}
With the algorithms to detect the existing and potentially new redundant categorizations, we can define the algorithm to implement the mapping and generate all semantic type assignments of the ESN, in the form of the file MRSTYE, as follows: first call the function DetectRedundantCatgs, and then call DetectNewRedundantCatgs. For each assignment (C|T) in MRSTY that was not marked by either of the two function calls, insert (C|T) into MRSTYE.
Note that the mapping handles a redundant categorization such that the assignment of the parent (or ancestor) to a concept will always be the one excluded from the ESN. But it is possible that in the original SN, the assignment of the parent (or ancestor) to a concept is actually correct, while the assignment of the child to the concept is wrong. Then, the assignment of the parent should be preserved in the ESN, while the assignment of the child should be excluded. If such a case is found by a human expert and corrected in the original SN, our algorithms can be rerun after the correction to guarantee that the concept is assigned the correct type in the ESN.
Results
ESN Relationship Structures
Our study is based on the UMLS 2002AB release. (Note that even though there have been a number of releases of the UMLS in the interim, the SN has changed only slightly with the addition of the semantic type Drug Delivery Device and its accompanying IS-A.) By applying the two rules and the Review Step in the section “Derivation of Semantic Types' Relationship Structures,” we obtained relationship structures for all the ESN's semantic types. Rule 3.1.1 preserved all of the SN's 6,977 relationships (including 422 introduced relationships and 6,555 inherited relationships) in the ESN. One introduced relationship in the SN, namely, part_of (Anatomical Structure, Organism), changed from an introduced relationship to an inherited relationship. In the SN, Anatomical Structure is the introduction point of this relationship, but in the ESN that type inherits this relationship from the new parent Physical Anatomical Entity instead. In fact, the new semantic type Anatomical Entity is the introduction point of this relationship, and part_of (Anatomical Entity, Organism) is an introduced relationship in the ESN.
Although the introduction pattern of Anatomical Structure was affected by the change of introduction points for the part_of relationship, Anatomical Structure's relationship structure did not change. It still exhibits the exact same relationships. Moreover, the relationship structure for nine of its descendants did not change either in the ESN. (The tenth descendant Gene or Genome inherits a new relationship from its new parent Molecular Sequence as noted in the section “Derivation of Semantic Types' Relationship Structures.”)
By Rule 3.1.2, 426 newly inherited relationships were obtained through multiple inheritance. In the Review Step, all 426 new relationships were audited by our domain expert, James J. Cimino, MD, who was a contractor of the UMLS and has extensive experience in medical terminologies.
Among the 426 new relationships, 12 involve the four semantic types appearing exclusively in the ESN and not in the SN. These were deemed valid upon review. As an example of a new semantic type, Anatomical Entity has three relationships in the ESN: the introduced part_of relationship to Organism and the two occurrences of issue_in to Occupation or Discipline and Biomedical Occupation or Discipline inherited from the parent Entity. The other three new semantic types of the ESN inherit these relationships from Anatomical Entity.
The remaining 414 (= 426 −12) newly inherited relationships involve the SN's semantic types having multiple parents (or ancestors) in the ESN. Among all 134 semantic types in the SN, 21 semantic types have multiple parents (or ancestors) in the ESN. They are Anatomical Structure with its ten descendants, Organism Attribute with its child Clinical Attribute, and eight other leaf semantic types. Hence, at most 21 semantic types of the SN can have different relationship structures in the ESN. The 414 newly inherited relationships involve these 21 semantic types.
A review of the 414 new relationships found that 314 of them (about 75%) are valid and are thus retained in the ESN. These are inherited by a total of 12 semantic types out of the 21 types having multiple parents. The other 100 relationships are semantically invalid and are blocked from being inherited by the children from their new parents. (An economical implementation of this blocking is described later in this section.) Therefore, those 12 semantic types have different relationship structures in the ESN from those types in the SN. For example, Body Substance in the ESN has a different relationship structure from that in the SN since it inherits a valid part_of relationship to Organism from its new parent Material Physical Anatomical Entity, obviously missing in the SN. As an example of an invalid new relationship, Organism Attribute's new parent Physiologic Function has a process_of relationship to Organism that might be inherited by Organism Attribute in the ESN. After being reviewed by our domain expert, process_of (Organism Attribute, Organism) is deemed invalid and is excluded (“blocked”) in the ESN. ▶ presents these 12 semantic types, the number of newly inherited relationships reviewed, the number of valid relationships in the ESN, and the number of invalid (blocked) relationships in the ESN.
Table 2.
Relationships Inherited from New Parent Semantic Types in the Enriched Semantic Network
Child Semantic Type | New Parent Semantic Type | No. of New Relationships Reviewed | Valid | Invalid |
---|---|---|---|---|
Body Location or Region | Conceptual Anatomical Entity | 1 | 1 | 0 |
Body Space or Junction | Physical Anatomical Entity | 1 | 1 | 0 |
Body Substance | Material Physical Anatomical Entity | 1 | 1 | 0 |
Body System | Conceptual Anatomical Entity | 1 | 1 | 0 |
Clinical Attribute | Physiologic Function | 92 | 52 | 40 |
Enzyme | Amino Acid, Peptide, or Protein | 1 | 1 | 0 |
Gene or Genome | Molecular Sequence | 1 | 1 | 0 |
Injury or Poisoning | Disease or Syndrome | 112 | 92 | 20 |
Laboratory or Test Result | Phenomenon or Process | 22 | 22 | 0 |
Organism Attribute | Physiologic Function | 92 | 52 | 40 |
Receptor | Cell Component | 67 | 67 | 0 |
Vitamin | Pharmacologic Substance | 23 | 23 | 0 |
Total: 12 | 414 | 314 | 100 |
Now let us consider the semantic types for which blocking occurs. In the ESN, Injury or Poisoning has a new parent Disease or Syndrome. This new IS-A relationship causes 112 newly inherited relationships for Injury or Poisoning. After being reviewed, 92 are deemed valid and are retained, while 20 are invalid and excluded. For example, there is a new valid relationship, affects (Injury or Poisoning, Organism), inherited from Disease or Syndrome. Another new relationship, degree_of (Injury or Poisoning, Pathologic Function), was found invalid and was excluded. ▶ shows the 20 invalid relationships. Because of space limitations, we do not show all 92 valid relationships.
Table 3.
Invalid Semantic Relationships of Injury or Poisoning Blocked in the Enriched Semantic Network
degree_of (Injury or Poisoning, Pathologic Function) |
degree_of (Injury or Poisoning, Cell or Molecular Dysfunction) |
degree_of (Injury or Poisoning, Disease or Syndrome) |
degree_of (Injury or Poisoning, Experimental Model of Disease) |
degree_of (Injury or Poisoning, Mental or Behavioral Dysfunction) |
degree_of (Injury or Poisoning, Neoplastic Process) |
manifestation_of (Injury or Poisoning, Pathologic Function) |
manifestation_of (Injury or Poisoning, Physiologic Function) |
manifestation_of (Injury or Poisoning, Cell Function) |
manifestation_of (Injury or Poisoning, Cell or Molecular Dysfunction) |
manifestation_of (Injury or Poisoning, Disease or Syndrome) |
manifestation_of (Injury or Poisoning, Experimental Model of Disease) |
manifestation_of (Injury or Poisoning, Genetic Function) |
manifestation_of (Injury or Poisoning, Injury or Poisoning) |
manifestation_of (Injury or Poisoning, Mental Process) |
manifestation_of (Injury or Poisoning, Mental or Behavioral Dysfunction) |
manifestation_of (Injury or Poisoning, Molecular Function) |
manifestation_of (Injury or Poisoning, Neoplastic Process) |
manifestation_of (Injury or Poisoning, Organ or Tissue Function) |
manifestation_of (Injury or Poisoning, Organism Function) |
The 20 invalid relationships of Injury or Poisoning are cases of degree_of and manifestation_of. We note that the target semantic type for the first invalid degree_of relationship of Injury or Poisoning in ▶ is Pathologic Function. The other five targets of these degree_of relationships are descendants of Pathologic Function in the SN. It is sufficient to block the invalid relationship, degree_of (Injury or Poisoning, Pathologic Function), to make sure the degree_of from Injury or Poisoning is not inherited by any of the descendants of Pathologic Function.
The situation for the invalid manifestation_of relationships of Injury or Poisoning is similar. The first two such relationships in ▶ are to Pathologic Function and Physiologic Function. The remaining 12 manifestation_of relationships in ▶ are to semantic types that are descendants of either Pathologic Function or Physiologic Function. Hence, it is sufficient to block only the first two manifestation_of relationships of Injury or Poisoning to make sure that this relationship is not inherited by any of the descendants of either Pathologic Function or Physiologic Function. Hence, blocking of only three relationships is needed to prevent the inheritance of the 20 invalid relationships of ▶.
Similarly, to block the 40 invalid relationships for Organism Attribute (▶), we need just to block four relationships since the targets of all other invalid relationships are descendants of one of the targets of these four blocked relationships. The 40 invalid relationships of Clinical Attribute have the same names and targets as those of the invalid relationships of Organism Attribute, its parent. Hence, the blocking of the invalid relationships of Organism Attribute prevents the inheritance of all these invalid relationships. Therefore, only seven explicit blockings are needed to avoid the inheritance of the 100 invalid relationships of ▶.
An example of valid inherited relationships involves Laboratory or Test Result, which has the new parent Phenomenon or Process. This new IS-A relationship causes 22 new relationships for Laboratory or Test Result that might be inherited from Phenomenon or Process. In the Review Step, all of them were deemed valid and were retained in the ESN. For example, the new relationship result_of (Laboratory or Test Result, Acquired Abnormality) is deemed valid on review since a test result may be caused by an acquired abnormality. ▶ shows all 22 new valid relationships of Laboratory or Test Result.
Table 4.
Laboratory or Test Result's New Relationships Inherited from Phenomenon or Process
result_of (Laboratory or Test Result, Acquired Abnormality) |
result_of (Laboratory or Test Result, Anatomical Abnormality) |
result_of (Laboratory or Test Result, Biologic Function) |
result_of (Laboratory or Test Result, Cell Function) |
result_of (Laboratory or Test Result, Cell or Molecular Dysfunction) |
result_of (Laboratory or Test Result, Congenital Abnormality) |
result_of (Laboratory or Test Result, Disease or Syndrome) |
result_of (Laboratory or Test Result, Environmental Effect of Humans) |
result_of (Laboratory or Test Result, Experimental Model of Disease) |
result_of (Laboratory or Test Result, Genetic Function) |
result_of (Laboratory or Test Result, Human-caused Phenomenon or Process) |
result_of (Laboratory or Test Result, Injury or Poisoning) |
result_of (Laboratory or Test Result, Mental Process) |
result_of (Laboratory or Test Result, Mental or Behavioral Dysfunction) |
result_of (Laboratory or Test Result, Molecular Function) |
result_of (Laboratory or Test Result, Natural Phenomenon or Process) |
result_of (Laboratory or Test Result, Neoplastic Process) |
result_of (Laboratory or Test Result, Organ or Tissue Function) |
result_of (Laboratory or Test Result, Organism Function) |
result_of (Laboratory or Test Result, Pathologic Function) |
result_of (Laboratory or Test Result, Phenomenon or Process) |
result_of (Laboratory or Test Result, Physiologic Function) |
There are 7,297 relationships (including both introduced and inherited relationships) in the ESN vs. 6,977 in the SN. Among the 138 semantic types in the ESN, 122 have the same relationship structure as in the SN, and 16 have a different relationship structure. Among them, four are new semantic types, and the other 12 are semantic types having newly inherited relationships. ▶ shows these 16 semantic types and their numbers of relationships in the SN and ESN. As an example of a semantic type having newly inherited relationships, Vitamin has 86 relationships in the SN as opposed to 109 relationships in the ESN.
Table 5.
Semantic Types with Different Relationship Structures in the Semantic Network and the Enriched Semantic Network
Semantic Type | No. of Relationships in Semantic Network | No. of Relationships in Enriched Semantic Network | Diff |
---|---|---|---|
Anatomical Entity | N/A | 3 | N/A |
Physical Anatomical Entity | N/A | 3 | N/A |
Conceptual Anatomical Entity | N/A | 3 | N/A |
Material-Physical Anatomical Entity | N/A | 3 | N/A |
Body Space or Junction | 42 | 43 | 1 |
Body Location or Region | 34 | 35 | 1 |
Body System | 5 | 6 | 1 |
Body Substance | 28 | 29 | 1 |
Gene or Genome | 72 | 73 | 1 |
Enzyme | 86 | 87 | 1 |
Injury or Poisoning | 86 | 178 | 92 |
Laboratory or Test Result | 105 | 127 | 22 |
Organism Attribute | 69 | 121 | 52 |
Clinical Attribute | 69 | 121 | 52 |
Receptor | 86 | 153 | 67 |
Vitamin | 86 | 109 | 23 |
Total: 16 | 314 |
Semantic Type Assignments in the ESN
The mapping described in section “Deriving MRSTYE from MRSTY for the ESN” did not allow any of the 5,653 existing redundant categorizations to appear in MRSTYE as semantic type assignments with respect to the ESN. For example, Enzyme has the old parent Biologically Active Substance in the ESN. Among the 19,226 concepts assigned Enzyme, 54 were also assigned Biologically Active Substance. Therefore, the assignments of Biologically Active Substance to the 54 concepts would be redundant categorizations in the ESN because they can be inferred by the assignments of Enzyme. All 54 of those redundant categorizations are excluded by the mapping process.
Altogether, the mapping prevented 21,297 potential new redundant categorizations in the process of generating the ESN's file MRSTYE. For example, the semantic type Enzyme has a new parent Amino Acid, Peptide, or Protein. Enzyme was assigned to 19,226 concepts. Meanwhile, Amino Acid, Peptide, or Protein was assigned to 18,941 concepts among the 19,226 concepts. Organic Chemical, the parent of Amino Acid, Peptide, or Protein, was assigned to 88 concepts among the 19,226 concepts (researchers are, in fact, divided over whether nonprotein substances such as catalytic RNA should be considered enzymes or merely considered to produce enzymatic activity). The new IS-A relationship would have made these assignments redundant categorizations if the mapping did not prevent the assignments of the new parent and ancestor. ▶ shows all the potential new redundant categorizations prevented by the mapping. Column 2 shows the number of concepts in the child semantic type. Column 3 shows the new parent(s) (or ancestors) of the child semantic type, the assignments of which would become redundant categorizations. Column 4 contains the number of prevented redundant categorizations with respect to the different new parents or ancestors.
Table 6.
Redundant Categorizations Involving Two Semantic Types Having New IS-A Links
Child ST | No. of Concepts | New Parent (Ancestor) Semantic Type | No. of Joint Concepts |
---|---|---|---|
Organism Attribute | 2,381 | Physiologic Function | 14 |
Injury or Poisoning | 30,778 | Disease or Syndrome | 556 |
Pathologic Function | 105 | ||
Enzyme | 19,266 | Amino Acid, Peptide, or Protein | 18,941 |
Organic Chemical | 88 | ||
Vitamin | 1,208 | Pharmacologic Substance | 948 |
Organic Chemical | 644 | ||
Chemical Viewed Structurally | 1 | ||
Total: 4 | 53,633 | 8 | 21,297 |
Another example is Vitamin, which has two new parents in the ESN (▶): Organic Chemical and Pharmacologic Substance. Among the 1,208 concepts assigned Vitamin, 644 were also assigned Organic Chemical, and 948 were also assigned Pharmacologic Substance. The mapping also prevented these potential redundant categorizations.
In total, the mapping avoids the generation of 26,950 (= 5,653 + 21,297) redundant categorizations in the construction of MRSTYE. The value 26,811 (= 5,514 + 21,297) is a good upper bound estimate on the number of concepts that had been assigned multiple semantic types in the SN but now have only one semantic type in the ESN. (The value 5,514 is used rather than 5,653 because there were 139 redundant categorizations that were duplicate in the sense of involving the same concept as another redundant categorization.)12
Discussion
Comparing the ESN to the SN
The ESN has about 13% more IS-As than the SN (149 vs. 132). In contrast, the increase in the number of relationships is only about 4.8% (314 + 12 = 326 new relationships). The main reason for the relatively low impact of the extra IS-As on the increase in the number of relationships in the ESN is the position of these IS-As in the ESN. Most of the semantic types with multiple parents are leaf semantic types or parents of leaves. Thus, most of the increase in relationship numbers happens at leaf semantic types where no further inheritance occurs and the expansion is limited.
Obviously, maintenance of the multiple subsumption hierarchy of the ESN is more complex than that of the two-tree hierarchy of the SN. However, knowledge representation techniques improved from the time the UMLS was initiated, and this maintenance can be handled properly. In our view, the more accurate model of the ESN, including the extra relationships and removed assignments discussed in this paper, clearly outweighs the enhanced programming needed to handle multiple subsumption.
To some extent, the study of the validity of the new relationships inherited due to the new IS-As serves as an evaluation of the ESN. The fact that about 25% (i.e., 100 out of 414) of the newly inherited relationships are found invalid and need to be blocked is alarming at first glance and may cast doubt on the validity of the whole enhancement of the SN as manifested by the ESN. However, as we showed in section “ESN Relationship Structures,” all these 100 invalid relationships are blocked by just seven blockings. Hence, the number of blockings in the ESN is 17 compared to ten in the SN, and the number of DNI is the same as in the SN (27). Therefore, the magnitude of blocking in the ESN is totally within the acceptable range for the SN. Furthermore, all 100 invalid relationships are inherited due to just two new IS-A links, from Organism Attribute to Physiological Function and from Injury or Poisoning to Disease or Syndrome. It is our opinion that these two new IS-As are justified due to the definitions of the semantic types involved. Furthermore, 196 valid relationships are gained due to the addition of these two IS-As. We think that adding 196 valid relationships that are missing from the SN is worth the trade-off of seven more blockings. Hence, from the perspective of the number of blockings needed in the ESN, this evaluation study justifies, in our opinion, the introduction of the new IS-As and the 314 newly inherited relationships that were missing from the SN.
However, this is just our recommendation. It is up to the NLM to make an authoritative decision about which of the new IS-As should be added to the SN. There is an option of not adding the above two IS-As, but instead using our research in individually introducing these 196 new relationships at proper semantic types. As a matter of fact, the 52 new valid relationships at Clinical Attribute will be inherited from Organism Attribute; therefore, only 144 new relationships need to be introduced individually. In this way, no new blocking will be added to the ESN. However, adding 144 newly introduced relationships to the current 422 such relationships in the SN, about a third more, also has a price. The introduced relationships need to be actually set in the UMLS structure, while all other relationships are automatically inherited. It is up to the NLM to make a choice between the options and their trade-offs.
We realize that even without the new IS-As, one could have introduced the newly inherited relationships at the proper semantic types in the SN since they are not inherited in the SN. To be more specific, if a semantic type A lacks an IS-A to another type B, one could have duplicated at A those semantic relationships defined at B, since they would not be inherited. If such steps had been taken, then the new IS-As of the ESN would not imply much of a difference between the relationship structure of the corresponding semantic types of the ESN and the SN.
Our observation in the section “ESN Relationship Structures” is that only three such duplicate relationship introductions appear in the SN; they involve the relationship result_of at Organism Attribute and Clinical Attribute and part_of at Anatomical Structure. In the ESN, these three relationships were obtained by the respective types via inheritance rather than explicit introduction. On the other hand, the 314 new relationships that appear in the ESN were not defined previously at the proper semantic types in the SN.
A similar issue can be raised regarding the assignment of semantic types to concepts. If, as before, an IS-A from A to B is lacking, one could have assigned B to all the concepts to which A was assigned. In this way, each such concept would be both assigned A and B, even though A IS-A B is not modeled. We actually see such a phenomenon in the assignment of Amino Acid, Peptide, or Protein to 18,941 concepts among the 19,226 concepts assigned Enzyme. Similarly, we see the assignment of Pharmacologic Substance or Organic Chemical to many concepts assigned Vitamin. See ▶ for more details. Thus, the redundant categorizations that are (potentially) caused by the addition of an IS-A to the SN exactly expose accurate modeling of the knowledge in the SN, where the IS-A did not originally appear. As we see in ▶, this phenomenon can be found in a few of the cases, but it does not seem to be a widespread phenomenon existing for all missing IS-As.
In summary, judging from our studies of the impact of adding IS-As to the SN on the semantic types' relationship structures and the semantic type assignments, we cannot identify a general phenomenon in the design of the UMLS that compensates for the lack of multiple parents. Nevertheless, we see some cases of such a compensation in the assignments of types to concepts.
Applications of the ESN
One of the most popular uses of the UMLS is in free-text searches for concepts that map to terms in a particular terminology. For example, a user looking for the Read Code term for Thallium Poisoning (concept names are set in small caps) can search the META for this term and find the concept with unique identifier C0238452. This concept, in turn, maps to the Read Code term “Thallium or thallium compound causing toxic effect.” Free-text searching of the META, however, can be daunting due to the large number of terms that are often found when keyword searching. A typical filtering technique is to exploit the semantic-type information to constrain the search, either by asking the user to select a semantic type or by employing information about the user's task. This filtering can vastly reduce the number of terms returned to the user. As an example, a normalized (case-insensitive) string search of the 2004 version of the META for the word “Thallium” returns 91 concepts that have terms containing Thallium as a proper substring. But when the search is constrained to those concepts with the semantic type Injury or Poisoning, only two concepts are returned.
The importance of appropriate classification of semantic types becomes clear when their use for filtering term look-up is considered. For example, if a physician wishes to use the META to look up patient diagnosis terms and performs a search for “thallium” that is constrained to the semantic type Disease or Syndrome and its two child semantic types (in the current UMLS SN), the search will find Toxic encephalitis due to thallium and Thallium encephalopathy. However, using the ESN where Injury and Poisoning is a new third child of Disease or Syndrome, the search will also find Thallium poisoning, Thallium sulfate toxicity, Thallium or thallium compound causing toxic effect, Accidental poisoning by thallium, and Accidental poisoning by thallium compounds. As the example demonstrates, filtering term look-up in the SN may be incomplete due to the missing IS-A relationships. However, the missing concepts are obtained when using the ESN for the filtering instead.
Note that other searching methods would be possible by taking into account the inconsistent structure of the original SN. However, by representing the knowledge more consistently, as in the ESN, we avoid having to take exceptional steps for each retrieval task and can instead rely on the natural arrangement that obtains.
Another application of our research involves naming of unnamed relationships in the META, following the newly inherited relationships in the ESN. In the UMLS, the relationships in the META ideally correspond to the relationships in the SN.23 That is, a nonhierarchical relationship from a concept assigned a semantic type A to another concept assigned a semantic type B should ideally correspond to an SN relationship from A to B. This desired correspondence has been used23 to infer validation or rejection of interconcept relationships in the META by comparison to the corresponding SN relationships between semantic types.
Many relationships in the META are either unnamed or named “other.” In some cases, such relationships do not fit, in their semantics, an existing corresponding relationship in the SN, but do fit a corresponding relationship added to the ESN. For example, the relationship occurs_in was added, according to the analysis in this paper, to the ESN from Organism Attribute to Temporal Concept. As a matter of fact, there are quite a few unnamed relationships in the META from a concept assigned Organism Attribute to a concept assigned Temporal Concept, the semantics of which fits the relationship occurs_in. An example is the relationship from Biological Immaturity to Childhood. All such relationships can now, with the addition of the corresponding relationship occurs_in from Organism Attribute to Temporal Concept, be named, accordingly, occurs_in. Before adding this relationship to the ESN, these META relationships could not be named in a way that conforms to the above correspondence.
Limitations
Of course, there were several limitations of our work on the design of the ESN.
Only two techniques were used to identify missing IS-As. There may be other potential techniques for discovering missing IS-As that were not discovered. However, we doubt that there are many more such IS-As since no additional missing IS-As were found in a random sample of 550 pairs of semantic types reviewed in our previous evaluation.10
The resulting ESN just represents our opinion. It is up to the NLM to make authoritative decisions regarding the inclusion of each suggested IS-A in the ESN. However, our previously applied techniques10 and the analysis in this paper provide support for the decision process.
Currently, our techniques are limited to the enrichment of the UMLS SN. Unfortunately, there are currently no other two-layered terminological knowledge bases with assignment of broad categories to concepts connecting the two layers. However, we expect other terminologies and ontologies, in both medical informatics and computer science, to follow the UMLS in adapting such a structure, as we reported.24 A recent such effort to add assignments to connect SUO,25 an Upper Standard Ontology, to the WordNet ontology26 was reported.27 As was demonstrated with the UMLS, such a two-layered terminological knowledge base helps in maintenance, auditing, and integration of terminologies.
Conclusions
The semantic relationship structures in the ESN are more complex than those in the SN due to the new multiple-parent IS-A hierarchy. In the ESN, relationships can be inherited from more than one source. In this paper, we presented a technique for deriving the relationship structures of the ESN's semantic types from those of the SN. The technique sought to preserve relationship introductions and existing relationship inheritance. All the newly inherited relationships were audited for semantic validity. Based on the audit step in our technique, we obtained the complete set of the ESN's relationship structures.
The entire set of assignments of semantic types to concepts in the ESN was derived automatically according to three rules. The process ensured that a concept is only assigned the most appropriate specialized semantic types. In this way, redundant categorizations were avoided completely, unlike in the original SN.
The resulting complete ESN contains 138 semantic types, 149 IS-A links, and 7,303 semantic relationships. There are in total 1,013,876 semantic type assignments. Compared to the SN, the ESN serves as more accurate and refined unifying semantic abstraction of the META.
This research was partially supported by contract #N01-1-3543 from the National Library of Medicine and by the New Jersey Commission for Science and Technology.
We thank the anonymous reviewers whose insightful remarks and suggestions helped to meaningfully improve the paper.
References
- 1.Humphreys BL, Lindberg DAB. Building the Unified Medical Language System. In: Kingsland LC, editor. Proceedings of the Thirteenth Annual Symposium on Computer Applications in Medical Care. Washington, DC: 1989. p. 475–80.
- 2.Schuyler PL, Hole WT, Tuttle MS, Sherertz DD. The UMLS Metathesaurus: representing different views of biomedical concepts. Bull Med Libr Assoc. 1993;81:217–22. [PMC free article] [PubMed] [Google Scholar]
- 3.U.S. Department of Health and Human Services, National Institutes of Health, National Library of Medicine. Unified Medical Language System. Updated periodically.
- 4.McCray AT, Miller RA. Making the conceptual connections: the Unified Medical Language System (UMLS) after a decade of research and development. J Am Med Inform Assoc. 1998;5:129–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.McCray AT. UMLS Semantic Network. In: Kingsland LC, editor. Proceedings of the Thirteenth Annual Symposium on Computer Applications in Medical Care. Washington, DC: 1989. p. 503–7.
- 6.McCray AT. Representing biomedical knowledge in the UMLS Semantic Network. In: Broering NC, editor. High-performance medical libraries: advances in information management for the virtual era. Westport, CT: Mekler, 1993, 45–55.
- 7.Cimino JJ, Johnson SB. Use of the Unified Medical Language System in Patient Care. Methods Inf Med. 1995;34:158–64. [PubMed] [Google Scholar]
- 8.Cimino JJ, Clayton PD, Hripcsak G, Johnson SB. Knowledge-based approaches to the maintenance of a large controlled medical terminology. J Am Med Inform Assoc. 1994;1:35–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Johnson SB, Friedman C, Cimino JJ, Clark BT, Hripcsak G, Clayton PD. Conceptual data model for a central patient database. In: Clayton PD, editor. Proceedings of the 15th Annual Symposium on Computer Applications in Medical Care. Washington, DC: 1991. p. 381–5. [PMC free article] [PubMed]
- 10.Zhang L, Perl Y, Halper M, Geller J, Cimino JJ. An enriched Unified Medical Language System semantic network with a multiple subsumption hierarchy. J Am Med Inform Assoc. 2004;11:195–206. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.McCray AT, Nelson SJ. The representation of meaning in the UMLS. Methods Inf Med. 1995;34:193–201. [PubMed] [Google Scholar]
- 12.Peng Y, Halper M, Perl Y, Geller J. Auditing the UMLS for redundant classifications. In: Kohane IS, editor. Proc AMIA Annu Symp. San Antonio, TX. 2002;612–6. [PMC free article] [PubMed]
- 13.Bodenreider O, McCray AT. Exploring semantic groups through visual approaches. JBI. 2003;36:414–32. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Gu H, Perl Y, Geller J, Halper M, Liu L, Cimino JJ. Representing the UMLS as an OODB: modeling issues and advantages. J Am Med Inform Assoc. 2000;7:66–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Stroustrup B. The C++ Programming Language. , 3rd ed. Reading, MA: Addison-Wesley, 1997.
- 16.Gosling J, Joy B, Steele G. The Java language specification. Reading, MA: Addison-Wesley, 1996.
- 17.Sowa JF. Knowledge representation: logical, philosophical, and computational foundations. Pacific Grove, CA: Brooks/Cole, 2000.
- 18.Schulze-Kremer S, Smith B, Kumar A. Revising the UMLS Semantic Network. In: Fieschi M, Coiera E, Li Y-C, editors. Proceedings of Medinfo 2004. San Francisco, CA; 2004. p. 1700.
- 19.Burgun A, Bodenreider O. Mapping the UMLS semantic network into general ontologies. In: Bakken S, editor. Proc AMIA Annu Symp. 2001;81–5. [PMC free article] [PubMed]
- 20.Yu H, Friedman C, Rzhetsky A, Kra P. Representing genomic knowledge in the UMLS semantic network. In: Lorenzi NM, editor. Proc AMIA Annu Symp. 1999;181–5. [PMC free article] [PubMed]
- 21.Michael J, Mejino JLV, Rosse C. The role of definitions in biomedical concept representation. In: Bakken S, editor. Proc AMIA Annu Symp. 2001;463–7. [PMC free article] [PubMed]
- 22.Rosse C, Mejino JLV. A reference ontology for biomedical informatics: the foundational model of anatomy. JBI. 2003;36:478–500. [DOI] [PubMed] [Google Scholar]
- 23.McCray AT, Bodenreider O. A conceptual framework for the biomedical domain. In: Green R, Bean CA, Myaeng SH, editors. The semantics of relationships: an interdisciplinary perspective. Boston: Kluwer Academic Publishers, 2002, 181–98.
- 24.Perl Y, Geller J. Research on structural issues of the UMLS—past, present, and future. JBI. 2003;36:409–13. [DOI] [PubMed] [Google Scholar]
- 25.Niles I, Pease A. Towards a standard upper ontology. In: Welty C, Smith B, editors. Proc. FOIS 2001. Ogunquit, ME; 2001. p. 2–9.
- 26.Fellbaum C. WordNet: an electronic lexical database. Cambridge, MA: The MIT Press, 1998.
- 27.Niles I, Pease A. Linking lexicons and ontologies: mapping WordNet to the Suggested Upper Merged Ontology. In: Proceedings of the International Conference on Information and Knowledge Engineering 2003 (IKE'03). Las Vegas, NV; 2003. p. 412–6.