Abstract
Objective
Each Unified Medical Language System (UMLS) concept is assigned one or more semantic types (ST). A dynamic methodology for aiding an auditor in finding concepts that are missing the assignment of a given ST, S is presented.
Design
The first part of the methodology exploits the previously introduced Refined Semantic Network and accompanying refined semantic types (RST) to help narrow the search space for offending concepts. The auditing is focused in a neighborhood surrounding the extent of an RST, T (of S) called an envelope, consisting of parents and children of concepts in the extent. The audit moves outward as long as missing assignments are discovered. In the second part, concepts not reached previously are processed and reassigned T as needed during the processing of S's other RSTs. The set of such concepts is expanded in a similar way to that in the first part.
Measurements
The number of errors discovered is reported. To measure the methodology's efficiency, “error hit rates” (i.e., errors found in concepts examined) are computed.
Results
The methodology was applied to three STs: Experimental Model of Disease (EMD), Environmental Effect of Humans, and Governmental or Regulatory Activity. The EMD experienced the most drastic change. For its RST “EMD ∩ Neoplastic Process” (RST “EMD”) with only 33 (31) original concepts, 915 (134) concepts were found by the first (second) part to be missing the EMD assignment. Changes to the other two STs were smaller.
Conclusion
The results show that the proposed auditing methodology can help to effectively and efficiently identify concepts lacking the assignment of a particular semantic type.
Introduction
The concept database of the Unified Medical Language System (UMLS), the Metathesaurus (META), contains about 1.5 million concepts in its 2007AC release. 1–6 Its Semantic Network (SN) overlays a consistent categorization via the assignment of one or more of its 135 semantic types (STs) to each concept. 7,8 However, because the META is so large and inherently complex, ST-assignment errors are all but unavoidable. Furthermore, the UMLS's construction through the integration of many source vocabularies that are not necessarily consistent may contribute to such problems. The differing views of various subject experts who carry out the ST assignments can also be seen as a contributing factor. In fact, ST mis-assignments may reflect a variety of misunderstandings, including inaccurate or incorrect meanings or ambiguities with respect to concepts. An ST mis-assignment may therefore imply the presence of other errors.
In a study involving UMLS users, 9 it was clearly expressed that significant attention should be paid to auditing. The ST mis-assignments were found to be among the leading concerns. Thus, weeding them out should be an important aspect of UMLS maintenance.
Regarding a specific ST S (semantic types are written in bold, while concept names are in italics), there are several possibilities for an assignment error: (i) a concept may be assigned S incorrectly; (ii) it may be assigned S correctly but have errors with respect to its other semantic types; or (iii) it may be missing the assignment of S. The first two possibilities were addressed in Chen et al. 2008. 10 We present a methodology for dealing with the third possibility in this paper.
Randomly searching through the META for concepts that warrant an assignment of S is certain to be tedious and unlikely to prove fruitful. The challenge is to design an effective algorithmic technique to identify “suspicious” concepts that may be missing the assignment of S, using a technique similar to that of Chen et al. 2008 10 for finding other such concepts with respect to different kinds of errors. However, in contrast to Chen et al. 2008, 10 where the basis was the overall extent (i.e., set of assigned concepts) of S, there is no obvious set from which to commence the search for omissions of the S assignment. This observation makes the current ST-assignment errors more difficult to uncover than those encountered in Chen et al. 2008. 10
To overcome this difficulty, we define a guided search for the auditor that emanates outward from the extent of S. We proceed from the assumption that concepts requiring the assignment of S are in all likelihood already in the vicinity of its extent. Therefore, the search for erroneous concepts is focused in a neighborhood surrounding the extent of a semantic type. Actually, to refine the search space further, we use a refined semantic type (RST)—a subtype of a semantic type—from our previously introduced Refined Semantic Network 11,12 as the starting point, since concepts in the same RST extent tend to share an overarching uniform broad meaning, which is not necessarily true for the entire ST. Having a search space of concepts with uniform broad meaning tends to simplify the auditing work, since concepts lacking the expected uniform meaning naturally stand out in a review.
The first part of our methodology concentrates on auditing concepts in a neighborhood surrounding the extent of an RST T of S that we call an envelope. An envelope is defined with respect to the UMLS's parent/child relationships whose origins are in the various UMLS source vocabularies. From there, the search space emanates outward in a concentric progression to encompass more and more distant neighborhoods as type assignment errors continue to be discovered by the auditor. Overall, the methodology allows ancestors and descendants of the concepts in T to be systematically examined and possibly brought into T's extent—when a previous ancestor or descendant in the progression is reassigned T. These ancestors and descendants are related to a concept (previously assigned an RST) via the parent/child relationships. All RSTs of S are, in turn, processed in this manner.
The second part of the methodology constitutes a cross-processing step, where concepts potentially needing an assignment of an RST T are identified and reassigned T while processing another RST T′. Subsequently, those concepts are processed in a manner similar to that of the original extent of T in the first part of the methodology, with various tiers of envelopes created and audited.
We demonstrate our methodology by applying it to three semantic types: Experimental Model of Disease, Environmental Effect of Humans, and Governmental or Regulatory Activity. The errors discovered during this effort are reported, and the effectiveness of our approach is discussed.
Background
Refined Semantic Network
We have previously introduced the Refined Semantic Network (RSN), a modified version of the existing Semantic Network, as an enhanced abstraction mechanism for the UMLS. 11,12 It consists of two types: pure semantic types (PSTs) and intersection semantic types (ISTs). Collectively, we refer to them as refined semantic types (RSTs). The RSTs are derived automatically from the existing STs in the Semantic Network (SN) and their assignments to concepts.
One PST in the RSN is defined for each ST S in the SN. While the PST is given the same name as its corresponding ST S in the SN, we will often denote it as S R to avoid confusion. The assignments of S R in the RSN differ from those of S. Specifically, S R is assigned strictly to those concepts that originally had S as their sole ST assignment. The ISTs serve to provide assignments for the remaining concepts originally in the extent of S, denoted E(S). Such a concept will have been assigned at least one other semantic type. In fact, let us assume that some concepts were assigned S and one other ST, say, U simultaneously. This implies the existence of an IST named S ∩ U that is assigned to exactly those concepts originally assigned both S and U and no other types. The symbol “∩ ” is mathematical intersection, and we use it and “intersection type” because E(S ∩ U) = E(S) ∩ E(U). That is, the extent of the IST is the intersection of the extents of the STs from the SN.
Let us note that an empty intersection of E(S) and the extent of another type, say, W means that S ∩ W would not appear in the RSN. This avoids any potential combinatorial explosion of ISTs. The ISTs can involve more than two types.
As an example, let us consider the ST Experimental Model of Disease (EMD). ▶ uses a Venn Diagram 13 to show some of the concepts assigned EMD and its overlap with Neoplastic Process (NP). The ellipses represent the respective extents of the STs. As we see, the concepts Arthritis, Experimental and Disease Model, along with 29 others, are solely assigned EMD. Thus, these 31 concepts would be assigned the PST EMD R with respect to the RSN (see ▶a). By contrast, Melanoma, Experimental and Experimental Hepatoma, along with 31 more concepts, are assigned both EMD and NP. In the RSN, these 33 concepts would be assigned the IST, EMD ∩ NP (▶a). ▶b shows the portion of the RSN involving EMD, NP and EMD ∩ NP.
Figure 1.
Semantic types EMD and NP and their intersection.
Figure 2.
RSTs derived from the semantic type EMD.
An important characteristic of the RSTs is that they collectively serve to partition the concepts of S. (Overall, the RSN's types partition the entire META.) That is, all concepts have unique assignments in the RSN. A concept originally assigned just S, will be uniquely assigned S R. A concept originally assigned S and one or more other STs at the same time will now be uniquely assigned the appropriate IST. As a consequence, all concepts in the extent of the same RST have exactly the same ST assignments in the context of the SN. This property, which we call semantic uniformity, is one of the main benefits of the RSN. In previous work, we have exploited it in the creation of auditing techniques for the UMLS. 10–12,14,15 Here, again, we make use of it in the attempted expansion of the extent of a given ST.
Auditing the Unified Medical Language System
Auditing is an important phase in a terminology's life cycle. 16 Various techniques have been proposed and applied to ensure the quality of the UMLS's contents. Issues addressed include the detection of classification errors, 17,18,14 redundant and circular hierarchical relationships, 19–21 and unintended synonymy. 22 An algorithm for rooting out each redundant ST assignment, in which a concept is assigned both a ST, say, X and an ancestor of X simultaneously, was formulated in Peng et al. 2002. 23 Object-oriented models of the UMLS have been developed for use in auditing. 12,24 The SN itself has been targeted for revisions in order, for example, to enrich its hierarchical structure and enhance support for source vocabulary integration. 25–27 Our work on the RSN can be seen in this light. 11,12
In Chen et al. 2008, 10 we developed a group-centered approach to facilitate the task of auditing ST assignments in the context of the extent of some given ST. The methodology employed a “divide and conquer” approach that used the semantic uniformity of the RSTs. For each RST extent, only “suspicious” concepts were audited. More specifically, a suspicious concept c is a concept such that at least one of its parents is assigned an ST S that is not assigned to c and is not an ancestor of c's semantic types. An algorithm was formulated to identify such suspicious concepts.
The auditing methodology is dynamic, with a reinvocation occurring after the correction of a ST mis-assignment at a parent concept. This can possibly lead to the discovery of suspicious children, which were not initially deemed such. Its dynamic nature enables the auditor to increase the number of errors found with only a small increase in effort.
The methodology was applied to the 73 concepts of E(EMD), resulting in changes to the ST assignments of 15 concepts. Only 64 concepts retained their assignment of EMD after the corrections. Nine concepts lost their EMD assignment, and six others gained a NP assignment. ▶ reflects the two RSTs EMD R and EMD ∩ NP after these corrections. In Chen et al, 28 it is shown that those concepts found to have an erroneous ST assignment in Chen et al, 10 have high likelihood for missing IS-A relationships. A methodology for exploring such errors is presented.
It may not be the case that all concepts that should be assigned some RST are. For example, the concept mouse carcinoma is assigned NP, but it is an animal model of disease used for research in carcinomas in humans. Thus, it should be assigned EMD as well. This concept was a parent of the suspicious concept Mouse Choroid Plexum Carcinoma. 10 But in Chen et al. 2008, 10 we concentrated only on auditing the extent of a RST. Here, we look beyond a given RST's extent with an eye toward expanding it.
Methods
The goal of our methodology is to find concepts that warrant an assignment of a semantic type S but have not been given it previously. Such a concept may simply be lacking the assignment of S among its other ST assignments. It may be the case that there exists an incorrect ST assignment that should be replaced by S. These decisions are made by the auditor in the process of being guided by our methodology.
As mentioned, we proceed from the assumption that some concepts lacking the assignment of S are already in the local vicinity of the extent of S. Actually, the starting point of our search will be a RST T rather than the entire ST S itself. This serves to further refine the search space. In the first part of the methodology, we focus on concepts in a neighborhood surrounding T's extent that we call an envelope. From there, the search space emanates outward in a breadth-first (concentric) progression as type mis-assignments are discovered. Note that by repeating this process for all RSTs of S, we ultimately obtain any required expansion of its whole extent.
In the second part of the methodology, we proceed in a similar manner but start from a group of concepts tagged for inspection during the processing of an RST of S other than T. This cross-processing allows us to reach concepts warranting the assignment of T but not being reachable via the first part of the methodology.
Part 1: Expanding the Extent of a Refined Semantic Type T
The notion of envelope is fundamental to Part 1 of our methodology, so let us start with its definition. In the following, T is an RST of the semantic type S.
Definition (Envelope of T): The envelope of T, denoted V(T), is the set containing all concepts c such that c is not assigned T or another RST of S, and c has either a child or a parent (or both) in E(T).
Let us note that typically when one audits a concept, one also reviews its neighborhood containing its parents, children, and sometimes the targets of its lateral relationships. 29 When the focus of the audit is a group of concepts—as in this case—rather than a single concept, then the envelope functions analogously to the neighborhood.
Note that a concept assigned another RST of S does not need to enter the envelope. An error in its ST assignment can be corrected using the auditing methodology presented in Chen et al. 2008. 10
As we see, the envelope comprises those concepts not assigned T but instead related to its concepts via PAR/CHD relationships. As an example, the extent E(EMD ∩ NP) can be seen as the inner circle in ▶. Examples of its concepts are Neoplasms, Experimental; Rous Sarcoma; and Tumor Virus Infections. The envelope V(EMD ∩ NP) is the outer ring. Parent concepts in the envelope include Neoplasms, Virus Diseases, and Experimental Organism Diagnosis. Children include Avian Leukosis, Marek Disease, and Common Wart.
Figure 3.
Envelope V(EMD ∩ NP).
The concepts in the envelope are deemed to be those most likely to be inadvertently lacking the assignment of T, and hence the auditing process begins with them. Each is examined solely for a potential assignment of T. If none requires this assignment, then the auditing process comes to an end. However, if a concept o is found to be missing the assignment, then the search space is expanded to include all the parents and children of every such concept o that are not already assigned T or are not already in V(T). Those concepts constitute a second-tier envelope—call it V2(T)—containing grandchildren, grandparents, and siblings, that is processed only after the scan of V(T) is finished. If warranted, this fanning out in a concentric manner to encompass a third-tier envelope V3(T)—and beyond—continues until no more concepts in an envelope are found to require the assignment of T. This way, not only are the concepts in the immediate vicinity of the original extent of T examined for a T assignment, but also concepts in the vicinity of any concepts that were assigned T during the course of the audit. In other words, any concept that is a child or parent of a concept assigned T using our methodology is also inspected to determine if it requires the assignment, too. Another way to look at the process is as follows: any concept reachable from a concept in the initial E(T) via a path of PAR/CHD relationships connecting concepts that have been reassigned T is audited for a possible T assignment.
The stages of auditing can be depicted as expanding outward in a series of concentric circles, as shown for EMD ∩ NP in ▶. For example, the two concepts Mouse Islet Cell Neoplasm and Experimental Organism Diagnosis reside in V(EMD ∩ NP). They are processed first. Mouse Islet Cell Neoplasm is found to be lacking the assignment of EMD ∩ NP. Thus, its parent, Mouse Pancreatic Neoplasm, and its children, Mouse Somatostatinoma, Mouse Insulinoma, and Mouse Islet Cell Adenoma, not already in E(EMD ∩ NP) or V(EMD ∩ NP), are included in the second-tier envelope V2(EMD ∩ NP) and await auditing until the scan of V(EMD ∩ NP) is complete. Similarly, Mouse Pancreatic Neoplasm is later found to be missing the assignment of EMD ∩ NP. Then its two parents, Mouse Digestive System Neoplasms and Mouse Pancreatic Disorder, and six of its children, Benign Mouse Pancreatic Neoplasm, Malignant Mouse Pancreatic Neoplasm, Mouse Pancreatic Acinar Neoplasm, Pancreatic Intraepithelial Neoplasia-1, Mouse Pancreatic Intraepithelial Neoplasia-2, and Mouse Pancreatic Intraepithelial Neoplasia-3, enter the third-tier envelope V3(EMD ∩ NP) that is processed after V2(EMD ∩ NP). As it happens, many of these concepts require the assignment of EMD ∩ NP, as indicated by their green shading. After the auditing process terminates, all the green-shaded concepts are deemed to belong to E(EMD ∩ NP).
Figure 4.
Auditing the RST EMD ∩ NP.
Part 2: Further Expansion of T as a Result of Processing Other RSTs
As noted, the overall problem we are trying to solve is finding Metathesaurus concepts that are missing an assignment of a given semantic type S. In the previous subsection, we presented a technique that takes as its starting point an RST T, which has the benefit of uniform semantics in the sense that all its concepts have the exact same set of assigned semantic types. From there, we fan out to search the local vicinity of T in an effort to find additional concepts that warrant the assignment of T—and hence S. The semantic uniformity of T is taken to be an aid to the auditor when deciding whether concepts in the search space (the envelopes of T) should also be assigned T.
An issue of concern is whether the expanding envelope approach reaches as many concepts as possible (needing the assignment of T) in the vicinity of T. As defined, our methodology will reach a concept c′ that is missing a T-assignment if there is a path of PAR/CHD relationships (in either direction) starting from a concept c in the original E(T) such that each concept on the path has been reassigned T in the process. If there is no such path, then c′ will not be audited and corrected. We see cases of this with concepts representing non-cancer diseases in a mouse, e.g., Mouse Pulmonary Disorder and Mouse Prostate Disorder. These are often incorrectly assigned Disease or Syndrome rather than the expected EMD.
To address this, Part 2 of our methodology uses errors that might be discovered by the auditor in the process of searching around another RST of S, say, T′. While reviewing the envelope of T′, the auditor might realize that some concepts not appropriate for T′ nonetheless have incorrect current ST assignments. In particular, some of the concepts in the envelope of T′ may need to be assigned T. For example, let T′ be EMD ∩ NP. When processing EMD ∩ NP, we find that some concepts in the envelope are assigned Disease or Syndrome. As mentioned, this happens for concepts representing diseases that are not cancers but are instead experimental human diseases in animals. For instance, we encounter the concept Animal Model in V(EMD ∩ NP) which is mis-assigned Disease of Syndrome, used exclusively for diseases of humans. The concept should be assigned EMD (more specifically, the RST EMD R) instead. We find Mouse Pancreatic Disorder with the same mistake while scanning V3(EMD ∩ NP). Two other such concepts, Mouse Skeletal System Disorder and Mouse Hematologic Disorder, are seen in V4(EMD ∩ NP). None of these concepts are found using Part 1 of our methodology on EMD R. Other similar examples occur when auditing Environmental Effect of Humans (EEH) and Governmental or Regulatory Activity (GRA), as discussed in the Results section.
That type of erroneous assignment to a concept p is replaced by the auditor with the proper RST assignment of T, and p is inserted into a set called the auxiliary extent of T, denoted AUX(T). Once the processing of Part 1 of the methodology on all other RSTs is complete, the auditor's attention is turned to the auxiliary extents AUX(T). This set is processed in a manner analogous to E(T). In fact, we extend the definition of envelope to any set of concepts within the hierarchy which are assigned (or reassigned) the same RST. In particular, we define V(AUX(T)) to be the set containing all concepts c such that c is not in AUX(T) and c has either a child or a parent (or both) in AUX(T). (Furthermore, c would not already have the assignment of another RST of T's semantic type S.) The second-tier, third-tier, etc, envelopes of AUX(T) are defined as well. The auditing then proceeds through the various envelopes of AUX(T) since they may contain concepts warranting the assignment of T. Note that when auditing the envelope V(AUX(T)), one would skip the concepts of V(T) to avoid duplicate review.
▶ illustrates the processing of AUX(EMD R). Only some of the concepts of AUX(EMD R), V(AUX(EMD R)), and V2(AUX(EMD R)) are shown. For a concept of AUX(EMD R), we include the tier number of the envelope of EMD ∩ NP in which it appeared in parentheses. For example, the concept Mouse Pancreatic Disorder is written as “Mouse Pancreatic Disorder (3)” since it appeared in V3(EMD ∩ NP). Let us note that the concepts of an auxiliary extent AUX(T) may not be mutually related hierarchically since they are discovered via the Part 1 processing of other RSTs.
Figure 5.
Processing of AUX(EMD R).
Results
As a demonstration, we applied our auditing methodology to the RSTs pertaining to three STs in the UMLS 2007AC: EMD, which is defined as “representation in a non-human organism of a human disease for the purpose of research into its mechanism or treatment,” EEH, defined as “change in the natural environment that is a result of the activities of human beings,” and GRA with the definition of “an activity carried out by officially constituted governments, or an activity related to the creation or enforcement of the rules or regulations governing some field of endeavor.” Two RSTs are derived from EMD: EMD R and EMD ∩ NP. Three are derived from EEH: EEH R, EEH ∩ Hazardous and Poisonous Substance (HPS), and EEH ∩ Substance. Two RSTs are derived from GRA: GRA R and GRA ∩ Intellectual Product (IP). For better accuracy, we started with the version of the RSTs obtained via an audit in Chen et al. 2008. 10
Part 1 Processing
1) Refined Semantic Type EMD ∩ NP
There are 33 concepts in E(EMD∩ NP), shown as yellow boxes in ▶. The arrows stand for PAR/CHD relationships among concepts. The envelope V(EMD∩ NP) also contains 33 concepts, shown as white boxes. Examples of concepts in the envelope include Animal Model (parent of Animal Cancer Model), Mouse Papiloma (parent of Mouse Choroid Plexus Papilloma), Avian Leukosis, Epstein–Barr virus Infections, and Marek Disease (all three being children of Tumor Virus Infections).
Figure 6.
E(EMD ∩ NP) and V(EMD ∩ NP).
The auditing began with an examination of the 33 concepts in the envelope V(EMD ∩ NP) (see ▶). The auditing was performed by two of the authors (YC, JX) who have training in medicine. We found nine concepts, shaded in ▶, which should be assigned the RST EMD ∩ NP. Thus, they were eventually moved into E(EMD ∩ NP).
Table 1.
Table 1 Contents of V(EMD ∩ NP) with Shaded Concepts Reassigned EMD ∩ NP (EMD)
![]() |
EMD = Experimental Model of Disease; NP = Neoplastic Process.
Due to the corrections of those nine concepts in V(EMD ∩ NP), the auditing process proceeded on to the second-tier envelope V2(EMD ∩ NP) consisting of all 44 parents and children of the above nine concepts. Among these 44 concepts (see ▶), we found 26 concepts (shaded in ▶) that should be assigned RST EMD ∩ NP. Note that it is straightforward for the reader to verify these reassignments for most of the concepts shaded in ▶. For the EMD assignment, one can look for “mouse” or “experimental” with the description of a disease or syndrome. For the NP assignment, one can look for one of the keywords that indicates a neoplastic process, e.g., “carcinoma” or “papilloma.”
Table 2.
Table 2 Contents of V2(EMD ∩ NP) with Shaded Concepts Reassigned EMD ∩ NP
![]() |
EMD = Experimental Model of Disease; NP = Neoplastic Process.
Due to the discovery of missing assignments, this process continued up to the twelfth-tier envelope V12(EMD ∩ NP), which turned out to be empty. ▶ summarizes the results of the processing with respect to each of the envelopes. Included are the size of the envelope (i.e., number of concepts audited), the number of concepts found missing the assignment of EMD ∩ NP and hence in error, the “hit rate” (i.e., percentage of errors found among concepts examined), and the new cardinality of the extent E(EMD ∩ NP) at end of the scan of the envelope (denoted E[EMD ∩ NP]). For example, V3(EMD ∩ NP) contained 79 concepts, of which 78 (99%) were found to be missing the assignment of EMD ∩ NP. After processing this envelope, the extent E(EMD ∩ NP) expanded to include 146 concepts. Altogether, 1,012 concepts were scanned. Of these, 915 (90%) were found to need an EMD ∩ NP assignment. The cardinality of E(EMD ∩ NP) increased from an original 33 to 948.
Table 3.
Table 3 Results of Processing Envelopes of EMD ∩ NP
Envelope | Number Concepts | Number Added to E(EMD ∩ NP) | Hit Rate (%) | E(EMD ∩ NP) |
---|---|---|---|---|
V | 33 | 9 | 27 | 42 |
V2 | 44 | 26 | 59 | 68 |
V3 | 79 | 78 | 99 | 146 |
V4 | 212 | 201 | 95 | 347 |
V5 | 214 | 204 | 95 | 551 |
V6 | 137 | 135 | 99 | 686 |
V7 | 145 | 119 | 83 | 805 |
V8 | 97 | 92 | 95 | 897 |
V9 | 32 | 32 | 97 | 929 |
V10 | 17 | 17 | 100 | 946 |
V11 | 2 | 2 | 100 | 948 |
V12 | — | — | — | 948 |
Total: | 1,012 | 915 | 90 | 948 |
EMD = Experimental Model of Disease; NP = Neoplastic Process.
Among the 915 erroneously assigned concepts, 911 were from the Experimental Organism Diagnosis hierarchy of the NCI thesaurus. 30 The other four, Avian Leukosis; Marek Disease; Carcinoma, Brown-Pearce; and Pulmonary Adenpmatosis, Ovine, were from CRISP (2 concepts), MeSH (4 concepts), NDFRT (3 concepts), and/or SNOMED (3 concepts). Some concepts were in several of these sources. In contrast, only seven of the original 33 concepts are from the NCI thesaurus.
▶ illustrates the progression of the auditing process leading to some of the corrections that were found in the portion of the hierarchy rooted at Mouse Neoplasms. As can be seen, these corrections originated from the two concepts Melanoma, Experimental and Mouse Choroid Plexus Carcinoma (white boxes) in the original E(EMD ∩ NP). In particular, let us consider the correction of Papillary Serous Carcinoma of the Mouse Endometrium, which lies at the end of one of the longest paths (starting at a concept in the current extent) leading to a correction. (The path is colored in yellow in ▶) At the outset, Mouse Neoplasms entered the envelope V(EMD ∩ NP) due to being the parent of Melanoma, Experimental. After review, Mouse Neoplasms was deemed to warrant the assignment of EMD ∩ NP and was moved into the extent. Because of this assignment correction, its child Mouse Neoplasms by Location entered the second-tier envelope V2(EMD ∩ NP) and was audited. It, too, ended up in the extent. Continued processing in this manner eventually found Papillary Serous Carcinoma of the Mouse Endometrium missing the assignment of EMD ∩ NP during the scan of V11(EMD ∩ NP), the last envelope in which there were concepts to audit. Note the path first moved upward across a PAR/CHD relationship before moving downward along a sequence of such relationships.
Figure 7.
Progression of the auditing process.
The up and down movement of the auditing is further illustrated in ▶ beginning with Mouse Choroid Plexus Carcinoma. Mouse Choroid Plexus Tumors (in pink) is its parent. Thus, it initially entered V(EMD ∩ NP) and then was moved to E(EMD ∩ NP) after review. The change of its type assignment caused its parent Mouse Tumors of Neuroepithelial Tissue (also in pink) to become a part of V2(EMD ∩ NP), allowing it to be reviewed and corrected. The correction of Mouse Tumors of Neuroepithelial Tissue then progressed to its parent Mouse Nervous System Neoplasms and its children Mouse Glial Tumors of Uncertain Origin, Mouse Ependymal Tumors, etc. The auditing then continued to its grandparents and grandchildren and further generations of ancestors and descendants. (Note that at that stage the parent Mouse Neoplasm by Location was already also assigned EMD as part of the yellow path in ▶, due to the concentric progression of the methodology.)
2) EMD R
There were 31 concepts in E(EMD R). The envelope V(EMD R) contained 30. Three concepts were found to be missing the assignment of EMD R. The first of these was Animal Model, which is the parent of Animal Disease Model and was erroneously assigned Animal. The other two, sham rage and diencephalic hyperactivity, are both children of diencephalic brain model and were erroneously assigned Organ or Tissue Function. According to their definitions, they should be assigned EMD R. The auditing in this case did not discover any errors beyond the first-tier envelope, since seven concepts were added to V2(EMD R), but no errors in their ST assignments were found.
3) EEH R
The ST EEH was divided into three RSTs: EEH R, EEH ∩ Substance, and EEH ∩ HPS. One concept, Biodegradation, assigned EEH ∩ Natural Phenomenon or Process, among the 66 in V(EEH R) was found to warrant inclusion in E(EEH R). This was determined from its definition, which states: “Elimination of ENVIRONMENTAL POLLUTANTS; PESTICIDES and other waste using living organisms, usually involving intervention of environmental or sanitation engineers (MeSH).” The auditing did not proceed further since Biodegradation does not have any child, and its parent is in E(EEH) already.
4) EEH ∩ Hazardous and Poisonous Substance (HPS)
Among the 25 concepts in V(EEH ∩ HPS), three, toxic industrial waste, Noxious fumes, and Environmental Tobacco Smoke, were moved to E(EEH ∩ HPS) because of type mis-assignments. These corrections cause four concepts, Physical force, Tobacco smoke, Specific Occupational Equipment and Hazards, and Poisons, to enter V2(EEH ∩ HPS) for processing. Only one of the four, Tobacco smoke, was found missing the assignment EEH ∩ HPS.
5) EEH ∩ Substance
In E(EEH ∩ Substance), there were only three concepts, garbage, Sewage, and industry waste. Eight concepts formed the envelope, and no erroneous ST assignments were identified among these.
6) GRA R
The ST GRA consisted of two RSTs: GRA R and GRA ∩ Intellectual Product (IP). There were 520 concepts in E(GRA R). The V(GRAR) contained 314 concepts. Eighteen of them were moved to E(GRA R) due to erroneous or inconsistent ST assignments. An example is Medicare, which is a child of Government Program and was assigned Regulation or Law. This concept is defined as “Government health care program for the aged …” and is similar to its sibling Medicaid, which was assigned GRA. Therefore, Medicare should have the GRA assignment instead of Regulation or Law. As other examples, six children of [X]Legal intervention (which itself is assigned GRA), e.g., [X]Legal intervention involving explosives, are all reassigned GRA. The correction of these eighteen concepts caused two more concepts out of the 42 in V2(GRA R), namely, Marijuana Legalization and [X]War operations, unspecified, to enter into E(GRA R). In total, the ST assignments of 20 concepts were changed to GRA (see Appendix A, available as an online data supplement at www.jamia.org).
7) GRA ∩ IP
In E(GRA ∩ IP), there were 22 concepts. Those concepts typically describe a policy (the IP assignment) of a governmental or regulatory activity or program (the GRA assignment), e.g., foreign policy and energy policy. Among the nine concepts in V(GRA ∩ IP), two, AOD public policy strategy and Public Policy, entered E(GRA ∩ IP). The processing of E(GRA ∩ IP) expanded through four tiers of envelopes and 31 concepts, and resulted in the identification of ten concepts that were missing the GRA ∩ IP assignment (see ▶).
Figure 8.
Processing of E(GRA ∩ IP) by envelope tier.
Part 2 Processing
1) Semantic Type EMD
The auxiliary extent AUX(EMD R) contains 20 concepts (see ▶). As it happened, all of these were identified while originally processing EMD ∩ NP during Part 1 of our methodology. All are concepts representing experimental diseases in animals and had an erroneous assignment of Disease or Syndrome (DS), when they should have had EMD. For each concept in ▶, we list the related concept and its envelope tier with respect to EMD ∩ NP that resulted in its being audited. Note that all these 20 concepts are parents of the related concepts which describe similar neoplastic diseases or syndromes. In general, the search for concepts of AUX(EMD R) can be limited to concepts added to the envelope V(EMD ∩ NP) as parents, reducing the needed effort.
Table 4.
Table 4 Concepts in AUX(EMDR) and Related Concepts with Respect to EMD ∩ NP
Concept | Related Concept | Envelope Tier |
---|---|---|
Animal Model | Animal Cancer Model | 1 |
Mouse pancreatic disorder | Mouse pancreatic Neoplasm | 3 |
Mouse skeletal system disorder | Mouse skeletal system neoplasms | 4 |
Mouse hematologic disorder | Mouse hematologic Neoplasms and related disorders | 4 |
Mouse head and Neck disorder | Mouse head and Neck Neoplasms | 4 |
Epithelial proliferative lesions of the mouse pulmonary system | Tumors of the mouse pulmonary system | 4 |
Mouse nervous system disorder | Mouse nervous system neoplasms | 4 |
Mouse Cardiovascular system disorder | Mouse Cardiovascular system Neoplasms | 4 |
Mouse Connective and soft tissue disorder | Mouse Connective and soft tissue Neoplasms | 4 |
Mouse endocrine gland system disorder | Mouse endocrine gland system neoplasms | 4 |
Hyperplasia of the mouse pulmonary system | Epithelial proliferative lesions of the mouse pulmonary system | 5 |
Mouse reproductive system disorder | Mouse reproductive system neoplasms | 5 |
Pulmonary proliferative lesions of the Mouse | Epithelial proliferative lesions of the mouse pulmonary system | 5 |
Mouse Mammary gland disorder | Neoplasms of the Mouse Mammary gland | 5 |
High risk proliferative disease of the Mouse prostate gland of unknown or premalignant potential | Neoplasms of the Mouse prostate gland | 5 |
Mouse skin disorder | Neoplasms of the Mouse skin | 5 |
Mouse liver disorder | Mouse hepatic system neoplasm | 5 |
Mouse urinary tract disorder | Mouse urinary tract Neoplasm | 5 |
Mouse prostate disorder | Neoplasms of the Mouse prostate gland | 5 |
Preinvasive lesions of the mouse pulmonary system | Epithelial proliferative lesions of the mouse pulmonary system | 5 |
AUX(MEDR) = auxilliary extent of MEDR; EMD = Experimental Model of Disease; NP = Neoplastic Process.
The processing of AUX(EMD R) proceeded through five tiers of envelopes comprising a total of 176 concepts. Out of these, 114 were reassigned EMD R. Therefore, an additional 134 (= 20+114) concepts were reassigned EMD R in comparison to only 31 concepts with an original EMD R assignment. For a sample of these, see Appendix B (available as an online data supplement at www.jamia.org). No more concepts were moved to E(EMD ∩ NP) as a result of processing the envelopes of AUX(EMD R).
2) EEH
Four concepts, Engine Exhaust, Coke oven emission, Pesticide Residues, and Air contaminant, were moved to E(EEH ∩ HPS) from AUX(EEH R). The expanding process does not result in more concepts moved to E(EEH ∩ HPS). No more concepts were moved to AUX(EEH ∩ HPS) or AUX(EEH ∩ Substance) while processing the other RSTs of EEH. For example, when processing EEH ∩ HPS, no concept requiring the assignment of EEH R or EEH ∩ Substance was discovered.
3) GRA
During the Part 1 processing of GRA R, five concepts, monetary policy, fiscal policy, tax policy, international trade policy, and Family Planning Policy, in V(GRA R) were identified as having ST assignments of IP but were missing the GRA assignment. Because the creation or enforcement of all the policies represented by those concepts involves government or regulation, their ST assignment should be GRA ∩ IP, similar to current concepts in GRA ∩ IP. The auxiliary extent AUX(GRA ∩ IP), therefore, contained those five concepts. While processing the envelopes of AUX(GRA ∩ IP), the auditors identified one more concept, Economic Policies, that should be moved to E(GRA ∩ IP), raising the total number to six. No concepts were moved from the GRA ∩ IP envelopes to E(GRA R) when applying Part 2. The results were reported to the UMLS team. The results regarding EMD were also reported to the NCI thesaurus team.
Discussion
Evaluation and Interpretation
A very high hit rate of 90% is observed when Part 1 of our methodology is applied to the RST EMD ∩ NP, as 915 of the 1,012 concepts audited were found to be missing this assignment. Similarly, a high hit rate of 65% is observed when Part 2 of our methodology is applied to EMD R, with 114 of the 176 audited concepts being reassigned EMD R. Among the 915 concepts, 911 are from the Experimental Organism Diagnosis hierarchy of the NCI thesaurus, one of its 22 hierarchies. The current ST assignment for these 911 concepts is only NP. The current assignment of the 114 reassigned EMD R is Disease or Syndrome. The assignments reflect the nature of this source; many concepts in the NCI thesaurus are cancer-related. Experimental Organism Diagnosis is defined as the abnormal conditions of affected Organisms, which are mice and rats in the NCI thesaurus. Therefore, concepts in this hierarchy that represent diseases fit the definition of EMD: “representation in a non-human organism of a human disease for the purpose of research into its mechanism or treatment.” That is the reason why most of the audited concepts joined E(EMD ∩ NP), while the others were reassigned just EMD R, covering non-neoplastic diseases.
The case for EMD ∩ NP is also extreme in the sense that more than 96% of the concepts eventually assigned EMD ∩ NP were discovered during auditing versus the 33 concepts originally assigned EMD ∩ NP. The corresponding percentage for EMD R is 82%. This seems to reflect some misconception of either the ST EMD of the Semantic Network or the Experimental Organism Diagnosis hierarchy of the NCI thesaurus, which manifested itself in the process of integrating the NCI thesaurus into the UMLS. Our guess is that the incorrect ST assignments were the result of applying some natural language processing technique that recognized keywords associated with cancer, while ignoring keywords like mouse.
The hit rates when Part 1 of the methodology was applied to EMD R, EEH R, EEH ∩ HPS, EEH ∩ Substance, GRA R, and GRA ∩ IP are 10% (= 3/30), 1.5% (= 1/66), 14% (= [3 + 1]/[25 + 4]), 0, 5.6% (= [18 + 2]/[314 + 42]) and 32% (= 10/31), respectively.
In Part 2 of the methodology, we identified 20, four, and five concepts for the auxiliary extents AUX(EMD R), AUX(EEH ∩ HPS), and AUX(GRA ∩ IP), respectively. The hit rates during the expansion of AUX(EMD R), AUX(EEH ∩ HPS), and AUX(GRA ∩ IP) are 64.7% (= 114/176), 0, and 50% (= 1/2), respectively. Note that the hit rate for Part 2 is calculated only with respect to the auxiliary extent's envelopes, similar to the calculation for Part 1 with respect to the original extent's envelopes. It does not count the discovery of the concepts joining an auxiliary extent via the processing of original extents of other RSTs.
In those cases, the percentage of erroneous concepts out of those audited is not high, but it definitely justifies the effort needed for finding the errors. These results are just for a sample of three STs with small extents. Further studies with more STs are needed to draw conclusions about an average hit rate. However, considering that a particular missing type assignment can occur anywhere in the META's vast repository outside the extent of this particular ST—and thus such errors are difficult to find in general—this hit rate can be deemed as a successful level.
Limitations
Experiments with STs having large extents are needed to further examine the efficiency of our methodology. Only parents or children were inserted in envelopes. It is suggested that broader/narrower relationships could play a useful role. It will be interesting to compare the performance of our methodology with these relationships as compared with using the PAR/CHD relationships. The broader/narrower relationships are not marked as hierarchical relationships in their source vocabularies as are the PAR/CHD relationships; that is only done by the NLM. 20
We dealt with the problem of expanding the extent of a semantic type. This problem is more challenging than the related internal auditing of an extent because the kind of error we are seeking—a concept missing a specific ST assignment—can occur for any of the UMLS's concepts except for those already in the ST's extent. Hence, the potential search space is huge, and the number of concepts for which the error occurs is expected to be relatively very small. The proverbial “needle in a haystack” is appropriate in this circumstance.
The main challenge we faced was to find limited-sized sets of concepts for which the likelihood of a given missing ST assignment is quite high. Part 1 of the methodology that we presented indeed tackles this by directing the efforts toward the immediate surroundings of the extent of the RST, without venturing too far away unless dictated by a trail of such discovered errors. The expansion can be seen as a continuity along PAR/CHD relationships of concepts that are missing the assignment of the given RST. The case of EMD ∩ NP showed that Part 1 of the methodology was able to find many missing ST assignments when such errors existed.
The challenge of overcoming a discontinuity in the expansion was taken up in Part 2 of the methodology. The case of the RST EMD R demonstrated the viability of the approach.
Conclusions
We presented a two-part auditing methodology pertaining to the assignments of a given UMLS semantic type S. The methodology is geared toward expanding the extent of S to include additional concepts. The methodology is based on the previously defined notion of refined semantic type, which serves to partition the concepts of the META. Since, in general, most of the META's concepts will not need the assignment of S, it is inefficient—and impractical—to randomly review concepts residing outside of S's extent. Instead, our methodology, consisting of two complementary parts, tightly limits the search space to a series of surrounding neighborhoods of the extent, in which mis-assignments are likely to be found. The search space dynamically expands only when mis-assignment errors are discovered. This follows from our auditing experience that where there are errors, more errors are likely to be. In this way, our methodology tends not to overburden the auditor with the processing of unnecessary concepts. The methodology was demonstrated on the extents of three semantic types, Experimental Model of Disease, Environmental Effect of Humans, and Governmental or Regulatory Activity. The results showed that the methodology was able to effectively and efficiently steer the auditor to concepts worth investigating regarding the assignment of a type.
Footnotes
This work was partially supported by the NLM under grant R-01-LM008445-01A2.
References
- 1.Tuttle MS, Sherertz DD, Olson NE, et al. Using META-1, the first version of the UMLS Metathesaurus 1990. In Proc. Fourteenth Annual SCAMC, pages 131–135.
- 2.Schuyler PL, Hole WT, Tuttle MS, Sherertz DD. The UMLS Metathesaurus: Representing different views of biomedical concepts Bull Med Libr Association. vol 2. 1993. pp. 217-22281. [PMC free article] [PubMed]
- 3.Bodenreider O. The Unified Medical Language System (UMLS): integrating biomedical terminology Necleic Acids Res 2004;32:D267-D270. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Lindberg DAB, Humphreys BL, McCray AT. The Unified Medical Language System Meth Inf Med 1993;32:281-291. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Campbell KE, Oliver DE, Shortliffe EH. The Unified Medical Language System: Toward a collaborative approach for solving terminologic problems J Am Med Inform Assoc 1998;5(1):12-16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Humphreys BL, Lindberg DAB, Schoolman HM, Barnett GO. The Unified Medical Language System: An informatics research collaboration J Am Med Inform Assoc 1998;5(1):1-11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.McCray AT, Hole WT. The scope and structure of the first version of the UMLS Semantic Network November1990. In Proc. Fourteenth Annual SCAMC, pages 126–130, Los Alamitos, CA.
- 8.McCray AT. An upper-level ontology for the Biomedical domain Comp Func Genom. vol 4. 2003. pp. 80-84. [DOI] [PMC free article] [PubMed]
- 9.Chen Y, Perl Y, Geller J, Cimino JJ. Analysis of a study of the users, uses and future agenda of the UMLS J Am Med Inform Assoc 2007;14(2):221-231. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Chen Y, Gu H, Perl Y, Geller J, Halper M. Structural Group Auditing of a UMLS Semantic Type's Extent JBI, 42(1): pages 41–52, 2009. [DOI] [PubMed]
- 11.Geller J, Gu H, Perl Y, Halper M. Semantic refinement and error correction in large terminological knowledge bases Data Knowl Eng 2003;45(1):1-32. [Google Scholar]
- 12.Gu H, Perl Y, Geller J, et al. Representing the UMLS as an OODB: Modeling issues and advantages. J Am Med Inform Assoc Jan–Febr 2000;7(1):66–80. Selected for reprint in: Haux R and Kulikowski C, editors, Yearbook of Medical Informatics: Digital Libraries and Medicine (International Medical Informatics Association), pp. 271–285, Schattauer, Stuttgart, Germany, 2001.
- 13.Johnsonbaugh R. Discrete Mathematics6th edn. Prentice-Hall; 2005. Pearson.
- 14.Gu H, Perl Y, Elhanan G, et al. Auditing concept categorizations in the UMLS Artif Intell Med May2004;31(1):29-44. [DOI] [PubMed] [Google Scholar]
- 15.Gu H, Hripcsak G, Chen Y, et al. Evaluation of a UMLS auditing process of semantic type assignmentsIn: Teich JM, Suermondt J, Hripcsak G, editors. Proc, Annual: AMIA, 2007. November2007. Symposium:294–8, Chicago, IL. [PMC free article] [PubMed]
- 16.Min H, Perl Y, Chen Y, et al. Auditing as part of the terminology design life cycle J Am Med Inform Assoc Nov-Dec 2006;13(6):676-690. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Cimino JJ. Auditing the Unified Medical Language System with semantic methods J Am Med Inform Assoc 1998;5:41-51. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Cimino JJ, Scylla b, Charybdis. The search for redundancy and ambiguity in the 2001 UMLS metathesaurusIn: Bakken S, editor. Proc, Annual: AMIA, 2001. 2001. Symposium:120–4. [PMC free article] [PubMed]
- 19.Bodenreider O. Strength in numbers: Exploring redundancy in hierarchical relations across biomedical terminologies. Proc, Annual: AMIA, 2003. 2003. Symposium: pages 101–105. [PMC free article] [PubMed]
- 20.Bodenreider O. Circular hierarchical relationships in the UMLS: Etiology, diagnosis, treatment, complications and prevention 2001. Proc. AMIA Symp:57–61. [PMC free article] [PubMed]
- 21.Mougin F, Bodenreider O. Approaches to eliminating cycles in the UMLS Metathesaurus: Naïve vs. formal. Proc, Annual: AMIA, 2005. 2005. Symposium: pages 550–554. [PMC free article] [PubMed]
- 22.Hole WT, Srinivasan S. Discovering missed synonymy in a large concept-oriented metathesaurus. In: Overhage JM, ed. Proc, Annual: AMIA, 2000. Symposium:354–8; Los Angeles. Cancer 2000, November. [PMC free article] [PubMed]
- 23.Peng Y, Halper M, Perl Y, Geller J. Auditing the UMLS for redundant classifications November2002. In Proc. 2002 AMIA Annual Symposium, pages 612–616, San Antonio, TX. [PMC free article] [PubMed]
- 24.Bodenreider O. An object-oriented model for representing semantic locality in the UMLS Proc.Medinfo. 2001 2001;10(1):161-165. [PMC free article] [PubMed] [Google Scholar]
- 25.Schulze-Kremer S, Smith B, Kumar A. Revising the UMLS Semantic Network September2004. Proc, Medinfo2004, page 1700, San Francisco, CA.
- 26.Zhang L, Perl Y, Geller J, Halper M, Cimino JJ. An enriched UMLS Semantic Network with a multiple inheritance hierarchy J Am Med Inform Assoc 2004;11(3):195-206. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Zhang L, Halper M, Perl Y, Geller J, Cimino JJ. Relationship structures and semantic type assignments of the UMLS enriched semantic network J Am Med Inform Assoc July2005;12(6):657-666. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Chen Y, Gu H, Perl Y, Geller J. Structural group-based auditing of missing hierarchical relationships in UMLS J Biomed Inform 2009;42(3):452-467. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Morrey CP, Geller J, Halper M, Perl Y. The Neighborhood Auditing Tool: A hybrid interface for auditing the UMLS J Biomed Inform 2009;42(3):468-489. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Sioutos N, de Coronado S, Haber MW, et al. NCI thesaurus: A semantic model integrating cancer-related clinical and molecular information J Biomed Inform February2007;40(1):30-43. [DOI] [PubMed] [Google Scholar]