Abstract
Abstract Objective: The authors evaluated the use of the Unified Medical Language System (UMLS) as a medical knowledge source for the representation of medical procedures in the MAOUSSC system.
Design: MAOUSSC, a multiaxial coding system, was used for the representation of 1500 procedures from 15 clinical specialties, using UMLS concepts (augmented by full sources for three new vocabularies being added to the UMLS) and relationships whenever possible. Evaluation criteria for the UMLS included (1) completeness of representation of concepts and of inter-concept relationships, (2) consistency in the categorization of both concepts and inter-concept relationships, and (3) usability, including adaptability of the UMLS to a foreign language (French), its suitability to a geographic region with different medical practices than the USA, and issues relative to the annual update changes in the test vocabularies.
Results: During the MAOUSSC trial, the number of missing concepts or relationships identified in the augmented UMLS sources was deemed to be inconsequential relative to overall project goals. “Missing” UMLS inter-concept relationships were identified, although they were small in number. Some inconsistencies in the UMLS were noted, especially in the area of hierarchic relationships.
Conclusion: After UMLS was used for five years as a knowledge source for representing 1500 complex medical procedures in MAOUSSC, its value is considered significant. Future editions of the UMLS are expected to improve representation of inter-concept relationships and global consistency.
The Unified Medical Language System (UMLS) has been developed and enhanced by the National Library of Medicine (NLM) since 1990 and provides a representation of knowledge in the biomedical domain.1 Since this knowledge representation has not been designed for a particular application, the UMLS is currently used in a wide variety of biomedical applications developed by the NLM (including Internet Grateful Med2,* and the UMLS Knowledge Source Server3,4,† and by other institutions.5,6,7,8,9,10,11,12
In a given context (e.g., electronic patient records), the major requirements for a knowledge source to be used as medical background knowledge are:
Completeness. The knowledge source is expected to provide a good coverage of the medical specialties involved. The basic concepts needed in this context must be found, as must the relationships between concepts.
Consistency. The categorization of the concepts is expected to be coherent whatever the specialty, and the relationship between concepts should allow one to navigate the knowledge consistently.
Usability. The data must be integrated in a local database or requested from a remote server. Furthermore, the data are expected to be usable by a natural language processing system.
The UMLS was selected as the background knowledge for the representation of medical procedures based on a semantic approach (MAOUSSC‡ model),13 because it seemed to fulfill these requirements and was available at the time the experiment started, in 1992. Compared with the UMLS, which is based on a bottom-up approach, GALEN also attempts to provide a medical terminology but from scratch, with a top-down method, and the operational phase (GALEN-IN-USE) has been available only since 1996.14 MAOUSSC has been used successfully in France since 1994, and some 1500 medical procedures have been described, covering 15 medical specialties.
Background
Using the context of the representation of medical procedures, we evaluated the use of the UMLS§ as a medical knowledge source, applying the criteria of completeness, consistency, and usability. Earlier papers have presented MAOUSSC as a conceptual model for the representation of medical procedures based on a semantic approach,16 its uses for the analysis of medical procedure terms and for the analysis of inpatient discharge summaries,17 and its current implementation as a Web-based application.18 Only the features involved in the evaluation of the UMLS are summarized here.
A Multiaxial Model
Eight semantic categories (or axes) can be used for the description of a procedure. Four of these are mandatory: nature (what action is performed, e.g., a resection), topography (to which part of the body the action is applied, e.g., the carotid artery), instrumentation (what equipment is used to perform the action, e.g., a resectoscope), and approach (how an anatomic site is reached, e.g., laparoscopic approach). The four other axes may or may not be filled out, depending on the kind of action performed: additional topography (required for the description of a shunt), matter/device (used to describe what material—organic or not—is moved, removed, or implanted, e.g., a prosthesis), body process (instantiated only to describe the physiologic process involved, e.g., intracranial pressure). Usually, the disease axis must not be used, because pathologic conditions do not affect the description of the procedure. When the original term is vague, however (e.g., correction of tetralogy of Fallot), its minimal description as treatment of a disease allows the user to keep the imprecision.
Compositional Rules
MAOUSSC has compositional rules to combine elementary procedures together in order to describe complex procedures. Each elementary procedure can use no more than one value in each semantic axis. Furthermore, slight differences between procedures are often described for billing purposes (e.g., emergency) but do not modify the basic description of the procedure. These modifiers are simply attached to the procedure but are not reported in a semantic axis.
A Controlled Vocabulary
The vocabulary used to instantiate the MAOUSSC axes comes from the UMLS Metathesaurus. Concepts that do not exist in the UMLS can be added. Since the UMLS contains only 18,000 terms from the French translation of the MeSH, existing concepts often need to be translated. While this translation makes the work of end-users more comfortable, it does not affect the representation.
Two kinds of restrictions are applied to the UMLS vocabulary: (1) Since a procedure belongs to a medical specialty (e.g., prostatectomy belongs to urology), only UMLS concepts in the appropriate domain can be used to describe the procedure (see Selection of Terms Related to a Particular Medical Domain, below). (2) Each semantic axis of MAOUSSC is tied to one or more semantic types in the UMLS, and only concepts categorized with these semantic types can be used as values for the axis. The representation of a transurethral prostatectomy is shown as an example in ▶.
Table 1.
MAOUSSC axis | Axis status | Concept | Allowed Semantic Types* |
---|---|---|---|
Nature | Mandatory | Excision (C0015252) | (Closed list of concepts |
Topography | Mandatory | Prostate (C0033572) |
Body Part, Organ, Organ Comp.
|
Embryonic Structure | |||
Body Location or Region | |||
Body Space or Junction | |||
Instrumentation | Mandatory | Resectoscope (C0182966) |
Medical Device
|
Approach | Mandatory | Transurethral Approach (C0205497) |
Approach† |
Additional topography | Optional | None | Body Location or Region |
Matter/Device | Mandatory | Prostate (C0033572) | Body Substance |
Biomedical or Dental Material | |||
Embryonic Structure | |||
Acquired Abnormality | |||
Body Part, Organ, Organ Comp.
|
|||
Tissue | |||
Congenital Abnormality | |||
Implantable Device† | |||
Body process | Not relevant | None | |
Disease | Not relevant | None |
The semantic type of the concept is underlined.
Not a UMLS semantic type.
Completeness
The UMLS has been examined in several studies to determine its coverage of particular biomedical domains (e.g., clinical nursing language,19,20 ambulatory family practice clinical terms,21 clinical content of patient records,22 general medicine clinical diagnoses,23 genetic disorders,24 hypertension,25 clinical radiology terms,26 and clinical laboratory procedures27). The largest and most recent study, the NLM/AHCPR large-scale vocabulary test (LSVT), was designed by the NLM and the Agency for Health Care Policy and Research (AHCPR) to determine the ability of the test vocabularies to serve as a source of controlled vocabulary for health data systems and applications.28 The test vocabularies (comprising 735,000 terms and more than 330,000 concepts) include not only the 1996 version of the UMLS Metathesaurus but also three other sources planned to be added to the UMLS—the portions of SNOMED International29 not yet included, the Read Clinical Classification,30 and the LOINC vocabulary.31
These studies generally examined whether the concepts of a biomedical domain could be found in the UMLS Metathesaurus using mapping techniques followed by human review. The completeness of the available semantic categories and the relationships between concepts should be taken into account as well.
Concepts
The concepts used in MAOUSSC are either UMLS concepts or user-added concepts (UACs). The UACs were submitted to the LSVT in order to find out whether their creation was justified. This section presents a preliminary analysis of the UACs.
While two medical procedure nomenclatures are part of the UMLS Metathesaurus (CPT32 and the procedures chapter of the ICD9CM33), the basic concepts needed in MAOUSSC are expected to come from vocabularies such as SNOMED International (anatomic sites)29 or the Universal Medical Device (UMD) nomenclature system34 because of the level of granularity required for the description of the procedures. Users are allowed to create a new concept if the concept they need to describe a procedure does not exist in the UMLS vocabulary. The new concept must be given at least one semantic type by the user before being added in MAOUSSC.
As of October 1996, 479 UACs had been created, representing 1500 procedures from 15 specialities. The UACs were translated into English and submitted to the LSVT application in December 1996. Although the decision of the official reviewers has not yet been disclosed, the results of our preliminary review show that 69% of the UACs were found in the test vocabularies; that is, only 148 concepts are actually missing from the test vocabularies (▶).
Fifty per cent of UACs could be found in the 1996 UMLS Metathesaurus because 1) most of these concepts were created against an older version of the Metathesaurus (i.e., the concept was added in a later version and has still not been replaced by its corresponding UMLS concepts in MAOUSSC), or 2) users could not find a concept under English terms (i.e., the concept did exist, but no French term was available). Since they have been found in the planned additions (especially in the Read Thesaurus), 67 more UACs could be replaced by their corresponding UMLS concepts once the integration of these new vocabularies has been completed.
The following concepts are not currently part of the Metathesaurus but will be added in future editions: mitral annulus (Read), balloon catheter (Read), and mucosal flap (Read, SNOMED International). UACs related to existing terms are, for example, “recovery period,” which is broader than “anesthesia recovery period”; “percutaneous biliary drainage tube,” which is more specific than “drainage tube”; and “kidney pedicle,” which can be associated with “distortion of kidney pedicle.” Finally, some UACs could not be found or related to a concept from the LSVT vocabulary: No approximate match was found for “medial angle of the eye”; “terminal” as in “terminal anastomosis” is not equivalent in meaning to “terminal” used as a synonym of “final stage” in the UMLS; and “extrafascicular,” “ipsilateral,” and “ohmmeter” could not be found.
Sometimes a term could not be found but could be described as a combination of existing concepts. For example, the concept “veins of the digestive tract” does not exist in the UMLS, but both “veins” and “digestive tract” are UMLS concepts. Since some modifiers (e.g., acute) are UMLS concepts, many new concepts can be built by combining an existing concept with a modifier. Combination rules could be an interesting alternative to the physical creation of new concepts (e.g., “acute” + “hematoma” = “acute hematoma”).
Semantic Categories
A correspondence has been established between the semantic axes of MAOUSSC and the semantic types of the UMLS semantic network. For example, concepts of these UMLS semantic types—“Body Part, Organ, Organ Component,” “Body Location or Region,” “Body Space or Junction,” and “Embryonic Structure”—can be used as instances of the Topography axis. As shown in ▶, the correspondence is good for all axes except Approach, for which no semantic type is identified in the semantic network.
Although more than 500 procedure terms from the UMLS contain an explicit reference to the approach (e.g., “Repair of diaphragmatic hernia by transpleural approach”), only 24 approaches currently exist in the UMLS Metathesaurus (e.g., “Vaginal Approach”). The parent concept of all these approaches is “Anatomic modifiers of procedural approaches,” and their semantic type is “Spatial concept.” Other concepts with this semantic type are administration routes (e.g., “Intracutaneous route”) or receptor sites (e.g., “Angiotensin receptor site”). The Read Classification will provide a more complete list of approaches (about 70 new concepts), so that adding the approach category to the Semantic Network could be justified.
Like Volot et al.,6 we created some subtypes of existing categories in order to have a better correspondence between the MAOUSSC axes and the UMLS categories. For example, we added “Implantable Devices” as a subtype of the “Medical Devices,” so that only implantable devices can be used for the description of the implanted material (Matter/Device axis).
Relationships
In MAOUSSC, the relationships between concepts are used in two processes: the selection of terms related to a particular medical domain, and the interactive navigation of the users in the UMLS knowledge base.
Selection of Terms Related to a Particular Medical Domain
An algorithm was designed to automatically extract the UMLS concepts related to a particular domain in order to allow only relevant concepts to be used for the instantiation of the MAOUSSC axes with regard to the domain to which the procedure belongs. Starting with a few terms related to the domain, the algorithm selects recursively all their descendants and then the neighbors (ancestors and their related terms) of the descendants.13,35 For example, starting from “Cardiovascular System,” the algorithm gets “Heart” and “Coronary Vessels” (descendants) and then “Anatomy” an ancestor. Finally, “Incision of heart,” “Coronary Circulation,” and other concepts related to those in the previous subset are added.
Since the algorithm makes heavy use of hierarchic and other relationships between concepts, the lack of relationship will result in silence in the subset of concepts captured at the end of the process. Silence is estimated by comparing the concepts in the subset to the concepts needed for the representation of the procedures from the domain. The level of silence is generally low. For example, only eight of the 200 anatomic concepts needed for the representation of the procedures in digestive surgery were not captured by the algorithm.
Silence related to a lack of relationships between concepts has been observed especially with medical devices. In the UMLS, most of the medical devices come from the UMD nomenclature and are often not related to the procedures in which these devices are used. For example, the concept “resectoscope” is found in the urology subset, since it is related to prostate. The concept “sphincterotome,” however, has no relationship with “Oddi's sphincter” and can not be selected automatically as part of the gastroenterology subset, even though this device is used to perform an “endoscopic sphincterotomy.”
Interactive Navigation of the Users in the UMLS Knowledge Base
During the process of representation of a medical procedure, the user has to assign a value to each relevant MAOUSSC axis. If the value belongs to the subset of values selected for the current domain, the user just has to pick it up from a list. Otherwise, the candidate values are all UMLS concepts whose semantic types are compatible with the axis being described. In this case, instead of building a huge list of values, the user is shown a browser so that he or she can navigate the UMLS knowledge base until the closest concept is found. This interactive navigation is based on the relationships between concepts. As before, the lack of relationships results in silence: A concept might seem not to exist in the UMLS only because it can not be reached from another concept using their relationships.
Consistency
Although the goal of the UMLS Metathesaurus is to provide a uniform, integrated distribution format for more than 30 biomedical vocabularies, the Metathesaurus encompasses existing vocabularies with respect to the hierarchy of concepts in each thesaurus. Knowledge representation can be slightly different from one vocabulary to another or even multiple within one source. The Metathesaurus just reflects the multiple hierarchies rather than providing a unique one. Using a graph representation in which concepts are nodes and hierarchic relationships between concepts are links, several paths can be found between two nodes. The shortest path simply reflects the lowest level of granularity. ▶ shows the multiple paths between “Hip Fracture” and “Fractures” with the source of the links.
Unlike granularity, the categorization of the concepts and the hierarchic relationships are expected to be consistent. Some characteristics of the relationships between concepts in the Metathesaurus have the potential to be problematic for approaches such as MAOUSSC.
Categorization of the Concepts
In MAOUSSC, axes are tightly related to the UMLS semantic types so that whether concepts are allowed to be used as values for a given axis depends directly on their categories. The categorization of the concepts is an important and difficult part of the work of inserting a new vocabulary into the Metathesaurus.36 Default semantic types are first assigned automatically to a set of concepts according to the hierarchy of the vocabulary. Then concepts are reviewed by editors. Because semantics is involved, it is difficult to define rules for quality assurance of the categorization process. Nor can simple queries be made for testing the consistency of assigned categories.
Despite this difficulty the formal properties of the Metathesaurus can be exploited to prevent inconsistencies. In the UMLS, semantics is not only represented by the categorization of the concepts and the semantic network. Three other sources of semantic information are term information (synonym terms providing different expressions of a unique concept), contextual information (relationships between concepts reflecting the structure of a particular thesaurus) and co-occurrence data (statistical association of concepts within MEDLINE citations).37 Both semantic categorization and “is_a” relationships define classes that would be expected to be consistent, so that members of the same class share not only the “is_a” relationship to their hypernym but also inherit its semantic types.38 On the other hand, multiple hierarchies can coexist in the Metathesaurus, which is an acyclic directed graph rather than a tree. In this case, the semantic types of a concept come from a multiple inheritance and can be defined as the union of the semantic types of every hypernym of this concept.
These properties can be used to evaluate how well the hierarchy of a particular thesaurus has been formed.37 The evaluation of the quality of the categorization using the hierarchic relations is another possible use of these properties. The expected results are: (1) that all concepts of a class (hyponyms of another concept or siblings) will have the same semantic types as their hypernym (or a hyponym of these semantic types; see ▶), and (2) that semantic types of a member of a class that are not shared by the other members will have been inherited from another hypernym of the concept (multiple inheritance; see ▶). In a previous study,38 Nelson et al. found that the proportion of siblings with a common semantic type was not more than 70%.
More surprisingly, some concepts in the UMLS have neither the expected semantic types according to their “is_a” relationships nor the expected relationships according to their position in the hierarchy. For example, “serous membranes” (SMs) are membranes “lining the external walls of the body cavities and reflected over the surfaces of protruding organs.” Pericardium, peritoneum, and pleura are known as serous membranes.
The three members of the SM class are expected to be all hyponyms of “Serous Membrane.” Although they all can be found in the same MeSH hierarchy, “Peritoneum” and “Pleura” are found as hyponyms of SM (“is_a” relationship), while “Pericardium” is a meronym (“part_of” relationship). “Pleura” and “Peritoneum” inherit the “Tissue” semantic type from their hypernym. “Pericardium” has both the “Tissue” and “Body Part, Organ, Organ Component” semantic types, although no explicit “is_a” relationship allows it to inherit them automatically. Finally, although the parents of “Pleura” and “Peritoneum” also have other semantic types, only “Pericardium” inherits one of its parents' semantic types. ▶ summarizes the representation of the SM class in the UMLS.
Hierarchic Relationships
Hierarchic relationships (HRs) can be used to assess the consistency of the categorization but also need to be consistent on their own. In the Metathesaurus, HRs between concepts come from the different thesauri in which the concepts are found and reflect the structure of these vocabularies, or they are assigned during the building process of the Metathesaurus. HRs are parent_of (PAR), child_of (CHD), broader_than (RB), and narrower_than (RN). Moreover, some of these HRs are labeled as hypernymy (is_a) or meronymy (part_of). Most of the thesauri do not, however, provide information about the nature of the HRs.37 Only 8868 of the 238,882 HRs (4%) are currently labeled, and two thirds of them are “is_a” relationships. For Nelson et al.38 most of the unlabeled HRs can be considered “some type of relationship which involves subsumption.” Since HRs are asymmetric, each A PAR B has its B CHD A counterpart, and each A′ RB B′ has its B′ RN A′ counterpart.
Consistency and Redundancy
Four types of redundancy can be found within the HR:
Redundancy from multiple hierarchies. The HRs can appear in several hierarchies, and the Metathesaurus keeps a trace of the HRs in each hierarchy. For example, the A PAR B relationship between two concepts—(A) “Anxiety Disorders” and (B) “Agoraphobia Without History of Panic Disorder”—is found in both DSM4 and SNOMED International.
Redundancy between hyponyms. Hyponym relationships coming from multiple thesauri are called CHD in some thesauri and RN in others; hypernym relationships are called PAR in some thesauri and RB in others. For example, the relationship between the two concepts (A) “Antihypertensive Agents” and (B) “guanadrel” is found as A PAR B in SNOMED International and as A RB B in MeSH.
“Incestuous” redundancy. Some concepts that are considered siblings in one thesaurus have hierarchic relationships in another. For example, the relationship between the two concepts (A) “Gastro-intestinal System” and (B) “Esophagus” is found as A SIB B (siblings) in the CRISP thesaurus and as A PAR B in MeSH. Unlike the two previous types of redundancy, “incestuous” redundancy reflects different knowledge representations or different levels of granularity.
Circular relationships. Since HRs are asymmetric, relationships like A PAR B and B PAR A are expected not to be found simultaneously. Nevertheless, a few hundred circular relationships (cyclic graphs) do exist in the Metathesaurus. The merging of thesauri providing different knowledge representations can lead to inconsistency. For example, “colorectal neoplasms” is parent of “colon neoplasm” in CRISP, while it is child of “colonic neoplasms” in MeSH.
Surprisingly, however, the main source of inconsistency is somewhere else: In some thesauri, terms belonging to different levels of the hierarchy are considered synonyms. For example, “renal cell cancer” and “renal cell adenocarcinoma,” which are in two different levels of the “cancer” hierarchy in PDQ, are both considered synonyms of “Carcinoma, Renal Cell.”
Other Sources of Inconsistency
Inconsistency could also be found without any redundancy and within one vocabulary. For example, in SNOMED the anterior branch of the right hepatic duct is part of the “Bile Ducts, Extrahepatic” tree, although it is intrahepatic. In fact, the right hepatic duct is an extrahepatic structure resulting from the union of two smaller intrahepatic branches (anterior and posterior branches of the right hepatic duct), collecting the bile inside the liver. But from a taxonomic point of view, the right hepatic duct can be seen as being one of the two branches of the common hepatic duct and as dividing itself into two intrahepatic branches.
The “branch_of” relationship does not exist in the UMLS but would be a meronymy rather than a hyponymy, so it is not necessarily transitive and it does not allow one concept to automatically inherit properties from its meronym. In the UMLS, “anterior branch of right hepatic duct” (intrahepatic) and “right hepatic duct” (extrahepatic) are two of the children of “Bile ducts, Extrahepatic,” and this relationship is not otherwise labeled.
Finally, in the SM example, while “Peritoneal Cavity” is parent of “Peritoneum,” it is surprising to not find “Pleural Cavity” as parent of “Pleura.” “Pleural Cavity” does exist in the UMLS, but it is described in SNOMED International as child of “Pleura,” while “Peritoneal Cavity” is described in CRISP as parent of “Peritoneum.” It is a typical example of the difference in knowledge representation between two vocabularies. Unlike inconsistency based on circular relationships, these inconsistent representations of the medical knowledge are difficult to detect.
In summary, these inconsistencies have not been found to be really detrimental to the representation of medical procedures. Because of the existence of cyclic graphs, however, the computability of the knowledge is made more difficult. MAOUSSC users can also be confused by inconsistencies: The description of a pericardectomy as an excision (Nature) of pericardium (Topography) is correct, since “Body Part, Organ, Organ Component” is one of the semantic types allowed for Topography. The representation of an excision of the peritoneum would not be correct, however, since “Tissue” is not allowed as a semantic type for Topography, and since “Peritoneum” does not have the “Body Part, Organ, Organ Component” semantic type.
Usability
The NLM currently provides two different ways to access the UMLS data: (1) files from the CD-ROM distribution to be integrated into a local system, and (2) online access to the Knowledge Source Server through a command line interface, an application programming interface (API), and a Web-based application.3,4 The online method not only provides access to the data but also makes it possible to use a large number of tools that are not yet available in the CD-ROM distribution (e.g., matching tools based on morphologic variations).
The major concerns about the usability of the UMLS in MAOUSSC are: (1) How can the annual changes in the UMLS be made consistent with the procedures already described? (2) How suitable is the UMLS for the use in the French language? and (3) How suitable is the UMLS for use in the context of French medical culture?
A Changing Vocabulary
Since 1990, eight annual versions of the UMLS have been released. Each provides many changes reflecting the integration of new vocabularies. Cimino and Clayton39 have proposed a typology for the changes, including “renamed terms,” “deleted terms,” and “added terms” (simple additions, refinements, redundancy and disambiguation). As mentioned above, MAOUSSC does not use UMLS concepts exclusively but also allows users to create new concepts. If one of the UACs is part of the new version of the UMLS, it can be replaced by the corresponding UMLS concept only if (1) the two terms have exactly the same meaning, and (2) if the UMLS concept has the same semantic types as the UAC. Furthermore, the relationship between the UAC and one or more medical domains should be transferred to the UMLS concept as well.
In MAOUSSC, the axis in which one concept is used represents a particular context. This context could be used as a heuristic for resolving word-sense ambiguity. This can be useful when an existing concept (A) appears to have multiple meanings (polysemy) and is split into several concepts (A1, A2, etc.) in a further edition of the UMLS. The concept A can be replaced by A1 if the meaning of A1 corresponds to the meaning of A used in the particular context of one MAOUSSC axis.
Some changes are not simple to detect and handle, however. The evolution of the representation of “Cryptorchidism” (Cry) and “Ectopia testis” (ET) gives an example of this problem. Cryptorchidism means that the testicle failed to descend into the scrotum but is located at some point on its migration path, while ectopia testis means that the testicle has not descended into the scrotum and is not located at any point on its migration path. Because of this distinction, the treatment of Cry and ET can be different. In a surgical context, Cry and ET have two different meanings.
Initially, only one concept was defined in the UMLS for the two meanings. The distinction between Cry and ET appeared in the 7th edition, when ET was removed from the synonyms of Cry and a new concept for ET was created. The visible part of this modification is the creation of a new concept for ET, but its counterpart (the removal of ET from the terms of Cry) is silent. Like Cimino and Clayton, we think that such modification should be made explicit by creating a new concept for each meaning as a child of the original term rather than by shifting one part of the initial meaning into another concept.
Even if the number of UACs is not important, the implementation of a new version of the UMLS in MAOUSSC is a complex operation. This task (called knowledge administration) is distinct from both database administration (which requires technical skills) and the administration of the conceptual model (which requires model-related knowledge).
UMLS and French Language
Although the UMLS is composed largely of vocabularies in English, the current version also includes French, German, Spanish, and Portuguese translations of the MeSH terms. The French translation includes 25,932 terms corresponding to 18,277 concepts. This opening to foreign languages is an important step toward the use of the UMLS as background knowledge in non-English applications.
A 7-bit character set has been chosen for the storage of the terms in the UMLS. This character set has an entry for every alphabetic character (from a to z, lowercase and uppercase) but does not allow the representation of diacritic marks (e.g., acute, grave, and circumflex accents; tilde, cedilla, dieresis) or ligatures (connected letters), and therefore is not well adapted to Western European languages like French (abcès, érysipéloïde, fœtus), German (Überdosis, Anästhesie, Mißbildungen), Spanish (células, corazón, riñon and Portuguese (micção, esteróides, náusea).
The loss of diacritic marks could result in ambiguity. For example, the two French words “côte” (rib) and “côté” (side) would be transformed into “cote” (quotation). Moreover, in the UMLS, non-English terms are all uppercased, which makes it more difficult for acronyms and proper nouns to be discovered.
The 7-bit character set is not suitable for the storage of non-English terms. Other character sets could be used, such as ISO 8859-1 or Unicode. The ISO 8859-1 set is an 8-bit character set, suitable for the representation of Western European languages, and currently used by the HyperText Markup Language (HTML). The Unicode standard is a 16-bit character set that provides the representation of virtually all existing character sets.40 Since it provides uniform representation for several character sets, the Unicode could be seen as a kind of Metathesaurus of character sets. Only a few computer systems and applications currently support Unicode character, however, so Unicode-encoded terms would have to be translated into a familiar 7- or 8-bit character before they could be used.
Finally, the lexical tools distributed with the UMLS or available through the Knowledge Source Server handle only English terms. For example, the approximate matching search can not be performed on the French vocabularies.
UMLS and French Medical Culture
Classical anatomy of the liver is described from the surface view. The liver is divided into four lobes (right lobe, left lobe, caudate lobe, and quadrate lobe). The right lobe is divided into anterior and posterior segments, while the left lobe has medial and lateral segments.
In the 1950s Couinaud, a French surgeon, proposed an alternative classification based on a surgical point of view.41 According to this view, the liver is divided into eight independent segments, each having its own vascular inflow, outflow, and biliary drainage, so that any segment can be resected without damaging the remaining ones. While classical anatomy is based on the surface anatomic landmarks, Couinaud segments are defined by the underlying vascular planes.∥
In the UMLS, liver anatomy comes mainly from SNOMED and refers to classical anatomy rather than to Couinaud nomenclature. Therefore, since French surgeons currently use Couinaud segments to describe liver anatomy and surgical procedures, they can hardly describe procedures like “Left hepatectomy (Segments: II, III and IV)” using the UMLS.
Discussion
In the 10 years since the UMLS adventure started, eight versions of the UMLS Knowledge Sources have been released. The schema of the UMLS remains stable over time, despite the fact that the number of concepts in the Metathesaurus has dramatically increased.
Trees Versus Graph
The UMLS never attempted to substitute a unique hierarchy for those coming from the different vocabularies. Rather, the projections of a concept to each hierarchy to which it belongs are part of the concept information in the Knowledge Source Server (▶), together with its definition, its semantic types, its sources, and the list of terms. This representation is tree-centered. For a given purpose, one particular tree would be given preference by the user.
Since every tree is part of the global meaning of a concept, we prefer an alternative approach for the representation in which the different trees are combined into a directed acyclic graph (DAG), regardless of the source of the information. Same terms or same relationships present in several trees reflect a high degree of universality for a piece of knowledge, and this kind of redundancy from multiple hierarchies could be a measure of the inertia (or stability) of the Metathesaurus. On the other hand, concepts and relationships coming from one particular thesaurus reflect the specificity of the thesaurus in terms of granularity or knowledge representation. Anyway, once in the Metathesaurus, the multiple points of view from each thesaurus give a more complete representation of the medical concept, which must be consistent. Thus, for us, the Metathesaurus is better seen as a DAG than as multiple trees. Instead of having a preferred source, a basic subset of the Metathesaurus graph could be selected as the nodes and links common to a certain number of sources. Moreover, a hypertext-based interface could easily transform such a representation into a real knowledge browser (not only a knowledge server) and allow users to navigate the knowledge as MetaCard did.42,43
More Terms Versus More Links
The number of user-added concepts (UACs)—that is, concepts needed for the representation of medical procedures and not found in the UMLS—is not significant at 479. In other words, for the description of medical procedures, the current number of concepts in the UMLS is almost enough. On the other hand, a higher number of interconcept relationships would make the selection of terms related to a particular medical domain more accurate and make it easier for users to navigate the knowledge.
Nevertheless, we think that standard nomenclatures like the 10th revision of the International Statistical Classification of Diseases and Related Health Problems (ICD-10)44 would be valuable additions to the UMLS for the following reasons: (1) ICD-10 is widely used as a diagnosis coding system in a large variety of contexts; (2) it has not only a strong hierarchic structure but also nonhierarchic labeled relationships (“caused_by” and “manifestation_of”); (3) it has been translated into several languages and could be a source of foreign terms. Moreover, because the structure of ICD-10 is the same whatever the language, foreign translations could be easily added into the Meta-thesaurus as foreign terms once the integration of the English version was complete.
More Versus Better
Since the earlier ICD version, ICD-9-CM, and other sources of diagnoses terms are already part of the Metathesaurus, ICD-10 would not be expected to provide an important number of new diagnoses terms. Rather, it would strengthen the UMLS by providing (consistent) redundancy for both concepts and interconcept relationships. Thus, the core of a basic medical terminology could be defined as a set of concepts and interconcept relationships present in a large number of vocabularies.
Rather than include more vocabularies, however, the UMLS should first reinforce its structure (e.g., by increasing the number of labeled interconcept relationships, by providing more accurate hierarchies when a particular nomenclature fails to be accurate enough), and its global consistency (by systematically crosschecking the validity of semantic types and interconcept relationships).
Finally, multiple knowledge representations should be either removed during the Metathesaurus building process or reported to the users, who can select their preferred representation.
Conclusion
Some 1500 medical procedures have been described using MAOUSSC and the UMLS as its background knowledge source. Although specific concepts still need to be created for the representation of the procedures, the UMLS appears to have progressively filled its gaps and can be expected to be a reasonably complete source of concepts once the planned additions have been included. The semantic network includes only broad categories, so that developers of MAOUSSC and other applications have decided to refine certain categories and even define new ones. On the other hand, interconcept relationships remain the weakest part of the UMLS Metathesaurus. More links, and particularly more labeled links, would make the UMLS knowledge more computable.
Overall, the major issues for the UMLS will certainly be to improve its consistency. Applications using the UMLS as a knowledge source need to rely on a strong and consistent structure. Technical and human efforts should focus on the enforcement of the formal properties of the UMLS rather than on the integration of a large number of vocabularies. Nevertheless, standard international nomenclatures should continue to be included in the UMLS.
Finally, a truly internationalized version of the UMLS is highly desirable and would require the National Library of Medicine to change the current character coding system and make lexical tools work with foreign languages as well.
Acknowledgments
The authors gratefully acknowledge the encouragement and advice they received from Alexa McCray and Tom Rindflesch.
This work was supported in part by 3M Laboratories and by the following associations: Association des Utilisateurs des Nomenclatures Nationales et Internationales de Santé (AUNIS), Collège des Praticiens Spécialistes en Information et Communication Médicales (COPSICOM), Information Médicale et Gestion des Établissements (Groupe IMAGE), and Contribuer à la Recherche et à l'Innovation au Service du Traitement et de l'Analyse des Langages utilisés dans les Systèmes de Soins (CRISTALS).
Footnotes
Modèle d'Aide à l'Orientation d'un Utilisateur au Sein de Systèmes de Codage (Model for Assistance in the Orientation of a User within Coding Systems).
Unless otherwise specified, citations of UMLS refer to the 8th edition.15
A tutorial on liver anatomy featuring both classical and Couinaud nomenclatures in available at <http://everest.radiology.uiowa.edu/DPI/nlm/apps/liver/liver.html>.
References
- 1.Lindberg D, Humphreys B, McCray A. The Unified Medical Language System. Methods Inf Med. 1993;32(4): 281-91. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Jachna J, Powsner S, Miller P. Augmenting GRATEFUL MED with the UMLS Metathesaurus: an initial evaluation. Bull Med Libr Assoc. 1993;81(1): 20-8. [PMC free article] [PubMed] [Google Scholar]
- 3.McCray A, Razi A, Bangalore A, Browne A, Starvi P. The UMLS Knowledge Source Server: a versatile Internet-based research tool. Proc AMIA Annu Fall Symp. 1996: 164-8. [PMC free article] [PubMed]
- 4.McCray A, Razi A. The UMLS Knowledge Source server. Medinfo. 1995;8(1): 144-7. [PubMed] [Google Scholar]
- 5.Cimino J. Use of the Unified Medical Language System in patient care at the Columbia-Presbyterian Medical Center. Methods Inf Med. 1995;34(1-2): 158-64. [PubMed] [Google Scholar]
- 6.Volot F, Zweigenbaum P, Bachimont B, et al. Structuration and acquisition of medical knowledge; using UMLS in the conceptual graph formalism. Proc Annu Symp Comput App Med Care. 1993; 710-4. [PMC free article] [PubMed]
- 7.Carenini G, Moore J. Using the UMLS Semantic Network as a basis for constructing a terminological knowledge base: a preliminary report. Proc Annu Symp Comput App Med Care. 1993; 725-9. [PMC free article] [PubMed]
- 8.Cimino J, Johnson S, Peng P, Aguirre A. From ICD9-CM to MeSH using the UMLS: a how-to guide. Proc Annu Symp Comput App Med Care. 1993: 730-4. [PMC free article] [PubMed]
- 9.McCormick K, Lang N, Zielstorff R, Milholland D, Saba V, Jacox A. Toward standard classification schemes for nursing language: recommendations of the American Nurses Association Steering Committee on Databases to Support Clinical Nursing Practice. J Am Med Inform Assoc. 1994;1(6): 421-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Rindflesch T, Aronson A. Ambiguity resolution while mapping free text to the UMLS Metathesaurus. Proc Annu Symp Comput App Med Care. 1994; 240-4. [PMC free article] [PubMed]
- 11.Sneiderman C, Rindflesch T, Aronson A. Finding the findings: identification of findings in medical literature using restricted natural language processing. Proc AMIA Annu Fall Symp. 1996; 239-43. [PMC free article] [PubMed]
- 12.Nelson S, Olson N, Fuller L, Tuttle M, Cole W, Sherertz D. Identifying concepts in medical knowledge. Medinfo. 1995;8(1): 33-6. [PubMed] [Google Scholar]
- 13.Burgun A, Botti G, Bodenreider O, et al. Methodology for using the UMLS as a background knowledge for the description of surgical procedures. Int J Biomed Comput. 1996; 43(3): 189-202. [DOI] [PubMed] [Google Scholar]
- 14.Rector A, Salomon W, Nowlan W, Rush T, Zanstra P, Classen W. A medical terminology server for medical language and medical information systems. Methods Inf Med. 1995;34(1-2): 147-57. [PubMed] [Google Scholar]
- 15.UMLS Knowledge Sources, 8th ed. Bethesda, MD: National Library of Medicine, 1997.
- 16.Burgun A, Botti G, Lukacs B, et al. A system that facilitates the orientation within procedure nomenclatures through a semantic approach. Med Inf (Lond) 1994;19(4): 297-310. [DOI] [PubMed] [Google Scholar]
- 17.Bodenreider O, Burgun A, Botti G, et al. Is medical information usable for multiple purposes? The case of medical procedures described with the MAOUSSC model. In: van der Lei J, Beckers W (eds). Proc AMICE. Amsterdam, The Netherlands. 1995; 171-9.
- 18.Burgun A, Denier P, Bodenreider O, et al. A web terminology server using UMLS for the description of medical procedures. J Am Med Inform Assoc. 1997;4(5): 356-63. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Lange L. Representation of everyday clinical nursing language in UMLS and SNOMED. Proc AMIA Annu Fall Symp. 1996; 140-4. [PMC free article] [PubMed]
- 20.Zielstorff R, Cimino C, Barnett G, Hassan L, Blewett D. Representation of nursing terminology in the UMLS Metathesaurus: a pilot study. Proc Annu Symp Comput App Med Care. 1992; 392-6. [PMC free article] [PubMed]
- 21.Mullins H, Scanland P, Collins D, et al. The efficacy of SNOMED, Read Codes, and UMLS in coding ambulatory family practice clinical records. Proc AMIA Annu Fall Symp. 1996; 135-9. [PMC free article] [PubMed]
- 22.Chute C, Cohn S, Campbell K, Oliver D, Campbell J. The content coverage of clinical classifications. (For The Computer-based Patient Record Institute's Work Group on Codes and Structures.) J Am Med Inform Assoc. 1996;3(3): 224-33. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Rosenberg K, Coultas D. Acceptability of Unified Medical Language System terms as substitute for natural language general medicine clinic diagnoses. Proc Annu Symp Comput App Med Care. 1994; 193-7. [PMC free article] [PubMed]
- 24.O'Keefe K, Sievert M, Mitchell J. Mendelian inheritance in man: diagnoses in the UMLS. Proc Annu Symp Comput App Med Care. 1993; 735-9. [PMC free article] [PubMed]
- 25.Campbell J, Kallenberg G, Sherrick R. The clinical utility of META: an analysis for hypertension. Proc Annu Symp Comput App Med Care. 1992; 397-401. [PMC free article] [PubMed]
- 26.Friedman C. The UMLS coverage of clinical radiology. Proc Annu Symp Comput App Med Care. 1992; 309-13. [PMC free article] [PubMed]
- 27.Cimino J. Representation of clinical laboratory terminology in the Unified Medical Language System. Proc Annu Symp Comput App Med Care. 1991; 199-203. [PMC free article] [PubMed]
- 28.Humphreys B, Hole W, McCray A, Fitzmaurice J. Planned NLM/AHCPR large-scale vocabulary test: using UMLS technology to determine the extent to which controlled vocabularies cover terminology needed for health care and public health. J Am Med Inform Assoc. 1996;3(4): 281-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Cote R (ed). Systematized Nomenclature of Human and Veterinary Medicine (SNOMED International), version 3.1. Northfield, IL: College of American Pathologists, and Schaumburg, IL, American Veterinary Medical Association, 1995.
- 30.The Read Thesaurus, version 3.1 (October 1995). Bethesda, MD: National Health Service, National Coding and Classification Center, 1995.
- 31.Logical Observations Identifiers, Names and Codes (LOINC). Indianapolis, IN: The Regenstrief Institute, 1996.
- 32.Physician's Current Procedural Terminology, 4th ed. Chicago, IL: American Medical Association, 1996.
- 33.The International Classification of Diseases, Clinical Modification (ICD9CM), 6th ed. Washington, DC: Health Care Financing Administration, 1996.
- 34.Universal Medical Device Nomenclature System: Product Category Thesaurus. Washington, DC: ECRI, 1997.
- 35.Burgun A, Bodenreider O, Denier P, et al. Knowledge acquisition from the UMLS sources: application to the description of surgical procedures. Medinfo. 1995;8(1): 75-9. [PubMed] [Google Scholar]
- 36.Sperzel W, Erlbaum M, Fuller L, et al. Editing the UMLS Metathesaurus: review and enhancement of a computed knowledge source. Proc Annu Symp Comput App Med Care. 1990; 136-40.
- 37.McCray A, Nelson S. The representation of meaning in the UMLS. Methods Inf Med. 1995;34(1-2): 193-201. [PubMed] [Google Scholar]
- 38.Nelson S, Fuller L, Erlbaum M, Tuttle M, Sherertz D, Olson N. The semantic structure of the UMLS Metathesaurus. Proc Annu Symp Comput App Med Care. 1992; 649-53. [PMC free article] [PubMed]
- 39.Cimino J, Clayton P. Coping with changing controlled vocabularies. Proc Annu Symp Comput App Med Care. 1994; 135-9. [PMC free article] [PubMed]
- 40.The Unicode Consortium. The Unicode Standard, Version 2.0. Menlo Park, CA: Addison-Wesley, 1997.
- 41.Couinaud C. Le foie—Études anatomiques et chirurgicales. Paris: Masson, 1957.
- 42.Tuttle M, Cole W, Sheretz D, Nelson S. Navigating to knowledge. Methods Inf Med. 1995;34(1-2): 214-31. [PubMed] [Google Scholar]
- 43.Nelson S, Sherertz D, Tuttle M, Erlbaum M. Using Metacard: a HyperCard browser for biomedical knowledge sources. Proc Annu Symp Comput App Med Care. 1990; 151-4.
- 44.International Statistical Classification of Diseases and Related Health Problems, 10th revision. Geneva: WHO, 1992.