CELEBRATING THE START OF SOMETHING BIG
November 2020 is the 30th anniversary of the release of the first edition of the Unified Medical Language System (UMLS) Knowledge Sources1 by the US National Library of Medicine (NLM). This special issue of JAMIA celebrates an early milestone in Open Science by highlighting research that makes use of the current versions of the UMLS resources. A previous special section of JAMIA, published in 1998, helped to mark the first decade of UMLS research and development.2
The 1990 UMLS Knowledge Sources3 were the result of a large multisite and multidisciplinary research project launched by NLM Director Donald Lindberg 4 years earlier.4 They were the first iteration of novel and freely distributed digital resources intended to help computers behave as if they understand biomedical meaning. In other words, the UMLS Knowledge Sources were designed to help developers produce systems able to retrieve and integrate semantically related information from disparate electronic sources (eg, literature, electronic health records [EHRs], databanks, research registries), irrespective of significant differences in the vocabularies and code sets used within them.
In 1990, the problem the UMLS was targeting was not as obvious as the Web would soon make it, the idea of customizing multi-purpose resources was unfamiliar to informatics developers, the notion of a Semantic Network to represent biomedical “common sense” was strange, and the format of the first version of the UMLS Metathesaurus was difficult to understand. With 64 123 concepts and 208 559 concept names from 7 sources, the 1990 Metathesaurus was both too small (in concept coverage) and too large (in file size) for many informatics groups. As a result, NLM and the institutions funded by the UMLS project were the principal users of the first several versions of the UMLS Knowledge Sources.5
Fortunately, early use was sufficient to provide important feedback. Although complexity remains an issue, the free and regularly updated, expanded, and enhanced UMLS Knowledge Sources are a good illustration of Lindberg’s maxim that “systems that get used get better,” thereby driving yet more use. As a premier example, the 1994 addition of the SPECIALIST Lexicon and Lexical Tools,6 and their use to index the Metathesaurus, made the UMLS Knowledge Sources an unparalleled free resource for biomedical natural language processing (NLP) and had a major impact on research in that field. The increasing availability of UMLS-related tools, such as MetamorphoSys,7 MetaMap,8 cTAKES,9 and CLAMP,10 produced at NLM and elsewhere, helped users to cope with customization and complexity and also drove additional use.
Perhaps even more important, external developments in information technology,11 enormous increases in genomic and EHR data, and new biomedical research priorities [eg,12] promoted use of the UMLS Knowledge Sources and tools. These developments reduced technical barriers to UMLS use, [eg,13] fostered greater understanding of the UMLS goal, and provided compelling new use cases related to maintenance and use of EHR standards, quality measures, [eg, 14,15] and clinical decision tools; analysis and research use of observational health data; [eg,16,17] and knowledge discovery across heterogeneous data sources [eg,18].
Thirty years after the release of the first experimental edition, the UMLS Knowledge Sources underpin a wide variety of consequential informatics research and applications. The Metathesaurus (2020AA edition) now contains 4.28 million concepts and 15.5 million concept names (including some in 25 different spoken languages) from 214 vocabulary sources; the Semantic Network has 127 types and 54 relationships; the SPECIALIST Lexicon includes 983 420 lexical items; and there are many associated lexical programs and tools. Seen in hindsight, some features that add complexity to the UMLS Metathesaurus also make terminology data more FAIR (findable, accessible, interoperable, and reusable).19
EXAMPLES OF UMLS RESEARCH AND APPLICATIONS TODAY
The articles in this Special Issue include examples of just some of the many types of research and development involving the UMLS resources.
Overview of UMLS users and uses
Two articles provide complementary overviews of UMLS users and uses giving a broader picture of UMLS impact beyond the specific research and applications highlighted in this issue. Drawing on data from annual reports from more than 5000 UMLS users, statistics on download and API accesses, and a scoping review of a random sample of recent research literature, Amos et al present a perspective on the heavy direct use of UMLS resources, including in many production applications and commercial systems not reflected in published literature, and comment on how this use aligns with the stated purpose of the UMLS.20 Kim at al present the results of a scientometric review of more than 10 000 bibliographic records for UMLS related publications from 1986–2019, mapping the multiple disciplines and themes involved in published UMLS work.21
Enhancement of UMLS resources
A perennial type of UMLS research is aimed at enhancing the content, characteristics, and distribution formats of the UMLS resources. Two articles in this issue fit this category. A case report by Lu et al describes recent developments in the SPECIALIST Lexicon and lexical tools that further enhance their utility for NLP research and applications.22 Vasilakes et al report the results of an assessment of the coverage of dietary supplement terminology in the UMLS Metathesaurus and identify a method for improving it.23
UMLS auditing and quality improvement methods
Key value-added contributions of the UMLS Metathesaurus are concept organization (cross-source synonymy), assignment of semantic types to concepts, and representation of all source vocabularies in a single common machine-readable format. In a number of cases, eg, the International Classification of Diseases, 9th Edition, Clinical Modification, the initial incorporation of a source vocabulary into the Metathesaurus was the first time that its semantics were explicitly represented in machine-readable form. The fully specified Metathesaurus format enabled novel research on terminology auditing techniques that have been applied both to the unique properties of the Metathesaurus and to assessment and comparison of its source vocabularies. Two articles in this issue focus on this line of research. L. Zheng, He, et al review the published literature on auditing techniques applied to the unique semantic features of the UMLS Metathesaurus,24 and F. Zheng et al introduce a transformation-based method for auditing IS-A hierarchies.25
UMLS in NLP method development
As free and regularly enhanced linguistic and semantic resources, the UMLS Knowledge Sources and related tools promoted interest in biomedical and clinical NLP and have been heavily used in developing, testing, and comparing methods for biomedical information retrieval and extraction tasks. Weinzierl at al explore how knowledge embeddings learned from the UMLS affect the quality of relation extraction from text.26 In work focused on automated measurement of the semantic relatedness among concepts, Mao and Fung describe methods that rely on public off-the-shelf word and graph embedding tools and the UMLS as the sole corpus.27 UMLS resources have been employed in developing test collections for NLP challenges, including some sponsored by the Text REtrieval Conference (TREC), BioCreAtIvE (Critical Assessment of Information Extraction systems in Biology), and the National NLP Clinical Challenges (n2c2, formerly known as i2b2 NLP Shared Tasks), and have been used in systems developed by many challenge participants. Three articles stem from the 2019 n2c2/Open Health NLP shared task on clinical concept normalization, with a goal of mapping terms in clinical text to UMLS concept unique identifiers (CUIs). Henry and Uzuner provide an overview of the shared task and its results, comparing the participating systems.28 Chen at al29 and Xu et al30 each describe a high performing entry in the shared task and enhancements that leverage the UMLS made after the task evaluation.
Applications of the UMLS to specific problems
In addition to their importance in research methods, the UMLS Knowledge Sources and tools are essential to the development and maintenance of a very wide range of applications. Four articles in this issue provide examples of the diversity of problems to which UMLS resources are applied. Reimer et al used the UMLS in building the Transport Data Repository to enable study of patients who undergo medical transfer.31 Wang et al determined that use of the UMLS could improve automated categorization of patient safety incident reports.32 Rasmy et al found the UMLS useful in representing electronic health record (EHR) data in predictive modeling.33 Bitton et al mapped transliterated terms to UMLS concepts to improve retrieval in a Hebrew online health community,34 an example of using the UMLS both to extract information from social media and to aid interpretation of non-English text.
FUTURE PROSPECTS
Thirty years after the initial UMLS release, differences in vocabularies and codes used in different digital information sources—and in the terminology employed by different users—show no signs of disappearing, despite progress in standardizing some data elements in EHRs. [eg,35] The UMLS currently helps thousands of system developers and researchers to overcome variations in the way concepts are expressed, a task that remains critical to effective retrieval, analysis, aggregation, and semantically interoperable exchange of biomedical and health-related information and data. Use of the UMLS resources underpins systems collectively used by millions of scientists, health professionals, patients, and consumers—and by thousands of computer programs—every day. A precise accounting of UMLS use and impact is not possible, however. Statistics and published papers are not available for many commercial products and institution-specific systems that rely on the UMLS.
The first UMLS resources were conceived and produced before the arrival of many things that define the informatics landscape today. These include widespread public access to high-speed Internet, Web browsers, search engines that rely on precomputed connections, inexpensive computers and information appliances, immense quantities of digital information and “big data,” and scientific and public policy priorities aimed at leveraging and advancing these developments. For the past 30 years, changes in the broad informatics environment and in biomedical and artificial intelligence research have facilitated the production, expansion, distribution, and use of the UMLS resources. They have also increased, rather than reduced, UMLS utility. It remains to be seen whether this pattern will hold into the long-term future. In the meantime, NLM will serve the informatics field—and users of biomedical and health information everywhere—by continuing to update, enhance, and simplify the UMLS Knowledge Sources, taking advantage of user feedback, ongoing developments in information technology, and advances in research methods that the UMLS itself has helped to foster.
CONFLICT OF INTEREST STATEMENT
BLH retired from the NLM in 2017. She served as UMLS Project Director at NLM from 1986–2006.
AUTHOR CONTRIBUTIONS
All authors participated in defining and organizing the topics covered and in drafting and editing the manuscript. All authors approved the final version of the manuscript.
This special issue is dedicated to the memory of Donald Allan Bror Lindberg, MD (1933–2019). An informatics pioneer, leader, and longest serving Director of the US National Library of Medicine (NLM), Dr. Lindberg conceived, initiated, and led the Unified Medical Language System (UMLS) effort in its formative years and provided sustained support for free dissemination and regular maintenance of the UMLS resources.
REFERENCES
- 1. Unified Medical Language System (UMLS). https://www.nlm.nih.gov/research/umls/index.html Accessed July 23, 2020
- 2. McCray AT, Miller RA. Making the conceptual connections: The UMLS after a decade of research and development. J Am Med Inform Assoc 1998; 5 (1): 129. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Lindberg DAB, Humphreys BL. The UMLS knowledge sources: tools for building better user interfaces. Proc Annu Symp Appl Med Care 1990; 121–125. [Google Scholar]
- 4. Humphreys BL, Lindberg DA, Schoolman HM, et al. The Unified Medical Language System: an informatics research collaboration. J Am Med Inform Assoc 1998; 5 (1): 1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Lindberg DAB, Humphreys BL, McCray AT. The Unified Medical Language System. Methods Inf Med 1993; 32 (04): 281–91. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. McCray AT, Srinivasan S, Browne AC. Lexical methods for managing variation in biomedical terminologies. Proc Annu Symp Comput Appl Med Care 1994; 235–9. [PMC free article] [PubMed] [Google Scholar]
- 7. MetamorphoSys Help. https://www.nlm.nih.gov/research/umls/implementation_resources/metamorphosys/help.html Accessed August 3, 2020
- 8. Aronson AR, Lang FM. An overview of MetaMap: historical perspective and recent advances. J Am Med Inform Assoc 2010; 17 (3): 229–36. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Savova GK, Masanz JJ, Ogren PV, et al. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation, and applications. J Am Med Inform Assoc 2010; 17 (5): 507–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Soysal E, Wang J, Jiang M, et al. CLAMP – a toolkit for efficiently building customized clinical natural language processing pipelines. J Am Med Inform Assoc 2018; 25 (3): 331–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Lindberg DA, Humphreys BL. The High-Performance Computing and Communications program, the national information infrastructure and health care. J Am Med Inform Assoc 1995; 2 (3): 156–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. National Research Council (US). Committee on Developing a Framework for Developing A New Taxonomy of Disease. Toward Precision Medicine: Building A Knowledge Network for Biomedical Research and A New Taxonomy of Disease. Washington, DC: National Academies Press, 2011. https://www.ncbi.nlm.nih.gov/books/NBK91503/ Accessed August 6, 2020 [PubMed] [Google Scholar]
- 13. McCray AT, Razi AM, Bangalore AK, et al. The UMLS Knowledge Source Server: a versatile Internet-based research tool. Proc AMIA Annu Fall Symp 1996; 164–8. PMID: 8947649. [PMC free article] [PubMed] [Google Scholar]
- 14. Winnenberg R, Bodenreider O. Issues in creating and maintaining value sets for clinical quality measures. AMIA Annu Symp Proc 2012; 2012: 988–96. [PMC free article] [PubMed] [Google Scholar]
- 15. Bodenreider O, Nguyen D, Chiang P, et al. The NLM Value Set Authority Center. Stud Health Technol Inform 2013; 192: 1224. [PMC free article] [PubMed] [Google Scholar]
- 16. Milinovich A, Kattan MW. Extracting and utilizing electronic health data from Epic for research. Ann Transl Med 2018; 6 (3): 42–42. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Hripcsak G, Levine ME, Shang N, et al. Effect of vocabulary mapping for conditions on phenotype cohorts. J Am Med Inform Assoc 2018; 25 (12): 1618–25. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Li H Fan J Vitali F, . et al. Novel disease syndromes unveiled by integrative multiscale network analysis of diseases sharing molecular effectors and comorbidities. BMC Med Genomics 2018; 11 (S6): 112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. FAIR principles. https://www.go-fair.org/fair-principles/ Accessed August 3, 2020
- 20. Amos L, Anderson D, Brody S, et al. UMLS users and uses: a current overview. J Am Med Inform Assoc 2020; 27 (10). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Kim MC, Nam S, Wang F, et al. Mapping scientific landscapes in UMLS research: a scientometric review of 30-year published literature. J Am Med Inform Assoc 2020; 27 (10). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Lu CJ, Payne A, Mork JG. The UMLS SPECIALIST lexicon and lexical tools: development and applications. J Am Med Inform Assoc 2020; 27 (10). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Vasilakes J, Bompelli A, Bishop JR, et al. Assessing the enrichment of dietary supplement coverage in the UMLS. J Am Med Inform Assoc 2020; 27 (10). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Zheng L, He Z, Wei D, et al. A review of auditing techniques for the Unified Medical Language System. J Am Med Inform Assoc 2020; 27 (10). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Zheng F, Shi J, Yang Y, et al. A transformation-based method for auditing the IS-A hierarchy of biomedical terminologies in the Unified Medical Language System (UMLS). J Am Med Inform Assoc 2020; 27 (10). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Weinzerl MA, Maldonado R, Harabagiu SM. The impact of learning UMLS embeddings in relation extraction from biomedical texts. J Am Med Inform Assoc 2020; 27 (10). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Mao Y, Fung KW. Use of word and graph embedding to measure semantic relatedness between UMLS concepts. J Am Med Inform Assoc 2020; 27 (10). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Henry S, Uzuner O. 2019 n2c2/OHNLP shared task on clinical concept normalization for clinical records. J Am Med Inform Assoc 2020; 27 (10). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Chen L, Wenbo F, Gu Y, et al. Clinical concept normalization with a hybrid NLP system combining multi-level matching and machine learning ranking. J Am Med Inform Assoc 2020; 27 (10). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Xu D, Gopala M, Zhang J, et al. UMLS resources improve sieve-based generation and BERT-based ranking for concept normalization. J Am Med Inform Assoc 2020; 27 (10). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Reimer AP, Milinovich A. Using UMLS for electronic health data standardization and database design. J Am Med Inform Assoc 2020; 27 (10). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Wang Y, Coiera E, Magrabi F. Can UMLS-based semantic representation improve automated identification of patient safety incident reports by type and severity? J Am Med Inform Assoc 2020; 27 (10). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Rasmy L, Tiryaki F, Zhou Y, et al. Representation of ehr data for predictive modeling: a comparison between UMLS and other terminologies. J Am Med Inform Assoc 2020; 27 (10). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Bitton Y, Cohen R, Schifter T, et al. Cross-lingual UMLS entity linking in online health communities. J Am Med Inform Assoc 2020; 27 (10). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Bhargava A, Kim T, Quine DB, et al. A 20-year evaluation of LOINC in the United States' largest integrated health system. Arch Pathol Lab Med 2020; 144 (4): 478–84. [DOI] [PubMed] [Google Scholar]