Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Sep 10.
Published in final edited form as: Proceedings (IEEE Int Conf Bioinformatics Biomed). 2022 Jan 14;2021:2609–2611. doi: 10.1109/bibm52615.2021.9669475

Data Normalization Improves Semantic Annotation – a Case Study of Rare Disease Name Annotation

Charlie Tang 1, Yanji Xu 2, Qian Zhu 3
PMCID: PMC12419296  NIHMSID: NIHMS2109302  PMID: 40932881

Abstract

Despite the individually low prevalence of rare diseases, they collectively constitute a big challenge to human health. Accurate annotation of rare diseases in biomedical data through natural language processing (NLP) could be instrumental in biomedical informatics research. Herein, we propose a data normalization-based annotation approach, complementary to the popular biomedical data annotation tool named MetaMap, for disease annotation from biomedical data in free text.

Keywords: Rare disease, data normalization, semantic annotation

I. Introduction

In the United States (US), a rare disease (RD) is defined as a disease or condition that affects fewer than 200,000 persons in the US [1]. There are more than 7,000 RDs, which collectively affect ~25–30 million people in the US. Research and treatment for these diseases are highly under-developed due to the low prevalence rates for the individual RD. Only about 500 RDs have a specific ICD (International Classification of Disease) code [2]without data standardization, it is difficult to connect and integrate multiple data resources together for data reuse in RDs. Natural language processing (NLP)-based semantic annotation is emerging as a promising solution, which recognizes biomedical terms presenting in various forms from unstructured data, including scientific literatures, clinical notes, grant applications, and represents them with standard terminologies. Jovanović et al. reviewed the current landscape of semantic annotations in biomedicine, and they highlighted that “annotation of biomedical documents with machine intelligible semantics facilitates advanced, semantics-based text management, curation, indexing, and search.”[3]

Previously, we developed a knowledge graph to integrate the National Institutes of Health (NIH) grant data into the Genetic and Rare Diseases information center (GARD) program [4], managed by the NIH’s National Center for Advancing Translational Sciences (NCATS), in order to connect RDs to funded projects in order to better understand the funding environment for RDs[5]. In that study, we annotated grant project titles via regular expression and MetaMap [6], a widely used semantic annotator for extracting medical terms from English text, using both NLP and domain knowledge. From the study, we learned the limitations of MetaMap in annotating RD data. It tended to have a high rate of false annotations, especially with large texts, and its scoring was not a good indicator of annotation confidence or not useful in ranking and filtering the results. Thus, we now further examine the capacity and performance of MetaMap with relatively large texts, i.e., project abstracts, and consequently propose a potential solution to semantic annotation of RD-related unstructured data.

II. Method

A. Rare disease data and NIH funding data preparation.

We obtained a listing of RDs included in GARD with names and synonyms from the NCATS GARD Knowledge Graph (NGKG) [7] and tried to map the RDs to NIH-funded projects from NIH ExPORTER [8]. In total, 6,061 RDs from GARD and 10,000 proposals from NIH ExPORTER were used in this study.

B. Semantic annotation of project abstracts via Metamap

The MetaMap 2010 version was installed locally. Each sentence from an abstract in a given project proposal was fed into the tool. All annotated disease-related terms in Unified Medical Language System (UMLS) [9] that belonged to the semantic group of disorders labelled as “DISO” [10] were collected and applied to map to the GARD RDs.

C. Data normalization-based annotation (NormMap).

RD names and project abstracts were both subject to a series of normalization steps to standardize some special patterns that are prone to different presentation in text, leading to failures in annotation. The normalization mainly aimed to standardize the semantic notations, word formats, and special expressions, including letter cases, hyphens, parentheses, forward slashes, Roman numbers, etc. in different contexts.

Each disease name or abstract sentence was normalized via the above steps and then tokenized as a list. One GARD RD may have multiple names, and one abstract may have multiple sentences; as a result, token lists were generated for each GARD RD and each project abstract. To be a successful match between a RD and a project, at least one token list of the RD was contained as a subset of at least one list of the annotated abstract. This approach highly relies on data normalization and thus is referred to as “NormMap”.

III. Results

We examined 10,000 project abstracts against 6,061 GARD RDs using both MetaMap and NormMap as above. Out of all the tested abstracts, about 10% were mapped to 252 GARD RDs by MetaMap, and about 5% mapped to 242 GARD RDs by NormMap; about 2.5% or 261 abstracts were mapped to 99 GARD RDs by both (see Figure 1). Of these 261 abstracts, 175 were mapped to at least one identical GARD RD by both methods. Note that the mapping could be many to many and the numbers were not mutually exclusive.

Figure1.

Figure1.

Numbers of abstracts annotated by rare diseases using MetaMap and NormMap

These numbers themselves are not indicative of annotation efficiency and quality, but a very high occurrence of a specific disease name being mapped to the abstracts usually suggests false mapping due to the lack of uniqueness of the name. We observed that MetaMap overall tended to map diseases with higher occurrences (Figure 2). There were two extreme outliers, and the top one was “Infantile neuroaxonal dystrophy” (GARD:0003957). This GARD RD has an alternative name of “PLAN”, and 488 abstracts containing the word “plan” were thus incorrectly annotated. In NormMap, we addressed this type of words that are of high-frequency occurrences yet contain special letter or case patterns, such as all uppercases or certain mixed cases. However, those that appeared with a regular pattern were not handled well yet by NormMap. For example, the disease name “Young syndrome” (GARD:0000341) was mis-assigned to 23 abstracts due to the word ‘Young’.

Figure2.

Figure2.

Cumulative distribution of occurences of GARD RDs in annotation results of MetaMap and NormMap, respectively.

It is difficult to quantitate the accuracy of the results since there is no gold standard for such annotation. Therefore, to evaluate the mapping results, we manually reviewed 77 abstracts that were annotated by both methods so that we could have a sense of where the problems existed and how to improve NormMap. The mapping was multiple versus multiple, and here we considered a mapped pair as one event. The evaluation results are summarized in Figure 3. A Chi-square test on the true and false events suggested that NormMap had a significant improvement in specificity (p-value=0.001), though it should be interpreted with caution since the number of the reviewed abstracts was relatively small and might not be representative enough.

Figure.3.

Figure.3.

Evaluation of annotations by MetaMap and NormMap

IV. Discussion

Accurate annotation of disease terms is foundational to subsequent data mining. To meet the need, we developed a data normalization-based annotation approach called NormMap. We have preliminarily shown that this approach demonstrated an improved specificity over MetaMap. In this study, we have not benchmarked the accuracy between the two methods. In fact, we recommend that the two methods could be used in combination to achieve better performance.

Annotating disease names out of the natural language is an important topic in biomedical informatics. Obtaining a perfect pattern matching would be expected to be low in recall. Machine learning based approaches hold promises; however, the lack of training data has made it difficult to establish this type of models thus far [11], and the scenario is even worse for RDs due to the rareness of data. MetaMap employs a knowledge-based NLP methodology, and is currently the standard community tool for semantic annotation. It is able to infer annotation of diseases even when the names may not literally appear in the text. For example, in a project proposed to study maternal PKU, MetaMap correctly assigned the disease of Microencephaly (GARD:0007038) despite the name not appearing in the abstract. Conversely, such inference could also cause erroneous or inaccurate annotations, being a source of false annotations. Another source of errors in annotating is the disease name variation, which is one component of data normalization that we addressed in our NormMap method. To minimize the false annotation rate, we will further refine our algorithm by creating a scoring system to rank and filter out low confidence mapping.

Nevertheless, given the inherent complex property, any text processing and mining effort will have its limitation in terms of disease annotation quality. However, despite the limitations, data standardization could greatly improve such efforts, and thereby help us tap into the rich resource of biomedical text data, just as we have observed in this study with the NIH funding data.

Acknowledgement

This research was supported in part by the Intramural/Extramural research program of the NCATS, NIH.

Contributor Information

Charlie Tang, Axle Informatics, LLC, Rockville, USA.

Yanji Xu, Office of Rare Disease Research, National Center for Advancing Translational Sciences, Bethesda, USA.

Qian Zhu, Division of Pre-Clinical Innovation, National Center for Advancing Translational Sciences, Rockville, USA.

References

RESOURCES