Abstract
Objective
Medical concept normalization (MCN), the task of linking textual mentions to concepts in an ontology, provides a solution to unify different ways of referring to the same concept. In this paper, we present a simple neural MCN model that takes mentions as input and directly predicts concepts.
Materials and Methods
We evaluate our proposed model on clinical datasets from ShARe/CLEF eHealth 2013 shared task and 2019 n2c2/OHNLP shared task track 3. Our neural MCN model consists of an encoder, and a normalized temperature-scaled softmax (NT-softmax) layer that maximizes the cosine similarity score of matching the mention to the correct concept. We adopt SAPBERT as the encoder and initialize the weights in the NT-softmax layer with pre-computed concept embeddings from SAPBERT.
Results
Our proposed neural model achieves competitive performance on ShARe/CLEF 2013 and establishes a new state-of-the-art on 2019-n2c2-MCN. Yet this model is simpler than most prior work: it requires no complex pipelines, no hand-crafted rules, and no preprocessing, making it simpler to apply in new settings.
Discussion
Analyses of our proposed model show that the NT-softmax is better than the conventional softmax on the MCN task, and both the CUI-less threshold parameter and the initialization of the weight vectors in the NT-softmax layer contribute to the improvements.
Conclusion
We propose a simple neural model for clinical MCN, an one-step approach with simpler inference and more effective performance than prior work. Our analyses demonstrate future work on MCN may require more effort on unseen concepts.
Keywords: Deep Learning, Natural Language Processing, Medical Concept Normalization, Vector Space Model, Normalized Temperature-scaled Softmax
1. Introduction
Mining and analyzing the constantly-growing unstructured text from the clinical notes offers great opportunities to improve a wide range of medical and healthcare practices, ranging from clinical trial recruitment[1] to pharmacovigilance[2,3] to clinical decision support[4,5]. However, texts describing biomedical concepts such as findings, symptoms, diseases, diagnoses, procedures, and medications have many morphological and orthographical variations, and utilize different word orderings or equivalent words. For instance, heart attack, coronary attack, MI, myocardial infarction, cardiac infarction, and cardiovascular stroke all refer to the same concept. Linking such terms with their corresponding concepts in an ontology or knowledge base is critical for data interoperability and the development of downstream natural language processing (NLP) techniques.
Medical concept normalization (MCN) -- a form of entity linking or entity normalization in the biomedical domain -- maps concept mentions, the in-text natural-language mentions of medical concepts, to concept entries in a standardized ontology or knowledge base. In this work, we focus on MCN in clinical notes since it is less studied compared with MCN in consumer health[6–8] or in scientific articles[9–12], largely due to fewer publicly available datasets and the restrictions of data use agreements. Three main challenges of MCN include: 1) novel concept mentions (i.e., CUI-less mentions) that may not be linked to any concepts in an ontology, 2) ambiguous mentions where a single term/phrase can refer to multiple distinct concepts, 3) lexical variants of concept mentions such as aliases, misspellings, abbreviations, etc. Among these challenges, ambiguity is less common in the existing clinical MCN datasets[13–16]. So, in the following sections, we focus on the approaches that only take the mentions as input, and leave the resolution of ambiguous mentions, which will require exploration of contextual information, for future work.
Approaches developed for the clinical MCN datasets include rule-based approaches and supervised-based approaches. Rule-based approaches[13,17–21] focus on how to construct a look-up dictionary and conduct string matching. They rely heavily on matching rules and domain knowledge, for example, D’Souza et al.[13] define 10 types of rules at different priority levels to measure morphological similarity between mentions and concept names from an ontology, Kate[18,21] implements a system to automatically learn the generalizable edit distance patterns from the Unified Medical Language System (UMLS) and the given training data. These approaches are easy to interpret how they normalize mentions, but they work poorly when there is a lack of lexical overlap between mentions and concept names, for example, in consumer health data like social media posts[6].
Supervised-based approaches can be categorized into classification-based approaches, learning to rank approaches, and vector space models (VSMs). Classification-based approaches[6,22–26] take mentions as input to various neural architectures such as Gated Recurrent Units (GRU) with attention mechanisms[26] or pre-trained transformer networks with extra features[23], and predict their concept using a softmax classification layer. Such approaches work well on a few consumer-health datasets[6,23], where only a smaller subset of medical concepts are of interest, and each concept has a few more training examples, but such classification-based approaches struggle when there are hundreds of thousands of concepts in an ontology and the annotated training data does not contain examples for all concepts. To alleviate the problems of classification-based approaches, researchers apply learning to rank approaches [14,16,27–32], such as a generate-and-rank framework that typically includes a non-trained candidate generator that generates a smaller list of candidate concepts, and a supervised candidate ranker that takes both mention and candidate concept as input to rank the candidate concepts. However, these approaches require complex pipelines and fail if their candidate generators do not find the gold-truth concepts.
Recently, VSMs have been widely applied in the MCN because they set up MCN as a one-step process requiring no complex pipelines: they encode mentions and concept names as vectors, and mentions are matched to concepts by computing the vector similarity between vector representations of mentions and of all concept names from an ontology. They are trained via similar objectives that pull semantically close texts together and pushes apart semantically dissimilar ones, but differ in their implementations, e.g., different encoders such as ELMO[33] or pre-trained transformer networks[34–38], different instance mining techniques such as online hard instance mining[34–36,38] or offline hard instance mining[33,37], different loss functions such as multi-similarity loss[34], cosine similarity loss[38], triplet loss[37], adaptive hinge loss[33], batch hard soft margin loss[36], or maximizing the marginal likelihood over synonyms[35]. Amongst these approaches, SAPBERT[34] and BIOSYN[35] (when initialized with SAPBERT) achieve new state-of-the-art performances on multiple MCN tasks in social media domain and scientific article domain, but it is unclear whether these approaches are generalizable in clinical texts where there are more CUI-less mentions. Another limitation of these two approaches is that they separate their training objectives from the task objective. So to predict concepts during inference, they first match the mention representation to the representation of concept text using nearest neighbor’s search, and then find the corresponding concept (unique identifier) for that concept text, which is not always a one-to-one mapping. For instance, the concept text “potassium” can be mapped to the medical substance (C0032821) or the procedure of measuring potassium (C0202194), and their solution is to pick one arbitrarily. Besides, each concept has multiple texts (synonyms), and searching over texts is more computationally expensive than searching over concepts, especially considering all the concepts from UMLS1. To alleviate these limitations, Xu et al.[36] explore similarity search using the concept representations, but their concept representations are not learnable during the training of VSM.
In this paper, we propose a neural MCN model (shown in Figure 1) that takes the mention texts as input and directly predicts the medical concept or the CUI-less label. Our system consists of a transformer network that encodes the mention as a vector, and a normalized temperature-scaled softmax (NT-softmax) layer that maximizes the cosine similarity score of matching the mention to the correct concept label, where the representation for each concept is pre-computed and then updated during training. Our system outperforms previous state-of-the-art systems on two clinical MCN datasets, and is simpler than most prior work–it requires no pre-processing, no complex pipelines, no rules for CUI-less prediction, and no nearest neighbor’s search. We also conduct analysis to understand what contributes to the improvements, where we find that NT-softmax layer is better than the conventional softmax, and initializing the weights in NT-softmax using concept embeddings generated from SAPBERT further improves the performance.
Figure 1.

Architecture of our proposed end-to-end neural MCN model.
2. MATERIALS AND METHODS
2.1. Dataset
To the best of our knowledge, there are four shared tasks for MCN in clinical notes, including ShARe/CLEF eHealth 2013, SemEval-2014 Task 7, SemEval-2015 Task 14, and 2019 n2c2/OHNLP shared task track 3. The first three shared tasks focus on the normalization of disease/disorder concept mentions from the same ShARe corpus, and datasets used in these tasks are built on each other by iteratively adding more annotations each year. The 2019 n2c2/OHNLP shared task track 3 uses the MCN corpus[39] (named 2019-n2c2-MCN), and covers a broader set of concepts including medical problems, treatments, and tests. In this study, we select the datasets from ShARe/CLEF eHealth 2013 and the 2019 n2c2/OHNLP shared task track 3. Table 1 shows the statistics of these two datasets.
Table 1.
Statistics of two clinical note datasets used in this study.
| ShARe/CLEF 2013 | 2019-n2c2-MCN | ||||
|---|---|---|---|---|---|
| train + dev | test | train + dev | test | ||
| Data | # of documents | 199 | 99 | 50 | 50 |
| # of mentions | 5816 | 5351 | 6684 | 6925 | |
| # of CUI-less mentions | 1641 | 1750 | 151 | 217 | |
| # of ambiguous mentions | 397 | 240 | 190 | 192 | |
| # of unique concepts | 1011 | 796 | 2331 | 2579 | |
| # of unseen concepts | - | 623 | - | 2067 | |
| # of mentions per concept | 5.75 | 6.72 | 2.5 | 2.69 | |
| Ontology | source | SNOMED-CT | SNOMED-CT + RxNorm | ||
| # of concepts | 88150 | 434056 | |||
| # of concept synonyms | 312506 | 1127171 | |||
| # of synonyms per concept | 3.55 | 2.6 | |||
# of CUI-less mentions: # of mentions that are annotated with the CUI-less label.
# of ambiguous mentions: # of mentions that are annotated with more than one concepts.
# of unseen concepts: # of mentions whose normalized concepts do not appear in the train or dev set, but appear in the test set.
The ShARe/CLEF 2013 corpus consists of 298 de-identified clinical notes from the ShARe corpus. Each note was annotated for mentions of disorders and normalized using 88,150 CUIs from the SNOMED CT subset of the 2012AB version of UMLS. We keep the same shared task setting: 199 clinical notes consisting of 5,816 disorder mentions are used as the train and development set, and 5,351 mentions from the remaining 99 clinical notes as the test set. One characteristic in this dataset is that around 30.4% of the mentions could not be mapped to any concepts in an ontology, and are assigned the CUI-less label. The 2019-n2c2-MCN corpus contains 100 clinical notes derived from the 2010 i2b2/VA shared task. All the mentions are mapped to one of 434,056 possible concepts from the SNOMED-CT and RxNorm subset of UMLS version 2017AB. We take 40 clinical notes from the released data as training, consisting of 5,334 mentions, and the standard evaluation data from 50 notes with 6,925 mentions as our test set. Around 2% of mentions in 2019-n2c2-MCN are assigned the CUI-less label.
From table 1, we can see that both two datasets have few ambiguous mentions – 4.5% mentions on the ShARe/CLEF 2013 test set and 2.8% mentions on the 2019-n2c2-MCN test set – so the context where mentions appear plays only a minor role in these datasets. Two major challenges for these two datasets are the unseen concepts (concepts that do not appear in the train or dev set, but appear in the test set) and fewer instances per concept label: 11.6% of unseen concepts on the ShARe/CLEF 2013, while 29.8% of unseen concepts on the 2019-n2c2-MCN; around 6 instances per concept on the ShARe/CLEF 2013, while only 2.5 instances per concept on the 2019-n2c2-MCN. Systems that memorize the training data or rely on it to determine the space of output concepts will thus perform poorly.
2.2. System description
We define a concept mention m as a textual string of medical concept. Given a list of pre-identified concept mentions {m1, m2,…,mt} from clinical notes, an ontology O with a set of concepts {c1, c2,…,cn}, the goal of MCN is to find a mapping function that maps each textual mention m to its corresponding concept c, or CUI-less label if there is no mapping concept for m.
We propose a neural MCN model that mainly consists of a transformer network and a NT-softmax layer (shown in figure 1). We first take mention m as input to the transformer network to generate a vector representation f(m); our NT-softmax layer computes the cosine similarity score si between the input f(m) and the weight vector wi for each concept ci in an ontology, formally, si = (wi, f(m)). To model the CUI-less label, we adopt a learnable scalar parameter s0 as a threshold to encourage for all linkable mentions, and for all CUI-less mentions. To learn such restrictions, we concatenate s0 with si as input fed into a NT-softmax to compute the probability of such mention being mapped to concept ci (i > 0) or CUI-less label (i = 0),
| (1) |
where si = (wi f(m))/‖wi f(m)‖ is the dot product between l2 normalized wi and f(m), i.e., si = cosine(wi, f(m)); τ is a temperature/scaling parameter that controls the concentration level of the distribution[40]. For simplicity, we ignore the s0 in the following system description.
We adopt the off-the-shelf SAPBERT[34] as the encoder to generate vector representations, where the representation for each mention f(m) could be the representation from the [CLS] token or an average pooling of the representations for all sub-word tokens. SAPBERT is initialized with PubMedBERT[41], and further fine-tuned on UMLS synonyms via a contrastive learning objective. SAPBERT has shown better capabilities than PubMedBERT (a state-of-the-art pre-trained language model for the biomedical domain) in representing medical texts in a well-separated space. Inspired by the technique of memory bank in contrastive learning[42,43], we similarly maintain a matrix to store the representations for all concepts by initializing the weight vectors W in NT-softmax layer with a pre-computed concept embeddings matrix V. We compute the concept embedding vi by averaging the representations of all the synonyms of this concept ci generated by SAPBERT, formally,
| (2) |
Where Text(ci) is all the concept texts denoting concept ci, including synonyms from an ontology and mentions from an annotated corpus. Conceptually, initializing with such concept embeddings can speed up the training process since both mention representation and concept embeddings are generated from the same encoder, and the embeddings for different concepts from SAPBERT are better separated than randomly initialized weight vectors.
2.2.1. Normalized Temperature-scaled Softmax
To explain the NT-softmax in our model, we first compare NT-softmax with the conventional softmax where it approaches MCN as a multiclass classification problem, and each concept including the CUI-less label is an individual class. However, unlike typical classification settings where the class size is small, and each class has many instances, in most MCN tasks, we have more than ten thousand concepts in an ontology, and each concept only has few training instances. During each mini-batch training, we only have mentions from tens of concepts, and only weight vectors of these concepts are activated, which leads to large variance of gradients for the weight vectors and thus cannot be learned effectively[42]. Another limitation of directly applying softmax to MCN is the modeling of the CUI-less label, where it assumes all the CUI-less mentions are encoded in the similar output space. However, this assumption is inappropriate for CUI-less mentions, as they represent areas of the ontology with lack of coverage, but do not necessarily share semantics, and would therefore be likely to map to highly varying parts of the concept space.
Under the conventional softmax formulation, the probability of a mention being classified as ith concept is,
| (3) |
where the similarity metric is an inner product between a mention representation f(m) and a weight vector wi for concept ci. Comparing equation (1) and (3), there are two differences: l2 normalized inner product (cosine similarity) and scaling parameter. For l2 normalized inner product, previous experiments[42–47] show that the angle between feature vectors is also a good similarity metric compared with Euclidean distance or inner-product operations. As for the scaling parameter, it is an important hyperparameter for learning from cosine similarity score[44]. Wang et al.[40] also show that the scaling parameter plays a role in controlling the strength of penalties on the hard negative samples, in our case, the mentions that are likely to be mapped to incorrect concepts but with high cosine similarity scores.
We further discuss the connection of our NT-softmax layer with contrastive learning approach as the NT-Softmax is also widely used there[43,45,46,48]. In contrastive learning, the probability of text have the same concept as is:
| (4) |
where , the cosine similarity score between texts of the ith and jth concept. In most contrastive learning approaches[45,46,48], g(∙) is the same encoder as f(∙), and the negative examples are sampled within a smaller mini-batch. Meanwhile, g(∙) could also be a memory bank[42,43], a matrix to store the representation for all the available concepts in an ontology, and during training the negative examples are sampled from the entire ontology. Our NT-softmax layer has commonalities with both settings: firstly, g(∙) is a weight matrix in the NT-softmax layer, which is able to compare a mention against all the concepts during the training; secondly, the concept embeddings used to initialize the matrix g(∙) are computed using the encoder f(∙).
2.3. Implementation Details
Unless specifically noted otherwise, we keep the default hyper-parameters as in huggingface’s pytorch implementation across all datasets. We set the learning rate = 2e-5, sequence length=16, and temperature parameter τ =0.03 for both two dataset. On ShARe/CLEF 2013 dataset, we set the batch size = 16, epoch size =10, and s0 = 0.9; on the 2019-n2c2-MCN dataset, we set batch size = 128, epoch size = 5, and s0 = 0.65. Following the setting in[36], we combine the mentions from annotated datasets and synonyms from their corresponding ontologies to train our system. During the training of our system, we initialize the weight vectors at the NT-softmax layer with pre-computed concept embeddings, and keep them learnable. We experiment with different settings for CUI-less threshold s0, and it turns out fixing s0 during training leads to better performance. In our preliminary experiments, we experimented with both the [CLS] token and the average pooling of all sub-word tokens to generate the representations. We found average pooling achieved slightly better performance, so we stick with this choice for all the experiments.
2.4. Experiments
2.4.1. Evaluation Metrics
For all the experiments, we adopt the standard evaluation metrics accuracy for MCN, which was the percentage of entity mentions that were correctly normalized.
Comparison with Related Work
We compare our proposed approaches with the current related work.
Sieve-based dictionary look-up[13,39]: rule-based approaches that carefully selects combinations and orders of dictionaries, exact and partial matching, and heuristic rules.
BERT-based ranker[14,29]: a generate and rank framework that consists of a dictionary look-up system and a supervised candidate ranker.
VSM triplet network[36]: a vector space model that encodes mentions and concepts as vectors via a transformer network. The transformer network is trained via a triplet objective with online hard triplet mining.
SAPBERT[34] (off-the-shelf): a state-of-the-art MCN system on the social media and science articles dataset. The architecture of SAPBERT is similar to[36] but trained on all the concept synonyms from UMLS via a multi-similarity loss.
SAPBERT (fine-tuned): the off-the-shelf SAPBERT model that is further fine-tuned on the ShARe/CLEF 2013 and 2019-n2c2-MCN dataset.
Since our focus is individual systems, not ensembles, we compare only to other non-ensembles. TTI-COIN [49] is an ensemble system of five different runs that achieves 85.26 on 2019-n2c2-MCN. Their individual model is a transformer network that takes mention as input to generate a vector representation, which is then matched to learnable concept vectors to compute cosine similarity. The cosine similarity is then fed to an ArcFace loss[50] for optimization. Their individual model is like our approach, but requires two-step training. The average performance of five individual system runs (TTI-COIN-Average) is 84.40%[51].
2.4.2. Models
For all the models, we apply SAPBERT as encoders and experiment with the following output layers:
NT-softmax: our complete model with an NT-softmax layer and a threshold parameter to model the CUI-less label. The weight vectors in NT-softmax are initialized with the pre-computed concept embeddings from SAPBERT.
NT-softmax (Random): same as our complete model but with weight vectors randomly initialized.
NT-softmax (Fixed): same as our complete model, except that the weight vectors initialized with the pre-computed concept embeddings from SAPBERT are not updated during fine-tuning.
Softmax: a conventional softmax layer that approaches the concept normalization as a multiclass classification problem, where each concept including the CUI-less label is an individual class. The weight vectors are randomly initialized.
Softmax (CUI-less): similar to the above conventional softmax, but modeling the CUI-less label using a threshold parameter as in NT-softmax.
3. RESULTS
Table 2 shows the results of our proposed models on two clinical MCN datasets, alongside the best models in the literature. Our complete model -- SAPBERT + NT-softmax -- achieves competitive performance on ShARe/CLEF 2013 and establishes a new state-of-the-art on 2019-n2c2-MCN: our model outperforms the previous state-of-the-art[13, 14] by small margins, and yet it requires no complex pipelines, no hand-crafted rules, and no preprocessing such as acronym expansion and spelling corrections; on 2019-n2c2-MCN, a dataset with a more realistic setting where the number of candidate concepts is large and many test concepts were never seen during training, we see bigger performance gains over previous work, for instance, 1.8 points over BERT-based ranker[29] and around 1 point over TTI-COIN-Average[51]. Compared with the VSM triplet network[36] that also requires no hand-crafted rules and preprocessing, and is easily adapted to both clinical datasets, our model achieves around 1 point better on ShARe/CLEF 2013 and around 1.5 points better on 2019-n2c2-MCN.
Table 2.
Comparison of our proposed concept normalization system against the current state-of-the-art performances on ShARe/CLEF 2013 and 2019-n2c2-MCN datasets.
| ShARe/CLEF 2013 | 2019-n2c2-MCN | |||
|---|---|---|---|---|
| Dev | Test | Dev | Test | |
| Sieve-based dictionary look-up [13] | - | 90.75 | - | - |
| Sieve-based dictionary look-up [39] | - | - | - | 76.35 |
| VSM triplet network [36] | 90.2 | 90.4 | 84.67 | 83.7 |
| TTI-COIN-Average [51] | - | - | - | 84.40 |
| BERT-based ranker [29] | - | - | 84.44 | 83.56 |
| BERT-based ranker [14] | - | 91.1 | - | - |
| SAPBERT [34] (off-the-shelf) | 83.97 | 82.77 | 82.89 | 83.02 |
| SAPBERT (fine-tuned) | 86.91 | 83.33 | 85.17 | 83.46 |
| SAPBERT+ Softmax | 85.55 | 83.30 | 78.88 | 78.30 |
| SAPBERT+ Softmax (CUI-less) | 87.34 | 85.37 | 82.60 | 81.31 |
| SAPBERT + NT-softmax (Random) | 90.93 | 90.66 | 83.70 | 82.65 |
| SAPBERT + NT-softmax (Fixed) | 92.31 | 91.18 | 90.52 | 85.15 |
| SAPBERT + NT-softmax | 92.51 | 91.31 | 90.74 | 85.32 |
The last six rows in table 2 present the performances of different SAPBERT settings. Firstly, the off-the-shelf SAPBERT achieves worse performance than the previous state of the art, especially around 8 points worse on ShARe/Clef 2013 data, where more than 30% of mentions are annotated with CUI-less labels. This is most likely because SAPBERT does not model the CUI-less label during the training, and they predict the CUI-less label equivalent to other concepts during inference. After fine-tuning SAPBERT on each task, the performance gains are marginal on the test set, which is consistent with their fine-tuning experiments on multiple MCN datasets from scientific publications[34].
When applying SAPBERT in neural models with different output layers, we find that approaching MCN as a multi-class classification problem using a conventional softmax barely works on the two datasets, with the performances no better than the fine-tuned SAPBERT on both datasets. Adding a threshold parameter to model the CUI-less label prediction improves performance over using the conventional softmax layer. Comparing + softmax (CUI-less) and + NT-softmax (Random) where both have a CUI-less threshold parameter and randomly initialized weight vectors, we find that part of the performance gains come from the NT-softmax, i.e., using a cosine similarity function and a scaling parameter in the softmax. Lastly, initializing the weights with pre-computed concept embeddings in NT-softmax (NT-softmax (Fixed) and NT-softmax) further improve performance, especially on the 2019-n2c2-MCN data where lots of unseen mentions on dev and test set. When the weights in NT-softmax are fixed, we can see the performances on two datasets are close to our complete model, and tuning the weights in NT-softmax slightly improves the performance, indicating that the initialization of pre-computed concept embeddings is the key to our success.
4. Discussion
4.1. Impacts of Model Initialization
To see whether our proposed approach only works when starting from SAPBERT, we conducted experiments that use the same hyperparameters as SAPBERT but adopt PubMedBERT as encoder and experiment different weight vectors: 1) PubMedBERT as the encoder and randomly initialized weight vectors, PubMedBERT + NT-softmax (Random), 2) PubMedBERT and pre-computed concept embeddings from PubMedBERT, PubMedBERT + NT-Softmax. Table 3 shows their performance on two datasets. When the weight vectors are randomly initialized, we can see the performance of SAPBERT is slightly better than PubMedBERT, 90.66 vs. 89.77 on ShARe/CLEF test set, and 82.65 vs. 82.17 on 2019-n2c2-MCN test set. However, initializing the weight vectors with PubMedBERT concept embeddings obtains lower scores than SAPBERT concept embeddings, e.g., around 30 points drop on 2019-n2c2-MCN test set. Since the performance gap is so large between SAPBERT + NT-softmax and PubMedBERT + NT-softmax, we did not spend much time on hyperparameter tuning. In our preliminary experiments, we explored different hyperparameters for PubMedBERT such as learning rate, batch size and CUI-less parameter s0, but none of these resulted in big differences.
Table 3.
Comparison of our proposed concept normalization system using different encoders and concept embeddings on ShARe/CLEF 2013 and 2019-n2c2-MCN datasets.
| ShARe/CLEF 2013 | 2019-n2c2-MCN | |||
|---|---|---|---|---|
| Dev | Test | Dev | Test | |
| PubMedBERT + NT-softmax (Rand) | 90.17 | 89.77 | 83.26 | 82.17 |
| PubMedBERT + NT-softmax | 52.11 | 52.94 | 60.07 | 58.99 |
| SAPBERT + NT-softmax (Rand) | 90.93 | 90.66 | 83.70 | 82.65 |
| SAPBERT + NT-softmax | 92.51 | 91.31 | 90.74 | 85.32 |
To study why the pre-computed concept embeddings from PubMedBERT are harder for the model training, we check the cosine similarity scores between mentions and their gold-truth concepts, and the cosine similarity scores between mentions and their top 10 negative concepts across five epochs for all four models on the 2019-n2c2-MCN dev set. We simplify the training objective as increasing the cosine similarity scores between mentions and gold-truth concepts, and decreasing the cosine similarity scores between mentions and top 10 negative concepts. From Figure 2, we can see all four models are optimized towards this objective as the differences of these two cosine similarity scores are increasing during training. However, such differences for PubMedBERT + NT-Softmax are much smaller than the other three models, especially at the early epochs. For instance, at the first epoch, the difference for PubMedBERT + NT-Softmax is around 0.06. This indicates PubMedBERT concept embeddings for dissimilar concepts are much closer in the vector space than SAPBERT, making the model harder to distinguish them during training. While for SAPBERT + NT-softmax, the differences of cosine similarity score between gold-truth concept and top 10 negative concepts are already large enough at the first epoch, this is likely because SAPBERT is already trained to distinguish medical texts with different semantics. Surprisingly, when the concept embeddings are randomly initializing, we find both PubMedBERT and SAPBERT models could distinguish dissimilar concepts even at the early stage of training, but the cosine similarity scores of gold truth concepts are relatively small and close to the CUI-less threshold hyperparameter. To conclude, we find that PubMedBERT is not a good starting point for our proposed approach, and the initialization of weight vectors in NT-softmax has more impacts on the training than the initialization of the encoder.
Figure 2.

The average cosine similarity score on the 2019-n2c2-MCN dev set across 5 training epochs. “Gold concept” denotes the cosine similarity scores between mentions and their gold truth concepts during each epoch, “Top 10 negative concepts” denotes the average cosine similarity scores between mentions and their top 10 negative concepts (concepts that are different from gold truth concept but have highest cosine similarity scores) during each epoch. Both cosine similarity scores are averaged across all mentions on the dev set.
4.2. Qualitative Analysis
Table 4 shows the top 5 predicted CUIs from SAPBERT + NT-softmax during training (before training vs. 5th epoch). For both mentions, our best model SAPBERT + NT-softmax could find their gold-truth concept before or after fine-tuning, but after fine-tuning the model could learn better ranks. For the mention “Her alkaline phosphatase,” the model more depends on the text overlap rather than the semantic type of the mention when not fine-tuned. For example, although all the incorrect predictions contain the same terms “Alkaline phosphatase” as in the mention, all of them mean the chemical substance; however, after fine-tuning, our model is able to understand “Her alkaline phosphatase” is a measurement or finding. For the mention “pelvic lymph node dissection,” the preferred name of the gold truth concept is “Pelvic lymphadenectomy,” which means “the surgery to remove lymph nodes in the pelvis for examination”. The concept prediction “Excision of pelvic lymph node (C0398429)” has the similar meaning as the gold-truth concept, but is ranked 5th using the original SAPBERT concept embeddings. After fine tuning, the model re-ranks the concept “Excision of pelvic lymph node (C0398429)” at rank 2.
Table 4.
Comparison of the top 5 predicted CUIs from SAPBERT + NT-softmax before any training and at the end of training (5th epoch) for mentions “Her alkaline phosphatase” and “pelvic lymph node dissection”. The bolded concepts are the gold truth concepts for these two mentions.
| Top 5 predictions | Her alkaline phosphatase | pelvic lymph node dissection | ||
|---|---|---|---|---|
| before training | 5th epoch | before training | 5th epoch | |
| 1 | Alkaline phosphatase measurement (C0201850) | Alkaline phosphatase measurement (C0201850) | Pelvic lymphadenectomy (C0193883) | Pelvic lymphadenectomy (C0193883) |
| 2 | Alkaline phosphatase isoenzyme, bone fraction (C0312399) | Alkaline phosphatase level - finding (C0428332) | Block dissection of pelvic lymph nodes (C0398405) | Excision of pelvic lymph node (C0398429) |
| 3 | Alkaline Phosphatase (C0002059) | CUI-less | Operation on pelvic lymph node draining prostate (C0401597) | Block dissection of pelvic lymph nodes (C0398405) |
| 4 | Alkaline phosphatase stain (C1318717) | Alkaline Phosphatase (C0002059) | Excision of inguinal lymph node and pelvic lymph node (C2959401) | Pelvic lymph node group (C0729595) |
| 5 | Alkaline phosphatase isoenzyme, renal fraction (C0368634) | Serum alkaline phosphatase measurement (C0036776) | Excision of pelvic lymph node (C0398429) | Pelvic lymph nodes sampling (C0398426) |
4.3. Evaluation on CUI-less Mentions and Unseen Concepts
To estimate how well our proposed approach would work on the realistic setting, we analyze the performances of our systems on unseen concepts and CUI-less mentions on both datasets because they are likely to appear more frequently in real-life clinical data. Table 5 shows that our proposed approach could achieve more than 90% accuracy for seen concepts on both datasets, while the performances for the unseen concepts on both datasets are around 70%. Comparing with previous work[28] on 2019-n2c2 MCN dataset, we find that our proposed approach obtain some improvements on both seen and unseen concepts, but the prediction for the unseen concepts is still a major challenge in MCN task.
Table 5.
Accuracy of our best model SAPBERT + NT-softmax on seen concepts, unseen concepts, and CUI-less mentions in test data from ShARe/CLEF 2013 and 2019-n2c2-MCN.
| ShARe/CLEF 2013 | 2019-n2c2-MCN | |
|---|---|---|
| Seen concepts | 94.16 | 93.40 |
| Unseen concepts | 71.68 | 69.15 |
| CUI-less | 93.56 | 66.36 |
4.4. Limitations and Future Work
One limitation of our approach is the memory size of the weight vectors in the NT-softmax layer. For the experiments on 2019-n2c2-MCN dataset where it considers SNOMED CT and Rx-Norm as the concept space, the weight vectors in NT-softmax are initialized with a 434056*768 concept embeddings matrix, around 1.4 GB memory. Such embeddings matrix could be 10 times larger if we consider all 4 million concepts from UMLS. In the future, we would like to explore different strategies to reduce the size of the embedding matrix, like adding a projection matrix between the encoder and the NT-softmax layer.
Although our simple one-step approach directly matches the mention representation to the concept embeddings, the ability to normalize ambiguous mentions is still limited by our context-independent mention representation as we only feed mention text to the encoder. However, as both our data analysis and previous work[15] show that the ambiguity ratios in most clinical notes MCN datasets are relatively low, it will not be fair to evaluate ambiguity resolution approach on current datasets. An interesting direction for future research would be the creation of and evaluation on datasets created to maximize ambiguity, along the lines of new adversarial datasets in other domains[52].
5. CONCLUSION
In this paper, we present a simple one-step neural MCN model consisting of a transformer network that encodes the mention as a vector, and a NT-softmax layer that maximizes the cosine similarity score of matching the mention to the correct concept label. Our proposed approach achieves competitive performance on ShARe/CLEF 2013 and establishes a new state-of-the-art on 2019-n2c2-MCN. Our analyses show that each component in NT-softmax layer is important, especially the initialization of weight vectors with pre-computed concept embeddings from SAPBERT. Our approach is simpler and more effective than prior work, but the prediction for the unseen concepts remains challenging.
The code for these experiments is publicly available in a GitHub repository https://github.com/Machine-Learning-for-Medical-Language/cnlp_transformers. The code repository also contains a FastAPI-based demonstration server that sets up a REST server to predict CUI for a given concept text.
ACKNOWLEDGMENTS
We thank the anonymous reviewers as well as the associate editor, Dr. Meliha Yetisgen, for helpful comments on an earlier draft of this paper.
FUNDING
Research reported in this publication was supported by the National Library Of Medicine of the National Institutes of Health under Award Numbers R01LM012918 and R01LM012973. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Footnotes
The 2021AB Metathesaurus contains approximately 4.54 million concepts and 16.5 million unique concept names from 220 source vocabularies. Searching over concept names requires 4 times more computation.
COMPETING INTERESTS
None declared.
Contributor Information
Dongfang Xu, Computational Health Informatics Program, Boston Children’s Hospital, Boston, Massachusetts, USA; Department of Pediatrics, Harvard Medical School, Boston, Massachusetts, USA.
Timothy Miller, Computational Health Informatics Program, Boston Children’s Hospital, Boston, Massachusetts, USA; Department of Pediatrics, Harvard Medical School, Boston, Massachusetts, USA.
REFERENCE
- 1.Wu H, Toti G, Morley KI, et al. SemEHR: A general-purpose semantic search system to surface semantic data from clinical notes for tailored care, trial recruitment, and clinical research. J Am Med Informatics Assoc 2018;25:530–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Lependu P, Liu Y, Iyer S, et al. Analyzing patterns of drug use in clinical notes for patient safety. AMIA Jt Summits Transl Sci proceedings AMIA Jt Summits Transl Sci 2012;2012:63–70. [PMC free article] [PubMed] [Google Scholar]
- 3.Li Y, Salmasian H, Vilar S, et al. A method for controlling complex confounding effects in the detection of adverse drug reactions using electronic health records. J Am Med Informatics Assoc 2014;21:308–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Topaz M, Lai K, Dowding D, et al. Automated identification of wound information in clinical notes of patients with heart diseases: Developing and validating a natural language processing application. Int J Nurs Stud 2016;64:25–31. [DOI] [PubMed] [Google Scholar]
- 5.Shao Y, Mohanty AF, Ahmed A, et al. Identification and Use of Frailty Indicators from Text to Examine Associations with Clinical Outcomes Among Patients with Heart Failure. AMIA Annu Symp Proc 2016;2016:1110–1118. [PMC free article] [PubMed] [Google Scholar]
- 6.Limsopatham N, Collier N. Normalising Medical Concepts in Social Media Texts by Learning Semantic Representation. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2016;2016:1014–1023. [Google Scholar]
- 7.Sarker A, Belousov M, Friedrichs J, et al. Data and systems for medication-related text classification and concept normalization from Twitter: Insights from the Social Media Mining for Health (SMM4H)-2017 shared task. J Am Med Informatics Assoc 2018;25:1274–1283. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Karimi S, Metke-Jimenez A, Kemp M, et al. Cadec: A corpus of adverse drug event annotations. J Biomed Inform 2015;55:73–81. [DOI] [PubMed] [Google Scholar]
- 9.Roberts K, Demner-Fushman D, Tonning JM. Overview of the TAC 2017 Adverse Reaction Extraction from Drug Labels Track. Text Anal Conf Proc 2017. https://tac.nist.gov/publications/2017/additional.papers/TAC2017.ADR_overview.proceedings.pdf?attredirects=0 ((accessed 3 Oct 2021)) [Google Scholar]
- 10.Doǧan RI, Leaman R, Lu Z. NCBI disease corpus: A resource for disease name recognition and concept normalization. J Biomed Inform 2014;47:1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Morgan AA, Lu Z, Wang X, et al. Overview of BioCreative II gene normalization. Genome Biol 2008;9,s3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Li J, Sun Y, Johnson RJ, et al. BioCreative V CDR task corpus: a resource for chemical disease relation extraction. Database (Oxford) 2016;2016:1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.D’Souza J, Ng V. Sieve-Based Entity Linking for the Biomedical Domain. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers) 2015;2015:297–302. [Google Scholar]
- 14.Ji Z, Wei Q, Xu H. BERT-based Ranking for Biomedical Entity Normalization. AMIA Jt Summits Transi Sci Proc 2020; 2020:269–277. [PMC free article] [PubMed] [Google Scholar]
- 15.Newman-Griffis D, Divita G, Desmet B, et al. Ambiguity in medical concept normalization: An analysis of types and coverage in electronic health record datasets. J Am Med Informatics Assoc 2020;00:1–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Li H, Chen Q, Tang B, et al. CNN-based ranking for biomedical entity normalization. BMC Bioinformatics 2017;18:385. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Jonnagaddala J, Jue TR, Chang NW, et al. Improving the dictionary lookup approach for disease normalization using enhanced dictionary and query expansion. Database (Oxford) 2016;2016: baw112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Kate RJ. Normalizing clinical terms using learned edit distance patterns. J Am Med Informatics Assoc 2016;23(2):380–386. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Leaman R, Wei CH, Lu Z. TmChem: A high performance approach for chemical named entity recognition and normalization. J Cheminform 2015;7(1):1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Kang N, Singh B, Afzal Z, et al. Using rule-based natural language processing to improve disease normalization in biomedical text. J Am Med Informatics Assoc 2013;20(5):876–881. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Kate RJ. Clinical term normalization using learned edit patterns and subconcept matching: System development and evaluation. JMIR Med Informatics 2021;9(1): e23104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Jimeno Yepes A. Word embeddings and recurrent neural networks based on Long-Short Term Memory nodes in supervised biomedical word sense disambiguation. J Biomed Inform 2017;73:137–147. [DOI] [PubMed] [Google Scholar]
- 23.Miftahutdinov Z, Tutubalina E. Deep Neural Models for Medical Concept Normalization in User-Generated Texts. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop 2019;32019:393–399. [Google Scholar]
- 24.Niu J, Yang Y, Zhang S, et al. Multi-task Character-Level Attentional Networks for Medical Concept Normalization. Neural Process Lett 2018;49:1239–1256. [Google Scholar]
- 25.Lee K, Hasan SA, Farri O, et al. Medical Concept Normalization for Online User-Generated Texts. In: Proc - 2017 IEEE Int Conf Healthc Informatics (ICHI) 2017;2017:462–469. [Google Scholar]
- 26.Tutubalina E, Miftahutdinov Z, Nikolenko S, et al. Medical concept normalization in social media posts with recurrent neural networks. J Biomed Inform 2018;84:93–102. [DOI] [PubMed] [Google Scholar]
- 27.Nguyen TN, Nguyen MT, Dang TH. Disease Named Entity Normalization Using Pairwise Learning To Rank and Deep Learning. VNU University of Engineering and Technology 2018. https://eprints.uet.vnu.edu.vn/eprints/id/eprint/3200/1/TechnicalReport_NguyenThanhNgan%20(1).pdf (accessed 3 Nov 2021) [Google Scholar]
- 28.Xu D, Gopale M, Zhang J, et al. Unified medical language system resources improve sieve-based generation and bidirectional encoder representations from transformers (BERT)–based ranking for concept normalization. J Am Med Informatics Assoc 2020;27:1510–1519. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Xu D, Zhang Z, Bethard S. A Generate-and-Rank Framework with Semantic Type Regularization for Biomedical Concept Normalization. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics 2020.2020:8452–8464. [Google Scholar]
- 30.Liu H, Xu Y. A deep learning way for disease name representation and normalization. National CCF conference on natural language processing and Chinese computing 2017;2017:10619 [Google Scholar]
- 31.Leaman R, Doǧan RI, Lu Z. DNorm: Disease name normalization with pairwise learning to rank. Bioinformatics 2013;29(22):2909–2917. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Mondal I, Purkayastha S, Sarkar S, et al. Medical Entity Linking using Triplet Network. In: Proceedings of the 2nd Clinical Natural Language Processing Workshop 2019; 2019:95–100 [Google Scholar]
- 33.Schumacher E, Mulyar A, Dredze M. Clinical Concept Linking with Contextualized Neural Representations. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics 2020;2020:8585–8592. [Google Scholar]
- 34.Liu F, Shareghi E, Meng Z, et al. Self-alignment Pre-training for Biomedical Entity Representations. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 2021.2021:4228–4238. [Google Scholar]
- 35.Sung M, Jeon H, Lee J, et al. Biomedical Entity Representations with Synonym Marginalization. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics 2020;2020:3641–3650. [Google Scholar]
- 36.Xu D, Bethard S. Triplet-Trained Vector Space and Sieve-Based Search Improve Biomedical Concept Normalization. In: Proceedings of the 20th Workshop on Biomedical Language Processing 2021;2021:11–22. [Google Scholar]
- 37.Miftahutdinov Z, Kadurin A, Kudrin R, et al. Drug and Disease Interpretation Learning with Biomedical Entity Representation Transformer. arXiv preprint arXiv 2021; 2021: 2101.09311. [Google Scholar]
- 38.Priyatam N, Patil S, Palshikar G, et al. Medical Concept Normalization by Encoding Target Knowledge. In: Proceedings of the Machine Learning for Health NeurIPS Workshop, PMLR 2020;116:246–259. [Google Scholar]
- 39.Luo YF, Sun W, Rumshisky A. MCN: A comprehensive corpus for medical concept normalization. J Biomed Inform 2019;92:103132. [DOI] [PubMed] [Google Scholar]
- 40.Wang F, Liu H. Understanding the Behaviour of Contrastive Loss. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2021;2021:2495–2504. [Google Scholar]
- 41.Gu Y, Tinn R, Cheng H, et al. Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing. ACM Transactions on Computing for Healthcare (HEALTH) 2021;3(1):1–23. [Google Scholar]
- 42.Xiao T, Li S, Wang B, et al. Joint Detection and Identification Feature Learning for Person Search. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2017;2017:3415–3424. [Google Scholar]
- 43.Wu Z, Xiong Y, Yu SX, et al. Unsupervised Feature Learning via Non-Parametric Instance Discrimination. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) 2018;2018:3733–3742. [Google Scholar]
- 44.Wang F, Xiang X, Cheng J, et al. NormFace: L2 hypersphere embedding for face verification. In: Proceedings of the 25th ACM international conference on Multimedia MM 2017;2017:1041–1049. [Google Scholar]
- 45.Chen T, Kornblith S, Norouzi M, et al. A simple framework for contrastive learning of visual representations. In: Proceedings of the 37th International Conference on Machine Learning PMLR 2020;119:1597–1607. [Google Scholar]
- 46.Gao T, Yao X, Chen D. SimCSE: Simple Contrastive Learning of Sentence Embeddings. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2021.2021:6894–6910. [Google Scholar]
- 47.Liu W, Wen Y, Yu Z, et al. SphereFace: Deep hypersphere embedding for face recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017;2017:212–220. [Google Scholar]
- 48.Yan Y, Li R, Wang S, et al. ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) 2021;2021:5065–5075. [Google Scholar]
- 49.Henry S, Wang Y, Shen F, et al. The 2019 national natural language processing (NLP) clinical challenges (n2c2)/Open health NLP (OHNLP) shared task on clinical concept normalization for clinical records. J Am Med Informatics Assoc 2020;27:1529–37. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Deng J, Guo J, Xue N, et al. ArcFace: Additive angular margin loss for deep face recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019;2019:4690–4699. [DOI] [PubMed] [Google Scholar]
- 51.Tsujimura T, Mori N, Asada M, et al. TTI-COIN at n2c2 2019 Track 3: Neural Medical Concept Normalization with Two-Step Training [presentation]. 2019 n2c2/OHNLP Shared-Task and Workshop, Washington, D.C., United States. Nov. 15, 2019. [Google Scholar]
- 52.Kiela D, Bartolo M, Nie Y, et al. Dynabench: Rethinking Benchmarking in NLP. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 2021;2021:4110–4124. [Google Scholar]
