Abstract
We present a method to enrich controlled medication terminology from free-text drug labels. This is important because, while controlled medication terminology capture well-structured medication information, much of the information pertaining to medications is still found in free-text. First, we compared different Named Entity Recognition (NER) models including rule-based, feature-based, deep learning-based models with Transformers as well as ChatGPT, few-shot and fine-tuned GPT-3 to find the most suitable model that accurately extracts medication entities (ingredients, brand, dose, etc.) from free-text. Then, a rule-based Relation Extraction algorithm transforms NER results into a well-structured medication knowledge graph. Finally, a Medication Searching method takes the knowledge graph and matches it to relevant medications in the terminology server. An empirical evaluation on real-world drug labels shows that BERT-CRF was the most effective NER model with F-measure 95%. After performing terms normalization, the Medication Searching achieved an accuracy of 77% for when matching a label to relevant medication in the terminology server. The NER and Medication Searching models could be deployed as a web service capable of accepting free-text queries and returning structured medication information; thus providing a useful means of better managing medications information found in different health systems.
1. Introduction
Accurately recording and sharing medications information is paramount in modern, electronic health systems. As such, standardised medications terminologies such as RxNorm in the U.S. and Australian Medicines Terminology (AMT) have been developed. These provide a rich source of medications information; however, much medications information is in free-text form and thus not mapped to a standardised medications terminology. Ideally, automated methods would be able to take an arbitrary piece of free-text mentioning a medication and map it directly to a relevant entry in a medication terminology.
Mapping free-text to medications terminology is non-trivial. The problem can be broken into two stages. First, Named Entity Recognition, where medication free-text mentions are decomposed according to different types: active ingredients, dosage, unit of measure, brand, etc. Second, Medication Searching where the aforementioned entities are searched and matched to their relevant entries in a controlled medical terminology (e.g., RxNorm or AMT).
Generally, automated approaches only tackle one of these two problems, not both. Additionally, the effectiveness of previous approaches is not high enough to meet the requirements for adoption in real, automated systems.
In this paper, we propose an end-to-end system that does both Named Entity Recognition (NER) and Medication Searching. The paper makes the following contributions:
Named Entity Recognition (NER) that converts free-text medication mention into a structured knowledge graph representing the medication with all its relevant entities (ingredients, dose, etc.). We investigate a number of models for this, including BERT and GPT based.
A Medication Searching method that translates the knowledge graph from the NER model into structured graph query; this query is then issued to a terminology server to retrieve candidate medication entries. Candidate entries are ranked before a final medication entry is selected as relevant.
A rigorous empirical evaluation based on both the above approaches on real medications data and the Australian Medication Terminology — the official standard for representing terminology data in Australia.
Our evaluation demonstrates that using a NER model, medication entities can be extracted from free text with high accuracy and very few errors. This allows one to efficiently extraction medications entities (e.g., dose or ingredients) from free-text in real world application. Using Medication Searching, medications could be further matched to the medications terminology. This allows one to convert unstructured medications information into a controlled vocabulary and all the benefits that this provides (interoperability, data quality, etc.). We also discuss how our proposed
terminology extraction and matching approach can be used to help handle new medications by identifying when these need adding to an existing medical terminology.
2. Problem Statement
The Australian Medicines Terminology (AMT)1 records approved medication in Australia via a knowledge graph database, in which, the details of each medicine is represented through linked nodes with unique identifications (ID) to provide information in different level of granularity. Figure 1 shows a simplified graph of those components for a registered medicine with title: “AMOXIL amoxicillin 250mg (as trihydrate) capsule blister pack”.
Figure 1:
Linked components in the AMT knowledge graph model
The process to create a new medication in AMT, a clinical terminologist performs 3 manual steps: 1) break the new medication label into its constituent parts (brand, ingredients, etc.); 2) for each part identified, search to see if it already exists in AMT, mapping it to an AMT entry if it does and creating a new entry if it does not; 3) Update AMT with the new medication, providing values for all its required attributes (Trade Product, Medicinal Product, Trade Product Unit of Use , Medicinal Product Unit of Use, Trade Product Pack, Medicinal Product Pack, Contained Trade Product Pack). The first two steps are the most laborious and impact the overall accurateness of the whole process. We aim to automated these first two steps via Named Entity Recognition and Medication Searching, respectively.
3. Related Work
Knowledge graphs provide a structured and factual knowledge in a machine-readable form; they have been adopted in a wide range of applications, especially in pharmaceuticals and healthcare domain, where they can provide standardization and interoperability.
Knowledge graph construction from text is an information extraction pipeline that consists of ontology design for a domain knowledge, named entity recognition, and relationship extraction among entities1. In our case, an ontology for medications information already exists in the AMT ontology so we do not need to tackle the end-to-task of knowledge graph construction. However, when medications information is found in free-text, there remains of issue of how to populate to the AMT ontology. For this, named entity recognition (NER) and relationship extraction (RE) are still required and thus explored in this paper.
Previous work has explicitly looked at medication NER; however, most of these attempted to extract medication information from free text such as doctor notes or discharge summaries from patient health records. MedEx2 and MedXN3 are typical rule-based systems designed for general-purpose medication information extraction. They mainly rely on semantic rules and lexical resources (such as RxNorm2) to build a parser to extract medication information used in clinical setting. As such, some types of entities are relevant to our interest (i.e., drug name, strength, form), some are redundant (i.e.., frequency, intake time, duration, route, refill, dispense amount) and others are missing such as brand name, container type, package information and details of strength with unit of measures. To customize types of named entities being extracted, machine learning based approaches have been investigated. Indeed, NER can be seen as a sequence labeling problem, where a machine learning model will classify a suitable tag/label to a span of tokens in a given text. Various machine learning models have been proposed for the NER ranging from CRF4, RNN/CNN5,6 to recent state-of-the-art models with Transformers7,8. In this work, we implemented different NER approaches including dictionary method, feature-based method with industrial strength - Spacy3, deep neural network based methods with BERT7 as well as zero-shot ChatGPT, few-shot and fine-tuned GPT based methods for comparison on our dataset.
Similar to medication NER, most of medication RE methods mainly focus on the impact of drugs in clinical setting such as drug-drug interaction, adverse drug effects, drug targets for diseases/genes9,10, whereas we focus on extracting key drug details such as strength, unit of measures for each active ingredient. There are two broad approaches for RE: syntactic patterns and supervised machine learning. Pattern based approaches applied a dependency parser and rule-based approach on drug description to identify links between medication entities11. Supervised approaches require extensive training data but can be done with many off-the-shelf learning algorithms12. Our approach is similar to the syntactic pattern, in which we define rules to detect relations between entities.
After the creation of a knowledge graph, it will continuously grown up incorporating new medications. This is a complex process known as knowledge graph expansion and enrichment13. The expansion can be new entities, new type of relationships or new discovery of missing links between entities. In our work, information of new medicines will strictly comply with the AMT data model, so we do not invent new relationships, but populate new data entries for new medicines and establish links between them with the existing nodes in the AMT knowledge graph.
Overall, our work tackles the end-to-end process of both identifying drug mentions in free-text via Named Entity Recognition, and mapping those to relevant entries in the medications terminology via Medication Searching, while also providing a mechanism for updating the terminology when new medications are released.
4. Methods
Our proposed approach works as follows: A free-text medicine description is input to the Named Entity Recognition (NER) component, which splits the text into separated segments and assigns a corresponding entity type to each of them. Those entities will be fed into the Knowledge Graph Building module to build a drug-specific knowledge graph representation. Next, the output knowledge graph is passed to the Graph Query to craft a graph query suitable for querying a medicine terminology server. Finally, a Search & Update component is used to search if similar medicines exist in the medicine terminology server, otherwise it will update its with new data medicine.
4.1. Data Acquisition and Annotation
To provide training data for the NER model, we crawled 1450 drug textual labels from the Therapeutic Goods Administration (TGA) website.4 The TGA is Australia’s regulatory authority for therapeutic goods and is akin to the FDA in the U.S. Texts were manually annotated with the following entity types: (i) Brand captures information of the registered brand name of a medicinal trade product; (ii) Ingredient shows the main active ingredient contained in a medicine; (iii) Modification provides additional information about the active ingredient in a medicine; (iv) Strength is a numeric value showing to the amount of an active ingredient in a medicine; (v) Unit refers to the unit of measure of the ingredient; (vi) Percentage is similar to Strength, but showing the percentage of amount of an ingredient in a compound medicine; (vii) DoseForm is a physical form of a dose of a chemical compound used as a drug or medication intended for administration or consumption; and (viii) ContainerType is a pharmaceutical container that holds the drug and may be in direct contact with product.
To facilitate training NER models, the annotated text was represented in Begin, In, Out (BIO) format: a method of assigning each token of text as either B, for beginning of an entity, I for In an entity, and O for Outside an entity. For example, for the segment “ENTRESTO 24 / 26”, the token “ENTRESTO” has a tag B-Brand (begin a Brand tag); “24”, “/” and “26” have the same tag I-Brand (still within the Brand tag). On completion of the process, every token in the text will have a corresponding NER tag: B-X or I-X or O where X is one of aforementioned entity types, and O means no entity. The BIO tags then represent different categories to train a binary classifier as the NER model.
4.2. Named Entity Recognition Methods
Named Entity Recognition methods fall into three main categories: non-learning methods, discriminative and generative machine learning models.
Non-learning methods typically take a dictionary approach. A vocabulary is constructed from different resources such as DrugBank5, TGA, SnomedCT6. Intuitively, these large medicinal resources may cover most words in medical labels. Then, the NER task can be treated as a string matching against terms from the dictionary for a given input text. A well-known solution uses a Trie data structure with efficient Aho-Corasick search algorithm. In this paper, we call it a Dictionary method. A widely adopted example of this is the MetaMap system.
Discriminative machine learning models treat NER as a classification task, in which each token of the given text will be assigned to one of the predefined labels. In this category, Spacy-NER and Transformer-based NER models have been investigated. Spacy7 is widely used for Natural Language Processing problems with industrial strength. It defines sophisticated feature extraction methods to train a classification model, and it was designed to give a good balance of efficiency, accuracy and adaptability. State-of-the-art for NER tasks have been achieved by Transformers models. We implement three popular Transformer models that all use Bidirectional Encoder Representations from Transformers (BERT)7; these are BERT-Softmax, BERT-CRF and BERT-BiLSTM-CRF. For all three, we use a pre-trained BioBERT14 model fine-tuned on our medications dataset.
The main difference between the three BERT-based NER models is the token classification layer on top of the BERT pre-trained model. In BERT-Softmax model, the classification layer is a simple linear neural network with a Softmax function as the output. It projects each token’s embedding vector into a list of probabilities for all predefined labels, then returns the one with the highest probability. The intuition for BERT-Softmax is that after fine-tuning, the output embedding vector for each token will capture the meaning of that token in the context of the input sentence. BERT-CRF uses a Conditional Random Field (CRF)4 layer instead of simple linear neural network on top of BERT. CRF is popular in sequence labelling tasks, in which it does not directly predict a single label for each individual token, but predicts a whole sequence of labels for the given text. The intuition behind the CRF is that it maximizes potential consecutive patterns of labels and avoids incorrect ones. BERT-BiLSTM-CRF adds a Bidirectional Long short-term memory (BiLSTM) neural network layer between BERT and CRF. It aims to enrich the tokens’ embedding vectors before feeding them into CRF module. The intuition is that while BERT is best known as self-attention network, which means every token will have some attention to all other tokens in the input text, BiLSTM network focuses on the propagation of attention of the consecutive tokens on the both sides. In other words, if BERT captures a global context of a token in the input text, BiLSTM captures the local context surrounding the token. Then, the combination of BERT and BiLSTM, theoretically, improves the semantic representation of the token embedding vectors and the classification that follows this.
Generative machine learning models, instead of doing classification to each input token, will generate a sequence of output tokens for a given text. We experiment with zero-shot learning, few-shot learning and fine-tuned Generative Pre-trained Transformer (GPT3) model. For few-shot learning, we provide sample prompts with predefined format to GPT3. Here is an example of prompt: “Label: AMISOLAN 50 amisulpride 50 mg uncoated tablet blister pack” “Entities: AMISOLAN 50:Brand | amisulpride:Ingredient | 50:Strength | mg:Unit | uncoated:O | tablet:Doseform | blister pack:ContainerType”. In this way, the text label is segmented into different span corresponding to different kinds of entities. For fine-tuned, we randomly collect 100 prompts with predefined output format as a training data for a fine-tuned GPT3 model. The fine-tuned GPT3 is used to generate answers to detect asked entities from the medicine label. Finally, for zero-shot learning, we experiment with prompting ChatGPT8 to answer what are the brand name, active ingredients, strength, unit of measures, dose form and container type in a given medicine label.
ChatGPT provided verbose responses that needed post processing. While, we don’t suggest using ChatGPT is an actual solution to this problem, we did include it as a reference to understand how it performs for this task.
4.3. Knowledge Graph Building
NER captures entities and not the relationship between them (e.g., relationship between dosage and active ingredient). To capture this important information, we construct a knowledge graph (KG) from the NER output.
Consider the medication label “Estalis continuous 50/250 estradiol & norethisterone acetate 50/250mcg/day patch sachet”. To determine the relationships between ingredients, strength values, percentage values and unit of measures, we define three heuristic rules: Consistency rule implies that the same unit of measure is applied to all ingredients in the same dose form. In this running example, the NER detects 2 types of Unit of measures namely “mcg” and “day” separated by a delimiter symbol, i.e., “/”. So, they can be inferred as a composite unit of measure (“mcg/day”), which consists of a numerator unit (“mcg”) and a denominator unit (“day”). Grouping rule states that the strength value and unit of measure are always close to each other. Ordering rule states that the relative orders of ingredients and relative orders of their corresponding strengths and unit of measures must be the same. According to the three rules, we can infer that the amount of use for “estradiol” is “50 mcg/day”; the amount of use for “norethisterone acetate” is “20 mcg/day” per each serve. Figure 2 shows the individual knowledge graph for the example drug’s label.
Figure 2:
KG for “Estalis continuous 50/250 estradiol & norethisterone acetate 50/250mcg/day patch sachet”. Square graph nodes are extracted and inferred entities; rounded graph edge labels are the predefined entity types.
The knowledge graph provides a structured representation of a medication that was originally expressed in unstructured free-text. The knowledge graph can be easily interrogated to obtain specific information on the drug (e.g., specific dose information; or specific active ingredients). Beyond this though, it forms an intermediate step in then mapping/adding to a standardised medications terminology.
4.4. Medication Searching
Once a medication knowledge graph is constructed it can be used to match against a medical terminology. Note that some entities in the graph may already exist in the terminology, while other may be new. (For example, the brand name may be new for a new drug but the active ingredient may already exist in AMT). Medication searching needs to determine what is existing information that needs to be mapped and what is new information that needs to be added. Medication in AMT are represented with underlying formal logic based on SNOMED CT. Thus, they can be interrogated with a corresponding graph query.
4.4.1. Graph Query Construction
To search AMT using a knowledge graph we have constructed, we need convert it a query in the SNOMED CT Expression Constraint Language (ECL)9. This process consists of two steps: Term normalization and Query translation.
Term normalization: involves finding the corresponding concept in AMT for each textual value in each node in the individual knowledge graph. This task is simply solved by using traditional keyword searching methods and string similarity measures. For trivial cases, the searched text can be found as an exact match to a concept in AMT. For non-trivial cases, we use Jaccard similarity to find the closest matching concept.
Query translation: involves converting edges in the drug’s individual knowledge graph into conditional statements in the graph query. For example, Figure 3 shows how we generate the ECL query for our earlier knowledge graph in Figure 2. The ECL query states as follows: find all drugs belonging to group of Containered trade product pack (CTPP) (1); where brand name is Estalis Continuous 50/250 (2); where container type is sachet (3); and where medicinal unit of use (4) is made up of: (5) an active ingredient estradiol (6), active ingredient norethisterone acetate (7), and form is patch (8).
Figure 3:
The ECL query to find all drugs having properties described in the Figure 2.
The graph query must be carefully constructed. The more query conditions, the higher the precision of results, but the greater risk of reducing recall. The Term Normalisation’s Jaccard similarity controls the matching constraint; thus we relax the threshold for matching to 0.9 to capture very similar but not exactly the same matching conditions. Furthermore, we ignore conditions regarding strengths and units of measures because the amount of ingredient in a drug can be measured in different unit scale; for example, microgram vs milligram. Instead, we run the query without considering such information and we treat these as a candidate set and applying a ranking and matching step afterwards to get the final match.
4.4.2. Ranking with a Similarity Score
We compute a similarity score between the knowledge graph and a Containered trade product pack (CTPP) concept returned from the graph query. Four types of information influence similarity scoring: (i) a group of each ingredient with its corresponding strength and unit of measure; (ii) dose form; (iii) container type; and (iv) brand name. We used the Jaro-Winkler string similarity which accounts for slight string variations.
To compute similarity of groups of ingredient, strength and unit of measure, we first check if the two groups have the same ingredient. If not, then their similarity is defined as 0.0. Next, standardise units of measure so the similarity score is a proportion of the strength values. For example, 0.05 g is converted to 50mg; 10000 mcg is converted to 10mg, so their similarity score equals 10/50 = 0.2. If the two groups use different units then their similarity is 0.
The output of the similarity function is a list of four similarity scores corresponding to the four types of information mentioned above. The ranking order of the results returned from graph query is based on the following priority order: The highest priority is similarity between groups of ingredients, strength and unit of measure; the second priority is similarity of dose forms; then similarity of container types and the last priority is similarity of brand name. The final matching results are those that have all similarity scores above our specific threshold value.
5. Empirical Evaluation
Separate empirical evaluation was done for Named Entity Recognition and Medication Searching. In the first experiment, all the NER models mentioned in the section 4 were trained and tested on our dataset of 1450 drug labels. We randomly split the data into train, validation and test sets by ratio 0.7:0.15:0.15. In the second experiment, we evaluated how well the Medication Searching process was at finding relevant concepts from the Australian Medication Terminology given a free-text drug label. We collected 12,300 drug labels from the Therapeutic Goods Administration (TGA) website that contained at least one Containered trade product pack concepts indexed in the Australian Medication Terminology. Precision, recall and F-measure were the evaluation measures.
5.1. Named Entity Recognition Evaluation
Table 1 reports effectiveness of all NER models. Overall, the most effective models were the discriminative, BERT-based models. The dictionary method had lower effectiveness; this methods strongly depends on the size and the quality of the collected vocabulary. For example, with huge terms collected for medicinal substances, the method may cover almost all Ingredient entities in the medicine labels because each substance usually has only one international unique name. However, it does not apply for Brand name, which is proposed by the drug manufacturer without any standard editorial rules. It explains why the recall of Ingredient is high, whereas the recall of Brand is low. On the other hand, the string searching algorithm in Dictionary can not distinguish different roles of a term in a given text. We noticed that in many cases, a Brand name also contains a Ingredient name, which is a reason for decreased precision.
Table 1:
Effectiveness of different Named Entity Recognition Models on the medication extraction task: P = Precision; R = Recall; F =F-measure.
| Model | Spacy-NER | BERT-Softmax | BERT-CRF | BERT-BiLSTM-CRF | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Tag type | n= | P | R | F | P | R | F | P | R | F | P | R | F |
| Brand | 216 | 84.79 | 85.19 | 84.99 | 81.58 | 86.11 | 83.78 | 85.14 | 87.50 | 86.30 | 83.26 | 85.19 | 84.21 |
| Ingredient | 215 | 91.63 | 91.63 | 91.63 | 91.74 | 93.0 | 92.38 | 93.02 | 93.02 | 93.02 | 93.81 | 91.63 | 92.71 |
| Modification | 46 | 95.65 | 95.65 | 95.65 | 91.84 | 97.83 | 94.74 | 88.24 | 97.83 | 92.78 | 93.88 | 100.00 | 96.84 |
| Strength | 232 | 98.28 | 98.28 | 98.28 | 98.70 | 97.84 | 98.27 | 98.70 | 97.84 | 98.27 | 99.15 | 100.00 | 99.57 |
| Unit | 237 | 98.71 | 97.05 | 97.87 | 98.73 | 98.73 | 98.73 | 99.15 | 98.73 | 98.94 | 98.33 | 99.16 | 98.74 |
| DoseForm | 202 | 93.50 | 92.57 | 93.03 | 92.59 | 99.01 | 95.69 | 94.69 | 97.03 | 95.84 | 94.26 | 97.52 | 95.86 |
| ContainerType | 192 | 98.43 | 97.92 | 98.17 | 97.95 | 99.48 | 98.71 | 97.40 | 97.40 | 97.40 | 97.44 | 98.96 | 98.19 |
| Percentage | 17 | 94.44 | 100.00 | 97.14 | 94.12 | 94.12 | 94.12 | 88.89 | 94.12 | 91.43 | 83.33 | 88.24 | 85.71 |
| Micro Average | 94.30 | 93.96 | 94.13 | 93.57 | 95.73 | 94.62 | 94.46 | 95.36 | 94.89 | 94.28 | 95.50 | 94.88 | |
| Model | Dictionary | Zero-shot ChatGPT | Few-shot GPT | Fine-tuned GPT | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Tag type | n= | P | R | F | P | R | F | P | R | F | P | R | F |
| Brand | 216 | 57.06 | 44.91 | 50.26 | 55.50 | 58.33 | 56.88 | 80.82 | 81.94 | 81.38 | 83.82 | 79.17 | 81.43 |
| Ingredient | 215 | 51.64 | 95.35 | 66.99 | 66.80 | 73.95 | 70.20 | 89.43 | 94.42 | 91.86 | 94.34 | 93.02 | 93.68 |
| Modification | 46 | 72.73 | 52.17 | 60.76 | 67.34 | 71.74 | 69.47 | 62.16 | 100.00 | 76.67 | 75.93 | 89.13 | 82.00 |
| Strength | 232 | 98.20 | 93.97 | 96.04 | 76.09 | 75.43 | 75.76 | 95.36 | 97.41 | 96.38 | 97.01 | 97.84 | 97.42 |
| Unit | 237 | 97.88 | 97.47 | 97.67 | 77.89 | 66.97 | 72.02 | 98.71 | 97.05 | 97.87 | 98.28 | 96.20 | 97.23 |
| DoseForm | 202 | 65.74 | 70.30 | 67.94 | 69.44 | 61.88 | 65.44 | 83.08 | 80.20 | 81.61 | 87.08 | 90.10 | 88.56 |
| ContainerType | 192 | 96.79 | 94.24 | 95.51 | 65.28 | 65.63 | 65.45 | 87.89 | 86.98 | 87.43 | 97.88 | 96.35 | 97.11 |
| Percentage | 17 | 89.74 | 100.00 | 94.44 | 38.89 | 61.88 | 65.44 | 82.35 | 82.35 | 82.35 | 77.78 | 82.35 | 80.00 |
| Micro Average | 78.21 | 82.17 | 78.96 | 67.85 | 67.04 | 67.44 | 88.52 | 90.27 | 89.23 | 92.40 | 91.97 | 92.15 | |
Among generative methods, few-shot learning, which involved providing a few example, performed quite well if one considers that very little training data was provided. However, the benefit of adding more training data can be seen in the fine-tuned GPT results, which exhibit a definite improvement over few-shot GPT. Zero-shot with ChatGPT produced the lowest results. However, considering the language model under ChatGPT was trained for the general domain and does not require any training sample for this task, it may have potential. We note that one of the main issues with ChatGPT was not that it did not correctly identify the relevant information, but more that its output contain a lot of other information that impacted the downstream evaluation.
Among discriminative models, we posit that features extracted by BERT models generalise better than Spacy features. This would need further investigation to determine if definitive. Moreover, the three BERT-based models use the same pre-trained BioBERT base model; so difference in effectiveness were the results of the model architecture used in fine tuning. BERT-CRF and BERT-BiLSTM-CRF achieved higher precision than BERT-Softmax but slightly reduced recall. This implies that the CRF layer improves the accuracy of the token prediction in comparison with simple linear neural network layer. Overall, the the BERT-CRF model was most effective. We thus select BERT-CRF as the NER model to be used next in the Medication Searching process.
5.2. Medication Searching Evaluation
Recall that the executing the graph query results in a candidate set of medications that were then ranked according to four similarity heuristics: Ingredient similarity, Dose Form similarity, Container Type similarity and Brand Name similarity. Only candidates above a certain threshold similarity score were marked as matching. Finding appropriate threshold values for these similarity scores is a non-trivial problem. Our approach to this was a greedy algorithm to determine the threshold value for each type of similarity as follows.
For drug matching, active ingredients and their amount of use serve as the most important information. Therefore, a threshold value for Ingredient similarity was determined first. Figure 4a shows the impact of threshold value for Ingredient similarity on effectiveness, while keeping a threshold value 0.5 for the rest. It is noticeable that when threshold beyond 0.9, recall slightly increases then stabilizes, but precision starts dropping. Therefore, we chose threshold = 0.9 as a cut off point for Ingredient similarity.
Dose form is the next priority information. Figure 4b shows the impact of threshold value for Dose Form similarity on effectiveness, while still keeping a threshold value 0.5 for Container Type and Brand Name similarities. F-measure stabilises 0.5 − 0.55. Therefore, we chose threshold = 0.5 as a cut off point for Dose Form similarity.
Figure 4c shows the impact of changing threshold value for Container Type similarity. The tipping point was threshold = 0.9, so we chose this as a cut off point for Container Type similarity.
Figure 4d is similar to Figure 4b. Therefore, we chose threshold = 0.5 as a cut off point for Brand Name similarity.
Figure 4:
Impact of changing threshold values (x-axis) on matching effectiveness (y-axis).
After the investigation on the impact of choosing threshold values for each of four similarity types, the following configurations were adopted: threshold value for Ingredient similarity is 0.9; threshold value for Dose Form similarity is 0.5; threshold value for Container Type similarity is 0.9 and threshold value for Brand Name similarity is 0.5.
Another factor that heavily impacted matching effectiveness was the combination of the Term normalization and Graph Query construction. An incorrect code assigned to a text value in the individual knowledge graph, which was constructed from a given drug’s label, would lead to an incorrect condition added to the graph query and consequently a wrong answers, or no answer returned. In order to investigate the impact of this combination into the matching results, in this experiment, we simulate this process by running different templates of graph query. This is essential an ablation study to see what types of information most impacted matching effectiveness.
The graph query templates were constructed as follows. The basic template just includes declarations of constituent ingredients (we denote this the I condition) of the drug. Other templates were built by adding combinations of dose form (F), container type (C) and brand name (B) conditions. For example, a template IF means the graph query contains condition statement about only ingredients and dose form. It simulates a situation where the Term normalization cannot find an appropriate code for either container type and brand name.
Figure 5 shows effectiveness for different graph query templates. For precision, Brand Name (B) was paramount. Understandable since many drugs share the same active ingredient (e.g., there are 1362 medications containing “paracetamol” in form of “tablet” and contained in “blister pack”) and without the particular brand name there are many false positives; this results in poor precision. For recall, ingredients (I) was important. In contrast, using Dose Form (F) tended to reduce recall by making the query too specific.
Figure 5:
Matching results of different graph query templates.
The best overall was to make use of just ingredient and brand (IB). This combination meant that a wide variety of medications were retrieved using the ingredient (high recall); then the addition of brand helped to narrow that list down to the relevant medication (and avoid false positives).
5.3. Discussion
Some interesting points need further discussion such as: (i) what factors impact matching effectiveness; (ii) why does Dose Form not help matching; and (iii) what if the mediation does not exist in the medical terminology.
Medication Searching effectiveness depends on three factors: the accuracy of NER model, the accuracy of the Term normalization and the choice of similarity measures (with threshold cut-off). Recall is heavily impacted by entities detected by the NER model that don’t exactly match the equivalent concept in the medical terminology. For example, the NER recognizes “DBL” as a Brand Name for a drug label: “DBL MORPHINE SULFATE 10mg/1mL Injection BP”, but Term normalization could not find its corresponding concept in the medical terminology. This is because the concept’s Brand Name is “Morphine Sulfate (DBL)”, which has low Jaro-Winkler similarity score (0.45) with “DBL”. This example is a false negative, so reduces recall.
Another source of errors was normalisation for strength and unit of measure. For example, the NER model detected “mcg” and “day” as shown in the Figure 2, but the Term normalization failed to assign codes to “mcg” and “day”, so the ranking algorithm could not compute similarity score for Ingredient in this case. This is another example of a false negative that decreased recall. Similarly, for Percentage entities that represent the strength of an ingredient, where NER model does not handle scale normalisation for percentages; again, reducing recall.
Overall, the NER model was very effective; matching errors were more the result of the similarity measure not being a good discriminator to avoid retrieving non-relevant matches (i.e., false positives). Our similarity score is a simple heuristic that combines ingredient, dose, container and brand information. Simple Jaro-Winkler similarity is used for string comparisons and simple normalisation is done for numeric values. A more principled approach to this problem would be to train a specific machine learning model to predict the score. Such a model would use features from both the knowledge graph and the candidate medications to derive a more principled and accurate similarity score. In fact, the score itself is only used to rank the candidates so that a single medication can be selected as the final match — ranking is what we care about. There are specific learning-to-rank models specifically designed for such use cases15; we intend to implement one as part of future work.
The Medication Searching approach provides an additional service not considered in this paper but important in medications management: that of adding new medications to an existing terminology. For a new drug not recorded in the medications terminology, we can still process its label via the NER model and get a knowledge graph with all its relevant types (ingredients, dose, etc.). This can be converted to a graph query and issued to a terminology server.
If no matches are found and manual review identifies it as a new medication, it is trivial to add to the medical to the terminology as we have already captured its underlying structure in the graph query. Thus the NER and Medication Searching methods have the additional benefit of keeping medication terminology up-to-date as new medications come on the market. To fully evaluate this we would need to establish a dataset specifically containing new medication and evaluate our methods on this dataset. This is an active line of future work.
6. Conclusion
We present a method to enrich free-text medication mentions (e.g., drug labels) to medications terminologies. The approach is two-part: Named Entity Recognition and Medication Searching. The NER model extracts relevant entities (ingredients, dose, brand, etc.) from free-text and represents these as an individual knowledge graph. We train Spacy-NER, BERT-based and GPT3-based NER models, pre-trained on biomedical text and fine-tuned on real drug labels. The BERT-based models were accurate at extracting a range of entity types (precision and recall ≈ 95%). For Medication Searching, the knowledge graph from the NER was used to generate a graph query to search relevant medications in a medications terminology. To differentiate between many similar medications, we propose a similarity score to determine the final matching medication. Empirical evaluation was done using real drug labels from Australia’s regulatory authority, the TGA (akin to the FDA). Medication Searching is best done using ingredient and brand information (precision 77% and recall 76%). Similarity scoring is one source of errors that could be improved through training a specific machine learning model to predict better scores. Valuable medications information is recorded in structured medical terminologies such as RxNorm and AMT, but medications information like drug labels are often found in unstructured, free-text descriptions. Methods to match this free-text to medical terminology helps facilitate accurate recording and sharing of medications information, which is paramount in modern, electronic health systems.
Footnotes
Figures & Tables
References
- 1.Zhong Lingfeng, Wu Jia, Li Qian, Peng Hao, Wu Xindong. A comprehensive survey on automatic knowledge graph construction. arXiv preprint arXiv:2302.05019. 2023.
- 2.Xu Hua, Doan Shane P, Stenner Son, Johnson Kevin B, Waitman Lemuel R, Denny Joshua C. MedEx: a medication information extraction system for clinical narratives. JAMIA 2010. 01 2010;17(1):19–24. doi: 10.1197/jamia.M3378. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Sohn Sunghwan, Clark Cheryl, Halgrim Scott R, Murphy Sean P, Chute Christopher G, Liu Hongfang. MedXN: an open source medication extraction and normalization tool for clinical text. JAMIA 2014. 03 2014;21(5):858–865. doi: 10.1136/amiajnl-2013-002190. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Lafferty John D., McCallum Andrew, Pereira Fernando C. N. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. ICML, ICML ‘01. 2001. pp. page 282–289.
- 5.Lample Guillaume, Ballesteros Miguel, Subramanian Sandeep, Kawakami Kazuya, Dyer Chris. Neural architectures for named entity recognition. CoRR, abs/1603.01360. 2016.
- 6.Chiu Jason P.C., Nichols Eric. Named Entity Recognition with Bidirectional LSTM-CNNs. Transactions of the Association for Computational Linguistics. 07 2016;4:357–370. [Google Scholar]
- 7.Devlin Jacob, Chang Ming-Wei, Lee Kenton, Toutanova Kristina. NAACL-HLT. Minneapolis, Minnesota: June 2019. BERT: Pre-training of deep bidirectional transformers for language understanding; pp. pages 4171–4186. [Google Scholar]
- 8.Gutierrez Bernal Jimenez, McNeal Nikolas, Washington Clayton, Chen You, Li Lang, Sun Huan, Su Yu. Thinking about GPT-3 in-context learning for biomedical IE? think again. EMNLP. December 2022. pp. pages 4497–4512.
- 9.Bose Priyankar, Srinivasan Sriram, Sleeman William C., Palta Jatinder, Kapoor Rishabh, Ghosh Preetam. A survey on recent named entity recognition and relationship extraction techniques on clinical texts. Applied Sciences. 2021;11(18) [Google Scholar]
- 10.Perera Nadeesha, Dehmer Matthias, Emmert-Streib Frank. Named entity recognition and relation detection for biomedical information extraction. Frontiers in Cell and Developmental Biology. 2020;8 doi: 10.3389/fcell.2020.00673. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Andrew D. Data and Text Mining in Biomedical Informatics. New York, NY, USA: ACM; 2012. MacKinlay and Karin M. Verspoor. Extracting structured information from free-text medication prescriptions using dependencies; pp. page 35–40. [Google Scholar]
- 12.Soares Livio Baldini, FitzGerald Nicholas, Ling Jeffrey, Kwiatkowski Tom. Association for Computational Linguistics. Florence, Italy, July 2019: ACL; Matching the blanks: Distributional similarity for relation learning; pp. pages 2895–2905. [Google Scholar]
- 13.Yoo SoYeop, Jeong OkRan. Automating the expansion of a knowledge graph. Expert Syst. Appl. mar 2020;141(C) [Google Scholar]
- 14.Lee Jinhyuk, Yoon Wonjin, Kim Sungdong, Kim Donghyeon, Kim Sunkyu, So Chan Ho, Kang Jaewoo. Biobert: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2020;36(4):1234–1240. doi: 10.1093/bioinformatics/btz682. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Liu Tie-Yan. Learning to rank for information retrieval. Foundations Trends in Information Retrieval. 2009;3(3):225–331. [Google Scholar]





