Skip to main content
Journal of the American Medical Informatics Association: JAMIA logoLink to Journal of the American Medical Informatics Association: JAMIA
. 2024 Jun 27;31(8):1725–1734. doi: 10.1093/jamia/ocae159

Annotation-preserving machine translation of English corpora to validate Dutch clinical concept extraction tools

Tom M Seinen 1,, Jan A Kors 2, Erik M van Mulligen 3, Peter R Rijnbeek 4
PMCID: PMC11258409  PMID: 38934643

Abstract

Objective

To explore the feasibility of validating Dutch concept extraction tools using annotated corpora translated from English, focusing on preserving annotations during translation and addressing the scarcity of non-English annotated clinical corpora.

Materials and Methods

Three annotated corpora were standardized and translated from English to Dutch using 2 machine translation services, Google Translate and OpenAI GPT-4, with annotations preserved through a proposed method of embedding annotations in the text before translation. The performance of 2 concept extraction tools, MedSpaCy and MedCAT, was assessed across the corpora in both Dutch and English.

Results

The translation process effectively generated Dutch annotated corpora and the concept extraction tools performed similarly in both English and Dutch. Although there were some differences in how annotations were preserved across translations, these did not affect extraction accuracy. Supervised MedCAT models consistently outperformed unsupervised models, whereas MedSpaCy demonstrated high recall but lower precision.

Discussion

Our validation of Dutch concept extraction tools on corpora translated from English was successful, highlighting the efficacy of our annotation preservation method and the potential for efficiently creating multilingual corpora. Further improvements and comparisons of annotation preservation techniques and strategies for corpus synthesis could lead to more efficient development of multilingual corpora and accurate non-English concept extraction tools.

Conclusion

This study has demonstrated that translated English corpora can be used to validate non-English concept extraction tools. The annotation preservation method used during translation proved effective, and future research can apply this corpus translation method to additional languages and clinical settings.

Keywords: named entity recognition, clinical concept extraction, machine learning, natural language processing, text mining, corpus annotation

Introduction

Electronic health records (EHRs) have become an invaluable source of real-world data for observational research, offering insights into disease prevalence, patient outcomes, and treatment effectiveness.1,2 While structured data, such as coded conditions, measurements, and prescriptions, are frequently used for analysis, a significant portion of valuable patient information remains locked within free text, such as nursing and physician notes.3,4 The extraction of information from these unstructured data in a structured manner, such as standardized clinical concepts from the Unified Medical Language System (UMLS),5 can greatly enhance observational research by providing additional rich, detailed clinical information at scale.4,6,7 Numerous tools for this natural language processing (NLP) task of clinical concept extraction, which consists of both named entity recognition (NER) and named entity linking (NEL), have been developed for English clinical texts,8–10 including tools such as cTAKES,11 MetaMap,12 QuickUMLS,13 and MedCAT,14 cloud-based tools,15 and tools using generative large language models (LLMs).16

However, the need for concept extraction tools and validating these tools extends beyond English,10 particularly with the rise of real-world data utilization in observational clinical research across the multilingual continent of Europe,17 as seen in initiatives like the European Medical Information Framework (EMIF),18 the European Health Data & Evidence Network (EHDEN),19 and the Data Analytics and Real World Interrogation Network (DARWIN EU).20 Utilizing unstructured data in large-scale analyses within standardized frameworks, such as the Observational Medical Outcomes Partnership Common Data Model (OMOP CDM),21,22 highlights the importance of reliable information extraction for different languages. Nevertheless, the landscape of concept extraction tools for relatively small non-English languages such as Dutch, remains underdeveloped, and the limited number of tools currently available for Dutch clinical text, including adapted versions of QuickUMLS6 and MedCAT,23 have not been publicly evaluated. At the same time, while it is not uncommon for extraction tools to lack validation,24 most English extraction tools are validated using various public corpora annotated with clinical concepts, for example, i2b2,25 ShARe/CLEF,26 and MedMentions.27 While benchmarks exist for various other Dutch NLP tasks,28 the absence of Dutch annotated clinical corpora poses a significant challenge for validation and comparison of the extraction tools in this language.29,30

Creating an annotated clinical corpus in any language is resource-intensive, requiring significant labor to manually annotate numerous clinical texts in great detail.10,31 The use of pre-trained LLMs for data augmentation and generation to create new annotated corpora has been proposed as an alternative to the manual annotation effort.32–34 For instance, a recent study demonstrated that LLM data generation can produce clinical texts in German, with entities annotated according to broad semantic categories.35 Besides synthesizing new data, a scalable option that relies on the models' creativity and domain knowledge, LLMs also enable data augmentation, notably by translating existing English corpora into other languages.34,36 While machine translation using LLMs has significantly improved in recent years,28,37 merely translating the clinical texts of an annotated corpus is insufficient because the word locations of clinical entities within the text shift during translation, causing the loss of annotation information tied to specific text locations.38 Although these annotations could be manually repositioned or aligned using secondary word alignment software after translation,36,38,39 we propose a method that preserves annotation locations during translation by embedding the annotations within the text before translation and retrieving them afterward.

Our study investigates the feasibility of validating non-English, specifically Dutch, concept extraction tools using English-annotated corpora translated via machine translation with embedded annotations. We evaluate 2 English concept extraction tools that were adapted to Dutch, on 2 English annotated corpora and their Dutch translations, and a multilingual annotated corpus. We compare the concept extraction performance of the tools between the languages.

Methods

Experimental setup

The experimental setup consisted of 2 main parts. The first part involved the corpus translation and preparation phase, where 3 publicly available annotated corpora were standardized to the same format. This included translating English corpora into Dutch while preserving annotations and creating training and test sets. The second part involved applying and evaluating 2 concept extraction tools on the test sets, with one tool that supported supervised training, also using the training sets. The setup is visualized in Figure 1.

Figure 1.

Figure 1.

Schema of the experimental setup: (1) preparation and translation of 3 different corpora (ShARe/CLEF, MedMentions, and Mantra) and (2) training, application, and evaluation of 2 concept extraction tools (MedCAT and MedSpaCy). CUI = concept unique identifier.

Corpora

The annotated corpora used in our study include the MedMentions corpus (MM),27 the corpus from the ShARe/CLEF eHealth evaluation lab task 2 (SC),26 and the multilingual Mantra corpus (MT).29 MM is a comprehensive biomedical corpus containing 4392 abstracts from PubMed, annotated with concepts across a wide range of biomedical semantic types. SC is a corpus derived from 432 clinical notes and is designed to facilitate tasks related to understanding clinical text, including entity recognition and normalization. The multilingual MT corpus provides annotations of 200 short texts from different parallel corpora (Medline abstract titles [MDL] and sentences of drug labels from the European Medicines Agency [EMA]), in multiple languages, including English and Dutch. All 3 corpora feature annotations that link text spans to a UMLS concept, identified by a Concept Unique Identifier (CUI). To facilitate uniform analysis, all corpora were standardized into the same tabular format. This involved separating the text documents (Attributes: DocumentId, Text) and the concept annotations (Attributes: DocumentId, CUI, SpanStart, SpanEnd, SpanText). The SC corpus is pre-partitioned into training and test sets, whereas for MM and MT, we randomly allocated 80% of the data for training and 20% for testing.

Corpus translation

To develop the Dutch corpora of annotated clinical texts, we used the English annotated corpora as a starting point. Directly translating an English text would allow us to create a Dutch text, but the exact locations of the annotations would be lost. To address this, we propose a method for preserving the location of annotated concepts through a process of 3 steps (see Table 1 for an example). First, annotations are integrated directly into the clinical text by enclosing the text span and the CUI in square brackets, ie [[text span][CUI]]. Next, this text with embedded annotations is translated using machine translation, keeping the annotations intact. Finally, the annotations are extracted from the translated text using a simple regular expression pattern (“\[\[([^\]\[]*)\]\[(C[0-9]*)\]\]”), resulting in separate text documents and annotations again.

Table 1.

Example phrase from the MT corpus to illustrate the steps in the in-text annotation translation process.

Process step Literal text Annotations (CUI: text span [index span])
Original text with separate annotations Temporary kidney enlargement in the newborn infant
  • C0542518: kidney enlargement [10-28];

  • C0021289: newborn infant [36-50]

Text with embedded annotations Temporary [[kidney enlargement][C0542518]] in the [[newborn infant][C0021289]]
Translated text with embedded annotations Tijdelijke [[niervergroting][C0542518]] bij de [[pasgeboren baby][C0021289]]
Translated text with extracted annotations Tijdelijke niervergroting bij de pasgeboren baby
  • C0542518: niervergroting [11-25];

  • C0021289: pasgeboren baby [33-48]

To experimentally assess the impact of translation in this process, we utilized and compared 2 different machine translation services: the Cloud Translation API from Google (referred to as Google) and the GPT-4 Turbo (gpt-4-0125-preview) API from OpenAI (referred to as GPT).40 Google's service offered direct machine translation of documents. In contrast, GPT, a generative text model, required a specific system prompt besides the document text to guide a zero-shot translation process:

“Translate the document to Dutch (Nederlands). Keep the formatting the same, including the in-text annotations: [[span][code]].”

This approach allowed us to compare a traditional translation service with a state-of-the-art generative text model in preserving annotated concept locations during translation. We evaluated the quality of the Google and GPT translations by assessing their similarity to each other and to manual Dutch translations in the Mantra corpus. We used the bilingual evaluation understudy (BLEU) algorithm41 and the character n-gram F-score (chrF)42 as translation evaluation metrics. Furthermore, to evaluate the quality of annotation preservation, we compared descriptive statistics like document size and the number of annotations before and after translation. Additionally, we quantified formatting errors, ie, the erroneous placing of brackets in the translated text, by counting the CUIs in the final translated text, as an annotation with a wrong bracket pattern is not extracted, and its CUI remains in the text.

Concept extraction tools

In our study, we validated and compared 2 concept extraction tools: MedSpaCy (https://github.com/medspacy/medspacy) and the Medical Concept Annotation Toolkit (MedCAT; https://github.com/CogStack/MedCAT). These Python tools, both open source and publicly available, were initially designed for extracting concepts from English texts and have been adapted for Dutch.6,23

MedSpaCy extends the spaCy software library for clinical NLP tasks, including clinical concept extraction. It uses an adaptation of QuickUMLS, a tool for fast, unsupervised biomedical concept extraction based on string similarity and a reference concept dictionary. For English concept extraction, we utilized all UMLS concepts with English terms. For Dutch concept extraction, we used all Dutch vocabularies from UMLS. We replaced the English Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT) vocabulary with the Dutch SNOMED CT translation (https://github.com/mi-erasmusmc/medspacy_dutch), maintained by NICTIZ, the Dutch National IT Institute for Healthcare (https://www.snomed.org/member/netherlands). If no Dutch version of a UMLS vocabulary existed, we kept the English one, as many concepts, such as drug and laboratory concepts, are language-independent. Further details on QuickUMLS settings are provided in Table S1, and detailed information on the concept dictionaries is available in Table S2.

MedCAT is an entity recognition and linking tool, employing context similarity based on word embeddings for concept recognition and disambiguation.14 It allows for both unsupervised training on unannotated clinical texts and supervised training on annotated texts. In this study, we used the publicly available pre-trained models for English and Dutch, developed using unsupervised training. The English model, trained on clinical notes from Medical Information Mart for Intensive Care (MIMIC) III,43 used a subset of the English UMLS as its concept dictionary. The Dutch model is trained on medical Wikipedia articles in Dutch, incorporating all UMLS concepts with Dutch descriptions and the Dutch SNOMED CT translation.23 Additionally, to showcase MedCAT's supervised training capabilities, we fine-tuned the pre-trained unsupervised models with supervised learning, creating a separate supervised MedCAT model on the training set of each corpus.

To summarize, our analysis involves 3 types of concept extraction models: one MedSpaCy model and one unsupervised MedCAT model for both languages, plus 10 supervised MedCAT models—3 for English for each corpus and 7 for Dutch, corresponding to each corpus translation.

Concept extraction evaluation

All models were applied to their respective test sets in both English and Dutch. We evaluated the extraction performance using precision, recall, and their harmonic mean, the F1 score.27 A concept was considered correctly extracted if both the CUI and its location were accurately identified. All other predicted concepts were counted as false positives, and all unmatched reference concepts were counted as false negatives. Furthermore, to compare the overall performance across the languages and the concept extraction models, we mean-centered the evaluation results by the individual corpus. This involved subtracting the corpus mean metric value from each result, allowing for a comparison of extraction outcomes that is independent of the specific corpora. The statistical significance of the differences in metric value distribution was assessed using Bonferroni-adjusted Wilcoxon tests.

Results

Corpus translation

The 3 corpora, MM, SC, and MT, were processed and translated using the 2 machine translation services, Google and GPT, while preserving the annotations. Table 2 shows the translation performance on the MT corpus and the agreement between Google and GPT translations across all corpora, as measured by average BLEU and chrF scores. The machine translations were reasonably close to the Dutch references in the MT corpus, with BLEU scores ranging from 0.28 to 0.39 and chrF scores around 0.70. The agreement between Google and GPT translations was very good, with BLEU scores ranging from 0.46 to 0.58 and chrF scores between 0.75 and 0.87 across all corpora. Table S3 provides translation examples to offer a sense of the individual BLEU and chrF scores, showing that machine translations with low similarity to the more loosely translated reference could still be considered natural, be of good quality, and show high agreement with each other. The MM and MT corpora translations are publicly available for transparency and reproducibility (https://github.com/mi-erasmusmc/DutchClinicalCorpora).

Table 2.

Performance of the machine translation compared to the Dutch reference in the MT corpus and the agreement between Google and GPT machine translations across all corpora, measured by the average BLEU and chrF scores and their standard deviations (SD).

Corpus Comparison BLEU (SD) chrF (SD)
MT EMA Google—Reference 0.39 (±0.31) 0.71 (±0.16)
GPT—Reference 0.37 (±0.28) 0.69 (±0.14)
Google—GPT 0.55 (±0.28) 0.81 (±0.13)
MT MDL Google—Reference 0.28 (±0.32) 0.73 (±0.17)
GPT—Reference 0.28 (±0.33) 0.73 (±0.18)
Google—GPT 0.58 (±0.37) 0.87 (±0.14)
MM Google—GPT 0.50 (±0.09) 0.80 (±0.05)
SC test and train Google—GPT 0.46 (±0.11) 0.75 (±0.04)

The characteristics of the corpora and their translations are presented in Table 3. MM was the largest corpus with also the highest annotation density, containing many annotations in a relatively short text, especially compared to the SC corpus. The translation of English to Dutch (with the annotations extracted) increases the number of characters, which is also visible in the existing Dutch translation in the multilingual MT corpus.

Table 3.

Characteristics of the original English and the Dutch translated corpora.

Characteristics Language MM SC train SC test MT EMA MT MDL
No. of documents 4392 299 133 100 100
No. of unique UMLS semantic types 126 11 11 73 73
Median no. of characters per document English 1493 2610 7370 97.5 54.5
Google 1742 2764 7840 111 58.5
GPT 1769 2727 7850 108.5 57.5
Dutch 115 59.5
Median no. of annotations per document English 79 31 54 3 2
Google 79 31 54 3 2
GPT 75 31 52 3 2
Dutch 3 2
Total no. of annotations English 352 496 8381 11937 363 222
Dutch 364 214
Total no. of missing annotations (% of annotations) Google 2877 54 52 6 0
(0.8%) (0.6%) (0.4%) (1.7%) (0.0%)
GPT 20 806 232 203 2 0
(5.9%) (2.8%) (1.7%) (0.6%) (0.0%)
No. of annotations missing due to formatting errors (% of missing annotations) Google 2753 51 29 6 0
(95.7%) (94.4%) (55.8%) (100.0%) (0.0%)
GPT 123 12 4 1 0
(0.6%) (5.2%) (2.0%) (50.0%) (0.0%)

The quality of in-text annotation preservation was measured by the number of missing annotations after the translation and how many of these annotations were missing due to formatting errors during translation. Overall, the preservation of annotations during translation was quite effective: the median number of annotations per document was the same or close to the original English version. However, GPT translations exhibited the highest percentage of missing annotations, with 5.9% in MM and 2.8% in SC. Google translations performed better, with less than 1% missing annotations, most of which could be attributed to formatting errors. In contrast, GPT showed a very low rate of formatting errors, and its missing annotations were primarily due to the pure loss of embedded annotations during translation: annotations were ignored in the generated text while keeping the sentence structure intact. Table 4 presents examples of both types of annotation preservation errors. Upon further inspection of the missing annotations in the GPT translation, we found that GPT primarily struggled with annotations related to verbal phrases or generic nouns, such as “investigates,” “performed,” “comparison,” and “evaluation.” These missing concepts were predominantly categorized under the more generic semantic types like Functional Concept (3715 concepts, 18%), Qualitative Concept (3593, 17%), and Activity (1591, 8%), while semantic types related to diseases and medicine were barely affected. For a detailed breakdown of the missing annotations in GPT's MM translations by semantic type and its most frequent unpreserved concepts, see Table S4.

Table 4.

Examples of sentences from the MM corpus with formatting errors and the loss of annotations.

Error type English sentence Translated sentence
Formatting error … its [[symptoms][C1457887]] are broad and place [[patients][C0030705]] at crossroads between … … de [[symptomen][C1457887]] zijn breed en plaatsen [[patiënten]][C0030705]] op het kruispunt tussen …
… of [[non-neoplastic kidney][C0022646]] could enable [[early identification][C0814435]] … van [[niet-neoplastische nier][C0022646]] [[vroege identificatie]mogelijk zou kunnen maken][C0814435]]
Annotation loss [[Selection][C1707391]] of the [[route][C0449444]] of [[hysterectomy][C0020699]] … De keuze van de [[route][C0449444]] van [[hysterectomie][C0020699]] …
… In addition, the [[bactericidal effect][C0544570]] was [[investigated][C1292732]] using a … … Daarnaast werd het [[bacteriedodende effect][C0544570]] onderzocht met behulp van een …

The text in italics indicates the affected annotation, and the bold text indicates the problem. The sentence translations are correct.

Concept extraction

In total, 30 model and corpus combinations were evaluated. The performances, measured by the F1 score (F), recall (R), and precision (P), are visualized in Figure 2. For the exact values, see Table S5. Overall, the concept extraction tools performed similarly in English and translated Dutch corpora. Despite several differences between English and Dutch within each corpus, we observed, on average, that there were no significant language differences across the different corpora (Figure 3A). Additionally, there were no performance differences between models using Dutch Google or GPT translations or between MT translations and the existing Dutch MT corpus.

Figure 2.

Figure 2.

Concept extraction performance per model type and corpus combination on the English version (blue) and the (translated) Dutch versions (orange) of the 3 main corpora, measured by the 3 metrics: F1 score (F), precision (P), and recall (R).

Figure 3.

Figure 3.

Performance comparison of (A) the different corpus languages and (B) the different concept extraction models independent of the corpora, using mean-centered metric value distributions. The dashed line indicates the mean center. The significant Bonferroni-adjusted Wilcoxon test results between the distributions are shown above the boxplots if significant, where “*” indicates a P-value < .01. The points represent the underlying data.

Within each corpus, performance differences could be observed between the concept extraction model types. In the MM corpus, the supervised MedCAT model performed the best according to the F1 score, followed by MedSpaCy, in both languages. The differences in performance could mainly be attributed to the large differences in recall. Notably, the low recall of the unsupervised MedCAT models showed a drastic improvement with supervised learning. Furthermore, the high recall of the English MedSpaCy was not mirrored in the Dutch version, possibly due to the wide semantic type range in MM and the lower number of concepts with Dutch terms in the Dutch MedSpaCy. In the MT corpus, we also found that the supervised MedCAT model had the best performance, with a similar performance of MedSpaCy in English, followed by unsupervised MedCAT in Dutch. Again, the high performance of MedSpaCy in English could be attributed to its high recall, as precision differences were relatively small. Conversely, in the SC corpus, the differences between the model types were mainly due to differences in their precision. Both unsupervised and supervised MedCAT models performed similarly, with the unsupervised model slightly outperforming the supervised one in the English corpus. Although supervised training improved recall, it reduced precision. Figure 3B presents the mean-centered performance across corpora, showing that, on average, supervised models performed best. While MedSpaCy models had a high recall similar to supervised MedCAT, their precision was consistently lower.

Discussion

Evaluation of concept extraction using translated corpora

This study explored the feasibility of validating Dutch concept extraction tools on annotated corpora derived from translating existing English corpora. We validated 2 concept extraction tools in Dutch and English using 1 annotated multilingual corpus and 2 annotated English corpora. The results demonstrated the effective generation of Dutch annotated corpora through our proposed method, which preserves annotation location through translation, facilitating rapid, efficient, and accurate creation of Dutch corpora annotated with clinical concepts, without necessitating further post-processing for text alignment.36

We successfully utilized 2 machine translation services, Google and GPT, for corpus translation. While both provided good quality translations, Google encountered more issues with annotation formatting, whereas GPT translations had a larger number of missing annotations, primarily due to problems with verbal phrases and generic nouns, affecting the preservation of annotations. This issue was particularly notable in the MM corpus, which has a high annotation density. The exact reason for these missing annotations is unclear but may be related to these phrases and nouns not typically annotated as clinical entities.

The translation process from English to Dutch did not significantly impact the performance of concept extraction, with models showing comparable effectiveness across languages and corpora. Moreover, no significant differences were observed in model performance between Google or GPT-translated corpora or between the Dutch MT corpus and the MT translations. These results confirm the feasibility of accurately translating existing annotated corpora for multilingual use, as demonstrated by the method's effectiveness for Dutch, which is broadly applicable and expected to perform well across various languages.

When comparing the performance of the concept extraction models across the different corpora, we found that the supervised MedCAT model generally performed best. The fine-tuning of the unsupervised models using supervised learning showed much improvement, especially in the MM and MT corpora. These findings for MedCAT align with those reported by the authors.14 MedSpaCy demonstrated a high recall across all corpora but suffered from lower precision, likely due to its reliance on an extensive concept database that led to the extraction of many correct concepts alongside numerous unannotated ones.

Overall, this research enhanced our understanding of the challenges and opportunities in creating multilingual annotated clinical corpora and validating non-English concept extraction tools, contributing to clinical NLP and data harmonization to improve observational research.

Biomedical settings

In this study, we included 3 different annotated biomedical corpora commonly used for evaluating concept extraction. However, the specific settings of these corpora should be considered when evaluating concept extraction for practical use. For example, SC contains clinical notes from an American Hospital EHR, which might differ from its Dutch counterpart due to variances in healthcare systems and practices. This presents an important limitation and potential source of bias when translating corpora. The context of one corpus might not transfer well to other biomedical settings, underscoring the importance of choosing the suitable corpus. This can be assessed by comparing the corpus with texts in the target setting and language, focusing on differences in terminology, reporting practices, and healthcare delivery. Nonetheless, while translating existing English corpora provides a rapid method for generating new corpora in other languages, creating new corpora based on texts in the target language and, crucially, within the target setting remains preferable.

Translation and annotation methods

For corpus translation, we relied on 2 leading LLM services. Initially, we explored more translation services but decided to focus on Google and GPT to keep the scope manageable and the narrative clear. Google was chosen for its widespread recognition and GPT for its state-of-the-art text interpretation and generation capabilities, alongside their relative cost-effectiveness. We also explored various methods for embedding annotations in text, such as using curly or angle brackets and Standard Generalized Markup Language (SGML). We found square brackets easy to implement and effective, with fewer formatting errors during translation compared to SGML methods. Square brackets also appeared less frequently in the original text compared to other types of brackets, simplifying retrieval. However, the choice of embedding method might depend on the data, and a formal comparison could be conducted in future work.

Machine translation evaluation

We evaluated the accuracy of machine translations in the MT corpus, which included manual Dutch translations for reference. We did not compare MM and SC corpora translations against a manual reference but observed a high agreement between Google and GPT translations. Moreover, the findings from the MT corpus are likely applicable to the other corpora, as recent studies have shown similar translation performances,44–46 and we identified no issues through empirical observation and comparison during the study. Despite this, machine translation is not infallible, and nuances or the naturalness of the text may be lost, potentially impacting the reliability of the annotated corpus. We assessed and quantified errors in the annotation-preserving translation process using simple metrics and published details on missing annotations for public scrutiny. With GPT, we only employed zero-shot prompting and observed good results despite occasional annotation losses and minimal formatting errors. However, GPT allows further improvement through techniques like prompt optimization and few-shot learning, highlighting its versatility. The Google translations exhibited more annotation formatting errors than GPT. While the Google translation model cannot be directly altered, addressing the errors with more complex regular expressions would further improve the annotation preservation.

Concept extraction tool validation

We validated and compared 2 concept extraction tools chosen for their ease of use, integration of both NER and NEL, and availability in Dutch and English. While more advanced biomedical embedding models for NEL, such as BioLORD-2023-M47 and mSapBERT,48 exist, evaluating only the NEL task or integrating these embedding models into MedCAT or MedSpacy was beyond this study's scope but remains interesting for future research. The performance of the concept extraction models was not optimal, with F1 scores of the best models ranging between 0.5 and 0.7 across corpora. Although we used default settings for the MedSpaCy and MedCAT models, further optimization could enhance performance. Moreover, our stringent evaluation required an exact match of the predicted and annotated CUI to be considered correct.27 We observed that many predicted concepts closely matched the annotated concepts and, in some instances, could be considered more accurate. For instance, the word “seizure” is annotated as C0036572: Seizures, but the model extracts the similar C4229252: Seizure-like activity. Similarly, the phrase “cocaine use” is annotated as C0009171: Cocaine Abuse, but the model extracts C3496069: Cocaine Use. Therefore, a less strict evaluation method based on close concept similarity, measured by hierarchical or concept embedding distance, would likely yield higher performance.

Future work

Future work should explore the generalization of our corpus translation method to languages beyond Dutch. While translating existing corpora offers an efficient alternative to creating new ones from scratch, comparing this method to others, such as exploring synthesizing corpora using LLMs to generate new CUI annotated data based on examples or combining multiple strategies, would be worthwhile. Our annotation preservation technique shows promise, but further research is needed to optimize its accuracy. Improvement could involve experimenting with various LLM models, employing one-shot or few-shot prompting, using more extensive prompts with more instructions, and fine-tuning models. A comparative analysis of our annotation preservation method with post-translation word alignment techniques, as proposed by others,36 would also be valuable. Lastly, others can use the translated corpora from our study to evaluate different concept extraction tools, train models, and adapt our translation approach for translating other corpora in various clinical settings and NLP tasks.

Conclusion

This study demonstrated the feasibility of validating Dutch concept extraction tools using annotated corpora translated from English. The proposed method of preserving in-text annotations during translation through language models offers a promising alternative to post-translation realignment of words. The research extended to 3 different corpora, 2 machine translation services, and 2 extraction tools, showcasing the method's versatility and potential for multilingual clinical NLP advancement. While machine translation services like Google and GPT were effective in translating annotated clinical corpora, some issues were encountered, highlighting the need for ongoing optimization and error assessment. The comparison of concept extraction models showed that the supervised MedCAT model generally performed best, with MedSpaCy demonstrating high recall but lower precision. Future work should focus on generalizing the corpus translation method to other languages, optimizing annotation preservation techniques, and exploring different strategies for embedding annotations in text. Comparative analysis with post-translation word alignment techniques and further experimentation with various language models and prompting techniques could also enhance the accuracy and efficiency of concept extraction in multilingual settings. This study contributes valuable insights into expanding clinical data augmentation and concept extraction research for non-English languages, paving the way for more extensive multilingual clinical NLP applications and advancements in the field.

Supplementary Material

ocae159_Supplementary_Data

Contributor Information

Tom M Seinen, Department of Medical Informatics, Erasmus University Medical Center, 3015 GD Rotterdam, The Netherlands.

Jan A Kors, Department of Medical Informatics, Erasmus University Medical Center, 3015 GD Rotterdam, The Netherlands.

Erik M van Mulligen, Department of Medical Informatics, Erasmus University Medical Center, 3015 GD Rotterdam, The Netherlands.

Peter R Rijnbeek, Department of Medical Informatics, Erasmus University Medical Center, 3015 GD Rotterdam, The Netherlands.

Author contributions

Tom M. Seinen proposed the methodology, designed and implemented the study protocol, and performed the data analysis. Jan A. Kors, Erik M. van Mulligen, and Peter R. Rijnbeek provided critical feedback, helped interpret the results, and shaped the research and analysis. Tom M. Seinen wrote the article with valuable input from all other authors.

Supplementary material

Supplementary material is available at Journal of the American Medical Informatics Association online.

Funding

This work has received support from the European Health Data & Evidence Network (EHDEN) project. EHDEN has received funding from the Innovative Medicines Initiative 2 Joint Undertaking (JU) under grant agreement No. 806968. The JU receives support from the European Union's Horizon 2020 research and innovation program and EFPIA.

Conflicts of interest

None declared.

Data availability

The aggregated data used for generating the results, conclusions, and figures/tables in this study are available as supplementary data. The translated MedMentions and Mantra corpora are made publicly available to enhance transparency and reproducibility and facilitate further research (https://github.com/mi-erasmusmc/DutchClinicalCorpora). Due to access restrictions and a data use agreement with PhysioNet, the translated ShARe/CLEF corpus cannot be made publicly available.

References

  • 1. Reps JM, Schuemie MJ, Suchard MA, Ryan PB, Rijnbeek PR.  Design and implementation of a standardized framework to generate and evaluate patient-level prediction models using observational healthcare data. J Am Med Inform Assoc. 2018;25(8):969-975. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Knevel R, Liao KP.  From real-world electronic health record data to real-world results using artificial intelligence. Ann Rheum Dis. 2023;82(3):306-311. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Percha B.  Modern clinical text mining: a guide and review. Annu Rev Biomed Data Sci. 2021;4:165-187. [DOI] [PubMed] [Google Scholar]
  • 4. Ford E, Carroll JA, Smith HE, Scott D, Cassell JA.  Extracting information from the text of electronic medical records to improve case detection: a systematic review. J Am Med Inform Assoc. 2016;23(5):1007-1015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Bodenreider O.  The unified medical language system (UMLS): integrating biomedical terminology. Nucleic Acids Res Spec Publ. 2004;32(suppl_1):D267-D270. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Seinen TM, Kors JA, van Mulligen EM, et al.  The added value of text from Dutch general practitioner notes in predictive modeling. J Am Med Inform Assoc. 2023;30(12):1973-1984. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Seinen TM, Fridgeirsson EA, Ioannou S, et al.  Use of unstructured text in prognostic clinical prediction models: a systematic review. J Am Med Inform Assoc. 2022;29(7):1292-1302. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Kreimeyer K, Foster M, Pandey A,  et al.  Natural language processing systems for capturing and standardizing unstructured clinical information: a systematic review. J Biomed Inform. 2017;73:14-29. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Fraile Navarro D, Ijaz K, Rezazadegan D,  et al.  Clinical named entity recognition and relation extraction using natural language processing of medical free text: A systematic review. Int J Med Inform. 2023;177:105122. [DOI] [PubMed] [Google Scholar]
  • 10. AlShuweihi M, Salloum SA, Shaalan K.  Biomedical corpora and natural language processing on clinical text in languages other than English: a systematic review. In: Al-Emran M, Shaalan K, Hassanien AE, eds. Recent Advances in Intelligent Systems and Smart Applications. Cham: Springer International Publishing; 2021:491-509. 10.1007/978-3-030-47411-9_27 [DOI] [Google Scholar]
  • 11. Savova GK, Masanz JJ, Ogren PV, et al.  Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc. 2010;17(5):507-513. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Aronson AR.  Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. Proc AMIA Symp. 2001:17-21. [PMC free article] [PubMed] [Google Scholar]
  • 13. Soldaini L, Goharian N.  QuickUMLS: a fast, unsupervised approach for medical concept extraction. MedIR Workshop, SIGIR  2016. [Google Scholar]
  • 14. Kraljevic Z, Searle T, Shek A,  et al.  Multi-domain clinical natural language processing with MedCAT: The Medical Concept Annotation Toolkit. Artif Intell Med. 2021;117:102083. [DOI] [PubMed] [Google Scholar]
  • 15. Bai L, Mulvenna MD, Wang Z, et al. Clinical entity extraction: comparison between MetaMap, cTAKES, CLAMP and Amazon Comprehend Medical. In: 2021 32nd Irish Signals and Systems Conference (ISSC). IEEE; 2021:1-6. 10.1109/ISSC52156.2021.9467856 [DOI]
  • 16. Hu Y, Ameer I, Zuo X, et al. Zero-shot clinical entity recognition using ChatGPT. arXiv. 2023. arXiv:2303.16416.
  • 17. Arlett P, Kjaer J, Broich K, et al.  Real‐world evidence in EU medicines regulation: enabling use and establishing value. Clin Pharmacol Ther. 2022;111(1):21-23. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Lovestone S, Consortium E; EMIF Consortium. The European Medical Information Framework: a novel ecosystem for sharing healthcare data across Europe. Learn Health Syst. 2020;4(2):e10214. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Gauffin O, Brand JS, Vidlin SH, et al.  Supporting pharmacovigilance signal validation and prioritization with analyses of routinely collected health data: lessons learned from an EHDEN Network Study. Drug Saf. 2023;46(12):1335-1352. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. European Medicines Agency. Data Analysis and Real World Interrogation Network. DARWIN EU. Accessed June 17, 2024. https://www.darwin-eu.org/
  • 21. Overhage JM, Ryan PB, Reich CG, et al.  Validation of a common data model for active safety surveillance research. J Am Med Inform Assoc. 2012;19(1):54-60. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Reich C, Ostropolets A, Ryan P, et al.  OHDSI Standardized Vocabularies—a large-scale centralized reference ontology for international data harmonization. J Am Med Inform Assoc. 2024;31(3):583-590. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. van Es B, Reteig LC, Tan SC, et al.  Negation detection in Dutch clinical texts: an evaluation of rule-based and machine learning methods. BMC Bioinformatics. 2023;24(1):10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Kersloot MG, van Putten FJ, Abu-Hanna A, et al.  Natural language processing algorithms for mapping clinical text fragments onto ontology concepts: a systematic review and recommendations for future studies. J Biomed Semant. 2020;11(1):1-21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Uzuner Ö, South BR, Shen S, et al.  2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. J Am Med Inform Assoc. 2011;18(5):552-556. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Mowery DL, Velupillai S, South BR, et al. Task 2: ShARe/CLEF eHealth evaluation lab 2014. In: Proceedings of CLEF 2014. 2014:31-42. ISSN 1613-0073.
  • 27. Mohan S, Li D. MedMentions: a large biomedical corpus annotated with UMLS concepts. arXiv. 2019. arXiv:1902.09476.
  • 28. De Vries W, Wieling M, Nissim M. DUMB: a benchmark for smart evaluation of Dutch models. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics; 2023:7221-7241.
  • 29. Kors JA, Clematide S, Akhondi SA, et al.  A multilingual gold-standard corpus for biomedical concept recognition: the Mantra GSC. J Am Med Inform Assoc. 2015;22(5):948-956. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Névéol A, Dalianis H, Velupillai S, et al.  Clinical natural language processing in languages other than English: opportunities and challenges. J Biomed Semant. 2018;9(1):1-13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Patel P, Davey D, Panchal V, et al. Annotation of a large clinical entity corpus. In: Riloff E, Chiang D, Hockenmaier J, Tsujii J, eds. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics; 2018:2033-2041. 10.18653/v1/D18-1228 [DOI]
  • 32. Anaby-Tavor A, Carmeli B, Goldbraich E, et al. Do not have enough data? Deep learning to the rescue! In: Wooldridge M, Dy J, Natarajan S, eds. Proceedings of the AAAI Conference on Artificial Intelligence. Association for the Advancement of Artificial Intelligence; 2020:7383-7390. 10.1609/aaai.v34i05.6233 [DOI]
  • 33. Schick T, Schütze H. Generating datasets with pretrained language models. In: Moens M, Huang X, Specia L, Yih S, eds. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics; 2021:6943-6951. 10.18653/v1/2021.emnlp-main.555 [DOI]
  • 34. Whitehouse C, Choudhury M, Aji AF. LLM-powered data augmentation for enhanced crosslingual performance. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics; 2023:671-686.
  • 35. Frei J, Kramer F.  Annotated dataset creation through large language models for non-english medical NLP. J Biomed Inform. 2023;145:104478. [DOI] [PubMed] [Google Scholar]
  • 36. Frei J, Kramer F.  GERNERMED: An open German medical NER model. Software Impacts. 2022;11:100212. [Google Scholar]
  • 37. Wang H, Wu HUA, He Z, Huang L, Church KW.  Progress in machine translation. Engineering. 2022;18:143-153. [Google Scholar]
  • 38. Gaschi F, Fontaine X, Rastin P, et al. Multilingual clinical NER: translation or cross-lingual transfer? In: Naumann T, Abacha AB, Bethard S, Roberts K, Rumshisky A, eds. Proceedings of the 5th Clinical Natural Language Processing Workshop. Association for Computational Linguistics; 2023:289-311. 10.18653/v1/2023.clinicalnlp-1.34 [DOI]
  • 39. Frei J, Frei-Stuber L, Kramer F.  GERNERMED++: Semantic annotation in German medical NLP through transfer-learning, translation and word alignment. J Biomed Inform. 2023;147:104513. [DOI] [PubMed] [Google Scholar]
  • 40. Achiam J, Adler S, Agarwal S, et al. GPT-4 technical report. arXiv. 2023. arXiv:2303.08774
  • 41. Papineni K, Roukos S, Ward T, et al. BLEU: a method for automatic evaluation of machine translation. In: Isabelle P, Charniak E, Lin D, eds. Proceedings of the 40th annual meeting of the Association for Computational Linguistics. Association for Computational Linguistics; 2002:311-318. 10.3115/1073083.1073135 [DOI]
  • 42. Popović M. chrF: character n-gram F-score for automatic MT evaluation. In: Bojar O, Chatterjee R, Federmann C, Haddow B, Hokamp C, Huck M, Logacheva V, Pecina P, eds. Proceedings of the Tenth Workshop on Statistical Machine Translation. Association for Computational Linguistics; 2015:392-395. 10.18653/v1/W15-3049 [DOI]
  • 43. Johnson AE, Pollard TJ, Shen L, et al.  MIMIC-III, a freely accessible critical care database. Sci Data. 2016;3(1):160035-160039. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Aiken M.  An updated evaluation of Google translate accuracy. Stud Linguist Literature. 2019;3(3):253-260. [Google Scholar]
  • 45. Jiao W, Wang W, Huang J-T, et al. Is ChatGPT a good translator? Yes with GPT-4 as the engine. arXiv. 2023. arXiv:2301.08745.
  • 46. Son J, Kim B.  Translation performance from the user’s perspective of large language models and neural machine translation systems. Inform. 2023;14(10):574. [Google Scholar]
  • 47. Remy F, Demuynck K, Demeester T.  BioLORD-2023: semantic textual representations fusing large language models and clinical knowledge graph insights. J Am Med Inform Assoc. 2024. 10.1093/jamia/ocae029 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Liu F, Vulić I, Korhonen A, et al. Learning domain-specialised representations for cross-lingual biomedical entity linking. In: Zong C, Xia F, Li W, Navigli R, eds. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Short Papers). Association for Computational Linguistics; 2021:565-574.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ocae159_Supplementary_Data

Data Availability Statement

The aggregated data used for generating the results, conclusions, and figures/tables in this study are available as supplementary data. The translated MedMentions and Mantra corpora are made publicly available to enhance transparency and reproducibility and facilitate further research (https://github.com/mi-erasmusmc/DutchClinicalCorpora). Due to access restrictions and a data use agreement with PhysioNet, the translated ShARe/CLEF corpus cannot be made publicly available.


Articles from Journal of the American Medical Informatics Association : JAMIA are provided here courtesy of Oxford University Press

RESOURCES