Skip to main content
Journal of Biomedical Semantics logoLink to Journal of Biomedical Semantics
. 2025 Aug 21;16:15. doi: 10.1186/s13326-025-00337-2

Mapping between clinical and preclinical terminologies: eTRANSAFE’s Rosetta stone approach

Erik M van Mulligen 1,, Rowan Parry 1, Johan van der Lei 1, Jan A Kors 1
PMCID: PMC12372267  PMID: 40841963

Abstract

Background

The eTRANSAFE project developed tools that support translational research. One of the challenges in this project was to combine preclinical and clinical data, which are coded with different terminologies and granularities, and are expressed as single pre-coordinated, clinical concepts and as combinations of preclinical concepts from different terminologies. This study develops and evaluates the Rosetta Stone approach, which maps combinations of preclinical concepts to clinical, pre-coordinated concepts, allowing for different levels of exactness of mappings.

Methods

Concepts from preclinical and clinical terminologies used in eTRANSAFE have been mapped to the Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT). SNOMED CT acts as an intermediary terminology that provides the semantics to bridge between pre-coordinated clinical concepts and combinations of preclinical concepts with different levels of granularity. The mappings from clinical terminologies to SNOMED CT were taken from existing resources, while mappings from the preclinical terminologies to SNOMED CT were manually created. A coordination template defines the relation types that can be explored for a mapping and assigns a penalty score that reflects the inexactness of the mapping. A subset of 60 pre-coordinated concepts was mapped both with the Rosetta Stone semantic approach and with a lexical term matching approach. Both results were manually evaluated.

Results

A total of 34,308 concepts from preclinical terminologies (Histopathology terminology, Standard for Exchange of Nonclinical Data (SEND) code lists, Mouse Adult Gross Anatomy Ontology) and a clinical terminology (MedDRA) were mapped to SNOMED CT as the intermediary bridging terminology. A terminology service has been developed that returns dynamically the exact and inexact mappings between preclinical and clinical concepts. On the evaluation set, the precision of the mappings from the terminology service was high (95%), much higher than for lexical term matching (22%).

Conclusion

The Rosetta Stone approach uses a semantically rich intermediate terminology to map between pre-coordinated clinical concepts and a combination of preclinical concepts with different levels of exactness. The possibility to generate not only exact but also inexact mappings allows to relate larger amounts of preclinical and clinical data, which can be helpful in translational use cases.

Keywords: Terminology mapping, SNOMED CT, Coordination, eTRANSAFE

Introduction

Many data-driven science projects must deal with a variety of data from heterogeneous sources. Data can be standardized with a terminology that provides a list of uniquely identified concepts, where each concept is an abstract idea or general notion representing something in the real world. A concept is linked to one or more terms that are used to reference the concept. If formal relations between concepts are provided as well, we refer to a terminology as an ontology. In the eTRANSAFE project, a research project funded by the Innovative Medicines Initiative (IMI), the focus was on bringing together preclinical and clinical data sources to support translational drug safety assessment and research [1, 2]. A challenge in eTRANSAFE is that the preclinical data use different terminologies than the clinical data. Another difference between preclinical and clinical data is the coordination. For preclinical data, combinations of concepts from different preclinical terminologies are used to capture different aspects of the data (e.g., a finding and an organ/location), while for clinical data these aspects are pre-coordinated in one term. Terminology mapping, the process of linking concepts of one terminology to concepts from another terminology, may resolve these differences. In eTRANSAFE the preclinical data stores findings and specimens/organs as distinctive data items based on different terminologies whereas the clinical data has been expressed with terminology that pre-coordinates a finding and specimen/organ as a single term. In addition, for many preclinical terms, an exact clinical counterpart is not available. For instance, preclinical terms like tail (mapped to the broader SNOMED CT Lower body part structure concept) and rumen (mapped to the broader SNOMED CT Stomach structure concept) do not have an exact clinical equivalent. Differences in granularity of terminologies can also prevent the establishment of exact equivalents, although near equivalents may be available. A terminology mapping that would allow for mappings with a certain level of inexactness, would be helpful.

In the past, different approaches to terminology mapping have been described. One of the largest terminology resources in the medical domain is the Unified Medical Language System (UMLS) [3]. The UMLS Metathesaurus maps terms from different terminologies to a unique biomedical concept. This one-to-one mapping is done by lexical matching of a term with all terms already associated with the concept [4]. Lexical matching has limitations since similar terms may refer to different concepts and different terms may denote the same concept.

Fung et al. further exploited the UMLS and combined synonymity between terms with hierarchical and explicit mapping relations provided by some source terminologies contained in the UMLS to find additional concept mappings between different terminologies [5]. They evaluated the use of the UMLS for mapping from the Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT) to the International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM) [6]. This approach of Fung is limited to one-to-one concept mappings in the clinical domain where the concepts are contained in the UMLS.

Mappings have also been created manually. An example of a large manual mapping effort was a task within the Innovative Medicine Initiative project WEB-RADR 2 to develop a one-to-one mapping between SNOMED CT and the Medical Dictionary of Regulatory Activities (MedDRA) [7, 8].

These mapping techniques have been used by providers of terminology repositories to cross-reference between terminologies. For example, in the National Center for Biomedical Ontology (NCBO) BioPortal, a set of mapping rules has been defined to create mappings automatically [9]. The ontology cross-reference service (OxO) developed by EMBL-EBI is a repository of mappings between existing ontologies obtained from multiple sources [10]. Mappings are discovered based on common annotation properties (external references) for concepts as provided by EMBL-EBI’s Ontology Lookup Service (OLS). The mappings in these systems link a single source concept to a single target concept and do not include one-to-many mappings.

An interesting approach to map pre-coordinated concepts from the Human Phenotype Ontology (HPO) to SNOMED CT, was described by Dhombres et al. [11] They used the ontological information in SNOMED CT to decompose the pre-coordinated HPO concepts and then created new post-coordinated concepts in SNOMED CT to allow one-to-one mapping. Thus, they were able to map 50% more concepts than without using decomposition [11].

Mapping between ontologies is a scientific field on its own. Various advanced and complex mapping strategies exist as described in surveys of Thiéblin et al. [12] and Ochleng et al. [13]. Since most of the terminologies used in our preclinical and clinical data sources lack semantic relations other than hierarchical relations, these advanced ontology mapping strategies will not work.

The coordination differences between preclinical and clinical data sources hamper the use of one-to-one terminology matching strategies, while the absence of semantics in the preclinical and clinical terminologies precludes the use of advanced ontology mapping approaches.

In this paper, we propose a terminology mapping method dubbed the Rosetta Stone approach, which uses an intermediary ontology that provides the semantics to bridge between combinations of preclinical concepts and pre-coordinated clinical concepts. As intermediary ontology, we selected SNOMED CT because of its large coverage and rich and well-defined semantic relations between concepts. The method also allows for inexact mappings with different levels of inexactness.

Methods and materials

eTRANSAFE databases

Within the eTRANSAFE project [1], the focus has been on integrating preclinical and clinical databases with the purpose of translational drug safety assessment and research. As preclinical databases, the eTOXsys database resulting from the eTOX project [14] and a SEND database with preclinical studies from the pharmaceutical companies participating in eTRANSAFE were used. The clinical databases contain adverse drug events data from ClinicalTrials.gov [15], the FDA Adverse Event Reporting System (FAERS) [16], drug label texts (DailyMed) [17], and published literature (Medline) [18].

Terminologies

The preclinical data are expressed as concepts from the histopathology terminology (HPATH) [19], the Mouse Adult Gross Anatomy Ontology (MA) [20], and the Clinical Data Interchange Standards Consortium (CDISC) Standard for Exchange of Nonclinical Data (SEND) controlled terminology [21]. The HPATH terminology was developed in the eTOX project [14, 19] and contains 1,047 concepts. The MA terminology is a publicly available resource that covers about 3,100 concepts [20]. The SEND terminology consists of various code lists. The data in eTRANSAFE predominantly use the Specimen, Laboratory test name, Neoplasm type, and Non-neoplastic finding type code lists, which contain about 3,700 concepts. The clinical data are expressed as concepts from the Medical Dictionary for Regulatory Activities (MedDRA) terminology, which contains about 80,000 concepts [22]. MedDRA concepts are organized in a five-level hierarchy: system organ classes, high-level group terms, high-level terms, preferred terms (PTs), and lowest-level terms (LLTs), where the LLTs mostly are synonyms of the PTs. The roughly 25,000 MedDRA PTs are used as mapping targets.

Mapping approach

Observations in the preclinical domain are often a combination of a finding and a specific organ (e.g., “Necrosis” and “Liver”). In the clinical domain, these observations are pre-coordinated findings and represent the combination of a morphologic abnormality and a location (e.g., “Hepatic necrosis”). To bridge the gap between preclinical data and clinical data and to be able to suggest alternative mappings, our approach was to map each terminology to an intermediate terminology, SNOMED CT, and use the rich semantic concept model of SNOMED CT to handle differences in coordination between preclinical and clinical concepts. SNOMED CT is a large terminology that arranges over 450,000 medical concepts in a hierarchical structure. An overview of the mapping approach is show in Fig. 1. In addition, each concept has a semantic class label, e.g., disorder, body structure, morphologic abnormality, etc. Semantic relations such as finding site and associated morphology link concepts that have different semantic classes. We developed a coordination template to specify which SNOMED CT semantic classes can be combined in the mapping process and what hierarchical traversals can be made.

Fig. 1.

Fig. 1

A schematic overview of the mapping approach. The preclinical database has fields that refer to different SEND terminologies (specimen and non-neoplastic finding). Combinations of concepts of these terminologies have been manually mapped to SNOMED CT concepts. Within SNOMED CT the relations defined in the coordination template and hierarchical relations are explored to find a SNOMED CT concept that has been mapped to a MedDRA concept. Relations are associated with different penalties that add up to a penalty score for a mapping. The clinical database refers to MedDRA pre-coordinated concepts. The path from a pair of SEND codes via SNOMED CT to a MedDRA code yields a mapping between preclinical and clinical records

Mappings to SNOMED CT

For mapping the preclinical MA, HPATH, and SEND terminologies to SNOMED CT we developed a tool called CodeMapper. CodeMapper can read the various preclinical terminologies and use the preferred term of a preclinical concept to lexically match against all terms in SNOMED CT. The tool shows for each preclinical concept a list of SNOMED CT terms ordered by lexical similarity, with the code, semantic class, and synonyms for each suggested SNOMED CT concept. Combinations of preclinical concepts can be expressed by coordinating a group of concepts with the AND operator. Multiple closely related inexact mappings can be specified using an OR operator. As an example, the HPATH concept Adenoma, leydig cell can be mapped to the SNOMED CT concepts Leydig cell hyperplasia OR Leydig cell tumor, benign. Each of these mappings is seen as a potential mapping and can be exploited to construct a mapping between a preclinical and clinical term. If an exact mapping does not exist, one can define an inexact mapping with a more general term or more specific term and indicate that the mapping is broader or narrower. For the clinical mappings from MedDRA to SNOMED CT, we used existing mappings available from OHDSI [23], the WEB-RADR 2 project [7], and the UMLS [3]. Both OHDSI and WEB-RADR 2 provide a manually curated concept mapping. Most of the available mappings are to MedDRA PTs, which are used in clinical databases, but some mappings are to MedDRA LLTs. Since each LLT is uniquely related to a single PT it was trivial to convert these mappings from SNOMED CT to MedDRA PTs.

Coordination template, hierarchies, and penalty scores

The search for a mapping from preclinical concepts to clinical concepts starts with a pair of finding and organ concepts from preclinical terminologies. If this pair is mapped to SNOMED CT, the search continues to find a traversal path from the SNOMED CT concepts to a clinical target concept. The coordination template defines what selection of semantic relations can be explored within SNOMED CT and specifies the preferred path for mapping. The path consisting of an exact mapping from a pair of pre-clinical concepts to an initial pair of SNOMED CT concepts, subsequently following only the relations defined in the coordination template to a common final SNOMED CT concept that has been mapped to a MedDRA preferred concept (PT) is called the preferred path and has a zero penalty. The coordination template associates each deviation from this preferred path with a cost. The penalty score of a path consists of the sum of the costs of all deviations from the preferred path. Paths can be sorted on penalty score and a maximum penalty score can be specified as a parameter of the mapping algorithm. In an application this maximum penalty could be set by the user, affecting the precision and recall of returned mappings. Similarly, the coordination template can be used to search for reverse mappings from clinical terminology to preclinical terminology.

As an example, Fig. 2 shows how a SEND preclinical pair of concepts from the classes finding and organ is mapped to SNOMED CT concepts of the classes body structure and morphologic abnormality, and then are coordinated into a disorder concept and subsequently mapped to a MedDRA finding concept. The coordination template specifies that the semantic classes finding site and associated morphology can be explored to coordinate a set of preclinical concepts into a clinical concept. The coordination template limits the set of semantic relations that can be followed in SNOMED CT and thus reduces the computational search effort. If not all semantic relations as specified in the coordination template are satisfied, this may result in inexact mappings. For example, if the coordination with a body structure is missing, one or more mappings based only on the morphologic abnormality may be generated as inexact mappings. Inexact mappings can also originate from the mapping function following the is-a or subsumes semantic relations of a concept as obtained from the hierarchy of SNOMED CT.

Fig. 2.

Fig. 2

An example of how an organ and finding concept from CDISC SEND can be mapped via SNOMED CT to a MedDRA finding concept. The red lines indicate the mappings provided to allow for CDISC SEND mapping to SNOMED CT and from SNOMED CT to MedDRA. The blue lines denote the semantic relations provided by SNOMED CT. The path shown from SEND via SNOMED CT to MedDRA is the preferred path as defined in the coordination template

Mappings with a penalty of zero are exact mappings, and mappings with a non-zero score are inexact mappings. The mapping shown in Fig. 2 is an exact mapping: the path between the preclinical concepts Liver and Necrosis and the clinical concept Hepatic necrosis has a penalty score of 0.

The coordination template distinguishes between larger semantic inexactness (i.e., missing the organ or making a hierarchical step) and smaller inexactness with a penalty score of 1.0 and 0.1, respectively. SNOMED CT contains three types of body structure concepts: the entire body structure, part, or structure (e.g. Entire liver, Liver part, or Liver structure). A preclinical organ concept is always mapped to the SNOMED CT structure concept ). To find paths from a combination of preclinical concepts to a clinical concept, traversals between the three body structure concepts are allowed, incurring the small penalty score. Different types of inexactness are accumulated in an overall penalty score. The exact value of the penalties is somewhat arbitrary, their main purpose is to rank the mappings in terms of exactness. The penalties for different types of inexactness are listed in Table 1.

Table 1.

Overview of the penalties for different types of deviations from the preferred path as defined in the coordination template

Major penalty (1 point) Minor penalty (0.1 point)
Traversing one level up or down in the SNOMED CT hierarchy Using broader-narrower mappings from the manual mappings
Ignoring body structures in combinations Traversal from MedDRA lowest-level term (LLT) to preferred term (PT)
Using MedDRA SNOMED CT mappings in the inverse direction
Traversal within SNOMED CT between three types of body structure concepts (entire, part, and structure)

The MedDRA-SNOMED CT mappings provided by WEB-RADR 2 are not congruent: both directions have to be explicitly mentioned since the reverse relation is not always automatically implied. An example of this is that the MedDRA concept Emotional lability (LLT) (Affect lability as the MedDRA PT term used in our mapping) maps to the SNOMED CT concept Mood swings (finding), but that the SNOMED CT concept Mood swings (finding) maps to the MedDRA concept Mood swings (LLT) (Mood swings as the MedDRA PT term in our mapping). Apart from incongruent mappings, also only one direction can be provided. The coordination template can allow for automatic inference of the reverse relation with a penalty of 0.1 points.

The clinical sources within eTRANSAFE are based on MedDRA PTs, but some of the existing mappings map SNOMED CT concepts to MedDRA LLTs. Therefore, an additional step has been implemented to go from an LLT to a PT with a penalty score of 0.1. SNOMED CT contains three types of body structure concepts (entire, part, and structure), which are hierarchically related, e.g., Liver part and Entire liver are subsumed by Liver structure. Therefore, if, for a given organ, a hierarchical traversal is needed between the body structure concepts, a small penalty of 0.1 is added. The total penalty score is an indication of the exactness of the mapping. The mapping function will return all mappings that have a penalty score lower than a user-specified threshold.

In Fig. 3, three examples of concept mappings with their associated penalty scores are shown. The first example shows how a SEND finding Necrosis without specification of an organ is mapped to the SNOMED CT Necrosis of anatomical site concept. The latter concept is mapped to the MedDRA term Necrosis and follows the preferred coordination path. The second example shows how the mapping follows an additional is-a relation in the SNOMED CT hierarchy. The total penalty for this mapping is 1.0. In the third example, an additional is-a relation from SNOMED CT is followed (+ 1.0 penalty) and the organ is ignored (+ 1.0 penalty), followed by a traversal in MedDRA from an LLT to a PT (+ 0.1 penalty), leading to a total penalty of 2.1. Penalty scores for mappings that do not include the organ are given as negative scores, to make it easy to distinguish the mappings that include an organ from those that do not. In this example, the mapping function therefore returns − 2.1.

Fig. 3.

Fig. 3

Three examples of concept mappings with their penalty scores: (a) for a SEND finding without specification of an organ (score 0.0), (b) for a SEND finding and organ with a SNOMED CT hierarchy traversal (score 1.0), and (c) for same finding and organ with unsuccessful mapping of organ (score 1.0), hierarchy traversal (score 1.0) and traversal from MedDRA LLT to PT (score 0.1), for a total score of 2.1 (negative because of absent organ). The red lines indicate the mappings provided to allow for CDISC SEND mapping to SNOMED CT and from SNOMED CT to MedDRA. The blue lines denote the semantic relations provided by SNOMED CT

Rosetta stone approach

The Rosetta Stone approach to mapping between pre-coordinated clinical concepts and combinations of preclinical concepts has been implemented as a web service that is used in eTRANSAFE applications [24]. The service takes as input the concepts to be mapped, the source and target terminologies, and a maximum penalty score, and returns all mappings that have a penalty score less than or equal to the maximum. The web service generally returns results for a request within a second.

The terminology service uses a database schema based on the terminology section of the OMOP Common Data Model (CDM) maintained by the Observational Health Data Sciences and Informatics network (OHDSI) [25]. MedDRA and SNOMED CT were already available in the CDM schema. For the preclinical terminologies, a conversion script has been developed that converts and stores these terminologies and their mappings to SNOMED CT according to the CDM schema. The mappings obtained from WEB-RADR 2 and UMLS were imported into this database as well as the manual mappings created with CodeMapper.

Mapping evaluation

For each of the two preclinical terminologies SEND and HPATH/MA the Rosetta Stone terminology service was used to compile for each concept all the best mappings to MedDRA and saved these mappings with its minimum penalty score. From these list we selected per preclinical terminology a random set of 10 mappings with a minimum score of 0.0, of 0.1 and of 1.0 (6 sets with a total of 60 mappings).

Each mapping was reviewed independently by the two authors JK and EvM. With the SNOMED CT browser and the MedDRA browser the mapping was analyzed and, if present, annotated with a better match from MedDRA. Only in a few cases the two reviewers had different annotations and in an interactive session a consensus was reached for the whole set. Additionally, a lexical search based on the search endpoint of the UMLS Knowledge Source Server was used to find MedDRA concepts for each mapping [26]. The organ with the finding were used as input for the search. If the results yielded by the services contained one of the MedDRA concepts from the best mapping it was considered a match.

Results

The number of concepts in the different terminologies and the number of these concepts that were mapped to SNOMED CT are shown in Table 2. Not all concepts from all terminologies have been mapped: for some concepts there was no matching equivalent to be found in SNOMED CT, other concepts were not used in the databases and were excluded in the mapping process. Table 2 shows also the number of records in the databases that have mapped concepts: for all databases over 87% of the records was mapped. We prioritized the mapping to the frequency with which a concept occurred in the databases. The number of used concepts that were not mapped varies between 0 and 43% of the concepts but only affects 0–13% of the database records. The number of concepts that are actually used in the (pre)clinical data databases is much smaller than the total number of concepts. Not all concepts have been mapped. For instance, the concept harderian gland from the MA terminology cannot be mapped to a SNOMED CT concept. Since the manual mapping process is laborious, for the larger preclinical terminologies we focused on mapping concepts that occur in the preclinical and clinical data sources used in eTRANSAFE.

Table 2.

Number of mappings from different preclinical and clinical terminologies to SNOMED CT. The number of used concepts for HPATH and MA is based on eTOXsys, SEND code lists on the SEND database, and MedDRA on the clinical databases. The last column shows the percentage of records from these data sources with concepts that have been mapped

Terminology No of concepts No of mapped
concepts
No of used concepts No of used concepts that were mapped (%) Percentage of records with mapped concepts
Histopathology 1,047 527 617 450 (73) 87
Mouse Adult Gross Anatomy 3,143 1,148 558 558 (100) 100
SEND laboratory test 2,551 834 471 268 (57) 93
SEND neoplastic finding 307 307 113 113 (100) 100
SEND non-neoplastic finding 309 208 155 135 (87) 87
SEND specimen 518 518 184 184 (100) 100
MedDRA 25,433 14,916 13,906 10,471 (75) 97
Total 34,308 18,458 16,444 12,811 (78)

We determined how well these mappings from preclinical and clinical terminologies to the intermediary SNOMED CT can be combined into full mappings between preclinical and clinical concepts using the template-driven inference in SNOMED CT. The mapping function of the eTRANSAFE semantic service yields these full mappings with for each mapping a penalty score. Mappings for the preclinical databases could include combinations of a finding concept with an organ concept, i.e., combining a Histopathology finding with an MA organ concept, or combining a finding concept from SEND’s neoplasm or non-neoplastic terminologies with an organ concept from SEND specimen, and finding those MedDRA concepts that have a penalty score less than or equal to a user-specified threshold. Inversely, mappings from a clinical MedDRA concept can yield a combination of a preclinical finding and an organ concept. We heuristically set our maximum penalty score to less than three as mappings with larger penalties were considered semantically too distant. As an example, Table 3 shows all mappings for the SEND concepts Necrosis and Liver.

Table 3.

MedDRA concepts mapped from the combination of SEND concepts for organ Liver and finding Necrosis. A penalty score of 0 denotes exact mapping, other scores are inexact. Negative scores indicate that the mapping disregarded the organ

MedDRA concept Penalty score
Hepatic necrosis 0
Liver injury 1
Zieve syndrome 1
Hepatic cytolysis 1
Cirrhosis alcoholic 1
Hepatitis alcoholic 1
Fatty liver alcoholic 1
Hepatocellular injury 1
Alcoholic liver disease 1
Acute yellow liver atrophy 1
Hepatocellular damage neonatal 1
Traumatic liver injury 1
Hepatic lesion 2
Hepatic infarction 2.1
Necrosis -1
Gangrene -2
Infarction -2
Necrobiosis -2
Fat necrosis -2
Septic necrosis -2
Tumour necrosis -2
Necrosis ischaemic -2
Osteonecrosis -2.1
Tissue injury -2.1
Skin exfoliation -2.1

The best mapping results for different combinations of source and target terminologies are shown in Table 4. Penalty scores were rounded off. A score of 0 is a perfect mapping and a score of Inf indicates that no mapping has been found or that the mapping has a penalty of three or more. The number of terms that cannot be mapped varies considerably (between 15% and 64%). The large number of unmapped MedDRA concepts can be explained by the fact that MedDRA is much larger than the preclinical terminologies and contains many concepts that do not have a preclinical counterpart. HPATH and SEND contain fine-grained concepts for describing morphological abnormalities in the preclinical domain, whereas MedDRA focuses on conditions. Many of the MedDRA (pre-coordinated) concepts are, however, not used in the clinical data sources as can be seen from Table 2.

Table 4.

Number of mappings from preclinical to clinical concepts and vice versa. The source for the mappings is based on all unique combinations of HPATH/MA and SEND terms in the preclinical databases and all MedDRA preferred terms. If multiple mappings for a source concept are available, only one mapping with the best (lowest) penalty score is counted. Penalty scores are rounded. Mappings with absolute penalty scores of three or more are combined with the missing mappings under inf. Negative penalties indicate that the organ is not mapped

Mapping Best Mapping Score Number
Source Target 0 (%) 1 (%) 2 (%) -1 (%) -2 (%) Inf (%) N
HPATH/MA MedDRA 686 (8) 1,333 (16) 1,125 (14)

629

(8)

2,053 (25)

2,383

(29)

8,209
MedDRA HPATH/MA 2,856 (11) 2,356 (9) 1,101 (4) 2,355 (9) 1,039 (4) 15,726 (62) 25,433
SEND MedDRA 428 (13)

677

(20)

653

(19)

717

(21)

372

(11)

505

(15)

3,352
MedDRA SEND 2,697 (11) 2,388 (9) 1,192 (5) 1,872 (7) 1,050 (4) 16,234 (64) 25,433

The percentage of mappings with a penalty score of zero is relatively small but mapping coverage shows a considerable increase when allowing inexact mappings (with positive or negative scores of one or two). For example, for the mappings from MedDRA to HPATH/MA, mapping coverage increases from 11 to 38%, and for HPATH/MA to MedDRA from 8 to 71%. Analysis of a sample of mappings with negative scores – the mapping does not include the organ – made clear that preclinical terminologies are more focused on recording changes at the cellular level. In addition, clinical terminologies (MedDRA) include a substantial number of conditions that are difficult to observe in animals, such as headache, dizziness, and back pain. Also, clinical signs and symptoms that are difficult to link to a specific physical abnormality, such as diarrhea, vomiting, headache, or chest pain, could not be mapped. Table 5 shows the percentage of MedDRA concepts per system organ class that could be mapped to preclinical concepts. Not surprisingly, system organ classes with the lowest percentage of mapped concepts were Social circumstances and Product issues.

Table 5.

The percentage of mapped concepts per MedDRA system organ class. The maximum penalty score is between − 2.0 and 2.0

MedDRA system organ class Mapped concepts (%)
Ear and labyrinth disorders 79
Cardiac disorders 73
Gastrointestinal disorders 64
Eye disorders 64
Endocrine disorders 64
Hepatobiliary disorders 62
Respiratory, thoracic and mediastinal disorders 60
Musculoskeletal and connective tissue disorders 59
Nervous system disorders 56
Renal and urinary disorders 54
Vascular disorders 53
Reproductive system and breast disorders 53
Skin and subcutaneous tissue disorders 51
Blood and lymphatic system disorders 50
Infections and infestations 46
Neoplasms benign, malignant and unspecified (incl cysts and polyps) 44
Immune system disorders 44
Congenital, familial and genetic disorders 39
Pregnancy, puerperium and perinatal conditions 34
Metabolism and nutrition disorders 31
Surgical and medical procedures 30
General disorders and administration site conditions 26
Psychiatric disorders 22
Injury, poisoning and procedural complications 22
Investigations 18
Social circumstances 7
Product issues 7

Mapping evaluation

The results of the 60 mappings produced by the Rosetta Stone and by UMLS lexical search are shown in Table 6. The Rosetta stone approach found the best matches 95% of the time, comparing favourably with 22% in the UMLS-based lexical approach.

Table 6.

Manual validation of best mapping found using the Rosetta stone and UMLS service. We randomly selected 10 examples for each penalty score 0.0 (exact), 0.1 (MedDRA traversal from LLT to PT), and 1.0 (hierarchical traversal in SNOMED CT) and checked if the MedDRA terms were the semantically best matches

Source Target N Score Best Match Rosetta Stone (%) Best Match UMLS (%)
HPATH/MA MedDRA 10 0.0 10 (100.0) 5 (50.0)
HPATH/MA MedDRA 10 0.1 8 (80.0) 2 (20.0)
HPATH/MA MedDRA 10 1.0 9 (90.0) 1 (10.0)
SEND MedDRA 10 0.0 10 (100.0) 3 (0.30)
SEND MedDRA 10 0.1 10 (100.0) 1 (10.0)
SEND MedDRA 10 1.0 10 (100.0) 1 (10.0)
total 60 57 (95.0) 13 (21.7)

Discussion

We have presented the Rosetta Stone approach that utilizes a semantically rich ontology, SNOMED CT, to map between preclinical and clinical terminologies. The semantics of SNOMED CT support the mapping between pre-coordinated clinical terms and combinations of preclinical terms. The approach has also been used to find inexact mappings where the mapped concepts are close in meaning but not the same. The implemented mapping function contains a penalty mechanism that determines the number and precision of the mappings returned by the function. Within the eTRANSAFE project, the Rosetta Stone has been successfully applied to three distinct use cases as described in the deliverable “Final report on results of use cases” [27].

BioPortal, UMLS, and OHDSI use textual similarity between concepts to find a one-to-one exact mapping [3, 9, 28]. Others such as Fung et al. rely on an initial set of mapped concepts and exploit the hierarchical structure of the terminologies for additional mappings [5, 6]. However, differences between concept combinations from preclinical terminologies and pre-coordinated clinical terminologies require more sophistication and go beyond one-to-one concept mappings. Dhombres used SNOMED CT to compose new post-coordinated concepts and map these to the Human Phenotype Ontology using different SNOMED CT post-coordination expression templates [11]. These templates, however, implement rules that are specific to the concepts contained in the Human Phenotype Ontology and the created mappings are exact. Our coordination template contains information to facilitate the creation of inexact mappings with a penalty score that can considerably increase the number of mappings. Within the context of the eTRANSAFE project, we only implemented one coordination template for mapping a combination of body structure and morphologic abnormality to a finding. Additional templates can be created that yield mapping paths based on other combinations of SNOMED CT classes.

The preclinical and the clinical domains exploit different sets of terminologies that are difficult to map. Preclinical research (often in vitro or in vivo animal studies) investigates structural and functional changes in tissues and organs to understand disease mechanisms. It focuses on morphological abnormalities (e.g., cellular damage, tissue degeneration, organ malformations) because these are early indicators of pathological changes before symptoms manifest. The focus of clinical research is more on measurable symptoms rather than microscopic or cellular changes. This prioritises how patients experience a disease and the effects of treatment. The challenge of translational research is to translate between these different worlds. While a perfect single mapping often does not exist, the Rosetta Stone approach can generate closely related inexact mappings. Different levels of inexactness are indicated by a penalty score, and the number of inexact mappings can be limited by a maximum penalty score that is passed as a parameter to the mapping function. Changing the penalty scheme in the coordination template can influence the kind of concepts that are returned as inexact matches. For now, we only distinguished between major (penalty of 1.0) and minor (penalty of 0.1) penalties.

The Rosetta Stone approach is generic and can also be applied if a new terminology is added to the terminologies already mapped in eTRANSAFE (HPATH, MA, SEND, and MedDRA). A new terminology needs only to be mapped to SNOMED CT. In eTRANSAFE, a total of 16,444 mappings from concepts used in the preclinical and clinical data sources to SNOMED CT were provided, of which the majority were mappings from MedDRA to SNOMED CT that could be compiled from existing resources. The total number of exact and inexact mappings between the (combinations of) preclinical concepts and the clinical concepts runs into the millions. Clearly, it is infeasible to provide all these mappings manually.

Validation of the Rosetta Stone mapping service is hampered by the absence of a reference set of mappings. We validated our approach by manually checking 60 random mappings with different penalty scores. The results of this validation show that our approach compared to a UMLS-based lexical-based approach yields better results, not only for the inexact one but also for the exact mappings. More sophistication could be implemented in the lexical-based approach.

For the eTRANSAFE project, one coordination template that combines a morphological abnormality and a body structure has been implemented. This coordination template can be extended to increase the number and precision of mappings between the preclinical and clinical terminologies. For instance, the coordination template could be extended with a qualifier value concept to indicate the severity of a finding, or a staging, and scales concept to indicate the onset of a finding. The concept Subacute hepatic necrosis is an example of a concept that combines a SNOMED CT staging, and scales concept, a finding site concept, with a finding concept. The SNOMED CT concept class staging, and scales could be used to map SEND Vital Signs Test Code, if applicable in the preclinical data sources.

Conclusion

The Rosetta Stone approach enables the mapping of combinations of preclinical concepts to clinical, pre-coordinated concepts using SNOMED CT as a semantic intermediary terminology. When exact mappings are not available, the approach provides inexact mappings between closely related concepts in the preclinical and clinical worlds.

Acknowledgements

This research received funding from the Innovative Medicines Initiative 2 Joint Undertaking under grant agreement eTRANSAFE (777365). This Joint Undertaking receives support from the European Union’s Horizon 2020 research and innovation program and EFPIA.

Not applicable.

Author contributions

E.M. and J.K. wrote the main manuscript text. R.P. and E.M. did the design of the work and created the new software used in the work. E.M. , J.L., J.K., and R.P. did the interpretation of the data. J.K. and E.M. did the manual evaluation of the subset of mappings. All authors reviewed the manuscript.

Funding

This research received funding from the Innovative Medicines Initiative 2 Joint Undertaking under grant agreement eTRANSAFE (777365). This Joint Undertaking receives support from the European Union’s Horizon 2020 research and innovation program and EFPIA.

Data availability

The Rosetta Stone software is available from https://github.com/mi-erasmusmc/ets-rosetta-stone and the mappings from https://github.com/mi-erasmusmc/send-snomed-mappings.

Declarations

Ethics approval

Not applicable.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The Rosetta Stone software is available from https://github.com/mi-erasmusmc/ets-rosetta-stone and the mappings from https://github.com/mi-erasmusmc/send-snomed-mappings.


Articles from Journal of Biomedical Semantics are provided here courtesy of BMC

RESOURCES