Abstract
Background and objectives.
Assistance Publique - Hôpitaux de Paris (AP-HP) is implementing a new laboratory management system (LMS) common to the 12 hospital groups. First step to this process was to acquire a biological analysis dictionary. This dictionary is interfaced with the international nomenclature LOINC, and has been developed in collaboration with experts from all biological disciplines. In this paper we describe in three steps (modeling, data migration and integration/verification) the implementation of a platform for publishing and maintaining the AP-HP laboratory data dictionary (AnaBio).
Material and Methods.
Due to data complexity and volume, setting up a platform dedicated to the terminology management was a key requirement. This is an enhancement tackling identified weaknesses of previous spreadsheet tool. Our core model allows interoperability regarding data exchange standards and dictionary evolution.
Results.
We completed our goals within one year. In addition, structuring data representation has lead to a significant data quality improvement (impacting more than 10% of data). The platform is active in the 21 hospitals of the institution spread into 165 laboratories.
Keywords: Controlled vocabulary, LOINC, Biological analysis, Authoring framework, Nomenclature, Standardization
Introduction
Context
Assistance Publique Hôpitaux Paris (AP-HP) is the largest europeen university hospital with 12 hospitals groups composed of 44 hospitals in and around Paris (23,000 beds, 1,000,000 inpatients per year and 4,000,000 outpatients per year).
Objective of AP-HP is to build an information system based on the concept of information sharing, integrating both clinic, medico-technical, medico economic domains. In 2005, AP-HP decided to develop a new laboratory management system (LMS) common to the 12 hospital groups and based on a single shared core application. In this way, it was necessary to acquire a biological analysis dictionary common to the whole production chain: order, analysis processing in the laboratory management systems (LMS) and transmission of the result.
Useable by all the LMS, active in the 21 hospitals of the institution spread into 165 laboratories, the repository on which is based this dictionary should ideally remain independent of these tool constraints. However, beyond the pair [code, label] which is available in French in the repository, it is necessary to standardize other elements, such as the display label, editing label, and mnemonic units and codes of each analysis.
To secure the communication of the analysis results to health practitioners, anywhere their practice takes place, the dictionary must rely on a nomenclature shared by as many as possible. LOINC (Logical Observation Identifier Names and Codes) has now the highest international development 1].
AP-HP decided to interface its analysis dictionary named Anabio with LOINC and ensures its maintenance. This is also the choice made by other hospitals showing a more or less complete interfacing with LOINC [2]. The French translation of LOINC labels is part of a cooperative work between the French Society of Laboratory Computing (SFIL) and the AP-HP.
Objectives
We aim to implement a publishing and maintenance platform of the AP-HP biomedical analysis dictionary that may answer to the requirements related to both the repository, the maintenance and the diffusion processes. The current repository management software based on spreadsheet (Excel®) shows adaptation limitations to the dictionary requirements and perspectives. To achieve this goal, our efforts focus on a model definition which supports the representation of the terminologies used and also of adjunct knowledge. It is scalable and allows representing any type of terminologies, independently of the discipline.
This effort data modeling and formalization shall provide the following benefits:
- multi-terminological repository;
- terminology update (e.g. half-yearly LOINC update);
- storage of LOINC translations;
- improving data quality.
First, we introduce the structure of AP-HP biomedical analysis dictionary, requirements linked to it and limitations of the previous tool (section 2). Secondly, we describe the development of the model (section 3). We explain the results generated by the new platform implementation project and data quality improvements related to their representation change (section 4). Finally, we discuss the troubles encountered (section 5) and we conclude (section 6).
Biomedical analysis (AnaBio) dictionary and LOINC
Description of the terminologies
The biomedical analysis dictionary implemented since 2006 at the AP-HP includes more than 39000 references used by 58 laboratories. The development of the dictionary benefits from the list of analysis performed by AP-HP laboratories involved in a renewal process of their LMS.
Each analysis is described in 5 axis (analytes, parameters, systems, methods and units), by a label (AP-HP label) to which is assigned a 5 characters alphanumeric code (AP-HP index) and, when necessary, a LOINC code. Each axis is organized on 3 levels. Elements which are necessary to the LMS setup are added: display labels, editing labels, analysis calling codes (mnemonic codes). The whole analysis included in the dictionary are themselves organized. This results in a unique electronic presentation designated to the clinicians for a univocal consultation of the result reports.
We have the successive LOINC versions from December 2005 (v2.16 to v2.34) which are freely downloaded from the website1. The LOINC nomenclature includes four parts, but the classification of laboratory terms (Laboratory Term Classes) is the only one used in this work. It increased from 31437 labels in 2005 to 44000 labels in 2010, spread in 11 chapters covering the 6 disciplines (biochemistry, hematology, immunology, pharmacology/toxicology, microbiology and molecular biology).
AnaBio components rely in part on the terms of the Names-Lab nomenclature [3]. The choice of an interfaced dictionary with the LOINC nomenclature is motivated by some inadequacies in the use of the LOINC repository alone, including:
incompleteness: LOINC is based on a pre-coordinated system of 6 axis which first offers associations not necessarily useful for AP-HP laboratories and, second, does not cover all the needs.
granularity: a higher detail level than the one of LOINC is sometimes required by AP-HP biologists for whom some axis must be more specifically described in specialized disciplines of Biology.
specific parameters: the French regulation constraints impose the inclusion of additional attributes in the repository to the LMS (for example, the NABM code2).
language: data must be available in French.
The dictionary AnaBio offers the managerial flexibility necessary to its daily use while maintaining the semantic interoperability with other international health organisms through its alignment with LOINC. With this regard, the biomedical analysis dictionary can be considered as local terminology interfaced with a reference terminology [5]. This dictionary is also linked to external data such as the list of hospital units as well as their contacts. This choice involves a daily alignment task between these two repositories to claim the interoperability.
Interaction with health providers
A centralized cell (Figure 1) is in charge of the analysis collection in the institution laboratories, the dictionary enrichment, its maintenance (creation/modification/deletion) and its interfacing with LOINC.
Figure 1.
Data flow around the biology knowledge database.
Thus, this dictionary serves as a common core to the overall LMS configuration at the AP-HP. All dictionary updates in LMS will be locked in modification. Only the central cell can import updates.
System requirements and limitations
The management of the AnaBio terminology are subjected to requirements linked to the repository itself and to the maintenance process. The main requirements identified are:
to dispose of a tool able to manage a data volume evolving continuously;
to implement a validation process of the analysis used;
to dispose of means to control data integrity (duplicate control, etc.);
to generate statistics on the repository content;
to allow the editing of the repository by several users simultaneously;
to consider the data associated to the analysis (units, referents);
to ensure the links between associated data and analysis (use, etc.);
to have a traceability of the items of the dictionary;
to be able to provide exports of all or part of the knowledge database in standard formats.
These requirements exclude use of generalist tools likes spreadsheet and lead us to develop an application based on the system management tool Intelligent Topic Manager (ITM) developed by Mondeca3.
Method
ITM is based on the modeling of the discipline explained by a knowledge graph in a semantic web format. Such approach based on the syntactic and semantic interoperability between terminologies, includes:
The alignment implementation between distributed terminologies [6][7][8];
The definition of a top-ontology or a top-thesaurus to link the knowledge organization systems [9][10][11];
The design of a hub meta-model supervising these organization systems [12][13].
Our method fits into this latter approach by defining a unique representation model. The advantage of this unique model is the integration of the different terminologies in a single server and thus, to allow the editing of these repositories. The platform implementation requires the transmission of semi-structured data to a structured model representation which relies on the knowledge engineering techniques [14]. Four steps were conducted: modeling, data migration, and integration/validation.
Modeling
The modeling task is conducted in close collaboration with the terminology maintenance unit. This collaboration aims understanding the usefulness of each component and apprehending the new needs which impact the model to design. For example, the addition of status properties allows a better traceability of the components over the time.
ITM is enough generic to represent any type of terminology including LOINC, AnaBio but also future resources that will be useful to improve the interoperability, such as SNOMED. However, this model remains extensible to consider the particularities of each terminology (for example the NABM codes for biomedical analysis results) but also to link the AnaBio dictionary to the adjunct knowledge such as hospital units, contacts …
In the field of knowledge organization system representation, some norms and standards exist such as SKOS and BS8723. Our approach does not pretend to define an ex nihilo model but wants to be a good practice paradigm for the controlled vocabulary representation. Our method uses and extends parts of modeling in these standards. SKOS [15] is defined as an extensible language allowing the representation of knowledge organization systems such as thesaurus, taxonomies, or any other type of controlled or structured vocabulary. This standard provides some primitives related to the central entity that is the Concept: alignment links, editorial annotations … SKOS defines some links as transitive, allowing performing logical inferences. The ISO norms on terminologies are changing4, thanks notably to the work of the British Standard and its BS 8723 project [16]. The linguistic management is thinner in BS 8723 than in the SKOS standard but will not serve us in our studies.
The OWL5 standard allows expressing a knowledge using a formal semantic based on the predicate logic. In this respect, OWL fits with the formal expressivity which is necessary to the interoperability and often serves to model definition format in knowledge representation. Its formalism and its expressivity allow representing ontologies such as SKOS or BS 8723 model. Our model extends SKOS and are at the same abstraction level. The OWL language and its expressivity in description logic are therefore used to describe our model [17].
The final model is presented on Figure 2 in a simplified version to improve its readability. It includes the notion of Concept (identical to the definition given in SKOS or BS 8723) and sub-classes including Analysis, RecordLOINC and Axis. The entity Analysis has particular attributes such as the NABMcode or isUsedBy allowing data tracking. It is constituted from one to five axis, concerns a discipline and is potentially used by one or more hospital units.
Figure 2.
Simplified model of UML class of the biology knowledge database.
The alignment part comes from SKOS. The alignment link between an Analysis and a RecordLOINC called AnalysisRecordMatch is reified, meaning that it is defined as a class and thus allows declaring attributes. This reification is the mean to add editing metadata as the link creator, creation date 18]. This is one of the differences with the SKOS standard whose purpose is not the editing but, instead, the distribution of knowledge organization systems.
To be compatible with the future exchange standards for the laboratories (IHE-Lab51), the analysis and axis must be organized within a hierarchy. In our previous studies [19], we have defined a modeling pattern called “ConceptGroup” which allows representing, by extension (explicit sense of belonging) or intention (through a query), the concepts of a group itself potentially organized in a hierarchy. These clusters are the way to represent a whole terminology (as all concepts of a given type) or a subset (also called “value set”).
The model must consider the business knowledge related to the components of knowledge organization systems. For example, it concerns Disciplines or the use of an analysis in the HospitalUnit. This knowledge improves the maintenance task of terminology entities providing indicators regarding the use, queries and validations. The whole entities is extended from very generic classes including Object, link, Relation. These meta-classes are the way to ensure the model scalability.
Data migration
Parallel to the model design, the task of data migration starts. This requires transforming the entire spreadsheet data to allow their integration and the conformity alignment with the new formal model. Other data on adjunct knowledge are provided in the spreadsheet format. During this migration, some data inconsistencies are identified and corrected (duplicate controls, spelling errors, integrity control on some values…). Data migration relies on a JAVA tool developed specifically to generate RDF/XML files from XLS files. Then, the RDF/XML format is consistent with the pattern described in the model and ready for integration. This step allows first validating the model by comparison with the data and, secondly, improving data quality included in the dictionary while preserving AP-HP identifiers and label (axis concatenation) which must remain fixed through the overall hospital information system.
Validation
Modeling and data migration stages allow an iterative refinement work. To be validated, modeling suggestions are imported into the platform with the migrated data. During this stage, the terminology maintenance team works in parallel with the spreadsheet and ITM tool. After this validation, corrections and improvements are made to the model and, thus, to data migration. After 6 validation cycles, the platform is deployed in the production environment.
Results
Modeling and platform implementation
The definition of business rules relies on the formal expressivity of the model described in OWL. Effectively, this model contains cardinality, domain and co-domain restrictions. For example, an Analysis addresses to one and only one Discipline. The link IsUsedBy has an Analysis for domain and zero or several HospitalUnits for co-domain. These restrictions constrain the edited informations and allow a reasoning module to validate data integrity in the knowledge database.
The model used is generic. Applied to biomedical analysis and LOINC, it may represent repositories from other fields. Its extension characteristic enables the linking of adjunct data with the dictionary. The modeling is based on knowledge representation standards and is written in a semantic web language. In this respect, it is possible to change models to these standards or from standards and thereby, to improve the resolution of semantic interoperability difficulties. This platform is able to directly communicate with the information systems using messages compliant with the IHE standard6 (LCSD profile7).
We integrated a multi-user and multi-lingual software (ITM) to our model. This software interprets formal models (OWL format) and utilizes expressed restrictions to constrain modelled domain knowledge. This software stores knowledge graph data in an Oracle relational base, ensuring its ability to manage the increasing repository volume. The ITM tool has also a reasoning and may generate statistical reports.
AnaBio data Quality evolution and Improvement
The improvement in data quality included in the AnaBio dictionary is a major point of the results obtained with this project. The transition from semi-structured data (spreadsheet) to structured data (by the formal model) has forced the correction of data considered as incoherent. Most of these inconsistencies were differences of breakage, spelling or absence of normalization of a value which should be identical. For example, “cysterceques”, “Cysterceque” and “Cysticerques anticorps” will be change to “Cysticerques anticorps”. These corrections led to improved data quality.
To achieve this objective, the team in charge of the AnaBio dictionary defined the set of the possible values for each axis and more formal writing rules For example: All acids should be written in the form of salts (lactate rather than lactic acid), use of the name of microorganisms in Latin (Streptococcus pneumoniae rather than Pneumocoque) and chemical molecules with their DCI (paracetamol instead of acetaminophen).
Then, we developed a program to detect automatically the dictionary axis which are not part of the allowed list. We used this detection 6 times between July 2009 and July 2010. This detection led, in part, to changes in the AnaBio dictionary. Daily normal evolutions, update of some disciplines, and data cleaning within the manual detection also led to additional changes (fusion values synonyms in the axis (red blood cells and erythrocytes, cobalamin and vitamin B12), fusion similar terms to express the same axis (IL-2 and interleukine-2).
We will analyze the repository evolution and try to quantify the part which can be related to the detection software through data structuring for its inclusion in a formal model.
Analysis and axis numbers increased over one year as shown in Figure 3, with an increment from 26458 to 35714 and 103758 to 144126, respectively. However, this increase was not linear. When analyzing the creation, deletion, and change numbers, we observe two important cleaning stages and one major dictionary enrichment stage.
Figure 3.
Evolution of analysis/axis numbers within the project duration.
Stage 1 (between August and October 2009): numerous analysis and axis were deleted and half of them were replaced by creations. 96% of analysis deletions correspond to the discipline of Microbiology subdivided into Virology (47%), Bacteriology (33%) and Myco-Parasitology (16%). 84% of analysis creations correspond to the disciplines of Bacteriology (48%) and Myco-Parasitology (36%). This first stage is the beginning of a task which is also performed in parallel by the team in charge of the AnaBio dictionary. The aim is to continuously enrich the repository and, clean and improve data quality (prevalent at this stage). During this stage, the biologists in charge of Microbiology initiated a new conception work leading to a large deletion number.
Stage 2 (between November 2009 and February 2010): between November 2009 and March 2010, a very large analysis number was created (8719) and 83% of these creations were for the Virology. This increase in the axis and, thus, analysis numbers can, in part, be explained by the integration of new analysis and also by the duplication of analysis according to the technique used (Ex: GC-MS, LC-MS, immunoanalysis, to reference values very different). This constraint mainly applied pharmacology enables clinicians to compare only comparable values.
Stage 3 (between March 2010 and July 2010): a more moderate analysis number was created and deleted. 85% of deletions corresponded to the Microbiology (59%) and Pharmaco-Toxicology (26%) and creations were shared between several disciplines. This stage tends to reinforce the dictionary by cleaning the data in view of its adequacy with the model and its integration into the tool.
From July 2009, the detection program had a major impact on the research of axis inconsistencies as shown in Figure 4. The first time, nearly 11% of incoherent axis were detected, then, less than 1% to reach 0% in July 2010.
Figure 4.
Detection and consideration of the axis anomalies regarding the total axis number during the project duration.
It will be of interest to understand why some axis detected as abnormal were not corrected by the AP-HP team (Figure 5). Several hypotheses are issued regarding their absence of correction:
Their deletions were postponed to a later date due to the replacement of values in a discipline.
The workload was too large and was postponed to later.
The value was correct and has been adjusted to the lists of permitted axis values.
Figure 5.
Explanation by the data of the non consideration of anomalies detected.
When the project started, some detected anomalies were not considered immediately. This can be explained by the fact that all the Microbiology analysis and axis should be deleted consecutively to the work performed by the institutional biologists in this discipline.
Until the beginning of 2010, due to the maintenance team activity and regarding the new discipline introductions, some changes were postponed.
Close to the end of the project, corrections were updated more regularly. The remaining detected values, coming for most of them from the new discipline analysis, were accurate and added to the lists of permitted axis values.
Discussion
The platform implementation to manage biomedical analysis and their associated data highlights some troubles which were hidden until then. The transition from a free entry without real data validation to a knowledge database, where any information is constrained by a formal model, reveals data inconsistencies. The removal of these inconsistencies could partially be automated but requires a manual expertise by biologists. More than 10% of original data are concerned by this task, mainly merging of identical informations, deletion of incorrect informations, and addition of missing data. Through the regular utilization of the detection program and adaptation to a more rigorous work method (identification of value sets permitted by axis), inconsistencies have progressively been diminished and data quality improved.
The implemented solution automates and integrates a larger task number (automatic index creation, input constraint control defined in the model…), releasing the team in charge of the AnaBio dictionary from proceedings outside of their expertise. A major difficulty relies on the need to integrate, according to the project progress, development queries impacting on the modeling. Indeed, the flow of information related to the biomedical analysis is quite advanced regarding the standardization. It also serves as a pilot in standardized communication within Paris hospitals. Presently, this platform integrates biomedical analysis results. A model extension to include analysis prescriptions is under development.
The thoughts regarding the integration model of heterogeneous terminologies have fed and nurtured projects carried out in parallel in other application areas. The InterSTIS project8 and the Eurovoc thesaurus editing9 can be cited. The suggestion and integration of the concept group idea in the new ISO 25964 norm, concerning the presentation of terminologies, result from the close collaboration with the modeling of standards.
Conclusions and perspectives
Represented as a knowledge database, this model is a particularly suitable and scalable solution for the needs of a terminology used daily. Contrarily to a XLS file which is constraint by its structure, this model can be extended without impacting the controls, exports and statistics already implemented. Its generic nature allows the future integration of other terminologies. It also allows the definition of restriction, inference and control rules through its formal definition. The communication between the knowledge database management tool and those of laboratories improves the control related to the impacts of information changes. The biomedical analysis dictionary illustrates the need to define a local terminology, in response to specific needs of use, aligned to a reference terminology The direct use of LOINC can not currently meet the requirements of completeness, granularity and specificity.
Presently, this alignment covers in average 31% of analysis. In this project, the task concentrated on data quality improvement of the AnaBio terminology. It will be followed by a work on the alignment with LOINC, based on semi-automatic alignment techniques. This interoperability work integrated to the established model will be used for the standardized exchange of biomedical data between AP-HP information systems or international hospital units compliants with Lab-51 recommendations of IHE.
Footnotes
Nomenclature des Actes de Biologie Médicale (Nomenclature of medical biology procedures)
ISO 25964 revises, merges, and extends two existing international standards: ISO 2788 for the establishment and development of monolingual thesauri and ISO 5964 for the establishment and development of multilingual thesauri. This new standard is based on the BS 8723 project.
Web Ontology Language.
Integrating the Healthcare Enterprise. See http://www.ihe.net/
Laboratory Code Sets Distribution. See http://wiki.ihe.net/index.php?title=Laboratory_Code_Sets_Distribution.
InterSTIS : French multi-terminologies server project. See : http://www.interstis.org/
multilingual, multidisciplinary thesaurus covering the activities of the EU. See : http://europa.eu/eurovoc/
References
- 1.McDonald CJ, Huff SM, Suico JG, et al. LOINC, a universal standard for identifying laboratory observations: a 5-year update. Clin Chem Apr. 2003;49(4):624–33. doi: 10.1373/49.4.624. [DOI] [PubMed] [Google Scholar]
- 2.Lin M, Vreeman D, McDonald C, et Huff S. A Characterization of Local LOINC Mapping for Laboratory Tests in Three Large Institutions. Methods Inf Med. 2010:49. doi: 10.3414/ME09-01-0072. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Cormont S, Erman A, Burckel Y, et Carayon A. Names-Lab: a model for the standardization of biology message exchanges. Ann Biol Clin (Paris) 2002 Mar-Apr;60(2):173–81. Anonymous. Coffee drinking and cancer of the pancreas [Editorial]. Br Med J 1981; 283: 628. [PubMed] [Google Scholar]
- 4.Cormont S, Zweigenbaum P, Brunel L, et Lepage É. Construction d’un référentiel francophone d’analyses biologiques lié à LOINC. JFIM. 2007 [Google Scholar]
- 5.Daniel C, Buemi A, Mazuel L, Ouagne D, et Charlet J. Functional requirements of terminology services for coupling interface terminologies to reference terminologies. Studies in health technology and informatics. 2009 [PubMed] [Google Scholar]
- 6.Si LE, Obrien A, et Probets S. Integration of distributed terminology resources to facilitate subject cross-browsing for library portal systems. In ISKO UK. 2009 [Google Scholar]
- 7.Isaac A, Schlobach S, Matthezing H, et Zinn C. Integrated access to cultural heritage resources through representation and alignment of controlled vocabularies. Library Review. 2009;57(3):187–199. [Google Scholar]
- 8.Macgregor G, Joseph A, et Nicholson D. A skos core approach to implementing an m2m terminology mapping server. In International Conference on Semantic Web and Digital Libraries (ICSD Proceedings of the) 2007:21–23. [Google Scholar]
- 9.Fung K, et Bodenreider O. Utilizing the UMLS for semantic mapping between terminologies. AMIA Annual Symposium Proceedings. 2005:266. [PMC free article] [PubMed] [Google Scholar]
- 10.Stenzhorn H, Beißwanger E, et Schulz S. Towards a top-domain ontology for linking biomedical ontologies. Studies in health technology and informatics, IOS Press; 1999. 2007;129:1225. [PubMed] [Google Scholar]
- 11.Gangemi A, Guarino N, Masolo C, Oltramari A, et Schneider L. Sweetening ontologies with DOLCE. Lecture notes in computer science, Springer. 2002:166–181. [Google Scholar]
- 12.Picca D, Gliozzo A, et Gangemi A. LMM: an owl-dl metamodel to represent heterogeneous lexical knowledge. Proceedings of LREC; Marrakech, Morocco. 2008. [Google Scholar]
- 13.Pathak J, Solbrig H, Buntrock J, Johnson T, et Chute C. LexGrid: A Framework for Representing, Storing, and Querying Biomedical Terminologies from Simple to Sublime. Journal of the American Medical Informatics Association, Elsevier. 2009;16:305–315. doi: 10.1197/jamia.M3006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Aussenac-Gilles N. Méthodes ascendantes pour l’ingénierie des connaissances. Université Paul Sabatier. 2005 [Google Scholar]
- 15.Miles A. SKOS: requirements for standardization. DC-2006: Proceedings of the International Conference on Dublin Core and Metadata Aplications; 2006. pp. 55–64. [Google Scholar]
- 16.British Standards Institution. 2008. BS8723. Structured vocabularies for information retrieval, Part 4: Interoperability between vocabularies. [Google Scholar]
- 17.Vandenbussche P-Y, et Charlet J. Méta-modèle général de description de ressources terminologiques et ontologiques. Ingénierie de la Connaissance (IC) 2009 [Google Scholar]
- 18.Noy N, Rector A. Defining N-ary relations on the Semantic Web: Use with individuals. W3C Working Draft. 2004 [Google Scholar]
- 19.Vandenbussche P-Y, et Charlet J. ConceptGroup pattern, poster presentation. MedInfo. 2010 [Google Scholar]





