Skip to main content
BMC Medical Informatics and Decision Making logoLink to BMC Medical Informatics and Decision Making
. 2026 Feb 9;26:52. doi: 10.1186/s12911-026-03378-4

Full-scale indexing and semantic annotation of CT imaging: boosting FAIRness

Hannes Ulrich 1,2,, Robin Hendel 3, Björn Bergh 1, Björn Schreiweis 1,2
PMCID: PMC12980909  PMID: 41664112

Abstract

Background

The integration of artificial intelligence into medicine has led to significant advances, particularly in diagnostics and treatment planning. However, the reliability of AI models is highly dependent on the quality of the training data, especially in medical imaging, where varying patient data and evolving medical knowledge pose a challenge to the accuracy and generalizability of given datasets.

Results

The proposed approach focuses on the integration and enhancement of clinical computed tomography (CT) image series for better findability, accessibility, interoperability, and reusability. Through an automated indexing process, CT image series are semantically enhanced using the TotalSegmentator framework for segmentation and resulting SNOMED CT annotations. The metadata is standardized with HL7 FHIR resources to enable efficient data recognition and data exchange between research projects.

Conclusions

The study successfully integrates a robust process within the UKSH MeDIC, leading to the semantic enrichment of over 1.7 million CT image series and over 50 million SNOMED CT annotations. The standardized representation using HL7 FHIR resources improves discoverability and facilitates interoperability, providing a foundation for the FAIRness of medical imaging data. However, developing automated annotation methods that can keep pace with growing clinical datasets remains a challenge to ensure continued progress in large-scale integration and indexing of medical imaging for advanced healthcare AI applications.

Keywords: Data standardization, Semantic interoperability, Artificial intelligence, Medical image processing, Computed tomography

Background

In recent years, integrating artificial intelligence (AI) into medicine has made promising progress and revolutionized diagnostics, treatment planning, and healthcare management [1]. However, the effectiveness of AI models depends heavily on the quality and reliability of the data sets used to train them. Reliable and real-world clinical data play a crucial role in ensuring the accuracy and generalizability of AI in medicine [2, 3]. Patient data can vary significantly due to factors such as age, gender, genetics, and lifestyle. In addition, healthcare professionals often deal with rare or novel cases that are not well represented in standard datasets [4]. Therefore, reliable datasets must express the clinical reality and be continuously updated so that models can adapt to new medical challenges. This adaptability is crucial for the robust performance of AI systems in the face of evolving medical knowledge and the discovery of previously unknown diseases. Most medical AI models are based on medical imaging [5], an important diagnostic tool. However, currently, existing datasets are rarely updated, so approaches are needed to provide up-to-date and customized datasets that reflect real-world clinical environments.

The IMPETUS junior research group is focusing on precisely these challenges. The main objective of the IMPETUS group is to upgrade the Medical Data Integration Center (MeDIC) [6], which was set up at the UKSH as part of the HiGHmed consortium [7], to enable the integration and reuse of all multimedia objects and reports, regardless of format, storage or presentation in a standardized environment. In a prior study, we integrated the productive picture archiving and communication system (PACS) of University Hospital Schleswig-Holstein, the second-largest university hospital in Germany, and established an automatic integration process to retrieve new imaging series and metadata on a daily basis [8]. By February 2024, over 33 million clinical imaging series had been successfully integrated. Unfortunately, the metadata to identify the corresponding data for incoming research requests is incomplete or missing. The metadata is not routinely curated for the purpose of findability, and radiological reports are often semi-structured or focused on relevant pathologies (without listing all visible structures). However, for research and for the creation of training data, an annotation and indexing process can improve the discoverability of relevant anatomical data. In this study, we are focusing on a subset of the imaging data, the computer tomography (CT) imaging series, and present our approach to increase the findability, accessibility, interoperability, and reusability (FAIRness) of the clinical imaging series [9]. The goal is the conceptualization and establishment of a robust indexing process to improve the metadata of the integrated CT imaging series. The incoming series and the corresponding metadata should be semantically enhanced to provide more granular search criteria to discover relevant imaging series within the growing dataset.

Implementation

The integration and indexing process must seamlessly integrate with the established MeDIC architecture, ensuring compatibility and continuity. The indexing process should exhibit robustness, particularly in handling diverse CT imaging protocols. Time efficiency is paramount, with computing requirements below the daily incoming volume and fully scalable to allow the processing of legacy data in parallel. Results must conform to standardized formats, facilitating adherence to national and international data-sharing initiatives and thereby enhancing accessibility and reusability of the data for researchers.

Image retrieval and semanttic enhancement

The process is triggered by the daily image series integration [8]. The implemented process is embeddable in the current MeDIC architecture, as seen in Fig. 1. The daily CTs are indexed in a fully automatic process flow. For each incoming CT series, an indexing task is created after non-processable series (single slide series and spectral CT series) have been filtered out. The tasks are transferred to a Kafka topic and will be received by an indexing service instance, loading the series from the PACS, segmenting it, and analyzing it. The indexing service is based on the TotalSegmentator introduced by Wasserthal et al. [10]. Their framework segments 104–124 anatomical structures in CT datasets, depending on the version used. The implemented process is embeddable in the current MeDIC architecture, as seen in Fig. 1. The service uses the TotalSegmentator full-body task. The result of the segmentation process is a single statistics file since the actual segmentation masks are discarded, respectively, or not saved during the process. This statistical file includes all identified anatomical structures, their volumes in mm³, and corresponding Hounsfield unit intensities. To semantically enrich the results, a comprehensive mapping between TotalSegmentator labels and standardized terminologies such as SNOMED CT and RadLex is provided. This mapping adheres to the principles outlined in ISO/TR 12300:2014, “Health informatics - Principles of mapping between terminological systems.” [11]. The methodology has been carefully implemented and validated by a radiological specialist to ensure accuracy. A total of 124 labels were mapped to SNOMED CT and RadLex.

Fig. 1.

Fig. 1

Technical overview of the implemented indexing process within the UKSH MeDIC. The process is started with Apache Nifi creating a task based on the incoming data. The task is forwarded to the indexing services via Kafka. After the successful indexing the results are sent back via Kafka and forwarded to the designated destination

Standardized representation using HL7 FHIR

For the MeDIC internal indexing process, the annotation file is enhanced with the established mapping and indexed within a central ElasticSearch instance, a key component of our data lakehouse architecture [6]. This enables fast findability for the downstream processing. To enable a more advanced and interoperable search, the annotation file is transformed into a standardized representation using HL7 FHIR resources [12] containing patient information, imaging metadata, and the corresponding SNOMED CT annotations, shown in Fig. 2. The software versions of the indexing service and relevant libraries used were modeled in the Device, more precisely in the included DeviceVersion. Annotations and versions are linked via a Provenance resource so that future version changes can be tracked and comprehensible. The resource definitions are formalized in the FHIR Shorthand language [13] and compiled into a set of FHIR profiles using sushi [14] and are available on GitHub [15].

Fig. 2.

Fig. 2

All associated resources and the used references for CT-Indexer profiles. The resources Patient and ImagingStudy include general information to identify the patient and the corresponding imaging series. The annotations are represented as a Body structure resource, including the assigned SNOMED CT codes. The Device and Prov Resource contain process information and ensure retrospective traceability of the annotations

Results

We successfully included the process in the established imaging integration flow and enabled automatic indexing for all CT imaging series within the USKH MeDIC. Until November 1, 2025, over 1.7 million CT imaging series are semantically enhanced using the presented approach. In total, more than 50.500.000 SNOMED CT annotations were added to the image series. The indexing service runs in eight instances in parallel, resulting in an average throughput of 200 imaging series per hour. We established a priority-based processing sequence to index the imaging data from the previous day primarily. While the daily data are processed, the legacy data are continuously retrieved from the repository in chronological order and indexed until the entire PACS system is fully semantically enriched. The standardized representation using HL7 FHIR R5 yields a maximum of five different FHIR resources per CT imaging series. The less dynamic resources, e.g., Patient and Device, are defined as conditional creates to reduce redundancy within the FHIR export and downstream repository. The 230,000 indexed series results in ca. 5.6 million unique FHIR resources that are stored on a separate server to enable effective data discovery. The indexing service is open-source and freely available on GitHub [16]. The semantic annotation was successfully tested in two data extraction projects. Image series were successfully found using fine-granular anatomical annotation. The first project needed imaging series with a specific muscle group between the heart and L1 vertebrae. In the second project, series with sacrum images were successfully identified.

Discussion

In the field of medical imaging, the ability to efficiently navigate and discover vast datasets is paramount. This necessitates sophisticated methodologies that enable fine granular search within the exponentially expanding datasets. Moreover, the standardization and semantic enhancement of imaging series representation are crucial for fostering interoperability and facilitating seamless data exchange among diverse research projects. Currently, most of the imaging series are hardly semantically annotated since the task is time-consuming and resource-intensive. However, radiological imaging continues to be one of the most clinically useful tools in various clinical fields and research projects, such as the European Health Data Space [17], which emphasizes the importance of engaging in standardized exchange formats to enable their cross-institutional findability and reuse. Our approach aims to strengthen the four FAIR principles: (F) via the continuous integration of the PACS, the imaging series are retrievable; (A) due to an established application process [18, 19] the data can be requested for medical research; (I) and (R) The use of FHIR as a standardized exchange format in combination with automated SNOMED CT annotations strengthens interoperability and reusability [20].

Traditionally, automatic image annotation has been a cornerstone in downstream data analysis processes. With the emergence of artificial intelligence and the adaption of various classification and segmentation networks, the development and exploration of traditional automatic annotation methods have slowed down, as these approaches often struggle with work effort and maintaining accuracy in growing clinical datasets.

The use of machine-learning methods for semantic annotation is logical due to the current research momentum – the use of segmentation networks for semantic annotation seems unusual at first thought, but the frameworks are remarkably robust on clinical data. The integrated imaging series are of wide variance due to the use of different CT devices and recording protocols over the decades. It was, therefore, necessary to apply a steady and robust method: the TotalSegmentator is a state-of-the-art framework that achieves high accuracy (Dice coefficient of 0.943) and outperforms other available segmentation frameworks in terms of label variety. Additionally, it has broad applicability and robust performance, especially in clinical settings. The framework shows a robust performance overall; only the processing of large full-body CTs is unstable due to a bug in the underlying library. The proposed service uses the functionality to index the imaging series in the first place but also has a segmentation mode to provide the segmentation mask to researchers as a MeDIC service if requested. The implemented service is just wrapped around the TotalSegmentator, so in the future, a change of the used network can be easily established. In the current research momentum, new frameworks with more labels [21] or a more robust performance are very likely. The presented approach is focused on CT as the processed modality. But future services shall index independently of modality [22] and thus enable consistent indexing across the entire PACS would be conceivable.

Conclusions

The integration of artificial intelligence into medicine has made remarkable progress and revolutionized diagnostics and treatment planning. However, the effectiveness of the current models depends on a continuous flow of representative training data from clinical routine in order to keep the models relevant. Our efforts to enhance the metadata semantically for CT image series within UKSH MeDIC are a crucial step towards improving findability, accessibility, interoperability, and reusability – FAIRness – in medical imaging. Significant progress has been made through the semantic preparation of more than 1.7 million CT image series, enabling a better discovery and resulting in a standardized representation using HL7 FHIR resources. The development of robust methods for automatic annotation remains paramount, especially in the context of growing clinical datasets, to ensure the continuing progress of large-scale integration and indexing of medical imaging.

Acknowledgements

Not applicable.

Abbreviations

HL7

Health Level 7

FHIR

Fast Health Interoperability Resources

AI

Artificial Intelligence

MeDIC

Medical Data Integration Center

PACS

Picture Archiving and Communication System

CT

Computer Tomography

Author contributions

HU developed the indexing concept and implementation. HU and RH established and evaluated the semantic mapping. BB and BS contributed conceptually and conducted review and editing. HU and BS wrote the manuscript. All authors read and approved the final manuscript.

Funding

Open Access funding enabled and organized by Projekt DEAL. This research was funded by the German Federal Ministry of Education and Research, grant number 01ZZ2011. The funding agency had no role in study design, data collection, data analysis, results interpretation, or in writing the manuscript.

Data availability

No datasets were generated or analysed during the current study.

Declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Availability and requirements

Project name: IMPETUS CT Indexer. Project home page: https://github.com/IMIS-MIKI/impetus-ct-indexer. Operating system(s): Platform independent. Programming Language: Python. Other requirements: Python 3.10, GPU is recommended. License: Apache-2.0. Any restrictions to use by non-academics: none.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Obermeyer Z, Emanuel EJ. Predicting the future — big data, machine learning, and clinical medicine. N Engl J Med. 2016;375(13):1216–1219. 10.1056/NEJMp1606181. [DOI] [PMC free article] [PubMed]
  • 2.Panayides AS, et al. AI in medical imaging informatics: current challenges and future directions. IEEE J Biomed Health Inform. 2020;24(7):1837–1857. 10.1109/JBHI.2020.2991043. [DOI] [PMC free article] [PubMed]
  • 3.Askin S, Burkhalter D, Calado G, El Dakrouni S. Artificial intelligence applied to clinical trials: opportunities and challenges. Health Technol. 2023;13(2):203–213. 10.1007/s12553-023-00738-2. [DOI] [PMC free article] [PubMed]
  • 4.Miotto R, Wang F, Wang S, Jiang X, Dudley JT. Deep learning for healthcare: review, opportunities and challenges. Brief Bioinform. 2018;19(6):1236–1246. . [DOI] [PMC free article] [PubMed]
  • 5.Rajpurkar P, Chen E, Banerjee O. Topol, „AI in health and medicine. Nat Med. 2022;28(1):31–8.. [DOI] [PubMed] [Google Scholar]
  • 6.Kock-Schoppenhauer A-K, et al. Medical data engineering – theory and practice. Commun Comput Inform Sci. 2021;1481 CCIS:269–284. 10.1007/978-3-030-87657-9_21.
  • 7.Haarbrandt B, et al. HiGHmed - An open platform approach to enhance care and research across institutional boundaries. Methods Inf Med. 2018;Bd 57(Nr S 01):Se66–e81. 10.3414/ME18-02-0002.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Ulrich H, Anywar M, Kinast B, Schreiweis B. Large-scale standardized image integration for secondary use research projects. Stud Health Technol Inform. 2024;310:174–178. 10.3233/SHTI230950. [DOI] [PubMed]
  • 9.Wilkinson MD, et al. The FAIR guiding principles for scientific data management and stewardship. Sci Data. 2016 Mar;3:160018. 10.1038/sdata.2016.18. [DOI] [PMC free article] [PubMed]
  • 10.Wasserthal J, et al. Totalsegmentator: robust segmentation of 104 anatomic structures in CT images. Radiol Artif Intell. 2023;5(5). 10.1148/ryai.230024. [DOI] [PMC free article] [PubMed]
  • 11.ISO/TR 12300:2014. Health informatics — principles of mapping between terminological systems [Internet]. 2014 [cited 2021 Oct 29]. Available from: https://www.iso.org/standard/51344.html.
  • 12.Benson T, Grieve G. Principles of health interoperability: FHIR, HL7 and SNOMED CT. in health information technology standards. Cham: Springer International Publishing; 2021. 10.1007/978-3-030-56883-2.. [Google Scholar]
  • 13.HL7.FHIR.UV.SHORTHAND. FHIR Shorthand – FHIR v4.0.1 [Internet]. 2024 [cited 2024 Feb 15]. Available from: https://hl7.org/fhir/uv/shorthand/.
  • 14.FHIR/sushi. TypeScript. FHIR [Internet]. 2024 Feb 2 [cited 2024 Feb 14]. Available from: https://github.com/FHIR/sushi.
  • 15.Ulrich H. impetus-ct-indexer-fhir [Internet]. 2024 Mar 4. Available from: 10.5281/ZENODO.10779322.
  • 16.Ulrich H. impetus-ct-indexer [Internet]. 2024 Mar 6. Available from: 10.5281/ZENODO.10784585.
  • 17.European Commission. A European Health Data Space: harnessing the power of health data for people, patients and innovation [Internet]. Available from: https://health.ec.europa.eu/system/files/2022-05/com_2022-196_en.pdf.
  • 18.Richter G, Krawczak M, Lieb W, Wolff L, Schreiber S, Buyx A. Broad consent for health care–embedded biobanking: understanding and reasons to donate in a large patient sample. Genet Med. 2018;20(1):76–82. 10.1038/gim.2017.82. [DOI] [PubMed]
  • 19.Lieb W, et al. Linking pre-existing biorepositories for medical research: the PopGen 2.0 Network. J Community Genet. 2019 Oct;10(4):523–530. 10.1007/s12687-019-00417-8. [DOI] [PMC free article] [PubMed]
  • 20.van Damme P, Löbe M, Benis N, de Keizer NF, Cornet R. Assessing the use of HL7 FHIR for implementing the FAIR guiding principles: a case study of the MIMIC-IV emergency department module. JAMIA Open. 2024;7(1):ooae002. . [DOI] [PMC free article] [PubMed]
  • 21.Sundar LKS, et al. Fully automated, semantic segmentation of whole-body 18F-FDG PET/CT images based on data-centric artificial intelligence. J Nucl Med. 2022 Dec;63(12):1941–1948. 10.2967/jnumed.122.264063. [DOI] [PMC free article] [PubMed]
  • 22.Ma J, He Y, Li F, Han L, You C, Wang B. „Segment anything in medical images. Nat Commun. 2024;15(1):654.. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

No datasets were generated or analysed during the current study.


Articles from BMC Medical Informatics and Decision Making are provided here courtesy of BMC

RESOURCES