LLM-AIx: An open source pipeline for Information Extraction from unstructured medical text based on privacy preserving Large Language Models

Isabella Catharina Wiest; Fabian Wolf; Marie-Elisabeth Leßmann; Marko van Treeck; Dyke Ferber; Jiefu Zhu; Heiko Boehme; Keno K Bressem; Hannes Ulrich; Matthias P Ebert; Jakob Nikolas Kather

doi:10.1101/2024.09.02.24312917

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2024 Sep 3:2024.09.02.24312917. [Version 1] doi: 10.1101/2024.09.02.24312917

LLM-AIx: An open source pipeline for Information Extraction from unstructured medical text based on privacy preserving Large Language Models

Isabella Catharina Wiest ^(1),⁽²⁾, Fabian Wolf ⁽²⁾, Marie-Elisabeth Leßmann ^(2),⁽³⁾, Marko van Treeck ⁽²⁾, Dyke Ferber ^(2),⁽⁴⁾, Jiefu Zhu ⁽²⁾, Heiko Boehme ⁽⁷⁾, Keno K Bressem ⁽⁸⁾, Hannes Ulrich ⁽⁹⁾, Matthias P Ebert ^(1),⁽⁶⁾, Jakob Nikolas Kather ^(2),^(3),^(4),⁽⁺⁾

^1.Department of Medicine II, Medical Faculty Mannheim, Heidelberg University, Mannheim, Germany

^2.Else Kroener Fresenius Center for Digital Health, Faculty of Medicine and University Hospital Carl Gustav Carus, TUD Dresden University of Technology, 01307 Dresden, Germany

^3.Department of Medicine I, Faculty of Medicine and University Hospital Carl Gustav Carus, TUD Dresden University of Technology, 01307 Dresden, Germany

^4.Medical Oncology, National Center for Tumor Diseases (NCT), University Hospital Heidelberg, Heidelberg, Germany

^5.Department of Medical Oncology, Heidelberg University Hospital, Heidelberg, Germany

^6.DKFZ Hector Cancer Institute at the University Medical Center, Mannheim, Germany

^7.National Center for Tumor Diseases (NCT/UCC), Dresden, Germany: German Cancer Research Center (DKFZ), Heidelberg, Germany; Medical Faculty and University Hospital Carl Gustav Carus, TUD Dresden University of Technology, Dresden, Germany; Helmholtz-Zentrum Dresden-Rossendorf (HZDR), Dresden, Germany

^8.Department of Cardiovascular Radiology and Nuclear Medicine, Technical University of Munich, School of Medicine and Health, German Heart Center, TUM University Hospital, Lazarethstr. 36, 80636, Munich, Germany

^9.Institute for Medical Informatics and Statistics, Kiel University and University Hospital Schleswig-Holstein, Campus Kiel, Kiel and Lübeck, Schleswig-Holstein, Germany

Author contributions

ICW conceptualized the methodology and protocol and designed the pipeline with MEL, JNK and FW. FW, MvT and ICW developed the software and wrote the technical documentation with contributions of JZ and MEL. ICW, MEL, FW, DF and HB tested and adapted the software. ICW and MEL interpreted the analyzed data. ICW, MEL, FW, JNK and DF were writing and reviewing the initial manuscript. All authors wrote and reviewed the protocol and approved the final version for submission. MPE and JNK provided supervision and resources for the project.

As per the ICMJE guidelines of April 2023, we hereby disclose that the following tools were used to write this article: Microsoft Word and Google Documents as Word processing software, ChatGPT-4 for checking and correcting spelling and grammar, Midjourney V5.2 for icon generation.

⁺

Corresponding author address: Jakob Nikolas Kather, MD, MSc, Professor of Clinical Artificial Intelligence, Else Kröner Fresenius Center for Digital Health, Technische Universität Dresden, DE – 01062 Dresden, Fax: +49 351 458 7236, jakob_nikolas.kather@tu-dresden.de

PMCID: PMC11398444 PMID: 39281753

Abstract

In clinical science and practice, text data, such as clinical letters or procedure reports, is stored in an unstructured way. This type of data is not a quantifiable resource for any kind of quantitative investigations and any manual review or structured information retrieval is time-consuming and costly. The capabilities of Large Language Models (LLMs) mark a paradigm shift in natural language processing and offer new possibilities for structured Information Extraction (IE) from medical free text. This protocol describes a workflow for LLM based information extraction (LLM-AIx), enabling extraction of predefined entities from unstructured text using privacy preserving LLMs. By converting unstructured clinical text into structured data, LLM-AIx addresses a critical barrier in clinical research and practice, where the efficient extraction of information is essential for improving clinical decision-making, enhancing patient outcomes, and facilitating large-scale data analysis.

The protocol consists of four main processing steps: 1) Problem definition and data preparation, 2) data preprocessing, 3) LLM-based IE and 4) output evaluation. LLM-AIx allows integration on local hospital hardware without the need of transferring any patient data to external servers. As example tasks, we applied LLM-AIx for the anonymization of fictitious clinical letters from patients with pulmonary embolism. Additionally, we extracted symptoms and laterality of the pulmonary embolism of these fictitious letters. We demonstrate troubleshooting for potential problems within the pipeline with an IE on a real-world dataset, 100 pathology reports from the Cancer Genome Atlas Program (TCGA), for TNM stage extraction. LLM-AIx can be executed without any programming knowledge via an easy-to-use interface and in no more than a few minutes or hours, depending on the LLM model selected.

Introduction

Development of the protocol

Medical free text contains essential information, such as details about patient characteristics and therapy course and maps the patient journey substantially better than structured medical information from electronic health records alone^1–3. This medical free text contains the main reasoning as well as observations from medical staff within a variety of different report types, such as clinical letters as well as documentation of different diagnostic and therapeutic procedures⁴. In its unstructured form, text is not available for quantitative analysis and is therefore not accessible for research, quality analysis or interoperable data exchange⁵. Forcing medical staff into structured documentation, however, is not feasible due to time constraints and shortage of personnel in the healthcare system. This leads to an increasing documentation burden and decreases the time available for actual patient care⁶. Therefore, systematically extracting information from free text is crucial for the medical field: It enables researchers to investigate rare diseases⁷, allows better tracking, overview, and exchange of patient information among different inpatient and outpatient providers via a comprehensive health record, and systematic quality control assessment^8,9. Previous methods to mine medical free text fall short because they are either not capable of processing large amounts of text and have limited capabilities to grasp context, or need task-specific fine-tuning¹⁰, whereas our method solely relies on in-context learning of large language models. In-context-learning enhances LLMs’ performance on new tasks by using examples or step-by-step instructions within the prompt^11,12. Additionally, narrative medical text comes from various source systems, which complicates a streamlined processing. Some reports may only be accessible in portable document format (PDF) from the clinical information system (CIS), others originate from secondary software in a variety of different formats.¹³ Data transformation processes to harmonize all the data formats from their source systems within one central database are not ubiquitously established¹⁴. We therefore present an open-source, LLM-based pipeline which tackles these challenges in medical information extraction (IE). Additionally, our pipeline extracts structured information elements that can be flexibly defined by the user. This is advantageous compared to traditional IE, where predefined categories and relationships are extracted. Our approach offers a highly flexible process for handling large-scale unstructured data¹⁵.

Our pipeline is able to transform various types of unstructured medical text data—such as clinical notes, procedure reports or entire clinical letters—into structured CSV format, suitable for quantitative analysis. This development was motivated by the need for a scalable solution that accommodates the technical expertise and deep medical domain understanding required for effective data utilization in healthcare. The method has been developed, applied and tested for several use cases, namely extracting suicidality of psychiatric admission notes¹⁶, tested with different Large Language Models from Meta AI (Llama-2 models). Additionally, we extracted several symptoms and diagnoses for detection of decompensated liver cirrhosis from emergency room (ER) admission notes¹⁷. Furthermore, we applied the pipeline for extracting adverse events from endoscopy reports of endoscopic mucosal resection (EMR) and colonoscopies¹⁸. All of these proof-of-concept studies led to the development of the entire pipeline presented here, which comprise an intuitive graphical user interface (GUI), data preprocessing, LLM-based IE as well as automated evaluation of the process within one pipeline. Previously, we introduced the LLM Anonymizer, which is a special case for IE with the purpose of anonymizing medical reports¹⁹.

The latest open-source LLMs can easily be implemented within the pipeline, which also facilitates benchmarking of different models in accurately extracting relevant entities and information based on the specific needs of requestors. Currently, all models available in Generative Pre-Trained Transformers (GPT)- Generated Unified Format (GGUF) can be included in the pipeline, such as Llama-2 with 7 billion parameters (7b), Llama-2 70b, Llama-3 8b, Llama-3 70b, Llama-2 “Sauerkraut” 70b, Phi-3, Mistral 7b and many more. By producing outputs in a CSV format, we enable seamless integration with existing data analysis tools and workflows, facilitating quantitative analysis without the need for specialized computational skills. As an example, existing databases such as cancer registries or clinical databases could be filled with the help of our pipeline.

Overview of the protocol

The protocol consists of four main stages: 1) Problem definition and data preparation 2) Data preprocessing, 3) LLM-based IE and 4) Output evaluation (Figure 1). The protocol facilitates any kind of IE from medical free text documents, with a variety of input formats possible. It is easy to use for clinical researchers without NLP expertise and allows the application of the latest LLMs for medical IE. We have shown that the protocol is broadly applicable to any kind of medical text data. The protocol is available in an open-source codebase on github (available at https://github.com/KatherLab/LLMAIx). Additionally, our method can be implemented on low hardware resources (e.g., a single graphical processing unit (GPU) with 48GB video random-access memory (VRAM)), making it more accessible and cost-effective compared to systems with higher computational demands.

Figure 1 - — A The information extraction pipeline follows a common path that includes data preprocessing, optional Optical Character Recognition (OCR), document splitting, and support for various file formats (CSV, Excel, PDF, or TXT). B After preprocessing, users can specify model parameters such as hyperparameters, prompts, and desired output structure. Once these are defined, the LLM-based information extraction process begins. B The resulting ZIP file contains the output CSV with LLM predictions of the desired information and the original reports. The evaluation process offers two options. D If the pipeline is used for information extraction, it identifies and extracts the required information into a CSV file. This extracted CSV file can then be compared to a ground truth CSV file. Confusion matrices and comprehensive performance metrics are generated to visualize and evaluate the pipeline’s performance. E If the pipeline is used for document anonymization, the original documents are redacted to obscure personal identifiers and can be compared to annotated data files. The pipeline automatically generates confusion matrices that visualize matching and mismatching characters, facilitating easy performance evaluation. The anonymization part of this figure is based on the workflow depiction of our previous publication¹⁹.

Applications of the method

The pipeline has been applied in several clinical use cases to demonstrate its versatility and effectiveness, and can be applied for any use case where quantifiable, structured data is required from unstructured medical text. Unlike traditional NLP methods that often require specific training and fine-tuning²⁰, this pipeline utilizes LLMs which excel in zero-shot applications, which refers to the ability of LLMs to make predictions on data that was not encountered during training, without requiring any task-specific fine-tuning. This makes LLMs ideal for processing a wide range of medical documents.

Furthermore, the performance of LLMs can be optimized through direct interaction via prompting and prompt engineering^21,22. This facilitates in-context learning, a methodology wherein the user presents the desired output to the model and provides one or multiple examples of the correct solution¹¹. This approach is entirely based on model inference, eliminating the need for extensive retraining of the model.

The procedure can be applied for a variety of purposes:

Interdisciplinary collaboration. When exchanging healthcare data across multiple centers for research, a uniform data standard is key. LLM information extraction based on a predefined data standard could support interdisciplinary cooperation without the need of exchanging original text data, which may contain sensitive information that may remain on the sites where the data emerges.
Clinical research. Quantitative research is only possible with quantitative data and cannot be performed with quantitative information hidden in free text data. Current practice is manual extraction of information from medical text by medical documentalists or scientific assistants. This is, however, time consuming and is complicated by personnel shortage in the healthcare sector. Additionally, the tool could support filling patient registries such as cancer registries and clinical trial documentation (e.g. by structured extraction of adverse events from free text clinical notes).
To build quantitative downstream models. For example, to predict certain outcomes from other data modalities, such as radiology images, one needs outcome information about the respective patients which is usually hidden in free-text documentation. This information can be extracted with our pipeline and then serve as a label for predictive machine learning model training^23–25.
Quality assurance and auditing. The tool could help to complete clinical data used for measuring quality of care.
EHR Data integration. The structured information could enrich patients’ electronic health records to then reflect a more complex picture of patient history and treatment within an interoperable and accessible way for all healthcare providers.

Comparison with other methods

Until now, due to the shortcomings of other methods, the gold standard in medical IE is labor intensive, manual IE by medical documentaries or medical staff. Traditional methods such as machine learning named entity recognition (NER) methods typically require the extraction of fixed entities and offer limited flexibility. To name an example, they extract all names, dates and locations from a text, without being able to interpret the context. If only the surgery date was needed from a report, this could not easily be identified among all dates mentioned within a text.

In contrast, our LLM-based approach allows for the flexible definition of entities to be extracted through advanced prompt engineering and in-context learning capabilities. This adaptability makes it more suitable for the dynamic and varied needs of medical data analysis.

NLP in the pre-transformer area - Beginnings of pre-trained models

Initially, IE relied on hand-crafted rules and required extensive manual efforts to define patterns which were limited in their adaptability to different domains^26,27. Machine learning techniques, which used labeled data to train models, improved the IE performance in the NLP domain²⁸. These methods leveraged features as part-of-speech tags; such as nouns, verb or adjectives²⁹, syntactic structures; such as noun phrases, verb phrases and adjective phrases³⁰, and lexical cues, which are words indicating a specific relation and entity, such as intensifiers like “very” in the sentence “she was very happy”³¹, to improve the accuracy of entity recognition and relation extraction. Notable algorithms like Hidden Markov Models (HMMs) and Conditional Random Fields (CRFs) were widely adopted for these tasks^32,33. Non-neural methods such as n-gram models^34–36, and neural network methods, particularly recurrent neural networks (RNNs) and convolutional neural networks (CNNs) were able to improve capturing contextual information³⁷. However, labeled data is scarce, particularly in the medical domain, and as a result, unsupervised and semi-supervised learning approaches were advanced simultaneously. They aim to utilize large amounts of unlabeled text to automatically discover patterns in the data. Word embeddings, which are representations of words in continuous vector spaces, such as Word2Vec or GloVe, further enhanced the generalizability across different contexts³⁸. ULMFit followed as one of the first approaches for pre-trained models³⁹. However, all of these methods suffered from limited context understanding in document level, do not capture polysemy, have a fixed vocabulary size, require large text corpora for training and were so far insufficient for IE in the medical field⁴⁰.

New prospects with LLMs

The development of the transformer architecture, a deep learning architecture that is based on multi-head attention^41,42, substantially changed the NLP landscape: Especially the introduction of Bidirectional Encoder Representations from Transformers (BERT) and Generative Pre-trained Transformer (GPT) advanced the field. These models capture language patterns and context, and subsequent fine-tuning of pre-trained models on specific tasks achieved good results in entity recognition⁴³. BERT-based models have also been established for the biomedical domain (BioBERT, SciBERT, ClinicalBERT, BioMedRoBERTa) and tested on several benchmark datasets (GLUE, MuliNLI, SQuAD)^44,45. Nevertheless, BERT models require fine-tuning for successful IE, which requires procedure and programming knowledge and have very limited context length¹⁰.

LLMs, which have larger parameter sizes than BERT-based language models, have shown great potential in classical IE tasks⁴⁶. They offer a high zero-shot performance and shift the task solving field towards immediate prompt engineering instead of fine-tuning and model training. In-context-learning, which does not alter the model’s weights and has the advantage of using purely natural language, potentially allows medical staff without programming knowledge to seamlessly integrate these tools into their daily routines. Furthermore, it provides maximum flexibility to extract contextually relevant information as specified by the requester, requiring minimal programming knowledge, making it ideal for the information extraction process in the medical field⁴⁷. The strength of our approach lies in its robust performance across datasets of any size, ensuring efficiency and accuracy whether analyzing a single report or aggregating insights from a vast collection of documents.

Experimental design

To validate the efficacy of our protocol, we conducted experiments across different datasets in different languages and clinical settings. Each use case was designed to test the protocol’s ability to accurately and efficiently process unstructured text into structured data while addressing specific clinical questions. The performance metrics included accuracy, sensitivity, specificity, F1-score and precision, and the ability to maintain data integrity and privacy.

Expertise needed to implement the protocol

We have designed the pipeline to require almost no programming knowledge with a user interface that allows intuitive data processing for non-technical users, however, pipeline setup requires some knowledge about virtual environments and navigating a terminal.

Additionally, a useful application requires domain knowledge, therefore it is crucial that medical experts clearly define the entities of interest to enable concise and effective prompting, which is central to the protocol’s operation. This requirement highlights the importance of having a well-understood and agreed-upon definition of the entities among the clinical team members to facilitate the accurate extraction of information.

Limitations

While our pipeline significantly enhances the accessibility and utility of unstructured medical text data, it does have limitations:

Dependence on High-Quality Data Inputs: The effectiveness of the LLM is contingent on the quality and diversity of the input data. Handwritten documents and poorly scanned files may not be effectively processed by the implemented Optical Character Recognition (OCR) engines.
Computational Resources: Despite the possibility to run the pipeline on consumer hardware, the necessity for a GPU with substantial VRAM may limit implementation in resource-constrained settings. Models with larger parameter sizes (such as Llama 3.1 405B) may require additional hardware to be tested.
LLM inherent constraints: LLMs may generate information or statements that are factually incorrect, misleading, or fabricated, so called “hallucinations”. They can be mitigated by adjusting hyperparameters and providing proper in-context learning, though it cannot be completely eliminated. However, LLM-AIx can still reduce IE time, even with a human in the loop. When using a homogeneous dataset, the error rate in IE can be considered comparable to that of a small evaluation subset¹⁸.

Materials

Data

The unstructured text data used to run the pipeline can have different formats. Our pipeline allows processing of portable document format (PDF), raw text (TXT) or comma-separated values (CSV) as well as EXCEL files.

Hardware

The pipeline can be run fully locally on consumer hardware (such as NVIDIA RTX 4090 or Apple Silicon M2 of a macbook). We ran the pipeline with one Graphics Processing Unit (GPU), equipped with 48 gigabytes (GB) of video random-access memory (VRAM) with a NVIDIA RTX A6000. In theory, the pipeline can be deployed on consumer hardware with any GPU (minimum 12 GB of VRAM). Processing times, context length and memory may limit the deployment on consumer hardware to smaller LLMs and datasets.

To enable use on comparatively low-resource hardware, we employed only quantized models (4- and 5-bit quantization), which are smaller than unquantized LLMs but maintain comparable performance⁴⁸.

Software

The pipeline can be used through a graphical user interface without any programming knowledge. It can be downloaded as a Docker image for a quick setup, including all its dependencies except the model files in GGUF format.

Alternatively, manual setup is possible by installing the required python packages as well as additional software packages (tesseract, llama.cpp) as it is described in the README.md file (https://github.com/KatherLab/LLMAIx). All software packages require a minimum of Python 3.12.

The data preprocessing stage includes different options of OCR for processing image-only PDFs. We implemented the popular open source OCR “tesseract” via the package OCRmyPDF, as well as potentially superior alternatives such as “surya”⁴⁹, which can be selected by the User. Default OCR is tesseract.

The protocol adopts the llama.cpp framework which enables the application of a variety of LLMs, formatted in GGUF. It allows LLM inference with state-of-the-art performance on a variety of hardware locally and in the cloud⁵⁰. It is an open-source project that enables the use of Llama (Large Language Model Meta AI) models in C++.

Equipment Setup

To set up the pipeline, two main steps are necessary: 1) Model download, 2A) Docker pipeline setup or 2B) Manual pipeline setup:

1) Download the desired models in GGUF format onto your local system.
2A) Edit docker-compose.yml with the correct model path and follow the instructions. Then run the docker image as described in the README.md.
2B) Download the pre-built llama.cpp from Github and follow the installation instructions⁵⁰. Then clone the pipeline-Github repository. Create a virtual environment and install all necessary python packages within the environment. The detailed implementation of all dependencies and setup is described in the README.md file.

Procedure

Stage 1: Problem definition and data preparations

Inline graphic TIMING

Inline graphic TROUBLESHOOTING