Skip to main content
Lippincott Open Access logoLink to Lippincott Open Access
. 2026 Feb 18;10:e2500133. doi: 10.1200/CCI-25-00133

Development and Assessment of a Pipeline for Extracting Structured Data From Free-Text Medical Reports Using a Large Language Model

Enzo Joseph 1, Paul Vallee 1, Tanguy Perennec 2, Nicolas Wagneur 2, Jean-Sébastien Frenel 3,4, Mario Campone 3,4, François Bocquet 1,5, Florent Le Borgne 1,
PMCID: PMC12928813  PMID: 41707099

Abstract

PURPOSE

Medical free texts such as pathology reports contain valuable clinical data but are challenging to structure at scale. Traditional natural language processing approaches require extensive annotated data and training. We investigate the use of large language model (LLM) like Mistral to automatically extract three breast cancer (BC) biomarkers from pathology reports.

MATERIALS AND METHODS

We developed and evaluated a pipeline combining Mistral Large LLM and a postprocessing phase. The pipeline's performance was assessed both at document and patient levels. For evaluation, two data sets were used: a data set of 1,152 pathology reports associated with 150 patients with BC focused solely on biomarker values and a gold standard database containing 101 patients with metastatic BC, enriched with detailed patient and tumor characteristics and double-blind validated by clinical research assistants. We also explored the pipeline's performance according to the use of a confidence prompt (CP), a chain of thought (CoT), and few-shot examples.

RESULTS

Our extraction pipeline achieved F1 scores of more than 95% and both recall and precision of more than 94% for each biomarker of interest (ie, estrogen receptor, progesterone receptor and human epidermal growth factor receptor 2 status and score) at the document level. At the patient level, the F1 score decreased between 87% and 90% with a greater drop in recall (ranging between 83% and 87%) compared with precision, which remained >90%. The results were similar whether the pipeline included a CP, CoT, or few-shot examples.

CONCLUSION

Our study provides strong evidence of the potential of LLMs like Mistral Large for extracting structured BC biomarker data from pathology reports and the potential of such methods for broader digital transformation of health care documents.

INTRODUCTION

Artificial intelligence (AI) is transforming health care, particularly in diagnostics, clinical workflows, and decision making. Applying AI to unstructured electronic health records (EHRs) enables efficient use of real-world data (RWD) for research and clinical practice.1-5 Unlike data from controlled clinical trials, RWD capture the diversity and complexity of routine care but remain difficult to exploit because most information is stored as unstructured text.6,7 Manual abstraction is labor-intensive, time-consuming, and prone to errors, limiting the potential utility of these valuable data sets.8

CONTEXT

  • Key Objective

  • Can a large language model (LLM) such as Mistral reliably extract structured breast cancer (BC) biomarker data from unstructured pathology reports without extensive task-specific training?

  • Knowledge Generated

  • Our pipeline accurately extracted estrogen receptor, progesterone receptor, and human epidermal growth factor receptor 2 status from pathology reports using Mistral, with F1 scores exceeding 95% at the document level. At the patient level, performance remained high, with F1 scores ranging from 90% to 97%.

  • Relevance (F. Lin)

  • LLM can accurately automate the extraction of variables, such as receptor status in BC, from unstructured pathology reports without the need for manual chart review, enabling scalable, efficient data collection for cancer registries, clinical research and real-world evidence generation.*

  • *Relevance section written by JCO CCI Deputy Editor Frank Lin, PhD, MB ChB, FRACP, FAIDH.

Natural language processing (NLP) has become an indispensable tool in medicine, particularly for structuring data from EHRs. Early rule-based methods9-11 lacked flexibility, whereas deep learning approaches, such as named entity recognition with transformer models like BERT,12 improved extraction capabilities and have been shown to perform well in identifying and extracting key clinical variables such as the patient's age or sex,13 smoking status,14 tumor characteristics,7,15-17 biomarkers,18,19 treatments,20,21 and recurrences22,23 from free-text medical notes. However, these methods still require extensive annotation and struggle with implicit information.19,24-28 Large language models (LLMs), such as generative pre-trained transformer (GPT), offer promising alternatives by leveraging contextual understanding to reduce annotation needs. While the results in the study by Sushil et al29 were mixed, other studies have shown more promising results. For example, Huang et al30 evaluated ChatGPT's performance in extracting TNM classification from free-text pathology reports, highlighting both the potential and limitations of general-purpose LLMs in structured data extraction. Similarly, Wals Zurita et al31 showed that LLMs with well-designed prompts could match or even surpass medical specialists in extracting comorbidities from complex clinical reports. By automating key variable identification such as diagnoses, treatments, and biomarker status, AI models can transform unstructured data into actionable insights, enabling large-scale research and real-time clinical decision support.32 Yet, few solutions currently exist for automatically structuring data for research purpose.28

Precision medicine in oncology refers to an innovative approach to cancer treatment that takes into account individual tumor characteristics.33-35 Identifying specific biomarkers such as estrogen receptor (ER), progesterone receptor (PR), and human epidermal growth factor receptor 2 (HER2) is necessary for the effective diagnosis and treatment planning of breast cancer (BC). Pathology reports describing these biomarkers are written in unstructured free text, requiring manual extraction by clinical research assistants (CRA) to build curated research databases. In this work, we investigate the use of Mistral Large LLM to automatically extract ER, PR, and HER2 biomarkers from BC pathology reports. We detail the extraction pipeline and prompt design and evaluate performance across different prompting strategies (confidence prompt [CP], chain of thought [CoT], and few-shot learning).

MATERIALS AND METHODS

Data

This study was conducted at the Institut de Cancérologie de l'Ouest (ICO) Comprehensive Cancer Center, part of the UNICANCER network of 18 centers in France. The study was conducted in 2024 based on retrospective reports of patients treated at ICO between January 2014 and December 2023. The ICO repository includes nearly 10 million clinical texts in French, with approximately 650,000 new reports each year, 5% of which are pathology reports. Various databases were used to develop and evaluate the pipeline at both document and patient levels, and all results presented were obtained using independent data sets from those used during the pipeline development, including prompt engineering. We used internal ICO reports and external reports from multiple laboratories across western France, introducing substantial variability in vocabulary and reporting styles and formats. All reports underwent Optical Character Recognition using Tesseract 3.5. In accordance with institutional practice at ICO, ER and PR positivity were defined using a 10% cutoff, corresponding to the percentage of nuclear positivity via immunohistochemical staining. This threshold, which may differ across countries, has remained the standard in France.

Large Language Model

We used the Mistral-Large-2411 LLM from Mistral AI through Microsoft's Azure AI Foundry without fine-tuning. This choice was driven by two primary considerations. First, Azure's infrastructure includes dedicated data centers in France, with Health Data Hosting certification, ensuring General Data Protection Regulation compliance for sensitive health care data. Second, the model offers an optimal balance between cost, accessibility, and robust AI capabilities.

Our implementation, built in Python 3.12, interfaced with Azure AI services using the Azure AI Projects SDK (v1.0.0b5) and Command-Line Interface authentication for secure access. The architecture follows a project-based approach, where a dedicated connection string configures the Azure AI project. The model interaction is handled through a chat completion interface, which facilitates structured communication with the language model through an inference layer. This setup enabled both efficient transmission of prompts and systematic retrieval of model responses, while maintaining secure authentication.

Iterative Approach

Our goal was to leverage LLM to extract structured BC biomarkers from anatomic pathology reports. To refine both the extraction pipeline and prompts, we used an iterative prompt engineering approach with three steps. First, we modified the process or prompt to verify a hypothesis by testing different phrasing, moving sections, and using CoT prompting, zero-shot, few-shot, confidence prompt, etc. Second, we evaluated these changes using predefined metrics on training data sets. Finally, we analyzed errors and formulated hypotheses about the causes of extraction failures.

Extraction Pipeline

Several features were implemented to improve LLM's responses and reduce hallucinations, which can cause false positives (FP). To mitigate this, a CP was introduced before the extraction prompt, asking whether attribute values were present in the report. If the model lacked confidence about an attribute, extraction prompt was skipped, and the attribute was considered unavailable. LLMs are powerful for generating text responses but are prone to more errors when they need to respect a standardized format. To overcome this, we defined a JSON schema for each prompt and instructed the LLM to produce output strictly conforming to this structure. If discrepancies remained (eg, unexpected values), we applied normalization using the Python difflib library (eg, mapping variations like “Positive” or “positives” to “positive”). If the output still failed to match the schema or expected values, the LLM was asked to correct its answer up to three times. If unsuccessful, the report was marked as failed and the script proceeded to the next document. The architecture is schematized in Figure 1.

FIG 1.

FIG 1.

Architecture of the extraction pipeline.

Prompts

Each prompt was designed to extract all biomarker attributes simultaneously and followed a structured format: context; attribute definitions; required format; zero, one, or multiple examples; and the pathology report. To handle potential contradictions within a report, the JSON schema was constrained to allow only one value per biomarker and per sample. The LLM was instructed to select a single value, giving priority to conclusions when multiple or conflicting mentions were found. We evaluated zero-shot, one-shot, and two-shot prompts. The final version of the pipeline used one-shot prompting, which provided a good balance between performance and prompt size. We also implemented CoT reasoning, prompting the model to explain intermediate steps before producing its final output in a JSON format. This could help to reduce hallucinations and improve accuracy. To further support reproducibility, a representative example of the final prompt and the LLM response (translated into English for clarity) is provided in the Data Supplement (S2).

Postprocessing Phase

To enhance pipeline performance, we added a postprocessing step to refine LLM-extracted data. In this step, at the document level, biomarker results were completed by cross-referencing related values. For instance, ER and PR staining percentages were extracted, with values below 10% indicating a negative result, allowing us to infer ER/PR status when percentages were available. Similarly, ER and PR were classified as negative if the hormone receptor (HR) was negative. HER2 status was completed using both the HER2 score (negative for 0 or 1+, equivocal for 2+, and positive for 3+) and the in situ hybridization (ISH) result extracted as amplified or not amplified.

At the patient level, there are duplicated or highly similar documents in the ICO's digital archive. In some cases, the same content was in multiple documents, whereas only the structure or the format was different. In other cases, a complementary result was added to a document, making the shared section present in both, thus duplicating the results. In real-world patient databases, manual input typically retains only a single result, making comparisons more challenging. To address this, duplicated biomarker results at the same date were removed. In addition, results from different documents on the same date were combined when complementary. For example, if one document reported a HER2 score of 2+ (equivocal) and another indicated amplified HER2 via ISH, they were merged into a single 2+ positive HER2 result, mimicking RWD curation.

Comparative Analysis

To assess the quality of the data extracted by the LLM pipeline, recall, precision, and F1 score metrics were used to compare the LLM extraction with the manually collected data from the document-level database. While document-level metrics were straightforward to compute, patient-level metrics required an algorithm since manually collected databases do not specify the source report for each biomarker (see the Data Supplement, S3, for more details). We computed the temporal distance as the difference in days between the date recorded in the gold standard (GS) and the date identified by the LLM. This metric was used to verify that the dates extracted by the LLM were correctly centered around the expected dates.

Ethical Approval

The ICO Data Protection Officer included the study in the register of ongoing studies as requested by the Commission nationale de l'informatique et des libertés (National Commission for Data Protection and Liberties). All the patients included in the study signed a generic research consent form. At any time, the patient can object to their participation in the study by contacting the ICO Data Protection Officer. A data processing sheet containing the essential information of the study and the data collected is public and available online for all the patients at mesdonnees.unicancer.fr.

RESULTS

Data Sets

The document-level validation data set includes 720 pathology reports corresponding to 100 patients, with 249 to 260 available results per biomarker (ER, PR, HER2 status, and HER2 score). For HER2, there are 0-8 results per patient, with a median of 3. The patient-level validation database included 101 patients associated with 624 pathology reports (see the Data Supplement, S1, for more details on all the data sets used). Table 1 presents the characteristics of the 101 patients used for patient-level validation.

TABLE 1.

Characteristics of the 101 Patients Included in the GS Database Used for Patient-Level Validation

Variable No. NA %
Female 99 0 98.0
Personal history of cancer 9 0 8.9
ECOG performance status ≥2 (v 0-1) 43
 0-1 48 82.8
 ≥2 10 17.2
Grade at diagnosis 14
 1 5 5.7
 2 49 56.3
 3 33 37.9
Histologic type at diagnosis 4
 Lobular carcinoma 16 16.5
 Nonspecific carcinoma 75 77.3
 Other 6 6.2
cT stage at diagnosis 73
 0 0 0.0
 I 6 21.4
 II 12 42.9
 III 10 35.7
cN stage at diagnosis 71
 0 18 60.0
 I 8 26.7
 II 2 6.7
 III 2 6.7
De novo mBC 30 13 34.1
ER-positive 67 7 71.3
PR-positive 48 7 51.1
HER2 score (IHC) at diagnosis 8
 0 53 57.0
 1+ 14 15.1
 2+ 16 17.2
 3+ 10 10.8
HER2 status at diagnosis 7
 Negative 81 86.2
 Equivocal 1 1.1
 Positive 12 12.8
Year of diagnosis 0
 Before 2015 41 40.6
 From 2015 to 2019 39 38.6
 After 2019 21 20.8
Year of metastatic diagnosis 0
 2018 22 21.8
 2019 29 28.7
 2020 27 26.7
 2021 10 9.9
 2022 13 12.9
Missing Median Min-Max
Age (years) 0 61.5 27.0-91.9
BMI 32 24.7 15.9-51.7

Abbreviations: ECOG, Eastern Cooperative Oncology Group; ER, estrogen receptor; GS, gold standard; HER2, human epidermal growth factor receptor 2; IHC, immunohistochemistry; mBC, metastatic breast cancer; PR, progesterone receptor.

Overall Performance

The final pipeline consists of one CP and one extraction prompt using CoT and one-shot prompting. The average execution time per document was 6.2 seconds. Table 2 presents the document-level results. Recall and precision ranged from 94.0% to 99.1% across all biomarkers, whereas the F1 score ranges from 95.3% for PR to 96.6% for HER2 status. These results demonstrate LLM's stability in detecting biomarkers with a good balance between FP and false negatives (FN).

TABLE 2.

Document-Level Results in Terms of Recall, Precision, and F1 Score for ER, PR, and HER2 (both score and status), Based on 720 Pathology Reports and Compared With the Manually Extracted Validation Data Set

Variable Recall (95% CI) Precision (95% CI) F1 Score (95% CI)
Document-level validation (n = 720 documents)
 ER (positive, negative) 95.5 (92.5 to 98.1) 96.9 (94.5 to 98.8) 96.2 (94.3 to 97.8)
 PR (positive, negative) 94.1 (90.9 to 97.2) 96.4 (93.9 to 98.4) 95.3 (92.8 to 97.4)
 HER2 status (positive, negative, equivocal) 98.1 (96.1 to 99.6) 95.1 (92.4 to 97.5) 96.6 (94.6 to 98.2)
 HER2 score (0, 1+, 2+, 3+) 99.1 (97.7 to 100.0) 94.0 (90.9 to 96.9) 96.5 (94.5 to 98.0)

Abbreviations: ER, estrogen receptor; GS, gold standard; HER2, human epidermal growth factor receptor 2; PR, progesterone receptor.

At the patient level, the results from the 101 patients in the GS (Table 3) show a slight decrease in precision ranging from 91.3 for HER2 status to 94.1% for PR and a more significant drop in recall with results between 83.1 for ER and HER2 status and 86.8% for HER2 score. The F1 score was between 87.0 for ER and 89.8% for the HER2 score. The matching temporal distances are presented in Figure 2. The temporal distance distribution is strongly centered at zero, with a large number of exact matches between LLM and GS dates. The few discrepancies observed were all negative, indicating that when the LLM differs from the reference, it generally assigns slightly later dates. This decline in recall is due to CRAs having access to full patient records, allowing them to retrieve biomarker results from multiple sources, whereas the LLM is limited to pathology reports. Providing the LLM with the same set of reports was possible but resulted in a substantial increase in FP. Some reports contain results from multiple samples; the pipeline's performance on these documents is presented in the Data Supplement (S4). When limiting the analysis to biomarker measurements from January 2014 onwards, the precision remained unchanged, whereas recall increased to around 90%. This cutoff corresponds to the start of the institutional data warehouse, ensuring more complete digital records. Excluding measurements before 2014 also allowed us to remove very old documents from previous diagnoses, which were more often missing or poorly scanned, thereby explaining the lower recall observed in the unrestricted analysis.

TABLE 3.

Patient-Level Performance in Terms of Recall, Precision, and F1 Score for ER, PR, and HER2 (both score and status), Evaluated Against the 101-Patient GS Validation Database

Variable Recall (95% CI) Precision (95% CI) F1 Score (95% CI)
Patient-level GS (n = 101 patients)
 ER (positive, negative) 83.1 (77.1 to 88.2) 93.7 (90.8 to 96.5) 88.1 (84.0 to 91.4)
 PR (positive, negative) 83.9 (78.5 to 88.7) 94.1 (91.1 to 97.1) 88.7 (84.9 to 92.0)
 HER2 status (positive, negative, equivocal) 83.1 (77.9 to 88.5) 91.3 (87.0 to 95.1) 87.0 (82.9 to 90.8)
 HER2 score (0, 1+, 2+, 3+) 86.8 (81.3 to 91.7) 93.1 (89.5 to 96.4) 89.8 (85.9 to 93.6)
Patient-level GS restricted to biomarker values since 2014 (n = 101 patients)
 ER (positive, negative) 89.6 (90.5 to 96.5) 93.5 (84.6 to 94.1) 91.5 (87.8 to 94.7)
 PR (positive, negative) 89.8 (90.7 to 97.1) 94.0 (85.1 to 93.9) 91.9 (88.3 to 94.9)
 HER2 status (positive, negative, equivocal) 89.2 (88.9 to 96.0) 92.7 (84.3 to 93.7) 90.9 (87.4 to 94.0)
 HER2 score (0, 1+, 2+, 3+) 90.3 (90.4 to 97.2) 93.9 (85.5 to 94.4) 92.1 (88.4 to 95.2)

Abbreviations: ER, estrogen receptor; GS, gold standard; HER2, human epidermal growth factor receptor 2; PR, progesterone receptor.

FIG 2.

FIG 2.

Distribution of matching temporal distances between LLM results and GS results. Positive distances mean that the date of the result found by the LLM is earlier than the date of the result present in the GS. (A) ER, (B) PR, and (C) HER2 status and (D) HER2 score. Prompt engineering process and results. ER, estrogen receptor; GS, gold standard; HER2, human epidermal growth factor receptor 2; LLM, large language model; PR, progesterone receptor.

Across all evaluated scenarios, between 11 and 13 documents triggered the LLM correction prompt; however, no extraction ultimately failed after the maximum of three correction attempts. An error analysis was performed by comparing results with the document-level testing set (Data supplement, S5).

To understand how parts of the prompt or the pipeline affected the results, an analysis of results with different settings is presented.

Providing Examples

Studies suggest that providing an LLM with examples improves its performance, which could be explained by better understanding of the task to be performed.36,37 In Table 4, results with no example (zero-shot), one example (one-shot) and two examples (two-shot) are presented. Performance varied only minimally across settings: the F1 score ranged from 94.1% (PR with two-shot) to 97.0% (HER2 score zero-shot). The variation in F1 scores across the zero-, one-, and two-shot settings for each biomarker (ER, PR, HER2 status, and HER2 score) was below 1.5 percentage points.

TABLE 4.

Results in Terms of Recall, Precision, and F1 Score for ER, PR, and HER2 (both score and status) With Zero, One, and Two Shots on the Document-Level Validation Set (n = 720 documents)

Variable Recall (95% CI) Precision (95% CI) F1 Score (95% CI)
0-Shot
 ER (positive, negative) 95.8 (92.8 to 97.7) 95.5 (93.3 to 98.1) 95.7 (93.8 to 97.4)
 PR (positive, negative) 94.5 (93.7 to 98.4) 96.4 (91.6 to 97.2) 95.5 (93.4 to 97.3)
 HER2 status (positive, negative, equivocal) 96.9 (94.1 to 98.4) 96.5 (94.6 to 98.8) 96.7 (94.9 to 98.3)
 HER2 score (0, 1+, 2+, 3+) 96.4 (95.8 to 99.5) 97.7 (93.7 to 98.6) 97.0 (95.3 to 98.4)
1-Shot
 ER (positive, negative) 95.5 (94.7 to 98.9) 96.9 (92.5 to 98.1) 96.2 (94.3 to 97.8)
 PR (positive, negative) 94.1 (94.0 to 98.4) 96.4 (90.8 to 96.9) 95.3 (92.9 to 97.2)
 HER2 status (positive, negative, equivocal) 98.1 (92.4 to 97.7) 95.1 (96.1 to 99.6) 96.6 (94.6 to 98.3)
 HER2 score (0, 1+, 2+, 3+) 99.1 (90.5 to 96.9) 94.0 (97.7 to 100) 96.5 (94.4 to 98.2)
2-Shot
 ER (positive, negative) 95.1 (93.7 to 98.1) 96.2 (92.1 to 97.7) 95.6 (93.8 to 97.3)
 PR (positive, negative) 93.0 (92.4 to 97.5) 95.2 (89.6 to 96.1) 94.1 (91.6 to 96.3)
 HER2 status (positive, negative, equivocal) 96.9 (91.5 to 97.0) 94.3 (94.6 to 98.8) 95.6 (93.3 to 97.6)
 HER2 score (0, 1+, 2+, 3+) 98.2 (91.5 to 97.4) 94.7 (96.4 to 100.0) 96.4 (94.3 to 98.2)

Abbreviations: ER, estrogen receptor; HER2, human epidermal growth factor receptor 2; PR, progesterone receptor.

Confidence and CoT

We evaluated the individual and combined effects of using CoT and CP strategies on the performances of the extraction pipeline. Table 5 presents the results. The performance was similar across all methods, with no clear advantage for one over the others. Removing CP while maintaining CoT (CoT, no CP) showed a minimal impact on all metrics for HER2 but slightly reduced precision (96% to 92%) and increased recall (94% to 98%) for ER and PR. Removing CoT while maintaining CP (no CoT, CP) had no impact on ER and PR but slightly favored CP for the HER2 score and status (F1 scores of 96.5% v 95.6%). Interestingly, the baseline configuration without either enhancement (no CoT, no CP) showed relatively robust performances with F1 scores superior to 95%, suggesting that the model's base capabilities are already well-suited to this task.

TABLE 5.

Results in Terms of Recall, Precision, and F1 Score for ER, PR, and HER2 (both score and status) According to the Use of CoT and/or CP on the Document-Level Validation Set (n = 720 documents)

Variable Recall (95% CI) Precision (95% CI) F1 Score (95% CI)
CoT, CP
 ER (positive, negative) 95.5 (92.5 to 98.1) 96.9 (94.5 to 98.8) 96.2 (94.3 to 97.8)
 PR (positive, negative) 94.1 (90.9 to 97.2) 96.4 (93.9 to 98.4) 95.3 (92.8 to 97.4)
 HER2 status (positive, negative, equivocal) 98.1 (96.1 to 99.6) 95.1 (92.4 to 97.5) 96.6 (94.6 to 98.2)
 HER2 score (0, 1+, 2+, 3+) 99.1 (97.7 to 100.0) 94.0 (90.9 to 96.9) 96.5 (94.5 to 98.0)
No CoT, CP
 ER (positive, negative) 95.5 (94.7 to 98.9) 96.9 (92.5 to 97.7) 96.2 (94.3 to 97.9)
 PR (positive, negative) 94.1 (94.0 to 98.4) 96.4 (90.8 to 97.2) 95.3 (92.9 to 97.4)
 HER2 status (positive, negative, equivocal) 96.9 (91.3 to 97.0) 94.3 (94.7 to 98.8) 95.6 (93.6 to 97.4)
 HER2 score (0, 1+, 2+, 3+) 97.7 (90.1 to 96.5) 93.5 (95.5 to 99.5) 95.6 (93.5 to 97.4)
CoT, no CP
 ER (positive, negative) 98.1 (90.1 to 96.0) 93.2 (96.3 to 99.6) 95.6 (93.6 to 97.3)
 PR (positive, negative) 97.3 (88.5 to 95.2) 91.9 (94.6 to 99.2) 94.5 (92.0 to 96.7)
 HER2 status (positive, negative, equivocal) 97.3 (92.3 to 97.3) 95.1 (95.3 to 99.2) 96.2 (94.1 to 97.9)
 HER2 score (0, 1+, 2+, 3+) 98.6 (91.1 to 97.0) 94.3 (97.2 to 100.0) 96.4 (94.6 to 98.0)
No CoT, no CP
 ER (positive, negative) 97.0 (93.3 to 98.1) 95.9 (94.5 to 99.2) 96.4 (94.5 to 97.9)
 PR (positive, negative) 96.1 (92.6 to 97.7) 95.3 (93.1 to 98.4) 95.7 (93.6 to 97.7)
 HER2 status (positive, negative, equivocal) 97.7 (90.8 to 96.6) 94.0 (95.8 to 99.2) 95.8 (93.8 to 97.6)
 HER2 score (0, 1+, 2+, 3+) 98.2 (90.0 to 96.5) 93.5 (96.4 to 99.6) 95.8 (93.8 to 97.6)

Abbreviations: CoT, chain of thought; CP, confidence prompt; ER, estrogen receptor; HER2, human epidermal growth factor receptor 2; PR, progesterone receptor.

DISCUSSION

This study explores the use of the Mistral LLM to extract structured data from pathology reports automating real-world database creation and reducing manual effort. Traditional NLP methods require a large amount of annotated data, which is labor-intensive and costly. LLMs bypass this by integrating problem-specific information and domain knowledge with general information. A key requirement for effectively using an LLM is to create one or multiple appropriate prompts. In this study, we tested a CP before extraction, CoT reasoning, and various prompting strategies (zero-, one-, and two-shot). Our results demonstrate that the pipeline effectively extracts structured data concerning three BC biomarkers from pathology reports.

Our extraction pipeline achieved F1 scores above 95%, with recall and precision over 94% for each of the three tested biomarkers at the document level. This highlights the potential of LLMs in structuring medical information without costly annotation. At the patient level, recall declined to 83%-87%, whereas precision remained above 90%, leading to F1 scores between 87% and 90%. This reflects the difference between extracting information from a document and aggregating all the unique data about a patient's real-life situation. The decrease in recall can be attributed to the LLM being limited to pathology reports, whereas manual data entry is based on the patient's entire medical record. However, including these documents in our pipeline improved recall but caused excessive precision loss. Restricting the analysis to biomarker measurements from 2014 onward, most relevant for contemporary research projects, improved recall to around 90% while maintaining precision, resulting in F1 scores above 90%. Surprisingly, pipeline variations (with or without CP, CoT, or different prompting strategies) yielded similar results, and we retained the CP, CoT, and one-shot configuration for its balance of performance and stability. The acceptable level of errors for an AI-based algorithm depends on their application. For tasks such as improving the efficiency of manual data entry, high recall is crucial for ensuring that as much relevant data as possible are captured, even if it means reviewing additional results because of lower precision. By contrast, for applications like real-world evidence studies, where data accuracy is crucial, the tolerance for errors is significantly lower. Balancing these metrics is essential and should align with each use case. Algorithm performance must also be periodically reassessed to ensure that document format changes do not affect accuracy.

At the document level, our results are close to the best results obtained using NLP methods such as in the study by Schiappa et al.,19 with an average recall of 0.96 and a precision of 0.94 at extracting ER, PR, and HER2 values from pathology reports. While no studies have calculated recall and precision for BC biomarkers using LLMs, some have explored the ability of LLMs for similar information extraction tasks in other pathologies, demonstrating an average recall and precision of 0.89 using ChatGPT.30 Our results outperform these, illustrating the constant improvement in LLMs and the possible gains from working on prompt engineering. The results obtained for the three biomarkers are consistent, which is promising for extending the approach to other biomarkers and potentially other variables. A key advantage of this method is that it requires no annotation for training, reducing human costs, and increasing adaptability to different document formats.

However, several limitations remain. LLMs require high computing power, making large-scale or routine use challenging. While document-level performance is strong, improvements are needed to match the accuracy of manually curated patient-level databases. Further work is needed to better integrate other documents, such as the consultation reports. Common errors included missed values and hallucinated results, which could be mitigated by self-evaluation prompts or additional checks for digitization errors. Another challenge is the lack of explainability, making it harder to interpret errors compared with rule-based or traditional NLP approaches. Future work should assess the robustness of our pipeline across different health care centers, data sets, and biomarkers and include comparisons with other LLMs or domain-specific models.

In conclusion, this study highlights the capacities of LLMs for extracting structured BC biomarker data from free-text pathology reports and the promising potential of such methods for the broader digital transformation of health care documents. Such models could enhance the use of the rich historical digital archives and improve RWD exploitation. While our work is motivated by structuring RWD for research purposes, our methods could also be used to support data accessibility and harmonization since communicating diagnostic methods and results from the pathologist to the clinician remains predominantly in free-text form that is not optimal for patient management. We focus on an application in oncology, but the process can be reproduced in other medical fields.

Jean-Sébastien Frenel

Consulting or Advisory Role: Novartis (Inst), Pfizer, Lilly, AstraZeneca (Inst), Daiichi Sankyo Europe GmbH (Inst), GlaxoSmithKline, Amgen, Seagen, Gilead Sciences, Clovis Oncology (Inst), MSD Oncology (Inst), Exact Sciences (Inst), Eisai, AbbVie

Travel, Accommodations, Expenses: Novartis, Lilly, Pfizer, Daiichi Sankyo Europe GmbH, AstraZeneca, Gilead Sciences, Seagen, MSD Oncology

Mario Campone

Honoraria: Pfizer (Inst)

Consulting or Advisory Role: Pfizer (Inst)

Speakers' Bureau: Novartis (Inst), Lilly (Inst)

Travel, Accommodations, Expenses: Pfizer, Novartis, AstraZeneca

No other potential conflicts of interest were reported.

Footnotes

*

E.J. and P.V. contributed equally to this work.

AUTHOR CONTRIBUTIONS

Conception and design: Enzo Joseph, Paul Vallee, Tanguy Perennec, Nicolas Wagneur, François Bocquet, Florent Le Borgne

Collection and assembly of data: Enzo Joseph, Paul Vallee

Data analysis and interpretation: Enzo Joseph, Paul Vallee, Jean-Sébastien Frenel, Mario Campone, François Bocquet, Florent Le Borgne

Manuscript writing: All authors

Final approval of manuscript: All authors

Accountable for all aspects of the work: All authors

AUTHORS' DISCLOSURES OF POTENTIAL CONFLICTS OF INTEREST

The following represents disclosure information provided by authors of this manuscript. All relationships are considered compensated unless otherwise noted. Relationships are self-held unless noted. I = Immediate Family Member, Inst = My Institution. Relationships may not relate to the subject matter of this manuscript. For more information about ASCO's conflict of interest policy, please refer to www.asco.org/rwc or ascopubs.org/cci/author-center.

Open Payments is a public database containing information reported by companies about payments made to US-licensed physicians (Open Payments).

Jean-Sébastien Frenel

Consulting or Advisory Role: Novartis (Inst), Pfizer, Lilly, AstraZeneca (Inst), Daiichi Sankyo Europe GmbH (Inst), GlaxoSmithKline, Amgen, Seagen, Gilead Sciences, Clovis Oncology (Inst), MSD Oncology (Inst), Exact Sciences (Inst), Eisai, AbbVie

Travel, Accommodations, Expenses: Novartis, Lilly, Pfizer, Daiichi Sankyo Europe GmbH, AstraZeneca, Gilead Sciences, Seagen, MSD Oncology

Mario Campone

Honoraria: Pfizer (Inst)

Consulting or Advisory Role: Pfizer (Inst)

Speakers' Bureau: Novartis (Inst), Lilly (Inst)

Travel, Accommodations, Expenses: Pfizer, Novartis, AstraZeneca

No other potential conflicts of interest were reported.

REFERENCES

  • 1.Nainamalai V, Qair HA, Pelanis E, et al. : Automated algorithm for medical data structuring, and segmentation using artificial intelligence within secured environment for dataset creation. Eur J Radiol Open 13:100582, 2024 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Rajkomar A, Oren E, Chen K, et al. : Scalable and accurate deep learning with electronic health records. NPJ Digit Med 1:18, 2018 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Ayala Solares JR, Diletta Raimondi FE, Zhu Y, et al. : Deep learning for electronic health records: A comparative review of multiple deep neural architectures. J Biomed Inform 101:103337, 2020 [DOI] [PubMed] [Google Scholar]
  • 4.King J, Patel V, Jamoom EW, et al. : Clinical benefits of electronic health record use: National findings. Health Serv Res 49:392-404, 2014 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Sauer CM, Chen L-C, Hyland SL, et al. : Leveraging electronic health records for data science: Common pitfalls and how to avoid them. Lancet Digit Health 4:e893-e898, 2022 [DOI] [PubMed] [Google Scholar]
  • 6.Sherman RE, Anderson SA, Dal Pan GJ, et al. : Real-world evidence—What is it and what can it tell us? N Engl J Med 375:2293-2297, 2016 [DOI] [PubMed] [Google Scholar]
  • 7.Cho H, Yoo S, Kim B, et al. : Extracting lung cancer staging descriptors from pathology reports: A generative language model approach. J Biomed Inform 157:104720, 2024 [DOI] [PubMed] [Google Scholar]
  • 8.Li T, Saldanha IJ, Jap J, et al. : A randomized trial provided new evidence on the accuracy and efficiency of traditional vs. electronically annotated abstraction approaches in systematic reviews. J Clin Epidemiol 115:77-89, 2019 [DOI] [PubMed] [Google Scholar]
  • 9.Nguyen AN, Lawley MJ, Hansen DP, et al. : Symbolic rule-based classification of lung cancer stages from free-text pathology reports. J Am Med Inform Assoc 17:440-445, 2010 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Ryu B, Yoon E, Kim S, et al. : Transformation of pathology reports into the common data model with oncology module: Use case for colon cancer. J Med Internet Res 22:e18526, 2020 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Hammami L, Paglialonga A, Pruneri G, et al. : Automated classification of cancer morphology from Italian pathology reports using natural language processing techniques: A rule-based approach. J Biomed Inform 116:103712, 2021 [DOI] [PubMed] [Google Scholar]
  • 12.Devlin J, Chang M-W, Lee K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding. In: Burstein J, Doran C, Solorio T, eds. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers). Minneapolis, MN: Association for Computational Linguistics; 2019:4171–4186. [Google Scholar]
  • 13.Munzone E, Marra A, Comotto F, et al. : Development and validation of a natural language processing algorithm for extracting clinical and pathological features of breast cancer from pathology reports. JCO Clin Cancer Inform 10.1200/CCI.24.00034 [DOI] [PubMed] [Google Scholar]
  • 14.Savova GK, Ogren PV, Duffy PH, et al. : Mayo clinic NLP system for patient smoking status identification. J Am Med Inform Assoc 15:25-28, 2008 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Nottke A, Alan S, Brimble E, et al. : Validation and clinical discovery demonstration of breast cancer data from a real-world data extraction platform. JAMIA Open 7:ooae041, 2024 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Puts S, Nobel M, Zegers C, et al. : How natural language processing can aid with pulmonary oncology tumor node metastasis staging from free-text radiology reports: Algorithm development and validation. JMIR Form Res 7:e38125, 2023 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Preston S, Wei M, Rao R, et al. : Towards structuring real-world data at scale: Deep learning for extracting key oncology information from clinical text with patient-level supervision. arXiv 10.48550/arXiv.2203.10442 [DOI] [PMC free article] [PubMed]
  • 18.Holmes B, Chitale D, Loving J, et al. : Customizable natural language processing biomarker extraction tool. JCO Clin Cancer Inform 10.1200/CCI.21.00017 [DOI] [PubMed] [Google Scholar]
  • 19.Schiappa R, Contu S, Culie D, et al. : RUBY: Natural language processing of French electronic medical records for breast cancer research. JCO Clin Cancer Inform 10.1200/CCI.21.00199 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Wang L, Luo L, Wang Y, et al. : Natural language processing for populating lung cancer clinical research data. BMC Med Inform Decis Mak 19:239, 2019 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Chen Y, Hao L, Zou VZ, et al. : Automated medical chart review for breast cancer outcomes research: A novel natural language processing extraction system. BMC Med Res Methodol 22:136, 2022 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Banerjee I, Bozkurt S, Caswell-Jin JL, et al. : Natural language processing approaches to detect the timeline of metastatic recurrence of breast cancer. JCO Clin Cancer Inform 10.1200/CCI.19.00034 [DOI] [PubMed] [Google Scholar]
  • 23.Carrell DS, Halgrim S, Tran D-T, et al. : Using natural language processing to improve efficiency of manual chart abstraction in research: The case of breast cancer recurrence. Am J Epidemiol 179:749-758, 2014 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Zhou S, Wang N, Wang L, et al. : CancerBERT: A cancer domain-specific language model for extracting breast cancer phenotypes from electronic health records. J Am Med Inform Assoc 29:1208-1216, 2022 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Leiter RE, Santus E, Jin Z, et al. : Deep natural language processing to identify symptom documentation in clinical notes for patients with heart failure undergoing cardiac resynchronization therapy. J Pain Symptom Manage 60:948-958.e3, 2020 [DOI] [PubMed] [Google Scholar]
  • 26.Solarte-Pabón O, Montenegro O, García-Barragán A, et al. : Transformers for extracting breast cancer information from Spanish clinical narratives. Artif Intell Med 143:102625, 2023 [DOI] [PubMed] [Google Scholar]
  • 27.Petit-Jean T, Gérardin C, Berthelot E, et al. : Collaborative and privacy-enhancing workflows on a clinical data warehouse: An example developing natural language processing pipelines to detect medical conditions. J Am Med Inform Assoc 31:1280-1290, 2024 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Holmes JH, Beinlich J, Boland MR, et al. : Why is the electronic health record So challenging for research and clinical care? Methods Inf Med 60:32-48, 2021 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Sushil M, Kennedy VE, Mandair D, et al. : CORAL: Expert-curated oncology reports to advance language model inference. NEJM AI 1:AIdbp2300110, 2024 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Huang J, Yang DM, Rong R, et al. : A critical assessment of using ChatGPT for extracting structured data from clinical notes. NPJ Digit Med 7:106-113, 2024 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Wals Zurita AJ, Miras Del Rio H, Ugarte Ruiz de Aguirre N, et al. : The transformative potential of large language models in mining electronic health records data: Content analysis. JMIR Med Inform 13:e58457, 2025 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Shickel B, Tighe PJ, Bihorac A, et al. : Deep EHR: A survey of recent advances in deep learning techniques for electronic health record (EHR) analysis. IEEE J Biomed Health Inform 22:1589-1604, 2018 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Harbeck N, Gnant M: Breast cancer. Lancet (London, England) 389:1134-1150, 2017 [DOI] [PubMed] [Google Scholar]
  • 34.Kunte S, Abraham J, Montero AJ: Novel HER2-targeted therapies for HER2-positive metastatic breast cancer. Cancer 126:4278-4288, 2020 [DOI] [PubMed] [Google Scholar]
  • 35.Gámez-Chiachio M, Sarrió D, Moreno-Bueno G: Novel therapies and strategies to overcome resistance to anti-HER2-targeted drugs. Cancers 14:4543, 2022 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Sivarajkumar S, Kelley M, Samolyk-Mazzanti A, et al. : An empirical evaluation of prompting strategies for large language models in zero-shot clinical natural language processing: Algorithm development and validation study. JMIR Med Inform 12:e55318, 2024 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Chen S, Li Y, Lu S, et al. : Evaluating the ChatGPT family of models for biomedical reasoning and classification. J Am Med Inform Assoc 31:940-948, 2024 [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from JCO Clinical Cancer Informatics are provided here courtesy of Wolters Kluwer Health

RESOURCES