Large language models for extracting histopathologic diagnoses of colorectal cancer and dysplasia from electronic health records

Brian Johnson; Tyler Bath; Xinyi Huang; Mark Lamm; Ashley Earles; Hyrum Eddington; Anna M Dornisch; VA Million Veteran Program; Lily J Jih; Samir Gupta; Shailja C Shah; Kit Curtius

doi:10.1136/bmjgast-2025-001896

. 2025 Sep 18;12(1):e001896. doi: 10.1136/bmjgast-2025-001896

Large language models for extracting histopathologic diagnoses of colorectal cancer and dysplasia from electronic health records

Brian Johnson ¹, Tyler Bath ¹, Xinyi Huang ², Mark Lamm ², Ashley Earles ², Hyrum Eddington ¹, Anna M Dornisch ^3,⁴; VA Million Veteran Program, Lily J Jih ^4,⁵, Samir Gupta ^4,^6,⁷, Shailja C Shah ^4,^6,⁷, Kit Curtius ^1,^4,^7,^✉

PMCID: PMC12458811 PMID: 40973184

Abstract

Objective

Accurate data resources are essential for impactful medical research, but available structured datasets are often incomplete or inaccurate. Recent advances in open-weight large language models (LLMs) enable more accurate data extraction from unstructured text in electronic health records (EHRs), however, thorough validation of such approaches is lacking. Our objective was to create a validated approach using LLMs for identifying histopathologic diagnoses in pathology reports from the nationwide Veterans Health Administration (VHA) database, including patients with genotype data within the Million Veteran Program (MVP) biobank.

Methods

Our approach utilises search term filtering followed by simple ‘yes/no’ question prompts for the following phenotypes of interest: any colorectal dysplasia, high-grade dysplasia and/or colorectal adenocarcinoma (HGD/CRC) and invasive CRC. We first developed the LLM prompts using example reports from patients with inflammatory bowel disease (IBD). We then validated the approach in IBD and non-IBD by applying the fixed prompts to a separate corpus of 116 373 pathology reports generated in the VHA between 1999 and 2024. We compared model outputs to blinded manual chart review of 200–300 pathology reports for each patient cohort and diagnostic task, totalling 3816 reviewed reports, and calculated F1 scores as a balanced accuracy measure.

Results

In patients with IBD in MVP, we achieved F1-scores of 96.9% (95% CI 94.0% to 99.6%) for identifying dysplasia, 93.7% (88.2%–98.4%) for identifying HGD/CRC and 98% (96.3%–99.4%) for identifying CRC. In patients without IBD in MVP, we achieved F1-scores of 99.2% (98.2%–100%) for identifying any colorectal dysplasia, 96.5% (93.0%–99.2%) for identifying HGD/CRC and 95% (92.8%–97.2%) for identifying CRC using LLM Gemma-2.

Conclusion

LLMs provided excellent accuracy in extracting the diagnoses of interest from EHRs. Our validated methods generalised to unstructured pathology notes, even withstanding challenges of resource-limited computing environments. This may, therefore, be a promising approach for other clinical phenotypes given the minimal human-led development required.

Keywords: COLORECTAL CANCER, INFLAMMATORY BOWEL DISEASE, ARTIFICIAL INTELLIGENCE

WHAT IS ALREADY KNOWN ON THIS TOPIC

Extracting structured data from free-text health records, such as pathology reports, remains a significant challenge in clinical research. Traditional natural language processing methods require extensive development and are often difficult to generalise across settings, limiting their usefulness for large-scale, reproducible data extraction.

WHAT THIS STUDY ADDS

This study demonstrates that relatively small (8–9 billion parameter) publicly available large language models can accurately extract cancer and dysplasia diagnoses from pathology reports without additional task-specific training or fine-tuning.

HOW THIS STUDY MIGHT AFFECT RESEARCH, PRACTICE OR POLICY

By enabling accurate data extraction from clinical text, large language models offer a scalable and accessible solution for structuring clinical data, reducing the burden of algorithm development and/or manual data curation. These advancements facilitate expanded access to high-quality real-world medical data for clinical and translational research.

Introduction

The expected breakthroughs in personalised treatments and improved medical outcomes have yet to fully materialise despite the exponential increase in volume of healthcare data available for research. One obstacle impeding these advances is the quality and accessibility of the vast data generated and stored as part of usual healthcare.

As an example use case, tailoring colonoscopy screening ages and surveillance intervals based on accurate risk stratification informed by large, high-quality datasets has real potential to reduce both the incidence of colorectal cancer as well as the number of unnecessary colonoscopies that add burden to both patients and the healthcare system.¹ Current risk stratification approaches in both the general population and those with inflammatory bowel disease (IBD) are based on few clinical variables, which are often associated with widely varied published estimates of risk.^{2 3} For example, when lesions such as adenomas or flat low-grade dysplasia (LGD) are diagnosed, current guidelines for all-comers recommend surveillance colonoscopy every 1–10 years (or 1–5 years in patients with IBD) based on clinical risk stratification.^{3 4} As such, screening guidelines essentially serve as heuristics, representing the best approach with limited data.⁵ Our overall goal is to improve the quality of available large-scale data resources—an essential prerequisite for accurate downstream analyses to improve personalised medicine—by leveraging artificial intelligence to extract clinical information, with a focus on histopathologic diagnoses in the present study.

Traditionally, ‘rule-based’ natural language processing (NLP) algorithms have dominated structured data extraction in this area. Briefly, NLP translates natural language into data formats that are easier for computers to process and analyse.⁶ Performance of these algorithms is excellent, with F1-scores frequently above 99% for identifying adenomas in data from the general population.⁷,¹⁴ Alternatively, published deep-learning or embedding-based models are less common for classifying pathology findings, though work by Syed et al reported an F1-score of 95% for identifying neoplastic (dysplastic) polyps.¹⁵ However, there are many drawbacks to current approaches. Development requires extensive human effort to refine algorithms: creating concepts (eg, enumerating all possible ways that each diagnosis could be written), identifying negation (ensuring that expressions of the absence or uncertainty of a diagnosis are captured and related to the correct concept), associating terms with their respective anatomical locations and modifying the algorithm to address ‘edge’ cases. Adapting these algorithms to new use cases or different databases presents similar challenges, as development is often tailored to the formatting and style of a specific hospital system, patient cohort and/or time period.¹⁶

Previous NLP approaches for such diagnostic tasks have not been tested in IBD populations specifically and may not be appropriate for use in the setting of IBD. For example, pathology reports from patients with IBD are often slightly different from reports from the non-IBD population. In the IBD patient cohort, historical terminology may be present (eg, Dysplasia Associated Lesion or Mass, or “DALM”) and there are more instances of negation where pathologists explicitly rule out dysplasia and carcinoma than in patients without IBD. Additionally, few of these NLP approaches have been rigorously tested in identifying varying severities of dysplasia (eg, low grade, high grade) and adenocarcinoma in either IBD or non-IBD. Thus, previous methods are largely insufficient for these examples and manual chart review of millions of patient records is infeasible in terms of human labour required. Consequently, inaccurate data or lack of large-scale data availability can have negative effects on patient outcomes, for example, through misinformed screening guidelines that may miss deadly cancers.

As opposed to traditional NLP methods, large language models (LLMs) are capable of many tasks ‘out of the box’ without additional tedious human-led development. Because of this, LLMs should be less susceptible to differences in formatting and style, working better across settings.¹⁷ For example, applying LLMs to determine colonoscopy follow-up time recommendations that are in line with established guidelines has been shown to be feasible without task-specific training.¹⁸ Recently, there have also been remarkable advances in the quality of open-weight LLMs released under permissive licences. Open-weight models avoid critical legal and regulatory issues, allowing researchers to conduct inference without significant privacy risks. Specifically, these models can be uploaded to the same computing environment where the data are stored, enabling researchers to structure data without it leaving this secured space. Because these models do not undergo training or fine-tuning within the computing environment, there is no risk of them ‘remembering’ patient information or accidental data breaches. Moreover, using these models does not involve sending data to third parties, avoiding the associated logistic and privacy challenges. While current LLMs require significant computational bandwidth, they are rapidly becoming viable alternatives for large-scale applications as their efficiency improves.

Here, our study objectives were to test and compare the performance of LLMs, without any task-specific training, on their ability to extract and characterise the presence versus absence of dysplasia and adenocarcinoma from unstructured colonoscopy-associated pathology reports. We hypothesised that an LLM approach would be straightforward to implement and achieve greater accuracy than previously published historic methods on similar tasks both in IBD and non-IBD populations. We show that LLMs, even in resource-limited environments, are accurate at identifying features from pathology reports in a way that is easily reproducible.

Methods

As a brief overview, the framework described below first uses simple search terms (specifically, terms indicating both a colorectal specimen location and a dysplastic and/or neoplastic microscopic histology) to filter all colonoscopy pathology reports and identify those that were potentially diagnostic for each of the three following concepts in the colon or rectum: any dysplasia, high-grade dysplasia or adenocarcinoma (high-grade dysplasia and/or colorectal cancer (HGD/CRC)) and invasive adenocarcinoma (CRC). For IBD, we also considered a task for indefinite for dysplasia (IND). For each task, we posed a yes-or-no question to an LLM to determine whether the report contained a diagnosis of that concept. These diagnostic labels can then be used in downstream analyses (see workflow for implementation of these steps in figure 1). We iteratively developed specific prompts for each concept as necessary and then validated the model performance in a separate corpus of pathology reports. The study population was divided into IBD and non-IBD, and model validation was performed independently in both populations. We used stratified sampling for validation to account for the potentially low prevalence of some concepts.

Datasets and compute environment

Patient databases

We applied our methods to data from the nationwide Veterans Health Administration (VHA), one of the largest integrated health systems in the USA. The Corporate Data Warehouse (CDW) in the Veterans Affairs (VA) contains all electronic health record (EHR) data from Veteran healthcare encounters, including notes, International Classification of Diseases (ICD) codes and other registries such as the National Death Index that can all be used and is intended for research purposes. In total, our provisioned CDW database contains the EHRs from 15.2 million current and former patients cared for through the VHA. This consists of roughly 6.2 billion notes, with a mean of more than 400 notes per patient. The earliest notes and other data relevant for our purposes, such as ICD codes, date back to around the year 2000, when records began to be consistently stored digitally.

The Million Veteran Program (MVP) is a research initiative where Veterans volunteer to have additional health, survey and complete genetic data collected and made available for research in an anonymised way. To date, over one million Veterans have volunteered to become a part of this initiative. Our provisioned dataset contains 913 318 patients (V.22 data release, August 2023). Veteran volunteers in MVP are demographically similar to patients in CDW and are a representative subsample,¹⁹ though reidentification or any linking of clinical data is strictly disallowed to protect the privacy of MVP participants. Therefore, cohort building must be done in each dataset separately. Furthermore, any overlap of patients and/or notes between MVP and CDW datasets is due to random chance and is not known to researchers.

VINCI workspaces

The VA Informatics and Computing Infrastructure (VINCI) is a platform where researchers can access clinical data from both CDW and MVP. VINCI allows researchers to analyse these data sources using various computational resources in a secured environment. Within VINCI, structured and unstructured free text data are organised in Structured Query Language (SQL) tables. R and Rstudio were used to process text from SQL tables into ‘.txt’ files for reading into compiled C++ software, which was a fork we created of the open source llama.cpp GitHub project.²⁰ We first used the standard 4-core Central Processing Units (CPUs) available on VINCI development workspaces. We were also then able to compare results using 2 Nvidia A40 Graphics Processing Units (GPUs) provisioned through VINCI and calculated the inference speed increase that GPU use enabled.

VA pathology domain

The VINCI team has created a dataset domain called the Pathology Domain, which takes full pathology reports and extracts certain sections based on their appropriate section headers. The resulting table contains columns representing each section header, with the associated text for each note. Our method uses the columns ‘Specimen’, to determine whether the report described tissue from the colon or rectum, and ‘Microscopic exam’, to extract the diagnosis. We also tested the method on full text of the corresponding pathology report, where available.

The pathology domain is available in CDW to approved VINCI researchers. In MVP, the pathology domain data must be requested. We requested all Microscopic exam and Specimen sections where specimen had one of the following terms: ‘rectum’, ‘colorectal’, ‘rectal’, ‘cecum’, ‘colon’, ‘hepatic flexure’, ‘ileocecal valve’, ‘rectosigmoid’, ‘splenic flexure’ or ‘colonic flexure’. This term search was not case sensitive and used word boundaries to identify terms.

LLM approach and application

Large language models

The main model used is Gemma-2-9B-It-SPPO,^{21 22} referred to herein as Gemma-2 (9 billion parameters). We also applied Llama-3-8B-Instruct,²³ referred to herein as Llama-3 (8 billion parameters), to all tasks/cohorts. All models used have licenses that allow commercial and research use, as required by VA policy. All models were run as GPT-Generated Unified Format (GGUF or .gguf) files, a format which combines the model architecture, weights, tokenizer and metadata into a single file for streamlined use with llama.cpp.²⁰ Gemma-2-9b-SPPO and Llama-3-8B-Instruct were run at FP16 precision and were converted from Safetensors format using llama.cpp. Quantised versions of Llama-3.2-3B-Instruct (Q8_0) and Gemma-2-2b-It (Q8_0) were used for ease of uploading to VINCI. This quantisation stores the model parameters at reduced integer precision²⁴ and is compatible with the GGUF file format. For details on model selection, see online supplemental methods section Large Language Model (LLM) selection.

Identifying colonoscopy pathology reports in the pathology domain

As mentioned in VA Pathology Domain, the partitioned data in the MVP pathology domain was filtered to only include those where the Specimen matches colon or rectum terms. In CDW, we apply the same terms matching: ‘rectum’, ‘colorectal’, ‘rectal’, ‘cecum’, ‘colon’, ‘hepatic flexure’, ‘ileocecal valve’, ‘rectosigmoid’ or ‘splenic flexure’. As in MVP, this term search is not case sensitive. This led to a corpus of relevant notes for our study totalling n=2 899 321 reports from 1 834 930 unique patients in CDW and n=279 964 reports from 170 806 unique patients in MVP. Similar filtering in pathology domain was previously used to validate method for extracting adenoma detection rates.²⁵ Pathology reports with inaccurate specimen extraction (due to typos for example) were not included in our corpus of relevant notes.

Tasks

Our approach identified the presence or absence of the following three concepts (clinical conditions) on any colonoscopy-associated pathology reports: any dysplasia, HGD/CRC and invasive adenocarcinoma (CRC). Additionally, for the IBD-specific population, we considered a concept for IND. Our definitions for each of these four concepts are as follows:

Any confirmed dysplasia: presence of any dysplasia in the colon or rectum explicitly stated in the report (eg, low-grade, mild, moderate, high-grade, etc). This includes presence of any adenoma or adenomatous lesions in the colon or rectum, excluding sessile serrated adenoma unless there is an explicit statement of sessile serrated adenoma with dysplasia. Excludes ‘IND’ and other uncertain phrases. Adenomas were counted in definition of any dysplasia because all adenomas contain at least LGD.
HGD/CRC: presence of HGD or any adenocarcinoma in the colon or rectum. Includes carcinoma in situ, adenocarcinoma in-situ and intramucosal adenocarcinoma. Excludes uncertain phrases such as ‘bordering on high-grade dysplasia’. This concept was chosen for its relevance in IBD (advanced neoplasia as a clinical outcome in treatment decision-making).
Invasive colorectal cancer (CRC): presence of invasive adenocarcinoma of the colon or rectum. Invasive is defined as T stage of 1 or greater, or equivalent language (eg, ‘invades into submucosa’). Excludes metastatic adenocarcinoma suspected or known to be from a different primary location (ie, primary is not colon or rectum). Excludes uncertain phrases such as ‘cannot rule out invasive adenocarcinoma’ and ‘suspicious for invasion’ (further details in online supplemental methods section Model validation details).
IND²⁶: in IBD, presence of changes in the colon or rectum that raise concern for dysplasia but lack definitive features to confirm. This includes findings described as ambiguous due to factors such as inflammation, regenerative changes or poor quality. Excludes explicit statements confirming the presence of dysplasia or adenomas.

We chose to evaluate these tasks as examples for our study, but the robust LLM framework described below can also distinguish other groupings if desired (such as LGD separately). However, we note that applying these tasks simultaneously does allow us to automatically classify the most advanced lesion in each pathology report (figure 1B). For example, if the report is positive for dysplasia and negative for HGD/CRC, we can conclude that the sample has LGD. Similarly, if the report is positive for HGD/CRC and negative for CRC, we can conclude that the sample has HGD or intramucosal CRC only. While both HGD/CRC and CRC are indications for resection in IBD, prognosis for these may differ in the non-IBD population. Furthermore, distinguishing between invasive and non-invasive adenocarcinoma or HGD is useful for incidence reporting and epidemiological studies, so we separated these herein.

Creation of plausible sets of notes

We then used simple search terms to reduce the number of colonoscopy pathology reports to only include those potentially diagnostic of a particular concept. For CRC, these were reports where the Microscopic exam section text matched ‘%carcinoma%’, ‘%tumour%’ or ‘%invasi%’, where ‘%’ represents a wildcard. The pathology reports matching this search are considered a part of the ‘plausible set’ for CRC identification. The plausible sets for dysplasia and HGD/CRC were expanded to include more search terms identifying those diagnoses. See online supplemental methods sections HGD/CRC ascertainment, Dysplasia ascertainment and Indefinite for dysplasia ascertainment for details. Online supplemental table S1 contains the exact numbers of reports considered for all patient cohorts at each step of filtering across tasks, starting with 16.3 million and 2.6 million total pathology reports in CDW and MVP, respectively.

LLM prompt development

LLMs require a ‘prompt’ to perform a given task. A ‘prompt’ is defined as the input text given to the model. The model then evaluates the prompt and generates additional text. The prompt we provide to the model consists of some text that defines the task and the question to be answered. Additionally, the prompt includes the text from the pathology report or section to be evaluated. We developed the initial prompts using 48 pathology report Microscopic exam sections from the plausible sets of notes and then performed some minor refinement of prompts (figure 2, development steps 1 and 2). No a priori performance targets were applied after step 1. For details on the evolution of the prompt text, see online supplemental methods section Details on LLM prompt development steps. Then for each task, the final sets of plausible notes used in model implementation and validation excluded all reports from the development sets, as detailed below.

Determining presence versus absence of given diagnosis using LLM

For each report in the plausible set of pathology reports, we feed either the Microscopic exam section or the full-text pathology note to an LLM, which determines if an individual has a specific pathological diagnosis (figure 1A). The input text is integrated into the prompt (see online supplemental methods sections CRC ascertainment, HGD/CRC ascertainment, Dysplasia ascertainment and Indefinite for dysplasia ascertainment for exact prompts used). The model then responds ‘yes’ or ‘no’. This response is recorded as an output ‘.txt’ file with the corresponding ID of the pathology domain entry. Llama.cpp²⁰ is used for model inference. We then implemented this approach for 115 417 unique reports not used in the model development (see online supplemental file S1 for number breakdown by task) to obtain adequate numbers of putative positive and negative cases for use in validation.

Model validation

Validation was performed independently in IBD and non-IBD populations using the same models and three prompts (see online supplemental methods for cohort creation). To create validation sets, either N=100 (CDW) or N=150 (MVP) randomly selected putative positive cases and the same number of putative negative cases were selected for review. Putative positive (negative) cases were defined as cases where Llama-3 responded ‘yes’ (‘no’). Validation was performed by two independent, blinded reviewers (BJ and AD). Disagreements between AD and BJ, of which there were 131 across 3816 total reviewed pathology reports (<4% of reviewed pathology reports), were resolved by a third, blinded reviewer (HE). If HE lacked certainty, which occurred for four reports total, LJJ resolved the disagreement as a blinded fourth reviewer. Cohen’s kappa was used to measure inter-reviewer agreement between BJ and AD. Validation was performed at the level of the pathology report, consistent with the LLM prompt asking if the given features are present in any colon or rectal sample. Validation was performed independently for each of the tasks, even if notes overlapped by chance in validation sets across tasks.

We performed validation only in the ‘plausible set’ of notes that passed our search term filters (online supplemental table S1) and recorded run times (online supplemental table S2). Considering the very low expected prevalence (potentially zero) outside of these filters, this approach provides a more informative assessment of the LLMs’ performance, as we are considerably more likely to include some false-negative cases in our validation (see online supplemental methods for details). For testing generalisability and validity, we also evaluated performance using 956 full pathology reports as LLM prompt input. Full-text pathology reports were considered in validation analyses only and were not used in prompt development, which only used semi-structured reports.

Performance metrics

We provide an estimate of the prevalence of cases in the reports from the plausible set as well as the positive predictive value (PPV), negative predictive value (NPV), sensitivity (recall), specificity, F1-score and Matthew’s correlation coefficient (MCC). In order to better compare across tasks with varying class ratios (prevalence), we also show the calibrated precision and calibrated F1 score (F1c) based on previous work by Siblini et al,²⁷ computed using a reference ratio $\pi _{0}$ =0.5. These calibrated metrics correct for the dependence of precision-based performance metrics on the positive class ratio. More specifically, if true positives are rare, precision will be lower than if the same models were evaluated on a more balanced dataset. For a given reference ratio, $\pi _{0}$ , the calibrated metrics provide the expected PPV and F1-score for a dataset with a positive class ratio of $\pi _{0}$ .²⁷

Because we use stratified sampling, that is, selecting N model-predicted positive and N model-predicted negative reports for validation, calculating the performance metrics listed above requires corrections to account for the conditional probabilities introduced by our sampling. The explicit forms of these equations are derived in online supplemental methods section Calculating performance metrics. This approach helps minimise the number of cases needed for validation, especially when prevalence is imbalanced, and builds on previous work by Liu et al.²⁸ Figure 2 shows the flowchart for all steps of LLM development through validation (numbers of all pathology reports evaluated for each task and model outcomes provided in online supplemental file S2). This study on diagnostic accuracy followed the Standards for Reporting Diagnostic Accuracy guidelines (see reporting checklist provided in online supplemental file S3).²⁹

Results

We applied LLMs to extract pathologic diagnoses from text in the VA Pathology Domain (figure 1) and free text pathology reports. We tested two LLMs (Gemma-2 and Llama-3) for each classification task. After prompt development, we validated our methods by comparing model predictions to blinded chart review by two independent reviewers evaluating randomly chosen sets of reports for each task (any dysplasia, HGD/CRC and CRC) in each patient cohort (IBD and non-IBD) and dataset (MVP and CDW). The overall agreement between the two reviewers was excellent for these tasks, with Cohen’s kappa ranging from 89% to 97%. The task of IND was also evaluated in the IBD cohort, where Cohen’s kappa was 78.1% and 93.1% in CDW and MVP, respectively.

LLMs extract pathologic diagnoses with high accuracy in patients with IBD

In model validation using strictly distinct reports from those used for prompt development (see the Methods section), all tasks in IBD achieved excellent performance using LLM Gemma-2 (table 1). Metrics such as precision (PPV) and F1 are dependent on the class ratio, meaning that the uncalibrated F1 and PPV are not as useful for comparing across tasks with very different prevalences.²⁷ This is especially relevant when comparing the main three diagnostic tasks to the IND task, which is relatively rare in the plausible set of pathology reports filtered using the dysplasia terms in IBD. The calibrated F1 score, which combines calibrated precision and recall, for IND was 98.6% (95% CI 96.9 to 99.8%) in MVP and 95.6% (95% CI 93.0 to 98.9%) in CDW when using the Microscopic exam section as input. In comparison, the task with the highest calibrated F1 score was diagnosis of CRC in MVP data (F1c=99.3% (95% CI 98.7 to 99.8%)).

Table 1. Validated performance results for IBD patients in MVP and CDW using Gemma-2.

Task	Source	Model prevalence estimate	PPV (LB–UB)	NPV (LB–UB)	PPVc	Recall (sensitivity)	Specificity	F1	F1c	MCC	Cohen’s kappa
CRC	MVP	0.257	0.962 (0.92–0.99)	1.000 (0.98–1.00)	0.986	1.000	0.987	0.980	0.993	0.974	0.889
CRC	CDW	0.266	0.916 (0.84–0.99)	0.967 (0.92–0.99)	0.971	0.910	0.969	0.913	0.939	0.881	0.910
HGD/CRC	MVP	0.176	0.921 (0.84–0.99)	0.990 (0.96–1.00)	0.983	0.953	0.983	0.937	0.968	0.924	0.947
HGD/CRC	CDW	0.219	0.941 (0.82–1.00)	0.990 (0.95–1.00)	0.983	0.962	0.983	0.951	0.973	0.938	0.970
Dysplasia	MVP	0.262	0.957 (0.92–0.99)	0.993 (0.96–1.00)	0.985	0.981	0.985	0.969	0.983	0.958	0.920
Dysplasia	CDW	0.313	0.964 (0.90–1.00)	1.000 (0.96–1.00)	0.983	1.000	0.984	0.982	0.992	0.974	0.950
IND	MVP	0.030	0.752 (0.57–0.93)	1.000 (0.98–1.00)	0.990	0.983	0.992	0.852	0.986	0.856	0.931
IND	CDW	0.031	0.526 (0.42–0.82)	0.999 (0.97–1.00)	0.974	0.938	0.985	0.674	0.956	0.696	0.781

Open in a new tab

Shading progression has lower value (red)=0.5, middle value (white)=0.9 and upper value (green)=1.

95% CIs for PPV and NPV were approximated using both bootstrapping and the binomial distribution, with the more conservative interval reported (see online supplemental methods).

F1c, calibrated F1 score²⁷; HGD/CRC, high-grade dysplasia and/or colorectal adenocarcinoma; IBD, inflammatory bowel disease; IND, indefinite for dysplasia; Invasive CRC, invasive colorectal cancer (invasive colorectal adenocarcinoma); LB, lower bound; MCC, Matthew’s correlation coefficient; MVP, Million Veteran Program; NPV, negative predictive value; PPV, positive predictive value; PPVc, calibrated positive predictive value (calibrated precision)²⁷; UB, upper bound.

We found similar but slightly lower performance when using LLM Llama-3 (online supplemental table S3). As expected, smaller LLMs with fewer parameters (2–3 billion) were less accurate in the four tasks for IBD (online supplemental table S4). GPU use significantly reduced run times (online supplemental table S2). Model performance for IBD-CRC in MVP was also similar if only filtering for colorectal location (ie, only step 1 of filtering in figure 1A) before implementing LLM prompts (online supplemental table S5).

Validation of LLM approach in non-IBD colorectal dysplasia and cancer

We then applied the same validation approach, with no changes to model prompts, to records from patients without IBD (ie, no IBD colitis ICD code found in patient clinical history) and again achieved highly accurate results in the three relevant tasks for non-IBD reports (table 2). Specifically, we found that the F1-score for identifying dysplasia in patients without IBD was ~99% using both Gemma-2 and Llama-3 (table 2, online supplemental table S3). F1-scores were slightly lower but still excellent for HGD/CRC (>96%) and CRC (>95%) using Gemma-2. These results highlight the flexibility of the LLM approach for similar histopathologic diagnoses in different patient groups.

Table 2. Validation performance results for non-IBD colitis patients in MVP and CDW using Gemma-2.

Task	Source	Model prevalence estimate	PPV (LB–UB)	NPV (LB–UB)	PPVc	Recall (sensitivity)	Specificity	F1	F1c	MCC	Cohen’s kappa
CRC	MVP	0.522	0.982 (0.94–1.00)	0.908 (0.85–0.95)	0.982	0.921	0.978	0.950	0.950	0.895	0.920
CRC	CDW	0.525	0.971 (0.92–1.00)	0.951 (0.88–0.99)	0.970	0.957	0.968	0.964	0.963	0.924	0.940
HGD/CRC	MVP	0.306	0.977 (0.93–1.00)	0.979 (0.94–1.00)	0.990	0.953	0.990	0.965	0.971	0.949	0.947
HGD/CRC	CDW	0.411	0.966 (0.90–1.00)	0.990 (0.94–1.00)	0.976	0.985	0.976	0.975	0.981	0.958	0.950
Dysplasia	MVP	0.896	0.986 (0.95–1.00)	0.993 (0.96–1.00)	0.889	0.999	0.890	0.992	0.941	0.933	0.953
Dysplasia	CDW	0.870	0.988 (0.95–1.00)	0.990 (0.94–1.00)	0.923	0.998	0.923	0.993	0.959	0.949	0.920

Open in a new tab

Shading progression has lower value (red)=0.5, middle value (white)=0.9 and upper value (green)=1.

95% CIs for PPV and NPV were approximated using both bootstrapping and the binomial distribution, with the more conservative interval reported (see online supplemental methods).

CDW, Corporate Data Warehouse; CRC, invasive colorectal cancer (invasive colorectal adenocarcinoma); F1c, calibrated F1 score²⁷; HGD/CRC, high-grade dysplasia and/or colorectal adenocarcinoma; IBD, inflammatory bowel disease; IND, indefinite for dysplasia; LB, lower bound; MCC, Matthew’s correlation coefficient; MVP, Million Veteran Program; NPV, negative predictive value; PPV, positive predictive value; PPVc, calibrated positive predictive value (calibrated precision)²⁷; UB, upper bound.

Accuracy of applying LLM methods to full-text pathology report

To evaluate the generalisability of our model to environments that do not contain semistructured resources such as the VA Pathology Domain, we applied our LLM approach to the full pathology report to evaluate performance and found excellent measures using Gemma-2 (table 3). In IBD, the task with the best performance was the presence of any dysplasia (F1=97.1% (95% CI 93.5% to 100%)) and lowest performance of the three main tasks was diagnosis of HGD/CRC (F1=86.7% (95% CI 78.6% to 94.6%)). See online supplemental table S6 for details on number of full-text reports implemented. Higher values were found for the calibrated F1-scores. While both Gemma-2 and Llama-3 were trained with context lengths up to 8192 tokens, and the full notes never exceeded these thresholds, performance decreases slightly when using Llama-3 (online supplemental table S3). We found similar performance results to Gemma-2 alone when requiring either or both models to answer ‘yes’ for a report to be deemed a positive case (online supplemental table S7).

Table 3. Validation results using full pathology report in IBD population in MVP.

Task	Input	Model prevalence estimate	PPV (LB–UB)	NPV (LB–UB)	PPVc	Recall (sensitivity)	Specificity	F1	F1c	MCC	Cohen’s kappa
CRC	Microscopic exam	0.257	0.962 (0.92–0.99)	1.000 (0.98–1.00)	0.986	1.000	0.987	0.980	0.993	0.974	0.889
CRC	Full pathology report	0.292	0.822 (0.73–0.91)	0.994 (0.95–1.00)	0.919	0.982	0.931	0.895	0.950	0.863	0.877
HGD/CRC	Microscopic exam	0.176	0.921 (0.84–0.99)	0.990 (0.96–1.00)	0.983	0.953	0.983	0.937	0.968	0.924	0.947
HGD/CRC	Full pathology report	0.198	0.792 (0.67–0.92)	0.991 (0.95–1.00)	0.942	0.958	0.951	0.867	0.950	0.844	0.941
Dysplasia	Microscopic exam	0.262	0.957 (0.92–0.99)	0.993 (0.96–1.00)	0.985	0.981	0.985	0.969	0.983	0.958	0.920
Dysplasia	Full pathology report	0.255	0.967 (0.91–1.00)	0.992 (0.95–1.00)	0.989	0.976	0.989	0.971	0.982	0.962	0.967
IND	Microscopic exam	0.030	0.752 (0.57–0.93)	1.000 (0.98–1.00)	0.990	0.983	0.992	0.852	0.986	0.856	0.931
IND	Full pathology report	0.028	0.636 (0.39–0.91)	0.992 (0.96–1.00)	0.988	0.696	0.989	0.665	0.817	0.656	0.935

Open in a new tab

Comparison of Microscopic exam section (these rows are repeated from table 1) and full pathology report as input to LLM. Full pathology reports evaluated by LLMs in all cases where the Pathology Domain entry had a matching full note. There were instances where the full note was not available, and so the number validated for the full pathology report was less than the total number validated for ‘Microscopic exam’. All analyses still had >100 model positive and model negative cases validated. 95% CIs for PPV and NPV were approximated using both bootstrapping and the binomial distribution, with the more conservative interval reported (see online supplemental methods).

CRC, invasive colorectal cancer (invasive colorectal adenocarcinoma); F1c, calibrated F1 score²⁷; HGD/CRC, high-grade dysplasia and/or colorectal adenocarcinoma; IBD, inflammatory bowel disease; IND, indefinite for dysplasia; LB, lower bound; MCC, Matthew’s correlation coefficient; MVP, Million Veteran Program; NPV, negative predictive value; PPV, positive predictive value; PPVc, calibrated positive predictive value (calibrated precision)²⁷; UB, upper bound.

Discussion

We have shown that LLMs are powerful, potentially generalisable tools for accurately extracting important information from clinical semistructured and unstructured text and which require little human-led development. With validated performance by blinded manual chart review also shown in MVP data, we expect that this approach will enable large-scale health research studies that can incorporate patient genomics in disease risk assessment and prediction. Another strength of this work is that the methods are relatively simple. While not explicitly tested, we expect our findings to adapt relatively easily to other pathological diagnoses, healthcare systems, patient populations and time periods. No aspect of the prompt or models used was specific to the VA or our cohorts, and no additional model fine-tuning was performed. The barriers to implementation are minimal; any researcher can download the llama.cpp²⁰ GitHub, add their desired prompt, compile and begin development. Our repository is available on GitHub for the community.

While previous NLP approaches show excellent performance in identifying common features like adenoma,^{10 11 14} few have maintained excellent performance with thorough testing of their approaches to identify rarer advanced features such as HGD, carcinoma in situ and invasive adenocarcinoma. Additionally, few have tested approaches in differing contexts, such as differing geographical locations, practice types (eg, academic vs private practice) and compensation structures (eg, salary vs fee-for-service).¹⁶ When attempted across four practice sites, Carrell et al report an F1-score of 95% for identifying adenoma and highlight the considerable time-consuming challenges they encountered in adapting the NLP system.¹⁶ The most comparable analysis in our study to previously published algorithms was the task of identifying any dysplasia in the non-IBD colitis cohort, where Gemma-2 had an F1-score of 99.2% (95% CI 98.2% to 100%) and Llama-3 had an F1-score of 99.2% (95% CI 98.1% to 100%) in MVP, and even higher scores were found in CDW. Published analyses in similar cohorts have similarly high F1-scores, such as Bae et al, who report an F1-score of 99% for identifying the presence of ‘conventional adenoma’,¹⁴ and Nayor et al, who report a perfect F1-score of 100% for identifying ‘adenoma’.¹¹ Online supplemental table S8 shows a comparison of performance using previous methods across comparable tasks. Notably, code for these rule-based approaches was not made available in most cases to enable direct application in our data.

While LLMs remain computationally expensive, the size and associated compute cost of proficient models have reduced drastically, with the best small (9 billion parameter), open-weight model available at the time of this work (Gemma-2-9b-it,²¹ released 27 June 2024) generally performing better than the largest proprietary models from a year prior (GPT-4–0 61³⁰, released 13 June 2023).³¹ If such improvements in efficiency continue, boosted by potential advancements in the underlying transformer architecture,³² LLMs will become more attractive in domains where the current computational expense makes their use unfeasible. Importantly, the LLM approach is robust and can be readily adapted with minimal, if any, tweaks to prompts/code as these newer models become available online. Even without further improvements in the models themselves, the increasing availability of GPUs and the throughput of new chip architectures³³,³⁵ may make current models a viable alternative to data structuring at scale. Due to the ease of our model implementation, with results as accurate as more complicated rule-based approaches, we suggest an LLM approach for many free text classification tasks in biomedical research going forward.

Our work has some limitations. First, while we expect our approach to adapt more flexibly to different settings, we did not explicitly test our LLM approach in other large-scale EHR datasets beyond the VHA, though recent success applying LLMs in other health systems has been shown.^{17 18} Second, without long-term access to GPUs, we could not feasibly test larger models, which may overcome some of the shortcomings seen in smaller models; this addition can be expected to increase performance above what we find herein. Finally, we could not rule out overlap between MVP and CDW reports, though our results in either cohort considered alone are sufficient validation compared with previously published work.

Ongoing work includes adapting our approach to detect stage and location of cancers, identifying features of dysplasia (size, shape, type, location, inflammation level, etc) and endoscopic resection details as well as incorporation with genetic data. Some tasks, such as identifying IBD subtypes and dates of diagnosis, may require larger models that are more capable of handling longer input text. Nonetheless, the general framework lends itself to many applications beyond the use cases analysed here, including the potential for real-time data integration in models used to aid in shared decision-making (so-called medical digital twins).³⁶ While we show that the applications for research are immediate (low compute requirements, free access to open weight models that do not risk patient privacy), these exciting opportunities for LLM integration into existing healthcare systems will have to overcome certain challenges for real-world deployment (eg, IT infrastructure, potential biases and user trust) that will be the focus of many future quality improvement studies.

Accurate clinical data are essential for understanding trends in patient disease risk and for predictive models to be clinically useful. In an era of increasing opportunities for personalised medicine, we show that LLMs offer a very useful tool for quickly and accurately obtaining relevant patient data to potentially inform medical decisions in real time.

Supplementary material

online supplemental file 1

bmjgast-12-1-s001.csv^{(1.4KB, csv)}

DOI: 10.1136/bmjgast-2025-001896

online supplemental file 2

bmjgast-12-1-s002.csv^{(104.7KB, csv)}

DOI: 10.1136/bmjgast-2025-001896

online supplemental file 3

bmjgast-12-1-s003.docx^{(34.7KB, docx)}

DOI: 10.1136/bmjgast-2025-001896

online supplemental file 4

bmjgast-12-1-s004.pdf^{(373.5KB, pdf)}

DOI: 10.1136/bmjgast-2025-001896

Footnotes

Funding: This research is based on data from the Million Veteran Program, Office of Research and Development, Veterans Health Administration, and was supported by MVP070 and MVP000 as well as Merit Review Award I01 BX005958 from the United States (U.S.) Department of Veterans Affairs Biomedical Laboratory Research and Development Service. The contents do not represent the views of the U.S. Department of Veterans Affairs or the United States Government. This work was supported by AGA Research Foundation (AGA Research Scholar Award AGA2022-13-05), NIH grants (R01 CA270235, P30 CA023100), and National Library of Medicine Training Grant (NIH grant T15LM011271). The study was supported in part by the NIDDK-funded San Diego Digestive Diseases Research Center (P30 DK120515).

Provenance and peer review: Not commissioned; externally peer reviewed.

Patient consent for publication: Not applicable.

Ethics approval: The Research and Development Committee of VA San Diego Healthcare System and the VA Central Institutional Review Board (IRB) reviewed and approved the IRB protocols for this study (E220040 and CIRB E22-5). We have a waiver for consent for our IRB approved exempt study. A waiver of individual authorisation for use of Protected Health Information (PHI) for full study purposes can be granted by the Institutional Review Board as stipulated by the HIPAA Privacy Rule,45 CFR 164 Section 512(1).

Data availability free text: A CSV (all_results.csv) with aggregated results from all validated runs, including additional details such as prevalence of model positives, exact numbers validated, and the full confusion matrix, is available in the online supplemental information. This CSV is the source for all three main text tables and

online supplemental tables S3–S5, S7. Prevalence in the 'all_results.csv' is pulled from supplementary file 'prevalence_CDW_and_MVP.csv'. Raw data access is reserved for VA investigators with appropriate research approvals.

Data availability statement

All data relevant to the study are included in the article or uploaded as supplementary information.

References

1.Choi C-HR, Rutter MD, Askari A, et al. Forty-Year Analysis of Colonoscopic Surveillance Program for Neoplasia in Ulcerative Colitis: An Updated Overview. Am J Gastroenterol. 2015;110:1022–34. doi: 10.1038/ajg.2015.65. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Shah SC, Itzkowitz SH. Colorectal Cancer in Inflammatory Bowel Disease: Mechanisms and Management. Gastroenterology. 2022;162:715–30. doi: 10.1053/j.gastro.2021.10.035. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Gupta S, Lieberman D, Anderson JC, et al. Recommendations for Follow-Up After Colonoscopy and Polypectomy: A Consensus Update by the US Multi-Society Task Force on Colorectal Cancer. Gastroenterology. 2020;158:1131–53. doi: 10.1053/j.gastro.2019.10.026. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Murthy SK, Feuerstein JD, Nguyen GC, et al. AGA Clinical Practice Update on Endoscopic Surveillance and Management of Colorectal Dysplasia in Inflammatory Bowel Diseases: Expert Review. Gastroenterology. 2021;161:1043–51. doi: 10.1053/j.gastro.2021.05.063. [DOI] [PubMed] [Google Scholar]
5.Rubin DT, Ananthakrishnan AN, Siegel CA, et al. ACG Clinical Guideline: Ulcerative Colitis in Adults. Am J Gastroenterol. 2019;114:384–413. doi: 10.14309/ajg.0000000000000152. [DOI] [PubMed] [Google Scholar]
6.Locke S, Bashall A, Al-Adely S, et al. Natural language processing in medicine: A review. Trends in Anaesthesia and Critical Care. 2021;38:4–9. doi: 10.1016/j.tacc.2021.02.007. [DOI] [Google Scholar]
7.Benson R, Winterton C, Winn M, et al. Leveraging Natural Language Processing to Extract Features of Colorectal Polyps From Pathology Reports for Epidemiologic Study. JCO Clin Cancer Inform. 2023;7:e2200131. doi: 10.1200/CCI.22.00131. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Fevrier HB, Liu L, Herrinton LJ, et al. A Transparent and Adaptable Method to Extract Colonoscopy and Pathology Data Using Natural Language Processing. J Med Syst. 2020;44:151. doi: 10.1007/s10916-020-01604-8. [DOI] [PubMed] [Google Scholar]
9.Gupta S, Earles A, Bustamante R, et al. Adenoma Detection Rate and Clinical Characteristics Influence Advanced Neoplasia Risk After Colorectal Polypectomy. Clin Gastroenterol Hepatol. 2023;21:1924–36. doi: 10.1016/j.cgh.2022.10.003. [DOI] [PubMed] [Google Scholar]
10.Harkema H, Chapman WW, Saul M, et al. Developing a natural language processing application for measuring the quality of colonoscopy procedures. J Am Med Inform Assoc. 2011;18 Suppl 1:i150–6. doi: 10.1136/amiajnl-2011-000431. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Nayor J, Borges LF, Goryachev S, et al. Natural Language Processing Accurately Calculates Adenoma and Sessile Serrated Polyp Detection Rates. Dig Dis Sci. 2018;63:1794–800. doi: 10.1007/s10620-018-5078-4. [DOI] [PubMed] [Google Scholar]
12.Imler TD, Morea J, Kahi C, et al. Natural language processing accurately categorizes findings from colonoscopy and pathology reports. Clin Gastroenterol Hepatol. 2013;11:689–94. doi: 10.1016/j.cgh.2012.11.035. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Raju GS, Lum PJ, Slack RS, et al. Natural language processing as an alternative to manual reporting of colonoscopy quality metrics. Gastrointest Endosc. 2015;82:512–9. doi: 10.1016/j.gie.2015.01.049. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Bae JH, Han HW, Yang SY, et al. Natural Language Processing for Assessing Quality Indicators in Free-Text Colonoscopy and Pathology Reports: Development and Usability Study. JMIR Med Inform. 2022;10:e35257. doi: 10.2196/35257. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Syed S, Angel A, Syeda H, et al. The h-ann model: comprehensive colonoscopy concept compilation using combined contextual embeddings. Biomed Eng Syst Technol Int Jt Conf BIOSTEC Revis Sel Pap. 2022:189–200. doi: 10.5220/0010903300003123. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Carrell DS, Schoen RE, Leffler DA, et al. Challenges in adapting existing clinical natural language processing systems to multiple, diverse health care settings. J Am Med Inform Assoc. 2017;24:986–91. doi: 10.1093/jamia/ocx039. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Patel PV, Davis C, Ralbovsky A, et al. Large Language Models Outperform Traditional Natural Language Processing Methods in Extracting Patient-Reported Outcomes in Inflammatory Bowel Disease. Gastro Hep Adv . 2025;4:100563. doi: 10.1016/j.gastha.2024.10.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Acharya V, Kumaresan V, England J, et al. Use of Large Language Models to Identify Surveillance Colonoscopy Intervals-A Feasibility Study. Gastroenterology. 2025;168:382–4. doi: 10.1053/j.gastro.2024.09.032. [DOI] [PubMed] [Google Scholar]
19.Gaziano JM, Concato J, Brophy M, et al. Million Veteran Program: A mega-biobank to study genetic influences on health and disease. J Clin Epidemiol. 2016;70:214–23. doi: 10.1016/j.jclinepi.2015.09.016. [DOI] [PubMed] [Google Scholar]
20.Gerganov G. llama.cpp. 2024. https://github.com/ggml-org/llama.cpp Available.
21.Team G, Mesnard T, Hardin C, et al. Gemma. n.d. [DOI]
22.Wu Y, Sun Z, Yuan H, et al. Self-play preference optimization for language model alignment. 2024. [Google Scholar]
23.AI@Meta Llama 3 model card. 2024. https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md Available.
24.Frantar E, Ashkboos S, Hoefler T, et al. GPTQ: accurate post-training quantization for generative pre-trained transformers. 2022. [Google Scholar]
25.Gawron AJ, Mckee G, Dominitz JA, et al. Validation of a National Pathology Database for Colonoscopy Quality Reporting and Assurance. Clin Gastroenterol Hepatol. 2025;23:866–8. doi: 10.1016/j.cgh.2024.08.017. [DOI] [PubMed] [Google Scholar]
26.Riddell RH, Goldman H, Ransohoff DF, et al. Dysplasia in inflammatory bowel disease: standardized classification with provisional clinical applications. Hum Pathol. 1983;14:931–68. doi: 10.1016/s0046-8177(83)80175-0. [DOI] [PubMed] [Google Scholar]
27.Siblini W, Fréry J, He-Guelton L, et al. In: Advances in intelligent data analysis XVIII. Vol 12080. Lecture notes in computer science. Berthold MR, Feelders A, Krempl G, editors. Springer International Publishing; 2020. Master your metrics with calibration; pp. 457–69. [Google Scholar]
28.Liu L, Bustamante R, Earles A, et al. A strategy for validation of variables derived from large-scale electronic health record data. J Biomed Inform. 2021;121:103879. doi: 10.1016/j.jbi.2021.103879. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Bossuyt PM, Reitsma JB, Bruns DE, et al. STARD 2015: an updated list of essential items for reporting diagnostic accuracy studies. BMJ . 2015;351:h5527. doi: 10.1136/bmj.h5527. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Achiam J, Adler S, et al. OpenAI GPT-4 technical report. 2023. [DOI]
31.Chiang WL, Zheng L, Sheng Y, et al. Chatbot arena: an open platform for evaluating LLMs by human preference. arXiv. 2024 doi: 10.48550/ARXIV.2403.04132. [DOI] [Google Scholar]
32.Ye T, Dong L, Xia Y, et al. Differential transformer. arXiv. 2024 doi: 10.48550/arXiv.2410.05258. [DOI] [Google Scholar]
33.Abts D, Kimmell G, Ling A, et al. A software-defined tensor streaming multiprocessor for large-scale machine learning. Proceedings of the 49th Annual International Symposium on Computer Architecture; 2022. pp. 567–80. [DOI] [Google Scholar]
34.Prabhakar R, Sivaramakrishnan R, Gandhi D, et al. SambaNova SN40L: scaling the AI memory wall with dataflow and composition of experts. arXiv. 2024 doi: 10.48550/ARXIV.2405.07518. [DOI] [Google Scholar]
35.Lie S. Cerebras Architecture Deep Dive: First Look Inside the Hardware/Software Co-Design for Deep Learning. IEEE Micro. 2023;43:18–30. doi: 10.1109/MM.2023.3256384. [DOI] [Google Scholar]
36.Johnson B, Curtius K. Digital twins are integral to personalizing medicine and improving public health. Nat Rev Gastroenterol Hepatol . 2024;21:740–1. doi: 10.1038/s41575-024-00992-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

online supplemental file 1

bmjgast-12-1-s001.csv^{(1.4KB, csv)}

DOI: 10.1136/bmjgast-2025-001896

online supplemental file 2

bmjgast-12-1-s002.csv^{(104.7KB, csv)}

DOI: 10.1136/bmjgast-2025-001896

online supplemental file 3

bmjgast-12-1-s003.docx^{(34.7KB, docx)}

DOI: 10.1136/bmjgast-2025-001896

online supplemental file 4

bmjgast-12-1-s004.pdf^{(373.5KB, pdf)}

DOI: 10.1136/bmjgast-2025-001896

Data Availability Statement

All data relevant to the study are included in the article or uploaded as supplementary information.

[R1] 1.Choi C-HR, Rutter MD, Askari A, et al. Forty-Year Analysis of Colonoscopic Surveillance Program for Neoplasia in Ulcerative Colitis: An Updated Overview. Am J Gastroenterol. 2015;110:1022–34. doi: 10.1038/ajg.2015.65. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] 2.Shah SC, Itzkowitz SH. Colorectal Cancer in Inflammatory Bowel Disease: Mechanisms and Management. Gastroenterology. 2022;162:715–30. doi: 10.1053/j.gastro.2021.10.035. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] 3.Gupta S, Lieberman D, Anderson JC, et al. Recommendations for Follow-Up After Colonoscopy and Polypectomy: A Consensus Update by the US Multi-Society Task Force on Colorectal Cancer. Gastroenterology. 2020;158:1131–53. doi: 10.1053/j.gastro.2019.10.026. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Murthy SK, Feuerstein JD, Nguyen GC, et al. AGA Clinical Practice Update on Endoscopic Surveillance and Management of Colorectal Dysplasia in Inflammatory Bowel Diseases: Expert Review. Gastroenterology. 2021;161:1043–51. doi: 10.1053/j.gastro.2021.05.063. [DOI] [PubMed] [Google Scholar]

[R5] 5.Rubin DT, Ananthakrishnan AN, Siegel CA, et al. ACG Clinical Guideline: Ulcerative Colitis in Adults. Am J Gastroenterol. 2019;114:384–413. doi: 10.14309/ajg.0000000000000152. [DOI] [PubMed] [Google Scholar]

[R6] 6.Locke S, Bashall A, Al-Adely S, et al. Natural language processing in medicine: A review. Trends in Anaesthesia and Critical Care. 2021;38:4–9. doi: 10.1016/j.tacc.2021.02.007. [DOI] [Google Scholar]

[R7] 7.Benson R, Winterton C, Winn M, et al. Leveraging Natural Language Processing to Extract Features of Colorectal Polyps From Pathology Reports for Epidemiologic Study. JCO Clin Cancer Inform. 2023;7:e2200131. doi: 10.1200/CCI.22.00131. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Fevrier HB, Liu L, Herrinton LJ, et al. A Transparent and Adaptable Method to Extract Colonoscopy and Pathology Data Using Natural Language Processing. J Med Syst. 2020;44:151. doi: 10.1007/s10916-020-01604-8. [DOI] [PubMed] [Google Scholar]

[R9] 9.Gupta S, Earles A, Bustamante R, et al. Adenoma Detection Rate and Clinical Characteristics Influence Advanced Neoplasia Risk After Colorectal Polypectomy. Clin Gastroenterol Hepatol. 2023;21:1924–36. doi: 10.1016/j.cgh.2022.10.003. [DOI] [PubMed] [Google Scholar]

[R10] 10.Harkema H, Chapman WW, Saul M, et al. Developing a natural language processing application for measuring the quality of colonoscopy procedures. J Am Med Inform Assoc. 2011;18 Suppl 1:i150–6. doi: 10.1136/amiajnl-2011-000431. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Nayor J, Borges LF, Goryachev S, et al. Natural Language Processing Accurately Calculates Adenoma and Sessile Serrated Polyp Detection Rates. Dig Dis Sci. 2018;63:1794–800. doi: 10.1007/s10620-018-5078-4. [DOI] [PubMed] [Google Scholar]

[R12] 12.Imler TD, Morea J, Kahi C, et al. Natural language processing accurately categorizes findings from colonoscopy and pathology reports. Clin Gastroenterol Hepatol. 2013;11:689–94. doi: 10.1016/j.cgh.2012.11.035. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] 13.Raju GS, Lum PJ, Slack RS, et al. Natural language processing as an alternative to manual reporting of colonoscopy quality metrics. Gastrointest Endosc. 2015;82:512–9. doi: 10.1016/j.gie.2015.01.049. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Bae JH, Han HW, Yang SY, et al. Natural Language Processing for Assessing Quality Indicators in Free-Text Colonoscopy and Pathology Reports: Development and Usability Study. JMIR Med Inform. 2022;10:e35257. doi: 10.2196/35257. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] 15.Syed S, Angel A, Syeda H, et al. The h-ann model: comprehensive colonoscopy concept compilation using combined contextual embeddings. Biomed Eng Syst Technol Int Jt Conf BIOSTEC Revis Sel Pap. 2022:189–200. doi: 10.5220/0010903300003123. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Carrell DS, Schoen RE, Leffler DA, et al. Challenges in adapting existing clinical natural language processing systems to multiple, diverse health care settings. J Am Med Inform Assoc. 2017;24:986–91. doi: 10.1093/jamia/ocx039. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] 17.Patel PV, Davis C, Ralbovsky A, et al. Large Language Models Outperform Traditional Natural Language Processing Methods in Extracting Patient-Reported Outcomes in Inflammatory Bowel Disease. Gastro Hep Adv . 2025;4:100563. doi: 10.1016/j.gastha.2024.10.003. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] 18.Acharya V, Kumaresan V, England J, et al. Use of Large Language Models to Identify Surveillance Colonoscopy Intervals-A Feasibility Study. Gastroenterology. 2025;168:382–4. doi: 10.1053/j.gastro.2024.09.032. [DOI] [PubMed] [Google Scholar]

[R19] 19.Gaziano JM, Concato J, Brophy M, et al. Million Veteran Program: A mega-biobank to study genetic influences on health and disease. J Clin Epidemiol. 2016;70:214–23. doi: 10.1016/j.jclinepi.2015.09.016. [DOI] [PubMed] [Google Scholar]

[R20] 20.Gerganov G. llama.cpp. 2024. https://github.com/ggml-org/llama.cpp Available.

[R21] 21.Team G, Mesnard T, Hardin C, et al. Gemma. n.d. [DOI]

[R22] 22.Wu Y, Sun Z, Yuan H, et al. Self-play preference optimization for language model alignment. 2024. [Google Scholar]

[R23] 23.AI@Meta Llama 3 model card. 2024. https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md Available.

[R24] 24.Frantar E, Ashkboos S, Hoefler T, et al. GPTQ: accurate post-training quantization for generative pre-trained transformers. 2022. [Google Scholar]

[R25] 25.Gawron AJ, Mckee G, Dominitz JA, et al. Validation of a National Pathology Database for Colonoscopy Quality Reporting and Assurance. Clin Gastroenterol Hepatol. 2025;23:866–8. doi: 10.1016/j.cgh.2024.08.017. [DOI] [PubMed] [Google Scholar]

[R26] 26.Riddell RH, Goldman H, Ransohoff DF, et al. Dysplasia in inflammatory bowel disease: standardized classification with provisional clinical applications. Hum Pathol. 1983;14:931–68. doi: 10.1016/s0046-8177(83)80175-0. [DOI] [PubMed] [Google Scholar]

[R27] 27.Siblini W, Fréry J, He-Guelton L, et al. In: Advances in intelligent data analysis XVIII. Vol 12080. Lecture notes in computer science. Berthold MR, Feelders A, Krempl G, editors. Springer International Publishing; 2020. Master your metrics with calibration; pp. 457–69. [Google Scholar]

[R28] 28.Liu L, Bustamante R, Earles A, et al. A strategy for validation of variables derived from large-scale electronic health record data. J Biomed Inform. 2021;121:103879. doi: 10.1016/j.jbi.2021.103879. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] 29.Bossuyt PM, Reitsma JB, Bruns DE, et al. STARD 2015: an updated list of essential items for reporting diagnostic accuracy studies. BMJ . 2015;351:h5527. doi: 10.1136/bmj.h5527. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] 30.Achiam J, Adler S, et al. OpenAI GPT-4 technical report. 2023. [DOI]

[R31] 31.Chiang WL, Zheng L, Sheng Y, et al. Chatbot arena: an open platform for evaluating LLMs by human preference. arXiv. 2024 doi: 10.48550/ARXIV.2403.04132. [DOI] [Google Scholar]

[R32] 32.Ye T, Dong L, Xia Y, et al. Differential transformer. arXiv. 2024 doi: 10.48550/arXiv.2410.05258. [DOI] [Google Scholar]

[R33] 33.Abts D, Kimmell G, Ling A, et al. A software-defined tensor streaming multiprocessor for large-scale machine learning. Proceedings of the 49th Annual International Symposium on Computer Architecture; 2022. pp. 567–80. [DOI] [Google Scholar]

[R34] 34.Prabhakar R, Sivaramakrishnan R, Gandhi D, et al. SambaNova SN40L: scaling the AI memory wall with dataflow and composition of experts. arXiv. 2024 doi: 10.48550/ARXIV.2405.07518. [DOI] [Google Scholar]

[R35] 35.Lie S. Cerebras Architecture Deep Dive: First Look Inside the Hardware/Software Co-Design for Deep Learning. IEEE Micro. 2023;43:18–30. doi: 10.1109/MM.2023.3256384. [DOI] [Google Scholar]

[R36] 36.Johnson B, Curtius K. Digital twins are integral to personalizing medicine and improving public health. Nat Rev Gastroenterol Hepatol . 2024;21:740–1. doi: 10.1038/s41575-024-00992-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Large language models for extracting histopathologic diagnoses of colorectal cancer and dysplasia from electronic health records

Brian Johnson

Tyler Bath

Xinyi Huang

Mark Lamm

Ashley Earles

Hyrum Eddington

Anna M Dornisch

Lily J Jih

Samir Gupta

Shailja C Shah

Kit Curtius

Abstract

Objective

Methods

Results

Conclusion

WHAT IS ALREADY KNOWN ON THIS TOPIC

WHAT THIS STUDY ADDS

HOW THIS STUDY MIGHT AFFECT RESEARCH, PRACTICE OR POLICY

Introduction

Methods

Datasets and compute environment

Patient databases

VINCI workspaces

VA pathology domain

LLM approach and application

Large language models

Identifying colonoscopy pathology reports in the pathology domain

Tasks

Creation of plausible sets of notes

LLM prompt development

Determining presence versus absence of given diagnosis using LLM

Model validation

Performance metrics

Results

LLMs extract pathologic diagnoses with high accuracy in patients with IBD

Table 1. Validated performance results for IBD patients in MVP and CDW using Gemma-2.

Validation of LLM approach in non-IBD colorectal dysplasia and cancer

Table 2. Validation performance results for non-IBD colitis patients in MVP and CDW using Gemma-2.

Accuracy of applying LLM methods to full-text pathology report

Table 3. Validation results using full pathology report in IBD population in MVP.

Discussion

Supplementary material

Footnotes

Data availability statement

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases