EchoLLM: extracting echocardiogram entities with light-weight, open-source large language models

Jonathan Chi; Yazan Rouphail; Ethan Hillis; Ningning Ma; An Nguyen; Jane Wang; Mackenzie Hofford; Aditi Gupta; Patrick G Lyons; Adam Wilcox; Albert M Lai; Philip R O Payne; Marin H Kollef; Caitlin Dreisbach; Andrew P Michelson

doi:10.1093/jamiaopen/ooaf092

. 2025 Aug 13;8(4):ooaf092. doi: 10.1093/jamiaopen/ooaf092

EchoLLM: extracting echocardiogram entities with light-weight, open-source large language models

Jonathan Chi ^1,^✉, Yazan Rouphail ², Ethan Hillis ³, Ningning Ma ⁴, An Nguyen ⁵, Jane Wang ⁶, Mackenzie Hofford ⁷, Aditi Gupta ⁸, Patrick G Lyons ⁹, Adam Wilcox ¹⁰, Albert M Lai ¹¹, Philip R O Payne ¹², Marin H Kollef ¹³, Caitlin Dreisbach ^14,¹⁵, Andrew P Michelson ^16,¹⁷

¹ Goergen Institute for Data Science and Artificial Intelligence, University of Rochester, Rochester, NY 14627, United States

² Department of Medicine, Institute for Informatics, Data Science and Biostatistics, Washington University in St. Louis, St. Louis, MO 63110, United States

³ Department of Medicine, Institute for Informatics, Data Science and Biostatistics, Washington University in St. Louis, St. Louis, MO 63110, United States

⁴ Division of Hospital Medicine, Department of Medicine, Washington University in St. Louis, St. Louis, MO 63110, United States

⁵ Division of Pulmonary and Critical Care Medicine, Department of Medicine, Washington University in St. Louis, St. Louis, MO 63110, United States

⁶ Division of Pulmonary and Critical Care Medicine, Department of Medicine, Washington University in St. Louis, St. Louis, MO 63110, United States

⁷ Department of Medicine, Institute for Informatics, Data Science and Biostatistics, Washington University in St. Louis, St. Louis, MO 63110, United States

⁸ Department of Medicine, Institute for Informatics, Data Science and Biostatistics, Washington University in St. Louis, St. Louis, MO 63110, United States

⁹ Division of Pulmonary, Allergy, and Critical Care Medicine, Department of Medicine, Oregon Health & Science University, Portland, OR 97239, United States

¹⁰ Department of Medicine, Institute for Informatics, Data Science and Biostatistics, Washington University in St. Louis, St. Louis, MO 63110, United States

¹¹ Department of Medicine, Institute for Informatics, Data Science and Biostatistics, Washington University in St. Louis, St. Louis, MO 63110, United States

¹² Department of Medicine, Institute for Informatics, Data Science and Biostatistics, Washington University in St. Louis, St. Louis, MO 63110, United States

¹³ Division of Pulmonary and Critical Care Medicine, Department of Medicine, Washington University in St. Louis, St. Louis, MO 63110, United States

¹⁴ Goergen Institute for Data Science and Artificial Intelligence, University of Rochester, Rochester, NY 14627, United States

¹⁵ School of Nursing, University of Rochester, Rochester, NY 14627, United States

¹⁶ Department of Medicine, Institute for Informatics, Data Science and Biostatistics, Washington University in St. Louis, St. Louis, MO 63110, United States

¹⁷ Division of Pulmonary and Critical Care Medicine, Department of Medicine, Washington University in St. Louis, St. Louis, MO 63110, United States

^✉

Corresponding author: Jonathan Chi, Goergen Institute for Data Science and Artificial Intelligence, University of Rochester, 500 Joseph C. Wilson Boulevard, Rochester, NY 14627, United States (jchi4@u.rochester.edu)

Roles

Jonathan Chi: Data curation, Formal analysis, Investigation, Methodology, Software, Validation, Writing - original draft, Writing - review & editing

Yazan Rouphail: BS, Conceptualization, Formal analysis, Investigation, Methodology

Ethan Hillis: MS, Conceptualization, Formal analysis, Investigation, Methodology

Ningning Ma: MD, Conceptualization, Formal analysis, Investigation, Methodology

An Nguyen: MD, Conceptualization, Formal analysis, Investigation, Methodology

Jane Wang: MD, Conceptualization, Formal analysis, Investigation, Methodology

Mackenzie Hofford: MD, Conceptualization, Formal analysis, Investigation, Methodology

Aditi Gupta: PhD, Conceptualization, Formal analysis, Investigation, Methodology

Patrick G Lyons: MD, Conceptualization, Formal analysis, Investigation, Methodology

Adam Wilcox: PhD, Conceptualization, Formal analysis, Investigation, Methodology

Albert M Lai: PhD, Conceptualization, Formal analysis, Investigation, Methodology

Philip R O Payne: PhD, Conceptualization, Formal analysis, Investigation, Methodology

Marin H Kollef: MD, Conceptualization, Formal analysis, Investigation, Methodology

Caitlin Dreisbach: PhD, RN, Writing - original draft, Writing - review & editing

Andrew P Michelson: MD, Conceptualization, Investigation, Methodology, Software, Supervision, Writing - original draft, Writing - review & editing

PMCID: PMC12349756 PMID: 40809469

Abstract

Objectives

Large language models (LLMs) have demonstrated high levels of performance in clinical information extraction compared to rule-based systems and traditional machine-learning approaches, offering scalability, contextualization, and easier deployment. However, most studies rely on proprietary models with privacy concerns and high costs, limiting accessibility. We aim to evaluate 14 publicly available open-source LLMs for extracting clinically relevant findings from free-text echocardiogram reports and examine the feasibility of their implementation in information extraction workflows.

Materials and Methods

We used 14 open-source LLM models to extract clinically relevant entities from echocardiogram reports (n = 507). Each report was manually annotated by 2 independent health-care professionals and adjudicated by a third. Lexical variance and length of each echocardiogram report were collected. Precision, recall, and F1 scores were calculated for the 9 extracted entities via multiclass classification.

Results

In aggregate, Gemma2:9b-instruct had the highest precision, recall, and F1 scores at 0.973 (0.962-0.983), 0.959 (0.947-0.973), and 0.965 (0.951-0.975), respectively. In comparison, Phi3:3.8b-mini-instruct had the lowest precision score at 0.831 (0.804-0.856), while Gemma:7b-instruct had the lowest recall and F1 scores at 0.382 (0.356-0.408) and 0.392 (0.356-0.428), respectively.

Discussion and Conclusion

Using LLMs for entity extraction for echocardiogram reports has the potential to support both clinical research and health-care delivery. Our work demonstrates the feasibility of using open-source models for more efficient computation and extraction.

Keywords: natural language processing, clinical decision support, clinical and research data collection, large language models, electronic health records

Background and significance

Unstructured free-text data comprises 80% of all digital health-care data and has the potential to improve health-care delivery, contextualize patient health, advance clinical research, and train artificial intelligence models.^1–5 Unfortunately, due to its complex heterogeneity and variability, unstructured data are challenging to use.¹^,²^,⁴ Previous attempts have used rule-based and pattern-matching systems to extract information, but they often require extensive validation due to lexical variations, negations, and ambiguous medical documentation language. Rule-based information extraction systems also require manual updating of regular expressions as the data become more complex, making them difficult and expensive to maintain.⁶ More advanced analyses have attempted to use traditional deep learning models and natural language processing (NLP) to derive semantic and lexical patterns from the data automatically.^5–9 However, these solutions are limited by slow development and deployment times, requirements for vast amounts of high-quality training data, and the need for fine-tuning.⁵^,⁷^,⁸ Furthermore, a large majority of these systems are specifically tailored for single-institutional, single-task needs, reducing their scalability and ease of use across diverse health-care systems or health records.⁸^,⁹

More recently, large language models (LLMs) have emerged as a new form of deep learning trained on large quantities of information that provide them with the ability to understand context, meaning, and widely variable language patterns. They have been applied as entity extraction agents in a diverse array of clinical settings, outperforming traditional rule-based and machine-learning approaches.^10–14 Furthermore, LLMs do not need additional retraining, offering contextualization across many health-care domains and easier deployment for health-care providers. A large majority of current studies, however, have relied on proprietary, expensive, closed-source models such as OpenAI’s ChatGPT. Although these larger models can draw from more computing resources and have demonstrated success in information extraction tasks, there are significant drawbacks. Privacy concerns in transferring private health information to third-party remote servers and expensive pay-per-token costs make closed-source LLMs inaccessible to many health-care institutions.

In contrast, there is a rapid emergence of publicly available open-source, locally run LLMs that may provide options to circumvent these limitations. In recent years, they have seen rapid improvement in transformer architecture and performance, providing a viable alternative to their expensive, third-party counterparts. These open-source LLMs can also be deployed on commercially available computing power, allowing for sensitive health information to stay within the health-care network infrastructure and mitigating the chances of data security breaches. Despite such benefits, nearly all previous clinical information extraction studies have not investigated these rapidly improving publicly available LLMs. Thus, to better understand the capabilities of publicly available open-source models, we evaluate 14 open-source LLMs and report the quality of their entity extractions (ie, clinical measurements and values) from narrative, free-text echocardiograms. We compare the LLM entity extractions to gold-standard manual annotations by clinicians.

Objectives

This study aims to evaluate the potential of open-source LLMs to accurately extract clinically relevant data from echocardiogram reports. We hope to characterize this inexpensive and data-secure clinical information extraction approach and determine its ability to transform unstructured electronic health records into usable and informative data, supporting downstream clinical research and decision-making.

Methods

Data

Echocardiogram reports from all patients admitted to an 11-hospital health-care system between July 1, 2018 and October 30, 2022 were extracted. A proportionally stratified random sample was selected from each hospital for analysis (n = 507). This project received approval from the Institutional Review Board at Washington University in St Louis (IRB #201804121) with a waiver of informed consent.

Nine entities from the echocardiogram report were chosen for extraction based on clinical significance. The entities included left ventricular ejection fraction (LVEF), left ventricular diastolic function (DF), pulmonary arterial systolic pressure (PASP), right ventricular heart size (RHS), right ventricular systolic function (RVF), along with the presence and multilevel severity of valvular heart disease such as mitral valve stenosis (MS), mitral valve regurgitation (MR), aortic valve stenosis (AS), and aortic valve regurgitation (AR).

Annotation

Each echocardiogram report was manually annotated by 2 independent health-care professionals (selected from among 1 medical student, 2 clinical fellows, 2 attending hospitalist physicians), with discrepancies adjudicated by an expert and independent third reviewer (a board-certified pulmonary and critical care physician). To standardize the annotation process, all reviewers were provided with a guide for annotation before reviewing the data, and each data point was constrained to a set of predefined options. When a categorical entity was not explicitly reported in the echocardiogram text, reviewers were instructed to label it as “normal.” For continuous variables, reviewers were instructed to leave missing values blank and, when a range was reported, to extract the worst (most abnormal) value. All discrepancies were labeled as “difficult study,” “discrepancy within the report,” “transcription error/reviewer mistake,” or “vague/unclear correct choice”.

Models

Fourteen high-performing, publicly available LLMs were chosen for analysis based on the manual review of performance on the Hugging Face Leaderboard¹⁵ and model size (Table 1). To estimate performance in resource-constrained instances, both 4-bit quantized models and their fp-16 instruct models were utilized, except for Lllama3:70b, where the fp-16 instruct model would have required drastically more compute resources. All models were run using the default settings of the Ollama framework due to its accessibility and ease of use. Computations were run on a system equipped with an Intel Xeon W-2195 CPU, 384 GB of RAM, and 4 Nvidia RTX 2080ti GPUs totaling 44GB of VRAM. For clarity, the suffixes -4k, -fp16, and -fp-16 are removed from model names in the body of this manuscript. Full model names are detailed in Table 1.

Table 1.

Publicly available models evaluated.

Model	No. of parameters	Date released	Company
Gemma:7b-instruct-fp-16	7 billion	February 21, 2024	Google
Gemma:7b	7 billion	February 21, 2024	Google
Mixtral:8 × 7b	47 billion	April 10, 2024	Mistral AI
Mixtral:instruct	47 billion	April 10, 2024	Mistral AI
Llama3:8b	8 billion	April 18, 2024	Meta
Llama3:8b-instruct-fp16	8 billion	April 18, 2024	Meta
Llama3:70b	70 billion	April 18, 2024	Meta
Phi3:3.8b-mini-instruct-4k-fp16	3.8 billion	April 23, 2024	Microsoft
Phi3:14b-medium-4k-instruct-f16	14 billion	April 23, 2024	Microsoft
Phi3:medium	14 billion	April 23, 2024	Microsoft
Mistral:v0.3	7 billion	May 22, 2024	Mistral AI
Mistral:7b-instruct-v0.3-fp16	7 billion	May 22, 2024	Mistral AI
Gemma2:9b	9 billion	June 27, 2024	Google
Gemma2:9b-instruct-fp-16	9 billion	June 27, 2024	Google

Open in a new tab

Model name, number of parameters, release date, and developer are in chronological order of release. Number of parameters refers to the number of individual weights a model has and is a measure of the model’s capacity and size.

Prompting and postprocessing

A series of LLM instructions was iteratively evaluated and refined. Ultimately, the models were prompted without in-context learning (ICL, also known as zero-shot learning) and instructed to return a JSON dictionary, with each entity represented as a key-value pair (entity: measurement). The prompt consisted of the task description and a template of the expected JSON dictionary (Table SA1). Although ICL and few-shot prompting have been shown to improve LLM performance,¹⁰^,¹²^,¹⁴ they required significant amounts of text that, when added to the length of the echocardiogram report, threatened to exceed the context window of many models and were therefore not utilized. To avoid hallucination and increase reproducibility, the parameters for each model set were such that the temperature, top_p, top_k, seed, and microstat_tau were consistently 0, 0.9, 40, 1, and 5, respectively. No other fine-tuning or further training of the model was conducted. Each model’s extracted entity was mapped to the desired output and evaluated for its correctness in postprocessing. For example, “no AS seen” would be mapped to “normal.” If not completed during LLM-based extraction, outputs indicating the absence of a categorical value were converted to “normal,” while continuous variables (LVEF and PASP) were left blank when the LLM returned no value. To ensure consistent comparison with manual annotations, the same postprocessing was applied to the manually labeled entities. To promote generalizability and reduce token usage, the LLM prompt did not include specific instructions for handling all lexical variations indicating unmeasured entities (such as “unable to assess”). Instead, this determination was made by the LLM during report interpretation when given the general instruction: “If an entity is not clearly stated, report it as ‘normal’” (Table SA1). The time to completion, ability to not exceed the context window length, and number of extraneous data points returned were also collected.

Statistical analyses

Lexical variance and length of each echocardiogram report were calculated. Precision, recall, and F1 scores were calculated for the 9 extracted entities via multiclass classification with a weighted sample average and reported as median (IQR). For ejection fraction and pulmonary arterial systolic pressure, the 2 continuous variables, root mean square error (RMSE), coefficient of determination (R²), and number of hallucinations and missing entries were also computed. Error was estimated using a bootstrap analysis performed over 250 iterations, with each resample including 100 echocardiogram reports. A subanalysis of precision, recall, and F1 scores was also conducted for abnormal entities (categorical entities presenting disease or deviation from normal physiology). All analyses were done in Python (version 3.11.4) in the Jupyter Notebook environment.¹⁶

Results

Echocardiogram report characterization

In total, 507 echocardiograms were chosen for annotation and analysis. The unprocessed echocardiogram report distribution from the manual health provider annotations can be found in Table SA2. The overall interrater agreement (Cohen’s Kappa) for entities relating to valvular function was 0.908. Out of 3549 categorical entities, 1025 (28.8%) had some degree of abnormality. The median (IQR) length of the reports was 401 (378-429) words. Across all echocardiogram reports, there were 209 766 total words, of which 4996 were unique, yielding a token ratio of 0.02, suggesting low lexical variance. Individually, the echocardiogram reports had a median (IQR) token ratio of 0.54 (0.43-0.56), suggesting each word is used about twice per report.

Model processing time and failed data extraction attempts

The frequency at which models were unable to process an echocardiogram report was captured, and among the 14 models analyzed, Gemma 2:9b-instruct was able to extract data from the largest quantity of echocardiogram reports (99.4% [n = 504]), and Phi3:3.8b-mini-instruct was able to extract data from the fewest number of echocardiograms (35.5% [n = 180]). The median processing time in seconds per report was lowest for Gemma:7b at 2.169 (2.064-2.335) and highest for Lllama3:70b at 63.242 (59.91-66.568). Aside from Lllama3:70b, the median processing time per report was less than 10 s for all evaluated models (Table 2).

Table 2.

Model compute time and failure rate.

Model	Successfully analyzed echocardiogram reports, n (%)	Processing time per report in seconds, median (IQR)	Total number of extracted labeled entities
Gemma2:9b-instruct	504 (99.4%)	5.737 (5.484-6.06)	30
Gemma2:9b	503 (99.2%)	3.583 (3.34-3.774)	31
Llama3:8b	499 (98.4%)	2.509 (2.349-2.594)	11
Gemma:7b	499 (98.4%)	2.169 (2.064-2.335)	86
Llama3:8b-instruct	498 (98.2%)	4.89 (4.684-5.109)	18
Llama3:70b	497 (98.0%)	63.242 (59.91-66.568)	19
Mistral:v0.3	462 (91.1%)	2.742 (2.585-2.959)	226
Mistral:7b-instruct-v0.3	462 (91.1%)	6.088 (5.701-6.737)	246
Mixtral:instruct	454 (89.5%)	6.919 (6.402-7.889)	114
Mixtral:8 × 7b	453 (89.3%)	6.915 (6.407-7.99)	101
Phi3:14b-medium-instruct	443 (87.4%)	9.044 (8.199-10.03)	139
Phi3:medium	392 (77.3%)	3.737 (3.339-4.12)	234
Gemma:7b-instruct	356 (70.2%)	6.657 (6.089-7.419)	141
Phi3:3.8b-mini-instruct	180 (35.5%)	2.662 (2.168-4.102)	96

Open in a new tab

The number of extracted echocardiograms, the processing time per echocardiogram in seconds, and the number of unique extractions for each model are denoted. The number of unique extractions is ideally 9 (1 for each entity type) and is a measure of postprocessing mapping needed.

Overall model performance

Horizontal bar charts showing precision, recall, and F1 scores for multiple large language models across all extracted echocardiographic entities. — Aggregate model performance across all extracted entities. Median and interquartile ranges for precision, recall, and F1 scores are shown. All models are 4-bit quantized, except the instruct models which are not quantized. Aggregate performance was computed by calculating scores for each entity and averaging them, giving equal weight to each entity.

Model performance by extracted entity

As Gemma2:9b-instruct had the best-aggregated performance (Table SA3), its performance for each extracted entity is shown in Figure 2, and the remainder of the models are available in Figures SA1-SA13. Gemma2:9b-instruct most accurately extracted AR results with an F1 score of 0.990 (0.981-1.000). The model also had high-quality extractions for LVEF, MR, MS, PASP, and RVF with F1 scores of 0.967, 0.980, 0.981, 0.955, and 0.985, respectively. In total, there were 4 echocardiogram reports without a measured LVEF and 171 without a measured PASP. Across all models, LLMs hallucinated an LVEF value in 0 (0%) to 2 (50%) of these cases and hallucinated a PASP value in 3 (1.75%) to 92 (53.8%) of these cases (Tables SA4 and SA5). The top-performing model, Gemma2:9b-instruct, hallucinated an LVEF in 1 (25%) case where none was reported and hallucinated a PASP value in 15 (8.77%) cases where none was reported. The LVEF and PASP extracted by the model were also highly concordant with an RMSE of 2.984 and 4.742, respectively, and correlation coefficients of 0.969 and 0.940, respectively (Tables SA4 and SA5). For abnormal entities, Gemma2:9b-instruct maintained high-quality extractions, as measured by F1, for AR, MR, and DF with scores of 0.960, 0.987, and 0.995, respectively. The model provided moderate-quality extractions for MS, RHS, and RVF at 0.835, 0.899, and 0.908, respectively, and poor, inconsistent extractions for AS with an F1 score falling to 0.479 (Table SA6).

Grouped bar chart showing precision, recall, and F1 scores for the Gemma2:9b-instruct model across individual echocardiographic entities: AR, AS, DF, LVEF, MR, MS, PASP, RVS, and RVF. — Entity-level model performance for Gemma2:9b-instruct. Median and interquartile ranges for precision, recall, and F1 scores are shown. Abbreviations: AR, aortic valve regurgitation; AS, aortic valve stenosis; DF, diastolic function; LVEF, left ventricular ejection fraction; MR, mitral valve regurgitation; MS, mitral valve stenosis; PASP, pulmonary arterial systolic pressure; RVF, right ventricular systolic function; RVS, right ventricular heart size.

Discussion

Our analysis demonstrates the advantages of using open-source LLMs in information extraction tasks for unstructured free-text electronic record data. We present the high-quality clinical extraction abilities of the 14 leading and high-performing open-source LLMs from echocardiogram narratives. Using a zero-shot prompt without ICL and no further task-specific fine-tuning, Llama3:70b, Gemma2:9b-instruct, and Gemma2:9b perform high-quality extractions nearing manual gold-standard clinician annotation performance. To the best of our knowledge, this is one of the largest echocardiogram LLM entity extraction studies and among the first to assess the performance of publicly accessible, nonproprietary models. These results are especially significant because open-source LLMs are scalable, easy to deploy, inexpensive, and require no previous knowledge about the structure of reports. While traditional approaches oftentimes require high-quality training data, time-intensive manual annotation, and significant financial investment, these LLMs can be prompted and implemented into data extraction workflows within days without significant compromises to accuracy or quality.

Gemma2:9b-instruct and Gemma2:9b models stood out in performance and computational efficiency. These models successfully extracted data from the highest number of echocardiograms (99.4% [n = 504]) and consistently provided the most accurate extraction for 4 out of 9 clinical entity types, with an average F1 score of 0.965 and 0.960, respectively. They are also equipped with only 9 billion parameters, allowing them to process echocardiograms on modest computer hardware at an average of 5.737 and 3.583 s, respectively. Their low computational costs are particularly notable when considering the potential for institution-wide deployment of these models. This efficiency, coupled with superior performance compared to other tested models, is likely attributed to their recent release date and reflects the recent rapid advancements in transformer architecture.

Past approaches to echocardiogram entity extraction have predominantly relied on static, rule-based approaches such as regular expression (regex)-driven retrieval and text-mining methods. Many studies primarily seek to extract LVEF, as this numeric measure is one of the most reliable prognostic indicators in patients with cardiovascular disease and greatly influences treatment decisions.¹⁷^,¹⁸ Given the additional cardiac parameters presented in this study, a complete comparative analysis of our proposed methodology may not be feasible. However, comparisons specific to LVEF extraction can provide insights into relative performance differences in methodologies.

A previous publication by Garvin et al. used a task-defined regex expression and string-matching approach and was able to extract LVEF values from 765 echocardiogram reports with precision, recall, and F1 scores of 0.95, 0.889, and 0.919, respectively.¹⁹ In a separate analysis, Patterson et al. created an NLP system equipped with a constructed custom dictionary to extract LVEF values from 100 reports. In this study, they were able to achieve precision and recall scores of 1.00 and 0.801.²⁰ Both aforementioned studies demonstrate that automated regex expressions and dictionary-driven NLP can yield high accuracy but also lack generalizability and scalability due to their dependence on task-specific and institution-specific development. Regex expressions also require IT expertise and priori assumptions about the report structure, further limiting this approach. More recently, Szekér et al. proposed a text-mining framework for extracting a broader range of numerical echocardiogram entities. Their method, which leveraged automated dictionary construction and text similarity-based entity matching to identify measurements, achieved precision, recall, and F1 scores of 1.0, 0.901, and 0.948, respectively, for LVEF extraction.²¹ They demonstrated that this method can produce results with high confidence and improved generalizability to regex methods. However, this approach remained restricted to numerical values, could not integrate important contextualizing information, and exhibited incidences of errors when entities were in unusual recording formats.

In contrast, our study, as exhibited by Gemma2:9b-instruct’s precision, recall, and F1 scores of 0.974, 0.970, and 0.967 for LVEF, demonstrates open-source LLMs are competitive with and sometimes even surpass the abilities of traditional approaches without institution-specific or entity-specific development. Unlike previous studies that primarily extract numerical values, LLMs facilitate the extraction across a large variety of clinical entities, both numeric and categorical. Based on these results, open-source LLMs ultimately provide similar or better performance measurements than other methods published in the literature for extraction from echocardiogram narratives.

Although LLMs have yet to be evaluated in echocardiogram extraction until this present study, their performance has been measured in other clinical data reports, most commonly in radiology. A recent study by Dorfner et al. compared open-source LLMs and their commercial counterparts in extracting entities from randomly selected chest radiographs from the Massachusetts General Hospital.¹³ The commercial models GPT-3.5 Turbo and GPT-4 were evaluated with open-source models, including Mistral:7b and Mixtral:8 × 7b and Llama2:13b and Llama2:70b. They found that the highest performing open-source model (Llama2-70b) achieved F1 score of 0.97, while GPT-4 achieved F1 score of 0.98 in both zero-shot and few-shot prompting, demonstrating that open-source models can serve as viable alternatives to commercial LLMs and outperform traditional specialized NLP approaches in text classification tasks for unstructured radiograph reports. Their conclusions align with the findings in our study that open-source LLMs can successfully conduct structuring tasks and extract data while safeguarding protected health information.

Despite promising results, challenges remain in our study. The poor performance of even the best-performing model (Gemma2:9b-instruct) for abnormal entities, such as, highlights the limitations of current open-source LLMs. The low extraction quality for AS can likely be attributed to the distribution of report findings, with normal entities comprising 89.15% (n = 452) of the echocardiograms analyzed. This limitation is significant because clinical research informed by echocardiograms often relies on accurately identifying and analyzing abnormal measures rather than normal ones, and these are often the minority of echocardiogram findings. Failure to reliably extract information on abnormal findings undermines the utility of these models in clinical and research applications, where the focus is typically on detecting and interpreting deviations from normality.

Furthermore, the absence of ICL in this study, due to limitations in context window size, may have impacted model performance. Open-source LLMs have a smaller context window and are limited in the number of tokens they can contextualize, process, and analyze. This constraint makes it challenging to include both an example echocardiogram and its expected output in the initial prompt without risking a decline in performance from exceeding the token limit. As previous analyses have demonstrated LLMs benefit greatly from few-shot prompting in many circumstances,¹⁰^,¹²^,¹⁴ future work should explore optimized prompting strategies, including ICL. Although the echocardiograms were derived from 11 distinct hospitals, the institutions belonged to the same health-care network, meaning that the models’ performance in our study may not be transferable to other echocardiogram samples. Additionally, while overall model performance was high, our analysis revealed that LLMs occasionally failed to follow instructions or hallucinated values not present in the original reports. For instance, the top-performing model, Gemma2:9b-instruct, hallucinated an LVEF in 1 of 4 cases where it was not reported and hallucinated a PASP value in 15 of 171 null entries (Tables SA4 and SA5). Although these hallucinations occurred less commonly in the best-performing models, they represent a persistent risk that must be considered in clinical applications. It is likely that the need for validation and postprocessing of outputs will always be present in LLM implementation for streamlining information extraction workflows, both due to the stochastic nature of transformer architectures and these occasional instruction-following failures. However, as newer model iterations continue to demonstrate improved instruction adherence and reduced hallucination rates, these limitations may diminish over time.

Conclusion

Using LLMs for entity extraction for echocardiogram reports has the potential to support both secondary clinical research and health-care delivery. Our work demonstrates the feasibility of using open-source models for more efficient computation and extraction. Future studies should evaluate the performance of these LLMs across a more diverse and representative echocardiogram report set, encompassing multiple health-care systems and a wider variety of clinician inputs.

Supplementary Material

ooaf092_Supplementary_Data

ooaf092_supplementary_data.zip^{(1.6MB, zip)}

Acknowledgments

The authors have no acknowledgments.

Contributor Information

Jonathan Chi, Goergen Institute for Data Science and Artificial Intelligence, University of Rochester, Rochester, NY 14627, United States.

Yazan Rouphail, Department of Medicine, Institute for Informatics, Data Science and Biostatistics, Washington University in St. Louis, St. Louis, MO 63110, United States.

Ethan Hillis, Department of Medicine, Institute for Informatics, Data Science and Biostatistics, Washington University in St. Louis, St. Louis, MO 63110, United States.

Ningning Ma, Division of Hospital Medicine, Department of Medicine, Washington University in St. Louis, St. Louis, MO 63110, United States.

An Nguyen, Division of Pulmonary and Critical Care Medicine, Department of Medicine, Washington University in St. Louis, St. Louis, MO 63110, United States.

Jane Wang, Division of Pulmonary and Critical Care Medicine, Department of Medicine, Washington University in St. Louis, St. Louis, MO 63110, United States.

Mackenzie Hofford, Department of Medicine, Institute for Informatics, Data Science and Biostatistics, Washington University in St. Louis, St. Louis, MO 63110, United States.

Aditi Gupta, Department of Medicine, Institute for Informatics, Data Science and Biostatistics, Washington University in St. Louis, St. Louis, MO 63110, United States.

Patrick G Lyons, Division of Pulmonary, Allergy, and Critical Care Medicine, Department of Medicine, Oregon Health & Science University, Portland, OR 97239, United States.

Adam Wilcox, Department of Medicine, Institute for Informatics, Data Science and Biostatistics, Washington University in St. Louis, St. Louis, MO 63110, United States.

Albert M Lai, Department of Medicine, Institute for Informatics, Data Science and Biostatistics, Washington University in St. Louis, St. Louis, MO 63110, United States.

Philip R O Payne, Department of Medicine, Institute for Informatics, Data Science and Biostatistics, Washington University in St. Louis, St. Louis, MO 63110, United States.

Marin H Kollef, Division of Pulmonary and Critical Care Medicine, Department of Medicine, Washington University in St. Louis, St. Louis, MO 63110, United States.

Caitlin Dreisbach, Goergen Institute for Data Science and Artificial Intelligence, University of Rochester, Rochester, NY 14627, United States; School of Nursing, University of Rochester, Rochester, NY 14627, United States.

Andrew P Michelson, Department of Medicine, Institute for Informatics, Data Science and Biostatistics, Washington University in St. Louis, St. Louis, MO 63110, United States; Division of Pulmonary and Critical Care Medicine, Department of Medicine, Washington University in St. Louis, St. Louis, MO 63110, United States.

Author contributions

Jonathan Chi (Data curation, Formal analysis, Investigation, Methodology, Software, Validation, Writing—original draft, Writing—review & editing), Yazan Rouphail (Conceptualization, Formal analysis, Investigation, Methodology), Ethan Hillis (Conceptualization, Formal analysis, Investigation, Methodology), Ningning Ma (Conceptualization, Formal analysis, Investigation, Methodology), An Nguyen (Conceptualization, Formal analysis, Investigation, Methodology), Jane Wang (Conceptualization, Formal analysis, Investigation, Methodology), Mackenzie Hofford (Conceptualization, Formal analysis, Investigation, Methodology), Aditi Gupta (Conceptualization, Formal analysis, Investigation, Methodology), Patrick Lyons (Conceptualization, Formal analysis, Investigation, Methodology), Adam B. Wilcox (Conceptualization, Formal analysis, Investigation, Methodology), Albert Max Lai (Conceptualization, Formal analysis, Investigation, Methodology), Philip Richard Orrin Payne (Conceptualization, Formal analysis, Investigation, Methodology), Marin H. Kollef (Conceptualization, Formal analysis, Investigation, Methodology), Caitlin Dreisbach (Writing—original draft, Writing—review & editing), and Andrew Michelson (Conceptualization, Investigation, Methodology, Software, Supervision, Writing—original draft, Writing—review & editing)

Supplementary material

Supplementary material is available at JAMIA Open online.

Funding

Funding for this study was provided in part by the National Institutes of Health (NIH) National Library of Medicine under award number R25LM014224 and by the Washington University Institute of Clinical and Translational Sciences grant UL1TR002345 from the National Center for Advancing Translational Sciences of the NIH.

Conflicts of interest

The authors declare that they have no competing interests in the research.

Data availability

The data underlying this current study cannot be made publicly available due to the protected nature of the health information contained in the dataset. However, data may be made available upon completion of appropriate regulatory approval processes.

Ethics statement

This project received approval from the Institutional Review Board at Washington University in St Louis (IRB #201804121) with a waiver of informed consent.

References

1. Sedlakova J, Daniore P, Wintsch AH, et al. ; University of Zurich Digital Society Initiative (UZH-DSI) Health Community. Challenges and best practices for digital unstructured data enrichment in health research: a systematic narrative review. PLOS Digit Health. 2023;2:e0000347. 10.1371/journal.pdig.0000347 [DOI] [PMC free article] [PubMed] [Google Scholar]
2. Adnan K, Akbar R, Khor SW, Ali ABA. Role and challenges of unstructured big data in healthcare. In: Sharma N, Chakrabarti A, Balas VE, eds. Data Management, Analytics and Innovation. Springer; 2020:301-323. [Google Scholar]
3. Cowie MR, Blomster JI, Curtis LH, et al. Electronic health records to facilitate clinical research. Clin Res Cardiol. 2017;106:1-9. 10.1007/s00392-016-1025-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
4. Kim E, Rubinstein SM, Nead KT, Wojcieszynski AP, Gabriel PE, Warner JL. The evolving use of electronic health records (EHR) for research. Semin Radiat Oncol. 2019;29:354-361. 10.1016/j.semradonc.2019.05.010 [DOI] [PubMed] [Google Scholar]
5. Xiao C, Choi E, Sun J. Opportunities and challenges in developing deep learning models using electronic health records data: a systematic review. J Am Med Inform Assoc. 2018;25:1419-1428. 10.1093/jamia/ocy068 [DOI] [PMC free article] [PubMed] [Google Scholar]
6. Hassanpour S, Langlotz CP. Information extraction from multi-institutional radiology reports. Artif Intell Med. 2016;66:29-39. 10.1016/j.artmed.2015.09.007 [DOI] [PMC free article] [PubMed] [Google Scholar]
7. Malmasi S, Hosomura N, Chang LS, Brown CJ, Skentzos S, Turchin A. Extracting healthcare quality information from unstructured data. AMIA Annu Symp Proc. 2018;2017:1243-1252. [PMC free article] [PubMed] [Google Scholar]
8. Kreimeyer K, Foster M, Pandey A, et al. Natural language processing systems for capturing and standardizing unstructured clinical information: a systematic review. J Biomed Inform. 2017;73:14-29. 10.1016/j.jbi.2017.07.012 [DOI] [PMC free article] [PubMed] [Google Scholar]
9. Li I, Pan J, Goldwasser J, et al. Neural natural language processing for unstructured data in electronic health records: a review. Comput Sci Rev. 2022;46:100511. 10.1016/j.cosrev.2022.100511 [DOI] [Google Scholar]
10. Huang J, Yang DM, Rong R, et al. A critical assessment of using ChatGPT for extracting structured data from clinical notes. NPJ Digit Med. 2024;7:106-113. 10.1038/s41746-024-01079-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
11. Hu D, Liu B, Zhu X, Lu X, Wu N. Zero-shot information extraction from radiological reports using ChatGPT. Int J Med Inform. 2024;183:105321. 10.1016/j.ijmedinf.2023.105321 [DOI] [PubMed] [Google Scholar]
12. Agrawal M, Hegselmann S, Lang H, Kim Y, Sontag D. Large language models are few-shot clinical information extractors. In: Goldberg Y, Kozareva Z, Zhang Y, eds. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics; 2022:1998–2022.
13. Dorfner FJ, Jürgensen L, Donle L, et al. Comparing commercial and open-source large language models for labeling chest radiograph reports. Radiology. 2024;313:e241139. 10.1148/radiol.241139 [DOI] [PMC free article] [PubMed] [Google Scholar]
14. Adams LC, Truhn D, Busch F, et al. Leveraging GPT-4 for post hoc transformation of free-text radiology reports into structured reporting: a multilingual feasibility study. Radiology. 2023;307:e230725. 10.1148/radiol.230725 [DOI] [PubMed] [Google Scholar]
15. Hugging Face community. Open LLM Leaderboard—a Hugging Face Space. Accessed February 25, 2025. https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
16. Kluyver T, Ragan-Kelley B, Pérez F, et al. Jupyter Notebooks—a publishing format for reproducible computational workflows. In: Loizides F, Schmidt B, eds. Positioning and Power in Academic Publishing: Players, Agents and Agendas. IOS Press; 2016:87-90. [Google Scholar]
17. Kim Y, Garvin JH, Goldstein MK, et al. Extraction of left ventricular ejection fraction information from various types of clinical reports. J Biomed Inform. 2017;67:42-48. 10.1016/j.jbi.2017.01.017 [DOI] [PMC free article] [PubMed] [Google Scholar]
18. Xie F, Zheng C, Yuh-Jer Shen A, Chen W. Extracting and analyzing ejection fraction values from electronic echocardiography reports in a large health maintenance organization. Health Informatics J. 2017;23:319-328. 10.1177/1460458216651917 [DOI] [PubMed] [Google Scholar]
19. Garvin JH, DuVall SL, South BR, et al. Automated extraction of ejection fraction for quality measurement using regular expressions in unstructured information management architecture (UIMA) for heart failure. J Am Med Inform Assoc. 2012;19:859-866. 10.1136/amiajnl-2011-000535 [DOI] [PMC free article] [PubMed] [Google Scholar]
20. Patterson OV, Freiberg MS, Skanderson M, Fodeh SJ, Brandt CA, DuVall SL. Unlocking echocardiogram measurements for heart disease research through natural language processing. BMC Cardiovasc Disord. 2017;17:151. 10.1186/s12872-017-0580-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
21. Szekér S, Fogarassy G, Vathy-Fogarassy Á. A general text mining method to extract echocardiography measurement results from echocardiography documents. Artif Intell Med. 2023;143:102584. 10.1016/j.artmed.2023.102584 [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ooaf092_Supplementary_Data

ooaf092_supplementary_data.zip^{(1.6MB, zip)}

Data Availability Statement

[ooaf092-B1] 1. Sedlakova J, Daniore P, Wintsch AH, et al. ; University of Zurich Digital Society Initiative (UZH-DSI) Health Community. Challenges and best practices for digital unstructured data enrichment in health research: a systematic narrative review. PLOS Digit Health. 2023;2:e0000347. 10.1371/journal.pdig.0000347 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ooaf092-B2] 2. Adnan K, Akbar R, Khor SW, Ali ABA. Role and challenges of unstructured big data in healthcare. In: Sharma N, Chakrabarti A, Balas VE, eds. Data Management, Analytics and Innovation. Springer; 2020:301-323. [Google Scholar]

[ooaf092-B3] 3. Cowie MR, Blomster JI, Curtis LH, et al. Electronic health records to facilitate clinical research. Clin Res Cardiol. 2017;106:1-9. 10.1007/s00392-016-1025-6 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ooaf092-B4] 4. Kim E, Rubinstein SM, Nead KT, Wojcieszynski AP, Gabriel PE, Warner JL. The evolving use of electronic health records (EHR) for research. Semin Radiat Oncol. 2019;29:354-361. 10.1016/j.semradonc.2019.05.010 [DOI] [PubMed] [Google Scholar]

[ooaf092-B5] 5. Xiao C, Choi E, Sun J. Opportunities and challenges in developing deep learning models using electronic health records data: a systematic review. J Am Med Inform Assoc. 2018;25:1419-1428. 10.1093/jamia/ocy068 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ooaf092-B6] 6. Hassanpour S, Langlotz CP. Information extraction from multi-institutional radiology reports. Artif Intell Med. 2016;66:29-39. 10.1016/j.artmed.2015.09.007 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ooaf092-B7] 7. Malmasi S, Hosomura N, Chang LS, Brown CJ, Skentzos S, Turchin A. Extracting healthcare quality information from unstructured data. AMIA Annu Symp Proc. 2018;2017:1243-1252. [PMC free article] [PubMed] [Google Scholar]

[ooaf092-B8] 8. Kreimeyer K, Foster M, Pandey A, et al. Natural language processing systems for capturing and standardizing unstructured clinical information: a systematic review. J Biomed Inform. 2017;73:14-29. 10.1016/j.jbi.2017.07.012 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ooaf092-B9] 9. Li I, Pan J, Goldwasser J, et al. Neural natural language processing for unstructured data in electronic health records: a review. Comput Sci Rev. 2022;46:100511. 10.1016/j.cosrev.2022.100511 [DOI] [Google Scholar]

[ooaf092-B10] 10. Huang J, Yang DM, Rong R, et al. A critical assessment of using ChatGPT for extracting structured data from clinical notes. NPJ Digit Med. 2024;7:106-113. 10.1038/s41746-024-01079-8 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ooaf092-B11] 11. Hu D, Liu B, Zhu X, Lu X, Wu N. Zero-shot information extraction from radiological reports using ChatGPT. Int J Med Inform. 2024;183:105321. 10.1016/j.ijmedinf.2023.105321 [DOI] [PubMed] [Google Scholar]

[ooaf092-B12] 12. Agrawal M, Hegselmann S, Lang H, Kim Y, Sontag D. Large language models are few-shot clinical information extractors. In: Goldberg Y, Kozareva Z, Zhang Y, eds. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics; 2022:1998–2022.

[ooaf092-B13] 13. Dorfner FJ, Jürgensen L, Donle L, et al. Comparing commercial and open-source large language models for labeling chest radiograph reports. Radiology. 2024;313:e241139. 10.1148/radiol.241139 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ooaf092-B14] 14. Adams LC, Truhn D, Busch F, et al. Leveraging GPT-4 for post hoc transformation of free-text radiology reports into structured reporting: a multilingual feasibility study. Radiology. 2023;307:e230725. 10.1148/radiol.230725 [DOI] [PubMed] [Google Scholar]

[ooaf092-B15] 15. Hugging Face community. Open LLM Leaderboard—a Hugging Face Space. Accessed February 25, 2025. https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard

[ooaf092-B16] 16. Kluyver T, Ragan-Kelley B, Pérez F, et al. Jupyter Notebooks—a publishing format for reproducible computational workflows. In: Loizides F, Schmidt B, eds. Positioning and Power in Academic Publishing: Players, Agents and Agendas. IOS Press; 2016:87-90. [Google Scholar]

[ooaf092-B17] 17. Kim Y, Garvin JH, Goldstein MK, et al. Extraction of left ventricular ejection fraction information from various types of clinical reports. J Biomed Inform. 2017;67:42-48. 10.1016/j.jbi.2017.01.017 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ooaf092-B18] 18. Xie F, Zheng C, Yuh-Jer Shen A, Chen W. Extracting and analyzing ejection fraction values from electronic echocardiography reports in a large health maintenance organization. Health Informatics J. 2017;23:319-328. 10.1177/1460458216651917 [DOI] [PubMed] [Google Scholar]

[ooaf092-B19] 19. Garvin JH, DuVall SL, South BR, et al. Automated extraction of ejection fraction for quality measurement using regular expressions in unstructured information management architecture (UIMA) for heart failure. J Am Med Inform Assoc. 2012;19:859-866. 10.1136/amiajnl-2011-000535 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ooaf092-B20] 20. Patterson OV, Freiberg MS, Skanderson M, Fodeh SJ, Brandt CA, DuVall SL. Unlocking echocardiogram measurements for heart disease research through natural language processing. BMC Cardiovasc Disord. 2017;17:151. 10.1186/s12872-017-0580-8 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ooaf092-B21] 21. Szekér S, Fogarassy G, Vathy-Fogarassy Á. A general text mining method to extract echocardiography measurement results from echocardiography documents. Artif Intell Med. 2023;143:102584. 10.1016/j.artmed.2023.102584 [DOI] [PubMed] [Google Scholar]

PERMALINK

EchoLLM: extracting echocardiogram entities with light-weight, open-source large language models

Jonathan Chi

Yazan Rouphail, BS

Ethan Hillis, MS

Ningning Ma, MD

An Nguyen, MD

Jane Wang, MD

Mackenzie Hofford, MD

Aditi Gupta, PhD

Patrick G Lyons, MD

Adam Wilcox, PhD

Albert M Lai, PhD

Philip R O Payne, PhD

Marin H Kollef, MD

Caitlin Dreisbach, PhD, RN

Andrew P Michelson, MD

Roles

Abstract

Objectives

Materials and Methods

Results

Discussion and Conclusion

Background and significance

Objectives

Methods

Data

Annotation

Models

Table 1.

Prompting and postprocessing

Statistical analyses

Results

Echocardiogram report characterization

Model processing time and failed data extraction attempts

Table 2.

Overall model performance

Figure 1.

Model performance by extracted entity

Figure 2.

Discussion

Conclusion

Supplementary Material

Acknowledgments

Contributor Information

Author contributions

Supplementary material

Funding

Conflicts of interest

Data availability

Ethics statement

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases