Abstract
Computational phenotyping is a central informatics activity with resulting cohorts supporting a wide variety of applications. However, it is time-intensive because of manual data review and limited automation. Since LLMs have demonstrated promising capabilities for text classification, comprehension, and generation, we posit they will perform well at repetitive manual review tasks traditionally performed by human experts. To support next-generation computational phenotyping, we developed SHREC, a framework for integrating LLMs into end-to-end phenotyping pipelines. We applied and tested three lightweight LLMs (Gemma2 27 billion, Mistral Small 24 billion, and Phi-4 14 billion) to classify concepts and phenotype patients using phenotypes for ARF respiratory support therapies. All models performed well on concept classification, with the best (Mistral) achieving an AUROC of 0.896. For phenotyping, models demonstrated near-perfect specificity for all phenotypes with the top-performing model (Mistral) achieving an average AUROC of 0.853 for single-therapy phenotypes. In conclusion, lightweight LLMs can assist researchers with resource-intensive phenotyping tasks. Several advantages of LLMs included their ability to adapt to new tasks with prompt engineering alone and their ability to incorporate raw EHR data. Future steps include determining optimal strategies for integrating biomedical data and understanding reasoning errors.
1.2. Introduction
Computational, or electronic phenotyping, is a central informatics activity focused on defining, extracting, and validating meaningful clinical representations of digital data from electronic health records (EHRs) and other relevant information systems.[2, 1] It is particularly fundamental to observational studies, large-scale pragmatic clinical trials, and healthcare quality improvement initiatives, where standardized, computable phenotypes allow for robust cohort discovery and monitoring of real-world outcomes.[3] Computable phenotypes have been developed for a wide variety of clinical outcomes and conditions, including acute conditions such as acute kidney injury,[4] Acute Respiratory Distress Syndrome,[5] and acute brain dysfunction in pediatric sepsis,[6] and chronic conditions such as breast cancer,[7] hypertension,[8] and Post-Acute Sequelae of SARS-CoV-2 infection (PASC).[9] They have also supported a variety of downstream tasks, including recruitment for clinical trials, development of clinical decision support systems, and hospital quality reporting.[3, 11, 10]
The process of developing computable phenotypes typically includes identification and construction of relevant data elements for classification and then application of an algorithm to produce the cohort(s) of interest.[12] Traditionally, these processes involve multiple time and resource-intensive tasks requiring manual data review, such as mapping of data elements to controlled vocabularies.[11] Despite increased adoption of controlled vocabularies in EHR systems and improvements in Natural Language Processing (NLP) and machine learning methods, computational phenotyping remains complicated and costly.[11, 10] As a result, many of the desiderata for phenotyping identified over a decade ago are still relevant today, indicating the need for substantial improvements to these methods.[1, 13] To demonstrate these issues, we highlight challenges in development of computable phenotypes for PASC.[9] Since the phenotype definition was based on symptom presence, manual expert review of 6, 569 concepts was first required to determine which were relevant to the 151 symptoms of interest. A series of data transformations were then applied to assess symptom presence relative to SARS-CoV-2 infection. Any new dataset, especially one not mapped to a controlled vocabulary, would require further manual review of concepts, rework of the algorithm, or both.
Given the existing opportunities with computational phenotyping and the minimal overall progress towards methodological improvements, it is natural to consider what will drive the next significant enhancement, or “next-generation” of phenotyping methods. In particular, with advances in machine and artificial intelligence, we also reconsider how much of the computational phenotyping process requires direct human involvement. The idea of human-machine synergy, with each component enhancing the abilities of the other, is fundamental to the field of informatics.[14] However, this synergy has yet to be achieved in computational phenotyping since humans still perform a majority of the phenotyping tasks, including ones where machines may excel. Therefore, we propose exploring the potential of Large Language Models (LLMs) for this domain. As a relatively new addition to biomedical research, LLMs introduce a novel set of text analysis, comprehension, and generation capabilities that allow them to analyze and generate text in ways that previously were either only possible by humans or by extensively trained, topic-specific NLP models.[15] Additionally, since LLMs are widely available as pretrained foundation models, they can be adapted to new tasks through prompt engineering alone, which is a more accessible and portable method of model adaptation compared to model retraining or domain adaptation methods.[16] Furthermore, biomedical fine-tuned models underperformed on clinical tasks when compared to general-use LLMs, indicating that model retraining (a costly and time-intensive process) isn’t even preferred for LLM adaptation.[17] Thus, the capabilities and advantages of LLMs satisfy many of the current deficits in computational phenotyping methods, suggesting their potential as foundational tools for next-generation phenotyping.
While some studies have applied LLMs to various clinical phenotyping tasks, none have explored the capability of LLMs to improve computational phenotyping specifically. The clinical phenotyping tasks studied include entity extraction and matching in clinical text,[18, 19] query generation for patient extraction,[20] evaluation of hospital quality measures,[21] and creation of phenotype definitions from standardized vocabulary codes.[10] When used to develop queries for identifying patients with type-2 diabetes mellitus, dementia, and hypo-thyroidism, GPT-4 produced queries that still required substantial oversight from human reviewers to generate accurate cohorts.[20] Additionally, when LLMs were used to generate computable phenotypes based on standardized vocabularies, GPT-4 only achieved an average accuracy of approximately 50% on both code matching and on string matching when compared to the original definition.[10] However, SOLAR 10.7B only slightly underperformed human categorizations for hospital quality measures and even provided a better response than human review in 4 out of 10 cases when responses between humans and LLMs differed.[21] Additionally, GPT-4o demonstrated perfect accuracy for classification of antibiotics from raw EHR data.[22] Therefore, while LLMs struggle with query and algorithm generation, even lightweight models have demonstrated the ability to categorize relevant clinical concepts from EHR data, further indicating potential application of LLMs for development of computable phenotypes.
Considering the opportunities in computational phenotyping methods and the novel capabilities of LLMs, we applied and evaluated LLMs to support computable phenotype development. We previously developed PHEONA (Evaluation of PHEnotyping for Observational Health Data), a framework specifically for evaluating LLMs for computational phenotyping tasks.[23] In this study, we expanded upon these methods to construct a broader view of next-generation phenotyping. The objectives of this study were thus the following:
Develop SHREC (SHifting to language model-based REal-world Computational phenotyping), a companion framework to PHEONA that outlines how to integrate LLMs into computational phenotyping.
Apply and demonstrate SHREC using previously developed computable phenotypes for Acute Respiratory Failure (ARF) respiratory support therapies.
Highlight future work and next steps to encourage progress towards next-generation phenotyping methods.
1.3. Materials and Methods
We first outline the development of SHREC along with its individual components (Figure 1 and Table 1) and then we discuss how we used LLMs to perform various tasks for a specific phenotyping use case.
Figure 1:
An overview of SHREC (SHifting to language model-based REal-world Computational phenotyping), including both an overview of the individual phenotyping tasks and a representation of the progress required to advance next-generation computational phenotyping.
Table 1:
Overview of the individual end-to-end computational phenotyping tasks for SHREC (SHifting to language model-based REal-world Computational phenotyping), a framework for next-generation computational phenotyping with Large Language Model (LLM)-based methods. Some tasks were based on a previously developed framework for phenotype development using machine learning algorithms.[12]
| Step | Name | Description | Analog to Previous Framework |
|---|---|---|---|
| 1 | Assess Fitness-for-Purpose | Determine the clinical outcome of interest, assess clinical significance, and assess any sources of clinical or data complexity. | Assess Fitness-for-Purpose |
| 2 | Identify Data Sources | Identify data source(s) to use for phenotype development and evaluation. | Assess Fitness-for-Purpose |
| 3 | Gather Ground Truth Data | Determine ground truth labels to use for validation of the phenotyping algorithm. | Create Gold Standard Data |
| 4 | Outline Phenotyping Heuristics | Determine the tasks necessary in the phenotyping process to obtain the resulting phenotypes from the input data. Will likely include the inclusion and exclusion criteria. | None |
| 5 | LLM Model Selection | If used, determine which LLMs can be tested for specific tasks and how these models can be evaluated.[23] For studies not using LLMs, can either skip or identify other machine learning or Natural Language Processing (NLP) models. | Develop Models |
| 6 | Concept Selection | Classify or identify relevant data elements from the electronic health record (EHR) data. | Engineer Features |
| 7 | Apply Algorithm | Apply the algorithm tasks to each record and identify the appropriate phenotype. | None |
| 8 | Evaluate Algorithm | Determine the effectiveness of the phenotyping algorithm against the ground truth data. | Evaluate Models |
1.3.1. Development of SHREC
1.3.1.1. Theoretical Foundation
To understand issues with computational phenotyping, we revisited the Fundamental Theorem of Informatics, which states optimized human-machine interactions should drive informatics methods,[14] and distributed cognition, which describes how overall cognitive load is shared between internal and external agents.[24, 25] In traditional phenotyping, humans generally could not delegate tasks to external agents without costly, time-consuming, or even impossible modifications.[15, 11, 10] Therefore, humans were responsible for both repetitive and complex tasks despite not being as inherently well-suited for repetitive work as machines. If LLMs are indeed capable of performing repetitive tasks well, these tasks can be offloaded to external LLM-based agents while ensuring humans are only responsible for complex ones to reduce overall cognitive burden of researchers and improve efficiency of the phenotyping process.
1.3.1.2. Framework Components
Using this foundation, we constructed SHREC to include both an end-to-end phenotyping pipeline and a broader vision for next-generation computational phenotyping. To develop our end-to-end pipeline, we extended an existing framework originally developed for machine learning–based cohort discovery to more broadly capture computational phenotyping tasks.[12] Specifically, we added tasks for developing (Outline Phenotyping Heuristics) and implementing (Apply Algorithm) the phenotyping algorithm since they were previously implicit in development of the machine learning model. The end-to-end pipeline is detailed in Table 1. Meanwhile, the broader overview indicates overall progress towards optimal human-machine synergy in next-generation phenotyping (Figure 1). In this study, we only conducted a feasibility assessment of LLMs for specific phenotyping tasks.
1.3.2. Application of SHREC to Phenotyping Use Case
1.3.2.1. Phenotyping Use Case
We leveraged computable phenotypes for Acute Respiratory Failure (ARF) respiratory support therapies to demonstrate LLM-based methods for phenotyping tasks. Encounters were phenotyped based on the type and order of respiratory therapies received during individual Intensive Care Unit (ICU) encounters.[26] The phenotypes were 1) Invasive Mechanical Ventilation (IMV) only; 2) Noninvasive Positive Pressure Ventilation (NIPPV) only; 3) High-Flow Nasal Insufflation (HFNI) only; 4) NIPPV Failure (or NIPPV to IMV); 5) HFNI Failure (or HFNI to IMV); 6) IMV to NIPPV; and 7) IMV to HFNI.
1.3.2.2. Identification of Phenotyping Tasks
A comparison of the methods performed in the original study and this study are presented in Figure 2. In the original study, data from the eICU Collaborative Research Database (eICU-CRD)[27] were manually reviewed to first determine relevance to the therapies or medications of interest and second to produce a phenotyping algorithm.[26] These processes mapped to the Concept Selection and Apply Algorithm tasks of the end-to-end phenotyping pipeline within SHREC, respectively. In this study, we used LLMs to perform the tasks of Concept Selection and Apply Algorithm while we also implemented Outline Phenotype Heuristics, LLM Model Selection, and Evaluate Algorithm because they were required to execute and test the LLM-based methods. Since the remaining tasks were not related to the LLM methods, they were not included in this study, but were previously discussed in-depth.[12, 26]
Figure 2:
Comparison of traditional and next-generation methods for constructing phenotypes for Acute Respiratory Failure (ARF) respiratory support therapies, using the end-to-end phenotyping pipeline from SHREC (SHifting to language model-based REal-world Computational phenotyping). The traditional methods were used for initial phenotype development[26] while the next-generation methods were implemented in this study.
1.3.3. Implementation of LLMs for Phenotyping Tasks
The following sections detail the methods for each of the implemented phenotyping tasks from SHREC.
1.3.3.1. Outline Phenotype Heuristics
We determined the following heuristics from the previously developed algorithm:[26]
Identified the first encounter for each unique patient. Removed additional encounters to ensure a single encounter per patient.
Removed individuals less than 18 years old at the start of the encounter.
Extracted concepts from all distinct EHR records across all encounters and determined which were relevant to the respiratory support therapies (IMV, NIPPV, or HFNI) or medications (see Step 5a) of interest. For example, “BiPAP/CPAP” indicates NIPPV and “Hi Flow NC” indicates HFNI.
Filtered records for each encounter to only the extracted concepts for all of the respiratory support therapies and medications of interest.
- Identified which of the respiratory support therapies were received during the encounter. The following criteria were used to determine if a therapy was received:
- IMV: The presence of at least two records indicating IMV and at least one record indicating use of specific medication related to pre-intubation, intra-intubation, and post-intubation care (e.g., rapid sequence intubation medications, neuromuscular blocking agents, or continuous sedative agents).
- NIPPV: At least two records indicating use of NIPPV AND no records indicating use of HFNI.
- HFNI: The criteria for NIPPV is met AND there is at least one additional record indicating use of HFNI.
Determined the start and end for each treatment based on the offset time from ICU admission. When applicable, removed NIPPV and HFNI records that occurred between IMV records from consideration and reassessed whether criteria for NIPPV or HFNI was still met.
Classified any encounters where any of the respiratory support therapies were received into one of the following 8 phenotypes based on treatment criteria and ordering: 1) IMV only; 2) NIPPV only; 3) HFNI only; 4) NIPPV Failure (or NIPPV to IMV); 5) HFNI Failure (or HFNI to IMV); 6) IMV to NIPPV; 7) IMV to HFNI; and 8) No Therapies Received.
1.3.3.2. LLM Model Selection
To promote reproducibility and adaptability of our methods, we selected LLMs available at the time of this study from Ollama, an open-source package that establishes local connections with open-source models.[28] Due to graphics processing unit (GPU) constraints, we selected the following lightweight, instruction-tuned models for testing: Mistral Small 24 billion with Q8.0 quantization (model tag: 20ffe5db0161), Phi-4 14 billion with Q8.0 quantization (model tag: 310d366232f4), and Gemma2 27 billion with Q8.0 quantization (model tag: dab5dca674db).[28] DeepSeek-r1 32 billion with Q4 K M quantization (model tag: 38056bbcbb2d) was previously tested for sampled data but was not used in this study due to high response latencies.[28, 23] Models were run on a single Nvidia V100 32GB GPU. Temperature and top-p were 0.0 and 0.99, respectively, for all experiments to avoid responses outside the requested format.
1.3.3.3. Concept Selection
For Concept Selection, we generated constructed concepts from eICU-CRD tables. Tables that did not include timestamped data but contained information on respiratory therapies or medications (such as the apacheApsVar table) or were unlikely to contain descriptions of respiratory therapies or medications (such as the vitalPeriodic table) were not processed further. There were 9 tables used to construct input concepts (Table 2). In early testing, we achieved best results when classifying the respiratory therapies separately from the relevant medications. We thus developed two prompts, resulting in two LLM responses per constructed concept (Supplementary Material). Concept definitions within the prompts were produced by two clinician experts who summarized and consolidated notes related to the terms of interest (Supplementary Material). We used Chain-of-Thought (CoT) prompting by including a series of questions and answers to help the model determine relevancy of the constructed concept. CoT is a prompt engineering technique that has improved performance of LLMs by using a series of reasoning tasks to guide the model to the final response.[29, 30] The answer to the final question for each prompt was parsed using string methods to get the final response of “YES” or “NO” for whether the constructed concept was relevant. Since there were two prompts, the final response was “YES” if at least one of the responses was “YES” and “NO” otherwise.
Table 2:
Constructed concept pattern for each selected table in the eICU Collaborative Research Database (eICU-CRD) database.[27] Italicized text was replaced with values from the relevant column in each table, as demonstrated in the example for each table.
| Table Name | Constructed Concept Pattern | Example Constructed Concept |
|---|---|---|
| Care Plan General | Source = Care Plan General; Concept = cplgroup: cplitemvalue | Source = Care Plan General; Concept = Route-Status: Oral - low sodium |
| Infusion Drug | Source = Infusion Drug; Concept = drugname | Source = Infusion Drug; Concept = Amiodarone (mg/min) |
| Medication | Source = Medication; Concept = drugname | Source = Medication; Concept = LOPRESSOR |
| Note | Source = Note; Concept = notevalue: notetext | Source = Note; Concept = denies fevers: denies fevers |
| Nurse Care | Source = Nurse Care; Concept = cellattributevalue | Source = Nurse Care; Concept = emergency equipment at bedside |
| Nurse Charting | Source = Nurse Charting; Concept = nursingchartcelltypevalname: nursingchartvalue | Source = Nurse Charting; Concept = O2 Admin Device: BiPAP/CPAP |
| Respiratory Care | Source = Respiratory Care; Concept = airwaytype | Source = Respiratory Care; Concept = Oral ETT |
| Respiratory Charting | Source = Respiratory Charting; Concept = respcharttypecat: respchartvaluelabel: respchartvalue | Source = Respiratory Charting; Concept = respFlowPtVentData: SaO2: 25 |
| Treatment | Source = Treatment; Concept = treatmentstring | Source = Treatment; Concept = cardiovascular—myocardial ischemia / infarction—antiplatelet agent—aspirin |
1.3.3.4. Apply Algorithm
After we identified the relevant constructed concepts across the eICU-CRD dataset, we applied the phenotyping heuristics. We identified the first encounter for each unique patient and then removed individuals who were less than 18 years old at the start of the encounter. For each encounter, we filtered data from the original 9 tables to the selected constructed concepts and then ordered each distinct constructed concept by its first occurrence based on the encounter admission time. The data, or constructed descriptions, for phenotyping were created by inserting each individual constructed concept into a string template and concatenating all of the unique constructed concepts together for the prompt. The template was “#: {constructed concept}” where “#” was the order of the constructed concept based on its first occurrence in the encounter records. Since timestamps were generalized to concept order and there was no encounter-specific information in the constructed descriptions, we phenotyped the unique constructed descriptions and then mapped the phenotypes to the relevant encounters for evaluation. Our phenotyping prompt used CoT (Supplementary Material). We parsed the answer to the final question to identify the selected phenotype.
1.3.3.5. Evaluate Algorithm
Models were evaluated for both Concept Selection and Apply Algorithm using components of PHEONA, an evaluation framework for LLM-based applications to computational phenotyping.[23] Previously, we evaluated the models for Concept Selection using a random sample of constructed concepts.[23] In this study, we evaluated these models on the full set of constructed concepts using Accuracy (the ability of the model to produce accurate results) as the primary evaluation criterion and Model Response Latency (how quickly model results were returned) as the secondary evaluation criterion from PHEONA. We used area under the receiver operating characteristic curve (AUROC) to measure response accuracy against the concept ground truths. We measured response latency as the seconds required for the model to return a response for each constructed concept and then averaged these values for each prompt. For Apply Algorithm, since we had not previously used PHEONA for model evaluation, we evaluated model performance on a randomly selected subsample (Supplementary Material) and then evaluated the best performing models on all encounters. We also used Accuracy and Model Response Latency criteria to assess model performance with the AUROC (and additionally, sensitivity and specificity) calculated using the original encounter ground truths.[26]
1.4. Results
1.4.1. Phenotyping Use Case
There were initially 200,859 encounters across 166,355 patients. After applying the inclusion and exclusion criteria, there were 159,701 encounters for 159,701 patients. Using the previously developed phenotyping algorithm, the encounters were phenotyped as follows: 16,736 (10.5%) as IMV only; 6,833 (4.3%) as NIPPV only; 1,089 (0.7%) as HFNI only; 1,466 (0.9%) as NIPPV Failure; 568 (0.4%) as HFNI Failure; 601 (0.4%) as IMV to NIPPV; 186 (0.1%) as IMV to HFNI; and 132,222 (82.8%) as None.[26]
1.4.2. Implementation of LLMs for Phenotyping Tasks
1.4.2.1. Concept Selection
There were 572 concept ground truths (404 ARF respiratory support therapies and 168 medications) from the original phenotyping study.[26] Classification results based on the concept ground truths are presented in Table 3. Mistral had the highest accuracy with an AUROC of 0.896 for classification of all concepts; however, it also had the highest total average latency (26.2 seconds compared to 22.0 seconds and faster). All models performed better at medication classification (AUROC of 0.997 and higher) when compared to respiratory support therapy classification (0.765 and higher).
Table 3:
Results of Concept Selection using Gemma2 27 billion, Mistral Small 24 billion, and Phi-4 14 billion Large Language Model (LLM) models. The number of concepts selected, area under the receiver operating characteristic curve (AUROC), and average latency (measured in seconds) were measured for both the Acute Respiratory Failure (ARF) respiratory support therapies and medications prompts.
| Model | Total Concepts | ARF Support Therapies Concepts | Medications Concepts | ||||||
|---|---|---|---|---|---|---|---|---|---|
| N | AUROCa | Latency | N | AUROCa | Latency | N | AUROCa | Latency | |
| Gemma | 30,062 | 0.792 | 22.0 | 29,394 | 0.783 | 17.7 | 674 | 0.997 | 4.4 |
| Mistral | 7,143 | 0.896 | 26.2 | 6,754 | 0.872 | 16.6 | 389 | 0.996 | 9.6 |
| Phi | 13,829 | 0.809 | 19.0 | 13,397 | 0.765 | 12.2 | 433 | 0.998 | 6.8 |
AUROC: Area under the receiver operating characteristic curve.
1.4.2.2. Apply Algorithm
There were 97,583 unique constructed descriptions for Gemma, 62,499 for Mistral, and 65,581 for Phi. However, since Gemma underperformed on the subsample of constructed descriptions (Supplementary Material), only Mistral and Phi were tested on the entire dataset. Mistral had an average response latency of 27.3 seconds and Phi of 20.8 seconds across all constructed descriptions. The AUROC, sensitivity, and specificity for each phenotype are presented in Table 4. Overall, Mistral performed better with a higher AUROC on all phenotypes when compared to Phi. Mistral performed best on no therapy and single therapy phenotypes (None and IMV, NIPPV, and HFNI Only) with an average AUROC of 0.853 while it only achieved an average AUROC of 0.604 on the remaining, multi-therapy phenotypes. Both models also had nearly perfect specificity for all phenotypes except IMV Only.
Table 4:
Results of Apply Algorithm using Mistral Small 24 billion and Phi-4 14 billion Large Language Model (LLM) models. The number of phenotyped encounters, area under the receiver operating characteristic curve (AUROC), sensitivity, and specificity were measured for both models across 159, 701 encounters.
| Phenotype | N | Mistral | Phi | ||||||
|---|---|---|---|---|---|---|---|---|---|
| N | AUROCa | Sens.b | Spec.c | N | AUROCa | Sens.b | Spec.c | ||
| IMVd Only | 16,736 | 33,537 | 0.881 | 0.898 | 0.864 | 36,343 | 0.863 | 0.936 | 0.791 |
| NIPPVe Only | 6,833 | 7,060 | 0.809 | 0.637 | 0.981 | 7,158 | 0.758 | 0.547 | 0.969 |
| HFNIf Only | 1,089 | 2,325 | 0.825 | 0.661 | 0.989 | 843 | 0.543 | 0.093 | 0.994 |
| NIPPVe Failure | 1,466 | 2,063 | 0.717 | 0.443 | 0.991 | 1,463 | 0.669 | 0.347 | 0.992 |
| HFNIf Failure | 568 | 377 | 0.513 | 0.028 | 0.998 | 208 | 0.503 | 0.007 | 0.998 |
| IMVd to NIPPVe | 601 | 1,725 | 0.526 | 0.063 | 0.989 | 1,536 | 0.565 | 0.143 | 0.987 |
| IMVd to HFNIf | 186 | 1,545 | 0.659 | 0.328 | 0.990 | 1,026 | 0.539 | 0.086 | 0.991 |
| None | 132,222 | 103,990 | 0.896 | 0.824 | 0.968 | 66,952 | 0.845 | 0.744 | 0.947 |
AUROC: Area under the receiver operating characteristic curve.
Sens.: Sensitivity
Spec.: Specificity
IMV: Invasive Mechanical Ventilation.
NIPPV: Noninvasive Positive Pressure Ventilation.
HFNI: High-Flow Nasal Insufflation.
1.5. Discussion
In this study, we introduced SHREC, a framework for applying LLM-based methods to computational phenotyping. We outlined the components of SHREC and demonstrated how LLMs can be used for computational phenotyping tasks.
1.5.1. Development and Application of SHREC
The primary contribution of this study was the components of SHREC. For the first component, the end-to-end phenotyping pipeline, we expanded upon a previously developed framework and generalized it to apply to all computable phenotypes, not just those involving either machine learning or LLMs.[12] For the second component, we outlined a novel system of LLM-based next-generation computational phenotyping. In its future state, we envision all repetitive tasks (including manual data review) being offloaded to LLM agents[31] while humans guide complex tasks and provide deliberate oversight to best satisfy the Fundamental Theorem of Informatics.[14] Towards this end, we noted several key properties of LLM-based methods that would support widespread integration into the end-to-end pipeline. First, prompt engineering alone was sufficient for adapting the models to both Concept Selection and Apply Algorithm without additional retraining or algorithm development. Second, minimal data processing was required: other than tagging concepts with the original table name and recording order, raw EHR data were used for both tasks. Therefore, even the lightweight models tested in this study demonstrated clear advantages over traditional phenotyping methods, including advanced NLP and machine learning algorithms.
1.5.2. Implementation of LLMs for Phenotyping Tasks
The second contribution of this study was the demonstration of LLMs for the tasks of Concept Selection and Apply Algorithm from the end-to-end pipeline. All models performed well at concept classification, especially classification of the medications concepts (Table 3). For the phenotyping tasks, Mistral and Phi generally performed better at determining phenotypes with only a single treatment when compared to those with a sequence of treatments (Table 4). We suspect the layered thought process of assigning records to a treatment and then determining treatment order was too complex for the models tested. We hypothesize that either mapping each constructed concept to a specific treatment or performing a second phenotyping step solely for treatment ordering would improve phenotyping performance. These results suggest that lightweight LLMs can be readily applied to concept classification and simple phenotypes but may currently be insufficient for complex phenotypes without enhancements to the base models, prompts, or pipeline within Apply Algorithm.
One outstanding question for all biomedical tasks performed with LLMs is how to best incorporate specialized medical knowledge, including standardized vocabularies and ontologies. In this study, we injected medical information into the prompts and relied on the inherent capabilities of each model for data synthesis and comprehension. This method is applicable to many other computable phenotypes, including the previously discussed phenotypes for PASC where the Observational Medical Outcomes Partnership Common Data Model (OMOP-CDM) concepts could be categorized by injecting prompts with symptom information.[9] Outside of computational phenotyping, studies have also explored methods for providing knowledge to LLMs, including prompt injection, model pretraining, and model finetuning, although almost none of these studies reported performance results after incorporating.[32] Furthermore, a recent study found finetuned biomedical LLMs underperformed when compared to generalist models on multiple clinical benchmarking tasks.[17] Therefore, while there is some evidence that prompt injection (with or without retrieval-augmented generation) may be the best method for incorporating domain knowledge, our results suggest there may be limits to the effectiveness of this method. Thus, there remain significant gaps in understanding how to best incorporate specialized medical knowledge into LLMs both for general biomedical and computational phenotyping tasks.
1.5.3. Study Limitations
There were several limitations to the framework and methods implemented in this study. First, repetitive manual review was still required for model evaluation. Furthermore, since we used previously developed ground truths, we performed less manual review than would be required for a novel phenotyping study. However, manual review for LLM-based methods can be reduced by reviewing samples rather than the entire dataset. Another limitation is scalability. For Concept Selection and Apply Algorithm, each constructed concept and description required individual LLM responses and thus, after a certain number of records, it would become infeasible to use LLMs. However, for this study, almost a sixfold increase in records would be required before the time for LLM-based phenotyping became comparable to the time for traditional phenotyping methods. Additionally, as improvements in model performance, retrieval-augmented generation architecture, and prompt engineering are developed, we expect the issue of scalability for LLM-based methods to be lessened, although not completely mitigated.
1.5.4. Future Directions
There are many future directions to explore outside of those highlighted by the broader vision of computational phenotyping from SHREC. One direction is to understand how LLMs reason with respect to computational phenotyping. Although we used CoT because of its demonstrated ability to produce accurate results,[29, 30, 33] recent studies have suggested that CoT reasoning may actually be unfaithful.[35, 34, 36] Given the complexity of phenotyping tasks, we suggest studying CoT reasoning to understand when and how logical inconsistencies may arise. Another future direction is to adapt phenotype definitions to LLMs. For example, the clinician experts indicated some of the medications may be present for noninvasive therapies, but not as a continuous infusion (Supplementary Material). However, for the previously developed definition, only medication presence was used due to complexities in algorithmically processing medication information in relation to respiratory therapies. In future iterations, we would update the phenotype definition to include administration method rather than simply asking the model to look for concept presence. Finally, we propose development of industry standards for evaluation of LLM-based methods specifically with respect to automated processes to ensure appropriate oversight.
1.6. Conclusion
We developed SHREC, a framework that describes how to apply LLM-based methods to computational phenotyping. SHREC outlines both an end-to-end pipeline for computable phenotype development along with a broader vision of next-generation phenotyping using LLM-based methods. We demonstrated SHREC on a phenotyping use case to assess the feasibility of LLMs for specific phenotyping tasks and promote further research into next-generation computational phenotyping methods. This work is applicable to all computational phenotyping studies, particularly those using manual review for phenotype development.
References
- [1].Hripcsak George and Albers David J. “Next-Generation Phenotyping of Electronic Health Records”. In: Journal of the American Medical Informatics Association 20.1 (Jan. 1, 2013), pp. 117–121. doi: 10.1136/amiajnl-2012-001145. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [2].Callahan Tiffany J. et al. “Characterizing Patient Representations for Computational Phenotyping”. In: AMIA Annual Symposium Proceedings 2022 (Apr. 29, 2023), pp. 319–328. [PMC free article] [PubMed] [Google Scholar]
- [3].Banda Juan M. et al. “Advances in Electronic Phenotyping: From Rule-Based Definitions to Machine Learning Models”. In: Annual Review of Biomedical Data Science 1 (Volume 1, 2018 July 20, 2018), pp. 53–68. doi: 10.1146/annurev-biodatasci-080917-013315. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [4].Ozrazgat-Baslanti Tezcan et al. “Development and Validation of A Race-Agnostic Computable Phenotype for Kidney Health in Adult Hospitalized Patients”. In: PLOS ONE 19.4 (Apr. 23, 2024), e0299332. doi: 10.1371/journal.pone.0299332. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [5].Li Heyi et al. “Rule-Based Cohort Definitions for Acute Respiratory Distress Syndrome: A Computable Phenotyping Strategy Based on the Berlin Definition”. In: Critical Care Explorations 3.6 (June 2021), e0451. doi: 10.1097/CCE.0000000000000451. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [6].Alcamo Alicia M. et al. “Validation of a Computational Phenotype to Identify Acute Brain Dysfunction in Pediatric Sepsis”. In: Pediatric Critical Care Medicine 23.12 (Dec. 2022), pp. 1027–1036. doi: 10.1097/PCC.0000000000003086. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [7].Neely Benjamin et al. “Design and Evaluation of a Computational Phenotype to Identify Patients With Metastatic Breast Cancer Within the Electronic Health Record”. In: JCO Clinical Cancer Informatics 6 (Sept. 2022), e2200056. doi: 10.1200/CCI.22.00056. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [8].McDonough Caitrin W. et al. “Optimizing Identification of Resistant Hypertension: Computable Phenotype Development and Validation”. In: Pharmacoepidemiology and Drug Safety 29.11 (2020), pp. 1393–1401. doi: 10.1002/pds.5095. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [9].Pungitore Sarah et al. “Computable Phenotypes for Post-acute sequelae of SARS-CoV-2: A National COVID Cohort Collaborative Analysis”. In: AMIA Annual Symposium Proceedings 2023 (Jan. 11, 2024), p. 589. [PMC free article] [PubMed] [Google Scholar]
- [10].Tekumalla Ramya and Banda Juan M.. “Towards Automated Phenotype Definition Extraction using Large Language Models”. In: Genomics & Informatics 22.1 (Oct. 31, 2024), p. 21. doi: 10.1186/s44342-024-00023-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [11].Shang Ning et al. “Making Work Visible for Electronic Phenotype Implementation: Lessons Learned from the eMERGE Network”. In: Journal of Biomedical Informatics 99 (Nov. 1, 2019), p. 103293. doi: 10.1016/j.jbi.2019.103293. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].Carrell David S et al. “A General Framework for Developing Computable Clinical Phenotype Algorithms”. In: Journal of the American Medical Informatics Association 31.8 (Aug. 1, 2024), pp. 1785–1796. doi: 10.1093/jamia/ocae121. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [13].Wen Andrew et al. “The IMPACT Framework and Implementation for Accessible in Silico Clinical Phenotyping in the Digital Era”. In: npj Digital Medicine 6.1 (July 21, 2023), pp. 1–8. doi: 10.1038/s41746-023-00878-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [14].Friedman Charles P.. “A “Fundamental Theorem” of Biomedical Informatics”. In: Journal of the American Medical Informatics Association : JAMIA 16.2 (2009), pp. 169–170. doi: 10.1197/jamia.M3092. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [15].Raiaan Mohaimenul Azam Khan et al. “A Review on Large Language Models: Architectures, Applications, Taxonomies, Open Issues and Challenges”. In: IEEE Access 12 (2024), pp. 26839–26874. doi: 10.1109/ACCESS.2024.3365742. [DOI] [Google Scholar]
- [16].Liu Pengfei et al. “Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing”. In: Association for Computing Machinery Computing Surveys 55.9 (Jan. 16, 2023), 195:1–195:35. doi: 10.1145/3560815. [DOI] [Google Scholar]
- [17].Dorfner Felix J et al. “Evaluating the Effectiveness of Biomedical Fine-Tuning for Large Language Models on Clinical Tasks”. In: Journal of the American Medical Informatics Association (Apr. 7, 2025), ocaf045. doi: 10.1093/jamia/ocaf045. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [18].Baddour Moussa et al. “Phenotypes Extraction from Text: Analysis and Perspective in the LLM Era”. In: IS 2024 – 12th IEEE International Conference on Intelligent Systems. Aug. 29, 2024, p. 1. [Google Scholar]
- [19].Zelin Charlotte et al. “Rare Disease Diagnosis using Knowledge Guided Retrieval Augmentation for ChatGPT”. In: Journal of Biomedical Informatics 157 (Sept. 1, 2024), p. 104702. doi: 10.1016/j.jbi.2024.104702. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [20].Yan Chao et al. “Large Language Models Facilitate the Generation of Electronic Health Record Phenotyping Algorithms”. In: Journal of the American Medical Informatics Association 31.9 (Sept. 1, 2024), pp. 1994–2001. doi: 10.1093/jamia/ocae072. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [21].Boussina Aaron et al. “Large Language Models for More Efficient Reporting of Hospital Quality Measures”. In: NEJM AI 1.11 (Oct. 24, 2024), AIcs2400420. doi: 10.1056/AIcs2400420. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [22].Matos João et al. “EHRmonize: A Framework for Medical Concept Abstraction from Electronic Health Records using Large Language Models”. In: Applications of Medical Artificial Intelligence. Ed. by Wu Shandong, Shabestari Behrouz, and Xing Lei. 2025, pp. 210–220. doi: 10.1007/978-3-03182007-6_20. [DOI] [Google Scholar]
- [23].Pungitore Sarah, Yadav Shashank, and Subbian Vignesh. “PHEONA: An Evaluation Framework for Large Language Model-based Approaches to Computational Phenotyping”. In: arXiv arXiv:2503.19265 (Mar. 25, 2025). doi: 10.48550/arXiv.2503.19265. [DOI] [Google Scholar]
- [24].Hazlehurst Brian, Gorman Paul N., and McMullen Carmit K.. “Distributed Cognition: An Alternative Model of Cognition for Medical Informatics”. In: International Journal of Medical Informatics 77.4 (Apr. 1, 2008), pp. 226–234. doi: 10.1016/j.ijmedinf.2007.04.008. [DOI] [PubMed] [Google Scholar]
- [25].Patel Vimla L. and Kaufman David R.. “Cognitive Informatics”. In: Biomedical Informatics: Computer Applications in Health Care and Biomedicine. Ed. by Shortliffe Edward H. and Cimino James J.. Cham: Springer International Publishing, 2021, pp. 121–152. doi: 10.1007/978-3-030-58721-5_4. [DOI] [Google Scholar]
- [26].Essay P., Mosier J., and Subbian V.. “Rule-Based Cohort Definitions for Acute Respiratory Failure: Electronic Phenotyping Algorithm”. In: JMIR Medical Informatics 8.4 (Apr. 15, 2020), e18402. doi: 10.2196/18402. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [27].Pollard Tom J. et al. “The eICU Collaborative Research Database, A Freely Available Multi-Center Database for Critical Care Research”. In: Scientific Data 5.1 (Sept. 11, 2018), p. 180178. doi: 10.1038/sdata.2018.178. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [28].ollama/ollama. original-date: 2023-06-26T19:39:32Z. Dec. 19, 2024. url: https://github.com/ollama/ollama.
- [29].Reynolds Laria and McDonell Kyle. “Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm”. In: Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems. Association for Computing Machinery, May 8, 2021, pp. 1–7. doi: 10.1145/3411763.3451760. [DOI] [Google Scholar]
- [30].Wei Jason et al. “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models”. In: Advances in Neural Information Processing Systems 35 (Dec. 6, 2022), pp. 24824–24837. [Google Scholar]
- [31].Qiu Jianing et al. “LLM-based Agentic Systems in Medicine and Healthcare”. In: Nature Machine Intelligence 6.12 (Dec. 2024), pp. 1418–1420. doi: 10.1038/s42256-024-00944-1. [DOI] [Google Scholar]
- [32].Chang Eunsuk and Sung Sumi. “Use of SNOMED CT in Large Language Models: Scoping Review”. In: JMIR Medical Informatics 12.1 (Oct. 7, 2024), e62924. doi: 10.2196/62924. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [33].Zhang Zhuosheng et al. “Igniting Language Intelligence: The Hitchhiker’s Guide from Chain-of-Thought Reasoning to Language Agents”. In: Association for Computing Machinery Computing Surveys 57.8 (Mar. 22, 2025), 206:1–206:39. doi: 10.1145/3719341. [DOI] [Google Scholar]
- [34].Chen Yanda et al. “Reasoning Models Don’t Always Say What They Think”. In: arXiv arXiv:2505.05410 (May 8, 2025). doi: 10.48550/arXiv.2505.05410. [DOI] [Google Scholar]
- [35].Arcuschin Iván et al. “Chain-of-Thought Reasoning in the Wild is not Always Faithful”. In: Workshop on Reasoning and Planning for Large Language Models. Mar. 5, 2025. [Google Scholar]
- [36].Turpin Miles et al. “Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting”. In: Thirty-seventh Conference on Neural Information Processing Systems. Nov. 2, 2023. [Google Scholar]


