Abstract
Background
Natural language processing (NLP) can facilitate research utilizing data from electronic health records (EHRs). Large language models can potentially improve NLP applications leveraging EHR notes. The objective of this study was to assess the performance of zero-shot learning using Chat Generative Pre-trained Transformer 4 (ChatGPT-4) for extraction of symptoms and signs, and compare its performance to baseline machine learning and rule-based methods developed using annotated data.
Methods and Results
From unstructured clinical notes of the national EHR data on the Veterans healthcare system, we extracted 1999 text snippets containing relevant keywords for heart failure symptoms and signs, which were then annotated by two clinicians. We also created 102 synthetic snippets that were semantically similar to snippets randomly selected from the original 1999 snippets. The authors applied zero-shot learning, using two different forms of prompt engineering in a symptom and sign extraction task with ChatGPT-4, utilizing the synthetic snippets. For comparison, baseline models using machine learning and rule-based methods were trained using the original 1999 annotated text snippets, and then used to classify the 102 synthetic snippets.
The best zero-shot learning application achieved 90.6% precision, 100% recall, and 95% F1 score, outperforming the best baseline method, which achieved 54.9% precision, 82.4% recall, and 65.5% F1 score. Prompt style and temperature settings influenced zero-shot learning performance.
Conclusions
Zero-shot learning utilizing ChatGPT-4 significantly outperformed traditional machine learning and rule-based NLP. Prompt type and temperature settings affected zero-shot learning performance. These findings suggest a more efficient means of symptoms and signs extraction than traditional machine learning and rule-based methods.
Keywords: Heart Failure, Information Extraction, Large Language Models, Zero-Shot Learning
Introduction
Clinical notes from electronic health records (EHR) are the focus of numerous natural language processing (NLP) tasks, with a variety of NLP techniques having been developed over the past few decades. The earliest clinical NLP systems were rule-based 1–5. Recent research has highlighted the contributions of machine learning (ML) to clinical NLP tasks 6–11. One persistent challenge of ML applications is the requirement of sample data. Supervised ML algorithms, in particular, require substantial human-annotated data for training. ML can also be combined with rule-based methods 12–14, implementing rule-based regular expression patterns to enhance performance. These patterns can especially be useful in identifying template data and the standard types of language used in clinical notes, boosting the overall NLP performance. However, because rules are often specific to a note type or text corpus, contemporary clinical NLP systems that rely only on rule-based classifications are not common.
With the recent surge in artificial intelligence (AI), there is strong interest in applying large language models (LLMs) to NLP clinical tasks 15–18. LLMs are trained on massive datasets, benefiting from data volume beyond what is available to most scientists relying on annotated data for ML. In clinical NLP, a typical application of LLMs involves feeding raw text into an LLM and then using the LLM output as input for text classification. This approach still requires annotated data while taking advantage of the LLM’s ability to process the text into better “features.”
LLM-based zero-shot learning, (ZSL) 19–22, on the other hand, does not require training data and has gained popularity. ZSL uses engineered prompts to query the LLM. For example, in a binary task, a prefix prompt can pass sufficient instructions to an LLM to answer questions regarding accompanying text. Prefix prompts, in the form of ‘Given the context X, determine Y’, are one of several prompt strategies that can aid LLM applications in medicine 23, depending on their form and the task at hand.
The main objective for this study was to assess the performance of ZSL using Chat Generative Pre-trained Transformer 4 (ChatGPT-4) for symptom and sign extraction and compare its performance to baseline ML and rule-based methods developed using annotated data. This project focused on specific symptoms and signs relevant to heart failure (HF). EHR clinical notes from the U.S. Department of Veteran Affairs (VA) served as the initial data source, using the VA Informatics and Computing Infrastructure (VINCI) secure research platform 24. This study received the proper ethical oversight. To our knowledge, this is the first study of its kind to compare ChatGPT-4-based symptom and sign extraction with traditional ML and rule-based methods for this specific task.
Methods
Original Annotated Snippets
The traditional ML methods utilized 1999 snippets extracted from VA clinical notes. We sampled the snippets for 10 symptoms and signs. Within each symptom/sign group there were approximately 200 snippets. Each snippet captured from a VA clinical note contained a relevant key phrase surrounded by 20 words before and after. The symptoms and signs, and their key phrases are in Table 1. Each snippet was independently annotated by two clinicians to indicate whether the patient had the symptom/sign, according to the use of the key phrase, using classifications of ‘yes’ (i.e., positive), ‘no’ (i.e., negative), or ‘uncertain’. For example, for the symptom ‘dyspnea at rest’, the key phrase ‘dyspnea’ had to be used within the context of rest in order for the classification to be positive. After independent annotations, the clinicians reviewed annotations where their classifications did not match, discussed them, and made a final decision on consensus. The distributions of the final annotations, by symptom/sign group, are in Figure 1. Dyspnea at Rest had the largest ratio of uncertain classified snippets (59, i.e., 29.4%). Mobility issues had the most positive snippets (121, i.e., 60.2%). Paroxysmal Nocturnal Dyspnea had the most negative snippets (189, i.e., 94.5%).
Table 1.
Symptoms and signs, and key phrases
| Symptom/Sign | Key Phrases |
|---|---|
| Dyspnea at Rest | shortness of breath, winded, short of breath, dyspnea, dyspneic, SOB |
| Dyspnea on Exertion | shortness of breath, winded, short of breath, dyspnea, DOE, dyspneic, SOB |
| Edema | swelling, fluid, edema |
| Fatigue | fatigue, tiredness, tired, energy, exhausted |
| Mobility Issues | wheelchair, cane, walker |
| Orthopnea | shortness of breath, orthopnea, winded, short of breath, dyspnea, dyspneic, SOB |
| Paroxysmal Nocturnal Dyspnea | shortness of breath, short of breath, PND, SOB, dyspnea, dyspneic, paroxysmal nocturnal dyspnea |
| Pulmonary Congestion | congestion, interstitial, septal |
| Pulmonary Edema | pulmonary edema |
| Rales | crepts, crackles, rales, crepitation |
Figure 1.

Annotation classification distributions by symptom/sign group
The annotated snippets were then randomized, and split into 80% and 20% groups for training and testing, respectively, of the baseline models. No snippets classified as ‘uncertain’ were included, due to their ambiguity.
Synthetic Data
Because of HIPAA regulation, we could not upload the original snippets for the public ChatGPT Application Programming Interface (API), which is an interface that allows researchers to integrate ChatGPT into their own applications. To create synthetic data we randomly chose approximately 10 snippets from each symptom/sign snippet group for modeling the synthetic text. When the original snippets were templates (i.e. artifacts from pull-down forms, questionnaires, and other common EHR tools) we recreated the same template. Otherwise, the synthetic data were created to be similar but not identical to the original text, while retaining the same keywords and essential clinical context.
This resulted in 102 synthetic snippets. There were ten for each symptom/sign, except for Edema and Paroxysmal Nocturnal Dyspnea, which each received 11 in the synthetic data creation process. The key phrases and distribution of positive and negative classifications (using the annotators’ classifications for the original snippets) are in Table 2.
Table 2.
Synthetic snippets, key phrases and classifications along with counts, by symptom/sign
| Symptom/Sign | Key Phrases & Counts | Annotated Values |
|---|---|---|
| Dyspnea at rest | dyspnea (2), dyspneic (3), short of breath (2), shortness of breath (1), SOB (2) | positive (4), negative (6) |
| Dyspnea on exertion | DOE (6), short of breath (2), SOB (2) | positive (4), negative (6) |
| Edema | edema (4), fluid (2), swelling (5) | positive (5), negative (6) |
| Fatigue | energy (3), fatigue (3), tired (3), tiredness (1) | positive (5), negative (5) |
| Mobility issues | cane (4), walker (2), wheelchair (4) | positive (5), negative (5) |
| Orthopnea | dyspnea (1), orthopnea (8), shortness of breath (1) | positive (6), negative (4) |
| Paroxysmal nocturnal dyspnea | dyspnea (2), dyspneic (1), paroxysmal nocturnal dyspnea (2), PND (4), shortness of breath (1), SOB (1) | positive (5), negative (6) |
| Pulmonary congestion | congestion (2), interstitial (4), septal (4) | positive (4), negative (6) |
| Pulmonary edema | pulmonary edema (10) | positive (5), negative (5) |
| Rales | crackles (2), crepitation (2), crepts (2), rales (4) | positive (5), negative (5) |
ML Baseline Models
We experimented with ten different ML algorithms to serve as baselines (i.e., Support Vector Machine, Random Forest, Decision Trees, Gaussian Naïve Bayes, KNN, Gaussian Process, Logistic Regression, AdaBoost, Extra Trees, and Bagging). Preprocessing consisted of removing all extra white spaces and non-word characters in the snippets. This action retained single white spaces, to enable tokenization (separation into words), and retained characters consisting of letters, numbers, and underscores. We found that it was impractical to fit ML models for individual symptoms and signs due to the small number of snippets for each symptom/sign group. After aggregating all the snippets, further experimentation revealed that a bag-of-words model with bigrams as features consisting entirely of letters and occurring 5 or more times in the full snippets provided the best performance across algorithm types. A bigram is two adjacent words found in the text, like “patient reports” or “denies cough”. The ten different ML algorithm models were trained and tested using the split annotated snippet dataset. We used the python ML modules within the scikit-learn package 25, version 1.0.2. After some experimentation with the hyperparameter settings for the models used in the experiment, we set the learning rate to 1.25 and the number of estimators to 50 for AdaBoost, the smoothing value to 0.000005 for Gaussian Naïve Bayes, and the C value to 1 for the Support Vector Machine. Otherwise, the default hyperparameter settings were used for each model. We used a set of fixed random seed values for all models to sustain reproducibility.
Rule-Based Patterns
We supplemented our baseline ML classification efforts by randomly reviewing the snippets and forming regular expressions specific to each symptom/sign. Regular expressions are specialized character patterns designed to match patterns found in text. For example, positive questionnaire template data like
where the author places an ‘x’ to indicate fatigue is present in the context of the symptom fatigue can be captured with the regular expression
This pattern recognizes the content preceding the positively indicated instance of fatigue in the questionnaire. It also matches either a lower-case or an upper-case x, plus any amount of extra white space within the brackets of interest.
To evaluate each annotated snippet using the regular expressions, we developed a system that first evaluated positive pattern matches. If any were matched, the snippet was classified as positive. Then if no positive pattern was matched, it evaluated negative pattern matches. If a negative pattern was matched, the snippet was classified as negative. This general approach worked well in previous work. Assessing each symptom/sign’s snippets in this fashion, we then tested the rule-based classifier on the 1999 annotated original snippets. It was able to classify 985 snippets that contained relevant patterns, achieving 98.4% precision, 94.7% recall, and 96.5% F1.
Development of Final Baseline Models
There were 892 unique bigrams used as features for training ten initial ML algorithms. The best performing ML models were AdaBoost (55.8% precision, 94.7% recall, 70.2% F1), Bagging (64.3% precision, 75.6% recall, 69.5% F1) and Decision Trees (59.2% precision, 73.1% recall, 65.4% F1). We then used these three top performing trained models to serve as baseline mechanisms to classify the synthetic snippets, with and without the rule-based classifier enhancement. To enhance performance of the trained ML model in classifying the synthetic snippets, the rule-based classifier also classified the synthetic snippets that contained matching text in separate tests. In these separate tests, for each snippet, where there were predictions made by both the trained ML model and the rule-based classifier, the rule-based prediction was used to assess the final results. Figures 2a and 2b illustrate this.
Figures 2a and 2b.

Classification without and with rule-based (RB) enhancement. In Figure 2a (top), all input synthetic snippets are classified by the trained machine learning (ML) model. In Figure 2b, for synthetic snippets that are classified by both the trained ML model and the RB enhancement, the RB classification is used in final output. Note that only synthetic snippets containing matching regular expression patterns can be classified by the RB enhancement.
Zero-Shot Learning
We created a Jupyter Notebook 26 that utilized the OpenAI API 27. We experimented with two forms of prompts. We used the following prefix-style prompt:
where in place of X we used the key phrase in the synthetic snippet, and in place of Y we used the symptom/sign. An example of this would be “Does dyspnea indicate dyspnea at rest in the following text? Limit your answer to yes or no.”
We also used a more basic prompt:
where in place of Y we used the symptom/sign. An example of this would be “Does the following text mean the patient has dyspnea at rest?”
We ran experiments using the different prompt styles with the synthetic snippets. Through use of the OpenAI API, ChatGPT-4 first received the prompt, then the corresponding synthetic snippet, and then produced the classification in the form of “yes” or “no”. This general process is illustrated in Figure 3.
Figure 3.

ZSL using the OpenAI API. Instructional prompt sent to OpenAI API, followed by the text to be analyzed; the API returns a “yes” or “no” answer indicating a positive or negative classification for the text.
A range of temperature settings was used to explore its effect on the task. Temperature, the “amount of randomness injected to the response” 28 can affect output for LLMs that implement this hyperparameter. We used gpt-4-0613, the most current version of ChatGPT-4 available at the time of execution.
Evaluation on Synthetic Data
To evaluate the performance of all methods on the synthetic snippets we measured precision, recall, and F1. We also performed error analyses on ZSL, by prompt style, and the baseline models to gain greater insight into performance.
Patient Characteristics
To better place these findings in a human context, we extracted demographic data of the patients for which the original snippets were recorded.
Results
Patient Characteristics
Patients varied by age, sex of record, marital status, race, and ethnicity (Table 3). Age was determined using each patient’s birthdate and the date of the associated clinical note. For patients who had multiple clinical notes, the first note’s date was used. The majority of patients were white and not Hispanic. Race and age data to a degree reflect recent population findings 29. Fewer of the study’s patients were female, as currently the majority of Veterans are male.
Table 3.
Patient demographic data
| Patient Demographics | ||
|---|---|---|
| N = 1611 | ||
| Average age (SD) | 72.1(11.1) | |
| Sex | ||
| Male | 1558 | 97% |
| Female | 53 | 3% |
| Marital Status | ||
| Married | 776 | 48% |
| Divorced | 431 | 27% |
| Widowed | 220 | 14% |
| Never Married | 121 | 8% |
| Separated | 53 | 3% |
| Missing or Unknown | 10 | 1% |
| Race | ||
| White | 1144 | 71% |
| Black or African American | 309 | 19% |
| Declined to Answer | 58 | 4% |
| Unknown by Patient | 20 | 1% |
| Native Hawaiian or Other Pacific Islander | 11 | 1% |
| American Indian or Alaska Native | 9 | 1% |
| Asian | 2 | >1% |
| Unknown | 58 | 4% |
| Ethnicity | ||
| Not Hispanic or Latino | 1452 | 90% |
| Hispanic or Latino | 62 | 4% |
| Declined to Answer | 32 | 2% |
| Unknown by Patient | 21 | 1% |
| Unknown | 44 | 3% |
ZSL and Baseline Models’ Performance
With optimal temperature settings, both prompt types significantly outperformed the baseline ML models, including with the rule-based enhancement. Table 4 includes a summary of the results. ZSL using the basic prompt at a temperature setting 0 yielded the best performance of all methods. Both ZSL prompt strategies outperformed the baseline models both without and with the rule-based enhancement.
Table 4.
Top performance of ZSL by prompt type, and top ML models without and with rule-based (RB) enhancement
| Performance, Precision/Recall/F1 | |||
|---|---|---|---|
| Model | Precision | Recall | F1 |
| Basic Prompt | 90.6% | 100.0% | 95.0% |
| Prefix Prompt | 77.4% | 85.4% | 81.2% |
| AdaBoost | 48.8% | 83.3% | 61.5% |
| AdaBoost with RB | 52.7% | 79.5% | 63.3% |
| Bagging | 50% | 10.4% | 17.2% |
| Bagging with RB | 77.3% | 35.4% | 48.6% |
| Decision Trees | 48.8% | 83.3% | 61.5% |
| Decision Trees with RB | 54.9% | 82.4% | 65.5% |
ZSL Prompt Styles
ZSL performed especially well using the basic prompt. Best performance was 90.6% precision, 100% recall, and 95% F1 using temperature setting 0 (Figure A, Supplemental Appendix). Performance using the prefix prompt achieved 77.4% precision, 85.4% recall, and 81% F1 using a temperature setting of 0 (Figure B, Supplemental Appendix).
Baseline Models
Performance of the top 3 trained ML models in classifying the synthetic snippets, without and with the rule-based classifier enhancement, varied (Figures C and D, Supplemental Appendix). The rule-based classifier alone was able to classify 32 of the synthetic snippets, achieving 93.3% precision, 82.4% recall, and 87.5% F1. Without the rule-based enhancement, AdaBoost and Decision Trees achieved nearly identical performance, with both achieving F1 scores of 61.5%; Bagging achieved an F1 of 17.2%. Bagging benefited the most from the rule-based enhancement. Bagging F1 increased to 48.8%, precision increased from 50% to 77.3%, and recall increased from 10.4% to 35.4%. The rule-based enhancement also increased AdaBoost and Decision Trees F1 (63.3% and 65.5%, respectively); however, AdaBoost’s recall decreased from 83.3% to 79.5%.
Error Analysis
To better understand the errors from the best performing ZSL prompt strategies according to temperature setting, we reviewed errors by type (false positive or false negative), symptom/sign, and key phrase (Table A, Appendix).
The experiment using the basic prompt at temperature setting 0 yielded 5 false positives. There were no false negative findings. Even though the basic prompt did not include the key phrase, it was contained in the synthetic snippet passed to ChatGPT-4 after the prompt. Interestingly, the prefix prompt style, which instructed ChatGPT-4 to specify whether or not the synthetic snippet indicated the symptom/sign, based on the use of the key phrase, yielded more errors, including five false negatives for the orthopnea: orthopnea symptom/sign: key phrase pairing.
We also performed error analyses for the baseline models with the rule-based enhancement. (Table B, Appendix). The AdaBoost and Decision Trees models produced more false positives (34 and 32, respectively) than false negatives (10 and 9, respectively). The Bagging model produced 5 false positives and 31 false negatives.
Discussion
The most significant finding of this study is that ZSL utilizing ChatGPT-4 significantly outperformed traditional ML and rule-based NLP. Although the traditional ML approach used a relatively small training dataset, it required considerable effort from clinicians to annotate the dataset and from informaticians to test different ML algorithms and parameters. ML performance was dependent on a small, usually imbalanced amount of data for each symptom/sign. Even with the rule-based enhancements, performance remained limited. Rule-based classification alone performed adequately, but required manual effort to develop the rules, which are most suitable for templated text with clear and consistent patterns. ZSL, on the other hand, required no task-specific training, potentially saving months of time spent on annotation and NLP development.
ZSL benefits from the vast volume of data and the computational resources available in LLM creation. The version of ChatGPT-4 that we used in this study accommodates a context window of over 8000 tokens 30. Individual research projects typically do not have access to such extensive resources for creating a language model. Our previous work in creating a ZSL resource to identify suicidality in clinical notes yielded good results, but not as good or as generalizable as the results obtained using a commercial language model, as in this study.
Prompt type and temperature settings did affect ZSL performance. Surprisingly, the basic prompt style yielded better results that the prefix prompt. Lower temperature settings improved performance for both prompt styles. Limiting ChatGPT-4 to a yes or no answer not only fulfilled the study’s objective of pursuing a binary classification task, but also likely limited ChatGPT-4 from returning tangential and possibly hallucinatory narratives. We observed that ChatGPT-4 indeed generated texts with a small amount of hallucination when not directed to only provide a yes or no answer.
The objective of our study was information extraction, not diagnosis. As such we were asking ChatGPT-4 to determine the meaning of the text at hand, not causes or outcomes. Symptom/sign extraction is not simple nor trivial. Nevertheless, there are NLP tasks that are more challenging for a general LLM. For example, in our experiment with ambient clinical note generation, more specific and detailed prompts were very helpful. Other studies have shown that LLMs perform well in diagnostic tasks and are capable of passing medical exams 31–33
Future Work
Future work will include experimentation with other types of wording for prefix prompts, and other types of prompt styles, including few-shot prompting. We also intend to test ChatGPT’s ability to identify different types of medical concepts.
Limitations
Our prompt engineering was limited in this prototype study. The prefix prompt style could have been designed in other ways. We used the word ‘indicate’ to determine the presence of a symptom/sign given the key phrase. As earlier noted, future research will include other words, and other prompt styles. Nevertheless, the fact that a very generic and simple prompt achieved the best performance suggests that more elaborate prompts may not be better at this particular task.
Even though we sought to make the synthetic data similar to the original EHR snippets, it may lack certain subtle features utilized by the ML models. Ideally, we would like to utilize a private ChatGPT in a HIPAA compliant environment on the actual snippets. At the same time, the synthetic texts are semantically identical to the original text. Generalizable ML models should be able to classify them correctly.
Conclusion
Utilizing ZSL with LLMs can potentially improve NLP applications leveraging EHR clinical notes. We applied ZSL, using two different forms of prompt engineering using ChatGPT-4 in extracting symptom/sign data from hand-crafted synthetic notes. For comparison, we also developed baseline models using ML and rule-based methods to process the synthetic notes. Applications using either ZSL prompt style outperformed the best baseline method. Prompt style and temperature settings affected ZSL performance. These findings suggest a more efficient means of symptoms and signs extraction than traditional ML and rule-based methods.
Supplementary Material
Funding:
This work was funded by the United States Department of Veterans Affairs HSR&D grant I01 HX002422-01A2 (Dr. Ahmed and Dr. Zeng-Treitler). Department of Veterans Affairs, Washington, DC, USA
Abbreviations
- API
Application Programming Interface
- AI
Artificial Intelligence
- EHR
Electronic Health Record
- LLM
Large Language Model
- ML
Machine Learning
- NLP
Natural Language Processing
- VA
Veterans Affairs
- ZSL
Zero-Shot Learning
Footnotes
Disclosures: The content is solely the responsibility of the authors and do not necessarily represent the views of the United States government or the affiliated academic institutions. No authors report any conflict of interest related to this manuscript.
References
- 1.Friedman C, Alderson PO, Austin JH, Cimino JJ, Johnson SB. A general natural-language text processor for clinical radiology. J Am Med Inform Assoc 1994;1(2):161–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Sager N, Lyman M, Bucknall C, Nhan N, Tick LJ. Natural language processing and the representation of clinical data. J Am Med Inform Assoc 1994;1(2):142–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Sager NFC, Lyman M Medical language processing: computer management of narrative data. . Reading, MA: Addison-Wesley; 1987. [Google Scholar]
- 4.Rindflesch TC, Fiszman M. The interaction of domain knowledge and linguistic structure in natural language processing: interpreting hypernymic propositions in biomedical text. J Biomed Inform 2003;36(6):462–77. [DOI] [PubMed] [Google Scholar]
- 5.Chapman WW, Bridewell W, Hanbury P, Cooper GF, Buchanan BG. A simple algorithm for identifying negated findings and diseases in discharge summaries. J Biomed Inform 2001;34(5):301–10. [DOI] [PubMed] [Google Scholar]
- 6.Kalmady SV, Salimi A, Sun W, Sepehrvand N, Nademi Y, Bainey K, et al. Development and validation of machine learning algorithms based on electrocardiograms for cardiovascular diagnoses at the population level. NPJ Digit Med 2024;7(1):133. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Sadegh-Zadeh SA, Soleimani Mamalo A, Kavianpour K, Atashbar H, Heidari E, Hajizadeh R, et al. Artificial intelligence approaches for tinnitus diagnosis: leveraging high-frequency audiometry data for enhanced clinical predictions. Front Artif Intell 2024;7:1381455. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Merhbene G, Puttick A, Kurpicz-Briki M. Investigating machine learning and natural language processing techniques applied for detecting eating disorders: a systematic literature review. Front Psychiatry. 2024;15:1319522. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Wieland-Jorna Y, van Kooten D, Verheij RA, de Man Y, Francke AL, Oosterveld-Vlug MG. Natural language processing systems for extracting information from electronic health records about activities of daily living. A systematic review. JAMIA Open. 2024;7(2):ooae044. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Wang H, Alanis N, Haygood L, Swoboda TK, Hoot N, Phillips D, et al. Using natural language processing in emergency medicine health service research: A systematic review and meta-analysis. Acad Emerg Med 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Le Glaz A, Haralambous Y, Kim-Dufor DH, Lenca P, Billot R, Ryan TC, et al. Machine Learning and Natural Language Processing in Mental Health: Systematic Review. J Med Internet Res 2021;23(5):e15708. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Ozbek MA, Basak AT, Cakici N, Evran S, Kayhan A, Saygi T, et al. Comparison of Clinical and Radiologic Outcomes between Dural Splitting and Duraplasty for Adult Patients with Chiari Type I Malformation. J Neurol Surg A Cent Eur Neurosurg 2023;84(4):370–6. [DOI] [PubMed] [Google Scholar]
- 13.Meystre SM, Thibault J, Shen S, Hurdle JF, South BR. Textractor: a hybrid system for medications and reason for their prescription extraction from clinical text documents. J Am Med Inform Assoc 2010;17(5):559–62. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Zhang K, Demner-Fushman D. Automated classification of eligibility criteria in clinical trials to facilitate patient-trial matching for specific patient populations. J Am Med Inform Assoc 2017;24(4):781–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Lee C, Mohebbi M, O’Callahaghan E, Winsberg M. Crisis prediction among tele-mental health patients: A large language model and expert clinician comparison. JMIR Ment Health. 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Chiang CC, Luo M, Dumkrieger G, Trivedi S, Chen YC, Chao CJ, et al. A large language model-based generative natural language processing framework fine-tuned on clinical notes accurately extracts headache frequency from electronic health records. Headache. 2024;64(4):400–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Lorenzoni G, Gregori D, Bressan S, Ocagli H, Azzolina D, Da Dalt L, et al. Use of a Large Language Model to Identify and Classify Injuries With Free-Text Emergency Department Data. JAMA Netw Open. 2024;7(5):e2413208. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Gill JK, Chetty M, Lim S, Hallinan J. Large language model based framework for automated extraction of genetic interactions from unstructured data. PLoS One. 2024;19(5):e0303231. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Lampert CH, Nickisch H, Harmeling S, editors. Learning to detect unseen object classes by between-class attribute transfer. 2009 IEEE conference on computer vision and pattern recognition; 2009: IEEE. [Google Scholar]
- 20.Sun X, Gu J, Sun H. Research progress of zero-shot learning. Applied Intelligence. 2021;51:3600–14. [Google Scholar]
- 21.Romera-Paredes B, Torr P, editors. An embarrassingly simple approach to zero-shot learning. International conference on machine learning; 2015: PMLR. [Google Scholar]
- 22.Pourpanah F, Abdar M, Luo Y, Zhou X, Wang R, Lim CP, et al. A Review of Generalized Zero-Shot Learning Methods. IEEE Trans Pattern Anal Mach Intell 2023;45(4):4051–70. [DOI] [PubMed] [Google Scholar]
- 23.Sivarajkumar S, Kelley M, Samolyk-Mazzanti A, Visweswaran S, Wang Y. An Empirical Evaluation of Prompting Strategies for Large Language Models in Zero-Shot Clinical Natural Language Processing: Algorithm Development and Validation Study. JMIR Med Inform 2024;12:e55318. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.VA Informatics and Computing Infrastructure (VINCI) [Available from: https://www.hsrd.research.va.gov/for_researchers/vinci/]. Accessed 12 August 2024. [Google Scholar]
- 25.Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine learning in Python. the Journal of machine Learning research. 2011;12:2825–30. [Google Scholar]
- 26.Kluyver T, Ragan-Kelley B, Pérez F, Granger BE, Bussonnier M, Frederic J, et al. Jupyter Notebooks-a publishing format for reproducible computational workflows. Elpub 2016;2016:87–90. [Google Scholar]
- 27.OpenAI documentation: OpenAI; [Available from: https://platform.openai.com/docs/overview]. Accessed 12 August 2024. [Google Scholar]
- 28.API Reference: temperature: Anthropic; [Available from: https://docs.anthropic.com/en/api/messages]. Accessed 12 August 2024. [Google Scholar]
- 29.Gilligan C Who Are America’s Veterans? US News & World Report. 2022. [Google Scholar]
- 30.Models: OpenAI; [Available from: https://platform.openai.com/docs/models/gpt-4-turbo-and-gpt-4]. Accessed 12 August 2024. [Google Scholar]
- 31.Farhat F, Chaudhry BM, Nadeem M, Sohail SS, Madsen DO. Evaluating Large Language Models for the National Premedical Exam in India: Comparative Analysis of GPT-3.5, GPT-4, and Bard. JMIR Med Educ 2024;10:e51523. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Tsoutsanis P, Tsoutsanis A. Evaluation of Large language model performance on the Multi-Specialty Recruitment Assessment (MSRA) exam. Comput Biol Med 2024;168:107794. [DOI] [PubMed] [Google Scholar]
- 33.Guillen-Grima F, Guillen-Aguinaga S, Guillen-Aguinaga L, Alas-Brun R, Onambele L, Ortega W, et al. Evaluating the Efficacy of ChatGPT in Navigating the Spanish Medical Residency Entrance Examination (MIR): Promising Horizons for AI in Clinical Medicine. Clin Pract 2023;13(6):1460–87. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
