Skip to main content
AMIA Summits on Translational Science Proceedings logoLink to AMIA Summits on Translational Science Proceedings
. 2025 Jun 10;2025:518–526.

Exploring ChatGPT 3.5 for structured data extraction from oncological notes

Ty J Skyles 1,*, Isaac J Freeman 2,*, Georgewilliam Kalibbala 3,*, David Davila-Garcia 3,4,*, Kendall Kiser 3, Silpa Raju 3, Adam Wilcox 3
PMCID: PMC12150697  PMID: 40502225

Abstract

In large-scale clinical informatics, there is a need to maximize the amount of usable data from electronic health records. With the adoption of large language models in medical research, there is potential to use them to extract structured data from unstructured clinical notes. We explored how ChatGPT could be used to improve data availability in cancer research. We assessed how GPT used clinical notes to answer six relevant clinical questions. Four prompt engineering strategies were used: zero-shot, zero-shot with context, few-shot, and few-shot with context. Few-shot prompting often decreased the accuracy of GPT outputs and context did not consistently improve accuracy. GPT extracted patients’ Gleason scores and ages with an F1 score of 0.99 and it identified if patients received palliative care with and if patients were in pain with an F1 score of 0.86. Effective use of LLMs has potential to increase interoperability between healthcare and clinical research.

Introduction

The widespread use of electronic health records (EHR) changed health data science and its opportunity to improve clinical care.13 It allowed large amounts of data to be available to inform clinical decisions and research.4 However, despite the advances in EHR systems, many physicians continue to document patient information in unstructured narrative format, which limits its usability for large scale, programmatic analyses and its interoperability.5,6 Accordingly, there is a significant need to extract information from unstructured narrative text reports to enhance data standardization and leverage analytic capabilities.6,7,8 Furthermore, there is a need for these tools and systems to be easily understood by clinicians.8

Conventional natural language processing (NLP) tools have been shown to be able to extract information from text, however their use is limited in clinical settings and are not generally integrated into existing systems.9,10 This limits the potential benefits of extracting valuable insights from narrative data, especially for clinicians.10,11 Large language models (LLMs), although less accurate than corpus-based or rule-based NLP, are significantly cheaper and more accessible to researchers.12 In particular, ChatGPT (OpenAI, San Francisco, CA) was he fastest adopted internet application in history at the time and has the potential to rapidly change the way clinical data is gathered and analyzed.13 There is a need to evaluate the feasibility of using ChatGPT to extract structured data from clinical notes that contain personal health information.10 As such, in this study, we assess the ability of ChatGPT to extract relevant information from unstructured clinical on six clinically relevant tasks.

Methods

We obtained unstructured data from the EHRs used by BJC Healthcare, a large academic medical system in St. Louis, MO, for patients seen in a radiation oncology department between 2018 and 2024. In late 2023, Washington University in St. Louis adopted an internal version of firewall-secured ChatGPT instance that can queried with sensitive clinical data in a Health Insurance Portability and Accountability Act (HIPAA)-compliant manner.14 We queried three different types of unstructured data: oncologist clinic notes (n=2,908,739), radiology reports (n=1,723,485), and pathology narratives (n=95,842). Because many of our questions were specific to a narrow patient population, we created an enriched set of notes by querying for keywords in the procedure types, specialty names, and the note itself. A bag-of-words search was first used to determine which notes could potentially contain the target information for a more efficient analysis. For example, when using GPT to read radiology reports of brain MRIs, notes were filtered for text that contained the words “MRI” and “brain”.

The ability of ChatGPT to retrieve information from clinical text was evaluated in conjunction with the development of a large cancer registry. It is intended that GPT would read clinical notes from oncologists and add extracted information as structured data to the cancer registry for future research.

A HIPAA secure version of ChatGPT 3.5 turbo 16k was used to read the notes and provide structured output to six clinically relevant tasks. The overall study methodology is shown in Figure 1. This was important to ensure patient privacy and protect personal health information. To ensure reproducibility, the model temperature, a variable that configures model word choice variability, was set to 0. To protect sensitive data, all patient information was stored in a data warehouse and all technical infrastructure was behind a firewall.

Figure 1.

Figure 1.

Flowchart of our note import, note filtering, annotation, prompt engineering, and GPT analysis.

Each question was asked in four different iterations: zero-shot, zero-shot with context, few-shot, and few-shot with context. Results were compared to our manually annotated standard.

The reference standard was developed through manual annotation of randomly selected clinical notes from each note type. Annotations were reviewed by two radiation oncology residents who assisted with jargon, abbreviations, and contextual understanding.

For each task, the accuracy of GPT was measured in four separate iterations as follows: zero-shot, zero-shot with additional context, few-shot, and few-shot with additional context. In the zero-shot iterations, GPT was only prompted with the relevant task. In few-shot prompt engineering, the model was provided three examples of a clinical note along with the correct output. In two additional iterations, zero-shot and few-shot was accompanied by additional context within the prompt, which included explanations of medical terminology or clarifying instructions. For example, when GPT was prompted to use radiology reports to count the number of metastases in a patient’s brain, it was given context by explaining the different words radiologists use to describe metastases. It was also given special instruction in how to differentiate metastases in the brain and near the brain. These clinical instructions were given with the oversight of an oncology expert.

For each task, the performance of the GPT model was assessed against a manually annotated reference standard. The primary metric used for comparison was the F1 score, calculated as the harmonic mean of precision and recall. The F1 score was chosen as the primary metric for comparison because it provides a balanced measure that accounts for both false positives and false negatives, offering a more comprehensive evaluation of the model’s performance. Additionally, other metrics such as precision, recall, and accuracy were calculated.

Results

GPT was used to analyze 297 clinical notes (37-80 notes for each clinical task). Task difficulty was determined by consensus among a team including a physician and multiple data-scientists. Of the six tasks given, two were considered difficult, two were considered intermediate difficulty, and two were considered easy. Easy tasks required reporting information that was clearly stated in the text, like the age of the patient and the Gleason score of the tumor. Difficult tasks involved understanding a large amount of complex medical terminology or reporting an answer that was synthesized from multiple pieces of information.

Relevant notes were a mixture of 50 inpatient oncology notes, 167 outpatient oncology notes, 40 pathology reports, and 40 radiology reports. This is shown in Table 1.

Table 1.

The six clinical questions GPT was tasked with answering in structured format using unstructured notes, reports, and narratives. Also displayed are the type of note, the number of notes analyzed, and the difficulty of each task.

Difficulty Question Note type Sample size
Easy What is the Gleason score for the patient’s prostate cancer? Inpatient Oncology Notes 50
How old was the patient at the time of his/her visit? Outpatient Oncology Notes 50
Medium Is this patient experiencing pain? Outpatient Oncology Notes 37
Is the patient in consideration for palliative care? Outpatient Oncology Notes 80
Hard Were there positive surgical margins on the patient’s surgical pathology? Pathology Reports 40
How many metastases does the patient have in his/her brain? Radiology Reports 40

GPT had the highest precision and accuracy when asked to extract the Gleason score from pathology narratives or patient age from clinical notes. Using zero-shot prompt engineering without context, GPT extracted the Gleason score and patient age with an F1 score 0.89. When prompted with more context, it predicted Gleason score with an F1 score of 0.99. In contrast, when few-shot prompting was used, the Gleason score report was highly inaccurate (F1=0.21 with no context and F1=0.39 with context).

Because the prompt was so simple, GPT was not provided context iterations when asked to report the age of the patient at the time of appointment. Zero-shot prompting yielded an F1 score of 0.98. When few-shot, prompting was used to report the age of the patient at the time of appointment, F1 score decreased to 0.04.

GPT was also assessed with more complex tasks. It was prompted to count a patient’s brain metastases from radiology reports and report if a tumor had positive surgical margins from pathology reports. These tasks involved interpreting complex medical terminology, synthesizing with clinical reasoning, and (particularly in the case of radiology reports) handling indeterminate findings. The F1 scores for GPT reporting if a tumor had positive margins were 0.50 (zero-shot), 0.82 (zero-shot with context), 0.39 (few-shot), 0.39 (few-shot with context). The F1 scores for GPT reporting the number of brain metastases were 0.77 (zero-shot), 0.67 (zero-shot with context), 0.10 (few-shot), 0.30 (few-shot with context). In both cases, few-shot prompting was less successful than zero-shot prompting.

Discussion

In our setting, we discovered that GPT was best at extracting concepts with a simple structure and minimal context dependence. Concepts like binary variables and “score-based” variables performed very well. As an example, the Gleason score was one of our most successful concepts to extract because it has a clearly structured appearance. It has a very simple format for GPT to parse with most cases just appearing as “Gleason #+#=#”. At times, GPT produced a technically correct result but was unable to correctly follow the format instructions. For example, Gleason score = “3+4”, instead of “7”. This is not clinically inconsequential: Gleason 3+4 disease denotes a pathologic disease pattern that is known to have a less aggressive prognostic course than Gleason 4+3 disease. In general, Gleason 4+3 disease often signifies an indication for treatment with androgen deprivation therapy, an indication that may not exist in Gleason 3+4 disease. It is worth noting that GPT performed better than previous attempts to use NLP to extract Gleason scores from clinical notes.15 Age is very similar in this respect with the added challenge of requiring GPT to do arithmetic in some cases to find the correct age. There are many more cancer-specific examples of “score-based” variables with clinically validated prognostic or predictive importance in the oncological landscape that could be explored in future experiments.

Overall, the additional context provided in the prompt did not consistently yield improved accuracy. We hypothesized that context would improve the accuracy of complex data extraction tasks, but this was not always the case. For example, explaining how to identify positive surgical margins drastically improved accuracy, but explaining how to count brain metastases decreased accuracy. Additionally, when few-shot engineering was added to the prompt, the accuracy of GPT often declined significantly. When extracting the age from the notes, the F1 score decreased from —0.98 in zero-shot to —0.04 in few-shot.

GPT was highly successful in extracting data when the task was relatively simple. When given a zero-shot prompt with no context, GPT extracted the patient’s age at the time of the appointment with near perfect accuracy. (F1= 0.98) When given additional context, it was able to predict the person’s Gleason score highly accurately (F1= 0.99) It is worth noting that in the absence of an explicitly written age, GPT was able to extract the patient’s age by using the date of the appointment and the patient’s date of birth. Additionally, with zero-shot prompt engineering with no context, GPT was able to successfully identify if a patient was in pain (F1= 0.86) or receiving palliative care (F1= 0.86). Structured data extraction was optimal when the task was simple, and the data was clearly identifiable.

Regarding the weaknesses of ChatGPT in an applied clinical oncological context, we noticed that concepts that had many synonyms, were very context-dependent, and required GPT to count or categorize performed poorly. For example, the task of counting brain metastases from radiology reports was challenged by medical terminology and by inherent limitations in what radiologists could definitively classify as a metastasis. Given that this task is challenging for physicians, it is impressive that GPT performed as well as it did.

Contrary to our initial expectation, few-shot prompt engineering did not improve GPT’s performance. In most cases, GPT fixated on providing one of the three examples given in few-shot prompt engineering. This is consistent with previous research on the subject.12 It is not known why GPT often performs worse when it is given few-shot prompting, but we believe it is because the examples provided in the prompts were too complex for effective learning. Most clinical notes were large, often consisting of several hundred words. Additionally, clinical notes were not written in a consistent format, so the relevant information is not always in the same place. It would be difficult for a statistically based language model to associate an output with the relevant information in a large body of text with only a few examples. Future use of GPT in clinical data extraction should account for this bias. Additionally, GPT exhibited decreased performance in cases where the given examples were long. For example, brain MRI radiology reports were often several hundred words long, and the majority of the text communicated results that were extraneous to the question of counting metastases.

Despite the limitations of ChatGPT in extracting structured data from unstructured clinical notes, its low cost and ease of use make it a promising tool for supplementing clinical research. The main alternatives for data extraction from narrative text are NLP and manual annotation, both of which are very expensive. Computing costs with the WUSTL HIPAA-secure GPT instance were estimated to be $3 USD per thousand clinical notes, which was much less expensive than the conventional NLP. Currently in the field of clinical data science, NLP is not commonly used because of its high costs and complexity of use.16 When compared to the cost of hiring a doctor to annotate large numbers of notes, large language models are much cheaper. Although GPT is not perfect, it is likely the only feasible option for large-scale data extraction projects. Future research should evaluate whether information added from GPT analysis can increase the quality of large-scale machine learning projects.

Limitations

This study is limited by a small sample size of clinical notes. This limitation was due to the difficulty of annotating notes, rather than a paucity of clinical notes. However, we believe that a larger number of annotations would not significantly impact the results of our study. Although the sample size was small, the annotations were of high quality and validated by expert clinicians. In future work, we plan on conducting a greater number of small studies rather than investing in one study with a larger number of notes, and we intend to utilize GPT structure outputs as features in machine learning models trained to predict clinical outcomes.

One limitation of the study is that the results are not deterministic. The temperature was set to zero, so variance between prompting was expected to be minimal. That being said, Results from each prompt are not completely deterministic and future studies should include multiple trials for each prompt.

Although the latest available version of ChatGPT is GPT-4, our study used GPT 3.5 because a HIPAA-compliant version of GPT-4 was not available to the researchers at the time of the study Although this presents a limitation to the study, similar research has shown that GPT-3.5 and GPT-4 perform similarly when reading electronic health records17.

This study does not directly compare ChatGPT to more conventional forms of NLP. The authors do not expect large language models to outperform conventional forms of NLP. Rather, the low cost and ease of use makes ChatGPT a strong candidate for widespread use in clinical research. We evaluated GPT on its own merits rather than comparing it to more conventional NLP.

Conclusion

In conclusion, GPT 3.5 presents a viable, cost-effective alternative to conventional NLP methods of large-scale analyses of clinical oncology notes. While it can perform a wide variety of tasks, it exhibits optimal performance when extracting binary or highly structured quantitative concepts. Currently, the high cost of annotation and NLP is a barrier to the widespread use of clinical notes. When GPT is used to extract structured data from clinical notes, our results indicate that few-shot prompt engineering did not improve GPT performance. Finally, GPT performance was dependent on the clinical context, suggesting that GPT output should be validated in each clinical context to which it is applied.

Figures & Tables

Figure 3.

Figure 3.

Comparison of F1 scores between each iteration of prompt engineering for when GPT was used to report the Gleason score and the patient’s age.

Figure 4.

Figure 4.

Comparison of F1 scores between each iteration of prompt engineering for when GPT was used to report if the patient was in palliative care or experiencing pain.

Figure 5.

Figure 5.

Comparison of F1 scores between each iteration of prompt engineering for when GPT was used to count metastases from radiology reports and identify if tumor resections had positive margins.

Table 2:

F1, precision, accuracy, and recall scores for each iteration of each type of structured data GPT was tasked with extracting.

graphic file with name 6560t2.jpg

References

  • 1.Trout KE, Chen LW, Wilson FA, Tak HJ, Palm D. The impact of meaningful use and electronic health records on hospital patient safety. Int J Environ Res Public Health. 2022;19(19):12525. doi: 10.3390/ijerph191912525. doi:10.3390/ijerph191912525. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Colicchio TK, Cimino JJ, Del Fiol G. Unintended consequences of nationwide electronic health record adoption: challenges and opportunities in the post-meaningful use era. J Med Internet Res. 2019;21(6):e13313. doi: 10.2196/13313. doi:10.2196/13313. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Office for Civil Rights (OCR) HITECH Act enforcement interim final rule. October 28, 2009 [Internet] [cited 2024 July 29] Available from: https://www.hhs.gov/hipaa/for-professionals/special-topics/hitech-act-enforcement-interim-final-rule/index.html . [Google Scholar]
  • 4.Batko K, Ślęzak A. The use of big data analytics in healthcare. J Big Data. 2022;9(1):3. doi: 10.1186/s40537-021-00553-4. doi:10.1186/s40537-021-00553-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Bloom MV, Huntington MK. Faculty, resident, and clinic staff’s evaluation of the effects of EHR implementation. Fam Med. 2010;42(8):562–566. [PubMed] [Google Scholar]
  • 6.Pervaz Iqbal M, Manias E, Mimmo L, et al. Clinicians’ experience of providing care: a rapid review. BMC Health Serv Res. 2020;20(1):952. doi: 10.1186/s12913-020-05812-3. doi:10.1186/s12913-020-05812-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Kelly CJ, Karthikesalingam A, Suleyman M, Corrado G, King D. Key challenges for delivering clinical impact with artificial intelligence. BMC Med. 2019;17(1):195. doi: 10.1186/s12916-019-1426-2. doi:10.1186/s12916-019-1426-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Kather JN, Ferber D, Wiest IC, et al. Large language models could make natural language again the universal interface of healthcare. Nat Med. 2024 doi: 10.1038/s41591-024-03199-w. doi:10.1038/s41591-024-03199-w. [DOI] [PubMed] [Google Scholar]
  • 9.Wang Y, Wang L, Rastegar-Mojarad M, Moon S, Shen F, Afzal N, et al. Clinical information extraction applications: a literature review. J Biomed Inform. 2018;77:34–49. doi: 10.1016/j.jbi.2017.11.011. doi:10.1016/j.jbi.2017.11.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Tamang S, Humbert-Droz M, Gianfrancesco M, Izadi Z, Schmajuk G, Yazdany J. Practical considerations for developing clinical natural language processing systems for population health management and measurement. JMIR Med Inform. 2023;11:e37805. doi: 10.2196/37805. doi:10.2196/37805. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Pavlenko E, Strech D, Langhof H. Implementation of data access and use procedures in clinical data warehouses: a systematic review of literature and publicly available policies. BMC Med Inform Decis Mak. 2020;20(1):157. doi: 10.1186/s12911-020-01177-z. doi:10.1186/s12911-020-01177-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Patil R, Gudivada V. A review of current trends, techniques, and challenges in large language models (LLMs) Appl Sci. 2024;14(5):2074. doi:10.3390/app14052074. [Google Scholar]
  • 13.Why ChatGPT is the fastest growing web platform ever | TIME [Internet] [cited 2024 Aug 7] Available from: https://time.com/6253615/chatgpt-fastest-growing/ [Google Scholar]
  • 14.Brian Washington University ChatGPT Beta is now available. Information Technology. December 19, 2023 [Internet]. [cited 2024 Aug 7] Available from: https://it.wustl.edu/2023/12/washington-university-chatgtp-beta-is-now-available/ [Google Scholar]
  • 15.Yu S, Le A, Feld E, et al. A natural language processing–assisted extraction system for Gleason scores: development and usability study. JMIR Cancer. 2021;7(3):e27970. doi: 10.2196/27970. doi:10.2196/27970. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Wilcox AB, Hripcsak G. The role of domain knowledge in automating medical text report classification. J Am Med Inform Assoc. 2003;10(4):330–338. doi: 10.1197/jamia.M1157. doi:10.1197/jamia.M1157. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Bhattarai K, Oh IY, Sierra JM, et al. Leveraging GPT-4 for identifying cancer phenotypes in electronic health records: a performance comparison between GPT-4, GPT-3.5-turbo, Flan-T5, Llama-3-8B, and spaCy’s rule-based and machine learning-based methods. JAMIA Open. 2024;7(3):ooae060. doi: 10.1093/jamiaopen/ooae060. doi:10.1093/jamiaopen/ooae060. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from AMIA Summits on Translational Science Proceedings are provided here courtesy of American Medical Informatics Association

RESOURCES