Abstract
Objectives
To evaluate the proficiency of a HIPAA-compliant version of GPT-4 in identifying actionable, incidental findings from unstructured radiology reports of Emergency Department patients. To assess appropriateness of artificial intelligence (AI)-generated, patient-facing summaries of these findings.
Materials and Methods
Radiology reports extracted from the electronic health record of a large academic medical center were manually reviewed to identify non-emergent, incidental findings with high likelihood of requiring follow-up, further sub-stratified as “definitely actionable” (DA) or “possibly actionable—clinical correlation” (PA-CC). Instruction prompts to GPT-4 were developed and iteratively optimized using a validation set of 50 reports. The optimized prompt was then applied to a test set of 430 unseen reports. GPT-4 performance was primarily graded on accuracy identifying either DA or PA-CC findings, then secondarily for DA findings alone. Outputs were reviewed for hallucinations. AI-generated patient-facing summaries were assessed for appropriateness via Likert scale.
Results
For the primary outcome (DA or PA-CC), GPT-4 achieved 99.3% recall, 73.6% precision, and 84.5% F-1. For the secondary outcome (DA only), GPT-4 demonstrated 95.2% recall, 77.3% precision, and 85.3% F-1. No findings were “hallucinated” outright. However, 2.8% of cases included generated text about recommendations that were inferred without specific reference. The majority of True Positive AI-generated summaries required no or minor revision.
Conclusion
GPT-4 demonstrates proficiency in detecting actionable, incidental findings after refined instruction prompting. AI-generated patient instructions were most often appropriate, but rarely included inferred recommendations. While this technology shows promise to augment diagnostics, active clinician oversight via “human-in-the-loop” workflows remains critical for clinical implementation.
Keywords: ChatGPT, GPT, radiology, incidentaloma, follow-up, large language model
Background and significance
Incidental findings are abnormalities discovered on imaging tests performed for unrelated reasons. Amidst recent advances in imaging, the widespread adoption of high-resolution technologies has led to a corresponding rise in incidental findings.1–7
The interpretation, communication, and management of incidental findings, especially when their clinical significance is uncertain, can be complex and resource-intensive. Challenges are accentuated in the emergency department (ED) where high patient throughput, care complexity, and frequent provider transitions can hinder appropriate acknowledgement and communication of incidental findings. Approximately 5%-30% of imaging studies include actionable incidental findings (AIFs).1,3–5 Only 10%-49% of incidental findings may be documented within the discharge instructions for ED patients8–11 and as few as 17% of patients discharged from the ED follow-up on their AIFs.12 This gap in care results in adverse patient health outcomes, uneven quality of care delivery, missed opportunities for earlier diagnosis of potential malignancy, and medicolegal repercussions.2,3,10–15
Best practice guidelines recommend that clinicians communicate the presence of AIFs to the patient at the time of the ED visit and document this information into the patient’s discharge instructions.4 Although rates of verbal communication are not easily measured, specific documentation of AIFs into a patient’s discharge summary are associated with timely completion of follow-up.16 In addition, artificial intelligence (AI)-generated summaries can improve the readability and accessibility of written instructions for patients,17,18 thus providing the potential to improve compliance of follow-up.
In this context, advancements in machine learning (ML) and natural language processing (NLP) have shown potential for automated analysis of free-text radiology reports to detect incidental findings10,19–22 and may assist with prompting adherence to best practices. Traditional ML and NLP models have long demonstrated their utility in the identification of incidental findings for ED patients.10,19,21 However, these rule-based algorithms, often relying on key word stems, regular expressions, or extensive fine-tuning required for custom tasks, limit the generalizability of this technology to unseen examples across a broadened range of imaging modalities and across institutions. In contrast, large language models (LLMs) such as GPT-4 may offer an evolved approach benefitting from huge swaths of training data to understand human language and, in turn, the potential to outperform specialized, fine-tuned models with additional adaptability on domain-specific tasks.23 Recently, GPT-4 performed well at identifying incidental adrenal nodules, pancreatic cysts, and vascular calcifications from abdominal CTs.24 Building upon these applications for specific incidental pathologies and other narrow tasks, we sought to investigate GPT-4's performance on a broader scale.
In this quality improvement initiative, we describe our investigation using a HIPAA-compliant instance of GPT-4 to detect and classify AIFs in radiology reports from ED patients using best practice prompt engineering techniques. We also investigate the feasibility of using GPT-4 to assist clinicians in generating patient-facing instructions about these AIFs. This preliminary exploration serves as a foundational analysis to inform future deployment of an LLM-based tool into our live electronic health record environment. This offers the potential to devise timely reminders to ED clinicians about AIFs at critical transition points in the patient’s care (ie, discharge or handoff) and to draft patient-friendly instructions about AIFs that clinician can, after review, opt to incorporate into a patient’s written discharge instructions.
Methods
Classification of incidental findings
We developed a structured classification scheme to delineate and categorize incidental findings found within reports based on likelihood of requiring further action for follow-up testing, imaging, or evaluation after ED care. By this definition, we excluded the following:
Non-incidental findings directly related to the clinical indication for imaging.
Emergent findings requiring immediate attention during the patient’s ED visit.
Incidental findings described as chronic, stable, or unchanged.
Incidental findings were classified into 3 categories (see Figures 1 and 2):
Figure 1.
Instruction prompt to identify and categorize non-emergent, actionable incidental findings (AIFs).
Figure 2.
Sample radiology report with categorizations of incidental findings by actionability.
“Definitely Actionable” (DA): findings with explicit instructions from the radiologist for non-emergent follow-up, or any finding concerning for malignancy.
“Possibly Actionable—Clinical Correlation” (PA-CC): findings with radiologist recommendation for clinical correlation.
“Indeterminately Actionable” (IA): findings not meeting DA or PA-CC criteria.
Derivation of radiology report datasets
We executed multiple structured queries to extract radiology reports of multiple imaging modalities (x-rays, computed tomography scans, and ultrasounds) from our health system’s electronic health record across 4 academic ED sites at NYU Langone Health.
To derive our validation set, we pulled 50 cases over a 1-month period (July to August 2021), half of which contained at least 1 DA or PA-CC finding. The dataset was searched to eliminate duplicate medical record numbers, manually labeled via expert review (see below), and used for the initial experimentation, development, and optimization of our prompt.
For our test dataset, we performed a second query over an extended period (July 2021 to August 2023) across the same locations and imaging modalities that used search terms of type (eg, node, nodule, cyst, lesion), measurement (eg, mm, cm), and instruction (eg, follow-up). We executed a third query without adding the requirement of the presence of the aforementioned search terms to obtain a baseline sample of cases over a 2-day period in November 2023, selecting dates that ensured no overlap in patient visits from prior data queries. After eliminating duplicate medical record numbers, 124 cases from the second query and 306 cases from the third query were randomly selected and combined to form our 430-case test dataset. To ensure a high case mix complexity to test GPT-4 performance while also more closely mirroring real-word incidence, we targeted a positivity rate of 20%-30% for DA or PA-CC findings in this enriched test set. Additionally, each case from our validation set was manually cross-referenced against the 430 cases in our test set to confirm there were no duplications, ensuring that cases from the test set against which our optimized prompt would be applied were unseen from the validation set.
All patient radiology reports were handled with strict confidentiality without de-identification to eliminate duplicate reports and allow for chart review of AIFs. This project met NYU Grossman School of Medicine’s Institutional Review Board criteria for quality improvement work not requiring IRB review.
Expert review/case labeling
Two board-certified emergency physician reviewers (Unblind: KW and GS, 13 and 6 years’ experience, respectively) independently applied the above classification scheme to cases for identification of DA and PA-CC findings due to their high likelihood of actionability. Labeling of IA findings was deferred, as individual reports could contain dozens of indeterminate or normal findings with unspecified clinical significance that were generally non-actionable. Inter-rater reliability of DA and PA-CC identification was assessed using Cohen’s Kappa for a sample of 30 independent cases randomly selected from the same derivation pool as the test cases, prior to randomized selection of the final test set. Agreement was substantial for AIF categorization of actionability, including the possibility of a lack of any AIF. Agreement of classification scheme ratings was 93.1% with K = 0.74, standard error = 0.13, P < .001. Subsequently, each of our 430 test cases were reviewed by 1 of the same 2 emergency physician reviewers. Cases with potentially uncertain categorizations by 1 of the 2 expert reviewers were flagged for further discussion and consensus review. A third rater (board-certified radiologist: SKK, 10 years’ experience) resolved any remaining disagreement.
Task specification
We prompted GPT-4 to perform 3 main functions (see Supplementary Appendix A): (1) review the full text of each radiology report (including indication for imaging) to identify all non-emergent, AIFs and any associated recommendations; (2) categorize the findings by the classification scheme defined above and extract the associated radiologist recommendation, request for clinical correlation, or concern for malignancy associated with each finding; and (3) for DA findings, generate patient-centered instructions that summarize the radiologist’s recommendations for follow-up at a seventh-grade reading level.
Prompt optimization
We employed a HIPAA-compliant instance of GPT-4 v0613 via Microsoft Azure OpenAI’s API to conduct extensive experimentation on our 50-case validation set targeting iterative performance improvement. Multiple trials were performed with varying prompts, encompassing thousands of experimental runs on both targeted individual cases as well as across the dataset. Python scripting was used to perform frequent, automated sample runs, parse free-text GPT-4 responses, and format outputs into discrete data types for clinician review.
Analysis and refinements of the prompt were anchored in prompt engineering best practices, including but not limited to task decomposition, sequential tasking, error checking, chain of thought reasoning, and example-based learning techniques. Parameter settings were maintained throughout prompt iterations as follows: Max Response—1000; Temperature—0.0, Top P—0.95, Frequency Penalty—0; Presence Penalty—0.
Evaluation of GPT-4 outputs
Each DA or PA-CC classification by GPT-4 was graded for presence of a True Positive (TP), False Positive (FP), or False Negative (FN) instance against the manual labelings previously described. In the primary analysis, outputs were graded for accuracy of identifying DA or PA-CC findings, encompassing a broad range of findings with high likelihood of actionability. In the secondary analysis, outputs were graded for accuracy of classifying DA findings alone, limiting analysis to findings with only the highest likelihood of actionability. Consequently, a True Negative (TN) event was defined by the absence of any DA or PA-CC findings per report for the primary outcome, or the absence of any DA findings for the secondary outcome. Multiple AIFs within the same report were counted as separate instances, unless findings were inherently related (eg, 2 renal cysts). FP instances were evaluated for frequency and the type of misclassification error (eg, non-incidental, emergent, indeterminately actionable, or chronic, stable, or unchanged). The numbers of TPs, FPs, and FNs were then used to calculate recall, precision, and F-1 for the primary (DA or PA-CC) and secondary (DA only) analyses, respectively.
To further evaluate performance, all GPT-4 outputs were independently reviewed by 4 practicing ED physicians for the presence of hallucinations of any finding or inferences of any recommendation without explicit foundation. Subsequently, physician reviewers rated the appropriateness and usability of GPT-4-generated patient-facing instructions for all TP DA findings using a 5-point Likert scale (1: Needs COMPLETE revision; 2: Needs MAJOR revision; 3: Needs MODERATE revision; 4: Needs MINOR revision; 5: Needs NO revision). Ratings were reported by mean (standard deviation) and frequency of each response category.
Chart review of cases with DA findings
To evaluate the potential for improvement that GPT-4 could present, we manually reviewed the electronic health records of patients with DA findings present to determine if AIFs were communicated with the patient. Communication was defined as documentation in the chart of having discussed the finding with the patient or inclusion of the finding or recommendations in the patient’s discharge instructions or handoff to an admitting inpatient team.
Results
Validation set
Our 50-case validation set contained 17 DA and 16 PA-CC findings. 50.0% (25/50) of cases were positive for 1 or more DA or PA-CC findings, while 28.0% (14/50) of cases were positive for 1 or more DA findings. Our prompt was iteratively optimized until it demonstrated 100% recall, 89% precision, and 94% F-1 for DA or PA-CC findings and 100% recall, 65% precision, and 79% F-1 for DA findings alone.
Test set
Manual review of our 430-case test set revealed 104 DA and 42 PA-CC findings across 108 unique reports. 25.1% of reports were positive for 1 or more DA or PACC findings, achieving our target positivity between 20% and 30% for our enriched dataset. 18.4% (79/430) of cases were positive for DA findings, with more than 1 per report in 19 cases. Our test dataset included a diverse mix of imaging modalities (42.6% x-rays, 47.0% computed tomography scans, and 10.5% ultrasounds). Each call using our prompt averaged 2735 prompt tokens for input and 227 completion tokens for output, comprising an average of 9.6 cents per report. Our cumulative costs accounting for all experiments conducted amounted to $438.00 (input) and $141.60 (output).
Primary outcome (classification of DA or PA-CC findings)
GPT-4 identification of DA or PA-CC findings performed with 99.3% [95% CI, 95.6%-100.0%] recall, 73.6% [95% CI, 66.8%-79.5%] precision, and 84.5% F-1 (see Table 1).
Table 1.
Contingency table of non-emergent AIFs, on a per-finding basis.
| Manual labeling |
|||||
|---|---|---|---|---|---|
| DA | PA-CC | IAa (among FPs) | Non-AIFa (among FPs) | ||
| GPT-4 categorization | DA | 99 | 9 | 6 | 14 |
| PA-CC | 4 | 33 | 17 | 15 | |
| IAa (among FNs) | 1 | 0 | n/aa | n/aa | |
| Total | 104 | 42 | 23 | 29 | |
Only those IA findings or non-AIFs (ie, non-incidental, emergent, and/or chronic, stable, or unchanged findings) that contributed to FP or FN instances are displayed here.
Fifty-two FP categorizations (see Table 2) were observed over 42 cases (8 cases included 2-3 findings and accounted for 18 occurrences), accounting for a FP rate of 15.3%. FP findings were comprised of 51.9% (27/52) non-incidental findings, 38.5% (23/52) IA findings that were misclassified by GPT-4 as either PA-CC or DA, 13.5%% (7/52) emergent findings, and 1.9% (1/52) finding that were chronic, stable, or unchanged (in this case chronic). (Five non-AIFs were both non-incidental and emergent and a sixth was both non-incidental and chronic.) Among the 23 IA findings that were misclassified by GPT-4 as PA-CC or DA, 9 were linked to GPT-4's inferences beyond what the radiologist stated (see Inferences and Hallucinations section).
Table 2.
Sample false positives representing common types of error with relevant text additionally underlined.
| Radiologist text of finding | Manual grading | GPT-4 categorization | Commentary |
|---|---|---|---|
|
Right adnexal/right lower quadrant fluid—non-incidental and emergent | PA-CC | This should have technically been excluded based on prompt instructions. While clinical correlation was indeed recommended, the finding was considered both non-incidental, since the clinical indication was for right lower quadrant pain and also emergent, since concern for acute appendicitis warrants immediate consideration of a surgical evaluation. |
|
R IJ thrombus—emergent | DA | While GPT-4 was correct in assessing that this was “definitely actionable,” strict adherence to the prompt would have excluded this finding due to its emergent nature. |
| ___ mm calcified extra-axial focus along the right parietal convexity may reflect a small calcified meningioma, considering slight enlargement as compared to prior head CT from [date] measuring then ____ mm. No appreciable mass effect on the subjacent brain parenchyma. | Parietal focus—IA | PA-CC | GPT-4 graded the finding’s actionability based on its own inferences, although on clinical review, this could be a reasonable clinical recommendation that a clinician might want to communicate to the patient. |
| Nonspecific chronic appearing patchy white matter hypodensity with associated volume loss in the left frontal lobe white matter and periventricular region, with slight ex vacuo prominence of the adjacent frontal horn. This could be due to prior ischemia vs less likely chronic demyelinating lesion. | Patchy white matter hypodensity—IA and chronic | PA-CC | GPT-4 assessed the finding as possibly requiring follow-up. The findings typically describe age-related changes, but “less likely chronic demyelinating lesion” could reasonably be assessed by a clinician for any need for further action. |
The most common types of GPT-4 error were classifying as incidental when the finding was related to the indication, or when a finding did not have an accompanying recommendation in the report (IA) but GPT-4 categorized the finding as needing clinical follow-up (DA) or clinical correlation (PA-CC).
Only 1 FN categorization was observed, constituting a 0.7% FN rate (see Table 3).
Table 3.
False Negatives, with relevant text additionally underlined.
| Radiologist text of finding | Manual grading | GPT-4 categorization | Commentary |
|---|---|---|---|
| “IMPRESSION: [] SOFT TISSUE ATTENUATION LESION…REQUIRES CONTRAST-ENHANCED CT CHEST WHICH MAY BE OBTAINED NONEMERGENTLY. LYMPHADENOPATHY/NEOPLASM NOT EXCLUDED. IMPORTANT FINDING. THIS REPORT WILL BE FLAGGED (!) IN EPIC. No evidence of acute fracture or facet dislocation. Severe spondylosis with severe canal stenosis at C4-5 and C5-6; consider nonemergent cervical spine MRI for further evaluation of the canal and cord. Advanced multilevel neural foraminal stenoses. 1 cm hypodense focus in the right thyroid lobe. Nonemergent ultrasound recommended to evaluate further.” | DA | IA | GPT-4 correctly identified the soft tissue lesion and thyroid focus as DA findings but missed the severe cervical spondylosis/stenosis that was followed by a recommendation to consider MRI. Replacement of the semicolon with a period did not correct this error on manual testing. |
| “MODERATE SPLENOMEGALY AND ENLARGED PREDOMINANTLY LEFT-SIDED ILIAC AND INGUINAL AND TO LESSER EXTENT RETROPERITONEAL LYMPH NODES, WITH ASSOCIATED FAT STRANDING. THE DIFFERENTIAL DIAGNOSIS INCLUDES INFLAMMATORY OR INFECTIOUS PROCESS (POSSIBLY WITH THE SOURCE IN THE LEFT LOWER EXTREMITY, GIVEN THE RELATIVE UNILATERALITY OF THESE FINDINGS) VS LYMPHOPROLIFERATIVE DISORDER (INCLUDING LYMPHOMA).” | DA due to concern for malignancy (ie, lymphoma) in the differential | PA-CC | GPT-4 missed identifying concern for neoplasm, cancer, or malignancy when “lymphoproliferative disorder (including lymphoma)” was listed within a list of differential diagnoses. |
| “Mildly enlarged left axillary and borderline enlarged pretracheal lymph nodes. This is amenable to interval follow-up” | DA through consensus in interpreting the level of actionability with the phrase “amenable to follow-up.” | PA-CC | GPT-4 interpretation of “amenable to interval follow-up” as only PA-CC seems understandable given the vague language. |
|
DA (lymph nodes) due to recommendation for follow-up non-contrast CT. | PA-CC, with GPT-4 citing the radiologist recommendation for the hilar lymph nodes as: “There are probably enlarged bilateral hilar lymph nodes, difficult to accurately assess on this noncontrast study.” | Inconsistent terminology between the findings and impression sections (hilar lymph nodes vs hilar adenopathy), lack of recommendation under the “findings” section, while lumping the ultimate recommendation for non-contrast CT with pneumonia likely contributed to this error. GPT-4 appears to have conflated the radiologist’s description of the limitations of the study as a possible recommendation or clinical correlation. |
| “Coarse calcification in the left breast is again noted, with adjacent spiculated appearing parenchymal opacity grossly unchanged since the prior chest CT. Correlate with mammographic evaluation.” | DA | PA-CC, with GPT-4 citing the radiologist recommendation for the breast calcifications as, “Correlate with mammographic evaluation.” | GPT-4 may have misinterpreted a “correlation with a mammographic evaluation” as a “clinical correlation” or assumed that this referred to a mammogram that concurrently or previously happened. |
Secondary outcome (classification of DA findings only)
GPT-4 identification of DA findings only performed at 95.2% [95% CI, 88.6%-98.2%] recall, 77.3% [95% CI, 68.9%-84.1%] precision, and 85.3% F-1 (see Table 1).
Twenty-nine FP DA categorizations (see Table 2) were observed over 25 cases (3 cases accounted for 7 instances), accounting for a FP rate of 8.1%. FP findings were comprised of 41.4% (12/29) non-incidental, 31.0% (9/29) PA-CC, 20.7% (6/29) IA, 17.2% (5/29) emergent, and 3.4% (1/29) chronic, stable, or unchanged findings (in this instance, chronic). (Three findings were both non-incidental and emergent and a fourth was both non-incidental and chronic.)
Five FN DA categorizations were observed (each from different cases), constituting a 1.5% FN rate. In addition to the 1 FN DA finding previously identified above that was misclassified as IA, the remaining 4 FN results were DA findings that the model misclassified as PA-CC (see Table 3).
Inferences and hallucinations
Among our 430 test cases, there were no observed hallucinations of any incidental findings that did not exist in the radiology report. However, in the recommendations for follow-up, 2.8% (12/430) of outputs included text that was inferred by GPT-4 rather than directly summarized from the radiologist’s recommendations (see Table 4).
Table 4.
GPT-4 inferences from radiologist recommendations, with relevant text additionally underlined.
| Finding, followed by radiologist recommendation (if present) | GPT’s output of radiologist recommendations and/or patient instructions | Commentary |
|---|---|---|
| Markedly thickened ____ cm endometrium with a suggestion of cystic changes. Follow-up is recommended | GPT patient instructions: “The lining of your uterus, also known as the endometrium, is thicker than usual and shows some changes that might be cysts. Cysts are small sacs filled with fluid. It is recommended that you have another test to check this area again.” | Follow-up would likely start first with an outpatient GYN referral in order to determine what tests are needed. |
| ___ mm calcified extra-axial focus along the right parietal convexity may reflect a small calcified meningioma, considering slight enlargement as compared to prior head CT from [date] measuring then ___ mm. No appreciable mass effect on the subjacent brain parenchyma. [No follow-up recommendations stated.] | Radiologist recommendations (per GPT): “No specific recommendation for follow-up given, but clinical correlation may be needed given the slight enlargement compared to prior CT.” | Clinical correlation in the setting of an enlarging meningioma would have been an appropriate recommendation, though GPT-4 appears to have inferred this on its own. |
| Partially visualized left renal cystic lesion, incompletely characterized without intravenous contrast. [No follow-up recommendations stated.] | Patient instructions: “There is a small cyst, or fluid-filled sac, in your left kidney. This was not fully seen on the scan, so it is important to have another scan to get a better look at it.” | GPT-4 did not actually recommend repeat imaging, although that is not an unreasonable follow-up recommendation to consider. |
| Mild to moderately thickened esophagus. Esophagitis is a consideration and follow-up is recommended. | Patient instructions: “We also noticed that the wall of your esophagus, which is the tube that connects your throat to your stomach, is thicker than normal. This could be due to inflammation, a condition called esophagitis. You should have more tests to confirm this and decide on the best treatment.” | The patient should be clinically evaluated to determine what (if any) further tests are recommended. |
|
Radiologist recommendations (per GPT): “Increased liver echotexture—Clinical correlation suggested. Nonspecific increased renal cortical echotexture—Clinical correlation suggested.” | Consideration of medical renal disease would be a type of clinical correlation. However, clinicians would have preferred to see that text more faithfully reproduced by GPT-4 instead of its own paraphrasing. GPT-4 then added clinical correlation to the abnormal liver echotexture as well on its own. |
|
Radiologist recommendations (per GPT): “Borderline enlarged ____ mm para-aortic node—Clinical correlation is recommended.” “Prominent soft tissue in the subcarinal region—Clinical correlation is recommended.” “Borderline enlarged ___ mm left paratracheal node—Clinical correlation is recommended.” | GPT-4 added “clinical correlation is recommended” where this was not stated by the radiologist to 3 separate findings. Of note, there was a true recommendation for clinical correlation for a different incidental finding—thickened esophageal wall—within the same radiology report. |
| ___ cm globular densely mineralized lesion within the posterior spinal canal at C3-4 (possibly likely reflecting a calcified meningioma) causing moderate spinal canal stenosis at this level. Follow-up MRI could be performed for further characterization. |
|
In prompt instructions, GPT-4 was asked to state concerns for malignancy only IF present. It appears to have stated lack of concern for malignancy in this version of the reproduction of the recommendations and either built on that to define the lesion as likely benign or researched the definition of a meningioma on its own for the patient instructions. While this was not incorrect, this would require close clinician oversight. |
| ___ cm right ovarian dermoid, which compared to report from in [date] is slightly increased in size. The previous ultrasound images are not available for review. Clinically warranted further evaluation with pelvic MRI can be obtained. | Patient instructions: “We found a small growth, called a dermoid, on your right ovary that has grown slightly since it was last seen in [date]. A dermoid is a type of benign (likely not cancerous) tumor that can contain a variety of different types of tissue. Because it has grown, it may be a good idea to have a special type of imaging called an MRI of your pelvis to get a better look at it.” | GPT-4 appeared to research the definition of a dermoid on its own to arrive at the “benign” designation. While not incorrect, this would require close clinician oversight. In addition, the wording makes it sound more definitively benign than most ED clinicians would want to commit to. |
| [Findings consistent with rare congenital heart/lung syndrome. No further radiology recommendations given.] |
|
GPT-4 graded this known congenital abnormality for the patient as a DA even though it recognized there were no specific follow-up recommendations given by the radiologist. |
| Nonspecific subpleural nodule lingula measuring ____ cm series ___ image ____. No other pulmonary nodules….Punctate nonobstructing intrarenal calculus midpole right kidney. [No follow-up radiology recommendations for either finding.] |
|
GPT-4 recognized that there were no radiology recommendations for both findings, yet decided to add that it “may need to be looked at again” or “may need to be monitored” anyway. GPT appears to be (correctly) intuiting that the presence of these findings requires future monitoring despite any explicit actionable recommendations—illustrating the complexity of our negative group: indeterminately actionable |
| ___ × ____ cm right paraesophageal nodule series ___ image___ increased since prior chest CT when it measured ___ × ____cm. This may represent an enlarged lymph node. Follow-up is recommended. |
|
GPT-4 removed the radiologist description that it could be an enlarged lymph node. It recognized the radiologist only stated “follow-up recommended,” but inferred on its own (not incorrectly) that the follow-up plan was to repeat a scan. |
| The ascending aorta is mildly dilated measuring ____ cm. Follow-up recommended. |
|
Technically not an incorrect inference, but GPT-4 added the recommendation for “another imaging test” on its own. |
Evaluation of performance for patient instructions
The mean (SD) Likert score for GPT-generated patient instructions for TP DA findings was 4.5 (0.65) on a 5-point Likert scale, indicating that reviewers perceived most outputs would require only minor (if any) revisions prior to inclusion into the patient’s discharge instructions. The most frequent rating was a 5 (no revision required), for 55.5% of reports; the range of scores was 3-5 with only a small minority (8.4%) indicated as needing moderate revision (Figure 3).
Figure 3.
Distribution of likert ratings for true positive DA non-emergent incidental findings.
Chart review of cases with DA findings
Seventy-nine out of 430 cases contained 1 or more DA findings. Per chart review, 55.7% (44 of 79) had documentation of communication with the patient about the finding at discharge or with the inpatient team upon admission.
Discussion
Performance
GPT-4 performance was noteworthy, especially given the model’s lack of clinical pretraining and complexity of the task. Nearly all DA or PA-CC findings were successfully screened by GPT-4. In comparison, only about half of DA findings were documented as communicated with the patient upon discharge or with the inpatient team upon admission, highlighting a void that this LLM tool could successfully fill. Outputs did not always reliably distinguish non-emergent DA/PA-CC findings from those that were non-incidental, IA, or emergent. However, the clinical consequences of these FPs are expected to be minor for patients who would already be receiving active treatment for emergent or non-incidental findings.
That the majority of FP misattributions were non-incidental highlights the model’s limitations in fully evaluating a radiologic finding in the context of the study’s clinical indication, which requires deep relational knowledge of medicine and nuanced semantic understanding. Additionally, we observed the model occasionally misattribute a radiologist’s follow-up recommendation to multiple upstream findings in preceding sentences, rather than to the specific antecedent finding to which the recommendation was referring, highlighting challenges in contextual interpretation.
Notably, we observed no cases where a FP finding was “hallucinated” de novo without an associated referenceable finding in the radiology report. However, GPT-4 on occasion inferred actionability of the finding that was not explicitly stated, asserting (reasonably so in most instances) that a finding was benign, or that clinical correlation or repeat imaging was warranted. This highlights the importance of active clinician review and oversight of GPT-4’s outputs.
Clinicians felt that the majority of GPT-4 outputs for TP DA findings were appropriate. However, free-text feedback indicated specific areas for improvement. For example, GPT-4 tended to avoid medical terminology when instructed to simplify the instructions to a seventh-grade reading level, so words like “nodule” could become replaced with ambiguous terms such as “lump.” GPT-4 sometimes paraphrased words that were actual medical terminology, turning lung “opacities” into “shadows,” or “atypical pneumonia” into an “unusual type of pneumonia.” GPT-4 sometimes added overly descriptive language, ranging from unnecessary (eg, describing a thyroid as “shaped like a butterfly, smaller in the middle with 2 wide wings that extend around the side of your throat”) to potentially upsetting (eg, defining a uterus as “the organ where a baby grows during pregnancy,” which would be inappropriate for a patient suffering from possible miscarriage or infertility). GPT-4 also tended to be overly confident in stating what a finding most likely represented but was not yet confirmed (eg, cyst, benign mass), which requires clinician oversight.
Prompt experimentation
Our process of prompt optimization though iterative experimentation was critical for enhancing model performance to complete the designated task utilizing a zero shot approach. Consistent with best practices, our strategy involved observing AI model outputs to varying prompt adjustments, identifying errors or misattributions, and revising the prompt accordingly over time.
We found that task decomposition, the process of dividing a complex task into many smaller, specific tasks within the same prompt, improved our model’s performance significantly. Additionally, sequential tasking proved instrumental in guiding the model’s interpretative processes. We then coupled this with an internal review or error check to revise initial classifications prior to a final output, which further improved GPT-4 accuracy.
Notably, we encountered challenges related to the formatting of GPT-4’s output in early testing. Initial attempts to direct the model’s output towards a JSON structure included specific example-based instruction. We observed that detailed clinical examples often biased clinical classification with resultant degradation of performance, which we attributed to the model’s tendency to overfit to the detailed classifications and format in the provided example. By minimizing detail within examples but maintaining general format guidelines, we achieved 100% adherence to the desired JSON format with minimal degradation.
Limitations
We occasionally employed multiple runs testing on the same prompt to gauge the stability of GPT-4’s responses, an inherent challenge for generative AI models. Anecdotally, this approach revealed a degree of sensitivity to the phrasing of instructions and highlighted occasional instability in outputs. Integrating an internal self-check mechanism within our prompt and adjusting model parameters (eg, temperature) were observed to mitigate some of this inconsistency. However, extensive high-volume testing for further stability assessment was not feasible as a limitation in our methodology. Our work also highlights the inherent challenge in the interpretation of “actionability.” While the distinct categories of DA, PA-CC and IA were clinically reasoned and useful for our purposes here, this inherently introduces some degree of subjectivity into ground truth classification. Our criteria for “actionability” were anchored in explicit radiologist recommendations for follow-up, clinical correlation, or concerns of malignancy, which, in real-life scenarios, may not encompass all clinically relevant AIFs. The PA-CC classification proved to be complex due to ambiguous phrasings suggestive of correlation even to human readers. Differentiating “non-incidental” findings presented its own set of complexities, reflecting the nuanced and subjective nature of clinical decision-making. This underscores the difficulty in translating the subtleties of clinical assessments into a rigid AI-driven categorization framework.
In addition, the derivation of our datasets may harbor unanticipated sources of bias. The use of a key word search to yield a high capture of AIFs may have limited the types of AIFs rendered. To ensure mutual exclusivity of case visits, the baseline cases within our test set were derived from a brief, convenience sample of time (November 2023) several months after the timeframe used for the radiology reports derived via key word search (July 2021 to August 2023). Still, from July 2021 to November 2023, there were no significant differences in use of structured reporting templates by the radiologists that would be anticipated to affect GPT-4's task at hand.
Future directions
Anticipated future advancements in state-of-the-art general LLMs and the evolution of models specialized to the medical domain can also be expected to yield improved performance. At our institution, we plan to silently deploy this model in a controlled framework that enables integration of AIF classification into the provider workflow and monitoring of real-world performance. For findings with high likelihood of actionability (DA or PA-CC), we envision creating clinician-facing alerts at the point of discharge or inpatient handoff as a just-in-time reminder at critical transition points in a patient’s ED visit. For findings with the highest likelihood of actionability (DA only), we envision presenting AI-generated instructions for the clinician to review, edit as needed, and incorporate into the patient’s discharge instructions if desired. It should be noted, however, that the same pressures that may predispose clinicians to overlook AIFs can also expose clinicians to a different but equally important risk of automation bias leading to over-reliance on AI outputs without thorough scrutiny. Deployment of this technology necessitates accompanying safeguards to mitigate such biases and ensure that AI tools are used as aids, not substitutes, for clinical expertise. As AI technology advances, reevaluation of these risks will need to be weighed against the benefits of enhancing clinicians’ ability to deliver informed, personalized, and patient-centered care.
Conclusion
As a novel investigation into the evaluation of GPT-4 performance in a highly relevant medical context, our work demonstrates a general model’s reasonable proficiency in the highly specialized and complex task of discerning and translating radiologic AIFs. We also highlight the need for rigorous scrutiny of GPT use in medicine, reinforcing the notion that while AI can provide significant assistance, it cannot supplant the nuanced judgement of healthcare experts in current state. Through rigorous testing of AI methodologies in an academic setting with real patient data, we endeavor to advance future applications that augment patient care while maintaining the highest standards of patient safety. Our study aims lay the groundwork for future investigations into the downstream applications of this technology to assist with provider-facing clinical decision support and patient-facing health communication.
Supplementary Material
Acknowledgments
We thank Dr Catherine Jamin, Dr Soterios Gyftopoulos, Dr Sarah Spiegel, Dr Leland Chan, and Duo (Walter) Wang for their support. We also thank the NYU Langone Health Medical Center IT for supporting exploratory and API access to the secure, HIPAA-compliant instance of GPT-4 that enabled this work.
Contributor Information
Kar-mun C Woo, Ronald O. Perelman Department of Emergency Medicine, NYU Grossman School of Medicine, New York, NY 10016, United States.
Gregory W Simon, Ronald O. Perelman Department of Emergency Medicine, NYU Grossman School of Medicine, New York, NY 10016, United States.
Olumide Akindutire, Ronald O. Perelman Department of Emergency Medicine, NYU Grossman School of Medicine, New York, NY 10016, United States.
Yindalon Aphinyanaphongs, Department of Population Health, NYU Grossman School of Medicine, New York, NY 10016, United States; Department of Health Informatics, Medical Center IT, NYU Langone Health, New York, NY 10016, United States.
Jonathan S Austrian, Department of Health Informatics, Medical Center IT, NYU Langone Health, New York, NY 10016, United States; Department of Medicine, NYU Grossman School of Medicine, New York, NY 10016, United States.
Jung G Kim, Ronald O. Perelman Department of Emergency Medicine, NYU Grossman School of Medicine, New York, NY 10016, United States; Institute for Innovations in Medical Education, NYU Langone Health, New York, NY 10016, United States.
Nicholas Genes, Ronald O. Perelman Department of Emergency Medicine, NYU Grossman School of Medicine, New York, NY 10016, United States; Department of Health Informatics, Medical Center IT, NYU Langone Health, New York, NY 10016, United States.
Jacob A Goldenring, Ronald O. Perelman Department of Emergency Medicine, NYU Grossman School of Medicine, New York, NY 10016, United States.
Vincent J Major, Department of Population Health, NYU Grossman School of Medicine, New York, NY 10016, United States; Department of Health Informatics, Medical Center IT, NYU Langone Health, New York, NY 10016, United States.
Chloé S Pariente, Department of Health Informatics, Medical Center IT, NYU Langone Health, New York, NY 10016, United States.
Edwin G Pineda, MCIT Clinical Systems—ASAP application, NYU Langone Health, New York, NY 10016, United States.
Stella K Kang, Department of Population Health, NYU Grossman School of Medicine, New York, NY 10016, United States; Department of Radiology, NYU Grossman School of Medicine, New York, NY 10016, United States.
Author contributions
Kar-mun C. Woo and Gregory W. Simon developed the instruction prompt, manually labeled cases, tested individual radiology reports within GPT-4, and drafted the manuscript. Stella K. Kang provided expert guidance on project direction, aims, labeling schema, dataset design, and arbitrated on manual labeling of cases. Chloé S. Pariente and Vincent J. Major developed and executed the Python code to enable automated high volume GPT testing and format outputs for review. Yindalon Aphinyanaphongs, Jonathan S. Austrian, and Vincent J. Major served as expert AI project mentors, guided project direction, and provided institutional executive leadership. Chloé S. Pariente, Nicholas Genes, Vincent J. Major, Yindalon Aphinyanaphongs, Jonathan S. Austrian, and Edwin G. Pineda assisted with prompt and output optimization. Nicholas Genes, Chloé S. Pariente, and Gregory W. Simon performed data queries to extract ED radiology reports. Olumide Akindutire and Jacob A. Goldenring performed chart reviews of cases with DA findings. Jacob A. Goldenring, Olumide Akindutire, Gregory W. Simon, and Kar-mun C. Woo performed Likert ratings of GPT outputs and reviewed outputs for hallucinations. Jung G. Kim assisted with statistical calculations. All authors reviewed, revised, and approved the submitted version.
Supplementary material
Supplementary material is available at Journal of the American Medical Informatics Association online.
Funding
This work was supported by the NYU Langone Health, MCIT. Y.A. and V.J.M. were partially supported by NIH 3UL1TR001445-05 and Y.A. was supported by the National Science Foundation award #1928614 and #2129076.
Conflicts of interest
None declared.
Data availability
Available upon reasonable request.
References
- 1. Berland LL. Overview of white papers of the ACR incidental findings committee II on adnexal, vascular, splenic, nodal, gallbladder, and biliary findings. J Am Coll Radiol. 2013;10(9):672-674. [DOI] [PubMed] [Google Scholar]
- 2. Evans CS, Arthur R, Kane M, et al. Incidental radiology findings on CT studies in the ED: a systematic review and meta-analysis. Ann Emerg Med. 2022;80(3):243-256. [DOI] [PubMed] [Google Scholar]
- 3. Lumbreras B, Donat L, Hernández-Aguado I. Incidental findings in imaging diagnostic tests a systematic review. Br J Radiol. 2010;83(988):276-289. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Moore CL, Baskin A, Chang AM, et al. White paper: best practices in the communication and management of actionable incidental findings in emergency department imaging. J Am Coll Radiol. 2023;20(4):422-430. [DOI] [PubMed] [Google Scholar]
- 5. O'Sullivan JW, Muntinga T, Grigg S, Ioannidis JPA. Prevalence and outcomes of incidental imaging findings: umbrella review. BMJ. 2018;361:k2387. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Vernooij MW, Ikram MA, Tanghe HL, et al. Incidental findings on brain MRI in the general population. N Engl J Med. 2007;357(18):1821-1828. [DOI] [PubMed] [Google Scholar]
- 7. Smith-Bindman R, Kwan ML, Marlow EC, et al. Trends in use of medical imaging in US health care systems and in Ontario, Canada, 2000-2016. JAMA. 2019;322(9):843-856. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Munk M-D, Peitzman AB, Hostler DP, Wolfson AB. Frequency and follow-up of incidental findings on trauma computed tomography scans: experience at a level one trauma center. J Emerg Med. 2010;38(3):346-350. [DOI] [PubMed] [Google Scholar]
- 9. Thompson RJ, Wojcik SM, Grant WD, Ko PY. Incidental findings on CT scans in the emergency department. Emerg Med Int. 2011;2011:624847. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Dutta S, Long WJ, Brown DFM, Reisner AT. Automated detection using NLP and radiologists' recommendations for additional imaging of incidental findings. Ann Emerg Med. 2013;62(2):162-169. [DOI] [PubMed] [Google Scholar]
- 11. Berge P, Darsonval A, Nedelcu C, Paisant A, Aubé C. Incidental findings on emergency CT scans: predictive factors and medico-economic impact. Eur J Radiol. 2020;129:109072. [DOI] [PubMed] [Google Scholar]
- 12. Liao GJ, Liao JM, Lalevic D, Zafar HM, Cook TS. Location, location, location: the association between imaging setting and follow-up of findings of indeterminate malignant potential. J Am Coll Radiol. 2019;16(6):781-787. [DOI] [PubMed] [Google Scholar]
- 13. Barrett TW, Garland NM, Freeman CL, et al. Catching those who fall through the cracks: integrating a follow-up process for emergency department patients with incidental radiologic findings. Ann Emerg Med. 2022;80(3):235-242. [DOI] [PubMed] [Google Scholar]
- 14. Hansra SS, Loehfelm TW, Wilson M, Corwin MT. Factors affecting adherence to recommendations for additional imaging of incidental findings in radiology reports. J Am Coll Radiol. 2021;18(2):233-239. [DOI] [PubMed] [Google Scholar]
- 15. Moore CL, Kadom N, Seidenwurm D, et al. Incidental findings: a survey of radiologists and emergency physicians. J Am Coll Radiol. 2021;18(6):853-856. [DOI] [PubMed] [Google Scholar]
- 16. Kwan JL, Yermak D, Markell L, Paul NS, Shojania KG, Cram P. Follow up of incidental high-risk pulmonary nodules on computed tomography pulmonary angiography at care transitions. J Hosp Med. 2019;14(6):349-352. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Lee T-C, Staller K, Botoman V, Pathipati MP, Varma S, Kuo B. ChatGPT answers common patient questions about colonoscopy. Gastroenterology. 2023;165(2):509.e7-511.e7. 10.1053/j.gastro.2023.04.033. [DOI] [PubMed] [Google Scholar]
- 18. Kirchner GJ, Kim RY, Weddle JB, Bible JE. Can artificial intelligence improve the readability of patient education materials? Clin Orthop Relat Res. 2023;481(11):2260-2267. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Evans CS, Dorris HD, Kane MT, et al. A NLP and ML approach to ID incidental radiology findings in trauma patients discharged from the ED. Ann Emerg Med. 2023;81(3):262-269. [DOI] [PubMed] [Google Scholar]
- 20. Fu S, Leung LY, Wang Y, et al. Natural language processing for the identification of silent brain infarcts from neuroimaging reports. JMIR Med Inform. 2019;7(2):e12109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Kang SK, Garry K, Chung R, et al. Natural language processing for identification of incidental pulmonary nodules in radiology reports. J Am Coll Radiol. 2019;16(11):1587-1594. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Pham AD, Névéol A, Lavergne T, et al. Natural language processing of radiology reports for the detection of thromboembolic diseases and clinically relevant incidental findings. BMC Bioinformatics. 2014;15(1):266. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Nori H, Lee YT, Zhang S, et al. 2023. Can generalist foundation models outcompete special-purpose tuning? A case study in medicine. arXiv, arXiv:2311.16452v1 [cs.CL].
- 24. Bhayana R, Elias G, Datta D, Bhambra N, Deng Y, Krishna S. Use of GPT-4 with single-shot learning to identify incidental findings in radiology reports. AJR Am J Roentgenol. 2024;222(3):e233065. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Available upon reasonable request.



