Evaluating GPT-4 as an academic support tool for clinicians: a comparative analysis of case records from the literature

BL Fabre; MAF Magalhaes Filho; PN Aguiar, Jr; FM da Costa; B Gutierres; WN William, Jr; A Del Giglio

doi:10.1016/j.esmorw.2024.100042

. 2024 May 13;4:100042. doi: 10.1016/j.esmorw.2024.100042

Evaluating GPT-4 as an academic support tool for clinicians: a comparative analysis of case records from the literature

BL Fabre ¹, MAF Magalhaes Filho ^1,^∗, PN Aguiar Jr ², FM da Costa ¹, B Gutierres ², WN William Jr ², A Del Giglio ³

PMCID: PMC12836686 PMID: 41647785

Abstract

Background

Artificial intelligence (AI) and natural language processing (NLP) advancements have led to sophisticated tools like GPT-4.0, allowing clinicians to explore its utility as a health care management support tool. Our study aimed to assess the capability of GPT-4 in suggesting definitive diagnoses and appropriate work-ups to minimize unnecessary procedures.

Materials and methods

We conducted a retrospective comparative analysis, extracting clinical data from 10 cases published in the New England Journal of Medicine after 2022 and inputting this data into GPT-4 to generate diagnostic and work-up recommendations. Primary endpoint: the ability to correctly identify the final diagnosis. Secondary endpoints: its ability to list the definitive diagnosis as the first of the five most likely differential diagnoses and determine an adequate work-up.

Results

The AI could not identify the definitive diagnosis in 2 out of 10 cases (20% inaccuracy). Among the eight cases correctly identified by the AI, five (63%) listed the definitive diagnosis at the top of the differential diagnosis list. In terms of diagnostic tests and exams, the AI suggested unnecessary procedures in two cases, representing 40% of the cases where it failed to correctly identify the final diagnosis. Moreover, the AI could not suggest adequate treatment for seven cases (70%). Among them, the AI suggested inappropriate management for two cases, and the remaining five received incomplete or non-specific advice, such as chemotherapy, without specifying the best regimen.

Conclusions

Our study demonstrated GPT-4’s potential as an academic support tool, although it cannot correctly identify the final diagnosis in 20% of the cases and the AI requested unnecessary additional diagnostic tests for 40% of the patients. Future research should focus on evaluating the performance of GPT-4 using a more extensive and diverse sample, incorporating prospective assessments, and investigating its ability to improve diagnostic and therapeutic accuracy.

Key words: artificial intelligence in health care, clinical decision support systems

Highlights

•
GPT-4 identified the final diagnosis in 80% of the cases.
•
GPT-4 suggested unnecessary procedures in 20% of cases and inadequate treatment recommendations in 70%.
•
GPT-4 identifies the final diagnosis within the top five differential diagnosis 80% of the time.

Introduction

Artificial intelligence (AI) has become an important asset for the field of data science, particularly in medical research.¹ Accurate diagnosis is crucial for effective management of patient problems in health care. Medical professionals often rely on their specialized knowledge and a variety of patient scenarios to guide clinical judgments. However, the increasing complexity of cases, particularly those requiring referral to specific specialties, and the continuous advancement of medical knowledge highlight the need for more comprehensive digital diagnostic support. In fact, a single-center study reported diagnostic error rates of 2%,² while a systematic review found that error rates in older adult patients could exceed 10%.³

The development of advanced AI and natural language processing (NLP) has resulted in the invention of sophisticated systems that clinicians could use as health care management support tools.⁴ AI chatbots, such as OpenAI’s ChatGPT, utilize a sophisticated language model called a pre-trained generative transformer (GPT) that is based on NLP.⁵ This model can generate responses comparable to those generated by humans in answering user questions.⁶

Nevertheless, AI chatbots are not immune to constraints and hazards.⁷^,⁸ The limitations include concerns regarding transparency,⁹ insufficient specialist medical knowledge, outdated medical information, inherent biases, and the possibility of spreading inaccurate information, referred to as ‘hallucinations’ in AI parlance.¹⁰ Previous studies have indicated that the diagnostic precision of the differential diagnosis lists produced by ChatGPT for clinical vignettes, which are succinct narratives used in studies to depict a clinical situation, may change. A study that assessed the diagnostic accuracy of the GPT-4 AI model in a series of challenging cases demonstrated that in 64% of cases, the model included the final diagnosis in its differential.¹¹ On the other hand, another study evaluated the accuracy of differential diagnosis lists generated by ChatGPT-3 for clinical vignettes containing the correct diagnosis within the five items lists was 83%.¹²

Our study aimed to assess the efficacy of GPT-4 in providing a correct diagnosis and clinical investigation suggestions in clinical oncology using clinical case data from cancer cases discussed at some of the weekly conferences held at the Massachusetts General Hospital (MGH) and published in the New England Journal of Medicine (NEJM).

Materials and Methods

Study design

We conducted a retrospective comparative analysis, extracting relevant clinical data from 10 cases published in the NEJM after 2022, as GPT-4, at the time of this study, included data up to 2021. In so doing, we wanted to preclude the possibility of the GPT-4 model already being exposed to the selected cases.

Subsequently, we input these data into the GPT-4, which is a NLP model, to obtain differential diagnosis and work-up recommendations. The primary endpoint of our study was to assess the effectiveness of the model in accurately determining the final diagnosis. In addition, secondary endpoints included evaluating the model’s capacity to accurately identify the correct diagnosis among the top five possibilities and its effectiveness in recommending correct tests for a diagnostic evaluation.

This study was conducted at the Oncology Department of the Beneficência Portuguesa de São Paulo, São Paulo, Brazil, and ABC Foundation School of Medicine.

Endpoints

Primary endpoint: the ability to correctly identify the final diagnosis. Secondary endpoints: its ability to list the definitive diagnosis as the first of the five most likely differential diagnoses and determine an adequate work-up.

Ethical considerations

Since this study utilized data obtained from previously published anonymous case reports, it did not require formal approval from an ethics committee or the need for individual consent.

Prompt created for GPT-4 to generate differential diagnosis lists and investigation recommendations

We utilized ChatGPT, an implementation of the GPT-4 models (version from March 2023; ChatGPT-4, OpenAI, LLC), on 10 August 2023, for our analysis. This GPT-4 model has not received specific training for the purpose of medical diagnosis. The physician entered the following text into the prompt: ‘You are an AI-assisted physician with a vast knowledge base up to September 2021. I will deliver patient data, including sex, age, main complaint, history of present illness, family history, social history, past medical history, allergies, gynecological history, clinical history, physical examination, laboratory findings, imaging findings, clinical evolution, and surgery results (if available). Based on this information, you will generate five possible diagnostic hypotheses, ranked in order of probability (from most probable to least likely). For each diagnosis, you will explain the rationale behind your conclusions. While you cannot access external tools such as medical imaging software or machine learning programs, you will use your pre-existing knowledge to suggest potential diagnoses and recommend further investigations. Please note that your response is intended for educational purposes only and should not be used as a substitute for professional medical advice. After describing your findings as requested, you will present a summarized table format, with the first column containing the diagnosis, and the second column listing recommendations for the next steps in investigation. Be sure to include any references utilized in your analysis. We acknowledge the limitations of the AI model and the importance of consulting with a qualified health care professional for accurate diagnoses and treatment decisions’.

The challenge was designed to prompt the ChatGPT model to generate a thorough and inclusive list of possible differential diagnoses. The selection of this prompt was based on initial testing, during which a variety of prompts were methodically evaluated for their effectiveness in stimulating the models to generate comprehensive lists of diseases. To eliminate any potential impact from previous answers, we methodically erased the previous discussion before presenting new sets of medical situations. We selected the top five differential diagnosis produced by ChatGPT-4.

Results

Representative examples of differential diagnosis lists, performance in exam requests and recommended management, are presented in Table 1.

Table 1.

Detailed comparison of diagnostic assessments and treatment recommendations between GPT-4 and the New England Journal of Medicine (NEJM) for 10 clinical cases

Case	Case diagnosis (NEJM)	Top diagnosis (GPT-4)	Correct exams by GPT-4	Proposed treatment (NEJM)	GPT-4 indication	Treatment discrepancy
Case 1	Breast cancer-associated thrombotic microangiopathy	1. Microangiopathic hemolytic anemia (MAHA) related to systemic sclerosis 2. Atypical hemolytic uremic syndrome (aHUS) 3. Thrombotic thrombocytopenic purpura (TTP) 4. Cancer-associated MAHA 5. Drug-induced hemolytic anemia	Yes, but only for the third diagnostic hypothesis	Chemotherapy	Treat main disease	Partial
Case 2	Cushing syndrome due to ectopic secretion of adrenocorticotropic hormone from small-cell lung cancer	1. Small-cell lung cancer with paraneoplastic syndrome 2. Adrenal insufficiency 3. Hyperaldosteronism 4. Delirium due to metabolic causes 5. Drug-induced psychiatric symptoms	Yes	Mifepristone	Steroids	Yes
Case 3	Adrenocortical carcinoma	1. Adrenal adenoma 2. Adrenocortical carcinoma 3. Cushing syndrome 4. Pheochromocytoma 5. Other adrenal lesion or incidentaloma	Yes	Mitotane	Send to oncologist	Yes
Case 4	Meningioma, meningothelial type, World Health Organization grade 1	1. Optic nerve sheath meningioma 2. Atypical optic neuritis 3. Sarcoidosis 4. Neuroretinitis 5. Inflammatory orbital pseudotumor	Yes	Radiotherapy (RT)	Surgery	Yes
Case 5	Poorly differentiated adenocarcinoma of the lung complicated by superior vena cava syndrome	1. Superior vena cava syndrome (SVCS) due to lung cancer 2. Non-small-cell lung cancer (NSCLC) 3. Small-cell lung cancer (SCLC) 4. Lymphoma 5. Infection (e.g. pneumonia or lung abscess)	NA	Stent and RT	Stent and RT	No
Case 6	Linitis plastica (invasive gastric adenocarcinoma)	1. Gastric cancer 2. Linitis plastic 3. Peptic ulcer 4. Gastritis 5. Gastroparesis	Yes	Chemotherapy	Staging-dependent decision	Partial
Case 7	Well-differentiated pancreatic insulin-secreting neuroendocrine tumor (insulinoma)	1. Insulinoma 2. Reactive hypoglycemia 3. Hypoglycemia due to insulin or oral hypoglycemic agent misuse 4. Adrenal insufficiency 5. Hypothyroidism	Yes	Surgery	Surgery	No
Case 8	Plasma-cell myeloma post-transplantation lymphoproliferative disorder (PTLD)	1. Plasma-cell myeloma or PTLD (plasma-cell subtype) 2. Plasmacytoma or isolated plasma-cell neoplasm 3. Central nervous system graft-versus-host disease (CNS-GVHD) 4. Central nervous system infection 5. Demyelinating disease (e.g. multiple sclerosis)	No	Daratumumab single-agent	Antineoplastic agents’ combination (not precise)	Yes
Case 9	Pancreatic ductal adenocarcinoma	1. Pancreatic cancer 2. Autoimmune pancreatitis 3. Pancreatic neuroendocrine tumor 4. Chronic pancreatitis 5. Metastatic cancer from another primary site	Yes	Surgery—FOLFIRINOX	Staging-dependent decision	Yes
Case 10	High-grade B-cell lymphoma, not otherwise specified	1. Lymphatic obstruction due to lymphoma or other malignancy 2. Protein-losing enteropathy 3. Sarcoidosis 4. Tuberculosis 5. Congestive heart failure	Yes	DA-EPOCH-R	Antineoplastic agents’ combination (not precise)	Yes

Open in a new tab

Each row represents a distinct case with the following columns:

Case: Lists the case number (1-10).

Case diagnosis (NEJM): Provides the official diagnosis as reported by New England Journal of Medicine (NEJM).

Top diagnosis (GPT-4): Lists the top five differential diagnoses generated by GPT-4.

Correct exams by GPT-4: Indicates whether GPT-4 suggested the correct diagnostic tests (‘Yes’, ‘No’, or ‘NA’ for not applicable).

Proposed treatment (NEJM): Describes the treatment recommended by NEJM.

GPT-4 indication: Shows the treatment suggested by GPT-4.

Treatment discrepancy: Specifies whether there is a discrepancy between the NEJM and GPT-4 treatment recommendations (‘Yes’, ‘No’, or ‘Partial’ for partial discrepancy).

The AI was able to identify the definitive diagnosis within the top five differential diagnoses in 8 out of the 10 cases. Among the eight cases whose diagnosis was correctly identified by the AI, only five had the definitive diagnosis listed as the top differential diagnoses in the generated list (Figure 1).

Furthermore, GPT-4 suggested inappropriate diagnostic tests in two cases and failed to suggest adequate treatment in seven cases. In two instances, it suggested improper management, and in the remaining five, it provided incomplete or non-specific advice, such as chemotherapy, without specifying the best regimen.

Discussion

Our research evaluated the precision of the ChatGPT-4-generated differential diagnosis lists for oncological patients by utilizing information extracted from case reports. ChatGPT-4 achieved an 80% concordance in correctly diagnosing the right disease within the top five differential diagnoses that were listed. Additionally, it achieved a 63% concordance in identifying the correct disease as the first option in the list of differential diagnoses it generated. These results closely correspond to the observations made by Hirosawa et al.,¹³ who reported rates of 81% and 60%, respectively. The similarity of results can be explained by the similarity in the evaluation method using clinical vignettes of complex cases from case reports.

In contrast, Kanjee et al. revealed that the generative AI model (GPT-4) accurately diagnosed the correct condition in 64% of challenging situations and identified it as the top diagnosis in only 39% of cases.¹¹ Methodological discrepancies between both studies may be partially explained by our choice of using cases from the NEJ M clinicopathologic conferences. These conferences involve difficult medical situations, comprehensively investigated. The richness of clinical data in these vignettes may improve GPT-4 performance and therefore produce higher rates of successful diagnosis as compared with the real world. We feel, however, that a 20% diagnostic error rate is still too high to validate ChatGPT-4 routine use at this time in differential diagnosis and work-up suggestions for oncological patients. Furthermore, our results also showed that in terms of the suggestions for diagnostic tests, ChatGPT-4 did poorly as it did with therapeutic advice. Therefore, the use of ChatGPT-4 at this time may increase the costs of work-up of oncological diseases.

This study has several limitations. Although these case studies provided significant insights into complex diagnostic scenarios, the small number of case reports included do not cover the full spectrum of patient clinical presentations. Therefore, the extent to which our findings may be applied to different situations is restricted in terms of external validity and generalizability. This study also does not account for the possibility of self-learning abilities that ChatGPT-4 may have, which could improve its performance in the future.

Conclusions

Our study demonstrates that the current potential of GPT-4 as a diagnostic support tool is unwarranted. The rate of 20% of misdiagnosis and the non-adequate diagnostic tests suggested for work-up may increase the overall costs of health care of cancer patients. In the future, through self-learning based on many more cases, performance of ChatGPT may improve and become a much better tool to help physicians care for cancer patients.

Acknowledgments

Funding

None declared.

Disclosure

The authors have declared no conflicts of interest.

References

1.Hunter D.J., Holmes C. Where medical statistics meets artificial intelligence. N Engl J Med. 2023;389(13):1211–1219. doi: 10.1056/NEJMra2212850. [DOI] [PubMed] [Google Scholar]
2.Harada Y., Otaka Y., Katsukura S., Shimizu T. Effect of contextual factors on the prevalence of diagnostic errors among patients managed by physicians of the same specialty: a single-centre retrospective observational study. BMJ Qual Saf. 2023 doi: 10.1136/bmjqs-2022-015436. [DOI] [PubMed] [Google Scholar]
3.Skinner T., Scott I., Martin J. Diagnostic errors in older patients: a systematic review of incidence and potential causes in seven prevalent diseases. Int J Gen Med. 2016;9:137–146. doi: 10.2147/IJGM.S96741. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Wani S.U.D., Khan N.A., Thakur G., et al. Utilization of artificial intelligence in disease prevention: diagnosis, treatment, and implications for the healthcare workforce. Healthcare (Basel) 2022;10(4):608. doi: 10.3390/healthcare10040608. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Curtis N. ChatGPT. To ChatGPT or not to ChatGPT? The impact of artificial intelligence on academic publishing. Pediatr Infect Dis J. 2023;42(4):275. doi: 10.1097/INF.0000000000003852. [DOI] [PubMed] [Google Scholar]
6.Kung T.H., Cheatham M., Medenilla A., et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2) doi: 10.1371/journal.pdig.0000198. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Vaishya R., Misra A., Vaish A. ChatGPT: is this version good for healthcare and research? Diabetes Metab Syndr. 2023;17(4) doi: 10.1016/j.dsx.2023.102744. [DOI] [PubMed] [Google Scholar]
8.Lee P., Bubeck S., Petro J. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine. N Engl J Med. 2023;388(13):1233–1239. doi: 10.1056/NEJMsr2214184. [DOI] [PubMed] [Google Scholar]
9.Zheng H., Zhan H. ChatGPT in scientific writing: a cautionary tale. Am J Med. 2023;136(8):725–726.e6. doi: 10.1016/j.amjmed.2023.02.011. [DOI] [PubMed] [Google Scholar]
10.Haug C.J., Drazen J.M. Artificial intelligence and machine learning in clinical medicine, 2023. N Engl J Med. 2023;388(13):1201–1208. doi: 10.1056/NEJMra2302038. [DOI] [PubMed] [Google Scholar]
11.Kanjee Z., Crowe B., Rodman A. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. JAMA. 2023;330(1):78–80. doi: 10.1001/jama.2023.8288. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Hirosawa T., Harada Y., Yokose M., Sakamoto T., Kawamura R., Shimizu T. Diagnostic accuracy of differential-diagnosis lists generated by generative pretrained transformer 3 chatbot for clinical vignettes with common chief complaints: a pilot study. Int J Environ Res Public Health. 2023;20(4):3378. doi: 10.3390/ijerph20043378. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Hirosawa T., Kawamura R., Harada Y., et al. ChatGPT-generated differential diagnosis lists for complex case-derived clinical vignettes: diagnostic accuracy evaluation. JMIR Med Inform. 2023;11 doi: 10.2196/48808. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib1] 1.Hunter D.J., Holmes C. Where medical statistics meets artificial intelligence. N Engl J Med. 2023;389(13):1211–1219. doi: 10.1056/NEJMra2212850. [DOI] [PubMed] [Google Scholar]

[bib2] 2.Harada Y., Otaka Y., Katsukura S., Shimizu T. Effect of contextual factors on the prevalence of diagnostic errors among patients managed by physicians of the same specialty: a single-centre retrospective observational study. BMJ Qual Saf. 2023 doi: 10.1136/bmjqs-2022-015436. [DOI] [PubMed] [Google Scholar]

[bib3] 3.Skinner T., Scott I., Martin J. Diagnostic errors in older patients: a systematic review of incidence and potential causes in seven prevalent diseases. Int J Gen Med. 2016;9:137–146. doi: 10.2147/IJGM.S96741. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib4] 4.Wani S.U.D., Khan N.A., Thakur G., et al. Utilization of artificial intelligence in disease prevention: diagnosis, treatment, and implications for the healthcare workforce. Healthcare (Basel) 2022;10(4):608. doi: 10.3390/healthcare10040608. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib5] 5.Curtis N. ChatGPT. To ChatGPT or not to ChatGPT? The impact of artificial intelligence on academic publishing. Pediatr Infect Dis J. 2023;42(4):275. doi: 10.1097/INF.0000000000003852. [DOI] [PubMed] [Google Scholar]

[bib6] 6.Kung T.H., Cheatham M., Medenilla A., et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2) doi: 10.1371/journal.pdig.0000198. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib7] 7.Vaishya R., Misra A., Vaish A. ChatGPT: is this version good for healthcare and research? Diabetes Metab Syndr. 2023;17(4) doi: 10.1016/j.dsx.2023.102744. [DOI] [PubMed] [Google Scholar]

[bib8] 8.Lee P., Bubeck S., Petro J. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine. N Engl J Med. 2023;388(13):1233–1239. doi: 10.1056/NEJMsr2214184. [DOI] [PubMed] [Google Scholar]

[bib9] 9.Zheng H., Zhan H. ChatGPT in scientific writing: a cautionary tale. Am J Med. 2023;136(8):725–726.e6. doi: 10.1016/j.amjmed.2023.02.011. [DOI] [PubMed] [Google Scholar]

[bib10] 10.Haug C.J., Drazen J.M. Artificial intelligence and machine learning in clinical medicine, 2023. N Engl J Med. 2023;388(13):1201–1208. doi: 10.1056/NEJMra2302038. [DOI] [PubMed] [Google Scholar]

[bib11] 11.Kanjee Z., Crowe B., Rodman A. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. JAMA. 2023;330(1):78–80. doi: 10.1001/jama.2023.8288. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib12] 12.Hirosawa T., Harada Y., Yokose M., Sakamoto T., Kawamura R., Shimizu T. Diagnostic accuracy of differential-diagnosis lists generated by generative pretrained transformer 3 chatbot for clinical vignettes with common chief complaints: a pilot study. Int J Environ Res Public Health. 2023;20(4):3378. doi: 10.3390/ijerph20043378. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib13] 13.Hirosawa T., Kawamura R., Harada Y., et al. ChatGPT-generated differential diagnosis lists for complex case-derived clinical vignettes: diagnostic accuracy evaluation. JMIR Med Inform. 2023;11 doi: 10.2196/48808. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Evaluating GPT-4 as an academic support tool for clinicians: a comparative analysis of case records from the literature

BL Fabre

MAF Magalhaes Filho

PN Aguiar Jr

FM da Costa

B Gutierres

WN William Jr

A Del Giglio

Abstract

Background

Materials and methods

Results

Conclusions

Highlights

Introduction

Materials and Methods

Study design

Endpoints

Ethical considerations

Prompt created for GPT-4 to generate differential diagnosis lists and investigation recommendations

Results

Table 1.

Figure 1.

Discussion

Conclusions

Acknowledgments

Funding

Disclosure

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Evaluating GPT-4 as an academic support tool for clinicians: a comparative analysis of case records from the literature

BL Fabre

MAF Magalhaes Filho

PN Aguiar Jr

FM da Costa

B Gutierres

WN William Jr

A Del Giglio

Abstract

Background

Materials and methods

Results

Conclusions

Highlights

Introduction

Materials and Methods

Study design

Endpoints

Ethical considerations

Prompt created for GPT-4 to generate differential diagnosis lists and investigation recommendations

Results

Table 1.

Figure 1.

Discussion

Conclusions

Acknowledgments

Funding

Disclosure

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases