Evaluation of ChatGPT as a diagnostic tool for medical learners and clinicians

Ali Hadi; Edward Tran; Branavan Nagarajan; Amrit Kirpalani

doi:10.1371/journal.pone.0307383

. 2024 Jul 31;19(7):e0307383. doi: 10.1371/journal.pone.0307383

Evaluation of ChatGPT as a diagnostic tool for medical learners and clinicians

Ali Hadi ^1,^‡, Edward Tran ^1,^‡, Branavan Nagarajan ¹, Amrit Kirpalani ^1,^2,^*

Editor: Fateen Ata³

PMCID: PMC11290643 PMID: 39083523

Abstract

Background

ChatGPT is a large language model (LLM) trained on over 400 billion words from books, articles, and websites. Its extensive training draws from a large database of information, making it valuable as a diagnostic aid. Moreover, its capacity to comprehend and generate human language allows medical trainees to interact with it, enhancing its appeal as an educational resource. This study aims to investigate ChatGPT’s diagnostic accuracy and utility in medical education.

Methods

150 Medscape case challenges (September 2021 to January 2023) were inputted into ChatGPT. The primary outcome was the number (%) of cases for which the answer given was correct. Secondary outcomes included diagnostic accuracy, cognitive load, and quality of medical information. A qualitative content analysis was also conducted to assess its responses.

Results

ChatGPT answered 49% (74/150) cases correctly. It had an overall accuracy of 74%, a precision of 48.67%, sensitivity of 48.67%, specificity of 82.89%, and an AUC of 0.66. Most answers were considered low cognitive load 51% (77/150) and most answers were complete and relevant 52% (78/150).

Discussion

ChatGPT in its current form is not accurate as a diagnostic tool. ChatGPT does not necessarily give factual correctness, despite the vast amount of information it was trained on. Based on our qualitative analysis, ChatGPT struggles with the interpretation of laboratory values, imaging results, and may overlook key information relevant to the diagnosis. However, it still offers utility as an educational tool. ChatGPT was generally correct in ruling out a specific differential diagnosis and providing reasonable next diagnostic steps. Additionally, answers were easy to understand, showcasing a potential benefit in simplifying complex concepts for medical learners. Our results should guide future research into harnessing ChatGPT’s potential educational benefits, such as simplifying medical concepts and offering guidance on differential diagnoses and next steps.

Introduction

Artificial Intelligence (AI) refers to computer systems that can perform tasks that require human intelligence, such as visual perception, decision-making, and language understanding [1]. Natural Language Processing (NLP), a crucial field in AI, focuses on the interaction between human language and computer systems [2]. NLP algorithms are capable of analyzing and generating human language, making them valuable tools in various sectors, including healthcare [2].

In the healthcare sector, NLP can be applied in several ways, such as in clinical documentation, coding and billing, monitoring drug safety, and keeping track of patients [3–5]. Large Language Models (LLMs) are a type of NLP model that can perform various language tasks, such as text completion, summarization, translation, and question-answering [6]. LLMs are trained on massive amounts of text data and can generate human-like responses to natural language queries [6, 7].

ChatGPT is a Large Language Model (LLM) developed by OpenAI, capable of performing a diverse array of natural language tasks [8]. At the moment, ChatGPT is arguably the most well-known, commercially available LLM. Its widespread accessibility appeals to a broad audience, including medical trainees and physicians, who are likely to be curious about its performance in a clinical setting.

A study recently found that ChatGPT was able to accurately answer biomedical and clinical questions on the United States Medical Licensing Examination (USMLE) at a level that approached or exceeded the passing threshold [9]. The study also found that ChatGPT’s accuracy was characterized by high concordance and density of insight, indicating its potential to generate novel insights and assist in medical education [9]. While these results have ignited discussions around potential implications for ChatGPT in healthcare, they also highlight the potential use of this tool in medical education. Whereas the ability of ChatGPT to answer concise, encyclopedic questions have been studied, the quality of its responses to complex medical cases remains unclear [10].

In this study, we aim to evaluate ChatGPT’s performance as a diagnostic tool for complex clinical cases to explore its diagnostic accuracy, the cognitive load of its answers, and the overall relevance of its responses. We aim to understand the potential benefits and limitations of ChatGPT in clinical education.

ChatGPT is powered by Generative Pre-trained Transformer (GPT) 3.5, an LLM trained on a massive dataset of text with over 400 billion words from the internet including books, articles, and websites [8]. However, this dataset is private and therefore lacks transparency as users have no convenient means to validate the accuracy or the source of the information being generated. We plan to conduct qualitative analysis to evaluate the quality of medical information ChatGPT provides.

While ChatGPT is able to generate novel responses that closely resemble natural human language [11] it lacks genuine comprehension of the content it receives or produces.

Once again, this underscores the importance of evaluating the responses provided by ChatGPT. While responses may sound grammatically correct and offer correct medical information, it is essential to assess the overall relevance to the medical question at hand as to not mislead medical trainees.

Medscape Clinical Challenges include complex cases that are designed to challenge the knowledge and diagnostic skills of healthcare professionals [12]. The cases are often based on real-world scenarios and may involve multiple comorbidities, unusual presentations, and diagnostic dilemmas [12]. By employing these challenges, we can evaluate ChatGPT’s ability to answer medical queries, diagnose conditions, and select appropriate treatment plans in a context that closely resembles actual clinical practice [13].

Materials and methods

Artificial intelligence

ChatGPT operates as a server-based language model, meaning it cannot access the internet. All responses are generated in real-time, relying on the abstract associations between words ("tokens") within the neural network. This constraint mirrors real-life clinical settings where professionals do not have the freedom to easily access additional scientific literature and also allows us to accurately evaluate ChatGPT’s knowledge.

Input source

We tested the performance of ChatGPT in answering Medscape Clinical Challenges. These complex cases are designed to challenge the knowledge and diagnostic skills of healthcare professionals [12]. These challenges present a clinical scenario that includes patient history, physical examination findings, and laboratory or imaging results. Healthcare professionals are required to make a diagnosis or choose an appropriate treatment plan using multiple-choice questions [12]. Feedback is provided after each answer with explanations of the correct diagnosis and treatment plan. The distribution of answer options selected by Medscape users is also provided. This feedback mechanism allows an accurate evaluation of ChatGPT’s responses compared to correct answers and also allows us to directly compare its thought process and decision making to healthcare professionals.

Medscape’s Case Challenges were selected because they were open-source and freely accessible. To prevent any possibility of ChatGPT having prior knowledge of the cases, only those authored after its NLP model training in August 2021 were included. This deliberate selection ensures that the chatbot hadn’t been trained on these specific cases beforehand, guaranteeing that each case presented is entirely novel to ChatGPT and that it does not already know the answers.

Data collection

Data was collected by the three authors, medical trainees (A.H, B.N, and E.T), and all content was reviewed by a Staff Physician (A.K). We felt that it was most appropriate for medical trainees to be the primary evaluator of ChatGPT’s responses, given that it will likely be medical trainees who would rely heaviest on it as an external resource. The three authors (A.H, B.N, and E.T) utilized publicly available clinical case challenges from Medscape, published between September 2021 and January 2023, after the date of ChatGTP’s model 3.5’s training. A total of 150 Medscape cases were analyzed; cases were randomized amongst the three authors with each case being overlapped by at least 2 authors. We excluded any cases with visual assets, such as clinical images, medical photography, and graphs, to ensure the consistency of the input format for ChatGPT.

Input and prompt standardization

To ensure consistency in the input provided to ChatGPT, the three independent reviewers transformed the case challenge content into one standardized prompt. Each prompt included an unbiased script of what we wanted from the output, followed by the relevant case presentation and multiple-choice answers. The standardization of prompts ensures consistent and reproducible responses across different users and effectively addresses the OpenAI’s placed restriction of using ChatGPT for healthcare advice.

Prompts were standardized as such, all information available on the data extraction supplementary file:

Prompt 1: I’m writing a literature paper on the accuracy of CGPT of correctly identified a diagnosis from complex, WRITTEN, clinical cases. I will be presenting you a series of medical cases and then presenting you with a multiple choice of what the answer to the medical cases.

Prompt 2: Come up with a differential and provide rationale for why this differential makes sense and findings that would cause you to rule out the differential. Here are your multiple choice options to choose from and give me a detailed rationale explaining your answer.

[Insert multiple choices]

[Insert all Case info]

[Insert radiology description]

ChatGPT interaction and data extraction

The standardized prompts were input into ChatGPT using the legacy model 3.5, and the model generated responses containing the suggested answer to the case challenge as well as background info on the disease, reasons for ruling in the diagnosis, and reasons for ruling out other diagnoses.

Primary outcome assessment

All cases were evaluated by at least two independent raters (A.H, B.N or E.T) for each case and blinded to each other’s responses. ChatGTP responses were extracted, and the primary outcome was analyzed based on the percentage of cases for which the answer given was correct.

Secondary outcome assessment

All cases were evaluated by at least two independent raters (A.H, B.N or E.T). To assess secondary outcomes, we employed three validated medical education evaluation scales:

Diagnostic Accuracy: The raters assessed the true positive (TP), false positive (FP), true negative (TN), and false negative (FN) rates of ChatGPT’s answers, considering the suggested differentials and the final diagnosis provided. Each case had four answer options, and ChatGPT’s explanation for each of the four answer options was categorized as either true or false, positive or negative [13]. We then calculated the accuracy, precision, sensitivity and specificity base as shown:

Accuracy: (TP + TN)/Total Responses

Precision: TP/ (TP + FP)

Sensitivity: TP/ (TP + FN)

Specificity: TN / (TN + FP)

To further evaluate the model’s performance, we generated a Receiver Operating Characteristic (ROC) curve and calculated the Area Under the Curve (AUC). This involved collecting model scores or probabilities for each instance, sorting instances based on their scores, iterating thresholds to calculate True Positive Rate (TPR) and False Positive Rate (FPR) for each threshold, plotting the FPR against the TPR to create the ROC curve, and computing the AUC to quantify the model’s discriminative ability. This thorough analysis provided both visual representation and scalar measurement to assess the model’s efficacy in diagnostic accuracy.
Cognitive Load: The raters evaluated the cognitive load of ChatGPT’s answers as low, moderate, or high, based on the complexity and clarity of the information provided according to the following scale [14]:

Low cognitive load: The answer is easy to understand and requires minimal cognitive effort to process

Moderate cognitive load: The answer requires moderate cognitive effort to process

High cognitive load: The answer is complex and requires significant cognitive effort to process
Quality of Medical Information: The raters assessed the quality of the medical information provided by ChatGPT according to the following criteria:

Complete: The answer includes all relevant information for making an accurate diagnosis

Incomplete: The answer is missing some relevant information for making an accurate diagnosis

Relevant: The answer includes information that is directly relevant to the diagnosis

Irrelevant: The answer includes information that is not directly relevant to the diagnosis

Using the above scale answers were categorized as one of: complete/relevant, complete/irrelevant, incomplete/relevant, and incomplete/irrelevant [15].

Discrepancies between raters were resolved through discussion and consensus. In order to assess the inter-rater reliability of our outcomes, we used Cohen’s Kappa coefficient. This statistical measure evaluates the agreement between two raters who each classify items into mutually exclusive categories. It is particularly useful in this study, as it accounts for any agreement that might occur by chance, which is important given the variability of responses from ChatGPT.

Content analysis

A content analysis was conducted on ChatGPT’s responses to identify patterns of strength and weakness. This analysis focused on the model’s ability to rule out specific differential diagnoses, provide reasonable diagnostic steps, and interpret laboratory values, specialized diagnostic testing, and imaging results. Additionally, we assessed the model’s ability to consider key information relevant to the diagnosis.

Data analysis

Results

A total of 150 Medscape cases were included in the analysis (see Table 1), with a total of 600 answer options (four per case) provided to ChatGPT.

Table 1. Summary of ChatGPT’s performance on MedScape clinical case challenges.

Case	Case Name	Answer Correct?
1	Internal Medicine Case Challenge: A Teacher’s Assistant With Bipolar Disorder Has Lung Problems	no
2	A 27-Year-Old Factory Worker With Incontinence and Imbalance	yes
3	Cardio Case Challenge: A 17-Year-Old in Cardiac Arrest After Collision Playing Sports	yes
4	A 21-Year-Old Man With Epigastric Pain After a Wild Party	yes
5	A 19-Year-Old With Hypercholesterolemia, Transaminitis, and IBD	yes
6	Emergency Med Case Challenge: A 46-Year-Old Beauty Pageant Winner With Sudden Blindness	no
7	Gastro Case Challenge: Excruciating Abdominal Pain in a Woman Taking Benzodiazepines and Narcotics	no
8	A Teenager Shot Multiple Times Develops Further Complications	yes
9	After Consuming Alcohol With Raw Beef, a Man Has Seizure, Pain	no
10	Fingernail, Toenail Changes and Flank Pain in a 20-Year-Old	no
11	Dermatology Case Challenge: Colorful Skin Patches on a Man With Fatigue Who Smokes Cigars	yes
12	Diarrhea, PPI Use, and Pain in a Restaurant Worker From Mexico	yes
13	A Woman With AF After Husband’s Death, Grandkids’ Drug Abuse	yes
14	Emergency Med Case Challenge: Hemorrhoids, Urinary and Blood Infections in a Woman With Rigors	yes
15	Morning Stiffness, Dry Eyes, Back Pain in a Fit 58-Year-Old	no
16	Gastro Case Challenge: A Coffee Drinker With Chronic Diarrhea, Epigastric Pain, and Fever	yes
17	Oncology Case Challenge: A Construction Worker Who Drinks Daily Has an Eyelid Lesion	yes
18	A 22-Year-Old Football Player Who Collapsed Has Urine Changes	no
19	A Woman With Multiple New Sexual Partners Has Fatigue, Pain	no
20	Endo Case Challenge: Pubic Hair and Violent Behavior in a Strong 19-Month-Old Girl	no
21	A 45-Year-Old Teacher With a Groin Rash That Is Spreading	yes
22	Emergency Med Case Challenge: Pain, Wheezing in a Nonverbal Man Who Keeps Rubbing His Chest	yes
23	Recurrent UTIs, Ulcerations, Foot Drop in 50-Year-Old Woman	yes
24	Dermatology Case Challenge: Painful Lesions, Open Wounds on a 45-Year-Old Woman	no
25	Palpitations, Cough in a Woman Who Lives Next to a Zookeeper	yes
26	Neuro Case Challenge: A 35-Year-Old With Angry, Aggressive Outbursts, Memory Loss, and Insomnia	yes
27	Gastro Case Challenge: Pain, Vomiting in a 48-Year-Old on Levothyroxine, Metformin	yes
28	A Woman Who Owns a Hot Tub and Chickens Has Dyspnea, Cough	yes
29	Emergency Case Challenge: After Argument, Unresponsive Woman Found By Her Boyfriend	no
30	Beer, Aspirin Worsen Nasal Issues in a 35-Year-Old With Asthma	yes
31	Endo Case Challenge: Amenorrhea for Months, Mood Swings, Weight Gain in a 38-Year-Old Woman	yes
32	Rectal Bleeding in a 47-Year-Old Farmer Who Can’t Pass Flatus	no
33	Derm Case Challenge: Rash on Chest, Buttocks and Toenail Changes in a Middle-Aged Man	Yes
34	Gastro Case Challenge: A 33-Year-Old Man Who Can’t Swallow His Own Saliva	yes
35	A Recently Married 27-Year-Old With Hot Flashes, Amenorrhea	yes
36	Delirious, Incontinent 45-Year-Old Found Crawling on the Floor	no
37	Violent Cough, Slurred Speech, and Ptosis in a Middle-Aged Man	yes
38	Emergency Med Case Challenge: A 41-Year-Old on Sildenafil With a Headache While Sleeping	yes
39	After Unprotected Sex, 50-Year-Old Has Rash, Severe Weakness	no
40	Endo Case Challenge: Rash, Brain Fog, and Sleep Issues in a 50-Year-Old IT Director	yes
41	Psychiatry Case Challenge: Nightmares and Poor Grades in a Third Grader Allergic to Cats	no
42	ED Case Challenge: After New Sexual Partner, Dysuria, Discharge in a 21-Year-Old	no
43	Loss of Taste, Rash, and Dyspnea in a 46-Year-Old With GERD	yes
44	A 27-Year-Old Woman With Constant Headache Too Tired to Party	no
45	Intentional Overdose in a Suicidal 28-Year-Old With Lupus	no
46	Oncology Case Challenge: A Daily Beer Drinker With Bruises, Back Pain, and Bleeding	yes
47	Neurology Case Challenge: A Man With Buttocks Pain, Bladder and Bowel Incontinence	no
48	A Man With Hypokalemia, Sleep Apnea, and Resistant Hypertension	no
49	An Adopted 43-Year-Old With Bad Breath, Dyspnea, Dysphagia	no
50	Gastro Case Challenge: A Daily Cannabis User With Sharp, Intense Epigastric Pain	yes
51	A 51-Year-Old Man Avoiding Sexual Intercourse Due to Rectal Pain	yes
52	Neuro Case Challenge: A 16-Year-Old With Quadriparesis After Respiratory Infection	yes
53	Psychiatry Case Challenge: Alarming Behavior in a 26-Year-Old Soldier and Father of Three	yes
54	A Scrotal Rash Lasting Months in a Man With Genital Edema	no
55	A 22-Year-Old Female College Athlete With Wild Mood Swings	yes
56	Pediatric Case Challenge: A 7-Year-Old Boy With a Limp and Obesity Who Fell in the Street	yes
57	Endo Case Challenge: A Cannabis User With Excessive Sweating and Syncope at Work	no
58	An Athletic Teen Suddenly Prone to Falls and Fractures	yes
59	An Office Worker With Abdominal Cramps, Burning Chest, Dyspnea	yes
60	Cardio Case Challenge: A Confused 35-Year-Old With Headache, Fever, and Sore Chest	no
61	A Woman With Back, Chest Pain After Eating Wings at a Restaurant	no
62	Neurology Case Challenge: A 19-Year-Old With Tinnitus, Vision Problems, and Headaches	no
63	Ob/Gyn Case Challenge: A 33-Year-Old Woman Trying to Conceive Has Dyspnea, Pain	no
64	A Patient Who Collapsed in Agony After Echocardiography	yes
65	Oncology Case Challenge: A 45-Year-Old Father Seeking Vasectomy Has Alarming Findings	no
66	A Divorced Man With Back Pain After Trip With New Girlfriend	no
67	Facial Spasms in a Man Recently Released From the Hospital	no
68	Gastro Case Challenge: A Woman Who Abstains from Alcohol Has Worsening Abdominal Pain	no
69	Oncology Case Challenge: A Retired Man With Left Upper Quadrant Pain, Leukocytosis	yes
70	A Coffee Drinker With Sudden-Onset Dyspnea, Tachycardia	yes
71	PCP Case Challenge: Lesions on the Hands, Palms, and Feet of a 57-Year-Old Man	yes
72	A 36-Year-Old Woman With Flatulence and Memory Problems	yes
73	A School Nurse With Anxiety, Diarrhea, Palpitations, and Cough	yes
74	Cardio Case Challenge: Syncope in a 53-Year-Old Woman With Dyspnea and Morning Chest Pain	yes
75	A 12-Year-Old With Urinary Retention Who Can’t Grasp Objects	no
76	Oncology Case Challenge: A 46-Year-Old Mother With Severe, Constant Abdominal Pain	yes
77	A 42-Year-Old Tennis Player With Dyspnea Blamed on Anxiety	no
78	Neurology Case Challenge: Drooling and Dysphagia in a Man Who Can’t Speak	no
79	Urination Problems After Procedure in a Man Treated for BPH	yes
80	Endo Case Challenge: A 36-Year-Old Has Cramping, Lung Issues and Can’t Lose Weight	yes
81	Seizure After Sudden Headache in a 16-Year-Old Cyclist	no
82	Cardiology Case Challenge: Worsening Chest Pain After a Respiratory Infection in a Man With Hypertension	no
83	A 53-Year-Old Waitress With a Cough and Constant Back Pain	no
84	Gastro Case Challenge: An Incarcerated 24-Year-Old With Dyspnea, Fatigue, and Chronic Nausea	yes
85	17-Year-Old With Hair Loss, Dysmenorrhea, Thrush, and Diarrhea	no
86	Sexually Active Man With Foreign-Body Feeling, Eye Discharge	yes
87	Case Report: Cardiac Arrest in a Man Who Has Overdosed	yes
88	Primary Care Case Challenge: An Accountant With Bilateral Neck Masses	yes
89	A 13-Year-Old Athlete With Chest Pain, Cough After Practice	no
90	Oncology Case Challenge: A 37-Year-Old Woman With Multiple Fibroadenomas	yes
91	Vaginal Discharge, Fever in Pregnant Woman After Hawaii Trip	yes
92	Emergency Medicine Case Challenge: An Active-Duty Soldier With a Burning, Spreading Rash and Sore Throat	yes
93	A 42-Year-Old With Declining Cognition and Frequent Vomiting	yes
94	Emergency Medicine Case Challenge: A Young Girl With Discolored Feet, Facial Swelling, and Cough	yes
95	Endocrinology Case Challenge: A 55-Year-Old With Impotence, Decreased Libido, and Hyponatremia	yes
96	5-Month-Old Rushed to the ED for Severe Abdominal Distention	yes
97	A 23-Year-Old Unaware She’s Pregnant With Hematuria, ECG Abnormalities	yes
98	An Anxious Hiker With Recurring Annular Rash and Sleep Loss	no
99	Psychiatry Case Challenge: A 9-Year-Old With Suicidal Behavior	no
100	Neurology Case Challenge: Visual and Auditory Hallucinations in a Patient With Parkinson Disease	no
101	A 16-Year-Old Girl With Full-Body Rash, Dyspnea, and Swelling	yes
102	A Sexually Active 30-Year-Old Woman With Rash and Wrist Pain	yes
103	A Retired Teacher With a Constant Headache and Vomiting	no
104	Star Athlete With a Blinking Fixation Struggling in College	yes
105	Seizures in a 42-Year-Old Who Left a Hospital Against Advice	yes
106	Edible Marijuana Use, Chest Pain, and Cough in a 53-Year-Old	no
107	After Drinking 21 Beers, a 27-Year-Old Can’t Stop Vomiting	yes
108	A Noncompliant Construction Worker With a Pulsating Abdomen	yes
109	A 47-Year-Old With Progressive Dyspnea and Weepy Nodules	no
110	What’s Causing This Rapidly Growing, Golf Ball–Sized Mass?	no
111	Recurrent Syncope in a 30-Year-Old Whose Uncle Died Suddenly	no
112	A Veteran With Lesions, Alcohol Use, and Opioid Dependence	no
113	A Nonverbal 33-Year-Old Woman With Intellectual Impairment	yes
114	A Mail Carrier With Gross Hematuria Whose Sister Has Lupus	no
115	A Sexually Active 23-Year-Old With Seizures and Tongue Pain	no
116	A 17-Year-Old With Hallucinations About Martians and Paranoia	no
117	A 30-Year-Old With a Full-Body Rash, Vomiting, and Confusion	no
118	An Adopted 42-Year-Old With Slurred Speech and Memory Loss	yes
119	A 26-Year-Old With Fever and Malaise Now Can’t Tie His Shoes	yes
120	Decreased Speech and Jerky Eye Movements in a ’Clumsy’ Toddler	yes
121	A 50-Year-Old With Telangiectasia, Cough, and Epistaxis	yes
122	After Travel, a 50-Year-Old Grandfather Has Dyspnea, Fever	yes
123	A 28-Year-Old Soccer Player With Odd Abdominal Pain, Fatigue	yes
124	A 47-Year-Old With Diplopia, Limb Tingling, and Imbalance	no
125	A 47-Year-Old With Diplopia, Limb Tingling, and Imbalance	yes
126	A 53-Year-Old Social Media Worker With Dysphonia and Paresis	no
127	After a Wild Party, a 24-Year-Old Has Intense Abdominal Pain	no
128	An Accountant Who Loves Aerobics With Hiccups and Incoordination	yes
129	A Woman With DVT After a Flight, Anemia, and Bowel Changes	yes
130	A 37-Year-Old Man With Chest Pain and Elbow/Eyelid Papules	no
131	A Marijuana User With Sudden Chest Pain Radiating to His Neck	no
132	A Farmer With Diffuse Pruritus and a Suntan That Won’t Fade	no
133	A 28-Year-Old Writer With Bilious Vomiting After Egg Donation	yes
134	A 35-Year-Old Soldier With Galactorrhea and Amenorrhea	no
135	A 52-Year-Old Man With a Hole in His Jaw and Alcoholism	yes
136	A 57-Year-Old Man With a Fever Who Can’t Stop Bleeding	yes
137	Abdominal Pain, Anemia, and Oliguria in a Distressed Woman	yes
138	A Sexually Active 29-Year-Old Man With a Weak Urine Stream	no
139	A 51-Year-Old Who Lost Her Job Due to Cognitive Decline	yes
140	Strange Stool Color and Fatigue in a Man With COPD and Atrial Fibrillation	yes
141	Painful, Discolored Toes With Sores in a 43-Year-Old Woman	yes
142	Pleural Effusion and an Axillary Mass in a Woman With Hypertension *(not diagnostic case, cancer origin case)*	yes
143	Penis Injury and Hematuria in a Man Who Fell on a Log	yes
144	A 25-Year-Old Mother With Joint Pain Who Feels Faint	yes
145	A 42-Year-Old Office Assistant With Chronic Leg and Back Pain	yes
146	Blackout at Rest and Slurring in a Man Afraid of COVID-19	yes
147	Chronic Gastritis, a Lesion, and Weight Loss in a Teenager	no
148	Clitoromegaly, Amenorrhea, and Hair Loss in a 32-Year-Old	no
149	A Daily Beer Drinker With Agonizing Gas and Back Pain	no
150	A Former Cocaine User Whose Specialist Told Her She’s Dying	yes

Open in a new tab

Primary outcome

Out of the 150 cases, ChatGPT provided correct answers for 74/150 (49%) of cases (Fig 1). In 92/150 (61%) of cases, ChatGPT provided the answer that the majority of Medscape users provided for the same question.

Secondary outcomes

Diagnostic accuracy

There are a total of 150 questions, each with 4 different multiple-choice options, resulting in a total of 600 possible answers, with only one correct answer per question. We found a true positive for 73/600 (12%), false positives for 77/600 (13%), true negatives for 373/600 (62%) and false negatives for 77/600 (13%) (Fig 2). ChatGPT demonstrated an accuracy of 74%, with a precision of 49%. Its sensitivity was 49%, while it achieved a specificity of 83%.The AUC for the ROC was 0.66.

Cognitive load

Out of the 150 responses, 78/150 (52%) were categorized as low cognitive load, 61/150 (41%), were found to be a moderate cognitive load, and 11/150 (7%) were classified as high cognitive load (Fig 3).

Quality of medical information

Responses were complete and relevant for 78/150 (52%) cases. None of the 0/150 (0%) responses were complete but irrelevant, and 64/150 (43%) responses were deemed incomplete yet relevant. Additionally, 8/150 (5%) of the responses were classified as both incomplete and irrelevant (Fig 4).

Cohen’s kappa for diagnostic accuracy, cognitive load, and quality of medical information was 0.78 (substantial Inter-rater reliability), 0.64 (substantial Inter-rater reliability), and 1.0 (perfect) respectively (Fig 5).